TriPoly: haplotype estimation for polyploids using sequencing data of related individuals

TriPoly: haplotype estimation for polyploids using sequencing data of related individuals Abstract Motivation Knowledge of haplotypes, i.e. phased and ordered marker alleles on a chromosome, is essential to answer many questions in genetics and genomics. By generating short pieces of DNA sequence, high-throughput modern sequencing technologies make estimation of haplotypes possible for single individuals. In polyploids, however, haplotype estimation methods usually require deep coverage to achieve sufficient accuracy. This often renders sequencing-based approaches too costly to be applied to large populations needed in studies of Quantitative Trait Loci. Results We propose a novel haplotype estimation method for polyploids, TriPoly, that combines sequencing data with Mendelian inheritance rules to infer haplotypes in parent-offspring trios. Using realistic simulations of both short and long-read sequencing data for banana (Musa acuminata) and potato (Solanum tuberosum) trios, we show that TriPoly yields more accurate progeny haplotypes at low coverages compared to existing methods that work on single individuals. We also apply TriPoly to phase Single Nucleotide Polymorphisms on chromosome 5 for a family of tetraploid potato with 2 parents and 37 offspring sequenced with an RNA capture approach. We show that TriPoly haplotype estimates differ from those of the other methods mainly in regions with imperfect sequencing or mapping difficulties, as it does not rely solely on sequence reads and aims to avoid phasings that are not likely to have been passed from the parents to the offspring. Availability and implementation TriPoly has been implemented in Python 3.5.2 (also compatible with Python 2.7.3 and higher) and can be freely downloaded at https://github.com/EhsanMotazedi/TriPoly. Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction Haplotypes are defined as sequences of consecutive nucleotides over a chromosome, which normally shares high similarity with k–1 other chromosomes in diploid (k = 2) and polyploid (k > 2) organisms. These k homologous chromosomes can nevertheless have important differences in the form of nucleotide substitutions or insertions/deletions, leading to genotypic (and phenotypic) diversity within an outcrossing population, e.g. of the diploid (k = 2) human (Homo sapiens), tetraploid (k = 4) African clawed frog (Xenopus laevis) or tetraploid potato (Solanum tuberosum), or between inbred lines of autogamous species, e.g. hexaploid (k = 6) wheat (Triticum aestivum). The assignment of these variant forms, i.e. alleles, to the chromosomes is called phasing or haplotyping. In this context, phasing may also refer to the set of phased homologues, H={h1,h2,…,hk} with k being the ploidy level and hi(i=1,…,k) being the haplotype corresponding to the ith homologue. As phasing is uninformative at genomic positions with identical nucleotides over all the homologous chromosomes, i.e. at homozygous sites, haplotypes are usually defined as sequences of alleles at heterozygous sites over a chromosome. By this definition, 2n haplotypes are theoretically possible for a region covering n bi-allelic Single Nucleotide Polymorphisms (SNPs), which is the most abundant form of genomic variation among individuals of the same species (Rafalski, 2002). However, often far fewer haplotypes are actually found in a population. While high-throughput genotyping assays, such as SNP arrays, exist for efficient determination of unphased SNPs, direct determination of haplotypes is much more complicated and usually requires laborious and expensive techniques such as bacterial cloning, allele-specific PCR or chromosome microdissection (Doležel et al., 2014; Michalatos-Beloin et al., 1996; Triplett et al., 2012). However, unphased SNPs provide less knowledge about an individual’s phenotype compared to phased SNPs, as both gene expression and protein function can be affected by the heterozygous variants being in cis or trans with other variants (Tewhey et al., 2011). Besides, haplotypes can be used as multi-allelic markers, offering more statistical power compared to single SNPs for genetic linkage and association studies (Simko et al., 2004). Several computational methods have therefore been proposed to indirectly infer the phasing from available genotype data. These can be divided into three main categories. Methods in the first category, such as Merlin (Abecasis et al., 2002) and TetraOrigin (Zheng et al., 2016), target pedigrees and aim to determine the most likely haplotypes using the segregation of marker alleles, taking into account the genetic distances between the marker loci. These methods can be applied to SNPs that are far enough apart to be informative about linkage, and are especially useful with large pedigrees. Methods in the second category, such as Beagle (Browning and Browning, 2007), SHAPEIT (Delaneau et al., 2012) and Eagle (Loh et al., 2016) target populations with unknown pedigrees and are based on coalescence theory, trying to obtain a set of highly frequent haplotypes in the population compatible with the genotype data. Methods in the third category, such as HapCut (Bansal and Bafna, 2008), HapCompass (Aguiar and Istrail, 2013), HapTree (Berger et al., 2014) and SDhaP (Das and Vikalo, 2015), use sequence read data and target single individuals, exploiting the fact that a sequence read that contains at least two SNPs reveals the phasing of the homologue from which it has originated at the contained SNP sites. The aim of these methods is therefore to assign the reads of a single individual to k groups, corresponding to the homologues of a k-ploid, and to obtain the consensus sequence of the reads within each group to reconstruct the haplotypes. All of these approaches have limitations in terms of the ploidy level (k) and the required marker density. For the methods in the third category, sequencing depth and read length are also limiting factors. As an example, Merlin, Beagle, SHAPEIT, Eagle and HapCut can only phase diploids (k = 2) and the TetraOrigin algorithm is only applicable to bi-parental tetraploid populations (k = 4) for which a linkage map is available. Also, HapCompass, HapTree and SDhaP can fail to reconstruct haplotypes with high quality at low sequence depths or at ploidy levels higher than k = 4 (Motazedi et al., 2017). In case parent-progeny relations exist in a sequenced population, it is possible to improve the quality of haplotype estimation by combining the information used in the first and the third categories under a unified scheme. With sequencing experiments becoming cheaper and more efficient, such an approach is of high practical importance as often whole populations are sequenced rather than only genotyped at specific marker loci. An implementation of this unifying framework, called PedMEC, has recently been reported by Garg et al. (2016) for diploid trios, i.e. families with two parents and one offspring. Specifically, PedMEC extends the partial-phasing of sequence reads using their overlaps while penalising meiotic recombination events in each trio. However, the exact dynamic programming approach of Garg et al. (2016) rapidly becomes intractable for polyploids, i.e. with k > 2, as its complexity increases exponentially with an increase in the ploidy level (Section 2). Here, we present a greedy algorithm, TriPoly, for phasing a set of SNPs connected by the sequencing reads in parent-offspring trios. Starting at the SNP site with the smallest genomic coordinate, TriPoly extends the phasing one SNP at a time, keeping only the most likely extended phasings to be worked out in the subsequent extension step. In determining the likelihood of each extension, TriPoly considers its compatibility with the sequence reads, as well as the number of recombination events observed by comparing the parental extensions with that of the offspring. Using quantitative measures, we investigated the quality of haplotype estimates obtained by TriPoly in parent-offspring trios simulated under realistic assumptions with tetraploid × diploid and tetraploid × tetraploid parents. By comparing our results with those obtained using single individual haplotyping methods, we show that TriPoly yields substantially better estimates for the haplotypes of the progeny, especially at low sequencing depths. Finally, we apply TriPoly to phase SNPs on chromosome 5 for a family of tetraploid potato with 2 parents and 37 offspring, sequenced with an RNA capture approach by paired-end Illumina HiSeq-2000 technology. We show that TriPoly phasings differ from those obtained by the other methods mainly in regions with imperfect sequencing or mapping difficulties, as TriPoly does not rely solely on sequence reads and aims to avoid phasings that are not likely to have been passed from the parents to the offspring. 2 Materials and methods 2.1 A Bayesian approach to obtaining phasing probabilities from sequence reads In order to establish a probabilistic model for haplotypes, with the sequence reads as data and the base call error and recombination rates as parameters, we must first determine which reads are informative about the phasing. Informative reads need to cover at least two variants, e.g. SNP sites which are heterozygous for at least one of the trio members (m, f, c), corresponding to mother, father and the offspring (child). As sites that are homozygous in all trio members retain no phasing information, we discard them from the sequence reads and keep only the base-calls corresponding to the variation positions. Therefore, in the first step, the SNP sites, s = 1, 2,…, l, are detected over a genomic region and the genotypes Gs=(Gsm,Gsf,Gsc) are estimated at these sites, using efficient algorithms such as FreeBayes (Garrison and Marth, 2012). The raw reads of each trio member are then replaced by the so-called SNP fragments of length l (Fig. 1) that each correspond to a read and contain the numerically coded alleles, i.e. 0, 1, 2 or 3 representing the reference and alternative nucleotides, at the SNP sites covered by that read and ‘–’ at positions not called or not covered. To reduce sequencing noise, the positions at which the base-calling quality is lower than a desired threshold can be set to ‘–’ as well. Hereafter, by using the term sequence read, r, we refer to SNP fragments that contain at least two determined positions. Fig. 1. View largeDownload slide A set of SNP fragments aligned to a reference and the homologues, h1 and h2, from which the fragments originated. Fragments that have identical variants, specified by 0 (reference) and 1 (alternative), at their overlapping sites are assigned to the same homologue Fig. 1. View largeDownload slide A set of SNP fragments aligned to a reference and the homologues, h1 and h2, from which the fragments originated. Fragments that have identical variants, specified by 0 (reference) and 1 (alternative), at their overlapping sites are assigned to the same homologue In the next step, one should assign the reads to k compatible sets in which all of the reads have the same allele at their overlaps, and obtain the consensus sequence of each set as the phasing. As shown in Figure 1, this process is straightforward for diploids in the absence of sequencing errors. With sequencing errors, however, such an assignment of reads to homologues will be possible only if mismatches are allowed. However, allowing mismatches at sites with no error can lead to incorrect haplotype estimates. Polyploidy results in further complexity, as there may be more than one way to assign the reads to k > 2 sets even when no error is present. This can happen for instance when several haplotypes are identical in a phasing solution, e.g. in a 3 SNP tetraploid phasing consisting of 4 homologues: (100100100011) in which three identical (100) haplotypes are present. In this example, the reads will be compatible with any phasing as long as it contains both (100) and (011) haplotypes regardless of their dosages, for example (100100011011) ⁠. Therefore, probabilistic models must take the uncertainty caused by the presence of several phasing possibilities and sequencing errors into account. To build the probabilistic model, we assume an independent binomial error model at each SNP site (Berger et al., 2014) and assign an error vector, ϵ⃗r ⁠, of length l to each read containing the probability of erroneous base-calling at the SNP sites in that read. Using these error rates, the probabilities of possible maternal, paternal and offspring phasings in a trio, represented by Hm, Hf and Hc, respectively, can be derived from the set of sequence reads associated with the trio, R (consisting of maternal read Rm, paternal reads Rf and offspring reads Rc). In addition to the reads, we consider meiotic recombination probabilities, θs, between SNP s–1 and SNP s, represented by vector θ⃗ for all s > 1 to adjust the probability assigned to each phasing using Mendelian inheritance rules as follows: P(Hm,Hf,Hc|R,ϵ,θ⃗)=P(Hm|Rm,ϵm)×P(Hf|Rf,ϵf)·P(Hc|Rc,Hm,Hf,ϵc,θ⃗)R=Rm∪Rf∪Rcϵ=ϵm∪ϵf∪ϵc (1) where ϵm, ϵf and ϵc are sets of error vectors associated with Rm, Rf and Rc, respectively. Assuming exchangeability of the offspring, it is straightforward to generalize Equation (1) to include n offspring as: P(Hm,Hf,Hc1,…,Hcn|R,ϵ,θ⃗)=P(Hm|Rm,ϵm)P(Hf|Rf,ϵf)∏i=1nP(Hci|Rci,Hm,Hf,ϵci,θ⃗)R=∪i=1nRci∪Rm∪Rfϵ=∪i=1nϵci∪ϵm∪ϵf (2) By calculating the righthand side of Equation (2), one can determine the likelihood of each possible phasing for a trio conditional on its sequence reads. However, as it is instead more convenient to calculate the probability of observing the reads conditional on a phasing (Berger et al., 2014), we obtain each element of this equation using Bayes’ formula according to: P(Hp|Rp,ϵp)=P(Rp|Hp,ϵp)P(Hp)∑Hp′P(Rp|Hp′,ϵp)P(Hp′), p∈{m,f}P(Hci|Rci,ϵci,Hm,Hf,θ⃗)=P(Rci|Hci,ϵci)P(Hci|Hm,Hf,θ⃗)∑Hci′P(Rci|Hci′,ϵci)P(Hci′|Hm,Hf,θ⃗) (3) where P(Hp) is the prior probability of the parental phasing Hp and P(Hci|Hm,Hf,θ⃗) is the prior probability of the phasing Hci for offspring ci conditional on the parental phasings (Hm, Hf) and the recombination probability θ⃗ (see Supplementary Material A for the calculation of the read likelihoods and priors). 2.2 The TriPoly method Following the Bayesian approach explained in Section 2.1, one has to calculate the likelihood of the reads conditional on every phasing possible. The computational cost of this brute-force approach, calculated in Supplementary Material E, grows linearly with the sequencing depth but exponentially with the number of SNPs, l, rapidly rendering the solution intractable. To overcome this problem, we perform SNP-by-SNP reconstruction of haplotypes, starting from the leftmost SNP in the target region and keeping only a few most likely phasing extensions to the next SNP at each step [Fig. 2, Supplementary Material A: Equations (1–5)]. Following this approach, one will end up with a limited number of phasings that have passed the selection criteria during the extension procedure from s = 1 to s = l. Assuming the selection procedure effectively keeps the number of accepted solutions at each extension step bounded above by Em and Ef for the mother and the father, respectively, the number of trio phasings at each extension will be bounded above by (kmkm2)(kfkf2)EmEf and the total complexity will be O(lkmaxΩmax(kmkm2)(kfkf2)EmEf) ⁠, with kmax and Ωmax denoting the maximum ploidy level and the maximum sequencing coverage in the trio, respectively. This greedy method is therefore linear in terms of the number of SNPs, l. With parental ploidy levels, kp (⁠ p∈{m,f} ⁠), in the range of 2 to 12 (covering most of the naturally occurring cases of polyploidy), (kpkp2)<kp2.75 ⁠. Therefore, the computational complexity grows at a rate of kmax6.5 with the ploidy level. Fig. 2. View largeDownload slide Overview of the SNP-by-SNP haplotyping method implemented in TriPoly for a trio consisting of two parents and one offspring, over a region containing l SNPs Fig. 2. View largeDownload slide Overview of the SNP-by-SNP haplotyping method implemented in TriPoly for a trio consisting of two parents and one offspring, over a region containing l SNPs To implement this greedy method, which we call TriPoly, we employ branching and pruning steps similar to those in the HapTree algorithm (Berger et al., 2014; Supplementary Fig. S1). Starting at SNP site s = 1, the k alleles of each parent and the offspring are used as the base parental and offspring phasings, Hbp and Hbc. The base phasings are then extended step by step from SNP s–1 to SNP s for s≥2 ⁠, until all of the SNPs have been phased as outlined in Supplementary Material: Algorithm S1. At each extension step, branching and pruning (Supplementary Material: Procedure S3 and Supplementary Material: Procedure S4) allow the algorithm to work with a limited number of possible phasings. This approach can be easily extended to include several offspring at the same time using Equation (2), a detailed description of which is given in Supplementary Material A. Note that this approach assumes working on the so-called phasing blocks, i.e. genomic regions in which each SNP, s, is connected to at least one other SNP, s′ ⁠, through at least one of the reads in R. In case the sequencing reads do not satisfy this condition for the whole set of SNPs in the region, it is straightforward to divide the SNP set into blocks prior to the phasing and phase each block separately, with the phasing being interrupted between the blocks. 3 Experimental setup 3.1 Simulation of polyploid trios Before evaluating the performance of TriPoly through realistic simulations, we tested it on directly simulated SNP fragments to determine the upper bounds of its accuracy and to clearly show factors that influence its estimation quality. The advantage of this approach is bypassing of the intermediate base-calling, read alignment and variant calling steps that occur in reality and each add an undetermined amount of noise to the haplotyping process. Therefore, the direct simulation of SNP fragments lets us measure the accuracy in ideal situations and focus on the effects of sequencing depth, SNP fragment length and actual error rate in the SNP fragments. For this purpose, we simulated parental haplotypes corresponding to 1 kb regions according to the potato heterozygosity model (Motazedi et al., 2017), and randomly selected half of the haplotypes of each parent to simulate the offspring. The lengths of the SNP fragments, i.e. the number of SNPs contained in each fragment, was randomly chosen from uniform distributions within the ranges [2, 3] and [2, l] (l being the total number of SNPs), resulting in average fragment lengths of 2.5 and l2+1 ⁠, respectively. Alleles at randomly selected SNP sites on a haplotype were included in each fragment and sufficient SNP fragments were generated to assure the specified per homologue sequencing depths, 5-5-2, 5-5-5, 15-15-2 and 15-15-5 (maternal-paternal-offspring). Considering error rates of 0 (no error), 2% and 10%, SNP alleles were flipped by chance to introduce errors in the fragments. For each scenario, 100 trios were simulated and phased by TriPoly. In the next step, we evaluated the performance of TriPoly, as well as three state-of-the-art single individual haplotyping algorithms: HapCompass, SDhaP and HapTree, by simulating realistic genomes and sequence data and following the read alignment and variant calling steps to obtain the SNP fragments for haplotype estimation. To this end, maternal and paternal genomes were independently simulated from a common reference using Haplogenerator (Motazedi et al., 2017), and offspring genomes were generated by passing recombinant parental chromatids at random considering a Poisson stochastic model for meiosis (see Supplementary Material B for the details). In our simulations, we set the recombination rate λ [Supplementary Material: Equation (7)] to 3.07 cM/Mb ⁠, corresponding to the average recombination rate in potato (Bourke et al., 2015; Felcher et al., 2012). Using this approach, genomic regions of length 10 kb were simulated for 100 independent trios of tetraploid (km=kf=kc=4) potato (Solanum tuberosum, 2n=4x=48 ⁠), based on 100 regions randomly selected from PGSC-DM genome, chromosome 5 (release version 4.03) (Genome Sequencing Consortium et al., 2011) using a lognormal model to simulate genomic variation (Motazedi et al., 2017). To fit the lognormal model, the SNP density of each parent was determined from empirical data (Uitdewilligen et al., 2013) as described in (Motazedi et al., 2017), resulting in a mean distance of 21 bp between neighbouring SNPs with a standard deviation (SD) of 27 bp. The proportion of each parental marker type: simplex, duplex, triplex and quadruplex, in the total set of markers was also determined from (Uitdewilligen et al., 2013) to be 0.5, 0.23, 0.14 and 0.13, respectively. We also simulated crosses of tetraploid and diploid banana (Musa acuminata) yielding triploid offspring (kc = 3), with the female parent being the tetraploid (km = 4) and the male parent being the diploid (kf = 2), as the pollen of tetraploid banana is hardly viable (Fortescue and Turner, 2004). In practice, commercial triploid bananas (⁠ 2n=3x=33 ⁠) are produced by such hybridizations, which have high consumer preference as their parthenocarpic fruits lack the large, hard seeds of fertilization-induced fruits of diploid and tetraploid sorts. We used the sequence of chromosome 10 from the reference genome of DH-Pahang, a doubled-haploid M. acuminata (D’Hont et al., 2012), release version 2 (Martin et al., 2016), to simulate banana trios, applying the lognormal model to generate SNPs. To fit the model, we set the average SNP frequency to 1 per 200 bp with a SD of 1194 bp. In the absence of population data like the one used for potato, we chose these compromise values so that we do not get many uninformative reads due to SNP sparsity (Section 2), while the average distance of 1394 bp reported for DH-Pahang SNPs (Droc et al., 2013) lies one SD away from our used average distance and thus could still frequently occur in the simulations. As 1% recombination rate has been reported to correspond to 100 to 400 kb physical distance for banana [except at regions close to the centromere (Pillay et al., 2012, p. 130)], we applied an average recombination rate of 0.04 cM/Mb for the simulation of meiosis. The proportions of parental marker types were set the same as that of potato. For each simulated individual, sequence data were simulated according to Illumina HiSeq-2000 and PacBio CCS technologies, and the read alignment and variant calling steps were performed using conventional tools as explained in Supplementary Material C. 3.2 Application to potato candidate gene sequencing data We used TriPoly, HapCompass, SDhaP and HapTree to estimate the haplotype blocks of chromosome 5 in a mapping population of tetraploid potato consisting of 2 parents and 37 offspring (Supplementary Material D). We used 1417 RNA capture probes for re-sequencing candidate genes by paired-end Illumina HiSeq-2000 technology with a median insert size of 316 bp per sample and a median absolute deviation of 58 bp among the insert sizes of each sample’s reads. The single reads within the paired fragments were 101 bp long. On average, the sequencing coverage for each sample was 58× (SD = 15) on the captured regions. However, the coverage varied markedly with genomic position as expected with an RNA capture approach (Sims et al., 2014), with SDs from the mean over all of the positions ranging from 25.5 to 122 among the samples. The sequence reads were mapped against the PGSC-DM genome (version 4.03) using bwa-mem (Li, 2013), and 9762 SNPs were jointly called for all samples using FreeBayes (Garrison and Marth, 2012). A filtering step removed SNPs whose segregation ratios significantly violated those predicted by Mendelian rules according to Pearson’s χ2 test. In the end, 7994 SNPs were considered for phasing by each haplotyping approach. 3.3 Measures of phasing estimation quality Knowing the true haplotypes in simulations, one can evaluate the performance of haplotyping methods by using measures that directly compare the estimates to the true haplotypes. We used the reconstruction rate (RR) (Geraci, 2010) and the pair-wise phasing accuracy rate (PAR) (Motazedi et al., 2017) to evaluate the accuracy, and the SNP missing rate (SMR) (Motazedi et al., 2017) as well as the number of gaps per SNP (NGPS) to evaluate the completeness and continuity of haplotyping. The first measure, RR, has been defined for diploids as the proportion of correctly phased markers in the phasing estimate of the target region (Geraci, 2010). However, to apply it for polyploids we have to generalize its mathematical formulation as haplotypes are not necessarily complementary in polyploids, making multiple correspondences possible between the original and estimated haplotypes. Let H^={h^1,…,h^k} be the estimated phasing and H={h1,…,hk} be the correct phasing of a region containing l SNPs. We define RR as: RRH^,H=1−min⁡p∈Sk1kl∑i=1kD(hi,h^φ(p,i)) (4) where Sk represents the permutation group on {1,…,k} and φ denotes the group action on {1,…,k} ⁠. In this definition, D(hi,h^φ(p,i)) is the Hamming distance: D(hi,h^φ(p,i))=∑s=1ld(hi,h^φ(p,i),s)d(hi,h^φ(p,i),s)={1his≠h^φ(p,i)s,h^φ(p,i)s≠‘‘-’’0otherwise (5) where h^φ(p,i)s=‘‘-’’ means that SNP s has not been phased in H^ ⁠. As an alternative measure of estimation accuracy, PAR is defined as the proportion of all SNP pairs for which the inferred phasing is correct. It is important to note that PAR takes into account phasings between any two SNPs and is not restricted to pairs of consecutive SNPs. While RR is an overall measure of accuracy based on the Hamming distance between the original haplotype and its estimate, PAR primarily shows the accuracy of long range phasing as it is highly affected by chimeric elongations of the haplotypes during estimation, i.e. the elongation of a homologue by part of another homologue. As the true haplotypes were not known for the empirical dataset (Section 3.2), we used RR and PAR to quantify the similarity between the haplotypes obtained by the various methods. As haplotyping methods sometimes report phasings with high SNP exclusion, which nevertheless can have high RR and PAR, the average proportion of SNPs left out in the phasing estimates of each method (SMR) was calculated, measuring phasing completeness. Besides, in order to learn how fragmented the phasing estimates are for each method, which is not reflected in RR, PAR or SMR, the average number of interruptions, i.e. the number of blocks minus one, in the estimates of each method was calculated and normalized by the number of SNPs, l, as NGPS. Defined in this way, NGPS measures the continuity of phasing. All of the calculations to obtain these quality measures were performed using hapcompare (Motazedi et al., 2017). In order to quantify the differences in haplotyping quality of the methods, we built regression models with the measures of haplotyping quality (RR, PAR, SMR and NGPS) as the dependent variables and the haplotyping method as the factor variable. To take the effect of sequencing depth into account, this was added as covariate to the regression models. We accounted for random variation among the simulated families by including a random intercept in the regression models. 3.4 Computational settings All of the analyses were run using 2.90 GHz Intel Xeon processors. A time-limit of 1500 seconds was set for each haplotyping method during simulations, not to consume too much of the shared computational resources in case estimation became prohibitively difficult (Motazedi et al., 2017). To achieve time-memory efficiency without losing much accuracy, we set the branching threshold of TriPoly, ρ, to 0.2 and its pruning threshold, κ, to 0.94, based on the results of pilot simulations. Besides, we forced TriPoly to keep no more than 11% of all possible phasing extensions at each step in case the pruning had not been able to discard as many with the value chosen for κ. 4 Results To obtain the upper bounds of accuracy for TriPoly and to clearly observe the effects of SNP fragment length, sequencing depth and sequencing errors on its accuracy, we simulated SNP fragments with different lengths and error rates at various depths. We used the reconstruction rate (RR) and the pairwise phasing accuracy rate (PAR) (Section 3.3) to measure the accuracy of TriPoly in these simulations. In order to assess the performance of TriPoly in practical situations and to compare it to HapCompass, HapTree and SDhaP, we next simulated realistic genomes and sequence reads for trios of tetraploid-diploid-triploid banana and tetraploid potato, and estimated the haplotypes by first aligning the reads and calling variants using conventional software. In addition to the measures RR and PAR used with SNP fragment simulations, we used the number of gaps per SNP (NGPS) in each estimate as measure of phasing continuity and the fraction of unphased SNPs, the SNP missing rate (SMR), as measure of phasing completeness in the realistic simulations (both measures were zero in the SNP fragment simulations). The mentioned haplotyping methods were also applied to a mapping population of tetraploid potato with 2 parents and 37 offspring, sequenced with paired-end Illumina HiSeq-2000 technology (Section 3.2). We compared the phasing estimates obtained by TriPoly to those of the other methods using PAR, RR, NGPS and SMR to detect agreements and conflicts. We also investigated which genomic regions are likely to be assigned to different phasings by different methods. 4.1 Performance on simulated data TriPoly yields almost perfect haplotypes in ideal situations and improves the overall quality of phasing in practice Figure 3 shows the performance of TriPoly with the simulated SNP fragments at various coverages and with different error rates and fragment lengths. As seen in Figure 3a and b, the RR and the PAR are very close to 1 with an average of l2+1 SNPs per fragment with low error rates (⁠ ≤2%), hence a high phasing information content in the fragments, indicating the precision of TriPoly method in ideal situations. However, it is also evident from Figure 3c and d that with small fragment lengths (occurring in practice due to limited heterozygosity and restricted read lengths), the precision substantially drops, especially at high error rates, although higher sequencing coverages can compensate for this to some extent. Fig. 3. View largeDownload slide Average RRs and pair-wise PAR obtained by TriPoly using the simulated SNP fragments for 100 trios with 0, 2% and 10% error rates at parent1-parent2-offspring coverages: 5-5-2, 5-5-5, 15-15-2, 15-15-5. (a) and (b) show the results with SNP fragment lengths in the range [2, l] (l being the total number of SNPs), while (c) and (d) show the results for short fragment lengths in the range [2, 3] Fig. 3. View largeDownload slide Average RRs and pair-wise PAR obtained by TriPoly using the simulated SNP fragments for 100 trios with 0, 2% and 10% error rates at parent1-parent2-offspring coverages: 5-5-2, 5-5-5, 15-15-2, 15-15-5. (a) and (b) show the results with SNP fragment lengths in the range [2, l] (l being the total number of SNPs), while (c) and (d) show the results for short fragment lengths in the range [2, 3] Through the realistic simulations of genomes and HiSeq-2000 reads, we showed 11% and 24% increases in RR by using TriPoly compared to the other methods for the banana and potato offspring, respectively (Fig. 4, Supplementary Tables S1 and S2). The obtained average increases in accuracy were 15% and 28% for the simulations with PacBio CCS reads, with RR scores of over 95% with these long reads (Supplementary Figs S5 and S6, Supplementary Tables S3 and S4). These observed improvements in the overall phasing show that parental transmission is informative even for phasing between nearby SNPs, in which case the SNPs can be contained within a single read. However, TriPoly did not increase RR for the parents, especially compared to HapTree (Supplementary Figs S2–S6). Fig. 4. View largeDownload slide Average RRs for the progeny in the 100 trios simulated for (a) potato and (b) banana, obtained by HapCompass, SDhaP, HapTree and TriPoly at various sequencing depths using HiSeq-2000 reads Fig. 4. View largeDownload slide Average RRs for the progeny in the 100 trios simulated for (a) potato and (b) banana, obtained by HapCompass, SDhaP, HapTree and TriPoly at various sequencing depths using HiSeq-2000 reads TriPoly markedly increases the accuracy of phasing between distant SNPs for the offspring The realistic simulations with HiSeq-2000 reads showed 33% and 42% higher PARs obtained by TriPoly for banana and potato offspring, respectively, at the same SMRs compared to the other methods (Fig. 5, Supplementary Fig. S4, Supplementary Tables S1 and S2). This increase was more manifest at low sequencing depths (Fig. 5), as the parental transmission information used by TriPoly is especially advantageous when little information is provided by the reads. By penalising recombination events through the considered small recombination probability [Supplementary Material: Equation (6)], TriPoly tends to reduce the chance of chimeric extensions and markedly improves the precision of phasing between distant SNPs in the offspring. However, TriPoly did not increase PAR for the parents (Supplementary Figs S2 and S3). Fig. 5. View largeDownload slide Average pairwise-PARs for the progeny in the 100 trios simulated for (a) potato and (b) banana, obtained by HapCompass, SDhaP, HapTree and TriPoly at various sequencing depths using HiSeq-2000 reads Fig. 5. View largeDownload slide Average pairwise-PARs for the progeny in the 100 trios simulated for (a) potato and (b) banana, obtained by HapCompass, SDhaP, HapTree and TriPoly at various sequencing depths using HiSeq-2000 reads With the simulated PacBio CCS reads, 31% and 45% increases in the average PAR were obtained by TriPoly for the banana and potato trios with the average PAR scores of TriPoly reaching 90% and 94%, respectively (Supplementary Tables S3 and S4, Supplementary Figs S5 and S6). Fewer phasing interruptions are introduced in the haplotype estimates by TriPoly As explained in Section 3.3, in read-based haplotyping the phasing is interrupted between two SNPs if there is no read that connects the two by covering both. With TriPoly, the blocks are determined once for the whole trio and therefore the SNPs can be connected through reads of any member in the trio. In this manner, two SNPs can be phased in an individual even if they are not connected by any reads of that individual as long as the other members in the trio provide the phasing information through their reads. This is especially beneficial for short Illumina reads, as PacBio reads contain more SNPs and hence often yield uninterrupted haplotypes even with single individual methods. The regression analysis of NGPS for HiSeq-2000 simulations showed that the haplotypes obtained by TriPoly were significantly less interrupted compared to the other approaches, notably for banana (Supplementary Tables S1 and S2, Fig. 6). At lower SNP densities, as the average distance between subsequent SNPs will be larger a higher number of reads will be uninformative for phasing (Section 2) and therefore more interruptions can be introduced in the reconstructed haplotypes (Motazedi et al., 2017). TriPoly proves to be especially beneficial in such situations, explaining the notable decrease in NGPS for banana compared to the slight decrease in NGPS for potato. With PacBio reads, the NGPS was (as expected) much lower on average (Supplementary Tables S3 and S4) and even zero with all of the methods at 5-5-5 coverage for potato (Supplementary Fig. S6). However, a substantial improvement was still observed with TriPoly especially for banana trios (Supplementary Table S3, Supplementary Fig. S5). Finally, the high SD of NGPS for HapTree stands out in Figure 6a, which is a reflection of its high failure rate at low sequencing coverages for tetraploid potato (Motazedi et al., 2017). As all of the SNPs belonging to a failed block are excluded from the final phasing, NGPS varies more across the simulated trios due to chance failures. Fig. 6. View largeDownload slide NGPS in the phasing estimates of the progeny from the 100 trios simulated for (a) potato and (b) banana, using HapCompass, SDhaP, HapTree and TriPoly at various sequencing depths using HiSeq-2000 reads Fig. 6. View largeDownload slide NGPS in the phasing estimates of the progeny from the 100 trios simulated for (a) potato and (b) banana, using HapCompass, SDhaP, HapTree and TriPoly at various sequencing depths using HiSeq-2000 reads TriPoly has the smallest memory consumption with similar running times As processing large genomic regions usually requires considerable amounts of CPU time and memory, it is important for a haplotyping algorithm to be efficient in terms of these two resources. Therefore, we measured the computation time and memory consumption of TriPoly for the simulated potato and banana trios at the applied sequencing depths and compared it to those of HapCompass, HapTree and SDhaP. As shown in the Supplementary Figures S7 and S8, TriPoly is the most memory-efficient algorithm compared to the others, while it requires more time compared to HapCompass and SDhaP for potato with HiSeq-2000 read and more time compared to all of the algorithms for both banana and potato with PacBio CCS reads. However, the amount of time required by TriPoly was still close to that needed by the other algorithms. Analysis of candidate gene sequencing data As the true haplotypes were not known for the real dataset, we evaluated the performance of TriPoly by comparing its estimates to those obtained by HapCompass, SDhaP and HapTree. The previously introduced measures, PAR and RR, were used for this purpose by replacing the true phasing in the original definition of each with the estimates of single individual haplotyping methods. This comparison revealed about 82% agreement between the pairwise phasings obtained by TriPoly and HapTree, and an overall similarity of 94% between the results of the two methods. Whilst almost the same agreement was observed between the estimates of HapCompass and TriPoly (PAR = 80%, RR = 93%), the phasings reported by SDhaP were largely different with only 46% similarity of pairwise phasing and an overall similarity score of 87% (Table 1). Table 1. Comparison of the phasings estimated by TriPoly to those estimated by HapCompass, SDhaP and HapTree for a family of tetraploid potato with 2 parents and 37 offspring PAR RR SMR NGPS TriPoly — — 0.017 0.18 HapCompass 0.80 0.93 0.015 0.18 SDhaP 0.45 0.87 0.025 0.18 HapTree 0.82 0.94 0.14 0.19 PAR RR SMR NGPS TriPoly — — 0.017 0.18 HapCompass 0.80 0.93 0.015 0.18 SDhaP 0.45 0.87 0.025 0.18 HapTree 0.82 0.94 0.14 0.19 Table 1. Comparison of the phasings estimated by TriPoly to those estimated by HapCompass, SDhaP and HapTree for a family of tetraploid potato with 2 parents and 37 offspring PAR RR SMR NGPS TriPoly — — 0.017 0.18 HapCompass 0.80 0.93 0.015 0.18 SDhaP 0.45 0.87 0.025 0.18 HapTree 0.82 0.94 0.14 0.19 PAR RR SMR NGPS TriPoly — — 0.017 0.18 HapCompass 0.80 0.93 0.015 0.18 SDhaP 0.45 0.87 0.025 0.18 HapTree 0.82 0.94 0.14 0.19 Considering the SMR, HapTree suffered substantially higher rates (Table 1) as a result of its failure to generate estimates for many blocks due to instability, as reported before in a simulation study (Motazedi et al., 2017). Among the applied methods, HapCompass, HapTree and SDhaP report the phasing the most compatible or the most likely with regard to the reads of each single individual as its phasing estimate. In contrast, TriPoly attempts to find the most likely haplotypes taking parental transmission probabilities into account in addition to the reads (Supplementary Material A). The large agreement between the phasing estimates of TriPoly and HapTree (as well as HapCompass) suggests that TriPoly estimates are satisfactorily compatible with the reads, but are more accurate in presence of noise in regions with high base-calling error or low mapping/variant calling quality, since TriPoly penalises wrong extensions using the phasing information of the parents about the offspring. To test this hypothesis, we considered for each haplotype block the number of reads that were aligned back to its corresponding genomic region, as an indicator of sequencing depth and mapping success (and so indirectly the variant calling quality) of that region. As seen in Figure 7, the methods disagree mostly in regions with poor mapping, perhaps because of genomic divergence from the used reference [which occurs often in plant studies (Golicz et al., 2016)], or in regions with poor capture success, where TriPoly is more reliable due to its taking parental transmission information into account. Fig. 7. View largeDownload slide Agreement in pairwise SNP phasing, measured by the logarithm of PAR, between Tripoly and (a) HapCompass, (b) HapTree and (c) SDhaP at potato haplotype blocks against the number of aligned reads for each block. The grey curve shows the exponential fit to the minimum log PAR score at each alignment count, revealing an exponential increase in phasing agreement with an increase in the number of reads Fig. 7. View largeDownload slide Agreement in pairwise SNP phasing, measured by the logarithm of PAR, between Tripoly and (a) HapCompass, (b) HapTree and (c) SDhaP at potato haplotype blocks against the number of aligned reads for each block. The grey curve shows the exponential fit to the minimum log PAR score at each alignment count, revealing an exponential increase in phasing agreement with an increase in the number of reads We found a positive association between the phasing agreement scores of each block and its number of supporting reads, especially for distant SNPs as the minimum observed PAR increases exponentially with the number of aligned reads, at a rate corresponding at low coverages to 0.01 per 4 reads, i.e. one more read per chromosome for potato (Fig. 7). Although this result is only descriptive, it points out the impact of sequencing depth and successful alignment on haplotype estimation. 5 Conclusion and discussion We propose a novel approach, called TriPoly, for estimating haplotypes in polyploid parent-offspring trios using sequencing data, while taking haplotype transmission from the parents to the progeny into account. TriPoly reconstructs the phasing of the SNPs over a genomic region simultaneously for the parents and for the offspring, starting from the SNP site with the smallest co-ordinate in the region, adding one SNP to the phasing at each step and greedily selecting the most likely extensions for the next extension step conditional on the sequence reads and recombination events. Through idealistic simulations, we show that TriPoly yields almost perfect haplotypes if the reads are long enough and accurate. Through realistic simulations of both short and long-read sequence data, we show that TriPoly significantly improves the haplotyping accuracy for the offspring compared to single individual approaches: HapCompass, SDhaP and HapTree. Besides, we show that TriPoly estimates are more continuous compared to the other methods when the SNP density is low. TriPoly is also an efficient algorithm in terms of the memory consumption and CPU time. We used TriPoly and the other methods to estimate haplotypes in a mapping population of tetraploid potato with 2 parents and 37 offspring sequenced using an RNA capture complexity reduction approach. We argue that in regions with imperfect sequencing or erroneous read mapping, TriPoly is more reliable compared to the other methods since it takes parental transmission probabilities into account to correct misleading read information. In contrast to HapCompass, SDhaP and HapTree, TriPoly provides an option to keep homozygous or missing SNPs in the phasing estimates of individuals. In this way, haplotypes can be compared over the same set of SNPs in an F1-population and segregation patterns can be easily investigated. Moreover, haplotypes reported in this format can be coded as multi-allelic markers to be used in genetic analyses. Besides, TriPoly accepts input in the more convenient format of multi-sample BAM and VCF files, compared to the other methods that either require one-sample BAM/VCF (HapCompass) or the SNP fragment matrix (SDhaP and HapTree). While TriPoly increases the accuracy of phasing for the offspring in a trio by incorporating parental recombination probabilities in the phasing likelihood [Equation (1)], it assumes exchangeability of the offspring in families with more than one progeny [Equation (2)] and therefore ignores the phasing information conveyed by one offspring about the others. By implementing more complex joint likelihood models, we can expect to achieve an enhancement in haplotyping accuracy for larger families. However, the computational burden is definitely a challenge in implementing such an approach. Another potential improvement in TriPoly is the phasing of the parents, the accuracy of which was shown to be inferior to that obtained by HapTree. An iterative approach of keeping a few surviving TriPoly solutions for the whole target region as the starting point for an Expectation Maximization routine can be a way to tackle this problem, resulting in a refined set of most likely haplotypes in the population to which the reads of each individual can be mapped back to find its specific phasing. Like the joint likelihood approach, the computational challenge will be an important consideration here. Acknowledgements The authors thank the graduate school Experimental Plant Sciences (EPS), Wageningen University and Research (WUR) for funding this work, Jeanne Jacobs and Sagar Datir for initiating the development of the tetraploid potato mapping population and Peter Bourke for performing the investigation of segregation patterns in the mapping population. The authors also wish to thank the project initiative ‘Novel genetic and genomic tools for polyploid crops’ (project number BO-26.03-009-004) for the international promotion and support of polyploid research, from which the present study benefitted a great deal. Conflict of Interest: none declared. References Abecasis G.R. et al. ( 2002 ) Merlin rapid analysis of dense genetic maps using sparse gene flow trees . Nat. Genet ., 30 , 97 . Google Scholar Crossref Search ADS PubMed Aguiar D. , Istrail S. ( 2013 ) Haplotype assembly in polyploid genomes and identical by descent shared tracts . Bioinformatics , 29 , i352 – i360 . Google Scholar Crossref Search ADS PubMed Bansal V. , Bafna V. ( 2008 ) HapCUT: an efficient and accurate algorithm for the haplotype assembly problem . Bioinformatics , 24 , i153 – i159 . Google Scholar Crossref Search ADS PubMed Berger E. et al. ( 2014 ) HapTree: a novel Bayesian framework for single individual polyplotyping using NGS data . PLoS Comput. Biol ., 10 , e1003502. Google Scholar Crossref Search ADS PubMed Bourke P.M. et al. ( 2015 ) The double-reduction landscape in tetraploid potato as revealed by a high-density linkage map . Genetics , 201 , 853 – 863 . Google Scholar Crossref Search ADS PubMed Browning S.R. , Browning B.L. ( 2007 ) Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering . Am. J. Hum. Genet ., 81 , 1084 – 1097 . Google Scholar Crossref Search ADS PubMed Das S. , Vikalo H. ( 2015 ) SDhaP: haplotype assembly for diploids and polyploids via semi-definite programming . BMC Genomics , 16 , 260. Google Scholar Crossref Search ADS PubMed Delaneau O. et al. ( 2012 ) A linear complexity phasing method for thousands of genomes . Nat. Methods , 9 , 179. Google Scholar Crossref Search ADS D’Hont A. et al. ( 2012 ) The banana (Musa acuminata) genome and the evolution of monocotyledonous plants . Nature , 488 , 213 – 217 . Google Scholar Crossref Search ADS PubMed Doležel J. et al. ( 2014 ) Advances in plant chromosome genomics . Biotechnol. Adv ., 32 , 122 – 136 . Google Scholar Crossref Search ADS PubMed Droc G. et al. ( 2013 ) The banana genome hub. Database, 2013, bat035. Felcher K.J. et al. ( 2012 ) Integration of two diploid potato linkage maps with the potato genome sequence . PLoS One , 7 , e36347. Google Scholar Crossref Search ADS PubMed Fortescue J. , Turner D. ( 2004 ) Pollen fertility in Musa: viability in cultivars grown in Southern Australia . Crop Pasture Sci ., 55 , 1085 – 1091 . Google Scholar Crossref Search ADS Garg S. et al. ( 2016 ) Read-based phasing of related individuals . Bioinformatics , 32 , i234 – i242 . Google Scholar Crossref Search ADS PubMed Garrison E. , Marth G. ( 2012 ) Haplotype-based variant detection from short-read sequencing . arXiv Preprint arXiv , 1207 , 3907 . Geraci F. ( 2010 ) A comparison of several algorithms for the single individual SNP haplotyping reconstruction problem . Bioinformatics , 26 , 2217 – 2225 . Google Scholar Crossref Search ADS PubMed Golicz A.A. et al. ( 2016 ) The pangenome of an agronomically important crop plant Brassica oleracea . Nat. Commun ., 7 , 13390 . Google Scholar Crossref Search ADS PubMed Li H. ( 2013 ) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv: 1303.3997. Loh P.-R. et al. ( 2016 ) Reference-based phasing using the Haplotype Reference Consortium panel . Nat. Genet ., 48 , 1443. Google Scholar Crossref Search ADS PubMed Martin G. et al. ( 2016 ) Improvement of the banana ‘Musa acuminata’ reference sequence using NGS data and semi-automated bioinformatics methods . BMC Genomics , 17 , 243 . Google Scholar Crossref Search ADS PubMed Michalatos-Beloin S. et al. ( 1996 ) Molecular haplotyping of genetic markers 10 kb apart by allele-specific long-range PCR . Nucleic Acids Res ., 24 , 4841 – 4843 . Google Scholar Crossref Search ADS PubMed Motazedi E. et al. ( 2017 ) Exploiting next-generation sequencing to solve the haplotyping puzzle in polyploids: a simulation study. Brief. Bioinform., 19 , 387 – 403 . Pillay M. et al. ( 2012 ) Genetics, Genomics, and Breeding of Bananas . Science Publishers , USA. Potato Genome Sequencing Consortium et al. ( 2011 ) Genome sequence and analysis of the tuber crop potato . Nature , 475 , 189 – 195 . Crossref Search ADS PubMed Rafalski J.A. ( 2002 ) Novel genetic mapping tools in plants: sNPs and LD-based approaches . Plant Sci ., 162 , 329 – 333 . Google Scholar Crossref Search ADS Simko I. et al. ( 2004 ) Mapping genes for resistance to Verticillium albo-atrum in tetraploid and diploid potato populations using haplotype association tests and genetic linkage analysis . Mol. Genet. Genomics , 271 , 522 – 531 . Google Scholar Crossref Search ADS PubMed Sims D. et al. ( 2014 ) Sequencing depth and coverage: key considerations in genomic analyses . Nat. Rev. Genet ., 15 , 121 – 132 . Google Scholar Crossref Search ADS PubMed Tewhey R. et al. ( 2011 ) The importance of phase information for human genomics . Nat. Rev. Genet ., 12 , 215 – 223 . Google Scholar Crossref Search ADS PubMed Triplett J.K. et al. ( 2012 ) Five nuclear loci resolve the polyploid history of switchgrass (panicum virgatum l.) and relatives . PLoS One , 7 , e38702. Google Scholar Crossref Search ADS PubMed Uitdewilligen J.G. et al. ( 2013 ) A next-generation sequencing method for genotyping-by-sequencing of highly heterozygous autotetraploid potato . PLoS One , 8 , e62355. Google Scholar Crossref Search ADS PubMed Zheng C. et al. ( 2016 ) Probabilistic multilocus haplotype reconstruction in outcrossing tetraploids . Genetics , 203 , 119 – 131 . Google Scholar Crossref Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

TriPoly: haplotype estimation for polyploids using sequencing data of related individuals

Loading next page...
 
/lp/ou_press/tripoly-haplotype-estimation-for-polyploids-using-sequencing-data-of-ROPpkMzpAv
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
ISSN
1367-4803
eISSN
1460-2059
D.O.I.
10.1093/bioinformatics/bty442
Publisher site
See Article on Publisher Site

Abstract

Abstract Motivation Knowledge of haplotypes, i.e. phased and ordered marker alleles on a chromosome, is essential to answer many questions in genetics and genomics. By generating short pieces of DNA sequence, high-throughput modern sequencing technologies make estimation of haplotypes possible for single individuals. In polyploids, however, haplotype estimation methods usually require deep coverage to achieve sufficient accuracy. This often renders sequencing-based approaches too costly to be applied to large populations needed in studies of Quantitative Trait Loci. Results We propose a novel haplotype estimation method for polyploids, TriPoly, that combines sequencing data with Mendelian inheritance rules to infer haplotypes in parent-offspring trios. Using realistic simulations of both short and long-read sequencing data for banana (Musa acuminata) and potato (Solanum tuberosum) trios, we show that TriPoly yields more accurate progeny haplotypes at low coverages compared to existing methods that work on single individuals. We also apply TriPoly to phase Single Nucleotide Polymorphisms on chromosome 5 for a family of tetraploid potato with 2 parents and 37 offspring sequenced with an RNA capture approach. We show that TriPoly haplotype estimates differ from those of the other methods mainly in regions with imperfect sequencing or mapping difficulties, as it does not rely solely on sequence reads and aims to avoid phasings that are not likely to have been passed from the parents to the offspring. Availability and implementation TriPoly has been implemented in Python 3.5.2 (also compatible with Python 2.7.3 and higher) and can be freely downloaded at https://github.com/EhsanMotazedi/TriPoly. Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction Haplotypes are defined as sequences of consecutive nucleotides over a chromosome, which normally shares high similarity with k–1 other chromosomes in diploid (k = 2) and polyploid (k > 2) organisms. These k homologous chromosomes can nevertheless have important differences in the form of nucleotide substitutions or insertions/deletions, leading to genotypic (and phenotypic) diversity within an outcrossing population, e.g. of the diploid (k = 2) human (Homo sapiens), tetraploid (k = 4) African clawed frog (Xenopus laevis) or tetraploid potato (Solanum tuberosum), or between inbred lines of autogamous species, e.g. hexaploid (k = 6) wheat (Triticum aestivum). The assignment of these variant forms, i.e. alleles, to the chromosomes is called phasing or haplotyping. In this context, phasing may also refer to the set of phased homologues, H={h1,h2,…,hk} with k being the ploidy level and hi(i=1,…,k) being the haplotype corresponding to the ith homologue. As phasing is uninformative at genomic positions with identical nucleotides over all the homologous chromosomes, i.e. at homozygous sites, haplotypes are usually defined as sequences of alleles at heterozygous sites over a chromosome. By this definition, 2n haplotypes are theoretically possible for a region covering n bi-allelic Single Nucleotide Polymorphisms (SNPs), which is the most abundant form of genomic variation among individuals of the same species (Rafalski, 2002). However, often far fewer haplotypes are actually found in a population. While high-throughput genotyping assays, such as SNP arrays, exist for efficient determination of unphased SNPs, direct determination of haplotypes is much more complicated and usually requires laborious and expensive techniques such as bacterial cloning, allele-specific PCR or chromosome microdissection (Doležel et al., 2014; Michalatos-Beloin et al., 1996; Triplett et al., 2012). However, unphased SNPs provide less knowledge about an individual’s phenotype compared to phased SNPs, as both gene expression and protein function can be affected by the heterozygous variants being in cis or trans with other variants (Tewhey et al., 2011). Besides, haplotypes can be used as multi-allelic markers, offering more statistical power compared to single SNPs for genetic linkage and association studies (Simko et al., 2004). Several computational methods have therefore been proposed to indirectly infer the phasing from available genotype data. These can be divided into three main categories. Methods in the first category, such as Merlin (Abecasis et al., 2002) and TetraOrigin (Zheng et al., 2016), target pedigrees and aim to determine the most likely haplotypes using the segregation of marker alleles, taking into account the genetic distances between the marker loci. These methods can be applied to SNPs that are far enough apart to be informative about linkage, and are especially useful with large pedigrees. Methods in the second category, such as Beagle (Browning and Browning, 2007), SHAPEIT (Delaneau et al., 2012) and Eagle (Loh et al., 2016) target populations with unknown pedigrees and are based on coalescence theory, trying to obtain a set of highly frequent haplotypes in the population compatible with the genotype data. Methods in the third category, such as HapCut (Bansal and Bafna, 2008), HapCompass (Aguiar and Istrail, 2013), HapTree (Berger et al., 2014) and SDhaP (Das and Vikalo, 2015), use sequence read data and target single individuals, exploiting the fact that a sequence read that contains at least two SNPs reveals the phasing of the homologue from which it has originated at the contained SNP sites. The aim of these methods is therefore to assign the reads of a single individual to k groups, corresponding to the homologues of a k-ploid, and to obtain the consensus sequence of the reads within each group to reconstruct the haplotypes. All of these approaches have limitations in terms of the ploidy level (k) and the required marker density. For the methods in the third category, sequencing depth and read length are also limiting factors. As an example, Merlin, Beagle, SHAPEIT, Eagle and HapCut can only phase diploids (k = 2) and the TetraOrigin algorithm is only applicable to bi-parental tetraploid populations (k = 4) for which a linkage map is available. Also, HapCompass, HapTree and SDhaP can fail to reconstruct haplotypes with high quality at low sequence depths or at ploidy levels higher than k = 4 (Motazedi et al., 2017). In case parent-progeny relations exist in a sequenced population, it is possible to improve the quality of haplotype estimation by combining the information used in the first and the third categories under a unified scheme. With sequencing experiments becoming cheaper and more efficient, such an approach is of high practical importance as often whole populations are sequenced rather than only genotyped at specific marker loci. An implementation of this unifying framework, called PedMEC, has recently been reported by Garg et al. (2016) for diploid trios, i.e. families with two parents and one offspring. Specifically, PedMEC extends the partial-phasing of sequence reads using their overlaps while penalising meiotic recombination events in each trio. However, the exact dynamic programming approach of Garg et al. (2016) rapidly becomes intractable for polyploids, i.e. with k > 2, as its complexity increases exponentially with an increase in the ploidy level (Section 2). Here, we present a greedy algorithm, TriPoly, for phasing a set of SNPs connected by the sequencing reads in parent-offspring trios. Starting at the SNP site with the smallest genomic coordinate, TriPoly extends the phasing one SNP at a time, keeping only the most likely extended phasings to be worked out in the subsequent extension step. In determining the likelihood of each extension, TriPoly considers its compatibility with the sequence reads, as well as the number of recombination events observed by comparing the parental extensions with that of the offspring. Using quantitative measures, we investigated the quality of haplotype estimates obtained by TriPoly in parent-offspring trios simulated under realistic assumptions with tetraploid × diploid and tetraploid × tetraploid parents. By comparing our results with those obtained using single individual haplotyping methods, we show that TriPoly yields substantially better estimates for the haplotypes of the progeny, especially at low sequencing depths. Finally, we apply TriPoly to phase SNPs on chromosome 5 for a family of tetraploid potato with 2 parents and 37 offspring, sequenced with an RNA capture approach by paired-end Illumina HiSeq-2000 technology. We show that TriPoly phasings differ from those obtained by the other methods mainly in regions with imperfect sequencing or mapping difficulties, as TriPoly does not rely solely on sequence reads and aims to avoid phasings that are not likely to have been passed from the parents to the offspring. 2 Materials and methods 2.1 A Bayesian approach to obtaining phasing probabilities from sequence reads In order to establish a probabilistic model for haplotypes, with the sequence reads as data and the base call error and recombination rates as parameters, we must first determine which reads are informative about the phasing. Informative reads need to cover at least two variants, e.g. SNP sites which are heterozygous for at least one of the trio members (m, f, c), corresponding to mother, father and the offspring (child). As sites that are homozygous in all trio members retain no phasing information, we discard them from the sequence reads and keep only the base-calls corresponding to the variation positions. Therefore, in the first step, the SNP sites, s = 1, 2,…, l, are detected over a genomic region and the genotypes Gs=(Gsm,Gsf,Gsc) are estimated at these sites, using efficient algorithms such as FreeBayes (Garrison and Marth, 2012). The raw reads of each trio member are then replaced by the so-called SNP fragments of length l (Fig. 1) that each correspond to a read and contain the numerically coded alleles, i.e. 0, 1, 2 or 3 representing the reference and alternative nucleotides, at the SNP sites covered by that read and ‘–’ at positions not called or not covered. To reduce sequencing noise, the positions at which the base-calling quality is lower than a desired threshold can be set to ‘–’ as well. Hereafter, by using the term sequence read, r, we refer to SNP fragments that contain at least two determined positions. Fig. 1. View largeDownload slide A set of SNP fragments aligned to a reference and the homologues, h1 and h2, from which the fragments originated. Fragments that have identical variants, specified by 0 (reference) and 1 (alternative), at their overlapping sites are assigned to the same homologue Fig. 1. View largeDownload slide A set of SNP fragments aligned to a reference and the homologues, h1 and h2, from which the fragments originated. Fragments that have identical variants, specified by 0 (reference) and 1 (alternative), at their overlapping sites are assigned to the same homologue In the next step, one should assign the reads to k compatible sets in which all of the reads have the same allele at their overlaps, and obtain the consensus sequence of each set as the phasing. As shown in Figure 1, this process is straightforward for diploids in the absence of sequencing errors. With sequencing errors, however, such an assignment of reads to homologues will be possible only if mismatches are allowed. However, allowing mismatches at sites with no error can lead to incorrect haplotype estimates. Polyploidy results in further complexity, as there may be more than one way to assign the reads to k > 2 sets even when no error is present. This can happen for instance when several haplotypes are identical in a phasing solution, e.g. in a 3 SNP tetraploid phasing consisting of 4 homologues: (100100100011) in which three identical (100) haplotypes are present. In this example, the reads will be compatible with any phasing as long as it contains both (100) and (011) haplotypes regardless of their dosages, for example (100100011011) ⁠. Therefore, probabilistic models must take the uncertainty caused by the presence of several phasing possibilities and sequencing errors into account. To build the probabilistic model, we assume an independent binomial error model at each SNP site (Berger et al., 2014) and assign an error vector, ϵ⃗r ⁠, of length l to each read containing the probability of erroneous base-calling at the SNP sites in that read. Using these error rates, the probabilities of possible maternal, paternal and offspring phasings in a trio, represented by Hm, Hf and Hc, respectively, can be derived from the set of sequence reads associated with the trio, R (consisting of maternal read Rm, paternal reads Rf and offspring reads Rc). In addition to the reads, we consider meiotic recombination probabilities, θs, between SNP s–1 and SNP s, represented by vector θ⃗ for all s > 1 to adjust the probability assigned to each phasing using Mendelian inheritance rules as follows: P(Hm,Hf,Hc|R,ϵ,θ⃗)=P(Hm|Rm,ϵm)×P(Hf|Rf,ϵf)·P(Hc|Rc,Hm,Hf,ϵc,θ⃗)R=Rm∪Rf∪Rcϵ=ϵm∪ϵf∪ϵc (1) where ϵm, ϵf and ϵc are sets of error vectors associated with Rm, Rf and Rc, respectively. Assuming exchangeability of the offspring, it is straightforward to generalize Equation (1) to include n offspring as: P(Hm,Hf,Hc1,…,Hcn|R,ϵ,θ⃗)=P(Hm|Rm,ϵm)P(Hf|Rf,ϵf)∏i=1nP(Hci|Rci,Hm,Hf,ϵci,θ⃗)R=∪i=1nRci∪Rm∪Rfϵ=∪i=1nϵci∪ϵm∪ϵf (2) By calculating the righthand side of Equation (2), one can determine the likelihood of each possible phasing for a trio conditional on its sequence reads. However, as it is instead more convenient to calculate the probability of observing the reads conditional on a phasing (Berger et al., 2014), we obtain each element of this equation using Bayes’ formula according to: P(Hp|Rp,ϵp)=P(Rp|Hp,ϵp)P(Hp)∑Hp′P(Rp|Hp′,ϵp)P(Hp′), p∈{m,f}P(Hci|Rci,ϵci,Hm,Hf,θ⃗)=P(Rci|Hci,ϵci)P(Hci|Hm,Hf,θ⃗)∑Hci′P(Rci|Hci′,ϵci)P(Hci′|Hm,Hf,θ⃗) (3) where P(Hp) is the prior probability of the parental phasing Hp and P(Hci|Hm,Hf,θ⃗) is the prior probability of the phasing Hci for offspring ci conditional on the parental phasings (Hm, Hf) and the recombination probability θ⃗ (see Supplementary Material A for the calculation of the read likelihoods and priors). 2.2 The TriPoly method Following the Bayesian approach explained in Section 2.1, one has to calculate the likelihood of the reads conditional on every phasing possible. The computational cost of this brute-force approach, calculated in Supplementary Material E, grows linearly with the sequencing depth but exponentially with the number of SNPs, l, rapidly rendering the solution intractable. To overcome this problem, we perform SNP-by-SNP reconstruction of haplotypes, starting from the leftmost SNP in the target region and keeping only a few most likely phasing extensions to the next SNP at each step [Fig. 2, Supplementary Material A: Equations (1–5)]. Following this approach, one will end up with a limited number of phasings that have passed the selection criteria during the extension procedure from s = 1 to s = l. Assuming the selection procedure effectively keeps the number of accepted solutions at each extension step bounded above by Em and Ef for the mother and the father, respectively, the number of trio phasings at each extension will be bounded above by (kmkm2)(kfkf2)EmEf and the total complexity will be O(lkmaxΩmax(kmkm2)(kfkf2)EmEf) ⁠, with kmax and Ωmax denoting the maximum ploidy level and the maximum sequencing coverage in the trio, respectively. This greedy method is therefore linear in terms of the number of SNPs, l. With parental ploidy levels, kp (⁠ p∈{m,f} ⁠), in the range of 2 to 12 (covering most of the naturally occurring cases of polyploidy), (kpkp2)<kp2.75 ⁠. Therefore, the computational complexity grows at a rate of kmax6.5 with the ploidy level. Fig. 2. View largeDownload slide Overview of the SNP-by-SNP haplotyping method implemented in TriPoly for a trio consisting of two parents and one offspring, over a region containing l SNPs Fig. 2. View largeDownload slide Overview of the SNP-by-SNP haplotyping method implemented in TriPoly for a trio consisting of two parents and one offspring, over a region containing l SNPs To implement this greedy method, which we call TriPoly, we employ branching and pruning steps similar to those in the HapTree algorithm (Berger et al., 2014; Supplementary Fig. S1). Starting at SNP site s = 1, the k alleles of each parent and the offspring are used as the base parental and offspring phasings, Hbp and Hbc. The base phasings are then extended step by step from SNP s–1 to SNP s for s≥2 ⁠, until all of the SNPs have been phased as outlined in Supplementary Material: Algorithm S1. At each extension step, branching and pruning (Supplementary Material: Procedure S3 and Supplementary Material: Procedure S4) allow the algorithm to work with a limited number of possible phasings. This approach can be easily extended to include several offspring at the same time using Equation (2), a detailed description of which is given in Supplementary Material A. Note that this approach assumes working on the so-called phasing blocks, i.e. genomic regions in which each SNP, s, is connected to at least one other SNP, s′ ⁠, through at least one of the reads in R. In case the sequencing reads do not satisfy this condition for the whole set of SNPs in the region, it is straightforward to divide the SNP set into blocks prior to the phasing and phase each block separately, with the phasing being interrupted between the blocks. 3 Experimental setup 3.1 Simulation of polyploid trios Before evaluating the performance of TriPoly through realistic simulations, we tested it on directly simulated SNP fragments to determine the upper bounds of its accuracy and to clearly show factors that influence its estimation quality. The advantage of this approach is bypassing of the intermediate base-calling, read alignment and variant calling steps that occur in reality and each add an undetermined amount of noise to the haplotyping process. Therefore, the direct simulation of SNP fragments lets us measure the accuracy in ideal situations and focus on the effects of sequencing depth, SNP fragment length and actual error rate in the SNP fragments. For this purpose, we simulated parental haplotypes corresponding to 1 kb regions according to the potato heterozygosity model (Motazedi et al., 2017), and randomly selected half of the haplotypes of each parent to simulate the offspring. The lengths of the SNP fragments, i.e. the number of SNPs contained in each fragment, was randomly chosen from uniform distributions within the ranges [2, 3] and [2, l] (l being the total number of SNPs), resulting in average fragment lengths of 2.5 and l2+1 ⁠, respectively. Alleles at randomly selected SNP sites on a haplotype were included in each fragment and sufficient SNP fragments were generated to assure the specified per homologue sequencing depths, 5-5-2, 5-5-5, 15-15-2 and 15-15-5 (maternal-paternal-offspring). Considering error rates of 0 (no error), 2% and 10%, SNP alleles were flipped by chance to introduce errors in the fragments. For each scenario, 100 trios were simulated and phased by TriPoly. In the next step, we evaluated the performance of TriPoly, as well as three state-of-the-art single individual haplotyping algorithms: HapCompass, SDhaP and HapTree, by simulating realistic genomes and sequence data and following the read alignment and variant calling steps to obtain the SNP fragments for haplotype estimation. To this end, maternal and paternal genomes were independently simulated from a common reference using Haplogenerator (Motazedi et al., 2017), and offspring genomes were generated by passing recombinant parental chromatids at random considering a Poisson stochastic model for meiosis (see Supplementary Material B for the details). In our simulations, we set the recombination rate λ [Supplementary Material: Equation (7)] to 3.07 cM/Mb ⁠, corresponding to the average recombination rate in potato (Bourke et al., 2015; Felcher et al., 2012). Using this approach, genomic regions of length 10 kb were simulated for 100 independent trios of tetraploid (km=kf=kc=4) potato (Solanum tuberosum, 2n=4x=48 ⁠), based on 100 regions randomly selected from PGSC-DM genome, chromosome 5 (release version 4.03) (Genome Sequencing Consortium et al., 2011) using a lognormal model to simulate genomic variation (Motazedi et al., 2017). To fit the lognormal model, the SNP density of each parent was determined from empirical data (Uitdewilligen et al., 2013) as described in (Motazedi et al., 2017), resulting in a mean distance of 21 bp between neighbouring SNPs with a standard deviation (SD) of 27 bp. The proportion of each parental marker type: simplex, duplex, triplex and quadruplex, in the total set of markers was also determined from (Uitdewilligen et al., 2013) to be 0.5, 0.23, 0.14 and 0.13, respectively. We also simulated crosses of tetraploid and diploid banana (Musa acuminata) yielding triploid offspring (kc = 3), with the female parent being the tetraploid (km = 4) and the male parent being the diploid (kf = 2), as the pollen of tetraploid banana is hardly viable (Fortescue and Turner, 2004). In practice, commercial triploid bananas (⁠ 2n=3x=33 ⁠) are produced by such hybridizations, which have high consumer preference as their parthenocarpic fruits lack the large, hard seeds of fertilization-induced fruits of diploid and tetraploid sorts. We used the sequence of chromosome 10 from the reference genome of DH-Pahang, a doubled-haploid M. acuminata (D’Hont et al., 2012), release version 2 (Martin et al., 2016), to simulate banana trios, applying the lognormal model to generate SNPs. To fit the model, we set the average SNP frequency to 1 per 200 bp with a SD of 1194 bp. In the absence of population data like the one used for potato, we chose these compromise values so that we do not get many uninformative reads due to SNP sparsity (Section 2), while the average distance of 1394 bp reported for DH-Pahang SNPs (Droc et al., 2013) lies one SD away from our used average distance and thus could still frequently occur in the simulations. As 1% recombination rate has been reported to correspond to 100 to 400 kb physical distance for banana [except at regions close to the centromere (Pillay et al., 2012, p. 130)], we applied an average recombination rate of 0.04 cM/Mb for the simulation of meiosis. The proportions of parental marker types were set the same as that of potato. For each simulated individual, sequence data were simulated according to Illumina HiSeq-2000 and PacBio CCS technologies, and the read alignment and variant calling steps were performed using conventional tools as explained in Supplementary Material C. 3.2 Application to potato candidate gene sequencing data We used TriPoly, HapCompass, SDhaP and HapTree to estimate the haplotype blocks of chromosome 5 in a mapping population of tetraploid potato consisting of 2 parents and 37 offspring (Supplementary Material D). We used 1417 RNA capture probes for re-sequencing candidate genes by paired-end Illumina HiSeq-2000 technology with a median insert size of 316 bp per sample and a median absolute deviation of 58 bp among the insert sizes of each sample’s reads. The single reads within the paired fragments were 101 bp long. On average, the sequencing coverage for each sample was 58× (SD = 15) on the captured regions. However, the coverage varied markedly with genomic position as expected with an RNA capture approach (Sims et al., 2014), with SDs from the mean over all of the positions ranging from 25.5 to 122 among the samples. The sequence reads were mapped against the PGSC-DM genome (version 4.03) using bwa-mem (Li, 2013), and 9762 SNPs were jointly called for all samples using FreeBayes (Garrison and Marth, 2012). A filtering step removed SNPs whose segregation ratios significantly violated those predicted by Mendelian rules according to Pearson’s χ2 test. In the end, 7994 SNPs were considered for phasing by each haplotyping approach. 3.3 Measures of phasing estimation quality Knowing the true haplotypes in simulations, one can evaluate the performance of haplotyping methods by using measures that directly compare the estimates to the true haplotypes. We used the reconstruction rate (RR) (Geraci, 2010) and the pair-wise phasing accuracy rate (PAR) (Motazedi et al., 2017) to evaluate the accuracy, and the SNP missing rate (SMR) (Motazedi et al., 2017) as well as the number of gaps per SNP (NGPS) to evaluate the completeness and continuity of haplotyping. The first measure, RR, has been defined for diploids as the proportion of correctly phased markers in the phasing estimate of the target region (Geraci, 2010). However, to apply it for polyploids we have to generalize its mathematical formulation as haplotypes are not necessarily complementary in polyploids, making multiple correspondences possible between the original and estimated haplotypes. Let H^={h^1,…,h^k} be the estimated phasing and H={h1,…,hk} be the correct phasing of a region containing l SNPs. We define RR as: RRH^,H=1−min⁡p∈Sk1kl∑i=1kD(hi,h^φ(p,i)) (4) where Sk represents the permutation group on {1,…,k} and φ denotes the group action on {1,…,k} ⁠. In this definition, D(hi,h^φ(p,i)) is the Hamming distance: D(hi,h^φ(p,i))=∑s=1ld(hi,h^φ(p,i),s)d(hi,h^φ(p,i),s)={1his≠h^φ(p,i)s,h^φ(p,i)s≠‘‘-’’0otherwise (5) where h^φ(p,i)s=‘‘-’’ means that SNP s has not been phased in H^ ⁠. As an alternative measure of estimation accuracy, PAR is defined as the proportion of all SNP pairs for which the inferred phasing is correct. It is important to note that PAR takes into account phasings between any two SNPs and is not restricted to pairs of consecutive SNPs. While RR is an overall measure of accuracy based on the Hamming distance between the original haplotype and its estimate, PAR primarily shows the accuracy of long range phasing as it is highly affected by chimeric elongations of the haplotypes during estimation, i.e. the elongation of a homologue by part of another homologue. As the true haplotypes were not known for the empirical dataset (Section 3.2), we used RR and PAR to quantify the similarity between the haplotypes obtained by the various methods. As haplotyping methods sometimes report phasings with high SNP exclusion, which nevertheless can have high RR and PAR, the average proportion of SNPs left out in the phasing estimates of each method (SMR) was calculated, measuring phasing completeness. Besides, in order to learn how fragmented the phasing estimates are for each method, which is not reflected in RR, PAR or SMR, the average number of interruptions, i.e. the number of blocks minus one, in the estimates of each method was calculated and normalized by the number of SNPs, l, as NGPS. Defined in this way, NGPS measures the continuity of phasing. All of the calculations to obtain these quality measures were performed using hapcompare (Motazedi et al., 2017). In order to quantify the differences in haplotyping quality of the methods, we built regression models with the measures of haplotyping quality (RR, PAR, SMR and NGPS) as the dependent variables and the haplotyping method as the factor variable. To take the effect of sequencing depth into account, this was added as covariate to the regression models. We accounted for random variation among the simulated families by including a random intercept in the regression models. 3.4 Computational settings All of the analyses were run using 2.90 GHz Intel Xeon processors. A time-limit of 1500 seconds was set for each haplotyping method during simulations, not to consume too much of the shared computational resources in case estimation became prohibitively difficult (Motazedi et al., 2017). To achieve time-memory efficiency without losing much accuracy, we set the branching threshold of TriPoly, ρ, to 0.2 and its pruning threshold, κ, to 0.94, based on the results of pilot simulations. Besides, we forced TriPoly to keep no more than 11% of all possible phasing extensions at each step in case the pruning had not been able to discard as many with the value chosen for κ. 4 Results To obtain the upper bounds of accuracy for TriPoly and to clearly observe the effects of SNP fragment length, sequencing depth and sequencing errors on its accuracy, we simulated SNP fragments with different lengths and error rates at various depths. We used the reconstruction rate (RR) and the pairwise phasing accuracy rate (PAR) (Section 3.3) to measure the accuracy of TriPoly in these simulations. In order to assess the performance of TriPoly in practical situations and to compare it to HapCompass, HapTree and SDhaP, we next simulated realistic genomes and sequence reads for trios of tetraploid-diploid-triploid banana and tetraploid potato, and estimated the haplotypes by first aligning the reads and calling variants using conventional software. In addition to the measures RR and PAR used with SNP fragment simulations, we used the number of gaps per SNP (NGPS) in each estimate as measure of phasing continuity and the fraction of unphased SNPs, the SNP missing rate (SMR), as measure of phasing completeness in the realistic simulations (both measures were zero in the SNP fragment simulations). The mentioned haplotyping methods were also applied to a mapping population of tetraploid potato with 2 parents and 37 offspring, sequenced with paired-end Illumina HiSeq-2000 technology (Section 3.2). We compared the phasing estimates obtained by TriPoly to those of the other methods using PAR, RR, NGPS and SMR to detect agreements and conflicts. We also investigated which genomic regions are likely to be assigned to different phasings by different methods. 4.1 Performance on simulated data TriPoly yields almost perfect haplotypes in ideal situations and improves the overall quality of phasing in practice Figure 3 shows the performance of TriPoly with the simulated SNP fragments at various coverages and with different error rates and fragment lengths. As seen in Figure 3a and b, the RR and the PAR are very close to 1 with an average of l2+1 SNPs per fragment with low error rates (⁠ ≤2%), hence a high phasing information content in the fragments, indicating the precision of TriPoly method in ideal situations. However, it is also evident from Figure 3c and d that with small fragment lengths (occurring in practice due to limited heterozygosity and restricted read lengths), the precision substantially drops, especially at high error rates, although higher sequencing coverages can compensate for this to some extent. Fig. 3. View largeDownload slide Average RRs and pair-wise PAR obtained by TriPoly using the simulated SNP fragments for 100 trios with 0, 2% and 10% error rates at parent1-parent2-offspring coverages: 5-5-2, 5-5-5, 15-15-2, 15-15-5. (a) and (b) show the results with SNP fragment lengths in the range [2, l] (l being the total number of SNPs), while (c) and (d) show the results for short fragment lengths in the range [2, 3] Fig. 3. View largeDownload slide Average RRs and pair-wise PAR obtained by TriPoly using the simulated SNP fragments for 100 trios with 0, 2% and 10% error rates at parent1-parent2-offspring coverages: 5-5-2, 5-5-5, 15-15-2, 15-15-5. (a) and (b) show the results with SNP fragment lengths in the range [2, l] (l being the total number of SNPs), while (c) and (d) show the results for short fragment lengths in the range [2, 3] Through the realistic simulations of genomes and HiSeq-2000 reads, we showed 11% and 24% increases in RR by using TriPoly compared to the other methods for the banana and potato offspring, respectively (Fig. 4, Supplementary Tables S1 and S2). The obtained average increases in accuracy were 15% and 28% for the simulations with PacBio CCS reads, with RR scores of over 95% with these long reads (Supplementary Figs S5 and S6, Supplementary Tables S3 and S4). These observed improvements in the overall phasing show that parental transmission is informative even for phasing between nearby SNPs, in which case the SNPs can be contained within a single read. However, TriPoly did not increase RR for the parents, especially compared to HapTree (Supplementary Figs S2–S6). Fig. 4. View largeDownload slide Average RRs for the progeny in the 100 trios simulated for (a) potato and (b) banana, obtained by HapCompass, SDhaP, HapTree and TriPoly at various sequencing depths using HiSeq-2000 reads Fig. 4. View largeDownload slide Average RRs for the progeny in the 100 trios simulated for (a) potato and (b) banana, obtained by HapCompass, SDhaP, HapTree and TriPoly at various sequencing depths using HiSeq-2000 reads TriPoly markedly increases the accuracy of phasing between distant SNPs for the offspring The realistic simulations with HiSeq-2000 reads showed 33% and 42% higher PARs obtained by TriPoly for banana and potato offspring, respectively, at the same SMRs compared to the other methods (Fig. 5, Supplementary Fig. S4, Supplementary Tables S1 and S2). This increase was more manifest at low sequencing depths (Fig. 5), as the parental transmission information used by TriPoly is especially advantageous when little information is provided by the reads. By penalising recombination events through the considered small recombination probability [Supplementary Material: Equation (6)], TriPoly tends to reduce the chance of chimeric extensions and markedly improves the precision of phasing between distant SNPs in the offspring. However, TriPoly did not increase PAR for the parents (Supplementary Figs S2 and S3). Fig. 5. View largeDownload slide Average pairwise-PARs for the progeny in the 100 trios simulated for (a) potato and (b) banana, obtained by HapCompass, SDhaP, HapTree and TriPoly at various sequencing depths using HiSeq-2000 reads Fig. 5. View largeDownload slide Average pairwise-PARs for the progeny in the 100 trios simulated for (a) potato and (b) banana, obtained by HapCompass, SDhaP, HapTree and TriPoly at various sequencing depths using HiSeq-2000 reads With the simulated PacBio CCS reads, 31% and 45% increases in the average PAR were obtained by TriPoly for the banana and potato trios with the average PAR scores of TriPoly reaching 90% and 94%, respectively (Supplementary Tables S3 and S4, Supplementary Figs S5 and S6). Fewer phasing interruptions are introduced in the haplotype estimates by TriPoly As explained in Section 3.3, in read-based haplotyping the phasing is interrupted between two SNPs if there is no read that connects the two by covering both. With TriPoly, the blocks are determined once for the whole trio and therefore the SNPs can be connected through reads of any member in the trio. In this manner, two SNPs can be phased in an individual even if they are not connected by any reads of that individual as long as the other members in the trio provide the phasing information through their reads. This is especially beneficial for short Illumina reads, as PacBio reads contain more SNPs and hence often yield uninterrupted haplotypes even with single individual methods. The regression analysis of NGPS for HiSeq-2000 simulations showed that the haplotypes obtained by TriPoly were significantly less interrupted compared to the other approaches, notably for banana (Supplementary Tables S1 and S2, Fig. 6). At lower SNP densities, as the average distance between subsequent SNPs will be larger a higher number of reads will be uninformative for phasing (Section 2) and therefore more interruptions can be introduced in the reconstructed haplotypes (Motazedi et al., 2017). TriPoly proves to be especially beneficial in such situations, explaining the notable decrease in NGPS for banana compared to the slight decrease in NGPS for potato. With PacBio reads, the NGPS was (as expected) much lower on average (Supplementary Tables S3 and S4) and even zero with all of the methods at 5-5-5 coverage for potato (Supplementary Fig. S6). However, a substantial improvement was still observed with TriPoly especially for banana trios (Supplementary Table S3, Supplementary Fig. S5). Finally, the high SD of NGPS for HapTree stands out in Figure 6a, which is a reflection of its high failure rate at low sequencing coverages for tetraploid potato (Motazedi et al., 2017). As all of the SNPs belonging to a failed block are excluded from the final phasing, NGPS varies more across the simulated trios due to chance failures. Fig. 6. View largeDownload slide NGPS in the phasing estimates of the progeny from the 100 trios simulated for (a) potato and (b) banana, using HapCompass, SDhaP, HapTree and TriPoly at various sequencing depths using HiSeq-2000 reads Fig. 6. View largeDownload slide NGPS in the phasing estimates of the progeny from the 100 trios simulated for (a) potato and (b) banana, using HapCompass, SDhaP, HapTree and TriPoly at various sequencing depths using HiSeq-2000 reads TriPoly has the smallest memory consumption with similar running times As processing large genomic regions usually requires considerable amounts of CPU time and memory, it is important for a haplotyping algorithm to be efficient in terms of these two resources. Therefore, we measured the computation time and memory consumption of TriPoly for the simulated potato and banana trios at the applied sequencing depths and compared it to those of HapCompass, HapTree and SDhaP. As shown in the Supplementary Figures S7 and S8, TriPoly is the most memory-efficient algorithm compared to the others, while it requires more time compared to HapCompass and SDhaP for potato with HiSeq-2000 read and more time compared to all of the algorithms for both banana and potato with PacBio CCS reads. However, the amount of time required by TriPoly was still close to that needed by the other algorithms. Analysis of candidate gene sequencing data As the true haplotypes were not known for the real dataset, we evaluated the performance of TriPoly by comparing its estimates to those obtained by HapCompass, SDhaP and HapTree. The previously introduced measures, PAR and RR, were used for this purpose by replacing the true phasing in the original definition of each with the estimates of single individual haplotyping methods. This comparison revealed about 82% agreement between the pairwise phasings obtained by TriPoly and HapTree, and an overall similarity of 94% between the results of the two methods. Whilst almost the same agreement was observed between the estimates of HapCompass and TriPoly (PAR = 80%, RR = 93%), the phasings reported by SDhaP were largely different with only 46% similarity of pairwise phasing and an overall similarity score of 87% (Table 1). Table 1. Comparison of the phasings estimated by TriPoly to those estimated by HapCompass, SDhaP and HapTree for a family of tetraploid potato with 2 parents and 37 offspring PAR RR SMR NGPS TriPoly — — 0.017 0.18 HapCompass 0.80 0.93 0.015 0.18 SDhaP 0.45 0.87 0.025 0.18 HapTree 0.82 0.94 0.14 0.19 PAR RR SMR NGPS TriPoly — — 0.017 0.18 HapCompass 0.80 0.93 0.015 0.18 SDhaP 0.45 0.87 0.025 0.18 HapTree 0.82 0.94 0.14 0.19 Table 1. Comparison of the phasings estimated by TriPoly to those estimated by HapCompass, SDhaP and HapTree for a family of tetraploid potato with 2 parents and 37 offspring PAR RR SMR NGPS TriPoly — — 0.017 0.18 HapCompass 0.80 0.93 0.015 0.18 SDhaP 0.45 0.87 0.025 0.18 HapTree 0.82 0.94 0.14 0.19 PAR RR SMR NGPS TriPoly — — 0.017 0.18 HapCompass 0.80 0.93 0.015 0.18 SDhaP 0.45 0.87 0.025 0.18 HapTree 0.82 0.94 0.14 0.19 Considering the SMR, HapTree suffered substantially higher rates (Table 1) as a result of its failure to generate estimates for many blocks due to instability, as reported before in a simulation study (Motazedi et al., 2017). Among the applied methods, HapCompass, HapTree and SDhaP report the phasing the most compatible or the most likely with regard to the reads of each single individual as its phasing estimate. In contrast, TriPoly attempts to find the most likely haplotypes taking parental transmission probabilities into account in addition to the reads (Supplementary Material A). The large agreement between the phasing estimates of TriPoly and HapTree (as well as HapCompass) suggests that TriPoly estimates are satisfactorily compatible with the reads, but are more accurate in presence of noise in regions with high base-calling error or low mapping/variant calling quality, since TriPoly penalises wrong extensions using the phasing information of the parents about the offspring. To test this hypothesis, we considered for each haplotype block the number of reads that were aligned back to its corresponding genomic region, as an indicator of sequencing depth and mapping success (and so indirectly the variant calling quality) of that region. As seen in Figure 7, the methods disagree mostly in regions with poor mapping, perhaps because of genomic divergence from the used reference [which occurs often in plant studies (Golicz et al., 2016)], or in regions with poor capture success, where TriPoly is more reliable due to its taking parental transmission information into account. Fig. 7. View largeDownload slide Agreement in pairwise SNP phasing, measured by the logarithm of PAR, between Tripoly and (a) HapCompass, (b) HapTree and (c) SDhaP at potato haplotype blocks against the number of aligned reads for each block. The grey curve shows the exponential fit to the minimum log PAR score at each alignment count, revealing an exponential increase in phasing agreement with an increase in the number of reads Fig. 7. View largeDownload slide Agreement in pairwise SNP phasing, measured by the logarithm of PAR, between Tripoly and (a) HapCompass, (b) HapTree and (c) SDhaP at potato haplotype blocks against the number of aligned reads for each block. The grey curve shows the exponential fit to the minimum log PAR score at each alignment count, revealing an exponential increase in phasing agreement with an increase in the number of reads We found a positive association between the phasing agreement scores of each block and its number of supporting reads, especially for distant SNPs as the minimum observed PAR increases exponentially with the number of aligned reads, at a rate corresponding at low coverages to 0.01 per 4 reads, i.e. one more read per chromosome for potato (Fig. 7). Although this result is only descriptive, it points out the impact of sequencing depth and successful alignment on haplotype estimation. 5 Conclusion and discussion We propose a novel approach, called TriPoly, for estimating haplotypes in polyploid parent-offspring trios using sequencing data, while taking haplotype transmission from the parents to the progeny into account. TriPoly reconstructs the phasing of the SNPs over a genomic region simultaneously for the parents and for the offspring, starting from the SNP site with the smallest co-ordinate in the region, adding one SNP to the phasing at each step and greedily selecting the most likely extensions for the next extension step conditional on the sequence reads and recombination events. Through idealistic simulations, we show that TriPoly yields almost perfect haplotypes if the reads are long enough and accurate. Through realistic simulations of both short and long-read sequence data, we show that TriPoly significantly improves the haplotyping accuracy for the offspring compared to single individual approaches: HapCompass, SDhaP and HapTree. Besides, we show that TriPoly estimates are more continuous compared to the other methods when the SNP density is low. TriPoly is also an efficient algorithm in terms of the memory consumption and CPU time. We used TriPoly and the other methods to estimate haplotypes in a mapping population of tetraploid potato with 2 parents and 37 offspring sequenced using an RNA capture complexity reduction approach. We argue that in regions with imperfect sequencing or erroneous read mapping, TriPoly is more reliable compared to the other methods since it takes parental transmission probabilities into account to correct misleading read information. In contrast to HapCompass, SDhaP and HapTree, TriPoly provides an option to keep homozygous or missing SNPs in the phasing estimates of individuals. In this way, haplotypes can be compared over the same set of SNPs in an F1-population and segregation patterns can be easily investigated. Moreover, haplotypes reported in this format can be coded as multi-allelic markers to be used in genetic analyses. Besides, TriPoly accepts input in the more convenient format of multi-sample BAM and VCF files, compared to the other methods that either require one-sample BAM/VCF (HapCompass) or the SNP fragment matrix (SDhaP and HapTree). While TriPoly increases the accuracy of phasing for the offspring in a trio by incorporating parental recombination probabilities in the phasing likelihood [Equation (1)], it assumes exchangeability of the offspring in families with more than one progeny [Equation (2)] and therefore ignores the phasing information conveyed by one offspring about the others. By implementing more complex joint likelihood models, we can expect to achieve an enhancement in haplotyping accuracy for larger families. However, the computational burden is definitely a challenge in implementing such an approach. Another potential improvement in TriPoly is the phasing of the parents, the accuracy of which was shown to be inferior to that obtained by HapTree. An iterative approach of keeping a few surviving TriPoly solutions for the whole target region as the starting point for an Expectation Maximization routine can be a way to tackle this problem, resulting in a refined set of most likely haplotypes in the population to which the reads of each individual can be mapped back to find its specific phasing. Like the joint likelihood approach, the computational challenge will be an important consideration here. Acknowledgements The authors thank the graduate school Experimental Plant Sciences (EPS), Wageningen University and Research (WUR) for funding this work, Jeanne Jacobs and Sagar Datir for initiating the development of the tetraploid potato mapping population and Peter Bourke for performing the investigation of segregation patterns in the mapping population. The authors also wish to thank the project initiative ‘Novel genetic and genomic tools for polyploid crops’ (project number BO-26.03-009-004) for the international promotion and support of polyploid research, from which the present study benefitted a great deal. Conflict of Interest: none declared. References Abecasis G.R. et al. ( 2002 ) Merlin rapid analysis of dense genetic maps using sparse gene flow trees . Nat. Genet ., 30 , 97 . Google Scholar Crossref Search ADS PubMed Aguiar D. , Istrail S. ( 2013 ) Haplotype assembly in polyploid genomes and identical by descent shared tracts . Bioinformatics , 29 , i352 – i360 . Google Scholar Crossref Search ADS PubMed Bansal V. , Bafna V. ( 2008 ) HapCUT: an efficient and accurate algorithm for the haplotype assembly problem . Bioinformatics , 24 , i153 – i159 . Google Scholar Crossref Search ADS PubMed Berger E. et al. ( 2014 ) HapTree: a novel Bayesian framework for single individual polyplotyping using NGS data . PLoS Comput. Biol ., 10 , e1003502. Google Scholar Crossref Search ADS PubMed Bourke P.M. et al. ( 2015 ) The double-reduction landscape in tetraploid potato as revealed by a high-density linkage map . Genetics , 201 , 853 – 863 . Google Scholar Crossref Search ADS PubMed Browning S.R. , Browning B.L. ( 2007 ) Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering . Am. J. Hum. Genet ., 81 , 1084 – 1097 . Google Scholar Crossref Search ADS PubMed Das S. , Vikalo H. ( 2015 ) SDhaP: haplotype assembly for diploids and polyploids via semi-definite programming . BMC Genomics , 16 , 260. Google Scholar Crossref Search ADS PubMed Delaneau O. et al. ( 2012 ) A linear complexity phasing method for thousands of genomes . Nat. Methods , 9 , 179. Google Scholar Crossref Search ADS D’Hont A. et al. ( 2012 ) The banana (Musa acuminata) genome and the evolution of monocotyledonous plants . Nature , 488 , 213 – 217 . Google Scholar Crossref Search ADS PubMed Doležel J. et al. ( 2014 ) Advances in plant chromosome genomics . Biotechnol. Adv ., 32 , 122 – 136 . Google Scholar Crossref Search ADS PubMed Droc G. et al. ( 2013 ) The banana genome hub. Database, 2013, bat035. Felcher K.J. et al. ( 2012 ) Integration of two diploid potato linkage maps with the potato genome sequence . PLoS One , 7 , e36347. Google Scholar Crossref Search ADS PubMed Fortescue J. , Turner D. ( 2004 ) Pollen fertility in Musa: viability in cultivars grown in Southern Australia . Crop Pasture Sci ., 55 , 1085 – 1091 . Google Scholar Crossref Search ADS Garg S. et al. ( 2016 ) Read-based phasing of related individuals . Bioinformatics , 32 , i234 – i242 . Google Scholar Crossref Search ADS PubMed Garrison E. , Marth G. ( 2012 ) Haplotype-based variant detection from short-read sequencing . arXiv Preprint arXiv , 1207 , 3907 . Geraci F. ( 2010 ) A comparison of several algorithms for the single individual SNP haplotyping reconstruction problem . Bioinformatics , 26 , 2217 – 2225 . Google Scholar Crossref Search ADS PubMed Golicz A.A. et al. ( 2016 ) The pangenome of an agronomically important crop plant Brassica oleracea . Nat. Commun ., 7 , 13390 . Google Scholar Crossref Search ADS PubMed Li H. ( 2013 ) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv: 1303.3997. Loh P.-R. et al. ( 2016 ) Reference-based phasing using the Haplotype Reference Consortium panel . Nat. Genet ., 48 , 1443. Google Scholar Crossref Search ADS PubMed Martin G. et al. ( 2016 ) Improvement of the banana ‘Musa acuminata’ reference sequence using NGS data and semi-automated bioinformatics methods . BMC Genomics , 17 , 243 . Google Scholar Crossref Search ADS PubMed Michalatos-Beloin S. et al. ( 1996 ) Molecular haplotyping of genetic markers 10 kb apart by allele-specific long-range PCR . Nucleic Acids Res ., 24 , 4841 – 4843 . Google Scholar Crossref Search ADS PubMed Motazedi E. et al. ( 2017 ) Exploiting next-generation sequencing to solve the haplotyping puzzle in polyploids: a simulation study. Brief. Bioinform., 19 , 387 – 403 . Pillay M. et al. ( 2012 ) Genetics, Genomics, and Breeding of Bananas . Science Publishers , USA. Potato Genome Sequencing Consortium et al. ( 2011 ) Genome sequence and analysis of the tuber crop potato . Nature , 475 , 189 – 195 . Crossref Search ADS PubMed Rafalski J.A. ( 2002 ) Novel genetic mapping tools in plants: sNPs and LD-based approaches . Plant Sci ., 162 , 329 – 333 . Google Scholar Crossref Search ADS Simko I. et al. ( 2004 ) Mapping genes for resistance to Verticillium albo-atrum in tetraploid and diploid potato populations using haplotype association tests and genetic linkage analysis . Mol. Genet. Genomics , 271 , 522 – 531 . Google Scholar Crossref Search ADS PubMed Sims D. et al. ( 2014 ) Sequencing depth and coverage: key considerations in genomic analyses . Nat. Rev. Genet ., 15 , 121 – 132 . Google Scholar Crossref Search ADS PubMed Tewhey R. et al. ( 2011 ) The importance of phase information for human genomics . Nat. Rev. Genet ., 12 , 215 – 223 . Google Scholar Crossref Search ADS PubMed Triplett J.K. et al. ( 2012 ) Five nuclear loci resolve the polyploid history of switchgrass (panicum virgatum l.) and relatives . PLoS One , 7 , e38702. Google Scholar Crossref Search ADS PubMed Uitdewilligen J.G. et al. ( 2013 ) A next-generation sequencing method for genotyping-by-sequencing of highly heterozygous autotetraploid potato . PLoS One , 8 , e62355. Google Scholar Crossref Search ADS PubMed Zheng C. et al. ( 2016 ) Probabilistic multilocus haplotype reconstruction in outcrossing tetraploids . Genetics , 203 , 119 – 131 . Google Scholar Crossref Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Journal

BioinformaticsOxford University Press

Published: Nov 15, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off