SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors

Rodrigo Goya; Mark G.F. Sun; Ryan D. Morin; Gillian Leung; Gavin Ha; Kimberley C. Wiegand; Janine Senz; Anamaria Crisan; Marco A. Marra; Martin Hirst; David Huntsman; Kevin P. Murphy; Sam Aparicio; Sohrab P. Shah

doi:10.1093/bioinformatics/btq040

SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors

Goya, Rodrigo; Sun, Mark G.F.; Morin, Ryan D.; Leung, Gillian; Ha, Gavin; Wiegand, Kimberley C.; Senz, Janine; Crisan, Anamaria; Marra, Marco A.; Hirst, Martin; Huntsman, David; Murphy, Kevin P.; Aparicio, Sam; Shah, Sohrab P. 2010-02-03 00:00:00 Vol. 26 no. 6 2010, pages 730–736 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btq040 Sequence analysis Advance Access publication February 3, 2010 SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors 1,2 1 2 1 1 Rodrigo Goya , Mark G.F. Sun , Ryan D. Morin , Gillian Leung , Gavin Ha , 3,4 3,4 1 2 Kimberley C. Wiegand , Janine Senz , Anamaria Crisan , Marco A. Marra , 2 3,4 5 1 Martin Hirst , David Huntsman , Kevin P. Murphy , Sam Aparicio and 1,3,4,∗ Sohrab P. Shah Department of Molecular Oncology Breast Cancer Research Program, British Columbia Cancer Research Centre, 2 3 Genome Sciences Centre, British Columbia Cancer Agency, Centre for Translational and Applied Genomics of 4 5 British Columbia Cancer Agency, Provincial Health Services Authority Laboratories and Department of Computer Science, University of British Columbia, Vancouver, BC, Canada Associate Editor: Alex Bateman ABSTRACT 1 INTRODUCTION Motivation: Next-generation sequencing (NGS) has enabled whole 1.1 Single nucleotide variants in cancer genome and transcriptome single nucleotide variant (SNV) discovery Cancer is a disease of genetic alterations. In particular, single in cancer. NGS produces millions of short sequence reads that, once nucleotide variants (SNVs) present as either germline or somatic aligned to a reference genome sequence, can be interpreted for the point mutations are essential drivers of tumorigenesis and cellular presence of SNVs. Although tools exist for SNV discovery from NGS proliferation in many human cancer types. The discovery of data, none are speciﬁcally suited to work with data from tumors, germline mutations established important gene functions in cancer; where altered ploidy and tumor cellularity impact the statistical however, the contribution of single germline alleles to the population expectations of SNV discovery. burden of cancer is relatively low. In contrast, determination of Results: We developed three implementations of a probabilistic tumorigenic mechanisms has focused on somatic mutations. The Binomial mixture model, called SNVMix, designed to infer SNVs somatic mutational landscape of cancer has to date largely been from NGS data from tumors to address this problem. The ﬁrst derived from small-scale or targeted approaches, leading to the models allelic counts as observations and infers SNVs and model discovery of genes affected by somatic mutations in many diverse parameters using an expectation maximization (EM) algorithm and cancer types. More comprehensive studies using Sanger-based is therefore capable of adjusting to deviation of allelic frequencies exon resequencing suggest that the mutational landscape will be inherent in genomically unstable tumor genomes. The second models characterized by relative handfuls of frequently mutated genes and nucleotide and mapping qualities of the reads by probabilistically a long tail of infrequent somatic mutations in many genes (Jones weighting the contribution of a read/nucleotide to the inference of et al., 2008; Stratton et al., 2009). a SNV based on the conﬁdence we have in the base call and the Considering this, unbiased sequencing surveys of tumor read alignment. The third combines ﬁltering out low-quality data in transcriptomes or genomes are expected to reveal mutations in these addition to probabilistic weighting of the qualities. We quantitatively commonly affected cancer genes as well as many novel mutations evaluated these approaches on 16 ovarian cancer RNASeq datasets in genes with no previous implication in cancer. Next-generation with matched genotyping arrays and a human breast cancer genome sequencing (NGS) technology (Shendure and Ji, 2008) has now sequenced to >40× (haploid) coverage with ground truth data and emerged as a practical, high-throughput and low-cost sequencing show systematically that the SNVMix models outperform competing method enabling the full and rapid interrogation of the genomes approaches. and transcriptomes of individual tumors for mutations. As such, Availability: Software and data are available at NGS has presented an unprecedented opportunity for SNV discovery http://compbio.bccrc.ca in cancer. Recent studies involving deeply sequencing the tumor Contact: [email protected] genomes from acute myeloid leukemia patients (Ley et al., 2008; Supplemantary information: Supplementary data are available at Mardis et al., 2009) and a lobular breast cancer patient (Shah et al., Bioinformatics online. 2009b) have revealed numerous novel somatic mutations in genes Received on November 6, 2009; revised on January 14, 2010; that had not been previously reported to harbor abnormalities. In accepted on January 26, 2010 addition, sequencing the transcriptomes of ovarian cancers with RNA-seq (Marioni et al., 2008; Morin et al., 2008; Mortazavi et al., 2008) led to the discovery of a deﬁning mutation in the FOXL2 gene (previously not implicated in cancer) in granulosa cell tumors of the To whom correspondence should be addressed. ovary (Shah et al., 2009a). © The Author(s) 2010. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. [12:51 3/3/2010 Bioinformatics-btq040.tex] Page: 730 730–736 Predicting SNVs from next-generation sequencing bb ab AC aattcaggaccca----------------------------- Position i aattcaggacccacacga------------------------ kk aattcaggacccacacgacgggaagacaa------------- -attcaggacaaacacgaagggaagacaagttcatgtacttt ----caggacccacacgacgggtagacaagttcatgtacttt a N k i --------acccacacgacgggtagacaagttcatgtacttt --------acccacacgacgggtagacaagttcatgtacttt Genotype k ----------------gacgggaagacaagttcatgtacttt SNVMix1 model ---------------------------------atgtacttt aattcaggaccaacacgacgggaagacaagttcatgtacttt Position i kk i i µ a z j j Genotype k j {1,...,N } Read SNVMix2 model Fig. 1. (A) Schematic diagram of input data to SNVMix1. We show how allelic counts (bottom) are derived from aligned reads (top). The reference sequence is shown indicated in blue. The arrows indicate positions representing SNVs. The non-reference bases are shown in red. (B) Input data for SNVMix2 that consists of the mapping and base qualities. The darker the background for a read represents a higher quality alignment. The brighter colored nucleotides represent higher quality base calls. Therefore, high contrast nucleotides are more trustworthy than lower contrast nucleotides. (C) SNVMix1 shown as a probabilistic graphical model. Circles represent random variables, and rounded squares represent ﬁxed constants. Shaded notes indicate observed data [the allelic counts and the read depth from (A)]. Unshaded nodes indicate quantities that are inferred during EM. G ∈{aa,ab,bb} represents the genotype, N ∈{0,1,...,} is the i i number of reads and a ∈{0,1,...,N } is the number of reference reads. π is the prior over genotypes and µ is the genotype-speciﬁc Binomial parameter i i k for genotype k.(D) SNVMix2 shown as a probabilistic graphical model. In comparison to SNVMix1, a is unobserved and we expand the input to consider i i i read-speciﬁc information indexed by j where z =1 indicates that read j is correctly aligned, q is the base quality and r is the mapping quality. j j j While these early studies have emerged as proof of principle (Li and Durbin, 2009), SOAP (Li,R. et al., 2008) and that novel SNVs can indeed be discovered using NGS, the study Mosaik (http://bioinformatics.bc.edu/marthlab/Mosaik). We begin of computational methods for their discovery in cancer is under- the discussion by describing two ways of preprocessing aligned data represented in the bioinformatics literature. The analysis of SNVs for input to SNV detection algorithms. The ﬁrst method is shown in from cancer data, where altered ploidy and tumor cellularity impact Figure 1A, where we show an example of aligned data where two the statistical expectations of SNV discovery; and transcriptome SNVs are identiﬁed. The reads are positioned according to their data, where the dynamic range of depth of sequencing is dependent alignment in the genome and the reference genome sequence is on highly variable transcript expression present unique challenges. shown in blue. The ﬁrst step involves transforming the aligned reads In this contribution, we describe a new statistical model for into allelic counts. This method assumes that the reads are correctly identifying SNVs in NGS data generated from cancer genomes and aligned and the nucleotide base calls are correct. Nucleotides that transcriptomes. We demonstrate how its novel features outperform match the reference are shown in black, whereas nucleotides that other available methods. Additionally, we provide a ground truth do not match the reference are shown bolded in red. The ﬁgure dataset (with Sanger validated SNVs) and robust accuracy metrics illustrates how aligned data can be ‘collapsed’ into allelic counts. that will permit future study of computational methods for SNV At each position i in the data, we can count the number of reads a detection in cancer genomes. that match the reference genome and the number of reads b that do not match the reference genome. In the case of rare third alleles, these reads are assumed to be errors. The total number of reads 1.2 NGS data preprocessing for SNV detection overlapping each position (called the depth) is given by N = a +b . i i i In this context, given {a ,b ,N } for all i ∈{1,2,...,T } where T is the The data produced by NGS consists of millions of short reads i i i total number of positions in the genome, the task is to infer which ranging in length from approximately 30–400 nt (although this positions exhibit an SNV. is steadily increasing with ongoing technology development). The second method (Fig. 1B) relaxes the assumption that the Here, we focus explicitly on the problem of inferring SNVs base calls and the alignments are correct and instead considers two once these reads have been aligned to the genome. Numerous types of uncertainty related to determining a and N , namely the methods have been developed for short read alignment including i i uncertainty encoded in the base call q ∈[0,1] which represents the Maq (Li,H. et al., 2008), BowTie (Langmead et al., 2009), ELAND (Illumina), SHRiMP (Rumble et al., 2009), BWA probability that the stated base is correct for read j ∈(1, ...,N ) [12:51 3/3/2010 Bioinformatics-btq040.tex] Page: 731 730–736 Aligned reads Reference seq Allelic counts b R.Goya et al. i A B at position i; and r ∈[0,1] representing the probability that read j aligns to its stated position in the genome. Note that although mapping quality is derived in part from base qualities, considering these quantities as independent allows us to encode the fact that base qualities are position speciﬁc, while mapping qualities are constant for all bases in the read. The input data for this method can be visualized as shown in Figure 1B: high mapping quality is shown as dark background and high base quality as bright foreground, high contrast positions indicate positions where the data are more trustworthy. We show in Section 2.4 how to explicitly model these uncertainties to perform soft probabilistic Fig. 2. (A) Theoretical behavior of SNVmix at depths of 2, 3, 5, 10, 15, 20, weighting of the data rather than thresholding the uncertainties to 35, 50 and 100. The plots show how the distribution of marginal probabilities deterministically calculate the allelic counts. We will now describe changes with the number of reference alleles given the model parameters ﬁt to how various authors have approached this problem given {a ,b ,N } i i i a10× breast cancer genome dataset. (B) ROC curves from ﬁtting SNVMix2 i i and optionally, {q ,r }. to synthetic data with increasing levels of certainty in the base call. 1:N 1:N i i changes and other factors. We use the expectation maximization 1.3 Related work (EM) algorithm to ﬁnd a maximum a posteriori (MAP) estimate of A simple way to detect SNV locations would be to compute the the parameters given some training data, allowing the model to adapt fraction f = , and then to call as SNVs those locations where f i i to genomes and transcriptomes that may deviate from the assumed is below some threshold. In the example in Figure 1A, applying distributions for normal genomes and thus model the data more threshold of would successfully discard all columns (including accurately. the two columns which have singleton non-reference reads, which Previous studies have employed stringent thresholding for may be due to base-calling errors), except the two containing the removing poor quality bases and/or reads (Ley et al., 2008; Morin SNVs. A critical ﬂaw with this approach is that it ignores the et al., 2008). We propose that this may throw out informative data, conﬁdence we have in our estimate of f . Intuitively, we can trust and we extend SNVMix1 to explicitly encode base and mapping our estimate more at locations with greater depth (larger N ). This qualities by using them to probabilistically weight the contribution idea has been applied by Morin et al. (2008), wherein read depth of each nucleotide to the posterior probability of a SNV call. In thresholds of N ≥6 and b ≥2 reads supporting the variant allele i i addition, we explore how to optimally combine thresholding and were applied, with an additional requirement that the non-reference probabilistic weighting in order to obtain more accurate results. We allele must be represented by at least 33% of all reads at that show (Section 3) how this extended model, which we call SNVMix2, site. This should eliminate SNVs with weak supporting evidence, confers an increased speciﬁcity in our predictions. but it categorizes the data into two discrete classes—SNV or not, The statistical models we propose in this contribution provide without explicitly providing conﬁdence estimates on the prediction. posterior probabilities on SNV predictions, removing the need for Moreover, in transcriptome data, the number of reads representing a depth thresholds and use an EM learning algorithm to ﬁt the model given transcript expected to be highly variable across all genes and to data removing the need to set model parameters by hand. We thus determining a minimum depth can be difﬁcult. We demonstrate also show how to explicitly model base and mapping qualities, and (Section 3) that applying depth-based thresholds reduces sensitivity explore how quality thresholds can be used in combination with to ﬁnding real SNVs. probabilistic weighting. We show that these attributes of the model To overcome these limitations, we propose a probabilistic result in increased accuracy compared with Maq’s SNV caller and approach based on a Binomial mixture model, called SNVMix1, depth threshold-based methods. We evaluate the model based on which computes posterior probabilities, providing a measure of real data derived from 16 ovarian cancer transcriptomes sequenced conﬁdence on the SNV predictions. The model infers the underlying using NGS, and a lobular breast cancer genome sequenced to >40x genotype at each location. We assume the genotype to be in coverage (Shah et al., 2009b). For all cases, we obtained high- one of three states: aa = homozygous for the reference allele, density genotyping array data for orthogonal comparisons. Finally, ab = heterozygous and bb = homozygous for the non-reference we demonstrate results on 497 positions from the breast cancer allele; the latter two genotypes constituting an SNV. In Figure 2, genome that were subjected to Sanger sequencing and thus constitute we show how the posterior probability of each of these three a ‘ground truth’ dataset for benchmarking. states increases with more depth, which demonstrates the theoretical qualities of our approach. Two other approaches: Maq (Li,H. et al., 2 METHODS 2008) and SOAPSNP (Li,R. et al., 2008) have proposed using Binomial distributions to model genotypes; however, these were 2.1 SNVMix model speciﬁcation developed in the context of sequencing normal genomes, not cancer SNVMix1 is shown as a probabilistic graphical model in Figure 1C. The genomes. Both set parameters for the model assuming expected conditional probability distributions for the model are given in Figure 3 distributions for normal allelic ratios, and apply post-processing and the description of all random variables is listed in Table 1. The input is heuristics to reduce false positives. In our application, we are composed of allelic counts from aligned data and the output of inference is the interested in cancer genomes and transcriptomes, both of which may predicted genotypes. Consider G =k, k ∈{aa,ab,bb}, to be a Multinomial not follow expected distributions due to tumor-normal admixtures random variable representing the genotype at nucleotide position i, where in the sample, within sample tumor heterogeneity, copy number aa is homozygous for the reference allele, ab is heterozygous and bb is [12:51 3/3/2010 Bioinformatics-btq040.tex] Page: 732 730–736 Predicting SNVs from next-generation sequencing homozygous for the non-reference allele. At each position, we have an 2.2 Prior distributions observed number of aligned reads N . We let a ∈{0,1} represent whether We assume that π is distributed according to a Dirichlet distribution or not read j ∈{1,...,N } matches the reference at position i. We let a (no j i i parameterized by δ, the so-called pseudocounts. We set δ to be skewed toward index) be the total number of reads that match the reference at i. We assume π assuming that most positions will be homozygous for the reference aa the following likelihood model for the data: allele. µ is conjugately distributed according to a Beta distribution: µ ∼ k k Beta(µ |α ,β ). We set α =1000,β =1; α =500,β =500 and α = k k k aa aa ab ab bb p(a |G =k,N ,µ ) ∼Binom(a |µ ,N ) (1) i i i 1:3 i k i 1,β =1000 assuming that µ should be skewed towards 1, µ should be bb aa ab close to 0.5 and µ should be close to 0. bb where µ is the parameter of a Binomial distribution for genotype k. µ k k models the expectation that for a given genotype k, a randomly sampled allele will be the reference allele. Intuitively, we should expect µ to be aa close to 1, µ to be close to 0.5 and µ to be close to 0. Thus, the key ab bb 2.3 Model ﬁtting and parameter estimation intuition is that for genotype k = aa, the Binomial distribution deﬁned by We ﬁt the model using the EM algorithm. We initialize µ and π(k)to µ should be highly skewed toward the reference allele. Similarly, µ aa bb their prior means. The EM algorithm iterates between the E-step where we would be skewed toward the non-reference allele. For µ , the distribution ab assign the genotypes using Equation 2 and the M-step where we re-estimate would be much more uniform. We impose a prior on the genotypes, G |π ∼ the model parameters. At each iteration, we evaluate the complete data log- Multinomial(G |π,1) where π(k) is the prior probability of genotype k. Given posterior and the algorithm terminates when this quantity no longer increases. knowledge of all the parameters, θ =(µ ,π), we can use Bayes’ rule to infer 1:3 The M-step equations are standard conjugate updating equations: the posterior over genotypes, γ (k) = p(G =k|a ,N ,θ), where: i i i π Binom(a |µ ,N ) k i k i T I(G =k) + δ(k) γ (k) = (2) i i new i=1 π (k) = (3) π Binom(a |µ ,N ) j i j i j=1 I(G = j) + δ(j) j i=1 Our approach to inference involves learning the parameters θ by ﬁtting the model to training data using MAP EM (see below). We demonstrate that this where I(G =k) is an indicator function to signal that G is assigned to state i i produces better results than Maq, which uses ﬁxed parameters (Section 3). k at position i, and: I(G =k) a + α −1 p(π|δ) = Dir(π|δ) new i=1 i µ = (4) T I(G =k) N + α + β −2 k k i=1 i p(G |π) = Multinomial(G |π,1) i i i i p(a |G =k,µ ) = Bern(a |µ ) i k k j j 2.4 Modeling base and mapping qualities i i p(a |G =k,µ ,N ) = Binom(a |µ ,N ) i k i k i i The model shown in Figure 1C assumes that a is observed (it is a shaded node in the graph), and thus assumed correct. However, each nucleotide’s p(µ |α ,β ) = Gam(µ |α ,β ) k k k k k k contribution to the allelic counts has uncertainty associated with it in the i i p(z ) = Bern(z |0.5) form of base and mapping qualities. We propose a soft (or probabilistic) j j weighting scheme, which will down-weight the inﬂuence of low-quality base i i i q if a =1,z =1 ⎨ j j j and mapping calls, but not discard them altogether. To model this, we change i i i i i i i 1 −q if a =0,z =1 a to be an unobserved quantity as shown in Figure 1D, and instead observe p(q |a ,z ) = j j j j j j j i the soft evidence on them in the form of probabilities, which we represent 0.5if z =0 by the observed base qualities q ∈[0,1]. Similarly, we introduce unobserved i i i binary random variables z ∈{0,1} representing whether read j is correctly r if z =1 i i j j p(r |z ) = j j i i aligned, and soft evidence in the form of probabilities which we represent 1 −r if z =0 j j i by the observed mapping qualities r ∈[0,1]. The conditional probability i i i i i distributions for p(q |a ,z ) and p(r |z ) are given in Figure 3. Thus, the input j j j j j 1:T 1:T Fig. 3. Conditional probability distributions of SNVMix model. data is now q ,r and the corresponding likelihood for each location i Table 1. Description of random variables in SNVMix1 and SNVMix2 Parameter Description Value δ Dirichlet prior on π (1000,100,100) π Multinomial distribution over genotypes Estimated by EM (M-step) G Genotype at position i Estimated by EM (E-step) a Indicates whether read j at position i matches the reference Observed in SNVMix1, latent in SNVMix2 z Indicates whether read j aligns to its stated position Latent q Probability that base call is correct Observed (SNVMix2 only) r Probability that alignment is correct Observed (SNVMix2 only) µ Parameter of the Binomial for genotype k Estimated by EM (M-step) α Shape parameter of Beta prior on µ (1000,500,1) β Scale parameter of Beta prior on µ (1,500,1000) [12:51 3/3/2010 Bioinformatics-btq040.tex] Page: 733 730–736 R.Goya et al. can be obtained by marginalizing out a,z as follows: 2.7 Accuracy metrics i i p(q ,r |G =k,µ ) (5) i k While comparing with the SNP array data, we deﬁned a true positive (TP) 1:N 1:N i i SNV as an ab or bb CRLMM genotype. A true negative (TN) SNV was i i i i i i i deﬁned as an aa genotype from the SNP array. For the Sanger validated = p(a |G ,µ)p(q |a ,z )p(z |r )p(z ) (6) j j j j j j j positions, a TP was an SNV that was conﬁrmed by Sanger sequencing, a z j=1 whereas a TN was a position that was not conﬁrmed. To evaluate our models against these data, we computed p(SNV ) = γ (ab) + γ (bb) and standard i i i i i i i ∝ 0.5(1 −r ) +r [(1 −q )(1 − µ ) +q µ ] (7) j j j k j k receiver operator characteristic (ROC) curves. The area under the ROC curve j=1 (AUC) was computed as a single numeric metric of accuracy that effectively As before, given the model parameters µ,π, we can infer the genotype at measures the trade-off between sensitivity and speciﬁcity. As an additional 2(precision×recall) each position by modifying Equation (2) as follows: measure, we computed the F-statistic: f = , where precision precision+recall i i π p(q ,r |G =k,µ ) is measured as the proportion of predictions that were true and recall is the k i k 1:N 1:N i i γ (k) = (8) K i i proportion of true SNVs that were predicted. π p(q ,r |G = h,µ ) h i h h=1 1:N 1:N i i The updating equations are unchanged in the M-step of EM. The model- 2.8 Benchmarking experiments ﬁtting algorithm changes only in the E-step by using Equation (8) instead of Equation (2). We have speciﬁed how to encode base and mapping uncertainty To evaluate the effect of estimating parameters, we designed a 4-fold into the model, obviating the need for thresholding these quantities. We call cross-validation study. We permuted the 144 271 positions with matched this version of the model SNVMix2. array-based genotype data from the ovarian cancer data, and divided the In Figure 2, we show the theoretical behavior of this model using simulated positions into four equal parts. We ﬁt the model to three parts (training data with varying base qualities. The model performs equally well for data) using EM and used the converged parameters to calculate p(SNV ) for datasets where the mean certainty of the base calls is ∼80% and higher. This each of the remaining positions (test data). We repeated this 10 times and suggests that thresholding base calls at Phred Q20 [99% certainty (Morin computed the AUC for each of the 16 cases. We also computed AUC from et al., 2008)] or Q30 [99.9% certainty (Ley et al., 2008)] may be overly the results predicted by Maq v0.6.8 and compared the AUC distributions stringent. across the 16 cases to SNVMix1 and SNVMix2. These data also allowed us to determine the range of converged parameter estimates across the folds and 10 replicates. We also tested the effect of depth-based thresholding by 2.5 Implementation and running time running SNVMix1 on the 14 649 positions from the breast cancer genome. The model and inference algorithm is implemented in C and supports To simulate the thresholding, we set p(SNV ) =0 at locations where N was i i both SAMtools (Li et al., 2009) and Maq pileup format. Running EM below some threshold, chosen from the set {0,1,...,7,10}. We compared (SNVMix2) on 14 649 positions for the 40× breast cancer genome took SNVMix1, SNVMix2 and Maq on this data as well. Finally, we evaluated 36 s. Predicting genotypes for the whole 40× genome took 11 min and 38 s. the true positive rate (TPR) and false positive rate (FPR) on the 497 ground (The Maq step cns2snp took 19 min and 9 s.) A script to choose optimal base truth positions from this case for SNVMix1, SNVMix2 and Maq. and mapping quality thresholds, given ground truth [or orthogonal single nucleotide polymorphism (SNP) array] data, is provided in the software package. 3 RESULTS 2.6 Datasets 3.1 Depth heuristics reduce sensitivity We used three datasets to evaluate our models. The ﬁrst (Supplementary We ﬁrst determined the effect that depth thresholding had on 14 649 Dataset 1A and B) consists of 16 ovarian cancer transcriptomes sequenced positions probed using an Affymetrix SNP 6.0 array from genome using the Illumina GA II platform, RNA-Seq paired end protocol. (Note data from the lobular breast cancer by calculating ROC curves and that this data has been generated as part of an ongoing study to proﬁle corresponding AUC values (Section 2) from the output of SNVMix1 ovarian carcinoma subtypes, and the full datasets will be available as part at different cutoff values. The most accurate results were obtained of forthcoming manuscripts. However, all SNV data referenced in this when no depth thresholding was applied. At a threshold of 0, the manuscript is available as Supplementary Materials). For each of these cases, AUC was 0.988 (the highest) and at a threshold of 10 reads, the we obtained Affymetrix SNP 6.0 high-density genotyping arrays from the AUC was 0.614 (the lowest). At a FPR of 0.01, the TPR decreased corresponding DNA. We examined coding positions in the transcriptome data for which there was a corresponding high-conﬁdence (>0.99) genotyping with increasing number of reads required for the threshold without call from the array predicted using the CRLMM algorithm (Lin et al., 2008). exception, suggesting that depth-thresholding under the SNVMix1 This resulted in an average of approximately 9000 positions from each model reduces overall sensitivity without increasing speciﬁcity, and case and a total of 144 271 positions. These data were used in the cross- should therefore be avoided. AUC for thresholds of 1, 3, 5 and 7 validation experiment, described below. The second dataset (Supplementary reads were 0.971, 0.893, 0.782 and 0.707, respectively. Dataset 2A–D) consisted of 497 positions from a lobular breast tumor genome predicted as SNVs using SNVMix1 model from data generated 3.2 Estimating parameters in transcriptome data by using the Illumina GA II platform. These positions were predicted to be model ﬁtting confers better accuracy non-synonymous protein-coding changes and were subsequently sequenced using Sanger capillary-based technology. Of these, 305 were conﬁrmed as Figure 4 shows the AUC distribution over 16 ovarian cancer SNVs and 192 were not conﬁrmed. These 497 positions were considered as transcriptomes (Section 2) for the best and worst cross- the ground truth dataset used for sensitivity and speciﬁcity calculations. In validation runs of SNVMix2 and SNVMix1 as well as the addition, we also generated Affymetrix SNP 6.0 array data for this case and results from the Maq SNV caller with the two recommended considered 14 649 positions (Supplementary Dataset 3A–D) that matched settings of the r parameter (0.001 and 0.02). Both runs of the coding positions and CRLMM prediction criteria outlined above. All SNVMix1 were statistically signiﬁcantly better than the Maq NGS data were aligned to the human genome reference (NCBI build 36.1) runs [analysis of variance (ANOVA) test, P <0.0001], with using Maq’s map tool (v0.6.8). Thus, for all comparisons between Maq and mean AUC of 0.9557 ± 0.0100 and 0.9552 ± 0.0100, compared SNVMix, we used the same baseline set of aligned data. [12:51 3/3/2010 Bioinformatics-btq040.tex] Page: 734 730–736 Predicting SNVs from next-generation sequencing with 0.9290 ± 0.0120 and 0.9032 ± 0.0119 for the Maq runs. of 0.9880) and Maq (0.9824 for 40× and 0.9115 for 10×—both Furthermore, SNVMix2 without quality thresholds offers a for the r = 0.001 parameter setting). After ﬁtting the model to the slight performance improvement over SNVMix1. Although the 14 649 positions, we evaluated the performance using 497 candidate improvement of SNVMix2 over SNVMix1 is not statistically mutations originally detected using SNVMix1 at 10×, which were signiﬁcant, it is noteworthy that no thresholds of any kind were validated using Sanger amplicon sequencing (Section 2). These applied to the data and thus probabilistic weighting can eliminate consisted of 305 true SNVs (variants seen in the Sanger traces) the need for arbitrarily thresholding the data (see below). and 192 that could not be conﬁrmed in the Sanger traces. Table 2 shows the sensitivity, precision and F-measure results of SNVMix2, SNVMix1 and SNVMix2 combined with base and mapping quality 3.3 Evaluation of models on a deeply sequenced breast thresholding at both 10× and 40× coverage at a p(SNV) (Section 2) cancer genome with ground truth SNVs threshold determined using a FPR ≤ 0.01. We did not include a We evaluated performance of the models on a lobular breast cancer comparison to Maq at these 497 positions since the results would be sample sequenced to >40× haploid coverage (Shah et al., 2009b). biased toward the SNVMix1 model that led us to identify them in In addition, we compared results obtained from the same genome the ﬁrst place. SNVMix2 and SNVMix1 showed similar F-measure at 10× coverage. We ﬁrst trained the model using 14 649 protein at both 10× and 40× reinforcing that the probabilistic weighting coding positions for which we generated matching Affymetrix confers equal accuracy without having to select arbitrary quality SNP6.0 calls. We computed the AUC for SNVMix1, SNVMix2 thresholds. In addition, both SNVMix2 and SNVMix1 had higher and Maq. Table 2 shows that the highest AUCs were obtained accuracy at 40× and 10× (Table 2). Interestingly, all the models had with SNVMix2 on the 40× genome, followed by SNVMix2 on increased false negative rates in the 40× genome in comparison with 10× genome (AUCs of 0.9929 and 0.9905, respectively). Both the 10× genome. Upon further review of the SNVMix2 positions of these were higher than results achieved for SNVMix1 (AUC predicted at 10×, but not at 40×, we examined that the majority (9 out of 13) were marginally below threshold and signiﬁcant probability mass was indeed on the P(ab) state (>0.99) and would have been predicted with even a slightly less stringent threshold. Three out of the remaining four appear to be the result of DNA copy number ampliﬁcations that are skewing the allelic ratios involved. We elaborate on this point in Section 4. While the SNVMix2 model eliminates the need for thresholding through probabilistic weighting, we explored the effect of applying thresholds to the SNVMix2 input in order to identify a practical balance between pure weighting and thresholding. We compared the results of thresholding mapping qualities at (Q0, Q5, Q10, Q20, Q30, Q40 and Q50) and concomitantly thresholding base qualities at (Q0, Q5, Q10, Q15, Q20 and Q25) and running SNVMix2 on the resulting data from both 10× and 40× genomes. As shown in Table 2, thresholding the mapping qualities at Q50 and base qualities at Q20 (mQ50_bQ20) at 10× performed better than all other 10× runs (F-measure 0.8441). For the 40× data, thresholding the mapping qualities at Q50 and base qualities at Q15 (mQ50_bQ15) performed Maq SNVMix2 SNVMix1 best over all runs (F-measure 0.8658). (See Supplementary Table S1 for all results from runs in increments of Q1 base quality thresholds.) Fig. 4. Distribution of AUC over 16 ovarian cancer transcriptomes This suggests that previously reported base quality thresholds may comparing accuracy of SNV detection for two Maq runs, the best and worst be too stringent. Furthermore, when used with stringent mapping SNVMix1 runs in the cross-validation experiment (middle) and best and quality thresholds, the SNVMix2 model can effectively use the base worst runs for SNVMix2 (mbQ0 = no quality thresholding, MbQ30 = keeping qualities by probabilistic weighting to confer higher accuracy. These only reads with mapping qualities > Q30). SNVMix1 and SNVMix2 runs were statistically more accurate than both Maq runs (ANOVA, P < 0.0001). results indicate that treating mapping and base qualities separately SNVMix2 runs were better than SNVMix1, but not statistically signiﬁcantly. Table 2. Comparison of accuracy of SNVMix1, SNVMix2 and SNVMix combined with base and mapping quality thresholding Model Run Train AUC TP FP TN FN Sens Prec F-measure 10× 0.9880 305 192 0 0 1.0000 0.6137 0.7606 SNVMix1 40× 0.9924 293 107 85 12 0.9607 0.7325 0.8312 10× 0.9905 299 162 30 6 0.9803 0.6486 0.7807 SNVMix2 40× 0.9929 290 107 85 15 0.9508 0.7305 0.8262 mQ50_bQ20 (10×) 0.9882 287 88 104 18 0.9410 0.7653 0.8441 SNVMix2 + thresholding mQ50_bQ15 (40×) 0.9928 287 71 121 18 0.9410 0.8017 0.8658 [12:51 3/3/2010 Bioinformatics-btq040.tex] Page: 735 730–736 R.Goya et al. as opposed to taking a minimum over the two (as in Maq) has We noticed that some positions missed by SNVMix1 at 40× were advantages. in what we believe are allele-speciﬁc copy number ampliﬁcations. Future work will involve incorporating copy number data directly into the model to consider such situations where the resultant allelic 4 DISCUSSION bias is expected to mask variants present in the unampliﬁed allele. We have described two statistical models based on Binomial mixture These results will be presented in a forthcoming manuscript. In models to infer SNVs from aligned NGS data obtained from tumors. a similar vein, due to regulatory mutations or epigenetic changes, We demonstrated that a probabilistic approach to modeling allelic transcriptomes can show preferential expression of one allele and counts obviates the need for depth-based thresholding of the data, thus our model will be insensitive to instances of extensively skewed and how ﬁtting the model to real data to estimate parameters allelic expression. Further extensions of the model to consider these is superior to Maq, which uses ﬁxed parameter settings on the factors will be explored. assumption that the data come from a normal human genome. In Finally, we recently demonstrated that intra-tumor heterogeneity addition, we extended the basic Binomial mixture to model mapping can be seen using ultra-deep targeted sequencing (Shah et al., and base qualities by using a probabilistic weighting technique. 2009b). The allelic frequencies of SNVs in rare clones in the This eliminates the need to employ arbitrary thresholds on base and tumor population will likely result in false negative predictions at mapping qualities and instead lets the model determine the strength conventional sequencing depths (i.e. between 20× and 40×), and of contribution of each read to the inference of the genotype. Finally, confound the estimation of the false negative rates of prediction. we showed that even further gains in accuracy can be obtained by Future investigation of all of these problems will be necessary if the combining moderate thresholding and probabilistic weighting of the goal of sequencing studies is to characterize all mutations present base and mapping qualities. Importantly, gains in accuracy by the in the heterogeneous mixture of genomes that make up a tumor. SNVMix models were shown in both transcriptome and genome Funding: The project is supported by the Canadian Institutes for data. Health Research (CIHR). SPS is supported by the Michael Smith Foundation for Health Research (MSFHR) and the Canadian Breast 4.1 Dependence on alignments Cancer Foundation. RG is supported by MSFHR and CIHR. Our results will be highly dependent on the accuracy of alignments Conﬂict of Interest: none declared. as well as the consistency and accuracy of mapping qualities reported by the aligner. Results in Table 2 showed that the combined approach of stringent thresholding on mapping quality and modeling the REFERENCES uncertainty of the remaining reads gave the highest accuracy. Given Jones,S. et al. (2008) Core signaling pathways in human pancreatic cancers revealed that the most gain was obtained in precision, it suggests that false by global genomic analyses. Science, 321, 1801–1806. positive predictions may indeed be reduced with more accurate Langmead,B. et al. (2009) Ultrafast and memory-efﬁcient alignment of short DNA alignments. As read lengths increase with technology development sequences to the human genome. Genome Biol., 10, R25. Ley,T.J.et al. (2008) DNA sequencing of a cytogenetically normal acute myeloid and mapping algorithms improve, we expect that the input to leukaemia genome. Nature, 456, 66–72. SNVMix will be of higher quality, which should yield more accurate Li,H. et al. (2008) Mapping short DNA sequencing reads and calling variants using results. Moreover, alignment using Maq presents a drawback with mapping quality scores. Genome Res., 18, 1851–1858. regard to SNVMix2’s model. When a short read can be aligned Li,H. and Durbin,R. (2009) Fast and accurate short read alignment with Burrows- to more than one position in the genome with the same mapping Wheeler transform. Bioinformatics, 25, 1754–1760. Li,H. et al. (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics, quality, this read is dropped, being assigned a mapping quality of 25, 2078–2079. zero. SNVMix2’s design would be able to leverage the read’s quality Li,R. et al. (2008) SOAP: short oligonucleotide alignment program. Bioinformatics, 24, amongst the distinct coordinates and still use the information it 713–714. conveys to predict SNVs. The performance of this will be evaluated Lin,S. et al. (2008) Validation and extension of an empirical Bayes method for SNP calling on Affymetrix microarrays. Genome Biol., 9, R63. in future work. Mardis,E. R.et al. (2009) Recurring mutations found by sequencing an acute myeloid leukemia genome. N. Engl. J. Med., 361, 1058–1066. 4.2 Limitations, extensions and future work Marioni,J.C. et al. (2008) RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res., 18, 1509–1517. As stated earlier, a major objective in cancer genome sequencing Morin,R. et al. (2008) Proﬁling the HeLa S3 transcriptome using randomly is to discover somatic mutations. If sequence data from tumor and primed cDNA and massively parallel short-read sequencing. BioTechniques, 45, normal DNA from the same patient is available, candidate somatic 81–94. mutations can be identiﬁed as positions for which p(SNV) is high Mortazavi,A. et al. (2008) Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods, 5, 621–628. in the tumor and 1 −p(SNV) is high in the normal data. If only Rumble,S.M. et al. (2009) SHRiMP: accurate mapping of short color-space reads. PLoS tumour data is available, we recommend ﬁltering against dbSNP Comput. Biol., 5, e1000386. and performing targeted validation on the remaining positions in Shah,S.P. et al. (2009a) Mutation of FOXL2 in granulosa-cell tumors of the ovary. New both tumor and normal DNA as described previously (Shah et al., Engl J. Med., 360, 2719–2729. Shah,S.P. et al. (2009b) Mutational evolution in a lobular breast tumor proﬁled at single 2009b). Moreover, the models we have presented assume identically nucleotide resolution. Nature, 461, 809–813. and independently distributed genotypes. As such, the common prior Shendure,J. and Ji,H. (2008) Next-generation DNA sequencing. Nat. Biotechnol., 26, over genotypes π can be indexed by position (i.e. π ) and thus could 1135–1145. encode information about what variants are known for each position Stratton,M.R. et al. (2009) The cancer genome. Nature, 458, 719–724. i in the genome. [12:51 3/3/2010 Bioinformatics-btq040.tex] Page: 736 730–736 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Pubmed Central http://www.deepdyve.com/lp/pubmed-central/snvmix-predicting-single-nucleotide-variants-from-next-generation-FWpKM0s3sV

Loading next page...

References (20)

Sohrab Shah, Ryan Morin, J. Khattra, Leah Prentice, Trevor Pugh, A. Burleigh, Allen Delaney, K. Gelmon, Ryan Guliany, J. Senz, C. Steidl, R. Holt, Steven Jones, Mark Sun, G. Leung, Richard Moore, T. Severson, G. Taylor, A. Teschendorff, Kane Tse, G. Turashvili, R. Varhol, R. Warren, P. Watson, Yongjun Zhao, C. Caldas, D. Huntsman, M. Hirst, M. Marra, S. Aparicio (2009)
Mutational evolution in a lobular breast tumour profiled at single nucleotide resolution
Nature, 461
(2008)
Supplemental Material References Open Access
Stephen Rumble, P. Lacroute, Adrian Dalca, M. Fiume, A. Sidow, M. Brudno (2009)
SHRiMP: Accurate Mapping of Short Color-space Reads
PLoS Computational Biology, 5
Ryan Morin, M. Bainbridge, Anthony Fejes, M. Hirst, M. Krzywinski, Trevor Pugh, H. McDonald, R. Varhol, Steven Jones, M. Marra (2008)
Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing.
BioTechniques, 45 1
Ben Langmead, C. Trapnell, Mihai Pop, S. Salzberg (2009)
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome
Genome Biology, 10
John Hodgson (2004)
The “cancer”
Catheterization and Cardiovascular Interventions, 61
Li (2008)
Mapping short DNA sequencing reads and calling variants using mapping quality scores
Genome Res., 18
J. Shendure, Hanlee Ji (2008)
Next-generation DNA sequencing
Nature Biotechnology, 26
J. Marioni, C. Mason, S. Mane, M. Stephens, Y. Gilad (2008)
RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays.
Genome research, 18 9
E. Mardis, L. Ding, D. Dooling, D. Larson, M. McLellan, Ken Chen, D. Koboldt, R. Fulton, K. Delehaunty, S. McGrath, L. Fulton, Devin Locke, V. Magrini, R. Abbott, T. Vickery, Jerry Reed, J. Robinson, T. Wylie, Scott Smith, Lynn Carmichael, James Eldred, C. Harris, Jason Walker, Joshua Peck, Feiyu Du, Adam Dukes, Gabriel Sanderson, Anthony Brummett, Eric Clark, Joshua McMichael, R. Meyer, Jonathan Schindler, C. Pohl, J. Wallis, Xiaoqi Shi, Ling Lin, Heather Schmidt, Yuzhu Tang, C. Haipek, Madeline Wiechert, Jolynda Ivy, Joelle Kalicki, Glendoria Elliott, R. Ries, J. Payton, P. Westervelt, M. Tomasson, M. Watson, J. Baty, S. Heath, W. Shannon, R. Nagarajan, D. Link, M. Walter, T. Graubert, J. Dipersio, R. Wilson, T. Ley (2009)
Recurring mutations found by sequencing an acute myeloid leukemia genome.
The New England journal of medicine, 361 11
Sohrab Shah, M. Köbel, J. Senz, Ryan Morin, B. Clarke, Kimberly Wiegand, G. Leung, Abdalnasser Zayed, Erika Mehl, S. Kalloger, Mark Sun, Ryan Giuliany, E. Yorida, Steven Jones, R. Varhol, K. Swenerton, Dianne Miller, P. Clement, Colleen Crane, J. Madore, D. Provencher, P. Leung, A. deFazio, J. Khattra, G. Turashvili, Yongjun Zhao, Thomas Zeng, J. Glover, B. Vanderhyden, Chengquan Zhao, C. Parkinson, M. Jimenez-Linan, D. Bowtell, A. Mes-Masson, J. Brenton, Samuel Aparicio, Niki Boyd, M. Hirst, C. Gilks, M. Marra, D. Huntsman (2009)
Mutation of FOXL2 in granulosa-cell tumors of the ovary.
The New England journal of medicine, 360 26
Heng Li, R. Handsaker, Alec Wysoker, T. Fennell, Jue Ruan, Nils Homer, Gabor Marth, G. Abecasis, R. Durbin (2009)
The Sequence Alignment/Map format and SAMtools
Bioinformatics, 25
Siân Jones, Xiaosong Zhang, D. Parsons, Jimmy Lin, R. Leary, P. Angenendt, P. Mankoo, H. Carter, H. Kamiyama, A. Jimeno, Seung‐Mo Hong, Baojin Fu, Ming-Tseh Lin, E. Calhoun, M. Kamiyama, K. Walter, T. Nikolskaya, Y. Nikolsky, James Hartigan, Douglas Smith, M. Hidalgo, S. Leach, A. Klein, E. Jaffee, M. Goggins, A. Maitra, C. Iacobuzio-Donahue, J. Eshleman, S. Kern, R. Hruban, R. Karchin, N. Papadopoulos, G. Parmigiani, B. Vogelstein, V. Velculescu, K. Kinzler (2008)
Core Signaling Pathways in Human Pancreatic Cancers Revealed by Global Genomic Analyses
Science, 321
A. Mortazavi, B. Williams, Kenneth McCue, Lorian Schaeffer, B. Wold (2008)
Mapping and quantifying mammalian transcriptomes by RNA-Seq
Nature Methods, 5
Natthakan Boongoen, Tossapon Garrett (2004)
BIOINFORMATICS ORIGINAL PAPER
T. Ley, E. Mardis, L. Ding, B. Fulton, M. McLellan, Ken Chen, D. Dooling, Brian Dunford-Shore, S. McGrath, M. Hickenbotham, L. Cook, R. Abbott, D. Larson, D. Koboldt, C. Pohl, Scott Smith, Amy Hawkins, S. Abbott, Devin Locke, L. Hillier, T. Miner, L. Fulton, V. Magrini, T. Wylie, Jarret Glasscock, J. Conyers, Nathan Sander, Xiaoqi Shi, John Osborne, P. Minx, D. Gordon, A. Chinwalla, Yu Zhao, R. Ries, J. Payton, P. Westervelt, M. Tomasson, M. Watson, J. Baty, J. Ivanovich, S. Heath, W. Shannon, R. Nagarajan, M. Walter, D. Link, T. Graubert, J. Dipersio, R. Wilson (2008)
DNA sequencing of a cytogenetically normal acute myeloid leukemia genome
Nature, 456
Shin Lin, B. Carvalho, D. Cutler, D. Arking, A. Chakravarti, R. Irizarry (2008)
Validation and extension of an empirical Bayes method for SNP calling on Affymetrix microarrays
Genome Biology, 9
Li (2009)
Fast and accurate short read alignment with Burrows-Wheeler transform
Bioinformatics, 25
Ruiqiang Li, Yingrui Li, K. Kristiansen, Jun Wang (2008)
SOAP: short oligonucleotide alignment program
Bioinformatics, 24 5
M. Stratton, P. Campbell, Peter Campbell, P. Futreal (2009)
The cancer genome
Nature, 458

Publisher: Pubmed Central
Copyright: © The Author(s) 2010. Published by Oxford University Press.
ISSN: 1367-4803
eISSN: 1367-4811
DOI: 10.1093/bioinformatics/btq040
Publisher site: See Article on Publisher Site

Abstract

Vol. 26 no. 6 2010, pages 730–736 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btq040 Sequence analysis Advance Access publication February 3, 2010 SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors 1,2 1 2 1 1 Rodrigo Goya , Mark G.F. Sun , Ryan D. Morin , Gillian Leung , Gavin Ha , 3,4 3,4 1 2 Kimberley C. Wiegand , Janine Senz , Anamaria Crisan , Marco A. Marra , 2 3,4 5 1 Martin Hirst , David Huntsman , Kevin P. Murphy , Sam Aparicio and 1,3,4,∗ Sohrab P. Shah Department of Molecular Oncology Breast Cancer Research Program, British Columbia Cancer Research Centre, 2 3 Genome Sciences Centre, British Columbia Cancer Agency, Centre for Translational and Applied Genomics of 4 5 British Columbia Cancer Agency, Provincial Health Services Authority Laboratories and Department of Computer Science, University of British Columbia, Vancouver, BC, Canada Associate Editor: Alex Bateman ABSTRACT 1 INTRODUCTION Motivation: Next-generation sequencing (NGS) has enabled whole 1.1 Single nucleotide variants in cancer genome and transcriptome single nucleotide variant (SNV) discovery Cancer is a disease of genetic alterations. In particular, single in cancer. NGS produces millions of short sequence reads that, once nucleotide variants (SNVs) present as either germline or somatic aligned to a reference genome sequence, can be interpreted for the point mutations are essential drivers of tumorigenesis and cellular presence of SNVs. Although tools exist for SNV discovery from NGS proliferation in many human cancer types. The discovery of data, none are speciﬁcally suited to work with data from tumors, germline mutations established important gene functions in cancer; where altered ploidy and tumor cellularity impact the statistical however, the contribution of single germline alleles to the population expectations of SNV discovery. burden of cancer is relatively low. In contrast, determination of Results: We developed three implementations of a probabilistic tumorigenic mechanisms has focused on somatic mutations. The Binomial mixture model, called SNVMix, designed to infer SNVs somatic mutational landscape of cancer has to date largely been from NGS data from tumors to address this problem. The ﬁrst derived from small-scale or targeted approaches, leading to the models allelic counts as observations and infers SNVs and model discovery of genes affected by somatic mutations in many diverse parameters using an expectation maximization (EM) algorithm and cancer types. More comprehensive studies using Sanger-based is therefore capable of adjusting to deviation of allelic frequencies exon resequencing suggest that the mutational landscape will be inherent in genomically unstable tumor genomes. The second models characterized by relative handfuls of frequently mutated genes and nucleotide and mapping qualities of the reads by probabilistically a long tail of infrequent somatic mutations in many genes (Jones weighting the contribution of a read/nucleotide to the inference of et al., 2008; Stratton et al., 2009). a SNV based on the conﬁdence we have in the base call and the Considering this, unbiased sequencing surveys of tumor read alignment. The third combines ﬁltering out low-quality data in transcriptomes or genomes are expected to reveal mutations in these addition to probabilistic weighting of the qualities. We quantitatively commonly affected cancer genes as well as many novel mutations evaluated these approaches on 16 ovarian cancer RNASeq datasets in genes with no previous implication in cancer. Next-generation with matched genotyping arrays and a human breast cancer genome sequencing (NGS) technology (Shendure and Ji, 2008) has now sequenced to >40× (haploid) coverage with ground truth data and emerged as a practical, high-throughput and low-cost sequencing show systematically that the SNVMix models outperform competing method enabling the full and rapid interrogation of the genomes approaches. and transcriptomes of individual tumors for mutations. As such, Availability: Software and data are available at NGS has presented an unprecedented opportunity for SNV discovery http://compbio.bccrc.ca in cancer. Recent studies involving deeply sequencing the tumor Contact: [email protected] genomes from acute myeloid leukemia patients (Ley et al., 2008; Supplemantary information: Supplementary data are available at Mardis et al., 2009) and a lobular breast cancer patient (Shah et al., Bioinformatics online. 2009b) have revealed numerous novel somatic mutations in genes Received on November 6, 2009; revised on January 14, 2010; that had not been previously reported to harbor abnormalities. In accepted on January 26, 2010 addition, sequencing the transcriptomes of ovarian cancers with RNA-seq (Marioni et al., 2008; Morin et al., 2008; Mortazavi et al., 2008) led to the discovery of a deﬁning mutation in the FOXL2 gene (previously not implicated in cancer) in granulosa cell tumors of the To whom correspondence should be addressed. ovary (Shah et al., 2009a). © The Author(s) 2010. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. [12:51 3/3/2010 Bioinformatics-btq040.tex] Page: 730 730–736 Predicting SNVs from next-generation sequencing bb ab AC aattcaggaccca----------------------------- Position i aattcaggacccacacga------------------------ kk aattcaggacccacacgacgggaagacaa------------- -attcaggacaaacacgaagggaagacaagttcatgtacttt ----caggacccacacgacgggtagacaagttcatgtacttt a N k i --------acccacacgacgggtagacaagttcatgtacttt --------acccacacgacgggtagacaagttcatgtacttt Genotype k ----------------gacgggaagacaagttcatgtacttt SNVMix1 model ---------------------------------atgtacttt aattcaggaccaacacgacgggaagacaagttcatgtacttt Position i kk i i µ a z j j Genotype k j {1,...,N } Read SNVMix2 model Fig. 1. (A) Schematic diagram of input data to SNVMix1. We show how allelic counts (bottom) are derived from aligned reads (top). The reference sequence is shown indicated in blue. The arrows indicate positions representing SNVs. The non-reference bases are shown in red. (B) Input data for SNVMix2 that consists of the mapping and base qualities. The darker the background for a read represents a higher quality alignment. The brighter colored nucleotides represent higher quality base calls. Therefore, high contrast nucleotides are more trustworthy than lower contrast nucleotides. (C) SNVMix1 shown as a probabilistic graphical model. Circles represent random variables, and rounded squares represent ﬁxed constants. Shaded notes indicate observed data [the allelic counts and the read depth from (A)]. Unshaded nodes indicate quantities that are inferred during EM. G ∈{aa,ab,bb} represents the genotype, N ∈{0,1,...,} is the i i number of reads and a ∈{0,1,...,N } is the number of reference reads. π is the prior over genotypes and µ is the genotype-speciﬁc Binomial parameter i i k for genotype k.(D) SNVMix2 shown as a probabilistic graphical model. In comparison to SNVMix1, a is unobserved and we expand the input to consider i i i read-speciﬁc information indexed by j where z =1 indicates that read j is correctly aligned, q is the base quality and r is the mapping quality. j j j While these early studies have emerged as proof of principle (Li and Durbin, 2009), SOAP (Li,R. et al., 2008) and that novel SNVs can indeed be discovered using NGS, the study Mosaik (http://bioinformatics.bc.edu/marthlab/Mosaik). We begin of computational methods for their discovery in cancer is under- the discussion by describing two ways of preprocessing aligned data represented in the bioinformatics literature. The analysis of SNVs for input to SNV detection algorithms. The ﬁrst method is shown in from cancer data, where altered ploidy and tumor cellularity impact Figure 1A, where we show an example of aligned data where two the statistical expectations of SNV discovery; and transcriptome SNVs are identiﬁed. The reads are positioned according to their data, where the dynamic range of depth of sequencing is dependent alignment in the genome and the reference genome sequence is on highly variable transcript expression present unique challenges. shown in blue. The ﬁrst step involves transforming the aligned reads In this contribution, we describe a new statistical model for into allelic counts. This method assumes that the reads are correctly identifying SNVs in NGS data generated from cancer genomes and aligned and the nucleotide base calls are correct. Nucleotides that transcriptomes. We demonstrate how its novel features outperform match the reference are shown in black, whereas nucleotides that other available methods. Additionally, we provide a ground truth do not match the reference are shown bolded in red. The ﬁgure dataset (with Sanger validated SNVs) and robust accuracy metrics illustrates how aligned data can be ‘collapsed’ into allelic counts. that will permit future study of computational methods for SNV At each position i in the data, we can count the number of reads a detection in cancer genomes. that match the reference genome and the number of reads b that do not match the reference genome. In the case of rare third alleles, these reads are assumed to be errors. The total number of reads 1.2 NGS data preprocessing for SNV detection overlapping each position (called the depth) is given by N = a +b . i i i In this context, given {a ,b ,N } for all i ∈{1,2,...,T } where T is the The data produced by NGS consists of millions of short reads i i i total number of positions in the genome, the task is to infer which ranging in length from approximately 30–400 nt (although this positions exhibit an SNV. is steadily increasing with ongoing technology development). The second method (Fig. 1B) relaxes the assumption that the Here, we focus explicitly on the problem of inferring SNVs base calls and the alignments are correct and instead considers two once these reads have been aligned to the genome. Numerous types of uncertainty related to determining a and N , namely the methods have been developed for short read alignment including i i uncertainty encoded in the base call q ∈[0,1] which represents the Maq (Li,H. et al., 2008), BowTie (Langmead et al., 2009), ELAND (Illumina), SHRiMP (Rumble et al., 2009), BWA probability that the stated base is correct for read j ∈(1, ...,N ) [12:51 3/3/2010 Bioinformatics-btq040.tex] Page: 731 730–736 Aligned reads Reference seq Allelic counts b R.Goya et al. i A B at position i; and r ∈[0,1] representing the probability that read j aligns to its stated position in the genome. Note that although mapping quality is derived in part from base qualities, considering these quantities as independent allows us to encode the fact that base qualities are position speciﬁc, while mapping qualities are constant for all bases in the read. The input data for this method can be visualized as shown in Figure 1B: high mapping quality is shown as dark background and high base quality as bright foreground, high contrast positions indicate positions where the data are more trustworthy. We show in Section 2.4 how to explicitly model these uncertainties to perform soft probabilistic Fig. 2. (A) Theoretical behavior of SNVmix at depths of 2, 3, 5, 10, 15, 20, weighting of the data rather than thresholding the uncertainties to 35, 50 and 100. The plots show how the distribution of marginal probabilities deterministically calculate the allelic counts. We will now describe changes with the number of reference alleles given the model parameters ﬁt to how various authors have approached this problem given {a ,b ,N } i i i a10× breast cancer genome dataset. (B) ROC curves from ﬁtting SNVMix2 i i and optionally, {q ,r }. to synthetic data with increasing levels of certainty in the base call. 1:N 1:N i i changes and other factors. We use the expectation maximization 1.3 Related work (EM) algorithm to ﬁnd a maximum a posteriori (MAP) estimate of A simple way to detect SNV locations would be to compute the the parameters given some training data, allowing the model to adapt fraction f = , and then to call as SNVs those locations where f i i to genomes and transcriptomes that may deviate from the assumed is below some threshold. In the example in Figure 1A, applying distributions for normal genomes and thus model the data more threshold of would successfully discard all columns (including accurately. the two columns which have singleton non-reference reads, which Previous studies have employed stringent thresholding for may be due to base-calling errors), except the two containing the removing poor quality bases and/or reads (Ley et al., 2008; Morin SNVs. A critical ﬂaw with this approach is that it ignores the et al., 2008). We propose that this may throw out informative data, conﬁdence we have in our estimate of f . Intuitively, we can trust and we extend SNVMix1 to explicitly encode base and mapping our estimate more at locations with greater depth (larger N ). This qualities by using them to probabilistically weight the contribution idea has been applied by Morin et al. (2008), wherein read depth of each nucleotide to the posterior probability of a SNV call. In thresholds of N ≥6 and b ≥2 reads supporting the variant allele i i addition, we explore how to optimally combine thresholding and were applied, with an additional requirement that the non-reference probabilistic weighting in order to obtain more accurate results. We allele must be represented by at least 33% of all reads at that show (Section 3) how this extended model, which we call SNVMix2, site. This should eliminate SNVs with weak supporting evidence, confers an increased speciﬁcity in our predictions. but it categorizes the data into two discrete classes—SNV or not, The statistical models we propose in this contribution provide without explicitly providing conﬁdence estimates on the prediction. posterior probabilities on SNV predictions, removing the need for Moreover, in transcriptome data, the number of reads representing a depth thresholds and use an EM learning algorithm to ﬁt the model given transcript expected to be highly variable across all genes and to data removing the need to set model parameters by hand. We thus determining a minimum depth can be difﬁcult. We demonstrate also show how to explicitly model base and mapping qualities, and (Section 3) that applying depth-based thresholds reduces sensitivity explore how quality thresholds can be used in combination with to ﬁnding real SNVs. probabilistic weighting. We show that these attributes of the model To overcome these limitations, we propose a probabilistic result in increased accuracy compared with Maq’s SNV caller and approach based on a Binomial mixture model, called SNVMix1, depth threshold-based methods. We evaluate the model based on which computes posterior probabilities, providing a measure of real data derived from 16 ovarian cancer transcriptomes sequenced conﬁdence on the SNV predictions. The model infers the underlying using NGS, and a lobular breast cancer genome sequenced to >40x genotype at each location. We assume the genotype to be in coverage (Shah et al., 2009b). For all cases, we obtained high- one of three states: aa = homozygous for the reference allele, density genotyping array data for orthogonal comparisons. Finally, ab = heterozygous and bb = homozygous for the non-reference we demonstrate results on 497 positions from the breast cancer allele; the latter two genotypes constituting an SNV. In Figure 2, genome that were subjected to Sanger sequencing and thus constitute we show how the posterior probability of each of these three a ‘ground truth’ dataset for benchmarking. states increases with more depth, which demonstrates the theoretical qualities of our approach. Two other approaches: Maq (Li,H. et al., 2 METHODS 2008) and SOAPSNP (Li,R. et al., 2008) have proposed using Binomial distributions to model genotypes; however, these were 2.1 SNVMix model speciﬁcation developed in the context of sequencing normal genomes, not cancer SNVMix1 is shown as a probabilistic graphical model in Figure 1C. The genomes. Both set parameters for the model assuming expected conditional probability distributions for the model are given in Figure 3 distributions for normal allelic ratios, and apply post-processing and the description of all random variables is listed in Table 1. The input is heuristics to reduce false positives. In our application, we are composed of allelic counts from aligned data and the output of inference is the interested in cancer genomes and transcriptomes, both of which may predicted genotypes. Consider G =k, k ∈{aa,ab,bb}, to be a Multinomial not follow expected distributions due to tumor-normal admixtures random variable representing the genotype at nucleotide position i, where in the sample, within sample tumor heterogeneity, copy number aa is homozygous for the reference allele, ab is heterozygous and bb is [12:51 3/3/2010 Bioinformatics-btq040.tex] Page: 732 730–736 Predicting SNVs from next-generation sequencing homozygous for the non-reference allele. At each position, we have an 2.2 Prior distributions observed number of aligned reads N . We let a ∈{0,1} represent whether We assume that π is distributed according to a Dirichlet distribution or not read j ∈{1,...,N } matches the reference at position i. We let a (no j i i parameterized by δ, the so-called pseudocounts. We set δ to be skewed toward index) be the total number of reads that match the reference at i. We assume π assuming that most positions will be homozygous for the reference aa the following likelihood model for the data: allele. µ is conjugately distributed according to a Beta distribution: µ ∼ k k Beta(µ |α ,β ). We set α =1000,β =1; α =500,β =500 and α = k k k aa aa ab ab bb p(a |G =k,N ,µ ) ∼Binom(a |µ ,N ) (1) i i i 1:3 i k i 1,β =1000 assuming that µ should be skewed towards 1, µ should be bb aa ab close to 0.5 and µ should be close to 0. bb where µ is the parameter of a Binomial distribution for genotype k. µ k k models the expectation that for a given genotype k, a randomly sampled allele will be the reference allele. Intuitively, we should expect µ to be aa close to 1, µ to be close to 0.5 and µ to be close to 0. Thus, the key ab bb 2.3 Model ﬁtting and parameter estimation intuition is that for genotype k = aa, the Binomial distribution deﬁned by We ﬁt the model using the EM algorithm. We initialize µ and π(k)to µ should be highly skewed toward the reference allele. Similarly, µ aa bb their prior means. The EM algorithm iterates between the E-step where we would be skewed toward the non-reference allele. For µ , the distribution ab assign the genotypes using Equation 2 and the M-step where we re-estimate would be much more uniform. We impose a prior on the genotypes, G |π ∼ the model parameters. At each iteration, we evaluate the complete data log- Multinomial(G |π,1) where π(k) is the prior probability of genotype k. Given posterior and the algorithm terminates when this quantity no longer increases. knowledge of all the parameters, θ =(µ ,π), we can use Bayes’ rule to infer 1:3 The M-step equations are standard conjugate updating equations: the posterior over genotypes, γ (k) = p(G =k|a ,N ,θ), where: i i i π Binom(a |µ ,N ) k i k i T I(G =k) + δ(k) γ (k) = (2) i i new i=1 π (k) = (3) π Binom(a |µ ,N ) j i j i j=1 I(G = j) + δ(j) j i=1 Our approach to inference involves learning the parameters θ by ﬁtting the model to training data using MAP EM (see below). We demonstrate that this where I(G =k) is an indicator function to signal that G is assigned to state i i produces better results than Maq, which uses ﬁxed parameters (Section 3). k at position i, and: I(G =k) a + α −1 p(π|δ) = Dir(π|δ) new i=1 i µ = (4) T I(G =k) N + α + β −2 k k i=1 i p(G |π) = Multinomial(G |π,1) i i i i p(a |G =k,µ ) = Bern(a |µ ) i k k j j 2.4 Modeling base and mapping qualities i i p(a |G =k,µ ,N ) = Binom(a |µ ,N ) i k i k i i The model shown in Figure 1C assumes that a is observed (it is a shaded node in the graph), and thus assumed correct. However, each nucleotide’s p(µ |α ,β ) = Gam(µ |α ,β ) k k k k k k contribution to the allelic counts has uncertainty associated with it in the i i p(z ) = Bern(z |0.5) form of base and mapping qualities. We propose a soft (or probabilistic) j j weighting scheme, which will down-weight the inﬂuence of low-quality base i i i q if a =1,z =1 ⎨ j j j and mapping calls, but not discard them altogether. To model this, we change i i i i i i i 1 −q if a =0,z =1 a to be an unobserved quantity as shown in Figure 1D, and instead observe p(q |a ,z ) = j j j j j j j i the soft evidence on them in the form of probabilities, which we represent 0.5if z =0 by the observed base qualities q ∈[0,1]. Similarly, we introduce unobserved i i i binary random variables z ∈{0,1} representing whether read j is correctly r if z =1 i i j j p(r |z ) = j j i i aligned, and soft evidence in the form of probabilities which we represent 1 −r if z =0 j j i by the observed mapping qualities r ∈[0,1]. The conditional probability i i i i i distributions for p(q |a ,z ) and p(r |z ) are given in Figure 3. Thus, the input j j j j j 1:T 1:T Fig. 3. Conditional probability distributions of SNVMix model. data is now q ,r and the corresponding likelihood for each location i Table 1. Description of random variables in SNVMix1 and SNVMix2 Parameter Description Value δ Dirichlet prior on π (1000,100,100) π Multinomial distribution over genotypes Estimated by EM (M-step) G Genotype at position i Estimated by EM (E-step) a Indicates whether read j at position i matches the reference Observed in SNVMix1, latent in SNVMix2 z Indicates whether read j aligns to its stated position Latent q Probability that base call is correct Observed (SNVMix2 only) r Probability that alignment is correct Observed (SNVMix2 only) µ Parameter of the Binomial for genotype k Estimated by EM (M-step) α Shape parameter of Beta prior on µ (1000,500,1) β Scale parameter of Beta prior on µ (1,500,1000) [12:51 3/3/2010 Bioinformatics-btq040.tex] Page: 733 730–736 R.Goya et al. can be obtained by marginalizing out a,z as follows: 2.7 Accuracy metrics i i p(q ,r |G =k,µ ) (5) i k While comparing with the SNP array data, we deﬁned a true positive (TP) 1:N 1:N i i SNV as an ab or bb CRLMM genotype. A true negative (TN) SNV was i i i i i i i deﬁned as an aa genotype from the SNP array. For the Sanger validated = p(a |G ,µ)p(q |a ,z )p(z |r )p(z ) (6) j j j j j j j positions, a TP was an SNV that was conﬁrmed by Sanger sequencing, a z j=1 whereas a TN was a position that was not conﬁrmed. To evaluate our models against these data, we computed p(SNV ) = γ (ab) + γ (bb) and standard i i i i i i i ∝ 0.5(1 −r ) +r [(1 −q )(1 − µ ) +q µ ] (7) j j j k j k receiver operator characteristic (ROC) curves. The area under the ROC curve j=1 (AUC) was computed as a single numeric metric of accuracy that effectively As before, given the model parameters µ,π, we can infer the genotype at measures the trade-off between sensitivity and speciﬁcity. As an additional 2(precision×recall) each position by modifying Equation (2) as follows: measure, we computed the F-statistic: f = , where precision precision+recall i i π p(q ,r |G =k,µ ) is measured as the proportion of predictions that were true and recall is the k i k 1:N 1:N i i γ (k) = (8) K i i proportion of true SNVs that were predicted. π p(q ,r |G = h,µ ) h i h h=1 1:N 1:N i i The updating equations are unchanged in the M-step of EM. The model- 2.8 Benchmarking experiments ﬁtting algorithm changes only in the E-step by using Equation (8) instead of Equation (2). We have speciﬁed how to encode base and mapping uncertainty To evaluate the effect of estimating parameters, we designed a 4-fold into the model, obviating the need for thresholding these quantities. We call cross-validation study. We permuted the 144 271 positions with matched this version of the model SNVMix2. array-based genotype data from the ovarian cancer data, and divided the In Figure 2, we show the theoretical behavior of this model using simulated positions into four equal parts. We ﬁt the model to three parts (training data with varying base qualities. The model performs equally well for data) using EM and used the converged parameters to calculate p(SNV ) for datasets where the mean certainty of the base calls is ∼80% and higher. This each of the remaining positions (test data). We repeated this 10 times and suggests that thresholding base calls at Phred Q20 [99% certainty (Morin computed the AUC for each of the 16 cases. We also computed AUC from et al., 2008)] or Q30 [99.9% certainty (Ley et al., 2008)] may be overly the results predicted by Maq v0.6.8 and compared the AUC distributions stringent. across the 16 cases to SNVMix1 and SNVMix2. These data also allowed us to determine the range of converged parameter estimates across the folds and 10 replicates. We also tested the effect of depth-based thresholding by 2.5 Implementation and running time running SNVMix1 on the 14 649 positions from the breast cancer genome. The model and inference algorithm is implemented in C and supports To simulate the thresholding, we set p(SNV ) =0 at locations where N was i i both SAMtools (Li et al., 2009) and Maq pileup format. Running EM below some threshold, chosen from the set {0,1,...,7,10}. We compared (SNVMix2) on 14 649 positions for the 40× breast cancer genome took SNVMix1, SNVMix2 and Maq on this data as well. Finally, we evaluated 36 s. Predicting genotypes for the whole 40× genome took 11 min and 38 s. the true positive rate (TPR) and false positive rate (FPR) on the 497 ground (The Maq step cns2snp took 19 min and 9 s.) A script to choose optimal base truth positions from this case for SNVMix1, SNVMix2 and Maq. and mapping quality thresholds, given ground truth [or orthogonal single nucleotide polymorphism (SNP) array] data, is provided in the software package. 3 RESULTS 2.6 Datasets 3.1 Depth heuristics reduce sensitivity We used three datasets to evaluate our models. The ﬁrst (Supplementary We ﬁrst determined the effect that depth thresholding had on 14 649 Dataset 1A and B) consists of 16 ovarian cancer transcriptomes sequenced positions probed using an Affymetrix SNP 6.0 array from genome using the Illumina GA II platform, RNA-Seq paired end protocol. (Note data from the lobular breast cancer by calculating ROC curves and that this data has been generated as part of an ongoing study to proﬁle corresponding AUC values (Section 2) from the output of SNVMix1 ovarian carcinoma subtypes, and the full datasets will be available as part at different cutoff values. The most accurate results were obtained of forthcoming manuscripts. However, all SNV data referenced in this when no depth thresholding was applied. At a threshold of 0, the manuscript is available as Supplementary Materials). For each of these cases, AUC was 0.988 (the highest) and at a threshold of 10 reads, the we obtained Affymetrix SNP 6.0 high-density genotyping arrays from the AUC was 0.614 (the lowest). At a FPR of 0.01, the TPR decreased corresponding DNA. We examined coding positions in the transcriptome data for which there was a corresponding high-conﬁdence (>0.99) genotyping with increasing number of reads required for the threshold without call from the array predicted using the CRLMM algorithm (Lin et al., 2008). exception, suggesting that depth-thresholding under the SNVMix1 This resulted in an average of approximately 9000 positions from each model reduces overall sensitivity without increasing speciﬁcity, and case and a total of 144 271 positions. These data were used in the cross- should therefore be avoided. AUC for thresholds of 1, 3, 5 and 7 validation experiment, described below. The second dataset (Supplementary reads were 0.971, 0.893, 0.782 and 0.707, respectively. Dataset 2A–D) consisted of 497 positions from a lobular breast tumor genome predicted as SNVs using SNVMix1 model from data generated 3.2 Estimating parameters in transcriptome data by using the Illumina GA II platform. These positions were predicted to be model ﬁtting confers better accuracy non-synonymous protein-coding changes and were subsequently sequenced using Sanger capillary-based technology. Of these, 305 were conﬁrmed as Figure 4 shows the AUC distribution over 16 ovarian cancer SNVs and 192 were not conﬁrmed. These 497 positions were considered as transcriptomes (Section 2) for the best and worst cross- the ground truth dataset used for sensitivity and speciﬁcity calculations. In validation runs of SNVMix2 and SNVMix1 as well as the addition, we also generated Affymetrix SNP 6.0 array data for this case and results from the Maq SNV caller with the two recommended considered 14 649 positions (Supplementary Dataset 3A–D) that matched settings of the r parameter (0.001 and 0.02). Both runs of the coding positions and CRLMM prediction criteria outlined above. All SNVMix1 were statistically signiﬁcantly better than the Maq NGS data were aligned to the human genome reference (NCBI build 36.1) runs [analysis of variance (ANOVA) test, P <0.0001], with using Maq’s map tool (v0.6.8). Thus, for all comparisons between Maq and mean AUC of 0.9557 ± 0.0100 and 0.9552 ± 0.0100, compared SNVMix, we used the same baseline set of aligned data. [12:51 3/3/2010 Bioinformatics-btq040.tex] Page: 734 730–736 Predicting SNVs from next-generation sequencing with 0.9290 ± 0.0120 and 0.9032 ± 0.0119 for the Maq runs. of 0.9880) and Maq (0.9824 for 40× and 0.9115 for 10×—both Furthermore, SNVMix2 without quality thresholds offers a for the r = 0.001 parameter setting). After ﬁtting the model to the slight performance improvement over SNVMix1. Although the 14 649 positions, we evaluated the performance using 497 candidate improvement of SNVMix2 over SNVMix1 is not statistically mutations originally detected using SNVMix1 at 10×, which were signiﬁcant, it is noteworthy that no thresholds of any kind were validated using Sanger amplicon sequencing (Section 2). These applied to the data and thus probabilistic weighting can eliminate consisted of 305 true SNVs (variants seen in the Sanger traces) the need for arbitrarily thresholding the data (see below). and 192 that could not be conﬁrmed in the Sanger traces. Table 2 shows the sensitivity, precision and F-measure results of SNVMix2, SNVMix1 and SNVMix2 combined with base and mapping quality 3.3 Evaluation of models on a deeply sequenced breast thresholding at both 10× and 40× coverage at a p(SNV) (Section 2) cancer genome with ground truth SNVs threshold determined using a FPR ≤ 0.01. We did not include a We evaluated performance of the models on a lobular breast cancer comparison to Maq at these 497 positions since the results would be sample sequenced to >40× haploid coverage (Shah et al., 2009b). biased toward the SNVMix1 model that led us to identify them in In addition, we compared results obtained from the same genome the ﬁrst place. SNVMix2 and SNVMix1 showed similar F-measure at 10× coverage. We ﬁrst trained the model using 14 649 protein at both 10× and 40× reinforcing that the probabilistic weighting coding positions for which we generated matching Affymetrix confers equal accuracy without having to select arbitrary quality SNP6.0 calls. We computed the AUC for SNVMix1, SNVMix2 thresholds. In addition, both SNVMix2 and SNVMix1 had higher and Maq. Table 2 shows that the highest AUCs were obtained accuracy at 40× and 10× (Table 2). Interestingly, all the models had with SNVMix2 on the 40× genome, followed by SNVMix2 on increased false negative rates in the 40× genome in comparison with 10× genome (AUCs of 0.9929 and 0.9905, respectively). Both the 10× genome. Upon further review of the SNVMix2 positions of these were higher than results achieved for SNVMix1 (AUC predicted at 10×, but not at 40×, we examined that the majority (9 out of 13) were marginally below threshold and signiﬁcant probability mass was indeed on the P(ab) state (>0.99) and would have been predicted with even a slightly less stringent threshold. Three out of the remaining four appear to be the result of DNA copy number ampliﬁcations that are skewing the allelic ratios involved. We elaborate on this point in Section 4. While the SNVMix2 model eliminates the need for thresholding through probabilistic weighting, we explored the effect of applying thresholds to the SNVMix2 input in order to identify a practical balance between pure weighting and thresholding. We compared the results of thresholding mapping qualities at (Q0, Q5, Q10, Q20, Q30, Q40 and Q50) and concomitantly thresholding base qualities at (Q0, Q5, Q10, Q15, Q20 and Q25) and running SNVMix2 on the resulting data from both 10× and 40× genomes. As shown in Table 2, thresholding the mapping qualities at Q50 and base qualities at Q20 (mQ50_bQ20) at 10× performed better than all other 10× runs (F-measure 0.8441). For the 40× data, thresholding the mapping qualities at Q50 and base qualities at Q15 (mQ50_bQ15) performed Maq SNVMix2 SNVMix1 best over all runs (F-measure 0.8658). (See Supplementary Table S1 for all results from runs in increments of Q1 base quality thresholds.) Fig. 4. Distribution of AUC over 16 ovarian cancer transcriptomes This suggests that previously reported base quality thresholds may comparing accuracy of SNV detection for two Maq runs, the best and worst be too stringent. Furthermore, when used with stringent mapping SNVMix1 runs in the cross-validation experiment (middle) and best and quality thresholds, the SNVMix2 model can effectively use the base worst runs for SNVMix2 (mbQ0 = no quality thresholding, MbQ30 = keeping qualities by probabilistic weighting to confer higher accuracy. These only reads with mapping qualities > Q30). SNVMix1 and SNVMix2 runs were statistically more accurate than both Maq runs (ANOVA, P < 0.0001). results indicate that treating mapping and base qualities separately SNVMix2 runs were better than SNVMix1, but not statistically signiﬁcantly. Table 2. Comparison of accuracy of SNVMix1, SNVMix2 and SNVMix combined with base and mapping quality thresholding Model Run Train AUC TP FP TN FN Sens Prec F-measure 10× 0.9880 305 192 0 0 1.0000 0.6137 0.7606 SNVMix1 40× 0.9924 293 107 85 12 0.9607 0.7325 0.8312 10× 0.9905 299 162 30 6 0.9803 0.6486 0.7807 SNVMix2 40× 0.9929 290 107 85 15 0.9508 0.7305 0.8262 mQ50_bQ20 (10×) 0.9882 287 88 104 18 0.9410 0.7653 0.8441 SNVMix2 + thresholding mQ50_bQ15 (40×) 0.9928 287 71 121 18 0.9410 0.8017 0.8658 [12:51 3/3/2010 Bioinformatics-btq040.tex] Page: 735 730–736 R.Goya et al. as opposed to taking a minimum over the two (as in Maq) has We noticed that some positions missed by SNVMix1 at 40× were advantages. in what we believe are allele-speciﬁc copy number ampliﬁcations. Future work will involve incorporating copy number data directly into the model to consider such situations where the resultant allelic 4 DISCUSSION bias is expected to mask variants present in the unampliﬁed allele. We have described two statistical models based on Binomial mixture These results will be presented in a forthcoming manuscript. In models to infer SNVs from aligned NGS data obtained from tumors. a similar vein, due to regulatory mutations or epigenetic changes, We demonstrated that a probabilistic approach to modeling allelic transcriptomes can show preferential expression of one allele and counts obviates the need for depth-based thresholding of the data, thus our model will be insensitive to instances of extensively skewed and how ﬁtting the model to real data to estimate parameters allelic expression. Further extensions of the model to consider these is superior to Maq, which uses ﬁxed parameter settings on the factors will be explored. assumption that the data come from a normal human genome. In Finally, we recently demonstrated that intra-tumor heterogeneity addition, we extended the basic Binomial mixture to model mapping can be seen using ultra-deep targeted sequencing (Shah et al., and base qualities by using a probabilistic weighting technique. 2009b). The allelic frequencies of SNVs in rare clones in the This eliminates the need to employ arbitrary thresholds on base and tumor population will likely result in false negative predictions at mapping qualities and instead lets the model determine the strength conventional sequencing depths (i.e. between 20× and 40×), and of contribution of each read to the inference of the genotype. Finally, confound the estimation of the false negative rates of prediction. we showed that even further gains in accuracy can be obtained by Future investigation of all of these problems will be necessary if the combining moderate thresholding and probabilistic weighting of the goal of sequencing studies is to characterize all mutations present base and mapping qualities. Importantly, gains in accuracy by the in the heterogeneous mixture of genomes that make up a tumor. SNVMix models were shown in both transcriptome and genome Funding: The project is supported by the Canadian Institutes for data. Health Research (CIHR). SPS is supported by the Michael Smith Foundation for Health Research (MSFHR) and the Canadian Breast 4.1 Dependence on alignments Cancer Foundation. RG is supported by MSFHR and CIHR. Our results will be highly dependent on the accuracy of alignments Conﬂict of Interest: none declared. as well as the consistency and accuracy of mapping qualities reported by the aligner. Results in Table 2 showed that the combined approach of stringent thresholding on mapping quality and modeling the REFERENCES uncertainty of the remaining reads gave the highest accuracy. Given Jones,S. et al. (2008) Core signaling pathways in human pancreatic cancers revealed that the most gain was obtained in precision, it suggests that false by global genomic analyses. Science, 321, 1801–1806. positive predictions may indeed be reduced with more accurate Langmead,B. et al. (2009) Ultrafast and memory-efﬁcient alignment of short DNA alignments. As read lengths increase with technology development sequences to the human genome. Genome Biol., 10, R25. Ley,T.J.et al. (2008) DNA sequencing of a cytogenetically normal acute myeloid and mapping algorithms improve, we expect that the input to leukaemia genome. Nature, 456, 66–72. SNVMix will be of higher quality, which should yield more accurate Li,H. et al. (2008) Mapping short DNA sequencing reads and calling variants using results. Moreover, alignment using Maq presents a drawback with mapping quality scores. Genome Res., 18, 1851–1858. regard to SNVMix2’s model. When a short read can be aligned Li,H. and Durbin,R. (2009) Fast and accurate short read alignment with Burrows- to more than one position in the genome with the same mapping Wheeler transform. Bioinformatics, 25, 1754–1760. Li,H. et al. (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics, quality, this read is dropped, being assigned a mapping quality of 25, 2078–2079. zero. SNVMix2’s design would be able to leverage the read’s quality Li,R. et al. (2008) SOAP: short oligonucleotide alignment program. Bioinformatics, 24, amongst the distinct coordinates and still use the information it 713–714. conveys to predict SNVs. The performance of this will be evaluated Lin,S. et al. (2008) Validation and extension of an empirical Bayes method for SNP calling on Affymetrix microarrays. Genome Biol., 9, R63. in future work. Mardis,E. R.et al. (2009) Recurring mutations found by sequencing an acute myeloid leukemia genome. N. Engl. J. Med., 361, 1058–1066. 4.2 Limitations, extensions and future work Marioni,J.C. et al. (2008) RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res., 18, 1509–1517. As stated earlier, a major objective in cancer genome sequencing Morin,R. et al. (2008) Proﬁling the HeLa S3 transcriptome using randomly is to discover somatic mutations. If sequence data from tumor and primed cDNA and massively parallel short-read sequencing. BioTechniques, 45, normal DNA from the same patient is available, candidate somatic 81–94. mutations can be identiﬁed as positions for which p(SNV) is high Mortazavi,A. et al. (2008) Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods, 5, 621–628. in the tumor and 1 −p(SNV) is high in the normal data. If only Rumble,S.M. et al. (2009) SHRiMP: accurate mapping of short color-space reads. PLoS tumour data is available, we recommend ﬁltering against dbSNP Comput. Biol., 5, e1000386. and performing targeted validation on the remaining positions in Shah,S.P. et al. (2009a) Mutation of FOXL2 in granulosa-cell tumors of the ovary. New both tumor and normal DNA as described previously (Shah et al., Engl J. Med., 360, 2719–2729. Shah,S.P. et al. (2009b) Mutational evolution in a lobular breast tumor proﬁled at single 2009b). Moreover, the models we have presented assume identically nucleotide resolution. Nature, 461, 809–813. and independently distributed genotypes. As such, the common prior Shendure,J. and Ji,H. (2008) Next-generation DNA sequencing. Nat. Biotechnol., 26, over genotypes π can be indexed by position (i.e. π ) and thus could 1135–1145. encode information about what variants are known for each position Stratton,M.R. et al. (2009) The cancer genome. Nature, 458, 719–724. i in the genome. [12:51 3/3/2010 Bioinformatics-btq040.tex] Page: 736 730–736

Journal

Bioinformatics – Pubmed Central

Published: Feb 3, 2010

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors

SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors

SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors

References (20)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies