Somatic Point Mutation Calling in Low Cellularity Tumors

Karin S. Kassahn; Oliver Holmes; Katia Nones; Ann-Marie Patch; David K. Miller; Angelika N. Christ; Ivon Harliwong; Timothy J. Bruxner; Qinying Xu; Matthew Anderson; Scott Wood; Conrad Leonard; Darrin Taylor; Felicity Newell; Sarah Song; Senel Idrisoglu; Craig Nourse; Ehsan Nourbakhsh; Suzanne Manning; Shivangi Wani; Anita Steptoe; Marina Pajic; Mark J. Cowley; Mark Pinese; David K. Chang; Anthony J. Gill; Amber L. Johns; Jianmin Wu; Peter J. Wilson; Lynn Fink; Andrew V. Biankin; Nicola Waddell; Sean M. Grimmond; John V. Pearson

doi:10.1371/journal.pone.0074380

Somatic Point Mutation Calling in Low Cellularity Tumors

Kassahn, Karin S.; Holmes, Oliver; Nones, Katia; Patch, Ann-Marie; Miller, David K.; Christ, Angelika N.; Harliwong, Ivon; Bruxner, Timothy J.; Xu, Qinying; Anderson, Matthew; Wood, Scott; Leonard, Conrad; Taylor, Darrin; Newell, Felicity; Song, Sarah; Idrisoglu, Senel; Nourse, Craig; Nourbakhsh, Ehsan; Manning, Suzanne; Wani, Shivangi; Steptoe, Anita; Pajic, Marina; Cowley, Mark J.; Pinese, Mark; Chang, David K.; Gill, Anthony J.; Johns, Amber L.; Wu, Jianmin; Wilson, Peter J.; Fink, Lynn; Biankin, Andrew V.; Waddell, Nicola; Grimmond, Sean M.; Pearson, John V. 2013-11-08 00:00:00 Somatic mutation calling from next-generation sequencing data remains a challenge due to the difficulties of distinguishing true somatic events from artifacts arising from PCR, sequencing errors or mis-mapping. Tumor cellularity or purity, sub- clonality and copy number changes also confound the identification of true somatic events against a background of germline variants. We have developed a heuristic strategy and software (http://www.qcmg.org/bioinformatics/qsnp/) for somatic mutation calling in samples with low tumor content and we show the superior sensitivity and precision of our approach using a previously sequenced cell line, a series of tumor/normal admixtures, and 3,253 putative somatic SNVs verified on an orthogonal platform. Citation: Kassahn KS, Holmes O, Nones K, Patch A-M, Miller DK, et al. (2013) Somatic Point Mutation Calling in Low Cellularity Tumors. PLoS ONE 8(11): e74380. doi:10.1371/journal.pone.0074380 Editor: I. King Jordan, Georgia Institute of Technology, United States of America Received April 11, 2013; Accepted July 31, 2013; Published November 8, 2013 Copyright: 2013 Kassahn et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: No current external funding sources for this study. Competing Interests: The authors have declared that no competing interests exist. * E-mail: [email protected] (SMG); [email protected] (JVP) accurately calling mutations in these samples and the expected Introduction high false negative rate. To keep the sensitivity of the analysis at The declining cost of next-generation sequencing is enabling an desired levels, there is a risk of calling an increasing number of increasing number of tumor sequencing studies [1–3], providing false positives. new insights into the mutations driving tumorigenesis. These large- Several software programs have been developed for variant and scale efforts are redefining the role of known oncogenes and tumor somatic mutation calling, including GATK [7], Strelka [8], TM suppressor genes, identifying new candidate driver genes and diBayes (Applied Biosystems BioScope software), SomaticSni- providing insights into the mutational mechanisms at play in per [9], VarScan 2 [10] and SNVMix [11]. For cancer genome different tumor types [4,5]. Accurate somatic mutation calling is analysis and to identify somatic events, a tumor sample is paramount in these studies. compared to its matched normal sample. Current software tools Despite this growing demand for accurate somatic mutation differ in important ways by either performing single or joint calls in cancer studies, mutation calling from next-generation sample analysis of the tumor/matched normal sample pair, and by sequencing data remains challenging. Early cycle PCR-induced either using Bayesian or heuristic approaches (Table 1). GATK errors, polymerase slippage [6] and the mis-mapping of reads due was initially developed in the context of the 1000 Genomes Project to homology to multiple genomic regions are some of the most [12] to enable variant discovery and genotyping from next- common sources of false positive calls. Inadequate sequence depth generation sequencing data. GATK performs single sample in the matched normal sample can also result in germline variants analysis only. A tumor and matched normal sample pair are thus being incorrectly identified as somatic mutations (false positives). genotyped independently and somatic events are determined by Finally, tumor heterogeneity and purity further confound accurate subtracting calls in the normal from those in the tumor sample. In somatic mutation calling as increased tumor heterogeneity and contrast, Strelka, SomaticSniper and VarScan 2 perform joint decreased purity result in lower mutant allele ratios that can make sample analysis of a tumor/normal pair and either model tumor as it difficult to distinguish true mutations from background (false a mixture of normal sample with somatic variation (Strelka), negative error). In solid tumors, purity varies widely with some calculate joint diploid genotype likelihoods using the MAQ tumor samples having less than 10% tumor content. Many low genotype model (SomaticSniper) or compare read count distribu- purity tumor samples have been excluded from somatic mutation tions between the two samples using Fisher’s exact test (VarScan2). analysis to date due to the analytical challenges associated with Importantly, due to the different statistical models employed, PLOS ONE | www.plosone.org 1 November 2013 | Volume 8 | Issue 11 | e74380 Novel Somatic Mutation Calling Strategy Table 1. Variant calling software tools. Tumor normal Output germline Software joint analysis variants Indels Statistical method Reference qSNP X X empirically determined set of heuristics optimized present study for sensitivity in low purity tumors GATK n/a X Bayesian model for genotype likelihood, can take into [7] account multiple samples for calibration Strelka X X Bayesian model of tumor as a mixture of normal sample [8] with somatic variation SomaticSniper X X Bayesian comparison of genotype likelihoods based [9] on MAQ genotype model diBayes n/a Bayesian model for presence of non-reference allele Applied Biosystems TM (color-space data) BioScope VarScan2 X X X heuristics to determine genotypes and Fisher’s exact [10] test to examine read count differences, also outputs CNV regions for exome data SNVMix n/a probabilistic binomial mixture model accounting for [11] tumor ploidy and purity doi:10.1371/journal.pone.0074380.t001 current somatic mutation callers differ in the number of somatic of verification using benchtop amplicon-based sequencing were mutation calls and in their overlap. In addition, many somatic performed to develop and refine post-processing checks to control the false discovery rate. The following considerations informed the mutation callers use a series of post-call filtering steps that further design of our mutation calling strategy and its software affect the number and type of final mutation calls. Some of these implementation, qSNP. tools also allow analysis of small indels, germline variants and copy number variations (Table 1). There have not been, however, any detailed investigations of Joint analysis of the tumor and matched normal sample the effects of reduced tumor cellularity or purity on the accuracy qSNP considers sequence data in Binary Sequence Alignment/ and sensitivity of somatic mutation calling, although a recent Map (BAM) format [15] from both tumor and matched normal software, MuTect, has been designed especially with subclonal samples jointly. Classification into germline and somatic calls follows a number of simple rules that were designed to mutations in mind [13]. A number of factors compromise somatic accommodate for the expected low mutant allele ratio in low mutation calling in low purity tumors. As sequence coverage and purity tumors (Table 2). tumor purity decrease, the effects of allele sampling confound the accurate assessment of allele distributions and thus compromise statistical models for determining potential variant sites of interest. Maximize sensitivity of mutation calling Secondly, depending on the statistic models used, low frequency qSNP currently triggers a variant call if a minimum of 3 reads of mutations may or may not trigger a variant call resulting in the same, non-reference allele are found. We found that this minimum evidence requirement ensures that a variant call is differences in the number and type of mutations called between different callers. Our interests in pancreatic adenocarcinomas triggered even in regions where Poisson sampling of alleles may have confounded the observed allele distributions. As sequence where over 70% of tumors are of less than 40% purity due to depth increases, so does the minimum read requirement. At desmoplastic stroma and despite enrichment by histology-guided coverage over 206a minimum of 4 mutant reads are required and macrodissection [14] have motivated us to determine optimal above 506 a minimum of 5% of mutant reads or a minimum of strategies for somatic point mutation calling in these tumors. To 2.5% mutant reads if reads are on both strands. In addition, the this end, several mutation calling strategies were tested and base qualities of the variant reads must be at least 10% of the sum extensive verification was performed, in which true positive and of base qualities at the position or at least 5% of the sum of base false positive mutation calls were inspected to identify common qualities if reads are found on both strands and coverage is over error sources. A heuristics-based single nucleotide variant caller, 506. To determine whether the position is homozygous or qSNP, was then implemented using these empirically determined heterozygous, the two most common alleles are determined. If features. Its performance was directly assessed in samples of both alleles match the evidence criteria above, the position is varying purity that were generated by mixing a tumor cell line and considered heterozygous, and if not, homozygous. its matched normal sample at varying proportions and sequencing each mixture. The decay in sensitivity as purity decreased in these Post-processing checks to control the false discovery rate mixtures was assessed and the performance of our caller was Various factors influence the confidence in a somatic mutation compared to that of two others. Finally, its performance was call, including sequence depth in tumor and matched normal, base benchmarked against the COLO-829 cell line, previously qualities of alleles, evidence for variant in matched normal sample, sequenced and analyzed by Pleasance et al. [4]. number of mutant reads, and mutant allele ratio. The statistical frameworks to encompass all of these factors into a single model Results and metric are still being developed. Some single-sample SNP Our somatic mutation calling strategy has been designed to callers give a p-value that purely reflects the likelihood for the maximize sensitivity in light of low tumor purity. Iterative rounds presence of a non-reference allele. Furthermore, most mutation PLOS ONE | www.plosone.org 2 November 2013 | Volume 8 | Issue 11 | e74380 Novel Somatic Mutation Calling Strategy Table 2. Classification of germline and somatic events. Normal genotype Tumor genotype Details* Classification Hom Het Variant is reference allele; G/G.A/G Germline Hom Het Variant novel; A/A.A/G Somatic Het Hom Tumor allele same; A/G.G/G Germline Het Hom Tumor allele different; A/G.T/T Somatic Hom Hom Same; G/G.G/G Germline Hom Hom Different; A/A.G/G Somatic Het Het Same; A/G.A/G Germline Het Het Different; A/G.T/G Somatic *All examples assume ‘A’ as the reference allele, ‘G’ as the variant, and ‘Hom’ and ‘Het’ denote homozygous and heterozygous respectively. check coverage in normal to exclude under-calling. could indicate LOH in tumor. doi:10.1371/journal.pone.0074380.t002 calling software is used in combination with a series of post-calling compute resources. To achieve this, qSNP is implemented in filtering steps to remove likely false positives. This practice means JAVA using the Picard library (version 1.62). qSNP is driven by a that the original p-values calculated by the mutation caller are single plain-text configuration file in the ‘‘Windows INI-file’’ style overridden by these further checks that ultimately decide whether and takes as its primary inputs, a pair of tumor and normal BAM or not a mutation is considered high confidence. For low purity files that have been duplicate-marked and coordinate-sorted. tumors, Poisson sampling of alleles can confound estimates of their qSNP implements a fast and flexible read-filtering system and if true frequencies, further compromising the calculation of accurate filters such as minimum mapping quality or alignment length are p-values or resulting in positions not exceeding a likelihood specified, qSNP will filter out failing reads prior to analysis. qSNP threshold. creates a pileup of bases in tumor and normal to look for evidence For these reasons we have not made an attempt to estimate a of a variant. qSNP has been specifically designed to make use of a p-value upfront but instead use flags to indicate that a putative compute cluster. It is thus multi-threaded, requiring 5 cores and somatic mutation call does not meet certain quality criteria or 20 GB of memory to run efficiently. evidence thresholds (Table 3). For example, putative somatic positions are checked for the presence of the variant in the Tuning using verification data matched normal BAM. If a position has evidence in the normal, To identify common error sources and to refine qSNP, the call is annotated as such. Somatic positions are further extensive verification of 3,253 putative somatic mutation calls checked for being a germline variant in another patient as this was performed across 65 tumors of 6 to 83% purity (mean 38% can indicate under-sampling of alleles in the matched normal. purity), including 60 tumors reported in Biankin et al. [14] For this check, we use an in-house database of germline variants (Table 4, Table S1). In total, 717 mutations were confirmed as and qSNP can be set up to output high quality germline calls to true somatic events, of which 704 had been classified as PASS this database with each iteration of qSNP. Positions that pass all by qSNP (Table 4). Miscalled somatic mutations were most checks are considered to be of highest confidence and we expect commonly associated with one of three features: position in these to be true somatic events. They are annotated as PASS in regions of sequence homology, support only by non-indepen- the qSNP output. Positions where the normal sample lacks dent reads or support by low evidence. By designing strategies adequate sequencing coverage are potentially false positive to eliminate false positives associated with these common error somatic calls and may return germline in verification. These are sources, we were able to maintain an accuracy of 57% at a annotated as COVN12 in qSNP output. All remaining somatic sensitivity of 98% across these tumors of mean purity of 38% mutations such as those where there is evidence of the variant (Table 4). This sensitivity is likely an overestimate of the true also in the normal sample or where only few mutant reads sensitivity as only known, verified mutations called by qSNP at support the variant call are considered lowest confidence and any evidence threshold were chosen for verification; it is are expected to include many false positives. These calls are possible that there were additional somatic events that were annotated as outlined in Table 3. never called. Nevertheless, our strategy is successful in retaining the vast majority of known true positive events Output mutation calls in.vcf and DCC formats (98%), while eliminating false positive calls associated with Output in Variant Call Format (VCF) [16] was required as common error sources. VCF is becoming the standard format for mutation reporting and annotation and allows integration with an ever-expanding set of Sequence homology regions VCF tools. To enable easy integration with the International Regions of sequence homology can cause problems in mapping Cancer Genome Consortium (ICGC) Data Coordination Centre and reads may be erroneously mapped to the wrong homologue. (DCC), output in DCC format was also implemented. This is not always apparent from the mapping quality values that can remain high especially if these values reflect pairing quality Fast, easy to run and operating-system independent values that consider the mapping qualities of both reads in a read Given the continuously increasing throughput of next-genera- pair. Nevertheless, these regions can often be identified on the tion sequencing platforms, qSNP needed to be efficient in its use of basis of having an excess of putative sequence variants. To PLOS ONE | www.plosone.org 3 November 2013 | Volume 8 | Issue 11 | e74380 Novel Somatic Mutation Calling Strategy Table 3. Post-processing checks performed by qSNP. Annotation Variant type Description PASS Somatic, Germline (Passed all post-processing checks) AND (min 5 mutant reads) AND (min 4 novel starts not considering read pair) COVN12 Somatic Less than 12 reads coverage in matched normal sample COVN8 Germline Less than 8 reads coverage in matched normal sample SAN3 Germline Less than 3 reads of same allele in normal COVT8 Germline Less than 8 reads coverage in tumor SAT3 Germline Less than 3 reads of same allele in tumor GERM Somatic Mutation is a germline variant in another patient MIN Somatic Mutation also found in pileup of normal BAM MIUN Somatic Mutation also found in pileup of unfiltered normal BAM NNS Somatic, Germline Less than 4 novel starts not considering read pair MR Somatic, Germline Less than 5 variant reads MER Somatic Mutation same as reference SBIAS Somatic Strand bias (Illumina only) doi:10.1371/journal.pone.0074380.t003 overcome this challenge, qSNP has a user-defined BAM filtering Non-independent reads option so that only high quality reads will trigger a mutation call. Picard MarkDuplicates (http://picard.sourceforge.net.) has For SOLiD v4 data, mapped with Bioscope 2.1 we find the become the standard tool for identifying PCR duplicates in following filters useful: next-generation sequencing data. Given that PCR is commonly used to amplify DNA for sequencing, likely PCR duplicates 1) min. 35 bp alignment length or (second of read pair and need to be identified so they don’t inflate allele counts during mapped as a proper pair); mutation calling. To identify duplicates Picard MarkDuplicates 2) min. SM.15 (single mapping quality); uses the start coordinates and orientations of both reads of a read pair. Within a set of duplicate read pairs, the read pair with 3) no more than 2 base-space mismatches to the reference; the highest base qualities is retained with the others marked as 4) not a PCR duplicate (Picard MarkDuplicates). PCR duplicates. Picard MarkDuplicates does not consider the For Illumina 100 bp paired-end data mapped with BWA, we sequence of the reads, only the alignment start coordinates and orientations. use the following filters: This strategy of marking PCR duplicates has one drawback. 1) min. SM.10 (single mapping quality); Read pairs where one read maps to a region of sequence 2) no more than 3 mismatches to the reference; homology sometimes fail to pass the Picard test for being PCR duplicates because these reads often map to different copies of 3) not a PCR duplicate (Picard MarkDuplicates). the region of sequence homology, thus disguising the fact that These read filters can be specified in the qSNP configuration file they are indeed all derived from the same PCR molecule. These reads can be easily identified upon visual inspection in using a domain-specific language (DSL). Once a somatic mutation call has been made, the unfiltered non-duplicate pileup from the Integrative Genomics Viewer (IGV) [17,18] on the basis of shared start coordinates of one read partner with different normal BAM is checked to see if there is any evidence of the variant. These steps help eliminate many of the false positives chromosome map positions of the other read in the pair associated with this common error source. (Figure 1). To overcome this challenge, all putative somatic Table 4. Details of verification using amplicon-based sequencing on the Ion Torrent. Verification across65primary pancreatic adenocarcinomas with mean tumor purity 38% (range 6 to 83%) Total verified somatic (TP) 717 qSNP pass calls Verified somatic 704 Verified germline 28 Verified wild type 506 Precision TP/(TP+FP) 57% Sensitivity TP/(TP+FN) 98% doi:10.1371/journal.pone.0074380.t004 PLOS ONE | www.plosone.org 4 November 2013 | Volume 8 | Issue 11 | e74380 Novel Somatic Mutation Calling Strategy Figure 1. Non-independent reads confounding mutation calls. Read pairs are colored by the chromosome map position of the second read in the pair. MarkDuplicates fails to correctly identify these non-independent read pairs as PCR duplicates due to the different map locations of the second read. doi:10.1371/journal.pone.0074380.g001 mutation calls are annotated in qSNP with the number of novel Benchmarking variant calling in a controlled mixture read starts not considering the read pair (NNS in the VCF output files). experiment Based on our extensive verification data, we find that a We previously modeled the performance of qSNP in a panel of minimum of 4 novel starts using this criterion is a useful lower mixtures where a pancreatic adenocarcinoma cell line and its limit for somatic mutation detection. matched normal were mixed at the following proportions: 0, 10, 20, 40, 60, 80 and 100% cell line DNA [14]. These mixtures were Low evidence calls sequenced to an average depth of approximately 656 using the Finally, mutation calls that are only supported by a few mutant SureSelect exon capture method and SOLiD v4 sequencing. Here, reads are also common false positives. However, as tumor purity we compare the decay in sensitivity across these mixtures using decreases, so does the expected mutant allele ratio, making it variant calls from qSNP and GATK (Table 5). All somatic qSNP difficult to distinguish true somatic events from sequencing calls made in the 100% cell line sample were selected for artifacts. We investigated a number of criteria to improve signal verification by amplicon-based sequencing on the Ion Torrent to noise for calls with low evidence. Strand bias proved not to be a PGM. The remaining mutation calls were assessed for evidence on useful discriminating feature for SOLiD v4 data as many true an alternate sequencing platform - HiSeq 2000 for calls made on somatic mutations were only supported by reads on one strand. the SOLiD v4 platform and vice versa. A position was considered Using results from amplicon-based verification, 363 FP were only verified if read depth was at least 206 and the mutation occurred on one strand, 171 FP were on both strands, 94 TP were only on at a frequency of at least 5% with a minimum of 3 variant reads on one strand, and 610 TP were on both strands. Filtering somatic the alternate sequencing platform. In all following comparisons, mutation calls by requiring the mutant allele being represented by GATK and Strelka were run in default mode with no changes to reads on both strands will thus severely impact sensitivity of default parameters. qSNP was run in standard mode, requiring a detection. Mutant allele ratio, i.e. proportion of mutant reads, also minimum of 3 mutant alleles of the same type to make a variant had poor discriminating power with many true positive calls call prior to applying standard read annotations and post-calling having very low mutant allele ratios: 112 FP had a mutant allele filters as described in the text. ratio ,10%, 422 FP had a mutant allele ratio .10%, 130 TP had As expected, as purity decreased, so did the sensitivity of a mutant allele ratio ,10% and 574 had a mutant allele ratio detecting true positive somatic mutations. In total, 84 mutations .10%. were verified as true somatic events. At 40% tumor purity, qSNP In contrast, there was a strong positive relationship between the successfully called 57 of 84 (68%) verified somatic mutations with likelihood of being a true somatic event and the number of reads with only 1 false positive call (Table 5). At tumor purities of 20% and novel starts not considering the read pair supporting the mutation. The 10% the sensitivity of detection dropped to 42% and 15%, higher the number of mutant reads supporting the call, the higher respectively. By increasing sequencing depth to .1506, the the accuracy. There is a trade-off between sensitivity and sensitivity of detection in the 20% and 10% samples was accuracy, however. At 5 mutant reads with a minimum of 4 improved, but not to a level comparable to that observed in the novel starts not considering read pair, we obtain an average higher mixtures (Table 5). In comparison, the GATK pipeline accuracy of 57% and sensitivity of 98% (Table 4), which we found called only 50 of 84 (60%) of verified somatic events in the 100% useful thresholds for mutation detection and follow-up verification sample and decayed more rapidly as tumor purity decreased with in primary pancreatic adenocarcinomas. By requiring a minimum no successful mutation calls in the 10% mixture (Table 5). of 10 mutant reads, our accuracy increases to 94%, but at the cost In addition, we re-sequenced 5 of these mixtures to an average of reduced sensitivity (53%). These criteria were determined using depth of 486 on HiSeq 2000 and called mutations using qSNP, exome samples that had been sequenced to a depth where 80% of GATK and Strelka (Table 6). qSNP detected a greater number of targeted bases had at least 206 coverage (average targeted base verified somatic events than GATK or Strelka in all mixtures coverage of approximately 656). (Table 6). Here, a total of 92 mutations were verified as true PLOS ONE | www.plosone.org 5 November 2013 | Volume 8 | Issue 11 | e74380 Novel Somatic Mutation Calling Strategy Table 5. Controlled mixture experiment to assess the effect of reducing tumor purity on somatic mutation detection using the SOLiD v4 platform. qSNP GATK* Mixture (%tumor) Cov. 80% Mean cov. ‘ ‘ ‘ ‘ ‘ ‘ VS FP U VS FP U 100 176 62.16 84 17 2 50 7 1 80 196 72.13 73 5 10 49 1 2 60 186 67.49 66 6 6 45 0 4 40 196 67.67 57 1 8 38 0 3 20 236 81.96 35 3 2 15 0 1 10 226 79.35 13 5 5 0 0 1 20 496 161.11 48 5 6 18 0 8 10 476 152.11 15 4 5 0 0 8 *raw.vcf files were passed through qSNP post-processing checks outlined in Table 3 to remove likely false positives such as positions with evidence in the matched normal. VS verified somatic; FP false positive; U untested. doi:10.1371/journal.pone.0074380.t005 somatic events. At 40% tumor purity, qSNP successfully called 60 COLO-829 whole-genome benchmarking study of 92 (65%) of verified somatic mutations, compared to GATK The melanoma cell line COLO-829 [4] has been used that called 55 (60%), and Strelka that called 56 (61%) verified previously for benchmarking new cancer analysis tools [8,9]. An somatic mutations (Table 6). There was substantial overlap in true aliquot of cell line and matched normal DNA were received and positive somatic calls between the three callers; 68 of a total of 90 whole-genome sequencing was performed on both SOLiD v4 (avg. (76%) verified somatic mutations were called by all three software coverage 326) and HiSeq 2000 (avg. coverage 756). The tools (Figure 2). These positions had an average of 32 mutant reads performance of qSNP was benchmarked against calls previously with an average mutant allele fraction of 0.52 (range 0.12 to 0.93). published by the Wellcome Trust Sanger Institute (WTSI) that As tumor purity decreased, so did the number of mutations called included 454 verified somatic mutations, 43 mutations previously by all three software tools (Figure 2). There were no mutations reported in COSMIC and 32,842 untested calls [4]. On the unique to GATK and Strelka that were not also called by qSNP SOLiD v4 platform qSNP called 85% of the 454 previously and for all mixtures qSNP missed the fewest number of true verified somatic mutations as well as 25 novel mutations that were somatic events compared to the other two callers. qSNP and verified using amplicon-based sequencing on the Ion Torrent GATK further called 1 private somatic mutation each that was not platform (Table 7, Table S2). For untested calls there was detected by the other callers, while Strelka called 7 private somatic considerable overlap between those reported by Pleasance et al. [4] mutations undetected by the other callers (Figure 2). and this study. For all variants called and verified by WTSI but not called by qSNP, a detailed breakdown is provided showing why the call was not made. For example, positions with insufficient coverage in the matched normal and which thus did not pass the qSNP PASS criterion are tabulated as well as positions where we observed evidence in the matched normal sample. The majority of positions where qSNP failed to make a call (5,735 or 62% of positions only called by WTSI) had less than 3 reads evidence in our SOLiD v4 sequence data. Using the HiSeq 2000 sequence data, qSNP called 85% of 454 previously reported verified somatic mutations and 26 novel mutations that were verified by Ion Torrent amplicon sequencing (Table 7). Of the positions initially reported by Pleasance et al. [4], the two re-sequencing efforts on SOLiD v4 and HiSeq 2000 identified 3531 positions that had less than 3 reads evidence for a mutant allele on both platforms (Table 7). On both platforms qSNP called a significant number of private mutations, 6486 on SOLiD v4 and 13098 on HiSeq 2000 of which 2674 were called on both platforms. Germline variants While qSNP was designed primarily to identify somatic mutations, a comparison of resulting germline calls was made Figure 2. Overlap in somatic mutation calls. Verified somatic using the COLO-829 sample and calls made by the Illumina mutation calls were compared across three callers in 5 different tumor Human 1M OmniQuad arrays, selecting all positions from the purity mixtures. Values are number of calls in 100%, 80%, 60%, 40% and arrays that showed evidence of a non-reference allele and had a 20% tumor content mixture, from top to bottom. doi:10.1371/journal.pone.0074380.g002 GenCall (GC) score of .0.7. The average genotype concordance PLOS ONE | www.plosone.org 6 November 2013 | Volume 8 | Issue 11 | e74380 Novel Somatic Mutation Calling Strategy Table 6. Controlled mixture experiment to assess the effect of reducing tumor purity on somatic mutation detection using the HiSeq2000 platform. qSNP GATK* Strelka** Mixture (%tumor) Cov. 80% Mean cov. ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ VS FP U VS FP U VS FP U 100 266 61.43 82 1 72 80 1 72 77 1 66 80 196 43.05 77 0 60 76 0 57 75 2 57 60 176 40.57 65 1 45 62 1 39 60 2 44 40 186 43.36 60 0 45 55 0 30 56 1 45 20 226 51.83 47 0 22 37 0 14 48 1 26 *.vcf files were passed through qSNP post-processing checks outlined in Table 3 to remove likely false positives such as positions with evidence in the matched normal. **calls from ‘pass’ category. VS verified somatic; FP false positive; U untested. doi:10.1371/journal.pone.0074380.t006 at positions with at least 8 reads coverage was 95% (same genotype substantial overlap with previously published calls and calls made call), while the variant call concordance was 99% (Table S3). As by either the SOLiD v4 or HiSeq2000 platforms as well as a small number of previously undetected protein-coding somatic muta- sequence depth increased so did accuracy in making the correct genotype call. Positions with .706 sequence coverage had a tions. genotype concordance of 99% and a variant call concordance of In the controlled mixture experiment the single sample approach used by GATK had reduced overall sensitivity and a 100% (Table S3). The array data has been submitted to Gene Expression Omnibus (GEO), accession number GSE47904. faster decay curve across samples of decreasing tumor purity than the joint sample callers, qSNP and Strelka, consistent with previous reports that joint sample analyses perform better for Discussion cancer analysis [9]. Using SOLiD v4 sequence data, qSNP and The development of cancer genome analysis tools and somatic GATK both achieved a low false positive rate, although GATK mutation calling software is an active area of research, but the called only 60% of known true positives in the 100% purity effects of reduced tumor purity on somatic mutation calling still mixture. Using the HiSeq 2000 platform, the sensitivity of GATK remain largely unexplored. Here, we present a strategy for somatic was improved, but at the cost of a high total number of calls likely point mutation calling in low purity tumors. We have used due to a high false positive rate that was only improved by extensive verification in primary pancreatic adenocarcinoma applying the same post-processing checks as in the qSNP pipeline, samples to determine a variant calling strategy that controls the such as excluding positions that had evidence of the mutation in false positive rate while maximizing sensitivity. When directly the matched normal sample (Table 3). assessing the accuracy and sensitivity of our approach in a The controlled mixture experiment further compared our controlled mixture experiment where samples of varying purity heuristic caller to a Bayesian approach (Strelka), demonstrating a were generated and sequenced, we demonstrate superior perfor- marginal advantage in sensitivity and false positive rate for qSNP. mance compared to other commonly used somatic mutation We believe that the success of our heuristic caller is due to its callers, for both SOLiD v4 and HiSeq 2000 data. Finally, we have ability to use minimum evidence to trigger a somatic mutation call benchmarked our caller against the COLO-829 sample and show and the use of powerful post-processing checks that control the Table 7. Benchmarking qSNP on sequencing data from the SOLiD v4 and HiSeq 2000 platforms using COLO-829 variants verified by either WTSI (WTSI only, qSNP+WTSI) or QCMG (qSNP only). SOLiD v4 HiSeq 2000 SOLiD v4 and HiSeq 2000 Caller Details ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ VS C U VS C U VS C U qSNP+WTSI 381 33 23,544 385 39 23,660 333 30 19,276 WTSI only ,126 coverage in normal 18 5 1,329 0 0 104 0 0 26 mutation also in normal 8 0 455 19 2 1,105 0 0 19 germline in another patient 0 0 7 1 0 6 0 0 5 did not pass post-filters 16 1 1,548 24 0 1,623 1 0 86 qSNP germline call 0 0 24 0 0 63 0 0 10 no call - ,3 reads evidence 0 0 5,735 22 2 5,945 0 0 3,531 no call - other 31 4 200 3 0 336 2 0 0 qSNP only* 25 0 6,486 26 0 13,098 22 0 2,674 *min 5 mutant reads and 4 novel starts not considering pair. VS verified somatic; C cosmic; U untested. doi:10.1371/journal.pone.0074380.t007 PLOS ONE | www.plosone.org 7 November 2013 | Volume 8 | Issue 11 | e74380 Novel Somatic Mutation Calling Strategy false positive rate. Machine learning approaches such as the datasets that are adjusted for coverage and exclude common error classifier of Ding et al. [6] can be a powerful strategy for identifying sources, such as calls made in repeat regions, low complexity features discriminating true positive from false positive mutation sequence or near indels. These post-filters are becoming increas- calls, provided availability of orthogonal verification data for ingly important as analyses are moving from exon-capture to training of the classifier. Discriminant features can then be whole-genome sequencing datasets. Nevertheless, the large overlap incorporated in the set of heuristics for informing mutation calls. in calls between the original and the two re-sequencing datasets In addition, automated pipelines for amplicon-based verification suggests that the overall sensitivity of detection of qSNP was good, can be set up using smaller scale sequencers such as the Ion and that the remaining challenge lies in controlling platform- and Torrent or MiSeq platform. We have found this a successful software-specific error sources. strategy in pancreatic adenocarcinomas that vary widely in tumor purity. On the other hand, Bayesian approaches may be more Conclusions readily transferrable across datasets and provide some form of Accurate and sensitive somatic mutation in low purity tumors quantitative measure of the confidence for a given mutation call, remains a formidable challenge, but one of great interest to the although as discussed above these will be most useful for high study of many solid tumors. Here, we have discussed some of the coverage regions and tumors of high purity where allele key challenges in this field and strategies we have devised to handle distributions can be accurately estimated and are not confounded these. Continuous refinement of existing strategies be they by Poisson sampling effects. heuristics or Bayesian, as well as comparative analyses and Finally, the controlled mixture experiment demonstrated that benchmarking on a defined set of samples will be critical to further no single variant calling strategy is optimal in all aspects. While improve performance of current somatic mutation callers. there was good overlap between callers and the majority of calls were made by at least 2 callers, each caller also identified private Materials and Methods mutations not called by the others and which were verified as somatic. Different callers thus have unique benefits, although Samples qSNP missed the fewest number of true somatic events. These Primary pancreatic adenocarcinoma samples discussed in this comparisons show that there is further scope for refinement of study were accrued as part of the Australian Pancreatic Genome either mutation calling strategy to improve accuracy and Initiative (APGI) (http://www/pancreaticcancer.net.au) using an sensitivity. Where high-density SNP array data are available, we institutional approved process for consent. COLO-829 sample recommend use of a genomic tool for estimating tumor purity aliquots for the melanoma cell line and matched normal were prior to variant calling, such as the qPure software [19]. obtained from WTSI. Sample extraction and processing followed Determining the purity of a tumor will help identify the most those outlined in Biankin et al. [14]. useful thresholds for variant calling. For example, samples of high purity are expected to have a lower false negative rate and thus the Verification of somatic mutations stringency of variant calling may be increased to lower the false Verification of somatic mutation calls was performed by positive rate. Given that the qSNP analysis of a whole-exome targeted Ion Torrent sequencing using PCR primers to amplify dataset of tumor/matched normal takes only 30 minutes, multiple 70–150 bp amplicons overlapping the somatic mutation. Tumor different parameters can be easily trialled to assess their effect on and normal DNA was whole-genome amplified prior to PCR the total number of calls. using the Illustra GenomiPhi V2 DNA Amplification Kit (GE; 25- We used the COLO-829 sample for benchmarking both 6600-30). PCR reactions and sequencing was performed as germline and somatic mutation calls. Germline calls from qSNP outlined in Biankin et al. (2012). Briefly, PCR reactions were set were compared to those made on the Illumina 1M OmniQuad up using 10 ng of amplified gDNA and 5 uM of primers mix. Ion chip, showing that the variant call concordance was over 99% Spheres were generated using the Ion Xpress Template Kit (Life even for positions with only 8 reads coverage. As sequencing depth Technologies; 4469001) with approximately 260 million amplicon increased, so did our accuracy to make the correct genotype call. molecules per emulsion PCR, effectively yielding an emulsion Detailed comparisons of the qSNP somatic mutation calls against containing 1 amplicon molecule per Ion Sphere. Samples were the original GAIIx calls of Pleasance et al. [4] showed considerable sequenced using the Ion Sequencing Kit (Life technologies; overlap for re-sequencing data from both the SOLiD v4 and 4468997) and the Ion Chip 316 Kit (Life Technologies; 4469496). HiSeq 2000 platforms, although there were also some important Verification of somatic mutations was performed by sequence differences. For example, our re-sequencing efforts identified 3,531 pileup at each mutant position and a position was considered positions that had less than 3 reads evidence for a mutation in both verified if it has a minimum depth of 100 reads coverage in the the SOLiD v4 and HiSeq 2000 data, suggesting that these original tumor and normal, a mutant allele frequency of at least 10% in calls are false positives and may reflect differences in read tumor and less than 0.5% in normal. sampling, mapping or bias of the original sequencing platform. Similarly, calls private to the qSNP pipeline on either the SOLiD Controlled mixture experiment v4 or HiSeq 2000 platform likely included a large number of false positive calls as evidenced by the fact that only 2,674of these SOLiD exon capture data for the mixture experiment was taken from Biankin et al. [14]. positions unique to our datasets were called on both sequencing platforms. Our calls on the HiSeq 2000 platform appear noisier Illumina exon capture was performed using the TargetSeq judging by the total number of private calls on this platform Exome Enrichment System (Life Technologies; A14060 and (13,098) compared to calls on the SOLiD v4 platform (6,486). This A138230) according to the manufacturer’s instructions, however is likely due to the increased coverage in the HiSeq runs (756 some modifications were made to the protocol to make the kit average base coverage compared to 326 in the SOLiD v4 data), compatible with Illumina libraries. SOLiD blocking and PCR which is expected to result in more variant calls when using the oligos were replaced with Illumina TruSeq blocking and PCR same evidence thresholds. We are currently implementing and oligos derived from the NimbleGen SeqCap EZ Library SR User’s refining post-processing checks for use with HiSeq whole-genome Guide v3.0 (Roche; 06588786001). The captured libraries were PLOS ONE | www.plosone.org 8 November 2013 | Volume 8 | Issue 11 | e74380 Novel Somatic Mutation Calling Strategy washed on the Life Techologies Library Builder using an BioAnalyser 2100 using the DNA High Sensitivity Kit (Agilent; unreleased protocol (Life Technologies), and the final post- 5067-4626) to calculate the molarity and assess the size capture PCR used the protocol in the NimbleGen SeqCap EZ distribution. Libraries were then prepared for Illumina cluster Library SR User’s Guide v3.0 (Roche; 06588786001). The final generation and sequencing. captured libraries were run on the Agilent BioAnalyser 2100 Of the qSNP unique calls, 61 protein-coding positions were using the DNA High Sensitivity Kit (Agilent; 5067-4626) to selected for verification on the Ion Torrent platform using the calculate the molarity and assess the size distribution. Cluster same verification criteria as outlined above; 30 were confirmed as generation of the libraries was performed using the TruSeq PE true somatic events and 31 as false positives (Table S2). In Cluster Kit v3-cBot-HS (Illumina; PE-401-3001), and sequenc- addition, 3 somatic mutations originally identified by WTSI could ing carried out. The SOLiD and HiSeq.BAM files have been not be confirmed as somatic events in our verification efforts submitted to the European Genome Archive, as part of project (Table S2). EGAS00000000078. Supporting Information COLO-829 whole-genome benchmarking study Table S1 Verification of 3253 putative somatic mutation Whole-genome sequencing of the COLO-829 tumor and calls across 65 tumors. matched normal sample were performed using the SOLiD v4 (XLSX) and Illumina HiSeq 2000 sequencing platforms. For preparation of SOLiD v4 long mate-pair libraries, 13 mgofgDNAwas Table S2 A mutation file containing somatic mutations sheared to a mean size of 2.5 kb using the Covaris S2 system. for the COLO-829 data set. Shearing was completed using the Blue miniTUBEs (Covaris p/ (XLSX) n: 520065) using the standard settings for 3 kb as described in Table S3 Comparison of qSNP germline variant calls to Covaris protocol 400069 (http://http//covarisinc.com/wp- calls from SNP array analysis. content/uploads/pn_400069.pdf). Following shearing, 1 uL of (DOCX) sheared sample was run on the Agilent BioAnalyser2100 using the DNA High Sensitivity Kit (Agilent p/n: 5067-4626) to assess the shearing size and distribution. The entire sheared DNA Acknowledgments sample was then converted into a SOLiDH compatible Long We thank the Australian Pancreatic Genome Initiative and associated Mate Pair (LMP) library using Life Technologies 5500SOLiDH clinical collaborators (APGI, www.pancreaticcancer.net.au/collaborators) Mate-Paired Library Kit (Invitrogen p/n: 4464418) following for the sample used in the mixture experiment. We thank the Cancer the standard protocol (http://tools.invitrogen.com/content/sfs/ Genome Project at the Wellcome Trust Sanger Institute for providing the manuals/cms_093442.pdf) with 10 minutes nick translation and COLO-829 DNA samples for sequencing. We would like to thank Deborah Gwynne, Cathy Axford, Mary-Anne Brancato, Sarah Rowe, a total of 12 cycles of amplification for the final library. After Michelle Thomas, Skye Simpson, Marc Jones and Gerard Hammond for PCR amplification the libraries were assessed for molarity and central co-ordination of the Australian Pancreatic Cancer Genome size distribution using the Agilent BioAnalyser 2100 using the Initiative, data management and quality control, and Mona Martyn- DNA High Sensitivity Kit. Libraries that passed this QC were Smith, Lisa Braatvedt, Henry Tang, Virginia Papangelis and Maria Beilin prepared for SOLiDH sequencing. for biospecimen acquisition. We also thank John Shepperd, Emma For the preparation of Illumina DNA libraries, 1 mgof Campbell and Evgeny Glasov for their efforts at the Queensland Centre gDNA was sheared to a mean size of 300 bp in a 130 mL for Medical Genomics. volume using a Covaris microTUBE and the Covaris S2 system according to the standard protocol (Covaris; 010158 Rev C). Author Contributions The sheared sample was prepared into a library using the Conceived and designed the experiments: KSK OH NW SMG JVP. NEBNext DNA Library Prep Master Mix Set for Illumina Performed the experiments: OH QX DKM IH ANC AS SM SI EN CN (NEB; E6040S) according to the manufacturer’s instructions TB S. Wood KN AMP NW LF MA CL S. Wani FN SS DT. Analyzed the with modifications. Size selection was done using an agarose data: KSK KN AMP NW LF MA CL S. Wani FN SS DT. Contributed gel (3% agarose) instead of the AMPure XP Beads size reagents/materials/analysis tools: ALJ AVB M. Pajic M. Pinese MJC JW selection. The final libraries were run on the Agilent AJG DKC PJW. Wrote the paper: KSK OH SMG JVP. References 1. Hudson TJ, Anderson W, Artez A, Barker AD, Bell C, et al. (2010) International 8. Saunders CT, Wong WSW, Swamy S, Becq J, Murray LJ, et al. (2012) Strelka: network of cancer genome projects. Nature 464: 993–998. accurate somatic small-variant calling from sequenced tumor-normal sample 2. TCGA (2011) Integrated genomic analyses of ovarian carcinoma. Nature 474: pairs. Bioinformatics 28: 1811–1817. 609–615. 9. Larson DE, Harris CC, Chen K, Koboldt DC, Abbott TE, et al. (2012) 3. TCGA (2012) Comprehensive molecular characterization of human colon and SomaticSniper: identification of somatic point mutations in whole genome rectal cancer. Nature 487: 330–337. sequencing data. Bioinformatics 28: 311–317. 4. Pleasance ED, Cheetham RK, Stephens PJ, McBride DJ, Humphray SJ, et al. 10. Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, et al. (2010) A comprehensive catalogue of somatic mutations from a human cancer (2012) VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Research 22: 568– genome. Nature 463: 191–U173. 5. Pleasance ED, Stephens PJ, O’Meara S, McBride DJ, Meynert A, et al. (2010) A 576. small-cell lung cancer genome with complex signatures of tobacco exposure. 11. Goya R, Sun MGF, Morin RD, Leung G, Ha G, et al. (2010) SNVMix: Nature 463: 184–U166. predicting single nucleotide variants from next-generation sequencing of tumors. Bioinformatics 26: 730–736. 6. Ding J, Bashashati A, Roth A, Oloumi A, Tse K, et al. (2012) Feature-based classifiers for somatic mutation detection in tumour-normal paired sequencing 12. Altshuler DL, Durbin RM, Abecasis GR, Bentley DR, Chakravarti A, et al. data. Bioinformatics 28: 167–175. (2010) A map of human genome variation from population-scale sequencing. 7. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, et al. Nature 467: 1061–1073. 13. Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, et al. (2013) (2010) The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research 20: Sensitive detection of somatic point mutations in impure and heterogeneous 1297–1303. cancer samples. Nature Biotechnology 31: 213–219. PLOS ONE | www.plosone.org 9 November 2013 | Volume 8 | Issue 11 | e74380 Novel Somatic Mutation Calling Strategy 14. Biankin AV, Waddell N, Kassahn KS, Gingras M-C, Muthuswamy LB, et al. 17. Robinson JT, Thorvaldsdottir H, Winckler W, Guttman M, Lander ES, et al. (2012) Pancreatic cancer genomes reveal aberrations in axon guidance pathway (2011) Integrative genomics viewer. Nature Biotechnology 29: 24–26. genes. Nature advance online publication. 18. Thorvaldsdo´ttir H, Robinson JT, Mesirov JP (2012) Integrative Genomics 15. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, et al. (2009) The Viewer (IGV): high-performance genomics data visualization and exploration. Sequence Alignment/Map format and SAMtools. Bioinformatics 25: 2078– Briefings in Bioinformatics. 2079. 19. Song S, Nones K, Miller D, Harliwong I, Kassahn KS, et al. (2012) qpure: A 16. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, et al. (2011) The variant tool to estimate tumor cellularity from genome-wide single-nucleotide polymor- call format and VCFtools. Bioinformatics 27: 2156–2158. phism profiles. PLoS One 7. PLOS ONE | www.plosone.org 10 November 2013 | Volume 8 | Issue 11 | e74380 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png PLoS ONE Pubmed Central http://www.deepdyve.com/lp/pubmed-central/somatic-point-mutation-calling-in-low-cellularity-tumors-ckUMppjWEE

Loading next page...

References (22)

Novel Somatic Mutation Calling Strategy PLOS ONE | www.plosone
P Danecek (2011)
The variant call format and VCFtools
Bioinformatics, 27
James Robinson, H. Thorvaldsdóttir, W. Winckler, M. Guttman, E. Lander, G. Getz, J. Mesirov (2011)
Integrative Genomics Viewer
Nature biotechnology, 29
A. Biankin, N. Waddell, K. Kassahn, M. Gingras, L. Muthuswamy, A. Johns, David Miller, P. Wilson, A. Patch, Jianmin Wu, D. Chang, M. Cowley, B. Gardiner, Sarah Song, Ivon Harliwong, S. Idrisoglu, C. Nourse, Ehsan Nourbakhsh, Suzanne Manning, Shivangi Wani, M. Gongora, M. Pajic, C. Scarlett, A. Gill, Andreia Pinho, I. Rooman, M. Anderson, O. Holmes, C. Leonard, Darrin Taylor, S. Wood, Qinying Xu, K. Nones, J. Fink, Angelika Christ, T. Bruxner, N. Cloonan, G. Kolle, F. Newell, M. Pinese, R. Mead, J. Humphris, W. Kaplan, Marc Jones, E. Colvin, A. Nagrial, Emily Humphrey, A. Chou, V. Chin, L. Chantrill, A. Mawson, J. Samra, J. Kench, Jessica Lovell, R. Daly, N. Merrett, C. Toon, K. Epari, N. Nguyen, A. Barbour, N. Zeps, N. Kakkar, Fengmei Zhao, Y. Wu, Min Wang, D. Muzny, W. Fisher, F. Brunicardi, S. Hodges, J. Reid, J. Drummond, K. Chang, Yi Han, L. Lewis, H. Dinh, C. Buhay, T. Beck, Lee Timms, M. Sam, K. Begley, Andrew Brown, D. Pai, A. Panchal, Nicholas Buchner, R. Borja, R. Denroche, C. Yung, S. Serra, N. Onetto, D. Mukhopadhyay, M. Tsao, P. Shaw, G. Petersen, S. Gallinger, R. Hruban, A. Maitra, C. Iacobuzio-Donahue, R. Schulick, elliot fishman, R. Morgan, R. Lawlor, P. Capelli, V. Corbo, M. Scardoni, G. Tortora, M. Tempero, K. Mann, N. Jenkins, P. Pérez-Mancera, D. Adams, D. Largaespada, L. Wessels, A. Rust, Lincoln Stein, D. Tuveson, N. Copeland, E. Musgrove, A. Scarpa, J. Eshleman, T. Hudson, R. Sutherland, D. Wheeler, J. Pearson, J. McPherson, R. Gibbs, S. Grimmond (2012)
Pancreatic cancer genomes reveal aberrations in axon guidance pathway genes
Nature, 491
E. Pleasance, R. Cheetham, P. Stephens, D. Mcbride, S. Humphray, C. Greenman, I. Varela, Meng‐Lay Lin, G. Ordóñez, G. Bignell, K. Ye, J. Alipaz, Markus Bauer, D. Beare, A. Butler, Richard Carter, Lina Chen, A. Cox, S. Edkins, P. Kokko-Gonzales, N. Gormley, R. Grocock, C. Haudenschild, Matthew Hims, Terena James, Mingming Jia, Z. Kingsbury, Catherine Leroy, J. Marshall, A. Menzies, L. Mudie, Z. Ning, Tom Royce, Ole Schulz-Trieglaff, Anastassia Spiridou, L. Stebbings, L. Szajkowski, J. Teague, David Williamson, L. Chin, M. Ross, Peter Campbell, D. Bentley, P. Futreal, Michael Stratton (2010)
A comprehensive catalogue of somatic mutations from a human cancer genome
Nature, 463
E. Pleasance, P. Stephens, S. O'meara, S. O'meara, D. Mcbride, A. Meynert, David Jones, Meng‐Lay Lin, D. Beare, K. Lau, C. Greenman, I. Varela, S. Nik-Zainal, H. Davies, G. Ordóñez, L. Mudie, Calli Latimer, S. Edkins, L. Stebbings, Lina Chen, Mingming Jia, Catherine Leroy, J. Marshall, A. Menzies, A. Butler, J. Teague, J. Mangion, Yongming Sun, Stephen McLaughlin, H. Peckham, Eric Tsung, G. Costa, Clarence Lee, J. Minna, A. Gazdar, E. Birney, Michael Rhodes, K. McKernan, M. Stratton, P. Futreal, P. Campbell, P. Campbell (2009)
A small cell lung cancer genome reports complex tobacco exposure signatures
Nature, 463
T. Hudson, W. Anderson, Axel Artez, A. Barker, C. Bell, R. Bernabé, M. Bhan, F. Calvo, I. Eerola, D. Gerhard, A. Guttmacher, M. Guyer, F. Hemsley, Jennifer Jennings, D. Kerr, P. Klatt, Patrik Kolar, Jun Kusada, D. Lane, F. Laplace, Lu Youyong, G. Nettekoven, B. Ozenberger, Jane Peterson, T. Rao, J. Remacle, A. Schafer, T. Shibata, M. Stratton, J. Vockley, Koichi Watanabe, Huanming Yang, M. Yuen, B. Knoppers, M. Bobrow, A. Cambon-Thomsen, L. Dressler, S. Dyke, Y. Joly, Kazuto Kato, Karen Kennedy, Pilar Nicolàs, M. Parker, E. Rial‐Sebbag, C. Romeo-Casabona, K. Shaw, S. Wallace, G. Wiesner, N. Zeps, P. Lichter, A. Biankin, C. Chabannon, L. Chin, B. Clement, E. Álava, F. Degos, M. Ferguson, Peter Geary, D. Hayes, A. Johns, A. Kasprzyk, H. Nakagawa, R. Penny, M. Piris, R. Sarin, A. Scarpa, M. Vijver, P. Futreal, H. Aburatani, M. Bayés, David Botwell, P. Campbell, X. Estivill, S. Grimmond, I. Gut, M. Hirst, C. López-Otín, P. Majumder, M. Marra, J. McPherson, Z. Ning, X. Puente, Y. Ruan, H. Stunnenberg, H. Swerdlow, V. Velculescu, R. Wilson, H. Xue, Liu Yang, P. Spellman, Gary Bader, P. Boutros, Paul Flicek, G. Getz, R. Guigó, Guangwu Guo, D. Haussler, S. Heath, T. Hubbard, T. Jiang, Steven Jones, Qibin Li, N. López-Bigas, Ruibang Luo, L. Muthuswamy, B. Ouellette, J. Pearson, V. Quesada, Benjamin Raphael, C. Sander, T. Speed, Lincoln Stein, Joshua Stuart, J. Teague, Y. Totoki, T. Tsunoda, A. Valencia, D. Wheeler, Honglong Wu, Shancen Zhao, Guangyu Zhou, M. Lathrop, G. Thomas, Teruhiko Yoshida, M. Axton, C. Gunter, L. Miller, Junjun Zhang, Syed Haider, Jianxin Wang, C. Yung, A. Cros, Yong Liang, S. Gnaneshan, J. Guberman, J. Hsu, D. Chalmers, K. Hasel, T. Kaan, W. Lowrance, T. Masui, L. Rodriguez, C. Vergely, D. Bowtell, N. Cloonan, A. deFazio, J. Eshleman, D. Etemadmoghadam, B. Gardiner, J. Kench, R. Sutherland, M. Tempero, N. Waddell, P. Wilson, S. Gallinger, M. Tsao, P. Shaw, G. Petersen, D. Mukhopadhyay, R. DePinho, S. Thayer, K. Shazand, Timothy Beck, M. Sam, Lee Timms, Vanessa Ballin, Youyong Lu, J. Ji, Xiuqing Zhang, Feng Chen, Xueda Hu, Qi Yang, G. Tian, Lianhai Zhang, Xiaofang Xing, Xianghong Li, Zheng‐gang Zhu, Yingyan Yu, Jun Yu, J. Tost, P. Brennan, I. Holcatova, D. Zaridze, A. Brazma, L. Egevard, E. Prokhortchouk, R. Banks, M. Uhlén, Juris Viksna, F. Pontén, K. Skryabin, E. Birney, Å. Borg, A. Børresen-Dale, C. Caldas, J. Foekens, Sancha Martin, J. Reis-Filho, A. Richardson, C. Sotiriou, G. Thoms, L. Veer, D. Birnbaum, H. Blanché, Pascal Boucher, S. Boyault, Jocelyne Masson-Jacquemier, I. Pauporté, X. Pivot, A. Vincent-Salomon, E. Tabone, C. Theillet, I. Treilleux, P. Bioulac-Sage, T. Decaens, D. Franco, M. Gut, Didier Samuel, J. Zucman‐Rossi, R. Eils, B. Brors, J. Korbel, A. Korshunov, P. Landgraf, H. Lehrach, S. Pfister, B. Radlwimmer, G. Reifenberger, Michael Taylor, C. Kalle, P. Majumder, P. Pederzoli, R. Lawlor, M. Delledonne, A. Bardelli, T. Gress, D. Klimstra, G. Zamboni, Y. Nakamura, S. Miyano, Akihiro Fujimoto, E. Campo, S. Sanjosé, E. Montserrat, M. González-Díaz, P. Jares, H. Himmelbauer, S. Beà, S. Aparicio, D. Easton, F. Collins, C. Compton, E. Lander, W. Burke, A. Green, S. Hamilton, O. Kallioniemi, T. Ley, E. Liu, B. Wainwright (2010)
International network of cancer genome projects
Nature, 464
(2013)
Somatic Point Mutation Calling in Low Cellularity Tumors
D. Bell, A. Berchuck, M. Birrer, J. Chien, D. Cramer, F. Dao, R. Dhir, P. Disaia, H. Gabra, Pat Glenn, A. Godwin, J. Gross, L. Hartmann, M. Huang, D. Huntsman, M. Iacocca, M. Imieliński, S. Kalloger, B. Karlan, D. Levine, G. Mills, C. Morrison, D. Mutch, Narciso Olvera, S. Orsulic, K. Park, N. Petrelli, B. Rabeno, J. Rader, B. Sikic, K. Smith-McCune, A. Sood, D. Bowtell, R. Penny, J. Testa, K. Chang, H. Dinh, J. Drummond, G. Fowler, P. Gunaratne, A. Hawes, C. Kovar, L. Lewis, M. Morgan, I. Newsham, J. Santibanez, J. Reid, L. Treviño, Y. Wu, M. Wang, D. Muzny, D. Wheeler, R. Gibbs, G. Getz, M. Lawrence, K. Cibulskis, A. Sivachenko, C. Sougnez, Douglas Voet, Jane Wilkinson, Toby Bloom, K. Ardlie, T. Fennell, J. Baldwin, S. Gabriel, E. Lander, L. Ding, R. Fulton, D. Koboldt, M. McLellan, T. Wylie, Jason Walker, M. O'Laughlin, D. Dooling, L. Fulton, R. Abbott, N. Dees, Q. Zhang, C. Kandoth, M. Wendl, W. Schierding, D. Shen, C. Harris, H. Schmidt, Joelle Kalicki, K. Delehaunty, C. Fronick, Ryan Demeter, L. Cook, J. Wallis, L. Lin, V. Magrini, J. Hodges, James Eldred, S. Smith, C. Pohl, Fabio Vandin, Benjamin Raphael, G. Weinstock, E. Mardis, R. Wilson, M. Meyerson, W. Winckler, R. Verhaak, Suzie Carter, C. Mermel, G. Saksena, H. Nguyen, R. Onofrio, D. Hubbard, S. Gupta, A. Crenshaw, A. Ramos, L. Chin, A. Protopopov, Juinhua Zhang, T. Kim, I. Perna, Y. Xiao, Hailei Zhang, G. Ren, N. Sathiamoorthy, R. Park, E. Lee, P. Park, R. Kucherlapati, D. Absher, L. Waite, G. Sherlock, J. Brooks, Jun Li, Jin Xu, R. Myers, P. Laird, L. Cope, J. Herman, Hui Shen, D. Weisenberger, H. Noushmehr, F. Pan, T. Triche, B. Berman, D. Berg, J. Buckley, S. Baylin, P. Spellman, E. Purdom, P. Neuvial, H. Bengtsson, L. Jakkula, S. Durinck, J. Han, S. Dorton, H. Marr, Y. Choi, V. Wang, Ninghai Wang, J. Ngai, J. Conboy, B. Parvin, H. Feiler, T. Speed, J. Gray, N. Socci, Yanke Liang, B. Taylor, N. Schultz, L. Borsu, A. Lash, C. Brennan, A. Viale, C. Sander, M. Ladanyi, K. Hoadley, S. Meng, Y. Du, Yufeng Shi, Lulin Li, Y. Turman, D. Zang, E. Helms, S. Balu, X. Zhou, Jinhua Wu, M. Topal, D. Hayes, C. Perou, Jun Zhang, Chaowei Wu, S. Shukla, A. Sivachenko, R. Jing, Yueh-Feng Liu, M. Noble, H. Carter, D. Kim, R. Karchin, J. Korkola, Laura Heiser, R. Cho, Zhihao Hu, E. Cerami, A. Olshen, B. Reva, Yevgeniy Antipin, R. Shen, P. Mankoo, R. Sheridan, G. Ciriello, William Chang, J. Bernanke, D. Haussler, C. Benz, Joshua Stuart, S. Benz, J. Sanborn, Charles Vaske, Jiangyu Zhu, C. Szeto, G. Scott, C. Yau, M. Wilkerson, N. Zhang, R. Akbani, K. Baggerly, W. Yung, J. Weinstein, T. Shelton, D. Grimm, M. Hatfield, S. Morris, P. Yena, P. Rhodes, M. Sherman, J. Paulauskis, S. Millis, A. Kahn, J. Greene, R. Sfeir, M. Jensen, James Chen, J. Whitmore, S. Alonso, J. Jordan, A. Chu, Jinghui Zhang, A. Barker, C. Compton, G. Eley, M. Ferguson, P. Fielding, D. Gerhard, R. Myles, C. Schaefer, K. Shaw, J. Vaught, J. Vockley, P. Good, M. Guyer, B. Ozenberger, James Peterson, E. Thomson (2011)
Integrated Genomic Analyses of Ovarian Carcinoma
Nature, 474
D. Muzny, M. Bainbridge, K. Chang, H. Dinh, J. Drummond, G. Fowler, C. Kovar, L. Lewis, M. Morgan, I. Newsham, J. Reid, J. Santibanez, E. Shinbrot, L. Treviño, Yuan-qing Wu, Min Wang, P. Gunaratne, L. Donehower, C. Creighton, D. Wheeler, R. Gibbs, M. Lawrence, Douglas Voet, R. Jing, K. Cibulskis, A. Sivachenko, P. Stojanov, A. McKenna, E. Lander, S. Gabriel, L. Ding, R. Fulton, D. Koboldt, T. Wylie, Jason Walker, D. Dooling, L. Fulton, K. Delehaunty, C. Fronick, Ryan Demeter, E. Mardis, R. Wilson, Andy Chu, Hye-Jung Chun, A. Mungall, E. Pleasance, A. Robertson, D. Stoll, M. Balasundaram, I. Birol, Y. Butterfield, E. Chuah, R. Coope, Noreen Dhalla, R. Guin, Carrie Hirst, M. Hirst, R. Holt, Darlene Lee, H. Li, Michael Mayo, Richard Moore, J. Schein, Jared Slobodan, Angela Tam, N. Thiessen, R. Varhol, Thomas Zeng, Yongjun Zhao, Steven Jones, M. Marra, A. Bass, A. Ramos, G. Saksena, A. Cherniack, S. Schumacher, B. Tabak, S. Carter, Nam Pho, Huy Nguyen, R. Onofrio, A. Crenshaw, K. Ardlie, R. Beroukhim, W. Winckler, M. Meyerson, A. Protopopov, Angela Hadjipanayis, E. Lee, Ruibin Xi, Lixing Yang, X. Ren, N. Sathiamoorthy, Peng-Chieh Chen, Psalm Haseley, Yonghong Xiao, Semin Lee, J. Seidman, L. Chin, P. Park, R. Kucherlapati, J. Auman, K. Hoadley, Ying Du, M. Wilkerson, Yan Shi, C. Liquori, S. Meng, Ling Li, Yidi Turman, M. Topal, Donghui Tan, S. Waring, Elizabeth Buda, Jesse Walsh, Corbin Jones, P. Mieczkowski, Darshan Singh, Junyuan Wu, Anisha Gulabani, Peter Dolina, T. Bodenheimer, A. Hoyle, J. Simons, Matthew Soloway, Lisle Mose, S. Jefferys, S. Balu, Brian O’Connor, J. Prins, Derek Chiang, D. Hayes, C. Perou, T. Hinoue, D. Weisenberger, D. Maglinte, F. Pan, B. Berman, D. Berg, Hui Shen, T. Triche, S. Baylin, P. Laird, G. Getz, M. Noble, Doug Voat, N. Gehlenborg, D. Dicara, Juinhua Zhang, Hailei Zhang, Chang-Jiun Wu, Spring Liu, S. Shukla, Lihua Zhou, Pei Lin, R. Park, Marc-Danie Nazaire, James Robinson, H. Thorvaldsdóttir, J. Mesirov, V. Thorsson, Sheila Reynolds, Brady Bernard, R. Kreisberg, Jake Lin, L. Iype, Ryan Bressler, Timo Erkkilä, M. Gundapuneni, Yuexin Liu, Adam Norberg, Thomas Robinson, Da Yang, Wei Zhang, I. Shmulevich, J. Ronde, N. Schultz, E. Cerami, G. Ciriello, A. Goldberg, Benjamin Gross, A. Jacobsen, Jianjiong Gao, B. Kaczkowski, Rileen Sinha, B. Aksoy, Yevgeniy Antipin, B. Reva, R. Shen, B. Taylor, M. Ladanyi, C. Sander, R. Akbani, Nianxiang Zhang, B. Broom, Tod Casasent, A. Unruh, Chris Wakefield, S. Hamilton, R. Cason, K. Baggerly, J. Weinstein, D. Haussler, C. Benz, Joshua Stuart, S. Benz, J. Sanborn, Charles Vaske, Jingchun Zhu, C. Szeto, G. Scott, C. Yau, S. Ng, Theodore Goldstein, K. Ellrott, E. Collisson, A. Cozen, D. Zerbino, C. Wilks, Brian Craft, P. Spellman, R. Penny, T. Shelton, M. Hatfield, S. Morris, P. Yena, C. Shelton, M. Sherman, J. Paulauskis, J. Gastier-Foster, Jay Bowen, N. Ramirez, Aaron Black, R. Pyatt, L. Wise, P. White, M. Bertagnolli, Jennifer Brown, T. Chan, Gerald Chu, Christine Czerwinski, F. Denstman, R. Dhir, A. Dörner, C. Fuchs, J. Guillem, M. Iacocca, H. Juhl, Andrew Kaufman, B. Kohl, X. Le, M. Mariano, Elizabeth Medina, M. Meyers, G. Nash, P. Paty, N. Petrelli, B. Rabeno, W. Richards, D. Solit, P. Swanson, L. Temple, J. Tepper, Richard Thorp, E. Vakiani, M. Weiser, J. Willis, G. Witkin, Z. Zeng, M. Zinner, C. Zornig, M. Jensen, R. Sfeir, A. Kahn, A. Chu, Prachi Kothiyal, Zhining Wang, E. Snyder, J. Pontius, T. Pihl, Brenda Ayala, M. Backus, Jessica Walton, J. Whitmore, J. Baboud, D. Berton, M. Nicholls, Deepak Srinivasan, R. Raman, Stanley Girshik, Peter Kigonya, S. Alonso, Rashmi Sanbhadti, S. Barletta, J. Greene, D. Pot, K. Shaw, Laura Dillon, K. Buetow, Tanja Davidsen, John Demchok, G. Eley, M. Ferguson, P. Fielding, C. Schaefer, Margi Sheth, Liming Yang, M. Guyer, B. Ozenberger, Jacqueline Palchik, Jane Peterson, H. Sofia, E. Thomson (2012)
Comprehensive molecular characterization of human colon and rectal cancer
Nature, 487
Jiarui Ding, A. Bashashati, Andrew Roth, A. Oloumi, Kane Tse, Thomas Zeng, Gholamreza Haffari, M. Hirst, M. Marra, A. Condon, Samuel Aparicio, Sohrab Shah (2011)
Feature-based classifiers for somatic mutation detection in tumour–normal paired sequencing data
Bioinformatics, 28
K. Cibulskis, M. Lawrence, S. Carter, A. Sivachenko, D. Jaffe, C. Sougnez, S. Gabriel, M. Meyerson, E. Lander, G. Getz (2013)
Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples
Nature biotechnology, 31
C. Saunders, Wendy Wong, Sajani Swamy, J. Becq, L. Murray, R. Cheetham (2012)
Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs
Bioinformatics, 28 14
D. Koboldt, Qunyuan Zhang, D. Larson, D. Shen, M. McLellan, Ling Lin, Christopher Miller, E. Mardis, L. Ding, R. Wilson (2012)
VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing.
Genome research, 22 3
Heng Li, R. Handsaker, Alec Wysoker, T. Fennell, Jue Ruan, Nils Homer, Gabor Marth, G. Abecasis, R. Durbin (2009)
The Sequence Alignment/Map format and SAMtools
Bioinformatics, 25
(2010)
Sequence analysis Advance Access publication June 7, 2011 The variant call format and VCFtools
Sarah Song, K. Nones, David Miller, Ivon Harliwong, K. Kassahn, M. Pinese, M. Pajic, A. Gill, A. Johns, M. Anderson, O. Holmes, C. Leonard, Darrin Taylor, S. Wood, Qinying Xu, F. Newell, M. Cowley, Jianmin Wu, Peter Wilson, L. Fink, A. Biankin, N. Waddell, S. Grimmond, J. Pearson (2012)
qpure: A Tool to Estimate Tumor Cellularity from Genome-Wide Single-Nucleotide Polymorphism Profiles
PLoS ONE, 7
D. Larson, C. Harris, Ken Chen, D. Koboldt, Travis Abbott, D. Dooling, T. Ley, E. Mardis, R. Wilson, L. Ding (2012)
SomaticSniper: identification of somatic point mutations in whole genome sequencing data
Bioinformatics, 28 3
G. Abecasis, D. Altshuler, A. Auton, L. Brooks, R. Durbin, R. Gibbs, M. Hurles, G. McVean (2010)
A map of human genome variation from population-scale sequencing
Nature, 467
A. McKenna, M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, K. Garimella, D. Altshuler, S. Gabriel, M. Daly, M. DePristo (2010)
The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.
Genome research, 20 9
R. Goya, Mark Sun, Ryan Morin, G. Leung, G. Ha, Kim Wiegand, J. Senz, Anamaria Crisan, M. Marra, M. Hirst, D. Huntsman, Kevin Murphy, Samuel Aparicio, Sohrab Shah (2010)
SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors
Bioinformatics, 26
H. Thorvaldsdóttir, James Robinson, J. Mesirov (2012)
Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration
Briefings in Bioinformatics, 14

Publisher: Pubmed Central
Copyright: © 2013 Kassahn et al
ISSN: 1932-6203
eISSN: 1932-6203
DOI: 10.1371/journal.pone.0074380
Publisher site: See Article on Publisher Site

Abstract

Somatic mutation calling from next-generation sequencing data remains a challenge due to the difficulties of distinguishing true somatic events from artifacts arising from PCR, sequencing errors or mis-mapping. Tumor cellularity or purity, sub- clonality and copy number changes also confound the identification of true somatic events against a background of germline variants. We have developed a heuristic strategy and software (http://www.qcmg.org/bioinformatics/qsnp/) for somatic mutation calling in samples with low tumor content and we show the superior sensitivity and precision of our approach using a previously sequenced cell line, a series of tumor/normal admixtures, and 3,253 putative somatic SNVs verified on an orthogonal platform. Citation: Kassahn KS, Holmes O, Nones K, Patch A-M, Miller DK, et al. (2013) Somatic Point Mutation Calling in Low Cellularity Tumors. PLoS ONE 8(11): e74380. doi:10.1371/journal.pone.0074380 Editor: I. King Jordan, Georgia Institute of Technology, United States of America Received April 11, 2013; Accepted July 31, 2013; Published November 8, 2013 Copyright: 2013 Kassahn et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: No current external funding sources for this study. Competing Interests: The authors have declared that no competing interests exist. * E-mail: [email protected] (SMG); [email protected] (JVP) accurately calling mutations in these samples and the expected Introduction high false negative rate. To keep the sensitivity of the analysis at The declining cost of next-generation sequencing is enabling an desired levels, there is a risk of calling an increasing number of increasing number of tumor sequencing studies [1–3], providing false positives. new insights into the mutations driving tumorigenesis. These large- Several software programs have been developed for variant and scale efforts are redefining the role of known oncogenes and tumor somatic mutation calling, including GATK [7], Strelka [8], TM suppressor genes, identifying new candidate driver genes and diBayes (Applied Biosystems BioScope software), SomaticSni- providing insights into the mutational mechanisms at play in per [9], VarScan 2 [10] and SNVMix [11]. For cancer genome different tumor types [4,5]. Accurate somatic mutation calling is analysis and to identify somatic events, a tumor sample is paramount in these studies. compared to its matched normal sample. Current software tools Despite this growing demand for accurate somatic mutation differ in important ways by either performing single or joint calls in cancer studies, mutation calling from next-generation sample analysis of the tumor/matched normal sample pair, and by sequencing data remains challenging. Early cycle PCR-induced either using Bayesian or heuristic approaches (Table 1). GATK errors, polymerase slippage [6] and the mis-mapping of reads due was initially developed in the context of the 1000 Genomes Project to homology to multiple genomic regions are some of the most [12] to enable variant discovery and genotyping from next- common sources of false positive calls. Inadequate sequence depth generation sequencing data. GATK performs single sample in the matched normal sample can also result in germline variants analysis only. A tumor and matched normal sample pair are thus being incorrectly identified as somatic mutations (false positives). genotyped independently and somatic events are determined by Finally, tumor heterogeneity and purity further confound accurate subtracting calls in the normal from those in the tumor sample. In somatic mutation calling as increased tumor heterogeneity and contrast, Strelka, SomaticSniper and VarScan 2 perform joint decreased purity result in lower mutant allele ratios that can make sample analysis of a tumor/normal pair and either model tumor as it difficult to distinguish true mutations from background (false a mixture of normal sample with somatic variation (Strelka), negative error). In solid tumors, purity varies widely with some calculate joint diploid genotype likelihoods using the MAQ tumor samples having less than 10% tumor content. Many low genotype model (SomaticSniper) or compare read count distribu- purity tumor samples have been excluded from somatic mutation tions between the two samples using Fisher’s exact test (VarScan2). analysis to date due to the analytical challenges associated with Importantly, due to the different statistical models employed, PLOS ONE | www.plosone.org 1 November 2013 | Volume 8 | Issue 11 | e74380 Novel Somatic Mutation Calling Strategy Table 1. Variant calling software tools. Tumor normal Output germline Software joint analysis variants Indels Statistical method Reference qSNP X X empirically determined set of heuristics optimized present study for sensitivity in low purity tumors GATK n/a X Bayesian model for genotype likelihood, can take into [7] account multiple samples for calibration Strelka X X Bayesian model of tumor as a mixture of normal sample [8] with somatic variation SomaticSniper X X Bayesian comparison of genotype likelihoods based [9] on MAQ genotype model diBayes n/a Bayesian model for presence of non-reference allele Applied Biosystems TM (color-space data) BioScope VarScan2 X X X heuristics to determine genotypes and Fisher’s exact [10] test to examine read count differences, also outputs CNV regions for exome data SNVMix n/a probabilistic binomial mixture model accounting for [11] tumor ploidy and purity doi:10.1371/journal.pone.0074380.t001 current somatic mutation callers differ in the number of somatic of verification using benchtop amplicon-based sequencing were mutation calls and in their overlap. In addition, many somatic performed to develop and refine post-processing checks to control the false discovery rate. The following considerations informed the mutation callers use a series of post-call filtering steps that further design of our mutation calling strategy and its software affect the number and type of final mutation calls. Some of these implementation, qSNP. tools also allow analysis of small indels, germline variants and copy number variations (Table 1). There have not been, however, any detailed investigations of Joint analysis of the tumor and matched normal sample the effects of reduced tumor cellularity or purity on the accuracy qSNP considers sequence data in Binary Sequence Alignment/ and sensitivity of somatic mutation calling, although a recent Map (BAM) format [15] from both tumor and matched normal software, MuTect, has been designed especially with subclonal samples jointly. Classification into germline and somatic calls follows a number of simple rules that were designed to mutations in mind [13]. A number of factors compromise somatic accommodate for the expected low mutant allele ratio in low mutation calling in low purity tumors. As sequence coverage and purity tumors (Table 2). tumor purity decrease, the effects of allele sampling confound the accurate assessment of allele distributions and thus compromise statistical models for determining potential variant sites of interest. Maximize sensitivity of mutation calling Secondly, depending on the statistic models used, low frequency qSNP currently triggers a variant call if a minimum of 3 reads of mutations may or may not trigger a variant call resulting in the same, non-reference allele are found. We found that this minimum evidence requirement ensures that a variant call is differences in the number and type of mutations called between different callers. Our interests in pancreatic adenocarcinomas triggered even in regions where Poisson sampling of alleles may have confounded the observed allele distributions. As sequence where over 70% of tumors are of less than 40% purity due to depth increases, so does the minimum read requirement. At desmoplastic stroma and despite enrichment by histology-guided coverage over 206a minimum of 4 mutant reads are required and macrodissection [14] have motivated us to determine optimal above 506 a minimum of 5% of mutant reads or a minimum of strategies for somatic point mutation calling in these tumors. To 2.5% mutant reads if reads are on both strands. In addition, the this end, several mutation calling strategies were tested and base qualities of the variant reads must be at least 10% of the sum extensive verification was performed, in which true positive and of base qualities at the position or at least 5% of the sum of base false positive mutation calls were inspected to identify common qualities if reads are found on both strands and coverage is over error sources. A heuristics-based single nucleotide variant caller, 506. To determine whether the position is homozygous or qSNP, was then implemented using these empirically determined heterozygous, the two most common alleles are determined. If features. Its performance was directly assessed in samples of both alleles match the evidence criteria above, the position is varying purity that were generated by mixing a tumor cell line and considered heterozygous, and if not, homozygous. its matched normal sample at varying proportions and sequencing each mixture. The decay in sensitivity as purity decreased in these Post-processing checks to control the false discovery rate mixtures was assessed and the performance of our caller was Various factors influence the confidence in a somatic mutation compared to that of two others. Finally, its performance was call, including sequence depth in tumor and matched normal, base benchmarked against the COLO-829 cell line, previously qualities of alleles, evidence for variant in matched normal sample, sequenced and analyzed by Pleasance et al. [4]. number of mutant reads, and mutant allele ratio. The statistical frameworks to encompass all of these factors into a single model Results and metric are still being developed. Some single-sample SNP Our somatic mutation calling strategy has been designed to callers give a p-value that purely reflects the likelihood for the maximize sensitivity in light of low tumor purity. Iterative rounds presence of a non-reference allele. Furthermore, most mutation PLOS ONE | www.plosone.org 2 November 2013 | Volume 8 | Issue 11 | e74380 Novel Somatic Mutation Calling Strategy Table 2. Classification of germline and somatic events. Normal genotype Tumor genotype Details* Classification Hom Het Variant is reference allele; G/G.A/G Germline Hom Het Variant novel; A/A.A/G Somatic Het Hom Tumor allele same; A/G.G/G Germline Het Hom Tumor allele different; A/G.T/T Somatic Hom Hom Same; G/G.G/G Germline Hom Hom Different; A/A.G/G Somatic Het Het Same; A/G.A/G Germline Het Het Different; A/G.T/G Somatic *All examples assume ‘A’ as the reference allele, ‘G’ as the variant, and ‘Hom’ and ‘Het’ denote homozygous and heterozygous respectively. check coverage in normal to exclude under-calling. could indicate LOH in tumor. doi:10.1371/journal.pone.0074380.t002 calling software is used in combination with a series of post-calling compute resources. To achieve this, qSNP is implemented in filtering steps to remove likely false positives. This practice means JAVA using the Picard library (version 1.62). qSNP is driven by a that the original p-values calculated by the mutation caller are single plain-text configuration file in the ‘‘Windows INI-file’’ style overridden by these further checks that ultimately decide whether and takes as its primary inputs, a pair of tumor and normal BAM or not a mutation is considered high confidence. For low purity files that have been duplicate-marked and coordinate-sorted. tumors, Poisson sampling of alleles can confound estimates of their qSNP implements a fast and flexible read-filtering system and if true frequencies, further compromising the calculation of accurate filters such as minimum mapping quality or alignment length are p-values or resulting in positions not exceeding a likelihood specified, qSNP will filter out failing reads prior to analysis. qSNP threshold. creates a pileup of bases in tumor and normal to look for evidence For these reasons we have not made an attempt to estimate a of a variant. qSNP has been specifically designed to make use of a p-value upfront but instead use flags to indicate that a putative compute cluster. It is thus multi-threaded, requiring 5 cores and somatic mutation call does not meet certain quality criteria or 20 GB of memory to run efficiently. evidence thresholds (Table 3). For example, putative somatic positions are checked for the presence of the variant in the Tuning using verification data matched normal BAM. If a position has evidence in the normal, To identify common error sources and to refine qSNP, the call is annotated as such. Somatic positions are further extensive verification of 3,253 putative somatic mutation calls checked for being a germline variant in another patient as this was performed across 65 tumors of 6 to 83% purity (mean 38% can indicate under-sampling of alleles in the matched normal. purity), including 60 tumors reported in Biankin et al. [14] For this check, we use an in-house database of germline variants (Table 4, Table S1). In total, 717 mutations were confirmed as and qSNP can be set up to output high quality germline calls to true somatic events, of which 704 had been classified as PASS this database with each iteration of qSNP. Positions that pass all by qSNP (Table 4). Miscalled somatic mutations were most checks are considered to be of highest confidence and we expect commonly associated with one of three features: position in these to be true somatic events. They are annotated as PASS in regions of sequence homology, support only by non-indepen- the qSNP output. Positions where the normal sample lacks dent reads or support by low evidence. By designing strategies adequate sequencing coverage are potentially false positive to eliminate false positives associated with these common error somatic calls and may return germline in verification. These are sources, we were able to maintain an accuracy of 57% at a annotated as COVN12 in qSNP output. All remaining somatic sensitivity of 98% across these tumors of mean purity of 38% mutations such as those where there is evidence of the variant (Table 4). This sensitivity is likely an overestimate of the true also in the normal sample or where only few mutant reads sensitivity as only known, verified mutations called by qSNP at support the variant call are considered lowest confidence and any evidence threshold were chosen for verification; it is are expected to include many false positives. These calls are possible that there were additional somatic events that were annotated as outlined in Table 3. never called. Nevertheless, our strategy is successful in retaining the vast majority of known true positive events Output mutation calls in.vcf and DCC formats (98%), while eliminating false positive calls associated with Output in Variant Call Format (VCF) [16] was required as common error sources. VCF is becoming the standard format for mutation reporting and annotation and allows integration with an ever-expanding set of Sequence homology regions VCF tools. To enable easy integration with the International Regions of sequence homology can cause problems in mapping Cancer Genome Consortium (ICGC) Data Coordination Centre and reads may be erroneously mapped to the wrong homologue. (DCC), output in DCC format was also implemented. This is not always apparent from the mapping quality values that can remain high especially if these values reflect pairing quality Fast, easy to run and operating-system independent values that consider the mapping qualities of both reads in a read Given the continuously increasing throughput of next-genera- pair. Nevertheless, these regions can often be identified on the tion sequencing platforms, qSNP needed to be efficient in its use of basis of having an excess of putative sequence variants. To PLOS ONE | www.plosone.org 3 November 2013 | Volume 8 | Issue 11 | e74380 Novel Somatic Mutation Calling Strategy Table 3. Post-processing checks performed by qSNP. Annotation Variant type Description PASS Somatic, Germline (Passed all post-processing checks) AND (min 5 mutant reads) AND (min 4 novel starts not considering read pair) COVN12 Somatic Less than 12 reads coverage in matched normal sample COVN8 Germline Less than 8 reads coverage in matched normal sample SAN3 Germline Less than 3 reads of same allele in normal COVT8 Germline Less than 8 reads coverage in tumor SAT3 Germline Less than 3 reads of same allele in tumor GERM Somatic Mutation is a germline variant in another patient MIN Somatic Mutation also found in pileup of normal BAM MIUN Somatic Mutation also found in pileup of unfiltered normal BAM NNS Somatic, Germline Less than 4 novel starts not considering read pair MR Somatic, Germline Less than 5 variant reads MER Somatic Mutation same as reference SBIAS Somatic Strand bias (Illumina only) doi:10.1371/journal.pone.0074380.t003 overcome this challenge, qSNP has a user-defined BAM filtering Non-independent reads option so that only high quality reads will trigger a mutation call. Picard MarkDuplicates (http://picard.sourceforge.net.) has For SOLiD v4 data, mapped with Bioscope 2.1 we find the become the standard tool for identifying PCR duplicates in following filters useful: next-generation sequencing data. Given that PCR is commonly used to amplify DNA for sequencing, likely PCR duplicates 1) min. 35 bp alignment length or (second of read pair and need to be identified so they don’t inflate allele counts during mapped as a proper pair); mutation calling. To identify duplicates Picard MarkDuplicates 2) min. SM.15 (single mapping quality); uses the start coordinates and orientations of both reads of a read pair. Within a set of duplicate read pairs, the read pair with 3) no more than 2 base-space mismatches to the reference; the highest base qualities is retained with the others marked as 4) not a PCR duplicate (Picard MarkDuplicates). PCR duplicates. Picard MarkDuplicates does not consider the For Illumina 100 bp paired-end data mapped with BWA, we sequence of the reads, only the alignment start coordinates and orientations. use the following filters: This strategy of marking PCR duplicates has one drawback. 1) min. SM.10 (single mapping quality); Read pairs where one read maps to a region of sequence 2) no more than 3 mismatches to the reference; homology sometimes fail to pass the Picard test for being PCR duplicates because these reads often map to different copies of 3) not a PCR duplicate (Picard MarkDuplicates). the region of sequence homology, thus disguising the fact that These read filters can be specified in the qSNP configuration file they are indeed all derived from the same PCR molecule. These reads can be easily identified upon visual inspection in using a domain-specific language (DSL). Once a somatic mutation call has been made, the unfiltered non-duplicate pileup from the Integrative Genomics Viewer (IGV) [17,18] on the basis of shared start coordinates of one read partner with different normal BAM is checked to see if there is any evidence of the variant. These steps help eliminate many of the false positives chromosome map positions of the other read in the pair associated with this common error source. (Figure 1). To overcome this challenge, all putative somatic Table 4. Details of verification using amplicon-based sequencing on the Ion Torrent. Verification across65primary pancreatic adenocarcinomas with mean tumor purity 38% (range 6 to 83%) Total verified somatic (TP) 717 qSNP pass calls Verified somatic 704 Verified germline 28 Verified wild type 506 Precision TP/(TP+FP) 57% Sensitivity TP/(TP+FN) 98% doi:10.1371/journal.pone.0074380.t004 PLOS ONE | www.plosone.org 4 November 2013 | Volume 8 | Issue 11 | e74380 Novel Somatic Mutation Calling Strategy Figure 1. Non-independent reads confounding mutation calls. Read pairs are colored by the chromosome map position of the second read in the pair. MarkDuplicates fails to correctly identify these non-independent read pairs as PCR duplicates due to the different map locations of the second read. doi:10.1371/journal.pone.0074380.g001 mutation calls are annotated in qSNP with the number of novel Benchmarking variant calling in a controlled mixture read starts not considering the read pair (NNS in the VCF output files). experiment Based on our extensive verification data, we find that a We previously modeled the performance of qSNP in a panel of minimum of 4 novel starts using this criterion is a useful lower mixtures where a pancreatic adenocarcinoma cell line and its limit for somatic mutation detection. matched normal were mixed at the following proportions: 0, 10, 20, 40, 60, 80 and 100% cell line DNA [14]. These mixtures were Low evidence calls sequenced to an average depth of approximately 656 using the Finally, mutation calls that are only supported by a few mutant SureSelect exon capture method and SOLiD v4 sequencing. Here, reads are also common false positives. However, as tumor purity we compare the decay in sensitivity across these mixtures using decreases, so does the expected mutant allele ratio, making it variant calls from qSNP and GATK (Table 5). All somatic qSNP difficult to distinguish true somatic events from sequencing calls made in the 100% cell line sample were selected for artifacts. We investigated a number of criteria to improve signal verification by amplicon-based sequencing on the Ion Torrent to noise for calls with low evidence. Strand bias proved not to be a PGM. The remaining mutation calls were assessed for evidence on useful discriminating feature for SOLiD v4 data as many true an alternate sequencing platform - HiSeq 2000 for calls made on somatic mutations were only supported by reads on one strand. the SOLiD v4 platform and vice versa. A position was considered Using results from amplicon-based verification, 363 FP were only verified if read depth was at least 206 and the mutation occurred on one strand, 171 FP were on both strands, 94 TP were only on at a frequency of at least 5% with a minimum of 3 variant reads on one strand, and 610 TP were on both strands. Filtering somatic the alternate sequencing platform. In all following comparisons, mutation calls by requiring the mutant allele being represented by GATK and Strelka were run in default mode with no changes to reads on both strands will thus severely impact sensitivity of default parameters. qSNP was run in standard mode, requiring a detection. Mutant allele ratio, i.e. proportion of mutant reads, also minimum of 3 mutant alleles of the same type to make a variant had poor discriminating power with many true positive calls call prior to applying standard read annotations and post-calling having very low mutant allele ratios: 112 FP had a mutant allele filters as described in the text. ratio ,10%, 422 FP had a mutant allele ratio .10%, 130 TP had As expected, as purity decreased, so did the sensitivity of a mutant allele ratio ,10% and 574 had a mutant allele ratio detecting true positive somatic mutations. In total, 84 mutations .10%. were verified as true somatic events. At 40% tumor purity, qSNP In contrast, there was a strong positive relationship between the successfully called 57 of 84 (68%) verified somatic mutations with likelihood of being a true somatic event and the number of reads with only 1 false positive call (Table 5). At tumor purities of 20% and novel starts not considering the read pair supporting the mutation. The 10% the sensitivity of detection dropped to 42% and 15%, higher the number of mutant reads supporting the call, the higher respectively. By increasing sequencing depth to .1506, the the accuracy. There is a trade-off between sensitivity and sensitivity of detection in the 20% and 10% samples was accuracy, however. At 5 mutant reads with a minimum of 4 improved, but not to a level comparable to that observed in the novel starts not considering read pair, we obtain an average higher mixtures (Table 5). In comparison, the GATK pipeline accuracy of 57% and sensitivity of 98% (Table 4), which we found called only 50 of 84 (60%) of verified somatic events in the 100% useful thresholds for mutation detection and follow-up verification sample and decayed more rapidly as tumor purity decreased with in primary pancreatic adenocarcinomas. By requiring a minimum no successful mutation calls in the 10% mixture (Table 5). of 10 mutant reads, our accuracy increases to 94%, but at the cost In addition, we re-sequenced 5 of these mixtures to an average of reduced sensitivity (53%). These criteria were determined using depth of 486 on HiSeq 2000 and called mutations using qSNP, exome samples that had been sequenced to a depth where 80% of GATK and Strelka (Table 6). qSNP detected a greater number of targeted bases had at least 206 coverage (average targeted base verified somatic events than GATK or Strelka in all mixtures coverage of approximately 656). (Table 6). Here, a total of 92 mutations were verified as true PLOS ONE | www.plosone.org 5 November 2013 | Volume 8 | Issue 11 | e74380 Novel Somatic Mutation Calling Strategy Table 5. Controlled mixture experiment to assess the effect of reducing tumor purity on somatic mutation detection using the SOLiD v4 platform. qSNP GATK* Mixture (%tumor) Cov. 80% Mean cov. ‘ ‘ ‘ ‘ ‘ ‘ VS FP U VS FP U 100 176 62.16 84 17 2 50 7 1 80 196 72.13 73 5 10 49 1 2 60 186 67.49 66 6 6 45 0 4 40 196 67.67 57 1 8 38 0 3 20 236 81.96 35 3 2 15 0 1 10 226 79.35 13 5 5 0 0 1 20 496 161.11 48 5 6 18 0 8 10 476 152.11 15 4 5 0 0 8 *raw.vcf files were passed through qSNP post-processing checks outlined in Table 3 to remove likely false positives such as positions with evidence in the matched normal. VS verified somatic; FP false positive; U untested. doi:10.1371/journal.pone.0074380.t005 somatic events. At 40% tumor purity, qSNP successfully called 60 COLO-829 whole-genome benchmarking study of 92 (65%) of verified somatic mutations, compared to GATK The melanoma cell line COLO-829 [4] has been used that called 55 (60%), and Strelka that called 56 (61%) verified previously for benchmarking new cancer analysis tools [8,9]. An somatic mutations (Table 6). There was substantial overlap in true aliquot of cell line and matched normal DNA were received and positive somatic calls between the three callers; 68 of a total of 90 whole-genome sequencing was performed on both SOLiD v4 (avg. (76%) verified somatic mutations were called by all three software coverage 326) and HiSeq 2000 (avg. coverage 756). The tools (Figure 2). These positions had an average of 32 mutant reads performance of qSNP was benchmarked against calls previously with an average mutant allele fraction of 0.52 (range 0.12 to 0.93). published by the Wellcome Trust Sanger Institute (WTSI) that As tumor purity decreased, so did the number of mutations called included 454 verified somatic mutations, 43 mutations previously by all three software tools (Figure 2). There were no mutations reported in COSMIC and 32,842 untested calls [4]. On the unique to GATK and Strelka that were not also called by qSNP SOLiD v4 platform qSNP called 85% of the 454 previously and for all mixtures qSNP missed the fewest number of true verified somatic mutations as well as 25 novel mutations that were somatic events compared to the other two callers. qSNP and verified using amplicon-based sequencing on the Ion Torrent GATK further called 1 private somatic mutation each that was not platform (Table 7, Table S2). For untested calls there was detected by the other callers, while Strelka called 7 private somatic considerable overlap between those reported by Pleasance et al. [4] mutations undetected by the other callers (Figure 2). and this study. For all variants called and verified by WTSI but not called by qSNP, a detailed breakdown is provided showing why the call was not made. For example, positions with insufficient coverage in the matched normal and which thus did not pass the qSNP PASS criterion are tabulated as well as positions where we observed evidence in the matched normal sample. The majority of positions where qSNP failed to make a call (5,735 or 62% of positions only called by WTSI) had less than 3 reads evidence in our SOLiD v4 sequence data. Using the HiSeq 2000 sequence data, qSNP called 85% of 454 previously reported verified somatic mutations and 26 novel mutations that were verified by Ion Torrent amplicon sequencing (Table 7). Of the positions initially reported by Pleasance et al. [4], the two re-sequencing efforts on SOLiD v4 and HiSeq 2000 identified 3531 positions that had less than 3 reads evidence for a mutant allele on both platforms (Table 7). On both platforms qSNP called a significant number of private mutations, 6486 on SOLiD v4 and 13098 on HiSeq 2000 of which 2674 were called on both platforms. Germline variants While qSNP was designed primarily to identify somatic mutations, a comparison of resulting germline calls was made Figure 2. Overlap in somatic mutation calls. Verified somatic using the COLO-829 sample and calls made by the Illumina mutation calls were compared across three callers in 5 different tumor Human 1M OmniQuad arrays, selecting all positions from the purity mixtures. Values are number of calls in 100%, 80%, 60%, 40% and arrays that showed evidence of a non-reference allele and had a 20% tumor content mixture, from top to bottom. doi:10.1371/journal.pone.0074380.g002 GenCall (GC) score of .0.7. The average genotype concordance PLOS ONE | www.plosone.org 6 November 2013 | Volume 8 | Issue 11 | e74380 Novel Somatic Mutation Calling Strategy Table 6. Controlled mixture experiment to assess the effect of reducing tumor purity on somatic mutation detection using the HiSeq2000 platform. qSNP GATK* Strelka** Mixture (%tumor) Cov. 80% Mean cov. ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ VS FP U VS FP U VS FP U 100 266 61.43 82 1 72 80 1 72 77 1 66 80 196 43.05 77 0 60 76 0 57 75 2 57 60 176 40.57 65 1 45 62 1 39 60 2 44 40 186 43.36 60 0 45 55 0 30 56 1 45 20 226 51.83 47 0 22 37 0 14 48 1 26 *.vcf files were passed through qSNP post-processing checks outlined in Table 3 to remove likely false positives such as positions with evidence in the matched normal. **calls from ‘pass’ category. VS verified somatic; FP false positive; U untested. doi:10.1371/journal.pone.0074380.t006 at positions with at least 8 reads coverage was 95% (same genotype substantial overlap with previously published calls and calls made call), while the variant call concordance was 99% (Table S3). As by either the SOLiD v4 or HiSeq2000 platforms as well as a small number of previously undetected protein-coding somatic muta- sequence depth increased so did accuracy in making the correct genotype call. Positions with .706 sequence coverage had a tions. genotype concordance of 99% and a variant call concordance of In the controlled mixture experiment the single sample approach used by GATK had reduced overall sensitivity and a 100% (Table S3). The array data has been submitted to Gene Expression Omnibus (GEO), accession number GSE47904. faster decay curve across samples of decreasing tumor purity than the joint sample callers, qSNP and Strelka, consistent with previous reports that joint sample analyses perform better for Discussion cancer analysis [9]. Using SOLiD v4 sequence data, qSNP and The development of cancer genome analysis tools and somatic GATK both achieved a low false positive rate, although GATK mutation calling software is an active area of research, but the called only 60% of known true positives in the 100% purity effects of reduced tumor purity on somatic mutation calling still mixture. Using the HiSeq 2000 platform, the sensitivity of GATK remain largely unexplored. Here, we present a strategy for somatic was improved, but at the cost of a high total number of calls likely point mutation calling in low purity tumors. We have used due to a high false positive rate that was only improved by extensive verification in primary pancreatic adenocarcinoma applying the same post-processing checks as in the qSNP pipeline, samples to determine a variant calling strategy that controls the such as excluding positions that had evidence of the mutation in false positive rate while maximizing sensitivity. When directly the matched normal sample (Table 3). assessing the accuracy and sensitivity of our approach in a The controlled mixture experiment further compared our controlled mixture experiment where samples of varying purity heuristic caller to a Bayesian approach (Strelka), demonstrating a were generated and sequenced, we demonstrate superior perfor- marginal advantage in sensitivity and false positive rate for qSNP. mance compared to other commonly used somatic mutation We believe that the success of our heuristic caller is due to its callers, for both SOLiD v4 and HiSeq 2000 data. Finally, we have ability to use minimum evidence to trigger a somatic mutation call benchmarked our caller against the COLO-829 sample and show and the use of powerful post-processing checks that control the Table 7. Benchmarking qSNP on sequencing data from the SOLiD v4 and HiSeq 2000 platforms using COLO-829 variants verified by either WTSI (WTSI only, qSNP+WTSI) or QCMG (qSNP only). SOLiD v4 HiSeq 2000 SOLiD v4 and HiSeq 2000 Caller Details ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ VS C U VS C U VS C U qSNP+WTSI 381 33 23,544 385 39 23,660 333 30 19,276 WTSI only ,126 coverage in normal 18 5 1,329 0 0 104 0 0 26 mutation also in normal 8 0 455 19 2 1,105 0 0 19 germline in another patient 0 0 7 1 0 6 0 0 5 did not pass post-filters 16 1 1,548 24 0 1,623 1 0 86 qSNP germline call 0 0 24 0 0 63 0 0 10 no call - ,3 reads evidence 0 0 5,735 22 2 5,945 0 0 3,531 no call - other 31 4 200 3 0 336 2 0 0 qSNP only* 25 0 6,486 26 0 13,098 22 0 2,674 *min 5 mutant reads and 4 novel starts not considering pair. VS verified somatic; C cosmic; U untested. doi:10.1371/journal.pone.0074380.t007 PLOS ONE | www.plosone.org 7 November 2013 | Volume 8 | Issue 11 | e74380 Novel Somatic Mutation Calling Strategy false positive rate. Machine learning approaches such as the datasets that are adjusted for coverage and exclude common error classifier of Ding et al. [6] can be a powerful strategy for identifying sources, such as calls made in repeat regions, low complexity features discriminating true positive from false positive mutation sequence or near indels. These post-filters are becoming increas- calls, provided availability of orthogonal verification data for ingly important as analyses are moving from exon-capture to training of the classifier. Discriminant features can then be whole-genome sequencing datasets. Nevertheless, the large overlap incorporated in the set of heuristics for informing mutation calls. in calls between the original and the two re-sequencing datasets In addition, automated pipelines for amplicon-based verification suggests that the overall sensitivity of detection of qSNP was good, can be set up using smaller scale sequencers such as the Ion and that the remaining challenge lies in controlling platform- and Torrent or MiSeq platform. We have found this a successful software-specific error sources. strategy in pancreatic adenocarcinomas that vary widely in tumor purity. On the other hand, Bayesian approaches may be more Conclusions readily transferrable across datasets and provide some form of Accurate and sensitive somatic mutation in low purity tumors quantitative measure of the confidence for a given mutation call, remains a formidable challenge, but one of great interest to the although as discussed above these will be most useful for high study of many solid tumors. Here, we have discussed some of the coverage regions and tumors of high purity where allele key challenges in this field and strategies we have devised to handle distributions can be accurately estimated and are not confounded these. Continuous refinement of existing strategies be they by Poisson sampling effects. heuristics or Bayesian, as well as comparative analyses and Finally, the controlled mixture experiment demonstrated that benchmarking on a defined set of samples will be critical to further no single variant calling strategy is optimal in all aspects. While improve performance of current somatic mutation callers. there was good overlap between callers and the majority of calls were made by at least 2 callers, each caller also identified private Materials and Methods mutations not called by the others and which were verified as somatic. Different callers thus have unique benefits, although Samples qSNP missed the fewest number of true somatic events. These Primary pancreatic adenocarcinoma samples discussed in this comparisons show that there is further scope for refinement of study were accrued as part of the Australian Pancreatic Genome either mutation calling strategy to improve accuracy and Initiative (APGI) (http://www/pancreaticcancer.net.au) using an sensitivity. Where high-density SNP array data are available, we institutional approved process for consent. COLO-829 sample recommend use of a genomic tool for estimating tumor purity aliquots for the melanoma cell line and matched normal were prior to variant calling, such as the qPure software [19]. obtained from WTSI. Sample extraction and processing followed Determining the purity of a tumor will help identify the most those outlined in Biankin et al. [14]. useful thresholds for variant calling. For example, samples of high purity are expected to have a lower false negative rate and thus the Verification of somatic mutations stringency of variant calling may be increased to lower the false Verification of somatic mutation calls was performed by positive rate. Given that the qSNP analysis of a whole-exome targeted Ion Torrent sequencing using PCR primers to amplify dataset of tumor/matched normal takes only 30 minutes, multiple 70–150 bp amplicons overlapping the somatic mutation. Tumor different parameters can be easily trialled to assess their effect on and normal DNA was whole-genome amplified prior to PCR the total number of calls. using the Illustra GenomiPhi V2 DNA Amplification Kit (GE; 25- We used the COLO-829 sample for benchmarking both 6600-30). PCR reactions and sequencing was performed as germline and somatic mutation calls. Germline calls from qSNP outlined in Biankin et al. (2012). Briefly, PCR reactions were set were compared to those made on the Illumina 1M OmniQuad up using 10 ng of amplified gDNA and 5 uM of primers mix. Ion chip, showing that the variant call concordance was over 99% Spheres were generated using the Ion Xpress Template Kit (Life even for positions with only 8 reads coverage. As sequencing depth Technologies; 4469001) with approximately 260 million amplicon increased, so did our accuracy to make the correct genotype call. molecules per emulsion PCR, effectively yielding an emulsion Detailed comparisons of the qSNP somatic mutation calls against containing 1 amplicon molecule per Ion Sphere. Samples were the original GAIIx calls of Pleasance et al. [4] showed considerable sequenced using the Ion Sequencing Kit (Life technologies; overlap for re-sequencing data from both the SOLiD v4 and 4468997) and the Ion Chip 316 Kit (Life Technologies; 4469496). HiSeq 2000 platforms, although there were also some important Verification of somatic mutations was performed by sequence differences. For example, our re-sequencing efforts identified 3,531 pileup at each mutant position and a position was considered positions that had less than 3 reads evidence for a mutation in both verified if it has a minimum depth of 100 reads coverage in the the SOLiD v4 and HiSeq 2000 data, suggesting that these original tumor and normal, a mutant allele frequency of at least 10% in calls are false positives and may reflect differences in read tumor and less than 0.5% in normal. sampling, mapping or bias of the original sequencing platform. Similarly, calls private to the qSNP pipeline on either the SOLiD Controlled mixture experiment v4 or HiSeq 2000 platform likely included a large number of false positive calls as evidenced by the fact that only 2,674of these SOLiD exon capture data for the mixture experiment was taken from Biankin et al. [14]. positions unique to our datasets were called on both sequencing platforms. Our calls on the HiSeq 2000 platform appear noisier Illumina exon capture was performed using the TargetSeq judging by the total number of private calls on this platform Exome Enrichment System (Life Technologies; A14060 and (13,098) compared to calls on the SOLiD v4 platform (6,486). This A138230) according to the manufacturer’s instructions, however is likely due to the increased coverage in the HiSeq runs (756 some modifications were made to the protocol to make the kit average base coverage compared to 326 in the SOLiD v4 data), compatible with Illumina libraries. SOLiD blocking and PCR which is expected to result in more variant calls when using the oligos were replaced with Illumina TruSeq blocking and PCR same evidence thresholds. We are currently implementing and oligos derived from the NimbleGen SeqCap EZ Library SR User’s refining post-processing checks for use with HiSeq whole-genome Guide v3.0 (Roche; 06588786001). The captured libraries were PLOS ONE | www.plosone.org 8 November 2013 | Volume 8 | Issue 11 | e74380 Novel Somatic Mutation Calling Strategy washed on the Life Techologies Library Builder using an BioAnalyser 2100 using the DNA High Sensitivity Kit (Agilent; unreleased protocol (Life Technologies), and the final post- 5067-4626) to calculate the molarity and assess the size capture PCR used the protocol in the NimbleGen SeqCap EZ distribution. Libraries were then prepared for Illumina cluster Library SR User’s Guide v3.0 (Roche; 06588786001). The final generation and sequencing. captured libraries were run on the Agilent BioAnalyser 2100 Of the qSNP unique calls, 61 protein-coding positions were using the DNA High Sensitivity Kit (Agilent; 5067-4626) to selected for verification on the Ion Torrent platform using the calculate the molarity and assess the size distribution. Cluster same verification criteria as outlined above; 30 were confirmed as generation of the libraries was performed using the TruSeq PE true somatic events and 31 as false positives (Table S2). In Cluster Kit v3-cBot-HS (Illumina; PE-401-3001), and sequenc- addition, 3 somatic mutations originally identified by WTSI could ing carried out. The SOLiD and HiSeq.BAM files have been not be confirmed as somatic events in our verification efforts submitted to the European Genome Archive, as part of project (Table S2). EGAS00000000078. Supporting Information COLO-829 whole-genome benchmarking study Table S1 Verification of 3253 putative somatic mutation Whole-genome sequencing of the COLO-829 tumor and calls across 65 tumors. matched normal sample were performed using the SOLiD v4 (XLSX) and Illumina HiSeq 2000 sequencing platforms. For preparation of SOLiD v4 long mate-pair libraries, 13 mgofgDNAwas Table S2 A mutation file containing somatic mutations sheared to a mean size of 2.5 kb using the Covaris S2 system. for the COLO-829 data set. Shearing was completed using the Blue miniTUBEs (Covaris p/ (XLSX) n: 520065) using the standard settings for 3 kb as described in Table S3 Comparison of qSNP germline variant calls to Covaris protocol 400069 (http://http//covarisinc.com/wp- calls from SNP array analysis. content/uploads/pn_400069.pdf). Following shearing, 1 uL of (DOCX) sheared sample was run on the Agilent BioAnalyser2100 using the DNA High Sensitivity Kit (Agilent p/n: 5067-4626) to assess the shearing size and distribution. The entire sheared DNA Acknowledgments sample was then converted into a SOLiDH compatible Long We thank the Australian Pancreatic Genome Initiative and associated Mate Pair (LMP) library using Life Technologies 5500SOLiDH clinical collaborators (APGI, www.pancreaticcancer.net.au/collaborators) Mate-Paired Library Kit (Invitrogen p/n: 4464418) following for the sample used in the mixture experiment. We thank the Cancer the standard protocol (http://tools.invitrogen.com/content/sfs/ Genome Project at the Wellcome Trust Sanger Institute for providing the manuals/cms_093442.pdf) with 10 minutes nick translation and COLO-829 DNA samples for sequencing. We would like to thank Deborah Gwynne, Cathy Axford, Mary-Anne Brancato, Sarah Rowe, a total of 12 cycles of amplification for the final library. After Michelle Thomas, Skye Simpson, Marc Jones and Gerard Hammond for PCR amplification the libraries were assessed for molarity and central co-ordination of the Australian Pancreatic Cancer Genome size distribution using the Agilent BioAnalyser 2100 using the Initiative, data management and quality control, and Mona Martyn- DNA High Sensitivity Kit. Libraries that passed this QC were Smith, Lisa Braatvedt, Henry Tang, Virginia Papangelis and Maria Beilin prepared for SOLiDH sequencing. for biospecimen acquisition. We also thank John Shepperd, Emma For the preparation of Illumina DNA libraries, 1 mgof Campbell and Evgeny Glasov for their efforts at the Queensland Centre gDNA was sheared to a mean size of 300 bp in a 130 mL for Medical Genomics. volume using a Covaris microTUBE and the Covaris S2 system according to the standard protocol (Covaris; 010158 Rev C). Author Contributions The sheared sample was prepared into a library using the Conceived and designed the experiments: KSK OH NW SMG JVP. NEBNext DNA Library Prep Master Mix Set for Illumina Performed the experiments: OH QX DKM IH ANC AS SM SI EN CN (NEB; E6040S) according to the manufacturer’s instructions TB S. Wood KN AMP NW LF MA CL S. Wani FN SS DT. Analyzed the with modifications. Size selection was done using an agarose data: KSK KN AMP NW LF MA CL S. Wani FN SS DT. Contributed gel (3% agarose) instead of the AMPure XP Beads size reagents/materials/analysis tools: ALJ AVB M. Pajic M. Pinese MJC JW selection. The final libraries were run on the Agilent AJG DKC PJW. Wrote the paper: KSK OH SMG JVP. References 1. Hudson TJ, Anderson W, Artez A, Barker AD, Bell C, et al. (2010) International 8. Saunders CT, Wong WSW, Swamy S, Becq J, Murray LJ, et al. (2012) Strelka: network of cancer genome projects. Nature 464: 993–998. accurate somatic small-variant calling from sequenced tumor-normal sample 2. TCGA (2011) Integrated genomic analyses of ovarian carcinoma. Nature 474: pairs. Bioinformatics 28: 1811–1817. 609–615. 9. Larson DE, Harris CC, Chen K, Koboldt DC, Abbott TE, et al. (2012) 3. TCGA (2012) Comprehensive molecular characterization of human colon and SomaticSniper: identification of somatic point mutations in whole genome rectal cancer. Nature 487: 330–337. sequencing data. Bioinformatics 28: 311–317. 4. Pleasance ED, Cheetham RK, Stephens PJ, McBride DJ, Humphray SJ, et al. 10. Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, et al. (2010) A comprehensive catalogue of somatic mutations from a human cancer (2012) VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Research 22: 568– genome. Nature 463: 191–U173. 5. Pleasance ED, Stephens PJ, O’Meara S, McBride DJ, Meynert A, et al. (2010) A 576. small-cell lung cancer genome with complex signatures of tobacco exposure. 11. Goya R, Sun MGF, Morin RD, Leung G, Ha G, et al. (2010) SNVMix: Nature 463: 184–U166. predicting single nucleotide variants from next-generation sequencing of tumors. Bioinformatics 26: 730–736. 6. Ding J, Bashashati A, Roth A, Oloumi A, Tse K, et al. (2012) Feature-based classifiers for somatic mutation detection in tumour-normal paired sequencing 12. Altshuler DL, Durbin RM, Abecasis GR, Bentley DR, Chakravarti A, et al. data. Bioinformatics 28: 167–175. (2010) A map of human genome variation from population-scale sequencing. 7. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, et al. Nature 467: 1061–1073. 13. Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, et al. (2013) (2010) The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research 20: Sensitive detection of somatic point mutations in impure and heterogeneous 1297–1303. cancer samples. Nature Biotechnology 31: 213–219. PLOS ONE | www.plosone.org 9 November 2013 | Volume 8 | Issue 11 | e74380 Novel Somatic Mutation Calling Strategy 14. Biankin AV, Waddell N, Kassahn KS, Gingras M-C, Muthuswamy LB, et al. 17. Robinson JT, Thorvaldsdottir H, Winckler W, Guttman M, Lander ES, et al. (2012) Pancreatic cancer genomes reveal aberrations in axon guidance pathway (2011) Integrative genomics viewer. Nature Biotechnology 29: 24–26. genes. Nature advance online publication. 18. Thorvaldsdo´ttir H, Robinson JT, Mesirov JP (2012) Integrative Genomics 15. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, et al. (2009) The Viewer (IGV): high-performance genomics data visualization and exploration. Sequence Alignment/Map format and SAMtools. Bioinformatics 25: 2078– Briefings in Bioinformatics. 2079. 19. Song S, Nones K, Miller D, Harliwong I, Kassahn KS, et al. (2012) qpure: A 16. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, et al. (2011) The variant tool to estimate tumor cellularity from genome-wide single-nucleotide polymor- call format and VCFtools. Bioinformatics 27: 2156–2158. phism profiles. PLoS One 7. PLOS ONE | www.plosone.org 10 November 2013 | Volume 8 | Issue 11 | e74380

Journal

PLoS ONE – Pubmed Central

Published: Nov 8, 2013

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Somatic Point Mutation Calling in Low Cellularity Tumors

Somatic Point Mutation Calling in Low Cellularity Tumors

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Somatic Point Mutation Calling in Low Cellularity Tumors

Somatic Point Mutation Calling in Low Cellularity Tumors

References (22)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies