Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Mapping-free variant calling using haplotype reconstruction from k-mer frequencies

Mapping-free variant calling using haplotype reconstruction from k-mer frequencies Motivation: The standard protocol for detecting variation in DNA is to map millions of short sequence reads to a known reference and find loci that differ. While this approach works well, it cannot be applied where the sample contains dense variants or is too distant from known referen- ces. De novo assembly or hybrid methods can recover genomic variation, but the cost of computa- tion is often much higher. We developed a novel k-mer algorithm and software implementation, Kestrel, capable of characterizing densely packed SNPs and large indels without mapping, assem- bly or de Bruijn graphs. Results: When applied to mosaic penicillin binding protein (PBP) genes in Streptococcus pneumo- niae, we found near perfect concordance with assembled contigs at a fraction of the CPU time. Multilocus sequence typing (MLST) with this approach was able to bypass de novo assemblies. Kestrel has a very low false-positive rate when applied to the whole genome, and while Kestrel identified many variants missed by other methods, limitations of a purely k-mer based approach affect overall sensitivity. Availability and implementation: Source code and documentation for a Java implementation of Kestrel can be found at https://github.com/paudano/kestrel. All test code for this publication is located at https://github.com/paudano/kescases. Contact: paudano@gatech.edu or fredrik.vannberg@biology.gatech.edu Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction typing and surveillance methods often require whole genome assembly (Thomsen et al., 2016), a database of known alleles (Li et al., 2016)or Although modern alignment tools are designed to handle errors and multiple reference sequences (Li et al., 2012). These approaches add to variation, a sequence read that differs significantly from the refer- the time required to run the analysis and to maintain allele databases. ence cannot be confidently assigned to the correct location. When The ideal method would handle bacterial diversity with a single refer- these reads are mapped, the low alignment confidence leads to low ence, and it would be capable of placing genomic alterations in a uni- variant call confidence, and it becomes difficult to separate true var- iants from false calls. In some regions, the read may be clipped or form context without switching references. not mapped at all. As a result, variant calling algorithms that rely There are several alternative methods commonly employed, but solely on alignments cannot characterize these events. they are either limited in power or consume far more computing Bacterial genomes are small and relatively simple, but they remain resources than the alignment and variant-calling protocol. Calling one of the hardest informatic targets due to their variability. variants from de novo assembled contigs may work for monoploid Horizontal gene transfer (HGT) is one mechanism that changes sam- organisms, but since it reduces the read coverage to 1, false-calls ples significantly when compared to the reference, and it often leads to cannot easily be corrected (Olson et al., 2015). Building the de drug resistance or increased virulence (Ochman et al., 2000). Bacterial Bruijn graphs that are typically employed in assembly may also V The Author(s) 2017. Published by Oxford University Press. 1659 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 1660 P.A.Audano et al. require many gigabytes of memory or computing clusters (Li et al., 2010). Cortex (Iqbal et al., 2012) was designed to overcome limita- tions of the de novo assembly approach, but it still relies on de Brujin graphs assembled over the whole genome. Platypus (Rimmer et al., 2014) performs local assemblies, however, it incurs the over- head of both read mapping and local assembly if variant loci are not known a priori. One possible approach is to use the information contained within a set of k-mer frequencies over the sequence data. K-mers can be represented numerically, they do not rely on sequence alignments, and k-mer counting methods are fast. To date, sparsely spaced SNPs have been corrected with such an approach (Gardner and Slezak, 2010; Gardner et al., 2013), but a robust variant calling algorithm has never been shown to work. Because dense SNPs and large inser- tions can create many k-mers very different from any reference sequence, how such an approach could work efficiently is not imme- Fig. 1. Overview of the Kestrel process from sequence data to variant call. diately clear. (a) The sequence reads are converted to an IKC file. (b) The reference We constructed an algorithm that relies strictly on k-mer fre- sequence is converted into an array of k-mers and left in reference order. (c) quencies over sample sequence data and ordered k-mers in a refer- K-mer frequencies from the sequence reads (vertical axis) are assigned to the ence sequence to call variants. This method first finds regions of ordered k-mers of the reference (horizontal axis). A decline and recovery of the frequencies bound an active region where one or more variants are variation over the reference using disruptions in the frequency distri- present. The recovery threshold is degraded with an exponential decay func- bution of expected k-mers, and by beginning with unaltered k-mers tion (red) to allow for declining read coverage. (d) Starting from the left anchor at the flanks of such regions, it can greatly simplify the search k-mer (last k-mer with a high frequency), the first base is removed, each possi- through k-mer space while resolving variation. Using this approach, ble base is appended, and the base that recovers the k-mer frequency is we show how Kestrel can re-build altered regions of the genome in a appended to the haplotype. (e) A modified alignment algorithm tracks haplo- targeted manner and resolve regions of dense variation. type reconstruction and terminates the process when an optimal alignment is reached. (f) This algorithm yields an alignment of the reference sequence and haplotype within the active region. (g) Variant calls are extracted from the alignment (Color version of this figure is available at Bioinformatics online.) 2 Materials and methods 2.1 Algorithm overview By relaxing these criteria and by building in either the forward or Using features in KAnalyze (Audano and Vannberg, 2014) version reverse direction, it is possible to call variants up to either end of the 2.0.0, the frequency of each k-mer in the sample is stored in an reference. indexed k-mer count file (IKC) (Fig. 1a). The IKC records are grouped by a k-mer minimizer (Roberts et al., 2004) and sorted. This structure tends to cluster k-mers from the original sequence, 2.2 Active region detection and by reading it as a memory mapped file, k-mers can be rapidly The distribution of k-mer frequencies from a set of sequence data is queried with low memory requirements using search optimizations approximately uniform with a mean equal to the average read depth similar to those implemented in Kraken (Wood and Salzberg, 2014). of the sample, and variants disrupt this distribution over reference From the reference sequence, the frequencies for each k-mer and k-mers. For example, a single SNP causes k k-mers of the sample to its reverse complement are obtained from the IKC file and summed, differ from the reference. Active region detection searches for differ- but these are left in order (Fig. 1b). Kestrel searches the resulting ences between neighbors that are less likely to have occurred by array for loci where the frequency declines and recovers (Fig. 1c), chance. Then by searching for a recovery in the downstream which suggests the k-mers of the reference and sample differ. frequencies, it attempts to resolve the left and right breakpoints in Analogous to the GATK (McKenna et al., 2010) HaplotypeCaller, k-mer space. This region is bounded by the unaltered k-mers at its this low-frequency region is called an active region, and haplotypes flanks, which we label anchor k-mers. The low-frequency and the are reconstructed over it. anchor k-mers together comprise the active region. Starting with the high-frequency k-mer on the left end of the The frequency difference between of two neighboring k-mers, N active region, the first base is removed, all four bases are appended and N , is evaluated and compared to , the threshold required to iþ1 to this (k - 1)-mer, and the k-mer frequency is queried for each of trigger an active region scan (jN  N j >). Because read depth i iþ1 the four possibilities (Fig. 1d). An equivalent process is performed can vary greatly over samples, setting this parameter to some value on the reverse complement k-mer, and the frequencies are summed. is unlikely to perform well on real sequence data. Therefore, Kestrel The base that yields a high frequency k-mer is appended to the hap- sets  to some quantile, Q , of the set of jN  N j for all neighbor- i iþ1 lotype. The new k-mer ending in that base is then used to find the ing k-mers in the reference. The default quantile, 0.90, with an next base by the same process. If more than one of these k-mers has absolute minimum of 5 was found to work well in practice. a high frequency, a haplotype is assembled for each one. Supplementary Section S1.3 discusses active region detection in A modified Smith-Waterman (Smith and Waterman, 1981)algo- more detail. rithm guides the process by aligning the active region and the haplo- Sequencing errors, PCR duplicates, GC biases and other factors type as it is reconstructed (Fig. 1e). By setting an initial score and systematically work to disrupt the uniformity of the frequency distri- disallowing links to zero score states, alignments are anchored on the bution. Therefore, any robust approach must be equipt with heuris- left and are allowed to extend until an optimal alignment is obtained. Variant calls follow trivially from the alignment (Fig. 1f and g). tics to work around such biases. Kestrel uses an exponential decay Kestrel 1661 function over the recovery threshold, and it ignores short peaks in While this algorithm is simple enough to build the haplotype the frequency distribution. sequence, it does not know when to terminate. It could continue When an active region occurs over declining read coverage, the until the right anchor k-mer is found, but such reference-na¨ve ı recovery k-mer frequency may be lower than expected. As the scan reconstruction would certainly waste many CPU cycles extending moves further from the left anchor k-mer, the expected recovery fre- erroneous sequence. Because the active region sequence and the hap- quency should be relaxed. The exponential decay function, f(x) lotype sequences are known, an alignment could be performed as it (Equation 1), is employed to reduce the recovery threshold as the is extended. A global alignment would be ideal except that it would active region extends. f(0) is the anchor k-mer frequency, and have to be recomputed each time a base is added, and performance it approaches a lower bound, f , asymptotically. By default, min of such an algorithm would be unacceptable in all but the most triv- f ¼ 0:55  fðÞ 0 to avoid ending an active region prematurely on a min ial of applications. large heterozygous variant region. f(x) is defined by scaling and An ideal algorithm would align over the active region from end to shifting the standard exponential decay function, h(x)(Equation 2). end, anchor the left ends of the haplotype and active region, and allow the right end of the haplotype to extend. To accomplish this, fxðÞ¼ðÞ fðÞ 0  f hxðÞþ f (1) min min Kestrel employs a modified Smith-Waterman alignment. Two key xk modifications were made to Smith-Waterman; (i) any subalignment hxðÞ¼ e (2) with a score of 0 cannot be extended, and (ii) the alignment must h(x) is parameterized by k, which must be also be set, but Kestrel does begin with a score greater than 0. Because of these modifications, the not configure this parameter directly because it is difficult to know alignment must begin with a non-zero score, and the gap extension how to choose a reasonable value. Instead, k is chosen by a configura- penalty must be non-zero to prevent unbounded extension. ble parameter, a, that is defined as the proportion of the decay range, Smith-Waterman is a dynamic programming approach where fðÞ 0  f ,after 1k-mer (Equation 3). This provides a more intuitive min matrices track scores by updating from shorter sub-alignments as way to define how rapidly the recovery threshold is allowed to decline. the alignment progresses (Eddy, 2004). The active region is posi- tioned over the vertical axis, and the haplotype is positioned over hkðÞ¼ a the horizontal axis of the score matrix. We define the active region kk e ¼ a base in row i as x and the haplotype base in column j as y . As hap- i j (3) kk ¼ logðÞ a lotype bases are added to the alignment, a column is appended to the matrix. Section 2.6 discusses how this is done efficiently. logðÞ a k ¼ For aligned bases, R is added to the score if they agree and match R is added if they disagree. For convenience, we define mismatch At k k-mers, the recovery threshold fkðÞ¼ a ðÞ fðÞ 0  f þ f . min min match(i, j)(Equation 5) to return R if x and y match, or match i j In other words, f(k) has declined in its range from f(0) to f by a min R if they do not. This implementation employs an affine gap mismatch factor of a. This is true for all nk such that fnðÞk ¼ a ðÞ fðÞ 0  f min model that allows for distinct gap open (R ) and gap extension open þf (Equation 4). Supplementary Section S2.2 outlines active min (R ) penalties. This type of model requires three matrices over x and gap region heuristics in more detail. y to track the scores. One matrix, S , contains scores through aligned aln xk (matched or mismatched) bases. Two more score matrices, S and hxðÞ¼ e gact S , contain scores through gaps in the active region and gaps in the logðÞ a ghap ¼ e haplotype, respectively. Using these definitions, the modified Smith- logðÞ a k (4) ¼ e Waterman score update process is defined in Equations 6–8. ¼ a u R : x ¼ y match i j matchðÞ i; j ¼ (5) R : x 6¼ y i j mismatch 2.3 Haplotype alignment > After the endpoints of an active region are found, the actual SðÞ i  1; j  1þ matchðÞ i; j : > aln sequence (haplotype) from the sample must be reconstructed. This SðÞ i  1; j  1 > 0 aln process is similar to a local assembly from k-mers, but it is imple- S ðÞ i  1; j  1þ matchðÞ i; j : S ðÞ i; j ¼ max (6) aln gact mented to keep resource usage at a minimum. Since the reference > S ðÞ i  1; j  1 > 0 sequence over the whole region is known and anchor k-mers have gact been chosen, the search through k-mers in the sample can be greatly > > S ðÞ i  1; j  1þ matchðÞ i; j : ghap simplified. S ðÞ i  1; j  1 > 0 ghap The process begins by initializing the haplotype with the left anchor k-mer. Then by removing the left-most base of the k-mer and appending a new base to the end, the k-mer can be shifted one base > S ðÞ i; j  1þ R þ R : > aln open gap to the right. Each nucleotide can be appended and the frequency checked. The base that produces a k-mer with the highest frequency S ðÞ i; j  1 > 0 aln is appended to the haplotype. If more than one base produces an S ðÞ i; j ¼ max S ðÞ i; j  1þ R : (7) gact gact gap acceptable frequency, then the alignment is split by saving the state S ðÞ i; j  1 > 0 > gact of reconstruction with alternate bases, continuing with the highest- frequency k-mer, and returning to the alternatives after the current > S ðÞ i; j  1þ R þ R : ghap open gap haplotype is built. In this way, multiple haplotypes may be built S ðÞ i; j  1 > 0 ghap over one active region. 1662 P.A.Audano et al. > will always report that the first base was deleted even though the > alignment score would be the same for a deletion at any locus of the S ðÞ i  1; jþ R þ R : > aln open gap > repeat. The effect is to left-align variant calls. S ðÞ i  1; j > 0 > aln Approximate read depth of an active region is estimated by S ðÞ i  1; jþ R þ R : S ðÞ i; j ¼ max gact open gap (8) ghap summing the depth of all haplotypes, which may include the refer- > S ðÞ i  1; j > 0 ence haplotype. Similarly, summing only the haplotypes that sup- gact > port a variant call gives the approximate depth of the variant. > S ðÞ i  1; jþ R : ghap gap : Because haplotypes may share k-mers, we use the minimum k-mer S ðÞ i  1; j > 0 ghap frequency as a conservative estimate of read depth of any single haplotype. These estimates may be useful for filtering variants with To initialize the alignment, the bases of the active region are low support. positioned along the vertical axis of the matrices, and each base of the anchor k-mer creates one column. The first base of the sequences is in row and column 1. Row and column 0 exist for convenience, 2.6 Alignment implementation but they do not align any bases. The initial alignment score, R Because of the nature of the dynamic programming algorithm, only init is assigned (S ðÞ i; i ¼ R ; 0  i  k) where k is the size of aln init the last column of the score matrices (S , S and S ) needs to aln gact ghap k-mers. All other scores in all three matrices are initialized to 0. be stored while the next column is built. Therefore, each of these A fourth matrix, T, contains traceback information from the matrices can be reduced to two arrays where one contains the last end of an alignment to S ðÞ 0; 0 . It is initialized so that aln column, and one contains the new column being added. When alter- TiðÞ ; i ! TiðÞ  1; i  1 ; 0 < i  k. This initialization of S and T aln nate haplotypes are explored and the alignment splits, only one creates a single path for the anchor k-mer in the alignment, and all array for each matrix must be saved. acceptable alignments must enter this path at S ðÞ k; k . The align- aln The traceback matrix, T, is more complex because it is not ment extends from the anchor k-mer as already described. stored as a matrix. Instead, it is a linked-list of alignment states that Supplementary Section S1.4-7 outlines the alignment process and always leads back to S ðÞ 0; 0 . When a non-zero score is added to a aln data structures in more detail. score matrix, a link is added to T. For each non-zero score in score matrices, a link from the matrix to a node in T is stored. 2.4 Alignment termination Since T is a linked list that is only traversed toward S ðÞ 0; 0 , aln Since the alignment must cover all of the active region, only the last one node of the alignment may have several links into it. Therefore, row of S needs to be queried to find the best score for the current aln different haplotypes may link to the same node in T where they split alignment, R (Equation 9). The maximum potential score that max and no part of T needs to be duplicated. Both haplotypes will trace might be obtained by adding more bases is determined by examining back to the point where they diverged and continue toward S aln the last column of S . The best possible score from S ðÞ i; j is the aln aln ðÞ 0; 0 along the same path. case where all subsequent bases of the active region are aligned with The linked list structure has another more subtle property that matched based, maxpot(i, j)(Equation 10), and R is the maxi- maxpot Java uses to keep memory usage low. When an alignment path mum of maxpot(i, j)(Equation 11). If R > R , haplotype max maxpot reaches a dead end, the node at the end of the path has no reference extension terminates. Supplementary Section S1.8-9 further outlines to it. This allows Java Virtual Machine (JVM) garbage collection optimal scores and alignment termination. (GC) to detect and remove these nodes. In other words, GC can automatically prune dead branches of T. This improves scalabilty by R ¼ maxfS ðÞ jxj; j ; 0  j jyjg (9) max aln reducing the memory requirements for large active regions where > 0 : many haplotypes are investigated. S ðÞ i; j ¼ 0 aln maxpotðÞ i; j ¼ (10) S ðÞ i; j þðÞ jxj i R : aln match 3 Results S ðÞ i; j > 0 aln 3.1 S.pneumoniae test case We analyzed the four penicillin binding protein (PBP) genes of R ¼ maxðÞ fmaxpotðÞ i; jyj ; 0  i jxjg (11) maxpot Streptococcus pneumoniae (S.pneumoniae) targeted by b-lactam compounds. According to several studies, 20% or more of a PBP gene may be altered by inter-species recombination (Laible et al., 2.5 Variant calling 1991; Martin et al., 1992), and this can create mosaic PBP genes The variants are interpreted from the alignment, and the locations with a lower b-lactam binding affinity. Because these recombination of the variants are translated with respect to the location of the events often alter hundreds of contiguous bases, variant calling from active region. The alignment provides a convenient way to identify a standard alignment pipeline cannot characterize them. mismatched bases and gaps in either sequence. We obtained 181 samples from 29 serotypes recently released If any cell of trace matrix, T, has more than one path out, then by the Centers for Disease Control and Prevention from NCBI there is more than one optimal alignment, and so there are multiple under BioProject PRJNA284954 (SRR2072210-SRR2072387, ways to translate the alignment to variant calls. When comparing SRR2076738-SRR2076740). These data are whole-genome 250 bp two alignments, the one with the first non-matching base is given a paired-end Illumina sequence reads ranging from 15 Mbp to 1234 higher priority. If the non-matching bases agree (same variant), then Mbp (median ¼ 323 Mbp). Selecting the best reference out of more the next non-matching base is queried. If the non-matching bases do than 10 is often necessary (Li et al., 2012), however, we tested not agree, then alignments are prioritized by mismatch, insertion and deletion, in that order. This gives the algorithm predictable out- Kestrel’s ability to characterize variants using a single reference for put for cases such as a deletion in a homopolymer repeat; Kestrel all samples, TIGR4 (NC_003028.3). Kestrel 1663 We called all variants in four PBP genes (PBP2X, PBP1A, PBP2B Kestrel required an average of 13.3 CPU minutes per million and PBP2A) using three distinct approaches. The first is a standard 250 bp reads (min/M-reads), the alignment approach required an alignment pipeline using BWA (Li and Durbin, 2009, 2010), Picard average of 14.5 min/M-reads, and the assembly approach required and GATK (McKenna et al., 2010) HaplotypeCaller. The second is 106.2 min/M-reads (Fig. 3d). For all steps of the pipeline, Kestrel Kestrel. The third is a de novo assembly pipeline using SPAdes consumed an average of 1.0 GB of memory, GATK consumed an (Bankevich et al., 2012), BWA and SAMtools (Li et al., 2009). average of 2.7 GB, and the assembly approach consumed 9.1 GB. Variants identified by the assembly where the depth of aligned con- Maximum memory consumption was 2.3, 3.6 and 15.7 GB respec- tigs is 1 were defined as the true variants. HaplotypeCaller and tively for Kestrel, GATK and assembly (Fig. 3e). Runtime metrics Kestrel variant calls were compared to the true set with RTG were obtained on a 12 core machine (2 x Intel Xeon E5-2620) with (Cleary et al., 2015) vcfeval. 32 GB of RAM (DDR3-1600), RAID-6 over SATA drives (3 GB/s, Contamination in the PBP genes was identified in SRR2072298, 72 K RPM) and CentOS 6.7. SRR2072306, SRR2072339, SRR2072342, SRR2072351 and SRR2072379, and these samples are not included in our analysis 3.2 MLST test case results. For all of these samples, there were many variant calls in the Multilocus Sequence Typing (MLST) attempts to cluster samples by PBP genes with a relative depth of 0.70 or less, which is unlikely in a analyzing the content of some set of genes. The current method is to monoploid organism. The size of the IKC files is also larger than build a BLAST (Altschul et al., 1990) database from de novo expected, which supports our conclusion that they contain allelic assembled contigs and to use the database for comparing each allele variation not present in the host genome (Fig. 2). Based on the IKC to the sequence data (Jolley and Maiden, 2010; Larsen et al., 2012). files, two more samples (SRR2072345 and SRR2072352) may also A whole-genome assembly is an expensive operation, so we tested contain non-host or extra-chromosomal material, but since we saw Kestrel’s ability to bypass it. no evidence for this in the variant calls over the PBP genes, these We obtained 7 Neisseria meningitidis (N.meningitidis) samples samples were included. Two other samples (SRR2072219 and from ENA study PRJEB3353 (ERR193671-ERR193677) (Reuter SRR2072360) did not have complete sequence data and were also et al., 2013) and allele sequences for 7 house-keeping genes (adk, removed. They contain 10 Mbp and 37 Mbp, respectively, where aroE, abcZ, fumC, gdh, pgm and pdhC) from pubMLST (Jolley and the median has 323 Mbp and the next lowest sample has 92 Mbp. Maiden, 2010). These data are whole-genome 150 bp paired-end Supplementary SectionS 3.4 discusses how removed samples were Illumina reads. For each sample, Kestrel identified the best allele by identified. using each allele as a reference sequence. With the minimum k-mer frequency set to 5, sequence read Sequence reads were assembled with SPAdes using default errors are filtered out of the distribution of k-mers. It is then inter- options, and contigs were used to construct a BLAST database. Each esting to note that a sample has a finite set of solid k-mers, and once this set is reached, the IKC file ceases to grow in size. BAM files allele downloaded from pubMLST for N.meningitidis was used as a must store a record for each read, and so they grow approximately BLAST query, and the best allele for each gene was chosen. With the linearly with read depth. The samples suspected of contamination allele calls, the sequence type (ST) was identified by comparing and low coverage indeed show an increase and a decrease, respec- against sequence type profiles also obtained from pubMLST. This tively, in IKC file size (Fig. 2). set of alleles and the sequence type represents the expected results Kestrel produced 29 806 true positive (TP) variant calls with based on the current methods. 73 false positive (FP) and 100 false negative (FN) calls (Fig. 3a) Sequence reads were also transformed to an IKC file with (sensitivity ¼ 1.00, FDR ¼ 0.00). Because mosaic regions disrupted KAnalyze using 31-mers, a 15-base minimizer, a minimum k-mer the alignments, GATK was only able to produce 17 636 TP variant frequency of 5 and a minimum allele depth of 0.50. Each allele was calls with 12 FP and 11 777 FN calls (Fig. 3b) (sensitivity ¼ 0.60, used as a reference sequence for variant calling. For each gene, the FDR ¼ 0.00). GATK tends to represent dense SNPs as insertion/dele- allele with the fewest variants was chosen as the best match. If no tion pairs, and so the expected true variants differ between the two variants were detected for an allele, the k-mer counts over the allele approaches. This is a diverse set of samples varying in their relation- reference was checked to ensure that it was present in the sequence ship with the reference and mosaic content, and a few samples con- data. The best allele matches were used to identify the ST by com- tribute most of the variants (Fig. 3c). paring them against types from pubMLST. There was 100% concordance between the assembly and Kestrel methods. All samples were called with a minimum k-mer frequency of 5 except ERR193672, which was reduced to 2 to call gene pgm because of low coverage. The assembly-based MLST approach required an average of 51.6 min/M-reads and average of 7.1 GB of memory (max 7.8 GB). The Kestrel MLST approach required 10.4 min/M-reads and 1.6 GB of memory (max 1.7 GB). Fig. 2. Size of BAM files vs IKC files for each sample. Since low frequency k- mers are removed, the size of an IKC file does not continue to grow with read 3.3 E.coli test case depth once a full set of representative k-mers are present. Since BAM files have a record for each read, their size does increase with the number of We tested Kestrel’s ability to call variants on whole genomes using reads. Samples removed for suspected contamination and low coverage are 309 Escherichia coli (E.coli) samples obtained from assembled con- shown in red. Low-coverage samples lack a representative set of k-mers at tigs in the supplementary information published by Salipante et al. sufficient frequency, and so the file size falls below the distribution. Samples (2015) (Dataset S7.tar.gz). The study reports on 312 samples, but with contamination contain k-mers that do not belong to the sample, and so two were not found in the online dataset (upec-240 and upec-52) their size rises above the distribution (Color version of this figure is available at Bioinformatics online.) and one consistently caused RTG vcfeval to crash (upec-9). 1664 P.A.Audano et al. Fig. 3. Results of testing Kestrel and the alignment approach using GATK over 173 S.pneumoniae samples. (a) Variant call summary for Kestrel calls depicting TP, FP, FN calls. (b) Variant call summary for GATK calls depicting TP, FP and FN calls. (c) A plot depicting the phylogeny of all samples by ANI (inner track), the distance from the reference (white) by ANI as a blue-yellow-red heatmap (middle track), and the relative number of variants in each sample (outer track). (d) CPU minutes per million reads. When an assembly is required, the required CPU time increases. (e) Maximum memory usage in gigabytes (GB) for each pipeline (Color version of this figure is available at Bioinformatics online.) The contigs were aligned to an E.coli K-12 reference and a sensitivity of 0.84 If we examine only regions where Kestrel (NC_000913.3), and variants between the reference and contig were had an active region, the sensitivity rose to 0.98 for Kestrel and identified. To minimize the effect of assembly errors, we used ART 0.97 for GATK, and the FDR remained 0.01 for Kestrel and 0.00 (Huang et al., 2012) with contig sequences to simulate 150 bp for GATK. paired-end reads to an average depth of 30. Kestrel and GATK Although the sensitivity was lower for Kestrel, there are regions where Kestrel consistently makes calls that are missed by the GATK were run on the 4.6 Mbp reference using the same tools and pipeline. We compared all Kestrel TP calls that were not within approach as the S.pneumoniae contigs where the contig depth was at least 1 (mean 4.0 Mbp). 50 bp of any GATK TP call, merged the resulting calls within 50 bp, KAnalyze and Kestrel were applied to these simulated reads with and found 780 regions affecting 86 unique genes where calls were the same parameters as the S.pneumoniae reads (k-mer size 31, qual- missed by GATK in more than 20 samples (Supplementary Table S5 ity 30, min depth 5). An additional parameter was given to Kestrel and Supplementary Fig. S7). to discard variants over ambiguous reference bases, such as N, so that RTG vcfeval had an equivalent set of variants from both 4 Discussion approaches. With all variant call sets, RTG vcfeval was used to test the calls. Kestrel must limit the size of active regions when calling on The Kestrel algorithm is a framework for calling variants from whole genomes or it will try to resolve erroneously large regions and sequence data using evidence found only in k-mer space. It provides perform poorly. For larger genomes, increasing the k-mer size may a mechanism for detecting regions of variation, a method for resolv- help reduce k-mer distribution noise. ing the variation to variant calls, and a set of heuristics that make In addition to a VCF, Kestrel can output a SAM file of the haplotypes it work with real data. Inference on k-mers has been applied to it assembles and aligns. The SAM feature was enabled for this experi- RNAseq analysis (Bray et al., 2016; Patro et al., 2014), metagenom- ment so that the locations of haplotypes could be identified. Variant sta- ics (Wood and Salzberg, 2014), phylogenetics (Gardner and Slezak, tistics were calculated twice; once over all regions where the alignment 2010; Gardner et al., 2013) and many other problems in bioinfor- depth was one and once for all regions where there was an assembled matics. To our knowledge, Kestrel is a first-in-class variant calling haplotype. The current implementation of Kestrel discards active regions implementation using only k-mers. with one wildtype haplotype, so the second analysis only applies to Although such a method is unorthodox with respect to current regions where at least one variant was identified (mean 2.3 Mbp). standards, it is able to capture variation in regions where alignments Over the whole genome, Kestrel yielded an FDR of 0.01, but a are too noisy. In these cases, a k-mer approach can displace alterna- sensitivity of only 0.71. GATK yielded an FDR of 0.00 (0.0029) tives that require far more computing resources and time to Kestrel 1665 Bray,N.L. et al. (2016) Near-optimal probabilistic RNA-seq quantification. complete, such as de novo assemblies. By applying this method to Nat. Biotechnol., 34, 888–525527. other domains where assemblies are the current standard, such as Cleary,J.G. et al. (2015) Comparing variant call files for performance bench- MLST and pathogen surveillance pipelines, this approach can have marking of next-generation sequencing variant calling pipelines. bioRxiv, a dramatic effect on computing resources. Despite several advantages, a purely k-mer-based algorithm such Eddy,S.R. (2004) What is dynamic programming? Nat. Biotechnol., 22, 909–910. as this does have limitations. The most significant of these is that Gardner,S.N. et al. (2013) When whole-genome alignments just won’t work: paired-end and sequence read context is lost when shredding kSNP v2 software for alignment-free SNP discovery and phylogenetics of sequence reads into k-mers. This information is valuable for resolv- hundreds of microbial genomes. PLoS One, 8, e81760. ing variants in repetitive, duplicated or low complexity regions of Gardner,S.N. and Slezak,T. (2010) Scalable SNP Analyses of 100þ Bacterial or Viral Genomes. J. Forensic Res., 1, 107. the genome. Some evidence of these events is still present in k-mer Huang,W. et al. (2012) ART: A next-generation sequencing read simulator. space, such as frequency peaks, and although Kestrel attempts to Bioinformatics, 28, 593–594. work through these events, alignments or assemblies will often have Iqbal,Z. et al. (2012) De novo assembly and genotyping of variants using col- higher accuracy. ored de Bruijn graphs. Nat. Genet., 44, 226–232. As a practical tool, Kestrel is a fast alternative to existing meth- Jolley,K.A. and Maiden,M.C.J. (2010) BIGSdb: Scalable analysis of bacterial ods when applied to specific regions of the genome. We expect it to genome variation at the population level. BMC Bioinformatics, 11, 595. find use as a targeted tool, as part of automated surveillance pipe- Ko ¨ ser,C.U. et al. (2012) Routine use of microbial whole genome sequencing in lines or an orthogonal approach for other callers. In its current diagnostic and public health microbiology. PLoS Pathogens, 8, e1002824. form, we cannot purport it to be a whole-genome alternative for Laible,G. et al. (1991) Interspecies recombinational events during the evolu- tion of altered pbp 2x genes in penicillin-resistant clinical isolates of software like Cortex or Platypus. Future developments may improve Streptococcus pneumoniae. Mol. Microbiol., 5, 1993–2002. the limitations of this approach, or the algorithm may find use in a Larsen,M.V. et al. (2012) Multilocus sequence typing of total genome hybrid tool making use of k-mers and other data. For example, the sequenced bacteria. J. Clin. Microbiol., 50, 1355–1361. GATK HaplotypeCaller might fall-back to such a k-mer method to Li,G. et al. (2012) Complete genome sequence of Streptococcus pneumoniae rescue variants in a region where alignments suffer because of dense Strain ST556, a multidrug-resistant isolate from an otitis media patient. SNVs or large indels. J. Bacteriol., 194, 3294–3295. In many cases, algorithms that are free of alignments and Li,H. and Durbin,R. (2009) Fast and accurate short read alignment with assemblies can reduce the demand on computing resources. As Burrows-Wheeler transform. Bioinformatics, 25, 1754–1760. high-volume sequencing technology becomes faster and cheaper, Li,H. and Durbin,R. (2010) Fast and accurate long-read alignment with reducing the cost of data storage and analysis becomes critical Burrows-Wheeler transform. Bioinformatics, 26, 589–595. Li,H. et al. (2009) The Sequence Alignment/Map format and SAMtools. (Ko ¨ ser et al., 2012; Sboner et al., 2011). Kestrel’s contribution was Bioinformatics, 25, 2078–2079. only marginal for calling variants where alignments would suffice, Li,Y. et al. (2010) State of the art de novo assembly of human genomes from but its ability to perform analysis without more costly methods massively parallel sequencing data. Hum. Genomics, 4, 271–277. gives it a significant advantage. The computing resources Kestrel Li,Y. et al. (2016) Penicillin-binding protein transpeptidase signatures for requires is well within the capabilities of a modern laptop, which is tracking and predicting b-lactam resistance levels in Streptococcus pneumo- far less expensive than the powerful machines or computing clus- niae. mBio, 7, 1–9. ters that would be required to do the same work in the same Martin,C. et al. (1992) Relatedness of penicillin-binding protein 1a genes from amount of time. Such a cost savings may conserve vital research different clones of penicillin-resistant Streptococcus pneumoniae isolated in South Africa and Spain. EMBO J., 11, 3831–3836. funds or enable routine analysis where funding and equipment are McKenna,A. et al. (2010) The Genome Analysis Toolkit: a MapReduce frame- not as readily available. work for analyzing next-generation DNA sequencing data. Genome Res., 20, 1297–1303. Ochman,H. et al. (2000) Lateral gene transfer and the nature of bacterial inno- Acknowledgements vation. Nature, 405, 299–304. Ben Metcalf and Bernard Beall from the Centers for Disease Control and Olson,N.D. et al. (2015) Best practices for evaluating single nucleotide variant Prevention provided data and insight for the S.pneumoniae experiment. calling methods for microbial genomics. Front. Genet., 6, 1–15. Dawn Audano designed the graphic for Figure 1. Patro,R. et al. (2014) Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol., 32, 462–464. Reuter,S. et al. (2013) Rapid bacterial whole-genome sequencing to enhance diag- Funding nostic and public health microbiology. JAMA Int. Med., 173, 1397–1404. This work was supported by the National Institutes of Health. P.A.’s contri- Rimmer,A. et al. (2014) Integrating mapping-, assembly- and haplotype-based bution funded through T32GM105490, which was made possible by Dr. approaches for calling variants in clinical sequencing applications. Nat. Gregory Gibson. F.O.V. and R.S. are supported by R01EB025022, Genet., 46, 912–918. R01ES026243, and funds awarded to F.O.V. by the Georgia Institute of Roberts,M. et al. (2004) Reducing storage requirements for biological Technology. sequence comparison. Bioinformatics, 20, 3363–3369. Salipante,S.J. et al. (2015) Large-scale genomic sequencing of extraintestinal Conflict of Interest: none declared. pathogenic Escherichia coli strains. Genome Res., 25, 119–128. Sboner,A. et al. (2011) The real cost of sequencing: higher than you think! Genome Biol., 12, 125. References Smith,T. and Waterman,M. (1981) Identification of common molecular subse- Altschul,S.F. et al. (1990) Basic local alignment search tool. J. Mol. Biol., 215, quences. J. Mol. Biol., 147, 195–197. 403–410. Thomsen,M.C.F. et al. (2016) A bacterial analysis platform: an integrated sys- Audano,P. and Vannberg,F. (2014) KAnalyze: a fast versatile pipelined K-mer tem for analysing bacterial whole genome sequencing data for clinical diag- toolkit. Bioinformatics, 30, 2070–2072. nostics and surveillance. PLoS One, 11, 1–14. Bankevich,A. et al. (2012) SPAdes: a new genome assembly algorithm and its Wood,D.E. and Salzberg,S.L. (2014) Kraken: ultrafast metagenomic sequence applications to single-cell sequencing. J. Comput. Biol., 19, 455–477. classification using exact alignments. Genome Biol., 15, 1–12. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Mapping-free variant calling using haplotype reconstruction from k-mer frequencies

Loading next page...
 
/lp/ou_press/mapping-free-variant-calling-using-haplotype-reconstruction-from-k-mer-0uT5v601U2
Publisher
Oxford University Press
Copyright
© The Author(s) 2017. Published by Oxford University Press.
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/btx753
Publisher site
See Article on Publisher Site

Abstract

Motivation: The standard protocol for detecting variation in DNA is to map millions of short sequence reads to a known reference and find loci that differ. While this approach works well, it cannot be applied where the sample contains dense variants or is too distant from known referen- ces. De novo assembly or hybrid methods can recover genomic variation, but the cost of computa- tion is often much higher. We developed a novel k-mer algorithm and software implementation, Kestrel, capable of characterizing densely packed SNPs and large indels without mapping, assem- bly or de Bruijn graphs. Results: When applied to mosaic penicillin binding protein (PBP) genes in Streptococcus pneumo- niae, we found near perfect concordance with assembled contigs at a fraction of the CPU time. Multilocus sequence typing (MLST) with this approach was able to bypass de novo assemblies. Kestrel has a very low false-positive rate when applied to the whole genome, and while Kestrel identified many variants missed by other methods, limitations of a purely k-mer based approach affect overall sensitivity. Availability and implementation: Source code and documentation for a Java implementation of Kestrel can be found at https://github.com/paudano/kestrel. All test code for this publication is located at https://github.com/paudano/kescases. Contact: paudano@gatech.edu or fredrik.vannberg@biology.gatech.edu Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction typing and surveillance methods often require whole genome assembly (Thomsen et al., 2016), a database of known alleles (Li et al., 2016)or Although modern alignment tools are designed to handle errors and multiple reference sequences (Li et al., 2012). These approaches add to variation, a sequence read that differs significantly from the refer- the time required to run the analysis and to maintain allele databases. ence cannot be confidently assigned to the correct location. When The ideal method would handle bacterial diversity with a single refer- these reads are mapped, the low alignment confidence leads to low ence, and it would be capable of placing genomic alterations in a uni- variant call confidence, and it becomes difficult to separate true var- iants from false calls. In some regions, the read may be clipped or form context without switching references. not mapped at all. As a result, variant calling algorithms that rely There are several alternative methods commonly employed, but solely on alignments cannot characterize these events. they are either limited in power or consume far more computing Bacterial genomes are small and relatively simple, but they remain resources than the alignment and variant-calling protocol. Calling one of the hardest informatic targets due to their variability. variants from de novo assembled contigs may work for monoploid Horizontal gene transfer (HGT) is one mechanism that changes sam- organisms, but since it reduces the read coverage to 1, false-calls ples significantly when compared to the reference, and it often leads to cannot easily be corrected (Olson et al., 2015). Building the de drug resistance or increased virulence (Ochman et al., 2000). Bacterial Bruijn graphs that are typically employed in assembly may also V The Author(s) 2017. Published by Oxford University Press. 1659 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 1660 P.A.Audano et al. require many gigabytes of memory or computing clusters (Li et al., 2010). Cortex (Iqbal et al., 2012) was designed to overcome limita- tions of the de novo assembly approach, but it still relies on de Brujin graphs assembled over the whole genome. Platypus (Rimmer et al., 2014) performs local assemblies, however, it incurs the over- head of both read mapping and local assembly if variant loci are not known a priori. One possible approach is to use the information contained within a set of k-mer frequencies over the sequence data. K-mers can be represented numerically, they do not rely on sequence alignments, and k-mer counting methods are fast. To date, sparsely spaced SNPs have been corrected with such an approach (Gardner and Slezak, 2010; Gardner et al., 2013), but a robust variant calling algorithm has never been shown to work. Because dense SNPs and large inser- tions can create many k-mers very different from any reference sequence, how such an approach could work efficiently is not imme- Fig. 1. Overview of the Kestrel process from sequence data to variant call. diately clear. (a) The sequence reads are converted to an IKC file. (b) The reference We constructed an algorithm that relies strictly on k-mer fre- sequence is converted into an array of k-mers and left in reference order. (c) quencies over sample sequence data and ordered k-mers in a refer- K-mer frequencies from the sequence reads (vertical axis) are assigned to the ence sequence to call variants. This method first finds regions of ordered k-mers of the reference (horizontal axis). A decline and recovery of the frequencies bound an active region where one or more variants are variation over the reference using disruptions in the frequency distri- present. The recovery threshold is degraded with an exponential decay func- bution of expected k-mers, and by beginning with unaltered k-mers tion (red) to allow for declining read coverage. (d) Starting from the left anchor at the flanks of such regions, it can greatly simplify the search k-mer (last k-mer with a high frequency), the first base is removed, each possi- through k-mer space while resolving variation. Using this approach, ble base is appended, and the base that recovers the k-mer frequency is we show how Kestrel can re-build altered regions of the genome in a appended to the haplotype. (e) A modified alignment algorithm tracks haplo- targeted manner and resolve regions of dense variation. type reconstruction and terminates the process when an optimal alignment is reached. (f) This algorithm yields an alignment of the reference sequence and haplotype within the active region. (g) Variant calls are extracted from the alignment (Color version of this figure is available at Bioinformatics online.) 2 Materials and methods 2.1 Algorithm overview By relaxing these criteria and by building in either the forward or Using features in KAnalyze (Audano and Vannberg, 2014) version reverse direction, it is possible to call variants up to either end of the 2.0.0, the frequency of each k-mer in the sample is stored in an reference. indexed k-mer count file (IKC) (Fig. 1a). The IKC records are grouped by a k-mer minimizer (Roberts et al., 2004) and sorted. This structure tends to cluster k-mers from the original sequence, 2.2 Active region detection and by reading it as a memory mapped file, k-mers can be rapidly The distribution of k-mer frequencies from a set of sequence data is queried with low memory requirements using search optimizations approximately uniform with a mean equal to the average read depth similar to those implemented in Kraken (Wood and Salzberg, 2014). of the sample, and variants disrupt this distribution over reference From the reference sequence, the frequencies for each k-mer and k-mers. For example, a single SNP causes k k-mers of the sample to its reverse complement are obtained from the IKC file and summed, differ from the reference. Active region detection searches for differ- but these are left in order (Fig. 1b). Kestrel searches the resulting ences between neighbors that are less likely to have occurred by array for loci where the frequency declines and recovers (Fig. 1c), chance. Then by searching for a recovery in the downstream which suggests the k-mers of the reference and sample differ. frequencies, it attempts to resolve the left and right breakpoints in Analogous to the GATK (McKenna et al., 2010) HaplotypeCaller, k-mer space. This region is bounded by the unaltered k-mers at its this low-frequency region is called an active region, and haplotypes flanks, which we label anchor k-mers. The low-frequency and the are reconstructed over it. anchor k-mers together comprise the active region. Starting with the high-frequency k-mer on the left end of the The frequency difference between of two neighboring k-mers, N active region, the first base is removed, all four bases are appended and N , is evaluated and compared to , the threshold required to iþ1 to this (k - 1)-mer, and the k-mer frequency is queried for each of trigger an active region scan (jN  N j >). Because read depth i iþ1 the four possibilities (Fig. 1d). An equivalent process is performed can vary greatly over samples, setting this parameter to some value on the reverse complement k-mer, and the frequencies are summed. is unlikely to perform well on real sequence data. Therefore, Kestrel The base that yields a high frequency k-mer is appended to the hap- sets  to some quantile, Q , of the set of jN  N j for all neighbor- i iþ1 lotype. The new k-mer ending in that base is then used to find the ing k-mers in the reference. The default quantile, 0.90, with an next base by the same process. If more than one of these k-mers has absolute minimum of 5 was found to work well in practice. a high frequency, a haplotype is assembled for each one. Supplementary Section S1.3 discusses active region detection in A modified Smith-Waterman (Smith and Waterman, 1981)algo- more detail. rithm guides the process by aligning the active region and the haplo- Sequencing errors, PCR duplicates, GC biases and other factors type as it is reconstructed (Fig. 1e). By setting an initial score and systematically work to disrupt the uniformity of the frequency distri- disallowing links to zero score states, alignments are anchored on the bution. Therefore, any robust approach must be equipt with heuris- left and are allowed to extend until an optimal alignment is obtained. Variant calls follow trivially from the alignment (Fig. 1f and g). tics to work around such biases. Kestrel uses an exponential decay Kestrel 1661 function over the recovery threshold, and it ignores short peaks in While this algorithm is simple enough to build the haplotype the frequency distribution. sequence, it does not know when to terminate. It could continue When an active region occurs over declining read coverage, the until the right anchor k-mer is found, but such reference-na¨ve ı recovery k-mer frequency may be lower than expected. As the scan reconstruction would certainly waste many CPU cycles extending moves further from the left anchor k-mer, the expected recovery fre- erroneous sequence. Because the active region sequence and the hap- quency should be relaxed. The exponential decay function, f(x) lotype sequences are known, an alignment could be performed as it (Equation 1), is employed to reduce the recovery threshold as the is extended. A global alignment would be ideal except that it would active region extends. f(0) is the anchor k-mer frequency, and have to be recomputed each time a base is added, and performance it approaches a lower bound, f , asymptotically. By default, min of such an algorithm would be unacceptable in all but the most triv- f ¼ 0:55  fðÞ 0 to avoid ending an active region prematurely on a min ial of applications. large heterozygous variant region. f(x) is defined by scaling and An ideal algorithm would align over the active region from end to shifting the standard exponential decay function, h(x)(Equation 2). end, anchor the left ends of the haplotype and active region, and allow the right end of the haplotype to extend. To accomplish this, fxðÞ¼ðÞ fðÞ 0  f hxðÞþ f (1) min min Kestrel employs a modified Smith-Waterman alignment. Two key xk modifications were made to Smith-Waterman; (i) any subalignment hxðÞ¼ e (2) with a score of 0 cannot be extended, and (ii) the alignment must h(x) is parameterized by k, which must be also be set, but Kestrel does begin with a score greater than 0. Because of these modifications, the not configure this parameter directly because it is difficult to know alignment must begin with a non-zero score, and the gap extension how to choose a reasonable value. Instead, k is chosen by a configura- penalty must be non-zero to prevent unbounded extension. ble parameter, a, that is defined as the proportion of the decay range, Smith-Waterman is a dynamic programming approach where fðÞ 0  f ,after 1k-mer (Equation 3). This provides a more intuitive min matrices track scores by updating from shorter sub-alignments as way to define how rapidly the recovery threshold is allowed to decline. the alignment progresses (Eddy, 2004). The active region is posi- tioned over the vertical axis, and the haplotype is positioned over hkðÞ¼ a the horizontal axis of the score matrix. We define the active region kk e ¼ a base in row i as x and the haplotype base in column j as y . As hap- i j (3) kk ¼ logðÞ a lotype bases are added to the alignment, a column is appended to the matrix. Section 2.6 discusses how this is done efficiently. logðÞ a k ¼ For aligned bases, R is added to the score if they agree and match R is added if they disagree. For convenience, we define mismatch At k k-mers, the recovery threshold fkðÞ¼ a ðÞ fðÞ 0  f þ f . min min match(i, j)(Equation 5) to return R if x and y match, or match i j In other words, f(k) has declined in its range from f(0) to f by a min R if they do not. This implementation employs an affine gap mismatch factor of a. This is true for all nk such that fnðÞk ¼ a ðÞ fðÞ 0  f min model that allows for distinct gap open (R ) and gap extension open þf (Equation 4). Supplementary Section S2.2 outlines active min (R ) penalties. This type of model requires three matrices over x and gap region heuristics in more detail. y to track the scores. One matrix, S , contains scores through aligned aln xk (matched or mismatched) bases. Two more score matrices, S and hxðÞ¼ e gact S , contain scores through gaps in the active region and gaps in the logðÞ a ghap ¼ e haplotype, respectively. Using these definitions, the modified Smith- logðÞ a k (4) ¼ e Waterman score update process is defined in Equations 6–8. ¼ a u R : x ¼ y match i j matchðÞ i; j ¼ (5) R : x 6¼ y i j mismatch 2.3 Haplotype alignment > After the endpoints of an active region are found, the actual SðÞ i  1; j  1þ matchðÞ i; j : > aln sequence (haplotype) from the sample must be reconstructed. This SðÞ i  1; j  1 > 0 aln process is similar to a local assembly from k-mers, but it is imple- S ðÞ i  1; j  1þ matchðÞ i; j : S ðÞ i; j ¼ max (6) aln gact mented to keep resource usage at a minimum. Since the reference > S ðÞ i  1; j  1 > 0 sequence over the whole region is known and anchor k-mers have gact been chosen, the search through k-mers in the sample can be greatly > > S ðÞ i  1; j  1þ matchðÞ i; j : ghap simplified. S ðÞ i  1; j  1 > 0 ghap The process begins by initializing the haplotype with the left anchor k-mer. Then by removing the left-most base of the k-mer and appending a new base to the end, the k-mer can be shifted one base > S ðÞ i; j  1þ R þ R : > aln open gap to the right. Each nucleotide can be appended and the frequency checked. The base that produces a k-mer with the highest frequency S ðÞ i; j  1 > 0 aln is appended to the haplotype. If more than one base produces an S ðÞ i; j ¼ max S ðÞ i; j  1þ R : (7) gact gact gap acceptable frequency, then the alignment is split by saving the state S ðÞ i; j  1 > 0 > gact of reconstruction with alternate bases, continuing with the highest- frequency k-mer, and returning to the alternatives after the current > S ðÞ i; j  1þ R þ R : ghap open gap haplotype is built. In this way, multiple haplotypes may be built S ðÞ i; j  1 > 0 ghap over one active region. 1662 P.A.Audano et al. > will always report that the first base was deleted even though the > alignment score would be the same for a deletion at any locus of the S ðÞ i  1; jþ R þ R : > aln open gap > repeat. The effect is to left-align variant calls. S ðÞ i  1; j > 0 > aln Approximate read depth of an active region is estimated by S ðÞ i  1; jþ R þ R : S ðÞ i; j ¼ max gact open gap (8) ghap summing the depth of all haplotypes, which may include the refer- > S ðÞ i  1; j > 0 ence haplotype. Similarly, summing only the haplotypes that sup- gact > port a variant call gives the approximate depth of the variant. > S ðÞ i  1; jþ R : ghap gap : Because haplotypes may share k-mers, we use the minimum k-mer S ðÞ i  1; j > 0 ghap frequency as a conservative estimate of read depth of any single haplotype. These estimates may be useful for filtering variants with To initialize the alignment, the bases of the active region are low support. positioned along the vertical axis of the matrices, and each base of the anchor k-mer creates one column. The first base of the sequences is in row and column 1. Row and column 0 exist for convenience, 2.6 Alignment implementation but they do not align any bases. The initial alignment score, R Because of the nature of the dynamic programming algorithm, only init is assigned (S ðÞ i; i ¼ R ; 0  i  k) where k is the size of aln init the last column of the score matrices (S , S and S ) needs to aln gact ghap k-mers. All other scores in all three matrices are initialized to 0. be stored while the next column is built. Therefore, each of these A fourth matrix, T, contains traceback information from the matrices can be reduced to two arrays where one contains the last end of an alignment to S ðÞ 0; 0 . It is initialized so that aln column, and one contains the new column being added. When alter- TiðÞ ; i ! TiðÞ  1; i  1 ; 0 < i  k. This initialization of S and T aln nate haplotypes are explored and the alignment splits, only one creates a single path for the anchor k-mer in the alignment, and all array for each matrix must be saved. acceptable alignments must enter this path at S ðÞ k; k . The align- aln The traceback matrix, T, is more complex because it is not ment extends from the anchor k-mer as already described. stored as a matrix. Instead, it is a linked-list of alignment states that Supplementary Section S1.4-7 outlines the alignment process and always leads back to S ðÞ 0; 0 . When a non-zero score is added to a aln data structures in more detail. score matrix, a link is added to T. For each non-zero score in score matrices, a link from the matrix to a node in T is stored. 2.4 Alignment termination Since T is a linked list that is only traversed toward S ðÞ 0; 0 , aln Since the alignment must cover all of the active region, only the last one node of the alignment may have several links into it. Therefore, row of S needs to be queried to find the best score for the current aln different haplotypes may link to the same node in T where they split alignment, R (Equation 9). The maximum potential score that max and no part of T needs to be duplicated. Both haplotypes will trace might be obtained by adding more bases is determined by examining back to the point where they diverged and continue toward S aln the last column of S . The best possible score from S ðÞ i; j is the aln aln ðÞ 0; 0 along the same path. case where all subsequent bases of the active region are aligned with The linked list structure has another more subtle property that matched based, maxpot(i, j)(Equation 10), and R is the maxi- maxpot Java uses to keep memory usage low. When an alignment path mum of maxpot(i, j)(Equation 11). If R > R , haplotype max maxpot reaches a dead end, the node at the end of the path has no reference extension terminates. Supplementary Section S1.8-9 further outlines to it. This allows Java Virtual Machine (JVM) garbage collection optimal scores and alignment termination. (GC) to detect and remove these nodes. In other words, GC can automatically prune dead branches of T. This improves scalabilty by R ¼ maxfS ðÞ jxj; j ; 0  j jyjg (9) max aln reducing the memory requirements for large active regions where > 0 : many haplotypes are investigated. S ðÞ i; j ¼ 0 aln maxpotðÞ i; j ¼ (10) S ðÞ i; j þðÞ jxj i R : aln match 3 Results S ðÞ i; j > 0 aln 3.1 S.pneumoniae test case We analyzed the four penicillin binding protein (PBP) genes of R ¼ maxðÞ fmaxpotðÞ i; jyj ; 0  i jxjg (11) maxpot Streptococcus pneumoniae (S.pneumoniae) targeted by b-lactam compounds. According to several studies, 20% or more of a PBP gene may be altered by inter-species recombination (Laible et al., 2.5 Variant calling 1991; Martin et al., 1992), and this can create mosaic PBP genes The variants are interpreted from the alignment, and the locations with a lower b-lactam binding affinity. Because these recombination of the variants are translated with respect to the location of the events often alter hundreds of contiguous bases, variant calling from active region. The alignment provides a convenient way to identify a standard alignment pipeline cannot characterize them. mismatched bases and gaps in either sequence. We obtained 181 samples from 29 serotypes recently released If any cell of trace matrix, T, has more than one path out, then by the Centers for Disease Control and Prevention from NCBI there is more than one optimal alignment, and so there are multiple under BioProject PRJNA284954 (SRR2072210-SRR2072387, ways to translate the alignment to variant calls. When comparing SRR2076738-SRR2076740). These data are whole-genome 250 bp two alignments, the one with the first non-matching base is given a paired-end Illumina sequence reads ranging from 15 Mbp to 1234 higher priority. If the non-matching bases agree (same variant), then Mbp (median ¼ 323 Mbp). Selecting the best reference out of more the next non-matching base is queried. If the non-matching bases do than 10 is often necessary (Li et al., 2012), however, we tested not agree, then alignments are prioritized by mismatch, insertion and deletion, in that order. This gives the algorithm predictable out- Kestrel’s ability to characterize variants using a single reference for put for cases such as a deletion in a homopolymer repeat; Kestrel all samples, TIGR4 (NC_003028.3). Kestrel 1663 We called all variants in four PBP genes (PBP2X, PBP1A, PBP2B Kestrel required an average of 13.3 CPU minutes per million and PBP2A) using three distinct approaches. The first is a standard 250 bp reads (min/M-reads), the alignment approach required an alignment pipeline using BWA (Li and Durbin, 2009, 2010), Picard average of 14.5 min/M-reads, and the assembly approach required and GATK (McKenna et al., 2010) HaplotypeCaller. The second is 106.2 min/M-reads (Fig. 3d). For all steps of the pipeline, Kestrel Kestrel. The third is a de novo assembly pipeline using SPAdes consumed an average of 1.0 GB of memory, GATK consumed an (Bankevich et al., 2012), BWA and SAMtools (Li et al., 2009). average of 2.7 GB, and the assembly approach consumed 9.1 GB. Variants identified by the assembly where the depth of aligned con- Maximum memory consumption was 2.3, 3.6 and 15.7 GB respec- tigs is 1 were defined as the true variants. HaplotypeCaller and tively for Kestrel, GATK and assembly (Fig. 3e). Runtime metrics Kestrel variant calls were compared to the true set with RTG were obtained on a 12 core machine (2 x Intel Xeon E5-2620) with (Cleary et al., 2015) vcfeval. 32 GB of RAM (DDR3-1600), RAID-6 over SATA drives (3 GB/s, Contamination in the PBP genes was identified in SRR2072298, 72 K RPM) and CentOS 6.7. SRR2072306, SRR2072339, SRR2072342, SRR2072351 and SRR2072379, and these samples are not included in our analysis 3.2 MLST test case results. For all of these samples, there were many variant calls in the Multilocus Sequence Typing (MLST) attempts to cluster samples by PBP genes with a relative depth of 0.70 or less, which is unlikely in a analyzing the content of some set of genes. The current method is to monoploid organism. The size of the IKC files is also larger than build a BLAST (Altschul et al., 1990) database from de novo expected, which supports our conclusion that they contain allelic assembled contigs and to use the database for comparing each allele variation not present in the host genome (Fig. 2). Based on the IKC to the sequence data (Jolley and Maiden, 2010; Larsen et al., 2012). files, two more samples (SRR2072345 and SRR2072352) may also A whole-genome assembly is an expensive operation, so we tested contain non-host or extra-chromosomal material, but since we saw Kestrel’s ability to bypass it. no evidence for this in the variant calls over the PBP genes, these We obtained 7 Neisseria meningitidis (N.meningitidis) samples samples were included. Two other samples (SRR2072219 and from ENA study PRJEB3353 (ERR193671-ERR193677) (Reuter SRR2072360) did not have complete sequence data and were also et al., 2013) and allele sequences for 7 house-keeping genes (adk, removed. They contain 10 Mbp and 37 Mbp, respectively, where aroE, abcZ, fumC, gdh, pgm and pdhC) from pubMLST (Jolley and the median has 323 Mbp and the next lowest sample has 92 Mbp. Maiden, 2010). These data are whole-genome 150 bp paired-end Supplementary SectionS 3.4 discusses how removed samples were Illumina reads. For each sample, Kestrel identified the best allele by identified. using each allele as a reference sequence. With the minimum k-mer frequency set to 5, sequence read Sequence reads were assembled with SPAdes using default errors are filtered out of the distribution of k-mers. It is then inter- options, and contigs were used to construct a BLAST database. Each esting to note that a sample has a finite set of solid k-mers, and once this set is reached, the IKC file ceases to grow in size. BAM files allele downloaded from pubMLST for N.meningitidis was used as a must store a record for each read, and so they grow approximately BLAST query, and the best allele for each gene was chosen. With the linearly with read depth. The samples suspected of contamination allele calls, the sequence type (ST) was identified by comparing and low coverage indeed show an increase and a decrease, respec- against sequence type profiles also obtained from pubMLST. This tively, in IKC file size (Fig. 2). set of alleles and the sequence type represents the expected results Kestrel produced 29 806 true positive (TP) variant calls with based on the current methods. 73 false positive (FP) and 100 false negative (FN) calls (Fig. 3a) Sequence reads were also transformed to an IKC file with (sensitivity ¼ 1.00, FDR ¼ 0.00). Because mosaic regions disrupted KAnalyze using 31-mers, a 15-base minimizer, a minimum k-mer the alignments, GATK was only able to produce 17 636 TP variant frequency of 5 and a minimum allele depth of 0.50. Each allele was calls with 12 FP and 11 777 FN calls (Fig. 3b) (sensitivity ¼ 0.60, used as a reference sequence for variant calling. For each gene, the FDR ¼ 0.00). GATK tends to represent dense SNPs as insertion/dele- allele with the fewest variants was chosen as the best match. If no tion pairs, and so the expected true variants differ between the two variants were detected for an allele, the k-mer counts over the allele approaches. This is a diverse set of samples varying in their relation- reference was checked to ensure that it was present in the sequence ship with the reference and mosaic content, and a few samples con- data. The best allele matches were used to identify the ST by com- tribute most of the variants (Fig. 3c). paring them against types from pubMLST. There was 100% concordance between the assembly and Kestrel methods. All samples were called with a minimum k-mer frequency of 5 except ERR193672, which was reduced to 2 to call gene pgm because of low coverage. The assembly-based MLST approach required an average of 51.6 min/M-reads and average of 7.1 GB of memory (max 7.8 GB). The Kestrel MLST approach required 10.4 min/M-reads and 1.6 GB of memory (max 1.7 GB). Fig. 2. Size of BAM files vs IKC files for each sample. Since low frequency k- mers are removed, the size of an IKC file does not continue to grow with read 3.3 E.coli test case depth once a full set of representative k-mers are present. Since BAM files have a record for each read, their size does increase with the number of We tested Kestrel’s ability to call variants on whole genomes using reads. Samples removed for suspected contamination and low coverage are 309 Escherichia coli (E.coli) samples obtained from assembled con- shown in red. Low-coverage samples lack a representative set of k-mers at tigs in the supplementary information published by Salipante et al. sufficient frequency, and so the file size falls below the distribution. Samples (2015) (Dataset S7.tar.gz). The study reports on 312 samples, but with contamination contain k-mers that do not belong to the sample, and so two were not found in the online dataset (upec-240 and upec-52) their size rises above the distribution (Color version of this figure is available at Bioinformatics online.) and one consistently caused RTG vcfeval to crash (upec-9). 1664 P.A.Audano et al. Fig. 3. Results of testing Kestrel and the alignment approach using GATK over 173 S.pneumoniae samples. (a) Variant call summary for Kestrel calls depicting TP, FP, FN calls. (b) Variant call summary for GATK calls depicting TP, FP and FN calls. (c) A plot depicting the phylogeny of all samples by ANI (inner track), the distance from the reference (white) by ANI as a blue-yellow-red heatmap (middle track), and the relative number of variants in each sample (outer track). (d) CPU minutes per million reads. When an assembly is required, the required CPU time increases. (e) Maximum memory usage in gigabytes (GB) for each pipeline (Color version of this figure is available at Bioinformatics online.) The contigs were aligned to an E.coli K-12 reference and a sensitivity of 0.84 If we examine only regions where Kestrel (NC_000913.3), and variants between the reference and contig were had an active region, the sensitivity rose to 0.98 for Kestrel and identified. To minimize the effect of assembly errors, we used ART 0.97 for GATK, and the FDR remained 0.01 for Kestrel and 0.00 (Huang et al., 2012) with contig sequences to simulate 150 bp for GATK. paired-end reads to an average depth of 30. Kestrel and GATK Although the sensitivity was lower for Kestrel, there are regions where Kestrel consistently makes calls that are missed by the GATK were run on the 4.6 Mbp reference using the same tools and pipeline. We compared all Kestrel TP calls that were not within approach as the S.pneumoniae contigs where the contig depth was at least 1 (mean 4.0 Mbp). 50 bp of any GATK TP call, merged the resulting calls within 50 bp, KAnalyze and Kestrel were applied to these simulated reads with and found 780 regions affecting 86 unique genes where calls were the same parameters as the S.pneumoniae reads (k-mer size 31, qual- missed by GATK in more than 20 samples (Supplementary Table S5 ity 30, min depth 5). An additional parameter was given to Kestrel and Supplementary Fig. S7). to discard variants over ambiguous reference bases, such as N, so that RTG vcfeval had an equivalent set of variants from both 4 Discussion approaches. With all variant call sets, RTG vcfeval was used to test the calls. Kestrel must limit the size of active regions when calling on The Kestrel algorithm is a framework for calling variants from whole genomes or it will try to resolve erroneously large regions and sequence data using evidence found only in k-mer space. It provides perform poorly. For larger genomes, increasing the k-mer size may a mechanism for detecting regions of variation, a method for resolv- help reduce k-mer distribution noise. ing the variation to variant calls, and a set of heuristics that make In addition to a VCF, Kestrel can output a SAM file of the haplotypes it work with real data. Inference on k-mers has been applied to it assembles and aligns. The SAM feature was enabled for this experi- RNAseq analysis (Bray et al., 2016; Patro et al., 2014), metagenom- ment so that the locations of haplotypes could be identified. Variant sta- ics (Wood and Salzberg, 2014), phylogenetics (Gardner and Slezak, tistics were calculated twice; once over all regions where the alignment 2010; Gardner et al., 2013) and many other problems in bioinfor- depth was one and once for all regions where there was an assembled matics. To our knowledge, Kestrel is a first-in-class variant calling haplotype. The current implementation of Kestrel discards active regions implementation using only k-mers. with one wildtype haplotype, so the second analysis only applies to Although such a method is unorthodox with respect to current regions where at least one variant was identified (mean 2.3 Mbp). standards, it is able to capture variation in regions where alignments Over the whole genome, Kestrel yielded an FDR of 0.01, but a are too noisy. In these cases, a k-mer approach can displace alterna- sensitivity of only 0.71. GATK yielded an FDR of 0.00 (0.0029) tives that require far more computing resources and time to Kestrel 1665 Bray,N.L. et al. (2016) Near-optimal probabilistic RNA-seq quantification. complete, such as de novo assemblies. By applying this method to Nat. Biotechnol., 34, 888–525527. other domains where assemblies are the current standard, such as Cleary,J.G. et al. (2015) Comparing variant call files for performance bench- MLST and pathogen surveillance pipelines, this approach can have marking of next-generation sequencing variant calling pipelines. bioRxiv, a dramatic effect on computing resources. Despite several advantages, a purely k-mer-based algorithm such Eddy,S.R. (2004) What is dynamic programming? Nat. Biotechnol., 22, 909–910. as this does have limitations. The most significant of these is that Gardner,S.N. et al. (2013) When whole-genome alignments just won’t work: paired-end and sequence read context is lost when shredding kSNP v2 software for alignment-free SNP discovery and phylogenetics of sequence reads into k-mers. This information is valuable for resolv- hundreds of microbial genomes. PLoS One, 8, e81760. ing variants in repetitive, duplicated or low complexity regions of Gardner,S.N. and Slezak,T. (2010) Scalable SNP Analyses of 100þ Bacterial or Viral Genomes. J. Forensic Res., 1, 107. the genome. Some evidence of these events is still present in k-mer Huang,W. et al. (2012) ART: A next-generation sequencing read simulator. space, such as frequency peaks, and although Kestrel attempts to Bioinformatics, 28, 593–594. work through these events, alignments or assemblies will often have Iqbal,Z. et al. (2012) De novo assembly and genotyping of variants using col- higher accuracy. ored de Bruijn graphs. Nat. Genet., 44, 226–232. As a practical tool, Kestrel is a fast alternative to existing meth- Jolley,K.A. and Maiden,M.C.J. (2010) BIGSdb: Scalable analysis of bacterial ods when applied to specific regions of the genome. We expect it to genome variation at the population level. BMC Bioinformatics, 11, 595. find use as a targeted tool, as part of automated surveillance pipe- Ko ¨ ser,C.U. et al. (2012) Routine use of microbial whole genome sequencing in lines or an orthogonal approach for other callers. In its current diagnostic and public health microbiology. PLoS Pathogens, 8, e1002824. form, we cannot purport it to be a whole-genome alternative for Laible,G. et al. (1991) Interspecies recombinational events during the evolu- tion of altered pbp 2x genes in penicillin-resistant clinical isolates of software like Cortex or Platypus. Future developments may improve Streptococcus pneumoniae. Mol. Microbiol., 5, 1993–2002. the limitations of this approach, or the algorithm may find use in a Larsen,M.V. et al. (2012) Multilocus sequence typing of total genome hybrid tool making use of k-mers and other data. For example, the sequenced bacteria. J. Clin. Microbiol., 50, 1355–1361. GATK HaplotypeCaller might fall-back to such a k-mer method to Li,G. et al. (2012) Complete genome sequence of Streptococcus pneumoniae rescue variants in a region where alignments suffer because of dense Strain ST556, a multidrug-resistant isolate from an otitis media patient. SNVs or large indels. J. Bacteriol., 194, 3294–3295. In many cases, algorithms that are free of alignments and Li,H. and Durbin,R. (2009) Fast and accurate short read alignment with assemblies can reduce the demand on computing resources. As Burrows-Wheeler transform. Bioinformatics, 25, 1754–1760. high-volume sequencing technology becomes faster and cheaper, Li,H. and Durbin,R. (2010) Fast and accurate long-read alignment with reducing the cost of data storage and analysis becomes critical Burrows-Wheeler transform. Bioinformatics, 26, 589–595. Li,H. et al. (2009) The Sequence Alignment/Map format and SAMtools. (Ko ¨ ser et al., 2012; Sboner et al., 2011). Kestrel’s contribution was Bioinformatics, 25, 2078–2079. only marginal for calling variants where alignments would suffice, Li,Y. et al. (2010) State of the art de novo assembly of human genomes from but its ability to perform analysis without more costly methods massively parallel sequencing data. Hum. Genomics, 4, 271–277. gives it a significant advantage. The computing resources Kestrel Li,Y. et al. (2016) Penicillin-binding protein transpeptidase signatures for requires is well within the capabilities of a modern laptop, which is tracking and predicting b-lactam resistance levels in Streptococcus pneumo- far less expensive than the powerful machines or computing clus- niae. mBio, 7, 1–9. ters that would be required to do the same work in the same Martin,C. et al. (1992) Relatedness of penicillin-binding protein 1a genes from amount of time. Such a cost savings may conserve vital research different clones of penicillin-resistant Streptococcus pneumoniae isolated in South Africa and Spain. EMBO J., 11, 3831–3836. funds or enable routine analysis where funding and equipment are McKenna,A. et al. (2010) The Genome Analysis Toolkit: a MapReduce frame- not as readily available. work for analyzing next-generation DNA sequencing data. Genome Res., 20, 1297–1303. Ochman,H. et al. (2000) Lateral gene transfer and the nature of bacterial inno- Acknowledgements vation. Nature, 405, 299–304. Ben Metcalf and Bernard Beall from the Centers for Disease Control and Olson,N.D. et al. (2015) Best practices for evaluating single nucleotide variant Prevention provided data and insight for the S.pneumoniae experiment. calling methods for microbial genomics. Front. Genet., 6, 1–15. Dawn Audano designed the graphic for Figure 1. Patro,R. et al. (2014) Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol., 32, 462–464. Reuter,S. et al. (2013) Rapid bacterial whole-genome sequencing to enhance diag- Funding nostic and public health microbiology. JAMA Int. Med., 173, 1397–1404. This work was supported by the National Institutes of Health. P.A.’s contri- Rimmer,A. et al. (2014) Integrating mapping-, assembly- and haplotype-based bution funded through T32GM105490, which was made possible by Dr. approaches for calling variants in clinical sequencing applications. Nat. Gregory Gibson. F.O.V. and R.S. are supported by R01EB025022, Genet., 46, 912–918. R01ES026243, and funds awarded to F.O.V. by the Georgia Institute of Roberts,M. et al. (2004) Reducing storage requirements for biological Technology. sequence comparison. Bioinformatics, 20, 3363–3369. Salipante,S.J. et al. (2015) Large-scale genomic sequencing of extraintestinal Conflict of Interest: none declared. pathogenic Escherichia coli strains. Genome Res., 25, 119–128. Sboner,A. et al. (2011) The real cost of sequencing: higher than you think! Genome Biol., 12, 125. References Smith,T. and Waterman,M. (1981) Identification of common molecular subse- Altschul,S.F. et al. (1990) Basic local alignment search tool. J. Mol. Biol., 215, quences. J. Mol. Biol., 147, 195–197. 403–410. Thomsen,M.C.F. et al. (2016) A bacterial analysis platform: an integrated sys- Audano,P. and Vannberg,F. (2014) KAnalyze: a fast versatile pipelined K-mer tem for analysing bacterial whole genome sequencing data for clinical diag- toolkit. Bioinformatics, 30, 2070–2072. nostics and surveillance. PLoS One, 11, 1–14. Bankevich,A. et al. (2012) SPAdes: a new genome assembly algorithm and its Wood,D.E. and Salzberg,S.L. (2014) Kraken: ultrafast metagenomic sequence applications to single-cell sequencing. J. Comput. Biol., 19, 455–477. classification using exact alignments. Genome Biol., 15, 1–12.

Journal

BioinformaticsOxford University Press

Published: May 15, 2018

References