Access the full text.
Sign up today, get DeepDyve free for 14 days.
Z. Stephens, Sk Lee, F. Faghri, R. Campbell, ChengXiang Zhai, Miles Efron, R. Iyer, M. Schatz, S. Sinha, G. Robinson (2015)
Big Data: Astronomical or Genomical?PLoS Biology, 13
Heng Li (2015)
BGT: efficient and flexible genotype query across many samplesBioinformatics, 32 4
Sebastian Deorowicz, S. Grabowski (2013)
Data compression for sequencing dataAlgorithms for Molecular Biology : AMB, 8
Danecek (2011)
2156Bioinformatics, 27
Peter Sudmant, T. Rausch, E. Gardner, R. Handsaker, A. Abyzov, J. Huddleston, Yan Zhang, K. Ye, G. Jun, Markus Fritz, Miriam Konkel, A. Malhotra, A. Stütz, Xinghua Shi, Francesco Casale, Jieming Chen, F. Hormozdiari, Gargi Dayama, Ken Chen, M. Malig, Mark Chaisson, Klaudia Walter, Sascha Meiers, Seva Kashin, Erik Garrison, A. Auton, Hugo Lam, Xinmeng Mu, C. Alkan, Danny Antaki, Taejeong Bae, Eliza Cerveira, P. Chines, Zechen Chong, Laura Clarke, E. Dal, L. Ding, Sarah Emery, Xian Fan, Madhusudan Gujral, F. Kahveci, J. Kidd, Yu Kong, Eric-Wubbo Lameijer, Shane McCarthy, Paul Flicek, R. Gibbs, Gabor Marth, C. Mason, A. Menelaou, D. Muzny, Bradley Nelson, Amina Noor, N. Parrish, Matthew Pendleton, Andrew Quitadamo, Benjamin Raeder, E. Schadt, Mallory Romanovitch, A. Schlattl, R. Sebra, A. Shabalin, A. Untergasser, Jerilyn Walker, Min Wang, F. Yu, Chengsheng Zhang, Jing Zhang, Xiangqun Zheng-Bradley, Wanding Zhou, T. Zichner, J. Sebat, M. Batzer, S. Mccarroll, Ryan Mills, M. Gerstein, A. Bashir, O. Stegle, S. Devine, Charles Lee, E. Eichler, J. Korbel (2015)
An integrated map of structural variation in 2,504 human genomesNature, 526
Ryan Layer, Neil Kindlon, K. Karczewski, A. Quinlan (2015)
Efficient genotype compression and analysis of large genetic variation datasetsNature methods, 13
Knuth (1998)
426
R. Raman, Venkatesh Raman, S. Satti (2007)
Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisetsACM Transactions on Algorithms (TALG), 3
McCarthy (2016)
1279Nat. Genet, 48
Xiuwen Zheng, S. Gogarten, Michael Lawrence, A. Stilp, M. Conomos, B. Weir, C. Laurie, David Levine (2017)
SeqArray—a storage‐efficient high‐performance data format for WGS variant callsBioinformatics, 33
Heng Li, R. Handsaker, Alec Wysoker, T. Fennell, Jue Ruan, Nils Homer, Gabor Marth, G. Abecasis, R. Durbin (2009)
The Sequence Alignment/Map format and SAMtoolsBioinformatics, 25
D. Salomon, G. Motta (2009)
Handbook of Data Compression
G. Whitesides (1970)
The Art of Computer ProgrammingNuclear Science and Engineering, 40
Simon Gog, Timo Beller, Alistair Moffat, M. Petri (2013)
From Theory to Practice: Plug and Play with Succinct Data StructuresArXiv, abs/1311.1249
Johnson (1997)
215
A. Orman, E. Aarts, J. Lenstra (1997)
Local Search in Combinatorial Optimisation.Journal of the Operational Research Society, 50
(2012)
A reference panel of 64 , 976 haplotypes for genome imputation
Kedar Tatwawadi, M. Hernaez, Idoia Ochoa, T. Weissman (2016)
GTRAC: fast retrieval from compressed collections of genomic variantsBioinformatics, 32 17
G. Navarro, Eliana Providel (2012)
Fast, Small, Simple Rank/Select on Bitmaps
(2010)
Sequence analysis Advance Access publication June 7, 2011 The variant call format and VCFtools
James Zou (2015)
Analysis of protein-coding genetic variation in 60,706 humansNature, 536
Ibrahim Numanagić, J. Bonfield, Faraz Hach, Jan Voges, J. Ostermann, C. Alberti, M. Mattavelli, S. Sahinalp (2016)
Comparison of high-throughput sequencing data compression toolsNature Methods, 13
Sebastian Deorowicz, Agnieszka Danek, S. Grabowski (2013)
Genome compression: a novel approach for large collectionsBioinformatics, 29 20
D. Knuth (1973)
Sorting and Searching
R. Durbin (2014)
Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)Bioinformatics, 30
David Johnson, L. McGeoch (2008)
The Traveling Salesman Problem: A Case Study in Local Optimization
Motivation: Nowadays, genome sequencing is frequently used in many research centers. In proj- ects, such as the Haplotype Reference Consortium or the Exome Aggregation Consortium, huge databases of genotypes in large populations are determined. Together with the increasing size of these collections, the need for fast and memory frugal ways of representation and searching in them becomes crucial. Results: We present GTC (GenoType Compressor), a novel compressed data structure for repre- sentation of huge collections of genetic variation data. It significantly outperforms existing solu- tions in terms of compression ratio and time of answering various types of queries. We show that the largest of publicly available database of about 60 000 haplotypes at about 40 million SNPs can be stored in <4 GB, while the queries related to variants are answered in a fraction of a second. Availability and implementation: GTC can be downloaded from https://github.com/refresh-bio/ GTC or http://sun.aei.polsl.pl/REFRESH/gtc. Contact: sebastian.deorowicz@polsl.pl Supplementary information: Supplementary data are available at Bioinformatics online. initiatives, such as the 1000 Genomes Project (1000 GP) (Sudmant 1 Introduction et al., 2015) or the 100 000 Genomes Project (https://www.genomic In the last two decades, the throughput of genome sequencers sengland.co.uk/the-100000-genomes-project-by-numbers/), deliver increased by a few orders of magnitude. At the same time, the VCF files for thousands of samples. Moreover, the scale of current sequencing cost of a single human individual decreased from over 1 projects, similar to the Haplotype Reference Consortium (HRC) billion to about 1 thousand US dollars. Stephens et al. (2015) pre- (McCarthy et al., 2016) or the Exome Aggregation Consortium dicted that in 2025, the genomic data will be acquired at a rate of 1 (ExAC) (Lek et al., 2016), is larger by an order of magnitude. For zettabases/year, which means that about 2bou exabytes/year of example, the VCF files of the HRC consist of 64 976 haplotypes at genome data should be deposited for future analysis. In addition, about 39.2 million SNPs and occupy 4.3 TB. It is also clear that the prices of data storage and transfer decrease moderately much larger databases will be formed in the near future. (Deorowicz and Grabowski, 2013), which means that keeping the VCF files contain a list of variants in a collection of genomes data management costs under control becomes a real challenge. as well as evidence of presence of a reference/nonreference allele Recently, the tools for compression of sequenced reads have at each specific variant position in each genome. As they are inten- been benchmarked (Numanagi c et al., 2016). The ability of the sively searched, the applied compression scheme should support examined utilities to shrink the data about ten times is remarkable. fast queries of various types. The indexed and gzipped VCF files Nevertheless, much more is necessary. The obvious option is to can be effectively queried using VCFtools (Danecek et al., 2011) resign from the storage of raw data (in most experiments) and focus or BCFtools when the query is about a single variant or a range just on the results of variant calling, deposited usually in the Variant of them. Unfortunately, retrieving a sample data means time- Call Format (VCF) files (Danecek et al., 2011). The famous consuming decompression and processing of a whole file. V The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com 1834 GTC: how to maintain huge genotype collections in a compressed form 1835 The recently introduced genotype query tools (GQTs) (Layer the nearest neighbor heuristic (Johnson and McGeoch, 1997) is used et al., 2016) made use of some specialized compression algorithm to quickly calculate a reasonably good solution. There are better for VCF files. GQT was designed to compare sample genotypes algorithms to perform this task (in terms of minimizing the total among many variant loci, but did not allow to retrieve the specified number of differences between neighbor haplotypes). Unfortunately, variant as well as sample data. Shortly after that, Li (2016) proposed they are too slow to be applied in our case. The description of the BGT based on the positional Burrows-Wheeler transform (Durbin, found permutation must be stored for each block to allow retrieval 2014). It offered more than 10-fold better compression than GQT of the original data. and supported queries for genotypes as well as variants. Moreover, The permuted haplotypes are compressed (still within blocks) it allowed to restrict the range of samples according to some meta- using a hybrid of specialized techniques inspired by Ziv-Lempel data conditions. The SeqArray (SA) library (Zheng et al., 2017) for compression algorithm, run length encoding, and Huffman coding the R programming language is yet another solution to effectively (Salomon and Motta, 2010). The variant bit vectors are processed compress and browse VCF files. The applied compression is based one by one. In the current variant bit vector, we look for the longest on the LZMA algorithm (Salomon and Motta, 2010). runs of 0 or 1 s, as well as longest matches (same sequences of bits) In this article, we introduce GenoType Compressor (GTC) in the already processed vectors. As a result, we obtain a description intended to store genetic variation data from VCF files in a highly of the current variant using the previous variants, which is much compressed form. Our data structure, as confirmed by experiments, shorter than the original variant bit vector. The description is finally is much faster (often by an order of magnitude) in various types of Huffman encoded to save even more space. These stages are queries. It is also significantly more compact than the competitive described in detail in the following subsections, whereas Figure 1 solutions. These features allow the users to maintain huge collec- provides an overview of the construction of a GTC archive for an tions of genotype data even on typical laptops with high speed of example input VCF file (a more detailed overview can be found in accession and could change the way the huge collections of VCF files the Supplementary Material). are processed. 2.3 Preprocessing the input VCF file Managing the input VCF file. Unphased genotypes in the input BCF/ 2 Materials and methods VCF are arbitrarily phased while each of the multi-allelic variant 2.1 General idea and definitions sites (described in a single line of VCF) is broken into multiple bial- GTC is a new tool for compressed representation of genotype data lelic variant sites, each described in a separate line, as in preprocess- supporting fast queries of various types. It compresses a collection ing of VCF file in BGT (Li, 2016). Thus, there are four possible of genotypes in a VCF/BCF format and allows for queries about allele values for each haplotype: ‘0’ for the reference allele, ‘1’ for genotypes: the nonreference allele, ‘2’ for the other non-reference allele (stored in a different line of the VCF file) and ‘.’ for the unknown allele at specified range of variant position (e.g. a single variant site), (Fig. 1a). referred to as variant query, Extraction of metadata. The altered description of V variant sites is in case of specified samples (e.g. a single sample), referred to as stored in a site-only BCF file (the HTSlib (Li et al., 2009)library is used sample query, forthistask) with _row variable indicating site id added in the INFO with a combination of the above two. field. A list of sample names is stored in a separate text file (Fig. 1b). Initial encoding of genotypes. The allele at each site is repre- The ploidy of individuals determines the number of haplotypes that sented as a dibit (00 for the reference allele, 01 for the nonreference make up a single genotype. For diploid organisms, a genotype of an allele, 11 for the other nonreference allele and 10 for the unknown individual is defined by two separate haplotypes. For a precise description of the proposed algorithm, let us intro- allele (Fig. 1c). Each of the V variants is represented by two variant duce some terms. As an input, we have a VCF/BCF file that describes bit vectors of size H: one for lower and one for higher bits of the dibits representing successive haplotypes. Together there are 2V var- H haplotypes at V biallelic sites (single variant allele). For any bit iant bit vectors (at this point information about all genotypes takes vector X, X½i is the ith bit of this vector and jXj denotes its length. 2HV bits in total). The vectors are identified by ids ranging from 0 to 2V 1. The vectors with even ids correspond to the lower bits of 2.2 Sketch of the algorithm the dibits representing haplotype sites, whereas vectors with odd ids In the beginning, GTC divides the variants into blocks of some num- correspond to the higher bits. In the next stages, described in detail ber of consecutive entries b (according to the preliminary experi- size below, the variant bit vectors are compressed and indexed, making ments, we picked the value b ¼ 3584; a justification for such a size it possible to randomly access an arbitrary vector. choice can be found in the Supplementary Material) and processes Forming blocks of genotype data. The variant bit vectors repre- each block separately (Fig. 1). Bit vectors representing presence/ senting genotype data are divided into blocks. A single block con- absence of reference alleles in all haplotypes are formed. In fact, two tains genotype data of b ¼ 3584 consecutive variant sites (value size bit vectors, referred to as variant bit vectors, are necessary to chosen experimentally), i.e. 2b ¼ 7168 consecutive variant bit size describe each variant, as four possibilities must be covered: (i) a vectors. Thus, there are dV=b e blocks (last may contain data reference allele, (ii) nonreference allele, (iii) the other nonreference size about less than b variants). The blocks are processed independ- allele or (iv) the unknown allele. size ently to each other, in parallel (if possible). The haplotypes in each block are independently permuted to minimize the number of differences (i.e. a Hamming distance) 2.4 Processing a single block of genotype data between successive entries. As determination of the best permutation of haplotypes is equivalent to solving the Traveling Salesperson Permutation of haplotypes. The haplotypes in a block are permuted Problem (Johnson and McGeoch, 1997), it is practically impossible to minimize the number of differences (i.e. a Hamming distance) to find the optimal solution in a reasonable amount of time. Thus, between neighboring haplotypes. The nearest neighbor heuristic 1836 A.Danek and S.Deorowicz (a) (b) (c)(d)(e) Fig. 1. Construction of GTC archive. (a) The input VCF file with genotypes of 6 diploid samples at 12 variant sites. It is decomposed into metadata and blocks of genotypes. Each block (here: max. five variant sites) is processed separately. (b) Metadata: variant sites description (stored as a site-only BCF) and list of samples. (c) Bit vector representation of genotypes in Block 2. Each variant site is described by two variant bit vectors. The genotype of a haplotype, at each variant site, is described by two-bits (located in two successive variant bit vectors). (d) Permutation of haplotypes in Block 2. Resulting order of haplotypes is stored. (e) Factorization of permuted Block 2. Empty and copied vectors are marked. All unique bit vectors are described as a sequence of longest possible 0-runs, 1-runs, matches to previous variant bit vectors in the block and literals. Note: The extended version of this example can be found in Supplementary Figure S1 (Johnson and McGeoch, 1997) is used to calculate a reasonably encode the sequence as a run of zeros (or ones). Otherwise, we look good solution in an acceptable time. The permutation of the haplo- for the longest possible match to one of the previously encoded types (their order after permutation) in the ith block is stored in an unique vector (identical byte substring). To increase the chance of a array P . long match, we search for shared haplotypes that is for the longest Categorizing variant bit vectors. The variant bit vectors (repre- match starting at position j in any previously processed byte vector. senting genotype annotations) are processed one by one, in a sequen- Matches to arbitrary parts of other vectors are accidental and thus tial manner. They are packed into byte vectors (8 bits in a byte). unlikely to be long. Moreover, by restricting to matches that start at A byte is then the smallest processing unit in the compression position j, matches can be encoded in fewer bits (no need to store scheme. In the initial phase, each byte vector is categorized either as their positions). To make the search faster, the already processed an empty vector (all bytes unset), a copy of a previously processed data from the vectors are indexed using hash tables (HTs). Each vector, or a unique vector, not equal to any of the previously proc- HT’s key consists of a sequence extracted from a vector and its posi- essed vectors. The classification is done with the help of a dictionary tion. The value is the unique vector id (uid). HT stores keys with vec structure HT , namely, hash table with linear probing (Knuth, sequences of length h ¼ 5. The minimum match length is equal to h 1998). The hash value is computed for each processed vector, and (and it is impossible to find shorter matches using HT). The match- vec HT stores ids of all previously processed unique vectors in the ing byte substring (or its parts) in a previously encoded vector can block (notice that the first nonempty byte vector is a unique vector). be already encoded as a match. However, we restrict a match depth Four bit vectors, each of size 3584 bits, are formed for the ith block: that is a number of allowed vectors describing each byte in a match. i i i i i i E ; E ; C and C . The kth set bit in E /E indicates even even even odd odd odd By default, the maximum allowed match depth (d ) is 100. A sub- max that the kth even/odd (respectively) variant byte vector in the block sequent match to the same vector is encoded with fewer bits, as there i i is empty. The kth set bit in C /C indicates that the kth even/ even odd is no need to store the id of the previous vector. If, with the match odd (respectively) variant byte vector in the block is a copy of one of depth restriction, no sufficiently long match can be found, the cur- the previous vectors. The C array maps subsequent copied vec- origin rent byte is encoded as a literal. The runs of literals (between 20 and tors with their equivalents in the set of unique variant bit vectors 252 literals by default) are concatenated and encoded as a separate (keeping id of unique vector, uid). tuple. The number of unique variant bit vectors will naturally Processing unique variant bit vectors. Every unique variant bit increase with increasing number of samples. Most of them will be vector, represented by byte vector (of size dH=8e bytes, padded with vectors with only one or two set bits, so they should not influence zeros), is transformed into a sequence of tuples, which represents lit- the archive size significantly. erals, runs of zeros or ones or matches to a previously encoded vec- The type of a tuple is indicated by its first field, a flag f . type tor. The encoding is done byte by byte, from left to right, starting at Overall, there are six possible tuple types at the current position j: position j ¼ 0. At each analyzed position, we first look for the length of the longest substring of zeros (bytes equal to zero) or ones (bytes hf ; num i—a run of num zero bytes, the position j is zero run 0 0 equal to 255). If it is of length r or more (r ¼ 2 by default), we updated to j þ num , min min 0 GTC: how to maintain huge genotype collections in a compressed form 1837 • • hf ; num i—a run of num one bytes (all bits set), the posi- U : byte array with positions of the subsequent unique variant one run 1 1 pos tion j is updated to j þ num , bit vectors in the U structure; full position is stored for hf ; uid; leni—a match of length len to a vector with id uid, every 1025th vector, the position of the remaining vectors are match the position j is updated to j þ len, delta coded. hf ; leni—a match of length len to a vector with the same P: byte array storing permutations of subsequent blocks; a single match same id as the previous match, the position j is updated to j þ len, permutation is a sequence of ids of haplotypes reflecting the hf ; bvi—a literal, where bv is the value of the byte, the posi- literal order in which they appear in the block. Each id is stored with tion j is increased by 1. minimum necessary number of bits (exactly: d log He bits). hf ; n; bv ; bv ; .. . ; bv i—a run of n literals, where bv ; literal run 1 2 n 1 The E ; E ; C and C bit vectors are represented by the even odd even odd bv ; .. . ; bv are the values of the consecutive bytes, the position j 2 n compressed structure (Raman et al., 2007; Navarro and Providel, is increased by n. 2012) implemented in the Succinct Data Structure Library (SDSL) (Gog et al., 2014) library. 2.5 Merging blocks The dV=b e processed blocks of genotype data are gathered and size 2.7 Queries concatenated in the order of their appearance in the input VCF file By default, the entire collection is decompressed into VCF/BCF file. (each ith block is added, for i ¼ 1to i ¼dV=b e). size i i i i It is possible to restrict the query by applying additional conditions. The bit vectors E ; E ; C and C are concatenated into even odd even odd Only variants and samples meeting all specified conditions are four global bit vectors of size V: E ; E ; C and C . The even odd even odd decompressed. The following restrictions are possible: vectors are kept in a succinct form. The C array gathers all C arrays adjusting stored uids origin origin range condition—it specifies the chromosome and a range of (number of unique vectors in all previous blocks is added to each positions within the chromosome, uid from the current block). For subsequent copied vectors, it refers sample condition—it specifies sample or samples, to uids of the original unique vectors out of all unique vectors. Every alternate allele frequency/count condition—it specifies minimum/ uid in C is then delta coded (difference between uid of the next origin maximum count/frequency of alternate allele among selected unique vector and original uid is calculated) with the minimum nec- samples for each variant site, essary number of bits. variant count condition—it specifies the maximum number of The encoded unique variant bit vectors are stored in a byte array variant sites to decompress. U. The flags, literals (bv), lengths of matches (len), and lengths of runs of zeros (num ) and ones (num ) are compressed (separately) 0 1 Two decompression algorithms are possible depending on the with an entropy coder (we use Huffman coder) in the context of chosen query parameters. For most queries, a variant-oriented total number of set bits in the current variant bit vector (eight sepa- approach is used. A sample-oriented approach is applied in queries rate groups by default). For each literal run, the number of bits it about relatively small, i.e. up to 2000, number of samples, but only occupies is kept. For matches, the ids of matched unique vectors if decompressed range of variant sites span multiple blocks. (uids) are delta coded with the minimum necessary number of bits. Variant-oriented approach. In this algorithm, two vectors repre- The starting positions of all unique vectors in U are stored in a senting a variant site are fully decompressed. Their ids are known byte array U , where every 1025th unique vector is stored using 4 pos owing to _row variable in the BCF with variant sites description. bytes, whereas for the 1024 subsequent vectors, only the difference Initially, the E ; E ; C and C bit vectors are used to define even odd even odd (between the current vector position and the nearest previous posi- a category of the variant bit vector. Decompression of an empty vec- tion encoded with bytes) is stored, using d bits, where d is the mini- tor is straightforward. For a copied vector, the rank operations on mum number of bits necessary to store any encoded position E ; E ; C and C bit vectors are used to determine its copy, even odd even odd difference. whereas the id of the original, unique vector is found using the C origin Finally, permutations of all blocks, P , are concatenated into a array. The U array is used to find position of the unique variant pos single array of permutations, P. It is stored with minimum necessary bit vector in the U array. The consecutive bytes of the variant bit vec- number of bits. The id of the variant bit vector to decompress is tor are decompressed by decompressing and decoding all flags, liter- enough to find the right permutation in P. als, lengths of runs of zeros and ones, and matches. If a match is encountered, the variant bit vector containing the match is not fully 2.6 Design of the compressed data structure decompressed. Instead, the complete decoding of all irrelevant tuples The main components of the GTC data structure are as follows: from the beginning of the vector up to the match position is skipped as far as possible. For example, the stored bit length of a run of literal E and E : two-bit vectors, each of size V, indicating if sub- even odd allows to skip the run without time-consuming literal decoding. The sequent variant bit vectors (out of all vectors for lower or higher P array helps to find the original order of haplotypes. Only a subset bits of corresponding haplotypes, respectively) are zero-only of queried samples, if that is the case, is reported. vectors, If a range of variant sites is queried, the decompression is C and C : two-bit vectors, each of size V, indicating if sub- even odd speeded up by keeping the track of an adequate number of previous, sequent variant bit vectors (out of all vectors for lower or higher already decompressed unique variant bit vectors. Moreover, the per- bits of corresponding haplotypes, respectively) are copies of mutation of haplotypes only needs to be read from the P array at the other vectors, • beginning of a block, not for each variant site separately. C : ids of the original vectors (out of all unique vectors) for origin Sample-oriented approach. In this approach all haplotypes of the successive vectors being a copy; delta coding and minimum nec- queried sample are decompressed. For example, if sample is diploid, essary number of bits are used to store each id. U: byte array storing unique variant bit vectors, encoded into two haplotypes are decompressed. In case of more samples, more tuples and compressed (as described above), haplotypes are decompressed. 1838 A.Danek and S.Deorowicz The decompression starts at the beginning of a block containing and 0.2% for list of sample names (Metadata). For smaller block the first queried variant site. The P array is used at the start of each sizes, the haplotypes can be better compressed, but the description block to find the positions of bytes containing the haplotypes in the of permutation consumes more space (Supplementary Fig. S8). In Supplementary Figures S2–S7, we also show the influence of various permuted variant vectors. Each variant vector is partly decoded parameters of GTC (such as block size, minimal match length etc.) once. The bytes containing information about the decompressed on the compression ratio. haplotypes are fully decoded and stored; the complete decoding of The compression times of BGT, SA and GTC are similar (slightly other bytes is skipped, if possible. As the previous decoded bytes are more than a day for the complete HRC collection at Intel Xeon- kept, if a match covering the decompressed haplotype is encountered based workstation clocked at 2.3 GHz) as they are dominated by (only a match within the same block is possible), the byte value can reading and parsing of the huge input gzipped VCF file. be read immediately. The decompressed bytes are used to retrieve values of queried haplotypes (a dibit, two single bits in two succes- 3.3 Queries to the compressed data structure sive variant bit vectors). The great compression would be, however, of little value without the ability of answering queries of various types. 3 Results Figure 2c–l shows the times of answering various types of queries by GTC, BGT, SA and BCFtools (BCF), for Chromosome 11 data 3.1 Datasets containing various number of samples (various collection sizes, To evaluate GTC and its competitors, we picked two large collections Fig. 2c–f) or for the whole collection (27 165 samples, Fig. 2g–l). of the genomes of Homo sapiens: the 1000 GP Phase 3 (2504 diploid The GTC decompression of the whole collection of variants is from genotypes) and the HRC (27 165 diploid genotypes). The 1000 7 times (1000-sample subset) to 13 times (all samples) faster than Genomes Project data were downloaded from the project web site with BGT and about 3 times faster than with SA (Fig. 2c). (details in the Supplementary Material). The HRC collection used in The GTC extraction for a single variant or a genomic range of our experiments was obtained from the European Genom-phenom 23 329 bases (median size of human gene) is up to six times faster Archive (accession number: EGAS00000000029; details in the than with BGT (Fig. 2d and e). It is also noteworthy that BCFtools Supplementary Material) and is slightly smaller than the full dataset are almost as fast as GTC for a single variant query. (containing 32 488 samples; unfortunately unavailable for public use). A bit different situation can be observed in Figure 2f, where the decompression times of single sample are presented. For collections 3.2 Compression ratios and times not larger than 5000 samples, GTC is the fastest, but then, BGT Table 1 summarizes the characteristics of the datasets and the com- takes the lead becoming 1.5 times faster for the complete collection. pression capabilities of gzip, BGT, GQT, SA and GTC. GTC archive Nevertheless, both GTC and BGT answer the query in a few sec- appeared to be about two times more compact than BGT archive and onds. As expected BCFtools are clearly the slowest. ten times more compact than gzipped VCF file. The SA archive size is Figure 2g–l shows the times of answering the queries about between GTC and BGT for the smaller collection but is noticeably different ranges of genome and various number of samples for the larger than BGT for the larger one. complete Chromosome 11 collection (27 165 samples). The BGT For a detailed examination of the compression ratios, we reduced extraction times are slightly better than that of GTC only when a the HRC dataset to collections containing only Chromosome 11 var- small number of samples (up to about one thousand samples for iants and 1000, 2000, etc. samples. Figure 2a shows that the relative small variant ranges and up to few hundred samples for large variant sizes of the BGT and GTC archives are similar in the examined range ranges) is queried. For more samples in a query, GTC takes the lead, (see also Supplementary Table S1). A more careful examination shows whereas the relative performance of BGT decreases (i.e. for a single that the compressed sizes of BGT and GTC grow slightly sub-linearly variant and 10 000 samples extraction, BCFtools are the second for growing numbers of samples. Moreover, the ratio between BGT best method). When extracting all samples, GTC is about ten times and GTC is almost constant for more than 5000 samples, which faster than BGT. Moreover, the advantage of GTC grows slightly suggests that it would remain the same for even larger collections. when the range extends. It can be noted that the relative perform- SA seems to scale poorer than BGT and GTC when the collection ance of BCTtools degrades when the range width exceeds a few size grows. thousand bases. It is also interesting to see the sizes of components of the GTC From the pure compression-ratio perspective, it is worth to note archive for the complete HRC data. As can be observed in Figure 2b, that the average size of genotype information in GTC archive for a sin- the majority of the archive (62.5%) is for the description of matches, gle sample is just about 146 KB. Even better result would be possible 0-runs and so on (Core) and 30.8% is for the description of permuta- when we resign from fast queries support. To date the best pure com- tions in blocks (Permutation). Finally, 6.5% is for the description pressorofVCF files isTGC (Deorowicz et al., 2013). Recently pro- of variants (position, reference, and non-reference alleles etc.) posed GTRAC (Tatwawadi et al., 2016) is a modified version of TGC, Table 1. Datasets used in the experiments Dataset No. of genot. No. of var. (M) Size (GB) VCF VCF-gzip GQT BGT SA GTC 1000GPip 2504 85.20 853 17.41 14.34 3.87 2.93 1.81 HRC 27 165 40.36 4300 75.01 84.42 7.01 12.68 3.98 HRC Chr11 27 165 1.94 210 3.42 4.11 0.33 0.62 0.18 Note: The sizes of the compressed representations with gzip, GQT, BGT, SA and GTC are also presented. GTC: how to maintain huge genotype collections in a compressed form 1839 (a) (b)(c) (d)(e)(f) (g)(h)(i) (j)(k)(l) Fig. 2. Comparison of compressed data structures for VCF (gzipped, VCF.gz) with BCFtools (BCF) used to access it, GQT, BGT, SA and GTC. (a) Sizes of com- pressed sampled archives of HRC data. (b) Components of GTC archive for HRC data. (c) Decompression times for HRC Chromosome 11 data. (d) Single variant range query times for HRC Chromosome 11 data. (e) Variants in range of size 23 239 bp query times for HRC Chromosome 11 data. (f) Single sample query times for HRC Chromosome 11 data. (g) Single variant for selected samples query times for HRC Chromosome 11 data. (h) Variants in range of size 10 Mb for selected samples query times for HRC Chromosome 11 data. (i) All variants for selected samples query times for HRC Chromosome 11 data. (j) Single sample for the selected range of chromosome query times for HRC Chromosome 11 data. (k) In total, 1000 samples for the selected range of chromosome query times for HRC Chromosome 11 data. (l) All samples for the selected range of chromosome query times for HRC Chromosome 11 data supporting some types of queries, for example, for single (or range of) we modified our compressor to produce similar output and support variants or single samples. Unfortunately, the output of GTRAC similar query types, and run both GTC and GTRAC for the 1000 GP queries is just a binary file, so no direct comparison with VCF/BCF- Phase 1 data containing 1092 genotypes and 39.7 M variants. The producing compressors (BGT, GTC and SA) is possible. Moreover, we compressed archives were of size 622 MB (GTC) and 1002 MB were unable to run GTRAC for the examined datasets. Nevertheless, (GTRAC). The queries for single variants were solved in 7 ms (GTC) 1840 A.Danek and S.Deorowicz and 47 ms (GTRAC). The queries for samples were answered in simi- References lar times, about 2 s for Chromosome 11 data. Danecek,P. et al. (2011) The variant call format and VCFtools. Bioinformatics, 27, 2156–2158. Deorowicz,S. et al. (2013) Genome compression: a novel approach for large 4 Conclusions collections. Bioinformatics, 29, 2572–2578. The widely used VCFtools and BCFtools offer relatively quick access Deorowicz,S. and Grabowski,S. (2013) Data compression for sequencing to VCF, gzipped VCF, or BCF files when we query about a specified data. Algorithms Mol. Biol., 8, 25. variant or variant range. The situation, however, changes when we Durbin,R. (2014) Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT). Bioinformatics, 30, focus on sample queries, which are slow. Another drawback of using 1266–1272. these tools is that the gzipped VCF or BCF files are quite large, Gog,S. et al. (2014) From theory to practice: plug and play with succinct data which causes the problems with storage and transfer. structures. Lect. Notes Comput. Sci., 8504, 326–337. When the user is interested only in the genotype data from the Johnson,D.S. and McGeoch,L.A. (1997) The traveling salesman problem: a VCF files, much better solutions are possible, that is, the state-of- case study in local optimization In: Aarts, E.H.L. and Lenstra, J.K. (eds) the-art BGT and SA. Nevertheless, the GTC, proposed in this article, Local Search in Combinatorial Optimisation. John Wiley and Sons, has several advantages over them. The most significant one is a new London, UK, pp. 215–310. compression algorithm designed with a care of fast queries support. Knuth,D.E. (1998) The art of computer programming. In: Sorting and The data structure is so small that it is possible to store it in the Searching, Vol. 3. Addison-Wesley Professional, USA, pp. 426–458. Layer,R.M. et al. (2016) Efficient genotype compression and analysis of large main memory of even commodity workstations or laptops giving the genetic-variation data sets. Nat. Methods, 13, 63–65. impressive query times. Even when run at powerful workstation, Lek,M. et al. (2016) Analysis of protein-coding genetic variation in 60, 706 GTC often outperforms the competitors in query times by an order humans. Nature, 536, 285–291. of magnitude. The scalability experiments suggest that much larger Li,H. et al. (2009) The Sequence alignment/map (SAM) format and SAMtools. collections can be also maintained effectively. Bioinformatics, 25, 2078–2079. Current work focuses on compression of genotypes only. Li,H. (2016) BGT: efficient and flexible genotype query across many samples. Compressing and querying other data that may be stored in VCF files, Bioinformatics, 32, 590–592. e.g. genotype likelihoods, would be hard to introduce considering the McCarthy,S. et al. (2016) A reference panel of 64, 976 haplotypes for genome wide range of possible values. A quantization could simplify the task, imputation. Nat. Genet., 48, 1279–1283. Navarro,G. and Providel,E. (2012) Fast, small, simple rank/select on bitmaps. but without further study it is hard to tell whether similarities and Lect. Notes Comput. Sci., 7276, 295–306. redundancies in the input data would allow to achieve satisfying Numanagi c,I. et al. (2016) Comparison of high-throughput sequencing data results. In our future work, we plan to investigate this issue thor- compression tools. Nat. Methods, 13, 1005–1008. oughly and possibly extend types of VCF fields that can be com- Raman,R. et al. (2007) Succinct indexable dictionaries with applications pressed with our tool. Moreover, we want to take a closer look at all to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms, parameters that were chosen experimentally and possibly make them 3, 43. dependable on some additional factors, specific to the input data. Salomon,D. and Motta,G. (2010) Handbook of Data Compression. Springer, London, UK. Stephens,Z.D. et al. (2015) Big Data: astronomical or Genomical. PLOS Biol., Funding 13, e1002195. Sudmant,P.H. et al. (2015) An integrated map of structural variation in 2, 504 This work was supported by the National Science Centre, Poland under project human genomes. Nature, 526, 75–81. [DEC-2015/17/B/ST6/01890]. We used the infrastructure supported by Tatwawadi,K. et al. (2016) GTRAC: fast retrieval from compressed collec- ‘GeCONiI-Upper Silesian Center for Computational Science and Engineering’ tions of genomic variants. Bioinformatics, 32, i479–i486. [POIG.02.03.01-24-099/13]. Zheng,X. et al. (2017) SeqArray—A storage-efficient high-performance data Conflict of Interest: none declared. format for WGS variant calls. Bioinformatics, 33, 2251–2257.
Bioinformatics – Oxford University Press
Published: Jan 16, 2018
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.