GTC: how to maintain huge genotype collections in a compressed form

GTC: how to maintain huge genotype collections in a compressed form Abstract Motivation Nowadays, genome sequencing is frequently used in many research centers. In projects, such as the Haplotype Reference Consortium or the Exome Aggregation Consortium, huge databases of genotypes in large populations are determined. Together with the increasing size of these collections, the need for fast and memory frugal ways of representation and searching in them becomes crucial. Results We present GTC (GenoType Compressor), a novel compressed data structure for representation of huge collections of genetic variation data. It significantly outperforms existing solutions in terms of compression ratio and time of answering various types of queries. We show that the largest of publicly available database of about 60 000 haplotypes at about 40 million SNPs can be stored in <4 GB, while the queries related to variants are answered in a fraction of a second. Availability and implementation GTC can be downloaded from https://github.com/refresh-bio/GTC or http://sun.aei.polsl.pl/REFRESH/gtc. Contact sebastian.deorowicz@polsl.pl Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction In the last two decades, the throughput of genome sequencers increased by a few orders of magnitude. At the same time, the sequencing cost of a single human individual decreased from over 1 billion to about 1 thousand US dollars. Stephens et al. (2015) predicted that in 2025, the genomic data will be acquired at a rate of 1 zettabases/year, which means that about 2bou exabytes/year of genome data should be deposited for future analysis. In addition, the prices of data storage and transfer decrease moderately (Deorowicz and Grabowski, 2013), which means that keeping the data management costs under control becomes a real challenge. Recently, the tools for compression of sequenced reads have been benchmarked (Numanagić et al., 2016). The ability of the examined utilities to shrink the data about ten times is remarkable. Nevertheless, much more is necessary. The obvious option is to resign from the storage of raw data (in most experiments) and focus just on the results of variant calling, deposited usually in the Variant Call Format (VCF) files (Danecek et al., 2011). The famous initiatives, such as the 1000 Genomes Project (1000 GP) (Sudmant et al., 2015) or the 100 000 Genomes Project (https://www.genomicsengland.co.uk/the-100000-genomes-project-by-numbers/), deliver VCF files for thousands of samples. Moreover, the scale of current projects, similar to the Haplotype Reference Consortium (HRC) (McCarthy et al., 2016) or the Exome Aggregation Consortium (ExAC) (Lek et al., 2016), is larger by an order of magnitude. For example, the VCF files of the HRC consist of 64 976 haplotypes at about 39.2 million SNPs and occupy 4.3 TB. It is also clear that much larger databases will be formed in the near future. VCF files contain a list of variants in a collection of genomes as well as evidence of presence of a reference/nonreference allele at each specific variant position in each genome. As they are intensively searched, the applied compression scheme should support fast queries of various types. The indexed and gzipped VCF files can be effectively queried using VCFtools (Danecek et al., 2011) or BCFtools when the query is about a single variant or a range of them. Unfortunately, retrieving a sample data means time-consuming decompression and processing of a whole file. The recently introduced genotype query tools (GQTs) (Layer et al., 2016) made use of some specialized compression algorithm for VCF files. GQT was designed to compare sample genotypes among many variant loci, but did not allow to retrieve the specified variant as well as sample data. Shortly after that, Li (2016) proposed BGT based on the positional Burrows-Wheeler transform (Durbin, 2014). It offered more than 10-fold better compression than GQT and supported queries for genotypes as well as variants. Moreover, it allowed to restrict the range of samples according to some metadata conditions. The SeqArray (SA) library (Zheng et al., 2017) for the R programming language is yet another solution to effectively compress and browse VCF files. The applied compression is based on the LZMA algorithm (Salomon and Motta, 2010). In this article, we introduce GenoType Compressor (GTC) intended to store genetic variation data from VCF files in a highly compressed form. Our data structure, as confirmed by experiments, is much faster (often by an order of magnitude) in various types of queries. It is also significantly more compact than the competitive solutions. These features allow the users to maintain huge collections of genotype data even on typical laptops with high speed of accession and could change the way the huge collections of VCF files are processed. 2 Materials and methods 2.1 General idea and definitions GTC is a new tool for compressed representation of genotype data supporting fast queries of various types. It compresses a collection of genotypes in a VCF/BCF format and allows for queries about genotypes: at specified range of variant position (e.g. a single variant site), referred to as variant query, in case of specified samples (e.g. a single sample), referred to as sample query, with a combination of the above two.The ploidy of individuals determines the number of haplotypes that make up a single genotype. For diploid organisms, a genotype of an individual is defined by two separate haplotypes. For a precise description of the proposed algorithm, let us introduce some terms. As an input, we have a VCF/BCF file that describes H haplotypes at V biallelic sites (single variant allele). For any bit vector X, X[i] is the ith bit of this vector and |X| denotes its length. 2.2 Sketch of the algorithm In the beginning, GTC divides the variants into blocks of some number of consecutive entries bsize (according to the preliminary experiments, we picked the value bsize=3584; a justification for such a choice can be found in the Supplementary Material) and processes each block separately (Fig. 1). Bit vectors representing presence/absence of reference alleles in all haplotypes are formed. In fact, two bit vectors, referred to as variant bit vectors, are necessary to describe each variant, as four possibilities must be covered: (i) a reference allele, (ii) nonreference allele, (iii) the other nonreference allele or (iv) the unknown allele. Fig. 1. View largeDownload slide Construction of GTC archive. (a) The input VCF file with genotypes of 6 diploid samples at 12 variant sites. It is decomposed into metadata and blocks of genotypes. Each block (here: max. five variant sites) is processed separately. (b) Metadata: variant sites description (stored as a site-only BCF) and list of samples. (c) Bit vector representation of genotypes in Block 2. Each variant site is described by two variant bit vectors. The genotype of a haplotype, at each variant site, is described by two-bits (located in two successive variant bit vectors). (d) Permutation of haplotypes in Block 2. Resulting order of haplotypes is stored. (e) Factorization of permuted Block 2. Empty and copied vectors are marked. All unique bit vectors are described as a sequence of longest possible 0-runs, 1-runs, matches to previous variant bit vectors in the block and literals. Note: The extended version of this example can be found in Supplementary Figure S1 Fig. 1. View largeDownload slide Construction of GTC archive. (a) The input VCF file with genotypes of 6 diploid samples at 12 variant sites. It is decomposed into metadata and blocks of genotypes. Each block (here: max. five variant sites) is processed separately. (b) Metadata: variant sites description (stored as a site-only BCF) and list of samples. (c) Bit vector representation of genotypes in Block 2. Each variant site is described by two variant bit vectors. The genotype of a haplotype, at each variant site, is described by two-bits (located in two successive variant bit vectors). (d) Permutation of haplotypes in Block 2. Resulting order of haplotypes is stored. (e) Factorization of permuted Block 2. Empty and copied vectors are marked. All unique bit vectors are described as a sequence of longest possible 0-runs, 1-runs, matches to previous variant bit vectors in the block and literals. Note: The extended version of this example can be found in Supplementary Figure S1 The haplotypes in each block are independently permuted to minimize the number of differences (i.e. a Hamming distance) between successive entries. As determination of the best permutation of haplotypes is equivalent to solving the Traveling Salesperson Problem (Johnson and McGeoch, 1997), it is practically impossible to find the optimal solution in a reasonable amount of time. Thus, the nearest neighbor heuristic (Johnson and McGeoch, 1997) is used to quickly calculate a reasonably good solution. There are better algorithms to perform this task (in terms of minimizing the total number of differences between neighbor haplotypes). Unfortunately, they are too slow to be applied in our case. The description of the found permutation must be stored for each block to allow retrieval of the original data. The permuted haplotypes are compressed (still within blocks) using a hybrid of specialized techniques inspired by Ziv-Lempel compression algorithm, run length encoding, and Huffman coding (Salomon and Motta, 2010). The variant bit vectors are processed one by one. In the current variant bit vector, we look for the longest runs of 0 or 1 s, as well as longest matches (same sequences of bits) in the already processed vectors. As a result, we obtain a description of the current variant using the previous variants, which is much shorter than the original variant bit vector. The description is finally Huffman encoded to save even more space. These stages are described in detail in the following subsections, whereas Figure 1 provides an overview of the construction of a GTC archive for an example input VCF file (a more detailed overview can be found in the Supplementary Material). 2.3 Preprocessing the input VCF file Managing the input VCF file. Unphased genotypes in the input BCF/VCF are arbitrarily phased while each of the multi-allelic variant sites (described in a single line of VCF) is broken into multiple biallelic variant sites, each described in a separate line, as in preprocessing of VCF file in BGT (Li, 2016). Thus, there are four possible allele values for each haplotype: ‘0’ for the reference allele, ‘1’ for the nonreference allele, ‘2’ for the other non-reference allele (stored in a different line of the VCF file) and ‘.’ for the unknown allele (Fig. 1a). Extraction of metadata. The altered description of V variant sites is stored in a site-only BCF file (the HTSlib (Li et al., 2009) library is used for this task) with _row variable indicating site id added in the INFO field. A list of sample names is stored in a separate text file (Fig. 1b). Initial encoding of genotypes. The allele at each site is represented as a dibit (00 for the reference allele, 01 for the nonreference allele, 11 for the other nonreference allele and 10 for the unknown allele (Fig. 1c). Each of the V variants is represented by two variant bit vectors of size H: one for lower and one for higher bits of the dibits representing successive haplotypes. Together there are 2V variant bit vectors (at this point information about all genotypes takes 2HV bits in total). The vectors are identified by ids ranging from 0 to 2V−1. The vectors with even ids correspond to the lower bits of the dibits representing haplotype sites, whereas vectors with odd ids correspond to the higher bits. In the next stages, described in detail below, the variant bit vectors are compressed and indexed, making it possible to randomly access an arbitrary vector. Forming blocks of genotype data. The variant bit vectors representing genotype data are divided into blocks. A single block contains genotype data of bsize=3584 consecutive variant sites (value chosen experimentally), i.e. 2bsize=7168 consecutive variant bit vectors. Thus, there are ⌈V/bsize⌉ blocks (last may contain data about less than bsize variants). The blocks are processed independently to each other, in parallel (if possible). 2.4 Processing a single block of genotype data Permutation of haplotypes. The haplotypes in a block are permuted to minimize the number of differences (i.e. a Hamming distance) between neighboring haplotypes. The nearest neighbor heuristic (Johnson and McGeoch, 1997) is used to calculate a reasonably good solution in an acceptable time. The permutation of the haplotypes (their order after permutation) in the ith block is stored in an array Pi. Categorizing variant bit vectors. The variant bit vectors (representing genotype annotations) are processed one by one, in a sequential manner. They are packed into byte vectors (8 bits in a byte). A byte is then the smallest processing unit in the compression scheme. In the initial phase, each byte vector is categorized either as an empty vector (all bytes unset), a copy of a previously processed vector, or a unique vector, not equal to any of the previously processed vectors. The classification is done with the help of a dictionary structure HTvec, namely, hash table with linear probing (Knuth, 1998). The hash value is computed for each processed vector, and HTvec stores ids of all previously processed unique vectors in the block (notice that the first nonempty byte vector is a unique vector). Four bit vectors, each of size 3584 bits, are formed for the ith block: Eeveni, Eoddi, Ceveni and Coddi. The kth set bit in Eeveni/ Eoddi indicates that the kth even/odd (respectively) variant byte vector in the block is empty. The kth set bit in Ceveni/ Coddi indicates that the kth even/odd (respectively) variant byte vector in the block is a copy of one of the previous vectors. The Corigini array maps subsequent copied vectors with their equivalents in the set of unique variant bit vectors (keeping id of unique vector, uid). Processing unique variant bit vectors. Every unique variant bit vector, represented by byte vector (of size ⌈H/8⌉ bytes, padded with zeros), is transformed into a sequence of tuples, which represents literals, runs of zeros or ones or matches to a previously encoded vector. The encoding is done byte by byte, from left to right, starting at position j = 0. At each analyzed position, we first look for the length of the longest substring of zeros (bytes equal to zero) or ones (bytes equal to 255). If it is of length rmin⁡ or more ( rmin⁡=2 by default), we encode the sequence as a run of zeros (or ones). Otherwise, we look for the longest possible match to one of the previously encoded unique vector (identical byte substring). To increase the chance of a long match, we search for shared haplotypes that is for the longest match starting at position j in any previously processed byte vector. Matches to arbitrary parts of other vectors are accidental and thus unlikely to be long. Moreover, by restricting to matches that start at position j, matches can be encoded in fewer bits (no need to store their positions). To make the search faster, the already processed data from the vectors are indexed using hash tables (HTs). Each HT’s key consists of a sequence extracted from a vector and its position. The value is the unique vector id ( uid). HT stores keys with sequences of length h = 5. The minimum match length is equal to h (and it is impossible to find shorter matches using HT). The matching byte substring (or its parts) in a previously encoded vector can be already encoded as a match. However, we restrict a match depth that is a number of allowed vectors describing each byte in a match. By default, the maximum allowed match depth ( dmax⁡) is 100. A subsequent match to the same vector is encoded with fewer bits, as there is no need to store the id of the previous vector. If, with the match depth restriction, no sufficiently long match can be found, the current byte is encoded as a literal. The runs of literals (between 20 and 252 literals by default) are concatenated and encoded as a separate tuple. The number of unique variant bit vectors will naturally increase with increasing number of samples. Most of them will be vectors with only one or two set bits, so they should not influence the archive size significantly. The type of a tuple is indicated by its first field, a flag ftype. Overall, there are six possible tuple types at the current position j: ⟨fzero_run,num0⟩—a run of num0 zero bytes, the position j is updated to j+num0, ⟨fone_run,num1⟩—a run of num1 one bytes (all bits set), the position j is updated to j+num1, ⟨fmatch,uid,len⟩—a match of length len to a vector with id uid, the position j is updated to j+len, ⟨fmatch_same,len⟩—a match of length len to a vector with the same id as the previous match, the position j is updated to j+len, ⟨fliteral,bv⟩—a literal, where bv is the value of the byte, the position j is increased by 1. ⟨fliteral_run,n,bv1,bv2,…,bvn⟩—a run of n literals, where bv1,bv2,…,bvn are the values of the consecutive bytes, the position j is increased by n. 2.5 Merging blocks The ⌈V/bsize⌉ processed blocks of genotype data are gathered and concatenated in the order of their appearance in the input VCF file (each ith block is added, for i = 1 to i=⌈V/bsize⌉). The bit vectors Eeveni, Eoddi, Ceveni and Coddi are concatenated into four global bit vectors of size V: Eeven, Eodd, Ceven and Codd. The vectors are kept in a succinct form. The Corigin array gathers all Corigini arrays adjusting stored uids (number of unique vectors in all previous blocks is added to each uid from the current block). For subsequent copied vectors, it refers to uids of the original unique vectors out of all unique vectors. Every uid in Corigin is then delta coded (difference between uid of the next unique vector and original uid is calculated) with the minimum necessary number of bits. The encoded unique variant bit vectors are stored in a byte array U. The flags, literals ( bv), lengths of matches ( len), and lengths of runs of zeros ( num0) and ones ( num1) are compressed (separately) with an entropy coder (we use Huffman coder) in the context of total number of set bits in the current variant bit vector (eight separate groups by default). For each literal run, the number of bits it occupies is kept. For matches, the ids of matched unique vectors ( uids) are delta coded with the minimum necessary number of bits. The starting positions of all unique vectors in U are stored in a byte array Upos, where every 1025th unique vector is stored using 4 bytes, whereas for the 1024 subsequent vectors, only the difference (between the current vector position and the nearest previous position encoded with bytes) is stored, using d bits, where d is the minimum number of bits necessary to store any encoded position difference. Finally, permutations of all blocks, Pi, are concatenated into a single array of permutations, P. It is stored with minimum necessary number of bits. The id of the variant bit vector to decompress is enough to find the right permutation in P. 2.6 Design of the compressed data structure The main components of the GTC data structure are as follows: Eeven and Eodd: two-bit vectors, each of size V, indicating if subsequent variant bit vectors (out of all vectors for lower or higher bits of corresponding haplotypes, respectively) are zero-only vectors, Ceven and Codd: two-bit vectors, each of size V, indicating if subsequent variant bit vectors (out of all vectors for lower or higher bits of corresponding haplotypes, respectively) are copies of other vectors, Corigin: ids of the original vectors (out of all unique vectors) for successive vectors being a copy; delta coding and minimum necessary number of bits are used to store each id. U: byte array storing unique variant bit vectors, encoded into tuples and compressed (as described above), Upos: byte array with positions of the subsequent unique variant bit vectors in the U structure; full position is stored for every 1025th vector, the position of the remaining vectors are delta coded. P: byte array storing permutations of subsequent blocks; a single permutation is a sequence of ids of haplotypes reflecting the order in which they appear in the block. Each id is stored with minimum necessary number of bits (exactly: ⌈ log⁡2H⌉ bits).The Eeven, Eodd, Ceven and Codd bit vectors are represented by the compressed structure (Raman et al., 2007; Navarro and Providel, 2012) implemented in the Succinct Data Structure Library (SDSL) (Gog et al., 2014) library. 2.7 Queries By default, the entire collection is decompressed into VCF/BCF file. It is possible to restrict the query by applying additional conditions. Only variants and samples meeting all specified conditions are decompressed. The following restrictions are possible: range condition—it specifies the chromosome and a range of positions within the chromosome, sample condition—it specifies sample or samples, alternate allele frequency/count condition—it specifies minimum/maximum count/frequency of alternate allele among selected samples for each variant site, variant count condition—it specifies the maximum number of variant sites to decompress.Two decompression algorithms are possible depending on the chosen query parameters. For most queries, a variant-oriented approach is used. A sample-oriented approach is applied in queries about relatively small, i.e. up to 2000, number of samples, but only if decompressed range of variant sites span multiple blocks. Variant-oriented approach. In this algorithm, two vectors representing a variant site are fully decompressed. Their ids are known owing to _row variable in the BCF with variant sites description. Initially, the Eeven, Eodd, Ceven and Codd bit vectors are used to define a category of the variant bit vector. Decompression of an empty vector is straightforward. For a copied vector, the rank operations on Eeven, Eodd, Ceven and Codd bit vectors are used to determine its copy, whereas the id of the original, unique vector is found using the Corigin array. The Upos array is used to find position of the unique variant bit vector in the U array. The consecutive bytes of the variant bit vector are decompressed by decompressing and decoding all flags, literals, lengths of runs of zeros and ones, and matches. If a match is encountered, the variant bit vector containing the match is not fully decompressed. Instead, the complete decoding of all irrelevant tuples from the beginning of the vector up to the match position is skipped as far as possible. For example, the stored bit length of a run of literal allows to skip the run without time-consuming literal decoding. The P array helps to find the original order of haplotypes. Only a subset of queried samples, if that is the case, is reported. If a range of variant sites is queried, the decompression is speeded up by keeping the track of an adequate number of previous, already decompressed unique variant bit vectors. Moreover, the permutation of haplotypes only needs to be read from the P array at the beginning of a block, not for each variant site separately. Sample-oriented approach. In this approach all haplotypes of the queried sample are decompressed. For example, if sample is diploid, two haplotypes are decompressed. In case of more samples, more haplotypes are decompressed. The decompression starts at the beginning of a block containing the first queried variant site. The P array is used at the start of each block to find the positions of bytes containing the haplotypes in the permuted variant vectors. Each variant vector is partly decoded once. The bytes containing information about the decompressed haplotypes are fully decoded and stored; the complete decoding of other bytes is skipped, if possible. As the previous decoded bytes are kept, if a match covering the decompressed haplotype is encountered (only a match within the same block is possible), the byte value can be read immediately. The decompressed bytes are used to retrieve values of queried haplotypes (a dibit, two single bits in two successive variant bit vectors). 3 Results 3.1 Datasets To evaluate GTC and its competitors, we picked two large collections of the genomes of Homo sapiens: the 1000 GP Phase 3 (2504 diploid genotypes) and the HRC (27 165 diploid genotypes). The 1000 Genomes Project data were downloaded from the project web site (details in the Supplementary Material). The HRC collection used in our experiments was obtained from the European Genom-phenom Archive (accession number: EGAS00000000029; details in the Supplementary Material) and is slightly smaller than the full dataset (containing 32 488 samples; unfortunately unavailable for public use). 3.2 Compression ratios and times Table 1 summarizes the characteristics of the datasets and the compression capabilities of gzip, BGT, GQT, SA and GTC. GTC archive appeared to be about two times more compact than BGT archive and ten times more compact than gzipped VCF file. The SA archive size is between GTC and BGT for the smaller collection but is noticeably larger than BGT for the larger one. Table 1. Datasets used in the experiments Dataset  No. of genot.  No. of var. (M)  Size (GB)   VCF  VCF-gzip  GQT  BGT  SA  GTC  1000GPip  2504  85.20  853  17.41  14.34  3.87  2.93  1.81  HRC  27 165  40.36  4300  75.01  84.42  7.01  12.68  3.98  HRC Chr11  27 165  1.94  210  3.42  4.11  0.33  0.62  0.18  Dataset  No. of genot.  No. of var. (M)  Size (GB)   VCF  VCF-gzip  GQT  BGT  SA  GTC  1000GPip  2504  85.20  853  17.41  14.34  3.87  2.93  1.81  HRC  27 165  40.36  4300  75.01  84.42  7.01  12.68  3.98  HRC Chr11  27 165  1.94  210  3.42  4.11  0.33  0.62  0.18  Note: The sizes of the compressed representations with gzip, GQT, BGT, SA and GTC are also presented. Table 1. Datasets used in the experiments Dataset  No. of genot.  No. of var. (M)  Size (GB)   VCF  VCF-gzip  GQT  BGT  SA  GTC  1000GPip  2504  85.20  853  17.41  14.34  3.87  2.93  1.81  HRC  27 165  40.36  4300  75.01  84.42  7.01  12.68  3.98  HRC Chr11  27 165  1.94  210  3.42  4.11  0.33  0.62  0.18  Dataset  No. of genot.  No. of var. (M)  Size (GB)   VCF  VCF-gzip  GQT  BGT  SA  GTC  1000GPip  2504  85.20  853  17.41  14.34  3.87  2.93  1.81  HRC  27 165  40.36  4300  75.01  84.42  7.01  12.68  3.98  HRC Chr11  27 165  1.94  210  3.42  4.11  0.33  0.62  0.18  Note: The sizes of the compressed representations with gzip, GQT, BGT, SA and GTC are also presented. For a detailed examination of the compression ratios, we reduced the HRC dataset to collections containing only Chromosome 11 variants and 1000, 2000, etc. samples. Figure 2a shows that the relative sizes of the BGT and GTC archives are similar in the examined range (see also Supplementary Table S1). A more careful examination shows that the compressed sizes of BGT and GTC grow slightly sub-linearly for growing numbers of samples. Moreover, the ratio between BGT and GTC is almost constant for more than 5000 samples, which suggests that it would remain the same for even larger collections. SA seems to scale poorer than BGT and GTC when the collection size grows. Fig. 2. View largeDownload slide Comparison of compressed data structures for VCF (gzipped, VCF.gz) with BCFtools (BCF) used to access it, GQT, BGT, SA and GTC. (a) Sizes of compressed sampled archives of HRC data. (b) Components of GTC archive for HRC data. (c) Decompression times for HRC Chromosome 11 data. (d) Single variant range query times for HRC Chromosome 11 data. (e) Variants in range of size 23 239 bp query times for HRC Chromosome 11 data. (f) Single sample query times for HRC Chromosome 11 data. (g) Single variant for selected samples query times for HRC Chromosome 11 data. (h) Variants in range of size 10 Mb for selected samples query times for HRC Chromosome 11 data. (i) All variants for selected samples query times for HRC Chromosome 11 data. (j) Single sample for the selected range of chromosome query times for HRC Chromosome 11 data. (k) In total, 1000 samples for the selected range of chromosome query times for HRC Chromosome 11 data. (l) All samples for the selected range of chromosome query times for HRC Chromosome 11 data Fig. 2. View largeDownload slide Comparison of compressed data structures for VCF (gzipped, VCF.gz) with BCFtools (BCF) used to access it, GQT, BGT, SA and GTC. (a) Sizes of compressed sampled archives of HRC data. (b) Components of GTC archive for HRC data. (c) Decompression times for HRC Chromosome 11 data. (d) Single variant range query times for HRC Chromosome 11 data. (e) Variants in range of size 23 239 bp query times for HRC Chromosome 11 data. (f) Single sample query times for HRC Chromosome 11 data. (g) Single variant for selected samples query times for HRC Chromosome 11 data. (h) Variants in range of size 10 Mb for selected samples query times for HRC Chromosome 11 data. (i) All variants for selected samples query times for HRC Chromosome 11 data. (j) Single sample for the selected range of chromosome query times for HRC Chromosome 11 data. (k) In total, 1000 samples for the selected range of chromosome query times for HRC Chromosome 11 data. (l) All samples for the selected range of chromosome query times for HRC Chromosome 11 data It is also interesting to see the sizes of components of the GTC archive for the complete HRC data. As can be observed in Figure 2b, the majority of the archive (62.5%) is for the description of matches, 0-runs and so on (Core) and 30.8% is for the description of permutations in blocks (Permutation). Finally, 6.5% is for the description of variants (position, reference, and non-reference alleles etc.) and 0.2% for list of sample names (Metadata). For smaller block sizes, the haplotypes can be better compressed, but the description of permutation consumes more space (Supplementary Fig. S8). In Supplementary Figures S2–S7, we also show the influence of various parameters of GTC (such as block size, minimal match length etc.) on the compression ratio. The compression times of BGT, SA and GTC are similar (slightly more than a day for the complete HRC collection at Intel Xeon-based workstation clocked at 2.3 GHz) as they are dominated by reading and parsing of the huge input gzipped VCF file. 3.3 Queries to the compressed data structure The great compression would be, however, of little value without the ability of answering queries of various types. Figure 2c–l shows the times of answering various types of queries by GTC, BGT, SA and BCFtools (BCF), for Chromosome 11 data containing various number of samples (various collection sizes, Fig. 2c–f) or for the whole collection (27 165 samples, Fig. 2g–l). The GTC decompression of the whole collection of variants is from 7 times (1000-sample subset) to 13 times (all samples) faster than with BGT and about 3 times faster than with SA (Fig. 2c). The GTC extraction for a single variant or a genomic range of 23 329 bases (median size of human gene) is up to six times faster than with BGT (Fig. 2d and e). It is also noteworthy that BCFtools are almost as fast as GTC for a single variant query. A bit different situation can be observed in Figure 2f, where the decompression times of single sample are presented. For collections not larger than 5000 samples, GTC is the fastest, but then, BGT takes the lead becoming 1.5 times faster for the complete collection. Nevertheless, both GTC and BGT answer the query in a few seconds. As expected BCFtools are clearly the slowest. Figure 2g–l shows the times of answering the queries about different ranges of genome and various number of samples for the complete Chromosome 11 collection (27 165 samples). The BGT extraction times are slightly better than that of GTC only when a small number of samples (up to about one thousand samples for small variant ranges and up to few hundred samples for large variant ranges) is queried. For more samples in a query, GTC takes the lead, whereas the relative performance of BGT decreases (i.e. for a single variant and 10 000 samples extraction, BCFtools are the second best method). When extracting all samples, GTC is about ten times faster than BGT. Moreover, the advantage of GTC grows slightly when the range extends. It can be noted that the relative performance of BCTtools degrades when the range width exceeds a few thousand bases. From the pure compression-ratio perspective, it is worth to note that the average size of genotype information in GTC archive for a single sample is just about 146 KB. Even better result would be possible when we resign from fast queries support. To date the best pure compressor of VCF files is TGC (Deorowicz et al., 2013). Recently proposed GTRAC (Tatwawadi et al., 2016) is a modified version of TGC, supporting some types of queries, for example, for single (or range of) variants or single samples. Unfortunately, the output of GTRAC queries is just a binary file, so no direct comparison with VCF/BCF-producing compressors (BGT, GTC and SA) is possible. Moreover, we were unable to run GTRAC for the examined datasets. Nevertheless, we modified our compressor to produce similar output and support similar query types, and run both GTC and GTRAC for the 1000 GP Phase 1 data containing 1092 genotypes and 39.7 M variants. The compressed archives were of size 622 MB (GTC) and 1002 MB (GTRAC). The queries for single variants were solved in 7 ms (GTC) and 47 ms (GTRAC). The queries for samples were answered in similar times, about 2 s for Chromosome 11 data. 4 Conclusions The widely used VCFtools and BCFtools offer relatively quick access to VCF, gzipped VCF, or BCF files when we query about a specified variant or variant range. The situation, however, changes when we focus on sample queries, which are slow. Another drawback of using these tools is that the gzipped VCF or BCF files are quite large, which causes the problems with storage and transfer. When the user is interested only in the genotype data from the VCF files, much better solutions are possible, that is, the state-of-the-art BGT and SA. Nevertheless, the GTC, proposed in this article, has several advantages over them. The most significant one is a new compression algorithm designed with a care of fast queries support. The data structure is so small that it is possible to store it in the main memory of even commodity workstations or laptops giving the impressive query times. Even when run at powerful workstation, GTC often outperforms the competitors in query times by an order of magnitude. The scalability experiments suggest that much larger collections can be also maintained effectively. Current work focuses on compression of genotypes only. Compressing and querying other data that may be stored in VCF files, e.g. genotype likelihoods, would be hard to introduce considering the wide range of possible values. A quantization could simplify the task, but without further study it is hard to tell whether similarities and redundancies in the input data would allow to achieve satisfying results. In our future work, we plan to investigate this issue thoroughly and possibly extend types of VCF fields that can be compressed with our tool. Moreover, we want to take a closer look at all parameters that were chosen experimentally and possibly make them dependable on some additional factors, specific to the input data. Funding This work was supported by the National Science Centre, Poland under project [DEC-2015/17/B/ST6/01890]. We used the infrastructure supported by ‘GeCONiI-Upper Silesian Center for Computational Science and Engineering’ [POIG.02.03.01-24-099/13]. Conflict of Interest: none declared. References Danecek P. et al.  . ( 2011) The variant call format and VCFtools. Bioinformatics , 27, 2156– 2158. Google Scholar CrossRef Search ADS PubMed  Deorowicz S. et al.  . ( 2013) Genome compression: a novel approach for large collections. Bioinformatics , 29, 2572– 2578. Google Scholar CrossRef Search ADS PubMed  Deorowicz S., Grabowski S. ( 2013) Data compression for sequencing data. Algorithms Mol. Biol ., 8, 25. Google Scholar CrossRef Search ADS PubMed  Durbin R. ( 2014) Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT). Bioinformatics , 30, 1266– 1272. Google Scholar CrossRef Search ADS PubMed  Gog S. et al.  . ( 2014) From theory to practice: plug and play with succinct data structures. Lect. Notes Comput. Sci ., 8504, 326– 337. Google Scholar CrossRef Search ADS   Johnson D.S., McGeoch L.A. ( 1997) The traveling salesman problem: a case study in local optimization In: Aarts E.H.L., Lenstra J.K. (eds) Local Search in Combinatorial Optimisation . John Wiley and Sons, London, UK, pp. 215– 310. Knuth D.E. ( 1998) The art of computer programming. In: Sorting and Searching , Vol. 3. Addison-Wesley Professional, USA, pp. 426– 458. Layer R.M. et al.  . ( 2016) Efficient genotype compression and analysis of large genetic-variation data sets. Nat. Methods , 13, 63– 65. Google Scholar CrossRef Search ADS PubMed  Lek M. et al.  . ( 2016) Analysis of protein-coding genetic variation in 60, 706 humans. Nature , 536, 285– 291. Google Scholar CrossRef Search ADS PubMed  Li H. et al.  . ( 2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics , 25, 2078– 2079. Google Scholar CrossRef Search ADS PubMed  Li H. ( 2016) BGT: efficient and flexible genotype query across many samples. Bioinformatics , 32, 590– 592. Google Scholar CrossRef Search ADS PubMed  McCarthy S. et al.  . ( 2016) A reference panel of 64, 976 haplotypes for genome imputation. Nat. Genet ., 48, 1279– 1283. Google Scholar CrossRef Search ADS PubMed  Navarro G., Providel E. ( 2012) Fast, small, simple rank/select on bitmaps. Lect. Notes Comput. Sci ., 7276, 295– 306. Google Scholar CrossRef Search ADS   Numanagić I. et al.  . ( 2016) Comparison of high-throughput sequencing data compression tools. Nat. Methods , 13, 1005– 1008. Google Scholar CrossRef Search ADS PubMed  Raman R. et al.  . ( 2007) Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms , 3, 43. Google Scholar CrossRef Search ADS   Salomon D., Motta G. ( 2010) Handbook of Data Compression . Springer, London, UK. Google Scholar CrossRef Search ADS   Stephens Z.D. et al.  . ( 2015) Big Data: astronomical or Genomical. PLOS Biol ., 13, e1002195. Google Scholar CrossRef Search ADS PubMed  Sudmant P.H. et al.  . ( 2015) An integrated map of structural variation in 2, 504 human genomes. Nature , 526, 75– 81. Google Scholar CrossRef Search ADS PubMed  Tatwawadi K. et al.  . ( 2016) GTRAC: fast retrieval from compressed collections of genomic variants. Bioinformatics , 32, i479– i486. Google Scholar CrossRef Search ADS PubMed  Zheng X. et al.  . ( 2017) SeqArray—A storage-efficient high-performance data format for WGS variant calls. Bioinformatics , 33, 2251– 2257. Google Scholar CrossRef Search ADS PubMed  © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

GTC: how to maintain huge genotype collections in a compressed form

Loading next page...
 
/lp/ou_press/gtc-how-to-maintain-huge-genotype-collections-in-a-compressed-form-bLBUDTso99
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
ISSN
1367-4803
eISSN
1460-2059
D.O.I.
10.1093/bioinformatics/bty023
Publisher site
See Article on Publisher Site

Abstract

Abstract Motivation Nowadays, genome sequencing is frequently used in many research centers. In projects, such as the Haplotype Reference Consortium or the Exome Aggregation Consortium, huge databases of genotypes in large populations are determined. Together with the increasing size of these collections, the need for fast and memory frugal ways of representation and searching in them becomes crucial. Results We present GTC (GenoType Compressor), a novel compressed data structure for representation of huge collections of genetic variation data. It significantly outperforms existing solutions in terms of compression ratio and time of answering various types of queries. We show that the largest of publicly available database of about 60 000 haplotypes at about 40 million SNPs can be stored in <4 GB, while the queries related to variants are answered in a fraction of a second. Availability and implementation GTC can be downloaded from https://github.com/refresh-bio/GTC or http://sun.aei.polsl.pl/REFRESH/gtc. Contact sebastian.deorowicz@polsl.pl Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction In the last two decades, the throughput of genome sequencers increased by a few orders of magnitude. At the same time, the sequencing cost of a single human individual decreased from over 1 billion to about 1 thousand US dollars. Stephens et al. (2015) predicted that in 2025, the genomic data will be acquired at a rate of 1 zettabases/year, which means that about 2bou exabytes/year of genome data should be deposited for future analysis. In addition, the prices of data storage and transfer decrease moderately (Deorowicz and Grabowski, 2013), which means that keeping the data management costs under control becomes a real challenge. Recently, the tools for compression of sequenced reads have been benchmarked (Numanagić et al., 2016). The ability of the examined utilities to shrink the data about ten times is remarkable. Nevertheless, much more is necessary. The obvious option is to resign from the storage of raw data (in most experiments) and focus just on the results of variant calling, deposited usually in the Variant Call Format (VCF) files (Danecek et al., 2011). The famous initiatives, such as the 1000 Genomes Project (1000 GP) (Sudmant et al., 2015) or the 100 000 Genomes Project (https://www.genomicsengland.co.uk/the-100000-genomes-project-by-numbers/), deliver VCF files for thousands of samples. Moreover, the scale of current projects, similar to the Haplotype Reference Consortium (HRC) (McCarthy et al., 2016) or the Exome Aggregation Consortium (ExAC) (Lek et al., 2016), is larger by an order of magnitude. For example, the VCF files of the HRC consist of 64 976 haplotypes at about 39.2 million SNPs and occupy 4.3 TB. It is also clear that much larger databases will be formed in the near future. VCF files contain a list of variants in a collection of genomes as well as evidence of presence of a reference/nonreference allele at each specific variant position in each genome. As they are intensively searched, the applied compression scheme should support fast queries of various types. The indexed and gzipped VCF files can be effectively queried using VCFtools (Danecek et al., 2011) or BCFtools when the query is about a single variant or a range of them. Unfortunately, retrieving a sample data means time-consuming decompression and processing of a whole file. The recently introduced genotype query tools (GQTs) (Layer et al., 2016) made use of some specialized compression algorithm for VCF files. GQT was designed to compare sample genotypes among many variant loci, but did not allow to retrieve the specified variant as well as sample data. Shortly after that, Li (2016) proposed BGT based on the positional Burrows-Wheeler transform (Durbin, 2014). It offered more than 10-fold better compression than GQT and supported queries for genotypes as well as variants. Moreover, it allowed to restrict the range of samples according to some metadata conditions. The SeqArray (SA) library (Zheng et al., 2017) for the R programming language is yet another solution to effectively compress and browse VCF files. The applied compression is based on the LZMA algorithm (Salomon and Motta, 2010). In this article, we introduce GenoType Compressor (GTC) intended to store genetic variation data from VCF files in a highly compressed form. Our data structure, as confirmed by experiments, is much faster (often by an order of magnitude) in various types of queries. It is also significantly more compact than the competitive solutions. These features allow the users to maintain huge collections of genotype data even on typical laptops with high speed of accession and could change the way the huge collections of VCF files are processed. 2 Materials and methods 2.1 General idea and definitions GTC is a new tool for compressed representation of genotype data supporting fast queries of various types. It compresses a collection of genotypes in a VCF/BCF format and allows for queries about genotypes: at specified range of variant position (e.g. a single variant site), referred to as variant query, in case of specified samples (e.g. a single sample), referred to as sample query, with a combination of the above two.The ploidy of individuals determines the number of haplotypes that make up a single genotype. For diploid organisms, a genotype of an individual is defined by two separate haplotypes. For a precise description of the proposed algorithm, let us introduce some terms. As an input, we have a VCF/BCF file that describes H haplotypes at V biallelic sites (single variant allele). For any bit vector X, X[i] is the ith bit of this vector and |X| denotes its length. 2.2 Sketch of the algorithm In the beginning, GTC divides the variants into blocks of some number of consecutive entries bsize (according to the preliminary experiments, we picked the value bsize=3584; a justification for such a choice can be found in the Supplementary Material) and processes each block separately (Fig. 1). Bit vectors representing presence/absence of reference alleles in all haplotypes are formed. In fact, two bit vectors, referred to as variant bit vectors, are necessary to describe each variant, as four possibilities must be covered: (i) a reference allele, (ii) nonreference allele, (iii) the other nonreference allele or (iv) the unknown allele. Fig. 1. View largeDownload slide Construction of GTC archive. (a) The input VCF file with genotypes of 6 diploid samples at 12 variant sites. It is decomposed into metadata and blocks of genotypes. Each block (here: max. five variant sites) is processed separately. (b) Metadata: variant sites description (stored as a site-only BCF) and list of samples. (c) Bit vector representation of genotypes in Block 2. Each variant site is described by two variant bit vectors. The genotype of a haplotype, at each variant site, is described by two-bits (located in two successive variant bit vectors). (d) Permutation of haplotypes in Block 2. Resulting order of haplotypes is stored. (e) Factorization of permuted Block 2. Empty and copied vectors are marked. All unique bit vectors are described as a sequence of longest possible 0-runs, 1-runs, matches to previous variant bit vectors in the block and literals. Note: The extended version of this example can be found in Supplementary Figure S1 Fig. 1. View largeDownload slide Construction of GTC archive. (a) The input VCF file with genotypes of 6 diploid samples at 12 variant sites. It is decomposed into metadata and blocks of genotypes. Each block (here: max. five variant sites) is processed separately. (b) Metadata: variant sites description (stored as a site-only BCF) and list of samples. (c) Bit vector representation of genotypes in Block 2. Each variant site is described by two variant bit vectors. The genotype of a haplotype, at each variant site, is described by two-bits (located in two successive variant bit vectors). (d) Permutation of haplotypes in Block 2. Resulting order of haplotypes is stored. (e) Factorization of permuted Block 2. Empty and copied vectors are marked. All unique bit vectors are described as a sequence of longest possible 0-runs, 1-runs, matches to previous variant bit vectors in the block and literals. Note: The extended version of this example can be found in Supplementary Figure S1 The haplotypes in each block are independently permuted to minimize the number of differences (i.e. a Hamming distance) between successive entries. As determination of the best permutation of haplotypes is equivalent to solving the Traveling Salesperson Problem (Johnson and McGeoch, 1997), it is practically impossible to find the optimal solution in a reasonable amount of time. Thus, the nearest neighbor heuristic (Johnson and McGeoch, 1997) is used to quickly calculate a reasonably good solution. There are better algorithms to perform this task (in terms of minimizing the total number of differences between neighbor haplotypes). Unfortunately, they are too slow to be applied in our case. The description of the found permutation must be stored for each block to allow retrieval of the original data. The permuted haplotypes are compressed (still within blocks) using a hybrid of specialized techniques inspired by Ziv-Lempel compression algorithm, run length encoding, and Huffman coding (Salomon and Motta, 2010). The variant bit vectors are processed one by one. In the current variant bit vector, we look for the longest runs of 0 or 1 s, as well as longest matches (same sequences of bits) in the already processed vectors. As a result, we obtain a description of the current variant using the previous variants, which is much shorter than the original variant bit vector. The description is finally Huffman encoded to save even more space. These stages are described in detail in the following subsections, whereas Figure 1 provides an overview of the construction of a GTC archive for an example input VCF file (a more detailed overview can be found in the Supplementary Material). 2.3 Preprocessing the input VCF file Managing the input VCF file. Unphased genotypes in the input BCF/VCF are arbitrarily phased while each of the multi-allelic variant sites (described in a single line of VCF) is broken into multiple biallelic variant sites, each described in a separate line, as in preprocessing of VCF file in BGT (Li, 2016). Thus, there are four possible allele values for each haplotype: ‘0’ for the reference allele, ‘1’ for the nonreference allele, ‘2’ for the other non-reference allele (stored in a different line of the VCF file) and ‘.’ for the unknown allele (Fig. 1a). Extraction of metadata. The altered description of V variant sites is stored in a site-only BCF file (the HTSlib (Li et al., 2009) library is used for this task) with _row variable indicating site id added in the INFO field. A list of sample names is stored in a separate text file (Fig. 1b). Initial encoding of genotypes. The allele at each site is represented as a dibit (00 for the reference allele, 01 for the nonreference allele, 11 for the other nonreference allele and 10 for the unknown allele (Fig. 1c). Each of the V variants is represented by two variant bit vectors of size H: one for lower and one for higher bits of the dibits representing successive haplotypes. Together there are 2V variant bit vectors (at this point information about all genotypes takes 2HV bits in total). The vectors are identified by ids ranging from 0 to 2V−1. The vectors with even ids correspond to the lower bits of the dibits representing haplotype sites, whereas vectors with odd ids correspond to the higher bits. In the next stages, described in detail below, the variant bit vectors are compressed and indexed, making it possible to randomly access an arbitrary vector. Forming blocks of genotype data. The variant bit vectors representing genotype data are divided into blocks. A single block contains genotype data of bsize=3584 consecutive variant sites (value chosen experimentally), i.e. 2bsize=7168 consecutive variant bit vectors. Thus, there are ⌈V/bsize⌉ blocks (last may contain data about less than bsize variants). The blocks are processed independently to each other, in parallel (if possible). 2.4 Processing a single block of genotype data Permutation of haplotypes. The haplotypes in a block are permuted to minimize the number of differences (i.e. a Hamming distance) between neighboring haplotypes. The nearest neighbor heuristic (Johnson and McGeoch, 1997) is used to calculate a reasonably good solution in an acceptable time. The permutation of the haplotypes (their order after permutation) in the ith block is stored in an array Pi. Categorizing variant bit vectors. The variant bit vectors (representing genotype annotations) are processed one by one, in a sequential manner. They are packed into byte vectors (8 bits in a byte). A byte is then the smallest processing unit in the compression scheme. In the initial phase, each byte vector is categorized either as an empty vector (all bytes unset), a copy of a previously processed vector, or a unique vector, not equal to any of the previously processed vectors. The classification is done with the help of a dictionary structure HTvec, namely, hash table with linear probing (Knuth, 1998). The hash value is computed for each processed vector, and HTvec stores ids of all previously processed unique vectors in the block (notice that the first nonempty byte vector is a unique vector). Four bit vectors, each of size 3584 bits, are formed for the ith block: Eeveni, Eoddi, Ceveni and Coddi. The kth set bit in Eeveni/ Eoddi indicates that the kth even/odd (respectively) variant byte vector in the block is empty. The kth set bit in Ceveni/ Coddi indicates that the kth even/odd (respectively) variant byte vector in the block is a copy of one of the previous vectors. The Corigini array maps subsequent copied vectors with their equivalents in the set of unique variant bit vectors (keeping id of unique vector, uid). Processing unique variant bit vectors. Every unique variant bit vector, represented by byte vector (of size ⌈H/8⌉ bytes, padded with zeros), is transformed into a sequence of tuples, which represents literals, runs of zeros or ones or matches to a previously encoded vector. The encoding is done byte by byte, from left to right, starting at position j = 0. At each analyzed position, we first look for the length of the longest substring of zeros (bytes equal to zero) or ones (bytes equal to 255). If it is of length rmin⁡ or more ( rmin⁡=2 by default), we encode the sequence as a run of zeros (or ones). Otherwise, we look for the longest possible match to one of the previously encoded unique vector (identical byte substring). To increase the chance of a long match, we search for shared haplotypes that is for the longest match starting at position j in any previously processed byte vector. Matches to arbitrary parts of other vectors are accidental and thus unlikely to be long. Moreover, by restricting to matches that start at position j, matches can be encoded in fewer bits (no need to store their positions). To make the search faster, the already processed data from the vectors are indexed using hash tables (HTs). Each HT’s key consists of a sequence extracted from a vector and its position. The value is the unique vector id ( uid). HT stores keys with sequences of length h = 5. The minimum match length is equal to h (and it is impossible to find shorter matches using HT). The matching byte substring (or its parts) in a previously encoded vector can be already encoded as a match. However, we restrict a match depth that is a number of allowed vectors describing each byte in a match. By default, the maximum allowed match depth ( dmax⁡) is 100. A subsequent match to the same vector is encoded with fewer bits, as there is no need to store the id of the previous vector. If, with the match depth restriction, no sufficiently long match can be found, the current byte is encoded as a literal. The runs of literals (between 20 and 252 literals by default) are concatenated and encoded as a separate tuple. The number of unique variant bit vectors will naturally increase with increasing number of samples. Most of them will be vectors with only one or two set bits, so they should not influence the archive size significantly. The type of a tuple is indicated by its first field, a flag ftype. Overall, there are six possible tuple types at the current position j: ⟨fzero_run,num0⟩—a run of num0 zero bytes, the position j is updated to j+num0, ⟨fone_run,num1⟩—a run of num1 one bytes (all bits set), the position j is updated to j+num1, ⟨fmatch,uid,len⟩—a match of length len to a vector with id uid, the position j is updated to j+len, ⟨fmatch_same,len⟩—a match of length len to a vector with the same id as the previous match, the position j is updated to j+len, ⟨fliteral,bv⟩—a literal, where bv is the value of the byte, the position j is increased by 1. ⟨fliteral_run,n,bv1,bv2,…,bvn⟩—a run of n literals, where bv1,bv2,…,bvn are the values of the consecutive bytes, the position j is increased by n. 2.5 Merging blocks The ⌈V/bsize⌉ processed blocks of genotype data are gathered and concatenated in the order of their appearance in the input VCF file (each ith block is added, for i = 1 to i=⌈V/bsize⌉). The bit vectors Eeveni, Eoddi, Ceveni and Coddi are concatenated into four global bit vectors of size V: Eeven, Eodd, Ceven and Codd. The vectors are kept in a succinct form. The Corigin array gathers all Corigini arrays adjusting stored uids (number of unique vectors in all previous blocks is added to each uid from the current block). For subsequent copied vectors, it refers to uids of the original unique vectors out of all unique vectors. Every uid in Corigin is then delta coded (difference between uid of the next unique vector and original uid is calculated) with the minimum necessary number of bits. The encoded unique variant bit vectors are stored in a byte array U. The flags, literals ( bv), lengths of matches ( len), and lengths of runs of zeros ( num0) and ones ( num1) are compressed (separately) with an entropy coder (we use Huffman coder) in the context of total number of set bits in the current variant bit vector (eight separate groups by default). For each literal run, the number of bits it occupies is kept. For matches, the ids of matched unique vectors ( uids) are delta coded with the minimum necessary number of bits. The starting positions of all unique vectors in U are stored in a byte array Upos, where every 1025th unique vector is stored using 4 bytes, whereas for the 1024 subsequent vectors, only the difference (between the current vector position and the nearest previous position encoded with bytes) is stored, using d bits, where d is the minimum number of bits necessary to store any encoded position difference. Finally, permutations of all blocks, Pi, are concatenated into a single array of permutations, P. It is stored with minimum necessary number of bits. The id of the variant bit vector to decompress is enough to find the right permutation in P. 2.6 Design of the compressed data structure The main components of the GTC data structure are as follows: Eeven and Eodd: two-bit vectors, each of size V, indicating if subsequent variant bit vectors (out of all vectors for lower or higher bits of corresponding haplotypes, respectively) are zero-only vectors, Ceven and Codd: two-bit vectors, each of size V, indicating if subsequent variant bit vectors (out of all vectors for lower or higher bits of corresponding haplotypes, respectively) are copies of other vectors, Corigin: ids of the original vectors (out of all unique vectors) for successive vectors being a copy; delta coding and minimum necessary number of bits are used to store each id. U: byte array storing unique variant bit vectors, encoded into tuples and compressed (as described above), Upos: byte array with positions of the subsequent unique variant bit vectors in the U structure; full position is stored for every 1025th vector, the position of the remaining vectors are delta coded. P: byte array storing permutations of subsequent blocks; a single permutation is a sequence of ids of haplotypes reflecting the order in which they appear in the block. Each id is stored with minimum necessary number of bits (exactly: ⌈ log⁡2H⌉ bits).The Eeven, Eodd, Ceven and Codd bit vectors are represented by the compressed structure (Raman et al., 2007; Navarro and Providel, 2012) implemented in the Succinct Data Structure Library (SDSL) (Gog et al., 2014) library. 2.7 Queries By default, the entire collection is decompressed into VCF/BCF file. It is possible to restrict the query by applying additional conditions. Only variants and samples meeting all specified conditions are decompressed. The following restrictions are possible: range condition—it specifies the chromosome and a range of positions within the chromosome, sample condition—it specifies sample or samples, alternate allele frequency/count condition—it specifies minimum/maximum count/frequency of alternate allele among selected samples for each variant site, variant count condition—it specifies the maximum number of variant sites to decompress.Two decompression algorithms are possible depending on the chosen query parameters. For most queries, a variant-oriented approach is used. A sample-oriented approach is applied in queries about relatively small, i.e. up to 2000, number of samples, but only if decompressed range of variant sites span multiple blocks. Variant-oriented approach. In this algorithm, two vectors representing a variant site are fully decompressed. Their ids are known owing to _row variable in the BCF with variant sites description. Initially, the Eeven, Eodd, Ceven and Codd bit vectors are used to define a category of the variant bit vector. Decompression of an empty vector is straightforward. For a copied vector, the rank operations on Eeven, Eodd, Ceven and Codd bit vectors are used to determine its copy, whereas the id of the original, unique vector is found using the Corigin array. The Upos array is used to find position of the unique variant bit vector in the U array. The consecutive bytes of the variant bit vector are decompressed by decompressing and decoding all flags, literals, lengths of runs of zeros and ones, and matches. If a match is encountered, the variant bit vector containing the match is not fully decompressed. Instead, the complete decoding of all irrelevant tuples from the beginning of the vector up to the match position is skipped as far as possible. For example, the stored bit length of a run of literal allows to skip the run without time-consuming literal decoding. The P array helps to find the original order of haplotypes. Only a subset of queried samples, if that is the case, is reported. If a range of variant sites is queried, the decompression is speeded up by keeping the track of an adequate number of previous, already decompressed unique variant bit vectors. Moreover, the permutation of haplotypes only needs to be read from the P array at the beginning of a block, not for each variant site separately. Sample-oriented approach. In this approach all haplotypes of the queried sample are decompressed. For example, if sample is diploid, two haplotypes are decompressed. In case of more samples, more haplotypes are decompressed. The decompression starts at the beginning of a block containing the first queried variant site. The P array is used at the start of each block to find the positions of bytes containing the haplotypes in the permuted variant vectors. Each variant vector is partly decoded once. The bytes containing information about the decompressed haplotypes are fully decoded and stored; the complete decoding of other bytes is skipped, if possible. As the previous decoded bytes are kept, if a match covering the decompressed haplotype is encountered (only a match within the same block is possible), the byte value can be read immediately. The decompressed bytes are used to retrieve values of queried haplotypes (a dibit, two single bits in two successive variant bit vectors). 3 Results 3.1 Datasets To evaluate GTC and its competitors, we picked two large collections of the genomes of Homo sapiens: the 1000 GP Phase 3 (2504 diploid genotypes) and the HRC (27 165 diploid genotypes). The 1000 Genomes Project data were downloaded from the project web site (details in the Supplementary Material). The HRC collection used in our experiments was obtained from the European Genom-phenom Archive (accession number: EGAS00000000029; details in the Supplementary Material) and is slightly smaller than the full dataset (containing 32 488 samples; unfortunately unavailable for public use). 3.2 Compression ratios and times Table 1 summarizes the characteristics of the datasets and the compression capabilities of gzip, BGT, GQT, SA and GTC. GTC archive appeared to be about two times more compact than BGT archive and ten times more compact than gzipped VCF file. The SA archive size is between GTC and BGT for the smaller collection but is noticeably larger than BGT for the larger one. Table 1. Datasets used in the experiments Dataset  No. of genot.  No. of var. (M)  Size (GB)   VCF  VCF-gzip  GQT  BGT  SA  GTC  1000GPip  2504  85.20  853  17.41  14.34  3.87  2.93  1.81  HRC  27 165  40.36  4300  75.01  84.42  7.01  12.68  3.98  HRC Chr11  27 165  1.94  210  3.42  4.11  0.33  0.62  0.18  Dataset  No. of genot.  No. of var. (M)  Size (GB)   VCF  VCF-gzip  GQT  BGT  SA  GTC  1000GPip  2504  85.20  853  17.41  14.34  3.87  2.93  1.81  HRC  27 165  40.36  4300  75.01  84.42  7.01  12.68  3.98  HRC Chr11  27 165  1.94  210  3.42  4.11  0.33  0.62  0.18  Note: The sizes of the compressed representations with gzip, GQT, BGT, SA and GTC are also presented. Table 1. Datasets used in the experiments Dataset  No. of genot.  No. of var. (M)  Size (GB)   VCF  VCF-gzip  GQT  BGT  SA  GTC  1000GPip  2504  85.20  853  17.41  14.34  3.87  2.93  1.81  HRC  27 165  40.36  4300  75.01  84.42  7.01  12.68  3.98  HRC Chr11  27 165  1.94  210  3.42  4.11  0.33  0.62  0.18  Dataset  No. of genot.  No. of var. (M)  Size (GB)   VCF  VCF-gzip  GQT  BGT  SA  GTC  1000GPip  2504  85.20  853  17.41  14.34  3.87  2.93  1.81  HRC  27 165  40.36  4300  75.01  84.42  7.01  12.68  3.98  HRC Chr11  27 165  1.94  210  3.42  4.11  0.33  0.62  0.18  Note: The sizes of the compressed representations with gzip, GQT, BGT, SA and GTC are also presented. For a detailed examination of the compression ratios, we reduced the HRC dataset to collections containing only Chromosome 11 variants and 1000, 2000, etc. samples. Figure 2a shows that the relative sizes of the BGT and GTC archives are similar in the examined range (see also Supplementary Table S1). A more careful examination shows that the compressed sizes of BGT and GTC grow slightly sub-linearly for growing numbers of samples. Moreover, the ratio between BGT and GTC is almost constant for more than 5000 samples, which suggests that it would remain the same for even larger collections. SA seems to scale poorer than BGT and GTC when the collection size grows. Fig. 2. View largeDownload slide Comparison of compressed data structures for VCF (gzipped, VCF.gz) with BCFtools (BCF) used to access it, GQT, BGT, SA and GTC. (a) Sizes of compressed sampled archives of HRC data. (b) Components of GTC archive for HRC data. (c) Decompression times for HRC Chromosome 11 data. (d) Single variant range query times for HRC Chromosome 11 data. (e) Variants in range of size 23 239 bp query times for HRC Chromosome 11 data. (f) Single sample query times for HRC Chromosome 11 data. (g) Single variant for selected samples query times for HRC Chromosome 11 data. (h) Variants in range of size 10 Mb for selected samples query times for HRC Chromosome 11 data. (i) All variants for selected samples query times for HRC Chromosome 11 data. (j) Single sample for the selected range of chromosome query times for HRC Chromosome 11 data. (k) In total, 1000 samples for the selected range of chromosome query times for HRC Chromosome 11 data. (l) All samples for the selected range of chromosome query times for HRC Chromosome 11 data Fig. 2. View largeDownload slide Comparison of compressed data structures for VCF (gzipped, VCF.gz) with BCFtools (BCF) used to access it, GQT, BGT, SA and GTC. (a) Sizes of compressed sampled archives of HRC data. (b) Components of GTC archive for HRC data. (c) Decompression times for HRC Chromosome 11 data. (d) Single variant range query times for HRC Chromosome 11 data. (e) Variants in range of size 23 239 bp query times for HRC Chromosome 11 data. (f) Single sample query times for HRC Chromosome 11 data. (g) Single variant for selected samples query times for HRC Chromosome 11 data. (h) Variants in range of size 10 Mb for selected samples query times for HRC Chromosome 11 data. (i) All variants for selected samples query times for HRC Chromosome 11 data. (j) Single sample for the selected range of chromosome query times for HRC Chromosome 11 data. (k) In total, 1000 samples for the selected range of chromosome query times for HRC Chromosome 11 data. (l) All samples for the selected range of chromosome query times for HRC Chromosome 11 data It is also interesting to see the sizes of components of the GTC archive for the complete HRC data. As can be observed in Figure 2b, the majority of the archive (62.5%) is for the description of matches, 0-runs and so on (Core) and 30.8% is for the description of permutations in blocks (Permutation). Finally, 6.5% is for the description of variants (position, reference, and non-reference alleles etc.) and 0.2% for list of sample names (Metadata). For smaller block sizes, the haplotypes can be better compressed, but the description of permutation consumes more space (Supplementary Fig. S8). In Supplementary Figures S2–S7, we also show the influence of various parameters of GTC (such as block size, minimal match length etc.) on the compression ratio. The compression times of BGT, SA and GTC are similar (slightly more than a day for the complete HRC collection at Intel Xeon-based workstation clocked at 2.3 GHz) as they are dominated by reading and parsing of the huge input gzipped VCF file. 3.3 Queries to the compressed data structure The great compression would be, however, of little value without the ability of answering queries of various types. Figure 2c–l shows the times of answering various types of queries by GTC, BGT, SA and BCFtools (BCF), for Chromosome 11 data containing various number of samples (various collection sizes, Fig. 2c–f) or for the whole collection (27 165 samples, Fig. 2g–l). The GTC decompression of the whole collection of variants is from 7 times (1000-sample subset) to 13 times (all samples) faster than with BGT and about 3 times faster than with SA (Fig. 2c). The GTC extraction for a single variant or a genomic range of 23 329 bases (median size of human gene) is up to six times faster than with BGT (Fig. 2d and e). It is also noteworthy that BCFtools are almost as fast as GTC for a single variant query. A bit different situation can be observed in Figure 2f, where the decompression times of single sample are presented. For collections not larger than 5000 samples, GTC is the fastest, but then, BGT takes the lead becoming 1.5 times faster for the complete collection. Nevertheless, both GTC and BGT answer the query in a few seconds. As expected BCFtools are clearly the slowest. Figure 2g–l shows the times of answering the queries about different ranges of genome and various number of samples for the complete Chromosome 11 collection (27 165 samples). The BGT extraction times are slightly better than that of GTC only when a small number of samples (up to about one thousand samples for small variant ranges and up to few hundred samples for large variant ranges) is queried. For more samples in a query, GTC takes the lead, whereas the relative performance of BGT decreases (i.e. for a single variant and 10 000 samples extraction, BCFtools are the second best method). When extracting all samples, GTC is about ten times faster than BGT. Moreover, the advantage of GTC grows slightly when the range extends. It can be noted that the relative performance of BCTtools degrades when the range width exceeds a few thousand bases. From the pure compression-ratio perspective, it is worth to note that the average size of genotype information in GTC archive for a single sample is just about 146 KB. Even better result would be possible when we resign from fast queries support. To date the best pure compressor of VCF files is TGC (Deorowicz et al., 2013). Recently proposed GTRAC (Tatwawadi et al., 2016) is a modified version of TGC, supporting some types of queries, for example, for single (or range of) variants or single samples. Unfortunately, the output of GTRAC queries is just a binary file, so no direct comparison with VCF/BCF-producing compressors (BGT, GTC and SA) is possible. Moreover, we were unable to run GTRAC for the examined datasets. Nevertheless, we modified our compressor to produce similar output and support similar query types, and run both GTC and GTRAC for the 1000 GP Phase 1 data containing 1092 genotypes and 39.7 M variants. The compressed archives were of size 622 MB (GTC) and 1002 MB (GTRAC). The queries for single variants were solved in 7 ms (GTC) and 47 ms (GTRAC). The queries for samples were answered in similar times, about 2 s for Chromosome 11 data. 4 Conclusions The widely used VCFtools and BCFtools offer relatively quick access to VCF, gzipped VCF, or BCF files when we query about a specified variant or variant range. The situation, however, changes when we focus on sample queries, which are slow. Another drawback of using these tools is that the gzipped VCF or BCF files are quite large, which causes the problems with storage and transfer. When the user is interested only in the genotype data from the VCF files, much better solutions are possible, that is, the state-of-the-art BGT and SA. Nevertheless, the GTC, proposed in this article, has several advantages over them. The most significant one is a new compression algorithm designed with a care of fast queries support. The data structure is so small that it is possible to store it in the main memory of even commodity workstations or laptops giving the impressive query times. Even when run at powerful workstation, GTC often outperforms the competitors in query times by an order of magnitude. The scalability experiments suggest that much larger collections can be also maintained effectively. Current work focuses on compression of genotypes only. Compressing and querying other data that may be stored in VCF files, e.g. genotype likelihoods, would be hard to introduce considering the wide range of possible values. A quantization could simplify the task, but without further study it is hard to tell whether similarities and redundancies in the input data would allow to achieve satisfying results. In our future work, we plan to investigate this issue thoroughly and possibly extend types of VCF fields that can be compressed with our tool. Moreover, we want to take a closer look at all parameters that were chosen experimentally and possibly make them dependable on some additional factors, specific to the input data. Funding This work was supported by the National Science Centre, Poland under project [DEC-2015/17/B/ST6/01890]. We used the infrastructure supported by ‘GeCONiI-Upper Silesian Center for Computational Science and Engineering’ [POIG.02.03.01-24-099/13]. Conflict of Interest: none declared. References Danecek P. et al.  . ( 2011) The variant call format and VCFtools. Bioinformatics , 27, 2156– 2158. Google Scholar CrossRef Search ADS PubMed  Deorowicz S. et al.  . ( 2013) Genome compression: a novel approach for large collections. Bioinformatics , 29, 2572– 2578. Google Scholar CrossRef Search ADS PubMed  Deorowicz S., Grabowski S. ( 2013) Data compression for sequencing data. Algorithms Mol. Biol ., 8, 25. Google Scholar CrossRef Search ADS PubMed  Durbin R. ( 2014) Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT). Bioinformatics , 30, 1266– 1272. Google Scholar CrossRef Search ADS PubMed  Gog S. et al.  . ( 2014) From theory to practice: plug and play with succinct data structures. Lect. Notes Comput. Sci ., 8504, 326– 337. Google Scholar CrossRef Search ADS   Johnson D.S., McGeoch L.A. ( 1997) The traveling salesman problem: a case study in local optimization In: Aarts E.H.L., Lenstra J.K. (eds) Local Search in Combinatorial Optimisation . John Wiley and Sons, London, UK, pp. 215– 310. Knuth D.E. ( 1998) The art of computer programming. In: Sorting and Searching , Vol. 3. Addison-Wesley Professional, USA, pp. 426– 458. Layer R.M. et al.  . ( 2016) Efficient genotype compression and analysis of large genetic-variation data sets. Nat. Methods , 13, 63– 65. Google Scholar CrossRef Search ADS PubMed  Lek M. et al.  . ( 2016) Analysis of protein-coding genetic variation in 60, 706 humans. Nature , 536, 285– 291. Google Scholar CrossRef Search ADS PubMed  Li H. et al.  . ( 2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics , 25, 2078– 2079. Google Scholar CrossRef Search ADS PubMed  Li H. ( 2016) BGT: efficient and flexible genotype query across many samples. Bioinformatics , 32, 590– 592. Google Scholar CrossRef Search ADS PubMed  McCarthy S. et al.  . ( 2016) A reference panel of 64, 976 haplotypes for genome imputation. Nat. Genet ., 48, 1279– 1283. Google Scholar CrossRef Search ADS PubMed  Navarro G., Providel E. ( 2012) Fast, small, simple rank/select on bitmaps. Lect. Notes Comput. Sci ., 7276, 295– 306. Google Scholar CrossRef Search ADS   Numanagić I. et al.  . ( 2016) Comparison of high-throughput sequencing data compression tools. Nat. Methods , 13, 1005– 1008. Google Scholar CrossRef Search ADS PubMed  Raman R. et al.  . ( 2007) Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms , 3, 43. Google Scholar CrossRef Search ADS   Salomon D., Motta G. ( 2010) Handbook of Data Compression . Springer, London, UK. Google Scholar CrossRef Search ADS   Stephens Z.D. et al.  . ( 2015) Big Data: astronomical or Genomical. PLOS Biol ., 13, e1002195. Google Scholar CrossRef Search ADS PubMed  Sudmant P.H. et al.  . ( 2015) An integrated map of structural variation in 2, 504 human genomes. Nature , 526, 75– 81. Google Scholar CrossRef Search ADS PubMed  Tatwawadi K. et al.  . ( 2016) GTRAC: fast retrieval from compressed collections of genomic variants. Bioinformatics , 32, i479– i486. Google Scholar CrossRef Search ADS PubMed  Zheng X. et al.  . ( 2017) SeqArray—A storage-efficient high-performance data format for WGS variant calls. Bioinformatics , 33, 2251– 2257. Google Scholar CrossRef Search ADS PubMed  © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

BioinformaticsOxford University Press

Published: Jan 16, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off