Access the full text.
Sign up today, get DeepDyve free for 14 days.
Marty Brandon, E. Ruiz‐Pesini, D. Mishmar, V. Procaccio, M. Lott, Kevin Nguyen, Syawal Spolim, Upen Patil, P. Baldi, D. Wallace (2009)
MITOMASTER: a bioinformatics tool for the analysis of mitochondrial DNA sequencesHuman Mutation, 30
B. Behzadi, F. Fessant (2005)
DNA Compression Challenge Revisited: A Dynamic Programming Approach
S. Golomb (1966)
Run-length encodings.
D. Huffman (1952)
A method for the construction of minimum-redundancy codesResonance, 11
Mark Thomas, C. Cook, K. Miller, M. Waring, E. Hagelberg (1998)
Molecular instability in the COII-tRNA(Lys) intergenic region of the human mitochondrial genome: multiple origins of the 9-bp deletion and heteroplasmy for expanded repeats.Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 353 1371
(2009)
Human genomes as email attachments
David Johnson, A. Mortazavi, R. Myers, B. Wold (2007)
Genome-Wide Mapping of in Vivo Protein-DNA InteractionsScience, 316
Eray Tuzun, A. Sharp, J. Bailey, R. Kaul, V. Morrison, Lisa Pertz, E. Haugen, H. Hayden, D. Albertson, D. Pinkel, M. Olson, E. Eichler (2005)
Fine-scale structural variation of the human genomeNature Genetics, 37
D. Wheeler, Maithreyan Srinivasan, M. Egholm, Yufeng Shen, Lei Chen, A. McGuire, Wenshe He, Yi-Ju Chen, V. Makhijani, G. Roth, Xavier Gomes, K. Tartaro, K. Tartaro, Faheem Niazi, C. Turcotte, G. Irzyk, J. Lupski, J. Lupski, C. Chinault, Xing-Zhi Song, Yue Liu, Ye Yuan, L. Nazareth, X. Qin, D. Muzny, M. Margulies, G. Weinstock, G. Weinstock, R. Gibbs, R. Gibbs, J. Rothberg, J. Rothberg (2008)
The complete genome of an individual by massively parallel DNA sequencingNature, 452
Marty Brandon, M. Lott, Kevin Nguyen, Syawal Spolim, S. Navathe, P. Baldi, D. Wallace (2004)
MITOMAP: a human mitochondrial genome database—2004 updateNucleic Acids Research, 33
I. Witten, Radford Neal, J. Cleary (1987)
Arithmetic coding for data compressionCommun. ACM, 30
Mathias Winther (1979)
Arithmetic Coding
D. Hinds, L. Stuve, Geoffrey Nilsen, E. Halperin, E. Eskin, D. Ballinger, K. Frazer, D. Cox (2008)
Hinds , in Three Human Populations Whole-Genome Patterns of Common DNA Variation
R. McEliece (1977)
The theory of information and coding
E. Ruiz‐Pesini, M. Lott, V. Procaccio, J. Poole, Marty Brandon, D. Mishmar, Christina Yi, James Kreuziger, P. Baldi, D. Wallace (2006)
An enhanced MITOMAP with a global mtDNA mutational phylogenyNucleic Acids Research, 35
D. Hinds, L. Stuve, Geoffrey Nilsen, E. Halperin, E. Eskin, D. Ballinger, K. Frazer, D. Cox (2005)
Whole-Genome Patterns of Common DNA Variation in Three Human PopulationsScience, 307
T. Cover, Joy Thomas (1991)
Elements of Information Theory
S. Anderson, A. Bankier, B. Barrell, M. Bruijn, A. Coulson, J. Drouin, J. Drouin, I. Eperon, D. Nierlich, D. Nierlich, B. Roe, B. Roe, F. Sanger, P. Schreier, A. Smith, R. Staden, I. Young, I. Young (1981)
Sequence and organization of the human mitochondrial genomeNature, 290
D. Goldstein, G. Cavalleri (2005)
Genomics: Understanding human diversityNature, 437
S. Golomb (1966)
Run-length encodings (Corresp.)IEEE Trans. Inf. Theory, 12
Guoqing Li, Lijia Ma, Chaojie Song, Zhentao Yang, Xiulan Wang, Hui Huang, Yingrui Li, Ruiqiang Li, Xiuqing Zhang, Huanming Yang, J. Wang, Jun Wang (2008)
The YH database: the first Asian diploid genome databaseNucleic Acids Research, 37
C. Feschotte, Ellen Pritham (2007)
DNA transposons and the evolution of eukaryotic genomes.Annual review of genetics, 41
K. Frazer, D. Ballinger, D. Cox, D. Hinds, L. Stuve, R. Gibbs, J. Belmont, A. Boudreau, P. Hardenbol, S. Leal, S. Pasternak, D. Wheeler, T. Willis, F. Yu, Huanming Yang, Changqing Zeng, Yang Gao, Haoran Hu, Weitao Hu, Chaohua Li, Wei Lin, Siqi Liu, Hao Pan, Xiaoli Tang, Jian Wang, Wei Wang, Jun Yu, Bo Zhang, Qingrun Zhang, Hongbin Zhao, Hui-Ping Zhao, Jun Zhou, S. Gabriel, Rachel Barry, B. Blumenstiel, Amy Camargo, M. Defelice, M. Faggart, Marie-Anne Goyette, Supriya Gupta, Jamie Moore, Huy Nguyen, R. Onofrio, Melissa Parkin, J. Roy, E. Stahl, E. Winchester, L. Ziaugra, D. Altshuler, Yan Shen, Zhijian Yao, Wei Huang, X. Chu, Yungang He, Li Jin, Yangfan Liu, Yayun Shen, Weiwei Sun, Haifeng Wang, Yi Wang, Ying Wang, Xiao-yan Xiong, Liang Xu, M. Waye, S. Tsui, H. Xue, J. Wong, L. Galver, Jian-Bing Fan, K. Gunderson, S. Murray, A. Oliphant, M. Chee, A. Montpetit, F. Chagnon, Vincent Ferretti, M. Leboeuf, J. Olivier, M. Phillips, Stéphanie Roumy, C. Sallée, A. Verner, T. Hudson, P. Kwok, Dongmei Cai, D. Koboldt, Raymond Miller, L. Pawlikowska, P. Taillon-Miller, M. Xiao, L. Tsui, W. Mak, You-Qiang Song, P. Tam, Yusuke Nakamura, T. Kawaguchi, T. Kitamoto, Takashi Morizono, A. Nagashima, Y. Ohnishi, A. Sekine, Toshihiro Tanaka, T. Tsunoda, P. Deloukas, C. Bird, Marcos Delgado, E. Dermitzakis, R. Gwilliam, S. Hunt, J. Morrison, Don Powell, B. Stranger, P. Whittaker, D. Bentley, M. Daly, P. Bakker, J. Barrett, Y. Chretien, J. Maller, S. Mccarroll, N. Patterson, I. Pe’er, A. Price, S. Purcell, D. Richter, Pardis Sabeti, R. Saxena, S. Schaffner, P. Sham, P. Varilly, Lincoln Stein, Lalitha Krishnan, A. Smith, M. Tello-Ruiz, Gudmundur Thorisson, A. Chakravarti, Peter Chen, D. Cutler, C. Kashuk, Shin Lin, G. Abecasis, W. Guan, Yun Li, Heather Munro, Zhaohui Qin, D. Thomas, G. McVean, A. Auton, L. Bottolo, Niall Cardin, S. Eyheramendy, C. Freeman, J. Marchini, S. Myers, C. Spencer, M. Stephens, P. Donnelly, L. Cardon, G. Clarke, David Evans, A. Morris, B. Weir, J. Mullikin, S. Sherry, M. Feolo, Andrew Skol, Houcan Zhang, I. Matsuda, Y. Fukushima, D. Macer, Eiko Suda, C. Rotimi, C. Adebamowo, I. Ajayi, Toyin Aniagwu, P. Marshall, C. Nkwodimmah, C. Royal, M. Leppert, M. Dixon, A. Peiffer, Renzong Qiu, A. Kent, Kazuto Kato, N. Niikawa, I. Adewole, B. Knoppers, Morris Foster, E. Clayton, Jessica Watkin, D. Muzny, L. Nazareth, E. Sodergren, G. Weinstock, I. Yakub, B. Birren, R. Wilson, L. Fulton, J. Rogers, J. Burton, N. Carter, C. Clee, M. Griffiths, Matthew Jones, K. McLay, R. Plumb, M. Ross, S. Sims, D. Willey, Zhu Chen, Hua Han, L. Kang, M. Godbout, J. Wallenburg, P. L'Archevêque, G. Bellemare, Koji Saeki, Hongguang Wang, Daochang An, Hongbo Fu, Qing Li, Zhen Wang, Ren-hao Wang, A. Holden, L. Brooks, J. Mcewen, M. Guyer, V. Wang, Jane Peterson, Michael Shi, J. Spiegel, L. Sung, Lynn Zacharia, F. Collins, Karen Kennedy, Ruth Jamieson, J. Stewart (2007)
A second generation human haplotype map of over 3.1 million SNPsNature, 449
Rasmus Wernersson, H. Nielsen (2005)
OligoWiz 2.0—integrating sequence feature annotation into the design of microarray probesNucleic Acids Research, 33
Alistair Moffat, V. Anh (2006)
Binary codes for locally homogeneous sequencesInf. Process. Lett., 99
R. Service (2006)
The Race for the $1000 GenomeScience, 311
Toshihiro Tanaka (2003)
The International HapMap ProjectNature, 426
Jun Wang, Wei Wang, Ruiqiang Li, Yingrui Li, G. Tian, L. Goodman, Wei Fan, Junqing Zhang, Jun Li, Juanbin Zhang, Yiran Guo, Binxiao Feng, Heng Li, Yao Lu, X. Fang, Huiqing Liang, Zhenglin Du, Dong Li, Yiqing Zhao, Yujie Hu, Zhenzhen Yang, Hancheng Zheng, Ines Hellmann, M. Inouye, J. Pool, X. Yi, J. Zhao, Jinjie Duan, Yan Zhou, J. Qin, Lijia Ma, Guoqing Li, Zhentao Yang, Guojie Zhang, Bin Yang, Chang Yu, Fang Liang, Wen-jie Li, Shaochuan Li, Dawei Li, Peixiang Ni, Jue Ruan, Qibin Li, Hong-mei Zhu, Dongyuan Liu, Zhike Lu, Ning Li, Guangwu Guo, Jianguo Zhang, Jia Ye, L. Fang, Qin Hao, Quan Chen, Yuxi Liang, Yeyang Su, A. San, Cuo Ping, Shuang Yang, Fang Chen, Li Li, Ke Zhou, Hongkun Zheng, Yuanyuan Ren, Ling Yang, Yang Gao, Guohua Yang, Zhuo Li, Xiaoli Feng, K. Kristiansen, G. Wong, R. Nielsen, R. Durbin, L. Bolund, Xiuqing Zhang, Songgang Li, Huanming Yang, Jian Wang (2008)
The diploid genome sequence of an Asian individualNature, 456
Xin Chen, Ming Li, Bin Ma, J. Tromp (2002)
DNACompress: fast and effective DNA sequence compressionBioinformatics, 18 12
D. Hirschberg, P. Baldi (2008)
Effective Compression of Monotone and Quasi-Monotone Sequences of IntegersData Compression Conference (dcc 2008)
J. Kaiser (2008)
A Plan to Capture Human Diversity in 1000 GenomesScience, 319
Alistair Moffat, Lang Stuiver (2000)
Binary Interpolative Coding for Effective Index CompressionInformation Retrieval, 3
S. Levy, G. Sutton, P. Ng, L. Feuk, A. Halpern, B. Walenz, Nelson Axelrod, Jiaqi Huang, E. Kirkness, Gennady Denisov, Yuan Lin, J. MacDonald, Andy Wing, Chun Pang, M. Shago, Timothy Stockwell, Alexia Tsiamouri, V. Bafna, V. Bansal, S. Kravitz, D. Busam, K. Beeson, T. McIntosh, K. Remington, J. Abril, J. Gill, Jon Borman, Y. Rogers, M. Frazier, S. Scherer, R. Strausberg, J. Venter (2007)
The Diploid Genome Sequence of an Individual HumanPLoS Biology, 5
Jorma Rissanen, Glen Langdon (1979)
Arithmetic CodingIBM J. Res. Dev., 23
P. Elias (1975)
Universal codeword sets and representations of the integersIEEE Trans. Inf. Theory, 21
P. Baldi, Ryan Benz, D. Hirschberg, S. Swamidass (2007)
Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and RetrievalJournal of chemical information and modeling, 47 6
H. Williams, J. Zobel (1997)
Compression of nucleotide databases for fast searchingComputer applications in the biosciences : CABIOS, 13 5
D. Mishmar, E. Ruiz‐Pesini, P. Golik, V. Macaulay, A. Clark, S. Hosseini, Marty Brandon, Kirk Easley, E. Chen, Michael Brown, R. Sukernik, A. Olckers, D. Wallace (2002)
Natural selection shaped regional mtDNA variation in humansProceedings of the National Academy of Sciences of the United States of America, 100
R. Andrews, I. Kubacka, P. Chinnery, R. Lightowlers, D. Turnbull, N. Howell (1999)
Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNANature Genetics, 23
A. Kogelnik, M. Lott, Michael Brown, S. Navathe, D. Wallace (1996)
MITOMAP: a human mitochondrial genome databaseNucleic acids research, 24 1
S. Harihara, M. Hirai, Y. Suutou, K. Shimizu, K. Omoto (1992)
Frequency of a 9-bp deletion in the mitochondrial DNA among Asian populations.Human biology, 64 2
Vol. 25 no. 14 2009, pages 1731–1738 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btp319 Genome analysis Data structures and compression algorithms for genomic sequence data 1,2,3 2,3,4 1,2,4,∗ Marty C. Brandon , Douglas C. Wallace and Pierre Baldi 1 2 3 Department of Computer Science, Institute for Genomics and Bioinformatics, Center for Molecular and Mitochondrial Medicine and Genetics and Department of Biological Chemistry, UCI, Irvine, CA 92697, USA Received on November 25, 2008; revised on April 13, 2009; accepted on May 11, 2009 Advance Access publication May 15, 2009 Associate Editor: Alfonso Valencia ABSTRACT Wheeler et al., 2008) and a project to sequence 1000 human genomes in the next few years is under way (Kaiser, 2008). And so is the race Motivation: The continuing exponential accumulation of full genome for the capability to sequence an individual human genome for less data, including full diploid human genomes, creates new challenges than $1000 within a few years (Service, 2006). Millions of human not only for understanding genomic structure, function and evolution, genome sequences could be generated within a decade or two. but also for the storage, navigation and privacy of genomic data. In addition to the obvious challenges to understand the structure, Here, we develop data structures and algorithms for the efficient function and evolution of genomes, modern high-throughput storage of genomic and other sequence data that may also facilitate sequencing (HTS) methods also raise questions about how to querying and protecting the data. efficiently represent, store, transmit, query and protect the privacy Results: The general idea is to encode only the differences between of sequence information. These questions are further reinforced if a genome sequence and a reference sequence, using absolute one takes into account also progress in synthetic biology and our or relative coordinates for the location of the differences. These ability to bioengineer new sequences. locations and the corresponding differential variants can be encoded Currently, publicly available genomes are typically stored as flat into binary strings using various entropy coding methods, from fixed text files in GenBank, but this approach is unlikely to scale up in codes such as Golomb and Elias codes, to variables codes, such as many ways. The storage of the diploid genomes of all currently Huffman codes. We demonstrate the approach and various tradeoffs living humans using this simple approach would take ‘GenBank’, using highly variables human mitochondrial genome sequences as without counting headers or any additional annotations, on the order a testbed. With only a partial level of optimization, 3615 genome of 36 × 10 bytes, or 36M Terabytes, an amount difficult to store sequences occupying 56 MB in GenBank are compressed down or download over the Internet, even using standard compression to only 167 KB, achieving a 345-fold compression rate, using the technologies (e.g. gzip). And even with the progress that can be revised Cambridge Reference Sequence as the reference sequence. expected with Moore’s law for storage and networking in the coming Using the consensus sequence as the reference sequence, the data years, it is likely that security and privacy issues will require can be stored using only 133 KB, corresponding to a 433-fold level additional layers of protection around genomic data. of compression, roughly a 23% improvement. Extensions to nuclear Here, we develop data structures and algorithms to begin genomes and high-throughput sequencing data are discussed. addressing these problems. These data structures allow the Availability: Data are publicly available from GenBank, the HapMap compression of genome and other sequences while facilitating web site, and the MITOMAP database. Supplementary materials certain classes of sequence queries by bypassing classical sequence with additional results, statistics, and software implementations alignments and dynamic programming algorithms. The approach are available from http://mammag.web.uci.edu/bin/view/Mitowiki/ is demonstrated primarily using a benchmark dataset comprising ProjectDNACompression. a few thousand individual mitochondrial genomes sequences. Contact: pfbaldi@ics.uci.edu Human mitochondrial sequences provide an excellent testbed for developing and testing efficient data structures and algorithms 1 INTRODUCTION because, unlike nuclear genome sequences, many thousands of fully sequenced mitochondrial genomes are already available, from a As high-throughput genome sequencing technologies continue to diverse population of individuals. In addition, mitochondrial genome improve, genome sequence data continue to accumulate at an sequences pose unique challenges due to their greater variability, as exponential pace. Not only do we already have the genome sequence compared with single nucleotide polymorphism (SNP) data. of thousands of viruses and bacteria and dozens of multicellular organisms from plants to humans, but we are rapidly approaching the stage where sequencing individual diploid human genomes will be 2 GENERAL APPROACH economically affordable. The first diploid human genome sequences were recently produced (Levy et al., 2007; Wang et al., 2008; In the case of multiple genomes from the same species, associated with ‘resequencing’ technologies, the flat text file approach is clearly To whom correspondence should be addressed. wasteful since for the most part the sequences are identical. Thus a © The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 1731 [14:50 29/6/2009 Bioinformatics-btp319.tex] Page: 1731 1731–1738 M.C.Brandon et al. simple approach is to store a reference sequence, and then for each However, no matter what the detailed scenario is, all applications other sequence, encode only the differences (or ‘deltas’) with respect of the basic ideas hinge on a fundamental technical problem: how to the original sequence. More precisely, consider first the sequences to encode integers, representing for instance absolute or relative AACGACTAGTAATTTG and CACGTCTAGTAATGTG which are genomic addresses or read lengths, into binary strings. It is essential identical, except for a substitution in position 1 (A→C),5(A→T) to understand that the naive idea of converting integers to their binary and 14 (T→G). Each SNP can be encoded by a pair (i,X ), where i value, that is, converting a ‘5’ to ‘101’ does not work at all since is an integer encoding the position and X represents the value of the with this encoding one does not know where an integer ends and the substitution relative to the reference. Thus given the first sequence as next one begins. There are no spaces, tabs or commas available to a reference, the second one can be encoded by the string ‘1C5T14G’, separate consecutive integers in the ultimate binary format of any concatenating the coordinates of the locations at which the variations computer where only the symbols 0 and 1 are available. Thus, the occur and the SNP values at these locations. Note that with this data encoding itself must somehow contain the information necessary representation, the questions ‘Is this sequence different from the to uniquely determine the beginning and end of each information reference sequence at position i? And if so how?’ are easy to answer. item. In addition, the plain conversion of integers to binary does not Thus, the same data structure that facilitates compact representation, take into account any entropy considerations. Similarly, a general facilitates also efficient information retrieval. purpose compression scheme for text data, such as Lempel-Ziv Other events such as deletions and insertions can easily be (gzip), is likely to be far from optimal for genome and HTS data. accommodated in the same general scheme. For a deletion, imagine In short, we are interested in binary encoding schemes for sequences using two integers (i,l) where the first integer denotes the position of integers that can be parsed automatically and that, consistently where the deletion occurs, and the second integer represents the with information theory, are entropy efficient, in the sense that fewer length of the deletion. Likewise, for an insertion of length l, one bits are used to encode more frequent events. The goal here is not to can use the encoding i,X ...X to denote the insertion of X ...X at prescribe a single strategy to achieve this end, but rather to present 1 l 1 l position i with respect to the reference sequence. a family of related coding strategies and some of the tradeoffs that Although the basic idea is easy to understand, and not new, a would have to be optimized in a practical application, and illustrate precise implementation requires addressing a number of important the approach using highly variable mitochondrial DNA. technical issues. A first observation is that one can use local relative addresses, i.e. intervals, rather than absolute addresses. Using intervals, the above example ‘1C5T14G’ becomes ‘0C4T9G’. 3 SPECIFIC ENCODING STRATEGIES With intervals the dynamic range of the integers to be encoded may To begin with, we illustrate these issues here by considering be considerably smaller than with absolute addresses. The relatively how the integer positions i are ultimately encoded into a binary modest price to pay is that intervals must be added to recover string. From Shannon’s entropy coding theory (Cover and Thomas, absolute coordinates. 1991; McEliece, 1977), optimal encoding of these integers from A second observation is that if the positions at which variations a compression standpoint depends on their distribution in order to occur in the population are fixed and form a relatively small subset assign shorter binary codes to more probable symbols (integers). For of all possible positions, then additional savings may result by simplicity, we distinguish two broad classes of codes: fixed codes, focusing only on those positions. If in the same schematic example such as Golomb codes (Golomb, 1965) and Elias codes (Elias, 1975), as above, one knew that the population substitutions can occur and variable codes, such as Huffman codes (Huffman, 1952). In a only at positions 1, 5 and 14, then one could, for instance, encode fixed code, the integer i is always encoded in the same way, whereas ‘1C5T14G’ simply by ‘CTG’, at the cost of keeping an additional in a variable code the encoding changes. table storing the coordinates where the variants occur, and using the letter in the reference sequences at positions where the reference sequence and the sequence under consideration are identical. This 3.1 Fixed codes: Golomb and Golomb–Rice codes approach could be suitable, for instance, for the SNP HapMap data Both Golomb codes and Elias codes encode an integer j by (The International HapMap Consortium, 2003, 2007), but may not be concatenating two bit strings: a preamble p(j), that encodes j’s suitable in other situations, where either the location of all possible scale, and a mantissa. Golomb codes were specifically developed to variations occurring in the population under consideration is not encode stationary coin flips with p = 0.5. Thus, they are known to be known in advance, or the number of such locations is very large optimal and asymptotically approach the Shannon limit if the data across the population, but not very large in a typical sequence. are generated by random coin flips or, equivalently, if the distribution This is the case, for instance, of mitochondrial DNA which is over the integers is geometric, although they can be used for any characterized by much higher mutation rates than nuclear DNA. other distribution. The more skewed the probability p is (towards Thus, different situations may lead to different variations of the basic 0 or 1) the greater the level of compression that can be achieved. idea. Golomb codes have one integer parameter m. Given m, any An additional technical consideration is the choice of the reference positive integer j can be written using its quotient and remainder sequence. In particular, the reference sequence does not need to be modulo m as j =j/m+ (j mod m). To encode j, the Golomb code an actual genome but can, for instance, correspond to a consensus with parameter m (Table 1) encodes the quotient and remainder by genome. While the resequencing case is of primary interest here using: due to the medical implication associated with resequencing human genomes, the same general ideas can be applied also to the case of j/m 1-bits for the quotient; de novo sequencing by using, for instance, the genome of the closest available species as the reference genome. followed by a 0, as a delimiter (unary encoding of j/m); [14:50 29/6/2009 Bioinformatics-btp319.tex] Page: 1732 1731–1738 Genomic data compression Table 1. Golomb encoding of the integers j = 0 to 8, for different values of Table 2. Golomb–Rice encoding of integers j = 0 − 33 with k =2(m = 4) the parameter m and k =3(m = 8) jm = 2 m = 3 m = 4 m = 5 m = 6 Number Encoding (k = 2) Number Encoding (k = 3) 0 00 00 000 000 000 0–3 0xx 0–7 0xxx 1 01 010 001 001 001 4–7 10xx 8–15 10xxx 2 100 011 010 010 0100 8–11 110xx 16–31 110xxx 3 101 100 011 0110 0101 33 11111111001 33 11110001 4 1100 1010 1000 0111 0110 5 1101 1011 1001 1000 0111 Integer j is encoded by concatenating j/2 1-bits, one 0-bit and the k least significant 6 11100 1100 1010 1001 1000 bits of j. 7 11101 11010 1011 1010 1001 8 111100 11011 11000 10110 10100 Table 3. Elias Gamma encoding Number Encoding followed by the phased-in binary code for j mod m for the remainder (described below). 2–3 01x The encoding of integers 0,...,m − 1 normally requires B =logm 4–7 001xx 8–15 0001xxx bits. If m is not a power of two, then one can sometimes use B − 1 16–31 00001xxxx bits. More specifically, in the ‘phased-in’ approach: Each integer j is encoded by concatenating logj 0’s with the binary value of j. if i < 2 −m, then encode i in binary, using (B − 1) bits; B B if i ≥ 2 −m, then encode i by i + 2 −m in binary, using B bits. 3.2 Elias codes For instance, for m = 5, i = 2 is encoded as ‘10’ using 2 (= B − 1) In the Elias Gamma coding scheme, the preamble p(m) is a string of bits, and i = 4 is encode as ‘111’ using 3 (= B) bits (Table 1). Thus zeroes of length logj, and the mantissa m(j) is the binary encoding the encoding of j requires in total j/m+ 1 +logm or j/m+ of j. More precisely, to encode the scale and value of j: 1 +logm bits (Table 1) and the codeword for the integer j +m has write logj 0-bits; one more bit than the codeword for the integer j. Unless otherwise followed by the binary value of j beginning with its most specified, all logarithms are taken to base 2. We use also ‘[logm]’to denote ‘logm or logm’. significant 1-bit. The entropy of the geometric distribution of the coin flip run- The length of the encoding of j is 2logj+ 1 (Table 3). The decoding lengths is given by (using q = 1 −p): is obvious: first read n 0-bits until the first 1-bit is encountered, then read n more bits to get the binary representation of j. j j Applying the relationship H (geometric) =− q plog(q p) (1) j=0 −logP(j) ≈ 2logj+ 1 (4) and provides the optimal Shannon coding lower bound on the to the integer probabilities, shows that Elias Gamma encoding −2 expected encoding length l per integer asymptotically approaches the Shannon limit for P(j) ≈ Cj . This is a power–law relationship with exponent −2 and C is a normalizing constant. Note that for both Golomb [Equation (3)] and Elias E(l) ≈ q p j/m+ 1+[logm] (2) Gamma codes [Equation (4)], several different consecutive integers j=0 can be encoded into a bit vector with the same length, hence the relationships −logP(j) ≈ length(j) is only approximate with respect under the coin flip model. Thus, the Golomb code approaches the m to geometric or power–law distributions over the integers. To be Shannon limit when q = 0.5. In particular, this ensures that for each more precise, the optimal distribution associated with the Elias integer j Gamma code can be separated into the product of a probability −l distribution over the length l given by P(l) = 2 and a uniform −logP(j) = log(q p) ≈j/m+ 1+[logm] (3) distribution over the integers having an encoding of length l given −l+1 where P(j) is the probability associated with the integer j. by P(j|l) = 2 . Finally, Golomb–Rice codes are a particularly convenient sub- More recently, new families of efficient fixed codes for integers family of Golomb codes, when m = 2 (Table 2). To encode j,we have been developed (Baldi et al., 2007; Hirschberg and Baldi, 2008; concatenate j/2 1-bits, one 0-bit and the k least significant bits Moffat and Anh, 2006; Moffat and Stuiver, 2000), for instance, in of j. The length of the encoding of j is thus j/2 +k + 1. The the case of increasing or quasi increasing sequences of integers, by decoding of Golomb–Rice codes is particularly simple, the position encoding only the deltas of the preambles. For sequence data, the of the 0-bit gives the value of the prefix to be followed by the next absolute addresses are increasing, and the relative addresses could k bits. be made quasi-increasing if one were to apply a fixed permutation [14:50 29/6/2009 Bioinformatics-btp319.tex] Page: 1733 1731–1738 M.C.Brandon et al. to all the sequences to be stored, at the cost of storing and using this reference sequence. Eighty sequences contained ambiguous symbols permutation (Baldi et al., 2007). which, for simplicity, were replaced by the corresponding value in the reference sequence. This replacement is without much loss of 3.3 Decoding and byte arithmetic generality since ambiguous symbols could easily be accommodated into the coding schemes, for instance as additional variation types. While the degree of compression achieved is an important criteria, the complexity and speed of decoding is also important in all the applications to be considered. For all the encoding algorithms 4.2 General statistics described above, we have also described corresponding simple There are 4577 positions along the reference sequence where at least and fast decoding algorithms. Direct implementations of the one of the other sequence deviates from the reference. In aggregate, decoding algorithms process the compressed representations bit-by- there are 122 131 bp that deviate from the reference sequence. bit; however, it is possible to implement even faster decoders, which Besides substitutions, the total number of insertion and deletion decode the compressed data byte-by-byte. These faster decoders events across all the sequences is 7119, the most frequent one being work by looking up information from precomputed tables. These 1 bp insertions (4615 occurrences), followed by 2 bp deletions (901). tables are indexed by: (i) all possible bytes B (ranging from 0 to 255); Some well-known variants, such as the ‘Asian-specific 9 bp deletion’ and (ii) a bit-index i (ranging from 0 to 7) which marks the position (Harihara et al., 1992; Thomas et al., 1998), also occur frequently of the decoder within the byte. These tables may store quantities such (255 occurrences). In total, there are 43 different kinds of variation as the binary value of byte B starting from bit i, the number of bits events (Tables 6 and 7). On average, a given sequence deviates from turned on in byte B starting from bit i and the unary value of byte B the reference sequence in 33.8 bp with a SD of 13.43 bp. The average starting from bit i. The exact quantities stored depend on the details number of substitutions (transition/transversions) per sequence is of a particular decoder implementation. In practice, byte arithmetic 30.69 bp. The average number of insertions per sequence is 1.69 bp considerably increases decoding speed, sometimes approaching as and the average number of deletions is 1.37 bp. much as an 8-fold improvement over the corresponding bit-by-bit The distribution of the raw intervals using the rCRS as the implementation. The exact value of the speedup depends on several reference sequence is represented in Figure 1 displaying the factors including the characteristic of the data, the exact compression logarithm of the counts versus the logarithm of the rank (in scheme and the hardware used. decreasing order of frequency). Observed intervals vary from 0 to 14 997 bp, the most frequent one being an interval of 72 (2579 3.4 Variable codes occurrences) (see interpretation in next section), followed by 687 In genomic applications, in general the integers may not have a (2418 occurrences), and followed by 5 (2130 occurrences). Overall well-defined distribution, in which case it is always possible to this distribution is not strongly structured. use a general entropy encoding scheme, such as Huffman coding (Cover and Thomas, 1991; Huffman, 1952; McEliece, 1977), which 4.3 Changing the Reference Sequence essentially builds a prefix code by using a binary hierarchical There are no particular reasons, beyond standardization and clustering algorithm starting from the events (integers) with the tradition, for using rCRS as the reference sequence. Furthermore, lowest probability. While Huffman coding achieves compression purely from a compression standpoint, the rCRS may not be optimal close to the entropy limit, the price to pay over fixed coding schemes such as Golomb and Elias Gamma, or the more recent codes mentioned above, is the storage of the Huffman table which can be quite large in some applications. However, this is a fixed cost with respect to the database size, and therefore whether this cost is acceptable or not depends on the specific application. Small gains in compression over Huffman coding may be obtained using arithmetic coding (Rissanen and Langdonr, 1979; Witten et al., 1987), but at a non-trivial price in the complexity of computations. 4 RESULTS 4.1 Data extraction To demonstrate the general approach, 3615 human mitochondrial sequences were downloaded from a recent version of GenBank. We focused on the sequences alone, ignoring any header and any other exogenous information. We first use the revised Cambridge Reference Sequence (rCRS) sequence (GenBank accession number: 0 1 2 3 4 5 6 7 8 AC_000021) as the reference sequence (Brandon et al., 2005; log of runlength rank Ruiz-Pesini et al., 2007). The reference sequence is 16 568 bp long. Among the other sequences, 2671 correspond to complete Fig. 1. Distribution of intervals between variations using a log rank-log genomes, while the remaining 944 correspond only to the coding frequency plot. x-axis represents the logarithm of the rank associated region sequence, which is about ∼1100 bp shorter than the full with decreasing interval frequencies. y-axis represents the logarithm of the genome sequence, and extends from position 577 to 16 023 of the corresponding counts. [14:50 29/6/2009 Bioinformatics-btp319.tex] Page: 1734 1731–1738 log of counts Genomic data compression due to biases in data. To illustrate this point, we computed the 4.4 Encoding and compression haplotype distribution of the data using the simplified haplotype We explored and compared different encoding schemes using both classification described in Figure 2 (see also, Brandon et al., 2009; fixed and variable codes. The main sample of results is given in Mishmar et al., 2003). We find the following skewed distribution: Tables 4 and 5 giving the average number of bits required to encode 11.2% African (405 sequences), 26.3% Asian (950 sequences) and an interval or a variant, using Huffman, Golomb or Elias Gamma 62.5% EurAsian (2260 sequences). In addition, it is well known that codes, with the rCRS or the consensus sequence, as well as the total the original Cambridge Reference sequence contains a number of number of bits required to encode the entire data. The Huffman errors and has been revised over the years (Anderson et al., 1981; coding tables for the events are given in Tables 6 and 7 for the rCRS Andrews et al., 1999). (The revisions to the original sequence are and consensus sequence, respectively. described at: http://www.mitomap.org/mitoseq.html.) This alone, for As can be seen in Table 5, Huffman coding achieves slightly better instance, explains why the interval 72 is so frequent with respect to compression rates than Golomb or Elias Gamma coding, with a the rCRS: the rCRS sequence hasaGinthe corresponding position, table storage cost that may be manageable in this case. The raw which is a very rare variant, most likely an error. data takes 56 MB (58 817 584 bytes) of space. By concatenating the Thus, it is clear that other reference sequences could be used Huffman codes for the intervals and the variants (H+H), the encoded to improve compression rates and minimize the total number of data requires only 167 KB of space, corresponding to a 345-fold variants. Furthermore, the reference sequence does not need to be level of compression. Using, for instance, Golomb codes for both a sequence from an actual individual, but could be designed using the intervals and the variants (G+G) requires instead 195 KB. The purely statistical considerations. Note that the design of the reference choice of the reference sequence has a noticeable effect. Although sequence impacts not only the variants to be recorded, but also the average number of bits required to encode an interval or a the intervals, and therefore it must also take into consideration any variant is slightly higher for the consensus sequence (Table 4), this constraints a particular implementation may place on the intervals is compensated by a considerable decrease in the total number of and their encodings. A reasonable choice adopted here to try to variants to be encoded. This is true here even with a consensus further improve the compression rate, is to use the consensus sequence that differs from the rCRS sequence by only 11 nt. As sequence, derived by computing the consensus at each position, as shown in Table 5, the same encoding method based on using the reference sequence. two Huffman codes (H+H), applied with the consensus sequence, Using the consensus sequence, observed intervals vary from 0 requires only 133 KB to store the entire data. This corresponds to a to 11 717 bp, the most frequent one being an interval of 5 (2104 433-fold level of compression, roughly a 23% improvement. occurrences), followed by 1 (1251 occurrences) and followed by 259 (895 occurrences). 5 DISCUSSION A simple but general data structure and data encoding approach has been developed for the efficient storage of genomic data. The approach specifically leverages homology between sequences and is different from general compression algorithms for text, or compression algorithms for single genome data (Behzadi and Table 5. Total file size comparison using the rRCS and the consensus sequence, with Huffman encoding for both intervals and variants (H+H), or Golomb encoding for both interval and variants (G+G), or Elias Gamma encoding for both interval and variants (E+E) H+H G+G E+E Cambridge 167 (345) 195 (295) 226 (254) Consensus 133 (433) 159 (361) 183 (314) Numbers are given in Kilobytes (1024 × 8 bits). In comparison, the raw data takes Fig. 2. Simplified haplotype classification used in Brandon et al. (2009). 56 MB (57439.05 KB). Compression factor are given in parenthesis. Table 4. Comparison of the average bit cost of encoding intervals and events for Huffman, Golomb and Elias Gamma encoding schemes using the rCRS and the consensus sequence Intervals Variants Huffman Golomb Elias Gamma Huffman Golomb Elias Gamma Cambridge 9.21 11.10 12.93 2.66 2.44 2.77 Consensus 9.75 12.03 13.86 2.44 2.59 2.97 [14:50 29/6/2009 Bioinformatics-btp319.tex] Page: 1735 1731–1738 M.C.Brandon et al. Table 6. Huffman encoding for the event types using the rCRS Table 7. Huffman encoding for the event types using the consensus sequence Variant Count Binary code Variant Count Binary code G 42839 11 C 26164 11 C 24753 01 A 19576 01 T 22345 00 G 18002 00 A 21003 101 T 16528 101 InsC 3980 1001 InsC 3980 1001 Del2bp 901 100011 Del2bp 901 100011 Del1bp 757 100001 Del1bp 757 100001 InsCC 360 1000100 InsCC 360 1000100 InsT 313 1000001 InsT 313 1000001 Del9bp 255 1000000 Del9bp 255 1000000 InsA 222 10001011 InsA 222 10001011 InsCCC 34 1000101000 InsCCC 34 1000101000 InsCCCC 30 10001010110 InsCCCC 30 10001010110 InsG 29 10001010100 InsG 29 10001010100 InsCCCCC 16 100010101110 InsCCCCC 16 100010101110 InsACA 15 100010101011 InsACA 15 100010101011 InsCCCCCC 12 100010100110 InsCCCCCC 12 100010100110 InsCCCCCCC 8 1000101010101 InsCCCCCCC 8 1000101010101 InsAC 6 1000101001011 InsAC 6 1000101001011 InsCCT 5 1000101001010 InsCCT 5 1000101001010 Del6bp 4 10001010111100 Del6bp 4 10001010111100 Del8bp 4 10001010111101 Del8bp 4 10001010111101 InsCCCCCTCTA 3 10001010011111 InsCCCCCTCTA 3 10001010011111 Del3bp 3 10001010101001 Del3bp 3 10001010101001 InsGC 3 10001010011101 InsGC 3 10001010011101 InsCCCCCCCC 3 10001010101000 InsCCCCCCCC 3 10001010101000 InsTT 3 10001010011110 InsTT 3 10001010011110 Del4bp 3 10001010011100 Del4bp 3 10001010011100 InsACAC 2 100010101111100 InsACAC 2 100010101111100 InsACACA 1 100010100100000 InsACACA 1 100010100100000 InsCCCCCCCCC 1 100010100100111 InsCCCCCCCCC 1 100010100100111 InsTA 1 1000101011111110 InsTA 1 1000101011111110 InsGA 1 1000101011111101 InsGA 1 1000101011111101 InsGG 1 1000101011111100 InsGG 1 1000101011111100 InsCA 1 1000101011111011 InsCA 1 1000101011111011 InsAG 1 1000101011111010 InsAG 1 1000101011111010 InsGATCACAG 1 100010100100011 InsGATCACAG 1 100010100100011 Del10bp 1 100010100100010 Del10bp 1 100010100100010 InsTCTCTGTTCTTTCAT 1 100010100100001 InsTCTCTGTTCTTTCAT 1 100010100100001 InsACACAC 1 100010100100101 InsACACAC 1 100010100100101 InsAGAA 1 100010100100100 InsAGAA 1 100010100100100 InsCACA 1 1000101011111111 InsCACA 1 1000101011111111 Del5bp 1 100010100100110 Del5bp 1 100010100100110 Deletions (Del) are followed by their length. Insertions (Ins) by their content. Deletions (Del) are followed by their length. Insertions (Ins) by their content. Fessant, 2005; Chen et al., 2002; Williams and Zobel, 1997). The applicable to other kinds of sequences, such as RNA or protein approach has been demonstrated on the mitochondrial genomes, sequences. While for demonstration purposes we have used a single where it leads to 2–3 orders of magnitude improvement in data reference sequence, it is clear that one could cluster the data and use storage. From these compact representations, full sequences can different reference sequences for different subgroups. In the case of be recovered rapidly using the reference sequence. Furthermore, mitochondria genomes, for instance, Figure 2 would suggest using queries regarding the existence and nature of variants at particular at least three different reference sequences. Whether the gain in coordinate positions, such as those arising in a variety of applications compression that can be expected for each subgroup, akin to the from medicine to forensics, can be answered efficiently. Additional gain achieved by going from the rCRS to the consensus sequence, encryption methods may be applied to these representations to is worth the cost of having multiple reference sequences rather than protect the security of both the genomic data and the queries. a single one, cannot be answered in generality and depends on the The approach has been used for lossless compression, however it details of a particular application, the number of genomes to be could be used also in lossy compression, for instance, by ignoring stored coming from each group and so forth. For future work, the variants that are not medically relevant. The approach is also same idea of multiple reference sequences can be extended beyond [14:50 29/6/2009 Bioinformatics-btp319.tex] Page: 1736 1731–1738 Genomic data compression Table 8. Compression of the read addresses information from three HTS experiments (see text) Encoding DataSet 1 DataSet 2 DataSet 3 Raw Sequence 133 366 560 353 182 128 8 869 613 600 Flat File 75 525 168 185 536 864 8 396 646 344 Elias Gamma absolute 358 402 (210.73) 79 281 140 (2.34) 1 373 892 116 (6.11) Elias Gamma relative 185 542 (407.05) 27 741 238 (6.69) 340 764 564 (24.64) Monotone Value (MOV) 169 664 (445.15) 39 528 754 (4.69) 834 672 652 (10.06) The size in bits for the raw sequence data, the corresponding flat text file format for the corresponding addresses, and the compressed files for different compression algorithms. Elias Gamma coding is applied both to the absolute and relative addresses. Compression factors with respect to the flat text file format are given in parentheses, with top compression factors in bold. MOV is a coding algorithm specifically designed for increasing sequences of integers described in Baldi et al. (2007). the storage of genomes within a given species, to the storage of and comes from an experiment aimed at mapping retrotransposon genomes from multiple species by using a phylogenetic hierarchy Ty3 insertion sites in the yeast genome. It consists of 833 541 of reference sequences. sequence reads, all of length 19 bp. The second dataset comes from Finally, the approach can be extended to human nuclear genomes a chromatin immunoprecipitation assay (ChIP-Seq) used to map the (see also, Christley et al., 2009) and to HTS from different in vivo binding site locations of the neuron-restrictive silencer factor technologies and different kinds of experiments. For human SNP (NRSF) in humans (Johnson et al., 2007). It consists of 1 697 991 variation, data and statistics are readily available (Goldstein and sequence reads, all of length 25 bp and mapped to the most recent Cavalleri, 2005; Hinds et al., 2005; The International HapMap human genome sequence (hg18). The third dataset corresponds to Consortium, 2007). A comprehensive list of human SNPs is a full diploid human genome sequencing experiment for an Asian available from the dbSNP database maintained by NCBI. The current individual (Wang et al., 2008). This is a very large dataset with release (version 129) contains about 15 million SNPs. This data enough reads to provide 36-fold average coverage, and we utilize the can readily be compressed using the techniques described here existing mapping of the reads provided by the YH database (Li et al., and additional gains in compression can be achieved by storing 2009) to the human reference genome. For illustrative purposes, we separately a fixed table recording the location of all the SNPs and report only the results corresponding to the reads associated with leveraging the skewed distribution of some of the SNP variants. chromosome 22. For chromosome 22, there are 31 118 532 reads In preliminary experiments, we have achieved compression factors that vary in length from 30 to 40 bp for a total of 1 108 701 700 bp of of over 1000 on the raw HapMap sequence data. Although SNPs sequence data. While complete details of these experiments will be account for most of genetic variation events between individuals, reported elsewhere, Table 8 shows the resulting compression factors a much larger fraction of the genome (in terms of the total which are again in the range of 1–3 orders of magnitude, depending number of bases) is involved in larger structural variation events, on the statistical properties of the datasets. The same techniques such as copy number variations (CNVs). While there have been described here can readily be applied to storing also the length of the studies attempting to derive a preliminary assessment of large-scale reads, the content of the reads, where they differ from the reference genomic complexity and variation (Feschotte and Pritham, 2007; genome, their quality and so forth. Statistical properties of the reads Tuzun et al., 2005), statistics on the frequencies and location of and the underlying HTS technologies, e.g. increasing error rates these more complex structural variations in the human genome are towards the end of the read, can also be exploited to achieve efficient still at an earlier stage of development. For instance, comparative compression. Thus, the data structures and compression algorithms analysis of the single diploid genome described in Levy et al. (2007), described here provide a framework for the management of HTS and ‘revealed more than 4.1 million DNA variants, encompassing genomic data that can be flexibly applied in different environments. 12.3 Mb. These variants (of which 1 288 319 were novel) included 3 213 401 single nucleotide polymorphisms (SNPs), 53 823 block substitutions (2206 bp), 292 102 heterozygous insertion/deletion ACKNOWLEDGEMENTS events (indels) (1571 bp), 559 473 homozygous indels (182 711 bp), We thank K. Daily, S. Sandmeyer and P. Rigor for useful discussions. 90 inversions, as well as numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for Funding: National Institutes of Health Biomedical Informatics 22% of all events identified in the donor, however they involve Training (grant LM-07443-01); NSF MRI (grant EIA-0321390); 74% of all variant bases. This suggests an important role for non- National Science Foundation (grant 0513376 to P.B.); National SNP genetic alterations in defining the diploid genome structure’. Institutes of Health (grants AG24373, DK73691, AG13154 and A better statistical understanding of the coding constraints posed NS21378 to D.W.). by these complex events, and how to encode them, should become Conflict of Interest: none declared. possible as more full human genome sequences become available in the coming years (www.1000genomes.org). Regarding HTS data, for illustration purposes, here we consider REFERENCES the problem of storing the genomic addresses of the reads from three HTS datasets associated with different HTS technologies. The first Anderson,S. et al. (1981) Sequence and organization of the human mitochondrial dataset is obtained from the laboratory of Dr S. Sandmeyer at UCI genome. Nature, 290, 457–465. [14:50 29/6/2009 Bioinformatics-btp319.tex] Page: 1737 1731–1738 M.C.Brandon et al. Andrews,R. et al. (1999) Reanalysis and revision of the cambridge reference sequence Levy,S. et al. (2007) The diploid genome sequence of an individual human. PLoS Biol., for human mitochondrial DNA. Nat. Genet., 2, 147. 5, e254. Baldi,P. et al. (2007) Lossless compression of chemical fingerprints using integer Li,G. et al. (2009) The YH database: the first Asian diploid genome database. Nucleic entropy codes improves storage and retrieval. J. Chem. Inf. Model., 47, 2098–2109. Acids Res., 37, D1025–D1028. Behzadi,B. and Fessant,F. (2005) DNA compression challenge revisited: a dynamic McEliece,R.J. (1977) The Theory of Information and Coding. Addison-Wesley programming approach. Lect. Notes Comput. Sci., 3537, 190–200. Publishing Company, Reading, MA. Brandon,M. et al. (2009) MITOMASTER: a bioinformatics tool for the analysis of Mishmar,D. et al. (2003) Natural selection shaped regional mtDNA variation in humans. mitochondrial DNA sequences. Hum. Mutat., 30 (Database Issue), 1–6. Proc. Natl Acad. Sci. USA, 100, 171–176. Brandon,M. et al. (2005) MITOMAP: a human mitochondrial genome database - 2004 Moffat,A. and Anh,V. (2006) Binary codes for locally homogeneous sequences. Inf. update. Nucleic Acids Res., 33, D611–D613. Process. Lett., 99, 175–180. Chen,X. et al. (2002) DNACompress: fast and effective DNA sequence compression. Moffat,A. and Stuiver,L. (2000) Binary interpolative coding for effective index Bioinformatics, 18, 1696–1698. compression. Inf. Retr., 3, 25–47. Christley,S. et al. (2009) Human genomes as email attachments. Bioinformatics, 25, Rissanen,J.J. and Langdonr,G.G. (1979) Arithmetic coding. IBM J. Res. Dev., 23, 274–275. 149–162. Cover,T.M. and Thomas,J.A. (1991) Elements of Information Theory. John Wiley, Ruiz-Pesini,E. et al. (2007) An enhanced MITOMAP with a global mtDNA mutational New York. philogeny. Nucleic Acids Res., 35, D823–D828. Elias,P. (1975) Universal codeword sets and representations of the integers. IEEE Trans. Service,R.F. (2006) The race for the $1000 genome. Science, 311, 1544–1546. Inf. Theory, 21, 194–203. The International HapMap Consortium (2003) The International HapMap Project. Feschotte,C. and Pritham,E. (2007) DNA transposons and the evolution of eukaryotic Nature, 426, 789–796. genomes. Ann. Rev. Genet., 41, 331. The International HapMap Consortium (2007) A second generation human haplotype Goldstein,D. and Cavalleri,G. (2005) Genomics: understanding human diversity. Nature, map of over 3.1 million SNPs. Nature, 449, 851–861. 437, 1241–1242. Thomas,M. et al. (1998) Molecular instability in the COII-tRNA(lys) intergenic region Golomb,S.W. (1965) Run-length encodings. IEEE Trans. Inf. Theory, 12, 399–401. of the human mitochondrial genome: multiple origins of the 9-bp deletion and Harihara,S. et al. (1992) Frequency of a 9-bp deletion in the mitochondrial DNA among heteroplasmy for expanded repeats. Phil. Trans. R. Soc. Lond. B Biol. Sci., 353, Asian populations. Hum. Biol., 64, 161–166. 955–965. Hinds,D.A. et al. (2005) Whole genome patterns of common DNA variation in three Tuzun,E. et al. (2005) Fine-scale structural variation of the human genome. Nat. Genet., human populations. Science, 307, 1072–1079. 37, 727–732. Hirschberg,D.S. and Baldi,P. (2008) Effective compression of monotone and quasi- Wang,J. et al. (2008) The diploid genome sequence of an Asian individual. Nature, 456, monotone sequences of integers. In Proceedings of the 2008 Data Compression 60–65. Conference (DCC 08). IEEE Computer Society Press, Los Alamitos, CA. Wheeler,D.A. et al. (2008) The complete genome of an individual by massively parallel Huffman,D. (1952) A method for the construction of minimum redundancy codes. Proc. DNA sequencing. Nature, 452, 872–876. IRE, 40, 1098–1101. Williams,H. and Zobel,J. (1997) Compression of nucleotide databases for fast searching. Johnson,D.S. et al. (2007) Genome-wide mapping of in vivo protein-DNA interactions. Bioinformatics, 13, 549–554. Science, 316, 1497–1502. Witten,I.H. et al. (1987) Arithmetic coding for data compression. Commun. ACM, 30, Kaiser,J. (2008) A plan to capture human diversity in 1000 genomes. Science, 319, 395. 520–540. [14:50 29/6/2009 Bioinformatics-btp319.tex] Page: 1738 1731–1738
Bioinformatics – Oxford University Press
Published: May 15, 2009
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.