ZTR: a new format for DNA sequence trace data

James K. Bonfield; Rodger Staden

doi:10.1093/bioinformatics/18.1.3

ZTR: a new format for DNA sequence trace data

Bonfield, James K.; Staden, Rodger 2002-01-01 00:00:00 Vol. 18 no. 1 2002 BIOINFORMATICS Pages 3–10 James K. Bonﬁeld and Rodger Staden MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, UK Received on July 2, 2001; revised on September 11, 2001; accepted on September 12, 2001 ABSTRACT require around 30 million traces, equating to 5700 Gb of Motivation: To produce an open and extensible ﬁle storage. This is for one individual for one species. At the format for DNA trace data which produces compact ﬁles time of writing (June 2001) the NCBI trace archive also suitable for large-scale storage and efﬁcient use of internet contains 23 million traces for other species. bandwidth. In 1991 our group introduced SCF format (Dear and Results: We have created an extensible format named Staden, 1992). The major motivations then were: ZTR. For a set of data taken from an ABI-3700 the (1) sequencing machine independence; ZTR format produces trace ﬁles which require 61.6% of (2) operating system independence; the disk space used by gzipped SCFv3, and which can be written and read at greater speed. The compression (3) an open and public format with sources available to algorithms used for the trace amplitudes are used within all; the National Center for Biotechnology Information (NCBI) (4) small ﬁle size; trace archive. (5) to introduce the idea of base call conﬁdence values Availability: Source code is available from ftp: and encourage their use. //ftp.mrc-lmb.cam.ac.uk/pub/staden/io lib/io lib.tar.gz. A complete format description can be found at http: This format is now the most widely used and the sources //www.mrc-lmb.cam.ac.uk/pubseq/ztr.html. Test data is are available via ftp. During the intervening years we have available from ftp://ftp.mrc-lmb.cam.ac.uk/pub/staden/ produced two major revisions. The current one (SCFv3) io lib/test data. includes the use of a ﬁnite differences function plus a Contact: jkb@mrc-lmb.cam.ac.uk reorganization of the order of the ﬁle contents to potentiate the efﬁcient use of standard compression programs. INTRODUCTION The content of trace ﬁles The genome projects performed to date are just a begin- ning, and as DNA sequencing is increasingly being used The minimum information needed in a DNA sequence for new scientiﬁc, medical and forensic purposes, the trace trace ﬁle is shown below with their percentage of the overall uncompressed SCF ﬁle size: data accumulated so far represent only a tiny fraction of the storage requirement of the future. (1) the base calls (1%); Major centres such as the National Center for Biotech- (2) base conﬁdence values (4%); nology Information (NCBI) and the European Bioin- formatics Institute (EBI) are collecting the trace data (3) the trace amplitudes for each of the four base types (88%); ﬁles from genome projects and making them available via the internet (http://www.ncbi.nlm.nih.gov/Traces/, (4) the offsets of the base calls relative to the trace http://trace.ensembl.org/). It is important that the storage coordinates (represented by the element numbers of and transfer of trace data is efﬁcient and that the format the trace values) (4%); used is easily adaptable. (5) various textual comments (sample identiﬁers, run To illustrate the size of the storage problem let us take date, etc.) (0.2%). a single human genome project as an example. Suppose The remaining ∼2.8% is in unused (marked as ‘spare’) we aim at 5-fold coverage, that we use sequencing ﬁelds. Typically ABI ﬁles will also contain additional instruments which generate Applied Biosystems’ ABI textual data plus arrays of current, voltages, temperature format trace ﬁles (typical ﬁle size 190 kb), and that we get and unprocessed trace amplitudes. on average 500 reliable bases per reading. Then we would A list of data items that today’s users may want to To whom correspondence should be addressed. store in the ﬁles is speciﬁed at the NCBI trace repository c Oxford University Press 2002 3 J.K.Bonﬁeld and R.Staden Table 1. Gzip compression ratios on a selection of ﬁle types SYSTEM AND METHODS The design of ZTR builds on this previous work and borrows new ideas taken from the PNG format (Boutell Format Original size Gzipped size Fraction et al., 1997), the public successor to the GIF image ABI 18 158 424 8 427 773 0.464 format. We wanted to reduce data size further, and also to SCFv2 7 887 845 3 881 662 0.492 incorporate additional textual information. The key design SCFv3 7 887 845 2 396 562 0.304 principles are: (1) Extensibility: we cannot easily foresee what future data may need to be stored within a trace ﬁle, so we (Gorrell et al., 2000), but we do not know what may need a mechanism of incorporating new information be required in the future. For example we originally in a way which does not invalidate the ﬁle format. suggested that it would be useful to store a conﬁdence (2) Small: a small size not only saves disk space, but value for each of the four base types at each position will reduce network usage and download times. in the sequence and provided four such slots in the SCF Rather than have a format which requires use of format. At present many use a single value, that for the external compression tools we would like the format called base only. However at least one group have a trace to specify its own compression methods. analysis program (ATQA, 1998) which in addition to an (3) Fast: ZTR ﬁle accessing should not be substantially overall conﬁdence for each base call can also calculate slower than existing SCF implementations. Given the probability of insertions and deletions at each base that gzipped SCF ﬁles are the norm, we considered position. Luckily these useful values can be stored in the this to be our target speed for both reading and spare conﬁdence value slots in the SCF ﬁle, but the format writing. we propose in this paper can readily incorporate new data types such as these. (4) Public: both the speciﬁcation and the source code for an example implementation should be freely Space saving methods available to both academic and commercial users. Lossless ﬁle compression tools save disk space, not by Extensibility deleting information, but by analyzing the data to repack the information using fewer bytes. One of the most The basic structure of a ZTR ﬁle is a header indicating the commonly used tools for this is gzip. We determined that ﬁle format, followed by zero or more data blocks. In ZTR storing the data within an SCF ﬁle in a different order can we call these blocks ‘chunks’. signiﬁcantly improve the performance of gzip. Also the The use of separate chunks for each data type implies trace amplitudes are not ideally suited to compression by new data types can be added without changing the basic gzip, but storing the differences between one value and the ﬁle structure and conversely chunks may be omitted. ZTR next reduces the signal variability. This ﬁnite differences readers should ignore chunks with unknown type and technique can be applied up to three times before com- so new ﬁles are backwards compatible. These features pression ratios start to suffer. These ideas were used to contribute to the extensibility of the ZTR format. form a new revision of SCF—version 3. Table 1 demon- The header structure is, in hex bytes: strates the compression ratios of gzip on a set of ABI-3700 8-byte magic number: AE 5A 54 52 0D 0A 1A 0A ﬁles in ABI, SCF version 2 and SCF version 3 formats. Format version, major: 01 More recently, Jean Thierry-Mieg at the NCBI produced Format version, minor: 02 a new trace format named CTF (unpublished) which compresses better than SCFv3. The CTF format speciﬁes its own compression algorithms. A raw CTF ﬁle has a The magic number includes an 8-bit byte (AE), a control- similar size to gzipped SCFv3, but is substantially faster to Z (1A) character (used to indicate end-of-ﬁle under DOS), read back. Furthermore CTF ﬁles can still be compressed both bare newline (0A) and carriage-return newline (0D further by using external programs, such as gzip, hence 0A) combinations, and the text ‘ZTR’ (5A 54 52). The giving additional space savings. purpose of this is to act as an immediate check for the more Despite the size reduction of SCFv3 and CTF we troublesome aspects of ﬁle reading and data transfer and recently felt that another completely new ﬁle structure so aid the detection of corrupt ﬁles. For example, using ftp would both enable us to make the ﬁles even smaller and to transfer a ZTR ﬁle in ASCII mode may swap newline produce an extensible format which would facilitate its use and newline-carriage return. Such ﬁles will not have a for novel future applications of DNA sequencing. We have valid ZTR header, so rather than return a corrupted ﬁle named this new binary format ZTR. the reading code will return an error. 4 ZTR: New format for DNA sequence trace data Each chunk consists of a type, meta-data and data. The integer; the same as used by Phred (Ewing and Green, chunk data is the main information we wish to store. The 1998), TraceTuner (http://www.paracel.com/tracetuner/), meta-data, which is not needed for many chunk types, is ATQA (1998) and Li-Cor base callers. typically a small amount of information about the data. CNF4. The four conﬁdence values stored in the same For example the chunk to store the digitized trace samples scale as CNF1, but with one value per base type. To aid will have the samples themselves in the data block and the compression, the conﬁdence for all the called bases (which name of the channel (A, C, G, T) in the meta-data block. defaults to T if not A, C or G) is stored ﬁrst followed by All integer values are stored using 4-byte values in big- the remaining conﬁdence values for A, C, G and T. endian format (i.e. most signiﬁcant byte ﬁrst). The chunk structure is: CSID. The conﬁdence values for substitution, insertion and deletion, stored in the −10 log (P ) scale. ATQA 4-byte chunk type: XX XX XX XX error Meta-data length (big endian): XX XX XX XX is one such program to produce these values. Meta-data: (any number of bytes, up to 2 ) Data length (big endian): XX XX XX XX CLIP. Poor quality clip points. Speciﬁed in base coor- Data: (any number of bytes, up to 2 ) dinates, this indicates where data (at both ends) should be considered as poor quality. This is included primarily The format of the meta-data and data elements is chunk for backwards compatibility with SCF—the CNF* chunk type dependent. The complete information may be found types provide more detailed information. in the on-line ZTR format speciﬁcation (http://www. mrc-lmb.cam.ac.uk/pubseq/ztr.html). COMM. User deﬁned text comments, in 8-bit ASCII. TEXT. A series of identiﬁer-value pairs stored as one or Chunk types more sets of identiﬁer, nul, value, nul terminating in an The chunk type may be considered to be a 4-character additional (i.e. double) nul character. The identiﬁers are string. Bit 5 of the ﬁrst character indicates whether this deﬁned as part of the ZTR spec, but have been taken from chunk type is part of the public ZTR speciﬁcation (in the NCBI trace repository RFC version 1.17. which case bit 5 is clear) or whether it is a private extension (bit 5 is set). Bit 5 of the remaining three CR32. A 32-bit cyclic redundancy check (ANSI X3.66) characters is reserved for future use and so currently value of all the data since the last CR32 chunk, including should always be clear. Practically speaking this means the ZTR header if appropriate. that public chunk types consist entirely of uppercase letters and private chunk types start with a lowercase letter. Compression This means that TEXT and tEXT are two completely Each chunk data block is compressed using zero or independent chunk types and the similarity of their names more ﬁltering and compression algorithms. The available does not imply a relationship between the format of their algorithm choices are: data. Also it is clear that private extensions will not clash with future public extensions. DELTA1, DELTA2, DELTA4. These apply the forward ﬁnite At present the publicly deﬁned chunk types are: differences technique to 1, 2 or 4 byte words. This replaces each 1, 2 or 4 byte word with the difference between itself SAMP. A single channel of trace samples, stored in 16- and the previous word. It does not directly decrease the bit format. size of the data. Table 2 contains an example of DELTA1 ﬁltering. SMP4. Four concatenated arrays of trace samples, stor- ing the same information as 4 SAMP chunks for the A, 16TO8, 32TO8. These attempt to store numerically small C, G and T channels. Note that both SMP4 and SAMP 16-bit and 32-bit integer values in a single 8-bit integer. chunks can be combined within the same ﬁle if desired. Values in the range of −127 to +127 are stored directly SMP4 typically gives compression ratios 4% smaller than in 8-bits. For values outside this range we emit −128 4 separate SAMP chunks, at a reduced CPU usage. followed by the actual 16 or 32-bit value. Table 3 contains an example of the 16TO8 ﬁlter type. BASE. Base calls, encoded using the NC-IUB character set (NC-IUB, 1985). FOLLOW1. This analyzes the complete data block to determine for each 8-bit value (‘x ’) which other value BPOS. A mapping of base numbers to trace sample most frequently follows it (follow (x )). Then for each byte numbers, stored as an array of 32-bit integer values. of data we store follow (previous byte)—current byte. To CNF1. The conﬁdence values for the called base type. enable reversal of this function we also prepend the data The scale must be −10 log (P ) expressed as an 8-bit block with the 256-byte follow table. error 5 J.K.Bonﬁeld and R.Staden Table 2. Example of levels 1–3 of the DELTA1 ﬁlter Level Data stream before and after the DELTA1 ﬁlters Entropy 0 +4 +7 +12 +17 +24 +30 +36 +40 +43 +43 +40 +35 +28 +21 +14 +9 7.50 1 +4 +3 +5 +5 +7 +6 +6 +4 +3 +0 −3 −5 −7 −7 −7 −5 6.16 2 +4 −1 +2 +0 +2 −1 +0 −2 −1 −3 −3 −2 −2 +0 +0 +2 4.97 3 +4 −5 +3 −2 +2 −3 +1 −2 +1 −2 +0 +1 +0 +2 +0 +2 5.62 Table 5. Summary of chunk type and the default ﬁlters and compression Table 3. An example of 16TO8 of 5 big-endian 16-bit numbers algorithms Before 00 4B 00 55 FF EB FC 22 00 BB Chunk type Filters/compressors (plus arguments) After 4B 55 EB 80 FC 22 80 00 BB SAMP/SMP4 DELTA2 (×3or ×2, depending on data range) 16TO8 FOLLOW1 Table 4. An example of the RLE compression method, using 8 as the token RLE ZLIB (Z HUFFMAN ONLY) Before567777 8 7 6 BASE/TEXT/COMM ZLIB (Z HUFFMAN ONLY) CNF1/CNF4/CSID DELTA1 (×1) After 5 6 8 4 7 8076 RLE ZLIB (Z HUFFMAN ONLY) BPOS DELTA4 (×1) 32TO8 RLE. Run length encoding. If 4 or more identical 8-bit ZLIB (Z HUFFMAN ONLY) values are detected in a row then RLE replaces this data with a special token followed by the number of repeated bytes and the value. If the token itself is within the raw data then it is output followed by zero. The token may as 16-bit quantities regardless of their actual scale. 8-bit be chosen to be a symbol with a low natural frequency. data (0–255) is compressed best using DELTA2 with 2 Table 4 contains an example of the RLE compression rounds, whereas full 16-bit data is compressed best using method. 3 rounds. ZLIB. Uses the zlib library (Deutsch and Gailly, 1996) to IMPLEMENTATION apply the LZ77 compression algorithm followed by Huff- The source code implementing the ZTR format is con- man encoding (Huffman, 1952). Zlib allows for Huffman tained within a library named ‘io lib’. This library also encoding only (denoted below as Z HUFFMAN ONLY), implements read-only support for the Applied Biosys- which for trace data typically reduces ﬁle size more than tems’ ABI and Pharmacia’s ALF format trace ﬁles and LZ77 and is faster. However all valid zlib streams are al- read–write support for the SCF and CTF formats. Io lib is lowed within a ZTR ﬁle. coded using ANSI C and is known to work on both UNIX The ﬁrst byte of the encoded chunk data indicates the al- and Microsoft Windows based systems. Internally it uses gorithm used, followed by any algorithm speciﬁc param- a common C structure for storing a trace along with a eters required for decoding, followed by the encoded data itself. A value of zero for the ﬁrst byte indicates the raw common programming interface for reading and writing data. Hence ZTR decoders simply need to keep recursively this structure. This means that the application does not applying the uncompression algorithms until the raw data need to know the ﬁle format of the trace data and so as is obtained. new formats are added existing applications will not need Experimentation has determined which sets of algo- to be modiﬁed or even recompiled. rithms are best applied to each type of chunk. Table 5 lists Io lib supports the notion of a trace search path, which the default ﬁlter and compression types used. Note that is independent from the trace format. Traces may be other combinations of ﬁlters and compression methods loaded directly from a ﬁle on disk in the current working may be used as they still produce a valid format ZTR ﬁle. directory, from an alternative directory, or extracted from The trace amplitudes (in the SAMP chunks) are all treated within a tar ﬁle. 6 ZTR: New format for DNA sequence trace data The tar ﬁle support allows for archiving many trace ﬁles Table 6. Total size in bytes and timings in seconds for 100 trace ﬁles into a single ﬁle, which has several beneﬁts. It makes distribution of data much easier, it may reduce disk space Size in Read Write and it reduces the number of ﬁles on the disk. This is Instrument Format bytes time (s) time (s) important as most ﬁlesystems support a limited number of ﬁles, usually speciﬁed at the time of formatting. Although ABI-3700 ABI 18 915 025 3.55 – ABI-3700 Gzipped ABI 8 780 830 6.18 – this number is typically set very high, a large number of ABI-3700 Gzipped SCF 2 494 217 1.54 9.15 very small ﬁles can still cause problems. MegaBACE Gzipped SCF 3 953 805 1.73 13.14 Many ﬁling systems also have a block size. The size Li-Cor Gzipped SCF 1 815 428 1.17 5.01 required to store a ﬁle of length N will be N rounded up to the next multiple of BLOCK SIZE; averaging at N + BLOCK SIZE/2. On Microsoft Windows the block Table 7. File size and I/O times as percentages relative to gzipped SCF size can often be as much as 64 kb, meaning an average wastage of 32 kb per ﬁle. Tar archives typically use an Size relative to SCF.gzip Average timings internal block size of 512 bytes, which greatly reduces Format ABI-3700 Mega- Licor Average Read & Encode & wasted space. This point is not to be underestimated; a BACE decode write 64 kb block size means that there is usually no saving in switching from gzipped SCF to ZTR unless tar archives SCF.raw 317.1 202.0 264.9 261.3 30.9 7.9 are also used. Fortunately UNIX ﬁle systems usually have SCF.gzip 100.0 100.0 100.0 100.0 100.0 100.0 much smaller block sizes. For example in Linux the ext2 SCF.bzip2 72.8 75.9 85.7 78.1 370.3 143.4 SCF.szip 71.1 74.8 80.2 75.4 937.9 164.6 ﬁlesystem has a block size of 1024, 2048 or 4096 bytes. Trace ﬁles within the tar ﬁle may be compressed if CTF.raw 96.2 112.2 114.5 107.6 34.2 117.6 CTF.gzip 70.2 80.3 83.0 77.8 79.2 144.4 desired, although with the ZTR format this is not advisable CTF.bzip2 65.6 72.0 79.2 72.2 324.6 217.3 due to the use of its own compression functions. However CTF.szip 63.2 70.5 75.5 69.8 740.1 227.6 the complete tar ﬁle itself must not be compressed as this ZTR(1).raw 150.0 99.5 220.1 156.5 34.9 8.2 would prevent random access within it. ZTR(1).gzip 69.5 73.1 84.2 75.6 85.4 50.6 In order to reduce time spent searching for ﬁles within ZTR(1).bzip2 62.9 68.7 72.3 68.0 370.9 125.9 a tar archive io lib can use an index ﬁle. The index ZTR(1).szip 60.8 67.2 68.4 65.4 779.2 129.8 consists of a series of lines containing the trace name and ZTR(2).raw 61.6 69.7 79.1 70.1 67.9 34.3 ﬁle offset, allowing for complete random access within the trace archive. The current implementation performs a linear search through the index, so access time is still proportional to number of ﬁles in the archive. However the for reading (when not cached) and 50% slower for time taken to ﬁnd a ﬁle within a directory is also dependent writing. on the number of ﬁles contained within it. At present we A comparison between SCF, CTF and ZTR is presented only support read access to tar ﬁles. in Table 7. Here we have normalized the the sizes and times against the gzipped SCF results from Table 6. RESULTS Several compression tools are also compared, including gzip (implemented using zlib 1.1.3), bzip2 (version 1.0.1) We analyzed the performance of ZTR on multiple sets and szip (version 1.11). Gzip was implemented as a library of data covering several machine manufacturers and call and so avoids the need for running an external process. multiple sequencing chemistries, with each set consisting of 100 traces. The ABI-3700 and MegaBACE data sets This does not affect the size, but reduces the real and (from the Sanger Centre) were re-base-called using cpu time and so there is a small bias against the timings Phred 0.990722.g. The Li-Cor data set (from Genoscope) for bzip2 and szip. Both gzip and bzip2 are widely-used was converted from SCFv2 to SCFv3 format, but was open source programs. Szip is freely available for many operating systems, but is not open-source. It is included not re-base-called as the Li-Cor base-caller produces as an illustration of one of the best general purpose conﬁdence values in the same log scale as Phred. All compression tools. of this data is publicly available on our ftp site. Table 6 Table 7 contains a lot of information so we have made presents the gzipped SCF size for each of these three the rows of formats which are faster at reading or writing sets along with the size for the ABI-3700 data set in the original ABI ﬁle format. The timings here represent than all others for a given ﬁle size bold. The remaining summation of the user and system CPU times, taken rows contain results which are bettered on both speed and from a 433 MHz Compaq Alpha running Digital UNIX size by at least one other format. For example SCF.gzip is V4.0E. Real times averaged at approximately 20% slower always beaten on speed and size by ZTR(2).raw. 7 J.K.Bonﬁeld and R.Staden Table 8. Relative proportions of data within a ZTR ﬁle From this we can see that the Huffman encoding used for TEXT and BASE chunks is not optimal, mostly due to the small size of the information being compressed. The Chunk type File (%) Bits/item TEXT chunks in this data set do not include the NCBI text attributes and so their average size is just 196 bytes. With SMP4 92.07 3.24 bits/sample longer TEXT chunks the compression rates will improve, CNF4 2.72 4.45 bits/value BPOS 2.38 3.90 bits/value but it is unlikely the size will be a signiﬁcant portion of BASE 1.59 2.60 bits/base the total ﬁle, so optimizing this will not provide an overall TEXT 1.23 7.89 bits/character improvement in compression. DISCUSSION The ZTR(1) and ZTR(2) formats are both valid ZTR We have presented ZTR as an extensible and compact ﬁles, but ZTR(1) does not include the ﬁnal FOLLOW1 replacement to gzipped SCF, but have concentrated on the and ZLIB compression methods. This means that a raw issues of ﬁle compression. The Huffman encoding used ZTR(1) ﬁle is substantially larger than ZTR(2), but the in ZTR represents a very basic compression algorithm. more complex external compression tools (bzip2 and szip) Better entropy encoders are known, with arithmetic coding reduce the ZTR(1) ﬁles to less than ZTR(2). ZTR(2)’s (Rissanen and Langdon, 1979) being the most widely internal compression prevents external tools from further used. They may produce smaller ﬁles without too large an reducing the ﬁle size, so these values are not shown in impact on speed, but these algorithms often require larger Table 7. Both ZTR sets have been encoded using a single amounts of data to work efﬁciently. Higher order statistical SMP4 chunk instead of 4 separate SAMP chunks. In encoders (such as the PPM family; Cleary and Teahan, summary, whilst ZTR(2) is not the smallest ﬁle format 1997) may also reduce space, but these are currently (although it is close), to produce smaller ﬁles takes slow algorithms. It can be seen that the higher order substantially longer. The ZTR(2) implementation fulﬁls block sorting methods (http://www.compressconsult.com/ the goals of being faster than gzipped SCF with a much st/), as used in szip, and the Burrows–Wheeler transform smaller output and so is our default implementation of the (Burrows and Wheeler, 1994), as used in bzip2, give ZTR format. substantial improvements, but again the methods are The ZTR(1).gzip ﬁles are larger than the ZTR(2) ﬁles, relatively slow. despite the fact that the Huffman compression used within However the most proﬁtable strategy may lie in trying ZTR(2) is the same code as used in gzip. This can be to curve-ﬁt the data. We have experimented with using explained by noting that different chunks contain byte Chebyshev polynomials (Press et al., 1992) to ﬁt the values with substantially different frequency distributions, previous 4 samples in order to predict a value for the but gzip averages all these together (assuming that the 5th sample. The difference between the predicted and entire ﬁle ﬁts within one gzip block). An additional beneﬁt real sample value can then be stored. This is still work to this approach is that random access to any chunk is in progress, but our current algorithm can compress the still possible, which in turn provides faster extraction of ABI-3700 data set to 56.7% of the gzipped SCF size (2.98 speciﬁc data. For example extracting just the base calls bits/sample). CPU performance is still a big issue with this from ZTR(2) ﬁles is faster than extracting them from method, with read times being approximately 2.7 times ZTR(1).gzip ﬁles. slower than gzipped SCF ﬁles. We can see that the Li-Cor ﬁles compress much less We have also examined the use of lossy compression for than the ABI and MegaBACE ﬁles. The main reason is the trace amplitudes. The simplest way to lose informa- that the Li-Cor data only stores 8-bit samples, compared tion uniformly is down-scaling. The original SCFv1 im- to the 11-bit data from ABI and MegaBACE machines. plementation stored information in 8-bits, but downscaling Scaling down the other data sets to 8-bit samples gives to any range also improves compression. Table 9 shows results comparable to the Li-Cor data (ZTR(2) is 72.9% the results of this form of lossy compression on the 100 for ABI and 80.4% for MegaBACE). ABI-3700 ﬁles using the ZTR(2) format. The other factor resulting in differences in compression We would not recommend using lossy compression for ratios between data sets is the noise in the trace data permanent archive of trace data, but for visual inspection (which depends in part on the preprocessing of the original over the network 7-bit data is generally adequate. data signals). As the noise increases the entropy of the data The nature of the ZTR format is such that, if useful, also increases, resulting in poorer compression. any of these alternative or additional methods can be Table 8 details the breakdown of a ZTR(2) ﬁle expressed implemented in the future without affecting the reading as a percentage of the overall size and in bits per item. This of older ﬁles. table was computed by averaging only the ABI-3700 ﬁles. The original size of the ABI-3700 data set is more than 8 ZTR: New format for DNA sequence trace data Table 9. The effect of down-scaling on ﬁle size duplication of work and reduce any fragmentation of the format. Initially such additions should be implemented as private types, but once stabilized these could be migrated Range Average ZTR(2) ﬁle size to public types in future revisions of the format. In the Staden Package individual traces stored in 0–1600 (lossless) 15 949 formats readable by io lib can be viewed using a pro- 0–1024 14 656 0–512 12 506 gram called Trev (Bonﬁeld et al., 2002), available 0–256 10 846 from ftp://ftp.mrc-lmb.cam.ac.uk/pub/staden/trev/. Also 0–128 9 175 via io lib, multiple traces can be viewed using the 0–64 7 924 package’s main sequence assembly and editing pro- 0–32 6 765 gram, Gap4 (Staden et al., 1998), which uses a single binary but machine independent database for each sequencing project. This database stores sequence read- double the size of the uncompressed SCF ﬁles. This is ings, conﬁdence values, contigs, templates, read-pair due to the additional information stored in an ABI ﬁle. data, annotations, edit information and links to the By deﬁning further ZTR chunk types it would be possible trace data. The package also contains the gap4 viewer to store all the data in an ABI ﬁle within ZTR, utilizing (ftp://ftp.mrc-lmb.cam.ac.uk/pub/staden/gap4 viewer/). appropriate compression methods for each chunk. The This viewer is a complete but read-only version of Gap4 main proportion (96%) of an ABI ﬁle consists of the and hence enables all of the above information to be 12 DATA channels corresponding to raw and processed displayed using its graphical user interface. Gap4 viewer copies of the trace data and various instrument settings and trev executables for UNIX and Microsoft Windows (voltage, current, power, temperature). Of the remaining are available free to commercial and academic users from ABI information approximately 2/3 is base calls and base our ftp site. offsets. All of this already compresses well using ZTR. The io lib implementation could readily be extended We estimate that a complete ZTR encoded ABI ﬁle will to provide additional search paths to allow for direct be 27% of the original size, compared to 49% for gzip and ﬁle loading over the internet, possibly using CORBA 34% for bzip2. Hence ZTR is a suitable open format for (Parsons et al., 1999). This would enable any program use by manufacturers of sequencing instruments. using io lib, including trev and the gap4 viewer, to access and display traces directly from remote trace archives. In the SCF format the number and type of data items is Calculations based on typical ﬁles obtained from the rigidly deﬁned in the header, with just one single ‘private’ Sanger Centre show that a gzipped Gap4 database from a block for additional data. ZTR overcomes this limitation ﬁnished assembly project occupies only 3% of the storage by having an arbitrary number of chunks, with either pub- required for the project’s gzipped SCF ﬁles. CAF (Dear et lic or private data types. CTF also overcomes many of the al., 1998) ﬁles are of comparable size. SCF limitations, however it does not distinguish between Although consensus conﬁdence values are useful when public and private chunk types and does not separate the it comes to checking the evidence for individual bases in data from the compression and ﬁlter algorithms. These last a consensus sequence from a genome project, we believe two differences directly impact on the extensibility and that where doubts arise most people would prefer to see hence the long term future of the format. ZTR ﬁle read- all the relevant sequences and traces aligned. They are ers can be assured that chunk types listed in the public also likely to be interested only in speciﬁc regions and speciﬁcation will be in a known format, but this does not hence not need to download all the traces from the relevant preclude the addition of new chunk types or the develop- project. ment of new compression algorithms. Bringing these last arguments together, in our view, if ZTR could also be used for related data such as that gen- the sequence assembly databases were made publically erated in Single Stranded Conformational Polymorphism available somewhere, the extra 3% of storage needed (Hayashi, 1991) experiments. Preliminary investigations would greatly increase the value of trace and sequence of SSCP data have shown that ZTR produces size re- data archives, and in addition to the contribution made by ductions similar to those achieved for sequencing traces. ZTR, further reduce the bandwidth required to service the Although the public speciﬁcation does not explicitly expected growth in this information. discuss storage of SSCP data it is envisaged that this will be achieved by using the existing SAMP chunk types with ACKNOWLEDGEMENTS appropriate meta-data ﬁelds. If, as we hope, others do wish to contribute new ZTR The authors would like to thank Jean Thierry-Mieg for the chunk types and compression methods we suggest that adding CTF to io lib which catalyzed us into ﬁnishing our they contact us beforehand so that we can help to avoid own work on ZTR, Mark Jordan for the meta-data and 9 J.K.Bonﬁeld and R.Staden Deutsch,P. and Gailly,J-L. (1996) ZLIB Compressed data format general comments, Andrew McLachlan for the Chebyshev speciﬁcation version 3.3. RFC 1950, http://www.gzip.org/zlib/ prediction idea, Steven Leonard for extending io lib to use Ensembl Trace Server (2000) http://trace.ensembl.org/. zlib instead of gzip and his ideas with tar support, and both Ewing,B. and Green,P. (1998) Base-calling of automated sequencer the Sanger Centre and Genoscope for providing test data. traces using Phred. II. Error probabilities. Genome Res., 8, 186– This work was supported by the UK Medical Research Council. Gorrell,H.G. et al. (2000) NCBI trace archive RFC. http://www. ncbi.nlm.nih.gov/Traces/TraceArchiveRFC.html. Hayashi,K. (1991) PCR-SSCP: a simple and sensitive method for REFERENCES detection of mutations in the genomic DNA. PCR Meth. Appl., ATQA (1998) http://www.wagner.com/technologies/biotech/ 1,34–38. atqaadcopy.html, Wagner Associates. Huffman,D.A. (1952) A method for the construction of minimum- Bonﬁeld,J.K. and Staden,R. (1995) The application of numerical redundancy codes. Proc. IRE, 40, 1098–1101. estimates of base calling accuracy to DNA sequencing projects. NCBI Trace Archive http://www.ncbi.nlm.nih.gov/Traces/. Nucleic Acids Res., 23, 1406–1410. NC-IUB (1985) Nomenclature Committee of the International Bonﬁeld,J.K., Beal,K.F., Betts,M.J. and Staden,R. (2002) Trev: a Union of Biochemistry. Nomenclature for incompletely speciﬁed DNA trace viewer. Bioinformatics, 18, 194–195. bases in nucleic acid sequences. Recommendations 1984. Eur. Boutell,T. et al. (1997) Portable Network Graphics (PNG) speciﬁ- J. Biochem, 150,1–5. http://www.chem.qmw.ac.uk/iubmb/misc/ cation version 1.0. RFC 2083, http://www.libpng.org/pub/png. naseq.html. Burrows,M. and Wheeler,D.J. (1994) A block-sorting lossless data Parsons,J.D., Buehler,E. and Hillier,L. (1999) DNA sequence compression algorithm. Technical Report. Digital Equipment chromatogram browsing using JAVA and CORBA. Genome Res., Corporation, Palo Alto, CA. 9, 277–281. Cleary,J.G. and Teahan,W.J. (1997) Unbounded length contexts for Press,W.H., Teukolsky,S.A., Vetterling,W.T. and Flannery,B.P. PPM. The Comput. J., 40,67–75. (1992) Numerical Recipies in C: The Art of Scientiﬁc Program- Dear,S., Durbin,R., Hillier,L., Marth,G., Thierry-Mieg,J. and ming, 2nd edn, Cambridge University Press, Cambridge. Mott,R. (1998) Sequence assembly with CAFTOOLS. Genome Rissanen,J.J. and Langdon,G.G. (1979) Arithmetic coding. IBM J. Res., 9, 260–267. Res. Develop., 23, 149–162. Dear,S. and Staden,R. (1992) A standard ﬁle format for data from Staden,R., Beal,K.F. and Bonﬁeld,J.K. (1998) The Staden Package DNA sequencing instruments. DNA Sequence, 3, 107–110. 1998. Comput. Meth. Mol. Biol., 132, 115–130. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/ztr-a-new-format-for-dna-sequence-trace-data-LPs0Cg4kDw

Loading next page...

References (19)

D. Huffman (1952)
A method for the construction of minimum-redundancy codes
Resonance, 11
(1985)
Nomenclature Committee of the International Union of Biochemistry (NC-IUB). Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984.
The Biochemical journal, 229 2
Brent Ewing, Philip Green (1998)
Base-calling of automated sequencer traces using phred. II. Error probabilities.
Genome research, 8 3
Peter Deutsch, J. Gailly (1996)
ZLIB Compressed Data Format Specification version 3.3
RFC, 1950
J. Bonfield, Kathryn Beal, Matthew Betts, R. Staden (2002)
Trev: a DNA trace editor and viewer
Bioinformatics, 18 1
S. Dear, Rodger Staden (1992)
A standard file format for data from DNA sequencing instruments.
DNA sequence : the journal of DNA sequencing and mapping, 3 2
J. Bonfield, Rodger Staden (1995)
The application of numerical estimates of base calling accuracy to DNA sequencing projects.
Nucleic acids research, 23 8
J. Cleary, W. Teahan (1995)
Unbounded length contexts for PPM
Proceedings DCC '95 Data Compression Conference
R. Staden, Kathryn Beal, J. Bonfield (2000)
The Staden package, 1998.
Methods in molecular biology, 132
T. Boutell (1997)
PNG (Portable Network Graphics) Specification Version 1.0
RFC, 2083
(2000)
NCBI trace archive RFC
Jorma Rissanen, Glen Langdon (1979)
Arithmetic Coding
IBM J. Res. Dev., 23
Bland Ewing, L. Hillier, M. Wendl, Philip Green (1998)
Base-calling of automated sequencer traces using phred. I. Accuracy assessment.
Genome research, 8 3
(1992)
Numerical Recipies in C: The Art of Scientific Programming, 2nd edn
J. Parsons, E. Buehler, L. Hillier (1999)
DNA sequence chromatogram browsing using JAVA and CORBA.
Genome research, 9 3
Portland Ltd (1985)
Nomenclature Committee for the International Union of Biochemistry (NC-IUB). Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984.
Molecular biology and evolution, 3 2
Simon Dear, Richard Durbin, L. Hillier, Gabor Marth, Jean Thierry-Mieg, Richard Mott (1998)
Sequence assembly with CAFTOOLS.
Genome research, 8 3
K. Hayashi (1991)
PCR-SSCP: a simple and sensitive method for detection of mutations in the genomic DNA.
PCR methods and applications, 1 1
M. Burrows, D. L, R. Taylor, D. Wheeler, D. Wheeler (1994)
A Block-sorting Lossless Data Compression Algorithm

Publisher: Oxford University Press
Copyright: © Oxford University Press 2002
ISSN: 1367-4803
eISSN: 1460-2059
DOI: 10.1093/bioinformatics/18.1.3
Publisher site: See Article on Publisher Site

Abstract

Vol. 18 no. 1 2002 BIOINFORMATICS Pages 3–10 James K. Bonﬁeld and Rodger Staden MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, UK Received on July 2, 2001; revised on September 11, 2001; accepted on September 12, 2001 ABSTRACT require around 30 million traces, equating to 5700 Gb of Motivation: To produce an open and extensible ﬁle storage. This is for one individual for one species. At the format for DNA trace data which produces compact ﬁles time of writing (June 2001) the NCBI trace archive also suitable for large-scale storage and efﬁcient use of internet contains 23 million traces for other species. bandwidth. In 1991 our group introduced SCF format (Dear and Results: We have created an extensible format named Staden, 1992). The major motivations then were: ZTR. For a set of data taken from an ABI-3700 the (1) sequencing machine independence; ZTR format produces trace ﬁles which require 61.6% of (2) operating system independence; the disk space used by gzipped SCFv3, and which can be written and read at greater speed. The compression (3) an open and public format with sources available to algorithms used for the trace amplitudes are used within all; the National Center for Biotechnology Information (NCBI) (4) small ﬁle size; trace archive. (5) to introduce the idea of base call conﬁdence values Availability: Source code is available from ftp: and encourage their use. //ftp.mrc-lmb.cam.ac.uk/pub/staden/io lib/io lib.tar.gz. A complete format description can be found at http: This format is now the most widely used and the sources //www.mrc-lmb.cam.ac.uk/pubseq/ztr.html. Test data is are available via ftp. During the intervening years we have available from ftp://ftp.mrc-lmb.cam.ac.uk/pub/staden/ produced two major revisions. The current one (SCFv3) io lib/test data. includes the use of a ﬁnite differences function plus a Contact: jkb@mrc-lmb.cam.ac.uk reorganization of the order of the ﬁle contents to potentiate the efﬁcient use of standard compression programs. INTRODUCTION The content of trace ﬁles The genome projects performed to date are just a begin- ning, and as DNA sequencing is increasingly being used The minimum information needed in a DNA sequence for new scientiﬁc, medical and forensic purposes, the trace trace ﬁle is shown below with their percentage of the overall uncompressed SCF ﬁle size: data accumulated so far represent only a tiny fraction of the storage requirement of the future. (1) the base calls (1%); Major centres such as the National Center for Biotech- (2) base conﬁdence values (4%); nology Information (NCBI) and the European Bioin- formatics Institute (EBI) are collecting the trace data (3) the trace amplitudes for each of the four base types (88%); ﬁles from genome projects and making them available via the internet (http://www.ncbi.nlm.nih.gov/Traces/, (4) the offsets of the base calls relative to the trace http://trace.ensembl.org/). It is important that the storage coordinates (represented by the element numbers of and transfer of trace data is efﬁcient and that the format the trace values) (4%); used is easily adaptable. (5) various textual comments (sample identiﬁers, run To illustrate the size of the storage problem let us take date, etc.) (0.2%). a single human genome project as an example. Suppose The remaining ∼2.8% is in unused (marked as ‘spare’) we aim at 5-fold coverage, that we use sequencing ﬁelds. Typically ABI ﬁles will also contain additional instruments which generate Applied Biosystems’ ABI textual data plus arrays of current, voltages, temperature format trace ﬁles (typical ﬁle size 190 kb), and that we get and unprocessed trace amplitudes. on average 500 reliable bases per reading. Then we would A list of data items that today’s users may want to To whom correspondence should be addressed. store in the ﬁles is speciﬁed at the NCBI trace repository c Oxford University Press 2002 3 J.K.Bonﬁeld and R.Staden Table 1. Gzip compression ratios on a selection of ﬁle types SYSTEM AND METHODS The design of ZTR builds on this previous work and borrows new ideas taken from the PNG format (Boutell Format Original size Gzipped size Fraction et al., 1997), the public successor to the GIF image ABI 18 158 424 8 427 773 0.464 format. We wanted to reduce data size further, and also to SCFv2 7 887 845 3 881 662 0.492 incorporate additional textual information. The key design SCFv3 7 887 845 2 396 562 0.304 principles are: (1) Extensibility: we cannot easily foresee what future data may need to be stored within a trace ﬁle, so we (Gorrell et al., 2000), but we do not know what may need a mechanism of incorporating new information be required in the future. For example we originally in a way which does not invalidate the ﬁle format. suggested that it would be useful to store a conﬁdence (2) Small: a small size not only saves disk space, but value for each of the four base types at each position will reduce network usage and download times. in the sequence and provided four such slots in the SCF Rather than have a format which requires use of format. At present many use a single value, that for the external compression tools we would like the format called base only. However at least one group have a trace to specify its own compression methods. analysis program (ATQA, 1998) which in addition to an (3) Fast: ZTR ﬁle accessing should not be substantially overall conﬁdence for each base call can also calculate slower than existing SCF implementations. Given the probability of insertions and deletions at each base that gzipped SCF ﬁles are the norm, we considered position. Luckily these useful values can be stored in the this to be our target speed for both reading and spare conﬁdence value slots in the SCF ﬁle, but the format writing. we propose in this paper can readily incorporate new data types such as these. (4) Public: both the speciﬁcation and the source code for an example implementation should be freely Space saving methods available to both academic and commercial users. Lossless ﬁle compression tools save disk space, not by Extensibility deleting information, but by analyzing the data to repack the information using fewer bytes. One of the most The basic structure of a ZTR ﬁle is a header indicating the commonly used tools for this is gzip. We determined that ﬁle format, followed by zero or more data blocks. In ZTR storing the data within an SCF ﬁle in a different order can we call these blocks ‘chunks’. signiﬁcantly improve the performance of gzip. Also the The use of separate chunks for each data type implies trace amplitudes are not ideally suited to compression by new data types can be added without changing the basic gzip, but storing the differences between one value and the ﬁle structure and conversely chunks may be omitted. ZTR next reduces the signal variability. This ﬁnite differences readers should ignore chunks with unknown type and technique can be applied up to three times before com- so new ﬁles are backwards compatible. These features pression ratios start to suffer. These ideas were used to contribute to the extensibility of the ZTR format. form a new revision of SCF—version 3. Table 1 demon- The header structure is, in hex bytes: strates the compression ratios of gzip on a set of ABI-3700 8-byte magic number: AE 5A 54 52 0D 0A 1A 0A ﬁles in ABI, SCF version 2 and SCF version 3 formats. Format version, major: 01 More recently, Jean Thierry-Mieg at the NCBI produced Format version, minor: 02 a new trace format named CTF (unpublished) which compresses better than SCFv3. The CTF format speciﬁes its own compression algorithms. A raw CTF ﬁle has a The magic number includes an 8-bit byte (AE), a control- similar size to gzipped SCFv3, but is substantially faster to Z (1A) character (used to indicate end-of-ﬁle under DOS), read back. Furthermore CTF ﬁles can still be compressed both bare newline (0A) and carriage-return newline (0D further by using external programs, such as gzip, hence 0A) combinations, and the text ‘ZTR’ (5A 54 52). The giving additional space savings. purpose of this is to act as an immediate check for the more Despite the size reduction of SCFv3 and CTF we troublesome aspects of ﬁle reading and data transfer and recently felt that another completely new ﬁle structure so aid the detection of corrupt ﬁles. For example, using ftp would both enable us to make the ﬁles even smaller and to transfer a ZTR ﬁle in ASCII mode may swap newline produce an extensible format which would facilitate its use and newline-carriage return. Such ﬁles will not have a for novel future applications of DNA sequencing. We have valid ZTR header, so rather than return a corrupted ﬁle named this new binary format ZTR. the reading code will return an error. 4 ZTR: New format for DNA sequence trace data Each chunk consists of a type, meta-data and data. The integer; the same as used by Phred (Ewing and Green, chunk data is the main information we wish to store. The 1998), TraceTuner (http://www.paracel.com/tracetuner/), meta-data, which is not needed for many chunk types, is ATQA (1998) and Li-Cor base callers. typically a small amount of information about the data. CNF4. The four conﬁdence values stored in the same For example the chunk to store the digitized trace samples scale as CNF1, but with one value per base type. To aid will have the samples themselves in the data block and the compression, the conﬁdence for all the called bases (which name of the channel (A, C, G, T) in the meta-data block. defaults to T if not A, C or G) is stored ﬁrst followed by All integer values are stored using 4-byte values in big- the remaining conﬁdence values for A, C, G and T. endian format (i.e. most signiﬁcant byte ﬁrst). The chunk structure is: CSID. The conﬁdence values for substitution, insertion and deletion, stored in the −10 log (P ) scale. ATQA 4-byte chunk type: XX XX XX XX error Meta-data length (big endian): XX XX XX XX is one such program to produce these values. Meta-data: (any number of bytes, up to 2 ) Data length (big endian): XX XX XX XX CLIP. Poor quality clip points. Speciﬁed in base coor- Data: (any number of bytes, up to 2 ) dinates, this indicates where data (at both ends) should be considered as poor quality. This is included primarily The format of the meta-data and data elements is chunk for backwards compatibility with SCF—the CNF* chunk type dependent. The complete information may be found types provide more detailed information. in the on-line ZTR format speciﬁcation (http://www. mrc-lmb.cam.ac.uk/pubseq/ztr.html). COMM. User deﬁned text comments, in 8-bit ASCII. TEXT. A series of identiﬁer-value pairs stored as one or Chunk types more sets of identiﬁer, nul, value, nul terminating in an The chunk type may be considered to be a 4-character additional (i.e. double) nul character. The identiﬁers are string. Bit 5 of the ﬁrst character indicates whether this deﬁned as part of the ZTR spec, but have been taken from chunk type is part of the public ZTR speciﬁcation (in the NCBI trace repository RFC version 1.17. which case bit 5 is clear) or whether it is a private extension (bit 5 is set). Bit 5 of the remaining three CR32. A 32-bit cyclic redundancy check (ANSI X3.66) characters is reserved for future use and so currently value of all the data since the last CR32 chunk, including should always be clear. Practically speaking this means the ZTR header if appropriate. that public chunk types consist entirely of uppercase letters and private chunk types start with a lowercase letter. Compression This means that TEXT and tEXT are two completely Each chunk data block is compressed using zero or independent chunk types and the similarity of their names more ﬁltering and compression algorithms. The available does not imply a relationship between the format of their algorithm choices are: data. Also it is clear that private extensions will not clash with future public extensions. DELTA1, DELTA2, DELTA4. These apply the forward ﬁnite At present the publicly deﬁned chunk types are: differences technique to 1, 2 or 4 byte words. This replaces each 1, 2 or 4 byte word with the difference between itself SAMP. A single channel of trace samples, stored in 16- and the previous word. It does not directly decrease the bit format. size of the data. Table 2 contains an example of DELTA1 ﬁltering. SMP4. Four concatenated arrays of trace samples, stor- ing the same information as 4 SAMP chunks for the A, 16TO8, 32TO8. These attempt to store numerically small C, G and T channels. Note that both SMP4 and SAMP 16-bit and 32-bit integer values in a single 8-bit integer. chunks can be combined within the same ﬁle if desired. Values in the range of −127 to +127 are stored directly SMP4 typically gives compression ratios 4% smaller than in 8-bits. For values outside this range we emit −128 4 separate SAMP chunks, at a reduced CPU usage. followed by the actual 16 or 32-bit value. Table 3 contains an example of the 16TO8 ﬁlter type. BASE. Base calls, encoded using the NC-IUB character set (NC-IUB, 1985). FOLLOW1. This analyzes the complete data block to determine for each 8-bit value (‘x ’) which other value BPOS. A mapping of base numbers to trace sample most frequently follows it (follow (x )). Then for each byte numbers, stored as an array of 32-bit integer values. of data we store follow (previous byte)—current byte. To CNF1. The conﬁdence values for the called base type. enable reversal of this function we also prepend the data The scale must be −10 log (P ) expressed as an 8-bit block with the 256-byte follow table. error 5 J.K.Bonﬁeld and R.Staden Table 2. Example of levels 1–3 of the DELTA1 ﬁlter Level Data stream before and after the DELTA1 ﬁlters Entropy 0 +4 +7 +12 +17 +24 +30 +36 +40 +43 +43 +40 +35 +28 +21 +14 +9 7.50 1 +4 +3 +5 +5 +7 +6 +6 +4 +3 +0 −3 −5 −7 −7 −7 −5 6.16 2 +4 −1 +2 +0 +2 −1 +0 −2 −1 −3 −3 −2 −2 +0 +0 +2 4.97 3 +4 −5 +3 −2 +2 −3 +1 −2 +1 −2 +0 +1 +0 +2 +0 +2 5.62 Table 5. Summary of chunk type and the default ﬁlters and compression Table 3. An example of 16TO8 of 5 big-endian 16-bit numbers algorithms Before 00 4B 00 55 FF EB FC 22 00 BB Chunk type Filters/compressors (plus arguments) After 4B 55 EB 80 FC 22 80 00 BB SAMP/SMP4 DELTA2 (×3or ×2, depending on data range) 16TO8 FOLLOW1 Table 4. An example of the RLE compression method, using 8 as the token RLE ZLIB (Z HUFFMAN ONLY) Before567777 8 7 6 BASE/TEXT/COMM ZLIB (Z HUFFMAN ONLY) CNF1/CNF4/CSID DELTA1 (×1) After 5 6 8 4 7 8076 RLE ZLIB (Z HUFFMAN ONLY) BPOS DELTA4 (×1) 32TO8 RLE. Run length encoding. If 4 or more identical 8-bit ZLIB (Z HUFFMAN ONLY) values are detected in a row then RLE replaces this data with a special token followed by the number of repeated bytes and the value. If the token itself is within the raw data then it is output followed by zero. The token may as 16-bit quantities regardless of their actual scale. 8-bit be chosen to be a symbol with a low natural frequency. data (0–255) is compressed best using DELTA2 with 2 Table 4 contains an example of the RLE compression rounds, whereas full 16-bit data is compressed best using method. 3 rounds. ZLIB. Uses the zlib library (Deutsch and Gailly, 1996) to IMPLEMENTATION apply the LZ77 compression algorithm followed by Huff- The source code implementing the ZTR format is con- man encoding (Huffman, 1952). Zlib allows for Huffman tained within a library named ‘io lib’. This library also encoding only (denoted below as Z HUFFMAN ONLY), implements read-only support for the Applied Biosys- which for trace data typically reduces ﬁle size more than tems’ ABI and Pharmacia’s ALF format trace ﬁles and LZ77 and is faster. However all valid zlib streams are al- read–write support for the SCF and CTF formats. Io lib is lowed within a ZTR ﬁle. coded using ANSI C and is known to work on both UNIX The ﬁrst byte of the encoded chunk data indicates the al- and Microsoft Windows based systems. Internally it uses gorithm used, followed by any algorithm speciﬁc param- a common C structure for storing a trace along with a eters required for decoding, followed by the encoded data itself. A value of zero for the ﬁrst byte indicates the raw common programming interface for reading and writing data. Hence ZTR decoders simply need to keep recursively this structure. This means that the application does not applying the uncompression algorithms until the raw data need to know the ﬁle format of the trace data and so as is obtained. new formats are added existing applications will not need Experimentation has determined which sets of algo- to be modiﬁed or even recompiled. rithms are best applied to each type of chunk. Table 5 lists Io lib supports the notion of a trace search path, which the default ﬁlter and compression types used. Note that is independent from the trace format. Traces may be other combinations of ﬁlters and compression methods loaded directly from a ﬁle on disk in the current working may be used as they still produce a valid format ZTR ﬁle. directory, from an alternative directory, or extracted from The trace amplitudes (in the SAMP chunks) are all treated within a tar ﬁle. 6 ZTR: New format for DNA sequence trace data The tar ﬁle support allows for archiving many trace ﬁles Table 6. Total size in bytes and timings in seconds for 100 trace ﬁles into a single ﬁle, which has several beneﬁts. It makes distribution of data much easier, it may reduce disk space Size in Read Write and it reduces the number of ﬁles on the disk. This is Instrument Format bytes time (s) time (s) important as most ﬁlesystems support a limited number of ﬁles, usually speciﬁed at the time of formatting. Although ABI-3700 ABI 18 915 025 3.55 – ABI-3700 Gzipped ABI 8 780 830 6.18 – this number is typically set very high, a large number of ABI-3700 Gzipped SCF 2 494 217 1.54 9.15 very small ﬁles can still cause problems. MegaBACE Gzipped SCF 3 953 805 1.73 13.14 Many ﬁling systems also have a block size. The size Li-Cor Gzipped SCF 1 815 428 1.17 5.01 required to store a ﬁle of length N will be N rounded up to the next multiple of BLOCK SIZE; averaging at N + BLOCK SIZE/2. On Microsoft Windows the block Table 7. File size and I/O times as percentages relative to gzipped SCF size can often be as much as 64 kb, meaning an average wastage of 32 kb per ﬁle. Tar archives typically use an Size relative to SCF.gzip Average timings internal block size of 512 bytes, which greatly reduces Format ABI-3700 Mega- Licor Average Read & Encode & wasted space. This point is not to be underestimated; a BACE decode write 64 kb block size means that there is usually no saving in switching from gzipped SCF to ZTR unless tar archives SCF.raw 317.1 202.0 264.9 261.3 30.9 7.9 are also used. Fortunately UNIX ﬁle systems usually have SCF.gzip 100.0 100.0 100.0 100.0 100.0 100.0 much smaller block sizes. For example in Linux the ext2 SCF.bzip2 72.8 75.9 85.7 78.1 370.3 143.4 SCF.szip 71.1 74.8 80.2 75.4 937.9 164.6 ﬁlesystem has a block size of 1024, 2048 or 4096 bytes. Trace ﬁles within the tar ﬁle may be compressed if CTF.raw 96.2 112.2 114.5 107.6 34.2 117.6 CTF.gzip 70.2 80.3 83.0 77.8 79.2 144.4 desired, although with the ZTR format this is not advisable CTF.bzip2 65.6 72.0 79.2 72.2 324.6 217.3 due to the use of its own compression functions. However CTF.szip 63.2 70.5 75.5 69.8 740.1 227.6 the complete tar ﬁle itself must not be compressed as this ZTR(1).raw 150.0 99.5 220.1 156.5 34.9 8.2 would prevent random access within it. ZTR(1).gzip 69.5 73.1 84.2 75.6 85.4 50.6 In order to reduce time spent searching for ﬁles within ZTR(1).bzip2 62.9 68.7 72.3 68.0 370.9 125.9 a tar archive io lib can use an index ﬁle. The index ZTR(1).szip 60.8 67.2 68.4 65.4 779.2 129.8 consists of a series of lines containing the trace name and ZTR(2).raw 61.6 69.7 79.1 70.1 67.9 34.3 ﬁle offset, allowing for complete random access within the trace archive. The current implementation performs a linear search through the index, so access time is still proportional to number of ﬁles in the archive. However the for reading (when not cached) and 50% slower for time taken to ﬁnd a ﬁle within a directory is also dependent writing. on the number of ﬁles contained within it. At present we A comparison between SCF, CTF and ZTR is presented only support read access to tar ﬁles. in Table 7. Here we have normalized the the sizes and times against the gzipped SCF results from Table 6. RESULTS Several compression tools are also compared, including gzip (implemented using zlib 1.1.3), bzip2 (version 1.0.1) We analyzed the performance of ZTR on multiple sets and szip (version 1.11). Gzip was implemented as a library of data covering several machine manufacturers and call and so avoids the need for running an external process. multiple sequencing chemistries, with each set consisting of 100 traces. The ABI-3700 and MegaBACE data sets This does not affect the size, but reduces the real and (from the Sanger Centre) were re-base-called using cpu time and so there is a small bias against the timings Phred 0.990722.g. The Li-Cor data set (from Genoscope) for bzip2 and szip. Both gzip and bzip2 are widely-used was converted from SCFv2 to SCFv3 format, but was open source programs. Szip is freely available for many operating systems, but is not open-source. It is included not re-base-called as the Li-Cor base-caller produces as an illustration of one of the best general purpose conﬁdence values in the same log scale as Phred. All compression tools. of this data is publicly available on our ftp site. Table 6 Table 7 contains a lot of information so we have made presents the gzipped SCF size for each of these three the rows of formats which are faster at reading or writing sets along with the size for the ABI-3700 data set in the original ABI ﬁle format. The timings here represent than all others for a given ﬁle size bold. The remaining summation of the user and system CPU times, taken rows contain results which are bettered on both speed and from a 433 MHz Compaq Alpha running Digital UNIX size by at least one other format. For example SCF.gzip is V4.0E. Real times averaged at approximately 20% slower always beaten on speed and size by ZTR(2).raw. 7 J.K.Bonﬁeld and R.Staden Table 8. Relative proportions of data within a ZTR ﬁle From this we can see that the Huffman encoding used for TEXT and BASE chunks is not optimal, mostly due to the small size of the information being compressed. The Chunk type File (%) Bits/item TEXT chunks in this data set do not include the NCBI text attributes and so their average size is just 196 bytes. With SMP4 92.07 3.24 bits/sample longer TEXT chunks the compression rates will improve, CNF4 2.72 4.45 bits/value BPOS 2.38 3.90 bits/value but it is unlikely the size will be a signiﬁcant portion of BASE 1.59 2.60 bits/base the total ﬁle, so optimizing this will not provide an overall TEXT 1.23 7.89 bits/character improvement in compression. DISCUSSION The ZTR(1) and ZTR(2) formats are both valid ZTR We have presented ZTR as an extensible and compact ﬁles, but ZTR(1) does not include the ﬁnal FOLLOW1 replacement to gzipped SCF, but have concentrated on the and ZLIB compression methods. This means that a raw issues of ﬁle compression. The Huffman encoding used ZTR(1) ﬁle is substantially larger than ZTR(2), but the in ZTR represents a very basic compression algorithm. more complex external compression tools (bzip2 and szip) Better entropy encoders are known, with arithmetic coding reduce the ZTR(1) ﬁles to less than ZTR(2). ZTR(2)’s (Rissanen and Langdon, 1979) being the most widely internal compression prevents external tools from further used. They may produce smaller ﬁles without too large an reducing the ﬁle size, so these values are not shown in impact on speed, but these algorithms often require larger Table 7. Both ZTR sets have been encoded using a single amounts of data to work efﬁciently. Higher order statistical SMP4 chunk instead of 4 separate SAMP chunks. In encoders (such as the PPM family; Cleary and Teahan, summary, whilst ZTR(2) is not the smallest ﬁle format 1997) may also reduce space, but these are currently (although it is close), to produce smaller ﬁles takes slow algorithms. It can be seen that the higher order substantially longer. The ZTR(2) implementation fulﬁls block sorting methods (http://www.compressconsult.com/ the goals of being faster than gzipped SCF with a much st/), as used in szip, and the Burrows–Wheeler transform smaller output and so is our default implementation of the (Burrows and Wheeler, 1994), as used in bzip2, give ZTR format. substantial improvements, but again the methods are The ZTR(1).gzip ﬁles are larger than the ZTR(2) ﬁles, relatively slow. despite the fact that the Huffman compression used within However the most proﬁtable strategy may lie in trying ZTR(2) is the same code as used in gzip. This can be to curve-ﬁt the data. We have experimented with using explained by noting that different chunks contain byte Chebyshev polynomials (Press et al., 1992) to ﬁt the values with substantially different frequency distributions, previous 4 samples in order to predict a value for the but gzip averages all these together (assuming that the 5th sample. The difference between the predicted and entire ﬁle ﬁts within one gzip block). An additional beneﬁt real sample value can then be stored. This is still work to this approach is that random access to any chunk is in progress, but our current algorithm can compress the still possible, which in turn provides faster extraction of ABI-3700 data set to 56.7% of the gzipped SCF size (2.98 speciﬁc data. For example extracting just the base calls bits/sample). CPU performance is still a big issue with this from ZTR(2) ﬁles is faster than extracting them from method, with read times being approximately 2.7 times ZTR(1).gzip ﬁles. slower than gzipped SCF ﬁles. We can see that the Li-Cor ﬁles compress much less We have also examined the use of lossy compression for than the ABI and MegaBACE ﬁles. The main reason is the trace amplitudes. The simplest way to lose informa- that the Li-Cor data only stores 8-bit samples, compared tion uniformly is down-scaling. The original SCFv1 im- to the 11-bit data from ABI and MegaBACE machines. plementation stored information in 8-bits, but downscaling Scaling down the other data sets to 8-bit samples gives to any range also improves compression. Table 9 shows results comparable to the Li-Cor data (ZTR(2) is 72.9% the results of this form of lossy compression on the 100 for ABI and 80.4% for MegaBACE). ABI-3700 ﬁles using the ZTR(2) format. The other factor resulting in differences in compression We would not recommend using lossy compression for ratios between data sets is the noise in the trace data permanent archive of trace data, but for visual inspection (which depends in part on the preprocessing of the original over the network 7-bit data is generally adequate. data signals). As the noise increases the entropy of the data The nature of the ZTR format is such that, if useful, also increases, resulting in poorer compression. any of these alternative or additional methods can be Table 8 details the breakdown of a ZTR(2) ﬁle expressed implemented in the future without affecting the reading as a percentage of the overall size and in bits per item. This of older ﬁles. table was computed by averaging only the ABI-3700 ﬁles. The original size of the ABI-3700 data set is more than 8 ZTR: New format for DNA sequence trace data Table 9. The effect of down-scaling on ﬁle size duplication of work and reduce any fragmentation of the format. Initially such additions should be implemented as private types, but once stabilized these could be migrated Range Average ZTR(2) ﬁle size to public types in future revisions of the format. In the Staden Package individual traces stored in 0–1600 (lossless) 15 949 formats readable by io lib can be viewed using a pro- 0–1024 14 656 0–512 12 506 gram called Trev (Bonﬁeld et al., 2002), available 0–256 10 846 from ftp://ftp.mrc-lmb.cam.ac.uk/pub/staden/trev/. Also 0–128 9 175 via io lib, multiple traces can be viewed using the 0–64 7 924 package’s main sequence assembly and editing pro- 0–32 6 765 gram, Gap4 (Staden et al., 1998), which uses a single binary but machine independent database for each sequencing project. This database stores sequence read- double the size of the uncompressed SCF ﬁles. This is ings, conﬁdence values, contigs, templates, read-pair due to the additional information stored in an ABI ﬁle. data, annotations, edit information and links to the By deﬁning further ZTR chunk types it would be possible trace data. The package also contains the gap4 viewer to store all the data in an ABI ﬁle within ZTR, utilizing (ftp://ftp.mrc-lmb.cam.ac.uk/pub/staden/gap4 viewer/). appropriate compression methods for each chunk. The This viewer is a complete but read-only version of Gap4 main proportion (96%) of an ABI ﬁle consists of the and hence enables all of the above information to be 12 DATA channels corresponding to raw and processed displayed using its graphical user interface. Gap4 viewer copies of the trace data and various instrument settings and trev executables for UNIX and Microsoft Windows (voltage, current, power, temperature). Of the remaining are available free to commercial and academic users from ABI information approximately 2/3 is base calls and base our ftp site. offsets. All of this already compresses well using ZTR. The io lib implementation could readily be extended We estimate that a complete ZTR encoded ABI ﬁle will to provide additional search paths to allow for direct be 27% of the original size, compared to 49% for gzip and ﬁle loading over the internet, possibly using CORBA 34% for bzip2. Hence ZTR is a suitable open format for (Parsons et al., 1999). This would enable any program use by manufacturers of sequencing instruments. using io lib, including trev and the gap4 viewer, to access and display traces directly from remote trace archives. In the SCF format the number and type of data items is Calculations based on typical ﬁles obtained from the rigidly deﬁned in the header, with just one single ‘private’ Sanger Centre show that a gzipped Gap4 database from a block for additional data. ZTR overcomes this limitation ﬁnished assembly project occupies only 3% of the storage by having an arbitrary number of chunks, with either pub- required for the project’s gzipped SCF ﬁles. CAF (Dear et lic or private data types. CTF also overcomes many of the al., 1998) ﬁles are of comparable size. SCF limitations, however it does not distinguish between Although consensus conﬁdence values are useful when public and private chunk types and does not separate the it comes to checking the evidence for individual bases in data from the compression and ﬁlter algorithms. These last a consensus sequence from a genome project, we believe two differences directly impact on the extensibility and that where doubts arise most people would prefer to see hence the long term future of the format. ZTR ﬁle read- all the relevant sequences and traces aligned. They are ers can be assured that chunk types listed in the public also likely to be interested only in speciﬁc regions and speciﬁcation will be in a known format, but this does not hence not need to download all the traces from the relevant preclude the addition of new chunk types or the develop- project. ment of new compression algorithms. Bringing these last arguments together, in our view, if ZTR could also be used for related data such as that gen- the sequence assembly databases were made publically erated in Single Stranded Conformational Polymorphism available somewhere, the extra 3% of storage needed (Hayashi, 1991) experiments. Preliminary investigations would greatly increase the value of trace and sequence of SSCP data have shown that ZTR produces size re- data archives, and in addition to the contribution made by ductions similar to those achieved for sequencing traces. ZTR, further reduce the bandwidth required to service the Although the public speciﬁcation does not explicitly expected growth in this information. discuss storage of SSCP data it is envisaged that this will be achieved by using the existing SAMP chunk types with ACKNOWLEDGEMENTS appropriate meta-data ﬁelds. If, as we hope, others do wish to contribute new ZTR The authors would like to thank Jean Thierry-Mieg for the chunk types and compression methods we suggest that adding CTF to io lib which catalyzed us into ﬁnishing our they contact us beforehand so that we can help to avoid own work on ZTR, Mark Jordan for the meta-data and 9 J.K.Bonﬁeld and R.Staden Deutsch,P. and Gailly,J-L. (1996) ZLIB Compressed data format general comments, Andrew McLachlan for the Chebyshev speciﬁcation version 3.3. RFC 1950, http://www.gzip.org/zlib/ prediction idea, Steven Leonard for extending io lib to use Ensembl Trace Server (2000) http://trace.ensembl.org/. zlib instead of gzip and his ideas with tar support, and both Ewing,B. and Green,P. (1998) Base-calling of automated sequencer the Sanger Centre and Genoscope for providing test data. traces using Phred. II. Error probabilities. Genome Res., 8, 186– This work was supported by the UK Medical Research Council. Gorrell,H.G. et al. (2000) NCBI trace archive RFC. http://www. ncbi.nlm.nih.gov/Traces/TraceArchiveRFC.html. Hayashi,K. (1991) PCR-SSCP: a simple and sensitive method for REFERENCES detection of mutations in the genomic DNA. PCR Meth. Appl., ATQA (1998) http://www.wagner.com/technologies/biotech/ 1,34–38. atqaadcopy.html, Wagner Associates. Huffman,D.A. (1952) A method for the construction of minimum- Bonﬁeld,J.K. and Staden,R. (1995) The application of numerical redundancy codes. Proc. IRE, 40, 1098–1101. estimates of base calling accuracy to DNA sequencing projects. NCBI Trace Archive http://www.ncbi.nlm.nih.gov/Traces/. Nucleic Acids Res., 23, 1406–1410. NC-IUB (1985) Nomenclature Committee of the International Bonﬁeld,J.K., Beal,K.F., Betts,M.J. and Staden,R. (2002) Trev: a Union of Biochemistry. Nomenclature for incompletely speciﬁed DNA trace viewer. Bioinformatics, 18, 194–195. bases in nucleic acid sequences. Recommendations 1984. Eur. Boutell,T. et al. (1997) Portable Network Graphics (PNG) speciﬁ- J. Biochem, 150,1–5. http://www.chem.qmw.ac.uk/iubmb/misc/ cation version 1.0. RFC 2083, http://www.libpng.org/pub/png. naseq.html. Burrows,M. and Wheeler,D.J. (1994) A block-sorting lossless data Parsons,J.D., Buehler,E. and Hillier,L. (1999) DNA sequence compression algorithm. Technical Report. Digital Equipment chromatogram browsing using JAVA and CORBA. Genome Res., Corporation, Palo Alto, CA. 9, 277–281. Cleary,J.G. and Teahan,W.J. (1997) Unbounded length contexts for Press,W.H., Teukolsky,S.A., Vetterling,W.T. and Flannery,B.P. PPM. The Comput. J., 40,67–75. (1992) Numerical Recipies in C: The Art of Scientiﬁc Program- Dear,S., Durbin,R., Hillier,L., Marth,G., Thierry-Mieg,J. and ming, 2nd edn, Cambridge University Press, Cambridge. Mott,R. (1998) Sequence assembly with CAFTOOLS. Genome Rissanen,J.J. and Langdon,G.G. (1979) Arithmetic coding. IBM J. Res., 9, 260–267. Res. Develop., 23, 149–162. Dear,S. and Staden,R. (1992) A standard ﬁle format for data from Staden,R., Beal,K.F. and Bonﬁeld,J.K. (1998) The Staden Package DNA sequencing instruments. DNA Sequence, 3, 107–110. 1998. Comput. Meth. Mol. Biol., 132, 115–130.

Journal

Bioinformatics – Oxford University Press

Published: Jan 1, 2002

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

ZTR: a new format for DNA sequence trace data

ZTR: a new format for DNA sequence trace data

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

ZTR: a new format for DNA sequence trace data

ZTR: a new format for DNA sequence trace data

References (19)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies