Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

ZTR: a new format for DNA sequence trace data

ZTR: a new format for DNA sequence trace data Vol. 18 no. 1 2002 BIOINFORMATICS Pages 3–10 James K. Bonfield and Rodger Staden MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, UK Received on July 2, 2001; revised on September 11, 2001; accepted on September 12, 2001 ABSTRACT require around 30 million traces, equating to 5700 Gb of Motivation: To produce an open and extensible file storage. This is for one individual for one species. At the format for DNA trace data which produces compact files time of writing (June 2001) the NCBI trace archive also suitable for large-scale storage and efficient use of internet contains 23 million traces for other species. bandwidth. In 1991 our group introduced SCF format (Dear and Results: We have created an extensible format named Staden, 1992). The major motivations then were: ZTR. For a set of data taken from an ABI-3700 the (1) sequencing machine independence; ZTR format produces trace files which require 61.6% of (2) operating system independence; the disk space used by gzipped SCFv3, and which can be written and read at greater speed. The compression (3) an open and public format with sources available to algorithms used for the trace amplitudes are used within all; the National Center for Biotechnology Information (NCBI) (4) small file size; trace archive. (5) to introduce the idea of base call confidence values Availability: Source code is available from ftp: and encourage their use. //ftp.mrc-lmb.cam.ac.uk/pub/staden/io lib/io lib.tar.gz. A complete format description can be found at http: This format is now the most widely used and the sources //www.mrc-lmb.cam.ac.uk/pubseq/ztr.html. Test data is are available via ftp. During the intervening years we have available from ftp://ftp.mrc-lmb.cam.ac.uk/pub/staden/ produced two major revisions. The current one (SCFv3) io lib/test data. includes the use of a finite differences function plus a Contact: jkb@mrc-lmb.cam.ac.uk reorganization of the order of the file contents to potentiate the efficient use of standard compression programs. INTRODUCTION The content of trace files The genome projects performed to date are just a begin- ning, and as DNA sequencing is increasingly being used The minimum information needed in a DNA sequence for new scientific, medical and forensic purposes, the trace trace file is shown below with their percentage of the overall uncompressed SCF file size: data accumulated so far represent only a tiny fraction of the storage requirement of the future. (1) the base calls (1%); Major centres such as the National Center for Biotech- (2) base confidence values (4%); nology Information (NCBI) and the European Bioin- formatics Institute (EBI) are collecting the trace data (3) the trace amplitudes for each of the four base types (88%); files from genome projects and making them available via the internet (http://www.ncbi.nlm.nih.gov/Traces/, (4) the offsets of the base calls relative to the trace http://trace.ensembl.org/). It is important that the storage coordinates (represented by the element numbers of and transfer of trace data is efficient and that the format the trace values) (4%); used is easily adaptable. (5) various textual comments (sample identifiers, run To illustrate the size of the storage problem let us take date, etc.) (0.2%). a single human genome project as an example. Suppose The remaining ∼2.8% is in unused (marked as ‘spare’) we aim at 5-fold coverage, that we use sequencing fields. Typically ABI files will also contain additional instruments which generate Applied Biosystems’ ABI textual data plus arrays of current, voltages, temperature format trace files (typical file size 190 kb), and that we get and unprocessed trace amplitudes. on average 500 reliable bases per reading. Then we would A list of data items that today’s users may want to To whom correspondence should be addressed. store in the files is specified at the NCBI trace repository c Oxford University Press 2002 3 J.K.Bonfield and R.Staden Table 1. Gzip compression ratios on a selection of file types SYSTEM AND METHODS The design of ZTR builds on this previous work and borrows new ideas taken from the PNG format (Boutell Format Original size Gzipped size Fraction et al., 1997), the public successor to the GIF image ABI 18 158 424 8 427 773 0.464 format. We wanted to reduce data size further, and also to SCFv2 7 887 845 3 881 662 0.492 incorporate additional textual information. The key design SCFv3 7 887 845 2 396 562 0.304 principles are: (1) Extensibility: we cannot easily foresee what future data may need to be stored within a trace file, so we (Gorrell et al., 2000), but we do not know what may need a mechanism of incorporating new information be required in the future. For example we originally in a way which does not invalidate the file format. suggested that it would be useful to store a confidence (2) Small: a small size not only saves disk space, but value for each of the four base types at each position will reduce network usage and download times. in the sequence and provided four such slots in the SCF Rather than have a format which requires use of format. At present many use a single value, that for the external compression tools we would like the format called base only. However at least one group have a trace to specify its own compression methods. analysis program (ATQA, 1998) which in addition to an (3) Fast: ZTR file accessing should not be substantially overall confidence for each base call can also calculate slower than existing SCF implementations. Given the probability of insertions and deletions at each base that gzipped SCF files are the norm, we considered position. Luckily these useful values can be stored in the this to be our target speed for both reading and spare confidence value slots in the SCF file, but the format writing. we propose in this paper can readily incorporate new data types such as these. (4) Public: both the specification and the source code for an example implementation should be freely Space saving methods available to both academic and commercial users. Lossless file compression tools save disk space, not by Extensibility deleting information, but by analyzing the data to repack the information using fewer bytes. One of the most The basic structure of a ZTR file is a header indicating the commonly used tools for this is gzip. We determined that file format, followed by zero or more data blocks. In ZTR storing the data within an SCF file in a different order can we call these blocks ‘chunks’. significantly improve the performance of gzip. Also the The use of separate chunks for each data type implies trace amplitudes are not ideally suited to compression by new data types can be added without changing the basic gzip, but storing the differences between one value and the file structure and conversely chunks may be omitted. ZTR next reduces the signal variability. This finite differences readers should ignore chunks with unknown type and technique can be applied up to three times before com- so new files are backwards compatible. These features pression ratios start to suffer. These ideas were used to contribute to the extensibility of the ZTR format. form a new revision of SCF—version 3. Table 1 demon- The header structure is, in hex bytes: strates the compression ratios of gzip on a set of ABI-3700 8-byte magic number: AE 5A 54 52 0D 0A 1A 0A files in ABI, SCF version 2 and SCF version 3 formats. Format version, major: 01 More recently, Jean Thierry-Mieg at the NCBI produced Format version, minor: 02 a new trace format named CTF (unpublished) which compresses better than SCFv3. The CTF format specifies its own compression algorithms. A raw CTF file has a The magic number includes an 8-bit byte (AE), a control- similar size to gzipped SCFv3, but is substantially faster to Z (1A) character (used to indicate end-of-file under DOS), read back. Furthermore CTF files can still be compressed both bare newline (0A) and carriage-return newline (0D further by using external programs, such as gzip, hence 0A) combinations, and the text ‘ZTR’ (5A 54 52). The giving additional space savings. purpose of this is to act as an immediate check for the more Despite the size reduction of SCFv3 and CTF we troublesome aspects of file reading and data transfer and recently felt that another completely new file structure so aid the detection of corrupt files. For example, using ftp would both enable us to make the files even smaller and to transfer a ZTR file in ASCII mode may swap newline produce an extensible format which would facilitate its use and newline-carriage return. Such files will not have a for novel future applications of DNA sequencing. We have valid ZTR header, so rather than return a corrupted file named this new binary format ZTR. the reading code will return an error. 4 ZTR: New format for DNA sequence trace data Each chunk consists of a type, meta-data and data. The integer; the same as used by Phred (Ewing and Green, chunk data is the main information we wish to store. The 1998), TraceTuner (http://www.paracel.com/tracetuner/), meta-data, which is not needed for many chunk types, is ATQA (1998) and Li-Cor base callers. typically a small amount of information about the data. CNF4. The four confidence values stored in the same For example the chunk to store the digitized trace samples scale as CNF1, but with one value per base type. To aid will have the samples themselves in the data block and the compression, the confidence for all the called bases (which name of the channel (A, C, G, T) in the meta-data block. defaults to T if not A, C or G) is stored first followed by All integer values are stored using 4-byte values in big- the remaining confidence values for A, C, G and T. endian format (i.e. most significant byte first). The chunk structure is: CSID. The confidence values for substitution, insertion and deletion, stored in the −10 log (P ) scale. ATQA 4-byte chunk type: XX XX XX XX error Meta-data length (big endian): XX XX XX XX is one such program to produce these values. Meta-data: (any number of bytes, up to 2 ) Data length (big endian): XX XX XX XX CLIP. Poor quality clip points. Specified in base coor- Data: (any number of bytes, up to 2 ) dinates, this indicates where data (at both ends) should be considered as poor quality. This is included primarily The format of the meta-data and data elements is chunk for backwards compatibility with SCF—the CNF* chunk type dependent. The complete information may be found types provide more detailed information. in the on-line ZTR format specification (http://www. mrc-lmb.cam.ac.uk/pubseq/ztr.html). COMM. User defined text comments, in 8-bit ASCII. TEXT. A series of identifier-value pairs stored as one or Chunk types more sets of identifier, nul, value, nul terminating in an The chunk type may be considered to be a 4-character additional (i.e. double) nul character. The identifiers are string. Bit 5 of the first character indicates whether this defined as part of the ZTR spec, but have been taken from chunk type is part of the public ZTR specification (in the NCBI trace repository RFC version 1.17. which case bit 5 is clear) or whether it is a private extension (bit 5 is set). Bit 5 of the remaining three CR32. A 32-bit cyclic redundancy check (ANSI X3.66) characters is reserved for future use and so currently value of all the data since the last CR32 chunk, including should always be clear. Practically speaking this means the ZTR header if appropriate. that public chunk types consist entirely of uppercase letters and private chunk types start with a lowercase letter. Compression This means that TEXT and tEXT are two completely Each chunk data block is compressed using zero or independent chunk types and the similarity of their names more filtering and compression algorithms. The available does not imply a relationship between the format of their algorithm choices are: data. Also it is clear that private extensions will not clash with future public extensions. DELTA1, DELTA2, DELTA4. These apply the forward finite At present the publicly defined chunk types are: differences technique to 1, 2 or 4 byte words. This replaces each 1, 2 or 4 byte word with the difference between itself SAMP. A single channel of trace samples, stored in 16- and the previous word. It does not directly decrease the bit format. size of the data. Table 2 contains an example of DELTA1 filtering. SMP4. Four concatenated arrays of trace samples, stor- ing the same information as 4 SAMP chunks for the A, 16TO8, 32TO8. These attempt to store numerically small C, G and T channels. Note that both SMP4 and SAMP 16-bit and 32-bit integer values in a single 8-bit integer. chunks can be combined within the same file if desired. Values in the range of −127 to +127 are stored directly SMP4 typically gives compression ratios 4% smaller than in 8-bits. For values outside this range we emit −128 4 separate SAMP chunks, at a reduced CPU usage. followed by the actual 16 or 32-bit value. Table 3 contains an example of the 16TO8 filter type. BASE. Base calls, encoded using the NC-IUB character set (NC-IUB, 1985). FOLLOW1. This analyzes the complete data block to determine for each 8-bit value (‘x ’) which other value BPOS. A mapping of base numbers to trace sample most frequently follows it (follow (x )). Then for each byte numbers, stored as an array of 32-bit integer values. of data we store follow (previous byte)—current byte. To CNF1. The confidence values for the called base type. enable reversal of this function we also prepend the data The scale must be −10 log (P ) expressed as an 8-bit block with the 256-byte follow table. error 5 J.K.Bonfield and R.Staden Table 2. Example of levels 1–3 of the DELTA1 filter Level Data stream before and after the DELTA1 filters Entropy 0 +4 +7 +12 +17 +24 +30 +36 +40 +43 +43 +40 +35 +28 +21 +14 +9 7.50 1 +4 +3 +5 +5 +7 +6 +6 +4 +3 +0 −3 −5 −7 −7 −7 −5 6.16 2 +4 −1 +2 +0 +2 −1 +0 −2 −1 −3 −3 −2 −2 +0 +0 +2 4.97 3 +4 −5 +3 −2 +2 −3 +1 −2 +1 −2 +0 +1 +0 +2 +0 +2 5.62 Table 5. Summary of chunk type and the default filters and compression Table 3. An example of 16TO8 of 5 big-endian 16-bit numbers algorithms Before 00 4B 00 55 FF EB FC 22 00 BB Chunk type Filters/compressors (plus arguments) After 4B 55 EB 80 FC 22 80 00 BB SAMP/SMP4 DELTA2 (×3or ×2, depending on data range) 16TO8 FOLLOW1 Table 4. An example of the RLE compression method, using 8 as the token RLE ZLIB (Z HUFFMAN ONLY) Before567777 8 7 6 BASE/TEXT/COMM ZLIB (Z HUFFMAN ONLY) CNF1/CNF4/CSID DELTA1 (×1) After 5 6 8 4 7 8076 RLE ZLIB (Z HUFFMAN ONLY) BPOS DELTA4 (×1) 32TO8 RLE. Run length encoding. If 4 or more identical 8-bit ZLIB (Z HUFFMAN ONLY) values are detected in a row then RLE replaces this data with a special token followed by the number of repeated bytes and the value. If the token itself is within the raw data then it is output followed by zero. The token may as 16-bit quantities regardless of their actual scale. 8-bit be chosen to be a symbol with a low natural frequency. data (0–255) is compressed best using DELTA2 with 2 Table 4 contains an example of the RLE compression rounds, whereas full 16-bit data is compressed best using method. 3 rounds. ZLIB. Uses the zlib library (Deutsch and Gailly, 1996) to IMPLEMENTATION apply the LZ77 compression algorithm followed by Huff- The source code implementing the ZTR format is con- man encoding (Huffman, 1952). Zlib allows for Huffman tained within a library named ‘io lib’. This library also encoding only (denoted below as Z HUFFMAN ONLY), implements read-only support for the Applied Biosys- which for trace data typically reduces file size more than tems’ ABI and Pharmacia’s ALF format trace files and LZ77 and is faster. However all valid zlib streams are al- read–write support for the SCF and CTF formats. Io lib is lowed within a ZTR file. coded using ANSI C and is known to work on both UNIX The first byte of the encoded chunk data indicates the al- and Microsoft Windows based systems. Internally it uses gorithm used, followed by any algorithm specific param- a common C structure for storing a trace along with a eters required for decoding, followed by the encoded data itself. A value of zero for the first byte indicates the raw common programming interface for reading and writing data. Hence ZTR decoders simply need to keep recursively this structure. This means that the application does not applying the uncompression algorithms until the raw data need to know the file format of the trace data and so as is obtained. new formats are added existing applications will not need Experimentation has determined which sets of algo- to be modified or even recompiled. rithms are best applied to each type of chunk. Table 5 lists Io lib supports the notion of a trace search path, which the default filter and compression types used. Note that is independent from the trace format. Traces may be other combinations of filters and compression methods loaded directly from a file on disk in the current working may be used as they still produce a valid format ZTR file. directory, from an alternative directory, or extracted from The trace amplitudes (in the SAMP chunks) are all treated within a tar file. 6 ZTR: New format for DNA sequence trace data The tar file support allows for archiving many trace files Table 6. Total size in bytes and timings in seconds for 100 trace files into a single file, which has several benefits. It makes distribution of data much easier, it may reduce disk space Size in Read Write and it reduces the number of files on the disk. This is Instrument Format bytes time (s) time (s) important as most filesystems support a limited number of files, usually specified at the time of formatting. Although ABI-3700 ABI 18 915 025 3.55 – ABI-3700 Gzipped ABI 8 780 830 6.18 – this number is typically set very high, a large number of ABI-3700 Gzipped SCF 2 494 217 1.54 9.15 very small files can still cause problems. MegaBACE Gzipped SCF 3 953 805 1.73 13.14 Many filing systems also have a block size. The size Li-Cor Gzipped SCF 1 815 428 1.17 5.01 required to store a file of length N will be N rounded up to the next multiple of BLOCK SIZE; averaging at N + BLOCK SIZE/2. On Microsoft Windows the block Table 7. File size and I/O times as percentages relative to gzipped SCF size can often be as much as 64 kb, meaning an average wastage of 32 kb per file. Tar archives typically use an Size relative to SCF.gzip Average timings internal block size of 512 bytes, which greatly reduces Format ABI-3700 Mega- Licor Average Read & Encode & wasted space. This point is not to be underestimated; a BACE decode write 64 kb block size means that there is usually no saving in switching from gzipped SCF to ZTR unless tar archives SCF.raw 317.1 202.0 264.9 261.3 30.9 7.9 are also used. Fortunately UNIX file systems usually have SCF.gzip 100.0 100.0 100.0 100.0 100.0 100.0 much smaller block sizes. For example in Linux the ext2 SCF.bzip2 72.8 75.9 85.7 78.1 370.3 143.4 SCF.szip 71.1 74.8 80.2 75.4 937.9 164.6 filesystem has a block size of 1024, 2048 or 4096 bytes. Trace files within the tar file may be compressed if CTF.raw 96.2 112.2 114.5 107.6 34.2 117.6 CTF.gzip 70.2 80.3 83.0 77.8 79.2 144.4 desired, although with the ZTR format this is not advisable CTF.bzip2 65.6 72.0 79.2 72.2 324.6 217.3 due to the use of its own compression functions. However CTF.szip 63.2 70.5 75.5 69.8 740.1 227.6 the complete tar file itself must not be compressed as this ZTR(1).raw 150.0 99.5 220.1 156.5 34.9 8.2 would prevent random access within it. ZTR(1).gzip 69.5 73.1 84.2 75.6 85.4 50.6 In order to reduce time spent searching for files within ZTR(1).bzip2 62.9 68.7 72.3 68.0 370.9 125.9 a tar archive io lib can use an index file. The index ZTR(1).szip 60.8 67.2 68.4 65.4 779.2 129.8 consists of a series of lines containing the trace name and ZTR(2).raw 61.6 69.7 79.1 70.1 67.9 34.3 file offset, allowing for complete random access within the trace archive. The current implementation performs a linear search through the index, so access time is still proportional to number of files in the archive. However the for reading (when not cached) and 50% slower for time taken to find a file within a directory is also dependent writing. on the number of files contained within it. At present we A comparison between SCF, CTF and ZTR is presented only support read access to tar files. in Table 7. Here we have normalized the the sizes and times against the gzipped SCF results from Table 6. RESULTS Several compression tools are also compared, including gzip (implemented using zlib 1.1.3), bzip2 (version 1.0.1) We analyzed the performance of ZTR on multiple sets and szip (version 1.11). Gzip was implemented as a library of data covering several machine manufacturers and call and so avoids the need for running an external process. multiple sequencing chemistries, with each set consisting of 100 traces. The ABI-3700 and MegaBACE data sets This does not affect the size, but reduces the real and (from the Sanger Centre) were re-base-called using cpu time and so there is a small bias against the timings Phred 0.990722.g. The Li-Cor data set (from Genoscope) for bzip2 and szip. Both gzip and bzip2 are widely-used was converted from SCFv2 to SCFv3 format, but was open source programs. Szip is freely available for many operating systems, but is not open-source. It is included not re-base-called as the Li-Cor base-caller produces as an illustration of one of the best general purpose confidence values in the same log scale as Phred. All compression tools. of this data is publicly available on our ftp site. Table 6 Table 7 contains a lot of information so we have made presents the gzipped SCF size for each of these three the rows of formats which are faster at reading or writing sets along with the size for the ABI-3700 data set in the original ABI file format. The timings here represent than all others for a given file size bold. The remaining summation of the user and system CPU times, taken rows contain results which are bettered on both speed and from a 433 MHz Compaq Alpha running Digital UNIX size by at least one other format. For example SCF.gzip is V4.0E. Real times averaged at approximately 20% slower always beaten on speed and size by ZTR(2).raw. 7 J.K.Bonfield and R.Staden Table 8. Relative proportions of data within a ZTR file From this we can see that the Huffman encoding used for TEXT and BASE chunks is not optimal, mostly due to the small size of the information being compressed. The Chunk type File (%) Bits/item TEXT chunks in this data set do not include the NCBI text attributes and so their average size is just 196 bytes. With SMP4 92.07 3.24 bits/sample longer TEXT chunks the compression rates will improve, CNF4 2.72 4.45 bits/value BPOS 2.38 3.90 bits/value but it is unlikely the size will be a significant portion of BASE 1.59 2.60 bits/base the total file, so optimizing this will not provide an overall TEXT 1.23 7.89 bits/character improvement in compression. DISCUSSION The ZTR(1) and ZTR(2) formats are both valid ZTR We have presented ZTR as an extensible and compact files, but ZTR(1) does not include the final FOLLOW1 replacement to gzipped SCF, but have concentrated on the and ZLIB compression methods. This means that a raw issues of file compression. The Huffman encoding used ZTR(1) file is substantially larger than ZTR(2), but the in ZTR represents a very basic compression algorithm. more complex external compression tools (bzip2 and szip) Better entropy encoders are known, with arithmetic coding reduce the ZTR(1) files to less than ZTR(2). ZTR(2)’s (Rissanen and Langdon, 1979) being the most widely internal compression prevents external tools from further used. They may produce smaller files without too large an reducing the file size, so these values are not shown in impact on speed, but these algorithms often require larger Table 7. Both ZTR sets have been encoded using a single amounts of data to work efficiently. Higher order statistical SMP4 chunk instead of 4 separate SAMP chunks. In encoders (such as the PPM family; Cleary and Teahan, summary, whilst ZTR(2) is not the smallest file format 1997) may also reduce space, but these are currently (although it is close), to produce smaller files takes slow algorithms. It can be seen that the higher order substantially longer. The ZTR(2) implementation fulfils block sorting methods (http://www.compressconsult.com/ the goals of being faster than gzipped SCF with a much st/), as used in szip, and the Burrows–Wheeler transform smaller output and so is our default implementation of the (Burrows and Wheeler, 1994), as used in bzip2, give ZTR format. substantial improvements, but again the methods are The ZTR(1).gzip files are larger than the ZTR(2) files, relatively slow. despite the fact that the Huffman compression used within However the most profitable strategy may lie in trying ZTR(2) is the same code as used in gzip. This can be to curve-fit the data. We have experimented with using explained by noting that different chunks contain byte Chebyshev polynomials (Press et al., 1992) to fit the values with substantially different frequency distributions, previous 4 samples in order to predict a value for the but gzip averages all these together (assuming that the 5th sample. The difference between the predicted and entire file fits within one gzip block). An additional benefit real sample value can then be stored. This is still work to this approach is that random access to any chunk is in progress, but our current algorithm can compress the still possible, which in turn provides faster extraction of ABI-3700 data set to 56.7% of the gzipped SCF size (2.98 specific data. For example extracting just the base calls bits/sample). CPU performance is still a big issue with this from ZTR(2) files is faster than extracting them from method, with read times being approximately 2.7 times ZTR(1).gzip files. slower than gzipped SCF files. We can see that the Li-Cor files compress much less We have also examined the use of lossy compression for than the ABI and MegaBACE files. The main reason is the trace amplitudes. The simplest way to lose informa- that the Li-Cor data only stores 8-bit samples, compared tion uniformly is down-scaling. The original SCFv1 im- to the 11-bit data from ABI and MegaBACE machines. plementation stored information in 8-bits, but downscaling Scaling down the other data sets to 8-bit samples gives to any range also improves compression. Table 9 shows results comparable to the Li-Cor data (ZTR(2) is 72.9% the results of this form of lossy compression on the 100 for ABI and 80.4% for MegaBACE). ABI-3700 files using the ZTR(2) format. The other factor resulting in differences in compression We would not recommend using lossy compression for ratios between data sets is the noise in the trace data permanent archive of trace data, but for visual inspection (which depends in part on the preprocessing of the original over the network 7-bit data is generally adequate. data signals). As the noise increases the entropy of the data The nature of the ZTR format is such that, if useful, also increases, resulting in poorer compression. any of these alternative or additional methods can be Table 8 details the breakdown of a ZTR(2) file expressed implemented in the future without affecting the reading as a percentage of the overall size and in bits per item. This of older files. table was computed by averaging only the ABI-3700 files. The original size of the ABI-3700 data set is more than 8 ZTR: New format for DNA sequence trace data Table 9. The effect of down-scaling on file size duplication of work and reduce any fragmentation of the format. Initially such additions should be implemented as private types, but once stabilized these could be migrated Range Average ZTR(2) file size to public types in future revisions of the format. In the Staden Package individual traces stored in 0–1600 (lossless) 15 949 formats readable by io lib can be viewed using a pro- 0–1024 14 656 0–512 12 506 gram called Trev (Bonfield et al., 2002), available 0–256 10 846 from ftp://ftp.mrc-lmb.cam.ac.uk/pub/staden/trev/. Also 0–128 9 175 via io lib, multiple traces can be viewed using the 0–64 7 924 package’s main sequence assembly and editing pro- 0–32 6 765 gram, Gap4 (Staden et al., 1998), which uses a single binary but machine independent database for each sequencing project. This database stores sequence read- double the size of the uncompressed SCF files. This is ings, confidence values, contigs, templates, read-pair due to the additional information stored in an ABI file. data, annotations, edit information and links to the By defining further ZTR chunk types it would be possible trace data. The package also contains the gap4 viewer to store all the data in an ABI file within ZTR, utilizing (ftp://ftp.mrc-lmb.cam.ac.uk/pub/staden/gap4 viewer/). appropriate compression methods for each chunk. The This viewer is a complete but read-only version of Gap4 main proportion (96%) of an ABI file consists of the and hence enables all of the above information to be 12 DATA channels corresponding to raw and processed displayed using its graphical user interface. Gap4 viewer copies of the trace data and various instrument settings and trev executables for UNIX and Microsoft Windows (voltage, current, power, temperature). Of the remaining are available free to commercial and academic users from ABI information approximately 2/3 is base calls and base our ftp site. offsets. All of this already compresses well using ZTR. The io lib implementation could readily be extended We estimate that a complete ZTR encoded ABI file will to provide additional search paths to allow for direct be 27% of the original size, compared to 49% for gzip and file loading over the internet, possibly using CORBA 34% for bzip2. Hence ZTR is a suitable open format for (Parsons et al., 1999). This would enable any program use by manufacturers of sequencing instruments. using io lib, including trev and the gap4 viewer, to access and display traces directly from remote trace archives. In the SCF format the number and type of data items is Calculations based on typical files obtained from the rigidly defined in the header, with just one single ‘private’ Sanger Centre show that a gzipped Gap4 database from a block for additional data. ZTR overcomes this limitation finished assembly project occupies only 3% of the storage by having an arbitrary number of chunks, with either pub- required for the project’s gzipped SCF files. CAF (Dear et lic or private data types. CTF also overcomes many of the al., 1998) files are of comparable size. SCF limitations, however it does not distinguish between Although consensus confidence values are useful when public and private chunk types and does not separate the it comes to checking the evidence for individual bases in data from the compression and filter algorithms. These last a consensus sequence from a genome project, we believe two differences directly impact on the extensibility and that where doubts arise most people would prefer to see hence the long term future of the format. ZTR file read- all the relevant sequences and traces aligned. They are ers can be assured that chunk types listed in the public also likely to be interested only in specific regions and specification will be in a known format, but this does not hence not need to download all the traces from the relevant preclude the addition of new chunk types or the develop- project. ment of new compression algorithms. Bringing these last arguments together, in our view, if ZTR could also be used for related data such as that gen- the sequence assembly databases were made publically erated in Single Stranded Conformational Polymorphism available somewhere, the extra 3% of storage needed (Hayashi, 1991) experiments. Preliminary investigations would greatly increase the value of trace and sequence of SSCP data have shown that ZTR produces size re- data archives, and in addition to the contribution made by ductions similar to those achieved for sequencing traces. ZTR, further reduce the bandwidth required to service the Although the public specification does not explicitly expected growth in this information. discuss storage of SSCP data it is envisaged that this will be achieved by using the existing SAMP chunk types with ACKNOWLEDGEMENTS appropriate meta-data fields. If, as we hope, others do wish to contribute new ZTR The authors would like to thank Jean Thierry-Mieg for the chunk types and compression methods we suggest that adding CTF to io lib which catalyzed us into finishing our they contact us beforehand so that we can help to avoid own work on ZTR, Mark Jordan for the meta-data and 9 J.K.Bonfield and R.Staden Deutsch,P. and Gailly,J-L. (1996) ZLIB Compressed data format general comments, Andrew McLachlan for the Chebyshev specification version 3.3. RFC 1950, http://www.gzip.org/zlib/ prediction idea, Steven Leonard for extending io lib to use Ensembl Trace Server (2000) http://trace.ensembl.org/. zlib instead of gzip and his ideas with tar support, and both Ewing,B. and Green,P. (1998) Base-calling of automated sequencer the Sanger Centre and Genoscope for providing test data. traces using Phred. II. Error probabilities. Genome Res., 8, 186– This work was supported by the UK Medical Research Council. Gorrell,H.G. et al. (2000) NCBI trace archive RFC. http://www. ncbi.nlm.nih.gov/Traces/TraceArchiveRFC.html. Hayashi,K. (1991) PCR-SSCP: a simple and sensitive method for REFERENCES detection of mutations in the genomic DNA. PCR Meth. Appl., ATQA (1998) http://www.wagner.com/technologies/biotech/ 1,34–38. atqaadcopy.html, Wagner Associates. Huffman,D.A. (1952) A method for the construction of minimum- Bonfield,J.K. and Staden,R. (1995) The application of numerical redundancy codes. Proc. IRE, 40, 1098–1101. estimates of base calling accuracy to DNA sequencing projects. NCBI Trace Archive http://www.ncbi.nlm.nih.gov/Traces/. Nucleic Acids Res., 23, 1406–1410. NC-IUB (1985) Nomenclature Committee of the International Bonfield,J.K., Beal,K.F., Betts,M.J. and Staden,R. (2002) Trev: a Union of Biochemistry. Nomenclature for incompletely specified DNA trace viewer. Bioinformatics, 18, 194–195. bases in nucleic acid sequences. Recommendations 1984. Eur. Boutell,T. et al. (1997) Portable Network Graphics (PNG) specifi- J. Biochem, 150,1–5. http://www.chem.qmw.ac.uk/iubmb/misc/ cation version 1.0. RFC 2083, http://www.libpng.org/pub/png. naseq.html. Burrows,M. and Wheeler,D.J. (1994) A block-sorting lossless data Parsons,J.D., Buehler,E. and Hillier,L. (1999) DNA sequence compression algorithm. Technical Report. Digital Equipment chromatogram browsing using JAVA and CORBA. Genome Res., Corporation, Palo Alto, CA. 9, 277–281. Cleary,J.G. and Teahan,W.J. (1997) Unbounded length contexts for Press,W.H., Teukolsky,S.A., Vetterling,W.T. and Flannery,B.P. PPM. The Comput. J., 40,67–75. (1992) Numerical Recipies in C: The Art of Scientific Program- Dear,S., Durbin,R., Hillier,L., Marth,G., Thierry-Mieg,J. and ming, 2nd edn, Cambridge University Press, Cambridge. Mott,R. (1998) Sequence assembly with CAFTOOLS. Genome Rissanen,J.J. and Langdon,G.G. (1979) Arithmetic coding. IBM J. Res., 9, 260–267. Res. Develop., 23, 149–162. Dear,S. and Staden,R. (1992) A standard file format for data from Staden,R., Beal,K.F. and Bonfield,J.K. (1998) The Staden Package DNA sequencing instruments. DNA Sequence, 3, 107–110. 1998. Comput. Meth. Mol. Biol., 132, 115–130. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

ZTR: a new format for DNA sequence trace data

Bioinformatics , Volume 18 (1): 8 – Jan 1, 2002

Loading next page...
 
/lp/oxford-university-press/ztr-a-new-format-for-dna-sequence-trace-data-LPs0Cg4kDw

References (19)

Publisher
Oxford University Press
Copyright
© Oxford University Press 2002
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/18.1.3
Publisher site
See Article on Publisher Site

Abstract

Vol. 18 no. 1 2002 BIOINFORMATICS Pages 3–10 James K. Bonfield and Rodger Staden MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, UK Received on July 2, 2001; revised on September 11, 2001; accepted on September 12, 2001 ABSTRACT require around 30 million traces, equating to 5700 Gb of Motivation: To produce an open and extensible file storage. This is for one individual for one species. At the format for DNA trace data which produces compact files time of writing (June 2001) the NCBI trace archive also suitable for large-scale storage and efficient use of internet contains 23 million traces for other species. bandwidth. In 1991 our group introduced SCF format (Dear and Results: We have created an extensible format named Staden, 1992). The major motivations then were: ZTR. For a set of data taken from an ABI-3700 the (1) sequencing machine independence; ZTR format produces trace files which require 61.6% of (2) operating system independence; the disk space used by gzipped SCFv3, and which can be written and read at greater speed. The compression (3) an open and public format with sources available to algorithms used for the trace amplitudes are used within all; the National Center for Biotechnology Information (NCBI) (4) small file size; trace archive. (5) to introduce the idea of base call confidence values Availability: Source code is available from ftp: and encourage their use. //ftp.mrc-lmb.cam.ac.uk/pub/staden/io lib/io lib.tar.gz. A complete format description can be found at http: This format is now the most widely used and the sources //www.mrc-lmb.cam.ac.uk/pubseq/ztr.html. Test data is are available via ftp. During the intervening years we have available from ftp://ftp.mrc-lmb.cam.ac.uk/pub/staden/ produced two major revisions. The current one (SCFv3) io lib/test data. includes the use of a finite differences function plus a Contact: jkb@mrc-lmb.cam.ac.uk reorganization of the order of the file contents to potentiate the efficient use of standard compression programs. INTRODUCTION The content of trace files The genome projects performed to date are just a begin- ning, and as DNA sequencing is increasingly being used The minimum information needed in a DNA sequence for new scientific, medical and forensic purposes, the trace trace file is shown below with their percentage of the overall uncompressed SCF file size: data accumulated so far represent only a tiny fraction of the storage requirement of the future. (1) the base calls (1%); Major centres such as the National Center for Biotech- (2) base confidence values (4%); nology Information (NCBI) and the European Bioin- formatics Institute (EBI) are collecting the trace data (3) the trace amplitudes for each of the four base types (88%); files from genome projects and making them available via the internet (http://www.ncbi.nlm.nih.gov/Traces/, (4) the offsets of the base calls relative to the trace http://trace.ensembl.org/). It is important that the storage coordinates (represented by the element numbers of and transfer of trace data is efficient and that the format the trace values) (4%); used is easily adaptable. (5) various textual comments (sample identifiers, run To illustrate the size of the storage problem let us take date, etc.) (0.2%). a single human genome project as an example. Suppose The remaining ∼2.8% is in unused (marked as ‘spare’) we aim at 5-fold coverage, that we use sequencing fields. Typically ABI files will also contain additional instruments which generate Applied Biosystems’ ABI textual data plus arrays of current, voltages, temperature format trace files (typical file size 190 kb), and that we get and unprocessed trace amplitudes. on average 500 reliable bases per reading. Then we would A list of data items that today’s users may want to To whom correspondence should be addressed. store in the files is specified at the NCBI trace repository c Oxford University Press 2002 3 J.K.Bonfield and R.Staden Table 1. Gzip compression ratios on a selection of file types SYSTEM AND METHODS The design of ZTR builds on this previous work and borrows new ideas taken from the PNG format (Boutell Format Original size Gzipped size Fraction et al., 1997), the public successor to the GIF image ABI 18 158 424 8 427 773 0.464 format. We wanted to reduce data size further, and also to SCFv2 7 887 845 3 881 662 0.492 incorporate additional textual information. The key design SCFv3 7 887 845 2 396 562 0.304 principles are: (1) Extensibility: we cannot easily foresee what future data may need to be stored within a trace file, so we (Gorrell et al., 2000), but we do not know what may need a mechanism of incorporating new information be required in the future. For example we originally in a way which does not invalidate the file format. suggested that it would be useful to store a confidence (2) Small: a small size not only saves disk space, but value for each of the four base types at each position will reduce network usage and download times. in the sequence and provided four such slots in the SCF Rather than have a format which requires use of format. At present many use a single value, that for the external compression tools we would like the format called base only. However at least one group have a trace to specify its own compression methods. analysis program (ATQA, 1998) which in addition to an (3) Fast: ZTR file accessing should not be substantially overall confidence for each base call can also calculate slower than existing SCF implementations. Given the probability of insertions and deletions at each base that gzipped SCF files are the norm, we considered position. Luckily these useful values can be stored in the this to be our target speed for both reading and spare confidence value slots in the SCF file, but the format writing. we propose in this paper can readily incorporate new data types such as these. (4) Public: both the specification and the source code for an example implementation should be freely Space saving methods available to both academic and commercial users. Lossless file compression tools save disk space, not by Extensibility deleting information, but by analyzing the data to repack the information using fewer bytes. One of the most The basic structure of a ZTR file is a header indicating the commonly used tools for this is gzip. We determined that file format, followed by zero or more data blocks. In ZTR storing the data within an SCF file in a different order can we call these blocks ‘chunks’. significantly improve the performance of gzip. Also the The use of separate chunks for each data type implies trace amplitudes are not ideally suited to compression by new data types can be added without changing the basic gzip, but storing the differences between one value and the file structure and conversely chunks may be omitted. ZTR next reduces the signal variability. This finite differences readers should ignore chunks with unknown type and technique can be applied up to three times before com- so new files are backwards compatible. These features pression ratios start to suffer. These ideas were used to contribute to the extensibility of the ZTR format. form a new revision of SCF—version 3. Table 1 demon- The header structure is, in hex bytes: strates the compression ratios of gzip on a set of ABI-3700 8-byte magic number: AE 5A 54 52 0D 0A 1A 0A files in ABI, SCF version 2 and SCF version 3 formats. Format version, major: 01 More recently, Jean Thierry-Mieg at the NCBI produced Format version, minor: 02 a new trace format named CTF (unpublished) which compresses better than SCFv3. The CTF format specifies its own compression algorithms. A raw CTF file has a The magic number includes an 8-bit byte (AE), a control- similar size to gzipped SCFv3, but is substantially faster to Z (1A) character (used to indicate end-of-file under DOS), read back. Furthermore CTF files can still be compressed both bare newline (0A) and carriage-return newline (0D further by using external programs, such as gzip, hence 0A) combinations, and the text ‘ZTR’ (5A 54 52). The giving additional space savings. purpose of this is to act as an immediate check for the more Despite the size reduction of SCFv3 and CTF we troublesome aspects of file reading and data transfer and recently felt that another completely new file structure so aid the detection of corrupt files. For example, using ftp would both enable us to make the files even smaller and to transfer a ZTR file in ASCII mode may swap newline produce an extensible format which would facilitate its use and newline-carriage return. Such files will not have a for novel future applications of DNA sequencing. We have valid ZTR header, so rather than return a corrupted file named this new binary format ZTR. the reading code will return an error. 4 ZTR: New format for DNA sequence trace data Each chunk consists of a type, meta-data and data. The integer; the same as used by Phred (Ewing and Green, chunk data is the main information we wish to store. The 1998), TraceTuner (http://www.paracel.com/tracetuner/), meta-data, which is not needed for many chunk types, is ATQA (1998) and Li-Cor base callers. typically a small amount of information about the data. CNF4. The four confidence values stored in the same For example the chunk to store the digitized trace samples scale as CNF1, but with one value per base type. To aid will have the samples themselves in the data block and the compression, the confidence for all the called bases (which name of the channel (A, C, G, T) in the meta-data block. defaults to T if not A, C or G) is stored first followed by All integer values are stored using 4-byte values in big- the remaining confidence values for A, C, G and T. endian format (i.e. most significant byte first). The chunk structure is: CSID. The confidence values for substitution, insertion and deletion, stored in the −10 log (P ) scale. ATQA 4-byte chunk type: XX XX XX XX error Meta-data length (big endian): XX XX XX XX is one such program to produce these values. Meta-data: (any number of bytes, up to 2 ) Data length (big endian): XX XX XX XX CLIP. Poor quality clip points. Specified in base coor- Data: (any number of bytes, up to 2 ) dinates, this indicates where data (at both ends) should be considered as poor quality. This is included primarily The format of the meta-data and data elements is chunk for backwards compatibility with SCF—the CNF* chunk type dependent. The complete information may be found types provide more detailed information. in the on-line ZTR format specification (http://www. mrc-lmb.cam.ac.uk/pubseq/ztr.html). COMM. User defined text comments, in 8-bit ASCII. TEXT. A series of identifier-value pairs stored as one or Chunk types more sets of identifier, nul, value, nul terminating in an The chunk type may be considered to be a 4-character additional (i.e. double) nul character. The identifiers are string. Bit 5 of the first character indicates whether this defined as part of the ZTR spec, but have been taken from chunk type is part of the public ZTR specification (in the NCBI trace repository RFC version 1.17. which case bit 5 is clear) or whether it is a private extension (bit 5 is set). Bit 5 of the remaining three CR32. A 32-bit cyclic redundancy check (ANSI X3.66) characters is reserved for future use and so currently value of all the data since the last CR32 chunk, including should always be clear. Practically speaking this means the ZTR header if appropriate. that public chunk types consist entirely of uppercase letters and private chunk types start with a lowercase letter. Compression This means that TEXT and tEXT are two completely Each chunk data block is compressed using zero or independent chunk types and the similarity of their names more filtering and compression algorithms. The available does not imply a relationship between the format of their algorithm choices are: data. Also it is clear that private extensions will not clash with future public extensions. DELTA1, DELTA2, DELTA4. These apply the forward finite At present the publicly defined chunk types are: differences technique to 1, 2 or 4 byte words. This replaces each 1, 2 or 4 byte word with the difference between itself SAMP. A single channel of trace samples, stored in 16- and the previous word. It does not directly decrease the bit format. size of the data. Table 2 contains an example of DELTA1 filtering. SMP4. Four concatenated arrays of trace samples, stor- ing the same information as 4 SAMP chunks for the A, 16TO8, 32TO8. These attempt to store numerically small C, G and T channels. Note that both SMP4 and SAMP 16-bit and 32-bit integer values in a single 8-bit integer. chunks can be combined within the same file if desired. Values in the range of −127 to +127 are stored directly SMP4 typically gives compression ratios 4% smaller than in 8-bits. For values outside this range we emit −128 4 separate SAMP chunks, at a reduced CPU usage. followed by the actual 16 or 32-bit value. Table 3 contains an example of the 16TO8 filter type. BASE. Base calls, encoded using the NC-IUB character set (NC-IUB, 1985). FOLLOW1. This analyzes the complete data block to determine for each 8-bit value (‘x ’) which other value BPOS. A mapping of base numbers to trace sample most frequently follows it (follow (x )). Then for each byte numbers, stored as an array of 32-bit integer values. of data we store follow (previous byte)—current byte. To CNF1. The confidence values for the called base type. enable reversal of this function we also prepend the data The scale must be −10 log (P ) expressed as an 8-bit block with the 256-byte follow table. error 5 J.K.Bonfield and R.Staden Table 2. Example of levels 1–3 of the DELTA1 filter Level Data stream before and after the DELTA1 filters Entropy 0 +4 +7 +12 +17 +24 +30 +36 +40 +43 +43 +40 +35 +28 +21 +14 +9 7.50 1 +4 +3 +5 +5 +7 +6 +6 +4 +3 +0 −3 −5 −7 −7 −7 −5 6.16 2 +4 −1 +2 +0 +2 −1 +0 −2 −1 −3 −3 −2 −2 +0 +0 +2 4.97 3 +4 −5 +3 −2 +2 −3 +1 −2 +1 −2 +0 +1 +0 +2 +0 +2 5.62 Table 5. Summary of chunk type and the default filters and compression Table 3. An example of 16TO8 of 5 big-endian 16-bit numbers algorithms Before 00 4B 00 55 FF EB FC 22 00 BB Chunk type Filters/compressors (plus arguments) After 4B 55 EB 80 FC 22 80 00 BB SAMP/SMP4 DELTA2 (×3or ×2, depending on data range) 16TO8 FOLLOW1 Table 4. An example of the RLE compression method, using 8 as the token RLE ZLIB (Z HUFFMAN ONLY) Before567777 8 7 6 BASE/TEXT/COMM ZLIB (Z HUFFMAN ONLY) CNF1/CNF4/CSID DELTA1 (×1) After 5 6 8 4 7 8076 RLE ZLIB (Z HUFFMAN ONLY) BPOS DELTA4 (×1) 32TO8 RLE. Run length encoding. If 4 or more identical 8-bit ZLIB (Z HUFFMAN ONLY) values are detected in a row then RLE replaces this data with a special token followed by the number of repeated bytes and the value. If the token itself is within the raw data then it is output followed by zero. The token may as 16-bit quantities regardless of their actual scale. 8-bit be chosen to be a symbol with a low natural frequency. data (0–255) is compressed best using DELTA2 with 2 Table 4 contains an example of the RLE compression rounds, whereas full 16-bit data is compressed best using method. 3 rounds. ZLIB. Uses the zlib library (Deutsch and Gailly, 1996) to IMPLEMENTATION apply the LZ77 compression algorithm followed by Huff- The source code implementing the ZTR format is con- man encoding (Huffman, 1952). Zlib allows for Huffman tained within a library named ‘io lib’. This library also encoding only (denoted below as Z HUFFMAN ONLY), implements read-only support for the Applied Biosys- which for trace data typically reduces file size more than tems’ ABI and Pharmacia’s ALF format trace files and LZ77 and is faster. However all valid zlib streams are al- read–write support for the SCF and CTF formats. Io lib is lowed within a ZTR file. coded using ANSI C and is known to work on both UNIX The first byte of the encoded chunk data indicates the al- and Microsoft Windows based systems. Internally it uses gorithm used, followed by any algorithm specific param- a common C structure for storing a trace along with a eters required for decoding, followed by the encoded data itself. A value of zero for the first byte indicates the raw common programming interface for reading and writing data. Hence ZTR decoders simply need to keep recursively this structure. This means that the application does not applying the uncompression algorithms until the raw data need to know the file format of the trace data and so as is obtained. new formats are added existing applications will not need Experimentation has determined which sets of algo- to be modified or even recompiled. rithms are best applied to each type of chunk. Table 5 lists Io lib supports the notion of a trace search path, which the default filter and compression types used. Note that is independent from the trace format. Traces may be other combinations of filters and compression methods loaded directly from a file on disk in the current working may be used as they still produce a valid format ZTR file. directory, from an alternative directory, or extracted from The trace amplitudes (in the SAMP chunks) are all treated within a tar file. 6 ZTR: New format for DNA sequence trace data The tar file support allows for archiving many trace files Table 6. Total size in bytes and timings in seconds for 100 trace files into a single file, which has several benefits. It makes distribution of data much easier, it may reduce disk space Size in Read Write and it reduces the number of files on the disk. This is Instrument Format bytes time (s) time (s) important as most filesystems support a limited number of files, usually specified at the time of formatting. Although ABI-3700 ABI 18 915 025 3.55 – ABI-3700 Gzipped ABI 8 780 830 6.18 – this number is typically set very high, a large number of ABI-3700 Gzipped SCF 2 494 217 1.54 9.15 very small files can still cause problems. MegaBACE Gzipped SCF 3 953 805 1.73 13.14 Many filing systems also have a block size. The size Li-Cor Gzipped SCF 1 815 428 1.17 5.01 required to store a file of length N will be N rounded up to the next multiple of BLOCK SIZE; averaging at N + BLOCK SIZE/2. On Microsoft Windows the block Table 7. File size and I/O times as percentages relative to gzipped SCF size can often be as much as 64 kb, meaning an average wastage of 32 kb per file. Tar archives typically use an Size relative to SCF.gzip Average timings internal block size of 512 bytes, which greatly reduces Format ABI-3700 Mega- Licor Average Read & Encode & wasted space. This point is not to be underestimated; a BACE decode write 64 kb block size means that there is usually no saving in switching from gzipped SCF to ZTR unless tar archives SCF.raw 317.1 202.0 264.9 261.3 30.9 7.9 are also used. Fortunately UNIX file systems usually have SCF.gzip 100.0 100.0 100.0 100.0 100.0 100.0 much smaller block sizes. For example in Linux the ext2 SCF.bzip2 72.8 75.9 85.7 78.1 370.3 143.4 SCF.szip 71.1 74.8 80.2 75.4 937.9 164.6 filesystem has a block size of 1024, 2048 or 4096 bytes. Trace files within the tar file may be compressed if CTF.raw 96.2 112.2 114.5 107.6 34.2 117.6 CTF.gzip 70.2 80.3 83.0 77.8 79.2 144.4 desired, although with the ZTR format this is not advisable CTF.bzip2 65.6 72.0 79.2 72.2 324.6 217.3 due to the use of its own compression functions. However CTF.szip 63.2 70.5 75.5 69.8 740.1 227.6 the complete tar file itself must not be compressed as this ZTR(1).raw 150.0 99.5 220.1 156.5 34.9 8.2 would prevent random access within it. ZTR(1).gzip 69.5 73.1 84.2 75.6 85.4 50.6 In order to reduce time spent searching for files within ZTR(1).bzip2 62.9 68.7 72.3 68.0 370.9 125.9 a tar archive io lib can use an index file. The index ZTR(1).szip 60.8 67.2 68.4 65.4 779.2 129.8 consists of a series of lines containing the trace name and ZTR(2).raw 61.6 69.7 79.1 70.1 67.9 34.3 file offset, allowing for complete random access within the trace archive. The current implementation performs a linear search through the index, so access time is still proportional to number of files in the archive. However the for reading (when not cached) and 50% slower for time taken to find a file within a directory is also dependent writing. on the number of files contained within it. At present we A comparison between SCF, CTF and ZTR is presented only support read access to tar files. in Table 7. Here we have normalized the the sizes and times against the gzipped SCF results from Table 6. RESULTS Several compression tools are also compared, including gzip (implemented using zlib 1.1.3), bzip2 (version 1.0.1) We analyzed the performance of ZTR on multiple sets and szip (version 1.11). Gzip was implemented as a library of data covering several machine manufacturers and call and so avoids the need for running an external process. multiple sequencing chemistries, with each set consisting of 100 traces. The ABI-3700 and MegaBACE data sets This does not affect the size, but reduces the real and (from the Sanger Centre) were re-base-called using cpu time and so there is a small bias against the timings Phred 0.990722.g. The Li-Cor data set (from Genoscope) for bzip2 and szip. Both gzip and bzip2 are widely-used was converted from SCFv2 to SCFv3 format, but was open source programs. Szip is freely available for many operating systems, but is not open-source. It is included not re-base-called as the Li-Cor base-caller produces as an illustration of one of the best general purpose confidence values in the same log scale as Phred. All compression tools. of this data is publicly available on our ftp site. Table 6 Table 7 contains a lot of information so we have made presents the gzipped SCF size for each of these three the rows of formats which are faster at reading or writing sets along with the size for the ABI-3700 data set in the original ABI file format. The timings here represent than all others for a given file size bold. The remaining summation of the user and system CPU times, taken rows contain results which are bettered on both speed and from a 433 MHz Compaq Alpha running Digital UNIX size by at least one other format. For example SCF.gzip is V4.0E. Real times averaged at approximately 20% slower always beaten on speed and size by ZTR(2).raw. 7 J.K.Bonfield and R.Staden Table 8. Relative proportions of data within a ZTR file From this we can see that the Huffman encoding used for TEXT and BASE chunks is not optimal, mostly due to the small size of the information being compressed. The Chunk type File (%) Bits/item TEXT chunks in this data set do not include the NCBI text attributes and so their average size is just 196 bytes. With SMP4 92.07 3.24 bits/sample longer TEXT chunks the compression rates will improve, CNF4 2.72 4.45 bits/value BPOS 2.38 3.90 bits/value but it is unlikely the size will be a significant portion of BASE 1.59 2.60 bits/base the total file, so optimizing this will not provide an overall TEXT 1.23 7.89 bits/character improvement in compression. DISCUSSION The ZTR(1) and ZTR(2) formats are both valid ZTR We have presented ZTR as an extensible and compact files, but ZTR(1) does not include the final FOLLOW1 replacement to gzipped SCF, but have concentrated on the and ZLIB compression methods. This means that a raw issues of file compression. The Huffman encoding used ZTR(1) file is substantially larger than ZTR(2), but the in ZTR represents a very basic compression algorithm. more complex external compression tools (bzip2 and szip) Better entropy encoders are known, with arithmetic coding reduce the ZTR(1) files to less than ZTR(2). ZTR(2)’s (Rissanen and Langdon, 1979) being the most widely internal compression prevents external tools from further used. They may produce smaller files without too large an reducing the file size, so these values are not shown in impact on speed, but these algorithms often require larger Table 7. Both ZTR sets have been encoded using a single amounts of data to work efficiently. Higher order statistical SMP4 chunk instead of 4 separate SAMP chunks. In encoders (such as the PPM family; Cleary and Teahan, summary, whilst ZTR(2) is not the smallest file format 1997) may also reduce space, but these are currently (although it is close), to produce smaller files takes slow algorithms. It can be seen that the higher order substantially longer. The ZTR(2) implementation fulfils block sorting methods (http://www.compressconsult.com/ the goals of being faster than gzipped SCF with a much st/), as used in szip, and the Burrows–Wheeler transform smaller output and so is our default implementation of the (Burrows and Wheeler, 1994), as used in bzip2, give ZTR format. substantial improvements, but again the methods are The ZTR(1).gzip files are larger than the ZTR(2) files, relatively slow. despite the fact that the Huffman compression used within However the most profitable strategy may lie in trying ZTR(2) is the same code as used in gzip. This can be to curve-fit the data. We have experimented with using explained by noting that different chunks contain byte Chebyshev polynomials (Press et al., 1992) to fit the values with substantially different frequency distributions, previous 4 samples in order to predict a value for the but gzip averages all these together (assuming that the 5th sample. The difference between the predicted and entire file fits within one gzip block). An additional benefit real sample value can then be stored. This is still work to this approach is that random access to any chunk is in progress, but our current algorithm can compress the still possible, which in turn provides faster extraction of ABI-3700 data set to 56.7% of the gzipped SCF size (2.98 specific data. For example extracting just the base calls bits/sample). CPU performance is still a big issue with this from ZTR(2) files is faster than extracting them from method, with read times being approximately 2.7 times ZTR(1).gzip files. slower than gzipped SCF files. We can see that the Li-Cor files compress much less We have also examined the use of lossy compression for than the ABI and MegaBACE files. The main reason is the trace amplitudes. The simplest way to lose informa- that the Li-Cor data only stores 8-bit samples, compared tion uniformly is down-scaling. The original SCFv1 im- to the 11-bit data from ABI and MegaBACE machines. plementation stored information in 8-bits, but downscaling Scaling down the other data sets to 8-bit samples gives to any range also improves compression. Table 9 shows results comparable to the Li-Cor data (ZTR(2) is 72.9% the results of this form of lossy compression on the 100 for ABI and 80.4% for MegaBACE). ABI-3700 files using the ZTR(2) format. The other factor resulting in differences in compression We would not recommend using lossy compression for ratios between data sets is the noise in the trace data permanent archive of trace data, but for visual inspection (which depends in part on the preprocessing of the original over the network 7-bit data is generally adequate. data signals). As the noise increases the entropy of the data The nature of the ZTR format is such that, if useful, also increases, resulting in poorer compression. any of these alternative or additional methods can be Table 8 details the breakdown of a ZTR(2) file expressed implemented in the future without affecting the reading as a percentage of the overall size and in bits per item. This of older files. table was computed by averaging only the ABI-3700 files. The original size of the ABI-3700 data set is more than 8 ZTR: New format for DNA sequence trace data Table 9. The effect of down-scaling on file size duplication of work and reduce any fragmentation of the format. Initially such additions should be implemented as private types, but once stabilized these could be migrated Range Average ZTR(2) file size to public types in future revisions of the format. In the Staden Package individual traces stored in 0–1600 (lossless) 15 949 formats readable by io lib can be viewed using a pro- 0–1024 14 656 0–512 12 506 gram called Trev (Bonfield et al., 2002), available 0–256 10 846 from ftp://ftp.mrc-lmb.cam.ac.uk/pub/staden/trev/. Also 0–128 9 175 via io lib, multiple traces can be viewed using the 0–64 7 924 package’s main sequence assembly and editing pro- 0–32 6 765 gram, Gap4 (Staden et al., 1998), which uses a single binary but machine independent database for each sequencing project. This database stores sequence read- double the size of the uncompressed SCF files. This is ings, confidence values, contigs, templates, read-pair due to the additional information stored in an ABI file. data, annotations, edit information and links to the By defining further ZTR chunk types it would be possible trace data. The package also contains the gap4 viewer to store all the data in an ABI file within ZTR, utilizing (ftp://ftp.mrc-lmb.cam.ac.uk/pub/staden/gap4 viewer/). appropriate compression methods for each chunk. The This viewer is a complete but read-only version of Gap4 main proportion (96%) of an ABI file consists of the and hence enables all of the above information to be 12 DATA channels corresponding to raw and processed displayed using its graphical user interface. Gap4 viewer copies of the trace data and various instrument settings and trev executables for UNIX and Microsoft Windows (voltage, current, power, temperature). Of the remaining are available free to commercial and academic users from ABI information approximately 2/3 is base calls and base our ftp site. offsets. All of this already compresses well using ZTR. The io lib implementation could readily be extended We estimate that a complete ZTR encoded ABI file will to provide additional search paths to allow for direct be 27% of the original size, compared to 49% for gzip and file loading over the internet, possibly using CORBA 34% for bzip2. Hence ZTR is a suitable open format for (Parsons et al., 1999). This would enable any program use by manufacturers of sequencing instruments. using io lib, including trev and the gap4 viewer, to access and display traces directly from remote trace archives. In the SCF format the number and type of data items is Calculations based on typical files obtained from the rigidly defined in the header, with just one single ‘private’ Sanger Centre show that a gzipped Gap4 database from a block for additional data. ZTR overcomes this limitation finished assembly project occupies only 3% of the storage by having an arbitrary number of chunks, with either pub- required for the project’s gzipped SCF files. CAF (Dear et lic or private data types. CTF also overcomes many of the al., 1998) files are of comparable size. SCF limitations, however it does not distinguish between Although consensus confidence values are useful when public and private chunk types and does not separate the it comes to checking the evidence for individual bases in data from the compression and filter algorithms. These last a consensus sequence from a genome project, we believe two differences directly impact on the extensibility and that where doubts arise most people would prefer to see hence the long term future of the format. ZTR file read- all the relevant sequences and traces aligned. They are ers can be assured that chunk types listed in the public also likely to be interested only in specific regions and specification will be in a known format, but this does not hence not need to download all the traces from the relevant preclude the addition of new chunk types or the develop- project. ment of new compression algorithms. Bringing these last arguments together, in our view, if ZTR could also be used for related data such as that gen- the sequence assembly databases were made publically erated in Single Stranded Conformational Polymorphism available somewhere, the extra 3% of storage needed (Hayashi, 1991) experiments. Preliminary investigations would greatly increase the value of trace and sequence of SSCP data have shown that ZTR produces size re- data archives, and in addition to the contribution made by ductions similar to those achieved for sequencing traces. ZTR, further reduce the bandwidth required to service the Although the public specification does not explicitly expected growth in this information. discuss storage of SSCP data it is envisaged that this will be achieved by using the existing SAMP chunk types with ACKNOWLEDGEMENTS appropriate meta-data fields. If, as we hope, others do wish to contribute new ZTR The authors would like to thank Jean Thierry-Mieg for the chunk types and compression methods we suggest that adding CTF to io lib which catalyzed us into finishing our they contact us beforehand so that we can help to avoid own work on ZTR, Mark Jordan for the meta-data and 9 J.K.Bonfield and R.Staden Deutsch,P. and Gailly,J-L. (1996) ZLIB Compressed data format general comments, Andrew McLachlan for the Chebyshev specification version 3.3. RFC 1950, http://www.gzip.org/zlib/ prediction idea, Steven Leonard for extending io lib to use Ensembl Trace Server (2000) http://trace.ensembl.org/. zlib instead of gzip and his ideas with tar support, and both Ewing,B. and Green,P. (1998) Base-calling of automated sequencer the Sanger Centre and Genoscope for providing test data. traces using Phred. II. Error probabilities. Genome Res., 8, 186– This work was supported by the UK Medical Research Council. Gorrell,H.G. et al. (2000) NCBI trace archive RFC. http://www. ncbi.nlm.nih.gov/Traces/TraceArchiveRFC.html. Hayashi,K. (1991) PCR-SSCP: a simple and sensitive method for REFERENCES detection of mutations in the genomic DNA. PCR Meth. Appl., ATQA (1998) http://www.wagner.com/technologies/biotech/ 1,34–38. atqaadcopy.html, Wagner Associates. Huffman,D.A. (1952) A method for the construction of minimum- Bonfield,J.K. and Staden,R. (1995) The application of numerical redundancy codes. Proc. IRE, 40, 1098–1101. estimates of base calling accuracy to DNA sequencing projects. NCBI Trace Archive http://www.ncbi.nlm.nih.gov/Traces/. Nucleic Acids Res., 23, 1406–1410. NC-IUB (1985) Nomenclature Committee of the International Bonfield,J.K., Beal,K.F., Betts,M.J. and Staden,R. (2002) Trev: a Union of Biochemistry. Nomenclature for incompletely specified DNA trace viewer. Bioinformatics, 18, 194–195. bases in nucleic acid sequences. Recommendations 1984. Eur. Boutell,T. et al. (1997) Portable Network Graphics (PNG) specifi- J. Biochem, 150,1–5. http://www.chem.qmw.ac.uk/iubmb/misc/ cation version 1.0. RFC 2083, http://www.libpng.org/pub/png. naseq.html. Burrows,M. and Wheeler,D.J. (1994) A block-sorting lossless data Parsons,J.D., Buehler,E. and Hillier,L. (1999) DNA sequence compression algorithm. Technical Report. Digital Equipment chromatogram browsing using JAVA and CORBA. Genome Res., Corporation, Palo Alto, CA. 9, 277–281. Cleary,J.G. and Teahan,W.J. (1997) Unbounded length contexts for Press,W.H., Teukolsky,S.A., Vetterling,W.T. and Flannery,B.P. PPM. The Comput. J., 40,67–75. (1992) Numerical Recipies in C: The Art of Scientific Program- Dear,S., Durbin,R., Hillier,L., Marth,G., Thierry-Mieg,J. and ming, 2nd edn, Cambridge University Press, Cambridge. Mott,R. (1998) Sequence assembly with CAFTOOLS. Genome Rissanen,J.J. and Langdon,G.G. (1979) Arithmetic coding. IBM J. Res., 9, 260–267. Res. Develop., 23, 149–162. Dear,S. and Staden,R. (1992) A standard file format for data from Staden,R., Beal,K.F. and Bonfield,J.K. (1998) The Staden Package DNA sequencing instruments. DNA Sequence, 3, 107–110. 1998. Comput. Meth. Mol. Biol., 132, 115–130.

Journal

BioinformaticsOxford University Press

Published: Jan 1, 2002

There are no references for this article.