Accurate whole human genome sequencing using reversible terminator chemistry

David R. Bentley; Shankar Balasubramanian; Harold P. Swerdlow; Geoffrey P. Smith; John Milton; Clive G. Brown; Kevin P. Hall; Dirk J. Evers; Colin L. Barnes; Helen R. Bignell; Jonathan M. Boutell; Jason Bryant; Richard J. Carter; R. Keira Cheetham; Anthony J. Cox; Darren J. Ellis; Michael R. Flatbush; Niall A. Gormley; Sean J. Humphray; Leslie J. Irving; Mirian S. Karbelashvili; Scott M. Kirk; Heng Li; Xiaohai Liu; Klaus S. Maisinger; Lisa J. Murray; Bojan Obradovic; Tobias Ost; Michael L. Parkinson; Mark R. Pratt; Isabelle M. J. Rasolonjatovo; Mark T. Reed; Roberto Rigatti; Chiara Rodighiero; Mark T. Ross; Andrea Sabot; Subramanian V. Sankar; Aylwyn Scally; Gary P. Schroth; Mark E. Smith; Vincent P. Smith; Anastassia Spiridou; Peta E. Torrance; Svilen S. Tzonev; Eric H. Vermaas; Klaudia Walter; Xiaolin Wu; Lu Zhang; Mohammed D. Alam; Carole Anastasi; Ify C. Aniebo; David M. D. Bailey; Iain R. Bancarz; Saibal Banerjee; Selena G. Barbour; Primo A. Baybayan; Vincent A. Benoit; Kevin F. Benson; Claire Bevis; Phillip J. Black; Asha Boodhun; Joe S. Brennan; John A. Bridgham; Rob C. Brown; Andrew A. Brown; Dale H. Buermann; Abass A. Bundu; James C. Burrows; Nigel P. Carter; Nestor Castillo; Maria Chiara E. Catenazzi; Simon Chang; R. Neil Cooley; Natasha R. Crake; Olubunmi O. Dada; Konstantinos D. Diakoumakos; Belen Dominguez-Fernandez; David J. Earnshaw; Ugonna C. Egbujor; David W. Elmore; Sergey S. Etchin; Mark R. Ewan; Milan Fedurco; Louise J. Fraser; Karin V. Fuentes Fajardo; W. Scott Furey; David George; Kimberley J. Gietzen; Colin P. Goddard; George S. Golda; Philip A. Granieri; David E. Green; David L. Gustafson; Nancy F. Hansen; Kevin Harnish; Christian D. Haudenschild; Narinder I. Heyer; Matthew M. Hims; Johnny T. Ho; Adrian M. Horgan; Katya Hoschler; Steve Hurwitz; Denis V. Ivanov; Maria Q. Johnson; Terena James; T. A. Huw Jones; Gyoung-Dong Kang; Tzvetana H. Kerelska; Alan D. Kersey; Irina Khrebtukova; Alex P. Kindwall; Zoya Kingsbury; Paula I. Kokko-Gonzales; Anil Kumar; Marc A. Laurent; Cynthia T. Lawley; Sarah E. Lee; Xavier Lee; Arnold K. Liao; Jennifer A. Loch; Mitch Lok; Shujun Luo; Radhika M. Mammen; John W. Martin; Patrick G. McCauley; Paul McNitt; Parul Mehta; Keith W. Moon; Joe W. Mullens; Taksina Newington; Zemin Ning; Bee Ling Ng; Sonia M. Novo; Michael J. O’Neill; Mark A. Osborne; Andrew Osnowski; Omead Ostadan; Lambros L. Paraschos; Lea Pickering; Andrew C. Pike; Alger C. Pike; D. Chris Pinkard; Daniel P. Pliskin; Joe Podhasky; Victor J. Quijano; Come Raczy; Vicki H. Rae; Stephen R. Rawlings; Ana Chiva Rodriguez; Phyllida M. Roe; John Rogers; Maria C. Rogert Bacigalupo; Nikolai Romanov; Anthony Romieu; Rithy K. Roth; Natalie J. Rourke; Silke T. Ruediger; Eli Rusman; Raquel M. Sanches-Kuiper; Martin R. Schenker; Josefina M. Seoane; Richard J. Shaw; Mitch K. Shiver; Steven W. Short; Ning L. Sizto; Johannes P. Sluis; Melanie A. Smith; Jean Ernest Sohna Sohna; Eric J. Spence; Kim Stevens; Neil Sutton; Lukasz Szajkowski; Carolyn L. Tregidgo; Gerardo Turcatti; Stephanie vandeVondele; Yuli Verhovsky; Selene M. Virk; Suzanne Wakelin; Gregory C. Walcott; Jingwen Wang; Graham J. Worsley; Juying Yan; Ling Yau; Mike Zuerlein; Jane Rogers; James C. Mullikin; Matthew E. Hurles; Nick J. McCooke; John S. West; Frank L. Oaks; Peter L. Lundberg; David Klenerman; Richard Durbin; Anthony J. Smith

doi:10.1038/nature07517

Accurate whole human genome sequencing using reversible terminator chemistry

Bentley, David R.; Balasubramanian, Shankar; Swerdlow, Harold P.; Smith, Geoffrey P.; Milton, John; Brown, Clive G.; Hall, Kevin P.; Evers, Dirk J.; Barnes, Colin L.; Bignell, Helen R.; Boutell, Jonathan M.; Bryant, Jason; Carter, Richard J.; Keira Cheetham, R.; Cox, Anthony J.; Ellis, Darren J.; Flatbush, Michael R.; Gormley, Niall A.; Humphray, Sean J.; Irving, Leslie J.; Karbelashvili, Mirian S.; Kirk, Scott M.; Li, Heng; Liu, Xiaohai; Maisinger, Klaus S.; Murray, Lisa J.; Obradovic, Bojan; Ost, Tobias; Parkinson, Michael L.; Pratt, Mark R.; Rasolonjatovo, Isabelle M. J.; Reed, Mark T.; Rigatti, Roberto; Rodighiero, Chiara; Ross, Mark T.; Sabot, Andrea; Sankar, Subramanian V.; Scally, Aylwyn; Schroth, Gary P.; Smith, Mark E.; Smith, Vincent P.; Spiridou, Anastassia; Torrance, Peta E.; Tzonev, Svilen S.; Vermaas, Eric H.; Walter, Klaudia; Wu, Xiaolin; Zhang, Lu; Alam, Mohammed D.; Anastasi, Carole; Aniebo, Ify C.; Bailey, David M. D.; Bancarz, Iain R.; Banerjee, Saibal; Barbour, Selena G.; Baybayan, Primo A.; Benoit, Vincent A.; Benson, Kevin F.; Bevis, Claire; Black, Phillip J.; Boodhun, Asha; Brennan, Joe S.; Bridgham, John A.; Brown, Rob C.; Brown, Andrew A.; Buermann, Dale H.; Bundu, Abass A.; Burrows, James C.; Carter, Nigel P.; Castillo, Nestor; Chiara E. Catenazzi, Maria; Chang, Simon; Neil Cooley, R.; Crake, Natasha R.; Dada, Olubunmi O.; Diakoumakos, Konstantinos D.; Dominguez-Fernandez, Belen; Earnshaw, David J.; Egbujor, Ugonna C.; Elmore, David W.; Etchin, Sergey S.; Ewan, Mark R.; Fedurco, Milan; Fraser, Louise J.; Fuentes Fajardo, Karin V.; Scott Furey, W.; George, David; Gietzen, Kimberley J.; Goddard, Colin P.; Golda, George S.; Granieri, Philip A.; Green, David E.; Gustafson, David L.; Hansen, Nancy F.; Harnish, Kevin; Haudenschild, Christian D.; Heyer, Narinder I.; Hims, Matthew M.; Ho, Johnny T.; Horgan, Adrian M.; Hoschler, Katya; Hurwitz, Steve; Ivanov, Denis V.; Johnson, Maria Q.; James, Terena; Huw Jones, T. A.; Kang, Gyoung-Dong; Kerelska, Tzvetana H.; Kersey, Alan D.; Khrebtukova, Irina; Kindwall, Alex P.; Kingsbury, Zoya; Kokko-Gonzales, Paula I.; Kumar, Anil; Laurent, Marc A.; Lawley, Cynthia T.; Lee, Sarah E.; Lee, Xavier; Liao, Arnold K.; Loch, Jennifer A.; Lok, Mitch; Luo, Shujun; Mammen, Radhika M.; Martin, John W.; McCauley, Patrick G.; McNitt, Paul; Mehta, Parul; Moon, Keith W.; Mullens, Joe W.; Newington, Taksina; Ning, Zemin; Ling Ng, Bee; Novo, Sonia M.; O’Neill, Michael J.; Osborne, Mark A.; Osnowski, Andrew; Ostadan, Omead; Paraschos, Lambros L.; Pickering, Lea; Pike, Andrew C.; Pike, Alger C.; Chris Pinkard, D.; Pliskin, Daniel P.; Podhasky, Joe; Quijano, Victor J.; Raczy, Come; Rae, Vicki H.; Rawlings, Stephen R.; Chiva Rodriguez, Ana; Roe, Phyllida M.; Rogers, John; Rogert Bacigalupo, Maria C.; Romanov, Nikolai; Romieu, Anthony; Roth, Rithy K.; Rourke, Natalie J.; Ruediger, Silke T.; Rusman, Eli; Sanches-Kuiper, Raquel M.; Schenker, Martin R.; Seoane, Josefina M.; Shaw, Richard J.; Shiver, Mitch K.; Short, Steven W.; Sizto, Ning L.; Sluis, Johannes P.; Smith, Melanie A.; Ernest Sohna Sohna, Jean; Spence, Eric J.; Stevens, Kim; Sutton, Neil; Szajkowski, Lukasz; Tregidgo, Carolyn L.; Turcatti, Gerardo; vandeVondele, Stephanie; Verhovsky, Yuli; Virk, Selene M.; Wakelin, Suzanne; Walcott, Gregory C.; Wang, Jingwen; Worsley, Graham J.; Yan, Juying; Yau, Ling; Zuerlein, Mike; Rogers, Jane; Mullikin, James C.; Hurles, Matthew E.; McCooke, Nick J.; West, John S.; Oaks, Frank L.; Lundberg, Peter L.; Klenerman, David; Durbin, Richard; Smith, Anthony J. 2008-11-06 00:00:00 Vol 456 |6 November 2008 |doi:10.1038/nature07517 ARTICLES Accurate whole human genome sequencing using reversible terminator chemistry A list of authors and their affiliations appears at the end of the paper DNA sequence information underpins genetic research, enabling discoveries of important biological or medical benefit. Sequencing projects have traditionally used long (400–800 base pair) reads, but the existence of reference sequences for the human and many other genomes makes it possible to develop new, fast approaches to re-sequencing, whereby shorter reads are compared to a reference to identify intraspecies genetic variation. Here we report an approach that generates several billion bases of accurate nucleotide sequence per experiment at low cost. Single molecules of DNA are attached to a flat surface, amplified in situ and used as templates for synthetic sequencing with fluorescent reversible terminator deoxyribonucleotides. Images of the surface are analysed to generate high-quality sequence. We demonstrate application of this approach to human genome sequencing on flow-sorted X chromosomes and then scale the approach to determine the genome sequence of a male Yoruba from Ibadan, Nigeria. We build an accurate consensus sequence from.303 average depth of paired 35-base reads. We characterize four million single-nucleotide polymorphisms and four hundred thousand structural variants, many of which were previously unknown. Our approach is effective for accurate, rapid and economical whole-genome re-sequencing and many other biomedical applications. DNA sequencing yields an unrivalled resource of genetic informa- strand as template for the second sequencing reaction (Fig. 1a–c). To tion. We can characterize individual genomes, transcriptional states obtain paired reads separated by larger distances, we circularized and genetic variation in populations and disease. Until recently, the DNA fragments of the required length (for example, 26 0.2 kb) scope of sequencing projects was limited by the cost and throughput and obtained short junction fragments for paired end sequencing of Sanger sequencing. The raw data for the three billion base (Fig. 1d). (3 gigabase (Gb)) human genome sequence, completed in 2004 (ref. 1), We sequenced DNA templates by repeated cycles of polymerase- was generated over several years for,$300 million using several hun- directed single base extension. To ensure base-by-base nucleotide dred capillary sequencers. More recently an individual human gen- incorporation in a stepwise manner, we used a set of four reversible ome sequence has been determined for ,$10 million by capillary terminators, 39-O-azidomethyl 29-deoxynucleoside triphosphates sequencing . Several new approaches at varying stages of development (A, C, G and T), each labelled with a different removable fluorophore 3–6 8 aim to increase sequencing throughput and reduce cost . They (Supplementary Fig. 1a) . The use of 39-modified nucleotides increase parallelization markedly by imaging many DNA molecules allowed the incorporation to be driven essentially to completion simultaneously. One instrument run produces typically thousands or without risk of over-incorporation. It also enabled addition of all millions of sequences that are shorter than capillary reads. Another four nucleotides simultaneously rather than sequentially, minimiz- human genome sequence was recently determined using one of these ing risk of misincorporation. We engineered the active site of 9uN approaches . However, much bigger improvements are necessary to DNA polymerase to improve the efficiency of incorporation of these enable routine whole human genome sequencing in genetic research. unnatural nucleotides . After each cycle of incorporation, we deter- We describe a massively parallel synthetic sequencing approach that mined the identity of the inserted base by laser-induced excitation of transforms our ability to use DNA and RNA sequence information in the fluorophores and imaging. We added tris(2-carboxyethyl)pho- biological systems. We demonstrate utility by re-sequencing an indivi- sphine (TCEP) to remove the fluorescent dye and side arm from a dual human genome to high accuracy. Our approach delivers data at linker attached to the base and simultaneously regenerate a 39 very high throughput and low cost, and enables extraction of genetic hydroxyl group ready for the next cycle of nucleotide addition information of high biological value, including single-nucleotide (Supplementary Fig. 1b). The Genome Analyzer (GA1) was designed polymorphisms (SNPs) and structural variants. to perform multiple cycles of sequencing chemistry and imaging to collect the sequence data automatically from each cluster on the DNA sequencing using reversible terminators surface of each lane of an eight-lane flow cell (Supplementary Fig. 2). We generated high-density single-molecule arrays of genomic DNA To determine the sequence from each cluster, we quantified the fragments attached to the surface of the reaction chamber (the flow fluorescent signal from each cycle and applied a base-calling algo- cell) and used isothermal ‘bridging’ amplification to form DNA ‘clus- rithm. We defined a quality (Q) value for each base call (scaled as by ters’ from each fragment. We made the DNA in each cluster single- the phred algorithm ) that represents the likelihood of each call stranded and added a universal primer for sequencing. For paired being correct (Supplementary Fig. 3). We used the Q-values in sub- read sequencing, we then converted the templates to double-stranded sequent analyses to weight the contribution of each base to sequence DNA and removed the original strands, leaving the complementary alignment and detection of sequence variants (for example, SNP Macmillan Publishers Limited. All rights reserved © 2008 ARTICLES NATURE |Vol 456 |6 November 2008 calling). We discarded all reads from mixed clusters and used the human chromosome 6 (accession AL662825.4, previously determined remaining ‘purity filtered’ reads for analysis. Typically we generated using capillary sequencing by the Wellcome Trust Sanger Institute). We developed a fast global alignment algorithm ELAND that aligns a 1–2 Gb of high-quality purity filtered sequence per flow cell from ,30–60-million single 35-base reads, or 2–4 Gb in a paired read read to the reference only if the read can be assigned a unique position with 0, 1 or 2 differences. We collected 0.17 Gb of aligned data for the experiment (Supplementary Table 1). BAC from one lane of a flow cell. Approximately 90% of the 35-base To demonstrate accurate sequencing of human DNA, we sequenced reads matched perfectly to the reference, demonstrating high raw read a human bacterial artificial chromosome (BAC) clone (bCX98J21) that accuracy (Supplementary Fig. 4). To examine consensus coverage contained 162,752 bp of the major histocompatibility complex on and accuracy, we used 5 Mb of 35-base purity filtered reads (30-fold average input depth of the BAC) and obtained 99.96% coverage of the reference. There was one consensus miscall, at a position of very low coverage (just above our cutoff threshold), yielding an overall con- sensus accuracy of .99.999%. Detecting genetic variation of the human X chromosome For an initial study of genetic variation, we sequenced flow-sorted X chromosomes of a Caucasian female (sample NA07340 originating from the Centre d’Etude du Polymorphisme Humain (CEPH)). We generated 278-million paired 30–35-bp purity filtered reads and aligned them to the human genome reference sequence. We carried out separate analyses of the data using two alignment algorithms: ELAND (see above) or MAQ (Mapping and Assembly with Qualities) . Both algorithms place each read pair where it best matches the reference and assign a confidence score to the alignment. In cases where a read has two or more equally likely positions (that is, in an exact repeat), MAQ randomly assigns the read pair to one position and assigns a zero alignment quality score (these reads are excluded from SNP analysis). ELAND rejects all non-unique align- ments, which are mostly in recently inserted retrotransposons (see B B d Supplementary Fig. 5). MAQ therefore provides an opportunity to assess the properties of a data set aligned to the entire reference, whereas ELAND effectively excludes ambiguities from the short read alignment before further analysis. Figure 1 | Preparation of samples. a, DNA fragments are generated, for We obtained comprehensive coverage of the X chromosome from example, by random shearing and joined to a pair of oligonucleotides in a both analyses. With MAQ, 204 million reads aligned to 99.94% of the forked adaptor configuration. The ligated products are amplified using two X chromosome at an average depth of 433. With ELAND, 192 mil- oligonucleotide primers, resulting in double-stranded blunt-ended material lion reads covered 91% of the reference sequence, showing what can with a different adaptor sequence on either end. b, Formation of clonal be covered by unique best alignments. These results were obtained single-molecule array. DNA fragments prepared as in a are denatured and after excluding reads aligning to non-X sequence (impurities of flow single strands are annealed to complementary oligonucleotides on the flow- cell surface (hatched). A new strand (dotted) is copied from the original sorting) and apparently duplicated read pairs (Supplementary Table 2). strand in an extension reaction that is primed from the 39 end of the surface- We reasoned that these duplicates (,10% of the total) arose during bound oligonucleotide; the original strand is then removed by denaturation. initial sample amplification. The adaptor sequence at the 39 end of each copied strand is annealed to a new The sampling of sequence fragments from the X chromosome is surface-bound complementary oligonucleotide, forming a bridge and close to random. This is evident from the distribution of mapped generating a new site for synthesis of a second strand (dotted). Multiple read depth in the MAQ alignment in regions where the reference is cycles of annealing, extension and denaturation in isothermal conditions unique (Fig. 2a): the variance of this distribution is only 2.26 times result in growth of clusters, each ,1 mm in physical diameter. This follows that of a Poisson distribution (the theoretical minimum). Half of this the basic method outlined in ref. 33. c, The DNA in each cluster is linearized by cleavage within one adaptor sequence (gap marked by an asterisk) and excess variance can be accounted for by a dependence on G1C con- denatured, generating single-stranded template for sequencing by synthesis tent. However, the average mapped read depth only falls below 103 to obtain a sequence read (read 1; the sequencing product is dotted). To in regions with G1C content less than 4% or greater than 76%, perform paired-read sequencing, the products of read 1 are removed by comprising in total just 1% of unique chromosome sequence and denaturation, the template is used to generate a bridge, the second strand is 3% of coding sequence (Fig. 2b). re-synthesized (shown dotted), and the opposite strand is then cleaved (gap We identified 92,485 candidate SNPs in the X chromosome using marked by an asterisk) to provide the template for the second read (read 2). ELAND (Supplementary Fig. 6). Most calls (85%) match previous d, Long-range paired-end sample preparation. To sequence the ends of a entries in the public database dbSNP. Heterozygosity (p) in this data long (for example,.1 kb) DNA fragment, the ends of each fragment are set is 4.33 10 (that is, one substitution per 2.3 kb), close to a tagged by incorporation of biotinylated (B) nucleotide and then circularized, previously published X chromosome estimate (4.73 10 ) . Using forming a junction between the two ends. Circularized DNA is randomly fragmented and the biotinylated junction fragments are recovered and used MAQ we obtained 104,567 SNPs, most of which were common to the as starting material in the standard sample preparation procedure illustrated results of the ELAND analysis. The differences between the two sets of in a. The orientation of the sequence reads relative to the DNA fragment is SNP calls are largely the consequence of different properties of the shown (magenta arrows). When aligned to the reference sequence, these alignments as described earlier. For example, most of the SNPs found reads are oriented with their 59 ends towards each other (in contrast to the only by the MAQ-based analysis were at positions of low or zero short insert paired reads produced as shown in a–c). See Supplementary Fig. sequence depth in the ELAND alignment (Supplementary Fig. 6c). 17a for examples of both. Turquoise and blue lines represent We assessed accuracy and completeness of SNP calling by compar- oligonucleotides and red lines represent genomic DNA. All surface-bound ison to genotypes obtained for this individual using the Illumina oligonucleotides are attached to the flow cell by their 59 ends. Dotted lines HumanHap550 BeadChip (HM550). The sequence data cov- indicate newly synthesized strands during cluster formation or sequencing. (See Supplementary Methods for details.) ered.99.8% of the 13,604 genotyped positions and we found excellent Macmillan Publishers Limited. All rights reserved © 2008 NATURE |Vol 456 |6 November 2008 ARTICLES depth and/or anomalous read pair spacing, similar to previous a All 13–15 Unique only approaches . We detected 115 indels in total, 77 of which were Poisson visible from anomalous read-pair spacing (see Supplementary Tables 4 and 5). We developed Resembl, an extension to the Ensembl browser , to view all variants (Supplementary Fig 9). Inversions can be detected when the orientation of one read in a pair is reversed (for example, see Supplementary Fig. 10). In general, inversions occur as the result of non-allelic homologous recombination, and are therefore flanked by repetitive sequence that can compromise alignments. We found partial evidence for other inversion events, but characterization of inversions from short read data is complex because of the repeats and requires further development. Sequencing and analysis of a whole human genome Our X chromosome study enabled us to develop an integrated set of methods for rapid sequencing and analysis of whole human genomes. We sequenced the genome of a male Yoruba from Ibadan, Nigeria (YRI, sample NA18507). This sample was originally collected for the 17,18 0 20 40 60 80 HapMap project through a process of community engagement Mapped depth (fold) and informed consent and has also been studied in other pro- 20,21 jects . We were therefore able to compare our results with publicly b G+C content (%) available data from the same sample. We constructed two libraries: 0 30 40 50 60 one of short inserts (,200 bp) with similar properties to the previous X chromosome library and one from long fragments (,2 kb) to provide longer-range read-pair information (see Supplementary Fig. 11 for size distributions). We generated 135 Gb of sequence (,4 billion paired 35-base reads; see Supplementary Table 6) over a period of 8 weeks (December 2007 to January 2008) on six GA1 instruments averaging 3.3 Gb per production run (see Supplementary Table 1 for example). The approximate consumables cost (based on full list price of reagents) was $250,000. We aligned 97% of the reads using MAQ and found that 99.9% of the human reference (NCBI build 36.1) was covered with one or more reads at an average of 40.6-fold depth. Using ELAND, we aligned 91% of the reads over 93% of the reference sequence at sufficient depth to call a strong consensus (.three Q30 bases). The distribution of mapped read depth was close to random, with slight over-dispersion as seen for the X chromosome data. We observed comprehensive representa- 0 20 40 60 80 100 tion across a wide range of G1C content, dropping only at the very Percentile of unique sequence ordered by G+C content extreme ends, but with a different pattern of distribution compared Figure 2 | X chromosome data. a, Distribution of mapped read depth in the to the X chromosome (see Supplementary Fig. 12). X chromosome data set (NA07340), sampled at every 50th position along the We identified ,4 million SNPs, with 74% matching previous chromosome and displayed as a histogram (‘All’). An equivalent analysis of entries in dbSNP (Fig. 3). We found excellent agreement of our mapped read depth for the unique subset of these positions is also shown SNP calls with genotyping results: sequence-based SNP calls covered (‘Unique only’). The solid line represents a Poisson distribution with the almost all of the 552,710 loci of HM550, with.99.5% concordance same mean. b, Distribution of X chromosome uniquely mapped reads as a of sequencing versus genotyping calls (Table 1 and Supplementary function of G1C content. Note that the x axis is per cent G1C content and is Table 7a). The few disagreements were mostly under-calls of hetero- scaled by percentile of unique sequence. The solid line is average mapped zygous positions (GT.Seq) in areas of low sequence depth, provid- depth of unique sequence; the grey region is the central 80% of the data (10th ing us with a false-negative rate of,0.35% from the ELAND analysis to 90th centiles); the dashed lines are 10th and 90th centiles of a Poisson distribution with the same mean as the data. (see Table 1). The other disagreements (0.09% of all genotypes) included errors in genotyping plus apparent tri-allelic SNPs agreement between sequence-based SNP calls and genotyping data (Supplementary Table 7a). The main cause of genotype error (99.52% or 99.99% using ELAND or MAQ, respectively; (0.05% of all genotypes) is the existence of a second ‘hidden’ SNP Supplementary Table 3). There was complete concordance of all close to the assayed locus that disrupts the genotyping assay, leading homozygous calls and a low level of ‘under-calling’ from the sequence to loss of one allele and an erroneous homozygous genotype data (denoted as ‘GT.Seq’ in Table 1) at a small number of the (Supplementary Figs 13 and 14). heterozygous sites, caused by inadequate sampling of one of the two To examine the accuracy of SNP calling in more detail, we com- alleles. The depth of input sequence influences the coverage and accu- pared our sequence-based SNP calls with 3.7 million genotypes (HM- racy of SNP calling. We found that reducing the read depth to 153 still All) generated for this sample during the HapMap project (Table 1 gives 97% coverage of genotype positions and only 1.27% of the het- and Supplementary Table 7b) and found excellent concordance erozygous sites are under-called. We observed no other types of dis- between the data sets. Disagreements included sequence-based agreement at any input depth (Supplementary Fig. 7). under-calls of heterozygous positions in regions of low read depth. We detected structural variants (defined as any variant other than The slightly higher level of other disagreements (0.76%) seen in this a single base substitution) as follows. We found 9,747 short inser- analysis compared to that of the HM550 data (0.09%) is in line with tions/deletions (‘short indels’; defined here as less than the length of the higher level of underlying genotype error rate of 0.7% for the the read) by performing a gapped alignment of individual reads HapMap data . To refine this analysis further, we generated a set of (Supplementary Fig. 8). We identified larger indels based on read 530,750 very high confidence reference genotypes comprising Macmillan Publishers Limited. All rights reserved © 2008 Frequency (Mb) Mapped depth ARTICLES NATURE |Vol 456 |6 November 2008 Table 1 | Comparison of SNP calls made from sequence versus genotype data for the human genome (NA18507) and X chromosome (NA07340) ELAND MAQ X Human Human X Human Human Human HM550 (13,604 HM550 (552,710 HM-All (3,699,592 HM550 (13,604 HM550 (552,710 HM-All (3,699,592 Combined (530,750 SNPs) SNPs) SNPs) SNPs) SNPs) SNPs) SNPs) (%) (%) (%) (%) (%) (%) (%) (n) Covered by 99.77 99.60 99.24 99.91 99.74 99.29 99.78 529,589 sequence Concordant calls 99.52 99.57 98.80 99.99 99.90 99.12 99.94 529,285 All disagreements 0.48 0.43 1.20 0.01 0.10.88 0.06 304 GT.Seq 0.48 0.35 0.46 0.01 0.03 0.15 0.02 130 Seq.GT 00.05 0.52 0 0.05 0.54 0.02 130 Other 00.03 0.22 0 0.02 0.20.01 44 discordances SNP panels referred to are HM550 (Illumina Infinium HumanHap550 BeadChip) and HM-All (complete data from phase 1 and phase 2 of the International HapMap Project). ‘Combined’ is a set of concordant genotypes from both sets (HM550 and HM-All; see text). GT.Seq denotes a heterozygous genotyping SNP call where there is a homozygous sequencing SNP call (one of the two alleles); Seq.GT denotes the converse (that is, a heterozygous sequencing SNP call where there is a homozygous genotyping call). Other discordances are differences in the two SNP calls that cannot be accounted for by one allele being missing from one call. concordant calls in both the HM550 and HM-All genotype data sets. at the tip of the short arm of chromosomes X and Y undergoes Comparing the results of the MAQ analysis to this high confidence obligatory recombination in male meiosis, which is equivalent to set (see Table 1), we found 130 heterozygote under-calls GT.Seq 203 the autosome average. This illustrates a clear correlation (that is, a false-negative rate of 0.025%). There were also 130 hetero- between recombination and nucleotide diversity. By contrast, the zygote over-calls Seq.GT, but most of these are probably genotype 0.33-Mb PAR2 region has a much lower recombination rate than errors as 82 have a nearby ‘hidden’ SNP and 3 have a nearby indel. A PAR1; we observed that heterozygosity in PAR2 is identical to that further 41 are tri-allelic loci, leaving at most 4 potential wrong calls by of the autosomes in NA18507. Heterozygosity in coding regions is sequencing (that is, false-positive rate of 4 per 529,589 positions). lower (0.543 10 ) than the total autosome average, consistent with Finally we selected a subset of novel SNP calls from the sequence data the model that some coding changes are deleterious and are lost as the and tested them by genotyping. We found 96.1% agreement between result of natural selection . Nevertheless, the 26,140 coding SNPs sequence and genotype calls (Supplementary Table 8). However, the (Supplementary Fig. 15) include 5,361 non-conservative amino acid 47 disagreements included 10 correct sequencing calls (genotyping substitutions plus 153 premature termination codons under-calls owing to hidden SNPs) and 7 sequencing under-calls. On (Supplementary Table 9), many of which are expected to affect pro- this basis, therefore, the false-positive discovery rate for the one mil- tein function. lion novel SNPs is 2.5% (30 out of 1,206). For the entire data set of We performed a genome-wide survey of structural variation in this four million SNPs detected in this analysis, the false-positive and individual and found excellent correlation with variants that had -negative rates both average,1%. been reported in previous studies, as well as detecting many new This genome from a Yoruba individual contains significantly more variants. We found 0.4 million short indels (1–16 bp; polymorphism than a genome of European descent. The autosomal Supplementary Fig. 16), most of which are length polymorphisms heterozygosity (p) of NA18507 is 9.943 10 (1 SNP per 1,006 bp), in homopolymeric tracts of A or T. Half of these events are corrobo- higher than previous values for Caucasians (7.63 10 , ref. 12). rated by entries in dbSNP, and 95 of 100 examined were present in Heterozygosity in the pseudoautosomal region 1 (PAR1) is substan- amplicons sequenced from this individual in ENCODE regions, con- tially higher (1.92 3 10 ) than the autosomal value. PAR1 (2.7 Mb) firming the high specificity of this method of short indel detection. For larger structural variants (detected by anomalously spaced paired ends) we found that some were detected by both long and short insert a ELAND MAQ data sets (Supplementary Fig. 17a), but most were unique to one or Call SNPs In dbSNP SNPs In dbSNP (n) (%) (n) (%) other data set. We observed two reasons for this: first, small events (,400 bp) are within the normal size variance of the long insert data; Homozygote 1,417,320 90.1 1,503,420 90.8 second, nearby repetitive structures can prevent unique alignment of Heterozygote 2,411,022 63.9 2,635,776 63.8 All 3,828,342 73.6 4,139,196 73.6 read pairs (see Supplementary Fig. 17b, c). In some cases, the high resolution of the short insert data permits detection of additional complexity in a structural rearrangement that is not revealed by the long insert data. For example, where the long insert data indicate a 1.3-kb deletion in NA18507 relative to the reference, the short insert ELAND MAQ data reveal an inversion accompanied by deletions at both break- 215,844 3,612,498 526,698 points (Fig. 4). We carried out de novo assembly of reads in this (42.4% dbSNP) (75.5% dbSNP) (60.8% dbSNP) region and constructed a single contig that defines the exact structure of the rearrangement (data not shown). We discovered 5,704 structural variants ranging from 50 bp to.35 kb where there is sequence absent from the genome of NA18507 compared to the reference genome. We observed a steadily Figure 3 | SNPs identified in the human genome sequence of NA18507. decreasing number of events of this type with increasing size, except a, Number of SNPs detected by class and percentage in dbSNP (release 128). for two peaks (Supplementary Fig. 18). Most of the events repre- Results from ELAND and MAQ alignments are reported separately. sented by the large peak at 300–350 bp contain a sequence of the b, Analysis of SNPs detected in each analysis reveals extensive overlap. The AluY family. This is consistent with insertion of short interspersed percentage of NA18507 SNP calls that match previous entries in dbSNP is nuclear elements (SINEs) that are present in the reference genome lower than that of our X chromosome study (see Supplementary Fig. 6). We but missing from the genome of NA18507. Similarly, the second, expect this because individual NA07340 (from the X chromosome study) was also previously used for discovery and submission of SNPs to dbSNP smaller peak at 6–7 kb is the consequence of insertion of the long during the HapMap project, in contrast to NA18507. interspersed nuclear element (LINE) L1 Homo sapiens (L1Hs) in Macmillan Publishers Limited. All rights reserved © 2008 NATURE |Vol 456 |6 November 2008 ARTICLES Figure 4 | Homozygous complex rearrangement detected by anomalous 8.00 kb paired reads. The rearrangement involves an inversion of 369 bp (blue–turquoise bar in the schematic diagram) flanked by deletions (red bars) of 1,206 and 164 bp, respectively, at the left- and right-hand breakpoints. a, Summary tracks in the Resembl browser, denoting scale, simulated alignability of reads to reference (blue plot), actual aligned depth of coverage by NA18507 reads (green plot), density of anomalous reads indicating structural variants (red plot; peaks denote ‘hotspots’) and density of singleton reads (pink plot). b, Anomalous long-insert read pairs (orange lines denote DNA fragment; blocks at either end denote each read); the data indicate loss of ,1.3 kb in NA18507 relative to the reference. c, Anomalous short-insert pairs of two types (red and pink) indicate an inverted sequence flanked by two deletions. d, Normal short-insert read-pair alignments (each green line denotes the extent of the reference that is covered by the short fragment, including the two reads). e, The schematic diagram depicts the arrangement of normal and anomalous read pairs relative to the rearrangement. Top line, structure of NA18507; second line, structure of reference sequence. Green bars denote sequence that is collinear in the reference and NA18507 genomes. The turquoise–blue bar illustrates the inverted segment. Red bars indicate the sequences present in the reference but absent in NA18507. Arrows denote orientation of reads when aligned to the reference. The display in a–d is a composite of screen shots of the same window, overlapped for display purposes. Supplementary Fig. 20. The ‘singleton’ reads on either side of the event, which have partners that do not align to the reference, form part of a de novo assembly that precisely defines the novel sequence and breakpoint (Supplementary Fig. 21). Effect of sequence depth on coverage and accuracy We investigated the impact of varying input read depth (and hence cost) on SNP calling using chromosome 2 as a model. SNP discovery increases with increasing depth: essentially all homozygous positions are detected at 153, whereas heterozygous positions accumulate more gradually to 333 (Fig. 5a). This effect is influenced by the stringency of the SNP caller. To call each allele in this analysis we required the equivalent of two high-quality Q30 bases (as opposed to three used in full depth analyses). Homozygotes could be detected at read depth of 23 or higher, whereas heterozygote detection required at least double this depth for sampling of both alleles. Missing calls (not covered by sequence) and discordances between sequence-based SNP calls and genotype loci (mostly under-calls of heterozygotes due to low depth) progressively reduced with increasing depth (Fig. 5b). We observed very few other types of discordance at any depth; many of these are genotyping errors as described above. Concluding remarks Reversible terminator chemistry is a defining feature of this sequen- cing approach, enabling each cycle to be driven to completion while minimizing misincorporation. The result is a system that generates 4 kb accurate data at very high throughput and low cost. We determined an accurate whole human genome sequence in 8 weeks to an average depth of ,403. We built a consensus sequence, optimized methods for analysis, assessed accuracy and characterized the genetic variation of this individual in detail. We assessed accuracy relative to genotype data over the entire fraction of the human sequence where SNP calling was possible (.90%). We established very low false-positive and -negative rates for the ,four million SNPs detected (,1% over-calls and under- calls). This compares favourably with previous individual genome analyses which reported a 24% under-calling of heterozygous posi- 2,7 tions . many cases. We found good correspondence between our results and Paired reads were very powerful in all areas of the analysis. They the data of ref. 23, which reported 148 deletions of,100 kb in this provided very accurate read alignment and thus improved the accu- individual on the basis of abnormal fosmid paired-end spacing. We racy and coverage of consensus sequence and SNP calling. They were found supporting evidence for 111 of these events. We detected a essential for developing our short indel caller, and for detecting larger further 2,345 indels in the range 60–160 bp which are sequences structural variants. Our short-insert paired-read data set introduced present in the genome of NA18507 and absent from the reference a new level of resolution in structural variation detection, revealing genome (Supplementary Fig. 19). One example is shown in thousands of variants in a size range not characterized previously. In Macmillan Publishers Limited. All rights reserved © 2008 ARTICLES NATURE |Vol 456 |6 November 2008 filtered read data are available for download from the Short Read Archive at a SNP calls versus sequence depth NCBI or from the European Short Read Archive (ERA) at the EBI. 350,000 Analysis software. Image analysis software and the ELAND aligner are provided 300,000 as part of the Genome Analyzer analysis software. SNP and structural variant detectors will be available as future upgrades of the analysis pipeline. The 250,000 Resembl extension to Ensembl is available on request. The MAQ (Mapping and Assembly with Qualities) aligner is freely available for download from 200,000 http://maq.sourceforge.net. Data access. Sequence data for NA18507 are freely available from the NCBI short 150,000 read archive, accession SRA000271 (ftp://ftp.ncbi.nih.gov/pub/TraceDB/ ShortRead/SRA000271). X chromosome data are freely available from ERA, 100,000 accession ERA000035. Links to Resembl displays for chromosome X and human 50,000 data, plus information on other available data, are provided at http://www. illumina.com/HumanGenome. See Supplementary Methods for a detailed Methods section. 0 5 10 15 20 25 30 35 Average input read depth (fold) Received 24 June; accepted 2 October 2008. b Missing or discordant data versus sequence depth 1. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004). 2. Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007). 3. Margulies, M. et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380 (2005). 10 4. Shendure, J. et al. Accurate multiplex polony sequencing of an evolved bacterial genome. Science 309, 1728–1732 (2005). 5. Harris, T. D. et al. Single-molecule DNA sequencing of a viral genome. Science 320, 106–109 (2008). 6. Lundquist, P. M. et al. Parallel confocal detection of single molecules in real time. 4 Opt. Lett. 33, 1026–1028 (2008). 7. Wheeler, D. A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008). 8. Milton, J. et al. Modified nucleotides. World Intellectual Property Organization 0 5 10 15 20 25 30 35 WO/2004/018497 (2004). Average input read depth (fold) 9. Smith, G. P. et al. Modified polymerases for improved incorporation of nucleotide analogues. World Intellectual Property Organization WO/2005/024010 Figure 5 | Effect of sequence depth on coverage and accuracy of human (2005). genome sequencing. ELAND alignments were used for this analysis. 10. Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. II. a, Accumulation of sequence-based SNP calls, including all SNPs (squares), Error probabilities. Genome Res. 8, 186–194 (1998). heterozygous SNPs (triangles) and homozygous SNPs (circles) with 11. Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res.. doi:10.1101/gr.078212.108 increasing input read depth. b, Decrease in genotype positions not covered (25 September 2008). by sequence (squares), heterozygote under-calls in sequence data relative to 12. The International SNP Map Working Group. A map of human genome sequence genotype data (triangles) and discordant SNP calls compared to genotypes variation containing 1.42 million single nucleotide polymorphisms. Nature 409, (circles) with increasing input read depth. Vertical dotted lines indicate 928–933 (2001). various input read depths (103,153,303 haploid genome). 13. Tuzun, E. et al. Fine-scale structural variation of the human genome. Nature Genet. 37, 727–732 (2005). 14. Korbel, J. O. et al. Paired-end mapping reveals extensive structural variation in the some cases we determined the exact sequence of structural variants by human genome. Science 318, 420–426 (2007). de novo assembly from the same paired-read data set. Interpreting 15. Campbell, P. J. et al. Identification of somatically acquired rearrangements in events that are embedded in repetitive sequence tracts will require cancer using genome-wide massively parallel paired-end sequencing. Nature Genet. 40, 722–729 (2008). further work. 16. Hubbard, T. et al. The Ensembl genome database project. Nucleic Acids Res. 30, Massively parallel sequencing technology makes it feasible to con- 38–41 (2002). sider whole human genome sequencing as a clinical tool in the near 17. The International HapMap Consortium. A haplotype map of the human genome. future. Characterizing multiple individual genomes will enable us to Nature 437, 1299–1320 (2005). 18. The International HapMap Consortium. A second generation human haplotype unravel the complexities of human variation in cancer and other map of over 3.1 million SNPs. Nature 449, 851–861 (2007). diseases and will pave the way for the use of personal genome 19. The International HapMap Consortium. The International HapMap Project. sequences in medicine and healthcare. Accuracy of personal genetic Nature 426, 789–796 (2003). information from sequence will be critical for life-changing decisions. 20. The ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, In addition to the large-scale genomic projects exemplified by the 799–816 (2007). 15,24–26 present study and others , the system described here is being 21. Redon, R. et al. Global variation in copy number in the human genome. Nature 444, used to explore biological phenomena in unprecedented detail, 444–454 (2006). including transcriptional activity, mechanisms of gene regulation 22. Cargill, M. et al. Characterization of single-nucleotide polymorphisms in coding 27–32 regions of human genes. Nature Genet. 22, 231–238 (1999). and epigenetic modification of DNA and chromatin . In the 23. Kidd, J. M. et al. Mapping and sequencing of structural variation from eight human future, DNA sequencing will be the central tool for unravelling genomes. Nature 453, 56–64 (2008). how genetic information is used in living processes. 24. Hillier, L. W. et al. Whole-genome sequencing and variant discovery in C. elegans. Nature Methods 5, 183–188 (2008). METHODS SUMMARY 25. Hodges, E. et al. Genome-wide in situ exon capture for selective resequencing. Nature Genet. 39, 1522–1527 (2007). DNA and sequencing. DNA samples (NA07340 and NA18507) and cell line 26. Porreca, G. J. et al. Multiplex amplification of large sets of human exons. Nature (GM07340) were obtained from Coriell Repositories. DNA samples were geno- Methods 4, 931–936 (2007). typed on the HM550 array and the results compared to publicly available data to 27. Barski, A. et al. High-resolution profiling of histone methylations in the human confirm their identity before use. Methods for DNA manipulation, including genome. Cell 129, 823–837 (2007). sample preparation, formation of single-molecule arrays, cluster growth and 28. Johnson, D. S., Mortazavi, A., Myers, R. M. & Wold, B. Genome-wide mapping of in sequencing were all developed during this study and formed the basis for the vivo protein-DNA interactions. Science 316, 1497–1502 (2007). standard protocols now available from Illumina, Inc. All sequencing was per- 29. Mikkelsen, T. S. et al. Genome-wide maps of chromatin state in pluripotent and formed on Illumina GA1s equipped with a one-megapixel camera. All purity lineage-committed cells. Nature 448, 553–560 (2007). Macmillan Publishers Limited. All rights reserved © 2008 Number of SNP calls Missing or discordant calls (%) NATURE |Vol 456 |6 November 2008 ARTICLES 1 1 3 1 1 30. Boyle, A. P. et al. High-resolution mapping and characterization of open chromatin Asha Boodhun , Joe S. Brennan , John A. Bridgham , Rob C. Brown , Andrew A. Brown , 3 1 3 4 across the genome. Cell 132, 311–322 (2008). Dale H. Buermann , Abass A. Bundu , James C. Burrows , Nigel P. Carter , Nestor 3 1 3 1 1 31. Lister, R. et al. Highly integrated single-base resolution maps of the epigenome in Castillo , Maria Chiara E. Catenazzi , Simon Chang , R. Neil Cooley , Natasha R. Crake , 1 1 1 Arabidopsis. Cell 133, 523–536 (2008). Olubunmi O. Dada , Konstantinos D. Diakoumakos , Belen Dominguez-Fernandez , 1,2 1 3 3 32. Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. Mapping and David J. Earnshaw , Ugonna C. Egbujor , David W. Elmore , Sergey S. Etchin , Mark R. 3 5 1 1 2 quantifying mammalian transcriptomes by RNA-Seq. Nature Methods 5, 585–587 Ewan , Milan Fedurco , Louise J. Fraser , Karin V. Fuentes Fajardo , W. Scott Furey , 3 6 1 3 (2008). David George , Kimberley J. Gietzen , Colin P. Goddard , George S. Golda , Philip A. 3 1 3 7 1 33. Fedurco, M., Romieu, A., Williams, S., Lawrence, I. & Turcatti, G. BTA, a novel Granieri , David E. Green , David L. Gustafson , Nancy F. Hansen , Kevin Harnish , 3 1 1 3 reagent for DNA attachment on glass and efficient generation of solid-phase Christian D. Haudenschild , Narinder I. Heyer , Matthew M. Hims , Johnny T. Ho , 1 1 3 3 amplified DNA colonies. Nucleic Acids Res. 34, e22 (2006). Adrian M. Horgan , Katya Hoschler , Steve Hurwitz , Denis V. Ivanov , Maria Q. 3 1 1 1 Johnson , Terena James , T. A. Huw Jones , Gyoung-Dong Kang , Tzvetana H. Supplementary Information is linked to the online version of the paper at 3 1 3 3 1 Kerelska , Alan D. Kersey , Irina Khrebtukova , Alex P. Kindwall , Zoya Kingsbury , www.nature.com/nature. 1 1 6 6 Paula I. Kokko-Gonzales , Anil Kumar , Marc A. Laurent , Cynthia T. Lawley , Sarah E. 1 3 3 1 3 3 Acknowledgements The authors acknowledge the advice of A. Williamson, T. Rink, Lee , Xavier Lee , Arnold K. Liao , Jennifer A. Loch , Mitch Lok , Shujun Luo , Radhika 1 3 1 3 1 S. Benkovic, J. Berriman, J. Todd, R. Waterston, S. Eletr, W. Jack, M. Cooper, M. Mammen , John W. Martin , Patrick G. McCauley , Paul McNitt , Parul Mehta , 3 3 1 4 4 T. Brown, C. Reece and R. Cook during this work; E. Margulies for assistance with Keith W. Moon , Joe W. Mullens , Taksina Newington , Zemin Ning , Bee Ling Ng , 1 3 1,2 1 data analysis; M. Shumway for assistance with data submission; and the Sonia M. Novo , Michael J. O’Neill , Mark A. Osborne , Andrew Osnowski , Omead 3,6 3 1 1 3 contributions of the administrative and support staff at all the institutions. This Ostadan , Lambros L. Paraschos , Lea Pickering , Andrew C. Pike , Alger C. Pike ,D. 3 3 3 3 1 research was supported in part by The Wellcome Trust (to H.L., A.Sc., K.W., N.P.C, Chris Pinkard , Daniel P. Pliskin , Joe Podhasky , Victor J. Quijano , Come Raczy , Vicki 1 1 1 1 1 B.N.L., J.R., M.E.H. and R.D.), the Biotechnology and Biological Sciences Research H. Rae , Stephen R. Rawlings , Ana Chiva Rodriguez , Phyllida M. Roe , John Rogers , 1 1 5 3 Council (BBSRC) (to S.B. and D.K.), the BBSRC Applied Genomics LINK Programme Maria C. Rogert Bacigalupo , Nikolai Romanov , Anthony Romieu , Rithy K. Roth , 1 1 3 1 (to A.Sp. and C.L.B.) and the Intramural Research Program of the National Human Natalie J. Rourke , Silke T. Ruediger , Eli Rusman , Raquel M. Sanches-Kuiper , Martin 1 3 1 3 3 Genome Research Institute, National Institutes of Health (to N.F.H. and J.C.M.). R. Schenker , Josefina M. Seoane , Richard J. Shaw , Mitch K. Shiver , Steven W. Short , 3 3 1 1 S. Balasubramanian and D. Klenerman are inventors and founders of Solexa Ltd. Ning L. Sizto , Johannes P. Sluis , Melanie A. Smith , Jean Ernest Sohna Sohna , Eric J. 3 1 1 1 1 Spence , Kim Stevens , Neil Sutton , Lukasz Szajkowski , Carolyn L. Tregidgo , Gerardo Author Information Reprints and permissions information is available at 5 1 3 3 Turcatti , Stephanie vandeVondele , Yuli Verhovsky , Selene M. Virk , Suzanne www.nature.com/reprints. This paper is distributed under the terms of the 3 3 1 1 3 Wakelin , Gregory C. Walcott , Jingwen Wang , Graham J. Worsley , Juying Yan , Ling Creative Commons Attribution-Non-Commercial-Share Alike licence, and is freely 3 3 4 7 4 Yau , Mike Zuerlein , Jane Rogers {, James C. Mullikin , Matthew E. Hurles , Nick J. available to all readers at www.nature.com/nature. The authors declare competing 1 3 3 3 2 McCooke {, John S. West , Frank L. Oaks , Peter L. Lundberg , David Klenerman , financial interests: details accompany the full-text HTML version of the paper at 4 1 Richard Durbin & Anthony J. Smith www.nature.com/nature. Correspondence and requests for materials should be addressed to D.R.B. ([email protected]). Illumina Cambridge Ltd. (Formerly Solexa Ltd), Chesterford Research Park, Little Chesterford, Nr Saffron Walden, Essex CB10 1XL, UK. Department of Chemistry, University of Cambridge, The University Chemical Laboratory, Lensfield Road, 1 2 1 Cambridge CB2 1EW, UK. Illumina Hayward (Formerly Solexa Inc.), 23851 Industrial David R. Bentley , Shankar Balasubramanian , Harold P. Swerdlow {, Geoffrey P. 1 1 1 1 1 1,2 Boulevard, Hayward, California 94343, USA. The Wellcome Trust Sanger Institute, Smith , John Milton {, Clive G. Brown {, Kevin P. Hall , Dirk J. Evers , Colin L. Barnes , 1 1 1 1 Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK. Manteia Helen R. Bignell , Jonathan M. Boutell , Jason Bryant , Richard J. Carter , R. Keira 1 1 1 3 1 Predictive Medicine S.A. Zone Industrielle, Coinsins, CH-1267, Switzerland. Illumina Cheetham , Anthony J. Cox , Darren J. Ellis , Michael R. Flatbush , Niall A. Gormley , 1 1 3 3 4 Inc., Corporate Headquarters, 9883 Towne Centre Drive, San Diego, California 92121, Sean J. Humphray , Leslie J. Irving , Mirian S. Karbelashvili , Scott M. Kirk , Heng Li , 1,2 1 1 1 1 USA. National Human Genome Research Institute, National Institutes of Health, 41 Xiaohai Liu , Klaus S. Maisinger , Lisa J. Murray , Bojan Obradovic , Tobias Ost , 1 3 1 3 Center Drive, MSC 2132, 9000 Rockville Pike, Bethesda, Maryland 20892-2132, USA. Michael L. Parkinson , Mark R. Pratt , Isabelle M. J. Rasolonjatovo , Mark T. Reed , 1 1 1 1 {Present addresses: The Wellcome Trust Sanger Institute, Wellcome Trust Genome Roberto Rigatti , Chiara Rodighiero , Mark T. Ross , Andrea Sabot , Subramanian V. 3 4 3 1 1 Campus, Hinxton, Cambridge CB10 1SA, UK (H.P.S.); Oxford Nanopore Technologies, Sankar , Aylwyn Scally , Gary P. Schroth , Mark E. Smith , Vincent P. Smith , 1 1 3 3 Anastassia Spiridou , Peta E. Torrance , Svilen S. Tzonev , Eric H. Vermaas , Klaudia Begbroke Science Park, Sandy Lane, Kidlington OX5 1PF, UK (J.M., C.G.B.); BBSRC 4 1 3 3 1 Genome Analysis Centre, John Innes Centre, Norwich Research Park, Colney, Norwich Walter , Xiaolin Wu , Lu Zhang , Mohammed D. Alam , Carole Anastasi , Ify C. 1 1 1 3 1 Aniebo , David M. D. Bailey , Iain R. Bancarz , Saibal Banerjee , Selena G. Barbour , NR4 7UH, UK (J.R.); Pronota, NV, VIB Bio-Incubator, Technologiepark 4, B-9052 3 1 1 1 1 Primo A. Baybayan , Vincent A. Benoit , Kevin F. Benson , Claire Bevis , Phillip J. Black , Zwijnaarde/Ghent, Belgium (N.J.M.). Macmillan Publishers Limited. All rights reserved © 2008 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Nature Springer Journals http://www.deepdyve.com/lp/springer-journals/accurate-whole-human-genome-sequencing-using-reversible-terminator-FDG3dFuoyo

Loading next page...

References (39)

P. Tam (2003)
The International HapMap Consortium. The International HapMap Project (Co-PI of Hong Kong Centre which responsible for 2.5% of genome)
E. Hodges, Zhenyu Xuan, Vivekanand Balija, M. Kramer, Michael Molla, Steven Smith, C. Middle, M. Rodesch, T. Albert, G. Hannon, R. Mccombie (2007)
Genome-wide in situ exon capture for selective resequencing
Nature Genetics, 39
Brent Ewing, Philip Green (1998)
Base-calling of automated sequencer traces using phred. II. Error probabilities.
Genome research, 8 3
L. Hillier, Gabor Marth, A. Quinlan, D. Dooling, G. Fewell, Derek Barnett, Paul Fox, Jarret Glasscock, M. Hickenbotham, Weichun Huang, V. Magrini, Ryan Richt, Sacha Sander, D. Stewart, Michael Stromberg, Eric Tsung, T. Wylie, T. Schedl, R. Wilson, E. Mardis (2008)
Whole-genome sequencing and variant discovery in C. elegans
Nature Methods, 5
R. Redon, S. Ishikawa, Karen Fitch, L. Feuk, L. Feuk, G. Perry, T. Andrews, H. Fiegler, M. Shapero, A. Carson, A. Carson, Wenwei Chen, Eun Cho, Stephanie Dallaire, J. Freeman, J. González, M. Gratacós, Jing Huang, Dimitrios Kalaitzopoulos, D. Komura, J. MacDonald, C. Marshall, C. Marshall, R. Mei, Lyndal Montgomery, Keunihiro Nishimura, Kohji Okamura, Kohji Okamura, F. Shen, M. Somerville, J. Tchinda, A. Valsesia, Cara Woodwark, Fengtang Yang, Junjun Zhang, T. Zerjal, Jane Zhang, L. Armengol, D. Conrad, X. Estivill, X. Estivill, C. Tyler-Smith, N. Carter, H. Aburatani, Charles Lee, Charles Lee, K. Jones, S. Scherer, S. Scherer, M. Hurles (2006)
Global variation in copy number in the human genome
Nature, 444
David Johnson, A. Mortazavi, R. Myers, B. Wold (2007)
Genome-Wide Mapping of in Vivo Protein-DNA Interactions
Science, 316
(2005)
Modified polymerases for improved incorporation of nucleotide analogues
Eray Tuzun, A. Sharp, J. Bailey, R. Kaul, V. Morrison, Lisa Pertz, E. Haugen, H. Hayden, D. Albertson, D. Pinkel, M. Olson, E. Eichler (2005)
Fine-scale structural variation of the human genome
Nature Genetics, 37
D. Wheeler, Maithreyan Srinivasan, M. Egholm, Yufeng Shen, Lei Chen, A. McGuire, Wenshe He, Yi-Ju Chen, V. Makhijani, G. Roth, Xavier Gomes, K. Tartaro, K. Tartaro, Faheem Niazi, C. Turcotte, G. Irzyk, J. Lupski, J. Lupski, C. Chinault, Xing-Zhi Song, Yue Liu, Ye Yuan, L. Nazareth, X. Qin, D. Muzny, M. Margulies, G. Weinstock, G. Weinstock, R. Gibbs, R. Gibbs, J. Rothberg, J. Rothberg (2008)
The complete genome of an individual by massively parallel DNA sequencing
Nature, 452
Ravi Sachidanandam, David Weissman, Steven Schmidt, Jerzy Kakol, Lincoln Stein, Gabor Marth, Steve Sherry, J. Mullikin, B. Mortimore, David Willey, S. Hunt, Charlotte Cole, Penny Coggill, C. Rice, Zemin Ning, J. Rogers, D. Bentley, Pui-Yan Kwok, E. Mardis, Raymond Yeh, Brian Schultz, L. Cook, Ruth Davenport, M. Dante, L. Fulton, L. Hillier, Robert Waterston, J. Mcpherson, Brian Gilman, Stephen Schaffner, W. Etten, David Reich, J. Higgins, Mark Daly, B. Blumenstiel, J. Baldwin, N. Stange-thomann, M. Zody, L. Linton, Eric Lander, D. Altshuler (2001)
A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms
Nature, 409
P. Lundquist, C. Zhong, P. Zhao, A. Tomaney, P. Peluso, John Dixon, B. Bettman, Yves Lacroix, D. Kwo, E. McCullough, M. Maxham, K. Hester, P. McNitt, Donald Grey, Carlos Henríquez, M. Foquet, S. Turner, D. Zaccarin (2008)
Parallel confocal detection of single molecules in real time.
Optics letters, 33 9
G. Porreca, Kun Zhang, Jin Li, Bin Xie, Derek Austin, Sara Vassallo, E. Leproust, Bill Peck, Christopher Emig, Fredrik Dahl, Yuan Gao, G. Church, J. Shendure (2007)
Multiplex amplification of large sets of human exons
Nature Methods, 4
R Alexander, G Cavagna, N. Heglund, C. Taylor, J Donelan, R. Kram, A. Kuo, A Biewener, C. Farley, T. Roberts, M. Temaner, Q. Zhang, H. Hofmann, W. Megill, A. Dis-Cussions, R Sprague, E. Maxwell, R. Essner, L. Gazit, M. Yuhas, J. Milligan, J. Shendure, J. Gregory, Porreca, B. Nikos, Reppas, X. Lin, J. McCutcheon, A. Rosenbaum, Michael Wang, Kun Zhang, R. Mitra, G. Church, E. R
Materials and Methods Som Text Figs. S1 and S2 Tables S1 to S4 References Accurate Multiplex Polony Sequencing of an Evolved Bacterial Genome
Heng Li, Jue Ruan, Richard Durbin (2008)
Mapping short DNA sequencing reads and calling variants using mapping quality scores.
Genome research, 18 11
P. Campbell, P. Stephens, E. Pleasance, Sarah 'meara, Heng Li, T. Santarius, L. Stebbings, Catherine Leroy, S. Edkins, Claire Hardy, J. Teague, A. Menzies, I. Goodhead, D. Turner, C. Clee, Michael Quail, Antony Cox, Clive Brown, R. Durbin, M. Hurles, Paul Edwards, G. Bignell, M. Stratton, A. Futreal (2008)
Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing
Nature Genetics, 40
J. Bonfield, J. Galagan (2004)
Finishing the euchromatic sequence of the human genome
Nature, 431
M. Margulies, M. Egholm, William Altman, S. Attiya, J. Bader, Lisa Bemben, J. Berka, Michael Braverman, Yi-Ju Chen, Zhoutao Chen, Scott Dewell, Lei Du, J. Fierro, Xavier Gomes, B. Godwin, Wenshe He, S. Helgesen, Chun Ho, G. Irzyk, Szilveszter Jando, Maria Alenquer, T. Jarvie, K. Jirage, Jong-Bum Kim, James Knight, Janna Lanza, J. Leamon, S. Lefkowitz, M. Lei, Jing Li, K. Lohman, Hong Lu, V. Makhijani, K. McDade, M. McKenna, E. Myers, E. Nickerson, J. Nobile, Ramona Plant, Bernard Puc, M. Ronan, G. Roth, G. Sarkis, J. Simons, J. Simpson, Maithreyan Srinivasan, K. Tartaro, A. Tomasz, K. Vogt, G. Volkmer, Shally Wang, Yong Wang, M. Weiner, Pengguang Yu, R. Begley, J. Rothberg (2005)
Genome sequencing in microfabricated high-density picolitre reactors
Nature, 437
J Shendure (2005)
Accurate multiplex polony sequencing of an evolved bacterial genome
Science, 309
The Consortium (2005)
A haplotype map of the human genome
Nature, 437
A. Barski, Suresh Cuddapah, Kairong Cui, Tae-Young Roh, Dustin Schones, Zhibin Wang, Gang Wei, I. Chepelev, K. Zhao (2007)
High-Resolution Profiling of Histone Methylations in the Human Genome
Cell, 129
A. Mortazavi, B. Williams, Kenneth McCue, Lorian Schaeffer, B. Wold (2008)
Mapping and quantifying mammalian transcriptomes by RNA-Seq
Nature Methods, 5
(2004)
Modified nucleotides. PCT International publication number WO
M. Fedurco, A. Romieu, Scott Williams, I. Lawrence, G. Turcatti (2006)
BTA, a novel reagent for DNA attachment on glass and efficient generation of solid-phase amplified DNA colonies
Nucleic Acids Research, 34
E. Birney, J. Stamatoyannopoulos, A. Dutta, R. Guigó, T. Gingeras, E. Margulies, Z. Weng, M. Snyder, E. Dermitzakis, R. Thurman, Michael Kuehn, Christopher Taylor, Shane Neph, Christoph Koch, S. Asthana, A. Malhotra, I. Adzhubei, J. Greenbaum, R. Andrews, Paul Flicek, Patrick Boyle, Hua Cao, N. Carter, Gayle Clelland, S. Davis, Nathan Day, P. Dhami, Shane Dillon, M. Dorschner, H. Fiegler, P. Giresi, J. Goldy, M. Hawrylycz, Andrew Haydock, R. Humbert, K. James, Brett Johnson, Ericka Johnson, Tristan Frum, Elizabeth Rosenzweig, N. Karnani, K. Lee, Gregory Lefebvre, P. Navas, Fidencio Neri, Stephen Parker, P. Sabo, R. Sandstrom, A. Shafer, D. Vetrie, M. Weaver, S. Wilcox, Man Yu, F. Collins, J. Dekker, J. Lieb, T. Tullius, G. Crawford, S. Sunyaev, W. Noble, I. Dunham, F. Denoeud, A. Reymond, P. Kapranov, J. Rozowsky, D. Zheng, R. Castelo, A. Frankish, J. Harrow, Srinka Ghosh, A. Sandelin, I. Hofacker, R. Baertsch, Damian Keefe, S. Dike, Jill Cheng, H. Hirsch, E. Sekinger, Julien Lagarde, J. Abril, A. Shahab, C. Flamm, C. Fried, J. Hackermüller, Jana Hertel, Manja Lindemeyer, Kristin Missal, Andrea Tanzer, Stefan Washietl, J. Korbel, O. Emanuelsson, J. Pedersen, N. Holroyd, Ruth Taylor, D. Swarbreck, N. Matthews, M. Dickson, D. Thomas, M. Weirauch, J. Gilbert, J. Drenkow, I. Bell, X. Zhao, K. Srinivasan, W. Sung, H. Ooi, K. Chiu, S. Foissac, T. Alioto, M. Brent, L. Pachter, M. Tress, A. Valencia, S. Choo, C. Choo, C. Ucla, C. Manzano, Carine Wyss, E. Cheung, T. Clark, James Brown, M. Ganesh, Sandeep Patel, H. Tammana, Jacqueline Chrast, C. Henrichsen, C. Kai, J. Kawai, U. Nagalakshmi, Jiaqian Wu, Z. Lian, Jin Lian, P. Newburger, Xueqing Zhang, P. Bickel, J. Mattick, Piero Carninci, Y. Hayashizaki, S. Weissman, T. Hubbard, R. Myers, J. Rogers, P. Stadler, T. Lowe, Chia-Lin Wei, Y. Ruan, K. Struhl, M. Gerstein, S. Antonarakis, Yutao Fu, E. Green, Ulas Karaöz, A. Siepel, James Taylor, L. Liefer, K. Wetterstrand, P. Good, E. Feingold, M. Guyer, G. Cooper, G. Asimenos, Colin Dewey, Minmei Hou, S. Nikolaev, J. Montoya-Burgos, A. Löytynoja, S. Whelan, F. Pardi, Tim Massingham, Haiyan Huang, Na Zhang, I. Holmes, Jim Mullikin, A. Ureta-Vidal, B. Paten, Michael Seringhaus, D. Church, K. Rosenbloom, W. Kent, Eric Stone, S. Batzoglou, N. Goldman, R. Hardison, D. Haussler, Webb Miller, A. Sidow, N. Trinklein, Zhengdong Zhang, Leah Barrera, R. Stuart, D. King, A. Ameur, Stefan Enroth, M. Bieda, Jonghwan Kim, A. Bhinge, N. Jiang, Jun Liu, Fei Yao, V. Vega, C. Lee, Patrick Ng, A. Yang, Z. Moqtaderi, Zhou Zhu, Xiaoqin Xu, S. Squazzo, M. Oberley, David Inman, Michael Singer, T. Richmond, Kyle Munn, Á. Rada-Iglesias, O. Wallerman, J. Komorowski, J. Fowler, Phillippe Couttet, Alexander Bruce, O. Dovey, P. Ellis, C. Langford, D. Nix, G. Euskirchen, S. Hartman, A. Urban, P. Kraus, Sara Calcar, Nathaniel Heintzman, Tae Kim, Kun Wang, Chunxu Qu, G. Hon, R. Luna, C. Glass, M. Rosenfeld, S. Aldred, S. Cooper, Anason Halees, J. Lin, H. Shulha, Xiaoling Zhang, Mousheng Xu, Jaafar Haidar, Yon-Jong Yu, V. Iyer, Roland Green, C. Wadelius, P. Farnham, B. Ren, R. Harte, A. Hinrichs, Heather Trumbower, H. Clawson, Jennifer Hillman-Jackson, A. Zweig, Kayla Smith, Archana Thakkapallayil, G. Barber, R. Kuhn, D. Karolchik, L. Armengol, C. Bird, P. Bakker, Andrew Kern, N. López-Bigas, Joel Martin, B. Stranger, A. Woodroffe, Eugene Davydov, A. Dimas, E. Eyras, Ingileif Hallgrímsdóttir, J. Huppert, M. Zody, G. Abecasis, X. Estivill, G. Bouffard, Xiaobin Guan, Nancy Hansen, J. Idol, V. Maduro, Baishali Maskeri, Jennifer Mcdowell, Morgan Park, Pamela Thomas, Alice Young, R. Blakesley, D. Muzny, E. Sodergren, D. Wheeler, K. Worley, Huaiyang Jiang, G. Weinstock, R. Gibbs, T. Graves, R. Fulton, E. Mardis, R. Wilson, M. Clamp, James Cuff, S. Gnerre, D. Jaffe, Jean Chang, K. Lindblad-Toh, E. Lander, M. Koriabine, M. Nefedov, K. Osoegawa, Y. Yoshinaga, B. Zhu, P. Jong (2007)
Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project
Nature, 447
Timothy Harris, P. Buzby, H. Babcock, E. Beer, Jayson Bowers, I. Braslavsky, Marie Causey, Jennifer Colonell, J. Dimeo, J. Efcavitch, E. Giladi, Jaime Gill, J. Healy, M. Jarosz, D. Lapen, Keith Moulton, S. Quake, Kathleen Steinmann, E. Thayer, A. Tyurina, R. Ward, Howard Weiss, Zheng Xie (2008)
Single-Molecule DNA Sequencing of a Viral Genome
Science, 320
A. Boyle, S. Davis, H. Shulha, P. Meltzer, E. Margulies, Z. Weng, T. Furey, G. Crawford (2008)
High-Resolution Mapping and Characterization of Open Chromatin across the Genome
Cell, 132
J. Shendure (2008)
The beginning of the end for microarrays?
Nature Methods, 5
T. Hubbard, D. Barker, E. Birney, G. Cameron, Yuan Chen, Laura Clarke, Tony Cox, James Cuff, V. Curwen, T. Down, R. Durbin, E. Eyras, J. Gilbert, M. Hammond, L. Huminiecki, A. Kasprzyk, H. Lehväslaiho, P. Lijnzaad, Craig Melsopp, Emmanuel Mongin, Roger Pettett, M. Pocock, Simon Potter, A. Rust, E. Schmidt, S. Searle, G. Slater, James Smith, W. Spooner, Arne Stabenau, J. Stalker, E. Stupka, A. Ureta-Vidal, Imre Vastrik, M. Clamp (2002)
The Ensembl genome database project
Nucleic acids research, 30 1
R. Lister, Ronan O’Malley, Julian Tonti-Filippini, B. Gregory, C. Berry, A. Millar, J. Ecker (2008)
Highly Integrated Single-Base Resolution Maps of the Epigenome in Arabidopsis
Cell, 133
K. Frazer, D. Ballinger, D. Cox, D. Hinds, L. Stuve, R. Gibbs, J. Belmont, A. Boudreau, P. Hardenbol, S. Leal, S. Pasternak, D. Wheeler, T. Willis, F. Yu, Huanming Yang, Changqing Zeng, Yang Gao, Haoran Hu, Weitao Hu, Chaohua Li, Wei Lin, Siqi Liu, Hao Pan, Xiaoli Tang, Jian Wang, Wei Wang, Jun Yu, Bo Zhang, Qingrun Zhang, Hongbin Zhao, Hui-Ping Zhao, Jun Zhou, S. Gabriel, Rachel Barry, B. Blumenstiel, Amy Camargo, M. Defelice, M. Faggart, Marie-Anne Goyette, Supriya Gupta, Jamie Moore, Huy Nguyen, R. Onofrio, Melissa Parkin, J. Roy, E. Stahl, E. Winchester, L. Ziaugra, D. Altshuler, Yan Shen, Zhijian Yao, Wei Huang, X. Chu, Yungang He, Li Jin, Yangfan Liu, Yayun Shen, Weiwei Sun, Haifeng Wang, Yi Wang, Ying Wang, Xiao-yan Xiong, Liang Xu, M. Waye, S. Tsui, H. Xue, J. Wong, L. Galver, Jian-Bing Fan, K. Gunderson, S. Murray, A. Oliphant, M. Chee, A. Montpetit, F. Chagnon, Vincent Ferretti, M. Leboeuf, J. Olivier, M. Phillips, Stéphanie Roumy, C. Sallée, A. Verner, T. Hudson, P. Kwok, Dongmei Cai, D. Koboldt, Raymond Miller, L. Pawlikowska, P. Taillon-Miller, M. Xiao, L. Tsui, W. Mak, You-Qiang Song, P. Tam, Yusuke Nakamura, T. Kawaguchi, T. Kitamoto, Takashi Morizono, A. Nagashima, Y. Ohnishi, A. Sekine, Toshihiro Tanaka, T. Tsunoda, P. Deloukas, C. Bird, Marcos Delgado, E. Dermitzakis, R. Gwilliam, S. Hunt, J. Morrison, Don Powell, B. Stranger, P. Whittaker, D. Bentley, M. Daly, P. Bakker, J. Barrett, Y. Chretien, J. Maller, S. Mccarroll, N. Patterson, I. Pe’er, A. Price, S. Purcell, D. Richter, Pardis Sabeti, R. Saxena, S. Schaffner, P. Sham, P. Varilly, Lincoln Stein, Lalitha Krishnan, A. Smith, M. Tello-Ruiz, Gudmundur Thorisson, A. Chakravarti, Peter Chen, D. Cutler, C. Kashuk, Shin Lin, G. Abecasis, W. Guan, Yun Li, Heather Munro, Zhaohui Qin, D. Thomas, G. McVean, A. Auton, L. Bottolo, Niall Cardin, S. Eyheramendy, C. Freeman, J. Marchini, S. Myers, C. Spencer, M. Stephens, P. Donnelly, L. Cardon, G. Clarke, David Evans, A. Morris, B. Weir, J. Mullikin, S. Sherry, M. Feolo, Andrew Skol, Houcan Zhang, I. Matsuda, Y. Fukushima, D. Macer, Eiko Suda, C. Rotimi, C. Adebamowo, I. Ajayi, Toyin Aniagwu, P. Marshall, C. Nkwodimmah, C. Royal, M. Leppert, M. Dixon, A. Peiffer, Renzong Qiu, A. Kent, Kazuto Kato, N. Niikawa, I. Adewole, B. Knoppers, Morris Foster, E. Clayton, Jessica Watkin, D. Muzny, L. Nazareth, E. Sodergren, G. Weinstock, I. Yakub, B. Birren, R. Wilson, L. Fulton, J. Rogers, J. Burton, N. Carter, C. Clee, M. Griffiths, Matthew Jones, K. McLay, R. Plumb, M. Ross, S. Sims, D. Willey, Zhu Chen, Hua Han, L. Kang, M. Godbout, J. Wallenburg, P. L'Archevêque, G. Bellemare, Koji Saeki, Hongguang Wang, Daochang An, Hongbo Fu, Qing Li, Zhen Wang, Ren-hao Wang, A. Holden, L. Brooks, J. Mcewen, M. Guyer, V. Wang, Jane Peterson, Michael Shi, J. Spiegel, L. Sung, Lynn Zacharia, F. Collins, Karen Kennedy, Ruth Jamieson, J. Stewart (2007)
A second generation human haplotype map of over 3.1 million SNPs
Nature, 449
Toshihiro Tanaka (2003)
The International HapMap Project
Nature, 426
Europe PMC Funders Author Manuscripts Europe PMC Funders Author Manuscripts
MiltonJ (2004)
Modified nucleotides
PCT International publication number WO 2004/018497
S. Levy, G. Sutton, P. Ng, L. Feuk, A. Halpern, B. Walenz, Nelson Axelrod, Jiaqi Huang, E. Kirkness, Gennady Denisov, Yuan Lin, J. MacDonald, Andy Wing, Chun Pang, M. Shago, Timothy Stockwell, Alexia Tsiamouri, V. Bafna, V. Bansal, S. Kravitz, D. Busam, K. Beeson, T. McIntosh, K. Remington, J. Abril, J. Gill, Jon Borman, Y. Rogers, M. Frazier, S. Scherer, R. Strausberg, J. Venter (2007)
The Diploid Genome Sequence of an Individual Human
PLoS Biology, 5
Jan Korbel, Alexander Urban, J. Affourtit, Brian Godwin, Fabian Grubert, Jan Simons, Philip Kim, D. Palejev, Nicholas Carriero, Lei Du, Bruce Taillon, Zhoutao Chen, Andrea Tanzer, C. A., Eugenia Saunders, Jianxiang Chi, Fengtang Yang, Nigel Carter, M. Hurles, Sherman Weissman, Timothy Harkins, Mark Gerstein, Michael Egholm, Michael Snyder (2007)
Paired-End Mapping Reveals Extensive Structural Variation in the Human Genome
Science, 318
Bland Ewing, L. Hillier, M. Wendl, Philip Green (1998)
Base-calling of automated sequencer traces using phred. I. Accuracy assessment.
Genome research, 8 3
J. Kidd, G. Cooper, W. Donahue, H. Hayden, N. Sampas, T. Graves, Nancy Hansen, Brian Teague, C. Alkan, F. Antonacci, E. Haugen, Troy Zerr, N. Yamada, P. Tsang, Tera Newman, Eray Tüzün, Ze Cheng, H. Ebling, N. Tusneem, R. David, W. Gillett, K. Phelps, M. Weaver, David Saranga, A. Brand, Wei Tao, E. Gustafson, K. McKernan, Lin Chen, M. Malig, Joshua Smith, Joshua Korn, S. Mccarroll, D. Altshuler, D. Peiffer, M. Dorschner, J. Stamatoyannopoulos, D. Schwartz, D. Nickerson, Jim Mullikin, R. Wilson, L. Bruhn, M. Olson, R. Kaul, Douglas Smith, E. Eichler (2008)
Mapping and sequencing of structural variation from eight human genomes
Nature, 453
M. Cargill, D. Altshuler, J. Ireland, P. Sklar, K. Ardlie, N. Patil, C. Lane, Esther Lim, Nilesh Kalyanaraman, J. Nemesh, L. Ziaugra, L. Friedland, A. Rolfe, J. Warrington, R. Lipshutz, G. Daley, E. Lander (1999)
Characterization of single-nucleotide polymorphisms in coding regions of human genes
Nature Genetics, 22
T. Mikkelsen, Manching Ku, D. Jaffe, B. Issac, Erez Lieberman, G. Giannoukos, P. Alvarez, W. Brockman, Tae-Kyung Kim, R. Koche, William Lee, E. Mendenhall, Aisling O'Donovan, Aviva Presser, C. Russ, Xiaohui Xie, A. Meissner, Marius Wernig, R. Jaenisch, C. Nusbaum, E. Lander, B. Bernstein (2007)
Genome-wide maps of chromatin state in pluripotent and lineage-committed cells
Nature, 448

Publisher: Springer Journals
Copyright: Copyright © 2008 by The Author(s)
Subject: Science, Humanities and Social Sciences, multidisciplinary; Science, Humanities and Social Sciences, multidisciplinary; Science, multidisciplinary
ISSN: 0028-0836
eISSN: 1476-4687
DOI: 10.1038/nature07517
Publisher site: See Article on Publisher Site

Abstract

Vol 456 |6 November 2008 |doi:10.1038/nature07517 ARTICLES Accurate whole human genome sequencing using reversible terminator chemistry A list of authors and their affiliations appears at the end of the paper DNA sequence information underpins genetic research, enabling discoveries of important biological or medical benefit. Sequencing projects have traditionally used long (400–800 base pair) reads, but the existence of reference sequences for the human and many other genomes makes it possible to develop new, fast approaches to re-sequencing, whereby shorter reads are compared to a reference to identify intraspecies genetic variation. Here we report an approach that generates several billion bases of accurate nucleotide sequence per experiment at low cost. Single molecules of DNA are attached to a flat surface, amplified in situ and used as templates for synthetic sequencing with fluorescent reversible terminator deoxyribonucleotides. Images of the surface are analysed to generate high-quality sequence. We demonstrate application of this approach to human genome sequencing on flow-sorted X chromosomes and then scale the approach to determine the genome sequence of a male Yoruba from Ibadan, Nigeria. We build an accurate consensus sequence from.303 average depth of paired 35-base reads. We characterize four million single-nucleotide polymorphisms and four hundred thousand structural variants, many of which were previously unknown. Our approach is effective for accurate, rapid and economical whole-genome re-sequencing and many other biomedical applications. DNA sequencing yields an unrivalled resource of genetic informa- strand as template for the second sequencing reaction (Fig. 1a–c). To tion. We can characterize individual genomes, transcriptional states obtain paired reads separated by larger distances, we circularized and genetic variation in populations and disease. Until recently, the DNA fragments of the required length (for example, 26 0.2 kb) scope of sequencing projects was limited by the cost and throughput and obtained short junction fragments for paired end sequencing of Sanger sequencing. The raw data for the three billion base (Fig. 1d). (3 gigabase (Gb)) human genome sequence, completed in 2004 (ref. 1), We sequenced DNA templates by repeated cycles of polymerase- was generated over several years for,$300 million using several hun- directed single base extension. To ensure base-by-base nucleotide dred capillary sequencers. More recently an individual human gen- incorporation in a stepwise manner, we used a set of four reversible ome sequence has been determined for ,$10 million by capillary terminators, 39-O-azidomethyl 29-deoxynucleoside triphosphates sequencing . Several new approaches at varying stages of development (A, C, G and T), each labelled with a different removable fluorophore 3–6 8 aim to increase sequencing throughput and reduce cost . They (Supplementary Fig. 1a) . The use of 39-modified nucleotides increase parallelization markedly by imaging many DNA molecules allowed the incorporation to be driven essentially to completion simultaneously. One instrument run produces typically thousands or without risk of over-incorporation. It also enabled addition of all millions of sequences that are shorter than capillary reads. Another four nucleotides simultaneously rather than sequentially, minimiz- human genome sequence was recently determined using one of these ing risk of misincorporation. We engineered the active site of 9uN approaches . However, much bigger improvements are necessary to DNA polymerase to improve the efficiency of incorporation of these enable routine whole human genome sequencing in genetic research. unnatural nucleotides . After each cycle of incorporation, we deter- We describe a massively parallel synthetic sequencing approach that mined the identity of the inserted base by laser-induced excitation of transforms our ability to use DNA and RNA sequence information in the fluorophores and imaging. We added tris(2-carboxyethyl)pho- biological systems. We demonstrate utility by re-sequencing an indivi- sphine (TCEP) to remove the fluorescent dye and side arm from a dual human genome to high accuracy. Our approach delivers data at linker attached to the base and simultaneously regenerate a 39 very high throughput and low cost, and enables extraction of genetic hydroxyl group ready for the next cycle of nucleotide addition information of high biological value, including single-nucleotide (Supplementary Fig. 1b). The Genome Analyzer (GA1) was designed polymorphisms (SNPs) and structural variants. to perform multiple cycles of sequencing chemistry and imaging to collect the sequence data automatically from each cluster on the DNA sequencing using reversible terminators surface of each lane of an eight-lane flow cell (Supplementary Fig. 2). We generated high-density single-molecule arrays of genomic DNA To determine the sequence from each cluster, we quantified the fragments attached to the surface of the reaction chamber (the flow fluorescent signal from each cycle and applied a base-calling algo- cell) and used isothermal ‘bridging’ amplification to form DNA ‘clus- rithm. We defined a quality (Q) value for each base call (scaled as by ters’ from each fragment. We made the DNA in each cluster single- the phred algorithm ) that represents the likelihood of each call stranded and added a universal primer for sequencing. For paired being correct (Supplementary Fig. 3). We used the Q-values in sub- read sequencing, we then converted the templates to double-stranded sequent analyses to weight the contribution of each base to sequence DNA and removed the original strands, leaving the complementary alignment and detection of sequence variants (for example, SNP Macmillan Publishers Limited. All rights reserved © 2008 ARTICLES NATURE |Vol 456 |6 November 2008 calling). We discarded all reads from mixed clusters and used the human chromosome 6 (accession AL662825.4, previously determined remaining ‘purity filtered’ reads for analysis. Typically we generated using capillary sequencing by the Wellcome Trust Sanger Institute). We developed a fast global alignment algorithm ELAND that aligns a 1–2 Gb of high-quality purity filtered sequence per flow cell from ,30–60-million single 35-base reads, or 2–4 Gb in a paired read read to the reference only if the read can be assigned a unique position with 0, 1 or 2 differences. We collected 0.17 Gb of aligned data for the experiment (Supplementary Table 1). BAC from one lane of a flow cell. Approximately 90% of the 35-base To demonstrate accurate sequencing of human DNA, we sequenced reads matched perfectly to the reference, demonstrating high raw read a human bacterial artificial chromosome (BAC) clone (bCX98J21) that accuracy (Supplementary Fig. 4). To examine consensus coverage contained 162,752 bp of the major histocompatibility complex on and accuracy, we used 5 Mb of 35-base purity filtered reads (30-fold average input depth of the BAC) and obtained 99.96% coverage of the reference. There was one consensus miscall, at a position of very low coverage (just above our cutoff threshold), yielding an overall con- sensus accuracy of .99.999%. Detecting genetic variation of the human X chromosome For an initial study of genetic variation, we sequenced flow-sorted X chromosomes of a Caucasian female (sample NA07340 originating from the Centre d’Etude du Polymorphisme Humain (CEPH)). We generated 278-million paired 30–35-bp purity filtered reads and aligned them to the human genome reference sequence. We carried out separate analyses of the data using two alignment algorithms: ELAND (see above) or MAQ (Mapping and Assembly with Qualities) . Both algorithms place each read pair where it best matches the reference and assign a confidence score to the alignment. In cases where a read has two or more equally likely positions (that is, in an exact repeat), MAQ randomly assigns the read pair to one position and assigns a zero alignment quality score (these reads are excluded from SNP analysis). ELAND rejects all non-unique align- ments, which are mostly in recently inserted retrotransposons (see B B d Supplementary Fig. 5). MAQ therefore provides an opportunity to assess the properties of a data set aligned to the entire reference, whereas ELAND effectively excludes ambiguities from the short read alignment before further analysis. Figure 1 | Preparation of samples. a, DNA fragments are generated, for We obtained comprehensive coverage of the X chromosome from example, by random shearing and joined to a pair of oligonucleotides in a both analyses. With MAQ, 204 million reads aligned to 99.94% of the forked adaptor configuration. The ligated products are amplified using two X chromosome at an average depth of 433. With ELAND, 192 mil- oligonucleotide primers, resulting in double-stranded blunt-ended material lion reads covered 91% of the reference sequence, showing what can with a different adaptor sequence on either end. b, Formation of clonal be covered by unique best alignments. These results were obtained single-molecule array. DNA fragments prepared as in a are denatured and after excluding reads aligning to non-X sequence (impurities of flow single strands are annealed to complementary oligonucleotides on the flow- cell surface (hatched). A new strand (dotted) is copied from the original sorting) and apparently duplicated read pairs (Supplementary Table 2). strand in an extension reaction that is primed from the 39 end of the surface- We reasoned that these duplicates (,10% of the total) arose during bound oligonucleotide; the original strand is then removed by denaturation. initial sample amplification. The adaptor sequence at the 39 end of each copied strand is annealed to a new The sampling of sequence fragments from the X chromosome is surface-bound complementary oligonucleotide, forming a bridge and close to random. This is evident from the distribution of mapped generating a new site for synthesis of a second strand (dotted). Multiple read depth in the MAQ alignment in regions where the reference is cycles of annealing, extension and denaturation in isothermal conditions unique (Fig. 2a): the variance of this distribution is only 2.26 times result in growth of clusters, each ,1 mm in physical diameter. This follows that of a Poisson distribution (the theoretical minimum). Half of this the basic method outlined in ref. 33. c, The DNA in each cluster is linearized by cleavage within one adaptor sequence (gap marked by an asterisk) and excess variance can be accounted for by a dependence on G1C con- denatured, generating single-stranded template for sequencing by synthesis tent. However, the average mapped read depth only falls below 103 to obtain a sequence read (read 1; the sequencing product is dotted). To in regions with G1C content less than 4% or greater than 76%, perform paired-read sequencing, the products of read 1 are removed by comprising in total just 1% of unique chromosome sequence and denaturation, the template is used to generate a bridge, the second strand is 3% of coding sequence (Fig. 2b). re-synthesized (shown dotted), and the opposite strand is then cleaved (gap We identified 92,485 candidate SNPs in the X chromosome using marked by an asterisk) to provide the template for the second read (read 2). ELAND (Supplementary Fig. 6). Most calls (85%) match previous d, Long-range paired-end sample preparation. To sequence the ends of a entries in the public database dbSNP. Heterozygosity (p) in this data long (for example,.1 kb) DNA fragment, the ends of each fragment are set is 4.33 10 (that is, one substitution per 2.3 kb), close to a tagged by incorporation of biotinylated (B) nucleotide and then circularized, previously published X chromosome estimate (4.73 10 ) . Using forming a junction between the two ends. Circularized DNA is randomly fragmented and the biotinylated junction fragments are recovered and used MAQ we obtained 104,567 SNPs, most of which were common to the as starting material in the standard sample preparation procedure illustrated results of the ELAND analysis. The differences between the two sets of in a. The orientation of the sequence reads relative to the DNA fragment is SNP calls are largely the consequence of different properties of the shown (magenta arrows). When aligned to the reference sequence, these alignments as described earlier. For example, most of the SNPs found reads are oriented with their 59 ends towards each other (in contrast to the only by the MAQ-based analysis were at positions of low or zero short insert paired reads produced as shown in a–c). See Supplementary Fig. sequence depth in the ELAND alignment (Supplementary Fig. 6c). 17a for examples of both. Turquoise and blue lines represent We assessed accuracy and completeness of SNP calling by compar- oligonucleotides and red lines represent genomic DNA. All surface-bound ison to genotypes obtained for this individual using the Illumina oligonucleotides are attached to the flow cell by their 59 ends. Dotted lines HumanHap550 BeadChip (HM550). The sequence data cov- indicate newly synthesized strands during cluster formation or sequencing. (See Supplementary Methods for details.) ered.99.8% of the 13,604 genotyped positions and we found excellent Macmillan Publishers Limited. All rights reserved © 2008 NATURE |Vol 456 |6 November 2008 ARTICLES depth and/or anomalous read pair spacing, similar to previous a All 13–15 Unique only approaches . We detected 115 indels in total, 77 of which were Poisson visible from anomalous read-pair spacing (see Supplementary Tables 4 and 5). We developed Resembl, an extension to the Ensembl browser , to view all variants (Supplementary Fig 9). Inversions can be detected when the orientation of one read in a pair is reversed (for example, see Supplementary Fig. 10). In general, inversions occur as the result of non-allelic homologous recombination, and are therefore flanked by repetitive sequence that can compromise alignments. We found partial evidence for other inversion events, but characterization of inversions from short read data is complex because of the repeats and requires further development. Sequencing and analysis of a whole human genome Our X chromosome study enabled us to develop an integrated set of methods for rapid sequencing and analysis of whole human genomes. We sequenced the genome of a male Yoruba from Ibadan, Nigeria (YRI, sample NA18507). This sample was originally collected for the 17,18 0 20 40 60 80 HapMap project through a process of community engagement Mapped depth (fold) and informed consent and has also been studied in other pro- 20,21 jects . We were therefore able to compare our results with publicly b G+C content (%) available data from the same sample. We constructed two libraries: 0 30 40 50 60 one of short inserts (,200 bp) with similar properties to the previous X chromosome library and one from long fragments (,2 kb) to provide longer-range read-pair information (see Supplementary Fig. 11 for size distributions). We generated 135 Gb of sequence (,4 billion paired 35-base reads; see Supplementary Table 6) over a period of 8 weeks (December 2007 to January 2008) on six GA1 instruments averaging 3.3 Gb per production run (see Supplementary Table 1 for example). The approximate consumables cost (based on full list price of reagents) was $250,000. We aligned 97% of the reads using MAQ and found that 99.9% of the human reference (NCBI build 36.1) was covered with one or more reads at an average of 40.6-fold depth. Using ELAND, we aligned 91% of the reads over 93% of the reference sequence at sufficient depth to call a strong consensus (.three Q30 bases). The distribution of mapped read depth was close to random, with slight over-dispersion as seen for the X chromosome data. We observed comprehensive representa- 0 20 40 60 80 100 tion across a wide range of G1C content, dropping only at the very Percentile of unique sequence ordered by G+C content extreme ends, but with a different pattern of distribution compared Figure 2 | X chromosome data. a, Distribution of mapped read depth in the to the X chromosome (see Supplementary Fig. 12). X chromosome data set (NA07340), sampled at every 50th position along the We identified ,4 million SNPs, with 74% matching previous chromosome and displayed as a histogram (‘All’). An equivalent analysis of entries in dbSNP (Fig. 3). We found excellent agreement of our mapped read depth for the unique subset of these positions is also shown SNP calls with genotyping results: sequence-based SNP calls covered (‘Unique only’). The solid line represents a Poisson distribution with the almost all of the 552,710 loci of HM550, with.99.5% concordance same mean. b, Distribution of X chromosome uniquely mapped reads as a of sequencing versus genotyping calls (Table 1 and Supplementary function of G1C content. Note that the x axis is per cent G1C content and is Table 7a). The few disagreements were mostly under-calls of hetero- scaled by percentile of unique sequence. The solid line is average mapped zygous positions (GT.Seq) in areas of low sequence depth, provid- depth of unique sequence; the grey region is the central 80% of the data (10th ing us with a false-negative rate of,0.35% from the ELAND analysis to 90th centiles); the dashed lines are 10th and 90th centiles of a Poisson distribution with the same mean as the data. (see Table 1). The other disagreements (0.09% of all genotypes) included errors in genotyping plus apparent tri-allelic SNPs agreement between sequence-based SNP calls and genotyping data (Supplementary Table 7a). The main cause of genotype error (99.52% or 99.99% using ELAND or MAQ, respectively; (0.05% of all genotypes) is the existence of a second ‘hidden’ SNP Supplementary Table 3). There was complete concordance of all close to the assayed locus that disrupts the genotyping assay, leading homozygous calls and a low level of ‘under-calling’ from the sequence to loss of one allele and an erroneous homozygous genotype data (denoted as ‘GT.Seq’ in Table 1) at a small number of the (Supplementary Figs 13 and 14). heterozygous sites, caused by inadequate sampling of one of the two To examine the accuracy of SNP calling in more detail, we com- alleles. The depth of input sequence influences the coverage and accu- pared our sequence-based SNP calls with 3.7 million genotypes (HM- racy of SNP calling. We found that reducing the read depth to 153 still All) generated for this sample during the HapMap project (Table 1 gives 97% coverage of genotype positions and only 1.27% of the het- and Supplementary Table 7b) and found excellent concordance erozygous sites are under-called. We observed no other types of dis- between the data sets. Disagreements included sequence-based agreement at any input depth (Supplementary Fig. 7). under-calls of heterozygous positions in regions of low read depth. We detected structural variants (defined as any variant other than The slightly higher level of other disagreements (0.76%) seen in this a single base substitution) as follows. We found 9,747 short inser- analysis compared to that of the HM550 data (0.09%) is in line with tions/deletions (‘short indels’; defined here as less than the length of the higher level of underlying genotype error rate of 0.7% for the the read) by performing a gapped alignment of individual reads HapMap data . To refine this analysis further, we generated a set of (Supplementary Fig. 8). We identified larger indels based on read 530,750 very high confidence reference genotypes comprising Macmillan Publishers Limited. All rights reserved © 2008 Frequency (Mb) Mapped depth ARTICLES NATURE |Vol 456 |6 November 2008 Table 1 | Comparison of SNP calls made from sequence versus genotype data for the human genome (NA18507) and X chromosome (NA07340) ELAND MAQ X Human Human X Human Human Human HM550 (13,604 HM550 (552,710 HM-All (3,699,592 HM550 (13,604 HM550 (552,710 HM-All (3,699,592 Combined (530,750 SNPs) SNPs) SNPs) SNPs) SNPs) SNPs) SNPs) (%) (%) (%) (%) (%) (%) (%) (n) Covered by 99.77 99.60 99.24 99.91 99.74 99.29 99.78 529,589 sequence Concordant calls 99.52 99.57 98.80 99.99 99.90 99.12 99.94 529,285 All disagreements 0.48 0.43 1.20 0.01 0.10.88 0.06 304 GT.Seq 0.48 0.35 0.46 0.01 0.03 0.15 0.02 130 Seq.GT 00.05 0.52 0 0.05 0.54 0.02 130 Other 00.03 0.22 0 0.02 0.20.01 44 discordances SNP panels referred to are HM550 (Illumina Infinium HumanHap550 BeadChip) and HM-All (complete data from phase 1 and phase 2 of the International HapMap Project). ‘Combined’ is a set of concordant genotypes from both sets (HM550 and HM-All; see text). GT.Seq denotes a heterozygous genotyping SNP call where there is a homozygous sequencing SNP call (one of the two alleles); Seq.GT denotes the converse (that is, a heterozygous sequencing SNP call where there is a homozygous genotyping call). Other discordances are differences in the two SNP calls that cannot be accounted for by one allele being missing from one call. concordant calls in both the HM550 and HM-All genotype data sets. at the tip of the short arm of chromosomes X and Y undergoes Comparing the results of the MAQ analysis to this high confidence obligatory recombination in male meiosis, which is equivalent to set (see Table 1), we found 130 heterozygote under-calls GT.Seq 203 the autosome average. This illustrates a clear correlation (that is, a false-negative rate of 0.025%). There were also 130 hetero- between recombination and nucleotide diversity. By contrast, the zygote over-calls Seq.GT, but most of these are probably genotype 0.33-Mb PAR2 region has a much lower recombination rate than errors as 82 have a nearby ‘hidden’ SNP and 3 have a nearby indel. A PAR1; we observed that heterozygosity in PAR2 is identical to that further 41 are tri-allelic loci, leaving at most 4 potential wrong calls by of the autosomes in NA18507. Heterozygosity in coding regions is sequencing (that is, false-positive rate of 4 per 529,589 positions). lower (0.543 10 ) than the total autosome average, consistent with Finally we selected a subset of novel SNP calls from the sequence data the model that some coding changes are deleterious and are lost as the and tested them by genotyping. We found 96.1% agreement between result of natural selection . Nevertheless, the 26,140 coding SNPs sequence and genotype calls (Supplementary Table 8). However, the (Supplementary Fig. 15) include 5,361 non-conservative amino acid 47 disagreements included 10 correct sequencing calls (genotyping substitutions plus 153 premature termination codons under-calls owing to hidden SNPs) and 7 sequencing under-calls. On (Supplementary Table 9), many of which are expected to affect pro- this basis, therefore, the false-positive discovery rate for the one mil- tein function. lion novel SNPs is 2.5% (30 out of 1,206). For the entire data set of We performed a genome-wide survey of structural variation in this four million SNPs detected in this analysis, the false-positive and individual and found excellent correlation with variants that had -negative rates both average,1%. been reported in previous studies, as well as detecting many new This genome from a Yoruba individual contains significantly more variants. We found 0.4 million short indels (1–16 bp; polymorphism than a genome of European descent. The autosomal Supplementary Fig. 16), most of which are length polymorphisms heterozygosity (p) of NA18507 is 9.943 10 (1 SNP per 1,006 bp), in homopolymeric tracts of A or T. Half of these events are corrobo- higher than previous values for Caucasians (7.63 10 , ref. 12). rated by entries in dbSNP, and 95 of 100 examined were present in Heterozygosity in the pseudoautosomal region 1 (PAR1) is substan- amplicons sequenced from this individual in ENCODE regions, con- tially higher (1.92 3 10 ) than the autosomal value. PAR1 (2.7 Mb) firming the high specificity of this method of short indel detection. For larger structural variants (detected by anomalously spaced paired ends) we found that some were detected by both long and short insert a ELAND MAQ data sets (Supplementary Fig. 17a), but most were unique to one or Call SNPs In dbSNP SNPs In dbSNP (n) (%) (n) (%) other data set. We observed two reasons for this: first, small events (,400 bp) are within the normal size variance of the long insert data; Homozygote 1,417,320 90.1 1,503,420 90.8 second, nearby repetitive structures can prevent unique alignment of Heterozygote 2,411,022 63.9 2,635,776 63.8 All 3,828,342 73.6 4,139,196 73.6 read pairs (see Supplementary Fig. 17b, c). In some cases, the high resolution of the short insert data permits detection of additional complexity in a structural rearrangement that is not revealed by the long insert data. For example, where the long insert data indicate a 1.3-kb deletion in NA18507 relative to the reference, the short insert ELAND MAQ data reveal an inversion accompanied by deletions at both break- 215,844 3,612,498 526,698 points (Fig. 4). We carried out de novo assembly of reads in this (42.4% dbSNP) (75.5% dbSNP) (60.8% dbSNP) region and constructed a single contig that defines the exact structure of the rearrangement (data not shown). We discovered 5,704 structural variants ranging from 50 bp to.35 kb where there is sequence absent from the genome of NA18507 compared to the reference genome. We observed a steadily Figure 3 | SNPs identified in the human genome sequence of NA18507. decreasing number of events of this type with increasing size, except a, Number of SNPs detected by class and percentage in dbSNP (release 128). for two peaks (Supplementary Fig. 18). Most of the events repre- Results from ELAND and MAQ alignments are reported separately. sented by the large peak at 300–350 bp contain a sequence of the b, Analysis of SNPs detected in each analysis reveals extensive overlap. The AluY family. This is consistent with insertion of short interspersed percentage of NA18507 SNP calls that match previous entries in dbSNP is nuclear elements (SINEs) that are present in the reference genome lower than that of our X chromosome study (see Supplementary Fig. 6). We but missing from the genome of NA18507. Similarly, the second, expect this because individual NA07340 (from the X chromosome study) was also previously used for discovery and submission of SNPs to dbSNP smaller peak at 6–7 kb is the consequence of insertion of the long during the HapMap project, in contrast to NA18507. interspersed nuclear element (LINE) L1 Homo sapiens (L1Hs) in Macmillan Publishers Limited. All rights reserved © 2008 NATURE |Vol 456 |6 November 2008 ARTICLES Figure 4 | Homozygous complex rearrangement detected by anomalous 8.00 kb paired reads. The rearrangement involves an inversion of 369 bp (blue–turquoise bar in the schematic diagram) flanked by deletions (red bars) of 1,206 and 164 bp, respectively, at the left- and right-hand breakpoints. a, Summary tracks in the Resembl browser, denoting scale, simulated alignability of reads to reference (blue plot), actual aligned depth of coverage by NA18507 reads (green plot), density of anomalous reads indicating structural variants (red plot; peaks denote ‘hotspots’) and density of singleton reads (pink plot). b, Anomalous long-insert read pairs (orange lines denote DNA fragment; blocks at either end denote each read); the data indicate loss of ,1.3 kb in NA18507 relative to the reference. c, Anomalous short-insert pairs of two types (red and pink) indicate an inverted sequence flanked by two deletions. d, Normal short-insert read-pair alignments (each green line denotes the extent of the reference that is covered by the short fragment, including the two reads). e, The schematic diagram depicts the arrangement of normal and anomalous read pairs relative to the rearrangement. Top line, structure of NA18507; second line, structure of reference sequence. Green bars denote sequence that is collinear in the reference and NA18507 genomes. The turquoise–blue bar illustrates the inverted segment. Red bars indicate the sequences present in the reference but absent in NA18507. Arrows denote orientation of reads when aligned to the reference. The display in a–d is a composite of screen shots of the same window, overlapped for display purposes. Supplementary Fig. 20. The ‘singleton’ reads on either side of the event, which have partners that do not align to the reference, form part of a de novo assembly that precisely defines the novel sequence and breakpoint (Supplementary Fig. 21). Effect of sequence depth on coverage and accuracy We investigated the impact of varying input read depth (and hence cost) on SNP calling using chromosome 2 as a model. SNP discovery increases with increasing depth: essentially all homozygous positions are detected at 153, whereas heterozygous positions accumulate more gradually to 333 (Fig. 5a). This effect is influenced by the stringency of the SNP caller. To call each allele in this analysis we required the equivalent of two high-quality Q30 bases (as opposed to three used in full depth analyses). Homozygotes could be detected at read depth of 23 or higher, whereas heterozygote detection required at least double this depth for sampling of both alleles. Missing calls (not covered by sequence) and discordances between sequence-based SNP calls and genotype loci (mostly under-calls of heterozygotes due to low depth) progressively reduced with increasing depth (Fig. 5b). We observed very few other types of discordance at any depth; many of these are genotyping errors as described above. Concluding remarks Reversible terminator chemistry is a defining feature of this sequen- cing approach, enabling each cycle to be driven to completion while minimizing misincorporation. The result is a system that generates 4 kb accurate data at very high throughput and low cost. We determined an accurate whole human genome sequence in 8 weeks to an average depth of ,403. We built a consensus sequence, optimized methods for analysis, assessed accuracy and characterized the genetic variation of this individual in detail. We assessed accuracy relative to genotype data over the entire fraction of the human sequence where SNP calling was possible (.90%). We established very low false-positive and -negative rates for the ,four million SNPs detected (,1% over-calls and under- calls). This compares favourably with previous individual genome analyses which reported a 24% under-calling of heterozygous posi- 2,7 tions . many cases. We found good correspondence between our results and Paired reads were very powerful in all areas of the analysis. They the data of ref. 23, which reported 148 deletions of,100 kb in this provided very accurate read alignment and thus improved the accu- individual on the basis of abnormal fosmid paired-end spacing. We racy and coverage of consensus sequence and SNP calling. They were found supporting evidence for 111 of these events. We detected a essential for developing our short indel caller, and for detecting larger further 2,345 indels in the range 60–160 bp which are sequences structural variants. Our short-insert paired-read data set introduced present in the genome of NA18507 and absent from the reference a new level of resolution in structural variation detection, revealing genome (Supplementary Fig. 19). One example is shown in thousands of variants in a size range not characterized previously. In Macmillan Publishers Limited. All rights reserved © 2008 ARTICLES NATURE |Vol 456 |6 November 2008 filtered read data are available for download from the Short Read Archive at a SNP calls versus sequence depth NCBI or from the European Short Read Archive (ERA) at the EBI. 350,000 Analysis software. Image analysis software and the ELAND aligner are provided 300,000 as part of the Genome Analyzer analysis software. SNP and structural variant detectors will be available as future upgrades of the analysis pipeline. The 250,000 Resembl extension to Ensembl is available on request. The MAQ (Mapping and Assembly with Qualities) aligner is freely available for download from 200,000 http://maq.sourceforge.net. Data access. Sequence data for NA18507 are freely available from the NCBI short 150,000 read archive, accession SRA000271 (ftp://ftp.ncbi.nih.gov/pub/TraceDB/ ShortRead/SRA000271). X chromosome data are freely available from ERA, 100,000 accession ERA000035. Links to Resembl displays for chromosome X and human 50,000 data, plus information on other available data, are provided at http://www. illumina.com/HumanGenome. See Supplementary Methods for a detailed Methods section. 0 5 10 15 20 25 30 35 Average input read depth (fold) Received 24 June; accepted 2 October 2008. b Missing or discordant data versus sequence depth 1. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004). 2. Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007). 3. Margulies, M. et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380 (2005). 10 4. Shendure, J. et al. Accurate multiplex polony sequencing of an evolved bacterial genome. Science 309, 1728–1732 (2005). 5. Harris, T. D. et al. Single-molecule DNA sequencing of a viral genome. Science 320, 106–109 (2008). 6. Lundquist, P. M. et al. Parallel confocal detection of single molecules in real time. 4 Opt. Lett. 33, 1026–1028 (2008). 7. Wheeler, D. A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008). 8. Milton, J. et al. Modified nucleotides. World Intellectual Property Organization 0 5 10 15 20 25 30 35 WO/2004/018497 (2004). Average input read depth (fold) 9. Smith, G. P. et al. Modified polymerases for improved incorporation of nucleotide analogues. World Intellectual Property Organization WO/2005/024010 Figure 5 | Effect of sequence depth on coverage and accuracy of human (2005). genome sequencing. ELAND alignments were used for this analysis. 10. Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. II. a, Accumulation of sequence-based SNP calls, including all SNPs (squares), Error probabilities. Genome Res. 8, 186–194 (1998). heterozygous SNPs (triangles) and homozygous SNPs (circles) with 11. Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res.. doi:10.1101/gr.078212.108 increasing input read depth. b, Decrease in genotype positions not covered (25 September 2008). by sequence (squares), heterozygote under-calls in sequence data relative to 12. The International SNP Map Working Group. A map of human genome sequence genotype data (triangles) and discordant SNP calls compared to genotypes variation containing 1.42 million single nucleotide polymorphisms. Nature 409, (circles) with increasing input read depth. Vertical dotted lines indicate 928–933 (2001). various input read depths (103,153,303 haploid genome). 13. Tuzun, E. et al. Fine-scale structural variation of the human genome. Nature Genet. 37, 727–732 (2005). 14. Korbel, J. O. et al. Paired-end mapping reveals extensive structural variation in the some cases we determined the exact sequence of structural variants by human genome. Science 318, 420–426 (2007). de novo assembly from the same paired-read data set. Interpreting 15. Campbell, P. J. et al. Identification of somatically acquired rearrangements in events that are embedded in repetitive sequence tracts will require cancer using genome-wide massively parallel paired-end sequencing. Nature Genet. 40, 722–729 (2008). further work. 16. Hubbard, T. et al. The Ensembl genome database project. Nucleic Acids Res. 30, Massively parallel sequencing technology makes it feasible to con- 38–41 (2002). sider whole human genome sequencing as a clinical tool in the near 17. The International HapMap Consortium. A haplotype map of the human genome. future. Characterizing multiple individual genomes will enable us to Nature 437, 1299–1320 (2005). 18. The International HapMap Consortium. A second generation human haplotype unravel the complexities of human variation in cancer and other map of over 3.1 million SNPs. Nature 449, 851–861 (2007). diseases and will pave the way for the use of personal genome 19. The International HapMap Consortium. The International HapMap Project. sequences in medicine and healthcare. Accuracy of personal genetic Nature 426, 789–796 (2003). information from sequence will be critical for life-changing decisions. 20. The ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, In addition to the large-scale genomic projects exemplified by the 799–816 (2007). 15,24–26 present study and others , the system described here is being 21. Redon, R. et al. Global variation in copy number in the human genome. Nature 444, used to explore biological phenomena in unprecedented detail, 444–454 (2006). including transcriptional activity, mechanisms of gene regulation 22. Cargill, M. et al. Characterization of single-nucleotide polymorphisms in coding 27–32 regions of human genes. Nature Genet. 22, 231–238 (1999). and epigenetic modification of DNA and chromatin . In the 23. Kidd, J. M. et al. Mapping and sequencing of structural variation from eight human future, DNA sequencing will be the central tool for unravelling genomes. Nature 453, 56–64 (2008). how genetic information is used in living processes. 24. Hillier, L. W. et al. Whole-genome sequencing and variant discovery in C. elegans. Nature Methods 5, 183–188 (2008). METHODS SUMMARY 25. Hodges, E. et al. Genome-wide in situ exon capture for selective resequencing. Nature Genet. 39, 1522–1527 (2007). DNA and sequencing. DNA samples (NA07340 and NA18507) and cell line 26. Porreca, G. J. et al. Multiplex amplification of large sets of human exons. Nature (GM07340) were obtained from Coriell Repositories. DNA samples were geno- Methods 4, 931–936 (2007). typed on the HM550 array and the results compared to publicly available data to 27. Barski, A. et al. High-resolution profiling of histone methylations in the human confirm their identity before use. Methods for DNA manipulation, including genome. Cell 129, 823–837 (2007). sample preparation, formation of single-molecule arrays, cluster growth and 28. Johnson, D. S., Mortazavi, A., Myers, R. M. & Wold, B. Genome-wide mapping of in sequencing were all developed during this study and formed the basis for the vivo protein-DNA interactions. Science 316, 1497–1502 (2007). standard protocols now available from Illumina, Inc. All sequencing was per- 29. Mikkelsen, T. S. et al. Genome-wide maps of chromatin state in pluripotent and formed on Illumina GA1s equipped with a one-megapixel camera. All purity lineage-committed cells. Nature 448, 553–560 (2007). Macmillan Publishers Limited. All rights reserved © 2008 Number of SNP calls Missing or discordant calls (%) NATURE |Vol 456 |6 November 2008 ARTICLES 1 1 3 1 1 30. Boyle, A. P. et al. High-resolution mapping and characterization of open chromatin Asha Boodhun , Joe S. Brennan , John A. Bridgham , Rob C. Brown , Andrew A. Brown , 3 1 3 4 across the genome. Cell 132, 311–322 (2008). Dale H. Buermann , Abass A. Bundu , James C. Burrows , Nigel P. Carter , Nestor 3 1 3 1 1 31. Lister, R. et al. Highly integrated single-base resolution maps of the epigenome in Castillo , Maria Chiara E. Catenazzi , Simon Chang , R. Neil Cooley , Natasha R. Crake , 1 1 1 Arabidopsis. Cell 133, 523–536 (2008). Olubunmi O. Dada , Konstantinos D. Diakoumakos , Belen Dominguez-Fernandez , 1,2 1 3 3 32. Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. Mapping and David J. Earnshaw , Ugonna C. Egbujor , David W. Elmore , Sergey S. Etchin , Mark R. 3 5 1 1 2 quantifying mammalian transcriptomes by RNA-Seq. Nature Methods 5, 585–587 Ewan , Milan Fedurco , Louise J. Fraser , Karin V. Fuentes Fajardo , W. Scott Furey , 3 6 1 3 (2008). David George , Kimberley J. Gietzen , Colin P. Goddard , George S. Golda , Philip A. 3 1 3 7 1 33. Fedurco, M., Romieu, A., Williams, S., Lawrence, I. & Turcatti, G. BTA, a novel Granieri , David E. Green , David L. Gustafson , Nancy F. Hansen , Kevin Harnish , 3 1 1 3 reagent for DNA attachment on glass and efficient generation of solid-phase Christian D. Haudenschild , Narinder I. Heyer , Matthew M. Hims , Johnny T. Ho , 1 1 3 3 amplified DNA colonies. Nucleic Acids Res. 34, e22 (2006). Adrian M. Horgan , Katya Hoschler , Steve Hurwitz , Denis V. Ivanov , Maria Q. 3 1 1 1 Johnson , Terena James , T. A. Huw Jones , Gyoung-Dong Kang , Tzvetana H. Supplementary Information is linked to the online version of the paper at 3 1 3 3 1 Kerelska , Alan D. Kersey , Irina Khrebtukova , Alex P. Kindwall , Zoya Kingsbury , www.nature.com/nature. 1 1 6 6 Paula I. Kokko-Gonzales , Anil Kumar , Marc A. Laurent , Cynthia T. Lawley , Sarah E. 1 3 3 1 3 3 Acknowledgements The authors acknowledge the advice of A. Williamson, T. Rink, Lee , Xavier Lee , Arnold K. Liao , Jennifer A. Loch , Mitch Lok , Shujun Luo , Radhika 1 3 1 3 1 S. Benkovic, J. Berriman, J. Todd, R. Waterston, S. Eletr, W. Jack, M. Cooper, M. Mammen , John W. Martin , Patrick G. McCauley , Paul McNitt , Parul Mehta , 3 3 1 4 4 T. Brown, C. Reece and R. Cook during this work; E. Margulies for assistance with Keith W. Moon , Joe W. Mullens , Taksina Newington , Zemin Ning , Bee Ling Ng , 1 3 1,2 1 data analysis; M. Shumway for assistance with data submission; and the Sonia M. Novo , Michael J. O’Neill , Mark A. Osborne , Andrew Osnowski , Omead 3,6 3 1 1 3 contributions of the administrative and support staff at all the institutions. This Ostadan , Lambros L. Paraschos , Lea Pickering , Andrew C. Pike , Alger C. Pike ,D. 3 3 3 3 1 research was supported in part by The Wellcome Trust (to H.L., A.Sc., K.W., N.P.C, Chris Pinkard , Daniel P. Pliskin , Joe Podhasky , Victor J. Quijano , Come Raczy , Vicki 1 1 1 1 1 B.N.L., J.R., M.E.H. and R.D.), the Biotechnology and Biological Sciences Research H. Rae , Stephen R. Rawlings , Ana Chiva Rodriguez , Phyllida M. Roe , John Rogers , 1 1 5 3 Council (BBSRC) (to S.B. and D.K.), the BBSRC Applied Genomics LINK Programme Maria C. Rogert Bacigalupo , Nikolai Romanov , Anthony Romieu , Rithy K. Roth , 1 1 3 1 (to A.Sp. and C.L.B.) and the Intramural Research Program of the National Human Natalie J. Rourke , Silke T. Ruediger , Eli Rusman , Raquel M. Sanches-Kuiper , Martin 1 3 1 3 3 Genome Research Institute, National Institutes of Health (to N.F.H. and J.C.M.). R. Schenker , Josefina M. Seoane , Richard J. Shaw , Mitch K. Shiver , Steven W. Short , 3 3 1 1 S. Balasubramanian and D. Klenerman are inventors and founders of Solexa Ltd. Ning L. Sizto , Johannes P. Sluis , Melanie A. Smith , Jean Ernest Sohna Sohna , Eric J. 3 1 1 1 1 Spence , Kim Stevens , Neil Sutton , Lukasz Szajkowski , Carolyn L. Tregidgo , Gerardo Author Information Reprints and permissions information is available at 5 1 3 3 Turcatti , Stephanie vandeVondele , Yuli Verhovsky , Selene M. Virk , Suzanne www.nature.com/reprints. This paper is distributed under the terms of the 3 3 1 1 3 Wakelin , Gregory C. Walcott , Jingwen Wang , Graham J. Worsley , Juying Yan , Ling Creative Commons Attribution-Non-Commercial-Share Alike licence, and is freely 3 3 4 7 4 Yau , Mike Zuerlein , Jane Rogers {, James C. Mullikin , Matthew E. Hurles , Nick J. available to all readers at www.nature.com/nature. The authors declare competing 1 3 3 3 2 McCooke {, John S. West , Frank L. Oaks , Peter L. Lundberg , David Klenerman , financial interests: details accompany the full-text HTML version of the paper at 4 1 Richard Durbin & Anthony J. Smith www.nature.com/nature. Correspondence and requests for materials should be addressed to D.R.B. ([email protected]). Illumina Cambridge Ltd. (Formerly Solexa Ltd), Chesterford Research Park, Little Chesterford, Nr Saffron Walden, Essex CB10 1XL, UK. Department of Chemistry, University of Cambridge, The University Chemical Laboratory, Lensfield Road, 1 2 1 Cambridge CB2 1EW, UK. Illumina Hayward (Formerly Solexa Inc.), 23851 Industrial David R. Bentley , Shankar Balasubramanian , Harold P. Swerdlow {, Geoffrey P. 1 1 1 1 1 1,2 Boulevard, Hayward, California 94343, USA. The Wellcome Trust Sanger Institute, Smith , John Milton {, Clive G. Brown {, Kevin P. Hall , Dirk J. Evers , Colin L. Barnes , 1 1 1 1 Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK. Manteia Helen R. Bignell , Jonathan M. Boutell , Jason Bryant , Richard J. Carter , R. Keira 1 1 1 3 1 Predictive Medicine S.A. Zone Industrielle, Coinsins, CH-1267, Switzerland. Illumina Cheetham , Anthony J. Cox , Darren J. Ellis , Michael R. Flatbush , Niall A. Gormley , 1 1 3 3 4 Inc., Corporate Headquarters, 9883 Towne Centre Drive, San Diego, California 92121, Sean J. Humphray , Leslie J. Irving , Mirian S. Karbelashvili , Scott M. Kirk , Heng Li , 1,2 1 1 1 1 USA. National Human Genome Research Institute, National Institutes of Health, 41 Xiaohai Liu , Klaus S. Maisinger , Lisa J. Murray , Bojan Obradovic , Tobias Ost , 1 3 1 3 Center Drive, MSC 2132, 9000 Rockville Pike, Bethesda, Maryland 20892-2132, USA. Michael L. Parkinson , Mark R. Pratt , Isabelle M. J. Rasolonjatovo , Mark T. Reed , 1 1 1 1 {Present addresses: The Wellcome Trust Sanger Institute, Wellcome Trust Genome Roberto Rigatti , Chiara Rodighiero , Mark T. Ross , Andrea Sabot , Subramanian V. 3 4 3 1 1 Campus, Hinxton, Cambridge CB10 1SA, UK (H.P.S.); Oxford Nanopore Technologies, Sankar , Aylwyn Scally , Gary P. Schroth , Mark E. Smith , Vincent P. Smith , 1 1 3 3 Anastassia Spiridou , Peta E. Torrance , Svilen S. Tzonev , Eric H. Vermaas , Klaudia Begbroke Science Park, Sandy Lane, Kidlington OX5 1PF, UK (J.M., C.G.B.); BBSRC 4 1 3 3 1 Genome Analysis Centre, John Innes Centre, Norwich Research Park, Colney, Norwich Walter , Xiaolin Wu , Lu Zhang , Mohammed D. Alam , Carole Anastasi , Ify C. 1 1 1 3 1 Aniebo , David M. D. Bailey , Iain R. Bancarz , Saibal Banerjee , Selena G. Barbour , NR4 7UH, UK (J.R.); Pronota, NV, VIB Bio-Incubator, Technologiepark 4, B-9052 3 1 1 1 1 Primo A. Baybayan , Vincent A. Benoit , Kevin F. Benson , Claire Bevis , Phillip J. Black , Zwijnaarde/Ghent, Belgium (N.J.M.). Macmillan Publishers Limited. All rights reserved © 2008

Journal

Nature – Springer Journals

Published: Nov 6, 2008

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Accurate whole human genome sequencing using reversible terminator chemistry

Accurate whole human genome sequencing using reversible terminator chemistry

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Accurate whole human genome sequencing using reversible terminator chemistry

Accurate whole human genome sequencing using reversible terminator chemistry

References (39)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies