TY - JOUR AU - Chen, Yi-Ping, Phoebe AB - Abstract The application of advanced sequencing technologies and the rapid growth of various sequence data have led to increasing interest in DNA sequence assembly. However, repeats and polymorphism occur frequently in genomes, and each of these has different impacts on assembly. Further, many new applications for sequencing, such as metagenomics regarding multiple species, have emerged in recent years. These not only give rise to higher complexity but also prevent short-read assembly in an efficient way. This article reviews the theoretical foundations that underlie current mapping-based assembly and de novo-based assembly, and highlights the key issues and feasible solutions that need to be considered. It focuses on how individual processes, such as optimal k-mer determination and error correction in assembly, rely on intelligent strategies or high-performance computation. We also survey primary algorithms/software and offer a discussion on the emerging challenges in assembly. DNA assembly, de Bruijn graph, fragment, k- mer, repeat Introduction Sequence assembly has become an important yet long-standing task for various applications in bioinformatics because of the emergence of next-generation sequencing (NGS) and third-generation sequencing (TGS) technologies and the increasing demand for sequencing services. Various sequencing platforms have been widely applied in the past few years, such as Solexa Genome Analyzer, HiSeq family of sequencers and Genome Analyzer IIX (personal genome analyzer) from Illumina [1], Heliscope from Roche 454 Life Science that will be discontinued in mid-2016 and SOLiD System from Applied Biosystems. Sanger reads range from 500 to 1000 bp. In contrast, NGS reads can be shorter (the Genome Analyzer I produced 36 bp reads) [2]. These shorter but large-quantity reads contain less overlapping sections between the reads than larger reads and thus demand high coverage in nature while assembling these sequences. This gives rise to higher complexity issues with respect to the large volume of NGS data and prevents the assembly of chromosome-size sequences. This survey reviews the recent advances and innovations in sequence assembly, including the principle of the algorithms as well as the application of the tools. Sequence assembly [3] is a process by dividing the large pieces of DNA into small pieces, reading the small fragments and reconstituting the original DNA by merging the information on various short fragments. In other words, it aims to combine the individual reads/fragments into longer contigs (contiguous sequences), eventually recovering the full contiguous sequence. Since the late 1970s when the first attempt was made [4], sequence assembly has experienced rapid and intensive growth. A number of innovations in assembly have been introduced in the past few years, including data structures for efficient memory [5], de novo assembly stratifies for metagenomics [6–8], transcriptome data [9, 10], complementary information from multiple sequencing technologies [11] and paired-end or mate-pair data [12]. In general, they can be divided into three categories in terms of graph construction, namely, overlap-layout-consensus (OLC) approaches [13, 14] using an overlap graph, the de Bruijn graph (DBG) methods [15–17], using a k-mer graph, and greedy algorithms [18, 19], using overlap or a k-mer graph [2]. In particular, DBG-based methods have been explored comprehensively as the advent of massively parallel NGS data. A number of biological phenomena can be elucidated or uncovered by the reconstruction of original sequences. At the genome level, genomic structural variants (SVs) and single-nucleotide polymorphisms (SNPs) can be easily identified. It has been widely reported that many SVs and SNPs are associated with genetic disorders [20, 21]. For example, the copy number variations, a major category of SV in the IRGM gene, are associated with Crohn’s disease [22], and even one SNP can be deleterious [23]. At the transcriptome level, the assembly assists in recovering representatives for a collection of alternative variants that share k-mers because of alternative splicing, gene duplication or allelic variation. Understanding alternative splicing plays a central role in studying complex proteome generation as well as the function of quantitative gene control [24]. At the metagenome level, the assembly of metagenomic sequences has been proven to be promising in unveiling the functions of gene products as well as the interactions between hosts and pathogens [25]. Last but not least, assembly has also shown its potential in studying biological evolution. The aforementioned targets can be achieved if we are able to accurately assemble whole sequences. Nevertheless, the raw sequences face several major confounding factors, including repeats, sequencing errors and uneven coverage. There are three major challenging issues with assembly processes. Existing assembly algorithms mainly depend on the assumption that highly similar sequences come from the same region of a genome. The similarity between DNA sequences is then used as the principle of a variety of graph-based assembly algorithms. However, this assumption no longer holds in the presence of abundant perfect or imperfect repeats. For example, Pierce et al. [26] reported that the human genome contains around 13% short interspersed elements and 21% long interspersed elements. These repeats cause the chimeric assemblies. That is, reads from different regions are aligned and merged together forming new contigs that markedly stray from the ground truth of a sample of interest. This makes transcriptome and metagenome assembly even harder and more complicated, as many identical fragments may be contained in different transcripts and genomes. In addition, repeats and sequencing errors, including insertions, deletions and substitutions, also aggravate the complexity of assembly. Without sequencing errors, imperfect repeats can be distinguished, resulting in better assembly performance. However, sequencing errors obfuscate the line between polymorphisms and errors. Existing algorithms must leverage the trade-off between the tolerance of errors and the accuracy of assembly. The uneven coverage of reads is another confounding factor. Repeat regions, even perfect repeat regions, can be solved, to a certain extent, under the even distribution of the reads’ coverage. Nevertheless, in reality, the coverage is affected by many factors, such as the GC content divergence, the instrumental technology preference, the enzymatic amplification bias, etc. [27]. This coverage problem becomes worse in transcriptome and metagenome assembly because of the complicated transcripts and genomes they contain per se. Although sequencing assembly is not straightforward, sometimes even intractable, it is evolving rapidly along with the development of new sequencing technologies. The Sanger sequencing technology, invented in 1977, has been widely applied in the sequencing market over the past almost four decades. It generates reads of a couple of hundred to a thousand base pairs in length, but has low throughput. Based on these kinds of data, a number of greedy or OLC-based assemblers have been developed [28, 29]. ASSEMBLER [22] was the first computer program aiming the assembly of DNA sequences. SEQAID [30] introduced the OLC paradigm, and Kececioglu and Myers [31] proposed the overlap graph to represent sequence assembly. In addition, DBG-based methods were also available by then [17]. The NGS technology, called the Genome Sequencer 20 (GS 20), was introduced in 2005 by Roche (www.454.com). This gave rise to a generation of massively parallel short reads. However, the short-read size is also a problem for OLCs. Thus, the OLC-based method was no longer suitable for dealing with such data, especially for large genomes, because of high time complexity. Since then, the DBG-based methods have become the mainstream, as they showed good performance in handling large data sets. By following a similar strategy, Euler-SR [32], ALLPATHs [33], Velvet [34], ABySS [35] and SOAPdenovo [36, 37] were introduced, and the DBG-based method was completely explored in [34]. Later on, longer reads and higher throughput (compared with Sanger sequencing) became practical, such as 400–600 bp reads from 454 [38]. Relying on the DBG-based approach results in a loss of contiguity information, and it also introduces new misassemblies while assembling these kinds of data. Instead, researchers attempt to solve this problem by resorting to OLC-like approaches. Nowadays, even longer reads can also be produced. For example, Pacific Bioscience can produce reads up to 20 kbp [3], and Oxford Nanopore technology can also generate up to 150 kb reads. Typically, these data have a high error rate (around 15%). Thus, the DBG-based approach may not be a smart choice. As described above, the previous DBG-based and OLC-based approaches are limited to short and accurate reads from second-generation sequencing (SGS). The recent release of TGS is able to carry out single-molecule sequencing reactions in real time without DNA amplification, and the generated long reads can simply sequence assembly by spanning repeat region. However, the obtained long reads have relatively high error rates. Many studies have been devoted to assemble single-molecule real-time (SMRT) sequencing data into accurate and continuous genome. This article aims to review the state-of-the-art DNA assembly algorithms from the aspects of theory and application, which can assist researchers by providing sufficient background to enable them to make right choices when applying assembly techniques. The advantages, limitations and applicability of typical/prevalent algorithms and tools are presented and discussed. Further, some guidelines are provided when assembling DNA fragments, and potential challenging issues of sequence assembly in the future are highlighted. The organization of this article is as follows. In the ‘Challenges of Assembly’ section, the challenges of sequence assembly are presented. The ‘Assembly Algorithms’ section discusses two primary DNA assembly algorithms, namely, the OLC and DBG algorithm, and a comparison of these algorithms is provided in the ‘Assembly Algorithms’ section. Several strategies to deal with common assembly issues are provided in the ‘Future Development of Assembly Methods’ section, and applications of the assembly are presented in the ‘Application of DNA Assembly’ section. Finally, the future development of DNA fragment assembly is discussed in the ‘Conclusion’ section. Motivations and importance DNA sequences have been proved to contain valuable information in controlling genetic traits. Deep analysis of DNA sequences is helpful in understanding the complicated regulatory networks of organisms as well as the functional roles of the molecules involved. Therefore, the study of DNA sequences has become one of the most important research fields in bioinformatics. With the development of high-throughput sequencing technologies, a variety of approaches has been applied for characterizing genomes, epigenomes and transcriptomes [3]. Nowadays, sequencing 1 billion base pair takes days, even hours, in any laboratory equipped with a SGS machine, but only minutes in large-scale sequencing centers [39]. Producing reference genomes is beneficial to many biological data analyses. If a reference genome exists, such as for humans or mice, it is usually viewed as the primary benchmark for data analysis. By mapping the derived DNA sequence fragments to the reference genome, the alignments help to determine whether the biological experiment, DNA preparation and sequencing experiment are successful, and whether the correct sample is sequenced. As for organisms without trusted reference genomes, assembly is always essential to yield new genomes. These already assembled reference genomes of other species are critical, as they provide data sets for measuring assembly accuracy between newly developed algorithms and the existing solutions. Further, this guarantees the quality of the assembly results for species without a reference genome. A reference genome is however unknown in many cases, and users have to assemble the sequencing reads to obtain the whole genome. Although sequencing technology has experienced rapid development in the past decade, the fragment length is still not long enough to explore the functions of DNA. These small DNA fragments can be merged into longer DNA sequences, by which to reconstruct the original sequence, even the whole genome. With the rapid growth of sequencing data, there have been considerable efforts to develop various computational algorithms or tools for short reads assembly, such as Velvet [34], ABySS [35] and Arachne [40]. They use the directed graph to merge reads. The nodes of the graph represent DNA fragments, and the edges mean two fragments can be integrated into a longer fragment. They are mainly developed from two kinds of algorithms, DBGs and OLC graphs. The OLC algorithm uses an intuitive way to assemble DNA fragments. This method can easily reveal whether the two fragments can be integrated into a long fragment. For example, the work of Arachne and Celera [41] is based on the OLC algorithm, and the DBG is used to solve the superstring problem [42]. Thus, it is suitable to assemble sequencing reads. Velvet and ABySS are two typical methods based on the DBG. They have been successfully applied in a number of cases of merging DNA fragments. Velvet is able to leverage short reads in combination with read pairs to yield useful assemblies. In particular, each node N is attached to a twin node N, which represents the reverse series of reverse complement k-mers. This ensures that overlaps between reads from opposite strands are taken into account. For example, some researchers apply Velvet to assemble the coral genome, which helps researchers better understand environmental change [42]. Arachne has been used to assemble the genome sequences of rats and presented comprehensive analyses of the rat genome [44]. It performs overlap detection by producing a sorted table of each k-mer together with its source and position in the read, and finds all read pairs that share at least one overlapping k-mer, by which to align the reads and detect and correct sequencing errors. In addition to a single-genome assembly, metagenomic studies have attracted increasing attention in recent studies by using either shotgun or polymerase chain reaction (PCR)-directed sequencing to obtain largely unbiased samples of all genes from all the members of the sampled communities. Metagenomic analysis assists in the exploration of the interplay between host and commensal microbial metabolic activity, and appears to be promising in disclosing its role in maintaining human health. Further, the profiling of human microbial and viral flora at different taxonomic levels in combination with functional profiling may become new therapeutic solutions. The primary challenge in metagenomics assembly stems from the uneven representation of the member of species, the organisms that have similar genome but vary owing to mobile genetic elements and point mutations in most environments. This property makes it impossible to conduct a single assembly of each organism occurred in a sample. Challenges of assembly A number of algorithms have been developed for assembling DNA fragments. Without exception, they usually obey one rule. That is, if the last d nucleotides of the DNA fragment X match the first d nucleotides of another DNA fragment Y, the two DNA fragments can be merged into a new DNA sequence. However, it is difficult to determine the best value of d. A large d prevents us from merging DNA segments with a small overlap, while a small value may cause misassembly because of repeat regions. A reasonable way to partially address this issue is applying different d to assemble DNA fragments, and then choosing the best result. However, the determination of the optimal value is not trivial. In fact, it is complicated and time-consuming, especially when sequencing errors are present. To get most of the sequence of the PCR product, we often need to obtain the complementary strand of a DNA sequence as well as the forward sequence file. These need to be assembled into one sequence using the overlap of the two sequences. In general, sequencer can automatically compare the forward- and the reverse-complement orientations to construct the best possible contigs. Thus, users can assemble DNA sequences regardless of orientation. Many tools provide free conversion of a reverse complement, such as FASTX-Toolkit at http://hannonlab.cshl.edu/fastx_toolkit/, such as sample sequence GAACTC and its reverse complement GACTTC. Repeat regions occurring in the genomes of multiple species are prevalent and are another barrier that blocks DNA assembly from short sequences. For example, human chromosomes have about a 50% repeat region, and plant genomes contain particularly high proportions of repeats, for example the maize genome and the short-lived fish Nothobranchius furzeri have >80% and 21% of their genomes occupied by repeats, respectively [45]. These repeat regions may cause ambiguity and lead to erroneous assembly while merging DNA fragments. Thus, the underlying assumption, that is, highly similar DNA fragments originate from the same local region within a genome, most likely does not hold. Similarly, in transcriptome or metagenomic samples, nearly identical sequences may come from different transcripts or genomes [46]. More details on the repeat regions can be found in the ‘Metagenome Study’ section. Sequencing errors complicate analysis and make assembly even harder. Reads are usually required to be aligned to each other for genome assembly or to a reference genome to detect mutations, respectively. As mentioned in [47], the primary errors are substitution errors, at a 0.5–2.5% error rate. The DNA fragments obtained may contain some erroneous nucleotides because of unexpected nucleotide substitution, deletion and insertion produced by the sequencer. These sequences containing errors may yield incorrect results during genome assembly. Errors can alter DBGs. Additional nodes need to be generated to fully incorporate the erroneous short reads that formed a branch going off the original graph. For instance, incorrect overlaps may create ambiguous paths or improperly connect to remote regions of the genome. This problem becomes even more complicated in the case of polymorphism being present. Fortunately, as the NGS data have high coverage, to some extent, this can alleviate the problem by correcting errors during assembly. The assembly of multiple species (such as metagenome assembly) is more complex than a single species. The repeated region may come from either the same species or other species (the same genus or family). Therefore, it has higher probability of generating ambiguous results than the assembly of single species. The nonuniform distribution of reads coverage is another challenging issue, as reads coming from a low-coverage region are perhaps deemed erroneous reads. As a result, these correct reads may be incorrectly edited, or no contig can be built from them. Another emerging issue is the increasing amount of large data. For example, the studies of gut metagenomics [48] and the assembly of the cow rumen metagenomics [49] are several hundreds of gigabytes. Traditional algorithms are not competent for a large volume of data analysis because of the huge amount of memory consumption. Instead, cloud computing platform-based algorithms may be the most promising tools to deal with the analysis of big data in biology. Cloud computing is a type of Internet-based computing that depends too much on network transmission speed. If there is low transmission speed, the computers processing resources and data sharing will be influenced. Assembly algorithms Preliminaries The algorithms for sequence assembly mainly belong to one of three categories, i.e. the DBG-based algorithm, the OLC-based algorithm and the greedy algorithm. The first two directed graph-based algorithms have been extensively studied because of their ability to handle large data sets. The terms used in the directed graph are as follows: k-mer: a k-mer is a consecutive sequence of k nucleotides. For example, Figure 1A is a 9-mer sequence. Read: a read is a short DNA fragment produced by the sequencing machine, for example, n1,n2,n3,n4 in Figure 1B. Directed edge: a directed edge connects two nodes ni and nj such that the last k−1 nucleotides of ni overlap with the first k−1 nucleotides of nj, for example the edges between n1 and n2, n2 and n3, n3 and n4. Contig: a contig represents a long DNA sequence that is formed by merging multiple reads. For example, the contig in Figure 1D can be formed by merging the four reads n1, n2, n3, n4 in Figure 1B.Directed links are established from n1, n2, n3 and n4 to construct graph in Figure 1C, as they share a common subsequence with each other, such as AACTC between n1 and n2, and ACTCC between n2 and n3. Figure 1D presents the construction of Euclidean path of the graph in Figure 1C to yield a longer sequence. Figure 1. Open in new tabDownload slide DBG algorithm. (A) A read. (B) A read is divided into four successive overlapping subsequences, and each subsequence is shifted by one nucleotide. These subsequences are also called 6-mers. (C) Constructing the DBG. The 5-mer suffix of n1 is equal to the five prefixes of the n2. Therefore, these nodes are connected by a directed link. (D) Constructing the Euclidean path of the DBG. The subsequence n1 is the attribute value of the directed link, the starting node is the five prefixes of the n1 and the later node is the five suffixes of the n1. Figure 1. Open in new tabDownload slide DBG algorithm. (A) A read. (B) A read is divided into four successive overlapping subsequences, and each subsequence is shifted by one nucleotide. These subsequences are also called 6-mers. (C) Constructing the DBG. The 5-mer suffix of n1 is equal to the five prefixes of the n2. Therefore, these nodes are connected by a directed link. (D) Constructing the Euclidean path of the DBG. The subsequence n1 is the attribute value of the directed link, the starting node is the five prefixes of the n1 and the later node is the five suffixes of the n1. The three substructures below are essential for a directed graph. Branch: a branch is the path of nodes. For example, node n4 leads to two branches in Figure 2, i.e. n4n5n6 and n4n9n10. Tip: a tip is a path with a closed end node that has two or more out-degrees. For example, in Figure 2, n4n5n6 is a tip. Bubble: a bubble is a structure with two or more branches both starting and ending at the same nodes. For example, let x be an extended node of Figure 2, n4, n5, n6, x and n4, n9, n10, x construct a bubble structure. Figure 2. Open in new tabDownload slide An example of a DBG. In this Figure, we assume that there are two reads s1 and s2. The s1 contains six k-mers: n1, n2, n3, n4, n5 and n6. The s2 contains n7, n8, n3, n4, n9 and n10. These two reads share two same k-mers n3 and n4. When traversing this graph, two sequences n7→ n8→ n3→ n4→ n5→ n6and n1→ n2→ n3→ n4→ n9→ n10 can be obtained. Figure 2. Open in new tabDownload slide An example of a DBG. In this Figure, we assume that there are two reads s1 and s2. The s1 contains six k-mers: n1, n2, n3, n4, n5 and n6. The s2 contains n7, n8, n3, n4, n9 and n10. These two reads share two same k-mers n3 and n4. When traversing this graph, two sequences n7→ n8→ n3→ n4→ n5→ n6and n1→ n2→ n3→ n4→ n9→ n10 can be obtained. Algorithm types Mapping-based method Mapping-based assembly maps short sequence reads generated by a sequencing machine such as Illumina paired-end sequencing to a reference genome [50–53] by either searching for systematic differences with the reference or finding haplotypes that are well supported by the data [54]. Mapping approaches have some strengths, including high sensitivity [55], access to most of the human genome, including repetitive regions [56], and low-resource requirements while processing reads [51]. However, this method has several weaknesses. It often focuses on a single-nucleotide variant (SNV) type [55], resulting in errors around indels and multiple nucleotide variants [57]. Further, this approach has limitations in highly divergent regions, in which misalignments lead to spurious support for SNVs and other variants [58]. On the other hand, it relies on the nucleotide-level accuracy of read alignments [59, 60]. Although the realignment around known indels can perhaps improve this process, this is not only expensive but also fails to improve alignments around other variant types, as it uses a dictionary of known polymorphic indels. SOAP is a commonly used short-read alignment tool because of its advantage in speed. It solves the problem of how to align large data like the human genome (nearly 3.2 GB) using a machine with a small memory (4 GB). SOAP was initially designed for single-end reads. Thus, many algorithms have been developed to deal with paired-end reads. Burrows–Wheeler alignment tool [51] developed by Li et al. is a fast method that first indexes the genome and subsequently identifies the best position of each read in terms of this index. This algorithm is widely applied because of its high accuracy, and is viewed as the first choice for SNP analysis. Bowtie [61]and Bowtie2 [62] are two fast, memory-efficient short-read aligners, the former being able to align short DNA sequences to the human genome at a speed of over 25 million 35 bp reads per hour. This requires building a Bowtie-build template in advance. Bowtie has been used in analyzing chromatin immunoprecipitation sequencing and RNA sequencing (RNA-seq). NovoAlign is commercial software (http://www.novocraft.com/products/novoalign/) designed to map shorts reads onto a reference genome from Illumina, Ion Torrent and 454 NGS platforms. Subread [63] is recently published software by Liao et al. A simple multi-seed strategy, called seed-and-vote, is proposed for mapping reads to a reference genome. It scales up efficiently for longer reads. The mapping-based method focuses only on read mapping and SNP and SV detection. However, there are still more things to do from mapped reads to assembly as demonstrated in AMOS Comparative Assembler [64]. The availability of sequenced genomes of two or more closely related species offers the possibility of generating a comparative genome assembly algorithm. OLC-based algorithm The OLC algorithm comprises three steps to assemble the DNA fragments, i.e. overlap, layout and consensus. The disadvantages of DBG include memory requirement and the loss of information. The most appealing aspect of the Simpson–Durbin algorithm is that it does not rely on DBGs, and instead uses a different graph construction approach called ‘string graph’. Figure 3 shows the primary procedures of the OLC algorithm. Figure 3. Open in new tabDownload slide OLC algorithm. (A) Three DNA fragments. (B) Construction of an overlap graph: each DNA fragment is viewed as a node of the graph. The suffix of the n1 (TGC) is equal to the prefix of the n2 (TGC). Thus, two nodes can be connected by a directed edge, and the weight of this edge is 3. (C) The layout step: the n1→ n2, n2→ n3, and n1→ n3. According to the transitive rule, the link between n1 and n3 will be deleted, so we can then obtain the simplified graph. (D) Walking through the graph and searching a Hamilton path (n1, n2 and n3), which has the maximum weight, and we can obtain a contig (GAATGCTTACC). Figure 3. Open in new tabDownload slide OLC algorithm. (A) Three DNA fragments. (B) Construction of an overlap graph: each DNA fragment is viewed as a node of the graph. The suffix of the n1 (TGC) is equal to the prefix of the n2 (TGC). Thus, two nodes can be connected by a directed edge, and the weight of this edge is 3. (C) The layout step: the n1→ n2, n2→ n3, and n1→ n3. According to the transitive rule, the link between n1 and n3 will be deleted, so we can then obtain the simplified graph. (D) Walking through the graph and searching a Hamilton path (n1, n2 and n3), which has the maximum weight, and we can obtain a contig (GAATGCTTACC). Overlap. Given two nodes ni and nj, if the last m subsequence of the node ni equals the first m subsequence of the node nj, then these two nodes can be linked by a directed edge. The directed graph generated at this step is called the overlap graph. This step conducts all the pair-wise read alignments. Smith–Waterman [65] is one of the most widely used algorithms for pair-wise read alignment. By using this algorithm, the overlap region of two DNA fragments can be found, and thus, the overlaps between all pairs. Layout. This step removes the redundant edges in the graph and simplifies the overlapping graph. The pruning is based on a transitive rule. That is, if a directed graph has three nodes n1′, n2′ and n3′, and n1′ → n2′, n2′ → n3′ and n1′ → n3′, then the edge between nodes n1′ and n3′ can be deleted. For example, in Figure 3B, n1→n3 is pruned. Consensus. This step searches the most likely paths in the graph. For example, in Figure 3C, the Hamilton path is n1→ n2→ n3. A contig can be generated as shown in Figure 3D. Although the graph has been simplified in the layout step, finding the Hamilton path with maximum weight is not trivial. Thus, many improved algorithms have been developed to identify the path that may have the maximum weight, such as the greedy and partition algorithms [3, 66]. However, these heuristic methods tend to produce low-quality results. The memory required by string graph method [67] is lower than DBGs. Thus, it can potentially handle large-size genomes like mammals at a lower cost. String graph assembler (SGA) performs a collection of assembly algorithms based on a compressed data structure Fitch-Margoliash (FM)-index [68, 69]. SGA is significantly memory-efficient in contrast to traditional DBG assemblers like ABySS, Velvet and SOAPdenovo. In general, SGA needs more computation time because of the time required to build the FM-index. However, this can be alleviated by reusing one FM-index for multiple runs of SGA, such as testing different error correction or assembly parameters. In contrast, the de Bruijn table has to be recalculated for each selected k. SGA usually has three primary phrases listed below: Error correction. It is the first stage of SGA assembly and aims to correct those error bases in the reads. Contig assembly. The second phrase constructs a FM-index of the corrected reads, and prunes the identified duplicate reads and low-quality reads using SGA filters. Then, FM-merge and SGA overlap are implemented to merge together reads and compute the structure of the string graph, respectively, to assemble contig. Scaffolding. It starts by realigning reads to the contigs yielded in the previous step. The paired-end and/or mate-pair data are applied to build scaffolds from the contigs. CloudBrush [70] is a novel distributed genome assembler in combination of string graph and MapReduce. It proposes an edge-adjustment algorithm to find structural defects by checking the neighboring reads of a specific read for sequencing errors and regulating the edges of the string graph, if necessary. Fast string graph [71] is a novel method to compute the string graph in terms of FM-index and Burrows–Wheeler transform. In particular, FM-index representation of the collection of reads is used to construct the string graph without accessing the input reads. The above advances have facilitated sequencing of a number of species and individuals, and assemblies are still remained incomplete and fragmented in a large extent mainly because the underlying sequence reads are too short. A novel approach [72] was developed by combing single-molecule, real-time sequence technology and a string graph de novo assembly algorithm for high-quality assembly of the gorilla genome. It improves contiguity by up to three orders of magnitude in contrast to previously released assemblies. DBG-based algorithm The main idea of the DBG algorithm is to transform each read into overlapping subsequences by shifting one base pair each time. For example, Figure 1A is a read, and this read is transformed into four successive subsequences, namely, n1, n2, n3 and n4 in Figure 1B. These overlapping subsequences are produced by sliding one nucleotide each time. The subsequence is called k-mer. For example, given the subsequence n1 in Figure 1B, it is called 6-mer, as its length is 6. Each k-mer is viewed as a node in the graph. If the last k−1 suffix of a k-mer is the same as the first k−1 prefix of another k-mer, then they are connected by an edge. Based on the k-mers of all the reads, the overlapping graph of the k-mer can be built, namely, a DBG. Reconstructing the contigs from the DBG is the same as finding the Hamilton path from the graph. It is well known that finding a Hamilton cycle is a NP (verifiable in nondeterministic polynomial time) complete problem [73]. Nevertheless, this can be readily transformed into identifying the Eulerian path from the DBG [34, 42, 81]. For instance, Figure 1B has four k-mers. They are assigned to four directed edges in Figure 1D. The pre-node and subsequent nodes of these edges are the first five letters and the last five letters of the alphabet of the directed edges’ k-mer, respectively. From the observation, the Eulerian path in Figure 1D is the same as the Hamilton cycle in Figure 1C. However, the time complexity of searching the Euclidean path is only O(n), in which n is the number of edges in the graph. In Figure 1D, if we remove all 6-mers in this graph, this graph becomes a 5-mer DBG. Therefore, finding a Hamilton circle in a k-mer DBG can be transformed into finding an Eulerian path in a (k−1)-mer DBG. The determination and optimization of the k-mer size have different influences on the sequence assembly. A smaller k-mer size may increase the chance of overlap between k-mer and decrease the number of edges in the graph for sequence storage. Nevertheless, the smaller k-mer can lead to many vertices and block the reconstruction of the genome. Repeats longer than k may disorganize the graph and break up the contigs [74]. Thus, a larger value of k is desired, as the larger k-mer sizes aid in alleviating the problem of small repeat regions to some extent. In addition, the larger the k, the greater the chance that a k-mer will have an error in it. Further, if the overlapping region of two reads is less than k characters, they do not have any common vertex in the graph. Thus, it is critical to choose an appropriate size that creates a balance or trade-off between the above effects. There has not been an absolutely dominant formula or method to calculate k considering all effects. Although it is possible to compute some bounds according to the estimated genome size and coverage, the influence of repetitiveness in the genome, heterozygosity rate or read error rate are often ignored. Some approaches attempt to compare assemblies by examining different k values; however, this is time-consuming. The abundance histogram is a more intuitive and informative way to observe the distribution of k-mer abundance and decide k. However, this still needs up to 1 day to create such a histogram, even for a single value of k [75]. VelvetOptimizer is a k-mer optimizing tool for the Velvet de novo sequence assembler. It searches a supplied hash value range for the optimum, estimates the expected coverage and then searches for the optimum coverage cutoff (http://bioinformatics.net.au/software.velvetoptimiser.shtml). A novel method to construct the histogram is introduced in [74]. It proposes an accurate sampling method to build an appropriate abundance histogram and a fast heuristic method to choose the optimal value of k. Greedy-based algorithms The greedy algorithms repeat a basic operation that given any read or contig, one more read or contig is added until no more operations are possible. Each operation applies the next highest scoring overlap to make the next join. They simplify the graph by taking into account only the high-scoring edges. SSAKE [76] is a classical greedy-based assembler for short DNA reads and is well-suited for SV assembly/detection. SSAKE is designed to help leverage the information from short sequences reads by assembling them into contigs. It is written in Perl and runs on Linux. It stores short reads and their frequency in a hash table and searches for extension candidates through a prefix tree. This successfully assembled 25–300 bp reads from viral, bacterial and fungal genomes. There have been many applications derived from SSAKE, ranging from genome assembly to de novo assembly, such as VCAKE [28], which is an improvement on a nearly version of SSAKE and capable of assembling short reads even in the presence of sequencing error, and continues to inspire new-generation assemblers like JR-Assembler [77] for de novo assembly of large genomes. JR-Assembler selects good reads as seeds according to the number of identical reads in the data. Seed extension is then conducted by a ‘jumping’ extension including a number of whole reads. In particular, back trimming is applied to prune low-quality nucleotides at the 3′-end of a read. After this, repeat detection is undertaken to check whether an improper extension occurs. The above three steps are repeated until no unexamined seed remains. In particular, a scaffolding program SSPACE is incorporated to build scaffolds. SHARCGS [78] in Perl is designed for the short-read assembly of 25 to 40-mer input fragments and deep sequence coverage in high accuracy and speed. A prefix tree is applied to find potential useful reads. The assembly starts by selecting a seed read. This seed is then extended by looking for reads with an exact overlap, and the contig including this sequence is extended if there are no inconsistencies with other reads. Quality-value guided short-read assembler [79] is regression or classification models used in the chemical and biological sciences and engineering. It is an improved algorithm from VACKE and usually generates the highest genome coverage. It proposes a quality value score to deal with error, and is competitive in its longest contig and N50/N80 contig lengths. It is not only faster than VACKE but it also increases the feasibility of the algorithm because of further error processing capabilities. A greedy-path merging algorithm [80] is designed to order and orientate the given contigs and make them consistent with as many mate-pairs as possible. This approach was originally developed as a primary component of the compartmentalized assembly strategy developed at Celera Genomics. A contig-mate-pair graph is constructed by nodes and two kinds of edges, called contig-edges to represent the contigs and mate-edges to represent the link between fragments embedded in different contigs. Each contig node Bi consists of two nodes, v and w in the graph, which describes one end of Bi and its other end, respectively. It is a useful model for scaffolding of current whole-genome assemblers. Comparison Two prevalent de novo sequence assembly algorithms have been described above. The advantages and disadvantages as well as their applicability are extensively explored and compared in the order of error, repeat and k-mer and complexity. Further, a comparison between different sequencing platforms is also presented. Errors, including insertion, deletion and substitution, can be easily handled by the OLC algorithm. For example, the well-known Smith–Waterman alignment algorithm uses an alignment score to measure the similarity between two reads. If overlaps between suffix of a read and prefix of other reads belonging to the known read set that has already been trimmed and possibly filtered are identified, then a high score is obtained. Thus, the error nucleotide has a limited effect on the OLC algorithm. In contrast, it is hard to deal with erroneous nucleotides using the DBG (Figure 4). OLC also contain tips, bubbles and branches by errors. In fact, string graph (OLC-based) and DBG usually share some similar skills in solving tips, bubbles and branches. These bubbles may generate incorrect contigs while searching the Euclerian path. Although many algorithms have been developed to handle these nucleotides, most of them have limitations because of uneven coverage, sequencing errors and repeats that make graph non-Euclerian or generate many possible walks. For example, Velvet prunes the path of the bubble, which has the lower repeat rate [34], and the wrong reads are the minority. Low-frequency k-mers can be viewed as wrong k-mers. However, a less frequent path may be a rare DNA sequence. The other methods can treat the error before constructing the DBG. This method uses the quality scores of the nucleotides to determine the error nucleotides, which are produced by the sequencer. Figure 4. Open in new tabDownload slide Assembling the error reads using the DBG. (A) The read does not have any error nucleotides. (B) A read with an error nucleotide underlined. (C) Using these two reads to construct the DBG containing a bubble. Two contigs can be obtained from the DBG, namely, ‘ATTCGTTAAC’ and ‘ATTCATTAAC’. It is observed that the DBG does not remove the error reads. Further, the graph has four error nodes because the length of the substring is 4, and the subsequent node is shifted by one nucleotide. Therefore, the size of the bubble depends on the length of the substring. Figure 4. Open in new tabDownload slide Assembling the error reads using the DBG. (A) The read does not have any error nucleotides. (B) A read with an error nucleotide underlined. (C) Using these two reads to construct the DBG containing a bubble. Two contigs can be obtained from the DBG, namely, ‘ATTCGTTAAC’ and ‘ATTCATTAAC’. It is observed that the DBG does not remove the error reads. Further, the graph has four error nodes because the length of the substring is 4, and the subsequent node is shifted by one nucleotide. Therefore, the size of the bubble depends on the length of the substring. Repeat and k-mer The repeated sequence may come from the same or different chromosomes. For example, Figure 5B is an overlapping graph, which is constructed by the six reads in Figure 5A. The repeat region ‘CCC’ leads to several branches, such as n1→ n6, n1→ n4 and n1→ n2. If the greedy method is applied to walk through the overlapping graph, two different results can be obtained in Figure 5C. However, these two results are not the correct sequence. Thus, the greedy algorithm is not appropriate to handle repeat regions. Figure 5. Open in new tabDownload slide The repeat region can cause bias. (A) The sequencing machine detects six reads, namely, n1, n2, n3, n4, n5 and n6 derived from the DNA sequence ‘TACCCGTCCCAACCCTT’. The DNA sequence has three repeat regions (CCC) underlined. (B) Constructing the overlapping graph with the six reads. (C) We can obtain two results (R1, R2) by using the greedy algorithm, as there are two strategies that traverse the map. From the observation, the greedy approach cannot obtain the correct sequence and fails to determine the best result. Also, these reads are assumed to come from two DNA sequences, namely, ‘TACCCAACCTT’ and ‘CCGTCCC’. Two results (R1 and R2) are generated, so it is hard to decide the correct sequence. Therefore, the repeat regions can generate ambiguous results and lead to erroneous assembly. (D) Constructing six reads using the DBG. The length of the k-mer is 3, and the repeat region (CCC) is presented by the node dn3. Figure 5. Open in new tabDownload slide The repeat region can cause bias. (A) The sequencing machine detects six reads, namely, n1, n2, n3, n4, n5 and n6 derived from the DNA sequence ‘TACCCGTCCCAACCCTT’. The DNA sequence has three repeat regions (CCC) underlined. (B) Constructing the overlapping graph with the six reads. (C) We can obtain two results (R1, R2) by using the greedy algorithm, as there are two strategies that traverse the map. From the observation, the greedy approach cannot obtain the correct sequence and fails to determine the best result. Also, these reads are assumed to come from two DNA sequences, namely, ‘TACCCAACCTT’ and ‘CCGTCCC’. Two results (R1 and R2) are generated, so it is hard to decide the correct sequence. Therefore, the repeat regions can generate ambiguous results and lead to erroneous assembly. (D) Constructing six reads using the DBG. The length of the k-mer is 3, and the repeat region (CCC) is presented by the node dn3. The DBG is capable of treating the repeat regions in virtue of k-mers. For example, Figure 5D is a DBG that is constructed by the six reads from Figure 5A. The repeat region is presented by dn3. By traversing the DBG in Figure 5D, two different contigs are generated, i.e. ‘TACCCAACCCGTCCCTT’ and ‘TACCCGTCCCAACCCTT’. However, only one of them is the right sequence. Therefore, the DBG may produce error contigs while analyzing repeat regions. These error contigs should be removed. If the DNA sequence has a high repeat rate, such as the potato, which has about 60% repeated DNA regions, then the DBG uses a small number of nodes to represent these repeat regions. Apart from pruning the error contigs, the length of the k-mer is another issue, which needs to be considered. Ensuring that the k-mer is an appropriate length is essential to enable the repeat regions to be correctly discovered. If the length of the k-mer is too long to find the repeat region, it will prevent us from assembling the reads into a longer sequence. The shorter the length of the k-mer, the higher the probability that two similar subsequences will overlap, and the more repeated nodes the graph may have. Some repeated nodes may be false repeats, which are useless for merging reads. Further, these meaningless overlapped regions may sometimes generate false contigs. For example, in Figure 6, two DBGs are built by setting different sizes for the k-mer. In Figure 6C, a short length of the k-mer is given, and two different contigs are yielded. One of the two contigs is an error sequence. However, if a long length for a k-mer is specified, only the correct contig is returned. The false repeat regions, including nodes a and b in Figure 6C, lead to incorrect assembly. Figure 6. Open in new tabDownload slide A short k - mer leads to a false candidate sequence. (A) A read ATCGCATACTC. (B) The read is presented by the DBG and the length of the substring is 3. (C) The DBG presents the read of length 2 character substring. From this graph, the DBG with a character substring of length 2 has two false repeat nodes: ‘AT’ and ‘TC’. So, these nodes have several branches. These branches lead to two different ways of traversing the graph. The first path is a → b → c → d → e → a → f → g → h → b, which is the result 1 in (C), and the second path is a → f → g → h → b → c → d → e → a → b. The short substring generates a correct contig. Figure 6. Open in new tabDownload slide A short k - mer leads to a false candidate sequence. (A) A read ATCGCATACTC. (B) The read is presented by the DBG and the length of the substring is 3. (C) The DBG presents the read of length 2 character substring. From this graph, the DBG with a character substring of length 2 has two false repeat nodes: ‘AT’ and ‘TC’. So, these nodes have several branches. These branches lead to two different ways of traversing the graph. The first path is a → b → c → d → e → a → f → g → h → b, which is the result 1 in (C), and the second path is a → f → g → h → b → c → d → e → a → b. The short substring generates a correct contig. It is observed that the selection of a suitable length of the substring is critical for assembly. Unfortunately, it is inherently difficult to determine an appropriate length of the k - mer. Many users select the length by relying on values resulting from their experience. Although many strategies have been proposed to estimate the optimal length of the substring, such as Velvet, these programs cannot achieve a performance as good as expected. Therefore, a feasible solution is to determine the best length by trying a collection of values and selecting the best one. Some of the repeat regions cannot be tackled by the DBG. If the repeat regions come from different chromosomes, they generate branches, and these branches may terminate at different nodes. This may lead to many ambiguous assemblies. For example, 10 k - mers construct the DBG in Figure 2. These k-mers are derived from two reads. As these reads have the same repeated k - mers and originate from different chromosomes, the DBG has a branch. According to this DBG, four different contigs can be created. Only two of these contigs are correct contigs. Hence, these repeat regions lead to erroneous assembly. Complexity comparison between OLC and DBG approach Suppose a DNA sequencing data set has m reads and the length of each read is l. To build a connection graph for the OLC algorithm, each read is taken as a node in the graph and all possible overlaps between any pair of reads are viewed as an edge. Thus, the number of the nucleotides in the OLC graph is m·l. As the computer requires a byte of memory to store a nucleotide, the space consumption of the OLC algorithm is m·l. Further, a node may connect with many other nodes. In the worst case, a node can link to every node. So, the maximum number of edge is m2 ⁠. An edge requires a byte of memory to store the number of the overlap nucleotides and a byte of memory to store the pointer. Thus, the space consumption of the edge is 2⋅m2 ⁠. In particular, at the worst, the space consumption of the OLC algorithm is 2⋅m2+m⋅l ⁠. The OLC algorithm requires whole pair-wise read alignments, and thus gives rise to the time complexity O(l2) [81]. However, it may take time to compare the similarity of two nodes. The DNA alignment method shows a polynomial increase in time complexity and space complexity [81, 82]. The time consumption of aligning the entire reads becomes O(⁠ l2*m*-12) ⁠. The DBG-based assembly algorithm is built by using all reads’ k-mers. Most of software models the DBG and the OLC graph as bidirected graphs. This is because a node in an OLC or a DBG contains the original read (k-mer) and its reverse complement read (k-mer). In this article, we separate the original read (k-mer) and its reverse complement reads (k-mer) into different nodes. A read will be divided into l − k + 1 k-mers and the number of the nucleotide in each k-mer is k. A node also contains some information, such as offset and orientation. It is assumed that the space consumption of this information is C ⁠. Each node has four edges that connect to the other nodes. As a result, the space consumption of DBG is O(m·(l − k + 1)·(C + k + 4)). The upper bound of the number of edges in the graph is 4·m·(l − k + 1). Thus, the time complexity for finding the Eulerian path (i.e. assembly) is also O(4·m (l−k + 1)). If the DNA sequence has a high repeat rate, such as the potato, which has about 60% repeated DNA regions, the DBG uses a small number of nodes to represent these repeat regions. The OLC graph needs many edges to link this repeated reads. According to the time consumption of these two algorithms, OLC algorithm usually has higher time consumption in comparison with de Bruijn algorithm in most cases. Comparison of NGS platforms The above assembly approaches are designed for handling different types of data from various sequencing machines. Table 1 presents a comparison between different NGS platforms, including accuracy, time, read length and cost. Assemblers that show good performance with the short reads from NGS sequencers (e.g. Illumina and SOLiD) might not work as well with reads from emerging technologies such as the TGSs from PacBio, which uses SMRT sequencing that harnesses the natural process of DNA replication and replies on two major innovations: zero waveguides and phospholinked nucleotides. The former allows light to illuminate only the bottom of a well where a DNA polymerase/template complex is immobilized. The latter makes the observed immobilized complex, as the DNA polymerase generates a completely natural DNA strand. Also, there is a separate application to assemble transcriptomes from RNA-seq data and to assemble metagenome data sets. It is necessary to have a consensus on how to choose the optimal combination of sequencing technologies and assembly algorithms that best fits our job. Assemblathon organized by Davis is one of the three evaluation projects for genome assembly, and was created with the goal of defining standard methodologies for DNA assembly. The other two projects are dnGASP organized by National Center for Genome Analysis in Spain, and the project Genome Assembly gold standard evaluations (GAGE) led by Steven Salzberg of the University of Maryland to investigate different genome assemblers. Table 1. Comparison of NGS platforms Sequencer . HiSeq 3000 . Miseq . 5500xl . Ion Torrent Proton . 454 GS FLX . Data generation Bridge amplification Bridge amplification Emulsion PCR Emulsion PCR Emulsion PCR Read length 150 bp Up to 2×300 bp 75 bp 200–400 bp Up to 1000 bp Accuracy 98% 3–5% 99% 99% 99% Reads/per run 2.5 billion 15 billion 0.7 billion 60–80 million 70 000 Time/run 1–3.5 days 1.6 days 8 days 2–4 h 10 h Throughput/per run 125–1500 GB 0.5–15 GB 120 GB 10 GB 0.7 GB Instrument cost $740k $125k $595k $150k $500k Alignment software HiSeq Analysis Software v0.9 MiSeq reporter N/A Torrent Suite GS de novo assembler, GS reference assembler Hardware CentOS, ≥48 GB RAM, ≥1 TB, root permission and 75 GB space in the default directory 64-bit Windows OS, ≥32 GB RAM, ≥1 TB, 2.8 GHz and .NET 4.5 N/A Ubuntu 10.04 OS, Dual 8-core 2.9 GHz, 128 GB RAM, 2 NVIDIA GPUs, 27 TB N/A Sequencer . HiSeq 3000 . Miseq . 5500xl . Ion Torrent Proton . 454 GS FLX . Data generation Bridge amplification Bridge amplification Emulsion PCR Emulsion PCR Emulsion PCR Read length 150 bp Up to 2×300 bp 75 bp 200–400 bp Up to 1000 bp Accuracy 98% 3–5% 99% 99% 99% Reads/per run 2.5 billion 15 billion 0.7 billion 60–80 million 70 000 Time/run 1–3.5 days 1.6 days 8 days 2–4 h 10 h Throughput/per run 125–1500 GB 0.5–15 GB 120 GB 10 GB 0.7 GB Instrument cost $740k $125k $595k $150k $500k Alignment software HiSeq Analysis Software v0.9 MiSeq reporter N/A Torrent Suite GS de novo assembler, GS reference assembler Hardware CentOS, ≥48 GB RAM, ≥1 TB, root permission and 75 GB space in the default directory 64-bit Windows OS, ≥32 GB RAM, ≥1 TB, 2.8 GHz and .NET 4.5 N/A Ubuntu 10.04 OS, Dual 8-core 2.9 GHz, 128 GB RAM, 2 NVIDIA GPUs, 27 TB N/A Table 1. Comparison of NGS platforms Sequencer . HiSeq 3000 . Miseq . 5500xl . Ion Torrent Proton . 454 GS FLX . Data generation Bridge amplification Bridge amplification Emulsion PCR Emulsion PCR Emulsion PCR Read length 150 bp Up to 2×300 bp 75 bp 200–400 bp Up to 1000 bp Accuracy 98% 3–5% 99% 99% 99% Reads/per run 2.5 billion 15 billion 0.7 billion 60–80 million 70 000 Time/run 1–3.5 days 1.6 days 8 days 2–4 h 10 h Throughput/per run 125–1500 GB 0.5–15 GB 120 GB 10 GB 0.7 GB Instrument cost $740k $125k $595k $150k $500k Alignment software HiSeq Analysis Software v0.9 MiSeq reporter N/A Torrent Suite GS de novo assembler, GS reference assembler Hardware CentOS, ≥48 GB RAM, ≥1 TB, root permission and 75 GB space in the default directory 64-bit Windows OS, ≥32 GB RAM, ≥1 TB, 2.8 GHz and .NET 4.5 N/A Ubuntu 10.04 OS, Dual 8-core 2.9 GHz, 128 GB RAM, 2 NVIDIA GPUs, 27 TB N/A Sequencer . HiSeq 3000 . Miseq . 5500xl . Ion Torrent Proton . 454 GS FLX . Data generation Bridge amplification Bridge amplification Emulsion PCR Emulsion PCR Emulsion PCR Read length 150 bp Up to 2×300 bp 75 bp 200–400 bp Up to 1000 bp Accuracy 98% 3–5% 99% 99% 99% Reads/per run 2.5 billion 15 billion 0.7 billion 60–80 million 70 000 Time/run 1–3.5 days 1.6 days 8 days 2–4 h 10 h Throughput/per run 125–1500 GB 0.5–15 GB 120 GB 10 GB 0.7 GB Instrument cost $740k $125k $595k $150k $500k Alignment software HiSeq Analysis Software v0.9 MiSeq reporter N/A Torrent Suite GS de novo assembler, GS reference assembler Hardware CentOS, ≥48 GB RAM, ≥1 TB, root permission and 75 GB space in the default directory 64-bit Windows OS, ≥32 GB RAM, ≥1 TB, 2.8 GHz and .NET 4.5 N/A Ubuntu 10.04 OS, Dual 8-core 2.9 GHz, 128 GB RAM, 2 NVIDIA GPUs, 27 TB N/A Assemblathon 1 (17 teams) started in 2010, and the results were published in late 2011 [83]. There were 43 submitted assemblies from 21 teams involved in the Assemblathon2 (occurring in 2013, http://arxiv.org/abs/1301.5406) contest, using real data from three vertebrate species, namely, a bird, a fish and a snake. A combination of optical map data, Fosmid sequences, and several statistical methods was applied to evaluate these assemblies. It is observed that the participants suffered from having too many species and sequence data. Thus, it is necessary to focus on one species for the purpose of fair comparison in the future Assemblathon 3. dnGASP (http://www.bsc.es/) uses a set of artificial chromosomes derived from the human genome, three from the chicken genome and others representing the fruit fly, nematode, brewer’s yeast and two species of mustard plant [84], to compare different teams’ assembly performance. In particular, some challenging chromosomes are applied to test assembler performance on various repetitive structures, divergent alleles and other difficult content. Users are allowed to run the reference data set through individual assembler and post the results back on the server. GAGE [85] uses four whole-genome short-gun sequence data sets for the competition, including Staphylococcus aureus, Rhodobacter sphaeroides, Human (e.g. chromosome 14) and Bombus impatiens (a species of bee). Only illumine reads are considered only. Unlike simulated data in Assemblathon and dnGASP, all data sets with GAGE are from recent sequencing projects. Table 2 presents comparative data for selected NGS genome assembling tools to clearly show the differences between each DNA assembly method, including the evaluation of assembling quality, total time and software/hardware infrastructure for assembly. According to the comparison, the short reads from NGS not only bring us more pieces but also result in errors. Repetitive sequences, polymorphisms, missing data and mistakes impact on the contigs’ construction. Further, in the absence of reference genome, it is often necessary to evaluate the assembly quality in terms of the simulated data from computer-generated genomes, and the number of scaffolds and contigs using a metric like N50/N90. Table 2. De novo assembly algorithms of second-generation projects Assemblers . Genome size . Sequencer . Average read (bp) . Read coverage . N50 scaffolds and contigs . Time . Software . Hardware . ABYSS, 2009 Human, 3.0 GB GA 35–46 45× 2.8 M and 1.5 kb 40 h C ++, Linux, MPI protocol 168 core 2.66 GHz CPU SOAPdenovo, 2010 Panda, 2.4 GB GA GA 45 10 25×, 4× 1.22 Mb and 3.6 kb 87 h Linux 8 core AMD 2.3 GHz CPUs, 512 GB Allpaths-LG, 2011 Human, 3.1 GB GA 100 45× 11.5 Mb and 24 kb 3.5 weeks Linux/Unix >16 GB CABOG, 2008 Turkey, 1.1 GB GA 454 195 74 1× 13× 1.44 Mb and 2.8 kb 5 days Linux >32 GB SGA C. elegans, 33.8 Mb GA 100 20× 26.3 kb and 16.8 kp 41 h Linux >4.5 GB Newbler Flow cytometry,877 Mb 454 200 20× 6.85 kb and 1.48 kb C ++, Linux >4 GB Phusion Mouse, 2.6 GB Sanger 2–200 7.5× 7.0 kb and 2.0 kb 36 Linux >90 GB Price Snake (Boa constrictor), 1.6 GB GA 100 125× 5.8 kp and 5.8 kp N/A Linux 4 AMD, 1.4 GHz, 256 GB RAM Curtain Snake (Boa constrictor), 1.6 GB GA 100 125× 30 kp and 11.5 kp N/A Linux N/A Assemblers . Genome size . Sequencer . Average read (bp) . Read coverage . N50 scaffolds and contigs . Time . Software . Hardware . ABYSS, 2009 Human, 3.0 GB GA 35–46 45× 2.8 M and 1.5 kb 40 h C ++, Linux, MPI protocol 168 core 2.66 GHz CPU SOAPdenovo, 2010 Panda, 2.4 GB GA GA 45 10 25×, 4× 1.22 Mb and 3.6 kb 87 h Linux 8 core AMD 2.3 GHz CPUs, 512 GB Allpaths-LG, 2011 Human, 3.1 GB GA 100 45× 11.5 Mb and 24 kb 3.5 weeks Linux/Unix >16 GB CABOG, 2008 Turkey, 1.1 GB GA 454 195 74 1× 13× 1.44 Mb and 2.8 kb 5 days Linux >32 GB SGA C. elegans, 33.8 Mb GA 100 20× 26.3 kb and 16.8 kp 41 h Linux >4.5 GB Newbler Flow cytometry,877 Mb 454 200 20× 6.85 kb and 1.48 kb C ++, Linux >4 GB Phusion Mouse, 2.6 GB Sanger 2–200 7.5× 7.0 kb and 2.0 kb 36 Linux >90 GB Price Snake (Boa constrictor), 1.6 GB GA 100 125× 5.8 kp and 5.8 kp N/A Linux 4 AMD, 1.4 GHz, 256 GB RAM Curtain Snake (Boa constrictor), 1.6 GB GA 100 125× 30 kp and 11.5 kp N/A Linux N/A Table 2. De novo assembly algorithms of second-generation projects Assemblers . Genome size . Sequencer . Average read (bp) . Read coverage . N50 scaffolds and contigs . Time . Software . Hardware . ABYSS, 2009 Human, 3.0 GB GA 35–46 45× 2.8 M and 1.5 kb 40 h C ++, Linux, MPI protocol 168 core 2.66 GHz CPU SOAPdenovo, 2010 Panda, 2.4 GB GA GA 45 10 25×, 4× 1.22 Mb and 3.6 kb 87 h Linux 8 core AMD 2.3 GHz CPUs, 512 GB Allpaths-LG, 2011 Human, 3.1 GB GA 100 45× 11.5 Mb and 24 kb 3.5 weeks Linux/Unix >16 GB CABOG, 2008 Turkey, 1.1 GB GA 454 195 74 1× 13× 1.44 Mb and 2.8 kb 5 days Linux >32 GB SGA C. elegans, 33.8 Mb GA 100 20× 26.3 kb and 16.8 kp 41 h Linux >4.5 GB Newbler Flow cytometry,877 Mb 454 200 20× 6.85 kb and 1.48 kb C ++, Linux >4 GB Phusion Mouse, 2.6 GB Sanger 2–200 7.5× 7.0 kb and 2.0 kb 36 Linux >90 GB Price Snake (Boa constrictor), 1.6 GB GA 100 125× 5.8 kp and 5.8 kp N/A Linux 4 AMD, 1.4 GHz, 256 GB RAM Curtain Snake (Boa constrictor), 1.6 GB GA 100 125× 30 kp and 11.5 kp N/A Linux N/A Assemblers . Genome size . Sequencer . Average read (bp) . Read coverage . N50 scaffolds and contigs . Time . Software . Hardware . ABYSS, 2009 Human, 3.0 GB GA 35–46 45× 2.8 M and 1.5 kb 40 h C ++, Linux, MPI protocol 168 core 2.66 GHz CPU SOAPdenovo, 2010 Panda, 2.4 GB GA GA 45 10 25×, 4× 1.22 Mb and 3.6 kb 87 h Linux 8 core AMD 2.3 GHz CPUs, 512 GB Allpaths-LG, 2011 Human, 3.1 GB GA 100 45× 11.5 Mb and 24 kb 3.5 weeks Linux/Unix >16 GB CABOG, 2008 Turkey, 1.1 GB GA 454 195 74 1× 13× 1.44 Mb and 2.8 kb 5 days Linux >32 GB SGA C. elegans, 33.8 Mb GA 100 20× 26.3 kb and 16.8 kp 41 h Linux >4.5 GB Newbler Flow cytometry,877 Mb 454 200 20× 6.85 kb and 1.48 kb C ++, Linux >4 GB Phusion Mouse, 2.6 GB Sanger 2–200 7.5× 7.0 kb and 2.0 kb 36 Linux >90 GB Price Snake (Boa constrictor), 1.6 GB GA 100 125× 5.8 kp and 5.8 kp N/A Linux 4 AMD, 1.4 GHz, 256 GB RAM Curtain Snake (Boa constrictor), 1.6 GB GA 100 125× 30 kp and 11.5 kp N/A Linux N/A Future development of assembly methods It is observed that current assembly algorithms have limitations in assembling sequences. Therefore, a number of new algorithms have been developed to handle the increasing growth of sequence data. This section offers a brief discussion of the future development of DNA assembly. Error correction There have been considerable efforts to recover a long, contiguous genomic sequence from short DNA reads, from single-end, paired-end to mate-pairs. However, error correction for disambiguating repeat regions remains challenging. The aforementioned substitutions, insertions and deletions errors have become a critical issue for the de novo assembly of NGS data. The error rate and types are varied with respect to different sequencing platforms. For example, the primary error with Illumina and Solexa is substitution, varying from 0.5 to 2.5% [47], in contrast to insertions and deletions up to a 30% error rate with the traditional Sanger sequencing method [86]. The typical error rates for existing machines range from 0.1 to 10% [87]. Error detection and correction can be obtained by aligning reads to each other (de novo assembly, such as construction and traversal of a DBG) or to a closely related well-known sequence (reference-based assembly) in the case of high coverage. In contrast, if coverage is low or short of a complete set of reference sequences, then it will be difficult to distinguish sequencing error from true biological variation. In general, errors can be reduced by quality filtering, such as removing or truncating reads with low-quality base calls, merging overlapping paired-end reads and, in the matter of amplicon reads, by clustering [88]. Quality filtering is widely applied in the data analysis of NGS reads for removing sequence artifacts in the data set but is rarely viewed as an approach to be designed and verified. It usually depends on ad hoc criteria such as imposing a maximum on the number of bases with less than a given Q score [89]. It aims at filtering instead of estimating error rates. It mainly comprises two recommended steps: (1) check for sequence errors, including raw sequence with low-quality score, short length, ambiguous base calls, mismatches to the prime sites or barcodes; and (2) check for alignment by aligning the sequences against a database. Other articles that present analysis pipelines for amplicon reads [9elx006-B] offer the mean error rate but ignore the tail of the error distribution. This may result in spurious clusters and consequent inflated estimates of diversity generated if the tail is not adequately controlled [91]. Overlapping paired-end reads can be applied to enhance the prediction of the sequence in the overlapping region by aligning the forward and reverse read. Although there have been a number of published paired-read mergers, such as PEAR [92], COPE [93], BLESS [94] and PANDAseq [95]. Of these, BLESS and PANDAseq are the only two mergers that include quality filtering as well as merging. Only limited methods have been developed for error correction of amplicon pyrosequencing reads, as they have special requirements and cannot be used for other platforms such as Illumina. For example, AmpliconNoise [96] and PyroNoise [97] use a greedy algorithm based on an abundance sort. Further, the pre-clustering method [90] and single-linkage pre-clustering [91] also use abundance differences between closely related sequences to correct errors. The error correction methods are also divided into four main categories [98], namely, (i) k-mer spectrum-based approaches by clustering the k-mers using maximum Hamming distance threshold, and amending erroneous reads into the consensus sequence [99]; (ii) suffix tree-based or array-based approaches with a flexible size of k but large memory consumption [100]; (iii) multiple sequence alignment using a k-mer as a seed to align reads [101] and (iv) hidden Markov model-based methods [102]. Graph simplification The weakness of the DBG is its high memory consumption. Graph simplification is applied in many applications. This method aims to prune the useless nodes of the graph. The reduced graph can save memory costs by merging nodes that do not affect the path generated in the graph to assemble reads into contigs. Given a DBG, a read is represented as a node, and these nodes contain many repeated base pairs. Thus, a compression method is applied after constructing a DBG. Once the DBG is generated, some nodes that only have one in-degree and one out-degree can be integrated into a new node. This new node contains the entire sequence information of the original nodes. For example, in Figure 7A, n4 and n5 have one in-degree and one out-degree. Therefore, they can be combined into a new node C2 as presented in Figure 7B. The detailed process of compressing the DBG is shown in Figure 7. In a similar manner, this compression method can also be applied to simplify the redundant nodes of the overlapping graphs. Figure 7. Open in new tabDownload slide The process of compressing a DBG. (A) A constructed DBG; (B) node n4 and node n5 are integrated into node C2; (C) the compressed graph of (A). Figure 7. Open in new tabDownload slide The process of compressing a DBG. (A) A constructed DBG; (B) node n4 and node n5 are integrated into node C2; (C) the compressed graph of (A). Maintaining the maximum connectivity is a future direction to explain observed activity in the graph [103]. In other words, given a directed graph and a set of directed acyclic graph (DAGs) (or trees) with specified roots, it aims to select a subset of arcs in the graph to maximize the number of nodes reachable in all DAGs by the corresponding DAG roots. MapReduce may be a good option for graph simplification owing to less memory consumption. In this way, we are able to compress paths, prune tip and bubble, and remove low coverage nodes like Contrail. The graph is perhaps still tangled even after applying simplification methods. Additional steps might be applied before overlap graph simplification stage by calculating reliability of each edge and removing edges with low reliability. For example, in [104], edge reliability, namely, the reliability of the statement ‘edge e is true’ is computed using two estimations. Theoretical estimation is calculated in terms of the overlap length and the distribution of the number of repeats over repeat length. Experimental estimation is calculated using the k-mer coverage dependence for overlapping reads, which is a dependence of the number of k-mers observed in all reads depending on the position of this k-mer in overlapping reads. In some studies, error correction step may be delayed until the graph simplification stage because some errors are not visible until the graph has been started, e.g. to distinguish polymorphisms from sequencing errors [3]. Scaffolding Walking through the graph can produce many contigs. However, these contigs are not independent sequences, and the order of these contigs is still unknown. Scaffolding is composed of overlapping contigs separated by gaps of known length, and is applied to order and orient these contigs using paired-read information [105]. Therefore, scaffolding is one of the most important steps for genome assembly. Paired-end sequencing makes a simple modification to the standard single end type by reading both the forward and reverse template strands of each cluster during one paired-end read. Mate-pair sequencing includes producing long-insert paired-end DNA libraries from 2 to 5 kb in size. Combining data generated from mate-pair sequencing with that from short-insert paired-end reads offers a combination of read lengths for maximal genomic sequencing coverage through the genome. NGS machines are able to produce mate-paired reads [19]. The mate-paired reads are obtained by sequencing the first and the last hundreds of the large DNA fragments (2–5 kbp), and the length of the large DNA fragments are known [106]. Therefore, the mate-pair information between contig-pairs can be used to constrain the placement of the reads within an assembly and construct scaffolds by ordering and orienting separate sequence contigs. If one contig includes one end of a mate-pair and a second contig contains the other end of the mate-pair, the two contigs are said to be connected by the mate-pair [106]. However, several mate-paired reads are required to arrange two contigs because of experimental errors in practice [19]. Therefore, most of the scaffolding methods rely on the greedy approach to obtain the approximate solution, such as Autofinish [107] and Bambus [6]. Although scaffolding can transfer contigs into a long sequence, it is still far from reconstructing the whole genome, as there are many gaps between contigs. ‘N’ is often used to fill the gaps, and the number of ‘N’ indicates the gap size. Scaffolding is not an essential step for de novo assembly. Some DNA assembly software (such as Velvet) does not apply scaffolding at all. Transcriptome and metagenomics assembly The de Bruijn algorithm and the OLC algorithm are both graph-based methods and can be applied to assemble the transcriptome reads. However, several aspects should be considered while constructing these graphs. The highly abundant sequences in the transcriptome assembly are the highly expressed genes or the common exons [3], rather than the repeat regions of genomes. Although these reads generate bubble and branch structures, they should be viewed as different genes. Further, the transcriptome reads are stand-specific. Thus, their complementary sequences are ignored. Traditional algorithms cannot be directly applied for metagenomic assembly, as mentioned in ‘Challenges of Assembly’ section. MAP [108], MetaVelvet [7] and Meta-IDBA [8] are three tools to tackle metagenomic assembly. MAP is based on the OLC algorithm, and MetaVelvet and Meta-IDBA are based on the de Bruijn algorithm. MAP divides the branches of a graph into several different paths [108]. This division can reduce the impact of the repeats from other species. Further, the bubbles in the graph are merged into a node [108]. This is able to represent the polymorphism of the same species. However, a traditional single-genome assembler shows limitations for de novo metagenome assembly, as sequences of highly abundant species are perhaps misidentified as repeats in a single genome, resulting in many small fragmented scaffolds. Unlike MAP, MetaVelvet applies statistical methods to handle the multiple species. It computes the frequency of each substring in the DBG. The frequency of the substring assists in classifying different species. This is because the frequencies of the substring follow the Poisson distribution in a single species [109], and the metagenome can be viewed as a mixed Poisson distribution [110]. This method decomposes a DBG constructed from mixed short reads of highly abundant species into individual subgraphs in terms of the difference of k-mer frequency (coverage) and subgraph connectivity. In contrast, low-abundant species can be regarded as a partition of the high-abundant species. As a result, low-abundant species may give rise to low assembly quality. Cloud computing technology Cloud computing has been introduced to make better use of distributed resources, combining them to solve large-scale computation issues [111]. Although the idea of cloud computing has been around for some time, it is still an emerging and promising field for multidisciplines, including bioinformatics. Cloud computing can be implemented with the MapReduce programming model and Hadoop. In general, the host computer divides the assignment into several parts, and sends each part of the assignment to the customers’ computers. The customers’ computers complete their tasks and send the results to the host. Therefore, the assembly algorithm should assemble reads in parallel. The OLC method may use cloud computing technology to assemble reads because the multiple DNA alignment algorithm can be performed in parallel. Contrail, relying on the theoretical framework of DBGs, uses Hadoop/MapReduce to parallelize the assembly across a number of computers, effectively reducing memory concerns and making assembly feasible for even the largest genomes. Traditional methods cannot be directly applied to cloud computing. They need to be redesigned to adjust to the architecture of cloud computing. Spark architecture is an emerging technique for large-scale data processing [112]. It can not only run program faster than Hadoop MapReduce in memory but also provides >80 high-level operators that facilitate us to build parallel apps [113]. In recent years, several algorithms and tools have been developed to transplant the DBG algorithm to cloud computing. Contrail (DBG) proposed by Michael Schatz’s team [114] is the first DNA assembly program that can be run on a cloud computing platform. Later, CloudBrush (string graph) [70], other DNA assembly software based on cloud computing, was also proposed. Cloud computing technology has been proved to be an efficient way to conduct big data analysis in biology. Applying graphic processor units to assemble DNA Graphic processor units (GPUs), which are widely used in video cards and motherboards, are a single-chip processor. As its highly parallel structure provides high performance in handling computer graphs, researchers apply it to enhance the performance of sequence assembly. Constructing an overlapping graph using the Smith–Waterman alignment algorithm is a time-consuming process. Many methods have been developed to align sequences using GPUs, such as CUSHAW2-GPU [115] and SOAP3-dp [36, 52]. According to the description in [116], the speed of aligning sequences by a GPU is over 20 times faster than aligning the sequences using a single-core central processing unit (CPU). Some researchers have developed GPU-based methods to assemble DNA using the DBG. For example, Mahmood [117] created the GPU-Euler software, and Adriano [118] proposed a new assembly method that is suitable for GPU. The sequence data need to be loaded into the GPU memory while assembling the sequence. If the sequence data are large, then the graph card cannot provide enough GPU memory. In particular, the de Bruijn-based algorithm requires the graph card to offer more memory to store the k - mers. One of the possible solutions is the GPU cluster, which uses multiple GPUs to complete a common task [119]. The entire reads can be divided into several small data sets, and each GPU can process a small data set. GPUs are a good option for solving CPU-intensive jobs instead of memory consumption. Nevertheless, GPU-based numerical algorithms have the shortcoming of low performance for double precision, and the GPU device suffers from memory limitation for big data set [120]. New sequencing machine With the continuous development of sequencing technology, NGS high-throughput sequencing has been widely used in various research fields and has provided a large amount of reads in a short time at a low cost; however, it also has shortcomings. The length of the derived reads is still too short to reconstruct the whole genome. Further, the technical principle is based on PCR-amplified DNA. The deviation between the number of pre-amplification DNA and after amplification has a great impact on the gene expression analysis. This, to some extent, blocks the application of NGS sequencing technology. A growing trend in sequencing has been the development of instruments capable of working at higher speeds and generating longer read lengths and achieving higher accuracy. Thus, the TGS machines with innovations such as SMRT sequencing have been commercially available to improve the technical aspects of the NGS platforms [121, 122]. They aim to produce longer reads, have a lower error rate of sequencing, a faster sequencing speed and a lower cost than the SGS machines [123]. TGS technologies aim to increase throughput and decrease the time to result. It can be used to identify the positions of individual nucleotides within long DNA fragments (>3000 bp). We have introduced a representative technology PacBio as follows. In general, TGS has two properties. First, the PCR amplification step is not needed before sequencing. Second, the signal is obtained in real time, no matter whether it is fluorescent like PacBio or electric current like Nanopore. This can give rise to a faster data read speed, and further reduces the cost of sequencing. There are mainly three types of TGS platforms, including true single molecular sequencing (tSMSTM) launched by Helicos biosciences [124], SMRT developed by Pacific Biosciences (http://www.pacificbiosciences.com) and MinION introduced by Oxford Nanopore Technologies (https://www.nanoporetech.com/). SMRT is a parallelized single-molecule DNA sequencing method. It uses a modified enzyme and enables direct observation of the enzymatic reaction in real time [125]. This may help the prediction of structural variance in the sequence, especially epigenetic studies. It was reported that the longest single-molecule read can reach up to 60 kb in the most recent version P6 DNA polymerase from PacBio Co. MinION can process the spectrum of read lengths and yield the longest read >200 Kb. Data streaming is used to return an answer on species identification quickly, or allow sequencing experiments to be stopped if more data have accumulated. It can be used outside the traditional laboratory environment to conduct field-based work. tSMS sequencing allows direct measurement of billions of individual nucleic acid molecules with high accuracy, simplicity and scale, requiring no amplification, ligation, complementary DNA synthesis or other complex manipulations. It is able to perform whole-transcriptome analyses including coding and noncoding RNAs, small RNA sequencing, single-step selection and sequencing of targeted DNA and RNA molecules, ancient DNA/RNA sequencing, targeted resequencing and whole-genome resequencing. PacBio has developed two kinds of sequencing machines, namely, the Sequel System for high-quality whole-genome de novo assembly and PacBio RS II for whole-genome sequencing of smaller organisms and targeted sequencing of DNA and RNA. The average length of the reads can reach from 3000 base pairs to 10 k base pairs [126]. However, this sequence machine has many shortcomings: low-quality and quantity reads. The PacBioRS II produces long reads but with a > 10% error rate [11]. This high error rate is a significant problem, but this can be alleviated if the coverage is high. However, PacBio RS II is expensive; hence, high coverage is not practical. Table 3 describes the differences between the TGS technologies including reads length, advantages and disadvantages. Table 3. Comparison of the TGS technologies Sequencer . PacBio SMRT . Helicos tSMS . Nanopore MinION . Read length 4000–24 000 bp 25–60 bp 100 kb Error rate 15–20% 3–5% 38% Reads/per run 47 000 reads 1 billion reads 3000 reads Time/run 10 h 8 days 1 h Advantage No amplification, real-time monitoring and least GC bias No amplification, nonbiased DNA sequence and RNA sequencing No amplification, fastest sequencer, whole-genome scan in 15 min and portability Disadvantage High error rates High NTP incorporation error rates Not much data available, high cost Instrument cost $695 k $1.35 million $1000 Sequencer . PacBio SMRT . Helicos tSMS . Nanopore MinION . Read length 4000–24 000 bp 25–60 bp 100 kb Error rate 15–20% 3–5% 38% Reads/per run 47 000 reads 1 billion reads 3000 reads Time/run 10 h 8 days 1 h Advantage No amplification, real-time monitoring and least GC bias No amplification, nonbiased DNA sequence and RNA sequencing No amplification, fastest sequencer, whole-genome scan in 15 min and portability Disadvantage High error rates High NTP incorporation error rates Not much data available, high cost Instrument cost $695 k $1.35 million $1000 Table 3. Comparison of the TGS technologies Sequencer . PacBio SMRT . Helicos tSMS . Nanopore MinION . Read length 4000–24 000 bp 25–60 bp 100 kb Error rate 15–20% 3–5% 38% Reads/per run 47 000 reads 1 billion reads 3000 reads Time/run 10 h 8 days 1 h Advantage No amplification, real-time monitoring and least GC bias No amplification, nonbiased DNA sequence and RNA sequencing No amplification, fastest sequencer, whole-genome scan in 15 min and portability Disadvantage High error rates High NTP incorporation error rates Not much data available, high cost Instrument cost $695 k $1.35 million $1000 Sequencer . PacBio SMRT . Helicos tSMS . Nanopore MinION . Read length 4000–24 000 bp 25–60 bp 100 kb Error rate 15–20% 3–5% 38% Reads/per run 47 000 reads 1 billion reads 3000 reads Time/run 10 h 8 days 1 h Advantage No amplification, real-time monitoring and least GC bias No amplification, nonbiased DNA sequence and RNA sequencing No amplification, fastest sequencer, whole-genome scan in 15 min and portability Disadvantage High error rates High NTP incorporation error rates Not much data available, high cost Instrument cost $695 k $1.35 million $1000 DBG2OLC [127] is a recent published hybrid assembly method that simultaneously uses NGS and TGS data to address two issues of high error rate and excessive cost while transiting NGS to TGS. Unlike traditional approaches, it preassembles NGS contigs by DBG from highly accurate NGS short reads. The derived contigs are mapped to each long reads, and the long reads are compressed into a collection of contig identifiers. This compact representation of the long reads enables efficient multiple sequence alignments to prune those reads with structural errors rather than base-level errors. The cleaned compressed long reads are applied to construct an optimal overlap graph, and eventually yield DNA sequence. It was tested on mammalian-sized genomes and was proved to be faster than existing methods without high memory consumption while saving nearly 50% of the sequencing cost. Most existing genome assemblies do not capture the heterozygosity present within a diploid or polyploid species. In [128], a new diploid-aware long-read assembler, FALCON is proposed to assemble haplotype contigs, ‘haplotigs’, representing the actual genome in its diploid state with homologous chromosomes independently represented and correctly phased. A new algorithm was developed for long-read sequence assembly of the gorilla genome [72]. Consensus sequences for reads with lengths in the top percentile are yielded by comparing, overlapping sequence reads. Overlaps between the adjusted longer reads are then applied to generate a string graph. A comprehensive review of PacBio sequencing and its applications are described in [129]. A systematic assessment of the technology called synthetic long-read sequencing by llumina that can produce reads of unusual length, around 10 kb, is conducted to evaluate the promise and deficiency of the long reads in these aspects using isogenic Caenorhabditis elegans genome [130]. Hybrid approaches for assembly Hybrid genome assembly is usually defined as applying different sequencing technologies to reach the goal of assembling a genome from DNA sequences. Traditional de novo assembly has been proved to be a computationally difficult process and can lead to NP-hard problems [31,] in some cases, such as a Hamiltonian-cycle approach owing to the occurrence of tandem repeats of DNA segments composed of thousands of base pairs in length. A large number of reads generated by NGS become a bottleneck in reconstructing a genome. As a result, hybrid approaches are being undertaken to make assembly a more computationally efficient process and to increase the accuracy of the process as a whole [131]. One of the advantages of assembly using long reads is that the difficulties caused by repeat regions (as discussed in ‘Transcriptome Assembly Application’ section) can be solved, as long reads contain more information on the repeat regions. However, the long TGS data have relatively low accuracy. To apply long reads for sequence assembly, wrong base pairs should be corrected. The combination of third-generation reads with short, high-accuracy SGSs can reduce such inherent errors and finalize crucial information of the genome. Some researchers apply SGS reads to correct these error base pairs [11].The NGS machine generates a high amount of short reads with low error rates. These short reads can offer a high-quality reference sequence to correct error base pairs. The recent development of TGS yields much longer reads than SGS. Thus, it offers a chance to solve problems that are hard to study via SGS alone. However, most error correction software for TGS uses NGS. Thus, most TGS technologies share an intrinsic drawback, namely, higher raw read error rates. The process of error correction involves aligning the short reads to long reads. The long reads are modified by replying to the alignment information of the short reads [11]. After correcting these wrong nucleotides, the aforementioned assembly algorithms can be used to generate long contigs. It is anticipated that if the read length is long enough and the per base quality is high, researchers will not use dynamic programming to align reads, as seed can do a much better job. Although long reads can tackle repeat regions, the repeat regions may come from different chromosomes. In the de Bruijn algorithm, a long read will generate lots of k-mers. A hybrid genome assembly approach was reported for the automated finishing of bacterial genomes [132]. Two different methods are used, including a scaffolding method supplementing currently available sequenced contigs with PacBio reads, as well as an error correction method to enhance the assembly of bacterial genomes. Cerulean is also a hybrid assembly program combining high-throughput short and long reads [133]. Unlike traditional hybrid assembly approaches, it does not use the short reads directly, instead an assembly graph generated in a similar way to the OLC method or the de Bruijn method is applied. An ensemble strategy that integrates the sequential use of various DBGs and OLC assemblers with a novel partitioned subassembly approach is developed by [134], in which a new quality metrics for evaluating metagenome de novo assembly is proposed. Patch [135] is another hybrid method, which exploits corrected long reads and preassembled contigs as inputs, to enhance microbial genome assemblies. Short reads of S. cerevisiae W303 were hybrid assembled into 115 contigs using additional 20× PacBio long reads. Patch was subsequently applied to upgrade the assembly of yeast (a eukaryotic microorganism) to a 35 contig draft genome. Application of DNA assembly Assembling DNA fragments is the foundation of genome research. This section discusses four primary bioinformatics applications for DNA fragment assembly. Whole-genome assembly Doctors usually obtain disease information from the historical records of patients. However, this does not make sense for intricate diseases. Whole-genome sequencing is able to obtain a comprehensive understanding of disease development at the molecular level [136]. For example, complete genome sequencing can generate the DNA sequences of the patient with genetic diseases. However, these sequences are too short to identify their functions. Therefore, assembling these short reads to a long sequence is the first step in understanding the cause of the disease and determines treatment options for patients. Genome assembly not only benefits disease diagnosis but also the diagnosis and treatment of bacterial infection, especially for mixed infections [137]. Genome assembly and deep analysis are still costly [138]. With the rapid development of the technology, genome sequencing will become cheaper. Further, the increasing application of big data analysis from computer science greatly improves the efficiency and accuracy of assembly. SNP detection SNP is a single nucleotide mutation that differs between members of a species. According to the 1000 Genomes Project, which was launched in 2008, and the pilot stage, which was completed in 2012, there are millions of SNPs in human genomes [139]. These SNPs play a vital role in personalized medicine and species evolution. [140]. Identifying these SNPs is a hot topic in bioinformatics. To identify SNPs, a number of methods based on sequence alignment have been developed, such as iterative mapping [141], post-alignment filtering [142] and read realignment [57]. Some researchers apply machine learning algorithms to identify SNPs [143]. The prediction accuracy of the alignment-based methods largely depends on the reference genome [144]. A reference genome, which has long deletions and structural variations, might lead to an incorrect prediction. The machine learning-based methods cannot always provide a high prediction rate, even though these methods select the best features [143]. This is because of the high complexity with SNP discovery. The de novo assembly method does not rely on a reference genome. To some extent, it avoids the weakness of the alignment-based method. Some articles have proved that the de novo assembly method can provide high prediction accuracy in identifying SNPs [144]. Nevertheless, a lot of de novo assembly algorithms, in fact, overlook SNPs. Transcriptome assembly application Transcriptome reads can be applied to detect gene fusions and discriminate the DNA expression level [37]. Many researchers analyze transcriptomes of plants to determine the medical value and commercial value [145]. In addition, transcriptome sequence data assist in understanding biological evolution. Over the past few years, transcriptome information has been derived from predicting genes and limited expressed sequence tag (EST) evidence [3]. To predict genes, we must obtain the whole-genome information. However, assembling the whole genome is expensive and time-consuming, and the traditional methods for gene prediction cannot provide a satisfying level of prediction accuracy. Therefore, they are not appropriate for finding transcriptome information. NGS technology gives rise to a large number of transcriptome reads. However, these reads are too short to uncover their functions and need to be assembled. Further, the transcript assembly can recover some genes that may be wrongly assembled by genome assembly and detect the unknown exogenous source. Transcriptome assembly provides an easy, cheap and quick way to conduct gene research. Metagenome study The ‘Scaffolding’ section described the algorithms that can be used to assemble metagenomics and the definition of metagenomics. However, it did not discuss the significance of metagenomics. Metagenomics has been widely applied in many fields, such as soil. Although soil metagenomics is important for agriculture, the deep analysis of the soil’s microbial communities is still underdeveloped. Metagenomic analyses can provide extensive information on the structure, composition and predicted gene functions of diverse environmental microbial assemblages [146]. The primary goal of metagenomic analysis is to find the genes of the metabolism [147] and discover the different organisms that are present in a sample [110]. However, the reads generated by the SGS technology are too short to predict the gene. Assembling these short reads is the first step to study gene functions. At the early stage of a metagenomic study, researchers use DNA assembly to define and describe gut microbes, and their effectiveness has been validated in [148]. Several researchers applied metagenomic sequencing and assembled metagenomes to investigate the gut microbes in the human body. This research reveals a possible relationship between aging and gut metagenomics [48]. However, unlike the assembly of a single species’ DNA fragments, metagenomic assembly has three main properties. First, the metagenome has numerous DNA sequences from more than one species. Second, the abundance of different species is diverse. Species of low abundance are only able to provide limited information to assemble fragments. Further, the same species has a similar sequence, but this may not be exactly the same; thus, the assembly algorithm should take this polymorphism [19] into account. Some research focuses on studying the components of the environment and identifying the environmental change. These short reads are long enough to be mapped to the known reference genome database. By doing so, we are able to identify the species and its quantity. However, de novo metagenome assembly cannot identify the unknown species and other important DNA sequences [147, 149, 150]. This is because the reference database contains the DNA sequences of limited species. The unknown organisms’ sequences cannot be mapped to its reference genome, but can be matched to the other species from the same family of the unknown species. Real metagenomic assembly is complicated. It aims to distinguish changes in different microorganisms in the environment. This provides valuable information on microorganisms with respect to environmental change. Thus, metagenomic assembly is a critical process in metagenomic research. Conclusion This article discussed the challenging issues of DNA assembly, including three major types of DNA assembly algorithms, the applications of the assembly and the future applications of the DNA assembly. There is no doubt that DNA assembly is still a hot topic in bioinformatics because of its complexity and importance. All the major DNA assembly algorithms have their limitations in assembling short sequences. It is impractical to develop one assembly algorithm to achieve perfect results owing to a higher error rate in current sequencing technologies. Further, it is not easy to solve the bottleneck problem of big data size and high memory consumption by depending on traditional single-threaded algorithms. Therefore, it is a good idea to use novel technology to deal with the shortcomings of these two types of assembly methods. Cloud computing technology and GPUs are two major technologies to enhance the performance of assembly. The new sequencing machine generates long reads. However, these long reads are still too short for species’ chromosomes. These long reads also should be combined. If the assembler cannot recover the full continuous sequence at one time, sequence assembly is still a vital step in the study of DNA. Key Points NGS and TGS generate valuable biology big data and thus lead to increasing interest in sequence assembly. Repeats, polymorphism and sequencing errors occur frequently in genomes. Mapping-based assembly and de novo-based assembly are two primary approaches for assembly. Optimal k-mer determination and error correction are critical for assembly. Funding The work reported in this paper was partially supported by two National Natural Science Foundation of China project 61363025 and 61373048, two key projects of Natural Science Foundation of Guangxi 2012GXNSFCB053006 and 2013GXNSFDA019029. Qingfeng Chen is with the School of Computer, Electronic and Information, Guangxi University, Nanning, 530004, China and the State Key Laboratory for Conservation and Utilization of Subtropical Agrobioresources, Guangxi University. Chaowang Lan is with the School of Computer, Electronic and Information, Guangxi University, Nanning 530004, China. Liang Zhao is with the School of Computer, Electronic and Information, Guangxi University, Nanning 530004, China. Jianxin Wang is with the School of Information Science and Engineering, Central South University, Changsha 410083, China. Baoshan Chen is with the State Key Laboratory for Conservation and Utilization of Subtropical Agrobioresources, Guangxi University, China. Yi-Ping Phoebe Chen is with the Department of Computer Science and Computer Engineering, La Trobe University, Victoria 3086, Australia. References 1 http://support.illumina.com/sequencing/sequencing_instruments/genome_analyzer_iix.html. 2 Miller JR , Koren S, Sutton G. Assembly algorithms for next-generation sequencing data . Genomics 2010 ; 95 : 315 – 27 . Google Scholar Crossref Search ADS PubMed WorldCat 3 El-Metwally S , Hamza T, Zakaria M, et al. Next-generation sequence assembly: four stages of data processing and computational challenges . PLoS Comput Biol 2013 ; 9 : e1003345 . Google Scholar Crossref Search ADS PubMed WorldCat 4 Gingeras T , Milazzo JP, Sciaky D, et al. Computer programs for the assembly of DNA sequences . Nucleic Acids Res 1979 ; 7 : 529 – 45 . Google Scholar Crossref Search ADS PubMed WorldCat 5 Conway TC , Bromage AJ. Succinct data structures for assembling large genomes . Bioinformatics 2011 ; 27 : 479 – 86 . Google Scholar Crossref Search ADS PubMed WorldCat 6 Koren S , Treangen TJ, Pop M. Bambus 2: scaffolding metagenomes . Bioinformatics 2011 ; 27 : 2964 – 71 . Google Scholar Crossref Search ADS PubMed WorldCat 7 Namiki T , Hachiya T, Tanaka H, et al. MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads . Nucleic Acids Res 2012 ; 40 : e155 . Google Scholar Crossref Search ADS PubMed WorldCat 8 Peng Y , Leung HC, Yiu SM, et al. Meta-IDBA: a de novo assembler for metagenomic data . Bioinformatics 2011 ; 27 : i94 – 101 . Google Scholar Crossref Search ADS PubMed WorldCat 9 Grabherr MG , Haas BJ, Yassour M, et al. Full-length transcriptome assembly from RNA-seq data without a reference genome . Nat Biotechnol 2011 ; 29 : 644 – 52 . Google Scholar Crossref Search ADS PubMed WorldCat 10 Robertson G , Schein J, Chiu R, et al. De novo assembly and analysis of RNA-seq data . Nat Methods 2010 ; 7 : 909 – 12 . Google Scholar Crossref Search ADS PubMed WorldCat 11 Koren S , Schatz MC, Walenz BP, et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads . Nat Biotechnol 2012 ; 30 : 693 – 700 . Google Scholar Crossref Search ADS PubMed WorldCat 12 Pham SK , Antipov D, Sirotkin A, et al. Pathset graphs: a novel approach for comprehensive utilization of paired reads in genome assembly . J Comput Biol 2013 ; 20 : 359 – 71 . Google Scholar Crossref Search ADS PubMed WorldCat 13 Batzoglou S , ( 2005 ) Algorithmic challenges in mammalian genome sequence assembly. In: Encyclopedia of Genomics, Proteomics and Bioinformatics . Hoboken, New Jersey : John Wiley and Sons . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 14 Sutton G , Dew I. Shotgun fragment assembly. In: Rigoutsos I, Stephanopoulos G (eds), Systems Biology: Genomics . New York : Oxford University Press , 2007 , 79 – 117 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 15 Idury RM , Waterman MS. A new algorithm for DNA sequence assembly . J Comput Biol 1995 ; 2 : 291 – 306 . Google Scholar Crossref Search ADS PubMed WorldCat 16 Pevzner PA. 1-Tuple DNA sequencing: computer analysis . J Biomol Struct Dyn 1989 ; 7 : 63 – 73 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 17 Pevzner PA , Tang H, Waterman MS. An Eulerian path approach to DNA fragment assembly . Proc Natl Acad Sci USA 2001 ; 98 : 9748 – 53 . Google Scholar Crossref Search ADS PubMed WorldCat 18 Pop M , Salzberg SL. Bioinformatics challenges of new sequencing technology . Trends Genet 2008 ; 24 : 142 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 19 Pop M. Genome assembly reborn: recent computational challenges . Brief Bioinform 2009 ; 10 : 354 – 66 . Google Scholar Crossref Search ADS PubMed WorldCat 20 Chial H. Rare genetic disorders: learning about genetic disease through gene mapping, SNPs, and microarray data . Nat Educ 2008 ; 1 : 192 . OpenURL Placeholder Text WorldCat 21 Lee HJ , Kweon J, Kim E, et al. Targeted chromosomal duplications and inversions in the human genome using zinc finger nucleases . Genome Res 2012 ; 22 : 539 – 48 . Google Scholar Crossref Search ADS PubMed WorldCat 22 Craddock N , Hurles ME, Cardin N, et al. Genome-wide association study of copy number variation in 16,000 cases of eight common diseases and 3,000 shared controls . Nature 2010 ; 464 : 713 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat 23 Altshuler DM , Gibbs RA, de Bakker PI, et al. Integrating common and rare genetic variation in diverse human populations . Nature 2010 ; 467 : 52 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 24 Matlin AJ , Clark F, Smith CWJ. Understanding alternative splicing: towards a cellular code . Nat Rev Mol Cell Biol 2005 ; 6 : 386 – 98 . Google Scholar Crossref Search ADS PubMed WorldCat 25 Väzquez-Castellanos JF , Garcïa-Löpez R, Përez-Brocal V, et al. Comparison of different assembly and annotation tools on analysis of simulated viral metagenomic communities in the gut . BMC Genomics 2014 ; 15 : 37 . Google Scholar Crossref Search ADS PubMed WorldCat 26 Pierce BA , ( 2013 ) Genetics: A Conceptual Approach , 5th edn. New York : W.H. Freeman & Company . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 27 Quail MA , Smith M, Coupland P, et al. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers . BMC Genomics 2012 ; 13 : 341 . Google Scholar Crossref Search ADS PubMed WorldCat 28 Jeck WR , Reinhardt JA, Baltrus DA, et al. Extending assembly of short DNA sequences to handle error . Bioinformatics 2007 ; 23 : 2942 – 4 . Google Scholar Crossref Search ADS PubMed WorldCat 29 Sutton G , White O, Adams MD, et al. TIGR assembler: a new tool for assembling large shotgun sequencing projects . Genome Sci Technol 1995 ; 1 : 9 – 19 . Google Scholar Crossref Search ADS WorldCat 30 Peltola H , Söderlund H, Ukkonen E. SEQAID: a DNA sequence assembling program based on a mathematical model . Nucleic Acids Res 1984 ; 12 (1 Pt 1): 307 – 21 . Google Scholar Crossref Search ADS PubMed WorldCat 31 Kececioglu JD , Myers EW. Combinatorial algorithms for DNA sequence assembly . Algorithmica 1995 ; 13 : 7 – 51 . Google Scholar Crossref Search ADS WorldCat 32 Chaisson MJ , Pevzner PA. Short read fragment assembly of bacterial genomes . Genome Res 2008 ; 18 : 324 – 30 . Google Scholar Crossref Search ADS PubMed WorldCat 33 Butler J , MacCallum I, Kleber M, et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads . Genome Res 2008 ; 18 : 810 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat 34 Zerbino DR , Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs . Genome Res 2008 ; 18 : 821 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 35 Simpson JT , Wong K, Jackman SD, et al. ABySS: a parallel assembler for short read sequence data . Genome Res 2009 ; 19 : 1117 – 23 . Google Scholar Crossref Search ADS PubMed WorldCat 36 Luo RB , Wong T, Zhu JQ, et al. SOAP3-dp: fast, accurate and sensitive GPU-based short read aligner . PLoS One 2013 ; 8 : e65632 . Google Scholar Crossref Search ADS PubMed WorldCat 37 Xie YL , Wu GX, Tang JB, et al. SOAPdenovo-trans: de novo transcriptome assembly with short RNA-seq reads . Bioinformatics 2014 ; 30 : 1660 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 38 Loman NJ , Misra RV, Dallman TJ, et al. Performance comparison of benchtop high-throughput sequencing platforms . Nat Biotechnol 2012 ; 30 : 434 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 39 Flicek P , Birney W. Sense from sequence reads: methods for alignment and assembly . Nat Methods 2009 ; 6 : S6 – S12 . Google Scholar Crossref Search ADS PubMed WorldCat 40 Jaffe DB , Butler J, Gnerre S, et al. Whole-genome sequence assembly for mammalian genomes: arachne 2 . Genome Res 2003 ; 13 : 91 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 41 Myers EW , Sutton GG, Delcher AL, et al. A whole-genome assembly of Drosophila . Science 2000 ; 287 : 2196 – 204 . Google Scholar Crossref Search ADS PubMed WorldCat 42 Compeau PEC , Pevzner PA, Tesler G. How to apply de Bruijn graphs to genome assembly . Nat Biotechnol 2011 ; 29 : 987 – 91 . Google Scholar Crossref Search ADS PubMed WorldCat 43 Shinzato C , Shoguchi E, Kawashima T, et al. Using the Acropora digitifera genome to understand coral responses to environmental change . Nature 2011 ; 476 : 320 – 3 . Google Scholar Crossref Search ADS PubMed WorldCat 44 Gibbs RA , Weinstock GM, Metzker ML, et al. Genome sequence of the Brown Norway rat yields insights into mammalian evolution . Nature 2004 ; 428 : 493 – 521 . Google Scholar Crossref Search ADS PubMed WorldCat 45 Todd JT , Steven LS. Repetitive DNA and next-generation sequencing: computational challenges and solutions . Nat Rev Genet 2011 ; 13 : 36 – 46 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 46 Nagarajan N , Pop M. Sequence assembly demystified . Nat Rev Genet 2013 ; 14 : 157 – 67 . Google Scholar Crossref Search ADS PubMed WorldCat 47 Kelley DR , Schatz MC, Salzberg SL. Quake: quality-aware detection and correction of sequencing errors . Genome Biol 2010 ; 11 : R116 . Google Scholar Crossref Search ADS PubMed WorldCat 48 Dinsdale A , Edwards RA, Hall D, et al. Functional metagenomic profiling of nine biomes . Nature 2008 ; 452 : 629 – 32 . Google Scholar Crossref Search ADS PubMed WorldCat 49 Hess M , Sczyrba A, Egan R. Metagenomic discovery of biomass-degrading genes and genomes from cow rumen . Science 2011 ; 331 : 463 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 50 Lunter G , Goodson M. Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads . Genome Res 2011 ; 21 : 936 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 51 Li H , Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform . Bioinformatics 2009 ; 25 : 1754 – 60 . Google Scholar Crossref Search ADS PubMed WorldCat 52 Li R , Li Y, Kristiansen K, et al. SOAP: short oligonucleotide alignment program . Bioinformatics 2008 ; 24 : 713 – 4 . Google Scholar Crossref Search ADS PubMed WorldCat 53 Raczy C , Petrovski R, Saunders CT, et al. Isaac: ultra-fast whole genome secondary analysis on llumina sequencing platforms . Bioinformatics 2013 ; 29 : 2041 – 3 . Google Scholar Crossref Search ADS PubMed WorldCat 54 Rimmer A , Phan H, Mathieson I, et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications . Nat Genet 2014 ; 46 : 912 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat 55 DePristo MA , Banks E, Poplin R, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data . Nat Genet 2011 ; 43 : 491 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 56 Li H , Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores . Genome Res 2008 ; 18 : 1851 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 57 Albers CA , Lunter G, MacArthur DG, et al. Dindel: accurate indel calls from short-read data . Genome Res 2011 ; 21 : 961 – 73 . Google Scholar Crossref Search ADS PubMed WorldCat 58 Lunter G , Rocco A, Mimouni N, et al. Uncertainty in homology inferences: assessing and improving genomic sequence alignment . Genome Res 2008 ; 18 : 298 – 309 . Google Scholar Crossref Search ADS PubMed WorldCat 59 Li RQ , Li YR, Fang X, et al. SNP detection for massively parallel whole-genome resequencing . Genome Res 2009 ; 19 : 1124 – 32 . Google Scholar Crossref Search ADS PubMed WorldCat 60 McKenna A , Hanna M, Banks E, et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data . Genome Res 2010 ; 20 : 1297 – 303 . Google Scholar Crossref Search ADS PubMed WorldCat 61 Langmead B , Trapnell C, Pop M, et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome . Genome Biol 2009 ; 10 : R25 . Google Scholar Crossref Search ADS PubMed WorldCat 62 Langmead B , Salzberg SL. Fast gapped-read alignment with Bowtie 2 . Nat Methods 2012 ; 9 : 357 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 63 Liao Y , Smyth GK, Shi W. The subread aligner: fast, accurate and scalable read mapping by seed-and-vote . Nucleic Acids Res 2013 ; 41 : e108 . Google Scholar Crossref Search ADS PubMed WorldCat 64 Pop M , Phillippy A, Delcher AL, Salzberg SL. Comparative genome assembly . Brief Bioinform 2004 ; 5 : 237 – 48 . Google Scholar Crossref Search ADS PubMed WorldCat 65 Gotoh O. An improved algorithm for matching biological se- quences . J Mol Biol 1982 ; 162 : 705 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 66 Huson DH , Reinert K, Myers E. The greedy path-merging algorithm for sequence assembly. In: Proceedings of the Fifth Annual International Conference on Computational Molecular Biology, 2001 , pp. 157 – 63 . IEEE Computer Society, Montreal, Quebec, Canada, ACM New York, USA. 67 Myers EW. The fragment assembly string graph . Bioinformatics 2005 ; 21(Suppl 2) : ii79 – 85 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 68 Simpson JT , Durbin R. Efficient construction of an assembly string graph using the FM-index . Bioinformatics 2010 ; 26 : i367 – 73 . Google Scholar Crossref Search ADS PubMed WorldCat 69 Simpson JT , Durbin R. Efficient de novo assembly of large genomes using compressed data structures . Genome Res 2012 ; 22 : 549 – 56 . Google Scholar Crossref Search ADS PubMed WorldCat 70 Chang YJ , Chen CC, Chen CL. A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework . BMC Genomics 2012 ; 13(Suppl 7) : S28 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 71 Bonizzoni P , Vedova GD, Pirola Y, et al. FSG: fast string graph construction for de novo assembly of reads data. In: Proceedings of 12th International Symposium, Bioinformatics Research and Applications. LNCS 9683, Switzerland: Springer, 2016 , pp 27 – 39 . 72 Gordon D , Huddleston J, Chaisson M, et al. Long-read sequence assembly of the gorilla genome . Science 2016 ; 352 : aae0344 . Google Scholar Crossref Search ADS PubMed WorldCat 73 Garey MR , Johnson DS, ( 1979 ), Computers and Intractability: A Guide to the Theory of NP-Completeness . New York : W.H. Freeman , pp. 199 – 200 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 74 Chikhi R , Medvede P. Informed and automated k-mer size selection for genome assembly . Bioinformatics 2014 ; 30 : 31 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 75 Rizk G , et al. DSK: k-mer counting with very low memory usage . Bioinformatics 2013 ; 29 : 652 – 3 . Google Scholar Crossref Search ADS PubMed WorldCat 76 Warren RL , Sutton GG, Jones SJM, et al. Assembling millions of short DNA sequences using SSAKE . Bioinformatics 2007 ; 23 : 500 . Google Scholar Crossref Search ADS PubMed WorldCat 77 Chu TC , Lu CH, Liu T, et al. Assembler for de novo assembly of large genomes . Proc Natl Acad Sci USA 2013 ; 110 : E3417 – 24 . Google Scholar Crossref Search ADS PubMed WorldCat 78 Dohm JC , Lottaz C, Borodina T, et al. SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing . Genome Res 2007 ; 17 : 1697 – 706 . Google Scholar Crossref Search ADS PubMed WorldCat 79 Bryant DW , Wong WK, Mockler TC. QSRA: a quality-value guided de novo short read assembler . BMC Bioinformatics 2009 ; 10 : 69 . Google Scholar Crossref Search ADS PubMed WorldCat 80 Huson DH , Reinert K, Myers EW. The greedy path-merging algorithm for contig scaffolding . J ACM 2002 ; 49 : 603 – 15 . Google Scholar Crossref Search ADS WorldCat 81 Just W. Computational complexity of multiple sequence alignment with SP-score . J Comput Biol 2001 ; 8 : 615 – 23 . Google Scholar Crossref Search ADS PubMed WorldCat 82 Akutsu T , Arimura H, Shimozono S. On approximation algorithms for local multiple alignment. In: Proceedings of the Fourth Annual International Conference on Computational Molecular Biology , 2000 , pp. 1 – 7 . IEEE Computer Society, Tokyo, Japan, ACM New York, USA. 83 Earl D , Bradnam K, John JS, et al. Assemblathon 1: a competitive assessment of de novo short read assembly methods . Genome Res 2011 ; 21 : 2224 – 41 . Google Scholar Crossref Search ADS PubMed WorldCat 84 Baker M. De novo genome assembly: what every biologist should know . Nat Method 2012 ; 9 : 333 – 7 . Google Scholar Crossref Search ADS WorldCat 85 Salzberg SL , Phillippy AM, Zimin A, et al. GAGE: a critical evaluation of genome assemblies and assembly algorithms . Genome Res 2011 ; 22 : 1196 . OpenURL Placeholder Text WorldCat 86 Mardis ER. Next-generation sequencing platforms . Annu Rev Anal Chem 2013 ; 6 : 287 – 303 . Google Scholar Crossref Search ADS WorldCat 87 Glenn TC. Field guide to next-generation DNA sequencers . Mol Ecol Resour 2011 ; 11 : 759 – 69 . Google Scholar Crossref Search ADS PubMed WorldCat 88 Edgar RC , Flyvbjerg H. Error filtering, pair assembly and error correction for next-generation sequencing reads . Bioinformatics 2015 ; 31 : 3476 – 82 . Google Scholar Crossref Search ADS PubMed WorldCat 89 Bokulich NA , Subramanian S, Faith JJ, et al. Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing . Nat Methods 2013 ; 10 : 57 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 90 Kozich JJ , Westcott SL, Baxter NT, et al. Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq illumina sequencing platform . Appl Environ Microbiol 2013 ; 79 : 5112 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat 91 Huse SM , Welch DM, Morrison HG, et al. Ironing out the wrinkles in the rare biosphere through improved OTU clustering . Environ Microbiol 2010 ; 12 : 1889 – 98 . Google Scholar Crossref Search ADS PubMed WorldCat 92 Zhang J , Kobert K, Flouri T, et al. PEAR: a fast and accurate Illumina paired-end read merger . Bioinformatics 2014 ; 30 : 614 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat 93 Liu B , Yuan J, Yiu SM, et al. COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly . Bioinformatics 2012 ; 28 : 2870 – 4 . Google Scholar Crossref Search ADS PubMed WorldCat 94 Heo Y , Wu XL, Chen DM, et al. BLESS: bloom filter-based error correction solution for high-throughput sequencing reads . Bioinformatics 2014 ; 30 : 1354 – 62 . Google Scholar Crossref Search ADS PubMed WorldCat 95 Masella AP , Bartram AK, Truszkowski JM, et al. PANDAseq: paired-end assembler for Illumina sequences . BMC Bioinformatics 2012 ; 13 : 31 . Google Scholar Crossref Search ADS PubMed WorldCat 96 Quince C , Lanzen A, Davenport RJ, et al. Removing noise from pyrosequenced amplicons . BMC Bioinformatics 2011 ; 12 : 38 . Google Scholar Crossref Search ADS PubMed WorldCat 97 Quince C , Lanzén A, Curtis TP, et al. Accurate determination of microbial diversity from 454 pyrosequencing data . Nat Methods 2009 ; 6 : 639 – 41 . Google Scholar Crossref Search ADS PubMed WorldCat 98 Yang X , Chockalingam SP, Aluru S. A survey of error correction methods for next-generation sequencing . Brief Bioinform 2013 ; 14 : 56 – 66 . Google Scholar Crossref Search ADS PubMed WorldCat 99 Liu Y , Schröder J, Schmidt B. Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data . Bioinformatics 2013 ; 29 : 308 – 15 . Google Scholar Crossref Search ADS PubMed WorldCat 100 Ilie L , Fazayeli F, Ilie S. HiTEC: accurate error correction in high-throughput sequencing data . Bioinformatics 2011 ; 27 : 295 – 302 . Google Scholar Crossref Search ADS PubMed WorldCat 101 Kao WC , Chan AH, Song YS. ECHO: a reference-free short-read error correction algorithm . Genome Res 2011 ; 21 : 1181 – 92 . Google Scholar Crossref Search ADS PubMed WorldCat 102 Yin X , Song Z, Dorman K, et al. PREMIER - probabilistic error-correction using Markov inference in errored reads . arXiv 2013 ; 1302.0212 . OpenURL Placeholder Text WorldCat 103 Bonchi F , De Francisci Morales G, Gionis A, et al. Activity preserving graph simplification . Data Min Knowl Discov 2013 ; 27 : 321 . Google Scholar Crossref Search ADS WorldCat 104 Kazakov S , Shalyto A. Overlap graph simplification using edge reliability calculation. In: Proceedings of the 8th International Conference on Intelligent Systems and Agents 2014 (ISA 2014), 2014 , pp. 222 – 6 . 105 Pop M , Kosack DS, Salzberg SL. Hierarchical scaffolding with Bambus . Genome Res 2004 ; 14 : 149 – 59 . Google Scholar Crossref Search ADS PubMed WorldCat 106 Kim PG , Cho HG, Park K. A scaffold analysis tool using mate-pair information in genome sequencing . J Biomed Biotechnol 2008 ; 2008 : 675741 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 107 Gordon D , Desmarais C, Green P. Automated finishing with autofinish . Genome Res 2001 ; 11 : 614 – 25 . Google Scholar Crossref Search ADS PubMed WorldCat 108 Lai BB , Ding RG, Li Y, et al. A de novo metagenomic assembly program for shotgun DNA reads . Bioinformatics 2012 ; 28 : 1455 – 62 . Google Scholar Crossref Search ADS PubMed WorldCat 109 Lander ES , Waterman MS. Genomic mapping by fingerprinting random clones: a mathematical analysis . Genomics 1988 ; 2 : 231 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 110 Wu YW , Ye YZ. A novel abundance-based algorithm for binning metagenomic sequences using l-tuples . J Comput Biol 2010 ; 18 : 523 – 34 . Google Scholar Crossref Search ADS WorldCat 111 Armbrust M , Fox A, Griffith R, et al. A view of cloud computing . Commun ACM 2010 ; 53 : 50 – 8 . Google Scholar Crossref Search ADS WorldCat 112 Wiewiórka MS , Messina A, Pacholewska A, et al. SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision . Bioinformatics 2014 ; 30 : 2652 – 3 . Google Scholar Crossref Search ADS PubMed WorldCat 113 http://spark.apache.org/. 114 http://cbcb.umd.edu/∼mschatz/Posters/BOG.2010 - Contrail.pdf 115 Liu YC , Schmidt B. CUSHAW2-GPU: empowering faster gapped short-read alignment using GPU computing . IEEE Des Test 2013 ; 31 : 31 – 9 . OpenURL Placeholder Text WorldCat 116 Striemer GM , Akoglu A. Sequence alignment with GPU: performance and design challenges. In: 2009 IEEE International Symposium on Parallel and Distributed Processing , 2009 , pp. 1 – 10 . IEEE Computer Society, Buena Vista Palace Hotel, Orlando, Florida, USA. 117 Mahmood SF , Rangwala H. GPU-Euler: sequence assembly using GPGPU. In: 2011 IEEE 13th International Conference on High Performance Computing and Communications (HPCC) 2010 , pp. 153 – 60 . IEEE Computer Society, Banff, Alberta, Canada. 118 Couto AD , Cerqueira FR, Guerra RL, et al. Theoretical basis of a new method for DNA fragment assembly in k-mer graphs. In: Proceedings of 2012 31st International Conference of the Chilean Computer Science Society (SCCC), 2012 , pp. 66 – 77 . IEEE Computer Society, Valparaíso, Chile. 119 Rumpf M , Strzodka R, ( 2006 ) Graphics Processor Units: New Prospects for Parallel Computing , LNCS 51. Springer , pp. 89 – 132 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 120 Shi H , Schmidt B, Liu W, Müller-Wittig W. Parallel mutual information estimation for inferring gene regulatory networks on GPUs . BMC Res Notes 2011 ; 4 : 189 . Google Scholar Crossref Search ADS PubMed WorldCat 121 Derrington IM , Butler TZ, Collins MD, et al. Nanopore DNA sequencing with Msp . Proc Natl Acad Sci USA 2010 ; 107 : 6060 – 5 . Google Scholar Crossref Search ADS WorldCat 122 Eid J , Fehr A, Gray J, et al. Real-time DNA sequencing from single polymerase molecules . Science 2009 ; 323 : 133 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 123 Schadt EE , Turner S, Kasarskis A. A window into third- generation sequencing . Hum Mol Genet 2010 ; 19 ( R2 ): R227 – 40 . Google Scholar Crossref Search ADS PubMed WorldCat 124 Bowers J , Mitchell J, Beer E, et al. Virtual terminator nucleotides for next-generation DNA sequencing . Nat Method 2009 ; 6 : 593 – 5 . Google Scholar Crossref Search ADS WorldCat 125 Timp W , Mirsaidov UM, Wang D, et al. Nanopore sequencing: electrical measurements of the code of life . IEEE Trans Nanotechnol 2010 ; 9 : 281 – 94 . Google Scholar Crossref Search ADS PubMed WorldCat 126 Mason CE , Elemento O. Faster sequencers, larger datasets, new challenges . Genome Biol 2012 ; 13 : 314 . Google Scholar Crossref Search ADS PubMed WorldCat 127` Ye C , Hill CM, Wu S, et al. DBG2OLC: efficient assembly of large genomes using long erroneous reads of the third generation sequencing technologies . Sci Rep 2016 ; 6 : 31900 . Google Scholar Crossref Search ADS PubMed WorldCat 128 Chin CS , Peluso P, Sedlazeck FJ, et al. Phased diploid genome assembly with single-molecule real-time sequencing . Nat Methods 2016 ; 13 : 1050 – 4 . Google Scholar Crossref Search ADS PubMed WorldCat 129 Rhoads A , Au KF. PacBio sequencing and its applications . Genomics Proteomics Bioinformatics 2015 ; 13 : 278 – 89 . Google Scholar Crossref Search ADS PubMed WorldCat 130 Li R , Hsieh CL, Young A, et al. Illumina synthetic long read sequencing allows recovery of missing sequences even in the "finished" C. elegans genome . Sci Rep 2015 ; 5 : 10814 . Google Scholar Crossref Search ADS PubMed WorldCat 131 Koren S , Harhay GP, Smith TPL, et al. Reducing assembly complexity of microbial genomes with singlemolecule sequencing . Genome Biol 2013 ; 14 : R101 . Google Scholar Crossref Search ADS PubMed WorldCat 132 Bashir A , Klammer AA, Robins WP, et al. A hybrid approach for the automated finishing of bacterial genomes . Nat Biotechnol 2012 ; 30 : 70 . Google Scholar Crossref Search ADS WorldCat 133 Deshpande V , Fung E, Pham S, et al. Cerulean: A hybrid assembly using high throughput short and long reads . Algorithms Bioinform 2013 ; 8126 : 349 – 63 . OpenURL Placeholder Text WorldCat 134 Deng XT , Naccache SN, Ng T, et al. An ensemble strategy that significantly improves de novo assembly of microbial genomes from metagenomic next-generation sequencing data . Nucleic Acids Res 2015 ; 43 : e46 . Google Scholar Crossref Search ADS PubMed WorldCat 135 Lin HH , Liao YC. Evaluation and validation of assembling corrected PacBio long reads for microbial genome completion via hybrid approaches . PLoS One 2015 ; 10 : e0144305 . Google Scholar Crossref Search ADS PubMed WorldCat 136 Drmanac R. The advent of personal genome sequencing . Genet Med 2011 ; 13 : 188 – 90 . Google Scholar Crossref Search ADS PubMed WorldCat 137 Eyre DW , Cule ML, Griffiths D, et al. Detection of mixed infection from bacterial whole genome sequence data allows assessment of its role in clostridium difficile transmission . PLoS Comput Biol 2013 ; 9 : e1003059 . Google Scholar Crossref Search ADS PubMed WorldCat 138 Carla G , Martina CC, Pascal B, et al. Whole-genome sequencing in health care: recommendations of the European Society of Human Genetics . Eur J Hum Genet 2013 ; 21 : 580 – 4 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 139 McVean GA. An integrated map of genetic variation from 1,092 human genomes . Nature 2013 ; 491 : 56 – 65 . Google Scholar Crossref Search ADS WorldCat 140 Lao O , van Duijn K, Kersbergen P, et al. Proportioning whole- genome single-nucleotide polymorphism diversity for the identification of geographic population structure and genetic ancestry . Am J Hum Genet 2006 ; 78 : 680 – 90 . Google Scholar Crossref Search ADS PubMed WorldCat 141 Shirasawa K , Isobe S, Hirakawa H, et al. SNP discovery and linkage map construction in cultivated tomato . DNA Res 2010 ; 17 : 381 – 91 . Google Scholar Crossref Search ADS PubMed WorldCat 142 Ossowski S , Schneeberger K, Clark RM, et al. Sequencing of natural strains of Arabidopsis thaliana with short reads . Genome Res 2008 ; 18 : 2024 – 33 . Google Scholar Crossref Search ADS PubMed WorldCat 143 Kong WM , Choo KW. Predicting single nucleotide polymorphisms (SNP) from DNA sequence by support vector machine . Front Biosci 2007 ; 12 :1610 – 14 . Google Scholar Crossref Search ADS PubMed WorldCat 144 Li H. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly . Bioinformatics 2012 ; 28 : 1838 – 44 . Google Scholar Crossref Search ADS PubMed WorldCat 145 Schliesky S , Gowik U, Weber APM. RNA-seq assembly are we there yet? Front Plant Sci 2012 ; 3 : 220 . Google Scholar Crossref Search ADS PubMed WorldCat 146 Kakirde KS , Parsley LC, Liles MR. Size does matter: application-driven approaches for soil metagenomics . Soil Biol Biochem 2010 ; 42 : 1911 – 23 . Google Scholar Crossref Search ADS PubMed WorldCat 147 Hoff KJ. The effect of sequencing errors on metagenomic gene prediction . BMC Genomics 2009 ; 10 : 520 . Google Scholar Crossref Search ADS PubMed WorldCat 148 Qin J , Li R, Raes J, et al. A human gut microbial gene catalogue established by metagenomic sequencing . Nature 2010 ; 464 : 59 – 65 . Google Scholar Crossref Search ADS PubMed WorldCat 149 Boisvert S , Raymond F, Godzaridis É, et al. Ray Meta: scalable de novo metagenome assembly and profiling . Genome Biol 2012 ; 13 : R122 . Google Scholar Crossref Search ADS PubMed WorldCat 150 Narasingarao P , Podell S, Ugalde JA. De novo metagenomic assembly reveals abundant novel major lineage of Archaea in hypersaline microbial communities . ISME J 2012 ; 6 : 81 – 93 . Google Scholar Crossref Search ADS PubMed WorldCat © The Author 2017. Published by Oxford University Press. All rights reserved. For permissions, please email: journals.permissions@oup.com TI - Recent advances in sequence assembly: principles and applications JF - Briefings in Functional Genomics DO - 10.1093/bfgp/elx006 DA - 2017-11-01 UR - https://www.deepdyve.com/lp/oxford-university-press/recent-advances-in-sequence-assembly-principles-and-applications-3ZSGcnI3cR SP - 361 VL - 16 IS - 6 DP - DeepDyve ER -