Error correcting optical mapping data

Error correcting optical mapping data Optical mapping is a unique system that is capable of producing high-resolution, high-throughput genomic map data that gives information about the structure of a genome . Recently it has been used for scaffolding contigs and for assembly validation for large-scale sequencing projects, including the maize, goat, and Amborella genomes. However, a major impediment in the use of this data is the variety and quantity of errors in the raw optical mapping data, which are called Rmaps. The challenges associated with using Rmap data are analogous to dealing with insertions and deletions in the alignment of long reads. Moreover, they are arguably harder to tackle since the data are numerical and susceptible to inaccuracy. We develop cOMet to error correct Rmap data, which to the best of our knowledge is the only optical mapping error correction method. Our experimental results demonstrate that cOMet has high prevision and corrects 82.49% of insertion errors and 77.38% of deletion errors in Rmap data generated from the Escherichia coli K-12 reference genome. Out of the deletion errors corrected, 98.26% are true errors. Similarly, out of the insertion errors corrected, 82.19% are true errors. It also successfully scales to large genomes, improving the quality of 78% and 99% of the Rmaps in the plum and goat genomes, respectively. Last, we show the utility of error correction by demonstrating how it improves the assembly of Rmap data. Error corrected Rmap data results in an assembly that is more contiguous and covers a larger fraction of the genome. Keywords: optical mapping; error correction Introduction DNA molecules cling to the surface of a microscope slide using electrostatic charge and are digested with one or more restric- In 1993 Schwartz et al. developed optical mapping, a system for tion enzymes. The restriction enzymes cut the DNA molecule creating an ordered, genome-wide, high-resolution restriction at occurrences of the enzyme’s recognition sequence, forming a map of a given organism’s genome [1]. Since this initial devel- number of DNA fragments. The fragments formed by digestion opment, genome-wide optical maps have found numerous ap- are painted with a fluorescent dye to allow visibility under laser plications including discovering structural variations and rear- light and a CCD (Charged Coupled device) camera. Computer rangements [5], scaffolding and validating contigs for several vision algorithms then estimate fragment length from consol- large sequencing projects [4,3, 6], and detecting misassembled idated intensity of fluorescent dye and apparent distance be- regions in draft genomes [7]. Thus, optical mapping has assisted tween fragment ends. in the assembly of a variety of species including various prokary- The resulting data from an experiment are in the form of an ote species [8, 9, 10], rice [11], maize [2], mouse [12], goat [3], ordered series of fragment lengths [13]. The data for each sin- parrot [6], and Amborella trichopoda [4]. The raw optical mapping gle molecule produced by the system is called an Rmap. Rmap data are generated by a biological experiment in which large data have a number of errors due to the experimental conditions Received: 22 June 2017; Revised: 4 December 2017; Accepted: 16 May 2018 The Author(s) 2018. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy061/5005021 by Ed 'DeepDyve' Gillespie user on 21 June 2018 2 Error correcting optical mapping data and system limitations. In an optical mapping experiment, it is as follows: 3,7,12,15,20. Then, R will be represented as R = 2, 4, i i unlikely to achieve perfectly uniform fluorescent staining. This 5, 3, 5, 2. The size of an Rmap denotes the number of fragments leads to an erroneous estimation of fragment sizes. Also, restric- in that Rmap. Therefore, the size of R is 6. tion enzymes often fail to digest all occurrences of their recog- We note that millions of Rmaps are produced for a single nition sequence across the DNA molecule. This manifests as genome since optical mapping is performed on many cells of missing restriction sites. Additionally, due to the fragile nature the organism and each cell provides thousands of Rmaps. The of DNA, additional breaks can incorrectly appear as restriction Rmaps can be assembled to produce a genome-wide optical sites. Last, the limitations of the imaging component of the opti- map. This is analogous to next-generation shotgun sequencing cal mapping system and the propensity for the DNA to ball up at where Rmaps are analogous to reads and a genome-wide optical the ends introduces more sizing error for smaller fragments. In- map is analogous to the assembled whole genome. terested readers will find more details about the causes of these There are three types of errors that can occur in optical map- errors in Valouev et al. [14] and Li et al. [15]. Because of all these ping: (1) missing cut sites, which are caused by an enzyme not experimental conditions, Rmap data generated through optical cleaving at a specific site; (2) additional cut sites, which can mapping experiment have insertion (added cut sites) and dele- occur due to random DNA breakage; and (3) inaccuracy in the tion (missed cut sites) errors along with fragment sizing errors. fragment size due to the inability of the system to accurately In most applications of optical map data, the Rmaps need estimate the fragment size. Continuing again with the exam- to be assembled into a genome-wide optical map. This is be- ple above, a more representative example Rmap would include cause the single molecule maps need redundant sampling to these errors, such as R = 7, 6, 3, 4. overcome the presence of the aforementioned errors and be- The error rates of optical maps depend on the platform used cause single molecule maps only span on the order of 500 Kbp for generating the maps. Li et al. [15] recently studied the error [14]. The first step of this assembly process involves finding pair- rates of optical maps produced by the Irys system from BioNano wise alignments among the Rmaps. In order to accomplish this, Genomics. According to their study, a missing cut site type of er- the challenge of dealing with missing fragment sizes has to be ror, i.e., error type (1),happens when a restriction site is incom- overcome. This challenge is analogous to dealing with insertions pletely digested by the enzyme and causes two flanking frag- and deletions in the alignment of long reads [16]. In fact, it is ar- ments to merge into one large fragment. The probability of com- guably harder since the data are numerical. At present, the only plete digestion of a restriction site can be modeled as a Bernoulli nonproprietary algorithmic method for pairwise alignment of trial whose probability of success is a function of the size of Rmaps is the dynamic programming-based method of Valouev the two flanking fragments. Additional cut sites, i.e., error type et al. [14], which runs in O(α × β)timewhere α and β are the (2),results from random breaks of the DNA molecule. The num- number of fragments in the two Rmaps being aligned. To align ber of false cuts per unit length of DNA follows a Poisson distri- an optical map dataset containing n Rmaps, the complexity be- bution. The inaccuracy of the fragment sizes, i.e., error type (3), 2 2 comes O(n ×  )where  is theaverage sizeofanRmap. is modeled using a Laplace distribution. If the observed and ac- This method is inherently computationally intensive. How- tual size of a fragment are o and r , respectively, then the sizing k k ever, if the error rate of the data could be improved, then non- error is defined as s = o /r and k k k dynamic programming-based methods that are orders of mag- nitude faster such as Twin [17], OMBlast [18], and Maligner [19] s ∼ Laplace(μ,β) could be used for alignment. This would greatly improve the time required to assemble Rmap data. Thus, we present cOMet where μ and β, the parameters of the laplace distribution, are in order to address this need. To the best of our knowledge, it functions of r . In practice, when aligning a pair of Rmaps, one is the first Rmap error correction method. Our experimental re- should allow for twice the error rate of a single Rmap since each sults demonstrate that cOMet has high precision and corrects Rmap will deviate from the genomic map by the above parame- 82.49% of insertion errors and 77.38% of deletion errors in Rmap ters. data generated from the Escherichia coli K-12 reference genome. Valouev et al. [14] provides a dynamic programming algo- Out of the deletion errors corrected, 98.26% are true errors. Sim- rithm for pairwise alignment, which generates a score for every ilarly, out of the insertion errors corrected, 82.19% are true er- possible alignment between two Rmaps and returns the align- rors. Furthermore, we show that the assembly of Rmaps is more ment that achieves the highest score, which is referred to as contiguous and covers a larger fraction of the genome if the the S-score. It is computed within a standard dynamic program- Rmaps are first error corrected. It also successfully scales to large ming framework, similar to Smith-Waterman alignment [20]. genomes, improving the quality of 78% and 99% of the Rmaps in The scoring function is based on a probabilistic model built on the plum and goat genome, respectively. the following assumptions: the fragment sizes follow an expo- nential distribution, the restriction sites follow an independent Bernoulli process, the number of false cuts in a given genomic Background length is a Poisson process, and the sizing error follows a nor- mal distribution with mean zero and variance following a linear From a computer science perspective, optical mapping can be function of the true size. Last, a different sizing error function seen as a process that takes in two strings, a nucleotide se- is used for fragments less than 4 kbp in length since they do quence S [1, n] and a restriction sequence B[1, b], and produces not converge to the defined normal distribution. The score of an an array (string) of integers R [1, m]. The array R is an Rmap cor- i i alignment is calculated as the sum of two functions: one func- responding to S and contains the string-lengths between cuts tion that estimates and scores the sizing error and a second that produced by B on S . Formally, R is defined as follows: R [j] = y − x i i i predicts and scores the presence of additional and/or missing th where y represents the location (starting index) of j occurrence cut sites between the fragments. The S-score will be used later th of B in S and x represents the location of (j − 1) occurrence of B to evaluate the error correction process. in S and R [1] = y − 1and R [m] = n − x.For example,say we have i i i B = act and S = atacttactggactactaaact. The locations of B in S are i i Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy061/5005021 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Mukherjee et al. 3 Methods Unfortunately, although this set contains all related Rmaps, it also likely contains Rmaps that are not related to R . Therefore, Given a set of n Rmaps R ={R , .., R }, our method aims to detect 1 n we filter this set of Rmaps using a simple heuristic that tries to and correct all errors in R by considering each R ∈ R and finding match each Rmap in this set with R in order to ascertain if it a set of Rmaps that originate from the same part of the genome is related to R . The heuristic traverses through two Rmaps (R i i as R . This step is performed heuristically in order to avoid align- and one Rmap from the set, say R ) attempting to match sub- ing every pair of Rmaps in R. sets of the fragments from each until it either reaches the end of one Rmap or it fails to match the fragments. We start the traver- sal from the first matching k-mer between R and R . We denote Preprocessing i j the position of the next fragment to be matched in R and R as i j Our first step is to remove the first and last fragments from each x and y, respectively, and assume that each fragment prior to Rmap in R. These fragments have one of their edges sheared by these positions is matched. Next, we consider all combinations artifacts of the DNA prep process (preceding the optical map- of matching the fragments at positions x, x+1,and x +2of R ping process) and not by restriction enzymes. Unless removed, with fragments at positions y, y+1,and y +2of R .Weevaluate they can misguide alignment between two Rmaps during the the cost of each combination based on the difference in the total error correction process. In addition, short Rmaps, i.e., those size of fragments from R and R .Thatis ∀α, β = [0, 2], i j that have fewer than 10 fragments, are removed at this stage since any Rmap that contains fewer than 10 fragments is typ- y+β x+α ically deemed too small for analysis even in consensus maps cost(x + α, y + β) =  R [g] − R [h] i j [21]. Next, the data are quantized so that a given genomic frag- g=x h=y ment is represented by the same value across multiple Rmaps despite the noise. Our quantization method assigns a unique where R [g]and R [h]denotethe g-th and h-th fragments of R i j i value to a range of fragment sizes by dividing each fragment and R , respectively. We select the combination with the least size by a fixed integer, denoted as b, and rounding to the near- cost; if there exists a tie, we select the match that has the least est integer. For example, if an Rmap R = {36, 13, 15, 20, 16, 5, number of added or missing cut sites (i.e., the combination with 21, 17} is quantized using b= 3, then the quantized Rmap will be the least value of α + β). If this selected match leads to a cost quantized R ={12, 4, 5, 7, 5, 2, 7, 6}. Say another Rmap, R = {17, 23, that is greater than a specified threshold (which was set to 25% 34, 12, 14, 21, 14, 5} has overlap with R ; however, due to noise in of the larger-sized fragment in practice), then we conclude that the data, this relation is not apparent. By quantizing R using the there is not a match at these positions and return that R and R i j quantized same b= 3, we get R ={6, 8, 11, 4, 5, 7, 5, 1}. This allows us are unrelated. Otherwise, we increment x and y accordingly and to uncover a region (in this case, {4, 5, 7, 5}) that is common to move onto the next fragments. If this heuristic continues until both the Rmaps. It should be noted that, in some cases, a frag- the last fragment of either R or R is reached, then we return that i j ment may have different values across two Rmaps even after R and R are related. Using this heuristic, we filter out the Rmaps i j quantization (e.g., the fragment values 36 from R and 34 from that were deemed to be related based on the number of k-mers R are quantized to 12 and 11, respectively). The quantized data in common with R but are in fact unrelated to R . i i areusedtofind theset of related Rmaps, as explained in the next The setting of parameters k and m are correlated. If the value section. of k is increased, that makes the k-mers more specific, hence, The setting of parameter b depends on the amount of sizing the value of m is lowered. On the other hand, if the value of k is error in the optical map data. With zero sizing error, b can be set reduced, then we increase the value of m.The valueof k is in- at 1. As sizing error increases, the value of b is increased accord- creased when there are fewer insertion and deletion errors and ingly. If the value of b is too small, we are not be able to uncover decreased otherwise. The default values are k = 4and m = 1. relations between overlapping Rmaps. If the value is too large, then unrelated Rmaps have common regions in their quantized Rmap alignment states, which makes them appear related. Considering the error rate of optical maps from BioNano genomics, the default value Next, for each R in R, we use the alignment method of Valouev of b = 4000. et al. [14]tofind the S-score of all pairwise alignments between R and each Rmap in its set of related Rmaps. The Rmaps that have an alignment score, i.e., S-score less than a defined thresh- Finding related Rmaps old (which we denote as S ), are removed from the set of related We refer to two Rmaps as related if their corresponding error-free Rmaps, and the alignments of the remaining Rmaps are stored Rmaps originate from overlapping regions of the genome. Next, in a multiple alignment grid, denoted as A .Thisgridisatwo- we define a k-mer as a string of k consecutive fragments from a dimensional array of integer pairs, where the number of rows is (quantized) Rmap. For example, if we have the Rmap R = {3, 3, 5, equal to the number of remaining Rmaps in the set of related 2, 6, 5, 5, 1} and k = 4, then the following k-mers can be extracted Rmaps of R and the number of columns is equal to the number from R: (3,3,5,2), (3,5,2,6), (5,2,6,5), (2,6,5,5),and (6,5,5,1). In order of fragments in R . An element of this array, A [ j, k], stores an i i to avoid aligning all pairs of Rmaps to find the related Rmaps, integer pair in the form of (x, y) representing that x fragments we use the number of common k-mers to discriminate between of R (which includes the k-th fragment of R ) matches to y frag- i i pairs of Rmaps that are related and those that are not. To accom- ments of R in the optimal alignment between R and R . Figure1 j i j plish this efficiently, we first extract all unique k-mers in each illustrates an example of A . The first fragment of R does not i i quantized Rmap and construct a hash table storing each unique match with any fragment of R and therefore (0,0) is stored at k-mer as a key and the list of Rmaps containing an occurrence this position. Fragments 2, 5, 6, 8, and 9 of R each matches with of that k-mer as the value. We call this the k-mer index. Next, one fragment of R , e.g., 1, 3, 4, 7, and 8, respectively. To repre- we consider each R in R and use the k-mer index to identify the sent these matches, we store a (1,1) in the 2nd, 5th, 6th, 8th, set of Rmaps that have m or more k-mers in common with R . and 9th columns of row j. Fragments 3 and 4 of R match with i i Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy061/5005021 by Ed 'DeepDyve' Gillespie user on 21 June 2018 4 Error correcting optical mapping data Figure 1: An alignment between R and R as givenbyValouevetal. [14] and its corresponding entry in the multiple alignment grid A . Each column of A represents i j i i one fragment from R , and each row represents one Rmap from its set of related Rmaps. The fragment sizes are in Kbp. one fragment of R , i.e., the 2nd fragment. To represent this, we error correction. As it is illustrated, to error correct the second store (2,1) in A [ j, 3] and A [ j, 4]. Fragment 7 of R matches with fragment of R , we compute the average of the matched frag- i i i i two fragments of R , i.e., the 5th and 6th fragments. To represent ments from related Rmaps 2, 3, 4, 5, and 6 and replace the sec- this, we store (1,2) in A [ j, 7]. Fragments 10 and 11 of R match ond fragment of R with that value as shown in Fig. 2. Similarly, i i i with two fragments of R , i.e., the 9th and 10th fragments. To to correct the third fragment in this example, we identify that represent this, we store (2,2) in positions A [ j, 10] and A [ j, 11]. (2,1) is in the consensus, which implies that the majority of the i i Finally, fragments 12 and 13 match with three fragments of R , related Rmaps are such that two fragments of R match with one j i i.e., fragments 11, 12, and 13. In this case, we store (2,3) in posi- fragment from the set of related Rmaps and therefore replace tions A [ j, 12] and A [ j, 13]. the third and fourth fragments with the average from the corre- i i The setting of parameter S controls the number of Rmaps sponding Rmaps and positions. that are included in the multiple-alignment grid of an Rmap. If The threshold d determines the accuracy and precision of er- we increase the value of S , fewer Rmaps will be added to the ror correction. A high value of d improves precision but lowers grid, but the ones included will be of higher quality (i.e., have accuracy as many fragments are left uncorrected. Similarly, a greater overlap with the Rmap under consideration). The default low value of d improves accuracy but lowers precision. The de- value for the parameter S = 8. We show in the Experiment sec- fault setting is d = 3. tion how we select this value. Complexity Error correcting using the consensus We define  to be the length of the longest Rmap in R. Quan- tization of the Rmaps takes O( × n) time. Constructing the k- The multiple alignment grid is used to find the consensus grid, mer index also takes O( × n) time. The k-mer index stores the denoted as C , for Rmap R .The grid C is a one-dimensional ar- i i i occurences of each quantized k-mer across all Rmaps. Let u be ray of integer pairs with size equal to the number of fragments maximum frequency of a k-mer. That is, a k-mer occurs in max in R . The grid is constructed for each R in R by iterating through i i u Rmaps (in practice u <<n). Then, the complexity of finding re- each column of A and finding the most frequent integer pair, lated Rmaps from the k-mer index is O(n ×  × u). For each Rmap, breaking ties arbitrarily. The most frequent integer pair is stored the filtering heuristic runs in time linear to the size of the Rmap. at each position of C if the frequency is above a given threshold Therefore, filtering the set of related Rmaps also takes linear d; otherwise, (0,0) is stored. Figure2 illustrates the construction O( × n) time. The most expensive step is the pairwise alignment of a consensus grid from an alignment grid. The type of error in that uses the Valouev et al. aligner. As mentioned earlier, this each fragment of R can be identified using C [k] = (x, y)asfol- i i aligner is based on DP and therefore has a O( ) time complexity lows: if x and y are equal, then a sizing error occurs at the k-th to perform one pairwise alignment. If the maximum cardinal- fragment of R ,otherwise,if x is greater than y,thenanaddi- ity of the set of related Rmaps for any Rmap is v, then the total tional cut site exists, and, last, if x is less than y, then a missing complexity of this step is bounded by O(n × v ×  ). The value of v cut site exists. Next, we use C and A to correct these errors in R . i i i depends on the coverage of the optical map data. The alignment For each fragment of R , we consider the consensus stored at the generated using the Valouev et al. method is stored in the multi- corresponding position of C , identify the positions in the corre- ple alignment grid in constant time, and it takes O(n × v × )time sponding column of A that are equal to it, and replace the frag- to generate the consensus maps for n Rmaps and error-correct ment of R with the mean total fragment size computed using the them. Thus, the runtime of cOMet is O(n × v ×  ). values at those positions in A .IfC is equal to (0,0) at any posi- i i tion, then the fragment at that position in R remains unchanged since it implies that there is no definitive result about the type of Datasets error in that position. In addition, if consecutive positions in C are discordant, then the fragments in those positions in R also We performed experiments on both simulated and real data. For remains unchanged. For example, if there is a (2,1) consensus at the real data, we used the Rmap data from the plum [22] and do- some position of C , then we expect the preceding or successive mestic goat [3] sequencing projects. These datasets were built on position to also have a (2,1) consensus. However, if this is not the OpGen mapping platform and are more error prone. We also the case, then we do not error correct those fragments since the experimented on a human dataset [23] built on the new BioNano consensus is discordant at those positions. Figure2 shows this platform. This dataset is built using the latest optical mapping Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy061/5005021 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Mukherjee et al. 5 Figure 2: Example of a multiple alignment grid and a consensus grid. The figure shows the multiple alignment grid A for an Rmap R and its consensus grid C . i i i Each row of the multiple alignment grid represents the alignment of R with one of its related Rmaps, and the columns represent the fragments of R . The figure i i also demonstrates error correction using the consensus grid, with the error-corrected Rmap denoted as R . The fragment sizes are in Kbp. To demonstrate the error correction process for the 3rd, 4th, and 5th fragments, we also include the fragments (in parentheses) to which they align. The error-corrected fragment is the mean of the fragments from the corresponding positions that have the same alignment as the consensus. For example, for the 5th fragment, the consensus is (1,1). Therefore, the mean of the aligned fragments with (1,1) alignment, i.e., 8.488, 8.132, 8.964, and 9.432, is the error-corrected value for the 5th fragment. Table 1: Summary of the real and simulated data In addition, we simulated Rmap data from E. coli K-12 sub- str. MG 1655 as follows. First, the reference genome was copied Genome Size No. of Rmaps 200 times, and uniformly distributed random loci were selected for each of these copies. These loci form the ends of a single E. coli 4.6 Mbp 2,504 molecule that would undergo in silico digestion. Next, molecules Plum 284 Mbp 749,895 smaller than 150 Kbp were discarded, and the cleavage sites for Goat 2.66 Gbp 3,447,997 the RsrII enzyme were then identified within each of these simu- Rmaps with fewer than 10 fragments were omitted from all the experiments. lated molecules. These error-free Rmap data are used for validat- cOMet was run on the remaining 2,504, 548,779, and 3,049,439 Rmaps for the E. ing the output of our method. Last, deletion, insertion, and siz- coli, plum, and goat genomes, respectively. ing errors were incorporated into the error-free Rmaps accord- ing to the error model discussed by Li et al. [15]. The error model Table 2: Summary of the real and simulated BioNano data. was described earlier in the Background section. This simula- tion resulted in 2,505 Rmaps containing 7,485 deletion and 554 Genome Size No. of Rmaps insertion errors. Last, we simulated optical map data from a simulation soft- E. coli 4.6 Mbp 123,251–157,743 ware called OMSim [24] that generates synthetic optical maps Human 3.2 Gbp 793,199 that mimics real Bionano Genomics data. The software takes two parameters as input: the false positive rate (FPR) rate, OMSim was used to simulate eight different BioNano datasets, each of which had varying error rates and, thus, had a different number of Rmaps. which is the number of additional cut sites erroneously in- serted per 100kbp, and the false negative rate (FNR), which is the percentage of times a cut site is missed. Using this Table 3: Results for the data simulated from E. coli K-12 MG 1655. method, we simulated eight datasets of Rmaps from E. coli K- Total no. of insertion errors 12 substr. MG 1655 using the restriction enzyme BspQI. The corrected 556 default FPR and FNR for BspQI are 1% and 15%, respectively. We generated additional datasets with the following error TPR of corrected insertions 82.49 % (457) rates (FPR,FNR) : (0.5%,15%), (1.0%,15%), (2.0%,15%),(5.0%,15%), FPR of corrected insertions 0.21 % (99) (1.0%,5%), (1.0%,25%), (2.0%,5%),and (2.0%,25%). Total no. of deletion errors 5,894 corrected TPR of corrected deletions 77.38 % (5,792) Experiments and Discussion FPR of corrected deletions 0.25 % (102) We performed all experiments on Intel E5-2698v3 processors The data was simulated according the algorithm described in Datasets. This sim- with 192 GB of random access memory (RAM) running 64-bit ulation resulted in 2,505 Rmaps containing 7,485 deletion and 554 insertion er- Linux. The input parameters to cOMet include b (quantization rors. bucket size), k (k-mer value), m (the number of k-mers needed to be conserved between two Rmaps), and d (the minimum number technology and has significantly better quality than the plum of Rmaps required to form consensus at a position). The default and goat genomes. The genome size and number of Rmaps for parameters are b = 4,000, k = 4, m = 1, and d = 3and led to the these species are shown in Tables 1 and 2. best result across all datasets. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy061/5005021 by Ed 'DeepDyve' Gillespie user on 21 June 2018 6 Error correcting optical mapping data Determining the value of S while the genome fraction covered by the two assembled maps from the corrected Rmaps was 82%. Moreover, the assembled The setting of the parameter S depends on the sensitivity of the maps from the uncorrected data had 47 insertion and deletion Valouev et al. aligner. If the alignment score between two Rmaps errors when aligned to the reference, while the error-corrected is less than S , then the aligned Rmaps are deemed to be un- data had only 34 such errors. In order to further contextual- related. We say an Rmap, R is overlapping with an Rmap, R if s t ize these results, we assemble the error-free Rmap dataset and at least 50% of R overlaps with R . That is, either the first half s t summarize this assembly in Table 4. or the second half of R is entirely and exactly (exact fragment matches) contained in R . Experiments with OMSim data We carried out the following experiment to determine the op- timum setting for S . From the set of simulated error-free Rmaps, To present the robustness of our method and its applicability we computed the set of overlapping Rmaps for each Rmap. We across datasets, we conducted experiments on synthetic data denote this set as related Rmaps. Then, we used the Valouev et from an optical map–simulating software called OMSim [24]. As al. aligner to score all pairwise wise alignments between the described in the Datasets section, we generated eight datasets simulated Rmaps (with errors added) and plot the scores in the of synthetic optical maps by varying the insertion and deletion form of a histogram, which is shown in Fig.3. The percentage of error rates. related Rmaps with an S-score less than 8 is 6.06%. Hence, we In the first experiment, we fixed the FNR at 15% and varied choose the setting of S = 8. the FPR between 0.5, 1.0, 2.0, and 5.0, respectively. For each of the four datasets, we align each Rmap (using the Valouev et al. aligner) before and after error correction to the reference opti- Experiments with our simulated data cal map obtained using the same restriction enzyme and report The cOMet error correction was run on the simulated E. coli data. the percent of Rmaps whose alignment S-score increased after The corrected Rmaps were then aligned to the error-free Rmaps error correction and the mean increase in the S-score. We note to determine the number of corrected insertions and deletions. that for each Rmap, the aligner returns the highest-scored align- The results of this experiment are shown in Table 3. To deter- ment, and the score represents how closely the Rmap aligns to the reference genome-wide optical map. Table 5 summarizes the mine the quality of error correction, we computed the true pos- itive rate (TPR), which is the ratio between the number of in- results from this experiment. We observe that the efficiency of sertion (or deletion) errors that cOMet correctly identified and error correction improves as the FPR is initially increased. When removed and the number of insertion (deletion) errors, and the the FPR reaches a high value of 5, the efficiency of error correc- FPR, which is the ratio between the number of insertion (or dele- tion drops. The mean S-score improves by more than 9 (∼10%) tion) errors that cOMet incorrectly identified and removed and when the FPR is reasonable. the total number of fragments not containing an insertion (dele- In the second experiment, we first fix the FPR at 1.0 and vary tion) error. The TPR is 82.49% and 77.38% with respect to the the FNR between 5%, 15%,and 25%, respectively. We then fix the number of corrected insertions and deletion errors; whereas the FPR at 2 and vary the FNR between 5%, 15%,and 25%, respec- FPR is 0.21% and 0.25% with respect to the number of corrected tively. We report the same results as in the previous experiment. insertions and deletion errors. This demonstrates the high accu- Table 6 shows the results. Similar to the previous experiment, racy of the correction made by cOMet. Our method also has high we find that the efficiency of error correction improves as the precision. Of the deletion errors corrected, 98.26% are true er- FNR increases from 5% to 25%. The error correction improves the quality of a high percentage of Rmaps (>70%) for all values rors. Similarly, of the insertion errors corrected, 82.19% are true errors. of parameters. Additionally, for each corrected Rmap, we computed the alignment S-score of both the original Rmap and the corrected Experiments with real data Rmap with the error-free Rmap. We found that for 96.5% of the Rmaps, the S-scores improved after error correction. In other Table 7 summarizes the results of running cOMet on the plum words, cOMet brought 96.5% Rmaps closer to their error-free and goat datasets. The plum and goat datasets do not contain state. The mean S-score before error correction was 44.91 and error-free Rmaps. Therefore, we are restricted to reporting the it improved by 14.03% to 51.30 after error correction. For 17.5% number of corrections made and the improvement to the S- of the Rmaps (415 Rmaps), the S-score improved by more than score. In order to compute the S-score before and after error 10. Last, we mention that the error correction was achieved in correction, we generated an in silico digested genome-wide op- 241 central processing unit (CPU) seconds and using 79.54 MB of tical map from the reference genome and aligned both the un- memory. corrected and corrected Rmap to the genome-wide optical map. To demonstrate the importance of error correction, we as- If it aligned to multiple positions, then we considered the align- sembled the Rmaps before and after error correction using the ment position where the corrected Rmap aligned with greatest Valouev et al. assembler [25]. Table 4 summarizes the results of S-score and considered the difference in the S-score when the this experiment. We assembled the uncorrected data into five uncorrected and corrected Rmap aligned to that position. How- assembled optical maps and the error-corrected data into two ever, we note that this process is error prone because of the frag- assembled optical maps. The N50 statistic of the assembly in- mented nature of the draft genomes and possible misassemblies creased from 1,242 Kbp for the uncorrected data to 3,348 Kbp for present in the genomes. We observed that the S-score after error the corrected data. Next, we aligned each assembled map to the correction improved for 78% of the plum Rmaps and 99% of the genome-wide (error-free) optical map using the Valouev et al. goat Rmaps. Figures 4 and 5 show the histograms of the distri- aligner in order to locate their positions on the genome and cal- bution of S-scores before and after error correction. For the plum culate the percentage of the genome that was covered by at least genome, the mean S-score improved from 8.60 before error cor- one of the assembled maps. The genome fraction covered by rection to 14.72 after error correction (a 71% improvement in the the five assembled maps from the uncorrected Rmaps was 80%, score), while for the goat genome, it improved from 9.38 before Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy061/5005021 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Mukherjee et al. 7 Figure 3: Distribution of S-scores of Rmap alignments between related Rmaps and unrelated Rmaps. Table 4: Assembly results of uncorrected Rmaps, corrected Rmaps, and error-free Rmaps using the Valouev et al. assembler Alignment location in Rmap status Assembled Map id Number of Map length reference fragments (Kbp) (start-loci, end-loci) Uncorrected Rmaps Assembled Map 0 75 921.41 (246,321) Assembled Map 1 88 1,242.40 (11,95) Assembled Map 2 30 531.65 (225,255) Assembled Map 3 44 759.16 (181,228) Assembled Map 4 107 1,699.60 (87,194) Corrected Rmaps Assembled Map 0 102 1,397.60 (225,322) Assembled Map 1 237 3,348 (8,230) Error-free Rmaps Assembled Map 0 60 808.74 (185,239) Assembled Map 1 91 1,100.5 (241,324) Assembled Map 2 104 2,474.4 (19,185) The Rmaps are simulated from the E. coli genome. Each assembled map is aligned to the reference genome-wide (error-free) optical map using the Valouev et al. aligner. The genome-wide optical map contains 383 fragments. Table 5: Efficiency of error correction when the FPR is varied and the FNR is fixed at 15% FPR No. of Percent of Rmaps Mean S-score Mean S-score Mean S-score with improved before error Rmaps S-score correction after error correction improvement 0.5 129,820 93.42 78.66 91.12 12.46 1.0 126,133 94.01 76.25 89.44 13.19 2.0 140,623 92.99 71.93 85.29 13.36 5.0 130,019 81.35 64.65 71.36 6.71 correction to 16.97 after correction (an 80.92% improvement in Based on these alignments, we then computed the fraction of the score). the genome covered by at least one original Rmap and the frac- We also measured the genome coverage, i.e., the fraction of tion of the genome covered by at least one corrected Rmap. On the genome covered by at least one Rmap, for both the origi- the goat genome, the genome coverage was 73.08% before cor- nal Rmaps and the corrected Rmaps as follows. First, we aligned rection and it increased to 84.56% after correction. The increase all Rmaps to the genome-wide optical map and then picked the in genome coverage shows that our method is able to correct best alignment for each original Rmap and each corrected Rmap. Rmaps from across the genome. Furthermore, it shows that even Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy061/5005021 by Ed 'DeepDyve' Gillespie user on 21 June 2018 8 Error correcting optical mapping data Table 6: Efficiency of error correction when the FNR is varied FPR FNR (%) No. of Percent of Rmaps Mean S-score Mean S-score Mean S-score with improved before error after error Rmaps S-score correction correction improvement 1.0 5 142,684 87.36 81.32 87.62 6.3 15 126,133 94.01 76.25 89.44 13.19 25 123,252 96.09 70.24 87.44 17.20 2.0 5 148,912 89.23 76.54 84.47 7.93 15 140,623 92.99 71.93 85.29 13.36 25 130,763 93.02 66.95 81.65 14.7 Figure 4: Alignment scores of Rmaps from the plum genome with the reference optical map. Before error correction, the S-score had a mean of 8.6 with a standard deviation of 6.49. After error correction, the mean S-score improved to 14.72 with standard deviation of 6.72. Table 7: Results on the Rmap data of plum and goat genomes and 105.7 CPU days for plum and goat, respectively), these fig- ures are not prohibitive given that this computation can easily Genome name Plum Goat be parallelized since the error correction process for each Rmap is independent. For example, we ran the goat genome on 20 Running time 7.4 days 105.7 days machines, and it required 126.84 hours for all Rmaps to be cor- Memory 12.20 GB 113.56 GB rected. In addition, we note that error correction of a dataset will No. of insertion 433,282 2,530,060 likely only be done once for any dataset, so 5.2 human days for errors corrected a large genome is not unreasonable. Last, the peak memory us- No. of deletion 430,329 3,187,023 age was 12.20 GB and 113.56 GB for plum and goat, respectively; errors corrected thus, cOMet is able to run on any modern server. Peak memory was measured as the maximum resident set size as reported by Next, we ran experiments on the human dataset. Again, the operating system with sufficient RAM to avoid paging. Running time is the since we do not possess the error-free Rmaps corresponding user process time, also reported by the operating system. to the raw Rmaps for this dataset, we follow a similar evalu- ation method as in the previous experiments. We performed our evaluation on an in silico digested human reference genome if Rmaps could not originally be reliably aligned to some regions (GenBank assembly accession: GCA 000001405.15, Genome Ref- of the genome, our method is sensitive enough to recover simi- erence Consortium Human Build 38) using BspQI, which was the lar Rmaps from these regions; thus, after correction, the fraction restriction enzyme that was used for generating the Rmap data. of the genome covered by aligned Rmaps is higher. For the plum, cOMet improved the S-score of 74.78% of the Rmaps. The aver- the genome coverage dropped negligibly from 99.01% before er- age S-score improved from 85.96 before error correction to 88.65 ror correction to 98.85% after error correction (which is less than after error correction. 1% of the genome size). In addition, as shown in Table 7, the running time and peak memory usage were recorded for the plum and goat genomes. Although these experiments have significant running times (7.4 Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy061/5005021 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Mukherjee et al. 9 Figure 5: Alignment scores of Rmaps from the goat genome with the reference optical map. The mean and standard deviation of the S-scores before error correction were 9.38 and 6.54, respectively. After error correction, the mean S-score improved to 16.97 with a standard deviation of 6.21. Conclusion License: GNU General Public License Research resource identifier: SCR 016276 Error correction of high-throughput sequencing data has be- An archival copy of the code is available via the GigaScience repos- come an imperative preprocessing step in genome assembly itory GigaDB[33]. since 2008 when Chaisson and Pevzner showed the dramatic im- provement it can have on the quality of the assembly [26, 27, 28]. For example, after error correction, the contig N50 size of Availability of supporting data an assembly of Rhodabacter sphaeroides improved from 233 bp to The optical mapping data for plum and goat are publicly avail- 7,793 bp using the same assembler [28]. Due to this inarguable able and can be accessed from their respective manuscripts [3, benefit on genome assembly, many methods have been devel- 22] and via GigaDB [34] The human optical map data can also oped for error correction of sequence reads, including BFC [29], be accessed from its manuscript [23]. The simulated data for E. Coral [30], EULER [26, 31], and Reptile [32]. Unfortunately, even .coli are provided in the github repository along with the python though there has been a massive effort in error correction of se- scripts used to generate it, and snapshots of the data and code quence data, there currently does not exist a publicly released are also included in GigaDB[33]. method for error correction of Rmap data—a method that would likely improve the quality of genome-wide optical map assem- blies and allow such assemblies to be computed with greater ef- Additional material ficiency. Here, we presented cOMet, an error correction method for In Figure S.1 we show the distribution of lengths of Rmaps whose Rmap data and demonstrated that it corrects and improves the S-score increases after error correction. From the distribution, quality of a high percentage of Rmaps in both simulated and we can tell that our method is able to error correct Rmaps of real datasets. As previously discussed, Rmap data are subject to all sizes. We also show the distribution of fragment sizes from Rmaps whose score increases after error correction in Figure S.2. high error rates. In addition to insertion and deletion errors, they contain sizing errors that necessitate the use of a dynamic pro- Figure S.1: Distribution of Rmap lengths whose S-score in- gramming algorithm for pairwise alignment and, subsequently, creased after error correction. The Rmaps are simulated from assembly. By correcting a significant number of errors in Rmap the E. coli K-12 substr. MG 1655 as explained in the text. data, cOMet can make it possible to use faster alignment meth- Figure S.2: Distribution of fragment sizes of Rmaps whose S- ods [18, 19, 17] and explore the development of more efficient score increased after error correction. The Rmaps are simulated Rmap assembly algorithms. from the E. coli K-12 substr. MG 1655 as explained in the text. Availability of source code Abbreviations Project name: cOMet CPU: central processing unit; FNR: false negative rate; FPR: false Project home page : https://github.com/kingufl/cOMet positive rate; RAM: random access memory; TPR: true positive Operating system(s): Linux rate; CCD : charged coupled device. Programming language: C++ Other requirements: GCC version 5.2.0 or higher Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy061/5005021 by Ed 'DeepDyve' Gillespie user on 21 June 2018 10 Error correcting optical mapping data (BLASR): application and theory. BMC Bioinformatics 2012;p. Funding K.M., D.W., M.M., and C.B. were funded by the National Science 17. Muggli MD, Puglisi SJ, Boucher C. Efficient indexed align- Foundation (1618814). L.S.was funded by the Academy of Finland ment of contigs to optical maps, Algorithms in Bioinfomatics. (grants 284598 [CoECGR], 308030, and 314170). WABI. 2014;68–81. 18. Leung AKY, Kwok TP, Wan R, et al. OMBlast: alignment tool for optical mapping using a seed-and-extend approach. References Bioinformatics 2016;p. btw620. 1. Schwartz DC, Li X, Hernandez LI, et al. Ordered restriction 19. Mendelowitz LM, Schwartz DC, Pop M. Maligner: a maps of Saccharomyces Cerevisiae chromosomes constructed fast ordered restriction map aligner. Bioinformatics by optical Mmapping. Science 1993;262:110–114. 2016;32(7):1016–1022. 2. Zhou S, Wei F, Nguyen J, et al. A single molecule scaffold for 20. Smith TF, Waterman MS. Identification of common molecu- the maize genome. PLoS Genetics 2009 11;5:e1000711. lar subsequences. J Mol Biol 1981;147(1):195–197. 3. Dong Y, Xie M, Jiang Y, et al. Sequencing and automated 21. Bradnam KR, Fass JN, Alexandrov A, et al. Assemblathon 2: whole-genome optical mapping of the genome of a domestic evaluating de novo methods of genome assembly in three ver- goat (Capra hircus). Nature Biotechnol 2013. http://dx.doi.org tebrate species. GigaScience 2013;2(1):1–31. /10.5524/100082. 22. Cai M, Chen W, Du D, et al. Genomic data of the plum (Prunus 4. Chamala S, Chanderbali AS, Der JP, et al. Assembly and vali- mume). GigaScience Database 2014. http://dx.doi.org/10.5524 dation of the genome of the nonmodel basal angiosperm Am- /100084. borella. Science 2013;342(6165):1516–1517. 23. Shi L, Guo Y, Dong C, et al. Long-read sequencing and de 5. Teague B, Waterman MS, Goldstein S, et al. High-resolution novo assembly of a Chinese genome. Nature Communica- human genome structure by single-molecule analysis. Proc tions 2016;http://dx.doi.org/10.1038/ncomms12065. Natl Acad Sci U S A 2010;107(24):10848–10853. 24. Miclotte G, Plaisance S, Rombauts S, et al. OMSim: a simulator 6. Ganapathy G, Howard JT, Ward JM, et al. De novo high- for optical map data. Bioinformatics 2017;2740–2. coverage sequencing and annotated assemblies of the 25. Valouev A, Schwartz DC, Zhou S, et al. An algorithm for budgerigar genome. GigaScience 2014;3,1–9. assembly of ordered restriction maps from single DNA 7. Muggli MD, Puglisi SJ, Ronen R, et al. Misassembly detection molecules. Proc Natl Acad Sci U S A 2006;103(43):15770–15775. using paired-end sequence reads and optical mapping data. 26. Chaisson MJ, Brinza D, Pevzner PA. De novo fragment assem- Bioinformatics 2015;31(12):i80–i88. bly with short mate-paired reads: does the read length mat- 8. Reslewic S, Zhou S, Place M, et al. Whole-senome Shotgun ter? Genome Res 2009;19(2):336–346. optical mapping of Rhodospirillum rubrum. Appl Environ Mi- 27. Ekblom R, Wolf JBW. A field guide to whole-genome se- crobiol 2005;71(9):5511–5522. quencing, assembly and annotation. Evolutionary Applica- 9. Zhou S, Deng W, Anantharaman TS, et al. A whole-genome tions 2014;7(9):1026–1042. Shotgun optical map of Yersinia pestis strain KIM. Appl Envi- 28. Salzberg SL, Phillippy AM, Zimin A, et al. GAGE: a critical ron Microbiol 2002;68(12):6321–6331. evaluation of genome assemblies and assembly algorithms. 10. Zhou S, Kile A, Kvikstad E, et al. Shotgun optical mapping Genome Res 2012;22(3):557–567. of the entire Leishmania major Friedlin genome. Mol Biochem 29. Li H. BFC: correcting Illumina sequencing errors. Bioinfor- Parasitol 2004;138(1):97–106. matics 2015;31(17):2885. 11. Zhou S, Bechner MC, Place M, et al. Validation of rice genome 30. Salmela L, Schroder ¨ J. Correcting errors in short reads by mul- sequence by optical mapping. BMC Genomics 2007;8(1):278. tiple alignments. Bioinformatics 2011;27(11):1455–1461. 12. Church DM, Goodstadt L, Hillier LW, et al. Lineage-specific bi- 31. Pevzner PA, Tang H, Waterman MS. An Eulerian path ap- ology revealed by a finished genome assembly of the mouse. proach to DNA fragment assembly. Proc Natl Acad Sci PLoS Biology 2009;7(5):e1000112+. 2001;98(17):9748–9753. 13. Zhou S, Herschleb J, Schwartz DC. A single molecule sys- 32. Yang X, Dorman KS, Aluru S. Reptile: representative tiling for tem for whole genome analysis. Perspectives in Bioanalysis short read error correction. Bioinformatics 2010;26(20):2526. 2007;2, 265–300. 33. Mukherjee K, Washimkar D, Muggli MD, et al. Supporting data 14. Valouev A, Li L, Liu YC, et al. Alignment of optical maps. J for “Error Correcting Optical Mapping Data.” GigaScience Comp Biol 2006;13(2):442–462. Database; 2018. http://dx.doi.org/10.5524/100434. 15. Li M, Mak ACY, Lam ET, et al. Towards a more accurate error 34 Bian C, Chen J, Chen W, Genomic data of the goat (Capra hir- model for BioNano optical maps. In: ISBRA 2016;pp. 67–79. cus). GigaScience Database 2014, http://dx.doi.org/10.5524/1 16. Chaisson MJ, Tesler G. Mapping single molecule sequencing reads using basic local alignment with successive refinement Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy061/5005021 by Ed 'DeepDyve' Gillespie user on 21 June 2018 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png GigaScience Oxford University Press

Error correcting optical mapping data

Free
10 pages

Loading next page...
 
/lp/ou_press/error-correcting-optical-mapping-data-08KnM3d8IT
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press.
eISSN
2047-217X
D.O.I.
10.1093/gigascience/giy061
Publisher site
See Article on Publisher Site

Abstract

Optical mapping is a unique system that is capable of producing high-resolution, high-throughput genomic map data that gives information about the structure of a genome . Recently it has been used for scaffolding contigs and for assembly validation for large-scale sequencing projects, including the maize, goat, and Amborella genomes. However, a major impediment in the use of this data is the variety and quantity of errors in the raw optical mapping data, which are called Rmaps. The challenges associated with using Rmap data are analogous to dealing with insertions and deletions in the alignment of long reads. Moreover, they are arguably harder to tackle since the data are numerical and susceptible to inaccuracy. We develop cOMet to error correct Rmap data, which to the best of our knowledge is the only optical mapping error correction method. Our experimental results demonstrate that cOMet has high prevision and corrects 82.49% of insertion errors and 77.38% of deletion errors in Rmap data generated from the Escherichia coli K-12 reference genome. Out of the deletion errors corrected, 98.26% are true errors. Similarly, out of the insertion errors corrected, 82.19% are true errors. It also successfully scales to large genomes, improving the quality of 78% and 99% of the Rmaps in the plum and goat genomes, respectively. Last, we show the utility of error correction by demonstrating how it improves the assembly of Rmap data. Error corrected Rmap data results in an assembly that is more contiguous and covers a larger fraction of the genome. Keywords: optical mapping; error correction Introduction DNA molecules cling to the surface of a microscope slide using electrostatic charge and are digested with one or more restric- In 1993 Schwartz et al. developed optical mapping, a system for tion enzymes. The restriction enzymes cut the DNA molecule creating an ordered, genome-wide, high-resolution restriction at occurrences of the enzyme’s recognition sequence, forming a map of a given organism’s genome [1]. Since this initial devel- number of DNA fragments. The fragments formed by digestion opment, genome-wide optical maps have found numerous ap- are painted with a fluorescent dye to allow visibility under laser plications including discovering structural variations and rear- light and a CCD (Charged Coupled device) camera. Computer rangements [5], scaffolding and validating contigs for several vision algorithms then estimate fragment length from consol- large sequencing projects [4,3, 6], and detecting misassembled idated intensity of fluorescent dye and apparent distance be- regions in draft genomes [7]. Thus, optical mapping has assisted tween fragment ends. in the assembly of a variety of species including various prokary- The resulting data from an experiment are in the form of an ote species [8, 9, 10], rice [11], maize [2], mouse [12], goat [3], ordered series of fragment lengths [13]. The data for each sin- parrot [6], and Amborella trichopoda [4]. The raw optical mapping gle molecule produced by the system is called an Rmap. Rmap data are generated by a biological experiment in which large data have a number of errors due to the experimental conditions Received: 22 June 2017; Revised: 4 December 2017; Accepted: 16 May 2018 The Author(s) 2018. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy061/5005021 by Ed 'DeepDyve' Gillespie user on 21 June 2018 2 Error correcting optical mapping data and system limitations. In an optical mapping experiment, it is as follows: 3,7,12,15,20. Then, R will be represented as R = 2, 4, i i unlikely to achieve perfectly uniform fluorescent staining. This 5, 3, 5, 2. The size of an Rmap denotes the number of fragments leads to an erroneous estimation of fragment sizes. Also, restric- in that Rmap. Therefore, the size of R is 6. tion enzymes often fail to digest all occurrences of their recog- We note that millions of Rmaps are produced for a single nition sequence across the DNA molecule. This manifests as genome since optical mapping is performed on many cells of missing restriction sites. Additionally, due to the fragile nature the organism and each cell provides thousands of Rmaps. The of DNA, additional breaks can incorrectly appear as restriction Rmaps can be assembled to produce a genome-wide optical sites. Last, the limitations of the imaging component of the opti- map. This is analogous to next-generation shotgun sequencing cal mapping system and the propensity for the DNA to ball up at where Rmaps are analogous to reads and a genome-wide optical the ends introduces more sizing error for smaller fragments. In- map is analogous to the assembled whole genome. terested readers will find more details about the causes of these There are three types of errors that can occur in optical map- errors in Valouev et al. [14] and Li et al. [15]. Because of all these ping: (1) missing cut sites, which are caused by an enzyme not experimental conditions, Rmap data generated through optical cleaving at a specific site; (2) additional cut sites, which can mapping experiment have insertion (added cut sites) and dele- occur due to random DNA breakage; and (3) inaccuracy in the tion (missed cut sites) errors along with fragment sizing errors. fragment size due to the inability of the system to accurately In most applications of optical map data, the Rmaps need estimate the fragment size. Continuing again with the exam- to be assembled into a genome-wide optical map. This is be- ple above, a more representative example Rmap would include cause the single molecule maps need redundant sampling to these errors, such as R = 7, 6, 3, 4. overcome the presence of the aforementioned errors and be- The error rates of optical maps depend on the platform used cause single molecule maps only span on the order of 500 Kbp for generating the maps. Li et al. [15] recently studied the error [14]. The first step of this assembly process involves finding pair- rates of optical maps produced by the Irys system from BioNano wise alignments among the Rmaps. In order to accomplish this, Genomics. According to their study, a missing cut site type of er- the challenge of dealing with missing fragment sizes has to be ror, i.e., error type (1),happens when a restriction site is incom- overcome. This challenge is analogous to dealing with insertions pletely digested by the enzyme and causes two flanking frag- and deletions in the alignment of long reads [16]. In fact, it is ar- ments to merge into one large fragment. The probability of com- guably harder since the data are numerical. At present, the only plete digestion of a restriction site can be modeled as a Bernoulli nonproprietary algorithmic method for pairwise alignment of trial whose probability of success is a function of the size of Rmaps is the dynamic programming-based method of Valouev the two flanking fragments. Additional cut sites, i.e., error type et al. [14], which runs in O(α × β)timewhere α and β are the (2),results from random breaks of the DNA molecule. The num- number of fragments in the two Rmaps being aligned. To align ber of false cuts per unit length of DNA follows a Poisson distri- an optical map dataset containing n Rmaps, the complexity be- bution. The inaccuracy of the fragment sizes, i.e., error type (3), 2 2 comes O(n ×  )where  is theaverage sizeofanRmap. is modeled using a Laplace distribution. If the observed and ac- This method is inherently computationally intensive. How- tual size of a fragment are o and r , respectively, then the sizing k k ever, if the error rate of the data could be improved, then non- error is defined as s = o /r and k k k dynamic programming-based methods that are orders of mag- nitude faster such as Twin [17], OMBlast [18], and Maligner [19] s ∼ Laplace(μ,β) could be used for alignment. This would greatly improve the time required to assemble Rmap data. Thus, we present cOMet where μ and β, the parameters of the laplace distribution, are in order to address this need. To the best of our knowledge, it functions of r . In practice, when aligning a pair of Rmaps, one is the first Rmap error correction method. Our experimental re- should allow for twice the error rate of a single Rmap since each sults demonstrate that cOMet has high precision and corrects Rmap will deviate from the genomic map by the above parame- 82.49% of insertion errors and 77.38% of deletion errors in Rmap ters. data generated from the Escherichia coli K-12 reference genome. Valouev et al. [14] provides a dynamic programming algo- Out of the deletion errors corrected, 98.26% are true errors. Sim- rithm for pairwise alignment, which generates a score for every ilarly, out of the insertion errors corrected, 82.19% are true er- possible alignment between two Rmaps and returns the align- rors. Furthermore, we show that the assembly of Rmaps is more ment that achieves the highest score, which is referred to as contiguous and covers a larger fraction of the genome if the the S-score. It is computed within a standard dynamic program- Rmaps are first error corrected. It also successfully scales to large ming framework, similar to Smith-Waterman alignment [20]. genomes, improving the quality of 78% and 99% of the Rmaps in The scoring function is based on a probabilistic model built on the plum and goat genome, respectively. the following assumptions: the fragment sizes follow an expo- nential distribution, the restriction sites follow an independent Bernoulli process, the number of false cuts in a given genomic Background length is a Poisson process, and the sizing error follows a nor- mal distribution with mean zero and variance following a linear From a computer science perspective, optical mapping can be function of the true size. Last, a different sizing error function seen as a process that takes in two strings, a nucleotide se- is used for fragments less than 4 kbp in length since they do quence S [1, n] and a restriction sequence B[1, b], and produces not converge to the defined normal distribution. The score of an an array (string) of integers R [1, m]. The array R is an Rmap cor- i i alignment is calculated as the sum of two functions: one func- responding to S and contains the string-lengths between cuts tion that estimates and scores the sizing error and a second that produced by B on S . Formally, R is defined as follows: R [j] = y − x i i i predicts and scores the presence of additional and/or missing th where y represents the location (starting index) of j occurrence cut sites between the fragments. The S-score will be used later th of B in S and x represents the location of (j − 1) occurrence of B to evaluate the error correction process. in S and R [1] = y − 1and R [m] = n − x.For example,say we have i i i B = act and S = atacttactggactactaaact. The locations of B in S are i i Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy061/5005021 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Mukherjee et al. 3 Methods Unfortunately, although this set contains all related Rmaps, it also likely contains Rmaps that are not related to R . Therefore, Given a set of n Rmaps R ={R , .., R }, our method aims to detect 1 n we filter this set of Rmaps using a simple heuristic that tries to and correct all errors in R by considering each R ∈ R and finding match each Rmap in this set with R in order to ascertain if it a set of Rmaps that originate from the same part of the genome is related to R . The heuristic traverses through two Rmaps (R i i as R . This step is performed heuristically in order to avoid align- and one Rmap from the set, say R ) attempting to match sub- ing every pair of Rmaps in R. sets of the fragments from each until it either reaches the end of one Rmap or it fails to match the fragments. We start the traver- sal from the first matching k-mer between R and R . We denote Preprocessing i j the position of the next fragment to be matched in R and R as i j Our first step is to remove the first and last fragments from each x and y, respectively, and assume that each fragment prior to Rmap in R. These fragments have one of their edges sheared by these positions is matched. Next, we consider all combinations artifacts of the DNA prep process (preceding the optical map- of matching the fragments at positions x, x+1,and x +2of R ping process) and not by restriction enzymes. Unless removed, with fragments at positions y, y+1,and y +2of R .Weevaluate they can misguide alignment between two Rmaps during the the cost of each combination based on the difference in the total error correction process. In addition, short Rmaps, i.e., those size of fragments from R and R .Thatis ∀α, β = [0, 2], i j that have fewer than 10 fragments, are removed at this stage since any Rmap that contains fewer than 10 fragments is typ- y+β x+α ically deemed too small for analysis even in consensus maps cost(x + α, y + β) =  R [g] − R [h] i j [21]. Next, the data are quantized so that a given genomic frag- g=x h=y ment is represented by the same value across multiple Rmaps despite the noise. Our quantization method assigns a unique where R [g]and R [h]denotethe g-th and h-th fragments of R i j i value to a range of fragment sizes by dividing each fragment and R , respectively. We select the combination with the least size by a fixed integer, denoted as b, and rounding to the near- cost; if there exists a tie, we select the match that has the least est integer. For example, if an Rmap R = {36, 13, 15, 20, 16, 5, number of added or missing cut sites (i.e., the combination with 21, 17} is quantized using b= 3, then the quantized Rmap will be the least value of α + β). If this selected match leads to a cost quantized R ={12, 4, 5, 7, 5, 2, 7, 6}. Say another Rmap, R = {17, 23, that is greater than a specified threshold (which was set to 25% 34, 12, 14, 21, 14, 5} has overlap with R ; however, due to noise in of the larger-sized fragment in practice), then we conclude that the data, this relation is not apparent. By quantizing R using the there is not a match at these positions and return that R and R i j quantized same b= 3, we get R ={6, 8, 11, 4, 5, 7, 5, 1}. This allows us are unrelated. Otherwise, we increment x and y accordingly and to uncover a region (in this case, {4, 5, 7, 5}) that is common to move onto the next fragments. If this heuristic continues until both the Rmaps. It should be noted that, in some cases, a frag- the last fragment of either R or R is reached, then we return that i j ment may have different values across two Rmaps even after R and R are related. Using this heuristic, we filter out the Rmaps i j quantization (e.g., the fragment values 36 from R and 34 from that were deemed to be related based on the number of k-mers R are quantized to 12 and 11, respectively). The quantized data in common with R but are in fact unrelated to R . i i areusedtofind theset of related Rmaps, as explained in the next The setting of parameters k and m are correlated. If the value section. of k is increased, that makes the k-mers more specific, hence, The setting of parameter b depends on the amount of sizing the value of m is lowered. On the other hand, if the value of k is error in the optical map data. With zero sizing error, b can be set reduced, then we increase the value of m.The valueof k is in- at 1. As sizing error increases, the value of b is increased accord- creased when there are fewer insertion and deletion errors and ingly. If the value of b is too small, we are not be able to uncover decreased otherwise. The default values are k = 4and m = 1. relations between overlapping Rmaps. If the value is too large, then unrelated Rmaps have common regions in their quantized Rmap alignment states, which makes them appear related. Considering the error rate of optical maps from BioNano genomics, the default value Next, for each R in R, we use the alignment method of Valouev of b = 4000. et al. [14]tofind the S-score of all pairwise alignments between R and each Rmap in its set of related Rmaps. The Rmaps that have an alignment score, i.e., S-score less than a defined thresh- Finding related Rmaps old (which we denote as S ), are removed from the set of related We refer to two Rmaps as related if their corresponding error-free Rmaps, and the alignments of the remaining Rmaps are stored Rmaps originate from overlapping regions of the genome. Next, in a multiple alignment grid, denoted as A .Thisgridisatwo- we define a k-mer as a string of k consecutive fragments from a dimensional array of integer pairs, where the number of rows is (quantized) Rmap. For example, if we have the Rmap R = {3, 3, 5, equal to the number of remaining Rmaps in the set of related 2, 6, 5, 5, 1} and k = 4, then the following k-mers can be extracted Rmaps of R and the number of columns is equal to the number from R: (3,3,5,2), (3,5,2,6), (5,2,6,5), (2,6,5,5),and (6,5,5,1). In order of fragments in R . An element of this array, A [ j, k], stores an i i to avoid aligning all pairs of Rmaps to find the related Rmaps, integer pair in the form of (x, y) representing that x fragments we use the number of common k-mers to discriminate between of R (which includes the k-th fragment of R ) matches to y frag- i i pairs of Rmaps that are related and those that are not. To accom- ments of R in the optimal alignment between R and R . Figure1 j i j plish this efficiently, we first extract all unique k-mers in each illustrates an example of A . The first fragment of R does not i i quantized Rmap and construct a hash table storing each unique match with any fragment of R and therefore (0,0) is stored at k-mer as a key and the list of Rmaps containing an occurrence this position. Fragments 2, 5, 6, 8, and 9 of R each matches with of that k-mer as the value. We call this the k-mer index. Next, one fragment of R , e.g., 1, 3, 4, 7, and 8, respectively. To repre- we consider each R in R and use the k-mer index to identify the sent these matches, we store a (1,1) in the 2nd, 5th, 6th, 8th, set of Rmaps that have m or more k-mers in common with R . and 9th columns of row j. Fragments 3 and 4 of R match with i i Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy061/5005021 by Ed 'DeepDyve' Gillespie user on 21 June 2018 4 Error correcting optical mapping data Figure 1: An alignment between R and R as givenbyValouevetal. [14] and its corresponding entry in the multiple alignment grid A . Each column of A represents i j i i one fragment from R , and each row represents one Rmap from its set of related Rmaps. The fragment sizes are in Kbp. one fragment of R , i.e., the 2nd fragment. To represent this, we error correction. As it is illustrated, to error correct the second store (2,1) in A [ j, 3] and A [ j, 4]. Fragment 7 of R matches with fragment of R , we compute the average of the matched frag- i i i i two fragments of R , i.e., the 5th and 6th fragments. To represent ments from related Rmaps 2, 3, 4, 5, and 6 and replace the sec- this, we store (1,2) in A [ j, 7]. Fragments 10 and 11 of R match ond fragment of R with that value as shown in Fig. 2. Similarly, i i i with two fragments of R , i.e., the 9th and 10th fragments. To to correct the third fragment in this example, we identify that represent this, we store (2,2) in positions A [ j, 10] and A [ j, 11]. (2,1) is in the consensus, which implies that the majority of the i i Finally, fragments 12 and 13 match with three fragments of R , related Rmaps are such that two fragments of R match with one j i i.e., fragments 11, 12, and 13. In this case, we store (2,3) in posi- fragment from the set of related Rmaps and therefore replace tions A [ j, 12] and A [ j, 13]. the third and fourth fragments with the average from the corre- i i The setting of parameter S controls the number of Rmaps sponding Rmaps and positions. that are included in the multiple-alignment grid of an Rmap. If The threshold d determines the accuracy and precision of er- we increase the value of S , fewer Rmaps will be added to the ror correction. A high value of d improves precision but lowers grid, but the ones included will be of higher quality (i.e., have accuracy as many fragments are left uncorrected. Similarly, a greater overlap with the Rmap under consideration). The default low value of d improves accuracy but lowers precision. The de- value for the parameter S = 8. We show in the Experiment sec- fault setting is d = 3. tion how we select this value. Complexity Error correcting using the consensus We define  to be the length of the longest Rmap in R. Quan- tization of the Rmaps takes O( × n) time. Constructing the k- The multiple alignment grid is used to find the consensus grid, mer index also takes O( × n) time. The k-mer index stores the denoted as C , for Rmap R .The grid C is a one-dimensional ar- i i i occurences of each quantized k-mer across all Rmaps. Let u be ray of integer pairs with size equal to the number of fragments maximum frequency of a k-mer. That is, a k-mer occurs in max in R . The grid is constructed for each R in R by iterating through i i u Rmaps (in practice u <<n). Then, the complexity of finding re- each column of A and finding the most frequent integer pair, lated Rmaps from the k-mer index is O(n ×  × u). For each Rmap, breaking ties arbitrarily. The most frequent integer pair is stored the filtering heuristic runs in time linear to the size of the Rmap. at each position of C if the frequency is above a given threshold Therefore, filtering the set of related Rmaps also takes linear d; otherwise, (0,0) is stored. Figure2 illustrates the construction O( × n) time. The most expensive step is the pairwise alignment of a consensus grid from an alignment grid. The type of error in that uses the Valouev et al. aligner. As mentioned earlier, this each fragment of R can be identified using C [k] = (x, y)asfol- i i aligner is based on DP and therefore has a O( ) time complexity lows: if x and y are equal, then a sizing error occurs at the k-th to perform one pairwise alignment. If the maximum cardinal- fragment of R ,otherwise,if x is greater than y,thenanaddi- ity of the set of related Rmaps for any Rmap is v, then the total tional cut site exists, and, last, if x is less than y, then a missing complexity of this step is bounded by O(n × v ×  ). The value of v cut site exists. Next, we use C and A to correct these errors in R . i i i depends on the coverage of the optical map data. The alignment For each fragment of R , we consider the consensus stored at the generated using the Valouev et al. method is stored in the multi- corresponding position of C , identify the positions in the corre- ple alignment grid in constant time, and it takes O(n × v × )time sponding column of A that are equal to it, and replace the frag- to generate the consensus maps for n Rmaps and error-correct ment of R with the mean total fragment size computed using the them. Thus, the runtime of cOMet is O(n × v ×  ). values at those positions in A .IfC is equal to (0,0) at any posi- i i tion, then the fragment at that position in R remains unchanged since it implies that there is no definitive result about the type of Datasets error in that position. In addition, if consecutive positions in C are discordant, then the fragments in those positions in R also We performed experiments on both simulated and real data. For remains unchanged. For example, if there is a (2,1) consensus at the real data, we used the Rmap data from the plum [22] and do- some position of C , then we expect the preceding or successive mestic goat [3] sequencing projects. These datasets were built on position to also have a (2,1) consensus. However, if this is not the OpGen mapping platform and are more error prone. We also the case, then we do not error correct those fragments since the experimented on a human dataset [23] built on the new BioNano consensus is discordant at those positions. Figure2 shows this platform. This dataset is built using the latest optical mapping Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy061/5005021 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Mukherjee et al. 5 Figure 2: Example of a multiple alignment grid and a consensus grid. The figure shows the multiple alignment grid A for an Rmap R and its consensus grid C . i i i Each row of the multiple alignment grid represents the alignment of R with one of its related Rmaps, and the columns represent the fragments of R . The figure i i also demonstrates error correction using the consensus grid, with the error-corrected Rmap denoted as R . The fragment sizes are in Kbp. To demonstrate the error correction process for the 3rd, 4th, and 5th fragments, we also include the fragments (in parentheses) to which they align. The error-corrected fragment is the mean of the fragments from the corresponding positions that have the same alignment as the consensus. For example, for the 5th fragment, the consensus is (1,1). Therefore, the mean of the aligned fragments with (1,1) alignment, i.e., 8.488, 8.132, 8.964, and 9.432, is the error-corrected value for the 5th fragment. Table 1: Summary of the real and simulated data In addition, we simulated Rmap data from E. coli K-12 sub- str. MG 1655 as follows. First, the reference genome was copied Genome Size No. of Rmaps 200 times, and uniformly distributed random loci were selected for each of these copies. These loci form the ends of a single E. coli 4.6 Mbp 2,504 molecule that would undergo in silico digestion. Next, molecules Plum 284 Mbp 749,895 smaller than 150 Kbp were discarded, and the cleavage sites for Goat 2.66 Gbp 3,447,997 the RsrII enzyme were then identified within each of these simu- Rmaps with fewer than 10 fragments were omitted from all the experiments. lated molecules. These error-free Rmap data are used for validat- cOMet was run on the remaining 2,504, 548,779, and 3,049,439 Rmaps for the E. ing the output of our method. Last, deletion, insertion, and siz- coli, plum, and goat genomes, respectively. ing errors were incorporated into the error-free Rmaps accord- ing to the error model discussed by Li et al. [15]. The error model Table 2: Summary of the real and simulated BioNano data. was described earlier in the Background section. This simula- tion resulted in 2,505 Rmaps containing 7,485 deletion and 554 Genome Size No. of Rmaps insertion errors. Last, we simulated optical map data from a simulation soft- E. coli 4.6 Mbp 123,251–157,743 ware called OMSim [24] that generates synthetic optical maps Human 3.2 Gbp 793,199 that mimics real Bionano Genomics data. The software takes two parameters as input: the false positive rate (FPR) rate, OMSim was used to simulate eight different BioNano datasets, each of which had varying error rates and, thus, had a different number of Rmaps. which is the number of additional cut sites erroneously in- serted per 100kbp, and the false negative rate (FNR), which is the percentage of times a cut site is missed. Using this Table 3: Results for the data simulated from E. coli K-12 MG 1655. method, we simulated eight datasets of Rmaps from E. coli K- Total no. of insertion errors 12 substr. MG 1655 using the restriction enzyme BspQI. The corrected 556 default FPR and FNR for BspQI are 1% and 15%, respectively. We generated additional datasets with the following error TPR of corrected insertions 82.49 % (457) rates (FPR,FNR) : (0.5%,15%), (1.0%,15%), (2.0%,15%),(5.0%,15%), FPR of corrected insertions 0.21 % (99) (1.0%,5%), (1.0%,25%), (2.0%,5%),and (2.0%,25%). Total no. of deletion errors 5,894 corrected TPR of corrected deletions 77.38 % (5,792) Experiments and Discussion FPR of corrected deletions 0.25 % (102) We performed all experiments on Intel E5-2698v3 processors The data was simulated according the algorithm described in Datasets. This sim- with 192 GB of random access memory (RAM) running 64-bit ulation resulted in 2,505 Rmaps containing 7,485 deletion and 554 insertion er- Linux. The input parameters to cOMet include b (quantization rors. bucket size), k (k-mer value), m (the number of k-mers needed to be conserved between two Rmaps), and d (the minimum number technology and has significantly better quality than the plum of Rmaps required to form consensus at a position). The default and goat genomes. The genome size and number of Rmaps for parameters are b = 4,000, k = 4, m = 1, and d = 3and led to the these species are shown in Tables 1 and 2. best result across all datasets. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy061/5005021 by Ed 'DeepDyve' Gillespie user on 21 June 2018 6 Error correcting optical mapping data Determining the value of S while the genome fraction covered by the two assembled maps from the corrected Rmaps was 82%. Moreover, the assembled The setting of the parameter S depends on the sensitivity of the maps from the uncorrected data had 47 insertion and deletion Valouev et al. aligner. If the alignment score between two Rmaps errors when aligned to the reference, while the error-corrected is less than S , then the aligned Rmaps are deemed to be un- data had only 34 such errors. In order to further contextual- related. We say an Rmap, R is overlapping with an Rmap, R if s t ize these results, we assemble the error-free Rmap dataset and at least 50% of R overlaps with R . That is, either the first half s t summarize this assembly in Table 4. or the second half of R is entirely and exactly (exact fragment matches) contained in R . Experiments with OMSim data We carried out the following experiment to determine the op- timum setting for S . From the set of simulated error-free Rmaps, To present the robustness of our method and its applicability we computed the set of overlapping Rmaps for each Rmap. We across datasets, we conducted experiments on synthetic data denote this set as related Rmaps. Then, we used the Valouev et from an optical map–simulating software called OMSim [24]. As al. aligner to score all pairwise wise alignments between the described in the Datasets section, we generated eight datasets simulated Rmaps (with errors added) and plot the scores in the of synthetic optical maps by varying the insertion and deletion form of a histogram, which is shown in Fig.3. The percentage of error rates. related Rmaps with an S-score less than 8 is 6.06%. Hence, we In the first experiment, we fixed the FNR at 15% and varied choose the setting of S = 8. the FPR between 0.5, 1.0, 2.0, and 5.0, respectively. For each of the four datasets, we align each Rmap (using the Valouev et al. aligner) before and after error correction to the reference opti- Experiments with our simulated data cal map obtained using the same restriction enzyme and report The cOMet error correction was run on the simulated E. coli data. the percent of Rmaps whose alignment S-score increased after The corrected Rmaps were then aligned to the error-free Rmaps error correction and the mean increase in the S-score. We note to determine the number of corrected insertions and deletions. that for each Rmap, the aligner returns the highest-scored align- The results of this experiment are shown in Table 3. To deter- ment, and the score represents how closely the Rmap aligns to the reference genome-wide optical map. Table 5 summarizes the mine the quality of error correction, we computed the true pos- itive rate (TPR), which is the ratio between the number of in- results from this experiment. We observe that the efficiency of sertion (or deletion) errors that cOMet correctly identified and error correction improves as the FPR is initially increased. When removed and the number of insertion (deletion) errors, and the the FPR reaches a high value of 5, the efficiency of error correc- FPR, which is the ratio between the number of insertion (or dele- tion drops. The mean S-score improves by more than 9 (∼10%) tion) errors that cOMet incorrectly identified and removed and when the FPR is reasonable. the total number of fragments not containing an insertion (dele- In the second experiment, we first fix the FPR at 1.0 and vary tion) error. The TPR is 82.49% and 77.38% with respect to the the FNR between 5%, 15%,and 25%, respectively. We then fix the number of corrected insertions and deletion errors; whereas the FPR at 2 and vary the FNR between 5%, 15%,and 25%, respec- FPR is 0.21% and 0.25% with respect to the number of corrected tively. We report the same results as in the previous experiment. insertions and deletion errors. This demonstrates the high accu- Table 6 shows the results. Similar to the previous experiment, racy of the correction made by cOMet. Our method also has high we find that the efficiency of error correction improves as the precision. Of the deletion errors corrected, 98.26% are true er- FNR increases from 5% to 25%. The error correction improves the quality of a high percentage of Rmaps (>70%) for all values rors. Similarly, of the insertion errors corrected, 82.19% are true errors. of parameters. Additionally, for each corrected Rmap, we computed the alignment S-score of both the original Rmap and the corrected Experiments with real data Rmap with the error-free Rmap. We found that for 96.5% of the Rmaps, the S-scores improved after error correction. In other Table 7 summarizes the results of running cOMet on the plum words, cOMet brought 96.5% Rmaps closer to their error-free and goat datasets. The plum and goat datasets do not contain state. The mean S-score before error correction was 44.91 and error-free Rmaps. Therefore, we are restricted to reporting the it improved by 14.03% to 51.30 after error correction. For 17.5% number of corrections made and the improvement to the S- of the Rmaps (415 Rmaps), the S-score improved by more than score. In order to compute the S-score before and after error 10. Last, we mention that the error correction was achieved in correction, we generated an in silico digested genome-wide op- 241 central processing unit (CPU) seconds and using 79.54 MB of tical map from the reference genome and aligned both the un- memory. corrected and corrected Rmap to the genome-wide optical map. To demonstrate the importance of error correction, we as- If it aligned to multiple positions, then we considered the align- sembled the Rmaps before and after error correction using the ment position where the corrected Rmap aligned with greatest Valouev et al. assembler [25]. Table 4 summarizes the results of S-score and considered the difference in the S-score when the this experiment. We assembled the uncorrected data into five uncorrected and corrected Rmap aligned to that position. How- assembled optical maps and the error-corrected data into two ever, we note that this process is error prone because of the frag- assembled optical maps. The N50 statistic of the assembly in- mented nature of the draft genomes and possible misassemblies creased from 1,242 Kbp for the uncorrected data to 3,348 Kbp for present in the genomes. We observed that the S-score after error the corrected data. Next, we aligned each assembled map to the correction improved for 78% of the plum Rmaps and 99% of the genome-wide (error-free) optical map using the Valouev et al. goat Rmaps. Figures 4 and 5 show the histograms of the distri- aligner in order to locate their positions on the genome and cal- bution of S-scores before and after error correction. For the plum culate the percentage of the genome that was covered by at least genome, the mean S-score improved from 8.60 before error cor- one of the assembled maps. The genome fraction covered by rection to 14.72 after error correction (a 71% improvement in the the five assembled maps from the uncorrected Rmaps was 80%, score), while for the goat genome, it improved from 9.38 before Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy061/5005021 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Mukherjee et al. 7 Figure 3: Distribution of S-scores of Rmap alignments between related Rmaps and unrelated Rmaps. Table 4: Assembly results of uncorrected Rmaps, corrected Rmaps, and error-free Rmaps using the Valouev et al. assembler Alignment location in Rmap status Assembled Map id Number of Map length reference fragments (Kbp) (start-loci, end-loci) Uncorrected Rmaps Assembled Map 0 75 921.41 (246,321) Assembled Map 1 88 1,242.40 (11,95) Assembled Map 2 30 531.65 (225,255) Assembled Map 3 44 759.16 (181,228) Assembled Map 4 107 1,699.60 (87,194) Corrected Rmaps Assembled Map 0 102 1,397.60 (225,322) Assembled Map 1 237 3,348 (8,230) Error-free Rmaps Assembled Map 0 60 808.74 (185,239) Assembled Map 1 91 1,100.5 (241,324) Assembled Map 2 104 2,474.4 (19,185) The Rmaps are simulated from the E. coli genome. Each assembled map is aligned to the reference genome-wide (error-free) optical map using the Valouev et al. aligner. The genome-wide optical map contains 383 fragments. Table 5: Efficiency of error correction when the FPR is varied and the FNR is fixed at 15% FPR No. of Percent of Rmaps Mean S-score Mean S-score Mean S-score with improved before error Rmaps S-score correction after error correction improvement 0.5 129,820 93.42 78.66 91.12 12.46 1.0 126,133 94.01 76.25 89.44 13.19 2.0 140,623 92.99 71.93 85.29 13.36 5.0 130,019 81.35 64.65 71.36 6.71 correction to 16.97 after correction (an 80.92% improvement in Based on these alignments, we then computed the fraction of the score). the genome covered by at least one original Rmap and the frac- We also measured the genome coverage, i.e., the fraction of tion of the genome covered by at least one corrected Rmap. On the genome covered by at least one Rmap, for both the origi- the goat genome, the genome coverage was 73.08% before cor- nal Rmaps and the corrected Rmaps as follows. First, we aligned rection and it increased to 84.56% after correction. The increase all Rmaps to the genome-wide optical map and then picked the in genome coverage shows that our method is able to correct best alignment for each original Rmap and each corrected Rmap. Rmaps from across the genome. Furthermore, it shows that even Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy061/5005021 by Ed 'DeepDyve' Gillespie user on 21 June 2018 8 Error correcting optical mapping data Table 6: Efficiency of error correction when the FNR is varied FPR FNR (%) No. of Percent of Rmaps Mean S-score Mean S-score Mean S-score with improved before error after error Rmaps S-score correction correction improvement 1.0 5 142,684 87.36 81.32 87.62 6.3 15 126,133 94.01 76.25 89.44 13.19 25 123,252 96.09 70.24 87.44 17.20 2.0 5 148,912 89.23 76.54 84.47 7.93 15 140,623 92.99 71.93 85.29 13.36 25 130,763 93.02 66.95 81.65 14.7 Figure 4: Alignment scores of Rmaps from the plum genome with the reference optical map. Before error correction, the S-score had a mean of 8.6 with a standard deviation of 6.49. After error correction, the mean S-score improved to 14.72 with standard deviation of 6.72. Table 7: Results on the Rmap data of plum and goat genomes and 105.7 CPU days for plum and goat, respectively), these fig- ures are not prohibitive given that this computation can easily Genome name Plum Goat be parallelized since the error correction process for each Rmap is independent. For example, we ran the goat genome on 20 Running time 7.4 days 105.7 days machines, and it required 126.84 hours for all Rmaps to be cor- Memory 12.20 GB 113.56 GB rected. In addition, we note that error correction of a dataset will No. of insertion 433,282 2,530,060 likely only be done once for any dataset, so 5.2 human days for errors corrected a large genome is not unreasonable. Last, the peak memory us- No. of deletion 430,329 3,187,023 age was 12.20 GB and 113.56 GB for plum and goat, respectively; errors corrected thus, cOMet is able to run on any modern server. Peak memory was measured as the maximum resident set size as reported by Next, we ran experiments on the human dataset. Again, the operating system with sufficient RAM to avoid paging. Running time is the since we do not possess the error-free Rmaps corresponding user process time, also reported by the operating system. to the raw Rmaps for this dataset, we follow a similar evalu- ation method as in the previous experiments. We performed our evaluation on an in silico digested human reference genome if Rmaps could not originally be reliably aligned to some regions (GenBank assembly accession: GCA 000001405.15, Genome Ref- of the genome, our method is sensitive enough to recover simi- erence Consortium Human Build 38) using BspQI, which was the lar Rmaps from these regions; thus, after correction, the fraction restriction enzyme that was used for generating the Rmap data. of the genome covered by aligned Rmaps is higher. For the plum, cOMet improved the S-score of 74.78% of the Rmaps. The aver- the genome coverage dropped negligibly from 99.01% before er- age S-score improved from 85.96 before error correction to 88.65 ror correction to 98.85% after error correction (which is less than after error correction. 1% of the genome size). In addition, as shown in Table 7, the running time and peak memory usage were recorded for the plum and goat genomes. Although these experiments have significant running times (7.4 Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy061/5005021 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Mukherjee et al. 9 Figure 5: Alignment scores of Rmaps from the goat genome with the reference optical map. The mean and standard deviation of the S-scores before error correction were 9.38 and 6.54, respectively. After error correction, the mean S-score improved to 16.97 with a standard deviation of 6.21. Conclusion License: GNU General Public License Research resource identifier: SCR 016276 Error correction of high-throughput sequencing data has be- An archival copy of the code is available via the GigaScience repos- come an imperative preprocessing step in genome assembly itory GigaDB[33]. since 2008 when Chaisson and Pevzner showed the dramatic im- provement it can have on the quality of the assembly [26, 27, 28]. For example, after error correction, the contig N50 size of Availability of supporting data an assembly of Rhodabacter sphaeroides improved from 233 bp to The optical mapping data for plum and goat are publicly avail- 7,793 bp using the same assembler [28]. Due to this inarguable able and can be accessed from their respective manuscripts [3, benefit on genome assembly, many methods have been devel- 22] and via GigaDB [34] The human optical map data can also oped for error correction of sequence reads, including BFC [29], be accessed from its manuscript [23]. The simulated data for E. Coral [30], EULER [26, 31], and Reptile [32]. Unfortunately, even .coli are provided in the github repository along with the python though there has been a massive effort in error correction of se- scripts used to generate it, and snapshots of the data and code quence data, there currently does not exist a publicly released are also included in GigaDB[33]. method for error correction of Rmap data—a method that would likely improve the quality of genome-wide optical map assem- blies and allow such assemblies to be computed with greater ef- Additional material ficiency. Here, we presented cOMet, an error correction method for In Figure S.1 we show the distribution of lengths of Rmaps whose Rmap data and demonstrated that it corrects and improves the S-score increases after error correction. From the distribution, quality of a high percentage of Rmaps in both simulated and we can tell that our method is able to error correct Rmaps of real datasets. As previously discussed, Rmap data are subject to all sizes. We also show the distribution of fragment sizes from Rmaps whose score increases after error correction in Figure S.2. high error rates. In addition to insertion and deletion errors, they contain sizing errors that necessitate the use of a dynamic pro- Figure S.1: Distribution of Rmap lengths whose S-score in- gramming algorithm for pairwise alignment and, subsequently, creased after error correction. The Rmaps are simulated from assembly. By correcting a significant number of errors in Rmap the E. coli K-12 substr. MG 1655 as explained in the text. data, cOMet can make it possible to use faster alignment meth- Figure S.2: Distribution of fragment sizes of Rmaps whose S- ods [18, 19, 17] and explore the development of more efficient score increased after error correction. The Rmaps are simulated Rmap assembly algorithms. from the E. coli K-12 substr. MG 1655 as explained in the text. Availability of source code Abbreviations Project name: cOMet CPU: central processing unit; FNR: false negative rate; FPR: false Project home page : https://github.com/kingufl/cOMet positive rate; RAM: random access memory; TPR: true positive Operating system(s): Linux rate; CCD : charged coupled device. Programming language: C++ Other requirements: GCC version 5.2.0 or higher Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy061/5005021 by Ed 'DeepDyve' Gillespie user on 21 June 2018 10 Error correcting optical mapping data (BLASR): application and theory. BMC Bioinformatics 2012;p. Funding K.M., D.W., M.M., and C.B. were funded by the National Science 17. Muggli MD, Puglisi SJ, Boucher C. Efficient indexed align- Foundation (1618814). L.S.was funded by the Academy of Finland ment of contigs to optical maps, Algorithms in Bioinfomatics. (grants 284598 [CoECGR], 308030, and 314170). WABI. 2014;68–81. 18. Leung AKY, Kwok TP, Wan R, et al. OMBlast: alignment tool for optical mapping using a seed-and-extend approach. References Bioinformatics 2016;p. btw620. 1. Schwartz DC, Li X, Hernandez LI, et al. Ordered restriction 19. Mendelowitz LM, Schwartz DC, Pop M. Maligner: a maps of Saccharomyces Cerevisiae chromosomes constructed fast ordered restriction map aligner. Bioinformatics by optical Mmapping. Science 1993;262:110–114. 2016;32(7):1016–1022. 2. Zhou S, Wei F, Nguyen J, et al. A single molecule scaffold for 20. Smith TF, Waterman MS. Identification of common molecu- the maize genome. PLoS Genetics 2009 11;5:e1000711. lar subsequences. J Mol Biol 1981;147(1):195–197. 3. Dong Y, Xie M, Jiang Y, et al. Sequencing and automated 21. Bradnam KR, Fass JN, Alexandrov A, et al. Assemblathon 2: whole-genome optical mapping of the genome of a domestic evaluating de novo methods of genome assembly in three ver- goat (Capra hircus). Nature Biotechnol 2013. http://dx.doi.org tebrate species. GigaScience 2013;2(1):1–31. /10.5524/100082. 22. Cai M, Chen W, Du D, et al. Genomic data of the plum (Prunus 4. Chamala S, Chanderbali AS, Der JP, et al. Assembly and vali- mume). GigaScience Database 2014. http://dx.doi.org/10.5524 dation of the genome of the nonmodel basal angiosperm Am- /100084. borella. Science 2013;342(6165):1516–1517. 23. Shi L, Guo Y, Dong C, et al. Long-read sequencing and de 5. Teague B, Waterman MS, Goldstein S, et al. High-resolution novo assembly of a Chinese genome. Nature Communica- human genome structure by single-molecule analysis. Proc tions 2016;http://dx.doi.org/10.1038/ncomms12065. Natl Acad Sci U S A 2010;107(24):10848–10853. 24. Miclotte G, Plaisance S, Rombauts S, et al. OMSim: a simulator 6. Ganapathy G, Howard JT, Ward JM, et al. De novo high- for optical map data. Bioinformatics 2017;2740–2. coverage sequencing and annotated assemblies of the 25. Valouev A, Schwartz DC, Zhou S, et al. An algorithm for budgerigar genome. GigaScience 2014;3,1–9. assembly of ordered restriction maps from single DNA 7. Muggli MD, Puglisi SJ, Ronen R, et al. Misassembly detection molecules. Proc Natl Acad Sci U S A 2006;103(43):15770–15775. using paired-end sequence reads and optical mapping data. 26. Chaisson MJ, Brinza D, Pevzner PA. De novo fragment assem- Bioinformatics 2015;31(12):i80–i88. bly with short mate-paired reads: does the read length mat- 8. Reslewic S, Zhou S, Place M, et al. Whole-senome Shotgun ter? Genome Res 2009;19(2):336–346. optical mapping of Rhodospirillum rubrum. Appl Environ Mi- 27. Ekblom R, Wolf JBW. A field guide to whole-genome se- crobiol 2005;71(9):5511–5522. quencing, assembly and annotation. Evolutionary Applica- 9. Zhou S, Deng W, Anantharaman TS, et al. A whole-genome tions 2014;7(9):1026–1042. Shotgun optical map of Yersinia pestis strain KIM. Appl Envi- 28. Salzberg SL, Phillippy AM, Zimin A, et al. GAGE: a critical ron Microbiol 2002;68(12):6321–6331. evaluation of genome assemblies and assembly algorithms. 10. Zhou S, Kile A, Kvikstad E, et al. Shotgun optical mapping Genome Res 2012;22(3):557–567. of the entire Leishmania major Friedlin genome. Mol Biochem 29. Li H. BFC: correcting Illumina sequencing errors. Bioinfor- Parasitol 2004;138(1):97–106. matics 2015;31(17):2885. 11. Zhou S, Bechner MC, Place M, et al. Validation of rice genome 30. Salmela L, Schroder ¨ J. Correcting errors in short reads by mul- sequence by optical mapping. BMC Genomics 2007;8(1):278. tiple alignments. Bioinformatics 2011;27(11):1455–1461. 12. Church DM, Goodstadt L, Hillier LW, et al. Lineage-specific bi- 31. Pevzner PA, Tang H, Waterman MS. An Eulerian path ap- ology revealed by a finished genome assembly of the mouse. proach to DNA fragment assembly. Proc Natl Acad Sci PLoS Biology 2009;7(5):e1000112+. 2001;98(17):9748–9753. 13. Zhou S, Herschleb J, Schwartz DC. A single molecule sys- 32. Yang X, Dorman KS, Aluru S. Reptile: representative tiling for tem for whole genome analysis. Perspectives in Bioanalysis short read error correction. Bioinformatics 2010;26(20):2526. 2007;2, 265–300. 33. Mukherjee K, Washimkar D, Muggli MD, et al. Supporting data 14. Valouev A, Li L, Liu YC, et al. Alignment of optical maps. J for “Error Correcting Optical Mapping Data.” GigaScience Comp Biol 2006;13(2):442–462. Database; 2018. http://dx.doi.org/10.5524/100434. 15. Li M, Mak ACY, Lam ET, et al. Towards a more accurate error 34 Bian C, Chen J, Chen W, Genomic data of the goat (Capra hir- model for BioNano optical maps. In: ISBRA 2016;pp. 67–79. cus). GigaScience Database 2014, http://dx.doi.org/10.5524/1 16. Chaisson MJ, Tesler G. Mapping single molecule sequencing reads using basic local alignment with successive refinement Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy061/5005021 by Ed 'DeepDyve' Gillespie user on 21 June 2018

Journal

GigaScienceOxford University Press

Published: May 25, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off