Sim3C: simulation of Hi-C and Meta3C proximity ligation sequencing technologies

Sim3C: simulation of Hi-C and Meta3C proximity ligation sequencing technologies Background: Chromosome conformation capture (3C) and Hi-C DNA sequencing methods have rapidly advanced our understanding of the spatial organization of genomes and metagenomes. Many variants of these protocols have been developed, each with their own strengths. Currently there is no systematic means for simulating sequence data from this family of sequencing protocols, potentially hindering the advancement of algorithms to exploit this new datatype. Findings: We describe a computational simulator that, given simple parameters and reference genome sequences, will simulate Hi-C sequencing on those sequences. The simulator models the basic spatial structure in genomes that is commonly observed in Hi-C and 3C datasets, including the distance-decay relationship in proximity ligation, differences in the frequency of interaction within and across chromosomes, and the structure imposed by cells. A means to model the 3D structure of randomly generated topologically associating domains is provided. The simulator considers several sources of error common to 3C and Hi-C library preparation and sequencing methods, including spurious proximity ligation events and sequencing error. Conclusions: We have introduced the first comprehensive simulator for 3C and Hi-C sequencing protocols. We expect the simulator to have use in testing of Hi-C data analysis algorithms, as well as more general value for experimental design, where questions such as the required depth of sequencing, enzyme choice, and other decisions can be made in advance in order to ensure adequate statistical power with respect to experimental hypothesis testing. Keywords: Hi-C; Meta3C; 3C; DNA sequencing; simulation; metagenomics For inferential software within scientific fields, the system- Findings level attributes of precision and accuracy are of primary Software testing interest, and their quantification is best accomplished by com- parison with a known truth (gold standard). Therefore, any test- To the casual observer, formal software testing is often thought to begin and end with the validation of fine-grained behavioural ing methodology capable of providing an a priori gold standard, particularly without estimation, improves this facet of testing (functional) aspects; such as the correct execution of individ- ual methods. In day-to-day use, however, what can matter significantly. Purpose-built bioinformatics software ultimately acts on ex- most to end users are broader system attributes such as speed, scalability, reproducibility, and ease of use. To ensure that a perimentally collected observations. The inherent noise and variation that comes with experimental data means that achiev- project offers maximum value, a thorough testing process would ing testing thoroughness is a great challenge. Ready access to collectively examine all aspects. Received: 1 June 2017; Revised: 18 September 2017; Accepted: 23 October 2017 The Author(s) 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4628124 by Ed 'DeepDyve' Gillespie user on 16 March 2018 2 DeMaere and Darling sufficient data sources is a fundamental necessity for adequate of eukaryotic chromosomes [9]. In 2009, Lieberman-Aiden [10] software testing. reported an extension of the protocol to high-throughput se- For established experimental methods, public data archives quencing, enabling the global spatial arrangement of chromo- are a first choice for the necessary testing data. When high- somes to be reconstructed at unprecedented resolution. All 3C quality metadata are available, testing driven by real data protocols depend on an initial formalin fixation step, which becomes possible. However, even when sufficient depth and de- crosslinks proteins bound to DNA in vivo. Subsequently cells are scription of data are available, difficulty can remain in match- lysed and the DNA:protein complexes are sheared enzymati- ing desired test data characteristics to what actually exists in cally and/or physically to create free ends in the bound DNA 1 or several public datasets. Further, fine-grained whole-corpus strands. These free ends are then subjected to a proximity lig- querying of metadata on remote data archives is not always ation reaction, in which ligation of free ends preferentially oc- possible, frequently making the up-front job of data selection curs among DNA strands cobound in a protein complex. The a difficult task. Once selected, obtaining said real data can be DNA:protein crosslinks are then reversed, the DNA is purified, time-consuming or even infeasible in locations with lower net- and an Illumina-compatible sequencing library is constructed. work speeds and/or high bandwidth costs. In advancing fields In Hi-C protocols, the proximity ligation junctions can then be such as DNA sequencing, new experimental datatypes can ap- further purified in the sequencing library. pear for which the public data archives contain only a handful 3C-derived methods have found several applications beyond of examples, and few researchers would have the time and fi- their initial use to reconstruct 3D chromosome structure. For ex- nancial resources to commit to experimental generation of new ample, it has been shown that 3C-derived data provide a valu- data purely for software testing. able signal for genome scaffolding [11,12], as well as a signal that Though performance on real data is the ultimate arbiter of can support genome-wide haplotype phasing [13,14]. 3C-derived analytical value, advantaged by explicit control over its charac- data have also proven valuable for metagenomics, where ini- teristics, a faithful simulation of real data can act as a valuable tial studies on mock communities demonstrated that highly ac- proxy. Simulation-driven development and testing has proven to curate genome reconstruction in mixed microbial communities be a highly cost-effective and time-efficient approach. It offers could be facilitated by proximity ligation sequence data [15–17]. the possibility to explore a near continuum of data character- Subsequent application to naturally occurring microbial com- istics, subjecting software to an otherwise unavailable degree munities has also suggested that bacteriophage can be linked of testing thoroughness. Certainty and control make attaining to their hosts with this data type [18]. the twin objectives of rigorous testing and an a priori gold stan- In the remainder of this manuscript, we describe the Sim3C dard straightforward. This enables us not only to be more certain software and demonstrate how it can be used to simulate data about when we have failed, but also to extrapolate this process for various 3C-derived experiments. to infer the limits of success within the experimental parameter space. Experiment scenarios Tools for simulating DNA sequencing reads have existed from the very early days of genomics, beginning with the many Beyond simple monochromosomal genome sequencing exper- anonymous implementations of simple DNA shearing algo- iments, Sim3C offers support for the more complex scenarios rithms, up to the most recent highly detailed empirical model of multi-chromosomal genomes and metagenomes. A scenario simulators [1–4]. From read simulation in isolation, field ad- is defined by way of a community profile, assigning a copy- vancements such as metagenomics have been accompanied number and containing genome to each chromosome and a rel- soon after by simulators reflecting their specific data character- ative abundance to each genome. The profile and supporting istics and evolving experimental methodology [5–7]. reference sequences form a skeleton definition with which to We introduce Sim3C, a software package designed to simu- initialize the weighted random sampling process within a sim- late data generated by Hi-C and other 3C-based proximity lig- ulation. The user can elect to supply a profile either as an ex- ation (PL) sequencing protocols. The software includes flexi- plicit table (Figs 1 and 2) or allow Sim3C to draw abundances at ble support for a range of sequencing project scenarios and runtime from 1 of 3 distributions (equal abundance, uniformly choice of three 3C methods (Hi-C, Meta3C, DNase Hi-C). The re- random, log-normal distribution) for communities made up of sulting output (paired-end FastQ) is easily assimilated into ex- strictly mono-chromosomal genomes. isting analysis workflows. It is our intention that Sim3C pro- vide the Hi-C/3C research community with a means to further Error Modelling validate existing software projects, support new experimental or analysis development initiatives, and serve as a platform Sim3C models 3 forms of experimental noise: machine-based for exploration, such as the comparative analysis of clustering sequencing error, the formation of spurious ligation products, algorithms [8]. and the contamination of PL libraries with whole genome shot- gun (WGS) read-pairs. To simulate machine-based sequencing error, the paired- 3C sequencing end mode from art illumina [2] has been reimplemented as a 3C-based sequencing protocols, including Hi-C, 4C-seq, and Python module (Art.py). This approach was taken as delegat- Meta3C, have great potential to address questions directed at the ing read-pair generation to native invocations of art illumina spatial organization of DNA in samples ranging from eukaryotic proved cumbersome. More explicitly, a loosely coupled solution tissue to single cells to microbial communities. The growing use (via subprocess calls but without an interprocess communica- of these protocols creates a legitimate need for a simulator ca- tion mechanism) lacked sufficient control to generate PL read- pable of generating data with relevant characteristics. pairs in an efficient and robust manner. On the other hand, Chromosome conformation capture (3C) was originally de- tightly coupling Sim3C to the ART C/C++ source code (i.e., im- signed as a polymerase chain reaction–based assay to mea- plementing hooks) would have left Sim3C vulnerable to changes sure interactions among a small number of defined regions in a non-public external API (i.e., a codebase without formal Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4628124 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Sim3C: simulation of Hi-C sequencing technology 3 Figure 1: A mock 2 genome community. For demonstration purposes, we assume that the plasmid (plas1) is present in 4 copies and that there is a 0.4/0.6 relative abundance split between the 2 organisms (bac1, bac2) in the community. Figure 2: A mock 4 chromosome genome. Cellular abundance is a constant across the profile, while chr4 exists in 2 copies. Note that relative abundances specified i n a profile are not required to sum to 1, but are normalized internally. definition or guarantee of stability). Reimplementation also braries to be subsequently enriched for fragments containing PL meant that Art’s many empirically derived machine profiles are junctions by streptavidin-mediated affinity purification. With- available for use by Sim3C, allowing equivalent treatment of ma- out enrichment, the simpler Meta3C protocol results in a gross chine error when experiments involve both PL (Sim3C) and pure mixture of both WGS and PL read-pairs, where only a small per- WGS (art illumina) libraries. centage of the total read-pair yield (approx. 1%) will possess PL The production of spurious ligation products is an inherent junctions [23]. The enrichment process within Hi-C, however, is source of noise in PL library construction [19]. Sim3C models not perfectly efficient, and WGS read-pairs are still observed (ap- spurious pairs as the uniformly random ligation of any 2 cut- prox. 10-50% of reads contain a PL product) [23]. DNase Hi-C re- sites across all source genomes. While this process disregards places restriction digest with a non-specific endonuclease (e.g., cellular organization, it respects the relative abundance of chro- DNase I) [24] or mechanical DNA shearing process (e.g., sonica- mosomes. Spurious pairs, and to a lesser extent sequencing er- tion) [20]. In this operational mode, Sim3C treats DNA cleavage ror, represent an important confounding signal to downstream as a completely unbiased (free) process, and as such all genomic analyses that attempt to infer the cellular or chromosomal or- positions have equal probability of participating in proximity lig- ganization of DNA sequences. ation events. Lastly, conventional WGS read-pairs represent a source of Within Sim3C, each of the 3 methodological variations is con- contamination within a PL library that even after Hi-C enrich- ceptualized as a sequencing strategy (figure 3), and each itera- ment steps, are not completely eliminated. The rates at which tion of a strategy produces 1 read-pair (PL or WGS in origin). For spurious and WGS read-pairs are injected into a simulation run all strategies, an iteration begins by drawing a 3-tuple of insert are controllable by the end-user. parameters: length, direction, and junction point (L , dir, x ). ins junc After obtaining insert parameters, the Hi-C strategy (Fig. 3a) first tests if the insert will represent a WGS or PL read-pair Simulation modes (∼Bern(p )), where efficiency p is defined in the sense of en- eff eff richment. When p = 1, there is perfect filtering and all WGS eff Since Hi-C was first introduced [ 10], the development of variants read-pairs are eliminated from the experiment. In the case of and extensions has been continual [17,20–22]. Variants have of- WGS, the iteration reaches an end-point and the simulation ten strived to further enhance the discriminatory power of the emits a conventional read-pair drawn from the community def- original experiment, while seemingly adding yet more complex- inition. In the case of PL, a cut-site 3-tuple is drawn (gen , ity to an already challenging protocol (in situ DNase Hi-C, sciHi- chr , x ), where the categorical distribution over chromosomes 1 1 C) [22]. Others instead have sought compromise, with the aim is weighted by relative abundances (A) and chromosomal copy- of lessening the burden on the laboratory (Meta3C). While not numbers (n ), genomic position is sampled uniformly from cpy considering more recent and complex extensions, Sim3C offers the set of restriction sites (sites(chr )), and the parent genome 3 simulation modes: traditional Hi-C, Meta3C, and DNase Hi- (gen ) is implicit from the chromosome. Next, a spurious liga- C. The first 2 of these modes were chosen as representing the tion test is performed (∼Bern(p )). If a spurious event has oc- spur fundamental basis (traditional Hi-C) and an attractive and prag- curred, the 3-tuple defining the second cut-site ( gen , chr , x ) 2 2 2 matic simplification of the original (Meta3C). The third mode is drawn i.i.d. as the first. If not spurious, next a test for inter- (DNase Hi-C) replaces the restriction endonuclease-driven pro- chromosomal (trans) ligation is performed. Only source chro- duction of the free ends, used to form PL products, with an ide- mosome and position (chr , x )need be drawnasthe second 2 2 ally free process of DNA fragmentation. In the laboratory, this genome is implicitly the same as the first ( gen = gen ). Here, 2 1 ideally free process could be carried out by DNase digestion or chr is selected without replacement from the set of chromo- mechanical shearing via sonication. somes of genome (gen ) where the categorical distribution is ad- The most notable difference between the methods of Hi- justed by removal of chr . Finally, an intra-chromosomal (cis) lig- C and the more recent Meta3C is that after restriction digest, ation must have occurred. As now both genome and chromo- Hi-C employs additional steps leading to the incorporation of bi- some are implicit (gen = gen , chr = chr ), all that is left is to 2 1 2 1 otin tags at each PL junction. This biotinylation permits Hi-C li- Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4628124 by Ed 'DeepDyve' Gillespie user on 16 March 2018 4 DeMaere and Darling AB Figure 3: Logical schema used within Sim3C. (a) Hi-C and (b) Meta3C simulation strategies. Gold diamonds represent simple Bernoulli trials. Blue boxes represent sampling distributions defined by runtime input data (community profile, genomic sequences, enzyme) and the empirically derived distribution for in tra-chromosome (cis) interaction probability (equation 1). Logical end-points to a single iteration of either algorithm are represented as red (producing a WGS read-pair) and green boxes (producing a PL read-pair). Due to the elimination of the biotinylation step, Meta3C does not produce a duplication of the restriction cut-site overhang (grey boxes). draw genomic position x . The pair of positions (x , x ) is con- where β is a mixing parameter, α the geometric distribution 2 1 2 strained by their separation (s =|x − x |), which is represented shape parameter, and l chromosome length. 2 1 by a mixture model of the geometric and uniform distributions For Meta3C (Fig. 3 b) after insert parameters are determined, (equation 1). This relation possesses rapid falloff with increas- in the same fashion as a regular WGS read, an initial free ge- ing separation and non-zero probability for all chromosomal po- nomic position is drawn (chr , x ), uniformly distributed over the sitions, as has been commonly observed in real experimental extent of chr rather than only over its cut-sites. In real datasets, data [10,25]. it has been observed that neither the restriction digestion nor the re-ligation of free ends is perfectly efficient. Taken as in- dependent probabilities, in our model we conceptualize their Pr(X = s|α, β, l) = β(1 − α) α + (1 − β)/l, (1) Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4628124 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Sim3C: simulation of Hi-C sequencing technology 5 AB Figure 4: Model details. Generation of proximity ligation inserts (a) involves joining 2 randomly drawn parts (red and blue), from which the read-pair (R1, R2) is then simulated. The junction point (x ) varies over the interval [0..L), and reproduction of read-through events is possible. For an unbounded chromosome (b) (circular junc here), besides strictly primary separation (black arrow), spatial proximity can be induced from successive folding (red, green arrows). When the spatial arrangement is consistent across the population of cells, this will be observable as modulations in the contact frequencies. Sim3C models simple structurally related modulation of observed contact frequencies (c). Beyond primary interactions forming the main diagonal, users can reproduce inter-arm-mediated anti-diagonals. Finer-scale modulations attributed to topologically associated domains can optionally be randomly simulated. Primary interactions f (s|θ ) (equation 1) cover the full interval [0, 0 0 L). Each level of recursion (d = 1, 2. . . n) generates a finer set of intervals, to which a distribution f (s|θ ) and probability p are assigned. The final covering of intervals i i i each define a range (green, curly braces) over which a set of probabilities and empirical distribution pairs govern interaction separation s. joint occurrence as an efficiency factor, p , and a Bernoulli trial occur along a chromosome (intra-arm) (Fig. 4b), seen as the pri- eff (Bern(p )) determines whether a sequence read is successful in mary (y  x) diagonal in the contact map. Sim3C can approxi- eff containing an observable proximity ligation event. Failing this mate the less frequent interactions occurring between chromo- coverage test relegates the iteration and end-point and emits somal arms (inter-arm) [26], which are visible as anti-diagonal a WGS read-pair. Successful candidates instead continue akin (y  L − x) in the contact map. to the Hi-C decision tree, beginning with the test for spurious At progressively smaller scales, the hierarchical 3D folding ligation. of DNA into topologically associated domains (TADs) produces For both Hi-C and Meta3C, PL read-pairs are produced by join- overlapping regions of interaction visible in the contact map as ing the free ends drawn above as defined by the fragment param- block-like intensity modulations. Though the agents responsible eters (Fig. 4a). Here the location of the PL junction within the for their formation vary [27,28], the characteristic patterns evi- insert is determined by x . At the junction, Hi-C differs from dent in real data–derived 3C contact maps have been observed junc Meta3C as the process of biotinylation results in the duplication across all 3 domains [25,26,29]. Sim3C can optionally approxi- of the restriction cut-site overhang sequence. The overhang du- mate the sense of TAD-related modulation by means of a recur- plication in Hi-C is included in the simulation. sive stochastic process. DNase Hi-C is handled similarly to traditional Hi-C, with the Our approximation of hierarchical folding begins from the exception that, as in silico digestion trivially leads to all sites, full extent L of a chromosome (Fig. 4c). Folding is portrayed by the simulated digestion is unnecessary to perform and positions the division of the interval [0..L) into a set of non-overlapping can be drawn directly from the uniform distribution over the sub-intervals {[0, x ), [x , x ), ..., [x , x )}, the number and 1 1 2 n − 1 n interval [0..L ). Site duplication, attributable to the likely pro- widths of which are drawn at random (U(l , l ), U(n , n )). chr min max min max duction of random overhangs in this scenario, is not presently The procedure is then recursively applied to each sub-interval simulated. until a depth d, producing a nested set of coverings of the full interval [0..L) at progressively finer scales. Across this hierarchi- cal collection, each interval is assigned a uniformly distributed Structurally related interactions random probability p and empirical distribution f (s|θ ) (equa- i i i tion 1) for separation s, parameterized by shape parameter α TAD Independent of any 3D structure that might exist, the pri- and interval length l = x − x ,where θ = (α , β, l ). TAD inv i + 1 i inv mary and most frequently observed interactions are those that Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4628124 by Ed 'DeepDyve' Gillespie user on 16 March 2018 6 DeMaere and Darling Table 1: Real Hi-C and Meta3C data-sets used within this work Authors Type Method Accession Sequencing Mapped details reads Beitel et al. [15] Synthetic bacterial metagenome Hi-C SRX377733 MiSeq 160bp PE insert 20552775 range: 280-420bp enzyme: HindIII Burton et al. [16] Synthetic yeast metagenome Hi-C SRX527868 HiSeq2500 100bp PE insert 9704944 range: 450-550bp enzyme: HindIII Le et al. [26] Single bacterial genome Hi-C SRX263925 HiSeq2000 40bp PE insert 22324360 range: 200-600bp enzyme: NcoI Marbouty et al. [41] Synthetic bacterial metagenome Meta3C doi:10.5061/ HiSeq2000 100bp PE insert 7975740 dryad.gv595 range: 400-800bp enzyme: HpaII The total off-diagonal weight of the contact map was used to calibrate the amount of simulated sequencing required to approximately match the outcome of the real experiments. The process of drawing samples of separation begins by de- comparison. Due to the smaller extent, a bright and high- termining the set of intervals {l } that contain an initial point resolution contact map (10-kbp bin size) is possible for a prac- inv x . The intervals, as tuples (p , f (s|θ )), then form a categorical dis- tical volume of sequencing data, potentially revealing fine detail 0 i i i tribution (equation 7), from which a governing distribution f (s|θ ) not easily discerned with larger bin sizes (50–100-kbp bin size). i i is drawn, and finally a sample of separation is taken, s ∼ f (s|θ ). The genome of Caulobacter crescentus NA1000, a model or- i i To efficiently sample from the full collection, an interval-tree ganism in the study of cellular differentiation and regulation data structure is employed. When queried, an interval-tree re- of the cell cycle, is comprised of a single 4-Mbp circular chro- turns the set of intervals {l} overlapping a position x in order mosome [30]. Deep Hi-C sequencing of C. crescentus has been O(log n + m), where n is number of intervals and m is number of used to explore the degree to which bacterial chromosomes intervals returned by the query. can be regarded as organized and provided evidence for the ex- istence of so-called chromosomal interaction domains (CIDs) [26]. As a prokaryotic analog of topologically associated domains f = f (s|θ ), f (s|θ ), ··· , f (s|θ ) (2) 0 0 1 1 i i from eukaryotic literature [28,31–32], these regions are believed to promote intra-domain loci interactions and thereby act to N = number of distributions = f (3) functionally compartmentalize the genome. This chromosomal structure was observed to be at once disruptable through rifampicin-mediated inhibition of transcription and malleable p = {p , p , ··· , p } (4) 0 1 i by the movement of highly expressed genes [26]. Forthe rawcontact mapof C. crescentus, prominent rectilin- ear features are apparent for both real and simulated traditional p ∼ U(0, 1) and p =1(5) i i Hi-C sequencing data (Fig. 5a,b), while notably for simulated unrestricted Hi-C the field is much smoother (Fig. 5c). Within the Sim3C model, a single distribution governs both intra- and n ∼ Cat(N, p)(6) inter-arm interactions. Inspection of the real-data contact map (Fig. 5a) suggests that the true relationship governing inter- N−1 arm interactions is more dispersed. This perhaps is not surpris- [i =n] f (s|n) = f (s|θ ) , (7) i i ing, where different arms associating spatially possess a greater i =0 number of potential configurations than can be taken on by the primary chromosome backbone. Additionally, for the real con- where [i = n] is the Iverson bracket. tact map, long-range interactions away from either diagonal can be seen to drop to a lower threshold than that produced from Example scenarios simulation. Within the unrestricted Hi-C map, the fine zero-intensity rec- In the following, 3 use cases are presented to demon- tilinear features are a direct result of poor mappability (non- strate aspects of the resulting simulation output: bacterial unique sequence), where their small size reflects the extent of genome, multi-chromosomal eukaryotic (yeast) genome, and the non-unique regions (example: rRNA genes) and the single metagenome. For each use case, 3C contact maps have been base-pair resolution of the less constrained read generation pro- used to pit simulation output against the corresponding real cess. The process of enzymatic digestion is the only difference experimental data (table 1). between the unrestricted and traditional Hi-C simulation mod- els. The clear contrast in their contact maps is thus a combina- Bacterial tion of factors either directly inherent to digestion (cut-site den- sity) or a byproduct of downstream bioinformatics analysis (e.g., A monochromosomal bacterial genome is perhaps the simplest filtering heuristics). Though the problem of mappability exists scenario to which proximity ligation methods have been ap- for any reference-based representation, for real and simulated plied, making for a sensible entry point from which to make Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4628124 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Sim3C: simulation of Hi-C sequencing technology 7 Figure 5: Bacterial contact maps. Observed Hi-C interactions for the monochromosomal genome of Caulobacter crescentus NA1000. Comparing (a) real experimental data [26] with the 3 simulation choices (b) traditional Hi-C, (c) DNase Hi-C, and (d) traditional Hi-C with TADs enabled. Sharp rectilinear modulations of the intensity within (a) and (b) indicate a reduction in PL observations within a given bin. Not due to 3D chromosome structure, rather such features can be attributedlargely to mappability and low cut-site density. (c) Without an enzymatic constraint, a significantly smoother field is apparent, yet still susceptible to mappa bility. (d) Enabling topologically associated domains highlights the similarity between features produced merely from biases and what could be truly associated with 3D structure. traditional Hi-C, zero-intensity rectilinear features mark regions yeasts included in a synthetic community to explore the appli- devoid of cut-sites over at least 10 kbp. cation of Hi-C sequencing to deconvolving metagenomic assem- Enabling TAD approximation in simulated traditional Hi-C blies [16], and it is divergent enough from other synthetic com- (Fig. 5d) has the effect of modulating map intensity in a manner munity members to permit unambiguous read mapping, and not particularly distinct from that produced purely from exper- thus act as a proxy for a clonal experiment. imental/workflow bias. Discriminating between these 2 feature From the contact map of real Hi-C data (Fig. 6a), it can be seen sources—one representing experimental signal, the other rep- that the rates of intra-chromosomal and inter-chromosomal in- resenting noise—demands attention when developing solutions teractions are roughly equivalent in magnitude. Across the 8 to problems such as normalization. Contact map normalization chromosomes of S. stipitis, there is significant uniformity in the methods, whether based upon explicit or implicit bias models degree of physical intimacy within and between all chromo- [33], may leave behind remnants of noise-related features from somes. The subtleties of this chromosomal organization reveal either a lack of convergence or model limitations. Downstream a self-similar “fuzzy-x” pattern repeated between all chromo- inferencing should therefore not be made under an assumption somes across the contact map. The convergence point within of bias-free signal. the pattern is attributed to centromere-SPB binding and has been used to predict centromere locations [35]. It has been shown that the physical constraints generated from the in- Eukaryotic teraction of centromeres to the spindle pole body (SPB) and telomeres to the nuclear envelope are sufficient to explain a The 8 chromosomes of the 15.4-Mbp genome of the native number of experimental observations in real data [36,37]. As xylose-fermenting yeast Scheffersomyces stipitis CBS 6054 [34] Sim3C was derived from the study of bacterial datasets, our range in size from 970 kbp to 3.5 Mbp. The organism was 1 of 16 Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4628124 by Ed 'DeepDyve' Gillespie user on 16 March 2018 8 DeMaere and Darling Figure 6: Eukaryotic contact maps. Observed Hi-C interactions of (a) real and (b) simulated data from the 8 chromosome genomes of the budding yeast Scheffersomyces stipitis CBS 6054 [16]. Grey dashed lines and alternating light and dark grey axes demarcate the boundaries between chromosomes. (b) Simulated data elicit a flat field, and the clearly evident higher rate of intra- to inter-interactions makes for easily observable chromosomal boundaries within the map. (a) Contrastingly for real data, the similar rates of intra-chr and inter-chr interactions reveal the physical constraints imposed by centromere-SPB tethering on all 8 chromosomes [35]. simulation model does not currently include a notion of these As with single-genome experiments, metagenomic contact higher organism physical constraints. Consequently, the contact maps are locally modulated by factors such as mappability and map derived from simulated traditional Hi-C sequencing elicits cut-site density. Importantly now for metagenomes, the factors a flat field (Fig. 6b), where the intensity variation that does ex- of relative abundance and GC content interact to alter the ob- ist is a byproduct of aforementioned factors such as mappability served intensity of each chromosome within the contact map. and cut-site density. For the runtime parameters employed, the As a first approximation and assuming agreement in nu- rate of intra-chromosomal contact is higher than that of inter- cleotide sampling frequency, we expect n = L/4 recognition chromosomal, making clear the boundaries between the 8 chro- sites for an enzyme of site length λ and DNA sequence length mosomes (Fig. 6b). Though our model is presently incomplete L. The degree to which an enzyme and DNA sequence deviate for higher organisms, there remains a potential utility as an an- from this estimate could be described as how well they match, alytical or simply observational prior. m = n /n . Poorer quality matches (m < 1) occurwhenanen- x 0 zyme’s recognition site is underrepresented, while conversely, better quality matches (m > 1) describe a situation of more recog- nition sites than expected. Metagenomic When multiple chromosomes are taken as a community, the relative proportion of sites from each represents an observa- In the deconvolution of metagenomes, proximity ligation meth- tional bias when conducting 3C-based experiments. For com- ods hold great potential as new sources of information and munity C, the number of sites n from chromosome x deter- have been investigated by the construction and sequencing of mines the number of potential PL pairings N within C that synthetic communities [15–17]. We selected 2 previously con- x (equation 8). The number of intra-chromosomal and inter- structed synthetic bacterial communities, 1 employing tradi- chromosomal potential pairs thus respectively vary quadrat- tional Hi-C and the other Meta3C (table 1). Intended as “proof of ically and linearly with n . Regarding the process of observ- concept” experiments, neither community reflects a real envi- ing a PL event (read-pair) from the community as a random ronment, but rather they were intended to be easily interpreted draw with replacement, and the selection pool as comprised of and include interesting features, such as range of GC, single and all potential events from all chromosomes, variation in match multi- chromosomal genomes, and strain-level divergence. The quality constitutes a per-chromosome bias. In real laboratory Hi-C community involved 5 genotypes from 4 species, 1 genome experiments, the composition of the selection pool is further of 2 chromosomes (B. thailandensis), E. coli strains BL21 and K12 modified by variation in other factors, such as cellular lysis (average nucleotide identity [ANI], 99%), and a wide overall GC efficiency, unintended DNA fragmentation, and relative abun- range of 37–68% (Table 2). Of lower complexity, the Meta3C com- dance. In particular, when relative abundances A are introduced, munity involved 3 genomes from 3 species, included 1 genome the odds of observing a PL event involving chromosome x are of 2 chromosomes (V. cholerae), and had a narrower GC range then proportional to the product p ∝ A N /N . Although the x x x C of 44–51% (table 3). Relative to the single genome experiments processes of intra-chromosomal, inter-chromosomal, and inter- above, a lower depth of sequencing resulted in a lower over- cellular (spurious) ligation are treated independently in our sim- all contact map intensity (Fig. 7). This is particularly the case ulation model, in this manner, per-chromosome intensity (ob- for Meta3C, where, by the nature of the method, a large pro- servation rate of chromosome x) can vary significantly within a portion (approx. 99%) of the sequencing yield is in reality con- metagenome. ventional WGS read-pair data [17]. As a direct result, in binning the Meta3C dataset, there were insufficient counts to fully es- tablish finer detail within the contact maps, leaving a smoother 2 N = n + n n (8) x x y appearance. n ∈C \n y x Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4628124 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Sim3C: simulation of Hi-C sequencing technology 9 Table 2: Synthetic Hi-C community Name Replicons Accession Chr abbr. An %GC n m cpy x NC 007651 Bt1 67.29 225 0.24 Burkholderia 2 0.054 1 NC 007650 Bt2 68.07 144 0.20 thailandensis E264 Escherichia coli BL21 1 NC 012892 BL21 0.242 1 50.83 508 0.46 Escherichia coli K12 1 NC 010473 K12 0.166 1 50.78 568 0.50 DH10B NC 008497 Lb 46.22 629 1.12 Lactobacillus brevis 3NC 008498 - 0.436 1 38.64 3 0.92 ATCC 367 NC 008499 - 38.51 16 1.84 Pediococcus pentosaceus 1NC 008525 Pp 0.102 1 37.36 863 1.93 ATCC 25745 A synthetic community used to demonstrate the utility of Hi-C sequencing data in resolving a microbial metagenome [15]. It is composed of 5 bacteria, including 2 closely related strains (E. coli K12 and BL21), a genome with 2 plasmids (L. brevis), and a 2-chromosome genome (B. thailandensis). A is relative abundance, n is copy cpy number, n is number of restriction sites, and m = n /n is match quality between chromosome and enzyme choice: m < 1 is worse, m > 1 is better. x x 0 Table 3: Synthetic Meta3C community Name Replicons Accession Chr abbr. An %GC n m cpy x Bacillus subtilis subsp. 1 NC 000964 Bs 0.123 1 43.51 14529 0.88 subtilis str. 168 Escherichia coli str. 1 NC 000913 K12 0.562 1 50.79 24311 1.34 K-12 substr. MG1655 NC 002505 Vc1 47.70 5909 0.51 Vibrio cholerae O1 2 0.332 1 NC 002506 Vc2 46.91 1802 0.43 biovar El Tor str. N16961 A synthetic community used to demonstrate the utility of Meta3C sequencing data in resolving a microbial metagenome [17,41]. It is composed of 3 bacteria with 1 possessing 2 chromosomes. A is relative abundance, n is copy number, n is number of restriction sites, and m = n /n is match quality between chromosome and cpy x x 0 enzyme choice: m < 1 is worse, m > 1 is better. Though the original laboratory experiments reported by Bei- domly generated approximations of self-associating domains tel et al. [15] and Marbouty et al. [17] intended to create synthetic (CIDs/TADs). Sim3C does not model structural features observed communities with uniform relative abundances, in practice in larger, more complex genomes (CTCF/cohesin loops, A/B com- each possesses a non-uniform profile. The variation in GC con- partments, chromosome territories) [10,38]. Such features are tent is largest for the Hi-C experiment and, together with non- becoming increasingly well characterized [39], and a simulator uniform relative abundances, produces a wide range of chromo- capable of modelling these features would surely be valuable. some intensity for both real and simulated data (Fig. 7a,b). For Mammalian genomes are much larger than microbial genomes, both the real and simulated Hi-C maps, the frequent observa- however, and additional work to improve the scalability of tion of PL events involving P. pentosaceus (Pp) and L. brevis (Lb) Sim3C will likely be required. suggests the possibility that inter-cellular interaction is signifi- Some features of microbial eukaryotes, such as the point cen- cant. Within the simulated map at least, inter-cellular pairs are tromeres found in budding yeast genomes [40], are computa- produced exclusively through the process of spurious ligation tionally simpler [35,36] yet remain unmodelled in Sim3C. The (noise) and are observed at a higher rate than in the real data, in- addition of these sorts of model details would be best supported dicating that, as expected, spurious ligation rates across species by introducing model initialization via external data (experi- are correlated with their relative abundances. mental observations, motif detection, cell phase), which subse- Further for the Hi-C data, the 2-chromosome genome of quently would require extension of the community profile defi- B. thailandensis (Bt1, Bt2) (Fig. 7a) has a greater rate of inter- nition. Careful design would be required to ensure these features chromosomal interaction than expected from comparing it with could be added without compromising ease of use. simulation (Fig. 7b). Meanwhile, the clear delineation of E. coli strains BL21 and K12 (ANI > 99%), with little inter-cellular sig- Methods nal, helps to support the notion that the inter-chromosomal in- teractions observed between B. thailandensis chromosomes (ANI Reference Data 83%) are real and not a by-product of inadequate filtering. To compare Sim3C against real experiments, we obtained pre- viously published experimental read-pair datasets (Table 1)and Limitations and future work their accompanying reference genomes (Tables 2, 3) from public Sim3C in its current form has several limitations, some of archives. In the case of the single genome project of Caulobac- which present opportunities for future work. Sim3C’s reper- ter crescentus CB15 [26], sequencing data derived from untreated toire of structural features is currently limited to those swarmer cells was chosen and the laboratory strain C. crescen- found in microbes—circular and linear chromosomes with ran- tus NA1000 (acc: NC 011916) was used as the reference genome. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4628124 by Ed 'DeepDyve' Gillespie user on 16 March 2018 10 DeMaere and Darling Figure 7: Metagenomic contact maps. From synthetic microbial communities, raw contact maps from real (a) and simulated (b) traditional HiC, and real (c) and simulated (d) Meta3C. Chromosome boundaries are demarcated by alternating light and dark grey bands (Tables 2, 3), while the small plasmids of L. brevis are omitted for clarity. Although the original works [15,17] intended uniform abundance, the results exhibit significant variation in abundance. Lysis efficiency (not modelled) and enzyme suitability are significant factors contributing to the overall intensity of a given chromosome. For more abundant members of the Hi-C comm unity (P. pentosaceus and L. brevis), signal due only to spurious ligation can appear to suggest inter-cellular interactions when none are present (b). For the yeast genome, the completed 8-chromosome genome dances were estimated by mapping real experimental reads to of Scheffersomyces stipitis CBS 6054 was used as a reference (acc: the respective reference genomes. From each real experiment, PRJNA18881), and the respective reads were extracted from the the off-diagonal weight of the resulting contact map was used MY16 yeast synthetic metagenome [16] by direct mapping with to calibrate the amount of simulated sequencing required to BWA MEM. Extraction by mapping in isolation was employed achieve roughly equivalent intensity (Table 4). Both real and sim- as S. stipitis was the second furthest phylogenetically removed ulated read-pair datasets were mapped to their respective refer- yeast in the synthetic community and was the most contigu- ence genomes using BWA MEM (v0.7.15-r1140, RRID:SCR 010910) ous (N50: 60 kbp) from the whole synthetic community de novo [42]. metagenomic WGS assembly. Contact maps Read generation Contact maps were produced using our own tool Experimental parameters used in read simulation were set to (contact map.py), where heatmap intensity was plotted as agree as closely as reasonably possible to the respective real ex- log-scaled observational frequency. All aligned reads were sub- periments, employing the same read length and restriction en- ject to the same basic filtering criteria: BWA MEM mapq >5and zyme (Table 1). In each experiment, the published fragment size alignment length ≥50% of read length, with the added restric- range was approximated by a normal distribution (Table 4). For tion that read alignments must have begun with a match. For ease of reproducibility, a single random seed (1234) was used methods that employed a restriction enzyme (traditional Hi-C, in all simulations. As our intent was primarily to demonstrate Meta3C), we constrained the maximum allowable distance from functionality, rates of inter-chromosomal and spurious events an aligned read to the nearest upstream cut-site. Calculated were adjusted per-experiment only through a qualitative pro- per chromosome, this distance constraint could not exceed cess. For simulation of metagenomic datasets, relative abun- 2-fold the median cut-site spacing. Rather than simply delete Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4628124 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Sim3C: simulation of Hi-C sequencing technology 11 Table 4: Runtime simulation Experiment Insert μ (bp) Insert σ (bp) Anti rate Spurious rate Trans rate Reads (×10 ) Beitel et al. 300 50 0.2 0.05 0.1 7 Burton et al. 400 50 0.2 0.5 0.15 1.5 Le et al. 400 100 0.2 0.2 0.1 22 Marbouty et al. 600 100 0.2 0.2 0.2 7.5 Parameters supplied to Sim3C during read generation. the primary diagonal for the sake of reducing the displayed https://www.education.gov.au/education-investment-fund dynamic range in figures, we instead reduced its intensity by https://www.education.gov.au/national-collaborative categorizing properly paired reads with an estimated fragment -research-infrastructure-strategy-ncris size of less than 2 of the reported mean as being conventional WGS (non-PL) reads and ignored them. The resolution of contact Authors contributions maps was adjusted between experiments so as to present a sufficiently bright image without undue loss of resolution. The M.D. designed and implemented Sim3C and wrote the contact map bin sizes employed were: 10 000 bp for the single manuscript and prepared figures. A.D. assisted in the de- bacterial genome, 25 000 bp for the yeast genome, and 40 000 bp sign and contributed to the manuscript. for the Hi-C and Meta3C metagenomes (Tables 2, 3). Acknowledgements Availability of data and materials We thank Steven P. Djordjevic for his support and helpful discus- sions. This work was supported by the AusGEM initiative, a col- Snapshots of the supporting code are available from the Giga- laboration between the NSW Department of Primary Industries Science repository, GigaDB [43]. and the Ithree Institute. We acknowledge the use of computing resources from the NeCTAR Research Cloud, the QCIF and the Availability of supporting source code and UTS eResearch Group. requirements http://www.nectar.org.au Project name: Sim3C http://www.qcif.edu.au Release version: 0.1 https://eresearch.uts.edu.au Project homepage: https://github.com/cerebis/sim3C RRID: SCR 015772 References DOI: 10.5281/zenodo.1030812 Operating system: platform independent 1. Li H. lh3/wgsim. 2011. https://github.com/lh3/wgsim. Ac- Programming languages: Python 2.7 cessed 21 March 2017. License: GNU GPL v3 2. Huang W, Li L, Myers JR et al. ART: a next-generation sequencing read simulator. Bioinformatics 2012; 28(4): 593–4. Abbreviations 3. Ono Y, Asai K, Hamada M. PBSIM: PacBio reads simulator– PL: proximity ligation; toward accurate genome assembly. Bioinformatics 2013; WGS: whole genome shotgun 29(1):119–21. CID: chromosomal interaction domain 4. Hu X, Yuan J, Shi Y et al. pIRS: Profile-based Illumina pair-end TAD: topologically associated domain reads simulator. Bioinformatics 2012; 28(11):1533–5. Bern(x): Bernoulli distribution 5. Jia B, Xuan L, Cai K et al. NeSSM: a next-generation U(x): uniform distribution sequencing simulator for metagenomics. PLoS One N(μ, σ ): normal distribution 2013;8(10):e75448. cis: intra-chromosomal 6. Angly FE, Willner D, Rohwer F et al. Grinder: a versatile am- trans: inter-chromosomal plicon and shotgun sequence simulator. Nucleic Acids Res 2012;40(12):e94–e94. 7. Richter DC, Ott F, Auch AF et al. MetaSim: a sequenc- Competing interests ing simulator for genomics and metagenomics. PLoS One The authors declare that they have no competing interests. 2008;3(10):e3373. 8. DeMaere MZ, Darling AE. Deconvoluting simulated metagenomes: the performance of hard- and soft- clus- Funding tering algorithms applied to metagenomic chromosome This work was supported under the Australian Research conformation capture (3C). PeerJ 2016;4(4):e2676. Council’s Discovery Projects funding scheme (project number: 9. Dekker J, Rippe K, Dekker M et al. Capturing chromosome LP150100912, CI: S.P. Djordjevic). The NeCTAR Research Cloud conformation. Science 2002; 295(5558):1306–11. is an Australian Government project conducted as part of the 10. Lieberman-Aiden E, van Berkum NL, Williams L et al. Super Science Initiative and financed by the Education Invest- Comprehensive mapping of long-range interactions reveals ment Fund (EIF) and National Collaborative Research Infrastruc- folding principles of the human genome. Science 2009; ture Strategy (NCRIS). 326(5950):289–93. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4628124 by Ed 'DeepDyve' Gillespie user on 16 March 2018 12 DeMaere and Darling 11. Burton JN, Adey A, Patwardhan RP et al. Chromosome-scale 28. Acemel RD, Maeso I, Gomez-Skarmeta ´ JL. Topologically as- scaffolding of de novo genome assemblies based on chro- sociated domains: a successful scaffold for the evolution of matin interactions. Nat Biotechnol 2013; 31(12):1119–25. gene regulation in animals. Wiley Interdiscip Rev Dev Biol 12. Dudchenko O, Batra SS, Omer AD et al. De novo assembly of 2017, doi: 10.1002/wdev.265. the Aedes aegypti genome using Hi-C yields chromosome- 29. Sexton T, Yaffe E, Kenigsberg E et al. Three-dimensional fold- length scaffolds. Science 2017; 356(6333):92–5. ing and functional organization principles of the Drosophila 13. Selvaraj S, R Dixon J, Bansal V et al. Whole-genome hap- genome. Cell 2012; 148(3):458–72. lotype reconstruction using proximity-ligation and shotgun 30. Marks ME, Castro-Rojas CM, Teiling C et al. The genetic basis sequencing. Nat Biotechnol 2013(12):1111–8. of laboratory adaptation in Caulobacter crescentus. J Bacte- 14. Korbel JO, Lee C. Genome assembly and haplotyping with Hi- riol 2010; 192(14):3678–88. C. Nat Biotechnol 2013; 31(12):1099–101. 31. Nora EP, Lajoie BR, Schulz EG et al. Spatial partitioning of 15. Beitel CW, Lang JM, Korf IF et al. Strain- and plasmid-level de- the regulatory landscape of the X-inactivation centre. Nature convolution of a synthetic metagenome by sequencing prox- 2012; 485(7398):381–5. imity ligation products. PeerJ 2014;2(12):e415. 32. Pope BD, Ryba T, Dileep V et al. Topologically associating do- 16. Burton JN, Liachko I, Dunham MJ et al. Species-level decon- mains are stable units of replication-timing regulation. Na- volution of metagenome assemblies with Hi-C-based con- ture 2014; 515(7527):402–5. tact probability maps. G3 2014; 4(7):1339–46. 33. Schmitt AD, Hu M, Ren B. Genome-wide mapping and anal- 17. Marbouty M, Cournac A, Flot JF et al. Metagenomic chro- ysis of chromosome architecture. Nat Rev Mol Cell Biol 2016; mosome conformation capture (meta3C) unveils the diver- 17(12):743–55. sity of chromosome organization in microorganisms. Elife 34. Jeffries TW, Grigoriev IV, Grimwood J et al. Genome sequence 2014;3(e03318):e03318. of the lignocellulose-bioconverting and xylose-fermenting 18. Marbouty M, Baudry L, Cournac A et al. Scaffolding bacterial yeast Pichia stipitis. Nat Biotechnol 2007; 25(3):319– genomes and probing host-virus interactions in gut micro- 26. biome by proximity ligation (chromosome capture) assay. Sci 35. Varoquaux N, Liachko I, Ay F et al. Accurate identification of Adv 2017;3(2):e1602105. centromere locations in yeast genomes using Hi-C. Nucleic 19. Nagano T, Varnai ´ C, Schoenfelder S et al. Comparison of Hi-C Acids Res 2015; 43(11):5331–39. results using in-solution versus in-nucleus ligation. Genome 36. Gong K, Tjong H, Zhou XJ et al. Comparative 3D genome Biol 2015; 16:175. structure analysis of the fission and the budding yeast. PLoS 20. Huang PYH, Han Y, Handoko L et al. Protocol: sonication- One 2015;10(3):e0119672. based circular chromosome conformation capture with 37. Wong H, Marie-Nelly H, Herbert S et al. A predictive compu- next-generation sequencing analysis for the detection of tational model of the dynamic 3D interphase yeast nucleus. chromatin interactions. Protocol Exchange 2010, doi:10.1038/ Curr Biol 2012; 22(20):1881–90. protex.2010.207. 38. Rao SSP, Huntley MH, Durand NC et al. A 3D map of the 21. Ramani V, Cusanovich DA, Hause RJ et al. Mapping 3D human genome at kilobase resolution reveals principles of genome architecture through in situ DNase Hi-C. Nat Protoc chromatin looping. Cell 2014; 159(7):1665–80. 2016; 11(11):2104–21. 39. Stevens TJ, Lando D, Basu S et al. 3D structures of individ- 22. Ramani V, Deng X, Qiu R et al. Massively multiplex single-cell ual mammalian genomes studied by single-cell Hi-C. Nature Hi-C. Nat Methods 2017; 14(3):263–6. 2017; 544(7648):59–64. 23. Liu M, Darling A. Metagenomic Chromosome Conforma- 40. Cottarel G, Shero JH, Hieter P et al. A 125-base-pair CEN6 DNA tion Capture (3C): techniques, applications, and challenges. fragment is sufficient for complete meiotic and mitotic cen- F1000Res 2015; 4(1377):1–9. tromere functions in Saccharomyces cerevisiae. Mol Cell Biol 24. Ma W, Ay F, Lee C et al. Fine-scale chromatin interaction 1989; 9(8):3342–3349. maps reveal the cis-regulatory landscape of human lincRNA 41. Marbouty M, Cournac A, Flot JF et al. Data from: Metage- genes. Nat Methods 2015; 12(1):71–8. nomic chromosome conformation capture (meta3C) un- 25. Dixon JR, Selvaraj S, Yue F et al. Topological domains in mam- veils the diversity of chromosome organization in mi- malian genomes identified by analysis of chromatin interac- croorganisms. 2014; https://dryad2.lib.ncsu.edu/resource/ tions. Nature 2012; 485(7398):376–80. doi:10.5061/dryad.gv595, Accessed January 8, 2018. 26. Le TBK, Imakaev MV, Mirny LA et al. High-resolution map- 42. Li H. Aligning sequence reads, clone sequences and assem- ping of the spatial organization of a bacterial chromosome. bly contigs with BWA-MEM. arXivorg 2013; q-bio.GN. Science 2013; 342(6159):731–4. 43. DeMaere MZ, Darling AE (2017): Software for “Sim3C: 27. Badrinarayanan A, Le TBK, Laub MT. Bacterial chromosome simulation of Hi-C and Meta3C proximity ligation se- organization and segregation. Annu Rev Cell Dev Biol 2015; quencing technologies.” GigaScience Database. http://dx.doi. 31(1):171–99. org/10.5524/100368, Accessed January 8, 2018. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4628124 by Ed 'DeepDyve' Gillespie user on 16 March 2018 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png GigaScience Oxford University Press

Sim3C: simulation of Hi-C and Meta3C proximity ligation sequencing technologies

Free
12 pages

Loading next page...
 
/lp/ou_press/sim3c-simulation-of-hi-c-and-meta3c-proximity-ligation-sequencing-ksW9Jj0381
Publisher
Oxford University Press
Copyright
© The Authors 2017. Published by Oxford University Press.
eISSN
2047-217X
D.O.I.
10.1093/gigascience/gix103
Publisher site
See Article on Publisher Site

Abstract

Background: Chromosome conformation capture (3C) and Hi-C DNA sequencing methods have rapidly advanced our understanding of the spatial organization of genomes and metagenomes. Many variants of these protocols have been developed, each with their own strengths. Currently there is no systematic means for simulating sequence data from this family of sequencing protocols, potentially hindering the advancement of algorithms to exploit this new datatype. Findings: We describe a computational simulator that, given simple parameters and reference genome sequences, will simulate Hi-C sequencing on those sequences. The simulator models the basic spatial structure in genomes that is commonly observed in Hi-C and 3C datasets, including the distance-decay relationship in proximity ligation, differences in the frequency of interaction within and across chromosomes, and the structure imposed by cells. A means to model the 3D structure of randomly generated topologically associating domains is provided. The simulator considers several sources of error common to 3C and Hi-C library preparation and sequencing methods, including spurious proximity ligation events and sequencing error. Conclusions: We have introduced the first comprehensive simulator for 3C and Hi-C sequencing protocols. We expect the simulator to have use in testing of Hi-C data analysis algorithms, as well as more general value for experimental design, where questions such as the required depth of sequencing, enzyme choice, and other decisions can be made in advance in order to ensure adequate statistical power with respect to experimental hypothesis testing. Keywords: Hi-C; Meta3C; 3C; DNA sequencing; simulation; metagenomics For inferential software within scientific fields, the system- Findings level attributes of precision and accuracy are of primary Software testing interest, and their quantification is best accomplished by com- parison with a known truth (gold standard). Therefore, any test- To the casual observer, formal software testing is often thought to begin and end with the validation of fine-grained behavioural ing methodology capable of providing an a priori gold standard, particularly without estimation, improves this facet of testing (functional) aspects; such as the correct execution of individ- ual methods. In day-to-day use, however, what can matter significantly. Purpose-built bioinformatics software ultimately acts on ex- most to end users are broader system attributes such as speed, scalability, reproducibility, and ease of use. To ensure that a perimentally collected observations. The inherent noise and variation that comes with experimental data means that achiev- project offers maximum value, a thorough testing process would ing testing thoroughness is a great challenge. Ready access to collectively examine all aspects. Received: 1 June 2017; Revised: 18 September 2017; Accepted: 23 October 2017 The Author(s) 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4628124 by Ed 'DeepDyve' Gillespie user on 16 March 2018 2 DeMaere and Darling sufficient data sources is a fundamental necessity for adequate of eukaryotic chromosomes [9]. In 2009, Lieberman-Aiden [10] software testing. reported an extension of the protocol to high-throughput se- For established experimental methods, public data archives quencing, enabling the global spatial arrangement of chromo- are a first choice for the necessary testing data. When high- somes to be reconstructed at unprecedented resolution. All 3C quality metadata are available, testing driven by real data protocols depend on an initial formalin fixation step, which becomes possible. However, even when sufficient depth and de- crosslinks proteins bound to DNA in vivo. Subsequently cells are scription of data are available, difficulty can remain in match- lysed and the DNA:protein complexes are sheared enzymati- ing desired test data characteristics to what actually exists in cally and/or physically to create free ends in the bound DNA 1 or several public datasets. Further, fine-grained whole-corpus strands. These free ends are then subjected to a proximity lig- querying of metadata on remote data archives is not always ation reaction, in which ligation of free ends preferentially oc- possible, frequently making the up-front job of data selection curs among DNA strands cobound in a protein complex. The a difficult task. Once selected, obtaining said real data can be DNA:protein crosslinks are then reversed, the DNA is purified, time-consuming or even infeasible in locations with lower net- and an Illumina-compatible sequencing library is constructed. work speeds and/or high bandwidth costs. In advancing fields In Hi-C protocols, the proximity ligation junctions can then be such as DNA sequencing, new experimental datatypes can ap- further purified in the sequencing library. pear for which the public data archives contain only a handful 3C-derived methods have found several applications beyond of examples, and few researchers would have the time and fi- their initial use to reconstruct 3D chromosome structure. For ex- nancial resources to commit to experimental generation of new ample, it has been shown that 3C-derived data provide a valu- data purely for software testing. able signal for genome scaffolding [11,12], as well as a signal that Though performance on real data is the ultimate arbiter of can support genome-wide haplotype phasing [13,14]. 3C-derived analytical value, advantaged by explicit control over its charac- data have also proven valuable for metagenomics, where ini- teristics, a faithful simulation of real data can act as a valuable tial studies on mock communities demonstrated that highly ac- proxy. Simulation-driven development and testing has proven to curate genome reconstruction in mixed microbial communities be a highly cost-effective and time-efficient approach. It offers could be facilitated by proximity ligation sequence data [15–17]. the possibility to explore a near continuum of data character- Subsequent application to naturally occurring microbial com- istics, subjecting software to an otherwise unavailable degree munities has also suggested that bacteriophage can be linked of testing thoroughness. Certainty and control make attaining to their hosts with this data type [18]. the twin objectives of rigorous testing and an a priori gold stan- In the remainder of this manuscript, we describe the Sim3C dard straightforward. This enables us not only to be more certain software and demonstrate how it can be used to simulate data about when we have failed, but also to extrapolate this process for various 3C-derived experiments. to infer the limits of success within the experimental parameter space. Experiment scenarios Tools for simulating DNA sequencing reads have existed from the very early days of genomics, beginning with the many Beyond simple monochromosomal genome sequencing exper- anonymous implementations of simple DNA shearing algo- iments, Sim3C offers support for the more complex scenarios rithms, up to the most recent highly detailed empirical model of multi-chromosomal genomes and metagenomes. A scenario simulators [1–4]. From read simulation in isolation, field ad- is defined by way of a community profile, assigning a copy- vancements such as metagenomics have been accompanied number and containing genome to each chromosome and a rel- soon after by simulators reflecting their specific data character- ative abundance to each genome. The profile and supporting istics and evolving experimental methodology [5–7]. reference sequences form a skeleton definition with which to We introduce Sim3C, a software package designed to simu- initialize the weighted random sampling process within a sim- late data generated by Hi-C and other 3C-based proximity lig- ulation. The user can elect to supply a profile either as an ex- ation (PL) sequencing protocols. The software includes flexi- plicit table (Figs 1 and 2) or allow Sim3C to draw abundances at ble support for a range of sequencing project scenarios and runtime from 1 of 3 distributions (equal abundance, uniformly choice of three 3C methods (Hi-C, Meta3C, DNase Hi-C). The re- random, log-normal distribution) for communities made up of sulting output (paired-end FastQ) is easily assimilated into ex- strictly mono-chromosomal genomes. isting analysis workflows. It is our intention that Sim3C pro- vide the Hi-C/3C research community with a means to further Error Modelling validate existing software projects, support new experimental or analysis development initiatives, and serve as a platform Sim3C models 3 forms of experimental noise: machine-based for exploration, such as the comparative analysis of clustering sequencing error, the formation of spurious ligation products, algorithms [8]. and the contamination of PL libraries with whole genome shot- gun (WGS) read-pairs. To simulate machine-based sequencing error, the paired- 3C sequencing end mode from art illumina [2] has been reimplemented as a 3C-based sequencing protocols, including Hi-C, 4C-seq, and Python module (Art.py). This approach was taken as delegat- Meta3C, have great potential to address questions directed at the ing read-pair generation to native invocations of art illumina spatial organization of DNA in samples ranging from eukaryotic proved cumbersome. More explicitly, a loosely coupled solution tissue to single cells to microbial communities. The growing use (via subprocess calls but without an interprocess communica- of these protocols creates a legitimate need for a simulator ca- tion mechanism) lacked sufficient control to generate PL read- pable of generating data with relevant characteristics. pairs in an efficient and robust manner. On the other hand, Chromosome conformation capture (3C) was originally de- tightly coupling Sim3C to the ART C/C++ source code (i.e., im- signed as a polymerase chain reaction–based assay to mea- plementing hooks) would have left Sim3C vulnerable to changes sure interactions among a small number of defined regions in a non-public external API (i.e., a codebase without formal Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4628124 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Sim3C: simulation of Hi-C sequencing technology 3 Figure 1: A mock 2 genome community. For demonstration purposes, we assume that the plasmid (plas1) is present in 4 copies and that there is a 0.4/0.6 relative abundance split between the 2 organisms (bac1, bac2) in the community. Figure 2: A mock 4 chromosome genome. Cellular abundance is a constant across the profile, while chr4 exists in 2 copies. Note that relative abundances specified i n a profile are not required to sum to 1, but are normalized internally. definition or guarantee of stability). Reimplementation also braries to be subsequently enriched for fragments containing PL meant that Art’s many empirically derived machine profiles are junctions by streptavidin-mediated affinity purification. With- available for use by Sim3C, allowing equivalent treatment of ma- out enrichment, the simpler Meta3C protocol results in a gross chine error when experiments involve both PL (Sim3C) and pure mixture of both WGS and PL read-pairs, where only a small per- WGS (art illumina) libraries. centage of the total read-pair yield (approx. 1%) will possess PL The production of spurious ligation products is an inherent junctions [23]. The enrichment process within Hi-C, however, is source of noise in PL library construction [19]. Sim3C models not perfectly efficient, and WGS read-pairs are still observed (ap- spurious pairs as the uniformly random ligation of any 2 cut- prox. 10-50% of reads contain a PL product) [23]. DNase Hi-C re- sites across all source genomes. While this process disregards places restriction digest with a non-specific endonuclease (e.g., cellular organization, it respects the relative abundance of chro- DNase I) [24] or mechanical DNA shearing process (e.g., sonica- mosomes. Spurious pairs, and to a lesser extent sequencing er- tion) [20]. In this operational mode, Sim3C treats DNA cleavage ror, represent an important confounding signal to downstream as a completely unbiased (free) process, and as such all genomic analyses that attempt to infer the cellular or chromosomal or- positions have equal probability of participating in proximity lig- ganization of DNA sequences. ation events. Lastly, conventional WGS read-pairs represent a source of Within Sim3C, each of the 3 methodological variations is con- contamination within a PL library that even after Hi-C enrich- ceptualized as a sequencing strategy (figure 3), and each itera- ment steps, are not completely eliminated. The rates at which tion of a strategy produces 1 read-pair (PL or WGS in origin). For spurious and WGS read-pairs are injected into a simulation run all strategies, an iteration begins by drawing a 3-tuple of insert are controllable by the end-user. parameters: length, direction, and junction point (L , dir, x ). ins junc After obtaining insert parameters, the Hi-C strategy (Fig. 3a) first tests if the insert will represent a WGS or PL read-pair Simulation modes (∼Bern(p )), where efficiency p is defined in the sense of en- eff eff richment. When p = 1, there is perfect filtering and all WGS eff Since Hi-C was first introduced [ 10], the development of variants read-pairs are eliminated from the experiment. In the case of and extensions has been continual [17,20–22]. Variants have of- WGS, the iteration reaches an end-point and the simulation ten strived to further enhance the discriminatory power of the emits a conventional read-pair drawn from the community def- original experiment, while seemingly adding yet more complex- inition. In the case of PL, a cut-site 3-tuple is drawn (gen , ity to an already challenging protocol (in situ DNase Hi-C, sciHi- chr , x ), where the categorical distribution over chromosomes 1 1 C) [22]. Others instead have sought compromise, with the aim is weighted by relative abundances (A) and chromosomal copy- of lessening the burden on the laboratory (Meta3C). While not numbers (n ), genomic position is sampled uniformly from cpy considering more recent and complex extensions, Sim3C offers the set of restriction sites (sites(chr )), and the parent genome 3 simulation modes: traditional Hi-C, Meta3C, and DNase Hi- (gen ) is implicit from the chromosome. Next, a spurious liga- C. The first 2 of these modes were chosen as representing the tion test is performed (∼Bern(p )). If a spurious event has oc- spur fundamental basis (traditional Hi-C) and an attractive and prag- curred, the 3-tuple defining the second cut-site ( gen , chr , x ) 2 2 2 matic simplification of the original (Meta3C). The third mode is drawn i.i.d. as the first. If not spurious, next a test for inter- (DNase Hi-C) replaces the restriction endonuclease-driven pro- chromosomal (trans) ligation is performed. Only source chro- duction of the free ends, used to form PL products, with an ide- mosome and position (chr , x )need be drawnasthe second 2 2 ally free process of DNA fragmentation. In the laboratory, this genome is implicitly the same as the first ( gen = gen ). Here, 2 1 ideally free process could be carried out by DNase digestion or chr is selected without replacement from the set of chromo- mechanical shearing via sonication. somes of genome (gen ) where the categorical distribution is ad- The most notable difference between the methods of Hi- justed by removal of chr . Finally, an intra-chromosomal (cis) lig- C and the more recent Meta3C is that after restriction digest, ation must have occurred. As now both genome and chromo- Hi-C employs additional steps leading to the incorporation of bi- some are implicit (gen = gen , chr = chr ), all that is left is to 2 1 2 1 otin tags at each PL junction. This biotinylation permits Hi-C li- Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4628124 by Ed 'DeepDyve' Gillespie user on 16 March 2018 4 DeMaere and Darling AB Figure 3: Logical schema used within Sim3C. (a) Hi-C and (b) Meta3C simulation strategies. Gold diamonds represent simple Bernoulli trials. Blue boxes represent sampling distributions defined by runtime input data (community profile, genomic sequences, enzyme) and the empirically derived distribution for in tra-chromosome (cis) interaction probability (equation 1). Logical end-points to a single iteration of either algorithm are represented as red (producing a WGS read-pair) and green boxes (producing a PL read-pair). Due to the elimination of the biotinylation step, Meta3C does not produce a duplication of the restriction cut-site overhang (grey boxes). draw genomic position x . The pair of positions (x , x ) is con- where β is a mixing parameter, α the geometric distribution 2 1 2 strained by their separation (s =|x − x |), which is represented shape parameter, and l chromosome length. 2 1 by a mixture model of the geometric and uniform distributions For Meta3C (Fig. 3 b) after insert parameters are determined, (equation 1). This relation possesses rapid falloff with increas- in the same fashion as a regular WGS read, an initial free ge- ing separation and non-zero probability for all chromosomal po- nomic position is drawn (chr , x ), uniformly distributed over the sitions, as has been commonly observed in real experimental extent of chr rather than only over its cut-sites. In real datasets, data [10,25]. it has been observed that neither the restriction digestion nor the re-ligation of free ends is perfectly efficient. Taken as in- dependent probabilities, in our model we conceptualize their Pr(X = s|α, β, l) = β(1 − α) α + (1 − β)/l, (1) Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4628124 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Sim3C: simulation of Hi-C sequencing technology 5 AB Figure 4: Model details. Generation of proximity ligation inserts (a) involves joining 2 randomly drawn parts (red and blue), from which the read-pair (R1, R2) is then simulated. The junction point (x ) varies over the interval [0..L), and reproduction of read-through events is possible. For an unbounded chromosome (b) (circular junc here), besides strictly primary separation (black arrow), spatial proximity can be induced from successive folding (red, green arrows). When the spatial arrangement is consistent across the population of cells, this will be observable as modulations in the contact frequencies. Sim3C models simple structurally related modulation of observed contact frequencies (c). Beyond primary interactions forming the main diagonal, users can reproduce inter-arm-mediated anti-diagonals. Finer-scale modulations attributed to topologically associated domains can optionally be randomly simulated. Primary interactions f (s|θ ) (equation 1) cover the full interval [0, 0 0 L). Each level of recursion (d = 1, 2. . . n) generates a finer set of intervals, to which a distribution f (s|θ ) and probability p are assigned. The final covering of intervals i i i each define a range (green, curly braces) over which a set of probabilities and empirical distribution pairs govern interaction separation s. joint occurrence as an efficiency factor, p , and a Bernoulli trial occur along a chromosome (intra-arm) (Fig. 4b), seen as the pri- eff (Bern(p )) determines whether a sequence read is successful in mary (y  x) diagonal in the contact map. Sim3C can approxi- eff containing an observable proximity ligation event. Failing this mate the less frequent interactions occurring between chromo- coverage test relegates the iteration and end-point and emits somal arms (inter-arm) [26], which are visible as anti-diagonal a WGS read-pair. Successful candidates instead continue akin (y  L − x) in the contact map. to the Hi-C decision tree, beginning with the test for spurious At progressively smaller scales, the hierarchical 3D folding ligation. of DNA into topologically associated domains (TADs) produces For both Hi-C and Meta3C, PL read-pairs are produced by join- overlapping regions of interaction visible in the contact map as ing the free ends drawn above as defined by the fragment param- block-like intensity modulations. Though the agents responsible eters (Fig. 4a). Here the location of the PL junction within the for their formation vary [27,28], the characteristic patterns evi- insert is determined by x . At the junction, Hi-C differs from dent in real data–derived 3C contact maps have been observed junc Meta3C as the process of biotinylation results in the duplication across all 3 domains [25,26,29]. Sim3C can optionally approxi- of the restriction cut-site overhang sequence. The overhang du- mate the sense of TAD-related modulation by means of a recur- plication in Hi-C is included in the simulation. sive stochastic process. DNase Hi-C is handled similarly to traditional Hi-C, with the Our approximation of hierarchical folding begins from the exception that, as in silico digestion trivially leads to all sites, full extent L of a chromosome (Fig. 4c). Folding is portrayed by the simulated digestion is unnecessary to perform and positions the division of the interval [0..L) into a set of non-overlapping can be drawn directly from the uniform distribution over the sub-intervals {[0, x ), [x , x ), ..., [x , x )}, the number and 1 1 2 n − 1 n interval [0..L ). Site duplication, attributable to the likely pro- widths of which are drawn at random (U(l , l ), U(n , n )). chr min max min max duction of random overhangs in this scenario, is not presently The procedure is then recursively applied to each sub-interval simulated. until a depth d, producing a nested set of coverings of the full interval [0..L) at progressively finer scales. Across this hierarchi- cal collection, each interval is assigned a uniformly distributed Structurally related interactions random probability p and empirical distribution f (s|θ ) (equa- i i i tion 1) for separation s, parameterized by shape parameter α TAD Independent of any 3D structure that might exist, the pri- and interval length l = x − x ,where θ = (α , β, l ). TAD inv i + 1 i inv mary and most frequently observed interactions are those that Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4628124 by Ed 'DeepDyve' Gillespie user on 16 March 2018 6 DeMaere and Darling Table 1: Real Hi-C and Meta3C data-sets used within this work Authors Type Method Accession Sequencing Mapped details reads Beitel et al. [15] Synthetic bacterial metagenome Hi-C SRX377733 MiSeq 160bp PE insert 20552775 range: 280-420bp enzyme: HindIII Burton et al. [16] Synthetic yeast metagenome Hi-C SRX527868 HiSeq2500 100bp PE insert 9704944 range: 450-550bp enzyme: HindIII Le et al. [26] Single bacterial genome Hi-C SRX263925 HiSeq2000 40bp PE insert 22324360 range: 200-600bp enzyme: NcoI Marbouty et al. [41] Synthetic bacterial metagenome Meta3C doi:10.5061/ HiSeq2000 100bp PE insert 7975740 dryad.gv595 range: 400-800bp enzyme: HpaII The total off-diagonal weight of the contact map was used to calibrate the amount of simulated sequencing required to approximately match the outcome of the real experiments. The process of drawing samples of separation begins by de- comparison. Due to the smaller extent, a bright and high- termining the set of intervals {l } that contain an initial point resolution contact map (10-kbp bin size) is possible for a prac- inv x . The intervals, as tuples (p , f (s|θ )), then form a categorical dis- tical volume of sequencing data, potentially revealing fine detail 0 i i i tribution (equation 7), from which a governing distribution f (s|θ ) not easily discerned with larger bin sizes (50–100-kbp bin size). i i is drawn, and finally a sample of separation is taken, s ∼ f (s|θ ). The genome of Caulobacter crescentus NA1000, a model or- i i To efficiently sample from the full collection, an interval-tree ganism in the study of cellular differentiation and regulation data structure is employed. When queried, an interval-tree re- of the cell cycle, is comprised of a single 4-Mbp circular chro- turns the set of intervals {l} overlapping a position x in order mosome [30]. Deep Hi-C sequencing of C. crescentus has been O(log n + m), where n is number of intervals and m is number of used to explore the degree to which bacterial chromosomes intervals returned by the query. can be regarded as organized and provided evidence for the ex- istence of so-called chromosomal interaction domains (CIDs) [26]. As a prokaryotic analog of topologically associated domains f = f (s|θ ), f (s|θ ), ··· , f (s|θ ) (2) 0 0 1 1 i i from eukaryotic literature [28,31–32], these regions are believed to promote intra-domain loci interactions and thereby act to N = number of distributions = f (3) functionally compartmentalize the genome. This chromosomal structure was observed to be at once disruptable through rifampicin-mediated inhibition of transcription and malleable p = {p , p , ··· , p } (4) 0 1 i by the movement of highly expressed genes [26]. Forthe rawcontact mapof C. crescentus, prominent rectilin- ear features are apparent for both real and simulated traditional p ∼ U(0, 1) and p =1(5) i i Hi-C sequencing data (Fig. 5a,b), while notably for simulated unrestricted Hi-C the field is much smoother (Fig. 5c). Within the Sim3C model, a single distribution governs both intra- and n ∼ Cat(N, p)(6) inter-arm interactions. Inspection of the real-data contact map (Fig. 5a) suggests that the true relationship governing inter- N−1 arm interactions is more dispersed. This perhaps is not surpris- [i =n] f (s|n) = f (s|θ ) , (7) i i ing, where different arms associating spatially possess a greater i =0 number of potential configurations than can be taken on by the primary chromosome backbone. Additionally, for the real con- where [i = n] is the Iverson bracket. tact map, long-range interactions away from either diagonal can be seen to drop to a lower threshold than that produced from Example scenarios simulation. Within the unrestricted Hi-C map, the fine zero-intensity rec- In the following, 3 use cases are presented to demon- tilinear features are a direct result of poor mappability (non- strate aspects of the resulting simulation output: bacterial unique sequence), where their small size reflects the extent of genome, multi-chromosomal eukaryotic (yeast) genome, and the non-unique regions (example: rRNA genes) and the single metagenome. For each use case, 3C contact maps have been base-pair resolution of the less constrained read generation pro- used to pit simulation output against the corresponding real cess. The process of enzymatic digestion is the only difference experimental data (table 1). between the unrestricted and traditional Hi-C simulation mod- els. The clear contrast in their contact maps is thus a combina- Bacterial tion of factors either directly inherent to digestion (cut-site den- sity) or a byproduct of downstream bioinformatics analysis (e.g., A monochromosomal bacterial genome is perhaps the simplest filtering heuristics). Though the problem of mappability exists scenario to which proximity ligation methods have been ap- for any reference-based representation, for real and simulated plied, making for a sensible entry point from which to make Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4628124 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Sim3C: simulation of Hi-C sequencing technology 7 Figure 5: Bacterial contact maps. Observed Hi-C interactions for the monochromosomal genome of Caulobacter crescentus NA1000. Comparing (a) real experimental data [26] with the 3 simulation choices (b) traditional Hi-C, (c) DNase Hi-C, and (d) traditional Hi-C with TADs enabled. Sharp rectilinear modulations of the intensity within (a) and (b) indicate a reduction in PL observations within a given bin. Not due to 3D chromosome structure, rather such features can be attributedlargely to mappability and low cut-site density. (c) Without an enzymatic constraint, a significantly smoother field is apparent, yet still susceptible to mappa bility. (d) Enabling topologically associated domains highlights the similarity between features produced merely from biases and what could be truly associated with 3D structure. traditional Hi-C, zero-intensity rectilinear features mark regions yeasts included in a synthetic community to explore the appli- devoid of cut-sites over at least 10 kbp. cation of Hi-C sequencing to deconvolving metagenomic assem- Enabling TAD approximation in simulated traditional Hi-C blies [16], and it is divergent enough from other synthetic com- (Fig. 5d) has the effect of modulating map intensity in a manner munity members to permit unambiguous read mapping, and not particularly distinct from that produced purely from exper- thus act as a proxy for a clonal experiment. imental/workflow bias. Discriminating between these 2 feature From the contact map of real Hi-C data (Fig. 6a), it can be seen sources—one representing experimental signal, the other rep- that the rates of intra-chromosomal and inter-chromosomal in- resenting noise—demands attention when developing solutions teractions are roughly equivalent in magnitude. Across the 8 to problems such as normalization. Contact map normalization chromosomes of S. stipitis, there is significant uniformity in the methods, whether based upon explicit or implicit bias models degree of physical intimacy within and between all chromo- [33], may leave behind remnants of noise-related features from somes. The subtleties of this chromosomal organization reveal either a lack of convergence or model limitations. Downstream a self-similar “fuzzy-x” pattern repeated between all chromo- inferencing should therefore not be made under an assumption somes across the contact map. The convergence point within of bias-free signal. the pattern is attributed to centromere-SPB binding and has been used to predict centromere locations [35]. It has been shown that the physical constraints generated from the in- Eukaryotic teraction of centromeres to the spindle pole body (SPB) and telomeres to the nuclear envelope are sufficient to explain a The 8 chromosomes of the 15.4-Mbp genome of the native number of experimental observations in real data [36,37]. As xylose-fermenting yeast Scheffersomyces stipitis CBS 6054 [34] Sim3C was derived from the study of bacterial datasets, our range in size from 970 kbp to 3.5 Mbp. The organism was 1 of 16 Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4628124 by Ed 'DeepDyve' Gillespie user on 16 March 2018 8 DeMaere and Darling Figure 6: Eukaryotic contact maps. Observed Hi-C interactions of (a) real and (b) simulated data from the 8 chromosome genomes of the budding yeast Scheffersomyces stipitis CBS 6054 [16]. Grey dashed lines and alternating light and dark grey axes demarcate the boundaries between chromosomes. (b) Simulated data elicit a flat field, and the clearly evident higher rate of intra- to inter-interactions makes for easily observable chromosomal boundaries within the map. (a) Contrastingly for real data, the similar rates of intra-chr and inter-chr interactions reveal the physical constraints imposed by centromere-SPB tethering on all 8 chromosomes [35]. simulation model does not currently include a notion of these As with single-genome experiments, metagenomic contact higher organism physical constraints. Consequently, the contact maps are locally modulated by factors such as mappability and map derived from simulated traditional Hi-C sequencing elicits cut-site density. Importantly now for metagenomes, the factors a flat field (Fig. 6b), where the intensity variation that does ex- of relative abundance and GC content interact to alter the ob- ist is a byproduct of aforementioned factors such as mappability served intensity of each chromosome within the contact map. and cut-site density. For the runtime parameters employed, the As a first approximation and assuming agreement in nu- rate of intra-chromosomal contact is higher than that of inter- cleotide sampling frequency, we expect n = L/4 recognition chromosomal, making clear the boundaries between the 8 chro- sites for an enzyme of site length λ and DNA sequence length mosomes (Fig. 6b). Though our model is presently incomplete L. The degree to which an enzyme and DNA sequence deviate for higher organisms, there remains a potential utility as an an- from this estimate could be described as how well they match, alytical or simply observational prior. m = n /n . Poorer quality matches (m < 1) occurwhenanen- x 0 zyme’s recognition site is underrepresented, while conversely, better quality matches (m > 1) describe a situation of more recog- nition sites than expected. Metagenomic When multiple chromosomes are taken as a community, the relative proportion of sites from each represents an observa- In the deconvolution of metagenomes, proximity ligation meth- tional bias when conducting 3C-based experiments. For com- ods hold great potential as new sources of information and munity C, the number of sites n from chromosome x deter- have been investigated by the construction and sequencing of mines the number of potential PL pairings N within C that synthetic communities [15–17]. We selected 2 previously con- x (equation 8). The number of intra-chromosomal and inter- structed synthetic bacterial communities, 1 employing tradi- chromosomal potential pairs thus respectively vary quadrat- tional Hi-C and the other Meta3C (table 1). Intended as “proof of ically and linearly with n . Regarding the process of observ- concept” experiments, neither community reflects a real envi- ing a PL event (read-pair) from the community as a random ronment, but rather they were intended to be easily interpreted draw with replacement, and the selection pool as comprised of and include interesting features, such as range of GC, single and all potential events from all chromosomes, variation in match multi- chromosomal genomes, and strain-level divergence. The quality constitutes a per-chromosome bias. In real laboratory Hi-C community involved 5 genotypes from 4 species, 1 genome experiments, the composition of the selection pool is further of 2 chromosomes (B. thailandensis), E. coli strains BL21 and K12 modified by variation in other factors, such as cellular lysis (average nucleotide identity [ANI], 99%), and a wide overall GC efficiency, unintended DNA fragmentation, and relative abun- range of 37–68% (Table 2). Of lower complexity, the Meta3C com- dance. In particular, when relative abundances A are introduced, munity involved 3 genomes from 3 species, included 1 genome the odds of observing a PL event involving chromosome x are of 2 chromosomes (V. cholerae), and had a narrower GC range then proportional to the product p ∝ A N /N . Although the x x x C of 44–51% (table 3). Relative to the single genome experiments processes of intra-chromosomal, inter-chromosomal, and inter- above, a lower depth of sequencing resulted in a lower over- cellular (spurious) ligation are treated independently in our sim- all contact map intensity (Fig. 7). This is particularly the case ulation model, in this manner, per-chromosome intensity (ob- for Meta3C, where, by the nature of the method, a large pro- servation rate of chromosome x) can vary significantly within a portion (approx. 99%) of the sequencing yield is in reality con- metagenome. ventional WGS read-pair data [17]. As a direct result, in binning the Meta3C dataset, there were insufficient counts to fully es- tablish finer detail within the contact maps, leaving a smoother 2 N = n + n n (8) x x y appearance. n ∈C \n y x Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4628124 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Sim3C: simulation of Hi-C sequencing technology 9 Table 2: Synthetic Hi-C community Name Replicons Accession Chr abbr. An %GC n m cpy x NC 007651 Bt1 67.29 225 0.24 Burkholderia 2 0.054 1 NC 007650 Bt2 68.07 144 0.20 thailandensis E264 Escherichia coli BL21 1 NC 012892 BL21 0.242 1 50.83 508 0.46 Escherichia coli K12 1 NC 010473 K12 0.166 1 50.78 568 0.50 DH10B NC 008497 Lb 46.22 629 1.12 Lactobacillus brevis 3NC 008498 - 0.436 1 38.64 3 0.92 ATCC 367 NC 008499 - 38.51 16 1.84 Pediococcus pentosaceus 1NC 008525 Pp 0.102 1 37.36 863 1.93 ATCC 25745 A synthetic community used to demonstrate the utility of Hi-C sequencing data in resolving a microbial metagenome [15]. It is composed of 5 bacteria, including 2 closely related strains (E. coli K12 and BL21), a genome with 2 plasmids (L. brevis), and a 2-chromosome genome (B. thailandensis). A is relative abundance, n is copy cpy number, n is number of restriction sites, and m = n /n is match quality between chromosome and enzyme choice: m < 1 is worse, m > 1 is better. x x 0 Table 3: Synthetic Meta3C community Name Replicons Accession Chr abbr. An %GC n m cpy x Bacillus subtilis subsp. 1 NC 000964 Bs 0.123 1 43.51 14529 0.88 subtilis str. 168 Escherichia coli str. 1 NC 000913 K12 0.562 1 50.79 24311 1.34 K-12 substr. MG1655 NC 002505 Vc1 47.70 5909 0.51 Vibrio cholerae O1 2 0.332 1 NC 002506 Vc2 46.91 1802 0.43 biovar El Tor str. N16961 A synthetic community used to demonstrate the utility of Meta3C sequencing data in resolving a microbial metagenome [17,41]. It is composed of 3 bacteria with 1 possessing 2 chromosomes. A is relative abundance, n is copy number, n is number of restriction sites, and m = n /n is match quality between chromosome and cpy x x 0 enzyme choice: m < 1 is worse, m > 1 is better. Though the original laboratory experiments reported by Bei- domly generated approximations of self-associating domains tel et al. [15] and Marbouty et al. [17] intended to create synthetic (CIDs/TADs). Sim3C does not model structural features observed communities with uniform relative abundances, in practice in larger, more complex genomes (CTCF/cohesin loops, A/B com- each possesses a non-uniform profile. The variation in GC con- partments, chromosome territories) [10,38]. Such features are tent is largest for the Hi-C experiment and, together with non- becoming increasingly well characterized [39], and a simulator uniform relative abundances, produces a wide range of chromo- capable of modelling these features would surely be valuable. some intensity for both real and simulated data (Fig. 7a,b). For Mammalian genomes are much larger than microbial genomes, both the real and simulated Hi-C maps, the frequent observa- however, and additional work to improve the scalability of tion of PL events involving P. pentosaceus (Pp) and L. brevis (Lb) Sim3C will likely be required. suggests the possibility that inter-cellular interaction is signifi- Some features of microbial eukaryotes, such as the point cen- cant. Within the simulated map at least, inter-cellular pairs are tromeres found in budding yeast genomes [40], are computa- produced exclusively through the process of spurious ligation tionally simpler [35,36] yet remain unmodelled in Sim3C. The (noise) and are observed at a higher rate than in the real data, in- addition of these sorts of model details would be best supported dicating that, as expected, spurious ligation rates across species by introducing model initialization via external data (experi- are correlated with their relative abundances. mental observations, motif detection, cell phase), which subse- Further for the Hi-C data, the 2-chromosome genome of quently would require extension of the community profile defi- B. thailandensis (Bt1, Bt2) (Fig. 7a) has a greater rate of inter- nition. Careful design would be required to ensure these features chromosomal interaction than expected from comparing it with could be added without compromising ease of use. simulation (Fig. 7b). Meanwhile, the clear delineation of E. coli strains BL21 and K12 (ANI > 99%), with little inter-cellular sig- Methods nal, helps to support the notion that the inter-chromosomal in- teractions observed between B. thailandensis chromosomes (ANI Reference Data 83%) are real and not a by-product of inadequate filtering. To compare Sim3C against real experiments, we obtained pre- viously published experimental read-pair datasets (Table 1)and Limitations and future work their accompanying reference genomes (Tables 2, 3) from public Sim3C in its current form has several limitations, some of archives. In the case of the single genome project of Caulobac- which present opportunities for future work. Sim3C’s reper- ter crescentus CB15 [26], sequencing data derived from untreated toire of structural features is currently limited to those swarmer cells was chosen and the laboratory strain C. crescen- found in microbes—circular and linear chromosomes with ran- tus NA1000 (acc: NC 011916) was used as the reference genome. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4628124 by Ed 'DeepDyve' Gillespie user on 16 March 2018 10 DeMaere and Darling Figure 7: Metagenomic contact maps. From synthetic microbial communities, raw contact maps from real (a) and simulated (b) traditional HiC, and real (c) and simulated (d) Meta3C. Chromosome boundaries are demarcated by alternating light and dark grey bands (Tables 2, 3), while the small plasmids of L. brevis are omitted for clarity. Although the original works [15,17] intended uniform abundance, the results exhibit significant variation in abundance. Lysis efficiency (not modelled) and enzyme suitability are significant factors contributing to the overall intensity of a given chromosome. For more abundant members of the Hi-C comm unity (P. pentosaceus and L. brevis), signal due only to spurious ligation can appear to suggest inter-cellular interactions when none are present (b). For the yeast genome, the completed 8-chromosome genome dances were estimated by mapping real experimental reads to of Scheffersomyces stipitis CBS 6054 was used as a reference (acc: the respective reference genomes. From each real experiment, PRJNA18881), and the respective reads were extracted from the the off-diagonal weight of the resulting contact map was used MY16 yeast synthetic metagenome [16] by direct mapping with to calibrate the amount of simulated sequencing required to BWA MEM. Extraction by mapping in isolation was employed achieve roughly equivalent intensity (Table 4). Both real and sim- as S. stipitis was the second furthest phylogenetically removed ulated read-pair datasets were mapped to their respective refer- yeast in the synthetic community and was the most contigu- ence genomes using BWA MEM (v0.7.15-r1140, RRID:SCR 010910) ous (N50: 60 kbp) from the whole synthetic community de novo [42]. metagenomic WGS assembly. Contact maps Read generation Contact maps were produced using our own tool Experimental parameters used in read simulation were set to (contact map.py), where heatmap intensity was plotted as agree as closely as reasonably possible to the respective real ex- log-scaled observational frequency. All aligned reads were sub- periments, employing the same read length and restriction en- ject to the same basic filtering criteria: BWA MEM mapq >5and zyme (Table 1). In each experiment, the published fragment size alignment length ≥50% of read length, with the added restric- range was approximated by a normal distribution (Table 4). For tion that read alignments must have begun with a match. For ease of reproducibility, a single random seed (1234) was used methods that employed a restriction enzyme (traditional Hi-C, in all simulations. As our intent was primarily to demonstrate Meta3C), we constrained the maximum allowable distance from functionality, rates of inter-chromosomal and spurious events an aligned read to the nearest upstream cut-site. Calculated were adjusted per-experiment only through a qualitative pro- per chromosome, this distance constraint could not exceed cess. For simulation of metagenomic datasets, relative abun- 2-fold the median cut-site spacing. Rather than simply delete Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4628124 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Sim3C: simulation of Hi-C sequencing technology 11 Table 4: Runtime simulation Experiment Insert μ (bp) Insert σ (bp) Anti rate Spurious rate Trans rate Reads (×10 ) Beitel et al. 300 50 0.2 0.05 0.1 7 Burton et al. 400 50 0.2 0.5 0.15 1.5 Le et al. 400 100 0.2 0.2 0.1 22 Marbouty et al. 600 100 0.2 0.2 0.2 7.5 Parameters supplied to Sim3C during read generation. the primary diagonal for the sake of reducing the displayed https://www.education.gov.au/education-investment-fund dynamic range in figures, we instead reduced its intensity by https://www.education.gov.au/national-collaborative categorizing properly paired reads with an estimated fragment -research-infrastructure-strategy-ncris size of less than 2 of the reported mean as being conventional WGS (non-PL) reads and ignored them. The resolution of contact Authors contributions maps was adjusted between experiments so as to present a sufficiently bright image without undue loss of resolution. The M.D. designed and implemented Sim3C and wrote the contact map bin sizes employed were: 10 000 bp for the single manuscript and prepared figures. A.D. assisted in the de- bacterial genome, 25 000 bp for the yeast genome, and 40 000 bp sign and contributed to the manuscript. for the Hi-C and Meta3C metagenomes (Tables 2, 3). Acknowledgements Availability of data and materials We thank Steven P. Djordjevic for his support and helpful discus- sions. This work was supported by the AusGEM initiative, a col- Snapshots of the supporting code are available from the Giga- laboration between the NSW Department of Primary Industries Science repository, GigaDB [43]. and the Ithree Institute. We acknowledge the use of computing resources from the NeCTAR Research Cloud, the QCIF and the Availability of supporting source code and UTS eResearch Group. requirements http://www.nectar.org.au Project name: Sim3C http://www.qcif.edu.au Release version: 0.1 https://eresearch.uts.edu.au Project homepage: https://github.com/cerebis/sim3C RRID: SCR 015772 References DOI: 10.5281/zenodo.1030812 Operating system: platform independent 1. Li H. lh3/wgsim. 2011. https://github.com/lh3/wgsim. Ac- Programming languages: Python 2.7 cessed 21 March 2017. License: GNU GPL v3 2. Huang W, Li L, Myers JR et al. ART: a next-generation sequencing read simulator. Bioinformatics 2012; 28(4): 593–4. Abbreviations 3. Ono Y, Asai K, Hamada M. PBSIM: PacBio reads simulator– PL: proximity ligation; toward accurate genome assembly. Bioinformatics 2013; WGS: whole genome shotgun 29(1):119–21. CID: chromosomal interaction domain 4. Hu X, Yuan J, Shi Y et al. pIRS: Profile-based Illumina pair-end TAD: topologically associated domain reads simulator. Bioinformatics 2012; 28(11):1533–5. Bern(x): Bernoulli distribution 5. Jia B, Xuan L, Cai K et al. NeSSM: a next-generation U(x): uniform distribution sequencing simulator for metagenomics. PLoS One N(μ, σ ): normal distribution 2013;8(10):e75448. cis: intra-chromosomal 6. Angly FE, Willner D, Rohwer F et al. Grinder: a versatile am- trans: inter-chromosomal plicon and shotgun sequence simulator. Nucleic Acids Res 2012;40(12):e94–e94. 7. Richter DC, Ott F, Auch AF et al. MetaSim: a sequenc- Competing interests ing simulator for genomics and metagenomics. PLoS One The authors declare that they have no competing interests. 2008;3(10):e3373. 8. DeMaere MZ, Darling AE. Deconvoluting simulated metagenomes: the performance of hard- and soft- clus- Funding tering algorithms applied to metagenomic chromosome This work was supported under the Australian Research conformation capture (3C). PeerJ 2016;4(4):e2676. Council’s Discovery Projects funding scheme (project number: 9. Dekker J, Rippe K, Dekker M et al. Capturing chromosome LP150100912, CI: S.P. Djordjevic). The NeCTAR Research Cloud conformation. Science 2002; 295(5558):1306–11. is an Australian Government project conducted as part of the 10. Lieberman-Aiden E, van Berkum NL, Williams L et al. Super Science Initiative and financed by the Education Invest- Comprehensive mapping of long-range interactions reveals ment Fund (EIF) and National Collaborative Research Infrastruc- folding principles of the human genome. Science 2009; ture Strategy (NCRIS). 326(5950):289–93. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4628124 by Ed 'DeepDyve' Gillespie user on 16 March 2018 12 DeMaere and Darling 11. Burton JN, Adey A, Patwardhan RP et al. Chromosome-scale 28. Acemel RD, Maeso I, Gomez-Skarmeta ´ JL. Topologically as- scaffolding of de novo genome assemblies based on chro- sociated domains: a successful scaffold for the evolution of matin interactions. Nat Biotechnol 2013; 31(12):1119–25. gene regulation in animals. Wiley Interdiscip Rev Dev Biol 12. Dudchenko O, Batra SS, Omer AD et al. De novo assembly of 2017, doi: 10.1002/wdev.265. the Aedes aegypti genome using Hi-C yields chromosome- 29. Sexton T, Yaffe E, Kenigsberg E et al. Three-dimensional fold- length scaffolds. Science 2017; 356(6333):92–5. ing and functional organization principles of the Drosophila 13. Selvaraj S, R Dixon J, Bansal V et al. Whole-genome hap- genome. Cell 2012; 148(3):458–72. lotype reconstruction using proximity-ligation and shotgun 30. Marks ME, Castro-Rojas CM, Teiling C et al. The genetic basis sequencing. Nat Biotechnol 2013(12):1111–8. of laboratory adaptation in Caulobacter crescentus. J Bacte- 14. Korbel JO, Lee C. Genome assembly and haplotyping with Hi- riol 2010; 192(14):3678–88. C. Nat Biotechnol 2013; 31(12):1099–101. 31. Nora EP, Lajoie BR, Schulz EG et al. Spatial partitioning of 15. Beitel CW, Lang JM, Korf IF et al. Strain- and plasmid-level de- the regulatory landscape of the X-inactivation centre. Nature convolution of a synthetic metagenome by sequencing prox- 2012; 485(7398):381–5. imity ligation products. PeerJ 2014;2(12):e415. 32. Pope BD, Ryba T, Dileep V et al. Topologically associating do- 16. Burton JN, Liachko I, Dunham MJ et al. Species-level decon- mains are stable units of replication-timing regulation. Na- volution of metagenome assemblies with Hi-C-based con- ture 2014; 515(7527):402–5. tact probability maps. G3 2014; 4(7):1339–46. 33. Schmitt AD, Hu M, Ren B. Genome-wide mapping and anal- 17. Marbouty M, Cournac A, Flot JF et al. Metagenomic chro- ysis of chromosome architecture. Nat Rev Mol Cell Biol 2016; mosome conformation capture (meta3C) unveils the diver- 17(12):743–55. sity of chromosome organization in microorganisms. Elife 34. Jeffries TW, Grigoriev IV, Grimwood J et al. Genome sequence 2014;3(e03318):e03318. of the lignocellulose-bioconverting and xylose-fermenting 18. Marbouty M, Baudry L, Cournac A et al. Scaffolding bacterial yeast Pichia stipitis. Nat Biotechnol 2007; 25(3):319– genomes and probing host-virus interactions in gut micro- 26. biome by proximity ligation (chromosome capture) assay. Sci 35. Varoquaux N, Liachko I, Ay F et al. Accurate identification of Adv 2017;3(2):e1602105. centromere locations in yeast genomes using Hi-C. Nucleic 19. Nagano T, Varnai ´ C, Schoenfelder S et al. Comparison of Hi-C Acids Res 2015; 43(11):5331–39. results using in-solution versus in-nucleus ligation. Genome 36. Gong K, Tjong H, Zhou XJ et al. Comparative 3D genome Biol 2015; 16:175. structure analysis of the fission and the budding yeast. PLoS 20. Huang PYH, Han Y, Handoko L et al. Protocol: sonication- One 2015;10(3):e0119672. based circular chromosome conformation capture with 37. Wong H, Marie-Nelly H, Herbert S et al. A predictive compu- next-generation sequencing analysis for the detection of tational model of the dynamic 3D interphase yeast nucleus. chromatin interactions. Protocol Exchange 2010, doi:10.1038/ Curr Biol 2012; 22(20):1881–90. protex.2010.207. 38. Rao SSP, Huntley MH, Durand NC et al. A 3D map of the 21. Ramani V, Cusanovich DA, Hause RJ et al. Mapping 3D human genome at kilobase resolution reveals principles of genome architecture through in situ DNase Hi-C. Nat Protoc chromatin looping. Cell 2014; 159(7):1665–80. 2016; 11(11):2104–21. 39. Stevens TJ, Lando D, Basu S et al. 3D structures of individ- 22. Ramani V, Deng X, Qiu R et al. Massively multiplex single-cell ual mammalian genomes studied by single-cell Hi-C. Nature Hi-C. Nat Methods 2017; 14(3):263–6. 2017; 544(7648):59–64. 23. Liu M, Darling A. Metagenomic Chromosome Conforma- 40. Cottarel G, Shero JH, Hieter P et al. A 125-base-pair CEN6 DNA tion Capture (3C): techniques, applications, and challenges. fragment is sufficient for complete meiotic and mitotic cen- F1000Res 2015; 4(1377):1–9. tromere functions in Saccharomyces cerevisiae. Mol Cell Biol 24. Ma W, Ay F, Lee C et al. Fine-scale chromatin interaction 1989; 9(8):3342–3349. maps reveal the cis-regulatory landscape of human lincRNA 41. Marbouty M, Cournac A, Flot JF et al. Data from: Metage- genes. Nat Methods 2015; 12(1):71–8. nomic chromosome conformation capture (meta3C) un- 25. Dixon JR, Selvaraj S, Yue F et al. Topological domains in mam- veils the diversity of chromosome organization in mi- malian genomes identified by analysis of chromatin interac- croorganisms. 2014; https://dryad2.lib.ncsu.edu/resource/ tions. Nature 2012; 485(7398):376–80. doi:10.5061/dryad.gv595, Accessed January 8, 2018. 26. Le TBK, Imakaev MV, Mirny LA et al. High-resolution map- 42. Li H. Aligning sequence reads, clone sequences and assem- ping of the spatial organization of a bacterial chromosome. bly contigs with BWA-MEM. arXivorg 2013; q-bio.GN. Science 2013; 342(6159):731–4. 43. DeMaere MZ, Darling AE (2017): Software for “Sim3C: 27. Badrinarayanan A, Le TBK, Laub MT. Bacterial chromosome simulation of Hi-C and Meta3C proximity ligation se- organization and segregation. Annu Rev Cell Dev Biol 2015; quencing technologies.” GigaScience Database. http://dx.doi. 31(1):171–99. org/10.5524/100368, Accessed January 8, 2018. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4628124 by Ed 'DeepDyve' Gillespie user on 16 March 2018

Journal

GigaScienceOxford University Press

Published: Feb 1, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off