Access the full text.
Sign up today, get DeepDyve free for 14 days.
A. Quinlan, Ira Hall
Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features
Shaun Mahony, Matthew Edwards, E. Mazzoni, R. Sherwood, Akshay Kakumanu, Carolyn Morrison, H. Wichterle, D. Gifford (2014)
An Integrated Model of Multiple-Condition ChIP-Seq Data Reveals Predeterminants of Cdx2 BindingPLoS Computational Biology, 10
ENCODEConsortium, Martin Min (2012)
An Integrated Encyclopedia of DNA Elements in the Human GenomeNature, 489
Yuanyuan Li, D. Umbach, Leping Li (2014)
T-KDE: a method for genome-wide identification of constitutive protein binding sites from multiple ChIP-seq data setsBMC Genomics, 15
Charles Grant, Timothy Bailey, William Noble (2011)
FIMO: scanning for occurrences of a given motifBioinformatics, 27
C. Bergman, J. Carlson, S. Celniker (2005)
Drosophila DNase I footprint database: a systematic genome annotation of transcription factor binding sites in the fruitfly, Drosophila melanogasterBioinformatics, 21 8
Geetu Tuteja, P. White, J. Schug, K. Kaestner (2009)
Extracting transcription factor targets from ChIP-Seq dataNucleic Acids Research, 37
N. Rashid, P. Giresi, J. Ibrahim, Wei Sun, J. Lieb (2011)
ZINBA integrates local covariates with DNA-seq data to identify broad and narrow regions of enrichment, even within amplified genomic regionsGenome Biology, 12
G. Celeux, G. Govaert (1995)
Gaussian parsimonious clustering modelsPattern Recognit., 28
Anthony Mathelier, Xiaobei Zhao, Allen Zhang, F. Parcy, R. Worsley-Hunt, David Arenillas, Sorana Buchman, Chih-Yu Chen, A. Chou, Hans Ienasescu, Jonathan Lim, C. Shyr, Ge Tan, Michelle Zhou, B. Lenhard, A. Sandelin, W. Wasserman (2013)
JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profilesNucleic Acids Research, 42
Haipeng Xing, Yifan Mo, W. Liao, Michael Zhang (2012)
Genome-Wide Localization of Protein-DNA Binding and Histone Modification by a Bayesian Change-Point Method with ChIP-seq DataPLoS Computational Biology, 8
G. Crawford, I. Holt, J. Whittle, B. Webb, Denise Tai, S. Davis, E. Margulies, Yidong Chen, J. Bernat, D. Ginsburg, Daixing Zhou, Shujun Luo, T. Vasicek, M. Daly, T. Wolfsberg, F. Collins (2005)
Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS).Genome research, 16 1
H. Xu, Lusy Handoko, Xueliang Wei, Chaopeng Ye, Jianpeng Sheng, Chia-Lin Wei, F. Lin, W. Sung (2010)
A signal-noise model for significance analysis of ChIP-seq with negative controlBioinformatics, 26 9
M. Guenther, S. Levine, L. Boyer, R. Jaenisch, R. Young (2007)
A Chromatin Landmark and Transcription Initiation at Most Promoters in Human CellsCell, 130
Parameswaran Ramachandran, Gareth Palidwor, C. Porter, T. Perkins (2013)
MaSC: mappability-sensitive cross-correlation for estimating mean fragment length of single-end short-read sequencing dataBioinformatics, 29
Ben Langmead, S. Salzberg (2012)
Fast gapped-read alignment with Bowtie 2Nature Methods, 9
Anirudh Natarajan, G. Yardımcı, Nathan Sheffield, G. Crawford, U. Ohler (2012)
Predicting cell-type–specific gene expression from regions of open chromatinGenome Research, 22
Mei Zhong, W. Niu, Z. Lu, M. Sarov, J. Murray, J. Janette, D. Raha, Karyn Sheaffer, Hugo Lam, E. Preston, C. Slightham, L. Hillier, Trisha Brock, A. Agarwal, Raymond Auerbach, A. Hyman, M. Gerstein, S. Mango, Stuart Kim, R. Waterston, V. Reinke, M. Snyder (2010)
Genome-Wide Identification of Binding Sites Defines Distinct Functions for Caenorhabditis elegans PHA-4/FOXA in Development and Environmental ResponsePLoS Genetics, 6
Tatsunori Hashimoto, Matthew Edwards, D. Gifford (2014)
Universal Count Correction for High-Throughput SequencingPLoS Computational Biology, 10
S. Landt, G. Marinov, A. Kundaje, P. Kheradpour, Florencia Pauli, S. Batzoglou, B. Bernstein, P. Bickel, James Brown, Philip Cayting, Yiwen Chen, Gilberto Desalvo, C. Epstein, Katherine Fisher-Aylor, G. Euskirchen, M. Gerstein, Jason Gertz, A. Hartemink, M. Hoffman, V. Iyer, Y. Jung, S. Karmakar, Manolis Kellis, P. Kharchenko, Qunhua Li, Tao Liu, X. Liu, Lijia Ma, A. Milosavljevic, R. Myers, P. Park, M. Pazin, M. Perry, D. Raha, Timothy Reddy, J. Rozowsky, N. Shoresh, A. Sidow, Matthew Slattery, J. Stamatoyannopoulos, M. Tolstorukov, K. White, S. Xi, P. Farnham, J. Lieb, B. Wold, M. Snyder (2012)
ChIP-seq guidelines and practices of the ENCODE and modENCODE consortiaGenome Research, 22
C. Fraley, A. Raftery (2012)
mclust Version 4 for R : Normal Mixture Modeling for Model-Based Clustering , Classification , and Density Estimation
J. Banfield, A. Raftery (1993)
Model-based Gaussian and non-Gaussian clusteringBiometrics, 49
Fidel Ramírez, F. Dündar, S. Diehl, B. Grüning, T. Manke (2014)
deepTools: a flexible platform for exploring deep-sequencing dataNucleic Acids Research, 42
Maya Kasowski, Fabian Grubert, Christopher Heffelfinger, M. Hariharan, A. Asabere, S. Waszak, L. Habegger, J. Rozowsky, Minyi Shi, A. Urban, Mi-Young Hong, K. Karczewski, W. Huber, S. Weissman, M. Gerstein, J. Korbel, Michael Snyder (2010)
Variation in Transcription Factor Binding Among HumansScience, 328
Qiye He, A. Bardet, Brianne Patton, Jennifer Purvis, Jeff Johnston, Ariel Paulson, Madelaine Gogol, A. Stark, J. Zeitlinger (2011)
High conservation of transcription factor binding and evidence for combinatorial regulation across six Drosophila speciesNature Genetics, 43
David Johnson, A. Mortazavi, R. Myers, B. Wold (2007)
Genome-Wide Mapping of in Vivo Protein-DNA InteractionsScience, 316
B. Deplancke (2010)
Faculty Opinions recommendation of Genetic analysis of variation in transcription factor binding in yeast.
Yajie Yang, J. Fear, Jianhong Hu, Irina Haecker, Lei Zhou, R. Renne, D. Bloom, L. McIntyre (2014)
Leveraging biological replicates to improve analysis in ChIP-seq experimentsComputational and Structural Biotechnology Journal, 9
A. Bardet, J. Steinmann, S. Bafna, J. Knoblich, J. Zeitlinger, A. Stark (2013)
Identification of transcription factor binding sites from ChIP-seq data at high resolutionBioinformatics, 29 21
Qiang Song, Andrew Smith
Bioinformatics Applications Note Gene Expression Identifying Dispersed Epigenomic Domains from Chip-seq Data
Heng Li, R. Handsaker, Alec Wysoker, T. Fennell, Jue Ruan, Nils Homer, Gabor Marth, G. Abecasis, R. Durbin (2009)
The Sequence Alignment/Map format and SAMtoolsBioinformatics, 25
Qunhua Li, James Brown, Haiyan Huang, P. Bickel (2011)
Measuring reproducibility of high-throughput experimentsThe Annals of Applied Statistics, 5
Haitham Ashoor, A. Hérault, A. Kamoun, F. Radvanyi, V. Bajic, E. Barillot, V. Boeva (2013)
HMCan: a method for detecting chromatin modifications in cancer samples using ChIP-seq dataBioinformatics, 29
Xin Zeng, Rajendran Sanalkumar, E. Bresnick, Hongda Li, Q. Chang, S. Keleş (2013)
jMOSAiCS: joint analysis of multiple ChIP-seq datasetsGenome Biology, 14
S. Pepke, B. Wold, A. Mortazavi (2009)
Computation for ChIP-seq and RNA-seq studiesNature Methods, 6
Bin Liu, Jimmy Yi, Aishwarya Sv, Xun Lan, Yilin Ma, T. Huang, G. Leone, V. Jin (2013)
QChIPat: a quantitative method to identify distinct binding patterns for two biological ChIP-seq samples in different experimental conditionsBMC Genomics, 14
Xin Feng, R. Grossman, L. Stein (2011)
PeakRanger: A cloud-enabled peak caller for ChIP-seq dataBMC Bioinformatics, 12
Hideaki Shimazaki, S. Shinomoto (2007)
A Method for Selecting the Bin Size of a Time HistogramNeural Computation, 19
J. Blanchet, A. Davison (2011)
Spatial modeling of extreme snow deptharXiv: Applications
Joseph Pickrell, Daniel Gaffney, Y. Gilad, J. Pritchard (2011)
False positive peaks in ChIP-seq and other sequencing-based functional assays caused by unannotated high copy number regionsBioinformatics, 27
Li Shen, Ning-Yi Shao, Xiaochuan Liu, E. Nestler (2014)
ngs.plot: Quick mining and visualization of next-generation sequencing data by integrating genomic databasesBMC Genomics, 15
Vibhor Kumar, M. Muratani, N. Rayan, P. Kraus, T. Lufkin, H. Ng, Shyam Prabhakar (2013)
Uniform, optimal signal processing of mapped deep-sequencing dataNature Biotechnology, 31
Y. Benjamini, T. Speed (2012)
Summarizing and correcting the GC content bias in high-throughput sequencingNucleic Acids Research, 40
Dominic Schmidt, Petra Schwalie, M. Wilson, B. Ballester, Ângela Gonçalves, C. Kutter, Gordon Brown, Aileen Marshall, Paul Flicek, D. Odom (2012)
Waves of Retrotransposon Expansion Remodel Genome Organization and CTCF Binding in Multiple Mammalian LineagesCell, 148
M. Rye, P. Sætrom, F. Drabløs (2010)
A manually curated ChIP-seq benchmark demonstrates room for improvement in current peak-finder programsNucleic Acids Research, 39
Y. Benjamini, Y. Hochberg (1995)
Controlling the false discovery rate: a practical and powerful approach to multiple testingJournal of the royal statistical society series b-methodological, 57
Li Shen, Ning-Yi Shao, Xiaochuan Liu, I. Maze, Jian Feng, E. Nestler (2013)
diffReps: Detecting Differential Chromatin Modification Sites from ChIP-seq Data with Biological ReplicatesPLoS ONE, 8
A. Szalkowski, C. Schmid (2011)
Rapid innovation in ChIP-seq peak-calling algorithms is outdistancing benchmarking effortsBriefings in bioinformatics, 12 6
M. Megraw, Fernando Pereira, S. Jensen, U. Ohler, A. Hatzigeorgiou (2009)
A transcription factor affinity-based code for mammalian transcription initiation.Genome research, 19 4
Yong Zhang, Tao Liu, Clifford Meyer, J. Eeckhoute, David Johnson, B. Bernstein, C. Nusbaum, R. Myers, Myles Brown, Wei Li, X. Liu (2008)
Model-based Analysis of ChIP-Seq (MACS)Genome Biology, 9
Christina Schweikert, Stuart Brown, Zuojian Tang, Phillip Smith, D. Hsu (2012)
Combining multiple ChIP-seq peak detection systems using combinatorial fusionBMC Genomics, 13
H. Thorvaldsdóttir, James Robinson, J. Mesirov (2012)
Integrative Genomics Viewer (IGV): high-performance genomics data visualization and explorationBriefings in Bioinformatics, 14
Dominic Schmidt, M. Wilson, B. Ballester, Petra Schwalie, Gordon Brown, Aileen Marshall, C. Kutter, S. Watt, C. Martinez-Jimenez, Sarah Mackay, I. Talianidis, Paul Flicek, D. Odom (2010)
Five-Vertebrate ChIP-seq Reveals the Evolutionary Dynamics of Transcription Factor BindingScience, 328
Vol. 31 no. 1 2015, pages 48–55 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btu568 Sequence analysis Advance Access publication September 15, 2014 1,2, 2 1,2, * * Mahmoud M. Ibrahim , Scott A. Lacadie and Uwe Ohler 1 2 Department of Biology, Humboldt University, Invalidenstrasse 43, D-10115 Berlin, Germany and The Berlin Institute for € € Medical Systems Biology, Max Delbruck Center for Molecular Medicine Berlin-Buch, Robert Rossle Str. 10, Berlin 13125, Germany Associate Editor: Inanc Birol widths and signal-to-noise ratios (SNR; Natarajan et al., 2012). ABSTRACT Therefore, there is a need for an approach that would not only Motivation: Although peak finding in next-generation sequencing focus on optimal detection of enrichment sites but would also be (NGS) datasets has been addressed extensively, there is no consen- able to adapt to enrichment sites with different signal properties sus on how to analyze and process biological replicates. Furthermore, and to define their boundaries accurately (Xing et al., 2012). most peak finders do not focus on accurate determination of enrich- Furthermore, while others focus on integration of multiple ment site widths and are not widely applicable to different types of datasets to define co-occurrence or differential enrichment [see datasets. Li et al. (2014); Liu et al. (2013); Shen et al. (2013) and Zeng et al. Results: We developed JAMM (Joint Analysis of NGS replicates via (2013) for examples], there is no consensus on biological repli- Mixture Model clustering): a peak finder that can integrate information from biological replicates, determine enrichment site widths accurately cates integration to find accurate consensus peaks. One common and resolve neighboring narrow peaks. JAMM is a universal peak approach is to determine enriched sites on each replicate separ- finder that is applicable to different types of datasets. We show that ately, and then combine the results via union or intersection JAMM is among the best performing peak finders in terms of site (Schweikert et al., 2012; Yang et al., 2014). Another common detection accuracy and in terms of accurate determination of enrich- approach is to pool aligned reads from all replicates available ment sites widths. In addition, JAMM’s replicate integration improves and then detect enriched sites on the pooled alignments [see peak spatial resolution, sorting and peak finding accuracy. Tuteja et al. (2009) for an example]. Taking the intersect or Availability and implementation: JAMM is available for free and can union of separately detected sites mandates rescoring the peaks run on Linux machines through the command line: http://code.google. and leads to inaccurate enriched sites’ widths. Pooling alignments com/p/jamm-peak-finder before site detection obscures the differential spatial and intensity Contact: mahmoud.ibrahim@mdc-berlin.de or uwe.ohler@mdc-berlin.de. information in the replicates. As biological replicate experiments Supplementary information: Supplementary data are available at are not expected to be exactly reproducible, there is a need to Bioinformatics online. develop a method for replicate integration that takes advantage of the differential information in the individual replicates to find Received on June 24, 2014; revised on August 8, 2014; accepted on accurate consensus peaks. August 18, 2014 In this article, we introduce JAMM (Joint Analysis of NGS replicates via Mixture Model clustering): a universal peak finder 1INTRODUCTION that can integrate information from multiple replicates to find consensus peaks, determine accurate peak widths and resolve A common task in Genomics research is detecting enriched sites neighboring narrow peaks. We demonstrate JAMM using after alignment of next-generation sequencing (NGS) reads, ChIP-Seq (Johnson et al., 2007), including transcription factor which involves separating the genome into regions of high en- ChIP-Seq, punctate histone modification ChIP-Seq and broad richment (i.e. peaks or clusters or binding sites) and regions of histone modification ChIP-Seq as well as DNase-Seq low enrichment (Pepke et al., 2009). Most peak and cluster find- (Crawford et al., 2006). We compare several programs that ing programs are developed with a specific experimental protocol focus on different aspects of the peak finding problem or dataset type in mind (Kumar et al., 2013). Therefore, it is (Table 1). MACS (Zhang et al., 2008) models read counts usually difficult to apply the same analysis pipeline uniformly using a local Poisson distribution, PeakRanger (Feng et al., across all datasets in a given project. Recently, there were attempts to develop universal peak finders 2011) focuses on detecting neighboring narrow peaks at high by defining the problem as that of classical signal detection resolution, PeakZilla (Bardet et al., 2013) is designed for uniform (Kumar et al., 2013). The main advantage of this approach is punctate transcription factor binding sites, BCP (Xing et al., that it allows for uniform data analysis via theoretically proven 2012) develops explicit formulas to model read counts, CCAT optimal signal detection. However, it does not take into account (Xu et al., 2010) detects broad enrichment patterns with low that enrichment sites are often not expected to have the same shape SNR and DFilter (Kumar et al., 2013) is a universal peak or signal properties, even if in the same dataset. For example, finder based on optimal signal detection. We demonstrate that DNase-I hypersensitive regions are expected to have different JAMM is widely applicable to different types of datasets, can define accurate peak boundaries and that JAMM’s replicate in- *To whom correspondence should be addressed. tegration improves peak finding resolution and accuracy. 48 The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com JAMM Table 1. Peak finders compared in this article Peak Finder ChIP-Seq (TF) ChIP-Seq ChIP-Seq DNase-Seq Datasets Suitable for IDR (HM-Punctate) (HM-Broad) integration MACS (Zhang et al., 2008) Default Default – – – No (score ties, peak widths) CCAT (PeakRanger) – – Default – – Caution (strict) (Xu et al., 2010) PeakRanger (Feng et al., 2011) Default Default – – – Caution (strict) BCP (Xing et al., 2012) BCP-TF BCP-HM BCP-HM – – No (score ties, peak widths) PeakZilla (Bardet et al., 2013) Default – – – – No (score ties) JAMM Default -m narrow -r region -f 1 Biological Yes replicates DFilter (Kumar et al., 2013) bs = 50 ks = 30 bs = 100 bs = 100 bs = 100 Different Caution refine nonzero ks = 100 ks = 30 nonzero ks = 50 refine experiments (peak widths) Notes: TF, transcription factors; HM, histone modifications; Strict, can not always adapt to calling a large number of peaks; Peak widths, inaccurate peak widths; Score ties, peak scores have ties. Parameters mentioned here are those used in the article, unless otherwise stated. We varied the MACS -g parameter appropriately for different genomes. 2 JAMM PEAK FINDING METHODS 2.1 Overview Figure 1 provides an overview of JAMM’s analysis steps. Core peak finding steps involve selecting local windows that are en- riched over background, followed by clustering the normalized extended-read counts in those windows into a peak cluster and noise cluster(s). Local clustering allows JAMM to adapt to peaks with different widths and signal properties and to accurately de- termine their boundaries. Furthermore, using clustering as an approach for peak finding extends naturally to multivariate clus- tering, which is useful for integrating datasets that are correlated but not expected to be exactly the same, such as biological rep- licates. We chose clustering via multivariate Gaussian mixture models (Banfield and Raftery, 1993), which allows for including information about the covariance of the replicates. Finally, JAMM scores peaks via the peak signal, represented by the geo- Fig. 1. An overview of JAMM’s peak finding steps metric mean of the replicates peak signals, and how it compares to background. larger, non-overlapping, variable-width enriched windows. This 2.2 Extended read counts approach ensures that enriched windows include entire binding For ChIP-Seq datasets, JAMM uses cross-correlation analysis to sites and that JAMM can seamlessly adapt to broader enrich- estimate the average fragment length (Ramachandran et al., ment domains. In addition, determining enrichment on the bin- 2013, see Supplementary Text). (Step 1, Fig. 1) Fragment level ensures maximized sensitivity, so that JAMM can easily length is calculated for each replicate, including biological con- adapt to reporting a large number of peaks. trol, separately. Similar to Song and Smith (2011), JAMM selects the bin size " Reads are extended/truncated to the average fragment length that minimizes the cost function C ð"Þ (Shimazaki and 0 0 in the 5 -to-3 direction. Extended-read counts at each base pair Shinomoto, 2007): are divided by the mean extended-read count to produce normal- 2k v ized extended-read counts. C ð"Þ= ; ðn"Þ 2.3 Enriched windows where n is the total number of reads, k is the average number of reads per bin for bins with width " and v is the variance JAMM selects enriched windows and then assigns peaks locally (Shimazaki and Shinomoto, 2007). The user can also specify in those windows (Steps 2, 3 and 4, Fig. 1). To find enriched windows, JAMM divides the genome into small non-overlapping an arbitrary bin size. For multiple replicates, the optimum bin bins and makes a decision whether each bin is enriched over size is calculated separately for each replicate and the smallest bin background. All book-ended enriched bins are merged into size is selected. 49 M.M.Ibrahim et al. A bin is enriched over background if Gaussian mixture clustering starts with chromosome-wide par- ameter initialization based on an imaginary large window W s b formed by concatenating the top-scoring windows in the chromosome. First, data points are assigned to clusters via and k-means. Cluster assignments are then used to initialize an SNR 4SNR ; b chr Expectation-Maximization (EM) algorithm to fit a Gaussian mixture model (Celeux and Govaert, 1995) starting with the where and are the average normalized extended-read s b maximization step. The Expectation step calculates the condi- counts in the sample bin and the corresponding background tional probability that the read count signal at a given base bin, respectively, and SNR and SNR are the SNR in the b chr pair bp originated from component k given : p ðbp Þj.The bin and the corresponding chromosome, respectively. Any t k t Maximization step calculates the maximum likelihood estimates SNR= , where is the average sample normalized ex- of the model parameters given all p :::p for all bp :::bp . tended-read count and is the standard deviation of the control 1 k 1 t Assigning bp to a cluster k is derived directly from p ðbp Þ normalized extended-read count. For multiple replicates, all rep- t 1:::k t where bp is assigned to the mixture component k that maximizes licates have to pass this enrichment test for a bin to be considered p ðbp Þ. Different structures of the covariance matrix (namely k t k enriched. 1 1 A and A ) are tested and the one maximizing the k k k k k k Bayesian Information Criteria is chosen (Fraley et al., 2012, see 2.4 Peak finding Supplementary Text). Formulas to update (Maximization Step) and p ðbp Þ (Expectation Step) for both models are k t JAMM assumes that the signal (smoothed extended-read counts, described by Celeux and Govaert (1995). see Supplementary Text) in enriched windows originated from a The model learned for W is used to initialize the EM Gaussian univariate Gaussian mixture model for single sample analysis or mixture clustering algorithm for every enriched window separ- a multivariate Gaussian mixture model when integrating multiple ately starting with the Expectation step. The mixture component replicates (Step 5, Fig. 1; Banfield and Raftery, 1993; Celeux and with the highest mean is taken to be the enriched cluster and Govaert, 1995; Fraley et al., 2012): contiguous base pairs assigned to this cluster are taken to be the T K YX peaks. In the multivariate case, all replicates are required to agree w N ðbp j ; S Þ; k k t k k on the mixture component mean ordering, otherwise the window t=1 k=1 is rejected (see Supplementary Text). where T is the window size, K is the number of components (clusters), bp is the read signal value for base pair t, w is the t k 2.5 Peak scoring weight of component k in the mixture and and S are the k k The background signal in every peak is subtracted from the cor- vector of means and the covariance matrix for component k, responding sample signal (Step 6, Fig. 1). When analyzing repli- respectively (Fraley et al., 2012). The Gaussian mixture model cates, sample signal is taken to be the per-position geometric is defined by the set of parameters =w :::w ; ::: ;S :::S . 1 k 1 k 1 k mean of the replicates signals. The resulting background-normal- To find accurate peak boundaries in enriched windows, JAMM ized signal values are averaged to produce the mean peak back- fits a Gaussian mixture model to cluster the smoothed extended- ground-normalized signal ( ). In addition, JAMM executes the read count in each enriched window separately, assuming either ns Mann–Whitney U non-parametric test to compare the sample two mixture components (corresponding to peaks and noise, par- signal (not background normalized) with the corresponding ameters: -m normal) or three mixture components (corresponding background signal. A Benjamini–Hochberg correction to peaks, peak tails and noise, parameters: -m narrow). In the (Benjamini and Hochberg, 1995) is applied to the full list of univariate case, variance is assumed to be different between dif- P-values after peak finding is complete. JAMM defines the ferent components. In the mutivariate case, the covariance matrix peak score to be is assumed to be different among different components and is parameterized according to its eigenvalue decomposition S = log10ðp Þ: p ns corrected (Banfield and Raftery, 1993): By default, each peak is scored and reported as a separate peak S = k k k (parameters: -r peak). All peaks detected in one window can and be merged to be scored and reported as one peak (parameters: -r region). = A ; k k k where is the orthogonal matrix of the eigenvectors and is a k k 2.6 Implementation and output diagonal matrix with the eigenvalues at the diagonals, with being the first eigenvalue in and A being a diagonal matrix JAMM is implemented as a bash script with peak finding and k k with a vector at the diagonal that is proportional to the vector of scoring implemented by R and Perl scripts. Other post- or eigenvalues. Therefore, determines the orientation of the eigen- preprocessing steps can be added to the pipeline easily if vectors of k,while defines the volume k occupies in the needed. JAMM outputs a sorted peak list in standard n-dimensional space and A defines the shape of the contour narrowPeak format with peak scores, P-values, corrected lines (Banfield and Raftery, 1993). P-values and peak summits. 50 JAMM When comparing JAMM-I (JAMM with replicate integration) 3 RESULTS with JAMM running on pooled replicates (JAMM-P), we found 3.1 Accuracy and spatial resolution that JAMM-I consistently outperforms JAMM-P (JAMM-P First, we sought to establish that JAMMachievesasimilarorbetter ranked better than JAMM-I in only one out of five datasets site detection accuracy compared with other peak finders. Accuracy where there was a difference), indicating that JAMM’s replicate refers to the extent to which peak finders can determine the correct integration improves peak finding accuracy over replicate read locations of enriched sites. Because there is no gold standard for pooling. A main contributing factor is JAMM-I’s better spatial benchmarking peak finders (Szalkowski and Schmid, 2011), we resolution owing to replicate integration via multivariate mixture analyzed five different ENCODE transcription factor ChIP-Seq model clustering. Figure 2b provides a demonstration of JAMM- datasets, including CTCF-HeLa, CTCF-K562, NRSF-K562, I’s improvement over replicate pooling. Only JAMM-I can re- MAX-K562 and SRF-GM12878 (see Supplementary Tables S1 solve two neighboring CTCF binding sites: the pooled replicate and S5—we refer to those datasets as ‘accuracy-benchmark’ data- profile obscures the better spatial resolution of Replicate 1 owing sets) using three different benchmarking methods: (i) motif finding to the poorer resolution of Replicate 2. precision (fraction of called peaks with motif matches) using FIMO To further confirm peak finding accuracy, we analyzed data- (Grant et al., 2011), which uses a uniform zero-order background sets used by Bardet et al. (2013) via the peak finding precision model, (ii) maximum cumulative motif likelihood using benchmark (FIMO-based), including Twist, PHA-4, NFKB, SpeakerScan (Megraw et al., 2009), which uses a first-order local CEBPA and Ste12 (we refer to those datasets as ‘bardet-bench- background model and (iii) accuracy of recovery of manually mark’ datasets, see Supplementary Table S6). We found curated positive peaks as reported by Rye et al. (2011) (see PeakZilla, PeakRanger and JAMM-I to be, on average, the Section 5 and Supplementary Text). Regarding motif precision, top performing programs (see Supplementary Table S2). we found that all peak finders perform comparably although Next, we asked whether peak finders can define accurate en- DFilter and JAMM rank better when results are averaged across richment site widths. We found that BCP underestimates peak multiple datasets (Supplementary Table S1). Regarding motif like- widths, while DFilter and MACS overestimate peak widths lihood, we also found that all peak finders, except BCP, perform (Fig 2c and d). JAMM, PeakZilla and PeakRanger have accurate rather comparably. However, JAMM ranks better than other peak peak width determination to a large extent. PeakRanger slightly finders when results are averaged across multiple datasets underestimated the peak widths of some sites with both CTCF (Supplementary Table S1, Supplementary Figs S1–3). Finally, we and NRSF, while PeakZilla fixes peak widths at twice the esti- found that PeakZilla is the best performing peak finder, followed by mated fragment length (Bardet et al., 2013) (Supplementary Figs JAMM and PeakRanger, in terms of recovering manually curated S4 and S5). Additionally, we observed a similar result with positive peaks (Supplementary Table S1). When we average the DNase-Seq: while JAMM can assign peak boundaries corres- results over all datasets and all benchmarks (a total of nine com- ponding accurately to variable-width DNase-I-hypersensitive re- parisons, Supplementary Methods), we found that JAMM and gions, DFilter can not (Supplementary Fig. S6). PeakRanger are the top ranking programs (Fig. 2a). JAMM Spatial resolution is especially relevant for histone modifica- ranked first for two benchmarks and third for one benchmark. tions with punctate enrichment patterns. We analyzed peak PeakRanger ranked second for all benchmarks (Supplementary coverage (see Section 5) of ENCODE HeLa-S3 H3K4me3. Table 1). H3K4me3 is expected to be maximally enriched immediately Fig. 2. Transcription Factor Peak Finding. (a) Average normalized accuracy score over three benchmarks (see Section 3, Supplementary Text, Supplementary Tables S1 and S5) (b) An example of JAMM-I’s improved spatial resolution because of replicate integration (CTCF, K562) (c)Peak width determination: Average Signal Ratio indicates the ratio of extended-read counts in the 20 bp upstream to that in the 20 bp downstream of the indicated location, averaged over all peaks. Negative numbers on the x-axis indicate locations outside the peak and positive numbers indicate locations inside the peak (CTCF, HeLa-S3). Increased signal ratio outside the peaks indicates peak width underestimation. Increased signal ratio inside the peak indicates peak width overestimation. (d) The corresponding heatmaps to (c) for JAMM-P (top) and MACS (bottom). Heatmaps are centered on peak center, ranked by peak width and show extended-read count intensity and the corresponding peak edges (gray squares). See Supplementary Figures S4 and S5 for other peak finders and datasets 51 M.M.Ibrahim et al. upstream and downstream of active transcription start sites because those domains feature relatively low SNR and stretch (TSSs) (Guenther et al., 2007). Although ChIP-Seq datasets typ- over thousands of base pairs. H3K27me3 and other histone ically have enough resolution to separate the signal upstream of modifications display broad enrichment patterns when assayed TSSs from the signal downstream (Fig. 3a), many peak finders with ChIP-Seq. We tested CCAT (Xu et al., 2010) (via can not recover this resolution. Out of the peak finders we tested, PeakRanger’s implementation of the CCAT algorithm) and only JAMM and PeakRanger can, on average, resolve neighbor- BCP (Xing et al., 2012) (both designed for broad enrichment ing H3K4me3 peaks, while other peak finders detect, on average, domains) as well as DFilter and JAMM (both universal peak one large peak encompassing multiple enriched sites (Fig. 3b and finders) in terms of their ability to capture broad enrichment c, see also Fig. 5b). patterns. We found that BCP assigns the most broad peaks. JAMM also assigns broad peaks but generally smaller than those called by BCP. DFilter and PeakRanger’s CCAT are less 3.2 Broad enrichment patterns suited for defining broader enrichment domains (Fig. 4). Peak finders designed to process punctate enrichment sites are typically not able to capture broad enrichment domains, often 3.3 Peak scoring and sorting JAMM can typically report a large number of peaks and relies on its peak scoring to robustly rank the reported peaks (see Section 5). This facilitates downstream analysis and gives users more flexibility in choosing a method to filter the peaks. Irreproducible Discovery Rate (IDR) (Li et al., 2011) is an ENCODE-recommended method for filtering peak calls based on replicate reproducibility (Landt et al., 2012). Briefly, the IDR pipeline involves calling peaks on the replicates separately, fol- lowed by applying the IDR statistical model to determine the number of reproducible peaks n given a certain IDR threshold. Peak reproducibility involves whether the peaks overlap and how their ranks compare in the replicates peak lists. Finally, peaks are called on the combined replicates and the top n peaks are taken to be the high-confidence reproducible peaks (see ChIP-Seq IDR Web page for more information: https://sites.google.com/site/ anshulkundaje/projects/idr). We applied the IDR analysis pipeline to HeLa-S3 CTCF ChIP-Seq ENCODE dataset. We found that sorting the peaks using JAMM’s peak scores produces a clear phase shift between reproducible peaks and irreproducible peaks (Fig. 5a). To call peaks jointly on biological replicates (the final step in the IDR pipeline), aligned reads are usually pooled before peak finding, but pooling alignments obscures the differential signal intensities of the replicates, and, therefore, may lead to invalid peak sorting. Figure 5b shows an example of cases where JAMM-I’s integrated Fig. 3. Resolving Punctate Histone Modification Peaks (HeLa-S3 sorting of peaks provides a more valid peak sort than sorting H3K4me3). Only JAMM and PeakRanger can recover the resolution peaks called on pooled alignments. Pooling the replicates of the dataset (a) in the peaks called (b) and (c). (b) Shows the average obscured the spatial information regarding peak ‘b’ (JAMM) number of peaks per cluster at different cluster ranges. Cluster range is as the two replicates do not agree on a specific peak location. the maximum distance separating peaks in the same cluster (for example, JAMM relies on geometric averaging of replicates peak signal to if two peaks are 50 bp apart, they will be grouped together in one cluster if cluster range is 50 bp or more). See Supplementary Figures S8 and S9 for score the peaks, which leads to more valid peak scores than those TSS peak coverage of other HeLa-S3 histone modification datasets based on read pooling. Fig. 4. ENCODE HeLa-S3 H3K27me3 (Pooled Replicates). Overview of peak calls, region shown is from chromosome 19. See also Supplementary Figure S7 52 JAMM Fig. 5. JAMM’s Peak Scoring. (a) Results for IDR analysis on biological replicates for ENCODE HeLa-S3 CTCF using JAMM. Dashed vertical line corresponds to the number of peaks selected with an IDR threshold of 0.02 (38 853 peaks). Recommendations for setting IDR thresholds are available on the IDR webpage. Input to the IDR pipeline included the top 150 000 peaks called by JAMM. The number of matched peaks increases as one descends through the sorted peak list up to the point where peaks become irreproducible between replicates. (b) ENCODE H3K4me3 HeLa-S3 ChIP-Seq. Black (JAMM) and gray (PeakRanger) letter labels indicate the peaks called. Numeric labels indicate peaks’ relative rankings as defined by each peak finder Taken together, JAMM provides a plausible approach to rep- more attention could be directed toward developing universal licate integration that is widely applicable to different types of peak finding solutions, refining preprocessing of read counts to datasets and protocols. The analysis pipeline would start with correct for different biases (Hashimoto et al., 2014) and toward peak calling on the replicates separately, followed by IDR ana- developing solutions for biological replicates integration (Yang lysis to select n (the number of reproducible peaks). Finally, et al., 2014). peaks are called on the replicates jointly via JAMM’s replicate Pooling reads from biological replicates before peak finding is integration and the top-scoring n peaks are taken as a highly part of the ENCODE consortium recommended guidelines confident set. (Landt et al., 2012). However, when peaks are called on pooled replicates, the differential intensities and differential spa- tial coverage of the replicates are obscured. JAMM addresses 4 DISCUSSION replicate integration by looking at biological replicates as not being exactly reproducible and attempts to model their variabil- A desirable property in universal peak finders is detecting, and ity using information about their covariance in local enriched correctly determining the widths of, enrichment sites with differ- windows. Using various accuracy benchmarks, we demonstrated ent signal properties. JAMM fits a Gaussian mixture model for that this approach results in better peak finding accuracy over every local enriched window separately and only fixes the struc- read pooling. ture of the covariance matrix (see Section 2 and Supplementary For peak scoring on replicates, JAMM uses the geometric Text). Therefore, JAMM can accurately determine widths of en- average of the replicates peak signal. We demonstrated that richment sites that have different signal properties, even if in the this approach improves peak sorting. Additionally, we also same dataset. Some peak finders start with learning an expected show that peak finding on the geometric mean of separately peak shape (Bardet et al., 2013; Kumar et al., 2013), making it normalized replicate signal profiles can improve peak finding more difficult to detect enrichment sites with varying widths or to accuracy over read pooling similarly to JAMM-I (see JAMM- assign their boundaries accurately. Other peak finders adapt spe- G in Supplementary Text Section 1.2 and Supplementary Table cialized subroutines for refining peak widths after peak finding is S3). Geometric averaging of normalized signal profiles can po- complete (Rashid et al., 2011). In some cases, this approach may tentially be implemented as a preprocessing step irrespective of be able to assign accurate peak boundaries. But when the ori- the specific peak finding method. Therefore, although it may not ginal peak represents several closely spaced sites, this approach be an optimal solution with increasing replicates variability, it is may result in choosing one site and missing the others (see a plausible approach that other peak finders could easily imple- Supplementary Fig. S5 in Rashid et al. (2011) for an example). ment for biological replicates analysis, without requiring a multi- We showed that JAMM’s local clustering also avoids this caveat variate clustering framework. and can correctly resolve neighboring punctate sites, similar to Accuracy benchmarks are independent of read count densities, programs specifically designed with this task in mind like as opposed to peaks per cluster (Fig. 3) and peak width accuracy PeakRanger (Feng et al., 2011). (Fig. 2c and d). However, motif content benchmarks do not JAMM is a universal peak finder that can analyze different types of datasets with little change, if any, to the underlying represent a definite gold standard because of our incomplete method. This demonstrates that finding enriched sites in read- understanding of protein–DNA interactions and potential density–based NGS datasets is essentially the same task regard- biases in the benchmarking methods (Szalkowski and Schmid, less of the sites’ signal properties. Therefore, we propose that 2011). We attempted to remedy this by using two different motif 53 M.M.Ibrahim et al. PHA-4: MA0047.1, Ste12: MA0393.1), FlyReg (Bergman et al., 2005) scanning algorithms and by including a manually curated set of (Twist) and Schmidt et al. (2012) (CTCF). Motif precision analysis was peak calls as an additional benchmark (Rye et al., 2011). But done using FIMO (P-value, ‘accuracy-benchmark’: 0.0001/ manual curation may also be biased because the manually P-value,’bardet-benchmark’: 0.001) (Grant et al., 2011), cumulative log curated set represents only a small fraction of the peaks present likelihood analysis was done using SpeakerScan (background window: in a dataset (345 peaks for MAX, 235 for NRSF and 198 for 150 bp) (Megraw et al., 2009) and the curated peak set provided by Rye SRF), and because some peak finders (like PeakZilla) use peak et al. (2011) was used for manual curation analysis. Manual curation detection methods similar to curation criteria (Bardet et al., 2013; results were defined as the number of peaks that intersect at least one Rye et al., 2011). manually curated positive peak after subtracting the number of peaks Many peak finders ignore being able to report a larger number that intersect manually curated negative peaks exclusively. Results for of peaks and/or ignore providing appropriate peak scores ‘accuracy-benchmark’ transcription factor datasets were 0–1 scaled and averaged over all three benchmarks to produce Fig. 2a. See (Table 1), both required criteria for assessing replicate reprodu- Supplementary Text, section 1.3. cibility via IDR analysis (Li et al., 2011). Appropriate peak scores would have few or no ties and represent the confidence 5.2 Visualization in the peak accurately based on its read density and how it com- pares with background or biological control. JAMM can typic- Extended Reads per Kilobase per Million reads mapped (RPKM)- normalized read counts were produced using deepTools (Ramırez et al., ally determine a large number of peaks, and it also provides 2014) at 10 bp resolution and visualized in IGV browser (Thorvaldsdottir robust peak scores with few score ties if any. This, in addition et al., 2013). to its accurate peak width determination, makes JAMM poten- Read coverage heatmaps were produced for peak regions (1000 bp in tially more applicable for different types of downstream analyses each direction centered at the peak center) using smoothed extended-read that rely on ranked peak lists. counts at 10 bp resolution. Although a multivariate clustering framework can potentially For peak coverage plots of histone modification peak calls, we inter- be used for differential peak finding, JAMM can not find differ- sected each set of peak calls with annotated promoter regions from ential peaks across multiple conditions in its current implemen- UCSC hg19 known genes (4000 bp in each direction centered at the tation. Also, JAMM does not take into account mappability, GC annotated TSS), using BEDTools (Quinlan and Hall, 2010). Each base content and Copy Number Variations (CNVs). CNVs are espe- pair was assigned a score of 1 for each intersecting peak. Per-base pair scores were summed then divided by the mean per-base pair score to cially relevant for cancer cell lines (Pickrell et al.,2011),while GC produce normalized peak coverage scores. Raw extended-read coverage content bias is a known problem in high-throughput sequencing was produced using ngs.plot (Shen et al., 2014) on the same TSS regions. libraries, probably due to PCR amplification (Benjamini and See Supplementary Text, section 1.3 for more details. Speed, 2012). We could not detect CNV bias in JAMM’s peak calls in regions of loss when compared with a peak finder that Funding: MMI was supported by the Max-Delbruck € -Center/New corrects for CNVs (Ashoor et al., 2013), but we noticed a pos- York University Exchange Program. sible increase in the proportion of peaks called by JAMM in regions of gain (see Supplementary Table S4). Explicit implemen- Conflict of interest: none declared. tation of GC content bias and CNV correction could improve peak finding accuracy (Ashoor et al., 2013; Rashid et al., 2011), REFERENCES and we plan to incorporate appropriate correction subroutines in Ashoor,H. et al. (2013) HMCan: a method for detecting chromatin modifications in the near future. Finally, JAMM is typically slower than other cancer samples using ChIP-seq data. Bioinformatics, 29, 2979–2986. peak finders with less complicated models, taking 6–7 h on Banfield,J.D. and Raftery,A.E. (1993) Model-based gaussian and non-gaussian average to analyze a typical human ENCODE ChIP-Seq dataset clustering. Bio-metrics, 49, 803–21. when run using a single processor. Bardet,A.F. et al. (2013) Identification of transcription factor binding sites from ChIP-seq data at high resolution. Bioinformatics, 29, 2705–2713. Benjamini,Y. and Hochberg,Y. (1995) Controlling the false discovery rate: A prac- tical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B, 57, 5 METHODS 283–300. 5.1 Datasets, preprocessing and accuracy analysis Benjamini,Y. and Speed,T.P. (2012) Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acid Res., 40,e72. All “accuracy-benchmark” datasets are ENCODE datasets. (Bernstein Bergman,C.M. et al. (2005) Drosophila DNase I footprint database: a systematic et al., 2012). ‘bardet-benchmark’ datasets were produced by Bardet genome annotation of transcription factor binding sites in the fruitfly, et al. (2013) and He et al. (2011) for Twist, Schmidt et al.(2010)for Drosophila melanogaster. Bioinformatics, 21, 1747–1749. CEBPA, Zhong et al. (2010) for PHA-4 and Kasowski et al.(2010) for Bernstein,B.E. et al. (2012) An integrated encyclopedia of DNA elements in the NFKB and Zheng et al. (2010) for Ste12. CTCF-NHEK (used in JAMM- human genome. Nature, 489, 57–74. G analysis), H3K4me3, H3K27ac, H3K27me3 and DNase-Seq are Celeux,G. and Govaert,G. (1995) Gaussian parsimonious clustering models. Bio- ENCODE datasets (Bernstein et al., 2012). Fastq files were aligned to metrics, 28, 781–793. Crawford,G.E. et al. (2006) Genome-wide mapping of DNase hypersensitive the respective genomes using Bowtie2 (Langmead and Salzberg, 2012) sites using massively parallel signature sequencing (MPSS). Genome Res., 1, (hg19: CTCF-HeLa, CTCF-K562, CTCF-NHEK, NFKB, H3K4me3, 123–131. H3K27ac, H3K27me3 - mm9: CEBPA - dm3: Twist - ce6: PHA-4). Feng,X. et al. (2011) PeakRanger: a cloud-enabled peak caller for ChIP-seq data. Alternatively, we started with the alignments provided by Rye et al. BMC Bioinformatics, 12,139. (2011) (MAX, NRSF, SRF), Zheng et al. (2010) (Ste12) and ENCODE Fraley,C. et al. (2012) MCLUST Version 4 for R: Normal Mixture Modeling for (DNase-Seq). PCR duplicates were removed using SAMTools (Li et al., Model-Based Clustering, Classification, and Density Estimation, Technical 2009). See Supplementary Text, section 1.1. Report no. 597, Department of Statistics, University of Washington, June 2012. Transcription factor motifs were obtained from JASPAR (Mathelier Grant,C.E. et al. (2011) FIMO: scanning for occurrences of a given motif. et al., 2014) (NRSF: MA0138.2, NFKB: MA0105.1, CEBPA: MA0102.2, Bioinformatics, 27, 1017–1018. 54 JAMM Guenther,M.G. et al. (2007) A chromatin landmark and transcription initiation at Rashid,N.U. et al. (2011) ZINBA integrates local covariates with DNA-seq data to most promoters in human cells. Cell, 130,77–88. identify broad and narrow regions of enrichment, even within amplified genomic Hashimoto,T.B. et al. (2014) Universal count correction for high-throughput regions. Genome Biol., 12,R67. sequencing. PLoS Comput. Biol., 10, e1003494. Rye,M.B. et al. (2011) A manually curated ChIP-seq benchmark demonstrates He,Q. et al. (2011) High conservation of transcription factor binding and evidence room for improvement in current peak-finder programs. Nucleic Acids Res., for combinatorial regulation across six Drosophila species. Nat. Genet., 43, 39,e25. 414–420. Schmidt,D. et al. (2010) Five-vertebrate ChIP-seq reveals the evolutionary dynamics Johnson,D.S. et al. (2007) Genome-wide mapping of in vivo protein-DNA inter- of transcription factor binding. Science, 328, 1036–1040. actions. Science, 316, 1497–1502. Schmidt,D. et al. (2012) Waves of retrotransposon expansion remodel genome or- Kasowski,M. et al. (2010) Variation in transcription factor binding among humans. ganization and CTCF binding in multiple mammalian lineages. Cell, 148, Science, 328, 232–235. 335–348. Kumar,V. et al. (2013) Uniform, optimal signal processing of mapped deep-sequen- Schweikert,C. et al. (2012) Combining multiple ChIP-seq peak detection systems cing data. Nat. Biotechnol., 31, 615–622. using combinatorial fusion. BMC genomics, 13 (Suppl. 8), S12. Landt,S.G. et al. (2012) ChIP-seq guidelines and practices of the ENCODE and Shen,L. et al. (2013) diffReps: detecting differential chromatin modification sites modENCODE consortia. Genome Res., 22, 1813–1831. from ChIP-seq data with biological replicates. PloS One, 8, e65598. Langmead,B. and Salzberg,S.L. (2012) Fast gapped-read alignment with Bowtie 2. Shen,L. et al. (2014) ngs.plot: Quick mining and visualization of next-generation Nat. Methods, 9, 357–359. sequencing data by integrating genomic databases. BMC Genomics, 15,284. Li,H. et al. (2009) The sequence alignment/map format and SAMtools. Shimazaki,H. and Shinomoto,S. (2007) A method for selecting the bin size of a time Bioinformatics, 25, 2078–2079. histogram. Neural Comput., 19, 1503–1527. Li,Q. et al. (2011) Measuring reproducibility of high-throughput experiments. Ann. Song,Q. and Smith,A.D. (2011) Identifying dispersed epigenomic domains from Appl. Stat., 5, 1699–2264. ChIP-Seq data. Bioinformatics, 27, 870–871. Li,Y. et al. (2014) T-KDE: a method for genome-wide identification of con- Szalkowski,A.M. and Schmid,C.D. (2011) Rapid innovation in ChIP-seq peak-call- stitutive protein binding sites from multiple ChIP-seq data sets. BMC ing algorithms is outdistancing benchmarking efforts. Brief. Bioinform., 12, Genomics, 15,27. 626–633. Liu,B. et al. (2013) QChIPat: a quantitative method to identify distinct binding Thorvaldsdott ir,H. et al. (2013) Integrative Genomics Viewer (IGV): high-perform- patterns for two biological ChIP-seq samples in different experimental condi- ance genomics data visualization and exploration. Brief. Bioinform., 14, tions. BMC Genomics, 14 (Suppl. 8), S3. 178–192. Mathelier,A. et al. (2014) JASPAR 2014: an extensively expanded and updated Tuteja,G. et al. (2009) Extracting transcription factor targets from ChIP-Seq data. open-access database of transcription factor binding profiles. Nucleic Acids Nucleic Acids Res., 37,e113. Res., 42, D142–D147. Xing,H. et al. (2012) Genome-wide localization of protein-DNA binding and his- Megraw,M. et al. (2009) A transcription factor affinity-based code for mammalian tone modification by a Bayesian change-point method with ChIP-seq data. transcription initiation. Genome Res., 19, 644–656. PLoS Comput. Biol., 8, e1002613. Natarajan,A. et al. (2012) Predicting cell-type-specific gene expression from regions Xu,H. et al. (2010) A signal-noise model for significance analysis of ChIP-seq with of open chromatin. Genome Res., 22, 1711–1722. negative control. Bioinformatics, 26, 1199–1204. Pepke,S. et al. (2009) Computation for ChIP-seq and RNA-seq studies. Nat. Yang,Y. et al. (2014) Leveraging biological replicates to improve analysis in ChIP- Methods, 6 (Suppl. 11), S22–S32. seq experiments. Comput. Struct. Biotechnol. J., 9, e201401002. Pickrell,J.K. et al. (2011) False positive peaks in ChIP-seq and other sequencing- Zeng,X. et al. (2013) jMOSAiCS: joint analysis of multiple ChIP-seq datasets. based functional assays caused by unannotated high copy number regions. Genome Biol., 14,R38. Bioinformatics, 27, 2144–2146. Zhang,Y. et al. (2008) Model-based analysis of ChIP-Seq (MACS). Genome Biol., 9, Quinlan,A.R. and Hall,I.M. (2010) BEDTools: a flexible suite of utilities for com- R137. paring genomic features. Bioinformatics, 26, 841–842. Zheng,W. et al. (2010) Genetic analysis of variation in transcription factor binding Ramachandran,P. et al. (2013) MaSC: mappability-sensitive cross-correlation for in yeast. Nature, 464, 1187–1191. estimating mean fragment length of single-end short-read sequencing data. Zhong,M. et al. (2010) Genome-wide identification of binding sites defines distinct Bioinformatics, 29, 444–450. functions for Caenorhabditis elegans PHA-4/FOXA in development and envir- Ramırez,F. et al. (2014) deepTools: a flexible platform for exploring deep-sequen- onmental response. PLoS Genet., 6, e1000848. cing data. Nucleic Acids Res., 42, W187–W191.
Bioinformatics – Oxford University Press
Published: Sep 15, 2014
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.