Wellington: a novel method for the accurate identification of digital genomic footprints from DNase-seq dataPiper, Jason; Elze, Markus C.; Cauchy, Pierre; Cockerill, Peter N.; Bonifer, Constanze; Ott, Sascha
doi: 10.1093/nar/gkt850pmid: 24071585
The expression of eukaryotic genes is regulated by cis-regulatory elements such as promoters and enhancers, which bind sequence-specific DNA-binding proteins. One of the great challenges in the gene regulation field is to characterise these elements. This involves the identification of transcription factor (TF) binding sites within regulatory elements that are occupied in a defined regulatory context. Digestion with DNase and the subsequent analysis of regions protected from cleavage (DNase footprinting) has for many years been used to identify specific binding sites occupied by TFs at individual cis-elements with high resolution. This methodology has recently been adapted for high-throughput sequencing (DNase-seq). In this study, we describe an imbalance in the DNA strand-specific alignment information of DNase-seq data surrounding proteinDNA interactions that allows accurate prediction of occupied TF binding sites. Our study introduces a novel algorithm, Wellington, which considers the imbalance in this strand-specific information to efficiently identify DNA footprints. This algorithm significantly enhances specificity by reducing the proportion of false positives and requires significantly fewer predictions than previously reported methods to recapitulate an equal amount of ChIP-seq data. We also provide an open-source software package, pyDNase, which implements the Wellington algorithm to interface with DNase-seq data and expedite analyses.
Identifying subgroup markers in heterogeneous populationsde Ronde, Jorma J.; Rigaill, Guillem; Rottenberg, Sven; Rodenhuis, Sjoerd; Wessels, Lodewyk F. A.
doi: 10.1093/nar/gkt845pmid: 24062158
Traditional methods that aim to identify biomarkers that distinguish between two groups, like Significance Analysis of Microarrays or the t-test, perform optimally when such biomarkers show homogeneous behavior within each group and differential behavior between the groups. However, in many applications, this is not the case. Instead, a subgroup of samples in one group shows differential behavior with respect to all other samples. To successfully detect markers showing such imbalanced patterns of differential signal, a different approach is required. We propose a novel method, specifically designed for the Detection of Imbalanced Differential Signal (DIDS). We use an artificial dataset and a human breast cancer dataset to measure its performance and compare it with three traditional methods and four approaches that take imbalanced signal into account. Supported by extensive experimental results, we show that DIDS outperforms all other approaches in terms of power and positive predictive value. In a mouse breast cancer dataset, DIDS is the only approach that detects a functionally validated marker of chemotherapy resistance. DIDS can be applied to any continuous value data, including gene expression data, and in any context where imbalanced differential signal is manifested.
DEXUS: identifying differential expression in RNA-Seq studies with unknown conditionsKlambauer, Gnter; Unterthiner, Thomas; Hochreiter, Sepp
doi: 10.1093/nar/gkt834pmid: 24049071
Detection of differential expression in RNA-Seq data is currently limited to studies in which two or more sample conditions are known a priori. However, these biological conditions are typically unknown in cohort, cross-sectional and nonrandomized controlled studies such as the HapMap, the ENCODE or the 1000 Genomes project. We present DEXUS for detecting differential expression in RNA-Seq data for which the sample conditions are unknown. DEXUS models read counts as a finite mixture of negative binomial distributions in which each mixture component corresponds to a condition. A transcript is considered differentially expressed if modeling of its read counts requires more than one condition. DEXUS decomposes read count variation into variation due to noise and variation due to differential expression. Evidence of differential expression is measured by the informative/noninformative (I/NI) value, which allows differentially expressed transcripts to be extracted at a desired specificity (significance level) or sensitivity (power). DEXUS performed excellently in identifying differentially expressed transcripts in data with unknown conditions. On 2400 simulated data sets, I/NI value thresholds of 0.025, 0.05 and 0.1 yielded average specificities of 92, 97 and 99 at sensitivities of 76, 61 and 38, respectively. On real-world data sets, DEXUS was able to detect differentially expressed transcripts related to sex, species, tissue, structural variants or quantitative trait loci. The DEXUS R package is publicly available from Bioconductor and the scripts for all experiments are available at http://www.bioinf.jku.at/software/dexus/.
A quality control system for profiles obtained by ChIP sequencingMendoza-Parra, Marco-Antonio; Van Gool, Wouter; Mohamed Saleem, Mohamed Ashick; Ceschin, Danilo Guillermo; Gronemeyer, Hinrich
doi: 10.1093/nar/gkt829pmid: 24038469
The absence of a quality control (QC) system is a major weakness for the comparative analysis of genome-wide profiles generated by next-generation sequencing (NGS). This concerns particularly genome binding/occupancy profiling assays like chromatin immunoprecipitation (ChIP-seq) but also related enrichment-based studies like methylated DNA immunoprecipitation/methylated DNA binding domain sequencing, global run on sequencing or RNA-seq. Importantly, QC assessment may significantly improve multidimensional comparisons that have great promise for extracting information from combinatorial analyses of the global profiles established for chromatin modifications, the bindings of epigenetic and chromatin-modifying enzymes/machineries, RNA polymerases and transcription factors and total, nascent or ribosome-bound RNAs. Here we present an approach that associates global and local QC indicators to ChIP-seq data sets as well as to a variety of enrichment-based studies by NGS. This QC system was used to certify >5600 publicly available data sets, hosted in a database for data mining and comparative QC analyses.
A general approach for discriminative de novo motif discovery from high-throughput dataGrau, Jan; Posch, Stefan; Grosse, Ivo; Keilwagen, Jens
doi: 10.1093/nar/gkt831pmid: 24057214
De novo motif discovery has been an important challenge of bioinformatics for the past two decades. Since the emergence of high-throughput techniques like ChIP-seq, ChIP-exo and protein-binding microarrays (PBMs), the focus of de novo motif discovery has shifted to runtime and accuracy on large data sets. For this purpose, specialized algorithms have been designed for discovering motifs in ChIP-seq or PBM data. However, none of the existing approaches work perfectly for all three high-throughput techniques. In this article, we propose Dimont, a general approach for fast and accurate de novo motif discovery from high-throughput data. We demonstrate that Dimont yields a higher number of correct motifs from ChIP-seq data than any of the specialized approaches and achieves a higher accuracy for predicting PBM intensities from probe sequence than any of the approaches specifically designed for that purpose. Dimont also reports the expected motifs for several ChIP-exo data sets. Investigating differences between in vitro and in vivo binding, we find that for most transcription factors, the motifs discovered by Dimont are in good accordance between techniques, but we also find notable exceptions. We also observe that modeling intra-motif dependencies may increase accuracy, which indicates that more complex motif models are a worthwhile field of research.
H1 histones: current perspectives and challengesHarshman, Sean W.; Young, Nicolas L.; Parthun, Mark R.; Freitas, Michael A.
doi: 10.1093/nar/gkt700pmid: 23945933
H1 and related linker histones are important both for maintenance of higher-order chromatin structure and for the regulation of gene expression. The biology of the linker histones is complex, as they are evolutionarily variable, exist in multiple isoforms and undergo a large variety of posttranslational modifications in their long, unstructured, NH2- and COOH-terminal tails. We review recent progress in understanding the structure, genetics and posttranslational modifications of linker histones, with an emphasis on the dynamic interactions of these proteins with DNA and transcriptional regulators. We also discuss various experimental challenges to the study of H1 and related proteins, including limitations of immunological reagents and practical difficulties in the analysis of posttranslational modifications by mass spectrometry.