Determination of genomic copy number alteration emphasizing a restriction site-based strategy of genome re-sequencingZheng, Caihong; Miao, Xuexia; Li, Yanen; Huang, Ying; Ruan, Jue; Ma, Xi; Wang, Li; Wu, Chung-I; Cai, Jun
doi: 10.1093/bioinformatics/btt481pmid: 23962614
Motivation: Copy number abbreviation (CNA) is one type of genomic aberration that is often induced by genome instability and is associated with diseases such as cancer. Determination of the genome-wide CNA profile is an important step in identifying the underlying mutation mechanisms. Genomic data based on next-generation sequencing technology are particularly suitable for determination of high-quality CNA profile. Now is an important time to reevaluate the use of sequencing techniques for CNA analysis, especially with the rapid growth of the different targeted genome and whole-genome sequencing strategies.Results: In this study, we provide a comparison of resequencing strategies, with regard to their utility, applied to the same hepatocellular carcinoma sample for copy number determination. These strategies include whole-genome, exome and restriction site-associated DNA (RAD) sequencing. The last of these strategies is a targeted sequencing technique that involves cutting the genome with a restriction enzyme and isolating the targeted sequences. Our data demonstrate that RAD sequencing is an efficient and comprehensive strategy that allows the cost-effective determination of CNAs. Further investigation of RAD sequencing data led to the finding that a precise measurement of the allele frequency would be a helpful complement to the read depth for CNA analysis for two reasons. First, knowledge of the allele frequency helps to resolve refined calculations of allele-specific copy numbers, which, in turn, identify the functionally important CNAs that are under natural selection on the parental alleles. Second, this knowledge enables deconvolution of CNA patterns in complex genomic regions.Contact: [email protected] or [email protected] information: Supplementary data are available at Bioinformatics online.
The code structure of the p53 DNA-binding domain and the prognosis of breast cancer patientsSato, Keiko; Hara, Toshihide; Ohya, Masanori
doi: 10.1093/bioinformatics/btt497pmid: 23986567
Motivation: The tumor-suppressor gene TP53 mutations are diverse in the central region encoding the DNA-binding domain. It has not been clear whether the prognostic significance for survival in breast cancer patients is the same for all types of mutations. Are there specific types of mutations carrying a worse prognosis? To understand the correlation between the mutations in the gene encoding the DNA-binding domain and the prognosis of breast cancer, we studied the code structure of the DNA-binding domain of breast cancer patients by using various artificial codes in information transmission.Results: We indicated that the prognostic significance of all types of mutations in the DNA-binding domain is not the same, and that the DNA-binding domain having a certain code structure is important for estimating the prognosis of breast cancer patients.Contact: [email protected] or [email protected]
Exploring variation-aware contig graphs for (comparative) metagenomics using MaryGoldNijkamp, Jurgen F.; Pop, Mihai; Reinders, Marcel J. T.; de Ridder, Dick
doi: 10.1093/bioinformatics/btt502pmid: 24058058
Motivation: Although many tools are available to study variation and its impact in single genomes, there is a lack of algorithms for finding such variation in metagenomes. This hampers the interpretation of metagenomics sequencing datasets, which are increasingly acquired in research on the (human) microbiome, in environmental studies and in the study of processes in the production of foods and beverages. Existing algorithms often depend on the use of reference genomes, which pose a problem when a metagenome of a priori unknown strain composition is studied. In this article, we develop a method to perform reference-free detection and visual exploration of genomic variation, both within a single metagenome and between metagenomes.Results: We present the MaryGold algorithm and its implementation, which efficiently detects bubble structures in contig graphs using graph decomposition. These bubbles represent variable genomic regions in closely related strains in metagenomic samples. The variation found is presented in a condensed Circos-based visualization, which allows for easy exploration and interpretation of the found variation.We validated the algorithm on two simulated datasets containing three respectively seven Escherichia coli genomes and showed that finding allelic variation in these genomes improves assemblies. Additionally, we applied MaryGold to publicly available real metagenomic datasets, enabling us to find within-sample genomic variation in the metagenomes of a kimchi fermentation process, the microbiome of a premature infant and in microbial communities living on acid mine drainage. Moreover, we used MaryGold for between-sample variation detection and exploration by comparing sequencing data sampled at different time points for both of these datasets.Availability: MaryGold has been written in C++ and Python and can be downloaded from http://bioinformatics.tudelft.nl/softwareContact: [email protected]
A statistical variant calling approach from pedigree information and local haplotyping with phase informative readsKojima, Kaname; Nariai, Naoki; Mimori, Takahiro; Takahashi, Mamoru; Yamaguchi-Kabata, Yumi; Sato, Yukuto; Nagasaki, Masao
doi: 10.1093/bioinformatics/btt503pmid: 24002111
Motivation: Variant calling from genome-wide sequencing data is essential for the analysis of disease-causing mutations and elucidation of disease mechanisms. However, variant calling in low coverage regions is difficult due to sequence read errors and mapping errors. Hence, variant calling approaches that are robust to low coverage data are demanded.Results: We propose a new variant calling approach that considers pedigree information and haplotyping based on sequence reads spanning two or more heterozygous positions termed phase informative reads. In our approach, genotyping and haplotyping by the assignment of each read to a haplotype based on phase informative reads are simultaneously performed. Therefore, positions with low evidence for heterozygosity are rescued by phase informative reads, and such rescued positions contribute to haplotyping in a synergistic way. In addition, pedigree information supports more accurate haplotyping as well as genotyping, especially in low coverage regions. Although heterozygous positions are useful for haplotyping, homozygous positions are not informative and weaken the information from heterozygous positions, as majority of positions are homozygous. Thus, we introduce latent variables that determine zygosity at each position to filter out homozygous positions for haplotyping. In performance evaluation with a parent–offspring trio sequencing data, our approach outperforms existing approaches in accuracy on the agreement with single nucleotide polymorphism array genotyping results. Also, performance analysis considering distance between variants showed that the use of phase informative reads is effective for accurate variant calling, and further performance improvement is expected with longer sequencing data.Contact: [email protected] or [email protected] information: Supplementary data are available at Bioinformatics online.
Assessing the validity and reproducibility of genome-scale predictionsSugden, Lauren A.; Tackett, Michael R.; Savva, Yiannis A.; Thompson, William A.; Lawrence, Charles E.
doi: 10.1093/bioinformatics/btt508pmid: 24048353
Motivation: Validation and reproducibility of results is a central and pressing issue in genomics. Several recent embarrassing incidents involving the irreproducibility of high-profile studies have illustrated the importance of this issue and the need for rigorous methods for the assessment of reproducibility.Results: Here, we describe an existing statistical model that is very well suited to this problem. We explain its utility for assessing the reproducibility of validation experiments, and apply it to a genome-scale study of adenosine deaminase acting on RNA (ADAR)-mediated RNA editing in Drosophila. We also introduce a statistical method for planning validation experiments that will obtain the tightest reproducibility confidence limits, which, for a fixed total number of experiments, returns the optimal number of replicates for the study.Availability: Downloadable software and a web service for both the analysis of data from a reproducibility study and for the optimal design of these studies is provided at http://ccmbweb.ccv.brown.edu/reproducibility.htmlContact: [email protected] information: Supplementary data are available at Bioinformatics online.
INSECT: IN-silico SEarch for Co-occurring Transcription factorsRohr, Cristian O.; Parra, R. Gonzalo; Yankilevich, Patricio; Perez-Castro, Carolina
doi: 10.1093/bioinformatics/btt506pmid: 24008418
Motivation: Transcriptional regulation occurs through the concerted actions of multiple transcription factors (TFs) that bind cooperatively to cis-regulatory modules (CRMs) of genes. These CRMs usually contain a variable number of transcription factor-binding sites (TFBSs) involved in related cellular and physiological processes. Chromatin immunoprecipitation followed by sequencing (ChIP-seq) has been effective in detecting TFBSs and nucleosome location to identify potential CRMs in genome-wide studies. Although several attempts were previously reported to predict the potential binding of TFs at TFBSs within CRMs by comparing different ChIP-seq data, these have been hampered by excessive background, usually emerging as a consequence of experimental conditions. To understand these complex regulatory circuits, it would be helpful to have reliable and updated user-friendly tools to assist in the identification of TFBSs and CRMs for gene(s) of interest.Results: Here we present INSECT (IN-silico SEarch for Co-occurring Transcription factors), a novel web server for identifying potential TFBSs and CRMs in gene sequences. By combining several strategies, INSECT provides flexible analysis of multiple co-occurring TFBSs, by applying differing search schemes and restriction parameters.Availability and implementation: INSECT is freely available as a web server at http://bioinformatics.ibioba-mpsp-conicet.gov.ar/INSECTContact: [email protected] or [email protected] information: Supplementary data are available at Bioinformatics online.
PyroHMMvar: a sensitive and accurate method to call short indels and SNPs for Ion Torrent and 454 dataZeng, Feng; Jiang, Rui; Chen, Ting
doi: 10.1093/bioinformatics/btt512pmid: 23995392
Motivation: The identification of short insertions and deletions (indels) and single nucleotide polymorphisms (SNPs) from Ion Torrent and 454 reads is a challenging problem, essentially because these techniques are prone to sequence erroneously at homopolymers and can, therefore, raise indels in reads. Most of the existing mapping programs do not model homopolymer errors when aligning reads against the reference. The resulting alignments will then contain various kinds of mismatches and indels that confound the accurate determination of variant loci and alleles.Results: To address these challenges, we realign reads against the reference using our previously proposed hidden Markov model that models homopolymer errors and then merges these pairwise alignments into a weighted alignment graph. Based on our weighted alignment graph and hidden Markov model, we develop a method called PyroHMMvar, which can simultaneously detect short indels and SNPs, as demonstrated in human resequencing data. Specifically, by applying our methods to simulated diploid datasets, we demonstrate that PyroHMMvar produces more accurate results than state-of-the-art methods, such as Samtools and GATK, and is less sensitive to mapping parameter settings than the other methods. We also apply PyroHMMvar to analyze one human whole genome resequencing dataset, and the results confirm that PyroHMMvar predicts SNPs and indels accurately.Availability and implementation: Source code freely available at the following URL: https://code.google.com/p/pyrohmmvar/, implemented in C++ and supported on Linux.Contact: [email protected] or [email protected]
A general species delimitation method with applications to phylogenetic placementsZhang, Jiajie; Kapli, Paschalia; Pavlidis, Pavlos; Stamatakis, Alexandros
doi: 10.1093/bioinformatics/btt499pmid: 23990417
Motivation: Sequence-based methods to delimit species are central to DNA taxonomy, microbial community surveys and DNA metabarcoding studies. Current approaches either rely on simple sequence similarity thresholds (OTU-picking) or on complex and compute-intensive evolutionary models. The OTU-picking methods scale well on large datasets, but the results are highly sensitive to the similarity threshold. Coalescent-based species delimitation approaches often rely on Bayesian statistics and Markov Chain Monte Carlo sampling, and can therefore only be applied to small datasets.Results: We introduce the Poisson tree processes (PTP) model to infer putative species boundaries on a given phylogenetic input tree. We also integrate PTP with our evolutionary placement algorithm (EPA-PTP) to count the number of species in phylogenetic placements. We compare our approaches with popular OTU-picking methods and the General Mixed Yule Coalescent (GMYC) model. For de novo species delimitation, the stand-alone PTP model generally outperforms GYMC as well as OTU-picking methods when evolutionary distances between species are small. PTP neither requires an ultrametric input tree nor a sequence similarity threshold as input. In the open reference species delimitation approach, EPA-PTP yields more accurate results than de novo species delimitation methods. Finally, EPA-PTP scales on large datasets because it relies on the parallel implementations of the EPA and RAxML, thereby allowing to delimit species in high-throughput sequencing data.Availability and implementation: The code is freely available at www.exelixis-lab.org/software.html.Contact: [email protected] information: Supplementary data are available at Bioinformatics online.
A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysisReese, Sarah E.; Archer, Kellie J.; Therneau, Terry M.; Atkinson, Elizabeth J.; Vachon, Celine M.; de Andrade, Mariza; Kocher, Jean-Pierre A.; Eckel-Passow, Jeanette E.
doi: 10.1093/bioinformatics/btt480pmid: 23958724
Motivation: Batch effects are due to probe-specific systematic variation between groups of samples (batches) resulting from experimental features that are not of biological interest. Principal component analysis (PCA) is commonly used as a visual tool to determine whether batch effects exist after applying a global normalization method. However, PCA yields linear combinations of the variables that contribute maximum variance and thus will not necessarily detect batch effects if they are not the largest source of variability in the data.Results: We present an extension of PCA to quantify the existence of batch effects, called guided PCA (gPCA). We describe a test statistic that uses gPCA to test whether a batch effect exists. We apply our proposed test statistic derived using gPCA to simulated data and to two copy number variation case studies: the first study consisted of 614 samples from a breast cancer family study using Illumina Human 660 bead-chip arrays, whereas the second case study consisted of 703 samples from a family blood pressure study that used Affymetrix SNP Array 6.0. We demonstrate that our statistic has good statistical properties and is able to identify significant batch effects in two copy number variation case studies.Conclusion: We developed a new statistic that uses gPCA to identify whether batch effects exist in high-throughput genomic data. Although our examples pertain to copy number data, gPCA is general and can be used on other data types as well.Availability and implementation: The gPCA R package (Available via CRAN) provides functionality and data to perform the methods in this article.Contact: [email protected] or [email protected] information: Supplementary data are available at Bioinformatics online.
A-clustering: a novel method for the detection of co-regulated methylation regions, and regions associated with exposureSofer, Tamar; Schifano, Elizabeth D.; Hoppin, Jane A.; Hou, Lifang; Baccarelli, Andrea A.
doi: 10.1093/bioinformatics/btt498pmid: 23990415
Motivation: DNA methylation is a heritable modifiable chemical process that affects gene transcription and is associated with other molecular markers (e.g. gene expression) and biomarkers (e.g. cancer or other diseases). Current technology measures methylation in hundred of thousands, or millions of CpG sites throughout the genome. It is evident that neighboring CpG sites are often highly correlated with each other, and current literature suggests that clusters of adjacent CpG sites are co-regulated.Results: We develop the Adjacent Site Clustering (A-clustering) algorithm to detect sets of neighboring CpG sites that are correlated with each other. To detect methylation regions associated with exposure, we propose an analysis pipeline for high-dimensional methylation data in which CpG sites within regions identified by A-clustering are modeled as multivariate responses to environmental exposure using a generalized estimating equation approach that assumes exposure equally affects all sites in the cluster. We develop a correlation preserving simulation scheme, and study the proposed methodology via simulations. We study the clusters detected by the algorithm on high dimensional dataset of peripheral blood methylation of pesticide applicators.Availability: We provide the R package Aclust that efficiently implements the A-clustering and the analysis pipeline, and produces analysis reports. The package is found on http://www.hsph.harvard.edu/tamar-sofer/packages/Contact: [email protected] information: Supplementary data are available at Bioinformatics online.