Bioinformatics | DeepDyve

journal article

Open Access Collection

PEATH: single-individual haplotyping by a probabilistic evolutionary algorithm with toggling

Na, Joong Chae; Lee, Jong-Chan; Rhee, Je-Keun; Shin, Soo-Yong

2018 Bioinformatics

doi: 10.1093/bioinformatics/bty012pmid: 29342247

MotivationSingle-individual haplotyping (SIH) is critical in genomic association studies and genetic diseases analysis. However, most genomic analysis studies do not perform haplotype-phasing analysis due to its complexity. Several computational methods have been developed to solve the SIH problem, but these approaches have not generated sufficiently reliable haplotypes.ResultsHere, we propose a novel SIH algorithm, called PEATH (Probabilistic Evolutionary Algorithm with Toggling for Haplotyping), to achieve more accurate and reliable haplotyping. The proposed PEATH method was compared to the most recent algorithms in terms of the phased length, N50 length, switch error rate and minimum error correction. The PEATH algorithm consistently provides the best phase and N50 lengths, as long as possible, given datasets. In addition, verification of the simulation data demonstrated that the PEATH method outperforms other methods on high noisy data. Additionally, the experimental results of a real dataset confirmed that the PEATH method achieved comparable or better accuracy.Availability and implementationSource code of PEATH is available at https://github.com/jcna99/PEATH.Supplementary informationSupplementary data are available at Bioinformatics online.

journal article

Open Access Collection

QuantumClone: clonal assessment of functional mutations in cancer based on a genotype-aware method for clonal reconstruction

Deveau, Paul; Colmet Daage, Leo; Oldridge, Derek; Bernard, Virginie; Bellini, Angela; Chicard, Mathieu; Clement, Nathalie; Lapouble, Eve; Combaret, Valerie; Boland, Anne; Meyer, Vincent; Deleuze, Jean-Francois; Janoueix-Lerosey, Isabelle; Barillot, Emmanuel; Delattre, Olivier; Maris, John M; Schleiermacher, Gudrun; Boeva, Valentina

journal article

Open Access Collection

A rapid epistatic mixed-model association analysis by linear retransformations of genomic estimated values

Ning, Chao; Wang, Dan; Kang, Huimin; Mrode, Raphael; Zhou, Lei; Xu, Shizhong; Liu, Jian-Feng

2018 Bioinformatics

doi: 10.1093/bioinformatics/bty017pmid: 29342229

MotivationEpistasis provides a feasible way for probing potential genetic mechanism of complex traits. However, time-consuming computation challenges successful detection of interaction in practice, especially when linear mixed model (LMM) is used to control type I error in the presence of population structure and cryptic relatedness.ResultsA rapid epistatic mixed-model association analysis (REMMA) method was developed to overcome computational limitation. This method first estimates individuals’ epistatic effects by an extended genomic best linear unbiased prediction (EG-BLUP) model with additive and epistatic kinship matrix, then pairwise interaction effects are obtained by linear retransformations of individuals’ epistatic effects. Simulation studies showed that REMMA could control type I error and increase statistical power in detecting epistatic QTNs in comparison with existing LMM-based FaST-LMM. We applied REMMA to two real datasets, a mouse dataset and the Wellcome Trust Case Control Consortium (WTCCC) data. Application to the mouse data further confirmed the performance of REMMA in controlling type I error. For the WTCCC data, we found most epistatic QTNs for type 1 diabetes (T1D) located in a major histocompatibility complex (MHC) region, from which a large interacting network with 12 hub genes (interacting with ten or more genes) was established.Availability and implementationOur REMMA method can be freely accessed at https://github.com/chaoning/REMMA.Supplementary informationSupplementary data are available at Bioinformatics online.

journal article

Open Access Collection

Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithms

Ferraro Petrillo, Umberto; Roscigno, Gianluca; Cattaneo, Giuseppe; Giancarlo, Raffaele

2018 Bioinformatics

doi: 10.1093/bioinformatics/bty018pmid: 29342232

MotivationInformation theoretic and compositional/linguistic analysis of genomes have a central role in bioinformatics, even more so since the associated methodologies are becoming very valuable also for epigenomic and meta-genomic studies. The kernel of those methods is based on the collection of k-mer statistics, i.e. how many times each k-mer in {A,C,G,T}k occurs in a DNA sequence. Although this problem is computationally very simple and efficiently solvable on a conventional computer, the sheer amount of data available now in applications demands to resort to parallel and distributed computing. Indeed, those type of algorithms have been developed to collect k-mer statistics in the realm of genome assembly. However, they are so specialized to this domain that they do not extend easily to the computation of informational and linguistic indices, concurrently on sets of genomes.ResultsFollowing the well-established approach in many disciplines, and with a growing success also in bioinformatics, to resort to MapReduce and Hadoop to deal with ‘Big Data’ problems, we present KCH, the first set of MapReduce algorithms able to perform concurrently informational and linguistic analysis of large collections of genomic sequences on a Hadoop cluster. The benchmarking of KCH that we provide indicates that it is quite effective and versatile. It is also competitive with respect to the parallel and distributed algorithms highly specialized to k-mer statistics collection for genome assembly problems. In conclusion, KCH is a much needed addition to the growing number of algorithms and tools that use MapReduce for bioinformatics core applications.Availability and implementationThe software, including instructions for running it over Amazon AWS, as well as the datasets are available at http://www.di-srv.unisa.it/KCH.Supplementary informationSupplementary data are available at Bioinformatics online.

journal article

Open Access Collection

GTC: how to maintain huge genotype collections in a compressed form

Danek, Agnieszka; Deorowicz, Sebastian

2018 Bioinformatics

doi: 10.1093/bioinformatics/bty023pmid: 29351600

MotivationNowadays, genome sequencing is frequently used in many research centers. In projects, such as the Haplotype Reference Consortium or the Exome Aggregation Consortium, huge databases of genotypes in large populations are determined. Together with the increasing size of these collections, the need for fast and memory frugal ways of representation and searching in them becomes crucial.ResultsWe present GTC (GenoType Compressor), a novel compressed data structure for representation of huge collections of genetic variation data. It significantly outperforms existing solutions in terms of compression ratio and time of answering various types of queries. We show that the largest of publicly available database of about 60 000 haplotypes at about 40 million SNPs can be stored in <4 GB, while the queries related to variants are answered in a fraction of a second.Availability and implementationGTC can be downloaded from https://github.com/refresh-bio/GTC or http://sun.aei.polsl.pl/REFRESH/gtc.Supplementary informationSupplementary data are available at Bioinformatics online.

journal article

Open Access Collection

APAtrap: identification and quantification of alternative polyadenylation sites from RNA-seq data

Ye, Congting; Long, Yuqi; Ji, Guoli; Li, Qingshun Quinn; Wu, Xiaohui

2018 Bioinformatics

doi: 10.1093/bioinformatics/bty029pmid: 29360928

MotivationAlternative polyadenylation (APA) has been increasingly recognized as a crucial mechanism that contributes to transcriptome diversity and gene expression regulation. As RNA-seq has become a routine protocol for transcriptome analysis, it is of great interest to leverage such unprecedented collection of RNA-seq data by new computational methods to extract and quantify APA dynamics in these transcriptomes. However, research progress in this area has been relatively limited. Conventional methods rely on either transcript assembly to determine transcript 3′ ends or annotated poly(A) sites. Moreover, they can neither identify more than two poly(A) sites in a gene nor detect dynamic APA site usage considering more than two poly(A) sites.ResultsWe developed an approach called APAtrap based on the mean squared error model to identify and quantify APA sites from RNA-seq data. APAtrap is capable of identifying novel 3′ UTRs and 3′ UTR extensions, which contributes to locating potential poly(A) sites in previously overlooked regions and improving genome annotations. APAtrap also aims to tally all potential poly(A) sites and detect genes with differential APA site usages between conditions. Extensive comparisons of APAtrap with two other latest methods, ChangePoint and DaPars, using various RNA-seq datasets from simulation studies, human and Arabidopsis demonstrate the efficacy and flexibility of APAtrap for any organisms with an annotated genome.Availability and implementationFreely available for download at https://apatrap.sourceforge.io.Supplementary informationSupplementary data are available at Bioinformatics online.

journal article

Open Access Collection

OPAL: prediction of MoRF regions in intrinsically disordered protein sequences

Sharma, Ronesh; Raicar, Gaurav; Tsunoda, Tatsuhiko; Patil, Ashwini; Sharma, Alok

2018 Bioinformatics

doi: 10.1093/bioinformatics/bty032pmid: 29360926

MotivationIntrinsically disordered proteins lack stable 3-dimensional structure and play a crucial role in performing various biological functions. Key to their biological function are the molecular recognition features (MoRFs) located within long disordered regions. Computationally identifying these MoRFs from disordered protein sequences is a challenging task. In this study, we present a new MoRF predictor, OPAL, to identify MoRFs in disordered protein sequences. OPAL utilizes two independent sources of information computed using different component predictors. The scores are processed and combined using common averaging method. The first score is computed using a component MoRF predictor which utilizes composition and sequence similarity of MoRF and non-MoRF regions to detect MoRFs. The second score is calculated using half-sphere exposure (HSE), solvent accessible surface area (ASA) and backbone angle information of the disordered protein sequence, using information from the amino acid properties of flanks surrounding the MoRFs to distinguish MoRF and non-MoRF residues.ResultsOPAL is evaluated using test sets that were previously used to evaluate MoRF predictors, MoRFpred, MoRFchibi and MoRFchibi-web. The results demonstrate that OPAL outperforms all the available MoRF predictors and is the most accurate predictor available for MoRF prediction. It is available at http://www.alok-ai-lab.com/tools/opal/.Supplementary informationSupplementary data are available at Bioinformatics online.

journal article

Open Access Collection

Splice Expression Variation Analysis (SEVA) for inter-tumor heterogeneity of gene isoform usage in cancer

Afsari, Bahman; Guo, Theresa; Considine, Michael; Florea, Liliana; Kagohara, Luciane T; Stein-O’Brien, Genevieve L; Kelley, Dylan; Flam, Emily; Zambo, Kristina D; Ha, Patrick K; Geman, Donald; Ochs, Michael F; Califano, Joseph A; Gaykalova, Daria A; Favorov, Alexander V; Fertig, Elana J

2018 Bioinformatics

doi:

journal article

Open Access Collection

Feature specific quantile normalization enables cross-platform classification of molecular subtypes using gene expression data

Franks, Jennifer M; Cai, Guoshuai; Whitfield, Michael L

2018 Bioinformatics

doi: 10.1093/bioinformatics/bty026pmid: 29360996

MotivationMolecular subtypes of cancers and autoimmune disease, defined by transcriptomic profiling, have provided insight into disease pathogenesis, molecular heterogeneity and therapeutic responses. However, technical biases inherent to different gene expression profiling platforms present a unique problem when analyzing data generated from different studies. Currently, there is a lack of effective methods designed to eliminate platform-based bias. We present a method to normalize and classify RNA-seq data using machine learning classifiers trained on DNA microarray data and molecular subtypes in two datasets: breast invasive carcinoma (BRCA) and colorectal cancer (CRC).ResultsMultiple analyses show that feature specific quantile normalization (FSQN) successfully removes platform-based bias from RNA-seq data, regardless of feature scaling or machine learning algorithm. We achieve up to 98% accuracy for BRCA data and 97% accuracy for CRC data in assigning molecular subtypes to RNA-seq data normalized using FSQN and a support vector machine trained exclusively on DNA microarray data. We find that maximum accuracy was achieved when normalizing RNA-seq datasets that contain at least 25 samples. FSQN allows comparison of RNA-seq data to existing DNA microarray datasets. Using these techniques, we can successfully leverage information from existing gene expression data in new analyses despite different platforms used for gene expression profiling.Availability and implementationFSQN has been submitted as an R package to CRAN. All code used for this study is available on Github (https://github.com/jenniferfranks/FSQN).Supplementary informationSupplementary data are available at Bioinformatics online.

journal article

Open Access Collection

A distance-based approach for testing the mediation effect of the human microbiome

Zhang, Jie; Wei, Zhi; Chen, Jun

2018 Bioinformatics

doi: 10.1093/bioinformatics/bty014pmid: 29346509

MotivationRecent studies have revealed a complex interplay between environment, the human microbiome and health and disease. Mediation analysis of the human microbiome in these complex relationships could potentially provide insights into the role of the microbiome in the etiology of disease and, more importantly, lead to novel clinical interventions by modulating the microbiome. However, due to the high dimensionality, sparsity, non-normality and phylogenetic structure of microbiome data, none of the existing methods are suitable for testing such clinically important mediation effect.ResultsWe propose a distance-based approach for testing the mediation effect of the human microbiome. In the framework, the nonlinear relationship between the human microbiome and independent/dependent variables is captured implicitly through the use of sample-wise ecological distances, and the phylogenetic tree information is conveniently incorporated by using phylogeny-based distance metrics. Multiple distance metrics are utilized to maximize the power to detect various types of mediation effect. Simulation studies demonstrate that our method has correct Type I error control, and is robust and powerful under various mediation models. Application to a real gut microbiome dataset revealed that the association between the dietary fiber intake and body mass index was mediated by the gut microbiome.Availability and implementationAn R package ‘MedTest’ is freely available at https://github.com/jchen1981/MedTest.Supplementary informationSupplementary data are available at Bioinformatics online.

Showing 1 to 10 of 32 Articles

Articles per page

MotivationIn cancer, clonal evolution is assessed based on information coming from single nucleotide variants and copy number alterations. Nonetheless, existing methods often fail to accurately combine information from both sources to truthfully reconstruct clonal populations in a given tumor sample or in a set of tumor samples coming from the same patient. Moreover, previously published methods detect clones from a single set of variants. As a result, compromises have to be done between stringent variant filtering [reducing dispersion in variant allele frequency estimates (VAFs)] and using all biologically relevant variants.ResultsWe present a framework for defining cancer clones using most reliable variants of high depth of coverage and assigning functional mutations to the detected clones. The key element of our framework is QuantumClone, a method for variant clustering into clones based on VAFs, genotypes of corresponding regions and information about tumor purity. We validated QuantumClone and our framework on simulated data. We then applied our framework to whole genome sequencing data for 19 neuroblastoma trios each including constitutional, diagnosis and relapse samples. We confirmed an enrichment of damaging variants within such pathways as MAPK (mitogen-activated protein kinases), neuritogenesis, epithelial-mesenchymal transition, cell survival and DNA repair. Most pathways had more damaging variants in the expanding clones compared to shrinking ones, which can be explained by the increased total number of variants between these two populations. Functional mutational rate varied for ancestral clones and clones shrinking or expanding upon treatment, suggesting changes in clone selection mechanisms at different time points of tumor evolution.Availability and implementationSource code and binaries of the QuantumClone R package are freely available for download at https://CRAN.R-project.org/package=QuantumClone.Supplementary informationSupplementary data are available at Bioinformatics online.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993

1992

1991

1990

1989

1988

1987

1986

1985

Related Journals: