PEATH: single-individual haplotyping by a probabilistic evolutionary algorithm with togglingNa, Joong Chae; Lee, Jong-Chan; Rhee, Je-Keun; Shin, Soo-Yong
doi: 10.1093/bioinformatics/bty012pmid: 29342247
MotivationSingle-individual haplotyping (SIH) is critical in genomic association studies and genetic diseases analysis. However, most genomic analysis studies do not perform haplotype-phasing analysis due to its complexity. Several computational methods have been developed to solve the SIH problem, but these approaches have not generated sufficiently reliable haplotypes.ResultsHere, we propose a novel SIH algorithm, called PEATH (Probabilistic Evolutionary Algorithm with Toggling for Haplotyping), to achieve more accurate and reliable haplotyping. The proposed PEATH method was compared to the most recent algorithms in terms of the phased length, N50 length, switch error rate and minimum error correction. The PEATH algorithm consistently provides the best phase and N50 lengths, as long as possible, given datasets. In addition, verification of the simulation data demonstrated that the PEATH method outperforms other methods on high noisy data. Additionally, the experimental results of a real dataset confirmed that the PEATH method achieved comparable or better accuracy.Availability and implementationSource code of PEATH is available at https://github.com/jcna99/PEATH.Supplementary informationSupplementary data are available at Bioinformatics online.
A rapid epistatic mixed-model association analysis by linear retransformations of genomic estimated valuesNing, Chao; Wang, Dan; Kang, Huimin; Mrode, Raphael; Zhou, Lei; Xu, Shizhong; Liu, Jian-Feng
doi: 10.1093/bioinformatics/bty017pmid: 29342229
MotivationEpistasis provides a feasible way for probing potential genetic mechanism of complex traits. However, time-consuming computation challenges successful detection of interaction in practice, especially when linear mixed model (LMM) is used to control type I error in the presence of population structure and cryptic relatedness.ResultsA rapid epistatic mixed-model association analysis (REMMA) method was developed to overcome computational limitation. This method first estimates individuals’ epistatic effects by an extended genomic best linear unbiased prediction (EG-BLUP) model with additive and epistatic kinship matrix, then pairwise interaction effects are obtained by linear retransformations of individuals’ epistatic effects. Simulation studies showed that REMMA could control type I error and increase statistical power in detecting epistatic QTNs in comparison with existing LMM-based FaST-LMM. We applied REMMA to two real datasets, a mouse dataset and the Wellcome Trust Case Control Consortium (WTCCC) data. Application to the mouse data further confirmed the performance of REMMA in controlling type I error. For the WTCCC data, we found most epistatic QTNs for type 1 diabetes (T1D) located in a major histocompatibility complex (MHC) region, from which a large interacting network with 12 hub genes (interacting with ten or more genes) was established.Availability and implementationOur REMMA method can be freely accessed at https://github.com/chaoning/REMMA.Supplementary informationSupplementary data are available at Bioinformatics online.
Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithmsFerraro Petrillo, Umberto; Roscigno, Gianluca; Cattaneo, Giuseppe; Giancarlo, Raffaele
doi: 10.1093/bioinformatics/bty018pmid: 29342232
MotivationInformation theoretic and compositional/linguistic analysis of genomes have a central role in bioinformatics, even more so since the associated methodologies are becoming very valuable also for epigenomic and meta-genomic studies. The kernel of those methods is based on the collection of k-mer statistics, i.e. how many times each k-mer in {A,C,G,T}k occurs in a DNA sequence. Although this problem is computationally very simple and efficiently solvable on a conventional computer, the sheer amount of data available now in applications demands to resort to parallel and distributed computing. Indeed, those type of algorithms have been developed to collect k-mer statistics in the realm of genome assembly. However, they are so specialized to this domain that they do not extend easily to the computation of informational and linguistic indices, concurrently on sets of genomes.ResultsFollowing the well-established approach in many disciplines, and with a growing success also in bioinformatics, to resort to MapReduce and Hadoop to deal with ‘Big Data’ problems, we present KCH, the first set of MapReduce algorithms able to perform concurrently informational and linguistic analysis of large collections of genomic sequences on a Hadoop cluster. The benchmarking of KCH that we provide indicates that it is quite effective and versatile. It is also competitive with respect to the parallel and distributed algorithms highly specialized to k-mer statistics collection for genome assembly problems. In conclusion, KCH is a much needed addition to the growing number of algorithms and tools that use MapReduce for bioinformatics core applications.Availability and implementationThe software, including instructions for running it over Amazon AWS, as well as the datasets are available at http://www.di-srv.unisa.it/KCH.Supplementary informationSupplementary data are available at Bioinformatics online.
GTC: how to maintain huge genotype collections in a compressed formDanek, Agnieszka; Deorowicz, Sebastian
doi: 10.1093/bioinformatics/bty023pmid: 29351600
MotivationNowadays, genome sequencing is frequently used in many research centers. In projects, such as the Haplotype Reference Consortium or the Exome Aggregation Consortium, huge databases of genotypes in large populations are determined. Together with the increasing size of these collections, the need for fast and memory frugal ways of representation and searching in them becomes crucial.ResultsWe present GTC (GenoType Compressor), a novel compressed data structure for representation of huge collections of genetic variation data. It significantly outperforms existing solutions in terms of compression ratio and time of answering various types of queries. We show that the largest of publicly available database of about 60 000 haplotypes at about 40 million SNPs can be stored in <4 GB, while the queries related to variants are answered in a fraction of a second.Availability and implementationGTC can be downloaded from https://github.com/refresh-bio/GTC or http://sun.aei.polsl.pl/REFRESH/gtc.Supplementary informationSupplementary data are available at Bioinformatics online.
APAtrap: identification and quantification of alternative polyadenylation sites from RNA-seq dataYe, Congting; Long, Yuqi; Ji, Guoli; Li, Qingshun Quinn; Wu, Xiaohui
doi: 10.1093/bioinformatics/bty029pmid: 29360928
MotivationAlternative polyadenylation (APA) has been increasingly recognized as a crucial mechanism that contributes to transcriptome diversity and gene expression regulation. As RNA-seq has become a routine protocol for transcriptome analysis, it is of great interest to leverage such unprecedented collection of RNA-seq data by new computational methods to extract and quantify APA dynamics in these transcriptomes. However, research progress in this area has been relatively limited. Conventional methods rely on either transcript assembly to determine transcript 3′ ends or annotated poly(A) sites. Moreover, they can neither identify more than two poly(A) sites in a gene nor detect dynamic APA site usage considering more than two poly(A) sites.ResultsWe developed an approach called APAtrap based on the mean squared error model to identify and quantify APA sites from RNA-seq data. APAtrap is capable of identifying novel 3′ UTRs and 3′ UTR extensions, which contributes to locating potential poly(A) sites in previously overlooked regions and improving genome annotations. APAtrap also aims to tally all potential poly(A) sites and detect genes with differential APA site usages between conditions. Extensive comparisons of APAtrap with two other latest methods, ChangePoint and DaPars, using various RNA-seq datasets from simulation studies, human and Arabidopsis demonstrate the efficacy and flexibility of APAtrap for any organisms with an annotated genome.Availability and implementationFreely available for download at https://apatrap.sourceforge.io.Supplementary informationSupplementary data are available at Bioinformatics online.
OPAL: prediction of MoRF regions in intrinsically disordered protein sequencesSharma, Ronesh; Raicar, Gaurav; Tsunoda, Tatsuhiko; Patil, Ashwini; Sharma, Alok
doi: 10.1093/bioinformatics/bty032pmid: 29360926
MotivationIntrinsically disordered proteins lack stable 3-dimensional structure and play a crucial role in performing various biological functions. Key to their biological function are the molecular recognition features (MoRFs) located within long disordered regions. Computationally identifying these MoRFs from disordered protein sequences is a challenging task. In this study, we present a new MoRF predictor, OPAL, to identify MoRFs in disordered protein sequences. OPAL utilizes two independent sources of information computed using different component predictors. The scores are processed and combined using common averaging method. The first score is computed using a component MoRF predictor which utilizes composition and sequence similarity of MoRF and non-MoRF regions to detect MoRFs. The second score is calculated using half-sphere exposure (HSE), solvent accessible surface area (ASA) and backbone angle information of the disordered protein sequence, using information from the amino acid properties of flanks surrounding the MoRFs to distinguish MoRF and non-MoRF residues.ResultsOPAL is evaluated using test sets that were previously used to evaluate MoRF predictors, MoRFpred, MoRFchibi and MoRFchibi-web. The results demonstrate that OPAL outperforms all the available MoRF predictors and is the most accurate predictor available for MoRF prediction. It is available at http://www.alok-ai-lab.com/tools/opal/.Supplementary informationSupplementary data are available at Bioinformatics online.
Feature specific quantile normalization enables cross-platform classification of molecular subtypes using gene expression dataFranks, Jennifer M; Cai, Guoshuai; Whitfield, Michael L
doi: 10.1093/bioinformatics/bty026pmid: 29360996
MotivationMolecular subtypes of cancers and autoimmune disease, defined by transcriptomic profiling, have provided insight into disease pathogenesis, molecular heterogeneity and therapeutic responses. However, technical biases inherent to different gene expression profiling platforms present a unique problem when analyzing data generated from different studies. Currently, there is a lack of effective methods designed to eliminate platform-based bias. We present a method to normalize and classify RNA-seq data using machine learning classifiers trained on DNA microarray data and molecular subtypes in two datasets: breast invasive carcinoma (BRCA) and colorectal cancer (CRC).ResultsMultiple analyses show that feature specific quantile normalization (FSQN) successfully removes platform-based bias from RNA-seq data, regardless of feature scaling or machine learning algorithm. We achieve up to 98% accuracy for BRCA data and 97% accuracy for CRC data in assigning molecular subtypes to RNA-seq data normalized using FSQN and a support vector machine trained exclusively on DNA microarray data. We find that maximum accuracy was achieved when normalizing RNA-seq datasets that contain at least 25 samples. FSQN allows comparison of RNA-seq data to existing DNA microarray datasets. Using these techniques, we can successfully leverage information from existing gene expression data in new analyses despite different platforms used for gene expression profiling.Availability and implementationFSQN has been submitted as an R package to CRAN. All code used for this study is available on Github (https://github.com/jenniferfranks/FSQN).Supplementary informationSupplementary data are available at Bioinformatics online.
A distance-based approach for testing the mediation effect of the human microbiomeZhang, Jie; Wei, Zhi; Chen, Jun
doi: 10.1093/bioinformatics/bty014pmid: 29346509
MotivationRecent studies have revealed a complex interplay between environment, the human microbiome and health and disease. Mediation analysis of the human microbiome in these complex relationships could potentially provide insights into the role of the microbiome in the etiology of disease and, more importantly, lead to novel clinical interventions by modulating the microbiome. However, due to the high dimensionality, sparsity, non-normality and phylogenetic structure of microbiome data, none of the existing methods are suitable for testing such clinically important mediation effect.ResultsWe propose a distance-based approach for testing the mediation effect of the human microbiome. In the framework, the nonlinear relationship between the human microbiome and independent/dependent variables is captured implicitly through the use of sample-wise ecological distances, and the phylogenetic tree information is conveniently incorporated by using phylogeny-based distance metrics. Multiple distance metrics are utilized to maximize the power to detect various types of mediation effect. Simulation studies demonstrate that our method has correct Type I error control, and is robust and powerful under various mediation models. Application to a real gut microbiome dataset revealed that the association between the dietary fiber intake and body mass index was mediated by the gut microbiome.Availability and implementationAn R package ‘MedTest’ is freely available at https://github.com/jchen1981/MedTest.Supplementary informationSupplementary data are available at Bioinformatics online.