IDRMutPred: predicting disease-associated germline nonsynonymous single nucleotide variants (nsSNVs) in intrinsically disordered regionsZhou, Jing-Bo; Xiong, Yao; An, Ke; Ye, Zhi-Qiang; Wu, Yun-Dong
doi: 10.1093/bioinformatics/btaa618pmid: 32756939
MotivationDespite of the lack of folded structure, intrinsically disordered regions (IDRs) of proteins play versatile roles in various biological processes, and many nonsynonymous single nucleotide variants (nsSNVs) in IDRs are associated with human diseases. The continuous accumulation of nsSNVs resulted from the wide application of NGS has driven the development of disease-association prediction methods for decades. However, their performance on nsSNVs in IDRs remains inferior, possibly due to the domination of nsSNVs from structured regions in training data. Therefore, it is highly demanding to build a disease-association predictor specifically for nsSNVs in IDRs with better performance.ResultsWe present IDRMutPred, a machine learning-based tool specifically for predicting disease-associated germline nsSNVs in IDRs. Based on 17 selected optimal features that are extracted from sequence alignments, protein annotations, hydrophobicity indices and disorder scores, IDRMutPred was trained using three ensemble learning algorithms on the training dataset containing only IDR nsSNVs. The evaluation on the two testing datasets shows that all the three prediction models outperform 17 other popular general predictors significantly, achieving the ACC between 0.856 and 0.868 and MCC between 0.713 and 0.737. IDRMutPred will prioritize disease-associated IDR germline nsSNVs more reliably than general predictors.Availability and implementationThe software is freely available at http://www.wdspdb.com/IDRMutPred.Supplementary informationSupplementary data are available at Bioinformatics online.
Reanalysis of genome sequences of tomato accessions and its wild relatives: development of Tomato Genomic Variation (TGV) database integrating SNPs and INDELs polymorphismsGupta, Prateek; Dholaniya, Pankaj Singh; Devulapalli, Sameera; Tawari, Nilesh Ramesh; Sreelakshmi, Yellamaraju; Sharma, Rameshwar
doi: 10.1093/bioinformatics/btaa617pmid: 32829394
MotivationFacilitated by technological advances and expeditious decrease in the sequencing costs, whole-genome sequencing is increasingly implemented to uncover variations in cultivars/accessions of many crop plants. In tomato (Solanum lycopersicum), the availability of the genome sequence, followed by the resequencing of tomato cultivars and its wild relatives, has provided a prodigious resource for the improvement of traits. A high-quality genome resequencing of 84 tomato accessions and wild relatives generated a dataset that can be used as a resource to identify agronomically important alleles across the genome. Converting this dataset into a searchable database, including information about the influence of single-nucleotide polymorphisms (SNPs) on protein function, provides valuable information about the genetic variations. The database will assist in searching for functional variants of a gene for introgression into tomato cultivars.ResultsA recent release of better-quality tomato genome reference assembly SL3.0, and new annotation ITAG3.2 of SL3.0, dropped 3857 genes, added 4900 novel genes and updated 20 766 genes. Using the above version, we remapped the data from the tomato lines resequenced under the ‘100 tomato genome resequencing project’ on new tomato genome assembly SL3.0 and made an online searchable Tomato Genomic Variations (TGVs) database. The TGV contains information about SNPs and insertion/deletion events and expands it by functional annotation of variants with new ITAG3.2 using SIFT4G software. This database with search function assists in inferring the influence of SNPs on the function of a target gene. This database can be used for selecting SNPs, which can be potentially deployed for improving tomato traits.Availability and implementationTGV is freely available at http://psd.uohyd.ac.in/tgv.
TE-greedy-nester: structure-based detection of LTR retrotransposons and their nestingLexa, Matej; Jedlicka, Pavel; Vanat, Ivan; Cervenansky, Michal; Kejnovsky, Eduard
doi: 10.1093/bioinformatics/btaa632pmid: 32663247
MotivationTransposable elements (TEs) in eukaryotes often get inserted into one another, forming sequences that become a complex mixture of full-length elements and their fragments. The reconstruction of full-length elements and the order in which they have been inserted is important for genome and transposon evolution studies. However, the accumulation of mutations and genome rearrangements over evolutionary time makes this process error-prone and decreases the efficiency of software aiming to recover all nested full-length TEs.ResultsWe created software that uses a greedy recursive algorithm to mine increasingly fragmented copies of full-length LTR retrotransposons in assembled genomes and other sequence data. The software called TE-greedy-nester considers not only sequence similarity but also the structure of elements. This new tool was tested on a set of natural and synthetic sequences and its accuracy was compared to similar software. We found TE-greedy-nester to be superior in a number of parameters, namely computation time and full-length TE recovery in highly nested regions.Availability and implementationhttp://gitlab.fi.muni.cz/lexa/nested.Supplementary informationSupplementary data are available at Bioinformatics online.
TALC: Transcript-level Aware Long-read CorrectionBroseus, Lucile; Thomas, Aubin; Oldfield, Andrew J; Severac, Dany; Dubois, Emeric; Ritchie, William
doi: 10.1093/bioinformatics/btaa634pmid: 32910174
MotivationLong-read sequencing technologies are invaluable for determining complex RNA transcript architectures but are error-prone. Numerous ‘hybrid correction’ algorithms have been developed for genomic data that correct long reads by exploiting the accuracy and depth of short reads sequenced from the same sample. These algorithms are not suited for correcting more complex transcriptome sequencing data.ResultsWe have created a novel reference-free algorithm called Transcript-level Aware Long-Read Correction (TALC) which models changes in RNA expression and isoform representation in a weighted De Bruijn graph to correct long reads from transcriptome studies. We show that transcript-level aware correction by TALC improves the accuracy of the whole spectrum of downstream RNA-seq applications and is thus necessary for transcriptome analyses that use long read technology.Availability and implementationTALC is implemented in C++ and available at https://github.com/lbroseus/TALC.Supplementary informationSupplementary data are available at Bioinformatics online.
Information theoretic generalized Robinson–Foulds metrics for comparing phylogenetic treesSmith, Martin R
doi: 10.1093/bioinformatics/btaa614pmid: 32619004
MotivationThe Robinson–Foulds (RF) metric is widely used by biologists, linguists and chemists to quantify similarity between pairs of phylogenetic trees. The measure tallies the number of bipartition splits that occur in both trees—but this conservative approach ignores potential similarities between almost-identical splits, with undesirable consequences. ‘Generalized’ RF metrics address this shortcoming by pairing splits in one tree with similar splits in the other. Each pair is assigned a similarity score, the sum of which enumerates the similarity between two trees. The challenge lies in quantifying split similarity: existing definitions lack a principled statistical underpinning, resulting in misleading tree distances that are difficult to interpret. Here, I propose probabilistic measures of split similarity, which allow tree similarity to be measured in natural units (bits).ResultsMy new information-theoretic metrics outperform alternative measures of tree similarity when evaluated against a broad suite of criteria, even though they do not account for the non-independence of splits within a single tree. Mutual clustering information exhibits none of the undesirable properties that characterize other tree comparison metrics, and should be preferred to the RF metric.Availability and implementationThe methods discussed in this article are implemented in the R package ‘TreeDist’, archived at https://dx.doi.org/10.5281/zenodo.3528123.Supplementary informationSupplementary data are available at Bioinformatics online.
Automatic classification of single-molecule force spectroscopy traces from heterogeneous samplesIlieva, Nina I; Galvanetto, Nicola; Allegra, Michele; Brucale, Marco; Laio, Alessandro
doi: 10.1093/bioinformatics/btaa626pmid: 32653898
MotivationSingle-molecule force spectroscopy (SMFS) experiments pose the challenge of analysing protein unfolding data (traces) coming from preparations with heterogeneous composition (e.g. where different proteins are present in the sample). An automatic procedure able to distinguish the unfolding patterns of the proteins is needed. Here, we introduce a data analysis pipeline able to recognize in such datasets traces with recurrent patterns (clusters).ResultsWe illustrate the performance of our method on two prototypical datasets: ∼50 000 traces from a sample containing tandem GB1 and ∼400 000 traces from a native rod membrane. Despite a daunting signal-to-noise ratio in the data, we are able to identify several unfolding clusters. This work demonstrates how an automatic pattern classification can extract relevant information from SMFS traces from heterogeneous samples without prior knowledge of the sample composition.Availability and implementationhttps://github.com/ninailieva/SMFS_clustering.Supplementary informationSupplementary data are available at Bioinformatics online.
OPUS-TASS: a protein backbone torsion angles and secondary structure predictor based on ensemble neural networksXu, Gang; Wang, Qinghua; Ma, Jianpeng
doi: 10.1093/bioinformatics/btaa629pmid: 32678893
MotivationPredictions of protein backbone torsion angles (ϕ and ψ) and secondary structure from sequence are crucial subproblems in protein structure prediction. With the development of deep learning approaches, their accuracies have been significantly improved. To capture the long-range interactions, most studies integrate bidirectional recurrent neural networks into their models. In this study, we introduce and modify a recently proposed architecture named Transformer to capture the interactions between the two residues theoretically with arbitrary distance. Moreover, we take advantage of multitask learning to improve the generalization of neural network by introducing related tasks into the training process. Similar to many previous studies, OPUS-TASS uses an ensemble of models and achieves better results.ResultsOPUS-TASS uses the same training and validation sets as SPOT-1D. We compare the performance of OPUS-TASS and SPOT-1D on TEST2016 (1213 proteins) and TEST2018 (250 proteins) proposed in the SPOT-1D paper, CASP12 (55 proteins), CASP13 (32 proteins) and CASP-FM (56 proteins) proposed in the SAINT paper, and a recently released PDB structure collection from CAMEO (93 proteins) named as CAMEO93. On these six test sets, OPUS-TASS achieves consistent improvements in both backbone torsion angles prediction and secondary structure prediction. On CAMEO93, SPOT-1D achieves the mean absolute errors of 16.89 and 23.02 for ϕ and ψ predictions, respectively, and the accuracies for 3- and 8-state secondary structure predictions are 87.72 and 77.15%, respectively. In comparison, OPUS-TASS achieves 16.56 and 22.56 for ϕ and ψ predictions, and 89.06 and 78.87% for 3- and 8-state secondary structure predictions, respectively. In particular, after using our torsion angles refinement method OPUS-Refine as the post-processing procedure for OPUS-TASS, the mean absolute errors for final ϕ and ψ predictions are further decreased to 16.28 and 21.98, respectively.Availability and implementationThe training and the inference codes of OPUS-TASS and its data are available at https://github.com/thuxugang/opus_tass.Supplementary informationSupplementary data are available at Bioinformatics online.
Efficient weighted univariate clustering maps outstanding dysregulated genomic zones in human cancersSong, Mingzhou; Zhong, Hua
doi: 10.1093/bioinformatics/btaa613pmid: 32619008
MotivationChromosomal patterning of gene expression in cancer can arise from aneuploidy, genome disorganization or abnormal DNA methylation. To map such patterns, we introduce a weighted univariate clustering algorithm to guarantee linear runtime, optimality and reproducibility.ResultsWe present the chromosome clustering method, establish its optimality and runtime and evaluate its performance. It uses dynamic programming enhanced with an algorithm to reduce search-space in-place to decrease runtime overhead. Using the method, we delineated outstanding genomic zones in 17 human cancer types. We identified strong continuity in dysregulation polarity—dominance by either up- or downregulated genes in a zone—along chromosomes in all cancer types. Significantly polarized dysregulation zones specific to cancer types are found, offering potential diagnostic biomarkers. Unreported previously, a total of 109 loci with conserved dysregulation polarity across cancer types give insights into pan-cancer mechanisms. Efficient chromosomal clustering opens a window to characterize molecular patterns in cancer genome and beyond.Availability and implementationWeighted univariate clustering algorithms are implemented within the R package ‘Ckmeans.1d.dp’ (4.0.0 or above), freely available at https://cran.r-project.org/package=Ckmeans.1d.dp.Supplementary informationSupplementary data are available at Bioinformatics online.
Galgo: a bi-objective evolutionary meta-heuristic identifies robust transcriptomic classifiers associated with patient outcome across multiple cancer typesGuerrero-Gimenez, M E; Fernandez-Muñoz, J M; Lang, B J; Holton, K M; Ciocca, D R; Catania, C A; Zoppino, F C M
doi: 10.1093/bioinformatics/btaa619pmid: 32638009
MotivationStatistical and machine-learning analyses of tumor transcriptomic profiles offer a powerful resource to gain deeper understanding of tumor subtypes and disease prognosis. Currently, prognostic gene-expression signatures do not exist for all cancer types, and most developed to date have been optimized for individual tumor types. In Galgo, we implement a bi-objective optimization approach that prioritizes gene signature cohesiveness and patient survival in parallel, which provides greater power to identify tumor transcriptomic phenotypes strongly associated with patient survival.ResultsTo compare the predictive power of the signatures obtained by Galgo with previously studied subtyping methods, we used a meta-analytic approach testing a total of 35 large population-based transcriptomic biobanks of four different cancer types. Galgo-generated colorectal and lung adenocarcinoma signatures were stronger predictors of patient survival compared to published molecular classification schemes. One Galgo-generated breast cancer signature outperformed PAM50, AIMS, SCMGENE and IntClust subtyping predictors. In high-grade serous ovarian cancer, Galgo signatures obtained similar predictive power to a consensus classification method. In all cases, Galgo subtypes reflected enrichment of gene sets related to the hallmarks of the disease, which highlights the biological relevance of the partitions found.Availability and implementationThe open-source R package is available on www.github.com/harpomaxx/galgo.Supplementary informationSupplementary data are available at Bioinformatics online.
Exploring generative deep learning for omics data using log-linear modelsHess, Moritz; Hackenberg, Maren; Binder, Harald
doi: 10.1093/bioinformatics/btaa623pmid: 32647888
MotivationFollowing many successful applications to image data, deep learning is now also increasingly considered for omics data. In particular, generative deep learning not only provides competitive prediction performance, but also allows for uncovering structure by generating synthetic samples. However, exploration and visualization is not as straightforward as with image applications.ResultsWe demonstrate how log-linear models, fitted to the generated, synthetic data can be used to extract patterns from omics data, learned by deep generative techniques. Specifically, interactions between latent representations learned by the approaches and generated synthetic data are used to determine sets of joint patterns. Distances of patterns with respect to the distribution of latent representations are then visualized in low-dimensional coordinate systems, e.g. for monitoring training progress. This is illustrated with simulated data and subsequently with cortical single-cell gene expression data. Using different kinds of deep generative techniques, specifically variational autoencoders and deep Boltzmann machines, the proposed approach highlights how the techniques uncover underlying structure. It facilitates the real-world use of such generative deep learning techniques to gain biological insights from omics data.Availability and implementationThe code for the approach as well as an accompanying Jupyter notebook, which illustrates the application of our approach, is available via the GitHub repository: https://github.com/ssehztirom/Exploring-generative-deep-learning-for-omics-data-by-using-log-linear-models.Supplementary informationSupplementary data are available at Bioinformatics online.