Using AnABlast for intergenic sORF prediction in the Caenorhabditis elegans genomeCasimiro-Soriguer, C S; Rigual, M M; Brokate-Llanos, A M; Muñoz, M J; Garzón, A; Pérez-Pulido, A J; Jimenez, J
doi: 10.1093/bioinformatics/btaa608pmid: 32614398
MotivationShort bioactive peptides encoded by small open reading frames (sORFs) play important roles in eukaryotes. Bioinformatics prediction of ORFs is an early step in a genome sequence analysis, but sORFs encoding short peptides, often using non-AUG initiation codons, are not easily discriminated from false ORFs occurring by chance.ResultsAnABlast is a computational tool designed to highlight putative protein-coding regions in genomic DNA sequences. This protein-coding finder is independent of ORF length and reading frame shifts, thus making of AnABlast a potentially useful tool to predict sORFs. Using this algorithm, here, we report the identification of 82 putative new intergenic sORFs in the Caenorhabditis elegans genome. Sequence similarity, motif presence, expression data and RNA interference experiments support that the underlined sORFs likely encode functional peptides, encouraging the use of AnABlast as a new approach for the accurate prediction of intergenic sORFs in annotated eukaryotic genomes.Availability and implementationAnABlast is freely available at http://www.bioinfocabd.upo.es/ab/. The C.elegans genome browser with AnABlast results, annotated genes and all data used in this study is available at http://www.bioinfocabd.upo.es/celegans.Supplementary informationSupplementary data are available at Bioinformatics online.
Rapid epistatic mixed-model association studies by controlling multiple polygenic effectsWang, Dan; Tang, Hui; Liu, Jian-Feng; Xu, Shizhong; Zhang, Qin; Ning, Chao
doi: 10.1093/bioinformatics/btaa610pmid: 32614415
SummaryWe have developed a rapid mixed model algorithm for exhaustive genome-wide epistatic association analysis by controlling multiple polygenic effects. Our model can simultaneously handle additive by additive epistasis, dominance by dominance epistasis and additive by dominance epistasis, and account for intrasubject fluctuations due to individuals with repeated records. Furthermore, we suggest a simple but efficient approximate algorithm, which allows the examination of all pairwise interactions in a remarkably fast manner of linear with population size. Simulation studies are performed to investigate the properties of REMMAX. Application to publicly available yeast and human data has showed that our mixed model-based method has similar performance with simple linear model on computational efficiency. It took less than 40 h for the pairwise analysis of 5000 individuals genotyped with roughly 350 000 SNPs with five threads on Intel Xeon E5 2.6 GHz CPU.Availability and implementationSource codes are freely available at https://github.com/chaoning/GMAT.Supplementary informationSupplementary data are available at Bioinformatics online.
Overlap detection on long, error-prone sequencing reads via smooth q-gramSong, Yan; Tang, Haixu; Zhang, Haoyu; Zhang, Qin
doi: 10.1093/bioinformatics/btaa252pmid: 32311007
MotivationThird generation sequencing techniques, such as the Single Molecule Real Time technique from PacBio and the MinION technique from Oxford Nanopore, can generate long, error-prone sequencing reads which pose new challenges for fragment assembly algorithms. In this paper, we study the overlap detection problem for error-prone reads, which is the first and most critical step in the de novo fragment assembly. We observe that all the state-of-the-art methods cannot achieve an ideal accuracy for overlap detection (in terms of relatively low precision and recall) due to the high sequencing error rates, especially when the overlap lengths between reads are relatively short (e.g. <2000 bases). This limitation appears inherent to these algorithms due to their usage of q-gram-based seeds under the seed-extension framework.ResultsWe propose smooth q-gram, a variant of q-gram that captures q-gram pairs within small edit distances and design a novel algorithm for detecting overlapping reads using smooth q-gram-based seeds. We implemented the algorithm and tested it on both PacBio and Nanopore sequencing datasets. Our benchmarking results demonstrated that our algorithm outperforms the existing q-gram-based overlap detection algorithms, especially for reads with relatively short overlapping lengths.Availability and implementationThe source code of our implementation in C++ is available at https://github.com/FIGOGO/smoothq.Supplementary informationSupplementary data are available at Bioinformatics online.
Coevolution-based prediction of protein–protein interactions in polyketide biosynthetic assembly linesWang, Yan; Correa Marrero, Miguel; Medema, Marnix H; van Dijk, Aalt D J
doi: 10.1093/bioinformatics/btaa595pmid: 32592463
MotivationPolyketide synthases (PKSs) are enzymes that generate diverse molecules of great pharmaceutical importance, including a range of clinically used antimicrobials and antitumor agents. Many polyketides are synthesized by cis-AT modular PKSs, which are organized in assembly lines, in which multiple enzymes line up in a specific order. This order is defined by specific protein–protein interactions (PPIs). The unique modular structure and catalyzing mechanism of these assembly lines makes their products predictable and also spurred combinatorial biosynthesis studies to produce novel polyketides using synthetic biology. However, predicting the interactions of PKSs, and thereby inferring the order of their assembly line, is still challenging, especially for cases in which this order is not reflected by the ordering of the PKS-encoding genes in the genome.ResultsHere, we introduce PKSpop, which uses a coevolution-based PPI algorithm to infer protein order in PKS assembly lines. Our method accurately predicts protein orders (93% accuracy). Additionally, we identify new residue pairs that are key in determining interaction specificity, and show that coevolution of N- and C-terminal docking domains of PKSs is significantly more predictive for PPIs than coevolution between ketosynthase and acyl carrier protein domains.Availability and implementationThe code is available on http://www.bif.wur.nl/ (under ‘Software’).Supplementary informationSupplementary data are available at Bioinformatics online.
BnpC: Bayesian non-parametric clustering of single-cell mutation profilesBorgsmüller, Nico; Bonet, Jose; Marass, Francesco; Gonzalez-Perez, Abel; Lopez-Bigas, Nuria; Beerenwinkel, Niko
doi: 10.1093/bioinformatics/btaa599pmid: 32592465
MotivationThe high resolution of single-cell DNA sequencing (scDNA-seq) offers great potential to resolve intratumor heterogeneity (ITH) by distinguishing clonal populations based on their mutation profiles. However, the increasing size of scDNA-seq datasets and technical limitations, such as high error rates and a large proportion of missing values, complicate this task and limit the applicability of existing methods.ResultsHere, we introduce BnpC, a novel non-parametric method to cluster individual cells into clones and infer their genotypes based on their noisy mutation profiles. We benchmarked our method comprehensively against state-of-the-art methods on simulated data using various data sizes, and applied it to three cancer scDNA-seq datasets. On simulated data, BnpC compared favorably against current methods in terms of accuracy, runtime and scalability. Its inferred genotypes were the most accurate, especially on highly heterogeneous data, and it was the only method able to run and produce results on datasets with 5000 cells. On tumor scDNA-seq data, BnpC was able to identify clonal populations missed by the original cluster analysis but supported by Supplementary Experimental Data. With ever growing scDNA-seq datasets, scalable and accurate methods such as BnpC will become increasingly relevant, not only to resolve ITH but also as a preprocessing step to reduce data size.Availability and implementationBnpC is freely available under MIT license at https://github.com/cbg-ethz/BnpC.Supplementary informationSupplementary data are available at Bioinformatics online.
Simulation, power evaluation and sample size recommendation for single-cell RNA-seqSu, Kenong; Wu, Zhijin; Wu, Hao
doi: 10.1093/bioinformatics/btaa607pmid: 32614380
MotivationDetermining the sample size for adequate power to detect statistical significance is a crucial step at the design stage for high-throughput experiments. Even though a number of methods and tools are available for sample size calculation for microarray and RNA-seq in the context of differential expression (DE), this topic in the field of single-cell RNA sequencing is understudied. Moreover, the unique data characteristics present in scRNA-seq such as sparsity and heterogeneity increase the challenge.ResultsWe propose POWSC, a simulation-based method, to provide power evaluation and sample size recommendation for single-cell RNA-sequencing DE analysis. POWSC consists of a data simulator that creates realistic expression data, and a power assessor that provides a comprehensive evaluation and visualization of the power and sample size relationship. The data simulator in POWSC outperforms two other state-of-art simulators in capturing key characteristics of real datasets. The power assessor in POWSC provides a variety of power evaluations including stratified and marginal power analyses for DEs characterized by two forms (phase transition or magnitude tuning), under different comparison scenarios. In addition, POWSC offers information for optimizing the tradeoffs between sample size and sequencing depth with the same total reads.Availability and implementationPOWSC is an open-source R package available online at https://github.com/suke18/POWSC.Supplementary informationSupplementary data are available at Bioinformatics online.
iPromoter-BnCNN: a novel branched CNN-based predictor for identifying and classifying sigma promotersAmin, Ruhul; Rahman, Chowdhury Rafeed; Ahmed, Sajid; Sifat, Md Habibur Rahman; Liton, Md Nazmul Khan; Rahman, Md Moshiur; Khan, Md Zahid Hossain; Shatabda, Swakkhar
doi: 10.1093/bioinformatics/btaa609pmid: 32614400
MotivationPromoter is a short region of DNA which is responsible for initiating transcription of specific genes. Development of computational tools for automatic identification of promoters is in high demand. According to the difference of functions, promoters can be of different types. Promoters may have both intra- and interclass variation and similarity in terms of consensus sequences. Accurate classification of various types of sigma promoters still remains a challenge.ResultsWe present iPromoter-BnCNN for identification and accurate classification of six types of promoters—σ24,σ28,σ32,σ38,σ54,σ70. It is a CNN-based classifier which combines local features related to monomer nucleotide sequence, trimer nucleotide sequence, dimer structural properties and trimer structural properties through the use of parallel branching. We conducted experiments on a benchmark dataset and compared with six state-of-the-art tools to show our supremacy on 5-fold cross-validation. Moreover, we tested our classifier on an independent test dataset.Availability and implementationOur proposed tool iPromoter-BnCNN web server is freely available at http://103.109.52.8/iPromoter-BnCNN. The runnable source code can be found https://colab.research.google.com/drive/1yWWh7BXhsm8U4PODgPqlQRy23QGjF2DZ.Supplementary informationSupplementary data are available at Bioinformatics online.
Network analysis of synonymous codon usageNewaz, Khalique; Wright, Gabriel; Piland, Jacob; Li, Jun; Clark, Patricia L; Emrich, Scott J; Milenković, Tijana
doi: 10.1093/bioinformatics/btaa603pmid: 32609328
MotivationMost amino acids are encoded by multiple synonymous codons, some of which are used more rarely than others. Analyses of positions of such rare codons in protein sequences revealed that rare codons can impact co-translational protein folding and that positions of some rare codons are evolutionarily conserved. Analyses of their positions in protein 3-dimensional structures, which are richer in biochemical information than sequences alone, might further explain the role of rare codons in protein folding.ResultsWe model protein structures as networks and use network centrality to measure the structural position of an amino acid. We first validate that amino acids buried within the structural core are network-central, and those on the surface are not. Then, we study potential differences between network centralities and thus structural positions of amino acids encoded by conserved rare, non-conserved rare and commonly used codons. We find that in 84% of proteins, the three codon categories occupy significantly different structural positions. We examine protein groups showing different codon centrality trends, i.e. different relationships between structural positions of the three codon categories. We see several cases of all proteins from our data with some structural or functional property being in the same group. Also, we see a case of all proteins in some group having the same property. Our work shows that codon usage is linked to the final protein structure and thus possibly to co-translational protein folding.Availability and implementationhttps://nd.edu/∼cone/CodonUsage/.Supplementary informationSupplementary data are available at Bioinformatics online.
Inference of gene regulatory networks based on nonlinear ordinary differential equationsMa, Baoshan; Fang, Mingkun; Jiao, Xiangtian
doi: 10.1093/bioinformatics/btaa032pmid: 31950997
MotivationGene regulatory networks (GRNs) capture the regulatory interactions between genes, resulting from the fundamental biological process of transcription and translation. In some cases, the topology of GRNs is not known, and has to be inferred from gene expression data. Most of the existing GRNs reconstruction algorithms are either applied to time-series data or steady-state data. Although time-series data include more information about the system dynamics, steady-state data imply stability of the underlying regulatory networks.ResultsIn this article, we propose a method for inferring GRNs from time-series and steady-state data jointly. We make use of a non-linear ordinary differential equations framework to model dynamic gene regulation and an importance measurement strategy to infer all putative regulatory links efficiently. The proposed method is evaluated extensively on the artificial DREAM4 dataset and two real gene expression datasets of yeast and Escherichia coli. Based on public benchmark datasets, the proposed method outperforms other popular inference algorithms in terms of overall score. By comparing the performance on the datasets with different scales, the results show that our method still keeps good robustness and accuracy at a low computational complexity.Availability and implementationThe proposed method is written in the Python language, and is available at: https://github.com/lab319/GRNs_nonlinear_ODEsSupplementary informationSupplementary data are available at Bioinformatics online.