Determining conserved metabolic biomarkers from a million database queriesKurczy, Michael E.; Ivanisevic, Julijana; Johnson, Caroline H.; Uritboonthai, Winnie; Hoang, Linh; Fang, Mingliang; Hicks, Matthew; Aldebot, Anthony; Rinehart, Duane; Mellander, Lisa J.; Tautenhahn, Ralf; Patti, Gary J.; Spilker, Mary E.; Benton, H. Paul; Siuzdak, Gary
doi: 10.1093/bioinformatics/btv475pmid: 26275895
Motivation: Metabolite databases provide a unique window into metabolome research allowing the most commonly searched biomarkers to be catalogued. Omic scale metabolite profiling, or metabolomics, is finding increased utility in biomarker discovery largely driven by improvements in analytical technologies and the concurrent developments in bioinformatics. However, the successful translation of biomarkers into clinical or biologically relevant indicators is limited.Results: With the aim of improving the discovery of translatable metabolite biomarkers, we present search analytics for over one million METLIN metabolite database queries. The most common metabolites found in METLIN were cross-correlated against XCMS Online, the widely used cloud-based data processing and pathway analysis platform. Analysis of the METLIN and XCMS common metabolite data has two primary implications: these metabolites, might indicate a conserved metabolic response to stressors and, this data may be used to gauge the relative uniqueness of potential biomarkers.Availability and implementation. METLIN can be accessed by logging on to: https://metlin.scripps.eduContact: [email protected] information: Supplementary data are available at Bioinformatics online.
TIPR: transcription initiation pattern recognition on a genome scaleMorton, Taj; Wong, Weng-Keen; Megraw, Molly
doi: 10.1093/bioinformatics/btv464pmid: 26254489
Motivation: The computational identification of gene transcription start sites (TSSs) can provide insights into the regulation and function of genes without performing expensive experiments, particularly in organisms with incomplete annotations. High-resolution general-purpose TSS prediction remains a challenging problem, with little recent progress on the identification and differentiation of TSSs which are arranged in different spatial patterns along the chromosome.Results: In this work, we present the Transcription Initiation Pattern Recognizer (TIPR), a sequence-based machine learning model that identifies TSSs with high accuracy and resolution for multiple spatial distribution patterns along the genome, including broadly distributed TSS patterns that have previously been difficult to characterize. TIPR predicts not only the locations of TSSs but also the expected spatial initiation pattern each TSS will form along the chromosome—a novel capability for TSS prediction algorithms. As spatial initiation patterns are associated with spatiotemporal expression patterns and gene function, this capability has the potential to improve gene annotations and our understanding of the regulation of transcription initiation. The high nucleotide resolution of this model locates TSSs within 10 nucleotides or less on average.Availability and implementation: Model source code is made available online at http://megraw.cgrb.oregonstate.edu/software/TIPR/.Contact: [email protected] information: Supplementary data are available at Bioinformatics online.
GMcloser: closing gaps in assemblies accurately with a likelihood-based selection of contig or long-read alignmentsKosugi, Shunichi; Hirakawa, Hideki; Tabata, Satoshi
doi: 10.1093/bioinformatics/btv465pmid: 26261222
Motivation: Genome assemblies generated with next-generation sequencing (NGS) reads usually contain a number of gaps. Several tools have recently been developed to close the gaps in these assemblies with NGS reads. Although these gap-closing tools efficiently close the gaps, they entail a high rate of misassembly at gap-closing sites.Results: We have found that the assembly error rates caused by these tools are 20–500-fold higher than the rate of errors introduced into contigs by de novo assemblers. We here describe GMcloser, a tool that accurately closes these gaps with a preassembled contig set or a long read set (i.e. error-corrected PacBio reads). GMcloser uses likelihood-based classifiers calculated from the alignment statistics between scaffolds, contigs and paired-end reads to correctly assign contigs or long reads to gap regions of scaffolds, thereby achieving accurate and efficient gap closure. We demonstrate with sequencing data from various organisms that the gap-closing accuracy of GMcloser is 3–100-fold higher than those of other available tools, with similar efficiency.Availability and implementation: GMcloser and an accompanying tool (GMvalue) for evaluating the assembly and correcting misassemblies except SNPs and short indels in the assembly are available at https://sourceforge.net/projects/gmcloser/.Contact: [email protected] information: Supplementary data are available at Bioinformatics online.
Mutadelic: mutation analysis using description logic inferencing capabilitiesHolford, Matthew E.; Krauthammer, Michael
doi: 10.1093/bioinformatics/btv467pmid: 26272983
Motivation: As next generation sequencing gains a foothold in clinical genetics, there is a need for annotation tools to characterize increasing amounts of patient variant data for identifying clinically relevant mutations. While existing informatics tools provide efficient bulk variant annotations, they often generate excess information that may limit their scalability.Results: We propose an alternative solution based on description logic inferencing to generate workflows that produce only those annotations that will contribute to the interpretation of each variant. Workflows are dynamically generated using a novel abductive reasoning framework called a basic framework for abductive workflow generation (AbFab). Criteria for identifying disease-causing variants in Mendelian blood disorders were identified and implemented as AbFab services. A web application was built allowing users to run workflows generated from the criteria to analyze genomic variants. Significant variants are flagged and explanations provided for why they match or fail to match the criteria.Availability and implementation: The Mutadelic web application is available for use at http://krauthammerlab.med.yale.edu/mutadelic.Contact: [email protected] information: Supplementary data are available at Bioinformatics online.
An efficient algorithm for the extraction of HGVS variant descriptions from sequencesVis, Jonathan K.; Vermaat, Martijn; Taschner, Peter E. M.; Kok, Joost N.; Laros, Jeroen F. J.
doi: 10.1093/bioinformatics/btv443pmid: 26231427
Motivation: Unambiguous sequence variant descriptions are important in reporting the outcome of clinical diagnostic DNA tests. The standard nomenclature of the Human Genome Variation Society (HGVS) describes the observed variant sequence relative to a given reference sequence. We propose an efficient algorithm for the extraction of HGVS descriptions from two sequences with three main requirements in mind: minimizing the length of the resulting descriptions, minimizing the computation time and keeping the unambiguous descriptions biologically meaningful.Results: Our algorithm is able to compute the HGVS descriptions of complete chromosomes or other large DNA strings in a reasonable amount of computation time and its resulting descriptions are relatively small. Additional applications include updating of gene variant database contents and reference sequence liftovers.Availability: The algorithm is accessible as an experimental service in the Mutalyzer program suite (https://mutalyzer.nl). The C++ source code and Python interface are accessible at: https://github.com/mutalyzer/description-extractor.Contact: [email protected]
BLSSpeller: exhaustive comparative discovery of conserved cis-regulatory elementsDe Witte, Dieter; Van de Velde, Jan; Decap, Dries; Van Bel, Michiel; Audenaert, Pieter; Demeester, Piet; Dhoedt, Bart; Vandepoele, Klaas; Fostier, Jan
doi: 10.1093/bioinformatics/btv466pmid: 26254488
Motivation: The accurate discovery and annotation of regulatory elements remains a challenging problem. The growing number of sequenced genomes creates new opportunities for comparative approaches to motif discovery. Putative binding sites are then considered to be functional if they are conserved in orthologous promoter sequences of multiple related species. Existing methods for comparative motif discovery usually rely on pregenerated multiple sequence alignments, which are difficult to obtain for more diverged species such as plants. As a consequence, misaligned regulatory elements often remain undetected.Results: We present a novel algorithm that supports both alignment-free and alignment-based motif discovery in the promoter sequences of related species. Putative motifs are exhaustively enumerated as words over the IUPAC alphabet and screened for conservation using the branch length score. Additionally, a confidence score is established in a genome-wide fashion. In order to take advantage of a cloud computing infrastructure, the MapReduce programming model is adopted. The method is applied to four monocotyledon plant species and it is shown that high-scoring motifs are significantly enriched for open chromatin regions in Oryza sativa and for transcription factor binding sites inferred through protein-binding microarrays in O.sativa and Zea mays. Furthermore, the method is shown to recover experimentally profiled ga2ox1-like KN1 binding sites in Z.mays.Availability and implementation: BLSSpeller was written in Java. Source code and manual are available at http://bioinformatics.intec.ugent.be/blsspellerContact: [email protected] or [email protected] information: Supplementary data are available at Bioinformatics online.
LoopIng: a template-based tool for predicting the structure of protein loopsMessih, Mario Abdel; Lepore, Rosalba; Tramontano, Anna
doi: 10.1093/bioinformatics/btv438pmid: 26249814
Motivation: Predicting the structure of protein loops is very challenging, mainly because they are not necessarily subject to strong evolutionary pressure. This implies that, unlike the rest of the protein, standard homology modeling techniques are not very effective in modeling their structure. However, loops are often involved in protein function, hence inferring their structure is important for predicting protein structure as well as function.Results: We describe a method, LoopIng, based on the Random Forest automated learning technique, which, given a target loop, selects a structural template for it from a database of loop candidates. Compared to the most recently available methods, LoopIng is able to achieve similar accuracy for short loops (4–10 residues) and significant enhancements for long loops (11–20 residues). The quality of the predictions is robust to errors that unavoidably affect the stem regions when these are modeled. The method returns a confidence score for the predicted template loops and has the advantage of being very fast (on average: 1 min/loop).Availability and implementation: www.biocomputing.it/loopingContact: [email protected] information: Supplementary data are available at Bioinformatics online.
Accurate disulfide-bonding network predictions improve ab initio structure prediction of cysteine-rich proteinsYang, Jing; He, Bao-Ji; Jang, Richard; Zhang, Yang; Shen, Hong-Bin
doi: 10.1093/bioinformatics/btv459pmid: 26254435
Motivation: Cysteine-rich proteins cover many important families in nature but there are currently no methods specifically designed for modeling the structure of these proteins. The accuracy of disulfide connectivity pattern prediction, particularly for the proteins of higher-order connections, e.g. >3 bonds, is too low to effectively assist structure assembly simulations.Results: We propose a new hierarchical order reduction protocol called Cyscon for disulfide-bonding prediction. The most confident disulfide bonds are first identified and bonding prediction is then focused on the remaining cysteine residues based on SVR training. Compared with purely machine learning-based approaches, Cyscon improved the average accuracy of connectivity pattern prediction by 21.9%. For proteins with more than 5 disulfide bonds, Cyscon improved the accuracy by 585% on the benchmark set of PDBCYS. When applied to 158 non-redundant cysteine-rich proteins, Cyscon predictions helped increase (or decrease) the TM-score (or RMSD) of the ab initio QUARK modeling by 12.1% (or 14.4%). This result demonstrates a new avenue to improve the ab initio structure modeling for cysteine-rich proteins.Availability and implementation: http://www.csbio.sjtu.edu.cn/bioinf/Cyscon/Contact: [email protected] or [email protected] information: Supplementary data are available at Bioinformatics online.
Improving protein fold recognition with hybrid profiles combining sequence and structure evolutionGhouzam, Yassine; Postic, Guillaume; de Brevern, Alexandre G.; Gelly, Jean-Christophe
doi: 10.1093/bioinformatics/btv462pmid: 26254434
Motivation: Template-based modeling, the most successful approach for predicting protein 3D structure, often requires detecting distant evolutionary relationships between the target sequence and proteins of known structure. Developed for this purpose, fold recognition methods use elaborate strategies to exploit evolutionary information, mainly by encoding amino acid sequence into profiles. Since protein structure is more conserved than sequence, the inclusion of structural information can improve the detection of remote homology.Results: Here, we present ORION, a new fold recognition method based on the pairwise comparison of hybrid profiles that contain evolutionary information from both protein sequence and structure. Our method uses the 16-state structural alphabet Protein Blocks, which provides an accurate 1D description of protein structure local conformations. ORION systematically outperforms PSI-BLAST and HHsearch on several benchmarks, including target sequences from the modeling competitions CASP8, 9 and 10, and detects ∼10% more templates at fold and superfamily SCOP levels.Availability: Software freely available for download at http://www.dsimb.inserm.fr/orion/.Contact: [email protected] information: Supplementary data are available at Bioinformatics online.
PBAP: a pipeline for file processing and quality control of pedigree data with dense genetic markersNato, Alejandro Q.; Chapman, Nicola H.; Sohi, Harkirat K.; Nguyen, Hiep D.; Brkanac, Zoran; Wijsman, Ellen M.
doi: 10.1093/bioinformatics/btv444pmid: 26231429
Motivation: Huge genetic datasets with dense marker panels are now common. With the availability of sequence data and recognition of importance of rare variants, smaller studies based on pedigrees are again also common. Pedigree-based samples often start with a dense marker panel, a subset of which may be used for linkage analysis to reduce computational burden and to limit linkage disequilibrium between single-nucleotide polymorphisms (SNPs). Programs attempting to select markers for linkage panels exist but lack flexibility.Results: We developed a pedigree-based analysis pipeline (PBAP) suite of programs geared towards SNPs and sequence data. PBAP performs quality control, marker selection and file preparation. PBAP sets up files for MORGAN, which can handle analyses for small and large pedigrees, typically human, and results can be used with other programs and for downstream analyses. We evaluate and illustrate its features with two real datasets.Availability and implementation: PBAP scripts may be downloaded from http://faculty.washington.edu/wijsman/software.shtml.Contact: [email protected] information: Supplementary data are available at Bioinformatics online.