Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Knime4Bio: a set of custom nodes for the interpretation of next-generation sequencing data with KNIME†

Knime4Bio: a set of custom nodes for the interpretation of next-generation sequencing data with... Vol. 27 no. 22 2011, pages 3200–3201 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btr554 Genome analysis Advance Access publication October 7, 2011 Knime4Bio: a set of custom nodes for the interpretation of next-generation sequencing data with KNIME 1 2 1 1,∗ Pierre Lindenbaum , Solena Le Scouarnec , Vincent Portero and Richard Redon Institut du thorax, Inserm UMR 915, Centre Hospitalier Universitaire de Nantes, 44000 Nantes, France and The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK Associate Editor: John Quackenbush ABSTRACT the scientists themselves for building complex analyses. They allow data reproducibility and workflows sharing. Summary: Analysing large amounts of data generated by next- Galaxy (Blankenberg et al., 2011), Cyrille2 (Fiers et al., 2008) and generation sequencing (NGS) technologies is difficult for researchers Mobyle (Nron et al., 2009) are three web-based workflow engines or clinicians without computational skills. They are often compelled that users have to install locally if computational needs on datasets to delegate this task to computer biologists working with command are very large, or if absolute security is required. Alternatively, line utilities. The availability of easy-to-use tools will become essential softwares such as the KNIME (Berthold et al., 2007) workbench with the generalization of NGS in research and diagnosis. It will or Taverna (Hull et al., 2006) run on the users’ desktop and can enable investigators to handle much more of the analysis. Here, interact with local resources. Taverna focuses on web services and we describe Knime4Bio, a set of custom nodes for the KNIME (The may require a large number of nodes even for a simple task. In Konstanz Information Miner) interactive graphical workbench, for the contrast, KNIME provides the ability to modify the nodes without interpretation of large biological datasets. We demonstrate that this having to re-run the whole analysis. We have chosen this latest tool tool can be utilized to quickly retrieve previously published scientific to develop Knime4Bio, a set of new nodes mostly dedicated to the findings. filtering and manipulation of VCF files. Although many standard Availability: http://code.google.com/p/knime4bio/. nodes provided by KNIME can be used to perform such analysis, Contact: [email protected] our nodes add new functionalities, some of which are described Received on August 11, 2011; revised on September 13, 2011; below. accepted on September 29, 2011 1 INTRODUCTION 2 IMPLEMENTATION Next-generation sequencing (NGS) technologies have led The java API for KNIME was used to write the new nodes, to an explosion of the amount of data to be analysed. As which were deployed and documented using some dedicated XML an example, a VCF (Danecek et al., 2011) file (Variant descriptors.Atypical workflow for analysing exome sequencing data Call Format—a standard specification for storing genomic starts by loading VCF files into the working environment. The data variations in a text file) produced by the 1000 Genomes Project contained in the INFO or the SAMPLE columns are extracted and contains about 25 million Single Nucleotide Variants (SNV), the next task consists in annotating SNVs and/or indels. One node [http://tinyurl.com/ALL2of4intersection (retrieved September predicts the consequence of variations at the transcript/protein level. 2011)], making it difficult to extract relevant information using For each variant, genomic sequences of overlapping transcripts spreadsheet programs. While computer biologists are used to are retrieved from the UCSC knownGene database (Hsu et al., invoke common command line tools—such as Perl and R—when 2006) to identify variants leading to premature stop codons, non- analysing those data through Unix pipelines, scientific investigators synonymous variants and variants likely to affect splicing. Some generally lack the technical skills necessary to handle these tools nodes have been designed to find the intersection between the and need to delegate data manipulation to a third party. variants in the VCF file and a various source of annotated genomic Scientific workflow and data integration platforms aim to make regions, which can be: a local BED file, a remote URL, a mysql those tasks more accessible to those research scientists. These tools table, a file indexed with tabix (Li, 2011), a BigBed or a BigWig are modular environments enabling an easy visual assembly and an file (Kent et al., 2010). Other nodes are able to incorporate data from interactive execution of an analysis pipeline (typically a directed other databases: dbSNFRP (Liu et al., 2011), dbSNP, Entrez Gene, graph) where a node defines a task to be executed on input data PubMed, the EMBL STRING database, Uniprot, Reactome and and an edge between two nodes represents a data flow. These GeneOntology (von Mering et al., 2007), MediaWiki, or to export applications provide an intuitive framework that can be used by the data to SIFT (Ng and Henikoff, 2001), Polyphen2 (Adzhubei et al., 2010), BED or MediaWiki formats. After being annotated, some SNVs (e.g. intronic) can be excluded from the dataset and the To whom correspondence should be addressed. remaining data are rearranged by grouping the variants per sample During the reviewing process of this article another solution based on KNIME but focusing on FASTQ data files was published by Jagla et al (Jagla or per gene as a pivot table. Some visualization tools have also et al., 2011). been implemented: the Picard API (Li et al., 2009) or the IGV © The Author(s) 2011. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. [14:46 19/10/2011 Bioinformatics-btr554.tex] Page: 3200 3200–3201 knime4bio Funding: Inserm, the ‘Centre Hospitalier Universitaire’ of Nantes; the ‘Fédération Française de Cardiologie’ (FFC); ‘Fondation pour la Recherche Médicale’ (FRM). Solena Le Scouarnec is supported by the Wellcome Trust (Grant n WT077008). Conflict of Interest: none declared. REFERENCES Adzhubei,I.A. et al. (2010) A method and server for predicting damaging missense mutations. Nat. Methods, 7, 248–249. Berthold,M.R. et al. (2007) Knime: the konstanz information miner. In Preisach,C. et al. (eds) GfKl, Studies in Classification, Data Analysis, and Knowledge Organization, Springer, pp. 319–326. Blankenberg,D. et al. (2011) Integrating diverse databases into an unified analysis Fig. 1. Screenshot of a Knime4Bio workflow for the NOTCH2 analysis. framework: a Galaxy approach. Database, 2011, bar011. Danecek,P. et al. (2011) The variant call format and VCFtools. Bioinformatics, 27, browser (Robinson et al., 2011) can be used visualize the short reads 2156–2158. Fiers,M.W.E.J. et al. (2008) High-throughput bioinformatics with the Cyrille2 pipeline overlapping a variation. system. BMC Bioinformatics, 9, 96. As a proof of concept, we tested our nodes to analyse the exomes Hsu,F. et al. (2006) The UCSC Known Genes. Bioinformatics, 22, 1036–1046. of six patients from a previously published study (Isidor et al., 2011) Hull,D. et al. (2006) Taverna: a tool for building and running workflows of services. related to the Hajdu Cheney syndrome (Fig. 1). For this purpose, Nucleic Acids Res., 34, W729–W732. Isidor,B. et al. (2011) Truncating mutations in the last exon of NOTCH2 cause a rare short reads were mapped to the human genome reference sequence skeletal disorder with osteoporosis. Nat. Genet., 43, 306–308. using BWA (Li and Durbin, 2010) and variants were called using Jagla,B. et al. (2011) Extending KNIME for next generation sequencing data analysis. SAMtools mpileup (Li et al., 2009). Homozygous variants, known Bioinformatics, 27, 2907–2909. SNPs (from dbSNP) and poor-quality variants were discarded, Kent,W.J. et al. (2010) BigWig and BigBed: enabling browsing of large distributed and only non-synonymous and variants introducing premature stop datasets. Bioinformatics, 26, 2204–2207. Liu,X. et al. (2011) dbNSFP: a lightweight database of human nonsynonymous SNPs codons were considered. On a RedHat server (64 bits, 4 processors, and their functional predictions. Hum. Mutat., 32, 894–899. 2 GB of RAM), our KNIME pipeline generated a list of six genes in Li,H. and Durbin,R. (2010) Fast and accurate long-read alignment with Burrows- 45 min: CELSR1, COL4A2, MAGEF1, MYO15A, ZNF341 and more Wheeler transform. Bioinformatics, 26, 589–595. importantly NOTCH2, the expected candidate gene. Li,H. et al. (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25, 2078–2079. Li,H. (2011) Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics, 27, 718–719. 3 DISCUSSION Nron,B. et al. (2009) Mobyle: a new full web bioinformatics framework. Bioinformatics, In practical terms, a computer biologist was close to our users to help 25, 3005–3011. Ng,P.C. and Henikoff,S. (2001) Predicting deleterious amino acid substitutions. Genome them with the construction of a workflow. After this short tutorial, Res., 11, 863–874. they were able to quickly play with the interface, add some nodes Robinson,J.T. et al. (2011) Integrative genomics viewer. Nat. Biotechnol., 29, 24–26. and modify the parameters without any further assistance, but the von Mering,C. et al. (2007) STRING 7–recent developments in the integration and suggestion or the configuration of some specific nodes (for example, prediction of protein interactions. Nucleic Acids Res., 35, D358–D362. those who require a snippet of java code). At the time of writing, Knime4Bio contains 55 new nodes. We believe Knime4Bio is an efficient interactive tool for NGS analysis. ACKNOWLEDGEMENTS We want to thank the Biostar community for its help, Jim Robinson and his team for the BigWig java API, and Dr Cedric Le Caignec for the NOTCH2 data. The workflow was posted on myexperiment.org at: www.myexperiment.org/workflows/2320. [14:46 19/10/2011 Bioinformatics-btr554.tex] Page: 3201 3200–3201 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Pubmed Central

Knime4Bio: a set of custom nodes for the interpretation of next-generation sequencing data with KNIME†

Bioinformatics , Volume 27 (22) – Oct 7, 2011

Loading next page...
 
/lp/pubmed-central/knime4bio-a-set-of-custom-nodes-for-the-interpretation-of-next-YZUSYaYvQY

References (28)

Publisher
Pubmed Central
Copyright
© The Author(s) 2011. Published by Oxford University Press.
ISSN
1367-4803
eISSN
1367-4811
DOI
10.1093/bioinformatics/btr554
Publisher site
See Article on Publisher Site

Abstract

Vol. 27 no. 22 2011, pages 3200–3201 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btr554 Genome analysis Advance Access publication October 7, 2011 Knime4Bio: a set of custom nodes for the interpretation of next-generation sequencing data with KNIME 1 2 1 1,∗ Pierre Lindenbaum , Solena Le Scouarnec , Vincent Portero and Richard Redon Institut du thorax, Inserm UMR 915, Centre Hospitalier Universitaire de Nantes, 44000 Nantes, France and The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK Associate Editor: John Quackenbush ABSTRACT the scientists themselves for building complex analyses. They allow data reproducibility and workflows sharing. Summary: Analysing large amounts of data generated by next- Galaxy (Blankenberg et al., 2011), Cyrille2 (Fiers et al., 2008) and generation sequencing (NGS) technologies is difficult for researchers Mobyle (Nron et al., 2009) are three web-based workflow engines or clinicians without computational skills. They are often compelled that users have to install locally if computational needs on datasets to delegate this task to computer biologists working with command are very large, or if absolute security is required. Alternatively, line utilities. The availability of easy-to-use tools will become essential softwares such as the KNIME (Berthold et al., 2007) workbench with the generalization of NGS in research and diagnosis. It will or Taverna (Hull et al., 2006) run on the users’ desktop and can enable investigators to handle much more of the analysis. Here, interact with local resources. Taverna focuses on web services and we describe Knime4Bio, a set of custom nodes for the KNIME (The may require a large number of nodes even for a simple task. In Konstanz Information Miner) interactive graphical workbench, for the contrast, KNIME provides the ability to modify the nodes without interpretation of large biological datasets. We demonstrate that this having to re-run the whole analysis. We have chosen this latest tool tool can be utilized to quickly retrieve previously published scientific to develop Knime4Bio, a set of new nodes mostly dedicated to the findings. filtering and manipulation of VCF files. Although many standard Availability: http://code.google.com/p/knime4bio/. nodes provided by KNIME can be used to perform such analysis, Contact: [email protected] our nodes add new functionalities, some of which are described Received on August 11, 2011; revised on September 13, 2011; below. accepted on September 29, 2011 1 INTRODUCTION 2 IMPLEMENTATION Next-generation sequencing (NGS) technologies have led The java API for KNIME was used to write the new nodes, to an explosion of the amount of data to be analysed. As which were deployed and documented using some dedicated XML an example, a VCF (Danecek et al., 2011) file (Variant descriptors.Atypical workflow for analysing exome sequencing data Call Format—a standard specification for storing genomic starts by loading VCF files into the working environment. The data variations in a text file) produced by the 1000 Genomes Project contained in the INFO or the SAMPLE columns are extracted and contains about 25 million Single Nucleotide Variants (SNV), the next task consists in annotating SNVs and/or indels. One node [http://tinyurl.com/ALL2of4intersection (retrieved September predicts the consequence of variations at the transcript/protein level. 2011)], making it difficult to extract relevant information using For each variant, genomic sequences of overlapping transcripts spreadsheet programs. While computer biologists are used to are retrieved from the UCSC knownGene database (Hsu et al., invoke common command line tools—such as Perl and R—when 2006) to identify variants leading to premature stop codons, non- analysing those data through Unix pipelines, scientific investigators synonymous variants and variants likely to affect splicing. Some generally lack the technical skills necessary to handle these tools nodes have been designed to find the intersection between the and need to delegate data manipulation to a third party. variants in the VCF file and a various source of annotated genomic Scientific workflow and data integration platforms aim to make regions, which can be: a local BED file, a remote URL, a mysql those tasks more accessible to those research scientists. These tools table, a file indexed with tabix (Li, 2011), a BigBed or a BigWig are modular environments enabling an easy visual assembly and an file (Kent et al., 2010). Other nodes are able to incorporate data from interactive execution of an analysis pipeline (typically a directed other databases: dbSNFRP (Liu et al., 2011), dbSNP, Entrez Gene, graph) where a node defines a task to be executed on input data PubMed, the EMBL STRING database, Uniprot, Reactome and and an edge between two nodes represents a data flow. These GeneOntology (von Mering et al., 2007), MediaWiki, or to export applications provide an intuitive framework that can be used by the data to SIFT (Ng and Henikoff, 2001), Polyphen2 (Adzhubei et al., 2010), BED or MediaWiki formats. After being annotated, some SNVs (e.g. intronic) can be excluded from the dataset and the To whom correspondence should be addressed. remaining data are rearranged by grouping the variants per sample During the reviewing process of this article another solution based on KNIME but focusing on FASTQ data files was published by Jagla et al (Jagla or per gene as a pivot table. Some visualization tools have also et al., 2011). been implemented: the Picard API (Li et al., 2009) or the IGV © The Author(s) 2011. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. [14:46 19/10/2011 Bioinformatics-btr554.tex] Page: 3200 3200–3201 knime4bio Funding: Inserm, the ‘Centre Hospitalier Universitaire’ of Nantes; the ‘Fédération Française de Cardiologie’ (FFC); ‘Fondation pour la Recherche Médicale’ (FRM). Solena Le Scouarnec is supported by the Wellcome Trust (Grant n WT077008). Conflict of Interest: none declared. REFERENCES Adzhubei,I.A. et al. (2010) A method and server for predicting damaging missense mutations. Nat. Methods, 7, 248–249. Berthold,M.R. et al. (2007) Knime: the konstanz information miner. In Preisach,C. et al. (eds) GfKl, Studies in Classification, Data Analysis, and Knowledge Organization, Springer, pp. 319–326. Blankenberg,D. et al. (2011) Integrating diverse databases into an unified analysis Fig. 1. Screenshot of a Knime4Bio workflow for the NOTCH2 analysis. framework: a Galaxy approach. Database, 2011, bar011. Danecek,P. et al. (2011) The variant call format and VCFtools. Bioinformatics, 27, browser (Robinson et al., 2011) can be used visualize the short reads 2156–2158. Fiers,M.W.E.J. et al. (2008) High-throughput bioinformatics with the Cyrille2 pipeline overlapping a variation. system. BMC Bioinformatics, 9, 96. As a proof of concept, we tested our nodes to analyse the exomes Hsu,F. et al. (2006) The UCSC Known Genes. Bioinformatics, 22, 1036–1046. of six patients from a previously published study (Isidor et al., 2011) Hull,D. et al. (2006) Taverna: a tool for building and running workflows of services. related to the Hajdu Cheney syndrome (Fig. 1). For this purpose, Nucleic Acids Res., 34, W729–W732. Isidor,B. et al. (2011) Truncating mutations in the last exon of NOTCH2 cause a rare short reads were mapped to the human genome reference sequence skeletal disorder with osteoporosis. Nat. Genet., 43, 306–308. using BWA (Li and Durbin, 2010) and variants were called using Jagla,B. et al. (2011) Extending KNIME for next generation sequencing data analysis. SAMtools mpileup (Li et al., 2009). Homozygous variants, known Bioinformatics, 27, 2907–2909. SNPs (from dbSNP) and poor-quality variants were discarded, Kent,W.J. et al. (2010) BigWig and BigBed: enabling browsing of large distributed and only non-synonymous and variants introducing premature stop datasets. Bioinformatics, 26, 2204–2207. Liu,X. et al. (2011) dbNSFP: a lightweight database of human nonsynonymous SNPs codons were considered. On a RedHat server (64 bits, 4 processors, and their functional predictions. Hum. Mutat., 32, 894–899. 2 GB of RAM), our KNIME pipeline generated a list of six genes in Li,H. and Durbin,R. (2010) Fast and accurate long-read alignment with Burrows- 45 min: CELSR1, COL4A2, MAGEF1, MYO15A, ZNF341 and more Wheeler transform. Bioinformatics, 26, 589–595. importantly NOTCH2, the expected candidate gene. Li,H. et al. (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25, 2078–2079. Li,H. (2011) Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics, 27, 718–719. 3 DISCUSSION Nron,B. et al. (2009) Mobyle: a new full web bioinformatics framework. Bioinformatics, In practical terms, a computer biologist was close to our users to help 25, 3005–3011. Ng,P.C. and Henikoff,S. (2001) Predicting deleterious amino acid substitutions. Genome them with the construction of a workflow. After this short tutorial, Res., 11, 863–874. they were able to quickly play with the interface, add some nodes Robinson,J.T. et al. (2011) Integrative genomics viewer. Nat. Biotechnol., 29, 24–26. and modify the parameters without any further assistance, but the von Mering,C. et al. (2007) STRING 7–recent developments in the integration and suggestion or the configuration of some specific nodes (for example, prediction of protein interactions. Nucleic Acids Res., 35, D358–D362. those who require a snippet of java code). At the time of writing, Knime4Bio contains 55 new nodes. We believe Knime4Bio is an efficient interactive tool for NGS analysis. ACKNOWLEDGEMENTS We want to thank the Biostar community for its help, Jim Robinson and his team for the BigWig java API, and Dr Cedric Le Caignec for the NOTCH2 data. The workflow was posted on myexperiment.org at: www.myexperiment.org/workflows/2320. [14:46 19/10/2011 Bioinformatics-btr554.tex] Page: 3201 3200–3201

Journal

BioinformaticsPubmed Central

Published: Oct 7, 2011

There are no references for this article.