Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Extending KNIME for next-generation sequencing data analysis

Extending KNIME for next-generation sequencing data analysis Vol. 27 no. 20 2011, pages 2907–2909 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btr478 Sequence analysis Advance Access publication August 27, 2011 1,∗ 2 1 Bernd Jagla , Bernd Wiswedel and Jean-Yves Coppée Departement Génomes et Génétique, Institut Pasteur, Plate-forme Transcriptome et Epigénome, 25 Rue du Docteur Roux, F-75015 Paris, France and KNIME.com AG - Technoparkstr. 1 - 8005 Zürich, Switzerland Associate Editor: Martin Bishop ABSTRACT 2 DESIGN AND IMPLEMENTATION Summary: KNIME (Konstanz Information Miner) is a user-friendly The current version (2.3.4) of KNIME is based on JAVA 1.6 and comprehensive open-source data integration, processing, and Eclipse 3.6.2. The functionality presented here follows the analysis and exploration platform. We present here new functionality general guidelines for implementing nodes within the KNIME and workflows that open the door to performing next-generation framework and augments the KNIME workflow management sequencing analysis using the KNIME framework. system with specific nodes for the correct handling of NGS Availability: All sources and compiled code are available via the data. Detailed information and examples are available through KNIME update mechanism. Example workflows and descriptions are the KNIME web site (http://tech.knime.org/community/next- available through http://tech.knime.org/community/next-generation- generation-sequencing). There, we have also posted a collection sequencing. of workflows with extensive descriptions and use cases. The Contact: [email protected] purpose of these workflows is to provide new users with some Supplementary Information: Supplementary data are available at examples of data handling and provide a good starting point for fast Bioinformatics online. data generation. For more complicated workflows, we encourage the community to use the myexperiments.org web site (Goble et al., Received on May 16, 2011; revised on July 29, 2011; accepted on 2010) (http://www.myexperiment.org/search?query=KNIME). In August 2, 2011 the following description, we use italic font to indicate names of nodes. The first set of nodes that we have released contains: 1 INTRODUCTION FastQReader, FastQWriter, SAMReader, AdapterRemovalAdv, KNIME (Konstanz Information Miner; Berthold et al., 2008) CountSorted, OneString, GetRegions, PositionStr2Position, distinguishes itself from other workflow management systems RegionOverlapp, Seq2PosIncidents, Bash, CmdWInput, like Mobyle (Néron et al., 2010), Galaxy (Goecks et al., 2010), BEDGraphWriter and JoinSorted. Other nodes mentioned in Taverna (Hull et al., 2006), Kepler (Ludäscher et al., 2006), the text below have been developed by KNIME developers and geWorkbench (Floratos et al., 2010), Conveyor (Linke et al., 2011) other community contributors. and many others by not being a domain-specific solution, but There are nodes specific for NGS-related file types such an integration backbone with strong data preprocessing and data as for reading and writing (compressed and uncompressed) analytics capabilities. It is mainly used in the customer relationship FastQ- (FastQReader), reading SAM/BAM- (SAMReader) and management and financial sector and, through a list of commercial writing BED files (BEDGraphWriter). (Reading BED files and and non-commercial vendors, in the cheminformatics area. (See writing SAM files can already be accomplished with standard participation of recent KNIME conferences.) The focus has been on nodes). NGS-specific tasks that can be executed through the KNIME providing functionality for building professional reports, statistical environment include adapter removal and working with regions of analysis, cheminformatics and very recently, high-throughput/high- interest (ROIs). In this context, we define a ROI as successive content (HCS/HCA) image analysis and scripting using languages nucleotides that have a common property within a reference such as Perl, Python, Matlab, R and Java. Here, we introduce new sequence, such as annotations and sequencing reads mapping to nodes that allow next-generation sequencing (NGS) data analysis short regions. to be performed using KNIME. These nodes take advantage of One node that is related to ROIs is GetRegions, which identifies some of KNIME’s general features including memory management, successive nucleotides with counts greater than zero in a sorted allowing the handling of billions of rows on a standard desktop table where each row represents a position defined by a string computer with only about 4 GB of RAM. The workflows can identifier for the chromosome and an integer describing the be executed from the command line where all variables can be position on the chromosome. This can be used to analyze pile-up manipulated if desired. This enables the administrator to easily files that hold information about how many reads align to incorporate workflows in web-based tools such as Mobyle or Galaxy. a given sequence position. Other nodes such as CountSorted, KNIME is based on Eclipse and JAVA 1.6; the workflows are stored PositionStr2Position, Seq2PosIncidents and OneString represent in plain-text XML files and can be executed on basically any modern functionality with improved performance or tasks that otherwise operating system, and also easily exchanged with or without data. would have taken more than one node to realize with out-of-the-box functionality. Detailed descriptions on these and all other nodes can be found in the help section within the KNIME environment or on To whom correspondence should be addressed. the KNIME community pages. © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] 2907 [17:05 21/9/2011 Bioinformatics-btr478.tex] Page: 2907 2907–2909 B.Jagla et al. Fig. 1. KNIME sample workflow for NGS-related data analysis. Different work areas are distinguished by background color: data cleansing and alignment to the reference genome (dark blue); data preparation of aligned reads (white); creating BED files (light blue); mutation analysis (pink); ROI analysis (green); and identification of uniquely aligning sequences and integration with R (orange). NGS reads that have already been mapped to a reference genome a more in-depth description of this workflow). This workflow can be of varying lengths and counts per sequence positions cannot can be exported with or without data and shared with the be obtained in a straightforward manner using general purpose community (http://www.myexperiment.org/workflows/2183.html). tools such as KNIME. A naive way to deal with this problem It can be integrated into Galaxy or Mobyle to allow others to use it is to convert individual sequences of length n into n individual without installing KNIME (see Supplementary Material for details). entries, one per sequence position (Seq2PosIncidents). This list can This gives just a small glimpse on the powerful features of this then be sorted by chromosomal position (Sorter) and counts per working environment. position can be calculated (CountSorted). This is equivalent to the It is not in the scope of this application note to compare KNIME pileup file generated by samtools (Li et al., 2009). From these with other tools such as Galaxy. Here, we are merely opening the counts we can now identify ROIs where we combine successive door for KNIME to be used in the field of NGS. Thus, we would like positions with counts greater than zero into a single entry (ROI) to briefly discuss some of the points that helped us with this rather in a table (GetRegions). Those ROIs can be compared with subjective decision. KNIME is more visually intuitive and enables annotation that is read in from, e.g. gff formatted file (File-Reader). us to better understand the flow of data. Some of the complexity This functionality resembles, e.g. intersectBED from BEDTools can be hidden in ‘meta-nodes’ or sub-workflows; it is possible, (Quinlan and Hall, 2010). for example, to construct loops for iterating through lists of files; When developing workflows it is usually impractical to work together with if/else statements the basic elements of programming with a full dataset that might comprise many millions of reads. are available; missing basic functionality can usually be prototyped Thus, sample files have to be created, sometimes for each project. using the Java/Python/Perl/R snippets that allow the user to employ The burden of maintaining these files can be avoided when using their favorite programming language. Those things are not possible KNIME. Almost all nodes that function as entry points in KNIME, at the moment in web-based tools such as Galaxy. A comparison i.e. read in data, can limit the number of rows they are reading. with Taverna and other desktop-oriented workflow management Thus, a workflow can be developed on a small subset of the data tools is much more complex since it is mostly not the list of and once the workflow is validated it can be launched on a different, functionalities that makes the difference between being accepted more powerful machine (through the export/import mechanism). by the user but rather an ‘intuitive feeling’ that is gained when This is also true for the FastQReader and SamReader nodes. installing and first launching the program. Missing functionality can Thus, KNIME now enables the user to generate workflows most of the time be easily developed and in none of the tools that we for a wide range of tasks in the field of NGS analysis. For have seen so far are all the things needed (or thought to be needed) example, we can now create workflow that read in a FASTQ readily available. Thus, we have decided on using KNIME because file from a sequencing machine (FastQReader), remove adapters we believe in its potential and user-friendliness, which are very (AdapterRemovalAdv), select sequences by length constraints specific to our own scenario. The availability of cheminformatics, (JavaSnippet, Row Filter), write out a FASTQ file (FastQWriter), statistical, image processing algorithms and features for visualizing execute an alignment program (Bash), read in the resulting data and interacting with data were also relevant. We hope that opening from a SAM/BAM file (SAMReader), select sequences that map the door to KNIME will create further interest within the NGS to a unique position on the reference, create a pile up, select community. and analyze mutations, identify successive ROIs, create counts To further close the remaining gaps, we are actively working per gene (see above), sort counts (Sorter), communicate the on enhancing KNIME by developing, among others, a Distributed results to R and then perform additional analysis and graphs Annotation System (DAS) client; handling protein and nucleotide in R or Matlab (See Fig. 1 and Supplementary Materials for sequence data; GBrowse (http://gmod.org/wiki/GBrowse), [17:05 21/9/2011 Bioinformatics-btr478.tex] Page: 2908 2907–2909 KNIME Integrative Genomics Viewer (IGV) (Robinson et al., 2011) Floratos,A. et al. (2010) geWorkbench: an open source platform for integrative genomics. Bioinformatics, 26, 1779–1780. and University of California Santa Cruz (UCSC) genome browser Goble,C.A. et al. (2010) myExperiment: a repository and social network for the sharing (Kent et al., 2002) interaction; as well as parallelization for of bioinformatics workflows. Nucleic Acids Res., 38 (Suppl. 2), W677–W682. multi-core computers and cluster environments. Goecks,J. et al. (2010) Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol., 11, R86. ACKNOWLEDGEMENTS Hull,D. et al. (2006) Taverna: a tool for building and running workflows of services. Nucleic Acids Res., 24, 729–732. We thank Odile Sismeiro and Caroline Proux for helpful discussions Kent,W.J. et al. (2002) The human genome browser at UCSC. Genome Res., 12, and testing of the workflows. We thank Dr Kenneth Smith for 996–1006. critical reading of the manuscript. We also thank the KNIME team Li,H. et al. (2009) The sequence alignment/map format and SAMtools. Bioinformatics, 25, 2078–2079. and community for timely and helpful support. Linke,B. et al. (2011) Conveyor : a workflow engine for bioinformatic analyses. Bioinformatics, 27, 903–911. Conflict of Interest: none declared. Ludäscher,B. et al. (2006) Scientific workflow management and the Kepler system. Grid systems. Concurr. Comput. Pract. Exp., 18, 1039–1065. Néron,B. et al. (2010) Mobyle: a new full web bioinformatics framework. REFERENCES Bioinformatics, 25, 3005–3011. Berthold,M.R. et al. (2008) KNIME: The Konstanz Information Miner. In Preisach,C. Quinlan,A.R. and Hall,I.M. (2010) BEDTools: a flexible suite of utilities for comparing et al. (eds) Data Analysis, Machine Learning and Applications: Studies in genomic features. Bioinformatics, 26, 841–842. Classification, Data Analysis, and Knowledge Organization, Vol. V, pp. 319–326. Robinson,J.T. et al (2011) Integrative genomics viewer. Nat. Biotechnol., 29, 24–26. [17:05 21/9/2011 Bioinformatics-btr478.tex] Page: 2909 2907–2909 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Extending KNIME for next-generation sequencing data analysis

Bioinformatics , Volume 27 (20): 3 – Aug 27, 2011

Loading next page...
 
/lp/oxford-university-press/extending-knime-for-next-generation-sequencing-data-analysis-lrcofn4WJ5

References (13)

Publisher
Oxford University Press
Copyright
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/btr478
pmid
21873641
Publisher site
See Article on Publisher Site

Abstract

Vol. 27 no. 20 2011, pages 2907–2909 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btr478 Sequence analysis Advance Access publication August 27, 2011 1,∗ 2 1 Bernd Jagla , Bernd Wiswedel and Jean-Yves Coppée Departement Génomes et Génétique, Institut Pasteur, Plate-forme Transcriptome et Epigénome, 25 Rue du Docteur Roux, F-75015 Paris, France and KNIME.com AG - Technoparkstr. 1 - 8005 Zürich, Switzerland Associate Editor: Martin Bishop ABSTRACT 2 DESIGN AND IMPLEMENTATION Summary: KNIME (Konstanz Information Miner) is a user-friendly The current version (2.3.4) of KNIME is based on JAVA 1.6 and comprehensive open-source data integration, processing, and Eclipse 3.6.2. The functionality presented here follows the analysis and exploration platform. We present here new functionality general guidelines for implementing nodes within the KNIME and workflows that open the door to performing next-generation framework and augments the KNIME workflow management sequencing analysis using the KNIME framework. system with specific nodes for the correct handling of NGS Availability: All sources and compiled code are available via the data. Detailed information and examples are available through KNIME update mechanism. Example workflows and descriptions are the KNIME web site (http://tech.knime.org/community/next- available through http://tech.knime.org/community/next-generation- generation-sequencing). There, we have also posted a collection sequencing. of workflows with extensive descriptions and use cases. The Contact: [email protected] purpose of these workflows is to provide new users with some Supplementary Information: Supplementary data are available at examples of data handling and provide a good starting point for fast Bioinformatics online. data generation. For more complicated workflows, we encourage the community to use the myexperiments.org web site (Goble et al., Received on May 16, 2011; revised on July 29, 2011; accepted on 2010) (http://www.myexperiment.org/search?query=KNIME). In August 2, 2011 the following description, we use italic font to indicate names of nodes. The first set of nodes that we have released contains: 1 INTRODUCTION FastQReader, FastQWriter, SAMReader, AdapterRemovalAdv, KNIME (Konstanz Information Miner; Berthold et al., 2008) CountSorted, OneString, GetRegions, PositionStr2Position, distinguishes itself from other workflow management systems RegionOverlapp, Seq2PosIncidents, Bash, CmdWInput, like Mobyle (Néron et al., 2010), Galaxy (Goecks et al., 2010), BEDGraphWriter and JoinSorted. Other nodes mentioned in Taverna (Hull et al., 2006), Kepler (Ludäscher et al., 2006), the text below have been developed by KNIME developers and geWorkbench (Floratos et al., 2010), Conveyor (Linke et al., 2011) other community contributors. and many others by not being a domain-specific solution, but There are nodes specific for NGS-related file types such an integration backbone with strong data preprocessing and data as for reading and writing (compressed and uncompressed) analytics capabilities. It is mainly used in the customer relationship FastQ- (FastQReader), reading SAM/BAM- (SAMReader) and management and financial sector and, through a list of commercial writing BED files (BEDGraphWriter). (Reading BED files and and non-commercial vendors, in the cheminformatics area. (See writing SAM files can already be accomplished with standard participation of recent KNIME conferences.) The focus has been on nodes). NGS-specific tasks that can be executed through the KNIME providing functionality for building professional reports, statistical environment include adapter removal and working with regions of analysis, cheminformatics and very recently, high-throughput/high- interest (ROIs). In this context, we define a ROI as successive content (HCS/HCA) image analysis and scripting using languages nucleotides that have a common property within a reference such as Perl, Python, Matlab, R and Java. Here, we introduce new sequence, such as annotations and sequencing reads mapping to nodes that allow next-generation sequencing (NGS) data analysis short regions. to be performed using KNIME. These nodes take advantage of One node that is related to ROIs is GetRegions, which identifies some of KNIME’s general features including memory management, successive nucleotides with counts greater than zero in a sorted allowing the handling of billions of rows on a standard desktop table where each row represents a position defined by a string computer with only about 4 GB of RAM. The workflows can identifier for the chromosome and an integer describing the be executed from the command line where all variables can be position on the chromosome. This can be used to analyze pile-up manipulated if desired. This enables the administrator to easily files that hold information about how many reads align to incorporate workflows in web-based tools such as Mobyle or Galaxy. a given sequence position. Other nodes such as CountSorted, KNIME is based on Eclipse and JAVA 1.6; the workflows are stored PositionStr2Position, Seq2PosIncidents and OneString represent in plain-text XML files and can be executed on basically any modern functionality with improved performance or tasks that otherwise operating system, and also easily exchanged with or without data. would have taken more than one node to realize with out-of-the-box functionality. Detailed descriptions on these and all other nodes can be found in the help section within the KNIME environment or on To whom correspondence should be addressed. the KNIME community pages. © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] 2907 [17:05 21/9/2011 Bioinformatics-btr478.tex] Page: 2907 2907–2909 B.Jagla et al. Fig. 1. KNIME sample workflow for NGS-related data analysis. Different work areas are distinguished by background color: data cleansing and alignment to the reference genome (dark blue); data preparation of aligned reads (white); creating BED files (light blue); mutation analysis (pink); ROI analysis (green); and identification of uniquely aligning sequences and integration with R (orange). NGS reads that have already been mapped to a reference genome a more in-depth description of this workflow). This workflow can be of varying lengths and counts per sequence positions cannot can be exported with or without data and shared with the be obtained in a straightforward manner using general purpose community (http://www.myexperiment.org/workflows/2183.html). tools such as KNIME. A naive way to deal with this problem It can be integrated into Galaxy or Mobyle to allow others to use it is to convert individual sequences of length n into n individual without installing KNIME (see Supplementary Material for details). entries, one per sequence position (Seq2PosIncidents). This list can This gives just a small glimpse on the powerful features of this then be sorted by chromosomal position (Sorter) and counts per working environment. position can be calculated (CountSorted). This is equivalent to the It is not in the scope of this application note to compare KNIME pileup file generated by samtools (Li et al., 2009). From these with other tools such as Galaxy. Here, we are merely opening the counts we can now identify ROIs where we combine successive door for KNIME to be used in the field of NGS. Thus, we would like positions with counts greater than zero into a single entry (ROI) to briefly discuss some of the points that helped us with this rather in a table (GetRegions). Those ROIs can be compared with subjective decision. KNIME is more visually intuitive and enables annotation that is read in from, e.g. gff formatted file (File-Reader). us to better understand the flow of data. Some of the complexity This functionality resembles, e.g. intersectBED from BEDTools can be hidden in ‘meta-nodes’ or sub-workflows; it is possible, (Quinlan and Hall, 2010). for example, to construct loops for iterating through lists of files; When developing workflows it is usually impractical to work together with if/else statements the basic elements of programming with a full dataset that might comprise many millions of reads. are available; missing basic functionality can usually be prototyped Thus, sample files have to be created, sometimes for each project. using the Java/Python/Perl/R snippets that allow the user to employ The burden of maintaining these files can be avoided when using their favorite programming language. Those things are not possible KNIME. Almost all nodes that function as entry points in KNIME, at the moment in web-based tools such as Galaxy. A comparison i.e. read in data, can limit the number of rows they are reading. with Taverna and other desktop-oriented workflow management Thus, a workflow can be developed on a small subset of the data tools is much more complex since it is mostly not the list of and once the workflow is validated it can be launched on a different, functionalities that makes the difference between being accepted more powerful machine (through the export/import mechanism). by the user but rather an ‘intuitive feeling’ that is gained when This is also true for the FastQReader and SamReader nodes. installing and first launching the program. Missing functionality can Thus, KNIME now enables the user to generate workflows most of the time be easily developed and in none of the tools that we for a wide range of tasks in the field of NGS analysis. For have seen so far are all the things needed (or thought to be needed) example, we can now create workflow that read in a FASTQ readily available. Thus, we have decided on using KNIME because file from a sequencing machine (FastQReader), remove adapters we believe in its potential and user-friendliness, which are very (AdapterRemovalAdv), select sequences by length constraints specific to our own scenario. The availability of cheminformatics, (JavaSnippet, Row Filter), write out a FASTQ file (FastQWriter), statistical, image processing algorithms and features for visualizing execute an alignment program (Bash), read in the resulting data and interacting with data were also relevant. We hope that opening from a SAM/BAM file (SAMReader), select sequences that map the door to KNIME will create further interest within the NGS to a unique position on the reference, create a pile up, select community. and analyze mutations, identify successive ROIs, create counts To further close the remaining gaps, we are actively working per gene (see above), sort counts (Sorter), communicate the on enhancing KNIME by developing, among others, a Distributed results to R and then perform additional analysis and graphs Annotation System (DAS) client; handling protein and nucleotide in R or Matlab (See Fig. 1 and Supplementary Materials for sequence data; GBrowse (http://gmod.org/wiki/GBrowse), [17:05 21/9/2011 Bioinformatics-btr478.tex] Page: 2908 2907–2909 KNIME Integrative Genomics Viewer (IGV) (Robinson et al., 2011) Floratos,A. et al. (2010) geWorkbench: an open source platform for integrative genomics. Bioinformatics, 26, 1779–1780. and University of California Santa Cruz (UCSC) genome browser Goble,C.A. et al. (2010) myExperiment: a repository and social network for the sharing (Kent et al., 2002) interaction; as well as parallelization for of bioinformatics workflows. Nucleic Acids Res., 38 (Suppl. 2), W677–W682. multi-core computers and cluster environments. Goecks,J. et al. (2010) Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol., 11, R86. ACKNOWLEDGEMENTS Hull,D. et al. (2006) Taverna: a tool for building and running workflows of services. Nucleic Acids Res., 24, 729–732. We thank Odile Sismeiro and Caroline Proux for helpful discussions Kent,W.J. et al. (2002) The human genome browser at UCSC. Genome Res., 12, and testing of the workflows. We thank Dr Kenneth Smith for 996–1006. critical reading of the manuscript. We also thank the KNIME team Li,H. et al. (2009) The sequence alignment/map format and SAMtools. Bioinformatics, 25, 2078–2079. and community for timely and helpful support. Linke,B. et al. (2011) Conveyor : a workflow engine for bioinformatic analyses. Bioinformatics, 27, 903–911. Conflict of Interest: none declared. Ludäscher,B. et al. (2006) Scientific workflow management and the Kepler system. Grid systems. Concurr. Comput. Pract. Exp., 18, 1039–1065. Néron,B. et al. (2010) Mobyle: a new full web bioinformatics framework. REFERENCES Bioinformatics, 25, 3005–3011. Berthold,M.R. et al. (2008) KNIME: The Konstanz Information Miner. In Preisach,C. Quinlan,A.R. and Hall,I.M. (2010) BEDTools: a flexible suite of utilities for comparing et al. (eds) Data Analysis, Machine Learning and Applications: Studies in genomic features. Bioinformatics, 26, 841–842. Classification, Data Analysis, and Knowledge Organization, Vol. V, pp. 319–326. Robinson,J.T. et al (2011) Integrative genomics viewer. Nat. Biotechnol., 29, 24–26. [17:05 21/9/2011 Bioinformatics-btr478.tex] Page: 2909 2907–2909

Journal

BioinformaticsOxford University Press

Published: Aug 27, 2011

There are no references for this article.