Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research

Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics... Vol. 21 no. 18 2005, pages 3674–3676 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/bti610 Sequence analysis Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research 1,∗,† 2,† 2 1 Ana Conesa , Stefan Götz , Juan Miguel García-Gómez , Javier Terol , 1 2 Manuel Talón and Montserrat Robles Centro de Genómica, Instituto Valenciano de Investigaciones Agrarias, Moncada, Valencia, Spain, and BET-ITACA, Universidad Politécnica de Valencia, Valencia, Spain Received on June 27, 2005; revised on July 28, 2005; accepted on July 29, 2005 Advance Access publication August 4, 2005 ABSTRACT and visualization capabilities, or accept only gene or probe identi- Summary: We present here Blast2GO (B2G), a research tool fiers as input data, making them restrictive to annotated sequences designed with the main purpose of enabling Gene Ontology (GO) already deposited in public databases. In order to provide a suitable based data mining on sequence data for which no GO annotation is solution to these limitations we have developed Blast2GO (B2G), a yet available. B2G joints in one application GO annotation based on universal GO annotation, visualization and statistics framework that similarity searches with statistical analysis and highlighted visualiza- brings advanced functional analysis to the genomics research of non- tion on directed acyclic graphs. This tool offers a suitable platform for model species. B2G has been design to (1) allow automatic and high functional genomics research in non-model species. B2G is an intu- throughput sequence annotation and (2) integrate functionality for itive and interactive desktop application that allows monitoring and annotation-based data mining. Briefly, B2G uses BLAST (Altschul comprehension of the whole annotation and analysis process. et al., 1990) to find homologs to fasta formatted input sequences. The Availability: Blast2GO is freely available via Java Web Start at program extracts GO terms to each obtained hit by mapping to exist- http://www.blast2go.de ent annotation associations. An annotation rule finally assigns GO Supplementary material: http://www.blast2go.de -> Evaluation terms to the query sequence. Annotation and functional analysis can Contact: aconesa@ivia.es; stefang@fis.upv.es be visualized in a graph form reconstructing the GO relationships and color-highlighting the most relevant areas (Fig. 1). B2G was conceived to be an attractive tool for research environments where INTRODUCTION genetic and/or computational resources are limited and where much One of the most important aspects in mining genomics data is to asso- work is still done in an explorative fashion. B2G is a user-friendly, ciate individual sequences and related expression information with easy to distribute and low maintenance tool. It allows monitoring and biological function. Automatic functional annotation is an effective interaction at different steps of the analysis, and emphasizes visual- approach to solve this problem. Functional annotation allows cat- ization as an important component of knowledge acquisition. B2G egorization of genes in functional classes, which can be very useful to is a Java application made available by Java Web Start. It is plat- understand the physiological meaning of large amounts of genes and form independent and has no further requirements than an Internet to assess functional differences between subgroups of sequences. The connection. Gene Ontology (GO) developed at the GO Consortium (Ashburner et al., 2000) provides a suitable framework for this kind of analysis, due to the wide scope of biology covered and its directed acyclic OBTAINING GO TERMS graph (DAG) structure that enables visualization in the context of The first step in B2G is to find sequences similar to a query set biological dependences. Different development teams have released by Blast searching. Homology search can either be done at public software to analyze sequences by the use of GO. A variety of desktop databases (e.g. NCBI nr and est using QBlast) or custom databases and web applications are available to electronically assign GO terms (e.g. GO annotated sequence sets and single species DBs) when a to unknown sequences based on similarity (Martin et al., 2004; Groth local www-Blast installation is available. Blast expectation values et al., 2004; Khan et al., 2003; Zehetner, 2003) or to analyze gen- (E-value) and hit number thresholds are provided to retrieve signi- omic data in the context of gene annotation (Al-Shahrour et al., 2004; ficant results. To avoid the danger of annotation by short matches Doniger et al., 2003). However, when trying to perform GO-based with low E-values, an additional filter can be set to the minimal analysis in poorly characterized organisms we encountered a num- alignment length (hsp-length). Annotation, however, will ultimately ber of drawbacks. In general, these tools are either not designed be based on sequence similarity levels since similarity percentages for high-throughput sequence annotation, are limited in their mining are independent on database size and more intuitive than E-values. In order to retrieve GO terms associated with the obtained hits, To whom correspondence should be addressed. a quite straight forward mapping is made. By using Blast hit gene identifiers (gi) and gene accessions B2G retrieves all GO annota- The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. tions for the hit sequences, together with their evidence codes (EC). 3674 © The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org Blast2GO Fig. 1. Application overview. The figure shows schematically a typical run of B2G. Used symbols are described in the embedded legend. Numbered circles denote the major application steps. From the left to the right these are (1) Blasting: a group of selected sequences is blasted against either the NCBI or custom databases, (2) Mapping: GO terms are mapped on the blast results using annotation files provided by the GO Consortium that are downloaded on a monthly basis at the Blast2GO server, (3) Annotation: sequences are annotated using an annotation rule that takes parameters provided by the user, (4) Statistical analysis: optionally, analysis of GO term distribution differences between groups of sequences can be performed and (5) Visualization: annotation and statistics results can be visualized on the GO DAG. At each of these steps, different charts are available to evaluate the progress of the analysis and data can be saved and exported in different formats. ECs can be interpreted as an index of the trustworthiness of the To comprehend the results of annotation, graph visualization for GO annotation. At the end of the mapping processes, for each query single sequences, showing all involved values, is available. sequence, a set of candidate annotations from different hits of diverse similarity levels and various annotation sources is gathered. STATISTICS Once GO annotation is available through B2G (uploading an existing ANNOTATION ASSIGNMENT annotation file is also supported), the application offers the possibility Annotation is performed by applying an annotation rule (AR) to the of direct statistical analysis on gene function information. A com- obtained ontologies. The rule seeks to find the most specific annota- mon analysis is the statistical assessment of GO term enrichments in tions with a certain level of reliability. This process is adjustable in a group of interesting genes when compared with a reference group. specificity and stringency. This functionality was introduced in B2G by integrating Gossip For each candidate GO an annotation score (AS) is computed. (Blüthgen et al., 2004). Gossip computes Fisher’s Exact Test apply- The AS is composed of two additive terms. The first, direct term ing robust FDR (false discovery rate) correction for multiple testing (DT), represents the highest hit similarity of this GO weighted by a and returns a list of significant GO terms ranked by their corrected or factor corresponding to its EC. By employing ECs, B2G promotes one-test P -values. Furthermore B2G offers various statistical charts the assignment of annotations with experimental evidence and penal- summarizing the results obtained at blasting, mapping or annotation. izes electronic annotations or low traceability. The EC weights have Bar or pie charts of similarity/E-value distributions, EC distributions been taken following recommendations of the GO Consortium and and annotation statistics (GOs/Seqs) can be generated, saved and can be modified if desired. The second term (AT) of the AS provides printed. the possibility of abstraction. This is defined as annotation to a parent node when several child nodes are present in the GO candidate col- VISUALIZATION lection. This term multiplies the number of total GOs unified at the Visualization is an important aspect in B2G. For each sequence, the node by a user defined GO weight factor that controls the possibility progress in the annotation process and the final annotation step are and strength of abstraction. Finally, the AR selects the lowest term visualized on the main application table by successive color changes. per branch that lies over a user defined threshold. In an analytical This allows the researcher to readily spot sequences that failed the form, DT, AT and the AR terms are defined as follows: initial annotation process and, if desired, modify annotation paramet- ers for those. Furthermore, the joined biological meaning of a set of DT = max(similarity × EC ) weight sequences can be visualized on the GO DAG by color-intensity high- AT = (#GO − 1) × GO weight lighting of the most relevant nodes in a combined sequence graph. Those nodes are identified by computing a node score that takes into AS = DT + AT account the number of sequences converging at one node and penal- AR : lowest.node(AS ≥ threshold). izes by the distance to the node where each sequence was annotated. 3675 A.Conesa et al. Alternatively, when an enrichment analysis is available, graph color ACKNOWLEDGEMENTS highlighting by statistical results will show the GO-term specificity The authors thank Dr Timothy Williams for fruitful discussions and of the query subset. comments on the software and Nils Bluethgen for kindly providing the Gossip software and supporting integration in B2G. This work VALIDATION has been funded by MCyT (GEN 2001 - 4885-C05-03) and eTumour Project (FP6-2002-LIFESCIHEALTH 503094). The authors thank The performance of Blast2GO has been tested using a dataset for the INBIOMED G03/160 research thematic network financed by which annotation and functional information was available. The FIS of the Instituto de Salud Carlos III. methodology and results of this evaluation are given as supplement- ary material and are available at the B2G site. Our results show Conflict of Interest: none declared. that Blast2GO reaches an annotation accuracy of 65–70% , which is commonly reported in automatic GO annotation methods (Martin REFERENCES et al., 2004; Khan et al., 2003). More interestingly, this evaluation Al-Shahrour,F. et al. (2004) FatiGO: a web tool for finding significant associations of shows that the tool is successful in extracting relevant functional Gene Ontology terms with groups of genes. Bioinformatics, 20, 578–580. features of these sequences based on the use of the predicted Altschul,S.F. et al. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. annotation. Ashburner,M. et al. (2000) Gene Ontology: tool for the unification of biology. Nat. Genet., 25, 25–29. Blüthgen,N., Brand,K., Cajavec,B., Swat,M., Herzel,H. and Beule,D., (2004) Biolo- CONCLUSIONS gical Profiling of Gene Groups utilizing Gene Ontology – A Statistical Framework. arXiv:q-bio.GN/0407034, 1,1. By joining annotation to function analysis B2G provides a power- Doniger,S. et al. (2003) MAPPFinder: using Gene Ontology and GenMAPP to create a ful data mining tool ideally suited to support genomic research in global gene-expression profile from microarray data. Genome Biol., 4, R7. non-model species. Its species-independent character and different Groth,D. et al. (2004) GOblet: a platform for Gene Ontology annotation of anonymous data input fronts makes it a valuable mining resource for potentially sequence data. Nucleic Acids Res., 32, 313–317. Khan,S. et al. (2003) GoFigure: automated Gene OntologyTM annotation. Bioinform- any organism. B2G combines high-throughput analysis, statistical atics, 19, 2484–2485. evaluation and biology framed visualization with a high degree of Martin,D. et al. (2004) GOtcha: a new method for prediction of protein function assessed user interaction. Further developments of Blast2GO will include by the annotation of seven genomes. BMC Bioinformatics, 5, 178. extension to multiple annotation types and novel statistical analysis Zehetner,G. (2003) OntoBlast function: from sequence similarities directly to potential tools. functional annotations by ontology terms. Nucleic Acids Res., 31, 3799–3803. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research

Loading next page...
 
/lp/oxford-university-press/blast2go-a-universal-tool-for-annotation-visualization-and-analysis-in-E0gNUWxs8b

References (8)

  • Doniger (2003)

    R7

    Genome biology, 4

  • Altschul (1990)

    403

    Journal of molecular biology, 215

  • Zehetner (2003)

    3799

    Nucleic Acids Research, 31

  • 578

    Bioinformatics, 20

  • Khan (2003)

    2484

    Bioinformatics, 19

  • Martin (2004)

    178

    BMC bioinformatics [electronic resource], 5

  • Ashburner (2000)

    25

    Nature genetics, 25

  • 313

    Nucleic Acids Research, 32

Publisher
Oxford University Press
Copyright
© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org
eISSN
1367-4811
DOI
10.1093/bioinformatics/bti610
pmid
16081474
Publisher site
See Article on Publisher Site

Abstract

Vol. 21 no. 18 2005, pages 3674–3676 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/bti610 Sequence analysis Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research 1,∗,† 2,† 2 1 Ana Conesa , Stefan Götz , Juan Miguel García-Gómez , Javier Terol , 1 2 Manuel Talón and Montserrat Robles Centro de Genómica, Instituto Valenciano de Investigaciones Agrarias, Moncada, Valencia, Spain, and BET-ITACA, Universidad Politécnica de Valencia, Valencia, Spain Received on June 27, 2005; revised on July 28, 2005; accepted on July 29, 2005 Advance Access publication August 4, 2005 ABSTRACT and visualization capabilities, or accept only gene or probe identi- Summary: We present here Blast2GO (B2G), a research tool fiers as input data, making them restrictive to annotated sequences designed with the main purpose of enabling Gene Ontology (GO) already deposited in public databases. In order to provide a suitable based data mining on sequence data for which no GO annotation is solution to these limitations we have developed Blast2GO (B2G), a yet available. B2G joints in one application GO annotation based on universal GO annotation, visualization and statistics framework that similarity searches with statistical analysis and highlighted visualiza- brings advanced functional analysis to the genomics research of non- tion on directed acyclic graphs. This tool offers a suitable platform for model species. B2G has been design to (1) allow automatic and high functional genomics research in non-model species. B2G is an intu- throughput sequence annotation and (2) integrate functionality for itive and interactive desktop application that allows monitoring and annotation-based data mining. Briefly, B2G uses BLAST (Altschul comprehension of the whole annotation and analysis process. et al., 1990) to find homologs to fasta formatted input sequences. The Availability: Blast2GO is freely available via Java Web Start at program extracts GO terms to each obtained hit by mapping to exist- http://www.blast2go.de ent annotation associations. An annotation rule finally assigns GO Supplementary material: http://www.blast2go.de -> Evaluation terms to the query sequence. Annotation and functional analysis can Contact: aconesa@ivia.es; stefang@fis.upv.es be visualized in a graph form reconstructing the GO relationships and color-highlighting the most relevant areas (Fig. 1). B2G was conceived to be an attractive tool for research environments where INTRODUCTION genetic and/or computational resources are limited and where much One of the most important aspects in mining genomics data is to asso- work is still done in an explorative fashion. B2G is a user-friendly, ciate individual sequences and related expression information with easy to distribute and low maintenance tool. It allows monitoring and biological function. Automatic functional annotation is an effective interaction at different steps of the analysis, and emphasizes visual- approach to solve this problem. Functional annotation allows cat- ization as an important component of knowledge acquisition. B2G egorization of genes in functional classes, which can be very useful to is a Java application made available by Java Web Start. It is plat- understand the physiological meaning of large amounts of genes and form independent and has no further requirements than an Internet to assess functional differences between subgroups of sequences. The connection. Gene Ontology (GO) developed at the GO Consortium (Ashburner et al., 2000) provides a suitable framework for this kind of analysis, due to the wide scope of biology covered and its directed acyclic OBTAINING GO TERMS graph (DAG) structure that enables visualization in the context of The first step in B2G is to find sequences similar to a query set biological dependences. Different development teams have released by Blast searching. Homology search can either be done at public software to analyze sequences by the use of GO. A variety of desktop databases (e.g. NCBI nr and est using QBlast) or custom databases and web applications are available to electronically assign GO terms (e.g. GO annotated sequence sets and single species DBs) when a to unknown sequences based on similarity (Martin et al., 2004; Groth local www-Blast installation is available. Blast expectation values et al., 2004; Khan et al., 2003; Zehetner, 2003) or to analyze gen- (E-value) and hit number thresholds are provided to retrieve signi- omic data in the context of gene annotation (Al-Shahrour et al., 2004; ficant results. To avoid the danger of annotation by short matches Doniger et al., 2003). However, when trying to perform GO-based with low E-values, an additional filter can be set to the minimal analysis in poorly characterized organisms we encountered a num- alignment length (hsp-length). Annotation, however, will ultimately ber of drawbacks. In general, these tools are either not designed be based on sequence similarity levels since similarity percentages for high-throughput sequence annotation, are limited in their mining are independent on database size and more intuitive than E-values. In order to retrieve GO terms associated with the obtained hits, To whom correspondence should be addressed. a quite straight forward mapping is made. By using Blast hit gene identifiers (gi) and gene accessions B2G retrieves all GO annota- The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. tions for the hit sequences, together with their evidence codes (EC). 3674 © The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org Blast2GO Fig. 1. Application overview. The figure shows schematically a typical run of B2G. Used symbols are described in the embedded legend. Numbered circles denote the major application steps. From the left to the right these are (1) Blasting: a group of selected sequences is blasted against either the NCBI or custom databases, (2) Mapping: GO terms are mapped on the blast results using annotation files provided by the GO Consortium that are downloaded on a monthly basis at the Blast2GO server, (3) Annotation: sequences are annotated using an annotation rule that takes parameters provided by the user, (4) Statistical analysis: optionally, analysis of GO term distribution differences between groups of sequences can be performed and (5) Visualization: annotation and statistics results can be visualized on the GO DAG. At each of these steps, different charts are available to evaluate the progress of the analysis and data can be saved and exported in different formats. ECs can be interpreted as an index of the trustworthiness of the To comprehend the results of annotation, graph visualization for GO annotation. At the end of the mapping processes, for each query single sequences, showing all involved values, is available. sequence, a set of candidate annotations from different hits of diverse similarity levels and various annotation sources is gathered. STATISTICS Once GO annotation is available through B2G (uploading an existing ANNOTATION ASSIGNMENT annotation file is also supported), the application offers the possibility Annotation is performed by applying an annotation rule (AR) to the of direct statistical analysis on gene function information. A com- obtained ontologies. The rule seeks to find the most specific annota- mon analysis is the statistical assessment of GO term enrichments in tions with a certain level of reliability. This process is adjustable in a group of interesting genes when compared with a reference group. specificity and stringency. This functionality was introduced in B2G by integrating Gossip For each candidate GO an annotation score (AS) is computed. (Blüthgen et al., 2004). Gossip computes Fisher’s Exact Test apply- The AS is composed of two additive terms. The first, direct term ing robust FDR (false discovery rate) correction for multiple testing (DT), represents the highest hit similarity of this GO weighted by a and returns a list of significant GO terms ranked by their corrected or factor corresponding to its EC. By employing ECs, B2G promotes one-test P -values. Furthermore B2G offers various statistical charts the assignment of annotations with experimental evidence and penal- summarizing the results obtained at blasting, mapping or annotation. izes electronic annotations or low traceability. The EC weights have Bar or pie charts of similarity/E-value distributions, EC distributions been taken following recommendations of the GO Consortium and and annotation statistics (GOs/Seqs) can be generated, saved and can be modified if desired. The second term (AT) of the AS provides printed. the possibility of abstraction. This is defined as annotation to a parent node when several child nodes are present in the GO candidate col- VISUALIZATION lection. This term multiplies the number of total GOs unified at the Visualization is an important aspect in B2G. For each sequence, the node by a user defined GO weight factor that controls the possibility progress in the annotation process and the final annotation step are and strength of abstraction. Finally, the AR selects the lowest term visualized on the main application table by successive color changes. per branch that lies over a user defined threshold. In an analytical This allows the researcher to readily spot sequences that failed the form, DT, AT and the AR terms are defined as follows: initial annotation process and, if desired, modify annotation paramet- ers for those. Furthermore, the joined biological meaning of a set of DT = max(similarity × EC ) weight sequences can be visualized on the GO DAG by color-intensity high- AT = (#GO − 1) × GO weight lighting of the most relevant nodes in a combined sequence graph. Those nodes are identified by computing a node score that takes into AS = DT + AT account the number of sequences converging at one node and penal- AR : lowest.node(AS ≥ threshold). izes by the distance to the node where each sequence was annotated. 3675 A.Conesa et al. Alternatively, when an enrichment analysis is available, graph color ACKNOWLEDGEMENTS highlighting by statistical results will show the GO-term specificity The authors thank Dr Timothy Williams for fruitful discussions and of the query subset. comments on the software and Nils Bluethgen for kindly providing the Gossip software and supporting integration in B2G. This work VALIDATION has been funded by MCyT (GEN 2001 - 4885-C05-03) and eTumour Project (FP6-2002-LIFESCIHEALTH 503094). The authors thank The performance of Blast2GO has been tested using a dataset for the INBIOMED G03/160 research thematic network financed by which annotation and functional information was available. The FIS of the Instituto de Salud Carlos III. methodology and results of this evaluation are given as supplement- ary material and are available at the B2G site. Our results show Conflict of Interest: none declared. that Blast2GO reaches an annotation accuracy of 65–70% , which is commonly reported in automatic GO annotation methods (Martin REFERENCES et al., 2004; Khan et al., 2003). More interestingly, this evaluation Al-Shahrour,F. et al. (2004) FatiGO: a web tool for finding significant associations of shows that the tool is successful in extracting relevant functional Gene Ontology terms with groups of genes. Bioinformatics, 20, 578–580. features of these sequences based on the use of the predicted Altschul,S.F. et al. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. annotation. Ashburner,M. et al. (2000) Gene Ontology: tool for the unification of biology. Nat. Genet., 25, 25–29. Blüthgen,N., Brand,K., Cajavec,B., Swat,M., Herzel,H. and Beule,D., (2004) Biolo- CONCLUSIONS gical Profiling of Gene Groups utilizing Gene Ontology – A Statistical Framework. arXiv:q-bio.GN/0407034, 1,1. By joining annotation to function analysis B2G provides a power- Doniger,S. et al. (2003) MAPPFinder: using Gene Ontology and GenMAPP to create a ful data mining tool ideally suited to support genomic research in global gene-expression profile from microarray data. Genome Biol., 4, R7. non-model species. Its species-independent character and different Groth,D. et al. (2004) GOblet: a platform for Gene Ontology annotation of anonymous data input fronts makes it a valuable mining resource for potentially sequence data. Nucleic Acids Res., 32, 313–317. Khan,S. et al. (2003) GoFigure: automated Gene OntologyTM annotation. Bioinform- any organism. B2G combines high-throughput analysis, statistical atics, 19, 2484–2485. evaluation and biology framed visualization with a high degree of Martin,D. et al. (2004) GOtcha: a new method for prediction of protein function assessed user interaction. Further developments of Blast2GO will include by the annotation of seven genomes. BMC Bioinformatics, 5, 178. extension to multiple annotation types and novel statistical analysis Zehetner,G. (2003) OntoBlast function: from sequence similarities directly to potential tools. functional annotations by ontology terms. Nucleic Acids Res., 31, 3799–3803.

Journal

BioinformaticsOxford University Press

Published: Aug 4, 2005

There are no references for this article.