GOstat: find statistically overrepresented Gene Ontologies within a group of genes

Tim Beißbarth; Terence P. Speed

doi:10.1093/bioinformatics/bth088

GOstat: find statistically overrepresented Gene Ontologies within a group of genes

Beißbarth, Tim; Speed, Terence P. 2004-02-12 00:00:00 Vol. 20 no. 9 2004, pages 1464–1465 BIOINFORMATICS APPLICATIONS NOTE DOI: 10.1093/bioinformatics/bth088 GOstat: ﬁnd statistically overrepresented Gene Ontologies within a group of genes Tim Beißbarth and Terence P. Speed Walter and Eliza Hall Institute of medical Research, 1G Royal Parade, Parkville, Vic 3050, Australia Received on July 14, 2004; revised on December 1, 2003; accepted on December 4, 2003 Advance Access publication February 12, 2004 ABSTRACT Summary: Modern experimental techniques, as for example DNA microarrays, as a result usually produce a long list of genes, which are potentially interesting in the analyzed pro- cess. In order to gain biological understanding from this type of data, it is necessary to analyze the functional annotations of all genes in this list. The Gene-Ontology (GO) database provides a useful tool to annotate and analyze the functions of a large number of genes. Here, we introduce a tool that utilizes this information to obtain an understanding of which annota- Fig. 1. Schema of GO annotation terms. tions are typical for the analyzed list of genes. This program automatically obtains the GO annotations from a database and 2001). Recently, a vast number of tools are evolving that generates statistics of which annotations are overrepresented make use of GOs (Doniger et al., 2003; Draghici et al., in the analyzed list of genes. This results in a list of GO terms 2003; Al-Shahrour et al., 2004; Dennis et al., 2003). We sorted by their speciﬁcity. consider GOstat an easy to use tool with a solid statistical Availability: Our program GOstat is accessible via the Internet foundation. at http://gostat.wehi.edu.au Each gene can have several associated GO terms. Further, Contact: [email protected] due to the hierarchical structure of the GOs, each GO term can be connected to several other GO terms higher in the Ontologies are a widely used concept to create a controlled GO hierarchy and therefore associated with the gene as well vocabulary to communicate and annotate knowledge. The (Fig. 1). We call the list of GO terms that are in between a top Gene Ontology Consortium deﬁnes GO as an international level and the annotated GO term its path. In fact, several such standard to annotate genes (Ashburner et al., 2000). GO has paths might lead to an individual GO term. Each GO term a hierarchical structure starting with top-levels ontologies for in the path we call a split. So in the end a list of 100 genes molecular functions, biological processes and cellular com- will usually have many hundreds of associated GO terms and ponents. The GO database consists of two essential parts, the several thousand associated splits. current ontologies, which deﬁne the vocabulary and structure, GOstat requires a list of gene identiﬁers that specify the and the current annotations, which create a link between the group of genes of interest. The program uses several syn- known genes and the associated GOs that deﬁne their func- onyms, each of which is sufﬁcient to identify a gene. These tion. Currently, many groups are working on the development synonyms are derived from the release of the GO database of the ontologies and annotations for different organisms. as well as from Unigene (Boguski and Schuler, 1995). GO All the information can be downloaded from the web-site databases for several organisms (human, mouse, Drosophila, http://www.geneontology.org yeast, Arabidopsis thaliana, etc.) are provided. In order to ﬁnd Here, we would like to make use of the annotations GO terms that are statistically signiﬁcant within the group, a and structure of the GOs in order to understand the bio- control set of genes needs to be used to obtain a total count logical processes present in a large dataset of genes. The of occurrences for each GO term. This can be the complete usefulness of keyword hierarchies in interpreting large database of annotated genes, one of several subsets that are datasets has been demonstrated previously (Masys et al., commonly used on widely available microarrays or a second list of gene identiﬁers that is passed to the program. In this To whom correspondence should be addressed. case, the second list is used as a reference to search for GO 1464 Bioinformatics 20(9) © Oxford University Press 2004; all rights reserved. GOstat various cutoff values. It is possible to display the over or underrepresented terms only. p-values of GO terms that are overrepresented in the dataset are typeset in green, p-values of underrepresented GO terms are colored red. GO terms that are annotated in more or less the same subsets of genes can be grouped together. GOstat will also output the com- plete list of the associations for the supplied genes to the annotated GO terms. The GO IDs in the output are linked to AmiGO, a visualization tool for the hierarchy in the GO database (http://www.godatabase.org). It is possible to format the output in HTML or as a tabular text. GOstat provides a useful tool in order to ﬁnd biological Fig. 2. GOstat Output. processes or annotations characteristic of a group of genes. This is greatly helpful in analyzing lists of genes resulting from terms, which are signiﬁcantly more represented in the ﬁrst list high-throughput screening experiments, such as microarrays, compared with the second. for their biological meaning. For all of the genes analyzed, GOstat will determine the annotated GO terms and all splits. The program will then ACKNOWLEDGEMENTS count the number of appearances of each GO term for the genes in the group as well as in the reference group. For each Thanks to Joelle Michaud, Lavinia Hyde, Gordon Smyth GO term, a p-value is calculated representing the probabil- and Hamish Scott for helpful suggestions and testing of ity that the observed numbers of counts could have resulted the program. This work was funded by the Deutsche from randomly distributing this GO term between the tested Forschungsgemeinschaft. group and the reference group. A χ test is used in order to approximate this p-value. If the expected value for any count REFERENCES is below 5, the χ approximation is inaccurate. Therefore, we use Fisher’s Exact Test in these cases. The resulting list of Al-Shahrour,F., Diaz-Uriarte,R. and Dopazo,J. (2004) FatiGO: a web p-values is sorted. The GO terms that are most speciﬁc for tool for ﬁnding signiﬁcant associations of Gene Ontology terms the analyzed list of genes will have the lowest p-values. with groups of genes. Bioinformatics, 20, 578–580. Ashburner,M., Ball,C., Blake,J., Botstein,D., Butler,H., Cherry,J., As the number of GO terms for which we test signiﬁc- Davis,A., Dolinski,K., Dwight,S., Eppig,J. (2000) Gene onto- ance is large, the computed p-values have to be corrected logy: tool for the uniﬁcation of biology. The Gene Ontology in order to control the rate of errors we expect with multiple Consortium. Nat. Genet., 25, 25–29. testing (Shaffer, 1995; Dudoit et al., 2002). Two methods for Boguski,M. and Schuler,G. (1995) Establishing a human transcript correcting the p-value are offered in GOstat. The Holm cor- map. Nat. Genet., 10, 369–371. rection controls the familywise error rate, e.g. selecting genes Dennis,G.,Jr, Sherman,B., Hosack,D., Yang,J., Gao,W., Lane,H. with a p-value below 0.1 we expect a 10% chance that any and Lempicki,R. (2003) DAVID: Database for Annotation, of the selected GO terms are not speciﬁc. The Benjamini and Visualization, and Integrated Discovery. Genome Biol., 4, P3. Hochberg correction controls the false discovery rate, e.g. Doniger,S., Salomonis,N., Dahlquist,K., Vranizan,K., Lawlor,S. selecting genes with a p-value below 0.1, we expect that 10% and Conklin,B. (2003) MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression proﬁle from of the selected GO terms are not speciﬁc. microarray data. Genome Biol., 4, R7. However, there are dependences between various GO terms Draghici,S., Khatri,P., Bhavsar,P., Shah,A., Krawetz,S. and in the resulting list. Frequently, genes share more or less the Tainsky,M. (2003) Onto-Tools, the toolkit of the modern biologist: same set of annotations, as several GO terms are indicative Onto-Express, Onto-Compare, Onto-Design and Onto-Translate. of the same process. Also, GO terms that are within one path Nucleic Acids Res., 31, 3775–3781. have strongly correlated results. In order to make the resulting Dudoit,S., Shaffer,J. and Boldrick,J. (2002) Multiple hypothesis test- list of GO terms more interpretable, GOstat has the option ing in microarray experiments. Technical Report 110, Division of to cluster the GO terms. In this process, GO terms that are Biostatistics, UC Berkeley. annotated in the same set of genes or where one set of genes Masys,D., Welsh,J., Fink,J.L., Gribskov,M., Klacansky,I. and is a subset of the other are grouped. Corbeil,J. (2001) Use of keyword hierarchies to interpret gene GOstat will result in a list of p-values that state how spe- expression patterns. Bioinformatics, 17, 319–326. ciﬁc certain GO terms are for a given list of genes (Fig. 2). Shaffer,J. (1995) Multiple hypothesis testing. Annu. Rev. Psychol., 46, 561–584. The output is sorted by the p-value and can be limited by http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/gostat-find-statistically-overrepresented-gene-ontologies-within-a-z10S9tWdxj

Loading next page...

References (9)

D. Masys, J. Welsh, J. Fink, M. Gribskov, Igor Klacansky, J. Corbeil (2001)
Use of keyword hierarchies to interpret gene expression patterns
Bioinformatics, 17 4
Scott Doniger, N. Salomonis, K. Dahlquist, K. Vranizan, Steven Lawlor, B. Conklin (2003)
MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data
Genome Biology, 4
S. Dudoit, J. Shaffer, Jennifer Boldrick (2003)
Multiple Hypothesis Testing in Microarray Experiments
Statistical Science, 18
S. Drăghici, P. Khatri, P. Bhavsar, Abhik Shah, S. Krawetz, M. Tainsky (2003)
Onto-Tools, the toolkit of the modern biologist: Onto-Express, Onto-Compare, Onto-Design and Onto-Translate
Nucleic acids research, 31 13
Glynn Dennis, Brad Sherman, Douglas Hosack, Jun Yang, Wei Gao, H. Lane, R. Lempicki (2003)
DAVID: Database for Annotation, Visualization, and Integrated Discovery
Genome Biology, 4
J. Shaffer (1995)
Multiple Hypothesis Testing
Annual Review of Psychology, 46
M. Ashburner, C. Ball, J. Blake, D. Botstein, Heather Butler, J. Cherry, A. Davis, K. Dolinski, S. Dwight, J. Eppig, M. Harris, D. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. Matese, J. Richardson, M. Ringwald, G. Rubin, G. Sherlock (2000)
Gene Ontology: tool for the unification of biology
Nature Genetics, 25
M. Boguski, G. Schuler (1995)
ESTablishing a human transcript map
Nature Genetics, 10
F. Al-Shahrour, R. Díaz-Uriarte, J. Dopazo (2004)
FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes
Bioinformatics, 20 4

Publisher: Oxford University Press
Copyright: Bioinformatics 20(9) © Oxford University Press 2004; all rights reserved.
ISSN: 1367-4803
eISSN: 1460-2059
DOI: 10.1093/bioinformatics/bth088
pmid: 14962934
Publisher site: See Article on Publisher Site

Abstract

Vol. 20 no. 9 2004, pages 1464–1465 BIOINFORMATICS APPLICATIONS NOTE DOI: 10.1093/bioinformatics/bth088 GOstat: ﬁnd statistically overrepresented Gene Ontologies within a group of genes Tim Beißbarth and Terence P. Speed Walter and Eliza Hall Institute of medical Research, 1G Royal Parade, Parkville, Vic 3050, Australia Received on July 14, 2004; revised on December 1, 2003; accepted on December 4, 2003 Advance Access publication February 12, 2004 ABSTRACT Summary: Modern experimental techniques, as for example DNA microarrays, as a result usually produce a long list of genes, which are potentially interesting in the analyzed pro- cess. In order to gain biological understanding from this type of data, it is necessary to analyze the functional annotations of all genes in this list. The Gene-Ontology (GO) database provides a useful tool to annotate and analyze the functions of a large number of genes. Here, we introduce a tool that utilizes this information to obtain an understanding of which annota- Fig. 1. Schema of GO annotation terms. tions are typical for the analyzed list of genes. This program automatically obtains the GO annotations from a database and 2001). Recently, a vast number of tools are evolving that generates statistics of which annotations are overrepresented make use of GOs (Doniger et al., 2003; Draghici et al., in the analyzed list of genes. This results in a list of GO terms 2003; Al-Shahrour et al., 2004; Dennis et al., 2003). We sorted by their speciﬁcity. consider GOstat an easy to use tool with a solid statistical Availability: Our program GOstat is accessible via the Internet foundation. at http://gostat.wehi.edu.au Each gene can have several associated GO terms. Further, Contact: [email protected] due to the hierarchical structure of the GOs, each GO term can be connected to several other GO terms higher in the Ontologies are a widely used concept to create a controlled GO hierarchy and therefore associated with the gene as well vocabulary to communicate and annotate knowledge. The (Fig. 1). We call the list of GO terms that are in between a top Gene Ontology Consortium deﬁnes GO as an international level and the annotated GO term its path. In fact, several such standard to annotate genes (Ashburner et al., 2000). GO has paths might lead to an individual GO term. Each GO term a hierarchical structure starting with top-levels ontologies for in the path we call a split. So in the end a list of 100 genes molecular functions, biological processes and cellular com- will usually have many hundreds of associated GO terms and ponents. The GO database consists of two essential parts, the several thousand associated splits. current ontologies, which deﬁne the vocabulary and structure, GOstat requires a list of gene identiﬁers that specify the and the current annotations, which create a link between the group of genes of interest. The program uses several syn- known genes and the associated GOs that deﬁne their func- onyms, each of which is sufﬁcient to identify a gene. These tion. Currently, many groups are working on the development synonyms are derived from the release of the GO database of the ontologies and annotations for different organisms. as well as from Unigene (Boguski and Schuler, 1995). GO All the information can be downloaded from the web-site databases for several organisms (human, mouse, Drosophila, http://www.geneontology.org yeast, Arabidopsis thaliana, etc.) are provided. In order to ﬁnd Here, we would like to make use of the annotations GO terms that are statistically signiﬁcant within the group, a and structure of the GOs in order to understand the bio- control set of genes needs to be used to obtain a total count logical processes present in a large dataset of genes. The of occurrences for each GO term. This can be the complete usefulness of keyword hierarchies in interpreting large database of annotated genes, one of several subsets that are datasets has been demonstrated previously (Masys et al., commonly used on widely available microarrays or a second list of gene identiﬁers that is passed to the program. In this To whom correspondence should be addressed. case, the second list is used as a reference to search for GO 1464 Bioinformatics 20(9) © Oxford University Press 2004; all rights reserved. GOstat various cutoff values. It is possible to display the over or underrepresented terms only. p-values of GO terms that are overrepresented in the dataset are typeset in green, p-values of underrepresented GO terms are colored red. GO terms that are annotated in more or less the same subsets of genes can be grouped together. GOstat will also output the com- plete list of the associations for the supplied genes to the annotated GO terms. The GO IDs in the output are linked to AmiGO, a visualization tool for the hierarchy in the GO database (http://www.godatabase.org). It is possible to format the output in HTML or as a tabular text. GOstat provides a useful tool in order to ﬁnd biological Fig. 2. GOstat Output. processes or annotations characteristic of a group of genes. This is greatly helpful in analyzing lists of genes resulting from terms, which are signiﬁcantly more represented in the ﬁrst list high-throughput screening experiments, such as microarrays, compared with the second. for their biological meaning. For all of the genes analyzed, GOstat will determine the annotated GO terms and all splits. The program will then ACKNOWLEDGEMENTS count the number of appearances of each GO term for the genes in the group as well as in the reference group. For each Thanks to Joelle Michaud, Lavinia Hyde, Gordon Smyth GO term, a p-value is calculated representing the probabil- and Hamish Scott for helpful suggestions and testing of ity that the observed numbers of counts could have resulted the program. This work was funded by the Deutsche from randomly distributing this GO term between the tested Forschungsgemeinschaft. group and the reference group. A χ test is used in order to approximate this p-value. If the expected value for any count REFERENCES is below 5, the χ approximation is inaccurate. Therefore, we use Fisher’s Exact Test in these cases. The resulting list of Al-Shahrour,F., Diaz-Uriarte,R. and Dopazo,J. (2004) FatiGO: a web p-values is sorted. The GO terms that are most speciﬁc for tool for ﬁnding signiﬁcant associations of Gene Ontology terms the analyzed list of genes will have the lowest p-values. with groups of genes. Bioinformatics, 20, 578–580. Ashburner,M., Ball,C., Blake,J., Botstein,D., Butler,H., Cherry,J., As the number of GO terms for which we test signiﬁc- Davis,A., Dolinski,K., Dwight,S., Eppig,J. (2000) Gene onto- ance is large, the computed p-values have to be corrected logy: tool for the uniﬁcation of biology. The Gene Ontology in order to control the rate of errors we expect with multiple Consortium. Nat. Genet., 25, 25–29. testing (Shaffer, 1995; Dudoit et al., 2002). Two methods for Boguski,M. and Schuler,G. (1995) Establishing a human transcript correcting the p-value are offered in GOstat. The Holm cor- map. Nat. Genet., 10, 369–371. rection controls the familywise error rate, e.g. selecting genes Dennis,G.,Jr, Sherman,B., Hosack,D., Yang,J., Gao,W., Lane,H. with a p-value below 0.1 we expect a 10% chance that any and Lempicki,R. (2003) DAVID: Database for Annotation, of the selected GO terms are not speciﬁc. The Benjamini and Visualization, and Integrated Discovery. Genome Biol., 4, P3. Hochberg correction controls the false discovery rate, e.g. Doniger,S., Salomonis,N., Dahlquist,K., Vranizan,K., Lawlor,S. selecting genes with a p-value below 0.1, we expect that 10% and Conklin,B. (2003) MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression proﬁle from of the selected GO terms are not speciﬁc. microarray data. Genome Biol., 4, R7. However, there are dependences between various GO terms Draghici,S., Khatri,P., Bhavsar,P., Shah,A., Krawetz,S. and in the resulting list. Frequently, genes share more or less the Tainsky,M. (2003) Onto-Tools, the toolkit of the modern biologist: same set of annotations, as several GO terms are indicative Onto-Express, Onto-Compare, Onto-Design and Onto-Translate. of the same process. Also, GO terms that are within one path Nucleic Acids Res., 31, 3775–3781. have strongly correlated results. In order to make the resulting Dudoit,S., Shaffer,J. and Boldrick,J. (2002) Multiple hypothesis test- list of GO terms more interpretable, GOstat has the option ing in microarray experiments. Technical Report 110, Division of to cluster the GO terms. In this process, GO terms that are Biostatistics, UC Berkeley. annotated in the same set of genes or where one set of genes Masys,D., Welsh,J., Fink,J.L., Gribskov,M., Klacansky,I. and is a subset of the other are grouped. Corbeil,J. (2001) Use of keyword hierarchies to interpret gene GOstat will result in a list of p-values that state how spe- expression patterns. Bioinformatics, 17, 319–326. ciﬁc certain GO terms are for a given list of genes (Fig. 2). Shaffer,J. (1995) Multiple hypothesis testing. Annu. Rev. Psychol., 46, 561–584. The output is sorted by the p-value and can be limited by

Journal

Bioinformatics – Oxford University Press

Published: Feb 12, 2004

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

GOstat: find statistically overrepresented Gene Ontologies within a group of genes

GOstat: find statistically overrepresented Gene Ontologies within a group of genes

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

GOstat: find statistically overrepresented Gene Ontologies within a group of genes

GOstat: find statistically overrepresented Gene Ontologies within a group of genes

References (9)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies