Access the full text.
Sign up today, get DeepDyve free for 14 days.
D. Masys, J. Welsh, J. Fink, M. Gribskov, Igor Klacansky, J. Corbeil (2001)
Use of keyword hierarchies to interpret gene expression patternsBioinformatics, 17 4
Scott Doniger, N. Salomonis, K. Dahlquist, K. Vranizan, Steven Lawlor, B. Conklin (2003)
MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray dataGenome Biology, 4
S. Dudoit, J. Shaffer, Jennifer Boldrick (2003)
Multiple Hypothesis Testing in Microarray ExperimentsStatistical Science, 18
S. Drăghici, P. Khatri, P. Bhavsar, Abhik Shah, S. Krawetz, M. Tainsky (2003)
Onto-Tools, the toolkit of the modern biologist: Onto-Express, Onto-Compare, Onto-Design and Onto-TranslateNucleic acids research, 31 13
Glynn Dennis, Brad Sherman, Douglas Hosack, Jun Yang, Wei Gao, H. Lane, R. Lempicki (2003)
DAVID: Database for Annotation, Visualization, and Integrated DiscoveryGenome Biology, 4
J. Shaffer (1995)
Multiple Hypothesis TestingAnnual Review of Psychology, 46
M. Ashburner, C. Ball, J. Blake, D. Botstein, Heather Butler, J. Cherry, A. Davis, K. Dolinski, S. Dwight, J. Eppig, M. Harris, D. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. Matese, J. Richardson, M. Ringwald, G. Rubin, G. Sherlock (2000)
Gene Ontology: tool for the unification of biologyNature Genetics, 25
M. Boguski, G. Schuler (1995)
ESTablishing a human transcript mapNature Genetics, 10
F. Al-Shahrour, R. Díaz-Uriarte, J. Dopazo (2004)
FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genesBioinformatics, 20 4
Vol. 20 no. 9 2004, pages 1464–1465 BIOINFORMATICS APPLICATIONS NOTE DOI: 10.1093/bioinformatics/bth088 GOstat: find statistically overrepresented Gene Ontologies within a group of genes Tim Beißbarth and Terence P. Speed Walter and Eliza Hall Institute of medical Research, 1G Royal Parade, Parkville, Vic 3050, Australia Received on July 14, 2004; revised on December 1, 2003; accepted on December 4, 2003 Advance Access publication February 12, 2004 ABSTRACT Summary: Modern experimental techniques, as for example DNA microarrays, as a result usually produce a long list of genes, which are potentially interesting in the analyzed pro- cess. In order to gain biological understanding from this type of data, it is necessary to analyze the functional annotations of all genes in this list. The Gene-Ontology (GO) database provides a useful tool to annotate and analyze the functions of a large number of genes. Here, we introduce a tool that utilizes this information to obtain an understanding of which annota- Fig. 1. Schema of GO annotation terms. tions are typical for the analyzed list of genes. This program automatically obtains the GO annotations from a database and 2001). Recently, a vast number of tools are evolving that generates statistics of which annotations are overrepresented make use of GOs (Doniger et al., 2003; Draghici et al., in the analyzed list of genes. This results in a list of GO terms 2003; Al-Shahrour et al., 2004; Dennis et al., 2003). We sorted by their specificity. consider GOstat an easy to use tool with a solid statistical Availability: Our program GOstat is accessible via the Internet foundation. at http://gostat.wehi.edu.au Each gene can have several associated GO terms. Further, Contact: [email protected] due to the hierarchical structure of the GOs, each GO term can be connected to several other GO terms higher in the Ontologies are a widely used concept to create a controlled GO hierarchy and therefore associated with the gene as well vocabulary to communicate and annotate knowledge. The (Fig. 1). We call the list of GO terms that are in between a top Gene Ontology Consortium defines GO as an international level and the annotated GO term its path. In fact, several such standard to annotate genes (Ashburner et al., 2000). GO has paths might lead to an individual GO term. Each GO term a hierarchical structure starting with top-levels ontologies for in the path we call a split. So in the end a list of 100 genes molecular functions, biological processes and cellular com- will usually have many hundreds of associated GO terms and ponents. The GO database consists of two essential parts, the several thousand associated splits. current ontologies, which define the vocabulary and structure, GOstat requires a list of gene identifiers that specify the and the current annotations, which create a link between the group of genes of interest. The program uses several syn- known genes and the associated GOs that define their func- onyms, each of which is sufficient to identify a gene. These tion. Currently, many groups are working on the development synonyms are derived from the release of the GO database of the ontologies and annotations for different organisms. as well as from Unigene (Boguski and Schuler, 1995). GO All the information can be downloaded from the web-site databases for several organisms (human, mouse, Drosophila, http://www.geneontology.org yeast, Arabidopsis thaliana, etc.) are provided. In order to find Here, we would like to make use of the annotations GO terms that are statistically significant within the group, a and structure of the GOs in order to understand the bio- control set of genes needs to be used to obtain a total count logical processes present in a large dataset of genes. The of occurrences for each GO term. This can be the complete usefulness of keyword hierarchies in interpreting large database of annotated genes, one of several subsets that are datasets has been demonstrated previously (Masys et al., commonly used on widely available microarrays or a second list of gene identifiers that is passed to the program. In this To whom correspondence should be addressed. case, the second list is used as a reference to search for GO 1464 Bioinformatics 20(9) © Oxford University Press 2004; all rights reserved. GOstat various cutoff values. It is possible to display the over or underrepresented terms only. p-values of GO terms that are overrepresented in the dataset are typeset in green, p-values of underrepresented GO terms are colored red. GO terms that are annotated in more or less the same subsets of genes can be grouped together. GOstat will also output the com- plete list of the associations for the supplied genes to the annotated GO terms. The GO IDs in the output are linked to AmiGO, a visualization tool for the hierarchy in the GO database (http://www.godatabase.org). It is possible to format the output in HTML or as a tabular text. GOstat provides a useful tool in order to find biological Fig. 2. GOstat Output. processes or annotations characteristic of a group of genes. This is greatly helpful in analyzing lists of genes resulting from terms, which are significantly more represented in the first list high-throughput screening experiments, such as microarrays, compared with the second. for their biological meaning. For all of the genes analyzed, GOstat will determine the annotated GO terms and all splits. The program will then ACKNOWLEDGEMENTS count the number of appearances of each GO term for the genes in the group as well as in the reference group. For each Thanks to Joelle Michaud, Lavinia Hyde, Gordon Smyth GO term, a p-value is calculated representing the probabil- and Hamish Scott for helpful suggestions and testing of ity that the observed numbers of counts could have resulted the program. This work was funded by the Deutsche from randomly distributing this GO term between the tested Forschungsgemeinschaft. group and the reference group. A χ test is used in order to approximate this p-value. If the expected value for any count REFERENCES is below 5, the χ approximation is inaccurate. Therefore, we use Fisher’s Exact Test in these cases. The resulting list of Al-Shahrour,F., Diaz-Uriarte,R. and Dopazo,J. (2004) FatiGO: a web p-values is sorted. The GO terms that are most specific for tool for finding significant associations of Gene Ontology terms the analyzed list of genes will have the lowest p-values. with groups of genes. Bioinformatics, 20, 578–580. Ashburner,M., Ball,C., Blake,J., Botstein,D., Butler,H., Cherry,J., As the number of GO terms for which we test signific- Davis,A., Dolinski,K., Dwight,S., Eppig,J. (2000) Gene onto- ance is large, the computed p-values have to be corrected logy: tool for the unification of biology. The Gene Ontology in order to control the rate of errors we expect with multiple Consortium. Nat. Genet., 25, 25–29. testing (Shaffer, 1995; Dudoit et al., 2002). Two methods for Boguski,M. and Schuler,G. (1995) Establishing a human transcript correcting the p-value are offered in GOstat. The Holm cor- map. Nat. Genet., 10, 369–371. rection controls the familywise error rate, e.g. selecting genes Dennis,G.,Jr, Sherman,B., Hosack,D., Yang,J., Gao,W., Lane,H. with a p-value below 0.1 we expect a 10% chance that any and Lempicki,R. (2003) DAVID: Database for Annotation, of the selected GO terms are not specific. The Benjamini and Visualization, and Integrated Discovery. Genome Biol., 4, P3. Hochberg correction controls the false discovery rate, e.g. Doniger,S., Salomonis,N., Dahlquist,K., Vranizan,K., Lawlor,S. selecting genes with a p-value below 0.1, we expect that 10% and Conklin,B. (2003) MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from of the selected GO terms are not specific. microarray data. Genome Biol., 4, R7. However, there are dependences between various GO terms Draghici,S., Khatri,P., Bhavsar,P., Shah,A., Krawetz,S. and in the resulting list. Frequently, genes share more or less the Tainsky,M. (2003) Onto-Tools, the toolkit of the modern biologist: same set of annotations, as several GO terms are indicative Onto-Express, Onto-Compare, Onto-Design and Onto-Translate. of the same process. Also, GO terms that are within one path Nucleic Acids Res., 31, 3775–3781. have strongly correlated results. In order to make the resulting Dudoit,S., Shaffer,J. and Boldrick,J. (2002) Multiple hypothesis test- list of GO terms more interpretable, GOstat has the option ing in microarray experiments. Technical Report 110, Division of to cluster the GO terms. In this process, GO terms that are Biostatistics, UC Berkeley. annotated in the same set of genes or where one set of genes Masys,D., Welsh,J., Fink,J.L., Gribskov,M., Klacansky,I. and is a subset of the other are grouped. Corbeil,J. (2001) Use of keyword hierarchies to interpret gene GOstat will result in a list of p-values that state how spe- expression patterns. Bioinformatics, 17, 319–326. cific certain GO terms are for a given list of genes (Fig. 2). Shaffer,J. (1995) Multiple hypothesis testing. Annu. Rev. Psychol., 46, 561–584. The output is sorted by the p-value and can be limited by
Bioinformatics – Oxford University Press
Published: Feb 12, 2004
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.