GoMiner: a resource for biological interpretation of genomic and proteomic data

Barry R Zeeberg; Weimin Feng; Geoffrey Wang; May D Wang; Anthony T Fojo; Margot Sunshine; Sudarshan Narasimhan; David W Kane; William C Reinhold; Samir Lababidi; Kimberly J Bussey; Joseph Riss; J Carl Barrett; John N Weinstein

doi:10.1186/gb-2003-4-4-r28

GoMiner: a resource for biological interpretation of genomic and proteomic data

Zeeberg, Barry R;Feng, Weimin;Wang, Geoffrey;Wang, May D;Fojo, Anthony T;Sunshine, Margot;Narasimhan, Sudarshan;Kane, David W;Reinhold, William C;Lababidi, Samir;Bussey, Kimberly J;Riss, Joseph;Barrett, J Carl;Weinstein, John N 2003-04-01 00:00:00 We have developed GoMiner, a program package that organizes lists of ‘interesting’ genes (for example, under- and overexpressed genes from a microarray experiment) for biological interpretation in the context of the Gene Ontology. GoMiner provides quantitative and statistical output files and two useful visualizations. The first is a tree-like structure analogous to that in the AmiGO browser and the second is a compact, dynamically interactive ‘directed acyclic graph’. Genes displayed in GoMiner are linked to major public bioinformatics resources. Rationale time. Recently, batch processing has been introduced [2], Gene-expression profiling and other forms of high-through- but with a flat-format output that does not communicate the put genomic and proteomic studies are revolutionizing richness of GO’s hierarchical structure. biology. That much is universally agreed. But the new tech- nologies pose new challenges. The first is the experiment We have developed, and present here, the program package itself, the second is statistical analysis of results, the third is GoMiner as a freely available computer resource that fully biological interpretation. That third challenge is often the incorporates the hierarchical structure of the Gene Ontology most vexing and time-consuming. In gene-expression to automate the functional categorization of gene lists of any microarray studies, for example, one generally obtains a list length. GoMiner is downloadable free of charge from [3] or of dozens or hundreds of genes that differ in expression [4]. GoMiner was developed particularly for biological inter- between samples and then asks: ‘What does all of this mean pretation of microarray data; one can input a list of under- biologically?’ The work of the Gene Ontology (GO) Consor- and overexpressed genes and a list of all genes on the array, tium [1] provides a way to address that question. GO orga- and then calculate enrichment or depletion of categories with nizes genes into hierarchical categories based on biological genes that have changed expression. GoMiner thus facilitates process, molecular function and subcellular localization. In analysis and organization of the results for rapid interpreta- the past, this GO information was queried one gene at a tion of ‘omic’ [5,6] data. For concreteness, the descriptions in Genome Biology 2003, 4:R28 R28.2 Genome Biology 2003, Volume 4, Issue 4, Article R28 Zeeberg et al. http://genomebiology.com/2003/4/4/R28 this article will focus on applications to microarray data, but further analysis. For example, the spreadsheet data can be the range of uses is obviously much broader. sorted by enrichment factor or p-value to focus attention on potentially interesting categories. Overview of GoMiner GoMiner takes as input two lists of genes: the total set on the Development of GoMiner array and the subset that the user flags as interesting (for GoMiner is based on a variety of open-source Java classes and example, altered in expression level). GoMiner displays the developer tools, plus substantial in-house custom software genes within the framework of the Gene Ontology hierarchy, engineering (Figure 2). We chose Java to achieve indepen- both as a directed acyclic graph (DAG) and as the equivalent dence of operating system so that more researchers could use tree structure. The latter is similar in format to the visualiza- the tool. A custom graphical user interface (GUI) provides the tion in the AmiGO browser display [1]. However, each cate- user with flexibility and an intuitive view of biological relation- gory is annotated to reflect the number of genes from the ships (Figure 1a). A complementary command-line version of user’s experiment assigned to that category plus the number GoMiner allows high-throughput applications and fluent assigned to its progeny categories (Figure 1a). This computa- integration with other programs. tion does not double-count genes that appear more than once along the traversal. The user has the option of designat- The heart of GoMiner is its processing engine (Figure 2), ing each gene within the ‘interesting gene’ list as exhibiting which parses input gene lists and retrieves database entries under- or overexpression. If that is done, genes displayed in for association with GO categories (also called ‘terms’). The the tree-like view are tagged with green down-arrows or red GO categories and gene associations are stored in a rela- up-arrows, respectively. tional database. To enhance the speed of data manipulation, we model the information in memory using a DAG data The most important parameter for purposes of interpreta- structure. The root is the topmost node: ‘Gene Ontology’. tion is the enrichment (or depletion) of a category with The other nodes represent gene categories, and the connec- respect to flagged genes (relative to what would have been tions represent relationships between categories. Each cate- expected by chance alone). This parameter will be discussed gory-node object contains its associated genes, functionality more extensively and more mathematically in the section on for counting genes, a flag for dereplication during counting, ‘Statistical considerations’. In Figure 1a, the relative enrich- and results of statistical analyses. The gene-category associa- ment is indicated by blue numbers for total flagged genes tions are displayed in the form of a tree (Figure 1a) or, alter- and by red and green numbers for over- and underexpressed natively, in the form of a DAG (Figure 1b). genes, respectively. The last number (blue) for each category is a two-sided p-value from Fisher’s exact test. We have developed GoMiner as a client-server application. The client, a Java application, communicates with a server- In GoMiner, clicking on a gene of interest in the tree-struc- side database through JDBC. The client can run on platforms ture opens a menu that can be used to submit that gene as a with Java run-time environment version 1.3 or higher. The query to an external data resource. The number of such primary client-user GUI, written using the Java Swing API, links is being expanded rapidly, but currently included are takes the form of a three-panel window in which the user can LocusLink [7], PubMed [8], MedMiner [9,10], GeneCards inspect GO categories and genes. The left-hand panel lists the [11], the NCBI’s Structure Database [12], and BioCarta and genes, the databases from which their identities were derived, KEGG pathway maps as implemented by the NCI Cancer and optional up- and down-arrows to indicate under- or over- Genome Anatomy Project (CGAP) [13]. These external data- expression; the middle panel shows a tree visualization of cat- bases provide GoMiner with a rich set of resources for egories in the style of the AmiGO browser [1] and, in addition, bioinformatic integration. For example, the links with provides a visualization of the flagged genes in the particular CGAP and LocusLink provide interaction with pathway microarray experiment. The right-hand panel shows all maps, chromosome visualizations, a database of single appearances within the GO hierarchy of any gene selected nucleotide polymorphism (SNP), and the Mammalian Gene from the left or middle panel. The gene and category names Collection (MGC). are implemented as links to facilitate navigation of the data structures and access to public resources. In GoMiner, clicking on a category instead of a gene brings up a second visualization (Figure 1b), a DAG programmed as A second type of visualization, the DAG (programmed as an a scalable vector graphic (SVG) that can be navigated flu- SVG) shows in compact form the spanning hierarchy for all ently. Any of its nodes can be moused-over to list the flagged flagged genes. Optionally, it can include only nodes below a genes or clicked to highlight multiple pathways connecting it specified level if the entire DAG would be too large for to the root. Detailed quantitative and statistical results are easy visualization. The client application uses several open downloadable in several tab-delimited formats that can be source components: the Berkeley Drosophila Genome read directly into a text file or a spreadsheet program for Project (BDGP) Java Toolkit [14] for utility classes; Browser Genome Biology 2003, 4:R28 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2003/4/4/R28 Genome Biology 2003, Volume 4, Issue 4, Article R28 Zeeberg et al. R28.3 (a) (b) Figure 1 GoMiner displays for microarray gene-expression data on prostate cancer cell line DU145 and a subline (RC0.1) selected for resistance to a topoisomerase 1 inhibitor. (a) Tree-like display showing underexpressed genes (green down-arrows), overexpressed genes (red up-arrows), and unchanged genes (gray circles) in the GO ‘Apoptosis Regulator’ category and its subcategories. The blue number indicates a 2.4-fold enrichment of changed genes in this category. The p-value (Fisher’s exact) indicates that, despite this degree of enrichment, the small total number of genes (14) in this category prevents statistical significance. (b) Dynamically generated SVG graphic of the ‘Biological Process’ DAG with genes in the GO ‘Apoptosis Regulator’ category opened in a pull-down list by mousing-over. Categories enriched more than 1.5-fold with flagged genes are color-coded red; those depleted more than 1.5-fold are blue. The rest of the categories are gray. Genome Biology 2003, 4:R28 R28.4 Genome Biology 2003, Volume 4, Issue 4, Article R28 Zeeberg et al. http://genomebiology.com/2003/4/4/R28 Query Tables Data sources Gene management Gene database names GoMiner Visualization engine Experimental Gene data data management Statistics GO Terms GO term GO management database DAG Figure 2 Schematic of GoMiner architecture and data flow. Launcher [15] for cross-platform web browser integration; that the predicate of the null hypothesis does not include ‘the Jakarta-ORO [16] for text processing; the Jena Semantic Web flagged genes that fall into the union of the rest of the cate- Toolkit [17] for manipulating RDF models; MySQL Connec- gories’. That predicate would not ensure mutual exclusivity. tor/J [18] for database connectivity; and Xerces [19] for The statistical question can be framed in terms of a classical parsing XML. The back-end is a relational database server, 2 x 2 contingency table (Table 1). which stores all gene ontology data. It includes an implemen- tation in MySQL [20] of the GO Consortium database. The null hypothesis can be formulated as: In addition to the deployed components, we have introduced a H : p - p = 0, o 1 2 number of open-source tools to enhance the development environment. In particular, the Concurrent Versions System where p = n /n and p = (N - n )/(N – n). The two-sided p- 1 f 2 f f (CVS) tool [21] coordinates program development at the value for Fisher’s exact test is the sum of probabilities of Georgia Institute of Technology with that at the NCI, and also observing tables that give at least as many extreme values as coordinates development within each of the groups. jUnit [22] the one actually observed, given that the null hypothesis is automates unit- and system-level testing of the application. true [23-25]. The use of Fisher’s exact test implies that we are conditioning on fixed marginal totals (n, N - n, N , N - N ) f f under the null hypothesis. For a discussion of the implica- Statistical considerations tions of fixed marginal values, see for example [23-25]. The two-sided Fisher’s exact test p-value for a category reflects a test of the null hypothesis that the category is Note that the 2 x 2 table does not require any information neither enriched in, nor depleted of, flagged genes with about the topology of the hierarchy or about how many genes respect to what would have been expected by chance alone. are included in any category other than the one to which the That is, it reflects the null hypothesis that, for each category, test is being applied. We used the two-sided version of the there is no difference between the proportion of flagged genes test, which detects a significant difference in the proportions that fall into the category and the proportion of flagged genes in either direction (that is, when the proportion of flagged that do not fall into the category. The two groups of genes are genes in the category is either higher or lower than would be mutually exclusive, as required for Fisher’s exact test. Note expected by random chance). Clearly, calculations analogous Genome Biology 2003, 4:R28 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2003/4/4/R28 Genome Biology 2003, Volume 4, Issue 4, Article R28 Zeeberg et al. R28.5 Table 1 human genes using the database created by the GO Consor- tium’s downloaded MySQL script files [28]. The hit rates Two-by-two contingency table for flagged and unflagged genes were low both when the gene names were used in the format in a GO category of HUGO names and when the gene names were used in the Flagged genes Non-flagged genes Total format of ‘HUGO_HUMAN.’ We tried the latter format because the flat files often contained ‘_HUMAN’ appended In category n n - n n f f to the human gene names. In contrast, when we used a com- Not in category N - n (N - n) - (N - n ) N - n f f f f bination of mouse (MGI) and rat (RGD) association files, there were reasonable numbers of hits. Therefore, we now Total N N - N N f f routinely use mouse and rat annotations for human data. We n is the number of flagged genes in the category, n is the total number of are currently augmenting the human associations in the GO genes in the category, N is the number of flagged genes on the Consortium database to provide a richer annotation of microarray, and N is the total number of genes on the microarray. All human gene names. This goal will be achieved by using the numbers are those obtained after dereplicating multiple instances of the same gene. MatchMiner database to integrate the information in the GO Consortium database [27] and the Swiss-Prot, TrEMBL and TrEMBLnew databases [29], and GoMiner will implement to the ones used here for all flagged genes can also be this database for human data in the near term. The MySQL applied to test separately the equivalent null hypotheses for script files will be freely available and should represent an under- and overexpressed genes. Unlike the Z-statistic with improvement over what is currently available to program the hypergeometric distribution, and tests based on it, Fish- developers and end-users. er’s exact test is appropriate even for categories containing a small number of genes. Our Java implementation of the Non-independence of gene data Fisher’s exact test is based on Javascript by Øyvind Gene-expression values within a category may be correlated Langsrud [26]. for any of several reasons. They may represent the same gene, close family members with similar functions, genes in The following limitations of this statistical formulation the same pathway or genes in alternative pathways for per- should be borne in mind, and the p-values should be inter- forming a biological function. Gene classifications in GO preted judiciously. may be correlated for analogous reasons. How do such rela- tionships affect the statistics? The answer is most easily seen Random experimental and categorization error by imagining a category containing nothing but five Experimental error and any uncertainties in the classifica- instances of the same gene (perhaps because five different tion of genes in GO are not included in the statistical model. identifiers were used and not recognized as representing the Perhaps, given enough information (which we essentially same gene). That category might appear either to be strik- never have) about those sources of error, they could be ingly enriched (with five out of five genes flagged) or strik- included in the statistical model, for example through a ingly depleted (with none out of five genes flagged). But the resampling technique. appropriate value of n for determining statistical signifi- cance in those cases would be 1, not 5. GoMiner’s companion Gene representation bias program MatchMiner [30,31] handles this problem by iden- The microarray gene set (or set from some other type of tifying replicates of the same gene, even if they are repre- genomic or proteomic experiment) will generally be a biased sented by different identifiers. representation of all genes. Therefore, enrichments and depletions, of necessity defined in terms of the genes What about possible sources of correlation other than ‘same- studied, may be biased with respect to biological significance gene’? Do we want to dereplicate them as well? Generally, as well. An alternative is to replace the list of the total set of the answer is ‘no’. Correlation of genes in the same pathway genes on the microarray with a list of the total set of genes in is precisely the phenomenon we are often trying to identify. the genome (or a representative sample), but that approach We would not want a statistical test to adjust for (and, in introduces another source of bias: genes not on the effect, null out) the effect of such relationships. Close family microarray are counted in determining N and n but have no members might be considered an intermediate case. The sta- chance to be flagged. tistical model implemented in GoMiner assumes, as our state of prior knowledge, that we know when two ‘genes’ are GO consortium database bias for human gene identical but nothing about their relationship if they are not associations identical. That seems the only available course. However, for The GO Consortium [1] provides a set of flat files that indicate each category, GoMiner provides the gene identities and the the association between gene names and GO categories for numbers given in Table 1 - sufficient information for the several species [27]. Although the flat files for human are quite knowledgeable user to decide to eliminate close family comprehensive, we found a low hit rate for GO annotation of members or pathway partners if desired. Genome Biology 2003, 4:R28 R28.6 Genome Biology 2003, Volume 4, Issue 4, Article R28 Zeeberg et al. http://genomebiology.com/2003/4/4/R28 The multiple comparisons problem ‘permissive apoptosis-resistance’ hypothesis) for the rela- If one has not decided before analysis which particular gene tionship between apoptotic and cell-proliferation pathways category is to be examined, a correction should be made for in the development of drug resistance. Figure 1a provides the multiple opportunities to obtain a p-value indicating sta- more detailed information, indicating that these differences tistically significant enrichment or depletion. For example, were focused in particular subcategories of apoptosis. Thus, with 1,000 categories, we would expect approximately 1,000 GoMiner can help the user in at least two ways: it identifies x 0.05 = 50 false positives simply by chance if we set the crit- categories enriched in, or depleted of, genes of interest; and ical value at p = 0.05. The most common way to correct for it generates hypotheses to guide further research. this problem is that of Bonferroni (see, for example [32]), in which the critical value is divided by the number of trials (in Unfortunately for us, interpretive analysis of the this case, 1,000). However, that approach assumes indepen- DU145/RC0.1 study was initially done one gene at a time dence of categories and is so conservative that it becomes before development of GoMiner (and, in fact, motivated that extremely hard to detect true positives. A number of less development). Performing the GO analysis one gene at a conservative statistical methods have also been developed, time would have taken more than two solid hours at the but it is beyond the scope of this paper to review them here. computer for the 181 genes before getting to the much An approach based on resampling will be incorporated into harder parts of the task: doing the same for the entire array GoMiner in the coming months. (nominally > 15 hours), then collating and organizing the information for each GO category. In contrast, operating on Overall, the p-values quoted should be considered as heuris- a 266 MHz PC with 250 MB RAM, it took 90 seconds to tic measures, useful as indicators of possible statistical sig- browse for and load the files, then 30 seconds for GoMiner nificance, rather than as the results of formal inference. The to process the entire array of 1,399 genes and display the p-values can be used, for example, to sort categories to iden- flagged and unflagged genes in their hierarchical context. In tify those of the most potential interest. another test, running 900 flagged genes and all of HUGO (15,000 genes) took 4 minutes and 40 seconds on the same As another useful measure, we have calculated the relative computer. Overall, the processing time was essentially linear enrichment factor, R , defined as with respect to the total number of genes (time in minutes = 0.0003 x genes + 0.0656; R = 0.998). R = (n /n)/(N /N) e f f and shown as blue numbers in Figure 1a. The analogous Comparison of GoMiner with related programs quantities for overexpressed (red numbers) and under- Several other programs related to GoMiner have recently expressed (green numbers) are also shown. Depletion is, of appeared. These include MAPPFinder [35,36], FatiGO [37], course, represented by an enrichment factor less than unity. Onto-Express [2,38], and GoSurfer [39]. The following rep- resents our best attempt at comparison, based on review of the available implementations and associated documenta- Benchmarking GoMiner on a biological problem tion as of January 2003. As a test, GoMiner was applied to the results of our cDNA microarray study of the molecular mechanisms by which FatiGO is a web application. The current implementation is drug resistance develops [33]. The DAG shown in Figure 1a very restrictive in that the user must specify ahead of time was generated from that study, which used quadruplicate one particular level of the GO hierarchy that is to be used for ‘Oncochip’ microarrays (Microarray Facility, Advanced analysis of the data. The other available applications, includ- Technology Center, NCI [34]) to compare gene expression ing GoMiner, process data for the entire GO hierarchy and profiles in a prostate cancer cell line (DU145) and a subline allow the user to select views of the results dynamically. In a (RC0.1) selected from it for resistance to the topoisomerase trial using FatiGO’s recommended search criteria with our 1-inhibitor 9-nitro-camptothecin. The microarray included standard test gene files, FatiGO did not find any GO cate- 1,399 cancer-interesting genes. 181 of those genes differed in gories with clusters of differentially expressed genes. expression according to a threshold criterion (>1.5-fold dif- ference). MatchMiner was used to translate IMAGE clone Onto-Express is also implemented as a web application. Ids for the 1,399 genes into HUGO names for input to Although more flexible than FatiGO, it is largely limited to a GoMiner. Figure 1a shows that the category ‘apoptosis regu- flat view of the biological world. Whereas GoMiner provides lator’ was enriched 2.4-fold in genes with altered expression both tree and DAG views of the genes embedded within the levels. More specifically, it was enriched 3.2-fold with under- GO hierarchy, Onto-Express does not provide any hierarchi- expressed genes and 2.0-fold with overexpressed genes. cal structure (the fundamental defining feature of GO). Flow cytometric annexin V and TUNEL assays verified Onto-Express lists enriched and depleted categories, but it important differences in apoptotic potential between the cell does not provide a statistical analysis of the results to aid lines, and analysis generated a novel hypothesis (the understanding. ‘Version 2,’ recently announced (at a price of Genome Biology 2003, 4:R28 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2003/4/4/R28 Genome Biology 2003, Volume 4, Issue 4, Article R28 Zeeberg et al. R28.7 $1,500 - $5,000), provides a p-value (computed by a method MAPPFinder is written in Microsoft’s Visual Basic and is not specified in the announcement). therefore restricted to running on PCs under Windows. In contrast, GoMiner is written in Java and runs on multiple GoSurfer is implemented as a Windows application. As such, operating systems. We have tested it on Windows XP, 2000, it lacks the flexibility of platform-independence that Java NT, and 98, as well as on Mac OS X, Solaris, Linux (Red Hat confers upon GoMiner. GoSurfer is also rather inflexible in distribution), IRIX (SGI), and FreeBSD. See the GoMiner that the input identifiers are required to be specific Affymetrix website for specific operating-system issues. probe sets. It is not clear whether other identifier types sug- gested in a figure on the web site have been implemented. In We recently implemented an alternative command-line contrast, GoMiner uses HUGO gene names as input. These interface for GoMiner (S.N., M.S., D.W.K. and B.R.Z., gene names are more convenient for human interpretation, unpublished work) to complement the GUI version. The and GoMiner’s companion program MatchMiner [30,31] command-line interface allows GoMiner to be integrated allows many other types of identifiers (listed at the end of this with other tools via scripts or pipes. Our website will post section) to be converted easily into HUGO gene names. The updated versions of the documentation and program as soon visual output of GoSurfer is in the form of a DAG. GoMiner as comprehensive testing of this interface has been com- uses a text-based tree as its primary visual output because the pleted. In preliminary trials with the new interface we have nodes of the DAG are inherently more difficult to label without routinely processed more than 2,000 datasets at a time creating unacceptable screen clutter. The DAG gives an intu- through GoMiner. This high-throughput capability has made itive feel for the overall complexity of the categorizations, but two further developments possible: first, randomization it is not particularly useful for detailed dynamic navigation or studies are being done to address the multiple-comparisons for examination of categorized genes. The tabular output of problem (that is, to estimate the fraction of false positives GoSurfer does not include the HUGO names, which we con- among the selected categories); second, the output data sider to be the most useful key to gene identity. In contrast to stream is being coupled with integrated downstream analy- GoMiner, it appears that GoSurfer does not provide complete sis for automated recognition of interesting results buried quantitative and statistical summary data. within a large number of exploratory experiments. The user can explore and visualize these interesting results with MAPPFinder is a pioneering project that integrates GO GoMiner’s graphical user interface. analysis and biological pathway maps. GoMiner also pro- vides the potential for this type of integration, since each The command-line interface also allows GoMiner to interact gene in the GoMiner tree classification is dynamically linked flexibly with its companion program MatchMiner. With to the corresponding set of BioCarta and KEGG biological MatchMiner as a ‘preprocessor’, GoMiner can take input data pathway maps. In addition to providing integration with organized on the basis of ‘omic’ identifiers other than the biological pathway maps, GoMiner provides integration HUGO names central to GO. MatchMiner currently resolves with chromosomal information via dynamic linking to IMAGE clone ids, UniGene clusters, GenBank accession LocusLink’s chromosome viewer. GoMiner also provides numbers, Affymetrix ids, chromosome locations, gene dynamic linking to SNPs and MGC databases via LocusLink. common names, and FISH clone ids, and greatly facilitates MAPPFinder provides the fundamental tree representation the preparation of microarray data for analysis in GoMiner. of the GO hierarchy, with summary and statistical data in line with each category. However, unlike the tree implemen- In conclusion, GoMiner will continue in development with a tation in GoMiner, it shows only the categories; the genes view to integration with other bioinformatic resources being themselves are shown in an auxiliary table. In GoMiner, generated by the NCI and NIH for use by the biomedical both the categories and the genes are seamlessly shown as research community. GoMiner is flexible both because it is integral components of the tree. coded in Java to be platform-independent and because it can accommodate either the default GO hierarchy and gene MAPPFinder does not appear to include a DAG representa- associations or customized versions. The default is the GO tion. In GoMiner, the DAG view provides a qualitative and Consortium’s database of categories and gene associations as quantitative picture of the often-complex, multiple parent- implemented on our server. However, the user can, if hood of some categories. In our opinion, this type of visual- desired, edit categories and gene memberships using DAG- ization is complementary to the tree form and important to Edit, the BDGP Gene Ontology Editor Tool [40]. The edited an appreciation of the complex, highly nonlinear relation- database can then be accessed by GoMiner from a local ships within biological systems and gene networks. This server to accommodate domain- and expertise-specific complexity is not easy for a human to infer from the tree rep- applications. Another important type of flexibility is the wide resentation. The GO consortium selected the DAG as its fun- range of uses. In this report, we have presented GoMiner in damental data structure (though not its visualization), in the context of microarray data, but the variety of applica- part because it includes the characteristics of a network that tions is clearly much broader; it embraces the full range of are not included in a tree. genomic and proteomic studies. Genome Biology 2003, 4:R28 R28.8 Genome Biology 2003, Volume 4, Issue 4, Article R28 Zeeberg et al. http://genomebiology.com/2003/4/4/R28 35. Doniger SW, Salomonis N, Dahlquist KD, Vranizan K, Lawlor SC, Acknowledgements Conklin BR: MAPPFinder: using Gene Ontology and GoMiner is being developed jointly by groups from the National Cancer GenMAPP to create a global gene-expression profile from Institute (NCI), the Georgia Institute of Technology, and Emory Univer- microarray data. Genome Biol 2002, 4:R7. sity. This project has been supported by a contract funded by the NCI’s 36. MAPPFinder [http://www.genmapp.org/MAPPFinder.html] Center for Cancer Research and by The Wallace H. Coulter Biomedical 37. FatiGO [http://fatigo.bioinfo.cnio.es/] Engineering Department of Georgia Tech and Emory University academic 38. Onto-Express [http://vortex.cs.wayne.edu/Projects.html] funds for Professor May D. Wang. Its user features, statistical repertoire, 39. GoSurfer [http://biosun1.harvard.edu/complab/gosurfer/] and links to external resources will continue to be expanded through the 40. BDGP Gene Ontology Editor Tool contract funded by the NCI’s Center for Cancer Research and through [http://www.godatabase.org/dev/editor.html] Professor Wang’s academic funds. References 1. Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K, Dwight S, Eppig J, et al.: Gene Ontology: tool for the unification of biology. Nat Genet 2000, 25:25-29. 2. Khatri P, Draghici S, Ostermeier G, Krawetz S: Profiling gene expression using Onto-Express. Genomics 2002, 79:266-270. 3. GoMiner [http://discover.nci.nih.gov/gominer] 4. GoMiner [http://www.miblab.gatech.edu/gominer] 5. Weinstein JN: Fishing expeditions. Science 1998, 282:628-629. 6. Weinstein JN: ‘Omic’ and hypothesis-driven research in the molecular pharmacology of cancer. Curr Opin Pharmacol 2002, 2:361-365. 7. LocusLink [http://www.ncbi.nlm.nih.gov/LocusLink] 8. PubMed [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed] 9. MedMiner [http://discover.nci.nih.gov] 10. Tanabe L, Scherf U, Smith LH, Lee JK, Hunter L, Weinstein JN: Med- Miner: an internet text-mining tool for biomedical informa- tion, with application to gene expression profiling. BioTechniques 1999, 27:1210-1217. 11. GeneCards [http://thr.cit.nih.gov:8081/cards/index.html] 12. NCBI Entrez Structure [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Structure] 13. The Cancer Genome Anatomy Project [http://cgap.nci.nih.gov/Pathways] 14. Berkeley Drosophila Genome Project: developers’ resources [http://www.fruitfly.org/developers] 15. BrowserLauncher [http://browserlauncher.sourceforge.net] 16. The Apache Jakarta Project [http://jakarta.apache.org/oro/index.html] 17. HP Labs Semantic Web Research [http://www.hpl.hp.com/semweb/index.html] 18. MySQL Connector/J downloads [http://www.MySQL.com/downloads/api-jdbc.html] 19. Xerces2 Java Parser Readme [http://xml.apache.org/xerces2-j/index.html] 20. MySQL [http://www.MySQL.com] 21. Concurrent Versions System [http://www.cvshome.org] 22. jUnit [http://www.junit.org] 23. Agresti A: Categorical Data Analysis. New York: John Wiley; 1990. 24. Agresti A: A survey of Exact inference for contingency tables. Stat Sci 1992, 7:131-177. 25. StatXact 5 for Windows. User Manual. Cambridge, MA: Cytel Software Corporation; 2002. 26. Fisher’s Exact Test [http://www.matforsk.no/ola/fisher.htm] 27. GO Database [http://www.geneontology.org/#godatabase] 28. GO downloads [http://www.godatabase.org/dev/database/archive/] 29. Swiss-Prot, TrEMBL and TrEMBLnew database [ftp://ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/] 30. Bussey JK, Kane D, Sunshine M, Narasimhan S, Nishizuka S, Reinhold WC, Zeeberg B, Ajay, Weinstein JN: MatchMiner: a tool for batch navigation among gene and gene product identifiers. Genome Biol 2003, 4:R27. 31. MatchMiner [http://discover.nci.nih.gov/matchminer/] 32. Bonferroni [http://home.clara.net/sisa/bonhlp.htm] 33. Reinhold WC, Kouros-Mehr H, Kohn KW, Maunakea AK, Lababidi S, Roschke A, Stover K, Alexander J, Pantazis P, Miller L, et al.: Apoptotic susceptibility of cancer cells selected for camp- tothecin resistance: gene expression profiling, functional analysis, and molecular interaction mapping. Cancer Res 2003, 63:1000-1011. 34. NCI human Oncochip genes [http://nciarray.nci.nih.gov/gi_acc_ug_title.shtml] Genome Biology 2003, 4:R28 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Genome Biology Springer Journals http://www.deepdyve.com/lp/springer-journals/gominer-a-resource-for-biological-interpretation-of-genomic-and-MCPrEyqxAw

Loading next page...

References (77)

(2003)
Genome Biology
Bonferroni
J. Weinstein (1998)
Fishing Expeditions
Science, 282
StatXact (2002)
Cytel Software Corporation
GoSurfer
LocusLink. [http
//www.ncbi.nlm.nih.gov/LocusLink]
GeneCards. [http
//thr.cit.nih.gov:8081/cards/index.html]
MatchMiner. [http
//discover.nci.nih.gov/matchminer/]
PubMed. [http
//www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed]
The Apache Jakarta Project. [http
//jakarta.apache.org/oro/index.html]
Xerces2 Java Parser Readme
(2003)
Onto-Express [http://vortex.cs.wayne.edu/Projects.html] 39
GoMiner. [http
//www.miblab.gatech.edu/gominer]
NCI human Oncochip genes
MySQL Connector/J downloads. [http
//www.MySQL.com/downloads/api-jdbc.html]
GO downloads. [http
//www.godatabase.org/dev/database/archive/]
Xerces
//xml.apache.org/xerces2-j/index.html]
MatchMiner
K. Bussey, D. Kane, M. Sunshine, Sudarshan Narasimhan, S. Nishizuka, W. Reinhold, B. Zeeberg, Weinstein Ajay, J. Weinstein (2003)
MatchMiner: a tool for batch navigation among gene and gene product identifiers
Genome Biology, 4
Scott Doniger, N. Salomonis, K. Dahlquist, K. Vranizan, Steven Lawlor, B. Conklin (2003)
MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data
Genome Biology, 4
GoSurfer. [http
//biosun1.harvard.edu/complab/gosurfer/]
Onto-Express
L. Tanabe, L. Tanabe, Uwe Scherf, Larry Smith, Jae Lee, Lawrence Hunter, Lawrence Hunter, John Weinstein (1999)
MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling.
BioTechniques, 27 6
J. Weinstein (2002)
'Omic' and hypothesis-driven research in the molecular pharmacology of cancer.
Current opinion in pharmacology, 2 4
jUnit. [http
//www.junit.org]
Fisher's Exact Test. [http
//www.matforsk.no/ola/fisher.htm]
The Apache Jakarta Project
NCBI Entrez Structure
HP Labs Semantic Web Research
GoMiner. [http
//discover.nci.nih.gov/gominer]
GoMiner
Bonferroni. [http
//home.clara.net/sisa/bonhlp.htm]
A Agresti (1990)
Categorical Data Analysis
TrEMBL and TrEMBLnew database. [ftp Swiss-Prot
//ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/]
FatiGO. [http
//fatigo.bioinfo.cnio.es/]
BDGP Gene Ontology Editor Tool. [http
//www.godatabase.org/dev/editor.html]
Concurrent Versions System. [http
//www.cvshome.org]
MAPPFinder
PubMed
GeneCards
Swiss-Prot, TrEMBL and TrEMBLnew database
MySQL Connector/J downloads
BDGP Gene Ontology Editor Tool
W. Reinhold, H. Kouros-Mehr, K. Kohn, A. Maunakea, S. Lababidi, A. Roschke, K. Stover, J. Alexander, P. Pantazis, L. Miller, E. Liu, I. Kirsch, Y. Urasaki, Y. Pommier, J. Weinstein (2003)
Apoptotic susceptibility of cancer cells selected for camptothecin resistance: gene expression profiling, functional analysis, and molecular interaction mapping.
Cancer research, 63 5
GO downloads
The Cancer Genome Anatomy Project. [http
//cgap.nci.nih.gov/Pathways]
The Cancer Genome Anatomy Project
Berkeley Drosophila Genome Project: developers' resources
GO Database. [http
//www.geneontology.org/#godatabase]
GoMiner GoMiner
BrowserLauncher. [http
//browserlauncher.sourceforge.net]
(2002)
StatXact 5 for Windows. User Manual. Cambridge, MA: Cytel Software Corporation
MedMiner
A. Agresti (1992)
A Survey of Exact Inference for Contingency Tables
Statistical Science, 7
GO Database
G. Upton (1992)
Fisher's Exact Test
Journal of The Royal Statistical Society Series A-statistics in Society, 155
P. Khatri, S. Drăghici, G. Ostermeier, S. Krawetz (2002)
Profiling gene expression using onto-express.
Genomics, 79 2
M. Ashburner, C. Ball, J. Blake, D. Botstein, Heather Butler, J. Cherry, A. Davis, K. Dolinski, S. Dwight, J. Eppig, M. Harris, D. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. Matese, J. Richardson, M. Ringwald, G. Rubin, G. Sherlock (2000)
Gene Ontology: tool for the unification of biology
Nature Genetics, 25
HP Labs Semantic Web Research. [http
//www.hpl.hp.com/semweb/index.html]
BrowserLauncher
LocusLink MedMiner
jUnit
NCBI Entrez Structure. [http
//www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Structure]
MatchMiner Bonferroni
(1990)
HP Labs Semantic Web Research Xerces 2 Java Parser Readme Concurrent Versions System
LocusLink
FatiGO
Concurrent Versions System
MySQL. [http
//www.MySQL.com]
Berkeley Drosophila Genome Project
developers' resources. [http://www.fruitfly.org/developers]
J. Weinstein (1998)
Fishing expeditions [6]
Science, 282
MAPPFinder. [http
//www.genmapp.org/MAPPFinder.html]
NCI human Oncochip genes. [http
//nciarray.nci.nih.gov/gi_acc_ug_title.shtml]
MedMiner. [http
//discover.nci.nih.gov]
MySQL
Onto-Express. [http
//vortex.cs.wayne.edu/Projects.html]
C. Mehta (1990)
StatXact : a statistical package for exact non-parametric inference, Cytel Software Corporation, Cambridge, MA, USA
Journal of Classification, 7

Publisher: Springer Journals
Copyright: 2003 Zeeberg et al.; licensee BioMed Central Ltd.
eISSN: 1474-760X
DOI: 10.1186/gb-2003-4-4-r28
Publisher site: See Article on Publisher Site

Abstract

We have developed GoMiner, a program package that organizes lists of ‘interesting’ genes (for example, under- and overexpressed genes from a microarray experiment) for biological interpretation in the context of the Gene Ontology. GoMiner provides quantitative and statistical output files and two useful visualizations. The first is a tree-like structure analogous to that in the AmiGO browser and the second is a compact, dynamically interactive ‘directed acyclic graph’. Genes displayed in GoMiner are linked to major public bioinformatics resources. Rationale time. Recently, batch processing has been introduced [2], Gene-expression profiling and other forms of high-through- but with a flat-format output that does not communicate the put genomic and proteomic studies are revolutionizing richness of GO’s hierarchical structure. biology. That much is universally agreed. But the new tech- nologies pose new challenges. The first is the experiment We have developed, and present here, the program package itself, the second is statistical analysis of results, the third is GoMiner as a freely available computer resource that fully biological interpretation. That third challenge is often the incorporates the hierarchical structure of the Gene Ontology most vexing and time-consuming. In gene-expression to automate the functional categorization of gene lists of any microarray studies, for example, one generally obtains a list length. GoMiner is downloadable free of charge from [3] or of dozens or hundreds of genes that differ in expression [4]. GoMiner was developed particularly for biological inter- between samples and then asks: ‘What does all of this mean pretation of microarray data; one can input a list of under- biologically?’ The work of the Gene Ontology (GO) Consor- and overexpressed genes and a list of all genes on the array, tium [1] provides a way to address that question. GO orga- and then calculate enrichment or depletion of categories with nizes genes into hierarchical categories based on biological genes that have changed expression. GoMiner thus facilitates process, molecular function and subcellular localization. In analysis and organization of the results for rapid interpreta- the past, this GO information was queried one gene at a tion of ‘omic’ [5,6] data. For concreteness, the descriptions in Genome Biology 2003, 4:R28 R28.2 Genome Biology 2003, Volume 4, Issue 4, Article R28 Zeeberg et al. http://genomebiology.com/2003/4/4/R28 this article will focus on applications to microarray data, but further analysis. For example, the spreadsheet data can be the range of uses is obviously much broader. sorted by enrichment factor or p-value to focus attention on potentially interesting categories. Overview of GoMiner GoMiner takes as input two lists of genes: the total set on the Development of GoMiner array and the subset that the user flags as interesting (for GoMiner is based on a variety of open-source Java classes and example, altered in expression level). GoMiner displays the developer tools, plus substantial in-house custom software genes within the framework of the Gene Ontology hierarchy, engineering (Figure 2). We chose Java to achieve indepen- both as a directed acyclic graph (DAG) and as the equivalent dence of operating system so that more researchers could use tree structure. The latter is similar in format to the visualiza- the tool. A custom graphical user interface (GUI) provides the tion in the AmiGO browser display [1]. However, each cate- user with flexibility and an intuitive view of biological relation- gory is annotated to reflect the number of genes from the ships (Figure 1a). A complementary command-line version of user’s experiment assigned to that category plus the number GoMiner allows high-throughput applications and fluent assigned to its progeny categories (Figure 1a). This computa- integration with other programs. tion does not double-count genes that appear more than once along the traversal. The user has the option of designat- The heart of GoMiner is its processing engine (Figure 2), ing each gene within the ‘interesting gene’ list as exhibiting which parses input gene lists and retrieves database entries under- or overexpression. If that is done, genes displayed in for association with GO categories (also called ‘terms’). The the tree-like view are tagged with green down-arrows or red GO categories and gene associations are stored in a rela- up-arrows, respectively. tional database. To enhance the speed of data manipulation, we model the information in memory using a DAG data The most important parameter for purposes of interpreta- structure. The root is the topmost node: ‘Gene Ontology’. tion is the enrichment (or depletion) of a category with The other nodes represent gene categories, and the connec- respect to flagged genes (relative to what would have been tions represent relationships between categories. Each cate- expected by chance alone). This parameter will be discussed gory-node object contains its associated genes, functionality more extensively and more mathematically in the section on for counting genes, a flag for dereplication during counting, ‘Statistical considerations’. In Figure 1a, the relative enrich- and results of statistical analyses. The gene-category associa- ment is indicated by blue numbers for total flagged genes tions are displayed in the form of a tree (Figure 1a) or, alter- and by red and green numbers for over- and underexpressed natively, in the form of a DAG (Figure 1b). genes, respectively. The last number (blue) for each category is a two-sided p-value from Fisher’s exact test. We have developed GoMiner as a client-server application. The client, a Java application, communicates with a server- In GoMiner, clicking on a gene of interest in the tree-struc- side database through JDBC. The client can run on platforms ture opens a menu that can be used to submit that gene as a with Java run-time environment version 1.3 or higher. The query to an external data resource. The number of such primary client-user GUI, written using the Java Swing API, links is being expanded rapidly, but currently included are takes the form of a three-panel window in which the user can LocusLink [7], PubMed [8], MedMiner [9,10], GeneCards inspect GO categories and genes. The left-hand panel lists the [11], the NCBI’s Structure Database [12], and BioCarta and genes, the databases from which their identities were derived, KEGG pathway maps as implemented by the NCI Cancer and optional up- and down-arrows to indicate under- or over- Genome Anatomy Project (CGAP) [13]. These external data- expression; the middle panel shows a tree visualization of cat- bases provide GoMiner with a rich set of resources for egories in the style of the AmiGO browser [1] and, in addition, bioinformatic integration. For example, the links with provides a visualization of the flagged genes in the particular CGAP and LocusLink provide interaction with pathway microarray experiment. The right-hand panel shows all maps, chromosome visualizations, a database of single appearances within the GO hierarchy of any gene selected nucleotide polymorphism (SNP), and the Mammalian Gene from the left or middle panel. The gene and category names Collection (MGC). are implemented as links to facilitate navigation of the data structures and access to public resources. In GoMiner, clicking on a category instead of a gene brings up a second visualization (Figure 1b), a DAG programmed as A second type of visualization, the DAG (programmed as an a scalable vector graphic (SVG) that can be navigated flu- SVG) shows in compact form the spanning hierarchy for all ently. Any of its nodes can be moused-over to list the flagged flagged genes. Optionally, it can include only nodes below a genes or clicked to highlight multiple pathways connecting it specified level if the entire DAG would be too large for to the root. Detailed quantitative and statistical results are easy visualization. The client application uses several open downloadable in several tab-delimited formats that can be source components: the Berkeley Drosophila Genome read directly into a text file or a spreadsheet program for Project (BDGP) Java Toolkit [14] for utility classes; Browser Genome Biology 2003, 4:R28 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2003/4/4/R28 Genome Biology 2003, Volume 4, Issue 4, Article R28 Zeeberg et al. R28.3 (a) (b) Figure 1 GoMiner displays for microarray gene-expression data on prostate cancer cell line DU145 and a subline (RC0.1) selected for resistance to a topoisomerase 1 inhibitor. (a) Tree-like display showing underexpressed genes (green down-arrows), overexpressed genes (red up-arrows), and unchanged genes (gray circles) in the GO ‘Apoptosis Regulator’ category and its subcategories. The blue number indicates a 2.4-fold enrichment of changed genes in this category. The p-value (Fisher’s exact) indicates that, despite this degree of enrichment, the small total number of genes (14) in this category prevents statistical significance. (b) Dynamically generated SVG graphic of the ‘Biological Process’ DAG with genes in the GO ‘Apoptosis Regulator’ category opened in a pull-down list by mousing-over. Categories enriched more than 1.5-fold with flagged genes are color-coded red; those depleted more than 1.5-fold are blue. The rest of the categories are gray. Genome Biology 2003, 4:R28 R28.4 Genome Biology 2003, Volume 4, Issue 4, Article R28 Zeeberg et al. http://genomebiology.com/2003/4/4/R28 Query Tables Data sources Gene management Gene database names GoMiner Visualization engine Experimental Gene data data management Statistics GO Terms GO term GO management database DAG Figure 2 Schematic of GoMiner architecture and data flow. Launcher [15] for cross-platform web browser integration; that the predicate of the null hypothesis does not include ‘the Jakarta-ORO [16] for text processing; the Jena Semantic Web flagged genes that fall into the union of the rest of the cate- Toolkit [17] for manipulating RDF models; MySQL Connec- gories’. That predicate would not ensure mutual exclusivity. tor/J [18] for database connectivity; and Xerces [19] for The statistical question can be framed in terms of a classical parsing XML. The back-end is a relational database server, 2 x 2 contingency table (Table 1). which stores all gene ontology data. It includes an implemen- tation in MySQL [20] of the GO Consortium database. The null hypothesis can be formulated as: In addition to the deployed components, we have introduced a H : p - p = 0, o 1 2 number of open-source tools to enhance the development environment. In particular, the Concurrent Versions System where p = n /n and p = (N - n )/(N – n). The two-sided p- 1 f 2 f f (CVS) tool [21] coordinates program development at the value for Fisher’s exact test is the sum of probabilities of Georgia Institute of Technology with that at the NCI, and also observing tables that give at least as many extreme values as coordinates development within each of the groups. jUnit [22] the one actually observed, given that the null hypothesis is automates unit- and system-level testing of the application. true [23-25]. The use of Fisher’s exact test implies that we are conditioning on fixed marginal totals (n, N - n, N , N - N ) f f under the null hypothesis. For a discussion of the implica- Statistical considerations tions of fixed marginal values, see for example [23-25]. The two-sided Fisher’s exact test p-value for a category reflects a test of the null hypothesis that the category is Note that the 2 x 2 table does not require any information neither enriched in, nor depleted of, flagged genes with about the topology of the hierarchy or about how many genes respect to what would have been expected by chance alone. are included in any category other than the one to which the That is, it reflects the null hypothesis that, for each category, test is being applied. We used the two-sided version of the there is no difference between the proportion of flagged genes test, which detects a significant difference in the proportions that fall into the category and the proportion of flagged genes in either direction (that is, when the proportion of flagged that do not fall into the category. The two groups of genes are genes in the category is either higher or lower than would be mutually exclusive, as required for Fisher’s exact test. Note expected by random chance). Clearly, calculations analogous Genome Biology 2003, 4:R28 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2003/4/4/R28 Genome Biology 2003, Volume 4, Issue 4, Article R28 Zeeberg et al. R28.5 Table 1 human genes using the database created by the GO Consor- tium’s downloaded MySQL script files [28]. The hit rates Two-by-two contingency table for flagged and unflagged genes were low both when the gene names were used in the format in a GO category of HUGO names and when the gene names were used in the Flagged genes Non-flagged genes Total format of ‘HUGO_HUMAN.’ We tried the latter format because the flat files often contained ‘_HUMAN’ appended In category n n - n n f f to the human gene names. In contrast, when we used a com- Not in category N - n (N - n) - (N - n ) N - n f f f f bination of mouse (MGI) and rat (RGD) association files, there were reasonable numbers of hits. Therefore, we now Total N N - N N f f routinely use mouse and rat annotations for human data. We n is the number of flagged genes in the category, n is the total number of are currently augmenting the human associations in the GO genes in the category, N is the number of flagged genes on the Consortium database to provide a richer annotation of microarray, and N is the total number of genes on the microarray. All human gene names. This goal will be achieved by using the numbers are those obtained after dereplicating multiple instances of the same gene. MatchMiner database to integrate the information in the GO Consortium database [27] and the Swiss-Prot, TrEMBL and TrEMBLnew databases [29], and GoMiner will implement to the ones used here for all flagged genes can also be this database for human data in the near term. The MySQL applied to test separately the equivalent null hypotheses for script files will be freely available and should represent an under- and overexpressed genes. Unlike the Z-statistic with improvement over what is currently available to program the hypergeometric distribution, and tests based on it, Fish- developers and end-users. er’s exact test is appropriate even for categories containing a small number of genes. Our Java implementation of the Non-independence of gene data Fisher’s exact test is based on Javascript by Øyvind Gene-expression values within a category may be correlated Langsrud [26]. for any of several reasons. They may represent the same gene, close family members with similar functions, genes in The following limitations of this statistical formulation the same pathway or genes in alternative pathways for per- should be borne in mind, and the p-values should be inter- forming a biological function. Gene classifications in GO preted judiciously. may be correlated for analogous reasons. How do such rela- tionships affect the statistics? The answer is most easily seen Random experimental and categorization error by imagining a category containing nothing but five Experimental error and any uncertainties in the classifica- instances of the same gene (perhaps because five different tion of genes in GO are not included in the statistical model. identifiers were used and not recognized as representing the Perhaps, given enough information (which we essentially same gene). That category might appear either to be strik- never have) about those sources of error, they could be ingly enriched (with five out of five genes flagged) or strik- included in the statistical model, for example through a ingly depleted (with none out of five genes flagged). But the resampling technique. appropriate value of n for determining statistical signifi- cance in those cases would be 1, not 5. GoMiner’s companion Gene representation bias program MatchMiner [30,31] handles this problem by iden- The microarray gene set (or set from some other type of tifying replicates of the same gene, even if they are repre- genomic or proteomic experiment) will generally be a biased sented by different identifiers. representation of all genes. Therefore, enrichments and depletions, of necessity defined in terms of the genes What about possible sources of correlation other than ‘same- studied, may be biased with respect to biological significance gene’? Do we want to dereplicate them as well? Generally, as well. An alternative is to replace the list of the total set of the answer is ‘no’. Correlation of genes in the same pathway genes on the microarray with a list of the total set of genes in is precisely the phenomenon we are often trying to identify. the genome (or a representative sample), but that approach We would not want a statistical test to adjust for (and, in introduces another source of bias: genes not on the effect, null out) the effect of such relationships. Close family microarray are counted in determining N and n but have no members might be considered an intermediate case. The sta- chance to be flagged. tistical model implemented in GoMiner assumes, as our state of prior knowledge, that we know when two ‘genes’ are GO consortium database bias for human gene identical but nothing about their relationship if they are not associations identical. That seems the only available course. However, for The GO Consortium [1] provides a set of flat files that indicate each category, GoMiner provides the gene identities and the the association between gene names and GO categories for numbers given in Table 1 - sufficient information for the several species [27]. Although the flat files for human are quite knowledgeable user to decide to eliminate close family comprehensive, we found a low hit rate for GO annotation of members or pathway partners if desired. Genome Biology 2003, 4:R28 R28.6 Genome Biology 2003, Volume 4, Issue 4, Article R28 Zeeberg et al. http://genomebiology.com/2003/4/4/R28 The multiple comparisons problem ‘permissive apoptosis-resistance’ hypothesis) for the rela- If one has not decided before analysis which particular gene tionship between apoptotic and cell-proliferation pathways category is to be examined, a correction should be made for in the development of drug resistance. Figure 1a provides the multiple opportunities to obtain a p-value indicating sta- more detailed information, indicating that these differences tistically significant enrichment or depletion. For example, were focused in particular subcategories of apoptosis. Thus, with 1,000 categories, we would expect approximately 1,000 GoMiner can help the user in at least two ways: it identifies x 0.05 = 50 false positives simply by chance if we set the crit- categories enriched in, or depleted of, genes of interest; and ical value at p = 0.05. The most common way to correct for it generates hypotheses to guide further research. this problem is that of Bonferroni (see, for example [32]), in which the critical value is divided by the number of trials (in Unfortunately for us, interpretive analysis of the this case, 1,000). However, that approach assumes indepen- DU145/RC0.1 study was initially done one gene at a time dence of categories and is so conservative that it becomes before development of GoMiner (and, in fact, motivated that extremely hard to detect true positives. A number of less development). Performing the GO analysis one gene at a conservative statistical methods have also been developed, time would have taken more than two solid hours at the but it is beyond the scope of this paper to review them here. computer for the 181 genes before getting to the much An approach based on resampling will be incorporated into harder parts of the task: doing the same for the entire array GoMiner in the coming months. (nominally > 15 hours), then collating and organizing the information for each GO category. In contrast, operating on Overall, the p-values quoted should be considered as heuris- a 266 MHz PC with 250 MB RAM, it took 90 seconds to tic measures, useful as indicators of possible statistical sig- browse for and load the files, then 30 seconds for GoMiner nificance, rather than as the results of formal inference. The to process the entire array of 1,399 genes and display the p-values can be used, for example, to sort categories to iden- flagged and unflagged genes in their hierarchical context. In tify those of the most potential interest. another test, running 900 flagged genes and all of HUGO (15,000 genes) took 4 minutes and 40 seconds on the same As another useful measure, we have calculated the relative computer. Overall, the processing time was essentially linear enrichment factor, R , defined as with respect to the total number of genes (time in minutes = 0.0003 x genes + 0.0656; R = 0.998). R = (n /n)/(N /N) e f f and shown as blue numbers in Figure 1a. The analogous Comparison of GoMiner with related programs quantities for overexpressed (red numbers) and under- Several other programs related to GoMiner have recently expressed (green numbers) are also shown. Depletion is, of appeared. These include MAPPFinder [35,36], FatiGO [37], course, represented by an enrichment factor less than unity. Onto-Express [2,38], and GoSurfer [39]. The following rep- resents our best attempt at comparison, based on review of the available implementations and associated documenta- Benchmarking GoMiner on a biological problem tion as of January 2003. As a test, GoMiner was applied to the results of our cDNA microarray study of the molecular mechanisms by which FatiGO is a web application. The current implementation is drug resistance develops [33]. The DAG shown in Figure 1a very restrictive in that the user must specify ahead of time was generated from that study, which used quadruplicate one particular level of the GO hierarchy that is to be used for ‘Oncochip’ microarrays (Microarray Facility, Advanced analysis of the data. The other available applications, includ- Technology Center, NCI [34]) to compare gene expression ing GoMiner, process data for the entire GO hierarchy and profiles in a prostate cancer cell line (DU145) and a subline allow the user to select views of the results dynamically. In a (RC0.1) selected from it for resistance to the topoisomerase trial using FatiGO’s recommended search criteria with our 1-inhibitor 9-nitro-camptothecin. The microarray included standard test gene files, FatiGO did not find any GO cate- 1,399 cancer-interesting genes. 181 of those genes differed in gories with clusters of differentially expressed genes. expression according to a threshold criterion (>1.5-fold dif- ference). MatchMiner was used to translate IMAGE clone Onto-Express is also implemented as a web application. Ids for the 1,399 genes into HUGO names for input to Although more flexible than FatiGO, it is largely limited to a GoMiner. Figure 1a shows that the category ‘apoptosis regu- flat view of the biological world. Whereas GoMiner provides lator’ was enriched 2.4-fold in genes with altered expression both tree and DAG views of the genes embedded within the levels. More specifically, it was enriched 3.2-fold with under- GO hierarchy, Onto-Express does not provide any hierarchi- expressed genes and 2.0-fold with overexpressed genes. cal structure (the fundamental defining feature of GO). Flow cytometric annexin V and TUNEL assays verified Onto-Express lists enriched and depleted categories, but it important differences in apoptotic potential between the cell does not provide a statistical analysis of the results to aid lines, and analysis generated a novel hypothesis (the understanding. ‘Version 2,’ recently announced (at a price of Genome Biology 2003, 4:R28 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2003/4/4/R28 Genome Biology 2003, Volume 4, Issue 4, Article R28 Zeeberg et al. R28.7 $1,500 - $5,000), provides a p-value (computed by a method MAPPFinder is written in Microsoft’s Visual Basic and is not specified in the announcement). therefore restricted to running on PCs under Windows. In contrast, GoMiner is written in Java and runs on multiple GoSurfer is implemented as a Windows application. As such, operating systems. We have tested it on Windows XP, 2000, it lacks the flexibility of platform-independence that Java NT, and 98, as well as on Mac OS X, Solaris, Linux (Red Hat confers upon GoMiner. GoSurfer is also rather inflexible in distribution), IRIX (SGI), and FreeBSD. See the GoMiner that the input identifiers are required to be specific Affymetrix website for specific operating-system issues. probe sets. It is not clear whether other identifier types sug- gested in a figure on the web site have been implemented. In We recently implemented an alternative command-line contrast, GoMiner uses HUGO gene names as input. These interface for GoMiner (S.N., M.S., D.W.K. and B.R.Z., gene names are more convenient for human interpretation, unpublished work) to complement the GUI version. The and GoMiner’s companion program MatchMiner [30,31] command-line interface allows GoMiner to be integrated allows many other types of identifiers (listed at the end of this with other tools via scripts or pipes. Our website will post section) to be converted easily into HUGO gene names. The updated versions of the documentation and program as soon visual output of GoSurfer is in the form of a DAG. GoMiner as comprehensive testing of this interface has been com- uses a text-based tree as its primary visual output because the pleted. In preliminary trials with the new interface we have nodes of the DAG are inherently more difficult to label without routinely processed more than 2,000 datasets at a time creating unacceptable screen clutter. The DAG gives an intu- through GoMiner. This high-throughput capability has made itive feel for the overall complexity of the categorizations, but two further developments possible: first, randomization it is not particularly useful for detailed dynamic navigation or studies are being done to address the multiple-comparisons for examination of categorized genes. The tabular output of problem (that is, to estimate the fraction of false positives GoSurfer does not include the HUGO names, which we con- among the selected categories); second, the output data sider to be the most useful key to gene identity. In contrast to stream is being coupled with integrated downstream analy- GoMiner, it appears that GoSurfer does not provide complete sis for automated recognition of interesting results buried quantitative and statistical summary data. within a large number of exploratory experiments. The user can explore and visualize these interesting results with MAPPFinder is a pioneering project that integrates GO GoMiner’s graphical user interface. analysis and biological pathway maps. GoMiner also pro- vides the potential for this type of integration, since each The command-line interface also allows GoMiner to interact gene in the GoMiner tree classification is dynamically linked flexibly with its companion program MatchMiner. With to the corresponding set of BioCarta and KEGG biological MatchMiner as a ‘preprocessor’, GoMiner can take input data pathway maps. In addition to providing integration with organized on the basis of ‘omic’ identifiers other than the biological pathway maps, GoMiner provides integration HUGO names central to GO. MatchMiner currently resolves with chromosomal information via dynamic linking to IMAGE clone ids, UniGene clusters, GenBank accession LocusLink’s chromosome viewer. GoMiner also provides numbers, Affymetrix ids, chromosome locations, gene dynamic linking to SNPs and MGC databases via LocusLink. common names, and FISH clone ids, and greatly facilitates MAPPFinder provides the fundamental tree representation the preparation of microarray data for analysis in GoMiner. of the GO hierarchy, with summary and statistical data in line with each category. However, unlike the tree implemen- In conclusion, GoMiner will continue in development with a tation in GoMiner, it shows only the categories; the genes view to integration with other bioinformatic resources being themselves are shown in an auxiliary table. In GoMiner, generated by the NCI and NIH for use by the biomedical both the categories and the genes are seamlessly shown as research community. GoMiner is flexible both because it is integral components of the tree. coded in Java to be platform-independent and because it can accommodate either the default GO hierarchy and gene MAPPFinder does not appear to include a DAG representa- associations or customized versions. The default is the GO tion. In GoMiner, the DAG view provides a qualitative and Consortium’s database of categories and gene associations as quantitative picture of the often-complex, multiple parent- implemented on our server. However, the user can, if hood of some categories. In our opinion, this type of visual- desired, edit categories and gene memberships using DAG- ization is complementary to the tree form and important to Edit, the BDGP Gene Ontology Editor Tool [40]. The edited an appreciation of the complex, highly nonlinear relation- database can then be accessed by GoMiner from a local ships within biological systems and gene networks. This server to accommodate domain- and expertise-specific complexity is not easy for a human to infer from the tree rep- applications. Another important type of flexibility is the wide resentation. The GO consortium selected the DAG as its fun- range of uses. In this report, we have presented GoMiner in damental data structure (though not its visualization), in the context of microarray data, but the variety of applica- part because it includes the characteristics of a network that tions is clearly much broader; it embraces the full range of are not included in a tree. genomic and proteomic studies. Genome Biology 2003, 4:R28 R28.8 Genome Biology 2003, Volume 4, Issue 4, Article R28 Zeeberg et al. http://genomebiology.com/2003/4/4/R28 35. Doniger SW, Salomonis N, Dahlquist KD, Vranizan K, Lawlor SC, Acknowledgements Conklin BR: MAPPFinder: using Gene Ontology and GoMiner is being developed jointly by groups from the National Cancer GenMAPP to create a global gene-expression profile from Institute (NCI), the Georgia Institute of Technology, and Emory Univer- microarray data. Genome Biol 2002, 4:R7. sity. This project has been supported by a contract funded by the NCI’s 36. MAPPFinder [http://www.genmapp.org/MAPPFinder.html] Center for Cancer Research and by The Wallace H. Coulter Biomedical 37. FatiGO [http://fatigo.bioinfo.cnio.es/] Engineering Department of Georgia Tech and Emory University academic 38. Onto-Express [http://vortex.cs.wayne.edu/Projects.html] funds for Professor May D. Wang. Its user features, statistical repertoire, 39. GoSurfer [http://biosun1.harvard.edu/complab/gosurfer/] and links to external resources will continue to be expanded through the 40. BDGP Gene Ontology Editor Tool contract funded by the NCI’s Center for Cancer Research and through [http://www.godatabase.org/dev/editor.html] Professor Wang’s academic funds. References 1. Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K, Dwight S, Eppig J, et al.: Gene Ontology: tool for the unification of biology. Nat Genet 2000, 25:25-29. 2. Khatri P, Draghici S, Ostermeier G, Krawetz S: Profiling gene expression using Onto-Express. Genomics 2002, 79:266-270. 3. GoMiner [http://discover.nci.nih.gov/gominer] 4. GoMiner [http://www.miblab.gatech.edu/gominer] 5. Weinstein JN: Fishing expeditions. Science 1998, 282:628-629. 6. Weinstein JN: ‘Omic’ and hypothesis-driven research in the molecular pharmacology of cancer. Curr Opin Pharmacol 2002, 2:361-365. 7. LocusLink [http://www.ncbi.nlm.nih.gov/LocusLink] 8. PubMed [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed] 9. MedMiner [http://discover.nci.nih.gov] 10. Tanabe L, Scherf U, Smith LH, Lee JK, Hunter L, Weinstein JN: Med- Miner: an internet text-mining tool for biomedical informa- tion, with application to gene expression profiling. BioTechniques 1999, 27:1210-1217. 11. GeneCards [http://thr.cit.nih.gov:8081/cards/index.html] 12. NCBI Entrez Structure [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Structure] 13. The Cancer Genome Anatomy Project [http://cgap.nci.nih.gov/Pathways] 14. Berkeley Drosophila Genome Project: developers’ resources [http://www.fruitfly.org/developers] 15. BrowserLauncher [http://browserlauncher.sourceforge.net] 16. The Apache Jakarta Project [http://jakarta.apache.org/oro/index.html] 17. HP Labs Semantic Web Research [http://www.hpl.hp.com/semweb/index.html] 18. MySQL Connector/J downloads [http://www.MySQL.com/downloads/api-jdbc.html] 19. Xerces2 Java Parser Readme [http://xml.apache.org/xerces2-j/index.html] 20. MySQL [http://www.MySQL.com] 21. Concurrent Versions System [http://www.cvshome.org] 22. jUnit [http://www.junit.org] 23. Agresti A: Categorical Data Analysis. New York: John Wiley; 1990. 24. Agresti A: A survey of Exact inference for contingency tables. Stat Sci 1992, 7:131-177. 25. StatXact 5 for Windows. User Manual. Cambridge, MA: Cytel Software Corporation; 2002. 26. Fisher’s Exact Test [http://www.matforsk.no/ola/fisher.htm] 27. GO Database [http://www.geneontology.org/#godatabase] 28. GO downloads [http://www.godatabase.org/dev/database/archive/] 29. Swiss-Prot, TrEMBL and TrEMBLnew database [ftp://ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/] 30. Bussey JK, Kane D, Sunshine M, Narasimhan S, Nishizuka S, Reinhold WC, Zeeberg B, Ajay, Weinstein JN: MatchMiner: a tool for batch navigation among gene and gene product identifiers. Genome Biol 2003, 4:R27. 31. MatchMiner [http://discover.nci.nih.gov/matchminer/] 32. Bonferroni [http://home.clara.net/sisa/bonhlp.htm] 33. Reinhold WC, Kouros-Mehr H, Kohn KW, Maunakea AK, Lababidi S, Roschke A, Stover K, Alexander J, Pantazis P, Miller L, et al.: Apoptotic susceptibility of cancer cells selected for camp- tothecin resistance: gene expression profiling, functional analysis, and molecular interaction mapping. Cancer Res 2003, 63:1000-1011. 34. NCI human Oncochip genes [http://nciarray.nci.nih.gov/gi_acc_ug_title.shtml] Genome Biology 2003, 4:R28

Journal

Genome Biology – Springer Journals

Published: Apr 1, 2003

Keywords: Animal Genetics and Genomics; Human Genetics; Plant Genetics and Genomics; Microbial Genetics and Genomics; Bioinformatics; Evolutionary Biology

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

GoMiner: a resource for biological interpretation of genomic and proteomic data

GoMiner: a resource for biological interpretation of genomic and proteomic data

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

GoMiner: a resource for biological interpretation of genomic and proteomic data

GoMiner: a resource for biological interpretation of genomic and proteomic data

References (77)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies