ErmineJ: Tool for functional analysis of gene expression data sets

Homin Lee; William Braynen; Kiran Keshav; Paul Pavlidis

doi:10.1186/1471-2105-6-269

ErmineJ: Tool for functional analysis of gene expression data sets

Lee, Homin; Braynen, William; Keshav, Kiran; Pavlidis, Paul 2005-11-09 00:00:00 Background: It is common for the results of a microarray study to be analyzed in the context of biologically-motivated groups of genes such as pathways or Gene Ontology categories. The most common method for such analysis uses the hypergeometric distribution (or a related technique) to look for "over-representation" of groups among genes selected as being differentially expressed or otherwise of interest based on a gene-by-gene analysis. However, this method suffers from some limitations, and biologist-friendly tools that implement alternatives have not been reported. Results: We introduce ErmineJ, a multiplatform user-friendly stand-alone software tool for the analysis of functionally-relevant sets of genes in the context of microarray gene expression data. ErmineJ implements multiple algorithms for gene set analysis, including over-representation and resampling-based methods that focus on gene scores or correlation of gene expression profiles. In addition to a graphical user interface, ErmineJ has a command line interface and an application programming interface that can be used to automate analyses. The graphical user interface includes tools for creating and modifying gene sets, visualizing the Gene Ontology as a table or tree, and visualizing gene expression data. ErmineJ comes with a complete user manual, and is open-source software licensed under the Gnu Public License. Conclusion: The availability of multiple analysis algorithms, together with a rich feature set and simple graphical interface, should make ErmineJ a useful addition to the biologist's informatics toolbox. ErmineJ is available from http://microarray.cu.genome.org. there are numerous advantages to using a computational Background A difficulty experienced by many (if not all) users of gene and statistical approach to analyze groups of genes. expression microarrays is making sense of the complex results. After analyzing each gene in a data set, an experi- The most common means of performing this analysis is to menter is often left to the task of summarizing the results ask whether certain Gene Ontology (GO) [1] terms are with little assistance. It is common for experimenters to "over-represented" in a set of genes selected by fold- ask questions at the level of molecular pathways or other change or statistically-motivated approaches such as a t- functionally relevant groupings of genes. While "ad hoc" test. This is easily implemented by using the properties of manual annotation of data sets is a common approach, the hypergeometric distribution (often referred to as Page 1 of 8 (page number not for citation purposes) BMC Bioinformatics 2005, 6:269 http://www.biomedcentral.com/1471-2105/6/269 Fisher's exact test for two categories) or its binomial The first version of ermineJ was made available in 2003. approximation. In our work, these methods are more Recently we have completely revamped the user interface generically referred to as "over-representation analysis" or and updated the feature set, releasing ermineJ 2.0 in Octo- ORA. In addition, as the GO is just one way of organizing ber 2004 and 2.1 in June 2005. genes, we refer to the general goal of these methods as "gene set analysis", where a gene set is any grouping of Implementation genes not derived from the data itself, typically based on ErmineJ is implemented entirely in the Java programming biologically-motivated criteria. language [7]. It uses the Java Swing libraries to create a graphical user interface that can run on many different The need to perform ORA has led to the emergence of a platforms. Architecturally, an effort has been made to sep- variety of tools. A list of many such tools is available from arate analytical and algorithmic concerns from user pres- the Gene Ontology Consortium [2], and a large number entation concerns. Besides being a design best practice, of them were recently reviewed [3]. However, to our the architecture was also driven by the need to support knowledge these tools all implement ORA methods; other command-line interfaces as well as application program- methods or algorithms are not available, with the excep- ming interfaces to the methods. The structure of ermineJ tion of the Perl script Catmap [4]. Thus these tools prima- also lends itself to fairly easy extensibility, so new algo- rily differentiate themselves through user interface rithms can be added to the software as requirements features, ease of use, supported data types, and speed [3]. change. The analysis algorithms in ermineJ were previ- Most tools surveyed by [3] were reported to have one or ously described [4,5]. more significant limitations, including slow performance, an inability to analyze gene annotations other than those In addition to using the Java SDK, ermineJ depends on a directly annotated (that is, other levels of the GO hierar- number of free third-party libraries, most notably the Colt chy are not considered), requiring web access to use, are library [8]. Colt is a high-performance numerical comput- difficult to install (limiting their usefulness to biologists), ing library that includes implementations of many linear or lack the ability to visualize the GO hierarchy [3]. algebra and statistical methods, as well as useful data structures which we rely on heavily in our software. Other In this paper we describe ermineJ, a stand-alone tool that libraries ermineJ uses include various Jakarta Commons implements methods described by [5] and [4] in addition libraries [9], and the Xerces XML parsing engine [10], to ORA, has a rich feature set, and does not have the lim- which we use to parse the Gene Ontology XML descrip- itations cited above. One of the offered analysis methods tion. Many of the low-level numerical and utility routines in particular is complementary to ORA analysis, which we (e.g., for file parsing and string manipulation) are tested now call Gene Set Resampling or GSR (the "experiment" in an extensive unit test suite. score in Pavlidis et al. (2002)). In GSR, the gene-by-gene scores (e.g., t-test p-values) are not thresholded. Instead, Results and discussion Inputs for each gene set an aggregate score is computed, such as the geometric mean of the p-values for genes in the cate- All interfaces to ermineJ use the same basic inputs. The gory, and the significance of that score determined by ran- first is a description of the Gene Ontology in XML format, dom sampling of the data. We have recently presented obtained from the GO consortium web site [11]. The sec- some evidence that GSR can provide better results than ond is a description of the microarray platform (the "array ORA in some situations [6]. annotation file", which contains tab-delimited text), which associates probe identifiers with Gene Ontology ErmineJ also has methods for analysis of genes based on terms and additionally associates each probe with a gene rankings (the receiver operator characteristic, or ROC) [4]. (used in the statistical analysis to account for repeated ROC can be thought of as a version of ORA where all pos- genes, as described below) and descriptions that are useful sible thresholds are considered simultaneously. Like GSR, for viewing in the context of the results. The third required the ROC method utilizes non-thresholded gene scores, input is the user's own data. For ORA, GSR and ROC but considers only their ranking, which might be consid- applications, this takes the form of a list of gene scores, ered more robust than using the raw gene scores. Finally, one for every probe set on the array design. Alternatively ErmineJ offers an analysis based on the correlation of gene (for expression profile correlation analysis), the input can expression profiles, gene group correlation analysis be the expression profile matrix, as might be used as an (GCA) [5]. GCA can be used as an alternative to the use of input to a clustering tool. The gene scores can be p-values ORA for the determination of whether genes in particular or another score such as fold-change. ErmineJ is purpose- functional categories are "clustering together". fully largely agnostic about the meaning of the gene scores, and focused on the distributional properties of the scores. Page 2 of 8 (page number not for citation purposes) BMC Bioinformatics 2005, 6:269 http://www.biomedcentral.com/1471-2105/6/269 We maintain on the order of 30 different mouse, human last option is not available from the GUI, though it can be and rat array annotation files for different platforms, as accessed from the other interfaces. Another important set- well as generic files for RefSeq [12] genes that can be used ting is the range of gene set sizes to analyze. Gene sets that to construct annotation files for other platforms (availa- are very small are unlikely to be very informative, because ble from our web site [13]). The native annotation file for- the goal of the analysis is to study genes in groups, while mat is very simple and new files can easily be constructed large gene sets may be too non-specific to provide useful with a modicum of bioinformatics skill. ErmineJ can also information. In addition, analyzing too many gene sets read Affymetrix "CSV" (comma-separated-value) annota- reduces the power of the analysis due to multiple testing tion files available from the manufacturer's web site. We costs. In practice we often use a range of 5–100 or 5–200. gladly entertain requests to add support for other arrays. When an annotation file is read in, the software automat- In addition to the pre-defined gene sets as defined by the ically associates each probe with all parent terms of each Gene Ontology, users are free to input their own gene sets. directly annotated terms. For example, all genes anno- These are defined in simple text files that are placed in a tated with the term "regulation of cell size" are also asso- directory that ermineJ checks at startup. These text files ciated with the higher-level terms "cellular can be created "off-line" or within the ermineJ GUI. In morphogenesis" and "morphogenesis". This feature is addition, users can modify gene sets from within ermineJ. only supported by some of the tools reviewed by [3]. This functionality can be used to correct errors or omis- sions in the Gene Ontology annotations, though care There are a number of parameters to set and decisions the must be exercised to avoid introducing biases into the user must make in order to run the software. The choice of results. analysis method is the most obvious, and each method has a few other settings that the user can choose to change. Types of analysis Gene-score based methods For example, for ORA analysis a threshold score must be defined. This is in contrast to most ORA software packages The ORA, GSR and ROC methods are closely related in which take as input a list of "genes of interest"; instead, that they are based on the gene-by-gene scores, with the ermineJ takes as input all the gene scores for the experi- goal of finding gene sets that are some sense "enriched" in ment. This lets ermineJ avoid the problem of selecting the high-scoring genes (which typically might be "differen- correct "null" gene set [3]: it is defined strictly by the genes tially expressed genes"). ORA is sometimes used to ana- analyzed in the experiment but not meeting the user- lyze genes which are selected by clustering, rather than a defined score threshold. continuous score. In this situation, GSR and ROC are not appropriate. However, the correlation method is specifi- For GSR, the method used to compute the score for a gene cally designed to address this situation. GSR and ROC set is a key parameter. The two options currently sup- have the benefit of not requiring a threshold to divide ported are the mean and the median. During the analysis, genes into "selected" and "non-selected" genes. The GSR uses the selected method to compute a summary of choice of the threshold for ORA can have a substantial the gene scores for each resampled or real gene set, and effect on the results obtained, because the "selected genes" this aggregate score is used to represent the gene set. change [4]. Choosing the median will tend to yield slightly more con- Correlation analysis servative results, as individual genes with very high scores are not given as much weight as in the mean computation. Gene group correlation analysis (GCA) is based on the similarity of the expression profiles of genes in a gene set: Some settings are used for multiple methods. For exam- loosely speaking, how well they "cluster together". Thus ple, when a gene is represented more than once in the data we propose that GCA can be used as an alternative to set, a decision has to be made as to how to treat these "rep- using ORA to analyze clusters. There are some differences licates" (which might not be replicates per se but represent to be noted between the typical application of ORA to different transcripts). The options supported are to use the clusters and the ermineJ correlation analysis. GCA is "best" score among the replicates to represent them as a group-centric, not cluster-centric. Thus we ask whether the group; to use the mean; or to treat them as separate enti- correlation among the members is higher than expected ties. Use of the "best" option is somewhat anti-conserva- by chance, not whether a given set of correlated genes is tive, but is reasonable when most "replicates" are in fact enriched for the genes in the group; GCA does not involve assaying different biological entities. In contrast, treating clustering. This is not a trivial distinction, because while replicates completely separately is not generally advised as the highest scores will be obtained for gene groups that it can lead to spurious positive findings in cases of true have uniform and high correlations among all the mem- replicates, as the gene set gets "adulterated" with multiple bers, groups that have two or more "sub-clusters" can also copies of the same high-scoring gene. For this reason the obtain high scores. In the current implementation of Page 3 of 8 (page number not for citation purposes) BMC Bioinformatics 2005, 6:269 http://www.biomedcentral.com/1471-2105/6/269 Figure 1 A: The main panel of ErmineJ after several analyses have been performed A: The main panel of ErmineJ after several analyses have been performed. Gene sets selected at low FDR levels are indicated in color. B: The tree-view panel of ErmineJ, illustrating the ability to browse gene sets in the GO hierarchy. The icons at each node have specific meanings. For example, the yellow "bull's-eye" icon indicates a gene sets selected at an FDR of 0.05 or less. Purple diamonds indicate nodes that have "significant" sub-nodes. GCA, the absolute value of the correlation is always used, using the fact that the ROC is equivalent to the Wilcoxon which allows. In future versions we may expose this as a rank-sum test [4]. The raw gene set score is simply the area user-settable option, as well as implementing other possi- under the receiver operator characteristic curve [14], ble correlation metrics other than the current Pearson cor- which ranges from values of 0.5 (random ranking) to 1.0 relation. (all genes in the gene set at the top of the ranking). Finally, for correlation analysis, the null hypothesis is that the In all methods, for each gene set analyzed, ermineJ com- mean pairwise correlation of profiles in the gene set is putes a score and, based on that score and the gene sets drawn from the global distribution of gene set correlation size, a p-value representing the "significance" of that gene scores, as determined by resampling [5]. The raw score is set with respect to the null hypothesis. The definition of the mean absolute value of the pair-wise correlation of the the raw score and the null hypothesis depends on the genes in the set (comparisons of a probe to itself, or to method being used. Note that the raw scores are of limited other probes for the same gene, are always ignored). use because it cannot be evaluated in the absence of infor- mation about the gene set size. However, they can provide ErmineJ includes implementations of three multiple test the user with a helpful indication the strength of the correction methods (though currently only one of these, result, not just its statistical significance. Benjamini-Hochberg false discovery rate (FDR) [15], is made available through the GUI). The additional options, For ORA, the null hypothesis is that the genes in the gene available from the command line, are Bonferroni correc- set are distributed randomly between the "selected" genes tion and a resampling-based family-wise error rate correc- and the "non-selected" genes. The raw score reported by tion [16]. The FDR is used in the GUI as a rapid and ErmineJ is the number of genes in the set which pass the reasonable guide to which gene sets are likely to be of threshold for gene selection. For GSR, the null hypothesis highest interest. is that the mean (or median) gene score (which forms the gene set score; for p-values negative-log-transformed val- The ermineJ GUI ues are used) is drawn from the global (data-wide) distri- Most users of ermineJ will access it through its graphical bution of possible gene set mean (or median) gene scores, interface. The GUI of ermineJ was designed to be simple as determined by resampling [5]. For ROC analysis, the to use and provides "wizards" to guide users through com- null hypothesis is that the genes in the gene set are distrib- mon tasks such as running an analysis. Many settings uted randomly in the ranking; p-values are computed made by the user during operation of the software are Page 4 of 8 (page number not for citation purposes) BMC Bioinformatics 2005, 6:269 http://www.biomedcentral.com/1471-2105/6/269 A Figure 2 gene set details view A gene set details view. The controls at the top allow adjustment of the size and contrast of the heat map. The gene scores (in this case p-values) are shown in the second text column. The grey and blue graph, shown only for experiments using p-values, shows the expected (grey) and actual (blue) distribution of p-values in the gene set. This display is provided as an additional aid to evaluation of the results. The last two columns provide information about each gene. The targets of the hyperlinks are con- figurable by the user. remembered between sessions, facilitating repeated anal- names of genes they contain. User-defined gene sets are ysis of the same data files and maintaining the user's pre- displayed in contrasting colors. Not shown in the figures ferred window sizes, for example. A complete manual is is the initial startup screen in which the user chooses the provided and is accessible via an on-line help function, as gene annotation file to use for the session. web pages on our web site, or in portable document for- mat (PDF). Double-clicking on a gene set in the main panel opens a new window that displays the genes in the gene set, along Some aspects of the ermineJ graphical user interface is with the expression profiles in a "heat-map" view (if the illustrated in Figures 1, 2, 3. The main panel of the soft- user has provided the profile data; Figure 2). The appear- ware can be viewed either as a table of gene sets (Figure ance of the heat map is configurable through menus and 1A) or in a hierarchical (tree) view (Figure 1B). These toolbar controls. The data displayed in the table, as well as views are linked so changes in one are reflected in the the image of the matrix, can be saved to disk using addi- other. To facilitate navigation of these displayed, gene sets tional menu options. The hyperlinks to external web sites can be searched by the name of the gene set or by the can be configured by the user to point to a web site of their Page 5 of 8 (page number not for citation purposes) BMC Bioinformatics 2005, 6:269 http://www.biomedcentral.com/1471-2105/6/269 E Figure 3 xamples of screens from ErmineJ Wizards Examples of screens from ErmineJ Wizards. A: Analysis wizard. This illustrates options to set the range of gene set sizes to analyze, and the method of treating "replicates" of genes. See text for details of the latter. B: Gene set modification wizard. In this screen the user is selecting genes to delete from a gene set. The list of all probe available on the platform is available in the left panel. A "find" function simplifies the location of genes and probes. choosing, again through a menu option. All of these capa- Once an analysis is initiated, the user is informed of its bilities are available even if the user has not performed progress via a status bar. An analysis can be cancelled any any analysis, so ErmineJ can be used as a "gene set time. On completion, the results are added to the tabular browser" as well as for analysis. and tree views (Figure 1). Multiple results can be dis- played simultaneously in the tabular view, allowing easy An important feature of the GUI is the capability to rap- comparison of different runs. The tree view can display idly define and edit gene sets, which is accomplished in a only a single analysis result set at a time, but offers a pull- "wizard" that takes the user through the process set-by- down menu to selected among the results sets to display. step. Alternatively, the user can simply populate the gene In the tree and tabular views, high-scoring (i.e., signifi- set directory with files they have obtained from other cant) gene sets are highlighted in color. The tree view uses sources, for example created in bulk with a Python script a simple system of icons for each node to indicate whether or obtained from another user. As far as we know, no tool a significant node is contained within a given higher level surveyed by [3] affords the user the ability to define or node. Finally, the results of an analysis can be saved to a modify the categories. ErmineJ also allows the user to tab-delimited file for use in other software or to be choose which of the GO aspects (Biological Process, etc.) reloaded by ermineJ at a later time. to use in the analysis. Other interfaces In addition to the GUI, ermineJ offers a command line The GUI version of ermineJ can be installed on the user's computer or run via Java WebStart. The latter option sim- interface (CLI) and a simple application programming ply involves clicking on a link in the user's web browser, interface (API). The CLI exposes some features of ermineJ and ensures that the users have the most up-to-date ver- that are not available in the GUI, such as different meth- sion of the software. The drawback of using WebStart is ods for multiple test correction. The CLI is suitable for that the user must be connected to the internet to use the scripting runs of ermineJ. For example, a simple Perl script software. With a local installation, no internet connection can be used to automate runs of ermineJ with different set- is needed. tings or on different data sets. In contrast, the API was introduced to allow programmers to include the analyses Running an analysis available in ermineJ in their own software. The API cur- Running an analysis using the ErmineJ GUI involves using rently provides more limited access to the functionality of a "wizard" to set the parameters (Figure 3). The user is the software than the command line version, but will be asked to choose an analysis method, select the data file to expanded in future versions. analyze, choose any user-defined gene sets to include in the analysis, and set the various parameters required for Performance the particular analysis. All settings are documented via We tested the performance of ermineJ using the HG- "tool tips" and in the manual. U133_Plus_2 Affymetrix array design. This is a particu- Page 6 of 8 (page number not for citation purposes) BMC Bioinformatics 2005, 6:269 http://www.biomedcentral.com/1471-2105/6/269 larly large array design with over 54,000 probe sets, and expression profile set and the results. Therefore we recom- represents a something of a worst-case scenario with mend running ermineJ on machines that have at least 256 respect to performance. With our current annotation set, Mb of RAM. 4844 different GO categories (gene sets) are available for analysis in this array design. We limited our analysis to Future plans gene sets with between 5 and 100 genes, leaving about At this writing, the current version of ermineJ is 2.1.6. 2700 gene sets. The times reported below are for analyzing New features planned for the software include expanding the complete set of over 54,000 probe sets with respect to the API and allowing more flexible creation of user- these 2700 gene sets on a on a 1.7 GHz Pentium laptop. defined gene sets, including allowing support of alterna- tive nomenclatures such as the Plant Ontology [17]. We With this array, ermineJ has an initial startup phase that also plan to provide annotation files for more platforms lasts 15–20 seconds, most of which is consumed by time and organisms. it takes for the gene annotation file to be read in and proc- essed for analysis. The time for analysis once startup is We have been interested in the possibility of including completed depends on the method used. For ORA, a com- other resampling-based methods such as GSEA [18] or the plete analysis is completed in 8 seconds (average of 3 similar resampling method implemented in Catmap [4] runs; times are wall clock seconds timed from within the in ermineJ. The primary reason to consider these methods software). While it is difficult to directly compare our is that they examine the distribution of gene scores by benchmarks with previously published benchmarks resampling over the samples, which is more correct than because the number of gene sets analyzed and the size of merely resampling over the genes. This is because the null the "null" gene set was not reported, and the times hypotheses in the gene score analysis are some variation reported might in some cases include initial startup times on a random distribution of genes within the ranking of [3], the fastest reported methods on the largest data sets genes. This assumption can be badly violated for gene sets tested completed ORA analyses in under 10 seconds. This containing highly correlated genes (such as the ribosomal indicates that ErmineJ is at least competitive with and pos- protein genes); such genes will tend to have correlated sibly faster than the fastest previously reported tools. rankings, and in some situations (particularly when the gene p-value distribution is close to uniform), spurious GSR analysis took about 370 seconds if a full resampling false positives can occur [4]. The ORA, GSR and ROC is performed (100,000 resampling trials per gene set size methods are all susceptible to this problem, though we in our tests). However, ermineJ implements an approxi- stress that this is only an serious issue for gene sets that mation, where limited resampling is used to estimate the show high correlations not related to the experimental parameters of a normal distribution. This normal is used design. to compute the p-values for each gene set. It also takes advantage that, especially for larger class sizes, the shape It would be challenging to provide a general-purpose of the resampled distribution is very similar for similar implementation of GSEA or Catmap that is easily accessi- class sizes, so not all of them need to be computed. In this ble to biologists with limited computational skills. These mode the analysis takes approximately 80 seconds. ROC methods require either that users can provide the gene analysis, which does not involve resampling, took about scores for hundreds (if not thousands) of resampled data 100 seconds. Correlation analysis is the most computa- sets [4], a task that is difficult to accomplish for the tar- tionally intensive resampling method; even with the geted user base of ermineJ, or computation of gene scores approximations enabled it currently takes about 400 sec- by the software. Because each experimental design might onds to run on the test data set (which contained 12 have a different mechanism for computing gene scores microarrays). This is because computing correlations is (fold-change, t-test, ANVOA, Cox regression, etc), it computationally intensive, compared to the methods would be difficult to provide a fully flexible tool without which use pre-computed gene scores such as p-values. including a full-fledged statistical analysis package as well. A feasible solution we are considering is to cover the most ErmineJ is fairly memory-intensive, because it holds in frequently-encountered situations (e.g., t-test and one- memory a complex data structure describing the annota- way ANOVA). tions, as well as the microarray data and information about the results for thousands of gene sets and tens of Conclusion thousands of genes. For the large HG-U133_Plus_2 ErmineJ is a fast, full-featured, user-friendly, multi-plat- design, after startup ermineJ occupies approximately 85 form open source application for analysis of gene sets. It Mb of RAM (determined using a Java heap profiler under implements multiple algorithms for performing the anal- Windows). After running the correlation analysis, this ysis, and permits easy modification and creation of new grew to 105 Mb, reflecting the loading of the complete gene sets. These features afford users considerable flexibil- Page 7 of 8 (page number not for citation purposes) BMC Bioinformatics 2005, 6:269 http://www.biomedcentral.com/1471-2105/6/269 ity in testing different methods and parameters. Perhaps References 1. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, the greatest current limitation to its usability at this date is Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel- the availability of gene annotation files for non-Affyme- Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, trix array designs we have not encountered frequently. Rubin GM, Sherlock G: Gene ontology: tool for the unification of biol- ogy. The Gene Ontology Consortium. Nat Genet 2000, 25(1):25-29. Users who wish to develop annotation files for their plat- 2. Gene Ontology Tools [http://www.geneontology.org/ form should contact us for assistance. GO.tools.shtml] 3. Khatri P, Draghici S: Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinfor- Availability and requirements matics 2005, 21(18):3587-3595. Project name: ErmineJ 4. Breslin T, Eden P, Krogh M: Comparing functional annotation analyses with Catmap. BMC Bioinformatics 2004, 5(1):193. 5. Pavlidis P, Lewis DP, Noble WS: Exploring gene expression data Project home page: http://microarray.cu-genome.org/ with class scores. Pac Symp Biocomput 2002:474-485. ermineJ/ 6. Pavlidis P, Qin J, Arango V, Mann JJ, Sibille E: Using the gene ontol- ogy for microarray data mining: a comparison of methods and application to age effects in human prefrontal cortex. Operating system(s): Platform independent Neurochem Res 2004, 29(6):1213-1222. 7. Java [http://java.sun.com/] 8. Colt [http://dsd.lbl.gov/~hoschek/colt/] Programming language: Java 9. Jakarta Commons [http://jakarta.apache.org/commons/] 10. Xerces [http://xml.apache.org/] 11. Gene Ontology Consoritum [http://www.geneontology.org/] Other requirements: Java 1.4 or higher; 256 Mb RAM 12. Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: gene- recommended. centered information at NCBI. Nucleic Acids Res 2005, 33(Data- base issue):D54-8. 13. Microarray annotation files [http://microarray.cu-genome.org/ License: GNU GPL and LPGL for helper library. annots/] 14. Duda RO, Hart PE, Stork DG: Pattern classification. 2nd edition. New York , Wiley; 2001:xx, 654. Any restrictions to use by non-academics: None 15. Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing. Jour- List of Abbreviations nal of the Royal Statistical Society B 1995, 57:289-300. 16. Westfall PH, Young SS: Resampling-based multiple testing. ORA: Over-representation analysis New York , John Wiley & Sons, Inc.; 1993:340. 17. The Plant Ontology [http://www.plantontology.org/] GSR: Gene score resampling 18. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gil- lette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: A knowledge-based approach ROC: Receiver operator characteristic for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 2005, 102(43):15545-15550. GCA: Gene group correlation analysis GSEA: Gene Set Enrichment Analysis FDR: False discovery rate GO: Gene Ontology GUI: Graphical User Interface API: Application Programming Interface CLI: Command Line Interface Publish with Bio Med Central and every scientist can read your work free of charge Authors' contributions "BioMed Central will be the most significant development for PP was the project lead and chief architect of the software, disseminating the results of biomedical researc h in our lifetime." and contributed to the source code. HKL, WB and KK all Sir Paul Nurse, Cancer Research UK contributed to the source code. Your research papers will be: available free of charge to the entire biomedical community Acknowledgements peer reviewed and published immediately upon acceptance We thank Shahmil Merchant and Edward Chen for contributions to an early version of ErmineJ, and William Noble for supporting the development of cited in PubMed and archived on PubMed Central the methods, and Neil Segal for providing the microarray data used in the yours — you keep the copyright screen shots. We also thank testers and users who provided bug reports BioMedcentral Submit your manuscript here: and suggestions for improvements. http://www.biomedcentral.com/info/publishing_adv.asp Page 8 of 8 (page number not for citation purposes) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Bioinformatics Springer Journals http://www.deepdyve.com/lp/springer-journals/erminej-tool-for-functional-analysis-of-gene-expression-data-sets-HLCtpWrHLy

Loading next page...

References (18)

P. Khatri, S. Drăghici (2005)
Ontological analysis of gene expression data: current tools, limitations, and open problems
Bioinformatics, 21 18
P. Pavlidis, J. Qin, V. Arango, J. Mann, E. Sibille (2004)
Using the Gene Ontology for Microarray Data Mining: A Comparison of Methods and Application to Age Effects in Human Prefrontal Cortex
Neurochemical Research, 29
(2004)
BMC Bioinformatics Methodology article Comparing functional annotation analyses with Catmap
P Pavlidis, DP Lewis, WS Noble (2002)
Pac Symp Biocomput
A. Subramanian, P. Tamayo, V. Mootha, Sayan Mukherjee, B. Ebert, Michael Gillette, A. Paulovich, S. Pomeroy, T. Golub, E. Lander, J. Mesirov (2005)
Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles
Proceedings of the National Academy of Sciences of the United States of America, 102
M Ashburner, CA Ball, JA Blake, D Botstein, H Butler, JM Cherry, AP Davis, K Dolinski, SS Dwight, JT Eppig, MA Harris, DP Hill, L Issel-Tarver, A Kasarskis, S Lewis, JC Matese, JE Richardson, M Ringwald, GM Rubin, G Sherlock (2000)
Gene ontology: tool for the unification of biology. The Gene Ontology Consortium
Nat Genet, 25
P. Pavlidis, Darrin Lewis, William Noble (2001)
Exploring Gene Expression Data with Class Scores
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
Colt [http://dsd.lbl.gov/~hoschek/colt/] 9
R. Duda, P. Hart, D. Stork (2000)
Pattern classification, 2nd Edition
P. Hart, R. Duda, D. Stork (1973)
Pattern Classification
Microarray annotation files [http://microarray.cu-genome.org/ annots
J. Booth, P. Westfall, S. Young (1994)
Resampling-Based Multiple Testing.
Journal of the American Statistical Association, 89
(1993)
The Plant Ontology
Y. Benjamini, Y. Hochberg (1995)
Controlling the false discovery rate: a practical and powerful approach to multiple testing
Journal of the royal statistical society series b-methodological, 57
M. Ashburner, C. Ball, J. Blake, D. Botstein, Heather Butler, J. Cherry, A. Davis, K. Dolinski, S. Dwight, J. Eppig, M. Harris, D. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. Matese, J. Richardson, M. Ringwald, G. Rubin, G. Sherlock (2000)
Gene Ontology: tool for the unification of biology
Nature Genetics, 25
T Breslin, P Eden, M Krogh (2004)
Comparing functional annotation analyses with Catmap
BMC Bioinformatics, 5
Gene Ontology Tools
Donna Maglott, J. Ostell, Kim Pruitt, T. Tatusova (2006)
Entrez Gene: gene-centered information at NCBI
Nucleic Acids Research, 35

Publisher: Springer Journals
Copyright: Copyright © 2005 by Lee et al; licensee BioMed Central Ltd.
Subject: Life Sciences; Bioinformatics; Microarrays; Computational Biology/Bioinformatics; Computer Appl. in Life Sciences; Combinatorial Libraries; Algorithms
eISSN: 1471-2105
DOI: 10.1186/1471-2105-6-269
pmid: 16280084
Publisher site: See Article on Publisher Site

Abstract

Background: It is common for the results of a microarray study to be analyzed in the context of biologically-motivated groups of genes such as pathways or Gene Ontology categories. The most common method for such analysis uses the hypergeometric distribution (or a related technique) to look for "over-representation" of groups among genes selected as being differentially expressed or otherwise of interest based on a gene-by-gene analysis. However, this method suffers from some limitations, and biologist-friendly tools that implement alternatives have not been reported. Results: We introduce ErmineJ, a multiplatform user-friendly stand-alone software tool for the analysis of functionally-relevant sets of genes in the context of microarray gene expression data. ErmineJ implements multiple algorithms for gene set analysis, including over-representation and resampling-based methods that focus on gene scores or correlation of gene expression profiles. In addition to a graphical user interface, ErmineJ has a command line interface and an application programming interface that can be used to automate analyses. The graphical user interface includes tools for creating and modifying gene sets, visualizing the Gene Ontology as a table or tree, and visualizing gene expression data. ErmineJ comes with a complete user manual, and is open-source software licensed under the Gnu Public License. Conclusion: The availability of multiple analysis algorithms, together with a rich feature set and simple graphical interface, should make ErmineJ a useful addition to the biologist's informatics toolbox. ErmineJ is available from http://microarray.cu.genome.org. there are numerous advantages to using a computational Background A difficulty experienced by many (if not all) users of gene and statistical approach to analyze groups of genes. expression microarrays is making sense of the complex results. After analyzing each gene in a data set, an experi- The most common means of performing this analysis is to menter is often left to the task of summarizing the results ask whether certain Gene Ontology (GO) [1] terms are with little assistance. It is common for experimenters to "over-represented" in a set of genes selected by fold- ask questions at the level of molecular pathways or other change or statistically-motivated approaches such as a t- functionally relevant groupings of genes. While "ad hoc" test. This is easily implemented by using the properties of manual annotation of data sets is a common approach, the hypergeometric distribution (often referred to as Page 1 of 8 (page number not for citation purposes) BMC Bioinformatics 2005, 6:269 http://www.biomedcentral.com/1471-2105/6/269 Fisher's exact test for two categories) or its binomial The first version of ermineJ was made available in 2003. approximation. In our work, these methods are more Recently we have completely revamped the user interface generically referred to as "over-representation analysis" or and updated the feature set, releasing ermineJ 2.0 in Octo- ORA. In addition, as the GO is just one way of organizing ber 2004 and 2.1 in June 2005. genes, we refer to the general goal of these methods as "gene set analysis", where a gene set is any grouping of Implementation genes not derived from the data itself, typically based on ErmineJ is implemented entirely in the Java programming biologically-motivated criteria. language [7]. It uses the Java Swing libraries to create a graphical user interface that can run on many different The need to perform ORA has led to the emergence of a platforms. Architecturally, an effort has been made to sep- variety of tools. A list of many such tools is available from arate analytical and algorithmic concerns from user pres- the Gene Ontology Consortium [2], and a large number entation concerns. Besides being a design best practice, of them were recently reviewed [3]. However, to our the architecture was also driven by the need to support knowledge these tools all implement ORA methods; other command-line interfaces as well as application program- methods or algorithms are not available, with the excep- ming interfaces to the methods. The structure of ermineJ tion of the Perl script Catmap [4]. Thus these tools prima- also lends itself to fairly easy extensibility, so new algo- rily differentiate themselves through user interface rithms can be added to the software as requirements features, ease of use, supported data types, and speed [3]. change. The analysis algorithms in ermineJ were previ- Most tools surveyed by [3] were reported to have one or ously described [4,5]. more significant limitations, including slow performance, an inability to analyze gene annotations other than those In addition to using the Java SDK, ermineJ depends on a directly annotated (that is, other levels of the GO hierar- number of free third-party libraries, most notably the Colt chy are not considered), requiring web access to use, are library [8]. Colt is a high-performance numerical comput- difficult to install (limiting their usefulness to biologists), ing library that includes implementations of many linear or lack the ability to visualize the GO hierarchy [3]. algebra and statistical methods, as well as useful data structures which we rely on heavily in our software. Other In this paper we describe ermineJ, a stand-alone tool that libraries ermineJ uses include various Jakarta Commons implements methods described by [5] and [4] in addition libraries [9], and the Xerces XML parsing engine [10], to ORA, has a rich feature set, and does not have the lim- which we use to parse the Gene Ontology XML descrip- itations cited above. One of the offered analysis methods tion. Many of the low-level numerical and utility routines in particular is complementary to ORA analysis, which we (e.g., for file parsing and string manipulation) are tested now call Gene Set Resampling or GSR (the "experiment" in an extensive unit test suite. score in Pavlidis et al. (2002)). In GSR, the gene-by-gene scores (e.g., t-test p-values) are not thresholded. Instead, Results and discussion Inputs for each gene set an aggregate score is computed, such as the geometric mean of the p-values for genes in the cate- All interfaces to ermineJ use the same basic inputs. The gory, and the significance of that score determined by ran- first is a description of the Gene Ontology in XML format, dom sampling of the data. We have recently presented obtained from the GO consortium web site [11]. The sec- some evidence that GSR can provide better results than ond is a description of the microarray platform (the "array ORA in some situations [6]. annotation file", which contains tab-delimited text), which associates probe identifiers with Gene Ontology ErmineJ also has methods for analysis of genes based on terms and additionally associates each probe with a gene rankings (the receiver operator characteristic, or ROC) [4]. (used in the statistical analysis to account for repeated ROC can be thought of as a version of ORA where all pos- genes, as described below) and descriptions that are useful sible thresholds are considered simultaneously. Like GSR, for viewing in the context of the results. The third required the ROC method utilizes non-thresholded gene scores, input is the user's own data. For ORA, GSR and ROC but considers only their ranking, which might be consid- applications, this takes the form of a list of gene scores, ered more robust than using the raw gene scores. Finally, one for every probe set on the array design. Alternatively ErmineJ offers an analysis based on the correlation of gene (for expression profile correlation analysis), the input can expression profiles, gene group correlation analysis be the expression profile matrix, as might be used as an (GCA) [5]. GCA can be used as an alternative to the use of input to a clustering tool. The gene scores can be p-values ORA for the determination of whether genes in particular or another score such as fold-change. ErmineJ is purpose- functional categories are "clustering together". fully largely agnostic about the meaning of the gene scores, and focused on the distributional properties of the scores. Page 2 of 8 (page number not for citation purposes) BMC Bioinformatics 2005, 6:269 http://www.biomedcentral.com/1471-2105/6/269 We maintain on the order of 30 different mouse, human last option is not available from the GUI, though it can be and rat array annotation files for different platforms, as accessed from the other interfaces. Another important set- well as generic files for RefSeq [12] genes that can be used ting is the range of gene set sizes to analyze. Gene sets that to construct annotation files for other platforms (availa- are very small are unlikely to be very informative, because ble from our web site [13]). The native annotation file for- the goal of the analysis is to study genes in groups, while mat is very simple and new files can easily be constructed large gene sets may be too non-specific to provide useful with a modicum of bioinformatics skill. ErmineJ can also information. In addition, analyzing too many gene sets read Affymetrix "CSV" (comma-separated-value) annota- reduces the power of the analysis due to multiple testing tion files available from the manufacturer's web site. We costs. In practice we often use a range of 5–100 or 5–200. gladly entertain requests to add support for other arrays. When an annotation file is read in, the software automat- In addition to the pre-defined gene sets as defined by the ically associates each probe with all parent terms of each Gene Ontology, users are free to input their own gene sets. directly annotated terms. For example, all genes anno- These are defined in simple text files that are placed in a tated with the term "regulation of cell size" are also asso- directory that ermineJ checks at startup. These text files ciated with the higher-level terms "cellular can be created "off-line" or within the ermineJ GUI. In morphogenesis" and "morphogenesis". This feature is addition, users can modify gene sets from within ermineJ. only supported by some of the tools reviewed by [3]. This functionality can be used to correct errors or omis- sions in the Gene Ontology annotations, though care There are a number of parameters to set and decisions the must be exercised to avoid introducing biases into the user must make in order to run the software. The choice of results. analysis method is the most obvious, and each method has a few other settings that the user can choose to change. Types of analysis Gene-score based methods For example, for ORA analysis a threshold score must be defined. This is in contrast to most ORA software packages The ORA, GSR and ROC methods are closely related in which take as input a list of "genes of interest"; instead, that they are based on the gene-by-gene scores, with the ermineJ takes as input all the gene scores for the experi- goal of finding gene sets that are some sense "enriched" in ment. This lets ermineJ avoid the problem of selecting the high-scoring genes (which typically might be "differen- correct "null" gene set [3]: it is defined strictly by the genes tially expressed genes"). ORA is sometimes used to ana- analyzed in the experiment but not meeting the user- lyze genes which are selected by clustering, rather than a defined score threshold. continuous score. In this situation, GSR and ROC are not appropriate. However, the correlation method is specifi- For GSR, the method used to compute the score for a gene cally designed to address this situation. GSR and ROC set is a key parameter. The two options currently sup- have the benefit of not requiring a threshold to divide ported are the mean and the median. During the analysis, genes into "selected" and "non-selected" genes. The GSR uses the selected method to compute a summary of choice of the threshold for ORA can have a substantial the gene scores for each resampled or real gene set, and effect on the results obtained, because the "selected genes" this aggregate score is used to represent the gene set. change [4]. Choosing the median will tend to yield slightly more con- Correlation analysis servative results, as individual genes with very high scores are not given as much weight as in the mean computation. Gene group correlation analysis (GCA) is based on the similarity of the expression profiles of genes in a gene set: Some settings are used for multiple methods. For exam- loosely speaking, how well they "cluster together". Thus ple, when a gene is represented more than once in the data we propose that GCA can be used as an alternative to set, a decision has to be made as to how to treat these "rep- using ORA to analyze clusters. There are some differences licates" (which might not be replicates per se but represent to be noted between the typical application of ORA to different transcripts). The options supported are to use the clusters and the ermineJ correlation analysis. GCA is "best" score among the replicates to represent them as a group-centric, not cluster-centric. Thus we ask whether the group; to use the mean; or to treat them as separate enti- correlation among the members is higher than expected ties. Use of the "best" option is somewhat anti-conserva- by chance, not whether a given set of correlated genes is tive, but is reasonable when most "replicates" are in fact enriched for the genes in the group; GCA does not involve assaying different biological entities. In contrast, treating clustering. This is not a trivial distinction, because while replicates completely separately is not generally advised as the highest scores will be obtained for gene groups that it can lead to spurious positive findings in cases of true have uniform and high correlations among all the mem- replicates, as the gene set gets "adulterated" with multiple bers, groups that have two or more "sub-clusters" can also copies of the same high-scoring gene. For this reason the obtain high scores. In the current implementation of Page 3 of 8 (page number not for citation purposes) BMC Bioinformatics 2005, 6:269 http://www.biomedcentral.com/1471-2105/6/269 Figure 1 A: The main panel of ErmineJ after several analyses have been performed A: The main panel of ErmineJ after several analyses have been performed. Gene sets selected at low FDR levels are indicated in color. B: The tree-view panel of ErmineJ, illustrating the ability to browse gene sets in the GO hierarchy. The icons at each node have specific meanings. For example, the yellow "bull's-eye" icon indicates a gene sets selected at an FDR of 0.05 or less. Purple diamonds indicate nodes that have "significant" sub-nodes. GCA, the absolute value of the correlation is always used, using the fact that the ROC is equivalent to the Wilcoxon which allows. In future versions we may expose this as a rank-sum test [4]. The raw gene set score is simply the area user-settable option, as well as implementing other possi- under the receiver operator characteristic curve [14], ble correlation metrics other than the current Pearson cor- which ranges from values of 0.5 (random ranking) to 1.0 relation. (all genes in the gene set at the top of the ranking). Finally, for correlation analysis, the null hypothesis is that the In all methods, for each gene set analyzed, ermineJ com- mean pairwise correlation of profiles in the gene set is putes a score and, based on that score and the gene sets drawn from the global distribution of gene set correlation size, a p-value representing the "significance" of that gene scores, as determined by resampling [5]. The raw score is set with respect to the null hypothesis. The definition of the mean absolute value of the pair-wise correlation of the the raw score and the null hypothesis depends on the genes in the set (comparisons of a probe to itself, or to method being used. Note that the raw scores are of limited other probes for the same gene, are always ignored). use because it cannot be evaluated in the absence of infor- mation about the gene set size. However, they can provide ErmineJ includes implementations of three multiple test the user with a helpful indication the strength of the correction methods (though currently only one of these, result, not just its statistical significance. Benjamini-Hochberg false discovery rate (FDR) [15], is made available through the GUI). The additional options, For ORA, the null hypothesis is that the genes in the gene available from the command line, are Bonferroni correc- set are distributed randomly between the "selected" genes tion and a resampling-based family-wise error rate correc- and the "non-selected" genes. The raw score reported by tion [16]. The FDR is used in the GUI as a rapid and ErmineJ is the number of genes in the set which pass the reasonable guide to which gene sets are likely to be of threshold for gene selection. For GSR, the null hypothesis highest interest. is that the mean (or median) gene score (which forms the gene set score; for p-values negative-log-transformed val- The ermineJ GUI ues are used) is drawn from the global (data-wide) distri- Most users of ermineJ will access it through its graphical bution of possible gene set mean (or median) gene scores, interface. The GUI of ermineJ was designed to be simple as determined by resampling [5]. For ROC analysis, the to use and provides "wizards" to guide users through com- null hypothesis is that the genes in the gene set are distrib- mon tasks such as running an analysis. Many settings uted randomly in the ranking; p-values are computed made by the user during operation of the software are Page 4 of 8 (page number not for citation purposes) BMC Bioinformatics 2005, 6:269 http://www.biomedcentral.com/1471-2105/6/269 A Figure 2 gene set details view A gene set details view. The controls at the top allow adjustment of the size and contrast of the heat map. The gene scores (in this case p-values) are shown in the second text column. The grey and blue graph, shown only for experiments using p-values, shows the expected (grey) and actual (blue) distribution of p-values in the gene set. This display is provided as an additional aid to evaluation of the results. The last two columns provide information about each gene. The targets of the hyperlinks are con- figurable by the user. remembered between sessions, facilitating repeated anal- names of genes they contain. User-defined gene sets are ysis of the same data files and maintaining the user's pre- displayed in contrasting colors. Not shown in the figures ferred window sizes, for example. A complete manual is is the initial startup screen in which the user chooses the provided and is accessible via an on-line help function, as gene annotation file to use for the session. web pages on our web site, or in portable document for- mat (PDF). Double-clicking on a gene set in the main panel opens a new window that displays the genes in the gene set, along Some aspects of the ermineJ graphical user interface is with the expression profiles in a "heat-map" view (if the illustrated in Figures 1, 2, 3. The main panel of the soft- user has provided the profile data; Figure 2). The appear- ware can be viewed either as a table of gene sets (Figure ance of the heat map is configurable through menus and 1A) or in a hierarchical (tree) view (Figure 1B). These toolbar controls. The data displayed in the table, as well as views are linked so changes in one are reflected in the the image of the matrix, can be saved to disk using addi- other. To facilitate navigation of these displayed, gene sets tional menu options. The hyperlinks to external web sites can be searched by the name of the gene set or by the can be configured by the user to point to a web site of their Page 5 of 8 (page number not for citation purposes) BMC Bioinformatics 2005, 6:269 http://www.biomedcentral.com/1471-2105/6/269 E Figure 3 xamples of screens from ErmineJ Wizards Examples of screens from ErmineJ Wizards. A: Analysis wizard. This illustrates options to set the range of gene set sizes to analyze, and the method of treating "replicates" of genes. See text for details of the latter. B: Gene set modification wizard. In this screen the user is selecting genes to delete from a gene set. The list of all probe available on the platform is available in the left panel. A "find" function simplifies the location of genes and probes. choosing, again through a menu option. All of these capa- Once an analysis is initiated, the user is informed of its bilities are available even if the user has not performed progress via a status bar. An analysis can be cancelled any any analysis, so ErmineJ can be used as a "gene set time. On completion, the results are added to the tabular browser" as well as for analysis. and tree views (Figure 1). Multiple results can be dis- played simultaneously in the tabular view, allowing easy An important feature of the GUI is the capability to rap- comparison of different runs. The tree view can display idly define and edit gene sets, which is accomplished in a only a single analysis result set at a time, but offers a pull- "wizard" that takes the user through the process set-by- down menu to selected among the results sets to display. step. Alternatively, the user can simply populate the gene In the tree and tabular views, high-scoring (i.e., signifi- set directory with files they have obtained from other cant) gene sets are highlighted in color. The tree view uses sources, for example created in bulk with a Python script a simple system of icons for each node to indicate whether or obtained from another user. As far as we know, no tool a significant node is contained within a given higher level surveyed by [3] affords the user the ability to define or node. Finally, the results of an analysis can be saved to a modify the categories. ErmineJ also allows the user to tab-delimited file for use in other software or to be choose which of the GO aspects (Biological Process, etc.) reloaded by ermineJ at a later time. to use in the analysis. Other interfaces In addition to the GUI, ermineJ offers a command line The GUI version of ermineJ can be installed on the user's computer or run via Java WebStart. The latter option sim- interface (CLI) and a simple application programming ply involves clicking on a link in the user's web browser, interface (API). The CLI exposes some features of ermineJ and ensures that the users have the most up-to-date ver- that are not available in the GUI, such as different meth- sion of the software. The drawback of using WebStart is ods for multiple test correction. The CLI is suitable for that the user must be connected to the internet to use the scripting runs of ermineJ. For example, a simple Perl script software. With a local installation, no internet connection can be used to automate runs of ermineJ with different set- is needed. tings or on different data sets. In contrast, the API was introduced to allow programmers to include the analyses Running an analysis available in ermineJ in their own software. The API cur- Running an analysis using the ErmineJ GUI involves using rently provides more limited access to the functionality of a "wizard" to set the parameters (Figure 3). The user is the software than the command line version, but will be asked to choose an analysis method, select the data file to expanded in future versions. analyze, choose any user-defined gene sets to include in the analysis, and set the various parameters required for Performance the particular analysis. All settings are documented via We tested the performance of ermineJ using the HG- "tool tips" and in the manual. U133_Plus_2 Affymetrix array design. This is a particu- Page 6 of 8 (page number not for citation purposes) BMC Bioinformatics 2005, 6:269 http://www.biomedcentral.com/1471-2105/6/269 larly large array design with over 54,000 probe sets, and expression profile set and the results. Therefore we recom- represents a something of a worst-case scenario with mend running ermineJ on machines that have at least 256 respect to performance. With our current annotation set, Mb of RAM. 4844 different GO categories (gene sets) are available for analysis in this array design. We limited our analysis to Future plans gene sets with between 5 and 100 genes, leaving about At this writing, the current version of ermineJ is 2.1.6. 2700 gene sets. The times reported below are for analyzing New features planned for the software include expanding the complete set of over 54,000 probe sets with respect to the API and allowing more flexible creation of user- these 2700 gene sets on a on a 1.7 GHz Pentium laptop. defined gene sets, including allowing support of alterna- tive nomenclatures such as the Plant Ontology [17]. We With this array, ermineJ has an initial startup phase that also plan to provide annotation files for more platforms lasts 15–20 seconds, most of which is consumed by time and organisms. it takes for the gene annotation file to be read in and proc- essed for analysis. The time for analysis once startup is We have been interested in the possibility of including completed depends on the method used. For ORA, a com- other resampling-based methods such as GSEA [18] or the plete analysis is completed in 8 seconds (average of 3 similar resampling method implemented in Catmap [4] runs; times are wall clock seconds timed from within the in ermineJ. The primary reason to consider these methods software). While it is difficult to directly compare our is that they examine the distribution of gene scores by benchmarks with previously published benchmarks resampling over the samples, which is more correct than because the number of gene sets analyzed and the size of merely resampling over the genes. This is because the null the "null" gene set was not reported, and the times hypotheses in the gene score analysis are some variation reported might in some cases include initial startup times on a random distribution of genes within the ranking of [3], the fastest reported methods on the largest data sets genes. This assumption can be badly violated for gene sets tested completed ORA analyses in under 10 seconds. This containing highly correlated genes (such as the ribosomal indicates that ErmineJ is at least competitive with and pos- protein genes); such genes will tend to have correlated sibly faster than the fastest previously reported tools. rankings, and in some situations (particularly when the gene p-value distribution is close to uniform), spurious GSR analysis took about 370 seconds if a full resampling false positives can occur [4]. The ORA, GSR and ROC is performed (100,000 resampling trials per gene set size methods are all susceptible to this problem, though we in our tests). However, ermineJ implements an approxi- stress that this is only an serious issue for gene sets that mation, where limited resampling is used to estimate the show high correlations not related to the experimental parameters of a normal distribution. This normal is used design. to compute the p-values for each gene set. It also takes advantage that, especially for larger class sizes, the shape It would be challenging to provide a general-purpose of the resampled distribution is very similar for similar implementation of GSEA or Catmap that is easily accessi- class sizes, so not all of them need to be computed. In this ble to biologists with limited computational skills. These mode the analysis takes approximately 80 seconds. ROC methods require either that users can provide the gene analysis, which does not involve resampling, took about scores for hundreds (if not thousands) of resampled data 100 seconds. Correlation analysis is the most computa- sets [4], a task that is difficult to accomplish for the tar- tionally intensive resampling method; even with the geted user base of ermineJ, or computation of gene scores approximations enabled it currently takes about 400 sec- by the software. Because each experimental design might onds to run on the test data set (which contained 12 have a different mechanism for computing gene scores microarrays). This is because computing correlations is (fold-change, t-test, ANVOA, Cox regression, etc), it computationally intensive, compared to the methods would be difficult to provide a fully flexible tool without which use pre-computed gene scores such as p-values. including a full-fledged statistical analysis package as well. A feasible solution we are considering is to cover the most ErmineJ is fairly memory-intensive, because it holds in frequently-encountered situations (e.g., t-test and one- memory a complex data structure describing the annota- way ANOVA). tions, as well as the microarray data and information about the results for thousands of gene sets and tens of Conclusion thousands of genes. For the large HG-U133_Plus_2 ErmineJ is a fast, full-featured, user-friendly, multi-plat- design, after startup ermineJ occupies approximately 85 form open source application for analysis of gene sets. It Mb of RAM (determined using a Java heap profiler under implements multiple algorithms for performing the anal- Windows). After running the correlation analysis, this ysis, and permits easy modification and creation of new grew to 105 Mb, reflecting the loading of the complete gene sets. These features afford users considerable flexibil- Page 7 of 8 (page number not for citation purposes) BMC Bioinformatics 2005, 6:269 http://www.biomedcentral.com/1471-2105/6/269 ity in testing different methods and parameters. Perhaps References 1. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, the greatest current limitation to its usability at this date is Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel- the availability of gene annotation files for non-Affyme- Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, trix array designs we have not encountered frequently. Rubin GM, Sherlock G: Gene ontology: tool for the unification of biol- ogy. The Gene Ontology Consortium. Nat Genet 2000, 25(1):25-29. Users who wish to develop annotation files for their plat- 2. Gene Ontology Tools [http://www.geneontology.org/ form should contact us for assistance. GO.tools.shtml] 3. Khatri P, Draghici S: Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinfor- Availability and requirements matics 2005, 21(18):3587-3595. Project name: ErmineJ 4. Breslin T, Eden P, Krogh M: Comparing functional annotation analyses with Catmap. BMC Bioinformatics 2004, 5(1):193. 5. Pavlidis P, Lewis DP, Noble WS: Exploring gene expression data Project home page: http://microarray.cu-genome.org/ with class scores. Pac Symp Biocomput 2002:474-485. ermineJ/ 6. Pavlidis P, Qin J, Arango V, Mann JJ, Sibille E: Using the gene ontol- ogy for microarray data mining: a comparison of methods and application to age effects in human prefrontal cortex. Operating system(s): Platform independent Neurochem Res 2004, 29(6):1213-1222. 7. Java [http://java.sun.com/] 8. Colt [http://dsd.lbl.gov/~hoschek/colt/] Programming language: Java 9. Jakarta Commons [http://jakarta.apache.org/commons/] 10. Xerces [http://xml.apache.org/] 11. Gene Ontology Consoritum [http://www.geneontology.org/] Other requirements: Java 1.4 or higher; 256 Mb RAM 12. Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: gene- recommended. centered information at NCBI. Nucleic Acids Res 2005, 33(Data- base issue):D54-8. 13. Microarray annotation files [http://microarray.cu-genome.org/ License: GNU GPL and LPGL for helper library. annots/] 14. Duda RO, Hart PE, Stork DG: Pattern classification. 2nd edition. New York , Wiley; 2001:xx, 654. Any restrictions to use by non-academics: None 15. Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing. Jour- List of Abbreviations nal of the Royal Statistical Society B 1995, 57:289-300. 16. Westfall PH, Young SS: Resampling-based multiple testing. ORA: Over-representation analysis New York , John Wiley & Sons, Inc.; 1993:340. 17. The Plant Ontology [http://www.plantontology.org/] GSR: Gene score resampling 18. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gil- lette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: A knowledge-based approach ROC: Receiver operator characteristic for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 2005, 102(43):15545-15550. GCA: Gene group correlation analysis GSEA: Gene Set Enrichment Analysis FDR: False discovery rate GO: Gene Ontology GUI: Graphical User Interface API: Application Programming Interface CLI: Command Line Interface Publish with Bio Med Central and every scientist can read your work free of charge Authors' contributions "BioMed Central will be the most significant development for PP was the project lead and chief architect of the software, disseminating the results of biomedical researc h in our lifetime." and contributed to the source code. HKL, WB and KK all Sir Paul Nurse, Cancer Research UK contributed to the source code. Your research papers will be: available free of charge to the entire biomedical community Acknowledgements peer reviewed and published immediately upon acceptance We thank Shahmil Merchant and Edward Chen for contributions to an early version of ErmineJ, and William Noble for supporting the development of cited in PubMed and archived on PubMed Central the methods, and Neil Segal for providing the microarray data used in the yours — you keep the copyright screen shots. We also thank testers and users who provided bug reports BioMedcentral Submit your manuscript here: and suggestions for improvements. http://www.biomedcentral.com/info/publishing_adv.asp Page 8 of 8 (page number not for citation purposes)

Journal

BMC Bioinformatics – Springer Journals

Published: Nov 9, 2005

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

ErmineJ: Tool for functional analysis of gene expression data sets

ErmineJ: Tool for functional analysis of gene expression data sets

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

ErmineJ: Tool for functional analysis of gene expression data sets

ErmineJ: Tool for functional analysis of gene expression data sets

References (18)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies