Comparative evaluation of gene-set analysis methods

Qi Liu; Irina Dinu; Adeniyi J Adewale; John D Potter; Yutaka Yasui

doi:10.1186/1471-2105-8-431

Comparative evaluation of gene-set analysis methods

Liu, Qi;Dinu, Irina;Adewale, Adeniyi J;Potter, John D;Yasui, Yutaka; 2007-11-07 00:00:00 Background: Multiple data-analytic methods have been proposed for evaluating gene-expression levels in specific biological pathways, assessing differential expression associated with a binary phenotype. Following Goeman and Bühlmann's recent review, we compared statistical performance of three methods, namely Global Test, ANCOVA Global Test, and SAM-GS, that test "self-contained null hypotheses" Via. subject sampling. The three methods were compared based on a simulation experiment and analyses of three real-world microarray datasets. Results: In the simulation experiment, we found that the use of the asymptotic distribution in the two Global Tests leads to a statistical test with an incorrect size. Specifically, p-values calculated by the scaled F distribution of Global Test and the asymptotic distribution of ANCOVA Global Test are too liberal, while the asymptotic distribution with a quadratic form of the Global Test results in p-values that are too conservative. The two Global Tests with permutation-based inference, however, gave a correct size. While the three methods showed similar power using permutation inference after a proper standardization of gene expression data, SAM-GS showed slightly higher power than the Global Tests. In the analysis of a real-world microarray dataset, the two Global Tests gave markedly different results, compared to SAM-GS, in identifying pathways whose gene expressions are associated with p53 mutation in cancer cell lines. A proper standardization of gene expression variances is necessary for the two Global Tests in order to produce biologically sensible results. After the standardization, the three methods gave very similar biologically-sensible results, with slightly higher statistical significance given by SAM-GS. The three methods gave similar patterns of results in the analysis of the other two microarray datasets. Conclusion: An appropriate standardization makes the performance of all three methods similar, given the use of permutation-based inference. SAM-GS tends to have slightly higher power in the lower D -level region (i.e. gene sets that are of the greatest interest). Global Test and ANCOVA Global Test have the important advantage of being able to analyze continuous and survival phenotypes and to adjust for covariates. A free Microsoft Excel Add-In to perform SAM-GS is available from http://www.ualberta.ca/~yyasui/homepage.html. Page 1 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 Background Results Some microarray-based gene expression analyses such as Simulation experiment Our first evaluation of the three methods used a simula- Significance Analysis of Microarray (SAM) [1] aim to dis- cover individual genes whose expression levels are associ- tion study, similar to that of Mansmann and Meister [6] ated with a phenotype of interest. Such individual-gene with some modifications that make the simulated data analyses can be enhanced by utilizing existing knowledge more realistic, and evaluated the size and power of the of biological pathways, or sets of individual genes (here- three hypothesis tests. Gene-set analysis was performed after referred to as "gene sets"), that are linked via. related for both the original and "z-score standardized" simulated biological functions. Gene-set analyses aim to discover gene datasets so that the effects of standardization on the three sets the expression of which is associated with a phenotype tests' performance can be assessed. The z-score standardi- of interest. zation was motivated by that fact that gene-expression variances can vary greatly across genes, even after a nor- Many gene-set analysis methods have been proposed previ- malization, which could influence gene-set analysis. In ously. For example, Mootha et al. [2] proposed Gene Set the z-score standardized datasets, gene expression was Enrichment Analysis (GSEA), which uses the Kol- standardized using the following equation: mogorov-Smirnov statistic to measure the degree of differ- ential gene expression in a gene set by a binary phenotype xx − jk j x = (1) (see also [3]). Goeman et al. [4] presented Global Test, jk modeling differential gene expression by means of ran- dom-effects logistic regression models. Goeman et al. [5] also extended their methods to continuous and survival where x is the gene expression for gene j in sample k, jk outcomes. Mansmann and Meister [6] proposed and s are the sample mean and standard deviation of gene ANCOVA Global Test, which is similar to Global Test but j expression using all samples, respectively. All simulation having the roles of phenotype and genes exchanged in analyses compared the mean expression of a gene-set of regression models. Mansmann and Meister [6] pointed interest between two groups, each with a sample of 10 out that their ANCOVA Global Test outperformed Global observations. Test, especially in cases where the asymptotic distribution of Global Test cannot be used. Dinu et al. [7] discussed First, we checked the size of the three tests, before and some critical problems of GSEA as a method for gene-set after the standardization, according to the following three analysis and proposed an alternative method called SAM- scenarios of no differential expression between two GS, an extension of SAM to gene-set analysis. Goeman groups: (1) randomly generate expression of 100 genes for and Bühlmann [8] provided an excellent review of the the two groups from a multivariate normal distribution methods, discussing important methodological questions (MVN) with a mean vector P and a diagonal variance-cov- of gene-set analysis, and summarized the methodological ariance matrix 6 , where the 100 elements of P and the 100 principles behind the existing methods. An important diagonal elements of 6 were randomly generated as 100 contribution of their review was the distinction between independently-and-identically-distributed (i.i.d.) uni- testing "self-contained null hypotheses" via. subject sam- form random variables in (0,10) and 100 i.i.d. uniform pling and testing "competitive null hypotheses" via. gene random variables in (0.1, 10), respectively (i.e., no gene sampling. They argue, and we agree, that the framework of was differentially expressed between the two groups and the competitive hypothesis testing via. gene sampling is expression was uncorrelated among the 100 genes); (2) subject to serious errors in calculating and interpreting exactly same as (1) except the variance-covariance matrix statistical significance of gene sets, because of its implicit 6 of the MVN being changed to have a correlation of 0.5 or explicit untenable assumption of probabilistic inde- between all pairs of the first 20 genes and also between all pendence across genes. pairs of the second 20 genes; (3) exactly same as (2) with the correlation value changed from 0.5 to 0.9. Although Global Test, ANCOVA Global Test, and SAM-GS each test a self-contained hypothesis on the association of expression patterns across a gene set with a phenotype of Second, we estimated the power of the three tests, before interest in a statistically appropriate manner, it is unclear and after the standardization, by randomly generating a how the three methods compare on performance in gene set of size 100, using the exactly same simulation set- detecting underlying associations. In this paper, we com- up of the size-evaluation (2) above, but allowing the first pare the performance of the three methods via. simulation 40 genes being differentially expressed. The mean expres- and real-world microarray data analyses, both statistically sion of the 40 differentially expressed genes was randomly and biologically. generated from Uniform(0,10) as in the size-evaluation Page 2 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV %0&%LRLQIRUPDWLFV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 The empirical Type I error rates of SAM-GS and the two (2), but was subsequently modified by an addition and a Global Tests with permutations were almost right on the subtraction of a constant J , as in Mansmann and Meister target of the nominal value of 0.05, before and after the [6], such that mean vectors P 's for the two groups (i = 1, standardization, for all three scenarios considered for the j > 20 evaluation of size (Table 1). Type I error rates of Global μμ−=()−12γ 12jj 2) differ by 2J, , for j = 1,..., 40. We Test with the scaled F null distribution and Global Test considered a range of J from 0 to 2 with an increment of with the asymptotic null distribution with a quadratic form deviated noticeably from the nominal size, being 0.1. The 40 differentially expressed genes were set to have too liberal with the scaled F and too conservative with the a correlation of 0.5, as in the size-evaluation (2), but no asymptotic distribution (non F distributed quadratic correlation and a correlation of 0.9 were also considered. form) as shown in Table 1. As the correlation among the 40 genes increased, the Type I error rates of Global Test In the comparison of size across the three tests, the size with the scaled F null distribution and the Global Test was estimated by the observed proportion of replications with the asymptotic null distribution with a quadratic with a p-value smaller than the correct size D . By defini- form generally moved towards the nominal size of 0.05. tion, under the null hypothesis, a proportion D of the rep- Type I error rates of ANCOVA Global Test with the asymp- lications of an experiment is expected to yield a p-value totic distribution also deviated noticeably from the nom- smaller than D . In order to assess the size, we ran 5000 inal size: 0.0692, 0.1034 and 0.0898 before the replications and used D = 0.05. For each permutation- standardization and 0.037, 0.0848 and 0.0792 after the based p-value, 1000 random permutations were carried standardization, for r = 0, 0.5, and 0.9, respectively. Here- out. after, therefore, the p-values for Global Test and ANCOVA Global Test are calculated based on permutations. We also In the comparison of power across the three tests, the estimated the size of the three tests using 25 samples, power was estimated by the observed proportion of the instead of 10 samples, in each group, and observed simi- replications of an experiment in which the null hypothe- lar patterns. As the sample size increased, the Type I error sis was correctly rejected. Given the fixed numbers of sam- rates of the two Global Tests by using the asymptotic dis- ples and genes with the fixed correlation structure in the tributions moved towards to the nominal level of 0.05. simulation experiment, a larger effect size J leads to higher power for a given D -level. In estimating the power, we ran The second step of the simulation was to assess power, the 1000 replications of an experiment for each J value. We results of which are shown in Figure 1, 2, 3, 4, 5, 6. Before considered D at 0.05, 0.01, 0.005, 0.0025, and 0.001. For the standardization, SAM-GS showed higher power than obtaining a permutation-based p-value, 1000 random the Global Tests at D = 0.05, with increasing power permutations were carried out. Table 1: Assessment of type I error probabilities 10 vs. 10 samples 25 vs. 25 samples Type of Methods inference 0 0.5 0.9 0 0.5 0.9 Before Global Test The scaled F 0.0982 0.0778 0.0722 0.0696 0.0700 0.0686 standardization Asymptotic 0.0006 0.0128 0.0298 0.0090 0.0328 0.0442 Permutation 0.0496 0.0434 0.0464 0.0534 0.0554 0.0556 ANCOVA Asymptotic 0.0692 0.1034 0.0898 0.0576 0.0840 0.0736 Global Test Permutation 0.0482 0.0462 0.0458 0.0526 0.0552 0.0562 SAM-GS Permutation 0.0498 0.0462 0.0478 0.0514 0.0518 0.0556 After Global Test The scaled F 0.1090 0.0844 0.0736 0.0734 0.0702 0.0698 standardization Asymptotic <0.0001 0.0094 0.0276 0.0036 0.0320 0.0424 Permutation 0.0524 0.0464 0.0458 0.0524 0.0528 0.0530 ANCOVA Asymptotic 0.0372 0.0848 0.0792 0.0474 0.0838 0.0730 Global Test Permutation 0.0532 0.0462 0.0466 0.0544 0.0542 0.0544 SAM-GS Permutation 0.0522 0.0468 0.0470 0.0526 0.0540 0.0542 Page 3 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV %0&%LRLQIRUPDWLFV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 Power at D =0.05, r=0 Power at D =0.01, r =0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Gamma Gamma Power at D =0.005, r =0 Power at D =0.0025, r =0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Gamma Gamma Power at D =0.001, r =0 Global Test Global Ancova SAM-GS 0.0 0.5 1.0 1.5 2.0 Gamma Th a Figure 1 me r ong 40 g esults of th enes e simulation experiment, evaluating power of the three tests before the standardization, for correlation of 0 The results of the simulation experiment, evaluating power of the three tests before the standardization, for correlation of 0 among 40 genes. differences with decreasing D levels. This pattern was Real-world data analyses observed regardless of the correlation level in the 40 dif- Our next evaluation of the performance of the three meth- ferentially-expressed genes (correlation of 0, 0.5, or 0.9). ods used biologically, a priori defined gene sets and three After the standardization, the performances of these three microarray datasets considered in Subramanian et al. [3], methods became closer: SAM-GS showed slightly higher download from GSEA web-page, [9]: 17 p53 wild-type vs. power than the Global tests with increasing power differ- 33 p53 mutant cancer cell lines; 15 male vs. 17 female ence with decreasing D levels. lymphoblastoid cells; 24 acute lymphoid leukemia (ALL) Page 4 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV Power Power Power 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 Power Power 0.0 0.4 0.8 0.0 0.4 0.8 %0&%LRLQIRUPDWLFV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 Power at D =0.05, r=0 Power at D =0.01, r =0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Gamma Gamma Power at D =0.005, r =0 Power at D =0.0025, r =0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Gamma Gamma Power at D =0.001, r =0 Global Test Global Ancova SAM-GS 0.0 0.5 1.0 1.5 2.0 Gamma Th a Figure 2 me r ong 40 g esults of th enes e simulation experiment, evaluating power of the three tests after the standardization, for correlation of 0 The results of the simulation experiment, evaluating power of the three tests after the standardization, for correlation of 0 among 40 genes. vs. 24 acute myeloid leukemia (AML) cells. For pathways/ that are involved in specific metabolic signaling pathways gene sets, we used Subramanian et al.'s gene-set subcata- [3]. In Subramanian et al., Catalog C1 included 24 sets, logs C1 and C2 from the same web-address above on one for each of the 24 human chromosomes, and 295 sets "Molecular Signature Database." The C1 catalog includes corresponding to the cytogenetic bands; Catalog C2 con- gene sets corresponding to human chromosomes and sisted of 472 sets containing gene sets reported in manu- cytogenetic bands, while the C2 catalog includes gene sets ally curated databases and 50 sets containing genes Page 5 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV Power Power Power 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 Power Power 0.0 0.4 0.8 0.0 0.4 0.8 %0&%LRLQIRUPDWLFV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 Power at D =0.05, r =0.5 Power at D =0.01, r=0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Gamma Gamma Power at D =0.005, r=0.5 Power at D =0.0025, r=0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Gamma Gamma Power at D =0.001, r=0.5 Global Test Global Ancova SAM-GS 0.0 0.5 1.0 1.5 2.0 Gamma a Figure 3 The mon results of the simulation g 40 genes experiment, evaluating power of the three tests before the standardization, for correlation of 0.5 The results of the simulation experiment, evaluating power of the three tests before the standardization, for correlation of 0.5 among 40 genes. reported in various experimental papers. Following Sub- Table 2 shows the associations of biologically-defined ramanian et al. [3], we restricted the set size to be between gene sets with the phenotype, assessed by Global Test, 15 and 500, resulting in 308 pathways to be examined. ANCOVA Global Test, and SAM-GS, in the analysis of gene expression differences between p53 wild-type vs. We compared the performance of the three methods mutant cancer cell lines. Gene sets with a p-value d 0.001 before and after the standardization by listing the gene by any of the three methods are listed in Table 2. Before sets which had a p-value d 0.001 by any of the three the standardization, SAM-GS identified 16 gene sets with methods. a p-value d 0.001, while Global Test and ANCOVA Global Page 6 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV Power Power Power 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 Power Power 0.0 0.4 0.8 0.0 0.4 0.8 %0&%LRLQIRUPDWLFV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 Power at D =0.05, r =0.5 Power at D =0.01, r=0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Gamma Gamma Power at D =0.005, r=0.5 Power at D =0.0025, r=0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Gamma Gamma Power at D =0.001, r=0.5 Global Test Global Ancova SAM-GS 0.0 0.5 1.0 1.5 2.0 Gamma Th a Figure 4 me r ong 40 g esults of th enes e simulation experiment, evaluating power of the three tests after the standardization, for correlation of 0.5 The results of the simulation experiment, evaluating power of the three tests after the standardization, for correlation of 0.5 among 40 genes. Test identified three and one gene sets, respectively, with pathways, and three involve the cell-cycle machinery; each a p-value d 0.001 (Table 2). Two of these three sets were of these eight gene sets, then, is in a direct, well-estab- among the 16 sets identified by SAM-GS. The third set was lished relationship with aspects of p53 signaling. The CR_DEATH which had a p-value = 0.008 from SAM-GS. remaining two gene sets have plausible, if less well estab- Among the 17 gene sets listed in Table 2, seven involve lished, links with p53 [7]. The disagreement between p53 directly as a gene-set member. Furthermore, five gene results of SAM-GS and the two Global tests was consider- sets directly involve the extrinsic and intrinsic apoptosis able before standardization. Although 16 of the 17 gene Page 7 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV Power Power Power 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 Power Power 0.0 0.4 0.8 0.0 0.4 0.8 %0&%LRLQIRUPDWLFV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 Power at D =0.05, r =0.9 Power at D =0.01, r=0.9 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Gamma Gamma Power at D =0.005, r=0.9 Power at D =0.0025, r=0.9 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Gamma Gamma Power at D =0.001, r=0.9 Global Test Global Ancova SAM-GS 0.0 0.5 1.0 1.5 2.0 Gamma The a Figure 5 mon results of the simulation g 40 genes experiment, evaluating power of the three tests before the standardization, for correlation of 0.9 The results of the simulation experiment, evaluating power of the three tests before the standardization, for correlation of 0.9 among 40 genes. sets in Table 2 had a SAM-GS p-value d 0.001, 7 had p-val- and after the standardization were highly consistent, and, ues larger than 0.1 by the two Global tests. For example, therefore, we used the results of SAM-GS before the stand- SAM-GS identified the gene set p53hypoxia pathway as a ardization for the comparisons with the other two meth- significant set with a p-value < 0.001, which seems biolog- ods. For Global Test and ANCOVA Global Test, p-values ically appropriate, yet the Global Test and the ANCOVA changed appreciably. Notably, p-values of Global Test Global Test gave p-values greater than 0.6. and ANCOVA Global Test after the standardization agreed closely with those of SAM-GS (Table 2, Figure 7). For We then compared the three methods incorporating the z- example, the p-values of p53hypoxia pathway changed score standardization. For SAM-GS, the p-values before from 0.626 to <0.001 for Global Test and from 0.622 to Page 8 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV Power Power Power 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 Power Power 0.0 0.4 0.8 0.0 0.4 0.8 %0&%LRLQIRUPDWLFV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 Power at D =0.05, r =0.9 Power at D =0.01, r=0.9 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Gamma Gamma Power at D =0.005, r=0.9 Power at D =0.0025, r=0.9 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Gamma Gamma Power at D =0.001, r=0.9 Global Test Global Ancova SAM-GS 0.0 0.5 1.0 1.5 2.0 Gamma Th a Figure 6 me r ong 40 g esults of th enes e simulation experiment, evaluating power of the three tests after the standardization, for correlation of 0.9 The results of the simulation experiment, evaluating power of the three tests after the standardization, for correlation of 0.9 among 40 genes. <0.001 for ANCOVA Global Test. Although the p-values sistent with the power-comparison simulation in which of the three methods agreed with each other after the SAM-GS showed slightly higher power than the Global standardization, the p-values from SAM-GS tended to be tests at small D levels, even after the standardization. smaller than those from Global Test and ANCOVA Global Test, in the lower range of p-values (gene sets that are of The same pattern was found in the analyses of the male- the greatest interest) (Table 2, Figures 7 and 8): this is con- vs.-female lymphoblastoid dataset and the ALL-vs.-AML Page 9 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV Power Power Power 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 Power Power 0.0 0.4 0.8 0.0 0.4 0.8 %0&%LRLQIRUPDWLFV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 Table 2: Gene sets in the p53 dataset with P-value d 0.001 by any of the three methods Gene Set Before standardization After standardization VSN Global Ancova SAM-GS Global Ancova SAM-GS Global Ancova SAM-GS ATM <0.001 <0.001 <0.001 <0.001 0.002 <0.001 0.001 0.001 <0.001 Pathway* BAD <0.001 0.007 <0.001 <0.001 <0.001 <0.001 0.004 0.004 <0.001 Pathway** Calcineurin 0.068 0.084 <0.001 0.007 0.002 <0.001 0.004 0.005 0.011 Pathway$ Cell cycle 0.021 0.017 <0.001 0.002 0.001 <0.001 0.002 <0.001 0.003 regulator† Hsp27Pathway 0.047 0.044 <0.001 <0.001 0.001 <0.001 0.011 0.005 <0.001 ** Mitochondria 0.002 0.002 <0.001 0.007 0.007 <0.001 0.013 0.006 <0.001 pathway** p53 signaling 0.112 0.101 <0.001 0.003 0.003 0.001 0.006 0.005 0.006 pathway* p53_UP* 0.003 0.004 <0.001 <0.001 <0.001 <0.001 0.019 0.018 <0.001 p53hypoxiaPat 0.626 0.622 <0.001 <0.001 <0.001 <0.001 0.044 0.041 <0.001 hway* p53Pathway* 0.142 0.150 <0.001 <0.001 <0.001 <0.001 <0.001 0.001 <0.001 Raccycd 0.177 0.181 <0.001 0.001 <0.001 <0.001 0.004 0.009 0.006 Pathway† Radiation_sen 0.119 0.135 <0.001 <0.001 <0.001 <0.001 0.014 0.020 <0.001 sitivity* SA_TRKA_RE 0.254 0.252 <0.001 0.001 <0.001 <0.001 0.004 0.001 0.006 CEPTOR‡ bcl2family & 0.102 0.100 0.001 0.001 0.005 <0.001 0.010 0.014 0.001 reg. network** Cell cycle 0.099 0.099 0.001 0.027 0.018 0.005 0.003 0.005 0.007 arrest† Ceramide 0.002 0.006 0.001 0.004 0.004 <0.001 0.001 0.001 <0.001 Pathway** CR_DEATH* 0.001 0.004 0.008 0.029 0.017 0.004 0.143 0.108 0.005 * pathway member ** apoptosis $ p53-induced proline oxidase mediates apoptosis via a calcineurin-dependent pathway † cell cycle ‡ integrated negative feedback loop between Akt and p53 dataset (See Figures S1, S2, S3 and S4 in Additional file 1, three tests: which is unlikely to be of any biological comparing the results from the three methods). Before the significance. standardization, p-values from Global Test and ANCOVA Global Test differed greatly from p-values from SAM-GS. In the User Guides for Global Test and ANCOVA Global The p-values of Global Test and ANCOVA Global Test Test, Variance Stabilization (VSN) was used to normalize changed markedly after the standardization and were very the data [10,11]. We also assessed the performance of the close to those of SAM-GS. After the standardization, in the three methods on the p53 dataset, male vs. female dataset, male-vs.-female analysis, 21 gene sets had a p-value < 0.15 and the ALL/AML dataset using VSN. The results for the by one or more of the methods; 17 of these had a SAM-GS p53 dataset are shown in Table 2 and Figure 9. When VSN p-value smaller than, or equal to, those of Global Test and was used for the normalization of the data, we observed: ANCOVA Global Test. In the ALL-vs.-AML analysis, all sets (1) p-values of Global Test and ANCOVA Global Test were statistically significant with p-values < 0.001 by all became similar to those of SAM-GS, but not as close as the Page 10 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV %0&%LRLQIRUPDWLFV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 Global Test Global Test Global Ancova Global Ancova 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.01 0.02 0.03 SAM-GS P-v alu e SAM-GS P-v alu e P-valu Global Test and vs. Figure 7 SAM-GS p-values befo es of 308 gene ANCOVA Gl sets in the re the stan obal p5 Test af 3 data analysis: p-values dardization ter standardization of Lowest P-values in the Test and ANCOVA Global Test after standardization vs. SAM-GS p-values before the stan Figure 8 p53 data analysis: p dardization -values of Global P-values of 308 gene sets in the p53 data analysis: p-values of Lowest P-values in the p53 data analysis: p-values of Global Global Test and ANCOVA Global Test after standardization Test and ANCOVA Global Test after standardization vs. vs. SAM-GS p-values before the standardization. The line SAM-GS p-values before the standardization. The line indi- indicates equal p-values between SAM-GS and Global Tests. cates equal p-values between SAM-GS and Global Tests. p-values after the z-score standardization; and (2) in the lower range of p-values, the p-values for SAM-GS tended to be smaller than those of Global Test and ANCOVA Global Test Global Ancova Global Test, (Table 2, Figure 9). Discussion From the simulation results, we suggest that, when Global Test and ANCOVA Global Test are used for the analysis of microarray data, permutations should always be used for the calculation of statistical significance. In the documen- tation included with the Global Test R package, Goeman et al. noted that the asymptotic distribution with a quad- ratic form is the recommended method for large sample sizes and it can be slightly conservative for small samples. In our simulation study, we used 10 and 25 samples for each of the two groups. In each situation, the asymptotic method with a quadratic form gave conservative p-values, 0.00 0.01 0.02 0.03 although the difference between asymptotic and permuta- SAM-GS P-v alu e tion-based methods did decrease when the sample size increased. Goeman et al. also noted that the scaled F Lowest P-values in the Test and A vs. SAM-G Figure 9 S p-v NCOVA Glob alues after the VSN normalization p5al 3 Test after the VSN data analysis: p-values of normalization Global method can be slightly anti-conservative, especially for Lowest P-values in the p53 data analysis: p-values of Global large gene sets. Our simulation study showed that the Test and ANCOVA Global Test after the VSN normalization scaled F method can be markedly anti-conservative. This vs. SAM-GS p-values after the VSN normalization. The line is in accord with the manual of Global Test, which recom- indicates equal p-values between SAM-GS and Global Tests. mends against using the scaled F approximation. We found that performance of the two Global Tests explained by: (1) the invariance of t-test statistics under changed greatly before and after standardization, but shifting and rescaling of data, that is relevant to SAM-GS; SAM-GS performance remained unchanged. This can be (2) ANCOVA's explicit assumption that all genes in the set Page 11 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV Global P-valu e 0.0 0.2 0.4 0.6 0.8 1.0 Global P-valu e Global P-valu e 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.01 0.02 0.03 0.04 0.05 %0&%LRLQIRUPDWLFV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 to have an equal variance, a violation of which would Tomfohr et al. [15], that test self-contained hypotheses clearly affect the performance of ANCOVA Global Test; via. subject sampling, in addition to the three methods we and (3) Global test's assumption that the regression coef- highlighted above. Tian et al. [14] tests the significance of ficients come from the same normal distribution, an a gene set by taking the mean of t-values of genes in the assumption that is met by the standardization of gene gene set as a test statistic and evaluating its significance by expression. Therefore, some sort of standardization that a permutation test. Tomfohr et al. [15] reduces the gene makes the variances of gene expression similar across set's expression into a single summary value by taking the genes is needed before using Global Test and ANCOVA first principal component of expressions of genes in the Global Test. SAM-GS employs a constant in the denomi- gene set and performs a permutation-based t-test of the nator of its t-like test statistic to address the small variabil- single summary. The two methods gave appreciably differ- ity in some of the gene expression measurements and, ent results when compared to Global Test and ANCOVA thus, effectively standardizes expression across genes; nei- Global Test, and SAM-GS. Of the 17 gene sets in Table 2 ther Global Test nor the ANCOVA Global Test addresses for the p53 analysis, for instance, Tian et al. and Tomfohr this characteristic of microarray data. Both Goeman et al. et al. identified only eight and one gene sets, respectively, [4] and Mansmann and Meister [6] have stated that an with p-value < 0.10: the ATM pathway, for example, was appropriate normalization is important. Note that not identified by Global Test, ANCOVA Global Test, and many normalization methods would standardize the SAM-GS with p-value d 0.001, while the methods of Tian expression across genes. It is only after applying z-score et al. and Tomfohr et al. gave p-value = 0.61 and 0.99, standardization (1) or the VSN normalization, that the respectively. The main reasons for their large discrepan- results of the three methods became congruent. The simi- cies from the results of the three highlighted methods are larity between Global Test and Global ANCOVA Test has as follows. Tian et al. sums up the t-values for all the genes already been commented upon in [6]. The similarity in a gene set, which will result in cancellation of large pos- between SAM-GS and Global Test may be inferred from itive t-values and large negative t-values. Among the 11 the construction of the latter as a weighted sum of squared up-regulated and 8 down-regulated genes in the ATM transformed t-statistics [12], which is similar to the SAM- pathway, for example, two up-regulated genes had large GS test statistic. positive t-values (about 2 or greater) and three down-reg- ulated genes had large negative t-values (about – 2 or It should be noted that Global Test allows four different smaller): these large positive and negative t-values cancel types of phenotype variables: binary; multi-class; continu- each other when summing up all t-values in the Tian et al. ous; and survival. ANCOVA Global Test allows binary, test statistic, leading to reduced power for detecting gene multi-class, and continuous phenotypes. The ability to sets that contain both significantly up-regulated genes and handle different classes of phenotypes is a very important significantly down-regulated genes. The method of Tom- advantage of Global Test and ANCOVA Global Test over fohr et al. summarizes the |S|-dimension gene-expression SAM-GS. It is also possible to use Global Test and vector of genes in the gene set S by the first principal com- ANCOVA Global Test while adjusting for covariates (e.g., ponent without considering the phenotype: if the direc- potential confounders). If covariates are incorporated, the tion of the first principal component does not correspond two tests assess whether the gene-expression profile has an to the direction that separates the two phenotypes, their independent association with the phenotype that is above method does not capture the differential expressions even and beyond what is explained by the covariates. The abil- when they exist, leading to markedly reduced power. ity to adjust for covariates is another important advantage of Global Test and ANCOVA Global Test over SAM-GS. Although we focused on the comparison of the "self-con- tained null hypothesis" approaches, it is also of method- We focused on p-values in this paper because we were ological interest to see how "competitive null hypothesis" comparing the three methods that test "self-contained approaches compare. We, therefore, applied three "com- null hypotheses" via. subject sampling. To account for petitive null hypothesis" approaches to the analysis of the multiple comparisons when multiple gene sets are tested, p53 dataset: Gene Set Enrichment Analysis (GSEA) [2]; the one might consider False Discovery Rate (FDR) instead of Significance Analysis of Function and Expression (SAFE) Type I error probability. For example, SAM uses a q-value, [16]; and Fisher's exact test [17]. The results are shown in an upper limit of the FDR, for each gene, which could be Additional file 3. The results from the three "competitive extended here to each gene-set using the method of Storey null hypothesis" approaches were greatly different from [13]. The q-values of the 17 gene sets listed in Table 2 are those of SAM-GS and the Global Tests. Most of the gene displayed in Additional file 2. sets identified as being significantly associated with the p53 mutation by SAM-GS and Global Tests were not iden- We have considered, but did not report detailed compari- tified as such by the three "competitive null hypothesis" son results of two other methods, Tian et al. [14] and approaches. The only gene set additionally identified as Page 12 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV %0&%LRLQIRUPDWLFV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 being significantly associated with the p53 mutation (with from a common distribution with mean zero and variance p < 0.001) was HUMAN_CD34_ENRICHED_TF_JP: for W , the null hypothesis of no differential gene-expression W = 0. Using the notation r = 6 x E , the this gene set, the Fisher's exact test p-value was < 0.001, is reduced to i j ij j -1 but all the other five methods gave p-values > 0.37. model simplifies to a random-effects model: E(Y |r ) = h i i Known biological functions of p53 are clearly more con- (D + r ). The null hypothesis can then be tested, based on sistent with the results of the "self-contained null hypoth- a score test statistic discussed in Le Cessie and Van Hou- esis" approaches. The differences observed between "self- welingen[18] and Houwing-Duistermaat et al. [19]: contained null hypothesis" and "competitive null hypoth- esis" approaches can be attributable, at least partly, to the () YR−− μμ (Y ) fact that the significance of a gene set depends only on the Q = , genes in the set under the "self-contained null hypothesis" -1 testing, while, under the "competitive null hypothesis" where R = (1/m)XX', P = h (D ), and P is the second cen- testing, the significance of a gene set depends not only on tral moment of Y under the null hypothesis. It can be the genes in the set but also on all the other genes in the shown that Q is asymptotically normally distributed (a array. quadratic form which is non-negative). However, when the sample size is small, a better approximation to the dis- In summary, the primary advantage of SAM-GS may be tribution of Q is a scaled F distribution. The p-value can, the slightly higher power in the low D -level region that is therefore, be calculated based on an approximate distribu- of highest scientific interest, whereas, despite the need for tion of the test statistic, i.e., the asymptotic distribution appropriate standardization, Global Test and the with a non-chi-squared distributed quadratic form or the ANCOVA Global Test can be used for a variety of pheno- scaled F distribution, or permutations of samples. types and incorporate covariates in the analysis. 2) ANCOVA Global Test Conclusion The null hypothesis of the Global Test is in the form of In conclusion, Global Test and ANCOVA Global Test P(Y|X) = P(Y). The ANCOVA Global Test changes the roles require appropriate standardization of gene expression of gene expression pattern X and phenotype Y, and the measurements across genes for proper performance. null hypothesis becomes P(X|Y = 1) = P(X|Y = 2), or, for Standardization of these two methods and the use of per- each gene j in a gene set of interest, P = P , where P is mutation inference make the performance of all three 1j 2j ij methods similar, with a slight power advantage in SAM- the mean expression of gene j in phenotype group i, i = GS. Global Test and the ANCOVA Global Test can be used 1,2. A linear model of the form, P = P + D + E + J , with ij i j ij for a variety of phenotypes and incorporate covariates in group effects D , gene effects E , and the gene-group interac- the analysis. tion J , is then used to test the null hypothesis. The condi- Methods tions 6 D = 6 E = 6 J = 6 J = 0 ensure identifiability of the i j i ij j ij In this section, we describe the three gene-set analysis parameters. The null hypothesis under the parameteriza- methods. The phenotype of interest is assumed to be tion of the linear model is H : D = J = 0. The test statistic 0 i ij binary. is the F-test statistic for linear models: 1) Global Test F=− {(SSR SSR ) /(df − df )}/{SSR / df } HH H H H H 01 1 0 1 1 The Global Test is based on a regression model that pre- where SSR and df denote the sum of squares and dicts response from the gene expression measurements of H H a gene set [4]. Generalized linear models are used to degrees of freedom, respectively, under the hypothesis H. model the dependency of response Y (an n × 1 vector) on The p-value can be calculated by a permutation distribu- gene expression measurements X (an n × m matrix) of a set tion of the F statistic or an asymptotic distribution of the of m genes on n samples: test statistic. 3) SAM-GS hE(Y |βα ) =+ xβ , i= 12 , ,..., n, () iij j ∑ SAM-GS extends SAM to gene-set analysis. SAM-GS tests a j =1 null hypothesis that the mean vectors of expression of genes in a gene set does not differ by the phenotype of where h denotes the link function and D and E 's are interest. The SAM-GS method is based on individual t-like parameters. If the genes are not differentially expressed, statistics from SAM, addressing the small variability prob- the regression coefficients (E 's) should be zero. Under an lem encountered in microarray data, i.e., reducing the sta- assumption that all regression coefficients are sampled tistical significance associated with genes with very little Page 13 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV %0&%LRLQIRUPDWLFV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 variation in their expression. For each gene j, the d statistic Operating system(s): Microsoft Windows XP is calculated as in SAM: Programming language: R 2.4.x and Microsoft Excel 2003 or 2007 xj ()−x (j) dj () = , sj ()+s Abbreviations Significance Analysis of Microarray for Gene Sets (SAM- where the 'gene-specific scatter' s(j) is a pooled standard GS) deviation over the two groups of the phenotype, and s is a small positive constant that adjusts for the small varia- Authors' contributions bility [1]. SAM-GS then summarizes these standardized JDP provided biological interpretations of the analysis differences in all genes in the gene set S by: results of the real-world dataset. QL and ID contributed significantly to data analysis, refinement of SAM-GS, and || S programming. The manuscript was written primarily by SAMGS = d QL, ID, and YY, and critically reviewed and revised by all i =1 authors. All authors read and approved the final manu- script. A permutation distribution of the SAMGS statistic is used to calculate the p-value. We note that even though the Additional material recalculation of s is needed for each permutation, practi- cally the implication is small, and both SAM and SAM-GS excel add-ins do not recalculate s . Additional file 1 The analysis results of the two real-world microarray datasets (gender and Each of the three methods provides a statistically valid test leukemia) by the three methods. These three methods were applied and of the null hypothesis of no differential gene expression compared on two real-world microarray datasets: the male vs. female lym- phoblastoid cell microarray dataset and the ALL- and AML-cell microar- across a binary phenotype. ray dataset. Click here for file For the purpose of methodological comparisons, we also [http://www.biomedcentral.com/content/supplementary/1471- applied three "competitive null hypothesis" approaches 2105-8-431-S1.pdf] to the analysis of the p53 dataset: Gene Set Enrichment Analysis (GSEA) [2]; the Significance Analysis of Function Additional file 2 and Expression (SAFE) [16]; and Fisher's exact test [17]. FDR values for the 17 gene sets listed in Table 2. FDR values of the 17 gene sets listed in Table 2 are presented. Both GSEA and SAFE employ a two-stage approach to Click here for file access the significance of a gene set. First, gene-specific [http://www.biomedcentral.com/content/supplementary/1471- measures are calculated that capture the association 2105-8-431-S2.pdf] between expression and the phenotype of interest. Then a test statistic is constructed as a function of the gene-spe- Additional file 3 cific measures used in the first step. The significance of the P-values and FDR values for the three "self-contained null hypothesis" test statistics is assessed by permutation of the response and three "competitive null hypothesis" approaches. The three "self-con- tained null hypothesis" and three "competitive null hypothesis" values. For GSEA, the Pearson correlation is used in the approaches were applied to the p53 dataset. The p-values and FDR values first step, according to Mootha et al. [2] and the Enriched for the 17 gene sets listed in Table 2 are presented. Score is used in the second step. For SAFE, the student t- Click here for file statistic is used in the first step and the Wilcoxon rank- [http://www.biomedcentral.com/content/supplementary/1471- sum test is used in the second step, both of these being the 2105-8-431-S3.pdf] default options. For the Fisher's exact test, the list of signif- icant genes is obtained from SAM [1]. An FDR cutoff of 0.3 assigned significance to 5% of the genes in the entire gene list. Acknowledgements ID, AJA, and YY are supported by the Alberta Heritage Foundation for Medical Research and YY is supported by the Canada Research Chair Pro- Availability and requirements gram and the Canadian Institutes of Health Research. Project name: Comparison of statistical methods for gene set analysis based on testing self-contained hypotheses References via. subject sampling. 1. Tusher VG, Tibshirani R, Chu G: Significance analysis of micro- arrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 2001, 98:5116-5121. Project home page: http://www.ualberta.ca/~yyasui/ 2. Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar homepage.html J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly Page 14 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV %0&%LRLQIRUPDWLFV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC: PGC-1alpha- responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet 2003, 34:267-273. 3. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gil- lette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 2005, 102:15545-15550. 4. Goeman JJ, van de Geer SA, de Kort F, van Houwelingen HC: A glo- bal test for groups of genes: testing association with a clinical outcome. Bioinformatics 2004, 20:93-99. 5. Goeman JJ, Oosting J, Cleton-Jansen AM, Anninga JK, van Houwelin- gen HC: Testing association of a pathway with survival using gene expression data. Bioinformatics 2005, 21:1950-1957. 6. Mansmann U, Meister R: Testing differential gene expression in functional groups. Goeman's global test versus an ANCOVA approach. Methods Inf Med 2005, 44:449-453. 7. Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y: Improving GSEA for Analysis of Biologic Pathways for Differential Gene Expression across a Binary Phenotype. BMC Bioinformatics 2007, 8:242. 8. Goeman JJ, Bühlmann P: Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 2007, 23:980-987. 9. Gene Set Enrichment Analysis [http://www.broad.mit.edu/gsea] 10. Huber W, von Heydebreck A, Sultmann H, Poustka A, Vingron M: Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinfor- matics 2002, 18(Suppl 1):S96-104. 11. Huber W, von Heydebreck A, Sueltmann H, Poustka A, Vingron M: Parameter estimation for the calibration and variance stabi- lization of microarray data. Stat Appl Genet Mol Biol 2003, 2:Article3. 12. Goeman JJ, Van de Geer SA, van Houwelingen HC: Testing against a high dimensional alternative. J R Statist Soc B 2006, 68:477-493. 13. Storey JD: A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2002, 64:479-498. 14. Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ: Dis- covering statistically significant pathways in expression pro- filing studies. Proc Natl Acad Sci USA 2005, 102:13544-13549. 15. Tomfohr J, Lu J, Kepler TB: Pathway level analysis of gene expression using singular value decomposition. BMC Bioinfor- matics 2005, 6:225. 16. Barry WT, Nobel AB, Wright FA: Significance analysis of func- tional categories in gene expression studies: a structured permutation approach. Bioinformatics 2005, 21:1943-1949. 17. Draghici S, Khatri P, Martins RP, Ostermeier GC, Krawetz SA: Glo- bal functional profiling of gene expression. Genomics 2003, 81:98-104. 18. le Cessie S, van Houwelingen HC: Testing the fit of a regression model via score tests in random effects models. Biometrics 1995, 51:600-614. 19. Houwing-Duistermaat JJ, Derkx BH, Rosendaal FR, van Houwelingen HC: Testing familial aggregation. Biometrics 1995, 51:1292-1301. Publish with Publish with Bio Bio Med Med Central Central and and e ev ver ery y scientist can scientist can r read ead y your our w work ork fr free of ee of charge charge "BioMed Centr "BioMed Central al will will be be the the most most signif significant icant de development velopment f for or disseminating the disseminating the r results esults of of biomedical biomedical r researc esearc h h in in our our lif lifetime etime." ." Sir Paul Nurse, Cancer Research UK Sir Paul Nurse, Cancer Research UK Y Your research papers will be: our research papers will be: a available fr vailable free ee of of charge charge to the to the entir entire e biomedical biomedical comm community unity peer r peer re evie view wed ed and and published published immediatel immediately y upon upon acceptance acceptance cited in cited in PubMed PubMed and and ar archiv chived ed on PubMed on PubMed Central Central y yours — ours — y you ou k keep eep the the cop copyright yright Bio BioMed Medcentral central Submit your manuscript here: Submit your manuscript here: http://www http://www.biomedcentral.com/info/publishing_adv .biomedcentral.com/info/publishing_adv.asp .asp Page 15 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV %0&%LRLQIRUPDWLFV http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Bioinformatics Unpaywall http://www.deepdyve.com/lp/unpaywall/comparative-evaluation-of-gene-set-analysis-methods-WGVDoYLD5Q

Loading next page...

References (23)

W. Barry, A. Nobel, F. Wright (2005)
Significance analysis of functional categories in gene expression studies: a structured permutation approach
Bioinformatics, 21 9
V. Mootha, C. Lindgren, K. Eriksson, A. Subramanian, S. Sihag, J. Lehár, P. Puigserver, E. Carlsson, M. Ridderstråle, E. Laurila, N. Houstis, M. Daly, N. Patterson, J. Mesirov, T. Golub, P. Tamayo, B. Spiegelman, E. Lander, J. Hirschhorn, D. Altshuler, L. Groop (2003)
PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes
Nature Genetics, 34
S. Cessie, H. Houwelingen (1995)
Testing the fit of a regression model via score tests in random effects models.
Biometrics, 51 2
J. Houwing-Duistermaat, B. Derkx, F. Rosendaal, H. Houwelingen (1995)
Testing familial aggregation.
Biometrics, 51 4
J. Goeman, J. Oosting, A. Cleton-Jansen, J. Anninga, H. Houwelingen (2005)
Testing association of a pathway with survival using gene expression data
Bioinformatics, 21 9
John Tomfohr, Jun Lu, T. Kepler (2005)
Pathway level analysis of gene expression using singular value decomposition
BMC Bioinformatics, 6
A. Subramanian, P. Tamayo, V. Mootha, Sayan Mukherjee, B. Ebert, Michael Gillette, A. Paulovich, S. Pomeroy, T. Golub, E. Lander, J. Mesirov (2005)
Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles
Proceedings of the National Academy of Sciences of the United States of America, 102
Charles Tilford, N. Siemers (2009)
Gene set enrichment analysis.
Methods in molecular biology, 563
(2001)
Significance analysis of microarrays applied to the ionizing radiation response
John Storey (2002)
A direct approach to false discovery rates
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64
J. Goeman, Sara Geer, H. Houwelingen (2006)
Testing against a high dimensional alternative
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68
R Meister U Mansmann (2005)
Testing differential gene expression in functional groups. Goeman's global test versus an ANCOVA approach
Methods Inf Med, 44
S. Drăghici, P. Khatri, R. Martins, G. Ostermeier, S. Krawetz (2003)
Global functional profiling of gene expression.
Genomics, 81 2
Gene Set Enrichment Analysis[http
//www.broad.mit.edu/gsea]
J. Goeman, P. Bühlmann (2007)
Analyzing gene expression data in terms of gene sets: methodological issues
Bioinformatics, 23 8
W. Huber, A. Heydebreck, H. Sueltmann, A. Poustka, M. Vingron (2003)
Parameter estimation for the calibration and variance stabilization of microarray data
Statistical Applications in Genetics and Molecular Biology, 2
W. Huber, A. Heydebreck, H. Sültmann, A. Poustka, M. Vingron (2002)
Variance stabilization applied to microarray data calibration and to the quantification of differential expression
Bioinformatics, 18 Suppl 1
J. Goeman, S. Geer, Floor Kort, H. Houwelingen (2004)
A global test for groups of genes: testing association with a clinical outcome
Bioinformatics, 20 1
Ulrich Mansmann, Reinhard Meister (2005)
Testing Differential Gene Expression in Functional Groups
Methods of Information in Medicine, 44
L. Tian, S. Greenberg, S. Kong, J. Altschuler, I. Kohane, P. Park (2005)
Discovering statistically significant pathways in expression profiling studies.
Proceedings of the National Academy of Sciences of the United States of America, 102 38
I. Dinu, J. Potter, T. Mueller, Qi Liu, A. Adewale, G. Jhangri, G. Einecke, K. Famulski, P. Halloran, Y. Yasui (2007)
Improving GSEA for Analysis of Biologic Pathways for Differential Gene Expression across a Binary Phenotype
S. Drăghici, P. Khatri, R. Martins, G. Ostermeier, S. Krawetz (2003)
Global functional profiling of gene expression ? ? This work was funded in part by a Sun Microsystem
Genomics
I. Dinu, J. Potter, T. Mueller, Qi Liu, A. Adewale, G. Jhangri, G. Einecke, K. Famulski, P. Halloran, Y. Yasui (2007)
Improving gene set analysis of microarray data by SAM-GS
BMC Bioinformatics, 8

Publisher: Unpaywall
ISSN: 1471-2105
DOI: 10.1186/1471-2105-8-431
Publisher site: See Article on Publisher Site

Abstract

Background: Multiple data-analytic methods have been proposed for evaluating gene-expression levels in specific biological pathways, assessing differential expression associated with a binary phenotype. Following Goeman and Bühlmann's recent review, we compared statistical performance of three methods, namely Global Test, ANCOVA Global Test, and SAM-GS, that test "self-contained null hypotheses" Via. subject sampling. The three methods were compared based on a simulation experiment and analyses of three real-world microarray datasets. Results: In the simulation experiment, we found that the use of the asymptotic distribution in the two Global Tests leads to a statistical test with an incorrect size. Specifically, p-values calculated by the scaled F distribution of Global Test and the asymptotic distribution of ANCOVA Global Test are too liberal, while the asymptotic distribution with a quadratic form of the Global Test results in p-values that are too conservative. The two Global Tests with permutation-based inference, however, gave a correct size. While the three methods showed similar power using permutation inference after a proper standardization of gene expression data, SAM-GS showed slightly higher power than the Global Tests. In the analysis of a real-world microarray dataset, the two Global Tests gave markedly different results, compared to SAM-GS, in identifying pathways whose gene expressions are associated with p53 mutation in cancer cell lines. A proper standardization of gene expression variances is necessary for the two Global Tests in order to produce biologically sensible results. After the standardization, the three methods gave very similar biologically-sensible results, with slightly higher statistical significance given by SAM-GS. The three methods gave similar patterns of results in the analysis of the other two microarray datasets. Conclusion: An appropriate standardization makes the performance of all three methods similar, given the use of permutation-based inference. SAM-GS tends to have slightly higher power in the lower D -level region (i.e. gene sets that are of the greatest interest). Global Test and ANCOVA Global Test have the important advantage of being able to analyze continuous and survival phenotypes and to adjust for covariates. A free Microsoft Excel Add-In to perform SAM-GS is available from http://www.ualberta.ca/~yyasui/homepage.html. Page 1 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 Background Results Some microarray-based gene expression analyses such as Simulation experiment Our first evaluation of the three methods used a simula- Significance Analysis of Microarray (SAM) [1] aim to dis- cover individual genes whose expression levels are associ- tion study, similar to that of Mansmann and Meister [6] ated with a phenotype of interest. Such individual-gene with some modifications that make the simulated data analyses can be enhanced by utilizing existing knowledge more realistic, and evaluated the size and power of the of biological pathways, or sets of individual genes (here- three hypothesis tests. Gene-set analysis was performed after referred to as "gene sets"), that are linked via. related for both the original and "z-score standardized" simulated biological functions. Gene-set analyses aim to discover gene datasets so that the effects of standardization on the three sets the expression of which is associated with a phenotype tests' performance can be assessed. The z-score standardi- of interest. zation was motivated by that fact that gene-expression variances can vary greatly across genes, even after a nor- Many gene-set analysis methods have been proposed previ- malization, which could influence gene-set analysis. In ously. For example, Mootha et al. [2] proposed Gene Set the z-score standardized datasets, gene expression was Enrichment Analysis (GSEA), which uses the Kol- standardized using the following equation: mogorov-Smirnov statistic to measure the degree of differ- ential gene expression in a gene set by a binary phenotype xx − jk j x = (1) (see also [3]). Goeman et al. [4] presented Global Test, jk modeling differential gene expression by means of ran- dom-effects logistic regression models. Goeman et al. [5] also extended their methods to continuous and survival where x is the gene expression for gene j in sample k, jk outcomes. Mansmann and Meister [6] proposed and s are the sample mean and standard deviation of gene ANCOVA Global Test, which is similar to Global Test but j expression using all samples, respectively. All simulation having the roles of phenotype and genes exchanged in analyses compared the mean expression of a gene-set of regression models. Mansmann and Meister [6] pointed interest between two groups, each with a sample of 10 out that their ANCOVA Global Test outperformed Global observations. Test, especially in cases where the asymptotic distribution of Global Test cannot be used. Dinu et al. [7] discussed First, we checked the size of the three tests, before and some critical problems of GSEA as a method for gene-set after the standardization, according to the following three analysis and proposed an alternative method called SAM- scenarios of no differential expression between two GS, an extension of SAM to gene-set analysis. Goeman groups: (1) randomly generate expression of 100 genes for and Bühlmann [8] provided an excellent review of the the two groups from a multivariate normal distribution methods, discussing important methodological questions (MVN) with a mean vector P and a diagonal variance-cov- of gene-set analysis, and summarized the methodological ariance matrix 6 , where the 100 elements of P and the 100 principles behind the existing methods. An important diagonal elements of 6 were randomly generated as 100 contribution of their review was the distinction between independently-and-identically-distributed (i.i.d.) uni- testing "self-contained null hypotheses" via. subject sam- form random variables in (0,10) and 100 i.i.d. uniform pling and testing "competitive null hypotheses" via. gene random variables in (0.1, 10), respectively (i.e., no gene sampling. They argue, and we agree, that the framework of was differentially expressed between the two groups and the competitive hypothesis testing via. gene sampling is expression was uncorrelated among the 100 genes); (2) subject to serious errors in calculating and interpreting exactly same as (1) except the variance-covariance matrix statistical significance of gene sets, because of its implicit 6 of the MVN being changed to have a correlation of 0.5 or explicit untenable assumption of probabilistic inde- between all pairs of the first 20 genes and also between all pendence across genes. pairs of the second 20 genes; (3) exactly same as (2) with the correlation value changed from 0.5 to 0.9. Although Global Test, ANCOVA Global Test, and SAM-GS each test a self-contained hypothesis on the association of expression patterns across a gene set with a phenotype of Second, we estimated the power of the three tests, before interest in a statistically appropriate manner, it is unclear and after the standardization, by randomly generating a how the three methods compare on performance in gene set of size 100, using the exactly same simulation set- detecting underlying associations. In this paper, we com- up of the size-evaluation (2) above, but allowing the first pare the performance of the three methods via. simulation 40 genes being differentially expressed. The mean expres- and real-world microarray data analyses, both statistically sion of the 40 differentially expressed genes was randomly and biologically. generated from Uniform(0,10) as in the size-evaluation Page 2 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV %0&%LRLQIRUPDWLFV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 The empirical Type I error rates of SAM-GS and the two (2), but was subsequently modified by an addition and a Global Tests with permutations were almost right on the subtraction of a constant J , as in Mansmann and Meister target of the nominal value of 0.05, before and after the [6], such that mean vectors P 's for the two groups (i = 1, standardization, for all three scenarios considered for the j > 20 evaluation of size (Table 1). Type I error rates of Global μμ−=()−12γ 12jj 2) differ by 2J, , for j = 1,..., 40. We Test with the scaled F null distribution and Global Test considered a range of J from 0 to 2 with an increment of with the asymptotic null distribution with a quadratic form deviated noticeably from the nominal size, being 0.1. The 40 differentially expressed genes were set to have too liberal with the scaled F and too conservative with the a correlation of 0.5, as in the size-evaluation (2), but no asymptotic distribution (non F distributed quadratic correlation and a correlation of 0.9 were also considered. form) as shown in Table 1. As the correlation among the 40 genes increased, the Type I error rates of Global Test In the comparison of size across the three tests, the size with the scaled F null distribution and the Global Test was estimated by the observed proportion of replications with the asymptotic null distribution with a quadratic with a p-value smaller than the correct size D . By defini- form generally moved towards the nominal size of 0.05. tion, under the null hypothesis, a proportion D of the rep- Type I error rates of ANCOVA Global Test with the asymp- lications of an experiment is expected to yield a p-value totic distribution also deviated noticeably from the nom- smaller than D . In order to assess the size, we ran 5000 inal size: 0.0692, 0.1034 and 0.0898 before the replications and used D = 0.05. For each permutation- standardization and 0.037, 0.0848 and 0.0792 after the based p-value, 1000 random permutations were carried standardization, for r = 0, 0.5, and 0.9, respectively. Here- out. after, therefore, the p-values for Global Test and ANCOVA Global Test are calculated based on permutations. We also In the comparison of power across the three tests, the estimated the size of the three tests using 25 samples, power was estimated by the observed proportion of the instead of 10 samples, in each group, and observed simi- replications of an experiment in which the null hypothe- lar patterns. As the sample size increased, the Type I error sis was correctly rejected. Given the fixed numbers of sam- rates of the two Global Tests by using the asymptotic dis- ples and genes with the fixed correlation structure in the tributions moved towards to the nominal level of 0.05. simulation experiment, a larger effect size J leads to higher power for a given D -level. In estimating the power, we ran The second step of the simulation was to assess power, the 1000 replications of an experiment for each J value. We results of which are shown in Figure 1, 2, 3, 4, 5, 6. Before considered D at 0.05, 0.01, 0.005, 0.0025, and 0.001. For the standardization, SAM-GS showed higher power than obtaining a permutation-based p-value, 1000 random the Global Tests at D = 0.05, with increasing power permutations were carried out. Table 1: Assessment of type I error probabilities 10 vs. 10 samples 25 vs. 25 samples Type of Methods inference 0 0.5 0.9 0 0.5 0.9 Before Global Test The scaled F 0.0982 0.0778 0.0722 0.0696 0.0700 0.0686 standardization Asymptotic 0.0006 0.0128 0.0298 0.0090 0.0328 0.0442 Permutation 0.0496 0.0434 0.0464 0.0534 0.0554 0.0556 ANCOVA Asymptotic 0.0692 0.1034 0.0898 0.0576 0.0840 0.0736 Global Test Permutation 0.0482 0.0462 0.0458 0.0526 0.0552 0.0562 SAM-GS Permutation 0.0498 0.0462 0.0478 0.0514 0.0518 0.0556 After Global Test The scaled F 0.1090 0.0844 0.0736 0.0734 0.0702 0.0698 standardization Asymptotic <0.0001 0.0094 0.0276 0.0036 0.0320 0.0424 Permutation 0.0524 0.0464 0.0458 0.0524 0.0528 0.0530 ANCOVA Asymptotic 0.0372 0.0848 0.0792 0.0474 0.0838 0.0730 Global Test Permutation 0.0532 0.0462 0.0466 0.0544 0.0542 0.0544 SAM-GS Permutation 0.0522 0.0468 0.0470 0.0526 0.0540 0.0542 Page 3 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV %0&%LRLQIRUPDWLFV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 Power at D =0.05, r=0 Power at D =0.01, r =0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Gamma Gamma Power at D =0.005, r =0 Power at D =0.0025, r =0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Gamma Gamma Power at D =0.001, r =0 Global Test Global Ancova SAM-GS 0.0 0.5 1.0 1.5 2.0 Gamma Th a Figure 1 me r ong 40 g esults of th enes e simulation experiment, evaluating power of the three tests before the standardization, for correlation of 0 The results of the simulation experiment, evaluating power of the three tests before the standardization, for correlation of 0 among 40 genes. differences with decreasing D levels. This pattern was Real-world data analyses observed regardless of the correlation level in the 40 dif- Our next evaluation of the performance of the three meth- ferentially-expressed genes (correlation of 0, 0.5, or 0.9). ods used biologically, a priori defined gene sets and three After the standardization, the performances of these three microarray datasets considered in Subramanian et al. [3], methods became closer: SAM-GS showed slightly higher download from GSEA web-page, [9]: 17 p53 wild-type vs. power than the Global tests with increasing power differ- 33 p53 mutant cancer cell lines; 15 male vs. 17 female ence with decreasing D levels. lymphoblastoid cells; 24 acute lymphoid leukemia (ALL) Page 4 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV Power Power Power 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 Power Power 0.0 0.4 0.8 0.0 0.4 0.8 %0&%LRLQIRUPDWLFV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 Power at D =0.05, r=0 Power at D =0.01, r =0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Gamma Gamma Power at D =0.005, r =0 Power at D =0.0025, r =0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Gamma Gamma Power at D =0.001, r =0 Global Test Global Ancova SAM-GS 0.0 0.5 1.0 1.5 2.0 Gamma Th a Figure 2 me r ong 40 g esults of th enes e simulation experiment, evaluating power of the three tests after the standardization, for correlation of 0 The results of the simulation experiment, evaluating power of the three tests after the standardization, for correlation of 0 among 40 genes. vs. 24 acute myeloid leukemia (AML) cells. For pathways/ that are involved in specific metabolic signaling pathways gene sets, we used Subramanian et al.'s gene-set subcata- [3]. In Subramanian et al., Catalog C1 included 24 sets, logs C1 and C2 from the same web-address above on one for each of the 24 human chromosomes, and 295 sets "Molecular Signature Database." The C1 catalog includes corresponding to the cytogenetic bands; Catalog C2 con- gene sets corresponding to human chromosomes and sisted of 472 sets containing gene sets reported in manu- cytogenetic bands, while the C2 catalog includes gene sets ally curated databases and 50 sets containing genes Page 5 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV Power Power Power 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 Power Power 0.0 0.4 0.8 0.0 0.4 0.8 %0&%LRLQIRUPDWLFV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 Power at D =0.05, r =0.5 Power at D =0.01, r=0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Gamma Gamma Power at D =0.005, r=0.5 Power at D =0.0025, r=0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Gamma Gamma Power at D =0.001, r=0.5 Global Test Global Ancova SAM-GS 0.0 0.5 1.0 1.5 2.0 Gamma a Figure 3 The mon results of the simulation g 40 genes experiment, evaluating power of the three tests before the standardization, for correlation of 0.5 The results of the simulation experiment, evaluating power of the three tests before the standardization, for correlation of 0.5 among 40 genes. reported in various experimental papers. Following Sub- Table 2 shows the associations of biologically-defined ramanian et al. [3], we restricted the set size to be between gene sets with the phenotype, assessed by Global Test, 15 and 500, resulting in 308 pathways to be examined. ANCOVA Global Test, and SAM-GS, in the analysis of gene expression differences between p53 wild-type vs. We compared the performance of the three methods mutant cancer cell lines. Gene sets with a p-value d 0.001 before and after the standardization by listing the gene by any of the three methods are listed in Table 2. Before sets which had a p-value d 0.001 by any of the three the standardization, SAM-GS identified 16 gene sets with methods. a p-value d 0.001, while Global Test and ANCOVA Global Page 6 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV Power Power Power 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 Power Power 0.0 0.4 0.8 0.0 0.4 0.8 %0&%LRLQIRUPDWLFV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 Power at D =0.05, r =0.5 Power at D =0.01, r=0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Gamma Gamma Power at D =0.005, r=0.5 Power at D =0.0025, r=0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Gamma Gamma Power at D =0.001, r=0.5 Global Test Global Ancova SAM-GS 0.0 0.5 1.0 1.5 2.0 Gamma Th a Figure 4 me r ong 40 g esults of th enes e simulation experiment, evaluating power of the three tests after the standardization, for correlation of 0.5 The results of the simulation experiment, evaluating power of the three tests after the standardization, for correlation of 0.5 among 40 genes. Test identified three and one gene sets, respectively, with pathways, and three involve the cell-cycle machinery; each a p-value d 0.001 (Table 2). Two of these three sets were of these eight gene sets, then, is in a direct, well-estab- among the 16 sets identified by SAM-GS. The third set was lished relationship with aspects of p53 signaling. The CR_DEATH which had a p-value = 0.008 from SAM-GS. remaining two gene sets have plausible, if less well estab- Among the 17 gene sets listed in Table 2, seven involve lished, links with p53 [7]. The disagreement between p53 directly as a gene-set member. Furthermore, five gene results of SAM-GS and the two Global tests was consider- sets directly involve the extrinsic and intrinsic apoptosis able before standardization. Although 16 of the 17 gene Page 7 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV Power Power Power 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 Power Power 0.0 0.4 0.8 0.0 0.4 0.8 %0&%LRLQIRUPDWLFV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 Power at D =0.05, r =0.9 Power at D =0.01, r=0.9 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Gamma Gamma Power at D =0.005, r=0.9 Power at D =0.0025, r=0.9 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Gamma Gamma Power at D =0.001, r=0.9 Global Test Global Ancova SAM-GS 0.0 0.5 1.0 1.5 2.0 Gamma The a Figure 5 mon results of the simulation g 40 genes experiment, evaluating power of the three tests before the standardization, for correlation of 0.9 The results of the simulation experiment, evaluating power of the three tests before the standardization, for correlation of 0.9 among 40 genes. sets in Table 2 had a SAM-GS p-value d 0.001, 7 had p-val- and after the standardization were highly consistent, and, ues larger than 0.1 by the two Global tests. For example, therefore, we used the results of SAM-GS before the stand- SAM-GS identified the gene set p53hypoxia pathway as a ardization for the comparisons with the other two meth- significant set with a p-value < 0.001, which seems biolog- ods. For Global Test and ANCOVA Global Test, p-values ically appropriate, yet the Global Test and the ANCOVA changed appreciably. Notably, p-values of Global Test Global Test gave p-values greater than 0.6. and ANCOVA Global Test after the standardization agreed closely with those of SAM-GS (Table 2, Figure 7). For We then compared the three methods incorporating the z- example, the p-values of p53hypoxia pathway changed score standardization. For SAM-GS, the p-values before from 0.626 to <0.001 for Global Test and from 0.622 to Page 8 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV Power Power Power 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 Power Power 0.0 0.4 0.8 0.0 0.4 0.8 %0&%LRLQIRUPDWLFV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 Power at D =0.05, r =0.9 Power at D =0.01, r=0.9 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Gamma Gamma Power at D =0.005, r=0.9 Power at D =0.0025, r=0.9 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Gamma Gamma Power at D =0.001, r=0.9 Global Test Global Ancova SAM-GS 0.0 0.5 1.0 1.5 2.0 Gamma Th a Figure 6 me r ong 40 g esults of th enes e simulation experiment, evaluating power of the three tests after the standardization, for correlation of 0.9 The results of the simulation experiment, evaluating power of the three tests after the standardization, for correlation of 0.9 among 40 genes. <0.001 for ANCOVA Global Test. Although the p-values sistent with the power-comparison simulation in which of the three methods agreed with each other after the SAM-GS showed slightly higher power than the Global standardization, the p-values from SAM-GS tended to be tests at small D levels, even after the standardization. smaller than those from Global Test and ANCOVA Global Test, in the lower range of p-values (gene sets that are of The same pattern was found in the analyses of the male- the greatest interest) (Table 2, Figures 7 and 8): this is con- vs.-female lymphoblastoid dataset and the ALL-vs.-AML Page 9 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV Power Power Power 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 Power Power 0.0 0.4 0.8 0.0 0.4 0.8 %0&%LRLQIRUPDWLFV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 Table 2: Gene sets in the p53 dataset with P-value d 0.001 by any of the three methods Gene Set Before standardization After standardization VSN Global Ancova SAM-GS Global Ancova SAM-GS Global Ancova SAM-GS ATM <0.001 <0.001 <0.001 <0.001 0.002 <0.001 0.001 0.001 <0.001 Pathway* BAD <0.001 0.007 <0.001 <0.001 <0.001 <0.001 0.004 0.004 <0.001 Pathway** Calcineurin 0.068 0.084 <0.001 0.007 0.002 <0.001 0.004 0.005 0.011 Pathway$ Cell cycle 0.021 0.017 <0.001 0.002 0.001 <0.001 0.002 <0.001 0.003 regulator† Hsp27Pathway 0.047 0.044 <0.001 <0.001 0.001 <0.001 0.011 0.005 <0.001 ** Mitochondria 0.002 0.002 <0.001 0.007 0.007 <0.001 0.013 0.006 <0.001 pathway** p53 signaling 0.112 0.101 <0.001 0.003 0.003 0.001 0.006 0.005 0.006 pathway* p53_UP* 0.003 0.004 <0.001 <0.001 <0.001 <0.001 0.019 0.018 <0.001 p53hypoxiaPat 0.626 0.622 <0.001 <0.001 <0.001 <0.001 0.044 0.041 <0.001 hway* p53Pathway* 0.142 0.150 <0.001 <0.001 <0.001 <0.001 <0.001 0.001 <0.001 Raccycd 0.177 0.181 <0.001 0.001 <0.001 <0.001 0.004 0.009 0.006 Pathway† Radiation_sen 0.119 0.135 <0.001 <0.001 <0.001 <0.001 0.014 0.020 <0.001 sitivity* SA_TRKA_RE 0.254 0.252 <0.001 0.001 <0.001 <0.001 0.004 0.001 0.006 CEPTOR‡ bcl2family & 0.102 0.100 0.001 0.001 0.005 <0.001 0.010 0.014 0.001 reg. network** Cell cycle 0.099 0.099 0.001 0.027 0.018 0.005 0.003 0.005 0.007 arrest† Ceramide 0.002 0.006 0.001 0.004 0.004 <0.001 0.001 0.001 <0.001 Pathway** CR_DEATH* 0.001 0.004 0.008 0.029 0.017 0.004 0.143 0.108 0.005 * pathway member ** apoptosis $ p53-induced proline oxidase mediates apoptosis via a calcineurin-dependent pathway † cell cycle ‡ integrated negative feedback loop between Akt and p53 dataset (See Figures S1, S2, S3 and S4 in Additional file 1, three tests: which is unlikely to be of any biological comparing the results from the three methods). Before the significance. standardization, p-values from Global Test and ANCOVA Global Test differed greatly from p-values from SAM-GS. In the User Guides for Global Test and ANCOVA Global The p-values of Global Test and ANCOVA Global Test Test, Variance Stabilization (VSN) was used to normalize changed markedly after the standardization and were very the data [10,11]. We also assessed the performance of the close to those of SAM-GS. After the standardization, in the three methods on the p53 dataset, male vs. female dataset, male-vs.-female analysis, 21 gene sets had a p-value < 0.15 and the ALL/AML dataset using VSN. The results for the by one or more of the methods; 17 of these had a SAM-GS p53 dataset are shown in Table 2 and Figure 9. When VSN p-value smaller than, or equal to, those of Global Test and was used for the normalization of the data, we observed: ANCOVA Global Test. In the ALL-vs.-AML analysis, all sets (1) p-values of Global Test and ANCOVA Global Test were statistically significant with p-values < 0.001 by all became similar to those of SAM-GS, but not as close as the Page 10 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV %0&%LRLQIRUPDWLFV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 Global Test Global Test Global Ancova Global Ancova 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.01 0.02 0.03 SAM-GS P-v alu e SAM-GS P-v alu e P-valu Global Test and vs. Figure 7 SAM-GS p-values befo es of 308 gene ANCOVA Gl sets in the re the stan obal p5 Test af 3 data analysis: p-values dardization ter standardization of Lowest P-values in the Test and ANCOVA Global Test after standardization vs. SAM-GS p-values before the stan Figure 8 p53 data analysis: p dardization -values of Global P-values of 308 gene sets in the p53 data analysis: p-values of Lowest P-values in the p53 data analysis: p-values of Global Global Test and ANCOVA Global Test after standardization Test and ANCOVA Global Test after standardization vs. vs. SAM-GS p-values before the standardization. The line SAM-GS p-values before the standardization. The line indi- indicates equal p-values between SAM-GS and Global Tests. cates equal p-values between SAM-GS and Global Tests. p-values after the z-score standardization; and (2) in the lower range of p-values, the p-values for SAM-GS tended to be smaller than those of Global Test and ANCOVA Global Test Global Ancova Global Test, (Table 2, Figure 9). Discussion From the simulation results, we suggest that, when Global Test and ANCOVA Global Test are used for the analysis of microarray data, permutations should always be used for the calculation of statistical significance. In the documen- tation included with the Global Test R package, Goeman et al. noted that the asymptotic distribution with a quad- ratic form is the recommended method for large sample sizes and it can be slightly conservative for small samples. In our simulation study, we used 10 and 25 samples for each of the two groups. In each situation, the asymptotic method with a quadratic form gave conservative p-values, 0.00 0.01 0.02 0.03 although the difference between asymptotic and permuta- SAM-GS P-v alu e tion-based methods did decrease when the sample size increased. Goeman et al. also noted that the scaled F Lowest P-values in the Test and A vs. SAM-G Figure 9 S p-v NCOVA Glob alues after the VSN normalization p5al 3 Test after the VSN data analysis: p-values of normalization Global method can be slightly anti-conservative, especially for Lowest P-values in the p53 data analysis: p-values of Global large gene sets. Our simulation study showed that the Test and ANCOVA Global Test after the VSN normalization scaled F method can be markedly anti-conservative. This vs. SAM-GS p-values after the VSN normalization. The line is in accord with the manual of Global Test, which recom- indicates equal p-values between SAM-GS and Global Tests. mends against using the scaled F approximation. We found that performance of the two Global Tests explained by: (1) the invariance of t-test statistics under changed greatly before and after standardization, but shifting and rescaling of data, that is relevant to SAM-GS; SAM-GS performance remained unchanged. This can be (2) ANCOVA's explicit assumption that all genes in the set Page 11 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV Global P-valu e 0.0 0.2 0.4 0.6 0.8 1.0 Global P-valu e Global P-valu e 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.01 0.02 0.03 0.04 0.05 %0&%LRLQIRUPDWLFV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 to have an equal variance, a violation of which would Tomfohr et al. [15], that test self-contained hypotheses clearly affect the performance of ANCOVA Global Test; via. subject sampling, in addition to the three methods we and (3) Global test's assumption that the regression coef- highlighted above. Tian et al. [14] tests the significance of ficients come from the same normal distribution, an a gene set by taking the mean of t-values of genes in the assumption that is met by the standardization of gene gene set as a test statistic and evaluating its significance by expression. Therefore, some sort of standardization that a permutation test. Tomfohr et al. [15] reduces the gene makes the variances of gene expression similar across set's expression into a single summary value by taking the genes is needed before using Global Test and ANCOVA first principal component of expressions of genes in the Global Test. SAM-GS employs a constant in the denomi- gene set and performs a permutation-based t-test of the nator of its t-like test statistic to address the small variabil- single summary. The two methods gave appreciably differ- ity in some of the gene expression measurements and, ent results when compared to Global Test and ANCOVA thus, effectively standardizes expression across genes; nei- Global Test, and SAM-GS. Of the 17 gene sets in Table 2 ther Global Test nor the ANCOVA Global Test addresses for the p53 analysis, for instance, Tian et al. and Tomfohr this characteristic of microarray data. Both Goeman et al. et al. identified only eight and one gene sets, respectively, [4] and Mansmann and Meister [6] have stated that an with p-value < 0.10: the ATM pathway, for example, was appropriate normalization is important. Note that not identified by Global Test, ANCOVA Global Test, and many normalization methods would standardize the SAM-GS with p-value d 0.001, while the methods of Tian expression across genes. It is only after applying z-score et al. and Tomfohr et al. gave p-value = 0.61 and 0.99, standardization (1) or the VSN normalization, that the respectively. The main reasons for their large discrepan- results of the three methods became congruent. The simi- cies from the results of the three highlighted methods are larity between Global Test and Global ANCOVA Test has as follows. Tian et al. sums up the t-values for all the genes already been commented upon in [6]. The similarity in a gene set, which will result in cancellation of large pos- between SAM-GS and Global Test may be inferred from itive t-values and large negative t-values. Among the 11 the construction of the latter as a weighted sum of squared up-regulated and 8 down-regulated genes in the ATM transformed t-statistics [12], which is similar to the SAM- pathway, for example, two up-regulated genes had large GS test statistic. positive t-values (about 2 or greater) and three down-reg- ulated genes had large negative t-values (about – 2 or It should be noted that Global Test allows four different smaller): these large positive and negative t-values cancel types of phenotype variables: binary; multi-class; continu- each other when summing up all t-values in the Tian et al. ous; and survival. ANCOVA Global Test allows binary, test statistic, leading to reduced power for detecting gene multi-class, and continuous phenotypes. The ability to sets that contain both significantly up-regulated genes and handle different classes of phenotypes is a very important significantly down-regulated genes. The method of Tom- advantage of Global Test and ANCOVA Global Test over fohr et al. summarizes the |S|-dimension gene-expression SAM-GS. It is also possible to use Global Test and vector of genes in the gene set S by the first principal com- ANCOVA Global Test while adjusting for covariates (e.g., ponent without considering the phenotype: if the direc- potential confounders). If covariates are incorporated, the tion of the first principal component does not correspond two tests assess whether the gene-expression profile has an to the direction that separates the two phenotypes, their independent association with the phenotype that is above method does not capture the differential expressions even and beyond what is explained by the covariates. The abil- when they exist, leading to markedly reduced power. ity to adjust for covariates is another important advantage of Global Test and ANCOVA Global Test over SAM-GS. Although we focused on the comparison of the "self-con- tained null hypothesis" approaches, it is also of method- We focused on p-values in this paper because we were ological interest to see how "competitive null hypothesis" comparing the three methods that test "self-contained approaches compare. We, therefore, applied three "com- null hypotheses" via. subject sampling. To account for petitive null hypothesis" approaches to the analysis of the multiple comparisons when multiple gene sets are tested, p53 dataset: Gene Set Enrichment Analysis (GSEA) [2]; the one might consider False Discovery Rate (FDR) instead of Significance Analysis of Function and Expression (SAFE) Type I error probability. For example, SAM uses a q-value, [16]; and Fisher's exact test [17]. The results are shown in an upper limit of the FDR, for each gene, which could be Additional file 3. The results from the three "competitive extended here to each gene-set using the method of Storey null hypothesis" approaches were greatly different from [13]. The q-values of the 17 gene sets listed in Table 2 are those of SAM-GS and the Global Tests. Most of the gene displayed in Additional file 2. sets identified as being significantly associated with the p53 mutation by SAM-GS and Global Tests were not iden- We have considered, but did not report detailed compari- tified as such by the three "competitive null hypothesis" son results of two other methods, Tian et al. [14] and approaches. The only gene set additionally identified as Page 12 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV %0&%LRLQIRUPDWLFV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 being significantly associated with the p53 mutation (with from a common distribution with mean zero and variance p < 0.001) was HUMAN_CD34_ENRICHED_TF_JP: for W , the null hypothesis of no differential gene-expression W = 0. Using the notation r = 6 x E , the this gene set, the Fisher's exact test p-value was < 0.001, is reduced to i j ij j -1 but all the other five methods gave p-values > 0.37. model simplifies to a random-effects model: E(Y |r ) = h i i Known biological functions of p53 are clearly more con- (D + r ). The null hypothesis can then be tested, based on sistent with the results of the "self-contained null hypoth- a score test statistic discussed in Le Cessie and Van Hou- esis" approaches. The differences observed between "self- welingen[18] and Houwing-Duistermaat et al. [19]: contained null hypothesis" and "competitive null hypoth- esis" approaches can be attributable, at least partly, to the () YR−− μμ (Y ) fact that the significance of a gene set depends only on the Q = , genes in the set under the "self-contained null hypothesis" -1 testing, while, under the "competitive null hypothesis" where R = (1/m)XX', P = h (D ), and P is the second cen- testing, the significance of a gene set depends not only on tral moment of Y under the null hypothesis. It can be the genes in the set but also on all the other genes in the shown that Q is asymptotically normally distributed (a array. quadratic form which is non-negative). However, when the sample size is small, a better approximation to the dis- In summary, the primary advantage of SAM-GS may be tribution of Q is a scaled F distribution. The p-value can, the slightly higher power in the low D -level region that is therefore, be calculated based on an approximate distribu- of highest scientific interest, whereas, despite the need for tion of the test statistic, i.e., the asymptotic distribution appropriate standardization, Global Test and the with a non-chi-squared distributed quadratic form or the ANCOVA Global Test can be used for a variety of pheno- scaled F distribution, or permutations of samples. types and incorporate covariates in the analysis. 2) ANCOVA Global Test Conclusion The null hypothesis of the Global Test is in the form of In conclusion, Global Test and ANCOVA Global Test P(Y|X) = P(Y). The ANCOVA Global Test changes the roles require appropriate standardization of gene expression of gene expression pattern X and phenotype Y, and the measurements across genes for proper performance. null hypothesis becomes P(X|Y = 1) = P(X|Y = 2), or, for Standardization of these two methods and the use of per- each gene j in a gene set of interest, P = P , where P is mutation inference make the performance of all three 1j 2j ij methods similar, with a slight power advantage in SAM- the mean expression of gene j in phenotype group i, i = GS. Global Test and the ANCOVA Global Test can be used 1,2. A linear model of the form, P = P + D + E + J , with ij i j ij for a variety of phenotypes and incorporate covariates in group effects D , gene effects E , and the gene-group interac- the analysis. tion J , is then used to test the null hypothesis. The condi- Methods tions 6 D = 6 E = 6 J = 6 J = 0 ensure identifiability of the i j i ij j ij In this section, we describe the three gene-set analysis parameters. The null hypothesis under the parameteriza- methods. The phenotype of interest is assumed to be tion of the linear model is H : D = J = 0. The test statistic 0 i ij binary. is the F-test statistic for linear models: 1) Global Test F=− {(SSR SSR ) /(df − df )}/{SSR / df } HH H H H H 01 1 0 1 1 The Global Test is based on a regression model that pre- where SSR and df denote the sum of squares and dicts response from the gene expression measurements of H H a gene set [4]. Generalized linear models are used to degrees of freedom, respectively, under the hypothesis H. model the dependency of response Y (an n × 1 vector) on The p-value can be calculated by a permutation distribu- gene expression measurements X (an n × m matrix) of a set tion of the F statistic or an asymptotic distribution of the of m genes on n samples: test statistic. 3) SAM-GS hE(Y |βα ) =+ xβ , i= 12 , ,..., n, () iij j ∑ SAM-GS extends SAM to gene-set analysis. SAM-GS tests a j =1 null hypothesis that the mean vectors of expression of genes in a gene set does not differ by the phenotype of where h denotes the link function and D and E 's are interest. The SAM-GS method is based on individual t-like parameters. If the genes are not differentially expressed, statistics from SAM, addressing the small variability prob- the regression coefficients (E 's) should be zero. Under an lem encountered in microarray data, i.e., reducing the sta- assumption that all regression coefficients are sampled tistical significance associated with genes with very little Page 13 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV %0&%LRLQIRUPDWLFV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 variation in their expression. For each gene j, the d statistic Operating system(s): Microsoft Windows XP is calculated as in SAM: Programming language: R 2.4.x and Microsoft Excel 2003 or 2007 xj ()−x (j) dj () = , sj ()+s Abbreviations Significance Analysis of Microarray for Gene Sets (SAM- where the 'gene-specific scatter' s(j) is a pooled standard GS) deviation over the two groups of the phenotype, and s is a small positive constant that adjusts for the small varia- Authors' contributions bility [1]. SAM-GS then summarizes these standardized JDP provided biological interpretations of the analysis differences in all genes in the gene set S by: results of the real-world dataset. QL and ID contributed significantly to data analysis, refinement of SAM-GS, and || S programming. The manuscript was written primarily by SAMGS = d QL, ID, and YY, and critically reviewed and revised by all i =1 authors. All authors read and approved the final manu- script. A permutation distribution of the SAMGS statistic is used to calculate the p-value. We note that even though the Additional material recalculation of s is needed for each permutation, practi- cally the implication is small, and both SAM and SAM-GS excel add-ins do not recalculate s . Additional file 1 The analysis results of the two real-world microarray datasets (gender and Each of the three methods provides a statistically valid test leukemia) by the three methods. These three methods were applied and of the null hypothesis of no differential gene expression compared on two real-world microarray datasets: the male vs. female lym- phoblastoid cell microarray dataset and the ALL- and AML-cell microar- across a binary phenotype. ray dataset. Click here for file For the purpose of methodological comparisons, we also [http://www.biomedcentral.com/content/supplementary/1471- applied three "competitive null hypothesis" approaches 2105-8-431-S1.pdf] to the analysis of the p53 dataset: Gene Set Enrichment Analysis (GSEA) [2]; the Significance Analysis of Function Additional file 2 and Expression (SAFE) [16]; and Fisher's exact test [17]. FDR values for the 17 gene sets listed in Table 2. FDR values of the 17 gene sets listed in Table 2 are presented. Both GSEA and SAFE employ a two-stage approach to Click here for file access the significance of a gene set. First, gene-specific [http://www.biomedcentral.com/content/supplementary/1471- measures are calculated that capture the association 2105-8-431-S2.pdf] between expression and the phenotype of interest. Then a test statistic is constructed as a function of the gene-spe- Additional file 3 cific measures used in the first step. The significance of the P-values and FDR values for the three "self-contained null hypothesis" test statistics is assessed by permutation of the response and three "competitive null hypothesis" approaches. The three "self-con- tained null hypothesis" and three "competitive null hypothesis" values. For GSEA, the Pearson correlation is used in the approaches were applied to the p53 dataset. The p-values and FDR values first step, according to Mootha et al. [2] and the Enriched for the 17 gene sets listed in Table 2 are presented. Score is used in the second step. For SAFE, the student t- Click here for file statistic is used in the first step and the Wilcoxon rank- [http://www.biomedcentral.com/content/supplementary/1471- sum test is used in the second step, both of these being the 2105-8-431-S3.pdf] default options. For the Fisher's exact test, the list of signif- icant genes is obtained from SAM [1]. An FDR cutoff of 0.3 assigned significance to 5% of the genes in the entire gene list. Acknowledgements ID, AJA, and YY are supported by the Alberta Heritage Foundation for Medical Research and YY is supported by the Canada Research Chair Pro- Availability and requirements gram and the Canadian Institutes of Health Research. Project name: Comparison of statistical methods for gene set analysis based on testing self-contained hypotheses References via. subject sampling. 1. Tusher VG, Tibshirani R, Chu G: Significance analysis of micro- arrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 2001, 98:5116-5121. Project home page: http://www.ualberta.ca/~yyasui/ 2. Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar homepage.html J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly Page 14 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV %0&%LRLQIRUPDWLFV 2007, :431 http://www.biomedcentral.com/1471-2105/8/431 MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC: PGC-1alpha- responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet 2003, 34:267-273. 3. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gil- lette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 2005, 102:15545-15550. 4. Goeman JJ, van de Geer SA, de Kort F, van Houwelingen HC: A glo- bal test for groups of genes: testing association with a clinical outcome. Bioinformatics 2004, 20:93-99. 5. Goeman JJ, Oosting J, Cleton-Jansen AM, Anninga JK, van Houwelin- gen HC: Testing association of a pathway with survival using gene expression data. Bioinformatics 2005, 21:1950-1957. 6. Mansmann U, Meister R: Testing differential gene expression in functional groups. Goeman's global test versus an ANCOVA approach. Methods Inf Med 2005, 44:449-453. 7. Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y: Improving GSEA for Analysis of Biologic Pathways for Differential Gene Expression across a Binary Phenotype. BMC Bioinformatics 2007, 8:242. 8. Goeman JJ, Bühlmann P: Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 2007, 23:980-987. 9. Gene Set Enrichment Analysis [http://www.broad.mit.edu/gsea] 10. Huber W, von Heydebreck A, Sultmann H, Poustka A, Vingron M: Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinfor- matics 2002, 18(Suppl 1):S96-104. 11. Huber W, von Heydebreck A, Sueltmann H, Poustka A, Vingron M: Parameter estimation for the calibration and variance stabi- lization of microarray data. Stat Appl Genet Mol Biol 2003, 2:Article3. 12. Goeman JJ, Van de Geer SA, van Houwelingen HC: Testing against a high dimensional alternative. J R Statist Soc B 2006, 68:477-493. 13. Storey JD: A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2002, 64:479-498. 14. Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ: Dis- covering statistically significant pathways in expression pro- filing studies. Proc Natl Acad Sci USA 2005, 102:13544-13549. 15. Tomfohr J, Lu J, Kepler TB: Pathway level analysis of gene expression using singular value decomposition. BMC Bioinfor- matics 2005, 6:225. 16. Barry WT, Nobel AB, Wright FA: Significance analysis of func- tional categories in gene expression studies: a structured permutation approach. Bioinformatics 2005, 21:1943-1949. 17. Draghici S, Khatri P, Martins RP, Ostermeier GC, Krawetz SA: Glo- bal functional profiling of gene expression. Genomics 2003, 81:98-104. 18. le Cessie S, van Houwelingen HC: Testing the fit of a regression model via score tests in random effects models. Biometrics 1995, 51:600-614. 19. Houwing-Duistermaat JJ, Derkx BH, Rosendaal FR, van Houwelingen HC: Testing familial aggregation. Biometrics 1995, 51:1292-1301. Publish with Publish with Bio Bio Med Med Central Central and and e ev ver ery y scientist can scientist can r read ead y your our w work ork fr free of ee of charge charge "BioMed Centr "BioMed Central al will will be be the the most most signif significant icant de development velopment f for or disseminating the disseminating the r results esults of of biomedical biomedical r researc esearc h h in in our our lif lifetime etime." ." Sir Paul Nurse, Cancer Research UK Sir Paul Nurse, Cancer Research UK Y Your research papers will be: our research papers will be: a available fr vailable free ee of of charge charge to the to the entir entire e biomedical biomedical comm community unity peer r peer re evie view wed ed and and published published immediatel immediately y upon upon acceptance acceptance cited in cited in PubMed PubMed and and ar archiv chived ed on PubMed on PubMed Central Central y yours — ours — y you ou k keep eep the the cop copyright yright Bio BioMed Medcentral central Submit your manuscript here: Submit your manuscript here: http://www http://www.biomedcentral.com/info/publishing_adv .biomedcentral.com/info/publishing_adv.asp .asp Page 15 of 15 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV %0&%LRLQIRUPDWLFV

Journal

BMC Bioinformatics – Unpaywall

Published: Nov 7, 2007

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Comparative evaluation of gene-set analysis methods

Comparative evaluation of gene-set analysis methods

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Comparative evaluation of gene-set analysis methods

Comparative evaluation of gene-set analysis methods

References (23)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies