Differential expression in SAGE: accounting for normal between-library variation

Keith A. Baggerly; Li Deng; Jeffrey S. Morris; C. Marcelo Aldaz

doi:10.1093/bioinformatics/btg173

Differential expression in SAGE: accounting for normal between-library variation

Baggerly, Keith A.; Deng, Li; Morris, Jeffrey S.; Aldaz, C. Marcelo 2003-08-12 00:00:00 Vol. 19 no. 12 2003, pages 1477–1483 BIOINFORMATICS DOI: 10.1093/bioinformatics/btg173 Differential expression in SAGE: accounting for normal between-library variation 1,∗ 2 1 Keith A. Baggerly ,Li Deng , Jeffrey S. Morris and C. Marcelo Aldaz Department of Biostatistics, UT M.D. Anderson Cancer Center, 1515 Holcombe Blvd, Box 447, Houston, TX 77030-4009, USA, Department of Statistics, Rice University, Houston TX 77005, USA and Department of Carcinogenesis, UT M.D. Anderson Cancer Center, 1515 Holcombe Blvd, Box 447, Houston, TX 77030-4009, USA Received on October 24, 2002; revised on January 30, 2003; accepted on February 16, 2003 ABSTRACT with different pathologies, the question of most interest is Motivation: In contrasting levels of gene expression whether a given tag is differentially expressed. between groups of SAGE libraries, the libraries within each Most methods currently advanced for assessing dif- group are often combined and the counts for the tag of ferential expression in SAGE address the case where interest summed, and inference is made on the basis one library is contrasted with another, assuming a null of these larger ‘pseudolibraries’. While this captures the hypothesis that there is no difference between the libraries sampling variability inherent in the procedure, it fails to being compared. Under this assumption, the chance of a allow for normal variation in levels of the gene between single tag falling in one library or the other is roughly individuals within the same group, and can consequently proportional to the library size. Differing approximations overstate the signiﬁcance of the results. The effect is not lead to modelling this behavior with binomial (Zhang et slight: between-library variation can be hundreds of times al., 1997) or Poisson distributions (Madden et al., 1997), the within-library variation. normal approximations (Madden et al., 1997; Kal et al., Results: We introduce a beta-binomial sampling model 1999; Michiels et al., 1999; Man et al., 2000) or simu- that correctly incorporates both sources of variation. We lations involving permutation tests (Zhang et al., 1997). show how to ﬁt the parameters of this model, and introduce Bayesian approaches have been suggested by Audic and a test statistic for differential expression similar to a two- Claverie (1997), and by Chen et al. (1998) (the latter sample t -test. method was adapted by Lal et al. (1999) to accommodate Contact: kabagg@mdanderson.org unequal library sizes). Of these, the simulation approach Supplementary information: http://bioinformatics. of Zhang et al. (1997) and the Bayesian approach of Lal mdanderson.org/ Includes Matlab and R code for ﬁtting et al. (1999) (see also Lash et al., 2000) are probably the model. the most widely used, due to their implementation in easily accessible software (the SAGE 2000 software INTRODUCTION available from the Kinzler Laboratory at Johns Hopkins, The Serial Analysis of Gene Expression (SAGE) method- and the routine implemented in SAGEmap at the NCBI, ology introduced by Velculescu et al. (1995) supplies data respectively). As noted in the comparison conducted on gene expression in the form of a table of counts. Brieﬂy, by Man et al. (2000) on several of the above methods, however, very similar results are obtained when the num- mRNA transcripts are converted to cDNA and then pro- bers of tags are large (>20); the authors contend that a cessed so as to isolate short (typically 10 or 14 bp) repre- normal approximation (Kal et al., 1999) or equivalently a sentative ‘tags’. Ideally, these tags should provide enough chi-squared test has more power for detecting differences information to uniquely identify the source mRNA, and to when the numbers of tags are small (<15). The validity a ﬁrst approximation this is correct. Tags are sampled and of small tag comparisons is, however, questionable due to sequenced, and a ‘library’ consisting of the tags seen and the presence of sequencing errors (Stollberg et al., 2000) their respective frequencies is constructed for each cellu- though this may be somewhat ameliorated if there is some lar sample. Given a set of libraries derived from samples external measure of quality for the read, such as a phred To whom correspondence should be addressed. score (Margulies and Innis, 2000; Margulies et al., 2001; Bioinformatics 19(12) c Oxford University Press 2003; all rights reserved. 1477 K.A.Baggerly et al. Table 1. Counts and proportions of tag AGGTCAGGAG in eight breast tumor libraries, ﬁve lymph node positive and three lymph node negative Library 1T+ 3T+ 4T+ 6T+ 8T+ 7T− 9T− 10T− Tag count 129 167 71 61 6 43 247 509 Library size 100 474 96 631 92 510 95 785 18 705 95 155 91 593 98 220 Proportions (%) 0.13 0.17 0.08 0.06 0.03 0.05 0.27 0.52 Ewing et al., 1998). We note that the statistic suggested the χ value (Michiels et al., 1999; Man et al., 2000) is by Kal et al. (1999) is 279.98; the 95% cutoff for this distribution is 3.84, so this is obviously a ‘signiﬁcant’ result. The equivalence p − p A B Z = , with of the above tests noted by Man et al. (2000) for high p (1− p ) p (1− p ) 0 0 0 0 + count tags means that the other tests will catch the same N N A B genes. Checking the sign of the test statistic proposed by X X X + X A B A B Kal et al. (1999) suggests that this tag is more strongly p = , p = , p = , A B 0 N N N + N A B A B expressed in LN− tumors. However, if we follow Ryu et al. (2002) and compare the eight proportions using where the two library sizes are N and N and the two √ A B atwo-sample t -test, t = ( p − p )/ V + V , with A B A B counts are X and X , respectively. The statistic we A B p being the average of the ﬁve proportions in group A, propose below will have a similar form. V being the sample variance of these ﬁve proportions, When the number of libraries involved is more than two, and p and V likewise deﬁned, we get a test statistic B B an omnibus test for differential expression such as that value of −1.3174. The 95% cutoffs for a t distribution suggested by Stekel et al. (2000) can be performed, but are ±2.4469, so this is a decidedly ‘insigniﬁcant’ result. interest is more often centered on comparing groups of While the mean proportion is higher for the LN− tumors, libraries where the group membership is known a priori. this is mostly being driven by results from a single library In this paper we focus on the comparison of two groups (10T) so that the variability within the LN- group is high. of libraries. When there are two groups of libraries being The ﬁrst approach fails to take into account the variability compared, the most common approach is to reduce the between like libraries which the t -test correctly captures. number of effective libraries to two by pooling the libraries Shifting between test types (χ and two-sample t)gives of like type, and reverting to the two-library comparison a different view of which tags are important, as can be seen form. This is not universal, and cautionary statements have in Figure 1. Most of the tags being ﬂagged as signiﬁcant been made. Lash et al. (2000) recommend checking for a by the χ test (values >5) are not signiﬁcant according to low coefﬁcient of variation before applying the SAGEmap the t -test (absolute values <2). procedure to grouped libraries. Ryu et al. (2002) use a Between-library variability within a group can often series of ﬁlters to deal with groups of pancreatic libraries, be as large in magnitude as the within-library variability the ﬁrst of which is a two-sample t -test applied to the due to sampling. Indeed, for the higher count data, the proportions. between-library variability is the dominant part of the These pooling approaches do catch differences, but variation. We can see this by surveying all of the tags they can overstate the signiﬁcance of the results by in the LN+ group, plotting the total (between + within) ignoring the role of normal variation in expression levels library sample variability as a multiple of the within- between like samples. As an example, we consider the library variability, with both quantities on the log scale case of a single prevalent tag in eight breast libraries that to make the structure more apparent. This is shown in have been assembled in the Aldaz Laboratory. This tag, Figure 2. For the high count tags the between-library AGGTCAGGAG, has multiple matches and hence is not variance clearly wins. Similar qualitative results hold for immediately biologically informative, but it will serve to illustrate the point. All of these libraries are derived from the LN- group (not shown). breast tumors; the ﬁrst ﬁve are from patients found to be While the two-sample t -statistic applied to the different lymph node positive (LN+), and the remaining three from normalized library proportions illustrates the problem, patients found to be lymph node negative (LN−). The tag it is too crude a tool to provide a solution in and of counts and proportions in the various libraries are given in itself: it weights the proportions from all libraries equally, Table 1. If we combine the ﬁve LN+ libraries and three even though the estimates from larger libraries are less LN− libraries and compare the resulting tag proportions, variable, and it is possible for the sample variance of the we are comparing 434/404105 to 799/284968, for which normalized proportions to be less than the known within- 1478 Differential expression in SAGE Fig. 1. (a)Two-sample t -test and χ values for all high count tags (40 or more total counts across all eight libraries). If the two tests were in agreement, we would see a ‘U-shape’ corresponding to genes being found equally extreme by both. Here, some of the most extreme values by one test (large χ values, or large absolute t -test values) are associated with at best weakly signiﬁcant values of the other. (b) Zoom on points with χ values below 50, indicated with a dashed line in (a). While the U-shape is clearly evident, it is more compressed than agreement would indicate. Most of the tags being ﬂagged as signiﬁcant by the χ test (values > 5) are not signiﬁcant according to the t -test (absolute values < 2). In effect, the t -test is going to the opposite extreme and focusing on the between-library variability at the expense of the within-library variability. In order to properly capture both types of variation, a compromise is needed. We introduce a beta-binomial model that includes both types of variation in a hierarchi- cal fashion: The proportion of a gene within a library is selected from an underlying beta distribution representing the normal between-library variation, and the count within that library is binomial with the chosen proportion as a parameter. Depending on the parameters of the beta model, this leads to estimates for the group proportions and associated variances that weight the different library proportions using values intermediate between equal weighting (all variation is between libraries) and weight- ing proportional to the library size (all variation is within libraries). METHODS Fig. 2. The log ratio of total (within plus between) variation to within variation as a function of log of the total tag count for the 2 For the sake of notational simplicity, we will focus on the LN+ group. The smooth line was ﬁt by binning along the x -axis case of modelling the counts of a speciﬁc tag within the one unit at a time, taking the mean (x , y) point within that bin, and ﬁrst group. Let n denote the total tag count in library i ﬁtting a loess smooth with span 5 to the resultant 12 points. Note of this group, and let p denote the true proportion of the the line crosses 1 (between is equal to within) at about 4 on the tag of interest within library i . Finally, let X denote the x -axis, corresponding to a raw count of 16. For larger count tags, corresponding count for the tag of interest. For the ﬁrst the between-library variation is clearly dominant, with the biggest 9.8 part of our model, we assume that the true proportions may multiple being about 2 ,or roughly 900-fold. vary from library to library. A standard distribution for proportions is the beta distribution, and we shall assume this here. The particular distribution used does not matter library variation. This can result in inappropriately large t a great deal. The main point is that the distribution is not values as the denominator of the test statistic goes to zero. necessarily degenerate: it can have a positive variance. (Most such cases occur when the total tag count is low, We will be focusing on the ﬁrst two moments of the so these are not apparent in Fig. 1 due to our ﬁltering.) various distributions throughout, both for computational 1479 K.A.Baggerly et al. simplicity and out of an intent to invoke the central limit At this point, we note that the optimal choice of weights theorem to get an approximately normal test statistic. is determined by a single relationship—the size of α + β Here, relative to n .Ifwe consider the extremes of this type of arrangement, letting α + β go to ∞ implies both that p ∼ Beta(α, β), E ( p ) = , i i the distribution of the p ’s is degenerate, so that there α + β is no change in the true proportion going from sample αβ to sample, and that in this case the optimal weighting is V ( p ) = . (α + β) (α + β + 1) proportional to the library size. If, on the other hand, the sum α + β is very small relative to the n values, then The second part of our model says that given the true i the imprecision in our knowledge of the proportion in a proportion in a sample, the corresponding count will given library is dwarfed by the imprecision due to library have a binomial distribution with the true proportion as to library variability, and the optimal weights are roughly a parameter: the same for all libraries. Thus, weighting by library size X | p ∼ Binomial(n , p ). i i i i and weighting equally represent the two extremes, and the true optimum lies somewhere in between. Note that Some straightforward algebra (Supplementary informa- the optimum weighting may be different for different tion) shows that the unconditional mean and variance of tags even if the same libraries are used! In theory, the the estimated proportion p ˆ = X /n are i i i weights are functions of ﬁxed α and β.In practice, as the parameters are unknown, estimation of the parameters and E ( p ˆ ) = , α + β the weights proceeds jointly. αβ 1 1 Now, the form of the weighting vector gives us the V ( p ˆ ) = + . estimated proportion for the group as (α + β)(α + β + 1) α + β n There are two components to the variance of the propor- p ˆ = w p ˆ . i i tion p ˆ (in square brackets above), and only one of them (the within-library variation) decreases as the library size It can be shown (Supplementary information) that an is increased. Now, given that we know the variance of a unbiased estimator of the variance of this proportion is single proportion, we turn to the mean and variance of a 2 2 2 2 weighted linear combination of proportions to see how to w p ˆ − w p ˆ i i i V = . combine the results from different libraries. unb 1 − w E w p ˆ i i When all of the w ’s are equal, this reduces to the standard α α unbiased estimator. This variance estimate is mostly right, = w E p ˆ = w = (1) i i i α + β α + β butitcan be too small—we know that the variance can never be less than the sampling variability. This lower V w p ˆ i i bound follows in turn from the assumptions that the libraries are assembled independently, and that sampling w αβ 1 1 = + . (2) within a library is also independent. These assumptions (α + β)(α + β + 1) α + β n strike us as reasonable, and we make them here. Allowing As long as the weights sum to 1, the combination has for this lower bound suggests the modiﬁed estimator the correct mean, so the focus shifts to choosing the   X X weights so as to minimize the associated variance. The i i 1 − n n i i constraint on the sum of the weights can be introduced into ˆ ˆ   V = max V , . unb the variance minimization problem through the method of Lagrange multipliers (Mathews and Walker, 1965), yielding There are slightly different lower bounds that could be constructed, but they all have the same leading term, V w p ˆ + λ 1 − w i i i X /( n ) .Werevisit this point below. i i ∂w In order to come up with a concrete number for a test αβ 1 1 statistic, we need to estimate the beta parameters. This can = 2w + − λ = 0 (α + β)(α + β + 1) α + β n i be done quickly using the method of moments, applied −1 to the unweighted sample proportions; this procedure can 1 1 → w ∝ + . i then be iterated as the weights provide revised estimates α + β n of the parameters. (Expressions for β and α ˆ can be found 1480 Differential expression in SAGE Table 2. Convergence of moment-based estimates of the beta parameters for (the variance doesn’t drop below our working ﬂoor) as we the LN+ values given in Table 1 attempt to account for additional variability. The test statistic that we propose for comparing groups A and B , then, is i 1234 5 p ˆ −ˆ p (i ) A B α 3.42 2.90 2.90 2.90 2.90 t = (i ) β 3184.1 3007.2 3015.9 3015.5 3015.5 ˆ ˆ V + V A B with p ˆ and V as deﬁned above. For testing signiﬁcance, it is useful to be able to specify the null distribution by manipulating Equations (1) and (2) to isolate the of a test statistic. This is somewhat difﬁcult here in parameters as functions of the moments.) Consider the that the distribution of the estimated proportion within case of the LN+ proportions in the example given earlier. a group depends on the relative sizes of the between (0) and within variation. If the within-library variation is w = = (0.249, 0.239, 0.229, 0.237, 0.046). predominant, then the shape of the distribution is largely driven by the total counts within a group, and if these (0) (0) p ˆ = w p ˆ = 0.00107 counts are reasonable (say 15 or more in each group) (0) (0) 2 2 2 (0) 2 then the binomial distribution will be roughly normal (w ) p ˆ − (w ) ( p ˆ ) i i i (0) and a Z distribution can be used. This follows from an V = (0) appeal to the central limit theorem, the rough bell-shape 1 − (w ) of the binomial distribution, and the fact that the degrees = 7.995e − 06 of freedom used for estimating this component of the (0) (0) (0) 2 (0) p ˆ (1 −ˆ p ) (w ) − V variation are very close to the total number of counts. (1) i β = −1 (0) If, however, the between-library variation is dominant, (0) (0) (0) 2 V 1 −ˆ p −ˆ p (w ) /n then the dominant effect in many cases will be the small = 3184.06 number of different libraries used to estimate this variance; (0) p ˆ in this case we are driven to a t distribution with, say, n − (1) (1) α ˆ = β = 3.42 (0) 1degrees of freedom for group A. The above discussion 1 −ˆ p treats the distribution of the estimate for a single group, (1) (1) α ˆ + β n i and we need to combine the results for two groups. We (1) w ∝ still treat the overall distribution as roughly t in nature, (1) (1) α ˆ + β + n butaswe are using separate variance estimates for the two ∝ (0.205, 0.205, 0.204, 0.205, 0.181). groups, so that the overall variance estimate is not pooled, we fall back on a Satterthwaite (1946) approximation to Empirically, convergence is quite rapid, as is shown for compute the degrees of freedom: this example in Table 2. Here, the size of the sum of the beta parameters (about ˆ ˆ (V + V ) A B df = . 3K) relative to the library sizes (about 100 K) suggests 2 2 ˆ ˆ V V A B that the between-library variability is roughly 30 times n −1 n −1 A B the within-library variability. Special note needs to be Small counts in each group should force us to account made of the case where the method of moments fails— for the asymmetric nature of the underlying distribution when the variability of the sample proportions is less more directly, and in this region a test such as that than that known to be present due to sampling variability. proposed by Audic and Claverie Audic and Claverie In this case, it is instructive to look at the likelihood (1997) seems reasonable. function. The likelihood function shows that the ratio α/(α + β) corresponding to the mean proportion is well DISCUSSION characterized, but the sum α + β diverges to ∞ if we attempt to ﬁnd a maximum. In this case, the underlying Between-library variability is non-negligible for SAGE maximum likelihood beta distribution is a degenerate data. Indeed, for the higher count data, the between- point mass, suggesting that the proper course of action library variability is the dominant part of the variation. is to ignore the between library variability and work just Pooling libraries in a group deals with between-library with the within-library variability. This is precisely when variation improperly, and can result in a bias towards we shift between different estimates of variance above; calling high-count tags ‘signiﬁcantly different’. We have consequently the estimates do not become more precise proposed a statistic, t , that incorporates between-library 1481 K.A.Baggerly et al. Our approach works one tag at a time. It may be possible to improve inference further by working with an ensemble approach that attempts to estimate parameters for all of the genes at once. This is the potential of ‘borrowing strength’ across genes to improve estimates throughout. Such borrowing has been used to good effect in estimating variances associated with microarray readings (see Baggerly et al., 2001; Newton et al., 2001, and Hughes et al., 2000, among others). In the microarray context, this borrowing was achieved by grouping genes according to intensity. Likewise, SAGE tags could be grouped according to relative abundance and estimation of the between-library variation assessed for the group. Our approach dodges the small-variance problem by reverting to using just the sampling variance when the other estimate gives a value that is too small on its face. It is possible to treat this in a more rigorous and coherent fashion using a full-blown Bayesian approach that addresses the uncertainty in our estimates of α and Fig. 3. Our test statistic as a function of log tag count for the LN+ β by simulating draws from the posterior distribution (see vs LN− comparison. Unlike the χ test, this test is not overly biased Gelman et al. 1995, p. 130 for a discussion of how this towards giving larger values to larger count tags. The distribution might be done here). This approach, however, requires appears roughly uniform, with granularity creeping in at the low the additional speciﬁcation of a prior and signiﬁcant counts as the normal approximation fails. computational overhead (several orders of magnitude beyond that required here). An additional complexity is that a completely non-informative prior can lead to an variability in addition to the within-library variabil- improper posterior in this context. For genes of particular ity already well-treated by previous methods; indeed, interest, the freely available BUGS software may be able our method reduces to that of Kal et al. (1999) in the to provide this type of approach without the need for much special case when between-library variability can be coding on the user’s part. We are exploring this. neglected. The ﬁnal distribution of the t values com- The special case where each group contains just one paring LN+ and LN− is shown in Figure 3, with the library can be treated by setting the weight w to 1 counts log -transformed to make the structure more in each group. As the within-group variance is smaller apparent. The distribution appears largely stable as a than the sampling variability, this would default to using function of tag count, so larger counts are not getting the sampling variability only. This weighting approach ‘more signiﬁcant’ just by default. The most extreme suggests a difﬁculty with one versus one comparisons if count tags (including our example tag) no longer appear the between-library variability is suspected to be large. as signiﬁcantly different. There is some granularity When we have just one library in each group, the degrees at the low counts where the Normal approximation of freedom in our t -statistic formulation drop to zero, is breaking down. By contrast, the two-sample t -test reﬂecting the fact that any differences we see could be deals with between-library variability but ignores the due to either the biological change of interest or to within-library variability. As the former is typically the normal variability, but we can’t tell which without an larger part of the variance, our t statistic is typically w assessment of this variation. Useful inferences in this case much closer to the t -test than to the pooled tests. Small- rely on prior assumptions about the scale of change to be scale simulation results (Supplementary information) expected or on implicitly borrowing strength across genes conﬁrm that the pooled tests have poor speciﬁcity in the by looking at which ones are ‘the most different’. This in presence of overdispersion, and show that the sensitivity turn treats the experimental results as supplying a ranking and speciﬁcity of t are overall very similar to the of interest rather than a straight signiﬁcance value. This two-sample t -test, with the differences showing mostly ranking viewpoint is reasonable in light of the multiple in isolated cases. Power comparisons of various tests testing problems inherent in checking thousands of genes. presuppose that the type I error rate has been ﬁxed; in Finally, our approach treats all of the libraries within a the presence of overdispersion we have seen pooled group as similar enough that the observed variation can tests with nominal 0.5% type I error rates have actual be described as ‘normal variation within the population rates of about 70%. of interest’. There may be additional known covariates 1482 Differential expression in SAGE Lash,A.E., Tolstoshev,C.M., Wagner,L., Schuler,G.D., that could account for much of this if they were included Strausberg,R.L., Riggins,G.J. and Altschul,S.F. (2000) in modelling the data, but this leads to a more involved SAGEmap: a public gene expression resource. Genome assessment of the variance structure, with pieces going Res., 10, 1051–1060. to each of these included factors. Our approach provides Madden,S.L., Galella,E.A., Zhu,J., Bertelsen,A.H. and perhaps the simplest way of incorporating between-library Beaudry,G.A. (1997) SAGE transcript proﬁles for p53- variability from a host of sources, and the variance bound dependent growth regulation. Oncogene, 15, 1079–1085. we have imposed is inherently at least as conservative as Man,M.Z., Wang,X. and Wang,Y. (2000) POWER SAGE: compar- other tests. That said, there are other ways of arriving ing statistical tests for SAGE experiments. Bioinformatics, 16, at similar test statistics (e.g. via weighted least squares; 953–959. Neter et al., 1996, p. 400) that may provide greater Margulies,E.H. and Innis,J.W. (2000) eSAGE: managing and an- ﬂexibility in modelling SAGE data, and we are working alyzing data generated with serial analysis of gene expression on this now. (SAGE). Bioinformatics, 16, 650–651. Margulies,E.H., Kardia,S.L.R. and Innis,J.W. (2001) A comparative molecular analysis of developing mouse forelimbs and hindlimbs ACKNOWLEDGEMENTS using serial analysis of gene expression (SAGE). Genome Res., The authors gratefully acknowledge support from NIH- 11, 1686–1698. NCI Grant 1U19 CA84978-1A1. Mathews,J. and Walker,R.L. (1965) Mathematical Methods of Physics.Benjamin, New York. REFERENCES Michiels,E.M.C., Oussoren,E., van Groenigen,M., Pauws,E., Bossuyt,P.M.M., Voute,P ˆ .A. and Baas,F. (1999) Genes differ- Audic,S. and Claverie,J.-M. (1997) The signiﬁcance of digital gene entially expressed in medulloblastoma and fetal brain. Physiol. expression proﬁles. Genome Res., 7, 986–995. Genomics, 1, 83–91. Baggerly,K.A., Coombes,K.R., Hess,K.R., Stivers,D.N., Neter,J., Kutner,M.H., Nachtsheim,C.J. and Wasserman,W. (1996) Abruzzo,L.V. and Zhang,W. (2001) Identifying differen- Applied Linear Statistical Models,Fourth edition, Irwin, New tially expressed genes in cDNA microarray experiments. J. York. Comput. Biol., 8, 639–659. Newton,M.A., Kendziorski,C.M., Richmond,C.S., Blattner,F.R. and Chen,H., Centola,M., Altschul,S.F. and Metzger,H. (1998) Charac- Tsui,K.W. (2001) On differential variability of expression ra- terization of gene expression in resting and activated mast cells. tios: improving statistical inference about gene expression J. Exp. Med., 188, 1657–1668. changes from microarray data. J. Comput. Biol., 8, 37–52. Ewing,B., Hillier,L., Wendl,M.C. and Green,P. (1998) Base-calling Ryu,B., Jones,J., Blades,N.J., Parmigiani,G., Hollingsworth,M.A., of automated sequencer traces using Phred. I. Accuracy assess- Hruban,R.H. and Kern,S.E. (2002) Relationships and differen- ment. Genome Res., 8, 175–185. tially expressed genes among pancreatic cancers examined by Gelman,A., Carlin,J.B., Stern,H.S. and Rubin,D.B. (1995) Bayesian large-scale serial analysis of gene expression. Cancer Res., 62, Data Analysis. Chapman and Hall, New York. 819–826. Hughes,T.R., Marton,M.J., Jones,A.R., Roberts,C.J., Stoughton,R., Satterthwaite,F.E. (1946) An approximate distribution of estimates Armour,C.D., Bennett,H.A., Coffey,E., Dai,H., He,Y.D. et al. of variance components. Biometrics Bulletin, 2, 110–114. (2000) Functional discovery via a compendium of expression Stekel,D.J., Git,Y. and Falciani,F. (2000) The comparison of gene proﬁles. Cell, 102, 109–126. expression from multiple cDNA libraries. Genome Res., 10, Kal,A.J., van Zonneveld,A.J., Benes,V., van den Berg,M., 2055–2061. Koerkamp,M.G., Albermann,K., Strack,N., Ruijter,J.M., Stollberg,J., Urschitz,J., Urban,Z. and Boyd,C.D. (2000) A quanti- Richter,A., Dujon,B. et al. (1999) Dynamics of gene expression tative evaluation of SAGE. Genome Res., 10, 1241–1248. revealed by comparison of serial analysis of gene expression Velculescu,V.E., Zhang,L., Vogelstein,B. and Kinzler,K.W. (1995) transcript proﬁles from yeast grown on two different carbon Serial analysis of gene expression. Science, 270, 484–487. sources. Mol. Biol. Cell, 10, 1859–1872. Zhang,L., Zhou,W., Velculescu,V.E., Kern,S.E., Hruban,R.H., Lal,A., Lash,A.E., Altschul,S.F., Velculescu,V., Zhang,L., Hamilton,S.R., Vogelstein,B. and Kinzler,K.W. (1997) Gene ex- McLendon,R.E., Marra,M.A., Prange,C., Morin,P.J., Polyak,K. pression proﬁles in normal and cancer cells. Science, 276, 1268– et al. (1999) A public database for gene expression in human cancers. Cancer Res., 59, 5403–5407. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/differential-expression-in-sage-accounting-for-normal-between-library-Tznn0G0m3j

Loading next page...

References (21)

J. Powell (1965)
Mathematical Methods in Physics
Physics Bulletin, 16
K. Baggerly, K. Coombes, K. Hess, D. Stivers, L. Abruzzo, Wei Zhang (2001)
Identifying Differentially Expressed Genes in cDNA Microarray Experiments
Journal of computational biology : a journal of computational molecular cell biology, 8 6
F. Satterthwaite (1946)
An approximate distribution of estimates of variance components.
Biometrics, 2 6
S. Audic, J. Claverie (1997)
The significance of digital gene expression profiles.
Genome research, 7 10
A. Kal, A. Zonneveld, V. Beneš, M. Berg, M. Koerkamp, K. Albermann, N. Strack, J. Ruijter, A. Richter, B. Dujon, W. Ansorge, H. Tabak (1999)
Dynamics of gene expression revealed by comparison of serial analysis of gene expression transcript profiles from yeast grown on two different carbon sources.
Molecular biology of the cell, 10 6
Michael Man, Xuning Wang, Yixin Wang (2000)
POWER_SAGE: comparing statistical tests for SAGE experiments
Bioinformatics, 16 11
Ji Huang, Hongsheng Zhang, Dong Wang, Ya-Jun Cao, Jin-shui Yang (2002)
[Serial analysis of gene expression].
Yi chuan = Hereditas, 24 2
Michael Newton, C. Kendziorski, Craig Richmond, F. Blattner, Kam-Wah Tsui (2001)
On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data
Journal of computational biology : a journal of computational molecular cell biology, 8 1
(1996)
Applied Linear Statistical Models, Fourth edition
E. Margulies, S. Kardia, J. Innis (2001)
A comparative molecular analysis of developing mouse forelimbs and hindlimbs using serial analysis of gene expression (SAGE).
Genome research, 11 10
J. Stollberg, J. Urschitz, Z. Urban, C. Boyd (2000)
A quantitative evaluation of SAGE.
Genome research, 10 8
S. Madden, E. Galella, Jingshi Zhu, A. Bertelsen, G. Beaudry (1997)
SAGE transcript profiles for p53-dependent growth regulation
Oncogene, 15
D. Stekel, Y. Git, F. Falciani (2000)
The comparison of gene expression from multiple cDNA libraries.
Genome research, 10 12
E. Margulies, J. Innis (2000)
eSAGE: managing and analysing data generated with Serial Analysis of Gene Expression (SAGE)
Bioinformatics, 16 7
A. Lash, C. Tolstoshev, L. Wagner, G. Schuler, R. Strausberg, G. Riggins, S. Altschul (2000)
SAGEmap: a public gene expression resource.
Genome research, 10 7
Bland Ewing, L. Hillier, M. Wendl, Philip Green (1998)
Base-calling of automated sequencer traces using phred. I. Accuracy assessment.
Genome research, 8 3
A. Lal, A. Lash, S. Altschul, V. Velculescu, Lin Zhang, R. McLendon, Marco Marra, C. Prange, P. Morin, P. Morin, K. Polyak, N. Papadopoulos, B. Vogelstein, K. Kinzler, R. Strausberg, G. Riggins (1999)
A public database for gene expression in human cancers.
Cancer research, 59 21
B. Ryu, Jessa Jones, Natalie Blades, G. Parmigiani, M. Hollingsworth, R. Hruban, S. Kern (2002)
Relationships and differentially expressed genes among pancreatic cancers examined by large-scale serial analysis of gene expression.
Cancer research, 62 3
T. Hughes, M. Marton, Allan Jones, C. Roberts, R. Stoughton, C. Armour, Holly Bennett, E. Coffey, H. Dai, Yudong He, M. Kidd, Amy King, Michael Meyer, D. Slade, P. Lum, S. Stepaniants, D. Shoemaker, D. Gachotte, K. Chakraburtty, J. Simon, M. Bard, S. Friend (2000)
Functional Discovery via a Compendium of Expression Profiles
Cell, 102
E. Michiels, Esmeralda Oussoren, M. Groenigen, Erwin Pauws, P. Bossuyt, P. Voute, F. Baas (1999)
Genes differentially expressed in medulloblastoma and fetal brain.
Physiological genomics, 1 2
Lin Zhang, Wei Zhou, V. Velculescu, S. Kern, R. Hruban, S. Hamilton, B. Vogelstein, K. Kinzler (1997)
Gene expression profiles in normal and cancer cells.
Science, 276 5316

Publisher: Oxford University Press
Copyright: © Oxford University Press 2003
ISSN: 1367-4803
eISSN: 1460-2059
DOI: 10.1093/bioinformatics/btg173
Publisher site: See Article on Publisher Site

Abstract

Vol. 19 no. 12 2003, pages 1477–1483 BIOINFORMATICS DOI: 10.1093/bioinformatics/btg173 Differential expression in SAGE: accounting for normal between-library variation 1,∗ 2 1 Keith A. Baggerly ,Li Deng , Jeffrey S. Morris and C. Marcelo Aldaz Department of Biostatistics, UT M.D. Anderson Cancer Center, 1515 Holcombe Blvd, Box 447, Houston, TX 77030-4009, USA, Department of Statistics, Rice University, Houston TX 77005, USA and Department of Carcinogenesis, UT M.D. Anderson Cancer Center, 1515 Holcombe Blvd, Box 447, Houston, TX 77030-4009, USA Received on October 24, 2002; revised on January 30, 2003; accepted on February 16, 2003 ABSTRACT with different pathologies, the question of most interest is Motivation: In contrasting levels of gene expression whether a given tag is differentially expressed. between groups of SAGE libraries, the libraries within each Most methods currently advanced for assessing dif- group are often combined and the counts for the tag of ferential expression in SAGE address the case where interest summed, and inference is made on the basis one library is contrasted with another, assuming a null of these larger ‘pseudolibraries’. While this captures the hypothesis that there is no difference between the libraries sampling variability inherent in the procedure, it fails to being compared. Under this assumption, the chance of a allow for normal variation in levels of the gene between single tag falling in one library or the other is roughly individuals within the same group, and can consequently proportional to the library size. Differing approximations overstate the signiﬁcance of the results. The effect is not lead to modelling this behavior with binomial (Zhang et slight: between-library variation can be hundreds of times al., 1997) or Poisson distributions (Madden et al., 1997), the within-library variation. normal approximations (Madden et al., 1997; Kal et al., Results: We introduce a beta-binomial sampling model 1999; Michiels et al., 1999; Man et al., 2000) or simu- that correctly incorporates both sources of variation. We lations involving permutation tests (Zhang et al., 1997). show how to ﬁt the parameters of this model, and introduce Bayesian approaches have been suggested by Audic and a test statistic for differential expression similar to a two- Claverie (1997), and by Chen et al. (1998) (the latter sample t -test. method was adapted by Lal et al. (1999) to accommodate Contact: kabagg@mdanderson.org unequal library sizes). Of these, the simulation approach Supplementary information: http://bioinformatics. of Zhang et al. (1997) and the Bayesian approach of Lal mdanderson.org/ Includes Matlab and R code for ﬁtting et al. (1999) (see also Lash et al., 2000) are probably the model. the most widely used, due to their implementation in easily accessible software (the SAGE 2000 software INTRODUCTION available from the Kinzler Laboratory at Johns Hopkins, The Serial Analysis of Gene Expression (SAGE) method- and the routine implemented in SAGEmap at the NCBI, ology introduced by Velculescu et al. (1995) supplies data respectively). As noted in the comparison conducted on gene expression in the form of a table of counts. Brieﬂy, by Man et al. (2000) on several of the above methods, however, very similar results are obtained when the num- mRNA transcripts are converted to cDNA and then pro- bers of tags are large (>20); the authors contend that a cessed so as to isolate short (typically 10 or 14 bp) repre- normal approximation (Kal et al., 1999) or equivalently a sentative ‘tags’. Ideally, these tags should provide enough chi-squared test has more power for detecting differences information to uniquely identify the source mRNA, and to when the numbers of tags are small (<15). The validity a ﬁrst approximation this is correct. Tags are sampled and of small tag comparisons is, however, questionable due to sequenced, and a ‘library’ consisting of the tags seen and the presence of sequencing errors (Stollberg et al., 2000) their respective frequencies is constructed for each cellu- though this may be somewhat ameliorated if there is some lar sample. Given a set of libraries derived from samples external measure of quality for the read, such as a phred To whom correspondence should be addressed. score (Margulies and Innis, 2000; Margulies et al., 2001; Bioinformatics 19(12) c Oxford University Press 2003; all rights reserved. 1477 K.A.Baggerly et al. Table 1. Counts and proportions of tag AGGTCAGGAG in eight breast tumor libraries, ﬁve lymph node positive and three lymph node negative Library 1T+ 3T+ 4T+ 6T+ 8T+ 7T− 9T− 10T− Tag count 129 167 71 61 6 43 247 509 Library size 100 474 96 631 92 510 95 785 18 705 95 155 91 593 98 220 Proportions (%) 0.13 0.17 0.08 0.06 0.03 0.05 0.27 0.52 Ewing et al., 1998). We note that the statistic suggested the χ value (Michiels et al., 1999; Man et al., 2000) is by Kal et al. (1999) is 279.98; the 95% cutoff for this distribution is 3.84, so this is obviously a ‘signiﬁcant’ result. The equivalence p − p A B Z = , with of the above tests noted by Man et al. (2000) for high p (1− p ) p (1− p ) 0 0 0 0 + count tags means that the other tests will catch the same N N A B genes. Checking the sign of the test statistic proposed by X X X + X A B A B Kal et al. (1999) suggests that this tag is more strongly p = , p = , p = , A B 0 N N N + N A B A B expressed in LN− tumors. However, if we follow Ryu et al. (2002) and compare the eight proportions using where the two library sizes are N and N and the two √ A B atwo-sample t -test, t = ( p − p )/ V + V , with A B A B counts are X and X , respectively. The statistic we A B p being the average of the ﬁve proportions in group A, propose below will have a similar form. V being the sample variance of these ﬁve proportions, When the number of libraries involved is more than two, and p and V likewise deﬁned, we get a test statistic B B an omnibus test for differential expression such as that value of −1.3174. The 95% cutoffs for a t distribution suggested by Stekel et al. (2000) can be performed, but are ±2.4469, so this is a decidedly ‘insigniﬁcant’ result. interest is more often centered on comparing groups of While the mean proportion is higher for the LN− tumors, libraries where the group membership is known a priori. this is mostly being driven by results from a single library In this paper we focus on the comparison of two groups (10T) so that the variability within the LN- group is high. of libraries. When there are two groups of libraries being The ﬁrst approach fails to take into account the variability compared, the most common approach is to reduce the between like libraries which the t -test correctly captures. number of effective libraries to two by pooling the libraries Shifting between test types (χ and two-sample t)gives of like type, and reverting to the two-library comparison a different view of which tags are important, as can be seen form. This is not universal, and cautionary statements have in Figure 1. Most of the tags being ﬂagged as signiﬁcant been made. Lash et al. (2000) recommend checking for a by the χ test (values >5) are not signiﬁcant according to low coefﬁcient of variation before applying the SAGEmap the t -test (absolute values <2). procedure to grouped libraries. Ryu et al. (2002) use a Between-library variability within a group can often series of ﬁlters to deal with groups of pancreatic libraries, be as large in magnitude as the within-library variability the ﬁrst of which is a two-sample t -test applied to the due to sampling. Indeed, for the higher count data, the proportions. between-library variability is the dominant part of the These pooling approaches do catch differences, but variation. We can see this by surveying all of the tags they can overstate the signiﬁcance of the results by in the LN+ group, plotting the total (between + within) ignoring the role of normal variation in expression levels library sample variability as a multiple of the within- between like samples. As an example, we consider the library variability, with both quantities on the log scale case of a single prevalent tag in eight breast libraries that to make the structure more apparent. This is shown in have been assembled in the Aldaz Laboratory. This tag, Figure 2. For the high count tags the between-library AGGTCAGGAG, has multiple matches and hence is not variance clearly wins. Similar qualitative results hold for immediately biologically informative, but it will serve to illustrate the point. All of these libraries are derived from the LN- group (not shown). breast tumors; the ﬁrst ﬁve are from patients found to be While the two-sample t -statistic applied to the different lymph node positive (LN+), and the remaining three from normalized library proportions illustrates the problem, patients found to be lymph node negative (LN−). The tag it is too crude a tool to provide a solution in and of counts and proportions in the various libraries are given in itself: it weights the proportions from all libraries equally, Table 1. If we combine the ﬁve LN+ libraries and three even though the estimates from larger libraries are less LN− libraries and compare the resulting tag proportions, variable, and it is possible for the sample variance of the we are comparing 434/404105 to 799/284968, for which normalized proportions to be less than the known within- 1478 Differential expression in SAGE Fig. 1. (a)Two-sample t -test and χ values for all high count tags (40 or more total counts across all eight libraries). If the two tests were in agreement, we would see a ‘U-shape’ corresponding to genes being found equally extreme by both. Here, some of the most extreme values by one test (large χ values, or large absolute t -test values) are associated with at best weakly signiﬁcant values of the other. (b) Zoom on points with χ values below 50, indicated with a dashed line in (a). While the U-shape is clearly evident, it is more compressed than agreement would indicate. Most of the tags being ﬂagged as signiﬁcant by the χ test (values > 5) are not signiﬁcant according to the t -test (absolute values < 2). In effect, the t -test is going to the opposite extreme and focusing on the between-library variability at the expense of the within-library variability. In order to properly capture both types of variation, a compromise is needed. We introduce a beta-binomial model that includes both types of variation in a hierarchi- cal fashion: The proportion of a gene within a library is selected from an underlying beta distribution representing the normal between-library variation, and the count within that library is binomial with the chosen proportion as a parameter. Depending on the parameters of the beta model, this leads to estimates for the group proportions and associated variances that weight the different library proportions using values intermediate between equal weighting (all variation is between libraries) and weight- ing proportional to the library size (all variation is within libraries). METHODS Fig. 2. The log ratio of total (within plus between) variation to within variation as a function of log of the total tag count for the 2 For the sake of notational simplicity, we will focus on the LN+ group. The smooth line was ﬁt by binning along the x -axis case of modelling the counts of a speciﬁc tag within the one unit at a time, taking the mean (x , y) point within that bin, and ﬁrst group. Let n denote the total tag count in library i ﬁtting a loess smooth with span 5 to the resultant 12 points. Note of this group, and let p denote the true proportion of the the line crosses 1 (between is equal to within) at about 4 on the tag of interest within library i . Finally, let X denote the x -axis, corresponding to a raw count of 16. For larger count tags, corresponding count for the tag of interest. For the ﬁrst the between-library variation is clearly dominant, with the biggest 9.8 part of our model, we assume that the true proportions may multiple being about 2 ,or roughly 900-fold. vary from library to library. A standard distribution for proportions is the beta distribution, and we shall assume this here. The particular distribution used does not matter library variation. This can result in inappropriately large t a great deal. The main point is that the distribution is not values as the denominator of the test statistic goes to zero. necessarily degenerate: it can have a positive variance. (Most such cases occur when the total tag count is low, We will be focusing on the ﬁrst two moments of the so these are not apparent in Fig. 1 due to our ﬁltering.) various distributions throughout, both for computational 1479 K.A.Baggerly et al. simplicity and out of an intent to invoke the central limit At this point, we note that the optimal choice of weights theorem to get an approximately normal test statistic. is determined by a single relationship—the size of α + β Here, relative to n .Ifwe consider the extremes of this type of arrangement, letting α + β go to ∞ implies both that p ∼ Beta(α, β), E ( p ) = , i i the distribution of the p ’s is degenerate, so that there α + β is no change in the true proportion going from sample αβ to sample, and that in this case the optimal weighting is V ( p ) = . (α + β) (α + β + 1) proportional to the library size. If, on the other hand, the sum α + β is very small relative to the n values, then The second part of our model says that given the true i the imprecision in our knowledge of the proportion in a proportion in a sample, the corresponding count will given library is dwarfed by the imprecision due to library have a binomial distribution with the true proportion as to library variability, and the optimal weights are roughly a parameter: the same for all libraries. Thus, weighting by library size X | p ∼ Binomial(n , p ). i i i i and weighting equally represent the two extremes, and the true optimum lies somewhere in between. Note that Some straightforward algebra (Supplementary informa- the optimum weighting may be different for different tion) shows that the unconditional mean and variance of tags even if the same libraries are used! In theory, the the estimated proportion p ˆ = X /n are i i i weights are functions of ﬁxed α and β.In practice, as the parameters are unknown, estimation of the parameters and E ( p ˆ ) = , α + β the weights proceeds jointly. αβ 1 1 Now, the form of the weighting vector gives us the V ( p ˆ ) = + . estimated proportion for the group as (α + β)(α + β + 1) α + β n There are two components to the variance of the propor- p ˆ = w p ˆ . i i tion p ˆ (in square brackets above), and only one of them (the within-library variation) decreases as the library size It can be shown (Supplementary information) that an is increased. Now, given that we know the variance of a unbiased estimator of the variance of this proportion is single proportion, we turn to the mean and variance of a 2 2 2 2 weighted linear combination of proportions to see how to w p ˆ − w p ˆ i i i V = . combine the results from different libraries. unb 1 − w E w p ˆ i i When all of the w ’s are equal, this reduces to the standard α α unbiased estimator. This variance estimate is mostly right, = w E p ˆ = w = (1) i i i α + β α + β butitcan be too small—we know that the variance can never be less than the sampling variability. This lower V w p ˆ i i bound follows in turn from the assumptions that the libraries are assembled independently, and that sampling w αβ 1 1 = + . (2) within a library is also independent. These assumptions (α + β)(α + β + 1) α + β n strike us as reasonable, and we make them here. Allowing As long as the weights sum to 1, the combination has for this lower bound suggests the modiﬁed estimator the correct mean, so the focus shifts to choosing the   X X weights so as to minimize the associated variance. The i i 1 − n n i i constraint on the sum of the weights can be introduced into ˆ ˆ   V = max V , . unb the variance minimization problem through the method of Lagrange multipliers (Mathews and Walker, 1965), yielding There are slightly different lower bounds that could be constructed, but they all have the same leading term, V w p ˆ + λ 1 − w i i i X /( n ) .Werevisit this point below. i i ∂w In order to come up with a concrete number for a test αβ 1 1 statistic, we need to estimate the beta parameters. This can = 2w + − λ = 0 (α + β)(α + β + 1) α + β n i be done quickly using the method of moments, applied −1 to the unweighted sample proportions; this procedure can 1 1 → w ∝ + . i then be iterated as the weights provide revised estimates α + β n of the parameters. (Expressions for β and α ˆ can be found 1480 Differential expression in SAGE Table 2. Convergence of moment-based estimates of the beta parameters for (the variance doesn’t drop below our working ﬂoor) as we the LN+ values given in Table 1 attempt to account for additional variability. The test statistic that we propose for comparing groups A and B , then, is i 1234 5 p ˆ −ˆ p (i ) A B α 3.42 2.90 2.90 2.90 2.90 t = (i ) β 3184.1 3007.2 3015.9 3015.5 3015.5 ˆ ˆ V + V A B with p ˆ and V as deﬁned above. For testing signiﬁcance, it is useful to be able to specify the null distribution by manipulating Equations (1) and (2) to isolate the of a test statistic. This is somewhat difﬁcult here in parameters as functions of the moments.) Consider the that the distribution of the estimated proportion within case of the LN+ proportions in the example given earlier. a group depends on the relative sizes of the between (0) and within variation. If the within-library variation is w = = (0.249, 0.239, 0.229, 0.237, 0.046). predominant, then the shape of the distribution is largely driven by the total counts within a group, and if these (0) (0) p ˆ = w p ˆ = 0.00107 counts are reasonable (say 15 or more in each group) (0) (0) 2 2 2 (0) 2 then the binomial distribution will be roughly normal (w ) p ˆ − (w ) ( p ˆ ) i i i (0) and a Z distribution can be used. This follows from an V = (0) appeal to the central limit theorem, the rough bell-shape 1 − (w ) of the binomial distribution, and the fact that the degrees = 7.995e − 06 of freedom used for estimating this component of the (0) (0) (0) 2 (0) p ˆ (1 −ˆ p ) (w ) − V variation are very close to the total number of counts. (1) i β = −1 (0) If, however, the between-library variation is dominant, (0) (0) (0) 2 V 1 −ˆ p −ˆ p (w ) /n then the dominant effect in many cases will be the small = 3184.06 number of different libraries used to estimate this variance; (0) p ˆ in this case we are driven to a t distribution with, say, n − (1) (1) α ˆ = β = 3.42 (0) 1degrees of freedom for group A. The above discussion 1 −ˆ p treats the distribution of the estimate for a single group, (1) (1) α ˆ + β n i and we need to combine the results for two groups. We (1) w ∝ still treat the overall distribution as roughly t in nature, (1) (1) α ˆ + β + n butaswe are using separate variance estimates for the two ∝ (0.205, 0.205, 0.204, 0.205, 0.181). groups, so that the overall variance estimate is not pooled, we fall back on a Satterthwaite (1946) approximation to Empirically, convergence is quite rapid, as is shown for compute the degrees of freedom: this example in Table 2. Here, the size of the sum of the beta parameters (about ˆ ˆ (V + V ) A B df = . 3K) relative to the library sizes (about 100 K) suggests 2 2 ˆ ˆ V V A B that the between-library variability is roughly 30 times n −1 n −1 A B the within-library variability. Special note needs to be Small counts in each group should force us to account made of the case where the method of moments fails— for the asymmetric nature of the underlying distribution when the variability of the sample proportions is less more directly, and in this region a test such as that than that known to be present due to sampling variability. proposed by Audic and Claverie Audic and Claverie In this case, it is instructive to look at the likelihood (1997) seems reasonable. function. The likelihood function shows that the ratio α/(α + β) corresponding to the mean proportion is well DISCUSSION characterized, but the sum α + β diverges to ∞ if we attempt to ﬁnd a maximum. In this case, the underlying Between-library variability is non-negligible for SAGE maximum likelihood beta distribution is a degenerate data. Indeed, for the higher count data, the between- point mass, suggesting that the proper course of action library variability is the dominant part of the variation. is to ignore the between library variability and work just Pooling libraries in a group deals with between-library with the within-library variability. This is precisely when variation improperly, and can result in a bias towards we shift between different estimates of variance above; calling high-count tags ‘signiﬁcantly different’. We have consequently the estimates do not become more precise proposed a statistic, t , that incorporates between-library 1481 K.A.Baggerly et al. Our approach works one tag at a time. It may be possible to improve inference further by working with an ensemble approach that attempts to estimate parameters for all of the genes at once. This is the potential of ‘borrowing strength’ across genes to improve estimates throughout. Such borrowing has been used to good effect in estimating variances associated with microarray readings (see Baggerly et al., 2001; Newton et al., 2001, and Hughes et al., 2000, among others). In the microarray context, this borrowing was achieved by grouping genes according to intensity. Likewise, SAGE tags could be grouped according to relative abundance and estimation of the between-library variation assessed for the group. Our approach dodges the small-variance problem by reverting to using just the sampling variance when the other estimate gives a value that is too small on its face. It is possible to treat this in a more rigorous and coherent fashion using a full-blown Bayesian approach that addresses the uncertainty in our estimates of α and Fig. 3. Our test statistic as a function of log tag count for the LN+ β by simulating draws from the posterior distribution (see vs LN− comparison. Unlike the χ test, this test is not overly biased Gelman et al. 1995, p. 130 for a discussion of how this towards giving larger values to larger count tags. The distribution might be done here). This approach, however, requires appears roughly uniform, with granularity creeping in at the low the additional speciﬁcation of a prior and signiﬁcant counts as the normal approximation fails. computational overhead (several orders of magnitude beyond that required here). An additional complexity is that a completely non-informative prior can lead to an variability in addition to the within-library variabil- improper posterior in this context. For genes of particular ity already well-treated by previous methods; indeed, interest, the freely available BUGS software may be able our method reduces to that of Kal et al. (1999) in the to provide this type of approach without the need for much special case when between-library variability can be coding on the user’s part. We are exploring this. neglected. The ﬁnal distribution of the t values com- The special case where each group contains just one paring LN+ and LN− is shown in Figure 3, with the library can be treated by setting the weight w to 1 counts log -transformed to make the structure more in each group. As the within-group variance is smaller apparent. The distribution appears largely stable as a than the sampling variability, this would default to using function of tag count, so larger counts are not getting the sampling variability only. This weighting approach ‘more signiﬁcant’ just by default. The most extreme suggests a difﬁculty with one versus one comparisons if count tags (including our example tag) no longer appear the between-library variability is suspected to be large. as signiﬁcantly different. There is some granularity When we have just one library in each group, the degrees at the low counts where the Normal approximation of freedom in our t -statistic formulation drop to zero, is breaking down. By contrast, the two-sample t -test reﬂecting the fact that any differences we see could be deals with between-library variability but ignores the due to either the biological change of interest or to within-library variability. As the former is typically the normal variability, but we can’t tell which without an larger part of the variance, our t statistic is typically w assessment of this variation. Useful inferences in this case much closer to the t -test than to the pooled tests. Small- rely on prior assumptions about the scale of change to be scale simulation results (Supplementary information) expected or on implicitly borrowing strength across genes conﬁrm that the pooled tests have poor speciﬁcity in the by looking at which ones are ‘the most different’. This in presence of overdispersion, and show that the sensitivity turn treats the experimental results as supplying a ranking and speciﬁcity of t are overall very similar to the of interest rather than a straight signiﬁcance value. This two-sample t -test, with the differences showing mostly ranking viewpoint is reasonable in light of the multiple in isolated cases. Power comparisons of various tests testing problems inherent in checking thousands of genes. presuppose that the type I error rate has been ﬁxed; in Finally, our approach treats all of the libraries within a the presence of overdispersion we have seen pooled group as similar enough that the observed variation can tests with nominal 0.5% type I error rates have actual be described as ‘normal variation within the population rates of about 70%. of interest’. There may be additional known covariates 1482 Differential expression in SAGE Lash,A.E., Tolstoshev,C.M., Wagner,L., Schuler,G.D., that could account for much of this if they were included Strausberg,R.L., Riggins,G.J. and Altschul,S.F. (2000) in modelling the data, but this leads to a more involved SAGEmap: a public gene expression resource. Genome assessment of the variance structure, with pieces going Res., 10, 1051–1060. to each of these included factors. Our approach provides Madden,S.L., Galella,E.A., Zhu,J., Bertelsen,A.H. and perhaps the simplest way of incorporating between-library Beaudry,G.A. (1997) SAGE transcript proﬁles for p53- variability from a host of sources, and the variance bound dependent growth regulation. Oncogene, 15, 1079–1085. we have imposed is inherently at least as conservative as Man,M.Z., Wang,X. and Wang,Y. (2000) POWER SAGE: compar- other tests. That said, there are other ways of arriving ing statistical tests for SAGE experiments. Bioinformatics, 16, at similar test statistics (e.g. via weighted least squares; 953–959. Neter et al., 1996, p. 400) that may provide greater Margulies,E.H. and Innis,J.W. (2000) eSAGE: managing and an- ﬂexibility in modelling SAGE data, and we are working alyzing data generated with serial analysis of gene expression on this now. (SAGE). Bioinformatics, 16, 650–651. Margulies,E.H., Kardia,S.L.R. and Innis,J.W. (2001) A comparative molecular analysis of developing mouse forelimbs and hindlimbs ACKNOWLEDGEMENTS using serial analysis of gene expression (SAGE). Genome Res., The authors gratefully acknowledge support from NIH- 11, 1686–1698. NCI Grant 1U19 CA84978-1A1. Mathews,J. and Walker,R.L. (1965) Mathematical Methods of Physics.Benjamin, New York. REFERENCES Michiels,E.M.C., Oussoren,E., van Groenigen,M., Pauws,E., Bossuyt,P.M.M., Voute,P ˆ .A. and Baas,F. (1999) Genes differ- Audic,S. and Claverie,J.-M. (1997) The signiﬁcance of digital gene entially expressed in medulloblastoma and fetal brain. Physiol. expression proﬁles. Genome Res., 7, 986–995. Genomics, 1, 83–91. Baggerly,K.A., Coombes,K.R., Hess,K.R., Stivers,D.N., Neter,J., Kutner,M.H., Nachtsheim,C.J. and Wasserman,W. (1996) Abruzzo,L.V. and Zhang,W. (2001) Identifying differen- Applied Linear Statistical Models,Fourth edition, Irwin, New tially expressed genes in cDNA microarray experiments. J. York. Comput. Biol., 8, 639–659. Newton,M.A., Kendziorski,C.M., Richmond,C.S., Blattner,F.R. and Chen,H., Centola,M., Altschul,S.F. and Metzger,H. (1998) Charac- Tsui,K.W. (2001) On differential variability of expression ra- terization of gene expression in resting and activated mast cells. tios: improving statistical inference about gene expression J. Exp. Med., 188, 1657–1668. changes from microarray data. J. Comput. Biol., 8, 37–52. Ewing,B., Hillier,L., Wendl,M.C. and Green,P. (1998) Base-calling Ryu,B., Jones,J., Blades,N.J., Parmigiani,G., Hollingsworth,M.A., of automated sequencer traces using Phred. I. Accuracy assess- Hruban,R.H. and Kern,S.E. (2002) Relationships and differen- ment. Genome Res., 8, 175–185. tially expressed genes among pancreatic cancers examined by Gelman,A., Carlin,J.B., Stern,H.S. and Rubin,D.B. (1995) Bayesian large-scale serial analysis of gene expression. Cancer Res., 62, Data Analysis. Chapman and Hall, New York. 819–826. Hughes,T.R., Marton,M.J., Jones,A.R., Roberts,C.J., Stoughton,R., Satterthwaite,F.E. (1946) An approximate distribution of estimates Armour,C.D., Bennett,H.A., Coffey,E., Dai,H., He,Y.D. et al. of variance components. Biometrics Bulletin, 2, 110–114. (2000) Functional discovery via a compendium of expression Stekel,D.J., Git,Y. and Falciani,F. (2000) The comparison of gene proﬁles. Cell, 102, 109–126. expression from multiple cDNA libraries. Genome Res., 10, Kal,A.J., van Zonneveld,A.J., Benes,V., van den Berg,M., 2055–2061. Koerkamp,M.G., Albermann,K., Strack,N., Ruijter,J.M., Stollberg,J., Urschitz,J., Urban,Z. and Boyd,C.D. (2000) A quanti- Richter,A., Dujon,B. et al. (1999) Dynamics of gene expression tative evaluation of SAGE. Genome Res., 10, 1241–1248. revealed by comparison of serial analysis of gene expression Velculescu,V.E., Zhang,L., Vogelstein,B. and Kinzler,K.W. (1995) transcript proﬁles from yeast grown on two different carbon Serial analysis of gene expression. Science, 270, 484–487. sources. Mol. Biol. Cell, 10, 1859–1872. Zhang,L., Zhou,W., Velculescu,V.E., Kern,S.E., Hruban,R.H., Lal,A., Lash,A.E., Altschul,S.F., Velculescu,V., Zhang,L., Hamilton,S.R., Vogelstein,B. and Kinzler,K.W. (1997) Gene ex- McLendon,R.E., Marra,M.A., Prange,C., Morin,P.J., Polyak,K. pression proﬁles in normal and cancer cells. Science, 276, 1268– et al. (1999) A public database for gene expression in human cancers. Cancer Res., 59, 5403–5407.

Journal

Bioinformatics – Oxford University Press

Published: Aug 12, 2003

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Differential expression in SAGE: accounting for normal between-library variation

Differential expression in SAGE: accounting for normal between-library variation

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Differential expression in SAGE: accounting for normal between-library variation

Differential expression in SAGE: accounting for normal between-library variation

References (21)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies