Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Differential expression in SAGE: accounting for normal between-library variation

Differential expression in SAGE: accounting for normal between-library variation Vol. 19 no. 12 2003, pages 1477–1483 BIOINFORMATICS DOI: 10.1093/bioinformatics/btg173 Differential expression in SAGE: accounting for normal between-library variation 1,∗ 2 1 Keith A. Baggerly ,Li Deng , Jeffrey S. Morris and C. Marcelo Aldaz Department of Biostatistics, UT M.D. Anderson Cancer Center, 1515 Holcombe Blvd, Box 447, Houston, TX 77030-4009, USA, Department of Statistics, Rice University, Houston TX 77005, USA and Department of Carcinogenesis, UT M.D. Anderson Cancer Center, 1515 Holcombe Blvd, Box 447, Houston, TX 77030-4009, USA Received on October 24, 2002; revised on January 30, 2003; accepted on February 16, 2003 ABSTRACT with different pathologies, the question of most interest is Motivation: In contrasting levels of gene expression whether a given tag is differentially expressed. between groups of SAGE libraries, the libraries within each Most methods currently advanced for assessing dif- group are often combined and the counts for the tag of ferential expression in SAGE address the case where interest summed, and inference is made on the basis one library is contrasted with another, assuming a null of these larger ‘pseudolibraries’. While this captures the hypothesis that there is no difference between the libraries sampling variability inherent in the procedure, it fails to being compared. Under this assumption, the chance of a allow for normal variation in levels of the gene between single tag falling in one library or the other is roughly individuals within the same group, and can consequently proportional to the library size. Differing approximations overstate the significance of the results. The effect is not lead to modelling this behavior with binomial (Zhang et slight: between-library variation can be hundreds of times al., 1997) or Poisson distributions (Madden et al., 1997), the within-library variation. normal approximations (Madden et al., 1997; Kal et al., Results: We introduce a beta-binomial sampling model 1999; Michiels et al., 1999; Man et al., 2000) or simu- that correctly incorporates both sources of variation. We lations involving permutation tests (Zhang et al., 1997). show how to fit the parameters of this model, and introduce Bayesian approaches have been suggested by Audic and a test statistic for differential expression similar to a two- Claverie (1997), and by Chen et al. (1998) (the latter sample t -test. method was adapted by Lal et al. (1999) to accommodate Contact: kabagg@mdanderson.org unequal library sizes). Of these, the simulation approach Supplementary information: http://bioinformatics. of Zhang et al. (1997) and the Bayesian approach of Lal mdanderson.org/ Includes Matlab and R code for fitting et al. (1999) (see also Lash et al., 2000) are probably the model. the most widely used, due to their implementation in easily accessible software (the SAGE 2000 software INTRODUCTION available from the Kinzler Laboratory at Johns Hopkins, The Serial Analysis of Gene Expression (SAGE) method- and the routine implemented in SAGEmap at the NCBI, ology introduced by Velculescu et al. (1995) supplies data respectively). As noted in the comparison conducted on gene expression in the form of a table of counts. Briefly, by Man et al. (2000) on several of the above methods, however, very similar results are obtained when the num- mRNA transcripts are converted to cDNA and then pro- bers of tags are large (>20); the authors contend that a cessed so as to isolate short (typically 10 or 14 bp) repre- normal approximation (Kal et al., 1999) or equivalently a sentative ‘tags’. Ideally, these tags should provide enough chi-squared test has more power for detecting differences information to uniquely identify the source mRNA, and to when the numbers of tags are small (<15). The validity a first approximation this is correct. Tags are sampled and of small tag comparisons is, however, questionable due to sequenced, and a ‘library’ consisting of the tags seen and the presence of sequencing errors (Stollberg et al., 2000) their respective frequencies is constructed for each cellu- though this may be somewhat ameliorated if there is some lar sample. Given a set of libraries derived from samples external measure of quality for the read, such as a phred To whom correspondence should be addressed. score (Margulies and Innis, 2000; Margulies et al., 2001; Bioinformatics 19(12)  c Oxford University Press 2003; all rights reserved. 1477 K.A.Baggerly et al. Table 1. Counts and proportions of tag AGGTCAGGAG in eight breast tumor libraries, five lymph node positive and three lymph node negative Library 1T+ 3T+ 4T+ 6T+ 8T+ 7T− 9T− 10T− Tag count 129 167 71 61 6 43 247 509 Library size 100 474 96 631 92 510 95 785 18 705 95 155 91 593 98 220 Proportions (%) 0.13 0.17 0.08 0.06 0.03 0.05 0.27 0.52 Ewing et al., 1998). We note that the statistic suggested the χ value (Michiels et al., 1999; Man et al., 2000) is by Kal et al. (1999) is 279.98; the 95% cutoff for this distribution is 3.84, so this is obviously a ‘significant’ result. The equivalence p − p A B Z =  , with of the above tests noted by Man et al. (2000) for high p (1− p ) p (1− p ) 0 0 0 0 + count tags means that the other tests will catch the same N N A B genes. Checking the sign of the test statistic proposed by X X X + X A B A B Kal et al. (1999) suggests that this tag is more strongly p = , p = , p = , A B 0 N N N + N A B A B expressed in LN− tumors. However, if we follow Ryu et al. (2002) and compare the eight proportions using where the two library sizes are N and N and the two √ A B atwo-sample t -test, t = ( p − p )/ V + V , with A B A B counts are X and X , respectively. The statistic we A B p being the average of the five proportions in group A, propose below will have a similar form. V being the sample variance of these five proportions, When the number of libraries involved is more than two, and p and V likewise defined, we get a test statistic B B an omnibus test for differential expression such as that value of −1.3174. The 95% cutoffs for a t distribution suggested by Stekel et al. (2000) can be performed, but are ±2.4469, so this is a decidedly ‘insignificant’ result. interest is more often centered on comparing groups of While the mean proportion is higher for the LN− tumors, libraries where the group membership is known a priori. this is mostly being driven by results from a single library In this paper we focus on the comparison of two groups (10T) so that the variability within the LN- group is high. of libraries. When there are two groups of libraries being The first approach fails to take into account the variability compared, the most common approach is to reduce the between like libraries which the t -test correctly captures. number of effective libraries to two by pooling the libraries Shifting between test types (χ and two-sample t)gives of like type, and reverting to the two-library comparison a different view of which tags are important, as can be seen form. This is not universal, and cautionary statements have in Figure 1. Most of the tags being flagged as significant been made. Lash et al. (2000) recommend checking for a by the χ test (values >5) are not significant according to low coefficient of variation before applying the SAGEmap the t -test (absolute values <2). procedure to grouped libraries. Ryu et al. (2002) use a Between-library variability within a group can often series of filters to deal with groups of pancreatic libraries, be as large in magnitude as the within-library variability the first of which is a two-sample t -test applied to the due to sampling. Indeed, for the higher count data, the proportions. between-library variability is the dominant part of the These pooling approaches do catch differences, but variation. We can see this by surveying all of the tags they can overstate the significance of the results by in the LN+ group, plotting the total (between + within) ignoring the role of normal variation in expression levels library sample variability as a multiple of the within- between like samples. As an example, we consider the library variability, with both quantities on the log scale case of a single prevalent tag in eight breast libraries that to make the structure more apparent. This is shown in have been assembled in the Aldaz Laboratory. This tag, Figure 2. For the high count tags the between-library AGGTCAGGAG, has multiple matches and hence is not variance clearly wins. Similar qualitative results hold for immediately biologically informative, but it will serve to illustrate the point. All of these libraries are derived from the LN- group (not shown). breast tumors; the first five are from patients found to be While the two-sample t -statistic applied to the different lymph node positive (LN+), and the remaining three from normalized library proportions illustrates the problem, patients found to be lymph node negative (LN−). The tag it is too crude a tool to provide a solution in and of counts and proportions in the various libraries are given in itself: it weights the proportions from all libraries equally, Table 1. If we combine the five LN+ libraries and three even though the estimates from larger libraries are less LN− libraries and compare the resulting tag proportions, variable, and it is possible for the sample variance of the we are comparing 434/404105 to 799/284968, for which normalized proportions to be less than the known within- 1478 Differential expression in SAGE Fig. 1. (a)Two-sample t -test and χ values for all high count tags (40 or more total counts across all eight libraries). If the two tests were in agreement, we would see a ‘U-shape’ corresponding to genes being found equally extreme by both. Here, some of the most extreme values by one test (large χ values, or large absolute t -test values) are associated with at best weakly significant values of the other. (b) Zoom on points with χ values below 50, indicated with a dashed line in (a). While the U-shape is clearly evident, it is more compressed than agreement would indicate. Most of the tags being flagged as significant by the χ test (values > 5) are not significant according to the t -test (absolute values < 2). In effect, the t -test is going to the opposite extreme and focusing on the between-library variability at the expense of the within-library variability. In order to properly capture both types of variation, a compromise is needed. We introduce a beta-binomial model that includes both types of variation in a hierarchi- cal fashion: The proportion of a gene within a library is selected from an underlying beta distribution representing the normal between-library variation, and the count within that library is binomial with the chosen proportion as a parameter. Depending on the parameters of the beta model, this leads to estimates for the group proportions and associated variances that weight the different library proportions using values intermediate between equal weighting (all variation is between libraries) and weight- ing proportional to the library size (all variation is within libraries). METHODS Fig. 2. The log ratio of total (within plus between) variation to within variation as a function of log of the total tag count for the 2 For the sake of notational simplicity, we will focus on the LN+ group. The smooth line was fit by binning along the x -axis case of modelling the counts of a specific tag within the one unit at a time, taking the mean (x , y) point within that bin, and first group. Let n denote the total tag count in library i fitting a loess smooth with span 5 to the resultant 12 points. Note of this group, and let p denote the true proportion of the the line crosses 1 (between is equal to within) at about 4 on the tag of interest within library i . Finally, let X denote the x -axis, corresponding to a raw count of 16. For larger count tags, corresponding count for the tag of interest. For the first the between-library variation is clearly dominant, with the biggest 9.8 part of our model, we assume that the true proportions may multiple being about 2 ,or roughly 900-fold. vary from library to library. A standard distribution for proportions is the beta distribution, and we shall assume this here. The particular distribution used does not matter library variation. This can result in inappropriately large t a great deal. The main point is that the distribution is not values as the denominator of the test statistic goes to zero. necessarily degenerate: it can have a positive variance. (Most such cases occur when the total tag count is low, We will be focusing on the first two moments of the so these are not apparent in Fig. 1 due to our filtering.) various distributions throughout, both for computational 1479 K.A.Baggerly et al. simplicity and out of an intent to invoke the central limit At this point, we note that the optimal choice of weights theorem to get an approximately normal test statistic. is determined by a single relationship—the size of α + β Here, relative to n .Ifwe consider the extremes of this type of arrangement, letting α + β go to ∞ implies both that p ∼ Beta(α, β), E ( p ) = , i i the distribution of the p ’s is degenerate, so that there α + β is no change in the true proportion going from sample αβ to sample, and that in this case the optimal weighting is V ( p ) = . (α + β) (α + β + 1) proportional to the library size. If, on the other hand, the sum α + β is very small relative to the n values, then The second part of our model says that given the true i the imprecision in our knowledge of the proportion in a proportion in a sample, the corresponding count will given library is dwarfed by the imprecision due to library have a binomial distribution with the true proportion as to library variability, and the optimal weights are roughly a parameter: the same for all libraries. Thus, weighting by library size X | p ∼ Binomial(n , p ). i i i i and weighting equally represent the two extremes, and the true optimum lies somewhere in between. Note that Some straightforward algebra (Supplementary informa- the optimum weighting may be different for different tion) shows that the unconditional mean and variance of tags even if the same libraries are used! In theory, the the estimated proportion p ˆ = X /n are i i i weights are functions of fixed α and β.In practice, as the parameters are unknown, estimation of the parameters and E ( p ˆ ) = , α + β the weights proceeds jointly. αβ 1 1 Now, the form of the weighting vector gives us the V ( p ˆ ) = + . estimated proportion for the group as (α + β)(α + β + 1) α + β n There are two components to the variance of the propor- p ˆ = w p ˆ . i i tion p ˆ (in square brackets above), and only one of them (the within-library variation) decreases as the library size It can be shown (Supplementary information) that an is increased. Now, given that we know the variance of a unbiased estimator of the variance of this proportion is single proportion, we turn to the mean and variance of a 2 2 2 2 weighted linear combination of proportions to see how to w p ˆ − w p ˆ i i i V =   . combine the results from different libraries. unb 1 − w E w p ˆ i i When all of the w ’s are equal, this reduces to the standard α α unbiased estimator. This variance estimate is mostly right, = w E p ˆ = w = (1) i i i α + β α + β butitcan be too small—we know that the variance can never be less than the sampling variability. This lower V w p ˆ i i bound follows in turn from the assumptions that the libraries are assembled independently, and that sampling w αβ 1 1 = + . (2) within a library is also independent. These assumptions (α + β)(α + β + 1) α + β n strike us as reasonable, and we make them here. Allowing As long as the weights sum to 1, the combination has for this lower bound suggests the modified estimator the correct mean, so the focus shifts to choosing the   X X weights so as to minimize the associated variance. The i i 1 − n n i i constraint on the sum of the weights can be introduced into ˆ ˆ   V = max V , . unb the variance minimization problem through the method of Lagrange multipliers (Mathews and Walker, 1965), yielding There are slightly different lower bounds that could be constructed, but they all have the same leading term, V w p ˆ + λ 1 − w i i i X /( n ) .Werevisit this point below. i i ∂w In order to come up with a concrete number for a test αβ 1 1 statistic, we need to estimate the beta parameters. This can = 2w + − λ = 0 (α + β)(α + β + 1) α + β n i be done quickly using the method of moments, applied −1 to the unweighted sample proportions; this procedure can 1 1 → w ∝ + . i then be iterated as the weights provide revised estimates α + β n of the parameters. (Expressions for β and α ˆ can be found 1480 Differential expression in SAGE Table 2. Convergence of moment-based estimates of the beta parameters for (the variance doesn’t drop below our working floor) as we the LN+ values given in Table 1 attempt to account for additional variability. The test statistic that we propose for comparing groups A and B , then, is i 1234 5 p ˆ −ˆ p (i ) A B α 3.42 2.90 2.90 2.90 2.90 t = (i ) β 3184.1 3007.2 3015.9 3015.5 3015.5 ˆ ˆ V + V A B with p ˆ and V as defined above. For testing significance, it is useful to be able to specify the null distribution by manipulating Equations (1) and (2) to isolate the of a test statistic. This is somewhat difficult here in parameters as functions of the moments.) Consider the that the distribution of the estimated proportion within case of the LN+ proportions in the example given earlier. a group depends on the relative sizes of the between (0) and within variation. If the within-library variation is w = = (0.249, 0.239, 0.229, 0.237, 0.046). predominant, then the shape of the distribution is largely driven by the total counts within a group, and if these (0) (0) p ˆ = w p ˆ = 0.00107 counts are reasonable (say 15 or more in each group) (0) (0) 2 2 2 (0) 2 then the binomial distribution will be roughly normal (w ) p ˆ − (w ) ( p ˆ ) i i i (0) and a Z distribution can be used. This follows from an V = (0) appeal to the central limit theorem, the rough bell-shape 1 − (w ) of the binomial distribution, and the fact that the degrees = 7.995e − 06 of freedom used for estimating this component of the (0) (0) (0) 2 (0) p ˆ (1 −ˆ p ) (w ) − V variation are very close to the total number of counts. (1) i β = −1 (0) If, however, the between-library variation is dominant, (0) (0) (0) 2 V 1 −ˆ p −ˆ p (w ) /n then the dominant effect in many cases will be the small = 3184.06 number of different libraries used to estimate this variance; (0) p ˆ in this case we are driven to a t distribution with, say, n − (1) (1) α ˆ = β = 3.42 (0) 1degrees of freedom for group A. The above discussion 1 −ˆ p treats the distribution of the estimate for a single group, (1) (1) α ˆ + β n i and we need to combine the results for two groups. We (1) w ∝ still treat the overall distribution as roughly t in nature, (1) (1) α ˆ + β + n butaswe are using separate variance estimates for the two ∝ (0.205, 0.205, 0.204, 0.205, 0.181). groups, so that the overall variance estimate is not pooled, we fall back on a Satterthwaite (1946) approximation to Empirically, convergence is quite rapid, as is shown for compute the degrees of freedom: this example in Table 2. Here, the size of the sum of the beta parameters (about ˆ ˆ (V + V ) A B df = . 3K) relative to the library sizes (about 100 K) suggests 2 2 ˆ ˆ V V A B that the between-library variability is roughly 30 times n −1 n −1 A B the within-library variability. Special note needs to be Small counts in each group should force us to account made of the case where the method of moments fails— for the asymmetric nature of the underlying distribution when the variability of the sample proportions is less more directly, and in this region a test such as that than that known to be present due to sampling variability. proposed by Audic and Claverie Audic and Claverie In this case, it is instructive to look at the likelihood (1997) seems reasonable. function. The likelihood function shows that the ratio α/(α + β) corresponding to the mean proportion is well DISCUSSION characterized, but the sum α + β diverges to ∞ if we attempt to find a maximum. In this case, the underlying Between-library variability is non-negligible for SAGE maximum likelihood beta distribution is a degenerate data. Indeed, for the higher count data, the between- point mass, suggesting that the proper course of action library variability is the dominant part of the variation. is to ignore the between library variability and work just Pooling libraries in a group deals with between-library with the within-library variability. This is precisely when variation improperly, and can result in a bias towards we shift between different estimates of variance above; calling high-count tags ‘significantly different’. We have consequently the estimates do not become more precise proposed a statistic, t , that incorporates between-library 1481 K.A.Baggerly et al. Our approach works one tag at a time. It may be possible to improve inference further by working with an ensemble approach that attempts to estimate parameters for all of the genes at once. This is the potential of ‘borrowing strength’ across genes to improve estimates throughout. Such borrowing has been used to good effect in estimating variances associated with microarray readings (see Baggerly et al., 2001; Newton et al., 2001, and Hughes et al., 2000, among others). In the microarray context, this borrowing was achieved by grouping genes according to intensity. Likewise, SAGE tags could be grouped according to relative abundance and estimation of the between-library variation assessed for the group. Our approach dodges the small-variance problem by reverting to using just the sampling variance when the other estimate gives a value that is too small on its face. It is possible to treat this in a more rigorous and coherent fashion using a full-blown Bayesian approach that addresses the uncertainty in our estimates of α and Fig. 3. Our test statistic as a function of log tag count for the LN+ β by simulating draws from the posterior distribution (see vs LN− comparison. Unlike the χ test, this test is not overly biased Gelman et al. 1995, p. 130 for a discussion of how this towards giving larger values to larger count tags. The distribution might be done here). This approach, however, requires appears roughly uniform, with granularity creeping in at the low the additional specification of a prior and significant counts as the normal approximation fails. computational overhead (several orders of magnitude beyond that required here). An additional complexity is that a completely non-informative prior can lead to an variability in addition to the within-library variabil- improper posterior in this context. For genes of particular ity already well-treated by previous methods; indeed, interest, the freely available BUGS software may be able our method reduces to that of Kal et al. (1999) in the to provide this type of approach without the need for much special case when between-library variability can be coding on the user’s part. We are exploring this. neglected. The final distribution of the t values com- The special case where each group contains just one paring LN+ and LN− is shown in Figure 3, with the library can be treated by setting the weight w to 1 counts log -transformed to make the structure more in each group. As the within-group variance is smaller apparent. The distribution appears largely stable as a than the sampling variability, this would default to using function of tag count, so larger counts are not getting the sampling variability only. This weighting approach ‘more significant’ just by default. The most extreme suggests a difficulty with one versus one comparisons if count tags (including our example tag) no longer appear the between-library variability is suspected to be large. as significantly different. There is some granularity When we have just one library in each group, the degrees at the low counts where the Normal approximation of freedom in our t -statistic formulation drop to zero, is breaking down. By contrast, the two-sample t -test reflecting the fact that any differences we see could be deals with between-library variability but ignores the due to either the biological change of interest or to within-library variability. As the former is typically the normal variability, but we can’t tell which without an larger part of the variance, our t statistic is typically w assessment of this variation. Useful inferences in this case much closer to the t -test than to the pooled tests. Small- rely on prior assumptions about the scale of change to be scale simulation results (Supplementary information) expected or on implicitly borrowing strength across genes confirm that the pooled tests have poor specificity in the by looking at which ones are ‘the most different’. This in presence of overdispersion, and show that the sensitivity turn treats the experimental results as supplying a ranking and specificity of t are overall very similar to the of interest rather than a straight significance value. This two-sample t -test, with the differences showing mostly ranking viewpoint is reasonable in light of the multiple in isolated cases. Power comparisons of various tests testing problems inherent in checking thousands of genes. presuppose that the type I error rate has been fixed; in Finally, our approach treats all of the libraries within a the presence of overdispersion we have seen pooled group as similar enough that the observed variation can tests with nominal 0.5% type I error rates have actual be described as ‘normal variation within the population rates of about 70%. of interest’. There may be additional known covariates 1482 Differential expression in SAGE Lash,A.E., Tolstoshev,C.M., Wagner,L., Schuler,G.D., that could account for much of this if they were included Strausberg,R.L., Riggins,G.J. and Altschul,S.F. (2000) in modelling the data, but this leads to a more involved SAGEmap: a public gene expression resource. Genome assessment of the variance structure, with pieces going Res., 10, 1051–1060. to each of these included factors. Our approach provides Madden,S.L., Galella,E.A., Zhu,J., Bertelsen,A.H. and perhaps the simplest way of incorporating between-library Beaudry,G.A. (1997) SAGE transcript profiles for p53- variability from a host of sources, and the variance bound dependent growth regulation. Oncogene, 15, 1079–1085. we have imposed is inherently at least as conservative as Man,M.Z., Wang,X. and Wang,Y. (2000) POWER SAGE: compar- other tests. That said, there are other ways of arriving ing statistical tests for SAGE experiments. Bioinformatics, 16, at similar test statistics (e.g. via weighted least squares; 953–959. Neter et al., 1996, p. 400) that may provide greater Margulies,E.H. and Innis,J.W. (2000) eSAGE: managing and an- flexibility in modelling SAGE data, and we are working alyzing data generated with serial analysis of gene expression on this now. (SAGE). Bioinformatics, 16, 650–651. Margulies,E.H., Kardia,S.L.R. and Innis,J.W. (2001) A comparative molecular analysis of developing mouse forelimbs and hindlimbs ACKNOWLEDGEMENTS using serial analysis of gene expression (SAGE). Genome Res., The authors gratefully acknowledge support from NIH- 11, 1686–1698. NCI Grant 1U19 CA84978-1A1. Mathews,J. and Walker,R.L. (1965) Mathematical Methods of Physics.Benjamin, New York. REFERENCES Michiels,E.M.C., Oussoren,E., van Groenigen,M., Pauws,E., Bossuyt,P.M.M., Voute,P ˆ .A. and Baas,F. (1999) Genes differ- Audic,S. and Claverie,J.-M. (1997) The significance of digital gene entially expressed in medulloblastoma and fetal brain. Physiol. expression profiles. Genome Res., 7, 986–995. Genomics, 1, 83–91. Baggerly,K.A., Coombes,K.R., Hess,K.R., Stivers,D.N., Neter,J., Kutner,M.H., Nachtsheim,C.J. and Wasserman,W. (1996) Abruzzo,L.V. and Zhang,W. (2001) Identifying differen- Applied Linear Statistical Models,Fourth edition, Irwin, New tially expressed genes in cDNA microarray experiments. J. York. Comput. Biol., 8, 639–659. Newton,M.A., Kendziorski,C.M., Richmond,C.S., Blattner,F.R. and Chen,H., Centola,M., Altschul,S.F. and Metzger,H. (1998) Charac- Tsui,K.W. (2001) On differential variability of expression ra- terization of gene expression in resting and activated mast cells. tios: improving statistical inference about gene expression J. Exp. Med., 188, 1657–1668. changes from microarray data. J. Comput. Biol., 8, 37–52. Ewing,B., Hillier,L., Wendl,M.C. and Green,P. (1998) Base-calling Ryu,B., Jones,J., Blades,N.J., Parmigiani,G., Hollingsworth,M.A., of automated sequencer traces using Phred. I. Accuracy assess- Hruban,R.H. and Kern,S.E. (2002) Relationships and differen- ment. Genome Res., 8, 175–185. tially expressed genes among pancreatic cancers examined by Gelman,A., Carlin,J.B., Stern,H.S. and Rubin,D.B. (1995) Bayesian large-scale serial analysis of gene expression. Cancer Res., 62, Data Analysis. Chapman and Hall, New York. 819–826. Hughes,T.R., Marton,M.J., Jones,A.R., Roberts,C.J., Stoughton,R., Satterthwaite,F.E. (1946) An approximate distribution of estimates Armour,C.D., Bennett,H.A., Coffey,E., Dai,H., He,Y.D. et al. of variance components. Biometrics Bulletin, 2, 110–114. (2000) Functional discovery via a compendium of expression Stekel,D.J., Git,Y. and Falciani,F. (2000) The comparison of gene profiles. Cell, 102, 109–126. expression from multiple cDNA libraries. Genome Res., 10, Kal,A.J., van Zonneveld,A.J., Benes,V., van den Berg,M., 2055–2061. Koerkamp,M.G., Albermann,K., Strack,N., Ruijter,J.M., Stollberg,J., Urschitz,J., Urban,Z. and Boyd,C.D. (2000) A quanti- Richter,A., Dujon,B. et al. (1999) Dynamics of gene expression tative evaluation of SAGE. Genome Res., 10, 1241–1248. revealed by comparison of serial analysis of gene expression Velculescu,V.E., Zhang,L., Vogelstein,B. and Kinzler,K.W. (1995) transcript profiles from yeast grown on two different carbon Serial analysis of gene expression. Science, 270, 484–487. sources. Mol. Biol. Cell, 10, 1859–1872. Zhang,L., Zhou,W., Velculescu,V.E., Kern,S.E., Hruban,R.H., Lal,A., Lash,A.E., Altschul,S.F., Velculescu,V., Zhang,L., Hamilton,S.R., Vogelstein,B. and Kinzler,K.W. (1997) Gene ex- McLendon,R.E., Marra,M.A., Prange,C., Morin,P.J., Polyak,K. pression profiles in normal and cancer cells. Science, 276, 1268– et al. (1999) A public database for gene expression in human cancers. Cancer Res., 59, 5403–5407. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Differential expression in SAGE: accounting for normal between-library variation

Loading next page...
 
/lp/oxford-university-press/differential-expression-in-sage-accounting-for-normal-between-library-Tznn0G0m3j

References (21)

Publisher
Oxford University Press
Copyright
© Oxford University Press 2003
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/btg173
Publisher site
See Article on Publisher Site

Abstract

Vol. 19 no. 12 2003, pages 1477–1483 BIOINFORMATICS DOI: 10.1093/bioinformatics/btg173 Differential expression in SAGE: accounting for normal between-library variation 1,∗ 2 1 Keith A. Baggerly ,Li Deng , Jeffrey S. Morris and C. Marcelo Aldaz Department of Biostatistics, UT M.D. Anderson Cancer Center, 1515 Holcombe Blvd, Box 447, Houston, TX 77030-4009, USA, Department of Statistics, Rice University, Houston TX 77005, USA and Department of Carcinogenesis, UT M.D. Anderson Cancer Center, 1515 Holcombe Blvd, Box 447, Houston, TX 77030-4009, USA Received on October 24, 2002; revised on January 30, 2003; accepted on February 16, 2003 ABSTRACT with different pathologies, the question of most interest is Motivation: In contrasting levels of gene expression whether a given tag is differentially expressed. between groups of SAGE libraries, the libraries within each Most methods currently advanced for assessing dif- group are often combined and the counts for the tag of ferential expression in SAGE address the case where interest summed, and inference is made on the basis one library is contrasted with another, assuming a null of these larger ‘pseudolibraries’. While this captures the hypothesis that there is no difference between the libraries sampling variability inherent in the procedure, it fails to being compared. Under this assumption, the chance of a allow for normal variation in levels of the gene between single tag falling in one library or the other is roughly individuals within the same group, and can consequently proportional to the library size. Differing approximations overstate the significance of the results. The effect is not lead to modelling this behavior with binomial (Zhang et slight: between-library variation can be hundreds of times al., 1997) or Poisson distributions (Madden et al., 1997), the within-library variation. normal approximations (Madden et al., 1997; Kal et al., Results: We introduce a beta-binomial sampling model 1999; Michiels et al., 1999; Man et al., 2000) or simu- that correctly incorporates both sources of variation. We lations involving permutation tests (Zhang et al., 1997). show how to fit the parameters of this model, and introduce Bayesian approaches have been suggested by Audic and a test statistic for differential expression similar to a two- Claverie (1997), and by Chen et al. (1998) (the latter sample t -test. method was adapted by Lal et al. (1999) to accommodate Contact: kabagg@mdanderson.org unequal library sizes). Of these, the simulation approach Supplementary information: http://bioinformatics. of Zhang et al. (1997) and the Bayesian approach of Lal mdanderson.org/ Includes Matlab and R code for fitting et al. (1999) (see also Lash et al., 2000) are probably the model. the most widely used, due to their implementation in easily accessible software (the SAGE 2000 software INTRODUCTION available from the Kinzler Laboratory at Johns Hopkins, The Serial Analysis of Gene Expression (SAGE) method- and the routine implemented in SAGEmap at the NCBI, ology introduced by Velculescu et al. (1995) supplies data respectively). As noted in the comparison conducted on gene expression in the form of a table of counts. Briefly, by Man et al. (2000) on several of the above methods, however, very similar results are obtained when the num- mRNA transcripts are converted to cDNA and then pro- bers of tags are large (>20); the authors contend that a cessed so as to isolate short (typically 10 or 14 bp) repre- normal approximation (Kal et al., 1999) or equivalently a sentative ‘tags’. Ideally, these tags should provide enough chi-squared test has more power for detecting differences information to uniquely identify the source mRNA, and to when the numbers of tags are small (<15). The validity a first approximation this is correct. Tags are sampled and of small tag comparisons is, however, questionable due to sequenced, and a ‘library’ consisting of the tags seen and the presence of sequencing errors (Stollberg et al., 2000) their respective frequencies is constructed for each cellu- though this may be somewhat ameliorated if there is some lar sample. Given a set of libraries derived from samples external measure of quality for the read, such as a phred To whom correspondence should be addressed. score (Margulies and Innis, 2000; Margulies et al., 2001; Bioinformatics 19(12)  c Oxford University Press 2003; all rights reserved. 1477 K.A.Baggerly et al. Table 1. Counts and proportions of tag AGGTCAGGAG in eight breast tumor libraries, five lymph node positive and three lymph node negative Library 1T+ 3T+ 4T+ 6T+ 8T+ 7T− 9T− 10T− Tag count 129 167 71 61 6 43 247 509 Library size 100 474 96 631 92 510 95 785 18 705 95 155 91 593 98 220 Proportions (%) 0.13 0.17 0.08 0.06 0.03 0.05 0.27 0.52 Ewing et al., 1998). We note that the statistic suggested the χ value (Michiels et al., 1999; Man et al., 2000) is by Kal et al. (1999) is 279.98; the 95% cutoff for this distribution is 3.84, so this is obviously a ‘significant’ result. The equivalence p − p A B Z =  , with of the above tests noted by Man et al. (2000) for high p (1− p ) p (1− p ) 0 0 0 0 + count tags means that the other tests will catch the same N N A B genes. Checking the sign of the test statistic proposed by X X X + X A B A B Kal et al. (1999) suggests that this tag is more strongly p = , p = , p = , A B 0 N N N + N A B A B expressed in LN− tumors. However, if we follow Ryu et al. (2002) and compare the eight proportions using where the two library sizes are N and N and the two √ A B atwo-sample t -test, t = ( p − p )/ V + V , with A B A B counts are X and X , respectively. The statistic we A B p being the average of the five proportions in group A, propose below will have a similar form. V being the sample variance of these five proportions, When the number of libraries involved is more than two, and p and V likewise defined, we get a test statistic B B an omnibus test for differential expression such as that value of −1.3174. The 95% cutoffs for a t distribution suggested by Stekel et al. (2000) can be performed, but are ±2.4469, so this is a decidedly ‘insignificant’ result. interest is more often centered on comparing groups of While the mean proportion is higher for the LN− tumors, libraries where the group membership is known a priori. this is mostly being driven by results from a single library In this paper we focus on the comparison of two groups (10T) so that the variability within the LN- group is high. of libraries. When there are two groups of libraries being The first approach fails to take into account the variability compared, the most common approach is to reduce the between like libraries which the t -test correctly captures. number of effective libraries to two by pooling the libraries Shifting between test types (χ and two-sample t)gives of like type, and reverting to the two-library comparison a different view of which tags are important, as can be seen form. This is not universal, and cautionary statements have in Figure 1. Most of the tags being flagged as significant been made. Lash et al. (2000) recommend checking for a by the χ test (values >5) are not significant according to low coefficient of variation before applying the SAGEmap the t -test (absolute values <2). procedure to grouped libraries. Ryu et al. (2002) use a Between-library variability within a group can often series of filters to deal with groups of pancreatic libraries, be as large in magnitude as the within-library variability the first of which is a two-sample t -test applied to the due to sampling. Indeed, for the higher count data, the proportions. between-library variability is the dominant part of the These pooling approaches do catch differences, but variation. We can see this by surveying all of the tags they can overstate the significance of the results by in the LN+ group, plotting the total (between + within) ignoring the role of normal variation in expression levels library sample variability as a multiple of the within- between like samples. As an example, we consider the library variability, with both quantities on the log scale case of a single prevalent tag in eight breast libraries that to make the structure more apparent. This is shown in have been assembled in the Aldaz Laboratory. This tag, Figure 2. For the high count tags the between-library AGGTCAGGAG, has multiple matches and hence is not variance clearly wins. Similar qualitative results hold for immediately biologically informative, but it will serve to illustrate the point. All of these libraries are derived from the LN- group (not shown). breast tumors; the first five are from patients found to be While the two-sample t -statistic applied to the different lymph node positive (LN+), and the remaining three from normalized library proportions illustrates the problem, patients found to be lymph node negative (LN−). The tag it is too crude a tool to provide a solution in and of counts and proportions in the various libraries are given in itself: it weights the proportions from all libraries equally, Table 1. If we combine the five LN+ libraries and three even though the estimates from larger libraries are less LN− libraries and compare the resulting tag proportions, variable, and it is possible for the sample variance of the we are comparing 434/404105 to 799/284968, for which normalized proportions to be less than the known within- 1478 Differential expression in SAGE Fig. 1. (a)Two-sample t -test and χ values for all high count tags (40 or more total counts across all eight libraries). If the two tests were in agreement, we would see a ‘U-shape’ corresponding to genes being found equally extreme by both. Here, some of the most extreme values by one test (large χ values, or large absolute t -test values) are associated with at best weakly significant values of the other. (b) Zoom on points with χ values below 50, indicated with a dashed line in (a). While the U-shape is clearly evident, it is more compressed than agreement would indicate. Most of the tags being flagged as significant by the χ test (values > 5) are not significant according to the t -test (absolute values < 2). In effect, the t -test is going to the opposite extreme and focusing on the between-library variability at the expense of the within-library variability. In order to properly capture both types of variation, a compromise is needed. We introduce a beta-binomial model that includes both types of variation in a hierarchi- cal fashion: The proportion of a gene within a library is selected from an underlying beta distribution representing the normal between-library variation, and the count within that library is binomial with the chosen proportion as a parameter. Depending on the parameters of the beta model, this leads to estimates for the group proportions and associated variances that weight the different library proportions using values intermediate between equal weighting (all variation is between libraries) and weight- ing proportional to the library size (all variation is within libraries). METHODS Fig. 2. The log ratio of total (within plus between) variation to within variation as a function of log of the total tag count for the 2 For the sake of notational simplicity, we will focus on the LN+ group. The smooth line was fit by binning along the x -axis case of modelling the counts of a specific tag within the one unit at a time, taking the mean (x , y) point within that bin, and first group. Let n denote the total tag count in library i fitting a loess smooth with span 5 to the resultant 12 points. Note of this group, and let p denote the true proportion of the the line crosses 1 (between is equal to within) at about 4 on the tag of interest within library i . Finally, let X denote the x -axis, corresponding to a raw count of 16. For larger count tags, corresponding count for the tag of interest. For the first the between-library variation is clearly dominant, with the biggest 9.8 part of our model, we assume that the true proportions may multiple being about 2 ,or roughly 900-fold. vary from library to library. A standard distribution for proportions is the beta distribution, and we shall assume this here. The particular distribution used does not matter library variation. This can result in inappropriately large t a great deal. The main point is that the distribution is not values as the denominator of the test statistic goes to zero. necessarily degenerate: it can have a positive variance. (Most such cases occur when the total tag count is low, We will be focusing on the first two moments of the so these are not apparent in Fig. 1 due to our filtering.) various distributions throughout, both for computational 1479 K.A.Baggerly et al. simplicity and out of an intent to invoke the central limit At this point, we note that the optimal choice of weights theorem to get an approximately normal test statistic. is determined by a single relationship—the size of α + β Here, relative to n .Ifwe consider the extremes of this type of arrangement, letting α + β go to ∞ implies both that p ∼ Beta(α, β), E ( p ) = , i i the distribution of the p ’s is degenerate, so that there α + β is no change in the true proportion going from sample αβ to sample, and that in this case the optimal weighting is V ( p ) = . (α + β) (α + β + 1) proportional to the library size. If, on the other hand, the sum α + β is very small relative to the n values, then The second part of our model says that given the true i the imprecision in our knowledge of the proportion in a proportion in a sample, the corresponding count will given library is dwarfed by the imprecision due to library have a binomial distribution with the true proportion as to library variability, and the optimal weights are roughly a parameter: the same for all libraries. Thus, weighting by library size X | p ∼ Binomial(n , p ). i i i i and weighting equally represent the two extremes, and the true optimum lies somewhere in between. Note that Some straightforward algebra (Supplementary informa- the optimum weighting may be different for different tion) shows that the unconditional mean and variance of tags even if the same libraries are used! In theory, the the estimated proportion p ˆ = X /n are i i i weights are functions of fixed α and β.In practice, as the parameters are unknown, estimation of the parameters and E ( p ˆ ) = , α + β the weights proceeds jointly. αβ 1 1 Now, the form of the weighting vector gives us the V ( p ˆ ) = + . estimated proportion for the group as (α + β)(α + β + 1) α + β n There are two components to the variance of the propor- p ˆ = w p ˆ . i i tion p ˆ (in square brackets above), and only one of them (the within-library variation) decreases as the library size It can be shown (Supplementary information) that an is increased. Now, given that we know the variance of a unbiased estimator of the variance of this proportion is single proportion, we turn to the mean and variance of a 2 2 2 2 weighted linear combination of proportions to see how to w p ˆ − w p ˆ i i i V =   . combine the results from different libraries. unb 1 − w E w p ˆ i i When all of the w ’s are equal, this reduces to the standard α α unbiased estimator. This variance estimate is mostly right, = w E p ˆ = w = (1) i i i α + β α + β butitcan be too small—we know that the variance can never be less than the sampling variability. This lower V w p ˆ i i bound follows in turn from the assumptions that the libraries are assembled independently, and that sampling w αβ 1 1 = + . (2) within a library is also independent. These assumptions (α + β)(α + β + 1) α + β n strike us as reasonable, and we make them here. Allowing As long as the weights sum to 1, the combination has for this lower bound suggests the modified estimator the correct mean, so the focus shifts to choosing the   X X weights so as to minimize the associated variance. The i i 1 − n n i i constraint on the sum of the weights can be introduced into ˆ ˆ   V = max V , . unb the variance minimization problem through the method of Lagrange multipliers (Mathews and Walker, 1965), yielding There are slightly different lower bounds that could be constructed, but they all have the same leading term, V w p ˆ + λ 1 − w i i i X /( n ) .Werevisit this point below. i i ∂w In order to come up with a concrete number for a test αβ 1 1 statistic, we need to estimate the beta parameters. This can = 2w + − λ = 0 (α + β)(α + β + 1) α + β n i be done quickly using the method of moments, applied −1 to the unweighted sample proportions; this procedure can 1 1 → w ∝ + . i then be iterated as the weights provide revised estimates α + β n of the parameters. (Expressions for β and α ˆ can be found 1480 Differential expression in SAGE Table 2. Convergence of moment-based estimates of the beta parameters for (the variance doesn’t drop below our working floor) as we the LN+ values given in Table 1 attempt to account for additional variability. The test statistic that we propose for comparing groups A and B , then, is i 1234 5 p ˆ −ˆ p (i ) A B α 3.42 2.90 2.90 2.90 2.90 t = (i ) β 3184.1 3007.2 3015.9 3015.5 3015.5 ˆ ˆ V + V A B with p ˆ and V as defined above. For testing significance, it is useful to be able to specify the null distribution by manipulating Equations (1) and (2) to isolate the of a test statistic. This is somewhat difficult here in parameters as functions of the moments.) Consider the that the distribution of the estimated proportion within case of the LN+ proportions in the example given earlier. a group depends on the relative sizes of the between (0) and within variation. If the within-library variation is w = = (0.249, 0.239, 0.229, 0.237, 0.046). predominant, then the shape of the distribution is largely driven by the total counts within a group, and if these (0) (0) p ˆ = w p ˆ = 0.00107 counts are reasonable (say 15 or more in each group) (0) (0) 2 2 2 (0) 2 then the binomial distribution will be roughly normal (w ) p ˆ − (w ) ( p ˆ ) i i i (0) and a Z distribution can be used. This follows from an V = (0) appeal to the central limit theorem, the rough bell-shape 1 − (w ) of the binomial distribution, and the fact that the degrees = 7.995e − 06 of freedom used for estimating this component of the (0) (0) (0) 2 (0) p ˆ (1 −ˆ p ) (w ) − V variation are very close to the total number of counts. (1) i β = −1 (0) If, however, the between-library variation is dominant, (0) (0) (0) 2 V 1 −ˆ p −ˆ p (w ) /n then the dominant effect in many cases will be the small = 3184.06 number of different libraries used to estimate this variance; (0) p ˆ in this case we are driven to a t distribution with, say, n − (1) (1) α ˆ = β = 3.42 (0) 1degrees of freedom for group A. The above discussion 1 −ˆ p treats the distribution of the estimate for a single group, (1) (1) α ˆ + β n i and we need to combine the results for two groups. We (1) w ∝ still treat the overall distribution as roughly t in nature, (1) (1) α ˆ + β + n butaswe are using separate variance estimates for the two ∝ (0.205, 0.205, 0.204, 0.205, 0.181). groups, so that the overall variance estimate is not pooled, we fall back on a Satterthwaite (1946) approximation to Empirically, convergence is quite rapid, as is shown for compute the degrees of freedom: this example in Table 2. Here, the size of the sum of the beta parameters (about ˆ ˆ (V + V ) A B df = . 3K) relative to the library sizes (about 100 K) suggests 2 2 ˆ ˆ V V A B that the between-library variability is roughly 30 times n −1 n −1 A B the within-library variability. Special note needs to be Small counts in each group should force us to account made of the case where the method of moments fails— for the asymmetric nature of the underlying distribution when the variability of the sample proportions is less more directly, and in this region a test such as that than that known to be present due to sampling variability. proposed by Audic and Claverie Audic and Claverie In this case, it is instructive to look at the likelihood (1997) seems reasonable. function. The likelihood function shows that the ratio α/(α + β) corresponding to the mean proportion is well DISCUSSION characterized, but the sum α + β diverges to ∞ if we attempt to find a maximum. In this case, the underlying Between-library variability is non-negligible for SAGE maximum likelihood beta distribution is a degenerate data. Indeed, for the higher count data, the between- point mass, suggesting that the proper course of action library variability is the dominant part of the variation. is to ignore the between library variability and work just Pooling libraries in a group deals with between-library with the within-library variability. This is precisely when variation improperly, and can result in a bias towards we shift between different estimates of variance above; calling high-count tags ‘significantly different’. We have consequently the estimates do not become more precise proposed a statistic, t , that incorporates between-library 1481 K.A.Baggerly et al. Our approach works one tag at a time. It may be possible to improve inference further by working with an ensemble approach that attempts to estimate parameters for all of the genes at once. This is the potential of ‘borrowing strength’ across genes to improve estimates throughout. Such borrowing has been used to good effect in estimating variances associated with microarray readings (see Baggerly et al., 2001; Newton et al., 2001, and Hughes et al., 2000, among others). In the microarray context, this borrowing was achieved by grouping genes according to intensity. Likewise, SAGE tags could be grouped according to relative abundance and estimation of the between-library variation assessed for the group. Our approach dodges the small-variance problem by reverting to using just the sampling variance when the other estimate gives a value that is too small on its face. It is possible to treat this in a more rigorous and coherent fashion using a full-blown Bayesian approach that addresses the uncertainty in our estimates of α and Fig. 3. Our test statistic as a function of log tag count for the LN+ β by simulating draws from the posterior distribution (see vs LN− comparison. Unlike the χ test, this test is not overly biased Gelman et al. 1995, p. 130 for a discussion of how this towards giving larger values to larger count tags. The distribution might be done here). This approach, however, requires appears roughly uniform, with granularity creeping in at the low the additional specification of a prior and significant counts as the normal approximation fails. computational overhead (several orders of magnitude beyond that required here). An additional complexity is that a completely non-informative prior can lead to an variability in addition to the within-library variabil- improper posterior in this context. For genes of particular ity already well-treated by previous methods; indeed, interest, the freely available BUGS software may be able our method reduces to that of Kal et al. (1999) in the to provide this type of approach without the need for much special case when between-library variability can be coding on the user’s part. We are exploring this. neglected. The final distribution of the t values com- The special case where each group contains just one paring LN+ and LN− is shown in Figure 3, with the library can be treated by setting the weight w to 1 counts log -transformed to make the structure more in each group. As the within-group variance is smaller apparent. The distribution appears largely stable as a than the sampling variability, this would default to using function of tag count, so larger counts are not getting the sampling variability only. This weighting approach ‘more significant’ just by default. The most extreme suggests a difficulty with one versus one comparisons if count tags (including our example tag) no longer appear the between-library variability is suspected to be large. as significantly different. There is some granularity When we have just one library in each group, the degrees at the low counts where the Normal approximation of freedom in our t -statistic formulation drop to zero, is breaking down. By contrast, the two-sample t -test reflecting the fact that any differences we see could be deals with between-library variability but ignores the due to either the biological change of interest or to within-library variability. As the former is typically the normal variability, but we can’t tell which without an larger part of the variance, our t statistic is typically w assessment of this variation. Useful inferences in this case much closer to the t -test than to the pooled tests. Small- rely on prior assumptions about the scale of change to be scale simulation results (Supplementary information) expected or on implicitly borrowing strength across genes confirm that the pooled tests have poor specificity in the by looking at which ones are ‘the most different’. This in presence of overdispersion, and show that the sensitivity turn treats the experimental results as supplying a ranking and specificity of t are overall very similar to the of interest rather than a straight significance value. This two-sample t -test, with the differences showing mostly ranking viewpoint is reasonable in light of the multiple in isolated cases. Power comparisons of various tests testing problems inherent in checking thousands of genes. presuppose that the type I error rate has been fixed; in Finally, our approach treats all of the libraries within a the presence of overdispersion we have seen pooled group as similar enough that the observed variation can tests with nominal 0.5% type I error rates have actual be described as ‘normal variation within the population rates of about 70%. of interest’. There may be additional known covariates 1482 Differential expression in SAGE Lash,A.E., Tolstoshev,C.M., Wagner,L., Schuler,G.D., that could account for much of this if they were included Strausberg,R.L., Riggins,G.J. and Altschul,S.F. (2000) in modelling the data, but this leads to a more involved SAGEmap: a public gene expression resource. Genome assessment of the variance structure, with pieces going Res., 10, 1051–1060. to each of these included factors. Our approach provides Madden,S.L., Galella,E.A., Zhu,J., Bertelsen,A.H. and perhaps the simplest way of incorporating between-library Beaudry,G.A. (1997) SAGE transcript profiles for p53- variability from a host of sources, and the variance bound dependent growth regulation. Oncogene, 15, 1079–1085. we have imposed is inherently at least as conservative as Man,M.Z., Wang,X. and Wang,Y. (2000) POWER SAGE: compar- other tests. That said, there are other ways of arriving ing statistical tests for SAGE experiments. Bioinformatics, 16, at similar test statistics (e.g. via weighted least squares; 953–959. Neter et al., 1996, p. 400) that may provide greater Margulies,E.H. and Innis,J.W. (2000) eSAGE: managing and an- flexibility in modelling SAGE data, and we are working alyzing data generated with serial analysis of gene expression on this now. (SAGE). Bioinformatics, 16, 650–651. Margulies,E.H., Kardia,S.L.R. and Innis,J.W. (2001) A comparative molecular analysis of developing mouse forelimbs and hindlimbs ACKNOWLEDGEMENTS using serial analysis of gene expression (SAGE). Genome Res., The authors gratefully acknowledge support from NIH- 11, 1686–1698. NCI Grant 1U19 CA84978-1A1. Mathews,J. and Walker,R.L. (1965) Mathematical Methods of Physics.Benjamin, New York. REFERENCES Michiels,E.M.C., Oussoren,E., van Groenigen,M., Pauws,E., Bossuyt,P.M.M., Voute,P ˆ .A. and Baas,F. (1999) Genes differ- Audic,S. and Claverie,J.-M. (1997) The significance of digital gene entially expressed in medulloblastoma and fetal brain. Physiol. expression profiles. Genome Res., 7, 986–995. Genomics, 1, 83–91. Baggerly,K.A., Coombes,K.R., Hess,K.R., Stivers,D.N., Neter,J., Kutner,M.H., Nachtsheim,C.J. and Wasserman,W. (1996) Abruzzo,L.V. and Zhang,W. (2001) Identifying differen- Applied Linear Statistical Models,Fourth edition, Irwin, New tially expressed genes in cDNA microarray experiments. J. York. Comput. Biol., 8, 639–659. Newton,M.A., Kendziorski,C.M., Richmond,C.S., Blattner,F.R. and Chen,H., Centola,M., Altschul,S.F. and Metzger,H. (1998) Charac- Tsui,K.W. (2001) On differential variability of expression ra- terization of gene expression in resting and activated mast cells. tios: improving statistical inference about gene expression J. Exp. Med., 188, 1657–1668. changes from microarray data. J. Comput. Biol., 8, 37–52. Ewing,B., Hillier,L., Wendl,M.C. and Green,P. (1998) Base-calling Ryu,B., Jones,J., Blades,N.J., Parmigiani,G., Hollingsworth,M.A., of automated sequencer traces using Phred. I. Accuracy assess- Hruban,R.H. and Kern,S.E. (2002) Relationships and differen- ment. Genome Res., 8, 175–185. tially expressed genes among pancreatic cancers examined by Gelman,A., Carlin,J.B., Stern,H.S. and Rubin,D.B. (1995) Bayesian large-scale serial analysis of gene expression. Cancer Res., 62, Data Analysis. Chapman and Hall, New York. 819–826. Hughes,T.R., Marton,M.J., Jones,A.R., Roberts,C.J., Stoughton,R., Satterthwaite,F.E. (1946) An approximate distribution of estimates Armour,C.D., Bennett,H.A., Coffey,E., Dai,H., He,Y.D. et al. of variance components. Biometrics Bulletin, 2, 110–114. (2000) Functional discovery via a compendium of expression Stekel,D.J., Git,Y. and Falciani,F. (2000) The comparison of gene profiles. Cell, 102, 109–126. expression from multiple cDNA libraries. Genome Res., 10, Kal,A.J., van Zonneveld,A.J., Benes,V., van den Berg,M., 2055–2061. Koerkamp,M.G., Albermann,K., Strack,N., Ruijter,J.M., Stollberg,J., Urschitz,J., Urban,Z. and Boyd,C.D. (2000) A quanti- Richter,A., Dujon,B. et al. (1999) Dynamics of gene expression tative evaluation of SAGE. Genome Res., 10, 1241–1248. revealed by comparison of serial analysis of gene expression Velculescu,V.E., Zhang,L., Vogelstein,B. and Kinzler,K.W. (1995) transcript profiles from yeast grown on two different carbon Serial analysis of gene expression. Science, 270, 484–487. sources. Mol. Biol. Cell, 10, 1859–1872. Zhang,L., Zhou,W., Velculescu,V.E., Kern,S.E., Hruban,R.H., Lal,A., Lash,A.E., Altschul,S.F., Velculescu,V., Zhang,L., Hamilton,S.R., Vogelstein,B. and Kinzler,K.W. (1997) Gene ex- McLendon,R.E., Marra,M.A., Prange,C., Morin,P.J., Polyak,K. pression profiles in normal and cancer cells. Science, 276, 1268– et al. (1999) A public database for gene expression in human cancers. Cancer Res., 59, 5403–5407.

Journal

BioinformaticsOxford University Press

Published: Aug 12, 2003

There are no references for this article.