Multiclass classification of microarray data with repeated measurements: application to cancer

Ka Yee Yeung; Roger E Bumgarner

doi:10.1186/gb-2003-4-12-r83

Multiclass classification of microarray data with repeated measurements: application to cancer

Yeung, Ka Yee; Bumgarner, Roger E 2003-11-24 00:00:00 Prediction of the diagnostic category of a tissue sample from its gene-expression profile and selection of relevant genes for class prediction have important applications in cancer research. We have developed the uncorrelated shrunken centroid (USC) and error-weighted, uncorrelated shrunken centroid (EWUSC) algorithms that are applicable to microarray data with any number of classes. We show that removing highly correlated genes typically improves classification results using a small set of genes. There has been a recent explosion in the use of expression Rationale The problem of predicting the diagnostic category of a given array phenotyping for identification and/or classification in a tissue sample is of fundamental clinical importance. Conven- variety of diagnostic areas. Examples of diagnostic categories tional diagnostic methods are based on subjective evaluation (or classes) include cancer versus non-cancer [7,8], different of the morphological appearance of the tissue sample, which subtypes of tumor [9-13], and prediction of responses to var- requires a visible phenotype and a trained pathologist to ious drugs or cancer prognosis [14-16]. The prediction of the interpret the view. In some cases the class is easily identified diagnostic category of a tissue sample from its expression by cell morphology or cell-type distribution, but in many array phenotype given the availability of similar data from tis- cases apparently similar pathologies can lead to very different sues in identified categories is known as classification (or clinical outcomes. Since the advent of DNA array technology supervised learning). A challenge in predicting diagnostic cat- [1-6], researchers have begun to use expression array analysis egories using microarray data is that the number of genes is as a quantitative phenotyping tool. The potential advantage to usually significantly greater than the number of tissue sam- using arrays for phenotyping is that they provide a simultane- ples available, and only a subset of the genes is relevant in dis- ous quantitative measure of thousands of parameters (for tinguishing different classes. Selection of relevant genes for example, gene-expression levels) some of which are likely to classification is known as feature selection. This has three have disease relevance. When array analysis is used predom- main applications: first, the classification accuracy is often inately for phenotyping, we refer to the expression pattern as improved using a subset instead of the entire set of genes; sec- an 'expression array phenotype'. Owing to the ability to quan- ond, a small set of relevant genes is convenient for developing tify a large number of parameters, the use of expression array diagnostic tests; and third, these genes may lead to biologi- in phenotyping promises both more accurate class prediction cally interesting insights that are characteristic of the classes and the identification of subclasses that could not be defined of interest. by traditional methods. Genome Biology 2003, 4:R83 R83.2 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner http://genomebiology.com/2003/4/12/R83 There have been many reports that address the classification progression, and response to therapy). A typical microarray and feature-selection problems, for example [9,10,14,17]. dataset consists of thousands to tens of thousands of genes, However, many of these methods are tailored towards binary and dozens to hundreds of experiments. One challenge of classification in which there are only two classes [9,14]. More- classification using microarray data is that the number of over, there has been very limited effort to develop classifica- genes is significantly greater than the number of samples. In tion and feature-selection algorithms for microarray data this situation, it is possible to find both random and biologi- with repeated measurements or error estimates. Array data is cally relevant correlations of gene behavior with sample type. well known to be noisy; for example, Lee et al. [18] showed To protect against spurious results, the goal is to identify the that any single microarray output is subject to substantial smallest possible subset of genes that correlate most strongly variability. This is particularly true for genes with low expres- with the known class labels. In addition, a small subset of sion levels, which are more difficult to measure than genes genes is desirable for the development of expression-based with high expression levels. As the cost of microarray experi- diagnostics. The problem of selecting relevant genes (or fea- ments is declining, more research laboratories are generating tures) for classification is known as feature selection. microarray data with repeated measurements [9,14,19,20]. Repeated measurements not only provide improved esti- Cross validation is a well-established technique used to opti- mates of gene-expression levels but can also be used to esti- mize the parameters or features chosen in a classifier. In m- mate the uncertainty or variability in the measurement. In fold cross-validation, the training set is randomly divided into some cases the repeated measurements are biological repli- m disjoint subsets with roughly equal size. Each of these m cates (for example, independent samples), whereas in other subsets is left out in turn for evaluation, and the other (m - 1) cases only technical replicates are available. Regardless of the subsets are used as inputs to the classification algorithm. In source, however, variability estimates should be taken into this work, we randomly divide each class into m disjoint sub- account in both clustering and classification algorithms, as sets (where m is less than the size of the smallest class in the variability estimates can potentially be exploited to improve training set), so that each class is represented in the subset fed the results. to the classification algorithm. The left-out subset of the training set is used to evaluate classification accuracy because We have developed two algorithms called the uncorrelated the classes of this subset are known. The most popular form shrunken centroid (USC) algorithm, and the error-weighted, of cross-validation is leave-one-out cross-validation uncorrelated shrunken centroid (EWUSC) algorithm. Both (LOOCV), in which m is equal to the number of samples in the USC and EWUSC are integrated feature-selection and classi- training set, and each sample in the training set is left out in fication algorithms that are applicable to data with any turn to evaluate the prediction results. number of classes. Our primary contribution is that both USC and EWUSC exploit interdependence between genes to Related work reduce the number of selected features. In addition, EWUSC van't Veer et al. [14] recently applied a binary classification takes advantage of variability estimates over repeated meas- algorithm to cDNA array data with repeated measurements, urements to down-weight noisy genes and noisy experiments and classified breast cancer patients into good and poor prog- so that no ad hoc filtering step is necessary. On the other nosis groups. Their classification algorithm consists of the hand, USC is applicable to microarray datasets with or with- following steps. The first step is filtering, in which only genes out repeated measurements. with both small error estimates and significant regulation rel- ative to a reference pool of samples from all patients are cho- Introduction to classification and feature selection sen. The second step consists of identifying a set of genes Classification is a supervised learning approach, in which the whose behaviour is highly correlated with the two sample classes (or labels) of a subset of samples are inputs to the algo- types (for example, upregulated in one sample type but down- rithm. This is in contrast to clustering, which is an unsuper- regulated in the other). These genes are rank-ordered so that vised approach, in which no knowledge of the samples is genes with the highest magnitudes of correlation with the assumed. A training set is a set of samples for which the sample types have top ranks. In the third step, the set of rele- classes are known. A test set is a set of samples for which the vant genes is optimized by sequentially adding genes with classes are assumed to be unknown to the algorithm, and the top-ranked correlation from the second step. Leave-one-out goal is to predict which classes these samples belong to. The cross-validation is used to evaluate and choose an optimal set first step in classification is to build a 'classifier' using the of features. van't Veer et al.'s approach takes variability esti- given training set, and the second step is to use the classifier mates of repeated measurements into consideration by using to predict the classes of the test set. error-weighted correlation in their method. However, this method involves an ad hoc filtering step and does not gener- In the context of gene-expression data, the samples are usu- alize to more than two classes. ally the experiments, and the classes (or labels) are usually different types of tissue samples (for example, cancer versus Ramaswamy et al. [10] combined support vector machines non-cancer, different tumor types, rate of disease (SVMs), which are binary classifiers, to solve the multiclass Genome Biology 2003, 4:R83 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2003/4/12/R83 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner R83.3 classification problem. They showed that the one-versus-all 'shrinkage threshold' which is fixed for all genes. The intui- approach of combining SVM yields the minimum number of tion is that genes with at least one class centroid that is signif- classification errors on their Affymetrix data with 14 tumor icantly different from the overall centroid are selected as types. The one-versus-all combination approach builds k (the relevant genes. The size of the shrinkage threshold is deter- number of classes) binary classifiers, each of which distin- mined by cross-validation on the training set to minimize guishes one class from all the other classes. Suppose binary classification errors. classifier i predicts a discriminant value f (x) for a given sam- ple x in the test set. The combined multiclass classifier assigns Our contributions sample x to the class for which the corresponding binary clas- Our algorithms have the following desirable characteristics. sifier produces the highest discriminant value. In addition to Both EWUSC and USC exploit the interdependence of genes not taking variability estimates of repeated measurements to reduce the number of selected features. EWUSC takes into account, this approach selects different relevant features advantage of the variability of gene-expression data over (genes) for each binary classifier. repeated measurements, so no ad hoc filtering step is neces- sary. Both EWUSC and USC can be applied to data with any Nguyen and Rocke [21,22] used partial least squares (PLS) for number of classes. Both EWUSC and USC adopt an inte- feature selection, together with traditional classification algo- grated approach for both feature selection and classifica- rithms such as logistic discrimination and quadratic discrim- tion. Both algorithms make no assumption on data ination to classify multiple tumor types from microarray data. distributions. These traditional classification algorithms require the number of samples (experiments) to be greater than the We illustrate the advantage of removing correlated genes (for number of variables (genes), and it is therefore essential to example, the USC algorithm) on the NCI 60 data [12] for reduce the dimensionality before applying these traditional which there is no variability information. This dataset has classification techniques. PLS is a dimension-reduction tech- been extensively used in other publications for classification nique that maximizes the covariance between the classes and algorithm development [22,23,25]. We illustrated and com- a linear combination of the genes. This approach can be gen- pared our USC and EWUSC algorithms with two real data- eralized to multiple classes, but it does not make use of varia- sets: a multiple tumor dataset from Ramaswamy et al. [10] bility estimates of the data. In addition, it is a multistep and a breast cancer dataset from van 't Veer et al. [14]. These process that involves a filtering step (to select genes with sig- two datasets were chosen as they are publicly available in a nificant mean differences) and then application of PLS to fur- form from which we can calculate or obtain error estimates ther reduce the dimensionality so that the number of samples for each gene-expression level or ratio. We used a subset of is greater than the number of dimensions. the multiple tumor data [10] that consists of 7,129 genes and 11 tumor types on Affymetrix chips. There are 96 samples in Dudoit et al. [23] compared the performance of different dis- the training set, and 27 samples in the test set. For the crimination methods (including nearest neighbor classifiers, Affymetrix dataset we estimated the variability in the gene- linear discriminant analysis and classification trees) for clas- expression levels using the robust multi-array analysis (RMA) sifying multiple tumor types using gene-expression data. tool [26,27] from the BioConductor project [28]. A subset of None of the discrimination methods they evaluated takes the published data was used as we could only obtain raw data measurement variability into consideration, and their (.cel files) for a subset. The breast cancer dataset [14] consists emphasis is on discrimination methods and not feature of 25,000 genes with four repeated measurements on cDNA selection. arrays. There are 78 samples in the training set, 19 samples in the test set, and two classes of patients: one class with good Yeung et al. [24] showed that clustering algorithms that take prognosis (with more than 5 years of survival time), and advantage of repeated measurements (including the error- another class with poor prognosis (with less than 5 years of weighted approach that down-weights noisy measurements) survival time). For the breast cancer cDNA array data, pub- yield more accurate and more stable clusters. Here, we will lished p-values as calculated by Rosetta's Resolver software focus on the supervised learning approach, instead of the were used to calculate the error estimates. In addition, we cre- unsupervised clustering technique. ated synthetic datasets with repeated measurements and compared the performance of EWUSC, USC and SC at differ- Tibshirani et al. [17] developed a 'shrunken centroid' (SC) ent noise levels. algorithm for classifying multiple cancer types. It is an inte- grated approach for feature selection and classification. Fea- We adopted three criteria for assessing feature selection and tures are selected by considering one gene at a time: the classification algorithms: prediction accuracy, number of rel- difference between the class centroid (average expression evant genes and feature stability. Prediction accuracy is level or ratio within a class) of a gene and the overall centroid defined as the percentage of correct classifications on the test (average expression level or ratio over all classes) of a gene is set. The number of relevant genes is the total number of genes compared to the within-class standard deviation plus a used to achieve optimal prediction accuracy. Feature stability Genome Biology 2003, 4:R83 R83.4 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner http://genomebiology.com/2003/4/12/R83 is the level of agreement of selected genes chosen over differ- The USC approach ent cross-validation runs of the algorithm. Our USC algorithm adds a step to the SC algorithm to remove redundant, correlated genes. The benefit of removing highly Using these algorithms we obtained the following general correlated genes is twofold. First, it reduces the number of results. Exploiting gene interdependence by removal of corre- relevant features (genes) needed for classification. A small lated genes typically results in comparable or higher predic- feature set is highly desirable if one wishes to use the results tion accuracy using fewer relevant genes. This is highly of feature selection and classification to develop diagnostic desirable if one wishes to develop diagnostic tools from the tools such as reverse transcription PCR (RT-PCR)-based tests selected set of genes. Using error or variability estimates as on a small number of the most relevant genes. Second, the weighting factors generally yields higher feature stability and removal of redundant genes reduces the impact of over-fit- reduces the number of relevant genes on real datasets. On the ting, and hence, potentially improves classification accuracy. multiple tumor data, our EWUSC algorithm achieves 16% , for any increase in prediction accuracy, using only 10% of the genes The SC algorithm produces a set of relevant genes, S as features (compared with using all the available genes in the given shrinkage threshold ∆. As ∆ increases, the number of published result). On the breast cancer data, our EWUSC relevant genes in S decreases; for example, the gene list is algorithm produces the same number of classification errors reduced to selected genes for which the within-class centroids as the published result using a larger feature set. Unlike the are farther away from the overall centroid and for which the published algorithm for this dataset, however, the EWUSC within-class variation is small. Each gene is considered inde- algorithm is applicable to datasets with more than two pendently in the SC algorithm. Our modification exploits the classes. correlation between genes by removing genes that are highly correlated within the set of relevant genes S . Specifically, we compute the pairwise correlation for each pair of genes (g , g ) i j in S for each ∆. If the pairwise correlation is greater than a Our integrated classification and feature- correlation threshold ρ , the gene g with the smaller relative selection algorithm 0 j As our USC and EWUSC algorithms are motivated by the difference is removed from the set of relevant genes. This shrunken centroid (SC) algorithm [17], we will briefly review results in a set of relevant genes S(∆, ρ ) for each shrinkage the SC algorithm, and then discuss our USC and EWUSC threshold ∆ and each correlation threshold ρ . These relevant algorithms. Details of these algorithms can be found later in genes are used to classify new samples. The USC algorithm is the paper. equivalent to the SC algorithm when no correlated genes are removed (that is, ρ = 1). We apply this USC algorithm to the The SC approach training set using cross-validation to determine the number The SC approach [17] is essentially a robust version of the of classification errors for each ∆ and each ρ . The optimal 'nearest centroid' approach, in which a sample is assigned to parameters for ∆ and ρ are chosen such that the number of the class with the nearest average pattern. Features are cross-validation classification errors is minimized on the selected by considering each gene individually. The overall training set. These optimal parameters are then used to clas- centroid of a gene i is defined as the average expression level/ sify samples from unknown classes on the test set. Our results ratio of gene i over all the experiments. The class centroid of show that the removal of correlated genes provides a signifi- a gene i in class k is defined to be the average expression level/ cant improvement over the SC algorithm in classification ratio of gene i over all the samples in class k. A gene is predic- results, and hence our USC algorithm is useful for datasets in tive of the class if at least one of its class centroids signifi- which error estimates are not available. cantly differs from its overall centroid. One obvious definition of significantly in the previous sentence is 'differs by more The EWUSC approach than the variation (or standard deviation) within the class', Our EWUSC algorithm is based on the USC algorithm with a which is essentially a modified form of a t-test. The shrunken key modification: we take advantage of error estimates or var- centroid method adds an additional term (s described in [17] iability over repeated measurements. We define an error- and in the section Details of algorithms below) to the within- weighted overall centroid, error-weighted class centroid, class standard deviation - for example, the difference between error-weighted relative difference, error-weighted shrunken the in-class average and the overall average must exceed the class centroid, and error-weighted discriminant score in in-class variation by s . A t-test like statistic, relative differ- order to down-weight both noisy genes and noisy experi- ence (d ), is defined to represent the difference between the ments. In addition, we adopt the error-weighted correlation ik class centroid and the overall centroid divided by the variance in the removal of highly correlated genes to select relevant (in-class variation + s ) and the absolute value of d is genes. Thus the EWUSC algorithm is identical to the USC 0 ik reduced by the 'shrinkage threshold' ∆. ∆ is determined by algorithm except for error-weighted definitions to down- cross-validation such that the number of classification errors weight noisy genes and noisy experiments in our calculations. is minimized on the training set. When all genes and all experiments have the same variability estimates, the EWUSC algorithm is equivalent to the USC Genome Biology 2003, 4:R83 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2003/4/12/R83 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner R83.5 Table 1 Multiple tumor data The multiple tumor dataset [10] consists of a large number of Tumor types and class sizes of the NCI 60 dataset tumor samples spanning 14 different tumor types hybridized to Affymetrix chips. On the Affymetrix platform, each target Origin of cell lines Class size (total 61 samples) gene is represented by 11-20 short oligo probes of approxi- Breast 9 mately 25 base-pairs (bp). Our goal is to take advantage of the Central nervous system 5 variability over different oligos for the same genes using our EWUSC algorithm. We pre-processed the raw multiple tumor Colon 7 data with the log scale robust multi-array analysis (RMA) Leukaemia 8 measure [27] implemented in the BioConductor project. The Melanoma 8 RMA measure is a summary statistic for the expression levels Non-small-cell-lung-carcinoma 9 over all the different oligos for the same gene. The standard Ovarian 6 error of the RMA measure is a variability estimate of the Renal 9 expression level over the different oligos representing the Tumor types and class sizes of the original full data with a total of 61 same target gene. In order to obtain the RMA measures and experiments. their associated standard errors on the multiple tumor data, the raw data (.cel files) are necessary. Because we have access to only a subset of the raw multiple tumor data, we used a sub- algorithm. As our results show, this error-weighted approach set of the original data in our study. The subset of multiple typically reduces the number of relevant genes and improves tumor data we used consists of 7,129 genes, 96 samples in the feature stability, and thus the EWUSC is usually the method training set, and 27 samples in the test set. These samples of choice when error or variability estimates are available. A span 11 different tumor types (Table 3). The smallest class size detailed description of the EWUSC algorithm is given later in is four on the training set, and hence, four-fold cross-valida- the paper. tion (m = 4) is used on this data. Breast cancer data The breast cancer data [14] consists of primary breast tumor Datasets used National Cancer Institute NCI 60 data samples hybridized to cDNA arrays containing approximately In the NCI 60 data [12], cDNA microarrays were used to study 25,000 genes. Two hybridizations were carried out for each the expression of approximately 60 cell lines derived from sample using a dye-reversal technique. Hence, there are four tumors with different sites of origin (see Table 1). We used the repeated measurements for each gene and each sample. The same pre-processed dataset as in Dudoit et al. [23], which p-values of log expression ratios are also available. These p- consists of log expression ratios of 5,244 genes over 61 exper- values are results of the four repeated measurements and an iments. Two prostate and one unknown cell lines from the error model based on extensive control experiments [29]. A original data [12] were excluded in their analysis because of p-value close to 1 represents low confidence that an their small class sizes. Only one leukemia and one breast can- expression ratio is significantly different from 1, while a cer cell line were repeated three times, and hence there are no repeated measurements or variability estimates available for Table 2 all 61 samples. These repeated experiments of the leukemia Tumor types and class sizes of the randomly partitioned training and breast cancer cell lines are treated as individual samples. and test sets of the NCI 60 dataset In addition, no additional test set is available for this data. To compare our results with those of Dudoit et al. [23], we Origin of cell lines Training set Test set adopted their 2:1 scheme in which one third of the samples (total 43) (total 18) are reserved as a test set. Breast 6 3 Central nervous system 4 1 Specifically, we randomly divided each class in the original Colon 5 2 data (61 experiments) into roughly three parts such that the Leukaemia 6 2 training set consists of a total of 43 experiments and the test Melanoma 6 2 set consists of a total of 18 experiments. Table 2 gives the class sizes of the training and test sets. The optimal parameters are Non-small-cell-lung-carcinoma 6 3 determined using cross-validation on the training set with 43 Ovarian 4 2 samples, and these optimal parameters are used to classify Renal 6 3 the 18 samples in the test set. We repeated this random parti- As no additional test set is available for the NCI 60 data, we randomly tion of the original data into three parts multiple times. divided each class of these 61 samples into roughly three parts and reserved one third of the samples as a test set. Genome Biology 2003, 4:R83 R83.6 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner http://genomebiology.com/2003/4/12/R83 Table 3 generation approach, generating sensible synthetic data turned out to be a nontrivial task. There are two parameters Tumor types and class sizes for the training set and test set of the that control the noise levels in the synthetic datasets, the bio- subset of multiple tumor data used in this study logical noise level (α) and the technical noise level (λ). The biological noise level (α) controls the level of biological noise Tumor type Training set (total 96) Test set (total 27) within each class (and hence, the signal-to-noise ratio) such Breast 7 0 that the classes are less separable with a higher α. The techni- Lung 4 2 cal noise level (λ) controls the noise level over repeated meas- Colorectal 7 3 urements such that a high λ indicates relatively noisy Lymphoma 14 5 repeated measurements. The primary difficulty in generating Melanoma 5 0 synthetic data is setting the parameters of α and λ, and the proportion of the patterned genes. As it is not obvious how to Uterus 7 2 set these parameters to reflect 'real-life' data, we experi- Leukemia 23 6 mented with different parameter settings, such as different Renal 5 3 biological noise levels: low (α = 0.1 with signal-to-noise ratio Pancreas 7 0 approximately 20), medium (α = 1 with signal-to-noise ratio Mesotheolima 8 3 approximately 2), or high (α = 2 with signal-to-noise ratio CNS 9 3 approximately 1); and low (λ = 1) or high (λ = 5 or 10) techni- cal noise. We also experimented with different proportions of patterned genes, and concluded that this parameter does not have any significant impact on the results. Table 4 Another issue in generating 'realistic' synthetic data involves Prognosis groups and class sizes of the training set and test set of the generation of non-patterned genes that are irrelevant in the breast cancer data distinguishing the classes. We addressed this issue by random sampling with replacement from a real dataset (that is, the Prognosis group Training set Test set breast cancer dataset [14]). Specifically, for each non-pat- (total 78) (total 19) terned gene, we randomly sample a gene g from the breast Good (> 5 years of survival time) 44 7 cancer data, and then randomly sample from the experiments Poor (≤5 years of survival time) 34 12 of gene g in the breast cancer data such that these non-pat- terned genes would not show any class-specific expression patterns but would show realistic variations in expression lev- p-value close to 0 represents high confidence that an expres- els over all classes. sion ratio is significantly different from 1. We converted these p-values into error estimates of log ratios, which are used in In particular, our synthetic training sets consist of 1,000 our EWUSC algorithm. genes, 80 samples, and 4 classes such that there are 20 sam- ples in each class. Our synthetic test sets consist of 1,000 The breast cancer dataset consists of approximately 25,000 genes and 40 samples with 10 samples in each class. We gen- genes, 78 samples in the training set, and 19 samples in the erated 64 patterned genes which have a different expression test set. van't Veer et al. [14] divided these samples into the pattern in each class, for example, genes that are upregulated good and poor prognosis groups, which have greater than 5 (or downregulated) in only m of the four classes, where m = 1, and less than 5 years of survival time respectively. Hence, 2, 3. In addition, there are five duplicates of each of these 64 there are two classes in this dataset (see Table 4). We per- patterned genes such that there are a total of 320 patterned formed 10-fold cross-validation (m = 10) on the breast cancer genes and (1,000 - 320 = 680) non-patterned genes. Ideally, data. the perfect classification algorithm would select only one of these five copies of the patterned genes. We also investigated Synthetic data the effect of the number of repeated measurements by gener- We also created synthetic datasets to compare the perform- ating synthetic datasets with 1, 4 or 20 repeated measure- ance of our algorithms. Our approach is to start with 'pat- ments. These synthetic datasets are available from our terned genes' which have a different expression pattern in supplementary website [30]. each class, and are therefore relevant in classifying unknown samples. The next step is to introduce noise (variation in both the class and non-class values) to these patterned genes in Assessment criteria order to reflect 'real-life' data. Finally, 'non-patterned genes', Prediction accuracy which are irrelevant in classifying samples, are added to these As the class information for the test sets is available, we define synthetic datasets. Even with this simple synthetic data- prediction accuracy as the percentage of correct Genome Biology 2003, 4:R83 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2003/4/12/R83 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner R83.7 classifications on the test set. The class information on a test set is used only to evaluate the performance of classification and feature-selection algorithms, and is unknown to the algo- rithms. 60 Number of relevant features One of the goals of classification is to select a minimal set of relevant genes (or features) that can be used in future diagno- sis or classification of tissue samples. We judge each method USC by the total number of relevant features required for optimal SC classification accuracy. A small set of relevant genes is desir- 1 10 100 1,000 10,000 able because it is more cost-effective in the development of Number of genes (log scale) diagnostic tools based on the results of expression analysis. For example, the cost of an RT-PCR test to classify patient samples is directly proportional to the number of genes which Co Figmp ure ar 1 ison of prediction accuracy of USC and SC on the NCI 60 data Comparison of prediction accuracy of USC and SC on the NCI 60 data. must be tested to make the diagnosis. As shown below, both The percentage of prediction accuracy is plotted against the number of the USC and EWUSC methods usually result in a significant relevant genes using the USC algorithm at ρ = 0.6 and the SC algorithm reduction in the numbers of selected genes for classification. (USC at ρ = 1.0). The horizontal axis is shown on a log scale. Because no We feel this represents a major advance in classification independent test set is available for this data, we randomly divided the samples in each class into roughly three parts multiple times, such that a algorithms. third of the samples are reserved as a test set. Thus the training set consists of 43 samples and the test set of 18 samples. The graph Feature stability represents typical results over these multiple random runs. Because relevant genes are derived from the training set and the choice of the training set is often arbitrary, a set of rele- vant genes that is insensitive to the training sets used would highly correlated genes reduces the number of selected fea- be desirable. Hence, we define feature stability as the level of tures while achieving comparable error rates. agreement between the set of relevant genes chosen in each fold of the cross-validation data with the set of relevant genes Like Dudoit et al. [23] we observed high error rates on this chosen using the full training set. Specifically, for each fold of dataset (around 40-60% using 10-200 relevant genes). USC the cross-validation data and for each set of parameters (∆ produces comparable error rates to the results reported in and ρ ), we compute the Jaccard index [31] which measures Dudoit et al. [23] using roughly the same number of relevant the level of agreement between the set of relevant genes cho- genes. However, our USC algorithm allows the optimal sen in this fold and the set chosen using the full training set. parameters (which indirectly control the number of selected The Jaccard index lies between 0 and 1. A high Jaccard index genes) to be determined. In this case, the optimal parameters (close to 1) implies high level of agreement, and hence, high produce an error rate of approximately 28% on the cross-val- feature stability (a mathematical definition of the Jaccard idation data. We repeated the random partition of the full index can be found in the section Details of algorithms, dataset with 61 samples into a training set with 43 samples below). We define feature stability of one cross-validation run and a test set with 18 samples multiple times, and obtained for a given set of parameters (∆ and ρ ) as the average Jaccard similar results on different random partitions of the original index over all m folds of cross-validation. In our experiments, dataset. we usually have five random runs of cross-validation; hence we adopt the average Jaccard index over these five random runs of cross-validation as our measure of overall feature sta- Results on the multiple tumor data bility for given parameters (∆ and ρ ). Figure 2 shows the results of applying EWUSC to the training set, four-fold cross-validation data, and test set of the multi- ple tumor data over a range of shrinkage thresholds (∆) and correlation thresholds (ρ ). In Figure 2a,c the percentage of Results on the NCI 60 data As variability estimates are not available on the NCI 60 data, classification errors is plotted against ∆ on the training and we compared the prediction accuracy from USC and SC (Fig- test sets respectively. In Figure 2b, the average percentage of ure 1; and Figure S14 of [30]). We showed that USC generally errors is plotted against ∆ over five random runs of cross-val- produces higher prediction accuracy than SC using the same idation. The optimal parameters (∆ and ρ ) are determined number of relevant genes (Figure 1). In particular, USC from the cross-validation results. Figure 2a-c shows that pre- requires 44% of the available genes (2,315 out of 5,244 genes) diction accuracy is increased (lower percentage of errors) to achieve a prediction accuracy of 72%, whereas SC requires when ρ < 1 over most values of ∆ (especially 2 ≤ ∆ ≤ 7) on the 77% of genes (3,998 out of 5,244 genes) to achieve the same training set, cross-validation data and test set. This shows prediction accuracy. Our results show that the removal of that removing highly correlated genes increases prediction Genome Biology 2003, 4:R83 Prediction accuracy (%) R83.8 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner http://genomebiology.com/2003/4/12/R83 (a) Training data ρ = 1 ρ = 0.9 ρ = 0.8 ρ = 0.7 ρ = 0.6 0 2 4 6 8 10 12 14 16 18 20 (b) Random cross-validation data ρ = 1 ρ = 0.9 ρ = 0.8 ρ = 0.7 ρ = 0.6 0 2 4 6 8 10 12 14 16 18 20 (c) Test data ρ = 1 ρ = 0.9 ρ = 0.8 ρ = 0.7 ρ = 0.6 0 2 4 6 8 10 12 14 16 18 20 (d) Number of genes 8,000 6,000 ρ = 1 ρ = 0.9 4,000 ρ = 0.8 ρ = 0.7 2,000 ρ = 0.6 0 2 4 6 8 10 12 14 16 18 20 Fig Predic ure tio 2n accuracy on the multiple tumor data using the EWUSC algorithm over the range of ∆ from 0 to 20 Prediction accuracy on the multiple tumor data using the EWUSC algorithm over the range of ∆ from 0 to 20. The percentage of classification errors is plotted against ∆ on (a) the full training set (96 samples) and (c) the test set (27 samples). In (b) the average percentage of errors is plotted against ∆ on the cross-validation data over five random runs of fourfold cross-validation. In (d), the number of relevant genes is plotted against ∆. Different colors are used to specify different correlation thresholds (ρ = 0.6, 0.7, 0.8, 0.9 or 1). Results of ρ < 0.6 are shown in Figure S1 on [30]. Optimal parameters are 0 0 inferred from the cross-validation data in (b). Genome Biology 2003, 4:R83 Number of genes Classification error (%) Average classification Classification error (%) error (%) comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2003/4/12/R83 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner R83.9 accuracy. In addition, Figure 2d shows that the number of rel- Stability across different random cross-validation evant genes is drastically reduced when genes with correla- 0.5 tion threshold (ρ ) above 0.9 are removed. From Figure 2b, 0.45 the average cross-validation error rate gradually reduces 0.4 when the correlation threshold ρ is decreased from 1 to 0.9 0.35 to 0.8, but the average error rate increases when ρ < 0.8. 0.3 (This observation also holds for ρ < 0.6, which are not shown 0.25 in Figure 2 for clarity.) Therefore, the optimal ρ is estimated 0.2 to be 0.8. 0.15 0.1 EWUSC produces the minimum average number of cross-val- 0.05 idation errors at ∆ = 0 and ρ = 0.9 using 1,626 relevant genes, 1 2 3 10 10 10 which achieves 78% prediction accuracy. However, ∆ = 0 is an Number of genes (log scale) unsatisfactory shrinkage threshold because we would prefer relevant genes to have class centroids significantly different EWUSC (ρ = 0.8) USC (ρ = 0.8) from their overall centroids. Moreover, the average error rate 0 SC starts to increase almost linearly when ∆ is greater than 6 on the cross-validation data. This 'bend' is more obvious Figure Com tu Fig mu or p r data e arison 3 of feature stability of EWUSC, USC and SC on the multiple S1(e) on [30], which shows the error rate for each of the five Comparison of feature stability of EWUSC, USC and SC on the multiple random runs of fourfold cross-validation for ∆ = 0 to 14. The tumor data. The average Jaccard index is plotted against the number of optimal ∆ is estimated to be 5.6. When ∆ = 5.6 and ρ = 0.8, relevant genes over five random runs of fourfold cross-validation using the prediction accuracy is 93% and the number of relevant EWUSC and USC at ρ = 0.8 and SC. A high average Jaccard index genes is 680 (out of a total of 7,129 genes). indicates high feature stability. The EWUSC algorithm selects the most stable features. Note that the horizontal axis is shown on a log scale. We also applied the USC and SC algorithms to the multiple tumor data and obtained similar results, except that the error rates are generally higher. Similarly, USC produces the mini- mum average number of cross-validation errors at ∆ = 0 and one-versus-all approach. In contrast, our EWUSC algorithm ρ = 0.9 using 1634 relevant genes, which achieves 74% pre- achieves a classification accuracy of 93% on the test set of the diction accuracy. SC produces the minimum average number multiple tumor data. As we used a subset of the original of cross-validation errors at ∆ = 0.4 using all 7,129 genes. On multiple tumor data and pre-processed the raw data using the the other hand, the optimal parameters (∆, ρ ) can be esti- RMA measures [27], we evaluated the performance of SVM mated by visual observation of 'bends' in the cross-validation combined with the one-versus-all method on the identical curves. In particular, when ∆ = 5.6 and ρ = 0.8, the predic- pre-processed subset of multiple tumor data used in our tion accuracy is 85% and the number of relevant genes is 735 experiments with the EWUSC and the USC algorithms. In our using the USC algorithm (see Figure S2 on [30] for detailed comparison study, we used the signal to noise (S2N) results). measures [9] to select relevant features for each binary SVM classifier. To produce directly comparable results, we used We also compared feature stability of the EWUSC and USC the exact same five splits of the training set into cross-valida- algorithms at correlation threshold (ρ ) = 0.8 with the SC tion data. algorithm [17] (which is equivalent to USC at ρ = 1) over dif- ferent numbers of relevant genes (Figure 3), and showed that Figure 4 compares the prediction accuracy on the test set of EWUSC produces higher feature stability (higher average the multiple tumor data using the EWUSC and USC algo- Jaccard index) than the USC and SC algorithms. The rela- rithms at the estimated optimal correlation threshold (ρ = tively high feature stability is due to relatively high numbers 0.8), the SC algorithm [17] and SVM (with S2N for feature of common features selected in different runs of cross-valida- selection). There are a few observations from Figure 4. First, tion (see Figure S5 on [30]). We also showed that EWUSC USC produces higher prediction accuracy than SC using the almost always selects relatively more stable sets of relevant same number of relevant genes. As SC is equivalent to USC at genes than USC (even over other correlation thresholds that ρ = 1, our results show that removing highly correlated genes are not shown). Hence, our results demonstrate that incorpo- reduces the number of relevant genes and improves predic- rating variability estimates over repeated measurements tion accuracy. Second, EWUSC generally produces higher yields higher feature stability. prediction accuracy than USC using the same number of rel- evant genes, except when both the number of relevant genes and prediction accuracy is low. This shows that we can poten- Comparison with published results Ramaswamy et al. [10] reported 78% classification accuracy tially improve prediction accuracy by taking advantage of on the multiple tumor data using SVMs combined using the error estimates in the data. Genome Biology 2003, 4:R83 Feature stability (average Jaccard index) R83.10 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner http://genomebiology.com/2003/4/12/R83 Test data 40 SVM EWUSC EWUSC (ρ = 0.7) USC USC (ρ = 0.6) SC 0 SC 10 100 1,000 10,000 Total number of genes (log scale) 1 2 3 4 10 10 10 10 Total number of genes (log scale) Co algo Figmpariso u rith re 4 ms on n o ft predicti he multip on acc le tumor uracy data of EWUSC, USC, SVM and SC Comparison of prediction accuracy of EWUSC, USC, SVM and SC algorithms on the multiple tumor data. The horizontal axis shows the total Co can Figm c uer data r pe ar 5 ison of prediction accuracy of EWUSC, USC and SC on the breast number of distinct genes selected over all binary SVM classifiers on a log Comparison of prediction accuracy of EWUSC, USC and SC on the breast scale. Some results are not available on the full range of the total number cancer data. The percentage of prediction accuracy is plotted against the of genes. For example, the maximum numbers of selected genes for number of relevant genes using the EWUSC algorithm at ρ = 0.7, the EWUSC and USC are roughly 1,000. The reported prediction accuracy is USC algorithm at ρ = 0.6 and the SC algorithm (USC at ρ = 1.0). Note 0 0 78% [10] using all 16,000 available genes on the full data. The EWUSC that the horizontal axis is shown on a log scale. algorithm achieves 89% prediction accuracy with only 89 genes. With 680 genes, EWUSC produces 93% prediction accuracy. any of the 78 samples of the training set. It is undesirable to Third, our SVM results (on a subset of the multiple tumor classify new samples using these genes that do not show any data pre-processed with RMA measures) are generally much expression patterns. With EWUSC (which takes error esti- better than the published result of 78% [10] (on the full mates into consideration), these two genes are eliminated for dataset pre-processed with MAS 4). Fourth, SVM with S2N as all ∆ > 0. On the contrary, one of these two genes is selected our feature-selection method produces high prediction accu- as a relevant gene by USC for ∆ = 0, 0.05, ..., 0.7 at ρ = 1. racy at the expense of using a lot of relevant genes. For exam- ple, SVM requires a total of 1,699 genes over all the binary The detailed results of applying the EWUSC and USC algo- classifiers to achieve 93% prediction accuracy, whereas our rithms to the breast cancer data are shown in Figures S8 and EWUSC algorithm requires only 610 relevant genes to S9 on [30]. Surprisingly, removing highly correlated genes achieve the same prediction accuracy. If we are willing to does not produce any considerable improvement in predic- trade off prediction accuracy with the number of relevant tion accuracy and does not drastically reduce the number of genes, EWUSC achieves 89% prediction accuracy with only relevant genes. This is probably due to the fact that the 89 relevant genes. numbers of classification errors on the cross-validation data are not well correlated with those on the test set (see [30]). Because the test set is an additional independent dataset, there might be some heterogeneity between the training and Results on the breast cancer data We applied the EWUSC, USC and SC algorithms to the breast test sets. Nevertheless, USC achieves comparable prediction cancer data, and compared the prediction accuracy of the accuracy to SC using relatively fewer selected genes (under three algorithms at their optimal correlation thresholds (ρ = 100 genes) over different correlation thresholds ρ . 0 0 0.7 or 0.6), and the SC algorithm (USC at ρ = 1). The results are shown in Figure 5. In general, EWUSC produces higher We compared the feature stability of EWUSC, USC and SC at prediction accuracy than USC and SC when the number of their optimal correlation thresholds ρ in Figure 6. We relevant genes is less than 1,000 (which is the range of showed that EWUSC and SC produce relatively stable rele- interest). In particular, EWUSC produces fewer classification vant features than USC. The detailed comparison of feature errors on the test set at its optimal parameters (two errors at stability in terms of the average numbers of true/false posi- ∆ = 0.8 and ρ = 0.7) than USC at its optimal parameters (four tives/negatives are shown in Figures S12 and S13 on [30]. The errors at ∆ = 1.15 and ρ = 0.6). relatively high feature stability of SC is due to its relatively high true-positive rate (common genes chosen in both ran- Moreover, EWUSC generally selects relevant genes with rela- dom cross-validation and using the entire training set), and tively small error bars (or low p-values). For example, there its relatively low false-negative rate (genes chosen using the are two genes with p-values equal to 1 across all 78 samples in entire training set but not in the cross-validation data). How- the training set. In other words, we have very low confidence ever Figure S12 in [30] shows that this effect is drastic at high that the expression ratios of these two genes are changed in numbers of relevant genes and is relatively less significant at Genome Biology 2003, 4:R83 Prediction accuracy (%) Prediction accuracy (%) comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2003/4/12/R83 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner R83.11 increased, the performance of EWUSC compared to USC Stability across different random cross-validation deteriorates. For example, EWUSC selects more relevant 0.7 genes than USC at low technical noise level but it selects fewer EWUSC (ρ = 0.7) 0.65 USC (ρ = 0.6) relevant genes than USC at α = 1 (with signal-to-noise ratio SC approximately 2). The relative performance of EWUSC is 0.6 even less favorable at high biological noise level (α = 2 with 0.55 signal-to-noise ratio roughly 1). The results in Table 5a sug- 0.5 gest that EWUSC is the method of choice when the classes are 0.45 relatively separable (at low biological noise and high signal- to-noise ratio), but USC would be the method of choice at 0.4 high biological noise. 0.35 0.3 1 2 3 In general, the performance of EWUSC increases as the 10 10 10 Number of genes (log scale) number of repeated measurements increases. In particular, we studied the effect of the number of repeated measurements on the relative performance of EWUSC, USC can Fi Compar gc uer re data 6 ison of feature stability of EWUSC, USC and SC on the breast Comparison of feature stability of EWUSC, USC and SC on the breast and SC at high biological noise (α = 2). The prediction cancer data. The average Jaccard index is plotted against the number of accuracy results using 1, 8 or 20 repeated measurements at relevant genes over five random runs of 10-fold cross-validation using the high biological noise (α = 2) are shown in Table 5b. The = 0.7, the USC algorithm at ρ = 0.6 and the SC EWUSC algorithm at ρ 0 0 results at α = 2 with four repeated measurements are shown algorithm (USC at ρ = 1). The EWUSC algorithm produces relatively more stable features when the number of relevant genes is small. in Table 5a. USC typically outperforms SC by selecting fewer relevant genes over different numbers of repeated measure- ments. In addition, we showed that EWUSC usually selects our optimal parameters with approximately 100 to 300 rele- fewer relevant genes than USC at high biological noise when vant genes. there are 20 repeated measurements. However, when the bio- logical noise level is high (with signal-to-noise ratio approxi- mately 1) and the number of repeated measurements is low (1, 4 or 8), USC usually selects fewer relevant genes than Results on the synthetic data We compared the performance of EWUSC, USC and SC on EWUSC. synthetic datasets with different numbers of repeated measurements, different biological and technical noise levels. Table 5a,b shows that EWUSC produces lower prediction As the biological noise levels of typical real microarray data- accuracy than USC at high biological noise when there are few sets are not known, we generated synthetic datasets with four repeated measurements. However, the levels of biological repeated measurements at different biological noise levels (α noise on real microarray datasets are not known. In practice, = 0.1, 1 or 2) and some typical results are shown in Table 5a. we recommend users of our algorithms to compare the Our complete results are shown in Tables S1, S2 and S3 on average numbers of errors on the cross-validation data and [30]. In most cases, USC achieves better or comparable pre- the numbers of relevant genes from the EWUSC and USC diction accuracy (lower number of errors on the test set) than algorithms, and then select the algorithm that produces lower SC using fewer relevant genes. There are a few exceptions to average cross-validation errors using fewer relevant genes. In this observation (see [30]). The optimal parameters (∆, ρ ) most cases, the prediction accuracy on the test set shows the are determined from the minimum average number of cross- same trend as the average number of cross-validation errors. validation errors. In some cases, there are very small differ- ences between the average numbers of cross-validation errors It is interesting that prediction accuracy is not necessarily of two sets of parameters, and the set of parameters that pro- reduced and the number of relevant genes is not necessarily duces a slightly higher average cross-validation error rate increased at higher technical noise levels. However, predic- yields fewer relevant genes. Therefore, this 'exception' is due tion accuracy is generally reduced and the number of relevant to the fact that the optimal parameters are not derived from genes is typically increased at higher biological noise levels the random cross-validation data. At low biological noise (see Additional data files Tables S1, S2 and S3 at [30]). All level (α), the inference of optimal parameters is obvious and three algorithms (EWUSC, USC and SC) produce comparable USC always yields fewer relevant genes than SC (see Table S2 feature stability at different noise levels when the number of on [30]). This observation demonstrates the power of remov- relevant genes is below 300 (see Figures S20, S21 at [30]). ing highly correlated genes in the USC algorithm. Our results also showed that EWUSC consistently achieves the same pre- diction accuracy using fewer relevant genes at low biological Summary of results on real data noise (α = 0.1, with signal-to-noise ratio approximately 20) at Table 6 summarizes our prediction accuracy results using the different technical noise levels (Table 5a). However, as α is EWUSC, USC and SC algorithms on the NCI 60 data, multiple Genome Biology 2003, 4:R83 Feature stability (average Jaccard index) R83.12 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner http://genomebiology.com/2003/4/12/R83 Table 5 Comparison of classification accuracy results from EWUSC, USC and SC on synthetic datasets at optimal parameters α Number of λ EWUSC USC SC measurements (a) Different noise levels with four 0.1 4 Low 100% 100% 100% Average % CV prediction accuracy repeated measurements 100% 100% 100% % prediction accuracy 10 24 72 Number of genes (18, 0.8) (17, 0.7) (17.5, 1) (∆, ρ) 0.1 4 High 100% 100% 100% Average % CV prediction accuracy 100% 100% 100% % prediction accuracy 8 16 22 Number of genes (12.5, 0.9) (12.5, 0.9) (12.5, 1) (∆, ρ) 1 4 Low 100% 100% 100% Average % CV prediction accuracy 100% 100% 100% % prediction accuracy 144 119 124 Number of genes (2.8, 0.5) (3.1, 0.6) (3.1, 1) (∆, ρ) 1 4 High 100% 100% 100% Average % CV prediction accuracy 100% 100% 100% % prediction accuracy 89 120 122 Number of genes (1.9, 0.5) (2.6, 0.6) (2.6, 1) (∆, ρ) 24 Low96.8% 99.0% 98.8% Average % CV prediction accuracy 97.5% 100.0% 100.0% % prediction accuracy 270 326 326 Number of genes (1.1, 0.5) (1, 0.4) (1.2, 1) (∆, ρ) 2 4 High 93.3% 98.8% 99.0% Average % CV prediction accuracy 92.5% 97.5% 97.5% % prediction accuracy 186 159 159 Number of genes (1, 0.7) (1.5, 0.5) (1.5, 1) (∆, ρ) (b) Different numbers of repeated 2 1 Low NA 99.5% 99.5% Average % CV prediction accuracy measurements at high biological noise levels NA 100.0% 100.0% % prediction accuracy NA 285 304 Number of genes NA (1.2, 0.5) (1.2, 1) (∆, ρ) 21 HighNA 96.5% 95.5% Average % CV prediction accuracy NA 92.5% 92.5% % prediction accuracy NA 258 282 Number of genes NA (1.2, 0.5) (1.2, 1) (∆, ρ) 28 Low99.8% 100.0% 100.0% Average % CV prediction accuracy 100.0% 100.0% 100.0% % prediction accuracy 246 220 221 Number of genes (1.3, 0.5) (1.4, 0.5) (1.4, 1) (∆, ρ) 2 8 High 98.3% 99.0% 99.0% Average % CV prediction accuracy 97.5% 100.0% 100.0% % prediction accuracy 171 242 245 Number of genes (1, 0.4) (1.3, 0.5) (1.3, 1) (∆, ρ) 2 20 Low 99.8% 100.0% 100.0% Average % CV prediction accuracy 100.0% 100.0% 100.0% % prediction accuracy 226 296 325 Number of genes (1.3, 0.5) (1.2, 0.6) (1.2, 1) (∆, ρ) 220 High99.8% 100.0% 100.0% Average % CV prediction accuracy Genome Biology 2003, 4:R83 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2003/4/12/R83 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner R83.13 Table 5 (Continued) Comparison of classification accuracy results from EWUSC, USC and SC on synthetic datasets at optimal parameters 100.0% 100.0% 100.0% % prediction accuracy 221 252 252 Number of genes (0.9, 0.6) (1.3, 0.5) (1.3, 1) (∆, ρ) Synthetic datasets were generated at different levels of biological noise (α) and technical noise (λ). The average percentage of cross validation (% CV) accuracy, the percentage of prediction accuracy on the test set, the number of relevant genes at the optimal parameters (∆, ρ ) are shown. For each synthetic dataset, the algorithm with the maximum percentage of average cross validation accuracy, maximum prediction accuracy, or the minimum number of relevant genes is shown in bold. (a) Typical classification accuracy results using synthetic datasets with four repeated measurements at different biological noise levels (α = 0.1, 1 or 2) and difference technical noise levels (λ = 1, 5 or 10). When the biological noise level is low (α = 0.1), EWUSC consistently achieves the same prediction accuracy using fewer relevant genes at various technical noise levels. However, at medium biological noise level (α = 1), EWUSC typically outperforms USC and SC at high technical noise level and not at low technical noise level. When the biological noise level is high (α = 2), EWUSC is often not the method of choice. (b) Typical classification accuracy results using synthetic datasets at high biological noise level (α = 2) with 1, 8, or 20 repeated measurements at different technical noise levels. When there is no repeated measurement (the number of repeated measurements = 1), there are no variability estimates over repeated measurements and hence, EWUSC is reduced to USC. The results with four repeated measurement at α = 2 are shown in (a). Our results over multiple synthetic datasets showed that EWUSC only outperforms USC with a large number of repeated measurements (20) at high biological noise (α = 2). We also showed that USC typically outperforms SC by choosing a smaller number of relevant genes in most scenarios (over different biological and technical noise levels, and different numbers of repeated measurements). tumor data and breast cancer data at optimal parameters. In avoids choosing noisy genes. The EWUSC algorithm can be general, we showed that using variability over repeated meas- applied to data with any number of classes. This is in contrast urements to down-weight noisy genes/experiments and the to the prognostic classifier, which is not applicable to the mul- removal of highly correlated genes usually reduce the number tiple tumor data (which consists of 11 classes) or the NCI 60 of relevant genes necessary for accurate class predictions. In data (which consists of 8 classes). addition, using variability of repeated measurements to down-weight noisy genes/experiments generally increases feature stability. Hence, our EWUSC and USC algorithms Comparison of USC, EWUSC and SC represent advances over the published SC algorithm [17]. algorithms The key characteristics of EWUSC, USC and SC are summa- On the NCI 60 data, USC generally produces higher predic- rized in Table 7. We illustrated the EWUSC and USC algo- tion accuracy than SC using the same number of relevant rithm on both real and synthetic datasets. Our results on real genes. This result shows that the removal of highly correlated data are summarized in Table 6. We compared the genes reduces the number of genes necessary for class performance of USC with SC, and showed that USC typically prediction while achieving comparable or higher prediction achieves comparable prediction accuracy using a smaller set accuracy. of relevant genes on both real and synthetic datasets. We showed that the step of removing highly correlated genes in On the multiple tumor data, EWUSC has the following advan- USC is effective in reducing the number of relevant genes tages over other methods: EWUSC produces higher predic- without sacrificing prediction accuracy, and hence, USC is an tion accuracy and selects fewer relevant genes than all other improvement over SC. approaches. In particular, EWUSC achieves 93% of predic- tion accuracy using less than 10% of the genes compared to We also compared the performance of EWUSC (which down- 78% of prediction accuracy using all the available genes in the weights noisy genes and noisy experiments) with USC on both published results [10]. Each of the binary SVM classifiers real and synthetic datasets. On real microarray datasets (mul- chooses a different subset of relevant genes while our EWUSC tiple tumor data and breast cancer data), we showed that algorithm uses only one set of relevant genes for all classes. EWUSC usually achieves higher or comparable feature stability using a smaller set of relevant genes, and EWUSC van't Veer et al. [14] reported two classification errors using avoids choosing noisy relevant genes for classification of sam- 70 relevant genes on the test set of the breast cancer data (out ples. Hence, we showed that using variability over repeated of a total of 19 samples). Our EWUSC produces the same measurements improves classification and feature-selection number of errors on the test set with 271 relevant genes. How- results. Moreover, we compared EWUSC with other existing ever, our EWUSC algorithm has the following advantages classification and feature-selection algorithms, and showed over the prognostic classifier used in [14]. No ad hoc filtering that EWUSC produces better or at least comparable results step is necessary. The EWUSC algorithm automatically than previously reported results on real datasets (see Table Genome Biology 2003, 4:R83 R83.14 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner http://genomebiology.com/2003/4/12/R83 Table 6 Summary of prediction accuracy results Data Parameters EWUSC USC SC Published results NCI 60 data* ρ NA 0.6 1.0 NA ∆ NA 1.0 1.0 NA Number of relevant genes NA 2,315 3998 200 Prediction accuracy NA 72% 72% ~40-60% [23] Multiple tumor data (estimated optimal parameters) ρ 0.8 0.8 1.0 NA ∆ 5.6 5.6 8.8 NA Number of relevant genes 680 735 3902 All genes Prediction accuracy 93% 85% 78% 78% [10] Multiple tumor data (global optimal parameters) ρ 0.9 0.9 1.0 NA ∆ 00 0.4 NA Number of relevant genes 1626 1634 7129 All genes Prediction accuracy 78% 74% 74% 78% [10] Breast cancer data ρ 0.7 0.6 1.0 NA ∆ 0.80 1.15 1.1 NA Number of relevant genes 271 82 187 70 Prediction accuracy 89% 79% 84% 89% [14] The optimal parameters (ρ and ∆), number of relevant genes chosen, and prediction accuracy for the NCI 60 data, multiple tumor data and breast cancer data are summarized here. Both EWUSC (error-weighted, uncorrelated shrunken centroid) and USC (uncorrelated shrunken centroid) were motivated by SC (shrunken centroid) [17]. Both EWUSC and USC take advantage of interdependence between genes by removing highly correlated relevant genes. EWUSC makes use of error estimates or variability over repeated measurements. SC [17] is equivalent to USC at ρ = 1. The optimal parameters (∆, ρ ) for EWUSC are estimated from the cross-validation results of EWUSC, while the optimal parameters (∆, ρ ) for USC are 0 0 independently estimated from the cross-validation results of USC. Entries with the minimum number of selected genes or highest prediction accuracy across all methods are highlighted in boldface type. *Since no repeated measurements or error estimates are available, EWUSC is not applicable to the NCI 60 data. In addition, there is no separate test set available for the NCI 60 data, typical results of random partitions of the original 61 samples into training and test sets are shown. The prediction accuracy and number of relevant genes are produced using optimal parameters (∆, ρ ) estimated by visual observation of 'bends' in the random cross-validation curves. The prediction accuracy and number of relevant genes are produced using global optimal parameters, that is (∆, ρ ) that produces the minimum average numbers of cross-validation errors over all ∆ and all ρ . 6). On the other hand, our results on synthetic datasets Our next step is to compare the performance of the EWUSC showed that EWUSC is usually the method of choice when the and USC algorithms with a wide range of other classification classes are well separated (that is, when biological noise is low and feature selection algorithms. One problem in the litera- or signal-to-noise ratio is high). ture is that researchers often use different pre-processed sub- sets of published array data, which makes direct comparisons Our main contribution is that we use cross-validation to select of published results difficult. Therefore, there is a need to a correlation threshold (ρ ) for the removal of highly corre- conduct a large-scale evaluation study of various classifica- lated genes. This idea is adopted in both USC and EWUSC, tion and feature selection algorithms on microarray data. which in turn take advantage of the interdependence of genes without sacrificing prediction accuracy. Our second major contribution is that we adopted the error-weighted method in Details of algorithms our integrated feature-selection and classification algorithm, The SC algorithm of Tibshirani et al. [17] EWUSC. To the best of our knowledge, EWUSC is the only Let x be the expression level for gene i = 1, 2, ..., p and sam- ij classification algorithm applicable to microarray data with ples j = 1, 2, ..., n. Suppose there are a total of K classes, and any number of classes that takes advantage of variability in let C be the set of all n samples in class k. The overall cen- k k repeated measurements. troid of gene i is, There are many directions for future work. The error- xx = /n ii ∑ j j =1 weighted idea can be applied to other distance-based classifi- cation algorithms, for example, the k-nearest neighbour, and the class centroid of class k and gene i is, which was shown to achieve high prediction accuracy [23]. Genome Biology 2003, 4:R83 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2003/4/12/R83 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner R83.15 Table 7 = n /n. The first term in the discriminant score rep- where π k k resents the standardized squared distance of x* to the Summary of EWUSC, USC and SC shrunken class centroid, and the second term represents a correction for the class prior probability. Sample x* is Desirable features EWUSC USC SC assigned to the class k with the minimum discriminant score. Make use of variability over repeated + measurements Our EWUSC algorithm Applicable to data with any number of classes + + + Mathematical definitions Exploit dependence relationships between ++ The EWUSC algorithm is a modification of the SC algorithm genes with two key differences: noisy measurements are down- Integrated approach for both feature selection ++ + weighted and redundant genes (features) are removed. Let σ ij and classification be the variability estimate of gene i and sample j over repeated No assumption on data distributions + + + measurements, where i = 1, 2, ..., p and j = 1, 2, ..., n. The weighted overall centroid for gene i is defined as n n ij x = / ∑∑ xx = /n ik ∑ ij k . σσ ij ij j== 11 j jC ∈ and the weighted class centroid for gene i and class k is The relative difference, d , is the difference in class centroid ik ( x ) and overall centroid ( x ), standardized by the within- ik i ij class standard deviation of gene i (s ); that is, x = / . ik ∑∑ σσ ij ij jC∈∈ jC kk xx − ik i d = ik Noisy measurements with a large variability estimate σ are ij ms +s () ki 0 down-weighted in the weighted overall and class centroids. The weighted relative difference is similarly defined as where xx − ik i d = ij ,, mn=+ 11 //n s = xx − kk ii ∑ ∑()jik ms ++ s ω () ki 0 i nK − k jC ∈ where the weighted within-class standard deviation, and s is the median value of the s s over all genes i. The rela- 0 i tive difference d is similar to a t-statistic, comparing the ik xx − ij ik 2 class centroid to the overall centroid. The shrunken relative () ∑ ∑ difference d' reduces d by an amount ∆ if |d | > ∆, other- ij ik ik ik k jC ∈ 2 , s = wise, sets d' to zero; that is, n K ik ∑ ∑ σσ j =1 k =1 ij ik sign d || d −∆∆ if|| d > ()() ik ik ik d = { ik 0 otherwise average variability estimate for class k, Hence, d' gets rid of genes with class centroids not signifi- σσ = / n ik , ik ∑ ij k cantly different from the overall centroids. The amount of jC ∈ shrinkage ∆ is determined by m-fold cross-validation such that the number of cross-validation classification errors is the scaling factor minimized. Genes with at least one positive shrunken relative difference d' (over all classes k) are selected as relevant fea- ik 1 1 tures. The shrunken class centroid ( x ) is defined as ∑ ∑ ik 2 σ σ jC ∈ ij j =1 ij k , ′ ′ . The discriminant score for a new xx=+m s +s d () m=+ ik i k i 0 ik sample x* and class k is defined as 2 () ∑ () ij σ jC ∈ ij k j =1 p ′ xx − () iik * s is the median of all s s over all genes i, and ω is the median 0 i i δπ () x = −2log , k k variability estimate for gene i across all n experiments. When ss + i =1() i 0 Genome Biology 2003, 4:R83 R83.16 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner http://genomebiology.com/2003/4/12/R83 the variability estimates are equal for all samples; that is, σ = For each ∆, ij σ for j = 1, 2, ..., n, the above definitions for x , x , s and m i i ik i k can be simplified to the corresponding formulae from the SC Compute d for each gene i and class k. ik algorithm. The intuition behind these error-weighted defini- tions is that noisy samples with large variability estimates σ For each gene i, denote the maximum shrunken relative dif- ij are down-weighted. The median variability for gene i (ω ) in ference over all K classes by β = max || d′ . i ik ik the denominator of the weighted relative difference (d ) ik down-weights noisy genes such that genes with large variabil- Let S be the set of genes with at least one positive shrunken itiy over all samples are less likely to be selected as relevant relative difference over all the K classes; that is, S = {g:β > ∆ g genes. The definition of weighted shrunken relative differ- 0}. ence d is very similar to that of d' ; that is, ik ik Sort the genes g in S in descending order of β . Denote this ∆ g , g , ..., g }. sorted set by G = {g 1 2 t sign d || d −∆∆ if|| d > ()() ik ik ik , d′ = { ik For ρ = 1, 0.9, 0.8, ..., 0.1, 0, 0 otherwise where the amount of shrinkage ∆ is determined by cross-val- Consider all pairs of genes (g , g ) in G such that i < j (that is, i j idation. Similarly, the weighted shrunken centroid is defined β > β ). gi gj ′ ′ as x =+ xm s +s +ω d , and the weighted discrimi- () ik i k i 0 i ik nant score for a new sample x* with variability estimate σ * Compute the error-weighted correlation ρ between (g , g ). i j and class k is If ρ ≥ ρ , remove gene g from S . 0 j ∆ Let S(∆, ρ ) be the set of genes left in S after removing the ′ 0 ∆ () xx − iik G () x = −2logS . k ∑ k highly correlated genes. 2 2 ()VZ ++(ss + ) i =1 ii 0 i Apply the discriminant score to predict the classes of sam- Error-weighted correlation ples in the test set using the relevant genes in S(∆, ρ ). Hughes et al. [29] defined error-weighted correlation that weighs expression values with error estimates such that Output of the algorithm: a predicted class for each sample expression values with relatively high errors are down- in the test set for each ∆ and each ρ . weighted. Let σ be the error estimate of the expression level ge of gene g under experiment e, where g = 1, ..., p and e = 1, ..., The above algorithm is applied to the m-fold cross-validation n. The error-weighted correlation between a pair of genes i data to determine the optimal parameters ∆ and ρ that min- and j is defined as imize the number of classification errors on the training set. The optimal parameters are then used to predict classes on the unknown samples on the test set. (( Dj,e) −µ ) (( Di,e) −µ ) σ σ ie je e =1 ρ = The Jaccard index as a measure of feature stability ij E E      Dj (,) e −µ We define feature stability as the average level of agreement Di (,e) −µ     ∑ ∑   between the set of relevant genes chosen in a fold of the cross-   σ σ e =1 ie e =1 je     validation data and the set of relevant genes chosen using the where full training set over all m folds of the cross-validation data. Let S(∆, ρ ) be the set of relevant genes chosen using the E E entire training set, and let S(m, ∆, ρ ) be the set of relevant Di (,e) 1 0 µ = / i ∑ ∑ genes chosen in the mth fold of the cross-validation data with σσ ie ie e =1 e =1 parameters ∆ and ρ . We define the number of true positives is the weighted average expression level of gene i. (TP) as the number of relevant genes chosen in both S(∆, ρ ) and S(m, ∆, ρ ). Similarly, we define the number of false pos- Algorithm outline for EWUSC itives (FP) as the number of relevant genes chosen in S(m, ∆, Inputs to the algorithm: training set (with known classes) ρ ) but not in S(∆, ρ ), and the number of false negatives (FN) 0 0 and test set as the number of relevant genes chosen in S(∆, ρ ) but not in S(m, ∆, ρ ). The Jaccard index, J(m, ∆, ρ ), is defined as TP/ 0 0 For each gene i and each class k, (TP + FP + FN). Intuitively, the level of agreement is high when there are many true positives, and relatively few false Compute x , x , s and d using the training set. positives and false negatives. Hence, a high Jaccard index i ik i ik indicates a high level of agreement. Feature stability is the Genome Biology 2003, 4:R83 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2003/4/12/R83 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner R83.17 average Jaccard index over all m folds; that is, J(∆, ρ ) = aver- Error model in the breast cancer data age of J(m, ∆, ρ ) over all m folds. The log ratios and their associated p-values are available from the breast cancer data. The p-values are confidence measures Support vector machines (SVMs) that expression ratios are significantly different from 1. Using The basic idea behind SVM [33] is that it maps data points to the error model documented in the 'Error Model' supplement a high-dimensional space such that the data points are line- of Hughes et al. [29], we converted the p-values into error arly separable. However, SVM avoids computations in high- estimates. Assuming the distribution of error magnitudes can dimensional space by the use of kernel functions, which be approximated by the normal distribution, significance val- allows computations in the input space. There are many dif- ues (or p-values) can be derived from the Gaussian error ferent types of kernel functions, with different effects. Brown function of the ratio of an observed log expression ratio to its et al. [34] showed that the radial kernel functions work very error estimate [29]. The p-value (p) for an observed log ratio well in classifying genes on array data. (r) is related to the error estimate of the observed log ratio (s) by p = 2 * (1 - Erf(|X|) where X represents the ratio of an We augmented the SVM implementation by Noble et al. [35] observed log expression ratio (r) to its error estimate (σ) and to incorporate the signal to noise (S2N) measure for feature Erf is the Gaussian error function. Hence, the error estimates selection. The S2N measure is defined as the difference of the of the log expression ratios can be derived from the p-values. means in the two classes divided by the sum of the standard However, when a p-value is equal to 1, the error estimate is deviations of the two classes. Because we adopt the one-ver- arbitrarily large. Hence, we ignored the corresponding sus-all approach [10,36] to combine the binary SVM classifi- expression ratio in our EWUSC algorithm when its p-value is ers, each binary classifier distinguishes samples of a given equal to 1. class from samples from all the other classes. The multiple tumor dataset consists of 11 classes (see Table 3 for details), Synthetic data and so there is a total of 11 binary SVM classifiers for this data. The synthetic training sets consist of 1,000 genes, 80 sam- We applied the S2N measure to select a given number of rel- ples, and four classes such that there are 20 samples in each evant genes on the four-fold cross-validation data using a class, and the synthetic test sets consist of 1,000 genes and 40 binary SVM classifier (with a radial kernel function). We then samples with 10 samples in each class. Two parameters con- combined the results from each of the 11 SVMs by assigning trol the noise levels in the synthetic datasets - the biological the sample to the class of the classifier with the maximum dis- noise level (α) and the technical noise level (λ). Let P be the criminant value. This process is repeated for each of the five matrix of patterns with 64 rows and 4 columns such that each random fourfold splits of the training set. The results on the entry P [i,j] is the ith pattern of class j (i = 1,2,..., 64, j = cross-validation data are shown in Figure S7(a) on [30], in 1,2,3,4). Table 8 shows the pattern matrix P used to generate which the average number of classification errors is plotted synthetic datasets in our study. Let X(i, j) be the true against the number of relevant genes chosen. The next step is expression value of gene i under experiment j before technical to apply this process to the entire training set, and use the noise is added. Let Y(i, j, r) be the rth measured expression selected genes to classify the samples on the test set. The value of gene i under experiment j, where i = 1, 2, ..., p, j = 1,2, results on the test set are shown in Figure S7(b) on [30], in ..., n, r = 1,2,..., R. Suppose gene i is generated from the mth which the number of classification errors on the test set is patterned gene that belongs to class k. X(i, j) is generated plotted against the number of relevant genes chosen. from the random normal distribution with mean P [m,k] and standard deviation α. Technical noise is randomly sampled from a real dataset. Four hybridizations were repeated on the yeast galactose data [32], and the standard deviation of each Details of dataset analysis gene under each experiment is adopted as our estimated tech- Multiple tumor data In order to process the multiple tumor data [10] with the nical noise. Let ε be the randomly sampled technical noise RMA measure implemented in the Bioconductor project, we (standard deviation over four repeated measurements) from need the raw data (.cel files) which contain the expression the yeast galactose data [32]. Y(i,j,r) is generated from the level for each oligo (probe cell). The original multiple tumor random normal distribution with mean X(i,j) and standard data consists of 14 tumor types which were hybridized to both deviation ελ. Hence, a high technical noise level λ indicates the Affymetrix Hu6800 and Hu35K chips. However, only a noisy repeated measurements. Moreover, there are five dupli- subset of the original '.cel' files (mostly data from the Hu6800 cates of each of these 64 patterned genes so that there is a chips) is available. Hence, the subset of the multiple tumor total of 320 patterned genes. Each of these five duplicated data we used consists of all the 7,129 genes on the Hu6800 patterned genes is generated using the same row in the pat- chips and 11 tumor types, with 96 samples in the training set tern matrix P. and 27 samples in the test set. Table 3 shows the tumor types and class sizes for both the training and test sets. For non-patterned genes, we randomly sample from the breast cancer data such that these non-patterned genes do not exhibit any class-specific expression patterns. Genome Biology 2003, 4:R83 R83.18 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner http://genomebiology.com/2003/4/12/R83 Table 8 (high technical noise) with R = 1 or 4 or 20 repeated measure- ments. We also experimented with synthetic datasets with a Pattern matrix for synthetic data higher fraction of non-patterned genes and showed that these larger datasets produce similar results (data not shown). Class 1 Class 2 Class 3 Class 4 1 000 -1 000 Acknowledgements We would like to thank Sridhar Ramaswamy for providing us with the raw 0 100 (.cel) files for the multiple tumor data. We also thank Jane Fridlyand for the 0-1 0 0 processed NCI 60 dataset. We would also like to acknowledge the publicly available BioConductor project [28] and GIST [35]. We would like to thank 0 010 William Noble for general discussions and Mette Peters for her suggestions 00 -1 0 on this writeup. This work was supported by NIH-NIDDK grant 0 001 5U24DK058813-02. 00 0 -1 1 100 References -1 -1 0 0 1. Lennon GG, Lehrach H: Hybridization analyses of arrayed 1-1 0 0 cDNA libraries. Trends Genet 1991, 7:314-317. 2. Nguyen C, Rocha D, Granjeaud S, Baldit M, Bernard K, Naquet P, Jor- -1 100 dan BR: Differential gene expression in the murine thymus 1 010 assayed by quantitative hybridization of arrayed cDNA clones. Genomics 1995, 29:207-216. -1 0 -1 0 3. Pietu G, Alibert O, Guichard V, Lamy B, Bois F, Leroy E, Mariage- 10 -1 0 Sampson R, Houlgatte R, Soularue P, Auffray C: Novel gene transcripts preferentially expressed in human muscles -1 010 revealed by quantitative hybridization of a high density 1 001 cDNA array. Genome Res 1996, 6:492-503. -1 0 0 -1 4. Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitor- ing of gene expression patterns with a complementary DNA 10 0 -1 microarray. Science 1995, 270:467-470. -1 001 5. Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C, Kobayashi M, Horton H et al.: Expression 0 110 monitoring by hybridization to high-density oligonucleotide 0-1 -1 0 arrays. Nat Biotechnol 1996, 14:1675-1680. 6. Pease AC, Solas D, Sullivan EJ, Cronin MT, Holmes CP, Fodor SP: 01 -1 0 Light-generated oligonucleotide arrays for rapid DNA 0-1 1 0 sequence analysis. Proc Natl Acad Sci USA 1994, 91:5022-5026. 7. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine 0 101 AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligo- Each row represents a pattern, and each column represents a class nucleotide arrays. Proc Natl Acad Sci USA 1999, 96:6745-6750. such that entry P(i, j) is the ith pattern of class j. An entry of 1 means 8. Schummer M, Ng WV, Bumgarner RE, Nelson PS, Schummer B, Bed- upregulated while an entry of -1 means downregulated. For example, narski DW, Hassell L, Baldwin RL, Karlan BY, Hood L: Comparative the first row indicates that a patterned gene is upregulated in class 1 hybridization of an array of 21500 ovarian cDNAs for the dis- compared to all the other three classes. covery of genes overexpressed in ovarian carcinomas. Gene 1999, 238:375-385. 9. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286:531-537. Specifically, let q be a non-patterned gene. Suppose we ran- 10. Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo domly sample a gene g and experiment e from the breast can- M, Ladd C, Reich M, Latulippe E, Mesirov JP et al.: Multiclass cancer cer data such that E[g,e] is the expression ratio of gene g and diagnosis using tumor gene expression signatures. Proc Natl Acad Sci USA 2001, 98:15149-15154. experiment e and s[g,e] is the error estimate of gene g and 11. Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, experiment e in the breast cancer data. Y(q,j) is generated Boldrick JC, Sabet H, Tran T, Yu X et al.: Distinct types of diffuse large B-cell lymphoma identified by gene expression from a random normal distribution with mean E[g,e] and profiling. Nature 2000, 403:503-511. standard deviation s[g,e] for sample j in the synthetic training 12. Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P, Iyer V, or test set. Note that all expression values of the non-pat- Jeffrey SS, Van de Rijn M, Waltham M et al.: Systematic variation in gene expression patterns in human cancer cell lines. Nat terned gene q are sampled from the same gene g (which is Genet 2000, 24:227-235. chosen randomly) from the breast cancer data. As experiment 13. Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd e is independently sampled for each sample j, any class spe- C, Beheshti J, Bueno R, Gillette M et al.: Classification of human lung carcinomas by mRNA expression profiling reveals dis- cific expression pattern in the original breast cancer data tinct adenocarcinoma subclasses. Proc Natl Acad Sci USA 2001, would be destroyed. 98:13790-13795. 14. van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT et al.: Gene Both the synthetic training and test sets are generated using expression profiling predicts clinical outcome of breast the same model described above. In our experiments, we set cancer. Nature 2002, 415:530-536. 15. Nutt CL, Mani DR, Betensky RA, Tamayo P, Cairncross JG, Ladd C, p = 1000, α = 0.1, 1 or 2, and λ = 1 (low technical noise) or 10 Pohl U, Hartmann C, McLaughlin ME, Batchelor TT et al.: Gene Genome Biology 2003, 4:R83 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2003/4/12/R83 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner R83.19 expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Res 2003, 63:1602-1607. 16. Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, Gaasenbeek M, Angelo M, Reich M, Pinkus GS et al.: Diffuse large B- cell lymphoma outcome prediction by gene-expression pro- filing and supervised machine learning. Nat Med 2002, 8:68-74. 17. Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA 2002, 99:6567-6572. 18. Lee MLT, Kuo FC, Whitmore GA, Sklar J: Importance of replica- tion in microarray gene expression studies: statistical meth- ods and evidence from repetitive cDNA hybridizations. Proc Natl Acad Sci USA 2000, 97:9834-9839. 19. Smith MW, Yue ZN, Geiss GK, Sadovnikova NY, Carter VS, Boix L, Lazaro CA, Rosenberg GB, Bumgarner RE, Fausto N et al.: Identifi- cation of novel tumor markers in hepatitis C virus-associ- ated hepatocellular carcinoma. Cancer Res 2003, 63:859-864. 20. Van't Wout AB, Lehrman GK, Mikheeva SA, O'Keeffe GC, Katze MG, Bumgarner RE, Geiss GK, Mullins JI: Cellular gene expression upon human immunodeficiency virus type 1 infection of CD4(+)-T-cell lines. J Virol 2003, 77:1392-1402. 21. Nguyen DV, Rocke DM: Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 2002, 18:39-50. 22. Nguyen DV, Rocke DM: Multi-class cancer classification via par- tial least squares with gene expression profiles. Bioinformatics 2002, 18:1216-1226. 23. Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors using gene expres- sion data. J Am Stat Assoc 2002, 97:77-87. 24. Yeung KY, Medvedovic M, Bumgarner RE: Clustering gene- expression data with repeated measurements. Genome Biol 2003, 4:R34. 25. Dettling M, Buhlmann P: Supervised clustering of genes. Genome Biol 2002, 3:research0069.1-0069.15. 26. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summa- ries of high density oligonucleotide array probe level data. Biostatistics 2003, 4:249-264. 27. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 2003, 31:e15. 28. BioConductor open source software for bioinformatics [http://www.bioconductor.org] 29. Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai H, He YD et al.: Functional discov- ery via a compendium of expression profiles. Cell 2000, 102:109-126. 30. Yeung KY and Bumgarner RE: Supplementary web site. [http:// expression.washington.edu/public]. 31. Jain AK and Dubes RC: Algorithms for Clustering Data Englewood Cliffs, NJ: Prentice Hall; 1988. 32. Ideker T, Thorsson V, Ranish JA, Christmas R, Buhler J, Eng JK, Bum- garner RE, Goodlett DR, Aebersold R, Hood L: Integrated genomic and proteomic analyses of a systemically perturbed metabolic network. Science 2001, 292:929-934. 33. Vapnik VN: Statistical Learning Theory New York: Wiley; 1998. 34. Brown MP, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M Jr, Haussler D: Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci USA 2000, 97:262-267. 35. GIST [http://microarray.cpmc.columbia.edu/gist] 36. Yeang CH, Ramaswamy S, Tamayo P, Mukherjee S, Rifkin RM, Angelo M, Reich M, Lander E, Mesirov J, Golub T: Molecular classification of multiple tumor types. Bioinformatics 2001, 17(Suppl 1):S316-S322. Genome Biology 2003, 4:R83 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Genome Biology Springer Journals http://www.deepdyve.com/lp/springer-journals/multiclass-classification-of-microarray-data-with-repeated-IHT3Nd6mK4

Loading next page...

References (38)

M. Schummer, W. Ng, Roger Bumgarner, P. Nelson, B. Schummer, D. Bednarski, L. Hassell, R. Baldwin, B. Karlan, L. Hood (1999)
Comparative hybridization of an array of 21,500 ovarian cDNAs for the discovery of genes overexpressed in ovarian carcinomas.
Gene, 238 2
MW Smith (2003)

Cancer Res, 63
Gregory Lennon, Hans Lehrach (1991)
Hybridization analyses of arrayed cDNA libraries.
Trends in genetics : TIG, 7 10
R. Irizarry, Bridget Hobbs, F. Collin, Y. Beazer-Barclay, Kristen Antonellis, U. Scherf, T. Speed (2003)
Exploration, normalization, and summaries of high density oligonucleotide array probe level data.
Biostatistics, 4 2
Arindam Bhattacharjee, William Richards, Jane Staunton, Cheng Li, Stefano Monti, Priya Vasa, C. Ladd, J. Beheshti, Raphael Bueno, Michael Gillette, Massimo Loda, G. Weber, Eugene Mark, Eric Lander, W. Wong, Bruce Johnson, Todd Golub, D. Sugarbaker, M. Meyerson (2001)
Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses.
Proceedings of the National Academy of Sciences of the United States of America, 98 24
M. Lee, F. Kuo, G. Whitmore, J. Sklar (2000)
Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations.
Proceedings of the National Academy of Sciences of the United States of America, 97 18
M. Brown, W. Grundy, D. Lin, N. Cristianini, C. Sugnet, T. Furey, M. Ares, D. Haussler (2000)
Knowledge-based analysis of microarray gene expression data by using support vector machines.
Proceedings of the National Academy of Sciences of the United States of America, 97 1
D. Lockhart, Helin Dong, M. Byrne, M. Follettie, Michael Gallo, M. Chee, Mike Mittmann, Chunwei Wang, M. Kobayashi, Heidi Norton, E. Brown (1996)
Expression monitoring by hybridization to high-density oligonucleotide arrays
Nature Biotechnology, 14
D. Ross, U. Scherf, M. Eisen, C. Perou, Christian Rees, P. Spellman, V. Iyer, S. Jeffrey, M. Rijn, M. Waltham, Alexander Pergamenschikov, Jeffrey Lee, D. Lashkari, D. Shalon, T. Myers, J. Weinstein, D. Botstein, P. Brown (2000)
Systematic variation in gene expression patterns in human cancer cell lines
Nature Genetics, 24
Margaret, A., Shipp, Ken, N., Ross, P. Tamayo, Andrew, P., Weng, Jeffery, L., Kutok, riCardo, T. C., Aguiar, M. Gaasenbeek, Michael Angelo, Michael Reich, Geraldine, S., Pinkus, Tane, Ray, Koval, Kim, W., Last, Norton, T., Andrew Lister, J. Mesirov, Donna, Neuberg, Eric, Lander, Jon, C., Aster, Todd, R., Golub (2002)
Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning
Nature Medicine, 8
D. Nguyen, David Rocke (2002)
Multi-class cancer classification via partial least squares with gene expression profiles
Bioinformatics, 18 9
Anil Jain, R. Dubes (1988)
Algorithms for Clustering Data
A Bhattacharjee (2001)
Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses
Proc Natl Acad Sci USA, 98
M. Schena, D. Shalon, Ronald Davis, P. Brown (1995)
Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray
Science, 270
U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mach, A. Levine (1999)
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.
Proceedings of the National Academy of Sciences of the United States of America, 96 12
Geneviève Piétu, O. Alibert, V. Guichard, B. Lamy, F. Bois, Elisabeth Leroy, R. Mariage-Sampson, R. Houlgatte, P. Soularue, Charles Auffray (1996)
Novel gene transcripts preferentially expressed in human muscles revealed by quantitative hybridization of a high density cDNA array.
Genome research, 6 6
D. Nguyen, David Rocke (2002)
Tumor classification by partial least squares using microarray gene expression data
Bioinformatics, 18 1
Chen-Hsiang Yeang, Sridhar Ramaswamy, P. Tamayo, Sayan Mukherjee, R. Rifkin, Michael Angelo, Michael Reich, E. Lander, J. Mesirov, T. Golub (2001)
Molecular classification of multiple tumor types
Bioinformatics, 17 Suppl 1
C. Nguyen, D. Rocha, S. Granjeaud, Mylène Baldit, K. Bernard, P. Naquet, B. Jordan (1995)
Differential gene expression in the murine thymus assayed by quantitative hybridization of arrayed cDNA clones.
Genomics, 29 1
Ash Alizadeh, M. Eisen, R. Davis, Chishieva Ma, I. Lossos, A. Rosenwald, Jennifer Boldrick, H. Sabet, T. Tran, Xin Yu, J. Powell, Liming Yang, G. Marti, T. Moore, J. Hudson, Li-Sheng Lu, D. Lewis, R. Tibshirani, G. Sherlock, W. Chan, T. Greiner, D. Weisenburger, J. Armitage, R. Warnke, R. Levy, W. Wilson, M. Grever, J. Byrd, D. Botstein, Pat Brown, L. Staudt (2000)
Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling
Nature, 403
open source software for bioinformatics
Todd Golub, Todd Golub, D. Slonim, Pablo Tamayo, Christine Huard, Michelle Gaasenbeek, J. Mesirov, Hilary Coller, M. Loh, James Downing, Michael Caligiuri, C. Bloomfield, Eric Lander (1999)
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.
Science, 286 5439
R. Tibshirani, T. Hastie, B. Narasimhan, Gilbert Chu (2002)
Diagnosis of multiple cancer types by shrunken centroids of gene expression
Proceedings of the National Academy of Sciences of the United States of America, 99
T. Ideker, V. Thorsson, J. Ranish, R. Christmas, J. Buhler, J. Eng, Roger Bumgarner, D. Goodlett, R. Aebersold, L. Hood (2001)
Integrated genomic and proteomic analyses of a systematically perturbed metabolic network.
Science, 292 5518
Sridhar Ramaswamy, P. Tamayo, R. Rifkin, Sayan Mukherjee, Chen-Hsiang Yeang, Michael Angelo, C. Ladd, Michael Reich, É. Latulippe, J. Mesirov, T. Poggio, W. Gerald, M. Loda, E. Lander, T. Golub (2001)
Multiclass cancer diagnosis using tumor gene expression signatures
Proceedings of the National Academy of Sciences of the United States of America, 98
K. Yeung, M. Medvedovic, Roger Bumgarner (2003)
Clustering gene-expression data with repeated measurements
Genome Biology, 4
(2003)
Gene Table 8 Pattern matrix for synthetic data
Supplementary web site
A. Pease, D. Solas, E. Sullivan, M. Cronin, C. Holmes, S. Fodor (1994)
Light-generated oligonucleotide arrays for rapid DNA sequence analysis.
Proceedings of the National Academy of Sciences of the United States of America, 91 11
S. Dudoit, J. Fridlyand, T. Speed (2002)
Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data
Journal of the American Statistical Association, 97
L. Veer, H. Dai, M. Vijver, Yudong He, A. Hart, M. Mao, H. Peterse, K. Kooy, M. Marton, A. Witteveen, G. Schreiber, R. Kerkhoven, C. Roberts, P. Linsley, R. Bernards, S. Friend (2002)
Gene expression profiling predicts clinical outcome of breast cancer
Nature, 415
C. Nutt, D. Mani, R. Betensky, P. Tamayo, J. Cairncross, C. Ladd, U. Pohl, C. Hartmann, M. Mclaughlin, T. Batchelor, P. Black, A. Deimling, S. Pomeroy, T. Golub, T. Golub, D. Louis (2003)
Gene expression-based classification of malignant gliomas correlates better with survival than histological classification.
Cancer research, 63 7
A. Wout, Ginger Lehrman, S. Mikheeva, Gemma O’Keeffe, M. Katze, Roger Bumgarner, G. Geiss, J. Mullins (2003)
Cellular Gene Expression upon Human Immunodeficiency Virus Type 1 Infection of CD4+-T-Cell Lines
Journal of Virology, 77
M. Dettling, P. Bühlmann (2002)
Supervised clustering of genes
Genome Biology, 3
V. Vapnik (1998)
Statistical learning theory
R. Irizarry, B. Bolstad, F. Collin, L. Cope, Bridget Hobbs, T. Speed (2003)
Summaries of Affymetrix GeneChip probe level data.
Nucleic acids research, 31 4
Maria Smith, Zhaoxia Yue, G. Geiss, Natalya Sadovnikova, V. Carter, L. Boix, C. Lázaro, G. Rosenberg, Roger Bumgarner, N. Fausto, J. Bruix, M. Katze (2003)
Identification of novel tumor markers in hepatitis C virus-associated hepatocellular carcinoma.
Cancer research, 63 4
T. Hughes, M. Marton, Allan Jones, C. Roberts, R. Stoughton, C. Armour, Holly Bennett, E. Coffey, H. Dai, Yudong He, M. Kidd, Amy King, Michael Meyer, D. Slade, P. Lum, S. Stepaniants, D. Shoemaker, D. Gachotte, K. Chakraburtty, J. Simon, M. Bard, S. Friend (2000)
Functional Discovery via a Compendium of Expression Profiles
Cell, 102

Publisher: Springer Journals
Copyright: Copyright © Yeung and Bumgarner; licensee BioMed Central Ltd. 2003. This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.
eISSN: 1474-760X
DOI: 10.1186/gb-2003-4-12-r83
Publisher site: See Article on Publisher Site

Abstract

Prediction of the diagnostic category of a tissue sample from its gene-expression profile and selection of relevant genes for class prediction have important applications in cancer research. We have developed the uncorrelated shrunken centroid (USC) and error-weighted, uncorrelated shrunken centroid (EWUSC) algorithms that are applicable to microarray data with any number of classes. We show that removing highly correlated genes typically improves classification results using a small set of genes. There has been a recent explosion in the use of expression Rationale The problem of predicting the diagnostic category of a given array phenotyping for identification and/or classification in a tissue sample is of fundamental clinical importance. Conven- variety of diagnostic areas. Examples of diagnostic categories tional diagnostic methods are based on subjective evaluation (or classes) include cancer versus non-cancer [7,8], different of the morphological appearance of the tissue sample, which subtypes of tumor [9-13], and prediction of responses to var- requires a visible phenotype and a trained pathologist to ious drugs or cancer prognosis [14-16]. The prediction of the interpret the view. In some cases the class is easily identified diagnostic category of a tissue sample from its expression by cell morphology or cell-type distribution, but in many array phenotype given the availability of similar data from tis- cases apparently similar pathologies can lead to very different sues in identified categories is known as classification (or clinical outcomes. Since the advent of DNA array technology supervised learning). A challenge in predicting diagnostic cat- [1-6], researchers have begun to use expression array analysis egories using microarray data is that the number of genes is as a quantitative phenotyping tool. The potential advantage to usually significantly greater than the number of tissue sam- using arrays for phenotyping is that they provide a simultane- ples available, and only a subset of the genes is relevant in dis- ous quantitative measure of thousands of parameters (for tinguishing different classes. Selection of relevant genes for example, gene-expression levels) some of which are likely to classification is known as feature selection. This has three have disease relevance. When array analysis is used predom- main applications: first, the classification accuracy is often inately for phenotyping, we refer to the expression pattern as improved using a subset instead of the entire set of genes; sec- an 'expression array phenotype'. Owing to the ability to quan- ond, a small set of relevant genes is convenient for developing tify a large number of parameters, the use of expression array diagnostic tests; and third, these genes may lead to biologi- in phenotyping promises both more accurate class prediction cally interesting insights that are characteristic of the classes and the identification of subclasses that could not be defined of interest. by traditional methods. Genome Biology 2003, 4:R83 R83.2 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner http://genomebiology.com/2003/4/12/R83 There have been many reports that address the classification progression, and response to therapy). A typical microarray and feature-selection problems, for example [9,10,14,17]. dataset consists of thousands to tens of thousands of genes, However, many of these methods are tailored towards binary and dozens to hundreds of experiments. One challenge of classification in which there are only two classes [9,14]. More- classification using microarray data is that the number of over, there has been very limited effort to develop classifica- genes is significantly greater than the number of samples. In tion and feature-selection algorithms for microarray data this situation, it is possible to find both random and biologi- with repeated measurements or error estimates. Array data is cally relevant correlations of gene behavior with sample type. well known to be noisy; for example, Lee et al. [18] showed To protect against spurious results, the goal is to identify the that any single microarray output is subject to substantial smallest possible subset of genes that correlate most strongly variability. This is particularly true for genes with low expres- with the known class labels. In addition, a small subset of sion levels, which are more difficult to measure than genes genes is desirable for the development of expression-based with high expression levels. As the cost of microarray experi- diagnostics. The problem of selecting relevant genes (or fea- ments is declining, more research laboratories are generating tures) for classification is known as feature selection. microarray data with repeated measurements [9,14,19,20]. Repeated measurements not only provide improved esti- Cross validation is a well-established technique used to opti- mates of gene-expression levels but can also be used to esti- mize the parameters or features chosen in a classifier. In m- mate the uncertainty or variability in the measurement. In fold cross-validation, the training set is randomly divided into some cases the repeated measurements are biological repli- m disjoint subsets with roughly equal size. Each of these m cates (for example, independent samples), whereas in other subsets is left out in turn for evaluation, and the other (m - 1) cases only technical replicates are available. Regardless of the subsets are used as inputs to the classification algorithm. In source, however, variability estimates should be taken into this work, we randomly divide each class into m disjoint sub- account in both clustering and classification algorithms, as sets (where m is less than the size of the smallest class in the variability estimates can potentially be exploited to improve training set), so that each class is represented in the subset fed the results. to the classification algorithm. The left-out subset of the training set is used to evaluate classification accuracy because We have developed two algorithms called the uncorrelated the classes of this subset are known. The most popular form shrunken centroid (USC) algorithm, and the error-weighted, of cross-validation is leave-one-out cross-validation uncorrelated shrunken centroid (EWUSC) algorithm. Both (LOOCV), in which m is equal to the number of samples in the USC and EWUSC are integrated feature-selection and classi- training set, and each sample in the training set is left out in fication algorithms that are applicable to data with any turn to evaluate the prediction results. number of classes. Our primary contribution is that both USC and EWUSC exploit interdependence between genes to Related work reduce the number of selected features. In addition, EWUSC van't Veer et al. [14] recently applied a binary classification takes advantage of variability estimates over repeated meas- algorithm to cDNA array data with repeated measurements, urements to down-weight noisy genes and noisy experiments and classified breast cancer patients into good and poor prog- so that no ad hoc filtering step is necessary. On the other nosis groups. Their classification algorithm consists of the hand, USC is applicable to microarray datasets with or with- following steps. The first step is filtering, in which only genes out repeated measurements. with both small error estimates and significant regulation rel- ative to a reference pool of samples from all patients are cho- Introduction to classification and feature selection sen. The second step consists of identifying a set of genes Classification is a supervised learning approach, in which the whose behaviour is highly correlated with the two sample classes (or labels) of a subset of samples are inputs to the algo- types (for example, upregulated in one sample type but down- rithm. This is in contrast to clustering, which is an unsuper- regulated in the other). These genes are rank-ordered so that vised approach, in which no knowledge of the samples is genes with the highest magnitudes of correlation with the assumed. A training set is a set of samples for which the sample types have top ranks. In the third step, the set of rele- classes are known. A test set is a set of samples for which the vant genes is optimized by sequentially adding genes with classes are assumed to be unknown to the algorithm, and the top-ranked correlation from the second step. Leave-one-out goal is to predict which classes these samples belong to. The cross-validation is used to evaluate and choose an optimal set first step in classification is to build a 'classifier' using the of features. van't Veer et al.'s approach takes variability esti- given training set, and the second step is to use the classifier mates of repeated measurements into consideration by using to predict the classes of the test set. error-weighted correlation in their method. However, this method involves an ad hoc filtering step and does not gener- In the context of gene-expression data, the samples are usu- alize to more than two classes. ally the experiments, and the classes (or labels) are usually different types of tissue samples (for example, cancer versus Ramaswamy et al. [10] combined support vector machines non-cancer, different tumor types, rate of disease (SVMs), which are binary classifiers, to solve the multiclass Genome Biology 2003, 4:R83 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2003/4/12/R83 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner R83.3 classification problem. They showed that the one-versus-all 'shrinkage threshold' which is fixed for all genes. The intui- approach of combining SVM yields the minimum number of tion is that genes with at least one class centroid that is signif- classification errors on their Affymetrix data with 14 tumor icantly different from the overall centroid are selected as types. The one-versus-all combination approach builds k (the relevant genes. The size of the shrinkage threshold is deter- number of classes) binary classifiers, each of which distin- mined by cross-validation on the training set to minimize guishes one class from all the other classes. Suppose binary classification errors. classifier i predicts a discriminant value f (x) for a given sam- ple x in the test set. The combined multiclass classifier assigns Our contributions sample x to the class for which the corresponding binary clas- Our algorithms have the following desirable characteristics. sifier produces the highest discriminant value. In addition to Both EWUSC and USC exploit the interdependence of genes not taking variability estimates of repeated measurements to reduce the number of selected features. EWUSC takes into account, this approach selects different relevant features advantage of the variability of gene-expression data over (genes) for each binary classifier. repeated measurements, so no ad hoc filtering step is neces- sary. Both EWUSC and USC can be applied to data with any Nguyen and Rocke [21,22] used partial least squares (PLS) for number of classes. Both EWUSC and USC adopt an inte- feature selection, together with traditional classification algo- grated approach for both feature selection and classifica- rithms such as logistic discrimination and quadratic discrim- tion. Both algorithms make no assumption on data ination to classify multiple tumor types from microarray data. distributions. These traditional classification algorithms require the number of samples (experiments) to be greater than the We illustrate the advantage of removing correlated genes (for number of variables (genes), and it is therefore essential to example, the USC algorithm) on the NCI 60 data [12] for reduce the dimensionality before applying these traditional which there is no variability information. This dataset has classification techniques. PLS is a dimension-reduction tech- been extensively used in other publications for classification nique that maximizes the covariance between the classes and algorithm development [22,23,25]. We illustrated and com- a linear combination of the genes. This approach can be gen- pared our USC and EWUSC algorithms with two real data- eralized to multiple classes, but it does not make use of varia- sets: a multiple tumor dataset from Ramaswamy et al. [10] bility estimates of the data. In addition, it is a multistep and a breast cancer dataset from van 't Veer et al. [14]. These process that involves a filtering step (to select genes with sig- two datasets were chosen as they are publicly available in a nificant mean differences) and then application of PLS to fur- form from which we can calculate or obtain error estimates ther reduce the dimensionality so that the number of samples for each gene-expression level or ratio. We used a subset of is greater than the number of dimensions. the multiple tumor data [10] that consists of 7,129 genes and 11 tumor types on Affymetrix chips. There are 96 samples in Dudoit et al. [23] compared the performance of different dis- the training set, and 27 samples in the test set. For the crimination methods (including nearest neighbor classifiers, Affymetrix dataset we estimated the variability in the gene- linear discriminant analysis and classification trees) for clas- expression levels using the robust multi-array analysis (RMA) sifying multiple tumor types using gene-expression data. tool [26,27] from the BioConductor project [28]. A subset of None of the discrimination methods they evaluated takes the published data was used as we could only obtain raw data measurement variability into consideration, and their (.cel files) for a subset. The breast cancer dataset [14] consists emphasis is on discrimination methods and not feature of 25,000 genes with four repeated measurements on cDNA selection. arrays. There are 78 samples in the training set, 19 samples in the test set, and two classes of patients: one class with good Yeung et al. [24] showed that clustering algorithms that take prognosis (with more than 5 years of survival time), and advantage of repeated measurements (including the error- another class with poor prognosis (with less than 5 years of weighted approach that down-weights noisy measurements) survival time). For the breast cancer cDNA array data, pub- yield more accurate and more stable clusters. Here, we will lished p-values as calculated by Rosetta's Resolver software focus on the supervised learning approach, instead of the were used to calculate the error estimates. In addition, we cre- unsupervised clustering technique. ated synthetic datasets with repeated measurements and compared the performance of EWUSC, USC and SC at differ- Tibshirani et al. [17] developed a 'shrunken centroid' (SC) ent noise levels. algorithm for classifying multiple cancer types. It is an inte- grated approach for feature selection and classification. Fea- We adopted three criteria for assessing feature selection and tures are selected by considering one gene at a time: the classification algorithms: prediction accuracy, number of rel- difference between the class centroid (average expression evant genes and feature stability. Prediction accuracy is level or ratio within a class) of a gene and the overall centroid defined as the percentage of correct classifications on the test (average expression level or ratio over all classes) of a gene is set. The number of relevant genes is the total number of genes compared to the within-class standard deviation plus a used to achieve optimal prediction accuracy. Feature stability Genome Biology 2003, 4:R83 R83.4 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner http://genomebiology.com/2003/4/12/R83 is the level of agreement of selected genes chosen over differ- The USC approach ent cross-validation runs of the algorithm. Our USC algorithm adds a step to the SC algorithm to remove redundant, correlated genes. The benefit of removing highly Using these algorithms we obtained the following general correlated genes is twofold. First, it reduces the number of results. Exploiting gene interdependence by removal of corre- relevant features (genes) needed for classification. A small lated genes typically results in comparable or higher predic- feature set is highly desirable if one wishes to use the results tion accuracy using fewer relevant genes. This is highly of feature selection and classification to develop diagnostic desirable if one wishes to develop diagnostic tools from the tools such as reverse transcription PCR (RT-PCR)-based tests selected set of genes. Using error or variability estimates as on a small number of the most relevant genes. Second, the weighting factors generally yields higher feature stability and removal of redundant genes reduces the impact of over-fit- reduces the number of relevant genes on real datasets. On the ting, and hence, potentially improves classification accuracy. multiple tumor data, our EWUSC algorithm achieves 16% , for any increase in prediction accuracy, using only 10% of the genes The SC algorithm produces a set of relevant genes, S as features (compared with using all the available genes in the given shrinkage threshold ∆. As ∆ increases, the number of published result). On the breast cancer data, our EWUSC relevant genes in S decreases; for example, the gene list is algorithm produces the same number of classification errors reduced to selected genes for which the within-class centroids as the published result using a larger feature set. Unlike the are farther away from the overall centroid and for which the published algorithm for this dataset, however, the EWUSC within-class variation is small. Each gene is considered inde- algorithm is applicable to datasets with more than two pendently in the SC algorithm. Our modification exploits the classes. correlation between genes by removing genes that are highly correlated within the set of relevant genes S . Specifically, we compute the pairwise correlation for each pair of genes (g , g ) i j in S for each ∆. If the pairwise correlation is greater than a Our integrated classification and feature- correlation threshold ρ , the gene g with the smaller relative selection algorithm 0 j As our USC and EWUSC algorithms are motivated by the difference is removed from the set of relevant genes. This shrunken centroid (SC) algorithm [17], we will briefly review results in a set of relevant genes S(∆, ρ ) for each shrinkage the SC algorithm, and then discuss our USC and EWUSC threshold ∆ and each correlation threshold ρ . These relevant algorithms. Details of these algorithms can be found later in genes are used to classify new samples. The USC algorithm is the paper. equivalent to the SC algorithm when no correlated genes are removed (that is, ρ = 1). We apply this USC algorithm to the The SC approach training set using cross-validation to determine the number The SC approach [17] is essentially a robust version of the of classification errors for each ∆ and each ρ . The optimal 'nearest centroid' approach, in which a sample is assigned to parameters for ∆ and ρ are chosen such that the number of the class with the nearest average pattern. Features are cross-validation classification errors is minimized on the selected by considering each gene individually. The overall training set. These optimal parameters are then used to clas- centroid of a gene i is defined as the average expression level/ sify samples from unknown classes on the test set. Our results ratio of gene i over all the experiments. The class centroid of show that the removal of correlated genes provides a signifi- a gene i in class k is defined to be the average expression level/ cant improvement over the SC algorithm in classification ratio of gene i over all the samples in class k. A gene is predic- results, and hence our USC algorithm is useful for datasets in tive of the class if at least one of its class centroids signifi- which error estimates are not available. cantly differs from its overall centroid. One obvious definition of significantly in the previous sentence is 'differs by more The EWUSC approach than the variation (or standard deviation) within the class', Our EWUSC algorithm is based on the USC algorithm with a which is essentially a modified form of a t-test. The shrunken key modification: we take advantage of error estimates or var- centroid method adds an additional term (s described in [17] iability over repeated measurements. We define an error- and in the section Details of algorithms below) to the within- weighted overall centroid, error-weighted class centroid, class standard deviation - for example, the difference between error-weighted relative difference, error-weighted shrunken the in-class average and the overall average must exceed the class centroid, and error-weighted discriminant score in in-class variation by s . A t-test like statistic, relative differ- order to down-weight both noisy genes and noisy experi- ence (d ), is defined to represent the difference between the ments. In addition, we adopt the error-weighted correlation ik class centroid and the overall centroid divided by the variance in the removal of highly correlated genes to select relevant (in-class variation + s ) and the absolute value of d is genes. Thus the EWUSC algorithm is identical to the USC 0 ik reduced by the 'shrinkage threshold' ∆. ∆ is determined by algorithm except for error-weighted definitions to down- cross-validation such that the number of classification errors weight noisy genes and noisy experiments in our calculations. is minimized on the training set. When all genes and all experiments have the same variability estimates, the EWUSC algorithm is equivalent to the USC Genome Biology 2003, 4:R83 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2003/4/12/R83 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner R83.5 Table 1 Multiple tumor data The multiple tumor dataset [10] consists of a large number of Tumor types and class sizes of the NCI 60 dataset tumor samples spanning 14 different tumor types hybridized to Affymetrix chips. On the Affymetrix platform, each target Origin of cell lines Class size (total 61 samples) gene is represented by 11-20 short oligo probes of approxi- Breast 9 mately 25 base-pairs (bp). Our goal is to take advantage of the Central nervous system 5 variability over different oligos for the same genes using our EWUSC algorithm. We pre-processed the raw multiple tumor Colon 7 data with the log scale robust multi-array analysis (RMA) Leukaemia 8 measure [27] implemented in the BioConductor project. The Melanoma 8 RMA measure is a summary statistic for the expression levels Non-small-cell-lung-carcinoma 9 over all the different oligos for the same gene. The standard Ovarian 6 error of the RMA measure is a variability estimate of the Renal 9 expression level over the different oligos representing the Tumor types and class sizes of the original full data with a total of 61 same target gene. In order to obtain the RMA measures and experiments. their associated standard errors on the multiple tumor data, the raw data (.cel files) are necessary. Because we have access to only a subset of the raw multiple tumor data, we used a sub- algorithm. As our results show, this error-weighted approach set of the original data in our study. The subset of multiple typically reduces the number of relevant genes and improves tumor data we used consists of 7,129 genes, 96 samples in the feature stability, and thus the EWUSC is usually the method training set, and 27 samples in the test set. These samples of choice when error or variability estimates are available. A span 11 different tumor types (Table 3). The smallest class size detailed description of the EWUSC algorithm is given later in is four on the training set, and hence, four-fold cross-valida- the paper. tion (m = 4) is used on this data. Breast cancer data The breast cancer data [14] consists of primary breast tumor Datasets used National Cancer Institute NCI 60 data samples hybridized to cDNA arrays containing approximately In the NCI 60 data [12], cDNA microarrays were used to study 25,000 genes. Two hybridizations were carried out for each the expression of approximately 60 cell lines derived from sample using a dye-reversal technique. Hence, there are four tumors with different sites of origin (see Table 1). We used the repeated measurements for each gene and each sample. The same pre-processed dataset as in Dudoit et al. [23], which p-values of log expression ratios are also available. These p- consists of log expression ratios of 5,244 genes over 61 exper- values are results of the four repeated measurements and an iments. Two prostate and one unknown cell lines from the error model based on extensive control experiments [29]. A original data [12] were excluded in their analysis because of p-value close to 1 represents low confidence that an their small class sizes. Only one leukemia and one breast can- expression ratio is significantly different from 1, while a cer cell line were repeated three times, and hence there are no repeated measurements or variability estimates available for Table 2 all 61 samples. These repeated experiments of the leukemia Tumor types and class sizes of the randomly partitioned training and breast cancer cell lines are treated as individual samples. and test sets of the NCI 60 dataset In addition, no additional test set is available for this data. To compare our results with those of Dudoit et al. [23], we Origin of cell lines Training set Test set adopted their 2:1 scheme in which one third of the samples (total 43) (total 18) are reserved as a test set. Breast 6 3 Central nervous system 4 1 Specifically, we randomly divided each class in the original Colon 5 2 data (61 experiments) into roughly three parts such that the Leukaemia 6 2 training set consists of a total of 43 experiments and the test Melanoma 6 2 set consists of a total of 18 experiments. Table 2 gives the class sizes of the training and test sets. The optimal parameters are Non-small-cell-lung-carcinoma 6 3 determined using cross-validation on the training set with 43 Ovarian 4 2 samples, and these optimal parameters are used to classify Renal 6 3 the 18 samples in the test set. We repeated this random parti- As no additional test set is available for the NCI 60 data, we randomly tion of the original data into three parts multiple times. divided each class of these 61 samples into roughly three parts and reserved one third of the samples as a test set. Genome Biology 2003, 4:R83 R83.6 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner http://genomebiology.com/2003/4/12/R83 Table 3 generation approach, generating sensible synthetic data turned out to be a nontrivial task. There are two parameters Tumor types and class sizes for the training set and test set of the that control the noise levels in the synthetic datasets, the bio- subset of multiple tumor data used in this study logical noise level (α) and the technical noise level (λ). The biological noise level (α) controls the level of biological noise Tumor type Training set (total 96) Test set (total 27) within each class (and hence, the signal-to-noise ratio) such Breast 7 0 that the classes are less separable with a higher α. The techni- Lung 4 2 cal noise level (λ) controls the noise level over repeated meas- Colorectal 7 3 urements such that a high λ indicates relatively noisy Lymphoma 14 5 repeated measurements. The primary difficulty in generating Melanoma 5 0 synthetic data is setting the parameters of α and λ, and the proportion of the patterned genes. As it is not obvious how to Uterus 7 2 set these parameters to reflect 'real-life' data, we experi- Leukemia 23 6 mented with different parameter settings, such as different Renal 5 3 biological noise levels: low (α = 0.1 with signal-to-noise ratio Pancreas 7 0 approximately 20), medium (α = 1 with signal-to-noise ratio Mesotheolima 8 3 approximately 2), or high (α = 2 with signal-to-noise ratio CNS 9 3 approximately 1); and low (λ = 1) or high (λ = 5 or 10) techni- cal noise. We also experimented with different proportions of patterned genes, and concluded that this parameter does not have any significant impact on the results. Table 4 Another issue in generating 'realistic' synthetic data involves Prognosis groups and class sizes of the training set and test set of the generation of non-patterned genes that are irrelevant in the breast cancer data distinguishing the classes. We addressed this issue by random sampling with replacement from a real dataset (that is, the Prognosis group Training set Test set breast cancer dataset [14]). Specifically, for each non-pat- (total 78) (total 19) terned gene, we randomly sample a gene g from the breast Good (> 5 years of survival time) 44 7 cancer data, and then randomly sample from the experiments Poor (≤5 years of survival time) 34 12 of gene g in the breast cancer data such that these non-pat- terned genes would not show any class-specific expression patterns but would show realistic variations in expression lev- p-value close to 0 represents high confidence that an expres- els over all classes. sion ratio is significantly different from 1. We converted these p-values into error estimates of log ratios, which are used in In particular, our synthetic training sets consist of 1,000 our EWUSC algorithm. genes, 80 samples, and 4 classes such that there are 20 sam- ples in each class. Our synthetic test sets consist of 1,000 The breast cancer dataset consists of approximately 25,000 genes and 40 samples with 10 samples in each class. We gen- genes, 78 samples in the training set, and 19 samples in the erated 64 patterned genes which have a different expression test set. van't Veer et al. [14] divided these samples into the pattern in each class, for example, genes that are upregulated good and poor prognosis groups, which have greater than 5 (or downregulated) in only m of the four classes, where m = 1, and less than 5 years of survival time respectively. Hence, 2, 3. In addition, there are five duplicates of each of these 64 there are two classes in this dataset (see Table 4). We per- patterned genes such that there are a total of 320 patterned formed 10-fold cross-validation (m = 10) on the breast cancer genes and (1,000 - 320 = 680) non-patterned genes. Ideally, data. the perfect classification algorithm would select only one of these five copies of the patterned genes. We also investigated Synthetic data the effect of the number of repeated measurements by gener- We also created synthetic datasets to compare the perform- ating synthetic datasets with 1, 4 or 20 repeated measure- ance of our algorithms. Our approach is to start with 'pat- ments. These synthetic datasets are available from our terned genes' which have a different expression pattern in supplementary website [30]. each class, and are therefore relevant in classifying unknown samples. The next step is to introduce noise (variation in both the class and non-class values) to these patterned genes in Assessment criteria order to reflect 'real-life' data. Finally, 'non-patterned genes', Prediction accuracy which are irrelevant in classifying samples, are added to these As the class information for the test sets is available, we define synthetic datasets. Even with this simple synthetic data- prediction accuracy as the percentage of correct Genome Biology 2003, 4:R83 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2003/4/12/R83 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner R83.7 classifications on the test set. The class information on a test set is used only to evaluate the performance of classification and feature-selection algorithms, and is unknown to the algo- rithms. 60 Number of relevant features One of the goals of classification is to select a minimal set of relevant genes (or features) that can be used in future diagno- sis or classification of tissue samples. We judge each method USC by the total number of relevant features required for optimal SC classification accuracy. A small set of relevant genes is desir- 1 10 100 1,000 10,000 able because it is more cost-effective in the development of Number of genes (log scale) diagnostic tools based on the results of expression analysis. For example, the cost of an RT-PCR test to classify patient samples is directly proportional to the number of genes which Co Figmp ure ar 1 ison of prediction accuracy of USC and SC on the NCI 60 data Comparison of prediction accuracy of USC and SC on the NCI 60 data. must be tested to make the diagnosis. As shown below, both The percentage of prediction accuracy is plotted against the number of the USC and EWUSC methods usually result in a significant relevant genes using the USC algorithm at ρ = 0.6 and the SC algorithm reduction in the numbers of selected genes for classification. (USC at ρ = 1.0). The horizontal axis is shown on a log scale. Because no We feel this represents a major advance in classification independent test set is available for this data, we randomly divided the samples in each class into roughly three parts multiple times, such that a algorithms. third of the samples are reserved as a test set. Thus the training set consists of 43 samples and the test set of 18 samples. The graph Feature stability represents typical results over these multiple random runs. Because relevant genes are derived from the training set and the choice of the training set is often arbitrary, a set of rele- vant genes that is insensitive to the training sets used would highly correlated genes reduces the number of selected fea- be desirable. Hence, we define feature stability as the level of tures while achieving comparable error rates. agreement between the set of relevant genes chosen in each fold of the cross-validation data with the set of relevant genes Like Dudoit et al. [23] we observed high error rates on this chosen using the full training set. Specifically, for each fold of dataset (around 40-60% using 10-200 relevant genes). USC the cross-validation data and for each set of parameters (∆ produces comparable error rates to the results reported in and ρ ), we compute the Jaccard index [31] which measures Dudoit et al. [23] using roughly the same number of relevant the level of agreement between the set of relevant genes cho- genes. However, our USC algorithm allows the optimal sen in this fold and the set chosen using the full training set. parameters (which indirectly control the number of selected The Jaccard index lies between 0 and 1. A high Jaccard index genes) to be determined. In this case, the optimal parameters (close to 1) implies high level of agreement, and hence, high produce an error rate of approximately 28% on the cross-val- feature stability (a mathematical definition of the Jaccard idation data. We repeated the random partition of the full index can be found in the section Details of algorithms, dataset with 61 samples into a training set with 43 samples below). We define feature stability of one cross-validation run and a test set with 18 samples multiple times, and obtained for a given set of parameters (∆ and ρ ) as the average Jaccard similar results on different random partitions of the original index over all m folds of cross-validation. In our experiments, dataset. we usually have five random runs of cross-validation; hence we adopt the average Jaccard index over these five random runs of cross-validation as our measure of overall feature sta- Results on the multiple tumor data bility for given parameters (∆ and ρ ). Figure 2 shows the results of applying EWUSC to the training set, four-fold cross-validation data, and test set of the multi- ple tumor data over a range of shrinkage thresholds (∆) and correlation thresholds (ρ ). In Figure 2a,c the percentage of Results on the NCI 60 data As variability estimates are not available on the NCI 60 data, classification errors is plotted against ∆ on the training and we compared the prediction accuracy from USC and SC (Fig- test sets respectively. In Figure 2b, the average percentage of ure 1; and Figure S14 of [30]). We showed that USC generally errors is plotted against ∆ over five random runs of cross-val- produces higher prediction accuracy than SC using the same idation. The optimal parameters (∆ and ρ ) are determined number of relevant genes (Figure 1). In particular, USC from the cross-validation results. Figure 2a-c shows that pre- requires 44% of the available genes (2,315 out of 5,244 genes) diction accuracy is increased (lower percentage of errors) to achieve a prediction accuracy of 72%, whereas SC requires when ρ < 1 over most values of ∆ (especially 2 ≤ ∆ ≤ 7) on the 77% of genes (3,998 out of 5,244 genes) to achieve the same training set, cross-validation data and test set. This shows prediction accuracy. Our results show that the removal of that removing highly correlated genes increases prediction Genome Biology 2003, 4:R83 Prediction accuracy (%) R83.8 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner http://genomebiology.com/2003/4/12/R83 (a) Training data ρ = 1 ρ = 0.9 ρ = 0.8 ρ = 0.7 ρ = 0.6 0 2 4 6 8 10 12 14 16 18 20 (b) Random cross-validation data ρ = 1 ρ = 0.9 ρ = 0.8 ρ = 0.7 ρ = 0.6 0 2 4 6 8 10 12 14 16 18 20 (c) Test data ρ = 1 ρ = 0.9 ρ = 0.8 ρ = 0.7 ρ = 0.6 0 2 4 6 8 10 12 14 16 18 20 (d) Number of genes 8,000 6,000 ρ = 1 ρ = 0.9 4,000 ρ = 0.8 ρ = 0.7 2,000 ρ = 0.6 0 2 4 6 8 10 12 14 16 18 20 Fig Predic ure tio 2n accuracy on the multiple tumor data using the EWUSC algorithm over the range of ∆ from 0 to 20 Prediction accuracy on the multiple tumor data using the EWUSC algorithm over the range of ∆ from 0 to 20. The percentage of classification errors is plotted against ∆ on (a) the full training set (96 samples) and (c) the test set (27 samples). In (b) the average percentage of errors is plotted against ∆ on the cross-validation data over five random runs of fourfold cross-validation. In (d), the number of relevant genes is plotted against ∆. Different colors are used to specify different correlation thresholds (ρ = 0.6, 0.7, 0.8, 0.9 or 1). Results of ρ < 0.6 are shown in Figure S1 on [30]. Optimal parameters are 0 0 inferred from the cross-validation data in (b). Genome Biology 2003, 4:R83 Number of genes Classification error (%) Average classification Classification error (%) error (%) comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2003/4/12/R83 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner R83.9 accuracy. In addition, Figure 2d shows that the number of rel- Stability across different random cross-validation evant genes is drastically reduced when genes with correla- 0.5 tion threshold (ρ ) above 0.9 are removed. From Figure 2b, 0.45 the average cross-validation error rate gradually reduces 0.4 when the correlation threshold ρ is decreased from 1 to 0.9 0.35 to 0.8, but the average error rate increases when ρ < 0.8. 0.3 (This observation also holds for ρ < 0.6, which are not shown 0.25 in Figure 2 for clarity.) Therefore, the optimal ρ is estimated 0.2 to be 0.8. 0.15 0.1 EWUSC produces the minimum average number of cross-val- 0.05 idation errors at ∆ = 0 and ρ = 0.9 using 1,626 relevant genes, 1 2 3 10 10 10 which achieves 78% prediction accuracy. However, ∆ = 0 is an Number of genes (log scale) unsatisfactory shrinkage threshold because we would prefer relevant genes to have class centroids significantly different EWUSC (ρ = 0.8) USC (ρ = 0.8) from their overall centroids. Moreover, the average error rate 0 SC starts to increase almost linearly when ∆ is greater than 6 on the cross-validation data. This 'bend' is more obvious Figure Com tu Fig mu or p r data e arison 3 of feature stability of EWUSC, USC and SC on the multiple S1(e) on [30], which shows the error rate for each of the five Comparison of feature stability of EWUSC, USC and SC on the multiple random runs of fourfold cross-validation for ∆ = 0 to 14. The tumor data. The average Jaccard index is plotted against the number of optimal ∆ is estimated to be 5.6. When ∆ = 5.6 and ρ = 0.8, relevant genes over five random runs of fourfold cross-validation using the prediction accuracy is 93% and the number of relevant EWUSC and USC at ρ = 0.8 and SC. A high average Jaccard index genes is 680 (out of a total of 7,129 genes). indicates high feature stability. The EWUSC algorithm selects the most stable features. Note that the horizontal axis is shown on a log scale. We also applied the USC and SC algorithms to the multiple tumor data and obtained similar results, except that the error rates are generally higher. Similarly, USC produces the mini- mum average number of cross-validation errors at ∆ = 0 and one-versus-all approach. In contrast, our EWUSC algorithm ρ = 0.9 using 1634 relevant genes, which achieves 74% pre- achieves a classification accuracy of 93% on the test set of the diction accuracy. SC produces the minimum average number multiple tumor data. As we used a subset of the original of cross-validation errors at ∆ = 0.4 using all 7,129 genes. On multiple tumor data and pre-processed the raw data using the the other hand, the optimal parameters (∆, ρ ) can be esti- RMA measures [27], we evaluated the performance of SVM mated by visual observation of 'bends' in the cross-validation combined with the one-versus-all method on the identical curves. In particular, when ∆ = 5.6 and ρ = 0.8, the predic- pre-processed subset of multiple tumor data used in our tion accuracy is 85% and the number of relevant genes is 735 experiments with the EWUSC and the USC algorithms. In our using the USC algorithm (see Figure S2 on [30] for detailed comparison study, we used the signal to noise (S2N) results). measures [9] to select relevant features for each binary SVM classifier. To produce directly comparable results, we used We also compared feature stability of the EWUSC and USC the exact same five splits of the training set into cross-valida- algorithms at correlation threshold (ρ ) = 0.8 with the SC tion data. algorithm [17] (which is equivalent to USC at ρ = 1) over dif- ferent numbers of relevant genes (Figure 3), and showed that Figure 4 compares the prediction accuracy on the test set of EWUSC produces higher feature stability (higher average the multiple tumor data using the EWUSC and USC algo- Jaccard index) than the USC and SC algorithms. The rela- rithms at the estimated optimal correlation threshold (ρ = tively high feature stability is due to relatively high numbers 0.8), the SC algorithm [17] and SVM (with S2N for feature of common features selected in different runs of cross-valida- selection). There are a few observations from Figure 4. First, tion (see Figure S5 on [30]). We also showed that EWUSC USC produces higher prediction accuracy than SC using the almost always selects relatively more stable sets of relevant same number of relevant genes. As SC is equivalent to USC at genes than USC (even over other correlation thresholds that ρ = 1, our results show that removing highly correlated genes are not shown). Hence, our results demonstrate that incorpo- reduces the number of relevant genes and improves predic- rating variability estimates over repeated measurements tion accuracy. Second, EWUSC generally produces higher yields higher feature stability. prediction accuracy than USC using the same number of rel- evant genes, except when both the number of relevant genes and prediction accuracy is low. This shows that we can poten- Comparison with published results Ramaswamy et al. [10] reported 78% classification accuracy tially improve prediction accuracy by taking advantage of on the multiple tumor data using SVMs combined using the error estimates in the data. Genome Biology 2003, 4:R83 Feature stability (average Jaccard index) R83.10 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner http://genomebiology.com/2003/4/12/R83 Test data 40 SVM EWUSC EWUSC (ρ = 0.7) USC USC (ρ = 0.6) SC 0 SC 10 100 1,000 10,000 Total number of genes (log scale) 1 2 3 4 10 10 10 10 Total number of genes (log scale) Co algo Figmpariso u rith re 4 ms on n o ft predicti he multip on acc le tumor uracy data of EWUSC, USC, SVM and SC Comparison of prediction accuracy of EWUSC, USC, SVM and SC algorithms on the multiple tumor data. The horizontal axis shows the total Co can Figm c uer data r pe ar 5 ison of prediction accuracy of EWUSC, USC and SC on the breast number of distinct genes selected over all binary SVM classifiers on a log Comparison of prediction accuracy of EWUSC, USC and SC on the breast scale. Some results are not available on the full range of the total number cancer data. The percentage of prediction accuracy is plotted against the of genes. For example, the maximum numbers of selected genes for number of relevant genes using the EWUSC algorithm at ρ = 0.7, the EWUSC and USC are roughly 1,000. The reported prediction accuracy is USC algorithm at ρ = 0.6 and the SC algorithm (USC at ρ = 1.0). Note 0 0 78% [10] using all 16,000 available genes on the full data. The EWUSC that the horizontal axis is shown on a log scale. algorithm achieves 89% prediction accuracy with only 89 genes. With 680 genes, EWUSC produces 93% prediction accuracy. any of the 78 samples of the training set. It is undesirable to Third, our SVM results (on a subset of the multiple tumor classify new samples using these genes that do not show any data pre-processed with RMA measures) are generally much expression patterns. With EWUSC (which takes error esti- better than the published result of 78% [10] (on the full mates into consideration), these two genes are eliminated for dataset pre-processed with MAS 4). Fourth, SVM with S2N as all ∆ > 0. On the contrary, one of these two genes is selected our feature-selection method produces high prediction accu- as a relevant gene by USC for ∆ = 0, 0.05, ..., 0.7 at ρ = 1. racy at the expense of using a lot of relevant genes. For exam- ple, SVM requires a total of 1,699 genes over all the binary The detailed results of applying the EWUSC and USC algo- classifiers to achieve 93% prediction accuracy, whereas our rithms to the breast cancer data are shown in Figures S8 and EWUSC algorithm requires only 610 relevant genes to S9 on [30]. Surprisingly, removing highly correlated genes achieve the same prediction accuracy. If we are willing to does not produce any considerable improvement in predic- trade off prediction accuracy with the number of relevant tion accuracy and does not drastically reduce the number of genes, EWUSC achieves 89% prediction accuracy with only relevant genes. This is probably due to the fact that the 89 relevant genes. numbers of classification errors on the cross-validation data are not well correlated with those on the test set (see [30]). Because the test set is an additional independent dataset, there might be some heterogeneity between the training and Results on the breast cancer data We applied the EWUSC, USC and SC algorithms to the breast test sets. Nevertheless, USC achieves comparable prediction cancer data, and compared the prediction accuracy of the accuracy to SC using relatively fewer selected genes (under three algorithms at their optimal correlation thresholds (ρ = 100 genes) over different correlation thresholds ρ . 0 0 0.7 or 0.6), and the SC algorithm (USC at ρ = 1). The results are shown in Figure 5. In general, EWUSC produces higher We compared the feature stability of EWUSC, USC and SC at prediction accuracy than USC and SC when the number of their optimal correlation thresholds ρ in Figure 6. We relevant genes is less than 1,000 (which is the range of showed that EWUSC and SC produce relatively stable rele- interest). In particular, EWUSC produces fewer classification vant features than USC. The detailed comparison of feature errors on the test set at its optimal parameters (two errors at stability in terms of the average numbers of true/false posi- ∆ = 0.8 and ρ = 0.7) than USC at its optimal parameters (four tives/negatives are shown in Figures S12 and S13 on [30]. The errors at ∆ = 1.15 and ρ = 0.6). relatively high feature stability of SC is due to its relatively high true-positive rate (common genes chosen in both ran- Moreover, EWUSC generally selects relevant genes with rela- dom cross-validation and using the entire training set), and tively small error bars (or low p-values). For example, there its relatively low false-negative rate (genes chosen using the are two genes with p-values equal to 1 across all 78 samples in entire training set but not in the cross-validation data). How- the training set. In other words, we have very low confidence ever Figure S12 in [30] shows that this effect is drastic at high that the expression ratios of these two genes are changed in numbers of relevant genes and is relatively less significant at Genome Biology 2003, 4:R83 Prediction accuracy (%) Prediction accuracy (%) comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2003/4/12/R83 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner R83.11 increased, the performance of EWUSC compared to USC Stability across different random cross-validation deteriorates. For example, EWUSC selects more relevant 0.7 genes than USC at low technical noise level but it selects fewer EWUSC (ρ = 0.7) 0.65 USC (ρ = 0.6) relevant genes than USC at α = 1 (with signal-to-noise ratio SC approximately 2). The relative performance of EWUSC is 0.6 even less favorable at high biological noise level (α = 2 with 0.55 signal-to-noise ratio roughly 1). The results in Table 5a sug- 0.5 gest that EWUSC is the method of choice when the classes are 0.45 relatively separable (at low biological noise and high signal- to-noise ratio), but USC would be the method of choice at 0.4 high biological noise. 0.35 0.3 1 2 3 In general, the performance of EWUSC increases as the 10 10 10 Number of genes (log scale) number of repeated measurements increases. In particular, we studied the effect of the number of repeated measurements on the relative performance of EWUSC, USC can Fi Compar gc uer re data 6 ison of feature stability of EWUSC, USC and SC on the breast Comparison of feature stability of EWUSC, USC and SC on the breast and SC at high biological noise (α = 2). The prediction cancer data. The average Jaccard index is plotted against the number of accuracy results using 1, 8 or 20 repeated measurements at relevant genes over five random runs of 10-fold cross-validation using the high biological noise (α = 2) are shown in Table 5b. The = 0.7, the USC algorithm at ρ = 0.6 and the SC EWUSC algorithm at ρ 0 0 results at α = 2 with four repeated measurements are shown algorithm (USC at ρ = 1). The EWUSC algorithm produces relatively more stable features when the number of relevant genes is small. in Table 5a. USC typically outperforms SC by selecting fewer relevant genes over different numbers of repeated measure- ments. In addition, we showed that EWUSC usually selects our optimal parameters with approximately 100 to 300 rele- fewer relevant genes than USC at high biological noise when vant genes. there are 20 repeated measurements. However, when the bio- logical noise level is high (with signal-to-noise ratio approxi- mately 1) and the number of repeated measurements is low (1, 4 or 8), USC usually selects fewer relevant genes than Results on the synthetic data We compared the performance of EWUSC, USC and SC on EWUSC. synthetic datasets with different numbers of repeated measurements, different biological and technical noise levels. Table 5a,b shows that EWUSC produces lower prediction As the biological noise levels of typical real microarray data- accuracy than USC at high biological noise when there are few sets are not known, we generated synthetic datasets with four repeated measurements. However, the levels of biological repeated measurements at different biological noise levels (α noise on real microarray datasets are not known. In practice, = 0.1, 1 or 2) and some typical results are shown in Table 5a. we recommend users of our algorithms to compare the Our complete results are shown in Tables S1, S2 and S3 on average numbers of errors on the cross-validation data and [30]. In most cases, USC achieves better or comparable pre- the numbers of relevant genes from the EWUSC and USC diction accuracy (lower number of errors on the test set) than algorithms, and then select the algorithm that produces lower SC using fewer relevant genes. There are a few exceptions to average cross-validation errors using fewer relevant genes. In this observation (see [30]). The optimal parameters (∆, ρ ) most cases, the prediction accuracy on the test set shows the are determined from the minimum average number of cross- same trend as the average number of cross-validation errors. validation errors. In some cases, there are very small differ- ences between the average numbers of cross-validation errors It is interesting that prediction accuracy is not necessarily of two sets of parameters, and the set of parameters that pro- reduced and the number of relevant genes is not necessarily duces a slightly higher average cross-validation error rate increased at higher technical noise levels. However, predic- yields fewer relevant genes. Therefore, this 'exception' is due tion accuracy is generally reduced and the number of relevant to the fact that the optimal parameters are not derived from genes is typically increased at higher biological noise levels the random cross-validation data. At low biological noise (see Additional data files Tables S1, S2 and S3 at [30]). All level (α), the inference of optimal parameters is obvious and three algorithms (EWUSC, USC and SC) produce comparable USC always yields fewer relevant genes than SC (see Table S2 feature stability at different noise levels when the number of on [30]). This observation demonstrates the power of remov- relevant genes is below 300 (see Figures S20, S21 at [30]). ing highly correlated genes in the USC algorithm. Our results also showed that EWUSC consistently achieves the same pre- diction accuracy using fewer relevant genes at low biological Summary of results on real data noise (α = 0.1, with signal-to-noise ratio approximately 20) at Table 6 summarizes our prediction accuracy results using the different technical noise levels (Table 5a). However, as α is EWUSC, USC and SC algorithms on the NCI 60 data, multiple Genome Biology 2003, 4:R83 Feature stability (average Jaccard index) R83.12 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner http://genomebiology.com/2003/4/12/R83 Table 5 Comparison of classification accuracy results from EWUSC, USC and SC on synthetic datasets at optimal parameters α Number of λ EWUSC USC SC measurements (a) Different noise levels with four 0.1 4 Low 100% 100% 100% Average % CV prediction accuracy repeated measurements 100% 100% 100% % prediction accuracy 10 24 72 Number of genes (18, 0.8) (17, 0.7) (17.5, 1) (∆, ρ) 0.1 4 High 100% 100% 100% Average % CV prediction accuracy 100% 100% 100% % prediction accuracy 8 16 22 Number of genes (12.5, 0.9) (12.5, 0.9) (12.5, 1) (∆, ρ) 1 4 Low 100% 100% 100% Average % CV prediction accuracy 100% 100% 100% % prediction accuracy 144 119 124 Number of genes (2.8, 0.5) (3.1, 0.6) (3.1, 1) (∆, ρ) 1 4 High 100% 100% 100% Average % CV prediction accuracy 100% 100% 100% % prediction accuracy 89 120 122 Number of genes (1.9, 0.5) (2.6, 0.6) (2.6, 1) (∆, ρ) 24 Low96.8% 99.0% 98.8% Average % CV prediction accuracy 97.5% 100.0% 100.0% % prediction accuracy 270 326 326 Number of genes (1.1, 0.5) (1, 0.4) (1.2, 1) (∆, ρ) 2 4 High 93.3% 98.8% 99.0% Average % CV prediction accuracy 92.5% 97.5% 97.5% % prediction accuracy 186 159 159 Number of genes (1, 0.7) (1.5, 0.5) (1.5, 1) (∆, ρ) (b) Different numbers of repeated 2 1 Low NA 99.5% 99.5% Average % CV prediction accuracy measurements at high biological noise levels NA 100.0% 100.0% % prediction accuracy NA 285 304 Number of genes NA (1.2, 0.5) (1.2, 1) (∆, ρ) 21 HighNA 96.5% 95.5% Average % CV prediction accuracy NA 92.5% 92.5% % prediction accuracy NA 258 282 Number of genes NA (1.2, 0.5) (1.2, 1) (∆, ρ) 28 Low99.8% 100.0% 100.0% Average % CV prediction accuracy 100.0% 100.0% 100.0% % prediction accuracy 246 220 221 Number of genes (1.3, 0.5) (1.4, 0.5) (1.4, 1) (∆, ρ) 2 8 High 98.3% 99.0% 99.0% Average % CV prediction accuracy 97.5% 100.0% 100.0% % prediction accuracy 171 242 245 Number of genes (1, 0.4) (1.3, 0.5) (1.3, 1) (∆, ρ) 2 20 Low 99.8% 100.0% 100.0% Average % CV prediction accuracy 100.0% 100.0% 100.0% % prediction accuracy 226 296 325 Number of genes (1.3, 0.5) (1.2, 0.6) (1.2, 1) (∆, ρ) 220 High99.8% 100.0% 100.0% Average % CV prediction accuracy Genome Biology 2003, 4:R83 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2003/4/12/R83 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner R83.13 Table 5 (Continued) Comparison of classification accuracy results from EWUSC, USC and SC on synthetic datasets at optimal parameters 100.0% 100.0% 100.0% % prediction accuracy 221 252 252 Number of genes (0.9, 0.6) (1.3, 0.5) (1.3, 1) (∆, ρ) Synthetic datasets were generated at different levels of biological noise (α) and technical noise (λ). The average percentage of cross validation (% CV) accuracy, the percentage of prediction accuracy on the test set, the number of relevant genes at the optimal parameters (∆, ρ ) are shown. For each synthetic dataset, the algorithm with the maximum percentage of average cross validation accuracy, maximum prediction accuracy, or the minimum number of relevant genes is shown in bold. (a) Typical classification accuracy results using synthetic datasets with four repeated measurements at different biological noise levels (α = 0.1, 1 or 2) and difference technical noise levels (λ = 1, 5 or 10). When the biological noise level is low (α = 0.1), EWUSC consistently achieves the same prediction accuracy using fewer relevant genes at various technical noise levels. However, at medium biological noise level (α = 1), EWUSC typically outperforms USC and SC at high technical noise level and not at low technical noise level. When the biological noise level is high (α = 2), EWUSC is often not the method of choice. (b) Typical classification accuracy results using synthetic datasets at high biological noise level (α = 2) with 1, 8, or 20 repeated measurements at different technical noise levels. When there is no repeated measurement (the number of repeated measurements = 1), there are no variability estimates over repeated measurements and hence, EWUSC is reduced to USC. The results with four repeated measurement at α = 2 are shown in (a). Our results over multiple synthetic datasets showed that EWUSC only outperforms USC with a large number of repeated measurements (20) at high biological noise (α = 2). We also showed that USC typically outperforms SC by choosing a smaller number of relevant genes in most scenarios (over different biological and technical noise levels, and different numbers of repeated measurements). tumor data and breast cancer data at optimal parameters. In avoids choosing noisy genes. The EWUSC algorithm can be general, we showed that using variability over repeated meas- applied to data with any number of classes. This is in contrast urements to down-weight noisy genes/experiments and the to the prognostic classifier, which is not applicable to the mul- removal of highly correlated genes usually reduce the number tiple tumor data (which consists of 11 classes) or the NCI 60 of relevant genes necessary for accurate class predictions. In data (which consists of 8 classes). addition, using variability of repeated measurements to down-weight noisy genes/experiments generally increases feature stability. Hence, our EWUSC and USC algorithms Comparison of USC, EWUSC and SC represent advances over the published SC algorithm [17]. algorithms The key characteristics of EWUSC, USC and SC are summa- On the NCI 60 data, USC generally produces higher predic- rized in Table 7. We illustrated the EWUSC and USC algo- tion accuracy than SC using the same number of relevant rithm on both real and synthetic datasets. Our results on real genes. This result shows that the removal of highly correlated data are summarized in Table 6. We compared the genes reduces the number of genes necessary for class performance of USC with SC, and showed that USC typically prediction while achieving comparable or higher prediction achieves comparable prediction accuracy using a smaller set accuracy. of relevant genes on both real and synthetic datasets. We showed that the step of removing highly correlated genes in On the multiple tumor data, EWUSC has the following advan- USC is effective in reducing the number of relevant genes tages over other methods: EWUSC produces higher predic- without sacrificing prediction accuracy, and hence, USC is an tion accuracy and selects fewer relevant genes than all other improvement over SC. approaches. In particular, EWUSC achieves 93% of predic- tion accuracy using less than 10% of the genes compared to We also compared the performance of EWUSC (which down- 78% of prediction accuracy using all the available genes in the weights noisy genes and noisy experiments) with USC on both published results [10]. Each of the binary SVM classifiers real and synthetic datasets. On real microarray datasets (mul- chooses a different subset of relevant genes while our EWUSC tiple tumor data and breast cancer data), we showed that algorithm uses only one set of relevant genes for all classes. EWUSC usually achieves higher or comparable feature stability using a smaller set of relevant genes, and EWUSC van't Veer et al. [14] reported two classification errors using avoids choosing noisy relevant genes for classification of sam- 70 relevant genes on the test set of the breast cancer data (out ples. Hence, we showed that using variability over repeated of a total of 19 samples). Our EWUSC produces the same measurements improves classification and feature-selection number of errors on the test set with 271 relevant genes. How- results. Moreover, we compared EWUSC with other existing ever, our EWUSC algorithm has the following advantages classification and feature-selection algorithms, and showed over the prognostic classifier used in [14]. No ad hoc filtering that EWUSC produces better or at least comparable results step is necessary. The EWUSC algorithm automatically than previously reported results on real datasets (see Table Genome Biology 2003, 4:R83 R83.14 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner http://genomebiology.com/2003/4/12/R83 Table 6 Summary of prediction accuracy results Data Parameters EWUSC USC SC Published results NCI 60 data* ρ NA 0.6 1.0 NA ∆ NA 1.0 1.0 NA Number of relevant genes NA 2,315 3998 200 Prediction accuracy NA 72% 72% ~40-60% [23] Multiple tumor data (estimated optimal parameters) ρ 0.8 0.8 1.0 NA ∆ 5.6 5.6 8.8 NA Number of relevant genes 680 735 3902 All genes Prediction accuracy 93% 85% 78% 78% [10] Multiple tumor data (global optimal parameters) ρ 0.9 0.9 1.0 NA ∆ 00 0.4 NA Number of relevant genes 1626 1634 7129 All genes Prediction accuracy 78% 74% 74% 78% [10] Breast cancer data ρ 0.7 0.6 1.0 NA ∆ 0.80 1.15 1.1 NA Number of relevant genes 271 82 187 70 Prediction accuracy 89% 79% 84% 89% [14] The optimal parameters (ρ and ∆), number of relevant genes chosen, and prediction accuracy for the NCI 60 data, multiple tumor data and breast cancer data are summarized here. Both EWUSC (error-weighted, uncorrelated shrunken centroid) and USC (uncorrelated shrunken centroid) were motivated by SC (shrunken centroid) [17]. Both EWUSC and USC take advantage of interdependence between genes by removing highly correlated relevant genes. EWUSC makes use of error estimates or variability over repeated measurements. SC [17] is equivalent to USC at ρ = 1. The optimal parameters (∆, ρ ) for EWUSC are estimated from the cross-validation results of EWUSC, while the optimal parameters (∆, ρ ) for USC are 0 0 independently estimated from the cross-validation results of USC. Entries with the minimum number of selected genes or highest prediction accuracy across all methods are highlighted in boldface type. *Since no repeated measurements or error estimates are available, EWUSC is not applicable to the NCI 60 data. In addition, there is no separate test set available for the NCI 60 data, typical results of random partitions of the original 61 samples into training and test sets are shown. The prediction accuracy and number of relevant genes are produced using optimal parameters (∆, ρ ) estimated by visual observation of 'bends' in the random cross-validation curves. The prediction accuracy and number of relevant genes are produced using global optimal parameters, that is (∆, ρ ) that produces the minimum average numbers of cross-validation errors over all ∆ and all ρ . 6). On the other hand, our results on synthetic datasets Our next step is to compare the performance of the EWUSC showed that EWUSC is usually the method of choice when the and USC algorithms with a wide range of other classification classes are well separated (that is, when biological noise is low and feature selection algorithms. One problem in the litera- or signal-to-noise ratio is high). ture is that researchers often use different pre-processed sub- sets of published array data, which makes direct comparisons Our main contribution is that we use cross-validation to select of published results difficult. Therefore, there is a need to a correlation threshold (ρ ) for the removal of highly corre- conduct a large-scale evaluation study of various classifica- lated genes. This idea is adopted in both USC and EWUSC, tion and feature selection algorithms on microarray data. which in turn take advantage of the interdependence of genes without sacrificing prediction accuracy. Our second major contribution is that we adopted the error-weighted method in Details of algorithms our integrated feature-selection and classification algorithm, The SC algorithm of Tibshirani et al. [17] EWUSC. To the best of our knowledge, EWUSC is the only Let x be the expression level for gene i = 1, 2, ..., p and sam- ij classification algorithm applicable to microarray data with ples j = 1, 2, ..., n. Suppose there are a total of K classes, and any number of classes that takes advantage of variability in let C be the set of all n samples in class k. The overall cen- k k repeated measurements. troid of gene i is, There are many directions for future work. The error- xx = /n ii ∑ j j =1 weighted idea can be applied to other distance-based classifi- cation algorithms, for example, the k-nearest neighbour, and the class centroid of class k and gene i is, which was shown to achieve high prediction accuracy [23]. Genome Biology 2003, 4:R83 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2003/4/12/R83 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner R83.15 Table 7 = n /n. The first term in the discriminant score rep- where π k k resents the standardized squared distance of x* to the Summary of EWUSC, USC and SC shrunken class centroid, and the second term represents a correction for the class prior probability. Sample x* is Desirable features EWUSC USC SC assigned to the class k with the minimum discriminant score. Make use of variability over repeated + measurements Our EWUSC algorithm Applicable to data with any number of classes + + + Mathematical definitions Exploit dependence relationships between ++ The EWUSC algorithm is a modification of the SC algorithm genes with two key differences: noisy measurements are down- Integrated approach for both feature selection ++ + weighted and redundant genes (features) are removed. Let σ ij and classification be the variability estimate of gene i and sample j over repeated No assumption on data distributions + + + measurements, where i = 1, 2, ..., p and j = 1, 2, ..., n. The weighted overall centroid for gene i is defined as n n ij x = / ∑∑ xx = /n ik ∑ ij k . σσ ij ij j== 11 j jC ∈ and the weighted class centroid for gene i and class k is The relative difference, d , is the difference in class centroid ik ( x ) and overall centroid ( x ), standardized by the within- ik i ij class standard deviation of gene i (s ); that is, x = / . ik ∑∑ σσ ij ij jC∈∈ jC kk xx − ik i d = ik Noisy measurements with a large variability estimate σ are ij ms +s () ki 0 down-weighted in the weighted overall and class centroids. The weighted relative difference is similarly defined as where xx − ik i d = ij ,, mn=+ 11 //n s = xx − kk ii ∑ ∑()jik ms ++ s ω () ki 0 i nK − k jC ∈ where the weighted within-class standard deviation, and s is the median value of the s s over all genes i. The rela- 0 i tive difference d is similar to a t-statistic, comparing the ik xx − ij ik 2 class centroid to the overall centroid. The shrunken relative () ∑ ∑ difference d' reduces d by an amount ∆ if |d | > ∆, other- ij ik ik ik k jC ∈ 2 , s = wise, sets d' to zero; that is, n K ik ∑ ∑ σσ j =1 k =1 ij ik sign d || d −∆∆ if|| d > ()() ik ik ik d = { ik 0 otherwise average variability estimate for class k, Hence, d' gets rid of genes with class centroids not signifi- σσ = / n ik , ik ∑ ij k cantly different from the overall centroids. The amount of jC ∈ shrinkage ∆ is determined by m-fold cross-validation such that the number of cross-validation classification errors is the scaling factor minimized. Genes with at least one positive shrunken relative difference d' (over all classes k) are selected as relevant fea- ik 1 1 tures. The shrunken class centroid ( x ) is defined as ∑ ∑ ik 2 σ σ jC ∈ ij j =1 ij k , ′ ′ . The discriminant score for a new xx=+m s +s d () m=+ ik i k i 0 ik sample x* and class k is defined as 2 () ∑ () ij σ jC ∈ ij k j =1 p ′ xx − () iik * s is the median of all s s over all genes i, and ω is the median 0 i i δπ () x = −2log , k k variability estimate for gene i across all n experiments. When ss + i =1() i 0 Genome Biology 2003, 4:R83 R83.16 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner http://genomebiology.com/2003/4/12/R83 the variability estimates are equal for all samples; that is, σ = For each ∆, ij σ for j = 1, 2, ..., n, the above definitions for x , x , s and m i i ik i k can be simplified to the corresponding formulae from the SC Compute d for each gene i and class k. ik algorithm. The intuition behind these error-weighted defini- tions is that noisy samples with large variability estimates σ For each gene i, denote the maximum shrunken relative dif- ij are down-weighted. The median variability for gene i (ω ) in ference over all K classes by β = max || d′ . i ik ik the denominator of the weighted relative difference (d ) ik down-weights noisy genes such that genes with large variabil- Let S be the set of genes with at least one positive shrunken itiy over all samples are less likely to be selected as relevant relative difference over all the K classes; that is, S = {g:β > ∆ g genes. The definition of weighted shrunken relative differ- 0}. ence d is very similar to that of d' ; that is, ik ik Sort the genes g in S in descending order of β . Denote this ∆ g , g , ..., g }. sorted set by G = {g 1 2 t sign d || d −∆∆ if|| d > ()() ik ik ik , d′ = { ik For ρ = 1, 0.9, 0.8, ..., 0.1, 0, 0 otherwise where the amount of shrinkage ∆ is determined by cross-val- Consider all pairs of genes (g , g ) in G such that i < j (that is, i j idation. Similarly, the weighted shrunken centroid is defined β > β ). gi gj ′ ′ as x =+ xm s +s +ω d , and the weighted discrimi- () ik i k i 0 i ik nant score for a new sample x* with variability estimate σ * Compute the error-weighted correlation ρ between (g , g ). i j and class k is If ρ ≥ ρ , remove gene g from S . 0 j ∆ Let S(∆, ρ ) be the set of genes left in S after removing the ′ 0 ∆ () xx − iik G () x = −2logS . k ∑ k highly correlated genes. 2 2 ()VZ ++(ss + ) i =1 ii 0 i Apply the discriminant score to predict the classes of sam- Error-weighted correlation ples in the test set using the relevant genes in S(∆, ρ ). Hughes et al. [29] defined error-weighted correlation that weighs expression values with error estimates such that Output of the algorithm: a predicted class for each sample expression values with relatively high errors are down- in the test set for each ∆ and each ρ . weighted. Let σ be the error estimate of the expression level ge of gene g under experiment e, where g = 1, ..., p and e = 1, ..., The above algorithm is applied to the m-fold cross-validation n. The error-weighted correlation between a pair of genes i data to determine the optimal parameters ∆ and ρ that min- and j is defined as imize the number of classification errors on the training set. The optimal parameters are then used to predict classes on the unknown samples on the test set. (( Dj,e) −µ ) (( Di,e) −µ ) σ σ ie je e =1 ρ = The Jaccard index as a measure of feature stability ij E E      Dj (,) e −µ We define feature stability as the average level of agreement Di (,e) −µ     ∑ ∑   between the set of relevant genes chosen in a fold of the cross-   σ σ e =1 ie e =1 je     validation data and the set of relevant genes chosen using the where full training set over all m folds of the cross-validation data. Let S(∆, ρ ) be the set of relevant genes chosen using the E E entire training set, and let S(m, ∆, ρ ) be the set of relevant Di (,e) 1 0 µ = / i ∑ ∑ genes chosen in the mth fold of the cross-validation data with σσ ie ie e =1 e =1 parameters ∆ and ρ . We define the number of true positives is the weighted average expression level of gene i. (TP) as the number of relevant genes chosen in both S(∆, ρ ) and S(m, ∆, ρ ). Similarly, we define the number of false pos- Algorithm outline for EWUSC itives (FP) as the number of relevant genes chosen in S(m, ∆, Inputs to the algorithm: training set (with known classes) ρ ) but not in S(∆, ρ ), and the number of false negatives (FN) 0 0 and test set as the number of relevant genes chosen in S(∆, ρ ) but not in S(m, ∆, ρ ). The Jaccard index, J(m, ∆, ρ ), is defined as TP/ 0 0 For each gene i and each class k, (TP + FP + FN). Intuitively, the level of agreement is high when there are many true positives, and relatively few false Compute x , x , s and d using the training set. positives and false negatives. Hence, a high Jaccard index i ik i ik indicates a high level of agreement. Feature stability is the Genome Biology 2003, 4:R83 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2003/4/12/R83 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner R83.17 average Jaccard index over all m folds; that is, J(∆, ρ ) = aver- Error model in the breast cancer data age of J(m, ∆, ρ ) over all m folds. The log ratios and their associated p-values are available from the breast cancer data. The p-values are confidence measures Support vector machines (SVMs) that expression ratios are significantly different from 1. Using The basic idea behind SVM [33] is that it maps data points to the error model documented in the 'Error Model' supplement a high-dimensional space such that the data points are line- of Hughes et al. [29], we converted the p-values into error arly separable. However, SVM avoids computations in high- estimates. Assuming the distribution of error magnitudes can dimensional space by the use of kernel functions, which be approximated by the normal distribution, significance val- allows computations in the input space. There are many dif- ues (or p-values) can be derived from the Gaussian error ferent types of kernel functions, with different effects. Brown function of the ratio of an observed log expression ratio to its et al. [34] showed that the radial kernel functions work very error estimate [29]. The p-value (p) for an observed log ratio well in classifying genes on array data. (r) is related to the error estimate of the observed log ratio (s) by p = 2 * (1 - Erf(|X|) where X represents the ratio of an We augmented the SVM implementation by Noble et al. [35] observed log expression ratio (r) to its error estimate (σ) and to incorporate the signal to noise (S2N) measure for feature Erf is the Gaussian error function. Hence, the error estimates selection. The S2N measure is defined as the difference of the of the log expression ratios can be derived from the p-values. means in the two classes divided by the sum of the standard However, when a p-value is equal to 1, the error estimate is deviations of the two classes. Because we adopt the one-ver- arbitrarily large. Hence, we ignored the corresponding sus-all approach [10,36] to combine the binary SVM classifi- expression ratio in our EWUSC algorithm when its p-value is ers, each binary classifier distinguishes samples of a given equal to 1. class from samples from all the other classes. The multiple tumor dataset consists of 11 classes (see Table 3 for details), Synthetic data and so there is a total of 11 binary SVM classifiers for this data. The synthetic training sets consist of 1,000 genes, 80 sam- We applied the S2N measure to select a given number of rel- ples, and four classes such that there are 20 samples in each evant genes on the four-fold cross-validation data using a class, and the synthetic test sets consist of 1,000 genes and 40 binary SVM classifier (with a radial kernel function). We then samples with 10 samples in each class. Two parameters con- combined the results from each of the 11 SVMs by assigning trol the noise levels in the synthetic datasets - the biological the sample to the class of the classifier with the maximum dis- noise level (α) and the technical noise level (λ). Let P be the criminant value. This process is repeated for each of the five matrix of patterns with 64 rows and 4 columns such that each random fourfold splits of the training set. The results on the entry P [i,j] is the ith pattern of class j (i = 1,2,..., 64, j = cross-validation data are shown in Figure S7(a) on [30], in 1,2,3,4). Table 8 shows the pattern matrix P used to generate which the average number of classification errors is plotted synthetic datasets in our study. Let X(i, j) be the true against the number of relevant genes chosen. The next step is expression value of gene i under experiment j before technical to apply this process to the entire training set, and use the noise is added. Let Y(i, j, r) be the rth measured expression selected genes to classify the samples on the test set. The value of gene i under experiment j, where i = 1, 2, ..., p, j = 1,2, results on the test set are shown in Figure S7(b) on [30], in ..., n, r = 1,2,..., R. Suppose gene i is generated from the mth which the number of classification errors on the test set is patterned gene that belongs to class k. X(i, j) is generated plotted against the number of relevant genes chosen. from the random normal distribution with mean P [m,k] and standard deviation α. Technical noise is randomly sampled from a real dataset. Four hybridizations were repeated on the yeast galactose data [32], and the standard deviation of each Details of dataset analysis gene under each experiment is adopted as our estimated tech- Multiple tumor data In order to process the multiple tumor data [10] with the nical noise. Let ε be the randomly sampled technical noise RMA measure implemented in the Bioconductor project, we (standard deviation over four repeated measurements) from need the raw data (.cel files) which contain the expression the yeast galactose data [32]. Y(i,j,r) is generated from the level for each oligo (probe cell). The original multiple tumor random normal distribution with mean X(i,j) and standard data consists of 14 tumor types which were hybridized to both deviation ελ. Hence, a high technical noise level λ indicates the Affymetrix Hu6800 and Hu35K chips. However, only a noisy repeated measurements. Moreover, there are five dupli- subset of the original '.cel' files (mostly data from the Hu6800 cates of each of these 64 patterned genes so that there is a chips) is available. Hence, the subset of the multiple tumor total of 320 patterned genes. Each of these five duplicated data we used consists of all the 7,129 genes on the Hu6800 patterned genes is generated using the same row in the pat- chips and 11 tumor types, with 96 samples in the training set tern matrix P. and 27 samples in the test set. Table 3 shows the tumor types and class sizes for both the training and test sets. For non-patterned genes, we randomly sample from the breast cancer data such that these non-patterned genes do not exhibit any class-specific expression patterns. Genome Biology 2003, 4:R83 R83.18 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner http://genomebiology.com/2003/4/12/R83 Table 8 (high technical noise) with R = 1 or 4 or 20 repeated measure- ments. We also experimented with synthetic datasets with a Pattern matrix for synthetic data higher fraction of non-patterned genes and showed that these larger datasets produce similar results (data not shown). Class 1 Class 2 Class 3 Class 4 1 000 -1 000 Acknowledgements We would like to thank Sridhar Ramaswamy for providing us with the raw 0 100 (.cel) files for the multiple tumor data. We also thank Jane Fridlyand for the 0-1 0 0 processed NCI 60 dataset. We would also like to acknowledge the publicly available BioConductor project [28] and GIST [35]. We would like to thank 0 010 William Noble for general discussions and Mette Peters for her suggestions 00 -1 0 on this writeup. This work was supported by NIH-NIDDK grant 0 001 5U24DK058813-02. 00 0 -1 1 100 References -1 -1 0 0 1. Lennon GG, Lehrach H: Hybridization analyses of arrayed 1-1 0 0 cDNA libraries. Trends Genet 1991, 7:314-317. 2. Nguyen C, Rocha D, Granjeaud S, Baldit M, Bernard K, Naquet P, Jor- -1 100 dan BR: Differential gene expression in the murine thymus 1 010 assayed by quantitative hybridization of arrayed cDNA clones. Genomics 1995, 29:207-216. -1 0 -1 0 3. Pietu G, Alibert O, Guichard V, Lamy B, Bois F, Leroy E, Mariage- 10 -1 0 Sampson R, Houlgatte R, Soularue P, Auffray C: Novel gene transcripts preferentially expressed in human muscles -1 010 revealed by quantitative hybridization of a high density 1 001 cDNA array. Genome Res 1996, 6:492-503. -1 0 0 -1 4. Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitor- ing of gene expression patterns with a complementary DNA 10 0 -1 microarray. Science 1995, 270:467-470. -1 001 5. Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C, Kobayashi M, Horton H et al.: Expression 0 110 monitoring by hybridization to high-density oligonucleotide 0-1 -1 0 arrays. Nat Biotechnol 1996, 14:1675-1680. 6. Pease AC, Solas D, Sullivan EJ, Cronin MT, Holmes CP, Fodor SP: 01 -1 0 Light-generated oligonucleotide arrays for rapid DNA 0-1 1 0 sequence analysis. Proc Natl Acad Sci USA 1994, 91:5022-5026. 7. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine 0 101 AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligo- Each row represents a pattern, and each column represents a class nucleotide arrays. Proc Natl Acad Sci USA 1999, 96:6745-6750. such that entry P(i, j) is the ith pattern of class j. An entry of 1 means 8. Schummer M, Ng WV, Bumgarner RE, Nelson PS, Schummer B, Bed- upregulated while an entry of -1 means downregulated. For example, narski DW, Hassell L, Baldwin RL, Karlan BY, Hood L: Comparative the first row indicates that a patterned gene is upregulated in class 1 hybridization of an array of 21500 ovarian cDNAs for the dis- compared to all the other three classes. covery of genes overexpressed in ovarian carcinomas. Gene 1999, 238:375-385. 9. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286:531-537. Specifically, let q be a non-patterned gene. Suppose we ran- 10. Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo domly sample a gene g and experiment e from the breast can- M, Ladd C, Reich M, Latulippe E, Mesirov JP et al.: Multiclass cancer cer data such that E[g,e] is the expression ratio of gene g and diagnosis using tumor gene expression signatures. Proc Natl Acad Sci USA 2001, 98:15149-15154. experiment e and s[g,e] is the error estimate of gene g and 11. Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, experiment e in the breast cancer data. Y(q,j) is generated Boldrick JC, Sabet H, Tran T, Yu X et al.: Distinct types of diffuse large B-cell lymphoma identified by gene expression from a random normal distribution with mean E[g,e] and profiling. Nature 2000, 403:503-511. standard deviation s[g,e] for sample j in the synthetic training 12. Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P, Iyer V, or test set. Note that all expression values of the non-pat- Jeffrey SS, Van de Rijn M, Waltham M et al.: Systematic variation in gene expression patterns in human cancer cell lines. Nat terned gene q are sampled from the same gene g (which is Genet 2000, 24:227-235. chosen randomly) from the breast cancer data. As experiment 13. Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd e is independently sampled for each sample j, any class spe- C, Beheshti J, Bueno R, Gillette M et al.: Classification of human lung carcinomas by mRNA expression profiling reveals dis- cific expression pattern in the original breast cancer data tinct adenocarcinoma subclasses. Proc Natl Acad Sci USA 2001, would be destroyed. 98:13790-13795. 14. van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT et al.: Gene Both the synthetic training and test sets are generated using expression profiling predicts clinical outcome of breast the same model described above. In our experiments, we set cancer. Nature 2002, 415:530-536. 15. Nutt CL, Mani DR, Betensky RA, Tamayo P, Cairncross JG, Ladd C, p = 1000, α = 0.1, 1 or 2, and λ = 1 (low technical noise) or 10 Pohl U, Hartmann C, McLaughlin ME, Batchelor TT et al.: Gene Genome Biology 2003, 4:R83 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2003/4/12/R83 Genome Biology 2003, Volume 4, Issue 12, Article R83 Yeung and Bumgarner R83.19 expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Res 2003, 63:1602-1607. 16. Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, Gaasenbeek M, Angelo M, Reich M, Pinkus GS et al.: Diffuse large B- cell lymphoma outcome prediction by gene-expression pro- filing and supervised machine learning. Nat Med 2002, 8:68-74. 17. Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA 2002, 99:6567-6572. 18. Lee MLT, Kuo FC, Whitmore GA, Sklar J: Importance of replica- tion in microarray gene expression studies: statistical meth- ods and evidence from repetitive cDNA hybridizations. Proc Natl Acad Sci USA 2000, 97:9834-9839. 19. Smith MW, Yue ZN, Geiss GK, Sadovnikova NY, Carter VS, Boix L, Lazaro CA, Rosenberg GB, Bumgarner RE, Fausto N et al.: Identifi- cation of novel tumor markers in hepatitis C virus-associ- ated hepatocellular carcinoma. Cancer Res 2003, 63:859-864. 20. Van't Wout AB, Lehrman GK, Mikheeva SA, O'Keeffe GC, Katze MG, Bumgarner RE, Geiss GK, Mullins JI: Cellular gene expression upon human immunodeficiency virus type 1 infection of CD4(+)-T-cell lines. J Virol 2003, 77:1392-1402. 21. Nguyen DV, Rocke DM: Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 2002, 18:39-50. 22. Nguyen DV, Rocke DM: Multi-class cancer classification via par- tial least squares with gene expression profiles. Bioinformatics 2002, 18:1216-1226. 23. Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors using gene expres- sion data. J Am Stat Assoc 2002, 97:77-87. 24. Yeung KY, Medvedovic M, Bumgarner RE: Clustering gene- expression data with repeated measurements. Genome Biol 2003, 4:R34. 25. Dettling M, Buhlmann P: Supervised clustering of genes. Genome Biol 2002, 3:research0069.1-0069.15. 26. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summa- ries of high density oligonucleotide array probe level data. Biostatistics 2003, 4:249-264. 27. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 2003, 31:e15. 28. BioConductor open source software for bioinformatics [http://www.bioconductor.org] 29. Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai H, He YD et al.: Functional discov- ery via a compendium of expression profiles. Cell 2000, 102:109-126. 30. Yeung KY and Bumgarner RE: Supplementary web site. [http:// expression.washington.edu/public]. 31. Jain AK and Dubes RC: Algorithms for Clustering Data Englewood Cliffs, NJ: Prentice Hall; 1988. 32. Ideker T, Thorsson V, Ranish JA, Christmas R, Buhler J, Eng JK, Bum- garner RE, Goodlett DR, Aebersold R, Hood L: Integrated genomic and proteomic analyses of a systemically perturbed metabolic network. Science 2001, 292:929-934. 33. Vapnik VN: Statistical Learning Theory New York: Wiley; 1998. 34. Brown MP, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M Jr, Haussler D: Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci USA 2000, 97:262-267. 35. GIST [http://microarray.cpmc.columbia.edu/gist] 36. Yeang CH, Ramaswamy S, Tamayo P, Mukherjee S, Rifkin RM, Angelo M, Reich M, Lander E, Mesirov J, Golub T: Molecular classification of multiple tumor types. Bioinformatics 2001, 17(Suppl 1):S316-S322. Genome Biology 2003, 4:R83

Journal

Genome Biology – Springer Journals

Published: Nov 24, 2003

Keywords: Prediction Accuracy; Relevant Gene; Synthetic Dataset; High Prediction Accuracy; Breast Cancer Data

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Multiclass classification of microarray data with repeated measurements: application to cancer

Multiclass classification of microarray data with repeated measurements: application to cancer

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Multiclass classification of microarray data with repeated measurements: application to cancer

Multiclass classification of microarray data with repeated measurements: application to cancer

References (38)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies