Multi-step dimensionality reduction and semi-supervised graph-based tumor
classiﬁcation using gene expression data
, Shu-Lin Wang
, Ying-Ke Lei
Intelligent Computing Lab, Hefei Institute of Intelligent Machines, Chinese Academy of Sciences, Hefei, Anhui 230031, China
Department of Automation, University of Science and Technology of China, Hefei, Anhui 230026, China
School of Computer and Communication, Hunan University, Changsha, Hunan 410082, China
Electronic Engineering Institute, Hefei, Anhui 230037, China
With the rapid development of large scale high throughput
gene expression technology, it is possible to diagnose and classify
different kinds of genetic diseases, especially cancers, with the help
of DNA microarray technologies [1,2]. This technique has been
termed ‘‘class prediction’’ in the microarray literature .
Microarray experiments can be used to monitor the expression
levels in cells for thousands of genes at the same time, which may
help us to understand the molecular variation among differential
tumor subtypes, allowing better and more reliable classiﬁcation
results to be obtained.
Many new prediction, classiﬁcation and clustering techniques
have been applied for analysis of gene microarray data [3–10].
Particularly, network-based approaches have been extensively
applied to analyze microarray data . Moreover, many
supervised and unsupervised classiﬁers have been developed to
tackle the tumor classiﬁcation problem. Unsupervised methods are
ones such as the simultaneous gene clustering and gene subset
selection for tumor classiﬁcation using model selection criterion
, clustering algorithm based on optimization techniques 
and self-organizing maps . Generally speaking, unsupervised
methods are not sensitive to the number of the labeled samples
since they work on the whole data, nevertheless, the relationship
between clusters and classes is not ensured. On the other hand,
supervised methods such as artiﬁcial neural networks ,
support vector machines (SVM) [16,17], multi-layer perceptrons
, rough set theory  and K-nearest neighbor (K-NN) 
have been successfully applied to tumor classiﬁcation problems.
The main difﬁculty with all supervised methods is that the learning
process heavily depends on the training dataset. However, the
Artiﬁcial Intelligence in Medicine 50 (2010) 181–191
Received 12 May 2009
Received in revised form 28 April 2010
Accepted 18 May 2010
Multi-step dimensionality reduction
Discrete cosine transform
Principal component analysis
Microarray data analysis
Both supervised methods and unsupervised methods have been widely used to solve the
tumor classiﬁcation problem based on gene expression proﬁles. This paper introduces a semi-supervised
graph-based method for tumor classiﬁcation. Feature extraction plays a key role in tumor classiﬁcation
based on gene expression proﬁles, and can greatly improve the performance of a classiﬁer. In this paper
we propose a novel multi-step dimensionality reduction method for extracting tumor-related features.
Methods and materials: First the Wilcoxon rank-sum test is used for gene selection. Then gene ranking
and discrete cosine transform are combined with principal component analysis for feature extraction.
Finally, the performance is evaluated by semi-supervised learning algorithms.
Results: To show the validity of the proposed method, we apply it to classify four tumor datasets
involving various human normal and tumor tissue samples. The experimental results show that the
proposed method is efﬁcient and feasible. Compared with other methods, our method can achieve
relatively higher prediction accuracy. Particularly, it is found that semi-supervised method is superior to
support vector machines in classiﬁcation performance.
Conclusions: The proposed approach can effectively improve the performance of tumor classiﬁcation
based on gene expression proﬁles. This work is a meaningful attempt to explore and apply multi-step
dimensionality reduction and semi-supervised learning methods in the ﬁeld of tumor classiﬁcation.
Considering the high classiﬁcation accuracy, there should be much room for the application of multi-step
dimensionality reduction and semi-supervised learning methods to perform tumor classiﬁcation.
ß 2010 Elsevier B.V. All rights reserved.
* Corresponding author at: Intelligent Computing Laboratory, Hefei Institute of
Intelligent Machines, Chinese Academy of Sciences, Hefei, Anhui 230031, China.
Tel.: +86 551 5592751; fax: +86 551 5592751.
E-mail address: email@example.com (J. Gui).
The ﬁrst two authors are joint ﬁrst author.
Contents lists available at ScienceDirect
Artificial Intelligence in Medicine
journal homepage: www.elsevier.com/locate/aiim
0933-3657/$ – see front matter ß 2010 Elsevier B.V. All rights reserved.