Multi-step dimensionality reduction and semi-supervised graph-based tumor
classification using gene expression data
Jie Gui
a,b,1,
*
, Shu-Lin Wang
a,c,1
, Ying-Ke Lei
a,b,d
a
Intelligent Computing Lab, Hefei Institute of Intelligent Machines, Chinese Academy of Sciences, Hefei, Anhui 230031, China
b
Department of Automation, University of Science and Technology of China, Hefei, Anhui 230026, China
c
School of Computer and Communication, Hunan University, Changsha, Hunan 410082, China
d
Electronic Engineering Institute, Hefei, Anhui 230037, China
1. Introduction
With the rapid development of large scale high throughput
gene expression technology, it is possible to diagnose and classify
different kinds of genetic diseases, especially cancers, with the help
of DNA microarray technologies [1,2]. This technique has been
termed ‘‘class prediction’’ in the microarray literature [3].
Microarray experiments can be used to monitor the expression
levels in cells for thousands of genes at the same time, which may
help us to understand the molecular variation among differential
tumor subtypes, allowing better and more reliable classification
results to be obtained.
Many new prediction, classification and clustering techniques
have been applied for analysis of gene microarray data [3–10].
Particularly, network-based approaches have been extensively
applied to analyze microarray data [11]. Moreover, many
supervised and unsupervised classifiers have been developed to
tackle the tumor classification problem. Unsupervised methods are
ones such as the simultaneous gene clustering and gene subset
selection for tumor classification using model selection criterion
[12], clustering algorithm based on optimization techniques [13]
and self-organizing maps [14]. Generally speaking, unsupervised
methods are not sensitive to the number of the labeled samples
since they work on the whole data, nevertheless, the relationship
between clusters and classes is not ensured. On the other hand,
supervised methods such as artificial neural networks [15],
support vector machines (SVM) [16,17], multi-layer perceptrons
[18], rough set theory [19] and K-nearest neighbor (K-NN) [20]
have been successfully applied to tumor classification problems.
The main difficulty with all supervised methods is that the learning
process heavily depends on the training dataset. However, the
Artificial Intelligence in Medicine 50 (2010) 181–191
ARTICLE INFO
Article history:
Received 12 May 2009
Received in revised form 28 April 2010
Accepted 18 May 2010
Keywords:
Multi-step dimensionality reduction
Gene ranking
Discrete cosine transform
Principal component analysis
Semi-supervised learning
Microarray data analysis
Tumor diagnosis
ABSTRACT
Objective:
Both supervised methods and unsupervised methods have been widely used to solve the
tumor classification problem based on gene expression profiles. This paper introduces a semi-supervised
graph-based method for tumor classification. Feature extraction plays a key role in tumor classification
based on gene expression profiles, and can greatly improve the performance of a classifier. In this paper
we propose a novel multi-step dimensionality reduction method for extracting tumor-related features.
Methods and materials: First the Wilcoxon rank-sum test is used for gene selection. Then gene ranking
and discrete cosine transform are combined with principal component analysis for feature extraction.
Finally, the performance is evaluated by semi-supervised learning algorithms.
Results: To show the validity of the proposed method, we apply it to classify four tumor datasets
involving various human normal and tumor tissue samples. The experimental results show that the
proposed method is efficient and feasible. Compared with other methods, our method can achieve
relatively higher prediction accuracy. Particularly, it is found that semi-supervised method is superior to
support vector machines in classification performance.
Conclusions: The proposed approach can effectively improve the performance of tumor classification
based on gene expression profiles. This work is a meaningful attempt to explore and apply multi-step
dimensionality reduction and semi-supervised learning methods in the field of tumor classification.
Considering the high classification accuracy, there should be much room for the application of multi-step
dimensionality reduction and semi-supervised learning methods to perform tumor classification.
ß 2010 Elsevier B.V. All rights reserved.
* Corresponding author at: Intelligent Computing Laboratory, Hefei Institute of
Intelligent Machines, Chinese Academy of Sciences, Hefei, Anhui 230031, China.
Tel.: +86 551 5592751; fax: +86 551 5592751.
E-mail address: guijie@ustc.edu (J. Gui).
1
The first two authors are joint first author.
Contents lists available at ScienceDirect
Artificial Intelligence in Medicine
journal homepage: www.elsevier.com/locate/aiim
0933-3657/$ – see front matter ß 2010 Elsevier B.V. All rights reserved.
doi:10.1016/j.artmed.2010.05.004