Molecular subtyping of cancer: current status and moving toward clinical applications

Molecular subtyping of cancer: current status and moving toward clinical applications Abstract Cancer is a collection of genetic diseases, with large phenotypic differences and genetic heterogeneity between different types of cancers and even within the same cancer type. Recent advances in genome-wide profiling provide an opportunity to investigate global molecular changes during the development and progression of cancer. Meanwhile, numerous statistical and machine learning algorithms have been designed for the processing and interpretation of high-throughput molecular data. Molecular subtyping studies have allowed the allocation of cancer into homogeneous groups that are considered to harbor similar molecular and clinical characteristics. Furthermore, this has helped researchers to identify both actionable targets for drug design as well as biomarkers for response prediction. In this review, we introduce five frequently applied techniques for generating molecular data, which are microarray, RNA sequencing, quantitative polymerase chain reaction, NanoString and tissue microarray. Commonly used molecular data for cancer subtyping and clinical applications are discussed. Next, we summarize a workflow for molecular subtyping of cancer, including data preprocessing, cluster analysis, supervised classification and subtype characterizations. Finally, we identify and describe four major challenges in the molecular subtyping of cancer that may preclude clinical implementation. We suggest that standardized methods should be established to help identify intrinsic subgroup signatures and build robust classifiers that pave the way toward stratified treatment of cancer patients. cancer, heterogeneity, subtyping, subtypes, challenges Introduction Cancer is a large group of genetic diseases that are currently classified by their primary site of origin, such as brain cancer, breast cancer and lung cancer. However, not all cancers of an organ are the same, and genetic heterogeneity exists between and within cancers [1–6]. A major cause of this heterogeneity is genomic instability [7] that can act at the single-nucleotide level, or at much larger scales [8]. This poses significant challenges to the efficacy of currently applicable targeted therapies and complicate the development of future treatment strategies [7]. Because of this, there is a great need to classify cancer into homogeneous groups that associate with distinct molecular features and clinical outcomes and allow the development of subgroup specific therapies. The traditional classification of cancer has been carried out by pathologists based on histological appearance and site of growth. This only partially reflects the true heterogenic character of cancer. Recent advances in genome-wide profiling techniques [9, 10] have allowed researchers to generate large-scale genomic data and classify cancer into more homogeneous groups [11, 12]. Genomic data have been used in many cancer subtyping studies, including leukemia [13], lymphoma [14], nasopharyngeal carcinoma (NPC) [15], breast [16], lung [17], liver [18], pancreas [19], colon [20] and soft tissue sarcomas [21]. Various machine learning algorithms [22–26] have also been developed for better prediction of cancer subtypes. Molecular subtyping studies have allowed the classification of cancer into uniform groups that correlated better with clinical outcomes than the traditional classifications of cancer [27]. In summary, the molecular classification can provide diagnostic, prognostic and therapeutic options for the treatment of cancers. This review is organized as follows: in ‘Molecular subtyping of cancer’ section, we first introduce two techniques that are used for cancer subtyping: microarray and RNA sequencing (RNA-Seq). Next, we introduce three other frequently applied techniques for generating low- and medium-throughput molecular data; quantitative polymerase chain reaction (qPCR), NanoString and tissue microarray (TMA), and we discuss their applications in clinical tests. Subtype identifications and characterizations, which are the two important aspects involved in the subtyping process are also discussed. In ‘Moving toward clinical applications’ section, we illustrate potential clinical applications of cancer subtyping studies for diagnosis, prognosis, predicting therapy response and drug design. In ‘Challenges’ section, we identify and describe four major challenges in the molecular subtyping of cancer that may preclude clinical implementation, and finally, in ‘Conclusions and outlook’ section, we provide the concluding remarks and recommendations. Molecular subtyping of cancer Recent advances in genome-wide profiling techniques have allowed the generation of large-scale genomic data, and various statistical and machine learning algorithms have been developed for processing and interpretation of such data [23–25, 28–31]. Molecular subtyping of cancer, as its name suggests, is a new way to classify cancers into different groups based on molecular data and classification models. Contrary to the traditional histological classification of cancer, molecular classifications rely on biomarkers and classifiers. Biomarkers can be informative genes, microRNAs (miRNAs), DNA methylation markers and others [32]. Classifiers can be built by machine learning algorithms, such as Prediction Analysis for Microarrays (PAM), Support Vector Machines (SVMs) and more [33]. In the following, we will provide an introduction of different molecular data types and their applications, to a workflow for unsupervised classification of cancer. High-throughput molecular data for cancer subtyping Gene expression profiling data for cancer subtyping Microarray and RNA-Seq are two common profiling techniques for generating high-throughput gene expression data. Microarrays are capable of profiling expression patterns for tens of thousands of selected genes in a single assay [9]. RNA-Seq is a sequencing-based method to determine the amount of gene abundance from the entire genome. There are numerous advantages of RNA-Seq over microarray [34]. First, unlike hybridization-based microarrays, RNA-Seq provides more accurate detection of gene expression. Second, RNA-Seq can detect novel transcripts, single-nucleotide variants and other (yet) unknown changes that microarray cannot detect. Finally, RNA-Seq has low background signal, and consequently has a large dynamic range. Microarray has been the most commonly used technique to generate large-scale molecular data for several decades [35]. With the fast development of sequencing and analyzing techniques, the sequencing cost will dramatically decrease and more statistical tools will be developed for RNA-Seq, and RNA-Seq will likely replace microarray [36]. Compared with other molecular profiling techniques, microarray and RNA-Seq are the most accurate, reliable and robust, but are also expensive, time-consuming and sample quality-dependent techniques (Table 1). They are commonly used in the initial biomarkers identification process. If biomarkers have been identified, other techniques are preferred. Table 1. A comparison of different techniques for molecular profiling of cancer Platform Characteristic Microarray RNA sequencing qPCR NanoString Tissue microarray Accuracy [37, 166, 167] Median Median High High Low Sensitivity [36, 167, 39] Median High High High Low Specificity [38, 167, 168] Median Median High High Low Speed [36, 167] Slow Slow Fast Median Slow Cost (per sample) $300 [163] $1000 [163] $280 [164] $800 [39] $100 [165] Sample requirement [167] FFPE/fresh-frozen FFPE/fresh-frozen Fresh-frozen FFPE/fresh-frozen FFPE Genome-wide coverage Yes Yes No No No Quantitative Yes Yes Yes Yes No Single-base resolution No Yes No No No Low sample input No No Yes No Yes Reproducibility [168] Median Median High High Low Platform Characteristic Microarray RNA sequencing qPCR NanoString Tissue microarray Accuracy [37, 166, 167] Median Median High High Low Sensitivity [36, 167, 39] Median High High High Low Specificity [38, 167, 168] Median Median High High Low Speed [36, 167] Slow Slow Fast Median Slow Cost (per sample) $300 [163] $1000 [163] $280 [164] $800 [39] $100 [165] Sample requirement [167] FFPE/fresh-frozen FFPE/fresh-frozen Fresh-frozen FFPE/fresh-frozen FFPE Genome-wide coverage Yes Yes No No No Quantitative Yes Yes Yes Yes No Single-base resolution No Yes No No No Low sample input No No Yes No Yes Reproducibility [168] Median Median High High Low Table 1. A comparison of different techniques for molecular profiling of cancer Platform Characteristic Microarray RNA sequencing qPCR NanoString Tissue microarray Accuracy [37, 166, 167] Median Median High High Low Sensitivity [36, 167, 39] Median High High High Low Specificity [38, 167, 168] Median Median High High Low Speed [36, 167] Slow Slow Fast Median Slow Cost (per sample) $300 [163] $1000 [163] $280 [164] $800 [39] $100 [165] Sample requirement [167] FFPE/fresh-frozen FFPE/fresh-frozen Fresh-frozen FFPE/fresh-frozen FFPE Genome-wide coverage Yes Yes No No No Quantitative Yes Yes Yes Yes No Single-base resolution No Yes No No No Low sample input No No Yes No Yes Reproducibility [168] Median Median High High Low Platform Characteristic Microarray RNA sequencing qPCR NanoString Tissue microarray Accuracy [37, 166, 167] Median Median High High Low Sensitivity [36, 167, 39] Median High High High Low Specificity [38, 167, 168] Median Median High High Low Speed [36, 167] Slow Slow Fast Median Slow Cost (per sample) $300 [163] $1000 [163] $280 [164] $800 [39] $100 [165] Sample requirement [167] FFPE/fresh-frozen FFPE/fresh-frozen Fresh-frozen FFPE/fresh-frozen FFPE Genome-wide coverage Yes Yes No No No Quantitative Yes Yes Yes Yes No Single-base resolution No Yes No No No Low sample input No No Yes No Yes Reproducibility [168] Median Median High High Low Gene expression-based subtyping of cancer was first proposed by Golub et al. [13] in leukemia. The expression pattern of the 50 most informative genes was measured and a two-cluster self-organizing map (SOM) clustering method was applied [40] to group 38 samples into two classes: acute myeloid leukemia and acute lymphoblastic leukemia with accuracy of 100%. This demonstrated the fidelity of cancer subtyping based solely on gene expression patterns [13]. Gene expression-based subtyping now has been extended to include many cancer types [11, 14, 16, 17, 19, 21, 41]. Multi-platform profiling data for cancer subtyping In addition to gene expression profiling, there are many other molecular profiling data types, such as mutation, miRNA expression, copy number variation (CNV) and DNA methylation, which can be used to identify and characterize cancer subtypes (Table 2) [43, 44, 50, 52–55]. As all cancers arise as a result of DNA sequence changes [56], the gene mutation patterns are informative and a likely platform from which to stratify cancer patients into homogeneous groups [57, 58]. MiRNAs are small noncoding RNAs about 20–22 nucleotides in length that play key roles in the regulation of gene expression. Alterations of miRNA expression are involved in the initiation and progression of human cancer [59–61]. MiRNA expression profiling now has been used as a new tool in cancer onset and subtyping [15, 62]. Unlike mRNAs, miRNAs are more stable and only a small number of miRNAs (∼200 in total) are sufficient to classify human cancers [63]. CNVs are structural variations and genomic alterations that affect DNA sequence lengths ranging from approximately 1 Kb to 3 Mb [64]. CNVs are associated with many complex diseases such as neuropsychiatric disorders [65], HIV [66], familiar pancreatitis [67] and cancers [68, 69]. Comparative genomic hybridization (CGH) can be used to detect CNVs at the genome-wide level, and array-based CGH can increase the resolution for better genomic studies. Epigenetic changes such as DNA methylation also play a significant role in the development and progression of cancer [70]. Bisulfite sequencing [71] and differential methylation hybridization [72] can be used to scan gene methylation status at the genome-wide level. Table 2. Molecular subtyping studies mentioned in the review Cancer type Discovery sample size Molecular data type Clustering method Determinative score Number of subtypes Classification method Reference Breast cancer 65 mRNA Hierarchical clustering NA 4 NA Perou et al .[16] Breast cancer 85 mRNA Hierarchical clustering NA 5 NA Sorlie et al. [42] Breast cancer 825 Five platforms Cluster of clusters NA 4 NA TCGA [43] Breast cancer 2, 000 mRNA + CNV iCluster ARI 10 PAM Curtis et al. [44] CRC 62 mRNA Iterative NMF Cophenetic coefficient 5 NA Schlicker et al. [45] CRC 443 mRNA Orig. cons. clustering CDF area 6 Centroid-based Marisa et al. [46] CRC 90 mRNA Orig. cons. clustering Gap statistic 3 PAM De Sousa E Melo et al. [20] CRC 445 mRNA NMF cons. clustering Cophenetic coefficient 5 PAM Sadanandam et al. [47] CRC 1, 113 mRNA Orig. cons. clustering Dynamic cut tree 5 Multiclass LDA Budinska et al. [48] CRC 188 mRNA k-means NA 3 Single-sample centroid based Roepman et al. [49] CRC 4, 151 mRNA Markov Cluster Algorithm Inflation factor 4 Random Forest Guinney et al. [11] PDAC 185 miRNA Hierarchical clustering CDF area 2 SVM Bauer et al. [50] PDAC 66 mRNA NMF cons. clustering Cophenetic coefficient 3 NTP Collisson et al. [19] PDAC 223 mRNA NMF cons. clustering Cophenetic coefficient 2 Rank-based classifier Moffitt et al. [51] Pancreatic cancer 96 mRNA NMF cons. clustering Cophenetic coefficient 4 NA Bailey et al. [12] Leukemia 38 mRNA SOM NA 2 NA Golub et al. [13] Leukemia 200 Methylation PCA NA 16 NA Figueroa et al. [169] Lymphoma 42 mRNA Hierarchical clustering NA 2 NA Alizadeh et al. [14] GBM 35 miRNA PCA Ratio of intracluster to intercluster correlation 2 LDA Marziali et al. [170] Lung 67 mRNA Hierarchical clustering NA 4 NA Garber et al [17] 12 cancer types 3, 527 Five platforms COCA NA 11 NA Hoadley et al [32] Cancer type Discovery sample size Molecular data type Clustering method Determinative score Number of subtypes Classification method Reference Breast cancer 65 mRNA Hierarchical clustering NA 4 NA Perou et al .[16] Breast cancer 85 mRNA Hierarchical clustering NA 5 NA Sorlie et al. [42] Breast cancer 825 Five platforms Cluster of clusters NA 4 NA TCGA [43] Breast cancer 2, 000 mRNA + CNV iCluster ARI 10 PAM Curtis et al. [44] CRC 62 mRNA Iterative NMF Cophenetic coefficient 5 NA Schlicker et al. [45] CRC 443 mRNA Orig. cons. clustering CDF area 6 Centroid-based Marisa et al. [46] CRC 90 mRNA Orig. cons. clustering Gap statistic 3 PAM De Sousa E Melo et al. [20] CRC 445 mRNA NMF cons. clustering Cophenetic coefficient 5 PAM Sadanandam et al. [47] CRC 1, 113 mRNA Orig. cons. clustering Dynamic cut tree 5 Multiclass LDA Budinska et al. [48] CRC 188 mRNA k-means NA 3 Single-sample centroid based Roepman et al. [49] CRC 4, 151 mRNA Markov Cluster Algorithm Inflation factor 4 Random Forest Guinney et al. [11] PDAC 185 miRNA Hierarchical clustering CDF area 2 SVM Bauer et al. [50] PDAC 66 mRNA NMF cons. clustering Cophenetic coefficient 3 NTP Collisson et al. [19] PDAC 223 mRNA NMF cons. clustering Cophenetic coefficient 2 Rank-based classifier Moffitt et al. [51] Pancreatic cancer 96 mRNA NMF cons. clustering Cophenetic coefficient 4 NA Bailey et al. [12] Leukemia 38 mRNA SOM NA 2 NA Golub et al. [13] Leukemia 200 Methylation PCA NA 16 NA Figueroa et al. [169] Lymphoma 42 mRNA Hierarchical clustering NA 2 NA Alizadeh et al. [14] GBM 35 miRNA PCA Ratio of intracluster to intercluster correlation 2 LDA Marziali et al. [170] Lung 67 mRNA Hierarchical clustering NA 4 NA Garber et al [17] 12 cancer types 3, 527 Five platforms COCA NA 11 NA Hoadley et al [32] Note: ARI, adjusted Rand index; No., number; COCA, Cluster-Of-Cluster-Assignments; iCluster, integrative clustering framework; LDA, linear discriminant analysis; NTP, nearest template prediction; Orig. cons., original consensus; PCA, principal component analysis. Table 2. Molecular subtyping studies mentioned in the review Cancer type Discovery sample size Molecular data type Clustering method Determinative score Number of subtypes Classification method Reference Breast cancer 65 mRNA Hierarchical clustering NA 4 NA Perou et al .[16] Breast cancer 85 mRNA Hierarchical clustering NA 5 NA Sorlie et al. [42] Breast cancer 825 Five platforms Cluster of clusters NA 4 NA TCGA [43] Breast cancer 2, 000 mRNA + CNV iCluster ARI 10 PAM Curtis et al. [44] CRC 62 mRNA Iterative NMF Cophenetic coefficient 5 NA Schlicker et al. [45] CRC 443 mRNA Orig. cons. clustering CDF area 6 Centroid-based Marisa et al. [46] CRC 90 mRNA Orig. cons. clustering Gap statistic 3 PAM De Sousa E Melo et al. [20] CRC 445 mRNA NMF cons. clustering Cophenetic coefficient 5 PAM Sadanandam et al. [47] CRC 1, 113 mRNA Orig. cons. clustering Dynamic cut tree 5 Multiclass LDA Budinska et al. [48] CRC 188 mRNA k-means NA 3 Single-sample centroid based Roepman et al. [49] CRC 4, 151 mRNA Markov Cluster Algorithm Inflation factor 4 Random Forest Guinney et al. [11] PDAC 185 miRNA Hierarchical clustering CDF area 2 SVM Bauer et al. [50] PDAC 66 mRNA NMF cons. clustering Cophenetic coefficient 3 NTP Collisson et al. [19] PDAC 223 mRNA NMF cons. clustering Cophenetic coefficient 2 Rank-based classifier Moffitt et al. [51] Pancreatic cancer 96 mRNA NMF cons. clustering Cophenetic coefficient 4 NA Bailey et al. [12] Leukemia 38 mRNA SOM NA 2 NA Golub et al. [13] Leukemia 200 Methylation PCA NA 16 NA Figueroa et al. [169] Lymphoma 42 mRNA Hierarchical clustering NA 2 NA Alizadeh et al. [14] GBM 35 miRNA PCA Ratio of intracluster to intercluster correlation 2 LDA Marziali et al. [170] Lung 67 mRNA Hierarchical clustering NA 4 NA Garber et al [17] 12 cancer types 3, 527 Five platforms COCA NA 11 NA Hoadley et al [32] Cancer type Discovery sample size Molecular data type Clustering method Determinative score Number of subtypes Classification method Reference Breast cancer 65 mRNA Hierarchical clustering NA 4 NA Perou et al .[16] Breast cancer 85 mRNA Hierarchical clustering NA 5 NA Sorlie et al. [42] Breast cancer 825 Five platforms Cluster of clusters NA 4 NA TCGA [43] Breast cancer 2, 000 mRNA + CNV iCluster ARI 10 PAM Curtis et al. [44] CRC 62 mRNA Iterative NMF Cophenetic coefficient 5 NA Schlicker et al. [45] CRC 443 mRNA Orig. cons. clustering CDF area 6 Centroid-based Marisa et al. [46] CRC 90 mRNA Orig. cons. clustering Gap statistic 3 PAM De Sousa E Melo et al. [20] CRC 445 mRNA NMF cons. clustering Cophenetic coefficient 5 PAM Sadanandam et al. [47] CRC 1, 113 mRNA Orig. cons. clustering Dynamic cut tree 5 Multiclass LDA Budinska et al. [48] CRC 188 mRNA k-means NA 3 Single-sample centroid based Roepman et al. [49] CRC 4, 151 mRNA Markov Cluster Algorithm Inflation factor 4 Random Forest Guinney et al. [11] PDAC 185 miRNA Hierarchical clustering CDF area 2 SVM Bauer et al. [50] PDAC 66 mRNA NMF cons. clustering Cophenetic coefficient 3 NTP Collisson et al. [19] PDAC 223 mRNA NMF cons. clustering Cophenetic coefficient 2 Rank-based classifier Moffitt et al. [51] Pancreatic cancer 96 mRNA NMF cons. clustering Cophenetic coefficient 4 NA Bailey et al. [12] Leukemia 38 mRNA SOM NA 2 NA Golub et al. [13] Leukemia 200 Methylation PCA NA 16 NA Figueroa et al. [169] Lymphoma 42 mRNA Hierarchical clustering NA 2 NA Alizadeh et al. [14] GBM 35 miRNA PCA Ratio of intracluster to intercluster correlation 2 LDA Marziali et al. [170] Lung 67 mRNA Hierarchical clustering NA 4 NA Garber et al [17] 12 cancer types 3, 527 Five platforms COCA NA 11 NA Hoadley et al [32] Note: ARI, adjusted Rand index; No., number; COCA, Cluster-Of-Cluster-Assignments; iCluster, integrative clustering framework; LDA, linear discriminant analysis; NTP, nearest template prediction; Orig. cons., original consensus; PCA, principal component analysis. Integrating the analysis of multiple genomic data, such as gene expression with CNV [44], miRNA with gene expression [73] and five-platform combined subtyping [32] studies can provide even better insights into tumor biology, and more accurate predictions, than the analysis at a single molecular level [74]. With the advances in high-throughput profiling technologies, the expenses spent on each sample are decreasing; thus, multi-platform identification and characterization of cancer is likely to become the norm. Low- and medium-throughput molecular data for clinical test Biomarkers identified from subtyping studies can be used in clinical practice. In typical clinical settings, only up to several dozens of these predefined biomarkers are measured to minimize the time and expenses spent on the tests [75]. In addition, most cancer specimens are formalin-fixed paraffin-embedded (FFPE), and only few are freshly prepared or snap frozen [76]. In contrast to the above mentioned high-throughput approaches, some low- and medium-throughput profiling techniques (such as qPCR, NanoString and TMA) that allow meaningful analysis of clinical specimens are well suited for clinical use of biomarker assays. These techniques are frequently used when fast detection time is required, and sample volume and pricing should be kept low. Sensitivity and specificity are the two terms used to evaluate a clinical test. Sensitivity refers to the ability of a test to correctly identify an individual with disease; specificity refers to the ability of a test to correctly identify an individual without the disease [77]. Another important term in the evaluation of a clinical test is to determine its accuracy, which describes the errors that a test will produce when differentiating between individuals with and without the disease [78]. In the following, we will compare these three techniques (qPCR, NanoString and TMA) in terms of accuracy, sensitivity, specificity and other aspects of concerns involved in a clinical test. Researchers can choose appropriate techniques for their clinical assays based on the comparisons provided in Table 1. qPCR is commonly used to determine biomarker expression levels, or to assess CNVs. Because there is a PCR amplification step, which can greatly increase the nucleic acid input, only limited sample quantity is needed. Other advantages of qPCR include fast, high sensitivity, specificity and accuracy, which make it the routine method for validation of results initially obtained from high-throughput methods such as microarray and RNA-Seq [79]. Compared with other techniques, which can assay hundreds to thousands biomarkers, qPCR-based assays can only handle a limited number of biomarkers in a single test. qPCR-based tests also require high quality of the nucleic acids in the sampled material, so fresh-frozen tissues are typically required for qPCR. The NanoString nCounter analysis system can be used to measure expression levels of up to 800 genes [80]. Developed by Geiss et al. [39], the nCounter system is more sensitive than microarrays, and similar in sensitivity to qPCR [39]. This technology uses digital molecular barcoding and microscopic imaging to detect and quantify the expression levels of genes in a single assay without enzymatic reactions [39, 81]. Other advantages of this technique include high accuracy and specificity [38]. Disadvantages include the high cost of the required reagents and instruments [80]. TMA is a histology-based test, developed by Kononen et al. [82], which allows the analysis of up to 1000 tumor specimens simultaneously in a single paraffin block [37]. Analysis of molecular targets at the DNA, mRNA and protein levels is possible. Once constructed, a TMA block can be sectioned hundreds of times (provided the depth of all cores is sufficient), with each section amenable to biomarker analysis. The most significant advantage of TMA is that all samples on the array are treated in an identical fashion [83]. Another advantage of TMA is that it is cost-effective (Table 1). Only a small amount of reagent is required to analyze all the samples on one slide [83]. Unlike qPCR, which requires fresh-frozen tissues, TMA requires FFPE tissues, which are the major source of material in the clinic. TMA also has limitations. For instance, low sensitivity, specificity and accuracy are the typical features of a TMA test [84]. Other disadvantages include: it usually takes several days to obtain the analysis results [85], only a limited number of analytes can be tested and the analyzed specimen volume is too small to represent the entire tumor [83]. Also during the TMA staining process, the amount of tissues will become less and less [86]. Subtype identifications and characterizations Molecular subtyping (or molecular classification) is a process of assigning data objects into clusters, so that objects in the same cluster are more similar to each other than those in other clusters. There are two kinds of classification strategies, supervised (with class labels, such as tumor or normal tissues, known beforehand) and unsupervised (with unlabeled data) classification. Subtyping is a more general term of classification, which can be both supervised and unsupervised. Unsupervised classification is increasingly popular in biomedical research [87], and has been successfully used in many cancer subtyping studies [11, 13, 15, 17, 41, 51, 88, 89]. From these studies, we summarize a workflow for molecular subtyping of cancer. These include: data preprocessing, cluster analysis, supervised classification and subtype characterizations (Figure 1). In the following, we focused our attention on subtype identifications and characterizations, which are the two important aspects in the workflow. Figure 1. View largeDownload slide Molecular subtyping of cancer workflow. The workflow consists of four major steps: (A) Data preprocessing. Array data preprocessing include image analysis, data normalization and transformation. Next-generation sequencing data preprocessing contains the following steps: quality control, read alignment, expression quantification, data normalization and transformation. (B) Cluster analysis. A first feature selection is performed with a cutoff on SD (e.g. SD > 0.8) or median absolute deviation (MAD) (e.g. MAD > 0.5). Clustering is usually applied to either feature dimension or sample dimension, biclustering at both dimensions and triclustering at three dimensions (feature, sample and time). After (bi/tri) clustering, the optimal number of (bi/tri) clusters is determined by measurement such as gap statistics, cophenetic coefficients and CDF. Also, ensemble and consensus clustering have been proposed to enhance the robustness of (bi/tri) clustering. (C) Supervised classification. To build the best possible classifier, a sample selection (Silhouette width > 0) and a second feature selection (SAM/Limma) processes are applied. Various algorithms such as PAM, SVM, Random Forests (RF) and K-nearest neighbors can be used to build classifiers. (D) Subtype characterizations. A heatmap is used to represent the molecular characterizations, in which rows are features (genes, miRNAs, pathways, etc.) and columns are samples. Here, features are subtype-specific features; samples are sorted according to their subtype numbers. A Kaplan–Meier survival plot is used to represent the clinical characterizations, in which x-axis is the survival time, and y-axis is the probability of an event (i.e. death). Figure 1. View largeDownload slide Molecular subtyping of cancer workflow. The workflow consists of four major steps: (A) Data preprocessing. Array data preprocessing include image analysis, data normalization and transformation. Next-generation sequencing data preprocessing contains the following steps: quality control, read alignment, expression quantification, data normalization and transformation. (B) Cluster analysis. A first feature selection is performed with a cutoff on SD (e.g. SD > 0.8) or median absolute deviation (MAD) (e.g. MAD > 0.5). Clustering is usually applied to either feature dimension or sample dimension, biclustering at both dimensions and triclustering at three dimensions (feature, sample and time). After (bi/tri) clustering, the optimal number of (bi/tri) clusters is determined by measurement such as gap statistics, cophenetic coefficients and CDF. Also, ensemble and consensus clustering have been proposed to enhance the robustness of (bi/tri) clustering. (C) Supervised classification. To build the best possible classifier, a sample selection (Silhouette width > 0) and a second feature selection (SAM/Limma) processes are applied. Various algorithms such as PAM, SVM, Random Forests (RF) and K-nearest neighbors can be used to build classifiers. (D) Subtype characterizations. A heatmap is used to represent the molecular characterizations, in which rows are features (genes, miRNAs, pathways, etc.) and columns are samples. Here, features are subtype-specific features; samples are sorted according to their subtype numbers. A Kaplan–Meier survival plot is used to represent the clinical characterizations, in which x-axis is the survival time, and y-axis is the probability of an event (i.e. death). Subtype identifications High-throughput molecular data are usually arranged into matrix forms, in which rows are features (genes, miRNAs or DNA methylation markers) and columns are samples. Molecular data matrices have been largely analyzed in two dimensions (2D): the feature dimension and the sample dimension [90]. Clustering is usually applied to either feature dimension or sample dimension. As subsets of features are active or suppressed only under certain experimental conditions, and behave almost independently under other conditions, to identify local patterns in the data matrix, biclustering (or subspace clustering), which allows to discover biclusters, was first proposed by Cheng and Church [91]. Now, various biclustering methods are developed to efficiently identify ‘homogeneous’ submatrices in data, such as singular value decomposition [22], nonnegative matrix factorization (NMF) [23] and geometric-based biclustering [92, 93]. With the fast development of data profiling technologies, it is now possible to have a number of samples for numerous features across multiple time points or experimental conditions. Such data can be arranged into three-dimensional (3D) matrices, with the first two dimensions representing the samples and features, respectively, and the third dimension for time or experimental conditions [94]. To find feature groups along the feature–sample–time (or –condition) dimensions, triclustering is proposed to mine triclusters in the data [95]. As tensor is a concept from mathematics that can be thought of as an organized multidimensional array of numerical values, tensor-based triclustering [96, 97] has become a promising solution for analyzing these longitudinal and spatial data. The optimal number of clusters is determined by measurements such as gap statistics [98], cophenetic coefficients [99] and cumulative distribution function (CDF). Given that cluster analysis methods are based on different algorithms, they yield different results in terms of cluster numbers and assignments [100]. To enhance the robustness of clustering, a method called cluster ensemble has been proposed, which combines results from different runs of clustering methods into a single consensus result [100]. Another similar methodology is consensus clustering, which in conjunction with resampling techniques provides a method to reach consensus from multiple runs of the same clustering method [101]. The major difference between ensemble and consensus clustering is that ensemble clustering integrates results from multiple clustering methods, while consensus clustering provides resampling and performs a single type of clustering method multiple times. Ensemble and consensus clustering methods are also applicable to biclustering and triclustering, and have been widely used in cancer subtyping studies [19, 20, 46, 102]. Subtype characterizations Subtype characterizations rely heavily on genomic and clinical data, and one purpose of subtype characterizations is to investigate the associations between the identified subtypes and their molecular/clinical relevance [103]. Subtype characterizations can also help to identify consensus subtypes within and between cancers, which we will cover in detail in ‘Cancer consensus molecular subtypes’ section. Pathways, mutations, structural variations and methylation patterns can be used as the molecular characteristics. Characterizations of cancer subtypes have implications for patient outcome and targeted therapies. Lex et al. [104] developed an integrative visualization tool called StratomeX, which can help researchers to explore the relationships between subtypes and multiple genomic data types such as gene expression, DNA methylation or copy number data. These genomic data have been discussed in the ‘High-throughput molecular data for cancer subtyping’ section, which can not only be used to identify robust cancer subtypes, but can also help us better understand and interpret the molecular characteristics of the subtypes. In addition, gene set enrichment analysis (GSEA) is usually performed to characterize the biology underlying the identified subtypes. GSEA interprets the expression data at the level of gene sets, groups of genes that share the same biological function, chromosomal location, or regulation [105]. Annotated gene sets with specific biological meanings can be obtained, for example, from Gene Ontology (GO) [106] and KEGG [107] databases. Clinical data include patient’s information such as age, gender, race, tumor grade, tumor size, time of diagnosis, smoking history, treatment strategies, relapse information, follow-up time and so on, which should be well preserved and managed for clinical characterization of the identified subtypes. Moreover, the survival analysis is a widely used method to compare the survival time differences between subtypes. The Kaplan–Meier estimator [108] can be used to generate the survival curve, and the log rank test provides a statistical comparison of two subtypes [109]. Subtype characterizations are necessary and important. Not only do they help us understand more about the subtype characteristics but also provide a subtype validation process. Ideally, there are distinct molecular and clinical characteristics between identified subtypes. Often, subtypes are only statistically different, but not biologically different. In such cases, reclustering and reclassification should be done until more interpretable results are obtained. Moving toward clinical applications From high-throughput molecular data and molecular subtyping of cancer to the development of marker panels using low- and medium-throughput methods, clinicians are beginning to embrace and make treatment decisions for cancer patients based on cancer subtyping studies [110, 111]. In the following, we will provide a few examples of subtyping studies that have been applied to cancer diagnosis, prognosis, response prediction and drug design. Specifically, we will focus on biomarkers for diagnostic and prognostic purposes in ‘Biomarkers identified from subtyping studies for cancer diagnosis and prognosis’ section, and cancer subtypes for therapy response prediction and drug development in ‘Cancer subtypes for predicting therapy response and drug design’ section. Biomarkers identified from subtyping studies for cancer diagnosis and prognosis Biomarkers identified from subtyping studies with specific indications for cancer diagnosis and prognosis are now widely applied in clinical research, and increasingly combined with conventional histology to improve diagnostic accuracy [112]. For example, TLE1 as a diagnostic marker for synovial sarcoma [113], and CD10, BCL6 and MUM1 as diagnostic markers for the germinal center B-cell-like (GCB) subtype of lymphoma [114]. Furthermore, biomarkers can be used directly to detect cancer. For instance, Bauer et al. [50] analyzed the complete miRNA repertoire of 136 pancreatic ductal adenocarcinoma (PDAC) samples, 27 pancreatitis samples and 22 normal controls. They used a hierarchical clustering method and an SVM classifier, and found that the analysis of only five miRNAs in blood and tissues can distinguish PDAC from pancreatitis and normal, possibly aiding PDAC diagnosis. Several multigene predictors have been developed for breast cancer patients [115]. These include MammaPrint, Oncotype DX and simplified MapQuant Dx. These predictors are now widely used in the clinic to classify breast patients and treat them accordingly. MammaPrint was the first successfully applied microarray-based prognostic test for breast cancer. MammaPrint uses a 70-gene signature. To identify these genes, hierarchical clustering was used to classify 98 breast cancer patients into good and poor prognosis groups. This was followed by a three-step supervised classification method to reliably stratify good and poor prognostic categories, and finally found 70 prognostic genes for breast cancer [116]. MammaPrint is a US Food and Drug Administration-approved molecular test to predict the risk of breast cancer metastasis. The result of the test can help physicians to determine the appropriate treatment strategy. Most early-stage breast cancer patients receive adjuvant chemotherapy, but only subset of them benefit from the treatment. Paik et al. [117] thus developed a 21-gene qPCR assay called Oncotype DX. This is a diagnostic test that predicts the likelihood of chemotherapy benefit, and calculates the recurrence scores for early-stage breast cancer. Simplified MapQuant Dx is also a qPCR-based prognostic test for breast cancer. It was developed by Toussaint et al. [118], and is based on the expression patterns of four representative genes from the genomic grade index [119] and four reference genes. The prognostic information provided by the test is only applicable to estrogen receptor-positive breast cancer patients [120]. Cancer subtypes for predicting therapy response and drug design Subtyping studies are potentially well suited to select a subset of patients that may benefit from certain drugs or therapies. For instance, Rouzier et al. [121] examined if the four subtypes of breast cancer [16] respond differently to chemotherapy. Results showed that the basal-like and ERBB2-overexpressing subtypes are more sensitive to paclitaxel- and doxorubicin-containing preoperative chemotherapy than the luminal and normal-like subtypes. Tumor specimens for laboratory research are often limited in quantity, infiltrated with nontumor cells and sometimes ethical issues apply. Models for cancer, such as cell lines and patient-derived xenografts (PDXs), have been established as in vitro and in vivo platforms that can overcome these shortcomings of tumor specimens, and are now widely used by researchers. For instance, Ross et al. [122] provided molecular characterization of the NCI (National Cancer Institute)-60 cancer cell line panel, and demonstrated that these cell lines correspond to their tumors of origin. Gao et al. [123] established about 1000 PDXs, which provided excellent in vivo platforms to screen novel therapies for cancer patients. Cancer cell lines and PDXs can also be classified into different subtypes, for example, Kao et al. [124] classified 52 commonly used breast cancer cell lines into five subtypes [42], and defined the cell line subtypes that most faithfully capture the known heterogeneity of breast cancer. Moffitt et al. [51] sequenced 37 PDXs from PDAC and demonstrated that these models can recapitulate tumor-specific subtypes. Therefore, cell line and PDX models can provide a great opportunity to investigate subtype-specific therapies as well. Recent developments in high-throughput technologies have allowed large-scale screening of chemicals and drugs on cell line panels [125]. For example, the abovementioned NCI-60 cancer cell line panel [126] has been used as a standard platform on which >40 000 chemicals were screened over the past few decades [125]. Besides, Garnett et al. [127] screened a panel of several hundred cancer cell lines with 130 drugs in clinical use and under preclinical investigation, which also provides a powerful strategy to identify subtype-specific cancer therapies and biomarkers to guide such strategies. Drug development is shifting away from cytotoxic agents, to drugs which are designed to target specific molecules that drive the malignant progression [128]. It is still a challenging task, but subtype-specific biomarkers can become potential targets for drug design, and should be investigated and validated further [129, 130]. Challenges We see four major challenges in cancer subtyping studies that preclude clinical implementation (Figure 2). The first is data acquisition, curation and management. The second challenge is tumor microenvironment (TME) heterogeneity. The remaining two challenges are the lack of consensus molecular subtypes, and problems with single-sample classification, respectively. Figure 2. View largeDownload slide Four major challenges in the molecular subtyping of cancer and associated solutions/problems. The first challenge is data acquisition, curation and management. Data from publicly available data sets, such as ICGC, TCGA and GEO can increase sample size or be used as validation data sets. Low tumor cellularity can be addressed by physical and virtual microdissection. The second challenge is TME heterogeneity. The TME includes immune cells, blood vessels, fibroblasts and ECM, which are all exhibit heterogeneity at some level. The third challenge is the lack of consensus molecular subtypes. Currently, we only have three examples of consensus subtyping studies: colorectal cancer, breast cancer and the TCGA’s pan-cancer study. The last challenge is the problem with single-sample classification. Currently applied SSPs may yield inconsistent classification results. Figure 2. View largeDownload slide Four major challenges in the molecular subtyping of cancer and associated solutions/problems. The first challenge is data acquisition, curation and management. Data from publicly available data sets, such as ICGC, TCGA and GEO can increase sample size or be used as validation data sets. Low tumor cellularity can be addressed by physical and virtual microdissection. The second challenge is TME heterogeneity. The TME includes immune cells, blood vessels, fibroblasts and ECM, which are all exhibit heterogeneity at some level. The third challenge is the lack of consensus molecular subtypes. Currently, we only have three examples of consensus subtyping studies: colorectal cancer, breast cancer and the TCGA’s pan-cancer study. The last challenge is the problem with single-sample classification. Currently applied SSPs may yield inconsistent classification results. Data acquisition, curation and management Many cancer subtyping studies use a strategy called multiple random training-validation strategy [131], in which a training data set is used to identify molecular signatures, and the validation data sets are used to validate the classification performance. Normally, researchers will use their own data set as training data set, and use publicly available data sets as their validation data sets. Publicly available data sets, such as the International Cancer Genome Consortium (ICGC, www.icgc.org) and The Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov/) contain coordinated large-scale cancer genomic data that can be accessed online. ICGC holds genomic, transcriptomic, epigenomic and clinical data from 50 different cancer types and subtypes. Currently, there are >25 000 tumor genome data available on the ICGC website [132]. TCGA also contains a collection of cancer genomic data, and so far, >30 human tumor types have been analyzed through large-scale genome sequencing from 11 000 patient samples [133]. In addition, Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) is a public repository that archives and freely distributes gene expression data from numerous studies [134]. Researchers can upload their own data to the GEO or download data from GEO as validation data sets. Subtyping studies typically use tumor numbers ranging from dozens to more than few hundreds for their study cohort (Table 2). Identification of cancer subtypes has been frustrated by a lack of tumor samples available for study [19]. For instance, because <20% of PDAC patients have resectable tumors at the time of diagnosis, material for profiling is typically limited [135]. Some studies have overcome this problem by integrating different sources of data into their studies to increase sample size [19, 136]. The introduced batch effects (or nonbiological differences) can be removed by methods like empirical Bayes [137], surrogate variable analysis [138] or Distance Weighted Discrimination [139]. Another common problem is the low tumor cellularity of patient samples, which makes the molecular data noisy. How to capture tumor-specific patterns in such data poses a problem. Because of the tight connection and interaction between cancer cells and surrounding cells, using conventional separation techniques, such as laser capture microdissection [140], cannot perfectly separate tumor cells from nontumor cells. Thus, various statistical enrichment techniques such as virtual microdissection [51], mathematical algorithms like ESTIMATE [141] or qpure [142] can be used to assess tumor cellularity and deconvolve tumor-specific contributions. In summary, to dissect the genetic heterogeneity of the tumor cell, molecular and clinical data should be well processed and managed. As there are abundant publicly available data sets and various data processing tools that may be useful for answering such questions, researchers should take full advantage of them. TME heterogeneity Heterogeneity not only exists in the tumor cell compartment but also in the TME. The TME is the sum of interactions between tumor cells and the surrounding environment, which plays an important role in tumor development, progression and therapy responses. The TME includes immune cells, blood vessels, fibroblasts and extracellular matrix (ECM). Stroma is part of the TME, and is a histological unit consisting of connective tissue, fat tissue, fibroblasts, ECM and immune cells within an extracellular scaffold [143]. Stroma, as a whole, can be classified into different subtypes with clinical implications. For instance, Moffitt et al. [51] used NMF-based consensus clustering of hundreds of PDAC tumors and cell lines, and identified two stroma subtypes named as normal and activated. The activated stroma subtype contributes to poor clinical outcome. Heterogeneity has also been observed in other components of the TME, such as tumor-infiltrated immune cells, fibroblasts and ECM [144–148]. Solid tumors are infiltrated by various immune cells, for example, T and B lymphocytes, mast cells and so on [149]. These immune cells either play a positive role in inhibition of cancer cell growth or are responsible for the tumor-associated chronic inflammation. The presence of a T-cell-infiltrated TME can serve as a predictive biomarker for response to immunotherapies [144]. However, in many tumor types, only a subset of patients can generate a tumor antigen-specific T-cell response. The remaining patients lack an appropriate T-cell phenotype and resist immunotherapeutic interventions [144]. How to select patients that can potentially benefit from immunotherapies is a challenge. We can address this problem by identifying T-cell response genes and building a binary gene expression classifier, which can distinguish response group from nonresponse group. ECM is a collection of extracellular proteins present in all tissues to provide support to that tissue’s cells [150]. Recent studies have found that considerable heterogeneity exists in the ECM, and clinical outcome is often related with ECM characteristics. For instance, Bergamaschi et al. [147] identified 278 ECM-related genes to classify primary breast tumors into four groups (ECM1–4) with distinct clinical outcomes. Although tumor and stromal cells have close interactions with each other, stroma cells are different from tumor cells in terms of genetic architecture. Stroma cells are mostly genetically intact [143, 151], which suggests that the stroma could be a target of therapy. Heterogeneity in the characteristic of both tumor cells and TME raise questions regarding future cancer treatment. Which one of them is easier to target? How do we interpret such 2D heterogeneity, and how are they related? Can we incorporate them into a single system? These questions remain to be answered in the future. Cancer consensus molecular subtypes Currently, there are six subtyping systems for colorectal cancer (CRC) [20, 46, 45, 47–49], which classify CRC into three to six subtypes (Table 2). To identify robust consensus subtypes of CRCs, a consensus subtyping effort for CRC was initiated. The Colorectal Cancer Subtyping Consortium (CRCSC) developed a network-based approach to investigate the associations between the six independent classification systems. A multi-class classifier was built that could classify CRC into four consensus molecular subtypes (CMS1-4) [152]. CMS1 tumors are highly mutated, microsatellite unstable and show strong immune activation. CMS2 tumors are characterized by marked Wnt and Myc signaling activation. CMS3 cancers are metabolically dysregulated. CMS4 cases feature transforming growth factor-β activation, stromal invasion and angiogenesis signatures. These consensus results will aid future clinical stratification and subtype-based targeted interventions for CRC, and such collaborations should serve as a role model for other cancer subtyping studies to accelerate our understanding of cancer biology [152] and develop more efficient ways to cure cancers. The use of different patient cohorts, platforms and clustering methods for a specific tumor type, typically yields divergent subtyping results. For breast cancer (Table 2), it was first classified by Perou et al. [16] into four subtypes: luminal, basal-like, normal-like and ERBB2-overexpressing subtypes. Then, Sørlie et al. [42] performed complementary DNA microarrays of 85 breast cancer patients and normal controls, and used hierarchical clustering to classify the patients into one of the five subtypes, i.e. luminal A, luminal B, HER2 over-expression, basal and normal-like. The most recent breast cancer subtyping study by TCGA also suggested four subtypes, which are luminal A, luminal B, HER2-positive and triple-negative subtypes [43]. We can conclude that despite inconsistent naming and number of clusters grouped by different studies [16, 42, 43], breast tumors fall primarily into three major subtypes: luminal, HER2 overexpression and triple-negative breast cancer (TNBC) [89]. The luminal subtype cancer is the most common one and carries a good prognosis. This subset of patients expresses hormone receptors, and this makes them responsive to hormone therapies. The HER2-overexpressing breast cancer subtype is more sensitive to herceptin (trastuzumab) and chemotherapy than the luminal subtype. The TNBC subtype is resistant to standard targeted therapies, and carries the worst prognosis. The next important consideration is the consensus subtyping between cancers. Although there are many cancer types based on their tissue of origin, we can observe similarities between them. The TCGA’s pan-cancer classification study [32] is a good example of this. Six different ‘omic’ platforms were integratively analyzed, consisting of 3527 tumor specimens across 12 cancer types. A unified cancer classification system was constructed, and it identified 11 major subtypes. Among them, five subtypes were strongly associated with their tissue of origin, but the remaining subtypes were not strictly associated with their tissue of origin. For instance, bladder cancers split into three pan-cancer subtypes. Lung squamous, head and neck and a subset of bladder cancers coalesced into a single subtype. This study not only provided a new classification system for multiple cancers but also demonstrated that general characteristics exist between cancers that were traditionally considered to be different entities. Cancer is a complex disease. Without a systematic understanding of the characteristics of the disease, we cannot develop effective therapies against it. The general characteristics within and between cancers provide great opportunities to identify consensus molecular subtypes. For example, basal subtypes are defined in breast cancer [42], bladder cancer [88] and pancreatic cancer [51]. Mesenchymal subtypes are defined in glioblastoma (GBM) [41], NPC [15], breast [153], pancreatic [19] and colon cancers [20]. Basal subtypes usually express genes like laminins and keratins, and have the worst prognosis compared with other subtypes. The characteristics of mesenchymal subtypes include a mesenchymal phenotype, high expression of proliferation genes, poor prognosis, high malignant potential and resistance to current therapies. Thus, devise treatments that are effective against multiple cancer types with shared characteristics may become a promising solution for future cancer treatment. Single-sample classification The abovementioned classifiers (or predictor) are mainly built based on a large number of training samples, and for this reason, we call them population-based predictors. In contrast, single-sample predictors (SSPs) are classification models that can classify a single sample into one of the molecular subtypes of a specific type of cancer [154, 155]. Traditionally, to classify a new sample into a specific subtype based on population-based predictor, reanalysis of a large data set is needed. Contrary to the population-based predictor, SSPs can assign a single sample to a specific subtype regardless of other samples, and is therefore more useful and practical for individual patients than population-based predictors. SSPs have been built for several types of cancer. For instances, Sørlie et al. [154] constructed the first SSP for breast cancer, Stratford et al. [136] developed an SSP for PDAC and Ringnér et al. [156] derived an SSP for lung adenocarcinoma. SSPs are constructed based on tumor-intrinsic signatures and similarities between a given sample and molecular subtype centroids [154, 155]. Methods applied in the population-based predictor, such as hierarchical clustering and nearest centroid classification method [157], can be used in the SSP. One of the most important requirements for an SSP is that it cannot be built based on row-centered (mean centering or median centering) data [158]. Normally, molecular data matrices contain features in rows and samples in columns. Row-centering is a feature centering process that can help to remove side effects caused by outlier features. The construction of SSPs features no row-centering step, and studies have found inconsistent classification results caused by SSPs [158–160]. Sørlie et al. [161] accepted Weigelt et al.’s [158] conclusions and comments, and explained why there were inconsistent classification results. The reasons are listed below: for the three one-channel-based data sets, most of the variations were caused by differences between genes, and not so much by differences between samples. So, the correlation values vary greatly over a smaller range in the uncentered data. Therefore, for a sample to be correctly assigned to a subtype, it must be centered against an appropriately large and heterogeneous sample set. Sørlie et al. [161] highlighted the importance of performing row-centering in molecular data-processing steps. In summary, building SSPs is a challenging but important task, and up to now, there are no effective ways to deal with the centering problem. Although current results are not encouraging, we hope that in the near future, applicable SSPs can be developed and applied in the clinic. Conclusions and outlook Heterogeneity renders cancer more than a single disease. This poses a significant challenge to the traditional management of cancer. With the advent of genome-wide molecular profiling of cancer, especially the advancements in high-throughput profiling technologies, researchers can now investigate the collective of genomic and epigenomic changes that exist in cancer. In contrast with traditional classification methods, molecular classification can be used to assign cancers to subgroups with distinct molecular characteristics, tumor biology and clinical presentation. The most important step in molecular subtyping of cancer is cluster analysis. Different clustering methods can produce different results, many cluster analyses are unstable and cluster analyses are a purely exploratory method [162]. It is hard to tell which algorithm is better, as this largely depends on the question asked. Thus, it is important to ascertain proper preprocessing and normalization of the data; also, ensemble and consensus clustering methods should be considered when doing the cluster analysis. Another important step is subtype characterizations. The identified subtypes should be both statistically significant and biologically relevant. This means that molecular as well as clinical data collection is mandatory to truly characterize the identified subtypes. Also, publicly available data sets can be used to evaluate the classification performance of the classifiers. Although numerous molecular subtyping studies have been conducted, which have identified subtypes for various cancer types, current cancer patient stratification still largely relies on traditional histopathological observation and assessment. We are facing several challenges (Figure 2). The gap between research findings (identified subtypes) and clinical applications can be bridged by the improvement of statistical methods and better interpretation of the results. When cancers are correctly separated into different subtypes, the next important step is to properly interpret these identified subtypes from a biological point of view followed by a move toward clinical applications. With the successfully applied clinical tests in breast cancer, we hope that this will be followed in other cancer types. In summary, cancer should not be treated as single disease. Molecular subtyping can identify distinct cancer subtypes, which may shed new lights on the treatment strategies for cancer patients. Several challenges should be addressed before clinical applications can be successfully applied. Key Points Heterogeneity renders cancer more than a single disease. Molecular subtyping can be used to assign cancers to subgroups with distinct molecular characteristics, tumor biology and clinical presentation. Unsupervised classification schemes have been successfully applied to identify subtypes in a large number of malignancies. From these studies, we summarize a workflow for molecular subtyping of cancer. These include data preprocessing, cluster analysis, supervised classification and subtype characterizations. We identified and described four major challenges in cancer subtyping studies that preclude clinical implementation. The first is data acquisition, curation and management. The second challenge is TME heterogeneity. The remaining two challenges are the lack of consensus molecular subtypes, and problems with single-sample classification, respectively. We suggest that standardized methods should be established to help identify intrinsic subgroup signatures and to build robust classifiers that pave the way toward stratified treatment of cancer patients. Lan Zhao is a PhD candidate at the Department of Electronic Engineering, City University of Hong Kong. Her research interests are in the areas of machine learning, cancer genomics and computational biology. Victor H. F. Lee is currently a Clinical Associate Professor of the Department of Clinical Oncology, the University of Hong Kong. His current interests include clinical and genetic studies on nasopharyngeal cancer, head and neck cancers, lung cancers, liver cancers and gastrointestinal cancers. Michael K. Ng is the Head and Chair Professor of the Department of Mathematics, and Chair Professor (Affiliate) of Department of Computer Science at the Hong Kong Baptist University. As an applied mathematician, his main research areas include bioinformatics, data mining, operations research and scientific computing. Hong Yan received his PhD degree from Yale University. He was a Professor of imaging science at the University of Sydney and currently is the chair professor of computer engineering at City University of Hong Kong. His research interests include bioinformatics, image processing and pattern recognition. Maarten F. Bijlsma is an Associate Professor at the Academic Medical Center with the University of Amsterdam. His research focuses on pancreatic and esophageal cancer, from the most fundamental mechanisms that underlie aberrant signaling in these diseases, to the development of serum-borne markers in patient cohorts to predict treatment response and disease outcome. Furthermore, he is a Biomarker/Imaging Program leader for the AMC/VUmc Cancer Center Amsterdam. Acknowledgement The authors thank Xin Wang from Department of Biomedical Sciences of the City University of Hong Kong for comments on an earlier version of the manuscript. Funding This work was supported by Hong Kong Research Grants Council (RGC) (Project C1007-15G) and City University of Hong Kong (Project 7004862). References 1 Campbell PJ , Pleasance ED , Stephens PJ , et al. Subclonal phylogenetic structures in cancer revealed by ultra-deep sequencing . Proc Nat Acad Sci USA 2008 ; 105 ( 35 ): 13081 – 6 . Google Scholar CrossRef Search ADS PubMed 2 Shipitsin M , Campbell LL , Argani P , et al. Molecular definition of breast tumor heterogeneity . Cancer Cell 2007 ; 11 ( 3 ): 259 – 73 . Google Scholar CrossRef Search ADS PubMed 3 Macintosh CA , Stower M , Reid N , et al. Precise microdissection of human prostate cancers reveals genotypic heterogeneity . Cancer Res 1998 ; 58 : 23 – 8 . Google Scholar PubMed 4 González-García I , Solé RV , Costa J. Metapopulation dynamics and spatial heterogeneity in cancer . Proc Natl Acad Sci USA 2002 ; 99 ( 20 ): 13085 – 9 . Google Scholar CrossRef Search ADS PubMed 5 Iacobuzio-Donahue CA. Genetic evolution of pancreatic cancer: lessons learnt from the pancreatic cancer genome sequencing project . Gut 2012 ; 61 ( 7 ): 1085 – 94 . Google Scholar CrossRef Search ADS PubMed 6 Penchev VR , Rasheed ZA , Maitra A , et al. Heterogeneity and targeting of pancreatic cancer stem cells . Clin Cancer Res 2012 ; 18 ( 16 ): 4277 – 84 . Google Scholar CrossRef Search ADS PubMed 7 Burrell RA , McGranahan N , Bartek J , et al. The causes and consequences of genetic heterogeneity in cancer evolution . Nature 2013 ; 501 ( 7467 ): 338 – 45 . Google Scholar CrossRef Search ADS PubMed 8 McGranahan N , Swanton C. Biological and therapeutic impact of intratumor heterogeneity in cancer evolution . Cancer Cell 2015 ; 27 ( 1 ): 15 – 26 . Google Scholar CrossRef Search ADS PubMed 9 Duggan DJ , Bittner M , Chen Y , et al. Expression profiling using cDNA microarrays . Nat Genet 1999 ; 21(Suppl 1) : 10 – 14 . Google Scholar CrossRef Search ADS 10 Metzker ML. Sequencing technologies—the next generation . Nat Rev Genet 2010 ; 11 ( 1 ): 31 – 46 . Google Scholar CrossRef Search ADS PubMed 11 Guinney J , Dienstmann R , Wang X , et al. The consensus molecular subtypes of colorectal cancer . Nat Med 2015 ; 21 ( 11 ): 1350 – 6 . Google Scholar CrossRef Search ADS PubMed 12 Bailey P , Chang DK , Nones K , et al. Genomic analyses identify molecular subtypes of pancreatic cancer . Nature 2016 ; 531 ( 7592 ): 47 – 52 . Google Scholar CrossRef Search ADS PubMed 13 Golub TR , Slonim DK , Tamayo P , et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring . Science 1999 ; 286 ( 5439 ): 531 – 7 . Google Scholar CrossRef Search ADS PubMed 14 Alizadeh AA , Eisen MB , Davis RE , et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling . Nature 2000 ; 403 ( 6769 ): 503 – 11 . Google Scholar CrossRef Search ADS PubMed 15 Zhao L , Fong AHW , Liu N , et al. Molecular subtyping of nasopharyngeal carcinoma (NPC) and a microRNA-based prognostic model for distant metastasis . J Biomed Sci 2018 ; 25 : 16 . Google Scholar CrossRef Search ADS PubMed 16 Perou CM , Sørlie T , Eisen MB , et al. Molecular portraits of human breast tumours . Nature 2000 ; 406 ( 6797 ): 747 – 52 . Google Scholar CrossRef Search ADS PubMed 17 Garber ME , Troyanskaya OG , Schluens K , et al. Diversity of gene expression in adenocarcinoma of the lung . Proc Natl Acad Sci USA 2001 ; 98 ( 24 ): 13784 – 9 . Google Scholar CrossRef Search ADS PubMed 18 Chen X , Cheung ST , So S , et al. Gene expression patterns in human liver cancers . Mol Biol Cell 2002 ; 13 ( 6 ): 1929 – 39 . Google Scholar CrossRef Search ADS PubMed 19 Collisson EA , Sadanandam A , Olson P , et al. Subtypes of pancreatic ductal adenocarcinoma and their differing responses to therapy . Nat Med 2011 ; 17 : 500 – 3 . Google Scholar CrossRef Search ADS PubMed 20 Felipe De Sousa EM , Wang X , Jansen M , et al. Poor-prognosis colon cancer is defined by a molecularly distinct subtype and develops from serrated precursor lesions . Nat Med 2013 ; 19 : 614 – 18 . Google Scholar CrossRef Search ADS PubMed 21 Nielsen TO , West RB , Linn SC , et al. Molecular characterisation of soft tissue tumours: a gene expression study . Lancet 2002 ; 359 ( 9314 ): 1301 – 7 . Google Scholar CrossRef Search ADS PubMed 22 Kluger Y , Basri R , Chang JT , et al. Spectral biclustering of microarray data: coclustering genes and conditions . Genome Res 2003 ; 13 ( 4 ): 703 – 16 . Google Scholar CrossRef Search ADS PubMed 23 Lee DD , Seung HS. Learning the parts of objects by non-negative matrix factorization . Nature 1999 ; 401 ( 6755 ): 788 – 91 . Google Scholar CrossRef Search ADS PubMed 24 Tibshirani R , Hastie T , Narasimhan B , et al. Diagnosis of multiple cancer types by shrunken centroids of gene expression . Proc Natl Acad Sci USA 2002 ; 99 ( 10 ): 6567 – 72 . Google Scholar CrossRef Search ADS PubMed 25 Hearst MA , Dumais ST , Osuna E , et al. Support vector machines . IEEE Intell Syst Their Appl 1998 ; 13 ( 4 ): 18 – 28 . Google Scholar CrossRef Search ADS 26 Khan J , Wei JS , Ringner M , et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks . Nat Med 2001 ; 7 : 673 – 9 . Google Scholar CrossRef Search ADS PubMed 27 Nutt CL , Mani DR , Betensky RA , et al. Gene expression-based classification of malignant gliomas correlates better with survival than histological classification . Cancer Res 2003 ; 63 : 1602 – 7 . Google Scholar PubMed 28 Eisen MB , Spellman PT , Brown PO , et al. Cluster analysis and display of genome-wide expression patterns . Proc Natl Acad Sci USA 1998 ; 95 ( 25 ): 14863 – 8 . Google Scholar CrossRef Search ADS PubMed 29 Pena JM , Lozano JA , Larranaga P. An empirical comparison of four initialization methods for the k-means algorithm . Pattern Recognit Lett 1999 ; 20 : 1027 – 40 . Google Scholar CrossRef Search ADS 30 Breiman L. Random forests . Mach Learn 2001 ; 45 : 5 – 32 . Google Scholar CrossRef Search ADS 31 Fukunaga K , Narendra PM. A branch and bound algorithm for computing k-nearest neighbors . IEEE Trans Comput 1975 ; 100 : 750 – 3 . Google Scholar CrossRef Search ADS 32 Hoadley KA , Yau C , Wolf DM , et al. Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin . Cell 2014 ; 158 ( 4 ): 929 – 44 . Google Scholar CrossRef Search ADS PubMed 33 Siang TC , Soon TW , Kasim S , et al. A review of cancer classification software for gene expression data . Int J Biosci Biotechnol 2015 ; 7 ( 4 ): 89 – 108 . 34 Wang Z , Gerstein M , Snyder M. RNA-seq: a revolutionary tool for transcriptomics . Nat Rev Genet 2009 ; 10 : 57 – 63 . Google Scholar CrossRef Search ADS PubMed 35 Guo Y , Sheng Q , Li J , et al. Large scale comparison of gene expression levels by microarrays and RNAseq using TCGA data . PLoS One 2013 ; 8 ( 8 ): e71462 . Google Scholar CrossRef Search ADS PubMed 36 Zhao S , Fung-Leung WP , Bittner A , et al. Comparison of RNA-seq and microarray in transcriptome profiling of activated T cells . PLoS One 2014 ; 9 ( 1 ): e78644 . Google Scholar CrossRef Search ADS PubMed 37 Shergill IS , Shergill NK , Arya M , et al. Tissue microarrays: a current medical research tool . Curr Med Res Opin 2004 ; 20 : 707 – 12 . Google Scholar CrossRef Search ADS PubMed 38 Veldman-Jones MH , Brant R , Rooney C , et al. Evaluating robustness and sensitivity of the nanostring technologies ncounter platform to enable multiplexed gene expression analysis of clinical samples . Cancer Res 2015 ; 75 ( 13 ): 2587 – 93 . Google Scholar CrossRef Search ADS PubMed 39 Geiss GK , Bumgarner RE , Birditt B , et al. Direct multiplexed measurement of gene expression with color-coded probe pairs . Nat Biotechnol 2008 ; 26 : 317 – 25 . Google Scholar CrossRef Search ADS PubMed 40 Tamayo P , Slonim D , Mesirov J , et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation . Proc Natl Acad Sci USA 1999 ; 96 ( 6 ): 2907 – 12 . Google Scholar CrossRef Search ADS PubMed 41 Verhaak RGW , Hoadley KA , Purdom E , et al. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1 . Cancer Cell 2010 ; 17 ( 1 ): 98 – 110 . Google Scholar CrossRef Search ADS PubMed 42 Sørlie T , Perou CM , Tibshirani R , et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications . Proc Natl Acad Sci USA 2001 ; 98 ( 19 ): 10869 – 74 . Google Scholar CrossRef Search ADS PubMed 43 Cancer Genome Atlas Network Comprehensive molecular portraits of human breast tumours . Nature 2012 ; 490 : 61 – 70 . CrossRef Search ADS PubMed 44 Curtis C , Shah SP , Chin SF , et al. The genomic and transcriptomic architecture of 2, 000 breast tumours reveals novel subgroups . Nature 2012 ; 486 ( 7403 ): 346 – 52 . Google Scholar PubMed 45 Schlicker A , Beran G , Chresta CM , et al. Subtypes of primary colorectal tumors correlate with response to targeted treatment in colorectal cell lines . BMC Med Genomics 2012 ; 5 : 66 . Google Scholar CrossRef Search ADS PubMed 46 Marisa L , de Reyniès A , Duval A , et al. Gene expression classification of colon cancer into molecular subtypes: characterization, validation, and prognostic value . PLoS Med 2013 ; 10 ( 5 ): e1001453 . Google Scholar CrossRef Search ADS PubMed 47 Sadanandam A , Lyssiotis CA , Homicsko K , et al. A colorectal cancer classification system that associates cellular phenotype and responses to therapy . Nat Med 2013 ; 19 : 619 – 25 . Google Scholar CrossRef Search ADS PubMed 48 Budinska E , Popovici V , Tejpar S , et al. Gene expression patterns unveil a new level of molecular heterogeneity in colorectal cancer . J Pathol 2013 ; 231 ( 1 ): 63 – 76 . Google Scholar CrossRef Search ADS PubMed 49 Roepman P , Schlicker A , Tabernero J , et al. Colorectal cancer intrinsic subtypes predict chemotherapy benefit, deficient mismatch repair and epithelial-to-mesenchymal transition . Int J Cancer 2014 ; 134 ( 3 ): 552 – 62 . Google Scholar CrossRef Search ADS PubMed 50 Bauer AS , Keller A , Costello E , et al. Diagnosis of pancreatic ductal adenocarcinoma and chronic pancreatitis by measurement of microRNA abundance in blood and tissue . PLoS One 2012 ; 7 ( 4 ): e34151 . Google Scholar CrossRef Search ADS PubMed 51 Moffitt RA , Marayati R , Flate EL , et al. Virtual microdissection identifies distinct tumor-and stroma-specific subtypes of pancreatic ductal adenocarcinoma . Nat Genet 2015 ; 47 : 1168 – 78 . Google Scholar CrossRef Search ADS PubMed 52 Marcucci G , Mrózek K , Bloomfield CD. Molecular heterogeneity and prognostic biomarkers in adults with acute myeloid leukemia and normal cytogenetics . Curr Opin Hematol 2005 ; 12 : 68 – 75 . Google Scholar CrossRef Search ADS PubMed 53 Nones K , Waddell N , Song S , et al. Genome-wide DNA methylation patterns in pancreatic ductal adenocarcinoma reveal epigenetic deregulation of SLIT-ROBO, ITGA2 and MET signaling . Int J Cancer 2014 ; 135 ( 5 ): 1110 – 18 . Google Scholar CrossRef Search ADS PubMed 54 Waddell N , Pajic M , Patch AM , et al. Whole genomes redefine the mutational landscape of pancreatic cancer . Nature 2015 ; 518 ( 7540 ): 495 – 501 . Google Scholar CrossRef Search ADS PubMed 55 Daemen A , Peterson D , Sahu N , et al. Metabolite profiling stratifies pancreatic ductal adenocarcinomas into subtypes with distinct sensitivities to metabolic inhibitors . Proc Natl Acad Sci USA 2015 ; 112 ( 32 ): E4410 – 17 . Google Scholar CrossRef Search ADS PubMed 56 Stratton MR , Campbell PJ , Futreal PA. The cancer genome . Nature 2009 ; 458 ( 7239 ): 719 – 24 . Google Scholar CrossRef Search ADS PubMed 57 Finkelstein SD , Sayegh R , Christensen S , et al. Genotypic classification of colorectal adenocarcinoma. Biologic behavior correlates with K-ras-2 mutation type . Cancer 1993 ; 71 ( 12 ): 3827 – 38 . Google Scholar CrossRef Search ADS PubMed 58 Vural S , Wang X , Guda C. Classification of breast cancer patients using somatic mutation profiles and machine learning approaches . BMC Syst Biol 2016 ; 10(Suppl 3) : 62 . Google Scholar CrossRef Search ADS PubMed 59 Calin GA , Liu CG , Sevignani C , et al. MicroRNA profiling reveals distinct signatures in B cell chronic lymphocytic leukemias . Proc Natl Acad Sci USA 2004 ; 101 : 11755 – 60 . Google Scholar CrossRef Search ADS PubMed 60 Calin GA , Croce CM. MicroRNA signatures in human cancers . Nat Rev Cancer 2006 ; 6 ( 11 ): 857 – 66 . Google Scholar CrossRef Search ADS PubMed 61 Calin GA , Garzon R , Cimmino A , et al. MicroRNAs and leukemias: how strong is the connection? Leuk Res 2006 ; 30 ( 6 ): 653 – 5 . Google Scholar CrossRef Search ADS PubMed 62 Cantini L , Caselle M , Forget A , et al. A review of computational approaches detecting microRNAs involved in cancer . Front Biosci 2017 ; 22 : 1774 – 91 . Google Scholar CrossRef Search ADS 63 Lu J , Getz G , Miska EA , et al. MicroRNA expression profiles classify human cancers . Nature 2005 ; 435 ( 7043 ): 834 – 8 . Google Scholar CrossRef Search ADS PubMed 64 Feuk L , Carson AR , Scherer SW. Structural variation in the human genome . Nat Rev Genet 2006 ; 7 ( 2 ): 85 – 97 . Google Scholar CrossRef Search ADS PubMed 65 Cook EH Jr , Scherer SW. Copy-number variations associated with neuropsychiatric conditions . Nature 2008 ; 455 ( 7215 ): 919 – 23 . Google Scholar CrossRef Search ADS PubMed 66 Gonzalez E , Kulkarni H , Bolivar H , et al. The influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS susceptibility . Science 2005 ; 307 ( 5714 ): 1434 – 40 . Google Scholar CrossRef Search ADS PubMed 67 Le Maréchal C , Masson E , Chen JM , et al. Hereditary pancreatitis caused by triplication of the trypsinogen locus . Nat Genet 2006 ; 38 ( 12 ): 1372 . Google Scholar CrossRef Search ADS PubMed 68 Kallioniemi OP , Kallioniemi A , Piper J , et al. Optimizing comparative genomic hybridization for analysis of DNA sequence copy number changes in solid tumors . Genes Chromosomes Cancer 1994 ; 10 ( 4 ): 231 – 43 . Google Scholar CrossRef Search ADS PubMed 69 Sebat J , Lakshmi B , Troge J , et al. Large-scale copy number polymorphism in the human genome . Science 2004 ; 305 ( 5683 ): 525 – 8 . Google Scholar CrossRef Search ADS PubMed 70 Baylin SB. DNA methylation and gene silencing in cancer . Nat Clin Pract Oncol 2005 ; 2 : S4 – S11 . Google Scholar CrossRef Search ADS PubMed 71 Frommer M , McDonald LE , Millar DS , et al. A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands . Proc Natl Acad Sci USA 1992 ; 89 ( 5 ): 1827 – 31 . Google Scholar CrossRef Search ADS PubMed 72 Huang THM , Perry MR , Laux DE. Methylation profiling of CpG islands in human breast cancer cells . Hum Mol Genet 1999 ; 8 : 459 – 70 . Google Scholar CrossRef Search ADS PubMed 73 Kwon MS , Kim Y , Lee S , et al. Integrative analysis of multi-omics data for identifying multi-markers for diagnosing pancreatic cancer . BMC Genomics 2015 ; 16 : S4 . Google Scholar CrossRef Search ADS PubMed 74 Zhao Q , Shi X , Xie Y , et al. Combining multidimensional genomic measurements for predicting cancer prognosis: observations from TCGA . Brief Bioinform 2015 ; 16 : 291 – 303 . Google Scholar CrossRef Search ADS PubMed 75 Wang Y. Development of cancer diagnostics—from biomarkers to clinical tests . Transl Cancer Res 2015 ; 4 : 270 – 9 . 76 Corless CL , Spellman PT. Tackling formalin-fixed, paraffin-embedded tumor tissue with next-generation sequencing . Cancer Discov 2012 ; 2 ( 1 ): 23 – 4 . Google Scholar CrossRef Search ADS PubMed 77 Lalkhen AG , McCluskey A. Clinical tests: sensitivity and specificity . Contin Educ Anaesth Crit Care Pain 2008 ; 8 ( 6 ): 221 – 3 . Google Scholar CrossRef Search ADS 78 Linnet K , Bossuyt PMM , Moons KGM , et al. Quantifying the accuracy of a diagnostic test or marker . Clin Chem 2012 ; 58 ( 9 ): 1292 – 301 . Google Scholar CrossRef Search ADS PubMed 79 Prokopec SD , Watson JD , Waggott DM , et al. Systematic evaluation of medium-throughput mRNA abundance platforms . RNA 2013 ; 19 ( 1 ): 51 – 62 . Google Scholar CrossRef Search ADS PubMed 80 Kulkarni MM. Digital multiplexed gene expression analysis using the NanoString nCounter system . Curr Protoc Mol Biol 2011 ; Chapter 25 : Unit25B.10 . Google Scholar PubMed 81 Payton JE , Grieselhuber NR , Chang LW , et al. High throughput digital quantification of mRNA abundance in primary human acute myeloid leukemia samples . J Clin Invest 2009 ; 119 ( 6 ): 1714 – 26 . Google Scholar CrossRef Search ADS PubMed 82 Kononen J , Bubendorf L , Kallionimeni A , et al. Tissue microarrays for high-throughput molecular profiling of tumor specimens . Nat Med 1998 ; 4 : 844 – 7 . Google Scholar CrossRef Search ADS PubMed 83 Rimm DL , Camp RL , Charette LA , et al. Amplification of tissue by construction of tissue microarrays . Exp Mol Pathol 2001 ; 70 : 255 – 64 . Google Scholar CrossRef Search ADS PubMed 84 Schmidt LH , Biesterfeld S , Kümmel A , et al. Tissue microarrays are reliable tools for the clinicopathological characterization of lung cancer tissue . Anticancer Res 2009 ; 29 : 201 – 9 . Google Scholar PubMed 85 Camp RL , Neumeister V , Rimm DL. A decade of tissue microarrays: progress in the discovery and validation of cancer biomarkers . J Clin Oncol 2008 ; 26 ( 34 ): 5630 – 7 . Google Scholar CrossRef Search ADS PubMed 86 Hoos A , Cordon-Cardo C. Tissue microarray profiling of cancer specimens and cell lines: opportunities and limitations . Lab Invest 2001 ; 81 : 1331 – 8 . Google Scholar CrossRef Search ADS PubMed 87 Xu R , Wunsch DC. Clustering algorithms in biomedical research: a review . IEEE Rev Biomed Eng 2010 ; 3 : 120 – 54 . Google Scholar CrossRef Search ADS PubMed 88 Cancer Genome Atlas Research Network . Comprehensive molecular characterization of urothelial bladder carcinoma . Nature 2014 ; 507 : 315 – 22 . CrossRef Search ADS PubMed 89 Dai X , Li T , Bai Z , et al. Breast cancer intrinsic subtype classification, clinical use and future trends . Am J Cancer Res 2015 ; 5 : 2929 – 43 . Google Scholar PubMed 90 Madeira SC , Oliveira AL. Biclustering algorithms for biological data analysis: a survey . IEEE/ACM Trans Comput Biol Bioinform 2004 ; 1 ( 1 ): 24 – 45 . Google Scholar CrossRef Search ADS PubMed 91 Cheng Y , Church GM. Biclustering of expression data . Proc Int Conf Intell Syst Mol Biol 2000 ; 8 : 93 – 103 . Google Scholar PubMed 92 Gan X , Liew AW-C , Yan H. Discovering biclusters in gene expression data based on high-dimensional linear geometries . BMC Bioinformatics 2008 ; 9 ( 1 ): 209. Google Scholar CrossRef Search ADS PubMed 93 Zhao H , Liew AW-C , Xie X , et al. A new geometric biclustering algorithm based on the Hough transform for analysis of large-scale microarray data . J Theor Biol 2008 ; 251 : 264 – 74 . Google Scholar CrossRef Search ADS PubMed 94 Mankad S , Michailidis G. Biclustering three-dimensional data arrays with plaid models . J Comput Graph Stat 2014 ; 23 : 943 – 65 . Google Scholar CrossRef Search ADS 95 Narmadha N , Rathipriya R. Triclustering: an evolution of clustering. In: 2016 Online International Conference on Green Engineering and Technologies (IC-GET) . IEEE, Coimbatore, India. 2016 , 1–4. 96 Li Y , Ngom A. Classification of clinical gene-sample-time microarray expression data via tensor decomposition methods. In: Computational Intelligence Methods for Bioinformatics and Biostatistics. Springer-Verlag Berlin, Heidelberg, Palermo, Italy, 2011, 275–86. 97 Luo Y , Wang F , Szolovits P. Tensor factorization toward precision medicine . Brief Bioinform 2017 ; 18 : 511 – 4 . Google Scholar CrossRef Search ADS PubMed 98 Tibshirani R , Walther G , Hastie T. Estimating the number of clusters in a data set via the gap statistic . J R Stat Soc Series B Stat Methodol 2001 ; 63 : 411 – 23 . Google Scholar CrossRef Search ADS 99 Brunet J-P , Tamayo P , Golub TR , et al. Metagenes and molecular pattern discovery using matrix factorization . Proc Natl Acad Sci USA 2004 ; 101 ( 12 ): 4164 – 9 . Google Scholar CrossRef Search ADS PubMed 100 Vega-Pons S , Ruiz-Shulcloper J. A survey of clustering ensemble algorithms . Int J Pattern Recognit Artif Intell 2011 ; 25 ( 03 ): 337 – 72 . Google Scholar CrossRef Search ADS 101 Monti S , Tamayo P , Mesirov J , et al. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data . Mach Learn 2003 ; 52 : 91 – 118 . Google Scholar CrossRef Search ADS 102 Mukhopadhyay A , Bandyopadhyay S , Maulik U. Multi-class clustering of cancer subtypes through SVM based ensemble of pareto-optimal solutions for gene marker identification . PLoS One 2010 ; 5 ( 11 ): e13803. Google Scholar CrossRef Search ADS PubMed 103 Wang X , Markowetz F , De Sousa E , Melo F , et al. Dissecting cancer heterogeneity–an unsupervised classification approach . Int J Biochem Cell Biol 2013 ; 45 : 2574 – 9 . Google Scholar CrossRef Search ADS PubMed 104 Lex A , Streit M , Schulz H-J , et al. StratomeX: visual Analysis of Large-Scale Heterogeneous Genomics Data for Cancer Subtype Characterization . Comput Graph Forum 2012 ; 31 : 1175 – 84 . Google Scholar CrossRef Search ADS PubMed 105 Subramanian A , Tamayo P , Mootha VK , et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles . Proc Natl Acad Sci USA 2005 ; 102 : 15545 – 50 . Google Scholar CrossRef Search ADS PubMed 106 Ashburner M , Ball CA , Blake JA , et al. Gene Ontology: tool for the unification of biology . Nat Genet 2000 ; 25 : 25 – 9 . Google Scholar CrossRef Search ADS PubMed 107 Kanehisa M , Goto S , Hattori M , et al. From genomics to chemical genomics: new developments in KEGG . Nucleic Acids Res 2006 ; 34 ( 90001 ): D354 – 7 . Google Scholar CrossRef Search ADS PubMed 108 Kaplan EL , Meier P. Nonparametric estimation from incomplete observations . J Am Stat Assoc 1958 ; 53 : 457 – 81 . Google Scholar CrossRef Search ADS 109 Mantel N. Evaluation of survival data and two new rank order statistics arising in its consideration . Cancer Chemother Rep 1966 ; 50 : 163 – 70 . Google Scholar PubMed 110 Shen T , Pajaro-Van de Stadt SH , Yeat NC , et al. Clinical applications of next generation sequencing in cancer: from panels, to exomes, to genomes . Front Genet 2015 ; 6 : 215 . Google Scholar CrossRef Search ADS PubMed 111 Peyser ND , Grandis JR. Cancer genomics: spot the difference . Nature 2017 ; 541 ( 7636 ): 162 – 3 . Google Scholar CrossRef Search ADS PubMed 112 Voduc D , Kenney C , Nielsen TO. Tissue microarrays in clinical oncology . Semin Radiat Oncol 2008 ; 18 ( 2 ): 89 – 97 . Google Scholar CrossRef Search ADS PubMed 113 Terry J , Saito T , Subramanian S , et al. TLE1 as a diagnostic immunohistochemical marker for synovial sarcoma emerging from gene expression profiling studies . Am J Surg Pathol 2007 ; 31 : 240 – 6 . Google Scholar CrossRef Search ADS PubMed 114 Hans CP , Weisenburger DD , Greiner TC , et al. Confirmation of the molecular classification of diffuse large B-cell lymphoma by immunohistochemistry using a tissue microarray . Blood 2004 ; 103 ( 1 ): 275 – 82 . Google Scholar CrossRef Search ADS PubMed 115 Yersal O , Barutca S. Biological subtypes of breast cancer: prognostic and therapeutic implications . World J Clin Oncol 2014 ; 5 : 412 – 24 . Google Scholar CrossRef Search ADS PubMed 116 van 't Veer LJ , Dai H , van de Vijver MJ , et al. Gene expression profiling predicts clinical outcome of breast cancer . Nature 2002 ; 415 ( 6871 ): 530 – 6 . Google Scholar CrossRef Search ADS PubMed 117 Paik S , Shak S , Tang G , et al. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer . N Engl J Med 2004 ; 351 ( 27 ): 2817 – 26 . Google Scholar CrossRef Search ADS PubMed 118 Toussaint J , Sieuwerts AM , Haibe-Kains B , et al. Improvement of the clinical applicability of the Genomic Grade Index through a qRT-PCR test performed on frozen and formalin-fixed paraffin-embedded tissues . BMC Genomics 2009 ; 10 : 424 . Google Scholar CrossRef Search ADS PubMed 119 Sotiriou C , Wirapati P , Loi S , et al. Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis . J Natl Cancer Inst 2006 ; 98 ( 4 ): 262 – 72 . Google Scholar CrossRef Search ADS PubMed 120 Wirapati P , Sotiriou C , Kunkel S , et al. Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures . Breast Cancer Res 2008 ; 10 : R65 . Google Scholar CrossRef Search ADS PubMed 121 Rouzier R , Perou CM , Symmans WF , et al. Breast cancer molecular subtypes respond differently to preoperative chemotherapy . Clin Cancer Res 2005 ; 11 : 5678 – 85 . Google Scholar CrossRef Search ADS PubMed 122 Ross DT , Scherf U , Eisen MB , et al. Systematic variation in gene expression patterns in human cancer cell lines . Nat Genet 2000 ; 24 ( 3 ): 227 – 35 . Google Scholar CrossRef Search ADS PubMed 123 Gao H , Korn JM , Ferretti S , et al. High-throughput screening using patient-derived tumor xenografts to predict clinical trial drug response . Nat Med 2015 ; 21 : 1318 – 25 . Google Scholar CrossRef Search ADS PubMed 124 Kao J , Salari K , Bocanegra M , et al. Molecular profiling of breast cancer cell lines defines relevant tumor models and provides a resource for cancer gene discovery . PLoS One 2009 ; 4 ( 7 ): e6146 . Google Scholar CrossRef Search ADS PubMed 125 Kim N , He N , Yoon S. Cell line modeling for systems medicine in cancers (Review) . Int J Oncol 2014 ; 44 : 371 – 6 . Google Scholar CrossRef Search ADS PubMed 126 Shoemaker RH , Monks A , Alley MC , et al. Development of human tumor cell line panels for use in disease-oriented drug screening . Prog Clin Biol Res 1987 ; 276 : 265 – 86 . 127 Garnett MJ , Edelman EJ , Heidorn SJ , et al. Systematic identification of genomic markers of drug sensitivity in cancer cells . Nature 2012 ; 483 ( 7391 ): 570 – 5 . Google Scholar CrossRef Search ADS PubMed 128 Workman P , Kaye SB. Translating basic cancer research into new cancer therapeutics . Trends Mol Med 2002 ; 8 ( 4 ): S1 – 9 . Google Scholar CrossRef Search ADS PubMed 129 Clarke PA , te Poele R , Workman P. Gene expression microarray technologies in the development of new therapeutic agents . Eur J Cancer 2004 ; 40 : 2560 – 91 . Google Scholar CrossRef Search ADS PubMed 130 Hijazi H , Wu M , Nath A , et al. Ensemble classification of cancer types and biomarker identification . Drug Dev Res 2012 ; 73 : 414 – 19 . Google Scholar CrossRef Search ADS PubMed 131 Michiels S , Koscielny S , Hill C. Prediction of cancer outcome with microarrays: a multiple random validation strategy . Lancet 2005 ; 365 ( 9458 ): 488 – 92 . Google Scholar CrossRef Search ADS PubMed 132 Hudson TJ , Anderson W , Aretz A , et al. International network of cancer genome projects . Nature 2010 ; 464 ( 7291 ): 993 – 8 . Google Scholar CrossRef Search ADS PubMed 133 Tomczak K , Czerwińska P , Wiznerowicz M. The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge . Contemp Oncol 2015 ; 19 ( 1A ): A68. 134 Barrett T. Gene Expression Omnibus (GEO). 2013 . https://www.ncbi.nlm.nih.gov/books/NBK159736/. 135 Neoptolemos JP , Stocken DD , Friess H , et al. A randomized trial of chemoradiotherapy and chemotherapy after resection of pancreatic cancer . N Engl J Med 2004 ; 350 : 1200 – 10 . Google Scholar CrossRef Search ADS PubMed 136 Stratford JK , Bentrem DJ , Anderson JM , et al. A six-gene signature predicts survival of patients with localized pancreatic ductal adenocarcinoma . PLoS Med 2010 ; 7 ( 7 ): e1000307 . Google Scholar CrossRef Search ADS PubMed 137 Johnson WE , Li C , Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods . Biostatistics 2007 ; 8 ( 1 ): 118 – 27 . Google Scholar CrossRef Search ADS PubMed 138 Leek JT , Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis . PLoS Genet 2007 ; 3 ( 9 ): e161. Google Scholar CrossRef Search ADS 139 Benito M , Parker J , Du Q , et al. Adjustment of systematic microarray data biases . Bioinformatics 2004 ; 20 ( 1 ): 105 – 14 . Google Scholar CrossRef Search ADS PubMed 140 Emmert-Buck MR , Bonner RF , Smith PD , et al. Laser capture microdissection . Science 1996 ; 274 ( 5289 ): 998 – 1001 . Google Scholar CrossRef Search ADS PubMed 141 Yoshihara K , Shahmoradgoli M , Martínez E , et al. Inferring tumour purity and stromal and immune cell admixture from expression data . Nat Commun 2013 ; 4 : 2612 . Google Scholar CrossRef Search ADS PubMed 142 Song S , Nones K , Miller D , et al. qpure: a tool to estimate tumor cellularity from genome-wide single-nucleotide polymorphism profiles . PLoS One 2012 ; 7 ( 9 ): e45835 . Google Scholar CrossRef Search ADS PubMed 143 Bhome R , Bullock MD , Al Saihati HA , et al. A top-down view of the tumor microenvironment: structure, cells and signaling . Front Cell Dev Biol 2015 ; 3 : 33 . Google Scholar CrossRef Search ADS PubMed 144 Gajewski TF , Schreiber H , Fu Y-X. Innate and adaptive immune cells in the tumor microenvironment . Nat Immunol 2013 ; 14 : 1014 – 22 . Google Scholar CrossRef Search ADS PubMed 145 Jiménez-Sánchez A , Memon D , Pourpe S , et al. Heterogeneous tumor-immune microenvironments among differentially growing metastases in an ovarian cancer patient . Cell 2017 ; 170 : 927 – 38.e20 . Google Scholar CrossRef Search ADS PubMed 146 Orimo A , Weinberg RA. Heterogeneity of stromal fibroblasts in tumor . Cancer Biol Ther 2007 ; 6 ( 4 ): 618 – 9 . Google Scholar CrossRef Search ADS PubMed 147 Bergamaschi A , Tagliabue E , Sørlie T , et al. Extracellular matrix signature identifies breast cancer subgroups with different clinical outcome . J Pathol 2008 ; 214 ( 3 ): 357 – 67 . Google Scholar CrossRef Search ADS PubMed 148 Pickup MW , Mouw JK , Weaver VM. The extracellular matrix modulates the hallmarks of cancer . EMBO Rep 2014 ; 15 ( 12 ): 1243 – 53 . Google Scholar CrossRef Search ADS PubMed 149 Pages F , Galon J , Dieu-Nosjean MC , et al. Immune infiltration in human tumors: a prognostic factor that should not be ignored . Oncogene 2010 ; 29 ( 8 ): 1093 – 102 . Google Scholar CrossRef Search ADS PubMed 150 Frantz C , Stewart KM , Weaver VM. The extracellular matrix at a glance . J Cell Sci 2010 ; 123 ( Pt 24 ): 4195 – 200 . Google Scholar CrossRef Search ADS PubMed 151 Allinen M , Beroukhim R , Cai L , et al. Molecular characterization of the tumor microenvironment in breast cancer . Cancer Cell 2004 ; 6 ( 1 ): 17 – 32 . Google Scholar CrossRef Search ADS PubMed 152 Guinney J , Dienstmann R , Wang X , et al. The consensus molecular subtypes of colorectal cancer . Nat Med 2015 ; 21 : 1350 – 6 . Google Scholar CrossRef Search ADS PubMed 153 Lehmann BD , Bauer JA , Chen X , et al. Identification of human triple-negative breast cancer subtypes and preclinical models for selection of targeted therapies . J Clin Invest 2011 ; 121 ( 7 ): 2750 . Google Scholar CrossRef Search ADS PubMed 154 Sørlie T , Tibshirani R , Parker J , et al. Repeated observation of breast tumor subtypes in independent gene expression data sets . Proc Natl Acad Sci USA 2003 ; 100 ( 14 ): 8418 – 23 . Google Scholar CrossRef Search ADS PubMed 155 Hu Z , Fan C , Oh DS , et al. The molecular portraits of breast tumors are conserved across microarray platforms . BMC Genomics 2006 ; 7 : 96. Google Scholar CrossRef Search ADS PubMed 156 Ringnér M , Jönsson G , Staaf J. Prognostic and chemotherapy predictive value of gene-expression phenotypes in primary lung adenocarcinoma . Clin Cancer Res 2016 ; 22 : 218 – 29 . Google Scholar CrossRef Search ADS PubMed 157 Haibe-Kains B , Desmedt C , Loi S , et al. A three-gene model to robustly identify breast cancer molecular subtypes . J Natl Cancer Inst 2012 ; 104 ( 4 ): 311 – 25 . Google Scholar CrossRef Search ADS PubMed 158 Weigelt B , Mackay A , A'hern R , et al. Breast cancer molecular profiling with single sample predictors: a retrospective analysis . Lancet Oncol 2010 ; 11 ( 4 ): 339 – 49 . Google Scholar CrossRef Search ADS PubMed 159 Lusa L , McShane LM , Reid JF , et al. Challenges in projecting clustering results across gene expression–profiling datasets . J Natl Cancer Inst 2007 ; 99 ( 22 ): 1715 – 23 . Google Scholar CrossRef Search ADS PubMed 160 Guiu S , Michiels S , Andre F , et al. Molecular subclasses of breast cancer: how do we define them? The IMPAKT 2012 Working Group Statement . Ann Oncol 2012 ; 23 : 2997 – 3006 . Google Scholar CrossRef Search ADS PubMed 161 Sørlie T , Borgan E , Myhre S , et al. The importance of gene-centring microarray data . Lancet Oncol 2010 ; 11 : 719 – 20 . Google Scholar CrossRef Search ADS PubMed 162 Allison DB , Page GP , Beasley TM , et al. DNA Microarrays and Related Genomics Techniques: Design, Analysis, and Interpretation of Experiments . 2005 . https://books.google.com.hk/books?hl=en&lr=&id=TUrMBQAAQBAJ&oi=fnd&pg=PP1&dq=DNA+Microarrays+and+Related+Genomics+Techniques:+Design,+Analysis,+and+Interpretation+of+Experiments&ots=eY-ZofXdvd&sig=17rgrkJzuOYz-TydzaTfLthxwyM&redir_esc=y#v=onepage&q=DNA%20Microarrays%20and%20Related%20Genomics%20Techniques%3A%20Design%2C%20Analysis%2C%20and%20Interpretation%20of%20Experiments&f=false. Google Scholar CrossRef Search ADS 163 Mantione KJ , Kream RM , Kuzelova H , et al. Comparing bioinformatic gene expression profiling methods: microarray and RNA-Seq . Med Sci Monit Basic Res 2014 ; 20 : 138 – 42 . Google Scholar CrossRef Search ADS PubMed 164 Khansarinejad B , Soleimanjahi H , Mirab Samiee S , et al. Monitoring human cytomegalovirus infection in pediatric hematopoietic stem cell transplant recipients: using an affordable in-house qPCR assay for management of HCMV infection under limited resources . Transpl Int 2015 ; 28 : 594 – 603 . Google Scholar CrossRef Search ADS PubMed 165 Pires ARC , Andreiuolo F da M , de Souza SR. TMA for all: a new method for the construction of tissue microarrays without recipient paraffin block using custom-built needles . Diagn Pathol 2006 ; 1 : 14 . Google Scholar CrossRef Search ADS PubMed 166 SEQC/MAQC-III Consortium . A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium . Nat Biotechnol 2014 ; 32 : 903 – 14 . CrossRef Search ADS PubMed 167 Singh A , Sau AK. Tissue microarray: a powerful and rapidly evolving tool for high-throughput analysis of clinical specimens . IJCRI 2010 ; 1:1 – 11 . 168 Łabaj PP , Kreil DP. Sensitivity, specificity, and reproducibility of RNA-Seq differential expression calls . Biol Direct 2016 ; 11 : 66 . Google Scholar CrossRef Search ADS PubMed 169 Figueroa ME , Lugthart S , Li Y , et al. DNA methylation signatures identify biologically distinct subtypes in acute myeloid leukemia . Cancer Cell 2010 ; 17 : 13 – 27 . Google Scholar CrossRef Search ADS PubMed 170 Marziali G , Buccarelli M , Giuliani A , et al. A three-microRNA signature identifies two subtypes of glioblastoma patients with different clinical outcomes . Mol Oncol 2017 ; 11 : 1115 – 1129 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Briefings in Bioinformatics Oxford University Press

Molecular subtyping of cancer: current status and moving toward clinical applications

Loading next page...
 
/lp/ou_press/molecular-subtyping-of-cancer-current-status-and-moving-toward-ZsyOEu1Nfe
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please email: journals.permissions@oup.com
ISSN
1467-5463
eISSN
1477-4054
D.O.I.
10.1093/bib/bby026
Publisher site
See Article on Publisher Site

Abstract

Abstract Cancer is a collection of genetic diseases, with large phenotypic differences and genetic heterogeneity between different types of cancers and even within the same cancer type. Recent advances in genome-wide profiling provide an opportunity to investigate global molecular changes during the development and progression of cancer. Meanwhile, numerous statistical and machine learning algorithms have been designed for the processing and interpretation of high-throughput molecular data. Molecular subtyping studies have allowed the allocation of cancer into homogeneous groups that are considered to harbor similar molecular and clinical characteristics. Furthermore, this has helped researchers to identify both actionable targets for drug design as well as biomarkers for response prediction. In this review, we introduce five frequently applied techniques for generating molecular data, which are microarray, RNA sequencing, quantitative polymerase chain reaction, NanoString and tissue microarray. Commonly used molecular data for cancer subtyping and clinical applications are discussed. Next, we summarize a workflow for molecular subtyping of cancer, including data preprocessing, cluster analysis, supervised classification and subtype characterizations. Finally, we identify and describe four major challenges in the molecular subtyping of cancer that may preclude clinical implementation. We suggest that standardized methods should be established to help identify intrinsic subgroup signatures and build robust classifiers that pave the way toward stratified treatment of cancer patients. cancer, heterogeneity, subtyping, subtypes, challenges Introduction Cancer is a large group of genetic diseases that are currently classified by their primary site of origin, such as brain cancer, breast cancer and lung cancer. However, not all cancers of an organ are the same, and genetic heterogeneity exists between and within cancers [1–6]. A major cause of this heterogeneity is genomic instability [7] that can act at the single-nucleotide level, or at much larger scales [8]. This poses significant challenges to the efficacy of currently applicable targeted therapies and complicate the development of future treatment strategies [7]. Because of this, there is a great need to classify cancer into homogeneous groups that associate with distinct molecular features and clinical outcomes and allow the development of subgroup specific therapies. The traditional classification of cancer has been carried out by pathologists based on histological appearance and site of growth. This only partially reflects the true heterogenic character of cancer. Recent advances in genome-wide profiling techniques [9, 10] have allowed researchers to generate large-scale genomic data and classify cancer into more homogeneous groups [11, 12]. Genomic data have been used in many cancer subtyping studies, including leukemia [13], lymphoma [14], nasopharyngeal carcinoma (NPC) [15], breast [16], lung [17], liver [18], pancreas [19], colon [20] and soft tissue sarcomas [21]. Various machine learning algorithms [22–26] have also been developed for better prediction of cancer subtypes. Molecular subtyping studies have allowed the classification of cancer into uniform groups that correlated better with clinical outcomes than the traditional classifications of cancer [27]. In summary, the molecular classification can provide diagnostic, prognostic and therapeutic options for the treatment of cancers. This review is organized as follows: in ‘Molecular subtyping of cancer’ section, we first introduce two techniques that are used for cancer subtyping: microarray and RNA sequencing (RNA-Seq). Next, we introduce three other frequently applied techniques for generating low- and medium-throughput molecular data; quantitative polymerase chain reaction (qPCR), NanoString and tissue microarray (TMA), and we discuss their applications in clinical tests. Subtype identifications and characterizations, which are the two important aspects involved in the subtyping process are also discussed. In ‘Moving toward clinical applications’ section, we illustrate potential clinical applications of cancer subtyping studies for diagnosis, prognosis, predicting therapy response and drug design. In ‘Challenges’ section, we identify and describe four major challenges in the molecular subtyping of cancer that may preclude clinical implementation, and finally, in ‘Conclusions and outlook’ section, we provide the concluding remarks and recommendations. Molecular subtyping of cancer Recent advances in genome-wide profiling techniques have allowed the generation of large-scale genomic data, and various statistical and machine learning algorithms have been developed for processing and interpretation of such data [23–25, 28–31]. Molecular subtyping of cancer, as its name suggests, is a new way to classify cancers into different groups based on molecular data and classification models. Contrary to the traditional histological classification of cancer, molecular classifications rely on biomarkers and classifiers. Biomarkers can be informative genes, microRNAs (miRNAs), DNA methylation markers and others [32]. Classifiers can be built by machine learning algorithms, such as Prediction Analysis for Microarrays (PAM), Support Vector Machines (SVMs) and more [33]. In the following, we will provide an introduction of different molecular data types and their applications, to a workflow for unsupervised classification of cancer. High-throughput molecular data for cancer subtyping Gene expression profiling data for cancer subtyping Microarray and RNA-Seq are two common profiling techniques for generating high-throughput gene expression data. Microarrays are capable of profiling expression patterns for tens of thousands of selected genes in a single assay [9]. RNA-Seq is a sequencing-based method to determine the amount of gene abundance from the entire genome. There are numerous advantages of RNA-Seq over microarray [34]. First, unlike hybridization-based microarrays, RNA-Seq provides more accurate detection of gene expression. Second, RNA-Seq can detect novel transcripts, single-nucleotide variants and other (yet) unknown changes that microarray cannot detect. Finally, RNA-Seq has low background signal, and consequently has a large dynamic range. Microarray has been the most commonly used technique to generate large-scale molecular data for several decades [35]. With the fast development of sequencing and analyzing techniques, the sequencing cost will dramatically decrease and more statistical tools will be developed for RNA-Seq, and RNA-Seq will likely replace microarray [36]. Compared with other molecular profiling techniques, microarray and RNA-Seq are the most accurate, reliable and robust, but are also expensive, time-consuming and sample quality-dependent techniques (Table 1). They are commonly used in the initial biomarkers identification process. If biomarkers have been identified, other techniques are preferred. Table 1. A comparison of different techniques for molecular profiling of cancer Platform Characteristic Microarray RNA sequencing qPCR NanoString Tissue microarray Accuracy [37, 166, 167] Median Median High High Low Sensitivity [36, 167, 39] Median High High High Low Specificity [38, 167, 168] Median Median High High Low Speed [36, 167] Slow Slow Fast Median Slow Cost (per sample) $300 [163] $1000 [163] $280 [164] $800 [39] $100 [165] Sample requirement [167] FFPE/fresh-frozen FFPE/fresh-frozen Fresh-frozen FFPE/fresh-frozen FFPE Genome-wide coverage Yes Yes No No No Quantitative Yes Yes Yes Yes No Single-base resolution No Yes No No No Low sample input No No Yes No Yes Reproducibility [168] Median Median High High Low Platform Characteristic Microarray RNA sequencing qPCR NanoString Tissue microarray Accuracy [37, 166, 167] Median Median High High Low Sensitivity [36, 167, 39] Median High High High Low Specificity [38, 167, 168] Median Median High High Low Speed [36, 167] Slow Slow Fast Median Slow Cost (per sample) $300 [163] $1000 [163] $280 [164] $800 [39] $100 [165] Sample requirement [167] FFPE/fresh-frozen FFPE/fresh-frozen Fresh-frozen FFPE/fresh-frozen FFPE Genome-wide coverage Yes Yes No No No Quantitative Yes Yes Yes Yes No Single-base resolution No Yes No No No Low sample input No No Yes No Yes Reproducibility [168] Median Median High High Low Table 1. A comparison of different techniques for molecular profiling of cancer Platform Characteristic Microarray RNA sequencing qPCR NanoString Tissue microarray Accuracy [37, 166, 167] Median Median High High Low Sensitivity [36, 167, 39] Median High High High Low Specificity [38, 167, 168] Median Median High High Low Speed [36, 167] Slow Slow Fast Median Slow Cost (per sample) $300 [163] $1000 [163] $280 [164] $800 [39] $100 [165] Sample requirement [167] FFPE/fresh-frozen FFPE/fresh-frozen Fresh-frozen FFPE/fresh-frozen FFPE Genome-wide coverage Yes Yes No No No Quantitative Yes Yes Yes Yes No Single-base resolution No Yes No No No Low sample input No No Yes No Yes Reproducibility [168] Median Median High High Low Platform Characteristic Microarray RNA sequencing qPCR NanoString Tissue microarray Accuracy [37, 166, 167] Median Median High High Low Sensitivity [36, 167, 39] Median High High High Low Specificity [38, 167, 168] Median Median High High Low Speed [36, 167] Slow Slow Fast Median Slow Cost (per sample) $300 [163] $1000 [163] $280 [164] $800 [39] $100 [165] Sample requirement [167] FFPE/fresh-frozen FFPE/fresh-frozen Fresh-frozen FFPE/fresh-frozen FFPE Genome-wide coverage Yes Yes No No No Quantitative Yes Yes Yes Yes No Single-base resolution No Yes No No No Low sample input No No Yes No Yes Reproducibility [168] Median Median High High Low Gene expression-based subtyping of cancer was first proposed by Golub et al. [13] in leukemia. The expression pattern of the 50 most informative genes was measured and a two-cluster self-organizing map (SOM) clustering method was applied [40] to group 38 samples into two classes: acute myeloid leukemia and acute lymphoblastic leukemia with accuracy of 100%. This demonstrated the fidelity of cancer subtyping based solely on gene expression patterns [13]. Gene expression-based subtyping now has been extended to include many cancer types [11, 14, 16, 17, 19, 21, 41]. Multi-platform profiling data for cancer subtyping In addition to gene expression profiling, there are many other molecular profiling data types, such as mutation, miRNA expression, copy number variation (CNV) and DNA methylation, which can be used to identify and characterize cancer subtypes (Table 2) [43, 44, 50, 52–55]. As all cancers arise as a result of DNA sequence changes [56], the gene mutation patterns are informative and a likely platform from which to stratify cancer patients into homogeneous groups [57, 58]. MiRNAs are small noncoding RNAs about 20–22 nucleotides in length that play key roles in the regulation of gene expression. Alterations of miRNA expression are involved in the initiation and progression of human cancer [59–61]. MiRNA expression profiling now has been used as a new tool in cancer onset and subtyping [15, 62]. Unlike mRNAs, miRNAs are more stable and only a small number of miRNAs (∼200 in total) are sufficient to classify human cancers [63]. CNVs are structural variations and genomic alterations that affect DNA sequence lengths ranging from approximately 1 Kb to 3 Mb [64]. CNVs are associated with many complex diseases such as neuropsychiatric disorders [65], HIV [66], familiar pancreatitis [67] and cancers [68, 69]. Comparative genomic hybridization (CGH) can be used to detect CNVs at the genome-wide level, and array-based CGH can increase the resolution for better genomic studies. Epigenetic changes such as DNA methylation also play a significant role in the development and progression of cancer [70]. Bisulfite sequencing [71] and differential methylation hybridization [72] can be used to scan gene methylation status at the genome-wide level. Table 2. Molecular subtyping studies mentioned in the review Cancer type Discovery sample size Molecular data type Clustering method Determinative score Number of subtypes Classification method Reference Breast cancer 65 mRNA Hierarchical clustering NA 4 NA Perou et al .[16] Breast cancer 85 mRNA Hierarchical clustering NA 5 NA Sorlie et al. [42] Breast cancer 825 Five platforms Cluster of clusters NA 4 NA TCGA [43] Breast cancer 2, 000 mRNA + CNV iCluster ARI 10 PAM Curtis et al. [44] CRC 62 mRNA Iterative NMF Cophenetic coefficient 5 NA Schlicker et al. [45] CRC 443 mRNA Orig. cons. clustering CDF area 6 Centroid-based Marisa et al. [46] CRC 90 mRNA Orig. cons. clustering Gap statistic 3 PAM De Sousa E Melo et al. [20] CRC 445 mRNA NMF cons. clustering Cophenetic coefficient 5 PAM Sadanandam et al. [47] CRC 1, 113 mRNA Orig. cons. clustering Dynamic cut tree 5 Multiclass LDA Budinska et al. [48] CRC 188 mRNA k-means NA 3 Single-sample centroid based Roepman et al. [49] CRC 4, 151 mRNA Markov Cluster Algorithm Inflation factor 4 Random Forest Guinney et al. [11] PDAC 185 miRNA Hierarchical clustering CDF area 2 SVM Bauer et al. [50] PDAC 66 mRNA NMF cons. clustering Cophenetic coefficient 3 NTP Collisson et al. [19] PDAC 223 mRNA NMF cons. clustering Cophenetic coefficient 2 Rank-based classifier Moffitt et al. [51] Pancreatic cancer 96 mRNA NMF cons. clustering Cophenetic coefficient 4 NA Bailey et al. [12] Leukemia 38 mRNA SOM NA 2 NA Golub et al. [13] Leukemia 200 Methylation PCA NA 16 NA Figueroa et al. [169] Lymphoma 42 mRNA Hierarchical clustering NA 2 NA Alizadeh et al. [14] GBM 35 miRNA PCA Ratio of intracluster to intercluster correlation 2 LDA Marziali et al. [170] Lung 67 mRNA Hierarchical clustering NA 4 NA Garber et al [17] 12 cancer types 3, 527 Five platforms COCA NA 11 NA Hoadley et al [32] Cancer type Discovery sample size Molecular data type Clustering method Determinative score Number of subtypes Classification method Reference Breast cancer 65 mRNA Hierarchical clustering NA 4 NA Perou et al .[16] Breast cancer 85 mRNA Hierarchical clustering NA 5 NA Sorlie et al. [42] Breast cancer 825 Five platforms Cluster of clusters NA 4 NA TCGA [43] Breast cancer 2, 000 mRNA + CNV iCluster ARI 10 PAM Curtis et al. [44] CRC 62 mRNA Iterative NMF Cophenetic coefficient 5 NA Schlicker et al. [45] CRC 443 mRNA Orig. cons. clustering CDF area 6 Centroid-based Marisa et al. [46] CRC 90 mRNA Orig. cons. clustering Gap statistic 3 PAM De Sousa E Melo et al. [20] CRC 445 mRNA NMF cons. clustering Cophenetic coefficient 5 PAM Sadanandam et al. [47] CRC 1, 113 mRNA Orig. cons. clustering Dynamic cut tree 5 Multiclass LDA Budinska et al. [48] CRC 188 mRNA k-means NA 3 Single-sample centroid based Roepman et al. [49] CRC 4, 151 mRNA Markov Cluster Algorithm Inflation factor 4 Random Forest Guinney et al. [11] PDAC 185 miRNA Hierarchical clustering CDF area 2 SVM Bauer et al. [50] PDAC 66 mRNA NMF cons. clustering Cophenetic coefficient 3 NTP Collisson et al. [19] PDAC 223 mRNA NMF cons. clustering Cophenetic coefficient 2 Rank-based classifier Moffitt et al. [51] Pancreatic cancer 96 mRNA NMF cons. clustering Cophenetic coefficient 4 NA Bailey et al. [12] Leukemia 38 mRNA SOM NA 2 NA Golub et al. [13] Leukemia 200 Methylation PCA NA 16 NA Figueroa et al. [169] Lymphoma 42 mRNA Hierarchical clustering NA 2 NA Alizadeh et al. [14] GBM 35 miRNA PCA Ratio of intracluster to intercluster correlation 2 LDA Marziali et al. [170] Lung 67 mRNA Hierarchical clustering NA 4 NA Garber et al [17] 12 cancer types 3, 527 Five platforms COCA NA 11 NA Hoadley et al [32] Note: ARI, adjusted Rand index; No., number; COCA, Cluster-Of-Cluster-Assignments; iCluster, integrative clustering framework; LDA, linear discriminant analysis; NTP, nearest template prediction; Orig. cons., original consensus; PCA, principal component analysis. Table 2. Molecular subtyping studies mentioned in the review Cancer type Discovery sample size Molecular data type Clustering method Determinative score Number of subtypes Classification method Reference Breast cancer 65 mRNA Hierarchical clustering NA 4 NA Perou et al .[16] Breast cancer 85 mRNA Hierarchical clustering NA 5 NA Sorlie et al. [42] Breast cancer 825 Five platforms Cluster of clusters NA 4 NA TCGA [43] Breast cancer 2, 000 mRNA + CNV iCluster ARI 10 PAM Curtis et al. [44] CRC 62 mRNA Iterative NMF Cophenetic coefficient 5 NA Schlicker et al. [45] CRC 443 mRNA Orig. cons. clustering CDF area 6 Centroid-based Marisa et al. [46] CRC 90 mRNA Orig. cons. clustering Gap statistic 3 PAM De Sousa E Melo et al. [20] CRC 445 mRNA NMF cons. clustering Cophenetic coefficient 5 PAM Sadanandam et al. [47] CRC 1, 113 mRNA Orig. cons. clustering Dynamic cut tree 5 Multiclass LDA Budinska et al. [48] CRC 188 mRNA k-means NA 3 Single-sample centroid based Roepman et al. [49] CRC 4, 151 mRNA Markov Cluster Algorithm Inflation factor 4 Random Forest Guinney et al. [11] PDAC 185 miRNA Hierarchical clustering CDF area 2 SVM Bauer et al. [50] PDAC 66 mRNA NMF cons. clustering Cophenetic coefficient 3 NTP Collisson et al. [19] PDAC 223 mRNA NMF cons. clustering Cophenetic coefficient 2 Rank-based classifier Moffitt et al. [51] Pancreatic cancer 96 mRNA NMF cons. clustering Cophenetic coefficient 4 NA Bailey et al. [12] Leukemia 38 mRNA SOM NA 2 NA Golub et al. [13] Leukemia 200 Methylation PCA NA 16 NA Figueroa et al. [169] Lymphoma 42 mRNA Hierarchical clustering NA 2 NA Alizadeh et al. [14] GBM 35 miRNA PCA Ratio of intracluster to intercluster correlation 2 LDA Marziali et al. [170] Lung 67 mRNA Hierarchical clustering NA 4 NA Garber et al [17] 12 cancer types 3, 527 Five platforms COCA NA 11 NA Hoadley et al [32] Cancer type Discovery sample size Molecular data type Clustering method Determinative score Number of subtypes Classification method Reference Breast cancer 65 mRNA Hierarchical clustering NA 4 NA Perou et al .[16] Breast cancer 85 mRNA Hierarchical clustering NA 5 NA Sorlie et al. [42] Breast cancer 825 Five platforms Cluster of clusters NA 4 NA TCGA [43] Breast cancer 2, 000 mRNA + CNV iCluster ARI 10 PAM Curtis et al. [44] CRC 62 mRNA Iterative NMF Cophenetic coefficient 5 NA Schlicker et al. [45] CRC 443 mRNA Orig. cons. clustering CDF area 6 Centroid-based Marisa et al. [46] CRC 90 mRNA Orig. cons. clustering Gap statistic 3 PAM De Sousa E Melo et al. [20] CRC 445 mRNA NMF cons. clustering Cophenetic coefficient 5 PAM Sadanandam et al. [47] CRC 1, 113 mRNA Orig. cons. clustering Dynamic cut tree 5 Multiclass LDA Budinska et al. [48] CRC 188 mRNA k-means NA 3 Single-sample centroid based Roepman et al. [49] CRC 4, 151 mRNA Markov Cluster Algorithm Inflation factor 4 Random Forest Guinney et al. [11] PDAC 185 miRNA Hierarchical clustering CDF area 2 SVM Bauer et al. [50] PDAC 66 mRNA NMF cons. clustering Cophenetic coefficient 3 NTP Collisson et al. [19] PDAC 223 mRNA NMF cons. clustering Cophenetic coefficient 2 Rank-based classifier Moffitt et al. [51] Pancreatic cancer 96 mRNA NMF cons. clustering Cophenetic coefficient 4 NA Bailey et al. [12] Leukemia 38 mRNA SOM NA 2 NA Golub et al. [13] Leukemia 200 Methylation PCA NA 16 NA Figueroa et al. [169] Lymphoma 42 mRNA Hierarchical clustering NA 2 NA Alizadeh et al. [14] GBM 35 miRNA PCA Ratio of intracluster to intercluster correlation 2 LDA Marziali et al. [170] Lung 67 mRNA Hierarchical clustering NA 4 NA Garber et al [17] 12 cancer types 3, 527 Five platforms COCA NA 11 NA Hoadley et al [32] Note: ARI, adjusted Rand index; No., number; COCA, Cluster-Of-Cluster-Assignments; iCluster, integrative clustering framework; LDA, linear discriminant analysis; NTP, nearest template prediction; Orig. cons., original consensus; PCA, principal component analysis. Integrating the analysis of multiple genomic data, such as gene expression with CNV [44], miRNA with gene expression [73] and five-platform combined subtyping [32] studies can provide even better insights into tumor biology, and more accurate predictions, than the analysis at a single molecular level [74]. With the advances in high-throughput profiling technologies, the expenses spent on each sample are decreasing; thus, multi-platform identification and characterization of cancer is likely to become the norm. Low- and medium-throughput molecular data for clinical test Biomarkers identified from subtyping studies can be used in clinical practice. In typical clinical settings, only up to several dozens of these predefined biomarkers are measured to minimize the time and expenses spent on the tests [75]. In addition, most cancer specimens are formalin-fixed paraffin-embedded (FFPE), and only few are freshly prepared or snap frozen [76]. In contrast to the above mentioned high-throughput approaches, some low- and medium-throughput profiling techniques (such as qPCR, NanoString and TMA) that allow meaningful analysis of clinical specimens are well suited for clinical use of biomarker assays. These techniques are frequently used when fast detection time is required, and sample volume and pricing should be kept low. Sensitivity and specificity are the two terms used to evaluate a clinical test. Sensitivity refers to the ability of a test to correctly identify an individual with disease; specificity refers to the ability of a test to correctly identify an individual without the disease [77]. Another important term in the evaluation of a clinical test is to determine its accuracy, which describes the errors that a test will produce when differentiating between individuals with and without the disease [78]. In the following, we will compare these three techniques (qPCR, NanoString and TMA) in terms of accuracy, sensitivity, specificity and other aspects of concerns involved in a clinical test. Researchers can choose appropriate techniques for their clinical assays based on the comparisons provided in Table 1. qPCR is commonly used to determine biomarker expression levels, or to assess CNVs. Because there is a PCR amplification step, which can greatly increase the nucleic acid input, only limited sample quantity is needed. Other advantages of qPCR include fast, high sensitivity, specificity and accuracy, which make it the routine method for validation of results initially obtained from high-throughput methods such as microarray and RNA-Seq [79]. Compared with other techniques, which can assay hundreds to thousands biomarkers, qPCR-based assays can only handle a limited number of biomarkers in a single test. qPCR-based tests also require high quality of the nucleic acids in the sampled material, so fresh-frozen tissues are typically required for qPCR. The NanoString nCounter analysis system can be used to measure expression levels of up to 800 genes [80]. Developed by Geiss et al. [39], the nCounter system is more sensitive than microarrays, and similar in sensitivity to qPCR [39]. This technology uses digital molecular barcoding and microscopic imaging to detect and quantify the expression levels of genes in a single assay without enzymatic reactions [39, 81]. Other advantages of this technique include high accuracy and specificity [38]. Disadvantages include the high cost of the required reagents and instruments [80]. TMA is a histology-based test, developed by Kononen et al. [82], which allows the analysis of up to 1000 tumor specimens simultaneously in a single paraffin block [37]. Analysis of molecular targets at the DNA, mRNA and protein levels is possible. Once constructed, a TMA block can be sectioned hundreds of times (provided the depth of all cores is sufficient), with each section amenable to biomarker analysis. The most significant advantage of TMA is that all samples on the array are treated in an identical fashion [83]. Another advantage of TMA is that it is cost-effective (Table 1). Only a small amount of reagent is required to analyze all the samples on one slide [83]. Unlike qPCR, which requires fresh-frozen tissues, TMA requires FFPE tissues, which are the major source of material in the clinic. TMA also has limitations. For instance, low sensitivity, specificity and accuracy are the typical features of a TMA test [84]. Other disadvantages include: it usually takes several days to obtain the analysis results [85], only a limited number of analytes can be tested and the analyzed specimen volume is too small to represent the entire tumor [83]. Also during the TMA staining process, the amount of tissues will become less and less [86]. Subtype identifications and characterizations Molecular subtyping (or molecular classification) is a process of assigning data objects into clusters, so that objects in the same cluster are more similar to each other than those in other clusters. There are two kinds of classification strategies, supervised (with class labels, such as tumor or normal tissues, known beforehand) and unsupervised (with unlabeled data) classification. Subtyping is a more general term of classification, which can be both supervised and unsupervised. Unsupervised classification is increasingly popular in biomedical research [87], and has been successfully used in many cancer subtyping studies [11, 13, 15, 17, 41, 51, 88, 89]. From these studies, we summarize a workflow for molecular subtyping of cancer. These include: data preprocessing, cluster analysis, supervised classification and subtype characterizations (Figure 1). In the following, we focused our attention on subtype identifications and characterizations, which are the two important aspects in the workflow. Figure 1. View largeDownload slide Molecular subtyping of cancer workflow. The workflow consists of four major steps: (A) Data preprocessing. Array data preprocessing include image analysis, data normalization and transformation. Next-generation sequencing data preprocessing contains the following steps: quality control, read alignment, expression quantification, data normalization and transformation. (B) Cluster analysis. A first feature selection is performed with a cutoff on SD (e.g. SD > 0.8) or median absolute deviation (MAD) (e.g. MAD > 0.5). Clustering is usually applied to either feature dimension or sample dimension, biclustering at both dimensions and triclustering at three dimensions (feature, sample and time). After (bi/tri) clustering, the optimal number of (bi/tri) clusters is determined by measurement such as gap statistics, cophenetic coefficients and CDF. Also, ensemble and consensus clustering have been proposed to enhance the robustness of (bi/tri) clustering. (C) Supervised classification. To build the best possible classifier, a sample selection (Silhouette width > 0) and a second feature selection (SAM/Limma) processes are applied. Various algorithms such as PAM, SVM, Random Forests (RF) and K-nearest neighbors can be used to build classifiers. (D) Subtype characterizations. A heatmap is used to represent the molecular characterizations, in which rows are features (genes, miRNAs, pathways, etc.) and columns are samples. Here, features are subtype-specific features; samples are sorted according to their subtype numbers. A Kaplan–Meier survival plot is used to represent the clinical characterizations, in which x-axis is the survival time, and y-axis is the probability of an event (i.e. death). Figure 1. View largeDownload slide Molecular subtyping of cancer workflow. The workflow consists of four major steps: (A) Data preprocessing. Array data preprocessing include image analysis, data normalization and transformation. Next-generation sequencing data preprocessing contains the following steps: quality control, read alignment, expression quantification, data normalization and transformation. (B) Cluster analysis. A first feature selection is performed with a cutoff on SD (e.g. SD > 0.8) or median absolute deviation (MAD) (e.g. MAD > 0.5). Clustering is usually applied to either feature dimension or sample dimension, biclustering at both dimensions and triclustering at three dimensions (feature, sample and time). After (bi/tri) clustering, the optimal number of (bi/tri) clusters is determined by measurement such as gap statistics, cophenetic coefficients and CDF. Also, ensemble and consensus clustering have been proposed to enhance the robustness of (bi/tri) clustering. (C) Supervised classification. To build the best possible classifier, a sample selection (Silhouette width > 0) and a second feature selection (SAM/Limma) processes are applied. Various algorithms such as PAM, SVM, Random Forests (RF) and K-nearest neighbors can be used to build classifiers. (D) Subtype characterizations. A heatmap is used to represent the molecular characterizations, in which rows are features (genes, miRNAs, pathways, etc.) and columns are samples. Here, features are subtype-specific features; samples are sorted according to their subtype numbers. A Kaplan–Meier survival plot is used to represent the clinical characterizations, in which x-axis is the survival time, and y-axis is the probability of an event (i.e. death). Subtype identifications High-throughput molecular data are usually arranged into matrix forms, in which rows are features (genes, miRNAs or DNA methylation markers) and columns are samples. Molecular data matrices have been largely analyzed in two dimensions (2D): the feature dimension and the sample dimension [90]. Clustering is usually applied to either feature dimension or sample dimension. As subsets of features are active or suppressed only under certain experimental conditions, and behave almost independently under other conditions, to identify local patterns in the data matrix, biclustering (or subspace clustering), which allows to discover biclusters, was first proposed by Cheng and Church [91]. Now, various biclustering methods are developed to efficiently identify ‘homogeneous’ submatrices in data, such as singular value decomposition [22], nonnegative matrix factorization (NMF) [23] and geometric-based biclustering [92, 93]. With the fast development of data profiling technologies, it is now possible to have a number of samples for numerous features across multiple time points or experimental conditions. Such data can be arranged into three-dimensional (3D) matrices, with the first two dimensions representing the samples and features, respectively, and the third dimension for time or experimental conditions [94]. To find feature groups along the feature–sample–time (or –condition) dimensions, triclustering is proposed to mine triclusters in the data [95]. As tensor is a concept from mathematics that can be thought of as an organized multidimensional array of numerical values, tensor-based triclustering [96, 97] has become a promising solution for analyzing these longitudinal and spatial data. The optimal number of clusters is determined by measurements such as gap statistics [98], cophenetic coefficients [99] and cumulative distribution function (CDF). Given that cluster analysis methods are based on different algorithms, they yield different results in terms of cluster numbers and assignments [100]. To enhance the robustness of clustering, a method called cluster ensemble has been proposed, which combines results from different runs of clustering methods into a single consensus result [100]. Another similar methodology is consensus clustering, which in conjunction with resampling techniques provides a method to reach consensus from multiple runs of the same clustering method [101]. The major difference between ensemble and consensus clustering is that ensemble clustering integrates results from multiple clustering methods, while consensus clustering provides resampling and performs a single type of clustering method multiple times. Ensemble and consensus clustering methods are also applicable to biclustering and triclustering, and have been widely used in cancer subtyping studies [19, 20, 46, 102]. Subtype characterizations Subtype characterizations rely heavily on genomic and clinical data, and one purpose of subtype characterizations is to investigate the associations between the identified subtypes and their molecular/clinical relevance [103]. Subtype characterizations can also help to identify consensus subtypes within and between cancers, which we will cover in detail in ‘Cancer consensus molecular subtypes’ section. Pathways, mutations, structural variations and methylation patterns can be used as the molecular characteristics. Characterizations of cancer subtypes have implications for patient outcome and targeted therapies. Lex et al. [104] developed an integrative visualization tool called StratomeX, which can help researchers to explore the relationships between subtypes and multiple genomic data types such as gene expression, DNA methylation or copy number data. These genomic data have been discussed in the ‘High-throughput molecular data for cancer subtyping’ section, which can not only be used to identify robust cancer subtypes, but can also help us better understand and interpret the molecular characteristics of the subtypes. In addition, gene set enrichment analysis (GSEA) is usually performed to characterize the biology underlying the identified subtypes. GSEA interprets the expression data at the level of gene sets, groups of genes that share the same biological function, chromosomal location, or regulation [105]. Annotated gene sets with specific biological meanings can be obtained, for example, from Gene Ontology (GO) [106] and KEGG [107] databases. Clinical data include patient’s information such as age, gender, race, tumor grade, tumor size, time of diagnosis, smoking history, treatment strategies, relapse information, follow-up time and so on, which should be well preserved and managed for clinical characterization of the identified subtypes. Moreover, the survival analysis is a widely used method to compare the survival time differences between subtypes. The Kaplan–Meier estimator [108] can be used to generate the survival curve, and the log rank test provides a statistical comparison of two subtypes [109]. Subtype characterizations are necessary and important. Not only do they help us understand more about the subtype characteristics but also provide a subtype validation process. Ideally, there are distinct molecular and clinical characteristics between identified subtypes. Often, subtypes are only statistically different, but not biologically different. In such cases, reclustering and reclassification should be done until more interpretable results are obtained. Moving toward clinical applications From high-throughput molecular data and molecular subtyping of cancer to the development of marker panels using low- and medium-throughput methods, clinicians are beginning to embrace and make treatment decisions for cancer patients based on cancer subtyping studies [110, 111]. In the following, we will provide a few examples of subtyping studies that have been applied to cancer diagnosis, prognosis, response prediction and drug design. Specifically, we will focus on biomarkers for diagnostic and prognostic purposes in ‘Biomarkers identified from subtyping studies for cancer diagnosis and prognosis’ section, and cancer subtypes for therapy response prediction and drug development in ‘Cancer subtypes for predicting therapy response and drug design’ section. Biomarkers identified from subtyping studies for cancer diagnosis and prognosis Biomarkers identified from subtyping studies with specific indications for cancer diagnosis and prognosis are now widely applied in clinical research, and increasingly combined with conventional histology to improve diagnostic accuracy [112]. For example, TLE1 as a diagnostic marker for synovial sarcoma [113], and CD10, BCL6 and MUM1 as diagnostic markers for the germinal center B-cell-like (GCB) subtype of lymphoma [114]. Furthermore, biomarkers can be used directly to detect cancer. For instance, Bauer et al. [50] analyzed the complete miRNA repertoire of 136 pancreatic ductal adenocarcinoma (PDAC) samples, 27 pancreatitis samples and 22 normal controls. They used a hierarchical clustering method and an SVM classifier, and found that the analysis of only five miRNAs in blood and tissues can distinguish PDAC from pancreatitis and normal, possibly aiding PDAC diagnosis. Several multigene predictors have been developed for breast cancer patients [115]. These include MammaPrint, Oncotype DX and simplified MapQuant Dx. These predictors are now widely used in the clinic to classify breast patients and treat them accordingly. MammaPrint was the first successfully applied microarray-based prognostic test for breast cancer. MammaPrint uses a 70-gene signature. To identify these genes, hierarchical clustering was used to classify 98 breast cancer patients into good and poor prognosis groups. This was followed by a three-step supervised classification method to reliably stratify good and poor prognostic categories, and finally found 70 prognostic genes for breast cancer [116]. MammaPrint is a US Food and Drug Administration-approved molecular test to predict the risk of breast cancer metastasis. The result of the test can help physicians to determine the appropriate treatment strategy. Most early-stage breast cancer patients receive adjuvant chemotherapy, but only subset of them benefit from the treatment. Paik et al. [117] thus developed a 21-gene qPCR assay called Oncotype DX. This is a diagnostic test that predicts the likelihood of chemotherapy benefit, and calculates the recurrence scores for early-stage breast cancer. Simplified MapQuant Dx is also a qPCR-based prognostic test for breast cancer. It was developed by Toussaint et al. [118], and is based on the expression patterns of four representative genes from the genomic grade index [119] and four reference genes. The prognostic information provided by the test is only applicable to estrogen receptor-positive breast cancer patients [120]. Cancer subtypes for predicting therapy response and drug design Subtyping studies are potentially well suited to select a subset of patients that may benefit from certain drugs or therapies. For instance, Rouzier et al. [121] examined if the four subtypes of breast cancer [16] respond differently to chemotherapy. Results showed that the basal-like and ERBB2-overexpressing subtypes are more sensitive to paclitaxel- and doxorubicin-containing preoperative chemotherapy than the luminal and normal-like subtypes. Tumor specimens for laboratory research are often limited in quantity, infiltrated with nontumor cells and sometimes ethical issues apply. Models for cancer, such as cell lines and patient-derived xenografts (PDXs), have been established as in vitro and in vivo platforms that can overcome these shortcomings of tumor specimens, and are now widely used by researchers. For instance, Ross et al. [122] provided molecular characterization of the NCI (National Cancer Institute)-60 cancer cell line panel, and demonstrated that these cell lines correspond to their tumors of origin. Gao et al. [123] established about 1000 PDXs, which provided excellent in vivo platforms to screen novel therapies for cancer patients. Cancer cell lines and PDXs can also be classified into different subtypes, for example, Kao et al. [124] classified 52 commonly used breast cancer cell lines into five subtypes [42], and defined the cell line subtypes that most faithfully capture the known heterogeneity of breast cancer. Moffitt et al. [51] sequenced 37 PDXs from PDAC and demonstrated that these models can recapitulate tumor-specific subtypes. Therefore, cell line and PDX models can provide a great opportunity to investigate subtype-specific therapies as well. Recent developments in high-throughput technologies have allowed large-scale screening of chemicals and drugs on cell line panels [125]. For example, the abovementioned NCI-60 cancer cell line panel [126] has been used as a standard platform on which >40 000 chemicals were screened over the past few decades [125]. Besides, Garnett et al. [127] screened a panel of several hundred cancer cell lines with 130 drugs in clinical use and under preclinical investigation, which also provides a powerful strategy to identify subtype-specific cancer therapies and biomarkers to guide such strategies. Drug development is shifting away from cytotoxic agents, to drugs which are designed to target specific molecules that drive the malignant progression [128]. It is still a challenging task, but subtype-specific biomarkers can become potential targets for drug design, and should be investigated and validated further [129, 130]. Challenges We see four major challenges in cancer subtyping studies that preclude clinical implementation (Figure 2). The first is data acquisition, curation and management. The second challenge is tumor microenvironment (TME) heterogeneity. The remaining two challenges are the lack of consensus molecular subtypes, and problems with single-sample classification, respectively. Figure 2. View largeDownload slide Four major challenges in the molecular subtyping of cancer and associated solutions/problems. The first challenge is data acquisition, curation and management. Data from publicly available data sets, such as ICGC, TCGA and GEO can increase sample size or be used as validation data sets. Low tumor cellularity can be addressed by physical and virtual microdissection. The second challenge is TME heterogeneity. The TME includes immune cells, blood vessels, fibroblasts and ECM, which are all exhibit heterogeneity at some level. The third challenge is the lack of consensus molecular subtypes. Currently, we only have three examples of consensus subtyping studies: colorectal cancer, breast cancer and the TCGA’s pan-cancer study. The last challenge is the problem with single-sample classification. Currently applied SSPs may yield inconsistent classification results. Figure 2. View largeDownload slide Four major challenges in the molecular subtyping of cancer and associated solutions/problems. The first challenge is data acquisition, curation and management. Data from publicly available data sets, such as ICGC, TCGA and GEO can increase sample size or be used as validation data sets. Low tumor cellularity can be addressed by physical and virtual microdissection. The second challenge is TME heterogeneity. The TME includes immune cells, blood vessels, fibroblasts and ECM, which are all exhibit heterogeneity at some level. The third challenge is the lack of consensus molecular subtypes. Currently, we only have three examples of consensus subtyping studies: colorectal cancer, breast cancer and the TCGA’s pan-cancer study. The last challenge is the problem with single-sample classification. Currently applied SSPs may yield inconsistent classification results. Data acquisition, curation and management Many cancer subtyping studies use a strategy called multiple random training-validation strategy [131], in which a training data set is used to identify molecular signatures, and the validation data sets are used to validate the classification performance. Normally, researchers will use their own data set as training data set, and use publicly available data sets as their validation data sets. Publicly available data sets, such as the International Cancer Genome Consortium (ICGC, www.icgc.org) and The Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov/) contain coordinated large-scale cancer genomic data that can be accessed online. ICGC holds genomic, transcriptomic, epigenomic and clinical data from 50 different cancer types and subtypes. Currently, there are >25 000 tumor genome data available on the ICGC website [132]. TCGA also contains a collection of cancer genomic data, and so far, >30 human tumor types have been analyzed through large-scale genome sequencing from 11 000 patient samples [133]. In addition, Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) is a public repository that archives and freely distributes gene expression data from numerous studies [134]. Researchers can upload their own data to the GEO or download data from GEO as validation data sets. Subtyping studies typically use tumor numbers ranging from dozens to more than few hundreds for their study cohort (Table 2). Identification of cancer subtypes has been frustrated by a lack of tumor samples available for study [19]. For instance, because <20% of PDAC patients have resectable tumors at the time of diagnosis, material for profiling is typically limited [135]. Some studies have overcome this problem by integrating different sources of data into their studies to increase sample size [19, 136]. The introduced batch effects (or nonbiological differences) can be removed by methods like empirical Bayes [137], surrogate variable analysis [138] or Distance Weighted Discrimination [139]. Another common problem is the low tumor cellularity of patient samples, which makes the molecular data noisy. How to capture tumor-specific patterns in such data poses a problem. Because of the tight connection and interaction between cancer cells and surrounding cells, using conventional separation techniques, such as laser capture microdissection [140], cannot perfectly separate tumor cells from nontumor cells. Thus, various statistical enrichment techniques such as virtual microdissection [51], mathematical algorithms like ESTIMATE [141] or qpure [142] can be used to assess tumor cellularity and deconvolve tumor-specific contributions. In summary, to dissect the genetic heterogeneity of the tumor cell, molecular and clinical data should be well processed and managed. As there are abundant publicly available data sets and various data processing tools that may be useful for answering such questions, researchers should take full advantage of them. TME heterogeneity Heterogeneity not only exists in the tumor cell compartment but also in the TME. The TME is the sum of interactions between tumor cells and the surrounding environment, which plays an important role in tumor development, progression and therapy responses. The TME includes immune cells, blood vessels, fibroblasts and extracellular matrix (ECM). Stroma is part of the TME, and is a histological unit consisting of connective tissue, fat tissue, fibroblasts, ECM and immune cells within an extracellular scaffold [143]. Stroma, as a whole, can be classified into different subtypes with clinical implications. For instance, Moffitt et al. [51] used NMF-based consensus clustering of hundreds of PDAC tumors and cell lines, and identified two stroma subtypes named as normal and activated. The activated stroma subtype contributes to poor clinical outcome. Heterogeneity has also been observed in other components of the TME, such as tumor-infiltrated immune cells, fibroblasts and ECM [144–148]. Solid tumors are infiltrated by various immune cells, for example, T and B lymphocytes, mast cells and so on [149]. These immune cells either play a positive role in inhibition of cancer cell growth or are responsible for the tumor-associated chronic inflammation. The presence of a T-cell-infiltrated TME can serve as a predictive biomarker for response to immunotherapies [144]. However, in many tumor types, only a subset of patients can generate a tumor antigen-specific T-cell response. The remaining patients lack an appropriate T-cell phenotype and resist immunotherapeutic interventions [144]. How to select patients that can potentially benefit from immunotherapies is a challenge. We can address this problem by identifying T-cell response genes and building a binary gene expression classifier, which can distinguish response group from nonresponse group. ECM is a collection of extracellular proteins present in all tissues to provide support to that tissue’s cells [150]. Recent studies have found that considerable heterogeneity exists in the ECM, and clinical outcome is often related with ECM characteristics. For instance, Bergamaschi et al. [147] identified 278 ECM-related genes to classify primary breast tumors into four groups (ECM1–4) with distinct clinical outcomes. Although tumor and stromal cells have close interactions with each other, stroma cells are different from tumor cells in terms of genetic architecture. Stroma cells are mostly genetically intact [143, 151], which suggests that the stroma could be a target of therapy. Heterogeneity in the characteristic of both tumor cells and TME raise questions regarding future cancer treatment. Which one of them is easier to target? How do we interpret such 2D heterogeneity, and how are they related? Can we incorporate them into a single system? These questions remain to be answered in the future. Cancer consensus molecular subtypes Currently, there are six subtyping systems for colorectal cancer (CRC) [20, 46, 45, 47–49], which classify CRC into three to six subtypes (Table 2). To identify robust consensus subtypes of CRCs, a consensus subtyping effort for CRC was initiated. The Colorectal Cancer Subtyping Consortium (CRCSC) developed a network-based approach to investigate the associations between the six independent classification systems. A multi-class classifier was built that could classify CRC into four consensus molecular subtypes (CMS1-4) [152]. CMS1 tumors are highly mutated, microsatellite unstable and show strong immune activation. CMS2 tumors are characterized by marked Wnt and Myc signaling activation. CMS3 cancers are metabolically dysregulated. CMS4 cases feature transforming growth factor-β activation, stromal invasion and angiogenesis signatures. These consensus results will aid future clinical stratification and subtype-based targeted interventions for CRC, and such collaborations should serve as a role model for other cancer subtyping studies to accelerate our understanding of cancer biology [152] and develop more efficient ways to cure cancers. The use of different patient cohorts, platforms and clustering methods for a specific tumor type, typically yields divergent subtyping results. For breast cancer (Table 2), it was first classified by Perou et al. [16] into four subtypes: luminal, basal-like, normal-like and ERBB2-overexpressing subtypes. Then, Sørlie et al. [42] performed complementary DNA microarrays of 85 breast cancer patients and normal controls, and used hierarchical clustering to classify the patients into one of the five subtypes, i.e. luminal A, luminal B, HER2 over-expression, basal and normal-like. The most recent breast cancer subtyping study by TCGA also suggested four subtypes, which are luminal A, luminal B, HER2-positive and triple-negative subtypes [43]. We can conclude that despite inconsistent naming and number of clusters grouped by different studies [16, 42, 43], breast tumors fall primarily into three major subtypes: luminal, HER2 overexpression and triple-negative breast cancer (TNBC) [89]. The luminal subtype cancer is the most common one and carries a good prognosis. This subset of patients expresses hormone receptors, and this makes them responsive to hormone therapies. The HER2-overexpressing breast cancer subtype is more sensitive to herceptin (trastuzumab) and chemotherapy than the luminal subtype. The TNBC subtype is resistant to standard targeted therapies, and carries the worst prognosis. The next important consideration is the consensus subtyping between cancers. Although there are many cancer types based on their tissue of origin, we can observe similarities between them. The TCGA’s pan-cancer classification study [32] is a good example of this. Six different ‘omic’ platforms were integratively analyzed, consisting of 3527 tumor specimens across 12 cancer types. A unified cancer classification system was constructed, and it identified 11 major subtypes. Among them, five subtypes were strongly associated with their tissue of origin, but the remaining subtypes were not strictly associated with their tissue of origin. For instance, bladder cancers split into three pan-cancer subtypes. Lung squamous, head and neck and a subset of bladder cancers coalesced into a single subtype. This study not only provided a new classification system for multiple cancers but also demonstrated that general characteristics exist between cancers that were traditionally considered to be different entities. Cancer is a complex disease. Without a systematic understanding of the characteristics of the disease, we cannot develop effective therapies against it. The general characteristics within and between cancers provide great opportunities to identify consensus molecular subtypes. For example, basal subtypes are defined in breast cancer [42], bladder cancer [88] and pancreatic cancer [51]. Mesenchymal subtypes are defined in glioblastoma (GBM) [41], NPC [15], breast [153], pancreatic [19] and colon cancers [20]. Basal subtypes usually express genes like laminins and keratins, and have the worst prognosis compared with other subtypes. The characteristics of mesenchymal subtypes include a mesenchymal phenotype, high expression of proliferation genes, poor prognosis, high malignant potential and resistance to current therapies. Thus, devise treatments that are effective against multiple cancer types with shared characteristics may become a promising solution for future cancer treatment. Single-sample classification The abovementioned classifiers (or predictor) are mainly built based on a large number of training samples, and for this reason, we call them population-based predictors. In contrast, single-sample predictors (SSPs) are classification models that can classify a single sample into one of the molecular subtypes of a specific type of cancer [154, 155]. Traditionally, to classify a new sample into a specific subtype based on population-based predictor, reanalysis of a large data set is needed. Contrary to the population-based predictor, SSPs can assign a single sample to a specific subtype regardless of other samples, and is therefore more useful and practical for individual patients than population-based predictors. SSPs have been built for several types of cancer. For instances, Sørlie et al. [154] constructed the first SSP for breast cancer, Stratford et al. [136] developed an SSP for PDAC and Ringnér et al. [156] derived an SSP for lung adenocarcinoma. SSPs are constructed based on tumor-intrinsic signatures and similarities between a given sample and molecular subtype centroids [154, 155]. Methods applied in the population-based predictor, such as hierarchical clustering and nearest centroid classification method [157], can be used in the SSP. One of the most important requirements for an SSP is that it cannot be built based on row-centered (mean centering or median centering) data [158]. Normally, molecular data matrices contain features in rows and samples in columns. Row-centering is a feature centering process that can help to remove side effects caused by outlier features. The construction of SSPs features no row-centering step, and studies have found inconsistent classification results caused by SSPs [158–160]. Sørlie et al. [161] accepted Weigelt et al.’s [158] conclusions and comments, and explained why there were inconsistent classification results. The reasons are listed below: for the three one-channel-based data sets, most of the variations were caused by differences between genes, and not so much by differences between samples. So, the correlation values vary greatly over a smaller range in the uncentered data. Therefore, for a sample to be correctly assigned to a subtype, it must be centered against an appropriately large and heterogeneous sample set. Sørlie et al. [161] highlighted the importance of performing row-centering in molecular data-processing steps. In summary, building SSPs is a challenging but important task, and up to now, there are no effective ways to deal with the centering problem. Although current results are not encouraging, we hope that in the near future, applicable SSPs can be developed and applied in the clinic. Conclusions and outlook Heterogeneity renders cancer more than a single disease. This poses a significant challenge to the traditional management of cancer. With the advent of genome-wide molecular profiling of cancer, especially the advancements in high-throughput profiling technologies, researchers can now investigate the collective of genomic and epigenomic changes that exist in cancer. In contrast with traditional classification methods, molecular classification can be used to assign cancers to subgroups with distinct molecular characteristics, tumor biology and clinical presentation. The most important step in molecular subtyping of cancer is cluster analysis. Different clustering methods can produce different results, many cluster analyses are unstable and cluster analyses are a purely exploratory method [162]. It is hard to tell which algorithm is better, as this largely depends on the question asked. Thus, it is important to ascertain proper preprocessing and normalization of the data; also, ensemble and consensus clustering methods should be considered when doing the cluster analysis. Another important step is subtype characterizations. The identified subtypes should be both statistically significant and biologically relevant. This means that molecular as well as clinical data collection is mandatory to truly characterize the identified subtypes. Also, publicly available data sets can be used to evaluate the classification performance of the classifiers. Although numerous molecular subtyping studies have been conducted, which have identified subtypes for various cancer types, current cancer patient stratification still largely relies on traditional histopathological observation and assessment. We are facing several challenges (Figure 2). The gap between research findings (identified subtypes) and clinical applications can be bridged by the improvement of statistical methods and better interpretation of the results. When cancers are correctly separated into different subtypes, the next important step is to properly interpret these identified subtypes from a biological point of view followed by a move toward clinical applications. With the successfully applied clinical tests in breast cancer, we hope that this will be followed in other cancer types. In summary, cancer should not be treated as single disease. Molecular subtyping can identify distinct cancer subtypes, which may shed new lights on the treatment strategies for cancer patients. Several challenges should be addressed before clinical applications can be successfully applied. Key Points Heterogeneity renders cancer more than a single disease. Molecular subtyping can be used to assign cancers to subgroups with distinct molecular characteristics, tumor biology and clinical presentation. Unsupervised classification schemes have been successfully applied to identify subtypes in a large number of malignancies. From these studies, we summarize a workflow for molecular subtyping of cancer. These include data preprocessing, cluster analysis, supervised classification and subtype characterizations. We identified and described four major challenges in cancer subtyping studies that preclude clinical implementation. The first is data acquisition, curation and management. The second challenge is TME heterogeneity. The remaining two challenges are the lack of consensus molecular subtypes, and problems with single-sample classification, respectively. We suggest that standardized methods should be established to help identify intrinsic subgroup signatures and to build robust classifiers that pave the way toward stratified treatment of cancer patients. Lan Zhao is a PhD candidate at the Department of Electronic Engineering, City University of Hong Kong. Her research interests are in the areas of machine learning, cancer genomics and computational biology. Victor H. F. Lee is currently a Clinical Associate Professor of the Department of Clinical Oncology, the University of Hong Kong. His current interests include clinical and genetic studies on nasopharyngeal cancer, head and neck cancers, lung cancers, liver cancers and gastrointestinal cancers. Michael K. Ng is the Head and Chair Professor of the Department of Mathematics, and Chair Professor (Affiliate) of Department of Computer Science at the Hong Kong Baptist University. As an applied mathematician, his main research areas include bioinformatics, data mining, operations research and scientific computing. Hong Yan received his PhD degree from Yale University. He was a Professor of imaging science at the University of Sydney and currently is the chair professor of computer engineering at City University of Hong Kong. His research interests include bioinformatics, image processing and pattern recognition. Maarten F. Bijlsma is an Associate Professor at the Academic Medical Center with the University of Amsterdam. His research focuses on pancreatic and esophageal cancer, from the most fundamental mechanisms that underlie aberrant signaling in these diseases, to the development of serum-borne markers in patient cohorts to predict treatment response and disease outcome. Furthermore, he is a Biomarker/Imaging Program leader for the AMC/VUmc Cancer Center Amsterdam. Acknowledgement The authors thank Xin Wang from Department of Biomedical Sciences of the City University of Hong Kong for comments on an earlier version of the manuscript. Funding This work was supported by Hong Kong Research Grants Council (RGC) (Project C1007-15G) and City University of Hong Kong (Project 7004862). References 1 Campbell PJ , Pleasance ED , Stephens PJ , et al. Subclonal phylogenetic structures in cancer revealed by ultra-deep sequencing . Proc Nat Acad Sci USA 2008 ; 105 ( 35 ): 13081 – 6 . Google Scholar CrossRef Search ADS PubMed 2 Shipitsin M , Campbell LL , Argani P , et al. Molecular definition of breast tumor heterogeneity . Cancer Cell 2007 ; 11 ( 3 ): 259 – 73 . Google Scholar CrossRef Search ADS PubMed 3 Macintosh CA , Stower M , Reid N , et al. Precise microdissection of human prostate cancers reveals genotypic heterogeneity . Cancer Res 1998 ; 58 : 23 – 8 . Google Scholar PubMed 4 González-García I , Solé RV , Costa J. Metapopulation dynamics and spatial heterogeneity in cancer . Proc Natl Acad Sci USA 2002 ; 99 ( 20 ): 13085 – 9 . Google Scholar CrossRef Search ADS PubMed 5 Iacobuzio-Donahue CA. Genetic evolution of pancreatic cancer: lessons learnt from the pancreatic cancer genome sequencing project . Gut 2012 ; 61 ( 7 ): 1085 – 94 . Google Scholar CrossRef Search ADS PubMed 6 Penchev VR , Rasheed ZA , Maitra A , et al. Heterogeneity and targeting of pancreatic cancer stem cells . Clin Cancer Res 2012 ; 18 ( 16 ): 4277 – 84 . Google Scholar CrossRef Search ADS PubMed 7 Burrell RA , McGranahan N , Bartek J , et al. The causes and consequences of genetic heterogeneity in cancer evolution . Nature 2013 ; 501 ( 7467 ): 338 – 45 . Google Scholar CrossRef Search ADS PubMed 8 McGranahan N , Swanton C. Biological and therapeutic impact of intratumor heterogeneity in cancer evolution . Cancer Cell 2015 ; 27 ( 1 ): 15 – 26 . Google Scholar CrossRef Search ADS PubMed 9 Duggan DJ , Bittner M , Chen Y , et al. Expression profiling using cDNA microarrays . Nat Genet 1999 ; 21(Suppl 1) : 10 – 14 . Google Scholar CrossRef Search ADS 10 Metzker ML. Sequencing technologies—the next generation . Nat Rev Genet 2010 ; 11 ( 1 ): 31 – 46 . Google Scholar CrossRef Search ADS PubMed 11 Guinney J , Dienstmann R , Wang X , et al. The consensus molecular subtypes of colorectal cancer . Nat Med 2015 ; 21 ( 11 ): 1350 – 6 . Google Scholar CrossRef Search ADS PubMed 12 Bailey P , Chang DK , Nones K , et al. Genomic analyses identify molecular subtypes of pancreatic cancer . Nature 2016 ; 531 ( 7592 ): 47 – 52 . Google Scholar CrossRef Search ADS PubMed 13 Golub TR , Slonim DK , Tamayo P , et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring . Science 1999 ; 286 ( 5439 ): 531 – 7 . Google Scholar CrossRef Search ADS PubMed 14 Alizadeh AA , Eisen MB , Davis RE , et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling . Nature 2000 ; 403 ( 6769 ): 503 – 11 . Google Scholar CrossRef Search ADS PubMed 15 Zhao L , Fong AHW , Liu N , et al. Molecular subtyping of nasopharyngeal carcinoma (NPC) and a microRNA-based prognostic model for distant metastasis . J Biomed Sci 2018 ; 25 : 16 . Google Scholar CrossRef Search ADS PubMed 16 Perou CM , Sørlie T , Eisen MB , et al. Molecular portraits of human breast tumours . Nature 2000 ; 406 ( 6797 ): 747 – 52 . Google Scholar CrossRef Search ADS PubMed 17 Garber ME , Troyanskaya OG , Schluens K , et al. Diversity of gene expression in adenocarcinoma of the lung . Proc Natl Acad Sci USA 2001 ; 98 ( 24 ): 13784 – 9 . Google Scholar CrossRef Search ADS PubMed 18 Chen X , Cheung ST , So S , et al. Gene expression patterns in human liver cancers . Mol Biol Cell 2002 ; 13 ( 6 ): 1929 – 39 . Google Scholar CrossRef Search ADS PubMed 19 Collisson EA , Sadanandam A , Olson P , et al. Subtypes of pancreatic ductal adenocarcinoma and their differing responses to therapy . Nat Med 2011 ; 17 : 500 – 3 . Google Scholar CrossRef Search ADS PubMed 20 Felipe De Sousa EM , Wang X , Jansen M , et al. Poor-prognosis colon cancer is defined by a molecularly distinct subtype and develops from serrated precursor lesions . Nat Med 2013 ; 19 : 614 – 18 . Google Scholar CrossRef Search ADS PubMed 21 Nielsen TO , West RB , Linn SC , et al. Molecular characterisation of soft tissue tumours: a gene expression study . Lancet 2002 ; 359 ( 9314 ): 1301 – 7 . Google Scholar CrossRef Search ADS PubMed 22 Kluger Y , Basri R , Chang JT , et al. Spectral biclustering of microarray data: coclustering genes and conditions . Genome Res 2003 ; 13 ( 4 ): 703 – 16 . Google Scholar CrossRef Search ADS PubMed 23 Lee DD , Seung HS. Learning the parts of objects by non-negative matrix factorization . Nature 1999 ; 401 ( 6755 ): 788 – 91 . Google Scholar CrossRef Search ADS PubMed 24 Tibshirani R , Hastie T , Narasimhan B , et al. Diagnosis of multiple cancer types by shrunken centroids of gene expression . Proc Natl Acad Sci USA 2002 ; 99 ( 10 ): 6567 – 72 . Google Scholar CrossRef Search ADS PubMed 25 Hearst MA , Dumais ST , Osuna E , et al. Support vector machines . IEEE Intell Syst Their Appl 1998 ; 13 ( 4 ): 18 – 28 . Google Scholar CrossRef Search ADS 26 Khan J , Wei JS , Ringner M , et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks . Nat Med 2001 ; 7 : 673 – 9 . Google Scholar CrossRef Search ADS PubMed 27 Nutt CL , Mani DR , Betensky RA , et al. Gene expression-based classification of malignant gliomas correlates better with survival than histological classification . Cancer Res 2003 ; 63 : 1602 – 7 . Google Scholar PubMed 28 Eisen MB , Spellman PT , Brown PO , et al. Cluster analysis and display of genome-wide expression patterns . Proc Natl Acad Sci USA 1998 ; 95 ( 25 ): 14863 – 8 . Google Scholar CrossRef Search ADS PubMed 29 Pena JM , Lozano JA , Larranaga P. An empirical comparison of four initialization methods for the k-means algorithm . Pattern Recognit Lett 1999 ; 20 : 1027 – 40 . Google Scholar CrossRef Search ADS 30 Breiman L. Random forests . Mach Learn 2001 ; 45 : 5 – 32 . Google Scholar CrossRef Search ADS 31 Fukunaga K , Narendra PM. A branch and bound algorithm for computing k-nearest neighbors . IEEE Trans Comput 1975 ; 100 : 750 – 3 . Google Scholar CrossRef Search ADS 32 Hoadley KA , Yau C , Wolf DM , et al. Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin . Cell 2014 ; 158 ( 4 ): 929 – 44 . Google Scholar CrossRef Search ADS PubMed 33 Siang TC , Soon TW , Kasim S , et al. A review of cancer classification software for gene expression data . Int J Biosci Biotechnol 2015 ; 7 ( 4 ): 89 – 108 . 34 Wang Z , Gerstein M , Snyder M. RNA-seq: a revolutionary tool for transcriptomics . Nat Rev Genet 2009 ; 10 : 57 – 63 . Google Scholar CrossRef Search ADS PubMed 35 Guo Y , Sheng Q , Li J , et al. Large scale comparison of gene expression levels by microarrays and RNAseq using TCGA data . PLoS One 2013 ; 8 ( 8 ): e71462 . Google Scholar CrossRef Search ADS PubMed 36 Zhao S , Fung-Leung WP , Bittner A , et al. Comparison of RNA-seq and microarray in transcriptome profiling of activated T cells . PLoS One 2014 ; 9 ( 1 ): e78644 . Google Scholar CrossRef Search ADS PubMed 37 Shergill IS , Shergill NK , Arya M , et al. Tissue microarrays: a current medical research tool . Curr Med Res Opin 2004 ; 20 : 707 – 12 . Google Scholar CrossRef Search ADS PubMed 38 Veldman-Jones MH , Brant R , Rooney C , et al. Evaluating robustness and sensitivity of the nanostring technologies ncounter platform to enable multiplexed gene expression analysis of clinical samples . Cancer Res 2015 ; 75 ( 13 ): 2587 – 93 . Google Scholar CrossRef Search ADS PubMed 39 Geiss GK , Bumgarner RE , Birditt B , et al. Direct multiplexed measurement of gene expression with color-coded probe pairs . Nat Biotechnol 2008 ; 26 : 317 – 25 . Google Scholar CrossRef Search ADS PubMed 40 Tamayo P , Slonim D , Mesirov J , et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation . Proc Natl Acad Sci USA 1999 ; 96 ( 6 ): 2907 – 12 . Google Scholar CrossRef Search ADS PubMed 41 Verhaak RGW , Hoadley KA , Purdom E , et al. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1 . Cancer Cell 2010 ; 17 ( 1 ): 98 – 110 . Google Scholar CrossRef Search ADS PubMed 42 Sørlie T , Perou CM , Tibshirani R , et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications . Proc Natl Acad Sci USA 2001 ; 98 ( 19 ): 10869 – 74 . Google Scholar CrossRef Search ADS PubMed 43 Cancer Genome Atlas Network Comprehensive molecular portraits of human breast tumours . Nature 2012 ; 490 : 61 – 70 . CrossRef Search ADS PubMed 44 Curtis C , Shah SP , Chin SF , et al. The genomic and transcriptomic architecture of 2, 000 breast tumours reveals novel subgroups . Nature 2012 ; 486 ( 7403 ): 346 – 52 . Google Scholar PubMed 45 Schlicker A , Beran G , Chresta CM , et al. Subtypes of primary colorectal tumors correlate with response to targeted treatment in colorectal cell lines . BMC Med Genomics 2012 ; 5 : 66 . Google Scholar CrossRef Search ADS PubMed 46 Marisa L , de Reyniès A , Duval A , et al. Gene expression classification of colon cancer into molecular subtypes: characterization, validation, and prognostic value . PLoS Med 2013 ; 10 ( 5 ): e1001453 . Google Scholar CrossRef Search ADS PubMed 47 Sadanandam A , Lyssiotis CA , Homicsko K , et al. A colorectal cancer classification system that associates cellular phenotype and responses to therapy . Nat Med 2013 ; 19 : 619 – 25 . Google Scholar CrossRef Search ADS PubMed 48 Budinska E , Popovici V , Tejpar S , et al. Gene expression patterns unveil a new level of molecular heterogeneity in colorectal cancer . J Pathol 2013 ; 231 ( 1 ): 63 – 76 . Google Scholar CrossRef Search ADS PubMed 49 Roepman P , Schlicker A , Tabernero J , et al. Colorectal cancer intrinsic subtypes predict chemotherapy benefit, deficient mismatch repair and epithelial-to-mesenchymal transition . Int J Cancer 2014 ; 134 ( 3 ): 552 – 62 . Google Scholar CrossRef Search ADS PubMed 50 Bauer AS , Keller A , Costello E , et al. Diagnosis of pancreatic ductal adenocarcinoma and chronic pancreatitis by measurement of microRNA abundance in blood and tissue . PLoS One 2012 ; 7 ( 4 ): e34151 . Google Scholar CrossRef Search ADS PubMed 51 Moffitt RA , Marayati R , Flate EL , et al. Virtual microdissection identifies distinct tumor-and stroma-specific subtypes of pancreatic ductal adenocarcinoma . Nat Genet 2015 ; 47 : 1168 – 78 . Google Scholar CrossRef Search ADS PubMed 52 Marcucci G , Mrózek K , Bloomfield CD. Molecular heterogeneity and prognostic biomarkers in adults with acute myeloid leukemia and normal cytogenetics . Curr Opin Hematol 2005 ; 12 : 68 – 75 . Google Scholar CrossRef Search ADS PubMed 53 Nones K , Waddell N , Song S , et al. Genome-wide DNA methylation patterns in pancreatic ductal adenocarcinoma reveal epigenetic deregulation of SLIT-ROBO, ITGA2 and MET signaling . Int J Cancer 2014 ; 135 ( 5 ): 1110 – 18 . Google Scholar CrossRef Search ADS PubMed 54 Waddell N , Pajic M , Patch AM , et al. Whole genomes redefine the mutational landscape of pancreatic cancer . Nature 2015 ; 518 ( 7540 ): 495 – 501 . Google Scholar CrossRef Search ADS PubMed 55 Daemen A , Peterson D , Sahu N , et al. Metabolite profiling stratifies pancreatic ductal adenocarcinomas into subtypes with distinct sensitivities to metabolic inhibitors . Proc Natl Acad Sci USA 2015 ; 112 ( 32 ): E4410 – 17 . Google Scholar CrossRef Search ADS PubMed 56 Stratton MR , Campbell PJ , Futreal PA. The cancer genome . Nature 2009 ; 458 ( 7239 ): 719 – 24 . Google Scholar CrossRef Search ADS PubMed 57 Finkelstein SD , Sayegh R , Christensen S , et al. Genotypic classification of colorectal adenocarcinoma. Biologic behavior correlates with K-ras-2 mutation type . Cancer 1993 ; 71 ( 12 ): 3827 – 38 . Google Scholar CrossRef Search ADS PubMed 58 Vural S , Wang X , Guda C. Classification of breast cancer patients using somatic mutation profiles and machine learning approaches . BMC Syst Biol 2016 ; 10(Suppl 3) : 62 . Google Scholar CrossRef Search ADS PubMed 59 Calin GA , Liu CG , Sevignani C , et al. MicroRNA profiling reveals distinct signatures in B cell chronic lymphocytic leukemias . Proc Natl Acad Sci USA 2004 ; 101 : 11755 – 60 . Google Scholar CrossRef Search ADS PubMed 60 Calin GA , Croce CM. MicroRNA signatures in human cancers . Nat Rev Cancer 2006 ; 6 ( 11 ): 857 – 66 . Google Scholar CrossRef Search ADS PubMed 61 Calin GA , Garzon R , Cimmino A , et al. MicroRNAs and leukemias: how strong is the connection? Leuk Res 2006 ; 30 ( 6 ): 653 – 5 . Google Scholar CrossRef Search ADS PubMed 62 Cantini L , Caselle M , Forget A , et al. A review of computational approaches detecting microRNAs involved in cancer . Front Biosci 2017 ; 22 : 1774 – 91 . Google Scholar CrossRef Search ADS 63 Lu J , Getz G , Miska EA , et al. MicroRNA expression profiles classify human cancers . Nature 2005 ; 435 ( 7043 ): 834 – 8 . Google Scholar CrossRef Search ADS PubMed 64 Feuk L , Carson AR , Scherer SW. Structural variation in the human genome . Nat Rev Genet 2006 ; 7 ( 2 ): 85 – 97 . Google Scholar CrossRef Search ADS PubMed 65 Cook EH Jr , Scherer SW. Copy-number variations associated with neuropsychiatric conditions . Nature 2008 ; 455 ( 7215 ): 919 – 23 . Google Scholar CrossRef Search ADS PubMed 66 Gonzalez E , Kulkarni H , Bolivar H , et al. The influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS susceptibility . Science 2005 ; 307 ( 5714 ): 1434 – 40 . Google Scholar CrossRef Search ADS PubMed 67 Le Maréchal C , Masson E , Chen JM , et al. Hereditary pancreatitis caused by triplication of the trypsinogen locus . Nat Genet 2006 ; 38 ( 12 ): 1372 . Google Scholar CrossRef Search ADS PubMed 68 Kallioniemi OP , Kallioniemi A , Piper J , et al. Optimizing comparative genomic hybridization for analysis of DNA sequence copy number changes in solid tumors . Genes Chromosomes Cancer 1994 ; 10 ( 4 ): 231 – 43 . Google Scholar CrossRef Search ADS PubMed 69 Sebat J , Lakshmi B , Troge J , et al. Large-scale copy number polymorphism in the human genome . Science 2004 ; 305 ( 5683 ): 525 – 8 . Google Scholar CrossRef Search ADS PubMed 70 Baylin SB. DNA methylation and gene silencing in cancer . Nat Clin Pract Oncol 2005 ; 2 : S4 – S11 . Google Scholar CrossRef Search ADS PubMed 71 Frommer M , McDonald LE , Millar DS , et al. A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands . Proc Natl Acad Sci USA 1992 ; 89 ( 5 ): 1827 – 31 . Google Scholar CrossRef Search ADS PubMed 72 Huang THM , Perry MR , Laux DE. Methylation profiling of CpG islands in human breast cancer cells . Hum Mol Genet 1999 ; 8 : 459 – 70 . Google Scholar CrossRef Search ADS PubMed 73 Kwon MS , Kim Y , Lee S , et al. Integrative analysis of multi-omics data for identifying multi-markers for diagnosing pancreatic cancer . BMC Genomics 2015 ; 16 : S4 . Google Scholar CrossRef Search ADS PubMed 74 Zhao Q , Shi X , Xie Y , et al. Combining multidimensional genomic measurements for predicting cancer prognosis: observations from TCGA . Brief Bioinform 2015 ; 16 : 291 – 303 . Google Scholar CrossRef Search ADS PubMed 75 Wang Y. Development of cancer diagnostics—from biomarkers to clinical tests . Transl Cancer Res 2015 ; 4 : 270 – 9 . 76 Corless CL , Spellman PT. Tackling formalin-fixed, paraffin-embedded tumor tissue with next-generation sequencing . Cancer Discov 2012 ; 2 ( 1 ): 23 – 4 . Google Scholar CrossRef Search ADS PubMed 77 Lalkhen AG , McCluskey A. Clinical tests: sensitivity and specificity . Contin Educ Anaesth Crit Care Pain 2008 ; 8 ( 6 ): 221 – 3 . Google Scholar CrossRef Search ADS 78 Linnet K , Bossuyt PMM , Moons KGM , et al. Quantifying the accuracy of a diagnostic test or marker . Clin Chem 2012 ; 58 ( 9 ): 1292 – 301 . Google Scholar CrossRef Search ADS PubMed 79 Prokopec SD , Watson JD , Waggott DM , et al. Systematic evaluation of medium-throughput mRNA abundance platforms . RNA 2013 ; 19 ( 1 ): 51 – 62 . Google Scholar CrossRef Search ADS PubMed 80 Kulkarni MM. Digital multiplexed gene expression analysis using the NanoString nCounter system . Curr Protoc Mol Biol 2011 ; Chapter 25 : Unit25B.10 . Google Scholar PubMed 81 Payton JE , Grieselhuber NR , Chang LW , et al. High throughput digital quantification of mRNA abundance in primary human acute myeloid leukemia samples . J Clin Invest 2009 ; 119 ( 6 ): 1714 – 26 . Google Scholar CrossRef Search ADS PubMed 82 Kononen J , Bubendorf L , Kallionimeni A , et al. Tissue microarrays for high-throughput molecular profiling of tumor specimens . Nat Med 1998 ; 4 : 844 – 7 . Google Scholar CrossRef Search ADS PubMed 83 Rimm DL , Camp RL , Charette LA , et al. Amplification of tissue by construction of tissue microarrays . Exp Mol Pathol 2001 ; 70 : 255 – 64 . Google Scholar CrossRef Search ADS PubMed 84 Schmidt LH , Biesterfeld S , Kümmel A , et al. Tissue microarrays are reliable tools for the clinicopathological characterization of lung cancer tissue . Anticancer Res 2009 ; 29 : 201 – 9 . Google Scholar PubMed 85 Camp RL , Neumeister V , Rimm DL. A decade of tissue microarrays: progress in the discovery and validation of cancer biomarkers . J Clin Oncol 2008 ; 26 ( 34 ): 5630 – 7 . Google Scholar CrossRef Search ADS PubMed 86 Hoos A , Cordon-Cardo C. Tissue microarray profiling of cancer specimens and cell lines: opportunities and limitations . Lab Invest 2001 ; 81 : 1331 – 8 . Google Scholar CrossRef Search ADS PubMed 87 Xu R , Wunsch DC. Clustering algorithms in biomedical research: a review . IEEE Rev Biomed Eng 2010 ; 3 : 120 – 54 . Google Scholar CrossRef Search ADS PubMed 88 Cancer Genome Atlas Research Network . Comprehensive molecular characterization of urothelial bladder carcinoma . Nature 2014 ; 507 : 315 – 22 . CrossRef Search ADS PubMed 89 Dai X , Li T , Bai Z , et al. Breast cancer intrinsic subtype classification, clinical use and future trends . Am J Cancer Res 2015 ; 5 : 2929 – 43 . Google Scholar PubMed 90 Madeira SC , Oliveira AL. Biclustering algorithms for biological data analysis: a survey . IEEE/ACM Trans Comput Biol Bioinform 2004 ; 1 ( 1 ): 24 – 45 . Google Scholar CrossRef Search ADS PubMed 91 Cheng Y , Church GM. Biclustering of expression data . Proc Int Conf Intell Syst Mol Biol 2000 ; 8 : 93 – 103 . Google Scholar PubMed 92 Gan X , Liew AW-C , Yan H. Discovering biclusters in gene expression data based on high-dimensional linear geometries . BMC Bioinformatics 2008 ; 9 ( 1 ): 209. Google Scholar CrossRef Search ADS PubMed 93 Zhao H , Liew AW-C , Xie X , et al. A new geometric biclustering algorithm based on the Hough transform for analysis of large-scale microarray data . J Theor Biol 2008 ; 251 : 264 – 74 . Google Scholar CrossRef Search ADS PubMed 94 Mankad S , Michailidis G. Biclustering three-dimensional data arrays with plaid models . J Comput Graph Stat 2014 ; 23 : 943 – 65 . Google Scholar CrossRef Search ADS 95 Narmadha N , Rathipriya R. Triclustering: an evolution of clustering. In: 2016 Online International Conference on Green Engineering and Technologies (IC-GET) . IEEE, Coimbatore, India. 2016 , 1–4. 96 Li Y , Ngom A. Classification of clinical gene-sample-time microarray expression data via tensor decomposition methods. In: Computational Intelligence Methods for Bioinformatics and Biostatistics. Springer-Verlag Berlin, Heidelberg, Palermo, Italy, 2011, 275–86. 97 Luo Y , Wang F , Szolovits P. Tensor factorization toward precision medicine . Brief Bioinform 2017 ; 18 : 511 – 4 . Google Scholar CrossRef Search ADS PubMed 98 Tibshirani R , Walther G , Hastie T. Estimating the number of clusters in a data set via the gap statistic . J R Stat Soc Series B Stat Methodol 2001 ; 63 : 411 – 23 . Google Scholar CrossRef Search ADS 99 Brunet J-P , Tamayo P , Golub TR , et al. Metagenes and molecular pattern discovery using matrix factorization . Proc Natl Acad Sci USA 2004 ; 101 ( 12 ): 4164 – 9 . Google Scholar CrossRef Search ADS PubMed 100 Vega-Pons S , Ruiz-Shulcloper J. A survey of clustering ensemble algorithms . Int J Pattern Recognit Artif Intell 2011 ; 25 ( 03 ): 337 – 72 . Google Scholar CrossRef Search ADS 101 Monti S , Tamayo P , Mesirov J , et al. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data . Mach Learn 2003 ; 52 : 91 – 118 . Google Scholar CrossRef Search ADS 102 Mukhopadhyay A , Bandyopadhyay S , Maulik U. Multi-class clustering of cancer subtypes through SVM based ensemble of pareto-optimal solutions for gene marker identification . PLoS One 2010 ; 5 ( 11 ): e13803. Google Scholar CrossRef Search ADS PubMed 103 Wang X , Markowetz F , De Sousa E , Melo F , et al. Dissecting cancer heterogeneity–an unsupervised classification approach . Int J Biochem Cell Biol 2013 ; 45 : 2574 – 9 . Google Scholar CrossRef Search ADS PubMed 104 Lex A , Streit M , Schulz H-J , et al. StratomeX: visual Analysis of Large-Scale Heterogeneous Genomics Data for Cancer Subtype Characterization . Comput Graph Forum 2012 ; 31 : 1175 – 84 . Google Scholar CrossRef Search ADS PubMed 105 Subramanian A , Tamayo P , Mootha VK , et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles . Proc Natl Acad Sci USA 2005 ; 102 : 15545 – 50 . Google Scholar CrossRef Search ADS PubMed 106 Ashburner M , Ball CA , Blake JA , et al. Gene Ontology: tool for the unification of biology . Nat Genet 2000 ; 25 : 25 – 9 . Google Scholar CrossRef Search ADS PubMed 107 Kanehisa M , Goto S , Hattori M , et al. From genomics to chemical genomics: new developments in KEGG . Nucleic Acids Res 2006 ; 34 ( 90001 ): D354 – 7 . Google Scholar CrossRef Search ADS PubMed 108 Kaplan EL , Meier P. Nonparametric estimation from incomplete observations . J Am Stat Assoc 1958 ; 53 : 457 – 81 . Google Scholar CrossRef Search ADS 109 Mantel N. Evaluation of survival data and two new rank order statistics arising in its consideration . Cancer Chemother Rep 1966 ; 50 : 163 – 70 . Google Scholar PubMed 110 Shen T , Pajaro-Van de Stadt SH , Yeat NC , et al. Clinical applications of next generation sequencing in cancer: from panels, to exomes, to genomes . Front Genet 2015 ; 6 : 215 . Google Scholar CrossRef Search ADS PubMed 111 Peyser ND , Grandis JR. Cancer genomics: spot the difference . Nature 2017 ; 541 ( 7636 ): 162 – 3 . Google Scholar CrossRef Search ADS PubMed 112 Voduc D , Kenney C , Nielsen TO. Tissue microarrays in clinical oncology . Semin Radiat Oncol 2008 ; 18 ( 2 ): 89 – 97 . Google Scholar CrossRef Search ADS PubMed 113 Terry J , Saito T , Subramanian S , et al. TLE1 as a diagnostic immunohistochemical marker for synovial sarcoma emerging from gene expression profiling studies . Am J Surg Pathol 2007 ; 31 : 240 – 6 . Google Scholar CrossRef Search ADS PubMed 114 Hans CP , Weisenburger DD , Greiner TC , et al. Confirmation of the molecular classification of diffuse large B-cell lymphoma by immunohistochemistry using a tissue microarray . Blood 2004 ; 103 ( 1 ): 275 – 82 . Google Scholar CrossRef Search ADS PubMed 115 Yersal O , Barutca S. Biological subtypes of breast cancer: prognostic and therapeutic implications . World J Clin Oncol 2014 ; 5 : 412 – 24 . Google Scholar CrossRef Search ADS PubMed 116 van 't Veer LJ , Dai H , van de Vijver MJ , et al. Gene expression profiling predicts clinical outcome of breast cancer . Nature 2002 ; 415 ( 6871 ): 530 – 6 . Google Scholar CrossRef Search ADS PubMed 117 Paik S , Shak S , Tang G , et al. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer . N Engl J Med 2004 ; 351 ( 27 ): 2817 – 26 . Google Scholar CrossRef Search ADS PubMed 118 Toussaint J , Sieuwerts AM , Haibe-Kains B , et al. Improvement of the clinical applicability of the Genomic Grade Index through a qRT-PCR test performed on frozen and formalin-fixed paraffin-embedded tissues . BMC Genomics 2009 ; 10 : 424 . Google Scholar CrossRef Search ADS PubMed 119 Sotiriou C , Wirapati P , Loi S , et al. Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis . J Natl Cancer Inst 2006 ; 98 ( 4 ): 262 – 72 . Google Scholar CrossRef Search ADS PubMed 120 Wirapati P , Sotiriou C , Kunkel S , et al. Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures . Breast Cancer Res 2008 ; 10 : R65 . Google Scholar CrossRef Search ADS PubMed 121 Rouzier R , Perou CM , Symmans WF , et al. Breast cancer molecular subtypes respond differently to preoperative chemotherapy . Clin Cancer Res 2005 ; 11 : 5678 – 85 . Google Scholar CrossRef Search ADS PubMed 122 Ross DT , Scherf U , Eisen MB , et al. Systematic variation in gene expression patterns in human cancer cell lines . Nat Genet 2000 ; 24 ( 3 ): 227 – 35 . Google Scholar CrossRef Search ADS PubMed 123 Gao H , Korn JM , Ferretti S , et al. High-throughput screening using patient-derived tumor xenografts to predict clinical trial drug response . Nat Med 2015 ; 21 : 1318 – 25 . Google Scholar CrossRef Search ADS PubMed 124 Kao J , Salari K , Bocanegra M , et al. Molecular profiling of breast cancer cell lines defines relevant tumor models and provides a resource for cancer gene discovery . PLoS One 2009 ; 4 ( 7 ): e6146 . Google Scholar CrossRef Search ADS PubMed 125 Kim N , He N , Yoon S. Cell line modeling for systems medicine in cancers (Review) . Int J Oncol 2014 ; 44 : 371 – 6 . Google Scholar CrossRef Search ADS PubMed 126 Shoemaker RH , Monks A , Alley MC , et al. Development of human tumor cell line panels for use in disease-oriented drug screening . Prog Clin Biol Res 1987 ; 276 : 265 – 86 . 127 Garnett MJ , Edelman EJ , Heidorn SJ , et al. Systematic identification of genomic markers of drug sensitivity in cancer cells . Nature 2012 ; 483 ( 7391 ): 570 – 5 . Google Scholar CrossRef Search ADS PubMed 128 Workman P , Kaye SB. Translating basic cancer research into new cancer therapeutics . Trends Mol Med 2002 ; 8 ( 4 ): S1 – 9 . Google Scholar CrossRef Search ADS PubMed 129 Clarke PA , te Poele R , Workman P. Gene expression microarray technologies in the development of new therapeutic agents . Eur J Cancer 2004 ; 40 : 2560 – 91 . Google Scholar CrossRef Search ADS PubMed 130 Hijazi H , Wu M , Nath A , et al. Ensemble classification of cancer types and biomarker identification . Drug Dev Res 2012 ; 73 : 414 – 19 . Google Scholar CrossRef Search ADS PubMed 131 Michiels S , Koscielny S , Hill C. Prediction of cancer outcome with microarrays: a multiple random validation strategy . Lancet 2005 ; 365 ( 9458 ): 488 – 92 . Google Scholar CrossRef Search ADS PubMed 132 Hudson TJ , Anderson W , Aretz A , et al. International network of cancer genome projects . Nature 2010 ; 464 ( 7291 ): 993 – 8 . Google Scholar CrossRef Search ADS PubMed 133 Tomczak K , Czerwińska P , Wiznerowicz M. The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge . Contemp Oncol 2015 ; 19 ( 1A ): A68. 134 Barrett T. Gene Expression Omnibus (GEO). 2013 . https://www.ncbi.nlm.nih.gov/books/NBK159736/. 135 Neoptolemos JP , Stocken DD , Friess H , et al. A randomized trial of chemoradiotherapy and chemotherapy after resection of pancreatic cancer . N Engl J Med 2004 ; 350 : 1200 – 10 . Google Scholar CrossRef Search ADS PubMed 136 Stratford JK , Bentrem DJ , Anderson JM , et al. A six-gene signature predicts survival of patients with localized pancreatic ductal adenocarcinoma . PLoS Med 2010 ; 7 ( 7 ): e1000307 . Google Scholar CrossRef Search ADS PubMed 137 Johnson WE , Li C , Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods . Biostatistics 2007 ; 8 ( 1 ): 118 – 27 . Google Scholar CrossRef Search ADS PubMed 138 Leek JT , Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis . PLoS Genet 2007 ; 3 ( 9 ): e161. Google Scholar CrossRef Search ADS 139 Benito M , Parker J , Du Q , et al. Adjustment of systematic microarray data biases . Bioinformatics 2004 ; 20 ( 1 ): 105 – 14 . Google Scholar CrossRef Search ADS PubMed 140 Emmert-Buck MR , Bonner RF , Smith PD , et al. Laser capture microdissection . Science 1996 ; 274 ( 5289 ): 998 – 1001 . Google Scholar CrossRef Search ADS PubMed 141 Yoshihara K , Shahmoradgoli M , Martínez E , et al. Inferring tumour purity and stromal and immune cell admixture from expression data . Nat Commun 2013 ; 4 : 2612 . Google Scholar CrossRef Search ADS PubMed 142 Song S , Nones K , Miller D , et al. qpure: a tool to estimate tumor cellularity from genome-wide single-nucleotide polymorphism profiles . PLoS One 2012 ; 7 ( 9 ): e45835 . Google Scholar CrossRef Search ADS PubMed 143 Bhome R , Bullock MD , Al Saihati HA , et al. A top-down view of the tumor microenvironment: structure, cells and signaling . Front Cell Dev Biol 2015 ; 3 : 33 . Google Scholar CrossRef Search ADS PubMed 144 Gajewski TF , Schreiber H , Fu Y-X. Innate and adaptive immune cells in the tumor microenvironment . Nat Immunol 2013 ; 14 : 1014 – 22 . Google Scholar CrossRef Search ADS PubMed 145 Jiménez-Sánchez A , Memon D , Pourpe S , et al. Heterogeneous tumor-immune microenvironments among differentially growing metastases in an ovarian cancer patient . Cell 2017 ; 170 : 927 – 38.e20 . Google Scholar CrossRef Search ADS PubMed 146 Orimo A , Weinberg RA. Heterogeneity of stromal fibroblasts in tumor . Cancer Biol Ther 2007 ; 6 ( 4 ): 618 – 9 . Google Scholar CrossRef Search ADS PubMed 147 Bergamaschi A , Tagliabue E , Sørlie T , et al. Extracellular matrix signature identifies breast cancer subgroups with different clinical outcome . J Pathol 2008 ; 214 ( 3 ): 357 – 67 . Google Scholar CrossRef Search ADS PubMed 148 Pickup MW , Mouw JK , Weaver VM. The extracellular matrix modulates the hallmarks of cancer . EMBO Rep 2014 ; 15 ( 12 ): 1243 – 53 . Google Scholar CrossRef Search ADS PubMed 149 Pages F , Galon J , Dieu-Nosjean MC , et al. Immune infiltration in human tumors: a prognostic factor that should not be ignored . Oncogene 2010 ; 29 ( 8 ): 1093 – 102 . Google Scholar CrossRef Search ADS PubMed 150 Frantz C , Stewart KM , Weaver VM. The extracellular matrix at a glance . J Cell Sci 2010 ; 123 ( Pt 24 ): 4195 – 200 . Google Scholar CrossRef Search ADS PubMed 151 Allinen M , Beroukhim R , Cai L , et al. Molecular characterization of the tumor microenvironment in breast cancer . Cancer Cell 2004 ; 6 ( 1 ): 17 – 32 . Google Scholar CrossRef Search ADS PubMed 152 Guinney J , Dienstmann R , Wang X , et al. The consensus molecular subtypes of colorectal cancer . Nat Med 2015 ; 21 : 1350 – 6 . Google Scholar CrossRef Search ADS PubMed 153 Lehmann BD , Bauer JA , Chen X , et al. Identification of human triple-negative breast cancer subtypes and preclinical models for selection of targeted therapies . J Clin Invest 2011 ; 121 ( 7 ): 2750 . Google Scholar CrossRef Search ADS PubMed 154 Sørlie T , Tibshirani R , Parker J , et al. Repeated observation of breast tumor subtypes in independent gene expression data sets . Proc Natl Acad Sci USA 2003 ; 100 ( 14 ): 8418 – 23 . Google Scholar CrossRef Search ADS PubMed 155 Hu Z , Fan C , Oh DS , et al. The molecular portraits of breast tumors are conserved across microarray platforms . BMC Genomics 2006 ; 7 : 96. Google Scholar CrossRef Search ADS PubMed 156 Ringnér M , Jönsson G , Staaf J. Prognostic and chemotherapy predictive value of gene-expression phenotypes in primary lung adenocarcinoma . Clin Cancer Res 2016 ; 22 : 218 – 29 . Google Scholar CrossRef Search ADS PubMed 157 Haibe-Kains B , Desmedt C , Loi S , et al. A three-gene model to robustly identify breast cancer molecular subtypes . J Natl Cancer Inst 2012 ; 104 ( 4 ): 311 – 25 . Google Scholar CrossRef Search ADS PubMed 158 Weigelt B , Mackay A , A'hern R , et al. Breast cancer molecular profiling with single sample predictors: a retrospective analysis . Lancet Oncol 2010 ; 11 ( 4 ): 339 – 49 . Google Scholar CrossRef Search ADS PubMed 159 Lusa L , McShane LM , Reid JF , et al. Challenges in projecting clustering results across gene expression–profiling datasets . J Natl Cancer Inst 2007 ; 99 ( 22 ): 1715 – 23 . Google Scholar CrossRef Search ADS PubMed 160 Guiu S , Michiels S , Andre F , et al. Molecular subclasses of breast cancer: how do we define them? The IMPAKT 2012 Working Group Statement . Ann Oncol 2012 ; 23 : 2997 – 3006 . Google Scholar CrossRef Search ADS PubMed 161 Sørlie T , Borgan E , Myhre S , et al. The importance of gene-centring microarray data . Lancet Oncol 2010 ; 11 : 719 – 20 . Google Scholar CrossRef Search ADS PubMed 162 Allison DB , Page GP , Beasley TM , et al. DNA Microarrays and Related Genomics Techniques: Design, Analysis, and Interpretation of Experiments . 2005 . https://books.google.com.hk/books?hl=en&lr=&id=TUrMBQAAQBAJ&oi=fnd&pg=PP1&dq=DNA+Microarrays+and+Related+Genomics+Techniques:+Design,+Analysis,+and+Interpretation+of+Experiments&ots=eY-ZofXdvd&sig=17rgrkJzuOYz-TydzaTfLthxwyM&redir_esc=y#v=onepage&q=DNA%20Microarrays%20and%20Related%20Genomics%20Techniques%3A%20Design%2C%20Analysis%2C%20and%20Interpretation%20of%20Experiments&f=false. Google Scholar CrossRef Search ADS 163 Mantione KJ , Kream RM , Kuzelova H , et al. Comparing bioinformatic gene expression profiling methods: microarray and RNA-Seq . Med Sci Monit Basic Res 2014 ; 20 : 138 – 42 . Google Scholar CrossRef Search ADS PubMed 164 Khansarinejad B , Soleimanjahi H , Mirab Samiee S , et al. Monitoring human cytomegalovirus infection in pediatric hematopoietic stem cell transplant recipients: using an affordable in-house qPCR assay for management of HCMV infection under limited resources . Transpl Int 2015 ; 28 : 594 – 603 . Google Scholar CrossRef Search ADS PubMed 165 Pires ARC , Andreiuolo F da M , de Souza SR. TMA for all: a new method for the construction of tissue microarrays without recipient paraffin block using custom-built needles . Diagn Pathol 2006 ; 1 : 14 . Google Scholar CrossRef Search ADS PubMed 166 SEQC/MAQC-III Consortium . A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium . Nat Biotechnol 2014 ; 32 : 903 – 14 . CrossRef Search ADS PubMed 167 Singh A , Sau AK. Tissue microarray: a powerful and rapidly evolving tool for high-throughput analysis of clinical specimens . IJCRI 2010 ; 1:1 – 11 . 168 Łabaj PP , Kreil DP. Sensitivity, specificity, and reproducibility of RNA-Seq differential expression calls . Biol Direct 2016 ; 11 : 66 . Google Scholar CrossRef Search ADS PubMed 169 Figueroa ME , Lugthart S , Li Y , et al. DNA methylation signatures identify biologically distinct subtypes in acute myeloid leukemia . Cancer Cell 2010 ; 17 : 13 – 27 . Google Scholar CrossRef Search ADS PubMed 170 Marziali G , Buccarelli M , Giuliani A , et al. A three-microRNA signature identifies two subtypes of glioblastoma patients with different clinical outcomes . Mol Oncol 2017 ; 11 : 1115 – 1129 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

Briefings in BioinformaticsOxford University Press

Published: Apr 12, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off