Effect of normalization methods on the performance of supervised learning algorithms applied to HTSeq-FPKM-UQ data sets: 7SK RNA expression as a predictor of survival in patients with colon adenocarcinoma

Effect of normalization methods on the performance of supervised learning algorithms applied to... Abstract Motivation: One of the main challenges in machine learning (ML) is choosing an appropriate normalization method. Here, we examine the effect of various normalization methods on analyzing FPKM upper quartile (FPKM-UQ) RNA sequencing data sets. We collect the HTSeq-FPKM-UQ files of patients with colon adenocarcinoma from TCGA-COAD project. We compare three most common normalization methods: scaling, standardizing using z-score and vector normalization by visualizing the normalized data set and evaluating the performance of 12 supervised learning algorithms on the normalized data set. Additionally, for each of these normalization methods, we use two different normalization strategies: normalizing samples (files) or normalizing features (genes). Results: Regardless of normalization methods, a support vector machine (SVM) model with the radial basis function kernel had the maximum accuracy (78%) in predicting the vital status of the patients. However, the fitting time of SVM depended on the normalization methods, and it reached its minimum fitting time when files were normalized to the unit length. Furthermore, among all 12 learning algorithms and 6 different normalization techniques, the Bernoulli naive Bayes model after standardizing files had the best performance in terms of maximizing the accuracy as well as minimizing the fitting time. We also investigated the effect of dimensionality reduction methods on the performance of the supervised ML algorithms. Reducing the dimension of the data set did not increase the maximum accuracy of 78%. However, it leaded to discovery of the 7SK RNA gene expression as a predictor of survival in patients with colon adenocarcinoma with accuracy of 78%. 7SK RNA, gene expression, colon adenocarcinoma, normalization methods, supervised machine learning algorithms, TCGA HTSeq-FPKM-UQ data sets Introduction Normalizing data is usually the first step before using any machine learning (ML) techniques because of its crucial rule in the performance of the algorithms [1]. The importance of normalization before applying ML algorithms has been shown in many studies for various types of data sets [2–4]. However, it has not well studied for using supervised or unsupervised learning algorithms on gene expression data sets. As accurately estimating the expression level of genes is a challenging task, several methods have been developed to increase the accuracy of the estimations [5–9]. To reach this goal, these methods commonly use statistical techniques to normalize the data [9, 10]. Many of these methods normalize data between samples by scaling the number of reads in a given library to a common value across all sequenced libraries in the experiment [11]. One of the most common normalization method is quantifying transcript levels in reads per kilobase of exon model per million mapped reads, which ‘facilitates transparent comparison of transcript levels both within and between samples’ [12]. The FPKM upper quartile (FPKM-UQ) is based on a modified version of the FPKM normalization method. FPKM-UQ values tend to be much higher than FPKM values because of the large difference between the total mapped number of reads in an alignment and the mapped number of reads to one gene. FPKM=109×number of reads mapped to the geneNumber of reads mapped to all protein-coding genes×length of the gene in base pairs FPKM-UQ=109×number of reads mapped to the geneThe 75th percentile read count value for genes in the sample×length of the gene in base pairs. Although the HTSeq-FPKM-UQ is the gene expression level after applying a normalization technique, it is not a normalization method for comparing RNA sequencing (RNA-seq) data sets across individuals. We need another normalization for being able to compare RNA-seq data sets of different patients. In this article, we explore the effect of normalizing FPKM-UQ data sets on the performance of ML algorithms trained to predict the vital status of cancer patients. More precisely, we examine the effect of 6 most common normalization methods on the accuracy and fitting time of the 12 most well-known supervised algorithms. Approach We collected the gene expression (HTSeq-FPKM-UQ) files of patients with colon adenocarcinoma from TCGA-COAD project, and we normalized the data set in six different ways. We used three most common normalizing methods: scaling, vector normalization and z-score. We applied each of these three methods in two ways: (1) independently normalizing each gene and (2) independently normalizing each file (or patient). For more details, please see ‘Methods’ section. RNA-seq data sets are mainly used to analyze patients (e.g. classifying patients) or genes (e.g. obtaining gene regulatory networks). For this reason, to compare the normalization methods, we visualized the normalized values of APC, KRAS and EAGFR genes. Additionally, we compared the average and SDs of genes’ values of the survived patients versus dead ones. We chose APC gene because the most common mutation in colon cancer is the inactivation of APC. We also chose EGFR and KRAS because the cetuximab (erbitux), which inhibits EGFR, is used for treatment of colon cancer with wild-type KRAS. As this drug has little or no effect in colorectal tumors with a KRAS mutation [13, 14], we also looked at the correlation coefficient of normalized values of KRAS and EGFR. Results Raw data The data set includes 487 and 41 HTSeq-FPKM-UQ files for primary tumors (PTs) and solid tissue normals (N), respectively (Table 1). Each file includes the expression value of 60 483 genes. After excluding the genes that their values are 0 in the entire data set, the number of genes is reduced to 57 813. Table 1. Data set Sample type Categories Number of cases Number of files Solid tissue normal (N) Female 45 21 Solid tissue normal (N) Male 48 20 PT Female 214 224 PT Male 240 252 Subgroups Number of cases in N Number of files in N Number of cases in PT Number of files in PT Female-alive 32 16 168 177 Female-dead 13 5 46 47 Male-alive 37 13 184 196 Male-dead 11 7 56 56 Sample type Categories Number of cases Number of files Solid tissue normal (N) Female 45 21 Solid tissue normal (N) Male 48 20 PT Female 214 224 PT Male 240 252 Subgroups Number of cases in N Number of files in N Number of cases in PT Number of files in PT Female-alive 32 16 168 177 Female-dead 13 5 46 47 Male-alive 37 13 184 196 Male-dead 11 7 56 56 Table 1. Data set Sample type Categories Number of cases Number of files Solid tissue normal (N) Female 45 21 Solid tissue normal (N) Male 48 20 PT Female 214 224 PT Male 240 252 Subgroups Number of cases in N Number of files in N Number of cases in PT Number of files in PT Female-alive 32 16 168 177 Female-dead 13 5 46 47 Male-alive 37 13 184 196 Male-dead 11 7 56 56 Sample type Categories Number of cases Number of files Solid tissue normal (N) Female 45 21 Solid tissue normal (N) Male 48 20 PT Female 214 224 PT Male 240 252 Subgroups Number of cases in N Number of files in N Number of cases in PT Number of files in PT Female-alive 32 16 168 177 Female-dead 13 5 46 47 Male-alive 37 13 184 196 Male-dead 11 7 56 56 Figure 1 shows the data before normalizing. Figure 1A indicates that some genes have a really high value (>108), while some genes have a small value (<10). This figure indicates that the gene’s expression values vary a lot, and SD of some genes’ values is around one order of magnitude higher than the mean (coefficient of variation >10), but the average of coefficient of variations for each gene’s values is around 5. Furthermore, Figure 1B reveals that the average of genes’ expressions for each patient is around 105 with SD >106; the coefficient of variation of values in each file is around 19. Additionally, Pearson correlation coefficient between raw values of EGFR and KRAS is 0.226, and between APC and EGFR is 0.31. Figure 1. View largeDownload slide Raw data. In (A), blue line represents the average value of each gene, and red line shows the average plus SD. (B) represents the averages and SDs of genes expressions in each file; the first 41 cases (stars) show the results for solid tissue normals, and the rest are the averages and SDs of genes expressions in HTSeq-FPKM-UQ files for PTs. (C) shows the values of APC, EGFR and KRAS genes in 487 HTSeq-FPKM-UQ files for PTs (circles) and in 41 files for solid tissue normals (stars). Figure 1. View largeDownload slide Raw data. In (A), blue line represents the average value of each gene, and red line shows the average plus SD. (B) represents the averages and SDs of genes expressions in each file; the first 41 cases (stars) show the results for solid tissue normals, and the rest are the averages and SDs of genes expressions in HTSeq-FPKM-UQ files for PTs. (C) shows the values of APC, EGFR and KRAS genes in 487 HTSeq-FPKM-UQ files for PTs (circles) and in 41 files for solid tissue normals (stars). The high variance in the genes’ values and the high Euclidean distance between genes signifies the need for normalizing the data because most ML and statistical methods are based on L2 norm. Furthermore, from clinical observations, we know there is a high correlation between KRAS and EGFR [13, 14], but the correlation coefficient of these genes in unnormalized data is a small number. Scaling One normalization method is scaling the data into range between 0 and 1. We have two options for scaling, scaling genes’ values across the entire data set, or scaling the gene expression data set of each patient (scaling each file separately). Scaling patients To scale patients, we divide the gene expression values in each file by the maximum gene expression value in the file. Thus, the range of values in each file will be between 0 and 1. Figure 2 represents the data after scaling each file (patient). As genes’ values after scaling are between 0 and 1, average of genes’ values in the entire data set is between 0 and 1. However, the average of coefficient of variations of each gene’s values is 5.05 (Figure 2A). Although after scaling the L2 distant between two values becomes <1, the coefficient of variation of values in each file is around 16 (Figure 2B). Importantly, using this normalization, Pearson correlation coefficient between EGFR and KRAS becomes 0.51, and Pearson correlation coefficient between EGFR and APC changes to 0.54. Figure 2. View largeDownload slide Scaled data. In this figure, top panels (A–C) represent the results of scaling each file, and bottom panels (D–E) show the results of scaling each gene. In (A) and (D), blue line represents the average value of each gene, and red line shows the average plus SD. (B) and (E) represent the average and average plusSD of genes expressions in each file; the first 41 cases (stars) show the results for solid tissue normals, and the rest are the averages and SDs of gene expressions for PTs. (C) and (F) show the values of APC, EGFR and KRAS genes in 487 HTSeq-FPKM-UQ files for PTs (circles) and in 41 files for solid tissue normals (stars). Figure 2. View largeDownload slide Scaled data. In this figure, top panels (A–C) represent the results of scaling each file, and bottom panels (D–E) show the results of scaling each gene. In (A) and (D), blue line represents the average value of each gene, and red line shows the average plus SD. (B) and (E) represent the average and average plusSD of genes expressions in each file; the first 41 cases (stars) show the results for solid tissue normals, and the rest are the averages and SDs of gene expressions for PTs. (C) and (F) show the values of APC, EGFR and KRAS genes in 487 HTSeq-FPKM-UQ files for PTs (circles) and in 41 files for solid tissue normals (stars). Scaling genes To scale each gene’s value, we divide each gene expression value by its maximum expression value in the entire data set. Thus, the range of values in each file will be between 0 and 1. In this case, if we look at the genes’ values in the entire data set, the average of each gene’s values will be between 0 and 1. However, similar to the raw data, the SD of some of genes is one order of magnitude higher than the mean value, while the average of coefficient of variations of each gene’s values is 4.97 (Figure 2D). Additionally, the coefficient of variation of values in each file is around 1.7. As the values of each gene are divided by a constant, Pearson correlation coefficients between genes does not change comparing to the raw data; Pearson correlation coefficient between scaled values of EGFR and KRAS stays 0.226, and between APC and EGFR remains 0.31. Vector normalization (scaling to unit length) Another common normalization method is vector normalization. This normalization method is similar to the scaling method, and the only difference is that we divide each value in a vector by the Euclidean norm of the vector instead of dividing by the maximum number in the vector. Again, in this case, we have two options: normalizing genes or files (patients). Scaling files to the unit length For scaling files to the unit length, for each file, we divide each gene expression value by the L2 norm of genes’ values in the file. After independently rescaling each file, the gene values become between 0 and 1, and the mean of coefficient of variations of normalized genes’ values is 5.1 (Figure 3A). Moreover, the average of coefficient of variations of normalized values in each file is 16 (Figure 3B). That means when we normalize each file independently, the coefficient of variations of genes’ values in each file is higher than coefficient of variations of values of a gene across all files. In this case, Pearson correlation coefficient between EGFR and KRAS becomes 0.43, and Pearson correlation coefficient between EGFR and APC becomes 0.47. Figure 3. View largeDownload slide Scaled data to unit length. In this figure, top panels (A–C) represent the results of independently scaling each file, and bottom panels (D–E) show the results of independently scaling each gene. In (A) and (D), blue line represents the average value of each gene, and red line shows the average plus SD. (B) and (E) represent the average and average plus SD of gene expressions in each file; the first 41 cases (stars) show the results for solid tissue normals, and the rest are the averages and standard deviations of gene expressions in HTSeq-FPKM-UQ files for PTs. (C) and (F) show the values of APC, EGFR and KRAS genes in 487 HTSeq-FPKM-UQ files for PTs (circles) and in 41 files for solid tissue normals (stars). Figure 3. View largeDownload slide Scaled data to unit length. In this figure, top panels (A–C) represent the results of independently scaling each file, and bottom panels (D–E) show the results of independently scaling each gene. In (A) and (D), blue line represents the average value of each gene, and red line shows the average plus SD. (B) and (E) represent the average and average plus SD of gene expressions in each file; the first 41 cases (stars) show the results for solid tissue normals, and the rest are the averages and standard deviations of gene expressions in HTSeq-FPKM-UQ files for PTs. (C) and (F) show the values of APC, EGFR and KRAS genes in 487 HTSeq-FPKM-UQ files for PTs (circles) and in 41 files for solid tissue normals (stars). Scaling genes to the unit length To independently rescale each gene’s values to the unit length, we divide each gene expression value by the L2 norm of the values of the gene in the entire data set. In this case, similar to the raw data, the average of coefficient of variations for each gene’s normalized values is 4.97 (Figure 3D), while the average of coefficient of variations of values in each file is 1.7 (Figure 3E). This implies that scaling genes to the unit length will reduce both coefficient of variation of values within each file and coefficient of variation of each gene’s values across all files. Here, also because each genes’ value is divided by a constant, Pearson correlation coefficients between genes do not change comparing with the raw data. Standardization (z-score) One of the wildly used normalization methods is standardization, which assumes the data have normal distribution with mean 0 and unit-variance. In this case also, we have two options: normalizing genes or files (patients). Standardizing files We normalize each file independently by obtaining z-score of values in each file. After standardizing each file (patient), the mean of coefficient of variations of normalized genes’ values becomes 7.07 (Figure 4A). Note that files’ standardization makes the average of genes values in each file equal to 0 with SD 1 (Figure 4B). In this case, Pearson correlation coefficient between EGFR and KRAS becomes 0.23, while Pearson correlation coefficient between EGFR and APC becomes 0.17. Figure 4. View largeDownload slide Standardized data. In this figure, top panels (A–C) represent the results of independently scaling each file, and bottom panels (D–E) show the results of independently scaling each gene. In (A) and (D), blue line represents the average value of each gene, and red line shows the average plus SD. (B) and (E) represent the average and average plus SD of genes expressions in each file; the first 41 cases (stars) show the results for solid tissue normals, and the rest are the averages and SDs of gene expressions in HTSeq-FPKM-UQ files for PTs. (C) and (F) show the normalized values of APC, EGFR and KRAS genes for PTs (circles) and in 41 files for solid tissue normals (stars). Figure 4. View largeDownload slide Standardized data. In this figure, top panels (A–C) represent the results of independently scaling each file, and bottom panels (D–E) show the results of independently scaling each gene. In (A) and (D), blue line represents the average value of each gene, and red line shows the average plus SD. (B) and (E) represent the average and average plus SD of genes expressions in each file; the first 41 cases (stars) show the results for solid tissue normals, and the rest are the averages and SDs of gene expressions in HTSeq-FPKM-UQ files for PTs. (C) and (F) show the normalized values of APC, EGFR and KRAS genes for PTs (circles) and in 41 files for solid tissue normals (stars). Standardizing genes To independently standardizing each gene’s values to the normal distribution, we obtain z-score of each gene’s expression values in the entire data set. The gene’s standardization makes the average values of each gene 0 with SD 1 in the whole data set (Figure 4D), while the mean of values in each file is around −0.03 with SD around 0.7 (Figure 4E). Furthermore, Pearson correlation coefficients between KRAS and EGFR and between EGFR and APC become 0.23 and 0.31, respectively. Figure 5 shows how the abovementioned normalization methods change the values of EGFR and KRAS, and ultimately the correlation coefficient of these two genes. Furthermore, this figure shows that the normalization methods change the Euclidean distance between two points that could lead to an alteration in the performance of some of supervised learning algorithms. Figure 5. View largeDownload slide Normalized value of EGFR and KRAS. This figure shows the normalized values of EGFR and KRAS genes in 487 HTSeq-FPKM-UQ files for PTs (circles) and in 41 files for solid tissue normals (stars). Figure 5. View largeDownload slide Normalized value of EGFR and KRAS. This figure shows the normalized values of EGFR and KRAS genes in 487 HTSeq-FPKM-UQ files for PTs (circles) and in 41 files for solid tissue normals (stars). Comparing performance of supervised learning algorithms We examine the effect of normalization methods on the performance of 12 most common ML algorithms. We use 12 supervised learning algorithms to predict the vital status of patients with colon cancer. The accuracy and fitting time of these algorithms have been provided in Table 2; for more details about the implemented ML algorithms, please see ‘Methods’ section. Table 2. Accuracy and fitting time of several methods for predicting vital status Normalization method SVM SGD Linear discriminant Quadratic discriminant Accuracy Fitting time Accuracy Fitting time Accuracy Fitting time Accuracy Fitting time Raw data 0.78 ± 0.01 16.4 ± 1.51 0.76 ± 0.05 1.9 ± 0.07 0.76 ± 0.09 7.1 ± 1.29 0.43 ± 0.17 11.0 ± 0.66 Scaling files 0.78 ± 0.01 8.6 ± 0.58 0.74 ± 0.23 2.1 ± 0.11 0.72 ± 0.10 7.9 ± 0.56 0.39 ± 0.20 11.5 ± 2.14 Scaling genes 0.78 ± 0.01 13.3 ± 1.63 0.63 ± 0.33 2.0 ± 0.15 0.76 ± 0.09 8.1 ± 1.25 0.49 ± 0.20 10.5 ± 2.12 Scaling files to the unit length 0.78 ± 0.01 6.9 ± 0.46 0.75 ± 0.10 0.4 ± 0.01 0.74 ± 0.11 6.3 ± 0.83 0.36 ± 0.19 9.6 ± 1.10 Scaling genes to the unit length 0.78 ± 0.01 10.9 ± 1.54 0.72 ± 0.18 0.4 ± 0.02 0.76 ± 0.09 5.7 ± 0.37 0.52 ± 0.12 9.1 ± 0.55 Standardizing files 0.78 ± 0.01 9.4 ± 0.56 0.57 ± 0.44 0.4 ± 0.01 0.70 ± 0.15 5.8 ± 0.21 0.22 ± 0.02 9.0 ± 0.47 Standardizing genes 0.78 ± 0.01 15.2 ± 1.16 0.26 ± 0.09 1.9 ± 0.03 0.61 ± 0.13 7.3 ± 0.25 0.44 ± 0.15 9.7 ± 2.44 Logistic regression Decision tree Nearest centroid Gaussian process Raw data 0.62 ± 0.19 6.0 ± 0.51 0.65 ± 0.15 18.9 ± 8.36 0.39 ± 0.22 1.8 ± 0.07 0.22 ± 0.01 16.5 ± 1.45 Scaling files 0.78 ± 0.02 2.9 ± 0.25 0.67 ± 0.13 19.5 ± 13.22 0.47 ± 0.16 2.0 ± 0.12 0.78 ± 0.01 17.5 ± 1.68 Scaling genes 0.78 ± 0.09 5.0 ± 0.52 0.66 ± 0.10 20.1 ± 9.10 0.59 ± 0.16 1.8 ± 0.06 0.68 ± 0.09 17.0 ± 1.31 Scaling files to the unit length 0.78 ± 0.01 1.0 ± 0.05 0.65 ± 0.10 15.7 ± 8.40 0.46 ± 0.20 0.3 ± 0.17 0.78 ± 0.01 15.2 ± 0.93 Scaling genes to the unit length 0.78 ± 0.07 1.5 ± 0.06 0.66 ± 0.12 17.0 ± 9.09 0.45 ± 0.28 0.3 ± 0.02 0.76 ± 0.04 15.3 ± 0.66 Standardizing files 0.70 ± 0.22 7.3 ± 0.04 0.64 ± 0.18 28.2 ± 15.65 0.46 ± 0.20 0.3 ± 0.01 0.24 ± 0.04 15.3 ± 0.74 Standardizing genes 0.32 ± 0.10 6.3 ± 0.85 0.65 ± 0.15 18.6 ± 8.33 0.53 ± 0.24 1.9 ± 0.21 0.22 ± 0.01 16.3 ± 1.57 Neural network size (5, 2) Gradient boosting Gaussian naive Bayes Bernoulli naive Bayes Raw data 0.78 ± 0.02 10.5 ± 7.76 0.76 ± 0.05 237.3 ± 17.75 0.32 ± 0.12 2.5 ± 1.08 0.71 ± 0.15 2.6 ± 0.58 Scaling files 0.70 ± 0.08 21.8 ± 3.86 0.77 ± 0.04 357.0 ± 757.56 0.34 ± 0.12 2.3 ± 0.18 0.71 ± 0.15 2.3 ± 0.35 Scaling genes 0.78 ± 0.01 2.5 ± 0.53 0.75 ± 0.07 242.8 ± 29.38 0.76 ± 0.02 2.2 ± 0.11 0.71 ± 0.15 2.3 ± 0.13 Scaling files to the unit length 0.72 ± 0.11 19.1 ± 3.98 0.75 ± 0.03 226.7 ± 17.24 0.33 ± 0.12 0.7 ± 0.04 0.71 ± 0.15 0.9 ± 0.07 Scaling genes to the unit length 0.75 ± 0.08 4.7 ± 2.28 0.76 ± 0.05 240.4 ± 17.45 0.76 ± 0.05 0.7 ± 0.05 0.71 ± 0.15 0.8 ± 0.05 Standardizing files 0.78 ± 0.01 1.3 ± 0.38 0.77 ± 0.06 286.3 ± 7.08 0.43 ± 0.12 0.7 ± 0.14 0.78 ± 0.03 0.6 ± 0.02 Standardizing genes 0.67 ± 0.16 19.8 ± 7.71 0.75 ± 0.07 249.8 ± 11.43 0.76 ± 0.05 2.3 ± 0.37 0.63 ± 0.17 2.2 ± 0.04 Normalization method SVM SGD Linear discriminant Quadratic discriminant Accuracy Fitting time Accuracy Fitting time Accuracy Fitting time Accuracy Fitting time Raw data 0.78 ± 0.01 16.4 ± 1.51 0.76 ± 0.05 1.9 ± 0.07 0.76 ± 0.09 7.1 ± 1.29 0.43 ± 0.17 11.0 ± 0.66 Scaling files 0.78 ± 0.01 8.6 ± 0.58 0.74 ± 0.23 2.1 ± 0.11 0.72 ± 0.10 7.9 ± 0.56 0.39 ± 0.20 11.5 ± 2.14 Scaling genes 0.78 ± 0.01 13.3 ± 1.63 0.63 ± 0.33 2.0 ± 0.15 0.76 ± 0.09 8.1 ± 1.25 0.49 ± 0.20 10.5 ± 2.12 Scaling files to the unit length 0.78 ± 0.01 6.9 ± 0.46 0.75 ± 0.10 0.4 ± 0.01 0.74 ± 0.11 6.3 ± 0.83 0.36 ± 0.19 9.6 ± 1.10 Scaling genes to the unit length 0.78 ± 0.01 10.9 ± 1.54 0.72 ± 0.18 0.4 ± 0.02 0.76 ± 0.09 5.7 ± 0.37 0.52 ± 0.12 9.1 ± 0.55 Standardizing files 0.78 ± 0.01 9.4 ± 0.56 0.57 ± 0.44 0.4 ± 0.01 0.70 ± 0.15 5.8 ± 0.21 0.22 ± 0.02 9.0 ± 0.47 Standardizing genes 0.78 ± 0.01 15.2 ± 1.16 0.26 ± 0.09 1.9 ± 0.03 0.61 ± 0.13 7.3 ± 0.25 0.44 ± 0.15 9.7 ± 2.44 Logistic regression Decision tree Nearest centroid Gaussian process Raw data 0.62 ± 0.19 6.0 ± 0.51 0.65 ± 0.15 18.9 ± 8.36 0.39 ± 0.22 1.8 ± 0.07 0.22 ± 0.01 16.5 ± 1.45 Scaling files 0.78 ± 0.02 2.9 ± 0.25 0.67 ± 0.13 19.5 ± 13.22 0.47 ± 0.16 2.0 ± 0.12 0.78 ± 0.01 17.5 ± 1.68 Scaling genes 0.78 ± 0.09 5.0 ± 0.52 0.66 ± 0.10 20.1 ± 9.10 0.59 ± 0.16 1.8 ± 0.06 0.68 ± 0.09 17.0 ± 1.31 Scaling files to the unit length 0.78 ± 0.01 1.0 ± 0.05 0.65 ± 0.10 15.7 ± 8.40 0.46 ± 0.20 0.3 ± 0.17 0.78 ± 0.01 15.2 ± 0.93 Scaling genes to the unit length 0.78 ± 0.07 1.5 ± 0.06 0.66 ± 0.12 17.0 ± 9.09 0.45 ± 0.28 0.3 ± 0.02 0.76 ± 0.04 15.3 ± 0.66 Standardizing files 0.70 ± 0.22 7.3 ± 0.04 0.64 ± 0.18 28.2 ± 15.65 0.46 ± 0.20 0.3 ± 0.01 0.24 ± 0.04 15.3 ± 0.74 Standardizing genes 0.32 ± 0.10 6.3 ± 0.85 0.65 ± 0.15 18.6 ± 8.33 0.53 ± 0.24 1.9 ± 0.21 0.22 ± 0.01 16.3 ± 1.57 Neural network size (5, 2) Gradient boosting Gaussian naive Bayes Bernoulli naive Bayes Raw data 0.78 ± 0.02 10.5 ± 7.76 0.76 ± 0.05 237.3 ± 17.75 0.32 ± 0.12 2.5 ± 1.08 0.71 ± 0.15 2.6 ± 0.58 Scaling files 0.70 ± 0.08 21.8 ± 3.86 0.77 ± 0.04 357.0 ± 757.56 0.34 ± 0.12 2.3 ± 0.18 0.71 ± 0.15 2.3 ± 0.35 Scaling genes 0.78 ± 0.01 2.5 ± 0.53 0.75 ± 0.07 242.8 ± 29.38 0.76 ± 0.02 2.2 ± 0.11 0.71 ± 0.15 2.3 ± 0.13 Scaling files to the unit length 0.72 ± 0.11 19.1 ± 3.98 0.75 ± 0.03 226.7 ± 17.24 0.33 ± 0.12 0.7 ± 0.04 0.71 ± 0.15 0.9 ± 0.07 Scaling genes to the unit length 0.75 ± 0.08 4.7 ± 2.28 0.76 ± 0.05 240.4 ± 17.45 0.76 ± 0.05 0.7 ± 0.05 0.71 ± 0.15 0.8 ± 0.05 Standardizing files 0.78 ± 0.01 1.3 ± 0.38 0.77 ± 0.06 286.3 ± 7.08 0.43 ± 0.12 0.7 ± 0.14 0.78 ± 0.03 0.6 ± 0.02 Standardizing genes 0.67 ± 0.16 19.8 ± 7.71 0.75 ± 0.07 249.8 ± 11.43 0.76 ± 0.05 2.3 ± 0.37 0.63 ± 0.17 2.2 ± 0.04 Note: This table shows the accuracy and fitting time of each method after normalizing the data for predicting the vital status. For this part, we only analyze the gene expression data sets of samples from PTs. Table 2. Accuracy and fitting time of several methods for predicting vital status Normalization method SVM SGD Linear discriminant Quadratic discriminant Accuracy Fitting time Accuracy Fitting time Accuracy Fitting time Accuracy Fitting time Raw data 0.78 ± 0.01 16.4 ± 1.51 0.76 ± 0.05 1.9 ± 0.07 0.76 ± 0.09 7.1 ± 1.29 0.43 ± 0.17 11.0 ± 0.66 Scaling files 0.78 ± 0.01 8.6 ± 0.58 0.74 ± 0.23 2.1 ± 0.11 0.72 ± 0.10 7.9 ± 0.56 0.39 ± 0.20 11.5 ± 2.14 Scaling genes 0.78 ± 0.01 13.3 ± 1.63 0.63 ± 0.33 2.0 ± 0.15 0.76 ± 0.09 8.1 ± 1.25 0.49 ± 0.20 10.5 ± 2.12 Scaling files to the unit length 0.78 ± 0.01 6.9 ± 0.46 0.75 ± 0.10 0.4 ± 0.01 0.74 ± 0.11 6.3 ± 0.83 0.36 ± 0.19 9.6 ± 1.10 Scaling genes to the unit length 0.78 ± 0.01 10.9 ± 1.54 0.72 ± 0.18 0.4 ± 0.02 0.76 ± 0.09 5.7 ± 0.37 0.52 ± 0.12 9.1 ± 0.55 Standardizing files 0.78 ± 0.01 9.4 ± 0.56 0.57 ± 0.44 0.4 ± 0.01 0.70 ± 0.15 5.8 ± 0.21 0.22 ± 0.02 9.0 ± 0.47 Standardizing genes 0.78 ± 0.01 15.2 ± 1.16 0.26 ± 0.09 1.9 ± 0.03 0.61 ± 0.13 7.3 ± 0.25 0.44 ± 0.15 9.7 ± 2.44 Logistic regression Decision tree Nearest centroid Gaussian process Raw data 0.62 ± 0.19 6.0 ± 0.51 0.65 ± 0.15 18.9 ± 8.36 0.39 ± 0.22 1.8 ± 0.07 0.22 ± 0.01 16.5 ± 1.45 Scaling files 0.78 ± 0.02 2.9 ± 0.25 0.67 ± 0.13 19.5 ± 13.22 0.47 ± 0.16 2.0 ± 0.12 0.78 ± 0.01 17.5 ± 1.68 Scaling genes 0.78 ± 0.09 5.0 ± 0.52 0.66 ± 0.10 20.1 ± 9.10 0.59 ± 0.16 1.8 ± 0.06 0.68 ± 0.09 17.0 ± 1.31 Scaling files to the unit length 0.78 ± 0.01 1.0 ± 0.05 0.65 ± 0.10 15.7 ± 8.40 0.46 ± 0.20 0.3 ± 0.17 0.78 ± 0.01 15.2 ± 0.93 Scaling genes to the unit length 0.78 ± 0.07 1.5 ± 0.06 0.66 ± 0.12 17.0 ± 9.09 0.45 ± 0.28 0.3 ± 0.02 0.76 ± 0.04 15.3 ± 0.66 Standardizing files 0.70 ± 0.22 7.3 ± 0.04 0.64 ± 0.18 28.2 ± 15.65 0.46 ± 0.20 0.3 ± 0.01 0.24 ± 0.04 15.3 ± 0.74 Standardizing genes 0.32 ± 0.10 6.3 ± 0.85 0.65 ± 0.15 18.6 ± 8.33 0.53 ± 0.24 1.9 ± 0.21 0.22 ± 0.01 16.3 ± 1.57 Neural network size (5, 2) Gradient boosting Gaussian naive Bayes Bernoulli naive Bayes Raw data 0.78 ± 0.02 10.5 ± 7.76 0.76 ± 0.05 237.3 ± 17.75 0.32 ± 0.12 2.5 ± 1.08 0.71 ± 0.15 2.6 ± 0.58 Scaling files 0.70 ± 0.08 21.8 ± 3.86 0.77 ± 0.04 357.0 ± 757.56 0.34 ± 0.12 2.3 ± 0.18 0.71 ± 0.15 2.3 ± 0.35 Scaling genes 0.78 ± 0.01 2.5 ± 0.53 0.75 ± 0.07 242.8 ± 29.38 0.76 ± 0.02 2.2 ± 0.11 0.71 ± 0.15 2.3 ± 0.13 Scaling files to the unit length 0.72 ± 0.11 19.1 ± 3.98 0.75 ± 0.03 226.7 ± 17.24 0.33 ± 0.12 0.7 ± 0.04 0.71 ± 0.15 0.9 ± 0.07 Scaling genes to the unit length 0.75 ± 0.08 4.7 ± 2.28 0.76 ± 0.05 240.4 ± 17.45 0.76 ± 0.05 0.7 ± 0.05 0.71 ± 0.15 0.8 ± 0.05 Standardizing files 0.78 ± 0.01 1.3 ± 0.38 0.77 ± 0.06 286.3 ± 7.08 0.43 ± 0.12 0.7 ± 0.14 0.78 ± 0.03 0.6 ± 0.02 Standardizing genes 0.67 ± 0.16 19.8 ± 7.71 0.75 ± 0.07 249.8 ± 11.43 0.76 ± 0.05 2.3 ± 0.37 0.63 ± 0.17 2.2 ± 0.04 Normalization method SVM SGD Linear discriminant Quadratic discriminant Accuracy Fitting time Accuracy Fitting time Accuracy Fitting time Accuracy Fitting time Raw data 0.78 ± 0.01 16.4 ± 1.51 0.76 ± 0.05 1.9 ± 0.07 0.76 ± 0.09 7.1 ± 1.29 0.43 ± 0.17 11.0 ± 0.66 Scaling files 0.78 ± 0.01 8.6 ± 0.58 0.74 ± 0.23 2.1 ± 0.11 0.72 ± 0.10 7.9 ± 0.56 0.39 ± 0.20 11.5 ± 2.14 Scaling genes 0.78 ± 0.01 13.3 ± 1.63 0.63 ± 0.33 2.0 ± 0.15 0.76 ± 0.09 8.1 ± 1.25 0.49 ± 0.20 10.5 ± 2.12 Scaling files to the unit length 0.78 ± 0.01 6.9 ± 0.46 0.75 ± 0.10 0.4 ± 0.01 0.74 ± 0.11 6.3 ± 0.83 0.36 ± 0.19 9.6 ± 1.10 Scaling genes to the unit length 0.78 ± 0.01 10.9 ± 1.54 0.72 ± 0.18 0.4 ± 0.02 0.76 ± 0.09 5.7 ± 0.37 0.52 ± 0.12 9.1 ± 0.55 Standardizing files 0.78 ± 0.01 9.4 ± 0.56 0.57 ± 0.44 0.4 ± 0.01 0.70 ± 0.15 5.8 ± 0.21 0.22 ± 0.02 9.0 ± 0.47 Standardizing genes 0.78 ± 0.01 15.2 ± 1.16 0.26 ± 0.09 1.9 ± 0.03 0.61 ± 0.13 7.3 ± 0.25 0.44 ± 0.15 9.7 ± 2.44 Logistic regression Decision tree Nearest centroid Gaussian process Raw data 0.62 ± 0.19 6.0 ± 0.51 0.65 ± 0.15 18.9 ± 8.36 0.39 ± 0.22 1.8 ± 0.07 0.22 ± 0.01 16.5 ± 1.45 Scaling files 0.78 ± 0.02 2.9 ± 0.25 0.67 ± 0.13 19.5 ± 13.22 0.47 ± 0.16 2.0 ± 0.12 0.78 ± 0.01 17.5 ± 1.68 Scaling genes 0.78 ± 0.09 5.0 ± 0.52 0.66 ± 0.10 20.1 ± 9.10 0.59 ± 0.16 1.8 ± 0.06 0.68 ± 0.09 17.0 ± 1.31 Scaling files to the unit length 0.78 ± 0.01 1.0 ± 0.05 0.65 ± 0.10 15.7 ± 8.40 0.46 ± 0.20 0.3 ± 0.17 0.78 ± 0.01 15.2 ± 0.93 Scaling genes to the unit length 0.78 ± 0.07 1.5 ± 0.06 0.66 ± 0.12 17.0 ± 9.09 0.45 ± 0.28 0.3 ± 0.02 0.76 ± 0.04 15.3 ± 0.66 Standardizing files 0.70 ± 0.22 7.3 ± 0.04 0.64 ± 0.18 28.2 ± 15.65 0.46 ± 0.20 0.3 ± 0.01 0.24 ± 0.04 15.3 ± 0.74 Standardizing genes 0.32 ± 0.10 6.3 ± 0.85 0.65 ± 0.15 18.6 ± 8.33 0.53 ± 0.24 1.9 ± 0.21 0.22 ± 0.01 16.3 ± 1.57 Neural network size (5, 2) Gradient boosting Gaussian naive Bayes Bernoulli naive Bayes Raw data 0.78 ± 0.02 10.5 ± 7.76 0.76 ± 0.05 237.3 ± 17.75 0.32 ± 0.12 2.5 ± 1.08 0.71 ± 0.15 2.6 ± 0.58 Scaling files 0.70 ± 0.08 21.8 ± 3.86 0.77 ± 0.04 357.0 ± 757.56 0.34 ± 0.12 2.3 ± 0.18 0.71 ± 0.15 2.3 ± 0.35 Scaling genes 0.78 ± 0.01 2.5 ± 0.53 0.75 ± 0.07 242.8 ± 29.38 0.76 ± 0.02 2.2 ± 0.11 0.71 ± 0.15 2.3 ± 0.13 Scaling files to the unit length 0.72 ± 0.11 19.1 ± 3.98 0.75 ± 0.03 226.7 ± 17.24 0.33 ± 0.12 0.7 ± 0.04 0.71 ± 0.15 0.9 ± 0.07 Scaling genes to the unit length 0.75 ± 0.08 4.7 ± 2.28 0.76 ± 0.05 240.4 ± 17.45 0.76 ± 0.05 0.7 ± 0.05 0.71 ± 0.15 0.8 ± 0.05 Standardizing files 0.78 ± 0.01 1.3 ± 0.38 0.77 ± 0.06 286.3 ± 7.08 0.43 ± 0.12 0.7 ± 0.14 0.78 ± 0.03 0.6 ± 0.02 Standardizing genes 0.67 ± 0.16 19.8 ± 7.71 0.75 ± 0.07 249.8 ± 11.43 0.76 ± 0.05 2.3 ± 0.37 0.63 ± 0.17 2.2 ± 0.04 Note: This table shows the accuracy and fitting time of each method after normalizing the data for predicting the vital status. For this part, we only analyze the gene expression data sets of samples from PTs. Dimensionality reduction Table 2 shows the performance of the supervised algorithms right after normalizing the data set. However, when there is a high-dimensional data set, one common approach is lowering the dimensionality of the problem to improve the performance of ML algorithms. The idea is to replace a large number of variables with a smaller number, which maintain a good representation of the data set. Here, we investigate the effect of dimensionality reduction methods such as principal components analysis (PCA) and removing features with a low variance on the performance of two winner learning algorithms support vector machine (SVM) and Bernoulli naive Bayes and two algorithms, which had a poor performance, nearest centroid and Gaussian naive Bayes. Therefore, we first normalize the data set, then we reduce dimension of the data set and finally we apply the supervised learning algorithms. Figure 6 shows that by applying PCA, the accuracy of SVM does not change; however, the Bernoulli naive Bayes model for all normalization methods reaches the accuracy of SVM, which is 78%. Importantly, Gaussian naive Bayes reaches the maximum accuracy of 78% when the dimension of the data set is reduced to 1, while nearest centroid algorithm never reaches the accuracy of 78%. Moreover, Gaussian naive Bayes model after using PCA performs better when files are normalized rather than genes. Figure 7 shows the performance of the four ML algorithms after removing genes with a variance less than a given threshold. We choose thresholds based on the variance of the genes in the normalized data set. Again, Gaussian naive Bayes model reaches the accuracy of 78% if the data set is normalized, while nearest centroid model never reaches that accuracy. Importantly, scaling genes to the unit length reduces the accuracy of Bernoulli naive Bayes model. Figure 6. View largeDownload slide Performance of supervised learning algorithms after applying PCA. This figure shows the accuracy (A) and fitting time (B) of four supervised learning models. In each of these models, we first reduce the dimension of the data set using PCA, and then we apply the supervised learning algorithm. Fitting time includes both the spending time for dimensionlity reduction and the training time of the chosen supervised algorithm. Figure 6. View largeDownload slide Performance of supervised learning algorithms after applying PCA. This figure shows the accuracy (A) and fitting time (B) of four supervised learning models. In each of these models, we first reduce the dimension of the data set using PCA, and then we apply the supervised learning algorithm. Fitting time includes both the spending time for dimensionlity reduction and the training time of the chosen supervised algorithm. Figure 7. View largeDownload slide Performance of supervised learning algorithms after removing low-variance genes. This figure shows the accuracy (A) and fitting time (B) of four supervised learning models. In each of these models, we first reduce the dimension of the data set by removing the genes with a variance lower than the given threshold, and then we apply the supervised learning algorithm. We choose thresholds according to the variance of genes in the normalized data set ( var(gj)), where gj is the normalized value of the gene j); thresholds are 0,avg(var(gj)),avg(var(gj))+max(var(gj))2, and 0.9×max(var(gj)). Fitting time includes both the spending time for dimensionlity reduction and the training time of the chosen supervised algorithm. Figure 7. View largeDownload slide Performance of supervised learning algorithms after removing low-variance genes. This figure shows the accuracy (A) and fitting time (B) of four supervised learning models. In each of these models, we first reduce the dimension of the data set by removing the genes with a variance lower than the given threshold, and then we apply the supervised learning algorithm. We choose thresholds according to the variance of genes in the normalized data set ( var(gj)), where gj is the normalized value of the gene j); thresholds are 0,avg(var(gj)),avg(var(gj))+max(var(gj))2, and 0.9×max(var(gj)). Fitting time includes both the spending time for dimensionlity reduction and the training time of the chosen supervised algorithm. Discussion One of the main questions in analyzing data sets is ‘do we need to normalize the data?, and if we need, what should be the strategy and method?’. Here, we try to answer this question for analyzing RNA-seq data sets. We compared six different normalization strategies for analyzing HTSeq-FPKM-UQ data sets. Table 3 shows the statistical changes in the data set after applying each of the abovementioned normalization methods. In average, the SD of genes’ values in each file is around 10 times more than the SD of each gene’s value in the unnormalized data set (see Columns 3 and 4 in Table 3). Standardizing genes and scaling genes to the unit length might not be a correct normalization approach because it would change the statistical information of the data set. Standardizing genes will make the average of SD of genes higher than the average of SDs of values in each file, and scaling genes to the unit length will make these two values approximately equal. Table 3. Mean and variance of data Normalization method avg(avg(gj))=avg(avg(Pi)) avg(std(gj)) avg(std(Pi)) cor(EGFR, KRAS) Raw data 142 069.91 136 066.44 27 84 854.65 0.226 Scaling files 0.000599 0.000522 0.0092737 0.51 Scaling genes 0.075386 0.084623 0.131425 0.226 Scaling files to the unit length 0.000258 0.000205 0.004150 0.43 Scaling genes to the unit length 0.017967 0.036806 0.03099 0.226 Standardizing files 0 0.060801 1.0 0.23 Standardizing genes 0 1.0 0.769104 0.23 Normalization method avg(avg(gj))=avg(avg(Pi)) avg(std(gj)) avg(std(Pi)) cor(EGFR, KRAS) Raw data 142 069.91 136 066.44 27 84 854.65 0.226 Scaling files 0.000599 0.000522 0.0092737 0.51 Scaling genes 0.075386 0.084623 0.131425 0.226 Scaling files to the unit length 0.000258 0.000205 0.004150 0.43 Scaling genes to the unit length 0.017967 0.036806 0.03099 0.226 Standardizing files 0 0.060801 1.0 0.23 Standardizing genes 0 1.0 0.769104 0.23 Note: In this table, Pi=[gi,j, 1≤j≤m] is the gene expression data in the file i. gj=[gi,j, 1≤i≤n] is the expression of the gene j in the entire data set. Thus, avg(avg(pi))= avg( [P¯i, 1≤i≤n]) and avg(std(pi))= avg( [P~i, 1≤i≤n]), where P¯i=avg(pi)= avg( [gi,j, 1≤j≤m]) and P~i=std(pi)= std( [gi,j, 1≤j≤m]). Similarly, avg(avg(gj))= avg( [g¯j, 1≤j≤m]) and avg(std(gj))= avg( [g~j, 1≤j≤m]), where g¯j=avg(gj)= avg( [gi,j, 1≤i≤n]) and g~j=std(gj)= std( [gi,j, 1≤i≤n]). Table 3. Mean and variance of data Normalization method avg(avg(gj))=avg(avg(Pi)) avg(std(gj)) avg(std(Pi)) cor(EGFR, KRAS) Raw data 142 069.91 136 066.44 27 84 854.65 0.226 Scaling files 0.000599 0.000522 0.0092737 0.51 Scaling genes 0.075386 0.084623 0.131425 0.226 Scaling files to the unit length 0.000258 0.000205 0.004150 0.43 Scaling genes to the unit length 0.017967 0.036806 0.03099 0.226 Standardizing files 0 0.060801 1.0 0.23 Standardizing genes 0 1.0 0.769104 0.23 Normalization method avg(avg(gj))=avg(avg(Pi)) avg(std(gj)) avg(std(Pi)) cor(EGFR, KRAS) Raw data 142 069.91 136 066.44 27 84 854.65 0.226 Scaling files 0.000599 0.000522 0.0092737 0.51 Scaling genes 0.075386 0.084623 0.131425 0.226 Scaling files to the unit length 0.000258 0.000205 0.004150 0.43 Scaling genes to the unit length 0.017967 0.036806 0.03099 0.226 Standardizing files 0 0.060801 1.0 0.23 Standardizing genes 0 1.0 0.769104 0.23 Note: In this table, Pi=[gi,j, 1≤j≤m] is the gene expression data in the file i. gj=[gi,j, 1≤i≤n] is the expression of the gene j in the entire data set. Thus, avg(avg(pi))= avg( [P¯i, 1≤i≤n]) and avg(std(pi))= avg( [P~i, 1≤i≤n]), where P¯i=avg(pi)= avg( [gi,j, 1≤j≤m]) and P~i=std(pi)= std( [gi,j, 1≤j≤m]). Similarly, avg(avg(gj))= avg( [g¯j, 1≤j≤m]) and avg(std(gj))= avg( [g~j, 1≤j≤m]), where g¯j=avg(gj)= avg( [gi,j, 1≤i≤n]) and g~j=std(gj)= std( [gi,j, 1≤i≤n]). Additionally, we examine the performance of 12 ML models in predicting vital status of patients with colon adenocarcinoma using HTSeq-FPKM-UQ data sets (Table 2). Based on the accuracies of the 12 implemented supervised classification algorithms, the worst normalization strategy is standardizing values of each gene. Moreover, quadratic discriminant and nearest centroid have the worst performance independent from normalization methods in terms of accuracy. Although Bernoulli naive Bayes after standardizing files has the best performance [maximum accuracy (78%) with minimum fitting time], and the SVM with radial basis function (RBF) kernel has the maximum accuracy (78%) independent of normalization strategies. It is not surprising that SVM works better than any other chosen algorithms because the number of features (genes) is much greater than the number of samples (files) [15, 16]. To do further investigation, we reduce the dimension of the normalized data set, and then we use supervised learning algorithms. Dimensionality reduction does not increase the maximum accuracy of 78%, which has been achieved by some of ML algorithms applied to the entire data set. However, it assists the Gaussian naive Bayes model, which has shown a poor performance on the entire data set, to reach the maximum accuracy of 78%. As PCA increases the fitting time, and removing features with low variance decreases the training time, the optimal approach could be reducing dimension of the data set by removing low-variance genes. Importantly, the dimensionality reduction shows that the expression level of one single gene, 7SK RNA, can be used to predict the vital status of the patients with colon adenocarcinomas with accuracy of 78%. If we standardize files, the average of the normalized level of 7SK RNA is 8.16 for surviving patients, and if we exclude the single outlier, which is a dead female with a high level of 7SK RNA, this average is 0.07 for un-survived patients (Figure 8). The intervals of the normalized values of 7SK RNA for surviving females and males, respectively, are (−0.08, 225.79) and (−0.09, 204.20), while the ranges of the normalized values of 7SK RNA for un-survived female and male patients excluding the single outlier are (−0.10, 0.71) and (−0.06, 1.12), respectively. Figure 8. View largeDownload slide Normalized value of 7SK RNA. This figure shows the value of 7SK RNA after standardizing each file separately. The squares are the means of each group and lines show the SDs plus mean. There are two squares (means) for the female-dead, the top one includes the single outlier, which has a high level of 7SK RNA, while the bottom one is the mean after excluding that single outlier. Figure 8. View largeDownload slide Normalized value of 7SK RNA. This figure shows the value of 7SK RNA after standardizing each file separately. The squares are the means of each group and lines show the SDs plus mean. There are two squares (means) for the female-dead, the top one includes the single outlier, which has a high level of 7SK RNA, while the bottom one is the mean after excluding that single outlier. In summary, we should normalize the HTSeq-FPKM-UQ data sets before applying ML algorithms to reach the best performance, and the best normalization strategy completely depends on the ML model. As Bernoulli naive Bayes model after standardizing files has the best performance, one might conclude that the best way for normalizing HTSeq-FPKM-UQ data sets is independently standardizing each file. We reach this conclusion, mainly because the first step in the Bernoulli naive Bayes model is binarizing the data set to 0 and 1, and we set the threshold to 0. Therefore, if we use any rescaling normalization method or raw data, this model assigns 0 values for genes that have not been expressed, and 1 for the genes that have a non-zero value in the file. However, if we standardize the data set, then this model assigns 0 values to the genes that have a negative z-score value, and 1 to the genes that have a positive z-score value. As the best performance will be achieved when each file is standardized, then the best way to analyze gene expression values (or to characterize genes as expressed versus nonexpressed) might be independently standardizing each file. Methods We collected the transcriptome profiling, RNA-seq, data of patient with colon adenocarcinomas from Genomic Data Commons (GDC) data portal. The data set includes 487 HTSeq-FPKM-UQ files for PTs and 41 for solid tissue normals (N). As there are 60 483 number of genes, the dimension of PT and N data sets is 60 483 × 487 and 60 483 × 41, respectively. Firs of all, we exclude the genes that their values were 0 in the entire data set, and then the number of genes will be reduced from 60 483 to 57 813. We represent the data set as a matrix D=[gi,j], where each row pi=[gi,j , 1≤j≤57813] represents the gene expression values of patient i. Figure 1A shows the sorted average of genes’ values gj¯=mean([gi,j , 1≤i≤n]), where n is the number of files. To better visualize the data, we divide each data set (PT and N) to four categories based on the patients’ vital status and gender (Table 1). We apply three different normalization methods, and for each method, we use two different strategies: normalizing patients or genes. In summary, we normalize the data using the following techniques. Normalization methods Scaling files The first method is rescaling the data by dividing the gene expression values in each file by the maximum gene expression value in the file. More precisely, we scaled the range of genes’ values into [0, 1] for each file separately; we obtained the normalized matrix D^=[g^i,j]=[gi,jmax⁡([gi,j,  1≤j≤57813])]. Scaling genes Another scaling method is rescaling the data by dividing the values of each gene by the maximum value of the gene in the whole data set. In other words, we scaled the range of genes’ values into [0, 1] for each gene separately by getting the matrix D^=[g^i,j]=[gi,jmax⁡([gi,j,  1≤i≤n])]. Scaling files to the unit length One of the most common normalization methods is rescaling the data to the unit length. To do that, we divided the gene expression values in each file by the L-2 norm of the values in the file; we obtained the matrix D^=[g^i,j]=[gi,j|([gi,j,  1≤j≤57813])|]. Scaling genes to the unit length Instead of scaling each file, we can separately rescale values of each gene to the unit length by calculating the normalized matrix D^=[g^i,j]=[gi,j|([gi,j,  1≤i≤n])|]. Standardizing files Another common approach is that assuming the data has normal distribution. We standardized each file separately by obtaining z-scores for each file: D^=[g^i,j]=[z−score([gi,j,  1≤j≤57813])]. Standardizing genes Another standardization method is normalizing the values of each gene separately by obtaining z-scores for the values of each gene: D^=[g^i,j]=[z−score([gi,j,  1≤i≤n])]. Supervised learning methods for classification We used scikit learn package in python to implement the ML algorithms to predict the vital status of patients with colon adenocarcinoma. We compared the performance of the following 12 supervised learning algorithms after using six abovementioned normalization methods. To obtain accuracy and fitting time of each of the following algorithms, we used the cross_validate function in sklearn.model_selection with scoring=‘accuracy’ and cv = 10. In cross_validate function, the training set is split into k (here k = 10) smaller sets. For each of the k ‘folds’, the model is trained using k−1 of the folds as training data, and the resulting model is validated on the remaining part of the data as a test set to compute the accuracy of the model. For more information about the implemented algorithms, please visit http://scikit-learn.org/stable/supervised_learning.html#supervised-learning. Support vector machine To classify data set using an SVM algorithm, we imported support vector classification (SVC) from sklearn.svm, and we used the default values for variables of SVC function: SVC(C = 1.0, kernel= RBF, degree = 3, gamma = auto, coef0 = 0.0, shrinking = True, probability = False, tol = 0.001, cache_size = 200, class_weight = None, verbose = False, max_iter = −1, decision_function_shape = ovr, random_state = None). Stochastic gradient descent We imported SGDClassifier from sklearn.linear_model, and we used the default setting: SGDClassifier(loss = hinge, penalty = l2, alpha = 0.0001, l1_ratio = 0.15, fit_intercept = True, max_iter = None, tol = None, shuffle = True, verbose = 0, epsilon = 0.1, n_jobs = 1, random_state = None, learning_rate = optimal, eta0 = 0.0, power_t = 0.5, class_weight = None, warm_start = False, average = False, n_iter = None). Note that SGDClassifier with loss=“hinge” fits a linear SVM with stochastic gradient descent (SGD) learning. Linear discriminant Linear discriminant analysis, which is A classifier with a linear decision boundary, uses Bayes’ rule. To apply this method, we imported LinearDiscriminantAnalysis from sklearn.discriminant_analysis, and we used the default values for variables of this function: LinearDiscriminantAnalysis(solver = svd, shrinkage = None, priors = None, n_components = None, store_covariance = False, tol = 0.0001). Quadratic discriminant To implement quadratic discriminant analysis, which uses Bayes’ rule and classifies with a quadratic decision boundary, we imported QuadraticDiscriminantAnalysis from sklearn.discriminant_analysis. We used the default values for variables of this function: QuadraticDiscriminantAnalysis(priors = None, reg_param = 0.0, store_covariance = False, tol = 0.0001, store_covariances = None). Logistic regression We implemented logistic regression, which is a one hidden layer neural network with logistic function, using the function sklearn.linear_model.LogisticRegression(penalty = l2, dual = True, tol = 0.0001, C = 1.0, fit_intercept = True, intercept_scaling = 1, class_weight = None, random_state = None, solver = liblinear, max_iter = 100, multi_class = ovr, verbose = 0, warm_start = False, n_jobs = 1). Decision tree To implement decision tree classification algorithm, we imported DecisionTreeClassifier from sklearn.tree, and we used the default function DecisionTreeClassifier(criterion = gini, splitter = best, max_depth = None, min_samples_split = 2, min_samples_leaf = 1, min_weight_fraction_leaf = 0.0, max_features = None, random_state = None, max_leaf_nodes = None, min_impurity_decrease = 0.0, min_impurity_split = None, class_weight = None, presort = False). Nearest centroid To implement NearestCentroid classifier, which represents each class by the centroid of its members, we used the function NearestCentroid(metric = euclidean, shrink_threshold = None) in sklearn.neighbors. Gaussian processes We used the function GaussianProcessClassifier(kernel = None, optimizer = fmin_l_bfgs_b, n_restarts_optimizer = 0, max_iter_predict = 100, warm_start = False, copy_X_train = True, random_state = None, multi_class = one_vs_rest, n_jobs = 1) in sklearn.gaussian_process to implement Gaussian processes for probabilistic classification. Neural networks To implement a neural network with two hidden layers of sizes 2 and 5 that trains using Back-propagation, we used the function sklearn.neural_network.MLPClassifier(hidden_layer_sizes = (5, 2), activation = logistic, solver = lbfgs, alpha = 1e-5, batch_size = auto, learning_rate = constant, learning_rate_init = 0.001, power_t = 0.5, max_iter = 200, shuffle = True, random_state = 1, tol = 0.0001, verbose = False, warm_start = False, momentum = 0.9, nesterovs_momentum = True, early_stopping = False, validation_fraction = 0.1, beta_1 = 0.9, beta_2 = 0.999, epsilon = 1e-08). Gradient boosting To classify the data via gradient-boosted regression, we applied the function trees.sklearn.ensemble.GradientBoostingClassifier(loss = deviance, learning_rate = 0.1, n_estimators = 100, subsample = 1.0, criterion = friedman_mse, min_samples_split = 2, min_samples_leaf = 1, min_weight_fraction_leaf = 0.0, max_depth = 3, min_impurity_decrease = 0.0, min_impurity_split = None, init = None, random_state = None, max_features = None, verbose = 0, max_leaf_nodes = None, warm_start = False, presort = auto). Gaussian naive Bayes To implement the Gaussian naive Bayes algorithm, which assumes that the likelihood of the features is Gaussian, we used the function sklearn.naive_bayes.GaussianNB(priors = None). Bernoulli naive Bayes We implemented the Bernoulli naive Bayes algorithm, which assumes features have a binary-value (Bernoulli, boolean), using the function sklearn.naive_bayes.BernoulliNB(alpha = 1.0, binarize = 0.0, fit_prior = True, class_prior = None). As our data set is not binary, the function will first binarize the input data set and then implements the Bernoulli naive Bayes algorithm. Dimensionality reduction We investigated the effect of dimensionality reduction methods on the performance of two winner learning algorithms SVM and Bernoulli naive Bayes and two models, which had a poor performance, nearest centroid and Gaussian naive Bayes. For each of these four ML algorithms, we first normalized the data set, then we used one of the following dimensional reduction techniques and finally we applied the supervised learning algorithm. Principal components analysis One of the most common methods for reducing the dimension of the data set is PCA. To implement PCA, we imported PCA (n_components = m) function from sklearn.decomposition. To investigate the effect of the number of components on the performance of the learning algorithms, we varied m from 1 to 40. Removing all low-variance features Another method for reducing the number of features is removing all low-variance features. To implement this method, we imported the function VarianceThreshold(threshold = θ) from sklearn.feature_selection, to remove genes with a variance lower than the threshold θ. We chose the value of θ based on the variance of genes in the normalized data set ( var(gj)), where gj is the normalized value of the gene j. We examined the results for θ=0,avg(var(gj)),avg(var(gj))+max(var(gj))2 and 0.9×max(var(gj)). Note that we cannot apply this approach after standardizing genes because standardizing genes makes the variances of all genes equal to 1. Key Points This work shows the importance of normalizing the HTSeq-FPKM-UQ data sets before applying ML algorithms, and it shows that the best normalization strategy depends on the ML model. The worst normalization method is standardizing the values of each gene separately because it changes the statistical information of the data set and leads to the poor performance for most of the supervised learning algorithms. SVM has the maximum accuracy (78%) among all implemented ML algorithms in predicting vital status of patients, regardless of normalization methods. Bernoulli naive Bayes model on the standardized file has the best performance (maximum accuracy as well as minimum fitting time) among all implemented models in predicting the vital status of patients with colon cancer. The expression level of one single gene, 7SK RNA, can be used to predict the vital status of the patients with colon adenocarcinomas with accuracy of 78%. Funding The Mathematical Biosciences Institute and the National Science Foundation (grant number DMS 1440386) in part. Leili Shahriyari, PHD, is a postdoctoral fellow in the Mathematical Biosciences Institute (MBI) at the Ohio State University. Her research interests include bioinformatics, computational biology, mathematical oncology, stochastic processes and cancer. References 1 Astorino A , Gorgone E , Gaudioso M , et al. Data preprocessing in semi-supervised SVM classification . Optimization 2011 ; 60 ( 1–2 ): 143 – 51 . Google Scholar CrossRef Search ADS 2 Toth MJ , Goran MI , Ades PA , et al. Examination of data normalization procedures for expressing peak VO2 data . J Appl Physiol 1983 ; 75 ( 5 ): 2288 – 92 . 3 Sola J , Sevilla J. Importance of input data normalization for the application of neural networks to complex industrial problems . IEEE Tran Nucl Sci 1997 ; 44 ( 3 ): 1464 – 8 . http://dx.doi.org/10.1109/23.589532 Google Scholar CrossRef Search ADS 4 Vemuri P , Gunter JL , Senjem ML , et al. Alzheimer’s disease diagnosis in individual subjects using structural MR images: validation studies . Neuroimage 2008 ; 39 ( 3 ): 1186 – 97 . Google Scholar CrossRef Search ADS PubMed 5 Quackenbush J. Microarray data normalization and transformation . Nat Genet 2002 ; 32 : 496 – 501 . http://dx.doi.org/10.1038/ng1032 Google Scholar CrossRef Search ADS PubMed 6 Sultan M , Schulz MH , Richard H , et al. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome . Science 2008 ; 321 ( 5891 ): 956 – 60 . http://dx.doi.org/10.1126/science.1160342 Google Scholar CrossRef Search ADS PubMed 7 Marioni JC , Mason CE , Mane SM , et al. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays . Genome Res 2008 ; 18 ( 9 ): 1509 – 17 . http://dx.doi.org/10.1101/gr.079558.108 Google Scholar CrossRef Search ADS PubMed 8 Cloonan N , Forrest ARR , Kolle G , et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing . Nat Methods 2008 ; 5 ( 7 ): 613 – 19 . http://dx.doi.org/10.1038/nmeth.1223 Google Scholar CrossRef Search ADS PubMed 9 Lin Y , Golovnina K , Chen Z , et al. Comparison of normalization and differential expression analyses using RNA-Seq data from 726 individual Drosophila melanogaster . BMC Genomics 2016 ; 17 ( 1 ): 28 . http://dx.doi.org/10.1186/s12864-015-2353-z Google Scholar CrossRef Search ADS PubMed 10 Dillies MA , Rau A , Aubert J , et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis . Brief Bioinform 2013 ; 14 ( 6 ): 671 – 83 . http://dx.doi.org/10.1093/bib/bbs046 Google Scholar CrossRef Search ADS PubMed 11 Robinson MD , Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data . Genome Biol 2010 ; 11 ( 3 ): R25 . Google Scholar CrossRef Search ADS PubMed 12 Mortazavi A , Williams BA , McCue K , et al. Mapping and quantifying mammalian transcriptomes by RNA-Seq . Nat Methods 2008 ; 5 ( 7 ): 621 – 8 . http://dx.doi.org/10.1038/nmeth.1226 Google Scholar CrossRef Search ADS PubMed 13 Lièvre A , Bachet JB , Le Corre D , et al. KRAS mutation status is predictive of response to cetuximab therapy in colorectal cancer . Cancer Res 2006 ; 66 ( 8 ): 3992 – 5 . Google Scholar CrossRef Search ADS PubMed 14 Misale S , Yaeger R , Hobor S , et al. Emergence of KRAS mutations and acquired resistance to anti-EGFR therapy in colorectal cancer . Nature 2012 ; 486 : 532 – 6 . Google Scholar PubMed 15 Guyon I , Boser B , Vapnik V. Automatic capacity tuning of very large VC-dimension classifiers . Adv Neural Inform Proces Syst 1993 ; 5 : 147 – 55 . 16 Joachims T. Text categorization with support vector machines: learning with many relevant features. In: Nédellec C , Rouveirol C , (eds) Machine Learning: ECML-98. ECML 1998. Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence), vol. 1398, Berlin, Heidelberg : Springer , 1998 , 137–42. doi: 10.1007/BFb0026683. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Briefings in Bioinformatics Oxford University Press

Effect of normalization methods on the performance of supervised learning algorithms applied to HTSeq-FPKM-UQ data sets: 7SK RNA expression as a predictor of survival in patients with colon adenocarcinoma

Loading next page...
 
/lp/ou_press/effect-of-normalization-methods-on-the-performance-of-supervised-HaEO5Dskhx
Publisher
Oxford University Press
Copyright
© The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
ISSN
1467-5463
eISSN
1477-4054
D.O.I.
10.1093/bib/bbx153
Publisher site
See Article on Publisher Site

Abstract

Abstract Motivation: One of the main challenges in machine learning (ML) is choosing an appropriate normalization method. Here, we examine the effect of various normalization methods on analyzing FPKM upper quartile (FPKM-UQ) RNA sequencing data sets. We collect the HTSeq-FPKM-UQ files of patients with colon adenocarcinoma from TCGA-COAD project. We compare three most common normalization methods: scaling, standardizing using z-score and vector normalization by visualizing the normalized data set and evaluating the performance of 12 supervised learning algorithms on the normalized data set. Additionally, for each of these normalization methods, we use two different normalization strategies: normalizing samples (files) or normalizing features (genes). Results: Regardless of normalization methods, a support vector machine (SVM) model with the radial basis function kernel had the maximum accuracy (78%) in predicting the vital status of the patients. However, the fitting time of SVM depended on the normalization methods, and it reached its minimum fitting time when files were normalized to the unit length. Furthermore, among all 12 learning algorithms and 6 different normalization techniques, the Bernoulli naive Bayes model after standardizing files had the best performance in terms of maximizing the accuracy as well as minimizing the fitting time. We also investigated the effect of dimensionality reduction methods on the performance of the supervised ML algorithms. Reducing the dimension of the data set did not increase the maximum accuracy of 78%. However, it leaded to discovery of the 7SK RNA gene expression as a predictor of survival in patients with colon adenocarcinoma with accuracy of 78%. 7SK RNA, gene expression, colon adenocarcinoma, normalization methods, supervised machine learning algorithms, TCGA HTSeq-FPKM-UQ data sets Introduction Normalizing data is usually the first step before using any machine learning (ML) techniques because of its crucial rule in the performance of the algorithms [1]. The importance of normalization before applying ML algorithms has been shown in many studies for various types of data sets [2–4]. However, it has not well studied for using supervised or unsupervised learning algorithms on gene expression data sets. As accurately estimating the expression level of genes is a challenging task, several methods have been developed to increase the accuracy of the estimations [5–9]. To reach this goal, these methods commonly use statistical techniques to normalize the data [9, 10]. Many of these methods normalize data between samples by scaling the number of reads in a given library to a common value across all sequenced libraries in the experiment [11]. One of the most common normalization method is quantifying transcript levels in reads per kilobase of exon model per million mapped reads, which ‘facilitates transparent comparison of transcript levels both within and between samples’ [12]. The FPKM upper quartile (FPKM-UQ) is based on a modified version of the FPKM normalization method. FPKM-UQ values tend to be much higher than FPKM values because of the large difference between the total mapped number of reads in an alignment and the mapped number of reads to one gene. FPKM=109×number of reads mapped to the geneNumber of reads mapped to all protein-coding genes×length of the gene in base pairs FPKM-UQ=109×number of reads mapped to the geneThe 75th percentile read count value for genes in the sample×length of the gene in base pairs. Although the HTSeq-FPKM-UQ is the gene expression level after applying a normalization technique, it is not a normalization method for comparing RNA sequencing (RNA-seq) data sets across individuals. We need another normalization for being able to compare RNA-seq data sets of different patients. In this article, we explore the effect of normalizing FPKM-UQ data sets on the performance of ML algorithms trained to predict the vital status of cancer patients. More precisely, we examine the effect of 6 most common normalization methods on the accuracy and fitting time of the 12 most well-known supervised algorithms. Approach We collected the gene expression (HTSeq-FPKM-UQ) files of patients with colon adenocarcinoma from TCGA-COAD project, and we normalized the data set in six different ways. We used three most common normalizing methods: scaling, vector normalization and z-score. We applied each of these three methods in two ways: (1) independently normalizing each gene and (2) independently normalizing each file (or patient). For more details, please see ‘Methods’ section. RNA-seq data sets are mainly used to analyze patients (e.g. classifying patients) or genes (e.g. obtaining gene regulatory networks). For this reason, to compare the normalization methods, we visualized the normalized values of APC, KRAS and EAGFR genes. Additionally, we compared the average and SDs of genes’ values of the survived patients versus dead ones. We chose APC gene because the most common mutation in colon cancer is the inactivation of APC. We also chose EGFR and KRAS because the cetuximab (erbitux), which inhibits EGFR, is used for treatment of colon cancer with wild-type KRAS. As this drug has little or no effect in colorectal tumors with a KRAS mutation [13, 14], we also looked at the correlation coefficient of normalized values of KRAS and EGFR. Results Raw data The data set includes 487 and 41 HTSeq-FPKM-UQ files for primary tumors (PTs) and solid tissue normals (N), respectively (Table 1). Each file includes the expression value of 60 483 genes. After excluding the genes that their values are 0 in the entire data set, the number of genes is reduced to 57 813. Table 1. Data set Sample type Categories Number of cases Number of files Solid tissue normal (N) Female 45 21 Solid tissue normal (N) Male 48 20 PT Female 214 224 PT Male 240 252 Subgroups Number of cases in N Number of files in N Number of cases in PT Number of files in PT Female-alive 32 16 168 177 Female-dead 13 5 46 47 Male-alive 37 13 184 196 Male-dead 11 7 56 56 Sample type Categories Number of cases Number of files Solid tissue normal (N) Female 45 21 Solid tissue normal (N) Male 48 20 PT Female 214 224 PT Male 240 252 Subgroups Number of cases in N Number of files in N Number of cases in PT Number of files in PT Female-alive 32 16 168 177 Female-dead 13 5 46 47 Male-alive 37 13 184 196 Male-dead 11 7 56 56 Table 1. Data set Sample type Categories Number of cases Number of files Solid tissue normal (N) Female 45 21 Solid tissue normal (N) Male 48 20 PT Female 214 224 PT Male 240 252 Subgroups Number of cases in N Number of files in N Number of cases in PT Number of files in PT Female-alive 32 16 168 177 Female-dead 13 5 46 47 Male-alive 37 13 184 196 Male-dead 11 7 56 56 Sample type Categories Number of cases Number of files Solid tissue normal (N) Female 45 21 Solid tissue normal (N) Male 48 20 PT Female 214 224 PT Male 240 252 Subgroups Number of cases in N Number of files in N Number of cases in PT Number of files in PT Female-alive 32 16 168 177 Female-dead 13 5 46 47 Male-alive 37 13 184 196 Male-dead 11 7 56 56 Figure 1 shows the data before normalizing. Figure 1A indicates that some genes have a really high value (>108), while some genes have a small value (<10). This figure indicates that the gene’s expression values vary a lot, and SD of some genes’ values is around one order of magnitude higher than the mean (coefficient of variation >10), but the average of coefficient of variations for each gene’s values is around 5. Furthermore, Figure 1B reveals that the average of genes’ expressions for each patient is around 105 with SD >106; the coefficient of variation of values in each file is around 19. Additionally, Pearson correlation coefficient between raw values of EGFR and KRAS is 0.226, and between APC and EGFR is 0.31. Figure 1. View largeDownload slide Raw data. In (A), blue line represents the average value of each gene, and red line shows the average plus SD. (B) represents the averages and SDs of genes expressions in each file; the first 41 cases (stars) show the results for solid tissue normals, and the rest are the averages and SDs of genes expressions in HTSeq-FPKM-UQ files for PTs. (C) shows the values of APC, EGFR and KRAS genes in 487 HTSeq-FPKM-UQ files for PTs (circles) and in 41 files for solid tissue normals (stars). Figure 1. View largeDownload slide Raw data. In (A), blue line represents the average value of each gene, and red line shows the average plus SD. (B) represents the averages and SDs of genes expressions in each file; the first 41 cases (stars) show the results for solid tissue normals, and the rest are the averages and SDs of genes expressions in HTSeq-FPKM-UQ files for PTs. (C) shows the values of APC, EGFR and KRAS genes in 487 HTSeq-FPKM-UQ files for PTs (circles) and in 41 files for solid tissue normals (stars). The high variance in the genes’ values and the high Euclidean distance between genes signifies the need for normalizing the data because most ML and statistical methods are based on L2 norm. Furthermore, from clinical observations, we know there is a high correlation between KRAS and EGFR [13, 14], but the correlation coefficient of these genes in unnormalized data is a small number. Scaling One normalization method is scaling the data into range between 0 and 1. We have two options for scaling, scaling genes’ values across the entire data set, or scaling the gene expression data set of each patient (scaling each file separately). Scaling patients To scale patients, we divide the gene expression values in each file by the maximum gene expression value in the file. Thus, the range of values in each file will be between 0 and 1. Figure 2 represents the data after scaling each file (patient). As genes’ values after scaling are between 0 and 1, average of genes’ values in the entire data set is between 0 and 1. However, the average of coefficient of variations of each gene’s values is 5.05 (Figure 2A). Although after scaling the L2 distant between two values becomes <1, the coefficient of variation of values in each file is around 16 (Figure 2B). Importantly, using this normalization, Pearson correlation coefficient between EGFR and KRAS becomes 0.51, and Pearson correlation coefficient between EGFR and APC changes to 0.54. Figure 2. View largeDownload slide Scaled data. In this figure, top panels (A–C) represent the results of scaling each file, and bottom panels (D–E) show the results of scaling each gene. In (A) and (D), blue line represents the average value of each gene, and red line shows the average plus SD. (B) and (E) represent the average and average plusSD of genes expressions in each file; the first 41 cases (stars) show the results for solid tissue normals, and the rest are the averages and SDs of gene expressions for PTs. (C) and (F) show the values of APC, EGFR and KRAS genes in 487 HTSeq-FPKM-UQ files for PTs (circles) and in 41 files for solid tissue normals (stars). Figure 2. View largeDownload slide Scaled data. In this figure, top panels (A–C) represent the results of scaling each file, and bottom panels (D–E) show the results of scaling each gene. In (A) and (D), blue line represents the average value of each gene, and red line shows the average plus SD. (B) and (E) represent the average and average plusSD of genes expressions in each file; the first 41 cases (stars) show the results for solid tissue normals, and the rest are the averages and SDs of gene expressions for PTs. (C) and (F) show the values of APC, EGFR and KRAS genes in 487 HTSeq-FPKM-UQ files for PTs (circles) and in 41 files for solid tissue normals (stars). Scaling genes To scale each gene’s value, we divide each gene expression value by its maximum expression value in the entire data set. Thus, the range of values in each file will be between 0 and 1. In this case, if we look at the genes’ values in the entire data set, the average of each gene’s values will be between 0 and 1. However, similar to the raw data, the SD of some of genes is one order of magnitude higher than the mean value, while the average of coefficient of variations of each gene’s values is 4.97 (Figure 2D). Additionally, the coefficient of variation of values in each file is around 1.7. As the values of each gene are divided by a constant, Pearson correlation coefficients between genes does not change comparing to the raw data; Pearson correlation coefficient between scaled values of EGFR and KRAS stays 0.226, and between APC and EGFR remains 0.31. Vector normalization (scaling to unit length) Another common normalization method is vector normalization. This normalization method is similar to the scaling method, and the only difference is that we divide each value in a vector by the Euclidean norm of the vector instead of dividing by the maximum number in the vector. Again, in this case, we have two options: normalizing genes or files (patients). Scaling files to the unit length For scaling files to the unit length, for each file, we divide each gene expression value by the L2 norm of genes’ values in the file. After independently rescaling each file, the gene values become between 0 and 1, and the mean of coefficient of variations of normalized genes’ values is 5.1 (Figure 3A). Moreover, the average of coefficient of variations of normalized values in each file is 16 (Figure 3B). That means when we normalize each file independently, the coefficient of variations of genes’ values in each file is higher than coefficient of variations of values of a gene across all files. In this case, Pearson correlation coefficient between EGFR and KRAS becomes 0.43, and Pearson correlation coefficient between EGFR and APC becomes 0.47. Figure 3. View largeDownload slide Scaled data to unit length. In this figure, top panels (A–C) represent the results of independently scaling each file, and bottom panels (D–E) show the results of independently scaling each gene. In (A) and (D), blue line represents the average value of each gene, and red line shows the average plus SD. (B) and (E) represent the average and average plus SD of gene expressions in each file; the first 41 cases (stars) show the results for solid tissue normals, and the rest are the averages and standard deviations of gene expressions in HTSeq-FPKM-UQ files for PTs. (C) and (F) show the values of APC, EGFR and KRAS genes in 487 HTSeq-FPKM-UQ files for PTs (circles) and in 41 files for solid tissue normals (stars). Figure 3. View largeDownload slide Scaled data to unit length. In this figure, top panels (A–C) represent the results of independently scaling each file, and bottom panels (D–E) show the results of independently scaling each gene. In (A) and (D), blue line represents the average value of each gene, and red line shows the average plus SD. (B) and (E) represent the average and average plus SD of gene expressions in each file; the first 41 cases (stars) show the results for solid tissue normals, and the rest are the averages and standard deviations of gene expressions in HTSeq-FPKM-UQ files for PTs. (C) and (F) show the values of APC, EGFR and KRAS genes in 487 HTSeq-FPKM-UQ files for PTs (circles) and in 41 files for solid tissue normals (stars). Scaling genes to the unit length To independently rescale each gene’s values to the unit length, we divide each gene expression value by the L2 norm of the values of the gene in the entire data set. In this case, similar to the raw data, the average of coefficient of variations for each gene’s normalized values is 4.97 (Figure 3D), while the average of coefficient of variations of values in each file is 1.7 (Figure 3E). This implies that scaling genes to the unit length will reduce both coefficient of variation of values within each file and coefficient of variation of each gene’s values across all files. Here, also because each genes’ value is divided by a constant, Pearson correlation coefficients between genes do not change comparing with the raw data. Standardization (z-score) One of the wildly used normalization methods is standardization, which assumes the data have normal distribution with mean 0 and unit-variance. In this case also, we have two options: normalizing genes or files (patients). Standardizing files We normalize each file independently by obtaining z-score of values in each file. After standardizing each file (patient), the mean of coefficient of variations of normalized genes’ values becomes 7.07 (Figure 4A). Note that files’ standardization makes the average of genes values in each file equal to 0 with SD 1 (Figure 4B). In this case, Pearson correlation coefficient between EGFR and KRAS becomes 0.23, while Pearson correlation coefficient between EGFR and APC becomes 0.17. Figure 4. View largeDownload slide Standardized data. In this figure, top panels (A–C) represent the results of independently scaling each file, and bottom panels (D–E) show the results of independently scaling each gene. In (A) and (D), blue line represents the average value of each gene, and red line shows the average plus SD. (B) and (E) represent the average and average plus SD of genes expressions in each file; the first 41 cases (stars) show the results for solid tissue normals, and the rest are the averages and SDs of gene expressions in HTSeq-FPKM-UQ files for PTs. (C) and (F) show the normalized values of APC, EGFR and KRAS genes for PTs (circles) and in 41 files for solid tissue normals (stars). Figure 4. View largeDownload slide Standardized data. In this figure, top panels (A–C) represent the results of independently scaling each file, and bottom panels (D–E) show the results of independently scaling each gene. In (A) and (D), blue line represents the average value of each gene, and red line shows the average plus SD. (B) and (E) represent the average and average plus SD of genes expressions in each file; the first 41 cases (stars) show the results for solid tissue normals, and the rest are the averages and SDs of gene expressions in HTSeq-FPKM-UQ files for PTs. (C) and (F) show the normalized values of APC, EGFR and KRAS genes for PTs (circles) and in 41 files for solid tissue normals (stars). Standardizing genes To independently standardizing each gene’s values to the normal distribution, we obtain z-score of each gene’s expression values in the entire data set. The gene’s standardization makes the average values of each gene 0 with SD 1 in the whole data set (Figure 4D), while the mean of values in each file is around −0.03 with SD around 0.7 (Figure 4E). Furthermore, Pearson correlation coefficients between KRAS and EGFR and between EGFR and APC become 0.23 and 0.31, respectively. Figure 5 shows how the abovementioned normalization methods change the values of EGFR and KRAS, and ultimately the correlation coefficient of these two genes. Furthermore, this figure shows that the normalization methods change the Euclidean distance between two points that could lead to an alteration in the performance of some of supervised learning algorithms. Figure 5. View largeDownload slide Normalized value of EGFR and KRAS. This figure shows the normalized values of EGFR and KRAS genes in 487 HTSeq-FPKM-UQ files for PTs (circles) and in 41 files for solid tissue normals (stars). Figure 5. View largeDownload slide Normalized value of EGFR and KRAS. This figure shows the normalized values of EGFR and KRAS genes in 487 HTSeq-FPKM-UQ files for PTs (circles) and in 41 files for solid tissue normals (stars). Comparing performance of supervised learning algorithms We examine the effect of normalization methods on the performance of 12 most common ML algorithms. We use 12 supervised learning algorithms to predict the vital status of patients with colon cancer. The accuracy and fitting time of these algorithms have been provided in Table 2; for more details about the implemented ML algorithms, please see ‘Methods’ section. Table 2. Accuracy and fitting time of several methods for predicting vital status Normalization method SVM SGD Linear discriminant Quadratic discriminant Accuracy Fitting time Accuracy Fitting time Accuracy Fitting time Accuracy Fitting time Raw data 0.78 ± 0.01 16.4 ± 1.51 0.76 ± 0.05 1.9 ± 0.07 0.76 ± 0.09 7.1 ± 1.29 0.43 ± 0.17 11.0 ± 0.66 Scaling files 0.78 ± 0.01 8.6 ± 0.58 0.74 ± 0.23 2.1 ± 0.11 0.72 ± 0.10 7.9 ± 0.56 0.39 ± 0.20 11.5 ± 2.14 Scaling genes 0.78 ± 0.01 13.3 ± 1.63 0.63 ± 0.33 2.0 ± 0.15 0.76 ± 0.09 8.1 ± 1.25 0.49 ± 0.20 10.5 ± 2.12 Scaling files to the unit length 0.78 ± 0.01 6.9 ± 0.46 0.75 ± 0.10 0.4 ± 0.01 0.74 ± 0.11 6.3 ± 0.83 0.36 ± 0.19 9.6 ± 1.10 Scaling genes to the unit length 0.78 ± 0.01 10.9 ± 1.54 0.72 ± 0.18 0.4 ± 0.02 0.76 ± 0.09 5.7 ± 0.37 0.52 ± 0.12 9.1 ± 0.55 Standardizing files 0.78 ± 0.01 9.4 ± 0.56 0.57 ± 0.44 0.4 ± 0.01 0.70 ± 0.15 5.8 ± 0.21 0.22 ± 0.02 9.0 ± 0.47 Standardizing genes 0.78 ± 0.01 15.2 ± 1.16 0.26 ± 0.09 1.9 ± 0.03 0.61 ± 0.13 7.3 ± 0.25 0.44 ± 0.15 9.7 ± 2.44 Logistic regression Decision tree Nearest centroid Gaussian process Raw data 0.62 ± 0.19 6.0 ± 0.51 0.65 ± 0.15 18.9 ± 8.36 0.39 ± 0.22 1.8 ± 0.07 0.22 ± 0.01 16.5 ± 1.45 Scaling files 0.78 ± 0.02 2.9 ± 0.25 0.67 ± 0.13 19.5 ± 13.22 0.47 ± 0.16 2.0 ± 0.12 0.78 ± 0.01 17.5 ± 1.68 Scaling genes 0.78 ± 0.09 5.0 ± 0.52 0.66 ± 0.10 20.1 ± 9.10 0.59 ± 0.16 1.8 ± 0.06 0.68 ± 0.09 17.0 ± 1.31 Scaling files to the unit length 0.78 ± 0.01 1.0 ± 0.05 0.65 ± 0.10 15.7 ± 8.40 0.46 ± 0.20 0.3 ± 0.17 0.78 ± 0.01 15.2 ± 0.93 Scaling genes to the unit length 0.78 ± 0.07 1.5 ± 0.06 0.66 ± 0.12 17.0 ± 9.09 0.45 ± 0.28 0.3 ± 0.02 0.76 ± 0.04 15.3 ± 0.66 Standardizing files 0.70 ± 0.22 7.3 ± 0.04 0.64 ± 0.18 28.2 ± 15.65 0.46 ± 0.20 0.3 ± 0.01 0.24 ± 0.04 15.3 ± 0.74 Standardizing genes 0.32 ± 0.10 6.3 ± 0.85 0.65 ± 0.15 18.6 ± 8.33 0.53 ± 0.24 1.9 ± 0.21 0.22 ± 0.01 16.3 ± 1.57 Neural network size (5, 2) Gradient boosting Gaussian naive Bayes Bernoulli naive Bayes Raw data 0.78 ± 0.02 10.5 ± 7.76 0.76 ± 0.05 237.3 ± 17.75 0.32 ± 0.12 2.5 ± 1.08 0.71 ± 0.15 2.6 ± 0.58 Scaling files 0.70 ± 0.08 21.8 ± 3.86 0.77 ± 0.04 357.0 ± 757.56 0.34 ± 0.12 2.3 ± 0.18 0.71 ± 0.15 2.3 ± 0.35 Scaling genes 0.78 ± 0.01 2.5 ± 0.53 0.75 ± 0.07 242.8 ± 29.38 0.76 ± 0.02 2.2 ± 0.11 0.71 ± 0.15 2.3 ± 0.13 Scaling files to the unit length 0.72 ± 0.11 19.1 ± 3.98 0.75 ± 0.03 226.7 ± 17.24 0.33 ± 0.12 0.7 ± 0.04 0.71 ± 0.15 0.9 ± 0.07 Scaling genes to the unit length 0.75 ± 0.08 4.7 ± 2.28 0.76 ± 0.05 240.4 ± 17.45 0.76 ± 0.05 0.7 ± 0.05 0.71 ± 0.15 0.8 ± 0.05 Standardizing files 0.78 ± 0.01 1.3 ± 0.38 0.77 ± 0.06 286.3 ± 7.08 0.43 ± 0.12 0.7 ± 0.14 0.78 ± 0.03 0.6 ± 0.02 Standardizing genes 0.67 ± 0.16 19.8 ± 7.71 0.75 ± 0.07 249.8 ± 11.43 0.76 ± 0.05 2.3 ± 0.37 0.63 ± 0.17 2.2 ± 0.04 Normalization method SVM SGD Linear discriminant Quadratic discriminant Accuracy Fitting time Accuracy Fitting time Accuracy Fitting time Accuracy Fitting time Raw data 0.78 ± 0.01 16.4 ± 1.51 0.76 ± 0.05 1.9 ± 0.07 0.76 ± 0.09 7.1 ± 1.29 0.43 ± 0.17 11.0 ± 0.66 Scaling files 0.78 ± 0.01 8.6 ± 0.58 0.74 ± 0.23 2.1 ± 0.11 0.72 ± 0.10 7.9 ± 0.56 0.39 ± 0.20 11.5 ± 2.14 Scaling genes 0.78 ± 0.01 13.3 ± 1.63 0.63 ± 0.33 2.0 ± 0.15 0.76 ± 0.09 8.1 ± 1.25 0.49 ± 0.20 10.5 ± 2.12 Scaling files to the unit length 0.78 ± 0.01 6.9 ± 0.46 0.75 ± 0.10 0.4 ± 0.01 0.74 ± 0.11 6.3 ± 0.83 0.36 ± 0.19 9.6 ± 1.10 Scaling genes to the unit length 0.78 ± 0.01 10.9 ± 1.54 0.72 ± 0.18 0.4 ± 0.02 0.76 ± 0.09 5.7 ± 0.37 0.52 ± 0.12 9.1 ± 0.55 Standardizing files 0.78 ± 0.01 9.4 ± 0.56 0.57 ± 0.44 0.4 ± 0.01 0.70 ± 0.15 5.8 ± 0.21 0.22 ± 0.02 9.0 ± 0.47 Standardizing genes 0.78 ± 0.01 15.2 ± 1.16 0.26 ± 0.09 1.9 ± 0.03 0.61 ± 0.13 7.3 ± 0.25 0.44 ± 0.15 9.7 ± 2.44 Logistic regression Decision tree Nearest centroid Gaussian process Raw data 0.62 ± 0.19 6.0 ± 0.51 0.65 ± 0.15 18.9 ± 8.36 0.39 ± 0.22 1.8 ± 0.07 0.22 ± 0.01 16.5 ± 1.45 Scaling files 0.78 ± 0.02 2.9 ± 0.25 0.67 ± 0.13 19.5 ± 13.22 0.47 ± 0.16 2.0 ± 0.12 0.78 ± 0.01 17.5 ± 1.68 Scaling genes 0.78 ± 0.09 5.0 ± 0.52 0.66 ± 0.10 20.1 ± 9.10 0.59 ± 0.16 1.8 ± 0.06 0.68 ± 0.09 17.0 ± 1.31 Scaling files to the unit length 0.78 ± 0.01 1.0 ± 0.05 0.65 ± 0.10 15.7 ± 8.40 0.46 ± 0.20 0.3 ± 0.17 0.78 ± 0.01 15.2 ± 0.93 Scaling genes to the unit length 0.78 ± 0.07 1.5 ± 0.06 0.66 ± 0.12 17.0 ± 9.09 0.45 ± 0.28 0.3 ± 0.02 0.76 ± 0.04 15.3 ± 0.66 Standardizing files 0.70 ± 0.22 7.3 ± 0.04 0.64 ± 0.18 28.2 ± 15.65 0.46 ± 0.20 0.3 ± 0.01 0.24 ± 0.04 15.3 ± 0.74 Standardizing genes 0.32 ± 0.10 6.3 ± 0.85 0.65 ± 0.15 18.6 ± 8.33 0.53 ± 0.24 1.9 ± 0.21 0.22 ± 0.01 16.3 ± 1.57 Neural network size (5, 2) Gradient boosting Gaussian naive Bayes Bernoulli naive Bayes Raw data 0.78 ± 0.02 10.5 ± 7.76 0.76 ± 0.05 237.3 ± 17.75 0.32 ± 0.12 2.5 ± 1.08 0.71 ± 0.15 2.6 ± 0.58 Scaling files 0.70 ± 0.08 21.8 ± 3.86 0.77 ± 0.04 357.0 ± 757.56 0.34 ± 0.12 2.3 ± 0.18 0.71 ± 0.15 2.3 ± 0.35 Scaling genes 0.78 ± 0.01 2.5 ± 0.53 0.75 ± 0.07 242.8 ± 29.38 0.76 ± 0.02 2.2 ± 0.11 0.71 ± 0.15 2.3 ± 0.13 Scaling files to the unit length 0.72 ± 0.11 19.1 ± 3.98 0.75 ± 0.03 226.7 ± 17.24 0.33 ± 0.12 0.7 ± 0.04 0.71 ± 0.15 0.9 ± 0.07 Scaling genes to the unit length 0.75 ± 0.08 4.7 ± 2.28 0.76 ± 0.05 240.4 ± 17.45 0.76 ± 0.05 0.7 ± 0.05 0.71 ± 0.15 0.8 ± 0.05 Standardizing files 0.78 ± 0.01 1.3 ± 0.38 0.77 ± 0.06 286.3 ± 7.08 0.43 ± 0.12 0.7 ± 0.14 0.78 ± 0.03 0.6 ± 0.02 Standardizing genes 0.67 ± 0.16 19.8 ± 7.71 0.75 ± 0.07 249.8 ± 11.43 0.76 ± 0.05 2.3 ± 0.37 0.63 ± 0.17 2.2 ± 0.04 Note: This table shows the accuracy and fitting time of each method after normalizing the data for predicting the vital status. For this part, we only analyze the gene expression data sets of samples from PTs. Table 2. Accuracy and fitting time of several methods for predicting vital status Normalization method SVM SGD Linear discriminant Quadratic discriminant Accuracy Fitting time Accuracy Fitting time Accuracy Fitting time Accuracy Fitting time Raw data 0.78 ± 0.01 16.4 ± 1.51 0.76 ± 0.05 1.9 ± 0.07 0.76 ± 0.09 7.1 ± 1.29 0.43 ± 0.17 11.0 ± 0.66 Scaling files 0.78 ± 0.01 8.6 ± 0.58 0.74 ± 0.23 2.1 ± 0.11 0.72 ± 0.10 7.9 ± 0.56 0.39 ± 0.20 11.5 ± 2.14 Scaling genes 0.78 ± 0.01 13.3 ± 1.63 0.63 ± 0.33 2.0 ± 0.15 0.76 ± 0.09 8.1 ± 1.25 0.49 ± 0.20 10.5 ± 2.12 Scaling files to the unit length 0.78 ± 0.01 6.9 ± 0.46 0.75 ± 0.10 0.4 ± 0.01 0.74 ± 0.11 6.3 ± 0.83 0.36 ± 0.19 9.6 ± 1.10 Scaling genes to the unit length 0.78 ± 0.01 10.9 ± 1.54 0.72 ± 0.18 0.4 ± 0.02 0.76 ± 0.09 5.7 ± 0.37 0.52 ± 0.12 9.1 ± 0.55 Standardizing files 0.78 ± 0.01 9.4 ± 0.56 0.57 ± 0.44 0.4 ± 0.01 0.70 ± 0.15 5.8 ± 0.21 0.22 ± 0.02 9.0 ± 0.47 Standardizing genes 0.78 ± 0.01 15.2 ± 1.16 0.26 ± 0.09 1.9 ± 0.03 0.61 ± 0.13 7.3 ± 0.25 0.44 ± 0.15 9.7 ± 2.44 Logistic regression Decision tree Nearest centroid Gaussian process Raw data 0.62 ± 0.19 6.0 ± 0.51 0.65 ± 0.15 18.9 ± 8.36 0.39 ± 0.22 1.8 ± 0.07 0.22 ± 0.01 16.5 ± 1.45 Scaling files 0.78 ± 0.02 2.9 ± 0.25 0.67 ± 0.13 19.5 ± 13.22 0.47 ± 0.16 2.0 ± 0.12 0.78 ± 0.01 17.5 ± 1.68 Scaling genes 0.78 ± 0.09 5.0 ± 0.52 0.66 ± 0.10 20.1 ± 9.10 0.59 ± 0.16 1.8 ± 0.06 0.68 ± 0.09 17.0 ± 1.31 Scaling files to the unit length 0.78 ± 0.01 1.0 ± 0.05 0.65 ± 0.10 15.7 ± 8.40 0.46 ± 0.20 0.3 ± 0.17 0.78 ± 0.01 15.2 ± 0.93 Scaling genes to the unit length 0.78 ± 0.07 1.5 ± 0.06 0.66 ± 0.12 17.0 ± 9.09 0.45 ± 0.28 0.3 ± 0.02 0.76 ± 0.04 15.3 ± 0.66 Standardizing files 0.70 ± 0.22 7.3 ± 0.04 0.64 ± 0.18 28.2 ± 15.65 0.46 ± 0.20 0.3 ± 0.01 0.24 ± 0.04 15.3 ± 0.74 Standardizing genes 0.32 ± 0.10 6.3 ± 0.85 0.65 ± 0.15 18.6 ± 8.33 0.53 ± 0.24 1.9 ± 0.21 0.22 ± 0.01 16.3 ± 1.57 Neural network size (5, 2) Gradient boosting Gaussian naive Bayes Bernoulli naive Bayes Raw data 0.78 ± 0.02 10.5 ± 7.76 0.76 ± 0.05 237.3 ± 17.75 0.32 ± 0.12 2.5 ± 1.08 0.71 ± 0.15 2.6 ± 0.58 Scaling files 0.70 ± 0.08 21.8 ± 3.86 0.77 ± 0.04 357.0 ± 757.56 0.34 ± 0.12 2.3 ± 0.18 0.71 ± 0.15 2.3 ± 0.35 Scaling genes 0.78 ± 0.01 2.5 ± 0.53 0.75 ± 0.07 242.8 ± 29.38 0.76 ± 0.02 2.2 ± 0.11 0.71 ± 0.15 2.3 ± 0.13 Scaling files to the unit length 0.72 ± 0.11 19.1 ± 3.98 0.75 ± 0.03 226.7 ± 17.24 0.33 ± 0.12 0.7 ± 0.04 0.71 ± 0.15 0.9 ± 0.07 Scaling genes to the unit length 0.75 ± 0.08 4.7 ± 2.28 0.76 ± 0.05 240.4 ± 17.45 0.76 ± 0.05 0.7 ± 0.05 0.71 ± 0.15 0.8 ± 0.05 Standardizing files 0.78 ± 0.01 1.3 ± 0.38 0.77 ± 0.06 286.3 ± 7.08 0.43 ± 0.12 0.7 ± 0.14 0.78 ± 0.03 0.6 ± 0.02 Standardizing genes 0.67 ± 0.16 19.8 ± 7.71 0.75 ± 0.07 249.8 ± 11.43 0.76 ± 0.05 2.3 ± 0.37 0.63 ± 0.17 2.2 ± 0.04 Normalization method SVM SGD Linear discriminant Quadratic discriminant Accuracy Fitting time Accuracy Fitting time Accuracy Fitting time Accuracy Fitting time Raw data 0.78 ± 0.01 16.4 ± 1.51 0.76 ± 0.05 1.9 ± 0.07 0.76 ± 0.09 7.1 ± 1.29 0.43 ± 0.17 11.0 ± 0.66 Scaling files 0.78 ± 0.01 8.6 ± 0.58 0.74 ± 0.23 2.1 ± 0.11 0.72 ± 0.10 7.9 ± 0.56 0.39 ± 0.20 11.5 ± 2.14 Scaling genes 0.78 ± 0.01 13.3 ± 1.63 0.63 ± 0.33 2.0 ± 0.15 0.76 ± 0.09 8.1 ± 1.25 0.49 ± 0.20 10.5 ± 2.12 Scaling files to the unit length 0.78 ± 0.01 6.9 ± 0.46 0.75 ± 0.10 0.4 ± 0.01 0.74 ± 0.11 6.3 ± 0.83 0.36 ± 0.19 9.6 ± 1.10 Scaling genes to the unit length 0.78 ± 0.01 10.9 ± 1.54 0.72 ± 0.18 0.4 ± 0.02 0.76 ± 0.09 5.7 ± 0.37 0.52 ± 0.12 9.1 ± 0.55 Standardizing files 0.78 ± 0.01 9.4 ± 0.56 0.57 ± 0.44 0.4 ± 0.01 0.70 ± 0.15 5.8 ± 0.21 0.22 ± 0.02 9.0 ± 0.47 Standardizing genes 0.78 ± 0.01 15.2 ± 1.16 0.26 ± 0.09 1.9 ± 0.03 0.61 ± 0.13 7.3 ± 0.25 0.44 ± 0.15 9.7 ± 2.44 Logistic regression Decision tree Nearest centroid Gaussian process Raw data 0.62 ± 0.19 6.0 ± 0.51 0.65 ± 0.15 18.9 ± 8.36 0.39 ± 0.22 1.8 ± 0.07 0.22 ± 0.01 16.5 ± 1.45 Scaling files 0.78 ± 0.02 2.9 ± 0.25 0.67 ± 0.13 19.5 ± 13.22 0.47 ± 0.16 2.0 ± 0.12 0.78 ± 0.01 17.5 ± 1.68 Scaling genes 0.78 ± 0.09 5.0 ± 0.52 0.66 ± 0.10 20.1 ± 9.10 0.59 ± 0.16 1.8 ± 0.06 0.68 ± 0.09 17.0 ± 1.31 Scaling files to the unit length 0.78 ± 0.01 1.0 ± 0.05 0.65 ± 0.10 15.7 ± 8.40 0.46 ± 0.20 0.3 ± 0.17 0.78 ± 0.01 15.2 ± 0.93 Scaling genes to the unit length 0.78 ± 0.07 1.5 ± 0.06 0.66 ± 0.12 17.0 ± 9.09 0.45 ± 0.28 0.3 ± 0.02 0.76 ± 0.04 15.3 ± 0.66 Standardizing files 0.70 ± 0.22 7.3 ± 0.04 0.64 ± 0.18 28.2 ± 15.65 0.46 ± 0.20 0.3 ± 0.01 0.24 ± 0.04 15.3 ± 0.74 Standardizing genes 0.32 ± 0.10 6.3 ± 0.85 0.65 ± 0.15 18.6 ± 8.33 0.53 ± 0.24 1.9 ± 0.21 0.22 ± 0.01 16.3 ± 1.57 Neural network size (5, 2) Gradient boosting Gaussian naive Bayes Bernoulli naive Bayes Raw data 0.78 ± 0.02 10.5 ± 7.76 0.76 ± 0.05 237.3 ± 17.75 0.32 ± 0.12 2.5 ± 1.08 0.71 ± 0.15 2.6 ± 0.58 Scaling files 0.70 ± 0.08 21.8 ± 3.86 0.77 ± 0.04 357.0 ± 757.56 0.34 ± 0.12 2.3 ± 0.18 0.71 ± 0.15 2.3 ± 0.35 Scaling genes 0.78 ± 0.01 2.5 ± 0.53 0.75 ± 0.07 242.8 ± 29.38 0.76 ± 0.02 2.2 ± 0.11 0.71 ± 0.15 2.3 ± 0.13 Scaling files to the unit length 0.72 ± 0.11 19.1 ± 3.98 0.75 ± 0.03 226.7 ± 17.24 0.33 ± 0.12 0.7 ± 0.04 0.71 ± 0.15 0.9 ± 0.07 Scaling genes to the unit length 0.75 ± 0.08 4.7 ± 2.28 0.76 ± 0.05 240.4 ± 17.45 0.76 ± 0.05 0.7 ± 0.05 0.71 ± 0.15 0.8 ± 0.05 Standardizing files 0.78 ± 0.01 1.3 ± 0.38 0.77 ± 0.06 286.3 ± 7.08 0.43 ± 0.12 0.7 ± 0.14 0.78 ± 0.03 0.6 ± 0.02 Standardizing genes 0.67 ± 0.16 19.8 ± 7.71 0.75 ± 0.07 249.8 ± 11.43 0.76 ± 0.05 2.3 ± 0.37 0.63 ± 0.17 2.2 ± 0.04 Note: This table shows the accuracy and fitting time of each method after normalizing the data for predicting the vital status. For this part, we only analyze the gene expression data sets of samples from PTs. Dimensionality reduction Table 2 shows the performance of the supervised algorithms right after normalizing the data set. However, when there is a high-dimensional data set, one common approach is lowering the dimensionality of the problem to improve the performance of ML algorithms. The idea is to replace a large number of variables with a smaller number, which maintain a good representation of the data set. Here, we investigate the effect of dimensionality reduction methods such as principal components analysis (PCA) and removing features with a low variance on the performance of two winner learning algorithms support vector machine (SVM) and Bernoulli naive Bayes and two algorithms, which had a poor performance, nearest centroid and Gaussian naive Bayes. Therefore, we first normalize the data set, then we reduce dimension of the data set and finally we apply the supervised learning algorithms. Figure 6 shows that by applying PCA, the accuracy of SVM does not change; however, the Bernoulli naive Bayes model for all normalization methods reaches the accuracy of SVM, which is 78%. Importantly, Gaussian naive Bayes reaches the maximum accuracy of 78% when the dimension of the data set is reduced to 1, while nearest centroid algorithm never reaches the accuracy of 78%. Moreover, Gaussian naive Bayes model after using PCA performs better when files are normalized rather than genes. Figure 7 shows the performance of the four ML algorithms after removing genes with a variance less than a given threshold. We choose thresholds based on the variance of the genes in the normalized data set. Again, Gaussian naive Bayes model reaches the accuracy of 78% if the data set is normalized, while nearest centroid model never reaches that accuracy. Importantly, scaling genes to the unit length reduces the accuracy of Bernoulli naive Bayes model. Figure 6. View largeDownload slide Performance of supervised learning algorithms after applying PCA. This figure shows the accuracy (A) and fitting time (B) of four supervised learning models. In each of these models, we first reduce the dimension of the data set using PCA, and then we apply the supervised learning algorithm. Fitting time includes both the spending time for dimensionlity reduction and the training time of the chosen supervised algorithm. Figure 6. View largeDownload slide Performance of supervised learning algorithms after applying PCA. This figure shows the accuracy (A) and fitting time (B) of four supervised learning models. In each of these models, we first reduce the dimension of the data set using PCA, and then we apply the supervised learning algorithm. Fitting time includes both the spending time for dimensionlity reduction and the training time of the chosen supervised algorithm. Figure 7. View largeDownload slide Performance of supervised learning algorithms after removing low-variance genes. This figure shows the accuracy (A) and fitting time (B) of four supervised learning models. In each of these models, we first reduce the dimension of the data set by removing the genes with a variance lower than the given threshold, and then we apply the supervised learning algorithm. We choose thresholds according to the variance of genes in the normalized data set ( var(gj)), where gj is the normalized value of the gene j); thresholds are 0,avg(var(gj)),avg(var(gj))+max(var(gj))2, and 0.9×max(var(gj)). Fitting time includes both the spending time for dimensionlity reduction and the training time of the chosen supervised algorithm. Figure 7. View largeDownload slide Performance of supervised learning algorithms after removing low-variance genes. This figure shows the accuracy (A) and fitting time (B) of four supervised learning models. In each of these models, we first reduce the dimension of the data set by removing the genes with a variance lower than the given threshold, and then we apply the supervised learning algorithm. We choose thresholds according to the variance of genes in the normalized data set ( var(gj)), where gj is the normalized value of the gene j); thresholds are 0,avg(var(gj)),avg(var(gj))+max(var(gj))2, and 0.9×max(var(gj)). Fitting time includes both the spending time for dimensionlity reduction and the training time of the chosen supervised algorithm. Discussion One of the main questions in analyzing data sets is ‘do we need to normalize the data?, and if we need, what should be the strategy and method?’. Here, we try to answer this question for analyzing RNA-seq data sets. We compared six different normalization strategies for analyzing HTSeq-FPKM-UQ data sets. Table 3 shows the statistical changes in the data set after applying each of the abovementioned normalization methods. In average, the SD of genes’ values in each file is around 10 times more than the SD of each gene’s value in the unnormalized data set (see Columns 3 and 4 in Table 3). Standardizing genes and scaling genes to the unit length might not be a correct normalization approach because it would change the statistical information of the data set. Standardizing genes will make the average of SD of genes higher than the average of SDs of values in each file, and scaling genes to the unit length will make these two values approximately equal. Table 3. Mean and variance of data Normalization method avg(avg(gj))=avg(avg(Pi)) avg(std(gj)) avg(std(Pi)) cor(EGFR, KRAS) Raw data 142 069.91 136 066.44 27 84 854.65 0.226 Scaling files 0.000599 0.000522 0.0092737 0.51 Scaling genes 0.075386 0.084623 0.131425 0.226 Scaling files to the unit length 0.000258 0.000205 0.004150 0.43 Scaling genes to the unit length 0.017967 0.036806 0.03099 0.226 Standardizing files 0 0.060801 1.0 0.23 Standardizing genes 0 1.0 0.769104 0.23 Normalization method avg(avg(gj))=avg(avg(Pi)) avg(std(gj)) avg(std(Pi)) cor(EGFR, KRAS) Raw data 142 069.91 136 066.44 27 84 854.65 0.226 Scaling files 0.000599 0.000522 0.0092737 0.51 Scaling genes 0.075386 0.084623 0.131425 0.226 Scaling files to the unit length 0.000258 0.000205 0.004150 0.43 Scaling genes to the unit length 0.017967 0.036806 0.03099 0.226 Standardizing files 0 0.060801 1.0 0.23 Standardizing genes 0 1.0 0.769104 0.23 Note: In this table, Pi=[gi,j, 1≤j≤m] is the gene expression data in the file i. gj=[gi,j, 1≤i≤n] is the expression of the gene j in the entire data set. Thus, avg(avg(pi))= avg( [P¯i, 1≤i≤n]) and avg(std(pi))= avg( [P~i, 1≤i≤n]), where P¯i=avg(pi)= avg( [gi,j, 1≤j≤m]) and P~i=std(pi)= std( [gi,j, 1≤j≤m]). Similarly, avg(avg(gj))= avg( [g¯j, 1≤j≤m]) and avg(std(gj))= avg( [g~j, 1≤j≤m]), where g¯j=avg(gj)= avg( [gi,j, 1≤i≤n]) and g~j=std(gj)= std( [gi,j, 1≤i≤n]). Table 3. Mean and variance of data Normalization method avg(avg(gj))=avg(avg(Pi)) avg(std(gj)) avg(std(Pi)) cor(EGFR, KRAS) Raw data 142 069.91 136 066.44 27 84 854.65 0.226 Scaling files 0.000599 0.000522 0.0092737 0.51 Scaling genes 0.075386 0.084623 0.131425 0.226 Scaling files to the unit length 0.000258 0.000205 0.004150 0.43 Scaling genes to the unit length 0.017967 0.036806 0.03099 0.226 Standardizing files 0 0.060801 1.0 0.23 Standardizing genes 0 1.0 0.769104 0.23 Normalization method avg(avg(gj))=avg(avg(Pi)) avg(std(gj)) avg(std(Pi)) cor(EGFR, KRAS) Raw data 142 069.91 136 066.44 27 84 854.65 0.226 Scaling files 0.000599 0.000522 0.0092737 0.51 Scaling genes 0.075386 0.084623 0.131425 0.226 Scaling files to the unit length 0.000258 0.000205 0.004150 0.43 Scaling genes to the unit length 0.017967 0.036806 0.03099 0.226 Standardizing files 0 0.060801 1.0 0.23 Standardizing genes 0 1.0 0.769104 0.23 Note: In this table, Pi=[gi,j, 1≤j≤m] is the gene expression data in the file i. gj=[gi,j, 1≤i≤n] is the expression of the gene j in the entire data set. Thus, avg(avg(pi))= avg( [P¯i, 1≤i≤n]) and avg(std(pi))= avg( [P~i, 1≤i≤n]), where P¯i=avg(pi)= avg( [gi,j, 1≤j≤m]) and P~i=std(pi)= std( [gi,j, 1≤j≤m]). Similarly, avg(avg(gj))= avg( [g¯j, 1≤j≤m]) and avg(std(gj))= avg( [g~j, 1≤j≤m]), where g¯j=avg(gj)= avg( [gi,j, 1≤i≤n]) and g~j=std(gj)= std( [gi,j, 1≤i≤n]). Additionally, we examine the performance of 12 ML models in predicting vital status of patients with colon adenocarcinoma using HTSeq-FPKM-UQ data sets (Table 2). Based on the accuracies of the 12 implemented supervised classification algorithms, the worst normalization strategy is standardizing values of each gene. Moreover, quadratic discriminant and nearest centroid have the worst performance independent from normalization methods in terms of accuracy. Although Bernoulli naive Bayes after standardizing files has the best performance [maximum accuracy (78%) with minimum fitting time], and the SVM with radial basis function (RBF) kernel has the maximum accuracy (78%) independent of normalization strategies. It is not surprising that SVM works better than any other chosen algorithms because the number of features (genes) is much greater than the number of samples (files) [15, 16]. To do further investigation, we reduce the dimension of the normalized data set, and then we use supervised learning algorithms. Dimensionality reduction does not increase the maximum accuracy of 78%, which has been achieved by some of ML algorithms applied to the entire data set. However, it assists the Gaussian naive Bayes model, which has shown a poor performance on the entire data set, to reach the maximum accuracy of 78%. As PCA increases the fitting time, and removing features with low variance decreases the training time, the optimal approach could be reducing dimension of the data set by removing low-variance genes. Importantly, the dimensionality reduction shows that the expression level of one single gene, 7SK RNA, can be used to predict the vital status of the patients with colon adenocarcinomas with accuracy of 78%. If we standardize files, the average of the normalized level of 7SK RNA is 8.16 for surviving patients, and if we exclude the single outlier, which is a dead female with a high level of 7SK RNA, this average is 0.07 for un-survived patients (Figure 8). The intervals of the normalized values of 7SK RNA for surviving females and males, respectively, are (−0.08, 225.79) and (−0.09, 204.20), while the ranges of the normalized values of 7SK RNA for un-survived female and male patients excluding the single outlier are (−0.10, 0.71) and (−0.06, 1.12), respectively. Figure 8. View largeDownload slide Normalized value of 7SK RNA. This figure shows the value of 7SK RNA after standardizing each file separately. The squares are the means of each group and lines show the SDs plus mean. There are two squares (means) for the female-dead, the top one includes the single outlier, which has a high level of 7SK RNA, while the bottom one is the mean after excluding that single outlier. Figure 8. View largeDownload slide Normalized value of 7SK RNA. This figure shows the value of 7SK RNA after standardizing each file separately. The squares are the means of each group and lines show the SDs plus mean. There are two squares (means) for the female-dead, the top one includes the single outlier, which has a high level of 7SK RNA, while the bottom one is the mean after excluding that single outlier. In summary, we should normalize the HTSeq-FPKM-UQ data sets before applying ML algorithms to reach the best performance, and the best normalization strategy completely depends on the ML model. As Bernoulli naive Bayes model after standardizing files has the best performance, one might conclude that the best way for normalizing HTSeq-FPKM-UQ data sets is independently standardizing each file. We reach this conclusion, mainly because the first step in the Bernoulli naive Bayes model is binarizing the data set to 0 and 1, and we set the threshold to 0. Therefore, if we use any rescaling normalization method or raw data, this model assigns 0 values for genes that have not been expressed, and 1 for the genes that have a non-zero value in the file. However, if we standardize the data set, then this model assigns 0 values to the genes that have a negative z-score value, and 1 to the genes that have a positive z-score value. As the best performance will be achieved when each file is standardized, then the best way to analyze gene expression values (or to characterize genes as expressed versus nonexpressed) might be independently standardizing each file. Methods We collected the transcriptome profiling, RNA-seq, data of patient with colon adenocarcinomas from Genomic Data Commons (GDC) data portal. The data set includes 487 HTSeq-FPKM-UQ files for PTs and 41 for solid tissue normals (N). As there are 60 483 number of genes, the dimension of PT and N data sets is 60 483 × 487 and 60 483 × 41, respectively. Firs of all, we exclude the genes that their values were 0 in the entire data set, and then the number of genes will be reduced from 60 483 to 57 813. We represent the data set as a matrix D=[gi,j], where each row pi=[gi,j , 1≤j≤57813] represents the gene expression values of patient i. Figure 1A shows the sorted average of genes’ values gj¯=mean([gi,j , 1≤i≤n]), where n is the number of files. To better visualize the data, we divide each data set (PT and N) to four categories based on the patients’ vital status and gender (Table 1). We apply three different normalization methods, and for each method, we use two different strategies: normalizing patients or genes. In summary, we normalize the data using the following techniques. Normalization methods Scaling files The first method is rescaling the data by dividing the gene expression values in each file by the maximum gene expression value in the file. More precisely, we scaled the range of genes’ values into [0, 1] for each file separately; we obtained the normalized matrix D^=[g^i,j]=[gi,jmax⁡([gi,j,  1≤j≤57813])]. Scaling genes Another scaling method is rescaling the data by dividing the values of each gene by the maximum value of the gene in the whole data set. In other words, we scaled the range of genes’ values into [0, 1] for each gene separately by getting the matrix D^=[g^i,j]=[gi,jmax⁡([gi,j,  1≤i≤n])]. Scaling files to the unit length One of the most common normalization methods is rescaling the data to the unit length. To do that, we divided the gene expression values in each file by the L-2 norm of the values in the file; we obtained the matrix D^=[g^i,j]=[gi,j|([gi,j,  1≤j≤57813])|]. Scaling genes to the unit length Instead of scaling each file, we can separately rescale values of each gene to the unit length by calculating the normalized matrix D^=[g^i,j]=[gi,j|([gi,j,  1≤i≤n])|]. Standardizing files Another common approach is that assuming the data has normal distribution. We standardized each file separately by obtaining z-scores for each file: D^=[g^i,j]=[z−score([gi,j,  1≤j≤57813])]. Standardizing genes Another standardization method is normalizing the values of each gene separately by obtaining z-scores for the values of each gene: D^=[g^i,j]=[z−score([gi,j,  1≤i≤n])]. Supervised learning methods for classification We used scikit learn package in python to implement the ML algorithms to predict the vital status of patients with colon adenocarcinoma. We compared the performance of the following 12 supervised learning algorithms after using six abovementioned normalization methods. To obtain accuracy and fitting time of each of the following algorithms, we used the cross_validate function in sklearn.model_selection with scoring=‘accuracy’ and cv = 10. In cross_validate function, the training set is split into k (here k = 10) smaller sets. For each of the k ‘folds’, the model is trained using k−1 of the folds as training data, and the resulting model is validated on the remaining part of the data as a test set to compute the accuracy of the model. For more information about the implemented algorithms, please visit http://scikit-learn.org/stable/supervised_learning.html#supervised-learning. Support vector machine To classify data set using an SVM algorithm, we imported support vector classification (SVC) from sklearn.svm, and we used the default values for variables of SVC function: SVC(C = 1.0, kernel= RBF, degree = 3, gamma = auto, coef0 = 0.0, shrinking = True, probability = False, tol = 0.001, cache_size = 200, class_weight = None, verbose = False, max_iter = −1, decision_function_shape = ovr, random_state = None). Stochastic gradient descent We imported SGDClassifier from sklearn.linear_model, and we used the default setting: SGDClassifier(loss = hinge, penalty = l2, alpha = 0.0001, l1_ratio = 0.15, fit_intercept = True, max_iter = None, tol = None, shuffle = True, verbose = 0, epsilon = 0.1, n_jobs = 1, random_state = None, learning_rate = optimal, eta0 = 0.0, power_t = 0.5, class_weight = None, warm_start = False, average = False, n_iter = None). Note that SGDClassifier with loss=“hinge” fits a linear SVM with stochastic gradient descent (SGD) learning. Linear discriminant Linear discriminant analysis, which is A classifier with a linear decision boundary, uses Bayes’ rule. To apply this method, we imported LinearDiscriminantAnalysis from sklearn.discriminant_analysis, and we used the default values for variables of this function: LinearDiscriminantAnalysis(solver = svd, shrinkage = None, priors = None, n_components = None, store_covariance = False, tol = 0.0001). Quadratic discriminant To implement quadratic discriminant analysis, which uses Bayes’ rule and classifies with a quadratic decision boundary, we imported QuadraticDiscriminantAnalysis from sklearn.discriminant_analysis. We used the default values for variables of this function: QuadraticDiscriminantAnalysis(priors = None, reg_param = 0.0, store_covariance = False, tol = 0.0001, store_covariances = None). Logistic regression We implemented logistic regression, which is a one hidden layer neural network with logistic function, using the function sklearn.linear_model.LogisticRegression(penalty = l2, dual = True, tol = 0.0001, C = 1.0, fit_intercept = True, intercept_scaling = 1, class_weight = None, random_state = None, solver = liblinear, max_iter = 100, multi_class = ovr, verbose = 0, warm_start = False, n_jobs = 1). Decision tree To implement decision tree classification algorithm, we imported DecisionTreeClassifier from sklearn.tree, and we used the default function DecisionTreeClassifier(criterion = gini, splitter = best, max_depth = None, min_samples_split = 2, min_samples_leaf = 1, min_weight_fraction_leaf = 0.0, max_features = None, random_state = None, max_leaf_nodes = None, min_impurity_decrease = 0.0, min_impurity_split = None, class_weight = None, presort = False). Nearest centroid To implement NearestCentroid classifier, which represents each class by the centroid of its members, we used the function NearestCentroid(metric = euclidean, shrink_threshold = None) in sklearn.neighbors. Gaussian processes We used the function GaussianProcessClassifier(kernel = None, optimizer = fmin_l_bfgs_b, n_restarts_optimizer = 0, max_iter_predict = 100, warm_start = False, copy_X_train = True, random_state = None, multi_class = one_vs_rest, n_jobs = 1) in sklearn.gaussian_process to implement Gaussian processes for probabilistic classification. Neural networks To implement a neural network with two hidden layers of sizes 2 and 5 that trains using Back-propagation, we used the function sklearn.neural_network.MLPClassifier(hidden_layer_sizes = (5, 2), activation = logistic, solver = lbfgs, alpha = 1e-5, batch_size = auto, learning_rate = constant, learning_rate_init = 0.001, power_t = 0.5, max_iter = 200, shuffle = True, random_state = 1, tol = 0.0001, verbose = False, warm_start = False, momentum = 0.9, nesterovs_momentum = True, early_stopping = False, validation_fraction = 0.1, beta_1 = 0.9, beta_2 = 0.999, epsilon = 1e-08). Gradient boosting To classify the data via gradient-boosted regression, we applied the function trees.sklearn.ensemble.GradientBoostingClassifier(loss = deviance, learning_rate = 0.1, n_estimators = 100, subsample = 1.0, criterion = friedman_mse, min_samples_split = 2, min_samples_leaf = 1, min_weight_fraction_leaf = 0.0, max_depth = 3, min_impurity_decrease = 0.0, min_impurity_split = None, init = None, random_state = None, max_features = None, verbose = 0, max_leaf_nodes = None, warm_start = False, presort = auto). Gaussian naive Bayes To implement the Gaussian naive Bayes algorithm, which assumes that the likelihood of the features is Gaussian, we used the function sklearn.naive_bayes.GaussianNB(priors = None). Bernoulli naive Bayes We implemented the Bernoulli naive Bayes algorithm, which assumes features have a binary-value (Bernoulli, boolean), using the function sklearn.naive_bayes.BernoulliNB(alpha = 1.0, binarize = 0.0, fit_prior = True, class_prior = None). As our data set is not binary, the function will first binarize the input data set and then implements the Bernoulli naive Bayes algorithm. Dimensionality reduction We investigated the effect of dimensionality reduction methods on the performance of two winner learning algorithms SVM and Bernoulli naive Bayes and two models, which had a poor performance, nearest centroid and Gaussian naive Bayes. For each of these four ML algorithms, we first normalized the data set, then we used one of the following dimensional reduction techniques and finally we applied the supervised learning algorithm. Principal components analysis One of the most common methods for reducing the dimension of the data set is PCA. To implement PCA, we imported PCA (n_components = m) function from sklearn.decomposition. To investigate the effect of the number of components on the performance of the learning algorithms, we varied m from 1 to 40. Removing all low-variance features Another method for reducing the number of features is removing all low-variance features. To implement this method, we imported the function VarianceThreshold(threshold = θ) from sklearn.feature_selection, to remove genes with a variance lower than the threshold θ. We chose the value of θ based on the variance of genes in the normalized data set ( var(gj)), where gj is the normalized value of the gene j. We examined the results for θ=0,avg(var(gj)),avg(var(gj))+max(var(gj))2 and 0.9×max(var(gj)). Note that we cannot apply this approach after standardizing genes because standardizing genes makes the variances of all genes equal to 1. Key Points This work shows the importance of normalizing the HTSeq-FPKM-UQ data sets before applying ML algorithms, and it shows that the best normalization strategy depends on the ML model. The worst normalization method is standardizing the values of each gene separately because it changes the statistical information of the data set and leads to the poor performance for most of the supervised learning algorithms. SVM has the maximum accuracy (78%) among all implemented ML algorithms in predicting vital status of patients, regardless of normalization methods. Bernoulli naive Bayes model on the standardized file has the best performance (maximum accuracy as well as minimum fitting time) among all implemented models in predicting the vital status of patients with colon cancer. The expression level of one single gene, 7SK RNA, can be used to predict the vital status of the patients with colon adenocarcinomas with accuracy of 78%. Funding The Mathematical Biosciences Institute and the National Science Foundation (grant number DMS 1440386) in part. Leili Shahriyari, PHD, is a postdoctoral fellow in the Mathematical Biosciences Institute (MBI) at the Ohio State University. Her research interests include bioinformatics, computational biology, mathematical oncology, stochastic processes and cancer. References 1 Astorino A , Gorgone E , Gaudioso M , et al. Data preprocessing in semi-supervised SVM classification . Optimization 2011 ; 60 ( 1–2 ): 143 – 51 . Google Scholar CrossRef Search ADS 2 Toth MJ , Goran MI , Ades PA , et al. Examination of data normalization procedures for expressing peak VO2 data . J Appl Physiol 1983 ; 75 ( 5 ): 2288 – 92 . 3 Sola J , Sevilla J. Importance of input data normalization for the application of neural networks to complex industrial problems . IEEE Tran Nucl Sci 1997 ; 44 ( 3 ): 1464 – 8 . http://dx.doi.org/10.1109/23.589532 Google Scholar CrossRef Search ADS 4 Vemuri P , Gunter JL , Senjem ML , et al. Alzheimer’s disease diagnosis in individual subjects using structural MR images: validation studies . Neuroimage 2008 ; 39 ( 3 ): 1186 – 97 . Google Scholar CrossRef Search ADS PubMed 5 Quackenbush J. Microarray data normalization and transformation . Nat Genet 2002 ; 32 : 496 – 501 . http://dx.doi.org/10.1038/ng1032 Google Scholar CrossRef Search ADS PubMed 6 Sultan M , Schulz MH , Richard H , et al. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome . Science 2008 ; 321 ( 5891 ): 956 – 60 . http://dx.doi.org/10.1126/science.1160342 Google Scholar CrossRef Search ADS PubMed 7 Marioni JC , Mason CE , Mane SM , et al. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays . Genome Res 2008 ; 18 ( 9 ): 1509 – 17 . http://dx.doi.org/10.1101/gr.079558.108 Google Scholar CrossRef Search ADS PubMed 8 Cloonan N , Forrest ARR , Kolle G , et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing . Nat Methods 2008 ; 5 ( 7 ): 613 – 19 . http://dx.doi.org/10.1038/nmeth.1223 Google Scholar CrossRef Search ADS PubMed 9 Lin Y , Golovnina K , Chen Z , et al. Comparison of normalization and differential expression analyses using RNA-Seq data from 726 individual Drosophila melanogaster . BMC Genomics 2016 ; 17 ( 1 ): 28 . http://dx.doi.org/10.1186/s12864-015-2353-z Google Scholar CrossRef Search ADS PubMed 10 Dillies MA , Rau A , Aubert J , et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis . Brief Bioinform 2013 ; 14 ( 6 ): 671 – 83 . http://dx.doi.org/10.1093/bib/bbs046 Google Scholar CrossRef Search ADS PubMed 11 Robinson MD , Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data . Genome Biol 2010 ; 11 ( 3 ): R25 . Google Scholar CrossRef Search ADS PubMed 12 Mortazavi A , Williams BA , McCue K , et al. Mapping and quantifying mammalian transcriptomes by RNA-Seq . Nat Methods 2008 ; 5 ( 7 ): 621 – 8 . http://dx.doi.org/10.1038/nmeth.1226 Google Scholar CrossRef Search ADS PubMed 13 Lièvre A , Bachet JB , Le Corre D , et al. KRAS mutation status is predictive of response to cetuximab therapy in colorectal cancer . Cancer Res 2006 ; 66 ( 8 ): 3992 – 5 . Google Scholar CrossRef Search ADS PubMed 14 Misale S , Yaeger R , Hobor S , et al. Emergence of KRAS mutations and acquired resistance to anti-EGFR therapy in colorectal cancer . Nature 2012 ; 486 : 532 – 6 . Google Scholar PubMed 15 Guyon I , Boser B , Vapnik V. Automatic capacity tuning of very large VC-dimension classifiers . Adv Neural Inform Proces Syst 1993 ; 5 : 147 – 55 . 16 Joachims T. Text categorization with support vector machines: learning with many relevant features. In: Nédellec C , Rouveirol C , (eds) Machine Learning: ECML-98. ECML 1998. Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence), vol. 1398, Berlin, Heidelberg : Springer , 1998 , 137–42. doi: 10.1007/BFb0026683. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

Journal

Briefings in BioinformaticsOxford University Press

Published: Nov 3, 2017

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off