Access the full text.
Sign up today, get DeepDyve free for 14 days.
Motivation: Gene expression data represents a unique challenge in predictive model building, be- cause of the small number of samples (n) compared with the huge amount of features (p). This ‘n p’ property has hampered application of deep learning techniques for disease outcome classi- ﬁcation. Sparse learning by incorporating external gene network information could be a potential solution to this issue. Still, the problem is very challenging because (i) there are tens of thousands of features and only hundreds of training samples, (ii) the scale-free structure of the gene network is unfriendly to the setup of convolutional neural networks. Results: To address these issues and build a robust classiﬁcation model, we propose the Graph- Embedded Deep Feedforward Networks (GEDFN), to integrate external relational information of features into the deep neural network architecture. The method is able to achieve sparse connection between network layers to prevent overﬁtting. To validate the method’s capability, we conducted both simulation experiments and real data analysis using a breast invasive carcinoma RNA-seq dataset and a kidney renal clear cell carcinoma RNA-seq dataset from The Cancer Genome Atlas. The resulting high classiﬁcation accuracy and easily interpretable feature selection results suggest the method is a useful addition to the current graph-guided classiﬁcation models and feature selection procedures. Availability and implementation: The method is available at https://github.com/yunchuankong/ GEDFN. Contact: email@example.com Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction predictors, solved by modern machine learning algorithms such as regression based methods (Algamal and Lee, 2015; Liang et al., In recent years, more and more studies attempt to link clinical out- 2013), support vector machines (Vanitha et al.,2015), random forests comes, such as cancer and other diseases, with gene expression or (Cai et al.,2015; Kursa, 2014) and neural networks (Chen et al., other types of profiling data. It is of great interest to develop new 2014). While these methods are aimed at achieving accurate classifica- computational methods to predict disease outcomes based on profil- tion performance, major efforts have also been put on selecting sig- ing datasets that contain tens of thousands of variables. The major nificant genes that effectively contribute to the prediction (Cai et al., challenges in these data lie in the heterogeneity of the samples, and the sample size being much smaller than the number of predictors 2015; Kursa, 2014). However, feature selection is based on fitted pre- (genes), i.e. the n p issue, as well as the complex correlation struc- dictive models and is conducted after parameter estimation, which ture between the predictors. Thus the prediction task has been for- causes the selection to rely on the classification methods rather than mulated as a classification problem combined with selection of the structure of the feature space itself. Beside building robust V The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: firstname.lastname@example.org 3727 3728 Y.Kong and T.Yu predictive models, the feature selection also serves another important connected network architectures with newly defined convolution purpose—the functionality of the selected features (genes) can help operations on graph-structured data. The methods have novel math- unravel the underlying biological mechanisms of the disease outcome. ematical formulation; however, the applications are yet to be gener- Given the nature of the data, i.e. functionally associated genes alized. In both of the two papers, by using the two benchmark tend to be statistically dependent and contribute to the biological datasets MINST (LeCun et al., 1998) and ImageNet (Russakovsky outcome in a synergistic manner, a branch of gene expression classi- et al., 2015), respectively, the authors have treated 2D grid images fication research has been focused on integrating the relations be- as a special form of graph-structured data in their experiments. This tween genes with classification methods, which helps in terms of is based on the fact that an image can be regarded as a graph in both classification performance as well as learning the structure of which each pixel is a vertex connected with four neighbors in the feature space. A critical data source to achieve this goal has been four directions. However, graph-structured data can be much more gene networks. A gene network is a graph-structured dataset with complex in general, as the degree of each vertex can vary widely, genes as the graph vertices and their functional relations as graph and the edges do not have orientations as in image data. For a gene edges. The functional relations are largely curated from existing bio- network, the degree of vertices is power-law distributed as the net- logical knowledge (Chowdhury and Sarkar, 2015; Szklarczyk and work is scale-free (Kolaczyk, 2009). In this case, convolution opera- Jensen, 2015). Each vertex in the network corresponds to a predict- tions are not easy to define. In addition, with tens of thousands of or in the classification model. Thus, it is expected that the gene net- vertices in the graph, applying multiple convolution operations work can provide useful information for a learning process where results in huge number of parameters, which easily leads to over- genes serve as predictors. Motivated by this fact, certain procedures fitting given the small number of training samples. By taking an al- have been developed where gene networks are employed to conduct ternative approach of modifying a usual DFN, our newly proposed feature selection prior to classification (Chuang et al., 2007; Li and graph-embedded DFN can serve as a convenient tool to fill the gap. Li, 2008; Wang et al., 2007; Wei and Pan, 2008). Moreover, meth- It avoids over-fitting in the n p scenario, as well as achieves good ods that integrate gene network information directly into classifiers feature selection results using the structure of the feature space. have also been developed. For example, Dutkowski and Ideker The article is organized as follows: Section 2 reviews usual deep (2011) propose the random forest-based method, where the feature feedforward networks (DFNs) and illustrates our graph-embedded sub-sampling is guided by graph search on gene networks when con- architecture. Section 3.1 compares the performance of our method structing decision trees. Lavi et al. (2012) and Zhu et al. (2009) with other approaches using synthetic datasets, followed by the real modify the objective function of the support vector machine with applications of two RNA-seq datasets in Section 3.2. Finally, con- penalty terms defined according to pairwise distances between genes clusions are presented in Section 4. in the network. Similarly, Kim et al. (2013) develops logistic regres- sion based classifier using regularization, where again a relational penalty term is introduced in the loss function. The authors of these methods have demonstrated that embedding expression data into gene network results in both better classification performance and 2 Materials and methods more interpretable selected feature sets. 2.1 Deep feedforward networks With the clear evidence that gene networks can lead to novel var- A DFN ( or DNN, multilayer perceptron) with l hidden layers has a iants of traditional classifiers, we are motivated to incorporate gene standard architecture networks with deep feedforward networks (DFN), which is closely related to the state-of-the-art technique deep learning (LeCun et al., PrðÞ yjX; h¼ fðÞ Z W þ b out out out 2015). Although nowadays deep learning has been constantly shown Z ¼ rðÞ Z W þ b out l l l to be one of the most powerful tools in classification, its application .. . in bioinformatics is limited (Min et al., 2017). This is due to many reasons including the n p issue, the large heterogeneity in cell Z ¼ rðÞ Z W þ b kþ1 k k k populations and clinical subject populations, as well as inconsistent .. . data characteristics across different laboratories, resulting in diffi- Z ¼ rðÞ XW þ b ; 1 in in culties merging datasets. Consequently, the relatively small number np of samples compared with the large number of features in a gene ex- where X 2R is the input data matrix with n samples and p pression dataset obstructs the use of deep learning techniques, where n features, y 2R is the outcome vector containing classification labels, the training process usually requires a large amount of samples such h denotes all the parameters in the model, Z and Z ; k ¼ 1; .. . ; l out k as in image classification (Russakovsky et al., 2015). Therefore, 1 are hidden neurons with corresponding weight matrices there is a need to modify deep learning models for disease outcome W ; W bias vectors b ; b . The dimensions of Z and W depend out k out k classification using gene expression data, which naturally leads us to on the number of hidden neurons h and h ; k ¼ 1; .. . ; l,as well as in the development a variant of deep learning models specifically fit- the input dimension p and the number of classes h for out ting the practical situation with the help of gene networks. classification problems. In this paper, we mainly focus on binary clas- Incorporating gene networks as relational information in the fea- sification problems hence the elements of y simply take binary values ture space into DFN classifiers is a natural option to achieve sparse and h 2. rðÞ is the activation function such as sigmoid, hyper- out learning with less parameters compared with the usual DFN. bolic tangent (tanh) or rectifiers. fðÞ is the softmax function convert- However, to the best of our knowledge, few existing work has been ing values of the output layer into probability prediction i.e. done on this track. Bruna et al. (2014) and Henaff et al. (2015) i1 started the direction of sparse deep neural networks (DNNs) for p ¼ fðÞ l ¼ i1 l l i0 i1 e þ e graph-structured data. The authors developed hierarchical locally Graph-embedded deep feedforward networks 3729 where have their corresponding hidden neurons in the first hidden layer. A feature can only feed information to hidden neurons that correspond p : ¼ PrðÞ y ¼ 1jx i i i to features connecting to it in the feature graph. hi ðoutÞ ðoutÞ ðoutÞ Specifically, let x ¼ x ; .. . ; x ; i ¼ 1; .. . ; n be any instance l :¼ z w þ b i i1 ip i0 i 0 0 (one row) of the input matrix X, in the usual DFN, the first hidden hi ðoutÞ ðoutÞ ðoutÞ l :¼ z w þ b ; layer of this instance is calculated as i1 i 1 1 0 1 "# p p X X for binary classification where i ¼ 1; .. . ; n. ðÞ 1 ðÞ in ðÞ in ðÞ in ðÞ in @ A z ¼ r x w þ b ; .. . ; x w þ b ; ij ij i 1j 1 h j h in in The parameters to be estimated in this model are all the weights j¼1 j¼1 and biases. For a training dataset given true labels, the model is ðÞ 1 ðÞ in ðÞ in trained using a stochastic gradient decent (SGD) based algorithm where z is the i-th row of Z , and w ; b ; k ¼ 1; .. . ; h are the 1 in kj k (Goodfellow et al., 2016) by minimizing the cross-entropy loss weight and bias for this layer. Now in our model, h ¼ p and each in ðÞ in function w is multiplied by an indicator function i.e. kj n n o p p T X X ð1Þ ðinÞ ðinÞ ðinÞ ðinÞ b b LðÞ h ¼ y log p þðÞ 1 y log 1 p ; i i i i z ¼r x w IðA ¼1Þþb ;...; x w IðA ¼1Þþb : ij 1j ij pj p n i 1j 1 pj i¼1 j¼1 j¼1 where again h denotes all the model parameters, and p is the fitted Therefore, the feature graph helps achieve sparsity for the connec- value of p . More details about DFN can be found in Goodfellow tion between the input layer and the first hidden layer. et al. (2016). 2.3 Evaluation of feature importance 2.2 Graph-embedded DFNs Beside improving classification, it is also of great interest to find fea- Our newly proposed DNN model is based on two main assump- tures that significantly contribute to the classification, as they can tions. The first assumption is that neighboring features on a known reveal the underlying biological mechanisms. Therefore, for scale-free feature network or feature graph (Since in this paper we interchangeably discuss feature networks and neural networks, to GEDFN, we also develop a feature ranking method according to a relative importance score. The idea is analogous to the connection avoid confusion, the equivalent term ‘graph’ is used to refer to the feature network from now on, while ‘networks’ naturally refer to weights (CWs) method introduced by Olden and Jackson (2002). neural networks.) tend to be statistically dependent. The second as- Extended from CW, we propose the graph connection weights sumption is that only a small number of features are true predictors (GCWs) method, which emphasizes the significance of the feature for the outcome, and the true predictors tend to form cliques in the graph in our newly proposed neural network architecture. feature graph. These assumptions have been commonly used and The main idea of GCW is that, the contribution of a specific vari- justified in previous works reviewed in Section 1. able is directly reflected by the magnitude of all the weights that dir- To incorporate the known feature graph information to DNN, ectly associated with the corresponding hidden neuron in the graph- we propose the graph-embedded deep feedforward network embedded layer (the first hidden layer). Summing over the absolute (GEDFN) model. The key idea is that, instead of letting the input values of the directly associated weights gives the relative import- layer and the first hidden layers to be fully connected, we embed the ance of the specific feature, i.e. feature graph in the first hidden layer so that a fixed informative X X 1 sparse connection can be achieved. ðÞ in ðÞ 1 s ¼ c jw I A ¼ 1 jþ jw j; (1) j kj j jm kj Let G ¼ðÞ V; E be a known graph of p features, with V the col- k¼1 m¼1 lection of p vertices and E the collection of all edges connecting ver- tices. A common representation of a graph is the corresponding c ¼ min c= I A ¼ 1 ; 1 ; j ¼ 1; .. . ; p; (2) adjacency matrix A. Given a graph G with p vertices, the adjacency j kj k¼1 A is a p p matrix with ðÞ ( in where s is the importance score for feature j, w denotes weights 1; if V and V are connected ; 8i; j ¼ 1; .. . ; p i j ðÞ 1 A ¼ between the input and first hidden layers, and w denotes weights ij 0; otherwise: between the first hidden layer and the second hidden layer. A con- stant c is imposed to penalize feature vertices with too many connec- In our case A is symmetric since the graph is undirected. Also, we re- tions, so that they will not be overly influential. In subsequent quire A ¼ 1 meaning each vertex is regarded to connecting itself. ii experiments, we take c ¼ 50. Now to mathematically formulate our idea, we construct the Note that the importance score consists of two parts according DNN such that the dimension of the first hidden layer (h )isthe in to Equation (1). The left term summarizes the importance of a fea- same as the original input i.e. h ¼ p, hence W has a dimension of in in ture according to the connection on the feature graph, coherent with p p. Between the input layer X and the first hidden layer Z , in- the property of the graph-embedded layer. The right term then sum- stead of fully connecting the two layers with Z ¼ rðÞ XW þ b , 1 in in we have marizes the contribution of the feature according to the connection to the hidden neurons in the next fully-connected layer. Input data Z ¼ rðÞ XWðÞ Aþ b 1 in in are required to be Z-score transformed (the original value minus the mean across all samples and then divided by the standard deviation) where the operation is the Hadamard (element-wise) product. before entered into the model, and this will guarantee all variables Thus, the connections between the first two layers of the feed- are of the same scale so that the magnitude of weights are compar- forward network are ‘filtered’ by the feature graph adjacency ma- trix. Through the one-to-one R : p ! p transformation, all features able. After training GEDFN, the importance scores for all the 3730 Y.Kong and T.Yu variables can be calculated using trained weights, which leads to a ranked feature list. 2.4 Detailed model settings For the choice of activation functions in GEDFN, the rectified linear unit (ReLU) (Nair and Hinton, 2010) with the form (in scalar case) r ðÞ x ¼ maxðÞ x; 0 ReLU is employed. This activation has an advantage over sigmoid and tanh as it can avoid the vanishing gradient problem (Hochreiter et al., 2001) during training using SGD. To train the model, we choose the Adam optimizer (Kingma and Ba, 2014), which is the most widely used variant of traditional gradient descent algorithms in deep learn- ing. Also, we use the mini-batch training strategy by which the opti- mizer randomly trains a small proportion of the samples in each iteration. Details about the Adam optimizer and mini-batch training can be found in Goodfellow et al. (2016) and Kingma and Ba (2014). The classification performance of a DNN model is associated with many hyper-parameters, including architecture-related parame- ters such as the number of layers and the number of hidden neurons in each layer, regularization-related parameters such as the dropout proportion, and model training-related parameters such as the learn- ing rate and the batch size. These hyper-parameters can be fine- tuned using advanced hyper-parameter training algorithm such as Bayesian Optimization (Mockus, 2012), however, as the hyper- Fig. 1. Network architecture of the GEDFN model for experiments in Sections parameters are not of primary interest in our work, in later sections, 3.1 and 3.2 we simply tune them using grid search in a feasible hyper-parameter space. A visualization of our tuned GEDFN model for simulation negatively correlated when we sample the expression values from and real data experiments is shown in Figure 1. More details of multivariate normal distribution using R as the variance–covariance hyper-parameter tuning can be found in Supplementary Section S1. matrix. Supplementary Figure S1 shows sample plots of the pairwise feature correlation distributions for the simulated data. 3 Results and discussion To generate outcome variables, we first select a subset of features to be the ‘true’ predictors. Following our assumptions mentioned in 3.1 Simulation experiments Section 2.2, we intend to select cliques in the feature graph. Among We conducted extensive simulation experiments to mimic disease vertices with relatively high degrees, part of them is randomly outcome classification using gene expression and network data, and selected as “cores”, and part of the neighboring vertices of cores are explored the performance of our new method in comparison with also selected. Denoting the number of true predictors as p , we sam- the usual DFN and other proven methods. Robustness was also ple a set of parameters b ¼ b ; .. . ; b and an intercept b with- 1 p tested by simulating datasets that did not fully satisfy the main 0 in a certain range. In our experiments, we first uniformly sample b’s assumptions. The method was applied to examine whether it could from (0.1, 0.2), and randomly turn some of the parameters into still achieve a reasonable performance. negative, so that we accommodate both positive and negative coeffi- cients. Finally, the outcome variable y is generated through a gener- 3.1.1 Synthetic data generation alized linear model framework For a given number of features p, we employed the preferential at- tachment algorithm proposed by Baraba ´ si and Albert (1999) to gen- 1 T Prðy ¼ 1jx Þ¼ g ðx b þ b Þ i i i erate a scale-free feature graph. The p p distance matrix D y ¼IðPrðy ¼ 1jx Þ > tÞ; i ¼ 1; .. . n; i i i recording pairwise distances among all vertices were then calculated. Next, we derived the covariance matrix by transforming the distan- where t is a threshold and gðÞ is the link function. We consider two ces between verticies R by letting cases of g ðÞ in our experiments, one is the sigmoid function, which is equivalent to the binary softmax and monotone ij R ¼ 0:7 ; i; j ¼ 1; ... ; p: ij g ðÞ x ¼ Here by convention the diagonal elements of D are all zeros mean- 1 þ e ing the distance between a vertex to itself is zero. and the other is a weighted tanh plus quadratic function, which is After simulating the feature graph and obtaining the covariance non-monotone matrix of features, we generate n multivariate Gaussian samples as the input matrix X ¼ðÞ x ; .. . ; x i.e. 1 n 1 2 g ðÞ x ¼ 0:7/ðÞ tanhxðÞþ 0:3/ x ; x NðÞ 0; R ; i ¼ 1; .. . n; where /ðÞ is the min_max function scaling the input to [0, 1]. where n p for imitating gene expression data. Using this setup, Following the above procedure, corresponding to the two cases vertices that are several steps away could naturally become of inverse link functions, we simulate two sets of synthetic datasets Graph-embedded deep feedforward networks 3731 with 5000 features and 400 samples. We compare our method with different methods by computing and comparing the AUC of the the usual DFN, the feature graph-embedded classification method precision-recall curves, which were constructed using the feature network-guided forest (NGF) (Dutkowski and Ideker, 2011) men- ranking of the models and the 0/1 vector indicating the true predict- tioned in Section 1, as well as the traditional logistic regression with or status of each feature. Figure 2d–f show the average precision- lasso (LRL) (Tibshirani, 1996). In gene expression data, the number recall AUC (error bars: the intervals of mean AUC plus/minus one of true predictors account for only a small proportion of the fea- standard error) for each simulation setting of the sigmoid case. We tures. Taking this aspect into consideration, we examine different found that DFN was not able to effectively rank features, resulting numbers, i.e. 40, 80, 120, 160 and 200, of true predictors, corre- in precision-recall AUC <0.05 for all the datasets, and thus they sponding to 2, 4, 6, 8 and 10 cores among all the high-degree verti- were not included in the plots. From Figure 2d–f, one can conclude ces in the feature graph. However, in reality, the true predictors may that GEDFN ranked features more effectively than NGF. not be perfectly distributed in the feature graph as cliques. Instead, LRL did not rank features but directly gave the selected feature some of the true predictors, which we call ‘singletons’, can be quite subset based on cross-validation. To compare feature selection be- scattered. To create this possible circumstance, we simulate three tween GEDFN and LRL, for each dataset, we fixed the precision of series of datasets with singleton proportions 0, 50 and 100% among GEDFN to be the same as LRL, and then compared their recall val- all the true predictors. Therefore, we investigate three situations ues. The recall plots (error bars: the intervals of mean recall plus/ where all true predictors are in cliques, half of the true predictors minus one standard error) for different simulation settings are are singletons, and all of the true predictors are scattered in the shown in Figure 2g–i. Again, it is evident that GEDFN achieved bet- graph, respectively. ter feature selection results. Simulation results for the case with the weighted tanh plus quad- ratic inverse link function are shown in Figure 3. From the first row 3.1.2 Simulation results and discussion of Figure 3, all the methods’ AUC decreased compared with their In our simulation studies, as shown in Figure 1, the GEDFN had counterparts in the case of sigmoid inverse link, as the non- three hidden layers, where the first hidden layer was the graph adja- monotone function brought more difficulty to classification. cency embedded layer. Thus the dimension of its output is the same However, GEDFN again outperformed the other methods in gen- as the input, namely 5 000. The second and third hidden layers had eral, and the difference between GEDFN and LRL was enlarged 64 and 16 hidden neurons, respectively, which are the same for the compared with the sigmoid function case since the non-monotone usual DFN. The number of the first layer hidden neurons in the inverse link was more challenging, and LRL was no longer the true usual DFN, 1024 neurons, was selected using grid search. model in this case. The second row and third row of Figure 3 indi- For each of the data generation settings, ten independent datasets cate GEDFN’s better feature selection than NGF and LRL across all were generated, and the GEDFN, DFN, NGF and LRL methods simulation settings in this case. DFN was again proved not to have were applied. For each simulated dataset, we randomly split the good feature selection capability through the experiment, with dataset into training and testing sets at a 4:1 ratio. The models were precision-recall AUC no more than 0.04. trained using the training dataset, and used to predict the class prob- The above simulation experiment results showed nice perform- abilities of the testing dataset. To evaluate the classification results, ance of GEDFN in both classification accuracy and feature selection receiver operating characteristic (ROC) curve was generated using in both the sigmoid case and the tanh plus quadratic case. The the predicted class probabilities and the true class membership of method was robust across different number of true predictors and the testing dataset, and the area under the curve (AUC) was calcu- different proportions of singletons in feature graphs. To further test lated. The final testing results were then averaged across the 10 the robustness of GEDFN, we considered cases that the known fea- datasets. Figure 2 shows the results of the case with the sigmoid inverse ture graph was completely misspecified, i.e. the graph structure link function. The error bars denote intervals of estimated mean bears misleading information with regard to feature correlation and AUC values plus/minus their standard errors. Corresponding to the true predictor location. This extreme situation is unlikely in applica- case that singleton proportion is 0%, Figure 2a shows GEDFN and tions. We employed the synthetic datasets used above with singleton LRL outperformed the other two methods. As the number of true proportion 50%, destroyed the true feature graphs, and re- predictors increased, all of the methods performed better as there constructed random feature graphs using the preferential attachment were more signals in the feature set. As the singleton proportion algorithm. The comparison of classification and feature selection be- increased to 0.5 (Fig. 2b), GEDFN was the best among the tween the GEDFN with correct feature graph and the GEDFN with four though the difference between GEDFN and LRL was not misspecified graphs is shown in Supplementary Figure S2. From the big. In Figure 2c, when the singleton proportion was increased to 1, results, misspecified feature graphs negatively affected GEDFN all of the methods performed worse, but GEDFN performed regarding both classification and feature selection. For classification, better than the others overall. The close results of GEDFN and LRL GEDFN was robust enough to obtain acceptable accuracies. In con- were expected, as in the sigmoid case LRL was in fact the true trast, feature selection was more influenced, which was expected as model. the feature ranking mechanism of GEDFN relied on the feature As for feature selection, GEDFN uses Equation (1) to rank fea- graph connections. tures. The feature ranking method for the usual DFN was similar to Another concern about the robustness of GEDFN is the reprodu- the one for GEDFN, except that for DFN each variable’s importance cibility of feature selection. For a fixed dataset, we were interested in whether a relatively stable set of important features would be was given only by the second term in Equation (1) that was to con- selected across different times of model fitting. To explore this, we sider only the weights connecting the input layer and the first hidden randomly chose a synthetic dataset with 40 true predictors, 50% layer. For NGF, the variable importance calculation based on cumu- singleton and sigmoid inverse link, and experimented GEDFN fea- lative reduction of Gini impurity in random forests (Breiman, 2001) could be directly applied. Therefore, knowing the true predictors for ture selection repeatedly for 10 times. Ten ranked feature lists were simulated data, we were able to compare feature selection results for obtained, and the top 40 ranked variables were selected for each 3732 Y.Kong and T.Yu (a)(b)(c) (d)(e)(f) (g)(h)(i) Fig. 2. Plots of the classiﬁcation and feature selection comparison for the case with the sigmoid inverse link function. Singleton proportions: left column 0%, mid- dle column 50%, right column 100%. First row: AUC of ROC for classiﬁcation; second row: AUC of precision-recall for feature selection; third row: recall plots given ﬁxed precision from LRL. Error bars represent the estimated mean quantity plus/minus the standard error experiment. Among the ten sets of 40 selected features, 19 features 3.2 Real data applications were repeatedly selected as top 40 over seven times, and they cov- 3.2.1 Breast invasive carcinoma data ered 40% of the 40 true predictors. Also, 70% of the union of the We applied our GEDFN method to the Cancer Genome Atlas 10 sets of top 40 features turned out to be relevant for prediction. (TCGA) breast cancer (BRCA) RNA-seq dataset (Koboldt et al., Here ‘relevant’ means a feature was either a true predictor, or a 2012). The dataset consisted of a gene expression matrix with neighbor of a true predictor in the feature graph, since in our simula- 20 532 genes of 707 cancer patients, as well as the clinical data con- tion settings, neighbors of true predictors can be useful in classifica- taining various disease status measurements. The gene network tion even if they were not chosen as true predictors themselves. This came from the HINT database (Das and Yu, 2012). We were inter- small specific experiment indicated the relative stable performance ested in the relation between gene expression and a molecular sub- of GEDFN feature selection. type of breast cancer—the tumor’s estrogen receptor (ER) status. Graph-embedded deep feedforward networks 3733 (a)(b)(c) (d)(e)(f) (g)(h)(i) Fig. 3. Plots of the classiﬁcation and feature selection comparison for the case with the weighted tanh plus quadratic inverse link function. Singleton proportions: left column 0%, middle column 50%, right column 100%. First row: AUC of ROC for classiﬁcation; second row: AUC of precision-recall for feature selection; third row: recall plots given ﬁxed precision from LRL. Error bars represent the estimated mean quantity plus/minus the standard error ER is expressed in more than 2/3 of breast tumors, and plays a critic- repeated experiments respectively. The computation time of al role in tumor growth (Sorlie et al., 2003). Elucidating the relation GEDFN was around 3 min each time on a workstation with dual between gene expression pattern and ER status can shed light on Xeon E5-2660 processors, 256 Gb RAM, and a single GTX Titan Xp GPU. The summary of test-set classification accuracies is seen in the subtypes of breast cancer and their specific regulations. After screening genes that were not involved in the gene network, Table 1. From the classification results, all the methods achieved ex- a total of 9 211 genes were used as the final feature set in our classi- cellent AUC scores, and we concluded that the dataset contained fication. For each gene, the expression value was Z-score strong signals for ER status. Thus, for this dataset, the improvement of incorporating feature graph regarding classification was limited, transformed. Using the HINT network architecture, we tested the four meth- as traditional methods already pushed the performance to the upper ods GEDFN, DFN, NGF and LRL on the BRCA data with 10 bound. 3734 Y.Kong and T.Yu Table 1. Classiﬁcation results for BRCA data Methods GEDFN DFN NGF LRL Mean AUC 0.945 0.938 0.922 0.940 SD 0.005 0.013 0.012 0.008 Table 2. Selected feature sub-graph analysis for BRCA data Methods GEDFN NGF LRL No. of connected components 3 4 80 Within-component average distance 3.181 3.169 1.700 Average distance 2.263 2.393 3.822 However, GEDFN exhibited advantages over other methods in terms of feature selection. To analyze the feature selection results for this dataset, we first averaged the importance scores across the ten repeated model trainings from GEDFN and NGF. DFN was proved not able to achieve good feature selection results in Section 3.1.2 and thus was excluded from this analysis. For LRL, the features selected over the ten times were quite stable with only one or two Fig. 4. Feature sub-graph selected by GEDFN for BRCA data different variables, hence we took the union of the 10 selected fea- ture sets as the feature selection result for LRL. In the end, selected Table 3. Top GO biological processes for the sub-graph selected by features from LRL and the top 1% ranked features from GEDFN GEDFN (BRCA data) and NGF were compared. They contained 89, 92 and 92 features, respectively. GOBPID P-value Term We invested the functional consistency of the selected features, GO: 0006914 5.02E-07 Autophagy as reflected by how close the selected features were in the original GO: 0045786 1.16E-05 Negative regulation of cell cycle feature graph. On the feature graph, which was based on protein– GO: 0030509 1.27E-05 BMP signaling pathway protein interaction (Das and Yu, 2012), functionally related genes GO: 2001233 1.74E-05 Regulation of apoptotic signaling pathway tend to be closer in distance. For each method, we extracted the sub- GO: 0006511 1.78E-05 Ubiquitin-dependent protein graph of the selected features from the entire feature graph, and catabolic process examined the connection of the sub-graph. A better feature selection GO: 0071363 3.01E-05 Cellular response to growth factor stimulus method was expected to choose features that fall into cliques of the GO: 0038061 5.56E-05 NIK/NF-kappaB signaling overall graph, resulting in fewer connected components in the GO: 0097576 5.97E-05 Vacuole fusion GO: 0071456 6.68E-05 Cellular response to hypoxia selected sub-graph. Table 2 shows the results of sub-graph analysis. GO: 2001020 1.69E-04 Regulation of response to DNA damage The first row is the number of connected components for each sub- stimulus graph. The second row is the within-component average distances in the sub-graph. The third row is the average distances in the entire Note: Manual pruning of partially overlapping GO terms was conducted. feature graph. From Table 2, one can see that features selected by GEDFN formed more closely connected sub-graphs (seen in Fig. 4), while NGF resulted in more scattered sub-graphs with 4 connected response and autophagy play a role in the development of anti- components. Features selected by LRL had no graph structure at all, estrogen therapy resistance in ERþ breast cancer (Cook and Clarke, with 89 features forming 80 connected components, meaning most 2014). of which were unconnected. The average distance in the entire fea- The second most significant term was ‘negative regulation of ture graph for GEDFN was smaller than that for NGF, indicating cell cycle’. ERa regulates the cell cycle by regulating the S and G2/ the closer relationship among genes selected by GEDFN. Although M phases in a ligand-dependent fashion (JavanMoghadam et al., 2016). Several of the top terms were signal transduction process. It the within-component average distance for LRL is the smallest, the large amount of connected components made this statistic meaning- has been long established that there are cross-talks between BMP less for LRL. and estrogen signaling, as well as between growth factor receptor Functional analysis of the genes selected by GEDFN was con- pathways and estrogen signaling (Osborne et al.,2005). BMPs are ducted by testing for enrichment of the gene ongoloty (GO) biologic- repressed by estrogen through ER signaling (Yamamoto et al., al processes using GOstats (Falcon and Gentleman, 2007). The 2002). NF-jBisacrucialplayerincancerinitiationand progres- results can be found in Table 3. Fifteen of the 92 selected genes be- sion. Direct binding to NF-jB is documented for p53 and ER long to the autophagy process, which was the most significant GO (Hoesel and Schmid, 2013). It exhibits differential function in ER- term. In addition, ‘regulation of apoptotic signaling pathway’ and and ERþ hormone-independent breast cancer cells (Gionet et al., ‘ubiquitin-dependent protein catabolic process’ were also among the 2009). top terms. Breast cancer cells that express ERa have a higher auto- The remaining top GO terms were related to stress response. phagic activity than cells that express ER-b and ER-cells (Felzen Breast cancer cells adapt to reduced oxygen concentrations by et al., 2015). It has been documented that the unfolded protein increasing levels of hypoxia-inducible factors. The increase of such Graph-embedded deep feedforward networks 3735 Table 4. GO enrichment analysis for features selected by GEDFN only (BRCA data) GOBPID P-value Term GO: 2001233 4.81E-06 Regulation of apoptotic signaling pathway GO: 0006511 1.12E-05 Ubiquitin-dependent protein catabolic process GO: 0030509 2.39E-05 BMP signaling pathway GO: 0071363 1.24E-04 Cellular response to growth factor stimulus GO: 0045786 1.89E-04 Negative regulation of cell cycle Note: Manual pruning of partially overlapping GO terms was conducted. Table 5. Classiﬁcation results for KIRC data Methods GEDFN DFN NGF LRL Mean AUC 0.743 0.643 0.521 0.698 SD 0.047 0.038 0.012 0.003 factors causes higher risk of metastasis (Gilkes and Semenza, 2013). Hypoxia inducible factors can influence the expression of ER in breast cancer cells (Wolff et al., 2017). Estrogen changes the DNA Fig. 5. Feature sub-graph selected by GEDFN for KIRC data damage response by regulating proteins including ATM, ATR, CHK1, BRCA1 and p53 (Caldon, 2014). Thus it is expected that DNA damage response is closely related to ER status. The full table Table 6. Top GO biological processes for the sub-graph selected by containing all GO terms for the functional analysis can be found in GEDFN (KIRC data) Supplementary Table S1. GOBPID P-value Term Finally, we analyzed the 69 genes that were only selected by GEDFN but not the other methods. The top five GO terms of this GO: 2001233 7.10E-10 Regulation of apoptotic signaling pathway feature set are listed in Table 4. Clearly these functions agree very GO: 0051098 1.14E-09 Regulation of binding GO: 0071363 1.27E-09 Cellular response to growth factor stimulus well with the biological processes based on all the selected genes GO: 0007178 1.48E-07 Transmembrane receptor protein listed in Table 3. Serine/threonine kinase signaling pathway GO: 1903827 2.27E-07 Regulation of cellular protein localization 3.2.2 Kidney renal clear cell carcinoma data GO: 0042176 5.72E-07 Regulation of protein catabolic process We also tested GEDFN on the kidney renal clear cell carcinoma GO: 0007507 1.66E-06 Heart development (KIRC) RNA-seq dataset from TCGA (Network et al., 2013). The GO: 0008285 1.72E-06 Negative regulation of cell proliferation dataset contained the gene expression matrix with 20 502 genes GO: 0048589 3.07E-06 Developmental growth GO: 0007183 3.52E-06 SMAD protein complex assembly from 537 subjects, as well as the clinical data including survival in- formation. The gene network again came from the HINT database. Note: Manual pruning of partially overlapping GO terms was conducted. For KIRC, We tried to study the relation between gene expression and the five-year survival outcome, which was a much more difficult obtained 86 top 1% important features that fall into 3 connected task compared with cancer subtypes. After screening genes that components, with an average within-component distance of 3.111, were not involved in the gene network, a total of 8630 genes were and an average distance in the entire feature graph of 2.257. In total used as the final feature set in our classification. For each gene, the 30 of the 86 genes overlap with the top genes in the breast cancer expression value was again Z-score transformed. study, which was not a surprise given both datasets are based on As in Section 3.2.1, we again tested the four methods GEDFN, tumor tissues. DFN, NGF and LRL on the KIRC data with ten repeated experi- The sub-graph of top 1% of genes selected by GEDFN is shown ments, respectively. The computation time of GEDFN was around in Figure 5. GO enrichment analysis was conducted for the 86 genes, 2.5min each time on the same workstation as for BRCA data. and the top 10 GO terms are shown in Table 6. The top GO terms Classification results are summarized in Table 5. Given the 5-year were predominantly regulatory and signal transduction processes, survival outcome variable was much more challenging to predict, several of which were well-known for their association with tumor the AUC scores were much lower for all the methods. NGF was not development. However their role in survival was previously not able to classify instances at all with AUC of ROC near 0.5. At the clear. A key regulator in the oncogenesis of renal cell carcinoma same time, GEDFN performed substantially better than the other inhibits apoptosis through apoptosis signaling pathway, which was three methods. Therefore, the KIRC data demonstrate that incorpo- the top GO term (Banumathy and Cairns, 2010). The second GO rating feature graph would improve classification accuracy for term, regulation of binding is a relatively broad term. The selected DNN models. Due to the poor classification of NGF, it was unnecessary to genes associated with this term fell mostly into protein and DNA examine its feature selection for KIRC. Similar to the BRCA results binding processes. The 17 selected genes that were in this process in- in Section 3.2.1, LRL selected scattered variables on the feature clude known oncogenes JUN (Jones et al., 2016) and TFIP11 (Tang graph with few connections between them. For GEDFN, we et al., 2015), tumor suppressors CRMP1 (Cai et al., 2017) 3736 Y.Kong and T.Yu Banumathy,G. and Cairns,P. (2010) Signaling pathways in renal cell carcin- and LDOC1 (Ambrosio et al., 2017), target of tumorcide oma. Cancer Biol. Ther., 10, 658–664. Manumycin-A PPP1CA (Carey et al., 2015), three SMAD family Baraba ´ si,A.-L. and Albert,R. (1999) Emergence of scaling in random net- proteins SMAD2/SMAD3/SMAD4 that are involved in multiple can- works. Science, 286, 509–512. cers (Samanta and Datta, 2012), as well as several genes involved in Breiman,L. (2001) Random forests. Mach. Learn., 45, 5–32. various other cancers, e.g. PIN1 (Cheng et al., 2016), MDF1 (Li Bruna,J. et al. (2014) Spectral networks and locally connected networks on et al., 2017), AES (Sarma and Yaseen, 2011), MAPK8 (Recio-Boiles graphs. International Conference on Learning Representations et al., 2016), CTNNB1 (Na et al., 2017), KDM1A (Ambrosio et al., (ICLR2014), CBLS, April 2014. 2017) and SUMO1 (Jin et al., 2017). Cai,G. et al. (2017) Collapsin response mediator protein-1 (CRMP1) acts as The term ‘cellular response to growth factor stimulus’ includes an invasion and metastasis suppressor of prostate cancer via its suppression the epidermal growth factor receptor (EGFR) pathway, and BMP of epithelial-mesenchymal transition and remodeling of actin cytoskeleton organization. Oncogene, 36, 546–558. signaling pathway. Both are related to the development of renal cell Cai,Z. et al. (2015) Classiﬁcation of lung cancer using ensemble-based feature cancer (Edeline et al., 2010; Zhang et al., 2016). Increased EGFR selection and machine learning methods. Mol. BioSyst., 11, 791–800. expression occurs in some renal cell carcinoma patients with an un- Caldon,C.E. (2014) Estrogen signaling and the DNA damage response in hor- favorable histologic phenotype (Minner et al., 2012). Many genes in mone dependent breast cancers. Front. Oncol., 4, 106. the ‘heart development’ and ‘developmental growth’ processes are Carey,G.B. et al. (2015) The natural tumorcide Manumycin-A targets protein also part of the response to growth factor stimulus, causing those phosphatase 1a and reduces hydrogen peroxide to induce lymphoma apop- terms to be significant. tosis. Exp. Cell Res., 332, 136–145. The serine/threonine kinase signaling pathway includes SMAD Chen,Y.-C. et al. (2014) Risk classiﬁcation of cancer survival using ann with and mTOR signal transduction, both of which are involved in renal gene expression data from multiple laboratories. Comput. Biol. Med., 48, 1–7. cell cancer development (Edeline et al., 2010). Both cell proliferation Cheng,C.W. et al. (2016) Understanding the role of PIN1 in hepatocellular regulation and ubiquitin-dependent protein catabolism are com- carcinoma. World J. Gastroenterol., 22, 9921–9932. monly affected pathways in multiple cancers. Specifically, the Chowdhury,S. and Sarkar,R.R. (2015) Comparison of human cell signaling ubiquitin-dependent protein catabolic process is impacted by a key pathway databases–evolution, drawbacks and challenges. Database genetic defect of clear cell kidney cancer in the VHL tumor suppres- (Oxford), bau126. sor gene, which is part of a multiprotein E3 ubiquitin ligase (Corn, Chuang,H.-Y. et al. (2007) Network-based classiﬁcation of breast cancer me- 2007). The full table containing all GO terms for the functional ana- tastasis. Mol. Syst. Biol., 3, 140. lysis can be found in Supplementary Table S2. Cook,K.L. and Clarke,R. (2014) Estrogen receptor- a signaling and localiza- Overall, with the KIRC data, GEDFN was able to achieve better tion regulates autophagy and unfolded protein response activation in ERþ prediction, and select genes that were easily interpretable. The breast cancer. Recept. Clin. Investig., 1. Corn,P.G. (2007) Role of the ubiquitin proteasome system in renal cell carcin- results pointed to several important pathways, the behavior of oma. BMC Biochem., 8(Suppl 1), S4. which may predispose patients to certain survival outcomes. Das,J. and Yu,H. (2012) HINT: high-quality protein interactomes and their applications in understanding human disease. BMC Syst. Biol., 6, 92. Dutkowski,J. and Ideker,T. (2011) Protein networks as logic functions in de- 4 Conclusion velopment and cancer. PLoS Comput. Biol., 7, e1002180. We presented a new DFN classifier embedding feature graph infor- Edeline,J. et al. (2010) Signalling pathways in renal-cell carcinoma: from the mation. It achieves sparse connected neural networks by constrain- molecular biology to the future therapy. Bull Cancer, 97, 5–15. ing connections between the input layer and the first hidden layer Falcon,S. and Gentleman,R. (2007) Using GOstats to test gene lists for GO according to the feature graph. Simulation experiments have shown term association. Bioinformatics, 23, 257–258. Felzen,V. et al. (2015) Estrogen receptor regulates non-canonical autophagy its relatively higher classification accuracy and better feature that provides stress resistance to neuroblastoma and breast cancer cells and selection ability compared with existing methods, and the real data involves BAG3 function. Cell Death Dis., 6, e1812. applications demonstrated the utility of the new model in both clas- Gilkes,D.M. and Semenza,G.L. (2013) Role of hypoxia-inducible factors in sification and the selection of biologically relevant features. breast cancer metastasis. Future Oncol, 9, 1623–1636. Gionet,N. et al. (2009) NF-kappaB and estrogen receptor alpha interactions: differential function in estrogen receptor-negative and -positive Acknowledgement hormone-independent breast cancer cells. J. Cell. Biochem., 107, 448–459. The authors thank Dr Hao Wu and Dr Jian Kang for helpful discussions. Goodfellow,I. et al. (2016) Deep Learning. MIT Press, Cambridge, MA. Henaff,M. et al. (2015) Deep convolutional networks on graph-structured data. arXiv preprint arXiv: 1506.05163. Funding Hochreiter,S. et al. (2001) Gradient ﬂow in recurrent nets: the difﬁculty of This study was partially funded by National Institutes of Health [grant num- learning long-term dependencies. arXiv. Hoesel,B. and Schmid,J.A. (2013) The complexity of NF-B signaling in inﬂam- ber R01GM124061]. mation and cancer. Mol. Cancer, 12, 86. Conﬂict of Interest: none declared. JavanMoghadam,S. et al. (2016) Estrogen receptor alpha is cell cycle-regulated and regulates the cell cycle in a ligand-dependent fashion. Cell Cycle, 15, 1579–1590. References Jin,L. et al. (2017) SUMO-1 Gene Silencing Inhibits Proliferation and Algamal,Z.Y. and Lee,M.H. (2015) Penalized logistic regression with the Promotes Apoptosis of Human Gastric Cancer SGC-7901 Cells. Cell. adaptive lasso for gene selection in high-dimensional cancer classiﬁcation. Physiol. Biochem., 41, 987–998. Expert Syst. Appl., 42, 9326–9332. Jones,M.R. et al. (2016) Response to angiotensin blockade with irbesartan in Ambrosio,S. et al. (2017) Lysine-speciﬁc demethylase LSD1 regulates autoph- a patient with metastatic colorectal cancer. Ann. Oncol., 27, 801–806. agy in neuroblastoma through SESN2-dependent pathway. Oncogene, 36, Kim,S. et al. (2013) Network-based penalized regression with application to 6701–6711. genomic data. Biometrics, 69, 582–593. Graph-embedded deep feedforward networks 3737 Kingma,D.P. and Ba,J. (2014) Adam: A method for stochastic optimization. Osborne,C.K. et al. (2005) Crosstalk between estrogen receptor and growth arXiv preprint arXiv:1412.6980. factor receptor pathways as a cause for endocrine therapy resistance in Koboldt,D.C. et al. (2012) Comprehensive molecular portraits of human breast cancer. Clin. Cancer Res., 11, 865s–870s. breast tumours. Nature, 490, 61–70. Recio-Boiles,A. et al. (2016) JNK pathway inhibition selectively primes pan- Kolaczyk,E.D. (2009) Statistical Analysis of Network Data: Methods and creatic cancer stem cells to TRAIL-induced apoptosis without affecting the Models, 1st edn. Springer Publishing Company, Incorporated, New York. physiology of normal tissue resident stem cells. Oncotarget, 7, 9890–9906. Kursa,M.B. (2014) Robustness of random forest-based gene selection meth- Russakovsky,O. et al. (2015) ImageNet large scale visual recognition chal- ods. BMC Bioinformatics, 15,8. lenge. Int. J. Comput. Vis., 115, 211–252. Lavi,O. et al. (2012) Network-induced classiﬁcation kernels for gene expres- Samanta,D. and Datta,P.K. (2012) Alterations in the Smad pathway in human sion proﬁle analysis. J. Comput. Biol., 19, 694–709. cancers. Front. Biosci. (Landmark Ed), 17, 1281–1293. LeCun,Y. et al. (1998) Gradient-based learning applied to document recogni- Sarma,N.J. and Yaseen,N.R. (2011) Amino-terminal enhancer of split (AES) tion. In Proceedings of the IEEE, pp. 2278–2324. interacts with the oncoprotein NUP98-HOXA9 and enhances its transform- LeCun,Y. et al. (2015) Deep learning. Nature, 521, 436–444. ing ability. J. Biol. Chem., 286, 38989–39001. Li,C. and Li,H. (2008) Network-constrained regularization and variable selec- Sorlie,T. et al. (2003) Repeated observation of breast tumor subtypes in inde- tion for analysis of genomic data. Bioinformatics, 24, 1175–1182. pendent gene expression data sets. Proc. Natl. Acad. Sci. USA, 100, Li,J. et al. (2017) DNA methylation of CMTM3, SSTR2, and MDFI genes in 8418–8423. colorectal cancer. Gene, 630, 1–7. Szklarczyk,D. and Jensen,L.J. (2015) Protein-protein interaction databases. Liang,Y. et al. (2013) Sparse logistic regression with a l 1/2 penalty for gene se- Methods Mol. Biol., 1278, 39–56. lection in cancer classiﬁcation. BMC Bioinformatics, 14, 198. Tang,Y. et al. (2015) STIP overexpression confers oncogenic potential to Min,S. et al. (2017) Deep learning in bioinformatics. Brief. Bioinformatics, human non-small cell lung cancer cells by regulating cell cycle and apop- 18, 851–869. tosis. J. Cell. Mol. Med., 19, 2806–2817. Minner,S. et al. (2012) Epidermal growth factor receptor protein Tibshirani,R. (1996) Regression shrinkage and selection via the lasso. J. Roy. expression and genomic alterations in renal cell carcinoma. Cancer, 118, Stat. Soc. Ser. B (Methodological), 58, 267–288. 1268–1275. Vanitha,C.D.A. et al. (2015) Gene expression data classiﬁcation using support Mockus,J. (2012) Bayesian Approach to Global Optimization: Theory and vector machine and mutual information-based gene selection. Proc. Applications, Vol. 37. Springer Science & Business Media, Berlin, Comput. Sci., 47, 13–21. Germany. Wang,L. et al. (2007) Group scad regression analysis for microarray time Na,K. et al. (2017) CTNNB1 Mutations in ovarian microcystic stromal tumors: course gene expression data. Bioinformatics, 23, 1486–1494. identiﬁcation of a novel deletion mutation and the use of pyrosequencing to Wei,P. and Pan,W. (2008) Incorporating gene networks into statistical tests identify reported point mutation. Anticancer Res., 37, 3249–3258. for genomic data via a spatially correlated mixture model. Bioinformatics, Nair,V. and Hinton,G.E. (2010) Rectiﬁed linear units improve restricted 24, 404–411. boltzmann machines. In: Fu ¨ rnkranz,J. and Joachims,T. eds. Proceedings of Wolff,M. et al. (2017) Impact of hypoxia inducible factors on estrogen recep- the 27th international conference on machine learning (ICML-10), Haifa, tor expression in breast cancer cells. Arch. Biochem. Biophys., 613, 23–30. Israel, pp. 807–814. Yamamoto,T. et al. (2002) Cross-talk between bone morphogenic proteins Network,C.G.A.R. et al. (2013) Comprehensive molecular characterization of and estrogen receptor signaling. Endocrinology, 143, 2635–2642. clear cell renal cell carcinoma. Nature, 499, 43. Zhang,L. et al. (2016) BMP signaling and its paradoxical effects in tumorigen- Olden,J.D. and Jackson,D.A. (2002) Illuminating the black box: a randomiza- esis and dissemination. Oncotarget, 7, 78206–78218. tion approach for understanding variable contributions in artiﬁcial neural Zhu,Y. et al. (2009) Network-based support vector machine for classiﬁcation networks. Ecol. Model., 154, 135–150. of microarray samples. BMC Bioinformatics, 10, S21.
Bioinformatics – Oxford University Press
Published: May 29, 2018
Access the full text.
Sign up today, get DeepDyve free for 14 days.