Discovering personalized driver mutation profiles of single samples in cancer by network control strategy

Discovering personalized driver mutation profiles of single samples in cancer by network control... Abstract Motivation It is a challenging task to discover personalized driver genes that provide crucial information on disease risk and drug sensitivity for individual patients. However, few methods have been proposed to identify the personalized-sample driver genes from the cancer omics data due to the lack of samples for each individual. To circumvent this problem, here we present a novel single-sample controller strategy (SCS) to identify personalized driver mutation profiles from network controllability perspective. Results SCS integrates mutation data and expression data into a reference molecular network for each patient to obtain the driver mutation profiles in a personalized-sample manner. This is the first such a computational framework, to bridge the personalized driver mutation discovery problem and the structural network controllability problem. The key idea of SCS is to detect those mutated genes which can achieve the transition from the normal state to the disease state based on each individual omics data from network controllability perspective. We widely validate the driver mutation profiles of our SCS from three aspects: (i) the improved precision for the predicted driver genes in the population compared with other driver-focus methods; (ii) the effectiveness for discovering the personalized driver genes and (iii) the application to the risk assessment through the integration of the driver mutation signature and expression data, respectively, across the five distinct benchmarks from The Cancer Genome Atlas. In conclusion, our SCS makes efficient and robust personalized driver mutation profiles predictions, opening new avenues in personalized medicine and targeted cancer therapy. Availability and implementation The MATLAB-package for our SCS is freely available from http://sysbio.sibcb.ac.cn/cb/chenlab/software.htm. Contact zhangsw@nwpu.edu.cn or zengtao@sibs.ac.cn or lnchen@sibs.ac.cn Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction With rapid advances in genomic techniques, there is a pressing need for integrative analysis of cancer omics data in terms of somatic mutations, transcriptomic changes and epigenetic alterations. A critical challenge facing cancer genomics today is to integrate these information-rich datasets to provide clinically characterized insights into tumor biology and cancer diagnostics and therapeutics. A fundamental question in the analysis of cancer genomic data is how to identify and distinguish driver genes that contribute to cancer initiation and progression, from numerous passenger mutation genes that emerge simply as results of genomic instability during cancer progression (Haber and Settleman, 2007). As well-known to us, identifying cancer drivers is crucial from a clinical perspective where personalized-sample driver genes would hold significant value for defining personalized therapeutic targets (Chin, 2011; Schilsky, 2010). This situation has provoked a bunch of mathematical methods that assist in differentiating driver mutation genes and passenger mutation genes. The recent studies can be mainly categorized into two classes, the machine learning based methods (Carter et al., 2009; Kumar et al., 2016; Mao et al., 2013) and the network based methods (Bashashati et al., 2012; Bertrand et al., 2015; Greenman et al., 2008; Hou and Ma, 2014; Kang et al., 2015; Suo et al., 2015; Zhang et al., 2016). The machine learning based methods are usually trained by using mutations designed as pathogenic or neutral, whose advantage is that such models can be developed for any specific tasks dependent on choosing training data. For example, in CHASM and CanDrA, driver genes are classified by relying on alterations trained from known cancer-causing somatic missense mutations (Carter et al., 2009; Mao et al., 2013), although these models are limited in a few applications due to the probable incompleteness of their cited databases. On the other hand, network-based approaches have become one of the most promising methods to understand cancer drivers due to their power to elucidate molecular mechanisms of disease development at the network level (Liu et al., 2016; Yu et al., 2017), and such network-based methods have been successfully applied to many biomedical field such as cancer driver discovery and drug target identification (Cowen et al., 2017; Hofree et al., 2013; Li and Patra, 2010; Wang et al., 2014; Zhang et al., 2016). However, most of the existing methods such as MEMo (Ciriello et al., 2012), Dendix (Vandin et al., 2012) and DriverNet (Bashashati et al., 2012) always require a large number of patient samples to generate reliable results and are not well suited for distinguishing rare driver genes or personalized-sample driver genes. Indeed, some new methods such as DawnRank (Hou and Ma, 2014) and OncoIMPACT (Bertrand et al., 2015) begin to focus on how to find the personalized-sample driver genes. However, they ignore the patient-specific network (topology or edges) information for determining the model parameters, which potentially lead to many false positives (Hou and Ma, 2014). To simultaneously identify the personalized-sample driver genes and their driving networks/pathways, we introduce network control theory to model the driving role and influence of individual mutation genes on tumors. Network control theory (Gao et al., 2014; Liu et al., 2011; Wu et al., 2014), which considers how to choose the proper subset of network nodes to control the whole network from one state (e.g. disease state) to another (e.g. normal state) has become a powerful conceptual paradigm in the field of biology to understand biological systems at a system level. How to control a large scale of biological network is a central issue for biological systems, which senses and processes both external and internal cues using a network of interacting molecules (Vinayagam et al., 2016; Yan et al., 2017). Structural controllability analysis of network has been applied to some biological systems, where interesting properties on the biological system have been discovered (Lin, 1974). However, the existing network-control methods (Gao et al., 2014; Liu et al., 2011; Wu et al., 2014) cannot be applied to identify driver genes for each patient directly or efficiently, since they only consider the consistent information of a population of samples but ignore the particular individual information. To address those methodological limitations, we here propose an innovative and effective approach based on the network control theory, called Single-sample Controller Strategy (SCS) to assess the impact potential of gene mutations on the changes in gene expression patterns. Intuitively, we consider mutations as controllers and gene expression profiles as states in a network, and thus SCS aims to detect a small number of mutation genes (i.e. driver genes) which can achieve the transition from the normal state to the disease state from the network controllability viewpoint, based on each individual gene expression data. Our SCS method integrates the mutation data and expression data to a gene-gene regulation network for each patient. In particular, we aim to identify the minimal number of Individual mutations to control the Individual differentially expressed genes (DEGs) in the Individual gene network, i.e. 3I framework. The main steps of SCS method include: (1) to obtain personalized DEGs for each patient by comparing the expression profile of the tumor sample with that of the corresponding normal sample, and then extract the individual gene regulations from the expert-curated databases; (2) to identify the minimal number of individual mutations with network controllability on the maximal coverage of individual DEGs in the individual gene network and (3) based on the dynamic network control theory, to rank and select the driver genes from individual mutations according to the uncovered consensus modules consisting of confidence-weighted paths from the driver genes to the target genes on the gene network. Such a single-sample framework SCS significantly differs from DawnRank (Hou and Ma, 2014) and OncoIMPACT (Bertrand et al., 2015), which predict patient-specific driver genes in the global gene network of population samples, whereas SCS considers the cancer-specific or patient-specific mutated network (topology) for predicting individual driver genes. In this paper, by only using data from an individual patient sample rather than a population of samples, SCS is the first such a framework which evaluates the impact potential on gene expression patterns of both CNVs and SNPs in a personalized fashion based on the network controllability. Unlike the DawnRank (Hou and Ma, 2014) and OncoIMPACT (Bertrand et al., 2015), SCS ranks potential driver genes based on their influence on the overall differential expressions of its downstream genes in the individual molecular network instead of the collective molecular network. The personalized ranking mutation genes of our SCS also allows us to further apply the Condorcet method (Pihur et al., 2008) to determine the summary ranking of genes in a patient population. We select the Top-50 ranked candidates as the driver genes for a patient population. We have widely validated the driver mutation profiles of our SCS from three aspects. Firstly, the benchmarking analysis on five different benchmark datasets including Glioblastoma (GBM), Ovarian (OVARIAN), Melanoma (MELANOMA), Bladder (BLCA) and Prostate cancer (PRAD) obtained from The Cancer Genome Atlas (TCGA) reveal notable improvements over other existing representative methods in terms of precision for discovering driver genes. Secondly, we discuss the personalized scope of SCS by demonstrating its ability to determine personalized novel and rare driver genes. Since our SCS classifies driver genes regardless of mutation frequency, it allows us possibly to discover rare (infrequent) driver genes. Finally, we demonstrate that the identified driver mutation profiles can be further used as a mutational-status-based signature to be integrated with the expression and network data for tumor stratification and prognostication, which has better performance than the integration of single mutation signature and expression data due to promoting biological significance rather than statistical significance. Based on the detected Top-50 driver genes for a large number of patient samples and the identified subtypes for tumor stratification and prognostication, we also demonstrate the underlying molecular mechanism differences in terms of the distributions of mutation frequency of the predicted Top-50 driver genes and provide the enriched biological pathways for corresponding identified subtypes. These results all indicate that somatic mutations would contribute to the heterogeneous responses to drive the different expression alterations of the downstream DEGs in different patient cohorts. Taken together, we have demonstrated that SCS can identify personalized driver genes, and provide new mutation profiles for quantitatively measuring the genotyping of each patient associated with phenotyping. Our analysis results on various cancer datasets from TCGA support the practical applications of our SCS, which is effective to integrate cancer omics data for tumor pathology, clinical stratification and personalized therapy. This is the first such a computational framework, to bridge the personalized driver mutation discovery problem and the structural network controllability problem. Therefore, SCS method opens a new paradigm into personalized medicine and targeted cancer therapy. 2 Materials and methods 2.1 Datasets and reference network In total, 1435 tumor samples in TCGA data portal (328 samples of GBM, 316 samples of Ovarian cancer, 379 samples of bladder cancer, 252 samples of PRAD cancer and 160 samples of Melanoma cancer; Bertrand et al., 2015), are studied in our paper. The datasets we used in this paper consist of gene expression data and coding region mutation data for five cancer types. SCS analysis was restricted to samples for which information on point mutations, copy-number alterations and gene expression was available (Bertrand et al., 2015). Our SCS uses the reference gene network (Hou and Ma, 2014) which integrates a variety of sources, including the network used in MEMo (Ciriello et al., 2012; Wu et al., 2010) as well as the up-to-date curated information from Reactome (Croft et al., 2010), the NCI-Nature Curated PID (Schaefer et al., 2009) and KEGG (Kanehisa et al., 2011). To aggregate all of the networks together, we collapsed all redundant edges to single edges. The resulting reference network consisted of 11 648 genes and 211 794 edges, including self-loops within the network to account for auto-regulation events (Becskei and Serrano, 2000). Furthermore, SCS also uses the directed human PPI network constructed by Vinayagam, A. et al. (Vinayagam et al., 2011) for evaluating the effect of the reference network on our SCS. The cancer datasets used in our paper for SCS as benchmarks are freely available from http://sysbio.sibcb.ac.cn/cb/chenlab/software.htm. 2.2 Single-sample controller strategy A natural framework to assess the impact of mutations is to associate the mutations to the gene expression changes from normal to tumor in their gene network by taking the mutations as control actions of those changes, and this is also adopted in the design of SCS. Specifically, we consider mutation genes as controllers in a network and whole gene expression profile in normal/tumor as the respective state, and thus SCS aims to detect a small number of mutation genes (i.e. driver genes) which can achieve the transition from the normal state to the tumor state (or vice versa) from the network controllability viewpoint, based on each individual gene expression data. To apply SCS, one should prepare the expression profiles of paired tumor-normal samples, the mutation profiles for each sample and the reference network information. With the driver mutation profiles, the mutation state of each gene is no longer discrete but continuous to reflect its impact potential on phenotypes through personalized DEGs. The key idea of SCS is to apply structural control theory for identifying the driver genes, which holds an assumption that the gene mutations will control the DEGs through their gene network. SCS views the gene network as a directed graph. Figure 1 shows the overview of our methodology. The SCS algorithm consists of two main steps, (i) Identification of individual DEGs (target genes), individual mutation genes (candidate driver genes) and the individual gene network (topology of interactive genes); (ii) Identification of the driver genes and their corresponding consensus modules on control paths, which assess the mutational impact of personalized-sample driver genes. Fig. 1. View largeDownload slide Overview of SCS’s workflow for identifying driver mutation profiles. (a) Input information. For the gene expression data and gene mutation profiles (SNV and CNV) for sample k or patient k, we identify the DEGs and extract the mutation genes and their close interactors by using the RWR algorithm and randomization-based test in the directed gene network. (b) The main part of our SCS. We apply a new concept called constrained target control (CTC), which focuses on identifying the minimal number of nodes (mutations) to control the targets (DEGs), to identify the driver genes and the corresponding control paths in the personal gene network. Since multiple solutions exist for the CTC, we apply the random Markov sampling to obtain different driver genes and the control paths. By repeating the random Markov sampling, we can obtain the consensus module containing one predicted driver gene and the downregulated genes for each module. In the module, the weight of each edge denotes the confidence of the edge as the control path to the DEGs. The sum of the edge weights forms the impact potential of the predicted driver gene on the expression patterns. In such a way, we can obtain the driver genes whose weights denote the impact potential of the mutations on the expression patterns for sample k Fig. 1. View largeDownload slide Overview of SCS’s workflow for identifying driver mutation profiles. (a) Input information. For the gene expression data and gene mutation profiles (SNV and CNV) for sample k or patient k, we identify the DEGs and extract the mutation genes and their close interactors by using the RWR algorithm and randomization-based test in the directed gene network. (b) The main part of our SCS. We apply a new concept called constrained target control (CTC), which focuses on identifying the minimal number of nodes (mutations) to control the targets (DEGs), to identify the driver genes and the corresponding control paths in the personal gene network. Since multiple solutions exist for the CTC, we apply the random Markov sampling to obtain different driver genes and the control paths. By repeating the random Markov sampling, we can obtain the consensus module containing one predicted driver gene and the downregulated genes for each module. In the module, the weight of each edge denotes the confidence of the edge as the control path to the DEGs. The sum of the edge weights forms the impact potential of the predicted driver gene on the expression patterns. In such a way, we can obtain the driver genes whose weights denote the impact potential of the mutations on the expression patterns for sample k (1) Identifying target genes, candidate driver genes and network for each individual In this study, we consider the transcriptomic changes due to mutation. For each patient, we calculate the log2 fold-change of gene expression between the paired tumor and normal samples. A significance of +/–1 is used to indicate the DEGs for each patient, which is used as the targets possibly controlled by the mutations according to SCS assumption. Both the mutation genes and their interactors are then extracted from each patient by using the Random Walker with Restart algorithm (RWR, see the details in Supplementary Note 1) for each sample. That is, for each sample we calculate the probability of each gene reached from the individual mutations (e.g. SNV: single nucleotidevariations; CNV: copy number variations) by using the RWR algorithm. We also introduce a randomization-based test to evaluate the statistical significance of control relation between the individual mutations and individual genes by utilizing 100 topologically matched random networks. The candidate genes that remain significance (P < 0.05) are retained and denoted as significant genes for the individual mutation genes. Then the interactors between significant genes including the individual mutation genes themselves can result in the individual mutated network. Finally, the individual mutation genes, the individual DEGs and the individual mutated network are linked together, and applied into the following individual driver gene identification and driver module discovery. (2) Identifying driver mutation profiles Different from the existing network control theory, we apply a new concept as Constrained Target Controllability (CTC; Guo et al., 2017) to identify the driver nodes from the constrained subset of the network instead of the whole network to control the maximal target nodes (model and algorithm details can be seen in Supplementary Note 1). Particularly in this study, we focus on how to identify the driver genes supported by the individual DEGs. To apply CTC to solve the problem, we first define the DEGs related with phenotype change (e.g. state change) as the target nodes to be controlled and the mutation genes as the constrained control nodes. Then to control the DEGs, a greedy algorithm is developed to identify the target controllable subsystem of each mutation gene and the control path from such mutation gene to its target genes (see the details in Supplementary Note 2). With target controllable subset defined as an iterated bipartite graph, the DEGs are represented within the target controllable subspace of each mutation gene. We apply a parsimony principle to identify a minimal set of driver genes (i.e. mutation genes) associated with the phenotype genes (e.g. DEGs) by the identification of minimum set cover (Supplementary Fig. S1). Although, the minimum set cover problem is an NP-hard problem with a greedy O(log n) approximation algorithm, the optimal solution can still be efficiently obtained for moderate sizes of graphs with up to a few tens of thousands of variables by utilizing a LP-based classic branch and bound method (Nemhauser and Wolsey, 1988; Wolsey, 1998). Furthermore, to coalesce mutated genes and DEGs into a consensus network module, we employ the random Markov chain (MC) samplings to generate candidate driver genes and the corresponding control path from the driver genes to the targeted DEGs (see the details in Supplementary Notes 2 and 3). For each personalized-sample mutation gene, different control paths from a mutation gene to the target genes would be obtained in the process of MC samplings, so that, the frequency of the edges in the gene network appeared in 1000 runs of sampling control paths can be assigned as the weight of each edge in the module. Finally, the consensus module includes mutation genes with multiple control paths to the target genes, where the edge weights denote the confidence of control path. Totally, the construction of personalized-sample consensus modules in SCS provides a comprehensive measurement of the impact potential of a putative driver gene, which results in the driver mutation profiles. 2.3 The time complexity analysis of SCS The computational complexity of our SCS method mainly stems from two parts. (i) For extracting the individual gene regulations from the expert-curated databases on the network G(V, E) where V and E are respective node set and edge set, we integrate the RWR algorithm and a randomization-based test to evaluate the statistical significance of control relations between the individual mutations and individual genes by n = 100 topologically matched random networks, each of which maintains the topological characteristics of the original network (e.g. degree of each node). The RWR algorithm runs in the order of O (k* n *‖E‖) where k denotes the number of iteration, n denotes the number of randomly generated networks and‖E‖denotes the number of edges in a network. In fact, the RWR algorithm is efficient and widely used by many research works (Jia and Zhao, 2014; Li and Patra, 2010; Luo et al., 2017); (ii) In the phase of uncovering consensus modules with CTC in the individual gene network G(Vi, Ei) for patient i, we use our CTCA (CTC Algorithm) to identify modules consisting of confidence-weighted paths from the driver genes to the target genes (Guo et al., 2017). Since our CTCA runs in the order of O(m*r* Vi*‖Ei‖) where m is the sampling number, r is the iteration number of obtaining the controllable differential expressed genes of mutations, and‖Vi‖and‖Ei‖denote the number of nodes and edges in the patient-specific mutated network (‖Vi‖≪‖V‖, ‖Ei‖≪‖E‖), respectively. Therefore, the overall computational time complexity of our SCS approach is O(k* n *‖E‖)+O(m*r* Vi*‖Ei‖). 3 Results 3.1 SCS accurately and robustly detects driver genes of cancer Most of the existing methods for identifying common driver genes are based on the aggregate analysis over large number of patients. Generally to identify driver genes at population level, our SCS applies a novel scheme: we firstly identify the patient-specific mutated network, which consists of the frequently interrupted interactions among the mutation genes in the human interactome by using a network propagation amplifier of genetic associations (RWR algorithm, Supplementary Note 1; Cowen et al., 2017). Here, the aim of introducing RWR in mutation data analysis is to filter out potential genes whose mutations likely occur by chance based on the patient-specific mutational profile; then, to identify the consensus modules or mutations, our SCS applies the CTC on the patient-specific mutated network, which can achieve the transition by mutations (as controllers) from the normal state to the disease state; finally, we apply Condorcet method to summarize the personalized results at the population level. Particularly, our SCS tries to capture individual-specific mutation genes and its impacts over each patient. By integrating the personalized driver genes with the Condorcet method, SCS can also obtain the common driver genes in the population, which allows us to validate the advantage over other driver-focus methods. The Condorcet method is used as a ‘voting’ scheme for the personalized ranking genes to determine the most impactful driver genes in a population (see the details in Supplementary Note 1), and the Top-50 ranked mutations in the population are selected as the candidate driver genes. To perform a systematic comparison across a number of computational methods, the genes annotated in the Cancer Gene Census (CGC; Futreal et al., 2004) are applied as a proxy for potential drivers to assess the precision of the top drivers genes reported for different cancer sites. The CGC (Futreal et al., 2004) is a well-studied cancer gene database consisting of a list of known driver genes with mutations that have been causally implicated in cancer. CGC genes have been widely used in many cancer studies for benchmark evaluation (Bertrand et al., 2015; Hou and Ma, 2014; Jia and Zhao, 2014). Representatively, SCS is compared against another personalized-sample method (OncoIMPACT; Bertrand et al., 2015), an aggregate network approach (DriverNet; Bashashati et al., 2012) and a commonly used mutation frequency-based approach (Frequency; Wei et al., 2011). As shown in Figure 2a and Supplementary Table S1, a stronger enrichment for true positive driver genes is achieved in SCS’s predictions. In contrast, the naive frequency-based approach finds less known cancer driver genes. For example, the top gene on the PRAD list is HLA-DRB instead of TP53 and all lists of other focus-driver methods, miss EGFR from the Top 10. Furthermore, among the Top 20 drivers in OVARIAN, MELANOMA and BLCA, SCS’s concordance is above 40% while the Frequency and DriverNet and OncoIMPACT are all around 30%, suggesting that SCS is generally more accurate and less likely to be influenced by high-frequency mutated passengers. On the other hand, to test the robustness of SCS, a sub-sampling approach is used to estimate the precisions of Top-50 driver genes on ovarian cancer, melanoma, GBM, BLCA and PRAD datasets from TCGA. Seeing the Figure 2b, SCS’s predictions are stable even with small sample sizes for the GBM, BLCA and PRAD datasets, although the precisions will become sensitive in ovarian and melanoma datasets (Fig. 2b). Fig. 2. View largeDownload slide Accuracy comparison of driver gene predictions according to the cancer census genes set. (a) Precision measured by the fraction of top ranked driver genes from SCS, OncoIMPACT, DriverNet and a frequency-based approach that are included in the CGC list. (b) Precisions of SCS when we evaluate prediction from the CGC as a function of the size of the dataset Fig. 2. View largeDownload slide Accuracy comparison of driver gene predictions according to the cancer census genes set. (a) Precision measured by the fraction of top ranked driver genes from SCS, OncoIMPACT, DriverNet and a frequency-based approach that are included in the CGC list. (b) Precisions of SCS when we evaluate prediction from the CGC as a function of the size of the dataset To identify the driver genes at the population level, the main technical contributions of our SCS include: (1) choosing personalized DEGs, personalized candidate mutations and extracting individual gene regulations from the expert-curated databases as the patient-specific mutated network (topology or edges) by the RWR algorithm; (2) identifying the confidence-weighted paths from the driver genes to the target genes on the patient-specific gene network by the concept ‘CTC’ and (3) statistically summarizing the personalized driver mutation prediction results at the population level by the Condorcet method. To assess more details of SCS on the precision, as shown in Table 1, we evaluate the effect of each technical contribution on the performance of our SCS for each type of cancer datasets. We define a measurement to denote the performance for predicting the driver genes, i.e. P = mean (pk) where pk denotes the fraction of the top k predicted driver genes within the cancer census genes list. The mean fraction of the top k (k = 1, 2,…, 50) ranked predicted driver genes within the cancer census genes list is given in Table 1. It shows that the technical contributions of our SCS together contribute to the better performance, compared with OncoIMPACT (Bertrand et al., 2015) and other comparable methods. Briefly, SCS is to obtain the patient-specific mutated network at the patient level by the RWR algorithm and then to summarize the driver mutation prediction results at the population level by the Condorcet method. The results in Figure 2 and Table 1 both show that SCS is able to recapitulate the known driver genes better than other methods, for instance: DriverNet and HotNet2 which are only work at the population level; OncoIMPACT and DawnRank which could work at the single-patient level; and a naive frequentist approach. Table 1. The performance of our SCS for each technical contribution in terms of the average precision at the population level in GBM, OVARIAN, MELANOMA, PRAD and BLCA GBM OVARIAN MELANOMA PRAD BLCA SCS 0.4141 0.3234 0.4004 0.4106 0.4736 RWR + CTC 0.3333 0.1932 0.1434 0.3295 0.3676 CTC + Condorcet 0.3778 0.2594 0.1544 0.1662 0.1905 CTC 0.3621 0.2300 0.1056 0.1364 0.1821 OncoImpact 0.3613 0.2370 0.2047 0.3613 0.3429 DriverNet 0.2496 0.1688 0.3115 0.2496 0.2872 Frequency 0.1480 0.1896 0.0654 0.1480 0.1268 HotNet2 0.1866 0.1916 0.0769 0.1350 0.1960 DawnRank 0.3251 0.2161 0.1147 0.2439 0.2445 GBM OVARIAN MELANOMA PRAD BLCA SCS 0.4141 0.3234 0.4004 0.4106 0.4736 RWR + CTC 0.3333 0.1932 0.1434 0.3295 0.3676 CTC + Condorcet 0.3778 0.2594 0.1544 0.1662 0.1905 CTC 0.3621 0.2300 0.1056 0.1364 0.1821 OncoImpact 0.3613 0.2370 0.2047 0.3613 0.3429 DriverNet 0.2496 0.1688 0.3115 0.2496 0.2872 Frequency 0.1480 0.1896 0.0654 0.1480 0.1268 HotNet2 0.1866 0.1916 0.0769 0.1350 0.1960 DawnRank 0.3251 0.2161 0.1147 0.2439 0.2445 Table 1. The performance of our SCS for each technical contribution in terms of the average precision at the population level in GBM, OVARIAN, MELANOMA, PRAD and BLCA GBM OVARIAN MELANOMA PRAD BLCA SCS 0.4141 0.3234 0.4004 0.4106 0.4736 RWR + CTC 0.3333 0.1932 0.1434 0.3295 0.3676 CTC + Condorcet 0.3778 0.2594 0.1544 0.1662 0.1905 CTC 0.3621 0.2300 0.1056 0.1364 0.1821 OncoImpact 0.3613 0.2370 0.2047 0.3613 0.3429 DriverNet 0.2496 0.1688 0.3115 0.2496 0.2872 Frequency 0.1480 0.1896 0.0654 0.1480 0.1268 HotNet2 0.1866 0.1916 0.0769 0.1350 0.1960 DawnRank 0.3251 0.2161 0.1147 0.2439 0.2445 GBM OVARIAN MELANOMA PRAD BLCA SCS 0.4141 0.3234 0.4004 0.4106 0.4736 RWR + CTC 0.3333 0.1932 0.1434 0.3295 0.3676 CTC + Condorcet 0.3778 0.2594 0.1544 0.1662 0.1905 CTC 0.3621 0.2300 0.1056 0.1364 0.1821 OncoImpact 0.3613 0.2370 0.2047 0.3613 0.3429 DriverNet 0.2496 0.1688 0.3115 0.2496 0.2872 Frequency 0.1480 0.1896 0.0654 0.1480 0.1268 HotNet2 0.1866 0.1916 0.0769 0.1350 0.1960 DawnRank 0.3251 0.2161 0.1147 0.2439 0.2445 Furthermore, to demonstrate the effect of the reference network on SCS, we also use the directed PPI network derived by Vinayagam, A. et al. (Vinayagam et al., 2011) as the reference network for analysis. The directed human PPI network represents a global snapshot of the information flow in cell signaling. The directed human PPI network consists of 6339 proteins and 34 813 directed edges, where the edge direction corresponds to the hierarchy of signal flow between the interacting proteins and the edge weight corresponds to the confidence of the predicted direction. The results of our SCS with this directed PPI network on the five cancer datasets are also shown in Supplementary Figure S3 in Supplementary Note 4, we can conclude that the directed PPI network is incomplete to analyze the controllability of the biological system. Therefore, a proper reference network is an important factor to our SCS. Actually, the incompleteness of molecular networks would increase false negatives in SCS’s predictions, and network inference approaches would be helpful pre-procession in this situation. In addition to its ability to identify known driver genes at the entire patient population level, a key of SCS is to recognize the personalized-sample drivers in CGC, compared with the random selecting scheme as shown in Figure 3. The threshold of +/– 1 log2 fold change used in this identification is actually adopted in many research works to identify the DEGs (Aytug et al., 2003; Bakken et al., 2016; Koren et al., 1989; Tothova et al., 2007). In fact, we have considered other values to evaluate the effect of the threshold on the precision of predicted driver genes. The mean fraction of the top k ranked predicted driver genes within the cancer census genes list is used to denote the performance for predicting the driver genes. The results are shown in Figure 3, from which we can see that the effect of the threshold around the +/– 1 on the precision is robust for the five cancer datasets. For the significance of the threshold +/– 1 log2 fold-change in each patient, we have tried two evaluation strategies. One is that we randomly choose the same number of personalized DEGs as the targets and then use our SCS with the randomized personalized DEGs to obtain the top ranked driver genes list at the population level; and we compute the mean fraction of the top ranked k genes within the cancer genes census list and obtain an enrichment P-value (the details are shown in Supplementary Note 6), whose results are shown in Figure 3b. Two is that we not only choose the same number of genes, but also randomly choose the same number of personalized mutations; then for each patient we use the randomized mutations and the randomized targets as the input information of our SCS; and we compute P-value of the enrichment in the CGC list, whose results are also shown in Figure 3b. From these results together in Figure 3, we can see that our threshold is actually significant for prediction enrichment in the CGC list. Fig. 3. View largeDownload slide The robustness and the significance of the threshold of fold change. (a) Choosing the absolute values of the threshold of log2-fold change at (0.6, 0.8, 1, 1.2, 1.4), we find that the precision is robust for our threshold +/–1; (b) The significance of the threshold +/–1 fold change for selecting target genes randomly, and selecting mutation genes and target genes randomly. The enrichment score ESg is then defined as ESg = –log10 (P-value) Fig. 3. View largeDownload slide The robustness and the significance of the threshold of fold change. (a) Choosing the absolute values of the threshold of log2-fold change at (0.6, 0.8, 1, 1.2, 1.4), we find that the precision is robust for our threshold +/–1; (b) The significance of the threshold +/–1 fold change for selecting target genes randomly, and selecting mutation genes and target genes randomly. The enrichment score ESg is then defined as ESg = –log10 (P-value) Now, we deeply investigate the common and difference among driver genes identified by our SCS and other methods. For different cancer datasets, approximately two-fifth (19/50 of ovarian), one-fifth (6/25 of melanoma), one-third (8/25 of GBM), one-third (8/25 of PRAD) and one-fourth (10/25 of BLCA) of SCS’s candidate drivers are also identified by other methods respectively as shown in Figure 4, Supplementary Figure S4 and Supplementary Table S1. These results indicate that many common mutation genes related with five cancers with those well-established cancer-related mutation genes are also predicted, including EGFR in ovarian cancer, NRAS in melanoma cancer, MYC in GBM, CTNNBCC in PRAD, EP300 and CREBPP and RB1 in BLCA, respectively. It is also observed that TP53 with high mutated frequency is detected as the candidate driver gene in most cancer datasets except for the melanoma cancer dataset, which support again that the high mutated frequency would not be a sufficient way to detect cancer drivers. Besides, the gene list of our SCS, OncoIMPACT, DawnRank, HotNet2, DriverNet and a frequency-based method is shown in Supplementary Table S1, which demonstrates whether the genes are identified by the six or several methods. Fig. 4. View largeDownload slide Comparison of candidate driver genes in ovarian cancer, MELANOMA cancer datasets and GBM datasets by various methods. SCS, OncoIMPACT, DriverNet and frequency-based method are applied to the GBM and ovara in cancer datasets from TCGA. For each tool, the Top-50 predicted driver genes are extracted. The candidate driver genes themselves are listed according to the tools by which they are identified (7: SCS, OncoIMPACT, DriverNet and Frequency; 6: SCS, OncoIMPACT and DriverNet; 4: SCS and DriverNet; 3: SCS, DriverNet and Frequency; 2: SCS and DriverNet; 1: SCS and Frequency; 0: SCS alone) Fig. 4. View largeDownload slide Comparison of candidate driver genes in ovarian cancer, MELANOMA cancer datasets and GBM datasets by various methods. SCS, OncoIMPACT, DriverNet and frequency-based method are applied to the GBM and ovara in cancer datasets from TCGA. For each tool, the Top-50 predicted driver genes are extracted. The candidate driver genes themselves are listed according to the tools by which they are identified (7: SCS, OncoIMPACT, DriverNet and Frequency; 6: SCS, OncoIMPACT and DriverNet; 4: SCS and DriverNet; 3: SCS, DriverNet and Frequency; 2: SCS and DriverNet; 1: SCS and Frequency; 0: SCS alone) Actually, SCS also identified some driver genes that may not have been classified as drivers by other computational methods. One-second (12/25) and four-fifth (19/25) of SCS’s candidate drivers are not identified by other computational tools, among which, five mutation genes in ovarian cancer, six mutation genes in melanoma, four mutation genes in PRAD and three mutation genes in BLCA, respectively, are indeed included in CGC. In addition, we discuss the mutated frequency of such driver genes determined by SCS, compared with the whole mean mutated frequency of all genes among all samples. From Figure 5a, we can see that the mean mutated frequency of our predicted Top-50 driver genes is higher than the whole mean mutated frequency of all genes in the five benchmark cancer datasets. Meanwhile in Figure 5b, we list the fraction of genes whose mutated frequency is higher than the whole mean mutated frequency of all genes and the fraction of genes whose mutated frequency is lower than the whole mean mutated frequency of all genes. We interestingly found that most of the predicted driver genes are mutated higher than the whole mean mutated frequency in Ovarian cancer and MELANOMA cancer datasets, while less number of the predicted driver genes are mutated higher than the whole mean mutated frequency in GBM cancer, PRAD cancer and BLCA cancer datasets. These results demonstrate again that high mutated frequency would not be a sufficient way to detect cancer drivers, and it would introduce more false negatives in identification. Fig. 5. View largeDownload slide Comparison of the mutated frequency of candidate driver genes with the whole mean mutated frequency in ovarian cancer, MELANOMA cancer, GBM, PRAD and BLCA cancer datasets. (a) The figure shows that the mutated frequency of the predicted driver genes of our SCS (red) for the five cancer datasets is higher than the whole mean mutated frequency (green); (b) The figure shows that the fraction of genes whose mutated frequency is greater than the whole mean mutated frequency in the five cancer datasets. The yellow color denotes the fraction of genes whose mutated frequency is higher than whole mean mutated frequency while the blue color denotes the fraction of genes whose mutated frequency is lower than whole mean mutated frequency (Color version of this figure is available at Bioinformatics online.) Fig. 5. View largeDownload slide Comparison of the mutated frequency of candidate driver genes with the whole mean mutated frequency in ovarian cancer, MELANOMA cancer, GBM, PRAD and BLCA cancer datasets. (a) The figure shows that the mutated frequency of the predicted driver genes of our SCS (red) for the five cancer datasets is higher than the whole mean mutated frequency (green); (b) The figure shows that the fraction of genes whose mutated frequency is greater than the whole mean mutated frequency in the five cancer datasets. The yellow color denotes the fraction of genes whose mutated frequency is higher than whole mean mutated frequency while the blue color denotes the fraction of genes whose mutated frequency is lower than whole mean mutated frequency (Color version of this figure is available at Bioinformatics online.) 3.2 SCS efficiently discovers personalized driver genes Here, we demonstrate SCS’s ability to determine personalized and rare driver genes. The main aspect that distinguishes SCS from existing methods is the ability to discover rare or even personalized-sample driver genes. Even if a gene is altered only in a single patient, SCS is capable to evaluate the impact potential of that gene alteration. In our case, a gene is considered to be a rare driver if the gene is labeled as significant from the impactful population drivers with Condorcet method in the above section, and is also mutated in only a small number of patients (<=5%). We selected genes that fit the above criteria, to discover potential personalized driver genes. The selection criteria yielded in 22 potential personalized driver genes in OVARIAN, 15 potential personalized driver genes in MELANOMA, 35 potential personalized driver genes in GBM, 49 potential personalized driver genes in PRAD and 44 potential personalized driver genes in BLCA, respectively (seeing the genes highlighted in yellow among Supplementary Table S2). Taking the potential personalized driver genes in OVARIAN as example, we found that several of them were involved in important known cancer pathways. Using KEGG (Kanehisa et al., 2014) to map the 22 potential personalized driver genes to biological pathways (see details in Supplementary Table S3), mutation in EGFR belong to multiple pathways that have significant impact on cancer, including ErbB signaling pathway, TNF signaling pathway and T cell receptor signaling pathway, which all are common drug targets in OVARIAN (Charles et al., 2009; De et al., 2008; Wang et al., 2004) and lead to the implications that EGFR could be targeted. Although EGFR is mutated in only 2.2152% of OVARIAN cancer samples and is ranked 6517th in terms of the mutated frequency, EGFR is ranked significantly higher than its average ranking in patients TCGA-09-0366-01 and TCGA-13-0717-01. We look further into the consensus module of EGFR obtained by using our SCS in patients TCGA-09-0366-01 and TCGA-13-0717-01, respectively (see details in Supplementary Fig. S5 and Supplementary Table S4). Mapping the corresponding modules of each sample/patient to the KEGG datasets (Kanehisa et al., 2014), respectively, we found that although they all are enriched in ErbB signaling pathway, they actually have some other different enriched biological pathways, i.e. insulin signaling pathway, Jak-STAT signaling pathway, T cell receptor signaling pathway, Chemokine signaling pathway and mTOR signaling pathway for TCGA-09-0366-01; meanwhile Wnt signaling pathway and MAPK signaling pathway for TCGA-13-0717-01. Among the consensus modules of EGFR in the above two patients, we also found that there are 35 and 15 drug targets, respectively. By mapping these drug targets to KEGG datasets (Kanehisa et al., 2014) again, we found that EGFR and TP53 as the common drug targets can be affected by Cisplatin and Carboplatin, respectively; meanwhile CTNNB1, ERBB2, INSR and KRAS can be targeted as the personalized drug targets for TCGA-09-0366-01, and NOS3 and TNFRSF1B can be targeted as the personalized drug targets for TCGA-13-0717-01. The more details can be seen in Supplementary Table S4. 3.3 SCS improves tumor stratification By inputting the personalized mutation genes, personalized DEGs and personalized network structure, the output of our SCS is the patient-specific driver mutation profiles in which the state of each gene is no longer binary on expression abundance but reflects its phenotypic impact on the DEGs. Patient-specific driver mutational profiles would be promising information as biomarkers for tumor stratification (Bertrand et al., 2015; Chen et al., 2012; van’t Veer and Bernards, 2008; Zeng et al., 2015; Zeng et al., 2014) since by definition, they are likely causative events for carcinogenesis and metastasis. Integrating the driver mutation signatures, the expression patterns and the gene network provides a comprehensive way to understand complex diseases in a multi-view manner (Shi et al., 2017). As a first pilot exploration of this concept, we also investigate the SCS’s predictions for stratifying patients, especially on their survival outcomes. And the unsupervised clustering from SNF (Wang et al., 2014) is applied because it used to integrate the predicted driver profiles and the expression data. Evaluations of survival outcomes for patients in these clusters by Kaplan–Meier statistics suggest that the patient clusters have significant prognostic values for survival analysis [Fig. 6(a–c)]. With this stratification strategy, as shown in Figure 6(a–c), three subtypes are identified for ovarian cancer, five subtypes are identified for melanoma cancer and three subtypes are identified for GBM, respectively (the results of other cancer datasets can be found in Supplementary Material). The P-value, 0.00257 for ovarian, 0.000257 for melanoma and 0.00442 for GBM cancer show the significant survival differences among the identified subtypes. Fig. 6. View largeDownload slide Tumor stratification using predicted driver gene profiles by SCS. (a–c) Survival profiles of ovarian, melanoma and glioblastoma cancer patients stratified by integrating the driver mutation profiles and expression data. (d) Bar plot showing the P-values (log rank test) for survival profiles of ovarian, melanoma and glioblastoma cancer patients using different gene signatures (e.g. Integration of driver mutation profiles and expression data, and Integration of mutation profiles and expression data) Fig. 6. View largeDownload slide Tumor stratification using predicted driver gene profiles by SCS. (a–c) Survival profiles of ovarian, melanoma and glioblastoma cancer patients stratified by integrating the driver mutation profiles and expression data. (d) Bar plot showing the P-values (log rank test) for survival profiles of ovarian, melanoma and glioblastoma cancer patients using different gene signatures (e.g. Integration of driver mutation profiles and expression data, and Integration of mutation profiles and expression data) In addition, we found that the integrative strategy by direct combination of mutations data and expression data cannot be effective for patient survival predictions (Fig. 6d). In contrast, the integration of driver mutation profiles and expression profiles by SCS perform better as listed in Figure 6d. In all, these results highlight the promise of driver genes’ profiles to stratify patients in an unsupervised fashion. 3.4 Subtype-specific driver genes and consensus modules from tumor stratification To deeply analyze the distribution difference of the predicted driver genes corresponding to particular cancer subtypes, the subtype-specific driver genes with high mutated frequency are defined as the genes satisfying the following requirements: (1) it is highly ranked in Top 10 among the subtype samples in terms of the mutated frequency; (2) alterations frequency is greater than that in all samples. In Supplementary Note 5 and Supplementary Table S2, the summary of the subtype-specific genes and the mutated frequency of the Top-50 drivers for different subtypes are, respectively, listed for OVARIAN, MELANOMA and GBM cancer datasets. Obviously, most sample clusters are indicated by a few key driver genes, which are predominantly mutated in tumors belonging to that sample cluster and serve to distinguish them from tumors in other clusters. As shown in Figure 7a, the mutated frequency distribution for the top drivers is significant changed among most of paired subtypes. For an example, TP53 is highly mutated for all subtypes in OVARIAN cancer because TP53 is always mutated in OVARIAN samples, as seen in Figure 7b; in contrast, TP53 is predominantly mutated in tumors belonging to a cluster and serves to distinguish them from tumors in other clusters in MELANOMA or GBM cancer, as seen in Figure 7b. Fig. 7. View largeDownload slide Different properties of the predicted Top-50 driver genes in different cancer subtypes. (a) Box plots showing the difference of the distribution of the mutation frequency of the ranked Top-50 drivers between the different subtypes in the ovarian (p12 = 4.2348e–4, p23 = 0.1546, p13 = 3.6276e–6), melanoma (p12 = 3.6276e–6, p13 = 3.6276e–6, p14 = 0.0951, p15 = 7.8398e–10; p23 = 0.6779, p24 = 3.6276e–6, p25 = 0.0171; p34 = 3.6276e–6, p35 = 0.0560; p45 = 7.8398e–10) and glioblastoma cancer (p12 = 2.7638e–5, p23 = 0.5077, p13 = 3.6276e–6). (b) The statistical analysis of the mutation frequency associated with the different cancer subtypes of TP53. (c) The statistical analysis of the drug sensitivity associated with the different cancer subtypes of TP53 Fig. 7. View largeDownload slide Different properties of the predicted Top-50 driver genes in different cancer subtypes. (a) Box plots showing the difference of the distribution of the mutation frequency of the ranked Top-50 drivers between the different subtypes in the ovarian (p12 = 4.2348e–4, p23 = 0.1546, p13 = 3.6276e–6), melanoma (p12 = 3.6276e–6, p13 = 3.6276e–6, p14 = 0.0951, p15 = 7.8398e–10; p23 = 0.6779, p24 = 3.6276e–6, p25 = 0.0171; p34 = 3.6276e–6, p35 = 0.0560; p45 = 7.8398e–10) and glioblastoma cancer (p12 = 2.7638e–5, p23 = 0.5077, p13 = 3.6276e–6). (b) The statistical analysis of the mutation frequency associated with the different cancer subtypes of TP53. (c) The statistical analysis of the drug sensitivity associated with the different cancer subtypes of TP53 In fact, the consensus modules of predicted driver genes also allow us to identify the drug sensitivity for these cancer drivers. For example, for the TP53 gene, we first check the subtypes of samples which have TP53 as the driver, and obtain the corresponding control downstream module. Then we identify the fraction of FDA drug targets within the module for the obtained samples in the subtype. Finally, we can identify the mean fraction in subtype as the subtype drug sensitivity for the interested driver. Based on the obtained drug sensitivity in the subtype, the subtype-specific driver genes with high drug sensitivity are defined as the genes satisfying the following requirements: (1) it is highly ranked in Top 10 among the subtype samples in terms of the drug sensitivity; (2) alterations of drug sensitivity is greater than that in all samples. In Supplementary Note 5 and Supplementary Table S5, the summary of the subtype-specific genes with high drug sensitivity and the drug sensitivity of the Top-50 drivers for different subtypes are, respectively, listed for OVARIAN, MELANOMA and GBM cancer datasets. From the Supplementary Note 5, we can find that most subtype-specific driver genes with high mutated frequency are different from that with high drug sensitivity. It indicates that the driver genes with high mutated frequency may be not the driver genes with high drug sensitivity, thus the cancer driver gene is not always a suitable drug target and dependent on individuals. Even so, we found that in subtype 3 for GBM, TP53 as the subtype driver with high mutated frequency also has significant drug sensitivity. It has been reported that the alteration leads to the activation of P13K/Akt and Ras/MAPK pathways (Mischel and Cloughesy, 2003), which provide targets for therapy and can also support our computational results. Furthermore, the ability to generate personalized-sample driver consensus modules allows us to analyze the difference of the enriched pathways between cancer subtypes. Again, the enriched pathways of TP53 as a known driver gene are used to indicate the identified subtypes in glioblastoma cancer and ovarian cancer datasets. With the module of driver gene TP53 for each sample, we first calculate P-value of the enrichment for a pathway by the hypergeometric test (Rivals et al., 2007), and then we regard that a pathway is significantly enriched within the sample when the P-value is less than 0.005. Subsequently, according to the frequency of the enriched pathway appearing in each sample for a given subtype, the subtype-specific pathway is selected when it has frequency f  ≥ 0.5 in such subtype samples. From the result shown in Supplementary Figure S6 and Supplementary Table S6, TP53 is indeed high-frequency mutated, but it can have different regulatory effects on the biological pathways in different subtypes, indicating the tumor heterogeneous on the biological system level rather than single sequence mutation level. Therefore, the identified driver mutation profiles can contribute to the improvement of tumor stratification with more biological significance and interpretability. 4 Discussion and conclusions Cancer genomics is an area that has now rightly shifted toward integrative analysis, with driver gene identification being a key focus (Vogelstein et al., 2013). Especially as the personalized medicine becomes a hot-spot, identifying personalized-sample driver genes that have predictive power from the personalized disease diagnosis/drug sensitivity analysis to their care, is attracting wide attention (Bertrand et al., 2015; Hou and Ma, 2014; Sheng et al., 2015; Wang et al., 2015). Accurate analysis of personalized genomics instability, e.g. somatic mutations, is necessary for translating the full benefit of cancer genome sequencing into the clinic. Computational models and methods are required to prioritize biologically active driver genes over inactive passenger dependent on cancer high-throughput sequencing data. However, few methods can efficiently distinguish the unique complement of genes that drive tumorigenesis in each patient. It is already recognized that patients with a cancer are not all the same and there may exist unique driver genes for each patient. Although existing computational methods have identified many common cancer drivers, it remains challenging to predict personalized driver genes to assess the rare and even personalized-sample mutations. Here, we proposed a new and efficient framework based on the structural control theory of complex network, called SCS. SCS considers how to find the minimal individual mutations, which can control the maximal individual DEGs from the normal state to the disease state, and also identify the consensus module with control paths from the candidate individual driver genes to the target DEGs. More importantly, the quantified confidence of control path of consensus driver module can evaluate the impact potential of candidate individual driver genes on the altered expression patterns. We apply SCS to multiple cancer datasets from TCGA. The validation results suggest that SCS can efficiently identify the known cancer driver genes for a large number of samples, and SCS outperforms over other competing approaches (OncoIMPACT, DawnRank, DriverNet, HotNet2 and Frequency-based approach) across distinct benchmarks. SCS is robust to noise and works well with small datasets, making it applicable to a wide array of sample collections. It is widely accepted that most driver genes would have low mutation frequency, which is called ‘long tail of rarely mutated genes’ (Hofree et al., 2013; Leiserson et al., 2015; Wang et al., 2014; Zhang et al., 2016) and always disregarded in traditional frequency-based method. In contrast, our SCS delves deeper into the long tail of rarely mutated genes regardless of mutation frequency by combining the patient-specific network structure (patient-specific edges) and the personal biological information (personalized mutations and personalized DEGs). Results in Figure 2 and Table 1 show that our SCS exhibited a higher significant enrichment for mutations in the Cancer Census sets compared with the traditional frequency-based approaches. Therefore, our SCS approach provides an efficient tool to to discover rare (infrequent) driver genes. DriverNet and HotNet2 are methods to address the problem of finding significantly mutated subnetworks in large and broad datasets of mutational frequency spectra, which ignore the individual information and therefore have worse performance. SCS, OncoIMPACT and DawnRank all assume that gene mutations could lead to the transcriptomic changes. However, our SCS adopts the different techniques on handling the patient-specific network (topology) information compared with OncoIMPACT and DawnRank. Given mutations in a patient, OncoIMPACT considers a gene in the patient as being related with mutations by using a statistic permutation-based model, whose parameters are determined with a grid search method on the whole gene network common to all patients, rather than each patient-specific network. On the other hand, DawnRank adopts the pagerank algorithm to assess the impact scores of patient-specific mutations also on the whole gene network. Therefore, these two models identify the modules of the personalized mutations by determining the model parameters based directly on the whole gene network, which is not proper due to the ignorance of patient-specific network (topology) information. In other words, there are two advantages of our SCS over OncoIMPACT and DawnRank. Firstly, OncoIMPACT and DawnRank apply their search techniques to the whole gene network common to all patients to obtain the subnetwork of mutations for one patient. The overall network (topology) information used in OncoIMPACT and DawnRank does not consider the individual information and is not patient-specific. In contrast, our SCS includes the execution of the RWR algorithm in each sample to search for the patient-specific mutated network, which consists of the frequently interrupted interactions among the mutation genes in the human interactome. RWR has been proven to be sensitive in identifying disease candidate genes and has been successfully applied in disease-phenotype analyses (Jia and Zhao, 2014; Li and Patra, 2010; Luo et al., 2017). Then SCS applies our CTC whose parameters are determined in the patient-specific mutated network instead of the whole network, to identify the personalized driver genes. Secondly, the impact score for mutations of OncoIMPACT and DawnRank is the number of the related DEGs, while the impact score of our SCS is the sum of the confidence control weight within the modules, which offers a refined approach with more biological significance, by considering all possible paths between mutations and DEGs. Compared with OncoIMPACT and DawnRank, the main difference in handling the patient-specific network is that SCS focuses on both patient-specific edge and node information, but OncoIMPACT and DawnRank focus only on patient-specific node information. As shown in Table 1, our SCS with the RWR algorithm outperformed over OncoIMPACT and DawnRank to predict the driver genes at the population level by considering the patient-specific mutated network. The important contribution of our SCS is that we apply the network control theory to the field of driver gene discovery by combining the personalized mutations, personalized DEGs and personalized network (topology or edges). This is the first integrative model to combine the network control theory and the individual multi-layer biological data for identifying the personalized driver genes. In addition, the SCS’s personalized-sample driver gene predictions revealed new biological insights into tumor stratification and prognosis analysis. The tumors can be efficiently stratified into molecule-determined subgroups through the integration of the predicted driver mutation profiles and gene expression profiles. Such patient subgroups actually exhibit significantly different survival outcomes, establishing the clinical relevance for such stratification from SCS’s predictions. More importantly, for same driver genes, the subtype-specific pathways enriched by driver genes and corresponding consensus modules can be screened and provide more biological evidence of the molecule subtypes. Noted, it has been widely recognized that genes form a network to interact with each other, and a complex disease or phenotypic change for each person usually results not from the changes of the individual genes or molecules but from the changes of their biological system or network. With rapid advances in high-throughput techniques, biological networks based on omics data have become the powerful resources and been successfully applied to many fields of biology and medicine, such as driver gene discovery and drug target identifications (Hofree et al., 2013; Leiserson et al., 2015; Wang et al., 2014; Zhang et al., 2016). Generally, a complex disease progression for each patient, such as cancer, can be viewed as a state transition of the corresponding biological system or network from a normal state to a disease state, resulting from the gradual accumulation of multiple driver mutations. From the perspective of systems control, such driver mutations can be viewed as controllers, which control the biological system or network of a patient to transit from a normal state to a disease state. Thus, a natural framework is to assess the impact of candidate driver genes (genomic and epigenomic data) on their gene interaction network by associating mutations (as controllers) with DEGs (Bertrand et al., 2015; Hou and Ma, 2014; Leiserson et al., 2015; Zhang et al., 2016). We assume that there is a specific gene network in each patient, i.e. patient-specific network, due to different personalized features and further heterogeneous disease features at both genomic or epigenomic levels. Therefore, as an important step of SCS, we must identify the patient-specific mutated network by using the RWR techniques based on the observed data with prior network information. In summary, SCS uses an innovative method to prioritize cancer driver mutation profiles from a network controllability perspective, and provides dramatic improvements over existing methods. SCS not only can help us to discover personalized causal mutations from those mutations obscured by tumor heterogeneity, but also can methodologically bridge the traditional structural control methods of complex networks to genomics research. However, there are still some open questions remaining for SCS or similar studies, such as (1) it is dependent on the reference network whose (structure) incompleteness would increase false negatives in SCS’s predictions and (2) the assumption of driver impact on gene expression pattern could be expanded to the impact on different omics patterns. Acknowledgement The authors thank Professor Fang-Xiang Wu from University of Saskatchewan for giving our valuable comments. Funding This paper was supported by National Key R&D Program (2017YFA0505500), Strategic Priority Research Program of the Chinese Academy of Sciences (No. XDB13040700), the National Natural Science Foundation of China (61473232, 91430111, 91439103, 91529303, 31771476, 81471047, 31200987 and 61170134), National Key R&D Program (Special Project on Precision Medicine) (2016YFC0903400) and Natural Science Foundation of Shanghai (17ZR1446100). Conflict of Interest: none declared. References Aytug S. et al. . ( 2003 ) Impaired IRS‐1/PI3‐kinase signaling in patients with HCV: a mechanism for increased prevalence of type 2 diabetes . Hepatology , 38 , 1384 – 1392 . Google Scholar CrossRef Search ADS PubMed Bakken T.E. et al. . ( 2016 ) Comprehensive transcriptional map of primate brain development . Nature , 535 , 367 . Google Scholar CrossRef Search ADS PubMed Bashashati A. et al. . ( 2012 ) DriverNet: uncovering the impact of somatic driver mutations on transcriptional networks in cancer . Genome Biol ., 13 , R124 . Google Scholar CrossRef Search ADS PubMed Becskei A. , Serrano L. ( 2000 ) Engineering stability in gene networks by autoregulation . Nature , 405 , 590 – 593 . Google Scholar CrossRef Search ADS PubMed Bertrand D. et al. . ( 2015 ) Patient-specific driver gene prediction and risk assessment through integrated network analysis of cancer omics profiles . Nucleic Acids Res ., 43 , e44 – e44 . Google Scholar CrossRef Search ADS PubMed Carter H. et al. . ( 2009 ) Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations . Cancer Res ., 69 , 6660 . Google Scholar CrossRef Search ADS PubMed Charles K.A. et al. . ( 2009 ) The tumor-promoting actions of TNF-α involve TNFR1 and IL-17 in ovarian cancer in mice and humans . J. Clin. Investig ., 119 , 3011 . Google Scholar CrossRef Search ADS Chen L. et al. . ( 2012 ) Detecting early-warning signals for sudden deterioration of complex diseases by dynamical network biomarkers . Sci. Rep ., 2 , 7391 – 7342 . Chin L. et al. . ( 2011 ) Cancer genomics: from discovery science to personalized medicine . Nat. Med ., 17 , 297 – 303 . Google Scholar CrossRef Search ADS PubMed Ciriello G. et al. . ( 2012 ) Mutual exclusivity analysis identifies oncogenic network modules . Genome Res ., 22 , 398 – 406 . Google Scholar CrossRef Search ADS PubMed Cowen L. et al. . ( 2017 ) Network propagation: a universal amplifier of genetic associations . Nat. Rev. Genetics , 18 , 551 – 562 . Google Scholar CrossRef Search ADS Croft D. et al. . ( 2010 ) Reactome: a database of reactions, pathways and biological processes . Nucleic Acids Res ., gkq1018. De G.P. et al. . ( 2008 ) The ErbB signalling pathway: protein expression and prognostic value in epithelial ovarian cancer . British J. Cancer , 99 , 341 – 349 . Google Scholar CrossRef Search ADS Futreal P.A. et al. . ( 2004 ) A census of human cancer genes . Nat. Rev. Cancer , 4 , 177 – 183 . Google Scholar CrossRef Search ADS PubMed Gao J. et al. . ( 2014 ) Target control of complex networks . Nat. Commun ., 5 , 5415 . Google Scholar CrossRef Search ADS PubMed Greenman C. et al. . ( 2008 ) Patterns of somatic mutation in human cancer genomes . Nature , 6 , 153 – 158 . Guo W.-F. et al. . ( 2017 ) Constrained target controllability of complex networks . J. Stat. Mech ., 2017 , 063402 . Google Scholar CrossRef Search ADS Haber D.A. , Settleman J. ( 2007 ) Cancer: drivers and passengers . Nature , 446 , 145 – 146 . Google Scholar CrossRef Search ADS PubMed Hofree M. et al. . ( 2013 ) Network-based stratification of tumor mutations . Nat. Methods , 10 , 1108 – 1115 . Google Scholar CrossRef Search ADS PubMed Hou J.P. , Ma J. ( 2014 ) DawnRank: discovering personalized driver genes in cancer . Genome Med ., 6 , 56. Google Scholar CrossRef Search ADS PubMed Jia P , Zhao Z. et al. . ( 2014 ) VarWalker: personalized mutation network analysis of putative cancer genes from next-generation sequencing data . PLoS Comput. Biol ., 10 , e1003460 . Google Scholar CrossRef Search ADS PubMed Kanehisa M. et al. . ( 2011 ) KEGG for integration and interpretation of large-scale molecular data sets . Nucleic Acids Res ., 40 (D1), D109 – D114 . Google Scholar CrossRef Search ADS PubMed Kanehisa M. et al. . ( 2014 ) Data, information, knowledge and principle: back to metabolism in KEGG . Nucleic Acids Res ., 42 , 199 – 205 . Google Scholar CrossRef Search ADS Kang H. et al. . ( 2015 ) Inferring sequential order of somatic mutations during tumorgenesis based on Markov chain model . IEEE/ACM Trans. Comput. Biol. Bioinformatics , 12 , 1094 . Google Scholar CrossRef Search ADS Koren H.S. et al. . ( 1989 ) Ozone-induced inflammation in the lower airways of human subjects . Am. Rev. Respiratory Dis ., 139 , 407 – 415 . Google Scholar CrossRef Search ADS Kumar R.D. et al. . ( 2016 ) Unsupervised detection of cancer driver mutations with parsimony-guided learning . Nat. Genetics , 48 , 1288. Google Scholar CrossRef Search ADS Leiserson M.D. et al. . ( 2015 ) Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes . Nat. Genetics , 47 , 106 – 114 . Google Scholar CrossRef Search ADS Li Y. , Patra J.C. ( 2010 ) Genome-wide inferring gene–phenotype relationship by walking on the heterogeneous network . Bioinformatics , 26 , 1219 – 1224 . Google Scholar CrossRef Search ADS PubMed Lin C.T. ( 1974 ) Structural controllability . IEEE Trans. Automatic Control , 19 , 201 – 208 . Google Scholar CrossRef Search ADS Liu Y.-Y. et al. . ( 2011 ) Controllability of complex networks . Nature , 473 , 167 – 173 . Google Scholar CrossRef Search ADS PubMed Liu X. et al. . ( 2016 ) Personalized characterization of diseases using sample-specific networks . Nucleic Acids Res ., 44 , e164 . Google Scholar CrossRef Search ADS PubMed Luo Y. et al. . ( 2017 ) A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information . Nat. Commun. , 8 , 573 . Google Scholar CrossRef Search ADS PubMed Mao Y. et al. . ( 2013 ) CanDrA: cancer-specific driver missense mutation annotation with optimized features . PLoS One , 8 , e77945 . Google Scholar CrossRef Search ADS PubMed Mischel P.S. , Cloughesy T.F. ( 2003 ) Targeted molecular therapy of GBM . Brain Pathol ., 13 , 52. Google Scholar CrossRef Search ADS PubMed Nemhauser G.L. , Wolsey L.A. ( 1988 ) Integer and combinatorial optimization . Wiley , New York . Google Scholar CrossRef Search ADS Pihur V. et al. . ( 2008 ) Finding common genes in multiple cancer types through meta-analysis of microarray experiments: a rank aggregation approach . Genomics , 92 , 400 – 403 . Google Scholar CrossRef Search ADS PubMed Rivals I. et al. . ( 2007 ) Enrichment or depletion of a GO category within a class of genes: which test? Bioinformatics , 23 , 401 – 407 . Google Scholar CrossRef Search ADS PubMed Schaefer C.F. et al. . ( 2009 ) PID: the pathway interaction database . Nucleic Acids Res ., 37 , D674 – D679 . Google Scholar CrossRef Search ADS PubMed Schilsky R.L. ( 2010 ) Personalized medicine in oncology: the future is now . Nat. Rev. Drug Discov ., 9 , 363 – 366 . Google Scholar CrossRef Search ADS PubMed Sheng J. et al. . ( 2015 ) Optimal drug prediction from personal genomics profiles . IEEE J. Biomed. Health Inform ., 19 , 1264 – 1270 . Shi Q. et al. . ( 2017 ) Pattern fusion analysis by adaptive alignment of multiple heterogeneous omics data . Bioinformatics , 33 , 2706 – 2714 . Google Scholar CrossRef Search ADS PubMed Suo C. et al. . ( 2015 ) Integration of somatic mutation, expression and functional data reveals potential driver genes predictive of breast cancer survival . Bioinformatics , 31 , 2607 – 2613 . Google Scholar CrossRef Search ADS PubMed Tothova Z. et al. . ( 2007 ) FoxOs are critical mediators of hematopoietic stem cell resistance to physiologic oxidative stress . Cell , 128 , 325 – 339 . Google Scholar CrossRef Search ADS PubMed van ‘t Veer L.J. , Bernards R. ( 2008 ) Enabling personalized cancer medicine through analysis of gene-expression patterns . Nature , 452 , 564 – 570 . Google Scholar CrossRef Search ADS PubMed Vandin F. et al. . ( 2012 ) De novo discovery of mutated driver pathways in cancer . Genome Res ., 22 , 375 – 385 . Google Scholar CrossRef Search ADS PubMed Vinayagam A. et al. . ( 2011 ) A directed protein interaction network for investigating intracellular signal transduction . Sci. Signal , 4 , rs8 – rs8 . Google Scholar CrossRef Search ADS PubMed Vinayagam A. et al. . ( 2016 ) Controllability analysis of the directed human protein interaction network identifies disease genes and drug targets . Proc. Natl. Acad. Sci. USA , 113 , 4976 – 4981 . Google Scholar CrossRef Search ADS Vogelstein B. et al. . ( 2013 ) Cancer genome landscapes . Science , 339 , 1546 – 1558 . Google Scholar CrossRef Search ADS PubMed Wang H. et al. . ( 2004 ) Ovarian carcinoma cells inhibit T cell proliferation: suppression of IL-2 receptor β and γ expression and their JAK-STAT signaling pathway . Life Sci ., 74 , 1739 – 1749 . Google Scholar CrossRef Search ADS PubMed Wang B. et al. . ( 2014 ) Similarity network fusion for aggregating data types on a genomic scale . Nat. Methods , 11 , 333 – 337 . Google Scholar CrossRef Search ADS PubMed Wang L. et al. . ( 2015 ) A computational method for clinically relevant cancer stratification and driver mutation module discovery using personal genomics profiles . BMC Genomics , 16 , S6 . Google Scholar CrossRef Search ADS PubMed Wei X. et al. . ( 2011 ) Exome sequencing identifies GRIN2A as frequently mutated in melanoma . Nat. Genetics , 43 , 442 – 446 . Google Scholar CrossRef Search ADS Wolsey L.A. ( 1998 ) Integer Programming . Wiley , New York . Wu G. et al. . ( 2010 ) A human functional protein interaction network and its application to cancer data analysis . Genome Biol ., 11 , R53. Google Scholar CrossRef Search ADS PubMed Wu F.X. et al. . ( 2014 ) Transittability of complex networks and its applications to regulatory biomolecular networks . Sci. Rep ., 4 , 4819 . Google Scholar CrossRef Search ADS PubMed Yan G. et al. . ( 2017 ) Network control principles predict neuron function in the Caenorhabditis elegans connectome . Nature , 550 , 519 . Google Scholar CrossRef Search ADS PubMed Yu X. et al. . ( 2017 ) Individual-specific edge-network analysis for disease prediction . Nucleic Acids Res ., 45 , e170 – e170 . Google Scholar CrossRef Search ADS PubMed Zeng T. et al. . ( 2014 ) Edge biomarkers for classification and prediction of phenotypes . Sci. China Life Sci ., 57 , 1103 – 1114 . Google Scholar CrossRef Search ADS PubMed Zeng T. et al. . ( 2015 ) Big-data-based edge biomarkers: study on dynamical drug sensitivity and resistance in individuals . Brief. Bioinformatics , 17 , 863 – 874 . Google Scholar PubMed Zhang S.-Y. et al. . ( 2016 ) m6A-Driver: identifying context-specific mRNA m6A methylation-driven gene interaction networks . PLoS Comput. Biol ., 12 , e1005287 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Discovering personalized driver mutation profiles of single samples in cancer by network control strategy

Loading next page...
 
/lp/ou_press/discovering-personalized-driver-mutation-profiles-of-single-samples-in-rR4x6CRIkk
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
ISSN
1367-4803
eISSN
1460-2059
D.O.I.
10.1093/bioinformatics/bty006
Publisher site
See Article on Publisher Site

Abstract

Abstract Motivation It is a challenging task to discover personalized driver genes that provide crucial information on disease risk and drug sensitivity for individual patients. However, few methods have been proposed to identify the personalized-sample driver genes from the cancer omics data due to the lack of samples for each individual. To circumvent this problem, here we present a novel single-sample controller strategy (SCS) to identify personalized driver mutation profiles from network controllability perspective. Results SCS integrates mutation data and expression data into a reference molecular network for each patient to obtain the driver mutation profiles in a personalized-sample manner. This is the first such a computational framework, to bridge the personalized driver mutation discovery problem and the structural network controllability problem. The key idea of SCS is to detect those mutated genes which can achieve the transition from the normal state to the disease state based on each individual omics data from network controllability perspective. We widely validate the driver mutation profiles of our SCS from three aspects: (i) the improved precision for the predicted driver genes in the population compared with other driver-focus methods; (ii) the effectiveness for discovering the personalized driver genes and (iii) the application to the risk assessment through the integration of the driver mutation signature and expression data, respectively, across the five distinct benchmarks from The Cancer Genome Atlas. In conclusion, our SCS makes efficient and robust personalized driver mutation profiles predictions, opening new avenues in personalized medicine and targeted cancer therapy. Availability and implementation The MATLAB-package for our SCS is freely available from http://sysbio.sibcb.ac.cn/cb/chenlab/software.htm. Contact zhangsw@nwpu.edu.cn or zengtao@sibs.ac.cn or lnchen@sibs.ac.cn Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction With rapid advances in genomic techniques, there is a pressing need for integrative analysis of cancer omics data in terms of somatic mutations, transcriptomic changes and epigenetic alterations. A critical challenge facing cancer genomics today is to integrate these information-rich datasets to provide clinically characterized insights into tumor biology and cancer diagnostics and therapeutics. A fundamental question in the analysis of cancer genomic data is how to identify and distinguish driver genes that contribute to cancer initiation and progression, from numerous passenger mutation genes that emerge simply as results of genomic instability during cancer progression (Haber and Settleman, 2007). As well-known to us, identifying cancer drivers is crucial from a clinical perspective where personalized-sample driver genes would hold significant value for defining personalized therapeutic targets (Chin, 2011; Schilsky, 2010). This situation has provoked a bunch of mathematical methods that assist in differentiating driver mutation genes and passenger mutation genes. The recent studies can be mainly categorized into two classes, the machine learning based methods (Carter et al., 2009; Kumar et al., 2016; Mao et al., 2013) and the network based methods (Bashashati et al., 2012; Bertrand et al., 2015; Greenman et al., 2008; Hou and Ma, 2014; Kang et al., 2015; Suo et al., 2015; Zhang et al., 2016). The machine learning based methods are usually trained by using mutations designed as pathogenic or neutral, whose advantage is that such models can be developed for any specific tasks dependent on choosing training data. For example, in CHASM and CanDrA, driver genes are classified by relying on alterations trained from known cancer-causing somatic missense mutations (Carter et al., 2009; Mao et al., 2013), although these models are limited in a few applications due to the probable incompleteness of their cited databases. On the other hand, network-based approaches have become one of the most promising methods to understand cancer drivers due to their power to elucidate molecular mechanisms of disease development at the network level (Liu et al., 2016; Yu et al., 2017), and such network-based methods have been successfully applied to many biomedical field such as cancer driver discovery and drug target identification (Cowen et al., 2017; Hofree et al., 2013; Li and Patra, 2010; Wang et al., 2014; Zhang et al., 2016). However, most of the existing methods such as MEMo (Ciriello et al., 2012), Dendix (Vandin et al., 2012) and DriverNet (Bashashati et al., 2012) always require a large number of patient samples to generate reliable results and are not well suited for distinguishing rare driver genes or personalized-sample driver genes. Indeed, some new methods such as DawnRank (Hou and Ma, 2014) and OncoIMPACT (Bertrand et al., 2015) begin to focus on how to find the personalized-sample driver genes. However, they ignore the patient-specific network (topology or edges) information for determining the model parameters, which potentially lead to many false positives (Hou and Ma, 2014). To simultaneously identify the personalized-sample driver genes and their driving networks/pathways, we introduce network control theory to model the driving role and influence of individual mutation genes on tumors. Network control theory (Gao et al., 2014; Liu et al., 2011; Wu et al., 2014), which considers how to choose the proper subset of network nodes to control the whole network from one state (e.g. disease state) to another (e.g. normal state) has become a powerful conceptual paradigm in the field of biology to understand biological systems at a system level. How to control a large scale of biological network is a central issue for biological systems, which senses and processes both external and internal cues using a network of interacting molecules (Vinayagam et al., 2016; Yan et al., 2017). Structural controllability analysis of network has been applied to some biological systems, where interesting properties on the biological system have been discovered (Lin, 1974). However, the existing network-control methods (Gao et al., 2014; Liu et al., 2011; Wu et al., 2014) cannot be applied to identify driver genes for each patient directly or efficiently, since they only consider the consistent information of a population of samples but ignore the particular individual information. To address those methodological limitations, we here propose an innovative and effective approach based on the network control theory, called Single-sample Controller Strategy (SCS) to assess the impact potential of gene mutations on the changes in gene expression patterns. Intuitively, we consider mutations as controllers and gene expression profiles as states in a network, and thus SCS aims to detect a small number of mutation genes (i.e. driver genes) which can achieve the transition from the normal state to the disease state from the network controllability viewpoint, based on each individual gene expression data. Our SCS method integrates the mutation data and expression data to a gene-gene regulation network for each patient. In particular, we aim to identify the minimal number of Individual mutations to control the Individual differentially expressed genes (DEGs) in the Individual gene network, i.e. 3I framework. The main steps of SCS method include: (1) to obtain personalized DEGs for each patient by comparing the expression profile of the tumor sample with that of the corresponding normal sample, and then extract the individual gene regulations from the expert-curated databases; (2) to identify the minimal number of individual mutations with network controllability on the maximal coverage of individual DEGs in the individual gene network and (3) based on the dynamic network control theory, to rank and select the driver genes from individual mutations according to the uncovered consensus modules consisting of confidence-weighted paths from the driver genes to the target genes on the gene network. Such a single-sample framework SCS significantly differs from DawnRank (Hou and Ma, 2014) and OncoIMPACT (Bertrand et al., 2015), which predict patient-specific driver genes in the global gene network of population samples, whereas SCS considers the cancer-specific or patient-specific mutated network (topology) for predicting individual driver genes. In this paper, by only using data from an individual patient sample rather than a population of samples, SCS is the first such a framework which evaluates the impact potential on gene expression patterns of both CNVs and SNPs in a personalized fashion based on the network controllability. Unlike the DawnRank (Hou and Ma, 2014) and OncoIMPACT (Bertrand et al., 2015), SCS ranks potential driver genes based on their influence on the overall differential expressions of its downstream genes in the individual molecular network instead of the collective molecular network. The personalized ranking mutation genes of our SCS also allows us to further apply the Condorcet method (Pihur et al., 2008) to determine the summary ranking of genes in a patient population. We select the Top-50 ranked candidates as the driver genes for a patient population. We have widely validated the driver mutation profiles of our SCS from three aspects. Firstly, the benchmarking analysis on five different benchmark datasets including Glioblastoma (GBM), Ovarian (OVARIAN), Melanoma (MELANOMA), Bladder (BLCA) and Prostate cancer (PRAD) obtained from The Cancer Genome Atlas (TCGA) reveal notable improvements over other existing representative methods in terms of precision for discovering driver genes. Secondly, we discuss the personalized scope of SCS by demonstrating its ability to determine personalized novel and rare driver genes. Since our SCS classifies driver genes regardless of mutation frequency, it allows us possibly to discover rare (infrequent) driver genes. Finally, we demonstrate that the identified driver mutation profiles can be further used as a mutational-status-based signature to be integrated with the expression and network data for tumor stratification and prognostication, which has better performance than the integration of single mutation signature and expression data due to promoting biological significance rather than statistical significance. Based on the detected Top-50 driver genes for a large number of patient samples and the identified subtypes for tumor stratification and prognostication, we also demonstrate the underlying molecular mechanism differences in terms of the distributions of mutation frequency of the predicted Top-50 driver genes and provide the enriched biological pathways for corresponding identified subtypes. These results all indicate that somatic mutations would contribute to the heterogeneous responses to drive the different expression alterations of the downstream DEGs in different patient cohorts. Taken together, we have demonstrated that SCS can identify personalized driver genes, and provide new mutation profiles for quantitatively measuring the genotyping of each patient associated with phenotyping. Our analysis results on various cancer datasets from TCGA support the practical applications of our SCS, which is effective to integrate cancer omics data for tumor pathology, clinical stratification and personalized therapy. This is the first such a computational framework, to bridge the personalized driver mutation discovery problem and the structural network controllability problem. Therefore, SCS method opens a new paradigm into personalized medicine and targeted cancer therapy. 2 Materials and methods 2.1 Datasets and reference network In total, 1435 tumor samples in TCGA data portal (328 samples of GBM, 316 samples of Ovarian cancer, 379 samples of bladder cancer, 252 samples of PRAD cancer and 160 samples of Melanoma cancer; Bertrand et al., 2015), are studied in our paper. The datasets we used in this paper consist of gene expression data and coding region mutation data for five cancer types. SCS analysis was restricted to samples for which information on point mutations, copy-number alterations and gene expression was available (Bertrand et al., 2015). Our SCS uses the reference gene network (Hou and Ma, 2014) which integrates a variety of sources, including the network used in MEMo (Ciriello et al., 2012; Wu et al., 2010) as well as the up-to-date curated information from Reactome (Croft et al., 2010), the NCI-Nature Curated PID (Schaefer et al., 2009) and KEGG (Kanehisa et al., 2011). To aggregate all of the networks together, we collapsed all redundant edges to single edges. The resulting reference network consisted of 11 648 genes and 211 794 edges, including self-loops within the network to account for auto-regulation events (Becskei and Serrano, 2000). Furthermore, SCS also uses the directed human PPI network constructed by Vinayagam, A. et al. (Vinayagam et al., 2011) for evaluating the effect of the reference network on our SCS. The cancer datasets used in our paper for SCS as benchmarks are freely available from http://sysbio.sibcb.ac.cn/cb/chenlab/software.htm. 2.2 Single-sample controller strategy A natural framework to assess the impact of mutations is to associate the mutations to the gene expression changes from normal to tumor in their gene network by taking the mutations as control actions of those changes, and this is also adopted in the design of SCS. Specifically, we consider mutation genes as controllers in a network and whole gene expression profile in normal/tumor as the respective state, and thus SCS aims to detect a small number of mutation genes (i.e. driver genes) which can achieve the transition from the normal state to the tumor state (or vice versa) from the network controllability viewpoint, based on each individual gene expression data. To apply SCS, one should prepare the expression profiles of paired tumor-normal samples, the mutation profiles for each sample and the reference network information. With the driver mutation profiles, the mutation state of each gene is no longer discrete but continuous to reflect its impact potential on phenotypes through personalized DEGs. The key idea of SCS is to apply structural control theory for identifying the driver genes, which holds an assumption that the gene mutations will control the DEGs through their gene network. SCS views the gene network as a directed graph. Figure 1 shows the overview of our methodology. The SCS algorithm consists of two main steps, (i) Identification of individual DEGs (target genes), individual mutation genes (candidate driver genes) and the individual gene network (topology of interactive genes); (ii) Identification of the driver genes and their corresponding consensus modules on control paths, which assess the mutational impact of personalized-sample driver genes. Fig. 1. View largeDownload slide Overview of SCS’s workflow for identifying driver mutation profiles. (a) Input information. For the gene expression data and gene mutation profiles (SNV and CNV) for sample k or patient k, we identify the DEGs and extract the mutation genes and their close interactors by using the RWR algorithm and randomization-based test in the directed gene network. (b) The main part of our SCS. We apply a new concept called constrained target control (CTC), which focuses on identifying the minimal number of nodes (mutations) to control the targets (DEGs), to identify the driver genes and the corresponding control paths in the personal gene network. Since multiple solutions exist for the CTC, we apply the random Markov sampling to obtain different driver genes and the control paths. By repeating the random Markov sampling, we can obtain the consensus module containing one predicted driver gene and the downregulated genes for each module. In the module, the weight of each edge denotes the confidence of the edge as the control path to the DEGs. The sum of the edge weights forms the impact potential of the predicted driver gene on the expression patterns. In such a way, we can obtain the driver genes whose weights denote the impact potential of the mutations on the expression patterns for sample k Fig. 1. View largeDownload slide Overview of SCS’s workflow for identifying driver mutation profiles. (a) Input information. For the gene expression data and gene mutation profiles (SNV and CNV) for sample k or patient k, we identify the DEGs and extract the mutation genes and their close interactors by using the RWR algorithm and randomization-based test in the directed gene network. (b) The main part of our SCS. We apply a new concept called constrained target control (CTC), which focuses on identifying the minimal number of nodes (mutations) to control the targets (DEGs), to identify the driver genes and the corresponding control paths in the personal gene network. Since multiple solutions exist for the CTC, we apply the random Markov sampling to obtain different driver genes and the control paths. By repeating the random Markov sampling, we can obtain the consensus module containing one predicted driver gene and the downregulated genes for each module. In the module, the weight of each edge denotes the confidence of the edge as the control path to the DEGs. The sum of the edge weights forms the impact potential of the predicted driver gene on the expression patterns. In such a way, we can obtain the driver genes whose weights denote the impact potential of the mutations on the expression patterns for sample k (1) Identifying target genes, candidate driver genes and network for each individual In this study, we consider the transcriptomic changes due to mutation. For each patient, we calculate the log2 fold-change of gene expression between the paired tumor and normal samples. A significance of +/–1 is used to indicate the DEGs for each patient, which is used as the targets possibly controlled by the mutations according to SCS assumption. Both the mutation genes and their interactors are then extracted from each patient by using the Random Walker with Restart algorithm (RWR, see the details in Supplementary Note 1) for each sample. That is, for each sample we calculate the probability of each gene reached from the individual mutations (e.g. SNV: single nucleotidevariations; CNV: copy number variations) by using the RWR algorithm. We also introduce a randomization-based test to evaluate the statistical significance of control relation between the individual mutations and individual genes by utilizing 100 topologically matched random networks. The candidate genes that remain significance (P < 0.05) are retained and denoted as significant genes for the individual mutation genes. Then the interactors between significant genes including the individual mutation genes themselves can result in the individual mutated network. Finally, the individual mutation genes, the individual DEGs and the individual mutated network are linked together, and applied into the following individual driver gene identification and driver module discovery. (2) Identifying driver mutation profiles Different from the existing network control theory, we apply a new concept as Constrained Target Controllability (CTC; Guo et al., 2017) to identify the driver nodes from the constrained subset of the network instead of the whole network to control the maximal target nodes (model and algorithm details can be seen in Supplementary Note 1). Particularly in this study, we focus on how to identify the driver genes supported by the individual DEGs. To apply CTC to solve the problem, we first define the DEGs related with phenotype change (e.g. state change) as the target nodes to be controlled and the mutation genes as the constrained control nodes. Then to control the DEGs, a greedy algorithm is developed to identify the target controllable subsystem of each mutation gene and the control path from such mutation gene to its target genes (see the details in Supplementary Note 2). With target controllable subset defined as an iterated bipartite graph, the DEGs are represented within the target controllable subspace of each mutation gene. We apply a parsimony principle to identify a minimal set of driver genes (i.e. mutation genes) associated with the phenotype genes (e.g. DEGs) by the identification of minimum set cover (Supplementary Fig. S1). Although, the minimum set cover problem is an NP-hard problem with a greedy O(log n) approximation algorithm, the optimal solution can still be efficiently obtained for moderate sizes of graphs with up to a few tens of thousands of variables by utilizing a LP-based classic branch and bound method (Nemhauser and Wolsey, 1988; Wolsey, 1998). Furthermore, to coalesce mutated genes and DEGs into a consensus network module, we employ the random Markov chain (MC) samplings to generate candidate driver genes and the corresponding control path from the driver genes to the targeted DEGs (see the details in Supplementary Notes 2 and 3). For each personalized-sample mutation gene, different control paths from a mutation gene to the target genes would be obtained in the process of MC samplings, so that, the frequency of the edges in the gene network appeared in 1000 runs of sampling control paths can be assigned as the weight of each edge in the module. Finally, the consensus module includes mutation genes with multiple control paths to the target genes, where the edge weights denote the confidence of control path. Totally, the construction of personalized-sample consensus modules in SCS provides a comprehensive measurement of the impact potential of a putative driver gene, which results in the driver mutation profiles. 2.3 The time complexity analysis of SCS The computational complexity of our SCS method mainly stems from two parts. (i) For extracting the individual gene regulations from the expert-curated databases on the network G(V, E) where V and E are respective node set and edge set, we integrate the RWR algorithm and a randomization-based test to evaluate the statistical significance of control relations between the individual mutations and individual genes by n = 100 topologically matched random networks, each of which maintains the topological characteristics of the original network (e.g. degree of each node). The RWR algorithm runs in the order of O (k* n *‖E‖) where k denotes the number of iteration, n denotes the number of randomly generated networks and‖E‖denotes the number of edges in a network. In fact, the RWR algorithm is efficient and widely used by many research works (Jia and Zhao, 2014; Li and Patra, 2010; Luo et al., 2017); (ii) In the phase of uncovering consensus modules with CTC in the individual gene network G(Vi, Ei) for patient i, we use our CTCA (CTC Algorithm) to identify modules consisting of confidence-weighted paths from the driver genes to the target genes (Guo et al., 2017). Since our CTCA runs in the order of O(m*r* Vi*‖Ei‖) where m is the sampling number, r is the iteration number of obtaining the controllable differential expressed genes of mutations, and‖Vi‖and‖Ei‖denote the number of nodes and edges in the patient-specific mutated network (‖Vi‖≪‖V‖, ‖Ei‖≪‖E‖), respectively. Therefore, the overall computational time complexity of our SCS approach is O(k* n *‖E‖)+O(m*r* Vi*‖Ei‖). 3 Results 3.1 SCS accurately and robustly detects driver genes of cancer Most of the existing methods for identifying common driver genes are based on the aggregate analysis over large number of patients. Generally to identify driver genes at population level, our SCS applies a novel scheme: we firstly identify the patient-specific mutated network, which consists of the frequently interrupted interactions among the mutation genes in the human interactome by using a network propagation amplifier of genetic associations (RWR algorithm, Supplementary Note 1; Cowen et al., 2017). Here, the aim of introducing RWR in mutation data analysis is to filter out potential genes whose mutations likely occur by chance based on the patient-specific mutational profile; then, to identify the consensus modules or mutations, our SCS applies the CTC on the patient-specific mutated network, which can achieve the transition by mutations (as controllers) from the normal state to the disease state; finally, we apply Condorcet method to summarize the personalized results at the population level. Particularly, our SCS tries to capture individual-specific mutation genes and its impacts over each patient. By integrating the personalized driver genes with the Condorcet method, SCS can also obtain the common driver genes in the population, which allows us to validate the advantage over other driver-focus methods. The Condorcet method is used as a ‘voting’ scheme for the personalized ranking genes to determine the most impactful driver genes in a population (see the details in Supplementary Note 1), and the Top-50 ranked mutations in the population are selected as the candidate driver genes. To perform a systematic comparison across a number of computational methods, the genes annotated in the Cancer Gene Census (CGC; Futreal et al., 2004) are applied as a proxy for potential drivers to assess the precision of the top drivers genes reported for different cancer sites. The CGC (Futreal et al., 2004) is a well-studied cancer gene database consisting of a list of known driver genes with mutations that have been causally implicated in cancer. CGC genes have been widely used in many cancer studies for benchmark evaluation (Bertrand et al., 2015; Hou and Ma, 2014; Jia and Zhao, 2014). Representatively, SCS is compared against another personalized-sample method (OncoIMPACT; Bertrand et al., 2015), an aggregate network approach (DriverNet; Bashashati et al., 2012) and a commonly used mutation frequency-based approach (Frequency; Wei et al., 2011). As shown in Figure 2a and Supplementary Table S1, a stronger enrichment for true positive driver genes is achieved in SCS’s predictions. In contrast, the naive frequency-based approach finds less known cancer driver genes. For example, the top gene on the PRAD list is HLA-DRB instead of TP53 and all lists of other focus-driver methods, miss EGFR from the Top 10. Furthermore, among the Top 20 drivers in OVARIAN, MELANOMA and BLCA, SCS’s concordance is above 40% while the Frequency and DriverNet and OncoIMPACT are all around 30%, suggesting that SCS is generally more accurate and less likely to be influenced by high-frequency mutated passengers. On the other hand, to test the robustness of SCS, a sub-sampling approach is used to estimate the precisions of Top-50 driver genes on ovarian cancer, melanoma, GBM, BLCA and PRAD datasets from TCGA. Seeing the Figure 2b, SCS’s predictions are stable even with small sample sizes for the GBM, BLCA and PRAD datasets, although the precisions will become sensitive in ovarian and melanoma datasets (Fig. 2b). Fig. 2. View largeDownload slide Accuracy comparison of driver gene predictions according to the cancer census genes set. (a) Precision measured by the fraction of top ranked driver genes from SCS, OncoIMPACT, DriverNet and a frequency-based approach that are included in the CGC list. (b) Precisions of SCS when we evaluate prediction from the CGC as a function of the size of the dataset Fig. 2. View largeDownload slide Accuracy comparison of driver gene predictions according to the cancer census genes set. (a) Precision measured by the fraction of top ranked driver genes from SCS, OncoIMPACT, DriverNet and a frequency-based approach that are included in the CGC list. (b) Precisions of SCS when we evaluate prediction from the CGC as a function of the size of the dataset To identify the driver genes at the population level, the main technical contributions of our SCS include: (1) choosing personalized DEGs, personalized candidate mutations and extracting individual gene regulations from the expert-curated databases as the patient-specific mutated network (topology or edges) by the RWR algorithm; (2) identifying the confidence-weighted paths from the driver genes to the target genes on the patient-specific gene network by the concept ‘CTC’ and (3) statistically summarizing the personalized driver mutation prediction results at the population level by the Condorcet method. To assess more details of SCS on the precision, as shown in Table 1, we evaluate the effect of each technical contribution on the performance of our SCS for each type of cancer datasets. We define a measurement to denote the performance for predicting the driver genes, i.e. P = mean (pk) where pk denotes the fraction of the top k predicted driver genes within the cancer census genes list. The mean fraction of the top k (k = 1, 2,…, 50) ranked predicted driver genes within the cancer census genes list is given in Table 1. It shows that the technical contributions of our SCS together contribute to the better performance, compared with OncoIMPACT (Bertrand et al., 2015) and other comparable methods. Briefly, SCS is to obtain the patient-specific mutated network at the patient level by the RWR algorithm and then to summarize the driver mutation prediction results at the population level by the Condorcet method. The results in Figure 2 and Table 1 both show that SCS is able to recapitulate the known driver genes better than other methods, for instance: DriverNet and HotNet2 which are only work at the population level; OncoIMPACT and DawnRank which could work at the single-patient level; and a naive frequentist approach. Table 1. The performance of our SCS for each technical contribution in terms of the average precision at the population level in GBM, OVARIAN, MELANOMA, PRAD and BLCA GBM OVARIAN MELANOMA PRAD BLCA SCS 0.4141 0.3234 0.4004 0.4106 0.4736 RWR + CTC 0.3333 0.1932 0.1434 0.3295 0.3676 CTC + Condorcet 0.3778 0.2594 0.1544 0.1662 0.1905 CTC 0.3621 0.2300 0.1056 0.1364 0.1821 OncoImpact 0.3613 0.2370 0.2047 0.3613 0.3429 DriverNet 0.2496 0.1688 0.3115 0.2496 0.2872 Frequency 0.1480 0.1896 0.0654 0.1480 0.1268 HotNet2 0.1866 0.1916 0.0769 0.1350 0.1960 DawnRank 0.3251 0.2161 0.1147 0.2439 0.2445 GBM OVARIAN MELANOMA PRAD BLCA SCS 0.4141 0.3234 0.4004 0.4106 0.4736 RWR + CTC 0.3333 0.1932 0.1434 0.3295 0.3676 CTC + Condorcet 0.3778 0.2594 0.1544 0.1662 0.1905 CTC 0.3621 0.2300 0.1056 0.1364 0.1821 OncoImpact 0.3613 0.2370 0.2047 0.3613 0.3429 DriverNet 0.2496 0.1688 0.3115 0.2496 0.2872 Frequency 0.1480 0.1896 0.0654 0.1480 0.1268 HotNet2 0.1866 0.1916 0.0769 0.1350 0.1960 DawnRank 0.3251 0.2161 0.1147 0.2439 0.2445 Table 1. The performance of our SCS for each technical contribution in terms of the average precision at the population level in GBM, OVARIAN, MELANOMA, PRAD and BLCA GBM OVARIAN MELANOMA PRAD BLCA SCS 0.4141 0.3234 0.4004 0.4106 0.4736 RWR + CTC 0.3333 0.1932 0.1434 0.3295 0.3676 CTC + Condorcet 0.3778 0.2594 0.1544 0.1662 0.1905 CTC 0.3621 0.2300 0.1056 0.1364 0.1821 OncoImpact 0.3613 0.2370 0.2047 0.3613 0.3429 DriverNet 0.2496 0.1688 0.3115 0.2496 0.2872 Frequency 0.1480 0.1896 0.0654 0.1480 0.1268 HotNet2 0.1866 0.1916 0.0769 0.1350 0.1960 DawnRank 0.3251 0.2161 0.1147 0.2439 0.2445 GBM OVARIAN MELANOMA PRAD BLCA SCS 0.4141 0.3234 0.4004 0.4106 0.4736 RWR + CTC 0.3333 0.1932 0.1434 0.3295 0.3676 CTC + Condorcet 0.3778 0.2594 0.1544 0.1662 0.1905 CTC 0.3621 0.2300 0.1056 0.1364 0.1821 OncoImpact 0.3613 0.2370 0.2047 0.3613 0.3429 DriverNet 0.2496 0.1688 0.3115 0.2496 0.2872 Frequency 0.1480 0.1896 0.0654 0.1480 0.1268 HotNet2 0.1866 0.1916 0.0769 0.1350 0.1960 DawnRank 0.3251 0.2161 0.1147 0.2439 0.2445 Furthermore, to demonstrate the effect of the reference network on SCS, we also use the directed PPI network derived by Vinayagam, A. et al. (Vinayagam et al., 2011) as the reference network for analysis. The directed human PPI network represents a global snapshot of the information flow in cell signaling. The directed human PPI network consists of 6339 proteins and 34 813 directed edges, where the edge direction corresponds to the hierarchy of signal flow between the interacting proteins and the edge weight corresponds to the confidence of the predicted direction. The results of our SCS with this directed PPI network on the five cancer datasets are also shown in Supplementary Figure S3 in Supplementary Note 4, we can conclude that the directed PPI network is incomplete to analyze the controllability of the biological system. Therefore, a proper reference network is an important factor to our SCS. Actually, the incompleteness of molecular networks would increase false negatives in SCS’s predictions, and network inference approaches would be helpful pre-procession in this situation. In addition to its ability to identify known driver genes at the entire patient population level, a key of SCS is to recognize the personalized-sample drivers in CGC, compared with the random selecting scheme as shown in Figure 3. The threshold of +/– 1 log2 fold change used in this identification is actually adopted in many research works to identify the DEGs (Aytug et al., 2003; Bakken et al., 2016; Koren et al., 1989; Tothova et al., 2007). In fact, we have considered other values to evaluate the effect of the threshold on the precision of predicted driver genes. The mean fraction of the top k ranked predicted driver genes within the cancer census genes list is used to denote the performance for predicting the driver genes. The results are shown in Figure 3, from which we can see that the effect of the threshold around the +/– 1 on the precision is robust for the five cancer datasets. For the significance of the threshold +/– 1 log2 fold-change in each patient, we have tried two evaluation strategies. One is that we randomly choose the same number of personalized DEGs as the targets and then use our SCS with the randomized personalized DEGs to obtain the top ranked driver genes list at the population level; and we compute the mean fraction of the top ranked k genes within the cancer genes census list and obtain an enrichment P-value (the details are shown in Supplementary Note 6), whose results are shown in Figure 3b. Two is that we not only choose the same number of genes, but also randomly choose the same number of personalized mutations; then for each patient we use the randomized mutations and the randomized targets as the input information of our SCS; and we compute P-value of the enrichment in the CGC list, whose results are also shown in Figure 3b. From these results together in Figure 3, we can see that our threshold is actually significant for prediction enrichment in the CGC list. Fig. 3. View largeDownload slide The robustness and the significance of the threshold of fold change. (a) Choosing the absolute values of the threshold of log2-fold change at (0.6, 0.8, 1, 1.2, 1.4), we find that the precision is robust for our threshold +/–1; (b) The significance of the threshold +/–1 fold change for selecting target genes randomly, and selecting mutation genes and target genes randomly. The enrichment score ESg is then defined as ESg = –log10 (P-value) Fig. 3. View largeDownload slide The robustness and the significance of the threshold of fold change. (a) Choosing the absolute values of the threshold of log2-fold change at (0.6, 0.8, 1, 1.2, 1.4), we find that the precision is robust for our threshold +/–1; (b) The significance of the threshold +/–1 fold change for selecting target genes randomly, and selecting mutation genes and target genes randomly. The enrichment score ESg is then defined as ESg = –log10 (P-value) Now, we deeply investigate the common and difference among driver genes identified by our SCS and other methods. For different cancer datasets, approximately two-fifth (19/50 of ovarian), one-fifth (6/25 of melanoma), one-third (8/25 of GBM), one-third (8/25 of PRAD) and one-fourth (10/25 of BLCA) of SCS’s candidate drivers are also identified by other methods respectively as shown in Figure 4, Supplementary Figure S4 and Supplementary Table S1. These results indicate that many common mutation genes related with five cancers with those well-established cancer-related mutation genes are also predicted, including EGFR in ovarian cancer, NRAS in melanoma cancer, MYC in GBM, CTNNBCC in PRAD, EP300 and CREBPP and RB1 in BLCA, respectively. It is also observed that TP53 with high mutated frequency is detected as the candidate driver gene in most cancer datasets except for the melanoma cancer dataset, which support again that the high mutated frequency would not be a sufficient way to detect cancer drivers. Besides, the gene list of our SCS, OncoIMPACT, DawnRank, HotNet2, DriverNet and a frequency-based method is shown in Supplementary Table S1, which demonstrates whether the genes are identified by the six or several methods. Fig. 4. View largeDownload slide Comparison of candidate driver genes in ovarian cancer, MELANOMA cancer datasets and GBM datasets by various methods. SCS, OncoIMPACT, DriverNet and frequency-based method are applied to the GBM and ovara in cancer datasets from TCGA. For each tool, the Top-50 predicted driver genes are extracted. The candidate driver genes themselves are listed according to the tools by which they are identified (7: SCS, OncoIMPACT, DriverNet and Frequency; 6: SCS, OncoIMPACT and DriverNet; 4: SCS and DriverNet; 3: SCS, DriverNet and Frequency; 2: SCS and DriverNet; 1: SCS and Frequency; 0: SCS alone) Fig. 4. View largeDownload slide Comparison of candidate driver genes in ovarian cancer, MELANOMA cancer datasets and GBM datasets by various methods. SCS, OncoIMPACT, DriverNet and frequency-based method are applied to the GBM and ovara in cancer datasets from TCGA. For each tool, the Top-50 predicted driver genes are extracted. The candidate driver genes themselves are listed according to the tools by which they are identified (7: SCS, OncoIMPACT, DriverNet and Frequency; 6: SCS, OncoIMPACT and DriverNet; 4: SCS and DriverNet; 3: SCS, DriverNet and Frequency; 2: SCS and DriverNet; 1: SCS and Frequency; 0: SCS alone) Actually, SCS also identified some driver genes that may not have been classified as drivers by other computational methods. One-second (12/25) and four-fifth (19/25) of SCS’s candidate drivers are not identified by other computational tools, among which, five mutation genes in ovarian cancer, six mutation genes in melanoma, four mutation genes in PRAD and three mutation genes in BLCA, respectively, are indeed included in CGC. In addition, we discuss the mutated frequency of such driver genes determined by SCS, compared with the whole mean mutated frequency of all genes among all samples. From Figure 5a, we can see that the mean mutated frequency of our predicted Top-50 driver genes is higher than the whole mean mutated frequency of all genes in the five benchmark cancer datasets. Meanwhile in Figure 5b, we list the fraction of genes whose mutated frequency is higher than the whole mean mutated frequency of all genes and the fraction of genes whose mutated frequency is lower than the whole mean mutated frequency of all genes. We interestingly found that most of the predicted driver genes are mutated higher than the whole mean mutated frequency in Ovarian cancer and MELANOMA cancer datasets, while less number of the predicted driver genes are mutated higher than the whole mean mutated frequency in GBM cancer, PRAD cancer and BLCA cancer datasets. These results demonstrate again that high mutated frequency would not be a sufficient way to detect cancer drivers, and it would introduce more false negatives in identification. Fig. 5. View largeDownload slide Comparison of the mutated frequency of candidate driver genes with the whole mean mutated frequency in ovarian cancer, MELANOMA cancer, GBM, PRAD and BLCA cancer datasets. (a) The figure shows that the mutated frequency of the predicted driver genes of our SCS (red) for the five cancer datasets is higher than the whole mean mutated frequency (green); (b) The figure shows that the fraction of genes whose mutated frequency is greater than the whole mean mutated frequency in the five cancer datasets. The yellow color denotes the fraction of genes whose mutated frequency is higher than whole mean mutated frequency while the blue color denotes the fraction of genes whose mutated frequency is lower than whole mean mutated frequency (Color version of this figure is available at Bioinformatics online.) Fig. 5. View largeDownload slide Comparison of the mutated frequency of candidate driver genes with the whole mean mutated frequency in ovarian cancer, MELANOMA cancer, GBM, PRAD and BLCA cancer datasets. (a) The figure shows that the mutated frequency of the predicted driver genes of our SCS (red) for the five cancer datasets is higher than the whole mean mutated frequency (green); (b) The figure shows that the fraction of genes whose mutated frequency is greater than the whole mean mutated frequency in the five cancer datasets. The yellow color denotes the fraction of genes whose mutated frequency is higher than whole mean mutated frequency while the blue color denotes the fraction of genes whose mutated frequency is lower than whole mean mutated frequency (Color version of this figure is available at Bioinformatics online.) 3.2 SCS efficiently discovers personalized driver genes Here, we demonstrate SCS’s ability to determine personalized and rare driver genes. The main aspect that distinguishes SCS from existing methods is the ability to discover rare or even personalized-sample driver genes. Even if a gene is altered only in a single patient, SCS is capable to evaluate the impact potential of that gene alteration. In our case, a gene is considered to be a rare driver if the gene is labeled as significant from the impactful population drivers with Condorcet method in the above section, and is also mutated in only a small number of patients (<=5%). We selected genes that fit the above criteria, to discover potential personalized driver genes. The selection criteria yielded in 22 potential personalized driver genes in OVARIAN, 15 potential personalized driver genes in MELANOMA, 35 potential personalized driver genes in GBM, 49 potential personalized driver genes in PRAD and 44 potential personalized driver genes in BLCA, respectively (seeing the genes highlighted in yellow among Supplementary Table S2). Taking the potential personalized driver genes in OVARIAN as example, we found that several of them were involved in important known cancer pathways. Using KEGG (Kanehisa et al., 2014) to map the 22 potential personalized driver genes to biological pathways (see details in Supplementary Table S3), mutation in EGFR belong to multiple pathways that have significant impact on cancer, including ErbB signaling pathway, TNF signaling pathway and T cell receptor signaling pathway, which all are common drug targets in OVARIAN (Charles et al., 2009; De et al., 2008; Wang et al., 2004) and lead to the implications that EGFR could be targeted. Although EGFR is mutated in only 2.2152% of OVARIAN cancer samples and is ranked 6517th in terms of the mutated frequency, EGFR is ranked significantly higher than its average ranking in patients TCGA-09-0366-01 and TCGA-13-0717-01. We look further into the consensus module of EGFR obtained by using our SCS in patients TCGA-09-0366-01 and TCGA-13-0717-01, respectively (see details in Supplementary Fig. S5 and Supplementary Table S4). Mapping the corresponding modules of each sample/patient to the KEGG datasets (Kanehisa et al., 2014), respectively, we found that although they all are enriched in ErbB signaling pathway, they actually have some other different enriched biological pathways, i.e. insulin signaling pathway, Jak-STAT signaling pathway, T cell receptor signaling pathway, Chemokine signaling pathway and mTOR signaling pathway for TCGA-09-0366-01; meanwhile Wnt signaling pathway and MAPK signaling pathway for TCGA-13-0717-01. Among the consensus modules of EGFR in the above two patients, we also found that there are 35 and 15 drug targets, respectively. By mapping these drug targets to KEGG datasets (Kanehisa et al., 2014) again, we found that EGFR and TP53 as the common drug targets can be affected by Cisplatin and Carboplatin, respectively; meanwhile CTNNB1, ERBB2, INSR and KRAS can be targeted as the personalized drug targets for TCGA-09-0366-01, and NOS3 and TNFRSF1B can be targeted as the personalized drug targets for TCGA-13-0717-01. The more details can be seen in Supplementary Table S4. 3.3 SCS improves tumor stratification By inputting the personalized mutation genes, personalized DEGs and personalized network structure, the output of our SCS is the patient-specific driver mutation profiles in which the state of each gene is no longer binary on expression abundance but reflects its phenotypic impact on the DEGs. Patient-specific driver mutational profiles would be promising information as biomarkers for tumor stratification (Bertrand et al., 2015; Chen et al., 2012; van’t Veer and Bernards, 2008; Zeng et al., 2015; Zeng et al., 2014) since by definition, they are likely causative events for carcinogenesis and metastasis. Integrating the driver mutation signatures, the expression patterns and the gene network provides a comprehensive way to understand complex diseases in a multi-view manner (Shi et al., 2017). As a first pilot exploration of this concept, we also investigate the SCS’s predictions for stratifying patients, especially on their survival outcomes. And the unsupervised clustering from SNF (Wang et al., 2014) is applied because it used to integrate the predicted driver profiles and the expression data. Evaluations of survival outcomes for patients in these clusters by Kaplan–Meier statistics suggest that the patient clusters have significant prognostic values for survival analysis [Fig. 6(a–c)]. With this stratification strategy, as shown in Figure 6(a–c), three subtypes are identified for ovarian cancer, five subtypes are identified for melanoma cancer and three subtypes are identified for GBM, respectively (the results of other cancer datasets can be found in Supplementary Material). The P-value, 0.00257 for ovarian, 0.000257 for melanoma and 0.00442 for GBM cancer show the significant survival differences among the identified subtypes. Fig. 6. View largeDownload slide Tumor stratification using predicted driver gene profiles by SCS. (a–c) Survival profiles of ovarian, melanoma and glioblastoma cancer patients stratified by integrating the driver mutation profiles and expression data. (d) Bar plot showing the P-values (log rank test) for survival profiles of ovarian, melanoma and glioblastoma cancer patients using different gene signatures (e.g. Integration of driver mutation profiles and expression data, and Integration of mutation profiles and expression data) Fig. 6. View largeDownload slide Tumor stratification using predicted driver gene profiles by SCS. (a–c) Survival profiles of ovarian, melanoma and glioblastoma cancer patients stratified by integrating the driver mutation profiles and expression data. (d) Bar plot showing the P-values (log rank test) for survival profiles of ovarian, melanoma and glioblastoma cancer patients using different gene signatures (e.g. Integration of driver mutation profiles and expression data, and Integration of mutation profiles and expression data) In addition, we found that the integrative strategy by direct combination of mutations data and expression data cannot be effective for patient survival predictions (Fig. 6d). In contrast, the integration of driver mutation profiles and expression profiles by SCS perform better as listed in Figure 6d. In all, these results highlight the promise of driver genes’ profiles to stratify patients in an unsupervised fashion. 3.4 Subtype-specific driver genes and consensus modules from tumor stratification To deeply analyze the distribution difference of the predicted driver genes corresponding to particular cancer subtypes, the subtype-specific driver genes with high mutated frequency are defined as the genes satisfying the following requirements: (1) it is highly ranked in Top 10 among the subtype samples in terms of the mutated frequency; (2) alterations frequency is greater than that in all samples. In Supplementary Note 5 and Supplementary Table S2, the summary of the subtype-specific genes and the mutated frequency of the Top-50 drivers for different subtypes are, respectively, listed for OVARIAN, MELANOMA and GBM cancer datasets. Obviously, most sample clusters are indicated by a few key driver genes, which are predominantly mutated in tumors belonging to that sample cluster and serve to distinguish them from tumors in other clusters. As shown in Figure 7a, the mutated frequency distribution for the top drivers is significant changed among most of paired subtypes. For an example, TP53 is highly mutated for all subtypes in OVARIAN cancer because TP53 is always mutated in OVARIAN samples, as seen in Figure 7b; in contrast, TP53 is predominantly mutated in tumors belonging to a cluster and serves to distinguish them from tumors in other clusters in MELANOMA or GBM cancer, as seen in Figure 7b. Fig. 7. View largeDownload slide Different properties of the predicted Top-50 driver genes in different cancer subtypes. (a) Box plots showing the difference of the distribution of the mutation frequency of the ranked Top-50 drivers between the different subtypes in the ovarian (p12 = 4.2348e–4, p23 = 0.1546, p13 = 3.6276e–6), melanoma (p12 = 3.6276e–6, p13 = 3.6276e–6, p14 = 0.0951, p15 = 7.8398e–10; p23 = 0.6779, p24 = 3.6276e–6, p25 = 0.0171; p34 = 3.6276e–6, p35 = 0.0560; p45 = 7.8398e–10) and glioblastoma cancer (p12 = 2.7638e–5, p23 = 0.5077, p13 = 3.6276e–6). (b) The statistical analysis of the mutation frequency associated with the different cancer subtypes of TP53. (c) The statistical analysis of the drug sensitivity associated with the different cancer subtypes of TP53 Fig. 7. View largeDownload slide Different properties of the predicted Top-50 driver genes in different cancer subtypes. (a) Box plots showing the difference of the distribution of the mutation frequency of the ranked Top-50 drivers between the different subtypes in the ovarian (p12 = 4.2348e–4, p23 = 0.1546, p13 = 3.6276e–6), melanoma (p12 = 3.6276e–6, p13 = 3.6276e–6, p14 = 0.0951, p15 = 7.8398e–10; p23 = 0.6779, p24 = 3.6276e–6, p25 = 0.0171; p34 = 3.6276e–6, p35 = 0.0560; p45 = 7.8398e–10) and glioblastoma cancer (p12 = 2.7638e–5, p23 = 0.5077, p13 = 3.6276e–6). (b) The statistical analysis of the mutation frequency associated with the different cancer subtypes of TP53. (c) The statistical analysis of the drug sensitivity associated with the different cancer subtypes of TP53 In fact, the consensus modules of predicted driver genes also allow us to identify the drug sensitivity for these cancer drivers. For example, for the TP53 gene, we first check the subtypes of samples which have TP53 as the driver, and obtain the corresponding control downstream module. Then we identify the fraction of FDA drug targets within the module for the obtained samples in the subtype. Finally, we can identify the mean fraction in subtype as the subtype drug sensitivity for the interested driver. Based on the obtained drug sensitivity in the subtype, the subtype-specific driver genes with high drug sensitivity are defined as the genes satisfying the following requirements: (1) it is highly ranked in Top 10 among the subtype samples in terms of the drug sensitivity; (2) alterations of drug sensitivity is greater than that in all samples. In Supplementary Note 5 and Supplementary Table S5, the summary of the subtype-specific genes with high drug sensitivity and the drug sensitivity of the Top-50 drivers for different subtypes are, respectively, listed for OVARIAN, MELANOMA and GBM cancer datasets. From the Supplementary Note 5, we can find that most subtype-specific driver genes with high mutated frequency are different from that with high drug sensitivity. It indicates that the driver genes with high mutated frequency may be not the driver genes with high drug sensitivity, thus the cancer driver gene is not always a suitable drug target and dependent on individuals. Even so, we found that in subtype 3 for GBM, TP53 as the subtype driver with high mutated frequency also has significant drug sensitivity. It has been reported that the alteration leads to the activation of P13K/Akt and Ras/MAPK pathways (Mischel and Cloughesy, 2003), which provide targets for therapy and can also support our computational results. Furthermore, the ability to generate personalized-sample driver consensus modules allows us to analyze the difference of the enriched pathways between cancer subtypes. Again, the enriched pathways of TP53 as a known driver gene are used to indicate the identified subtypes in glioblastoma cancer and ovarian cancer datasets. With the module of driver gene TP53 for each sample, we first calculate P-value of the enrichment for a pathway by the hypergeometric test (Rivals et al., 2007), and then we regard that a pathway is significantly enriched within the sample when the P-value is less than 0.005. Subsequently, according to the frequency of the enriched pathway appearing in each sample for a given subtype, the subtype-specific pathway is selected when it has frequency f  ≥ 0.5 in such subtype samples. From the result shown in Supplementary Figure S6 and Supplementary Table S6, TP53 is indeed high-frequency mutated, but it can have different regulatory effects on the biological pathways in different subtypes, indicating the tumor heterogeneous on the biological system level rather than single sequence mutation level. Therefore, the identified driver mutation profiles can contribute to the improvement of tumor stratification with more biological significance and interpretability. 4 Discussion and conclusions Cancer genomics is an area that has now rightly shifted toward integrative analysis, with driver gene identification being a key focus (Vogelstein et al., 2013). Especially as the personalized medicine becomes a hot-spot, identifying personalized-sample driver genes that have predictive power from the personalized disease diagnosis/drug sensitivity analysis to their care, is attracting wide attention (Bertrand et al., 2015; Hou and Ma, 2014; Sheng et al., 2015; Wang et al., 2015). Accurate analysis of personalized genomics instability, e.g. somatic mutations, is necessary for translating the full benefit of cancer genome sequencing into the clinic. Computational models and methods are required to prioritize biologically active driver genes over inactive passenger dependent on cancer high-throughput sequencing data. However, few methods can efficiently distinguish the unique complement of genes that drive tumorigenesis in each patient. It is already recognized that patients with a cancer are not all the same and there may exist unique driver genes for each patient. Although existing computational methods have identified many common cancer drivers, it remains challenging to predict personalized driver genes to assess the rare and even personalized-sample mutations. Here, we proposed a new and efficient framework based on the structural control theory of complex network, called SCS. SCS considers how to find the minimal individual mutations, which can control the maximal individual DEGs from the normal state to the disease state, and also identify the consensus module with control paths from the candidate individual driver genes to the target DEGs. More importantly, the quantified confidence of control path of consensus driver module can evaluate the impact potential of candidate individual driver genes on the altered expression patterns. We apply SCS to multiple cancer datasets from TCGA. The validation results suggest that SCS can efficiently identify the known cancer driver genes for a large number of samples, and SCS outperforms over other competing approaches (OncoIMPACT, DawnRank, DriverNet, HotNet2 and Frequency-based approach) across distinct benchmarks. SCS is robust to noise and works well with small datasets, making it applicable to a wide array of sample collections. It is widely accepted that most driver genes would have low mutation frequency, which is called ‘long tail of rarely mutated genes’ (Hofree et al., 2013; Leiserson et al., 2015; Wang et al., 2014; Zhang et al., 2016) and always disregarded in traditional frequency-based method. In contrast, our SCS delves deeper into the long tail of rarely mutated genes regardless of mutation frequency by combining the patient-specific network structure (patient-specific edges) and the personal biological information (personalized mutations and personalized DEGs). Results in Figure 2 and Table 1 show that our SCS exhibited a higher significant enrichment for mutations in the Cancer Census sets compared with the traditional frequency-based approaches. Therefore, our SCS approach provides an efficient tool to to discover rare (infrequent) driver genes. DriverNet and HotNet2 are methods to address the problem of finding significantly mutated subnetworks in large and broad datasets of mutational frequency spectra, which ignore the individual information and therefore have worse performance. SCS, OncoIMPACT and DawnRank all assume that gene mutations could lead to the transcriptomic changes. However, our SCS adopts the different techniques on handling the patient-specific network (topology) information compared with OncoIMPACT and DawnRank. Given mutations in a patient, OncoIMPACT considers a gene in the patient as being related with mutations by using a statistic permutation-based model, whose parameters are determined with a grid search method on the whole gene network common to all patients, rather than each patient-specific network. On the other hand, DawnRank adopts the pagerank algorithm to assess the impact scores of patient-specific mutations also on the whole gene network. Therefore, these two models identify the modules of the personalized mutations by determining the model parameters based directly on the whole gene network, which is not proper due to the ignorance of patient-specific network (topology) information. In other words, there are two advantages of our SCS over OncoIMPACT and DawnRank. Firstly, OncoIMPACT and DawnRank apply their search techniques to the whole gene network common to all patients to obtain the subnetwork of mutations for one patient. The overall network (topology) information used in OncoIMPACT and DawnRank does not consider the individual information and is not patient-specific. In contrast, our SCS includes the execution of the RWR algorithm in each sample to search for the patient-specific mutated network, which consists of the frequently interrupted interactions among the mutation genes in the human interactome. RWR has been proven to be sensitive in identifying disease candidate genes and has been successfully applied in disease-phenotype analyses (Jia and Zhao, 2014; Li and Patra, 2010; Luo et al., 2017). Then SCS applies our CTC whose parameters are determined in the patient-specific mutated network instead of the whole network, to identify the personalized driver genes. Secondly, the impact score for mutations of OncoIMPACT and DawnRank is the number of the related DEGs, while the impact score of our SCS is the sum of the confidence control weight within the modules, which offers a refined approach with more biological significance, by considering all possible paths between mutations and DEGs. Compared with OncoIMPACT and DawnRank, the main difference in handling the patient-specific network is that SCS focuses on both patient-specific edge and node information, but OncoIMPACT and DawnRank focus only on patient-specific node information. As shown in Table 1, our SCS with the RWR algorithm outperformed over OncoIMPACT and DawnRank to predict the driver genes at the population level by considering the patient-specific mutated network. The important contribution of our SCS is that we apply the network control theory to the field of driver gene discovery by combining the personalized mutations, personalized DEGs and personalized network (topology or edges). This is the first integrative model to combine the network control theory and the individual multi-layer biological data for identifying the personalized driver genes. In addition, the SCS’s personalized-sample driver gene predictions revealed new biological insights into tumor stratification and prognosis analysis. The tumors can be efficiently stratified into molecule-determined subgroups through the integration of the predicted driver mutation profiles and gene expression profiles. Such patient subgroups actually exhibit significantly different survival outcomes, establishing the clinical relevance for such stratification from SCS’s predictions. More importantly, for same driver genes, the subtype-specific pathways enriched by driver genes and corresponding consensus modules can be screened and provide more biological evidence of the molecule subtypes. Noted, it has been widely recognized that genes form a network to interact with each other, and a complex disease or phenotypic change for each person usually results not from the changes of the individual genes or molecules but from the changes of their biological system or network. With rapid advances in high-throughput techniques, biological networks based on omics data have become the powerful resources and been successfully applied to many fields of biology and medicine, such as driver gene discovery and drug target identifications (Hofree et al., 2013; Leiserson et al., 2015; Wang et al., 2014; Zhang et al., 2016). Generally, a complex disease progression for each patient, such as cancer, can be viewed as a state transition of the corresponding biological system or network from a normal state to a disease state, resulting from the gradual accumulation of multiple driver mutations. From the perspective of systems control, such driver mutations can be viewed as controllers, which control the biological system or network of a patient to transit from a normal state to a disease state. Thus, a natural framework is to assess the impact of candidate driver genes (genomic and epigenomic data) on their gene interaction network by associating mutations (as controllers) with DEGs (Bertrand et al., 2015; Hou and Ma, 2014; Leiserson et al., 2015; Zhang et al., 2016). We assume that there is a specific gene network in each patient, i.e. patient-specific network, due to different personalized features and further heterogeneous disease features at both genomic or epigenomic levels. Therefore, as an important step of SCS, we must identify the patient-specific mutated network by using the RWR techniques based on the observed data with prior network information. In summary, SCS uses an innovative method to prioritize cancer driver mutation profiles from a network controllability perspective, and provides dramatic improvements over existing methods. SCS not only can help us to discover personalized causal mutations from those mutations obscured by tumor heterogeneity, but also can methodologically bridge the traditional structural control methods of complex networks to genomics research. However, there are still some open questions remaining for SCS or similar studies, such as (1) it is dependent on the reference network whose (structure) incompleteness would increase false negatives in SCS’s predictions and (2) the assumption of driver impact on gene expression pattern could be expanded to the impact on different omics patterns. Acknowledgement The authors thank Professor Fang-Xiang Wu from University of Saskatchewan for giving our valuable comments. Funding This paper was supported by National Key R&D Program (2017YFA0505500), Strategic Priority Research Program of the Chinese Academy of Sciences (No. XDB13040700), the National Natural Science Foundation of China (61473232, 91430111, 91439103, 91529303, 31771476, 81471047, 31200987 and 61170134), National Key R&D Program (Special Project on Precision Medicine) (2016YFC0903400) and Natural Science Foundation of Shanghai (17ZR1446100). Conflict of Interest: none declared. References Aytug S. et al. . ( 2003 ) Impaired IRS‐1/PI3‐kinase signaling in patients with HCV: a mechanism for increased prevalence of type 2 diabetes . Hepatology , 38 , 1384 – 1392 . Google Scholar CrossRef Search ADS PubMed Bakken T.E. et al. . ( 2016 ) Comprehensive transcriptional map of primate brain development . Nature , 535 , 367 . Google Scholar CrossRef Search ADS PubMed Bashashati A. et al. . ( 2012 ) DriverNet: uncovering the impact of somatic driver mutations on transcriptional networks in cancer . Genome Biol ., 13 , R124 . Google Scholar CrossRef Search ADS PubMed Becskei A. , Serrano L. ( 2000 ) Engineering stability in gene networks by autoregulation . Nature , 405 , 590 – 593 . Google Scholar CrossRef Search ADS PubMed Bertrand D. et al. . ( 2015 ) Patient-specific driver gene prediction and risk assessment through integrated network analysis of cancer omics profiles . Nucleic Acids Res ., 43 , e44 – e44 . Google Scholar CrossRef Search ADS PubMed Carter H. et al. . ( 2009 ) Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations . Cancer Res ., 69 , 6660 . Google Scholar CrossRef Search ADS PubMed Charles K.A. et al. . ( 2009 ) The tumor-promoting actions of TNF-α involve TNFR1 and IL-17 in ovarian cancer in mice and humans . J. Clin. Investig ., 119 , 3011 . Google Scholar CrossRef Search ADS Chen L. et al. . ( 2012 ) Detecting early-warning signals for sudden deterioration of complex diseases by dynamical network biomarkers . Sci. Rep ., 2 , 7391 – 7342 . Chin L. et al. . ( 2011 ) Cancer genomics: from discovery science to personalized medicine . Nat. Med ., 17 , 297 – 303 . Google Scholar CrossRef Search ADS PubMed Ciriello G. et al. . ( 2012 ) Mutual exclusivity analysis identifies oncogenic network modules . Genome Res ., 22 , 398 – 406 . Google Scholar CrossRef Search ADS PubMed Cowen L. et al. . ( 2017 ) Network propagation: a universal amplifier of genetic associations . Nat. Rev. Genetics , 18 , 551 – 562 . Google Scholar CrossRef Search ADS Croft D. et al. . ( 2010 ) Reactome: a database of reactions, pathways and biological processes . Nucleic Acids Res ., gkq1018. De G.P. et al. . ( 2008 ) The ErbB signalling pathway: protein expression and prognostic value in epithelial ovarian cancer . British J. Cancer , 99 , 341 – 349 . Google Scholar CrossRef Search ADS Futreal P.A. et al. . ( 2004 ) A census of human cancer genes . Nat. Rev. Cancer , 4 , 177 – 183 . Google Scholar CrossRef Search ADS PubMed Gao J. et al. . ( 2014 ) Target control of complex networks . Nat. Commun ., 5 , 5415 . Google Scholar CrossRef Search ADS PubMed Greenman C. et al. . ( 2008 ) Patterns of somatic mutation in human cancer genomes . Nature , 6 , 153 – 158 . Guo W.-F. et al. . ( 2017 ) Constrained target controllability of complex networks . J. Stat. Mech ., 2017 , 063402 . Google Scholar CrossRef Search ADS Haber D.A. , Settleman J. ( 2007 ) Cancer: drivers and passengers . Nature , 446 , 145 – 146 . Google Scholar CrossRef Search ADS PubMed Hofree M. et al. . ( 2013 ) Network-based stratification of tumor mutations . Nat. Methods , 10 , 1108 – 1115 . Google Scholar CrossRef Search ADS PubMed Hou J.P. , Ma J. ( 2014 ) DawnRank: discovering personalized driver genes in cancer . Genome Med ., 6 , 56. Google Scholar CrossRef Search ADS PubMed Jia P , Zhao Z. et al. . ( 2014 ) VarWalker: personalized mutation network analysis of putative cancer genes from next-generation sequencing data . PLoS Comput. Biol ., 10 , e1003460 . Google Scholar CrossRef Search ADS PubMed Kanehisa M. et al. . ( 2011 ) KEGG for integration and interpretation of large-scale molecular data sets . Nucleic Acids Res ., 40 (D1), D109 – D114 . Google Scholar CrossRef Search ADS PubMed Kanehisa M. et al. . ( 2014 ) Data, information, knowledge and principle: back to metabolism in KEGG . Nucleic Acids Res ., 42 , 199 – 205 . Google Scholar CrossRef Search ADS Kang H. et al. . ( 2015 ) Inferring sequential order of somatic mutations during tumorgenesis based on Markov chain model . IEEE/ACM Trans. Comput. Biol. Bioinformatics , 12 , 1094 . Google Scholar CrossRef Search ADS Koren H.S. et al. . ( 1989 ) Ozone-induced inflammation in the lower airways of human subjects . Am. Rev. Respiratory Dis ., 139 , 407 – 415 . Google Scholar CrossRef Search ADS Kumar R.D. et al. . ( 2016 ) Unsupervised detection of cancer driver mutations with parsimony-guided learning . Nat. Genetics , 48 , 1288. Google Scholar CrossRef Search ADS Leiserson M.D. et al. . ( 2015 ) Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes . Nat. Genetics , 47 , 106 – 114 . Google Scholar CrossRef Search ADS Li Y. , Patra J.C. ( 2010 ) Genome-wide inferring gene–phenotype relationship by walking on the heterogeneous network . Bioinformatics , 26 , 1219 – 1224 . Google Scholar CrossRef Search ADS PubMed Lin C.T. ( 1974 ) Structural controllability . IEEE Trans. Automatic Control , 19 , 201 – 208 . Google Scholar CrossRef Search ADS Liu Y.-Y. et al. . ( 2011 ) Controllability of complex networks . Nature , 473 , 167 – 173 . Google Scholar CrossRef Search ADS PubMed Liu X. et al. . ( 2016 ) Personalized characterization of diseases using sample-specific networks . Nucleic Acids Res ., 44 , e164 . Google Scholar CrossRef Search ADS PubMed Luo Y. et al. . ( 2017 ) A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information . Nat. Commun. , 8 , 573 . Google Scholar CrossRef Search ADS PubMed Mao Y. et al. . ( 2013 ) CanDrA: cancer-specific driver missense mutation annotation with optimized features . PLoS One , 8 , e77945 . Google Scholar CrossRef Search ADS PubMed Mischel P.S. , Cloughesy T.F. ( 2003 ) Targeted molecular therapy of GBM . Brain Pathol ., 13 , 52. Google Scholar CrossRef Search ADS PubMed Nemhauser G.L. , Wolsey L.A. ( 1988 ) Integer and combinatorial optimization . Wiley , New York . Google Scholar CrossRef Search ADS Pihur V. et al. . ( 2008 ) Finding common genes in multiple cancer types through meta-analysis of microarray experiments: a rank aggregation approach . Genomics , 92 , 400 – 403 . Google Scholar CrossRef Search ADS PubMed Rivals I. et al. . ( 2007 ) Enrichment or depletion of a GO category within a class of genes: which test? Bioinformatics , 23 , 401 – 407 . Google Scholar CrossRef Search ADS PubMed Schaefer C.F. et al. . ( 2009 ) PID: the pathway interaction database . Nucleic Acids Res ., 37 , D674 – D679 . Google Scholar CrossRef Search ADS PubMed Schilsky R.L. ( 2010 ) Personalized medicine in oncology: the future is now . Nat. Rev. Drug Discov ., 9 , 363 – 366 . Google Scholar CrossRef Search ADS PubMed Sheng J. et al. . ( 2015 ) Optimal drug prediction from personal genomics profiles . IEEE J. Biomed. Health Inform ., 19 , 1264 – 1270 . Shi Q. et al. . ( 2017 ) Pattern fusion analysis by adaptive alignment of multiple heterogeneous omics data . Bioinformatics , 33 , 2706 – 2714 . Google Scholar CrossRef Search ADS PubMed Suo C. et al. . ( 2015 ) Integration of somatic mutation, expression and functional data reveals potential driver genes predictive of breast cancer survival . Bioinformatics , 31 , 2607 – 2613 . Google Scholar CrossRef Search ADS PubMed Tothova Z. et al. . ( 2007 ) FoxOs are critical mediators of hematopoietic stem cell resistance to physiologic oxidative stress . Cell , 128 , 325 – 339 . Google Scholar CrossRef Search ADS PubMed van ‘t Veer L.J. , Bernards R. ( 2008 ) Enabling personalized cancer medicine through analysis of gene-expression patterns . Nature , 452 , 564 – 570 . Google Scholar CrossRef Search ADS PubMed Vandin F. et al. . ( 2012 ) De novo discovery of mutated driver pathways in cancer . Genome Res ., 22 , 375 – 385 . Google Scholar CrossRef Search ADS PubMed Vinayagam A. et al. . ( 2011 ) A directed protein interaction network for investigating intracellular signal transduction . Sci. Signal , 4 , rs8 – rs8 . Google Scholar CrossRef Search ADS PubMed Vinayagam A. et al. . ( 2016 ) Controllability analysis of the directed human protein interaction network identifies disease genes and drug targets . Proc. Natl. Acad. Sci. USA , 113 , 4976 – 4981 . Google Scholar CrossRef Search ADS Vogelstein B. et al. . ( 2013 ) Cancer genome landscapes . Science , 339 , 1546 – 1558 . Google Scholar CrossRef Search ADS PubMed Wang H. et al. . ( 2004 ) Ovarian carcinoma cells inhibit T cell proliferation: suppression of IL-2 receptor β and γ expression and their JAK-STAT signaling pathway . Life Sci ., 74 , 1739 – 1749 . Google Scholar CrossRef Search ADS PubMed Wang B. et al. . ( 2014 ) Similarity network fusion for aggregating data types on a genomic scale . Nat. Methods , 11 , 333 – 337 . Google Scholar CrossRef Search ADS PubMed Wang L. et al. . ( 2015 ) A computational method for clinically relevant cancer stratification and driver mutation module discovery using personal genomics profiles . BMC Genomics , 16 , S6 . Google Scholar CrossRef Search ADS PubMed Wei X. et al. . ( 2011 ) Exome sequencing identifies GRIN2A as frequently mutated in melanoma . Nat. Genetics , 43 , 442 – 446 . Google Scholar CrossRef Search ADS Wolsey L.A. ( 1998 ) Integer Programming . Wiley , New York . Wu G. et al. . ( 2010 ) A human functional protein interaction network and its application to cancer data analysis . Genome Biol ., 11 , R53. Google Scholar CrossRef Search ADS PubMed Wu F.X. et al. . ( 2014 ) Transittability of complex networks and its applications to regulatory biomolecular networks . Sci. Rep ., 4 , 4819 . Google Scholar CrossRef Search ADS PubMed Yan G. et al. . ( 2017 ) Network control principles predict neuron function in the Caenorhabditis elegans connectome . Nature , 550 , 519 . Google Scholar CrossRef Search ADS PubMed Yu X. et al. . ( 2017 ) Individual-specific edge-network analysis for disease prediction . Nucleic Acids Res ., 45 , e170 – e170 . Google Scholar CrossRef Search ADS PubMed Zeng T. et al. . ( 2014 ) Edge biomarkers for classification and prediction of phenotypes . Sci. China Life Sci ., 57 , 1103 – 1114 . Google Scholar CrossRef Search ADS PubMed Zeng T. et al. . ( 2015 ) Big-data-based edge biomarkers: study on dynamical drug sensitivity and resistance in individuals . Brief. Bioinformatics , 17 , 863 – 874 . Google Scholar PubMed Zhang S.-Y. et al. . ( 2016 ) m6A-Driver: identifying context-specific mRNA m6A methylation-driven gene interaction networks . PLoS Comput. Biol ., 12 , e1005287 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

BioinformaticsOxford University Press

Published: Jan 10, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off