A novel network regularized matrix decomposition method to detect mutated cancer genes in tumour samples with inter-patient heterogeneity

A novel network regularized matrix decomposition method to detect mutated cancer genes in tumour... www.nature.com/scientificreports OPEN A novel network regularized matrix decomposition method to detect mutated cancer genes in Received: 27 January 2017 tumour samples with inter-patient Accepted: 20 April 2017 Published: xx xx xxxx heterogeneity 1 1,2 1,2 Jianing Xi , Ao Li & Minghui Wang Inter-patient heterogeneity is a major challenge for mutated cancer genes detection which is crucial to advance cancer diagnostics and therapeutics. To detect mutated cancer genes in heterogeneous tumour samples, a prominent strategy is to determine whether the genes are recurrently mutated in their interaction network context. However, recent studies show that some cancer genes in different perturbed pathways are mutated in different subsets of samples. Subsequently, these genes may not display significant mutational recurrence and thus remain undiscovered even in consideration of network information. We develop a novel method called mCGfinder to efficiently detect mutated cancer genes in tumour samples with inter-patient heterogeneity. Based on matrix decomposition framework incorporated with gene interaction network information, mCGfinder can successfully measure the significance of mutational recurrence of genes in a subset of samples. When applying mCGfinder on TCGA somatic mutation datasets of five types of cancers, we find that the genes detected by mCGfinder are significantly enriched for known cancer genes, and yield substantially smaller p-values than other existing methods. All the results demonstrate that mCGfinder is an efficient method in detecting mutated cancer genes. Next generation sequencing (NGS) technology has revolutionized the detection of somatic mutations in cancer 1–5 genomics in recent years . With NGS technique, large numbers of tumour samples have been sequenced in pro- 6 7–10 jects such as The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) . es Th e projects provide excellent opportunities to find mutated cancer genes from a large cohort of tumour sam- 11–13 ples, which can help differentiate functionally related driver mutations from passenger mutations . A common 14, 15 strategy to detect mutated cancer genes is to detect genes with significant mutational recurrence . Although some cancer genes show high mutation frequencies (such as TP53 or KRAS, both well-known cancer genes), previous studies demonstrate that extensive inter-patient heterogeneity is present in various types of cancers 16–18 and some cancer genes are mutated in a small number of samples . es Th e mutated cancer genes are not likely to display significant mutational recurrence due to inter-patient heterogeneity, and consequently they may be 13, 17–19 underestimated by the frequency-based methods . A prominent explanation of inter-patient heterogeneity is that the behavior of key pathways of tumour sam- ples is perturbed by mutated cancer genes, and only a subset of genes in these pathways are mutated in a given 17–19 sample . Subsequently, many recent approaches exploit large scale gene interaction network as an additional 17–21 source to identify cancer genes mutated in perturbed pathways . Considering both mutation frequencies of 22 23 24 25–27 genes and information from interaction network such as iRefIndex , HPRD , STRING and others , these approaches detect mutated cancer genes by determining whether the investigated genes are recurrently mutated 17, 18 19 in their network context. For example, HotNet and HotNet2 propagate the “heat” of mutation frequencies School of Information Science and Technology, University of Science and Technology of China, Hefei, AH230027, China. Centers for Biomedical Engineering, University of Science and Technology of China, Hefei, AH230027, China. Jianing Xi and Ao Li contributed equally to this work. Correspondence and requests for materials should be addressed to A.L. (email: aoli@ustc.edu.cn) Scientific Repo R ts | 7: 2855 | DOI:10.1038/s41598-017-03141-w 1 www.nature.com/scientificreports/ of genes through the network and select genes with significantly high “heat” scores as mutated cancer genes. ReMIC detects genes with mutational recurrence in their network context through a diffusion graph kernel strategy. In these network-based approaches, mutated cancer genes are determined according to both their muta- tional recurrence and the mutational influence from their network context. Despite the success achieved by the aforementioned approaches, another important aspect contributing to inter-patient heterogeneity is that some cancer genes in different perturbed pathways are mutated in different 28–31 subsets of samples, which has been observed in recent studies . For example, transcriptional abnormalities of 28–30 some genes in different pathways are found in different subsets of samples . Moreover, another study shows that in multiple types of cancers, somatic mutations of some cancer genes in various perturbed subnetworks are observed in distinct subgroups, suggesting that cancer genes in die ff rent pathways may be mutated in die ff rent subsets of samples . If some mutated cancer genes are associated with only a subset of samples, these genes may not exhibit significant mutational recurrence in all samples even in consideration of the mutational influence from their network context. Accordingly, these cancer genes are likely to be underestimated by the existing meth- ods, as these methods are not specially designed for cancer gene detection under this scenario. Identifying abnormal genes in a subset of samples from cancer data with inter-patient heterogeneity is a crit- 32–35 ical problem in bioinformatics, and therefore has been studied in many previous researches . To tackle this 36–40 problem, methods based on matrix decomposition framework have been introduced . es Th e methods decom- pose the cancer data matrix into different components, which indicate different subsets of samples and the related abnormal genes. Nevertheless, to the best of our knowledge, these matrix decomposition based methods cannot efficiently incorporate information from network context. Therefore, to capture the mutated cancer genes in per - turbed pathways associated with only a subset of samples, it is an urgent need to establish an integrated method that can both incorporate gene interaction network information and measure the significance of mutational recurrence of genes in a subset of samples. In this study, we propose a novel method called mCGfinder, to detect mutated cancer genes in tumour samples with inter-patient heterogeneity. Based on matrix decomposition framework, mCGfinder can successfully meas- ure the significance of mutational recurrence of genes in a subset of samples instead of in all samples. Meanwhile, we introduce graph Laplacian regularization into mCGfinder, which can efficiently measure the mutational influence from the network neighbors of the investigated genes. When applying mCGfinder on TCGA somatic mutation datasets of five types of cancers, we find that the genes detected by mCGfinder are significantly enriched for known cancer genes. Notably, mCGfinder yields substantially smaller p-values (e.g., p-value = 1.24e–17 for breast cancer) than other existing network-based approaches across all investigated cancers. Moreover, we observe that high percentages of known cancer genes are included in the top ranked genes detected by mCG- finder. All the results indicate the efficiency of mCGfinder in detecting mutated cancer genes in heterogeneous tumour samples. Results Overview of mCGfinder. To detect mutated cancer genes from somatic mutation data of inter-patient het- erogeneous cancers, mCGfinder involves mainly three steps (Fig.  1). In the first step, mCGfinder decomposes the mutation data matrix of heterogeneous tumour samples into several components, and use the summation of these components to approximate the mutation matrix. Each component obtained by mCGfinder is the outer product of sample indicator vector and gene score vector, indicating a subset of samples and the mutational recurrence of genes related to these samples respectively. At the same time, we also use graph Laplacian regularization to incorporate information of gene interaction network into mCGfinder. In the second step, we apply permutation test and false discovery rate (FDR) control on gene score vectors of every components, and obtain the FDR q-values of all investigated genes. In the third step, mutated cancer genes are selected with FDR q-values less than 29, 42 the default significance threshold 0.05 . The code of mCGfinder can be freely accessed at https://github.com/ USTC-HIlab/mCGfinder . Comparison analysis. For the analysis of mCGfinder in mutated cancer gene detection, we employ TCGA somatic mutation data of five types of cancers in this study, including 776 breast invasive carcinoma (BRCA) sam- 29 30 43 ples , 238 bladder urothelial carcinoma (BLCA) samples , 291 glioblastoma multiforme (GBM) samples , 509 44 45 head and neck squamous cell carcinoma (HNSC) samples and 197 acute myeloid leukemia (LAML) samples . 19 20 The performance of mCGfinder is compared against two existing methods, HotNet2 and ReMIC . In mCG- finder, HotNet2 and ReMIC, we use a highly curated gene interaction network iRefIndex as the network infor- 19, 20 mation. In the comparison study, mCGfinder, HotNet2 and ReMIC are configured by their default settings (details in Supplementary materials). An overview of the mutated cancer genes detected by mCGfinder, HotNet2 and ReMIC is illustrated as Venn diagrams (Fig. 2 and Supplementary Fig. S5A). For all the five types of cancers, there is a high concordance between the results of mCGfinder and the results of the other two methods. Among the genes detected by mCGfinder, the percentages of genes that are also detected by at least one of the other meth- ods range from 36.6% (BRCA) to 84.3% (LAML) across the five types of cancers. In this study, we apply Fisher’s exact test on the detection results to evaluate whether the detected genes of the three methods are significantly enriched for known mutated cancer genes in Cancer Gene Census (CGC) , which is a highly curated database of cancer genes. For all the five types of cancers, the results of mCGfinder are highly enriched for CGC cancer genes (Fig. 2), and yield the most significant p-values among the three investi- gated methods. Taking BRCA as an example, HotNet2 and ReMIC obtain p-values of 1.08e-05 (8 CGC genes) and 1.04e-03 (5 CGC genes) respectively, which suggest that these results are significantly different than random selection. In comparison, mCGfinder achieves a p-value of 1.24e-17 and captures 18 CGC breast cancer genes. Notably, there are 8 CGC genes (AKT1, BRCA2, CASP8, CTCF, MAP3K1, MAP3K13, NCOR1 and TBX3) cap- tured by mCGfinder but not by HotNet2 or ReMIC. Literature survey shows that AKT1 gene is implicated as Scientific Repo R ts | 7: 2855 | DOI:10.1038/s41598-017-03141-w 2 www.nature.com/scientificreports/ Figure 1. Schematic diagram of mCGfinder, which is a matrix decomposition method integrated with network information. It decomposes the mutation matrix as the matrix multiplication of the sample indicator vectors and the transpose of gene score vectors. The different components in the results of mCGfinder are regarded as the outer products of different sample indicator vectors and their related gene score vectors, where the summation of the components is an approximation of the mutation matrix. Graph Laplacian regularization is used to incorporate information of gene interaction network into mCGfinder. Aer t ft he matrix decomposition procedure, the mutational recurrence of genes in different subsets of samples can be measured from the gene score vectors of the related component, and the related subsets samples of the component are indicated by the sample indicator vectors. Through permutation test and false discovery rate (FDR) control, mutated cancer gene candidates can be identified by thresholding FDR q-values of the genes. significantly mutated gene in breast cancer in a previous study , and mutations of BRCA2 gene are reported to be involved in the primary events of breast carcinogenesis . In the three other types of cancers, mCGfinder also provides high enrichment for CGC genes, with associated p-values of 1.91e-08 in BLCA (8 CGC genes), 4.91e-14 in GBM (12 CGC genes), 1.46e-08 in HNSC (9 CGC genes) and 5.57e-16 in LAML (10 CGC genes). Interestingly, for all the investigated cancers, the genes detected by both HotNet2 and ReMIC but not by mCGfinder include no known CGC gene (Fig. 2). Taking BLCA as an example, there is no CGC gene among the 242 genes detected by both HotNet2 and ReMIC but not by mCGfinder. The full lists of CGC genes detected by mCGfinder on the five types of cancers are provided in Supplementary Table S1. To give a more comprehensive assessment on the detection results, we also use another independent curated cancer gene database, Integrative Onco Genomics (IntOGen) . In the enrichment analysis for known cancer genes reported in IntOGen, mCGfinder demonstrates comparable or better performance than the two competing methods (Supplementary Figs S1 and S5A). In BRCA, the detection results of HotNet2 and ReMIC show good performance and contain 22 and 31 IntOGen breast cancer genes respectively. In comparison, mCGfinder suc- cessfully recovers 60 IntOGen breast cancer genes. The enrichment p-values of the results of mCGfinder, HotNet2 and ReMIC for IntOGen genes in BRCA are 1.97e-36, 1.74e-06 and 1.15e-16 respectively. For BLCA data, there are 31 and 44 IntOGen genes captured by HotNet2 (p-value = 4.16e-03) and ReMIC (p-value = 1.30e-10) Scientific Repo R ts | 7: 2855 | DOI:10.1038/s41598-017-03141-w 3 www.nature.com/scientificreports/ Figure 2. Venn diagrams of intersections between the genes detected by mCGfinder (red circle), HotNet2 (green circle) and ReMIC (blue circle) on TCGA somatic mutation datasets of BRCA (north-west panel), BLCA (north-east panel), GBM (south-west panel) and HNSC (south-east panel). The gray and black numbers in each region of the Venn diagrams indicate the number of detected genes and the number of genes also reported in Cancer Gene Census (CGC) respectively. The p-values next to the circles of the methods are calculated by Fisher’s exact test, representing the enrichment significance of the detection results for CGC annotated cancer genes. respectively. In comparison, mCGfinder predicts 56 IntOGen genes, yielding a p-value of 4.78e-34. e Th IntOGen genes detected by mCGfinder on the five types of cancers are listed in Supplementary Table  S2. Finally, we per- form cancer gene enrichment analysis by using the combined cancer gene lists of both CGC and IntOGen data- bases, and similar conclusion can be drawn from the results across all the five types of cancers (Supplementary Figs S1 and S5A). e Th CGC and IntOGen genes identified by mCGfinder but not by the other investigated meth- ods along with their functions are demonstrated in Supplementary Table S3. Ranking analysis. In addition to the statistical enrichment analysis, in order to comprehensively evaluate the performances of mCGfinder, we further use the results obtained by not only the default threshold but also 49–52 various thresholds by following previous studies . The gene ranking scores of different approaches are detailed in Supplementary materials. By raising the threshold and obtaining the percentages of known cancer genes falling under the category, we can evaluate the detection results of different methods comprehensively through rank cut- 49, 50 off curves . Here we use the rank cutoff curves as the evaluation metric for the top ranked genes detected by the investigated methods, which are drawn by listing the percentages of known cancer genes that are also included in the top ranked genes. As shown in Fig. 3A–D, the top ranked genes detected by mCGfinder contain consistently higher percentages of known CGC genes than the results of the other methods at various rank thresholds. Taking BRCA as an example, 3.5% of CGC breast cancer genes are included in the top 50 genes detected by HotNet2. In comparison, the top 50 genes identified by mCGfinder contain 31.0% of CGC breast cancer genes. When the rank threshold of genes raises to 100, the percentages of known cancer genes detected by HotNet2 and ReMIC also increase to 10.3% and 6.9% respectively. In comparison, there are 41.4% of known CGC genes included in the results of mCGfinder. Similarly, in the other types of cancers, the top ranked genes detected by mCGfinder also Scientific Repo R ts | 7: 2855 | DOI:10.1038/s41598-017-03141-w 4 www.nature.com/scientificreports/ Figure 3. Rank cutoff curves of top 100 candidates in mCGfinder (red line with circle markers), HotNet2 (green line with square markers) and ReMIC (blue line with triangle markers) results, describing the relation between various cutoffs and the fraction of known CGC cancer genes ranked above this cutoff in BRCA (A), BLCA (B), GBM (C) and HNSC (D). Cumulative fractions of known CGC cancer genes annotated by CGC within the top 100, 300, 500, 700 and 1000 genes in BRCA (E), BLCA (F), GBM (G) and HNSC (H). Results from all the assessments indicate the generally improved performance of mCGfinder over the competing methods. contain high fractions of known cancer genes. For example, there are 50.0%, 50.0% and 33.3% of known CGC genes included in the top 100 genes detected by mCGfinder on BLCA, GBM and HNSC data respectively. Next, we assess the performance of different methods with larger rank thresholds, and mCGfinder still com- pares favorably to the competing methods across all the five types of cancers (Fig.  3E–H and Supplementary Fig. S5C). When the rank threshold raises to 500, both HotNet2 and ReMIC demonstrate reasonable performance in BRCA and yield percentages 27.6% and 20.7% of CGC cancer genes respectively. In comparison, mCGfinder achieves a percentage of 62.7%. In BLCA, GBM and HNSC, more than half of the known CGC cancer genes are also included in the top 500 genes detected by mCGfinder respectively. We further assess the top ranked genes of the investigated methods by IntOGen gene lists and the combined gene lists of both the two databases. The results also show that mCGfinder achieves the highest percentages among the three investigated methods throughout the rank cutoff analysis (Supplementary Figs  S2, S3 and S5B,C). Moreover, by varying the rank thresholds and calculating the precisions and recalls, we draw the precision-recall curve (PR curve) of the results detected by the investigated methods as the assessment met- 51, 52 ric used in previous studies . For all the five types of cancers, when the known cancer genes in CGC are used as gold-standard, the PR curves of mCGfinder are clearly located over the curves of the other meth- ods (Supplementary Figs S4 and S5D). As the limited number of known cancer genes from CGC may lead to inaccurate performance, we further use known cancer genes annotated by IntOGen for evaluation (Fig. 4 and Supplementary Fig. S5D), in which the number of known breast cancer genes is largely increased. Taking BRCA as an example, when the recalls are fixed at 10.0%, the precisions of mCGfinder, HotNet2 and ReMIC are 33.9%, 4.3% and 11.9% respectively, which are consistently better than random selection. The area under the precision-recall curve of mCGfinder is also greater than the other methods (Supplementary Table  S4). In consist- ent with BRCA results, mCGfinder also gives the best performance among the detection results of the investigated methods on BLCA, GBM, HNSC and LAML data when evaluated by IntOGen. Similar conclusions can also be obtained from analysis of the detection results from the combined gene lists of the two databases (Supplementary Figs S4, S5D and Supplementary Table S4). Computational cost. In addition to the analysis of detection performance, we further examine the com- putational time of the three investigated methods. The experiments in this study are performed on a computer with Intel Xeon(R) CPU E5-2630 0 @ 2.30 GHz × 18 Processors and 64 GB of memory. For BRCA, BLCA, GBM, HNSC and LAML somatic mutation datasets with 12129 genes and 776, 238, 291, 509 and 197 samples, the running time of mCGfinder is 3–5 minutes in average, which is smaller than HotNet2 and ReMIC (Fig. 5). For example, in BRCA, HotNet2 takes around 24 minutes, and ReMIC takes around 21 minutes. In comparison, mCGfinder takes only around 5 minutes. In HNSC, mCGfinder, HotNet2 and ReMIC take around 4, 24 and 26 minutes respectively. Discussion Developing efficient methods to detect cancer genes from inter-patient heterogeneous tumour samples is an chal- lenging task, and a major obstacle is the fact that some cancer genes are mutated in perturbed pathways associated 28, 31 with only a subset of samples . Thus, these mutated cancer genes may not be significantly recurrent in all samples and remain undiscovered even when the mutations in their interaction network context are considered. In this Scientific Repo R ts | 7: 2855 | DOI:10.1038/s41598-017-03141-w 5 www.nature.com/scientificreports/ Figure 4. Precision-recall curves for the three methods on BRCA (north-west panel), BLCA (north-east panel), GBM (south-west panel) and HNSC (south-east panel) data, where red, green, blue and gray lines represent the curves of mCGfinder, HotNet2, ReMIC and random selection respectively. For each curve, the points indicate the precisions and recalls at different ranks in the prediction results. The precision is computed as the fraction of the top ranked genes that are known cancer genes, and the recall is computed as the fraction of known cancer genes in the top ranked genes. Figure 5. Running time comparison of mCGfinder (red bar), HotNet2 (green bar) and ReMIC (blue bar) on datasets of BRCA, BLCA, GBM, HNSC and LAML respectively. Scientific Repo R ts | 7: 2855 | DOI:10.1038/s41598-017-03141-w 6 www.nature.com/scientificreports/ paper, based on the combination of matrix decomposition framework and information from gene interaction net- work, we propose a novel method which is capable of detecting mutated cancer genes in a subset of samples. When applied on TCGA somatic mutation datasets of five types of cancers, mCGn fi der precisely recovers many known cancer genes. Our results also show that the performance of mCGfinder is not sensitive to the selection of the tuning parameter (Supplementary Fig. S6). Notably, mCGfinder achieves the highest enrichment for known genes among the investigated methods, suggesting that it is a powerful bioinformatics tool for mutated cancer gene detection. A significant distinction between mCGfinder and the existing network-based approaches for cancer gene detection is that mCGfinder decomposes the mutation data matrix of heterogeneous tumour samples into differ - ent components, and measures the mutational recurrence of genes in subsets of samples indicated by the com- ponents. Based on this design, mCGfinder greatly complements the detection results of the existing approaches. Nevertheless, it should be pointed out that the evaluation results are not sufficient to mean as a criticism of the other investigated methods. Instead, they show the difference between whether considering the mutated cancer genes in different subsets of samples or not. In the detection results of the investigated cancers, some CGC genes missed by mCGfinder are detected either by HotNet2 only or by ReMIC only (Fig.  2). For example, in BLCA results there is 1 CGC gene among the 1040 genes detected by HotNet2 only, and there is 1 CGC gene among the 653 genes detected by ReMIC only. These results suggest that it may be worth using both mCGfinder and the existing methods to maximize the detection rate of mutated cancer genes. Despite the promising results achieved by the purposed method, there are also some avenues for further inves- tigation. For example, our method is not designed to address the issue of intra-tumour heterogeneity , which cannot be represented by the input binary matrix. In consistent with HotNet2 and ReMIC which highly depend on gene interaction network, our method utilizes gene interaction network as an important information source for detecting mutated cancer genes. Therefore, mCGfinder is not yet applicable for genes that are not included in gene interaction network. Also, it is noteworthy that unlike previous approach , our method is not designed to stratify cancer samples and cannot incorporate biological knowledge of cancer subtypes . Meanwhile, the objective func- tion in mCGfinder is not guaranteed to be convex albeit a local optimum can be reached. Furthermore, a promis- ing expansion to the mCGfinder in future work would be to integrate information from not only gene interactions, but also different types of information such as copy number alternation, gene expression and DNA methylation, 51, 53, 54 which would oer a ff n opportunity to comprehensively understand cancer events from a multi-omics view . In summary, mCGfinder is a novel method to efficiently detect mutated cancer genes in tumour samples with inter-patient heterogeneity, which provides a more sophisticated view of cancer genomics from both the influence of interaction network context and mutational recurrence of genes in different subsets of samples. Altogether, mutational profile analysis from mCGfinder and further experimental follow-up may help take a step forward to a more comprehensive knowledge of the cancer genome. Materials and Methods TCGA somatic mutation data of cancers. We apply mCGfinder on TCGA somatic mutation data of v fi e types of cancers, BRCA, BLCA, GBM, HNSC and LAML (detailed information in Supplementary materials and Supplementary Table S5). For each type of cancer, the mutation data is a binary matrix X = (X ) where ij n×p the rows and columns of the mutation matrix denote the tumour samples (totally n samples) and the investigated genes (totally p genes) respectively. Each entry X of the matrix indicates the binary state of the gene, in which 1 ij 31, 55 represents the i-th sample contains a somatic mutation of the j-th gene, and 0 otherwise . Network regularized matrix decomposition. Based on matrix decomposition framework, mCGfinder decomposes the matrix X of somatic mutation data in heterogeneous tumour samples into different components, and the summation of these components can be regarded as an approximation of the mutation data matrix, i.e. Xs =+ g ε . ∑ r r (1) r =1 where s = () s and g = () g are the sample indicator vector and the gene score vector for the r-th com- rirn×1 p×1 rjr ponent. The ε is the residual matrix for the r-th component, and R is the total number of the components obtained by mCGfinder. The sample indicator vector s indicates the assignment of tumour samples to the r-th component, in which the coefficient s = 1 represents that the i-th samples are included in the component, and ir s = 0 otherwise. As for the gene score vector g of the r-th component, a higher value of the coefficient g of the ir r jr vector presents a larger potential of the j-th gene to be a mutated cancer gene. Note that the first component, which is the outer product of the two vectors , is the best rank-one approximation of the data sg = () sg 1 i1 np × 1 j1 matrix X. Thus, we can use the approximation to decompose the first component (S and g ) from the data matrix, 1 1 36, 37, 40 and obtain the remaining components through a component-by-component strategy . Also, to efficiently incorporate information from gene interaction network, we use graph Laplacian regularization on the gene score vector g . Subsequently, we construct an optimization problem for vector s and g to obtain the first component, 1 1 1 and the objective function is, T T min Xs −+ gg λ Lg 1 L 1 1 1 sg , 1 1 st ..s ∈. {0,1} (2) 2 n where ⋅ denotes the squared Frobenius norm of a matrix, and s ∈ {0,1} indicates that the coefficients in F 1 vector s can be either 1 or 0. The matrix L = () L is the Laplacian matrix of the gene interaction network, 1 ij pp × Scientific Repo R ts | 7: 2855 | DOI:10.1038/s41598-017-03141-w 7 www.nature.com/scientificreports/ which is calculated through the matrix subtraction . The matrix is the sym- () LD =− () () A () A ij pp ×× ij pp ij pp × ij pp × metric normalized adjacency matrix of the gene interaction network (see Supplementary materials for details of the normalization procedure), and the matrix is a diagonal matrix whose entries are the column sums of () D ij pp × matrix () A . ij pp × In the objective function (2), the first term is the summation of the residuals between the first component and the data matrix. When the first term is minimized, we can obtain a component that best fit the data matrix. The second term is the graph Laplacian term, which can be rewritten as p p p p T 2 gLg == ggL () gg −. A ∑∑ ∑∑ ij ij 1 1 ij 11 ij 11 i== 11 j i== 11 ji+ (3) Through the graph Laplacian term, we can successfully adopt the assumption that if the i-th gene and the j-th gene are connected in the gene interaction network (A > 0), the scores g and g of the two genes are also close ij i1 j1 to each other. The tuning parameter λ is used to balance the fitness of the model (first term) and the smoothness of the scores of connected genes (second term), which is set to 0.1 in this study. Accordingly, mCGfinder can efficiently measure the significance of mutational recurrence of genes in a subset of samples and incorporate information from network context at the same time. Iterative estimation procedure. To solve the optimization problem in (2), we employ an efficient iterative 36, 37, 40 procedure to estimate the two vector s and g alternatively . When the gene score vector g is fixed, the opti- 1 1 1 mization function to solve the coefficient s in the sample indicator vector s is formulated as below: i1 1 2 2 min( ss gX − 2) g ii 1 1 i 1 1 i1 st .. ss (1 −= )0,1 ∀= in ,, … , ii 11 (4) where the ⋅ denotes the squared L2-norm of a vector, and (·) indicates the i-th coefficients of a vector. Since the values of the coefficients in sample indicator vector are constrained to be binary, we introduce Boolean con - straint on coefficients in vector s . For the assignment of the i-th sample of the first component, the estimation of s in vector s can be calculated through Karush-Kuhn-Tucker (KKT) conditions, i1 1 1if(2) Xg ≥ g  i s =  i1 0otherwise. (5) Likewise, when the sample indicator vector s is x fi ed, the optimization function to solve the gene score vector g in optimization problem (2) is formulated as below: 2 T T T min2 sg gX −+ () sg λ gLg . () 1 1 L 2 1 1 11 1 (6) Similar to the derivation for the sample indicator vector above, the gene score vector can also be obtained through the KKT conditions: −1 2 T gs =+ IL λ () Xs , () 1 pL 1 2 (7) where I is a p × p identity matrix, and the symmetric matrix () sI + λ L (r = 1 in this case) is an invertible rp L matrix (see Supplementary materials for detailed proof ). Subsequently, the gene score vector and sample indica- tor vector in the first component can be iteratively estimated through alternating the two update rules (5) and (7) 36, 37, 40 until convergence . Algorithm 1 mCGfinder: iterative estimation procedure Input: mutation matrix X ; graph Laplacian matrix L; np × Output: sample indicator vector s ; gene score vector g . 1 1   1: set λ ← 0.1 −1 (0) (0) T   2: s ← 1 and g ←+ nIL λ X 1 1 n () pL () n   3: repeat −1 (1 kk + ) () T () k   4: g ←+ sI λ LX s 1 pL () 1 1 () 2   (1 kk + ) (1 ++ ) (1 k )   6:  s ← IX 2 gg − 1 [0,) ∞  1 1      6: k ← k + 1   7: until Convergence () ∞ () ∞   8: return s ← s and g ← g Note: 1 is an n × 1 vector with all coefficients being 1; Indicator function I (x) returns a logical vector if x ∈ A. n A i Algorithm 1.. e i Th terative estimation procedure of sample indicator vector and gene score vector in mCGfinder. Scientific Repo R ts | 7: 2855 | DOI:10.1038/s41598-017-03141-w 8 www.nature.com/scientificreports/ The algorithm of the estimation of the two vectors in the first component are summarized in Algorithm 1. Ae ft r convergence, the first component is obtained by matrix multiplication . To obtain the next component sg 36, 37, 40 (s and g ), we repeat the procedures in Algorithm 1 on the remaining samples . Subsequently, we can esti- 2 2 mate the r-th component ( s and g ) (r = 2, … , R) by decomposing the data matrix through the r r component-by-component strategy until all samples are assigned (details in Supplementary Fig. S7), and the number R is obtained by counting the components decomposed by mCGfinder. Significance test. To assess which of these mutated genes are statistically significant in a subset of samples, we implement significance test on the coefficients of the gene score vectors g (r = 1, …, R) in every 21 − components decomposed by mCGfinder. In brief, we define in (7) as the network influ- Xs () IL + λ rp L enced matrix. The coefficients of gene score vector g can be calculated by the summation of the entries of a subset of rows of the network influenced matrix X , where the rows are indicated by the sample indicator net 40, 57 vector s of the investigated component. We follow the procedure in previous studies and identify the genes of which the scores can disprove the null hypothesis that their values of the gene score vector coeffi- cients are only contributed by background mutations alone. Since the random background mutations could occur anywhere in the genome, the null distribution is modeled by recalculating the gene score vectors across all combinations of permutations of the network influenced matrix within samples. Detailed proce- dure for the significance test is provided in Supplementary materials. Since large numbers of permutations is usually time consuming, we instead use a semi-exact estimation approach proposed in previous 40, 57 approaches to estimate the distribution of scores and the corresponding p-values. To control the false discovery rates of the investigated genes, we apply the Benjamini-Hochberg FDR procedure on the p-values obtained from the significance test, and calculate the q-values of the investigated genes for each component. For a specific gene, we choose the most significant (smallest) q-values of the investigated gene among all components as the significance score of the gene. References 1. Schuster, S. C. Next-generation sequencing transforms today’s biology. Nature 200, 16–18 (2007). 2. Chiang, D. Y. et al. High-resolution mapping of copy-number alterations with massively parallel sequencing. Nature methods 6, 99–103 (2009). 3. Xiong, M., Zhao, Z., Arnold, J. & Yu, F. Next-generation sequencing. BioMed Research International 2010 (2011). 4. Nijkamp, J. F. et al. De novo detection of copy number variation by co-assembly. Bioinformatics 28, 3195–3202 (2012). 5. Zhao, M., Wang, Q., Wang, Q., Jia, P. & Zhao, Z. Computational tools for copy number variation (cnv) detection using next- generation sequencing data: features and perspectives. BMC bioinformatics 14, 1 (2013). 6. Weinstein, J. N. et al. The cancer genome atlas pan-cancer analysis project. Nature genetics 45, 1113–1120 (2013). 7. Mardis, E. R. Genome sequencing and cancer. Current opinion in genetics & development 22, 245–250 (2012). 8. Watson, I. R., Takahashi, K., Futreal, P. A. & Chin, L. Emerging patterns of somatic mutations in cancer. Nature reviews Genetics 14, 703–718 (2013). 9. Vogelstein, B. et al. Cancer genome landscapes. science 339, 1546–1558 (2013). 10. Ding, L., Wendl, M. C., McMichael, J. F. & Raphael, B. J. Expanding the computational toolbox for mining cancer genomes. Nature Reviews Genetics 15, 556–570 (2014). 11. Stephens, P. J. et al. The landscape of cancer genes and mutational processes in breast cancer. Nature 486, 400–404 (2012). 12. Wendl, M. C. et al. Pathscan: a tool for discerning mutational significance in groups of putative cancer genes. Bioinformatics 27, 1595–1602 (2011). 13. Raphael, B. J., Dobson, J. R., Oesper, L. & Vandin, F. Identifying driver mutations in sequenced cancer genomes: computational approaches to enable precision medicine. Genome medicine 6, 1 (2014). 14. Yuan, X., Zhang, J., Zhang, S., Yu, G. & Wang, Y. Comparative analysis of methods for identifying recurrent copy number alterations in cancer. PloS one 7, e52516 (2012). 15. Dees, N. D. et al. Music: identifying mutational significance in cancer genomes. Genome research 22, 1589–1598 (2012). 16. Sontrop, H. M., Verhaegh, W. F., Reinders, M. J. & Moerland, P. D. An evaluation protocol for subtype-specific breast cancer event prediction. PloS one 6, e21681 (2011). 17. Vandin, F., Upfal, E. & Raphael, B. J. Algorithms for detecting significantly mutated pathways in cancer. Journal of Computational Biology 18, 507–522 (2011). 18. Vandin, F., Clay, P., Upfal, E. & Raphael, B. J. Discovery of mutated subnetworks associated with clinical data in cancer. In Pac Symp Biocomput 2012, 55–66 (2012). 19. Leiserson, M. D., Vandin, F., Wu, H.-T., Dobson, J. R. & Raphael, B. R. Pan-cancer identification of mutated pathways and protein complexes. Cancer Research 74, 5324–5324 (2014). 20. Babaei, S., Hulsman, M., Reinders, M. & de Ridder, J. Detecting recurrent gene mutation in interaction network context using multi- scale graph diffusion. BMC bioinformatics 14, 1 (2013). 21. Jia, P. & Zhao, Z. Varwalker: personalized mutation network analysis of putative cancer genes from next-generation sequencing data. PLoS Comput Biol 10, e1003460 (2014). 22. Razick, S., Magklaras, G. & Donaldson, I. M. irefindex: a consolidated protein interaction database with provenance. BMC bioinformatics 9, 1 (2008). 23. Prasad, T. K. et al. Human protein reference database-2009 update. Nucleic acids research 37, D767–D772 (2009). 24. Szklarczyk, D. et al. The string database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic acids research 39, D561–D568 (2011). 25. Lee, I., Blom, U. M., Wang, P. I., Shim, J. E. & Marcotte, E. M. Prioritizing candidate disease genes by network-based boosting of genome-wide association data. Genome research 21, 1109–1121 (2011). 26. Das, J. & Yu, H. Hint: High-quality protein interactomes and their applications in understanding human disease. BMC systems biology 6, 92 (2012). 27. Khurana, E., Fu, Y., Chen, J. & Gerstein, M. Interpretation of genomic variants using a unified biological network approach. PLoS Comput Biol 9, e1002886 (2013). 28. Vaske, C. J. et al. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using paradigm. Bioinformatics 26, i237–i245 (2010). 29. Cancer Genome Atlas Network. et al. Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012). Scientific Repo R ts | 7: 2855 | DOI:10.1038/s41598-017-03141-w 9 www.nature.com/scientificreports/ 30. Cancer Genome Atlas Research Network. et al. Comprehensive molecular characterization of urothelial bladder carcinoma. Nature 507, 315–322 (2014). 31. Hofree, M., Shen, J. P., Carter, H., Gross, A. & Ideker, T. Network-based stratification of tumor mutations. Nature methods 10, 1108–1115 (2013). 32. Cheng, Y. & Church, G. M. Biclustering of expression data. Ismb 8, 93–103 (2000). 33. Yang, J., Wang, H., Wang, W. & Yu, P. S. An improved biclustering method for analyzing gene expression profiles. International Journal on Artici fi al Intelligence Tools 14, 771–789 (2005). 34. Shabalin, A. A., Weigman, V. J., Perou, C. M. & Nobel, A. B. Finding large average submatrices in high dimensional data. The Annals of Applied Statistics 985–1012 (2009). 35. Oghabian, A., Kilpinen, S., Hautaniemi, S. & Czeizler, E. Biclustering methods: biological relevance and application in gene expression analysis. PloS one 9, e90801 (2014). 36. Lee, M., Shen, H., Huang, J. Z. & Marron, J. S. Biclustering via sparse singular value decomposition. Biometrics 66, 1087–1095 (2010). 37. Sill, M., Kaiser, S., Benner, A. & Kopp-Schneider, A. Robust biclustering by sparse singular value decomposition incorporating stability selection. Bioinformatics 27, 2089–2097 (2011). 38. Zhou, X., Yang, C., Wan, X., Zhao, H. & Yu, W. Multisample acgh data analysis via total variation and spectral regularization. IEEE/ ACM Transactions on Computational Biology and Bioinformatics 10, 230–235 (2013). 39. Zhou, X., Liu, J., Wan, X. & Yu, W. Piecewise-constant and low-rank approximation for identification of recurrent copy number variations. Bioinformatics 30, 1943–1949 (2014). 40. Xi, J. & Li, A. Discovering recurrent copy number aberrations in complex patterns via non-negative sparse singular value decomposition. IEEE/ACM Transactions on Computational Biology and Bioinformatics 13, 656–668 (2016). 41. Xie, B., Wang, M. & Tao, D. Toward the optimization of normalized graph laplacian. IEEE Transactions on Neural Networks 22, 660–666 (2011). 42. Cancer Genome Atlas Research Network. et al. Comprehensive molecular profiling of lung adenocarcinoma. Nature 511, 543–550 (2014). 43. McLendon, R. et al. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455, 1061–1068 (2008). 44. Cancer Genome Atlas Network. et al. Comprehensive genomic characterization of head and neck squamous cell carcinomas. Nature 517, 576–582 (2015). 45. Network, C. G. A. R. et al. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N Engl J Med 2013, 2059–2074 (2013). 46. Futreal, P. A. et al. A census of human cancer genes. Nature Reviews Cancer 4, 177–183 (2004). 47. Weber, B., Brohm, M., Stec, I., Backe, J. & Caffier, H. A somatic truncating mutation in brca2 in a sporadic breast tumor. American journal of human genetics 59, 962 (1996). 48. Gonzalez-Perez, A. et al. Intogen-mutations identifies cancer drivers across tumor types. Nature methods 10, 1081–1082 (2013). 49. Linghu, B., Snitkin, E. S., Hu, Z., Xia, Y. & DeLisi, C. Genome-wide prioritization of disease genes and identification of disease- disease associations from an integrated human functional linkage network. Genome biology 10, R91 (2009). 50. Chen, X., Liu, M.-X. & Yan, G.-Y. Drug–target interaction prediction by random walk on the heterogeneous network. Molecular BioSystems 8, 1970–1978 (2012). 51. Yang, H., Wei, Q., Zhong, X., Yang, H. & Li, B. Cancer driver gene discovery through an integrative genomics approach in a non- parametric bayesian framework. Bioinformatics 33, 483–490 (2017). 52. Wu, H.-T., Hajirasouliha, I. & Raphael, B. J. Detecting independent and recurrent copy number aberrations using interval graphs. Bioinformatics 30, i195–i203 (2014). 53. Gevaert, O., Villalobos, V., Sikic, B. I. & Plevritis, S. K. Identification of ovarian cancer driver genes by using module network integration of multi-omics data. Interface focus 3, 20130013 (2013). 54. Taskesen, E., Staal, F. J. & Reinders, M. J. An integrated approach of gene expression and dna-methylation profiles of wnt signaling genes uncovers novel prognostic markers in acute myeloid leukemia. BMC bioinformatics 16, 1 (2015). 55. Kim, S., Sael, L. & Yu, H. A mutation profile for top-k patient search exploiting gene-ontology and orthogonal non-negative matrix factorization. Bioinformatics 31, 3653–3659 (2015). 56. Malioutov, D. & Malyutov, M. Boolean compressed sensing: Lp relaxation for group testing. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 3305–3308 (IEEE, 2012). 57. Beroukhim, R. et al. Assessing the significance of chromosomal aberrations in cancer: methodology and application to glioma. Proceedings of the National Academy of Sciences 104, 20007–20012 (2007). 58. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society. Series B (Methodological) 289–300 (1995). Acknowledgements e a Th uthors would like to thank the editor and the reviewers for their valuable comments and helpful suggestions for improving the manuscript. They also thank Dr. Hengyu Li at Department of Breast and Thyroid Surgery, Changhai Hospital, Shanghai for providing helpful suggestions. This work is supported by National Natural Science Foundation of China (Grant Nos 61571414, 61471331 and 31100955). The authors would like to thank Chen Peng and Changran Zhang for helpful discussions and suggestions. Author Contributions J.X. and A.L. wrote the main manuscript text and prepared all Tables and Figures. M.W. provided valuable suggestions and guidance. All authors reviewed the manuscript. Additional Information Supplementary information accompanies this paper at doi:10.1038/s41598-017-03141-w Competing Interests: The authors declare that they have no competing interests. Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Scientific Repo R ts | 7: 2855 | DOI:10.1038/s41598-017-03141-w 10 www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Cre- ative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not per- mitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. © The Author(s) 2017 Scientific Repo R ts | 7: 2855 | DOI:10.1038/s41598-017-03141-w 11 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Scientific Reports Springer Journals

A novel network regularized matrix decomposition method to detect mutated cancer genes in tumour samples with inter-patient heterogeneity

Free
11 pages

Loading next page...
 
/lp/springer_journal/a-novel-network-regularized-matrix-decomposition-method-to-detect-3QAR0zelpm
Publisher
Springer Journals
Copyright
Copyright © 2017 by The Author(s)
Subject
Science, Humanities and Social Sciences, multidisciplinary; Science, Humanities and Social Sciences, multidisciplinary; Science, multidisciplinary
eISSN
2045-2322
D.O.I.
10.1038/s41598-017-03141-w
Publisher site
See Article on Publisher Site

Abstract

www.nature.com/scientificreports OPEN A novel network regularized matrix decomposition method to detect mutated cancer genes in Received: 27 January 2017 tumour samples with inter-patient Accepted: 20 April 2017 Published: xx xx xxxx heterogeneity 1 1,2 1,2 Jianing Xi , Ao Li & Minghui Wang Inter-patient heterogeneity is a major challenge for mutated cancer genes detection which is crucial to advance cancer diagnostics and therapeutics. To detect mutated cancer genes in heterogeneous tumour samples, a prominent strategy is to determine whether the genes are recurrently mutated in their interaction network context. However, recent studies show that some cancer genes in different perturbed pathways are mutated in different subsets of samples. Subsequently, these genes may not display significant mutational recurrence and thus remain undiscovered even in consideration of network information. We develop a novel method called mCGfinder to efficiently detect mutated cancer genes in tumour samples with inter-patient heterogeneity. Based on matrix decomposition framework incorporated with gene interaction network information, mCGfinder can successfully measure the significance of mutational recurrence of genes in a subset of samples. When applying mCGfinder on TCGA somatic mutation datasets of five types of cancers, we find that the genes detected by mCGfinder are significantly enriched for known cancer genes, and yield substantially smaller p-values than other existing methods. All the results demonstrate that mCGfinder is an efficient method in detecting mutated cancer genes. Next generation sequencing (NGS) technology has revolutionized the detection of somatic mutations in cancer 1–5 genomics in recent years . With NGS technique, large numbers of tumour samples have been sequenced in pro- 6 7–10 jects such as The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) . es Th e projects provide excellent opportunities to find mutated cancer genes from a large cohort of tumour sam- 11–13 ples, which can help differentiate functionally related driver mutations from passenger mutations . A common 14, 15 strategy to detect mutated cancer genes is to detect genes with significant mutational recurrence . Although some cancer genes show high mutation frequencies (such as TP53 or KRAS, both well-known cancer genes), previous studies demonstrate that extensive inter-patient heterogeneity is present in various types of cancers 16–18 and some cancer genes are mutated in a small number of samples . es Th e mutated cancer genes are not likely to display significant mutational recurrence due to inter-patient heterogeneity, and consequently they may be 13, 17–19 underestimated by the frequency-based methods . A prominent explanation of inter-patient heterogeneity is that the behavior of key pathways of tumour sam- ples is perturbed by mutated cancer genes, and only a subset of genes in these pathways are mutated in a given 17–19 sample . Subsequently, many recent approaches exploit large scale gene interaction network as an additional 17–21 source to identify cancer genes mutated in perturbed pathways . Considering both mutation frequencies of 22 23 24 25–27 genes and information from interaction network such as iRefIndex , HPRD , STRING and others , these approaches detect mutated cancer genes by determining whether the investigated genes are recurrently mutated 17, 18 19 in their network context. For example, HotNet and HotNet2 propagate the “heat” of mutation frequencies School of Information Science and Technology, University of Science and Technology of China, Hefei, AH230027, China. Centers for Biomedical Engineering, University of Science and Technology of China, Hefei, AH230027, China. Jianing Xi and Ao Li contributed equally to this work. Correspondence and requests for materials should be addressed to A.L. (email: aoli@ustc.edu.cn) Scientific Repo R ts | 7: 2855 | DOI:10.1038/s41598-017-03141-w 1 www.nature.com/scientificreports/ of genes through the network and select genes with significantly high “heat” scores as mutated cancer genes. ReMIC detects genes with mutational recurrence in their network context through a diffusion graph kernel strategy. In these network-based approaches, mutated cancer genes are determined according to both their muta- tional recurrence and the mutational influence from their network context. Despite the success achieved by the aforementioned approaches, another important aspect contributing to inter-patient heterogeneity is that some cancer genes in different perturbed pathways are mutated in different 28–31 subsets of samples, which has been observed in recent studies . For example, transcriptional abnormalities of 28–30 some genes in different pathways are found in different subsets of samples . Moreover, another study shows that in multiple types of cancers, somatic mutations of some cancer genes in various perturbed subnetworks are observed in distinct subgroups, suggesting that cancer genes in die ff rent pathways may be mutated in die ff rent subsets of samples . If some mutated cancer genes are associated with only a subset of samples, these genes may not exhibit significant mutational recurrence in all samples even in consideration of the mutational influence from their network context. Accordingly, these cancer genes are likely to be underestimated by the existing meth- ods, as these methods are not specially designed for cancer gene detection under this scenario. Identifying abnormal genes in a subset of samples from cancer data with inter-patient heterogeneity is a crit- 32–35 ical problem in bioinformatics, and therefore has been studied in many previous researches . To tackle this 36–40 problem, methods based on matrix decomposition framework have been introduced . es Th e methods decom- pose the cancer data matrix into different components, which indicate different subsets of samples and the related abnormal genes. Nevertheless, to the best of our knowledge, these matrix decomposition based methods cannot efficiently incorporate information from network context. Therefore, to capture the mutated cancer genes in per - turbed pathways associated with only a subset of samples, it is an urgent need to establish an integrated method that can both incorporate gene interaction network information and measure the significance of mutational recurrence of genes in a subset of samples. In this study, we propose a novel method called mCGfinder, to detect mutated cancer genes in tumour samples with inter-patient heterogeneity. Based on matrix decomposition framework, mCGfinder can successfully meas- ure the significance of mutational recurrence of genes in a subset of samples instead of in all samples. Meanwhile, we introduce graph Laplacian regularization into mCGfinder, which can efficiently measure the mutational influence from the network neighbors of the investigated genes. When applying mCGfinder on TCGA somatic mutation datasets of five types of cancers, we find that the genes detected by mCGfinder are significantly enriched for known cancer genes. Notably, mCGfinder yields substantially smaller p-values (e.g., p-value = 1.24e–17 for breast cancer) than other existing network-based approaches across all investigated cancers. Moreover, we observe that high percentages of known cancer genes are included in the top ranked genes detected by mCG- finder. All the results indicate the efficiency of mCGfinder in detecting mutated cancer genes in heterogeneous tumour samples. Results Overview of mCGfinder. To detect mutated cancer genes from somatic mutation data of inter-patient het- erogeneous cancers, mCGfinder involves mainly three steps (Fig.  1). In the first step, mCGfinder decomposes the mutation data matrix of heterogeneous tumour samples into several components, and use the summation of these components to approximate the mutation matrix. Each component obtained by mCGfinder is the outer product of sample indicator vector and gene score vector, indicating a subset of samples and the mutational recurrence of genes related to these samples respectively. At the same time, we also use graph Laplacian regularization to incorporate information of gene interaction network into mCGfinder. In the second step, we apply permutation test and false discovery rate (FDR) control on gene score vectors of every components, and obtain the FDR q-values of all investigated genes. In the third step, mutated cancer genes are selected with FDR q-values less than 29, 42 the default significance threshold 0.05 . The code of mCGfinder can be freely accessed at https://github.com/ USTC-HIlab/mCGfinder . Comparison analysis. For the analysis of mCGfinder in mutated cancer gene detection, we employ TCGA somatic mutation data of five types of cancers in this study, including 776 breast invasive carcinoma (BRCA) sam- 29 30 43 ples , 238 bladder urothelial carcinoma (BLCA) samples , 291 glioblastoma multiforme (GBM) samples , 509 44 45 head and neck squamous cell carcinoma (HNSC) samples and 197 acute myeloid leukemia (LAML) samples . 19 20 The performance of mCGfinder is compared against two existing methods, HotNet2 and ReMIC . In mCG- finder, HotNet2 and ReMIC, we use a highly curated gene interaction network iRefIndex as the network infor- 19, 20 mation. In the comparison study, mCGfinder, HotNet2 and ReMIC are configured by their default settings (details in Supplementary materials). An overview of the mutated cancer genes detected by mCGfinder, HotNet2 and ReMIC is illustrated as Venn diagrams (Fig. 2 and Supplementary Fig. S5A). For all the five types of cancers, there is a high concordance between the results of mCGfinder and the results of the other two methods. Among the genes detected by mCGfinder, the percentages of genes that are also detected by at least one of the other meth- ods range from 36.6% (BRCA) to 84.3% (LAML) across the five types of cancers. In this study, we apply Fisher’s exact test on the detection results to evaluate whether the detected genes of the three methods are significantly enriched for known mutated cancer genes in Cancer Gene Census (CGC) , which is a highly curated database of cancer genes. For all the five types of cancers, the results of mCGfinder are highly enriched for CGC cancer genes (Fig. 2), and yield the most significant p-values among the three investi- gated methods. Taking BRCA as an example, HotNet2 and ReMIC obtain p-values of 1.08e-05 (8 CGC genes) and 1.04e-03 (5 CGC genes) respectively, which suggest that these results are significantly different than random selection. In comparison, mCGfinder achieves a p-value of 1.24e-17 and captures 18 CGC breast cancer genes. Notably, there are 8 CGC genes (AKT1, BRCA2, CASP8, CTCF, MAP3K1, MAP3K13, NCOR1 and TBX3) cap- tured by mCGfinder but not by HotNet2 or ReMIC. Literature survey shows that AKT1 gene is implicated as Scientific Repo R ts | 7: 2855 | DOI:10.1038/s41598-017-03141-w 2 www.nature.com/scientificreports/ Figure 1. Schematic diagram of mCGfinder, which is a matrix decomposition method integrated with network information. It decomposes the mutation matrix as the matrix multiplication of the sample indicator vectors and the transpose of gene score vectors. The different components in the results of mCGfinder are regarded as the outer products of different sample indicator vectors and their related gene score vectors, where the summation of the components is an approximation of the mutation matrix. Graph Laplacian regularization is used to incorporate information of gene interaction network into mCGfinder. Aer t ft he matrix decomposition procedure, the mutational recurrence of genes in different subsets of samples can be measured from the gene score vectors of the related component, and the related subsets samples of the component are indicated by the sample indicator vectors. Through permutation test and false discovery rate (FDR) control, mutated cancer gene candidates can be identified by thresholding FDR q-values of the genes. significantly mutated gene in breast cancer in a previous study , and mutations of BRCA2 gene are reported to be involved in the primary events of breast carcinogenesis . In the three other types of cancers, mCGfinder also provides high enrichment for CGC genes, with associated p-values of 1.91e-08 in BLCA (8 CGC genes), 4.91e-14 in GBM (12 CGC genes), 1.46e-08 in HNSC (9 CGC genes) and 5.57e-16 in LAML (10 CGC genes). Interestingly, for all the investigated cancers, the genes detected by both HotNet2 and ReMIC but not by mCGfinder include no known CGC gene (Fig. 2). Taking BLCA as an example, there is no CGC gene among the 242 genes detected by both HotNet2 and ReMIC but not by mCGfinder. The full lists of CGC genes detected by mCGfinder on the five types of cancers are provided in Supplementary Table S1. To give a more comprehensive assessment on the detection results, we also use another independent curated cancer gene database, Integrative Onco Genomics (IntOGen) . In the enrichment analysis for known cancer genes reported in IntOGen, mCGfinder demonstrates comparable or better performance than the two competing methods (Supplementary Figs S1 and S5A). In BRCA, the detection results of HotNet2 and ReMIC show good performance and contain 22 and 31 IntOGen breast cancer genes respectively. In comparison, mCGfinder suc- cessfully recovers 60 IntOGen breast cancer genes. The enrichment p-values of the results of mCGfinder, HotNet2 and ReMIC for IntOGen genes in BRCA are 1.97e-36, 1.74e-06 and 1.15e-16 respectively. For BLCA data, there are 31 and 44 IntOGen genes captured by HotNet2 (p-value = 4.16e-03) and ReMIC (p-value = 1.30e-10) Scientific Repo R ts | 7: 2855 | DOI:10.1038/s41598-017-03141-w 3 www.nature.com/scientificreports/ Figure 2. Venn diagrams of intersections between the genes detected by mCGfinder (red circle), HotNet2 (green circle) and ReMIC (blue circle) on TCGA somatic mutation datasets of BRCA (north-west panel), BLCA (north-east panel), GBM (south-west panel) and HNSC (south-east panel). The gray and black numbers in each region of the Venn diagrams indicate the number of detected genes and the number of genes also reported in Cancer Gene Census (CGC) respectively. The p-values next to the circles of the methods are calculated by Fisher’s exact test, representing the enrichment significance of the detection results for CGC annotated cancer genes. respectively. In comparison, mCGfinder predicts 56 IntOGen genes, yielding a p-value of 4.78e-34. e Th IntOGen genes detected by mCGfinder on the five types of cancers are listed in Supplementary Table  S2. Finally, we per- form cancer gene enrichment analysis by using the combined cancer gene lists of both CGC and IntOGen data- bases, and similar conclusion can be drawn from the results across all the five types of cancers (Supplementary Figs S1 and S5A). e Th CGC and IntOGen genes identified by mCGfinder but not by the other investigated meth- ods along with their functions are demonstrated in Supplementary Table S3. Ranking analysis. In addition to the statistical enrichment analysis, in order to comprehensively evaluate the performances of mCGfinder, we further use the results obtained by not only the default threshold but also 49–52 various thresholds by following previous studies . The gene ranking scores of different approaches are detailed in Supplementary materials. By raising the threshold and obtaining the percentages of known cancer genes falling under the category, we can evaluate the detection results of different methods comprehensively through rank cut- 49, 50 off curves . Here we use the rank cutoff curves as the evaluation metric for the top ranked genes detected by the investigated methods, which are drawn by listing the percentages of known cancer genes that are also included in the top ranked genes. As shown in Fig. 3A–D, the top ranked genes detected by mCGfinder contain consistently higher percentages of known CGC genes than the results of the other methods at various rank thresholds. Taking BRCA as an example, 3.5% of CGC breast cancer genes are included in the top 50 genes detected by HotNet2. In comparison, the top 50 genes identified by mCGfinder contain 31.0% of CGC breast cancer genes. When the rank threshold of genes raises to 100, the percentages of known cancer genes detected by HotNet2 and ReMIC also increase to 10.3% and 6.9% respectively. In comparison, there are 41.4% of known CGC genes included in the results of mCGfinder. Similarly, in the other types of cancers, the top ranked genes detected by mCGfinder also Scientific Repo R ts | 7: 2855 | DOI:10.1038/s41598-017-03141-w 4 www.nature.com/scientificreports/ Figure 3. Rank cutoff curves of top 100 candidates in mCGfinder (red line with circle markers), HotNet2 (green line with square markers) and ReMIC (blue line with triangle markers) results, describing the relation between various cutoffs and the fraction of known CGC cancer genes ranked above this cutoff in BRCA (A), BLCA (B), GBM (C) and HNSC (D). Cumulative fractions of known CGC cancer genes annotated by CGC within the top 100, 300, 500, 700 and 1000 genes in BRCA (E), BLCA (F), GBM (G) and HNSC (H). Results from all the assessments indicate the generally improved performance of mCGfinder over the competing methods. contain high fractions of known cancer genes. For example, there are 50.0%, 50.0% and 33.3% of known CGC genes included in the top 100 genes detected by mCGfinder on BLCA, GBM and HNSC data respectively. Next, we assess the performance of different methods with larger rank thresholds, and mCGfinder still com- pares favorably to the competing methods across all the five types of cancers (Fig.  3E–H and Supplementary Fig. S5C). When the rank threshold raises to 500, both HotNet2 and ReMIC demonstrate reasonable performance in BRCA and yield percentages 27.6% and 20.7% of CGC cancer genes respectively. In comparison, mCGfinder achieves a percentage of 62.7%. In BLCA, GBM and HNSC, more than half of the known CGC cancer genes are also included in the top 500 genes detected by mCGfinder respectively. We further assess the top ranked genes of the investigated methods by IntOGen gene lists and the combined gene lists of both the two databases. The results also show that mCGfinder achieves the highest percentages among the three investigated methods throughout the rank cutoff analysis (Supplementary Figs  S2, S3 and S5B,C). Moreover, by varying the rank thresholds and calculating the precisions and recalls, we draw the precision-recall curve (PR curve) of the results detected by the investigated methods as the assessment met- 51, 52 ric used in previous studies . For all the five types of cancers, when the known cancer genes in CGC are used as gold-standard, the PR curves of mCGfinder are clearly located over the curves of the other meth- ods (Supplementary Figs S4 and S5D). As the limited number of known cancer genes from CGC may lead to inaccurate performance, we further use known cancer genes annotated by IntOGen for evaluation (Fig. 4 and Supplementary Fig. S5D), in which the number of known breast cancer genes is largely increased. Taking BRCA as an example, when the recalls are fixed at 10.0%, the precisions of mCGfinder, HotNet2 and ReMIC are 33.9%, 4.3% and 11.9% respectively, which are consistently better than random selection. The area under the precision-recall curve of mCGfinder is also greater than the other methods (Supplementary Table  S4). In consist- ent with BRCA results, mCGfinder also gives the best performance among the detection results of the investigated methods on BLCA, GBM, HNSC and LAML data when evaluated by IntOGen. Similar conclusions can also be obtained from analysis of the detection results from the combined gene lists of the two databases (Supplementary Figs S4, S5D and Supplementary Table S4). Computational cost. In addition to the analysis of detection performance, we further examine the com- putational time of the three investigated methods. The experiments in this study are performed on a computer with Intel Xeon(R) CPU E5-2630 0 @ 2.30 GHz × 18 Processors and 64 GB of memory. For BRCA, BLCA, GBM, HNSC and LAML somatic mutation datasets with 12129 genes and 776, 238, 291, 509 and 197 samples, the running time of mCGfinder is 3–5 minutes in average, which is smaller than HotNet2 and ReMIC (Fig. 5). For example, in BRCA, HotNet2 takes around 24 minutes, and ReMIC takes around 21 minutes. In comparison, mCGfinder takes only around 5 minutes. In HNSC, mCGfinder, HotNet2 and ReMIC take around 4, 24 and 26 minutes respectively. Discussion Developing efficient methods to detect cancer genes from inter-patient heterogeneous tumour samples is an chal- lenging task, and a major obstacle is the fact that some cancer genes are mutated in perturbed pathways associated 28, 31 with only a subset of samples . Thus, these mutated cancer genes may not be significantly recurrent in all samples and remain undiscovered even when the mutations in their interaction network context are considered. In this Scientific Repo R ts | 7: 2855 | DOI:10.1038/s41598-017-03141-w 5 www.nature.com/scientificreports/ Figure 4. Precision-recall curves for the three methods on BRCA (north-west panel), BLCA (north-east panel), GBM (south-west panel) and HNSC (south-east panel) data, where red, green, blue and gray lines represent the curves of mCGfinder, HotNet2, ReMIC and random selection respectively. For each curve, the points indicate the precisions and recalls at different ranks in the prediction results. The precision is computed as the fraction of the top ranked genes that are known cancer genes, and the recall is computed as the fraction of known cancer genes in the top ranked genes. Figure 5. Running time comparison of mCGfinder (red bar), HotNet2 (green bar) and ReMIC (blue bar) on datasets of BRCA, BLCA, GBM, HNSC and LAML respectively. Scientific Repo R ts | 7: 2855 | DOI:10.1038/s41598-017-03141-w 6 www.nature.com/scientificreports/ paper, based on the combination of matrix decomposition framework and information from gene interaction net- work, we propose a novel method which is capable of detecting mutated cancer genes in a subset of samples. When applied on TCGA somatic mutation datasets of five types of cancers, mCGn fi der precisely recovers many known cancer genes. Our results also show that the performance of mCGfinder is not sensitive to the selection of the tuning parameter (Supplementary Fig. S6). Notably, mCGfinder achieves the highest enrichment for known genes among the investigated methods, suggesting that it is a powerful bioinformatics tool for mutated cancer gene detection. A significant distinction between mCGfinder and the existing network-based approaches for cancer gene detection is that mCGfinder decomposes the mutation data matrix of heterogeneous tumour samples into differ - ent components, and measures the mutational recurrence of genes in subsets of samples indicated by the com- ponents. Based on this design, mCGfinder greatly complements the detection results of the existing approaches. Nevertheless, it should be pointed out that the evaluation results are not sufficient to mean as a criticism of the other investigated methods. Instead, they show the difference between whether considering the mutated cancer genes in different subsets of samples or not. In the detection results of the investigated cancers, some CGC genes missed by mCGfinder are detected either by HotNet2 only or by ReMIC only (Fig.  2). For example, in BLCA results there is 1 CGC gene among the 1040 genes detected by HotNet2 only, and there is 1 CGC gene among the 653 genes detected by ReMIC only. These results suggest that it may be worth using both mCGfinder and the existing methods to maximize the detection rate of mutated cancer genes. Despite the promising results achieved by the purposed method, there are also some avenues for further inves- tigation. For example, our method is not designed to address the issue of intra-tumour heterogeneity , which cannot be represented by the input binary matrix. In consistent with HotNet2 and ReMIC which highly depend on gene interaction network, our method utilizes gene interaction network as an important information source for detecting mutated cancer genes. Therefore, mCGfinder is not yet applicable for genes that are not included in gene interaction network. Also, it is noteworthy that unlike previous approach , our method is not designed to stratify cancer samples and cannot incorporate biological knowledge of cancer subtypes . Meanwhile, the objective func- tion in mCGfinder is not guaranteed to be convex albeit a local optimum can be reached. Furthermore, a promis- ing expansion to the mCGfinder in future work would be to integrate information from not only gene interactions, but also different types of information such as copy number alternation, gene expression and DNA methylation, 51, 53, 54 which would oer a ff n opportunity to comprehensively understand cancer events from a multi-omics view . In summary, mCGfinder is a novel method to efficiently detect mutated cancer genes in tumour samples with inter-patient heterogeneity, which provides a more sophisticated view of cancer genomics from both the influence of interaction network context and mutational recurrence of genes in different subsets of samples. Altogether, mutational profile analysis from mCGfinder and further experimental follow-up may help take a step forward to a more comprehensive knowledge of the cancer genome. Materials and Methods TCGA somatic mutation data of cancers. We apply mCGfinder on TCGA somatic mutation data of v fi e types of cancers, BRCA, BLCA, GBM, HNSC and LAML (detailed information in Supplementary materials and Supplementary Table S5). For each type of cancer, the mutation data is a binary matrix X = (X ) where ij n×p the rows and columns of the mutation matrix denote the tumour samples (totally n samples) and the investigated genes (totally p genes) respectively. Each entry X of the matrix indicates the binary state of the gene, in which 1 ij 31, 55 represents the i-th sample contains a somatic mutation of the j-th gene, and 0 otherwise . Network regularized matrix decomposition. Based on matrix decomposition framework, mCGfinder decomposes the matrix X of somatic mutation data in heterogeneous tumour samples into different components, and the summation of these components can be regarded as an approximation of the mutation data matrix, i.e. Xs =+ g ε . ∑ r r (1) r =1 where s = () s and g = () g are the sample indicator vector and the gene score vector for the r-th com- rirn×1 p×1 rjr ponent. The ε is the residual matrix for the r-th component, and R is the total number of the components obtained by mCGfinder. The sample indicator vector s indicates the assignment of tumour samples to the r-th component, in which the coefficient s = 1 represents that the i-th samples are included in the component, and ir s = 0 otherwise. As for the gene score vector g of the r-th component, a higher value of the coefficient g of the ir r jr vector presents a larger potential of the j-th gene to be a mutated cancer gene. Note that the first component, which is the outer product of the two vectors , is the best rank-one approximation of the data sg = () sg 1 i1 np × 1 j1 matrix X. Thus, we can use the approximation to decompose the first component (S and g ) from the data matrix, 1 1 36, 37, 40 and obtain the remaining components through a component-by-component strategy . Also, to efficiently incorporate information from gene interaction network, we use graph Laplacian regularization on the gene score vector g . Subsequently, we construct an optimization problem for vector s and g to obtain the first component, 1 1 1 and the objective function is, T T min Xs −+ gg λ Lg 1 L 1 1 1 sg , 1 1 st ..s ∈. {0,1} (2) 2 n where ⋅ denotes the squared Frobenius norm of a matrix, and s ∈ {0,1} indicates that the coefficients in F 1 vector s can be either 1 or 0. The matrix L = () L is the Laplacian matrix of the gene interaction network, 1 ij pp × Scientific Repo R ts | 7: 2855 | DOI:10.1038/s41598-017-03141-w 7 www.nature.com/scientificreports/ which is calculated through the matrix subtraction . The matrix is the sym- () LD =− () () A () A ij pp ×× ij pp ij pp × ij pp × metric normalized adjacency matrix of the gene interaction network (see Supplementary materials for details of the normalization procedure), and the matrix is a diagonal matrix whose entries are the column sums of () D ij pp × matrix () A . ij pp × In the objective function (2), the first term is the summation of the residuals between the first component and the data matrix. When the first term is minimized, we can obtain a component that best fit the data matrix. The second term is the graph Laplacian term, which can be rewritten as p p p p T 2 gLg == ggL () gg −. A ∑∑ ∑∑ ij ij 1 1 ij 11 ij 11 i== 11 j i== 11 ji+ (3) Through the graph Laplacian term, we can successfully adopt the assumption that if the i-th gene and the j-th gene are connected in the gene interaction network (A > 0), the scores g and g of the two genes are also close ij i1 j1 to each other. The tuning parameter λ is used to balance the fitness of the model (first term) and the smoothness of the scores of connected genes (second term), which is set to 0.1 in this study. Accordingly, mCGfinder can efficiently measure the significance of mutational recurrence of genes in a subset of samples and incorporate information from network context at the same time. Iterative estimation procedure. To solve the optimization problem in (2), we employ an efficient iterative 36, 37, 40 procedure to estimate the two vector s and g alternatively . When the gene score vector g is fixed, the opti- 1 1 1 mization function to solve the coefficient s in the sample indicator vector s is formulated as below: i1 1 2 2 min( ss gX − 2) g ii 1 1 i 1 1 i1 st .. ss (1 −= )0,1 ∀= in ,, … , ii 11 (4) where the ⋅ denotes the squared L2-norm of a vector, and (·) indicates the i-th coefficients of a vector. Since the values of the coefficients in sample indicator vector are constrained to be binary, we introduce Boolean con - straint on coefficients in vector s . For the assignment of the i-th sample of the first component, the estimation of s in vector s can be calculated through Karush-Kuhn-Tucker (KKT) conditions, i1 1 1if(2) Xg ≥ g  i s =  i1 0otherwise. (5) Likewise, when the sample indicator vector s is x fi ed, the optimization function to solve the gene score vector g in optimization problem (2) is formulated as below: 2 T T T min2 sg gX −+ () sg λ gLg . () 1 1 L 2 1 1 11 1 (6) Similar to the derivation for the sample indicator vector above, the gene score vector can also be obtained through the KKT conditions: −1 2 T gs =+ IL λ () Xs , () 1 pL 1 2 (7) where I is a p × p identity matrix, and the symmetric matrix () sI + λ L (r = 1 in this case) is an invertible rp L matrix (see Supplementary materials for detailed proof ). Subsequently, the gene score vector and sample indica- tor vector in the first component can be iteratively estimated through alternating the two update rules (5) and (7) 36, 37, 40 until convergence . Algorithm 1 mCGfinder: iterative estimation procedure Input: mutation matrix X ; graph Laplacian matrix L; np × Output: sample indicator vector s ; gene score vector g . 1 1   1: set λ ← 0.1 −1 (0) (0) T   2: s ← 1 and g ←+ nIL λ X 1 1 n () pL () n   3: repeat −1 (1 kk + ) () T () k   4: g ←+ sI λ LX s 1 pL () 1 1 () 2   (1 kk + ) (1 ++ ) (1 k )   6:  s ← IX 2 gg − 1 [0,) ∞  1 1      6: k ← k + 1   7: until Convergence () ∞ () ∞   8: return s ← s and g ← g Note: 1 is an n × 1 vector with all coefficients being 1; Indicator function I (x) returns a logical vector if x ∈ A. n A i Algorithm 1.. e i Th terative estimation procedure of sample indicator vector and gene score vector in mCGfinder. Scientific Repo R ts | 7: 2855 | DOI:10.1038/s41598-017-03141-w 8 www.nature.com/scientificreports/ The algorithm of the estimation of the two vectors in the first component are summarized in Algorithm 1. Ae ft r convergence, the first component is obtained by matrix multiplication . To obtain the next component sg 36, 37, 40 (s and g ), we repeat the procedures in Algorithm 1 on the remaining samples . Subsequently, we can esti- 2 2 mate the r-th component ( s and g ) (r = 2, … , R) by decomposing the data matrix through the r r component-by-component strategy until all samples are assigned (details in Supplementary Fig. S7), and the number R is obtained by counting the components decomposed by mCGfinder. Significance test. To assess which of these mutated genes are statistically significant in a subset of samples, we implement significance test on the coefficients of the gene score vectors g (r = 1, …, R) in every 21 − components decomposed by mCGfinder. In brief, we define in (7) as the network influ- Xs () IL + λ rp L enced matrix. The coefficients of gene score vector g can be calculated by the summation of the entries of a subset of rows of the network influenced matrix X , where the rows are indicated by the sample indicator net 40, 57 vector s of the investigated component. We follow the procedure in previous studies and identify the genes of which the scores can disprove the null hypothesis that their values of the gene score vector coeffi- cients are only contributed by background mutations alone. Since the random background mutations could occur anywhere in the genome, the null distribution is modeled by recalculating the gene score vectors across all combinations of permutations of the network influenced matrix within samples. Detailed proce- dure for the significance test is provided in Supplementary materials. Since large numbers of permutations is usually time consuming, we instead use a semi-exact estimation approach proposed in previous 40, 57 approaches to estimate the distribution of scores and the corresponding p-values. To control the false discovery rates of the investigated genes, we apply the Benjamini-Hochberg FDR procedure on the p-values obtained from the significance test, and calculate the q-values of the investigated genes for each component. For a specific gene, we choose the most significant (smallest) q-values of the investigated gene among all components as the significance score of the gene. References 1. Schuster, S. C. Next-generation sequencing transforms today’s biology. Nature 200, 16–18 (2007). 2. Chiang, D. Y. et al. High-resolution mapping of copy-number alterations with massively parallel sequencing. Nature methods 6, 99–103 (2009). 3. Xiong, M., Zhao, Z., Arnold, J. & Yu, F. Next-generation sequencing. BioMed Research International 2010 (2011). 4. Nijkamp, J. F. et al. De novo detection of copy number variation by co-assembly. Bioinformatics 28, 3195–3202 (2012). 5. Zhao, M., Wang, Q., Wang, Q., Jia, P. & Zhao, Z. Computational tools for copy number variation (cnv) detection using next- generation sequencing data: features and perspectives. BMC bioinformatics 14, 1 (2013). 6. Weinstein, J. N. et al. The cancer genome atlas pan-cancer analysis project. Nature genetics 45, 1113–1120 (2013). 7. Mardis, E. R. Genome sequencing and cancer. Current opinion in genetics & development 22, 245–250 (2012). 8. Watson, I. R., Takahashi, K., Futreal, P. A. & Chin, L. Emerging patterns of somatic mutations in cancer. Nature reviews Genetics 14, 703–718 (2013). 9. Vogelstein, B. et al. Cancer genome landscapes. science 339, 1546–1558 (2013). 10. Ding, L., Wendl, M. C., McMichael, J. F. & Raphael, B. J. Expanding the computational toolbox for mining cancer genomes. Nature Reviews Genetics 15, 556–570 (2014). 11. Stephens, P. J. et al. The landscape of cancer genes and mutational processes in breast cancer. Nature 486, 400–404 (2012). 12. Wendl, M. C. et al. Pathscan: a tool for discerning mutational significance in groups of putative cancer genes. Bioinformatics 27, 1595–1602 (2011). 13. Raphael, B. J., Dobson, J. R., Oesper, L. & Vandin, F. Identifying driver mutations in sequenced cancer genomes: computational approaches to enable precision medicine. Genome medicine 6, 1 (2014). 14. Yuan, X., Zhang, J., Zhang, S., Yu, G. & Wang, Y. Comparative analysis of methods for identifying recurrent copy number alterations in cancer. PloS one 7, e52516 (2012). 15. Dees, N. D. et al. Music: identifying mutational significance in cancer genomes. Genome research 22, 1589–1598 (2012). 16. Sontrop, H. M., Verhaegh, W. F., Reinders, M. J. & Moerland, P. D. An evaluation protocol for subtype-specific breast cancer event prediction. PloS one 6, e21681 (2011). 17. Vandin, F., Upfal, E. & Raphael, B. J. Algorithms for detecting significantly mutated pathways in cancer. Journal of Computational Biology 18, 507–522 (2011). 18. Vandin, F., Clay, P., Upfal, E. & Raphael, B. J. Discovery of mutated subnetworks associated with clinical data in cancer. In Pac Symp Biocomput 2012, 55–66 (2012). 19. Leiserson, M. D., Vandin, F., Wu, H.-T., Dobson, J. R. & Raphael, B. R. Pan-cancer identification of mutated pathways and protein complexes. Cancer Research 74, 5324–5324 (2014). 20. Babaei, S., Hulsman, M., Reinders, M. & de Ridder, J. Detecting recurrent gene mutation in interaction network context using multi- scale graph diffusion. BMC bioinformatics 14, 1 (2013). 21. Jia, P. & Zhao, Z. Varwalker: personalized mutation network analysis of putative cancer genes from next-generation sequencing data. PLoS Comput Biol 10, e1003460 (2014). 22. Razick, S., Magklaras, G. & Donaldson, I. M. irefindex: a consolidated protein interaction database with provenance. BMC bioinformatics 9, 1 (2008). 23. Prasad, T. K. et al. Human protein reference database-2009 update. Nucleic acids research 37, D767–D772 (2009). 24. Szklarczyk, D. et al. The string database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic acids research 39, D561–D568 (2011). 25. Lee, I., Blom, U. M., Wang, P. I., Shim, J. E. & Marcotte, E. M. Prioritizing candidate disease genes by network-based boosting of genome-wide association data. Genome research 21, 1109–1121 (2011). 26. Das, J. & Yu, H. Hint: High-quality protein interactomes and their applications in understanding human disease. BMC systems biology 6, 92 (2012). 27. Khurana, E., Fu, Y., Chen, J. & Gerstein, M. Interpretation of genomic variants using a unified biological network approach. PLoS Comput Biol 9, e1002886 (2013). 28. Vaske, C. J. et al. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using paradigm. Bioinformatics 26, i237–i245 (2010). 29. Cancer Genome Atlas Network. et al. Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012). Scientific Repo R ts | 7: 2855 | DOI:10.1038/s41598-017-03141-w 9 www.nature.com/scientificreports/ 30. Cancer Genome Atlas Research Network. et al. Comprehensive molecular characterization of urothelial bladder carcinoma. Nature 507, 315–322 (2014). 31. Hofree, M., Shen, J. P., Carter, H., Gross, A. & Ideker, T. Network-based stratification of tumor mutations. Nature methods 10, 1108–1115 (2013). 32. Cheng, Y. & Church, G. M. Biclustering of expression data. Ismb 8, 93–103 (2000). 33. Yang, J., Wang, H., Wang, W. & Yu, P. S. An improved biclustering method for analyzing gene expression profiles. International Journal on Artici fi al Intelligence Tools 14, 771–789 (2005). 34. Shabalin, A. A., Weigman, V. J., Perou, C. M. & Nobel, A. B. Finding large average submatrices in high dimensional data. The Annals of Applied Statistics 985–1012 (2009). 35. Oghabian, A., Kilpinen, S., Hautaniemi, S. & Czeizler, E. Biclustering methods: biological relevance and application in gene expression analysis. PloS one 9, e90801 (2014). 36. Lee, M., Shen, H., Huang, J. Z. & Marron, J. S. Biclustering via sparse singular value decomposition. Biometrics 66, 1087–1095 (2010). 37. Sill, M., Kaiser, S., Benner, A. & Kopp-Schneider, A. Robust biclustering by sparse singular value decomposition incorporating stability selection. Bioinformatics 27, 2089–2097 (2011). 38. Zhou, X., Yang, C., Wan, X., Zhao, H. & Yu, W. Multisample acgh data analysis via total variation and spectral regularization. IEEE/ ACM Transactions on Computational Biology and Bioinformatics 10, 230–235 (2013). 39. Zhou, X., Liu, J., Wan, X. & Yu, W. Piecewise-constant and low-rank approximation for identification of recurrent copy number variations. Bioinformatics 30, 1943–1949 (2014). 40. Xi, J. & Li, A. Discovering recurrent copy number aberrations in complex patterns via non-negative sparse singular value decomposition. IEEE/ACM Transactions on Computational Biology and Bioinformatics 13, 656–668 (2016). 41. Xie, B., Wang, M. & Tao, D. Toward the optimization of normalized graph laplacian. IEEE Transactions on Neural Networks 22, 660–666 (2011). 42. Cancer Genome Atlas Research Network. et al. Comprehensive molecular profiling of lung adenocarcinoma. Nature 511, 543–550 (2014). 43. McLendon, R. et al. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455, 1061–1068 (2008). 44. Cancer Genome Atlas Network. et al. Comprehensive genomic characterization of head and neck squamous cell carcinomas. Nature 517, 576–582 (2015). 45. Network, C. G. A. R. et al. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N Engl J Med 2013, 2059–2074 (2013). 46. Futreal, P. A. et al. A census of human cancer genes. Nature Reviews Cancer 4, 177–183 (2004). 47. Weber, B., Brohm, M., Stec, I., Backe, J. & Caffier, H. A somatic truncating mutation in brca2 in a sporadic breast tumor. American journal of human genetics 59, 962 (1996). 48. Gonzalez-Perez, A. et al. Intogen-mutations identifies cancer drivers across tumor types. Nature methods 10, 1081–1082 (2013). 49. Linghu, B., Snitkin, E. S., Hu, Z., Xia, Y. & DeLisi, C. Genome-wide prioritization of disease genes and identification of disease- disease associations from an integrated human functional linkage network. Genome biology 10, R91 (2009). 50. Chen, X., Liu, M.-X. & Yan, G.-Y. Drug–target interaction prediction by random walk on the heterogeneous network. Molecular BioSystems 8, 1970–1978 (2012). 51. Yang, H., Wei, Q., Zhong, X., Yang, H. & Li, B. Cancer driver gene discovery through an integrative genomics approach in a non- parametric bayesian framework. Bioinformatics 33, 483–490 (2017). 52. Wu, H.-T., Hajirasouliha, I. & Raphael, B. J. Detecting independent and recurrent copy number aberrations using interval graphs. Bioinformatics 30, i195–i203 (2014). 53. Gevaert, O., Villalobos, V., Sikic, B. I. & Plevritis, S. K. Identification of ovarian cancer driver genes by using module network integration of multi-omics data. Interface focus 3, 20130013 (2013). 54. Taskesen, E., Staal, F. J. & Reinders, M. J. An integrated approach of gene expression and dna-methylation profiles of wnt signaling genes uncovers novel prognostic markers in acute myeloid leukemia. BMC bioinformatics 16, 1 (2015). 55. Kim, S., Sael, L. & Yu, H. A mutation profile for top-k patient search exploiting gene-ontology and orthogonal non-negative matrix factorization. Bioinformatics 31, 3653–3659 (2015). 56. Malioutov, D. & Malyutov, M. Boolean compressed sensing: Lp relaxation for group testing. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 3305–3308 (IEEE, 2012). 57. Beroukhim, R. et al. Assessing the significance of chromosomal aberrations in cancer: methodology and application to glioma. Proceedings of the National Academy of Sciences 104, 20007–20012 (2007). 58. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society. Series B (Methodological) 289–300 (1995). Acknowledgements e a Th uthors would like to thank the editor and the reviewers for their valuable comments and helpful suggestions for improving the manuscript. They also thank Dr. Hengyu Li at Department of Breast and Thyroid Surgery, Changhai Hospital, Shanghai for providing helpful suggestions. This work is supported by National Natural Science Foundation of China (Grant Nos 61571414, 61471331 and 31100955). The authors would like to thank Chen Peng and Changran Zhang for helpful discussions and suggestions. Author Contributions J.X. and A.L. wrote the main manuscript text and prepared all Tables and Figures. M.W. provided valuable suggestions and guidance. All authors reviewed the manuscript. Additional Information Supplementary information accompanies this paper at doi:10.1038/s41598-017-03141-w Competing Interests: The authors declare that they have no competing interests. Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Scientific Repo R ts | 7: 2855 | DOI:10.1038/s41598-017-03141-w 10 www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Cre- ative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not per- mitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. © The Author(s) 2017 Scientific Repo R ts | 7: 2855 | DOI:10.1038/s41598-017-03141-w 11

Journal

Scientific ReportsSpringer Journals

Published: Jun 6, 2017

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off