Background: Survival analysis on tumor expression profiles has always been a key issue for subsequent biological experimental validation. It is crucial how to select features which closely correspond to survival time. Furthermore, it is important how to select features which best discriminate between low-risk and high-risk group of patients. Common features derived from the two aspects may provide variable candidates for prognosis of cancer. Results: Based on the provided two-step feature selection strategy, we develop a joint covariate detection tool for survival analysis on tumor expression profiles. Significant features, which are not only consistent with survival time but also associated with the categories of patients with different survival risks, are chosen. Using the miRNA expression data (Level 3) of 548 patients with glioblastoma multiforme (GBM) as an example, miRNA candidates for prognosis of cancer are selected. The reliability of selected miRNAs using this tool is demonstrated by 100 simulations. Furthermore, It is discovered that significant covariates are not directly composed of individually significant variables. Conclusions: Joint covariate detection provides a viewpoint for selecting variables which are not individually but jointly significant. Besides, it helps to select features which are not only consistent with survival time but also associated with prognosis risk. The software is available at http://bio-nefu.com/resource/jcdsa. Keywords: Feature selection, Expression profiles, Survival analysis, Prognosis, Cancer Background outcomes. Anyway, these top-down strategies provide so Due to the limited effectiveness of current clinical diag- many variable candidates that the real features which may noses, expression profiles are utilized for informing vari- reveal the possible molecular cause of different survival ables, which are not only associated with the categories risks are inevitably submerged. of patients with different survival risks but also consistent In contrast, univariable hazards regression analyses with survival time . Commonly, Cox proportional haz- have been placed firmly in the mainstream. Bottom-up ards regression analysis is used to seek relevant variables strategies with different constraints such as least-angle considering the continuity of the patients’ survival out- regression and sparse kernel  are utilized for pro- comes with right censoring . As to small sample data viding variables associated with survival time. To the with high dimension, Cox proportional hazards regression best of our knowledge, we are the first to present joint has to be combined with methods using dimension reduc- covariate detection  that combines significant variables tion or shrinkage such as partial least squares and prin- consistent with survival time and associated with the cat- cipal component analysis . However, these approaches egories of patients. Other than individually significant only provide a combination of variables. Besides, tree- variables, we concentrate on bottom-up enumeration of structured survival analysis , random survival forests feature tuples, each component of which is either indi-  and that associated with hazards regression are vidually significant or not. This thought is inspired by proposed for selection of features associated with survival Integrative Hypothesis Testing , which is used for selecting features differentially expressed between dif- *Correspondence: email@example.com ferent groups of patients. Unlike Integrative Hypothesis College of Information and Computer Engineering, Northeast Forestry University, No.26 Hexing Road, 150001 Harbin, China Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Wu et al. BMC Bioinformatics (2018) 19:187 Page 2 of 8 Fig. 1 A schematic diagram to elucidate joint covariate detection Testing, joint covariate detection is faced with continu- for discrimination between patients with different sur- ous survival time other than labels representing different vival risks. In addition, we develop a joint covariate detec- categories of patients. tion tool for survival analysis on tumor expression profiles In this paper, we further divide the provided feature (i.e. JCDSA), which helps to conveniently select signifi- selection into two steps, i.e., selection of variables associ- cant features either on a cluster or a workstation, even ated with survival outcomes and further feature selection on a personal computer. Matlab R2012b and Python 3 Fig. 2 Selection of features associated with survival time Wu et al. BMC Bioinformatics (2018) 19:187 Page 3 of 8 Fig. 3 Selection of features for discriminating between two risk groups are utilized as the development platform. miRNA expres- regression coefficients of the detected variables, respec- sion data (Level 3) of 548 patients with GBM down- tively. The summation in the denominator is over all sub- loaded from TCGA (http://cancergenome.nih.gov)and jects in the risk set at ordered survival time t , denoted (i) the simulated data are considered to be the examples. by R(t ). z denotes a null statistics by a random rear- (i) Compared with the prevailing method named as random rangement of survival outcomes. The estimator of the survival forests (i.e. RSF), JCDSA shows better experimen- expected number of deaths in high-risk group is denoted n d 1i i tal results, which demonstrates the effectiveness of our by e ˆ ,expressed as e ˆ = ,where n and d repre- 1i 1i i i method. sent the number at risk and of deaths at the observation of ordered survival time t , n denotes the number at (i) 1i Implementation risk in high-risk group. The estimator of the variance of In order to elucidate joint covariate detection in brief, a d on the hypergeometric distribution is defined as v ˆ = 1i 1i n n d (n −d ) 1i 0i i i i schematic diagram is illustrated in Fig. 1 (Notations: x (i) ,where n denotes the number at risk in low- 0i n (n −1) and β denote the expression levels of sample i and the risk group. Q denotes a null statistics by a random rear- rangement of survival outcomes). Input data is considered as expression profiles with survival time and censoring Table 1 Individually significant miRNAs using joint covariate states of patients. Output data refers to selected features. detection (p<=0.001) Joint covariate detection corresponds to two-step feature miRNA probe β(Cox) Z(Cox) P(Cox) selection, i.e., selection of features associated with sur- hsa-miR-148a 0.192 4.607 <0.001 vival outcomes and selection of features for discriminating hsa-miR-17-3p -0.308 -3.321 <0.001 between two risk groups. hsa-miR-200a 0.465 3.563 <0.001 Features associated with survival outcomes hsa-miR-20a -0.177 -3.163 <0.001 We first consider to select features associated with sur- hsa-miR-221 0.284 5.396 <0.001 vival time. A bottom-up enumeration on k-tuple with k hsa-miR-222 0.246 6.332 <0.001 variables is made. As to each k-tuple, Cox proportional hsa-miR-340 -0.468 -3.498 <0.001 hazards regression analysis  is introduced. By making hsa-miR-34a 0.182 4.287 <0.001 the maximum partial likelihood estimation on the partial Wu et al. BMC Bioinformatics (2018) 19:187 Page 4 of 8 Table 2 Significant miRNAs in pairs using joint covariate detection (p<0.001) miRNA probe miRNA probe β(Cox) β(Cox) Z(Cox) Z(Cox) P(Cox) P(Cox) hsa-miR-10b hsa-miR-222 0.1412 0.3061 3.6472 7.1789 0.0004 <0.0001 hsa-miR-140 hsa-miR-148a -0.2450 0.1956 -3.3193 4.7179 0.0004 <0.0001 hsa-miR-143 hsa-miR-34a -0.2452 0.2326 -3.5230 5.2069 0.0004 <0.0001 hsa-miR-182 hsa-miR-204 -0.1186 0.1482 -3.4971 4.2846 0.0004 <0.0001 hsa-miR-340 hsa-miR-801 -0.7523 -0.2290 -4.7672 -4.0426 <0.0001 0.0002 hsa-miR-198 hsa-miR-671 0.6433 -0.6435 3.7746 -3.9295 0.0002 0.0002 hsa-miR-196a hsa-miR-20a 0.2191 -0.2120 3.4284 -3.6662 0.0007 0.0002 hsa-miR-340 hsa-miR-452 -0.7811 -0.2872 -4.8128 -3.6202 <0.0001 0.0003 hsa-miR-196a hsa-miR-20b 0.2159 -0.2582 3.3972 -3.6163 0.0008 0.0003 hsa-miR-196a hsa-miR-340 0.2115 -0.5325 3.2889 -3.8183 0.0010 0.0003 hsa-miR-374 hsa-miR-671 -0.3845 -0.2770 -4.1883 -3.5837 0.0002 0.0004 hsa-miR-140 hsa-miR-801 -0.3620 -0.2002 -4.2702 -3.6236 <0.0001 0.0005 hsa-miR-340 hsa-miR-671 -0.7553 -0.2512 -4.6673 -3.4952 0.0002 0.0005 hsa-miR-340 hsa-miR-765 -0.7652 -0.2524 -4.6791 -3.4679 <0.0001 0.0006 hsa-miR-17-5p hsa-miR-196a -0.2635 0.2226 -3.8666 3.4765 <0.0001 0.0006 hsa-miR-222 hsa-miR-422b 0.2911 -0.3619 7.0607 -3.5045 <0.0001 0.0007 hsa-miR-140 hsa-miR-671 -0.3948 -0.2333 -4.2886 -3.3077 <0.0001 0.0007 hsa-miR-340 hsa-miR-370 -0.7885 -0.1201 -4.6899 -3.4386 <0.0001 0.0007 hsa-miR-374 hsa-miR-663 -0.3226 -0.2551 -3.9265 -3.4033 0.0002 0.0007 hsa-miR-190 hsa-miR-374 0.9479 -0.2649 3.4665 -3.5370 0.0004 0.0007 hsa-miR-148a hsa-miR-30e-3p 0.2287 -0.3551 5.1831 -3.1949 <0.0001 0.0008 hsa-miR-374 hsa-miR-801 -0.2932 -0.1921 -3.7141 -3.4390 0.0005 0.0008 hsa-miR-374 hsa-miR-765 -0.3481 -0.2457 -3.9480 -3.2346 0.0002 0.0009 hsa-miR-30e-3p hsa-miR-663 -0.4564 -0.2517 -3.4388 -3.2166 0.0005 0.0009 hsa-miR-181c hsa-miR-675 -0.2618 -2.9279 -3.6755 -3.3646 0.0003 0.0010 hsa-miR-200b hsa-miR-487b 0.4543 0.2424 4.0048 3.2972 0.0007 0.0010 Fig. 4 Kaplan-Meier analysis Wu et al. BMC Bioinformatics (2018) 19:187 Page 5 of 8 Fig. 5 Risk score analysis likelihood function, we obtain k estimated regression as a candidate feature for discriminating between two risk coefficients on which Wald statistics are made. Further- groups. More details can be also seen in . more, a permutation test is made on each Wald statistic. The k-tuple with each component corresponding to a Brief overview of the software significant p value is regarded as a candidate feature asso- Our software, which is implemented in Matlab R2012b ciated with survival outcomes. More details can be seen or other later versions, can work on different compu- in . tational platforms (e.g., a cluster, a workstation, even a personal computer). Therefore, it contains two parts, Features for discriminating between two risk groups i.e., client and server. Selection of features associated We then intend to select features for discriminating with survival outcomes is accomplished by two Mat- between low-risk and high-risk group of patients, which lab m-files (i.e., ’/Client/S1_feature_selection.m’ and conforms to doctors’ daily decision making process. As ’/Server/S1_feature_selection_on_server.m’). A further to each patient, a risk score which is the linear portion selection of features for stratification of patients is of the expression values using the Cox regression coef- fulfilled by a Matlab m-file ’Client/S2_plot_draw.m’. ficients is calculated. A preassigned risk score is utilized If this program is implemented on a workstation or as a cut-off value for stratification between high-risk and a personal computer, only the client part is needed. low-risk group of patients. Log-rank test is made. Fur- That is to say, users only need to concentrate on thermore, a permutation test is presented on each tuple, two GUIs (i.e., ’/Client/S1_feature_selection.m’ and which has been selected to be associated with survival out- ’Client/S2_plot_draw.m’) on the client part. Otherwise, comes. The k-tuple with a significant p value is regarded the server part is also in demand. Data communications and environment configurations are actualized using Python 3. More details can be seen in the user’s guide on Table 3 Significant miRNAs using random survival forests (VIMP the website: http://bio-nefu.com/resource/jcdsa. score>=0.001) miRNA probe VIMP score hsa-miR-222 0.0103 Table 4 Individually significant miRNAs using joint covariate detection on the simulated data (p<=0.05) hsa-miR-148a 0.0027 hsa-miR-30d 0.0012 miRNA probe β(Cox) Z(Cox) P(Cox) hsa-miR-27a 0.0011 miRNA-alternative 1 4.739 5.929 <0.001 hsa-miR-422b 0.0011 miRNA-null 33 -0.3583 -1.9486 0.023 Wu et al. BMC Bioinformatics (2018) 19:187 Page 6 of 8 Table 5 Significant miRNAs in pairs using joint covariate detection on the simulated data (p<=0.001) miRNA probe miRNA probe β(Cox) β(Cox) Z(Cox) Z(Cox) P(Cox) P(Cox) miRNA-alternative 1 miRNA-alternative 2 7.6975 0.8455 5.1236 3.6895 <0.001 <0.001 Results variable importance (VIMP) for each variable. The result According to the presented two-step feature selection is listed in Table 3. strategy, we first consider selecting features associated After making careful comparisons between Tables 2 with survival outcomes. Figure 2 illustrates this step. Can- and 3, we find that miR-10b is still unimportant, as it cer type can be selected or input by clicking the right is not listed in Table 3. This phenomenon reveals the side arrow if it is not supported in the type list. Other advantage of using joint covariate detection other than selections in the setting frame can be also made, details RSF. In fact, the individually significant miR-222 keeps of which are listed in user’s guide. Before running at a p=0.0012 corresponding to log-rank test with 10000 full speed, JCDSA estimates the finishing seconds which rounds of permutation. As to significant pair (i.e., miR- helps to make a further decision. After its completion, the 222 and miR-10b), it keeps a p=0.0002 which corresponds result which records p value(s) of each k-tuple is stored to log-rank test with 10000 rounds of permutation. As in ’/Client/Data/S1’. Figure 3 further illustrates the step of to miR-10b, it keeps a p=0.285, which is individually selecting features associated with survival outcomes (i.e., insignificant. Step 2.1). By setting the threshold of the p value corre- We simulated data under 40 independent dimensions, sponding to permutation test on Wald statistic, features from which we assigned two to be significant. That is, associated with survival outcomes are selected. the survival time S is defined as S = exp(−Xβ + ε), Using the miRNA expression data (Level 3) of 548 where X is the simulated gene expression matrix and β = patients with GBM as an example, individually signifi- [ 0.9, 0.1, 0.001, ..., 0.001] denotes the coefficient param- cant miRNAs and significant miRNAs in pairs are listed in eter. ε ∼ N (0, 2). The sample size n is 50. The censoring Tables 1 and 2, respectively. After making careful compar- states are generated, and yield 10 percent censoring for the isons between Tables 1 and 2,weconcludethatsignificant simulated data. features in high dimension may not be composed of indi- The experimental results on simulated data are listed in vidually significant miRNAs. Taking the significant pair Tables 4, 5 and 6, respectively. The significant pair closely miR-10b and miR-222 as an example, miR-10b is not listed associated with simulated survival outcomes are selected in Table 1, which shows that it is not individually signif- out, as shown in Table 5. In contrast, miRNA-alternative icant. This phenomenon reveals the advantage of using 2 which is in absence in Table 4 shows insignificant joint covariate detection. (p=0.939), and illustrates less important in Table 6.These Together, Figs. 3, 4 and 5 illustrate the feature selection results demonstrate the effectiveness of our method. The step for discriminating between two risk groups. In Fig. 3, simulated data and full tables corresponding to Tables 4, 5 after choosing the files that represent the original data and and 6 can be downloaded on the website: http://bio-nefu. the result corresponding to significant features associated com/resource/jcdsa. with survival time at Step 2.2, the software runs to Step 2.3 In order to show that selected variables are improba- and Step 2.4. ble false positive or false negative ones, we repeated the As shown in Fig. 4, Kaplan-Meier analysis with param- simulations above for 100 times with an enlarged sam- eters derived from log-rank test and Harrell’s con- plesize(n=500). The experimental results are illustrated cordance index is made for further selection of fea- in Fig. 6.Figure 6a denotes the p values (p < 1e − 3) tures, which helps to discriminate between high-risk of the significant pair through 100 times of simulation. and low-risk group of patients. Meanwhile, the result However, miRNA-alternative 2 individually shows less of risk score analysis is illustrated in Fig. 5.Corre- important, as illustrated in Fig. 6b. Comparisons between spondingly, results which refer to significant features are stored in ’Client/Data/S2/S2_3’ and ’Client/Data/S2/S2_4’, Table 6 Significant miRNAs using random survival forests on the respectively. simulated data (VIMP score>=0.001) In order to show the effectiveness of our method, we miRNA probe VIMP score implemented the prevailing method named as random survival forests (i.e. RSF) on the miRNA expression data miRNA-alternative 1 0.1887 (Level 3) of 548 patients with GBM for comparison. 1000 miRNA-null 32 0.0016 binary survival trees were made, with each terminal node miRNA-alternative 2 0.0013 containing a minimum of d =10 unique deaths. We made miRNA-null 10 0.0013 1000 permutations on each variable, and obtained the Wu et al. BMC Bioinformatics (2018) 19:187 Page 7 of 8 Fig. 6 Simulation results. a p values of the significant pair through 100 times of simulation. b p values of the significant individual through 100 times of simulation. c The number of positive pairs through 100 times of simulation. d The number of positive individuals through 100 times of simulation Fig. 6a and b indicate that the significant features are survival outcomes but also associated with stratification probably not composed of individually significant uni- of patients under different survival risks. This fact has variables. Figure 6c and d report the number of positive been demonstrated by our experimental results in this pairs and individuals through 100 times of simulation, paper. Second, components of each significant multi- respectively. No false negative results are discovered. In variable may keep a low correlation. This phenomenon Fig. 6c, the maximum number of false positive pair is has been discovered when experiments on the simulated three, which indicates a small probability of false pos- data were made. Further evidence is still needed. Third, itive pair 0.0038 (i.e., 3/C ). As to Fig. 6d,the maxi- the correction for multiple hypothesis testing is absent, mum number of false positive individual is also three; considering the computational cost of calculating FDR, yet, the probability of false positive individual is 0.075 q value, the adjusted p values, etc. on each pair or each (i.e., 3/40). high-dimension tuple of variables. However, simulations are made, which demonstrate the effectiveness of our Discussion method. There are several states needed to be discussed. First, it is the significant multi-variable other than combina- Conclusion tions of individually significant uni-variables that con- Our joint covariate detection for survival analysis pro- tributes to selection of features not only consistent with vides a new viewpoint for selecting variable candidates Wu et al. BMC Bioinformatics (2018) 19:187 Page 8 of 8 which are not individually but jointly significant. Follow- Received: 20 August 2017 Accepted: 21 May 2018 ing a two-step variable selection strategy, we propose a software (i.e., JCDSA) in order to help users to select fea- References tures which are not only consistent with survival time but 1. Sun CQ, Zhao XD. Joint covariate detection on expression profiles for also associated with prognosis risk. JCDSA can be adapted selecting prognostic mirnas in glioblastoma. Biomed Res Int. 2017;2:1–10. 2. Cox DR. Regression models and life tables (with discussion). J R Stat Soc for many categories of cancer. Users can easily operate Series B. 1972;34:187–220. it and conveniently obtain the experimental results for 3. Li H, Gui J. Partial cox regression analysis for high-dimensional microarray subsequent biological experimental validation. gene expression data. Bioinformatics. 2004;20:208–15. 4. Li L, Li H. Dimension reduction methods for microarrays with application to censored survival data. Bioinformatics. 2004;20:3406–12. Availability and requirements 5. Wallace ML. Time-dependent tree-structured survival analysis with Project name:JCDSA unbiased variable selection through permutation tests. Stat Med. 2013;33:4790–804. Project home page: http://bio-nefu.com/resource/jcdsa 6. Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival Operating system(s): Linux, Windows forests. Ann Appl Stat. 2008;2:841–60. Programming language:Matlab(≥R2012b), Python 7. Kawaguchi A, Yajima N, Tsuchiya N, Homma J, Sano M, Natsumeda M, Takahashi H, Fujii Y, Kakuma T, Yamanaka R. Gene expression (≥ 3.0) signature-based prognostic risk score in patients with glioblastoma. License:GPL (≥2) Cancer Sci. 2013;104:1205–10. Any restrictions to use by non-academics:none 8. Gui J, Li HZ. Penalized cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene Abbreviations expression data. Bioinformatics. 2005;21:3001–8. GUI: Graphical user interface; GBM: Glioblastoma multiforme; JCDSA: Joint 9. Evers L, Messow CM. Sparse kernel methods for high-dimensional covariate detection for survival analysis; TCGA: The Cancer Genome Atlas survival data. Bioinformatics. 2008;24:1632–8. 10. Xu L. Bi-linear matrix-variate analyses, integrative hypothesis tests, and Acknowledgements case-control studies. Appl Inform. 2015;2:1–39. Not applicable. Funding This work has been supported by the financial support of Fundamental Research Funds for the Central Universities (No. 2572018BH01), National Undergraduate Innovation Project (No. 201610225050) and Specialized Personnel Start-up Grant (Also National Construction Plan of World-class Universities and First-class Disciplines, No. 41113237). The funding body of Fundamental Research Funds for the Central Universities played an important role in the design of the study, collection, analysis and interpretation of data and in writing the manuscript. Availability of data and materials The dataset analysed during the current study is available in the TCGA repository, http://cancergenome.nih.gov. The simulated data can be downloaded on http://bio-nefu.com/resource/jsdca. Authors’ contributions 1 3 XDZ conceived the general project and supervised it. YMW ,YNL,YMW and XDZ were the principal developers. YMW has rewritten almost all the front-end codes and has majorly made the revision on the manuscript. YNL has made the supplementary experiments on new simulated data, which helps to illustrate the effectiveness of JCDSA on avoiding false positives. YS tested the software and made the improvement. XDZ wrote the underlying source code and the original manuscript. All authors read and approved the final manuscript. Ethics approval and consent to participate Not applicable. Competing interests The authors declare that they have no competing interests. Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Author details College of Information and Computer Engineering, Northeast Forestry University, No.26 Hexing Road, 150001 Harbin, China . College of Foreign Languages, Northeast Forestry University, No.26 Hexing Road, 150001 Harbin, China .
– Springer Journals
Published: May 29, 2018