Access the full text.
Sign up today, get DeepDyve free for 14 days.
E. Bair, T. Hastie, D. Paul, R. Tibshirani (2006)
Prediction by Supervised Principal ComponentsJournal of the American Statistical Association, 101
A. Rosenwald, George Wright, W. Chan, J. Connors, E. Campo, R. Fisher, R. Gascoyne, H. Muller-Hermelink, E. Smeland, J. Giltnane, E. Hurt, Hong Zhao, Lauren Averett, Liming Yang, W. Wilson, E. Jaffe, R. Simon, R. Klausner, J. Powell, P. Duffey, D. Longo, T. Greiner, D. Weisenburger, W. Sanger, B. Dave, J. Lynch, J. Vose, J. Armitage, E. Montserrat, A. López-Guillermo, T. Grogan, T. Miller, M. LeBlanc, G. Ott, S. Kvaløy, J. Delabie, H. Holte, P. Krajči, T. Stokke, L. Staudt (2002)
The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma.The New England journal of medicine, 346 25
Eric Lander (1999)
Array of hopeNature Genetics, 21 Suppl 1
M. Segal (2006)
Microarray gene expression data with linked survival phenotypes: diffuse large-B-cell lymphoma revisited.Biostatistics, 7 2
L. Kaderali, T. Zander, U. Faigle, J. Wolf, J. Schultze, R. Schrader (2006)
CASPAR: a hierarchical Bayesian approach to predict survival times in cancer from gene expression dataBioinformatics, 22 12
Jianqing Fan, Runze Li (2002)
Variable Selection for Cox's proportional Hazards Model and Frailty ModelAnnals of Statistics, 30
Song Liu (2008)
Variable selection in semi-parametric additive models with extensions to high dimensional data and additive Cox models
T. Sørlie, R. Tibshirani, Joel Parker, T. Hastie, J. Marron, A. Nobel, Shibing Deng, H. Johnsen, Robert Pesich, S. Geisler, J. Demeter, C. Perou, P. Lønning, P. Brown, A. Børresen-Dale, D. Botstein (2003)
Repeated observation of breast tumor subtypes in independent gene expression data setsProceedings of the National Academy of Sciences of the United States of America, 100
P. Heagerty, T. Lumley, M. Pepe (2000)
Time‐Dependent ROC Curves for Censored Survival Data and a Diagnostic MarkerBiometrics, 56
J. Gui, Hongzhe Li (2005)
Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression dataBioinformatics, 21 13
R. Tibshirani (1997)
The lasso method for variable selection in the Cox model.Statistics in medicine, 16 4
B. Efron, Trevor Hastie, Iain Johnstone, R. Tibshirani (2004)
Least angle regressionAnnals of Statistics, 32
Mee Park, T. Hastie (2007)
L1‐regularization path algorithm for generalized linear modelsJournal of the Royal Statistical Society: Series B (Statistical Methodology), 69
M. Yuan, Yi Lin (2006)
Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society: Series B (Statistical Methodology), 68
Hongzhe Li (2006)
Censored Data Regression in High-Dimension and Low-Sample Size Settings For Genomic Applications
K. Owzar, Sin-Ho Jung, P. Sen (2007)
A Copula Approach for Detecting Prognostic Genes Associated With Survival Outcome in Microarray StudiesBiometrics, 63
Hao Zhang, Wenbin Lu (2007)
Adaptive Lasso for Cox's proportional hazards modelBiometrika, 94
Shuangge Ma, Jian Huang (2005)
LASSO Method for Additive Risk Models with High Dimensional Covariates
D., R., Cox
Regression Models and Life-Tables
Jinseog Kim, Yuwon Kim, Yongdai Kim (2008)
A gradient-based optimization algorithm for LASSO
H. Zou (2008)
A note on path-based variable selection in the penalized proportional hazards model
Sohn et al
M. Akritas (1994)
Nearest Neighbor Estimation of a Bivariate Distribution Under Random CensoringAnnals of Statistics, 22
E. Bair, R. Tibshirani (2004)
Semi-Supervised Methods to Predict Patient Survival from Gene Expression DataPLoS Biology, 2
Yongdai Kim, Jinseog Kim (2004)
Gradient LASSO for feature selectionProceedings of the twenty-first international conference on Machine learning
J. Gui, Hongzhe Li (2004)
Threshold Gradient Descent Method for Censored Data Regression with Applications in PharmacogenomicsPacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
(2008)
An efficient algorithm for L1-penalized estimation
Vol. 25 no. 14 2009, pages 1775–1781 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btp322 Gene expression 1 2 1 3,∗ Insuk Sohn , Jinseog Kim , Sin-Ho Jung and Changyi Park 1 2 Department of Biostatistics & Bioinformatics, Duke University, NC 27705, USA, Department of Statistics and Information Science, Dongguk University, Gyeongju 780-714 and Department of Statistics, University of Seoul, Seoul 130-743, Korea Received on February 9, 2009; Revised on May 7, 2009; accepted on May 12, 2009 Advance Access publication May 15, 2009 Associate Editor: John Quackenbush ABSTRACT Tibshirani, 2004), supervised principal components (Bair et al., 2006), a Cox regression based on threshold gradient descent (Gui Motivation: There has been an increasing interest in expressing and Li, 2005a), the LARS-Cox (Gui and Li, 2005b), the residual a survival phenotype (e.g. time to cancer recurrence or death) or finesse (Segal, 2006) and cancer survival prediction using automatic its distribution in terms of a subset of the expression data of a relevance determination (Kaderali et al., 2006). subset of genes. Due to high dimensionality of gene expression In particular, we are interested in the Cox regression (Cox, data, however, there is a serious problem of collinearity in fitting a 1972) because of its popularity in the analysis of survival data prediction model, e.g. Cox’s proportional hazards model. To avoid with censoring. However, due to high dimensionality of gene the collinearity problem, several methods based on penalized Cox expression data, there is a serious problem of collinearity in proportional hazards models have been proposed. However, those applying the Cox proportional hazards model directly. To avoid methods suffer from severe computational problems, such as slow the collinearity problem, several methods based on penalized Cox or even failed convergence, because of high-dimensional matrix proportional hazards models have been proposed. Fan and Li (2002) inversions required for model fitting. We propose to implement the adopted the smoothly clipped absolute deviation (SCAD) penalty penalized Cox regression with a lasso penalty via the gradient lasso in the Cox model and Zou (2008) proposed a path-based variable algorithm that yields faster convergence to the global optimum selection method to construct an adaptive sparse shrinkage path than do other algorithms. Moreover the gradient lasso algorithm of the coefficients. Because the SCAD penalty is non-convex, the is guaranteed to converge to the optimum under mild regularity optimization with the SCAD penalty has a non-convex problem. conditions. Hence, our gradient lasso algorithm can be a useful tool in Zhang and Lu (2007) suggested an adaptive lasso method with an developing a prediction model based on high-dimensional covariates adaptively weighted penalty on coefficients. Note that the adaptive including gene expression data. lasso has an extra tuning parameter and thus the computation is much Results: Results from simulation studies showed that the prediction heavier than usual one step estimations. Park and Hastie (2007) model by gradient lasso recovers the prognostic genes. Also results generalized the LARS algorithm to generalized linear models and from diffuse large B-cell lymphoma datasets and Norway/Stanford the Cox proportional hazards model. breast cancer dataset indicate that our method is very competitive The LARS-Cox procedure by Gui and Li (2005b) suffers from compared with popular existing methods by Park and Hastie and severe computational problems in its optimization because it Goeman in its computational time, prediction and selectivity. requires inversion of high-dimensional matrices as discussed in Availability: R package glcoxph is available at Segal (2006) or Goeman (2008). To remedy the computational http://datamining.dongguk.ac.kr/R/glcoxph. problem of the LARS-Cox, Segal (2006) proposed a method, called Contact: park463@uos.ac.kr the residual finesse, to speed up the computation by replacing the survival times by the deviance residuals and Goeman (2008) 1 INTRODUCTION proposed to combine gradient descent with the Newton–Raphson algorithm. DNA microarray is a biotechnology allowing simultaneous In this article, we apply the gradient lasso algorithm proposed by monitoring of tens of thousands of gene expression in cells (Brown Kim et al. (2008) to a Cox proportional hazards model with a lasso and Botstein, 1999). It has important applications in pharmaceutical penalty. Unlike the algorithm in Gui and Li (2005b), the gradient and clinical research, including tumor classification, molecular lasso algorithm does not require a matrix inversion, so that it is pathway modeling and functional genomics. The identification scalable to high-dimensional data and yields faster convergence to of genes related to survival may provide new information on the global optimum. Theorem 1 in Kim et al. (2008) guarantees pathogenesis and etiology, and may further aid in the search for the convergence of the gradient lasso estimator to the optimum new targets for drug design. under mild regularity conditions. Unlike other methods such as Gui There has been an increasing interest in relating gene expression and Li (2005b), Segal (2006) and Zou (2008), the gradient lasso profiles to survival phenotypes such as time to cancer recurrence or yields a solution to the exact penalized partial likelihood, not to an death. Recent works include semi-supervised methods (Bair and approximate penalized partial likelihood. Hence, our method can be To whom correspondence should be addressed. a useful alternative to those methods in building a statistical model © The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 1775 [15:16 15/6/2009 Bioinformatics-btp322.tex] Page: 1775 1775–1781 I.Sohn et al. to predict the patient’s survival time based on high-dimensional gene be determined, for example, by cross-validated partial likelihood (CVPL) as in Gui and Li (2005b) and Segal (2006). expression data. Tibshirani (1997) proposed an iterative procedure to solve (2). Let X = A recent method proposed by Park and Hastie (2007) provides T ∂l (x ,...,x ) denote the p ×N gene expression matrix, η = β X, µ =− , 1 N the entire penalization path for the Cox model in a forward stage- ∂η ∂ l − − wise manner. Both their method and our gradient lasso act only A =− , and z = η +A µ, where A is a generalized inverse of A.By ∂η∂η on active sets. Because Park and Hastie (2007) require matrix the Taylor expansion of l(β ), the partial log-likelihood is approximated by inversions only on active sets, the computation is faster and more (z − η) A(z − η). (4) stable than other methods in the literature. Ma and Huang (2005) adopted the earlier version of the gradient lasso by Kim and Kim Tibshirani (1997) suggested to replace A with a diagonal matrix D having (2004). In this article, we adopt the refined version of Kim and the same diagonal elements as A and solve (4) iteratively using a quadratic programming. See Tibshirani (1997) for more details on the iterative Kim (2004) with a deletion step to improve the computational speed procedure. Since the quadratic programming cannot be applied directly to the and provide more stable solutions than the earlier version. Another cases with p N , Gui and Li (2005b) applied the Choleski decomposition: recent method by Goeman (2008) is a gradient algorithm with an 1/2 T T =A and T T =A. Note that, with the restriction rank(A) = n −1in option of switching to the Newton–Raphson algorithm to avoid place, the quadratic form (4) is invariant to the choice of the generalized slow convergence. Both our method and the algorithm by Goeman inverse. The LARS-Cox procedure can be solved using the LARS procedure (2008) are gradient algorithms. However, our method directly solves in Efron et al. (2004). See Gui and Li (2005b) for more details. the optimization problem of the lasso penalized Cox model in the As noted earlier, these algorithms can be very computationally intensive constrained form, whereas the methods of Park and Hastie (2007) and sometimes may fail to converge to the optimum because they and Goeman (2008) solve the optimization problem equally in an involve matrix inversions. Other methods in the literature such as Segal (2006) and Zou (2008) speed up the computation by solving approximate unconstrained form using a Lagrange multiplier. objective functions. The gradient lasso algorithm presented below solves the The rest of the article is organized as follows. In Section 2, we optimization problem (2) directly and does not resort to a matrix inversion. describe the Cox proportional hazards model with lasso penalty and Thus, the gradient lasso can be a useful alternative. Another recent method present the gradient lasso algorithm for the model. We briefly review of Goeman (2008) is a gradient algorithm solving (3). While the gradient Park and Hastie (2007), too. In Section 3, we compare the methods lasso updates a single coordinate at a time, the method in Goeman (2008) in Park and Hastie (2007) and Goeman (2008) with the gradient utilizes the full gradient at each step and can switch to a Newton–Raphson lasso on simulated and real microarray datasets. Finally, we give algorithm to speed up its convergence around the optimum. As the gradient brief discussions of our method and results. lasso is scalable to high-dimensional data, so are the algorithms in Park and Hastie (2007) and Goeman (2008). Interested readers may see Park and Hastie (2007) and Goeman (2008) for more details on the algorithms. 2 METHODS 2.1 Cox model with lasso penalty 2.2 Gradient lasso algorithm We investigate the relationship between the survival time and the covariates Kim et al. (2008) proposed gradient lasso algorithm, which is a refined such as gene expression levels and clinical variables. For patient i (=1, …, N ), version of the early algorithm by Kim and Kim (2004). The algorithm can be we observe (t ,δ ,x ,...,x ), where δ is the censoring indicator taking 1 if applied to general convex loss functions. Moreover the algorithm is scalable i i i1 ip i an event is observed and 0 otherwise, t denotes the survival time if δ =1or to high-dimensional data because it requires only a univariate optimization i i censoring time otherwise, and x = (x ,...,x ) is the vector of covariates. at each iteration and its convergence speed is independent of the number of i i1 ip By the Cox’s proportional hazards model, the hazard function for patient input variables. i is given as For a given convex function C(β)of β ∈R , consider the minimization λ (t) = λ (t)exp(x β ), (1) i 0 of C(β)on S ={β :β ≤ s}. Note that C(β ) =−l(β ) in the lasso penalized i 1 Cox model. The gradient lasso solves the minimization problem by iteratively where λ (t) is a unknown baseline hazard function and β = (β ,...,β ) is 0 1 p updating its solution through the addition and deletion steps. the regression coefficient vector. Then the partial log-likelihood (Cox, 1972) In the addition step, a coordinate vector v at which C(β ) decreases most is expressed as ⎧ ⎛ ⎞ ⎫ rapidly is sought. Let ∇C(β ) = (∂C(β )/∂β ,...,∂C(β )/∂β ) be the gradient 1 p ⎨ ⎬ of C(β ) with respect to β . By Taylor expansion, we have T T ⎝ ⎠ l(β ) = δ x β − log exp(x β , i j ⎩ ⎭ i=1 j∈R C((1 − α)β + αv) ≈ C(β ) + α∇C(β ) (v − β ) where R is the set of indices of the patients at risk at time t −. i i for v ∈S and α ∈[0,1], and Typically for microarray data we have p N , so that there exists a serious collinearity problem when applying the partial likelihood estimation to the ∂C(β ) ∂C(β ) min∇C(β ) v = min min ,− . Cox model directly. Tibshirani (1997) proposed to estimate the parameters v∈S j=1,...,p ∂β ∂β j j in (1) under the L constraint: Let β = arg maxl(β ), subject to β ≤ s, (2) ∂C(β ) ∂C(β ) j = arg max and γˆ =−s × sign . j=1,...,p ∂β ∂β j j where · denotes the L norm and s is a positive number. The 1 1 optimization problem (2) is good for dimension reduction of covariates, ˆ Let v ˆ be a vector such that its j-th coordinate of v is γˆ and the other coordinates but is computationally difficult because the L objective function is not are zeros. The current solution β is updated as (1−ˆ α)β +ˆ αv ˆ, where the differentiable (Fan and Li, 2002). weight α ˆ is determined by The equivalent Lagrange multiplier version of (2) is α ˆ = arg min C((1 − α)β + αv ˆ). (5) β = arg min{−l(β ) + λβ } (3) α∈[0,1] for λ> 0. Equation (3) can be interpreted as the posterior mode of β for the Since we take a convex combination, the updated solution always satisfies double exponential prior with parameter λ. The tuning parameters s and λ can the constraint S . [15:16 15/6/2009 Bioinformatics-btp322.tex] Page: 1776 1775–1781 Gradient lasso for Cox regression by the use of time-dependent ROC curves, proposed by Heagerty et al. 1. Initialize: β =0 and m = 0. (2000) and used in the present context by Gui and Li (2005b). Let f (X ) be the predictive model, where X = (X ,...,X ) denotes a p-dimensional 2. Do until convergence 1 p random vector measuring the gene expression levels of an individual. Define (a) Addition: time-dependent sensitivity and specificity functions at a cutoff point c as i. Compute the gradient ∇C(β ). sensitivity(c,t|f (X )) =P(f (X ) > c| (t) = 1) (8) ii. Find the j maximizing |∂C(β )/∂β | for j = specificity(c,t|f (X )) =P(f (X ) ≤ c| (t) = 0), (9) 1,...,p and γˆ =−s × sign(∂C(β )/∂β ). where (t) is the event indicator at time t. Throughout the article, the ˆ cutoff point c is set to zero as in Gui and Li (2005b) and Segal (2006). iii. Let v be a p dimensional vector such that its j-th The corresponding time-dependent ROC curve at time t, ROC(t|f (X )), is element γˆ and the other elements are zeros. then just the plot of sensitivity(c,t|f (X )) versus 1 − specificity(c,t|f (X )) over iv. Find α ˆ = arg min C (1 − α)β + αv . α∈[0,1] different values of cutoff point c. We adopt the analogous area under the v. Update β = (1−ˆ α)β +ˆ αv. time-dependent ROC(t|f (X )) curve, denoted as AUC(t|f (X )), in Section 3. In order to estimate the conditional probabilities in (8) and (9) accounting for (b) Deletion: possible censoring, we employ a nearest neighbor estimator for the bivariate i. Calculate h as in (6). σ distribution function (Akritas, 1994) as in Heagerty et al. (2000). ii. Find δ as in (7). iii. Update β = β + δh. 3 RESULTS (c) Set m = m + 1. We compared the performance of a recent algorithm by Park and Hastie (2007), called the coxpath, the gradient algorithm by Goeman 3. Return β. (2008), called the penalized, and our gradient lasso algorithm, called the glcoxph. Note that the coxpath and the penalized are available Fig. 1. The gradient lasso algorithm. in the R package glmpath and penalized, respectively. The glcoxph package in R is available upon request from the authors. The earlier version of the algorithm (Kim and Kim, 2004) consisting of The glcoxph directly solves the constrained optimization problem only the addition step may converge slowly near the optimum. To resolve in (2), while the coxpath and the penalized solve the equivalent this problem, Kim et al. (2008) introduced the deletion step. The deletion unconstrained optimization (3). step speeds up the algorithm by updating all the non-zero coordinate of β Although there exists a one-to-one correspondence between the simultaneously. The active set σ is defined as σ ={j : β = 0}. Denote the tuning parameters s in (2) and λ in (3), there is no closed conversion current solution and corresponding gradient of the current solution with formula between them. Hence, we used different grids for tuning respect to an active set as β and ∇C(β ), respectively. Then we move σ σ the coxpath and glcoxph. For the coxpath and the penalized, β toward the direction of −∇C(β ). If the current solution is placed on σ σ we conducted a grid search. More specifically, we extracted the the surface of S with redundant non-zero coefficients, such a move may not maximum value, λ ,of λ after fitting the coxpath on the training satisfy the constraint S . In order to make the move to satisfy the constraint, max dataset, obtained the 100 grid points by dividing the interval 1 ≤ we replace ∇C(β )by λ ≤ λ equally in the log scale, and used those points in tuning. max h =−∇C(β ) + θ ∇C(β ) θ /|σ |, (6) σ σ σ σ σ ˆ Note that the coefficients for λ ≥ λ are almost the same and thus max the penalization path taken to be flat over the range of λ in the where θ is the sign vector of β and |σ | is the cardinality of σ . Let U = min {−β /h : β h < 0}. Let P denote the permutation matrix that coxpath. For the glcoxph, the grid for the tuning parameter s was k∈σ k k k k collects the non-zero elements of β in the first |σ | elements. Then we move 0.05,0.1,...,5 as available in the previous studies such as Gui and Li β toward h until one of the non-zero coordinates becomes zero. That is, σ (2005b). We adopted 5-fold CVPL as our criterion for the selection β is updated as β + δh, where of tuning parameter. δ = arg min C(β + δh) (7) δ∈[0,U] 3.1 Real microarray data T σ The diffuse large B-cell lymphoma (DLBCL) dataset consists and h =P . When δ = U , one of non-zero coordinates is deleted. of 240 samples from patients having DLBCL with expression Figure 1 summarizes the gradient lasso algorithm. measurements on 7399 genes. The clinical outcome was survival time, either observed or censored. We analyzed two different sets of 2.3 Performance measures the DLBCL data: one from Rosenwald et al. (2002) and the other In the microarray survival literature, the predictive performance of a model from Bair et al. (2006). Both of the datasets consist of 160 training is typically assessed as follows: (i) risk scores based on the fitted model samples and 80 test samples. are computed for patients in a (withheld) test dataset, (ii) strata (usually Table 1 lists the genes selected by the glcoxph on the datasets two) are created based on thresholding these scores and (iii) log-rank from Rosenwald et al. (2002) and Bair et al. (2006). For the datasets testing of between-strata survival differences is performed. The greater defined in Rosenwald et al. (2002), eight genes were selected by the the achieved significance the more predictive the model is deemed to be. glcoxph. Four of those genes belong to the three gene expression In our data analysis, the threshold value of the predictive score is set to signature groups, including Germinal-center B-cell signature, MHC zero for dividing patients into low- and high-risk groups. Limitations of class II signature and Lymph-node signature, defined in Rosenwald this approach include not just the arbitrariness of the imposed stratification et al. (2002). Note that these four genes belong to the 10 genes but, more importantly, the familiar shortcoming of P-values not necessarily selected by the LARS-Cox in Gui and Li (2005b) and are in fact capturing effect size/explained variation. A more refined approach is afforded [15:16 15/6/2009 Bioinformatics-btp322.tex] Page: 1777 1775–1781 I.Sohn et al. Table 1. Genes selected by the glcoxph on DLBCL datasets Dataset GenBank ID β Signature Description AA805575 −0.190 Germ Thyroxine-binding globulin precursor X00452 −0.163 MHC Major histocompatibility complex, class II, DQ alpha 1 LC_29222 −0.148 Lymph Rosenwald et al. (2002) X59812 −0.076 Lymph Cytochrome P450, subfamily XXVIIA polypeptide 1 AA729003 −0.062 T-cell leukemia/lymphoma 1A M15800 −0.052 Mal, T-cell differentiation protein AA291844 0.016 Immunoglobulin kappa constant AA760674 0.111 COX15 homolog, cytochrome c oxidase assembly protein (yeast) AA805575 −0.609 Germ Thyroxine-binding globulin precursor M81750 −0.117 Myeloid cell nuclear differentiation antigen Bair et al. (2006) AA729055 −0.086 MHC Major histocompatibility complex, class II, DR alpha AA809474 0.042 Immunoglobulin heavy constant mu AA485725 0.140 Immunoglobulin kappa constant Table 2. The number of genes selected and the partial log-likelihood on DLBCL training datasets Rosenwald et al. (2002) Bair et al. (2006) coxpath glcoxph penalized coxpath glcoxph penalized No. of genes 14 8 16 5 5 5 l(β ) −391.1635 −383.8169 −387.0456 −432.3464 −429.5923 −432.3153 Table 3. Log-rank test results on DLBCL test datasets Rosenwald et al. (2002) Bair et al. (2006) coxpath glcoxph penalized coxpath glcoxph penalized No. of high risk 38 31 32 42 33 35 Log-rank P-value 0.0005 0.0000 0.0009 0.0040 0.0000 0.0027 the top four genes selected by the residual finesse in Segal (2006). the other methods. The optimal values of tuning parameters on the The other four genes do not belong to the signature groups defined training datasets from Rosenwald et al. (2002) were 0.8, 25.2068 and in Rosenwald et al. (2002). Among those four genes, AA729003, 23.1014 for the glcoxph, the coxpath and the penalized, respectively. a protein coding TCL1A gene and AA760674, a COX15 homolog, For the second training dataset from Bair et al. (2006), the optimal were also selected by the LARS-Cox in Gui and Li (2005b). The values were 0.2 for the glcoxph and 31.2733 for the coxpath and the glcoxph has selected five genes from the datasets by Bair et al. penalized. (2006). Two of those genes belong to the two gene expression We compared the predictive performances of the methods. We signature groups, including Germinal-center B-cell signature and stratified the test data into low- and high-risk groups. Table 3 shows MHC class II signature, defined in Rosenwald et al. (2002). the results by the log-rank test. The P-value of the glcoxph was a Table 2 summarizes the number of selected genes and the partial bit smaller than the others on both of the DLBCL datasets. log-likelihood by the coxpath, glcoxph and penalized on DLBCL Time-dependent AUC curves in Figure 2 indicate that the glcoxph training datasets. For the training data from Rosenwald et al. outperforms the other methods over the time span and the predictive (2002), the coxpath and the glcoxph selected 14, 8 and 16 genes, performance of the penalized and the coxpath are almost the same. respectively, and seven genes were selected by all the methods. Only Note that the curve for the coxpath and the penalized in Figure 2b are by the coxpath, glcoxph and penalized algorithms, 1, 1 and 3 genes identical. In summary, for the DLBCL datasets, the glcoxph predicts were selected, respectively. Interestingly, all the methods selected the survival distribution better with a smaller number of genes than the same five genes on the training data from Bair et al. (2006). Our the penalized or the coxpath. method yielded a sparser model than the other methods for both of Kaplan–Meier curves on the DLBCL datasets are shown in DLBCL datasets. Comparing the values of l(β ) on the training data, Figure 3. For the first dataset, the Kaplan–Meier curves have similar we observe that the glcoxph yielded slightly larger value of l(β ) than patterns except that the survival probability of the coxpath for the [15:16 15/6/2009 Bioinformatics-btp322.tex] Page: 1778 1775–1781 Gradient lasso for Cox regression high-risk patients has a longer tail than the glcoxph or the penalized. patterns of the methods are very similar up to 10 years. After For the second dataset, the three methods show almost the same 10 years, the survival curve for the coxpath decays very slowly, patterns for the low-risk patients. For the high-risk patients, the while that of the penalized drops quickly. The experiments were implemented in a Windows R environment on a computer with 2 GHz Intel Pentium processor and 1 GB RAM. (a) We compared the computational times on the DLBCL dataset from glcoxph coxpath Rosenwald et al. (2002). It took 4449.72 s (3094.24 s in tuning and penalized 1355.38 s in training at the optimal value of tuning parameter), 2904.53 s (2903.84 s in tuning and 0.69 s in training), and 59432.29 s (59361.04 s in tuning and 71.25 s in training) seconds for the coxpath, the glcoxph and the penalized, respectively. While glcoxph fits the model at each fixed value of tuning parameter, the coxpath computes the entire penalization path over the range of s values with a specific step size. Consequently, a direct comparison of their computing times may not be so meaningful. However, at least we can say that the glcoxph is very competitive in terms of computing time. In order to compare the performance of these methods further, 510 15 20 we randomly partitioned the DLBCL data and the Norway/Stanford Survival in years breast cancer data from Sørlie et al. (2003). The latter consist (b) of 115 samples from women with breast cancer with expression glcoxph measurements on 549 genes. The DLBCL dataset was randomly coxpath penalized partitioned into 160 training samples and 80 test samples, and the Norway/Stanford breast cancer dataset was partitioned into 76 training samples and 39 test samples. We compared the predictive performance of the methods over 50 random partitions. Table 4 summarizes the average performance measures (and their standard errors in the parentheses) over the chosen random partitions. In terms of the P-value and the AUC, the differences of the methods were negligible because the numbers are within the error range. However, the penalized was worse than the other methods in terms of the P-value on both datasets. The numbers of selected genes by the glcoxph and the coxpath were not much different, but fewer than that by the penalized. In summary, the glcoxph showed 510 15 20 similar performance as the coxpath and the glcoxph outperformed Survival in years the penalized in prediction and in sparsity of the resulting predictive model. Fig. 2. AUC on DLBCL test datasets (a) Rosenwald et al. (2002) and (b) Bair et al. (2006). (a)( glcoxph coxpath b) glcoxph coxpath low−risk patients low−risk patients low−risk patients low−risk patients high−risk patients high−risk patients high−risk patients high−risk patients p=0.000036 p=0.000591 p=0.000098 p=0.004061 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 time time time time penalized penalized low−risk patients low−risk patients high−risk patients high−risk patients p=0.000916 p=0.002766 0 5 10 15 20 0 5 10 15 20 time time Fig. 3. Kaplan–Meier curves on DLBCL test datasets (a) Rosenwald et al. (2002) and (b) Bair et al. (2006). [15:16 15/6/2009 Bioinformatics-btp322.tex] Page: 1779 1775–1781 Area under the curve Area under the curve 0.55 0.60 0.65 0.70 0.55 0.60 0.65 0.70 Prob survival Prob survival 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Prob survival 0.0 0.2 0.4 0.6 0.8 1.0 Prob survival Prob survival 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Prob survival 0.0 0.2 0.4 0.6 0.8 1.0 I.Sohn et al. Table 4. Results on the DLBCL and Norway/Stanford breast cancer datasets Average no. of genes selected Average P-value Average AUC Dataset coxpath glcoxph penalized coxpath glcoxph penalized coxpath glcoxph penalized DLBCL 13.30 13.54 23.12 0.0590 0.0672 0.1020 0.5917 0.6015 0.5678 (1.6311) (1.2438) (1.7976) (0.0123) (0.0143) (0.0275) (0.0185) (0.0188) (0.0216) Norway/Stanford 9.36 9.12 13.56 0.0653 0.0474 0.1081 0.6131 0.6119 0.6120 breast cancer (0.8411) (0.7332) (0.9652) (0.0202) (0.0147) (0.0248) (0.0098) (0.0101) (0.0091) Table 5. Average number of genes and prognostic genes on the simulated data coxpath glcoxph penalized No. of No. of Recovery No. of No. of Recovery No. of No. of Recovery Censoring (%) ρ genes prognostic genes rate genes prognostic genes rate genes prognostic genes rate 0 21.04 2.86 0.13 16.32 2.86 0.18 30.16 2.98 0.09 30 0.3 23.22 2.96 0.12 16.80 2.96 0.18 30.22 2.98 0.09 0.6 22.54 2.94 0.13 15.70 2.94 0.19 28.68 2.94 0.10 0 23.29 2.96 0.12 17.28 2.96 0.17 29.90 2.94 0.09 15 0.3 24.24 2.92 0.12 15.82 2.92 0.18 28.26 2.92 0.10 0.6 22.22 2.92 0.13 15.26 2.92 0.19 30.64 2.98 0.09 The recovery rate is defined as the ratio of the average number of prognostic genes to the average number of genes selected. 3.2 Simulated data Overall the penalized selected the largest number of genes and the glcoxph selected the least. All the methods selected almost all the To evaluate the performance of the glcoxph empirically, we carried three prognostic genes. The recovery rate of glcoxph was the highest, out a simulation study. We have modified the simulation scheme the coxpath was the second and the penalized was the lowest. in Owzar et al. (2007) slightly. The data generation scheme is as Table 6 shows average P-value of the log-rank test and AUC on the follows. For i = 1,...,N , we generate , ,..., independently i0 i1 ip simulated data with standard errors in parentheses. All three methods from standard normal distribution and set perform better at 15% censoring than at 30% censoring as expected. x = 1 − ρ + ρ, j = 1,...,p. In terms of P-value, the glcoxph performed a bit better than the other ij ij i0 methods. The penalized yielded slightly larger AUC values than the The gene expression data have a block exchangeable correlation others. However, their differences seem to be marginal. Combined structure (i.e. genes in the same block are correlated and in different with the results reported in Table 6, we conclude that the glcoxph blocks uncorrelated) with the correlation coefficient ρ and the block showed similar predictive performance as the other methods with a size 10. Survival times are generated from the Cox regression more succinct model and a shorter computing time. model (1) in the following configuration of coefficients: −0.7, j = 1,...,D 4 DISCUSSION β = 0, j = D + 1,...,p, We applied the gradient lasso algorithm by Kim et al. (2008) to where D denotes the number of prognostic genes. the Cox proportional hazards model. Most of the algorithms for For our experiment, we have set p = 1000, N = 200 and D = 3. We the Cox model with lasso penalty suffer from severe computational considered the cases with 15% and 30% of censoring and ρ = 0,0.3 problems. Due to matrix inversions, those algorithms converge very and 0.6. A censoring time was generated from U (0,a), where a slowly or even fail to converge to the optimum. Meanwhile, the was chosen to yield ∼30% of censoring. With a fixed at this value, gradient lasso algorithm is scalable to high-dimensional data because a censoring variable was generated from U (b,a +b). Here, b was it does not require matrix inversions. Also the gradient lasso tackles chosen to yield about 15% of censoring. We have generated 200 the exact penalized partial likelihood directly, while many other random samples. Hundred samples were used for training and the methods solve approximated penalized partial likelihoods. We have others for testing. To assess the variability of the experiment, we compared the gradient lasso algorithm with recent algorithms by have replicated the above process 100 times. Park and Hastie (2007) and Goeman (2008) on DLBCL datasets from We compared our method with the coxpath and the penalized. Rosenwald et al. (2002) and Bair et al. (2006), Norway/Stanford The tuning parameters were chosen by 5-fold CVPL as stated at breast cancer dataset from Sørlie et al. (2003) and also on a simulated the beginning of Section 3. Table 5 shows the average number of dataset. Results indicate that our gradient lasso algorithm is very genes and prognostic genes selected by the coxpath and the glcoxph. competitive in analyzing high-dimensional survival data in terms of [15:16 15/6/2009 Bioinformatics-btp322.tex] Page: 1780 1775–1781 Gradient lasso for Cox regression Table 6. Average P-values of log-rank test and AUC on the simulated data with standard errors in parentheses coxpath glcoxph penalized Censoring (%) ρ P-value AUC P-value AUC P-value AUC 0 0.0183 (0.0090) 0.6452 (0.0132) 0.0113 (0.0071) 0.6444 (0.0052) 0.0232 (0.0096) 0.6546 (0.0018) 30 0.3 0.0038 (0.0017) 0.6524 (0.0110) 0.0039 (0.0032) 0.6582 (0.0046) 0.0028 (0.0071) 0.6677 (0.0136) 0.6 0.0013 (0.0004) 0.6588 (0.0102) 0.0010 (0.0004) 0.6603 (0.0051) 0.0026 (0.0013) 0.6659 (0.0172) 0 0.0028 (0.0008) 0.6525 (0.0091) 0.0015 (0.0007) 0.6768 (0.0046) 0.0092 (0.0064) 0.6685 (0.0086) 15 0.3 0.0020 (0.0013) 0.6798 (0.0041) 0.0011 (0.0005) 0.6851 (0.0051) 0.0008 (0.0004) 0.6691 (0.0065) 0.6 0.0002 (0.0001) 0.6679 (0.0092) 0.0005 (0.0003) 0.6779 (0.0046) 0.0005 (0.0004) 0.6827 (0.0061) sparsity of the final prediction model, predictability and computing Fan,J. and Li,R. (2002) Variable selection for Cox’s proportional hazards model and frailty model. Ann. Stat., 30, 74–99. time. Since our algorithm can yield a more stable solution to the Goeman,J.J. (2008) An efficient algorithm for L -penalized estimation. Preprint, penalized Cox model in a faster manner than other methods in the Department of Medical Statistics and Bioinformatics, Leiden University, literature, it will provide an efficient tool in developing a prediction Netherlands. model for survival time based on high-dimensional microarray data. Gui,J. and Li,H. (2005a) Threshold gradient descent method for censored data regression, with Applications in Pharmacogenomics. In Pacific Symposium on There are several issues yet to be investigated. We considered Biocomputing, World Scientific, Singapore, pp. 272–283. relating only gene expression profiles to survival phenotypes in our Gui,J. and Li,H. (2005b) Penalized Cox regression analysis in the high-dimensional study. In the Cox proportional hazards model, the incorporation of and low-sample size settings, with applications to microarray gene expression data. some categorical clinical variables is anticipated. One may consider Bioinformatics, 21, 3001–3008. borrowing ideas from the blockwise sparse regression, an extension Heagerty,P.J. et al. (2000) Time dependent ROC curves for censored survival data and a diagnostic marker. Biometrics, 56, 337–344. of the group lasso in Yuan and Lin (2006) to general loss functions, Kaderali,L. et al. (2006) CASPAR: a hierarchical Bayesian approach to predict survival proposed by Kim et al. (2006). A non-parametric extension of the times in cancer from gene expression data. Bioinformatics, 22, 1495–1502. penalized Cox regression is another direction of interest. Liu (2008) Kim,J. and Kim,Y. (2004) Gradient lasso for feature selection. In Proceedings of the proposed a garrote method of feature selection in the penalized 21th International Conference on Machine Learning. Morgan Kaufmann, ACM, New York, pp. 473–480. partial likelihood of additive Cox models. An advantage of the Kim,Y. et al. (2006) Blockwise sparse regression. Stat. Sin., 16, 375–390. garrote method is that it can deal with continuous as well as Kim,J. et al. (2008) A gradient-based optimization algorithm for lasso. J. Comput. categorical variables. However, it seems that there is room for Graph. Stat., 17, 994–1009. improvement in the computational aspect of the garrote method. Liu, S. (2008) Variable selection in semi-parametric additive models with extensions to high dimensional data and addtive Cox models. Ph.D. Thesis, Department of Statistics, North Carolina State University. Ma,S. and Huang,J. (2005) Lasso method for additive risk models with high dimensional ACKNOWLEDGEMENTS covariates. Technical Report No. 347, Department of Statistics and Actuarial The authors are grateful to Dr M. Y. Park for personal Science, The University of Iowa. communications on coxpath. Owzar,K. et al. (2007) A coupula approach for detecting prognostic genes associated with survival outcome in microarray studies. Biometrics, 63, 1089–1098. Funding: Korea Research Foundation Grant (KRF-2008-331- Park,M.Y. and Hastie,T (2007) L1 regularization path algorithm for generalized linear models. J. R. Stat. Soc. B, 69, 659–677. C00055) by the Korean Government. Rosenwald,A. et al. (2002) The use of molecular profiling to predict survival after chemotherapy for diffuse large B-cell lymphoma. New Engl. J. Med., 346, Conflict of Interest: none declared. 1937–1946. Segal,M.R. (2006) Microarray gene expression data with linked survival phenotypes: diffuse large-B-cell lymphoma revisited. Biostatistics, 7, 268–285. REFERENCES Sørlie,T. et al. (2003) Repeated observation of breast tumor subtypes in independent Akritas,M.G. (1994) Nearest neighbor estimation of a bivariate distribution under gene expression datasets. Proc. Natl Acad. Sci. USA, 100, 8418–8423. random censoring. Ann. Stat., 22, 1299–1327. Tibshirani,R. (1997) The lasso method for variable selection in the Cox model. Bair,E. and Tibshirani,R. (2004) Semi-supervised methods to predict patient survival Stat. Med., 16, 385–395. from gene expression data. PLoS Biol., 2, 511–522. Yuan,M. and Lin,Y. (2006) Model selection and estimation in regression with grouped Bair,E. et al. (2006) Prediction by supervised principal components. J. Am. Stat. Assoc., variables. J. R. Stat. Soc. B, 68, 49–67. 101, 119–137. Zhang,H.H. and Lu,W. (2007) Adaptive lasso for Cox’s proportional hazards model. Brown,P.O. and Botstein,D. (1999) Exploring the new world of the genome with DNA Biometrika, 94, 691–703. microarrays. Nat. Genet., 21 (Suppl. 1), 33–37. Zou,H. (2008) A note on path-based variable selection in the penalized proportional Cox,D.R. (1972) Regression models and life-tables. J. R. Stat. Soc. B, 34, 187–220. hazards model. Biometrika, 95, 241–247. Efron,B. et al. (2004) Least angle regression. Ann. Stat., 32, 407–499. [15:16 15/6/2009 Bioinformatics-btp322.tex] Page: 1781 1775–1781
Bioinformatics – Oxford University Press
Published: May 15, 2009
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.