# A framework for estimating and testing qualitative interactions with applications to predictive biomarkers

A framework for estimating and testing qualitative interactions with applications to predictive... SUMMARY An effective treatment may only benefit a subset of patients enrolled in a clinical trial. We translate the search for patient characteristics that predict treatment benefit to a search for qualitative interactions, which occur when the estimated response-curve under treatment crosses the estimated response-curve under control. We propose a regression-based framework that tests for qualitative interactions without assuming linearity or requiring pre-specified risk strata; this flexibility is useful in settings where there is limited a priori scientific knowledge about the relationship between features and the response. Simulations suggest that our method controls Type I error while offering an improvement in power over a procedure based on linear regression or a procedure that pre-specifies evenly spaced risk strata. We apply our method to a publicly available dataset to search for a subset of HER2+ breast cancer patients who benefit from adjuvant chemotherapy. We implement our method in Python and share the code/data used to produce our results on GitHub (https://github.com/jhroth/data-example). 1. Introduction If the results of a clinical trial indicate that a proposed treatment is not beneficial overall, the treatment may still benefit a smaller subset of patients. Existing methods to predict which patients benefit from treatment generally use one of two general strategies. The first approach, sometimes called classification-based (in Zhang and others, 2015, for example), assumes the structure of the treatment rule (a function that maps a patient’s characteristics to a treatment option) and then looks for the optimal rule over that class of possible treatment rules. For example, Zhang and others (2015) assume the optimal treatment rule is a sequence of “if-then” statements with less than three variables involved in each condition, then searches for the rule of this form that maximizes an estimate of treatment rule value. Outcome-weighted learning (Zhao and others, 2012, 2015) and marginal structural mean models (Robins and others, 2008; Orellana and others, 2010) are other classification-based approaches to predicting treatment benefit. The second approach, sometimes called regression-based, usually assumes a conditional mean model of the response variable given a treatment assignment and patient characteristics. For an observation with a particular set of characteristics, these methods generally predict the conditional mean response under different treatment assignments to infer the best treatment option. Examples include methods leveraging the Q-learning framework (Zhao and others, 2009, 2011; Moodie and others, 2014), other statistical learning tools (e.g. boosting in Kang and others (2014)), or a combination of parametric, semiparametric, and nonparametric estimators (such as the two-stage approach of Cai and others, 2011). In the econometric literature, Chang and others (2015) and Crump and others (2008) develop nonparametric theory to test the null hypothesis that the direction of the treatment effect is the same across all values of pre-specified covariates. This null hypothesis is sometimes called testing for treatment effect heterogeneity. In the regression setting, the search for predictive biomarkers can be thought of as a test for a clinically meaningful interaction between treatment and patient characteristics. A qualitative interaction occurs when treatment is neither superior nor inferior over all subsets of patients (Gail and Simon, 1985; Pan and Wolfe, 1997; Peto, 1982). Qualitative interactions stand in contrast to non-qualitative interactions, where the magnitude of treatment effect varies across patient subsets but the direction of the treatment effect is the same in every subset (i.e. either all patients benefit or all patients do not benefit). Qualitative interactions are clinically meaningful because they identify a treatment strategy with the ability to improve patient outcomes. Figure 1 shows a hypothetical example of a qualitative interaction that can inform a treatment decision (left panel) and a non-qualitative interaction that cannot inform a treatment decision (right panel). If the response variable $$Y$$ is desirable (e.g. survival time), the left panel suggests that treatment is only beneficial for patients with values of candidate biomarker $$X$$ above about 1.7. Fig. 1. View largeDownload slide Mean of response variable Y, conditional on candidate biomarker X in the treatment group (solid line) and control group (dashed line). Fig. 1. View largeDownload slide Mean of response variable Y, conditional on candidate biomarker X in the treatment group (solid line) and control group (dashed line). We propose a regression-based framework that tests for qualitative interaction without assuming a linear mean model or requiring a candidate biomarker to be divided into risk strata beforehand. To do this, we translate the general settings of qualitative and non-qualitative interaction into convex optimization problems that data-adaptively estimate the control and treatment groups’ underlying trends in treatment response over the range of a candidate biomarker. We form a simple likelihood-based test statistic and evaluate its significance using a permutation test. We emphasize that our framework does not directly identify a subpopulation of patients who should receive treatment; rather, our method provides a global test of the null hypothesis that treatment is either uniformly beneficial or uniformly not beneficial across the ordered values of a candidate biomarker. In the convex problem that allows for qualitative interaction, the values of the candidate biomarker at which the estimated response-curves for treatment groups cross each other suggest a subset of patients who benefit from treatment. However, there is no formal test run evaluating average treatment effect in that subset. Section 2 outlines our general framework for estimation and testing. Section 3 describes our testing procedure in more detail. Section 4 briefly discusses the Python implementation of our method. We present results from a data example in Section 5 and simulation results in Section 6. Section 7 gives a brief conclusion. 2. Estimation and testing framework We begin with the formal definition of a qualitative interaction in the multivariate setting. Suppose we observe patients randomized to either new treatment ($$T=1$$) or standard of care ($$T=0$$). On each patient we observe $$p$$ covariate values $${\bf x} \equiv \left(x_1,\ldots,x_p\right)$$ and a numeric response $$y$$. Suppose further that $$\left(y |x_1,\ldots,x_p, T\right) \sim L\left(\cdot, \theta\left({\bf x}, T\right)\right)$$ where $$L$$ is a known distribution parametrized by a single parameter, $$\theta\left({\bf x}, T\right)$$, that is an unknown function of the features and treatment assignment. Further suppose that, for any $$y$$, $$L(y,\theta)$$ is log-concave in $$\theta$$. In what follows we will rewrite $$\theta\left({\bf x}, T\right)$$ as $$\theta_T\left({\bf x}\right)$$. In addition let $$G$$ denote the joint distribution of $${\bf x}$$ and let $$\operatorname{supp}(G) = \left\{ {\bf x} | G({\bf x}) \neq 0\right\}$$ denote the support of $$G$$. We are interested in testing the null hypothesis: \begin{align*} H_0 & \text{(no qualitative interaction)}:\\ &\text{ Either } \forall {\bf x}\in\operatorname{supp}(G),\, \theta_1\left({\bf x}\right) \leq \theta_0\left({\bf x}\right) \text{ OR } \forall {\bf x}\in\operatorname{supp}(G),\, \theta_0\left({\bf x}\right) \leq \theta_1\left({\bf x}\right)\\ \hspace{2ex} & \qquad \text{versus}\\ \hspace{2ex} H_A &\text{(qualitative interaction)}:\\ &\exists {\bf x}_1, {\bf x}_0\in\operatorname{supp}(G) \text{ with } \theta_0\left({\bf x}_0\right) > \theta_1\left({\bf x}_0\right)\, \text{and } \theta_1\left({\bf x}_1\right) > \theta_0\left({\bf x}_1\right). \end{align*} Often rather than a bi-directional null hypothesis, we will instead be interested in testing whether new treatment is more effective than control for any subset of patients. If higher values of $$y$$ are more beneficial, then in this case $$H_0$$ is reduced to: $$\forall {\bf x}\,\, \theta_1\left({\bf x}\right) \leq \theta_0\left({\bf x}\right)$$; and the alternative hypothesis is correspondingly changed. We note that this tests a global null hypothesis (that one response curve lies everywhere above the other). Rejecting this null does not indicate where the curves cross. The testing procedure we will propose does give an estimate of the subset of patients for whom treatment is more effective than control (and vice versa). However, we do not run a formal statistical test of average treatment effect in those subsets. For testing this hypothesis suppose we have data on $$n_0 + n_1 = n$$ independently drawn individuals, where $$n_0$$ individuals are randomized to the control group and $$n_1$$ to the treatment group. Let $${\mathcal{X}} \in {\mathbb{R}}^{n \times p}$$ be a matrix of $$p$$ features observed for the $$n$$ individuals and $$Y \in {\mathbb{R}}^n$$ be our vector of responses. We will let $${\bf x}_{i, \cdot}$$ denote the $$i$$-th row of $${\mathcal{X}}$$, $${\bf x}_{\cdot, j}$$ denote the $$j$$-th column of $${\mathcal{X}}$$, and $$x_{ij}$$ the $$(i,j)$$ entry of $${\mathcal{X}}$$. In what follows we will give a proposal for a general likelihood ratio based test statistic—the null distribution of this statistic will be calculated via permutation. 2.1. Forming the test statistic To form the standardized likelihood ratio statistic we must estimate $$\theta_T\left({\bf x}\right)$$ for $$T=0,1$$ under both the null and alternative hypotheses. Let $$\mathcal{L}\left(\cdot,\theta\right)$$ denote the negative log-likelihood of $$L(\cdot,\theta)$$ and recall the definition of deviance is \begin{equation*} \mathcal{D}\left(y,\hat{\theta}\right) = 2\left[\mathcal{L}\left(y,\hat{\theta} \right) - \operatorname{min}_{\theta}\mathcal{L}\left(y,\theta\right)\right], \end{equation*} This is just a standardized version of the negative-log-likelihood (forced to be positive). Our test statistic is $$\label{eq:LRstat} Z = \frac{\sum_{i = 1}^n\mathcal{D}\left(y_i,\hat{\theta}^{\rm{null}}_{T_i}({\bf x}_i)\right) - \sum_{i=1}^n\mathcal{D}\left(y_i,\hat{\theta}^{{\rm{alt}}}_{T_i}({\bf x}_{i, \cdot})\right)}{\sum_{i=1}^n\mathcal{D}\left(y_i,\hat{\theta}^{{\rm{alt}}}_{T_i}({\bf x}_{i, \cdot})\right)},$$ (2.1) where for $$T=0,1$$, $$\hat{\theta}^{{\rm{null}}}_T$$ and $$\hat{\theta}^{{\rm{alt}}}_T$$ are estimates under the null and alternative hypotheses respectively. In the case of Gaussian data with fixed variance, we have $$Z = \frac{\sum_i \left(y_i - \hat{\theta}^{{\rm{null}}}_{T_i}({\bf x}_i)\right)^2 - \sum_i \left(y_i - \hat{\theta}^{{\rm{alt}}}_{T_i}({\bf x}_i)\right)^2}{\sum_i \left(y_i - \hat{\theta}^{{\rm{alt}}}_{T_i}({\bf x}_i)\right)^2}$$. This is just a scaled $$F$$-statistic. For logistic loss, $$\operatorname{min}_{\theta}\mathcal{L}\left(y,\theta\right) = 0$$, so we have $$Z= \frac{\sum_{i = 1}^n\mathcal{L}\left(y_i,\hat{\theta}^{{\rm{null}}}_{T_i}({\bf x}_i)\right) - \sum_{i=1}^n\mathcal{L}\left(y_i,\hat{\theta}^{{\rm{alt}}}_{T_i}({\bf x}_{i, \cdot})\right)}{\sum_{i=1}^n\mathcal{L}\left(y_i,\hat{\theta}^{{\rm{alt}}}_{T_i}({\bf x}_{i, \cdot})\right)}$$. In order to control Type I error in our permutation-based testing procedure, it is important for the denominator of $$Z$$ to remain positive; the deviance based test statistic in (2.1) achieves this. In estimating $$\hat{\theta}^{{\rm{null}}}_T$$ and $$\hat{\theta}^{{\rm{alt}}}_T$$, we often do not want to make parametric assumptions. In what follows we will discuss non-parametric estimation of $$\hat{\theta}^{{\rm{null}}}_T$$ and $$\hat{\theta}^{{\rm{alt}}}_T$$ under general shape constraints. 2.2. A convex formulation for estimating $$\hat{\theta}^{{\rm{null}}}$$ and $$\hat{\theta}^{{\rm{alt}}}$$ Our approach leverages penalized likelihood methods (Hastie and others, 2008) to estimate our parameters. In the more traditional, single class setting where $$y_i \sim L\left(\cdot,\theta\left({\bf x}\right)\right)$$, the penalized likelihood approach solves the problem: $$\label{eq:pen} \hat{\theta} \gets \operatorname{argmin}_{\theta} \sum_i \mathcal{L}\left(y_i, \theta\left({\bf x}_{i, \cdot}\right)\right) + \lambda P\left(\theta\right),$$ (2.2) where $$P$$ is a structure-inducing penalty and $$\lambda > 0$$ is a tuning parameter that trades off between enforcing goodness of fit and structure. $$P$$ is chosen based on the type of smoothness or structure one expects: Sobolev-type penalties, $$P(\theta) = \left \|\theta^{(k)}\right\|^2$$, can be used to induce general smoothness; total-variation/trend-filtering penalties, in 1D, can be used to obtain piecewise polynomial fits. Other penalties, based on the convex indicator (defined as $$I(z\in S) \equiv \infty*\delta\left\{z\not\in S\right\}$$ where $$\delta(\cdot)$$ is the usual $$0,1$$ indicator function) can enforce other structure such as monotonicity $$I(\theta \text{ non-decreasing})$$, a parametric form $$I\left(\theta \in \operatorname{span}\left\{\psi_1,\ldots, \psi_K\right\}\right)$$ for prespecified functions $$\psi_1,\ldots,\psi_K$$, or additivity $$I\left(\theta({\bf x}) = \sum_{j=1}^p\phi_j\left(x_j\right) \text{ for some }\phi_1,\ldots \phi_p \right)$$. In the additive case, we can combine this convex indicator with smoothness-type penalties on each individual component. In all of these cases because $$\mathcal{L}$$ is convex, the problem in (2.2) is convex (Boyd and Vandenberghe, 2004). We now extend this to our two class framework. 2.3. Estimation of $$\hat{\theta}^{{\rm{alt}}}$$ For the unrestricted model under $$H_A$$, we can use the penalized likelihood framework to jointly solve for $$\hat{\theta}^{{\rm{alt}}}_0$$ and $$\hat{\theta}^{{\rm{alt}}}_1$$. We optimize $$\label{eq:optimization_alternative} \hat{\theta}^{{\rm{alt}}}_1,\hat{\theta}^{{\rm{alt}}}_0 \gets \underset{\theta_1,\theta_0}{\operatorname{argmin}} \sum_{T_i = 1}\mathcal{L}\left(y_i,\theta_1({\bf x}_{i, \cdot})\right) + \lambda_1 P\left(\theta_1\right) + \sum_{T_i = 0}\mathcal{L}\left(y_i,\theta_0({\bf x}_{i, \cdot})\right) + \lambda_0 P\left(\theta_0\right).$$ (2.3) We note that this can be minimized separately in $$\theta_1$$ and $$\theta_0$$, requiring us merely to solve two problems of the form (2.2). The tuning parameters $$\lambda_0$$ and $$\lambda_1$$ determine the amount of regularization; we use split-sample/cross validation (as described in Section 3) to choose these tuning parameters. 2.4. Estimation of $$\hat{\theta}^{{\rm{null}}}$$ Estimating the null-restricted parameters is more difficult; here we aim to constrain our estimates not to cross on the support of $$G$$. Because $$G$$ is generally unknown, we use instead the empirical distribution of $${\bf x}$$, imposing the constraint that either $$\hat{\theta}_0\left({\bf x}_{i, \cdot}\right) \leq \hat{\theta}_1\left({\bf x}_{i, \cdot}\right)$$ for all $$i=1,\ldots,n$$ or the mirror $$\hat{\theta}_0\left({\bf x}_{i, \cdot}\right) \geq \hat{\theta}_1\left({\bf x}_{i, \cdot}\right)$$ for all $$i=1,\ldots,n$$ is true. Thus we solve the following penalized, constrained, maximum likelihood problem: \begin{align} \hat{\theta}^{{\rm{null}}}_1,\hat{\theta}^{{\rm{null}}}_0 \gets \underset{\theta_1,\theta_0}{\operatorname{argmin}} &\sum_{T_i =1}\mathcal{L}\left(y_i,\theta_1({\bf x}_{i, \cdot})\right) + \lambda_1 P\left(\theta_1\right) + \sum_{T_i = 0}\mathcal{L}\left(y_i,\theta_0({\bf x}_{i, \cdot})\right) + \lambda_0 P\left(\theta_0\right)\\ \end{align} (2.4) \begin{align} s.t. \quad&\,\, {{\bf Either }}\, \theta_0\left({\bf x}_{i, \cdot}\right) \leq \theta_1\left({\bf x}_{i, \cdot}\right) \textrm{ for all } i=1,\ldots,n,\\ \end{align} (C1) \begin{align} &\,\, {{\bf Or }}\,\theta_0\left({\bf x}_{i, \cdot}\right) \geq \theta_1\left({\bf x}_{i, \cdot}\right) \textrm{ for all } i=1,\ldots,n. \qquad\qquad \end{align} (C2) This is not a convex problem because the constraint is not convex. However, (C1) or (C2) alone is a convex constraint. Because this Either/Or condition is actually optimizing over the union of the sets defined by (C1) and (C2) we can still find the global optimum quite simply. We merely solve (2.4) with only constraint (C1) and again with only constraint (C2) and then choose, from those $$2$$ candidate solutions, the solution that has the lower objective value. For each single constraint (2.4) is a convex problem and can be solved efficiently (Boyd and Vandenberghe, 2004). It should be noted that the null hypothesis is only constrained to have no qualitative interaction at the observed data points, rather than to have no qualitative interaction over the entire design space. In cases where minimal smoothness is assumed (particularly with a single feature), fitted values at unobserved data points will be close to a simple average of the fitted values at nearby observed data points. Here, without qualitative interactions at the observed data points, we would not expect qualitative interactions to occur for unobserved points within the design space. When more structure is assumed, such as in the additive model framework discussed in Section 2.6, fits under the null may actually allow qualitative interactions for unobserved combinations of covariates in the design space. That is, the null model permits more flexibility than it should under the assumption of no qualitative interaction anywhere in the design space, and consequently our testing procedure will be conservative. If the covariate values from our sample are representative of the population, then any qualitative interaction permitted under the null (at unobserved covariate combinations) would necessarily apply to a small subpopulation: Otherwise that subpopulation would have been represented in our sample. If observed features are not representative of the larger population, however, then the conservative nature of the method could be impactful—it may be that a covariate combination which is unobserved, and at which we allow a qualitative interaction under the null, is actually a significant segment of the population. Our test would be unable to detect that sub-population. An alternative way to form constraints under the null would be to discretize the entire design space, then apply the (C1) and (C2) constraints to each element of the grid; this is an interesting approach, but one we choose not to pursue primarily for computational reasons. 2.5. Structure induced on fitted values For some choices of penalty, (2.2) does not immediately estimate an entire function $$\theta$$; instead the function is only estimated at a finite list of specified $${\bf x}$$-values (often taken to be the observed values $${\bf x}_1,\ldots {\bf x}_n$$). This is the case for fused lasso (Tibshirani and others, 2005) and higher order $$\ell_1$$-trend filtering (Tibshirani, 2013), among others. There, the penalty generally only involves those specified $${\bf x}$$-values; for example, $$0$$th order trend filtering with a single covariate $$x$$, ordered such that $$x_1 < x_2 < \cdots < x_n$$, yields $$P(\theta) = \sum_{i=2}^n \left|\theta\left(x_i\right) - \theta\left(x_{i-1}\right)\right|$$. However to estimate our fits under the null hypothesis we must solve (2.4)—this problem has constraints that require estimation at all $$x$$-values observed under either treatment for each of $$\theta_1$$ and $$\theta_0$$. In this case we adjust the usual fused lasso estimator to include all $$x$$-values in the penalty. More specifically, if we reorder our data such that $$x_1 < x_2 < \cdots < x_n$$ then our optimization problem is: \begin{align*} \hat{\theta}^{\rm{null}}_1,\hat{\theta}^{\rm{null}}_0 \gets \underset{\theta_1,\theta_0}{\operatorname{argmin}} &\sum<_{T_i =1}\mathcal{L}\left(y_i,\theta_1({\bf x}_{i, \cdot})\right) + \lambda_1 \sum_{i=2}^n \left|\theta_1\left(x_i\right) - \theta_1\left(x_{i-1}\right)\right| + \\ & \sum_{T_i = 0}\mathcal{L}\left(y_i,\theta_0({\bf x}_{i, \cdot})\right) + \lambda_0 \sum_{i=2}^n \left|\theta_0\left(x_i\right) - \theta_0\left(x_{i-1}\right)\right|\\ s.t. \quad&\,\, {{\bf Either }}\,\,\theta_0\left({\bf x}_{i, \cdot}\right) \leq \theta_1\left({\bf x}_{i, \cdot}\right) \textrm{ for all } i=1,\ldots,n \qquad \text{(C1)}\\ &\,\, {{\bf Or }}\,\,\theta_0\left({\bf x}_{i, \cdot}\right) \geq \theta_1\left({\bf x}_{i, \cdot}\right) \textrm{ for all } i=1,\ldots,n.\quad\quad\quad\,\, \text{(C2)} \end{align*} Note that in fitting the alternative (2.3), for $$T=0$$ we need only include those $$x_i$$ with $$T_i = 0$$ in the penalty (and similarly for $$T=1$$). 2.6. Estimation with multiple features When $${\bf x}_{i, \cdot} \in {\mathbb{R}}^p$$ for $$p >1$$, we can implement our framework using an additive model (Hastie and Tibshirani, 1990; Hastie and others, 2008). Suppose $${\bf x}_{i, \cdot}$$ is partitioned into continuous and categorical variables as $${\bf x}_{i, \cdot} = \{{\bf x}_{i, \cdot}^{\text{cont.}}, {\bf x}_{i, \cdot}^{\text{cat.}}\}$$ with $${\bf x}_{i, \cdot}^{\text{cont.}} \in {\mathbb{R}}^q$$ and $${\bf x}_{i, \cdot}^{\text{cat.}} \in {\mathbb{R}}^s$$. Also write $$\lambda_j = \{\lambda_j^{\text{cont.}}, \lambda_j^{\text{cat.}}\}$$ for $$j=0, 1$$. Define \begin{align*} \Theta_1 &= \begin{bmatrix} \theta_{1, 1}^{\text{cont.}} & \cdots & \theta_{1, q}^{\text{cont.}} & \theta_{1, 1}^{\text{cat.}} & \cdots & \theta_{1, s}^{\text{cat.}} \end{bmatrix}, \end{align*} and similarly for $$\Theta_0$$. Under the alternative hypothesis, we would then solve \begin{align} \left(\widehat{\Theta}_1,\hat{\gamma}_1 \right) \gets \underset{\Theta_1,\gamma_1}{\operatorname{argmin}} \sum_{T_i = 1} & \mathcal{L}\left(y_i, \left[\gamma_1 + \sum_{j=1}^q \theta_{1, j}^{\text{cont.}}(x_{ij}^{\text{cont.}}) + \sum_{k=1}^s \theta_{1, k}^{\text{cat.}}(x_{ik}^{\text{cat.}}) \right] \right) + \nonumber \\ & \lambda_1^{\text{cont.}} \sum_{j=1}^q P_j^{\text{cont.}}\left(\theta_{1, j}^{\text{cont.}}\right) + \lambda_1^{\text{cat.}} \sum_{k=1}^s P_k^{\text{cat.}}\left(\theta_{1, k}^{\text{cat.}}\right) \nonumber, \\ \left(\widehat{\Theta}_0,\hat{\gamma}_0 \right) \gets \underset{\Theta_0,\gamma_0}{\operatorname{argmin}} \sum_{T_i = 1} & \mathcal{L}\left(y_i, \left[\gamma_0 + \sum_{j=1}^q \theta_{0, j}^{\text{cont.}}(x_{ij}^{\text{cont.}}) + \sum_{k=1}^s \theta_{0, k}^{\text{cat.}}(x_{ik}^{\text{cat.}}) \right] \right) + \nonumber \\ & \lambda_0^{\text{cont.}} \sum_{j=1}^q P_j^{\text{cont.}}\left(\theta_{0, j}^{\text{cont.}}\right) + \lambda_0^{\text{cat.}} \sum_{k=1}^s P_k^{\text{cat.}}\left(\theta_{0, k}^{\text{cat.}}\right) \nonumber, \end{align} where $$\gamma_0, \gamma_1 \in {\mathbb{R}}$$. To solve under the null hypothesis, we add the appropriate Either/Or conditions and (C1) and (C2) as additional constraints. The fitted value under the alternative for the $$i$$th individual would then be $$\sum_{t=0}^1 I(T_i = t) \left[\gamma_t + \sum_{j=1}^q \theta_{t, j}^{\text{cont.}}(x_{ij}^{\text{cont.}}) + \sum_{k=1}^s \theta_{t, k}^{\text{cat.}}(x_{ik}^{\text{cat.}}) \right]. \label{eq:framework_fitted_mixture}$$ (2.5) There are many potential choices for the continuous and categorical penalties. One choice that seems to perform well, which we use in our Python implementation, is $$P^{\text{cat.}}$$ as the 1D fused lasso penalty and $$P_k^{\text{cat.}}$$ as a ridge penalty: \begin{align*} P_k^{\text{cat.}} \left(\theta_{t, j}^{\text{cat.}}\right) &= \sum_{T_i = t} \left[\theta_{t, j}^{\text{cat.}}(x_{i j})\right]^2, \\ P_j^{\text{cont.}} \left(\theta_{t, j}^{\text{cont.}}\right) &= {\left\lVert{D^{(0)}_{j, t} \left[\theta_{t, j}^{\text{cont.}}\left(x^{\text{cont.}}_{\cdot, j}\right) \right]}\right\rVert}_1, \end{align*} where $$D^{(0)}_{j, t}({\bf x}_{\cdot, j})$$ returns a vector of first-differences between the fitted values at consecutive ordered values of the feature $${\bf x}_{\cdot, j}$$, among observations in the $$T=t$$ group. 2.7. Minimum clinically relevant effect Rather than testing a null hypothesis of no crossing, we may instead be interested in testing a null of the form no clinically relevant crossing, where clinical relevance will be application-specific. For ease of exposition we will work with only the one sided null/alternative wherein under the null we expect that outcomes under the new treatment ($$T=1$$) will fall largely or entirely below outcomes under the control ($$T=0$$). However, the opposite direction or the bidirectional null/alternative are also simple to test. In this case we can formalize our null hypothesis as: $H_0: \left[\theta_1\left({\bf x}\right) - \theta_0\left({\bf x}\right) - \delta\right]_+ {\rm{d}}G\left({\bf x}\right) \leq C$ for some prespecified $$\delta$$ and $$C > 0$$, where $$[z]_+ = z{\,\cdot\,}I(z \geq 0)$$. For $$\delta = 0$$ this null hypothesis states that, if the population is treated under the optimal rule, the population average benefit from new treatment over control is at most $$C$$. For $$C=0$$, and $$\delta \neq 0$$ this null hypothesis states that no patients in the population have expected benefit of greater than $$\delta$$ from new treatment over control. For $$\delta = C = 0$$ we return to the original null hypothesis. To test this null we use the same formulation as above; now however we need estimates $$\hat{\theta}^{{\rm{null}}}_1$$, $$\hat{\theta}^{{\rm{null}}}_0$$ for a slightly different null. As before we use the empirical distribution of the covariates as a surrogate for $$G$$. We solve \begin{align} \hat{\theta}^{{\rm{null}}}_1,\hat{\theta}^{{\rm{null}}}_0 \gets \underset{\theta_1,\theta_0}{\operatorname{argmin}} &\sum_{T_i =1}\mathcal{L}\left(y_i,\theta_1({\bf x}_{i, \cdot})\right) + \lambda_1 P\left(\theta_1\right) + \sum_{T_i = 0}\mathcal{L}\left(y_i,\theta_0({\bf x}_{i, \cdot})\right) + \lambda_0 P\left(\theta_0\right)\\ \end{align} (2.6) \begin{align} s.t. \quad&\,\, \frac{1}{n}\sum_{i=1}^n \left[\theta_1\left({\bf x}_i\right) - \theta_0\left({\bf x}_{i, \cdot}\right) - \delta \right]_+ \leq C. \end{align} (C1*) This is a convex problem that can be solved efficiently. As before one could replace (C1*) with an equivalent (C2*), or conduct a bidirectional test on an Either/Or constraint. 2.8. Selecting a smoothing method The choice of penalty function $$P$$ determines the type of smoothness our solutions will have. We illustrate various candidates in Table 1 Table 1 Examples of smoothing methods Method $$P(\theta)$$ Example Fit 1D fused lasso (Tibshirani and others, 2005) $$\sum \left| \theta_{i+1} - \theta_i \right|$$ $$1$$st order trend filtering (Tibshirani, 2013) $$|| D^{(x, 2)} \theta ||_1$$ $$2$$nd order trend filtering (Tibshirani, 2013) $$|| D^{(x, 3)} \theta ||_1$$ Cubic smoothing splines (Hastie and others, 2008) $$\theta^T \Omega_N \theta$$ Method $$P(\theta)$$ Example Fit 1D fused lasso (Tibshirani and others, 2005) $$\sum \left| \theta_{i+1} - \theta_i \right|$$ $$1$$st order trend filtering (Tibshirani, 2013) $$|| D^{(x, 2)} \theta ||_1$$ $$2$$nd order trend filtering (Tibshirani, 2013) $$|| D^{(x, 3)} \theta ||_1$$ Cubic smoothing splines (Hastie and others, 2008) $$\theta^T \Omega_N \theta$$ Table 1 Examples of smoothing methods Method $$P(\theta)$$ Example Fit 1D fused lasso (Tibshirani and others, 2005) $$\sum \left| \theta_{i+1} - \theta_i \right|$$ $$1$$st order trend filtering (Tibshirani, 2013) $$|| D^{(x, 2)} \theta ||_1$$ $$2$$nd order trend filtering (Tibshirani, 2013) $$|| D^{(x, 3)} \theta ||_1$$ Cubic smoothing splines (Hastie and others, 2008) $$\theta^T \Omega_N \theta$$ Method $$P(\theta)$$ Example Fit 1D fused lasso (Tibshirani and others, 2005) $$\sum \left| \theta_{i+1} - \theta_i \right|$$ $$1$$st order trend filtering (Tibshirani, 2013) $$|| D^{(x, 2)} \theta ||_1$$ $$2$$nd order trend filtering (Tibshirani, 2013) $$|| D^{(x, 3)} \theta ||_1$$ Cubic smoothing splines (Hastie and others, 2008) $$\theta^T \Omega_N \theta$$ Our Python implementation uses the 1D fused lasso penalty. We make this choice because we have found that the 1D fused lasso penalty performs well empirically, and solutions to (2.4) and (2.3) using this penalty can be found efficiently and accurately in our implementation. Figure 2 shows example solutions to (2.4) and (2.3) for the fused lasso with a continuous response where we have simulated noisy data from two crossing sinusoidal functions. We see that, in this simple example, the fused lasso fits effectively detect and localize the crossings. Fig. 2. View largeDownload slide Example piecewise-constant solutions to (2.3) and (2.4) using the 1D fused lasso. The crossing sinusoidal curves show the true relationship between the response and candidate biomarker. Fig. 2. View largeDownload slide Example piecewise-constant solutions to (2.3) and (2.4) using the 1D fused lasso. The crossing sinusoidal curves show the true relationship between the response and candidate biomarker. 3. Testing procedure We are interested in testing our null hypothesis of no qualitative interaction based on the ingredients (test statistic and estimated functions) outlined in Section 2. To develop a formal test we combine those ingredients with permutations to evaluate the null distribution of our statistic. The exact algorithm is the following: 1. Calculate $$\hat{\theta}^{{\rm{alt}}}_T$$, for $$T=0,1$$ by solving (2.3) across sequences of $$\lambda_0$$ and $$\lambda_1$$ values. Using $$10$$-fold cross-validation, choose the tuning-parameter values $$\lambda_0^*$$ and $$\lambda_1^*$$ that maximize the cross-validated log-likelihood. 2. Calculate $$\hat{\theta}^{{\rm{null}}}_T$$ for $$T=0,1$$ by solving (2.4) with $$\lambda_0 = \lambda_0^*, \lambda_1 = \lambda_1^*$$. 3. Calculate the test-statistic, $$Z$$, by plugging in $$\hat{\theta}^{{\rm{alt}}}_T$$ and $$\hat{\theta}^{{\rm{null}}}_T$$ for $$T=0,1$$ to (2.1): \begin{equation*} Z = \frac{\sum_{i = 1}^n\mathcal{D}\left(y_i,\hat{\theta}^{{\rm{null}}}_{T_i}({\bf x}_i)\right) - \sum_{i=1}^n\mathcal{D}\left(y_i,\hat{\theta}^{{\rm{alt}}}_{T_i}({\bf x}_{i, \cdot})\right)}{\sum_{i=1}^n\mathcal{D}\left(y_i,\hat{\theta}^{{\rm{alt}}}_{T_i}({\bf x}_{i, \cdot})\right)}. \end{equation*} 4. Calculate the permutation null of $$Z$$ by running the following for $$b=1,\ldots, B$$ (a) Permute the treatment labels $$T_i$$ for $$i=1,\ldots,n$$ (b) With the new treatment labels rerun steps $$1-3$$. This will produce a permuted statistic $$Z^{*b}$$. 5. Compare $$Z$$ to $$\left\{Z^{*b}\right\}_{b=1}^B$$ using either a P-value $p^* = \frac{\#\left\{Z^{*b} \geq Z\right\}}{B}$ or an estimate of the false discovery rate (Benjamini and Hochberg, 1995; Tusher and others, 2001), which is further discussed in Appendix A.2 of supplementary material available at Biostatistics online. We use the optimal tuning parameters under the alternative hypothesis to solve under the null hypothesis. This decision was primarily made for computational reasons. The extra constraints (C1) and (C2) required under the null make the problem more computationally costly to solve. In particular, the specialized algorithm given in Section 4 for solving under the alternative no longer works under the null, and a general-purpose convex solver is used instead. In our experience, penalty parameters selected via CV under the alternative perform well under the null in practice. Nevertheless, there may be problems where this is not appropriate, and performing CV on the null is required for good performance. 3.1. Conservativeness of permutation tests In the particular case that $$\theta_0(x) = \theta_1(x)$$ for all $$x$$ under the null, our permutation test evaluates an exact$$p$$-value based on the conditional distribution of our statistic. However, our (one sided) null only requires that $$\theta_0(x) \leq \theta_1(x)$$ for all $$x$$. In this case, permutations appear to result in a conservative $$p$$-value. To see this, consider the situation wherein the null is true, and the entire $$\theta_1$$ curve falls well above (or well below) the $$\theta_0$$ curve. In this case the true null distribution of $$Z$$ will have significant mass at exactly $$0$$. However, by permuting we treat both $$\theta_0$$ and $$\theta_1$$ as the average of the two curves, making it much more likely for our empirical estimates to cross, and for our permuted statistic to be positive (i.e. for the fit under the alternative to be a superior fit to the data). We provide some theoretical evidence of conservativeness in Appendix A.3 of supplementary material available at Biostatistics online for the simplified setting wherein we estimate our conditional means using piecewise constant functions with prespecified knots and assume our outcomes are Gaussian distributed with equal variance in each treatment arm. 4. Computing We implement our method in Python using a 1D fused lasso penalty. The GitHub repository https://github.com/jhroth/data-example contains a working example of applying our method to data from a randomized controlled trial investigating a breast cancer treatment (Prat and others, 2014). The data are publicly available on the Gene Expression Omnibus with accession number GSE50948 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE50948). We describe GSE50948 dataset in more detail in Section 5. 4.1. Solving under $$H_0$$ We solve the optimization problem under $$H_0$$ in (2.4) for either a binary response (using a logistic loss function) or continuous response (using a squared-error loss function) with CVXPY, a general solver for convex optimization problems implemented in Python (Diamond and others, 2014). Our method calls the splitting conic solver (SCS) (O’Donoghue and others, 2016a,b) when using the 1D fused lasso penalty. 4.2. Solving under $$H_A$$ If the response variable is continuous, we solve under $$H_A$$ using a C implementation of a very efficient fused lasso solver that uses a dynamic programming approach Johnson, 2013 (when we have one feature) or using CVXPY with the SCS solver (in an additive model when we have more than one feature). For a binary response variable, we use logistic loss and apply a generalized gradient descent algorithm (Parikh and others, 2014) that leans on the solver from Johnson (2013). We give detail on the derivation of this algorithm in Appendix A.1 of supplementary material available at Biostatistics online. For a binary response variable and more than one feature, we again fit an additive model and call the SCS (O’Donoghue and others, 2016a,b). For multivariate problems, our use of a general purpose solver can result in long runtimes. In the worst case, for the examples in this manuscript, we found that a single model fit can take several minutes. This is not prohibitive, as fits for various features/permutations can be run in parallel. In particular, simulation and data examples in this manuscript were run on Amazon Web Services (aws.amazon.com). In addition, we plan to develop a custom first solver (leveraging sparsity) to further alleviate this issue. 5. Data example: GSE50948 The code used to produce and plot the results for the data example is available in the GitHub repository https://github.com/jhroth/data-example. We use our methodology to evaluate expression markers from the Prediction Analysis of Microarray 50 (PAM50) diagnostic test (Parker and others, 2009) for use in characterizing those clinically-defined HER2+ breast cancer patients (defined by pathology/copy number) who are likely to benefit from trastuzumab. Between 20 and 30% of breast cancer cases are characterized by over-expression of human epidermal growth factor receptor type 2 (HER2); this is known as HER2+ breast cancer. Trastuzumab, an intravenous antibody therapy that inhibits production of HER2, has been shown to lengthen disease-free survival for women with HER2+ breast cancer (Joensuu and others, 2006) and is now part of the standard-of-care chemotherapy regimen for HER2+ breast cancer patients (Hudis, 2007). PAM50 is a diagnostic test that uses the expression levels of 50 genes to characterize cancer in 5 molecular subtypes. One of these subtypes is “HER2-enriched” (HER2-E). While clinical and PAM50-based molecular classification of HER2+/$$-$$ often agree, they do not always. Prat and others (2014) find evidence that clinically HER2+ breast cancer patients classified to the HER2-E subtype by PAM50 have a higher rate of 3-year pathologic complete response (pCR) after receiving trastuzumab and chemotherapy, compared to chemotherapy without trastuzumab. No statistically significant differences in pCR between treatment arms were found for those clinically HER2+ patients classified (in contradiction) by PAM50 as having any of the other four cancer subtypes. Here, pCR is defined as the absence of residual invasive breast cancer at the end of the 3-year period. We reanalyze the publicly available subset of data from the NeoAdjuvant Herceptin (NOAH) randomized clinical trial (Gianni and others, 2010), which can be accessed in the Gene Expression Omnibus under the accession number GSE50948 (Edgar and others, 2002). We explore, which, if any, of the PAM50 genes can be used to characterize benefit from trastuzumab in clinically HER2 patients. The NOAH study population consisted of 334 women, 235 of whom had clinically HER2+ breast cancer. Clinically HER2+ patients in the NOAH trial were randomized to either receive treatment with trastuzumab in addition to neoadjuvant chemotherapy ($$T=1$$) or to receive neoadjuvant chemotherapy alone ($$T=0$$). Gene expression levels that passed quality control are available for the baseline tumor biopsies of 111 HER2+ patients. Our response variable of interest is a binary indicator of 3-year pCR. In the $$T=0$$ group, 25.5% of participants experienced pCR after 3 years; in the treatment group, 46.7% experienced pCR after 3 years. This is consistent with the finding in Gianni and others (2010) that trastuzumab was beneficial overall for women with clinically HER2+ breast cancer. There is no mention of censoring in Prat and others (2014) and we use logistic loss for this analysis. The supplemental materials in Prat and others (2014) list 9 unique probes used to represent the PAM50 genes classifying patients to the HER2+ subtype; we use combinations of these 9 expression levels as the candidate biomarkers for our data example. For one example, we use our method to fit a univariate model with each of the 9 probes as a single feature. We also fit our method with 36 additive models, as described in Section 2.6, where each possible combination of two probes is used as the set of features. We compute one observed and 100 permuted test statistics as described in Section 3 for each of 9 univariate models and 36 2D additive models. We estimate a false discovery rate (FDR) as described in Appendix A.2 of supplementary material available at Biostatistics online. Figure 3 shows the fitted values for the single probe (ACTR3B) in the univariate model with the lowest FDR estimate (0.30). Figure 3 suggests that trastuzumab only improves pCR for HER2+ patients with low levels of ACTR3B, but the associated FDR estimate (0.30) is relatively high. For comparison, Figure 4 shows the predicted values for both treatment groups from a logistic regression where the response variable is pCR and the features are ACTR3B, $$T$$, and $$(\text{ACTR3B} \times T)$$. The estimated response-curves from linear-logistic regression do not cross each other for any value of ACTR3B in the sample range. Fig. 3. View largeDownload slide Predicted values from our method with ACTR3B as the feature (estimated FDR$$\,=\,$$0.30). The flatter line corresponds to the treatment group. Fig. 3. View largeDownload slide Predicted values from our method with ACTR3B as the feature (estimated FDR$$\,=\,$$0.30). The flatter line corresponds to the treatment group. Fig. 4. View largeDownload slide Predicted values from logistic regression with ACTR3B, T, and $$(\text{ACTR3B} \times {\rm T})$$ as the features. The curve for the treatment group lies above the curve for the control group. Fig. 4. View largeDownload slide Predicted values from logistic regression with ACTR3B, T, and $$(\text{ACTR3B} \times {\rm T})$$ as the features. The curve for the treatment group lies above the curve for the control group. Figure 5 shows a surface plot for the 2-probe combination from the additive framework model (ACTR3B and BAG1), which had an estimated FDR of 0.15. Figure 5 suggests that HER2+ breast cancer patients with relatively low expression levels of ACTR3B and low levels of BAG1 may not benefit from trastuzumab, while all other patients do benefit. If this finding is validated in independent datasets, an implication is that clinically HER2+ patients who have low levels of ACTR3B and low levels of BAG1 can avoid treatment with trastuzumab. Fig. 5. View largeDownload slide Fitted surfaces for the additive model with ACTR3B and BAG1 as features (estimated FDR$$\,=\,$$0.15). The flatter surface corresponds to the treatment group. Fig. 5. View largeDownload slide Fitted surfaces for the additive model with ACTR3B and BAG1 as features (estimated FDR$$\,=\,$$0.15). The flatter surface corresponds to the treatment group. 6. Simulations 6.1. Simulation scenarios We simulate data from 12 different scenarios across a range of signal-to-noise ratios (SNRs), using $$n=100$$ observations and $$100$$ permutations for testing. Figure 6 shows the simulated response variable as a function of the candidate biomarker for an SNR of 1.5. Fig. 6. View largeDownload slide The 12 simulation scenarios with $${\text{SNR}} = 1.5$$. Fig. 6. View largeDownload slide The 12 simulation scenarios with $${\text{SNR}} = 1.5$$. In every scenario we simulate candidate biomarker $$x_i \sim \text{Uniform}(0, 1)$$ and treatment indicator $$t_i \sim \text{Bernoulli}(0.5)$$ for $$i=1, \ldots, n$$. In the four mean-shift simulation scenarios (first row of Figure 6), the response-curves for the treatment groups are mean shifts of each other and we simulate the response variable as \begin{equation*} y_i \mid t_i = f(x_i) - \delta t_i + \epsilon_i, \end{equation*} for $$i=1, \cdots, n$$, where we and we chose $$\delta=0.6$$ and used noise parameter $$\nu^2 = \frac{\widehat{{\text{Var}}}(f(X) - \delta T)}{{\text{SNR}}}$$ to generate $$\epsilon_i \sim N(0, \nu^2)$$. The function $$f$$ is either linear, piecewise linear, piecewise constant, or sinusoidal. In the second row of Figure 6 (non-qualitative interaction), we simulate the response variable with \begin{equation*} y_i \mid t_i = f_0(x_i) t_i + f_1(x_i) (1 - t_i) + \epsilon_i t_i + \gamma_i (1-t_i) \end{equation*} for $$i=1, \cdots, n$$, where we specify noise parameters $$\nu_0^2 = \frac{\widehat{{\text{Var}}}(f_0(x))} {{\text{SNR}}}$$ and $$\nu_1^2 = \frac{\widehat{{\text{Var}}}(f_0(x))} {{\text{SNR}}}$$ to generate $$\gamma_i \sim N(0, \nu_0^2)$$ and $$\epsilon_i \sim N(0, \nu_1^2)$$. The functions $$f_0$$ and $$f_1$$ yield linear, piecewise linear, piecewise constant, or sinusoidal response-curves for the treatment groups that do not cross each other. In the last row of Figure 6 (qualitative interaction), the response-curves population curves for the treatment groups cross each other and we generate \begin{equation*} y_i \mid t_i = f_1(x_i) t_i + f_2(x_i) (1 - t_i) + \epsilon_i \end{equation*} for $$i=1, \cdots, n$$, where we specify noise parameter $$\nu^2 = \frac{\sum_{i=1}^n \left(\left(f_0(x_i) - f_1(x_i)\right)^+\right)^2} {{\text{SNR}}}$$ to generate $$\epsilon_i \sim N(0, \nu^2)$$. The functions $$f_0$$ and $$f_1$$ yield linear, piecewise linear, piecewise constant, or sinusoidal response-curves for the treatment groups that cross each other. 6.2. Simulation results 6.2.1. Comparison to Linear Regression Let $$\mathcal{X} \subset {\mathbb{R}}$$ be values of candidate biomarker $$X$$ that have clinical relevance. Then a simple alternative to our testing procedure is fitting the linear regression $$E(Y_i \mid x_i) = \beta_0 + \beta_1 x_i + \beta_2 t_i + \beta_3 (x_i \times t_i)$$ Under this model, qualitative interaction occurs if there exist $$x, x', x'' \in \mathcal{X}$$ such that \begin{equation*} \beta_2 + \beta_3 x = 0 , \hspace{2ex} \beta_2 + \beta_3 x' < 0, \hspace{2ex} \beta_2 + \beta_3 x'' > 0 \end{equation*} so we are testing \begin{equation*} H_0: \text{there do} \it{not exist } x, x', x'' \in \mathcal{X} \hspace{1ex} \text{such that } x = -\frac{\beta_2}{\beta_3}, x' < -\frac{\beta_2}{\beta_3}, x'' > -\frac{\beta_2}{\beta_3} \end{equation*} We derive our procedure for testing this hypothesis in Appendix A.4 of supplementary material available at Biostatistics online. Figure 7 plots the the share of 1000 replications where the null hypothesis was rejected, as a function of SNR. Our method is shown in squares and the procedure based on linear regression is shown in crosses. Rejecting $$H_0$$ in any of the first two rows is a Type I error, while it is the correct decision in the third row. For these simulation scenarios, Figure 7 suggests that both methods have Type I error rates below the p-value threshold of 0.05. The procedure based on linear regression only has higher power than our method when the truth is linear; for the other three truth types, our method offers a substantial improvement in power. Fig. 7. View largeDownload slide Share of 1000 simulations with a p-value below 0.05, as a function of 30 SNR values across each of the 12 scenarios (permutation-based p-value p* from our method shown in squares; p-values from procedure based on linear regression shown in crosses). Fig. 7. View largeDownload slide Share of 1000 simulations with a p-value below 0.05, as a function of 30 SNR values across each of the 12 scenarios (permutation-based p-value p* from our method shown in squares; p-values from procedure based on linear regression shown in crosses). The observed conservativeness of our test under the null is in line with the discussion in Section 3.1. When the null is true and the response-curves between the treatment groups are well-separated, the observed test statistic has a large point mass at $$0$$; a permuted test statistic, on the other hand, is much more likely to be positive in this setting because it is based on response-curves that are averaged across the two treatment groups and thus are more likely to cross. As the signal-to-noise ratio increases, we would expect the observed test statistic to have more mass at $$0$$ (while the permuted test statistic is still likely to be positive) and thus for conservativeness to increase. One potential saving grace is that at the boundary of our null hypothesis space, we have $$\theta_0(x) = \theta_1(x)$$ for all $$x$$—in which case our test is no longer conservative. We also attempted to fix this conservatism: We considered generating null samples using a parametric bootstrap from our null-constrained estimates as an alternative to permutations. However, we found that this approach was anti-conservative. Since we used fused lasso as the basis for our estimation, the fitted values for each treatment group are piecewise constant across the range of the feature. The number and location of the knots are chosen data-adaptively instead of being specified beforehand. A simple alternative to this approach would be to pre-specify a fixed number of evenly spaced knots and use the group means in each region as the fitted values. Appendix A.5 of supplementary material available at Biostatistics online compares the power and Type I error of our method to this alternative. 7. Conclusion To test whether treatment is either uniformly superior or uniformly inferior for all patients, we propose a convex, regression-based test for qualitative interactions. We implement our method in Python and share an example of using it to search for a subset of HER2+ breast cancer patients who benefit from adjuvant chemotherapy. Our work has several notable limitations. One is the long runtime of our Python implementation due to its reliance on a general convex solver. In future work that focuses on computation, we would like to devise a custom solver to work more efficiently with additive models. In addition, our current work only tests a strong null hypothesis. The estimation stage of our framework suggests a subset of patients who benefit from treatment (i.e. the values of the candidate biomarker(s) for which the estimated response in the treatment group is superior to the estimated response in the control group). However, no formal significance test of average treatment effect is run in that subgroup. In addition, the testing stage of our framework is also stated for pre-specified biomarker candidates and does not perform variable selection among a set of potential candidates. Although our current implementation estimates the response variable as a piecewise-constant function of each continuous feature (either alone in a univariate model or as part of an additive model) and tests for qualitative interaction, we presented a general underlying framework that accommodates alternative penalty functions that induce different structures on the fitted values. The constraints in the optimization framework can also be modified to test whether the population average benefit from treatment is at least a pre-specified amount. Supplementary material Supplementary material is available at http://biostatistics.oxfordjournals.org. Acknowledgments Conflict of Interest: None declared. Funding Noah Simon was supported by NIH Grant DP5OD019820. Both authors were supported by an AWS in Education Research Grant award. References Benjamini Y. and Hochberg Y. ( 1995 ). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological) 57 , 289 – 300 . Boyd S. and Vandenberghe L. ( 2004 ). Convex Optimization . Cambridge University Press. Cambridge, United Kingdom. Google Scholar CrossRef Search ADS Cai T. , Tian L. , Wong P. H. and Wei L. J. ( 2011 ). Analysis of randomized comparative clinical trial data for personalized treatment selections. Biostatistics 12 , 270 – 282 . Google Scholar CrossRef Search ADS PubMed Chang M. , Lee S. and Whang Y.-J. ( 2015 ). Nonparametric tests of conditional treatment effects with an application to single-sex schooling on academic achievements. The Econometrics Journal 18 , 307 – 346 . Google Scholar CrossRef Search ADS Crump R. K. , Hotz V. J. , Imbens G. W. and Mitnik O. A. ( 2008 ). Nonparametric tests for treatment effect heterogeneity. The Review of Economics and Statistics 90 , 389 – 405 . Google Scholar CrossRef Search ADS Diamond S. , Chu E. and Boyd S. ( 2014 ). CVXPY: A Python-Embedded Modeling Language for Convex Optimization, Version 0.2 . http://cvxpy.org/. Edgar R. , Domrachev M. and Lash A. E. ( 2002 ). Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research 30 , 207 – 210 . Google Scholar CrossRef Search ADS PubMed Gail M. and Simon R. ( 1985 ). Testing for qualitative interactions between treatment effects and patient subsets. Biometrics 41 , 361 – 372 . Google Scholar CrossRef Search ADS PubMed Gianni L. , Eiermann W. , Semiglazov V. , Manikhas A. , Lluch A. , Tjulandin S. , Zambetti M. , Vazquez F. , Byakhow M. , Lichinitser M. and others ( 2010 ). Neoadjuvant chemotherapy with trastuzumab followed by adjuvant trastuzumab versus neoadjuvant chemotherapy alone, in patients with her2-positive locally advanced breast cancer (the noah trial): a randomised controlled superiority trial with a parallel her2-negative cohort. The Lancet 375 , 377 – 384 . Google Scholar CrossRef Search ADS Hastie T. and Tibshirani R. ( 1990 ). Generalized Additive Models , Volume 43. CRC Press . Hastie T. , Tibshirani R. and Friedman J. ( 2008 ). The Elements of Statistical Learning , 2nd edition. New York : Springer, U.S.A. Hudis C. A. ( 2007 ). Trastuzumab mechanism of action and use in clinical practice. New England Journal of Medicine 357 , 39 – 51 . Google Scholar CrossRef Search ADS PubMed Joensuu H. , Kellokumpu-Lehtinen P.-L. , Bono P. , Alanko T. , Kataja V. , Asola R. , Utriainen T. , Kokko R. , Hemminki A. , Tarkkanen M. and others ( 2006 ). Adjuvant docetaxel or vinorelbine with or without trastuzumab for breast cancer. New England Journal of Medicine 354 , 809 – 820 . Google Scholar CrossRef Search ADS PubMed Johnson N. ( 2013 ). A dynamic programming algorithm for the fused lasso and $$l_0$$-segmentation. Journal of Computational and Graphical Statistics 22 , 246 – 260 . Google Scholar CrossRef Search ADS Kang C. , Janes H. and Huang Y. ( 2014 ). Combining biomarkers to optimize patient treatment recommendations. Biometrics 70 , 695 – 707 . Google Scholar CrossRef Search ADS PubMed Moodie E. E. M. , Dean N. and Sun Yue R. ( 2014 ). Q-learning: flexible learning about useful utilities. Statistics in Biosciences 6 , 223 – 243 . Google Scholar CrossRef Search ADS O’Donoghue B. , Chu E. , Parikh N. and Boyd S. ( 2016a ). Conic optimization via operator splitting and homogeneous self-dual embedding. Journal of Optimization Theory and Applications 169 , 1042 – 1068 . Google Scholar CrossRef Search ADS O’Donoghue B. , Chu E. , Parikh N. and Boyd S. ( 2016b ). SCS: Splitting Conic Solver, Version 1.2.6 . https://github.com/cvxgrp/scs. Orellana L. , Rotnitzky A. and Robins J. M. ( 2010 ). Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part i: main content. The International Journal of Biostatistics 6 ( 2 ), 1 – 47 . Pan G. and Wolfe D. A. ( 1997 ). Test for qualitative interaction of clinical significance. Statistics in Medicine 16 , 1645 – 1652 . Google Scholar CrossRef Search ADS PubMed Parikh N. , Boyd S. ( 2014 ). Proximal algorithms. Foundations and Trends in Optimization 1 , 127 – 239 . Google Scholar CrossRef Search ADS Parker J. S. , Mullins M. , Cheang M. , Leung S. , Voduc D. , Vickery T. , Davies S. , Fauron C. , He X. , Hu Z. and others ( 2009 ). Supervised risk predictor of breast cancer based on intrinsic subtypes. Journal of Clinical Oncology 27 , 1160 – 1167 . Google Scholar CrossRef Search ADS PubMed Peto R. ( 1982 ). Statistical aspects of cancer trials. Treatment of Cancer. Chapman and Hall. London, United Kingdom . Prat A. , Bianchini G. , Thomas M. , Belousov A. , Cheang M. C. , Koehler A. , Gómez P. , Semiglazov V. , Eiermann W. , Tjulandin S. and others ( 2014 ). Research-based pam50 subtype predictor identifies higher responses and improved survival outcomes in her2-positive breast cancer in the noah study. Clinical Cancer Research 20 , 511 – 521 . Google Scholar CrossRef Search ADS PubMed Robins J. , Orellana L. and Rotnitzky A. ( 2008 ). Estimation and extrapolation of optimal treatment and testing strategies. Statistics in Medicine 27 , 4678 – 4721 . Google Scholar CrossRef Search ADS PubMed Tibshirani R. , Saunders M. , Rosset S. , Zhu J. and Knight K. ( 2005 ). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B 67 , 91 – 108 . Google Scholar CrossRef Search ADS Tibshirani R. J. ( 2013 ). Adaptive piecewise polynomial estimation via trend filtering. The Annals of Statistics 42 , 285 – 323 . Google Scholar CrossRef Search ADS Tusher V. G. , Tibshirani R. and Chu G. ( 2001 ). Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences 98 , 5116 – 5121 . Google Scholar CrossRef Search ADS Zhang Y. , Laber E. B. , Tsiatis A. and Davidian M. ( 2015 ). Using decision lists to construct interpretable and parsimonious treatment regimes. Biometrics 71 , 895 – 904 . Google Scholar CrossRef Search ADS PubMed Zhao Y. , Kosorok M. R. and Zeng D. ( 2009 ). Reinforcement learning design for cancer clinical trials. Statistics in Medicine 28 , 3294 – 3315 . Google Scholar CrossRef Search ADS PubMed Zhao Y. , Zeng D. , Rush A. J. and Kosorok M. R. ( 2012 ). Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association 107 , 1106 – 1118 . Google Scholar CrossRef Search ADS PubMed Zhao Y.-Q. , Zeng D. , Laber E. B. , Song R. , Yuan M. and Kosorok M. R. ( 2015 ). Doubly robust learning for estimating individualized treatment with censored data. Biometrika 102 , 151 – 168 . Google Scholar CrossRef Search ADS PubMed Zhao Y. , Zeng D. , Socinski M. A. and Kosorok M. R. ( 2011 ). Reinforcement learning strategies for clinical trials in nonsmall cell lung cancer. Biometrics 67 , 1422 – 1433 . Google Scholar CrossRef Search ADS PubMed © The Author 2017. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Biostatistics Oxford University Press

# A framework for estimating and testing qualitative interactions with applications to predictive biomarkers

Biostatistics, Volume Advance Article (3) – Aug 24, 2017
18 pages

/lp/ou_press/a-framework-for-estimating-and-testing-qualitative-interactions-with-95ZmW0CbEQ
Publisher
Oxford University Press
ISSN
1465-4644
eISSN
1468-4357
D.O.I.
10.1093/biostatistics/kxx038
Publisher site
See Article on Publisher Site

### Abstract

SUMMARY An effective treatment may only benefit a subset of patients enrolled in a clinical trial. We translate the search for patient characteristics that predict treatment benefit to a search for qualitative interactions, which occur when the estimated response-curve under treatment crosses the estimated response-curve under control. We propose a regression-based framework that tests for qualitative interactions without assuming linearity or requiring pre-specified risk strata; this flexibility is useful in settings where there is limited a priori scientific knowledge about the relationship between features and the response. Simulations suggest that our method controls Type I error while offering an improvement in power over a procedure based on linear regression or a procedure that pre-specifies evenly spaced risk strata. We apply our method to a publicly available dataset to search for a subset of HER2+ breast cancer patients who benefit from adjuvant chemotherapy. We implement our method in Python and share the code/data used to produce our results on GitHub (https://github.com/jhroth/data-example). 1. Introduction If the results of a clinical trial indicate that a proposed treatment is not beneficial overall, the treatment may still benefit a smaller subset of patients. Existing methods to predict which patients benefit from treatment generally use one of two general strategies. The first approach, sometimes called classification-based (in Zhang and others, 2015, for example), assumes the structure of the treatment rule (a function that maps a patient’s characteristics to a treatment option) and then looks for the optimal rule over that class of possible treatment rules. For example, Zhang and others (2015) assume the optimal treatment rule is a sequence of “if-then” statements with less than three variables involved in each condition, then searches for the rule of this form that maximizes an estimate of treatment rule value. Outcome-weighted learning (Zhao and others, 2012, 2015) and marginal structural mean models (Robins and others, 2008; Orellana and others, 2010) are other classification-based approaches to predicting treatment benefit. The second approach, sometimes called regression-based, usually assumes a conditional mean model of the response variable given a treatment assignment and patient characteristics. For an observation with a particular set of characteristics, these methods generally predict the conditional mean response under different treatment assignments to infer the best treatment option. Examples include methods leveraging the Q-learning framework (Zhao and others, 2009, 2011; Moodie and others, 2014), other statistical learning tools (e.g. boosting in Kang and others (2014)), or a combination of parametric, semiparametric, and nonparametric estimators (such as the two-stage approach of Cai and others, 2011). In the econometric literature, Chang and others (2015) and Crump and others (2008) develop nonparametric theory to test the null hypothesis that the direction of the treatment effect is the same across all values of pre-specified covariates. This null hypothesis is sometimes called testing for treatment effect heterogeneity. In the regression setting, the search for predictive biomarkers can be thought of as a test for a clinically meaningful interaction between treatment and patient characteristics. A qualitative interaction occurs when treatment is neither superior nor inferior over all subsets of patients (Gail and Simon, 1985; Pan and Wolfe, 1997; Peto, 1982). Qualitative interactions stand in contrast to non-qualitative interactions, where the magnitude of treatment effect varies across patient subsets but the direction of the treatment effect is the same in every subset (i.e. either all patients benefit or all patients do not benefit). Qualitative interactions are clinically meaningful because they identify a treatment strategy with the ability to improve patient outcomes. Figure 1 shows a hypothetical example of a qualitative interaction that can inform a treatment decision (left panel) and a non-qualitative interaction that cannot inform a treatment decision (right panel). If the response variable $$Y$$ is desirable (e.g. survival time), the left panel suggests that treatment is only beneficial for patients with values of candidate biomarker $$X$$ above about 1.7. Fig. 1. View largeDownload slide Mean of response variable Y, conditional on candidate biomarker X in the treatment group (solid line) and control group (dashed line). Fig. 1. View largeDownload slide Mean of response variable Y, conditional on candidate biomarker X in the treatment group (solid line) and control group (dashed line). We propose a regression-based framework that tests for qualitative interaction without assuming a linear mean model or requiring a candidate biomarker to be divided into risk strata beforehand. To do this, we translate the general settings of qualitative and non-qualitative interaction into convex optimization problems that data-adaptively estimate the control and treatment groups’ underlying trends in treatment response over the range of a candidate biomarker. We form a simple likelihood-based test statistic and evaluate its significance using a permutation test. We emphasize that our framework does not directly identify a subpopulation of patients who should receive treatment; rather, our method provides a global test of the null hypothesis that treatment is either uniformly beneficial or uniformly not beneficial across the ordered values of a candidate biomarker. In the convex problem that allows for qualitative interaction, the values of the candidate biomarker at which the estimated response-curves for treatment groups cross each other suggest a subset of patients who benefit from treatment. However, there is no formal test run evaluating average treatment effect in that subset. Section 2 outlines our general framework for estimation and testing. Section 3 describes our testing procedure in more detail. Section 4 briefly discusses the Python implementation of our method. We present results from a data example in Section 5 and simulation results in Section 6. Section 7 gives a brief conclusion. 2. Estimation and testing framework We begin with the formal definition of a qualitative interaction in the multivariate setting. Suppose we observe patients randomized to either new treatment ($$T=1$$) or standard of care ($$T=0$$). On each patient we observe $$p$$ covariate values $${\bf x} \equiv \left(x_1,\ldots,x_p\right)$$ and a numeric response $$y$$. Suppose further that $$\left(y |x_1,\ldots,x_p, T\right) \sim L\left(\cdot, \theta\left({\bf x}, T\right)\right)$$ where $$L$$ is a known distribution parametrized by a single parameter, $$\theta\left({\bf x}, T\right)$$, that is an unknown function of the features and treatment assignment. Further suppose that, for any $$y$$, $$L(y,\theta)$$ is log-concave in $$\theta$$. In what follows we will rewrite $$\theta\left({\bf x}, T\right)$$ as $$\theta_T\left({\bf x}\right)$$. In addition let $$G$$ denote the joint distribution of $${\bf x}$$ and let $$\operatorname{supp}(G) = \left\{ {\bf x} | G({\bf x}) \neq 0\right\}$$ denote the support of $$G$$. We are interested in testing the null hypothesis: \begin{align*} H_0 & \text{(no qualitative interaction)}:\\ &\text{ Either } \forall {\bf x}\in\operatorname{supp}(G),\, \theta_1\left({\bf x}\right) \leq \theta_0\left({\bf x}\right) \text{ OR } \forall {\bf x}\in\operatorname{supp}(G),\, \theta_0\left({\bf x}\right) \leq \theta_1\left({\bf x}\right)\\ \hspace{2ex} & \qquad \text{versus}\\ \hspace{2ex} H_A &\text{(qualitative interaction)}:\\ &\exists {\bf x}_1, {\bf x}_0\in\operatorname{supp}(G) \text{ with } \theta_0\left({\bf x}_0\right) > \theta_1\left({\bf x}_0\right)\, \text{and } \theta_1\left({\bf x}_1\right) > \theta_0\left({\bf x}_1\right). \end{align*} Often rather than a bi-directional null hypothesis, we will instead be interested in testing whether new treatment is more effective than control for any subset of patients. If higher values of $$y$$ are more beneficial, then in this case $$H_0$$ is reduced to: $$\forall {\bf x}\,\, \theta_1\left({\bf x}\right) \leq \theta_0\left({\bf x}\right)$$; and the alternative hypothesis is correspondingly changed. We note that this tests a global null hypothesis (that one response curve lies everywhere above the other). Rejecting this null does not indicate where the curves cross. The testing procedure we will propose does give an estimate of the subset of patients for whom treatment is more effective than control (and vice versa). However, we do not run a formal statistical test of average treatment effect in those subsets. For testing this hypothesis suppose we have data on $$n_0 + n_1 = n$$ independently drawn individuals, where $$n_0$$ individuals are randomized to the control group and $$n_1$$ to the treatment group. Let $${\mathcal{X}} \in {\mathbb{R}}^{n \times p}$$ be a matrix of $$p$$ features observed for the $$n$$ individuals and $$Y \in {\mathbb{R}}^n$$ be our vector of responses. We will let $${\bf x}_{i, \cdot}$$ denote the $$i$$-th row of $${\mathcal{X}}$$, $${\bf x}_{\cdot, j}$$ denote the $$j$$-th column of $${\mathcal{X}}$$, and $$x_{ij}$$ the $$(i,j)$$ entry of $${\mathcal{X}}$$. In what follows we will give a proposal for a general likelihood ratio based test statistic—the null distribution of this statistic will be calculated via permutation. 2.1. Forming the test statistic To form the standardized likelihood ratio statistic we must estimate $$\theta_T\left({\bf x}\right)$$ for $$T=0,1$$ under both the null and alternative hypotheses. Let $$\mathcal{L}\left(\cdot,\theta\right)$$ denote the negative log-likelihood of $$L(\cdot,\theta)$$ and recall the definition of deviance is \begin{equation*} \mathcal{D}\left(y,\hat{\theta}\right) = 2\left[\mathcal{L}\left(y,\hat{\theta} \right) - \operatorname{min}_{\theta}\mathcal{L}\left(y,\theta\right)\right], \end{equation*} This is just a standardized version of the negative-log-likelihood (forced to be positive). Our test statistic is $$\label{eq:LRstat} Z = \frac{\sum_{i = 1}^n\mathcal{D}\left(y_i,\hat{\theta}^{\rm{null}}_{T_i}({\bf x}_i)\right) - \sum_{i=1}^n\mathcal{D}\left(y_i,\hat{\theta}^{{\rm{alt}}}_{T_i}({\bf x}_{i, \cdot})\right)}{\sum_{i=1}^n\mathcal{D}\left(y_i,\hat{\theta}^{{\rm{alt}}}_{T_i}({\bf x}_{i, \cdot})\right)},$$ (2.1) where for $$T=0,1$$, $$\hat{\theta}^{{\rm{null}}}_T$$ and $$\hat{\theta}^{{\rm{alt}}}_T$$ are estimates under the null and alternative hypotheses respectively. In the case of Gaussian data with fixed variance, we have $$Z = \frac{\sum_i \left(y_i - \hat{\theta}^{{\rm{null}}}_{T_i}({\bf x}_i)\right)^2 - \sum_i \left(y_i - \hat{\theta}^{{\rm{alt}}}_{T_i}({\bf x}_i)\right)^2}{\sum_i \left(y_i - \hat{\theta}^{{\rm{alt}}}_{T_i}({\bf x}_i)\right)^2}$$. This is just a scaled $$F$$-statistic. For logistic loss, $$\operatorname{min}_{\theta}\mathcal{L}\left(y,\theta\right) = 0$$, so we have $$Z= \frac{\sum_{i = 1}^n\mathcal{L}\left(y_i,\hat{\theta}^{{\rm{null}}}_{T_i}({\bf x}_i)\right) - \sum_{i=1}^n\mathcal{L}\left(y_i,\hat{\theta}^{{\rm{alt}}}_{T_i}({\bf x}_{i, \cdot})\right)}{\sum_{i=1}^n\mathcal{L}\left(y_i,\hat{\theta}^{{\rm{alt}}}_{T_i}({\bf x}_{i, \cdot})\right)}$$. In order to control Type I error in our permutation-based testing procedure, it is important for the denominator of $$Z$$ to remain positive; the deviance based test statistic in (2.1) achieves this. In estimating $$\hat{\theta}^{{\rm{null}}}_T$$ and $$\hat{\theta}^{{\rm{alt}}}_T$$, we often do not want to make parametric assumptions. In what follows we will discuss non-parametric estimation of $$\hat{\theta}^{{\rm{null}}}_T$$ and $$\hat{\theta}^{{\rm{alt}}}_T$$ under general shape constraints. 2.2. A convex formulation for estimating $$\hat{\theta}^{{\rm{null}}}$$ and $$\hat{\theta}^{{\rm{alt}}}$$ Our approach leverages penalized likelihood methods (Hastie and others, 2008) to estimate our parameters. In the more traditional, single class setting where $$y_i \sim L\left(\cdot,\theta\left({\bf x}\right)\right)$$, the penalized likelihood approach solves the problem: $$\label{eq:pen} \hat{\theta} \gets \operatorname{argmin}_{\theta} \sum_i \mathcal{L}\left(y_i, \theta\left({\bf x}_{i, \cdot}\right)\right) + \lambda P\left(\theta\right),$$ (2.2) where $$P$$ is a structure-inducing penalty and $$\lambda > 0$$ is a tuning parameter that trades off between enforcing goodness of fit and structure. $$P$$ is chosen based on the type of smoothness or structure one expects: Sobolev-type penalties, $$P(\theta) = \left \|\theta^{(k)}\right\|^2$$, can be used to induce general smoothness; total-variation/trend-filtering penalties, in 1D, can be used to obtain piecewise polynomial fits. Other penalties, based on the convex indicator (defined as $$I(z\in S) \equiv \infty*\delta\left\{z\not\in S\right\}$$ where $$\delta(\cdot)$$ is the usual $$0,1$$ indicator function) can enforce other structure such as monotonicity $$I(\theta \text{ non-decreasing})$$, a parametric form $$I\left(\theta \in \operatorname{span}\left\{\psi_1,\ldots, \psi_K\right\}\right)$$ for prespecified functions $$\psi_1,\ldots,\psi_K$$, or additivity $$I\left(\theta({\bf x}) = \sum_{j=1}^p\phi_j\left(x_j\right) \text{ for some }\phi_1,\ldots \phi_p \right)$$. In the additive case, we can combine this convex indicator with smoothness-type penalties on each individual component. In all of these cases because $$\mathcal{L}$$ is convex, the problem in (2.2) is convex (Boyd and Vandenberghe, 2004). We now extend this to our two class framework. 2.3. Estimation of $$\hat{\theta}^{{\rm{alt}}}$$ For the unrestricted model under $$H_A$$, we can use the penalized likelihood framework to jointly solve for $$\hat{\theta}^{{\rm{alt}}}_0$$ and $$\hat{\theta}^{{\rm{alt}}}_1$$. We optimize $$\label{eq:optimization_alternative} \hat{\theta}^{{\rm{alt}}}_1,\hat{\theta}^{{\rm{alt}}}_0 \gets \underset{\theta_1,\theta_0}{\operatorname{argmin}} \sum_{T_i = 1}\mathcal{L}\left(y_i,\theta_1({\bf x}_{i, \cdot})\right) + \lambda_1 P\left(\theta_1\right) + \sum_{T_i = 0}\mathcal{L}\left(y_i,\theta_0({\bf x}_{i, \cdot})\right) + \lambda_0 P\left(\theta_0\right).$$ (2.3) We note that this can be minimized separately in $$\theta_1$$ and $$\theta_0$$, requiring us merely to solve two problems of the form (2.2). The tuning parameters $$\lambda_0$$ and $$\lambda_1$$ determine the amount of regularization; we use split-sample/cross validation (as described in Section 3) to choose these tuning parameters. 2.4. Estimation of $$\hat{\theta}^{{\rm{null}}}$$ Estimating the null-restricted parameters is more difficult; here we aim to constrain our estimates not to cross on the support of $$G$$. Because $$G$$ is generally unknown, we use instead the empirical distribution of $${\bf x}$$, imposing the constraint that either $$\hat{\theta}_0\left({\bf x}_{i, \cdot}\right) \leq \hat{\theta}_1\left({\bf x}_{i, \cdot}\right)$$ for all $$i=1,\ldots,n$$ or the mirror $$\hat{\theta}_0\left({\bf x}_{i, \cdot}\right) \geq \hat{\theta}_1\left({\bf x}_{i, \cdot}\right)$$ for all $$i=1,\ldots,n$$ is true. Thus we solve the following penalized, constrained, maximum likelihood problem: \begin{align} \hat{\theta}^{{\rm{null}}}_1,\hat{\theta}^{{\rm{null}}}_0 \gets \underset{\theta_1,\theta_0}{\operatorname{argmin}} &\sum_{T_i =1}\mathcal{L}\left(y_i,\theta_1({\bf x}_{i, \cdot})\right) + \lambda_1 P\left(\theta_1\right) + \sum_{T_i = 0}\mathcal{L}\left(y_i,\theta_0({\bf x}_{i, \cdot})\right) + \lambda_0 P\left(\theta_0\right)\\ \end{align} (2.4) \begin{align} s.t. \quad&\,\, {{\bf Either }}\, \theta_0\left({\bf x}_{i, \cdot}\right) \leq \theta_1\left({\bf x}_{i, \cdot}\right) \textrm{ for all } i=1,\ldots,n,\\ \end{align} (C1) \begin{align} &\,\, {{\bf Or }}\,\theta_0\left({\bf x}_{i, \cdot}\right) \geq \theta_1\left({\bf x}_{i, \cdot}\right) \textrm{ for all } i=1,\ldots,n. \qquad\qquad \end{align} (C2) This is not a convex problem because the constraint is not convex. However, (C1) or (C2) alone is a convex constraint. Because this Either/Or condition is actually optimizing over the union of the sets defined by (C1) and (C2) we can still find the global optimum quite simply. We merely solve (2.4) with only constraint (C1) and again with only constraint (C2) and then choose, from those $$2$$ candidate solutions, the solution that has the lower objective value. For each single constraint (2.4) is a convex problem and can be solved efficiently (Boyd and Vandenberghe, 2004). It should be noted that the null hypothesis is only constrained to have no qualitative interaction at the observed data points, rather than to have no qualitative interaction over the entire design space. In cases where minimal smoothness is assumed (particularly with a single feature), fitted values at unobserved data points will be close to a simple average of the fitted values at nearby observed data points. Here, without qualitative interactions at the observed data points, we would not expect qualitative interactions to occur for unobserved points within the design space. When more structure is assumed, such as in the additive model framework discussed in Section 2.6, fits under the null may actually allow qualitative interactions for unobserved combinations of covariates in the design space. That is, the null model permits more flexibility than it should under the assumption of no qualitative interaction anywhere in the design space, and consequently our testing procedure will be conservative. If the covariate values from our sample are representative of the population, then any qualitative interaction permitted under the null (at unobserved covariate combinations) would necessarily apply to a small subpopulation: Otherwise that subpopulation would have been represented in our sample. If observed features are not representative of the larger population, however, then the conservative nature of the method could be impactful—it may be that a covariate combination which is unobserved, and at which we allow a qualitative interaction under the null, is actually a significant segment of the population. Our test would be unable to detect that sub-population. An alternative way to form constraints under the null would be to discretize the entire design space, then apply the (C1) and (C2) constraints to each element of the grid; this is an interesting approach, but one we choose not to pursue primarily for computational reasons. 2.5. Structure induced on fitted values For some choices of penalty, (2.2) does not immediately estimate an entire function $$\theta$$; instead the function is only estimated at a finite list of specified $${\bf x}$$-values (often taken to be the observed values $${\bf x}_1,\ldots {\bf x}_n$$). This is the case for fused lasso (Tibshirani and others, 2005) and higher order $$\ell_1$$-trend filtering (Tibshirani, 2013), among others. There, the penalty generally only involves those specified $${\bf x}$$-values; for example, $$0$$th order trend filtering with a single covariate $$x$$, ordered such that $$x_1 < x_2 < \cdots < x_n$$, yields $$P(\theta) = \sum_{i=2}^n \left|\theta\left(x_i\right) - \theta\left(x_{i-1}\right)\right|$$. However to estimate our fits under the null hypothesis we must solve (2.4)—this problem has constraints that require estimation at all $$x$$-values observed under either treatment for each of $$\theta_1$$ and $$\theta_0$$. In this case we adjust the usual fused lasso estimator to include all $$x$$-values in the penalty. More specifically, if we reorder our data such that $$x_1 < x_2 < \cdots < x_n$$ then our optimization problem is: \begin{align*} \hat{\theta}^{\rm{null}}_1,\hat{\theta}^{\rm{null}}_0 \gets \underset{\theta_1,\theta_0}{\operatorname{argmin}} &\sum<_{T_i =1}\mathcal{L}\left(y_i,\theta_1({\bf x}_{i, \cdot})\right) + \lambda_1 \sum_{i=2}^n \left|\theta_1\left(x_i\right) - \theta_1\left(x_{i-1}\right)\right| + \\ & \sum_{T_i = 0}\mathcal{L}\left(y_i,\theta_0({\bf x}_{i, \cdot})\right) + \lambda_0 \sum_{i=2}^n \left|\theta_0\left(x_i\right) - \theta_0\left(x_{i-1}\right)\right|\\ s.t. \quad&\,\, {{\bf Either }}\,\,\theta_0\left({\bf x}_{i, \cdot}\right) \leq \theta_1\left({\bf x}_{i, \cdot}\right) \textrm{ for all } i=1,\ldots,n \qquad \text{(C1)}\\ &\,\, {{\bf Or }}\,\,\theta_0\left({\bf x}_{i, \cdot}\right) \geq \theta_1\left({\bf x}_{i, \cdot}\right) \textrm{ for all } i=1,\ldots,n.\quad\quad\quad\,\, \text{(C2)} \end{align*} Note that in fitting the alternative (2.3), for $$T=0$$ we need only include those $$x_i$$ with $$T_i = 0$$ in the penalty (and similarly for $$T=1$$). 2.6. Estimation with multiple features When $${\bf x}_{i, \cdot} \in {\mathbb{R}}^p$$ for $$p >1$$, we can implement our framework using an additive model (Hastie and Tibshirani, 1990; Hastie and others, 2008). Suppose $${\bf x}_{i, \cdot}$$ is partitioned into continuous and categorical variables as $${\bf x}_{i, \cdot} = \{{\bf x}_{i, \cdot}^{\text{cont.}}, {\bf x}_{i, \cdot}^{\text{cat.}}\}$$ with $${\bf x}_{i, \cdot}^{\text{cont.}} \in {\mathbb{R}}^q$$ and $${\bf x}_{i, \cdot}^{\text{cat.}} \in {\mathbb{R}}^s$$. Also write $$\lambda_j = \{\lambda_j^{\text{cont.}}, \lambda_j^{\text{cat.}}\}$$ for $$j=0, 1$$. Define \begin{align*} \Theta_1 &= \begin{bmatrix} \theta_{1, 1}^{\text{cont.}} & \cdots & \theta_{1, q}^{\text{cont.}} & \theta_{1, 1}^{\text{cat.}} & \cdots & \theta_{1, s}^{\text{cat.}} \end{bmatrix}, \end{align*} and similarly for $$\Theta_0$$. Under the alternative hypothesis, we would then solve \begin{align} \left(\widehat{\Theta}_1,\hat{\gamma}_1 \right) \gets \underset{\Theta_1,\gamma_1}{\operatorname{argmin}} \sum_{T_i = 1} & \mathcal{L}\left(y_i, \left[\gamma_1 + \sum_{j=1}^q \theta_{1, j}^{\text{cont.}}(x_{ij}^{\text{cont.}}) + \sum_{k=1}^s \theta_{1, k}^{\text{cat.}}(x_{ik}^{\text{cat.}}) \right] \right) + \nonumber \\ & \lambda_1^{\text{cont.}} \sum_{j=1}^q P_j^{\text{cont.}}\left(\theta_{1, j}^{\text{cont.}}\right) + \lambda_1^{\text{cat.}} \sum_{k=1}^s P_k^{\text{cat.}}\left(\theta_{1, k}^{\text{cat.}}\right) \nonumber, \\ \left(\widehat{\Theta}_0,\hat{\gamma}_0 \right) \gets \underset{\Theta_0,\gamma_0}{\operatorname{argmin}} \sum_{T_i = 1} & \mathcal{L}\left(y_i, \left[\gamma_0 + \sum_{j=1}^q \theta_{0, j}^{\text{cont.}}(x_{ij}^{\text{cont.}}) + \sum_{k=1}^s \theta_{0, k}^{\text{cat.}}(x_{ik}^{\text{cat.}}) \right] \right) + \nonumber \\ & \lambda_0^{\text{cont.}} \sum_{j=1}^q P_j^{\text{cont.}}\left(\theta_{0, j}^{\text{cont.}}\right) + \lambda_0^{\text{cat.}} \sum_{k=1}^s P_k^{\text{cat.}}\left(\theta_{0, k}^{\text{cat.}}\right) \nonumber, \end{align} where $$\gamma_0, \gamma_1 \in {\mathbb{R}}$$. To solve under the null hypothesis, we add the appropriate Either/Or conditions and (C1) and (C2) as additional constraints. The fitted value under the alternative for the $$i$$th individual would then be $$\sum_{t=0}^1 I(T_i = t) \left[\gamma_t + \sum_{j=1}^q \theta_{t, j}^{\text{cont.}}(x_{ij}^{\text{cont.}}) + \sum_{k=1}^s \theta_{t, k}^{\text{cat.}}(x_{ik}^{\text{cat.}}) \right]. \label{eq:framework_fitted_mixture}$$ (2.5) There are many potential choices for the continuous and categorical penalties. One choice that seems to perform well, which we use in our Python implementation, is $$P^{\text{cat.}}$$ as the 1D fused lasso penalty and $$P_k^{\text{cat.}}$$ as a ridge penalty: \begin{align*} P_k^{\text{cat.}} \left(\theta_{t, j}^{\text{cat.}}\right) &= \sum_{T_i = t} \left[\theta_{t, j}^{\text{cat.}}(x_{i j})\right]^2, \\ P_j^{\text{cont.}} \left(\theta_{t, j}^{\text{cont.}}\right) &= {\left\lVert{D^{(0)}_{j, t} \left[\theta_{t, j}^{\text{cont.}}\left(x^{\text{cont.}}_{\cdot, j}\right) \right]}\right\rVert}_1, \end{align*} where $$D^{(0)}_{j, t}({\bf x}_{\cdot, j})$$ returns a vector of first-differences between the fitted values at consecutive ordered values of the feature $${\bf x}_{\cdot, j}$$, among observations in the $$T=t$$ group. 2.7. Minimum clinically relevant effect Rather than testing a null hypothesis of no crossing, we may instead be interested in testing a null of the form no clinically relevant crossing, where clinical relevance will be application-specific. For ease of exposition we will work with only the one sided null/alternative wherein under the null we expect that outcomes under the new treatment ($$T=1$$) will fall largely or entirely below outcomes under the control ($$T=0$$). However, the opposite direction or the bidirectional null/alternative are also simple to test. In this case we can formalize our null hypothesis as: $H_0: \left[\theta_1\left({\bf x}\right) - \theta_0\left({\bf x}\right) - \delta\right]_+ {\rm{d}}G\left({\bf x}\right) \leq C$ for some prespecified $$\delta$$ and $$C > 0$$, where $$[z]_+ = z{\,\cdot\,}I(z \geq 0)$$. For $$\delta = 0$$ this null hypothesis states that, if the population is treated under the optimal rule, the population average benefit from new treatment over control is at most $$C$$. For $$C=0$$, and $$\delta \neq 0$$ this null hypothesis states that no patients in the population have expected benefit of greater than $$\delta$$ from new treatment over control. For $$\delta = C = 0$$ we return to the original null hypothesis. To test this null we use the same formulation as above; now however we need estimates $$\hat{\theta}^{{\rm{null}}}_1$$, $$\hat{\theta}^{{\rm{null}}}_0$$ for a slightly different null. As before we use the empirical distribution of the covariates as a surrogate for $$G$$. We solve \begin{align} \hat{\theta}^{{\rm{null}}}_1,\hat{\theta}^{{\rm{null}}}_0 \gets \underset{\theta_1,\theta_0}{\operatorname{argmin}} &\sum_{T_i =1}\mathcal{L}\left(y_i,\theta_1({\bf x}_{i, \cdot})\right) + \lambda_1 P\left(\theta_1\right) + \sum_{T_i = 0}\mathcal{L}\left(y_i,\theta_0({\bf x}_{i, \cdot})\right) + \lambda_0 P\left(\theta_0\right)\\ \end{align} (2.6) \begin{align} s.t. \quad&\,\, \frac{1}{n}\sum_{i=1}^n \left[\theta_1\left({\bf x}_i\right) - \theta_0\left({\bf x}_{i, \cdot}\right) - \delta \right]_+ \leq C. \end{align} (C1*) This is a convex problem that can be solved efficiently. As before one could replace (C1*) with an equivalent (C2*), or conduct a bidirectional test on an Either/Or constraint. 2.8. Selecting a smoothing method The choice of penalty function $$P$$ determines the type of smoothness our solutions will have. We illustrate various candidates in Table 1 Table 1 Examples of smoothing methods Method $$P(\theta)$$ Example Fit 1D fused lasso (Tibshirani and others, 2005) $$\sum \left| \theta_{i+1} - \theta_i \right|$$ $$1$$st order trend filtering (Tibshirani, 2013) $$|| D^{(x, 2)} \theta ||_1$$ $$2$$nd order trend filtering (Tibshirani, 2013) $$|| D^{(x, 3)} \theta ||_1$$ Cubic smoothing splines (Hastie and others, 2008) $$\theta^T \Omega_N \theta$$ Method $$P(\theta)$$ Example Fit 1D fused lasso (Tibshirani and others, 2005) $$\sum \left| \theta_{i+1} - \theta_i \right|$$ $$1$$st order trend filtering (Tibshirani, 2013) $$|| D^{(x, 2)} \theta ||_1$$ $$2$$nd order trend filtering (Tibshirani, 2013) $$|| D^{(x, 3)} \theta ||_1$$ Cubic smoothing splines (Hastie and others, 2008) $$\theta^T \Omega_N \theta$$ Table 1 Examples of smoothing methods Method $$P(\theta)$$ Example Fit 1D fused lasso (Tibshirani and others, 2005) $$\sum \left| \theta_{i+1} - \theta_i \right|$$ $$1$$st order trend filtering (Tibshirani, 2013) $$|| D^{(x, 2)} \theta ||_1$$ $$2$$nd order trend filtering (Tibshirani, 2013) $$|| D^{(x, 3)} \theta ||_1$$ Cubic smoothing splines (Hastie and others, 2008) $$\theta^T \Omega_N \theta$$ Method $$P(\theta)$$ Example Fit 1D fused lasso (Tibshirani and others, 2005) $$\sum \left| \theta_{i+1} - \theta_i \right|$$ $$1$$st order trend filtering (Tibshirani, 2013) $$|| D^{(x, 2)} \theta ||_1$$ $$2$$nd order trend filtering (Tibshirani, 2013) $$|| D^{(x, 3)} \theta ||_1$$ Cubic smoothing splines (Hastie and others, 2008) $$\theta^T \Omega_N \theta$$ Our Python implementation uses the 1D fused lasso penalty. We make this choice because we have found that the 1D fused lasso penalty performs well empirically, and solutions to (2.4) and (2.3) using this penalty can be found efficiently and accurately in our implementation. Figure 2 shows example solutions to (2.4) and (2.3) for the fused lasso with a continuous response where we have simulated noisy data from two crossing sinusoidal functions. We see that, in this simple example, the fused lasso fits effectively detect and localize the crossings. Fig. 2. View largeDownload slide Example piecewise-constant solutions to (2.3) and (2.4) using the 1D fused lasso. The crossing sinusoidal curves show the true relationship between the response and candidate biomarker. Fig. 2. View largeDownload slide Example piecewise-constant solutions to (2.3) and (2.4) using the 1D fused lasso. The crossing sinusoidal curves show the true relationship between the response and candidate biomarker. 3. Testing procedure We are interested in testing our null hypothesis of no qualitative interaction based on the ingredients (test statistic and estimated functions) outlined in Section 2. To develop a formal test we combine those ingredients with permutations to evaluate the null distribution of our statistic. The exact algorithm is the following: 1. Calculate $$\hat{\theta}^{{\rm{alt}}}_T$$, for $$T=0,1$$ by solving (2.3) across sequences of $$\lambda_0$$ and $$\lambda_1$$ values. Using $$10$$-fold cross-validation, choose the tuning-parameter values $$\lambda_0^*$$ and $$\lambda_1^*$$ that maximize the cross-validated log-likelihood. 2. Calculate $$\hat{\theta}^{{\rm{null}}}_T$$ for $$T=0,1$$ by solving (2.4) with $$\lambda_0 = \lambda_0^*, \lambda_1 = \lambda_1^*$$. 3. Calculate the test-statistic, $$Z$$, by plugging in $$\hat{\theta}^{{\rm{alt}}}_T$$ and $$\hat{\theta}^{{\rm{null}}}_T$$ for $$T=0,1$$ to (2.1): \begin{equation*} Z = \frac{\sum_{i = 1}^n\mathcal{D}\left(y_i,\hat{\theta}^{{\rm{null}}}_{T_i}({\bf x}_i)\right) - \sum_{i=1}^n\mathcal{D}\left(y_i,\hat{\theta}^{{\rm{alt}}}_{T_i}({\bf x}_{i, \cdot})\right)}{\sum_{i=1}^n\mathcal{D}\left(y_i,\hat{\theta}^{{\rm{alt}}}_{T_i}({\bf x}_{i, \cdot})\right)}. \end{equation*} 4. Calculate the permutation null of $$Z$$ by running the following for $$b=1,\ldots, B$$ (a) Permute the treatment labels $$T_i$$ for $$i=1,\ldots,n$$ (b) With the new treatment labels rerun steps $$1-3$$. This will produce a permuted statistic $$Z^{*b}$$. 5. Compare $$Z$$ to $$\left\{Z^{*b}\right\}_{b=1}^B$$ using either a P-value $p^* = \frac{\#\left\{Z^{*b} \geq Z\right\}}{B}$ or an estimate of the false discovery rate (Benjamini and Hochberg, 1995; Tusher and others, 2001), which is further discussed in Appendix A.2 of supplementary material available at Biostatistics online. We use the optimal tuning parameters under the alternative hypothesis to solve under the null hypothesis. This decision was primarily made for computational reasons. The extra constraints (C1) and (C2) required under the null make the problem more computationally costly to solve. In particular, the specialized algorithm given in Section 4 for solving under the alternative no longer works under the null, and a general-purpose convex solver is used instead. In our experience, penalty parameters selected via CV under the alternative perform well under the null in practice. Nevertheless, there may be problems where this is not appropriate, and performing CV on the null is required for good performance. 3.1. Conservativeness of permutation tests In the particular case that $$\theta_0(x) = \theta_1(x)$$ for all $$x$$ under the null, our permutation test evaluates an exact$$p$$-value based on the conditional distribution of our statistic. However, our (one sided) null only requires that $$\theta_0(x) \leq \theta_1(x)$$ for all $$x$$. In this case, permutations appear to result in a conservative $$p$$-value. To see this, consider the situation wherein the null is true, and the entire $$\theta_1$$ curve falls well above (or well below) the $$\theta_0$$ curve. In this case the true null distribution of $$Z$$ will have significant mass at exactly $$0$$. However, by permuting we treat both $$\theta_0$$ and $$\theta_1$$ as the average of the two curves, making it much more likely for our empirical estimates to cross, and for our permuted statistic to be positive (i.e. for the fit under the alternative to be a superior fit to the data). We provide some theoretical evidence of conservativeness in Appendix A.3 of supplementary material available at Biostatistics online for the simplified setting wherein we estimate our conditional means using piecewise constant functions with prespecified knots and assume our outcomes are Gaussian distributed with equal variance in each treatment arm. 4. Computing We implement our method in Python using a 1D fused lasso penalty. The GitHub repository https://github.com/jhroth/data-example contains a working example of applying our method to data from a randomized controlled trial investigating a breast cancer treatment (Prat and others, 2014). The data are publicly available on the Gene Expression Omnibus with accession number GSE50948 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE50948). We describe GSE50948 dataset in more detail in Section 5. 4.1. Solving under $$H_0$$ We solve the optimization problem under $$H_0$$ in (2.4) for either a binary response (using a logistic loss function) or continuous response (using a squared-error loss function) with CVXPY, a general solver for convex optimization problems implemented in Python (Diamond and others, 2014). Our method calls the splitting conic solver (SCS) (O’Donoghue and others, 2016a,b) when using the 1D fused lasso penalty. 4.2. Solving under $$H_A$$ If the response variable is continuous, we solve under $$H_A$$ using a C implementation of a very efficient fused lasso solver that uses a dynamic programming approach Johnson, 2013 (when we have one feature) or using CVXPY with the SCS solver (in an additive model when we have more than one feature). For a binary response variable, we use logistic loss and apply a generalized gradient descent algorithm (Parikh and others, 2014) that leans on the solver from Johnson (2013). We give detail on the derivation of this algorithm in Appendix A.1 of supplementary material available at Biostatistics online. For a binary response variable and more than one feature, we again fit an additive model and call the SCS (O’Donoghue and others, 2016a,b). For multivariate problems, our use of a general purpose solver can result in long runtimes. In the worst case, for the examples in this manuscript, we found that a single model fit can take several minutes. This is not prohibitive, as fits for various features/permutations can be run in parallel. In particular, simulation and data examples in this manuscript were run on Amazon Web Services (aws.amazon.com). In addition, we plan to develop a custom first solver (leveraging sparsity) to further alleviate this issue. 5. Data example: GSE50948 The code used to produce and plot the results for the data example is available in the GitHub repository https://github.com/jhroth/data-example. We use our methodology to evaluate expression markers from the Prediction Analysis of Microarray 50 (PAM50) diagnostic test (Parker and others, 2009) for use in characterizing those clinically-defined HER2+ breast cancer patients (defined by pathology/copy number) who are likely to benefit from trastuzumab. Between 20 and 30% of breast cancer cases are characterized by over-expression of human epidermal growth factor receptor type 2 (HER2); this is known as HER2+ breast cancer. Trastuzumab, an intravenous antibody therapy that inhibits production of HER2, has been shown to lengthen disease-free survival for women with HER2+ breast cancer (Joensuu and others, 2006) and is now part of the standard-of-care chemotherapy regimen for HER2+ breast cancer patients (Hudis, 2007). PAM50 is a diagnostic test that uses the expression levels of 50 genes to characterize cancer in 5 molecular subtypes. One of these subtypes is “HER2-enriched” (HER2-E). While clinical and PAM50-based molecular classification of HER2+/$$-$$ often agree, they do not always. Prat and others (2014) find evidence that clinically HER2+ breast cancer patients classified to the HER2-E subtype by PAM50 have a higher rate of 3-year pathologic complete response (pCR) after receiving trastuzumab and chemotherapy, compared to chemotherapy without trastuzumab. No statistically significant differences in pCR between treatment arms were found for those clinically HER2+ patients classified (in contradiction) by PAM50 as having any of the other four cancer subtypes. Here, pCR is defined as the absence of residual invasive breast cancer at the end of the 3-year period. We reanalyze the publicly available subset of data from the NeoAdjuvant Herceptin (NOAH) randomized clinical trial (Gianni and others, 2010), which can be accessed in the Gene Expression Omnibus under the accession number GSE50948 (Edgar and others, 2002). We explore, which, if any, of the PAM50 genes can be used to characterize benefit from trastuzumab in clinically HER2 patients. The NOAH study population consisted of 334 women, 235 of whom had clinically HER2+ breast cancer. Clinically HER2+ patients in the NOAH trial were randomized to either receive treatment with trastuzumab in addition to neoadjuvant chemotherapy ($$T=1$$) or to receive neoadjuvant chemotherapy alone ($$T=0$$). Gene expression levels that passed quality control are available for the baseline tumor biopsies of 111 HER2+ patients. Our response variable of interest is a binary indicator of 3-year pCR. In the $$T=0$$ group, 25.5% of participants experienced pCR after 3 years; in the treatment group, 46.7% experienced pCR after 3 years. This is consistent with the finding in Gianni and others (2010) that trastuzumab was beneficial overall for women with clinically HER2+ breast cancer. There is no mention of censoring in Prat and others (2014) and we use logistic loss for this analysis. The supplemental materials in Prat and others (2014) list 9 unique probes used to represent the PAM50 genes classifying patients to the HER2+ subtype; we use combinations of these 9 expression levels as the candidate biomarkers for our data example. For one example, we use our method to fit a univariate model with each of the 9 probes as a single feature. We also fit our method with 36 additive models, as described in Section 2.6, where each possible combination of two probes is used as the set of features. We compute one observed and 100 permuted test statistics as described in Section 3 for each of 9 univariate models and 36 2D additive models. We estimate a false discovery rate (FDR) as described in Appendix A.2 of supplementary material available at Biostatistics online. Figure 3 shows the fitted values for the single probe (ACTR3B) in the univariate model with the lowest FDR estimate (0.30). Figure 3 suggests that trastuzumab only improves pCR for HER2+ patients with low levels of ACTR3B, but the associated FDR estimate (0.30) is relatively high. For comparison, Figure 4 shows the predicted values for both treatment groups from a logistic regression where the response variable is pCR and the features are ACTR3B, $$T$$, and $$(\text{ACTR3B} \times T)$$. The estimated response-curves from linear-logistic regression do not cross each other for any value of ACTR3B in the sample range. Fig. 3. View largeDownload slide Predicted values from our method with ACTR3B as the feature (estimated FDR$$\,=\,$$0.30). The flatter line corresponds to the treatment group. Fig. 3. View largeDownload slide Predicted values from our method with ACTR3B as the feature (estimated FDR$$\,=\,$$0.30). The flatter line corresponds to the treatment group. Fig. 4. View largeDownload slide Predicted values from logistic regression with ACTR3B, T, and $$(\text{ACTR3B} \times {\rm T})$$ as the features. The curve for the treatment group lies above the curve for the control group. Fig. 4. View largeDownload slide Predicted values from logistic regression with ACTR3B, T, and $$(\text{ACTR3B} \times {\rm T})$$ as the features. The curve for the treatment group lies above the curve for the control group. Figure 5 shows a surface plot for the 2-probe combination from the additive framework model (ACTR3B and BAG1), which had an estimated FDR of 0.15. Figure 5 suggests that HER2+ breast cancer patients with relatively low expression levels of ACTR3B and low levels of BAG1 may not benefit from trastuzumab, while all other patients do benefit. If this finding is validated in independent datasets, an implication is that clinically HER2+ patients who have low levels of ACTR3B and low levels of BAG1 can avoid treatment with trastuzumab. Fig. 5. View largeDownload slide Fitted surfaces for the additive model with ACTR3B and BAG1 as features (estimated FDR$$\,=\,$$0.15). The flatter surface corresponds to the treatment group. Fig. 5. View largeDownload slide Fitted surfaces for the additive model with ACTR3B and BAG1 as features (estimated FDR$$\,=\,$$0.15). The flatter surface corresponds to the treatment group. 6. Simulations 6.1. Simulation scenarios We simulate data from 12 different scenarios across a range of signal-to-noise ratios (SNRs), using $$n=100$$ observations and $$100$$ permutations for testing. Figure 6 shows the simulated response variable as a function of the candidate biomarker for an SNR of 1.5. Fig. 6. View largeDownload slide The 12 simulation scenarios with $${\text{SNR}} = 1.5$$. Fig. 6. View largeDownload slide The 12 simulation scenarios with $${\text{SNR}} = 1.5$$. In every scenario we simulate candidate biomarker $$x_i \sim \text{Uniform}(0, 1)$$ and treatment indicator $$t_i \sim \text{Bernoulli}(0.5)$$ for $$i=1, \ldots, n$$. In the four mean-shift simulation scenarios (first row of Figure 6), the response-curves for the treatment groups are mean shifts of each other and we simulate the response variable as \begin{equation*} y_i \mid t_i = f(x_i) - \delta t_i + \epsilon_i, \end{equation*} for $$i=1, \cdots, n$$, where we and we chose $$\delta=0.6$$ and used noise parameter $$\nu^2 = \frac{\widehat{{\text{Var}}}(f(X) - \delta T)}{{\text{SNR}}}$$ to generate $$\epsilon_i \sim N(0, \nu^2)$$. The function $$f$$ is either linear, piecewise linear, piecewise constant, or sinusoidal. In the second row of Figure 6 (non-qualitative interaction), we simulate the response variable with \begin{equation*} y_i \mid t_i = f_0(x_i) t_i + f_1(x_i) (1 - t_i) + \epsilon_i t_i + \gamma_i (1-t_i) \end{equation*} for $$i=1, \cdots, n$$, where we specify noise parameters $$\nu_0^2 = \frac{\widehat{{\text{Var}}}(f_0(x))} {{\text{SNR}}}$$ and $$\nu_1^2 = \frac{\widehat{{\text{Var}}}(f_0(x))} {{\text{SNR}}}$$ to generate $$\gamma_i \sim N(0, \nu_0^2)$$ and $$\epsilon_i \sim N(0, \nu_1^2)$$. The functions $$f_0$$ and $$f_1$$ yield linear, piecewise linear, piecewise constant, or sinusoidal response-curves for the treatment groups that do not cross each other. In the last row of Figure 6 (qualitative interaction), the response-curves population curves for the treatment groups cross each other and we generate \begin{equation*} y_i \mid t_i = f_1(x_i) t_i + f_2(x_i) (1 - t_i) + \epsilon_i \end{equation*} for $$i=1, \cdots, n$$, where we specify noise parameter $$\nu^2 = \frac{\sum_{i=1}^n \left(\left(f_0(x_i) - f_1(x_i)\right)^+\right)^2} {{\text{SNR}}}$$ to generate $$\epsilon_i \sim N(0, \nu^2)$$. The functions $$f_0$$ and $$f_1$$ yield linear, piecewise linear, piecewise constant, or sinusoidal response-curves for the treatment groups that cross each other. 6.2. Simulation results 6.2.1. Comparison to Linear Regression Let $$\mathcal{X} \subset {\mathbb{R}}$$ be values of candidate biomarker $$X$$ that have clinical relevance. Then a simple alternative to our testing procedure is fitting the linear regression $$E(Y_i \mid x_i) = \beta_0 + \beta_1 x_i + \beta_2 t_i + \beta_3 (x_i \times t_i)$$ Under this model, qualitative interaction occurs if there exist $$x, x', x'' \in \mathcal{X}$$ such that \begin{equation*} \beta_2 + \beta_3 x = 0 , \hspace{2ex} \beta_2 + \beta_3 x' < 0, \hspace{2ex} \beta_2 + \beta_3 x'' > 0 \end{equation*} so we are testing \begin{equation*} H_0: \text{there do} \it{not exist } x, x', x'' \in \mathcal{X} \hspace{1ex} \text{such that } x = -\frac{\beta_2}{\beta_3}, x' < -\frac{\beta_2}{\beta_3}, x'' > -\frac{\beta_2}{\beta_3} \end{equation*} We derive our procedure for testing this hypothesis in Appendix A.4 of supplementary material available at Biostatistics online. Figure 7 plots the the share of 1000 replications where the null hypothesis was rejected, as a function of SNR. Our method is shown in squares and the procedure based on linear regression is shown in crosses. Rejecting $$H_0$$ in any of the first two rows is a Type I error, while it is the correct decision in the third row. For these simulation scenarios, Figure 7 suggests that both methods have Type I error rates below the p-value threshold of 0.05. The procedure based on linear regression only has higher power than our method when the truth is linear; for the other three truth types, our method offers a substantial improvement in power. Fig. 7. View largeDownload slide Share of 1000 simulations with a p-value below 0.05, as a function of 30 SNR values across each of the 12 scenarios (permutation-based p-value p* from our method shown in squares; p-values from procedure based on linear regression shown in crosses). Fig. 7. View largeDownload slide Share of 1000 simulations with a p-value below 0.05, as a function of 30 SNR values across each of the 12 scenarios (permutation-based p-value p* from our method shown in squares; p-values from procedure based on linear regression shown in crosses). The observed conservativeness of our test under the null is in line with the discussion in Section 3.1. When the null is true and the response-curves between the treatment groups are well-separated, the observed test statistic has a large point mass at $$0$$; a permuted test statistic, on the other hand, is much more likely to be positive in this setting because it is based on response-curves that are averaged across the two treatment groups and thus are more likely to cross. As the signal-to-noise ratio increases, we would expect the observed test statistic to have more mass at $$0$$ (while the permuted test statistic is still likely to be positive) and thus for conservativeness to increase. One potential saving grace is that at the boundary of our null hypothesis space, we have $$\theta_0(x) = \theta_1(x)$$ for all $$x$$—in which case our test is no longer conservative. We also attempted to fix this conservatism: We considered generating null samples using a parametric bootstrap from our null-constrained estimates as an alternative to permutations. However, we found that this approach was anti-conservative. Since we used fused lasso as the basis for our estimation, the fitted values for each treatment group are piecewise constant across the range of the feature. The number and location of the knots are chosen data-adaptively instead of being specified beforehand. A simple alternative to this approach would be to pre-specify a fixed number of evenly spaced knots and use the group means in each region as the fitted values. Appendix A.5 of supplementary material available at Biostatistics online compares the power and Type I error of our method to this alternative. 7. Conclusion To test whether treatment is either uniformly superior or uniformly inferior for all patients, we propose a convex, regression-based test for qualitative interactions. We implement our method in Python and share an example of using it to search for a subset of HER2+ breast cancer patients who benefit from adjuvant chemotherapy. Our work has several notable limitations. One is the long runtime of our Python implementation due to its reliance on a general convex solver. In future work that focuses on computation, we would like to devise a custom solver to work more efficiently with additive models. In addition, our current work only tests a strong null hypothesis. The estimation stage of our framework suggests a subset of patients who benefit from treatment (i.e. the values of the candidate biomarker(s) for which the estimated response in the treatment group is superior to the estimated response in the control group). However, no formal significance test of average treatment effect is run in that subgroup. In addition, the testing stage of our framework is also stated for pre-specified biomarker candidates and does not perform variable selection among a set of potential candidates. Although our current implementation estimates the response variable as a piecewise-constant function of each continuous feature (either alone in a univariate model or as part of an additive model) and tests for qualitative interaction, we presented a general underlying framework that accommodates alternative penalty functions that induce different structures on the fitted values. The constraints in the optimization framework can also be modified to test whether the population average benefit from treatment is at least a pre-specified amount. Supplementary material Supplementary material is available at http://biostatistics.oxfordjournals.org. Acknowledgments Conflict of Interest: None declared. Funding Noah Simon was supported by NIH Grant DP5OD019820. Both authors were supported by an AWS in Education Research Grant award. References Benjamini Y. and Hochberg Y. ( 1995 ). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological) 57 , 289 – 300 . Boyd S. and Vandenberghe L. ( 2004 ). Convex Optimization . Cambridge University Press. Cambridge, United Kingdom. Google Scholar CrossRef Search ADS Cai T. , Tian L. , Wong P. H. and Wei L. J. ( 2011 ). Analysis of randomized comparative clinical trial data for personalized treatment selections. Biostatistics 12 , 270 – 282 . Google Scholar CrossRef Search ADS PubMed Chang M. , Lee S. and Whang Y.-J. ( 2015 ). Nonparametric tests of conditional treatment effects with an application to single-sex schooling on academic achievements. The Econometrics Journal 18 , 307 – 346 . Google Scholar CrossRef Search ADS Crump R. K. , Hotz V. J. , Imbens G. W. and Mitnik O. A. ( 2008 ). Nonparametric tests for treatment effect heterogeneity. The Review of Economics and Statistics 90 , 389 – 405 . Google Scholar CrossRef Search ADS Diamond S. , Chu E. and Boyd S. ( 2014 ). CVXPY: A Python-Embedded Modeling Language for Convex Optimization, Version 0.2 . http://cvxpy.org/. Edgar R. , Domrachev M. and Lash A. E. ( 2002 ). Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research 30 , 207 – 210 . Google Scholar CrossRef Search ADS PubMed Gail M. and Simon R. ( 1985 ). Testing for qualitative interactions between treatment effects and patient subsets. Biometrics 41 , 361 – 372 . Google Scholar CrossRef Search ADS PubMed Gianni L. , Eiermann W. , Semiglazov V. , Manikhas A. , Lluch A. , Tjulandin S. , Zambetti M. , Vazquez F. , Byakhow M. , Lichinitser M. and others ( 2010 ). Neoadjuvant chemotherapy with trastuzumab followed by adjuvant trastuzumab versus neoadjuvant chemotherapy alone, in patients with her2-positive locally advanced breast cancer (the noah trial): a randomised controlled superiority trial with a parallel her2-negative cohort. The Lancet 375 , 377 – 384 . Google Scholar CrossRef Search ADS Hastie T. and Tibshirani R. ( 1990 ). Generalized Additive Models , Volume 43. CRC Press . Hastie T. , Tibshirani R. and Friedman J. ( 2008 ). The Elements of Statistical Learning , 2nd edition. New York : Springer, U.S.A. Hudis C. A. ( 2007 ). Trastuzumab mechanism of action and use in clinical practice. New England Journal of Medicine 357 , 39 – 51 . Google Scholar CrossRef Search ADS PubMed Joensuu H. , Kellokumpu-Lehtinen P.-L. , Bono P. , Alanko T. , Kataja V. , Asola R. , Utriainen T. , Kokko R. , Hemminki A. , Tarkkanen M. and others ( 2006 ). Adjuvant docetaxel or vinorelbine with or without trastuzumab for breast cancer. New England Journal of Medicine 354 , 809 – 820 . Google Scholar CrossRef Search ADS PubMed Johnson N. ( 2013 ). A dynamic programming algorithm for the fused lasso and $$l_0$$-segmentation. Journal of Computational and Graphical Statistics 22 , 246 – 260 . Google Scholar CrossRef Search ADS Kang C. , Janes H. and Huang Y. ( 2014 ). Combining biomarkers to optimize patient treatment recommendations. Biometrics 70 , 695 – 707 . Google Scholar CrossRef Search ADS PubMed Moodie E. E. M. , Dean N. and Sun Yue R. ( 2014 ). Q-learning: flexible learning about useful utilities. Statistics in Biosciences 6 , 223 – 243 . Google Scholar CrossRef Search ADS O’Donoghue B. , Chu E. , Parikh N. and Boyd S. ( 2016a ). Conic optimization via operator splitting and homogeneous self-dual embedding. Journal of Optimization Theory and Applications 169 , 1042 – 1068 . Google Scholar CrossRef Search ADS O’Donoghue B. , Chu E. , Parikh N. and Boyd S. ( 2016b ). SCS: Splitting Conic Solver, Version 1.2.6 . https://github.com/cvxgrp/scs. Orellana L. , Rotnitzky A. and Robins J. M. ( 2010 ). Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part i: main content. The International Journal of Biostatistics 6 ( 2 ), 1 – 47 . Pan G. and Wolfe D. A. ( 1997 ). Test for qualitative interaction of clinical significance. Statistics in Medicine 16 , 1645 – 1652 . Google Scholar CrossRef Search ADS PubMed Parikh N. , Boyd S. ( 2014 ). Proximal algorithms. Foundations and Trends in Optimization 1 , 127 – 239 . Google Scholar CrossRef Search ADS Parker J. S. , Mullins M. , Cheang M. , Leung S. , Voduc D. , Vickery T. , Davies S. , Fauron C. , He X. , Hu Z. and others ( 2009 ). Supervised risk predictor of breast cancer based on intrinsic subtypes. Journal of Clinical Oncology 27 , 1160 – 1167 . Google Scholar CrossRef Search ADS PubMed Peto R. ( 1982 ). Statistical aspects of cancer trials. Treatment of Cancer. Chapman and Hall. London, United Kingdom . Prat A. , Bianchini G. , Thomas M. , Belousov A. , Cheang M. C. , Koehler A. , Gómez P. , Semiglazov V. , Eiermann W. , Tjulandin S. and others ( 2014 ). Research-based pam50 subtype predictor identifies higher responses and improved survival outcomes in her2-positive breast cancer in the noah study. Clinical Cancer Research 20 , 511 – 521 . Google Scholar CrossRef Search ADS PubMed Robins J. , Orellana L. and Rotnitzky A. ( 2008 ). Estimation and extrapolation of optimal treatment and testing strategies. Statistics in Medicine 27 , 4678 – 4721 . Google Scholar CrossRef Search ADS PubMed Tibshirani R. , Saunders M. , Rosset S. , Zhu J. and Knight K. ( 2005 ). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B 67 , 91 – 108 . Google Scholar CrossRef Search ADS Tibshirani R. J. ( 2013 ). Adaptive piecewise polynomial estimation via trend filtering. The Annals of Statistics 42 , 285 – 323 . Google Scholar CrossRef Search ADS Tusher V. G. , Tibshirani R. and Chu G. ( 2001 ). Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences 98 , 5116 – 5121 . Google Scholar CrossRef Search ADS Zhang Y. , Laber E. B. , Tsiatis A. and Davidian M. ( 2015 ). Using decision lists to construct interpretable and parsimonious treatment regimes. Biometrics 71 , 895 – 904 . Google Scholar CrossRef Search ADS PubMed Zhao Y. , Kosorok M. R. and Zeng D. ( 2009 ). Reinforcement learning design for cancer clinical trials. Statistics in Medicine 28 , 3294 – 3315 . Google Scholar CrossRef Search ADS PubMed Zhao Y. , Zeng D. , Rush A. J. and Kosorok M. R. ( 2012 ). Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association 107 , 1106 – 1118 . Google Scholar CrossRef Search ADS PubMed Zhao Y.-Q. , Zeng D. , Laber E. B. , Song R. , Yuan M. and Kosorok M. R. ( 2015 ). Doubly robust learning for estimating individualized treatment with censored data. Biometrika 102 , 151 – 168 . Google Scholar CrossRef Search ADS PubMed Zhao Y. , Zeng D. , Socinski M. A. and Kosorok M. R. ( 2011 ). Reinforcement learning strategies for clinical trials in nonsmall cell lung cancer. Biometrics 67 , 1422 – 1433 . Google Scholar CrossRef Search ADS PubMed © The Author 2017. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

### Journal

BiostatisticsOxford University Press

Published: Aug 24, 2017

## You’re reading a free preview. Subscribe to read the entire article.

### DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month ### Explore the DeepDyve Library ### Search Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly ### Organize Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place. ### Access Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals. ### Your journals are on DeepDyve Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more. All the latest content is available, no embargo periods. DeepDyve ### Freelancer DeepDyve ### Pro Price FREE$49/month
\$360/year

Save searches from
PubMed

Create folders to

Export folders, citations