Biometrika | DeepDyve

journal article

LitStream Collection

Robust estimation of causal effects via a high-dimensional covariate balancing propensity score

2020 Biometrika

Summary We propose a robust method to estimate the average treatment effects in observational studies when the number of potential confounders is possibly much greater than the sample size. Our method consists of three steps. We first use a class of penalized |$M$|-estimators for the propensity score and outcome models. We then calibrate the initial estimate of the propensity score by balancing a carefully selected subset of covariates that are predictive of the outcome. Finally, the estimated propensity score is used to construct the inverse probability weighting estimator. We prove that the proposed estimator, which we call the high-dimensional covariate balancing propensity score, has the sample boundedness property, is root-|$n$| consistent, asymptotically normal, and semiparametrically efficient when the propensity score model is correctly specified and the outcome model is linear in covariates. More importantly, we show that our estimator remains root-|$n$| consistent and asymptotically normal so long as either the propensity score model or the outcome model is correctly specified. We provide valid confidence intervals in both cases and further extend these results to the case where the outcome model is a generalized linear model. In simulation studies, we find that the proposed methodology often estimates the average treatment effect more accurately than existing methods. We also present an empirical application, in which we estimate the average causal effect of college attendance on adulthood political participation. An open-source software package is available for implementing the proposed methodology. 1. Introduction The propensity score of Rosenbaum & Rubin (1983) plays a central role in the estimation of causal effects in observational studies. In particular, matching and weighting methods based on propensity score have become part of the applied researcher’s toolkit across many scientific disciplines. One important challenge, which is becoming increasingly common as the amount of available data grows, is the question of how to incorporate a large number of potential confounders. For example, Schneeweiss et al. (2009) consider a total of several thousand candidate confounders obtained from the health care claims data. In this paper we propose a robust method to estimate the average treatment effect and the average treatment effect for the treated in observational studies when the number of potential confounders is possibly much greater than the sample size. Under the standard assumption of strong ignorability, we propose to estimate the propensity score by balancing covariates in high-dimensional settings. The proposed method consists of several steps. We first obtain an initial estimator of the propensity score by maximizing a penalized generalized quasilikelihood, which depends on a user-specified weight function. Next, we apply the weighted least squares method to fit the outcome model. We show that the two weight functions critically determine the performance of the proposed estimator under model misspecification. Third, we refine the initial estimate of the propensity score by balancing a carefully selected set of observed covariates that are predictive of the outcome. Finally, the estimated propensity score is used to construct the inverse probability weighting estimator of the average treatment effect. The extension to the estimation of the average treatment effect for the treated is given in the Supplementary Material. We prove that under mild conditions the proposed estimator, which we call the high-dimensional covariate balancing propensity score, has the sample boundedness property, is root-|$n$| consistent, asymptotically normal and semiparametrically efficient when the propensity score model is correctly specified and the outcome model is linear in covariates. Although this result holds for a broad class of weight functions used in the initial estimation of propensity score, the choice of weight functions determines the rate of convergence under model misspecification. We carefully choose the weight functions such that the estimator remains root-|$n$| consistent and asymptotically normal so long as either the propensity score model or the outcome model is correctly specified. In addition, the proposed estimator comes with honest confidence intervals. Finally, we extend these theoretical results to the case where the outcome model is a generalized linear model in order to allow for nonlinearity. An open-source software R package, CBPS (R Development Core Team 2020), is available for implementing our proposed methodology at https://CRAN.R-project.org/package=CBPS. The proposed methodology does not require the variable selection consistency of either the propensity score model or the outcome model. This property is shared by many recent works on high-dimensional inference on regression models; see Zhang & Zhang (2014), Javanmard & Montanari (2014), Van de Geer et al. (2014), Belloni et al. (2016), Ning & Liu (2017), Ning et al. (2017), Cai & Guo (2017), Neykov et al. (2018), Zhu & Bradic (2018), and Dukes et al. (2019), among many others. The main idea of these works is to correct the bias of the lasso-type estimators by inverting the optimality condition or the score function through the projection onto the nuisance tangent space, yielding honest confidence intervals for prespecified coefficients. Our proposed methodology builds on three strands of research that have recently emerged in the causal inference literature; see § 3.5 for a detailed discussion. First, a number of researchers have proposed estimating the average treatment effect by optimizing covariate balance between the treatment and control groups (e.g., Hainmueller, 2012; Graham et al., 2012; Imai & Ratkovic, 2014; Zubizarreta, 2015; Chan et al., 2016; Fan et al., 2016; Zhao, 2019). The proposed methodology extends the covariate balancing propensity score of Imai & Ratkovic (2014) and Fan et al. (2016) to the high-dimensional setting, in which the number of potential confounders is possibly greater than the sample size. Second, we contribute to the growing literature on the estimation of the average treatment effect in high-dimensional settings (e.g., Belloni et al., 2014, 2017; Farrell, 2015; Chernozhukov et al., 2018). These methods first estimate the nuisance parameters, e.g., propensity score, typically by the penalized maximum likelihood, and then estimate the average treatment effect by solving the efficient score function. Different from this line of work, we rely on the covariate balancing strategy for estimating the propensity score model and use the Horvitz–Thompson estimator. As elaborated in § 3.5, the robustness of the asymptotic distributions of our estimator to model misspecification is the main advantage over these existing methods. Most recently, (Tan, 2017; Tan, 2018) proposed a penalized calibrated propensity score method. Unlike this method, our approach is based on covariate balancing. Finally, our method is related to the recently proposed approximate residual balancing method (Athey et al., 2018), which, unlike our methodology, does not require a propensity score model. While the approximate residual balancing method requires the outcome model to be linear in covariates, our method yields a consistent and asymptotically normal estimator even under the misspecification of the outcome model so long as the propensity score is correctly specified. In addition, we extend the proposed method and its asymptotic theory to the case in which the outcome variable follows a generalized linear model. This also overcomes the same limitation of the original covariate balancing estimator (Imai & Ratkovic, 2014; Fan et al., 2016). Throughout the paper, we use the following notation. For |$v=(v_1, \ldots ,v_d)^\top \in \mathbb{R}^d$|⁠, and |$1 \leqslant q \leqslant \infty$|⁠, we define |$\|{v}\|_q=(\sum_{i=1}^d |v_i|^q)^{1/q}$|⁠, |$\|{v}\|_0=|\textrm{supp}(v)|$|⁠, where |$\textrm{supp}(v)=\{j: v_j |{\hspace{-3.5mm}} = 0\}$| and |$|A|$| is the cardinality of a set |$A$|⁠. Denote |$v^{\otimes 2}=vv^\top$|⁠. If the matrix |$M$| is symmetric, then |$\lambda_{\min}(M)$| and |$\lambda_{\max}(M)$| are the minimal and maximal eigenvalues of |$M$|⁠. For |$S\subseteq \{1, \ldots ,d\}$|⁠, let |$v_S=\{v_j: j\in S\}$| and |$S^c$| be the complement of |$S$|⁠. For two positive sequences |$a_n$| and |$b_n$|⁠, we write |$a_n\asymp b_n$| if |$C\leqslant a_n/b_n\leqslant C'$| for some |$C,C'>0$|⁠. Similarly, we use |$a_n\lesssim b_n$| to denote |$a_n\leqslant Cb_n$| for some constant |$C>0$|⁠. A random variable |$X$| is subexponential if there exists some constant |$K_1>0$| such that |${\rm pr}(|X|>t)\leqslant \exp(1-t/K_1)$| for all |$t\geqslant 0$|⁠. The subexponential norm of |$X$| is defined as |$\|X\|_{\psi_1}=\sup_{p\geqslant 1}p^{-1}({E}|X|^p)^{1/p}$|⁠. A random variable |$X$| is sub-Gaussian if there exists some constant |$K_2>0$| such that |${\rm pr}(|X|>t)\leqslant \exp(1-t^2/K^2_2)$| for all |$t\geqslant 0$|⁠. The sub-Gaussian norm of |$X$| is defined as |$\|X\|_{\psi_2}=\sup_{p\geqslant 1}p^{-1/2}({E}|X|^p)^{1/p}$|⁠. Denote |$a\vee b=\max(a,b)$|⁠. 2. Proposed methodology 2.1. Set-up Suppose that we observe a simple random sample of size |$n$| from a population of interest. For each unit |$i$|⁠, we observe a 3-tuple |$(T_i,Y_i,X_i )$| where |$X_i $| is a |$d$|-dimensional vector of pretreatment covariates, |$Y_i$| is an outcome variable and |$T_i$| is a binary treatment variable denoting whether the observation receives the treatment |$(T_i=1)$| or not |$(T_i=0)$|⁠. Let |$Y_i(1)$| and |$Y_i(0)$| denote the potential outcomes under the treatment and control conditions, respectively. This notation implies the stable unit treatment value assumption (Rubin, 1990). Then, the observed outcome can be written as |$Y_i = Y_i(T_i)$|⁠. Our goal is to infer the average treatment effect, |$\mu^* \ = \ {E}\{Y_i(1) - Y_i(0)\}\text{.}$| We focus on the estimation of |$\mu_1^\ast={E}\{Y_i(1)\}$|⁠, since |$\mu_0^\ast ={E}\{Y_i(0)\}$| can be estimated in a similar manner. We impose a working parametric model |$\pi(X_i ^\top \beta)$| for the treatment assignment mechanism, which is known as the propensity score |${\rm pr}(T_i=1\mid X_i)$|⁠, where |$\pi(\cdot)$| is a known function and |$\beta$| is an unknown |$d$|-dimensional vector. In this work, we consider settings where the number of covariates is possibly much greater than the sample size, i.e., |$d\gg n$|⁠. When the propensity score model is correctly specified, we have $$\begin{equation}\label{eqps} {\rm pr}(T_i = 1 \mid X_i ) \ = \ \pi(X_i ^\top \beta^*+u_i), \end{equation}$$(1) for some |$\beta^*\in\mathbb{R}^d$|⁠, where |$u_i$| is a small approximation error which accounts for the nonsparse effect due to weak signals in the model. Similarly, for the outcome variable we impose a linear working model. When the working model is correctly specified, we have $$\begin{equation}\label{eqoutcome} {E}\{Y_i(1)\mid X_i \} \ = \ K_1(X_i )+r_i, \end{equation}$$(2) where |$K_1(X_i )=\alpha^{*\top}X_i$| for some |$\alpha^*\in\mathbb{R}^d$| and similarly |$r_i$| is the approximation error. An extension to generalized linear models will be studied in § 4. In general, the propensity score model (1) or the outcome model (2) can be misspecified. We begin by assuming both models (1) and (2) hold. When studying the theoretical properties of our proposed methodology in § 3, however, we will consider the situations in which either model (1) or (2) does not hold. 2.2. High-dimensional covariate balancing propensity score In many applications, it is often reasonable to assume that the propensity score model is sparse or approximately sparse. Under the sparsity assumption, Tibshirani (1996) and Fan & Li (2001) proposed the penalized maximum likelihood estimators for parameter estimation and variable selection. Unfortunately, the penalized maximum likelihood estimators cannot be directly used with the Horvitz–Thompson estimator to infer |$\mu^*_1={E}\{Y_i(1)\}$| because the estimator may incur a large bias due to shrinkage. Thus, the resulting estimator may have a slower rate of convergence. To address this problem, we estimate the propensity score by optimizing covariate balance between the treatment and control groups. To this end, we distinguish the following two types of covariate balancing properties. Definition 1 (Covariate balancing properties). Let |$\hat\pi=\pi(X^\top \hat\beta)$| denote an estimator of the propensity score |${\rm pr}(T= 1 \mid X)$| with |$\hat\beta$| being an estimator of |$\beta^\ast$|⁠, which is the true value of |$\beta$|⁠. (a) We say |$\hat\pi$| satisfies the strong covariate balancing property if the following equality holds: $$\begin{equation}\label{eqstrong} \sum_{i=1}^n\bigg(\frac{T_i}{\hat\pi_i}-1\bigg)X_{i} \ = \ 0\text{.} \end{equation}$$(3) (b) We say |$\hat\pi$| satisfies the weak covariate balancing property if the following equality holds: $$\begin{equation}\label{eqweak} \sum_{i=1}^n\bigg(\frac{T_i}{\hat\pi_i}-1\bigg)\alpha^{*\top}X_{i} \ = \ 0, \end{equation}$$(4) where |$\alpha^\ast$| is defined by |$K_1(X_i )=\alpha^{*\top}X_i $| in equation (2). Although the strong covariate balancing property implies the weak one, the converse does not necessarily hold. Existing covariate balancing propensity score methods aim to achieve the strong covariate balancing property, which balances the mean of every component of |$X_i$| (e.g., Imai & Ratkovic, 2014; Fan et al., 2016). However, constructing an estimator |$\hat\pi$| with the strong covariate balancing property is difficult in high-dimensional settings. When |$d > n$|⁠, the estimator |$\hat\beta$| that satisfies equation (3) is not unique and therefore not even well defined. To overcome this difficulty, we propose to estimate the propensity score such that it approximately satisfies the weak covariate balancing property, i.e., |$n^{-1}\sum_{i=1}^n({T_i}/{\hat\pi_i}-1)\alpha^{*\top}X_{i} \approx 0$|⁠. We show that the weak covariate balancing property is sufficient to remove the bias from the estimation of the propensity score model. Here, we first introduce the proposed methodology, which we call the high-dimensional covariate balancing propensity score. Step 1. Define a generalized quasilikelihood function as $$\begin{equation*} Q_n(\beta)=\frac{1}{n}\sum_{i=1}^n \int_0^{\beta^{\top}X_i }\left\{\frac{T_i}{\pi(u)}-1\right\}w_1(u) \, {\rm d} u, \end{equation*}$$ where |$w_1(\cdot)$| is a positive weight function. Compute the regularized estimator \[\mathop a\limits_b \] $$\begin{equation}\label{eqhatbeta} \hat\beta = \mathop {\rm argmin}\limits_{\beta\in\mathbb{R}^d}\{-Q_n(\beta)+\lambda\|\beta\|_1\}, \end{equation}$$(5) where |$\lambda>0$| is a tuning parameter. Step 2. Define a weighted least square loss function using the treatment group as $$\begin{equation*} L_n(\alpha)=\frac{1}{n}\sum_{i=1}^n T_i w_2(\hat\beta^\top X_i)(Y_i-\alpha^{\top}X_i )^2, \end{equation*}$$ where |$w_2(\cdot)$| is another positive weight function. Compute the regularized estimator $$\begin{equation}\label{eqhatalpha} \tilde\alpha \ = \mathop {\rm argmin}\limits_{\alpha\in\mathbb{R}^d}\{L_n(\alpha)+\lambda'\|\alpha\|_1\}, \end{equation}$$(6) where |$\lambda'>0$| is a tuning parameter. Step 3. Let |$\tilde S=\{j: |\tilde\alpha_j|>0\}$| denote the support of |$\tilde\alpha$| and |$X_{\tilde S}$| denote the corresponding subset of |$X$|⁠. We calibrate the initial estimator |$\hat\beta_{\tilde{S}}$| to balance |$X_{\tilde{S}}$|⁠. Specifically, we solve $$\begin{equation}\label{eqcovariatebalance} \tilde\gamma \ = \ \mathop {\rm argmin}\limits_{\gamma\in\mathbb{R}^{|\tilde S|}}\|g_n(\gamma)\|^2_2 \quad \textrm{where}~ g_n(\gamma) \ = \ \frac{1}{n}\sum_{i=1}^n \left\{\frac{T_i}{\pi(\gamma^\top X_{i\tilde S}+\hat\beta^\top_{\tilde S^c}X_{i\tilde S^c})}-1\right\}X_{i \tilde S}\text{.} \end{equation}$$(7) We then set |$\tilde\beta=(\tilde\gamma, \hat\beta_{\tilde S^c})$| and |$\tilde\pi_i=\pi(\tilde\beta^\top X_i)$|⁠. Step 4. Estimate |$\mu_1^\ast={E}\{Y_i(1)\}$| by the Horvitz–Thompson estimator |$\hat \mu_1={n}^{-1} \sum_{i=1}^{n}T_iY_i/\tilde\pi_i$|⁠. In Step 1, we obtain an initial estimate of the propensity score via the penalized |$M$|-estimation approach. We refer to the function |$Q_n(\beta)$| as the generalized quasilikelihood function, as its construction is similar to the quasilikelihood function for generalized linear models (Wedderburn, 1974). To understand how the generalized quasilikelihood function is motivated, we compute the corresponding quasi-score function, $$\begin{equation}\label{eq_quasi2} \frac{\partial Q_n(\beta)}{\partial \beta} \ = \ \frac{1}{n}\sum_{i=1}^n \left\{\frac{T_i}{\pi(\beta^{\top}X_i)}-1\right\}w_1(\beta^{\top}X_i)X_i\text{.} \end{equation}$$(8) Since (8) is an unbiased estimating function for |$\beta$|⁠, |$Q_n(\beta)$| serves as a legitimate quasilikelihood function that integrates the quasi-score function (8). The quasilikelihood function |$Q_n(\beta)$| depends on the choice of the weight |$w_1(u)$|⁠. In particular, we consider the following two examples. (a) If $w_1(u)=\pi(u)$⁠, (8) is identical to the score function for the logistic regression and thus $Q_n(\beta)$ reduces to the standard quasi-likelihood function for the treatment variable. (b) If $w_1(u)=1$⁠, the quasi-score function (8) leads to the strong covariate balancing equation (3). Consequently, we call $Q_n(\beta)$ with $w_1(u)=1$ as the covariate balancing loss function. Thus, in Step 1 we allow a broad class of initial estimators |$\hat\beta$|⁠, including the penalized (quasi-)maximum likelihood estimator and many other penalized |$M$|-estimators corresponding to different |$w_1(u)$|⁠. By computing the Hessian matrix of |$Q_n(\beta)$|⁠, we find that (5) can be a nonconvex optimization problem depending on the choice of |$w_1(u)$|⁠. The nonconvexity may pose computational challenges. For instance, the gradient descent algorithm can be trapped at a local solution which is far from the global maximizer. To avoid the computational issue, we mainly focus on the concave quasilikelihood function |$Q_n(\beta)$|⁠. It is easy to verify that |$Q_n(\beta)$| with |$w_1(u)=\pi(u)$| in case (a) and |$w_1(u)=1$| in case (b) are both concave. In Step 2 we fit the outcome model using a class of penalized weighted least square estimators. We allow the weight |$w_2(\hat\beta^\top X_i)$| to depend on |$X_i$| and also the initial estimator |$\hat\beta$| from Step 1. For instance, we have the following examples. (a) If $w_2(u)=1$⁠, $L_n(\alpha)$ is the classical least square loss function in the treatment group. (b) If $w_2(u)=1/\pi(u)$⁠, $L_n(\alpha)$ is known as the inverse propensity score weighted least square loss. (c) If $w_2(u)=\pi'(u)/\pi^2(u)$⁠, $L_n(\alpha)$ remains a valid loss function for estimating $\alpha$⁠. It is shown in § 3.4 that this loss function plays an important role when studying the robustness of the proposed estimator to misspecified outcome models. In the following, we call this loss function the propensity score adjusted least square loss. Step 3 removes the bias induced by the penalized estimators used in Steps 1 and 2. We calibrate the estimated propensity score by balancing a subset of covariates |$X_{\tilde S}$|⁠, which represent the variables selected for the outcome model. The optimization problem (7) implies that the estimated propensity score |$\tilde \pi_i$| achieves the strong covariate balancing property only for these covariates |$X_{\tilde S}$| but not for the other covariates |$X_{\tilde S^c}$|⁠. Thus, unlike the original covariate balancing methodology, the proposed method does not achieve the strong covariate balancing property. Interestingly, however, the method does approximately satisfy the weak covariate balancing property if |$\alpha^*$| can be approximated well by |$\tilde\alpha$|⁠. Specifically, we have $$\begin{equation} \sum_{i=1}^n\left(\frac{T_i}{\tilde\pi_i}-1\right)\alpha^{*\top}X_{i}\ \approx \ \sum_{i=1}^n\left(\frac{T_i}{\tilde\pi_i}-1\right)\tilde\alpha^{\top}X_{i}\ = \ \sum_{i=1}^n\left(\frac{T_i}{\tilde\pi_i}-1\right)\tilde\alpha^{\top}_{\tilde S}X_{i \tilde S}\ = \ 0,\label{eqbalance} \end{equation}$$(9) where the first equality follows from |$\tilde\alpha_{\tilde S^c}=0$| and the second equality holds due to (7). In Step 4, we estimate |$\mu^*_1$| using the Horvitz–Thompson estimator. The proposed estimator can be rewritten as the Horvitz–Thompson estimator with the normalized weight, and also the augmented inverse probability weighting estimator. A detailed discussion is deferred to the Supplementary Material. 3. Theoretical properties of the proposed estimator 3.1. Assumptions We now study the theoretical properties of the proposed estimator. We begin by presenting the required assumptions. Assumption 1 (Unconfoundedness). The treatment assignment is unconfounded, i.e., |$\{Y_i(1), Y_i(0)\} \ \perp\!\!\!\perp \ T_i \mid X_i $|⁠. Assumption 2 (Overlap). There exists a constant |$c_0>0$| such that |$\pi_i^*\geqslant c_0$| for |$1\leqslant i\leqslant n$|⁠, where |$\pi_i^*=\pi(X_i^\top\beta^*)$| Assumption 1 implies that there are no unmeasured confounders, while Assumption 2 requires that all samples have a positive probability of receiving the treatment. Together, these represent the standard strong ignorability condition common to propensity score methods (Rosenbaum & Rubin, 1983). To estimate the treatment effect, one also needs to identify |${E}\{Y(0)\}$|⁠, which requires a similar overlap assumption |$\pi_i^*\leqslant 1-c_1$| for some constant |$c_1>0$|⁠. Assumption 3 (Sub-Gaussian condition). Assume that |$\varepsilon_1=Y(1)-\alpha^{*\top}X$| and |$X_j$| satisfy |$\|\varepsilon_1\|_{\psi_2}\leqslant C_\varepsilon$| and |$\|X_j\|_{\psi_2}\leqslant C_X$| for any |$1\leqslant j\leqslant d$|⁠, where |$C_X$| and |$C_{\varepsilon}$| are positive constants. Assumption 3 controls the tail behaviour of the error |$\varepsilon_1$| and the covariate |$X_j$|⁠, which facilitates the use of many existing concentration inequalities in high-dimensional statistics. Similar sub-Gaussian conditions are imposed by Athey et al. (2018) in their Theorem 5. Belloni et al. (2017) and Farrell (2015) relaxed the sub-Gaussian condition on |$\varepsilon_1$| to the bounded |$q$|th moment for some |$q>4$| under a slightly stronger sparsity assumption than our sparsity assumption below. Assumption 4 (Sparsity). Assume that |$(s_1\vee s_2)\log (d\vee n)/n^{1/2}=o(1)$| as |$s_1, s_2, d, n\rightarrow\infty$|⁠, where |$s_1=\|\beta^*\|_0$| and |$s_2=\|\alpha^*\|_0$|⁠. Recall that |$a\vee b=\max(a,b)$|⁠. The approximation errors in the propensity score model (1) and the outcome model (2) satisfy |$\sum_{i=1}^n r^2_i=O(s_2)$|⁠, |$\sum_{i=1}^n u^2_i=O(s_1)$|⁠, and |$\sum_{i=1}^n r_iu_i=o(n^{1/2}).$| Assumption 4 requires that both the propensity score and outcome models are sparse and the approximation errors are sufficiently small. Since we consider the high-dimensional case with |$d\gg n$|⁠, the sparsity assumption plays an important role in the regularized |$M$|-estimation of the propensity score and outcome models. In particular, if |$s_1\asymp s_2\asymp n^\kappa$| for some |$\kappa<1/2$|⁠, then the condition reduces to |$d=o\{\exp(n^{1/2-\kappa})\}$|⁠. This condition is similar to that in Belloni et al. (2014,2017) and Farrell (2015), where they imposed a slightly stronger condition with |$\log (d\vee n)$| replaced by |$\{\log (d\vee n)\}^q$| for some |$q>1$|⁠. Assumption 5 (Eigenvalue condition)> Denote |$\Sigma={E}(X^{\otimes 2})$|⁠. There exists a constant |$C>0$| such that |$C\leqslant \lambda_{\min}(\Sigma_{SS})\leqslant\lambda_{\max}(\Sigma_{SS})\leqslant 1/C$| for any |$S\subset\{1,\ldots,d\}$| with |$|S|\lesssim (s_1 \vee s_2)\log n$|⁠. When the dimension |$d$| is fixed, this assumption simply requires that the design matrix has full column rank, which is a standard regularity condition for regression problems. For high dimensions, Assumption 5 implies the well-known sparse eigenvalue condition introduced by Bickel et al. (2009); see Lemma 1 in Belloni & Chernozhukov (2013). The same sparse eigenvalue condition is imposed by Belloni et al. (2014). We also refer to Assumption 5 of Athey et al. (2018) and § 6.2 of Farrell (2015) for a similar restricted eigenvalue condition. Since our assumption only applies to any |$S \times S$| submatrix of |$\Sigma$|⁠, it is weaker than the |$C\leqslant \lambda_{\min}(\Sigma)\leqslant\lambda_{\max}(\Sigma)\leqslant 1/C$| imposed by Van de Geer et al. (2014), Ning & Liu (2017) and Cai & Guo (2017) for high-dimensional inference on lasso estimators. Assumption 6 (Propensity score and weight functions). Assume that |$Q_n(\beta)$| is a concave function. Let |$C,C'$| denote positive constants, which may take different values for each of the conditions below. (i) The propensity score model |$\pi(u)$| satisfies |$C\leqslant (\pi_i^{*})'\leqslant 1/C$|⁠, and there exist constants |$r>0$| and |$C'>0$| such that the Lipschitz condition holds locally, i.e., |$|\pi'(u)-\pi'(v)|\leqslant C'|u-v|$| for any |$u,v\in [X_i^\top\beta^*-r, X_i^\top\beta^*+r]$| and |$1\leqslant i\leqslant n$|⁠. (ii) The weight |$w_1(u)$| satisfies |$C\leqslant w_{1i}^{*}\leqslant 1/C$|⁠, |$(w_{1i}^{*})'\leqslant 1/C$| and the local Lipschitz condition |$|w_1'(u)-w_1'(v)|\leqslant C'|u-v|$| for any |$u,v\in [X_i^\top\beta^*-r, X_i^\top\beta^*+r]$| and |$1\leqslant i\leqslant n$|⁠, where |$w_{1i}^*=w_1(X_i^\top\beta^*)$|⁠. (iii) The weight |$w_2(u)$| satisfies |$C\leqslant w_{2i}^{*}\leqslant 1/C$| and |$(w_{2i}^{*})'\leqslant 1/C$| for |$1\leqslant i\leqslant n$|⁠, where |$w_{2i}^*=w_2(X_i^\top\beta^*)$|⁠. Assume |$w'_2(u)$| is continuous. Assumption 6 imposes mild regularity conditions on the propensity score function and weight functions. In part (i), we assume |$\pi(u)$| is differentiable and its derivative is bounded and Lipschitz around |$X_i^\top\beta^*$|⁠. Under the overlap assumption |$c_1\leqslant \pi_i^*\leqslant 1-c_1$|⁠, part (i) holds for the logistic regression without any further conditions. In parts (ii) and (iii), we assume mild conditions on the magnitude and smoothness of |$w_1(u)$| and |$w_2(u)$|⁠. Again, if |$\pi(u)$| is the logistic function and the overlap assumption holds, all examples of |$w_1(u)$| and |$w_2(u)$| discussed in § 2.2 satisfy the regularity conditions in parts (ii) and (iii). Thus, Assumption 6 automatically holds for the logistic propensity score model under the overlap assumption. 3.2. Asymptotic distribution under correct model specification We now derive the theoretical results for the proposed estimator |$\hat\mu_1$| when both the propensity score model (1) and the outcome model (2) are correctly specified. Recall that our estimator |$\hat\mu_1$| depends on the choice of the two weight functions, i.e., |$w_1(u)$| in Step 1 and |$w_2(u)$| in Step 2. In the following, we establish the asymptotic normality and semiparametric efficiency of |$\hat\mu_1$| for any weight functions |$w_1(u)$| and |$w_2(u)$|⁠. Theorem 1 (Asymptotic normality and semiparametric efficiency). Suppose that both the propensity score model (1) and the outcome model (2) are correctly specified and Assumptions 1–6 hold. If we take |$\lambda\asymp\lambda'\asymp \{{\log (d\vee n)/n}\}^{1/2}$|⁠, then the estimator |$\hat{\mu}_1$| with any weight functions |$w_1(u)$| and |$w_2(u)$| satisfies $$\hat{\mu}_1-\mu_1^* \ = \ \frac{1}{n}\sum_{i=1}^n\left[\frac{T_i}{\pi^*_i}\{Y_i(1)-\alpha^{*\top}X_i \}+\alpha^{*\top}X_i -\mu_1^*\right]+O_p\left\{\frac{(s_1\vee s_2)\log (d\vee n)}{n}\right\} $$as |$d, n\rightarrow\infty$|⁠. Let |$V$| be the semiparametric asymptotic variance bound, i.e., $$V \ = \ {E}\left\{\frac{1}{\pi^*}{E}(\varepsilon_1^2\mid X)+(\alpha^{*\top}X-\mu_1^*)^2\right\}\text{.}$$ Assume that |${E}(\varepsilon_1^2\mid X)\geqslant c$| for some constant |$c>0$| and |${E}(\alpha^{*\top} X)^4=O(s_2^2)$|⁠. Then, |$n^{1/2}(\hat{\mu}_1-\mu_1^*)/V^{1/2}\rightarrow_d N(0,1)$|⁠. The theorem shows that |$\hat \mu_1-\mu_1^*$| is asymptotically equivalent to the average of the efficient score functions, and hence |$\hat \mu_1$| is locally efficient under the correct model specification. In addition, the asymptotic distribution of |$\hat \mu_1$| does not depend on the choice of the weight functions |$w_1(u)$| and |$w_2(u)$|⁠, provided that they satisfy Assumption 6. The intuition is that, as long as the weak covariate balancing property is approximately attained, the choice of the weight functions in the first two steps is less important. To prove the asymptotic normality of |$\hat\mu_1$|⁠, we further require that the variance of the noise cannot tend to 0, i.e., |${E}(\varepsilon_1^2\mid X)\geqslant c>0$|⁠. This guarantees the nondegeneracy of the asymptotic variance |$V$|⁠. We also assume |${E}(\alpha^{*\top} X)^4=O(s_2^2)$| in order to verify the Lyaponov condition for the central limit theorem. This is a mild technical condition. For instance, if |$X$| is a sub-Gaussian vector and |$\|\alpha^*\|_2=O(s_2^{1/2})$|⁠, then |$\|\alpha^{*\top} X\|_{\psi_2}\leqslant \|\alpha^*\|_2\|X\|_{\psi_2}=O(s_2^{1/2})$|⁠. This further implies the desired condition |${E}(\alpha^{*\top} X)^4=O(s_2^2)$| by the definition of the sub-Gaussian norm. The asymptotic variance |$V$| depends on the true data-generating process, which is allowed to change with |$d$| and |$n$|⁠. For this reason, we consider the limiting distribution of the standardized statistic |$n^{1/2}(\hat{\mu}_1-\mu_1^*)/V^{1/2}$| as |$n,d\rightarrow\infty$|⁠. Hahn (1998) proved that |$V$| is the semiparametric asymptotic variance bound, when both the propensity score and outcome models are treated as nuisance. He proposed a nonparametric inverse probability weighting estimator for fixed |$d$| that attains this semiparametric efficiency bound. We show that, when the high-dimensional models (1) and (2) are both correct, the estimator |$\hat \mu_1$| attains the same bound and is locally efficient. Our variance bound |$V$| is different from the oracle efficiency bound, which is the semiparametric variance bound with the known support of the propensity score and outcome models (Hahn, 2004). Since the support of both models is unknown and the variable selection consistency does not hold under our assumptions, the estimation of the support set leads to additional uncertainty. This explains why our method cannot attain the oracle efficiency bound. We refer to § 5.3 of Farrell (2015) for further discussion on this point. Remark 1 (Sample splitting). Chernozhukov et al. (2018) proposed a double machine learning method based on the sample splitting technique to reduce the bias and allow for more flexible estimators of nuisance functions; see also Zheng & van der Laan (2011). Newey & Robins (2018) showed that the sample splitting technique can improve the convergence rate of the remainder terms for semiparametric models. In the Supplementary Material we propose a modified algorithm based on sample splitting by replacing Assumption 4 with a weaker assumption: |$(s_1 s_2)^{1/2}\log (d\vee n)/n^{1/2}=o(1)$|⁠. Ignoring the logarithmic factors of |$d$| and |$n$|⁠, while Assumption 4 requires |$s_1=o(n^{1/2})$| and |$s_2=o(n^{1/2})$|⁠, the sample splitting method only requires the weaker condition |$s_1s_2=o(n)$|⁠, which may still hold if one model is dense, e.g., |$n^{1/2}\ll s_1\ll n$|⁠, and the other model is sufficiently sparse, e.g., |$s_2\ll n^{1/2}$|⁠. However, the sample splitting method incurs further computational cost and may not be stable when the sample size is small. Remark 2 (Sample boundedness). The proposed method guarantees that |$\hat \mu_1$| lies in the range of |$\{Y_i: T_i=1, i=1, \ldots, n\}$|⁠. This sample boundedness property (Robins et al., 2007) holds because, by construction, the covariate balancing equation satisfies $$\begin{equation}\label{eqsample} \frac{1}{n}\sum_{i=1}^n \left(\frac{T_i}{\tilde\pi_i}-1\right) \ = \ 0 \end{equation}$$(10) so long as an intercept is included in |$X_{i\tilde S}$|⁠. Equation (10) implies that the estimated propensity score |$\tilde\pi_i$| must be greater than or equal to |$1/n$| for any treated observation. In contrast, the estimated propensity score |$\pi_i^*$| for the |$i$|th observation via the penalized maximum likelihood estimation can become close to |$0$|⁠, leading to extremely large weights for some observations and unstable causal effect estimates. To see why the sample boundedness property holds, we have $$\begin{equation*} \frac{1}{n}\sum_{i=1}^n \frac{T_iY_i}{\tilde\pi_i}\ \geqslant \ \frac{\min_{i: T_i=1} Y_i}{n}\sum_{i=1}^n \frac{T_i}{\tilde\pi_i}\ = \ \min_{i: T_i=1} Y_i, \end{equation*}$$ where the last equality holds by (10). Similarly, we can show that |$\hat\mu_1\leqslant \max_{i: T_i=1} Y_i$|⁠. Finally, to construct a confidence interval for |$\mu_1^*$|⁠, we estimate |$V$| by $$\begin{equation}\label{eqestvar} \hat V \ = \ \frac{1}{n}\sum_{i=1}^n \left\{\frac{T_i}{\tilde\pi_i^2}(Y_i-\tilde\alpha^\top X_i)^2+(\tilde\alpha^\top X_i -\hat\mu_1)^2\right\}\text{.} \end{equation}$$(11) The following corollary shows that |$\hat V$| is a consistent estimator of |$V$| and therefore we obtain valid confidence intervals for |$\mu_1^*$|⁠. Corollary 1 (Honest confidence intervals). Suppose that the assumptions in Theorem 1 hold. Then $$|\hat V-V| \ = \ O_p\bigg[(s_1\vee s_2)\bigg\{\frac{{\log (d\vee n)}}{{n}}\bigg\}^{1/2}\bigg]\text{.}$$ Given |$0<\eta\leqslant 1$|⁠, define the |$(1-\eta)$|-confidence interval as |$\mathcal{I} \ = \ (\hat\mu_1-z_{1-\eta/2}({\hat V/n})^{1/2},\ \hat\mu_1+z_{1-\eta/2}({\hat V/n})^{1/2})$|⁠, where |$z_{1-\eta/2}$| is the |$(1-\eta/2)$| quantile of the standard normal distribution. Then $$\begin{equation}\label{thm1unif2} \big|{pr}(\mu_1^*\in\mathcal{I})-(1-\eta)\big| =o(1)\text{.} \end{equation}$$(12) Indeed, this confidence interval |$\mathcal{I}$| is honest in the sense that (12) holds uniformly over all probability distributions that satisfy Assumptions 1–6. In addition, the proof of Corollary 1 holds even if the error |$\varepsilon_1$| is heteroskedastic, i.e., |${E}(\varepsilon_1^2\mid X)$| depends on the value of |$X$|⁠. In the Supplementary Material we further extend these theoretical results to the estimation of the average treatment effect for the treated. 3.3. Asymptotic distribution under misspecified propensity score models We next investigate the robustness of the proposed estimator to the misspecification of the propensity score model. In this subsection we assume that the true propensity score |$\pi^*={\rm pr}(T=1\mid X)$| does not belong to the assumed parametric class |$\{\pi( X ^\top \beta): \beta\in\mathbb{R}^d\}$|⁠. To study the limiting behaviour of our estimator |$\hat\mu_1$| in this setting, we first define the estimand of |$\hat\beta$| in Step 1. Given the generalized quasilikelihood function |$Q_n(\beta)$|⁠, the estimand of |$\hat\beta$| in (5) is defined as $$\beta^o \ = \ {argmax}_{\beta \in \mathbb{R}^d} {E}\left[\int_0^{\beta^{\top}X_i }\left\{\frac{T_i}{\pi(u)}-1\right\}w_1(u) \, {\rm d} u\right]\text{.}$$ The estimand |$\beta^o$| implicitly depends on the choice of the weight function |$w_1(u)$|⁠, and when the model is correctly specified, |$\beta^o$| reduces to |$\beta^*$|⁠. We assume that the estimand |$\beta^o$| is sparse, which is a technical assumption required to study misspecified models in high-dimensional settings (Bühlmann & van de Geer, 2015). This assumption ensures that the |$d$|-dimensional vector |$\beta^o$| can be well estimated with a fast rate. In practice, this assumption implies that it is sufficient to only adjust for a small number of covariates in order to attain the covariate balance. By choosing the weight |$w_1(u)=1$|⁠, |$\beta^o$| satisfies $$\begin{equation}\label{eq_mis_ps} {E}\bigg\{\frac{T_i}{\pi(X_i^\top\beta^o)}-1\bigg\}X_i=0\text{.} \end{equation}$$(13) Thus, |$\beta^o$| can be viewed as a natural estimand by solving the strong covariate balancing equation (3). This property plays an important role under the misspecified propensity score model. To see this, we rewrite |$\hat{\mu}_1-\mu_1^*$| as $$\begin{equation}\label{eqexpan} \hat{\mu}_1-\mu_1^* \ = \ \frac{1}{n}\sum_{i=1}^n\left[\frac{T_i}{\pi_i^o}\{Y_i(1)-\alpha^{*\top} X _i\}+\alpha^{*\top} X _i-\mu_1^*\right]+I_1+I_2, \end{equation}$$(14) where |$\pi_i^o=\pi(X_i^\top\beta^o)$|⁠, and the remainder terms |$I_1$| and |$I_2$| are defined as $$I_1\ = \ \frac{1}{n}\sum_{i=1}^n \left(\frac{T_i}{\tilde\pi_i}-\frac{T_i}{\pi_i^o}\right)\{Y_i(1)-\alpha^{*\top} X _i\}, \qquad I_2 \ = \ \frac{1}{n}\sum_{i=1}^n\left(\frac{T_i}{\tilde\pi_i}-1\right)\alpha^{*\top} X _i\text{.}$$ The remainder term |$I_1$| is sufficiently small by exploiting the rate of convergence of the estimated propensity score and the correct specification of the outcome model (2). The remainder term |$I_2$| is identical to the left-hand side of (9). However, in order to rigorously justify (9), we have to require the unbiasedness of the covariate balancing equation (13) when the propensity score model is misspecified. As a side note, if the estimated propensity score satisfies the covariate balancing equation (3) or (4), we immediately have |$I_2=0$|⁠. However, if other estimators of the propensity score model, such as the maximum likelihood estimator, are used, the corresponding estimand is typically different from |$\beta^o$| in (13). By taking the expectation of |$I_2$|⁠, we can show that the remainder term |$I_2$| is asymptotically biased. This shows the importance of (13) under misspecified propensity score models. As a comparison, for the augmented inverse probability weighting estimator, the expansion (14) holds with the same |$I_1$| term, and the second remainder term reduces to |$I_2' = n^{-1}\sum_{i=1}^n({T_i}/{\tilde\pi_i}-1)(\alpha^{*}-\hat\alpha)^\top X _i$|⁠. The method in Belloni et al. (2017) and Farrell (2015) is to estimate |$\beta^o$| by penalized maximum likelihood estimators. However, when the propensity score model is misspecified, the convergence rate of |$I_2'$| is slower than root-|$n$|⁠; see the Supplementary Material for a detailed analysis. We now establish the asymptotic properties of |$\hat\mu_1$| under misspecified propensity score models. Proposition 1 (Consistency and asymptotic normality under misspecified propensity score models). Suppose that the outcome model (2) is correctly specified, but the propensity score model (1) is misspecified. Assumptions 1–6 hold with |$\beta^*$| replaced by |$\beta^o$|⁠. If we take |$\lambda\asymp\lambda'\asymp \{{\log (d\vee n)/n}\}^{1/2}$|⁠, then the estimator |$\hat{\mu}_1$| with |$w_1(u)=1$| and any weight |$w_2(u)$| satisfies $$\hat{\mu}_1-\mu_1^* \ = \ \frac{1}{n}\sum_{i=1}^n\left[\frac{T_i}{\pi^o_i}\{Y_i(1)-\alpha^{*\top} X _i\}+\alpha^{*\top} X _i-\mu_1^*\right]+O_p\left\{\frac{(s_1\vee s_2)\log (d\vee n)}{n}\right\},$$where |$\pi_i^o=\pi(X_i^\top\beta^o)$|⁠. Assume that |${E}(\varepsilon_1^2\mid X)\geqslant c$| for some constant |$c>0$| and |${E}(\alpha^{*\top} X)^4=O(s_2^2)$|⁠. This implies |$n^{1/2}(\hat{\mu}_1-\mu_1^*)/V_{\sf mis\mbox{-}ps}^{1/2}\rightarrow_d N(0,1)$| where $$V_{\sf mis\mbox{-}ps} \ = \ {E}\left\{\frac{\pi^*}{(\pi^{o})^2}{E}(\varepsilon_1^2\mid X )+(\alpha^{*\top} X -\mu_1^*)^2\right\}\text{.}$$ If the propensity score model is correctly specified, i.e., |$\pi^o_i=\pi^*_i$|⁠, then the asymptotic variance |$V_{\sf mis\mbox{-}ps}$| in Proposition 1 reduces to the asymptotic variance |$V$| in Theorem 1. To construct the confidence interval for |$\mu_1^*$|⁠, we need to estimate |$V_{\sf mis\mbox{-}ps}$|⁠. By inspecting the proof of Corollary 1, we can show that the estimator |$\hat V$| defined in (11) is still consistent for |$V_{\sf mis\mbox{-}ps}$|⁠, even if the propensity score model is misspecified. Thus, the confidence interval shown in Corollary 1 is valid whether or not the propensity score model is misspecified. 3.4. Asymptotic distribution under misspecified outcome models In this subsection we study the robustness of the proposed estimator to the misspecification of the outcome model. Assume that the propensity score model is correct, but the true conditional mean function |${E}\{Y_i(1)\mid X_i\}$| is nonlinear in |$X_i$|⁠, i.e., there does not exist |$\alpha^*$| such that |${E}\{Y_i(1) \mid X_i\}=\alpha^{*\top}X_i$|⁠. Similar to § 3.3, we first define the estimand of |$\tilde\alpha$| in Step 2 as $$\alpha^o \ = \ \mathop {\rm argmin}\limits_{\alpha \in \mathbb{R}^d} {E}\left\{T_i w_2(\beta^{*\top}X_i)(Y_i-\alpha^{\top}X_i )^2\right\},$$ which in turn depends on the weight function |$w_2(\cdot)$|⁠. Similarly, we assume |$\alpha^o$| is sparse with |$s_2=\|\alpha^o\|_0$|⁠. We establish the asymptotic properties of |$\hat\mu_1$| under misspecified outcome models. Proposition 2 (Consistency and asymptotic normality under misspecified outcome models). Suppose that the propensity score model (1) is correctly specified, but the outcome model (2) is misspecified. Assumptions 1–6 hold with |$\alpha^*$| replaced by |$\alpha^o$|⁠. If we take |$\lambda\asymp\lambda'\asymp \{{\log (d\vee n)/n}\}^{1/2}$|⁠, then the estimator |$\hat{\mu}_1$| with any weight functions |$w_1(u)$| and |$w_2(u)=\pi'(u)/\pi^2(u)$| satisfies $$\hat{\mu}_1-\mu_1^* \ = \ \frac{1}{n}\sum_{i=1}^n\left[\frac{T_i}{\pi^*_i}\{Y_i(1)-\alpha^{o\top} X _i\}+\alpha^{o\top} X _i-\mu_1^*\right]+O_p\left\{\frac{(s_1\vee s_2)\log (d\vee n)}{n}\right\}\text{.}$$ Assume that |${E}(\varepsilon_1^{o2}\mid X)\geqslant c$| for some constant |$c>0$| and |${E}(\alpha^{o\top} X)^4=O(s_2^2)$|⁠, where |$\varepsilon_1^o=Y(1)-\alpha^{o\top}X$|⁠. This implies that |$n^{1/2}(\hat{\mu}_1-\mu_1^*)/V_{\sf mis\mbox{-}o}^{1/2} \rightarrow_d N(0,1)$|⁠, where $$V_{\sf mis\mbox{-}o} \ = \ {E}\left\{\frac{1}{\pi^{*}}{E}(\varepsilon^{o2}\mid X )+(\alpha^{o\top} X -\mu_1^*)^2\right\}\text{.}$$ The results in this proposition parallel those in Proposition 1. Specifically, when the outcome model is misspecified, if we choose |$w_2(u)=\pi'(u)/\pi^2(u)$|⁠, or equivalently the propensity score adjusted least square loss |$L_n(\alpha)$| in example (c|$'$|⁠) of § 2.2, the desired properties such as root-|$n$| consistency and asymptotic normality are attained. Similarly, the form of |$w_2(u)$| is motivated by the remainder term in the asymptotic expansion of |$\hat\mu_1$| under the misspecified outcome model. Finally, the estimator |$\hat V$| in (11) is consistent for the asymptotic variance |$V_{\sf mis\mbox{-}o}$|⁠. Remark 3 (Double robustness and honest confidence intervals). Propositions 1 and 2 together imply that our estimator |$\hat\mu_1$| is root-|$n$| consistent and asymptotically normal provided either the propensity score model or outcome model is correctly specified. This estimator does not require knowing which of the two models is correct. Since |$\hat V$| is always consistent, the same confidence interval defined in Corollary 1 is valid as long as one of the two models is correctly specified. Thus, we recommend the use of this estimator and the associated confidence interval in practice. Finally, we comment that when |$w_1(u)=1$| and |$w_2(u)=\pi'(u)/\pi^2(u)$|⁠, the gradient of |$Q_n(\beta)$| and |$L_n(\alpha)$| is related to the estimating equations proposed by Robins et al. (2007); see the Supplementary Material for details. 3.5. Comparison with related work In this subsection we compare our method with several related papers. Belloni et al. (2017) and Farrell (2015) showed that the augmented inverse probability weighting estimator is asymptotically normal and efficient in high dimensions when both the propensity score and outcome models are correctly specified. Their assumptions and main results are parallel to those of our Theorem 1. However, the sample boundedness property, see Remark 2, does not hold for the augmented inverse probability weighting estimator in general. When either the propensity score model or the outcome model is misspecified, Propositions 1 and 2 provide a complete characterization of the asymptotic behaviour of our estimator. In the same context, Farrell (2015) proved that the augmented inverse probability weighting estimator is consistent, but his Theorem 2 does not derive an explicit convergence rate. We show in the Supplementary Material that the augmented inverse probability weighting estimator |$\bar\mu_1$| satisfies |$\bar\mu_1 - \mu_1^*\ = \ O_p[\{{(s_1\vee s_2) \log (d\vee n)}/{n}\}^{1/2}]$|⁠, which is slower than |$n^{-1/2}$|⁠, and thus the confidence intervals for the treatment effect are not available under model misspecification. In contrast, our estimator is root-|$n$| consistent, which leads to honest confidence intervals as shown in § 3.3 and § 3.4. This robustness of the asymptotic distributions to model specification is the main advantage over the augmented inverse probability weighting estimators (Farrell, 2015; Belloni et al., 2017) and the double selection estimator (Belloni et al., 2014). The approximate residual balancing method proposed by Athey et al. (2018) does not require the propensity score model to be sparse or even well formulated. Thus, their method is robust to the misspecification of the propensity score model so long as the outcome model is correct. In contrast, our method requires both models to be sparse. The advantage of our framework is that it tolerates the misspecified outcome model, so long as the propensity score model is correctly specified. Furthermore, the linearity assumption of the outcome model plays an important role in Athey et al. (2018) and their method is not readily applicable if the outcome model is nonlinear. In contrast, our method is robust to the misspecification of the outcome model and also can be generalized to nonlinear outcome models. As an illustration of its generalizability, we consider the extension of the proposed methodology to generalized linear models in § 4. In addition, when the propensity score model is correctly specified and is indeed sparse, the estimation of the propensity score can help scientists better understand the treatment assignment mechanism (e.g., Rubin, 2008). Recall that |$\tilde\pi_i=\pi(\tilde\beta^\top X_i)$|⁠. As a byproduct of Theorem 1, our estimated propensity score is uniformly consistent, $$\begin{equation*} \max_{1\leqslant i\leqslant n}|\tilde \pi_i-\pi^*_i|\ = \ O_p\bigg\{\frac{(s_1\vee s_2)\log (d\vee n)({\log n})^{1/2}}{{n}^{1/2}}\bigg\}\text{.} \end{equation*}$$ Thus, the estimated propensity score |$\tilde\pi_i$| is an accurate approximation to the unknown treatment assignment mechanism. Hirshberg & Wager (2019) recently showed that under suitable regularity conditions the approximate residual balancing method is also semiparametrically efficient and the balancing weights converge to the inverse propensity score but with a slower rate. Most recently and independently of this paper, Tan (2017,2018) proposed a penalized calibrated propensity score method and studied its robustness to model misspecification. Our work is closely related to Tan (2017), which can be seen as equivalent to directly plugging the initial estimator |$\hat\beta$| into the Horvitz–Thompson estimator with |$w_1(u)=1$|⁠. However, this method does not balance the covariates as we do in Step 3. Corollary 3 of Tan (2017) implies that the estimator has the rate of convergence |$O_p\{(s_1\log d/n)^{1/2}\}$|⁠, which is slower than that of our estimator. In our proof, one can treat |$\sum_{i=1}^n\left({T_i}/{\hat\pi_i}-1\right)\alpha^{*\top}X_{i}$| as the bias of the Horvitz–Thompson estimator, which is eliminated by the covariate balancing step, whereas this term remains in Tan (2017). In the follow-up paper, Tan (2018) removed this bias by constructing an augmented inverse probability weighting estimator. However, our result is more general. First, we show that there exists a large class of estimators that are asymptotically normal under possible model misspecification. Second, our theory holds for generalized linear models, whereas Tan’s (2018) method is not applicable if the propensity score model is misspecified. 4. Covariate balancing for generalized linear models 4.1. Method We now extend our method to the setting in which the outcome follows a generalized linear model. The validity of many existing methods critically relies on the assumption that the outcome model is linear in the covariates |$ X _i$|⁠, or in some transformations, e.g., spline basis, of |$ X _i$|⁠. Thus, generalizing the covariate balancing approach to nonlinear models is an important extension. Assume that the working model for |$Y_i(1)$| given |$ X _i$| belongs to the exponential family, $$\begin{equation}\label{eqglm} p(y\mid X ) \ = \ h(y,\phi)\exp\left[\frac{1}{a(\phi)}\{y\alpha^{*\top} X -b(\alpha^{*\top} X )\}\right], \end{equation}$$(15) where |$h(\cdot,\cdot)$|⁠, |$a(\cdot)$| and |$b(\cdot)$| are known functions, |$\phi$| is the dispersion parameter and |$\alpha^*$| is a |$d$|-dimensional vector of unknown regression parameters. For simplicity, we assume that the dispersion parameter |$\phi$| is known. Given this set-up, we propose the following modification of our methodology described in § 2.2. Step 1. Fit the outcome model via the penalized maximum likelihood method within the treatment group, $$ \hat\alpha \ = \ \mathop {\rm argmin}\limits_{\alpha\in\mathbb{R}^{d}}\left[-\frac{1}{n}\sum_{i=1}^n \frac{T_i}{a(\phi)}\{Y_i\alpha^{\top} X _i-b(\alpha^{\top} X _i)\}+\lambda_0\|\alpha\|_1\right] ,$$ where |$\lambda_0>0$| is a tuning parameter. Step 2. This step is identical to Step 1 in § 2.2, where the weight function |$w_1(u)$| is replaced by |$w_1(\hat\alpha^{\top}X_i, u)$|⁠. This defines the initial estimator |$\hat\beta$|⁠. Step 3. Re-estimate the outcome model via the penalized weighted maximum likelihood method within the treatment group, $$ \tilde\alpha \ = \ \mathop {\rm argmin}\limits_{\alpha\in\mathbb{R}^{d}}\left[-\frac{1}{n}\sum_{i=1}^n \frac{T_i w_2(\hat\beta^\top X_i)}{a(\phi)}\{Y_i\alpha^{\top} X _i-b(\alpha^{\top} X _i)\}+\lambda'\|\alpha\|_1\right] ,$$ where |$\lambda'>0$| is a tuning parameter and |$w_2(\cdot)$| is the weight function similar to Step 2 in § 2.2. Step 4. Define |$\tilde S=\{j: |\tilde\alpha_j|>0\}$| and |$f(X)=b''(\tilde\alpha^{\top} X ) X_{\tilde S}$|⁠. Compute $$\begin{equation*} \tilde\gamma \ = \ \mathop {\rm argmin}\limits_{\gamma\in\mathbb{R}^{|\tilde S|}}\|g_n(\gamma)\|^2_2 , \quad {\rm where} \ g_n(\gamma)\ = \ \frac{1}{n}\sum_{i=1}^n \left\{\frac{T_i}{\pi(\gamma^\top\bar X _{i\tilde S}+\hat\beta^\top_{\tilde S^c} X _{i\tilde S^c})}-1\right\} f ( X _i)\text{.} \end{equation*}$$ Set |$\tilde\beta=(\tilde\gamma, \hat\beta_{\tilde S^c})$| and |$\tilde\pi_i=\pi(\tilde\beta^\top X _i)$|⁠. Step 5. Estimate |$\mu_1^*$| by |$\hat \mu_1={n}^{-1} \sum_{i=1}^{n}{T_iY_i}/{\tilde\pi_i}-{n}^{-1} \sum_{i=1}^{n}({T_i}/{\tilde\pi_i}-1)b'(\tilde\alpha^{\top} X_i)$|⁠. The current algorithm differs from that in § 2.2 in the following three aspects. First, Step 1 yields an initial estimator of |$\alpha$|⁠, which is then incorporated into the weight function |$w_1(\hat\alpha^{\top}X_i, u)$|⁠. As shown later in Theorem 2, the choice of the weight function |$w_1(\hat\alpha^{\top}X_i, u)=b''(\hat\alpha^{\top}X_i)$| leads to a valid asymptotic distribution of |$\hat\mu_1$| under misspecified propensity score models. Thus, we need this extra step to obtain an initial estimator of |$\alpha$|⁠. Second, Step 4 balances the weighted covariates |$f(X)=b''(\tilde\alpha^{\top} X ) X_{\tilde S}$| instead of |$ X _{i \tilde S}$| as done in (7). The reason is that to achieve a similar weak covariate balancing property, one must balance a vector of functions |$f( X )$| such that |$b'(\alpha^{*\top} X )\in \textrm{span}\{ f ( X )\}$|⁠, where |$\textrm{span}\{ f ( X )\}$| represents the linear space generated by the basis functions |$ f ( X )$|⁠. Let |$S$| denote the support set for |$\alpha$|⁠, i.e., |$S=\{j: |\alpha^*_j|>0\}$|⁠. Since |$b'(\alpha^{*\top} X )$| is unknown in practice, we approximate |$b'(\alpha^{*\top} X )=b'(\alpha^{*\top}_S X _S)$| by a local linear estimator |$b'(\tilde\alpha^\top X )+b''(\tilde\alpha^\top X )(\alpha^*-\tilde\alpha)_S X_S $|⁠. Furthermore, if we replace |$S$| by an estimator |$\tilde S$|⁠, this leads to the weighted covariates |$f(X)=b''(\tilde\alpha^{\top} X ) X_{\tilde S}$| in Step 4. Unfortunately, balancing |$f(X)$| alone does not attain the approximate weak covariate balancing property, because the leading term |$b'(\tilde\alpha^\top X )$| in the local linear approximation has not been considered. It is possible to add |$b'(\tilde\alpha^\top X )$| into the covariate balancing equation, which leads to |$f(X)=\{b'(\tilde\alpha^\top X ), b''(\tilde\alpha^{\top} X ) X_{\tilde S}\}$|⁠. However, doing so leads to additional technical assumptions on the eigenvalues of |$f(X)$|⁠. To avoid such assumptions, we only attain partial covariate balancing in Step 4 by choosing |$f(X)=b''(\tilde\alpha^{\top} X ) X_{\tilde S}$|⁠. Third, Step 5 applies the augmented inverse probability weighting estimator rather than the Horvitz–Thompson estimator used in § 2.2. The additional term, i.e., |${n}^{-1} \sum_{i=1}^{n}({T_i}/{\tilde\pi_i}-1)b'(\tilde\alpha^{\top} X_i)$|⁠, comes from the bias due to the imbalance of |$b'(\tilde\alpha^\top X )$|⁠. The augmented inverse probability weighting estimator agrees with the Horvitz–Thompson estimator when the outcome model is linear. In this case we have |$b''(u)=1$| and |$b'(u)=u$|⁠, and balancing |$b''(\tilde\alpha^{\top} X )X_{\tilde S}=X_{\tilde S}$| is sufficient to remove the imbalance effect of |$b'(\tilde\alpha^\top X )=\tilde\alpha^\top X$|⁠. Thus, as expected, the current algorithm reduces to the one in § 2.2 under the linear outcome model. In addition, if the outcome model is the Poisson regression, we can also apply the Horvitz–Thompson estimator because under this model |$b''(u)=b'(u)=\exp(u)$| and therefore balancing |$b''(\tilde\alpha^{\top} X )X_{\tilde S}$| is sufficient, provided the intercept term is included. 4.2. Theoretical results Assumption 7 (Subexponential condition). Assume that |$\varepsilon_1=Y(1)-b'(\alpha^{*\top} X )$| satisfies |$\|\varepsilon_1\|_{\psi_1}\leqslant C$| and |$\max_{1\leqslant i\leqslant n,\, 1\leqslant j\leqslant d}|X_{ij}|\leqslant C_n$|⁠, where |$C$| is a positive constant and we allow |$C_n$| to increase with |$n$|⁠. Assumption 8 (Sparsity). Let us denote |$s_1=\|\beta^*\|_0$| and |$s_2=\|\alpha^*\|_0$|⁠. Assume that |$C_n^2(s_1\vee s_2)\log (d\vee n)/n^{1/2}=o(1)$|⁠, where |$C_n$| is defined in Assumption 7. Assumption 9 (Propensity score, outcome model and weight functions). Assume that |$Q_n(\beta)$| is a concave function. Let |$C,C'$| denote positive constants, which may take a different value for each of the conditions below. (i) The same condition (i) in Assumption 6 holds for the propensity score model |$\pi(u)$|⁠. (ii) In the outcome model, |$b(u)$| is third-order continuously differentiable and |$|X_i^\top\alpha^*|\leqslant C'$|⁠. (iii) The weight function |$w_1(u,v)$| satisfies the following conditions in a small neighbourhood of |$u^*=X_i^\top\alpha^*$| and |$v^*=X_i^\top\beta^*$|⁠: |$C \leqslant w_{1}(u,v)\leqslant 1/C$|⁠, |$0\leqslant w'_1(u,v)\leqslant 1/C$|⁠, and the Lipschitz condition in |$u$|⁠, |$|w_1(u,v^*)-w_1(u',v^*)|\leqslant C'|u-u'|$|⁠, where |$u\in [u^*-r, u^*+r]$|⁠, |$v\in [v^*-r, v^*+r]$| for some small constant |$r>0$| and |$w_1'(u,v)=\partial w_1(u,v)/\partial v$|⁠. (iv) The same condition (iii) in Assumption 6 holds for the weight |$w_2(u)$|⁠. Unlike the sub-Gaussian condition in Assumption 3, we allow the error |$\varepsilon_1$| to be subexponential in Assumption 7. This extension is necessary because many examples of generalized linear models, e.g., exponential regression and Poisson regression, satisfy the subexponential condition but not the sub-Gaussian condition. We also allow |$C_n$| to possibly grow with |$n$|⁠. Specifically, when |$X_{ij}$| is uniformly bounded, |$C_n$| is a positive constant. When |$X_{ij}$| is sub-Gaussian, then |$\max_{1\leqslant i\leqslant n,\, 1\leqslant j\leqslant d}|X_{ij}|=O_p(\{\log (nd)\}^{1/2})$|⁠. Assumption 8 requires a similar sparsity condition, and allows |$C_n$| to increase with |$n$|⁠. For simplicity, we focus on the exact sparse case with no approximation errors in the propensity score and outcome models. Parts (i) and (iv) of Assumption 9 are identical to Assumption 6. Part (ii) is a mild condition, stating that the regression effect in the outcome model is bounded. The third-order differentiability of |$b(u)$| holds for most generalized linear models. Part (iii) is a technical condition. To analyse the estimator |$\hat\beta$| in Step 2, we need to control |$w_1(u,v)$| and |$w'_1(u,v)$| in a small neighbourhood of the true values. This condition holds for two important examples, |$w_1(u,v)=\pi(v)$| and |$w_1(u,v)=b''(u)$|⁠. The former corresponds to example (a) in § 2.2, and the latter represents the generalization of example (b) to generalized linear models. To study the performance of our estimator under misspecified models, as in § 3.3 and § 3.4 we define the least false parameters as $$\beta^o \ = \ {argmax}_{\beta \in \mathbb{R}^d} {E}\left[\int_0^{\beta^{\top}X_i }\left\{\frac{T_i}{\pi(u)}-1\right\}w_1(X_i^\top\alpha^*,u) \, {\rm d} u\right],$$ $$\alpha^o \ = \ \mathop {\rm argmin}\limits_{\alpha \in \mathbb{R}^d} {E}\left[\frac{T_i w_2(\beta^{*\top} X_i)}{a(\phi)}\{Y_i\alpha^{\top} X _i-b(\alpha^{\top} X _i)\}\right]\text{.}$$ The following theorem establishes the asymptotic normality of |$\hat\mu_1$| when the outcome variable follows a generalized linear model. When analysing the theoretical properties under model misspecification, we replace |$\alpha^*$| and |$\beta^*$| in all assumptions with |$\alpha^o$| and |$\beta^o$|⁠. Theorem 2 (Asymptotic properties under generalized linear models). Suppose that Assumptions 1, 2, 5, 7, 8 and 9 hold, and that the tuning parameters satisfy |$\lambda_0\asymp\lambda\asymp\lambda'\asymp \{{\log (d\vee n)/n}\}^{1/2}$|⁠. (i) Assume that both the propensity score model (1) and the outcome model (15) are correctly specified. Then the estimator |$\hat{\mu}_1$| with any weight functions |$w_1(u,v)$| and |$w_2(u)$| satisfies $$\hat{\mu}_1-\mu_1^* \ = \ \frac{1}{n}\sum_{i=1}^n\left[\frac{T_i}{\pi^*_i}\{Y_i(1)-b'(\alpha^{*\top}X_i) \}+b'(\alpha^{*\top}X_i) -\mu_1^*\right]+o_p(n^{-1/2}),$$and |$\hat{\mu}_1$| achieves the same semiparametric efficiency bound. (ii) Assume that the outcome model (15) is correctly specified, but the propensity score model (1) is misspecified. If we choose |$w_1(u,v)=b''(u)$|⁠, then for any |$w_2(u)$| we have $$\hat{\mu}_1-\mu_1^* \ = \ \frac{1}{n}\sum_{i=1}^n\left[\frac{T_i}{\pi(X_i^\top\beta^o)}\{Y_i(1)-b'(\alpha^{*\top}X_i) \}+b'(\alpha^{*\top}X_i) -\mu_1^*\right]+o_p(n^{-1/2})\text{.}$$ (iii) Suppose that the propensity score model (1) is correctly specified, but the outcome model (15) is misspecified. If we set |$w_2(u)=\pi'(u)/\pi^2(u)$|⁠, then for any |$w_1(u,v)$| we have $$\hat{\mu}_1-\mu_1^* \ = \ \frac{1}{n}\sum_{i=1}^n\left[\frac{T_i}{\pi^*_i}\{Y_i(1)-b'(\alpha^{o\top}X_i) \}+b'(\alpha^{o\top}X_i) -\mu_1^*\right]+o_p(n^{-1/2})\text{.}$$ Part (i) of Theorem 2 is the extension of Theorem 1 to generalized linear models. Under the correct model specification, the asymptotic normality of |$\hat\mu_1$| holds for any weight functions |$w_1(u,v)$| and |$w_2(u)$| that satisfy Assumption 9. This result agrees with the theory of augmented inverse probability weighting estimators in Belloni et al. (2017) and Farrell (2015). Unlike the existing work, parts (ii) and (iii) provide novel results on the asymptotic normality of |$\hat\mu_1$| when either the propensity score model or the outcome model is misspecified. Similar to Propositions 1 and 2, these results hold only if particular forms of |$w_1(u,v)$| and |$w_2(u)$| are chosen to remove the bias from model misspecification. In particular, we use the weight |$w_1(u,v)=b''(u)$| in part (ii), which requires knowledge of |$\alpha^*$| in the outcome model. This explains why Step 1 is needed. Since part (ii) holds for any weight function |$w_2(u)$|⁠, the estimator |$\hat{\mu}_1$| remains asymptotically normal even if we skip Step 3 and replace |$\tilde\alpha$| in Step 4 with |$\hat\alpha$| in Step 1. Similarly, part (iii) holds for any weight function |$w_1(u,v)$|⁠. Thus, if we set |$w_2(u)=\pi'(u)/\pi^2(u)$| and |$w_1(u,v)=\pi(v)$|⁠, we may skip Step 1 of our algorithm and the same result in part (iii) still applies. Similar to Remark 3, when |$w_1(u,v)=b''(u)$| and |$w_2(u)=\pi'(u)/\pi^2(u)$| the proposed estimator |$\hat\mu_1$| is asymptotically normal provided that either the propensity score model or the outcome model is correctly specified. This estimator does not require knowing which of the two models is correct, and therefore is recommended for practical use. 5. Simulation studies In this section we conduct simulation studies to evaluate the finite-sample performance of the proposed methodology. We consider the following data-generating processes. First, we generate the |$d$|-dimensional covariate |$X_i\sim N(0,\Sigma)$|⁠, where |$\Sigma_{jk}=\rho^{|j-k|}$| with |$\rho=1/2$|⁠. We generate the binary treatment |$T_i$| using the logistic regression model of the form |$\pi(X_i) = 1-1 / \{1 + \exp(-X_{i1} +X_{i2}/2-X_{i3}/4-X_{i4}/10-X_{i5}/10+X_{i6}/10)\}$|⁠. For the potential outcomes, we consider both linear and logistic regression models as specified later. The observed outcome is |$Y_i=Y_i(1)T_i+Y_i(0)(1-T_i)$|⁠. The simulation is repeated 200 times under each setting. Throughout the simulation studies, whenever possible we compare our method to the approximate residual balancing method (Athey et al., 2018), the regularized augmented inverse probability weighting method (Farrell, 2015; Belloni et al., 2017) and the double selection estimator (Belloni et al., 2014). For the sake of comparison, we use the lasso penalty in both our method and the regularized augmented inverse probability weighting method, and all tuning parameters are determined by five-fold cross-validation. The weight functions in our method are chosen according to Remark 3. For the approximate residual balancing method we use the default values of the tuning parameters in the R package balanceHD. The double selection method is implemented using the R package hdm. We first consider the linear regression models for the potential outcomes, $$\begin{equation*} \begin{split} Y_i(1) \ = \ 2 &+ 0 . 137(X_{i5} + X_{i6}+ X_{i7}+X_{i8})+ \varepsilon_{1i},\\ Y_i(0) \ = \ 1 &+ 0 . 291(X_{i5} + X_{i6} + X_{i7} +X_{i8}+ X_{i9}+X_{i10})+ \varepsilon_{0i},\\ \end{split} \end{equation*}$$ where |$\varepsilon_{1i}$| and |$\varepsilon_{0i}$| are independent standard normal random variables. Under this setting, we consider the following four scenarios. In the first scenario, (A), we assume that the propensity score and outcome models are correctly specified. In the second scenario, (B) the outcome models are correctly specified but the propensity score model is misspecified. We use the transformed variables |$X_{\rm mis} = \{\exp(X_1/2), X_2/\{1+\exp(X_1)\}+10, (X_1X_3/25+0 . 6)^3, (X_2+X_4+20)^2, X_6, \exp(X_6+X_7), X_9^2, X_7^3-20, X_9, \ldots , X_d\}$| to generate the treatment but the original variables |$X$| to generate the outcome variables. In the third scenario, (C) the propensity score model is correctly specified but the outcome models are misspecified. We use the same transformed variables |$X_{\rm mis}$| to generate the outcomes, but the original variables |$X$| to generate the treatment. Finally, in scenario (D) both the outcome and propensity score models are misspecified using the transformed covariates. This model misspecification follows the work of Kang & Schafer (2007), who evaluated the empirical performance of the augmented inverse probability weighting estimator in low-dimensional settings. Table 1 shows the bias, standard error, standardized root-mean-squared error |$\{{{E}(\hat{\mu} - \mu)^2}\}^{1/2}/\mu$|⁠, coverage probability of 95% confidence intervals and their length for the estimation of the average treatment effect under the four scenarios. We focus on the high-dimensional setting with |$d=1000, 2000$| and sample size |$n=500$|⁠. Additional simulation studies under a variety of different settings appear in the Supplementary Material. Table 1 shows that the proposed method tends to have a smaller standardized root-mean-squared error in most scenarios. More importantly, as seen in scenarios (B) and (C), the fact that the proposed method has an accurate coverage probability under model misspecification provides empirical support for the robustness property established in Propositions 1 and 2. In contrast, the augmented inverse probability weighting estimator has a significant bias under scenarios (B) and (C). As a result, its coverage probability is too small in most cases. The approximate residual balancing and double selection methods perform reasonably well under model misspecification. But their confidence intervals tend to be wider than those of the proposed method. Table 1. Bias, standard error, standardized root-mean-squared error, coverage probability of 95% confidence intervals and length of 95% confidence intervals for the estimation of the average treatment effect. Four methods are compared: the proposed method, approximate residual balancing, regularized augmented inverse probability weighting estimator and double selection |$n=500$| . |$d=1000$| . |$d=2000$| . . HD-CBPS . RB . AIPW . D-SELECT . HD-CBPS . RB . AIPW . D-SELECT . (A) Both models are correct Bias |$-$|0.0026 |$-$|0.0017 |$-$|0.0498 |$-$|0.0910 |$-$|0.0595 |$-$|0.0580 |$-$|0.1200 |$-$|0.0397 Std err 0.0936 0.1074 0.0926 0.0979 0.1061 0.1155 0.1011 0.1279 RMSE 0.0936 0.1074 0.1052 0.1337 0.1216 0.1292 0.1569 0.1334 Coverage 0.965 0.930 0.915 0.890 0.910 0.910 0.855 0.945 CI length 0.3867 0.4231 0.3775 0.4294 0.3862 0.4359 0.3731 0.5034 (B) Propensity score model is misspecified Bias |$-$|0.0120 |$-$|0.0303 |$-$|0.1078 |$-$|0.0782 |$-$|0.0446 |$-$|0.0685 |$-$|0.1234 |$-$|0.0357 Std err 0.0984 0.1153 0.0963 0.1034 0.0924 0.1041 0.0921 0.1214 RMSE 0.0991 0.1193 0.1446 0.1296 0.1025 0.1246 0.1540 0.1265 Coverage 0.965 0.945 0.815 0.905 0.930 0.910 0.740 0.940 CI length 0.3864 0.4431 0.3732 0.4227 0.3839 0.4382 0.3702 0.5023 (C) Outcome model is misspecified Bias |$-$|0.0034 |$-$|0.0321 |$-$|0.0562 |$-$|0.0991 |$-$|0.0317 |$-$|0.0572 |$-$|0.1215 |$-$|0.0443 Std err 0.0917 0.0982 0.0914 0.1023 0.0944 0.0992 0.0921 0.1026 RMSE 0.0917 0.1033 0.1072 0.1424 0.0995 0.1145 0.1525 0.1118 Coverage 0.960 0.960 0.905 0.845 0.950 0.955 0.770 0.945 CI length 0.3874 0.4292 0.3815 0.4327 0.3890 0.4403 0.3728 0.4261 (D) Both models are misspecified Bias |$-$|0.0547 |$-$|0.1201 |$-$|0.1873 |$-$|0.1005 |$-$|0.0243 |$-$|0.0599 |$-$|0.1393 |$-$|0.0518 Std err 0.1106 0.1038 0.0903 0.0950 0.0969 0.1060 0.0921 0.0965 RMSE 0.1234 0.1588 0.2079 0.1383 0.0999 0.1218 0.1670 0.1095 Coverage 0.890 0.815 0.775 0.875 0.940 0.940 0.720 0.950 CI length 0.3994 0.4586 0.3790 0.4333 0.3948 0.4545 0.3781 0.4334 |$n=500$| . |$d=1000$| . |$d=2000$| . . HD-CBPS . RB . AIPW . D-SELECT . HD-CBPS . RB . AIPW . D-SELECT . (A) Both models are correct Bias |$-$|0.0026 |$-$|0.0017 |$-$|0.0498 |$-$|0.0910 |$-$|0.0595 |$-$|0.0580 |$-$|0.1200 |$-$|0.0397 Std err 0.0936 0.1074 0.0926 0.0979 0.1061 0.1155 0.1011 0.1279 RMSE 0.0936 0.1074 0.1052 0.1337 0.1216 0.1292 0.1569 0.1334 Coverage 0.965 0.930 0.915 0.890 0.910 0.910 0.855 0.945 CI length 0.3867 0.4231 0.3775 0.4294 0.3862 0.4359 0.3731 0.5034 (B) Propensity score model is misspecified Bias |$-$|0.0120 |$-$|0.0303 |$-$|0.1078 |$-$|0.0782 |$-$|0.0446 |$-$|0.0685 |$-$|0.1234 |$-$|0.0357 Std err 0.0984 0.1153 0.0963 0.1034 0.0924 0.1041 0.0921 0.1214 RMSE 0.0991 0.1193 0.1446 0.1296 0.1025 0.1246 0.1540 0.1265 Coverage 0.965 0.945 0.815 0.905 0.930 0.910 0.740 0.940 CI length 0.3864 0.4431 0.3732 0.4227 0.3839 0.4382 0.3702 0.5023 (C) Outcome model is misspecified Bias |$-$|0.0034 |$-$|0.0321 |$-$|0.0562 |$-$|0.0991 |$-$|0.0317 |$-$|0.0572 |$-$|0.1215 |$-$|0.0443 Std err 0.0917 0.0982 0.0914 0.1023 0.0944 0.0992 0.0921 0.1026 RMSE 0.0917 0.1033 0.1072 0.1424 0.0995 0.1145 0.1525 0.1118 Coverage 0.960 0.960 0.905 0.845 0.950 0.955 0.770 0.945 CI length 0.3874 0.4292 0.3815 0.4327 0.3890 0.4403 0.3728 0.4261 (D) Both models are misspecified Bias |$-$|0.0547 |$-$|0.1201 |$-$|0.1873 |$-$|0.1005 |$-$|0.0243 |$-$|0.0599 |$-$|0.1393 |$-$|0.0518 Std err 0.1106 0.1038 0.0903 0.0950 0.0969 0.1060 0.0921 0.0965 RMSE 0.1234 0.1588 0.2079 0.1383 0.0999 0.1218 0.1670 0.1095 Coverage 0.890 0.815 0.775 0.875 0.940 0.940 0.720 0.950 CI length 0.3994 0.4586 0.3790 0.4333 0.3948 0.4545 0.3781 0.4334 HD-CBPS, the proposed method; RB, approximate residual balancing; AIPW, regularized augmented inverse probability weighting estimator; D-SELECT, double selection; Std err, standard error; RMSE, standardized root-mean-squared error; Coverage, coverage probability of 95% confidence intervals; CI length, length of 95% confidence intervals. Open in new tab Table 1. Bias, standard error, standardized root-mean-squared error, coverage probability of 95% confidence intervals and length of 95% confidence intervals for the estimation of the average treatment effect. Four methods are compared: the proposed method, approximate residual balancing, regularized augmented inverse probability weighting estimator and double selection |$n=500$| . |$d=1000$| . |$d=2000$| . . HD-CBPS . RB . AIPW . D-SELECT . HD-CBPS . RB . AIPW . D-SELECT . (A) Both models are correct Bias |$-$|0.0026 |$-$|0.0017 |$-$|0.0498 |$-$|0.0910 |$-$|0.0595 |$-$|0.0580 |$-$|0.1200 |$-$|0.0397 Std err 0.0936 0.1074 0.0926 0.0979 0.1061 0.1155 0.1011 0.1279 RMSE 0.0936 0.1074 0.1052 0.1337 0.1216 0.1292 0.1569 0.1334 Coverage 0.965 0.930 0.915 0.890 0.910 0.910 0.855 0.945 CI length 0.3867 0.4231 0.3775 0.4294 0.3862 0.4359 0.3731 0.5034 (B) Propensity score model is misspecified Bias |$-$|0.0120 |$-$|0.0303 |$-$|0.1078 |$-$|0.0782 |$-$|0.0446 |$-$|0.0685 |$-$|0.1234 |$-$|0.0357 Std err 0.0984 0.1153 0.0963 0.1034 0.0924 0.1041 0.0921 0.1214 RMSE 0.0991 0.1193 0.1446 0.1296 0.1025 0.1246 0.1540 0.1265 Coverage 0.965 0.945 0.815 0.905 0.930 0.910 0.740 0.940 CI length 0.3864 0.4431 0.3732 0.4227 0.3839 0.4382 0.3702 0.5023 (C) Outcome model is misspecified Bias |$-$|0.0034 |$-$|0.0321 |$-$|0.0562 |$-$|0.0991 |$-$|0.0317 |$-$|0.0572 |$-$|0.1215 |$-$|0.0443 Std err 0.0917 0.0982 0.0914 0.1023 0.0944 0.0992 0.0921 0.1026 RMSE 0.0917 0.1033 0.1072 0.1424 0.0995 0.1145 0.1525 0.1118 Coverage 0.960 0.960 0.905 0.845 0.950 0.955 0.770 0.945 CI length 0.3874 0.4292 0.3815 0.4327 0.3890 0.4403 0.3728 0.4261 (D) Both models are misspecified Bias |$-$|0.0547 |$-$|0.1201 |$-$|0.1873 |$-$|0.1005 |$-$|0.0243 |$-$|0.0599 |$-$|0.1393 |$-$|0.0518 Std err 0.1106 0.1038 0.0903 0.0950 0.0969 0.1060 0.0921 0.0965 RMSE 0.1234 0.1588 0.2079 0.1383 0.0999 0.1218 0.1670 0.1095 Coverage 0.890 0.815 0.775 0.875 0.940 0.940 0.720 0.950 CI length 0.3994 0.4586 0.3790 0.4333 0.3948 0.4545 0.3781 0.4334 |$n=500$| . |$d=1000$| . |$d=2000$| . . HD-CBPS . RB . AIPW . D-SELECT . HD-CBPS . RB . AIPW . D-SELECT . (A) Both models are correct Bias |$-$|0.0026 |$-$|0.0017 |$-$|0.0498 |$-$|0.0910 |$-$|0.0595 |$-$|0.0580 |$-$|0.1200 |$-$|0.0397 Std err 0.0936 0.1074 0.0926 0.0979 0.1061 0.1155 0.1011 0.1279 RMSE 0.0936 0.1074 0.1052 0.1337 0.1216 0.1292 0.1569 0.1334 Coverage 0.965 0.930 0.915 0.890 0.910 0.910 0.855 0.945 CI length 0.3867 0.4231 0.3775 0.4294 0.3862 0.4359 0.3731 0.5034 (B) Propensity score model is misspecified Bias |$-$|0.0120 |$-$|0.0303 |$-$|0.1078 |$-$|0.0782 |$-$|0.0446 |$-$|0.0685 |$-$|0.1234 |$-$|0.0357 Std err 0.0984 0.1153 0.0963 0.1034 0.0924 0.1041 0.0921 0.1214 RMSE 0.0991 0.1193 0.1446 0.1296 0.1025 0.1246 0.1540 0.1265 Coverage 0.965 0.945 0.815 0.905 0.930 0.910 0.740 0.940 CI length 0.3864 0.4431 0.3732 0.4227 0.3839 0.4382 0.3702 0.5023 (C) Outcome model is misspecified Bias |$-$|0.0034 |$-$|0.0321 |$-$|0.0562 |$-$|0.0991 |$-$|0.0317 |$-$|0.0572 |$-$|0.1215 |$-$|0.0443 Std err 0.0917 0.0982 0.0914 0.1023 0.0944 0.0992 0.0921 0.1026 RMSE 0.0917 0.1033 0.1072 0.1424 0.0995 0.1145 0.1525 0.1118 Coverage 0.960 0.960 0.905 0.845 0.950 0.955 0.770 0.945 CI length 0.3874 0.4292 0.3815 0.4327 0.3890 0.4403 0.3728 0.4261 (D) Both models are misspecified Bias |$-$|0.0547 |$-$|0.1201 |$-$|0.1873 |$-$|0.1005 |$-$|0.0243 |$-$|0.0599 |$-$|0.1393 |$-$|0.0518 Std err 0.1106 0.1038 0.0903 0.0950 0.0969 0.1060 0.0921 0.0965 RMSE 0.1234 0.1588 0.2079 0.1383 0.0999 0.1218 0.1670 0.1095 Coverage 0.890 0.815 0.775 0.875 0.940 0.940 0.720 0.950 CI length 0.3994 0.4586 0.3790 0.4333 0.3948 0.4545 0.3781 0.4334 HD-CBPS, the proposed method; RB, approximate residual balancing; AIPW, regularized augmented inverse probability weighting estimator; D-SELECT, double selection; Std err, standard error; RMSE, standardized root-mean-squared error; Coverage, coverage probability of 95% confidence intervals; CI length, length of 95% confidence intervals. Open in new tab We also consider the simulation with logistic outcome models. When the outcome variable is binary, the approximate residual balancing method is not directly applicable. Thus, we only compare our method with the regularized augmented inverse probability weighting and the double selection method. The simulation results illustrate the same conclusion. Due to space limitations, we defer the details to the Supplementary Material. In summary, the proposed estimator tends to have a smaller mean-squared error, is more robust to model misspecification, and exhibits accurate coverage probability in finite samples. Our results are consistent with the empirical findings of Imai & Ratkovic (2014) and Fan et al. (2016) that covariate balancing tends to outperform the augmented inverse probability weighting estimator in low-dimensional settings. Our simulation studies imply that the same conclusion appears to hold in high-dimensional settings. 6. Empirical illustration For an empirical illustration we consider a dataset based on the Political Socialization Panel Study, which was originally analysed by Kam & Palmer (2008). One purpose of this study was to understand the effect of higher education on political participation. The dataset consists of 1051 randomly selected high school seniors in the class of 1965. The information about each sample was collected via in-person interviews in the first wave of the study, which we treat as pretreatment covariates. The second wave of the study, conducted in 1973, collected the outcome variable, political participation and the dichotomous treatment variable, college attendance. For the purpose of comparison, we follow the original study and use 81 pretreatment covariates, including gender, race, club participation and academic performance. Since many of the covariates are categorical variables with more than two levels, we create an indicator variable that represents each level. Therefore, a total of 204 pretreatment variables are used in the propensity score and outcome models. The outcome variable represents an index of adult political participation, which is equal to the sum of eight acts including the turnout in the 1972 presidential election, attending campaign rallies, making a donation to a campaign, and displaying a campaign button and bumper sticker. Since this variable takes an integer value ranging from zero to eight, we use the binomial logistic regression for the outcome model. The propensity score model is assumed to be the logistic regression. We then estimate both the average treatment effect and average treatment effect for the treated college attendance on political participation. The number of treated observations is 675. We apply seven methods: the proposed methodology, the regularized augmented inverse probability weighting method (Farrell, 2015), the covariate balancing propensity score method (Imai & Ratkovic, 2014), the augmented inverse probability weighting method without regularization (Robins et al., 1994), the inverse probability weighting estimator with regularized logistic regression, the approximate residual balancing method and the double selection method. The estimation procedures for the first two methods are identical to those described in the simulation studies. For the covariate balancing propensity score methodology, it is designed for the linear outcome model, which does not provide an ideal balance of pre-treatment variables. In addition, we use the bootstrap method to approximate the standard error of the estimator based on the covariate balancing propensity score and inverse probability weighting methods. The results are shown in Table 2. All methods except the augmented inverse probability weighting estimator without regularization imply that the overall average treatment effect of college education on political participation is positive and statistically significant. The average treatment effect estimates and their associated standard errors based on the regularized methods are similar to each other. These estimates are, however, smaller than that of the covariate balancing propensity score method. Importantly, our estimator has much smaller standard errors than the covariate balancing propensity score method. Table 2. The estimated average effects of college attendance on political participation. The estimates based on the proposed methodology are compared with those of the covariate balancing propensity score estimator, the regularized augmented inverse probability weighting estimator, the augmented inverse probability weighting estimator without regularization, the inverse probability weighting estimator with the regularized logistic regression, approximate residual balancing and double selection. Standard errors appear in parentheses . HD-CBPS . CBPS . AIPW . AIPW-NR . IPW . RB . D-SELECT . Overall (ATE) 0.8293 1.0163 0.8796 0.4904 1.0666 0.6706 0.7859 (0.1247) (0.2380) (0.1043) (0.6009) (0.1588) (0.1643) (0.2682) Overall (ATT) 0.8439 1.1232 0.9790 0.4761 1.0023 0.7348 1.780 (0.1420) (0.3094) (0.1617) (1.2614) (0.1272) (0.2237) (0.2529) Whites (ATE) 0.8445 NA 0.8977 0.1205 1.1371 0.6971 0.8004 (0.1279) (0.1089) (9.4522) (0.1548) (0.1786) (0.2298) . HD-CBPS . CBPS . AIPW . AIPW-NR . IPW . RB . D-SELECT . Overall (ATE) 0.8293 1.0163 0.8796 0.4904 1.0666 0.6706 0.7859 (0.1247) (0.2380) (0.1043) (0.6009) (0.1588) (0.1643) (0.2682) Overall (ATT) 0.8439 1.1232 0.9790 0.4761 1.0023 0.7348 1.780 (0.1420) (0.3094) (0.1617) (1.2614) (0.1272) (0.2237) (0.2529) Whites (ATE) 0.8445 NA 0.8977 0.1205 1.1371 0.6971 0.8004 (0.1279) (0.1089) (9.4522) (0.1548) (0.1786) (0.2298) ATE and ATT, the estimated average effects; HD-CBPS, the estimates based on the proposed methodology; CBPS, the covariate balancing propensity score estimator; AIPW, the regularized augmented inverse probability weighting estimator; AIPW-NR, the augmented inverse probability weighting estimator without regularization; IPW, the inverse probability weighting estimator with the regularized logistic regression; RB, approximate residual balancing; D-SELECT, double selection. Open in new tab Table 2. The estimated average effects of college attendance on political participation. The estimates based on the proposed methodology are compared with those of the covariate balancing propensity score estimator, the regularized augmented inverse probability weighting estimator, the augmented inverse probability weighting estimator without regularization, the inverse probability weighting estimator with the regularized logistic regression, approximate residual balancing and double selection. Standard errors appear in parentheses . HD-CBPS . CBPS . AIPW . AIPW-NR . IPW . RB . D-SELECT . Overall (ATE) 0.8293 1.0163 0.8796 0.4904 1.0666 0.6706 0.7859 (0.1247) (0.2380) (0.1043) (0.6009) (0.1588) (0.1643) (0.2682) Overall (ATT) 0.8439 1.1232 0.9790 0.4761 1.0023 0.7348 1.780 (0.1420) (0.3094) (0.1617) (1.2614) (0.1272) (0.2237) (0.2529) Whites (ATE) 0.8445 NA 0.8977 0.1205 1.1371 0.6971 0.8004 (0.1279) (0.1089) (9.4522) (0.1548) (0.1786) (0.2298) . HD-CBPS . CBPS . AIPW . AIPW-NR . IPW . RB . D-SELECT . Overall (ATE) 0.8293 1.0163 0.8796 0.4904 1.0666 0.6706 0.7859 (0.1247) (0.2380) (0.1043) (0.6009) (0.1588) (0.1643) (0.2682) Overall (ATT) 0.8439 1.1232 0.9790 0.4761 1.0023 0.7348 1.780 (0.1420) (0.3094) (0.1617) (1.2614) (0.1272) (0.2237) (0.2529) Whites (ATE) 0.8445 NA 0.8977 0.1205 1.1371 0.6971 0.8004 (0.1279) (0.1089) (9.4522) (0.1548) (0.1786) (0.2298) ATE and ATT, the estimated average effects; HD-CBPS, the estimates based on the proposed methodology; CBPS, the covariate balancing propensity score estimator; AIPW, the regularized augmented inverse probability weighting estimator; AIPW-NR, the augmented inverse probability weighting estimator without regularization; IPW, the inverse probability weighting estimator with the regularized logistic regression; RB, approximate residual balancing; D-SELECT, double selection. Open in new tab There are at least two reasons for this difference in standard errors. First, as shown in § 4, the proposed methodology uses a different covariate balancing equation than the original covariate balancing propensity score method when the outcome model is nonlinear, achieving the semiparametric efficiency bound. Second, the original covariate balancing propensity score methodology tends to be unstable when balancing a large number of covariates, 204 in this case. Thus, the proposed method improves the existing covariate balancing methods when the outcome model belongs to the class of generalized linear models and the number of covariates is large. We also apply the same methods to the subsample of whites separately. Among a total of 1051 respondents there are 966 white respondents. In this case, the covariate balancing propensity score method does not converge, and therefore the estimate is unavailable. The results appear in the last row of Table 2. Again, the estimates of the regularized methods are similar, as are the standard errors. However, the augmented inverse probability weighting method without regularization yields a large variance mainly because the maximum likelihood estimate of propensity score tends to be unstable when the number of covariates is large. 7. Discussion There are several future directions that are worthy of further investigation. First, it is important to extend these high-dimensional causal inference methods to nonbinary treatment regimes, including continuous treatment, individualized treatment rules and dynamic treatment regimes (Imai & Ratkovic, 2015; Fong et al., 2018; Zhao et al., 2019). Second, we plan to further study the effect of tuning parameters on statistical inference. In numerical experiments, the tuning parameters are chosen by cross-validation, which leads to reasonable finite sample results. Based on the sensitivity analysis, the results appear to be stable with respect to a small perturbation of the tuning parameters. However, one challenge is to formally justify the validity of the inference based on the cross-validated estimators. The recent work by Chetverikov et al. (2020) provides some promising results along this line. Further theoretical development is needed to address this important problem. Finally, it is of importance to consider how to relax the sparsity assumptions. One of the most recent works in this area by Bradic et al. (2019) proposes an estimator that is rate/sparsity doubly robust: the estimator remains root-|$n$| consistent even if either the propensity score model or the outcome model is nonsparse so long as the other model is sufficiently sparse; see also Smucler et al. (2019). This contrasts with the double robustness of our estimator, which addresses possible model misspecification under the sparsity assumption. Acknowledgement This work was partially supported by the U.S. National Science Foundation. The authors are grateful to the editor, an associate editor and three reviewers for helpful comments, and to Jamie Robins for insightful discussions. Imai is also affiliated with the Department of Government at Harvard University. Supplementary material Supplementary material available at Biometrika online includes proofs of the theoretical results, additional simulation results and further technical details. References Athey, S. , Imbens, G. W. & Wager, S. ( 2018 ). Approximate residual balancing: Debiased inference of average treatment effects in high dimensions . J. R. Statist. Soc. B 80 , 597 – 623 . Google Scholar Crossref Search ADS WorldCat Belloni, A. & Chernozhukov, V. ( 2013 ). Least squares after model selection in high-dimensional sparse models . Bernoulli 19 , 521 – 47 . Google Scholar Crossref Search ADS WorldCat Belloni, A. , Chernozhukov, V., Fernández-Val, I. & Hansen, C. ( 2017 ). Program evaluation and causal inference with high-dimensional data . Econometrica 85 , 233 – 98 . Google Scholar Crossref Search ADS WorldCat Belloni, A. , Chernozhukov, V. & Hansen, C. ( 2014 ). Inference on treatment effects after selection among high-dimensional controls . Rev. Econ. Studies 81 , 608 – 50 . Google Scholar Crossref Search ADS WorldCat Belloni, A. , Chernozhukov, V. & Wei, Y. ( 2016 ). Post-selection inference for generalized linear models with many controls . J. Bus. Econ. Statist. 34 , 606 – 19 . Google Scholar Crossref Search ADS WorldCat Bickel, P. J. , Ritov, Y. & Tsybakov, A. B. ( 2009 ). Simultaneous analysis of lasso and Dantzig selector . Ann. Statist. 37 , 1705 – 32 . Google Scholar Crossref Search ADS WorldCat Bradic, J. , Wager, S. & Zhu, Y. ( 2019 ). Sparsity double robust inference of average treatment effects . arXiv:1905.00744. Bühlmann, P. & van de Geer, S. ( 2015 ). High-dimensional inference in misspecified linear models . Electron. J. Statist. 9 , 1449 – 73 . Google Scholar Crossref Search ADS WorldCat Cai, T. T. & Guo, Z. ( 2017 ). Confidence intervals for high-dimensional linear regression: Minimax rates and adaptivity . Ann. Statist. 45 , 615 – 46 . Google Scholar Crossref Search ADS WorldCat Chan, K. C. G. , Yam, S. C. P. & Zhang, Z. ( 2016 ). Globally efficient non-parametric inference of average treatment effects by empirical balancing calibration weighting . J. R. Statist. Soc. B 78 , 673 – 700 . Google Scholar Crossref Search ADS WorldCat Chernozhukov, V. , Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. & Robins, J. ( 2018 ). Double/debiased machine learning for treatment and structural parameters . Economet. J. 21 , 733 – 50 . Google Scholar Crossref Search ADS WorldCat Chetverikov, D. , Liao, Z. & Chernozhukov, V. ( 2020 ). On cross-validated lasso . arXiv:1605.02214v6. OpenURL Placeholder Text WorldCat Dukes, O. , Avagyan, V. & Vansteelandt, S. ( 2019 ). High-dimensional doubly robust tests for regression parameters . arXiv:1805.06714v3. Fan, J. , Imai, K., Liu, H., Ning, Y. & Yang, X. ( 2016 ). Improving covariate balancing propensity score: A doubly robust and efficient approach . Technical report , Princeton University . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Fan, J. & Li, R. ( 2001 ). Variable selection via nonconcave penalized likelihood and its oracle properties . J. Am. Statist. Assoc. 96 , 1348 – 60 . Google Scholar Crossref Search ADS WorldCat Farrell, M. H. ( 2015 ). Robust inference on average treatment effects with possibly more covariates than observations . J. Economet. 189 , 1 – 23 . Google Scholar Crossref Search ADS WorldCat Fong, C. , Hazlett, C. & Imai, K. ( 2018 ). Covariate balancing propensity score for a continuous treatment: Application to the efficacy of political advertisements . Ann. Appl. Statist. 12 , 156 – 77 . Google Scholar Crossref Search ADS WorldCat Graham, B. S. , de Xavier Pinto, C. C. & Egel, D. ( 2012 ). Inverse probability tilting for moment condition models with missing data . Rev. Econ. Studies 79 , 1053 – 79 . Google Scholar Crossref Search ADS WorldCat Hahn, J. ( 1998 ). On the role of the propensity score in efficient semiparametric estimation of average treatment effects . Econometrica 66 , 315 – 31 . Google Scholar Crossref Search ADS WorldCat Hahn, J. ( 2004 ). Functional restriction and efficiency in causal inference . Rev. Econ. Statist. 86 , 73 – 6 . Google Scholar Crossref Search ADS WorldCat Hainmueller, J. ( 2012 ). Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies . Polit. Anal. 20 , 25 – 46 . Google Scholar Crossref Search ADS WorldCat Hirshberg, D. A. & Wager, S. ( 2019 ). Augmented minimax linear estimation . arXiv:1712.00038v5. Imai, K. & Ratkovic, M. ( 2014 ). Covariate balancing propensity score . J. R. Statist. Soc. B 76 , 243 – 63 . Google Scholar Crossref Search ADS WorldCat Imai, K. & Ratkovic, M. ( 2015 ). Robust estimation of inverse probability weights for marginal structural models . J. Am. Statist. Assoc. 110 , 1013 – 23 . Google Scholar Crossref Search ADS WorldCat Javanmard, A. & Montanari, A. ( 2014 ). Confidence intervals and hypothesis testing for high-dimensional regression . J. Mach. Learn. Res. 15 , 2869 – 2909 . OpenURL Placeholder Text WorldCat Kam, C. D. & Palmer, C. L. ( 2008 ). Reconsidering the effects of education on political participation . J. Polit. 70 , 612 – 31 . Google Scholar Crossref Search ADS WorldCat Kang, J. D. & Schafer, J. L. ( 2007 ). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data . Statist. Sci. 22 , 523 – 39 . Google Scholar Crossref Search ADS WorldCat Newey, W. K. & Robins, J. R. ( 2018 ). Cross-fitting and fast remainder rates for semiparametric estimation . arXiv:1801.09138. Neykov, M. , Ning, Y., Liu, J. S. & Liu, H. ( 2018 ). A unified theory of confidence regions and testing for high-dimensional estimating equations . Statist. Sci. 33 , 427 – 43 . Google Scholar Crossref Search ADS WorldCat Ning, Y. & Liu, H. ( 2017 ). A general theory of hypothesis tests and confidence regions for sparse high-dimensional models . Ann. Statist. 45 , 158 – 95 . Google Scholar Crossref Search ADS WorldCat Ning, Y. , Zhao, T. & Liu, H. ( 2017 ). A likelihood ratio framework for high-dimensional semiparametric regression . Ann. Statist. 45 , 2299 – 327 . Google Scholar Crossref Search ADS WorldCat R Development Core Team ( 2020 ). R:A Language and Environment for Statistical Computing . Vienna, Austria: R Foundation for Statistical Computing . ISBN 3-900051-07-0, http://www.R-project.org. Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Robins, J. , Sued, M., Lei-Gomez, Q. & Rotnitzky, A. ( 2007 ). Comment: Performance of double-robust estimators when inverse probability weights are highly variable . Statist. Sci. 22 , 544 – 59 . Google Scholar Crossref Search ADS WorldCat Robins, J. M. , Rotnitzky, A. & Zhao, L. P. ( 1994 ). Estimation of regression coefficients when some regressors are not always observed . J. Am. Statist. Assoc. 89 , 846 – 66 . Google Scholar Crossref Search ADS WorldCat Rosenbaum, P. R. & Rubin, D. B. ( 1983 ). The central role of the propensity score in observational studies for causal effects . Biometrika 70 , 41 – 55 . Google Scholar Crossref Search ADS WorldCat Rubin, D. B. ( 1990 ). Comments on “On the application of probability theory to agricultural experiments. Essay on principles. Section 9” by J. Splawa-Neyman, translated from the Polish and edited by D. M. Dabrowska and T. P. Speed . Statist. Sci. 5 , 472 – 80 . Google Scholar Crossref Search ADS WorldCat Rubin, D. B. ( 2008 ). For objective causal inference, design trumps analysis . Ann. Appl. Statist. 2 , 808 – 40 . Google Scholar Crossref Search ADS WorldCat Schneeweiss, S. , Rassen, J. A., Glynn, R. J., Avorn, J., Mogun, H. & Brookhart, M. A. ( 2009 ). High-dimensional propensity score adjustment in studies of treatment effects using health care claims data . Epidemiology 20 , 512 . Google Scholar Crossref Search ADS PubMed WorldCat Smucler, E. , Rotnitzky, A. & Robins, J. M. ( 2019 ). A unifying approach for doubly-robust |$\ell_1$| regularized estimation of causal contrasts . arXiv:1904.03737v3. OpenURL Placeholder Text WorldCat Tan, Z. ( 2017 ). Regularized calibrated estimation of propensity scores with model misspecification and high-dimensional data . arXiv:1710.08074. Tan, Z. ( 2018 ). Model-assisted inference for treatment effects using regularized calibrated estimation with high-dimensional data . arXiv:1801.09817. Tibshirani, R. ( 1996 ). Regression shrinkage and selection via the lasso . J. R. Statist. Soc. B 58 , 267 – 88 . OpenURL Placeholder Text WorldCat Van de Geer, S. , Bühlmann, P., Ritov, Y. & Dezeure, R. ( 2014 ). On asymptotically optimal confidence regions and tests for high-dimensional models . Ann. Statist. 42 , 1166 – 202 . Google Scholar Crossref Search ADS WorldCat Wedderburn, R. W. ( 1974 ). Quasi-likelihood functions, generalized linear models, and the Gauss–Newton method . Biometrika 61 , 439 – 47 . OpenURL Placeholder Text WorldCat Zhang, C.-H. & Zhang, S. S. ( 2014 ). Confidence intervals for low-dimensional parameters in high-dimensional linear models . J. R. Statist. Soc. B 76 , 217 – 42 . Google Scholar Crossref Search ADS WorldCat Zhao, Q. ( 2019 ). Covariate balancing propensity score by tailored loss functions . Ann. Statist. 47 , 965 – 93 . Google Scholar Crossref Search ADS WorldCat Zhao, Y.-Q. , Laber, E. B., Ning, Y., Saha, S. & Sands, B. E. ( 2019 ). Efficient augmentation and relaxation learning for individualized treatment rules using observational data . J. Mach. Learn. Res. 20 , 1 – 23 . OpenURL Placeholder Text WorldCat Zheng, W. & van der Laan, M. J. ( 2011 ). Cross-validated targeted minimum-loss-based estimation . In Targeted Learning , pp. 459 – 74 . New York : Springer . Google Scholar Crossref Search ADS Google Scholar Google Preview WorldCat COPAC Zhu, Y. & Bradic, J. ( 2018 ). Linear hypothesis testing in dense high-dimensional linear models . J. Am. Statist. Assoc. 113 , 1583 – 600 . Google Scholar Crossref Search ADS WorldCat Zubizarreta, J. R. ( 2015 ). Stable weights that balance covariates for estimation with incomplete outcome data . J. Am. Statist. Assoc. 110 , 910 – 22 . Google Scholar Crossref Search ADS WorldCat © 2020 Biometrika Trust This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

journal article

LitStream Collection

Sparse semiparametric canonical correlation analysis for data of mixed types

Yoon,, Grace;Carroll, Raymond, J;Gaynanova,, Irina

2020 Biometrika

doi: 10.1093/biomet/asaa007pmid: 34621080

Summary Canonical correlation analysis investigates linear relationships between two sets of variables, but it often works poorly on modern datasets because of high dimensionality and mixed data types such as continuous, binary and zero-inflated. To overcome these challenges, we propose a semiparametric approach to sparse canonical correlation analysis based on the Gaussian copula. The main result of this paper is a truncated latent Gaussian copula model for data with excess zeros, which allows us to derive a rank-based estimator of the latent correlation matrix for mixed variable types without estimation of marginal transformation functions. The resulting canonical correlation analysis method works well in high-dimensional settings, as demonstrated via numerical studies, and when applied to the analysis of association between gene expression and microRNA data from breast cancer patients. 1. Introduction Canonical correlation analysis investigates linear associations between two sets of variables and is widely used in various fields, including the biomedical sciences, imaging and genomics (Hardoon et al., 2004; Chi et al., 2013; Safo et al., 2018). However, sample canonical correlation analysis often performs poorly because of two main challenges: high dimensionality and nonnormality of the data. In high-dimensional settings, sample canonical correlation analysis is known to overfit the data because of the singularity of the sample covariance matrices (Hardoon et al., 2004; Guo et al., 2016). Additional regularization is often used to deal with this problem. González et al. (2008) used ridge regularization of sample covariance matrices to avoid singularity, while more recent methods focus on sparsity regularization of canonical vectors (Parkhomenko et al., 2009; Witten et al., 2009; Chen & Liu, 2011; Chi et al., 2013; Cruz-Cano & Lee, 2014; Gao et al., 2015; Wilms & Croux, 2015; Safo et al., 2018). At the same time, with the advancement in technology, it is common for data of different types to be collected. For example, the Cancer Genome Atlas Project contains matched data of mixed types, such as gene expression (continuous), mutation (binary) and microRNA (count) data. While regularized canonical correlation methods work well for Gaussian data, they are still based on the sample covariance matrix and therefore not appropriate for analysis in the presence of binary data or data with excess zero values. Several approaches have been proposed to address the nonnormality of the data. Some are completely nonparametric approaches, such as kernel canonical correlation analysis (Hardoon et al., 2004). There are also parametric approaches that build upon a probabilistic interpretation of Bach & Jordan (2005). For example, Zoh et al. (2016) developed probabilistic canonical correlation analysis for count data based on the Poisson distribution. More recently, Agniel & Cai (2017) used a normal semiparametric transformation model to analyse mixed types of variables; however, their method requires estimation of marginal transformation functions via nonparametric maximum likelihood. In summary, significant progress has been made in developing regularized variants of sample canonical correlation analysis that work well in high-dimensional settings; however, these approaches are not suitable for mixed data types. At the same time, several methods have been proposed to account for nonnormality of the data, but they are not designed for high-dimensional settings. More importantly, to the best of our knowledge none of the existing methods explicitly addresses the case of zero-inflated measurements, which, for example, are common in microRNA and microbiome abundance data. To bridge this major gap, we propose a semiparametric approach to sparse canonical correlation analysis which allows us to handle high-dimensional data of mixed types via a common latent Gaussian copula framework. Our work has three main contributions. First, we model the zeros in the data as being observed due to truncation of an underlying latent continuous variable, and we define a corresponding truncated Gaussian copula model. We derive explicit formulas for the bridge functions that connect the Kendall’s |$\tau$| of the observed data to the latent correlation matrix for different combinations of continuous, binary and truncated data types, and use these formulas to construct a rank-based estimator of the latent correlation matrix for the mixed data. Fan et al. (2017) used a similar bridge function approach in the context of graphical models, but they did not consider the truncated variable type. The latter requires derivation of new bridge functions, and those derivations are considerably more involved than the corresponding derivations for the continuous and binary cases. A significant advantage of the bridge function technique is that it allows one to estimate the latent correlation structure of a Gaussian copula without estimating marginal transformation functions, in contrast to the method of Agniel & Cai (2017). Second, we use the derived rank-based estimator instead of the sample correlation matrix within the sparse canonical correlation analysis framework motivated by Chi et al. (2013) and Wilms & Croux (2015). This allows us to take into account the dataset-specific correlation structure in addition to the cross-correlation structure. In contrast, Parkhomenko et al. (2009) and Witten et al. (2009) treated the variables within each dataset as uncorrelated. We develop an efficient optimization algorithm to solve the corresponding problem. Finally, we propose two types of Bayesian information criterion, bic, for tuning parameter selection, which yield significant computational savings compared to commonly used cross-validation and permutation techniques (Witten & Tibshirani, 2009). Wilms & Croux (2015) also used bic in the canonical correlation analysis context, but they proposed only one criterion. Our two criteria correspond to the cases of the error variance being either known or unknown. We have found in our numerical studies that both criteria are competitive, but one works better for variable selection while the other works better for prediction. 2. Background 2.1. Canonical correlation analysis In this section we review classical canonical correlation analysis and its sparse alternatives. Given two random vectors |${\rm X}_1\in \mathbb{R}^{p_1}$| and |${\rm X}_2\in \mathbb{R}^{p_2}$|⁠, let |$\Sigma_1 = {\rm cov} ({\rm X}_1)$|⁠, |$\Sigma_2 = {\rm cov}({\rm X}_2)$| and |$\Sigma_{12} = {\rm cov} ({\rm X}_1, {\rm X}_2)$|⁠. Population canonical correlation analysis (Hotelling, 1936) seeks linear combinations |$w_1^{{ {\rm \scriptscriptstyle T} }}{\rm X}_1$| and |$w_2^{{ {\rm \scriptscriptstyle T} }}{\rm X}_2$| with maximal correlation; that is, $$\begin{equation}\label{eq:pCCA} \mathop{\rm maximize}\limits_{w_1,w_2} \:\bigl\{ w_1^{{ {\rm \scriptscriptstyle T} }}\Sigma_{12}w_2 \bigr\} \quad \mbox{subject to}\quad w_1^{{ {\rm \scriptscriptstyle T} }}\Sigma_1 w_1=1, \quad w_2^{{ {\rm \scriptscriptstyle T} }}\Sigma_2 w_2=1. \end{equation}$$(1) Problem (1) has a closed-form solution via the singular value decomposition of |$\Sigma_{1}^{-1/2} \Sigma_{12} \Sigma_{2}^{-1/2}$|⁠. Given the first pair of singular vectors |$(u,v)$|⁠, the solutions to (1) can be expressed as |$w_1 = \Sigma_1^{-1/2}u$| and |$w_2 = \Sigma_2^{-1/2}v$|⁠. Sample canonical correlation analysis replaces |$\Sigma_1$|⁠, |$\Sigma_2$| and |$\Sigma_{12}$| in (1) with corresponding sample covariance matrices |$S_1$|⁠, |$S_2$| and |$S_{12}$|⁠. In high-dimensional settings, where the sample size is small relative to the number of variables, |$S_1$| and |$S_2$| are singular, leading to nonuniqueness of solutions and poor performance due to overfitting. A common way of circumventing this difficulty is to consider sparse regularization of |$w_1$| and |$w_2$| via the addition of an |$\ell_1$| penalty in the objective function of (1) (Parkhomenko et al., 2009; Witten et al., 2009; Chi et al., 2013; Wilms & Croux, 2015). Sparse canonical correlation analysis is thus formulated as $$\begin{equation}\label{eq:sparseCCA} \mathop{\rm maximize}\limits_{w_1,w_2} \:\bigl\{ w_1^{{ {\rm \scriptscriptstyle T} }}S_{12}w_2-\lambda_1\|w_1\|_1-\lambda_2\|w_2\|_1 \bigr\} \quad \mbox{subject to}\quad w_1^{{ {\rm \scriptscriptstyle T} }}S_1 w_1\le1, \quad w_2^{{ {\rm \scriptscriptstyle T} }}S_2 w_2\le1. \end{equation}$$(2) Besides adding |$\ell_1$| penalties, the equality constraints in (1) are replaced with inequality constraints which define convex sets. This generalization is possible because nonzero solutions to (2) satisfy the constraints with equality; see Proposition 1 below. While problem (2) works well in high-dimensional settings, it still relies on sample covariance matrices and is therefore not well suited to skewed or noncontinuous data, such as binary or zero-inflated data. We next review the Gaussian copula models that we will use to deal with these challenges. 2.2. Latent Gaussian copula model for mixed data In this section we review the Gaussian copula model of Liu et al. (2009) and its extension to mixed continuous and binary data due to Fan et al. (2017). Definition 1 (Gaussian copula model). A random vector |${\rm X}=(X_1,\ldots,X_p)^{{ {\rm \scriptscriptstyle T} }}$| satisfies a Gaussian copula model if there exists a set of monotonically increasing transformations |$f=(f_j)^p_{j=1}$| such that |$f({\rm X}) = \{f_1(X_1),\ldots,f_p(X_p)\}^{{ {\rm \scriptscriptstyle T} }}\sim {{N}}_p(0,\Sigma)$| with |$\Sigma_{jj}=1$| for all |$j$|⁠. We denote this by |${\rm X} \sim {\rm NPN}(0, \Sigma, f)$|. Definition 2 (Latent Gaussian copula model for mixed data). Let |${\rm X}_1\in \mathbb{R}^{p_1}$| be continuous, |${\rm X}_2\in \mathbb{R}^{p_2}$| be binary random vectors and |${\rm X}=({\rm X}_1, {\rm X}_2)$|⁠. Then |${\rm X}$| satisfies the latent Gaussian copula model if there exists a |$p_2$|-dimensional random vector |${\rm U}_2=(U_{p_1+1},\ldots,U_{p_1+p_2})^{{ {\rm \scriptscriptstyle T} }}$| such that |${\rm U}:=({\rm X}_1,{\rm U}_2)\sim {\rm NPN}(0,\Sigma,f)$| and |$X_j = I(U_j >C_j)$| for all |$j=p_1+1,\ldots,p_1+p_2$|⁠, where |$I(\cdot)$| is the indicator function and |${\rm C}=(C_1,\ldots,C_{p_2})$| is a vector of constants. We denote this by |${\rm X} \sim {\rm LNPN}(0,\Sigma, f, {\rm C})$|⁠, where |$\Sigma$| is the latent correlation matrix. Fan et al. (2017) considered the problem of estimating |$\Sigma$| for the latent Gaussian copula model based on Kendall’s |$\tau$|⁠. Given the observed data |$(X_{1j}, X_{1k}), \ldots, (X_{nj}, X_{nk}) $| for variables |$X_j$| and |$X_k$|⁠, Kendall’s |$\tau$| is defined as $$\hat\tau_{jk} = \dfrac{2}{n(n-1)}\sum_{1\le i<i'\le n}\text{sign}(X_{ij}-X_{i'j})\,\text{sign}(X_{ik}-X_{i'k}).$$ Since |$\hat \tau_{jk}$| is invariant under monotone transformations of the data, it is well suited to capturing associations in copula models. Let |$\tau_{jk} = {E}(\hat \tau_{jk})$| be the population Kendall’s |$\tau$|⁠. The latent correlation matrix |$\Sigma$| is connected to Kendall’s |$\tau$| via the so-called bridge function |$F$|⁠, such that |$\Sigma_{jk} = F^{-1}(\tau_{jk})$| for all variables |$j$| and |$k$|⁠. Fan et al. (2017) derived an explicit form of the bridge function for continuous, binary and mixed variable pairs, which enables estimation of the latent correlation matrix via the method of moments. We summarize these results below. Theorem 1 (Fan et al., 2017). Let |${\rm X}=({\rm X}_1, {\rm X}_2)\sim{\rm LNPN}(0,\Sigma, f, {\rm C})$| with |$p_1$|-dimensional continuous |${\rm X}_1$| and |$p_2$|-dimensional binary |${\rm X}_2$|⁠. The rank-based estimator of |$\Sigma$| is the symmetric matrix |$\hat R$| with |$\hat R_{jj}=1$| and |$\hat R_{jk} = \hat R_{kj}=F_{jk}^{-1}(\hat \tau_{jk})$| where, for |$r\in (0,1)$|⁠, $$F_{jk}(r) = \begin{cases} 2\sin^{-1}(r)/\pi, & \quad 1\leq j<k\leq p_1;\\ 2\bigl\{\Phi_2( \Delta_j, \Delta_{k}; r)-\Phi( \Delta_j)\Phi( \Delta_{k})\bigr\}, & \quad p_1+1 \leq j<k\leq p_1+p_2;\\ 4\Phi_2( \Delta_k, 0; r/\surd{2})-2\Phi( \Delta_k), &\quad 1\leq j\leq p_1,\; p_1+1\leq k\leq p_1+ p_2. \end{cases}$$fin Here |$\Delta_j = f_j(C_j)$|⁠, |$\Phi(\cdot)$| is the cumulative distribution function of the standard normal distribution, and |$\Phi_2(\cdot, \cdot\,; r)$| is the cumulative distribution function of the standard bivariate normal distribution with correlation |$r$|⁠. Remark 1. Since |$\Delta_j = f_j(C_j)$| is unknown in practice, Fan et al. (2017) proposed using a plug-in estimator from the moment equation |${E}(X_{ij})= 1- \Phi(\Delta_j)$|⁠, leading to |$\hat \Delta_j = \Phi^{-1}(1-\bar X_{j})$| where |$\bar X_{j} = \sum_{i=1}^{n}X_{ij}/n$|⁠. Fan et al. (2017) used these results in the context of Gaussian graphical models and replaced the sample covariance matrix with a rank-based estimator |$\hat R$|⁠, which allows one to use Gaussian models with skewed continuous and binary data. However, they did not consider the case of zero-inflated data, which requires formulation of a new model and derivation of new bridge functions. 3. Methodology 3.1. Truncated latent Gaussian copula model Our goal is to model zero-inflated data using latent Gaussian copula models. Two motivating examples are microRNA and microbiome data, where it is common to have a large number of zero counts. In both cases it is reasonable to assume that the observation of zeros is due to truncation of underlying latent continuous variables. More generally, one can think of zeros as representing the measurement error arising from truncation of values below a certain positive threshold. This intuition leads us to consider the following model. Definition 3 (Truncated latent Gaussian copula model). A random vector |${\rm X}=(X_1, \ldots, X_p)^{{ {\rm \scriptscriptstyle T} }}$| satisfies the truncated Gaussian copula model if there exists a |$p$|-dimensional random vector |${\rm U} = (U_1,\ldots, U_p)^{{ {\rm \scriptscriptstyle T} }} \sim {\rm NPN}(0, \Sigma, f)$| such that $$X_j = I(U_j>C_j)U_j \quad (j=1,\ldots,p),$$where |$I(\cdot)$| denotes the indicator function and |${\rm C}=(C_1,\ldots,C_p)$| is a vector of positive constants. We denote this by |$X \sim {\rm TLNPN}(0,\Sigma, f, {\rm C})$|⁠, where |$\Sigma$| is the latent correlation matrix. The method in Fan et al. (2017) allows estimation of the latent correlation matrix in the presence of mixed continuous and binary data. Our Definition 3 adds a third type of variable, which we refer to as truncated for short. To construct a rank-based estimator for |$\Sigma$| as in Theorem 1 in the presence of truncated variables, we derive an explicit form of the bridge function for all possible combinations of the data types. Throughout, we write |$\Phi(\cdot)$| for the cumulative distribution function of a standard normal distribution and |$\Phi_d(\cdots; \Sigma_d)$| for the cumulative distribution function of a standard |$d$|-variate normal distribution with correlation matrix |$\Sigma_d$|⁠. All proofs are deferred to the Supplementary Material. Theorem 2. Let |$X_j$| be truncated and |$X_k$| binary. Then |${E} (\hat \tau_{jk}) = F_{\rm TB}(\Sigma_{jk};\Delta_j, \Delta_k)$|⁠, where $$ F_{\rm TB}(\Sigma_{jk};\Delta_j, \Delta_k) = 2 \{1-\Phi(\Delta_j)\}\Phi(\Delta_k) -2 \Phi_3(-\Delta_j, \Delta_k, 0; \Sigma_{3a}) -2 \Phi_3(-\Delta_j, \Delta_k, 0; \Sigma_{3b})$$with |$\Delta_j = f_j(C_j)$|⁠, |$\Delta_k = f_k(C_k)$|⁠, $$\Sigma_{3a}=\begin{pmatrix}\! 1 & \;\;-\Sigma_{jk} & 1/\surd{2} \\ -\Sigma_{jk} & \;\;1 & \;-\Sigma_{jk}/\surd{2}\\ 1/\surd{2}\;\; & \;\;-\Sigma_{jk}/\surd{2} & 1 \!\end{pmatrix}\!, \quad \Sigma_{3b}=\begin{pmatrix}\! 1 & 0 & \;-1/\surd{2} \\ 0 & 1 &\; -\Sigma_{jk}/\surd{2}\\ -1/\surd{2}\;\; &\;\; -\Sigma_{jk}/\surd{2} & 1 \!\end{pmatrix}\!.$$ Theorem 3. Let |$X_j$| be truncated and |$X_k$| continuous. Then |${E} (\hat \tau_{jk}) = F_{\rm TC}(\Sigma_{jk};\Delta_j)$|⁠, where $$F_{\rm TC}(\Sigma_{jk};\Delta_j) = -2 \Phi_2 (-\Delta_j,0; 1/\surd{2} ) +4\Phi_3 (-\Delta_j,0,0; \Sigma_3)$$with |$\Delta_j = f_j(C_j)$| and $$\Sigma_3 = \begin{pmatrix} 1 \; & \; 1/\surd{2} \; & \;\; \Sigma_{jk}/\surd{2}\\ 1/\surd{2}\; & \; 1\; & \; \Sigma_{jk}\\ \Sigma_{jk}/\surd{2} \; & \; \Sigma_{jk} &\;1 \end{pmatrix}\!.$$ Theorem 4. Let both |$X_j$| and |$X_k$| be truncated. Then |${E} (\hat \tau_{jk}) = F_{\rm TT}(\Sigma_{jk};\Delta_j, \Delta_k)$|⁠, where $$F_{\rm TT}(\Sigma_{jk};\Delta_j, \Delta_k) = -2 \Phi_4 (-\Delta_j, -\Delta_k, 0,0; \Sigma_{4a}) + 2 \Phi_4 (-\Delta_j, -\Delta_k, 0,0; \Sigma_{4b})$$with |$\Delta_j = f_j(C_j)$|⁠, |$\Delta_k = f_k(C_k)$|⁠, $$\begin{align*} \Sigma_{4a} & = \begin{pmatrix} 1 \; & \; 0 \; &\; 1/\surd{2} \; & \; -\Sigma_{jk}/\surd{2}\\ 0\; &\; 1 \; &\; -\Sigma_{jk}/\surd{2}\; & \; 1/\surd{2}\\ 1/\surd{2}\; &\; -\Sigma_{jk}/\surd{2} \; &\; 1\; & \; -\Sigma_{jk}\\ -\Sigma_{jk}/\surd{2}\; & \; 1/\surd{2}\; &\; -\Sigma_{jk}\; &\; 1 \end{pmatrix}\!, \\ \Sigma_{4b} & = \begin{pmatrix} 1 \; & \; \Sigma_{jk}\; &\; 1/\surd{2}\; & \;\Sigma_{jk}/\surd{2}\\ \Sigma_{jk}\; &\; 1\; & \;\Sigma_{jk}/\surd{2}\; &\; 1/\surd{2}\\ 1/\surd{2}\; & \;\Sigma_{jk}/\surd{2}\; & \;1\; &\; \Sigma_{jk}\\ \Sigma_{jk}/\surd{2}\; & \;1/\surd{2}\; & \; \Sigma_{jk}\; & \;1 \end{pmatrix}\!. \end{align*}$$ We also show that the inverse bridge function exists for all of the cases. Theorem 5. For any constants |$\Delta_j$| and |$\Delta_k$|⁠, the bridge functions |$F(\Sigma_{jk})$| in Theorems 2–4 are strictly increasing in |$\Sigma_{jk} \in(-1,1)$|⁠, and so the corresponding inverse functions |$F^{-1}(\tau_{jk})$| exist. Remark 2. While the inverse functions exist, they do not have closed forms. In practice we estimate |$\hat R$| elementwise by solving |$\hat R_{jk}={\rm arg\,min}_{r}\{F(r)-\hat \tau_{jk}\}^2$|⁠. This leads to |$O(p^2)$| computations, which can be done in parallel to reduce the computational burden. Theorems 2–5 complement the results of Fan et al. (2017) summarized in Theorem 1 by adding three more cases: continuous/truncated, binary/truncated and truncated/truncated. This allows us to construct a rank-based estimator |$\hat R$| for |$\Sigma$| in the presence of mixed variables. Remark 3. Since |$\hat R$| is not guaranteed to be positive semidefinite, Fan et al. (2017) regularized |$\hat R$| by projecting it onto the cone of positive-semidefinite matrices. We follow their approach and use the nearPD function in the R (R Development Core Team, 2020) package Matrix to obtain the estimator |$\hat R_p$|⁠. Furthermore, we consider $$\begin{equation}\label{d:Rtilde} \tilde R = (1-\nu)\hat R_p + \nu I \end{equation}$$(3) with a small value of |$\nu>0$| so that |$\tilde R$| is strictly positive definite. Throughout, we fix |$\nu = 0.01$|⁠. Remark 4. As in the binary case, |$\Delta_j = f_j(C_j)$| is unknown for truncated variables. Similar to Fan et al. (2017), we use a plug-in estimator |$\hat \Delta_j$| based on the moment equation |${E} \{I(X_{ij}>0) \} = {\rm pr} (X_{j}>0) = {\rm pr} \{f_j(U_{j})>\Delta_j\} = 1-\Phi(\Delta_j)$|⁠. Let |$n_{\text{zero}}=\sum_{i=1}^{n} I(X_{ij}=0)$|⁠; then we use |$\hat{\Delta}_j = \Phi^{-1} ({n_{\text{zero}}}/{n})$|⁠. For clarity, we summarize below all the steps in the construction of our rank-based estimator |$\tilde R$| based on the observed data matrix |${\rm X}\in \mathbb{R}^{n \times p}$|⁠. Step 1. Calculate |$\hat \tau_{jk}$| for all pairs of variables |$1\le j < k \le p$|⁠. Step 2. Estimate |$\hat{\Delta}_j = \Phi^{-1} \{\sum_{i=1}^{n} I( X_{ij} \neq 0)/n \}$| for all |$j$| of truncated or binary type. Step 3. Compute |$\hat R_{jk} = F^{-1} (\hat \tau_{jk})$|⁠, where |$F$| is the bridge function chosen according to the types of the variables |$j$| and |$k$|⁠, with possible dependence on |$\hat \Delta_j$| and |$\hat \Delta_k$|⁠. Step 4. Project |$\hat R$| onto the cone of positive-semidefinite matrices to form |$\hat R_p$|⁠. Step 5. Set |$\tilde R = (1-\nu)\hat R_p + \nu I$| for small |$\nu > 0$|⁠. 3.2. Consistency of the rank-based estimator for the latent correlation matrix We next show that our proposed estimator |$\tilde R$| is consistent for |$\Sigma$|⁠. As in Fan et al. (2017), we make the following two assumptions. Assumption 1. All the elements of |$\Sigma$| satisfy |$|\Sigma_{jk}|\leq 1-\delta$| for some |$\delta > 0$|⁠. Assumption 2. All the thresholds |$\Delta_j$| satisfy |$|\Delta_j|\leq M$| for some constant |$M>0$|⁠. We first prove Lipschitz continuity of the inverse of the bridge function, |$F^{-1}(\tau_{jk})$|⁠. Theorem 6. Under Assumptions 1 and 2, for any constants |$\Delta_j$| and |$\Delta_k$|⁠, the inverses |$F^{-1}(\cdot)$| of the bridge functions in Theorems 2–4 are such that for any |$\tau_1$| and |$\tau_2$|⁠, $$\bigl|F^{-1}(\tau_1) - F^{-1}(\tau_2) \bigr| \le L |\tau_1 - \tau_2|,$$where |$L>0$| is a constant independent of |$\tau_1$|⁠, |$\tau_2$|⁠, |$\Delta_j$| and |$\Delta_k$|⁠. Fan et al. (2017) also proved Lipschitz continuity in the continuous/binary case, but their proof technique cannot be applied directly to the truncated case considered here because the bridge functions have a more complex form. Instead, we develop a new method of proof based on the multivariate chain rule, which also yields simplified proofs in the continuous/binary case. The full proof is given in the Supplementary Material. The Lipschitz continuity of the inverse bridge functions is then used to prove consistency of |$\hat R$|⁠. Theorem 7. Suppose that a random |${\rm X}=({\rm X}_1,{\rm X}_2, {\rm X}_3) \in \mathbb{R}^p$| satisfies the latent Gaussian copula model with correlation matrix |$\Sigma$|⁠, where |${\rm X}_1\in \mathbb{R}^{p_1}$| is continuous, |${\rm X}_2\in \mathbb{R}^{p_2}$| is binary and |${\rm X}_3\in \mathbb{R}^{p_3}$| is truncated, with |$p=p_1 + p_2 + p_3$|⁠. Let |$\hat R$| be the rank-based estimator for the correlation matrix |$\Sigma$| from §3.1constructed by inverting the corresponding bridge functions elementwise. Under Assumptions 1 and 2, with probability at least |$1-p^{-1}$|⁠, for some |$C>0$| that is independent of |$n$| and |$p$| we have $$\bigl\|\hat R - \Sigma\bigr\|_{\max} = \max_{j,k}\,\bigl|\hat R_{jk} - \Sigma_{jk}\bigr| \leq C(\log p/n)^{1/2}.$$ Theorem 7 states that |$\hat R$| is consistent in estimating |$\Sigma$| with respect to the supremum norm and that the consistency rate coincides, up to constants, with the rate obtained by the sample covariance matrix in the Gaussian case. In practice, we further regularize |$\hat R$| by forming |$\tilde R = (1-\nu)\hat R_p + \nu I$|⁠. By Corollary 2 in Fan et al. (2017), |$\hat R_p$| has the same consistency rate as |$\hat R$|⁠, so Theorem 7 implies the consistency of |$\tilde R$| with the same rate as long as |$\nu \le (\log p / n)^{1/2}$|⁠. 3.3. Semiparametric sparse canonical correlation analysis Our proposal is based on formulating sparse canonical correlation analysis using a latent correlation matrix from the Gaussian copula model for mixed data. At a population level, let |$\Sigma$| be the latent correlation matrix for |$({\rm X}_1,{\rm X}_2) \sim {\rm LNPN}(0,\Sigma, f, {\rm C})$|⁠, where each of |${\rm X}_1$| and |${\rm X}_2$| follows one of the three data types: continuous, binary or truncated. In § 3.1 we derived a rank-based estimator for |$\Sigma$|⁠, which we propose to use within the sparse canonical correlation analysis framework (2). Given the semiparametric estimator |$\tilde R$| in (3), we propose to find canonical vectors by solving $$\begin{equation}\label{eq:sparseCCAwithR} \mathop{\rm minimize}\limits_{w_1,w_2} \:\bigl\{-w_1^{{ {\rm \scriptscriptstyle T} }}\tilde{R}_{12}w_2+\lambda_1\|w_1\|_1+\lambda_2\|w_2\|_1\bigr\}\quad\mbox{subject to}\quad w_1^{{ {\rm \scriptscriptstyle T} }}\tilde{R}_1 w_1\le1,\quad w_2^{{ {\rm \scriptscriptstyle T} }}\tilde{R}_2 w_2\le1. \end{equation}$$(4) Remark 5. Mai & Zhang (2019) established the consistency of estimated canonical vectors from the sparse canonical correlation analysis problem (2) in the Gaussian case. Their proof relies on the supremum norm bound for the sample covariance matrix. Since Theorem 7 establishes such a bound for our rank-based estimator, these results can be directly extended to (4). Although here we focus only on estimation of the first canonical pair, the subsequent canonical pairs can be found sequentially by using a deflation scheme as follows. Let |$\tilde R_{12}^{(1)}= \tilde{R}_{12}$| and let |$(\hat{w}_{1}, \hat{w}_2)$| be the |$(k-1)$|th estimated canonical pair. To estimate the |$k$|th pair for |$k>1$|⁠, form $$\tilde{R}_{12}^{(k)}=\tilde R_{12}^{(k-1)} - (\hat w_1^{{ {\rm \scriptscriptstyle T} }}\tilde R_{12}^{(k-1)}\hat w_2)\tilde R_1\hat w_{1}\hat w_{2}^{{ {\rm \scriptscriptstyle T} }}\tilde R_2$$ and solve (4) using |$\tilde R_{12}^{(k)}$| instead of |$\tilde R_{12}$|⁠. While problem (4) is not jointly convex in |$w_1$| and |$w_2$|⁠, it is biconvex. Therefore, we propose to iteratively optimize over |$w_1$| and |$w_2$|⁠. First, consider optimizing over |$w_1$| with |$w_2$| fixed. Proposition 1. For a fixed |$w_2\in \mathbb{R}^{p_2}$|⁠, let $$\begin{equation}\label{eq:prop1_1} \begin{split} \hat w_1 =\mathop{\rm arg\,min}\limits_{w_1} &\: \bigl\{ - w_1^{{ {\rm \scriptscriptstyle T} }} \tilde R_{12} w_2 + \lambda_1\|w_1\|_1 \bigr\}\quad {subject\ to}\quad w_1^{{ {\rm \scriptscriptstyle T} }}\tilde R_1 w_1\le 1. \end{split} \end{equation}$$(5) This problem is equivalent to finding $$\begin{equation}\label{eq:prop1_2} \begin{split} \tilde{w}_1 = \mathop{\rm arg\,min}\limits_{w_1} & \:\bigl\{(1/2) w_1^{{ {\rm \scriptscriptstyle T} }} \tilde R_1 w_1 - w_1^{{ {\rm \scriptscriptstyle T} }} \tilde R_{12} w_2 + \lambda_1\|w_1\|_1 \bigr\} \end{split} \end{equation}$$(6)and then setting |$\hat{w}_1 =0$| if |$\tilde{w}_1=0$| and |$\hat{w}_1 =\tilde{w}_1/(\tilde{w}_1^{{ {\rm \scriptscriptstyle T} }}\tilde R_1\tilde{w}_1)^{1/2}$| if |$\tilde{w}_1\neq 0$|⁠. The problems (5) and (6) are both convex, but unlike (5), problem (6) is unconstrained. Furthermore, problem (6) is of the same form as the well-studied penalized lasso problem (Tibshirani, 1996), which can be solved efficiently using, for example, the coordinate-descent algorithm. Hence, the proposed optimization algorithm for (4) can be viewed as a sequence of lasso problems with rescaling. Given the value of |$w_2$| at iteration |$t$|⁠, the updates at iteration |$t+1$| have the form $$\begin{align*} &\tilde w_{1}=\mathop{\rm arg\,min}\limits_{w_{1}}\:\bigl\{(1/2){w_1^{{ {\rm \scriptscriptstyle T} }} \tilde R_1w_1}-w_1^{{ {\rm \scriptscriptstyle T} }} \tilde R_{12}w_2^{(t)}+\lambda_{1}\|w_{1}\|_{1}\bigr\},\\ &\hat w_{1}^{(t+1)}=\tilde w_{1}/(\tilde w_{1}^{{ {\rm \scriptscriptstyle T} }} \tilde R_1\tilde w_{1})^{1/2},\\ &\tilde w_{2}=\mathop{\rm arg\,min}\limits_{w_{2}}\:\bigl\{(1/2){w_2^{{ {\rm \scriptscriptstyle T} }} \tilde R_2w_2}-w_2^{{ {\rm \scriptscriptstyle T} }} \tilde R_{12}^{{ {\rm \scriptscriptstyle T} }}w_1^{(t+1)}+\lambda_{2}\|w_{2}\|_{1}\bigr\},\\ &\hat w_{2}^{(t+1)}=\tilde w_{2}/(\tilde w_{2}^{{ {\rm \scriptscriptstyle T} }} \tilde R_2\tilde w_{2})^{1/2}. \end{align*}$$ If a zero solution is obtained at any of the steps, the optimization algorithm stops, and both |$w_1$| and |$w_2$| are returned as zeros. Otherwise, the algorithm proceeds until convergence, which is guaranteed by the biconvexity of (4) (Gorski et al., 2007). We further describe a coordinate-descent algorithm for (6). Consider the Karush–Kuhn–Tucker conditions (Boyd & Vandenberghe, 2004) $$\tilde R_1 w_1 - \tilde R_{12}w_2 + \lambda_1 s_1 = 0,$$ where |$s_1$| is the subgradient of |$\|w_1\|_1$|⁠. If |$\lambda_1 \geq \|\tilde R_{12}w_2\|_{\infty}$|⁠, it follows that |$\tilde w_1 = 0$|⁠. Otherwise, the |$i$|th element of |$w_1$| can be expressed in terms of the other coordinates as $$w_{1i} = S_{\lambda_1}\bigl\{(\tilde R_{12})_i w_2^{(t)} - (\tilde R_{1})_{i, -i} (w_{1})_{-i}\bigr\},$$ where |$S_{\lambda} (t) = \text{sign}(t) \left(| t | - \lambda \right)_{+}$| is the soft-thresholding operator, |$(R_{12})_i$| denotes the |$i$|th row of matrix |$R_{12}$| and |$(R_1)_{i, -i}$| denotes the |$i$|th row of matrix |$R_1$| without the |$i$|th component, i.e., |$(R)_{i, -i} = (R_{i1}, \ldots, R_{i,i-1}, R_{i,i+1}, \ldots, R_{ip})$|⁠. The coordinate-descent algorithm proceeds by using the above formula to update one coordinate at a time until convergence to a global optimum is achieved. This convergence is guaranteed by the convexity of the objective function and the separability of the penalty with respect to the coordinates (Tseng, 1988). 3.4. Selection of tuning parameters Cross-validation is a popular approach to selecting the tuning parameter in the lasso. In our context, however, it amounts to performing a grid search over both |$\lambda_1$| and |$\lambda_2$|⁠. Moreover, splitting the data as in cross-validation may lead to too small a number of testing samples to construct the rank-based estimator of the latent correlation matrix. Instead, motivated by Wilms & Croux (2015), we adapt the Bayesian information criterion to the canonical correlation analysis to avoid splitting the data and thus reduce the computational cost. For the Gaussian linear regression model, the Bayesian information criterion has the form $${{\small{\text{BIC}}}} = -2\ell + {\rm df} \log n,$$ where |${\rm df}$| indicates the number of parameters in the model and |$\ell$| is the loglikelihood, $$\ell = \log L = -(n/2)\log{\sigma^2} -\sum_{i=1}^{n} (y_i - X_i {\beta} )^2/({2\sigma^2}).$$ Two cases can be considered, depending on whether the variance |$\sigma^2$| is known or unknown. (i) If $\sigma^2$ is known and the data are scaled so that $\sigma^2 = 1$⁠, then $$ {{\small{\text{BIC}}}} = n^{-1}\sum_{i=1}^{n} (y_i - X_i {\hat\beta})^2 + {\rm df}\,\dfrac{\log n}{n}.$$ (ii) If $\sigma^2$ is unknown, using $\hat{\sigma}^2_{\text{MLE}} = n^{-1} \sum_{i=1}^{n} (y_i - X_i {\hat \beta} )^2 $ gives $$ {{\small{\text{BIC}}}} = n\log \!\left\{n^{-1}\sum_{i=1}^{n} (y_i - X_i {\hat \beta} )^2 \right\} + {\rm df} \log n.$$ Wilms & Croux (2015) used criterion (ii) for canonical correlation analysis by substituting |$\| X_1 \tilde w_1 - X_2 w_2\|_2^2 /n$| for |$\sum_{i=1}^{n} (y_i - X_i {\hat \beta})^2 /n$| for centred |$X_1$| and |$X_2$|⁠. Since |$\| X_1 \tilde w_1 - X_2 w_2\|_2^2 /n = \tilde w_1^{{ {\rm \scriptscriptstyle T} }}S_1 \tilde w_1 - 2 \tilde w_1^{{ {\rm \scriptscriptstyle T} }}S_{12}w_2 + w_2^{{ {\rm \scriptscriptstyle T} }}S_2w_2$| and we use |$\tilde R$| instead of the sample covariance matrix |$S$|⁠, we substitute $$f(\tilde{w}_1) = \tilde{w}_1^{{ {\rm \scriptscriptstyle T} }} \tilde R_{1} \tilde{w}_1 - 2 \tilde{w}_1^{{ {\rm \scriptscriptstyle T} }} \tilde R_{12} w_2 + w_2 \tilde R_{2} w_2$$ for the residual sum of squares. Furthermore, motivated by the performance of the adjusted degrees of freedom variance estimator in Reid et al. (2016), we also adjust |$f(\tilde w_1)$| for criterion (ii), leading to the two criteria $$\begin{equation} \begin{split}\nonumber {{\small{\text{BIC}}}}_1 = f(\tilde{w}_1) + {\rm df}_{\tilde{w}_1} \dfrac{\log n}{n},\quad {{\small{\text{BIC}}}}_2 = \log \!\left\{ \dfrac{n}{n-{\rm df}_{\tilde{w}_1}} f(\tilde{w}_1) \right\} + {\rm df}_{\tilde{w}_1} \dfrac{\log n}{n}. \end{split} \end{equation}$$ Here |${\rm df}_{\tilde w_1}$| coincides with the size of the support of |$\tilde w_1$| (Tibshirani & Taylor, 2012). The criteria for |$w_2$| are defined analogously to those for |$w_1$|⁠. We use both criteria in evaluating our approach. Upon selecting a criterion, either |${{\small{\text{BIC}}}}_1$| or |${{\small{\text{BIC}}}}_2$|⁠), we apply it sequentially at each step of the biconvex optimization algorithm of § 3.3 and each time select the tuning parameter corresponding to the smallest value of the criterion. Owing to the alternating minimization, the solution will in general depend on the choice of the initial starting point. By default, we initialize the algorithm with the unpenalized solution to (1) obtained using |$\tilde R + 0.25\,{\rm I}$|⁠, which corresponds to the canonical ridge solution with a fixed amount of regularization (González et al., 2008). We find that this initialization works well compared to a random initialization; more details are provided in the Supplementary Material. Remark 6. Sequences of |$\lambda$| values for |$w_1$| and |$w_2$| are separately generated for the algorithm if there is no specification. For example, a sequence for |$\lambda_1$| is generated as follows. We first calculate |$\lambda_{\rm max} = \tilde R_{12} \hat w^{(0)}_{2}$| and |$\lambda_{\rm min} = \epsilon \lambda_{\rm max}$|⁠, where |$\hat w^{(0)}_{2}$| is the starting point for |$w_2$|⁠. Then, from |$\lambda_{\rm min}$| to |$\lambda_{\rm max}$|⁠, the sequence is generated to be equally spaced on a logarithmic scale. As a default, we use 20 values of |$\lambda$| for each side with |$\epsilon = 0.01$|⁠. The sequence for |$\lambda_2$| is defined analogously. 4. Simulation studies In this section we evaluate the performance of the following methods: (i) classical canonical correlation analysis based on the sample covariance matrix; (ii) canonical ridge analysis, available in the R package CCA (González et al., 2008); (iii) the sparse canonical correlation analysis of Witten et al. (2009), available in the R package PMA; (iv) the sparse canonical correlation analysis of Gao et al. (2017), available in the Matlab package SCCALab; and (v) sparse canonical correlation analysis via Kendall’s |$\tau$| as proposed in this paper. For our method, we evaluate both types of bic described in § 3.4. We also consider using Pearson’s sample correlation instead of |$\tilde R$| within our optimization framework with the same bic for parameter selection. For fair comparison with |$\tilde R$|⁠, we also apply shrinkage to the Pearson correlation matrix as in (3). Direct comparison of the estimation performance of our rank-based estimator and Pearson’s sample correlation as a function of the sample size and level of truncation can be found in the Supplementary Material. We generate |$n=100$| independent pairs |$({\rm Z}_1, {\rm Z}_2) \in \mathbb{R}^{p_1+p_2}$| according to $$\begin{pmatrix} {\rm Z}_1\\{\rm Z}_2 \end{pmatrix} \sim {{N}} \!\left\{ \begin{pmatrix} 0\\0 \end{pmatrix}, \: \begin{pmatrix} \Sigma_1 & \rho \Sigma_1 w_1 w_2^{{ {\rm \scriptscriptstyle T} }} \Sigma_2\\ \rho \Sigma_2 w_2 w_1^{{ {\rm \scriptscriptstyle T} }} \Sigma_1 & \Sigma_2 \\ \end{pmatrix} \right\}.$$ We consider two settings for the number of variables: low-dimensional (⁠|$p_1=p_2=25$|⁠) and high-dimensional (⁠|$p_1=p_2=100$|⁠). Each canonical vector |$w_g$| (⁠|$g=1, 2$|⁠) is defined by taking a vector with ones at the coordinates |$(1, 6, 11, 16, 21)$| and zeros elsewhere, and normalizing it such that |$w_g^{{ {\rm \scriptscriptstyle T} }} \Sigma_g w_g =1$|⁠; a similar model is used in Chen et al. (2013). We use an autoregressive structure for |$\Sigma_{1}$| and a block-diagonal structure for |$\Sigma_2$|⁠: |$\Sigma_{1} = \{\gamma^{|j-k|}\}_{j,k=1}^{p_1}$| and |$\Sigma_2= {\rm block-diag}(\Sigma_\gamma, \ldots, \Sigma_\gamma)$| where |$\Sigma_\gamma\in \mathbb{R}^{d\times d}$| is an equicorrelated matrix with value |$1$| on the diagonal and value |$\gamma$| off the diagonal. We use five blocks of size |$d\in\{6, 6, 3, 7, 3\}$| for the low-dimensional case, and take |$d\in\{14, 21, 12, 25, 28\}$| for the high-dimensional setting. We set |$\gamma = {0.7}$| for both |$\Sigma_1$| and |$\Sigma_2$|⁠. We further randomly permute the order of variables in each |${\rm Z}_g$| to remove the covariance-induced ordering. The value of the canonical correlation is set at |$\rho = \text{0.9}$|⁠. We consider transformations |${\rm U}_g = f_g({\rm Z}_g +{\rm B}_g)$| where the elements of vector |${\rm B}_g$| are 0 or 1 with equal probability. The variation in the shift of |${\rm Z}_g$| across |$p_g$| variables due to |${\rm B}_g$| leads to the variation in the proportion of zeros across the variables in the 5–80% range for the same choice of the truncation constant |$C$|⁠. We consider three choices for |$f_g$|⁠: no transformation, |$f_g(z) = z$| for |$g=1, 2$| (copula 0); exponential transformation for |${\rm U}_1$|⁠, |$f_1(z) = \exp(z)$|⁠, and no transformation for |${\rm U}_2$|⁠, |$f_2(z) = z$| (copula 1); exponential transformation for |${\rm U}_1$|⁠, |$f_1(z) = \exp(z)$|⁠, and cubic transformation for |${\rm U}_2$|⁠, |$f_2(z) = z^3$| (copula 2). Finally, we set |${\rm X}_g$| equal to |${\rm U}_g$| for variables of continuous type and dichotomize or truncate |${\rm U}_g$| at the same value |$C$| for all variables to form, respectively, binary or truncated |${\rm X}_g$|⁠. We set |$C=\text{1.5}$| for exponentially transformed variables and |$C=0$| for the others. In each case, we consider three combinations of variable types for |${\rm X}_1$|/|${\rm X}_2$|⁠: truncated/truncated, truncated/continuous and truncated/binary. To compare the performance of the methods, we evaluate the expected out-of-sample correlation $$\begin{equation}\label{def:rhohat} \hat{\rho} = \left| \dfrac{ \hat{w}_1^{{ {\rm \scriptscriptstyle T} }} \Sigma_{12} \hat{w}_2 }{ (\hat{w}_1^{{ {\rm \scriptscriptstyle T} }} \Sigma_{1} \hat{w}_1)^{1/2} (\hat{w}_2^{{ {\rm \scriptscriptstyle T} }} \Sigma_{2} \hat{w}_2)^{1/2} } \right| \end{equation}$$(7) and the predictive loss $$\begin{equation}\label{def:predloss} L(w_g, \hat w_g) = 1-\dfrac{|\hat{w}_g^{{ {\rm \scriptscriptstyle T} }} \Sigma_{g} w_g|}{(\hat{w}_g^{{ {\rm \scriptscriptstyle T} }} \Sigma_{g} \hat{w}_g)^{1/2} } \quad (g=1,2); \end{equation}$$(8) a similar loss is used in Gao et al. (2017). By the definition of the true canonical correlation |$\rho$|⁠, for any |$\hat w_1$| and |$\hat w_2$| we have that |$\hat \rho \leq \rho$|⁠, with equality when |$\hat w_1 = w_1$| and |$\hat w_2 = w_2$|⁠. Since |$w_g^{{ {\rm \scriptscriptstyle T} }}\Sigma_g w_g=1$|⁠, |$\:L(w_g, \hat w_g)\in[0,1]$| with |$L(w_g, \hat w_g)=0$| if |$\hat w_g = w_g$|⁠. We also evaluate the variable selection performance using the selected model size and the true positive and true negative rates, defined respectively as $${\small{\text{TPR}}}_{g} = \dfrac{\#\{j: \hat{w}_{gj} \neq 0,\: w_{gj} \neq 0\}}{\#\{j: w_{gj} \neq 0\}}, \quad {\small{\text{TNR}}}_{g} = \dfrac{\#\{j: \hat{w}_{gj} = 0,\: w_{gj} = 0\}}{\#\{j: w_{gj} = 0\}} \quad (g=1,2).$$ The results for the truncated/truncated case over 500 replications are presented in Figs. 1 and 2. From Fig. 1 it can be seen that the majority of methods achieve higher values of |$\hat \rho$| in the absence of data transformation (copula 0) than when a transformation is applied (copulas 1 and 2). The only exception is our approach based on Kendall’s |$\tau$|⁠, which as expected has comparable performance across the copula types. The performance of all methods deteriorates with increasing dimension, leading to smaller values of |$\hat\rho$| and larger predictive losses. The classical canonical correlation analysis performs especially poorly in high-dimensional settings, with |$\hat \rho$| being almost 0 and the predictive loss close to 1 for both |$w_1$| and |$w_2$|⁠. The canonical ridge method works well in the copula 0 setting, but its performance is strongly affected in the presence of transformations (copulas 1 and 2). Surprisingly to us, Gao’s method, as implemented in SCCALab, performs poorly relative to the other approaches. Since Gao’s method was designed for Gaussian data, the poor performance is likely due to its sensitivity to the presence of copulas and zero truncation, in the copula 0 case, proportions of zero values for the variables range from 5% to 70%. We also used the default values in SCCALab for all of the parameters, so better performance could possibly be achieved by adjusting those values. Sparse canonical correlation analysis based on Pearson’s correlation outperforms all the other methods in the low-dimensional setting when no data transformation is applied (copula 0); however, its performance deteriorates when the monotone transformations are applied to the data (copulas 1 and 2). It also performs worse than our rank-based approach in the high-dimensional setting. This is likely to be due to the increase in variables with zero inflation because of truncation, which Pearson’s correlation does not take into account. In the low-dimensional setting, bic|$_1$| and bic|$_2$| lead to similar values of |$\hat \rho$|⁠, with a larger variance in bic|$_1$|’s performance. In the high-dimensional setting, bic|$_2$| is clearly better than bic|$_1$| in terms of prediction, and this better performance is irrespective of the choice of the estimator for the latent correlation matrix, Pearson’s correlation matrix or the proposed rank-based correlation matrix. Overall, our method based on Kendall’s |$\tau$| with bic|$_2$| yields the highest values of |$\hat \rho$| and the smallest values of predictive loss across dimensions and different copula types. Fig. 1. Open in new tabDownload slide Results over 500 replications for the truncated/truncated case: (a) the value of |$\hat{\rho}$| from (7), with the red horizontal lines indicating the true canonical correlation value of |$\rho = \text{0.9}$|⁠; (b) the value of the predictive loss (8). The methods under comparison are: CCA, sample canonical correlation analysis; RidgeCCA, the canonical ridge analysis of González et al. (2008); WittenCCA, the method of Witten et al. (2009); GaoCCA, the method of Gao et al. (2017); PearsonBIC1 and PearsonBIC2, the proposed algorithm with the Pearson sample correlation matrix and tuning parameter selection criterion bic|$_1$| or bic|$_2$|⁠; KendallBIC1 and KendallBIC2, the proposed algorithm with the rank-based estimator |$\tilde R$| and tuning parameter selection criterion bic|$_1$| or bic|$_2$|⁠. LD, the low-dimensional setting, |$p_1\,=\,p_2 \,=\, 25$|⁠; HD, the high-dimensional setting, |$p_1 = p_2 = 100$|⁠. Fig. 1. Open in new tabDownload slide Results over 500 replications for the truncated/truncated case: (a) the value of |$\hat{\rho}$| from (7), with the red horizontal lines indicating the true canonical correlation value of |$\rho = \text{0.9}$|⁠; (b) the value of the predictive loss (8). The methods under comparison are: CCA, sample canonical correlation analysis; RidgeCCA, the canonical ridge analysis of González et al. (2008); WittenCCA, the method of Witten et al. (2009); GaoCCA, the method of Gao et al. (2017); PearsonBIC1 and PearsonBIC2, the proposed algorithm with the Pearson sample correlation matrix and tuning parameter selection criterion bic|$_1$| or bic|$_2$|⁠; KendallBIC1 and KendallBIC2, the proposed algorithm with the rank-based estimator |$\tilde R$| and tuning parameter selection criterion bic|$_1$| or bic|$_2$|⁠. LD, the low-dimensional setting, |$p_1\,=\,p_2 \,=\, 25$|⁠; HD, the high-dimensional setting, |$p_1 = p_2 = 100$|⁠. Fig. 2. Open in new tabDownload slide Results over 500 replications for the truncated/truncated case: (a) true positive rate, TPR; (b) true negative rate, TNR; (c) selected model size. The methods under comparison are: WittenCCA, the method of Witten et al. (2009); GaoCCA, the method of Gao et al. (2017); PearsonBIC1 and PearsonBIC2, the proposed algorithm with the Pearson sample correlation matrix and tuning parameter selection criterion bic|$_1$| or bic|$_2$|⁠; KendallBIC1 and KendallBIC2, the proposed algorithm with the rank-based estimator |$\tilde R$| and tuning parameter selection criterion bic|$_1$| or bic|$_2$|⁠. LD, the low-dimensional setting, |$p_1 \,=\, p_2 = 25$|⁠; and HD, the high-dimensional setting, |$p_1 = p_2 = 100$|⁠. Fig. 2. Open in new tabDownload slide Results over 500 replications for the truncated/truncated case: (a) true positive rate, TPR; (b) true negative rate, TNR; (c) selected model size. The methods under comparison are: WittenCCA, the method of Witten et al. (2009); GaoCCA, the method of Gao et al. (2017); PearsonBIC1 and PearsonBIC2, the proposed algorithm with the Pearson sample correlation matrix and tuning parameter selection criterion bic|$_1$| or bic|$_2$|⁠; KendallBIC1 and KendallBIC2, the proposed algorithm with the rank-based estimator |$\tilde R$| and tuning parameter selection criterion bic|$_1$| or bic|$_2$|⁠. LD, the low-dimensional setting, |$p_1 \,=\, p_2 = 25$|⁠; and HD, the high-dimensional setting, |$p_1 = p_2 = 100$|⁠. Figure 2 shows the variable selection performance of each method. The classical canonical correlation analysis and canonical ridge method are excluded as they do not perform variable selection. To ensure that the results are consistent with the numerical precision of the optimization algorithm, we treat a variable as being nonzero if its loading is above the |$10^{-6}$| threshold in absolute value. An unexpected finding is that the number of selected variables varies significantly across replications for Witten’s method, as can be seen in Fig. 2(c), leading to significant variations in the true positive and true negative rates. We suspect that this is due to the use of a permutation method for selecting tuning parameters. Our approach based on Kendall’s |$\tau$| leads to a more favourable combination of true positive and true negative rates than competing methods, especially when data transformations are applied. Furthermore, this advantage is maintained independently of the tuning parameter selection scheme. In the Supplementary Material we compare the true positive versus false positive curves obtained from each method over the range of tuning parameters, and find that our rank-based estimator gives the highest area under the curve in the copula settings. Comparing the performance of bic|$_1$| and bic|$_2$| in Fig. 2, we see that bic|$_1$| yields the sparsest model and the highest true negative rate for both Pearson’s correlation and our rank-based correlation, at the expense of missing some true variables in the high-dimensional setting. From the comparison of predictive performance between the two selection criteria, we conclude that bic|$_1$| is better suited to variable selection, especially when a high true negative rate is desired, whereas bic|$_2$| works better for prediction. In addition to the truncated/truncated case, we also consider the truncated/continuous and truncated/binary cases; the results are reported in the Supplementary Material. The conclusions drawn from the comparison of methods are similar to those for the truncated/truncated case. Overall, all the methods perform best in the truncated/continuous case and worst in the truncated/binary case, which is not surprising, since dichotomization of a continuous variable leads to loss of information, thus reducing the effective sample size. 5. Application to the Cancer Genome Atlas data The Cancer Genome Atlas project collects data from multiple platforms using high-throughput sequencing technologies. We consider gene expression data (⁠|$p_1=891$|⁠) and microRNA data (⁠|$p_2=431$|⁠) for |$n=500$| matched subjects from the Cancer Genome Atlas breast cancer database. We treat gene expression data as continuous and microRNA data as truncated continuous. The range of proportions of zero values contained in each variable in the microRNA data is 0–49.8%. The subjects belong to one of five breast cancer subtypes: Normal, Basal, Her2, LumA and LumB, with 37 subjects having missing subtype information, denoted by NA. The goal of the analysis is to characterize the association between gene expression and microRNA data, and to investigate whether this association is related to breast cancer subtypes. To investigate the performance of our method relative to other approaches, we randomly split the data 500 times. Each time |$400$| samples are used for training, and the remaining |$100$| test samples are used to assess the association via $$\hat{\rho}_{\text{test}} = \left| \dfrac{\hat{w}_{1, \text{train}}^{{ {\rm \scriptscriptstyle T} }} \Sigma_{12, \text{test}} \hat{w}_{2, \text{train}}}{(\hat{w}_{1, \text{train}}^{{ {\rm \scriptscriptstyle T} }} \Sigma_{1, \text{test}} \hat{w}_{1, \text{train}})^{1/2} (\hat{w}_{2, \text{train}}^{{ {\rm \scriptscriptstyle T} }} \Sigma_{2, \text{test}} \hat{w}_{2, \text{train}})^{1/2}} \right|.$$ Here |$\Sigma_{\text{test}}$| is evaluated based on the test samples, and is either the rank-based estimator |$\tilde R$| for our method or the sample covariance matrix for the other methods. We also compare the numbers of selected genes and selected microRNAs, and the results are presented in Table 1. We do not consider the method of Gao et al. (2017) in this section because of its poor performance in § 4 and high computational cost, it takes around 40 minutes per replication for these data on a Windows 3.60 GHz Intel Core i7 CPU machine. Table 1. Mean support size and value of |$\hat{\rho}_{{\rm test}}$| over |$500$| random splits of breast cancer data, with standard deviation given in parentheses Method . Selected genes . Selected microRNAs . |$\hat{\rho}_{\text{test}}$| . CCA 891.00 (0.00) 431.00 (0.00) 0.004 (0.109) RidgeCCA 891.00 (0.00) 431.00 (0.00) 0.712 (0.126) WittenCCA 338.36 (194.58) 165.53 (100.30) 0.789 (0.041) PearsonBIC1 9.92 (2.62) 16.08 (3.18) 0.813 (0.044) PearsonBIC2 27.68 (6.20) 40.50 (12.54) 0.857 (0.034) KendallBIC1 18.24 (3.51) 9.68 (3.15) 0.880 (0.030) KendallBIC2 38.18 (8.47) 31.01 (6.74) 0.913 (0.029) Method . Selected genes . Selected microRNAs . |$\hat{\rho}_{\text{test}}$| . CCA 891.00 (0.00) 431.00 (0.00) 0.004 (0.109) RidgeCCA 891.00 (0.00) 431.00 (0.00) 0.712 (0.126) WittenCCA 338.36 (194.58) 165.53 (100.30) 0.789 (0.041) PearsonBIC1 9.92 (2.62) 16.08 (3.18) 0.813 (0.044) PearsonBIC2 27.68 (6.20) 40.50 (12.54) 0.857 (0.034) KendallBIC1 18.24 (3.51) 9.68 (3.15) 0.880 (0.030) KendallBIC2 38.18 (8.47) 31.01 (6.74) 0.913 (0.029) CCA, sample canonical correlation analysis; RidgeCCA, canonical ridge of González et al. (2008); WittenCCA, method of Witten et al. (2009); PearsonBIC1 and PearsonBIC2, the proposed algorithm with the Pearson sample correlation matrix and tuning parameter selection criterion bic|$_1$| or bic|$_2$|⁠; KendallBIC1 and KendallBIC2, the proposed algorithm with the rank-based estimator |$\tilde R$| and tuning parameter selection criterion bic|$_1$| or bic|$_2$|⁠. Open in new tab Table 1. Mean support size and value of |$\hat{\rho}_{{\rm test}}$| over |$500$| random splits of breast cancer data, with standard deviation given in parentheses Method . Selected genes . Selected microRNAs . |$\hat{\rho}_{\text{test}}$| . CCA 891.00 (0.00) 431.00 (0.00) 0.004 (0.109) RidgeCCA 891.00 (0.00) 431.00 (0.00) 0.712 (0.126) WittenCCA 338.36 (194.58) 165.53 (100.30) 0.789 (0.041) PearsonBIC1 9.92 (2.62) 16.08 (3.18) 0.813 (0.044) PearsonBIC2 27.68 (6.20) 40.50 (12.54) 0.857 (0.034) KendallBIC1 18.24 (3.51) 9.68 (3.15) 0.880 (0.030) KendallBIC2 38.18 (8.47) 31.01 (6.74) 0.913 (0.029) Method . Selected genes . Selected microRNAs . |$\hat{\rho}_{\text{test}}$| . CCA 891.00 (0.00) 431.00 (0.00) 0.004 (0.109) RidgeCCA 891.00 (0.00) 431.00 (0.00) 0.712 (0.126) WittenCCA 338.36 (194.58) 165.53 (100.30) 0.789 (0.041) PearsonBIC1 9.92 (2.62) 16.08 (3.18) 0.813 (0.044) PearsonBIC2 27.68 (6.20) 40.50 (12.54) 0.857 (0.034) KendallBIC1 18.24 (3.51) 9.68 (3.15) 0.880 (0.030) KendallBIC2 38.18 (8.47) 31.01 (6.74) 0.913 (0.029) CCA, sample canonical correlation analysis; RidgeCCA, canonical ridge of González et al. (2008); WittenCCA, method of Witten et al. (2009); PearsonBIC1 and PearsonBIC2, the proposed algorithm with the Pearson sample correlation matrix and tuning parameter selection criterion bic|$_1$| or bic|$_2$|⁠; KendallBIC1 and KendallBIC2, the proposed algorithm with the rank-based estimator |$\tilde R$| and tuning parameter selection criterion bic|$_1$| or bic|$_2$|⁠. Open in new tab Of course, neither the sample canonical correlation analysis nor the canonical ridge method performs variable selection. Moreover, |$\hat{\rho}_{\text{test}}$| is very close to zero for the sample canonical correlation, confirming the poor performance of the method. The canonical ridge leads to significantly higher values of |$\hat \rho_{\text{test}}$|⁠, demonstrating the advantage of added regularization; however, it still has smaller correlation values than the other approaches. The method of Witten et al. (2009) gives higher correlation values than both sample canonical correlation analysis and the canonical ridge, but it still selects a significant number of variables, with highly variable model sizes across replications. We suspect that this is due to the use of a permutation-based algorithm for tuning parameter selection; similar behaviour was observed in § 4. Sparse canonical correlation analysis based on Pearson’s correlation selects a much smaller number of genes and microRNAs, but achieves higher values of |$\hat \rho_{\text{test}}$| than the method of Witten et al. (2009); this is consistent with the results in § 4. The highest values of |$\hat \rho_{\text{test}}$| are achieved by our approach based on Kendall’s |$\tau$|⁠, with a smaller number of selected variables, confirming that the association found is not due to overfitting as it generalizes well to out-of-sample data. bic|$_1$| leads to a sparser model than bic|$_2$| consistently for both Pearson- and Kendall-based correlation estimates, with bic|$_2$| yielding the larger out-of-sample correlation value. In light of these findings and the results of § 4, we conclude that bic|$_1$| is advantageous for variable selection owing to its selection of a sparser model and a higher true negative rate observed in simulations, whereas bic|$_2$| is advantageous for prediction. We next investigate possible relationships between selected variables and breast cancer subtypes. Since the selected variables may change across the random data splits, we consider the selection frequency of each gene and microRNA across all 500 replications of our method with |${\small{\text{BIC}}}_2$|⁠, and choose the variables that are selected at least 80% of the time. Figure 3 shows heatmaps of the expression levels of the resulting 19 genes and 16 microRNAs, with samples ordered by their respective cancer subtype. The heatmaps show clear separation between the Basal and other subtypes, suggesting that the association found is relevant to cancer biology. Fig. 3. Open in new tabDownload slide Genes and microRNAs selected more often than 80% among 500 repetitions by our method with bic|$_2$|⁠: (a) heatmap of 19 genes, where blue, red and white indicate positive, negative and zero expression levels, respectively; (b) heatmap of 16 microRNAs, where the saturation levels of colours are assigned based on variable-specific quantiles. In both panels, the dissimilarity measure is taken to be |$1-\tilde{R}$| with our rank-based correlation |$\tilde{R}$|⁠, and Ward linkage is used. Normal, Basal, Her2, LumA and LumB refer to the five breast cancer subtypes, and NA denotes missing subtype information. Fig. 3. Open in new tabDownload slide Genes and microRNAs selected more often than 80% among 500 repetitions by our method with bic|$_2$|⁠: (a) heatmap of 19 genes, where blue, red and white indicate positive, negative and zero expression levels, respectively; (b) heatmap of 16 microRNAs, where the saturation levels of colours are assigned based on variable-specific quantiles. In both panels, the dissimilarity measure is taken to be |$1-\tilde{R}$| with our rank-based correlation |$\tilde{R}$|⁠, and Ward linkage is used. Normal, Basal, Her2, LumA and LumB refer to the five breast cancer subtypes, and NA denotes missing subtype information. Many of the selected genes and microRNAs have appeared in recent literature, which supports their association with breast cancer. Kim et al. (2016) have indicated that ERBB4 is a prognostic marker for triple negative breast cancer, which is often used interchangeably with Basal-like breast cancer. In agreement with our results, Castilla et al. (2014) found that VGLL1 and miR-934 are highly correlated with each other, and that both are overexpressed in the Basal-like subtype. They also found that the selected FOXA1 and GATA3 genes, as well as the ESR1 gene, which is not selected at the 80% frequency threshold, but still has a 73.4% frequency, have strong negative correlations with both VGLL1 and miR-934. The expression level of the selected ELF5 has been shown to play a key role in determining breast cancer molecular subtype by Kalyuga et al. (2012) and Piggin et al. (2016). Furthermore, Jonsdottir et al. (2012) have validated that the selected hsa-miR-18a and hsa-miR-505 microRNAs are significantly correlated with prognostic breast cancer biomarkers, and that high expression of hsa-miR-18a is strongly associated with Basal-like breast cancer features. Finally, the selected hsa-miR-135b is reported to be related to breast cancer cell growth in Aakula et al. (2015) and Hua et al. (2016). 6. Discussion One of the main contributions of this work is a truncated Gaussian copula model for zero-inflated data and the development of a rank-based estimator for the latent correlation matrix. While our focus is on canonical correlation analysis, our estimator can be used in conjunction with other covariance-based approaches. For example, it can be used for constructing graphical models as in Fan et al. (2017) in cases where some or all of the variables have an excess of zeros. Besides the microRNA data explored in this work, another prominent application example consists of microbiome abundance data. It would be of interest to further explore the potential uses of our modelling approach in different application areas. The R package mixedCCA with our method’s implementation is available from the GitHub page https://github.com/irinagain/mixedCCA. Acknowledgement Yoon and Carroll were supported by the U.S. National Cancer Institute. Carroll is also a Distinguished Professor at the University of Technology Sydney, Australia. Gaynanova was supported by the U.S. National Science Foundation. We thank the associate editor and two referees for valuable comments and suggestions that helped to significantly improve this manuscript. Supplementary material Supplementary Material available at Biometrika online includes proofs of all the technical results and additional simulation studies. References Aakula, A. , Leivonen, S.-K., Hintsanen, P., Aittokallio, T., Ceder, Y., Børresen-Dale, A.-L., Perälä, M., Östling, P. & Kallioniemi, O. ( 2015 ). MicroRNA-135b regulates ER|$\alpha$|⁠, AR and HIF1AN and affects breast and prostate cancer cell growth . Molec. Oncol. 9 , 1287 – 300 . Google Scholar Crossref Search ADS WorldCat Agniel, D. & Cai, T. ( 2017 ). Analysis of multiple diverse phenotypes via semiparametric canonical correlation analysis . Biometrics 73 , 1254 – 65 . Google Scholar Crossref Search ADS PubMed WorldCat Bach, F. R. & Jordan, M. I. ( 2005 ). A probabilistic interpretation of canonical correlation analysis . Tech. Rep. 688 , Department of Statistics, University of California, Berkeley . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Boyd, S. P. & Vandenberghe, L. ( 2004 ). Convex Optimization . Cambridge : Cambridge University Press . Google Scholar Crossref Search ADS Google Scholar Google Preview WorldCat COPAC Castilla, M. Á. , López-García, M. Á., Atienza, M. R., Rosa-Rosa, J. M., Diaz-Martin, J., Pecero, M. L., Vieites, B., Romero-Pérez, L., Benítez, J., Calcabrini, A. et al. ( 2014 ). VGLL1 expression is associated with a triple-negative basal-like phenotype in breast cancer . Endocrine-Rel. Cancer 21 , 587 – 99 . Google Scholar Crossref Search ADS WorldCat Chen, M. , Gao, C., Ren, Z. & Zhou, H. H. ( 2013 ). Sparse CCA via precision adjusted iterative thresholding . arXiv: 1311.6186v1. Chen, X. & Liu, H. ( 2011 ). An efficient optimization algorithm for structured sparse CCA, with applications to eQTL mapping . Statist. Biosci. 4 , 3 – 26 . Google Scholar Crossref Search ADS WorldCat Chi, E. C. , Allen, G. I., Zhou, H., Kohannim, O., Lange, K. & Thompson, P. M. ( 2013 ). Imaging genetics via sparse canonical correlation analysis . In 2013 IEEE 10th International Symposium on Biomedical Imaging . New York : Institute of Electrical and Electronics Engineers , pp. 740 – 3 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Cruz-Cano, R. & Lee, M.-L. T. ( 2014 ). Fast regularized canonical correlation analysis . Comp. Statist. Data Anal. 70 , 88 – 100 . Google Scholar Crossref Search ADS WorldCat Fan, J. , Liu, H., Ning, Y. & Zou, H. ( 2017 ). High dimensional semiparametric latent graphical model for mixed data . J. R. Statist. Soc. B 79 , 405 – 21 . Google Scholar Crossref Search ADS WorldCat Gao, C. , Ma, Z., Ren, Z. & Zhou, H. H. ( 2015 ). Minimax estimation in sparse canonical correlation analysis . Ann. Statist. 43 , 2168 – 97 . Google Scholar Crossref Search ADS WorldCat Gao, C. , Ma, Z. & Zhou, H. H. ( 2017 ). Sparse CCA: Adaptive estimation and computational barriers . Ann. Statist. 45 , 2074 – 101 . Google Scholar Crossref Search ADS WorldCat González, I. , Déjean, S., Martin, P. G. & Baccini, A. ( 2008 ). CCA: An R package to extend canonical correlation analysis . J. Statist. Software 23 , 1 – 14 . Google Scholar Crossref Search ADS WorldCat Gorski, J. , Pfeuffer, F. & Klamroth, K. ( 2007 ). Biconvex sets and optimization with biconvex functions: A survey and extensions . Math. Meth. Oper. Res. 66 , 373 – 407 . Google Scholar Crossref Search ADS WorldCat Guo, Y. , Ding, X., Liu, C. & Xue, J.-H. ( 2016 ). Sufficient canonical correlation analysis . IEEE Trans. Image Proces. 25 , 2610 – 9 . Google Scholar Crossref Search ADS WorldCat Hardoon, D. R. , Szedmak, S. & Shawe-Taylor, J. ( 2004 ). Canonical correlation analysis: An overview with application to learning methods . Neural Comp. 16 , 2639 – 64 . Google Scholar Crossref Search ADS WorldCat Hotelling, H. ( 1936 ). Relations between two sets of variates . Biometrika 28 , 321 – 77 . Google Scholar Crossref Search ADS WorldCat Hua, K. , Jin, J., Zhao, J., Song, J., Song, H., Li, D., Maskey, N., Zhao, B., Wu, C., Xu, H. et al. ( 2016 ). miR-135b, upregulated in breast cancer, promotes cell growth and disrupts the cell cycle by regulating LATS2 . Int. J. Oncol. 48 , 1997 – 2006 . Google Scholar Crossref Search ADS PubMed WorldCat Jonsdottir, K. , Janssen, S. R., Da Rosa, F. C., Gudlaugsson, E., Skaland, I., Baak, J. P. A. & Janssen, E. A. M. ( 2012 ). Validation of expression patterns for nine miRNAs in 204 lymph-node negative breast cancers . PLoS One 7 , 1 – 9 . DOI: 10.1371/journal.pone.0048692. Google Scholar Crossref Search ADS WorldCat Kalyuga, M. , Gallego-Ortega, D., Lee, H. J., Roden, D. L., Cowley, M. J., Caldon, C. E., Stone, A., Allerdice, S. L., Valdes-Mora, F., Launchbury, R. et al. ( 2012 ). ELF5 suppresses estrogen sensitivity and underpins the acquisition of antiestrogen resistance in luminal breast cancer . PLoS Biol. 10 , 1 – 17 . DOI: 10.1371/journal.pbio.1001461. Google Scholar Crossref Search ADS WorldCat Kim, J.-Y. , Jung, H. H., Do, I.-G., Bae, S., Lee, S. K., Kim, S. W., Lee, J. E., Nam, S. J., Ahn, J. S., Park, Y. H. et al. ( 2016 ). Prognostic value of ERBB4 expression in patients with triple negative breast cancer . BMC Cancer 16 , 138 . DOI: 10.1186/s12885-016-2195-3. Google Scholar Crossref Search ADS PubMed WorldCat Liu, H. , Lafferty, J. & Wasserman, L. ( 2009 ). The nonparanormal: Semiparametric estimation of high dimensional undirected graphs . J. Mach. Learn. Res. 10 , 2295 – 328 . OpenURL Placeholder Text WorldCat Mai, Q. & Zhang, X. ( 2019 ). An iterative penalized least squares approach to sparse canonical correlation analysis . Biometrics 75 , 734 – 44 . Google Scholar Crossref Search ADS PubMed WorldCat Parkhomenko, E. , Tritchler, D. & Beyene, J. ( 2009 ). Sparse canonical correlation analysis with application to genomic data integration . Statist. Applic. Genet. Molec. Biol. 8 , 1 – 34 . Google Scholar Crossref Search ADS WorldCat Piggin, C. L. , Roden, D. L., Gallego-Ortega, D., Lee, H. J., Oakes, S. R. & Ormandy, C. J. ( 2016 ). ELF5 isoform expression is tissue-specific and significantly altered in cancer . Breast Cancer Res. 18 , 4 . DOI: 10.1186/s13058-015-0666-0. Google Scholar Crossref Search ADS PubMed WorldCat R Development Core Team ( 2020 ). R: A Language and Environment for Statistical Computing . R Foundation for Statistical Computing , Vienna, Austria . ISBN 3-900051-07-0. http://www.R-project.org. Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Reid, S. , Tibshirani, R. & Friedman, J. ( 2016 ). A study of error variance estimation in lasso regression . Statist. Sinica 26 , 35 – 67 . OpenURL Placeholder Text WorldCat Safo, S. E. , Li, S. & Long, Q. ( 2018 ). Integrative analysis of transcriptomic and metabolomic data via sparse canonical correlation analysis with incorporation of biological information . Biometrics 74 , 300 – 12 . Google Scholar Crossref Search ADS PubMed WorldCat Tibshirani, R. ( 1996 ). Regression shrinkage and selection via the lasso . J. R. Statist. Soc. B 58 , 267 – 88 . OpenURL Placeholder Text WorldCat Tibshirani, R. J. & Taylor, J. ( 2012 ). Degrees of freedom in lasso problems . Ann. Statist. 40 , 1198 – 232 . Google Scholar Crossref Search ADS WorldCat Tseng, P. ( 1988 ). Coordinate ascent for maximizing nondifferentiable concave functions . Tech. Rep. LIDS-P 1840, Laboratory for Information and Decision Systems , Massachusetts Institute of Technology . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Wilms, I. & Croux, C. ( 2015 ). Sparse canonical correlation analysis from a predictive point of view . Biomet. J. 57 , 834 – 51 . Google Scholar Crossref Search ADS WorldCat Witten, D. M. & Tibshirani, R. J. ( 2009 ). Extensions of sparse canonical correlation analysis with applications to genomic data . Statist. Applic. Genet. Molec. Biol. 8 , 1 – 27 . Google Scholar Crossref Search ADS WorldCat Witten, D. M. , Tibshirani, R. J. & Hastie, T. ( 2009 ). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis . Biostatistics 10 , 515 – 34 . Google Scholar Crossref Search ADS PubMed WorldCat Zoh, R. S. , Mallick, B., Ivanov, I., Baladandayuthapani, V., Manyam, G., Chapkin, R. S., Lampe, J. W. & Carroll, R. J. ( 2016 ). PCAN: Probabilistic correlation analysis of two non-normal data sets . Biometrics 72 , 1358 – 68 . Google Scholar Crossref Search ADS PubMed WorldCat © 2020 Biometrika Trust This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993

1992

1991

1990

1989

1988

1987

1986

1985

1984

1983

1982

1981

1980

1979

1978

1977

1976

1975

1974

1973

1972

1971

1970

1969

1968

1967

1966

1965

1964

1963

1962

1961

1960

1959

1958

1957

1956

1955

1954

1953

1952

1951

1950

1949

1948

1947

1946

1945

1942