# On causal estimation using $$U$$ -statistics

On causal estimation using $$U$$ -statistics Summary We introduce a general class of causal estimands which extends the familiar notion of average treatment effect. The class is defined by a contrast function, prespecified to quantify the relative favourability of one outcome over another, averaged over the marginal distributions of two potential outcomes. Natural estimators arise in the form of $$U$$-statistics. We derive both a naive inverse propensity score weighted estimator and a class of locally efficient and doubly robust estimators. The usefulness of our theory is illustrated by two examples, one for causal estimation with ordinal outcomes, and the other for causal tests that are robust with respect to outliers. 1. Motivation Under the counterfactual paradigm of causal inference (Rubin, 1974), each subject is associated with two outcome variables, denoted by $$Y(1)$$ and $$Y(0)$$, representing the potential outcomes had the subject been assigned to the active treatment and control, respectively. The causal effect of the active treatment as compared to the control is typically measured by $$E\{Y(1)-Y(0)\}$$, which is commonly referred to as the average treatment effect. Most of the literature on causal inference under the counterfactual framework has focused on this quantity. Consider a more general set-up. Suppose that we compare the two potential outcomes by some arbitrary contrast function $$h: \mathcal Y\times\mathcal Y\to\mathbb R$$, where $$\mathcal Y$$ is the space in which the potential outcomes reside. Typically, $$h(t,s)$$ quantifies the relative favourability of outcome $$t$$ over outcome $$s$$. For the average treatment effect, the contrast function is $$h_0(t,s)=t-s$$; however, this function may be inappropriate when the outcomes are ordinal but non-numeric or are susceptible to outliers. In such cases, a common alternative is to use $$h(t,s)=I(t\geqslant s)$$, where $$I(\cdot)$$ is the indicator function. When the contrast function is nonlinear, such as $$I(t\geqslant s)$$, we must exercise caution when defining the population-averaged causal effect. It might be tempting to define the population causal estimand as $$\tau^*_h= E[h\{Y(1),Y(0)\}]$$, which represents the individual causal effects averaged over the population. However, $$\tau^*_h$$ is in general not estimable because the joint distribution of $$\{Y(1),Y(0)\}$$ is not identifiable from the observed data. In the Supplementary Material we illustrate the nonidentifiability of $$\tau^*_h$$ using a simple example of bivariate normal outcomes. To circumvent the nonidentifiability problem, we instead consider an average over the marginal distributions, which are identifiable under standard assumptions. Specifically, let $$\nu_k$$ denote the marginal distribution of $$Y(k)$$$$(k=1, 0)$$. Then the population-level causal effect is defined as   $$\label{eq:estimand_def} \tau_h=\int\!\!\int h(t,s)\nu_1({\rm d} t)\nu_0({\rm d} s),$$ (1) which can be interpreted as the average superiority, as measured by $$h$$, of a randomly chosen outcome from the hypothetical population where all subjects are assigned to treatment over another randomly chosen outcome from the hypothetical population where all subjects are assigned to control. If $$\{Y_1(1),Y_1(0)\}$$ and $$\{Y_2(1),Y_2(0)\}$$ are two independent replicates of $$\{Y(1),Y(0)\}$$, then (1) can be represented as $$\tau_h= E[h\{Y_1(1),Y_2(0)\}]$$. In the special case with $$h_0(t,s)= t-s$$, we have $$\tau^*_{h_0}=\tau_{h_0}$$, but this equality does not hold for nonlinear $$h$$. In this article, we establish a framework for estimation of (1) using semiparametric efficiency theory (Bickel et al., 1993; Robins et al., 1994; van der Laan & Robins, 2003; Tsiatis, 2006). We propose estimators taking the form of $$U$$-statistics and use $$U$$-statistic theory (Hoeffding, 1948; Arcones & Giné, 1993) to derive their asymptotic properties. Our work extends semiparametric estimation theory for the average treatment effect, treated in detail in Tsiatis (2006, Ch. 13), to the general class of estimands defined in (1). 2. Estimation theory 2.1. Set-up Let $$Z$$ denote a vector of pre-treatment variables which may confound the relationship between treatment assignment and the potential outcomes. Let $$W$$ denote the treatment assignment indicator; that is, $$W=1$$ or $$0$$ if the subject is assigned to active treatment or control, respectively. Let $$Y$$ denote the observed outcome and assume that $$Y=WY(1)+(1-W)Y(0)$$ (Rubin, 1978). Throughout, we impose the strong ignorability assumption (Rosenbaum & Rubin, 1983)   $$\label{eq:ignore} \{Y(1), Y(0)\}\perp\!\!\!\perp W\mid Z,$$ (2) where $$\perp\!\!\!\perp$$ denotes independence, and the positivity condition that $$\delta\leqslant {\rm pr}(W=1\mid Z)\leqslant 1-\delta$$ almost surely for some $$\delta>0$$. The observable data then consist of   $$\label{eq:obs} (Y_i, W_i, Z_i)\quad (i=1,\ldots, n),$$ (3) which are $$n$$ replicates of independent and identically distributed observations of $$(Y, W, Z)$$. 2.2. Influence functions with observed data With the observed data (3), nonparametric estimation relying on the ignorability condition (2) alone is in general infeasible due to the possible curse of dimensionality (see, e.g., van der Laan & Robins, 2003, § § 1.2.3 and 1.2.4), and so one needs lower-dimensional models. Assume a parametric model for the propensity score (Rosenbaum & Rubin, 1983),   $$\label{eq:pi_model} \pi(Z;\psi)={\rm pr}(W=1\mid Z;\psi),$$ (4) where $$\psi$$ is a finite-dimensional parameter with true value $$\psi_0$$. Denote model (4) by $$\mathcal M_\pi$$. Typical choices of $$\mathcal M_\pi$$ include logistic and probit regressions. Thus, the model for (3) is semiparametric as its tangent space is constrained by $$\mathcal M_\pi$$ and the assumptions made in § 2.1. First, we characterize the space of influence functions for all regular and asymptotically linear semiparametric estimators, which will be used to construct optimal estimators in § 2.4. Theorem 1. The space of all influence functions for $$\tau_h$$ with the observed data is  \begin{equation*}\label{eq:if_class} \big\{\chi_h^{\rm IP}(\mathcal O)+q(W,Z):q\in\mathcal Q\big\}, \end{equation*} where $$\mathcal O=(Y,W,Z)$$,  \begin{equation*} \chi_h^{\rm IP}(\mathcal O)=\frac{W}{\pi(Z;\psi_0)} \,h_1(Y)+\frac{1-W}{1-\pi(Z;\psi_0)}\,h_0(Y)-2\tau_h, \end{equation*} $$h_1(t)=\int h(t,s)\nu_0({\rm d} s)$$, $$h_0(s)=\int h(t,s)\nu_1({\rm d} t)$$ and $$\mathcal Q=\{\{W-\pi(Z;\psi_0)\}L(Z):E\{L(Z)^2\}<\infty\}$$.  The superscript IP indicates inverse propensity score. The space $$\mathcal Q$$ is called the augmentation space in Tsiatis (2006). The proof of Theorem 1 is relegated to the Supplementary Material. 2.3. Inverse propensity score weighted estimator Unlike with the average treatment effect, the form of the influence functions is not immediately instructive in constructing estimators, as the representation in Theorem 1 contains the unknown functions $$h_1$$ and $$h_0$$. Notice, however, that the function of observed data   $g_h^{\rm IP}(\mathcal O_i,\mathcal O_j;\psi_0)=2^{-1}\left[\frac{W_i(1-W_j)h(Y_i,Y_j)}{\pi(Z_i;\psi_0)\{1-\pi(Z_j;\psi_0)\}} +\frac{W_j(1-W_i)h(Y_j,Y_i)}{\pi(Z_j;\psi_0)\{1-\pi(Z_i;\psi_0)\}}\right]$ has expectation $$\tau_h$$ for any $$i=\hspace{-8pt}|\,\,\, j$$ thanks to the strong ignorability assumption. This suggests an inverse propensity score weighted estimator   $\hat\tau_{hn}^{\rm IP}={n \choose 2}^{-1}\sum_{i<j}\sum g_h^{\rm IP}(\mathcal O_i,\mathcal O_j;\hat\psi_n),$ where $$\hat\psi_n$$ is the maximum likelihood estimator for $$\psi_0$$ under $$\mathcal M_\pi$$ based on $$(W_i,Z_i)$$$$(i=1,\ldots, n)$$. The next proposition is useful in conducting inference on $$\tau_h$$ based on $$\hat\tau_{hn}^{\rm IP}$$. Proposition 1. The estimator $$\hat\tau_{hn}^{\rm IP}$$ is consistent and asymptotically linear. The asymptotic variance of $$n^{1/2}(\hat\tau_{hn}^{\rm IP}-\tau_h)$$ can be consistently estimated by  $n^{-1}\sum_{i=1}^n\left[2\big\{\hat g_1^{\rm IP}(\mathcal O_i;\hat\psi_n)-\hat\tau_{hn}^{\rm IP}\big\}+A_n^{ \mathrm{\scriptscriptstyle T} }\,\tilde l_\psi(W_i, Z_i;\hat\psi_n)\right]^{\otimes 2},$ where $$\hat g_1^{\rm IP}(\mathcal O_i;\psi)=(n-1)^{-1}\sum_{j=\hspace{-8pt}|\, i}g_h^{\rm IP}(\mathcal O_i,\mathcal O_j;\psi)$$,  $A_n={n \choose 2}^{-1}\sum_{i<j}\sum \frac{\partial}{\partial\psi}\,g_h^{\rm IP}(\mathcal O_i,\mathcal O_j;\psi)\bigg|_{\psi=\hat\psi_n},$ $$\tilde l_\psi(W, Z;\psi)$$ is the efficient influence function for $$\psi$$, and $$a^{\otimes 2}=aa^{ \mathrm{\scriptscriptstyle T} }$$ for any vector $$a$$.  Because $$\hat\tau_{hn}^{\rm IP}$$ contains an estimated quantity $$\hat\psi_n$$, standard techniques of variance estimation for $$U$$-statistics do not apply. The key step in the proof of Proposition 1 is to derive the influence function of $$\hat\tau_{hn}^{\rm IP}$$ while taking into account the randomness of $$\hat\psi_n$$. This amounts to using a uniform central limit theorem for parameter-indexed $$U$$-processes (Arcones & Giné, 1993) and then separating out the influence of the estimated parameter by the delta method. Details are given in the Supplementary Material. 2.4. Locally efficient and doubly robust estimator The inverse propensity score estimators use the pre-treatment variables $$Z$$ only for correcting the confounding bias. More efficient estimators can be constructed by exploiting associations between $$Z$$ and the potential outcomes. It follows from our Theorem 1 and Theorem 13.1 of Tsiatis (2006) that the efficient influence function for $$\tau_h$$ is   \begin{align*} \chi_h(\mathcal O)&=\frac{W}{\pi(Z;\psi_0)}\, h_1(Y)+\frac{1-W}{1-\pi(Z;\psi_0)}\,h_0(Y)-2\tau_h\notag\\ &\quad -\frac{W-\pi(Z;\psi_0)}{\pi(Z;\psi_0)}\,E[h_1\{Y(1)\}\mid Z]+\frac{W-\pi(Z;\psi_0)}{1-\pi(Z;\psi_0)}\,E[h_0\{Y(0)\}\mid Z]\text{.} \end{align*} Construction of estimators with the efficient influence function $$\chi_h(\mathcal O)$$ is generally feasible only if we postulate parametric models for the conditional distributions of $$Y(k)$$ given $$Z$$ ($$k=1,0$$). In that case, we aim to construct locally efficient estimators when these models are true. Let $$\nu_k(\cdot\mid Z;\beta)$$ denote the conditional laws of $$Y(k)$$ ($$k=1, 0$$) given $$Z$$, indexed by a Euclidean parameter $$\beta$$. Denote the intersection of these two marginal models by $$\mathcal M_\mu$$. We propose the following procedure for constructing a locally efficient estimator. Given $$Z_i$$ and $$Z_j$$ with $$i=\hspace{-8pt}|\, j$$, let   $\mu(Z_i,Z_j;\beta)=E\bigl[h\{Y_i(1),Y_j(0)\}\mid Z_i,Z_j;\beta\bigr]\text{.}$ The function $$\mu$$ equals   $\mu(Z_i,Z_j;\beta)=\int\!\!\int h(t,s)\nu_1({\rm d} t\mid Z_i;\beta)\nu_0({\rm d} s\mid Z_j;\beta)\text{.}$ Then, find a regular asymptotically linear estimator $$\hat\beta_n$$ for $$\beta$$. Finally, construct the estimator   $\hat\tau_{hn}^{\rm DR}={n \choose 2}^{-1}\sum_{i<j}\sum g_h^{\rm DR}(\mathcal O_i,\mathcal O_j;\hat\psi_n,\hat\beta_n),$ where   \begin{align*} g_h^{\rm DR}(\mathcal O_i,\mathcal O_j;\psi,\beta)& =2^{-1}\left( \frac{W_i(1-W_j)h(Y_i,Y_j)}{\pi(Z_i;\psi)\{1-\pi(Z_j;\psi)\}} -\left[\frac{W_i(1-W_j)}{\pi(Z_i;\psi)\{1-\pi(Z_j;\psi)\}}-1\right]\mu(Z_i,Z_j;\beta)\right.\\ & \qquad\qquad \left.+\frac{W_j(1-W_i)h(Y_j,Y_i)}{\pi(Z_j;\psi)\{1-\pi(Z_i;\psi)\}}-\left[\frac{W_j(1-W_i)}{\pi(Z_j;\psi)\{1-\pi(Z_i;\psi)\}}-1\right]\mu(Z_j,Z_i;\beta)\right)\!\text{.} \end{align*} The superscript DR refers to double robustness. The next theorem establishes local efficiency and double robustness of $$\hat\tau_{hn}^{\rm DR}$$. Theorem 2. The estimator $$\hat\tau_{hn}^{\rm DR}$$ is a semiparametric estimator under $$\mathcal M_\pi$$, is locally efficient if $$\mathcal M_\mu$$ is true, and is doubly robust in the sense of being consistent and asymptotically linear under $$\mathcal M_\pi\cup\mathcal M_\mu$$.  Proof 1. By Hoeffding’s decomposition theorem (Hoeffding, 1948), if we can show that   \begin{equation*} 2\big[E\{g_h^{\rm DR}(\mathcal O_i,\mathcal O_j;\psi_0,\beta_0)\mid\mathcal O_i\}-\tau_h\big]=\chi_h(\mathcal O_i), \end{equation*} where $$\beta_0$$ is the true value of $$\beta$$, then the rest of the proof is essentially the same as in the case of the average treatment effect discussed in Tsiatis (2006, Ch. 13); see the Supplementary Material for details. □ Finally, robust variance estimators for $$\hat\tau_{hn}^{\rm DR}$$ can be constructed in similar ways to the construction for $$\hat\tau_{hn}^{\rm IP}$$ in Proposition 1. We give a detailed discussion in the Supplementary Material about different classes of variance estimators for $$\hat\tau_{hn}^{\rm DR}$$. 3. Examples 3.1. Win ratio for ordinal outcomes There is limited work on causal inference with ordinal outcomes, largely due to the difficulty of defining a meaningful causal estimand. In randomized clinical trials with ordinal outcomes, Wang & Pocock (2016) introduced a win ratio approach to comparing two treatment arms, using the idea of the win ratio from multivariate censored outcomes (Pocock et al., 2012). All possible pairs are formed between patients from the two treatment arms, and the win ratio is calculated as the fraction of pairs for which the active treatment is more favourable. To formalize the idea of the win ratio in the causal inference setting, denote the outcome space by $$\mathcal Y=\{1,\ldots, L\}$$, where $$L$$ is a positive integer, and assume that larger values are more favourable. Let $$\{Y_i(1), Y_i(0)\}$$$$(i=1, 2)$$ denote two independent replicates of $$\{Y(1), Y(0)\}$$. Then the causal win ratio for the treatment over control can be defined as $$\tau={\rm pr}\{Y_1(1)>Y_2(0)\}/{\rm pr}\{Y_1(1)<Y_2(0)\}$$, i.e., the odds of a randomly chosen subject from the all-treatment population having a more favourable outcome than one from the all-control population. The estimand $$\tau$$ is a smooth function of the two-dimensional $$\tau_h$$ defined in (1) with $$h(t,s)=\{I(t>s), I(t<s)\}$$. Application of our theory to this particular $$h$$ function is straightforward. We focus only on the construction of $$\mathcal M_\mu$$ and calculation of the augmentation terms. Since the outcomes are ordinal, it is convenient to posit proportional odds regression models for $$Y(k)$$ against $$Z$$$$(k=1, 0)$$. Let $$p_{kl}(Z;\beta)$$ denote the model-based conditional probability of $$Y(k)=l$$ given $$Z$$ and regression parameter $$\beta$$$$(l=1,\ldots, L)$$. Then it is easy to show that   $\mu(Z_i, Z_j;\beta)=\left\{\sum_{l=1}^L\sum_{l'=1}^{l-1}p_{1l}(Z_i;\beta)p_{0l'}(Z_j;\beta), \: \sum_{l=1}^L\sum_{l'=l+1}^Lp_{1l}(Z_i;\beta)p_{0l'}(Z_j;\beta)\right\}\!\text{.}$ 3.2. Mann–Whitney-type statistics and tests Tests for comparing the locations of two populations based on sample averages, such as the $$t$$ test, are sensitive to outliers. The Mann–Whitney test bases inference on a two-sample $$U$$-statistic with a uniformly bounded kernel function, thereby limiting the influence of large observations. Wu et al. (2014) adapted the Mann–Whitney statistic for causal inference using a functional response model. However, no semiparametric efficiency consideration was discussed there. Our results provide a natural platform for studying Mann–Whitney-type statistics in causal inference. Assume that the potential outcomes are continuous and that $$\mathcal Y=\mathbb R$$. Then the kernel function for comparing the two potential outcome populations can be set to $$h(t,s)=I(t\geqslant s)$$, and the estimand is thus $$\tau_h={\rm pr}\{Y_1(1)\geqslant Y_2(0)\}$$, where $$Y_1(1)$$ and $$Y_2(0)$$ are defined similarly to their counterparts in § 3.1. The estimand $$\tau_h$$ is sometimes referred to as the Mann–Whitney parameter and is commonly used to discriminate between two populations in medical studies (see, e.g., Koziol & Jia, 2009). We again focus on constructing $$\mathcal M_\mu$$ and calculating the augmentation terms. Suppose that we posit a working model of normal linear regression, i.e., $$Y(k)\mid Z\sim N(\beta_k^{ \mathrm{\scriptscriptstyle T} }Z,\sigma^2)$$. The maximum likelihood estimators for $$(\beta_1,\beta_0)$$ and $$\sigma^2$$ can easily be computed, and it can be shown that   $\mu(Z_i,Z_j;\beta_1,\beta_0,\sigma^2)=\Phi\!\left(\frac{\beta_1^{ \mathrm{\scriptscriptstyle T} }Z_i-\beta_0^{ \mathrm{\scriptscriptstyle T} }Z_j}{2^{1/2}\sigma}\right)\!,$ where $$\Phi(\cdot)$$ is the standard normal cumulative distribution function. To test the null hypothesis of no causal effect, i.e., $$H_0:\nu_1=\nu_0$$, one can use the test statistic $$n^{1/2}(\hat\tau_{hn}^{\rm DR}-2^{-1})$$. Simulation studies, described in the Supplementary Material, show that our methods perform well in finite samples and are superior to tests based on average treatment effects in the presence of outliers. In particular, the variance estimators are computationally fast and accurate. 4. Discussion Just as the ordinary $$U$$-statistics can be viewed as generalized sample averages, the causal $$U$$-statistics introduced in this article are natural extensions to estimators for the average treatment effect. Although we have focused on complete $$U$$-statistics based on the exhaustive set of $$n(n-1)/2$$ pairs, analogous results for incomplete $$U$$-statistics (Blom, 1976) can easily be obtained. The causal estimand defined in (1) is based on contrasts between the potential outcome populations. Some estimands of interest, such as those measuring causal effects on dispersion and inequality, may be based on contrasts within each population. An example is the treatment effect on the Gini coefficient (Imbens & Rubin, 2015, Ch. 20). The theory and methods presented in this paper can be readily extended to such estimands. Acknowledgement I am grateful to two referees, the associate editor and the editor, whose comments and suggestions have led to significant improvement of the manuscript. Supplementary material Supplementary material available at Biometrika online includes proofs of the theoretical results, technical notes, simulation studies, and code that allows the reader to replicate the numerical results. References Arcones, M. A. & Giné, E. ( 1993). Limit theorems for $$U$$-processes. Ann. Prob.  21, 1494– 542. Google Scholar CrossRef Search ADS   Bickel, P. J., Klaassen, C. A., Ritov, Y. A. & Wellner, J. A. ( 1993). Efficient and Adaptive Estimation for Semiparametric Models . Baltimore: Johns Hopkins University Press. Blom, G. ( 1976). Some properties of incomplete $$U$$-statistics. Biometrika  63, 573– 80. Google Scholar CrossRef Search ADS   Hoeffding, W. ( 1948). A class of statistics with asymptotically normal distribution. Ann. Math. Statist.  19, 293– 325. Google Scholar CrossRef Search ADS   Imbens, G. W. & Rubin, D. B. ( 2015). Causal Inference in Statistics, Social, and Biomedical Sciences . Cambridge: Cambridge University Press. Google Scholar CrossRef Search ADS   Koziol, J. A. & Jia, Z. ( 2009). The concordance index $$C$$ and the Mann–Whitney parameter Pr$$(X> Y)$$ with randomly censored data. Biomet. J.  51, 467– 74. Google Scholar CrossRef Search ADS   Pocock, S. J., Ariti, C. A., Collier, T. J. & Wang, D. ( 2012). The win ratio: A new approach to the analysis of composite endpoints in clinical trials based on clinical priorities. Euro. Heart J.  33, 176– 82. Google Scholar CrossRef Search ADS   Robins, J. M., Rotnitzky, A. & Zhao, L. P. ( 1994). Estimation of regression coefficients when some regressors are not always observed. J. Am. Statist. Assoc.  89, 846– 66. Google Scholar CrossRef Search ADS   Rosenbaum, P. R. & Rubin, D. B. ( 1983). The central role of the propensity score in observational studies for causal effects. Biometrika  70, 41– 55. Google Scholar CrossRef Search ADS   Rubin, D. B. ( 1974). Estimating causal effects of treatments in randomized and nonrandomized studies. J. Educ. Psychol.  66, 688– 701. Google Scholar CrossRef Search ADS   Rubin, D. B. ( 1978). Bayesian inference for causal effects: The role of randomization. Ann. Statist.  6, 34– 58. Google Scholar CrossRef Search ADS   Tsiatis, A. ( 2006). Semiparametric Theory and Missing Data . New York: Springer. van der Laan, M. J. & Robins, J. M. ( 2003). Unified Methods for Censored Longitudinal Data and Causality.  New York: Springer. Google Scholar CrossRef Search ADS   Wang, D. & Pocock, S. J. ( 2016). A win ratio approach to comparing continuous nonnormal outcomes in clinical trials. Pharm. Statist.  15, 238– 45. Google Scholar CrossRef Search ADS   Wu, P., Han, Y., Chen, T. & Tu, X. M. ( 2014). Causal inference for Mann–Whitney–Wilcoxon rank sum and other nonparametric statistics. Statist. Med.  33, 1261– 71. Google Scholar CrossRef Search ADS   © 2017 Biometrika Trust http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Biometrika Oxford University Press

# On causal estimation using $$U$$ -statistics

, Volume 105 (1) – Mar 1, 2018
6 pages

/lp/ou_press/on-causal-estimation-using-u-statistics-qU8i2brjhI
Publisher
Oxford University Press
ISSN
0006-3444
eISSN
1464-3510
D.O.I.
10.1093/biomet/asx071
Publisher site
See Article on Publisher Site

### Abstract

Summary We introduce a general class of causal estimands which extends the familiar notion of average treatment effect. The class is defined by a contrast function, prespecified to quantify the relative favourability of one outcome over another, averaged over the marginal distributions of two potential outcomes. Natural estimators arise in the form of $$U$$-statistics. We derive both a naive inverse propensity score weighted estimator and a class of locally efficient and doubly robust estimators. The usefulness of our theory is illustrated by two examples, one for causal estimation with ordinal outcomes, and the other for causal tests that are robust with respect to outliers. 1. Motivation Under the counterfactual paradigm of causal inference (Rubin, 1974), each subject is associated with two outcome variables, denoted by $$Y(1)$$ and $$Y(0)$$, representing the potential outcomes had the subject been assigned to the active treatment and control, respectively. The causal effect of the active treatment as compared to the control is typically measured by $$E\{Y(1)-Y(0)\}$$, which is commonly referred to as the average treatment effect. Most of the literature on causal inference under the counterfactual framework has focused on this quantity. Consider a more general set-up. Suppose that we compare the two potential outcomes by some arbitrary contrast function $$h: \mathcal Y\times\mathcal Y\to\mathbb R$$, where $$\mathcal Y$$ is the space in which the potential outcomes reside. Typically, $$h(t,s)$$ quantifies the relative favourability of outcome $$t$$ over outcome $$s$$. For the average treatment effect, the contrast function is $$h_0(t,s)=t-s$$; however, this function may be inappropriate when the outcomes are ordinal but non-numeric or are susceptible to outliers. In such cases, a common alternative is to use $$h(t,s)=I(t\geqslant s)$$, where $$I(\cdot)$$ is the indicator function. When the contrast function is nonlinear, such as $$I(t\geqslant s)$$, we must exercise caution when defining the population-averaged causal effect. It might be tempting to define the population causal estimand as $$\tau^*_h= E[h\{Y(1),Y(0)\}]$$, which represents the individual causal effects averaged over the population. However, $$\tau^*_h$$ is in general not estimable because the joint distribution of $$\{Y(1),Y(0)\}$$ is not identifiable from the observed data. In the Supplementary Material we illustrate the nonidentifiability of $$\tau^*_h$$ using a simple example of bivariate normal outcomes. To circumvent the nonidentifiability problem, we instead consider an average over the marginal distributions, which are identifiable under standard assumptions. Specifically, let $$\nu_k$$ denote the marginal distribution of $$Y(k)$$$$(k=1, 0)$$. Then the population-level causal effect is defined as   $$\label{eq:estimand_def} \tau_h=\int\!\!\int h(t,s)\nu_1({\rm d} t)\nu_0({\rm d} s),$$ (1) which can be interpreted as the average superiority, as measured by $$h$$, of a randomly chosen outcome from the hypothetical population where all subjects are assigned to treatment over another randomly chosen outcome from the hypothetical population where all subjects are assigned to control. If $$\{Y_1(1),Y_1(0)\}$$ and $$\{Y_2(1),Y_2(0)\}$$ are two independent replicates of $$\{Y(1),Y(0)\}$$, then (1) can be represented as $$\tau_h= E[h\{Y_1(1),Y_2(0)\}]$$. In the special case with $$h_0(t,s)= t-s$$, we have $$\tau^*_{h_0}=\tau_{h_0}$$, but this equality does not hold for nonlinear $$h$$. In this article, we establish a framework for estimation of (1) using semiparametric efficiency theory (Bickel et al., 1993; Robins et al., 1994; van der Laan & Robins, 2003; Tsiatis, 2006). We propose estimators taking the form of $$U$$-statistics and use $$U$$-statistic theory (Hoeffding, 1948; Arcones & Giné, 1993) to derive their asymptotic properties. Our work extends semiparametric estimation theory for the average treatment effect, treated in detail in Tsiatis (2006, Ch. 13), to the general class of estimands defined in (1). 2. Estimation theory 2.1. Set-up Let $$Z$$ denote a vector of pre-treatment variables which may confound the relationship between treatment assignment and the potential outcomes. Let $$W$$ denote the treatment assignment indicator; that is, $$W=1$$ or $$0$$ if the subject is assigned to active treatment or control, respectively. Let $$Y$$ denote the observed outcome and assume that $$Y=WY(1)+(1-W)Y(0)$$ (Rubin, 1978). Throughout, we impose the strong ignorability assumption (Rosenbaum & Rubin, 1983)   $$\label{eq:ignore} \{Y(1), Y(0)\}\perp\!\!\!\perp W\mid Z,$$ (2) where $$\perp\!\!\!\perp$$ denotes independence, and the positivity condition that $$\delta\leqslant {\rm pr}(W=1\mid Z)\leqslant 1-\delta$$ almost surely for some $$\delta>0$$. The observable data then consist of   $$\label{eq:obs} (Y_i, W_i, Z_i)\quad (i=1,\ldots, n),$$ (3) which are $$n$$ replicates of independent and identically distributed observations of $$(Y, W, Z)$$. 2.2. Influence functions with observed data With the observed data (3), nonparametric estimation relying on the ignorability condition (2) alone is in general infeasible due to the possible curse of dimensionality (see, e.g., van der Laan & Robins, 2003, § § 1.2.3 and 1.2.4), and so one needs lower-dimensional models. Assume a parametric model for the propensity score (Rosenbaum & Rubin, 1983),   $$\label{eq:pi_model} \pi(Z;\psi)={\rm pr}(W=1\mid Z;\psi),$$ (4) where $$\psi$$ is a finite-dimensional parameter with true value $$\psi_0$$. Denote model (4) by $$\mathcal M_\pi$$. Typical choices of $$\mathcal M_\pi$$ include logistic and probit regressions. Thus, the model for (3) is semiparametric as its tangent space is constrained by $$\mathcal M_\pi$$ and the assumptions made in § 2.1. First, we characterize the space of influence functions for all regular and asymptotically linear semiparametric estimators, which will be used to construct optimal estimators in § 2.4. Theorem 1. The space of all influence functions for $$\tau_h$$ with the observed data is  \begin{equation*}\label{eq:if_class} \big\{\chi_h^{\rm IP}(\mathcal O)+q(W,Z):q\in\mathcal Q\big\}, \end{equation*} where $$\mathcal O=(Y,W,Z)$$,  \begin{equation*} \chi_h^{\rm IP}(\mathcal O)=\frac{W}{\pi(Z;\psi_0)} \,h_1(Y)+\frac{1-W}{1-\pi(Z;\psi_0)}\,h_0(Y)-2\tau_h, \end{equation*} $$h_1(t)=\int h(t,s)\nu_0({\rm d} s)$$, $$h_0(s)=\int h(t,s)\nu_1({\rm d} t)$$ and $$\mathcal Q=\{\{W-\pi(Z;\psi_0)\}L(Z):E\{L(Z)^2\}<\infty\}$$.  The superscript IP indicates inverse propensity score. The space $$\mathcal Q$$ is called the augmentation space in Tsiatis (2006). The proof of Theorem 1 is relegated to the Supplementary Material. 2.3. Inverse propensity score weighted estimator Unlike with the average treatment effect, the form of the influence functions is not immediately instructive in constructing estimators, as the representation in Theorem 1 contains the unknown functions $$h_1$$ and $$h_0$$. Notice, however, that the function of observed data   $g_h^{\rm IP}(\mathcal O_i,\mathcal O_j;\psi_0)=2^{-1}\left[\frac{W_i(1-W_j)h(Y_i,Y_j)}{\pi(Z_i;\psi_0)\{1-\pi(Z_j;\psi_0)\}} +\frac{W_j(1-W_i)h(Y_j,Y_i)}{\pi(Z_j;\psi_0)\{1-\pi(Z_i;\psi_0)\}}\right]$ has expectation $$\tau_h$$ for any $$i=\hspace{-8pt}|\,\,\, j$$ thanks to the strong ignorability assumption. This suggests an inverse propensity score weighted estimator   $\hat\tau_{hn}^{\rm IP}={n \choose 2}^{-1}\sum_{i<j}\sum g_h^{\rm IP}(\mathcal O_i,\mathcal O_j;\hat\psi_n),$ where $$\hat\psi_n$$ is the maximum likelihood estimator for $$\psi_0$$ under $$\mathcal M_\pi$$ based on $$(W_i,Z_i)$$$$(i=1,\ldots, n)$$. The next proposition is useful in conducting inference on $$\tau_h$$ based on $$\hat\tau_{hn}^{\rm IP}$$. Proposition 1. The estimator $$\hat\tau_{hn}^{\rm IP}$$ is consistent and asymptotically linear. The asymptotic variance of $$n^{1/2}(\hat\tau_{hn}^{\rm IP}-\tau_h)$$ can be consistently estimated by  $n^{-1}\sum_{i=1}^n\left[2\big\{\hat g_1^{\rm IP}(\mathcal O_i;\hat\psi_n)-\hat\tau_{hn}^{\rm IP}\big\}+A_n^{ \mathrm{\scriptscriptstyle T} }\,\tilde l_\psi(W_i, Z_i;\hat\psi_n)\right]^{\otimes 2},$ where $$\hat g_1^{\rm IP}(\mathcal O_i;\psi)=(n-1)^{-1}\sum_{j=\hspace{-8pt}|\, i}g_h^{\rm IP}(\mathcal O_i,\mathcal O_j;\psi)$$,  $A_n={n \choose 2}^{-1}\sum_{i<j}\sum \frac{\partial}{\partial\psi}\,g_h^{\rm IP}(\mathcal O_i,\mathcal O_j;\psi)\bigg|_{\psi=\hat\psi_n},$ $$\tilde l_\psi(W, Z;\psi)$$ is the efficient influence function for $$\psi$$, and $$a^{\otimes 2}=aa^{ \mathrm{\scriptscriptstyle T} }$$ for any vector $$a$$.  Because $$\hat\tau_{hn}^{\rm IP}$$ contains an estimated quantity $$\hat\psi_n$$, standard techniques of variance estimation for $$U$$-statistics do not apply. The key step in the proof of Proposition 1 is to derive the influence function of $$\hat\tau_{hn}^{\rm IP}$$ while taking into account the randomness of $$\hat\psi_n$$. This amounts to using a uniform central limit theorem for parameter-indexed $$U$$-processes (Arcones & Giné, 1993) and then separating out the influence of the estimated parameter by the delta method. Details are given in the Supplementary Material. 2.4. Locally efficient and doubly robust estimator The inverse propensity score estimators use the pre-treatment variables $$Z$$ only for correcting the confounding bias. More efficient estimators can be constructed by exploiting associations between $$Z$$ and the potential outcomes. It follows from our Theorem 1 and Theorem 13.1 of Tsiatis (2006) that the efficient influence function for $$\tau_h$$ is   \begin{align*} \chi_h(\mathcal O)&=\frac{W}{\pi(Z;\psi_0)}\, h_1(Y)+\frac{1-W}{1-\pi(Z;\psi_0)}\,h_0(Y)-2\tau_h\notag\\ &\quad -\frac{W-\pi(Z;\psi_0)}{\pi(Z;\psi_0)}\,E[h_1\{Y(1)\}\mid Z]+\frac{W-\pi(Z;\psi_0)}{1-\pi(Z;\psi_0)}\,E[h_0\{Y(0)\}\mid Z]\text{.} \end{align*} Construction of estimators with the efficient influence function $$\chi_h(\mathcal O)$$ is generally feasible only if we postulate parametric models for the conditional distributions of $$Y(k)$$ given $$Z$$ ($$k=1,0$$). In that case, we aim to construct locally efficient estimators when these models are true. Let $$\nu_k(\cdot\mid Z;\beta)$$ denote the conditional laws of $$Y(k)$$ ($$k=1, 0$$) given $$Z$$, indexed by a Euclidean parameter $$\beta$$. Denote the intersection of these two marginal models by $$\mathcal M_\mu$$. We propose the following procedure for constructing a locally efficient estimator. Given $$Z_i$$ and $$Z_j$$ with $$i=\hspace{-8pt}|\, j$$, let   $\mu(Z_i,Z_j;\beta)=E\bigl[h\{Y_i(1),Y_j(0)\}\mid Z_i,Z_j;\beta\bigr]\text{.}$ The function $$\mu$$ equals   $\mu(Z_i,Z_j;\beta)=\int\!\!\int h(t,s)\nu_1({\rm d} t\mid Z_i;\beta)\nu_0({\rm d} s\mid Z_j;\beta)\text{.}$ Then, find a regular asymptotically linear estimator $$\hat\beta_n$$ for $$\beta$$. Finally, construct the estimator   $\hat\tau_{hn}^{\rm DR}={n \choose 2}^{-1}\sum_{i<j}\sum g_h^{\rm DR}(\mathcal O_i,\mathcal O_j;\hat\psi_n,\hat\beta_n),$ where   \begin{align*} g_h^{\rm DR}(\mathcal O_i,\mathcal O_j;\psi,\beta)& =2^{-1}\left( \frac{W_i(1-W_j)h(Y_i,Y_j)}{\pi(Z_i;\psi)\{1-\pi(Z_j;\psi)\}} -\left[\frac{W_i(1-W_j)}{\pi(Z_i;\psi)\{1-\pi(Z_j;\psi)\}}-1\right]\mu(Z_i,Z_j;\beta)\right.\\ & \qquad\qquad \left.+\frac{W_j(1-W_i)h(Y_j,Y_i)}{\pi(Z_j;\psi)\{1-\pi(Z_i;\psi)\}}-\left[\frac{W_j(1-W_i)}{\pi(Z_j;\psi)\{1-\pi(Z_i;\psi)\}}-1\right]\mu(Z_j,Z_i;\beta)\right)\!\text{.} \end{align*} The superscript DR refers to double robustness. The next theorem establishes local efficiency and double robustness of $$\hat\tau_{hn}^{\rm DR}$$. Theorem 2. The estimator $$\hat\tau_{hn}^{\rm DR}$$ is a semiparametric estimator under $$\mathcal M_\pi$$, is locally efficient if $$\mathcal M_\mu$$ is true, and is doubly robust in the sense of being consistent and asymptotically linear under $$\mathcal M_\pi\cup\mathcal M_\mu$$.  Proof 1. By Hoeffding’s decomposition theorem (Hoeffding, 1948), if we can show that   \begin{equation*} 2\big[E\{g_h^{\rm DR}(\mathcal O_i,\mathcal O_j;\psi_0,\beta_0)\mid\mathcal O_i\}-\tau_h\big]=\chi_h(\mathcal O_i), \end{equation*} where $$\beta_0$$ is the true value of $$\beta$$, then the rest of the proof is essentially the same as in the case of the average treatment effect discussed in Tsiatis (2006, Ch. 13); see the Supplementary Material for details. □ Finally, robust variance estimators for $$\hat\tau_{hn}^{\rm DR}$$ can be constructed in similar ways to the construction for $$\hat\tau_{hn}^{\rm IP}$$ in Proposition 1. We give a detailed discussion in the Supplementary Material about different classes of variance estimators for $$\hat\tau_{hn}^{\rm DR}$$. 3. Examples 3.1. Win ratio for ordinal outcomes There is limited work on causal inference with ordinal outcomes, largely due to the difficulty of defining a meaningful causal estimand. In randomized clinical trials with ordinal outcomes, Wang & Pocock (2016) introduced a win ratio approach to comparing two treatment arms, using the idea of the win ratio from multivariate censored outcomes (Pocock et al., 2012). All possible pairs are formed between patients from the two treatment arms, and the win ratio is calculated as the fraction of pairs for which the active treatment is more favourable. To formalize the idea of the win ratio in the causal inference setting, denote the outcome space by $$\mathcal Y=\{1,\ldots, L\}$$, where $$L$$ is a positive integer, and assume that larger values are more favourable. Let $$\{Y_i(1), Y_i(0)\}$$$$(i=1, 2)$$ denote two independent replicates of $$\{Y(1), Y(0)\}$$. Then the causal win ratio for the treatment over control can be defined as $$\tau={\rm pr}\{Y_1(1)>Y_2(0)\}/{\rm pr}\{Y_1(1)<Y_2(0)\}$$, i.e., the odds of a randomly chosen subject from the all-treatment population having a more favourable outcome than one from the all-control population. The estimand $$\tau$$ is a smooth function of the two-dimensional $$\tau_h$$ defined in (1) with $$h(t,s)=\{I(t>s), I(t<s)\}$$. Application of our theory to this particular $$h$$ function is straightforward. We focus only on the construction of $$\mathcal M_\mu$$ and calculation of the augmentation terms. Since the outcomes are ordinal, it is convenient to posit proportional odds regression models for $$Y(k)$$ against $$Z$$$$(k=1, 0)$$. Let $$p_{kl}(Z;\beta)$$ denote the model-based conditional probability of $$Y(k)=l$$ given $$Z$$ and regression parameter $$\beta$$$$(l=1,\ldots, L)$$. Then it is easy to show that   $\mu(Z_i, Z_j;\beta)=\left\{\sum_{l=1}^L\sum_{l'=1}^{l-1}p_{1l}(Z_i;\beta)p_{0l'}(Z_j;\beta), \: \sum_{l=1}^L\sum_{l'=l+1}^Lp_{1l}(Z_i;\beta)p_{0l'}(Z_j;\beta)\right\}\!\text{.}$ 3.2. Mann–Whitney-type statistics and tests Tests for comparing the locations of two populations based on sample averages, such as the $$t$$ test, are sensitive to outliers. The Mann–Whitney test bases inference on a two-sample $$U$$-statistic with a uniformly bounded kernel function, thereby limiting the influence of large observations. Wu et al. (2014) adapted the Mann–Whitney statistic for causal inference using a functional response model. However, no semiparametric efficiency consideration was discussed there. Our results provide a natural platform for studying Mann–Whitney-type statistics in causal inference. Assume that the potential outcomes are continuous and that $$\mathcal Y=\mathbb R$$. Then the kernel function for comparing the two potential outcome populations can be set to $$h(t,s)=I(t\geqslant s)$$, and the estimand is thus $$\tau_h={\rm pr}\{Y_1(1)\geqslant Y_2(0)\}$$, where $$Y_1(1)$$ and $$Y_2(0)$$ are defined similarly to their counterparts in § 3.1. The estimand $$\tau_h$$ is sometimes referred to as the Mann–Whitney parameter and is commonly used to discriminate between two populations in medical studies (see, e.g., Koziol & Jia, 2009). We again focus on constructing $$\mathcal M_\mu$$ and calculating the augmentation terms. Suppose that we posit a working model of normal linear regression, i.e., $$Y(k)\mid Z\sim N(\beta_k^{ \mathrm{\scriptscriptstyle T} }Z,\sigma^2)$$. The maximum likelihood estimators for $$(\beta_1,\beta_0)$$ and $$\sigma^2$$ can easily be computed, and it can be shown that   $\mu(Z_i,Z_j;\beta_1,\beta_0,\sigma^2)=\Phi\!\left(\frac{\beta_1^{ \mathrm{\scriptscriptstyle T} }Z_i-\beta_0^{ \mathrm{\scriptscriptstyle T} }Z_j}{2^{1/2}\sigma}\right)\!,$ where $$\Phi(\cdot)$$ is the standard normal cumulative distribution function. To test the null hypothesis of no causal effect, i.e., $$H_0:\nu_1=\nu_0$$, one can use the test statistic $$n^{1/2}(\hat\tau_{hn}^{\rm DR}-2^{-1})$$. Simulation studies, described in the Supplementary Material, show that our methods perform well in finite samples and are superior to tests based on average treatment effects in the presence of outliers. In particular, the variance estimators are computationally fast and accurate. 4. Discussion Just as the ordinary $$U$$-statistics can be viewed as generalized sample averages, the causal $$U$$-statistics introduced in this article are natural extensions to estimators for the average treatment effect. Although we have focused on complete $$U$$-statistics based on the exhaustive set of $$n(n-1)/2$$ pairs, analogous results for incomplete $$U$$-statistics (Blom, 1976) can easily be obtained. The causal estimand defined in (1) is based on contrasts between the potential outcome populations. Some estimands of interest, such as those measuring causal effects on dispersion and inequality, may be based on contrasts within each population. An example is the treatment effect on the Gini coefficient (Imbens & Rubin, 2015, Ch. 20). The theory and methods presented in this paper can be readily extended to such estimands. Acknowledgement I am grateful to two referees, the associate editor and the editor, whose comments and suggestions have led to significant improvement of the manuscript. Supplementary material Supplementary material available at Biometrika online includes proofs of the theoretical results, technical notes, simulation studies, and code that allows the reader to replicate the numerical results. References Arcones, M. A. & Giné, E. ( 1993). Limit theorems for $$U$$-processes. Ann. Prob.  21, 1494– 542. Google Scholar CrossRef Search ADS   Bickel, P. J., Klaassen, C. A., Ritov, Y. A. & Wellner, J. A. ( 1993). Efficient and Adaptive Estimation for Semiparametric Models . Baltimore: Johns Hopkins University Press. Blom, G. ( 1976). Some properties of incomplete $$U$$-statistics. Biometrika  63, 573– 80. Google Scholar CrossRef Search ADS   Hoeffding, W. ( 1948). A class of statistics with asymptotically normal distribution. Ann. Math. Statist.  19, 293– 325. Google Scholar CrossRef Search ADS   Imbens, G. W. & Rubin, D. B. ( 2015). Causal Inference in Statistics, Social, and Biomedical Sciences . Cambridge: Cambridge University Press. Google Scholar CrossRef Search ADS   Koziol, J. A. & Jia, Z. ( 2009). The concordance index $$C$$ and the Mann–Whitney parameter Pr$$(X> Y)$$ with randomly censored data. Biomet. J.  51, 467– 74. Google Scholar CrossRef Search ADS   Pocock, S. J., Ariti, C. A., Collier, T. J. & Wang, D. ( 2012). The win ratio: A new approach to the analysis of composite endpoints in clinical trials based on clinical priorities. Euro. Heart J.  33, 176– 82. Google Scholar CrossRef Search ADS   Robins, J. M., Rotnitzky, A. & Zhao, L. P. ( 1994). Estimation of regression coefficients when some regressors are not always observed. J. Am. Statist. Assoc.  89, 846– 66. Google Scholar CrossRef Search ADS   Rosenbaum, P. R. & Rubin, D. B. ( 1983). The central role of the propensity score in observational studies for causal effects. Biometrika  70, 41– 55. Google Scholar CrossRef Search ADS   Rubin, D. B. ( 1974). Estimating causal effects of treatments in randomized and nonrandomized studies. J. Educ. Psychol.  66, 688– 701. Google Scholar CrossRef Search ADS   Rubin, D. B. ( 1978). Bayesian inference for causal effects: The role of randomization. Ann. Statist.  6, 34– 58. Google Scholar CrossRef Search ADS   Tsiatis, A. ( 2006). Semiparametric Theory and Missing Data . New York: Springer. van der Laan, M. J. & Robins, J. M. ( 2003). Unified Methods for Censored Longitudinal Data and Causality.  New York: Springer. Google Scholar CrossRef Search ADS   Wang, D. & Pocock, S. J. ( 2016). A win ratio approach to comparing continuous nonnormal outcomes in clinical trials. Pharm. Statist.  15, 238– 45. Google Scholar CrossRef Search ADS   Wu, P., Han, Y., Chen, T. & Tu, X. M. ( 2014). Causal inference for Mann–Whitney–Wilcoxon rank sum and other nonparametric statistics. Statist. Med.  33, 1261– 71. Google Scholar CrossRef Search ADS   © 2017 Biometrika Trust

### Journal

BiometrikaOxford University Press

Published: Mar 1, 2018

## You’re reading a free preview. Subscribe to read the entire article.

### DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 12 million articles from more than
10,000 peer-reviewed journals.

All for just $49/month ### Explore the DeepDyve Library ### Unlimited reading Read as many articles as you need. Full articles with original layout, charts and figures. Read online, from anywhere. ### Stay up to date Keep up with your field with Personalized Recommendations and Follow Journals to get automatic updates. ### Organize your research It’s easy to organize your research with our built-in tools. ### Your journals are on DeepDyve Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more. All the latest content is available, no embargo periods. ### Monthly Plan • Read unlimited articles • Personalized recommendations • No expiration • Print 20 pages per month • 20% off on PDF purchases • Organize your research • Get updates on your journals and topic searches$49/month

14-day Free Trial

Best Deal — 39% off

### Annual Plan

• All the features of the Professional Plan, but for 39% off!
• Billed annually
• No expiration
• For the normal price of 10 articles elsewhere, you get one full year of unlimited access to articles.

$588$360/year

billed annually

14-day Free Trial