# Shape-constrained partial identification of a population mean under unknown probabilities of sample selection

Shape-constrained partial identification of a population mean under unknown probabilities of... Summary Estimating a population mean from a sample obtained with unknown selection probabilities is important in the biomedical and social sciences. Using a ratio estimator, Aronow & Lee (2013) proposed a method for partial identification of the mean by allowing the unknown selection probabilities to vary arbitrarily between two fixed values. In this paper, we show how to use auxiliary shape constraints on the population outcome distribution, such as symmetry or log-concavity, to obtain tighter bounds on the population mean. We use this method to estimate the performance of Aymara students, an ethnic minority in the north of Chile, in a national educational standardized test. We implement this method in the R package scbounds. 1. Introduction A common challenge in the biomedical and social sciences is to estimate a population mean from a sample obtained with unknown probabilities of sample selection. This is often the case when drawing inferences about mobile populations, such as homeless people and hospital outpatients, or hard-to-reach populations, such as injection drug users and some ethnic minorities. In general, this problem arises when a sampling frame is unavailable or unreliable, and when there is no or limited information about the sampling design. The estimation problem can be formalized as follows. Let $${\mathcal{P}}$$ denote a potentially infinite population, and let $$F$$ denote the cumulative distribution function of our outcome of interest, $$Y$$, over $${\mathcal{P}}$$. Our goal is to estimate the population mean $$\mu = E_F(Y)$$. To do so, we have access to a random sample $${\mathcal{S}} = \left\{ {{Y_i}} \right\}$$ of size $$n$$ obtained via biased sampling. Concretely, we can imagine that $${\mathcal{S}}$$ was generated using an accept-reject scheme as follows: until we have $$n$$ observations, repeatedly draw independent and identically distributed pairs $$(Y, \pi) \in {\mathbb{R}} \times (0, 1]$$ where $$Y \sim F$$, and then add $$Y$$ to our sample $${\mathcal{S}}$$ with probability $$\pi$$. Whenever the inverse sampling probabilities $$\pi_i^{-1}$$ are correlated with $$Y_i$$, the sample mean will be an inconsistent estimator for the population mean. If these sampling probabilities $$\pi_i$$ for our $$n$$ sampled observations were known, then we could use the following ratio estimator, which is consistent for $$\mu$$ under weak conditions (Hájek, 1958; Cochran, 1977):   $$\hat \mu ^* = \sum_{i = 1}^n \pi_i^{-1} Y_i \, \bigg/ \sum_{i = 1}^n \pi_i^{-1}\text{.}$$ (1) Here, however, we suppose that the sampling weights $$\pi_i$$ are unknown. Aronow & Lee (2013) showed that it is possible to obtain meaningful identification intervals for $$\mu$$ in the sense of Manski (2003), even if we only have bounds on the sampling weights $$\pi_i$$. Suppose we know that $$\max(\pi_i) / \min(\pi_i) \leqslant \gamma$$ for some constant $$\gamma < \infty$$. This gives an asymptotically consistent identification interval $${\mathcal{I}}_{\textrm{AL}} = [\hat \mu _{\textrm{AL}}^-,\, \hat \mu _{\textrm{AL}}^+]$$ for $$\mu$$, where   $$\hat \mu _{\textrm{AL}}^{+} = \sup \left\{ {{\sum_{i = 1}^n w_i Y_i : \sum_{i = 1}^n w_i = 1, \; \frac{\max(w_i)}{\min(w_i)} \leqslant \gamma}} \right\}$$ (2) and $$\hat \mu _{\textrm{AL}}^{-}$$ is the infimum over the same set. Aronow & Lee (2013) also developed an efficient algorithm for computing these bounds. While this approach can yield identification intervals for $$\mu$$ under weak assumptions, the Aronow–Lee bounds can be unnecessarily pessimistic. To understand why, it is helpful to consider their method as first estimating the true population $$F$$ as   $\hat F_{w}(y) = \sum_{i=1}^n w_i \, 1(Y_i \leqslant y),$ where $$w_i$$ are the maximizing or minimizing weights in (2), and then setting the limits of the interval to $$E_{\hat F_{w}}(Y)$$ for the two resulting extreme sets of weights. The problem is that the population distributions implied by these extreme weights are often rather implausible in practice. Specifically, the weights $$w_i$$ induced by (2) correspond to a step function depending on whether or not $$Y_i$$ falls below some threshold, and so the weighted empirical distribution functions $$\hat F_{\textrm{AL}}^{+}$$ and $$\hat F_{\textrm{AL}}^{-}$$ have a sharp change in slope at that threshold, as illustrated in § 3. This threshold can also be interpreted as a substantial discontinuity in an associated density, provided we are willing to posit the existence of such a density; see the figure in § 3. Such sharp elbows in the estimated cumulative distribution function often contradict expert knowledge about the true population distribution $$F$$. For example, physical measurements in the biological and medical sciences often exhibit a bell-shaped distribution, as do stock returns and other indicators in finance, as well as mechanical error measurements in industry. In this paper we study how to use such auxiliary information about the shape of the population outcome distribution $$F$$ to obtain shorter identification intervals for $$\mu$$ by ruling out implausible weightings in order to tighten the resulting identification bounds. We allow for various types of specifications for $$F$$, such as parametric assumptions, and shape constraints based on symmetry or log-concavity. In general, the more we are willing to assume about $$F$$, the shorter the resulting identification intervals. At one extreme, if we know that $$F$$ is Gaussian, then we can substantially shorten the identification intervals, whereas if we make weaker assumptions, for instance that $$F$$ has a log-concave density, then we get smaller but still noticeable improvements over $${\mathcal{I}}_{\textrm{AL}}$$. We focus on the situation where $$F$$ has real-valued support; when $$F$$ has categorical support, it is less common to have access to plausible shape constraints. This paper relates to the literature on biased sampling, empirical likelihood, and exponential tilting (Owen, 1988; Efron & Tibshirani, 1996; de Carvalho & Davison, 2014; Fithian & Wager, 2015; Qin, 2017). We implement the proposed methods in the R (R Development Core Team, 2018) package scbounds. 2. Tighter identification bounds via shape constraints 2.1. Characterization of the oracle estimator Our goal is to use existing information about the population distribution $$F$$, for example that $$F$$ is symmetric or log-concave, to obtain tighter identification bounds for $$\mu = E_F(Y)$$. Operationally, we seek to encode such information about $$F$$ into constraints that can be added to the optimization problem (2). Throughout the paper, we assume that $$\mu$$ is well-defined and finite. Our analysis focuses on the weighted empirical distribution function $${\hat{F}}^*$$ induced by the oracle ratio estimator $${\hat{\mu}}^*$$ in (1) which has access to the true sampling probabilities $$\pi_i$$ that give corresponding oracle weights $$w_i^*$$:   \begin{equation*} {\hat{F}}^*(y) = \sum_{i = 1}^n \pi^{-1}_i \, 1(Y_i \leqslant y)\bigg/ \sum_{i = 1}^n \pi^{-1}_i = \sum_{i = 1}^n w^*_i \, 1(Y_i \leqslant y)\text{.} \end{equation*} Any shape constraint on $$F$$ that lets us control the behaviour of $${\hat{F}}^*$$ induces an asymptotically consistent identification interval for the population mean $$\mu$$. The intuition is that if we can construct sets of distribution functions that are as constrained as possible while still containing our oracle $$\hat{F}^{*}$$ with high probability, then the optimization problem will contain the oracle estimate $$\hat{\mu}^*$$, giving short but still consistent bounds. Theorem 1. Suppose that we have auxiliary information on $$F$$ that lets us construct sets of distribution functions $${\mathcal{C}}_{\gamma, n}$$ with the property that $${\rm{lim}_{n\rightarrow\infty}} \mathrm{pr}({\hat{F}}^* \in {\mathcal{C}}_{\gamma, n}) = 1$$. Then, if we define  $$\label{eq:main} {\hat{\mu}}^+ = \sup\left\{ {\sum_{i = 1}^n w_i Y_i : \sum_{i = 1}^n w_i = 1, \ \frac{\max(w_i)}{\min(w_i)} \leqslant \gamma, \ {\hat{F}}_w \in {\mathcal{C}}_{\gamma,n}} \right\}$$ (3)and $${\hat{\mu}}^-$$ as the infimum in the analogous optimization problem, the resulting identification interval $${\mathcal{I}} = [{\hat{\mu}}^-,{\hat{\mu}}^+]$$ is asymptotically valid in the sense that $$\Delta(\mu, {\mathcal{I}})$$ converges in probability to $$0$$, where $$\Delta(\mu, {\mathcal{I}})$$ is the distance between $$\mu$$ and the nearest point in the interval $${\mathcal{I}}$$.  The intervals $${\mathcal{I}}$$ can never be wider than those of Aronow & Lee (2013), because the optimization problem (3) has strictly more constraints than the original optimization problem (2). Theorem 1 shows that if we have any auxiliary information about $$F$$, then the identification bounds of Aronow & Lee (2013) are needlessly long. However, it cannot directly guide practical data analysis. First, it leaves open the problem of how to turn shape constraints on $$F$$ into plausibility sets $${\mathcal{C}}_{\gamma, n}$$ that contain $${\hat{F}}^*$$ with high probability. Second, Theorem 1 is not useful if we cannot solve the optimization problem (3) in practice. Our next concern is to address these issues given specific side information about $$F$$. 2.2. Identification bounds in parametric families Although our main goal is to provide inference under shape constraints on $$F$$, we begin by considering the parametric case $$F = F_\theta$$ for some $$\theta \in \Theta$$, which allows us to construct particularly simple plausibility sets $${\mathcal{C}}_{\gamma, n}$$. Our approach is built around a Kolmogorov–Smirnov-type concentration bound for ratio estimators, and relies on finding the worst-case weighted distribution with $$\pi_{\max}/\pi_{\min} \leqslant \gamma$$ in terms of the characterization of Marcus & Shepp (1972) for tails of Gaussian processes; see the proof in the Appendix. Lemma 1. Suppose that we have a population and a sampling scheme for which $$\pi_{\min} \leqslant \pi_i \leqslant \pi_{\max}$$ with $$\pi_{\max}/\pi_{\min} \leqslant \gamma$$; and, for $$\alpha > 0$$, define  $$\label{eq:ks} \rho_{\alpha, \, n} = \mathrm{pr} \!\left[ n^{1/2} \, \sup_{y \in {\mathbb{R}}} \big| {{\hat{F}}^*(y) - F(y)}\big| \geqslant \left\{ { \frac{\sigma_\gamma^2\left( {1 + \gamma} \right)\left( {1 + \gamma^{-1}} \right) \log \alpha^{-1}}{2}} \right\}^{1/2} \right]\!,$$ (4)for a constant $$\sigma_\gamma^2 \leqslant 1$$ defined as the maximum of a concave function specified in the proof. Then the limiting probability $$\rho_{\alpha, n}$$ of large tail exceedances is bounded as follows:  $\limsup_{\alpha \rightarrow 0} \,\Bigl( \limsup_{n \rightarrow \infty} \: \log \rho^{-1}_{\alpha, n} \Big/ \log \alpha^{-1} \Bigr) \geqslant 1\text{.}$ Substituting $$\alpha = n^{-1/2}$$ into (4) and considering the union of all possible population distributions $$F(\cdot) = F_\theta(\cdot)$$ with $$\theta \in \Theta$$ suggests using the plausibility set   $$\label{eq:param_C} \begin{split} {\mathcal{C}}_{\gamma, n}\left( {\Theta} \right) & = \bigcup_{\theta \in \Theta} \Big\{ H : \sup_{y \in {\mathbb{R}}} \big| H(y) - F_\theta(y)\big| \leqslant \delta_{\gamma, n}\Big\}, \\ \delta_{\gamma, n} & = \left\{ { \frac{\sigma_\gamma^2 \left( {1 + \gamma} \right)\left( {1 + \gamma^{-1}} \right) \log n}{4n}} \right\}^{1/2}\text{.} \end{split}$$ (5) Regardless of the true parameter value $$\theta \in \Theta$$, we have $$\hat{F}^* \in {\mathcal{C}}_{\gamma, n}(\Theta)$$ with probability tending to unity, so we immediately obtain the following result. Corollary 1. Suppose that, under the conditions of Lemma 1, we know that $$F = F_\theta$$ for some $$\theta \in \Theta$$, and let $${\mathcal{C}}_{\gamma, n}\left( {\Theta} \right)$$ be as in (5). Then (3) provides an asymptotically valid identification interval for $$\mu$$ as $$n \to \infty$$.  Furthermore, we can check that the resulting intervals are asymptotically sharp. We implement this procedure by first solving the optimization problem   $$\label{eq:theta_opt} {\hat{\mu}}^+_\theta = \sup\left\{ {\sum_{i = 1}^n w_i Y_i : \sum_{i = 1}^n w_i = 1, \: \frac{\max(w_i)}{\min(w_i)} \leqslant \gamma, \: \sup_{y \in {\mathbb{R}}} \big| {\hat{F}}_w(y) - F_\theta(y)\big| \leqslant \delta_{\gamma, n}} \right\}$$ (6) over a grid of candidate values $$\theta \in \Theta$$ and then setting $${\hat{\mu}}^+ = \sup_\theta {\hat{\mu}}^+_\theta$$. The fractional programming problem (6) can be solved as a linear program using standard optimization methods; see, for example, Boyd & Vandenberghe (2004, § 4.3). 2.3. Relaxation of sharp shape constraints Consider the parametric family case above. Lemma 1 implies that $${\hat{F}}^*(\cdot)$$ will grow arbitrarily close to some $$F_\theta$$ with probability unity. This can be a very strong assumption: under a Gaussian family, for example, this implies that, with increasing $$n$$, not only are our bounds sharp but they will collapse to a single point as $${\mathcal{C}}_{\gamma, n}(\Theta)$$ shrinks, because of the tightening of $$\delta_{\gamma, n}$$. If we instead impose a shape constraint of $$\max_y | F(y) - F_\delta(y) | \leqslant \delta^*$$ for some $$\delta^*$$ and unknown $$\theta \in \Theta$$, we can expand our set $${\mathcal{C}}_{\gamma, n}(\Theta)$$ to   ${\mathcal{C}}_{\gamma, n}(\Theta) = \bigcup_{\theta \in \Theta} \left\{ {H : \sup_{y \in {\mathbb{R}}} \big| H(y) - F_\theta(y)\big| \leqslant \delta_{\gamma, n} + \delta^*} \right\}\!\text{.}$ By the triangle inequality for Kolmogorov–Smirnov distances we still have, in the limit, $${\hat{F}}^*$$ in our set with probability 1 and, therefore, valid bounds on our mean. This relaxation allows for restricting the shape of our unknown distribution to be near a given parametric family without imposing a strong parametric assumption. The $$\delta^*$$ is a sensitivity parameter, and practitioners could examine how the bounds respond to different choices of $$\delta^*$$ values. Here we instead investigate methods that make general shape constraint assumptions rather than these parametric ones. 2.4. Identification bounds with symmetry We now return to our main topic of interest, i.e., how to leverage shape constraints on $$F$$ to obtain improved identification bounds for $$\mu$$. The difference between parametric and shape-constrained side information about $$F$$ is that grid search is not usually practical over all distributions in some shape-constrained class, so the algorithm based on (6) does not generalize. Rather, for each candidate shape-constrained class, we may need to find an ad hoc way to avoid a full grid search. First, we consider the case where $$F$$ is symmetric, i.e., there is some value $$m \in {\mathbb{R}}$$ such that $$F(m + y) = 1 - F(m - y)$$ for all $$y \in {\mathbb{R}}$$. Such symmetry constraints mesh particularly well with our approach. In order to make use of them, we establish the following analogue of Lemma 1. Unlike Lemma 1, however, this result holds for any value of $$\alpha > 0$$ and not only for small $$\alpha \rightarrow 0$$; from a technical perspective, it follows directly from Donsker arguments used to prove the classical Kolmogorov–Smirnov theorem, and does not require the additional machinery of Marcus & Shepp (1972). Lemma 2. Suppose that we have a population and a sampling scheme for which $$\pi_{\min} \leqslant \pi_i \leqslant \pi_{\max}$$ for some $$\pi_{\max}/\pi_{\min} \leqslant \gamma$$. Then, for any $$\alpha > 0$$,  $$\label{eq:symmerty_KS} \lim_{n \rightarrow \infty} \mathrm{pr} \!\left[ n^{1/2} \sup_{q \in [0, \, 1]} \left| {{\hat{F}}^* \{F^{-1}(q)\} + {\hat{F}}^*\{F^{-1}(1 - q)\} - 1 } \right| \geqslant \zeta_{\gamma, \alpha} \right] \leqslant \alpha,$$ (7)with  \begin{equation*} \zeta_{\gamma, \alpha} = \Phi^{-1}\!\left( {1 - \frac{\alpha}{4}} \right) \left\{ {\frac{\left( {1 + \gamma} \right)\left( {1 + \gamma^{-1}} \right)}{4}} \right\}^{1/2}\!\text{.} \end{equation*} To draw a connection to Lemma 1, we note that $$\Phi^{-1}\left( {1 - \alpha/4} \right) \asymp (2\log \alpha^{-1})^{1/2}$$ for small values of $$\alpha$$. Lemma 2 holds regardless of symmetry. If the distribution $$F(\cdot)$$ is symmetric about $$m$$, then any pair $$\{F^{-1}(q), \, F^{-1}(1 - q)\}$$ with $$q \in [0, 1]$$ can be written as $$(m - y, \, m + y)$$ for some $$y \in {\mathbb{R}}$$. Thus, in the case of symmetric distributions, (7) provides a tail bound on the supremum of $${\hat{F}}^*(m + y) + {\hat{F}}^*(m - y) - 1$$ over $$y \in {\mathbb{R}}$$, and suggests using the estimator   $$\label{eq:symmetry_bound} \begin{split} {\hat{\mu}}^+_m & = \sup\left\{ {\sum_{i = 1}^n w_i Y_i : \sum_{i = 1}^n w_i = 1, \ \frac{\max(w_i)}{\min(w_i)} \leqslant \gamma, \ {\hat{F}}_w \in {\mathcal{C}}_{\gamma, n}^{\mathrm{SYM}}} \right\}\!, \\ {\mathcal{C}}_{\gamma, n}^{\mathrm{SYM}} & = \bigcup_{m \in {\mathbb{R}}} \Big\{H : \sup_y \big| H(m + y) + H(m - y) - 1\big|\leqslant \zeta_{\gamma, \, n^{-1/2}}\Big\}\text{.} \end{split}$$ (8) The lower bound $${\hat{\mu}}^-$$ is computed analogously. This algorithm thus enables us to use symmetry constraints while performing a grid search over only a single parameter $$m$$, i.e., the centre of symmetry. The identification intervals defined by (8) are again asymptotically valid and sharp. 2.5. Identification bounds with log-concavity Finally, we consider the case where $$F$$ is known to have a log-concave density. Imposing log-concavity constraints appears to be a promising way of encoding side information about $$F$$: the class of log-concave distributions is quite flexible, including most of the widely-used parametric distributions with continuous support, while enforcing regularity properties such as unimodality and exponentially decaying tails (Walther, 2009). Unlike in the case of symmetry, there does not appear to be a simple way to turn log-concavity constraints into asymptotically sharp identification bounds for $$\mu$$ using only linear programming and a low-dimensional grid search. However, we can still use log-concavity to obtain computationally tractable improvements over the bounds of Aronow & Lee (2013). Below, we detail our procedure for obtaining $${\hat{\mu}}^{+}$$; to obtain $${\hat{\mu}}^{-}$$ we can apply the same procedure to $$-Y_i$$. We posit the existence of a known upper bound $$Y_i \leqslant y_{\max}$$ for the outcomes $$Y_i$$ to ensure integrability. Step 1. Let $${\hat{S}}(y) = n^{-1} \sum_i 1(Y_i \leqslant y)$$ and $${\hat{S}}_{\mathrm{KS}}(y) = \max\{{\hat{S}}(y) - n^{-1/2}D^{(-1)}_{\mathrm{KS}}(1 - n^{-1/2}) , \, 0\}$$, where $$D^{(-1)}_{\mathrm{KS}}(\cdot)$$ denotes the inverse Kolmogorov–Smirnov cumulative distribution function. By the Kolmogorov–Smirnov theorem, $$S(y) \geqslant {\hat{S}}_{\mathrm{KS}}(y)$$ for all $$y \in {\mathbb{R}}$$ with probability tending to 1, where $$S(y)$$ is the distribution function of the observed, biased, sample. Step 2. Next, in the proof, we show that for some $$m \in {\mathbb{R}}$$,   \begin{equation*} \label{eq:U_definition} {\hat{U}}_m(y) = \left[ {\hat{S}}_{\mathrm{KS}}(y) + (\gamma - 1)\big\{{\hat{S}}_{\mathrm{KS}}(y) - {\hat{S}}_{\mathrm{KS}}(m)\bigr\}_+\right] \Big/ \left[ {\hat{S}}_{\mathrm{KS}}(m) + \gamma\bigl\{1 - {\hat{S}}_{\mathrm{KS}}(m)\bigr\} \right] \end{equation*} is a lower bound for the population distribution of interest, $$F(y)$$, with probability tending to 1. We also set $${\hat{U}}_m(y_{\max}) = 1$$ to ensure that $${\hat{U}}_m(\cdot)$$ is a cumulative distribution function. Step 3. We now use the fact that our distribution is log-concave. It is well known that if $$F$$ has a log-concave density, then $$\log F(y)$$ must be concave (Prékopa, 1973). Thanks to this fact, we know that if $$F(y) \geqslant {\hat{U}}_m(y)$$, then also $$F(y) \geqslant {\hat{L}}_m(y)$$, where   $${\hat{L}}_m(\cdot) = \mathop{\arg\min}_{L(\cdot)} \left\{ {\int_{-\infty}^{y_{\max}} \!\!L(y) \,{\rm{d}} y : \log L(y) \text{ is concave and } L(y) \geqslant {\hat{U}}_m(y) \text{ for all } y \in {\mathbb{R}}} \right\}\!;$$ $${\hat{L}}_m(\cdot)$$ is, for $$m$$, the lowest function for which $$\log L(y)$$ is concave that still lies above our overall lower bound $${\hat{U}}_m(y)$$. Step 4. Finally, we define $$C_{\gamma, n}$$ as the set of distributions satisfying at least one of the lower bounds   $$C_{\gamma, n}^{\mathrm{LC}+} = \bigcup_{m \in {\mathbb{R}}} \left\{ {H : H(y) \geqslant {\hat{L}}_m(y), \: y \in {\mathbb{R}}} \right\}\!\text{.}$$ Given this construction, we can obtain an upper endpoint for our identification interval as usual:   $${\hat{\mu}}^+ = \sup_{w} \left\{ {\sum_{i = 1}^n w_i Y_i : \sum_{i = 1}^n w_i = 1, \ \frac{\max(w_i)}{\min(w_i)} \leqslant \gamma, \ {\hat{F}}_w(y) \in C_{\gamma, n}^{\mathrm{LC}+}} \right\}\!\text{.}$$ Lemma 3 shows that $$C_{\gamma, n}^{\mathrm{LC}+}$$ contains the population sampling distribution with probability tending to 1, and so Theorem 1 establishes the validity of our identification intervals. Unlike in the parametric or symmetric cases, our log-concave identification bounds are not asymptotically sharp, i.e., they may not converge to the shortest possible identification interval given our assumptions about log-concavity and bounded sampling ratios; however, they still provide tighter intervals than do Aronow & Lee (2013). Lemma 3. Suppose that we have a population for which $$\pi_{\min} \leqslant \pi_i \leqslant \pi_{\max}$$ for some $$\pi_{\max}/\pi_{\min} \leqslant \gamma$$, and that $$F$$ has a log-concave density. Then $$\lim_{n \rightarrow \infty} \mathrm{pr} ({\hat{F}}^* \in {\mathcal{C}}_{\gamma, n}^{\mathrm{LC}+}) = 1$$.  3. Application: sampling ethnic minorities The Aymara are an indigenous population of the Andean plateau of South America, who live predominantly in Bolivia and Peru, with a small proportion living in the north of Argentina and Chile. In Chile, they constitute a minority of nearly 50 000 in a country of approximately 18 million. Across the world, it is of great importance to understand how ethnic minorities are faring in order to design effective affirmative action policies. Here we use the proposed method to bound the average performance of Aymara students in the national standardized test held in Chile for admission to higher education. This test is called Prueba de Selección Universitaria, and almost 90% of enrolled high school students take it every year; however, participation is known to be lower among vulnerable populations such as the Aymara in northern Chile. Using the sample of 847 Aymara students who took the test in mathematics in 2008, we seek identification intervals for the population mean counterfactual test score had everyone taken the test. We assume that the sampling ratio is bounded by $$\max(\pi_i) /\min(\pi_i) \leqslant \gamma = 9$$ and consider inference under the assumption that the population test score distribution is symmetric or is log-concave. We also consider the approach of Aronow & Lee (2013) without shape constraints. The observed test scores have a mean of 502 with a sample standard deviation of 104. Given $$\gamma = 9$$, we obtain population identification intervals of $$(426, \, 578)$$ assuming symmetry, $$(414, \, 589)$$ assuming log-concavity, and $$(410, \, 591)$$ without any constraints. Thus, in this example, assuming symmetry buys us shorter identification intervals than assuming log-concavity. Figure 1 depicts the weighted distribution functions $${\hat{F}}_w(\cdot)$$ underlying the upper endpoints of all three identification intervals. The sharp threshold of the weights $$w_i$$ resulting from the unconstrained method of Aronow & Lee (2013) is readily apparent. Assuming either symmetry or log-concavity of the population sampling distribution yields more regular-looking distributions. Fig. 1. View largeDownload slide Illustration of the weighted distribution functions $$\hat{F}_w(\cdot)$$ used to obtain upper endpoints $$\hat{\mu}^+$$ for our identification intervals: (a) the raw weighted cumulative distribution functions used to compute $$\hat{\mu}^+$$; (b) weighted density functions obtained by using smoothing splines to help visualize the underlying estimated densities. In each panel the dotted curve shows the observed empirical cumulative distribution, while the dashed, dot-dashed and solid curves represent $$\hat{F}_w(\cdot)$$ obtained with, respectively, log-concavity, symmetry and no constraints on the population distribution. Fig. 1. View largeDownload slide Illustration of the weighted distribution functions $$\hat{F}_w(\cdot)$$ used to obtain upper endpoints $$\hat{\mu}^+$$ for our identification intervals: (a) the raw weighted cumulative distribution functions used to compute $$\hat{\mu}^+$$; (b) weighted density functions obtained by using smoothing splines to help visualize the underlying estimated densities. In each panel the dotted curve shows the observed empirical cumulative distribution, while the dashed, dot-dashed and solid curves represent $$\hat{F}_w(\cdot)$$ obtained with, respectively, log-concavity, symmetry and no constraints on the population distribution. Acknowledgement The authors thank the referees, associate editor and editor for their helpful comments. Wager thanks Columbia University for support during his time as a postdoctoral research scholar. Zubizarreta was supported by the Alfred P. Sloan Foundation. Miratrix and Zubizarreta are also affiliated with the Department of Statistics at Harvard University. Wager is also affiliated with the Department of Statistics at Stanford University. Appendix Proof of Theorem 1 First, mirroring the argument of Aronow & Lee (2013), we note that $$\hat{F}^*$$ is itself an estimator of the form $$\hat{F}^*(y) = \sum_i w_i 1\left({Y_i \leq y}\right)$$ with $$\sum_i w_i = 1$$ and $$\max{w_i}/\min{w_i} \leq \gamma$$, for some vector of weights $$w$$. Further, if $$\hat{F}^* \in \mathcal{C}_{\gamma, n}$$ then $$\mu^* \in \mathcal{I}$$. Now, if $$\hat{F}^* \in \mathcal{C}_{\gamma, n}$$ then $$\Delta(\mu, \mathcal{I}) \leq |\hat{\mu}^* - \mu|$$, so $$\Delta(\mu, \mathcal{I}) \leq |\hat{\mu}^* - \mu|$$ with probability tending to 1. Finally, as $$|\hat{\mu}^* - \mu|$$ converges to zero by the weak law of large numbers, we have that $$\Delta(\mu, \mathcal{I})$$ also converges to zero in probability. Proof of Lemma 1 To derive this type of uniform tail bound, we proceed in the following manner. First, we verify that $$n^{1/2} \{\hat{F}^*(\cdot) - F(\cdot)\}$$ converges in distribution to a tight Gaussian process $$G(\cdot)$$; then we bound the tail probabilities of $$\sup_{y \in \mathbb{R}} |G(y)|$$ directly. The first step involves a routine application of Donsker’s theorem, as presented, for example, in van der Vaart (1998, Ch. 19); here we take the existence of a limiting process $$G(\cdot)$$ as given. Because the supremum of $$n^{1/2} \{ \hat{F}^*(\cdot) - F(\cdot) \}$$ is invariant under monotone transformations of $$Y$$ and is stochastically maximized when $$Y$$ has a continuous density without atoms, we can without loss of generality assume that $$Y \in [0, 1]$$. Our sampling framework can be modelled as first drawing triplets $$W_i = (Y_i, \pi_i, u_i) \sim F$$, with $$u_i$$ marginally distributed as Un$$[0,1]$$, and then thinning our sample if $$u_i > \pi_i$$. We make the assumption that the prethinned sample size is Poisson, which implies that the actual sample size we observe is also Poisson, $$n \sim \text{Po}(N)$$, with $$N \rightarrow \infty$$; this does not affect our conclusions but makes the derivation more direct. Given the Poisson assumption, we can further assume, again without loss of generality, that $$y$$ is scaled on the $$[0,1]$$ interval such that there exists a $$\omega > 0$$ for which   $$\label{eq:Avarh1} N \,\mathrm{var} \{\hat{H}(y)\} = \omega^2 y, \quad \hat{H}(y) = \frac{1}{N} \sum_{ (i : Y_i \leq y) } \frac{1}{\pi_i} \Big/ E_{\mathrm{obs}}\! \left( \frac{1}{\pi_i} \right)\!,$$ (A1) where $$E_{\mathrm{obs}}\{f(W_i)\} = E\{ f(W_i) \mid u_i \leq \pi_i \}$$ denotes expectations with respect to the observed, i.e., thinned and therefore biased, sampling distribution. Observe that (A1) simply rescales $$y$$ so that the variance of $$\hat{H}(y)$$, the total mass sampled below $$y$$, increases linearly with $$Y$$, allowing conception of this object as effectively a random walk. We remark that $$\hat{H}(y)$$ is an oracle unnormalized estimator of $$F(\cdot)$$, akin to a Horvitz–Thompson estimator where we normalize by an expected weight rather than actual weights. It is unbiased for $$F(y)$$ in that, for any $$y$$, $$\:E_{\textrm{obs}}\{ \hat{H}(y) \} = F(y)$$. Now, $$\hat{H}(y)$$ is a compound Poisson process with variable size increments proportional to $$1/\pi_i$$ and a rate controlled by $$F(y)$$ and $$N$$. Using standard results on such compound Poisson processes, we have   $$\label{eq:Avarh2} N \,\mathrm{var} \{\hat{H}(y)\} = E_{\mathrm{obs}} \!\bigg\{{\frac{ 1(Y_i \leq y) }{ \pi_i^2 }}\bigg\} \bigg/ E_{\mathrm{obs}}\! \left( \frac{1}{\pi_i} \right)^2\text{.}$$ (A2) In connecting (A1) and (A2), note that the additional variability due to the random sample size $$n$$ plays a critical role; the variable sample size $$n$$ allows $$\mathrm{var} \{\hat{H}(y)\}$$ to increase as there is no normalization constraint that sets $$\hat{H}(1)$$ to 1. Because we are not yet normalizing $$\hat{H}(y)$$, we can easily obtain the pairwise covariances of $$\hat{H}(y_1)$$ and $$\hat{H}(y_2)$$ for any $$y_1 < y_2$$:   \begin{equation*} \mathrm{cov}\big\{ \hat{H}(y_1), \,\hat{H}(y_2) \big\} = \mathrm{var}\{ \hat{H}(y_1) \}\text{.} \end{equation*} These pairwise covariances, coupled with the linearly increasing variance, imply that if Gaussian limits exist, which we know to be the case by Donsker’s theorem, we must have that $$N^{1/2} \{ \hat{H}(y) - F(y) \}$$ converges in distribution to $$\omega W(y),$$ where $$W(y)$$ is a standard Wiener process on $$[0, 1]$$. Then, noting that $$\hat{F}^*(y) = \hat{H}(y) / \hat{H}(1)$$, we obtain, using the delta method, the convergence in distribution   $$\label{eq:Amain} n^{1/2} \, \big\{\hat{F}^*(\cdot) - F(\cdot)\big\} \to G(y) = \omega \{W(y) - F(y) W(1)\}$$ (A3) as $$n \to \infty$$. We can use $$n^{1/2}$$ instead of $$N^{1/2}$$ thanks to Slutsky’s theorem, since $$n^{1/2} = N^{1/2} \{1 + o_{\rm p}(1)\}$$. The next step is to bound $$\omega$$. We do this by expressing $$\omega$$ in terms of population moments on the $$\pi_i$$ and then optimizing over all possible distributions of $$\pi_i$$ to maximize this expression. First, plug $$y = 1$$ into (A2) and (A1) and set them equal to each other to obtain   $$\label{eq:Aomega} \omega^2 = E_{\mathrm{obs}} \bigl( \pi_i^{-2} \bigr) \big/ E_{\mathrm{obs}} \bigl( \pi_i^{-1} \bigr)^2\text{.}$$ (A4) Next, defining $$E_{\mathrm{pop}}$$ to be the expectation without the thinning step, i.e., with respect to the underlying unbiased population, we derive the equalities   $$E_{\mathrm{obs}}\!\left({{\pi_i}^{-1}}\right) = E_{\mathrm{pop}}\!\left({\pi_i}\right)^{-1}, \quad E_{\mathrm{obs}}\!\left({{\pi_i^{-2}}}\right) = E_{\mathrm{pop}}\!\left({{\pi_i}^{-1}}\right) \big/ E_{\mathrm{pop}}\!\left({\pi_i}\right)\text{.}$$ These are derived by simply writing out the expectation integrals of the thinned sample. For example, letting $$\Pi$$ be the population distribution of the $$\pi_i$$,   $E_{\mathrm{obs}}\!\left({ \pi_i^{-1}}\right) = \frac{1}{\int \pi_i \,\mathrm{d}\,\Pi} \int \frac{1}{\pi_i} \,\pi_i \,\mathrm{d}\, \Pi = \frac{1}{E_{\mathrm{pop}}(\pi_i)} \text{.}$ These relations, substituted into (A4), imply that   $$\label{eq:bound} \omega^2 = E_{\mathrm{pop}}\!\left({{\pi_i}^{-1}}\right) E_{\mathrm{pop}}\!\left({\pi_i}\right) \leq \sup_{a \in [0, 1]} \{1 + a(\gamma - 1)\}\bigg\{{1 + a(\gamma^{-1} - 1)}\bigg\}\!,$$ (A5) where the last inequality follows from the fact that, because $$1/x$$ is a convex function of $$x$$, our expression of interest is maximized for some distribution of $$\pi_i$$ with all its weight on two discrete values representing the endpoints $$\gamma$$ and $$1$$ of the allowed range, up to a scaling constant. The supremum over $$a$$ is then the supremum over all distributions of this form, with $$a$$ being the probability of $$\pi_i = \gamma$$. We can check by calculus that the bound (A5) is maximized at $$a^* = 1/2$$, resulting in   $$\label{eq:Abounds} \omega^2 \leq (1 + \gamma)(1 + \gamma^{-1})/4\text{.}$$ (A6) It remains to bound the suprema of $$X(y) = W(y) - F(y) W(1)$$. If we had $$F(y) = y$$, then $$X(y)$$ would be a standard Brownian bridge and the next steps would be straightforward. In our case, however, $$F(y)$$ may belong to a larger class of functions, so we instead make use of the following technical lemma, proved at the end of this appendix. Lemma A1. Suppose that $$W(y)$$ is a standard Wiener process and that $$F(y)$$ is a cumulative distribution function on $$[0, 1]$$ with $$F(0) = 0$$, $$F(1) = 1$$ and, for some constant $$\gamma > 0$$, $$F(A)/F(B) \leq \gamma$$ for any intervals $$A$$ and $$B$$ with $$\lambda(A) = \lambda(B)$$, where $$\lambda(\cdot)$$ is Lebesgue measure. Then there exists a constant $$\sigma_\gamma$$, bounded by unity, for which the stochastic process $$X(y) = W(y) - F(y)W(1)$$ satisfies the bound  $$\label{eq:A3} \lim_{u \rightarrow \infty} \frac{1}{u^2} \, \log \mathrm{pr}\!\bigg\{{\sup_{y \in [0, \, 1]} |{X(y)}| > u}\bigg\} \leq \frac{-1}{2\sigma^2_\gamma}\text{.}$$ (A7) The $$\gamma$$ condition in the lemma will correspond to our control on the weights $$\pi_i$$ due to the rescaled $$y$$. To continue, letting $$X(y) = G(y) / \omega$$ and plugging this and $$u = \sigma_\gamma (2 \log \alpha^{-1})^{1/2}$$ into (A7) gives   $$\limsup_{\alpha \rightarrow 0} \: \log \mathrm{pr}\!\bigg\{{\sup_{y \in [0, 1]} |{G(y)}| > \omega \sigma_\gamma (2\log{\alpha^{-1}})^{1/2}}\bigg\}^{-1} \bigg/ \log \alpha^{-1} \geq 1\text{.}$$ The final form of the lemma follows from substituting for $$\omega$$ in (A6). Proof of Corollary 1 By Lemma 1 we know that $$\hat{F}^* \in \mathcal{C}_{\gamma, n}$$ with probability tending to 1, so the result follows immediately from Theorem 1. Proof of Lemma 2 The proofs of Lemmas 2 and 1 begin in the same way to obtain the tight Gaussian process, but then diverge in how we bound the tail behaviour of this process. To prove Lemma 2 we start with the representation in (A3). Given a $$q \in [0,0{\cdot}5]$$, use (A3) twice with $$y = F^{-1}(q)$$ and $$y = F^{-1}(1-q)$$, take the sum, and then simplify to obtain   \begin{align*} \omega^{-1} n^{1/2} \left[\hat{F}^*\{F^{-1}(q)\} + \hat{F}^*\{F^{-1}(1 - q)\} - 1 \right] & \to W\{F^{-1}(q)\} + W\{F^{-1}(1 - q)\} - W(1) \\ & = W \{ F^{-1}(q) \} - \left[ W(1) - W\{F^{-1}(1 - q)\} \right] \\ & = W \{ F^{-1}(q) + 1 - F^{-1}(1 - q)\}, \end{align*} where the last equality is in distribution. For the last step note that $$W \{F^{-1}(q) \}$$ and $$W(1) - W \{ F^{-1}(1 - q) \}$$ are two independent Gaussian processes over $$q \in [0, 0{\cdot}5]$$; $$W(1) - W(y_2)$$ is independent of $$W(y_1)$$ if $$y_1 < y_2$$ as we are effectively looking at two non-overlapping, and therefore independent, sums of our sample. Then the sum of two independent random walks is a random walk of the total distance. Finally, for $$q \in [0, 0{\cdot}5]$$, $$F^{-1}(q) + 1 - F^{-1}(1 - q)$$ takes all values in $$[0, 1]$$, so for any threshold $$t$$,   \begin{align*} \lim_{n \rightarrow \infty} \mathrm{pr} \!\left( n^{1/2} \sup_{q \in [0, 0{\cdot}5]} \left[ \hat{F}^* \{ F^{-1}(q) \} + \hat{F}^* \{ F^{-1}(1 - q) \} - 1 \right] \geq \omega t \right) & = \mathrm{pr}\left\{\sup_{y \in [0, 1]} W(y) \geq t\right\} \\ & = 2\Phi(-t), \end{align*} where the last equality is a consequence of the reflection principle for Brownian motion (see, e.g., Mörters & Peres, 2010, p. 44). The desired conclusion then follows upon applying the bound (A6) on $$\omega$$ to both tails. This lemma does not require any symmetry in $$F(\cdot)$$; the subsequent use of the lemma is what takes advantage of that structure. Proof of Lemma 3 It suffices to verify all claims made in the four steps leading to our construction of $$\mathcal{C}_{\gamma, n}^{\mathrm{LC}+}$$. Step 1 is a direct consequence of the Kolmogorov–Smirnov theorem. Steps 3 and 4 are also immediate given the result of Prékopa (1973). The optimization problem used to define $$\hat{L}_m(y)$$ is computationally tractable: finding a concave upper bound for a function $$l(y)$$ is equivalent to taking the convex hull of the curve $$\{y, \, l(y)\}$$. It remains to check Step 2, which is a direct analogue of the argument of Aronow & Lee (2013) applied to the sampling distribution $$S(\cdot)$$. Consider $$r = \min(\pi_i)/E(\pi_i)$$. If $$r$$ were known, and recalling that $$\max{\pi_i}/\min{\pi_i} \leq \gamma$$, we can follow the argument of Aronow & Lee (2013) to verify that $$F$$ is stochastically dominated by $$U_m$$, i.e., $$F(y) \geq U_m(y)$$ for all $$y \in \mathbb{R}$$, where $$U_m(\cdot)$$ is a reweighting of the observed cumulative sampling distribution $$S(\cdot)$$ given by   $$U_m(y) = \left[ S(y) + (\gamma - 1)\left\{ S(y) - S(m) \right\}_+\right] \big/ \bigl[ S(m) + \gamma\left\{ 1 - S(m) \right\} \bigr],$$ for a threshold $$m$$ characterized by   $$S(m) \big/\bigl[ S(m) + \gamma \left\{ 1 - S(m) \right\} \bigr] = r\text{.}$$ In other words, $$U_m(y)$$ corresponds to a reweighting of $$S(y)$$ with sampling weights $$\pi_i$$ jumping from a low value $$a$$ to a high value $$\gamma a$$ at threshold $$m$$. In practice, of course, we do not know $$r$$, but we still know that $$F$$ is stochastically dominated by $$U_m$$ for some $$m \in \mathbb{R}$$. The claim made in Step 2 then follows upon observing that $$S(y) \geq \hat{S}_{\mathrm{KS}}(y)$$, which implies $$U_m(y) \geq \hat{U}_m(y)$$ for all $$m, y \in \mathbb{R}$$ with probability tending to 1, thanks to Step 1. Proof of Lemma A1 We first recall the classic result of Marcus & Shepp (1972), which in our situation implies that the Gaussian process $$X(y)$$ satisfies   $$\lim_{u \rightarrow \infty} \frac{1}{u^2} \, \log \mathrm{pr} \!\left\{\, \sup_{y \in [0, 1]} \left| X(y) \right| > u \right\} = \frac{-1}{2 \sup_{y \in [0, 1]} \textrm{var}\{X(y)\} }\text{.}$$ It therefore remains to use the shape restrictions on $$F(y)$$ to bound the right-hand expression from below. To do so, it is helpful to decompose the stochastic process $$X(y)$$ into two independent parts:   \begin{equation*} X(y) = B(y) + W(1) \{ y - F(y) \}, \end{equation*} where $$B(y) = W(y) - y W(1)$$ is a standard Brownian bridge. Moreover, given our assumptions on $$F(y)$$,   \begin{equation*} \label{eq:Fbound} \big|F(y) - y\big| \leq A_\gamma(y), \quad A_\gamma(y) = \frac{\gamma y}{1 - y + \gamma y} - y \end{equation*} for $$0 \leq y \leq 0{\cdot}5$$, $$A_\gamma(y)$$ corresponds to how much more weight below $$y$$, as compared to a uniform distribution, is possible given the constraints of $$\gamma$$. A similar derivation shows that $$A_\gamma(y) = A_\gamma(1 - y)$$ for all $$y \in [0{\cdot}5, 1]$$. Thus, making use of the facts that $$\textrm{var}\{B(y)\} = y(1 - y)$$ and $$\textrm{var}\{W(1)\} = 1$$, and using $$A_\gamma(y)$$ as a bound on the $$y - F(y)$$ scaling of the $$W(1)$$ term, we deduce that   $$\textrm{var}\{X(y)\} \leq y(1 - y) + A_\gamma(y) \quad (0 \leq y \leq 1)\text{.}$$ The above function is concave on $$[0, 0{\cdot}5]$$; in particular, $$A''_\gamma(y) = -2\gamma(\gamma - 1) /(1 - y + \gamma y)^3$$, and so the above expression has a unique maximizer $$y^*$$ that can be found numerically. The desired conclusion then holds for $$\sigma^2_\gamma := y_\gamma^*(1 - y_\gamma^*) + A_\gamma(y^*_\gamma)$$; the fact that $$\sigma_\gamma^2 \leq 1$$ is immediate by inspection. References Aronow P. M. & Lee D. K. ( 2013). Interval estimation of population means under unknown but bounded probabilities of sample selection. Biometrika  100, 235– 40. Google Scholar CrossRef Search ADS   Boyd S. & Vandenberghe L. ( 2004). Convex Optimization.  Cambridge: Cambridge University Press. Google Scholar CrossRef Search ADS   Cochran W. G. ( 1977). Sampling Techniques.  New York: Wiley. de Carvalho M. & Davison A. C. ( 2014). Spectral density ratio models for multivariate extremes. J. Am. Statist. Assoc.  109, 764– 76. Google Scholar CrossRef Search ADS   Efron B. & Tibshirani R. ( 1996). Using specially designed exponential families for density estimation. Ann. Statist.  24, 2431– 61. Google Scholar CrossRef Search ADS   Fithian W. & Wager S. ( 2015). Semiparametric exponential families for heavy-tailed data. Biometrika  102, 486– 93. Google Scholar CrossRef Search ADS   Hájek J. ( 1958). On the theory of ratio estimates. Aplikace Matematiky  3, 384– 98. Manski C. F. ( 2003). Partial Identification of Probability Distributions.  New York: Springer. Marcus M. B. & Shepp L. A. ( 1972). Sample behavior of Gaussian processes. In Proc. 6th Berkeley Symp. Math. Statist. Prob., Vol. 2: Probability Theory.  Berkeley, California: University of California Press, pp. 423– 41. Mörters P. & Peres Y. ( 2010). Brownian Motion.  Cambridge: Cambridge University Press. Google Scholar CrossRef Search ADS   Owen A. B. ( 1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika  75, 237– 49. Google Scholar CrossRef Search ADS   Prékopa A. ( 1973). Logarithmic concave measures and functions. Acta Sci. Math. (Szeged)  34, 334– 43. Qin J. ( 2017). Biased Sampling, Over-identified Parameter Problems and Beyond.  Singapore: Springer. Google Scholar CrossRef Search ADS   R Development Core Team ( 2018). R: A Language and Environment for Statistical Computing.  R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. http://www.R-project.org. van der Vaart A. W. ( 1998). Asymptotic Statistics.  Cambridge: Cambridge University Press. Google Scholar CrossRef Search ADS   Walther G. ( 2009). Inference and modeling with log-concave distributions. Statist. Sci.  24, 319– 27. Google Scholar CrossRef Search ADS   © 2017 Biometrika Trust http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Biometrika Oxford University Press

# Shape-constrained partial identification of a population mean under unknown probabilities of sample selection

, Volume 105 (1) – Mar 1, 2018
12 pages

/lp/ou_press/shape-constrained-partial-identification-of-a-population-mean-under-I4ewnsrIqd
Publisher
Oxford University Press
ISSN
0006-3444
eISSN
1464-3510
D.O.I.
10.1093/biomet/asx077
Publisher site
See Article on Publisher Site

### Abstract

Summary Estimating a population mean from a sample obtained with unknown selection probabilities is important in the biomedical and social sciences. Using a ratio estimator, Aronow & Lee (2013) proposed a method for partial identification of the mean by allowing the unknown selection probabilities to vary arbitrarily between two fixed values. In this paper, we show how to use auxiliary shape constraints on the population outcome distribution, such as symmetry or log-concavity, to obtain tighter bounds on the population mean. We use this method to estimate the performance of Aymara students, an ethnic minority in the north of Chile, in a national educational standardized test. We implement this method in the R package scbounds. 1. Introduction A common challenge in the biomedical and social sciences is to estimate a population mean from a sample obtained with unknown probabilities of sample selection. This is often the case when drawing inferences about mobile populations, such as homeless people and hospital outpatients, or hard-to-reach populations, such as injection drug users and some ethnic minorities. In general, this problem arises when a sampling frame is unavailable or unreliable, and when there is no or limited information about the sampling design. The estimation problem can be formalized as follows. Let $${\mathcal{P}}$$ denote a potentially infinite population, and let $$F$$ denote the cumulative distribution function of our outcome of interest, $$Y$$, over $${\mathcal{P}}$$. Our goal is to estimate the population mean $$\mu = E_F(Y)$$. To do so, we have access to a random sample $${\mathcal{S}} = \left\{ {{Y_i}} \right\}$$ of size $$n$$ obtained via biased sampling. Concretely, we can imagine that $${\mathcal{S}}$$ was generated using an accept-reject scheme as follows: until we have $$n$$ observations, repeatedly draw independent and identically distributed pairs $$(Y, \pi) \in {\mathbb{R}} \times (0, 1]$$ where $$Y \sim F$$, and then add $$Y$$ to our sample $${\mathcal{S}}$$ with probability $$\pi$$. Whenever the inverse sampling probabilities $$\pi_i^{-1}$$ are correlated with $$Y_i$$, the sample mean will be an inconsistent estimator for the population mean. If these sampling probabilities $$\pi_i$$ for our $$n$$ sampled observations were known, then we could use the following ratio estimator, which is consistent for $$\mu$$ under weak conditions (Hájek, 1958; Cochran, 1977):   $$\hat \mu ^* = \sum_{i = 1}^n \pi_i^{-1} Y_i \, \bigg/ \sum_{i = 1}^n \pi_i^{-1}\text{.}$$ (1) Here, however, we suppose that the sampling weights $$\pi_i$$ are unknown. Aronow & Lee (2013) showed that it is possible to obtain meaningful identification intervals for $$\mu$$ in the sense of Manski (2003), even if we only have bounds on the sampling weights $$\pi_i$$. Suppose we know that $$\max(\pi_i) / \min(\pi_i) \leqslant \gamma$$ for some constant $$\gamma < \infty$$. This gives an asymptotically consistent identification interval $${\mathcal{I}}_{\textrm{AL}} = [\hat \mu _{\textrm{AL}}^-,\, \hat \mu _{\textrm{AL}}^+]$$ for $$\mu$$, where   $$\hat \mu _{\textrm{AL}}^{+} = \sup \left\{ {{\sum_{i = 1}^n w_i Y_i : \sum_{i = 1}^n w_i = 1, \; \frac{\max(w_i)}{\min(w_i)} \leqslant \gamma}} \right\}$$ (2) and $$\hat \mu _{\textrm{AL}}^{-}$$ is the infimum over the same set. Aronow & Lee (2013) also developed an efficient algorithm for computing these bounds. While this approach can yield identification intervals for $$\mu$$ under weak assumptions, the Aronow–Lee bounds can be unnecessarily pessimistic. To understand why, it is helpful to consider their method as first estimating the true population $$F$$ as   $\hat F_{w}(y) = \sum_{i=1}^n w_i \, 1(Y_i \leqslant y),$ where $$w_i$$ are the maximizing or minimizing weights in (2), and then setting the limits of the interval to $$E_{\hat F_{w}}(Y)$$ for the two resulting extreme sets of weights. The problem is that the population distributions implied by these extreme weights are often rather implausible in practice. Specifically, the weights $$w_i$$ induced by (2) correspond to a step function depending on whether or not $$Y_i$$ falls below some threshold, and so the weighted empirical distribution functions $$\hat F_{\textrm{AL}}^{+}$$ and $$\hat F_{\textrm{AL}}^{-}$$ have a sharp change in slope at that threshold, as illustrated in § 3. This threshold can also be interpreted as a substantial discontinuity in an associated density, provided we are willing to posit the existence of such a density; see the figure in § 3. Such sharp elbows in the estimated cumulative distribution function often contradict expert knowledge about the true population distribution $$F$$. For example, physical measurements in the biological and medical sciences often exhibit a bell-shaped distribution, as do stock returns and other indicators in finance, as well as mechanical error measurements in industry. In this paper we study how to use such auxiliary information about the shape of the population outcome distribution $$F$$ to obtain shorter identification intervals for $$\mu$$ by ruling out implausible weightings in order to tighten the resulting identification bounds. We allow for various types of specifications for $$F$$, such as parametric assumptions, and shape constraints based on symmetry or log-concavity. In general, the more we are willing to assume about $$F$$, the shorter the resulting identification intervals. At one extreme, if we know that $$F$$ is Gaussian, then we can substantially shorten the identification intervals, whereas if we make weaker assumptions, for instance that $$F$$ has a log-concave density, then we get smaller but still noticeable improvements over $${\mathcal{I}}_{\textrm{AL}}$$. We focus on the situation where $$F$$ has real-valued support; when $$F$$ has categorical support, it is less common to have access to plausible shape constraints. This paper relates to the literature on biased sampling, empirical likelihood, and exponential tilting (Owen, 1988; Efron & Tibshirani, 1996; de Carvalho & Davison, 2014; Fithian & Wager, 2015; Qin, 2017). We implement the proposed methods in the R (R Development Core Team, 2018) package scbounds. 2. Tighter identification bounds via shape constraints 2.1. Characterization of the oracle estimator Our goal is to use existing information about the population distribution $$F$$, for example that $$F$$ is symmetric or log-concave, to obtain tighter identification bounds for $$\mu = E_F(Y)$$. Operationally, we seek to encode such information about $$F$$ into constraints that can be added to the optimization problem (2). Throughout the paper, we assume that $$\mu$$ is well-defined and finite. Our analysis focuses on the weighted empirical distribution function $${\hat{F}}^*$$ induced by the oracle ratio estimator $${\hat{\mu}}^*$$ in (1) which has access to the true sampling probabilities $$\pi_i$$ that give corresponding oracle weights $$w_i^*$$:   \begin{equation*} {\hat{F}}^*(y) = \sum_{i = 1}^n \pi^{-1}_i \, 1(Y_i \leqslant y)\bigg/ \sum_{i = 1}^n \pi^{-1}_i = \sum_{i = 1}^n w^*_i \, 1(Y_i \leqslant y)\text{.} \end{equation*} Any shape constraint on $$F$$ that lets us control the behaviour of $${\hat{F}}^*$$ induces an asymptotically consistent identification interval for the population mean $$\mu$$. The intuition is that if we can construct sets of distribution functions that are as constrained as possible while still containing our oracle $$\hat{F}^{*}$$ with high probability, then the optimization problem will contain the oracle estimate $$\hat{\mu}^*$$, giving short but still consistent bounds. Theorem 1. Suppose that we have auxiliary information on $$F$$ that lets us construct sets of distribution functions $${\mathcal{C}}_{\gamma, n}$$ with the property that $${\rm{lim}_{n\rightarrow\infty}} \mathrm{pr}({\hat{F}}^* \in {\mathcal{C}}_{\gamma, n}) = 1$$. Then, if we define  $$\label{eq:main} {\hat{\mu}}^+ = \sup\left\{ {\sum_{i = 1}^n w_i Y_i : \sum_{i = 1}^n w_i = 1, \ \frac{\max(w_i)}{\min(w_i)} \leqslant \gamma, \ {\hat{F}}_w \in {\mathcal{C}}_{\gamma,n}} \right\}$$ (3)and $${\hat{\mu}}^-$$ as the infimum in the analogous optimization problem, the resulting identification interval $${\mathcal{I}} = [{\hat{\mu}}^-,{\hat{\mu}}^+]$$ is asymptotically valid in the sense that $$\Delta(\mu, {\mathcal{I}})$$ converges in probability to $$0$$, where $$\Delta(\mu, {\mathcal{I}})$$ is the distance between $$\mu$$ and the nearest point in the interval $${\mathcal{I}}$$.  The intervals $${\mathcal{I}}$$ can never be wider than those of Aronow & Lee (2013), because the optimization problem (3) has strictly more constraints than the original optimization problem (2). Theorem 1 shows that if we have any auxiliary information about $$F$$, then the identification bounds of Aronow & Lee (2013) are needlessly long. However, it cannot directly guide practical data analysis. First, it leaves open the problem of how to turn shape constraints on $$F$$ into plausibility sets $${\mathcal{C}}_{\gamma, n}$$ that contain $${\hat{F}}^*$$ with high probability. Second, Theorem 1 is not useful if we cannot solve the optimization problem (3) in practice. Our next concern is to address these issues given specific side information about $$F$$. 2.2. Identification bounds in parametric families Although our main goal is to provide inference under shape constraints on $$F$$, we begin by considering the parametric case $$F = F_\theta$$ for some $$\theta \in \Theta$$, which allows us to construct particularly simple plausibility sets $${\mathcal{C}}_{\gamma, n}$$. Our approach is built around a Kolmogorov–Smirnov-type concentration bound for ratio estimators, and relies on finding the worst-case weighted distribution with $$\pi_{\max}/\pi_{\min} \leqslant \gamma$$ in terms of the characterization of Marcus & Shepp (1972) for tails of Gaussian processes; see the proof in the Appendix. Lemma 1. Suppose that we have a population and a sampling scheme for which $$\pi_{\min} \leqslant \pi_i \leqslant \pi_{\max}$$ with $$\pi_{\max}/\pi_{\min} \leqslant \gamma$$; and, for $$\alpha > 0$$, define  $$\label{eq:ks} \rho_{\alpha, \, n} = \mathrm{pr} \!\left[ n^{1/2} \, \sup_{y \in {\mathbb{R}}} \big| {{\hat{F}}^*(y) - F(y)}\big| \geqslant \left\{ { \frac{\sigma_\gamma^2\left( {1 + \gamma} \right)\left( {1 + \gamma^{-1}} \right) \log \alpha^{-1}}{2}} \right\}^{1/2} \right]\!,$$ (4)for a constant $$\sigma_\gamma^2 \leqslant 1$$ defined as the maximum of a concave function specified in the proof. Then the limiting probability $$\rho_{\alpha, n}$$ of large tail exceedances is bounded as follows:  $\limsup_{\alpha \rightarrow 0} \,\Bigl( \limsup_{n \rightarrow \infty} \: \log \rho^{-1}_{\alpha, n} \Big/ \log \alpha^{-1} \Bigr) \geqslant 1\text{.}$ Substituting $$\alpha = n^{-1/2}$$ into (4) and considering the union of all possible population distributions $$F(\cdot) = F_\theta(\cdot)$$ with $$\theta \in \Theta$$ suggests using the plausibility set   $$\label{eq:param_C} \begin{split} {\mathcal{C}}_{\gamma, n}\left( {\Theta} \right) & = \bigcup_{\theta \in \Theta} \Big\{ H : \sup_{y \in {\mathbb{R}}} \big| H(y) - F_\theta(y)\big| \leqslant \delta_{\gamma, n}\Big\}, \\ \delta_{\gamma, n} & = \left\{ { \frac{\sigma_\gamma^2 \left( {1 + \gamma} \right)\left( {1 + \gamma^{-1}} \right) \log n}{4n}} \right\}^{1/2}\text{.} \end{split}$$ (5) Regardless of the true parameter value $$\theta \in \Theta$$, we have $$\hat{F}^* \in {\mathcal{C}}_{\gamma, n}(\Theta)$$ with probability tending to unity, so we immediately obtain the following result. Corollary 1. Suppose that, under the conditions of Lemma 1, we know that $$F = F_\theta$$ for some $$\theta \in \Theta$$, and let $${\mathcal{C}}_{\gamma, n}\left( {\Theta} \right)$$ be as in (5). Then (3) provides an asymptotically valid identification interval for $$\mu$$ as $$n \to \infty$$.  Furthermore, we can check that the resulting intervals are asymptotically sharp. We implement this procedure by first solving the optimization problem   $$\label{eq:theta_opt} {\hat{\mu}}^+_\theta = \sup\left\{ {\sum_{i = 1}^n w_i Y_i : \sum_{i = 1}^n w_i = 1, \: \frac{\max(w_i)}{\min(w_i)} \leqslant \gamma, \: \sup_{y \in {\mathbb{R}}} \big| {\hat{F}}_w(y) - F_\theta(y)\big| \leqslant \delta_{\gamma, n}} \right\}$$ (6) over a grid of candidate values $$\theta \in \Theta$$ and then setting $${\hat{\mu}}^+ = \sup_\theta {\hat{\mu}}^+_\theta$$. The fractional programming problem (6) can be solved as a linear program using standard optimization methods; see, for example, Boyd & Vandenberghe (2004, § 4.3). 2.3. Relaxation of sharp shape constraints Consider the parametric family case above. Lemma 1 implies that $${\hat{F}}^*(\cdot)$$ will grow arbitrarily close to some $$F_\theta$$ with probability unity. This can be a very strong assumption: under a Gaussian family, for example, this implies that, with increasing $$n$$, not only are our bounds sharp but they will collapse to a single point as $${\mathcal{C}}_{\gamma, n}(\Theta)$$ shrinks, because of the tightening of $$\delta_{\gamma, n}$$. If we instead impose a shape constraint of $$\max_y | F(y) - F_\delta(y) | \leqslant \delta^*$$ for some $$\delta^*$$ and unknown $$\theta \in \Theta$$, we can expand our set $${\mathcal{C}}_{\gamma, n}(\Theta)$$ to   ${\mathcal{C}}_{\gamma, n}(\Theta) = \bigcup_{\theta \in \Theta} \left\{ {H : \sup_{y \in {\mathbb{R}}} \big| H(y) - F_\theta(y)\big| \leqslant \delta_{\gamma, n} + \delta^*} \right\}\!\text{.}$ By the triangle inequality for Kolmogorov–Smirnov distances we still have, in the limit, $${\hat{F}}^*$$ in our set with probability 1 and, therefore, valid bounds on our mean. This relaxation allows for restricting the shape of our unknown distribution to be near a given parametric family without imposing a strong parametric assumption. The $$\delta^*$$ is a sensitivity parameter, and practitioners could examine how the bounds respond to different choices of $$\delta^*$$ values. Here we instead investigate methods that make general shape constraint assumptions rather than these parametric ones. 2.4. Identification bounds with symmetry We now return to our main topic of interest, i.e., how to leverage shape constraints on $$F$$ to obtain improved identification bounds for $$\mu$$. The difference between parametric and shape-constrained side information about $$F$$ is that grid search is not usually practical over all distributions in some shape-constrained class, so the algorithm based on (6) does not generalize. Rather, for each candidate shape-constrained class, we may need to find an ad hoc way to avoid a full grid search. First, we consider the case where $$F$$ is symmetric, i.e., there is some value $$m \in {\mathbb{R}}$$ such that $$F(m + y) = 1 - F(m - y)$$ for all $$y \in {\mathbb{R}}$$. Such symmetry constraints mesh particularly well with our approach. In order to make use of them, we establish the following analogue of Lemma 1. Unlike Lemma 1, however, this result holds for any value of $$\alpha > 0$$ and not only for small $$\alpha \rightarrow 0$$; from a technical perspective, it follows directly from Donsker arguments used to prove the classical Kolmogorov–Smirnov theorem, and does not require the additional machinery of Marcus & Shepp (1972). Lemma 2. Suppose that we have a population and a sampling scheme for which $$\pi_{\min} \leqslant \pi_i \leqslant \pi_{\max}$$ for some $$\pi_{\max}/\pi_{\min} \leqslant \gamma$$. Then, for any $$\alpha > 0$$,  $$\label{eq:symmerty_KS} \lim_{n \rightarrow \infty} \mathrm{pr} \!\left[ n^{1/2} \sup_{q \in [0, \, 1]} \left| {{\hat{F}}^* \{F^{-1}(q)\} + {\hat{F}}^*\{F^{-1}(1 - q)\} - 1 } \right| \geqslant \zeta_{\gamma, \alpha} \right] \leqslant \alpha,$$ (7)with  \begin{equation*} \zeta_{\gamma, \alpha} = \Phi^{-1}\!\left( {1 - \frac{\alpha}{4}} \right) \left\{ {\frac{\left( {1 + \gamma} \right)\left( {1 + \gamma^{-1}} \right)}{4}} \right\}^{1/2}\!\text{.} \end{equation*} To draw a connection to Lemma 1, we note that $$\Phi^{-1}\left( {1 - \alpha/4} \right) \asymp (2\log \alpha^{-1})^{1/2}$$ for small values of $$\alpha$$. Lemma 2 holds regardless of symmetry. If the distribution $$F(\cdot)$$ is symmetric about $$m$$, then any pair $$\{F^{-1}(q), \, F^{-1}(1 - q)\}$$ with $$q \in [0, 1]$$ can be written as $$(m - y, \, m + y)$$ for some $$y \in {\mathbb{R}}$$. Thus, in the case of symmetric distributions, (7) provides a tail bound on the supremum of $${\hat{F}}^*(m + y) + {\hat{F}}^*(m - y) - 1$$ over $$y \in {\mathbb{R}}$$, and suggests using the estimator   $$\label{eq:symmetry_bound} \begin{split} {\hat{\mu}}^+_m & = \sup\left\{ {\sum_{i = 1}^n w_i Y_i : \sum_{i = 1}^n w_i = 1, \ \frac{\max(w_i)}{\min(w_i)} \leqslant \gamma, \ {\hat{F}}_w \in {\mathcal{C}}_{\gamma, n}^{\mathrm{SYM}}} \right\}\!, \\ {\mathcal{C}}_{\gamma, n}^{\mathrm{SYM}} & = \bigcup_{m \in {\mathbb{R}}} \Big\{H : \sup_y \big| H(m + y) + H(m - y) - 1\big|\leqslant \zeta_{\gamma, \, n^{-1/2}}\Big\}\text{.} \end{split}$$ (8) The lower bound $${\hat{\mu}}^-$$ is computed analogously. This algorithm thus enables us to use symmetry constraints while performing a grid search over only a single parameter $$m$$, i.e., the centre of symmetry. The identification intervals defined by (8) are again asymptotically valid and sharp. 2.5. Identification bounds with log-concavity Finally, we consider the case where $$F$$ is known to have a log-concave density. Imposing log-concavity constraints appears to be a promising way of encoding side information about $$F$$: the class of log-concave distributions is quite flexible, including most of the widely-used parametric distributions with continuous support, while enforcing regularity properties such as unimodality and exponentially decaying tails (Walther, 2009). Unlike in the case of symmetry, there does not appear to be a simple way to turn log-concavity constraints into asymptotically sharp identification bounds for $$\mu$$ using only linear programming and a low-dimensional grid search. However, we can still use log-concavity to obtain computationally tractable improvements over the bounds of Aronow & Lee (2013). Below, we detail our procedure for obtaining $${\hat{\mu}}^{+}$$; to obtain $${\hat{\mu}}^{-}$$ we can apply the same procedure to $$-Y_i$$. We posit the existence of a known upper bound $$Y_i \leqslant y_{\max}$$ for the outcomes $$Y_i$$ to ensure integrability. Step 1. Let $${\hat{S}}(y) = n^{-1} \sum_i 1(Y_i \leqslant y)$$ and $${\hat{S}}_{\mathrm{KS}}(y) = \max\{{\hat{S}}(y) - n^{-1/2}D^{(-1)}_{\mathrm{KS}}(1 - n^{-1/2}) , \, 0\}$$, where $$D^{(-1)}_{\mathrm{KS}}(\cdot)$$ denotes the inverse Kolmogorov–Smirnov cumulative distribution function. By the Kolmogorov–Smirnov theorem, $$S(y) \geqslant {\hat{S}}_{\mathrm{KS}}(y)$$ for all $$y \in {\mathbb{R}}$$ with probability tending to 1, where $$S(y)$$ is the distribution function of the observed, biased, sample. Step 2. Next, in the proof, we show that for some $$m \in {\mathbb{R}}$$,   \begin{equation*} \label{eq:U_definition} {\hat{U}}_m(y) = \left[ {\hat{S}}_{\mathrm{KS}}(y) + (\gamma - 1)\big\{{\hat{S}}_{\mathrm{KS}}(y) - {\hat{S}}_{\mathrm{KS}}(m)\bigr\}_+\right] \Big/ \left[ {\hat{S}}_{\mathrm{KS}}(m) + \gamma\bigl\{1 - {\hat{S}}_{\mathrm{KS}}(m)\bigr\} \right] \end{equation*} is a lower bound for the population distribution of interest, $$F(y)$$, with probability tending to 1. We also set $${\hat{U}}_m(y_{\max}) = 1$$ to ensure that $${\hat{U}}_m(\cdot)$$ is a cumulative distribution function. Step 3. We now use the fact that our distribution is log-concave. It is well known that if $$F$$ has a log-concave density, then $$\log F(y)$$ must be concave (Prékopa, 1973). Thanks to this fact, we know that if $$F(y) \geqslant {\hat{U}}_m(y)$$, then also $$F(y) \geqslant {\hat{L}}_m(y)$$, where   $${\hat{L}}_m(\cdot) = \mathop{\arg\min}_{L(\cdot)} \left\{ {\int_{-\infty}^{y_{\max}} \!\!L(y) \,{\rm{d}} y : \log L(y) \text{ is concave and } L(y) \geqslant {\hat{U}}_m(y) \text{ for all } y \in {\mathbb{R}}} \right\}\!;$$ $${\hat{L}}_m(\cdot)$$ is, for $$m$$, the lowest function for which $$\log L(y)$$ is concave that still lies above our overall lower bound $${\hat{U}}_m(y)$$. Step 4. Finally, we define $$C_{\gamma, n}$$ as the set of distributions satisfying at least one of the lower bounds   $$C_{\gamma, n}^{\mathrm{LC}+} = \bigcup_{m \in {\mathbb{R}}} \left\{ {H : H(y) \geqslant {\hat{L}}_m(y), \: y \in {\mathbb{R}}} \right\}\!\text{.}$$ Given this construction, we can obtain an upper endpoint for our identification interval as usual:   $${\hat{\mu}}^+ = \sup_{w} \left\{ {\sum_{i = 1}^n w_i Y_i : \sum_{i = 1}^n w_i = 1, \ \frac{\max(w_i)}{\min(w_i)} \leqslant \gamma, \ {\hat{F}}_w(y) \in C_{\gamma, n}^{\mathrm{LC}+}} \right\}\!\text{.}$$ Lemma 3 shows that $$C_{\gamma, n}^{\mathrm{LC}+}$$ contains the population sampling distribution with probability tending to 1, and so Theorem 1 establishes the validity of our identification intervals. Unlike in the parametric or symmetric cases, our log-concave identification bounds are not asymptotically sharp, i.e., they may not converge to the shortest possible identification interval given our assumptions about log-concavity and bounded sampling ratios; however, they still provide tighter intervals than do Aronow & Lee (2013). Lemma 3. Suppose that we have a population for which $$\pi_{\min} \leqslant \pi_i \leqslant \pi_{\max}$$ for some $$\pi_{\max}/\pi_{\min} \leqslant \gamma$$, and that $$F$$ has a log-concave density. Then $$\lim_{n \rightarrow \infty} \mathrm{pr} ({\hat{F}}^* \in {\mathcal{C}}_{\gamma, n}^{\mathrm{LC}+}) = 1$$.  3. Application: sampling ethnic minorities The Aymara are an indigenous population of the Andean plateau of South America, who live predominantly in Bolivia and Peru, with a small proportion living in the north of Argentina and Chile. In Chile, they constitute a minority of nearly 50 000 in a country of approximately 18 million. Across the world, it is of great importance to understand how ethnic minorities are faring in order to design effective affirmative action policies. Here we use the proposed method to bound the average performance of Aymara students in the national standardized test held in Chile for admission to higher education. This test is called Prueba de Selección Universitaria, and almost 90% of enrolled high school students take it every year; however, participation is known to be lower among vulnerable populations such as the Aymara in northern Chile. Using the sample of 847 Aymara students who took the test in mathematics in 2008, we seek identification intervals for the population mean counterfactual test score had everyone taken the test. We assume that the sampling ratio is bounded by $$\max(\pi_i) /\min(\pi_i) \leqslant \gamma = 9$$ and consider inference under the assumption that the population test score distribution is symmetric or is log-concave. We also consider the approach of Aronow & Lee (2013) without shape constraints. The observed test scores have a mean of 502 with a sample standard deviation of 104. Given $$\gamma = 9$$, we obtain population identification intervals of $$(426, \, 578)$$ assuming symmetry, $$(414, \, 589)$$ assuming log-concavity, and $$(410, \, 591)$$ without any constraints. Thus, in this example, assuming symmetry buys us shorter identification intervals than assuming log-concavity. Figure 1 depicts the weighted distribution functions $${\hat{F}}_w(\cdot)$$ underlying the upper endpoints of all three identification intervals. The sharp threshold of the weights $$w_i$$ resulting from the unconstrained method of Aronow & Lee (2013) is readily apparent. Assuming either symmetry or log-concavity of the population sampling distribution yields more regular-looking distributions. Fig. 1. View largeDownload slide Illustration of the weighted distribution functions $$\hat{F}_w(\cdot)$$ used to obtain upper endpoints $$\hat{\mu}^+$$ for our identification intervals: (a) the raw weighted cumulative distribution functions used to compute $$\hat{\mu}^+$$; (b) weighted density functions obtained by using smoothing splines to help visualize the underlying estimated densities. In each panel the dotted curve shows the observed empirical cumulative distribution, while the dashed, dot-dashed and solid curves represent $$\hat{F}_w(\cdot)$$ obtained with, respectively, log-concavity, symmetry and no constraints on the population distribution. Fig. 1. View largeDownload slide Illustration of the weighted distribution functions $$\hat{F}_w(\cdot)$$ used to obtain upper endpoints $$\hat{\mu}^+$$ for our identification intervals: (a) the raw weighted cumulative distribution functions used to compute $$\hat{\mu}^+$$; (b) weighted density functions obtained by using smoothing splines to help visualize the underlying estimated densities. In each panel the dotted curve shows the observed empirical cumulative distribution, while the dashed, dot-dashed and solid curves represent $$\hat{F}_w(\cdot)$$ obtained with, respectively, log-concavity, symmetry and no constraints on the population distribution. Acknowledgement The authors thank the referees, associate editor and editor for their helpful comments. Wager thanks Columbia University for support during his time as a postdoctoral research scholar. Zubizarreta was supported by the Alfred P. Sloan Foundation. Miratrix and Zubizarreta are also affiliated with the Department of Statistics at Harvard University. Wager is also affiliated with the Department of Statistics at Stanford University. Appendix Proof of Theorem 1 First, mirroring the argument of Aronow & Lee (2013), we note that $$\hat{F}^*$$ is itself an estimator of the form $$\hat{F}^*(y) = \sum_i w_i 1\left({Y_i \leq y}\right)$$ with $$\sum_i w_i = 1$$ and $$\max{w_i}/\min{w_i} \leq \gamma$$, for some vector of weights $$w$$. Further, if $$\hat{F}^* \in \mathcal{C}_{\gamma, n}$$ then $$\mu^* \in \mathcal{I}$$. Now, if $$\hat{F}^* \in \mathcal{C}_{\gamma, n}$$ then $$\Delta(\mu, \mathcal{I}) \leq |\hat{\mu}^* - \mu|$$, so $$\Delta(\mu, \mathcal{I}) \leq |\hat{\mu}^* - \mu|$$ with probability tending to 1. Finally, as $$|\hat{\mu}^* - \mu|$$ converges to zero by the weak law of large numbers, we have that $$\Delta(\mu, \mathcal{I})$$ also converges to zero in probability. Proof of Lemma 1 To derive this type of uniform tail bound, we proceed in the following manner. First, we verify that $$n^{1/2} \{\hat{F}^*(\cdot) - F(\cdot)\}$$ converges in distribution to a tight Gaussian process $$G(\cdot)$$; then we bound the tail probabilities of $$\sup_{y \in \mathbb{R}} |G(y)|$$ directly. The first step involves a routine application of Donsker’s theorem, as presented, for example, in van der Vaart (1998, Ch. 19); here we take the existence of a limiting process $$G(\cdot)$$ as given. Because the supremum of $$n^{1/2} \{ \hat{F}^*(\cdot) - F(\cdot) \}$$ is invariant under monotone transformations of $$Y$$ and is stochastically maximized when $$Y$$ has a continuous density without atoms, we can without loss of generality assume that $$Y \in [0, 1]$$. Our sampling framework can be modelled as first drawing triplets $$W_i = (Y_i, \pi_i, u_i) \sim F$$, with $$u_i$$ marginally distributed as Un$$[0,1]$$, and then thinning our sample if $$u_i > \pi_i$$. We make the assumption that the prethinned sample size is Poisson, which implies that the actual sample size we observe is also Poisson, $$n \sim \text{Po}(N)$$, with $$N \rightarrow \infty$$; this does not affect our conclusions but makes the derivation more direct. Given the Poisson assumption, we can further assume, again without loss of generality, that $$y$$ is scaled on the $$[0,1]$$ interval such that there exists a $$\omega > 0$$ for which   $$\label{eq:Avarh1} N \,\mathrm{var} \{\hat{H}(y)\} = \omega^2 y, \quad \hat{H}(y) = \frac{1}{N} \sum_{ (i : Y_i \leq y) } \frac{1}{\pi_i} \Big/ E_{\mathrm{obs}}\! \left( \frac{1}{\pi_i} \right)\!,$$ (A1) where $$E_{\mathrm{obs}}\{f(W_i)\} = E\{ f(W_i) \mid u_i \leq \pi_i \}$$ denotes expectations with respect to the observed, i.e., thinned and therefore biased, sampling distribution. Observe that (A1) simply rescales $$y$$ so that the variance of $$\hat{H}(y)$$, the total mass sampled below $$y$$, increases linearly with $$Y$$, allowing conception of this object as effectively a random walk. We remark that $$\hat{H}(y)$$ is an oracle unnormalized estimator of $$F(\cdot)$$, akin to a Horvitz–Thompson estimator where we normalize by an expected weight rather than actual weights. It is unbiased for $$F(y)$$ in that, for any $$y$$, $$\:E_{\textrm{obs}}\{ \hat{H}(y) \} = F(y)$$. Now, $$\hat{H}(y)$$ is a compound Poisson process with variable size increments proportional to $$1/\pi_i$$ and a rate controlled by $$F(y)$$ and $$N$$. Using standard results on such compound Poisson processes, we have   $$\label{eq:Avarh2} N \,\mathrm{var} \{\hat{H}(y)\} = E_{\mathrm{obs}} \!\bigg\{{\frac{ 1(Y_i \leq y) }{ \pi_i^2 }}\bigg\} \bigg/ E_{\mathrm{obs}}\! \left( \frac{1}{\pi_i} \right)^2\text{.}$$ (A2) In connecting (A1) and (A2), note that the additional variability due to the random sample size $$n$$ plays a critical role; the variable sample size $$n$$ allows $$\mathrm{var} \{\hat{H}(y)\}$$ to increase as there is no normalization constraint that sets $$\hat{H}(1)$$ to 1. Because we are not yet normalizing $$\hat{H}(y)$$, we can easily obtain the pairwise covariances of $$\hat{H}(y_1)$$ and $$\hat{H}(y_2)$$ for any $$y_1 < y_2$$:   \begin{equation*} \mathrm{cov}\big\{ \hat{H}(y_1), \,\hat{H}(y_2) \big\} = \mathrm{var}\{ \hat{H}(y_1) \}\text{.} \end{equation*} These pairwise covariances, coupled with the linearly increasing variance, imply that if Gaussian limits exist, which we know to be the case by Donsker’s theorem, we must have that $$N^{1/2} \{ \hat{H}(y) - F(y) \}$$ converges in distribution to $$\omega W(y),$$ where $$W(y)$$ is a standard Wiener process on $$[0, 1]$$. Then, noting that $$\hat{F}^*(y) = \hat{H}(y) / \hat{H}(1)$$, we obtain, using the delta method, the convergence in distribution   $$\label{eq:Amain} n^{1/2} \, \big\{\hat{F}^*(\cdot) - F(\cdot)\big\} \to G(y) = \omega \{W(y) - F(y) W(1)\}$$ (A3) as $$n \to \infty$$. We can use $$n^{1/2}$$ instead of $$N^{1/2}$$ thanks to Slutsky’s theorem, since $$n^{1/2} = N^{1/2} \{1 + o_{\rm p}(1)\}$$. The next step is to bound $$\omega$$. We do this by expressing $$\omega$$ in terms of population moments on the $$\pi_i$$ and then optimizing over all possible distributions of $$\pi_i$$ to maximize this expression. First, plug $$y = 1$$ into (A2) and (A1) and set them equal to each other to obtain   $$\label{eq:Aomega} \omega^2 = E_{\mathrm{obs}} \bigl( \pi_i^{-2} \bigr) \big/ E_{\mathrm{obs}} \bigl( \pi_i^{-1} \bigr)^2\text{.}$$ (A4) Next, defining $$E_{\mathrm{pop}}$$ to be the expectation without the thinning step, i.e., with respect to the underlying unbiased population, we derive the equalities   $$E_{\mathrm{obs}}\!\left({{\pi_i}^{-1}}\right) = E_{\mathrm{pop}}\!\left({\pi_i}\right)^{-1}, \quad E_{\mathrm{obs}}\!\left({{\pi_i^{-2}}}\right) = E_{\mathrm{pop}}\!\left({{\pi_i}^{-1}}\right) \big/ E_{\mathrm{pop}}\!\left({\pi_i}\right)\text{.}$$ These are derived by simply writing out the expectation integrals of the thinned sample. For example, letting $$\Pi$$ be the population distribution of the $$\pi_i$$,   $E_{\mathrm{obs}}\!\left({ \pi_i^{-1}}\right) = \frac{1}{\int \pi_i \,\mathrm{d}\,\Pi} \int \frac{1}{\pi_i} \,\pi_i \,\mathrm{d}\, \Pi = \frac{1}{E_{\mathrm{pop}}(\pi_i)} \text{.}$ These relations, substituted into (A4), imply that   $$\label{eq:bound} \omega^2 = E_{\mathrm{pop}}\!\left({{\pi_i}^{-1}}\right) E_{\mathrm{pop}}\!\left({\pi_i}\right) \leq \sup_{a \in [0, 1]} \{1 + a(\gamma - 1)\}\bigg\{{1 + a(\gamma^{-1} - 1)}\bigg\}\!,$$ (A5) where the last inequality follows from the fact that, because $$1/x$$ is a convex function of $$x$$, our expression of interest is maximized for some distribution of $$\pi_i$$ with all its weight on two discrete values representing the endpoints $$\gamma$$ and $$1$$ of the allowed range, up to a scaling constant. The supremum over $$a$$ is then the supremum over all distributions of this form, with $$a$$ being the probability of $$\pi_i = \gamma$$. We can check by calculus that the bound (A5) is maximized at $$a^* = 1/2$$, resulting in   $$\label{eq:Abounds} \omega^2 \leq (1 + \gamma)(1 + \gamma^{-1})/4\text{.}$$ (A6) It remains to bound the suprema of $$X(y) = W(y) - F(y) W(1)$$. If we had $$F(y) = y$$, then $$X(y)$$ would be a standard Brownian bridge and the next steps would be straightforward. In our case, however, $$F(y)$$ may belong to a larger class of functions, so we instead make use of the following technical lemma, proved at the end of this appendix. Lemma A1. Suppose that $$W(y)$$ is a standard Wiener process and that $$F(y)$$ is a cumulative distribution function on $$[0, 1]$$ with $$F(0) = 0$$, $$F(1) = 1$$ and, for some constant $$\gamma > 0$$, $$F(A)/F(B) \leq \gamma$$ for any intervals $$A$$ and $$B$$ with $$\lambda(A) = \lambda(B)$$, where $$\lambda(\cdot)$$ is Lebesgue measure. Then there exists a constant $$\sigma_\gamma$$, bounded by unity, for which the stochastic process $$X(y) = W(y) - F(y)W(1)$$ satisfies the bound  $$\label{eq:A3} \lim_{u \rightarrow \infty} \frac{1}{u^2} \, \log \mathrm{pr}\!\bigg\{{\sup_{y \in [0, \, 1]} |{X(y)}| > u}\bigg\} \leq \frac{-1}{2\sigma^2_\gamma}\text{.}$$ (A7) The $$\gamma$$ condition in the lemma will correspond to our control on the weights $$\pi_i$$ due to the rescaled $$y$$. To continue, letting $$X(y) = G(y) / \omega$$ and plugging this and $$u = \sigma_\gamma (2 \log \alpha^{-1})^{1/2}$$ into (A7) gives   $$\limsup_{\alpha \rightarrow 0} \: \log \mathrm{pr}\!\bigg\{{\sup_{y \in [0, 1]} |{G(y)}| > \omega \sigma_\gamma (2\log{\alpha^{-1}})^{1/2}}\bigg\}^{-1} \bigg/ \log \alpha^{-1} \geq 1\text{.}$$ The final form of the lemma follows from substituting for $$\omega$$ in (A6). Proof of Corollary 1 By Lemma 1 we know that $$\hat{F}^* \in \mathcal{C}_{\gamma, n}$$ with probability tending to 1, so the result follows immediately from Theorem 1. Proof of Lemma 2 The proofs of Lemmas 2 and 1 begin in the same way to obtain the tight Gaussian process, but then diverge in how we bound the tail behaviour of this process. To prove Lemma 2 we start with the representation in (A3). Given a $$q \in [0,0{\cdot}5]$$, use (A3) twice with $$y = F^{-1}(q)$$ and $$y = F^{-1}(1-q)$$, take the sum, and then simplify to obtain   \begin{align*} \omega^{-1} n^{1/2} \left[\hat{F}^*\{F^{-1}(q)\} + \hat{F}^*\{F^{-1}(1 - q)\} - 1 \right] & \to W\{F^{-1}(q)\} + W\{F^{-1}(1 - q)\} - W(1) \\ & = W \{ F^{-1}(q) \} - \left[ W(1) - W\{F^{-1}(1 - q)\} \right] \\ & = W \{ F^{-1}(q) + 1 - F^{-1}(1 - q)\}, \end{align*} where the last equality is in distribution. For the last step note that $$W \{F^{-1}(q) \}$$ and $$W(1) - W \{ F^{-1}(1 - q) \}$$ are two independent Gaussian processes over $$q \in [0, 0{\cdot}5]$$; $$W(1) - W(y_2)$$ is independent of $$W(y_1)$$ if $$y_1 < y_2$$ as we are effectively looking at two non-overlapping, and therefore independent, sums of our sample. Then the sum of two independent random walks is a random walk of the total distance. Finally, for $$q \in [0, 0{\cdot}5]$$, $$F^{-1}(q) + 1 - F^{-1}(1 - q)$$ takes all values in $$[0, 1]$$, so for any threshold $$t$$,   \begin{align*} \lim_{n \rightarrow \infty} \mathrm{pr} \!\left( n^{1/2} \sup_{q \in [0, 0{\cdot}5]} \left[ \hat{F}^* \{ F^{-1}(q) \} + \hat{F}^* \{ F^{-1}(1 - q) \} - 1 \right] \geq \omega t \right) & = \mathrm{pr}\left\{\sup_{y \in [0, 1]} W(y) \geq t\right\} \\ & = 2\Phi(-t), \end{align*} where the last equality is a consequence of the reflection principle for Brownian motion (see, e.g., Mörters & Peres, 2010, p. 44). The desired conclusion then follows upon applying the bound (A6) on $$\omega$$ to both tails. This lemma does not require any symmetry in $$F(\cdot)$$; the subsequent use of the lemma is what takes advantage of that structure. Proof of Lemma 3 It suffices to verify all claims made in the four steps leading to our construction of $$\mathcal{C}_{\gamma, n}^{\mathrm{LC}+}$$. Step 1 is a direct consequence of the Kolmogorov–Smirnov theorem. Steps 3 and 4 are also immediate given the result of Prékopa (1973). The optimization problem used to define $$\hat{L}_m(y)$$ is computationally tractable: finding a concave upper bound for a function $$l(y)$$ is equivalent to taking the convex hull of the curve $$\{y, \, l(y)\}$$. It remains to check Step 2, which is a direct analogue of the argument of Aronow & Lee (2013) applied to the sampling distribution $$S(\cdot)$$. Consider $$r = \min(\pi_i)/E(\pi_i)$$. If $$r$$ were known, and recalling that $$\max{\pi_i}/\min{\pi_i} \leq \gamma$$, we can follow the argument of Aronow & Lee (2013) to verify that $$F$$ is stochastically dominated by $$U_m$$, i.e., $$F(y) \geq U_m(y)$$ for all $$y \in \mathbb{R}$$, where $$U_m(\cdot)$$ is a reweighting of the observed cumulative sampling distribution $$S(\cdot)$$ given by   $$U_m(y) = \left[ S(y) + (\gamma - 1)\left\{ S(y) - S(m) \right\}_+\right] \big/ \bigl[ S(m) + \gamma\left\{ 1 - S(m) \right\} \bigr],$$ for a threshold $$m$$ characterized by   $$S(m) \big/\bigl[ S(m) + \gamma \left\{ 1 - S(m) \right\} \bigr] = r\text{.}$$ In other words, $$U_m(y)$$ corresponds to a reweighting of $$S(y)$$ with sampling weights $$\pi_i$$ jumping from a low value $$a$$ to a high value $$\gamma a$$ at threshold $$m$$. In practice, of course, we do not know $$r$$, but we still know that $$F$$ is stochastically dominated by $$U_m$$ for some $$m \in \mathbb{R}$$. The claim made in Step 2 then follows upon observing that $$S(y) \geq \hat{S}_{\mathrm{KS}}(y)$$, which implies $$U_m(y) \geq \hat{U}_m(y)$$ for all $$m, y \in \mathbb{R}$$ with probability tending to 1, thanks to Step 1. Proof of Lemma A1 We first recall the classic result of Marcus & Shepp (1972), which in our situation implies that the Gaussian process $$X(y)$$ satisfies   $$\lim_{u \rightarrow \infty} \frac{1}{u^2} \, \log \mathrm{pr} \!\left\{\, \sup_{y \in [0, 1]} \left| X(y) \right| > u \right\} = \frac{-1}{2 \sup_{y \in [0, 1]} \textrm{var}\{X(y)\} }\text{.}$$ It therefore remains to use the shape restrictions on $$F(y)$$ to bound the right-hand expression from below. To do so, it is helpful to decompose the stochastic process $$X(y)$$ into two independent parts:   \begin{equation*} X(y) = B(y) + W(1) \{ y - F(y) \}, \end{equation*} where $$B(y) = W(y) - y W(1)$$ is a standard Brownian bridge. Moreover, given our assumptions on $$F(y)$$,   \begin{equation*} \label{eq:Fbound} \big|F(y) - y\big| \leq A_\gamma(y), \quad A_\gamma(y) = \frac{\gamma y}{1 - y + \gamma y} - y \end{equation*} for $$0 \leq y \leq 0{\cdot}5$$, $$A_\gamma(y)$$ corresponds to how much more weight below $$y$$, as compared to a uniform distribution, is possible given the constraints of $$\gamma$$. A similar derivation shows that $$A_\gamma(y) = A_\gamma(1 - y)$$ for all $$y \in [0{\cdot}5, 1]$$. Thus, making use of the facts that $$\textrm{var}\{B(y)\} = y(1 - y)$$ and $$\textrm{var}\{W(1)\} = 1$$, and using $$A_\gamma(y)$$ as a bound on the $$y - F(y)$$ scaling of the $$W(1)$$ term, we deduce that   $$\textrm{var}\{X(y)\} \leq y(1 - y) + A_\gamma(y) \quad (0 \leq y \leq 1)\text{.}$$ The above function is concave on $$[0, 0{\cdot}5]$$; in particular, $$A''_\gamma(y) = -2\gamma(\gamma - 1) /(1 - y + \gamma y)^3$$, and so the above expression has a unique maximizer $$y^*$$ that can be found numerically. The desired conclusion then holds for $$\sigma^2_\gamma := y_\gamma^*(1 - y_\gamma^*) + A_\gamma(y^*_\gamma)$$; the fact that $$\sigma_\gamma^2 \leq 1$$ is immediate by inspection. References Aronow P. M. & Lee D. K. ( 2013). Interval estimation of population means under unknown but bounded probabilities of sample selection. Biometrika  100, 235– 40. Google Scholar CrossRef Search ADS   Boyd S. & Vandenberghe L. ( 2004). Convex Optimization.  Cambridge: Cambridge University Press. Google Scholar CrossRef Search ADS   Cochran W. G. ( 1977). Sampling Techniques.  New York: Wiley. de Carvalho M. & Davison A. C. ( 2014). Spectral density ratio models for multivariate extremes. J. Am. Statist. Assoc.  109, 764– 76. Google Scholar CrossRef Search ADS   Efron B. & Tibshirani R. ( 1996). Using specially designed exponential families for density estimation. Ann. Statist.  24, 2431– 61. Google Scholar CrossRef Search ADS   Fithian W. & Wager S. ( 2015). Semiparametric exponential families for heavy-tailed data. Biometrika  102, 486– 93. Google Scholar CrossRef Search ADS   Hájek J. ( 1958). On the theory of ratio estimates. Aplikace Matematiky  3, 384– 98. Manski C. F. ( 2003). Partial Identification of Probability Distributions.  New York: Springer. Marcus M. B. & Shepp L. A. ( 1972). Sample behavior of Gaussian processes. In Proc. 6th Berkeley Symp. Math. Statist. Prob., Vol. 2: Probability Theory.  Berkeley, California: University of California Press, pp. 423– 41. Mörters P. & Peres Y. ( 2010). Brownian Motion.  Cambridge: Cambridge University Press. Google Scholar CrossRef Search ADS   Owen A. B. ( 1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika  75, 237– 49. Google Scholar CrossRef Search ADS   Prékopa A. ( 1973). Logarithmic concave measures and functions. Acta Sci. Math. (Szeged)  34, 334– 43. Qin J. ( 2017). Biased Sampling, Over-identified Parameter Problems and Beyond.  Singapore: Springer. Google Scholar CrossRef Search ADS   R Development Core Team ( 2018). R: A Language and Environment for Statistical Computing.  R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. http://www.R-project.org. van der Vaart A. W. ( 1998). Asymptotic Statistics.  Cambridge: Cambridge University Press. Google Scholar CrossRef Search ADS   Walther G. ( 2009). Inference and modeling with log-concave distributions. Statist. Sci.  24, 319– 27. Google Scholar CrossRef Search ADS   © 2017 Biometrika Trust

### Journal

BiometrikaOxford University Press

Published: Mar 1, 2018

## You’re reading a free preview. Subscribe to read the entire article.

### DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month ### Explore the DeepDyve Library ### Search Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly ### Organize Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place. ### Access Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals. ### Your journals are on DeepDyve Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more. All the latest content is available, no embargo periods. DeepDyve ### Freelancer DeepDyve ### Pro Price FREE$49/month
\$360/year

Save searches from
PubMed

Create lists to

Export lists, citations