Optimal pseudolikelihood estimation in the analysis of multivariate missing data with nonignorable nonresponse

Optimal pseudolikelihood estimation in the analysis of multivariate missing data with... SUMMARY Tang et al. (2003) considered a regression model with missing response, where the missingness mechanism depends on the value of the response variable and hence is nonignorable. They proposed three pseudolikelihood estimators, based on different treatments of the probability distribution of the completely observed covariates. The first assumes the distribution of the covariate to be known, the second estimates this distribution parametrically, and the third estimates the distribution nonparametrically. While it is not hard to show that the second estimator is more efficient than the first, Tang et al. (2003) only conjectured that the third estimator is more efficient than the first two. In this paper, we investigate the asymptotic behaviour of the third estimator by deriving a closed-form representation of its asymptotic variance. We then prove that the third estimator is more efficient than the other two. Our result can be straightforwardly applied to missingness mechanisms that are more general than that in Tang et al. (2003). 1. Introduction Tang et al. (2003) considered multivariate regression analysis of a $$q$$-dimensional response $$Y$$ on a $$p$$-dimensional covariate $$X$$, with the joint density function of $$(Y,X)$$ factorized as $$p(y,x) = f(\,y\mid x; \beta) g(x)$$, where $$g(x)$$ represents the marginal density function of $$X$$ and the estimation of $$\beta$$ is of main interest. They considered the situation where $$X$$ is fully observed but $$Y$$ has missing values. Let $$R=1$$ if $$Y$$ is completely observed and $$R=0$$ otherwise. Tang et al. (2003) assumed that the missing data mechanism depends only on the underlying value of the response $$Y$$ and hence is nonignorable,   \begin{equation} \hbox{pr}(R=1 \mid Y, X) = \hbox{pr}(R=1 \mid Y)\text{.} \label{eq:assume} \end{equation} (1) They proposed estimators built on the fact that $$R$$ and $$X$$ are conditionally independent given $$Y$$, so the completely observed subjects form a random sample from the distribution of $$X$$ given $$Y$$. The missing data mechanism (1) has been widely adopted in response-biased sampling (Brown, 1990; Liang & Qin, 2000; Chen, 2001) and is relevant in many applications. For example, Chen (2001) studied a case of a univariate response $$Y$$ and a multivariate covariate $$X$$, where the observed data form a nonrandom sample from $$(X, Y)$$, with the sampling probability depending only on $$Y$$. Therefore, the observed data can be viewed as a random sample from the distribution of $$X$$ given $$Y$$, instead of from the original regression model $$f(\,y\mid x; \beta)$$ (Chen, 2001). Assumption (1) is also sensible in other situations. For example, when evaluating a new biomarker in analytical chemistry, scientists usually encounter a laboratory quality control limit, the so-called detection limit (Navas-Acien et al., 2008; Caldwell et al., 2009; Carter et al., 2016), defined as the lowest concentration of analyte distinguishable from the background noise. Although theoretically available, concentration values below the detection limit are usually not released by laboratories. When the concentration value is the outcome of interest and needs to be regressed against covariates $$X$$, assumption (1) is satisfied, with $$\hbox{pr}(R=1 \mid Y, X) = I(Y>c)=\hbox{pr}(R=1 \mid Y)$$ where $$c$$ is the detection limit. Eliminating observations with values below the detection limit can lead to severe bias; see Hopke et al. (2001), Moulton et al. (2002), Richardson & Ciampi (2003), Schisterman et al. (2006) and the references therein. Other examples where (1) is valid include survey sampling (Deville & Särndal, 1992; Deville, 2000; Kott, 2014), case-control studies (Chen, 2007), and some survival analysis contexts. Tang et al. (2003) also discussed some extensions of the assumption (1). The novelty of the idea in Tang et al. (2003) has led to recent developments such as Kim & Yu (2011), Zhao & Shao (2015), Shao & Wang (2016) and Miao & Tchetgen Tchetgen (2016). In this paper, we first present our results under assumption (1) and then show that they can be straightforwardly applied to more general missingness mechanisms. The estimation of $$\beta$$ is based on independent and identically distributed observations $$(Y_i, X_i, R_i=1)$$ for $$i=1, \dots, n$$ and $$(X_i, R_i=0)$$ for $$i=n+1, \ldots, N$$. Based on (1), we estimate $$\beta$$ through maximizing the likelihood of the parameters of $$p(X\mid Y)$$ based on the complete observations,   \[ \prod_{i=1}^n p(x_i\mid y_i) = \prod_{i=1}^n \frac{f(\,y_i\mid x_i; \beta)g(x_i)}{\int f(\,y_i\mid x; \beta)g(x)\,{\rm d}x} = \prod_{i=1}^N \left\{ \frac{f(\,y_i\mid x_i; \beta)g(x_i)}{\int f(\,y_i\mid x; \beta)g(x)\,{\rm d}x} \right\}^{r_i}\!\text{.} \] Equivalently, we estimate $$\beta$$ by maximizing the complete-case conditional loglikelihood   \begin{equation} l(\beta, g) = \frac1N \sum_{i=1}^N r_i \left\{\log\,f(\,y_i\mid x_i;\beta) - \log \int f(\,y_i\mid x;\beta)g(x)\,{\rm d}x\right\}\!, \label{eq:loglike} \end{equation} (2) which contains $$g(x)$$, the unspecified probability density function of $$X$$. If the true $$g$$, $$\,g_0$$, is known, the corresponding estimator of $$\beta$$, denoted by $$\hat\beta_{\rm {PL0}}$$, is the maximizer of $$l(\beta, g_0)$$. If $$g$$ is unknown, with $$X$$ fully observed, any appropriate complete-data technique can be applied to estimate $$g$$. Tang et al. (2003) considered two situations and obtained two pseudolikelihood estimators of $$\beta$$ (Gong & Samaniego, 1981; Parke, 1986). The first is when a parametric model $$g(x;\alpha)$$ is adopted and the full-data maximum likelihood estimator is used to obtain $$\hat\alpha$$. Then $$g(x;\hat \alpha)$$ is used to replace $$g(x)$$ in (2), which leads to the estimator $$\hat\beta_{\rm {PL1}}$$, the maximizer of $$l\{\beta,g(x;\hat \alpha)\}$$. The second is when $$g(x)$$ is unspecified and its cumulative distribution is estimated by its empirical version $$\hat G_{\rm N}(x)$$. This gives the estimator $$\hat\beta_{\rm {PL2}}$$, the maximizer of   \begin{equation} l(\beta, \hat G_{\rm N}) = \frac1N \sum_{i=1}^N r_i \left\{\log\,f(\,y_i\mid x_i;\beta) - \log \int f(\,y_i\mid x;\beta) \,{\rm d}\hat G_{\rm N}(x)\right\}\!\text{.} \label{eq:loglikenon} \end{equation} (3) An interesting issue is the efficiency of $$\hat\beta_{\rm {PL0}}$$, $$\hat\beta_{\rm {PL1}}$$ and $$\hat\beta_{\rm {PL2}}$$. Theorem 2 of Tang et al. (2003) established the asymptotic normality of $$\hat\beta_{\rm {PL1}}$$ and showed that $$\hat\beta_{\rm {PL1}}$$ is more efficient than $$\hat\beta_{\rm {PL0}}$$. However, the authors did not give an explicit expression for the asymptotic variance of $$\hat\beta_{\rm {PL2}}$$, so could not provide a theoretical efficiency comparison with $$\hat\beta_{\rm {PL2}}$$. Based on simulation studies, they conjectured that $$\hat\beta_{\rm {PL2}}$$ is more efficient than both $$\hat\beta_{\rm {PL0}}$$ and $$\hat\beta_{\rm {PL1}}$$. We derive the asymptotic variance of $$\hat\beta_{\rm {PL2}}$$ in closed form and prove that $$\hat\beta_{\rm {PL2}}$$ is more efficient than $$\hat\beta_{\rm {PL1}}$$ and $$\hat\beta_{\rm {PL0}}$$; thus we establish the correctness of the conjecture in Tang et al. (2003) and provide a clear explanation of their numerical observations. We also show that, in general, no other method of estimating $$g(x)$$ can lead to a more efficient estimator of $$\beta$$ than $$\hat\beta_{\rm {PL2}}$$, which is recommended for use in practice. 2. Asymptotic distribution and optimality We use uppercase letters to denote random variables and lowercase letters to denote their realizations. We let $$S(X, Y; \beta)={\partial\log\,f(\,y \mid X;\beta)}/{\partial\beta}$$ and $$s(x, y; \beta)={\partial\log\,f(\,y \mid x;\beta)}/{\partial\beta}$$. Sometimes we also write $$S_i(\beta)={\partial\log\,f(\,y_i\mid X_i;\beta)}/{\partial\beta}$$ and $$s_i(\beta)={\partial\log\,f(\,y_i\mid x_i;\beta)}/{\partial\beta}$$. We define $$T_i(\alpha)={\partial\log g(X_i;\alpha)}/{\partial\alpha}$$, $$t_i(\alpha)={\partial\log g(x_i;\alpha)}/{\partial\alpha}$$ and $$G=-E\{{\partial T_i(\alpha)}/{\partial{\alpha}^{\mathrm{\scriptscriptstyle T}}}\} =E\{T_i(\alpha) T_i(\alpha)^{\mathrm{\scriptscriptstyle T}}\}$$. We let $$A = E[R_i \hbox{var}\{S_i(\beta)\mid Y_i\}]$$ and $$B = E[R_i \,\mbox{cov}\{S_i(\beta), T_i(\alpha) \mid Y_i\}]$$. We also write $$d_i (\beta) = r_i[s_i(\beta) - E\{S_i(\beta)\mid y_i\}]$$ and $$D_i (\beta) = R_i[S_i(\beta) - E\{S_i(\beta)\mid Y_i\}]$$. In this section, we establish the asymptotic distribution of $$\hat\beta_{\rm {PL2}}$$ and briefly describe the asymptotic distributions of $$\hat\beta_{\rm {PL0}}$$ and $$\hat\beta_{\rm {PL1}}$$. The results for $$\hat\beta_{\rm {PL0}}$$ and $$\hat\beta_{\rm {PL1}}$$ can be found in Theorem 2 of Tang et al. (2003). Recall that $$\hat\beta_{\rm {PL0}}$$ is the maximizer of $$l(\beta, g_0)$$. It is straightforward to show that   \[ N^{1/2} (\hat\beta_{\rm {PL0}} - \beta) = A^{-1} N^{-1/2}\sum_{i=1}^N d_i(\beta) + o_{\rm p}(1), \] so $$N^{1/2} (\hat\beta_{\rm {PL0}} - \beta) \to N(0, A^{-1})$$ in distribution as $$N\to\infty$$. The estimator $$\hat\beta_{\rm {PL1}}$$ is the maximizer of $$l\{\beta, g(x;\hat\alpha)\}$$. Because the maximum likelihood estimate of $$\alpha$$ satisfies $$N^{1/2}(\hat\alpha-\alpha)=G^{-1}N^{-1/2}\sum_{i=1}^N t_i(\alpha) +o_{\rm p}(1)$$,   \[ N^{1/2} (\hat\beta_{\rm {PL1}} - \beta) =A^{-1}N^{-1/2}\sum_{i=1}^N \left\{d_i(\beta) - BG^{-1} t_i(\alpha) \right\} +o_{\rm p}(1)\text{.} \] Hence $$N^{1/2} (\hat\beta_{\rm {PL1}} - \beta) \to N(0, V)$$ in distribution as $$N\to\infty$$, where $$V=A^{-1}(A-B G^{-1} B^{\mathrm{\scriptscriptstyle T}})A^{-1}$$. It is obvious that $$V \leq A^{-1}$$. Theorem 1. Under the conditions of Theorem 3 in Tang et al. (2003), the estimator $$\hat\beta_{\rm {PL2}}$$ has the asymptotic representation  \[ N^{1/2} (\hat\beta_{\rm {PL2}} - \beta) =A^{-1}N^{-1/2} \sum_{i=1}^N \left[ d_i(\beta) - E\{D_i(\beta)\mid x_i\} \right] +o_{\rm p}(1), \]so $$N^{1/2} (\hat\beta_{\rm {PL2}} - \beta) \to N(0, U)$$ in distribution as $$N\to\infty$$, where  $$ U = A^{-1} E\bigl( \bigl[D_i(\beta) - E\{D_i(\beta)\mid X_i\}\bigr]^{\otimes2}\bigr) A^{-1}\text{.} $$ This implies that: Corollary 1. $$\hat\beta_{\rm {PL2}}$$ is more efficient than $$\hat\beta_{\rm {PL1}}$$ and hence more efficient than $$\hat\beta_{\rm {PL0}}$$. Although other nonparametric methods can be used to estimate $$g(x)$$ and thus obtain alternative estimators of $$\beta$$, doing so cannot increase the efficiency of $$\hat\beta_{\rm {PL2}}$$. To see this, let $$\hat g(x)$$ denote the empirical estimator of $$g(x)$$, which results in $$\hat\beta_{\rm {PL2}}$$, and let $$\tilde g(x)$$ be an alternative consistent estimator of $$g(x)$$ using data $$X_1, \dots, X_N$$, which gives rise to an alternative estimator of $$\beta$$, denoted by $$\tilde\beta_{\rm {alt}}$$. The derivation in Theorem 1 yields   \begin{align*} A N^{1/2}(\tilde\beta_{\rm {alt}}-\beta)&=N^{-1/2}\sum_{i=1}^N \left[ d_i(\beta) - E\{D_i(\beta) \mid x_i\} \right]\\ & \quad +N^{-1/2}\sum_{i=1}^N r_i\!\left\{\log\!\int\! f(\,y_i\mid x;\beta)\hat g(x)\,{\rm d}x -\log\!\int\! f(\,y_i\mid x;\beta)\tilde g(x)\,{\rm d}x\!\right\}\! +o_{\rm p}(1)\text{.} \end{align*} We write $$r_i\log\int f(\,y_i\mid x;\beta) g(x)\,{\rm d}x$$ as $$\alpha(g,y_i,r_i)$$. As regular asymptotically linear estimators of $$\alpha(g,y_i,r_i)$$ based on $$X_1, \dots, X_N$$, $$\:\alpha(\hat g,y_i,r_i)$$ and $$\alpha(\tilde g,y_i,r_i)$$ satisfy   \begin{eqnarray*} N^{1/2}\{\alpha(\hat g,y_i,r_i)-\alpha(g,y_i,r_i)\}&=& N^{-1/2}\sum_{j=1}^N \phi_1(x_j,y_i,r_i)+o_{\rm p}(1),\\ N^{1/2}\{\alpha(\tilde g,y_i,r_i)-\alpha(g,y_i,r_i)\}&=& N^{-1/2}\sum_{j=1}^N \phi_2(x_j,y_i,r_i)+o_{\rm p}(1) \end{eqnarray*} (Huber, 1981) for some influence functions $$\phi_1(X,y,r)$$ and $$\phi_2(X,y,r)$$, which we inspect in order to compare estimation efficiency. Here $$E\{\phi_1(X,y,r)\} =E\{\phi_2(X,y,r)\}=0$$. Therefore   \begin{align*} &N^{-1/2}\sum_{i=1}^N r_i\left\{\log\int f(\,y_i\mid x;\beta)\hat g(x)\,{\rm d}x -\log\int f(\,y_i\mid x;\beta)\tilde g(x)\,{\rm d}x\right\}\\ &\quad =N^{-3/2} \sum_{i=1}^N\sum_{j=1}^N\{\phi_1(x_j,y_i,r_i) -\phi_2(x_j,y_i,r_i)\}+o_{\rm p}(1)\\ &\quad =N^{-1/2}\sum_{i=1}^N E\{\phi_1(x_i,Y,R) -\phi_2(x_i,Y,R)\}+o_{\rm p}(1), \end{align*} where in the last step we have used the zero-mean properties of $$\phi_1$$ and $$\phi_2$$. This leads to   $$ A N^{1/2}(\tilde\beta_{\rm {alt}}-\beta) =N^{-1/2}\sum_{i=1}^N \big[d_i(\beta) - E\{D_i(\beta)\mid x_i\} + E\{\phi_1(x_i,Y,R) -\phi_2(x_i,Y,R)\}\big]+o_{\rm p}(1)\text{.} $$ Using the technique in Corollary 1, we obtain   \begin{align*} &E\!\left( \big[ d_i(\beta) - E\{D_i(\beta)\mid x_i\} + E\{\phi_1(x_i,Y,R) -\phi_2(x_i,Y,R)\}\big]^{\otimes2}\right)\\ &\quad =E\!\left( \big[d_i(\beta) - E\{D_i(\beta)\mid x_i\}\big]^{\otimes2}\right)+E\!\left(\big[E\{\phi_1(x_i,Y_j,R_j) -\phi_2(x_i,Y_j,R_j)\}\big]^{\otimes2}\right)\\ &\quad \ge E\!\left( \big[d_i(\beta) - E\{D_i(\beta)\mid x_i\}\big]^{\otimes2}\right)\!, \end{align*} so $$\tilde\beta_{\rm {alt}}$$ is also less efficient than $$\hat\beta_{\rm {PL2}}$$. Therefore the pseudolikelihood estimator $$\hat\beta_{\rm {PL2}}$$ is superior to any other possible parametrically or nonparametrically based estimators. 3. Extension to more general missingness mechanisms The missing data mechanism in (1) assumes that given $$Y$$, $$\,R$$ and $$X$$ are conditionally independent. Although reasonable in many situations, this is not always true. For instance, in a randomized clinical trial comparing a treatment with a placebo, the dichotomous treatment indicator may influence the missingness. Consider a very simple scenario where $$Y$$ denotes the outcome, $$T$$ the binary treatment indicator and $$Z$$ the covariate. We are interested in the unknown parameters in $$f(y\mid t, z; \beta)$$. Compared with (1), it is more cautious to assume that   \begin{equation} \hbox{pr}(R=1\mid Y,T,Z) = \hbox{pr}(R=1\mid Y,T)\text{.} \label{eq:assume2} \end{equation} (4) Under (4), the methods in § 2 still apply. To see this, similar to the idea in § 1, the unknown parameter $$\beta$$ can be estimated based on the conditional likelihood   \begin{eqnarray*} & &\prod_{i=1}^n \frac{f(y_i\mid t_i, z_i; \beta)g(z_i\mid t_i)h(t_i)}{\int f(y_i\mid t_i, z; \beta)g(z\mid t_i)h(t_i)\,{\rm d}z}\\ &&\quad = \prod_{\substack{i=1\\t_i=0}}^n \frac{f(y_i\mid t_i=0, z_i; \beta)g(z_i\mid t_i=0)}{\int f(y_i\mid t_i=0, z; \beta)g(z\mid t_i=0)\,{\rm d}z} \prod_{\substack{i=1\\t_i=1}}^n \frac{f(y_i\mid t_i=1, z_i; \beta)g(z_i\mid t_i=1)}{\int f(y_i\mid t_i=1, z; \beta)g(z\mid t_i=1)\,{\rm d}z}\text{.} \end{eqnarray*} Thus the same reasoning and derivation can be applied, and we can show that the estimator for $$\beta$$ with $$g(Z\mid t)$$ estimated by its empirical version under $$T=0$$ and $$T=1$$ separately, i.e., the $$\hat\beta_{\rm {PL2}}$$ version in § 2, is also optimal among the three estimators. We now generalize the assumption (1) to the case where the missing data indicator $$R$$ and some components in the covariates $$X$$, say $$Z$$, are conditionally independent given $$Y$$ and the remaining components of $$X$$, say $$T$$; that is,   \begin{equation} \hbox{pr}(R=1\mid Y,T,Z) = \hbox{pr}(R=1\mid Y,T)\text{.} \label{eq:assume3} \end{equation} (5) Here, the covariates are represented by $$X=(T^{\mathrm{\scriptscriptstyle T}},Z^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}$$, and $$Z$$ is called a nonresponse instrument (Zhao & Shao, 2015) or a shadow variable (Miao & Tchetgen Tchetgen, 2016). The objective function becomes   \begin{equation} \prod_{i=1}^N \left\{\frac{f(\,y_i\mid t_i, z_i; \beta)g(z_i\mid t_i)h(t_i)}{\int f(\,y_i\mid t_i, z; \beta)g(z\mid t_i) h(t_i) \,{\rm d}z}\right\}^{r_i} =\prod_{i=1}^N \left\{\frac{f(\,y_i\mid t_i, z_i; \beta)g(z_i\mid t_i)}{\int f(\,y_i\mid t_i, z; \beta)g(z\mid t_i)\,{\rm d} z}\right\}^{r_i}\!\text{.}\label{eq:zt} \end{equation} (6) The distribution of $$Z$$ conditional on $$T$$ in (6) poses extra challenges. The truth, a parametric or nonparametric estimator of $$g(Z\mid T)$$, can be incorporated into the estimation, resulting in $$\hat\beta_{\rm {PL0}}$$, $$\hat\beta_{\rm {PL1}}$$ or $$\hat\beta_{\rm {PL2}}$$. Theory similar to that in § 2 can be developed and leads to the same optimality of $$\hat\beta_{\rm {PL2}}$$. The nonignorable missing data mechanism assumption is usually difficult to specify or verify (d’Haultfoeuille, 2010), but a nonresponse instrument $$Z$$ is often available and assumption (5) is often reasonable. For example, in a study of children’s mental health (Zahner et al., 1992), investigators were interested in evaluating the prevalence of children with abnormal psychopathological status based on their teacher’s assessment, $$Y$$, which was subject to missing values. A missing teacher report may be related to the teacher’s assessment of the student even after adjusting for fully observed covariates $$T$$ such as physical health of the child and parental status of the household (Ibrahim et al., 2001). A separate parental report on the psychopathology of the child was also available for all children in the study. Such a report is likely to be highly correlated with that of the teacher, but is unlikely to be correlated with the teacher’s response status conditional on the teacher’s assessment of that student. Therefore, the parental assessment constitutes a valid nonresponse instrument and assumption (5) is reasonable. Acknowledgement We thank the editor, associate editor and three referees for their constructive comments, which have led to a significantly improved paper. This work was partially supported by the National Center for Advancing Translational Sciences of the U.S. National Institutes of Health and the U.S. National Science Foundation. Supplementary material Supplementary material available at Biometrika online contains some simulation results. Appendix Proof of Theorem 1. If we estimate $$g(x)$$ with its empirical distribution, we approximate $$\int f(\,y\mid x; \beta)g(x)\,{\rm d}x$$ by its sample average, i.e., $$\int f(\,y\mid x; \beta)g(x)\,{\rm d}x\approx N^{-1}\sum_{i=1}^N f(\,y\mid x_i; \beta)$$. Also, $$\int s(x,y,\beta) f(\,y\mid x;\beta)g(x)\,{\rm d}x\approx N^{-1}\sum_{i=1}^N s(x_i,y,\beta) f(\,y\mid x_i;\beta)=N^{-1}\sum_{i=1}^N\partial f(\,y\mid x_i;\beta)/\partial\beta $$. To obtain $$\hat\beta_{\rm {PL2}}$$, we maximize (3), which is equivalent to maximizing   \begin{eqnarray*} \frac1N \sum_{j=1}^N r_j \left[ \log\,f(\,y_j\mid x_j;\beta) -\log\left\{ N^{-1}\sum_{i=1}^N f(\,y_j\mid x_i,\beta)\right\}\right]\!\text{.} \end{eqnarray*} Thus, by the mean value theorem, there must exist some $$\beta^*$$ lying between $$\hat\beta_{\rm {PL2}}$$ and $$\beta$$ such that   \begin{eqnarray*} 0 &=&N^{-1/2}\sum_{j=1}^N r_j \left\{s_j(\hat\beta_{\rm {PL2}}) -\frac{ N^{-1}\sum_{i=1}^N s(x_i,y_j,\hat\beta_{\rm {PL2}}) f(\,y_j\mid x_i;\hat\beta_{\rm {PL2}})}{ N^{-1}\sum_{i=1}^N f(\,y_j\mid x_i;\hat\beta_{\rm {PL2}}) }\right\}\\ &=&N^{-1/2}\sum_{j=1}^N r_j \left\{s_j(\beta) -\frac{ N^{-1}\sum_{i=1}^N s(x_i,y_j,\beta) f(\,y_j\mid x_i;\beta)}{ N^{-1}\sum_{i=1}^N f(\,y_j\mid x_i;\beta) }\right\}\\ &&+\,N^{-1}\sum_{j=1}^N r_j \frac{\partial}{\partial{\beta^*}^{\mathrm{\scriptscriptstyle T}}}\left\{s_j({\beta^*}) -\frac{ N^{-1}\sum_{i=1}^N s(x_i,y_j,{\beta^*}) f(\,y_j\mid x_i;{\beta^*})}{ N^{-1}\sum_{i=1}^N f(\,y_j\mid x_i;{\beta^*}) }\right\}N^{1/2}(\hat\beta_{\rm {PL2}}-\beta)\\ &=&N^{-1/2}\sum_{j=1}^N r_j \left\{s_j(\beta) -\frac{ N^{-1}\sum_{i=1}^N s(x_i,y_j,\beta) f(\,y_j\mid x_i;\beta)}{ N^{-1}\sum_{i=1}^N f(\,y_j\mid x_i;\beta) }\right\} -\{A+o_{\rm p}(1)\}N^{1/2}(\hat\beta_{\rm {PL2}}-\beta)\text{.} \end{eqnarray*} Now   \begin{eqnarray} &&N^{-1/2}\sum_{j=1}^N r_j \left\{s_j(\beta) -\frac{ N^{-1}\sum_{i=1}^N s(x_i,y_j,\beta) f(\,y_j\mid x_i;\beta)}{ N^{-1}\sum_{i=1}^N f(\,y_j\mid x_i;\beta) }\right\}\\ &&= N^{-1/2} \sum_{j=1}^N r_j \left\{ s_j(\beta) - \frac{\int s(x,y_j,\beta) f(\,y_j\mid x;\beta)g(x)\,{\rm d}x}{ \int f(\,y_j\mid x;\beta) g(x)\,{\rm d}x} \right\} \nonumber\\ && \quad - \,N^{-1/2} \sum_{j=1}^N r_j \frac{ N^{-1}\sum_{i=1}^N s(x_i,y_j,\beta) f(\,y_j\mid x_i;\beta) }{ \int f(\,y_j\mid x;\beta) g(x)\,{\rm d}x }\nonumber\\ &&\quad+\,N^{-1/2}\sum_{j=1}^N r_j \frac{\int s(x,y_j,\beta) f(\,y_j\mid x;\beta)g(x)\,{\rm d}x}{ \{\int f(\,y_j\mid x;\beta) g(x)\,{\rm d}x\}^2}\, N^{-1}\sum_{i=1}^N f(\,y_j\mid x_i;\beta) +o_{\rm p}(1)\\ &&=N^{-1/2}\sum_{i=1}^N d_i(\beta) -N^{-3/2}\sum_{j=1}^N \sum_{i=1}^N r_j \left[\frac{ s(x_i,y_j,\beta) -E\{S_j(\beta)\mid y_j\}}{\int f(\,y_j\mid x;\beta) g(x)\,{\rm d}x}\right] f(\,y_j\mid x_i;\beta)+o_{\rm p}(1)\\ &&=N^{-1/2}\sum_{i=1}^N d_i(\beta) - N^{-1/2}\sum_{i=1}^N u(x_i,\beta) + o_{\rm p}(1)\label{eq:expansion} \end{eqnarray} (A1) where, using the decomposition and representation techniques related to V-statistics (Serfling, 1980; Shao, 2003), we have   \begin{eqnarray*} u(x_i,\beta) &=& \int r[s(x_i,y,\beta)-E\{S(\beta)\mid Y=y\}]\frac{f(\,y\mid x_i;\beta)}{\int f(\,y\mid x_i;\beta)g(x_i)\,{\rm d}x_i}\, p(r,y)\,{\rm d}\mu(r)\,{\rm d}\mu(y)\\ &=& \int r[s(x_i,y,\beta)-E\{S(\beta)\mid Y=y\}]\frac{p(x_i\mid y)}{p(x_i)}\,p(r,y)\,{\rm d}\mu(r)\,{\rm d}\mu(y)\\ &=& \int r[s(x_i,y,\beta)-E\{S(\beta)\mid Y=y\}]\frac{p(x_i\mid y,r)}{p(x_i)}\,p(r,y)\,{\rm d}\mu(r)\,{\rm d}\mu(y)\\ &=& \int r[s(x_i,y,\beta)-E\{S(\beta)\mid Y=y\}]\,p(r,y\mid x_i)\,{\rm d}\mu(r)\,{\rm d}\mu(y)\\ &=& E\bigl(R_i[ S_i(\beta) -E\{S_i(\beta)\mid Y_i\}]\mid X_i=x_i\bigr)\\ &=& E\{D_i(\beta)\mid x_i\}\text{.} \end{eqnarray*} Substituting this into (A1), we get   \begin{eqnarray*} &&N^{-1/2}\sum_{j=1}^N r_j \left\{s_j(\beta) -\frac{ N^{-1}\sum_{i=1}^N s(x_i,y_j,\beta) f(\,y_j\mid x_i;\beta)}{ N^{-1}\sum_{i=1}^N f(\,y_j\mid x_i;\beta) }\right\}\\ &&\quad =N^{-1/2}\sum_{i=1}^N \left[d_i(\beta) - E\left\{D_i(\beta)\mid x_i\right\}\right] +o_{\rm p}(1)\text{.} \end{eqnarray*} Hence $$N^{1/2}(\hat\beta_{\rm {PL2}}-\beta)\to N(0, U)$$ in distribution as $$N\rightarrow \infty$$, where   \begin{equation*} U = A^{-1} E\bigl([D_i(\beta) - E\{D_i(\beta)\mid X_i\}]^{\otimes2}\bigr) A^{-1}\text{.}\\[-3.2pc] \end{equation*} □ Proof of Corollary 1. To prove that $$\hat\beta_{\rm {PL2}}$$ is more efficient than $$\hat\beta_{\rm {PL1}}$$, note that   \begin{eqnarray*} &&E\big[\{d_i(\beta) - BG^{-1}t_i(\alpha)\}^{\otimes2}\big] =E\bigl(\big[d_i(\beta) - E\{D_i(\beta)\mid X_i\} + E\{D_i(\beta)\mid X_i\} - BG^{-1}t_i(\alpha) \big]^{\otimes2}\bigr) \\ &&\quad = E\bigl(\big[d_i(\beta) - E\{D_i(\beta)\mid X_i\}\big]^{\otimes2}\bigr) + E\bigl(\big[E\{D_i(\beta)\mid X_i\} - BG^{-1}t_i(\alpha) \big]^{\otimes2}\bigr) \\ && \quad \ge E\bigl(\big[ d_i(\beta) - E\{D_i(\beta)\mid X_i\}\big]^{\otimes2}\bigr), \end{eqnarray*} where the second equality comes from the fact that   \begin{eqnarray*} E\bigl( \big[d_i(\beta) - E\{D_i(\beta)\mid X_i\}\big] \big[E\{D_i(\beta)\mid X_i\} - BG^{-1}t_i(\alpha) \big]^{\mathrm{\scriptscriptstyle T}}\bigr) = 0\text{.} \end{eqnarray*} Therefore, $$\hat\beta_{\rm {PL2}}$$ is also more efficient than $$\hat\beta_{\rm {PL0}}$$. □ References Brown C. H. ( 1990). Protecting against nonrandomly missing data in longitudinal studies. Biometrics  46, 143– 55. Google Scholar CrossRef Search ADS PubMed  Caldwell K. L., Jones R. L., Verdon C. P., Jarrett J. M., Caudill S. P. & Osterloh J. D. ( 2009). Levels of urinary total and speciated arsenic in the US population: National Health and Nutrition Examination Survey 2003–2004. J. Expos. Sci. Envir. Epidemiol.  19, 59– 68. Google Scholar CrossRef Search ADS   Carter R. L., Wrabetz L., Jalal K., Orsini J. J., Barczykowski A. L., Matern D. & Langan T. J. ( 2016). Can psychosine and galactocerebrosidase activity predict early-infantile Krabbe’s disease presymptomatically? J. Neurosci. Res.  94, 1084– 93. Google Scholar CrossRef Search ADS PubMed  Chen H. Y. ( 2007). A semiparametric odds ratio model for measuring association. Biometrics  63, 413– 21. Google Scholar CrossRef Search ADS PubMed  Chen K. ( 2001). Parametric models for response-biased sampling. J. R. Statist. Soc. B  63, 775– 89. Google Scholar CrossRef Search ADS   Deville J.-C. ( 2000). Generalized calibration and application to weighting for non-response. In Proceedings in Computational Statistics: 14th Symposium held in Utrecht, The Netherlands, 2000 . Heidelberg: Springer, pp. 65– 76. Google Scholar CrossRef Search ADS   Deville J.-C. & Särndal C. E. ( 1992). Calibration estimators in survey sampling. J. Am. Statist. Assoc.  87, 376– 82. Google Scholar CrossRef Search ADS   d’Haultfoeuille X. ( 2010). A new instrumental method for dealing with endogeneous selection. J. Economet.  154, 1– 15. Google Scholar CrossRef Search ADS   Gong G. & Samaniego F. J. ( 1981). Pseudo maximum likelihood estimation: Theory and applications. Ann. Statist.  9, 861– 9. Google Scholar CrossRef Search ADS   Hopke P. K., Liu C. & Rubin D. B. ( 2001). Multiple imputation for multivariate data with missing and below-threshold measurements: Time-series concentrations of pollutants in the Arctic. Biometrics  57, 22– 33. Google Scholar CrossRef Search ADS PubMed  Huber P. J. ( 1981). Robust Statistics . New York: Wiley. Google Scholar CrossRef Search ADS   Ibrahim J. G., Lipsitz S. R. & Horton N. ( 2001). Using auxiliary data for parameter estimation with non-ignorably missing outcomes. Appl. Statist.  50, 361– 73. Kim J. K. & Yu C. L. ( 2011). A semiparametric estimation of mean functionals with nonignorable missing data. J. Am. Statist. Assoc.  106, 157– 65. Google Scholar CrossRef Search ADS   Kott P. ( 2014). Calibration weighting when model and calibration variables can differ. In Contributions to Sampling Statistics . Cham: Springer International Publishing, pp. 1– 18. Google Scholar CrossRef Search ADS   Liang K.-Y. & Qin J. ( 2000). Regression analysis under non-standard situations: A pairwise pseudolikelihood approach. J. R. Statist. Soc. B  62, 773– 86. Google Scholar CrossRef Search ADS   Miao W. & Tchetgen Tchetgen E. J. ( 2016). On varieties of doubly robust estimators under missingness not at random with a shadow variable. Biometrika  103, 475– 82. Google Scholar CrossRef Search ADS PubMed  Moulton L. H., Curriero F. C. & Barroso P. F. ( 2002). Mixture models for quantitative HIV RNA data. Statist. Meth. Med. Res.  11, 317– 25. Google Scholar CrossRef Search ADS   Navas-Acien A., Silbergeld E. K., Pastor-Barriuso R. & Guallar E. ( 2008). Arsenic exposure and prevalence of type 2 diabetes in US adults. J. Am. Med. Assoc.  300, 814– 22. Google Scholar CrossRef Search ADS   Parke W. R. ( 1986). Pseudo maximum likelihood estimation: The asymptotic distribution. Ann. Statist.  14, 355– 7. Google Scholar CrossRef Search ADS   Richardson D. B. & Ciampi A. ( 2003). Effects of exposure measurement error when an exposure variable is constrained by a lower limit. Am. J. Epidemiol.  157, 355– 63. Google Scholar CrossRef Search ADS PubMed  Schisterman E. F., Vexler A., Whitcomb B. W. & Liu A. ( 2006). The limitations due to exposure detection limits for regression models. Am. J. Epidemiol.  163, 374– 83. Google Scholar CrossRef Search ADS PubMed  Serfling R. J. ( 1980). Approximation Theorems of Mathematical Statistics . New York: Wiley. Google Scholar CrossRef Search ADS   Shao J. ( 2003). Mathematical Statistics . New York: Springer, 2nd ed. Google Scholar CrossRef Search ADS   Shao J. & Wang L. ( 2016). Semiparametric inverse propensity weighting for nonignorable missing data. Biometrika  103, 175– 87. Google Scholar CrossRef Search ADS   Tang G., Little R. J. & Raghunathan T. E. ( 2003). Analysis of multivariate missing data with nonignorable nonresponse. Biometrika  90, 747– 64. Google Scholar CrossRef Search ADS   Zahner G. E., Pawelkiewicz W., DeFrancesco J. J. & Adnopoz J. ( 1992). Children’s mental health service needs and utilization patterns in an urban community: An epidemiological assessment. J. Am. Acad. Child Adolesc. Psychiat.  31, 951– 60. Google Scholar CrossRef Search ADS   Zhao J. & Shao J. ( 2015). Semiparametric pseudo-likelihoods in generalized linear models with nonignorable missing data. J. Am. Statist. Assoc.  110, 1577– 90. Google Scholar CrossRef Search ADS   © 2018 Biometrika Trust This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Biometrika Oxford University Press

Optimal pseudolikelihood estimation in the analysis of multivariate missing data with nonignorable nonresponse

Biometrika , Volume Advance Article (2) – Feb 28, 2018

Loading next page...
 
/lp/ou_press/optimal-pseudolikelihood-estimation-in-the-analysis-of-multivariate-MNmzuIMaMz
Publisher
Oxford University Press
Copyright
© 2018 Biometrika Trust
ISSN
0006-3444
eISSN
1464-3510
D.O.I.
10.1093/biomet/asy007
Publisher site
See Article on Publisher Site

Abstract

SUMMARY Tang et al. (2003) considered a regression model with missing response, where the missingness mechanism depends on the value of the response variable and hence is nonignorable. They proposed three pseudolikelihood estimators, based on different treatments of the probability distribution of the completely observed covariates. The first assumes the distribution of the covariate to be known, the second estimates this distribution parametrically, and the third estimates the distribution nonparametrically. While it is not hard to show that the second estimator is more efficient than the first, Tang et al. (2003) only conjectured that the third estimator is more efficient than the first two. In this paper, we investigate the asymptotic behaviour of the third estimator by deriving a closed-form representation of its asymptotic variance. We then prove that the third estimator is more efficient than the other two. Our result can be straightforwardly applied to missingness mechanisms that are more general than that in Tang et al. (2003). 1. Introduction Tang et al. (2003) considered multivariate regression analysis of a $$q$$-dimensional response $$Y$$ on a $$p$$-dimensional covariate $$X$$, with the joint density function of $$(Y,X)$$ factorized as $$p(y,x) = f(\,y\mid x; \beta) g(x)$$, where $$g(x)$$ represents the marginal density function of $$X$$ and the estimation of $$\beta$$ is of main interest. They considered the situation where $$X$$ is fully observed but $$Y$$ has missing values. Let $$R=1$$ if $$Y$$ is completely observed and $$R=0$$ otherwise. Tang et al. (2003) assumed that the missing data mechanism depends only on the underlying value of the response $$Y$$ and hence is nonignorable,   \begin{equation} \hbox{pr}(R=1 \mid Y, X) = \hbox{pr}(R=1 \mid Y)\text{.} \label{eq:assume} \end{equation} (1) They proposed estimators built on the fact that $$R$$ and $$X$$ are conditionally independent given $$Y$$, so the completely observed subjects form a random sample from the distribution of $$X$$ given $$Y$$. The missing data mechanism (1) has been widely adopted in response-biased sampling (Brown, 1990; Liang & Qin, 2000; Chen, 2001) and is relevant in many applications. For example, Chen (2001) studied a case of a univariate response $$Y$$ and a multivariate covariate $$X$$, where the observed data form a nonrandom sample from $$(X, Y)$$, with the sampling probability depending only on $$Y$$. Therefore, the observed data can be viewed as a random sample from the distribution of $$X$$ given $$Y$$, instead of from the original regression model $$f(\,y\mid x; \beta)$$ (Chen, 2001). Assumption (1) is also sensible in other situations. For example, when evaluating a new biomarker in analytical chemistry, scientists usually encounter a laboratory quality control limit, the so-called detection limit (Navas-Acien et al., 2008; Caldwell et al., 2009; Carter et al., 2016), defined as the lowest concentration of analyte distinguishable from the background noise. Although theoretically available, concentration values below the detection limit are usually not released by laboratories. When the concentration value is the outcome of interest and needs to be regressed against covariates $$X$$, assumption (1) is satisfied, with $$\hbox{pr}(R=1 \mid Y, X) = I(Y>c)=\hbox{pr}(R=1 \mid Y)$$ where $$c$$ is the detection limit. Eliminating observations with values below the detection limit can lead to severe bias; see Hopke et al. (2001), Moulton et al. (2002), Richardson & Ciampi (2003), Schisterman et al. (2006) and the references therein. Other examples where (1) is valid include survey sampling (Deville & Särndal, 1992; Deville, 2000; Kott, 2014), case-control studies (Chen, 2007), and some survival analysis contexts. Tang et al. (2003) also discussed some extensions of the assumption (1). The novelty of the idea in Tang et al. (2003) has led to recent developments such as Kim & Yu (2011), Zhao & Shao (2015), Shao & Wang (2016) and Miao & Tchetgen Tchetgen (2016). In this paper, we first present our results under assumption (1) and then show that they can be straightforwardly applied to more general missingness mechanisms. The estimation of $$\beta$$ is based on independent and identically distributed observations $$(Y_i, X_i, R_i=1)$$ for $$i=1, \dots, n$$ and $$(X_i, R_i=0)$$ for $$i=n+1, \ldots, N$$. Based on (1), we estimate $$\beta$$ through maximizing the likelihood of the parameters of $$p(X\mid Y)$$ based on the complete observations,   \[ \prod_{i=1}^n p(x_i\mid y_i) = \prod_{i=1}^n \frac{f(\,y_i\mid x_i; \beta)g(x_i)}{\int f(\,y_i\mid x; \beta)g(x)\,{\rm d}x} = \prod_{i=1}^N \left\{ \frac{f(\,y_i\mid x_i; \beta)g(x_i)}{\int f(\,y_i\mid x; \beta)g(x)\,{\rm d}x} \right\}^{r_i}\!\text{.} \] Equivalently, we estimate $$\beta$$ by maximizing the complete-case conditional loglikelihood   \begin{equation} l(\beta, g) = \frac1N \sum_{i=1}^N r_i \left\{\log\,f(\,y_i\mid x_i;\beta) - \log \int f(\,y_i\mid x;\beta)g(x)\,{\rm d}x\right\}\!, \label{eq:loglike} \end{equation} (2) which contains $$g(x)$$, the unspecified probability density function of $$X$$. If the true $$g$$, $$\,g_0$$, is known, the corresponding estimator of $$\beta$$, denoted by $$\hat\beta_{\rm {PL0}}$$, is the maximizer of $$l(\beta, g_0)$$. If $$g$$ is unknown, with $$X$$ fully observed, any appropriate complete-data technique can be applied to estimate $$g$$. Tang et al. (2003) considered two situations and obtained two pseudolikelihood estimators of $$\beta$$ (Gong & Samaniego, 1981; Parke, 1986). The first is when a parametric model $$g(x;\alpha)$$ is adopted and the full-data maximum likelihood estimator is used to obtain $$\hat\alpha$$. Then $$g(x;\hat \alpha)$$ is used to replace $$g(x)$$ in (2), which leads to the estimator $$\hat\beta_{\rm {PL1}}$$, the maximizer of $$l\{\beta,g(x;\hat \alpha)\}$$. The second is when $$g(x)$$ is unspecified and its cumulative distribution is estimated by its empirical version $$\hat G_{\rm N}(x)$$. This gives the estimator $$\hat\beta_{\rm {PL2}}$$, the maximizer of   \begin{equation} l(\beta, \hat G_{\rm N}) = \frac1N \sum_{i=1}^N r_i \left\{\log\,f(\,y_i\mid x_i;\beta) - \log \int f(\,y_i\mid x;\beta) \,{\rm d}\hat G_{\rm N}(x)\right\}\!\text{.} \label{eq:loglikenon} \end{equation} (3) An interesting issue is the efficiency of $$\hat\beta_{\rm {PL0}}$$, $$\hat\beta_{\rm {PL1}}$$ and $$\hat\beta_{\rm {PL2}}$$. Theorem 2 of Tang et al. (2003) established the asymptotic normality of $$\hat\beta_{\rm {PL1}}$$ and showed that $$\hat\beta_{\rm {PL1}}$$ is more efficient than $$\hat\beta_{\rm {PL0}}$$. However, the authors did not give an explicit expression for the asymptotic variance of $$\hat\beta_{\rm {PL2}}$$, so could not provide a theoretical efficiency comparison with $$\hat\beta_{\rm {PL2}}$$. Based on simulation studies, they conjectured that $$\hat\beta_{\rm {PL2}}$$ is more efficient than both $$\hat\beta_{\rm {PL0}}$$ and $$\hat\beta_{\rm {PL1}}$$. We derive the asymptotic variance of $$\hat\beta_{\rm {PL2}}$$ in closed form and prove that $$\hat\beta_{\rm {PL2}}$$ is more efficient than $$\hat\beta_{\rm {PL1}}$$ and $$\hat\beta_{\rm {PL0}}$$; thus we establish the correctness of the conjecture in Tang et al. (2003) and provide a clear explanation of their numerical observations. We also show that, in general, no other method of estimating $$g(x)$$ can lead to a more efficient estimator of $$\beta$$ than $$\hat\beta_{\rm {PL2}}$$, which is recommended for use in practice. 2. Asymptotic distribution and optimality We use uppercase letters to denote random variables and lowercase letters to denote their realizations. We let $$S(X, Y; \beta)={\partial\log\,f(\,y \mid X;\beta)}/{\partial\beta}$$ and $$s(x, y; \beta)={\partial\log\,f(\,y \mid x;\beta)}/{\partial\beta}$$. Sometimes we also write $$S_i(\beta)={\partial\log\,f(\,y_i\mid X_i;\beta)}/{\partial\beta}$$ and $$s_i(\beta)={\partial\log\,f(\,y_i\mid x_i;\beta)}/{\partial\beta}$$. We define $$T_i(\alpha)={\partial\log g(X_i;\alpha)}/{\partial\alpha}$$, $$t_i(\alpha)={\partial\log g(x_i;\alpha)}/{\partial\alpha}$$ and $$G=-E\{{\partial T_i(\alpha)}/{\partial{\alpha}^{\mathrm{\scriptscriptstyle T}}}\} =E\{T_i(\alpha) T_i(\alpha)^{\mathrm{\scriptscriptstyle T}}\}$$. We let $$A = E[R_i \hbox{var}\{S_i(\beta)\mid Y_i\}]$$ and $$B = E[R_i \,\mbox{cov}\{S_i(\beta), T_i(\alpha) \mid Y_i\}]$$. We also write $$d_i (\beta) = r_i[s_i(\beta) - E\{S_i(\beta)\mid y_i\}]$$ and $$D_i (\beta) = R_i[S_i(\beta) - E\{S_i(\beta)\mid Y_i\}]$$. In this section, we establish the asymptotic distribution of $$\hat\beta_{\rm {PL2}}$$ and briefly describe the asymptotic distributions of $$\hat\beta_{\rm {PL0}}$$ and $$\hat\beta_{\rm {PL1}}$$. The results for $$\hat\beta_{\rm {PL0}}$$ and $$\hat\beta_{\rm {PL1}}$$ can be found in Theorem 2 of Tang et al. (2003). Recall that $$\hat\beta_{\rm {PL0}}$$ is the maximizer of $$l(\beta, g_0)$$. It is straightforward to show that   \[ N^{1/2} (\hat\beta_{\rm {PL0}} - \beta) = A^{-1} N^{-1/2}\sum_{i=1}^N d_i(\beta) + o_{\rm p}(1), \] so $$N^{1/2} (\hat\beta_{\rm {PL0}} - \beta) \to N(0, A^{-1})$$ in distribution as $$N\to\infty$$. The estimator $$\hat\beta_{\rm {PL1}}$$ is the maximizer of $$l\{\beta, g(x;\hat\alpha)\}$$. Because the maximum likelihood estimate of $$\alpha$$ satisfies $$N^{1/2}(\hat\alpha-\alpha)=G^{-1}N^{-1/2}\sum_{i=1}^N t_i(\alpha) +o_{\rm p}(1)$$,   \[ N^{1/2} (\hat\beta_{\rm {PL1}} - \beta) =A^{-1}N^{-1/2}\sum_{i=1}^N \left\{d_i(\beta) - BG^{-1} t_i(\alpha) \right\} +o_{\rm p}(1)\text{.} \] Hence $$N^{1/2} (\hat\beta_{\rm {PL1}} - \beta) \to N(0, V)$$ in distribution as $$N\to\infty$$, where $$V=A^{-1}(A-B G^{-1} B^{\mathrm{\scriptscriptstyle T}})A^{-1}$$. It is obvious that $$V \leq A^{-1}$$. Theorem 1. Under the conditions of Theorem 3 in Tang et al. (2003), the estimator $$\hat\beta_{\rm {PL2}}$$ has the asymptotic representation  \[ N^{1/2} (\hat\beta_{\rm {PL2}} - \beta) =A^{-1}N^{-1/2} \sum_{i=1}^N \left[ d_i(\beta) - E\{D_i(\beta)\mid x_i\} \right] +o_{\rm p}(1), \]so $$N^{1/2} (\hat\beta_{\rm {PL2}} - \beta) \to N(0, U)$$ in distribution as $$N\to\infty$$, where  $$ U = A^{-1} E\bigl( \bigl[D_i(\beta) - E\{D_i(\beta)\mid X_i\}\bigr]^{\otimes2}\bigr) A^{-1}\text{.} $$ This implies that: Corollary 1. $$\hat\beta_{\rm {PL2}}$$ is more efficient than $$\hat\beta_{\rm {PL1}}$$ and hence more efficient than $$\hat\beta_{\rm {PL0}}$$. Although other nonparametric methods can be used to estimate $$g(x)$$ and thus obtain alternative estimators of $$\beta$$, doing so cannot increase the efficiency of $$\hat\beta_{\rm {PL2}}$$. To see this, let $$\hat g(x)$$ denote the empirical estimator of $$g(x)$$, which results in $$\hat\beta_{\rm {PL2}}$$, and let $$\tilde g(x)$$ be an alternative consistent estimator of $$g(x)$$ using data $$X_1, \dots, X_N$$, which gives rise to an alternative estimator of $$\beta$$, denoted by $$\tilde\beta_{\rm {alt}}$$. The derivation in Theorem 1 yields   \begin{align*} A N^{1/2}(\tilde\beta_{\rm {alt}}-\beta)&=N^{-1/2}\sum_{i=1}^N \left[ d_i(\beta) - E\{D_i(\beta) \mid x_i\} \right]\\ & \quad +N^{-1/2}\sum_{i=1}^N r_i\!\left\{\log\!\int\! f(\,y_i\mid x;\beta)\hat g(x)\,{\rm d}x -\log\!\int\! f(\,y_i\mid x;\beta)\tilde g(x)\,{\rm d}x\!\right\}\! +o_{\rm p}(1)\text{.} \end{align*} We write $$r_i\log\int f(\,y_i\mid x;\beta) g(x)\,{\rm d}x$$ as $$\alpha(g,y_i,r_i)$$. As regular asymptotically linear estimators of $$\alpha(g,y_i,r_i)$$ based on $$X_1, \dots, X_N$$, $$\:\alpha(\hat g,y_i,r_i)$$ and $$\alpha(\tilde g,y_i,r_i)$$ satisfy   \begin{eqnarray*} N^{1/2}\{\alpha(\hat g,y_i,r_i)-\alpha(g,y_i,r_i)\}&=& N^{-1/2}\sum_{j=1}^N \phi_1(x_j,y_i,r_i)+o_{\rm p}(1),\\ N^{1/2}\{\alpha(\tilde g,y_i,r_i)-\alpha(g,y_i,r_i)\}&=& N^{-1/2}\sum_{j=1}^N \phi_2(x_j,y_i,r_i)+o_{\rm p}(1) \end{eqnarray*} (Huber, 1981) for some influence functions $$\phi_1(X,y,r)$$ and $$\phi_2(X,y,r)$$, which we inspect in order to compare estimation efficiency. Here $$E\{\phi_1(X,y,r)\} =E\{\phi_2(X,y,r)\}=0$$. Therefore   \begin{align*} &N^{-1/2}\sum_{i=1}^N r_i\left\{\log\int f(\,y_i\mid x;\beta)\hat g(x)\,{\rm d}x -\log\int f(\,y_i\mid x;\beta)\tilde g(x)\,{\rm d}x\right\}\\ &\quad =N^{-3/2} \sum_{i=1}^N\sum_{j=1}^N\{\phi_1(x_j,y_i,r_i) -\phi_2(x_j,y_i,r_i)\}+o_{\rm p}(1)\\ &\quad =N^{-1/2}\sum_{i=1}^N E\{\phi_1(x_i,Y,R) -\phi_2(x_i,Y,R)\}+o_{\rm p}(1), \end{align*} where in the last step we have used the zero-mean properties of $$\phi_1$$ and $$\phi_2$$. This leads to   $$ A N^{1/2}(\tilde\beta_{\rm {alt}}-\beta) =N^{-1/2}\sum_{i=1}^N \big[d_i(\beta) - E\{D_i(\beta)\mid x_i\} + E\{\phi_1(x_i,Y,R) -\phi_2(x_i,Y,R)\}\big]+o_{\rm p}(1)\text{.} $$ Using the technique in Corollary 1, we obtain   \begin{align*} &E\!\left( \big[ d_i(\beta) - E\{D_i(\beta)\mid x_i\} + E\{\phi_1(x_i,Y,R) -\phi_2(x_i,Y,R)\}\big]^{\otimes2}\right)\\ &\quad =E\!\left( \big[d_i(\beta) - E\{D_i(\beta)\mid x_i\}\big]^{\otimes2}\right)+E\!\left(\big[E\{\phi_1(x_i,Y_j,R_j) -\phi_2(x_i,Y_j,R_j)\}\big]^{\otimes2}\right)\\ &\quad \ge E\!\left( \big[d_i(\beta) - E\{D_i(\beta)\mid x_i\}\big]^{\otimes2}\right)\!, \end{align*} so $$\tilde\beta_{\rm {alt}}$$ is also less efficient than $$\hat\beta_{\rm {PL2}}$$. Therefore the pseudolikelihood estimator $$\hat\beta_{\rm {PL2}}$$ is superior to any other possible parametrically or nonparametrically based estimators. 3. Extension to more general missingness mechanisms The missing data mechanism in (1) assumes that given $$Y$$, $$\,R$$ and $$X$$ are conditionally independent. Although reasonable in many situations, this is not always true. For instance, in a randomized clinical trial comparing a treatment with a placebo, the dichotomous treatment indicator may influence the missingness. Consider a very simple scenario where $$Y$$ denotes the outcome, $$T$$ the binary treatment indicator and $$Z$$ the covariate. We are interested in the unknown parameters in $$f(y\mid t, z; \beta)$$. Compared with (1), it is more cautious to assume that   \begin{equation} \hbox{pr}(R=1\mid Y,T,Z) = \hbox{pr}(R=1\mid Y,T)\text{.} \label{eq:assume2} \end{equation} (4) Under (4), the methods in § 2 still apply. To see this, similar to the idea in § 1, the unknown parameter $$\beta$$ can be estimated based on the conditional likelihood   \begin{eqnarray*} & &\prod_{i=1}^n \frac{f(y_i\mid t_i, z_i; \beta)g(z_i\mid t_i)h(t_i)}{\int f(y_i\mid t_i, z; \beta)g(z\mid t_i)h(t_i)\,{\rm d}z}\\ &&\quad = \prod_{\substack{i=1\\t_i=0}}^n \frac{f(y_i\mid t_i=0, z_i; \beta)g(z_i\mid t_i=0)}{\int f(y_i\mid t_i=0, z; \beta)g(z\mid t_i=0)\,{\rm d}z} \prod_{\substack{i=1\\t_i=1}}^n \frac{f(y_i\mid t_i=1, z_i; \beta)g(z_i\mid t_i=1)}{\int f(y_i\mid t_i=1, z; \beta)g(z\mid t_i=1)\,{\rm d}z}\text{.} \end{eqnarray*} Thus the same reasoning and derivation can be applied, and we can show that the estimator for $$\beta$$ with $$g(Z\mid t)$$ estimated by its empirical version under $$T=0$$ and $$T=1$$ separately, i.e., the $$\hat\beta_{\rm {PL2}}$$ version in § 2, is also optimal among the three estimators. We now generalize the assumption (1) to the case where the missing data indicator $$R$$ and some components in the covariates $$X$$, say $$Z$$, are conditionally independent given $$Y$$ and the remaining components of $$X$$, say $$T$$; that is,   \begin{equation} \hbox{pr}(R=1\mid Y,T,Z) = \hbox{pr}(R=1\mid Y,T)\text{.} \label{eq:assume3} \end{equation} (5) Here, the covariates are represented by $$X=(T^{\mathrm{\scriptscriptstyle T}},Z^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}$$, and $$Z$$ is called a nonresponse instrument (Zhao & Shao, 2015) or a shadow variable (Miao & Tchetgen Tchetgen, 2016). The objective function becomes   \begin{equation} \prod_{i=1}^N \left\{\frac{f(\,y_i\mid t_i, z_i; \beta)g(z_i\mid t_i)h(t_i)}{\int f(\,y_i\mid t_i, z; \beta)g(z\mid t_i) h(t_i) \,{\rm d}z}\right\}^{r_i} =\prod_{i=1}^N \left\{\frac{f(\,y_i\mid t_i, z_i; \beta)g(z_i\mid t_i)}{\int f(\,y_i\mid t_i, z; \beta)g(z\mid t_i)\,{\rm d} z}\right\}^{r_i}\!\text{.}\label{eq:zt} \end{equation} (6) The distribution of $$Z$$ conditional on $$T$$ in (6) poses extra challenges. The truth, a parametric or nonparametric estimator of $$g(Z\mid T)$$, can be incorporated into the estimation, resulting in $$\hat\beta_{\rm {PL0}}$$, $$\hat\beta_{\rm {PL1}}$$ or $$\hat\beta_{\rm {PL2}}$$. Theory similar to that in § 2 can be developed and leads to the same optimality of $$\hat\beta_{\rm {PL2}}$$. The nonignorable missing data mechanism assumption is usually difficult to specify or verify (d’Haultfoeuille, 2010), but a nonresponse instrument $$Z$$ is often available and assumption (5) is often reasonable. For example, in a study of children’s mental health (Zahner et al., 1992), investigators were interested in evaluating the prevalence of children with abnormal psychopathological status based on their teacher’s assessment, $$Y$$, which was subject to missing values. A missing teacher report may be related to the teacher’s assessment of the student even after adjusting for fully observed covariates $$T$$ such as physical health of the child and parental status of the household (Ibrahim et al., 2001). A separate parental report on the psychopathology of the child was also available for all children in the study. Such a report is likely to be highly correlated with that of the teacher, but is unlikely to be correlated with the teacher’s response status conditional on the teacher’s assessment of that student. Therefore, the parental assessment constitutes a valid nonresponse instrument and assumption (5) is reasonable. Acknowledgement We thank the editor, associate editor and three referees for their constructive comments, which have led to a significantly improved paper. This work was partially supported by the National Center for Advancing Translational Sciences of the U.S. National Institutes of Health and the U.S. National Science Foundation. Supplementary material Supplementary material available at Biometrika online contains some simulation results. Appendix Proof of Theorem 1. If we estimate $$g(x)$$ with its empirical distribution, we approximate $$\int f(\,y\mid x; \beta)g(x)\,{\rm d}x$$ by its sample average, i.e., $$\int f(\,y\mid x; \beta)g(x)\,{\rm d}x\approx N^{-1}\sum_{i=1}^N f(\,y\mid x_i; \beta)$$. Also, $$\int s(x,y,\beta) f(\,y\mid x;\beta)g(x)\,{\rm d}x\approx N^{-1}\sum_{i=1}^N s(x_i,y,\beta) f(\,y\mid x_i;\beta)=N^{-1}\sum_{i=1}^N\partial f(\,y\mid x_i;\beta)/\partial\beta $$. To obtain $$\hat\beta_{\rm {PL2}}$$, we maximize (3), which is equivalent to maximizing   \begin{eqnarray*} \frac1N \sum_{j=1}^N r_j \left[ \log\,f(\,y_j\mid x_j;\beta) -\log\left\{ N^{-1}\sum_{i=1}^N f(\,y_j\mid x_i,\beta)\right\}\right]\!\text{.} \end{eqnarray*} Thus, by the mean value theorem, there must exist some $$\beta^*$$ lying between $$\hat\beta_{\rm {PL2}}$$ and $$\beta$$ such that   \begin{eqnarray*} 0 &=&N^{-1/2}\sum_{j=1}^N r_j \left\{s_j(\hat\beta_{\rm {PL2}}) -\frac{ N^{-1}\sum_{i=1}^N s(x_i,y_j,\hat\beta_{\rm {PL2}}) f(\,y_j\mid x_i;\hat\beta_{\rm {PL2}})}{ N^{-1}\sum_{i=1}^N f(\,y_j\mid x_i;\hat\beta_{\rm {PL2}}) }\right\}\\ &=&N^{-1/2}\sum_{j=1}^N r_j \left\{s_j(\beta) -\frac{ N^{-1}\sum_{i=1}^N s(x_i,y_j,\beta) f(\,y_j\mid x_i;\beta)}{ N^{-1}\sum_{i=1}^N f(\,y_j\mid x_i;\beta) }\right\}\\ &&+\,N^{-1}\sum_{j=1}^N r_j \frac{\partial}{\partial{\beta^*}^{\mathrm{\scriptscriptstyle T}}}\left\{s_j({\beta^*}) -\frac{ N^{-1}\sum_{i=1}^N s(x_i,y_j,{\beta^*}) f(\,y_j\mid x_i;{\beta^*})}{ N^{-1}\sum_{i=1}^N f(\,y_j\mid x_i;{\beta^*}) }\right\}N^{1/2}(\hat\beta_{\rm {PL2}}-\beta)\\ &=&N^{-1/2}\sum_{j=1}^N r_j \left\{s_j(\beta) -\frac{ N^{-1}\sum_{i=1}^N s(x_i,y_j,\beta) f(\,y_j\mid x_i;\beta)}{ N^{-1}\sum_{i=1}^N f(\,y_j\mid x_i;\beta) }\right\} -\{A+o_{\rm p}(1)\}N^{1/2}(\hat\beta_{\rm {PL2}}-\beta)\text{.} \end{eqnarray*} Now   \begin{eqnarray} &&N^{-1/2}\sum_{j=1}^N r_j \left\{s_j(\beta) -\frac{ N^{-1}\sum_{i=1}^N s(x_i,y_j,\beta) f(\,y_j\mid x_i;\beta)}{ N^{-1}\sum_{i=1}^N f(\,y_j\mid x_i;\beta) }\right\}\\ &&= N^{-1/2} \sum_{j=1}^N r_j \left\{ s_j(\beta) - \frac{\int s(x,y_j,\beta) f(\,y_j\mid x;\beta)g(x)\,{\rm d}x}{ \int f(\,y_j\mid x;\beta) g(x)\,{\rm d}x} \right\} \nonumber\\ && \quad - \,N^{-1/2} \sum_{j=1}^N r_j \frac{ N^{-1}\sum_{i=1}^N s(x_i,y_j,\beta) f(\,y_j\mid x_i;\beta) }{ \int f(\,y_j\mid x;\beta) g(x)\,{\rm d}x }\nonumber\\ &&\quad+\,N^{-1/2}\sum_{j=1}^N r_j \frac{\int s(x,y_j,\beta) f(\,y_j\mid x;\beta)g(x)\,{\rm d}x}{ \{\int f(\,y_j\mid x;\beta) g(x)\,{\rm d}x\}^2}\, N^{-1}\sum_{i=1}^N f(\,y_j\mid x_i;\beta) +o_{\rm p}(1)\\ &&=N^{-1/2}\sum_{i=1}^N d_i(\beta) -N^{-3/2}\sum_{j=1}^N \sum_{i=1}^N r_j \left[\frac{ s(x_i,y_j,\beta) -E\{S_j(\beta)\mid y_j\}}{\int f(\,y_j\mid x;\beta) g(x)\,{\rm d}x}\right] f(\,y_j\mid x_i;\beta)+o_{\rm p}(1)\\ &&=N^{-1/2}\sum_{i=1}^N d_i(\beta) - N^{-1/2}\sum_{i=1}^N u(x_i,\beta) + o_{\rm p}(1)\label{eq:expansion} \end{eqnarray} (A1) where, using the decomposition and representation techniques related to V-statistics (Serfling, 1980; Shao, 2003), we have   \begin{eqnarray*} u(x_i,\beta) &=& \int r[s(x_i,y,\beta)-E\{S(\beta)\mid Y=y\}]\frac{f(\,y\mid x_i;\beta)}{\int f(\,y\mid x_i;\beta)g(x_i)\,{\rm d}x_i}\, p(r,y)\,{\rm d}\mu(r)\,{\rm d}\mu(y)\\ &=& \int r[s(x_i,y,\beta)-E\{S(\beta)\mid Y=y\}]\frac{p(x_i\mid y)}{p(x_i)}\,p(r,y)\,{\rm d}\mu(r)\,{\rm d}\mu(y)\\ &=& \int r[s(x_i,y,\beta)-E\{S(\beta)\mid Y=y\}]\frac{p(x_i\mid y,r)}{p(x_i)}\,p(r,y)\,{\rm d}\mu(r)\,{\rm d}\mu(y)\\ &=& \int r[s(x_i,y,\beta)-E\{S(\beta)\mid Y=y\}]\,p(r,y\mid x_i)\,{\rm d}\mu(r)\,{\rm d}\mu(y)\\ &=& E\bigl(R_i[ S_i(\beta) -E\{S_i(\beta)\mid Y_i\}]\mid X_i=x_i\bigr)\\ &=& E\{D_i(\beta)\mid x_i\}\text{.} \end{eqnarray*} Substituting this into (A1), we get   \begin{eqnarray*} &&N^{-1/2}\sum_{j=1}^N r_j \left\{s_j(\beta) -\frac{ N^{-1}\sum_{i=1}^N s(x_i,y_j,\beta) f(\,y_j\mid x_i;\beta)}{ N^{-1}\sum_{i=1}^N f(\,y_j\mid x_i;\beta) }\right\}\\ &&\quad =N^{-1/2}\sum_{i=1}^N \left[d_i(\beta) - E\left\{D_i(\beta)\mid x_i\right\}\right] +o_{\rm p}(1)\text{.} \end{eqnarray*} Hence $$N^{1/2}(\hat\beta_{\rm {PL2}}-\beta)\to N(0, U)$$ in distribution as $$N\rightarrow \infty$$, where   \begin{equation*} U = A^{-1} E\bigl([D_i(\beta) - E\{D_i(\beta)\mid X_i\}]^{\otimes2}\bigr) A^{-1}\text{.}\\[-3.2pc] \end{equation*} □ Proof of Corollary 1. To prove that $$\hat\beta_{\rm {PL2}}$$ is more efficient than $$\hat\beta_{\rm {PL1}}$$, note that   \begin{eqnarray*} &&E\big[\{d_i(\beta) - BG^{-1}t_i(\alpha)\}^{\otimes2}\big] =E\bigl(\big[d_i(\beta) - E\{D_i(\beta)\mid X_i\} + E\{D_i(\beta)\mid X_i\} - BG^{-1}t_i(\alpha) \big]^{\otimes2}\bigr) \\ &&\quad = E\bigl(\big[d_i(\beta) - E\{D_i(\beta)\mid X_i\}\big]^{\otimes2}\bigr) + E\bigl(\big[E\{D_i(\beta)\mid X_i\} - BG^{-1}t_i(\alpha) \big]^{\otimes2}\bigr) \\ && \quad \ge E\bigl(\big[ d_i(\beta) - E\{D_i(\beta)\mid X_i\}\big]^{\otimes2}\bigr), \end{eqnarray*} where the second equality comes from the fact that   \begin{eqnarray*} E\bigl( \big[d_i(\beta) - E\{D_i(\beta)\mid X_i\}\big] \big[E\{D_i(\beta)\mid X_i\} - BG^{-1}t_i(\alpha) \big]^{\mathrm{\scriptscriptstyle T}}\bigr) = 0\text{.} \end{eqnarray*} Therefore, $$\hat\beta_{\rm {PL2}}$$ is also more efficient than $$\hat\beta_{\rm {PL0}}$$. □ References Brown C. H. ( 1990). Protecting against nonrandomly missing data in longitudinal studies. Biometrics  46, 143– 55. Google Scholar CrossRef Search ADS PubMed  Caldwell K. L., Jones R. L., Verdon C. P., Jarrett J. M., Caudill S. P. & Osterloh J. D. ( 2009). Levels of urinary total and speciated arsenic in the US population: National Health and Nutrition Examination Survey 2003–2004. J. Expos. Sci. Envir. Epidemiol.  19, 59– 68. Google Scholar CrossRef Search ADS   Carter R. L., Wrabetz L., Jalal K., Orsini J. J., Barczykowski A. L., Matern D. & Langan T. J. ( 2016). Can psychosine and galactocerebrosidase activity predict early-infantile Krabbe’s disease presymptomatically? J. Neurosci. Res.  94, 1084– 93. Google Scholar CrossRef Search ADS PubMed  Chen H. Y. ( 2007). A semiparametric odds ratio model for measuring association. Biometrics  63, 413– 21. Google Scholar CrossRef Search ADS PubMed  Chen K. ( 2001). Parametric models for response-biased sampling. J. R. Statist. Soc. B  63, 775– 89. Google Scholar CrossRef Search ADS   Deville J.-C. ( 2000). Generalized calibration and application to weighting for non-response. In Proceedings in Computational Statistics: 14th Symposium held in Utrecht, The Netherlands, 2000 . Heidelberg: Springer, pp. 65– 76. Google Scholar CrossRef Search ADS   Deville J.-C. & Särndal C. E. ( 1992). Calibration estimators in survey sampling. J. Am. Statist. Assoc.  87, 376– 82. Google Scholar CrossRef Search ADS   d’Haultfoeuille X. ( 2010). A new instrumental method for dealing with endogeneous selection. J. Economet.  154, 1– 15. Google Scholar CrossRef Search ADS   Gong G. & Samaniego F. J. ( 1981). Pseudo maximum likelihood estimation: Theory and applications. Ann. Statist.  9, 861– 9. Google Scholar CrossRef Search ADS   Hopke P. K., Liu C. & Rubin D. B. ( 2001). Multiple imputation for multivariate data with missing and below-threshold measurements: Time-series concentrations of pollutants in the Arctic. Biometrics  57, 22– 33. Google Scholar CrossRef Search ADS PubMed  Huber P. J. ( 1981). Robust Statistics . New York: Wiley. Google Scholar CrossRef Search ADS   Ibrahim J. G., Lipsitz S. R. & Horton N. ( 2001). Using auxiliary data for parameter estimation with non-ignorably missing outcomes. Appl. Statist.  50, 361– 73. Kim J. K. & Yu C. L. ( 2011). A semiparametric estimation of mean functionals with nonignorable missing data. J. Am. Statist. Assoc.  106, 157– 65. Google Scholar CrossRef Search ADS   Kott P. ( 2014). Calibration weighting when model and calibration variables can differ. In Contributions to Sampling Statistics . Cham: Springer International Publishing, pp. 1– 18. Google Scholar CrossRef Search ADS   Liang K.-Y. & Qin J. ( 2000). Regression analysis under non-standard situations: A pairwise pseudolikelihood approach. J. R. Statist. Soc. B  62, 773– 86. Google Scholar CrossRef Search ADS   Miao W. & Tchetgen Tchetgen E. J. ( 2016). On varieties of doubly robust estimators under missingness not at random with a shadow variable. Biometrika  103, 475– 82. Google Scholar CrossRef Search ADS PubMed  Moulton L. H., Curriero F. C. & Barroso P. F. ( 2002). Mixture models for quantitative HIV RNA data. Statist. Meth. Med. Res.  11, 317– 25. Google Scholar CrossRef Search ADS   Navas-Acien A., Silbergeld E. K., Pastor-Barriuso R. & Guallar E. ( 2008). Arsenic exposure and prevalence of type 2 diabetes in US adults. J. Am. Med. Assoc.  300, 814– 22. Google Scholar CrossRef Search ADS   Parke W. R. ( 1986). Pseudo maximum likelihood estimation: The asymptotic distribution. Ann. Statist.  14, 355– 7. Google Scholar CrossRef Search ADS   Richardson D. B. & Ciampi A. ( 2003). Effects of exposure measurement error when an exposure variable is constrained by a lower limit. Am. J. Epidemiol.  157, 355– 63. Google Scholar CrossRef Search ADS PubMed  Schisterman E. F., Vexler A., Whitcomb B. W. & Liu A. ( 2006). The limitations due to exposure detection limits for regression models. Am. J. Epidemiol.  163, 374– 83. Google Scholar CrossRef Search ADS PubMed  Serfling R. J. ( 1980). Approximation Theorems of Mathematical Statistics . New York: Wiley. Google Scholar CrossRef Search ADS   Shao J. ( 2003). Mathematical Statistics . New York: Springer, 2nd ed. Google Scholar CrossRef Search ADS   Shao J. & Wang L. ( 2016). Semiparametric inverse propensity weighting for nonignorable missing data. Biometrika  103, 175– 87. Google Scholar CrossRef Search ADS   Tang G., Little R. J. & Raghunathan T. E. ( 2003). Analysis of multivariate missing data with nonignorable nonresponse. Biometrika  90, 747– 64. Google Scholar CrossRef Search ADS   Zahner G. E., Pawelkiewicz W., DeFrancesco J. J. & Adnopoz J. ( 1992). Children’s mental health service needs and utilization patterns in an urban community: An epidemiological assessment. J. Am. Acad. Child Adolesc. Psychiat.  31, 951– 60. Google Scholar CrossRef Search ADS   Zhao J. & Shao J. ( 2015). Semiparametric pseudo-likelihoods in generalized linear models with nonignorable missing data. J. Am. Statist. Assoc.  110, 1577– 90. Google Scholar CrossRef Search ADS   © 2018 Biometrika Trust This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

BiometrikaOxford University Press

Published: Feb 28, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off