# Local polynomial regression with correlated errors in random design and unknown correlation structure

Local polynomial regression with correlated errors in random design and unknown correlation... Summary Automated or data-driven bandwidth selection methods tend to break down in the presence of correlated errors. While this problem has previously been studied in the fixed design setting for kernel regression, the results were applicable only when there is knowledge about the correlation structure. This article generalizes these results to the random design setting and addresses the problem in situations where no prior knowledge about the correlation structure is available. We establish the asymptotic optimality of our proposed bandwidth selection criterion based on kernels $$K$$ satisfying $$K(0)=0$$. 1. Introduction Assume that we observe random variables $$(X_i,Y_i)^{n}_{i=1}$$, where $$Y_i$$ takes values in $$\mathbb{R}$$ and $$X_i$$ takes values in $$\Omega\subseteq \mathbb{R}$$ with a common density $$f$$. The density $$f$$ is compactly supported, bounded and continuous with $$f(x)>0$$ for all $$x$$. Consider the model \begin{equation*} Y_i = m(X_i)+ e_i \quad (i=1,\ldots,n), \end{equation*} where $$m$$ is an unknown smooth regression function and the $$e_i$$ are unobserved random variables such that $${E}(e_i\mid X_i) = 0, \quad {\mathrm{cov}}(e_i,e_j\mid X_i,X_j)=\sigma^2\rho_n(X_i-X_j) \quad (i,j=1,\ldots,n), \label{assumption1}$$ (1) with $$\sigma^2 <\infty$$ and $$\rho_n$$ a stationary correlation function satisfying $$\rho_n(0)=1, \rho_n(x)=\rho_n(-x)$$ and $$|\rho_n(x)| \leq 1$$ for all $$x$$. The subscript $$n$$ allows the correlation function $$\rho_n$$ to shrink as $$n\rightarrow\infty$$. Many kernel-based regression methods require the tuning or selection of a smoothing parameter, often referred to as the bandwidth. This introduces arbitrariness into the estimation procedure, so many bandwidth selection methods have been developed (Fan & Gijbels, 1996; Konishi & Kitagawa, 2008). Diggle & Hutchinson (1989) and Hart (1991) have shown that these methods perform rather poorly under model (1), since they typically require that the errors be independent and identically distributed. Several modifications have been proposed to account for short-range correlation (Opsomer et al., 2001; Francisco-Fernández et al., 2004). The plug-in method of Francisco-Fernández et al. (2004) assumes that the errors are generated from an autoregressive model of order 1, but this could be hard to verify in practice. However, the authors performed a sensitivity analysis suggesting that their method is reasonably robust against correlation model misspecification. Lee et al. (2010) proposed two criteria that approximate the averaged squared error when the errors are correlated and the design is fixed. Although their method does not require a parametric form for the correlation structure, it relies on estimation of the autocorrelation at different lags via the method proposed by Park et al. (2006). Gijbels et al. (1999) considered exponential smoothing and established the relation with nonparametric regression in fixed designs. They assumed that the errors are realizations of a zero-mean causal autoregressive moving average process. Hall & Van Keilegom (2003) showed that difference methods can be used for autoregressive parameters in nonparametric regression with time series errors. Although the problem of correlated errors is well understood and many remedies exist, a pressing challenge is the implementation of these results (Opsomer et al., 2001). Kim et al. (2004) developed a nonparametric method to deal with the related problem of detecting correlated errors. They showed that the first-order sample autocovariance of the residuals from kernel regression provides information on whether or not the errors are correlated. Without prior knowledge of the correlation structure, many bandwidth selection procedures are cumbersome to implement. One of the first steps taken to simplify implementation was proposed by Park et al. (2006), whose estimator of the error correlation is based neither on ordinary crossvalidation nor on modified crossvalidation (Chu & Marron, 1991) or the block bootstrap (Hall et al., 1995). To simplify regression with correlated errors, Kim et al. (2009) showed that bimodal kernels are quite effective even when the errors are heavily correlated. However, three shortcomings remain: prior information about the correlation structure is required, bandwidth selection methods need to be modified to work in this context, and most theoretical results have been established for fixed equispaced designs. In this article we focus on estimating the regression function, not the correlation structure, by local polynomial regression in random designs. As we will show, under mild conditions, neither special crossvalidation nor prior information on the correlation structure will be required, as a kernel $$K$$ satisfying $$K(0)=0$$ can effectively remove the effects of correlation. Since all the results are derived for a random design, a natural extension is to consider spatially correlated errors, which would have more applications. 2. Kernel smoother and assumptions If $$m$$ is an unknown $$p+1$$ times continuously differentiable function, then the local polynomial regression estimator of order $$p$$ at an arbitrary point $$x$$ can be found by solving the following weighted least squares problem: minimize $$\sum_{i=1}^n\{Y_i-\sum_{j=0}^p\beta_j(X_i-x)^{\,j}\}^2K_h(X_i-x)$$ with respect to $$\beta$$. This yields $$\hat{\beta} = ({X}_x^{{\rm T}}{\rm W}_x{X}_x)^{-1}{X}_x^{{\rm T}}{\rm W}_x{Y} = S_n^{-1}{X}_x^{{\rm T}}{\rm W}_x{Y}$$ as the solution vector, where $${Y}=(Y_1,\ldots,Y_n)^{{\rm T}}$$, $${\rm W}_x$$ is the $$n\times n$$ diagonal matrix of weights, i.e., $${\mathrm{diag}}\{K_h(X_i-x)\}$$ with kernel $$K$$ and bandwidth $$h$$ so that $$K_h(\cdot)=K(\cdot/h)/h$$, $$S_n={X}_x^{{\rm T}}{\rm W}_x{X}_x$$, and \begin{equation*} {X}_x = \begin{pmatrix} 1 & (X_1-x) & \cdots & (X_1-x)^p \\[-3pt] \vdots & \vdots & & \vdots\\ 1 & (X_n-x) & \cdots & (X_n-x)^p \end{pmatrix}\!\text{.} \end{equation*} The local polynomial estimator for the regression function at an arbitrary point $$x$$ is $$\hat{m}(x) = \varepsilon_{1}^{{\rm T}}\hat{\beta}$$, where $$\varepsilon_{1}=(1,0,\ldots,0)^{{\rm T}}$$ is a unit vector with 1 in the first position. A useful representation of the local polynomial estimator at an arbitrary point $$x$$ is \begin{equation*} \hat{m}(x) = \varepsilon_{1}^{{\rm T}} S_n^{-1}{X}_x^{{\rm T}}{\rm W}_x{Y} = \sum_{i=1}^n W_i^n(x)Y_i, \end{equation*} where $$W_i^n(x)=\varepsilon_{1}^{{\rm T}} S_n^{-1}\{1,X_i-x,\ldots,(X_i-x)^p\}^{{\rm T}}K\{(X_i-x)/h\}/h$$. Before stating the main theoretical results of the paper, we list the assumptions on which they rest: Assumption 1. $$m$$ is a $$p+1$$ times continuously differentiable function; Assumption 2. $$h \rightarrow 0$$ and $$nh \rightarrow \infty$$ as $$n\to \infty$$; Assumption 3. there exists an $$f_{\textrm{max}}$$ such that $$f(x) < f_{\textrm{max}}$$ for all $$x\in \Omega \subseteq\mathbb{R}$$; Assumption 4. $$f$$ has bounded support, is continuous, and satisfies $$f(x)>0$$ for all $$x \in \textrm{supp}(f)$$; Assumption 5. there exists a constant $$K_{\textrm{max}}$$ such that $$|K(x)| < K_{\textrm{max}}$$, and $$K(x)\ge 0$$ for all $$x$$; Assumption 6. $$K$$ is symmetric and Lipschitz continuous at $$0$$; Assumption 7. $$\lim_{|u| \to \infty} |u|^l K(u) < \infty$$ for $$l=0,\ldots,p$$; Assumption 8. the correlation function $$\rho_n$$ is an element of a sequence $$\{\rho_n\}$$ with the following properties for all $$n$$: there exist constants $$\rho_{\textrm{max}}$$ and $$\rho_{c}$$ such that $$n\int |\rho_n(x)|\,{\rm d} x < \rho_{\textrm{max}}$$ and $$\lim_{n\to\infty}n\int \rho_n(x)\,{\rm d} x = \rho_{c}$$; and for any sequence $$\epsilon_n>0$$ satisfying $$n\epsilon_n \to \infty$$, \begin{equation*} n\int_{|x|\ge \epsilon_n}|\rho_n(x)|\,{\rm d} x \rightarrow 0, \quad n\rightarrow \infty; \end{equation*} Assumption 9. $$\lim_{n\to\infty}n\int \rho^2_n(x)\,{\rm d} x = \rho_{cc}$$; Assumption 10. for all $$i,j,k$$ and $$l$$, \begin{align*} {\mathrm{cov}}(e_ie_j,e_ke_l \mid X_i,X_j,X_k,X_l) &= {\mathrm{cov}}(e_i,e_k \mid X_i,X_k){\mathrm{cov}}(e_j,e_l \mid X_j,X_l)\\ &\quad+{\mathrm{cov}}(e_i,e_l \mid X_i,X_l){\mathrm{cov}}(e_j,e_k \mid X_j,X_k)\text{.} \end{align*} Assumption 4 requires a density $$f$$ with bounded support. Strictly speaking, this means that distributions defined on the entire space $$\mathbb{R}$$ are excluded, but our simulations suggest that the proposed method also works when the design points are sampled from densities with infinite support. Assumption 8 implies that the correlation function depends on $$n$$ and that $$\int|\rho_n(x)|\,{\rm d} x$$ should vanish at a rate not slower than $$O(1/n)$$. We also require the correlation to be short range, in the second part of Assumption 8. Two correlation functions satisfying Assumption 8 are (Cressie, 1993) $$\rho_n(x) = \exp(-\alpha\,n|x|), \quad \rho_n(x) = \frac{1}{1+\alpha\,n^2x^2}, \quad \alpha >0\text{.} \label{corrfun}$$ (2) These common assumptions are generalizations of the equispaced design assumptions in Altman (1990). Assumptions 8 and 10 are satisfied, for example, when the errors have an autoregressive order-one dependence structure and follow a normal distribution. 3. Main theoretical results We first establish the bias and variance of the local polynomial regression estimator under our design assumptions. The bias is not related to the correlation structure and therefore is the same as in the independent error case. In what follows, we only give the bias for $$p$$ odd. For the asymptotic bias when $$p$$ is even, see Fan & Gijbels (1996, p. 62). The proofs can be found in the Supplementary Material. Theorem 1. Assume $$f(x)>0$$, and let $$f(\cdot)$$ and $$m^{(p+1)}(\cdot)$$ be continuous in a neighbourhood of $$x$$. Then under Assumptions 2, 5, 6, 7 and 8, the asymptotic conditional bias and variance for the local polynomial regression estimator of odd order $$p$$ are \begin{eqnarray*} {\mathrm{bias}}\{\hat{m}(x) \mid X_1,\ldots,X_n\} &=& \varepsilon_1^{{\rm T}}S^{-1}\frac{c_p}{(p+1)!}\,m^{(p+1)}(x)h^{\,p+1}+o_{\rm p}(h^{\,p+1})\\ &=& \biggl\{\int t^{\,p+1}K_p^{\star}(t)\,{\rm d} t\biggr\} \frac{1}{(p+1)!}\,m^{(p+1)}(x)h^{\,p+1}+o_{\rm p}(h^{\,p+1}),\\ {\mathrm{var}}\{\hat{m}(x) \mid X_1,\ldots,X_n\} &=& \frac{\sigma^2 \{1+f(x)\rho_c\}}{nhf(x)}\,\varepsilon_1^{{\rm T}}S^{-1}S^{\star}S^{-1}\varepsilon_1 + o_{\rm p}\biggl(\frac{1}{nh}\biggr)\\ &=& \int K_p^{\star \, 2}(t)\,{\rm d} t \, \frac{\sigma^2 \{1+f(x)\rho_c\}}{nhf(x)} + o_{\rm p}\biggl(\frac{1}{nh}\biggr), \end{eqnarray*} where $$S=(\mu_{i+j})_{0\,\leq\,i,\,j\,\leq\, p}$$ with $$\mu_j = \int u^jK(u)\,{\rm d} u$$, $$S^{\star}=(\nu_{i+j})_{0\,\leq\, i,j\,\leq\, p}$$ with $$\nu_j = \int u^{\,j}K^2(u)\,{\rm d} u$$, $$c_p = (\mu_{p+1},\ldots,\mu_{2p+1})^{{\rm T}}$$ and $$K_p^{\star}(t) = \varepsilon_1^{{\rm T}}S^{-1}(1,t,\ldots, t^p)^{{\rm T}} K(t)$$. The bandwidth $$h_{\textrm{opt}}$$ minimizing the asymptotic mean integrated squared error, i.e., $$\int [{\mathrm{bias}}^2\{\hat{m}(x) \mid X_1,\ldots,X_n\}+{\mathrm{var}}\{\hat{m}(x) \mid X_1,\ldots,X_n\}]\,f(x)\,{\rm d} x$$, readily follows and is given in Corollary 1. There is a slight but significant difference between the correlated and independent error cases (Fan & Gijbels, 1996, p. 68). Corollary 1. Under the assumptions of Theorem 1, the asymptotically optimal constant bandwidth for $$p$$ odd is \begin{equation*} h_{\rm opt} = C_p(K)\biggl[\frac{\sigma^2\{\mu(\Omega)+\rho_c\}}{\int\{m^{(p+1)}(x)\}^2f(x)\,{\rm d} x}\biggr]^{1/(2p+3)}n^{-1/(2p+3)}, \end{equation*} where \begin{equation*} C_p(K) = \biggl[\frac{(p+1)!^2\int K^{\star2}_p(t)\,{\rm d} t}{2(p+1)\{\int t^{\,p+1}K^{\star}_p(t)\,{\rm d} t\}^2}\biggr]^{1/(2p+3)} \end{equation*} and $$\mu(\Omega)$$ denotes the Lebesgue measure of the design region $$\Omega \subseteq \mathbb{R}$$. In what follows, define the residual sum of squares $${\small{\text{RSS}}}(h)=n^{-1} \sum_{i=1}^n\{Y_i-\hat{m}(X_i)\}^2$$ and the asymptotic squared error $${\small{\text{ASE}}}(h)=n^{-1} \sum_{i=1}^{n}\{m(X_i) - \hat{m}(X_i)\}^2$$. Next, we will show that $${\small{\text{RSS}}}(h)$$ approximates $${\small{\text{ASE}}}(h)$$ well enough uniformly over a set of bandwidths. To prove this result, we require a slightly stronger condition on the correlation function than that in Assumption 8. However, the correlation functions in (2) also satisfy this stronger condition. Theorem 2. Under Assumptions 1–10, if $$n^{\delta}\int |\rho_n(t)|\,{\rm d} t < \rho_\delta$$ for $$\delta>1$$, $$p$$ is odd and $$h\in\mathcal{H}_n$$ where $$\mathcal{H}_n=[an^{-1/(2p+3)},bn^{-1/(2p+3)}]$$ with $$0<a<b<\infty$$, then \begin{equation*} {\small{\text{RSS}}}(h) = {\small{\text{ASE}}}(h) + \frac1n \sum_{i=1}^ne_i^2 - \frac{2\sigma^2K(0)S^{00}\{\mu(\Omega)+\rho_c\}}{nh}+o_{\rm p}\{n^{-(2p+2)/(2p+3)}\}, \end{equation*} where $$S^{00}$$ denotes the first element of the first row of the inverse of the matrix $$S$$. Theorem 3 establishes our proposed bandwidth selection criterion based on the residual sum of squares for the case of correlated data and unknown correlation structure. Theorem 3 shows that the proposed criterion results in an optimal bandwidth that minimizes the asymptotic squared error asymptotically. Theorem 3. Suppose that the assumptions of Theorem 2 hold. Let $$\hat{h}_{{\small{\text{RSS}}}}$$ be the minimizer of the residual sum of squares criterion based on a kernel satisfying $$K(0)=0$$ over $$h \in \mathcal{H}_n$$. Further, let $$\hat{h}_{{\small{\text{ASE}}}}$$ be the minimizer of the asymptotic squared error over $$h \in \mathcal{H}_n$$. Then uniformly for $$h\in \mathcal{H}_n$$, \begin{equation*} {\small{\text{RSS}}}(h) = \frac1n \sum_{i=1}^ne_i^2 + {\small{\text{ASE}}}(h)+o_{\rm p}\{n^{-(2p+2)/(2p+3)}\} \end{equation*} and $${\small{\text{ASE}}}(\hat{h}_{{\small{\text{RSS}}}})/{\small{\text{ASE}}}(\hat{h}_{{\small{\text{ASE}}}}) = 1 +o_{\rm p}(1)$$ or, equivalently, \begin{equation*} \frac{{\small{\text{ASE}}}(\hat{h}_{{\small{\text{RSS}}}})}{\inf_{h \in \mathcal{H}_n}{\small{\text{ASE}}}(h)} \to 1 \end{equation*} in probability as $$n\to \infty$$. Proof. From Theorem 2 and choosing a kernel that satisfies $$K(0)=0$$, it follows that \begin{equation*} {\small{\text{RSS}}}(h) = \frac1n \sum_{i=1}^ne_i^2 + {\small{\text{ASE}}}(h)+o_{\rm p}\!\left\{n^{-(2p+2)/(2p+3)}\right\}\!\text{.} \end{equation*} Next, \begin{align} {\small{\text{ASE}}}\big(\hat{h}_{{\small{\text{RSS}}}}\big) &= {\small{\text{RSS}}}\big(\hat{h}_{{\small{\text{RSS}}}}\big) - \frac1n \sum_{i=1}^ne_i^2 + o_{\rm p}\big\{n^{-(2p+2)/(2p+3)}\big\} \nonumber \\ &\leq {\small{\text{RSS}}}\big(\hat{h}_{{\small{\text{ASE}}}}\big) - \frac1n \sum_{i=1}^ne_i^2 + o_{\rm p}\big\{n^{-(2p+2)/(2p+3)}\big\}\nonumber \\ &= {\small{\text{ASE}}}\big(\hat{h}_{{\small{\text{ASE}}}}\big) + o_{\rm p}\big\{n^{-(2p+2)/(2p+3)}\big\} \!\label{asopt}\text{.} \end{align} (3) By the definition of $$\hat{h}_{{\small{\text{ASE}}}}$$ it follows that $${\small{\text{ASE}}}(\hat{h}_{{\small{\text{ASE}}}}) \leq {\small{\text{ASE}}}(\hat{h}_{{\small{\text{RSS}}}})$$. Asymptotic optimality (Li, 1987) of our bandwidth selection criterion is a consequence of (3) and Theorem 2. □ Theorem 3 states that minimizing $${\small{\text{ASE}}}(h)$$ is asymptotically equivalent to minimizing $${\small{\text{RSS}}}(h)$$ with a kernel satisfying $$K(0)=0$$. We need not estimate anything regarding the correlation structure. This type of kernel successfully removes the effects of correlation and eliminates the need for crossvalidation-based techniques, since it puts a zero value on the point under consideration, automatically creating a leave-one-out crossvalidation scenario. 4. Kernel and bandwidth selection in practice Although kernels satisfying $$K(0)=0$$ are effective in removing the effects of correlation from bandwidth selection, they also introduce extra bias and variability (Kim et al., 2009). An optimal kernel $$\tilde{K}$$ in this framework, ignoring Assumption 6, would be \begin{equation*} \tilde{K}(u) = \begin{cases} \frac{3}{4}(1-u^2), & \quad |u| \leq 1, \\ 0, & \quad u = 0, \end{cases} \end{equation*} the Epanechnikov kernel with a discontinuity at zero, but the latter condition violates Assumption 6. It is easy to see that an optimal kernel does not exist in the class of kernels satisfying $$K(0)=0$$ and Assumption 6 (Kim et al., 2009). The Epanechnikov kernel is obtained as the kernel, $$K$$, that minimizes \begin{equation*} Q_K = \left\{\int u^2K(u)\,{\rm d} u\right\} \left\{\int K^2(u)\,{\rm d} u\right\}^2 \end{equation*} subject to certain constraints. The value $$Q_K$$ for the Epanechnikov kernel is 0$$\cdot$$072. For $$K(u)= 630(4u^2-1)^2u^4 I_{(|u|\leq 1/2)}$$, the kernel used by Kim et al. (2009), $$Q_K=$$0$$\cdot$$374. Therefore, this kernel is roughly five times less efficient than the Epanechnikov kernel. Next, we introduce two kernels with $$Q_K$$ values closer to the Epanechnikov kernel’s. The kernels $$K(u) = 2\pi^{-1/2}u^2\exp(-u^2)$$ and $$K(u)=2^{-1} |u|\,\exp(-|u|)$$ have $$Q_K$$ values of 0$$\cdot$$134 and 0$$\cdot$$093, respectively. The three kernels discussed here, and shown in Fig. 1, exhibit very different behaviour around zero. All three satisfy Assumption 6. Among these three kernels satisfying $$K(0)=0$$, we recommend the use of $$K(u) = 2\pi^{-1/2}u^2\exp(-u^2)$$, whose smooth trough around zero makes finding the minimum of the residual sum of squares more numerically stable. Although finding an optimal kernel is an interesting question, we do not pursue it here. Fig. 1. View largeDownload slide Three kernels satisfying $$K(0)=0$$: (a) $$K(u)= 630(4u^2-1)^2u^4 I_{(|u|\leq 1/2)}$$; (b) $$K(u) = 2\pi^{-1/2}u^2\exp(-u^2)$$; (c) $$K(u)=\frac12 |u|\,\exp(-|u|)$$. Note the different behaviour around zero for these kernels. Fig. 1. View largeDownload slide Three kernels satisfying $$K(0)=0$$: (a) $$K(u)= 630(4u^2-1)^2u^4 I_{(|u|\leq 1/2)}$$; (b) $$K(u) = 2\pi^{-1/2}u^2\exp(-u^2)$$; (c) $$K(u)=\frac12 |u|\,\exp(-|u|)$$. Note the different behaviour around zero for these kernels. Kernels with $$K(0)=0$$ are very effective at removing the unknown correlation, but due to their non-optimality, they introduce extra mean squared error, producing a more wiggly estimate of the regression function. To counteract this effect and as our final smoothing procedure, we propose that the kernel with $$K(0)=0$$ be used to provide a pilot bandwidth $$\hat{h}_{\textrm{bimd}}$$ that can then be related to the bandwidth of a unimodal kernel via a so-called factor rule. In order to distinguish between a kernel satisfying $$K(0)=0$$ and a unimodal kernel, we will denote kernels satisfying $$K(0)=0$$ by $$\bar{K}$$. Both $$K$$ and $$\bar{K}$$ should satisfy Assumption 6. Corollary 1 suggests that the relation between the bandwidth of a kernel $$\bar{K}$$ with bandwidth $$\hat{h}_{\textrm{bimd}}$$ and a Gaussian kernel $$K$$ with bandwidth $$\hat{h}$$ is $$\hat{h} = \frac{C_p(K)}{C_p(\bar{K})}\,\hat{h}_{\textrm{bimd}} =C_p(K,\bar{K})\,\hat{h}_{\textrm{bimd}}, \label{relation}$$ (4) where \begin{equation*} C_p(K,\bar{K}) = \left[\frac{\int K_{p}^{\star2}(t)\,{\rm d} t \, \{\int t^{\,p+1}\bar{K}_{p}^{\star}(t)\,{\rm d} t\}^2}{\int \bar{K}_{p}^{\star2}(t)\,{\rm d} t \, \{\int t^{\,p+1}K_{p}^{\star}(t)\,{\rm d} t\}^2}\right]^{1/(2p+3)}\!\text{.} \end{equation*} The equivalent kernel $$\bar{K}_{p}^{\star}$$ can easily be obtained by substituting $$\bar{K}$$ for $$K$$. Values for $$C_p(K,\bar{K})$$ are given in the next section. 5. Simulations In this section we compare the effects on bandwidth selection of using the Gaussian kernel and using a kernel satisfying $$K(0)=0$$. Since the use of such kernels introduces extra variance, we perform a two-step procedure to reduce the variance. First, we use the kernel $$\bar{K}(u) = 2\pi^{-1/2}u^2\exp(-u^2)$$ to obtain $$\hat{h}_{\textrm{bimd}}$$, which is in fact $$\hat{h}_{{\small{\text{RSS}}}}$$. Second, we exploit the asymptotic relationship between the optimal bandwidths to get to $$\hat{h}$$, as indicated in expression (4). This requires no extra smoothing, and the relation between the kernel satisfying $$K(0)=0$$ and the Gaussian kernel $$K$$ is given by (4) with $$C_p(K,\bar{K})=1{\cdot}16$$ or $$C_p(K,\bar{K})=1{\cdot}01$$ for local linear, $$p=1$$, or local cubic, $$p=3$$, regression, respectively. The Epanechnikov kernel, optimal in $$L_2$$, could have been taken as the unimodal kernel in the second step, but since there is not much difference in optimality between the Gaussian $$Q_k=0{\cdot}08$$ and the Epanechnikov $$Q_k =0{\cdot}072$$, and because of the link between the shapes of the kernel satisfying $$K(0)=0$$ and the Gaussian kernel, we chose to use the Gaussian kernel in the second step. In the simulation study we used local linear regression. We also compare with standard crossvalidation using the Gaussian kernel ignoring correlation among the errors, and with the plug-in method proposed by Francisco-Fernández et al. (2004), which involves estimating the correlation structure of the errors; it is assumed that the design is regular and generated by a density. We use equally spaced designs in the simulations. Consider the model $$Y_i = m(x_i) + e_i$$ where $$x_i = i/n$$, $$m(x_i) = 2\sin(6x_i) + (x_i+1)^2$$ and $$e_i \sim N(0, 1)$$ with $${\mathrm{cov}}(e_i,e_j) = \sigma^2 \rho(|x_i - x_j|)$$. Also, consider two correlation functions $$\rho_{1}(t) = \exp(-\alpha |t|)$$ with $$\alpha= 50,100,200,1000$$, for the autoregressive model of order one, and \begin{equation*} \rho_{2}(t) = \exp(-\alpha_1|t|/2){\cos(p|t|) + \frac{\alpha_1}{2p}\sin(p|t|)} \end{equation*} where $$p^2 = (\alpha_2 - \alpha_1^2 / 4) > 0$$, for a quasi-periodic autoregressive model of order-two correlation structure. We take $$(\alpha_1, \alpha_2)=\{(50, 3000), (50, 4000), (60, 3000), (60, 4000)\}$$ for $$\rho_{n2}$$. The correlation functions are shown in Figs. 2 and 3. For each of the correlation structures we generate 5000 sample data replicates and compute the estimated bandwidth $$\hat{h}$$ using the three estimation methods mentioned earlier for sample sizes $$n=50, 100$$ and $$200$$. We approximate the value of the mean of $$\hat{h}_{{\small{\text{ASE}}}}$$ for 5000 data replicates, denoted by $$h_{{\small{\text{MASE}}}}$$. In general, the bandwidth density of our method is centred at zero with variance comparable to the plug-in and crossvalidation methods. Fig. 2. View largeDownload slide Correlation function $$\rho_{1}(\cdot)$$ for three different parameter settings: (a) $$\alpha=1000$$, (b) $$\alpha=100$$, and (c) $$\alpha=50$$. Fig. 2. View largeDownload slide Correlation function $$\rho_{1}(\cdot)$$ for three different parameter settings: (a) $$\alpha=1000$$, (b) $$\alpha=100$$, and (c) $$\alpha=50$$. Fig. 3. View largeDownload slide Correlation function $$\rho_{2}(\cdot)$$ for three different parameter settings: (a) $$\alpha_1=50$$ and $$\alpha_2=3000$$; (b) $$\alpha_1=50$$ and $$\alpha_2=4000$$; (c) $$\alpha_1=60$$ and $$\alpha_2=4000$$. Fig. 3. View largeDownload slide Correlation function $$\rho_{2}(\cdot)$$ for three different parameter settings: (a) $$\alpha_1=50$$ and $$\alpha_2=3000$$; (b) $$\alpha_1=50$$ and $$\alpha_2=4000$$; (c) $$\alpha_1=60$$ and $$\alpha_2=4000$$. With positive autoregressive correlated errors of order one, the plug-in method is resistant to correlation and usually performs better than crossvalidation with a normal kernel. The bandwidth density of the proposed method is centred at zero, so our method will give better estimates of the optimal bandwidth. When $$\alpha=1000$$, which is almost the independent error case, all the methods give good results, but our method shows a slight rise in variance. When $$\alpha$$ decreases, the correlation increases, and the other two methods tend to give smaller than optimal bandwidths. However, for the plug-in method to work, a pilot smoother is needed. Since Francisco-Fernández et al. (2004) did not specify practical guidelines on how to choose this, we used our method based on a kernel satisfying $$K(0)=0$$ as the pilot smoother to obtain residuals. The results for $$n=100$$ are shown in Fig. 4; see also the Supplementary Material. The results for the autoregressive model, shown in Fig. 5, indicate that our method will generally outperform the plug-in method and standard crossvalidation; it demonstrates the ability to perform well not only with positive correlation but also with negative correlation. Fig. 4. View largeDownload slide Densities of the bandwidth estimators $$\hat{h}$$ (solid), plug-in bandwidth $$\hat{h}_{\textrm{plgn}}$$ (dashed) and crossvalidation ignoring correlation $$\hat{h}_{\textrm{norm}}$$ (dash-dotted) for an autoregressive model of order-one correlated errors and $$n=100$$ with (a) $$\alpha=1000$$, (b) $$\alpha=100$$ and (c) $$\alpha=50$$. The result for $$\alpha = 200$$ is omitted. Fig. 4. View largeDownload slide Densities of the bandwidth estimators $$\hat{h}$$ (solid), plug-in bandwidth $$\hat{h}_{\textrm{plgn}}$$ (dashed) and crossvalidation ignoring correlation $$\hat{h}_{\textrm{norm}}$$ (dash-dotted) for an autoregressive model of order-one correlated errors and $$n=100$$ with (a) $$\alpha=1000$$, (b) $$\alpha=100$$ and (c) $$\alpha=50$$. The result for $$\alpha = 200$$ is omitted. Fig. 5. View largeDownload slide Densities of the bandwidth estimators $$\hat{h}$$ (solid), plug-in bandwidth $$\hat{h}_{\textrm{plgn}}$$ (dashed) and crossvalidation ignoring correlation $$\hat{h}_{\textrm{norm}}$$ (dash-dotted) for an autoregressive model of order-two correlated errors and $$n=100$$ with (a) $$\alpha_1=50$$ and $$\alpha_2=3000$$, (b) $$\alpha_1=50$$ and $$\alpha_2=4000$$, or (c) $$\alpha_1=60$$ and $$\alpha_2=4000$$. Panel (b) has a different scale on the vertical axis. The result for $$\alpha_1=60$$ and $$\alpha_2=3000$$ is omitted. Fig. 5. View largeDownload slide Densities of the bandwidth estimators $$\hat{h}$$ (solid), plug-in bandwidth $$\hat{h}_{\textrm{plgn}}$$ (dashed) and crossvalidation ignoring correlation $$\hat{h}_{\textrm{norm}}$$ (dash-dotted) for an autoregressive model of order-two correlated errors and $$n=100$$ with (a) $$\alpha_1=50$$ and $$\alpha_2=3000$$, (b) $$\alpha_1=50$$ and $$\alpha_2=4000$$, or (c) $$\alpha_1=60$$ and $$\alpha_2=4000$$. Panel (b) has a different scale on the vertical axis. The result for $$\alpha_1=60$$ and $$\alpha_2=3000$$ is omitted. 6. Discussion Let $$(X_i ,Y_i)_{i=1}^n$$ be a set of $$\mathbb{R}^{d+1}$$-valued random vectors, where the $$Y_i$$ are scalar responses and the $$X_i$$ are $$\mathbb{R}^d$$-valued covariates having common density $$f$$ with compact support $$\Omega \subseteq \mathbb{R}^d$$. Further, $$f$$ is bounded, continuous and such that $$f(x) > 0$$ for all $$x$$. Consider the same model as in § 1 with $$\rho_n$$ being a $$d$$-variate stationary correlation function. The local linear estimator for $$m(\cdot)$$ at $$x$$ is the solution to the least squares minimization \begin{equation*} \min_{\alpha,\beta}\,\sum_{i=1}^n\bigl\{Y_i-\alpha-\beta^{{\rm T}}(X_i-x)\bigr\}^2K_H(X_i-x), \end{equation*} where $$H$$ is a $$d\times d$$ symmetric positive-definite bandwidth matrix, $$K$$ is a $$d$$-variate kernel and $$K_H(u)=|H|^{-1}K(H^{-1}u)$$ with $$|H|$$ denoting the determinant of the matrix $$H$$. A key to extending our method to the multivariate case is the results of Ruppert & Wand (1994) for the asymptotic expression of $$S_n^{-1}=({X}_x^{{\rm T}}{\rm W}_x{X}_x)^{-1}$$. In what follows we write $$\lambda_{\textrm{max}}(H)$$ and $$\lambda_{\textrm{min}}(H)$$ for the maximal and minimal eigenvalues of the matrix $$H$$ and $$\|x\|$$ for the $$L_2$$-norm. Finally, for a matrix $$H$$, $$\:H\to 0$$ means that every element of $$H$$ goes to zero. Besides the assumptions on the density $$f$$, regression function $$m$$ and kernel $$K$$, we also require that the bandwidth matrix $$H$$ be symmetric and positive definite. We have that $$H\to 0$$ and $$n|H|\to \infty$$ as $$n\to \infty$$, and the ratio $$\lambda_{\textrm{max}}(H)/\lambda_{\textrm{min}}(H)$$ is bounded above. Next, $$\lim_{\|u\| \to \infty} \|u\|^l K(u) < \infty$$ for $$l=0,1$$. Finally, the first part of Assumption 8 and Assumptions 9 and 10 for $$\rho_n$$ still hold, with the second part of Assumption 8 replaced by the following: for any sequence $$\epsilon_n>0$$ satisfying $$n^{1/d}\epsilon_n \to \infty$$, \begin{equation*} n\int_{\|x\|\ge \epsilon_n}|\rho_n(x)|\,{\rm d} x \rightarrow 0 \end{equation*} as $$n\rightarrow \infty$$. Then for $$n^{\delta}\int |\rho_n(t)|\,{\rm d} t < \rho_\delta$$ and $$\delta>1$$, each element of $$H$$ is $$O(n^{-1/(d+4)})$$, and for a kernel $$K$$ such that $$K(0)=0$$ Theorem 2 is extended to the multivariate case. A possible kernel could be $$K(u) = (2d^{-1}\pi^{-d/2})\|u\|^2\exp(-\|u\|^2)$$. Two commonly used correlation functions satisfying the modified second part of Assumption 8 are the exponential and rational quadratic models (Cressie, 1993) \begin{equation*} \rho_n(x) = \exp(-\alpha n^{1/d} \|x\|), \quad \rho_n(x) = \frac{1}{1+\alpha n^{2/d} \|x\|^2}, \quad \alpha>0\text{.} \end{equation*} Acknowledgement We are very grateful to the associate editor and referees for their helpful comments. The third author acknowledges support from the Research Fund of the Katholieke Universiteit Leuven and from the Interuniversity Attraction Poles Research Network P7/06 of the Belgian Science Policy. Supplementary material Supplementary material available at Biometrika online includes the proofs of Theorems 1 and 2 and additional simulation results. References Altman N. S. ( 1990 ). Kernel smoothing of data with correlated errors. J. Am. Statist. Assoc. 84 , 749 – 59 . Google Scholar CrossRef Search ADS Chu C. K. & Marron J. S. ( 1991 ). Comparison of two bandwidth selectors with dependent errors. Ann. Statist. 4 , 1906 – 18 . Google Scholar CrossRef Search ADS Cressie N. A. C. ( 1993 ). Statistics for Spatial Data . New York : Wiley , 2nd ed . Diggle P. J. & Hutchinson M. F. ( 1989 ). On spline smoothing with correlated errors. Aust. J. Statist. 31 , 166 – 82 . Google Scholar CrossRef Search ADS Fan J. & Gijbels I. ( 1996 ). Local Polynomial Modelling and Its Applications . New York : Chapman & Hall/CRC . Francisco-Fernández M. , Opsomer J. D. & Vilar-Fernández J. M. ( 2004 ). Plug-in bandwidth selector for local polynomial regression estimator with correlated errors. J. Nonparam. Statist. 16 , 127 – 51 . Google Scholar CrossRef Search ADS Gijbels I. , Pope A. & Wand M. P. ( 1999 ). Understanding exponential smoothing via kernel regression. J. R. Statist. Soc. B 61 , 39 – 50 . Google Scholar CrossRef Search ADS Hall P. , Lahiri S. N. & Polzehl J. ( 1995 ). On bandwidth choice in nonparametric regression with both short- and long-range dependent errors. Ann. Statist. 23 , 1921 – 36 . Google Scholar CrossRef Search ADS Hall P. & Van Keilegom I. ( 2003 ). Using difference-based methods for inference in nonparametric regression with time series errors. J. R. Statist. Soc. B. 65 , 443 – 56 . Google Scholar CrossRef Search ADS Hart J. D. ( 1991 ). Kernel regression estimation with time series errors. J. R. Statist. Soc. B. 53 , 173 – 87 . Kim T. Y. , Kim D. , Park B. U. & Simpson D. G. ( 2004 ). Nonparametric detection of correlated errors. Biometrika 91 , 491 – 6 . Google Scholar CrossRef Search ADS Kim T. Y. , Park B. U. , Moon M. S. & Kim C. ( 2009 ). Using bimodal kernel for inference in nonparametric regression with correlated errors. J. Mult. Anal. 100 , 1487 – 97 . Google Scholar CrossRef Search ADS Konishi S. & Kitagawa G. ( 2008 ). Information Criteria and Statistical Modeling. New York : Springer . Lee K. L. , Mammen E. & Park B. U. ( 2010 ). Bandwidth selection for kernel regression with correlated errors. Statistics 44 , 327 – 40 . Google Scholar CrossRef Search ADS Li C. K. ( 1987 ). Asymptotic optimality for $$C_p$$, $$C_L$$, cross-validation and generalized cross-validation: Discrete index set. Ann. Statist. 15 , 958 – 75 . Google Scholar CrossRef Search ADS Opsomer J. , Wang Y. & Yang Y. ( 2001 ). Nonparametric regression with correlated errors. Statist. Sci. 16 , 134 – 53 . Google Scholar CrossRef Search ADS Park B. U. , Lee Y. K. , Kim T. Y. & Park C. ( 2006 ). A simple estimator of error correlation in non-parametric regression models. Scand. J. Statist. 33 , 451 – 62 . Google Scholar CrossRef Search ADS Ruppert D. & Wand M. P. ( 1994 ). Multivariate locally weighted least squares regression. Ann. Statist. 22 , 1346 – 70 . Google Scholar CrossRef Search ADS © 2018 Biometrika Trust This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Biometrika Oxford University Press

# Local polynomial regression with correlated errors in random design and unknown correlation structure

, Volume 105 (3) – Sep 1, 2018
10 pages

/lp/ou_press/local-polynomial-regression-with-correlated-errors-in-random-design-6vwXzOKlB6
Publisher
Oxford University Press
ISSN
0006-3444
eISSN
1464-3510
D.O.I.
10.1093/biomet/asy025
Publisher site
See Article on Publisher Site

### Abstract

Summary Automated or data-driven bandwidth selection methods tend to break down in the presence of correlated errors. While this problem has previously been studied in the fixed design setting for kernel regression, the results were applicable only when there is knowledge about the correlation structure. This article generalizes these results to the random design setting and addresses the problem in situations where no prior knowledge about the correlation structure is available. We establish the asymptotic optimality of our proposed bandwidth selection criterion based on kernels $$K$$ satisfying $$K(0)=0$$. 1. Introduction Assume that we observe random variables $$(X_i,Y_i)^{n}_{i=1}$$, where $$Y_i$$ takes values in $$\mathbb{R}$$ and $$X_i$$ takes values in $$\Omega\subseteq \mathbb{R}$$ with a common density $$f$$. The density $$f$$ is compactly supported, bounded and continuous with $$f(x)>0$$ for all $$x$$. Consider the model \begin{equation*} Y_i = m(X_i)+ e_i \quad (i=1,\ldots,n), \end{equation*} where $$m$$ is an unknown smooth regression function and the $$e_i$$ are unobserved random variables such that $${E}(e_i\mid X_i) = 0, \quad {\mathrm{cov}}(e_i,e_j\mid X_i,X_j)=\sigma^2\rho_n(X_i-X_j) \quad (i,j=1,\ldots,n), \label{assumption1}$$ (1) with $$\sigma^2 <\infty$$ and $$\rho_n$$ a stationary correlation function satisfying $$\rho_n(0)=1, \rho_n(x)=\rho_n(-x)$$ and $$|\rho_n(x)| \leq 1$$ for all $$x$$. The subscript $$n$$ allows the correlation function $$\rho_n$$ to shrink as $$n\rightarrow\infty$$. Many kernel-based regression methods require the tuning or selection of a smoothing parameter, often referred to as the bandwidth. This introduces arbitrariness into the estimation procedure, so many bandwidth selection methods have been developed (Fan & Gijbels, 1996; Konishi & Kitagawa, 2008). Diggle & Hutchinson (1989) and Hart (1991) have shown that these methods perform rather poorly under model (1), since they typically require that the errors be independent and identically distributed. Several modifications have been proposed to account for short-range correlation (Opsomer et al., 2001; Francisco-Fernández et al., 2004). The plug-in method of Francisco-Fernández et al. (2004) assumes that the errors are generated from an autoregressive model of order 1, but this could be hard to verify in practice. However, the authors performed a sensitivity analysis suggesting that their method is reasonably robust against correlation model misspecification. Lee et al. (2010) proposed two criteria that approximate the averaged squared error when the errors are correlated and the design is fixed. Although their method does not require a parametric form for the correlation structure, it relies on estimation of the autocorrelation at different lags via the method proposed by Park et al. (2006). Gijbels et al. (1999) considered exponential smoothing and established the relation with nonparametric regression in fixed designs. They assumed that the errors are realizations of a zero-mean causal autoregressive moving average process. Hall & Van Keilegom (2003) showed that difference methods can be used for autoregressive parameters in nonparametric regression with time series errors. Although the problem of correlated errors is well understood and many remedies exist, a pressing challenge is the implementation of these results (Opsomer et al., 2001). Kim et al. (2004) developed a nonparametric method to deal with the related problem of detecting correlated errors. They showed that the first-order sample autocovariance of the residuals from kernel regression provides information on whether or not the errors are correlated. Without prior knowledge of the correlation structure, many bandwidth selection procedures are cumbersome to implement. One of the first steps taken to simplify implementation was proposed by Park et al. (2006), whose estimator of the error correlation is based neither on ordinary crossvalidation nor on modified crossvalidation (Chu & Marron, 1991) or the block bootstrap (Hall et al., 1995). To simplify regression with correlated errors, Kim et al. (2009) showed that bimodal kernels are quite effective even when the errors are heavily correlated. However, three shortcomings remain: prior information about the correlation structure is required, bandwidth selection methods need to be modified to work in this context, and most theoretical results have been established for fixed equispaced designs. In this article we focus on estimating the regression function, not the correlation structure, by local polynomial regression in random designs. As we will show, under mild conditions, neither special crossvalidation nor prior information on the correlation structure will be required, as a kernel $$K$$ satisfying $$K(0)=0$$ can effectively remove the effects of correlation. Since all the results are derived for a random design, a natural extension is to consider spatially correlated errors, which would have more applications. 2. Kernel smoother and assumptions If $$m$$ is an unknown $$p+1$$ times continuously differentiable function, then the local polynomial regression estimator of order $$p$$ at an arbitrary point $$x$$ can be found by solving the following weighted least squares problem: minimize $$\sum_{i=1}^n\{Y_i-\sum_{j=0}^p\beta_j(X_i-x)^{\,j}\}^2K_h(X_i-x)$$ with respect to $$\beta$$. This yields $$\hat{\beta} = ({X}_x^{{\rm T}}{\rm W}_x{X}_x)^{-1}{X}_x^{{\rm T}}{\rm W}_x{Y} = S_n^{-1}{X}_x^{{\rm T}}{\rm W}_x{Y}$$ as the solution vector, where $${Y}=(Y_1,\ldots,Y_n)^{{\rm T}}$$, $${\rm W}_x$$ is the $$n\times n$$ diagonal matrix of weights, i.e., $${\mathrm{diag}}\{K_h(X_i-x)\}$$ with kernel $$K$$ and bandwidth $$h$$ so that $$K_h(\cdot)=K(\cdot/h)/h$$, $$S_n={X}_x^{{\rm T}}{\rm W}_x{X}_x$$, and \begin{equation*} {X}_x = \begin{pmatrix} 1 & (X_1-x) & \cdots & (X_1-x)^p \\[-3pt] \vdots & \vdots & & \vdots\\ 1 & (X_n-x) & \cdots & (X_n-x)^p \end{pmatrix}\!\text{.} \end{equation*} The local polynomial estimator for the regression function at an arbitrary point $$x$$ is $$\hat{m}(x) = \varepsilon_{1}^{{\rm T}}\hat{\beta}$$, where $$\varepsilon_{1}=(1,0,\ldots,0)^{{\rm T}}$$ is a unit vector with 1 in the first position. A useful representation of the local polynomial estimator at an arbitrary point $$x$$ is \begin{equation*} \hat{m}(x) = \varepsilon_{1}^{{\rm T}} S_n^{-1}{X}_x^{{\rm T}}{\rm W}_x{Y} = \sum_{i=1}^n W_i^n(x)Y_i, \end{equation*} where $$W_i^n(x)=\varepsilon_{1}^{{\rm T}} S_n^{-1}\{1,X_i-x,\ldots,(X_i-x)^p\}^{{\rm T}}K\{(X_i-x)/h\}/h$$. Before stating the main theoretical results of the paper, we list the assumptions on which they rest: Assumption 1. $$m$$ is a $$p+1$$ times continuously differentiable function; Assumption 2. $$h \rightarrow 0$$ and $$nh \rightarrow \infty$$ as $$n\to \infty$$; Assumption 3. there exists an $$f_{\textrm{max}}$$ such that $$f(x) < f_{\textrm{max}}$$ for all $$x\in \Omega \subseteq\mathbb{R}$$; Assumption 4. $$f$$ has bounded support, is continuous, and satisfies $$f(x)>0$$ for all $$x \in \textrm{supp}(f)$$; Assumption 5. there exists a constant $$K_{\textrm{max}}$$ such that $$|K(x)| < K_{\textrm{max}}$$, and $$K(x)\ge 0$$ for all $$x$$; Assumption 6. $$K$$ is symmetric and Lipschitz continuous at $$0$$; Assumption 7. $$\lim_{|u| \to \infty} |u|^l K(u) < \infty$$ for $$l=0,\ldots,p$$; Assumption 8. the correlation function $$\rho_n$$ is an element of a sequence $$\{\rho_n\}$$ with the following properties for all $$n$$: there exist constants $$\rho_{\textrm{max}}$$ and $$\rho_{c}$$ such that $$n\int |\rho_n(x)|\,{\rm d} x < \rho_{\textrm{max}}$$ and $$\lim_{n\to\infty}n\int \rho_n(x)\,{\rm d} x = \rho_{c}$$; and for any sequence $$\epsilon_n>0$$ satisfying $$n\epsilon_n \to \infty$$, \begin{equation*} n\int_{|x|\ge \epsilon_n}|\rho_n(x)|\,{\rm d} x \rightarrow 0, \quad n\rightarrow \infty; \end{equation*} Assumption 9. $$\lim_{n\to\infty}n\int \rho^2_n(x)\,{\rm d} x = \rho_{cc}$$; Assumption 10. for all $$i,j,k$$ and $$l$$, \begin{align*} {\mathrm{cov}}(e_ie_j,e_ke_l \mid X_i,X_j,X_k,X_l) &= {\mathrm{cov}}(e_i,e_k \mid X_i,X_k){\mathrm{cov}}(e_j,e_l \mid X_j,X_l)\\ &\quad+{\mathrm{cov}}(e_i,e_l \mid X_i,X_l){\mathrm{cov}}(e_j,e_k \mid X_j,X_k)\text{.} \end{align*} Assumption 4 requires a density $$f$$ with bounded support. Strictly speaking, this means that distributions defined on the entire space $$\mathbb{R}$$ are excluded, but our simulations suggest that the proposed method also works when the design points are sampled from densities with infinite support. Assumption 8 implies that the correlation function depends on $$n$$ and that $$\int|\rho_n(x)|\,{\rm d} x$$ should vanish at a rate not slower than $$O(1/n)$$. We also require the correlation to be short range, in the second part of Assumption 8. Two correlation functions satisfying Assumption 8 are (Cressie, 1993) $$\rho_n(x) = \exp(-\alpha\,n|x|), \quad \rho_n(x) = \frac{1}{1+\alpha\,n^2x^2}, \quad \alpha >0\text{.} \label{corrfun}$$ (2) These common assumptions are generalizations of the equispaced design assumptions in Altman (1990). Assumptions 8 and 10 are satisfied, for example, when the errors have an autoregressive order-one dependence structure and follow a normal distribution. 3. Main theoretical results We first establish the bias and variance of the local polynomial regression estimator under our design assumptions. The bias is not related to the correlation structure and therefore is the same as in the independent error case. In what follows, we only give the bias for $$p$$ odd. For the asymptotic bias when $$p$$ is even, see Fan & Gijbels (1996, p. 62). The proofs can be found in the Supplementary Material. Theorem 1. Assume $$f(x)>0$$, and let $$f(\cdot)$$ and $$m^{(p+1)}(\cdot)$$ be continuous in a neighbourhood of $$x$$. Then under Assumptions 2, 5, 6, 7 and 8, the asymptotic conditional bias and variance for the local polynomial regression estimator of odd order $$p$$ are \begin{eqnarray*} {\mathrm{bias}}\{\hat{m}(x) \mid X_1,\ldots,X_n\} &=& \varepsilon_1^{{\rm T}}S^{-1}\frac{c_p}{(p+1)!}\,m^{(p+1)}(x)h^{\,p+1}+o_{\rm p}(h^{\,p+1})\\ &=& \biggl\{\int t^{\,p+1}K_p^{\star}(t)\,{\rm d} t\biggr\} \frac{1}{(p+1)!}\,m^{(p+1)}(x)h^{\,p+1}+o_{\rm p}(h^{\,p+1}),\\ {\mathrm{var}}\{\hat{m}(x) \mid X_1,\ldots,X_n\} &=& \frac{\sigma^2 \{1+f(x)\rho_c\}}{nhf(x)}\,\varepsilon_1^{{\rm T}}S^{-1}S^{\star}S^{-1}\varepsilon_1 + o_{\rm p}\biggl(\frac{1}{nh}\biggr)\\ &=& \int K_p^{\star \, 2}(t)\,{\rm d} t \, \frac{\sigma^2 \{1+f(x)\rho_c\}}{nhf(x)} + o_{\rm p}\biggl(\frac{1}{nh}\biggr), \end{eqnarray*} where $$S=(\mu_{i+j})_{0\,\leq\,i,\,j\,\leq\, p}$$ with $$\mu_j = \int u^jK(u)\,{\rm d} u$$, $$S^{\star}=(\nu_{i+j})_{0\,\leq\, i,j\,\leq\, p}$$ with $$\nu_j = \int u^{\,j}K^2(u)\,{\rm d} u$$, $$c_p = (\mu_{p+1},\ldots,\mu_{2p+1})^{{\rm T}}$$ and $$K_p^{\star}(t) = \varepsilon_1^{{\rm T}}S^{-1}(1,t,\ldots, t^p)^{{\rm T}} K(t)$$. The bandwidth $$h_{\textrm{opt}}$$ minimizing the asymptotic mean integrated squared error, i.e., $$\int [{\mathrm{bias}}^2\{\hat{m}(x) \mid X_1,\ldots,X_n\}+{\mathrm{var}}\{\hat{m}(x) \mid X_1,\ldots,X_n\}]\,f(x)\,{\rm d} x$$, readily follows and is given in Corollary 1. There is a slight but significant difference between the correlated and independent error cases (Fan & Gijbels, 1996, p. 68). Corollary 1. Under the assumptions of Theorem 1, the asymptotically optimal constant bandwidth for $$p$$ odd is \begin{equation*} h_{\rm opt} = C_p(K)\biggl[\frac{\sigma^2\{\mu(\Omega)+\rho_c\}}{\int\{m^{(p+1)}(x)\}^2f(x)\,{\rm d} x}\biggr]^{1/(2p+3)}n^{-1/(2p+3)}, \end{equation*} where \begin{equation*} C_p(K) = \biggl[\frac{(p+1)!^2\int K^{\star2}_p(t)\,{\rm d} t}{2(p+1)\{\int t^{\,p+1}K^{\star}_p(t)\,{\rm d} t\}^2}\biggr]^{1/(2p+3)} \end{equation*} and $$\mu(\Omega)$$ denotes the Lebesgue measure of the design region $$\Omega \subseteq \mathbb{R}$$. In what follows, define the residual sum of squares $${\small{\text{RSS}}}(h)=n^{-1} \sum_{i=1}^n\{Y_i-\hat{m}(X_i)\}^2$$ and the asymptotic squared error $${\small{\text{ASE}}}(h)=n^{-1} \sum_{i=1}^{n}\{m(X_i) - \hat{m}(X_i)\}^2$$. Next, we will show that $${\small{\text{RSS}}}(h)$$ approximates $${\small{\text{ASE}}}(h)$$ well enough uniformly over a set of bandwidths. To prove this result, we require a slightly stronger condition on the correlation function than that in Assumption 8. However, the correlation functions in (2) also satisfy this stronger condition. Theorem 2. Under Assumptions 1–10, if $$n^{\delta}\int |\rho_n(t)|\,{\rm d} t < \rho_\delta$$ for $$\delta>1$$, $$p$$ is odd and $$h\in\mathcal{H}_n$$ where $$\mathcal{H}_n=[an^{-1/(2p+3)},bn^{-1/(2p+3)}]$$ with $$0<a<b<\infty$$, then \begin{equation*} {\small{\text{RSS}}}(h) = {\small{\text{ASE}}}(h) + \frac1n \sum_{i=1}^ne_i^2 - \frac{2\sigma^2K(0)S^{00}\{\mu(\Omega)+\rho_c\}}{nh}+o_{\rm p}\{n^{-(2p+2)/(2p+3)}\}, \end{equation*} where $$S^{00}$$ denotes the first element of the first row of the inverse of the matrix $$S$$. Theorem 3 establishes our proposed bandwidth selection criterion based on the residual sum of squares for the case of correlated data and unknown correlation structure. Theorem 3 shows that the proposed criterion results in an optimal bandwidth that minimizes the asymptotic squared error asymptotically. Theorem 3. Suppose that the assumptions of Theorem 2 hold. Let $$\hat{h}_{{\small{\text{RSS}}}}$$ be the minimizer of the residual sum of squares criterion based on a kernel satisfying $$K(0)=0$$ over $$h \in \mathcal{H}_n$$. Further, let $$\hat{h}_{{\small{\text{ASE}}}}$$ be the minimizer of the asymptotic squared error over $$h \in \mathcal{H}_n$$. Then uniformly for $$h\in \mathcal{H}_n$$, \begin{equation*} {\small{\text{RSS}}}(h) = \frac1n \sum_{i=1}^ne_i^2 + {\small{\text{ASE}}}(h)+o_{\rm p}\{n^{-(2p+2)/(2p+3)}\} \end{equation*} and $${\small{\text{ASE}}}(\hat{h}_{{\small{\text{RSS}}}})/{\small{\text{ASE}}}(\hat{h}_{{\small{\text{ASE}}}}) = 1 +o_{\rm p}(1)$$ or, equivalently, \begin{equation*} \frac{{\small{\text{ASE}}}(\hat{h}_{{\small{\text{RSS}}}})}{\inf_{h \in \mathcal{H}_n}{\small{\text{ASE}}}(h)} \to 1 \end{equation*} in probability as $$n\to \infty$$. Proof. From Theorem 2 and choosing a kernel that satisfies $$K(0)=0$$, it follows that \begin{equation*} {\small{\text{RSS}}}(h) = \frac1n \sum_{i=1}^ne_i^2 + {\small{\text{ASE}}}(h)+o_{\rm p}\!\left\{n^{-(2p+2)/(2p+3)}\right\}\!\text{.} \end{equation*} Next, \begin{align} {\small{\text{ASE}}}\big(\hat{h}_{{\small{\text{RSS}}}}\big) &= {\small{\text{RSS}}}\big(\hat{h}_{{\small{\text{RSS}}}}\big) - \frac1n \sum_{i=1}^ne_i^2 + o_{\rm p}\big\{n^{-(2p+2)/(2p+3)}\big\} \nonumber \\ &\leq {\small{\text{RSS}}}\big(\hat{h}_{{\small{\text{ASE}}}}\big) - \frac1n \sum_{i=1}^ne_i^2 + o_{\rm p}\big\{n^{-(2p+2)/(2p+3)}\big\}\nonumber \\ &= {\small{\text{ASE}}}\big(\hat{h}_{{\small{\text{ASE}}}}\big) + o_{\rm p}\big\{n^{-(2p+2)/(2p+3)}\big\} \!\label{asopt}\text{.} \end{align} (3) By the definition of $$\hat{h}_{{\small{\text{ASE}}}}$$ it follows that $${\small{\text{ASE}}}(\hat{h}_{{\small{\text{ASE}}}}) \leq {\small{\text{ASE}}}(\hat{h}_{{\small{\text{RSS}}}})$$. Asymptotic optimality (Li, 1987) of our bandwidth selection criterion is a consequence of (3) and Theorem 2. □ Theorem 3 states that minimizing $${\small{\text{ASE}}}(h)$$ is asymptotically equivalent to minimizing $${\small{\text{RSS}}}(h)$$ with a kernel satisfying $$K(0)=0$$. We need not estimate anything regarding the correlation structure. This type of kernel successfully removes the effects of correlation and eliminates the need for crossvalidation-based techniques, since it puts a zero value on the point under consideration, automatically creating a leave-one-out crossvalidation scenario. 4. Kernel and bandwidth selection in practice Although kernels satisfying $$K(0)=0$$ are effective in removing the effects of correlation from bandwidth selection, they also introduce extra bias and variability (Kim et al., 2009). An optimal kernel $$\tilde{K}$$ in this framework, ignoring Assumption 6, would be \begin{equation*} \tilde{K}(u) = \begin{cases} \frac{3}{4}(1-u^2), & \quad |u| \leq 1, \\ 0, & \quad u = 0, \end{cases} \end{equation*} the Epanechnikov kernel with a discontinuity at zero, but the latter condition violates Assumption 6. It is easy to see that an optimal kernel does not exist in the class of kernels satisfying $$K(0)=0$$ and Assumption 6 (Kim et al., 2009). The Epanechnikov kernel is obtained as the kernel, $$K$$, that minimizes \begin{equation*} Q_K = \left\{\int u^2K(u)\,{\rm d} u\right\} \left\{\int K^2(u)\,{\rm d} u\right\}^2 \end{equation*} subject to certain constraints. The value $$Q_K$$ for the Epanechnikov kernel is 0$$\cdot$$072. For $$K(u)= 630(4u^2-1)^2u^4 I_{(|u|\leq 1/2)}$$, the kernel used by Kim et al. (2009), $$Q_K=$$0$$\cdot$$374. Therefore, this kernel is roughly five times less efficient than the Epanechnikov kernel. Next, we introduce two kernels with $$Q_K$$ values closer to the Epanechnikov kernel’s. The kernels $$K(u) = 2\pi^{-1/2}u^2\exp(-u^2)$$ and $$K(u)=2^{-1} |u|\,\exp(-|u|)$$ have $$Q_K$$ values of 0$$\cdot$$134 and 0$$\cdot$$093, respectively. The three kernels discussed here, and shown in Fig. 1, exhibit very different behaviour around zero. All three satisfy Assumption 6. Among these three kernels satisfying $$K(0)=0$$, we recommend the use of $$K(u) = 2\pi^{-1/2}u^2\exp(-u^2)$$, whose smooth trough around zero makes finding the minimum of the residual sum of squares more numerically stable. Although finding an optimal kernel is an interesting question, we do not pursue it here. Fig. 1. View largeDownload slide Three kernels satisfying $$K(0)=0$$: (a) $$K(u)= 630(4u^2-1)^2u^4 I_{(|u|\leq 1/2)}$$; (b) $$K(u) = 2\pi^{-1/2}u^2\exp(-u^2)$$; (c) $$K(u)=\frac12 |u|\,\exp(-|u|)$$. Note the different behaviour around zero for these kernels. Fig. 1. View largeDownload slide Three kernels satisfying $$K(0)=0$$: (a) $$K(u)= 630(4u^2-1)^2u^4 I_{(|u|\leq 1/2)}$$; (b) $$K(u) = 2\pi^{-1/2}u^2\exp(-u^2)$$; (c) $$K(u)=\frac12 |u|\,\exp(-|u|)$$. Note the different behaviour around zero for these kernels. Kernels with $$K(0)=0$$ are very effective at removing the unknown correlation, but due to their non-optimality, they introduce extra mean squared error, producing a more wiggly estimate of the regression function. To counteract this effect and as our final smoothing procedure, we propose that the kernel with $$K(0)=0$$ be used to provide a pilot bandwidth $$\hat{h}_{\textrm{bimd}}$$ that can then be related to the bandwidth of a unimodal kernel via a so-called factor rule. In order to distinguish between a kernel satisfying $$K(0)=0$$ and a unimodal kernel, we will denote kernels satisfying $$K(0)=0$$ by $$\bar{K}$$. Both $$K$$ and $$\bar{K}$$ should satisfy Assumption 6. Corollary 1 suggests that the relation between the bandwidth of a kernel $$\bar{K}$$ with bandwidth $$\hat{h}_{\textrm{bimd}}$$ and a Gaussian kernel $$K$$ with bandwidth $$\hat{h}$$ is $$\hat{h} = \frac{C_p(K)}{C_p(\bar{K})}\,\hat{h}_{\textrm{bimd}} =C_p(K,\bar{K})\,\hat{h}_{\textrm{bimd}}, \label{relation}$$ (4) where \begin{equation*} C_p(K,\bar{K}) = \left[\frac{\int K_{p}^{\star2}(t)\,{\rm d} t \, \{\int t^{\,p+1}\bar{K}_{p}^{\star}(t)\,{\rm d} t\}^2}{\int \bar{K}_{p}^{\star2}(t)\,{\rm d} t \, \{\int t^{\,p+1}K_{p}^{\star}(t)\,{\rm d} t\}^2}\right]^{1/(2p+3)}\!\text{.} \end{equation*} The equivalent kernel $$\bar{K}_{p}^{\star}$$ can easily be obtained by substituting $$\bar{K}$$ for $$K$$. Values for $$C_p(K,\bar{K})$$ are given in the next section. 5. Simulations In this section we compare the effects on bandwidth selection of using the Gaussian kernel and using a kernel satisfying $$K(0)=0$$. Since the use of such kernels introduces extra variance, we perform a two-step procedure to reduce the variance. First, we use the kernel $$\bar{K}(u) = 2\pi^{-1/2}u^2\exp(-u^2)$$ to obtain $$\hat{h}_{\textrm{bimd}}$$, which is in fact $$\hat{h}_{{\small{\text{RSS}}}}$$. Second, we exploit the asymptotic relationship between the optimal bandwidths to get to $$\hat{h}$$, as indicated in expression (4). This requires no extra smoothing, and the relation between the kernel satisfying $$K(0)=0$$ and the Gaussian kernel $$K$$ is given by (4) with $$C_p(K,\bar{K})=1{\cdot}16$$ or $$C_p(K,\bar{K})=1{\cdot}01$$ for local linear, $$p=1$$, or local cubic, $$p=3$$, regression, respectively. The Epanechnikov kernel, optimal in $$L_2$$, could have been taken as the unimodal kernel in the second step, but since there is not much difference in optimality between the Gaussian $$Q_k=0{\cdot}08$$ and the Epanechnikov $$Q_k =0{\cdot}072$$, and because of the link between the shapes of the kernel satisfying $$K(0)=0$$ and the Gaussian kernel, we chose to use the Gaussian kernel in the second step. In the simulation study we used local linear regression. We also compare with standard crossvalidation using the Gaussian kernel ignoring correlation among the errors, and with the plug-in method proposed by Francisco-Fernández et al. (2004), which involves estimating the correlation structure of the errors; it is assumed that the design is regular and generated by a density. We use equally spaced designs in the simulations. Consider the model $$Y_i = m(x_i) + e_i$$ where $$x_i = i/n$$, $$m(x_i) = 2\sin(6x_i) + (x_i+1)^2$$ and $$e_i \sim N(0, 1)$$ with $${\mathrm{cov}}(e_i,e_j) = \sigma^2 \rho(|x_i - x_j|)$$. Also, consider two correlation functions $$\rho_{1}(t) = \exp(-\alpha |t|)$$ with $$\alpha= 50,100,200,1000$$, for the autoregressive model of order one, and \begin{equation*} \rho_{2}(t) = \exp(-\alpha_1|t|/2){\cos(p|t|) + \frac{\alpha_1}{2p}\sin(p|t|)} \end{equation*} where $$p^2 = (\alpha_2 - \alpha_1^2 / 4) > 0$$, for a quasi-periodic autoregressive model of order-two correlation structure. We take $$(\alpha_1, \alpha_2)=\{(50, 3000), (50, 4000), (60, 3000), (60, 4000)\}$$ for $$\rho_{n2}$$. The correlation functions are shown in Figs. 2 and 3. For each of the correlation structures we generate 5000 sample data replicates and compute the estimated bandwidth $$\hat{h}$$ using the three estimation methods mentioned earlier for sample sizes $$n=50, 100$$ and $$200$$. We approximate the value of the mean of $$\hat{h}_{{\small{\text{ASE}}}}$$ for 5000 data replicates, denoted by $$h_{{\small{\text{MASE}}}}$$. In general, the bandwidth density of our method is centred at zero with variance comparable to the plug-in and crossvalidation methods. Fig. 2. View largeDownload slide Correlation function $$\rho_{1}(\cdot)$$ for three different parameter settings: (a) $$\alpha=1000$$, (b) $$\alpha=100$$, and (c) $$\alpha=50$$. Fig. 2. View largeDownload slide Correlation function $$\rho_{1}(\cdot)$$ for three different parameter settings: (a) $$\alpha=1000$$, (b) $$\alpha=100$$, and (c) $$\alpha=50$$. Fig. 3. View largeDownload slide Correlation function $$\rho_{2}(\cdot)$$ for three different parameter settings: (a) $$\alpha_1=50$$ and $$\alpha_2=3000$$; (b) $$\alpha_1=50$$ and $$\alpha_2=4000$$; (c) $$\alpha_1=60$$ and $$\alpha_2=4000$$. Fig. 3. View largeDownload slide Correlation function $$\rho_{2}(\cdot)$$ for three different parameter settings: (a) $$\alpha_1=50$$ and $$\alpha_2=3000$$; (b) $$\alpha_1=50$$ and $$\alpha_2=4000$$; (c) $$\alpha_1=60$$ and $$\alpha_2=4000$$. With positive autoregressive correlated errors of order one, the plug-in method is resistant to correlation and usually performs better than crossvalidation with a normal kernel. The bandwidth density of the proposed method is centred at zero, so our method will give better estimates of the optimal bandwidth. When $$\alpha=1000$$, which is almost the independent error case, all the methods give good results, but our method shows a slight rise in variance. When $$\alpha$$ decreases, the correlation increases, and the other two methods tend to give smaller than optimal bandwidths. However, for the plug-in method to work, a pilot smoother is needed. Since Francisco-Fernández et al. (2004) did not specify practical guidelines on how to choose this, we used our method based on a kernel satisfying $$K(0)=0$$ as the pilot smoother to obtain residuals. The results for $$n=100$$ are shown in Fig. 4; see also the Supplementary Material. The results for the autoregressive model, shown in Fig. 5, indicate that our method will generally outperform the plug-in method and standard crossvalidation; it demonstrates the ability to perform well not only with positive correlation but also with negative correlation. Fig. 4. View largeDownload slide Densities of the bandwidth estimators $$\hat{h}$$ (solid), plug-in bandwidth $$\hat{h}_{\textrm{plgn}}$$ (dashed) and crossvalidation ignoring correlation $$\hat{h}_{\textrm{norm}}$$ (dash-dotted) for an autoregressive model of order-one correlated errors and $$n=100$$ with (a) $$\alpha=1000$$, (b) $$\alpha=100$$ and (c) $$\alpha=50$$. The result for $$\alpha = 200$$ is omitted. Fig. 4. View largeDownload slide Densities of the bandwidth estimators $$\hat{h}$$ (solid), plug-in bandwidth $$\hat{h}_{\textrm{plgn}}$$ (dashed) and crossvalidation ignoring correlation $$\hat{h}_{\textrm{norm}}$$ (dash-dotted) for an autoregressive model of order-one correlated errors and $$n=100$$ with (a) $$\alpha=1000$$, (b) $$\alpha=100$$ and (c) $$\alpha=50$$. The result for $$\alpha = 200$$ is omitted. Fig. 5. View largeDownload slide Densities of the bandwidth estimators $$\hat{h}$$ (solid), plug-in bandwidth $$\hat{h}_{\textrm{plgn}}$$ (dashed) and crossvalidation ignoring correlation $$\hat{h}_{\textrm{norm}}$$ (dash-dotted) for an autoregressive model of order-two correlated errors and $$n=100$$ with (a) $$\alpha_1=50$$ and $$\alpha_2=3000$$, (b) $$\alpha_1=50$$ and $$\alpha_2=4000$$, or (c) $$\alpha_1=60$$ and $$\alpha_2=4000$$. Panel (b) has a different scale on the vertical axis. The result for $$\alpha_1=60$$ and $$\alpha_2=3000$$ is omitted. Fig. 5. View largeDownload slide Densities of the bandwidth estimators $$\hat{h}$$ (solid), plug-in bandwidth $$\hat{h}_{\textrm{plgn}}$$ (dashed) and crossvalidation ignoring correlation $$\hat{h}_{\textrm{norm}}$$ (dash-dotted) for an autoregressive model of order-two correlated errors and $$n=100$$ with (a) $$\alpha_1=50$$ and $$\alpha_2=3000$$, (b) $$\alpha_1=50$$ and $$\alpha_2=4000$$, or (c) $$\alpha_1=60$$ and $$\alpha_2=4000$$. Panel (b) has a different scale on the vertical axis. The result for $$\alpha_1=60$$ and $$\alpha_2=3000$$ is omitted. 6. Discussion Let $$(X_i ,Y_i)_{i=1}^n$$ be a set of $$\mathbb{R}^{d+1}$$-valued random vectors, where the $$Y_i$$ are scalar responses and the $$X_i$$ are $$\mathbb{R}^d$$-valued covariates having common density $$f$$ with compact support $$\Omega \subseteq \mathbb{R}^d$$. Further, $$f$$ is bounded, continuous and such that $$f(x) > 0$$ for all $$x$$. Consider the same model as in § 1 with $$\rho_n$$ being a $$d$$-variate stationary correlation function. The local linear estimator for $$m(\cdot)$$ at $$x$$ is the solution to the least squares minimization \begin{equation*} \min_{\alpha,\beta}\,\sum_{i=1}^n\bigl\{Y_i-\alpha-\beta^{{\rm T}}(X_i-x)\bigr\}^2K_H(X_i-x), \end{equation*} where $$H$$ is a $$d\times d$$ symmetric positive-definite bandwidth matrix, $$K$$ is a $$d$$-variate kernel and $$K_H(u)=|H|^{-1}K(H^{-1}u)$$ with $$|H|$$ denoting the determinant of the matrix $$H$$. A key to extending our method to the multivariate case is the results of Ruppert & Wand (1994) for the asymptotic expression of $$S_n^{-1}=({X}_x^{{\rm T}}{\rm W}_x{X}_x)^{-1}$$. In what follows we write $$\lambda_{\textrm{max}}(H)$$ and $$\lambda_{\textrm{min}}(H)$$ for the maximal and minimal eigenvalues of the matrix $$H$$ and $$\|x\|$$ for the $$L_2$$-norm. Finally, for a matrix $$H$$, $$\:H\to 0$$ means that every element of $$H$$ goes to zero. Besides the assumptions on the density $$f$$, regression function $$m$$ and kernel $$K$$, we also require that the bandwidth matrix $$H$$ be symmetric and positive definite. We have that $$H\to 0$$ and $$n|H|\to \infty$$ as $$n\to \infty$$, and the ratio $$\lambda_{\textrm{max}}(H)/\lambda_{\textrm{min}}(H)$$ is bounded above. Next, $$\lim_{\|u\| \to \infty} \|u\|^l K(u) < \infty$$ for $$l=0,1$$. Finally, the first part of Assumption 8 and Assumptions 9 and 10 for $$\rho_n$$ still hold, with the second part of Assumption 8 replaced by the following: for any sequence $$\epsilon_n>0$$ satisfying $$n^{1/d}\epsilon_n \to \infty$$, \begin{equation*} n\int_{\|x\|\ge \epsilon_n}|\rho_n(x)|\,{\rm d} x \rightarrow 0 \end{equation*} as $$n\rightarrow \infty$$. Then for $$n^{\delta}\int |\rho_n(t)|\,{\rm d} t < \rho_\delta$$ and $$\delta>1$$, each element of $$H$$ is $$O(n^{-1/(d+4)})$$, and for a kernel $$K$$ such that $$K(0)=0$$ Theorem 2 is extended to the multivariate case. A possible kernel could be $$K(u) = (2d^{-1}\pi^{-d/2})\|u\|^2\exp(-\|u\|^2)$$. Two commonly used correlation functions satisfying the modified second part of Assumption 8 are the exponential and rational quadratic models (Cressie, 1993) \begin{equation*} \rho_n(x) = \exp(-\alpha n^{1/d} \|x\|), \quad \rho_n(x) = \frac{1}{1+\alpha n^{2/d} \|x\|^2}, \quad \alpha>0\text{.} \end{equation*} Acknowledgement We are very grateful to the associate editor and referees for their helpful comments. The third author acknowledges support from the Research Fund of the Katholieke Universiteit Leuven and from the Interuniversity Attraction Poles Research Network P7/06 of the Belgian Science Policy. Supplementary material Supplementary material available at Biometrika online includes the proofs of Theorems 1 and 2 and additional simulation results. References Altman N. S. ( 1990 ). Kernel smoothing of data with correlated errors. J. Am. Statist. Assoc. 84 , 749 – 59 . Google Scholar CrossRef Search ADS Chu C. K. & Marron J. S. ( 1991 ). Comparison of two bandwidth selectors with dependent errors. Ann. Statist. 4 , 1906 – 18 . Google Scholar CrossRef Search ADS Cressie N. A. C. ( 1993 ). Statistics for Spatial Data . New York : Wiley , 2nd ed . Diggle P. J. & Hutchinson M. F. ( 1989 ). On spline smoothing with correlated errors. Aust. J. Statist. 31 , 166 – 82 . Google Scholar CrossRef Search ADS Fan J. & Gijbels I. ( 1996 ). Local Polynomial Modelling and Its Applications . New York : Chapman & Hall/CRC . Francisco-Fernández M. , Opsomer J. D. & Vilar-Fernández J. M. ( 2004 ). Plug-in bandwidth selector for local polynomial regression estimator with correlated errors. J. Nonparam. Statist. 16 , 127 – 51 . Google Scholar CrossRef Search ADS Gijbels I. , Pope A. & Wand M. P. ( 1999 ). Understanding exponential smoothing via kernel regression. J. R. Statist. Soc. B 61 , 39 – 50 . Google Scholar CrossRef Search ADS Hall P. , Lahiri S. N. & Polzehl J. ( 1995 ). On bandwidth choice in nonparametric regression with both short- and long-range dependent errors. Ann. Statist. 23 , 1921 – 36 . Google Scholar CrossRef Search ADS Hall P. & Van Keilegom I. ( 2003 ). Using difference-based methods for inference in nonparametric regression with time series errors. J. R. Statist. Soc. B. 65 , 443 – 56 . Google Scholar CrossRef Search ADS Hart J. D. ( 1991 ). Kernel regression estimation with time series errors. J. R. Statist. Soc. B. 53 , 173 – 87 . Kim T. Y. , Kim D. , Park B. U. & Simpson D. G. ( 2004 ). Nonparametric detection of correlated errors. Biometrika 91 , 491 – 6 . Google Scholar CrossRef Search ADS Kim T. Y. , Park B. U. , Moon M. S. & Kim C. ( 2009 ). Using bimodal kernel for inference in nonparametric regression with correlated errors. J. Mult. Anal. 100 , 1487 – 97 . Google Scholar CrossRef Search ADS Konishi S. & Kitagawa G. ( 2008 ). Information Criteria and Statistical Modeling. New York : Springer . Lee K. L. , Mammen E. & Park B. U. ( 2010 ). Bandwidth selection for kernel regression with correlated errors. Statistics 44 , 327 – 40 . Google Scholar CrossRef Search ADS Li C. K. ( 1987 ). Asymptotic optimality for $$C_p$$, $$C_L$$, cross-validation and generalized cross-validation: Discrete index set. Ann. Statist. 15 , 958 – 75 . Google Scholar CrossRef Search ADS Opsomer J. , Wang Y. & Yang Y. ( 2001 ). Nonparametric regression with correlated errors. Statist. Sci. 16 , 134 – 53 . Google Scholar CrossRef Search ADS Park B. U. , Lee Y. K. , Kim T. Y. & Park C. ( 2006 ). A simple estimator of error correlation in non-parametric regression models. Scand. J. Statist. 33 , 451 – 62 . Google Scholar CrossRef Search ADS Ruppert D. & Wand M. P. ( 1994 ). Multivariate locally weighted least squares regression. Ann. Statist. 22 , 1346 – 70 . Google Scholar CrossRef Search ADS © 2018 Biometrika Trust This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

### Journal

BiometrikaOxford University Press

Published: Sep 1, 2018

## You’re reading a free preview. Subscribe to read the entire article.

### DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month ### Explore the DeepDyve Library ### Search Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly ### Organize Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place. ### Access Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals. ### Your journals are on DeepDyve Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more. All the latest content is available, no embargo periods. DeepDyve ### Freelancer DeepDyve ### Pro Price FREE$49/month
\$360/year

Save searches from
PubMed

Create lists to

Export lists, citations