TY - JOUR AU1 - Schafgans, Marcia M., A. AU2 - Zinde‐Walsh,, Victoria AB - Summary Many important models utilize estimation of average derivatives of the conditional mean function. Asymptotic results in the literature on density weighted average derivative estimators (ADE) focus on convergence at parametric rates; this requires making stringent assumptions on smoothness of the underlying density; here we derive asymptotic properties under relaxed smoothness assumptions. We adapt to the unknown smoothness in the model by consistently estimating the optimal bandwidth rate and using linear combinations of ADE estimators for different kernels and bandwidths. Linear combinations of estimators (i) can have smaller asymptotic mean squared error (AMSE) than an estimator with an optimal bandwidth and (ii) when based on estimated optimal rate bandwidth can adapt to unknown smoothness and achieve rate optimality. Our combined estimator minimizes the trace of estimated MSE of linear combinations. Monte Carlo results for ADE confirm good performance of the combined estimator. 1. Introduction Many important models, such as index models widely used in limited dependent variables, partial linear models and non‐parametric demand studies utilize estimation of average derivatives (sometimes weighted) of a conditional mean function. Härdle et al. (1991) and Blundell et al. (1998), amongst others, advocated the derivative‐based approach in the analysis of consumer demand, where non‐parametric estimation of Engel curves has become common place (e.g. Yatchew, 2003). Powell et al. (1989, hereafter referred to as PSS) popularized the use of density weighted average derivatives of the conditional mean in the semi‐parametric estimation of index models by pointing out that the average derivatives in single index models identify the parameters ‘up to scale’. A large literature is devoted to the asymptotic properties of non‐parametric estimators of average derivatives and to their use in estimation of index models and testing of coefficients. Asymptotic properties of average density weighted derivatives, hereafter referred to as ADEs, are discussed in PSS and Robinson (1989); Härdle and Stoker (1989) investigated the properties of the average derivatives themselves; Newey and Stoker (1993) addressed the choice of weighting function; Horowitz and Härdle (1996) extended the ADE approach in estimating the coefficients in the single index model to the presence of discrete covariates; Donkers and Schafgans (2008) extended the ADE approach to multiple index models; Chaudhuri et al. (1997) investigated the average derivatives in quantile regression; Li et al. (2003) investigated the local polynomial fitting to average derivatives; Banerjee (2007) provided a recent discussion on estimating the average derivatives using a fast algorithm; and Cattaneo et al. (2008) investigated for ADE a weakening on the lower bound of the bandwidth while avoiding the use of higher order kernels. Higher order expansions and the properties of bootstrap tests of ADE are investigated in Nichiyama and Robinson (2000, 2005). To formulate the ADE under consideration in our paper, let g(x) = E(y | x) with y ∈ R and x ∈ Rk, and define (1.1) with g′ (x) the derivative of the unknown conditional mean function and f(x) the density of x. With x ∈ Rk, g′ (x) stands for the vector (∂ g(x)/∂ x1, …, ∂ g(x)/∂ xk)T. Recognizing that δ0=−2E(f′ (x)y) under certain regularity conditions, PSS introduced the estimator (1.2) with Here K denotes a kernel smoothing function, K′ its derivative and h denotes the smoothing parameter that depends on the sample size N, with as . In all of the literature on ADE, asymptotic theory was provided for parametric rates of convergence. Even though the estimators are based on non‐parametric kernel estimators which depend on the kernel and bandwidth and converge at a non‐parametric rate, averaging can produce a parametric convergence rate thus reducing dependence on selection of the kernel and bandwidth which may not appear in the leading term of the mean squared error (MSE) expansion. This parametric rate of convergence (and thus the results in this literature), however, relies on the assumption of sufficiently high degree of smoothness of the underlying density of the regressors, f(x). This assumption is not based on any a priori theoretical considerations. Various multimodal distributions are encountered in biomedical and statistical studies (see e.g. Izenman and Sommer, 1988); multimodal distributions, even if they are sufficiently smooth, possess derivatives that are large enough to cause problems—see discussion in Marron and Wand (1992) for examples of normal mixtures that exhibit features usually thought of as characteristic of non‐smooth densities. Even when there is sufficient smoothness for parametric rates the choice of bandwidth and kernel affects second‐order terms in MSE which are often not much smaller than first‐order terms (see e.g. Dalalyan et al., 2006). Our concern with the possible violation of assumed high degree of density smoothness led us to extend the existing asymptotic results for ADE by relaxing the smoothness assumptions on the density. We examine an expansion of the variance up to the first term that depends on the bandwidth. The leading term in the bias expansion is called ‘asymptotic bias’ and the terms in the expansion of MSE that combine these leading terms of bias and variance we call ‘asymptotic MSE’ (AMSE). Insufficient smoothness will result in possible asymptotic bias and may easily lead to non‐parametric rates (exact results are in Theorem 3.1). Since selection of optimal kernel (order) and bandwidth (Powell and Stoker, 1996, and Theorem 3.1) presumes the knowledge of the degree of density smoothness, uncertainty about the degree of density smoothness poses an additional concern. In principle, smoothness properties of density f(x) could differ for different components of the vector x = (x1, …, xk)T which could lead to possibly different rates for the component bandwidths, h[ℓ], ℓ= 1, …, k (e.g. Li and Racine, 2007). Even when all the rates are the same it may be advantageous to use different bandwidths in finite sample, and we regard as . Denote by h the diagonal matrix (1.3) with inverse h−1, and the product of bandwidth components (1.4) With all bandwidths equal, h and can be read as scalar h. With this notation the vector and the ADE is given by If the degree of smoothness is known an optimal asymptotic rate of the bandwidth that balances the asymptotic variance and squared bias can be derived. Under some more restrictions the optimal bandwidth vector, hopt, is obtained in Theorem 3.1. Given sufficient smoothness the optimal bandwidth rate balances second‐order terms of the variance that depend on the kernel and bandwidth with the leading term in squared bias. The estimator with this bandwidth rate is referred to as second‐order rate efficient. With insufficient smoothness there is first‐order dependence of the variance on the kernel and bandwidth and the rate optimal bandwidth ADE estimator is referred to as first‐order rate efficient. With an unknown degree of smoothness, the optimal rate of bandwidth cannot be derived; however, it is consistently estimated here. For a given kernel it involves an estimator of rate using a technique that can be traced back to Woodroofe (1970).1 In Theorem 3.2, we show that there exists a (non‐convex) linear combination of ADE estimators: for a set of bandwidth vectors hs (with components hs [ℓ], ℓ= 1, …, k) s = 1, …, S such that the trace of AMSE of this linear combination is strictly smaller than that of . This somewhat surprising result is the consequence of elimination of leading terms in the biases in the linear combination similar to the generalized jackknifing proposed in PSS (see appendix 2 in PSS) for ‐consistent ADE. We consider different kernels as well as different bandwidths in linear combinations, since the selection among kernels (higher and lower order) is also hampered by an unknown degree of smoothness. This is an important generalization, in particular given that the order of the kernel has been shown to have a large impact on the finite sample performance for density estimation and similarly, for kernels of the same order, different shapes (including asymmetric) affect performance; see Hansen (2005) and Kotlyarova and Zinde‐Walsh (2007). Combining estimators was recently investigated in the statistical literature, where for the most part convex combinations are used as a means to achieve adaptiveness (Juditsky and Nemirovski, 2000; Yang, 2000). Kotlyarova and Zinde‐Walsh (2006, hereafter KZW) propose non‐convex combinations for estimators with possibly non‐parametric rates. They develop the so‐called combined estimator with weights that minimize the trace of its estimated AMSE. Our proposed estimation strategy is as follows. If the smoothness and thus optimal rate for bandwidth were known we select several suitable kernels and specify a set of bandwidths for each that would ensure that the optimal linear combination would outperform any individual optimal estimator. When the smoothness is not assumed known we use a consistent estimator of the rate of optimal bandwidth and consider a corresponding set of bandwidths. Next, we obtain the optimal linear combination of the individual ADEs under consideration by minimizing the trace of AMSE. For this minimization the (leading terms of) variances of estimators, covariances between the different ADEs and biases would have to be known. To obtain a feasible combined estimator (as in KZW) we use consistent estimators for biases and covariances. We thus consider an estimator that optimally combines ADE for different kernel/bandwidth pairs (Ks, hs), s = 1, …, S. The combined estimator is given by where are chosen so as to minimize the trace of the estimated MSE subject to . We use a Monte Carlo experiment for the Tobit model, for a variety of distributions for the explanatory variables (gaussian, tri‐modal gaussian mixture and the ‘double claw’ and ‘discrete comb’ mixtures from Marron and Wand, 1992). There, we demonstrate that there is no clear guidance on the choice of suitable kernel/bandwidth pair. Even though in these cases the smoothness assumptions hold, the high modal nature of these mixture distributions leads to large partial derivatives that undermine the performance of ADE. At the same time, the combined estimator provides reliable results in all cases. The paper is organized as follows. In Section 2, we provide the assumptions, where we relax the usual high smoothness assumptions common in the literature. In Section 3, we derive the asymptotic properties of the ADE under various assumptions about density smoothness, the joint asymptotics for ADE estimators based on different bandwidth/kernel pairs, examine the advantages of linear combinations and develop the combined estimator. Section 4 provides the Monte Carlo study results and Section 5 concludes. 2. Assumptions The assumptions here keep some conditions common in the literature on ADE but relax the usual higher smoothness assumptions. The first two assumptions are similar to PSS; they restrict x to be random variables that are continuously distributed with no component of x functionally determined by other components of x (y could be discrete, e.g. a binary variable) and impose the minimal smoothness assumption of continuous differentiability on f and g. Assumption 2.1. Letbe a random sample drawn from a distribution that is absolutely continuous in x. The supportΩof the density ofx, f(x), is a convex (possibly unbounded) subset ofRkwith non‐empty interiorΩ0. Assumption 2.2. The density functionf(x) is continuous overRk, so thatf(x) = 0for allx ∈∂Ω, where∂Ωdenotes the boundary ofΩ; f is continuously differentiable in the components of x for allx ∈Ω0and the conditional mean functiong(x) is continuously differentiable in the components of all, wherediffers fromΩ0by a set of measure 0. Additional requirements involving the conditional distribution of y given x, as well as more detailed differentiability conditions subsequently need to be added. The conditions are slightly amended from how they appear in the literature, in particular we use the weaker Hölder conditions instead of Lipschitz conditions; all the proofs can accommodate this weakened assumption. Assumption 2.3. (a)E(y2 | x)is continuous in x.(b)The components of the random vectorg′ (x)and matrixf′ (x)[y, xT]have finite second moments;(fg)′satisfies a Hölder condition with0 < α0≤ 1: and. The kernel K satisfies a standard assumption.2 Assumption 2.4. (a)The functionK(u) is a symmetric continuously differentiable function inRkwith convex support.(b)The kernel functionK(u) has orderv(K) > 1: where(i1, …, ik)is an index set.(c)The kernel smoothing functionK(u) is differentiable up to the orderv(K). Density smoothness plays a role in controlling the rate for the bias of the PSS estimator; the bias is (2.1) We formalize the degree of density smoothness in terms of the Hölder space of functions. More precisely, with the ADE involving the derivative (vector) of the density, we specify the smoothness for each component of the derivative vector, with ℓ= 1, …, k, separately, thereby enabling some components to be smoother than others. The Hölder space of functions, denoted as , consists of mℓ− 1 times continuously differentiable functions on Ω with all (mℓ− 1) th partial derivatives satisfying Hölder's condition of order αℓ. We assume that , implying that all its (mℓ− 1) th partial derivatives, denoted as , satisfy: Assumption 2.5. The derivative of the density satisfieswithmℓ≥ 1, 0 < αℓ≤ 1and. Note that in the case mℓ= 1 there may be no more than Hölder continuity of the partial derivative without further differentiability, significantly relaxing the usual assumptions in the literature. We denote mℓ− 1 +αℓ by vℓ and define the vector , with Provided , the derivative of the density bias, is as usual O(∥ h∥v(K)) (by applying the v(K)th order Taylor expansion of around . If differentiability conditions typically assumed do not hold, then the bias does not vanish sufficiently fast even for bandwidth vectors such that . All we can state is the upper bound on the bias (component‐wise): We make a somewhat stronger assumption on the bias that is similar to Woodroofe (1970) for density estimation. To this end, we introduce the diagonal matrix whose inverse is denoted by . Assumption 2.6. (a)As (2.2)where the vectoris such that; (b). This assumption significantly relaxes the usual smoothness assumptions. Part (b) additionally assumes the same smoothness for the different derivatives. When all the bandwidths are the same h and is constant for all components, the matrix in Assumption 2.6 can be read as a scalar, . 3. Main Results We extend the existing asymptotic results for ADE by relaxing the smoothness assumptions on the density and obtain optimal bandwidth rates. We show that linear combinations of ADE can have better asymptotic properties than optimal ADE and propose a feasible combination (the combined estimator) that minimizes the trace of estimated MSE. 3.1. Asymptotic results for ADEs based on a specific kernel and bandwidth vector We consider the asymptotic results for the ADE, given in (1.2), under the Assumptions 2.1–2.6(a) of the previous section for all possible degrees of smoothness and kernel orders (for ). Under minimal smoothness assumptions, Lemma 3.1 presents an expression for its variance. Lemma 3.1. Under Assumptions 2.1–2.5, ifandthe variance ofis given by with We see that unless the O(N−1) term dominates the variance, there is first‐order dependence on the kernel. With the bias of our ADE given by Assumption 2.6(a) it then follows that the MSE satisfies (3.1) The following Theorem 3.1 summarizes all the possible convergence rates and limit features of for different choices of bandwidth and kernel and presents the optimal bandwidth rate based on the standard bias variance trade‐off. Theorem 3.1. UnderAssumptions 2.1–2.6(a). (a)If the density is sufficiently smooth and the order of kernel is sufficiently high: all, the rateO(N−1)for the MSE and the parametric ratefor the ADE can be achieved for a range of bandwidth vectors. Outside this range whenthe asymptotic variance depends on the kernel, ifasymptotic bias dominates.(b)If the density is not smooth enough or the order of the kernel is too low: some, the parametric rate cannot be obtained. The asymptotic variance depends on the kernel. Depending onand bandwidth/kernel pair(K, h), a diagonal matrix of ratesrNwith diagonal elements, such thathas finite first and second moments, obtains. If, ADE has no asymptotic bias at the rate of convergencerN. (c)The optimal bandwidth vector can be obtained by minimizing trace ofunder Assumption 2.6(b) this provides the optimal rate (3.2) The optimal constants for each ℓ can be obtained from this minimization (see the Appendix for details). The theorem provides a full description of the asymptotic behaviour of the moments of the estimator allowing for different bandwidth rates for different components. For equal (rate) bandwidth under Assumption 2.6(b), the PSS results with the parametric rate hold for sufficiently smooth f(x) (permitting ) with and . In the absence of the high degree of differentiability the first‐order asymptotic variance (as the asymptotic bias) does depend on the weighting used in the local averaging—involves Σ1(K)—yielding a non‐parametric rate. Selection of the optimal bandwidth and kernel (order) that minimize the mean squared error depends on our knowledge of the degree of smoothness of the density (see also Powell and Stoker, 1996). 3.2. Asymptotic results for linear combinations of ADEs To reduce the dependence of ADE on the optimal bandwidth and kernel (order) selection, we consider a linear combination of different ADE estimators, with , for a range of different kernel/bandwidth pairs. In order to obtain the AMSE of , we need to find the leading terms of the first and second moments for the stacked vector , with rNs the diagonal matrix of rates associated with kernel/bandwidth pair (Ks, hs). First moments are given by Assumption 2.6; the limit covariances between the estimators are derived in the following Lemma 3.2. The fact that some estimators have zero covariances in the limit indicates that they provide complementary information. Lemma 3.2. Under Assumptions 2.1–2.5, ifandthe limit covariance matrix for the vector with components, hask × kblocks, with whereis defined in Lemma A.1 in the Appendix. Covariance matrices between estimators converging at different rates go to zero. With , the part of the asymptotic covariance between and that depends on the bandwidth, we denote their asymptotic covariance as The trace of the AMSE of , can then be written as (3.3) where with . We note that the O(N−1) part in does not depend on the weights in the linear combination. In Theorem 3.2, we consider for a given kernel linear combinations of ADEs with different bandwidths. It shows that with appropriately chosen bandwidths it is possible to obtain an estimator that is superior to the individual optimal estimator. Theorem 3.2. Under the Assumptions 2.1–2.6(a), for any kernel K and given an optimal bandwidthhoptthere exists a linear combination,for a set of bandwidth vectorsh1, …, hSsuch that{hs [ℓ]}= csℓ{hopt[ℓ]}with constantscsℓ > 1, ℓ= 1, …, k, that provides (3.4) If —and thus the optimal rate—and an upper bound on the constant in the optimal bandwidth were known it would be straightforward to get weights that satisfy (3.4) even without knowing the optimal bandwidth itself.3 Under conditions of Theorem 3.2, could be minimized to obtain an optimal combination. Weights could be restricted to a compact set (e.g. ∥ a∥≤ A < ∞) that would include weights that result in (3.4). Including different kernels in the linear combination would ensure that the linear combination performs better than optimal regardless of which of the chosen kernels dominates. When is low or not known, low‐order kernels should be used. With moderate sample size we use two low‐order kernels. For large samples a variety of kernels including asymmetric kernels could be beneficial—the order and shapes affect performance; see Hansen (2005) and Kotlyarova and Zinde‐Walsh (2007). Since minimizing means in effect minimizing a′ Da of (3.3), which has exactly the same structure as in KZW, their theorem 3.2 applies to show that the optimal weights provide the best convergence rate for a′ Da available for any included bandwidth. 3.3. The combined estimator Next we consider replacing unknown quantities by estimates and define (as in KZW) the combined estimator as (3.5) where the weights , are chosen so as to minimize the trace of the estimated AMSE. Suppose that covariances are estimated so that this will result from consistent estimation of Σ2 and of terms depending on the bandwidth, (e.g. using the plug‐in approach); suppose the estimated biases are such that . Then an argument as in theorem 3 of KZW implies that the weights that minimize the estimated trMSE will similarly lead to the best available rate for bandwidth‐dependent part of trAMSE, a′ Da. Here we further improve the combined estimator by appropriately choosing the bandwidths for the estimators in the combination. As noted in the previous section, had we known the smoothness properties and thus we could use optimal rate bandwidths. Not knowing requires us to propose a strategy that allows to adapt to the unknown smoothness in the model. Suppose for a kernel K we obtain an estimator of such that ; then the bandwidth vector satisfies . It follows that, with , the bandwidth‐dependent part of AMSE achieves the same best rate as for . Following Theorem 3.2, we include for each kernel the estimated rate optimal bandwidth and k larger bandwidths; we also include a marginally smaller bandwidth, , and —an automatic bandwidth selector for non‐parametric regression. The remainder of this section is devoted to deriving the estimator that satisfies and proposing consistent estimators for the components of AMSE. The leading terms of variances and covariances, Σ1 and Σ2, can be consistently estimated with the usual plug‐in approach (i.e. by replacing the densities and derivatives by consistent non‐parametric estimators). One can also use a bootstrap (see Härdle and Bowman, 1988), as we do in our simulation.4 It is obtained as (3.6) where for each of the B bootstrapped samples estimates are obtained for s = 1, …, S. Theorem 3.3 below details our consistent estimators for and the bias. To obtain a consistent estimator of , we construct for a given kernel K a set of bandwidth vectors for which the corresponding estimators are asymptotically biased (oversmoothed). One such bandwidth vector, hgcv, is given by the usual cross‐validation for non‐parametric regression, it is oversmoothed since (following Stone, 1982). The consistent estimator for is obtained by an approach reminiscent of Woodroofe's (1970); it relies on Assumption 2.6 from which it follows that for each lth component of the ADE and two distinct oversmoothed bandwidth vectors ht and from the set detailed in Theorem 3.3 below (which satisfy we have where ht [ℓ] is the lth component of the vector ht. Part (a) of Theorem 3.3 below provides the regression estimator, , together with a consistent estimator of the optimal rate for the bandwidth. To obtain a consistent estimator of the bias of for a given kernel K, we make use of the properties of oversmoothed and undersmoothed estimators. Specifically, using a pair of estimators , one of which is based on an oversmoothed bandwidth, and the other on a somewhat undersmoothed one, we consistently estimate the bias for the oversmoothed estimator. Subsequently, for any bandwidth vector h a consistent estimator of bias of relies on the fact that the leading terms of the bias differ by the ratio of bandwidths to the power , which we consistently estimate by . The details are given in part (b) of Theorem 3.3. Theorem 3.3. Under Assumptions 2.1–2.6(a)Consider a sequence of bandwidth vectors, such thatfor some positive constantsctwith. Letdefine a subset of all pairswith cardinality. An estimator for, given by (3.7)for anyℓ= 1, …, ksatisfies. Givena bandwidth vector with optimal rate is consistently estimated by. (b)Given bandwidths, with, and, witha consistent estimate for asymptotic, which is, is provided by. Consistent estimates ofwithascan be obtained as. We note that by construction the estimator in (3.7) when applied to the different components, ℓ= 1, …, k, will lead to k consistent estimators for , which will differ in finite samples.5 Summarizing, our proposed procedure consists of the following steps: Step 1: Compute an estimator, , of the smoothness index for each included kernel. Two low‐order kernels should be included (we use one second‐ and one fourth‐order kernel). Step 2: Compute for a set of kernel/bandwidth combinations, s = 1, …, S. For each kernel we use k + 3 bandwidths. Based on the estimated for that kernel we include the estimated rate optimal bandwidth and k larger bandwidths in accordance with Theorem 3.2; we also include a marginally smaller bandwidth, , and —an automatic bandwidth selector for non‐parametric regression. Note that the set of bandwidths need not increase with N. Step 3: Estimate all the covariances and biases for the individual estimators. Step 4: Compute the combined estimator, , where are weights that minimize the estimated subject to over some compact set, e.g. {a* :∥ a*∥≤ A < ∞}. 4. Simulation In order to illustrate the effectiveness of the combined estimator, we provide a Monte Carlo study where we consider the Tobit model. The Tobit model under consideration is given by where our dependent variable yi is censored to zero for all observations for which the latent variable lies below a threshold, which without loss of generality is set equal to zero. We randomly draw , where we assume that the errors, drawn independently of the regressors, are standard Gaussian. Consequently, the conditional mean representation of y given x can be written as where Φ (·) and φ (·) denote the standard normal cdf and pdf, respectively. We consider the density weighted average derivative estimate (ADE) of this single‐index model defined in (1.2) which identifies the parameters β ‘up to scale’ without relying on the Gaussianity assumption on ɛi. Under the usual smoothness assumptions, the finite sample properties of the ADE for this Tobit model have been considered in the literature (Nichiyama and Robinson, 2005). We use two explanatory variables and select β= (1, 1)T. We make various assumptions about the distribution of our independent, explanatory variables. The base model uses two standard normal explanatory variables. In the other models various multimodal normal mixtures are considered, which while still being infinitely differentiable, do allow behaviour resembling that of non‐smooth densities. In particular, we consider the trimodal normal mixture used in KZW, , and the ‘double claw’ and ‘discrete comb’ mixtures (Marron and Wand, 1992). The models are labelled using two indices (i1, i2) representing the distributions used for the two explanatory variables with each index (standard normal), m (trimodal normal mixture), c (double claw) and d (discrete comb). The sample size is set at N = 2000 with 100 replications. The multivariate kernel function K(·) (on R2) is chosen as the product of two univariate kernel functions. We use the quartic second‐order kernel (see e.g. Yatchew, 2003) and a fourth‐order kernel in our Monte Carlo experiment, where, given that we use two explanatory variables, the highest order satisfies the theoretical requirement for ascertaining a parametric rate subject to the necessary smoothness assumptions.6 First, we apply the usual cross‐validation for non‐parametric regression, yielding a bandwidth sequence with some positive vector. Even though the rates are the same, for computation of bandwidth vectors in our finite sample experiment we allow for differing bandwidths. We obtain them using a gridsearch.7 Next, we estimate using bandwidths that satisfy the conditions of Theorem 3.3(a). We set . The actual bandwidth sequences are {hgcv, 1.01hgcvN1/36, 0.98hgcvN2/36, 0.93hgcvN3/36, 0.86hgcvN4/36, 0.78hgcvN5/36} for second‐order kernel; {hgcv, 1.10hgcvN1/60, 1.16hgcvN2/60, 1.20hgcvN3/60, 1.21hgcvN4/60, 1.19hgcvN5/60} for fourth‐order kernel. The reason for selecting the ct's is to ensure a reasonable spread of bandwidths for N = 2000 (correspond to bandwidth sequences {hgcv, 1.25hgcv, 1.5hgcv, 1.75hgcv, 2.0hgcv, 2.25hgcv}). To estimate we select a subset of Q bandwidths in the following way: select a range of consecutive bandwidths where differences all have the same sign; if that is not possible, we use all . The estimated ), for our models on average provided in the (s,s) model (1.99, 1.98) for the second‐order kernel, K2, and (3.68, 3.71) for the fourth‐order kernel, K4; in the (s,m) model K2 provided (1.70, 1.51) and K4 :(3.16, 2.67); the (m,m) model K2 :(1.40, 1.37) and K4 :(1.98, 1.96); the (s,c) model K2 :(1.94, 1.87) and K4 :(3.56, 3.21); the (s,d) model K2 :(1.50, 1.04) and K4 :(3.36, 1.67); and the (c,d) model K2 :(1.49, 0.89) and K4 :(3.14, 1.68), which are reasonable. We use and as estimators of relating them to their respective component in the ADE vector. Table 1. Relative RMSE of the density weighted ADE estimators. . Model (s,s) . Model (s,m) . Model (m,m) . Bandwidth/Kernel . K2 . K4 . K2 . K4 . K2 . K4 . h0(hu) 0.234 0.141 0.457 0.427 0.672 0.686 h1 0.156 0.113 0.471 0.445 0.694 0.743 h2(hopt) 0.106 0.085 0.500 0.495 0.755 0.811 h3 0.093 0.078 0.533 0.516 0.811 0.839 h4(ho) 0.096 0.078 0.564 0.519 0.856 0.864 h5(hgcv) 0.113 0.083 0.607 0.543 0.910 0.934 Combined 0.096 0.561 0.869 Model (s,c) Model (s,d) Model (c,d) Bandwidth/Kernel K2 K4 K2 K4 K2 K4 h0(hu) 0.499 0.455 1.487 1.538 1.054 1.015 h1 0.465 0.447 1.319 1.271 0.900 0.785 h2(hopt) 0.470 0.444 1.168 1.038 0.728 0.632 h3 0.480 0.447 1.033 0.995 0.613 0.671 h4(ho) 0.486 0.451 0.895 0.925 0.517 0.671 h5(hgcv) 0.497 0.461 0.766 0.847 0.479 0.575 Combined 0.465 0.872 0.690 . Model (s,s) . Model (s,m) . Model (m,m) . Bandwidth/Kernel . K2 . K4 . K2 . K4 . K2 . K4 . h0(hu) 0.234 0.141 0.457 0.427 0.672 0.686 h1 0.156 0.113 0.471 0.445 0.694 0.743 h2(hopt) 0.106 0.085 0.500 0.495 0.755 0.811 h3 0.093 0.078 0.533 0.516 0.811 0.839 h4(ho) 0.096 0.078 0.564 0.519 0.856 0.864 h5(hgcv) 0.113 0.083 0.607 0.543 0.910 0.934 Combined 0.096 0.561 0.869 Model (s,c) Model (s,d) Model (c,d) Bandwidth/Kernel K2 K4 K2 K4 K2 K4 h0(hu) 0.499 0.455 1.487 1.538 1.054 1.015 h1 0.465 0.447 1.319 1.271 0.900 0.785 h2(hopt) 0.470 0.444 1.168 1.038 0.728 0.632 h3 0.480 0.447 1.033 0.995 0.613 0.671 h4(ho) 0.486 0.451 0.895 0.925 0.517 0.671 h5(hgcv) 0.497 0.461 0.766 0.847 0.479 0.575 Combined 0.465 0.872 0.690 Open in new tab Table 1. Relative RMSE of the density weighted ADE estimators. . Model (s,s) . Model (s,m) . Model (m,m) . Bandwidth/Kernel . K2 . K4 . K2 . K4 . K2 . K4 . h0(hu) 0.234 0.141 0.457 0.427 0.672 0.686 h1 0.156 0.113 0.471 0.445 0.694 0.743 h2(hopt) 0.106 0.085 0.500 0.495 0.755 0.811 h3 0.093 0.078 0.533 0.516 0.811 0.839 h4(ho) 0.096 0.078 0.564 0.519 0.856 0.864 h5(hgcv) 0.113 0.083 0.607 0.543 0.910 0.934 Combined 0.096 0.561 0.869 Model (s,c) Model (s,d) Model (c,d) Bandwidth/Kernel K2 K4 K2 K4 K2 K4 h0(hu) 0.499 0.455 1.487 1.538 1.054 1.015 h1 0.465 0.447 1.319 1.271 0.900 0.785 h2(hopt) 0.470 0.444 1.168 1.038 0.728 0.632 h3 0.480 0.447 1.033 0.995 0.613 0.671 h4(ho) 0.486 0.451 0.895 0.925 0.517 0.671 h5(hgcv) 0.497 0.461 0.766 0.847 0.479 0.575 Combined 0.465 0.872 0.690 . Model (s,s) . Model (s,m) . Model (m,m) . Bandwidth/Kernel . K2 . K4 . K2 . K4 . K2 . K4 . h0(hu) 0.234 0.141 0.457 0.427 0.672 0.686 h1 0.156 0.113 0.471 0.445 0.694 0.743 h2(hopt) 0.106 0.085 0.500 0.495 0.755 0.811 h3 0.093 0.078 0.533 0.516 0.811 0.839 h4(ho) 0.096 0.078 0.564 0.519 0.856 0.864 h5(hgcv) 0.113 0.083 0.607 0.543 0.910 0.934 Combined 0.096 0.561 0.869 Model (s,c) Model (s,d) Model (c,d) Bandwidth/Kernel K2 K4 K2 K4 K2 K4 h0(hu) 0.499 0.455 1.487 1.538 1.054 1.015 h1 0.465 0.447 1.319 1.271 0.900 0.785 h2(hopt) 0.470 0.444 1.168 1.038 0.728 0.632 h3 0.480 0.447 1.033 0.995 0.613 0.671 h4(ho) 0.486 0.451 0.895 0.925 0.517 0.671 h5(hgcv) 0.497 0.461 0.766 0.847 0.479 0.575 Combined 0.465 0.872 0.690 Open in new tab In accordance with Theorem 3.3(b), we choose and and obtain . For the combined estimator we consider, as indicated in our Step 2, a range of bandwidths that include for each kernel and two larger bandwidths, one marginally smaller bandwidth and the generalized cross‐validation bandwidth, providing (or ). With two kernels, this implies that the combined estimator under consideration has S = 10. Covariances are computed by bootstrap using (3.6); biases according to Theorem 3.3(b). The weights are then obtained by minimizing the trAMSE constructed according to (3.3) with estimated biases and covariances subject to the sum of the weights being equal to 1.8 Larger weights including those of opposite signs, are typically given to the higher bandwidths for the second‐ and fourth‐order kernel. In Table 1, we report relative error: the ratio of the true finite sample root mean squared errors (RMSE) to δ0 for ADE in the different models for the sample size N = 2000. Note that the relative errors for model (s,s) are in the range 7.8–23.4% and are relatively small. For (s,c) the errors are much larger but are close for all bandwidths and kernels: range is 44.4–49.9%, so there is not much sensitivity to the choice of bandwidth/kernel order. There is somewhat more of a dispersion for the (s,m) case: the range is 42.7–60.7%, but even in this case the price of an incorrect choice (associated with a too large bandwidth) is not that dramatic. More striking consequences of choice are seen for the (m,m) case: the range of relative errors is 68.6–93.4%; here similarly to (s,m) incorrect choice involves oversmoothing but unlike (s,m) higher‐order kernel gives consistently worse results. The most dramatic cases are (s,d) with range 76.6–153.8% and (c,d) with 47.9–105.4% where now incorrect choice is associated with undersmoothing. In these cases the combined estimator gives results much closer to the lower bound than upper bound of the errors, and also often presents a better choice than the estimated optimal bandwidth. We conclude that there is no rule regarding either kernel order or bandwidth that works uniformly (similar results found by Hansen, 2005): some individual estimators that are best for one model are worst for another. The estimated optimal bandwidth compares favourably with many bandwidths (including cross‐validation), but there is no indication which order of kernel to use. The combined estimator offers reliably good performance and is often better than the optimal, especially in cases of large relative errors. 5. Conclusions In this paper we provide asymptotic properties of the ADE in the case of insufficient smoothness (or kernel order) and demonstrate availability of estimators that improve on the ADE with optimal bandwidth via using linear non‐convex combinations of ADEs. We adapt to the unknown and/or insufficient density smoothness by using a combined estimator that is constructed with specially selected bandwidths, based on the optimal rate. With an unknown degree of smoothness, the optimal rate of bandwidth is consistently estimated. Monte Carlo simulations demonstrate that even in the case where formally the smoothness assumptions hold, due to large values for the derivatives there is no general guidance for selecting a kernel and bandwidth that will not lead to large errors for some distributions. Using the estimated optimal rate bandwidth leads to less erratic performance but could be adversely affected by incorrect kernel choice. By not relying on a single kernel/bandwidth choice, the combined estimator reduces sensitivity and provides good and reliable performance. Acknowledgments The authors would like to thank two anonymous referees and Richard Smith for their comments and suggestions. The work was supported by the Social Sciences and Humanities Research Council of Canada (SSHRC) and by the Fonds québecois de la recherche sur la société et la culture (FQRSC). Footnotes 1 " We are grateful to an anonymous referee who pointed out that our approach to rate estimation is reminiscent of Woodroofe (1970). Unfortunately, lemma 2.3 of that paper does not hold and the proofs about convergence of MSE that use it cannot be applied. 2 " In Schafgans and Zinde‐Walsh (2007) we discuss the possibility of non‐symmetric kernels and derive results for that case. 3 " Implementing a sequence of slowly diverging constants could get around having strict bounds on the optimal constant. 4 " The bootstrapped variance provides the same expansion as in Lemma 3.2 for our simulation example with k = 2; generally validity of this bootstrap expansion holds with somewhat stronger moment assumptions, such as E(y4 | x) < ∞. Details can be obtained from the authors. 5 " As a referee pointed out, if we use all H(H + 1)/2 pairs we can simplify some of sums above, e.g. . 6 " The fourth‐order kernel we use is given by . 7 " The cross‐validated bandwidths for the second‐ and fourth‐order kernel in the (s,s) model with N = 2000 were and respectively. The bandwidths for the (s,m) model were and respectively; the (m,m) model and the (s,c) model and the (s,d) model and and the (c,d) model and . 8 " Ordering the kernel/bandwidth pairs, s = 1, …, 10 as: (K2, h1), …, (K2, h5), (K4, h1), …, (K4, h5), on average the weights are (−0.00, −0.03, 0.65, −0.45, −0.07, −0.09, −0.04, −0.23, 2.30, −1.05) for the (s,s) model; for (s,m) the weights are (0.03, −0.01, 0.89, 1.00, −0.77, −0.38, −0.10, −0.22, 1.24, −0.70); for (m,m) (0.02, 0.07, −0.92, 4.01, −2.40, −0.32, 0.14, −0.74, 1.43, −0.28); for (s,c) (0.02, −0.11, 0.74, −0.14, −0.09, −0.14, −0.11, −0.08, 1.96, −1.05); for (s,d) (0.03, 0.11, 0.30, 2.35, −0.60, −0.50, −0.36, −0.52, 1.10, −0.90); and for (c,d) (0.05, 0.09, −0.29, 2.77, −0.85, −0.52, 0.65, −0.87, 1.62, −1.06). References Banerjee , A. N. ( 2007 ). A method of estimating the average derivative . Journal of Econometrics 136 , 65 – 88 . Google Scholar Crossref Search ADS WorldCat Blundell , R. , A. Duncan and K. Pendakur ( 1998 ). Semiparametric estimation and consumer demand . Journal of Applied Econometrics 13 , 435 – 61 . Google Scholar Crossref Search ADS WorldCat Cattaneo , M. D. , R. K. Crump and M. Jansson ( 2008 ). Small bandwidth asymptotics for density‐weighted average derivatives . Research Paper 2008‐24, CREATES, Aarhus University. Available at SSRN: http://ssrn.com/abstract=1148173. Chaudhuri , P. , K. Doksum and A. Samarov ( 1997 ). On average derivative quantile regression . Annals of Statistics 25 , 715 – 44 . Google Scholar Crossref Search ADS WorldCat Dalalyan , A. S. , G. K. Golubev and A. B. Tsybakov ( 2006 ). Penalized maximum likelihood and semiparametric second order efficiency . Annals of Statistics 34 , 169 – 201 . Google Scholar Crossref Search ADS WorldCat Donkers , B. and M. Schafgans ( 2008 ). Specification and estimation of semiparametric multiple‐index models . Econometric Theory 24 , 1584 – 606 . Google Scholar Crossref Search ADS WorldCat Hansen , B. E. ( 2005 ). Exact mean integrated squared error of higher order kernel estimators . Econometric Theory 21 , 1031 – 57 . OpenURL Placeholder Text WorldCat Härdle , W. and A. W. Bowman ( 1988 ). Bootstrapping in nonparametric regression: local adaptive smoothing and confidence bands . Journal of the American Statistical Association 83 , 101 – 10 . OpenURL Placeholder Text WorldCat Härdle , W. , W. Hildenbrand and M. Jerison ( 1991 ). Empirical evidence on the law of demand . Econometrica 59 , 1525 – 49 . Google Scholar Crossref Search ADS WorldCat Härdle , W. and T. M. Stoker ( 1989 ). Investigating smooth multiple regression by the method of average derivatives . Journal of the American Statistical Association 84 , 986 – 95 . OpenURL Placeholder Text WorldCat Horowitz , J. L. and W. Härdle ( 1996 ). Direct semiparametric estimation of single‐index models with discrete covariates . Journal of the American Statistical Association 91 , 1632 – 40 . Google Scholar Crossref Search ADS WorldCat Izenman , A. J. and C. J. Sommer ( 1988 ). Philatelic mixtures and multimodal densities . Journal of the American Statistical Association 83 , 941 – 53 . Google Scholar Crossref Search ADS WorldCat Juditsky , A. and A. Nemirovski ( 2000 ). Functional aggregation for nonparametric regression . Annals of Statistics 3 , 681 – 712 . OpenURL Placeholder Text WorldCat Kotlyarova , Y. and V. Zinde‐Walsh ( 2006 ). Non‐ and semi‐parametric estimation in models with unknown smoothness . Economics Letters 93 , 379 – 86 . Google Scholar Crossref Search ADS WorldCat Kotlyarova , Y. and V. Zinde‐Walsh ( 2007 ). Robust kernel estimator for densities of unknown smoothness . Journal of Nonparametric Statistics 19 , 89 – 101 . Google Scholar Crossref Search ADS WorldCat Li , Q. , X. Lu and A. Ullah ( 2003 ). Multivariate local polynomial regression for estimating average derivatives . Journal of Nonparametric Statistics 15 , 607 – 24 . Google Scholar Crossref Search ADS WorldCat Li , Q. and J. S. Racine ( 2007 ). Nonparametric Econometrics: Theory and Practice . Princeton : Princeton University Press . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Marron , J. S. and M. P. Wand ( 1992 ). Exact mean integrated squared error . Annals of Statistics 20 , 712 – 36 . Google Scholar Crossref Search ADS WorldCat Newey , W. K. and T. M. Stoker ( 1993 ). Efficiency of weighted average derivative estimators and index models . Econometrica 61 , 1199 – 223 . Google Scholar Crossref Search ADS WorldCat Nichiyama , Y. and P. M. Robinson ( 2000 ). Edgeworth expansions for semiparametric averaged derivatives . Econometrica 68 , 931 – 80 . Google Scholar Crossref Search ADS WorldCat Nichiyama , Y. and P. M. Robinson ( 2005 ). The bootstrap and the Edgeworth correction for semiparametric averaged derivatives . Econometrica 73 , 903 – 48 . Google Scholar Crossref Search ADS WorldCat Powell , J. L. , J. H. Stock and T. M. Stoker ( 1989 ). Semiparametric estimation of weighted average derivatives . Econometrica 57 , 1403 – 30 . Google Scholar Crossref Search ADS WorldCat Powell , J. L. and T. M. Stoker ( 1996 ). Optimal bandwidth choice for density‐weighted averages . Journal of Econometrics 75 , 291 – 316 . Google Scholar Crossref Search ADS WorldCat Robinson , P. M. ( 1989 ). Hypothesis testing in semiparametric and nonparametric models for econometric time series . Review of Economic Studies 56 , 511 – 34 . Google Scholar Crossref Search ADS WorldCat Schafgans , M. M. A. and V. Zinde‐Walsh ( 2007 ). Robust average derivative estimation . Working Paper 2007‐12, Department of Economics, McGill University . Stone , C. J. ( 1982 ). Optimal global rates of convergence for nonparametric regression . Annals of Statistics 10 , 1040 – 53 . Google Scholar Crossref Search ADS WorldCat Woodroofe , M. ( 1970 ). On choosing a delta sequence . Annals of Mathematical Statistics 41 , 1665 – 71 . Google Scholar Crossref Search ADS WorldCat Yang , Y. ( 2000 ). Combining different procedures for adaptive regression . Journal of Multivariate Analysis 74 , 135 – 61 . Google Scholar Crossref Search ADS WorldCat Yatchew , A. ( 2003 ). Semiparametric Regression for the Applied Econometrician . Cambridge : Cambridge University Press . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Appendix: Proofs The proofs of our results rely on the following lemma. For notation purposes we recall (1.3) and (1.4): h is a diagonal matrix and is the scalar product of bandwidth components. Both h and can be read as scalar h in the equal bandwidth setting (and Assumption 2.6(b) where it relates to ). Lemma A.1. Under Assumptions 2.1–2.5, ifandfors = 1, …, S, the covariance ofand, fors1, s2= 1, …, Sis with Proof: To derive an expression for , for s1, s2= 1, …, S we note (A.1) Let I(τ) = 1, if the expression τ is true, zero otherwise. We decompose the first term as follows: (A.2) For the first term of (A.2) we obtain where the final equality applies a change of variables to the first term and uses the independence of and for the second term. Note A similar change of variables to the second term in the resulting expression yields Recognizing that the kernel vanishes at the boundary, using integration by parts and applying Assumption 2.5 we obtain e.g. for s = s1, s2 with Note that the o(1) term here and below is by Assumptions 2.4 and 2.5. Similarly for and we obtain (A.3) For the second term in last line of (A.2), where for brevity we omit terms such as I(i1≠ i2) in the expression, we obtain Using integration by parts to the various terms with kernel vanishing at the boundary and minimal smoothness requirements on g(x) and f(x), represented by Assumptions 2.3 and 2.5 with mℓ≥ 1, we note for s = s1, s2 with ; with ; and with ; this gives (A.4) Substituting (A.3) and (A.4) using (A.2) in (A.1) gives the desired result. □ Proof of Lemma 3.1: The result is a special case of Lemma A.1 with s1= s2= s where the subscripts indicating a particular kernel/bandwidth combination are removed. □ Proof of Theorem 3.1: The proof relies on the expression for the MSE in (3.1) that combines squared bias from Assumption 2.6, based on Assumption 2.5, and the variance as given in Lemma A.1. The variance has two leading parts, one that converges to Σ2 at a parametric rate, O(N−1); the other converges with rates to Σ1(K); the squared bias converges with rates . In case (a) for the term N−1Σ2 dominates the MSE; correspondingly a parametric rate holds for the estimator; the asymptotic normality result in PSS can easily adapt to accommodate different bandwidths and holds for this case. For the parametric rate still holds but the variance may have a part that depends on the kernel. For the rate is parametric, but asymptotic bias is present. When (undersmoothing) the MSE is dominated by The estimator has no asymptotic bias, but its variance depends on the kernel, convergence rate is . If (oversmoothing) the squared asymptotic bias dominates in the MSE and by standard arguments (Chebyshev's inequality) this situation results in the estimator converging in probability to with rates . In case (b) the range of bandwidths corresponding to parametric rates cannot be obtained. When the MSE is dominated by the term . The estimator has no asymptotic bias, convergence rate is . If the squared asymptotic bias dominates in the MSE and the estimator converges in probability to with rates . For (c) without loss of generality assume that h[1] has the slowest rate among the bandwidth components, then in terms of rates every other component is with σℓ≥ 1. The part of trMSE that depends on the bandwidth, trMSE(h), then takes the form with positive coefficients sℓ, bℓ. As h[1] increases, the first term declines and the second term increases; in either of the cases (a) or (b) over the relevant range of h[1] the first term dominates the sum at low bandwidths, and the second at higher ones. As a continuous function of h[1] the trMSE(h) attains a minimum over that range. If all the bandwidths are the same and equal to h[1], so that all we get the optimal rate in (3.2) by equating the rates of the two components. If the bandwidth rates are the same and , with and h[ℓ]= cℓh[1], ℓ= 1, …, k(c1= 1), the optimal constants can be obtained by solving and with respect to (c0, c2, …, ck). □ Proof of Lemma 3.2: Lemma A.1 provides the limit covariance matrix for the vector with components with k × k blocks . We note here that the expression for the covariance can also be written by interchanging s1 and s2. Thus for different bandwidth rates without any loss of generality we can assume that . For then the expression under the integral converges to zero since by symmetry K′ (0) = 0 and we can interchange integration and going to the limit due to continuity, providing μ2= o(1). We note now that only two cases of different rates are possible here: (a) a parametric rate for s2 and a non‐parametric for s1, and (b) non‐parametric (different) rates for both. Denote the square root of the product of bandwidth components, , as . Consider case (a): . Then For case (b): we get □ Proof of Theorem 3.2: Consider each i separately and suppress the subscript i. First we find weights on the ith component that eliminate the ith leading component of the bias of the combination and show that we could have the norm of this weights vector less than one; then we show that as a result the term coming from the variance is smaller than the corresponding term for the optimal bandwidth. Solve first (A.5) Denoting by bs, the Lagrangian is From the FOC, we obtain Denoting by α, we obtain . By squaring and summing as for s = 1, …, S, we get This quadratic equation for α has a root of as a solution to the FOC. We denote and (defined in Lemma A.1) and recall the definitions of Σ1(K) and μ2(K) from Lemma 3.1. Next we show that , with {·}ii denoting the ith diagonal element. This follows as we have where the last line follows from: for any ϕ > 0 and G(·). Now we can evaluate, for the combination with the weights that solve the (A.5) with , the part of the ith diagonal element of trAMSE coming from the variance that depends on the bandwidth (involves the leading term of ; see Lemma A.1). With denoting the product of the components of hopt, we observe that its comparable component in is given by (see Lemma 3.1). Now, With {hs [ℓ]}= csℓ{hopt[ℓ]}, csℓ > 1, ℓ= 1, …, k, the second inequality reflects the fact that and uses Cauchy's inequality . The last inequality uses α= 1/S. Recall that the part of trAMSE that involves the matrix Σ2 does not depend on the weights. Thus the sum of the k diagonal elements in the trAMSE of the linear combination is no greater than that for the optimal if S > k, enabling this linear combination to be strictly better than the individual ADE based on the optimal bandwidth, . □ Proof of Theorem 3.3: (a) We utilize the expression for the bias given in Assumption 2.6(a) component‐wise (A.6) Following Assumption 2.6(b), we consider constant . Using (A.6) and Lemma A.1 we can write for bandwidth vector ht (A.7) where by Chebyshev inequality. For each kernel consider the sequence of bandwidths , with γt > 0 (to ensure oversmoothing) and to ensure that the bandwidths converge to zero; the condition relies on the unknown ; the more smoothness the tighter is the bound on γt; thus we can replace this condition by . For H ≥ 2 we obtain a sequence of bandwidth vectors for which the bias term dominates in the MSE for this estimator so that ψt= op(1). For this sequence of bandwidths then Difference these equations component‐wise to get rid of δ0; then for the ℓth component based on two distinct bandwidth vectors , When we get (A.8) For each ℓ we define a subset of all pairs with cardinality we consider for each ℓ the following Q equations : (A.9) with eℓ= op(1). We obtain these equations by squaring both sides of (A.8) and applying the natural logarithm transformation. The rhs of (A.9) uses and follows an expansion of the ln function: Define and consider the least squares estimator With the bandwidth vectors , with some constants ct we note that the (non‐stochastic) regressors are trending as N increases: wtℓ= O(ln N) as . So, and for any ℓ we get . (b) Without loss of generality we assume that all bandwidth components are the same, so that both and h−1 can be read as scalars. Using (A.7) we write With we note . Recognizing that and with ζ satisfying the lower bound requirement, we obtain Now similarly we investigate for hu the rates in relative to the rate of . Clearly ; we have shown ; finally, . Then substituting and evaluating the rates we obtain revealing that the difference provides a consistent estimator for asymptotic . Consider now asymptotic with as . The estimator and thus provides a consistent estimator of the asymptotic bias. □ © The Author(s). Journal compilation © Royal Economic Society 2010. TI - Smoothness adaptive average derivative estimation JF - Econometrics Journal DO - 10.1111/j.1368-423X.2009.00300.x DA - 2010-02-01 UR - https://www.deepdyve.com/lp/oxford-university-press/smoothness-adaptive-average-derivative-estimation-076bZWf09O SP - 40 VL - 13 IS - 1 DP - DeepDyve ER -