Quantized minimax estimation over Sobolev ellipsoids

Quantized minimax estimation over Sobolev ellipsoids Abstract We formulate the notion of minimax estimation under storage or communication constraints, and prove an extension to Pinsker's theorem for non-parametric estimation over Sobolev ellipsoids. Placing limits on the number of bits used to encode any estimator, we give tight lower and upper bounds on the excess risk due to quantization in terms of the number of bits, the signal size and the noise level. This establishes the Pareto optimal tradeoff between storage and risk under quantization constraints for Sobolev spaces. Our results and proof techniques combine elements of rate distortion theory and minimax analysis. The proposed quantized estimation scheme, which shows achievability of the lower bounds, is adaptive in the usual statistical sense, achieving the optimal quantized minimax rate without knowledge of the smoothness parameter of the Sobolev space. It is also adaptive in a computational sense, as it constructs the code only after observing the data, to dynamically allocate more codewords to blocks where the estimated signal size is large. Simulations are included that illustrate the effect of quantization on statistical risk. 1. Introduction In this article, we introduce a minimax framework for non-parametric estimation under storage constraints. In the classical statistical setting, the minimax risk for estimating a function f from a function class F using a sample of size n places no constraints on the estimator f^n, other than requiring it to be a measurable function of the data. However, if the estimator is to be constructed with restrictions on the computational resources used, it is of interest to understand how the error can degrade. Letting C(f^n)≤Bn indicate that the computational resources C(f^n) used to construct f^n are required to fall within a budget Bn, the constrained minimax risk is   Rn(F,Bn)=inff^n:C(f^n)≤Bnsupf∈FR(f^n,f). Minimax lower bounds on the risk as a function of the computational budget thus determine a feasible region for computation constrained estimation and a Pareto optimal tradeoff for risk versus computation as Bn varies. Several recent papers have presented results on tradeoffs between statistical risk and computational resources, measured in terms of either running time of the algorithm, number of floating point operations or number of bits used to store or construct the estimators [6,5,16]. However, the existing work quantifies the tradeoff by analyzing the statistical and computational performance of specific procedures, rather than by establishing lower bounds and a Pareto optimal tradeoff. In this article, we treat the case where the complexity C(f^n) is measured by the storage or space used by the procedure and sharply characterize the optimal tradeoff. Specifically, we limit the number of bits used to represent the estimator f^n. We focus on the setting of non-parametric regression under standard smoothness assumptions and study how the excess risk depends on the storage budget Bn. We view the study of quantized estimation as a theoretical problem of fundamental interest. But quantization may arise naturally in future applications of large-scale statistical estimation. For instance, when data are collected and analyzed on board a remote satellite, the estimated values may need to be sent back to Earth for further analysis. To limit communication costs, the estimates can be quantized, and it becomes important to understand what, in principle, is lost in terms of statistical risk through quantization. A related scenario is a cloud computing environment, where data are processed for many different statistical estimation problems, with the estimates then stored for future analysis. To limit the storage costs, which could dominate the compute costs in many scenarios, it is of interest to quantize the estimates, and the quantization-risk tradeoff again becomes an important concern. Estimates are always quantized to some degree in practice. But to impose energy constraints on computation, future processors may limit precision in arithmetic computations more significantly [11]; the cost of limited precision in terms of statistical risk must then be quantified. A related problem is to distribute the estimation over many parallel processors, and to then limit the communication costs of the submodels to the central host. We focus on the centralized setting in this article, but an extension to the distributed case may be possible with the techniques that we introduce here. We study risk-storage tradeoffs in the normal means model of non-parametric estimation assuming the target function lies in a Sobolev space. The problem is intimately related to classical rate distortion theory [12], and our results rely on a marriage of minimax theory and rate distortion ideas. We thus build on and refine the connection between function estimation and lossy source coding that was elucidated in David Donoho's 1997 Wald Lectures [9]. We work in the Gaussian white noise model   dX(t)=f(t)dt+εdW(t), 0≤t≤1, (1.1) where W is a standard Wiener process on [0,1], ε is the standard deviation of the noise, and f lies in the periodic Sobolev space W˜(m,c) of order m and radius c. (We discuss the non-periodic Sobolev space W(m,c) in Section 4.) The white noise model is a centerpiece of non-parametric estimation. It is asymptotically equivalent to non-parametric regression [4] and density estimation [18], and simplifies some of the mathematical analysis in our framework. In this classical setting, the minimax risk of estimation   Rε(m,c)=inff^εsupf∈W˜(m,c)E||f−f^ε||22 is well known to satisfy   limε→0ε−4m2m+1Rε(m,c)=(c2(2m+1)π2m)12m+1(mm+1)2m2m+1≜Pm,c, (1.2) where Pm,c is Pinsker's constant [19]. The constrained minimax risk for quantized estimation becomes   Rε(m,c,Bε)=inff^ε,C( f^ε)≤Bεsupf∈W˜(m,c)E||f−f^ε||22, where f^ε is a quantized estimator that is required to use storage C(f^ε) no greater than Bε bits in total. Our main result identifies three separate quantization regimes. In the over-sufficient regime, the number of bits is very large, satisfying Bε≫ε−22m+1 and the classical minimax rate of convergence Rε≍ε4m2m+1 is obtained. Moreover, the optimal constant is the Pinsker constant Pm,c. In the sufficient regime, the number of bits scales as Bε≍ε−22m+1. This level of quantization is just sufficient to preserve the classical minimax rate of convergence, and thus in this regime Rε(m,c,Bε)≍ε4m2m+1. However, the optimal constant degrades to a new constant Pm,c+Qm,c,d, where Qm,c,d is characterized in terms of the solution of a certain variational problem, depending on d=limε→0Bεε22m+1. In the insufficient regime, the number of bits scales as Bε≪ε−22m+1, with however Bε→∞. Under this scaling, the number of bits is insufficient to preserve the unquantized minimax rate of convergence and the quantization error dominates the estimation error. We show that the quantized minimax risk in this case satisfies   limε→0Bε2mRε(m,c,Bε)=c2m2mπ2m. Thus, in the insufficient regime the quantized minimax rate of convergence is Bε−2m, with optimal constant as shown above. By using an upper bound for the family of constants Qm,c,d, the three regimes can be combined together to view the risk in terms of a decomposition into estimation error and quantization error. Specifically, we can write   Rε(m,c,Bε)  ≈Pm,cε4m2m+1︸estimation error+c2m2mπ2mBε−2m︸quantization error. When Bε≫ε−22m+1, the estimation error dominates the quantization error and the usual minimax rate and constant are obtained. In the insufficient case Bε≪ε−22m+1, only a slower rate of convergence is achievable. When Bε and ε−22m+1 are comparable, the estimation error and quantization error are on the same order. The threshold ε−22m+1 should not be surprising, given that in classical unquantized estimation the minimax rate of convergence is achieved by estimating the first ε−22m+1 Fourier coefficients, and simply setting the remaining coefficients to zero. This corresponds to selecting a smoothing bandwidth that scales as h≍n−12m+1 with the sample size n. At a high level, our proof strategy integrates elements of minimax theory and source coding theory. In minimax analysis, one computes lower bounds by thinking in Bayesian terms to look for least-favorable priors. In source coding analysis, one constructs worst-case distributions by setting up an optimization problem based on mutual information. Our quantized minimax analysis requires that these approaches be carefully combined to balance the estimation and quantization errors. To show achievability of the lower bounds we establish, we likewise need to construct an estimator and coding scheme together. Our approach is to quantize the block-wise James–Stein estimator, which achieves the classical Pinsker bound. However, our quantization scheme differs from the approach taken in classical rate distortion theory, where the generation of the codebook is determined once the source distribution is known. In our setting, we require the allocation of bits to be adaptive to the data, using more bits for blocks that have larger signal size. We therefore design a quantized estimation procedure that adaptively distributes the communication budget across the blocks. Assuming only a lower bound m0 on the smoothness m and an upper bound c0 on the radius c of the Sobolev space, our quantization–estimation procedure is adaptive to m and c in the usual statistical sense, and is also adaptive to the coding regime. In other words, given a storage budget Bε, the coding procedure achieves the optimal rate and constant for the unknown m and c, operating in the corresponding regime for those parameters. In the following section, we establish some notation, outline our proof strategy and present some simple examples. In Section 3, we state and prove our main result on quantized minimax lower bounds, relegating some of the technical details to an Appendix. In Section 4, we show asymptotic achievability of these lower bounds, using a quantized estimation procedure based on adaptive James–Stein estimation and quantization in blocks, again deferring proofs of technical lemmas to the Supplementary Material. This is followed by a presentation of some results from experiments in Section 5, illustrating the performance and properties of the proposed quantized estimation procedure. 2. Quantized estimation and minimax risk Suppose that (X1,…,Xn)∈Xn is a random vector drawn from a distribution Pn. Consider the problem of estimating a functional θn=θ(Pn) of the distribution, assuming θn is restricted to lie in a parameter space Θn. To unclutter some of the notation, we will suppress the subscript n and write θ and Θ in the following, keeping in mind that non-parametric settings are allowed. The subscript n will be maintained for random variables. The minimax ℓ2 risk of estimating θ is then defined as   Rn(Θ)=infθ^nsupθ∈ΘEθ||θ−θ^n||2, where the infimum is taken over all possible estimators θ^n:Xn→Θ that are measurable with respect to the data X1,…,Xn. We will abuse notation by using θ^n to denote both the estimator and the estimate calculated based on an observed set of data. Among numerous approaches to obtaining the minimax risk, the Bayesian method is best aligned with quantized estimation. Consider a prior distribution π(θ) whose support is a subset of Θ. Let δ(X1:n) be the posterior mean of θ given the data X1,…,Xn, which minimizes the integrated risk. Then for any estimator θ^n,   supθ∈ΘEθ||θ−θ^n||2≥∫ΘEθ||θ−θ^n||2dπ(θ)≥∫ΘEθ||θ−δ(X1:n)||2dπ(θ). Taking the infimum over θ^n yields   infθ^nsupθ∈ΘEθ||θ−θ^n||2≥∫ΘEθ||θ−δ(X1:n)||||2dπ(θ)≜Rn(Θ;π). Thus, any prior distribution supported on Θ gives a lower bound on the minimax risk, and selecting the least-favorable prior leads to the largest lower bound provable by this approach. Now consider constraints on the storage or communication cost of our estimate. We restrict to the set of estimators that use no more than a total of Bn bits; that is, the estimator takes at most 2Bn different values. Such quantized estimators can be formulated by the following two-step procedure. First, an encoder maps the data X1:n to an index ϕn(X1:n), where   ϕn:Xn→{1,2,…,2Bn} is the encoding function. The decoder, after receiving or retrieving the index, represents the estimates based on a decoding function  ψn:{1,2,…,2Bn}→Θ mapping the index to a codebook of estimates. All that needs to be transmitted or stored is the Bn-bit-long index, and the quantized estimator θ^n is simply ψn°ϕn, the composition of the encoder and the decoder functions. Denoting by C(θ^n) the storage, in terms of the number of bits, required by an estimator θ^n, the minimax risk of quantized estimation is then defined as   Rn(Θ,Bn)=infθ^n,C(θ^n)≤Bnsupθ∈ΘEθ||θ−θ^n||2 and we are interested in the effect of the constraint on the minimax risk. Once again, we consider a prior distribution π(θ) supported on Θ, and let δ(X1:n) be the posterior mean of θ given the data. The integrated risk can then be decomposed as   ∫ΘEθ||θ−θ^n||2dπ(θ)=E||θ−δ(X1:n)+δ(X1:n)−θ^n||2=E||θ−δ(X1:n)||2+E||δ(X1:n)−θ^n||2, (2.1) where the expectation is with respect to the joint distribution of θ~π(θ) and X1:n|θ~Pθ, and the second equality is due to   E⟨θ−δ(X1:n),δ(X1:n)−θ^n⟩=E(E(⟨θ−δ(X1:n),δ(X1:n)−θ^n⟩|X1:n))=E(⟨E(θ−δ(X1:n)|X1:n),δ(X1:n)−θ^n⟩)=E(⟨0,δ(X1:n)− ^n⟩)=0 using the fact that θ→X1:n→θ^n forms a Markov chain. The first term in the decomposition (2.1) is the Bayes risk Rn(Θ;π). The second term can be viewed as the excess risk due to quantization. Let Tn=T(X1,…,Xn) be a sufficient statistic for θ. The posterior mean can be expressed in terms of Tn and we will abuse notation and write it as δ(Tn). Since the quantized estimator θ^n uses at most Bn bits, we have   Bn≥H(θ^n)≥H(θ^n)−H(θ^n|δ(Tn))=I(θ^n;δ(Tn)), where H and I denote the Shannon entropy and mutual information, respectively. Now consider the optimization   infP(⋅ | δ(Tn))  E||δ(Tn)−θ˜n||2such that I(θ˜n;δ(Tn))≤Bn, where the infimum is over all conditional distributions P(θ˜n|δ(Tn)). This parallels the definition of the distortion rate function, minimizing the distortion under a constraint on mutual information [12]. Denoting the value of this optimization by Qn(Θ,Bn;π), we can lower bound the quantized minimax risk by   Rn(Θ,Bn)≥Rn(Θ;π)+Qn(Θ,Bn;π). Since each prior distribution π(θ) supported on Θ gives a lower bound, we have   Rn(Θ,Bn)≥supπ{Rn(Θ;π)+Qn(Θ,Bn;π)} and the goal becomes to obtain a least favorable prior for the quantized risk. Before turning to the case of quantized estimation over Sobolev spaces, we illustrate this technique on some simpler, more concrete examples. Example 2.1 [Normal means in a hypercube] Let Xi~N(θ,σ2Id) for i=1,2,…,n. Suppose that σ2 is known and θ∈[−τ,τ]d is to be estimated. We choose the prior π(θ) on θ to be a product distribution with density   π(θ)=∏j=1d32τ3(τ−|θj|)+2. It is shown in [15] that   Rn(Θ;π)≥σ2dnτ2τ2+12σ2/n≥c1σ2dn, where c1=τ2τ2+12σ2. Turning to Qn(Θ,Bn;π), let T(n)=(T1(n),…,Td(n))=E(θ|X1:n) be the posterior mean of θ. In fact, by the independence and symmetry among the dimensions, we know T1,…,Td are independently and identically distributed. Denoting by T0(n) this common distribution, we have   Qn(Θ,Bn;π)≥d⋅q(Bn/d), where q(B) is the distortion rate function for T0(n), i.e., the value of the following problem   infP(T^ | T0(n))  E(T0(n)−T^)2such that I(T^;T0(n))≤B. Now using the Shannon lower bound [8], we get   Qn(Θ,Bn;π)≥d2πe⋅2h(T0(n))⋅2−2Bnd. Note that as n→∞, T0(n) converges to θ in distribution, so there exists a constant c2 independent of n and d such that   Rn(Θ,Bn)≥c1σ2dn+c2d2−2Bnd. This lower bound intuitively shows the risk is regulated by two factors, the estimation error and the quantization error; whichever is larger dominates the risk. The scaling behavior of this lower bound (ignoring constants) can be achieved by first quantizing each of the d intervals [−τ,τ] using Bn/d bits each, and then mapping the Maximum likelihood estimator (MLE) to its closest codeword. Example 2.2 [Gaussian sequences in Euclidean balls] In the example shown above, the lower bound is tight only in terms of the scaling of the key parameters. In some instances, we are able to find an asymptotically tight lower bound for which we can show achievability of both the rate and the constants. Estimating the mean vector of a Gaussian sequence with an ℓ2 norm constraint on the mean is one of such case, as we showed in previous work [27]. Specifically, let Xi~N(θi,σn2) for i=1,2,…,n, where σn2=σ2/n. Suppose that the parameter θ=(θ1,…,θn) lies in the Euclidean ball Θn(c)={θ:∑i=1nθi2≤c2}. Furthermore, suppose that Bn=nB. Then using the prior θi~N(0,c2), it can be shown that   liminfn→∞Rn(Θn(c),Bn)≥σ2c2σ2+c2+c42−2Bσ2+c2. The asymptotic estimation error σ2c2/(σ2+c2) is the well-known Pinsker bound for the Euclidean ball case. As shown in [27], an explicit quantization scheme can be constructed that asymptotically achieves this lower bound, realizing the smallest possible quantization error c42−2B/(σ2+c2) for a budget of Bn=nB bits. The Euclidean ball case is clearly relevant to the Sobolev ellipsoid case, but new coding strategies and proof techniques are required. In particular, as will be made clear in the sequel, we will use an adaptive allocation of bits across blocks of coefficients, using more bits for blocks that have larger estimated signal size. Moreover, determination of the optimal constants requires a detailed analysis of the worst-case prior distributions and the solution of a series of variational problems. 3. Quantized estimation over Sobolev spaces Recall that the Sobolev space of order mand radius c is defined by   W(m,c)={f∈[0,1]→ℝ:f(m−1) is absolutely continuous and∫01(f(m)(x))2dx≤c2}. The periodic Sobolev space is defined by   W˜(m,c)={f∈W(m,c):f(j)(0)=f(j)(1), j=0,1,…,m−1}​. (3.1) The white noise model (1.1) is asymptotically equivalent to making n equally spaced observations along the sample path, Yi=f(i/n)+σϵi, where ϵi~N(0,1) [4]. In this formulation, the noise level in the formulation (1.1) scales as ϵ2=σ2/n, and the rate of convergence takes the familiar form n−2m2m+1, where n is the number of observations. To carry out quantized estimation, we now require an encoder   ϕε:ℝ[0,1]→{1,2,…,2Bε}, which is a function applied to the sample path X(t). The decoding function then takes the form   ψε:{1,2,…,2Bε}→ℝ[0,1] and maps the index to a function estimate. As in the previous section, we write the composition of the encoder and the decoder as f^ε=ψε°ϕε, which we call the quantized estimator. The communication or storage C(f^ε) required by this quantized estimator is no more than Bε bits. To recast quantized estimation in terms of an infinite sequence model, let (φj)j=1∞ be the trigonometric basis, and let   θj=∫01φj(t)f(t)dt, j=1,2,…, be the Fourier coefficients. It is well known [22] that f=∑j=1∞θjφj belongs to W˜(m,c) if and only if the Fourier coefficients θ belong to the Sobolev ellipsoid defined as   Θ(m,c)={θ∈ℓ2:∑j=1∞aj2θj2≤c2π2m}, (3.2) where   aj={jm,for even j,(j−1)m,for odd j. Although this is the standard definition of a Sobolev ellipsoid, for the rest of the paper we will set aj=jm, j=1,2,… for convenience of analysis. All of the results hold for both definitions of aj. Also note that (3.2) actually gives a more general definition, since m is no longer assumed to be an integer, as it is in (3.1). Expanding with respect to the same orthonormal basis, the observed path X(t) is converted into an infinite Gaussian sequence   Yj=∫01φj(t)dX(t), j=1,2,… with Yj~N(θj,ε2). For an estimator (θ^j)j=1∞ of (Yj)j=1∞, an estimate of f is obtained by   f^(x)=∑j=1∞θ^jφj(x) with squared error ||f^−f||22=||θ^−θ||22. In terms of this standard reduction, the quantized minimax risk is thus reformulated as   Rε(m,c,Bε)=infθ^ε,C(θ^ε)≤Bεsupθ∈Θ(m,c)Eθ||θ−θ^ε||22. (3.3) To state our result, we need to define the value of the following variational problem:   Vm,c,d≜max(σ2,x0)∈F(m,c,d)∫0x0σ2(x)σ2(x)+1dx+x0exp(1x0∫0x0logσ4(x)σ2(x)+1dx−2dx0), (3.4) where the feasible set F(m,c,d) is the collection of increasing functions σ2(x) and values x0, satisfying   ∫0x0x2mσ2(x)dx≤c2σ4(x)σ2(x)+1≥exp(1x0∫0x0logσ4(x)σ2(x)+1dx−2dx0) for all x≤x0. The significance and interpretation of the variational problem will become apparent as we outline the proof of this result. Theorem 3.1 Let Rε(m,c,Bε) be defined as in (3.3) for m>0 and c>0. (i) If Bεε22m+1→∞ as ε→0, then   liminfε→0ε−4m2m+1Rε(m,c,Bε)≥Pm,c, where Pm,c is Pinker's constant defined in (1.2). (ii) If Bεε22m+1→d for some constant d as ε→0, then   liminfε→0ε−4m2m+1Rε(m,c,Bε)≥Pm,c+Qm,c,d=Vm,c,d, where Vm,c,d is the value of the variational problem (3.4). (iii) If Bεε22m+1→0 and Bε→∞ as ε→0, then   liminfε→0Bε2mRε(m,c,Bε)≥c2m2mπ2m. In the first regime where the number of bits Bε is much greater than ε−22m+1, we recover the same convergence result as in Pinsker's theorem, in terms of both convergence rate and leading constant. The proof of the lower bound for this regime can directly follow the proof of Pinsker's theorem, since the set of estimators considered in our minimax framework is a subset of all possible estimators. In the second regime where we have ‘just enough’ bits to preserve the rate, we suffer a loss in terms of the leading constant. In this ‘Goldilocks regime’, the optimal rate ε−4m2m+1 is achieved, but the constant in front of the rate is Pinsker's constant Pm,c plus a positive quantity Qm,c,d determined by the variational problem. While the solution to this variational problem does not appear to have an explicit form, it can be computed numerically. We discuss this term at length in the sequel, where we explain the origin of the variational problem, compute the constant numerically and approximate it from above and below. The constants Pm,c and Qm,c,d are shown graphically in Fig. 1. Note that the parameter d can be thought of as the average number of bits per coefficient used by an optimal quantized estimator, since ε−22m+1 is asymptotically the number of coefficients needed to estimate at the classical minimax rate. As shown in Fig. 1, the constant for quantized estimation quickly approaches the Pinsker constant as d increases—when d=3, the two are already very close. Fig. 1. View largeDownload slide The constants Pm,c+Qm,c,d as a function of quantization level d in the sufficient regime, where Bεε22m+1→d. The parameter d can be thought of as the average number of bits per coefficient used by an optimal quantized estimator, because ε−22m+1 is asymptotically the number of coefficients needed to estimate at the classical minimax rate. Here, we take m=2 and c2/π2m=1. The curve indicates that with only two bits per coefficient, optimal quantized minimax estimation degrades by less than a factor of 2 in the constant. With three bits per coefficient, the constant is very close to the classical Pinsker constant. Fig. 1. View largeDownload slide The constants Pm,c+Qm,c,d as a function of quantization level d in the sufficient regime, where Bεε22m+1→d. The parameter d can be thought of as the average number of bits per coefficient used by an optimal quantized estimator, because ε−22m+1 is asymptotically the number of coefficients needed to estimate at the classical minimax rate. Here, we take m=2 and c2/π2m=1. The curve indicates that with only two bits per coefficient, optimal quantized minimax estimation degrades by less than a factor of 2 in the constant. With three bits per coefficient, the constant is very close to the classical Pinsker constant. In the third regime, where the communication budget is insufficient for the estimator to achieve the optimal rate, we obtain a suboptimal rate which no longer depends explicitly on the noise level ε of the model. In this regime, quantization error dominates, and the risk decays at a rate of B−12m no matter how fast ε approaches zero, as long as B≪ε−22m+1. Here the analog of Pinsker's constant takes a very simple form. Proof of Theorem 3.1. Consider a Gaussian prior distribution on θ=(θj)j=1∞ with θj~N(0,σj2) for j=1,2,…, in terms of parameters σ2=(σj2)j=1∞ to be specified later. One requirement for the variances is   ∑j=1∞aj2σj2≤c2π2m. We denote this prior distribution by π(θ;σ2) and shown in Section A that it is asymptotically concentrated on the ellipsoid Θ(m,c). Under this prior the model is   θj~N(0,σj2)Yj|θj~N(θj,ε2), j=1,2,… and the marginal distribution of Yj is thus N(0,σj2+ε2). Following the strategy outlined in Section 2, let δ denote the posterior mean of θ given Y under this prior, and consider the optimization   inf  E||δ−θ˜||2such that I(δ;θ˜)≤Bϵ, where the infimum is over all distributions on θ˜ such that θ→Y→θ˜ forms a Markov chain. Now, the posterior mean satisfies δj=γjYj, where γj=σj2/(σj2+ϵ2). Note that the Bayes risk under this prior is   E||θ−δ||22=∑j=1∞σj2ε2σj2+ε2. Define   μj2≜E(δj−θ˜j)2. Then the classical rate distortion argument [8] gives that   I(δ;θ˜)≥∑j=1∞I(γjYj;θ˜j) ≥∑j=1∞12log+(γj2(σj2+ε2)μj2) =∑j=1∞12log+(σj4μj2(σj2+ε2)), where log+(x)=max(logx,0). Therefore, the quantized minimax risk is lower bounded by   Rε(m,c,Bε)=infθ^ε,C(θ^ε)≤Bεsupθ∈Θ(m,c)E||θ−θ^ε||2≥Vε(Bε,m,c)(1+o(1)), where Vε(Bε,m,c) is the value of the optimization   maxσ2minμ2  ∑j=1∞μj2+∑j=1∞σj2ε2σj2+ε2such that ∑j=1∞12log+(σj4μj2(σj2+ε2))≤Bε∑j=1∞aj2σj2≤c2π2m (P1) and the (1+o(1)) deviation term is analyzed in the Supplementary Material. Observe that the quantity Vε(Bε,m,c) can be upper and lower bounded by   max{Rε(m,c),Qε(m,c,Bε)}≤Vε(m,c,Bε)≤Rε(m,c)+Qε(m,c,Bε), (3.5) where the estimation error term Rε(m,c) is the value of the optimization   maxσ2 ∑j=1∞σj2ε2σj2+ε2such that ∑j=1∞aj2σj2≤c2π2m (R1) and the quantization error term Qε(m,c,Bε) is the value of the optimization   maxσ2minμ2∑j=1∞μj2such that∑j=1∞12log+(σj4μj2(σj2+ε2))≤Bε∑j=1∞aj2σj2≤c2π2m. (Q1) The following results specify the leading order asymptotics of these quantities. Lemma 3.1 As ε→0,   Rε(m,c)=Pm,cε4m2m+1(1+o(1)). Lemma 3.2 As ε→0,   Qε(m,c,Bε)≤c2m2mπ2mBε−2m(1+o(1)). (3.6) Moreover, if Bεε22m+1→0 and Bε→∞,   Qε(m,c,Bε)=c2m2mπ2mBε−2m(1+o(1)). This yields the following closed form upper bound. Corollary 3.1 Suppose that Bε→∞ and ε→0. Then   Vε(m,c,Bε)≤(Pm,cε4m2m+1+c2m2mπ2mBε−2m)(1+o(1)). (3.7) In the insufficient regime Bεε22m+1→0 and Bε→∞ as ε→0, equation (3.5) and Lemma 3.2 show that   Vε(m,c,Bε)=c2m2mπ2mBε−2m(1+o(1)). Similarly, in the over sufficient regime Bεε22m+1→∞ as ε→0, we conclude that   Vε(m,c,Bε)=Pm,cε4m2m+1(1+o(1)). We now turn to the sufficient regime Bεε22m+1→d. We begin by making three observations about the solution to the optimization (P1). First, we note that the series (σj2)j=1∞ that solves (P1) can be assumed to be decreasing. If (σj2) were not in decreasing order, we could rearrange it to be decreasing, and correspondingly rearrange (μj2), without violating the constraints or changing the value of the optimization. Secondly, we note that given (σj2), the optimal (μj2) is obtained by the ‘reverse water-filling’ scheme [8]. Specifically, there exists η>0 such that   μj2={η if σj4σj2+ε2≥ησj4σj2+ε2 otherwise, where η is chosen so that   12∑j=1∞log+(σj4μj2(σj2+ε2))≤Bε. Thirdly, there exists an integer J>0 such that the optimal series (σj2) satisfies   σj4σj2+ε2≥η, for j=1,…,J and σj2=0, for j>J, where η is the ‘water-filling level’ for (μj2) (see [8]). Using these three observations, the optimization (P1) can be reformulated as   maxσ2,JJη+∑j=1Jσj2ε2σj2+ε2such that 12∑j=1Jlog+(σj4η(σj2+ε2))=Bε∑j=1Jaj2σj2≤c2π2m(σj2) is decreasing and σJ4σJ2+ε2≥η.  (P2) To derive the solution to (P2), we use a continuous approximation of σ2, writing   σj2=σ2(jh)h2m+1, where h is the bandwidth to be specified and σ2(⋅) is a function defined on (0,∞). The constraint that ∑j=1∞aj2σj2≤c2π2m becomes the integral constraint [19]   ∫0∞x2mσ2(x)dx≤c2π2m. We now set the bandwidth so that h2m+1=ε2. This choice of bandwidth will balance the two terms in the objective function and thus gives the hardest prior distribution. Applying the above three observations under this continuous approximation, we transform problem (P2) to the following optimization:   maxσ2,x0x0η+∫0x0σ2(x)σ2(x)+1dxsuch that ∫0x012log+(σ4(x)η(σ2(x)+1))=d∫0x0x2mσ2(x)dx≤c2π2mσ2(x) is decreasing and σ4(x)σ2(x)+1≥η for all x≤x0. (P3) Note that here we omit the convergence rate h2m=ε4m2m+1 in the objective function. The asymptotic equivalence between )P2) and (P3) can be established by a similar argument to Theorem 3.1 in [9]. Solving the first constraint for η yields    maxσ2,x0∫0x0σ2(x)σ2(x)+1dx+x0exp(1x0∫0x0logσ4(x)σ2(x)+1dx−2dx0)such that ∫0x0x2mσ2(x)dx≤c2π2mσ2(x) is decreasing σ4(x)σ2(x)+1≥exp(1x0∫0x0logσ4(x)σ2(x)+1dx−2dx0) (P4) for all x≤x0. The following is proved using a variational argument in the Supplementary Material. Lemma 3.3 The solution to (P4) satisfies   1(σ2(x)+1)2+exp(1x0∫0x0logσ4(x)σ2(x)+1dx−2dx0)σ2(x)+2σ2(x)(σ2(x)+1)=λx2m for some λ>0. Fixing x0, the lemma shows that by setting   α=exp(1x0∫0x0logσ4(x)σ2(x)+1dx−2dx0) we can express σ2(x) implicitly as the unique positive root of a third-order polynomial in y,   λx2my3+(2λx2m−α)y2+(λx2m−3α−1)y−2α. This leads us to an explicit form of σ2(x) for a given value α. However, note that α still depends on σ2(x) and x0, so the solution σ2(x) might not be compatible with α and x0. We can either search through a grid of values of α and x0, or, more efficiently, use an iterative method to find the pair of values that gives us the solution. We omit the details on how to calculate the values of the optimization, as it is not main purpose of this article. To summarize, in the regime Bεε22m+1→d as ε→0, we obtain   Vε(m,c,Bε)=(Pm,c+Qm,c,d)ε4m2m+1(1+o(1)), where we denote by Pm,c+Qm,c,d the values of the optimization (P4). 4. Achievability In this section, we show that the lower bounds in Theorem 3.1 are achievable by a quantized estimator using a random coding scheme. The basic idea of our quantized estimation procedure is to conduct block-wise estimation and quantization together, using a quantized form of James–Stein estimator. Before we present a quantized form of the James–Stein estimator, let us first consider a class of simple procedures. Suppose that θ^=θ^(X) is an estimator of θ∈Θ(m,c) without quantization. We assume that θ^∈Θ(m,c), as projection always reduces mean squared error. To design a B-bit quantized estimator, let Θˇ be the optimal δ-covering of the parameter space Θ(m,c) such that |Θˇ|≤2B, that is,   δ=δ(B)=infΘ⌣⊂Θ:|Θ⌣|≤2Bsupθ∈Θinfθ′∈Θ⌣||θ−θ′||. The quantized estimator is then defined to be   θˇ=θˇ(X)=argminθ′∈Θˇ||θ^(X)−θ′||. Now the mean squared error satisfies   Eθ||θˇ−θ||2=Eθ||θˇ−θ^+θ^−θ||2≤2Eθ||θ^−θ||2+2Eθ||θˇ−θ^||2≤2supθ′Eθ′||θ^−θ′||2+2δ(B)2. If we pick θ^ to be a minimax estimator for Θ, the first term above gives the minimax risk for estimating θ in the parameter space Θ. The second term is closely related to the metric entropy of the parameter space Θ(m,c). In fact, for the Sobolev ellipsoid Θ(m,c), it is shown in [9] that δ(B)2=c2m2mπ2mB−2m(1+o(1)) as B→∞. Thus, with an extra constant factor of 2, the mean squared error of this quantized estimator is decomposed into the minimax risk for Θ and an error term due to quantization. In addition to the fact that this procedure does not achieve the exact lower bound of the minimax risk for the constrained estimation problem, it is not clear how such an ε-net can be generated. In what follows, we will describe a quantized estimation procedure that we will show achieves the lower bound with the exact constants, and that also adapts to the unknown parameters of the Sobolev space. We begin by defining the block system to be used, which is usually referred to as the weakly geometric system of blocks [22]. Let Nε=⌊1/ε2⌋ and ρε=(log(1/ε))−1. Let J1,…,JK be a partition of the set {1,…,Nε} such that   ∪k=1KJk={1,…,Nε}, Jk1∩Jk2=Ø for k1≠k2and  min{j:j∈Jk}>max{j:j∈Jk−1}. Let Tk be the cardinality of the kth block and suppose that T1,…,Tk satisfy   T1=⌈ρε−1⌉=⌈log(1/ε)⌉,T2=⌊T1(1+ρε)⌋,⋮TK−1=⌊T1(1+ρε)K−2⌋,TK=Nε−∑k=1K−1Tk. (4.1) Then K≤Clog2(1/ε) (see Lemma A.4). For an infinite sequence x∈ℓ2, denote by x(k) the vector (xj)j∈Jk∈ℝTk. We also write jk=∑l=1k−1Tl+1, which is the smallest index in block Jk. The weakly geometric system of blocks is defined such that the size of the blocks does not grow too quickly (the ratio between the sizes of the neighboring two blocks goes to 1 asymptotically), and that the number of the blocks is on the logarithmic scale with respect to 1/ε ( K≲log2(1/ε)). See Lemma A.4. We are now ready to describe the quantized estimation scheme. We first give a high-level description of the scheme and then the precise specification. In contrast to rate distortion theory, where the codebook and allocation of the bits are determined once the source distribution is known, here the codebook and allocation of bits are adaptive to the data—more bits are used for blocks having larger signal size. The first step in our quantization scheme is to construct a ‘base code’ of 2Bε randomly generated vectors of maximum block length TK, with N(0,1) entries. The base code is thought of as a 2Bε×TK random matrix Z; it is generated before observing any data, and is shared between the sender and receiver. After observing data (Yj), the rows of Z are apportioned to different blocks k=1,…,K, with more rows being used for blocks having larger estimated signal size. To do so, the norm ||Y(k)|| of each block k is first quantized as a discrete value Sˇk. A subcodebook Zk is then constructed by normalizing the appropriate rows and the first Tk columns of the base code, yielding a collection of random points on the unit sphere STk−1. To form a quantized estimate of the coefficients in the block, the codeword Zˇ(k)∈Zk having the smallest angle to Y(k) is then found. The appropriate indices are then transmitted to the receiver. To decode and reconstruct the quantized estimate, the receiver first recovers the quantized norms (Sˇk), which enables reconstruction of the subdivision of the base code that was used by the encoder. After extracting for each block k the appropriate row of the base code, the codeword Zˇ(k) is reconstructed and a James–Stein type estimator is then calculated. The quantized estimation scheme is detailed below. Step 1. Base code generation. 1.1. Generate codebook Sk={Tkε2+iε2: i=0,1,…,sk}, where sk=⌈ε−2c(jkπ)−m⌉ for k=1,…,K. 1.2. Generate base code Z, a 2B×TK matrix with i.i.d. N(0,1) entries. (Sk) and Z are shared between the encoder and the decoder, before seeing any data. Step 2. Encoding. 2.1. Encoding block radius. For k=1,…,K, encode Sˇk=argmin{|s−Sk|:s∈Sk}, where   Sk={Tkε2if ||Y(k)||<Tkε2Tkε2+c(jkπ)−mif ||Y(k)||>Tkε2+c(jkπ)−m||Y(k)||otherwise. 2.2. Allocation of bits. Let (b˜k)k=1K be the solution to the optimization   minb¯∑k=1K(Sˇk2−Tkε2)2Sˇk2⋅2−2b¯ksuch that ∑k=1KTkb¯k≤B, b¯k≥0. (4.2) 2.3. Encoding block direction. Form the data-dependent codebook as follows. Divide the rows of Z into blocks of sizes 2⌈T1b˜1⌉,…,2⌈TKb˜K⌉. Based on the kth block of rows, construct the data-dependent codebook Z˜k by keeping only the first Tk entries and normalizing each truncated row; specifically, the jth row of Z˜k is given by   Z˜k,j=Zi,1:Tk||Zi,1:Tk||∈STk−1, where i is the appropriate row of the base code Z and Zi,1:t denotes the first t entries of the row vector. A graphical illustration is shown below in Fig. 2. With this data-dependent codebook, encode   Zˇ(k)=argmax{⟨z,Y(k)⟩:z∈Z˜k} for k=1,…,K. Step 3. Transmission. Transmit or store (Sˇk)k=1K and (Zˇ(k))k=1K by their corresponding indices. Step 4. Decoding and Estimation. 4.1. Recover (Sˇk) based on the transmitted or stored indices and the common codebook (Sk). 4.2. Solve (4.2) and get (b˜k). Reconstruct (Z˜k) using Z and (b˜k). 4.3. Recover (Zˇ(k)) based on the transmitted or stored indices and the reconstructed codebook (Z˜k). 4.4. Estimate θ(k) by   θˇ(k)=Sˇk2−Tkε2Sˇk1−2−2b˜k⋅Zˇ(k). 4.5. Estimate the entire vector θ by concatenating the θˇ(k) vectors and padding with zeros; thus,   θˇ=(θˇ(1),…,θˇ(K),0,0,…). The following theorem establishes the asymptotic optimality of this quantized estimator. Fig. 2. View largeDownload slide An illustration of the data-dependent codebook. The big matrix represents the base code Z, and the shaded areas are (Z˜k), sub-matrices of size Tk×2⌈Tkb˜k⌉ with rows normalized. Fig. 2. View largeDownload slide An illustration of the data-dependent codebook. The big matrix represents the base code Z, and the shaded areas are (Z˜k), sub-matrices of size Tk×2⌈Tkb˜k⌉ with rows normalized. Theorem 4.1 Let θˇ be the quantized estimator defined above. (i) If Bε22m+1→∞, then   limε→0ε−4m2m+1supθ∈Θ(m,c)E||θ−θˇ||2=Pm,c. (ii) If Bε22m+1→d for some constant d as ε→0, then   limε→0ε−4m2m+1supθ∈Θ(m,c)E||θ−θˇ||2=Pm,c+Qd,m,c. (iii) If Bε22m+1→0 and B(log(1/ε))−3→∞, then   limε→0B2msupθ∈Θ(m,c)E||θ−θˇ||2=c2m2mπ2m. The expectations are with respect to the random quantized estimation scheme Q and the distribution of the data. We pause to make several remarks on this result before outlining the proof. Remark The total number of bits used by this quantized estimation scheme is   ∑k=1K⌈Tkb˜k⌉+∑k=1Klog⌈ε−2c(jkπ)−m⌉≤∑k=1K⌈Tkb˜k⌉+∑k=1Klog⌈ε−2c⌉≤B+K+2Kρε−1+Klog⌈c⌉=B+O((log(1/ε))3), where we use the fact that K≲log2(1/ε2) (see Lemma A.4). Therefore, as long as B(log(1/ε))−3→∞, the total number of bits used is asymptotically no more than B, the given communication budget. Remark 4.2 The quantized estimation scheme does not make essential use of the parameters of the Sobolev space, namely the smoothness m and the radius c. The only exception is that in Step 1.1 the size of the codebook Sk depends on m and c. However, suppose that we know a lower bound on the smoothness m, say m≥m0, and an upper bound on the radius c, say c≤c0. By replacing m and c by m0 and c0, respectively, we make the codebook independent of the parameters. We shall assume m0>1/2, which leads to continuous functions. This modification does not, however, significantly increase the number of bits; in fact, the total number of bits is still B+O(ρε−3). Thus, we can easily make this quantized estimator minimax adaptive to the class of Sobolev ellipsoids {Θ(m,c):m≥m0, c≤c0}, as long as B grows faster than (log(1/ε))3. More formally, we have Corollary 4.1 Suppose that Bε satisfies Bε(log(1/ε))−3→∞. Let θˇ' be the quantized estimator with the modification described above, which does not assume knowledge of m and c. Then for m≥m0 and c≤c0,   limε→0supθ∈Θ(m,c)E||θ−θ'ˇ||2infθ^,C(θ^)≤Bsupθ∈Θ(m,c)E||θ−θ^||2=1, where the expectation in the numerator is with respect to the data and the randomized coding scheme, while the expectation in the denominator is only with respect to the data. Remark 4.4 When B grows at a rate comparable to or slower than (log(1/ε))3, the lower bound is still achievable, just no longer by the quantized estimator we described above. The main reason is that when B does not grow faster than log(1/ε)3, the block size T1=⌈log(1/ε)⌉ is too large. The blocking needs to be modified to get achievability in this case. Remark 4.4 In classical rate distortion [8,12], the probabilistic method applied to a randomized coding scheme shows the existence of a code achieving the rate distortion bounds. Comparing to Theorem 3.1, we see that the expected risk, averaged over the randomness in the codebook, similarly achieves the quantized minimax lower bound. However, note that the average over the codebook is inside the supremum over the Sobolev space, implying that the code achieving the bound may vary over the ellipsoid. In other words, while the coding scheme generates a codebook that is used for different θ, it is not known whether there is one code generated by this randomized scheme that is ‘universal’, and achieves the risk lower bound with high probability over the ellipsoid. The existence or non-existence of such ‘universal codes’ is an interesting direction for further study. Remark We have so far dealt with the periodic case, i.e., functions in the periodic Sobolev space W˜(m,c) defined in (3.1). For the Sobolev space W(m,c), where the functions are not necessarily periodic, the lower bound given in Theorem 3.1 still holds, since W˜(m,c) is a subset of the larger class W(m,c). To extend the achievability result to W(m,c), we again need to relate W(m,c) to an ellipsoid. Nussbaum [17] shows using spline theory that the non-periodic space can actually be expressed as an ellipsoid, where the length of the jth principal axis scales as (π2j)m asymptotically. Based on this link between W(m,c) and the ellipsoid, the techniques used here to show achievability apply, and since the principal axes scale as in the periodic case, the convergence rates remain the same. Proof of Theorem 4.1 We now sketch the proof of Theorem 4.1, deferring the full details to Section A. To provide only an informal outline of the proof, we shall write A1≈A2 as a shorthand for A1=A2(1+o(1)), and A1≲A2 for A1≤A2(1+o(1)), without specifying here what these o(1) terms are. To upper bound the risk E||θˇ−θ||2, we adopt the following sequence of approximations and inequalities. First, we discard the components whose index is greater than N and show that   E||θˇ−θ||2≈E∑k=1K||θˇ(k)−θ(k)||2. Since Sˇk is close enough to Sk, we can then safely replace θˇ(k) by θ^(k)=Sk2−Tkε2Sk1−2−2b˜(k)⋅Zˇ(k) and obtain   ≈E∑k=1K||θ^(k)−θ(k)||2. Writing λk=Sk2−Tkε2Sk2, we further decompose the risk into   =E∑k=1K(||θ^(k)−λkY(k)||2+||λkY(k)−θ(k)||2+2⟨θ^(k)−λkY(k),λkY(k)−θ(k)⟩). Conditioning on the data Y and taking the expectation with respect to the random codebook yields   ≲E∑k=1K((Sk2−Tkε2)2Sk22−2b˜k+||λkY(k)−θ(k)||2)​. By two oracle inequalities upper bounding the expectations with respect to the data, and the fact that b˜ is the solution to (4.2),   ≲minb∈Πblk(B)∑k=1K(||θ(k)||4||θ(k)||2+Tkε22−2b¯k+||θ(k)||2Tkε2||θ(k)||2+Tkε2). Showing that the block-wise constant oracles are almost as good as the monotone oracle, we get for some B′≈B  ≲minb∈Πmon(B′), ω∈Ωmon∑j=1N(θj4θj2+ε22−2bj+(1−ωj)2θj2+ωj2ε2), where Πblk(B), Πmon(B) are the classes of block-wise constant and monotone allocations of the bits defined in (A.8), (A.9) and Ωmon is the class of monotone weights defined in (A.11). The proof is then completed by Lemma A.9, showing that the last quantity is equal to Vε(m,c,B). 5. Simulations Here we illustrate the performance of the proposed quantized estimation scheme. We use the function   f(x)=x(1−x)sin(2.1πx+0.3), 0≤x≤1, which we shall refer to as the ‘damped Doppler function’, shown in Fig. 3 (the gray lines). Note that the value 0.3 differs from the value 0.05 in the usual Doppler function used to illustrate spatial adaptation of methods such as wavelets. Since we do not address spatial adaptivity in this article, we ‘slow’ the oscillations of the Doppler function near zero in our illustrations. Fig. 3. View largeDownload slide The damped Doppler function (solid) and typical realizations of the estimators under different noise levels ( n=500, 5000 and 50000). Three estimators are used: the block-wise James–Stein estimator (dashed black) and two quantized estimator with budgets of 5 bits (dashed red) and 30 bits (dashed blue). Fig. 3. View largeDownload slide The damped Doppler function (solid) and typical realizations of the estimators under different noise levels ( n=500, 5000 and 50000). Three estimators are used: the block-wise James–Stein estimator (dashed black) and two quantized estimator with budgets of 5 bits (dashed red) and 30 bits (dashed blue). We use this f as the underlying true mean function and generate our data according to the corresponding white noise model (1.1),   dX(t)=f(t)dt+εdW(t), 0≤t≤1. We apply the block-wise James–Stein estimator, as well as the proposed quantized estimator with different communication budgets. We also vary the noise level ε and, equivalently, the effective sample size n=1/ε2. We first show in Fig. 3 some typical realizations of these estimators on data generated under different noise levels ( n=500, 5000 and 50000, respectively). To keep the plots succinct, we show only the true function, the block-wise James–Stein estimates and quantized estimates using total bit budgets of 5 and 30 bits. We observe, in the first plot, that both quantized estimates deviate from the true function, and so does the block-wise James–Stein estimates. This is when the noise is relatively large and any quantized estimate performs poorly, no matter how large a budget is given. Both 5 bits and 30 bits appear to be ‘sufficient/over sufficient’ here. In the second plot, the block-wise James–Stein estimate is close to the quantized estimate with a budget of 30 bits, whereas with a budget of 5 bits it fails to capture the fluctuations of the true function. Thus, a budget of 30 bits is still ‘sufficient’, but 5 bits apparently becomes ‘insufficient’. In the third plot, the block-wise James–Stein estimate gives a better fit than the two quantized estimates, as both budgets become ‘insufficient’ to achieve the optimal risk. Next, in Fig. 4 we plot the risk as a function of sample size n, averaging over 2000 simulations. Note that the bottom plot is just the first plot on a log–log scale. In this set of plots, we are able to observe the phase transition for the quantized estimators. For relatively small values of n, all quantized estimators yield a similar error rate, with risks that are close to (or even smaller than) that of the block-wise James–Stein estimator. This is the over sufficient regime—even the smallest budget suffices to achieve the optimal risk. As n increases, the curves start to separate, with estimators having smaller bit budgets leading to worse risks compared to the block-wise James–Stein estimator and compared to estimators with larger budgets. This can be seen as the sufficient regime for the small-budget estimators—the risks are still going down, but at a slower rate than optimal. The six quantized estimators all end up in the insufficient regime—as n increases, their risks begin to flatten out, while the risk of the block-wise James–Stein estimator continues to decrease. Fig. 4. View largeDownload slide Risk versus effective sample size n=1/ε2 for estimating the damped Doppler function with different estimators. The dashed line represents the risk of the block-wise James–Stein estimator and the solid ones are for the quantized estimators with different budgets. The budgets are 5, 10, 15, 20, 25 and 30 bits, corresponding to the lines from top to bottom. The two plots are the same curves on the original scale and the log–log scale. Fig. 4. View largeDownload slide Risk versus effective sample size n=1/ε2 for estimating the damped Doppler function with different estimators. The dashed line represents the risk of the block-wise James–Stein estimator and the solid ones are for the quantized estimators with different budgets. The budgets are 5, 10, 15, 20, 25 and 30 bits, corresponding to the lines from top to bottom. The two plots are the same curves on the original scale and the log–log scale. 6. Related work and future directions Concepts related to quantized non-parametric estimation appear in multiple communities. As mentioned in the introduction, Donoho's 1997 Wald Lectures [9] (on the eve of the 50th anniversary of Shannon's 1948 paper) drew sharp parallels between rate distortion, metric entropy and minimax rates, focusing on the same Sobolev function spaces we treat here. One view of the present work is that we take this correspondence further by studying how the risk continuously degrades with the level of quantization. We have analyzed the precise leading order asymptotics for quantized regression over the Sobolev spaces, showing that these rates and constants are realized with coding schemes that are adaptive to the smoothness m and radius c of the ellipsoid, achieving automatically the optimal rate for the regime corresponding to those parameters given the specified communication budget. Our detailed analysis is possible due to what Nussbaum [19] calls the ‘Pinsker phenomenon’, referring to the fact that linear filters attain the minimax rate in the over sufficient regime. It will be interesting to study quantized non-parametric estimation in cases where the Pinsker phenomenon does not hold, for example over Besov bodies and different Lp spaces. Many problems of rate distortion type are similar to quantized regression. The standard ‘reverse water filling’ construction to quantize a Gaussian source with varying noise levels, plays a key role in our analysis, as shown in Section 3. In our case, the Sobolev ellipsoid is an infinite Gaussian sequence model, requiring truncation of the sequence at the appropriate level, depending on the targeted quantization and estimation error. In the case of Euclidean balls, Draper and Wornell [10] study rate distortion problems motivated by communication in sensor networks; this is closely related to the problem of quantized minimax estimation over Euclidean balls that we analyzed in [27]. The essential difference between rate distortion and our quantized minimax framework is that in rate distortion the quantization is carried out for a random source, while in quantized estimation we quantize our estimate of the deterministic and unknown basis coefficients. Since linear estimators are asymptotically minimax for Sobolev spaces under squared error (the ‘Pinsker phenomenon’), this naturally leads to an alternative view of quantizing the observations, or said differently, of compressing the data before estimation. Statistical estimation from compressed data has appeared previously in different communities. In [26], a procedure is analyzed that compresses data by random linear transformations in the setting of sparse linear regression. Zhang & Berger [24] study estimation problems when the data are communicated from multiple sources; Ahlswede & Csiszár [2] consider testing problems under communication constraints; the use of side information is studied by Ahlswede & Burnashev [1]; other formulations in terms of multiterminal information theory are given by Han & Amari [14]; non-parametric problems are considered by Raginsky in [20]. In a distributed setting, the data may be divided across different compute nodes, with distributed estimates then aggregated or pooled by communicating with a central node. The general ‘CEO problem’ of distributed estimation was introduced by Berger et al. [3] and has been recently studied in parametric settings in [13,25]. These papers take the view that the data are communicated to the statistician at a certain rate, which may introduce distortion, and the goal is to study the degradation of the estimation error. In contrast, in our setting we can view the unquantized data as being fully available to the statistician at the time of estimation, with communication constraints being imposed when communicating the estimated model to a remote location. Finally, our quantized minimax analysis shows achievability using random coding schemes that are not computationally efficient. A natural problem is to develop practical coding schemes that come close to the quantized minimax lower bounds. In our view, the most promising approach currently is to exploit source coding schemes based on greedy sparse regression [23], applying such techniques blockwise according to the procedure we developed in Section 4. Supplementary data Supplementary date are available at IMAIAI online. Acknowledgements The authors thank Andrew Barron, John Duchi, Maxim Raginsky, Philippe Rigollet, Harrison Zhou and the anonymous referees for valuable comments on this work. Funding Office of Naval Research (N00014-15-1-2379, in part) and National Science Foundation (DMS-1513594, DMS-1547396, in part). References Ahlswede R., Burnashev M. ( 1990) On minimax estimation in the presence of side information about remote data. Ann. Statist. , 18, 141– 171. Google Scholar CrossRef Search ADS   Ahlswede R., Csiszár I. ( 1986) Hypothesis testing with communication constraints. IEEE Trans. Inform. Theory , 32, 533– 542. Google Scholar CrossRef Search ADS   Berger T., Zhang Z., Viswanathan H. ( 1996) The CEO problem. IEEE Trans. Inform. Theory , 42, 887– 902. Google Scholar CrossRef Search ADS   Brown L. D., Low M. G. ( 1996) Asymptotic equivalence of non-parametric regression and white noise. Ann. Statist. , 24, 2384– 2398. Google Scholar CrossRef Search ADS   Bruer J. J., Tropp J. A., Cevher V., Becker S. Ghahramani Z., Welling M., Cortes C., Lawrence N. D., Weinberger K. Q. ( 2014) Time–data tradeoffs by aggressive smoothing. Advances in Neural Information Processing Systems , Montreal, Canada: Neural Information Processing Systems Foundation, Inc. pp. 1664– 1672. Chandrasekaran V., Jordan M. I. ( 2013) Computational and statistical tradeoffs via convex relaxation. Proc. Natl. Acad. Sci. USA , 110, E1181– E1190. Google Scholar CrossRef Search ADS   Chattamvelli R., Jones M. ( 1995) Recurrence relations for non-central density, distribution functions and inverse moments. J. Stat. Comput. Simul. , 52, 289– 299. Google Scholar CrossRef Search ADS   Cover T. M., Thomas J. A. ( 2006) Elements of Information Theory . New York: Wiley-Interscience. Donoho D. L. ( 1997) Wald lecture I: counting bits with Kolmogorov and Shannon. Note for the Wald Lectures . Draper S. C., Wornell G. W. ( 2004) Side information aware coding strategies for sensor networks. IEEE J. Sel. Areas Commun. , 22, 966– 976. Google Scholar CrossRef Search ADS   Galal S., Horowitz M. ( 2011) Energy-efficient floating-point unit design. IEEE Trans. Comput. , 60, 913– 922. Google Scholar CrossRef Search ADS   Gallager R. G. ( 1968) Information Theory and Reliable Communication . New York: John Wiley & Sons. Garg A., Ma T., Nguyen H. ( 2014) On communication cost of distributed statistical estimation and dimensionality. Advances in Neural Information Processing Systems . 2726– 2734. Han T. S., Amari S.-I. ( 1998) Statistical inference under multiterminal data compression. IEEE Trans. Inform. Theory , 44, 2300– 2324. Google Scholar CrossRef Search ADS   Johnstone, I. M. (2015) Gaussian estimation: sequence and wavelet models. Unpublished manuscript. Lucic M., Ohannessian M. I., Karbasi A., Krause A. Lebanon G., Vishwanathanis S. V. N. ( 2015) Tradeoffs for space, time, data and risk in unsupervised learning. International Conference on Artificial Intelligence and Statistics , San Diego, CA: Proceedings of Machine Learning Research, Vol. 38, pp. 663– 671. Nussbaum M. ( 1985) Spline smoothing in regression models and asymptotic efficiency in L2. Ann. Statist. , 13, 984– 997. Google Scholar CrossRef Search ADS   Nussbaum M. ( 1996) Asymptotic equivalence of density estimation and Gaussian white noise. Ann. Statist. , 24, 2399– 2430. Google Scholar CrossRef Search ADS   Nussbaum M. ( 1999) Minimax risk: Pinsker bound. Encycl. Stat. Sci. , 3, 451– 460. Raginsky M. ( 2007) Learning from compressed observations. IEEE Information Theory Workshop . Lake Tahoe, CA. pp. 420– 425. Sakrison D. ( 1968) A geometric treatment of the source encoding of a {G}aussian random variable. IEEE Trans. Inform. Theory , 14, 481– 486. Google Scholar CrossRef Search ADS   Tsybakov A. B. ( 2008) Introduction to Non-Parametric Estimation . Springer Series in Statistics, 1st edn. New York: Springer. Venkataramanan R., Sarkar T., Tatikonda S. ( 2014) Lossy compression via sparse linear regression: Computationally efficient encoding and decoding. IEEE Trans. Inform. Theory , 60, 3265– 3278. Google Scholar CrossRef Search ADS   Zhang Z., Berger T. ( 1988) Estimation via compressed information. IEEE Trans. Inform. Theory , 34, 198– 211. Google Scholar CrossRef Search ADS   Zhang Y., Duchi J., Jordan M. I., Wainwright M. J. Burges C. J. C., Bottou L., Welling M., Ghahramani Z., Weinberger K. Q. ( 2013) Information-theoretic lower bounds for distributed statistical estimation with communication constraints. Advances in Neural Information Processing Systems , Lake Tahoe, NV: Neural Information Processing Systems Foundation, Inc. pp. 2328– 2336. Zhou S., Lafferty J., Wasserman L. ( 2009) Compressed and privacy-sensitive sparse regression. IEEE Trans. Inform. Theory , 55, 846– 866. Google Scholar CrossRef Search ADS   Zhu Y., Lafferty J. Ghahramani Z., Welling M., Cortes C., Lawrence N. D., Weinberger K. Q. ( 2014) Quantized estimation of Gaussian sequence models in Euclidean balls. Advances in Neural Information Processing Systems , Montreal, Canada: Neural Information Processing Systems Foundation, Inc. pp. 3662– 3670. Appendix. Proofs of Technical Results In this section, we provide proofs for Theorems 3.1 and 4.1. A.1. Proof of Theorem 3.1 We first show Lemma A.1 The quantized minimax risk is lower bounded by Vε(m,c,Bε), the value of the optimization (P1). Proof. As will be clear to the reader, Vε(m,c,Bε) is achieved by some σ2 that is non-increasing and finitely supported. Let σ2 be such that   σ12≥…≥σn2>0=σn+1=…, ∑j=1naj2σj2=c2π2m and let   Θn(m,c)={θ∈ℓ2:∑j=1naj2θj2≤c2π2m, θj=0 for j≥n+1}⊂Θ(m,c). We build on this sequence of σ2, a prior distribution of θ. In particular, for τ∈(0,1), write sj2=(1−τ)σj2 and let πτ(θ;σ2) be a prior distribution on θ such that   θj~N(0,sj2), j=1,…,n,ℙ(θj=0)=1, j≥n+1. We observe that   Rε(m,c,Bε) ≥infθ^,C(θ^)≤Bεsupθ∈Θn(m,c)E||θ−θ^||2≥infθ^,C(θ^)≤Bε∫Θn(m,c)E||θ−θ^||2dπτ(θ;σ2)≥Iτ−rτ, where Iτ is the integrated risk of the optimal quantized estimator   Iτ=infθ^,C(θ^)≤Bε∫ℝn⊗{0}∞E||θ−θ^||2dπτ(θ;σ2) and rτ is the residual   rτ=supθ^∈Θ(m,c)∫Θ(m,c)¯E||θ−θ^||2dπτ(θ;σ2), where Θ(m,c)¯=(ℝn⊗{0}∞)\Θn(m,c). As shown in Section 3, limτ→0Iτ is lower bounded by the value of the optimization   minμ2∑j=1∞μj2+∑j=1∞σj2ε2σj2+ε2such that ∑j=1∞12log+(σj4μj2(σj2+ε2))≤Bε.  It then suffices to show that rτ=o(Iτ) as ε→0 for τ∈(0,1). Let dn=supθ∈Θn(m,c)||θ||, which is bounded since for any θ∈Θn(m,c)  ||θ||=∑jθj2=1a12∑ja12θj2≤1a12∑jaj2θj2≤1a12c2π2m=ca1πm. We have   rτ=supθ^∈Θ(m,c)∫Θn(m,c)¯E||θ−θ^||2dπτ(θ;σ2)≤2∫Θn(m,c)¯(dn2+E||θ||2)dπτ(θ;σ2)≤2(dn2ℙ(θ∉Θn(m,c))+(ℙ(θ∉Θn(m,c))E||θ||4)1/2), where we use the Cauchy–Schwarz inequality. Noticing that   E||θ||4=E((∑j=1nθj2)2)=∑j1≠j2E(θj12)E(θj22)+∑j=1nE(θj4)≤∑j1≠j2sj12sj22+3∑j=1nsj4≤3(∑j=1nsj2)2≤3dn4, we obtain   rτ≤2dn2(ℙ(θ∉Θn(m,c))+3ℙ(θ∉Θn(m,c)))≤6dn2ℙ(θ∉Θn(m,c)). Thus, we only need to show that ℙ(θ∉Θn(m,c))=o(Iτ). In fact,   ℙ(θ∉Θn(m,c)) =ℙ(∑j=1naj2θj2>c2π2m)=ℙ(∑j=1naj2(θj2−E(θj2))>c2π2m−(1−τ)∑j=1naj2σj2)=ℙ(∑j=1naj2(θj2−E(θj2))>τc2π2m)=ℙ(∑j=1naj2sj2(Zj2−1)>τ1−τ∑j=1naj2sj2), where Zj~N(0,1). By Lemma A.2, we get   ℙ(θ∉Θn(m,c))≤exp(−τ28(1−τ)2∑j=1naj2sj2max1≤j≤naj2sj2)=exp(−τ28(1−τ)2∑j=1naj2σj2max1≤j≤naj2σj2). Next we will show that for the σ2 that achieves Vε(m,c,Bε), we have ℙ(θ∉Θn(m,c))=o(Iτ). For the sufficient regime where Bεε22m+1→∞ as ε→0, it is shown in [22] that max1≤j≤naj2σj2=O(ε22m+1) and Iτ=O(ε4m2m+1), and hence that ℙ(θ∉Θn(m,c))=o(Iτ). For the insufficient regime where Bεε22m+1→0, but still Bε→∞ as ε→0, an achieving sequence σ2 is given later by (A.4) and (A.3). We obtain that max1≤j≤naj2σj2=O(Bε−1) and Iτ=O(Bε−2m), and therefore ℙ(θ∉Θn(m,c))=o(Iτ). The sufficient regime where Bεε22m+1→d for some constant d is a bit more complicated, as we don't have an explicit formula for the optimal sequence σ2. However, by Lemma 3.3, for the continuous approximation σ2(x) such that σj2=σ2(jh)h2m+1, we have   λx2mσ2(x) =σ2(x)(σ2(x)+1)2+α⋅σ2(x)+2σ2(x)+1≤14+2α, where α=exp(1x0∫0x0logσ4(x)σ2(x)+1dx−2dx0) and λ are both constants. Therefore,   max1≤j≤naj2σj2≈j2mσ2(jh)h2m+1≤1λ(14+2α)⋅h. Note that ∑j=1naj2σj2=O(h2m) and that h=ε22m+1. We obtain that for this case Iτ=O(ε4m2m+1) and ℙ(θ∉Θn(m,c))=o(Iτ). Thus, for each of the three regimes, we have rτ=o(Iτ). Lemma A.2 (Lemma 3.5 in [22]) Suppose that X1,…,Xn are i.i.d. N(0,1). For t∈(0,1) and ωj>0, j=1,…,n, we have   ℙ(∑j=1nωj(Xj2−1)>t∑j=1nXj)≤exp(−t2∑j=1nωj8max1≤j≤nωj). Proof of Lemma 3.1. This is in fact Pinsker's theorem, which gives the exact asymptotic minimax risk of estimation of normal means in the Sobolev ellipsoid. The proof can be found in [19] and [22]. Proof of Lemma 3.2. As argued in Section 3 for the lower bound in the sufficient regime, optimization problem (Q1) can be reformulated as   maxσ2,JJηsuch that 12∑j=1Jlog+(σj4η(σj2+ε2))≤Bε∑j=1Jaj2σj2≤c2π2m(σj2) is decreasing and σJ4σJ2+ε2≥η. (Q2) Now suppose that we have a series ( σj2) which satisfies the last constraint and is supported on {1,…,J}. By the first constraint, we have that   Jη=Jexp(−2BεJ)(∏j=1Jσj4σj2+ε2)1J≤Jexp(−2BεJ)(∏j=1Jσj2)1J=Jexp(−2BεJ)(∏j=1Jaj2σj2)1J(∏j=1Jaj−2)1J≤exp(−2BεJ)(∑j=1Jaj2σj2)(∏j=1Jaj−2)1J≤c2π2mexp(−2BεJ)(∏j=1Jaj−2)1J=c2π2m(exp(Bεm)J!)−2mJ. (A.1) This provides a series of upper bounds for Qε(m,c,Bε) parameterized by J. To minimize (A.1) over J, we look at the ratio of the neighboring terms with J and J+1, and compare it to 1. We obtain that the optimal J satisfies   JJJ!<exp(Bεm)≤(J+1)J+1(J+1)!. (A.2) Denote this optimal J by Jε. By Stirling's approximation, we have   limε→0Bε/mJε=1 (A.3) and plugging this asymptote into (A.1), we get as ε→0  c2π2m(exp(Bεm)Jε!)−2mJε~c2π2mJε−2m~c2m2mπ2mBε−2m. This gives the desired upper bound (3.6). Next we show that the upper bound (3.6) is asymptotically achievable when Bεε22m+1→0 and Bε→∞. It suffices to find a feasible solution that attains (3.6). Let   σ˜j2=c2/π2mJεaj2, j=1,…,Jε. (A.4) Note that the entire sequence of (σ˜j2)j=1Jε does not qualify for a feasible solution, since the first constraint in (Q2) won't be satisfied for any η≤σ˜Jε4σ˜Jε2+ε2. We keep only the first Jε' terms of (σ˜j2), where Jε' is the largest j such that   σ˜j4σ˜j2+ε2≥σ˜Jε2. (A.5) Thus,   ∑j=1Jε′12log+(σ˜j4σ˜j2+ε2σ˜Jε2)≤∑j=1Jε′12log+(σ˜j2σ˜Jε2)≤∑j=1Jε12log+(σ˜j2σ˜Jε2)≤Bε, where the last inequality is due to (A.2). This tells us that setting η=σ˜Jε2 leads to a feasible solution to (Q2). As a result,   Qε(m,c,Bε)≥J′εσ˜Jε2. (A.6) If we can show that Jε′~Jε, then   J′εσ˜Jε2~Jεσ˜Jε2~c2m2mπ2mBε−2m. (A.7) To show that Jε'~Jε, it suffices to show that aJε'~aJε. Plugging the formula of σ˜j2 into (A.5) and solving for aJε'2, we get   aJε'2~−c2π2mJε+(c2π2mJε)2+4c2π2mJεε2aJε22ε2~−c2π2mJε+c2π2mJε+12π2mJεc24c2π2mJεε2aJε22ε2=aJε2, where the equivalence is due to the assumption Bεε22m+1→0 and a Taylor's expansion of the function x. Proof of Lemma 3.3 Suppose that σ2(x) with x0 solves (P4). Consider function σ2(x)+ξv(x) such that it is still feasible for (P4), and thus we have   ∫0x0x2mv(x)dx≤0. Now plugging σ2(x)+ξv(x) for σ2(x) in the objective function of (P4), taking derivative with respect to ξ, and letting ξ→0, we must have   ∫0x0v(x)(σ2(x)+1)2dx+x0exp(1x0∫0x0logσ4(x)σ2(x)+1dx−2dx0)1x0∫0x02v(x)σ2(x)−v(x)σ2(x)+1dx≤0, which, after some calculation and rearrangement of terms, yields   ∫0x0v(x)(1(σ2(x)+1)2+exp(1x0∫0x0logσ4(x)σ2(x)+1dx−2dx0)σ2(x)+2σ2(x)(σ2(x)+1))dx≤0. Thus, by the lemma that follows, we obtain that for some λ  1(σ2(x)+1)2+exp(1x0∫0x0logσ4(y)σ2(y)+1dy−2dx0)σ2(x)+2σ2(x)(σ2(x)+1)=λx2m. Lemma A.3 Suppose that f(x) and g(x) are two non-zero functions on (0,x0) such that for any v(x), satisfying ∫0x0f(x)v(x)dx≤0, it holds that ∫0x0g(x)v(x)dx≤0. Then there exists a constant λ such that f(x)=λg(x). Proof. First we show that for any v(x) such that ∫0x0f(x)v(x)dx=0, we must have ∫0x0g(x)v(x)dx=0. Otherwise, suppose that v0(x) is such that ∫0x0f(x)v0(x)dx=0 and ∫0x0g(x)v0(x)dx<0. Then take another v(x) with ∫0x0f(x)v(x)dx≤0 and consider vγ(x)=v(x)−γv0(x). We have ∫0x0f(x)vγ(x)dx≤0 and ∫0x0g(x)vγ(x)=∫0x0v(x)g(x)dx−γ∫0x0g(x)v0(x)dx>0 for large enough γ, which results in contradiction. Let λ=∫0x0f(x)2dx/∫0x0f(x)g(x)dx as the denominator cannot be zero. In fact, if ∫0x0f(x)g(x)dx=0, it would imply that ∫0x0g(x)2dx=0, and hence g(x)≡0. Now consider the function f(x)−λg(x). Notice that we have ∫0x0f(x)(f(x)−λg(x))dx=0 by the definition of λ. It follows that ∫0x0g(x)(f(x)−λg(x))dx=0, and therefore, ∫0x0(f(x)−λg(x))2dx=0, which concludes the proof. A.2. Proof of Theorem 4.1 Now we give the details of the proof of Theorem 4.1. For the purpose of our analysis, we define two allocations of bits, the monotone allocation and the block-wise constant allocation,   Πblk(B)={(bj)j=1∞: ∑j=1∞bj≤B, bj=b¯k for j∈Jk, 0≤bj≤bmax}, (A.8)  Πmon(B)={(bj)j=1∞: ∑j=1∞bj≤B, bj−1≥bj, 0≤bj≤bmax}, (A.9) where bmax=2log(1/ε). We also define two classes of weights, the monotonic weights and the block-wise constant weights,   Ωblk={(ωj)j=1∞: ωj=ω¯k for j∈Jk, 0≤ωj≤1}​, (A.10)  Ωmon={(ωj)j=1∞: ωj−1≥ωj, 0≤ωj≤1}​. (A.11) We will also need the following results from [22] regarding the weakly geometric system of blocks. Lemma A.4 Let {Jk} be a weakly geometric block system defined by (4.1). Then there exists 0<ε0<1 and C>0 such that for any ε∈(0,ε0),   K≤Clog2(1/ε),max1≤k≤K−1Tk+1Tk≤1+3ρε. We divide the proof into four steps. Step 1. Truncation and replacement The loss of the quantized estimator θˇ can be decomposed into   ||||θˇ−θ||2=∑k=1K||θˇ(k)−θ(k)||2+∑j=N+1∞θj2, where the remainder term satisfies   ∑j=N+1∞θj2≤N−2m∑j=N+1∞aj2θj2=O(N−2m). If we assume that m>1/2, which corresponds to classes of continuous functions, the remainder term is then o(ε2). If m≤1/2, the remainder term is on the order of O(ε4m), which is still negligible compared to the order of the lower bound ε4m2m+1. To ease the notation, we will assume that m>1/2, and write the remainder term as o(ε2), but need to bear in mind that the proof works for all m>0. We can thus discard the remainder term in our analysis. Recall that the quantized estimate for each block is given by   θˇ(k)=Sˇk2−Tkε2Sˇk1−2−2b˜kZˇ(k) and consider the following estimate with Sˇk replaced by Sk  θ^(k)=Sk2−Tkε2Sk1−2−2b˜kZˇ(k). Notice that   ||θ^(k)−θˇ(k)||=|Sˇk2−Tkε2Sˇk−Sk2−Tkε2Sk|1−2−2b˜k||Zˇ(k)||≤|SˇkSk+Tkε2SˇkSk||Sˇk−Sk|≤2ε2, where the last inequality is because SˇkSk≥Tkε2 and |Sˇk−Sk|≤ε2. Thus we can safely replace θˇ(k) by θ^(k) because   ||θˇ(k)−θ(k)||2=||θˇ(k)−θ^(k)+θ^(k)−θ(k)||2≤||θˇ(k)−θ^(k)||2+||θ^(k)−θ(k)||2+2||θˇ(k)−θ^(k)||||||θ^(k)−θ(k)||||=||θ^(k)−θ(k)||2+O(ε2). Therefore, we have   E||θˇ−θ||2=E∑k=1K||θ^(k)−θ(k)||2+O(Kε2). Step 2. Expectation over codebooks Now conditioning on the data Y, we work under the probability measure introduced by the random codebook. Write   λk=Sk2−Tkε2Sk2 and Z(k)=Y(k)||Y(k)||. We decompose and examine the following term   Ak=||θ^(k)−θ(k)||2=||θ^(k)−λkSkZ(k)+λkSkZ(k)−θ(k)||2=||θ^(k)−λkSkZ(k)||2︸Ak,1+||λkSkZ(k)−θ(k)||2︸Ak,2+2⟨θ^(k)−λkSkZ(k),λkSkZ(k)−θ(k)⟩︸Ak,3. To bound the expectation of the first term Ak,1, we need the following lemma, which bounds the probability of the distortion of a codeword exceeding the desired value. Lemma A.5 Suppose that Z1,…,Zn are independent and each follows the uniform distribution on the t-dimensional unit sphere St−1. Let y∈St−1 be a fixed vector, and   Z*=argminz∈Z1:n||1−2−2qz−y||2. If n=2qt, then   E||1−2−2qZ*−y||2≤2−2q(1+ν(t))+2e−2t, where   ν(t)=6logt+7t−6logt−7. Observe that   Ak,1=||θ^(k)−λkSkZ(k)||2=||λkSk1−2−2b˜kZˇ(k)−λkSkZ(k)||2=λk2Sk2||1−2−2b˜kZˇ(k)−Z(k)||2. Then, it follows as a result of Lemma A.5 that   E(Ak,1|Y(k))≤(Sk2−Tkε2)2Sk2(2−2b˜k(1+νε)+2e−2Tk)≤(Sk2−Tkε2)2Sk2(2−2b˜k(1+νε)+2e−2T1)≤(Sk2−Tkε2)2Sk22−2b˜k(1+νε)+2c2(jkπ)2mε2, where νε=6logT1+7T1−6logT1−7. Since Ak,2 only depends on Y(k), E(Ak,2|Y(k))=Ak,2. Next we consider the cross term Ak,3. Write γk=⟨θ(k),Y(k)⟩||Y(k)||2 and   Ak,3=2⟨θ^(k)−λkSkZ(k),λkSkZ(k)−θ(k)⟩=2⟨θ^(k)−λkSkZ(k),γkY(k)−θ(k)⟩︸Ak,3a+2⟨θ^(k)−λkSkZ(k),λkSkZ(k)−γkY(k)⟩︸Ak,3b. The quantity γk is chosen such that ⟨Y(k),γkY(k)−θ(k)⟩=0 and therefore   Ak,3a=2⟨θ^(k)−λkSkZ(k),γkY(k)−θ(k)⟩=2⟨ΠY(k)⊥(θ^(k)−λkSkZ(k)),γkY(k)−θ(k)⟩, where ΠY(k)⊥ denotes the projection onto the orthogonal complement of Y(k). Due to the choice of Zˇ(k), the projection ΠY(k)⊥(θ^(k)−λkSkZ(k)) is rotation symmetric, and hence E(Ak,3a|Y(k))=0. Finally, for Ak,3b we have   E(Ak,3b|Y(k))≤2||λkSkZ(k)−γkY(k)||E(||θ^(k)−λkSkZ(k)|||Y(k))≤2||λkSkZ(k)−γkY(k)||E(||θ^(k)−λkSkZ(k)||2|Y(k))≤2||λkSkZ(k)−γkY(k)||(Sk2−Tkε2)2Sk22−2b˜k(1+νε)+2c2(jkπ)2mε2. Combining all the analyses above, we have   E(Ak|Y(k))≤(Sk2−Tkε2)2Sk22−2b˜k(1+νε)+2c2(jkπ)2mε2+||λkSkZ(k)−θ(k)||2+2λkSkZ(k)−γkY(k)||(Sk2−Tkε2)2Sk22−2b˜k(1+νε)+2c2(jkπ)2mε2 and summing over k we get   E(||θˇ−θ||2|Y)≤∑k=1K(Sk2−Tkε2)2Sk22−2b˜k(1+νε)+∑k=1K||λkSkZ(k)−θ(k)||2+2∑k=1K||λkSkZ(k)−γkY(k)||(Sk2−Tkε2)2Sk22−2b˜k(1+νε)+O(ε2)+O(Kε2). (A.12) Step 3. Expectation over data First we will state three lemmas, which bound the deviation of the expectation of some particular functions of the norm of a Gaussian vector to the desired quantities. The proofs are given in Section A.3. LemmaA.6 Suppose that Xi~N(θi,σ2) independently for i=1,…,n, where ||θ||2≤c2. Let S be given by   S={nσ2if ||X||<nσ2nσ2+cif ||X||>nσ2+c||X||otherwise. Then there exists some absolute constant C0 such that   E(S2−nσ2S−⟨θ,X⟩||X||)2≤C0σ2. Lemma A.7 Let X and S be the same as defined in Lemma A.6. Then for n>4  E(S2−nσ2)2S2≤||θ||4||θ||2+nσ2+4nn−4σ2. Lemma A.8 Let X and S be the same as defined in Lemma A.6. Define   θ^+=(||X||2−nσ2||X||2)+X, θ^†=S2−nσ2S||X||X. Then   E||θ^†−θ||2≤E||θ^+−θ||2≤nσ2||θ||2||θ||2+nσ2+4σ2. We now take the expectation with respect to the data on both sides of (A.12). First, by the Cauchy–Schwarz inequality   E(||λkSkZ(k)−γkY(k)||(Sk2−Tkε2)2Sk22−2b˜k(1+νε)+O(ε2))≤E||λkSkZ(k)−γkY(k)||2E((Sk2−Tkε2)2Sk22−2b˜k(1+νε)+O(ε2)). (A.13) We then calculate   E||λkSkZ(k)−γkY(k)||2=E||Sk2−Tkε2SkY(k)||Y(k)||−⟨θ(k),Y(k)⟩||Y(k)||Y(k)||Y(k)||||2=E(Sk2−Tkε2Sk−⟨θ(k),Y(k)⟩||Y(k)||)2≤C0ε, where the last inequality is due to Lemma A.6, and C0 is the constant therein. Plugging this in (A.13) and summing over k, we get   ∑k=1KE(||λkSkZ(k)−γkY(k)||(Sk2−Tkε2)2Sk22−2b˜k(1+νε)+O(ε2))≤C0ε∑k=1KE((Sk2−Tkε2)2Sk22−2b˜k(1+νε)+O(ε2))≤C0KεE∑k=1K(Sk2−Tkε2)2Sk22−2b˜k(1+νε)+O(Kε2). Therefore,   E||θˇ−θ||2≤E∑k=1K(Sk2−Tkε2)2Sk22−2b˜k︸B1(1+νε)+E∑k=1K||λkSkZ(k)−θ(k)||2︸B2+C0KεE∑k=1K(Sk2−Tkε2)2Sk22−2b˜k(1+νε)+O(Kε2)+O(Kε2). Now we deal with the term B1. Recall that the sequence b˜ solves problem (4.2), so for any sequence b∈Πblk  ∑k=1K(Sˇk2−Tkε2)2Sˇk22−2b˜k≤∑k=1K(Sˇk2−Tkε2)2Sˇk22−2b¯k. Notice that   |(Sˇk2−Tkε2)2Sˇk2−(Sk2−Tkε2)2Sk2|=|Sˇk2−Sk2||Sˇk2Sk2−Tkε2Sˇk2Sk2|=O(ε2) and thus,   ∑k=1K(Sk2−Tkε2)2Sk22−2b˜k≤∑k=1K(Sk2−Tkε2)2Sk22−2b¯k+O(Kε2). Taking the expectation, we get   E∑k=1K(Sk2−Tkε2)2Sk22−2b˜k≤∑k=1KE(Sk2−Tkε2)2Sk22−2b¯k+O(Kε2). Applying Lemma A.7, we get for Tk>4  E(Sk2−Tkε2)2Sk2≤||θ(k)||4||θ(k)||2+Tkε2+4TkTk−4ε2 and it follows that   E∑k=1K(Sk2−Tkε2)2Sk22−2b˜k≤∑k=1K||θ(k)||4||θ(k)||2+Tkε22−2b¯k+O(Kε2). Since b∈Πblk is arbitrary,   E∑k=1K(Sk2−Tkε2)2Sk22−2b˜k≤minb∈Πblk∑k=1K||θ(k)||4||θ(k)||2+Tkε22−2b¯k+O(Kε2). Turning to the term B2, as a result of Lemma A.8 we have   ||λkSkZ(k)−θ(k)||2≤||θ(k)||2Tkε2||θ(k)||2+Tkε2+4ε2. Combining the above results, we have shown that   E||θˇ−θ||2≤M+O(Kε2)+C0KεM+O(Kε2), (A.14) where   M=(1+νε)minb∈Πblk(B)∑k=1K||θ(k)||4||θ(k)||2+Tkε22−2b¯k+∑k=1K||θ(k)||2Tkε2||θ(k)||2+Tkε2=(1+νε)minb∈Πblk(B)∑k=1K||θ(k)||4||θ(k)||2+Tkε22−2b¯k+minω∈Ωblk∑k=1K((1−ω¯k)2||θ(k)||2+ω¯k2Tkε2)​. Step 4. Blockwise constant is almost optimal We now show that in terms of both bit allocation and weight assignment, block-wise constant is almost optimal. Let's first consider bit allocation. Let B′=11+3ρε(B−T1bmax). We are going to show that   minb∈Πblk(B)∑k=1K||θ(k)||4||θ(k)||2+Tkε22−2b¯k≤minb∈Πmon(B′)∑j=1Nθj4θj2+ε22−2bj. (A.15) In fact, suppose that b*∈Πmon(B′) achieves the minimum on the right-hand side, and define b⋆ by   bj⋆={maxi∈Bkbi*j∈Bk0j≥N. The sum of the elements in b⋆ then satisfies   ∑j=1∞bj⋆=∑k=0K−1Tk+1maxj∈Bk+1bj*=T1b1⋆+∑k=1K−1Tk+1maxj∈Bk+1bj*≤T1bmax+∑k=1K−1Tk+1Tk∑j∈Bkbj*≤T1bmax+(1+3ρε)∑k=1K−1∑j∈Bkbj*≤T1bmax+(1+3ρε)B′=B, which means that b⋆∈Πblk(B). It then follows that   minb∈Πblk(B)∑k=1K||θ(k)||4||θ(k)||2+Tkε22−2b¯k≤∑k=1K||θ(k)||4||θ(k)||2+Tkε22−2b¯k⋆≤∑k=1K∑j∈Bkθj4θj2+ε22−2bj⋆=∑j=1Nθj4θj2+ε22−2bj*=minb∈Πmon(B′)∑j=1Nθj4θj2+ε22−2bj, (A.16) where (A.16) is due to Jensen's inequality on the convex function x2x+ε2  (1Tk||θ(k)||2)21Tk||θ(k)||2+ε2≤1Tk∑j∈Bkθj4θj2+ε2. Next, for the weights assignment, by Lemma 3.11 in [22], we have   minω∈Ωblk∑k=1K((1−ω¯k)2||θ(k)||2+ω¯k2Tkε2)≤(1+3ρε)(minω∈Ωmon∑k=1K((1−ωj)2θj2+ωj2ε2))+T1ε2. (A.17) Combining (A.15) and (A.17), we get   M=(1+νε)minb∈Πblk(B)∑k=1K||θ(k)||4||θ(k)||2+Tkε22−2b¯k+minω∈Ωblk∑k=1K((1−ω¯k)2||θ(k)||2+ω¯k2Tkε2)≤(1+νε)minb∈Πblk(B)∑k=1K||θ(k)||4||θ(k)||2+Tkε22−2b¯k+(1+3ρε)minω∈Ωmon∑k=1K((1−ω¯k)2||θ(k)||2+ω¯k2Tkε2)+T1ε2≤(1+νε)(minb∈Πmon(B′)∑j=1Nθj4θj2+ε22−2bj+minω∈Ωmon∑j=1N((1−ωj)2θj2+ωj2ε2))+T1ε2. Then by Lemma A.9,   M≤(1+νε)Vε(m,c,B′)+T1ε2 which, plugged into (A.14), gives us   E||θˇ−θ||2≤(1+νε)Vε(m,c,B′)+O(Kε2)+C0Kε(1+νε)Vε(m,c,B′)+O(Kε2). Recall that   νε=O(loglog(1/ε)log(1/ε)), K=O(log2(1/ε)) and that   limε→0B′B=limε→011+3ρε(1−T1bmaxB)=1. Thus,   limε→0Vε(m,c,B′)Vε(m,c,B)=1. Also notice that no matter how B grows as ε→0, Vε(m,c,B)=O(ε4m2m+1). Therefore,   limε→0E||θˇ−θ||2Vε(B,m,c)≤limε→0((1+νε)Vε(B′,m,c)Vε(B,m,c)+O(Kε2)V(B,m,c)+C0(1+νε)Kε2Vε(B,m,c)Vε(B′,m,c)Vε(B,m,c)+(O(Kε2)Vε(B,m,c))2)=1, which concludes the proof. Lemma A.9 Let V1 be the value of the optimization   maxθminb∑j=1N(θj4θj2+ε22−2bj+θj2ε2θj2+ε2)such that∑j=1Nbj≤B, bj≥0, ∑j=1Jaj2θj2≤c2π2m (A1) and let V2 be the value of the optimization   maxθminb,ω∑j=1N(θj4θj2+ε22−2bj+(1−ωj)2θj2+ωj2ε2)such that∑j=1Nbj≤B, bj−1≥bj, 0≤bj≤bmax, ωj−1≥ωj,∑j=1Jaj2θj2≤c2π2m. (A2) Then V1=V2. A.3. Proofs of Lemmas Proof of Lemma A.5. Let ζ(t) be a positive function of t to be specified later. Let   p0=ℙ(||1−2−2qZ1−y||≤2−q1+ζ(t)). By Lemma A.10, when ζ(t)≤2(1−2−2q), p0 can be lower bounded by   p0≥Γ(t2+1)πtΓ(t+12)(2−q1+ζ(t)/2)t−1. We obtain that   E||1−2−2qZ*−y||2≤2−2q(1+ζ(t))+2ℙ(||1−2−2qZ*−y||>2−q1+ζ(t))=2−2q(1+ζ(t))+2(1−p0)n. To upper bound (1−p0)n, we consider   log((1−p0)n)=nlog(1−p0)≤−np0≤−2qtΓ(t2+1)πtΓ(t+12)(2−q1+ζ(t)/2)t−1≤−2qΓ(t2+1)πtΓ(t+12)(1+ζ(t)/2)(2/ζ(t)+1)t−12(2/ζ(t)+1)≤−2π(t2)t2+12e−t2πte(t2−12)t2e−(t2−12)et−12(2/ζ(t)+1)=−e−32t−12(tt−1)t2et−12(2/ζ(t)+1)≤−e−1t−12et−12(2/ζ(t)+1), where we have used Stirling's approximation in the form   2πzz+1/2e−z≤Γ(z+1)≤ezz+1/2e−z. In order for (1−p0)n≤e−2t to hold, we need   −2t=−e−1t−12et−12(2/ζ(t)+1), which leads to the choice of ζ(t)  ζ(t)=2t−12log(2et32)−1=6logt+4log(2e)t−3logt−2log(2e)−1. Thus, we have shown that when q is not too close to 0, satisfying 1−2−2q≥ζ(t)/2, we have   E||1−2−2qZ*−y||2≤2−2q(1+ζ(t))+e−2t. When 1−2−2q<ζ(t)/2, we observe that   E||1−2−2qZ*−y||2=1−2−2q+1−21−2−2qE⟨Z*,y⟩≤2−2−2q=2−2q(1+2(22q−1)) and that   2(22q−1)<21−ζ(t)/2−2=2ζ(t)2−ζ(t)=6logt+4log(2e)t−6logt−4log(2e)−1. Now take ν(t)=6logt+7t−6logt−7. Notice that ν(t)>6logt+4log(2e)t−6logt−4log(2e)−1≥ζ(t), we have for any q≥0  E||1−2−2qZ*−y||2≤2−2q(1+ν(t))+e−2t. Lemma A.10 Suppose Z is a t-dimensional random vector uniformly distributed on the unit sphere St−1. Let y be a fixed vector on the unit sphere. For δ<1 and ζ>0, satisfying ζ≤2(1−δ2), define   p0=ℙ(||Z−y||≤δ1+ζ). We have   p0≥Γ(t2+1)πtΓ(t+12)(δ1+ζ/2)t−1. Proof. The proof is based on an idea from [21]. Denote by Vt and At, the volume and the surface area of a t-dimensional unit sphere, respectively. We have   Vt=∫01Atrt−1dr=1tAt. From the geometry of the situation as illustrated in Fig. A1, p0 is equal to the ratio of two areas S1 and S2. The first area S1 is the portion of the surface area of the sphere of radius 1−δ2 and center O contained within the sphere of radius δ1+ζ and center y. It is the surface area of a (t−1)-dimensional polar cap of radius 1−δ2 and polar angle θ0, and can be lower bounded by the area of a (t−1)-dimensional disk of radius 1−δ2sinθ0, that is,   S1≥Vt−1(1−δ2sinθ0)t−1=1t−1At−1(1−δ2sinθ0)t−1. The second area S2 is simply the surface area of a (t−1)-dimensional sphere of radius 1−δ2  S2=At(1−δ2)t−1. Therefore, we obtain   p0 =S1S2≥1t−1At−1(1−δ2sinθ0)t−1At(1−δ2)t−1=At−1(t−1)At(sinθ0)t−1=Γ(t+12+12)πtΓ(t+12)(sinθ0)t−1, where we have used the well-known relationship between At−1 and At  At−1At=1π(t−1)Γ(t2+1)tΓ(t−12+1). Now we need to calculate sinθ0. By the law of cosines, we have   cosθ0=1+1−δ2−δ2(1+ζ)21−δ2=1−δ2(1+ζ/2)1−δ2 and it follows that   sin2θ0=1−cos2θ0=1−1+δ4(1+ζ/2)2−2δ2(1+ζ/2)1−δ2=δ2(1+ζ)−δ4ζ24(1−δ2). Now since ζ≤2(1−δ2), we get   sinθ0≥δ1+ζ/2, which completes the proof. Fig. A1. View largeDownload slide Illustration of the geometry for calculating p0. Fig. A1. View largeDownload slide Illustration of the geometry for calculating p0. Proof of Lemma A.6. We first claim that   E(S2−nσ2S−⟨θ,X⟩||X||)2≤E(||X||2−nσ2||X||−⟨θ,X⟩||X||)2. In fact, writing Er(⋅) for the conditional expectation E(⋅|||X||=r), it suffices to show that for r<nσ2 and r>nσ2+c  Er(S2−nσ2S−⟨θ,X⟩||X||)2≤Er(||X||2−nσ2||X||−⟨θ,X⟩||X||)2. When r<nσ2, it is equivalent to   Er(⟨θ,X⟩||X||||)2≤Er(⟨θ,X⟩||X||−||X||2−nσ2||X||)2. It is then sufficient to show that Er⟨θ,X⟩≥0. This can be obtained by following a similar argument as in Lemma A.6 in [22]. When r>nσ2+c, we need to show that   Er((nσ2+c)2−nσ2nσ2+c−⟨θ,X⟩||X||)2≤Er(||X||2−nσ2||X||−⟨θ,X⟩||X||)2, which, after some algebra, boils down to   (nσ2+c)2−nσ2nσ2+c+r2−nσ2r≥2rEr⟨θ,X⟩. This holds because   r((nσ2+c)2−nσ2nσ2+c+r2−nσ2r−2rEr⟨θ,X⟩)≥||θ||2+r2−nσ2−2Er⟨θ,X⟩≥Er||X−θ||2−nσ2≥0, where we have used the assumption that r>nσ2+c, ||θ||≤c and that   Er||X−θ||||≥Er||X||−||θ||≥nσ2. Now that we have shown (A.3) and noting that   E(||X||2−nσ2||X||−⟨θ,X⟩||X||)2=σ2E(||X/σ||2−n||X/σ||−⟨θ/σ,X/σ⟩||X/σ||)2, we can assume that X~N(θ,In) and equivalently show that there exists a universal constant C0 such that   E(||X||2−n||X||−⟨θ,X⟩||X||)2≤C0 holds for any n and θ. Letting Z=X−θ and writing ||θ||2=ξ, we have   E(||X||2−n||X||−⟨θ,X⟩||X||)2=E(||Z+θ||2−n−ξ||Z+θ||−⟨θ,Z⟩||Z+θ||)2≤2E(||Z+θ||2−n−ξ||Z+θ||)2+2E(⟨θ,Z⟩||Z+θ||||)2≤2E||Z+θ||2−4(n+ξ)+2E(n+ξ)2||Z+θ||2+2E(⟨θ,Z⟩||Z+θ||)2≤2(n+ξ)−4(n+ξ)+2(n+ξ)2n+ξ−4+2E(⟨θ,Z⟩||Z+θ||)2=8(n+ξ)n+ξ−4+2E(⟨θ,Z⟩||Z+θ||)2, where the last inequality is due to Lemma A.11. To bound the last term, we apply the Cauchy–Schwarz inequality and get   E(⟨θ,Z⟩||Z+θ||)2≤E1||Z+θ||4E⟨θ,Z⟩4≤3(n−4)ξ2(n−6)(n+ξ−4)(n+ξ−6), where the last inequality is again due to Lemma A.11. Thus we just need to take C0 to be   supn≥7,ξ≥08(n+ξ)n+ξ−4+23(n−4)ξ2(n−6)(n+ξ−4)(n+ξ−6), which is apparently a finite quantity. Proof of Lemma A.7. Since the function (x2−nσ2)2/x2 is decreasing on (0,nσ2) and increasing on (nσ2,∞), we have   (S2−nσ2)2S2≤(||X||2−nσ2)2||X||2 and it follows that if n>4  E(S2−nσ2)2S2≤E(||X||2−nσ2)2||X||2 (A.18)  =E||X||2−2nσ2+n2σ4E(1||X||2) (A.19)  ≤||θ||2−nσ2+n2σ4||θ||2+nσ2−4σ2 (A.20)  ≤||θ||4||θ||2+nσ2+4nn−4σ2, (A.21) where (A.20) is due to Lemma A.11, and (A.21) is obtained by   ||θ||2−nσ2+n2σ4||θ||2+nσ2−4σ2−||θ||4||θ||2+nσ2=||θ||4+4σ2(nσ2−||θ||2)||θ||2+nσ2−4σ2−||θ||4||θ||2+nσ2=4n2σ6(||θ||2+nσ2−4σ2)(||θ||2+nσ2)≤4nn−4σ2. Proof of Lemma A.8. First, the second inequality   E||θ^+−θ||2≤nσ2||θ||2||θ||2+nσ2+4σ2 is given by Lemma 3.10 from [22]. We thus focus on the first inequality. For convenience we write   g+(x)=(||x||2−nσ2||x||2)+, g†(x)=s(x)2−nσ2s(x)||x|||| with   s(x)={nσ2if ||||x||||<nσ2nσ2+cif ||||x||||>nσ2+c||||x||||otherwise. Notice that g+(x)=g†(x) when ||x||≤nσ2+c and g+(x)>g†(x) when ||x||>nσ2+c. Since g† and g+ both only depend on ||x||, we sometimes will also write g†(||x||) for g†(x) and g+(||x||) for g+(x). Setting Er(⋅) to denote the conditional expectation E(⋅|||X||=r) for brevity, it suffices to show that for r≥nσ2+c  Er(||g†(X)X−θ||2)≤Er(||g+(X)X−θ||2)⇔g†(r)2r2−2g†(r)Er⟨X,θ⟩≤g+(r)2r2−2g+(r)Er⟨X,θ⟩⇔(g†(r)2−g+(r)2)r2≥2(g†(r)−g+(r))Er⟨X,θ⟩⇔(g†(r)+g+(r))r2≥2Er⟨X,θ⟩. (A.22) On the other hand, we have   (g†(r)+g+(r))r2≥(||θ||2r2+r2−nσ2r2)r2=||θ||2+r2−nσ2=||θ||2+r2−2Er⟨X,θ⟩−nσ2+2Er⟨X,θ⟩=Er||X−θ||2−nσ2+2Er⟨X,θ⟩≥2Er⟨X,θ⟩, where the last inequality is because   ||X−θ||2≥(||X||−||θ||)2≥nσ2. Thus, (A.22) holds, and hence E||θ^†−θ||2≤E||θ^+−θ||2. Proof of Lemma A.9. It is easy to see that V1≤V2, because for any θ the inside minimum is smaller for (A1) than for (A2). Next, we will show V1≥V2. Suppose that θ* achieves the value V2, with corresponding b* and ω*. We claim that θ* is non-increasing. In fact, if θ* is not non-increasing then there must exist an index j such that θj*<θj+1*, and for simplicity, let's assume that θ1*<θ2*. We are going to show that this leads to b1*=b2* and ω1*=ω2*. Write   s1=θ1*4θ1*2+ε2, s2=θ2*4θ2*2+ε2. We have s1<s2. Let b¯*=b1*+b2*2 and observe that b1*≥b¯*≥b2*. Notice that   (s12−2b1*+s22−2b2*)−(s12−2b¯*+s22−2b¯*)=s1(2−2b1*−2−2b¯*)+s2(2−2b2*−2−2b¯*)≥s2(2−2b1*−2−2b¯*)+s2(2−2b2*−2−2b¯*)≥s2(2−2b1*+2−2b2*−2⋅2−2b¯*)≥0, where equality holds if and only if b1*=b2*, since s2>s1≥0. Hence, b1* and b2* have to be equal, or otherwise it would contradict with the assumption that b* achieves the inside minimum of (A2). Now turn to ω*. Write ω¯*=ω1*+ω2*2 and note that ω1*≥ω¯*≥ω2*. Consider   ((1−ω1*)2θ1*2+ω1*2ε2)+((1−ω2*)2θ2*2+ω2*2ε2)−((1−ω¯*)2(θ1*2+θ2*2)+2ω¯*2ε2)=((1−ω1*)2−(1−ω¯*)2)θ1*2+((1−ω2*)2−(1−ω¯*)2)θ2*2+(ω1*2+ω2*2−2ω¯*2)ε2≥((1−ω1*)2−(1−ω¯*)2)θ2*2+((1−ω2*)2−(1−ω¯*)2)θ2*2+(ω1*2+ω2*2−2ω¯*2)ε2=((1−ω1*)2+(1−ω2*)2−2(1−ω¯*)2)θ2*2+(ω1*2+ω2*2−2ω¯*2)ε2≥0, where the equality holds if and only if ω1*=ω2*. Therefore, ω1* and ω2* must be equal. Now, with b1*=b2* and ω1*=ω2*, we can switch θ1* and θ2* without increasing the objective function and violating the constraints. Thus, our claim that θ* is non-increasing is justified. Now that we have shown that the solution triplet (θ*,b*,ω*) to (A2) satisfy that θ* is non-increasing, in order to prove V1≥V2, it suffices to show that if we take θ=θ* in (A1), the minimizer b⋆ is non-increasing and b1⋆≤bmax. In fact, if so, we will have b⋆=b* as well as ω*=θj*2θj*2+ε2, and then   V1≥minb:∑j=1Nbj≤B∑j=1N(θj*4θj*2+ε22−2bj+θj*2ε2θj*2+ε2)≥V2. Let's take θ=θ* in (A1). The optimal b⋆ is non-increasing because the solution is given by the ‘reverse water-filling’ scheme and θ* is non-increasing. Next, we will show that b1⋆≤bmax. If b1⋆>bmax, then we would have for j=1,…,N  θj*4θj*2+ε22−2bj⋆≤θ1*4θ1*2+ε22−2b1⋆≤θ1*22−2bmax≤c22−4log(1/ε)=c2ε4, where the first inequality follows from the ‘reverse water-filling’ solution, and therefore   ∑j=1Nθj*4θj*2+ε22−2bj⋆≤Nc2ε4=o(ε4m2m+1), which would not give the optimal solution. Hence, b1⋆≤bmax, and this completes the proof. Lemma A.11 Suppose that Wn,ξ follows a non-central chi-square distribution with n degrees of freedom and non-centrality parameter ξ. We have for n≥5  E(Wn,ξ−1)≤1n+ξ−4 and for n≥7  E(Wn,ξ−2)≤n−4(n−6)(n+ξ−4)(n+ξ−6). Proof. It is well known that the non-central chi-square random variable Wn,ξ can be written as a Poisson-weighted mixture of central chi-square distributions, i.e., Wn,ξ~χn+2K2 with K~Poisson(ξ/2). Then   E(Wn,ξ−1)=E(E(Wn,ξ−1|K))=E(1n+2K−2)≥1n+2EK−2=1n+ξ−2, where we have used the fact that E(1/χn2)=n−2 and Jensen's inequality. Similarly, we have   E(Wn,ξ−2)=E(E(Wn,ξ−2|K))=E(1(n+2K−2)(n+2K−4))≥1(n+2EK−2)(n+2EK−4)=1(n+ξ−2)(n+ξ−4). Using the Poisson-weighted mixture representation, the following recurrence relation can be derived [7]   1=ξE(Wn+4,ξ−1)+nE(Wn+2,ξ−1)​, (A.23)  E(Wn,ξ−1)=ξE(Wn+4,ξ−2)+nE(Wn+2,ξ−2) (A.24) for n≥3. Thus,   E(Wn+4,ξ−1)=1ξ−nξE(Wn+2,ξ−1)≤1ξ−nξ1n+ξ=1n+ξ. Replacing n by n−4 proves (A.11). On the other hand, rearranging (A.23), we get   E(Wn+2,ξ−1)=1n−ξnE(Wn+4,ξ−1)≤1n−ξn1n+ξ+2=n+2n(n+ξ+2). Now using (A.24), we have   E(Wn+4,ξ−2)=1ξE(Wn,ξ−1)−nξE(Wn+2,ξ−2)≤nξ(n−2)(n+ξ)−nξ(n+ξ)(n+ξ−2)=n(n−2)(n+ξ)(n+ξ−2). Replacing n by n−4 proves (A.11). © The authors 2017. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) For permissions, please e-mail: journals. permissions@oup.com http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Information and Inference: A Journal of the IMA Oxford University Press

Quantized minimax estimation over Sobolev ellipsoids

Loading next page...
 
/lp/ou_press/quantized-minimax-estimation-over-sobolev-ellipsoids-yKtbYmLpuG
Publisher
Oxford University Press
Copyright
© The authors 2017. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved.
ISSN
2049-8764
eISSN
2049-8772
D.O.I.
10.1093/imaiai/iax007
Publisher site
See Article on Publisher Site

Abstract

Abstract We formulate the notion of minimax estimation under storage or communication constraints, and prove an extension to Pinsker's theorem for non-parametric estimation over Sobolev ellipsoids. Placing limits on the number of bits used to encode any estimator, we give tight lower and upper bounds on the excess risk due to quantization in terms of the number of bits, the signal size and the noise level. This establishes the Pareto optimal tradeoff between storage and risk under quantization constraints for Sobolev spaces. Our results and proof techniques combine elements of rate distortion theory and minimax analysis. The proposed quantized estimation scheme, which shows achievability of the lower bounds, is adaptive in the usual statistical sense, achieving the optimal quantized minimax rate without knowledge of the smoothness parameter of the Sobolev space. It is also adaptive in a computational sense, as it constructs the code only after observing the data, to dynamically allocate more codewords to blocks where the estimated signal size is large. Simulations are included that illustrate the effect of quantization on statistical risk. 1. Introduction In this article, we introduce a minimax framework for non-parametric estimation under storage constraints. In the classical statistical setting, the minimax risk for estimating a function f from a function class F using a sample of size n places no constraints on the estimator f^n, other than requiring it to be a measurable function of the data. However, if the estimator is to be constructed with restrictions on the computational resources used, it is of interest to understand how the error can degrade. Letting C(f^n)≤Bn indicate that the computational resources C(f^n) used to construct f^n are required to fall within a budget Bn, the constrained minimax risk is   Rn(F,Bn)=inff^n:C(f^n)≤Bnsupf∈FR(f^n,f). Minimax lower bounds on the risk as a function of the computational budget thus determine a feasible region for computation constrained estimation and a Pareto optimal tradeoff for risk versus computation as Bn varies. Several recent papers have presented results on tradeoffs between statistical risk and computational resources, measured in terms of either running time of the algorithm, number of floating point operations or number of bits used to store or construct the estimators [6,5,16]. However, the existing work quantifies the tradeoff by analyzing the statistical and computational performance of specific procedures, rather than by establishing lower bounds and a Pareto optimal tradeoff. In this article, we treat the case where the complexity C(f^n) is measured by the storage or space used by the procedure and sharply characterize the optimal tradeoff. Specifically, we limit the number of bits used to represent the estimator f^n. We focus on the setting of non-parametric regression under standard smoothness assumptions and study how the excess risk depends on the storage budget Bn. We view the study of quantized estimation as a theoretical problem of fundamental interest. But quantization may arise naturally in future applications of large-scale statistical estimation. For instance, when data are collected and analyzed on board a remote satellite, the estimated values may need to be sent back to Earth for further analysis. To limit communication costs, the estimates can be quantized, and it becomes important to understand what, in principle, is lost in terms of statistical risk through quantization. A related scenario is a cloud computing environment, where data are processed for many different statistical estimation problems, with the estimates then stored for future analysis. To limit the storage costs, which could dominate the compute costs in many scenarios, it is of interest to quantize the estimates, and the quantization-risk tradeoff again becomes an important concern. Estimates are always quantized to some degree in practice. But to impose energy constraints on computation, future processors may limit precision in arithmetic computations more significantly [11]; the cost of limited precision in terms of statistical risk must then be quantified. A related problem is to distribute the estimation over many parallel processors, and to then limit the communication costs of the submodels to the central host. We focus on the centralized setting in this article, but an extension to the distributed case may be possible with the techniques that we introduce here. We study risk-storage tradeoffs in the normal means model of non-parametric estimation assuming the target function lies in a Sobolev space. The problem is intimately related to classical rate distortion theory [12], and our results rely on a marriage of minimax theory and rate distortion ideas. We thus build on and refine the connection between function estimation and lossy source coding that was elucidated in David Donoho's 1997 Wald Lectures [9]. We work in the Gaussian white noise model   dX(t)=f(t)dt+εdW(t), 0≤t≤1, (1.1) where W is a standard Wiener process on [0,1], ε is the standard deviation of the noise, and f lies in the periodic Sobolev space W˜(m,c) of order m and radius c. (We discuss the non-periodic Sobolev space W(m,c) in Section 4.) The white noise model is a centerpiece of non-parametric estimation. It is asymptotically equivalent to non-parametric regression [4] and density estimation [18], and simplifies some of the mathematical analysis in our framework. In this classical setting, the minimax risk of estimation   Rε(m,c)=inff^εsupf∈W˜(m,c)E||f−f^ε||22 is well known to satisfy   limε→0ε−4m2m+1Rε(m,c)=(c2(2m+1)π2m)12m+1(mm+1)2m2m+1≜Pm,c, (1.2) where Pm,c is Pinsker's constant [19]. The constrained minimax risk for quantized estimation becomes   Rε(m,c,Bε)=inff^ε,C( f^ε)≤Bεsupf∈W˜(m,c)E||f−f^ε||22, where f^ε is a quantized estimator that is required to use storage C(f^ε) no greater than Bε bits in total. Our main result identifies three separate quantization regimes. In the over-sufficient regime, the number of bits is very large, satisfying Bε≫ε−22m+1 and the classical minimax rate of convergence Rε≍ε4m2m+1 is obtained. Moreover, the optimal constant is the Pinsker constant Pm,c. In the sufficient regime, the number of bits scales as Bε≍ε−22m+1. This level of quantization is just sufficient to preserve the classical minimax rate of convergence, and thus in this regime Rε(m,c,Bε)≍ε4m2m+1. However, the optimal constant degrades to a new constant Pm,c+Qm,c,d, where Qm,c,d is characterized in terms of the solution of a certain variational problem, depending on d=limε→0Bεε22m+1. In the insufficient regime, the number of bits scales as Bε≪ε−22m+1, with however Bε→∞. Under this scaling, the number of bits is insufficient to preserve the unquantized minimax rate of convergence and the quantization error dominates the estimation error. We show that the quantized minimax risk in this case satisfies   limε→0Bε2mRε(m,c,Bε)=c2m2mπ2m. Thus, in the insufficient regime the quantized minimax rate of convergence is Bε−2m, with optimal constant as shown above. By using an upper bound for the family of constants Qm,c,d, the three regimes can be combined together to view the risk in terms of a decomposition into estimation error and quantization error. Specifically, we can write   Rε(m,c,Bε)  ≈Pm,cε4m2m+1︸estimation error+c2m2mπ2mBε−2m︸quantization error. When Bε≫ε−22m+1, the estimation error dominates the quantization error and the usual minimax rate and constant are obtained. In the insufficient case Bε≪ε−22m+1, only a slower rate of convergence is achievable. When Bε and ε−22m+1 are comparable, the estimation error and quantization error are on the same order. The threshold ε−22m+1 should not be surprising, given that in classical unquantized estimation the minimax rate of convergence is achieved by estimating the first ε−22m+1 Fourier coefficients, and simply setting the remaining coefficients to zero. This corresponds to selecting a smoothing bandwidth that scales as h≍n−12m+1 with the sample size n. At a high level, our proof strategy integrates elements of minimax theory and source coding theory. In minimax analysis, one computes lower bounds by thinking in Bayesian terms to look for least-favorable priors. In source coding analysis, one constructs worst-case distributions by setting up an optimization problem based on mutual information. Our quantized minimax analysis requires that these approaches be carefully combined to balance the estimation and quantization errors. To show achievability of the lower bounds we establish, we likewise need to construct an estimator and coding scheme together. Our approach is to quantize the block-wise James–Stein estimator, which achieves the classical Pinsker bound. However, our quantization scheme differs from the approach taken in classical rate distortion theory, where the generation of the codebook is determined once the source distribution is known. In our setting, we require the allocation of bits to be adaptive to the data, using more bits for blocks that have larger signal size. We therefore design a quantized estimation procedure that adaptively distributes the communication budget across the blocks. Assuming only a lower bound m0 on the smoothness m and an upper bound c0 on the radius c of the Sobolev space, our quantization–estimation procedure is adaptive to m and c in the usual statistical sense, and is also adaptive to the coding regime. In other words, given a storage budget Bε, the coding procedure achieves the optimal rate and constant for the unknown m and c, operating in the corresponding regime for those parameters. In the following section, we establish some notation, outline our proof strategy and present some simple examples. In Section 3, we state and prove our main result on quantized minimax lower bounds, relegating some of the technical details to an Appendix. In Section 4, we show asymptotic achievability of these lower bounds, using a quantized estimation procedure based on adaptive James–Stein estimation and quantization in blocks, again deferring proofs of technical lemmas to the Supplementary Material. This is followed by a presentation of some results from experiments in Section 5, illustrating the performance and properties of the proposed quantized estimation procedure. 2. Quantized estimation and minimax risk Suppose that (X1,…,Xn)∈Xn is a random vector drawn from a distribution Pn. Consider the problem of estimating a functional θn=θ(Pn) of the distribution, assuming θn is restricted to lie in a parameter space Θn. To unclutter some of the notation, we will suppress the subscript n and write θ and Θ in the following, keeping in mind that non-parametric settings are allowed. The subscript n will be maintained for random variables. The minimax ℓ2 risk of estimating θ is then defined as   Rn(Θ)=infθ^nsupθ∈ΘEθ||θ−θ^n||2, where the infimum is taken over all possible estimators θ^n:Xn→Θ that are measurable with respect to the data X1,…,Xn. We will abuse notation by using θ^n to denote both the estimator and the estimate calculated based on an observed set of data. Among numerous approaches to obtaining the minimax risk, the Bayesian method is best aligned with quantized estimation. Consider a prior distribution π(θ) whose support is a subset of Θ. Let δ(X1:n) be the posterior mean of θ given the data X1,…,Xn, which minimizes the integrated risk. Then for any estimator θ^n,   supθ∈ΘEθ||θ−θ^n||2≥∫ΘEθ||θ−θ^n||2dπ(θ)≥∫ΘEθ||θ−δ(X1:n)||2dπ(θ). Taking the infimum over θ^n yields   infθ^nsupθ∈ΘEθ||θ−θ^n||2≥∫ΘEθ||θ−δ(X1:n)||||2dπ(θ)≜Rn(Θ;π). Thus, any prior distribution supported on Θ gives a lower bound on the minimax risk, and selecting the least-favorable prior leads to the largest lower bound provable by this approach. Now consider constraints on the storage or communication cost of our estimate. We restrict to the set of estimators that use no more than a total of Bn bits; that is, the estimator takes at most 2Bn different values. Such quantized estimators can be formulated by the following two-step procedure. First, an encoder maps the data X1:n to an index ϕn(X1:n), where   ϕn:Xn→{1,2,…,2Bn} is the encoding function. The decoder, after receiving or retrieving the index, represents the estimates based on a decoding function  ψn:{1,2,…,2Bn}→Θ mapping the index to a codebook of estimates. All that needs to be transmitted or stored is the Bn-bit-long index, and the quantized estimator θ^n is simply ψn°ϕn, the composition of the encoder and the decoder functions. Denoting by C(θ^n) the storage, in terms of the number of bits, required by an estimator θ^n, the minimax risk of quantized estimation is then defined as   Rn(Θ,Bn)=infθ^n,C(θ^n)≤Bnsupθ∈ΘEθ||θ−θ^n||2 and we are interested in the effect of the constraint on the minimax risk. Once again, we consider a prior distribution π(θ) supported on Θ, and let δ(X1:n) be the posterior mean of θ given the data. The integrated risk can then be decomposed as   ∫ΘEθ||θ−θ^n||2dπ(θ)=E||θ−δ(X1:n)+δ(X1:n)−θ^n||2=E||θ−δ(X1:n)||2+E||δ(X1:n)−θ^n||2, (2.1) where the expectation is with respect to the joint distribution of θ~π(θ) and X1:n|θ~Pθ, and the second equality is due to   E⟨θ−δ(X1:n),δ(X1:n)−θ^n⟩=E(E(⟨θ−δ(X1:n),δ(X1:n)−θ^n⟩|X1:n))=E(⟨E(θ−δ(X1:n)|X1:n),δ(X1:n)−θ^n⟩)=E(⟨0,δ(X1:n)− ^n⟩)=0 using the fact that θ→X1:n→θ^n forms a Markov chain. The first term in the decomposition (2.1) is the Bayes risk Rn(Θ;π). The second term can be viewed as the excess risk due to quantization. Let Tn=T(X1,…,Xn) be a sufficient statistic for θ. The posterior mean can be expressed in terms of Tn and we will abuse notation and write it as δ(Tn). Since the quantized estimator θ^n uses at most Bn bits, we have   Bn≥H(θ^n)≥H(θ^n)−H(θ^n|δ(Tn))=I(θ^n;δ(Tn)), where H and I denote the Shannon entropy and mutual information, respectively. Now consider the optimization   infP(⋅ | δ(Tn))  E||δ(Tn)−θ˜n||2such that I(θ˜n;δ(Tn))≤Bn, where the infimum is over all conditional distributions P(θ˜n|δ(Tn)). This parallels the definition of the distortion rate function, minimizing the distortion under a constraint on mutual information [12]. Denoting the value of this optimization by Qn(Θ,Bn;π), we can lower bound the quantized minimax risk by   Rn(Θ,Bn)≥Rn(Θ;π)+Qn(Θ,Bn;π). Since each prior distribution π(θ) supported on Θ gives a lower bound, we have   Rn(Θ,Bn)≥supπ{Rn(Θ;π)+Qn(Θ,Bn;π)} and the goal becomes to obtain a least favorable prior for the quantized risk. Before turning to the case of quantized estimation over Sobolev spaces, we illustrate this technique on some simpler, more concrete examples. Example 2.1 [Normal means in a hypercube] Let Xi~N(θ,σ2Id) for i=1,2,…,n. Suppose that σ2 is known and θ∈[−τ,τ]d is to be estimated. We choose the prior π(θ) on θ to be a product distribution with density   π(θ)=∏j=1d32τ3(τ−|θj|)+2. It is shown in [15] that   Rn(Θ;π)≥σ2dnτ2τ2+12σ2/n≥c1σ2dn, where c1=τ2τ2+12σ2. Turning to Qn(Θ,Bn;π), let T(n)=(T1(n),…,Td(n))=E(θ|X1:n) be the posterior mean of θ. In fact, by the independence and symmetry among the dimensions, we know T1,…,Td are independently and identically distributed. Denoting by T0(n) this common distribution, we have   Qn(Θ,Bn;π)≥d⋅q(Bn/d), where q(B) is the distortion rate function for T0(n), i.e., the value of the following problem   infP(T^ | T0(n))  E(T0(n)−T^)2such that I(T^;T0(n))≤B. Now using the Shannon lower bound [8], we get   Qn(Θ,Bn;π)≥d2πe⋅2h(T0(n))⋅2−2Bnd. Note that as n→∞, T0(n) converges to θ in distribution, so there exists a constant c2 independent of n and d such that   Rn(Θ,Bn)≥c1σ2dn+c2d2−2Bnd. This lower bound intuitively shows the risk is regulated by two factors, the estimation error and the quantization error; whichever is larger dominates the risk. The scaling behavior of this lower bound (ignoring constants) can be achieved by first quantizing each of the d intervals [−τ,τ] using Bn/d bits each, and then mapping the Maximum likelihood estimator (MLE) to its closest codeword. Example 2.2 [Gaussian sequences in Euclidean balls] In the example shown above, the lower bound is tight only in terms of the scaling of the key parameters. In some instances, we are able to find an asymptotically tight lower bound for which we can show achievability of both the rate and the constants. Estimating the mean vector of a Gaussian sequence with an ℓ2 norm constraint on the mean is one of such case, as we showed in previous work [27]. Specifically, let Xi~N(θi,σn2) for i=1,2,…,n, where σn2=σ2/n. Suppose that the parameter θ=(θ1,…,θn) lies in the Euclidean ball Θn(c)={θ:∑i=1nθi2≤c2}. Furthermore, suppose that Bn=nB. Then using the prior θi~N(0,c2), it can be shown that   liminfn→∞Rn(Θn(c),Bn)≥σ2c2σ2+c2+c42−2Bσ2+c2. The asymptotic estimation error σ2c2/(σ2+c2) is the well-known Pinsker bound for the Euclidean ball case. As shown in [27], an explicit quantization scheme can be constructed that asymptotically achieves this lower bound, realizing the smallest possible quantization error c42−2B/(σ2+c2) for a budget of Bn=nB bits. The Euclidean ball case is clearly relevant to the Sobolev ellipsoid case, but new coding strategies and proof techniques are required. In particular, as will be made clear in the sequel, we will use an adaptive allocation of bits across blocks of coefficients, using more bits for blocks that have larger estimated signal size. Moreover, determination of the optimal constants requires a detailed analysis of the worst-case prior distributions and the solution of a series of variational problems. 3. Quantized estimation over Sobolev spaces Recall that the Sobolev space of order mand radius c is defined by   W(m,c)={f∈[0,1]→ℝ:f(m−1) is absolutely continuous and∫01(f(m)(x))2dx≤c2}. The periodic Sobolev space is defined by   W˜(m,c)={f∈W(m,c):f(j)(0)=f(j)(1), j=0,1,…,m−1}​. (3.1) The white noise model (1.1) is asymptotically equivalent to making n equally spaced observations along the sample path, Yi=f(i/n)+σϵi, where ϵi~N(0,1) [4]. In this formulation, the noise level in the formulation (1.1) scales as ϵ2=σ2/n, and the rate of convergence takes the familiar form n−2m2m+1, where n is the number of observations. To carry out quantized estimation, we now require an encoder   ϕε:ℝ[0,1]→{1,2,…,2Bε}, which is a function applied to the sample path X(t). The decoding function then takes the form   ψε:{1,2,…,2Bε}→ℝ[0,1] and maps the index to a function estimate. As in the previous section, we write the composition of the encoder and the decoder as f^ε=ψε°ϕε, which we call the quantized estimator. The communication or storage C(f^ε) required by this quantized estimator is no more than Bε bits. To recast quantized estimation in terms of an infinite sequence model, let (φj)j=1∞ be the trigonometric basis, and let   θj=∫01φj(t)f(t)dt, j=1,2,…, be the Fourier coefficients. It is well known [22] that f=∑j=1∞θjφj belongs to W˜(m,c) if and only if the Fourier coefficients θ belong to the Sobolev ellipsoid defined as   Θ(m,c)={θ∈ℓ2:∑j=1∞aj2θj2≤c2π2m}, (3.2) where   aj={jm,for even j,(j−1)m,for odd j. Although this is the standard definition of a Sobolev ellipsoid, for the rest of the paper we will set aj=jm, j=1,2,… for convenience of analysis. All of the results hold for both definitions of aj. Also note that (3.2) actually gives a more general definition, since m is no longer assumed to be an integer, as it is in (3.1). Expanding with respect to the same orthonormal basis, the observed path X(t) is converted into an infinite Gaussian sequence   Yj=∫01φj(t)dX(t), j=1,2,… with Yj~N(θj,ε2). For an estimator (θ^j)j=1∞ of (Yj)j=1∞, an estimate of f is obtained by   f^(x)=∑j=1∞θ^jφj(x) with squared error ||f^−f||22=||θ^−θ||22. In terms of this standard reduction, the quantized minimax risk is thus reformulated as   Rε(m,c,Bε)=infθ^ε,C(θ^ε)≤Bεsupθ∈Θ(m,c)Eθ||θ−θ^ε||22. (3.3) To state our result, we need to define the value of the following variational problem:   Vm,c,d≜max(σ2,x0)∈F(m,c,d)∫0x0σ2(x)σ2(x)+1dx+x0exp(1x0∫0x0logσ4(x)σ2(x)+1dx−2dx0), (3.4) where the feasible set F(m,c,d) is the collection of increasing functions σ2(x) and values x0, satisfying   ∫0x0x2mσ2(x)dx≤c2σ4(x)σ2(x)+1≥exp(1x0∫0x0logσ4(x)σ2(x)+1dx−2dx0) for all x≤x0. The significance and interpretation of the variational problem will become apparent as we outline the proof of this result. Theorem 3.1 Let Rε(m,c,Bε) be defined as in (3.3) for m>0 and c>0. (i) If Bεε22m+1→∞ as ε→0, then   liminfε→0ε−4m2m+1Rε(m,c,Bε)≥Pm,c, where Pm,c is Pinker's constant defined in (1.2). (ii) If Bεε22m+1→d for some constant d as ε→0, then   liminfε→0ε−4m2m+1Rε(m,c,Bε)≥Pm,c+Qm,c,d=Vm,c,d, where Vm,c,d is the value of the variational problem (3.4). (iii) If Bεε22m+1→0 and Bε→∞ as ε→0, then   liminfε→0Bε2mRε(m,c,Bε)≥c2m2mπ2m. In the first regime where the number of bits Bε is much greater than ε−22m+1, we recover the same convergence result as in Pinsker's theorem, in terms of both convergence rate and leading constant. The proof of the lower bound for this regime can directly follow the proof of Pinsker's theorem, since the set of estimators considered in our minimax framework is a subset of all possible estimators. In the second regime where we have ‘just enough’ bits to preserve the rate, we suffer a loss in terms of the leading constant. In this ‘Goldilocks regime’, the optimal rate ε−4m2m+1 is achieved, but the constant in front of the rate is Pinsker's constant Pm,c plus a positive quantity Qm,c,d determined by the variational problem. While the solution to this variational problem does not appear to have an explicit form, it can be computed numerically. We discuss this term at length in the sequel, where we explain the origin of the variational problem, compute the constant numerically and approximate it from above and below. The constants Pm,c and Qm,c,d are shown graphically in Fig. 1. Note that the parameter d can be thought of as the average number of bits per coefficient used by an optimal quantized estimator, since ε−22m+1 is asymptotically the number of coefficients needed to estimate at the classical minimax rate. As shown in Fig. 1, the constant for quantized estimation quickly approaches the Pinsker constant as d increases—when d=3, the two are already very close. Fig. 1. View largeDownload slide The constants Pm,c+Qm,c,d as a function of quantization level d in the sufficient regime, where Bεε22m+1→d. The parameter d can be thought of as the average number of bits per coefficient used by an optimal quantized estimator, because ε−22m+1 is asymptotically the number of coefficients needed to estimate at the classical minimax rate. Here, we take m=2 and c2/π2m=1. The curve indicates that with only two bits per coefficient, optimal quantized minimax estimation degrades by less than a factor of 2 in the constant. With three bits per coefficient, the constant is very close to the classical Pinsker constant. Fig. 1. View largeDownload slide The constants Pm,c+Qm,c,d as a function of quantization level d in the sufficient regime, where Bεε22m+1→d. The parameter d can be thought of as the average number of bits per coefficient used by an optimal quantized estimator, because ε−22m+1 is asymptotically the number of coefficients needed to estimate at the classical minimax rate. Here, we take m=2 and c2/π2m=1. The curve indicates that with only two bits per coefficient, optimal quantized minimax estimation degrades by less than a factor of 2 in the constant. With three bits per coefficient, the constant is very close to the classical Pinsker constant. In the third regime, where the communication budget is insufficient for the estimator to achieve the optimal rate, we obtain a suboptimal rate which no longer depends explicitly on the noise level ε of the model. In this regime, quantization error dominates, and the risk decays at a rate of B−12m no matter how fast ε approaches zero, as long as B≪ε−22m+1. Here the analog of Pinsker's constant takes a very simple form. Proof of Theorem 3.1. Consider a Gaussian prior distribution on θ=(θj)j=1∞ with θj~N(0,σj2) for j=1,2,…, in terms of parameters σ2=(σj2)j=1∞ to be specified later. One requirement for the variances is   ∑j=1∞aj2σj2≤c2π2m. We denote this prior distribution by π(θ;σ2) and shown in Section A that it is asymptotically concentrated on the ellipsoid Θ(m,c). Under this prior the model is   θj~N(0,σj2)Yj|θj~N(θj,ε2), j=1,2,… and the marginal distribution of Yj is thus N(0,σj2+ε2). Following the strategy outlined in Section 2, let δ denote the posterior mean of θ given Y under this prior, and consider the optimization   inf  E||δ−θ˜||2such that I(δ;θ˜)≤Bϵ, where the infimum is over all distributions on θ˜ such that θ→Y→θ˜ forms a Markov chain. Now, the posterior mean satisfies δj=γjYj, where γj=σj2/(σj2+ϵ2). Note that the Bayes risk under this prior is   E||θ−δ||22=∑j=1∞σj2ε2σj2+ε2. Define   μj2≜E(δj−θ˜j)2. Then the classical rate distortion argument [8] gives that   I(δ;θ˜)≥∑j=1∞I(γjYj;θ˜j) ≥∑j=1∞12log+(γj2(σj2+ε2)μj2) =∑j=1∞12log+(σj4μj2(σj2+ε2)), where log+(x)=max(logx,0). Therefore, the quantized minimax risk is lower bounded by   Rε(m,c,Bε)=infθ^ε,C(θ^ε)≤Bεsupθ∈Θ(m,c)E||θ−θ^ε||2≥Vε(Bε,m,c)(1+o(1)), where Vε(Bε,m,c) is the value of the optimization   maxσ2minμ2  ∑j=1∞μj2+∑j=1∞σj2ε2σj2+ε2such that ∑j=1∞12log+(σj4μj2(σj2+ε2))≤Bε∑j=1∞aj2σj2≤c2π2m (P1) and the (1+o(1)) deviation term is analyzed in the Supplementary Material. Observe that the quantity Vε(Bε,m,c) can be upper and lower bounded by   max{Rε(m,c),Qε(m,c,Bε)}≤Vε(m,c,Bε)≤Rε(m,c)+Qε(m,c,Bε), (3.5) where the estimation error term Rε(m,c) is the value of the optimization   maxσ2 ∑j=1∞σj2ε2σj2+ε2such that ∑j=1∞aj2σj2≤c2π2m (R1) and the quantization error term Qε(m,c,Bε) is the value of the optimization   maxσ2minμ2∑j=1∞μj2such that∑j=1∞12log+(σj4μj2(σj2+ε2))≤Bε∑j=1∞aj2σj2≤c2π2m. (Q1) The following results specify the leading order asymptotics of these quantities. Lemma 3.1 As ε→0,   Rε(m,c)=Pm,cε4m2m+1(1+o(1)). Lemma 3.2 As ε→0,   Qε(m,c,Bε)≤c2m2mπ2mBε−2m(1+o(1)). (3.6) Moreover, if Bεε22m+1→0 and Bε→∞,   Qε(m,c,Bε)=c2m2mπ2mBε−2m(1+o(1)). This yields the following closed form upper bound. Corollary 3.1 Suppose that Bε→∞ and ε→0. Then   Vε(m,c,Bε)≤(Pm,cε4m2m+1+c2m2mπ2mBε−2m)(1+o(1)). (3.7) In the insufficient regime Bεε22m+1→0 and Bε→∞ as ε→0, equation (3.5) and Lemma 3.2 show that   Vε(m,c,Bε)=c2m2mπ2mBε−2m(1+o(1)). Similarly, in the over sufficient regime Bεε22m+1→∞ as ε→0, we conclude that   Vε(m,c,Bε)=Pm,cε4m2m+1(1+o(1)). We now turn to the sufficient regime Bεε22m+1→d. We begin by making three observations about the solution to the optimization (P1). First, we note that the series (σj2)j=1∞ that solves (P1) can be assumed to be decreasing. If (σj2) were not in decreasing order, we could rearrange it to be decreasing, and correspondingly rearrange (μj2), without violating the constraints or changing the value of the optimization. Secondly, we note that given (σj2), the optimal (μj2) is obtained by the ‘reverse water-filling’ scheme [8]. Specifically, there exists η>0 such that   μj2={η if σj4σj2+ε2≥ησj4σj2+ε2 otherwise, where η is chosen so that   12∑j=1∞log+(σj4μj2(σj2+ε2))≤Bε. Thirdly, there exists an integer J>0 such that the optimal series (σj2) satisfies   σj4σj2+ε2≥η, for j=1,…,J and σj2=0, for j>J, where η is the ‘water-filling level’ for (μj2) (see [8]). Using these three observations, the optimization (P1) can be reformulated as   maxσ2,JJη+∑j=1Jσj2ε2σj2+ε2such that 12∑j=1Jlog+(σj4η(σj2+ε2))=Bε∑j=1Jaj2σj2≤c2π2m(σj2) is decreasing and σJ4σJ2+ε2≥η.  (P2) To derive the solution to (P2), we use a continuous approximation of σ2, writing   σj2=σ2(jh)h2m+1, where h is the bandwidth to be specified and σ2(⋅) is a function defined on (0,∞). The constraint that ∑j=1∞aj2σj2≤c2π2m becomes the integral constraint [19]   ∫0∞x2mσ2(x)dx≤c2π2m. We now set the bandwidth so that h2m+1=ε2. This choice of bandwidth will balance the two terms in the objective function and thus gives the hardest prior distribution. Applying the above three observations under this continuous approximation, we transform problem (P2) to the following optimization:   maxσ2,x0x0η+∫0x0σ2(x)σ2(x)+1dxsuch that ∫0x012log+(σ4(x)η(σ2(x)+1))=d∫0x0x2mσ2(x)dx≤c2π2mσ2(x) is decreasing and σ4(x)σ2(x)+1≥η for all x≤x0. (P3) Note that here we omit the convergence rate h2m=ε4m2m+1 in the objective function. The asymptotic equivalence between )P2) and (P3) can be established by a similar argument to Theorem 3.1 in [9]. Solving the first constraint for η yields    maxσ2,x0∫0x0σ2(x)σ2(x)+1dx+x0exp(1x0∫0x0logσ4(x)σ2(x)+1dx−2dx0)such that ∫0x0x2mσ2(x)dx≤c2π2mσ2(x) is decreasing σ4(x)σ2(x)+1≥exp(1x0∫0x0logσ4(x)σ2(x)+1dx−2dx0) (P4) for all x≤x0. The following is proved using a variational argument in the Supplementary Material. Lemma 3.3 The solution to (P4) satisfies   1(σ2(x)+1)2+exp(1x0∫0x0logσ4(x)σ2(x)+1dx−2dx0)σ2(x)+2σ2(x)(σ2(x)+1)=λx2m for some λ>0. Fixing x0, the lemma shows that by setting   α=exp(1x0∫0x0logσ4(x)σ2(x)+1dx−2dx0) we can express σ2(x) implicitly as the unique positive root of a third-order polynomial in y,   λx2my3+(2λx2m−α)y2+(λx2m−3α−1)y−2α. This leads us to an explicit form of σ2(x) for a given value α. However, note that α still depends on σ2(x) and x0, so the solution σ2(x) might not be compatible with α and x0. We can either search through a grid of values of α and x0, or, more efficiently, use an iterative method to find the pair of values that gives us the solution. We omit the details on how to calculate the values of the optimization, as it is not main purpose of this article. To summarize, in the regime Bεε22m+1→d as ε→0, we obtain   Vε(m,c,Bε)=(Pm,c+Qm,c,d)ε4m2m+1(1+o(1)), where we denote by Pm,c+Qm,c,d the values of the optimization (P4). 4. Achievability In this section, we show that the lower bounds in Theorem 3.1 are achievable by a quantized estimator using a random coding scheme. The basic idea of our quantized estimation procedure is to conduct block-wise estimation and quantization together, using a quantized form of James–Stein estimator. Before we present a quantized form of the James–Stein estimator, let us first consider a class of simple procedures. Suppose that θ^=θ^(X) is an estimator of θ∈Θ(m,c) without quantization. We assume that θ^∈Θ(m,c), as projection always reduces mean squared error. To design a B-bit quantized estimator, let Θˇ be the optimal δ-covering of the parameter space Θ(m,c) such that |Θˇ|≤2B, that is,   δ=δ(B)=infΘ⌣⊂Θ:|Θ⌣|≤2Bsupθ∈Θinfθ′∈Θ⌣||θ−θ′||. The quantized estimator is then defined to be   θˇ=θˇ(X)=argminθ′∈Θˇ||θ^(X)−θ′||. Now the mean squared error satisfies   Eθ||θˇ−θ||2=Eθ||θˇ−θ^+θ^−θ||2≤2Eθ||θ^−θ||2+2Eθ||θˇ−θ^||2≤2supθ′Eθ′||θ^−θ′||2+2δ(B)2. If we pick θ^ to be a minimax estimator for Θ, the first term above gives the minimax risk for estimating θ in the parameter space Θ. The second term is closely related to the metric entropy of the parameter space Θ(m,c). In fact, for the Sobolev ellipsoid Θ(m,c), it is shown in [9] that δ(B)2=c2m2mπ2mB−2m(1+o(1)) as B→∞. Thus, with an extra constant factor of 2, the mean squared error of this quantized estimator is decomposed into the minimax risk for Θ and an error term due to quantization. In addition to the fact that this procedure does not achieve the exact lower bound of the minimax risk for the constrained estimation problem, it is not clear how such an ε-net can be generated. In what follows, we will describe a quantized estimation procedure that we will show achieves the lower bound with the exact constants, and that also adapts to the unknown parameters of the Sobolev space. We begin by defining the block system to be used, which is usually referred to as the weakly geometric system of blocks [22]. Let Nε=⌊1/ε2⌋ and ρε=(log(1/ε))−1. Let J1,…,JK be a partition of the set {1,…,Nε} such that   ∪k=1KJk={1,…,Nε}, Jk1∩Jk2=Ø for k1≠k2and  min{j:j∈Jk}>max{j:j∈Jk−1}. Let Tk be the cardinality of the kth block and suppose that T1,…,Tk satisfy   T1=⌈ρε−1⌉=⌈log(1/ε)⌉,T2=⌊T1(1+ρε)⌋,⋮TK−1=⌊T1(1+ρε)K−2⌋,TK=Nε−∑k=1K−1Tk. (4.1) Then K≤Clog2(1/ε) (see Lemma A.4). For an infinite sequence x∈ℓ2, denote by x(k) the vector (xj)j∈Jk∈ℝTk. We also write jk=∑l=1k−1Tl+1, which is the smallest index in block Jk. The weakly geometric system of blocks is defined such that the size of the blocks does not grow too quickly (the ratio between the sizes of the neighboring two blocks goes to 1 asymptotically), and that the number of the blocks is on the logarithmic scale with respect to 1/ε ( K≲log2(1/ε)). See Lemma A.4. We are now ready to describe the quantized estimation scheme. We first give a high-level description of the scheme and then the precise specification. In contrast to rate distortion theory, where the codebook and allocation of the bits are determined once the source distribution is known, here the codebook and allocation of bits are adaptive to the data—more bits are used for blocks having larger signal size. The first step in our quantization scheme is to construct a ‘base code’ of 2Bε randomly generated vectors of maximum block length TK, with N(0,1) entries. The base code is thought of as a 2Bε×TK random matrix Z; it is generated before observing any data, and is shared between the sender and receiver. After observing data (Yj), the rows of Z are apportioned to different blocks k=1,…,K, with more rows being used for blocks having larger estimated signal size. To do so, the norm ||Y(k)|| of each block k is first quantized as a discrete value Sˇk. A subcodebook Zk is then constructed by normalizing the appropriate rows and the first Tk columns of the base code, yielding a collection of random points on the unit sphere STk−1. To form a quantized estimate of the coefficients in the block, the codeword Zˇ(k)∈Zk having the smallest angle to Y(k) is then found. The appropriate indices are then transmitted to the receiver. To decode and reconstruct the quantized estimate, the receiver first recovers the quantized norms (Sˇk), which enables reconstruction of the subdivision of the base code that was used by the encoder. After extracting for each block k the appropriate row of the base code, the codeword Zˇ(k) is reconstructed and a James–Stein type estimator is then calculated. The quantized estimation scheme is detailed below. Step 1. Base code generation. 1.1. Generate codebook Sk={Tkε2+iε2: i=0,1,…,sk}, where sk=⌈ε−2c(jkπ)−m⌉ for k=1,…,K. 1.2. Generate base code Z, a 2B×TK matrix with i.i.d. N(0,1) entries. (Sk) and Z are shared between the encoder and the decoder, before seeing any data. Step 2. Encoding. 2.1. Encoding block radius. For k=1,…,K, encode Sˇk=argmin{|s−Sk|:s∈Sk}, where   Sk={Tkε2if ||Y(k)||<Tkε2Tkε2+c(jkπ)−mif ||Y(k)||>Tkε2+c(jkπ)−m||Y(k)||otherwise. 2.2. Allocation of bits. Let (b˜k)k=1K be the solution to the optimization   minb¯∑k=1K(Sˇk2−Tkε2)2Sˇk2⋅2−2b¯ksuch that ∑k=1KTkb¯k≤B, b¯k≥0. (4.2) 2.3. Encoding block direction. Form the data-dependent codebook as follows. Divide the rows of Z into blocks of sizes 2⌈T1b˜1⌉,…,2⌈TKb˜K⌉. Based on the kth block of rows, construct the data-dependent codebook Z˜k by keeping only the first Tk entries and normalizing each truncated row; specifically, the jth row of Z˜k is given by   Z˜k,j=Zi,1:Tk||Zi,1:Tk||∈STk−1, where i is the appropriate row of the base code Z and Zi,1:t denotes the first t entries of the row vector. A graphical illustration is shown below in Fig. 2. With this data-dependent codebook, encode   Zˇ(k)=argmax{⟨z,Y(k)⟩:z∈Z˜k} for k=1,…,K. Step 3. Transmission. Transmit or store (Sˇk)k=1K and (Zˇ(k))k=1K by their corresponding indices. Step 4. Decoding and Estimation. 4.1. Recover (Sˇk) based on the transmitted or stored indices and the common codebook (Sk). 4.2. Solve (4.2) and get (b˜k). Reconstruct (Z˜k) using Z and (b˜k). 4.3. Recover (Zˇ(k)) based on the transmitted or stored indices and the reconstructed codebook (Z˜k). 4.4. Estimate θ(k) by   θˇ(k)=Sˇk2−Tkε2Sˇk1−2−2b˜k⋅Zˇ(k). 4.5. Estimate the entire vector θ by concatenating the θˇ(k) vectors and padding with zeros; thus,   θˇ=(θˇ(1),…,θˇ(K),0,0,…). The following theorem establishes the asymptotic optimality of this quantized estimator. Fig. 2. View largeDownload slide An illustration of the data-dependent codebook. The big matrix represents the base code Z, and the shaded areas are (Z˜k), sub-matrices of size Tk×2⌈Tkb˜k⌉ with rows normalized. Fig. 2. View largeDownload slide An illustration of the data-dependent codebook. The big matrix represents the base code Z, and the shaded areas are (Z˜k), sub-matrices of size Tk×2⌈Tkb˜k⌉ with rows normalized. Theorem 4.1 Let θˇ be the quantized estimator defined above. (i) If Bε22m+1→∞, then   limε→0ε−4m2m+1supθ∈Θ(m,c)E||θ−θˇ||2=Pm,c. (ii) If Bε22m+1→d for some constant d as ε→0, then   limε→0ε−4m2m+1supθ∈Θ(m,c)E||θ−θˇ||2=Pm,c+Qd,m,c. (iii) If Bε22m+1→0 and B(log(1/ε))−3→∞, then   limε→0B2msupθ∈Θ(m,c)E||θ−θˇ||2=c2m2mπ2m. The expectations are with respect to the random quantized estimation scheme Q and the distribution of the data. We pause to make several remarks on this result before outlining the proof. Remark The total number of bits used by this quantized estimation scheme is   ∑k=1K⌈Tkb˜k⌉+∑k=1Klog⌈ε−2c(jkπ)−m⌉≤∑k=1K⌈Tkb˜k⌉+∑k=1Klog⌈ε−2c⌉≤B+K+2Kρε−1+Klog⌈c⌉=B+O((log(1/ε))3), where we use the fact that K≲log2(1/ε2) (see Lemma A.4). Therefore, as long as B(log(1/ε))−3→∞, the total number of bits used is asymptotically no more than B, the given communication budget. Remark 4.2 The quantized estimation scheme does not make essential use of the parameters of the Sobolev space, namely the smoothness m and the radius c. The only exception is that in Step 1.1 the size of the codebook Sk depends on m and c. However, suppose that we know a lower bound on the smoothness m, say m≥m0, and an upper bound on the radius c, say c≤c0. By replacing m and c by m0 and c0, respectively, we make the codebook independent of the parameters. We shall assume m0>1/2, which leads to continuous functions. This modification does not, however, significantly increase the number of bits; in fact, the total number of bits is still B+O(ρε−3). Thus, we can easily make this quantized estimator minimax adaptive to the class of Sobolev ellipsoids {Θ(m,c):m≥m0, c≤c0}, as long as B grows faster than (log(1/ε))3. More formally, we have Corollary 4.1 Suppose that Bε satisfies Bε(log(1/ε))−3→∞. Let θˇ' be the quantized estimator with the modification described above, which does not assume knowledge of m and c. Then for m≥m0 and c≤c0,   limε→0supθ∈Θ(m,c)E||θ−θ'ˇ||2infθ^,C(θ^)≤Bsupθ∈Θ(m,c)E||θ−θ^||2=1, where the expectation in the numerator is with respect to the data and the randomized coding scheme, while the expectation in the denominator is only with respect to the data. Remark 4.4 When B grows at a rate comparable to or slower than (log(1/ε))3, the lower bound is still achievable, just no longer by the quantized estimator we described above. The main reason is that when B does not grow faster than log(1/ε)3, the block size T1=⌈log(1/ε)⌉ is too large. The blocking needs to be modified to get achievability in this case. Remark 4.4 In classical rate distortion [8,12], the probabilistic method applied to a randomized coding scheme shows the existence of a code achieving the rate distortion bounds. Comparing to Theorem 3.1, we see that the expected risk, averaged over the randomness in the codebook, similarly achieves the quantized minimax lower bound. However, note that the average over the codebook is inside the supremum over the Sobolev space, implying that the code achieving the bound may vary over the ellipsoid. In other words, while the coding scheme generates a codebook that is used for different θ, it is not known whether there is one code generated by this randomized scheme that is ‘universal’, and achieves the risk lower bound with high probability over the ellipsoid. The existence or non-existence of such ‘universal codes’ is an interesting direction for further study. Remark We have so far dealt with the periodic case, i.e., functions in the periodic Sobolev space W˜(m,c) defined in (3.1). For the Sobolev space W(m,c), where the functions are not necessarily periodic, the lower bound given in Theorem 3.1 still holds, since W˜(m,c) is a subset of the larger class W(m,c). To extend the achievability result to W(m,c), we again need to relate W(m,c) to an ellipsoid. Nussbaum [17] shows using spline theory that the non-periodic space can actually be expressed as an ellipsoid, where the length of the jth principal axis scales as (π2j)m asymptotically. Based on this link between W(m,c) and the ellipsoid, the techniques used here to show achievability apply, and since the principal axes scale as in the periodic case, the convergence rates remain the same. Proof of Theorem 4.1 We now sketch the proof of Theorem 4.1, deferring the full details to Section A. To provide only an informal outline of the proof, we shall write A1≈A2 as a shorthand for A1=A2(1+o(1)), and A1≲A2 for A1≤A2(1+o(1)), without specifying here what these o(1) terms are. To upper bound the risk E||θˇ−θ||2, we adopt the following sequence of approximations and inequalities. First, we discard the components whose index is greater than N and show that   E||θˇ−θ||2≈E∑k=1K||θˇ(k)−θ(k)||2. Since Sˇk is close enough to Sk, we can then safely replace θˇ(k) by θ^(k)=Sk2−Tkε2Sk1−2−2b˜(k)⋅Zˇ(k) and obtain   ≈E∑k=1K||θ^(k)−θ(k)||2. Writing λk=Sk2−Tkε2Sk2, we further decompose the risk into   =E∑k=1K(||θ^(k)−λkY(k)||2+||λkY(k)−θ(k)||2+2⟨θ^(k)−λkY(k),λkY(k)−θ(k)⟩). Conditioning on the data Y and taking the expectation with respect to the random codebook yields   ≲E∑k=1K((Sk2−Tkε2)2Sk22−2b˜k+||λkY(k)−θ(k)||2)​. By two oracle inequalities upper bounding the expectations with respect to the data, and the fact that b˜ is the solution to (4.2),   ≲minb∈Πblk(B)∑k=1K(||θ(k)||4||θ(k)||2+Tkε22−2b¯k+||θ(k)||2Tkε2||θ(k)||2+Tkε2). Showing that the block-wise constant oracles are almost as good as the monotone oracle, we get for some B′≈B  ≲minb∈Πmon(B′), ω∈Ωmon∑j=1N(θj4θj2+ε22−2bj+(1−ωj)2θj2+ωj2ε2), where Πblk(B), Πmon(B) are the classes of block-wise constant and monotone allocations of the bits defined in (A.8), (A.9) and Ωmon is the class of monotone weights defined in (A.11). The proof is then completed by Lemma A.9, showing that the last quantity is equal to Vε(m,c,B). 5. Simulations Here we illustrate the performance of the proposed quantized estimation scheme. We use the function   f(x)=x(1−x)sin(2.1πx+0.3), 0≤x≤1, which we shall refer to as the ‘damped Doppler function’, shown in Fig. 3 (the gray lines). Note that the value 0.3 differs from the value 0.05 in the usual Doppler function used to illustrate spatial adaptation of methods such as wavelets. Since we do not address spatial adaptivity in this article, we ‘slow’ the oscillations of the Doppler function near zero in our illustrations. Fig. 3. View largeDownload slide The damped Doppler function (solid) and typical realizations of the estimators under different noise levels ( n=500, 5000 and 50000). Three estimators are used: the block-wise James–Stein estimator (dashed black) and two quantized estimator with budgets of 5 bits (dashed red) and 30 bits (dashed blue). Fig. 3. View largeDownload slide The damped Doppler function (solid) and typical realizations of the estimators under different noise levels ( n=500, 5000 and 50000). Three estimators are used: the block-wise James–Stein estimator (dashed black) and two quantized estimator with budgets of 5 bits (dashed red) and 30 bits (dashed blue). We use this f as the underlying true mean function and generate our data according to the corresponding white noise model (1.1),   dX(t)=f(t)dt+εdW(t), 0≤t≤1. We apply the block-wise James–Stein estimator, as well as the proposed quantized estimator with different communication budgets. We also vary the noise level ε and, equivalently, the effective sample size n=1/ε2. We first show in Fig. 3 some typical realizations of these estimators on data generated under different noise levels ( n=500, 5000 and 50000, respectively). To keep the plots succinct, we show only the true function, the block-wise James–Stein estimates and quantized estimates using total bit budgets of 5 and 30 bits. We observe, in the first plot, that both quantized estimates deviate from the true function, and so does the block-wise James–Stein estimates. This is when the noise is relatively large and any quantized estimate performs poorly, no matter how large a budget is given. Both 5 bits and 30 bits appear to be ‘sufficient/over sufficient’ here. In the second plot, the block-wise James–Stein estimate is close to the quantized estimate with a budget of 30 bits, whereas with a budget of 5 bits it fails to capture the fluctuations of the true function. Thus, a budget of 30 bits is still ‘sufficient’, but 5 bits apparently becomes ‘insufficient’. In the third plot, the block-wise James–Stein estimate gives a better fit than the two quantized estimates, as both budgets become ‘insufficient’ to achieve the optimal risk. Next, in Fig. 4 we plot the risk as a function of sample size n, averaging over 2000 simulations. Note that the bottom plot is just the first plot on a log–log scale. In this set of plots, we are able to observe the phase transition for the quantized estimators. For relatively small values of n, all quantized estimators yield a similar error rate, with risks that are close to (or even smaller than) that of the block-wise James–Stein estimator. This is the over sufficient regime—even the smallest budget suffices to achieve the optimal risk. As n increases, the curves start to separate, with estimators having smaller bit budgets leading to worse risks compared to the block-wise James–Stein estimator and compared to estimators with larger budgets. This can be seen as the sufficient regime for the small-budget estimators—the risks are still going down, but at a slower rate than optimal. The six quantized estimators all end up in the insufficient regime—as n increases, their risks begin to flatten out, while the risk of the block-wise James–Stein estimator continues to decrease. Fig. 4. View largeDownload slide Risk versus effective sample size n=1/ε2 for estimating the damped Doppler function with different estimators. The dashed line represents the risk of the block-wise James–Stein estimator and the solid ones are for the quantized estimators with different budgets. The budgets are 5, 10, 15, 20, 25 and 30 bits, corresponding to the lines from top to bottom. The two plots are the same curves on the original scale and the log–log scale. Fig. 4. View largeDownload slide Risk versus effective sample size n=1/ε2 for estimating the damped Doppler function with different estimators. The dashed line represents the risk of the block-wise James–Stein estimator and the solid ones are for the quantized estimators with different budgets. The budgets are 5, 10, 15, 20, 25 and 30 bits, corresponding to the lines from top to bottom. The two plots are the same curves on the original scale and the log–log scale. 6. Related work and future directions Concepts related to quantized non-parametric estimation appear in multiple communities. As mentioned in the introduction, Donoho's 1997 Wald Lectures [9] (on the eve of the 50th anniversary of Shannon's 1948 paper) drew sharp parallels between rate distortion, metric entropy and minimax rates, focusing on the same Sobolev function spaces we treat here. One view of the present work is that we take this correspondence further by studying how the risk continuously degrades with the level of quantization. We have analyzed the precise leading order asymptotics for quantized regression over the Sobolev spaces, showing that these rates and constants are realized with coding schemes that are adaptive to the smoothness m and radius c of the ellipsoid, achieving automatically the optimal rate for the regime corresponding to those parameters given the specified communication budget. Our detailed analysis is possible due to what Nussbaum [19] calls the ‘Pinsker phenomenon’, referring to the fact that linear filters attain the minimax rate in the over sufficient regime. It will be interesting to study quantized non-parametric estimation in cases where the Pinsker phenomenon does not hold, for example over Besov bodies and different Lp spaces. Many problems of rate distortion type are similar to quantized regression. The standard ‘reverse water filling’ construction to quantize a Gaussian source with varying noise levels, plays a key role in our analysis, as shown in Section 3. In our case, the Sobolev ellipsoid is an infinite Gaussian sequence model, requiring truncation of the sequence at the appropriate level, depending on the targeted quantization and estimation error. In the case of Euclidean balls, Draper and Wornell [10] study rate distortion problems motivated by communication in sensor networks; this is closely related to the problem of quantized minimax estimation over Euclidean balls that we analyzed in [27]. The essential difference between rate distortion and our quantized minimax framework is that in rate distortion the quantization is carried out for a random source, while in quantized estimation we quantize our estimate of the deterministic and unknown basis coefficients. Since linear estimators are asymptotically minimax for Sobolev spaces under squared error (the ‘Pinsker phenomenon’), this naturally leads to an alternative view of quantizing the observations, or said differently, of compressing the data before estimation. Statistical estimation from compressed data has appeared previously in different communities. In [26], a procedure is analyzed that compresses data by random linear transformations in the setting of sparse linear regression. Zhang & Berger [24] study estimation problems when the data are communicated from multiple sources; Ahlswede & Csiszár [2] consider testing problems under communication constraints; the use of side information is studied by Ahlswede & Burnashev [1]; other formulations in terms of multiterminal information theory are given by Han & Amari [14]; non-parametric problems are considered by Raginsky in [20]. In a distributed setting, the data may be divided across different compute nodes, with distributed estimates then aggregated or pooled by communicating with a central node. The general ‘CEO problem’ of distributed estimation was introduced by Berger et al. [3] and has been recently studied in parametric settings in [13,25]. These papers take the view that the data are communicated to the statistician at a certain rate, which may introduce distortion, and the goal is to study the degradation of the estimation error. In contrast, in our setting we can view the unquantized data as being fully available to the statistician at the time of estimation, with communication constraints being imposed when communicating the estimated model to a remote location. Finally, our quantized minimax analysis shows achievability using random coding schemes that are not computationally efficient. A natural problem is to develop practical coding schemes that come close to the quantized minimax lower bounds. In our view, the most promising approach currently is to exploit source coding schemes based on greedy sparse regression [23], applying such techniques blockwise according to the procedure we developed in Section 4. Supplementary data Supplementary date are available at IMAIAI online. Acknowledgements The authors thank Andrew Barron, John Duchi, Maxim Raginsky, Philippe Rigollet, Harrison Zhou and the anonymous referees for valuable comments on this work. Funding Office of Naval Research (N00014-15-1-2379, in part) and National Science Foundation (DMS-1513594, DMS-1547396, in part). References Ahlswede R., Burnashev M. ( 1990) On minimax estimation in the presence of side information about remote data. Ann. Statist. , 18, 141– 171. Google Scholar CrossRef Search ADS   Ahlswede R., Csiszár I. ( 1986) Hypothesis testing with communication constraints. IEEE Trans. Inform. Theory , 32, 533– 542. Google Scholar CrossRef Search ADS   Berger T., Zhang Z., Viswanathan H. ( 1996) The CEO problem. IEEE Trans. Inform. Theory , 42, 887– 902. Google Scholar CrossRef Search ADS   Brown L. D., Low M. G. ( 1996) Asymptotic equivalence of non-parametric regression and white noise. Ann. Statist. , 24, 2384– 2398. Google Scholar CrossRef Search ADS   Bruer J. J., Tropp J. A., Cevher V., Becker S. Ghahramani Z., Welling M., Cortes C., Lawrence N. D., Weinberger K. Q. ( 2014) Time–data tradeoffs by aggressive smoothing. Advances in Neural Information Processing Systems , Montreal, Canada: Neural Information Processing Systems Foundation, Inc. pp. 1664– 1672. Chandrasekaran V., Jordan M. I. ( 2013) Computational and statistical tradeoffs via convex relaxation. Proc. Natl. Acad. Sci. USA , 110, E1181– E1190. Google Scholar CrossRef Search ADS   Chattamvelli R., Jones M. ( 1995) Recurrence relations for non-central density, distribution functions and inverse moments. J. Stat. Comput. Simul. , 52, 289– 299. Google Scholar CrossRef Search ADS   Cover T. M., Thomas J. A. ( 2006) Elements of Information Theory . New York: Wiley-Interscience. Donoho D. L. ( 1997) Wald lecture I: counting bits with Kolmogorov and Shannon. Note for the Wald Lectures . Draper S. C., Wornell G. W. ( 2004) Side information aware coding strategies for sensor networks. IEEE J. Sel. Areas Commun. , 22, 966– 976. Google Scholar CrossRef Search ADS   Galal S., Horowitz M. ( 2011) Energy-efficient floating-point unit design. IEEE Trans. Comput. , 60, 913– 922. Google Scholar CrossRef Search ADS   Gallager R. G. ( 1968) Information Theory and Reliable Communication . New York: John Wiley & Sons. Garg A., Ma T., Nguyen H. ( 2014) On communication cost of distributed statistical estimation and dimensionality. Advances in Neural Information Processing Systems . 2726– 2734. Han T. S., Amari S.-I. ( 1998) Statistical inference under multiterminal data compression. IEEE Trans. Inform. Theory , 44, 2300– 2324. Google Scholar CrossRef Search ADS   Johnstone, I. M. (2015) Gaussian estimation: sequence and wavelet models. Unpublished manuscript. Lucic M., Ohannessian M. I., Karbasi A., Krause A. Lebanon G., Vishwanathanis S. V. N. ( 2015) Tradeoffs for space, time, data and risk in unsupervised learning. International Conference on Artificial Intelligence and Statistics , San Diego, CA: Proceedings of Machine Learning Research, Vol. 38, pp. 663– 671. Nussbaum M. ( 1985) Spline smoothing in regression models and asymptotic efficiency in L2. Ann. Statist. , 13, 984– 997. Google Scholar CrossRef Search ADS   Nussbaum M. ( 1996) Asymptotic equivalence of density estimation and Gaussian white noise. Ann. Statist. , 24, 2399– 2430. Google Scholar CrossRef Search ADS   Nussbaum M. ( 1999) Minimax risk: Pinsker bound. Encycl. Stat. Sci. , 3, 451– 460. Raginsky M. ( 2007) Learning from compressed observations. IEEE Information Theory Workshop . Lake Tahoe, CA. pp. 420– 425. Sakrison D. ( 1968) A geometric treatment of the source encoding of a {G}aussian random variable. IEEE Trans. Inform. Theory , 14, 481– 486. Google Scholar CrossRef Search ADS   Tsybakov A. B. ( 2008) Introduction to Non-Parametric Estimation . Springer Series in Statistics, 1st edn. New York: Springer. Venkataramanan R., Sarkar T., Tatikonda S. ( 2014) Lossy compression via sparse linear regression: Computationally efficient encoding and decoding. IEEE Trans. Inform. Theory , 60, 3265– 3278. Google Scholar CrossRef Search ADS   Zhang Z., Berger T. ( 1988) Estimation via compressed information. IEEE Trans. Inform. Theory , 34, 198– 211. Google Scholar CrossRef Search ADS   Zhang Y., Duchi J., Jordan M. I., Wainwright M. J. Burges C. J. C., Bottou L., Welling M., Ghahramani Z., Weinberger K. Q. ( 2013) Information-theoretic lower bounds for distributed statistical estimation with communication constraints. Advances in Neural Information Processing Systems , Lake Tahoe, NV: Neural Information Processing Systems Foundation, Inc. pp. 2328– 2336. Zhou S., Lafferty J., Wasserman L. ( 2009) Compressed and privacy-sensitive sparse regression. IEEE Trans. Inform. Theory , 55, 846– 866. Google Scholar CrossRef Search ADS   Zhu Y., Lafferty J. Ghahramani Z., Welling M., Cortes C., Lawrence N. D., Weinberger K. Q. ( 2014) Quantized estimation of Gaussian sequence models in Euclidean balls. Advances in Neural Information Processing Systems , Montreal, Canada: Neural Information Processing Systems Foundation, Inc. pp. 3662– 3670. Appendix. Proofs of Technical Results In this section, we provide proofs for Theorems 3.1 and 4.1. A.1. Proof of Theorem 3.1 We first show Lemma A.1 The quantized minimax risk is lower bounded by Vε(m,c,Bε), the value of the optimization (P1). Proof. As will be clear to the reader, Vε(m,c,Bε) is achieved by some σ2 that is non-increasing and finitely supported. Let σ2 be such that   σ12≥…≥σn2>0=σn+1=…, ∑j=1naj2σj2=c2π2m and let   Θn(m,c)={θ∈ℓ2:∑j=1naj2θj2≤c2π2m, θj=0 for j≥n+1}⊂Θ(m,c). We build on this sequence of σ2, a prior distribution of θ. In particular, for τ∈(0,1), write sj2=(1−τ)σj2 and let πτ(θ;σ2) be a prior distribution on θ such that   θj~N(0,sj2), j=1,…,n,ℙ(θj=0)=1, j≥n+1. We observe that   Rε(m,c,Bε) ≥infθ^,C(θ^)≤Bεsupθ∈Θn(m,c)E||θ−θ^||2≥infθ^,C(θ^)≤Bε∫Θn(m,c)E||θ−θ^||2dπτ(θ;σ2)≥Iτ−rτ, where Iτ is the integrated risk of the optimal quantized estimator   Iτ=infθ^,C(θ^)≤Bε∫ℝn⊗{0}∞E||θ−θ^||2dπτ(θ;σ2) and rτ is the residual   rτ=supθ^∈Θ(m,c)∫Θ(m,c)¯E||θ−θ^||2dπτ(θ;σ2), where Θ(m,c)¯=(ℝn⊗{0}∞)\Θn(m,c). As shown in Section 3, limτ→0Iτ is lower bounded by the value of the optimization   minμ2∑j=1∞μj2+∑j=1∞σj2ε2σj2+ε2such that ∑j=1∞12log+(σj4μj2(σj2+ε2))≤Bε.  It then suffices to show that rτ=o(Iτ) as ε→0 for τ∈(0,1). Let dn=supθ∈Θn(m,c)||θ||, which is bounded since for any θ∈Θn(m,c)  ||θ||=∑jθj2=1a12∑ja12θj2≤1a12∑jaj2θj2≤1a12c2π2m=ca1πm. We have   rτ=supθ^∈Θ(m,c)∫Θn(m,c)¯E||θ−θ^||2dπτ(θ;σ2)≤2∫Θn(m,c)¯(dn2+E||θ||2)dπτ(θ;σ2)≤2(dn2ℙ(θ∉Θn(m,c))+(ℙ(θ∉Θn(m,c))E||θ||4)1/2), where we use the Cauchy–Schwarz inequality. Noticing that   E||θ||4=E((∑j=1nθj2)2)=∑j1≠j2E(θj12)E(θj22)+∑j=1nE(θj4)≤∑j1≠j2sj12sj22+3∑j=1nsj4≤3(∑j=1nsj2)2≤3dn4, we obtain   rτ≤2dn2(ℙ(θ∉Θn(m,c))+3ℙ(θ∉Θn(m,c)))≤6dn2ℙ(θ∉Θn(m,c)). Thus, we only need to show that ℙ(θ∉Θn(m,c))=o(Iτ). In fact,   ℙ(θ∉Θn(m,c)) =ℙ(∑j=1naj2θj2>c2π2m)=ℙ(∑j=1naj2(θj2−E(θj2))>c2π2m−(1−τ)∑j=1naj2σj2)=ℙ(∑j=1naj2(θj2−E(θj2))>τc2π2m)=ℙ(∑j=1naj2sj2(Zj2−1)>τ1−τ∑j=1naj2sj2), where Zj~N(0,1). By Lemma A.2, we get   ℙ(θ∉Θn(m,c))≤exp(−τ28(1−τ)2∑j=1naj2sj2max1≤j≤naj2sj2)=exp(−τ28(1−τ)2∑j=1naj2σj2max1≤j≤naj2σj2). Next we will show that for the σ2 that achieves Vε(m,c,Bε), we have ℙ(θ∉Θn(m,c))=o(Iτ). For the sufficient regime where Bεε22m+1→∞ as ε→0, it is shown in [22] that max1≤j≤naj2σj2=O(ε22m+1) and Iτ=O(ε4m2m+1), and hence that ℙ(θ∉Θn(m,c))=o(Iτ). For the insufficient regime where Bεε22m+1→0, but still Bε→∞ as ε→0, an achieving sequence σ2 is given later by (A.4) and (A.3). We obtain that max1≤j≤naj2σj2=O(Bε−1) and Iτ=O(Bε−2m), and therefore ℙ(θ∉Θn(m,c))=o(Iτ). The sufficient regime where Bεε22m+1→d for some constant d is a bit more complicated, as we don't have an explicit formula for the optimal sequence σ2. However, by Lemma 3.3, for the continuous approximation σ2(x) such that σj2=σ2(jh)h2m+1, we have   λx2mσ2(x) =σ2(x)(σ2(x)+1)2+α⋅σ2(x)+2σ2(x)+1≤14+2α, where α=exp(1x0∫0x0logσ4(x)σ2(x)+1dx−2dx0) and λ are both constants. Therefore,   max1≤j≤naj2σj2≈j2mσ2(jh)h2m+1≤1λ(14+2α)⋅h. Note that ∑j=1naj2σj2=O(h2m) and that h=ε22m+1. We obtain that for this case Iτ=O(ε4m2m+1) and ℙ(θ∉Θn(m,c))=o(Iτ). Thus, for each of the three regimes, we have rτ=o(Iτ). Lemma A.2 (Lemma 3.5 in [22]) Suppose that X1,…,Xn are i.i.d. N(0,1). For t∈(0,1) and ωj>0, j=1,…,n, we have   ℙ(∑j=1nωj(Xj2−1)>t∑j=1nXj)≤exp(−t2∑j=1nωj8max1≤j≤nωj). Proof of Lemma 3.1. This is in fact Pinsker's theorem, which gives the exact asymptotic minimax risk of estimation of normal means in the Sobolev ellipsoid. The proof can be found in [19] and [22]. Proof of Lemma 3.2. As argued in Section 3 for the lower bound in the sufficient regime, optimization problem (Q1) can be reformulated as   maxσ2,JJηsuch that 12∑j=1Jlog+(σj4η(σj2+ε2))≤Bε∑j=1Jaj2σj2≤c2π2m(σj2) is decreasing and σJ4σJ2+ε2≥η. (Q2) Now suppose that we have a series ( σj2) which satisfies the last constraint and is supported on {1,…,J}. By the first constraint, we have that   Jη=Jexp(−2BεJ)(∏j=1Jσj4σj2+ε2)1J≤Jexp(−2BεJ)(∏j=1Jσj2)1J=Jexp(−2BεJ)(∏j=1Jaj2σj2)1J(∏j=1Jaj−2)1J≤exp(−2BεJ)(∑j=1Jaj2σj2)(∏j=1Jaj−2)1J≤c2π2mexp(−2BεJ)(∏j=1Jaj−2)1J=c2π2m(exp(Bεm)J!)−2mJ. (A.1) This provides a series of upper bounds for Qε(m,c,Bε) parameterized by J. To minimize (A.1) over J, we look at the ratio of the neighboring terms with J and J+1, and compare it to 1. We obtain that the optimal J satisfies   JJJ!<exp(Bεm)≤(J+1)J+1(J+1)!. (A.2) Denote this optimal J by Jε. By Stirling's approximation, we have   limε→0Bε/mJε=1 (A.3) and plugging this asymptote into (A.1), we get as ε→0  c2π2m(exp(Bεm)Jε!)−2mJε~c2π2mJε−2m~c2m2mπ2mBε−2m. This gives the desired upper bound (3.6). Next we show that the upper bound (3.6) is asymptotically achievable when Bεε22m+1→0 and Bε→∞. It suffices to find a feasible solution that attains (3.6). Let   σ˜j2=c2/π2mJεaj2, j=1,…,Jε. (A.4) Note that the entire sequence of (σ˜j2)j=1Jε does not qualify for a feasible solution, since the first constraint in (Q2) won't be satisfied for any η≤σ˜Jε4σ˜Jε2+ε2. We keep only the first Jε' terms of (σ˜j2), where Jε' is the largest j such that   σ˜j4σ˜j2+ε2≥σ˜Jε2. (A.5) Thus,   ∑j=1Jε′12log+(σ˜j4σ˜j2+ε2σ˜Jε2)≤∑j=1Jε′12log+(σ˜j2σ˜Jε2)≤∑j=1Jε12log+(σ˜j2σ˜Jε2)≤Bε, where the last inequality is due to (A.2). This tells us that setting η=σ˜Jε2 leads to a feasible solution to (Q2). As a result,   Qε(m,c,Bε)≥J′εσ˜Jε2. (A.6) If we can show that Jε′~Jε, then   J′εσ˜Jε2~Jεσ˜Jε2~c2m2mπ2mBε−2m. (A.7) To show that Jε'~Jε, it suffices to show that aJε'~aJε. Plugging the formula of σ˜j2 into (A.5) and solving for aJε'2, we get   aJε'2~−c2π2mJε+(c2π2mJε)2+4c2π2mJεε2aJε22ε2~−c2π2mJε+c2π2mJε+12π2mJεc24c2π2mJεε2aJε22ε2=aJε2, where the equivalence is due to the assumption Bεε22m+1→0 and a Taylor's expansion of the function x. Proof of Lemma 3.3 Suppose that σ2(x) with x0 solves (P4). Consider function σ2(x)+ξv(x) such that it is still feasible for (P4), and thus we have   ∫0x0x2mv(x)dx≤0. Now plugging σ2(x)+ξv(x) for σ2(x) in the objective function of (P4), taking derivative with respect to ξ, and letting ξ→0, we must have   ∫0x0v(x)(σ2(x)+1)2dx+x0exp(1x0∫0x0logσ4(x)σ2(x)+1dx−2dx0)1x0∫0x02v(x)σ2(x)−v(x)σ2(x)+1dx≤0, which, after some calculation and rearrangement of terms, yields   ∫0x0v(x)(1(σ2(x)+1)2+exp(1x0∫0x0logσ4(x)σ2(x)+1dx−2dx0)σ2(x)+2σ2(x)(σ2(x)+1))dx≤0. Thus, by the lemma that follows, we obtain that for some λ  1(σ2(x)+1)2+exp(1x0∫0x0logσ4(y)σ2(y)+1dy−2dx0)σ2(x)+2σ2(x)(σ2(x)+1)=λx2m. Lemma A.3 Suppose that f(x) and g(x) are two non-zero functions on (0,x0) such that for any v(x), satisfying ∫0x0f(x)v(x)dx≤0, it holds that ∫0x0g(x)v(x)dx≤0. Then there exists a constant λ such that f(x)=λg(x). Proof. First we show that for any v(x) such that ∫0x0f(x)v(x)dx=0, we must have ∫0x0g(x)v(x)dx=0. Otherwise, suppose that v0(x) is such that ∫0x0f(x)v0(x)dx=0 and ∫0x0g(x)v0(x)dx<0. Then take another v(x) with ∫0x0f(x)v(x)dx≤0 and consider vγ(x)=v(x)−γv0(x). We have ∫0x0f(x)vγ(x)dx≤0 and ∫0x0g(x)vγ(x)=∫0x0v(x)g(x)dx−γ∫0x0g(x)v0(x)dx>0 for large enough γ, which results in contradiction. Let λ=∫0x0f(x)2dx/∫0x0f(x)g(x)dx as the denominator cannot be zero. In fact, if ∫0x0f(x)g(x)dx=0, it would imply that ∫0x0g(x)2dx=0, and hence g(x)≡0. Now consider the function f(x)−λg(x). Notice that we have ∫0x0f(x)(f(x)−λg(x))dx=0 by the definition of λ. It follows that ∫0x0g(x)(f(x)−λg(x))dx=0, and therefore, ∫0x0(f(x)−λg(x))2dx=0, which concludes the proof. A.2. Proof of Theorem 4.1 Now we give the details of the proof of Theorem 4.1. For the purpose of our analysis, we define two allocations of bits, the monotone allocation and the block-wise constant allocation,   Πblk(B)={(bj)j=1∞: ∑j=1∞bj≤B, bj=b¯k for j∈Jk, 0≤bj≤bmax}, (A.8)  Πmon(B)={(bj)j=1∞: ∑j=1∞bj≤B, bj−1≥bj, 0≤bj≤bmax}, (A.9) where bmax=2log(1/ε). We also define two classes of weights, the monotonic weights and the block-wise constant weights,   Ωblk={(ωj)j=1∞: ωj=ω¯k for j∈Jk, 0≤ωj≤1}​, (A.10)  Ωmon={(ωj)j=1∞: ωj−1≥ωj, 0≤ωj≤1}​. (A.11) We will also need the following results from [22] regarding the weakly geometric system of blocks. Lemma A.4 Let {Jk} be a weakly geometric block system defined by (4.1). Then there exists 0<ε0<1 and C>0 such that for any ε∈(0,ε0),   K≤Clog2(1/ε),max1≤k≤K−1Tk+1Tk≤1+3ρε. We divide the proof into four steps. Step 1. Truncation and replacement The loss of the quantized estimator θˇ can be decomposed into   ||||θˇ−θ||2=∑k=1K||θˇ(k)−θ(k)||2+∑j=N+1∞θj2, where the remainder term satisfies   ∑j=N+1∞θj2≤N−2m∑j=N+1∞aj2θj2=O(N−2m). If we assume that m>1/2, which corresponds to classes of continuous functions, the remainder term is then o(ε2). If m≤1/2, the remainder term is on the order of O(ε4m), which is still negligible compared to the order of the lower bound ε4m2m+1. To ease the notation, we will assume that m>1/2, and write the remainder term as o(ε2), but need to bear in mind that the proof works for all m>0. We can thus discard the remainder term in our analysis. Recall that the quantized estimate for each block is given by   θˇ(k)=Sˇk2−Tkε2Sˇk1−2−2b˜kZˇ(k) and consider the following estimate with Sˇk replaced by Sk  θ^(k)=Sk2−Tkε2Sk1−2−2b˜kZˇ(k). Notice that   ||θ^(k)−θˇ(k)||=|Sˇk2−Tkε2Sˇk−Sk2−Tkε2Sk|1−2−2b˜k||Zˇ(k)||≤|SˇkSk+Tkε2SˇkSk||Sˇk−Sk|≤2ε2, where the last inequality is because SˇkSk≥Tkε2 and |Sˇk−Sk|≤ε2. Thus we can safely replace θˇ(k) by θ^(k) because   ||θˇ(k)−θ(k)||2=||θˇ(k)−θ^(k)+θ^(k)−θ(k)||2≤||θˇ(k)−θ^(k)||2+||θ^(k)−θ(k)||2+2||θˇ(k)−θ^(k)||||||θ^(k)−θ(k)||||=||θ^(k)−θ(k)||2+O(ε2). Therefore, we have   E||θˇ−θ||2=E∑k=1K||θ^(k)−θ(k)||2+O(Kε2). Step 2. Expectation over codebooks Now conditioning on the data Y, we work under the probability measure introduced by the random codebook. Write   λk=Sk2−Tkε2Sk2 and Z(k)=Y(k)||Y(k)||. We decompose and examine the following term   Ak=||θ^(k)−θ(k)||2=||θ^(k)−λkSkZ(k)+λkSkZ(k)−θ(k)||2=||θ^(k)−λkSkZ(k)||2︸Ak,1+||λkSkZ(k)−θ(k)||2︸Ak,2+2⟨θ^(k)−λkSkZ(k),λkSkZ(k)−θ(k)⟩︸Ak,3. To bound the expectation of the first term Ak,1, we need the following lemma, which bounds the probability of the distortion of a codeword exceeding the desired value. Lemma A.5 Suppose that Z1,…,Zn are independent and each follows the uniform distribution on the t-dimensional unit sphere St−1. Let y∈St−1 be a fixed vector, and   Z*=argminz∈Z1:n||1−2−2qz−y||2. If n=2qt, then   E||1−2−2qZ*−y||2≤2−2q(1+ν(t))+2e−2t, where   ν(t)=6logt+7t−6logt−7. Observe that   Ak,1=||θ^(k)−λkSkZ(k)||2=||λkSk1−2−2b˜kZˇ(k)−λkSkZ(k)||2=λk2Sk2||1−2−2b˜kZˇ(k)−Z(k)||2. Then, it follows as a result of Lemma A.5 that   E(Ak,1|Y(k))≤(Sk2−Tkε2)2Sk2(2−2b˜k(1+νε)+2e−2Tk)≤(Sk2−Tkε2)2Sk2(2−2b˜k(1+νε)+2e−2T1)≤(Sk2−Tkε2)2Sk22−2b˜k(1+νε)+2c2(jkπ)2mε2, where νε=6logT1+7T1−6logT1−7. Since Ak,2 only depends on Y(k), E(Ak,2|Y(k))=Ak,2. Next we consider the cross term Ak,3. Write γk=⟨θ(k),Y(k)⟩||Y(k)||2 and   Ak,3=2⟨θ^(k)−λkSkZ(k),λkSkZ(k)−θ(k)⟩=2⟨θ^(k)−λkSkZ(k),γkY(k)−θ(k)⟩︸Ak,3a+2⟨θ^(k)−λkSkZ(k),λkSkZ(k)−γkY(k)⟩︸Ak,3b. The quantity γk is chosen such that ⟨Y(k),γkY(k)−θ(k)⟩=0 and therefore   Ak,3a=2⟨θ^(k)−λkSkZ(k),γkY(k)−θ(k)⟩=2⟨ΠY(k)⊥(θ^(k)−λkSkZ(k)),γkY(k)−θ(k)⟩, where ΠY(k)⊥ denotes the projection onto the orthogonal complement of Y(k). Due to the choice of Zˇ(k), the projection ΠY(k)⊥(θ^(k)−λkSkZ(k)) is rotation symmetric, and hence E(Ak,3a|Y(k))=0. Finally, for Ak,3b we have   E(Ak,3b|Y(k))≤2||λkSkZ(k)−γkY(k)||E(||θ^(k)−λkSkZ(k)|||Y(k))≤2||λkSkZ(k)−γkY(k)||E(||θ^(k)−λkSkZ(k)||2|Y(k))≤2||λkSkZ(k)−γkY(k)||(Sk2−Tkε2)2Sk22−2b˜k(1+νε)+2c2(jkπ)2mε2. Combining all the analyses above, we have   E(Ak|Y(k))≤(Sk2−Tkε2)2Sk22−2b˜k(1+νε)+2c2(jkπ)2mε2+||λkSkZ(k)−θ(k)||2+2λkSkZ(k)−γkY(k)||(Sk2−Tkε2)2Sk22−2b˜k(1+νε)+2c2(jkπ)2mε2 and summing over k we get   E(||θˇ−θ||2|Y)≤∑k=1K(Sk2−Tkε2)2Sk22−2b˜k(1+νε)+∑k=1K||λkSkZ(k)−θ(k)||2+2∑k=1K||λkSkZ(k)−γkY(k)||(Sk2−Tkε2)2Sk22−2b˜k(1+νε)+O(ε2)+O(Kε2). (A.12) Step 3. Expectation over data First we will state three lemmas, which bound the deviation of the expectation of some particular functions of the norm of a Gaussian vector to the desired quantities. The proofs are given in Section A.3. LemmaA.6 Suppose that Xi~N(θi,σ2) independently for i=1,…,n, where ||θ||2≤c2. Let S be given by   S={nσ2if ||X||<nσ2nσ2+cif ||X||>nσ2+c||X||otherwise. Then there exists some absolute constant C0 such that   E(S2−nσ2S−⟨θ,X⟩||X||)2≤C0σ2. Lemma A.7 Let X and S be the same as defined in Lemma A.6. Then for n>4  E(S2−nσ2)2S2≤||θ||4||θ||2+nσ2+4nn−4σ2. Lemma A.8 Let X and S be the same as defined in Lemma A.6. Define   θ^+=(||X||2−nσ2||X||2)+X, θ^†=S2−nσ2S||X||X. Then   E||θ^†−θ||2≤E||θ^+−θ||2≤nσ2||θ||2||θ||2+nσ2+4σ2. We now take the expectation with respect to the data on both sides of (A.12). First, by the Cauchy–Schwarz inequality   E(||λkSkZ(k)−γkY(k)||(Sk2−Tkε2)2Sk22−2b˜k(1+νε)+O(ε2))≤E||λkSkZ(k)−γkY(k)||2E((Sk2−Tkε2)2Sk22−2b˜k(1+νε)+O(ε2)). (A.13) We then calculate   E||λkSkZ(k)−γkY(k)||2=E||Sk2−Tkε2SkY(k)||Y(k)||−⟨θ(k),Y(k)⟩||Y(k)||Y(k)||Y(k)||||2=E(Sk2−Tkε2Sk−⟨θ(k),Y(k)⟩||Y(k)||)2≤C0ε, where the last inequality is due to Lemma A.6, and C0 is the constant therein. Plugging this in (A.13) and summing over k, we get   ∑k=1KE(||λkSkZ(k)−γkY(k)||(Sk2−Tkε2)2Sk22−2b˜k(1+νε)+O(ε2))≤C0ε∑k=1KE((Sk2−Tkε2)2Sk22−2b˜k(1+νε)+O(ε2))≤C0KεE∑k=1K(Sk2−Tkε2)2Sk22−2b˜k(1+νε)+O(Kε2). Therefore,   E||θˇ−θ||2≤E∑k=1K(Sk2−Tkε2)2Sk22−2b˜k︸B1(1+νε)+E∑k=1K||λkSkZ(k)−θ(k)||2︸B2+C0KεE∑k=1K(Sk2−Tkε2)2Sk22−2b˜k(1+νε)+O(Kε2)+O(Kε2). Now we deal with the term B1. Recall that the sequence b˜ solves problem (4.2), so for any sequence b∈Πblk  ∑k=1K(Sˇk2−Tkε2)2Sˇk22−2b˜k≤∑k=1K(Sˇk2−Tkε2)2Sˇk22−2b¯k. Notice that   |(Sˇk2−Tkε2)2Sˇk2−(Sk2−Tkε2)2Sk2|=|Sˇk2−Sk2||Sˇk2Sk2−Tkε2Sˇk2Sk2|=O(ε2) and thus,   ∑k=1K(Sk2−Tkε2)2Sk22−2b˜k≤∑k=1K(Sk2−Tkε2)2Sk22−2b¯k+O(Kε2). Taking the expectation, we get   E∑k=1K(Sk2−Tkε2)2Sk22−2b˜k≤∑k=1KE(Sk2−Tkε2)2Sk22−2b¯k+O(Kε2). Applying Lemma A.7, we get for Tk>4  E(Sk2−Tkε2)2Sk2≤||θ(k)||4||θ(k)||2+Tkε2+4TkTk−4ε2 and it follows that   E∑k=1K(Sk2−Tkε2)2Sk22−2b˜k≤∑k=1K||θ(k)||4||θ(k)||2+Tkε22−2b¯k+O(Kε2). Since b∈Πblk is arbitrary,   E∑k=1K(Sk2−Tkε2)2Sk22−2b˜k≤minb∈Πblk∑k=1K||θ(k)||4||θ(k)||2+Tkε22−2b¯k+O(Kε2). Turning to the term B2, as a result of Lemma A.8 we have   ||λkSkZ(k)−θ(k)||2≤||θ(k)||2Tkε2||θ(k)||2+Tkε2+4ε2. Combining the above results, we have shown that   E||θˇ−θ||2≤M+O(Kε2)+C0KεM+O(Kε2), (A.14) where   M=(1+νε)minb∈Πblk(B)∑k=1K||θ(k)||4||θ(k)||2+Tkε22−2b¯k+∑k=1K||θ(k)||2Tkε2||θ(k)||2+Tkε2=(1+νε)minb∈Πblk(B)∑k=1K||θ(k)||4||θ(k)||2+Tkε22−2b¯k+minω∈Ωblk∑k=1K((1−ω¯k)2||θ(k)||2+ω¯k2Tkε2)​. Step 4. Blockwise constant is almost optimal We now show that in terms of both bit allocation and weight assignment, block-wise constant is almost optimal. Let's first consider bit allocation. Let B′=11+3ρε(B−T1bmax). We are going to show that   minb∈Πblk(B)∑k=1K||θ(k)||4||θ(k)||2+Tkε22−2b¯k≤minb∈Πmon(B′)∑j=1Nθj4θj2+ε22−2bj. (A.15) In fact, suppose that b*∈Πmon(B′) achieves the minimum on the right-hand side, and define b⋆ by   bj⋆={maxi∈Bkbi*j∈Bk0j≥N. The sum of the elements in b⋆ then satisfies   ∑j=1∞bj⋆=∑k=0K−1Tk+1maxj∈Bk+1bj*=T1b1⋆+∑k=1K−1Tk+1maxj∈Bk+1bj*≤T1bmax+∑k=1K−1Tk+1Tk∑j∈Bkbj*≤T1bmax+(1+3ρε)∑k=1K−1∑j∈Bkbj*≤T1bmax+(1+3ρε)B′=B, which means that b⋆∈Πblk(B). It then follows that   minb∈Πblk(B)∑k=1K||θ(k)||4||θ(k)||2+Tkε22−2b¯k≤∑k=1K||θ(k)||4||θ(k)||2+Tkε22−2b¯k⋆≤∑k=1K∑j∈Bkθj4θj2+ε22−2bj⋆=∑j=1Nθj4θj2+ε22−2bj*=minb∈Πmon(B′)∑j=1Nθj4θj2+ε22−2bj, (A.16) where (A.16) is due to Jensen's inequality on the convex function x2x+ε2  (1Tk||θ(k)||2)21Tk||θ(k)||2+ε2≤1Tk∑j∈Bkθj4θj2+ε2. Next, for the weights assignment, by Lemma 3.11 in [22], we have   minω∈Ωblk∑k=1K((1−ω¯k)2||θ(k)||2+ω¯k2Tkε2)≤(1+3ρε)(minω∈Ωmon∑k=1K((1−ωj)2θj2+ωj2ε2))+T1ε2. (A.17) Combining (A.15) and (A.17), we get   M=(1+νε)minb∈Πblk(B)∑k=1K||θ(k)||4||θ(k)||2+Tkε22−2b¯k+minω∈Ωblk∑k=1K((1−ω¯k)2||θ(k)||2+ω¯k2Tkε2)≤(1+νε)minb∈Πblk(B)∑k=1K||θ(k)||4||θ(k)||2+Tkε22−2b¯k+(1+3ρε)minω∈Ωmon∑k=1K((1−ω¯k)2||θ(k)||2+ω¯k2Tkε2)+T1ε2≤(1+νε)(minb∈Πmon(B′)∑j=1Nθj4θj2+ε22−2bj+minω∈Ωmon∑j=1N((1−ωj)2θj2+ωj2ε2))+T1ε2. Then by Lemma A.9,   M≤(1+νε)Vε(m,c,B′)+T1ε2 which, plugged into (A.14), gives us   E||θˇ−θ||2≤(1+νε)Vε(m,c,B′)+O(Kε2)+C0Kε(1+νε)Vε(m,c,B′)+O(Kε2). Recall that   νε=O(loglog(1/ε)log(1/ε)), K=O(log2(1/ε)) and that   limε→0B′B=limε→011+3ρε(1−T1bmaxB)=1. Thus,   limε→0Vε(m,c,B′)Vε(m,c,B)=1. Also notice that no matter how B grows as ε→0, Vε(m,c,B)=O(ε4m2m+1). Therefore,   limε→0E||θˇ−θ||2Vε(B,m,c)≤limε→0((1+νε)Vε(B′,m,c)Vε(B,m,c)+O(Kε2)V(B,m,c)+C0(1+νε)Kε2Vε(B,m,c)Vε(B′,m,c)Vε(B,m,c)+(O(Kε2)Vε(B,m,c))2)=1, which concludes the proof. Lemma A.9 Let V1 be the value of the optimization   maxθminb∑j=1N(θj4θj2+ε22−2bj+θj2ε2θj2+ε2)such that∑j=1Nbj≤B, bj≥0, ∑j=1Jaj2θj2≤c2π2m (A1) and let V2 be the value of the optimization   maxθminb,ω∑j=1N(θj4θj2+ε22−2bj+(1−ωj)2θj2+ωj2ε2)such that∑j=1Nbj≤B, bj−1≥bj, 0≤bj≤bmax, ωj−1≥ωj,∑j=1Jaj2θj2≤c2π2m. (A2) Then V1=V2. A.3. Proofs of Lemmas Proof of Lemma A.5. Let ζ(t) be a positive function of t to be specified later. Let   p0=ℙ(||1−2−2qZ1−y||≤2−q1+ζ(t)). By Lemma A.10, when ζ(t)≤2(1−2−2q), p0 can be lower bounded by   p0≥Γ(t2+1)πtΓ(t+12)(2−q1+ζ(t)/2)t−1. We obtain that   E||1−2−2qZ*−y||2≤2−2q(1+ζ(t))+2ℙ(||1−2−2qZ*−y||>2−q1+ζ(t))=2−2q(1+ζ(t))+2(1−p0)n. To upper bound (1−p0)n, we consider   log((1−p0)n)=nlog(1−p0)≤−np0≤−2qtΓ(t2+1)πtΓ(t+12)(2−q1+ζ(t)/2)t−1≤−2qΓ(t2+1)πtΓ(t+12)(1+ζ(t)/2)(2/ζ(t)+1)t−12(2/ζ(t)+1)≤−2π(t2)t2+12e−t2πte(t2−12)t2e−(t2−12)et−12(2/ζ(t)+1)=−e−32t−12(tt−1)t2et−12(2/ζ(t)+1)≤−e−1t−12et−12(2/ζ(t)+1), where we have used Stirling's approximation in the form   2πzz+1/2e−z≤Γ(z+1)≤ezz+1/2e−z. In order for (1−p0)n≤e−2t to hold, we need   −2t=−e−1t−12et−12(2/ζ(t)+1), which leads to the choice of ζ(t)  ζ(t)=2t−12log(2et32)−1=6logt+4log(2e)t−3logt−2log(2e)−1. Thus, we have shown that when q is not too close to 0, satisfying 1−2−2q≥ζ(t)/2, we have   E||1−2−2qZ*−y||2≤2−2q(1+ζ(t))+e−2t. When 1−2−2q<ζ(t)/2, we observe that   E||1−2−2qZ*−y||2=1−2−2q+1−21−2−2qE⟨Z*,y⟩≤2−2−2q=2−2q(1+2(22q−1)) and that   2(22q−1)<21−ζ(t)/2−2=2ζ(t)2−ζ(t)=6logt+4log(2e)t−6logt−4log(2e)−1. Now take ν(t)=6logt+7t−6logt−7. Notice that ν(t)>6logt+4log(2e)t−6logt−4log(2e)−1≥ζ(t), we have for any q≥0  E||1−2−2qZ*−y||2≤2−2q(1+ν(t))+e−2t. Lemma A.10 Suppose Z is a t-dimensional random vector uniformly distributed on the unit sphere St−1. Let y be a fixed vector on the unit sphere. For δ<1 and ζ>0, satisfying ζ≤2(1−δ2), define   p0=ℙ(||Z−y||≤δ1+ζ). We have   p0≥Γ(t2+1)πtΓ(t+12)(δ1+ζ/2)t−1. Proof. The proof is based on an idea from [21]. Denote by Vt and At, the volume and the surface area of a t-dimensional unit sphere, respectively. We have   Vt=∫01Atrt−1dr=1tAt. From the geometry of the situation as illustrated in Fig. A1, p0 is equal to the ratio of two areas S1 and S2. The first area S1 is the portion of the surface area of the sphere of radius 1−δ2 and center O contained within the sphere of radius δ1+ζ and center y. It is the surface area of a (t−1)-dimensional polar cap of radius 1−δ2 and polar angle θ0, and can be lower bounded by the area of a (t−1)-dimensional disk of radius 1−δ2sinθ0, that is,   S1≥Vt−1(1−δ2sinθ0)t−1=1t−1At−1(1−δ2sinθ0)t−1. The second area S2 is simply the surface area of a (t−1)-dimensional sphere of radius 1−δ2  S2=At(1−δ2)t−1. Therefore, we obtain   p0 =S1S2≥1t−1At−1(1−δ2sinθ0)t−1At(1−δ2)t−1=At−1(t−1)At(sinθ0)t−1=Γ(t+12+12)πtΓ(t+12)(sinθ0)t−1, where we have used the well-known relationship between At−1 and At  At−1At=1π(t−1)Γ(t2+1)tΓ(t−12+1). Now we need to calculate sinθ0. By the law of cosines, we have   cosθ0=1+1−δ2−δ2(1+ζ)21−δ2=1−δ2(1+ζ/2)1−δ2 and it follows that   sin2θ0=1−cos2θ0=1−1+δ4(1+ζ/2)2−2δ2(1+ζ/2)1−δ2=δ2(1+ζ)−δ4ζ24(1−δ2). Now since ζ≤2(1−δ2), we get   sinθ0≥δ1+ζ/2, which completes the proof. Fig. A1. View largeDownload slide Illustration of the geometry for calculating p0. Fig. A1. View largeDownload slide Illustration of the geometry for calculating p0. Proof of Lemma A.6. We first claim that   E(S2−nσ2S−⟨θ,X⟩||X||)2≤E(||X||2−nσ2||X||−⟨θ,X⟩||X||)2. In fact, writing Er(⋅) for the conditional expectation E(⋅|||X||=r), it suffices to show that for r<nσ2 and r>nσ2+c  Er(S2−nσ2S−⟨θ,X⟩||X||)2≤Er(||X||2−nσ2||X||−⟨θ,X⟩||X||)2. When r<nσ2, it is equivalent to   Er(⟨θ,X⟩||X||||)2≤Er(⟨θ,X⟩||X||−||X||2−nσ2||X||)2. It is then sufficient to show that Er⟨θ,X⟩≥0. This can be obtained by following a similar argument as in Lemma A.6 in [22]. When r>nσ2+c, we need to show that   Er((nσ2+c)2−nσ2nσ2+c−⟨θ,X⟩||X||)2≤Er(||X||2−nσ2||X||−⟨θ,X⟩||X||)2, which, after some algebra, boils down to   (nσ2+c)2−nσ2nσ2+c+r2−nσ2r≥2rEr⟨θ,X⟩. This holds because   r((nσ2+c)2−nσ2nσ2+c+r2−nσ2r−2rEr⟨θ,X⟩)≥||θ||2+r2−nσ2−2Er⟨θ,X⟩≥Er||X−θ||2−nσ2≥0, where we have used the assumption that r>nσ2+c, ||θ||≤c and that   Er||X−θ||||≥Er||X||−||θ||≥nσ2. Now that we have shown (A.3) and noting that   E(||X||2−nσ2||X||−⟨θ,X⟩||X||)2=σ2E(||X/σ||2−n||X/σ||−⟨θ/σ,X/σ⟩||X/σ||)2, we can assume that X~N(θ,In) and equivalently show that there exists a universal constant C0 such that   E(||X||2−n||X||−⟨θ,X⟩||X||)2≤C0 holds for any n and θ. Letting Z=X−θ and writing ||θ||2=ξ, we have   E(||X||2−n||X||−⟨θ,X⟩||X||)2=E(||Z+θ||2−n−ξ||Z+θ||−⟨θ,Z⟩||Z+θ||)2≤2E(||Z+θ||2−n−ξ||Z+θ||)2+2E(⟨θ,Z⟩||Z+θ||||)2≤2E||Z+θ||2−4(n+ξ)+2E(n+ξ)2||Z+θ||2+2E(⟨θ,Z⟩||Z+θ||)2≤2(n+ξ)−4(n+ξ)+2(n+ξ)2n+ξ−4+2E(⟨θ,Z⟩||Z+θ||)2=8(n+ξ)n+ξ−4+2E(⟨θ,Z⟩||Z+θ||)2, where the last inequality is due to Lemma A.11. To bound the last term, we apply the Cauchy–Schwarz inequality and get   E(⟨θ,Z⟩||Z+θ||)2≤E1||Z+θ||4E⟨θ,Z⟩4≤3(n−4)ξ2(n−6)(n+ξ−4)(n+ξ−6), where the last inequality is again due to Lemma A.11. Thus we just need to take C0 to be   supn≥7,ξ≥08(n+ξ)n+ξ−4+23(n−4)ξ2(n−6)(n+ξ−4)(n+ξ−6), which is apparently a finite quantity. Proof of Lemma A.7. Since the function (x2−nσ2)2/x2 is decreasing on (0,nσ2) and increasing on (nσ2,∞), we have   (S2−nσ2)2S2≤(||X||2−nσ2)2||X||2 and it follows that if n>4  E(S2−nσ2)2S2≤E(||X||2−nσ2)2||X||2 (A.18)  =E||X||2−2nσ2+n2σ4E(1||X||2) (A.19)  ≤||θ||2−nσ2+n2σ4||θ||2+nσ2−4σ2 (A.20)  ≤||θ||4||θ||2+nσ2+4nn−4σ2, (A.21) where (A.20) is due to Lemma A.11, and (A.21) is obtained by   ||θ||2−nσ2+n2σ4||θ||2+nσ2−4σ2−||θ||4||θ||2+nσ2=||θ||4+4σ2(nσ2−||θ||2)||θ||2+nσ2−4σ2−||θ||4||θ||2+nσ2=4n2σ6(||θ||2+nσ2−4σ2)(||θ||2+nσ2)≤4nn−4σ2. Proof of Lemma A.8. First, the second inequality   E||θ^+−θ||2≤nσ2||θ||2||θ||2+nσ2+4σ2 is given by Lemma 3.10 from [22]. We thus focus on the first inequality. For convenience we write   g+(x)=(||x||2−nσ2||x||2)+, g†(x)=s(x)2−nσ2s(x)||x|||| with   s(x)={nσ2if ||||x||||<nσ2nσ2+cif ||||x||||>nσ2+c||||x||||otherwise. Notice that g+(x)=g†(x) when ||x||≤nσ2+c and g+(x)>g†(x) when ||x||>nσ2+c. Since g† and g+ both only depend on ||x||, we sometimes will also write g†(||x||) for g†(x) and g+(||x||) for g+(x). Setting Er(⋅) to denote the conditional expectation E(⋅|||X||=r) for brevity, it suffices to show that for r≥nσ2+c  Er(||g†(X)X−θ||2)≤Er(||g+(X)X−θ||2)⇔g†(r)2r2−2g†(r)Er⟨X,θ⟩≤g+(r)2r2−2g+(r)Er⟨X,θ⟩⇔(g†(r)2−g+(r)2)r2≥2(g†(r)−g+(r))Er⟨X,θ⟩⇔(g†(r)+g+(r))r2≥2Er⟨X,θ⟩. (A.22) On the other hand, we have   (g†(r)+g+(r))r2≥(||θ||2r2+r2−nσ2r2)r2=||θ||2+r2−nσ2=||θ||2+r2−2Er⟨X,θ⟩−nσ2+2Er⟨X,θ⟩=Er||X−θ||2−nσ2+2Er⟨X,θ⟩≥2Er⟨X,θ⟩, where the last inequality is because   ||X−θ||2≥(||X||−||θ||)2≥nσ2. Thus, (A.22) holds, and hence E||θ^†−θ||2≤E||θ^+−θ||2. Proof of Lemma A.9. It is easy to see that V1≤V2, because for any θ the inside minimum is smaller for (A1) than for (A2). Next, we will show V1≥V2. Suppose that θ* achieves the value V2, with corresponding b* and ω*. We claim that θ* is non-increasing. In fact, if θ* is not non-increasing then there must exist an index j such that θj*<θj+1*, and for simplicity, let's assume that θ1*<θ2*. We are going to show that this leads to b1*=b2* and ω1*=ω2*. Write   s1=θ1*4θ1*2+ε2, s2=θ2*4θ2*2+ε2. We have s1<s2. Let b¯*=b1*+b2*2 and observe that b1*≥b¯*≥b2*. Notice that   (s12−2b1*+s22−2b2*)−(s12−2b¯*+s22−2b¯*)=s1(2−2b1*−2−2b¯*)+s2(2−2b2*−2−2b¯*)≥s2(2−2b1*−2−2b¯*)+s2(2−2b2*−2−2b¯*)≥s2(2−2b1*+2−2b2*−2⋅2−2b¯*)≥0, where equality holds if and only if b1*=b2*, since s2>s1≥0. Hence, b1* and b2* have to be equal, or otherwise it would contradict with the assumption that b* achieves the inside minimum of (A2). Now turn to ω*. Write ω¯*=ω1*+ω2*2 and note that ω1*≥ω¯*≥ω2*. Consider   ((1−ω1*)2θ1*2+ω1*2ε2)+((1−ω2*)2θ2*2+ω2*2ε2)−((1−ω¯*)2(θ1*2+θ2*2)+2ω¯*2ε2)=((1−ω1*)2−(1−ω¯*)2)θ1*2+((1−ω2*)2−(1−ω¯*)2)θ2*2+(ω1*2+ω2*2−2ω¯*2)ε2≥((1−ω1*)2−(1−ω¯*)2)θ2*2+((1−ω2*)2−(1−ω¯*)2)θ2*2+(ω1*2+ω2*2−2ω¯*2)ε2=((1−ω1*)2+(1−ω2*)2−2(1−ω¯*)2)θ2*2+(ω1*2+ω2*2−2ω¯*2)ε2≥0, where the equality holds if and only if ω1*=ω2*. Therefore, ω1* and ω2* must be equal. Now, with b1*=b2* and ω1*=ω2*, we can switch θ1* and θ2* without increasing the objective function and violating the constraints. Thus, our claim that θ* is non-increasing is justified. Now that we have shown that the solution triplet (θ*,b*,ω*) to (A2) satisfy that θ* is non-increasing, in order to prove V1≥V2, it suffices to show that if we take θ=θ* in (A1), the minimizer b⋆ is non-increasing and b1⋆≤bmax. In fact, if so, we will have b⋆=b* as well as ω*=θj*2θj*2+ε2, and then   V1≥minb:∑j=1Nbj≤B∑j=1N(θj*4θj*2+ε22−2bj+θj*2ε2θj*2+ε2)≥V2. Let's take θ=θ* in (A1). The optimal b⋆ is non-increasing because the solution is given by the ‘reverse water-filling’ scheme and θ* is non-increasing. Next, we will show that b1⋆≤bmax. If b1⋆>bmax, then we would have for j=1,…,N  θj*4θj*2+ε22−2bj⋆≤θ1*4θ1*2+ε22−2b1⋆≤θ1*22−2bmax≤c22−4log(1/ε)=c2ε4, where the first inequality follows from the ‘reverse water-filling’ solution, and therefore   ∑j=1Nθj*4θj*2+ε22−2bj⋆≤Nc2ε4=o(ε4m2m+1), which would not give the optimal solution. Hence, b1⋆≤bmax, and this completes the proof. Lemma A.11 Suppose that Wn,ξ follows a non-central chi-square distribution with n degrees of freedom and non-centrality parameter ξ. We have for n≥5  E(Wn,ξ−1)≤1n+ξ−4 and for n≥7  E(Wn,ξ−2)≤n−4(n−6)(n+ξ−4)(n+ξ−6). Proof. It is well known that the non-central chi-square random variable Wn,ξ can be written as a Poisson-weighted mixture of central chi-square distributions, i.e., Wn,ξ~χn+2K2 with K~Poisson(ξ/2). Then   E(Wn,ξ−1)=E(E(Wn,ξ−1|K))=E(1n+2K−2)≥1n+2EK−2=1n+ξ−2, where we have used the fact that E(1/χn2)=n−2 and Jensen's inequality. Similarly, we have   E(Wn,ξ−2)=E(E(Wn,ξ−2|K))=E(1(n+2K−2)(n+2K−4))≥1(n+2EK−2)(n+2EK−4)=1(n+ξ−2)(n+ξ−4). Using the Poisson-weighted mixture representation, the following recurrence relation can be derived [7]   1=ξE(Wn+4,ξ−1)+nE(Wn+2,ξ−1)​, (A.23)  E(Wn,ξ−1)=ξE(Wn+4,ξ−2)+nE(Wn+2,ξ−2) (A.24) for n≥3. Thus,   E(Wn+4,ξ−1)=1ξ−nξE(Wn+2,ξ−1)≤1ξ−nξ1n+ξ=1n+ξ. Replacing n by n−4 proves (A.11). On the other hand, rearranging (A.23), we get   E(Wn+2,ξ−1)=1n−ξnE(Wn+4,ξ−1)≤1n−ξn1n+ξ+2=n+2n(n+ξ+2). Now using (A.24), we have   E(Wn+4,ξ−2)=1ξE(Wn,ξ−1)−nξE(Wn+2,ξ−2)≤nξ(n−2)(n+ξ)−nξ(n+ξ)(n+ξ−2)=n(n−2)(n+ξ)(n+ξ−2). Replacing n by n−4 proves (A.11). © The authors 2017. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) For permissions, please e-mail: journals. permissions@oup.com

Journal

Information and Inference: A Journal of the IMAOxford University Press

Published: Mar 1, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off