Non-Gaussian observations in nonlinear compressed sensing via Stein discrepancies

Non-Gaussian observations in nonlinear compressed sensing via Stein discrepancies Abstract Performance guarantees for estimates of unknowns in nonlinear compressed sensing models under non-Gaussian measurements can be achieved through the use of distributional characteristics that are sensitive to the distance to normality, and which in particular return the value of zero under Gaussian or linear sensing. The use of these characteristics, or discrepancies, improves some previous results in this area by relaxing conditions and tightening performance bounds. In addition, these characteristics are tractable to compute when Gaussian sensing is corrupted by either additive errors or mixing. 1. Introduction Consider the nonlinear sensing model where $$(y_1,\boldsymbol{a}_1),\ldots ,(y_m,\boldsymbol{a}_m)$$ in $$\mathbb{R} \times \mathbb{R}^d$$ are i.i.d. copies of an observation and sensing vector pair (y, a), satisfying   $$E{\left[y|\boldsymbol{a}\right]}=\theta(\langle\boldsymbol{a},\boldsymbol{x}\rangle),$$ (1.1)where a is composed of entry-wise independent random variables distributed as a, a mean zero, variance one random variable. Throughout the paper we assume that the function $$\theta :\mathbb{R} \rightarrow \mathbb{R}$$ is measurable, and $$\boldsymbol{x} \in \mathbb{R}^d$$ is an unknown, non-zero vector lying in a closed set $$K \subseteq{\mathbb{R}}^d$$. The goal is to recover x given the measurement pairs $$\{(y_i,\mathbf{a}_i)\}_{i=1}^m$$. We note that the magnitude of x is unidentifiable under the model (1.1) as $$\theta (\cdot )$$ is unknown. Hence, in the following, by absorbing a factor of ∥x∥ into $$\theta$$, we may assume $$\|\boldsymbol{x}\|_2=1$$ without loss of generality. In [1], the authors consider model (1.1) under the one-bit sensing scenario where $$y_1,\ldots ,y_m$$ lie in the two-point set {−1, 1} and $$\theta :\mathbb{R} \rightarrow [-1,1]$$. They demonstrate that despite $$\theta$$ being unknown and potentially highly nonlinear, performance guarantees can be provided for estimators $$\widehat{\boldsymbol{x}}$$ of x without additional knowledge of the structure of $$\theta$$ and in a way that allows for non-Gaussian sensing. Nonlinear compressed sensing beyond the one-bit model has also been considered in previous works under certain distribution assumptions. For example, [15] and [12] consider the nonlinear model (1.1), with measurement vectors a being Gaussian and an elliptical symmetric distribution, respectively. More recently, [22] considers measurement vectors of general distribution via a score function method, under the assumption that the full knowledge of the distribution function is known. We also mention that the work [8] handles non-Gaussian designs using the zero bias transform in order to study equivalences between Generalized and Ordinary least squares. In [1], consideration of the non-Gaussian case introduces some challenges, reflected in potentially poor performance of the bounds, additional smoothness assumptions and difficulties that may arise when the unknown is extremely sparse. We show many of these difficulties can be overcome through the introduction of various measures of the discrepancy between the sensing distribution of a and the standard normal g. Though our main goal is to develop bounds that are sensitive to certain deviations from normality, and which in particular recover the previous results for Gaussian sensing and linear sensing as special cases, we also improve previous results by supplying explicit small constants in our recovery bounds. Regarding notation, we generally adhere to the principle that random variables appear in upper case, but to be consistent with the existing literature, and in particular with [1], we make an exception for the components of the sensing vector, generically denoted by a and the Gaussian by g, and also for the observed values, denoted by y. Vectors are in bold face. 1.1 Estimator and main result Given the pairs $$\{(y_i,\boldsymbol{a}_i)\}_{i=1}^m$$ generated by the model (1.1), let   $$L_m(\mathbf{t}):=\|\mathbf{t}\|_2^2-\frac{2}{m}\sum_{i=1}^my_i\left\langle\mathbf{a}_i,\mathbf{t}\right\rangle \quad \mbox{for $$\boldsymbol{t} \in K$$,} \quad$$ (1.2)which is an unbiased estimator of   $$L(\mathbf{t}):=\|\mathbf{t}\|_2^2-2E{\left[\,y\left\langle\mathbf{a},\mathbf{t}\right\rangle\right]}.$$ (1.3)As the components of a have mean zero, variance one and are independent, $$E{ [\mathbf{a}\mathbf{a}^T ]}=\mathbf{I}_{d\times d}$$, and therefore minimizing L(t) is equivalent to minimizing the quadratic loss $$E [\left (\,y-\left \langle \mathbf{a},\mathbf{t}\right \rangle \right )^2 ]$$. Thus, we define the estimator   $$\widehat{\mathbf{x}}_m:=\mathop{\mbox{argmin}}_{\mathbf{t}\in K}L_m(\mathbf{t}).$$ (1.4)For simplicity of notation, we will write   $$f_{\mathbf{x}}(\mathbf{t}):=\frac{1}{m}\sum_{i=1}^m y_i\left\langle\mathbf{a}_i,\mathbf{t}\right\rangle\!\!.$$ (1.5)To state the main result, we need the following three definitions: Definition 1. 1 (Gaussian mean width) For $$\mathbf{g}\!\!\sim\!\! \mathscr{N}(0,\mathbf{I}_{d\times d})$$, the Gaussian mean width of a set $$\mathscr{T}\subseteq \mathbb{R}^d$$ is   $$\omega(\mathscr{T})=E{\left[\sup_{\boldsymbol{t}\in \mathscr{T}}\left\langle\mathbf{g},\boldsymbol{t}\right\rangle\right]}.$$ Remark 1.2 In [1], the definition of Gaussian mean width of a set $$\mathscr{T}$$ is taken to be   $$\omega(\mathscr{T})=E{\left[\sup_{\boldsymbol{t}\in\mathscr{T} - \mathscr{T}}\left\langle\mathbf{g},\boldsymbol{t}\right\rangle\right]},$$where the supremum is over the Minkowski difference. Here, for ease of presentation, we adopt the somewhat more ‘classical’ Definition 1.1 that appears in earlier works in the literature, such as [17]. These two definitions are equivalent up to a constant as $$E \,[\sup _{\boldsymbol{t}\in \mathscr{T} - \mathscr{T}}\left \langle \mathbf{g},\boldsymbol{t}\right \rangle ]= 2E\, [\sup _{\boldsymbol{t}\in \mathscr{T}}\left \langle \mathbf{g},\boldsymbol{t}\right \rangle ]$$, which can be seen using the symmetry of the distribution of g. Remark 1.3 (Measurability issue) The precise meaning of $$E\, [\sup _{t\in \mathscr{T}}X(t) ]$$ for an arbitrary process $$\{X(t)\}_{t\in \mathscr{T}}$$ is not clear if $$\mathscr{T}$$ is uncountable. In fact, for an uncountable index set $$\mathscr{T}$$, the function $$\sup _{t\in{\mathscr{T}}}X(t)$$ might not be measurable. Letting $$(\Omega ,\mathcal{E},P)$$ be the underlying probability space, well-known counter examples exist even in the case where X(⋅) is jointly measurable on the product space $$(\Omega \times \mathscr{T},~\mathscr{E}\otimes \Psi )$$ (first constructed by Luzin and Suslin), where $$\varPsi$$ is a Borel $$\sigma$$-algebra on $$\mathscr{T}$$. However, when $$\mathscr{T}$$ is a Borel measurable subset of $$\mathbb{R}^d$$ (which is the case we are interested in) and X(⋅) is jointly measurable on $$(\Omega \times \mathscr{T},~\mathscr{E}\otimes \varPsi )$$, one can show that the $$\sup _{t\in \mathscr{T}}X(t)$$ is always measurable. Indeed, $$\sup _{t\in \mathscr{T}}X(t)$$ is measurable if and only if the set $$\{\sup _{t\in \mathscr{T}}X(t)> c\}\in \mathscr{E}$$ for any $$c\in \mathbb{R}$$. On the other hand, $$\{\sup _{t\in \mathscr{T}}X(t)> c\}=P_{\Omega }\{X(\cdot )> c\}$$, where for any set $$A\in \Omega \times \mathscr{T}$$, $$P_{\Omega }A:=\{\omega \in \Omega :(\omega ,t)\in A\}$$ is the projection of the set A onto $$\Omega$$. Then, the measurability comes from the following theorem in [6]: if $$(\Omega , \mathscr{E})$$ is a measurable space and $$\mathscr{T}$$ is a Polish space, then the projection onto $$\Omega$$ of any product measurable subset of $$\Omega \times \mathscr{T}$$ is also measurable. Definition 1.4 ($$\psi _q$$-norm) The $$\psi _q$$-norm of a real-valued random variable X is given by   $$\|X\|_{\psi_q}=\sup_{p\geqslant1}p^{-\frac{1}{q}}\left(E\left[|X|^p\right]\right)^{\frac1p}\!\!.$$In particular, for q = 1 and q = 2, respectively, the value of $$\psi _q$$ is called the subexponential and subgaussian norm, and we say X is subexponential or subgaussian when $$\|X\|_{\psi _1}<\infty$$ or $$\|X\|_{\psi _2}<\infty$$. The subgaussian q = 2 case of Definition 1.4 is the most important. Though here the $$\psi _2$$-norm we have chosen to use is based on comparing the growth of a distribution’s absolute moments to that of a normal, definitions equivalent up to universal constants can also be stated in terms of comparisons of tail decay or of the Laplace transform of X, among others. Remark 1.5 It is easily justified that $$\|\cdot \|_{\psi _q}$$ for q ⩾ 1 defines a norm with identification of almost everywhere equal random variables. Here we only check the triangle inequality as it is immediate that $$\|\cdot \|_{\psi _q}$$ is homogeneous and separates points. Indeed, for any two random variables X and Y, the Minkowski inequality yields that   \begin{equation*} \|X+Y\|_{\psi_q}=\sup_{p\geqslant1}p^{-\frac{1}{q}}\left(E{\left[|X+Y|^p\right]}\right)^{\frac1p} \leqslant\sup_{p\geqslant1}p^{-\frac{1}{q}}\left(\left(E{\left[|X|^p\right]}\right)^{\frac1p}+\left(E{\left[|Y|^p\right]}\right)^{\frac1p}\right) \leqslant\|X\|_{\psi_q}+\|Y\|_{\psi_q}. \end{equation*} Definition 1.6 (Descent cone) The descent cone of a set $$\mathscr{T}\subseteq \mathbb{R}^d$$ at any point $$\boldsymbol{t}_0\in \mathscr{T}$$ is defined as   $$D(\mathscr{T},\boldsymbol{t}_0)=\big\{\tau\mathbf{h}:~\tau\geqslant0, \mathbf{h}\in \mathscr{T}-\boldsymbol{t}_0\big\}.$$ Theorem 1.7 Let $$\mathbf{a}=(a_1,\ldots ,a_d)$$ where $$a_1,\ldots ,a_d$$ are i.i.d. copies of a random variable a with a centered subgaussian distribution having unit variance, and let $$\{(y_i,\mathbf{a}_i)\}_{i=1}^m$$ be i.i.d. copies of the pair (y, a) where y, given by the sensing model (1.1), is assumed to be subgaussian. If K is a closed, measurable subset of $$\mathbb{R}^d$$ and $$\lambda \boldsymbol{x} \in K$$ where   $$\lambda=E{\left[\,y\left\langle\mathbf{a},\mathbf{x}\right\rangle\right]},$$ (1.6)then for all u ⩾ 2, with probability at least $$1-4\mathrm{e}^{-u}$$, the estimator $$\widehat{\mathbf{x}}_m$$ given by (1.4) satisfies   $$\left\|\,\widehat{\mathbf{x}}_m-\lambda\mathbf{x}\right\|_2 \leqslant2\alpha+C_0\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right)\frac{\omega\left(D(K,\lambda\mathbf{x})\cap\mathbb{S}^{d-1}\right)+u}{\sqrt{m}},$$for all $$m\geqslant \omega (D(K,\mathbf{x})\cap \mathbb{S}^{d-1})^2$$ and some constant $$C_0>0$$, where   $$\alpha=\sup\left\{ \big|E{\left[ y \left\langle\mathbf{a},\mathbf{t}\right\rangle \right]}-\lambda\left\langle\mathbf{x},\mathbf{t}\right\rangle\! \big|, \mathbf{t} \in B_2^d \right\}\!,$$ (1.7)and $$\mathbb{S}^{d-1}$$ and $$B_2^d$$ are the unit Euclidean sphere and ball in $$\mathbb{R}^d$$, respectively. We note that $$\alpha =0$$ under the conditions of Theorems 2.1 and 2.4, and also when $$\theta$$ is linear. Hence, Theorem 1.7 recovers results for the normal and linear compressed sensing models as special cases. Remark 1.8 At first glance it may seem surprising that the least squares type estimator (1.4), which is well known to work when $$\theta$$ is linear, succeeds in such greater generality. The appearance of the factors $$\lambda$$ and $$\alpha$$ in (1.6) and (1.7), respectively, may also be non-intuitive. The following explanations may shed some light. First, regarding the scaling factor $$\lambda$$, one can easily verify that if $$\theta (w)=\mu w$$, a linear function, then $$\lambda =\mu$$. Hence, in this case $$\theta (\langle \boldsymbol{a},{x}\rangle )=\lambda \langle \boldsymbol{a},\boldsymbol{x}\rangle = \langle \boldsymbol{a}, \lambda \boldsymbol{x}\rangle$$, which behaves as though the unknown vector to be estimated has length $$\lambda$$, possibly different from one, the assumed length of x. Next, we present Lemma 1.9, used later in the proof of Theorem 1.7, to give some intuition as to why the proposed estimator succeeds when $$\theta$$ is nonlinear. Let L be as in (1.3), the expectation of the function $$L_m$$ whose argument at the minimum defines the estimator $$\widehat{\mathbf{x}}_m$$. Lemma 1.9 For any t ∈ K, we have   $$L(\mathbf{t})-L(\lambda\mathbf{x})\geqslant\|\mathbf{t}-\lambda\mathbf{x}\|_2^2-2\alpha\|\mathbf{t}-\lambda\mathbf{x}\|_2,$$where $$\lambda$$ and $$\alpha$$ are defined in (1.6) and (1.7), respectively. Proof. For any t ∈ K,   \begin{align*} L(\mathbf{t})-L(\lambda\mathbf{x})=&\, \|\mathbf{t}\|_2^2-\|\lambda\mathbf{x}\|_2^2 -2E{\left[\,y\left\langle\mathbf{a},\mathbf{t}-\lambda\mathbf{x}\right\rangle\right]} \\ \geqslant&\,\|\mathbf{t}\|_2^2-\|\lambda\mathbf{x}\|_2^2-2\lambda\left\langle\mathbf{t}-\lambda\mathbf{x},\mathbf{x}\right\rangle-2\alpha\|\mathbf{t}-\lambda\mathbf{x}\|_2\\ =&\,\|\mathbf{t}-\lambda\mathbf{x}\|_2^2-2\alpha\|\mathbf{t}-\lambda\mathbf{x}\|_2, \end{align*}where the inequality follows from (1.7). Hence, if one could minimize L instead of $$L_m$$ (the difference in practice being controlled by a generic chaining argument), when $$\lambda \boldsymbol{x} \in K$$, the set over which L is minimized, one obtains   $$\|\,\widehat{\boldsymbol{x}}_m-\lambda\boldsymbol{x}\|_2^2 \leqslant \left[L(\,\widehat{\boldsymbol{x}}_m) - L(\lambda\boldsymbol{x})\right] + 2 \alpha \|\,\widehat{\boldsymbol{x}}_m-\lambda\boldsymbol{x}\|_2 \leqslant 2 \alpha \|\,\widehat{\boldsymbol{x}}_m-\lambda\boldsymbol{x}\|_2,$$ (1.8)and therefore that   \begin{equation*} \|\,\widehat{\boldsymbol{x}}_m-\lambda\boldsymbol{x}\|_2 \leqslant 2 \alpha. \end{equation*} From the inequality in the proof of Lemma 1.9, one can see that $$\alpha$$ is the ‘price’ for replacing the nonlinearity inherent in y with a simpler inner product, as supported by the fact that $$\alpha =0$$ when $$\theta$$ is linear. In addition, parts (a) and (b) of Theorem 2.1 to follow show that $$\alpha$$ is again zero when $$\theta$$ is Lipschitz, or has bounded second derivative, and the sensing vector is composed of independent Gaussian variables. Theorem 2.4 provides this same conclusion when $$\theta$$ is the sign function. Hence, in these cases, minimizing L would lead to exact recovery. As mentioned earlier, the length of the unknown vector x in (1.1) is not identifiable due to the generality in $$\theta$$ that the model allows. However, if one has prior knowledge that $$\|\mathbf{x}\|_2 = 1$$, the following corollary to Theorem 1.7 shows that rescaling $$\widehat{\mathbf{x}}_m$$ to have norm 1 gives an estimator of the true vector x. The idea underlying the corollary was originally developed in [12]. Corollary 1.10 Let the conditions of Theorem 1.7 be in force, and suppose that $$\|\mathbf{x}\|_2 = 1$$ and $$\lambda> 0$$. Define the normalized estimator $$\overline{\mathbf{x}}_m$$ as   $$\overline{\mathbf{x}}_m := \begin{cases} \widehat{\mathbf{x}}_m/\|\,\widehat{\mathbf{x}}_m\|_2,~~&\textrm{if }~\widehat{\mathbf{x}}_m\neq0, \\ 0,~~&\textrm{if }~\widehat{\mathbf{x}}_m=0. \end{cases}$$Then there exists a constant $$C_0>0$$ such that for all u ⩾ 2, with probability at least $$1-4\mathrm{e}^{-u}$$,   $$\left\|\, \overline{\mathbf{x}}_m - \mathbf{x}\right\|_2 \leqslant\frac{4\alpha}{\lambda}+2C_0\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right)\frac{\omega\left(D(K,\lambda\mathbf{x})\cap\mathbb{S}^{d-1}\right)+u}{\lambda\sqrt{m}},$$whenever $$m\geqslant \omega (D(K,\mathbf{x})\cap \mathbb{S}^{d-1})^2$$. Proof. By Theorem 1.7, we know that with probability at least $$1-4\mathrm{e}^{-u}$$  $$\left\|\,\widehat{\mathbf{x}}_m-\lambda\mathbf{x}\right\|_2\leqslant B \quad \mbox{where} \quad B = 2\alpha+C_0\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right)\frac{\omega\left(D(K,\lambda\mathbf{x})\cap\mathbb{S}^{d-1}\right)+u}{\sqrt{m}}.$$Since $$\lambda>0$$, it follows that on this event   \begin{equation*} \left\|\frac{\widehat{\mathbf{x}}_m}{\lambda} - \mathbf{x}\right\|_2\leqslant \frac{B}{\lambda}. \end{equation*} Let $$\omega \in [0,\pi )$$ be the angle between $$\widehat{\mathbf{x}}_m$$ and x (see Fig. 1). First consider the case where either $$\omega \geqslant \frac{\pi }{2}$$ or $$\widehat{\mathbf{x}}_m =0$$. Then $$\langle\, \widehat{\mathbf{x}}_m,\mathbf{x}\rangle \leqslant 0$$, and we have from the above inequality,   $$\frac{B}{\lambda}\geqslant\left\|\frac{\widehat{\mathbf{x}}_m}{\lambda} - \mathbf{x}\right\|_2 = \sqrt{\|\,\widehat{\mathbf{x}}_m/\lambda\|_2^2 - 2 \langle\,\widehat{\mathbf{x}}_m,\mathbf{x}\rangle/\lambda + \| \mathbf{x}\|_2^2}\geqslant\|\mathbf{x}\|_2=1.$$ Hence, applying the triangle inequality, we have   \begin{equation*} \left\|\overline{\mathbf{x}}_m - \mathbf{x}\right\|_2\leqslant 2 \leqslant \frac{2B}{\lambda}. \end{equation*} In the remaining case where $$\omega <\frac{\pi }{2}$$ and $$\widehat{\mathbf{x}}_m \neq 0$$, as can be seen with the help of Fig. 1,   \begin{multline*} \left\|\frac{\widehat{\mathbf{x}}_m}{\|\,\widehat{\mathbf{x}}_m\|_2} - \mathbf{x}\right\|_2 = \frac{\textrm{dist}\left(\mathbf{x},\textrm{span}\left(\,\widehat{\mathbf{x}}_m\right)\right)}{\cos(\omega/2)} \leqslant \frac{\textrm{dist}\left(\mathbf{x},\textrm{span}\left(\,\widehat{\mathbf{x}}_m\right)\right)}{\cos(\pi/4)} \leqslant \frac{\left\|\,\widehat{\mathbf{x}}_m/\lambda - \mathbf{x}\right\|_2}{\cos(\pi/4)} \leqslant \frac{\sqrt{2}B}{\lambda}, \end{multline*}where $$\textrm{dist}\left (\mathbf{x},\textrm{span}\left (\,\widehat{\mathbf{x}}_m\right )\right )$$ denotes the distance of the vector x to the linear span of $$\widehat{\mathbf{x}}_m$$, the first inequality follows from $$\omega <\frac{\pi }{2}$$ and the second inequality follows from the fact that $$\frac{\widehat{\mathbf{x}}_m}{\lambda }$$ is in the linear span of $$\widehat{\mathbf{x}}_m$$. Combining the above two cases completes the proof. Fig. 1. View largeDownload slide Illustration of the geometric relation between the estimator $$\widehat{\mathbf{x}}_m$$ and the true vector x. Fig. 1. View largeDownload slide Illustration of the geometric relation between the estimator $$\widehat{\mathbf{x}}_m$$ and the true vector x. Remark 1.11 We compare the result in Corollary 1.10 with Lemma 2.2 of [1], where a nearly identical bound is presented under the additional assumptions that $$\{y_i\}_{i=1}^m$$ take values in {−1, 1}, $$\theta :\mathbb{R}\rightarrow [-1,1]$$, that K lies in a unit Euclidean ball $$B_2^d$$ and $$g \sim \mathscr{N}(0,1)$$. Specifically, under the preceding assumptions it is shown that   $$\left\|\,\widehat{\mathbf{x}}_m-\mathbf{x}\right\|_2\leqslant\frac{4\alpha}{\lambda}+C\|a\|_{\psi_2}\frac{\omega(K)+u}{\lambda\sqrt{m}},$$with probability at least $$1-4\mathrm{e}^{-u^2}$$. Under the normality assumption $$\left \langle \mathbf{g},\mathbf{x}\right \rangle \sim \mathscr{N}(0,1)$$ and $$\lambda$$ of (1.6) specializes to $$E[g\theta (g)]$$. Here, we are able to obtain a more general result that allows y to be subgaussian rather than restricting it to lie in {−1, 1}, which comes at the extra cost of a term that is of the same order as previously existing ones in the bound, and in particular which vanish as $$m\rightarrow \infty$$. Lastly, allowing y to be subgaussian, the variable $$y\left \langle \mathbf{a},\mathbf{t}\right \rangle$$ is subexponential for all $$\mathbf{t}\in \mathbb{R}^d$$, as opposed to being subgaussian as in [1]. This additional generality necessitates a generic chaining argument to obtain the subexponential concentration bound. This paper is organized as follows. In Section 2 we introduce two measures of a distribution’s discrepancy from the normal that have their roots in Stein’s method, see [5,19]. The zero bias distribution is introduced first, being relevant for both Sections 2.1 and 2.2, that considers the cases where $$\theta$$ is a smooth function, and the sign function, respectively. Section 2.1 further introduces a discrepancy measure based on Stein coefficients, and Theorem 2.1 provides bounds on $$\alpha$$ of (1.7) in terms of these two measures, when $$\theta$$ is Lipschitz and when it has a bounded second derivative. Section 2.1 also defines two specific error models on the Gaussian, an additive one in (2.23) and the other via mixtures in (2.24). Theorem 2.3 shows the behavior of the bound on $$\alpha$$ in these two models as a function of the amount $$\varepsilon \in [0,1]$$ the Gaussian is corrupted, tending to zero as $$\varepsilon$$ becomes small. Section 2.2 provides corresponding results when $$\theta$$ is the sign function, specifically in Theorems 2.4 and 2.5. Section 2.3 studies some relationships between the two discrepancy measures applied and also to the total variation distance. Theorem 1.7 is proved in Section 3. The presentation of the postponed proofs of some results used earlier appears in an Appendix in Sections A and B. 2. Discrepancy bounds via Stein’s method Here we introduce two measures of the sensing distribution’s proximity to normality that can be used to bound $$\alpha$$ in (1.7). In Sections 2.1 and 2.2 we consider the cases where $$\theta$$ is a Lipschitz function and the sign function, respectively; the difference in the degree of smoothness in these two cases necessitates the use of different ways of measuring the discrepancy to normality. An observation that will be useful in both settings is that by definition (1.5), for any $$\mathbf{t}\in \mathbb{R}^d$$, we have   \begin{align} &E\left[\,f_{\mathbf{x}}(\mathbf{t})\right]=E\left[\,y\langle\mathbf{a},\mathbf{t}\rangle\right]=E\left[E\left[\,y\langle\mathbf{a},\mathbf{t}\rangle \mathbf{a}\right]\right] =E\left[\langle\mathbf{a},\mathbf{t}\rangle \theta(\langle\mathbf{a},\mathbf{x}\rangle)\right]=\langle\mathbf{v}_{\mathbf{x}},\mathbf{t}\rangle \nonumber\\ &\qquad \qquad\mbox{where} \quad\mathbf{v}_{\mathbf{x}}=E\left[\mathbf{a}\theta(\langle\mathbf{a},\mathbf{x}\rangle)\right]\!. \end{align} (2.1)Specializing to the case where t = x, we may therefore express $$\lambda$$ in (1.6) as   $$\lambda=\langle\mathbf{v}_{\mathbf{x}},\mathbf{x}\rangle = E \left[\langle\mathbf{a},\mathbf{x}\rangle \theta(\langle\mathbf{a},\mathbf{x}\rangle)\right]\!.$$ (2.2) In the settings of both Sections 2.1 and 2.2, we require facts regarding the zero bias distribution and depend on [13] or [5] for properties stated below. With $${\mathcal L}(\cdot )$$ denoting distribution or law, given a mean zero distribution $${\mathcal L}(a)$$ with finite, non-zero variance $$\sigma ^2$$, there exists a unique law $${\mathcal L}(a^{\ast })$$, termed the ‘a-zero bias’ distribution, characterized by the satisfaction of   $$E\, [af(a)]=\sigma^2 E\,[f^{\prime}(a^{\ast})] \quad \mbox{for all Lipschitz functions $$f$$.} \quad$$ (2.3)The existence of the variance of a, and hence also its second moment, guarantees that the expectation on the left, and hence also on the right, exists. Letting   \begin{equation*} {\textrm{Lip}}_1=\big\{g:\mathbb{R} \rightarrow \mathbb{R} \quad \mbox{satisfying} \quad \left|g(y)-g(x)\right| \leqslant |y-x|\big\}, \end{equation*}we recall that the Wasserstein or $$L^1$$ distance between the laws $$\mathscr{L}(X)$$ and $$\mathscr{L}(Y)$$ of two random variables X and Y can be defined as   \begin{equation*} d_1\left(\mathscr{L}(X),\mathscr{L}(Y)\right) = \sup_{f \in{\textrm{Lip}}_1} \big|Ef(X)-Ef(Y)\big|, \end{equation*}or alternatively as   $$d_1\left(\mathscr{L}(X),\mathscr{L}(Y)\right) = \inf_{(X,Y)}E|X-Y|,$$ (2.4)where the infimum is over all couplings (X, Y) of random variables having the given marginals. The infimum is achievable for real-valued random variables, see [16]. Now we define our first discrepancy measure by   $$\gamma_{\mathscr{L}(a)} = d_1(a,a^{\ast}).$$ (2.5)Stein’s characterization [19] of the normal yields that $$\mathscr{L}(a^{\ast })=\mathscr{L}(a)$$ if and only if a is a mean zero normal variable. Further, with some abuse of notation, writing $$\gamma _a$$ for (2.5) for simplicity, Lemma 1.1 of [11] yields that if a has mean zero, variance 1 and finite third moment, then   $$\gamma_a \leqslant \frac{1}{2}E|a|^3,$$ (2.6)so in particular $$\gamma _a < \infty$$ whenever a has a finite third moment. In the case where $$Y_1,\ldots ,Y_n$$ are independent mean zero random variables with finite, non-zero variances $$\sigma _1^2,\ldots ,\sigma _n^2$$, having sum $$Y=\sum _{i=1}^n Y_i$$ with variance $$\sigma ^2$$, we may construct $$Y^{\ast }$$ with the Y-zero biased distribution by letting   $$Y^{\ast}=Y-Y_I+Y_I^{\ast} \quad \mbox{where} \quad P[I=i]=\frac{\sigma_i^2}{\sigma^2},$$ (2.7)where $$Y_i^{\ast }$$ has the $$Y_i$$-zero biased distribution and is independent of $$Y_j, j \not =i$$, and where the random index I is independent of $$\{Y_i,Y_i^{\ast }, i=1,\ldots ,n\}$$. We will also make use of the fact that for any c≠0   $$\mathscr{L}((ca)^{\ast})=\mathscr{L}(ca^{\ast}).$$ (2.8) 2.1 Lipschitz functions When $$\theta$$ is a Lipschitz function inequality (2.12) of Theorem 2.1 below gives a bound on $$\alpha$$ in (1.7) in terms of Stein coefficients. We say T is a Stein coefficient, or Stein kernel, for a random variable X with finite, non-zero variance when   $$E\left[X\,f(X)\right]=E\left[T\,f^{\prime}(X)\right]$$ (2.9)for all Lipschitz functions f. Specializing (2.9) to the cases where f(x) = 1 and f(x) = x, we find   $$E[X]=0 \quad \mbox{and} \quad{\mathrm Var}(X)= E[T].$$ (2.10) By Stein’s characterization [19], the distribution of X is normal with mean zero and variance $$\sigma ^2$$ if and only if $$T=\sigma ^2$$. Correspondingly, for unit variance random variables we will define our second discrepancy measure as E|1 − T|. If c is a non-zero constant and $$T_X$$ is a Stein coefficient for X, then $$c^2T_X$$ is a Stein coefficient for Y = cX. Indeed, with h(x) = f(cx) below we obtain changed g to h to avoid confusion with normal   $$E\left[Yf(Y)\right]=cE\left[Xf(cX)\right]=cE\left[Xh(X)\right]=cE\left[T_X h^{\prime}(X)\right]=cE\left[cT_Xf^{\prime}(cX)\right] =E\left[c^2T_Xf^{\prime}(Y)\right].$$ (2.11)Stein coefficients first appeared in the work of [3], and were further developed in [4] for random variables that are functions of Gaussians; we revisit this later point in Section 2.3. The following result considers two separate sets of hypotheses on the unknown function $$\theta$$ and the sensing distribution a. The assumptions leading to the bound (2.12) require fewer conditions on $$\theta$$ and more on a as compared to those leading to (2.13). That is, though Stein coefficients may fail to exist for certain mean zero, variance one distributions, discrete ones in particular, the zero bias distribution here exists for all. We note that by Stein’s characterization, when a is standard normal we may take T = 1 in (2.12) and $$\gamma _a=0$$ in (2.13), and hence $$\alpha =0$$ in both the cases considered in the theorem that follows. The bound (2.13) also returns zero discrepancy in the special case where $$\theta$$ is linear and thus recovers the results on linear compressed sensing [17] when combined with Theorem 1.7. For a real-valued function f with domain D let   \begin{equation*} \|\,f\|=\sup_{x \in D}\left|\,f(x)\right|\!. \end{equation*} Theorem 2.1 Let a be a mean zero, variance one random variable and set $$\boldsymbol{a}=(a_1,\ldots ,a_d)$$ with $$a_1,\ldots ,a_d$$ independent random variables distributed as a, and let $$\alpha$$ be as in (1.7). (a) If $$\theta \in \textrm{Lip}_1$$ and a has Stein coefficient T, then   $$\alpha \leqslant E|1-T|.$$ (2.12) (b) If $$\theta$$ possesses a bounded second derivative, then   $$\alpha \leqslant \|\theta^{\prime\prime}\| \gamma_a.$$ (2.13) Remark 2.2 In [1] the quantity $$\alpha$$ is bounded in terms of the total variation distance $$d_{\mathrm TV}(a,g)$$ between a and the standard Gaussian distribution g. In particular, for $$\theta \in C^2$$, Proposition 5.5 of [1] yields   $$\alpha \leqslant 8(Ea^6+Eg^6)^{1/2}\left(\|\theta^{\prime}\|+\|\theta^{\prime\prime}\|\right)\sqrt{d_{\mathrm TV}(a,g)}.$$ (2.14) In contrast, the upper bound (2.12) does not depend on any moments of a, requires $$\theta$$ to be only once differentiable, and in typical cases where $$d_{\mathrm TV}(a,g)$$ and E|1 − T| are of the same order, that is, when the upper bound in Lemma 2.10 is of the correct order, $$\alpha$$ in (2.12) is bounded by a first power rather than the larger square root in (2.14). When $$\theta$$ possesses a bounded second derivative, the upper bound (2.13) improves on (2.14) in terms of constant factors, requirements on the existence of moments and dependence on a first power rather than a square root. In this case Lemma 2.11 shows $$d_{\mathrm TV}(a,g)$$ and $$\gamma _a$$ are of the same order when a has bounded support. Measuring discrepancy from normality in terms of E|1 − T| and $$\gamma _a$$ also has the advantage of being tractable when each component of the Gaussian sensing vector g has been independently corrupted at the level of some $$\varepsilon \in [0,1]$$ by a non-Gaussian, mean zero, variance one distribution a. In the two models we consider we let the sensing vector have i.i.d. entries, and hence only specify the distribution of its components. The first model is the case of additive error, where each component of the sensing vector is of the form   $$g_\varepsilon = \sqrt{1-\varepsilon}g+\sqrt{\varepsilon} a$$ (2.15)with a independent of g, with the second one being the mixture model where each component has been corrupted due to some ‘bad event’ A that substitutes g with a so that   $$g_\varepsilon=g\boldsymbol{1}_{A^c} + a\boldsymbol{1}_A,$$ (2.16)where A occurs with probability $$\varepsilon$$, independently of g, a and a given Stein coefficient T for a. Since   $$E\left[Tf^{\prime}(a)\right]=E \left[E[T|a]\ f^{\prime}(a)\right]\!,$$ (2.17)we see that E[T|a] is a Stein coefficient for a. Hence, upon replacing T by E[T|a] only the independence of A from {g, a} is required. Theorem 2.3 shows that under both scenarios (a) and (b) considered in Theorem 2.1, and further, under both the additive and mixture models, the value $$\alpha$$ can be bounded explicitly in terms of a quantity that vanishes in $$\varepsilon$$. Further, we note that both error models agree with each other, and with the model of Theorem 2.1, when $$\varepsilon =1$$, so that Theorem 2.3 recovers Theorem 2.1 when so specializing. We now present Theorem 2.3 followed by its proof, then the proof of Theorem 2.1. Theorem 2.3 Under condition (a) of Theorem 2.1, under both the additive (2.15) and mixture (2.16) error models, we have   \begin{equation*} \alpha \leqslant \varepsilon E|1-T|. \end{equation*}As regards the measure $$\gamma _a$$ in (b) of Theorem 2.1, under the additive error model (2.15),   $$\gamma_{g_\varepsilon} \leqslant \varepsilon^{3/2} \gamma_a, \quad \mbox{and when {\theta} has a bounded second derivative,} \quad \alpha \leqslant \varepsilon^{3/2}\|\theta^{\prime\prime}\| \gamma_a,$$ (2.18)and under the mixture error model (2.16),   $$\gamma_{g_\varepsilon} \leqslant \varepsilon \gamma_a, \quad \mbox{and when {\theta} has a bounded second derivative,} \quad \alpha \leqslant \varepsilon \|\theta^{\prime\prime}\| \gamma_a.$$ (2.19) Proof. By the assumptions of independence and on the mean and variance of a and g, in both error models $$g_\varepsilon$$ has mean zero and variance 1. As the components of the sensing vector are i.i.d. by construction, the hypotheses on a in Theorem 2.1 hold. First consider scenario (a) under the additive error model. If a random variable W is the sum of two independent mean zero variables X and Y with finite variances, and Stein coefficients $$T_X$$ and $$T_Y$$, respectively, then for any Lipshitz function f one has   \begin{align*} E\left[Wf(W)\right]&=E\left[(X+Y)f(X+Y)\right]= E\left[Xf(X+Y)\right]+E\left[Yf(X+Y)\right]\\ &= E\left[T_Xf^{\prime}(X+Y)\right]+E\left[T_Yf^{\prime}(X+Y)\right] =E\left[(T_X+T_Y)f^{\prime}(X+Y)\right]\\ &= E\left[T_Wf^{\prime}(W)\right] \quad \mbox{where {$T_W=T_X+T_Y$},} \quad \end{align*}showing that Stein coefficients are additive for independents summands. In particular, now also using (2.11), we see that the Stein coefficient $$T_\varepsilon$$ for $$g_\varepsilon$$ in (2.15) is given by $$T_\varepsilon = 1-\varepsilon + \varepsilon T$$, where T is the given Stein coefficient for a. As $$1-T_\varepsilon =\varepsilon (1-T)$$, the first claim of the lemma follows by applying Theorem 2.1. For the mixture model, by the independence between A and {a, g, T},   \begin{align*} E\left[g_\varepsilon f(g_\varepsilon)\right] &= (1-\varepsilon)E\left[gf(g)\right]+\varepsilon E\left[af(a)\right] = (1-\varepsilon)E\left[f^{\prime}(g)\right]+\varepsilon E\left[Tf^{\prime}(a)\right] \\ &=E\left[\boldsymbol{1}_{A^c}f^{\prime}(g)+T\boldsymbol{1}_Af^{\prime}(a)\right]= E\left[\boldsymbol{1}_{A^c}f^{\prime}(g_\varepsilon)+T\boldsymbol{1}_Af^{\prime}(g_\varepsilon)\right]\\ &= E\left[T_\varepsilon f^{\prime}(g_\varepsilon)\right] \quad \mbox{where} \quad T_\varepsilon =\boldsymbol{1}_{A^c} + T\boldsymbol{1}_A. \end{align*}Hence, the bound just shown for the additive model is seen to hold also for the mixture model by applying Theorem 2.1 and observing that $$1-T_\varepsilon =\boldsymbol{1}_A(1-T)$$, and recalling the independence between T and A. Now consider scenario (b) under the additive error model. This paragraph rewritten for clarity Identity (2.7) says one may construct the zero bias distribution of a sum of independent terms by choosing a summand proportional to its variance and replacing it by a variable independent of the remaining summands, and having the chosen summands’ zero bias distribution, where the replacement is done independent of all else. As the two summands in (2.15) have variance $$1-\varepsilon$$ and $$\varepsilon$$, we choose them for replacement with these probabilities, respectively. Hence, letting B be the event that a is chosen, we see   \begin{equation*} g_\varepsilon^{\ast} = \left(\sqrt{1-\varepsilon}g^{\ast}+\sqrt{\varepsilon} a\right)\boldsymbol{1}_{B^c} + \left(\sqrt{1-\varepsilon}g+\sqrt{\varepsilon} a^{\ast}\right)\boldsymbol{1}_B = \sqrt{1-\varepsilon} g + \sqrt{\varepsilon} \left(a\boldsymbol{1}_{B^c} + a^{\ast}\boldsymbol{1}_B\right) \end{equation*}has the $$g_\varepsilon$$-zero bias distribution, where for the first equality we have applied (2.8), yielding $$(\sqrt{1-\varepsilon } g)^{\ast }=_d\sqrt{1-\varepsilon } g^{\ast }$$ and likewise $$(\varepsilon a)^{\ast }=_d\varepsilon a^{\ast }$$, and used that the standard normal is a fixed point of the zero bias transformation for the second. In addition, we construct $$a^{\ast }$$ to have the a-zero bias distribution, be independent of g and B and achieve the infimum $$d_1(\mathscr{L}(a),\mathscr{L}(a^{\ast }))$$ in (2.4), that is, giving the coupling that minimizes $$E|a^{\ast }-a|$$. We now obtain   \begin{equation*} g_\varepsilon^{\ast}-g_\varepsilon = \sqrt{1-\varepsilon} g + \sqrt{\varepsilon} \left( a\boldsymbol{1}_{B^c} + a^{\ast}\boldsymbol{1}_B\right) -\left(\sqrt{1-\varepsilon} g + \sqrt{\varepsilon} a\right)=\sqrt{\varepsilon}(a^{\ast}-a)\boldsymbol{1}_B. \end{equation*}As the Wasserstein distance is the infimum (2.4) over all couplings between g and $$g_\varepsilon$$, using that B is independent of a and $$a^{\ast }$$, we have   \begin{equation*} \gamma_{g_\varepsilon} = d_1\left(g_\varepsilon,g_\varepsilon^{\ast}\right) \leqslant E\left|g_\varepsilon^{\ast}-g_\varepsilon\right| = \sqrt\varepsilon E\left|a^{\ast}-a\right|P(B) = \varepsilon^{3/2}\gamma_a. \end{equation*}The proof of (2.18), the first claim under (b), can now be completed by applying (2.13). Continuing under scenario (b), again consider the mixture model (2.16). By Theorem 2.1 of [11], as Var(a) = Var(g), the variable   \begin{equation*} g_\varepsilon^{\ast}= g^{\ast}\boldsymbol{1}_{A^c} + a^{\ast}\boldsymbol{1}_A= g\boldsymbol{1}_{A^c} + a^{\ast}\boldsymbol{1}_A \end{equation*}has the $$g_\varepsilon$$ zero bias distribution, where we again take $$g^{\ast }$$ and $$a^{\ast }$$ as in the previous construction. Hence, arguing as for the additive error model, we obtain the bound   \begin{equation*} \gamma_{g_\varepsilon} \leqslant E\left|g_\varepsilon^{\ast}-g_\varepsilon\right| =E{\left[|a^{\ast}-a|\boldsymbol{1}_A\right]} = \varepsilon \gamma_a. \end{equation*}The second claim under (b) now follows as the first. Proof of Theorem 2.1. Recalling that x is a unit vector, for any $$\boldsymbol{t} \in B_2^d$$ the vectors x and v = t −⟨x, t⟩x are perpendicular. If v≠0 set $$\boldsymbol{x}^\perp$$ to be the unit vector in direction v, and let $$\boldsymbol{x}^\perp$$ be zero otherwise. These vectors produce an orthogonal decomposition of any $$\boldsymbol{t} \in B_2^d$$ as   $$\boldsymbol{t}= \left\langle\boldsymbol{x},\boldsymbol{t}\right\rangle\boldsymbol{x} + \left\langle\boldsymbol{x}^\perp,\boldsymbol{t}\right\rangle\boldsymbol{x}^\perp.$$ (2.20)Defining   \begin{equation*} Y=\langle\boldsymbol{a},\boldsymbol{x} \rangle \quad \mbox{and} \quad Y^\perp=\left\langle\boldsymbol{a},\boldsymbol{x}^\perp\right\rangle, \end{equation*}using the decomposition (2.20) in (2.1) and the expression for $$\lambda$$ in (2.2) yields   \begin{align*} E\left[\,f_{\boldsymbol{x}}(\boldsymbol{t})\right] =&\, E\left[\left\langle\boldsymbol{a},\boldsymbol{t}\right\rangle\theta(\left\langle\boldsymbol{a},\boldsymbol{x}\right\rangle)\right] = \left\langle\boldsymbol{x},\boldsymbol{t}\right\rangle E\left[\left\langle\boldsymbol{a},\boldsymbol{x}\right\rangle \theta\left(\left\langle\boldsymbol{a},\boldsymbol{x}\right\rangle\right)\right]+\left\langle\boldsymbol{x}^\perp,\boldsymbol{t} \right\rangle E\left[\left\langle\boldsymbol{a},\boldsymbol{x}^\perp \right\rangle \theta(\left\langle\boldsymbol{a},\boldsymbol{x}\right\rangle)\right] \\ =&\, \lambda \left\langle\boldsymbol{x},\boldsymbol{t}\right\rangle +\left\langle\boldsymbol{x}^\perp,\boldsymbol{t} \right\rangle E\left[Y^\perp \theta(Y)\right]. \end{align*}As $$\|\mathbf{x}^\perp \|_2$$ and $$\|\boldsymbol{t}\|_2$$ are at most one, applying the Cauchy–Schwarz inequality we obtain from below to be used in both cases   $$\big|E\left[\,f_{\boldsymbol{x}}(\boldsymbol{t})\right]-\lambda \left\langle\boldsymbol{x},\boldsymbol{t} \right\rangle\!\big| \leqslant \left\vert\vphantom{\frac{1}{1}}\right. E\left[Y^\perp \theta(Y)\right] \left\vert\vphantom{\frac{1}{1}}\right..$$ (2.21) We determine a Stein coefficient for $$Y^\perp$$ as follows. For $$T_i$$ Stein coefficients for $$a_i$$, independent and identically distributed as the given T for all i = 1, …, d, by conditioning on $$Y-x_ia_i$$, a function of $$\{a_j, j \not = i\}$$ and therefore independent of $$a_i$$, using the scaling property (2.11), we have   \begin{align} E\left[x_i^\perp a_i\theta(Y)\right]&=E\left[ x_i^\perp a_i \theta\left(x_i a_i + (Y-x_ia_i)\right)\right]=E\left[ x_i^\perp x_i T_i \theta^{\prime}\left(x_i a_i + (Y-x_ia_i)\right)\right]\nonumber\\ &=E\left[ x_i^\perp x_i T_i \theta^{\prime}(Y)\right]. \end{align} (2.22)Hence,   $$E\left[Y^\perp \theta(Y)\right]=\sum_{i=1}^d E\left[x_i^\perp a_i \theta(Y)\right]=E\left[T_{Y^\perp} \theta^{\prime}(Y)\right]\ \mbox{where} \ \ T_{Y^\perp}= \sum_{i=1}^d x_i^\perp x_iT_i= \sum_{i=1}^d x_i^\perp x_i(T_i-1),$$ (2.23)where the last equality follows from $$\left \langle \mathbf{x},\mathbf{x}^\perp \right \rangle =0$$. Now from (2.21) and (2.23), we have   \begin{align*} \big|E\left[\,f_{\boldsymbol{x}}(\boldsymbol{t})\right]-\lambda \left\langle\boldsymbol{x},\boldsymbol{t} \right\rangle\!\big| &\leqslant \left|E\left[T_{Y^\perp} \theta^{\prime}(Y)\right]\right|\\ &\leqslant E|T_{Y^\perp}| \leqslant \sum_{i=1}^d \left|x_i^\perp x_i \right| E|T-1| \leqslant\|\mathbf{x}^\perp\|_2\|\mathbf{x}\|_2 E|T-1|\leqslant E|T-1|, \end{align*}using $$\theta \in \textrm{Lip}_1$$ in the second inequality, followed by (2.23) again and the Cauchy–Schwarz inequality, noting that $$\|\mathbf{x}^\perp \|_2 \le 1$$ and $$\|\mathbf{x}\|_2=1$$. Hence, we obtain   \begin{equation*} \big|E\left[\,f_{\boldsymbol{x}}(\boldsymbol{t})\right]-\lambda \left\langle\boldsymbol{x},\boldsymbol{t} \right\rangle\!\big| \leqslant E|T-1| \quad \mbox{for all $\boldsymbol{t} \in B_2^d$,} \quad \end{equation*}which completes the proof of (2.12) in light of the definition (1.7) of $$\alpha$$. In a similar fashion, if $$\theta$$ is twice differentiable with bounded second derivative, then in place of (2.22) for every i = 1, …, d we may write   \begin{equation*} E\left[x_i^\perp a_i\theta(Y)\right]=E\left[ x_i^\perp a_i \theta(x_i a_i + (Y-x_ia_i))\right]=E\left[ x_i^\perp x_i \theta^{\prime}\left(x_i a_i^{\ast} + (Y-x_ia_i)\right)\right], \end{equation*}where $$a_i,a_i^{\ast }$$ are constructed on the same space to be an optimal coupling, in the sense of achieving the infimum of $$E|a^{\ast }-a|$$. Hence,   \begin{align} E\left[Y^\perp \theta(Y)\right]=\sum_{i=1}^d E\left[x_i^\perp a_i \theta(Y)\right]&=\sum_{i=1}^d E\left[ x_i^\perp x_i \theta^{\prime}\left(x_i a_i^{\ast} + (Y-x_ia_i)\right)\right] \nonumber\\ &= \sum_{i=1}^d E\left[ x_i^\perp x_i \left(\theta^{\prime}\left(x_i a_i^{\ast} + (Y-x_ia_i)\right) - \theta^{\prime}(Y)\right) \right]\nonumber \\&= \sum_{i=1}^d E\left[ x_i^\perp x_i \left(\theta^{\prime}\left(x_i a_i^{\ast} + (Y-x_ia_i)\right) - \theta^{\prime}\left(x_i a_i + (Y-x_ia_i)\right)\right) \right],\end{align} (2.24)where in the third inequality we have used $$\langle \boldsymbol{x}^\perp ,\boldsymbol{x} \rangle =0$$, as in (2.23). The proof of (2.13) is completed by applying (2.21) and (2.24) to obtain   \begin{align*} \big|E\left[\,f_{\boldsymbol{x}}(\boldsymbol{t})\right]-\lambda \left\langle\boldsymbol{x},\boldsymbol{t} \right\rangle\!\big| &\leqslant \left| E\left[Y^\perp \theta(Y)\right] \right| \leqslant \|\theta^{\prime\prime}\| \sum_{i=1}^d E \left| x_i^\perp x_i^2 \left(a_i^{\ast}-a_i\right) \right| \leqslant \|\theta^{\prime\prime}\| \gamma_a \sum_{i=1}^d \left| x_i^\perp x_i^2 \right| \\ &\leqslant \|\theta^{\prime\prime}\| \gamma_a \sum_{i=1}^d \left| x_i^\perp x_i \right| \le \|\theta^{\prime\prime}\| \gamma_a, \end{align*}where we have applied the mean value theorem for the second inequality, the fact that the infimum in (2.4) is achieved for the third, that $$\|\boldsymbol{x}\|_2=1$$ for the fourth and the Cauchy–Schwarz inequality for the last. 2.2 Sign function In this section we consider the case where $$\theta$$ is the sign function given by   \begin{equation*} \theta(x)=\begin{cases}-1 & x <0\\ \hfill1 & x \geqslant 0. \end{cases} \end{equation*}The motivation comes from the one bit compressed sensing model, see [1] for a more detailed discussion. The following result shows how $$\alpha$$ of (1.7) can be bounded in terms of the discrepancy measure $$\gamma _a$$ introduced in Section 2.1. Throughout this section set   \begin{equation*} c_1=\sqrt{2/\pi}-1/2. \end{equation*}We continue to assume that the unknown vector x has unit Euclidean length. In the following, we say a random variable a is symmetric if the distributions of a and −a are equal. Theorem 2.4 Let $$\theta$$ be the sign function, a have a symmetric distribution and $$\gamma _a$$ as defined in (2.5). If $$\|\boldsymbol{x}\|_3^3 \leqslant c_1/\gamma _a$$ and $$\|\boldsymbol{x}\|_\infty \leqslant 1/2$$, then $$\alpha$$ defined in (1.7) satisfies   $$\alpha \leqslant \left(10 \gamma_a E|a|^3 \|\boldsymbol{x}\|_\infty \right)^{1/2}.$$ (2.25) Under the condition that $$\|\boldsymbol{x}\|_\infty \leqslant c/E|a|^3$$ for some c > 0, Proposition 4.1 of [1] yields the existence of a constant C such that   $$\alpha \leqslant CE|a|^3 \|\boldsymbol{x}\|_\infty^{1/2}.$$ (2.26)Theorem 2.4 improves (2.26) by introducing the factor of $$\gamma _a$$ in the bound, thus providing a right-hand side that takes the value 0 when a is normal. Applying the inequality $$\gamma _a \leqslant E|a|^3/2$$ in (2.6) to (2.25) in the case where a has finite third moment recovers (2.26) with C assigned the specific value of $$\sqrt{5}$$. In terms of the total variation distance between a and the Gaussian g, Proposition 5.2 in [1] provides the bound   \begin{equation*} \alpha \leqslant C (Ea^4)^{1/8} d_{\mathrm TV}(a,g)^{1/8} \end{equation*}depending on an unspecified constant and an eighth root. For distributions where $$\gamma _a$$ is comparable to the total variation distance, see Section 2.3, the bound of Theorem 2.4 would be preferred as far as its dependence on the distance between a and g and is also explicit. Now we derive bounds on $$\alpha$$ defined in (1.7) for the two error models introduced in Section 2.1. As in Theorem 2.3, the bounds vanish as $$\varepsilon$$ tends to zero. We note that Theorem 2.4 is recovered as the special case $$\varepsilon =1$$ for both models considered. For comparison, in view of the relation between (2.25) of Theorem 2.4 and (2.26), for these error models the bounds one obtains from the latter are the same as the ones below, but with the factor $$\gamma _a$$ replaced by $$C=\sqrt{5}$$ by virtue of (2.6), and with the cubic term, which gives a bound on the third absolute moment of the $$\varepsilon$$-contaminated distribution, appearing outside the square root. Theorem 2.5 In the additive and mixture error models (2.15) and (2.16), the bound of Theorem 2.4 becomes, respectively,   \begin{equation*} \alpha \leqslant \left(10\varepsilon^{3/2}\gamma_a\left(\sqrt{1-\varepsilon}\left(\sqrt{\frac8\pi}\right)^{1/3}+\sqrt{\varepsilon}E{\left[|a|^3\right]}^{1/3}\right)^3\|\mathbf{x}\|_{\infty}\right)^{1/2} \end{equation*}and   \begin{equation*} \alpha \leqslant \left(10\varepsilon\gamma_a\left(\left((1-\varepsilon) \sqrt{\frac8\pi}\right)^{1/3}+E{\left[\varepsilon|a|^3\right]}^{1/3}\right)^3\|\mathbf{x}\|_{\infty}\right)^{1/2}. \end{equation*} We first demonstrate the proof of Theorem 2.4, starting with a series of lemmas. Lemma 2.6 For any mean zero, variance 1 random variable a and any $$\mathbf{x} \in B_2^d$$,   $$\left| \left\langle\mathbf{v}_{\mathbf{x}},\mathbf{x}\right\rangle - \sqrt{\frac2\pi} \right|\leqslant \gamma_a \|\mathbf{x}\|_3^3,$$ (2.27)where $$\mathbf{v}_{\mathbf{x}}=E[\mathbf{a}\theta (\langle \mathbf{a},\mathbf{x} \rangle )]$$ as in (2.1). The inequality in Lemma 2.6 should be compared to Lemma 5.3 of [1], where the bound on the quantity in (2.27) is in terms of the fourth root of the total variation distance between a and g and their fourth moments. Proof. It is direct to verify that $$E|g|=\sqrt{2/\pi }$$ for $$g \sim \mathscr{N}(0,1)$$. In Lemma B.1 in Appendix B, we show that when taking f to be the unique bounded solution to the Stein equation   $$f^{\prime}(x)-xf(x)=|x|-\sqrt{\frac{2}{\pi}},$$ (2.28) we have $$\|\,f^{\prime\prime}\|_\infty =1$$, where $$\|\cdot \|_\infty$$ is the essential supremum. Hence, for a mean zero, variance one random variable Y, using that sets of measure zero do not affect the integral below, we have   \begin{align*} |E|Y|-E|g||&=\big|E\left[\,f^{\prime}(Y)-Yf(Y)\right]\!\big|=\big|E\left[\,f^{\prime}(Y)-f^{\prime}(Y^{\ast})\right]\!\big| =\left| E{\left[\int_Y^{Y^{\ast}} f^{\prime\prime}(u)\,\mathrm{d}u\right]}\right| \\ &\leqslant \|\,f^{\prime\prime}\|_{\infty} E|Y^{\ast}-Y|=E|Y^{\ast}-Y|, \end{align*}where $$Y^{\ast }$$ is any random variable on the same space as Y, having the Y-zero biased distribution. As $$\theta$$ is the sign function   \begin{equation*} \left\langle\mathbf{v}_{\mathbf{x}},\mathbf{x}\right\rangle = E\left[\langle\mathbf{a},\mathbf{x} \rangle \theta( \langle\mathbf{a},\mathbf{x} \rangle ) \right]=E\big|\!\langle\mathbf{a},\mathbf{x} \rangle\!\big| \quad \mbox{and hence} \quad \left|\left\langle\mathbf{v}_{\mathbf{x}},\mathbf{x}\right\rangle - \sqrt{\frac2\pi}\right|=\big|E|\langle\mathbf{a},\mathbf{x} \rangle|- E|g|\big|. \end{equation*} For the case at hand, let $$Y=\langle \boldsymbol{a},\boldsymbol{x} \rangle = \sum _{i=1}^n x_i a_i$$, where $$a_1,\ldots ,a_n$$ are independent and identically distributed as a, having mean zero and variance 1 and recall $$\|\boldsymbol{x}\|_2=1$$. Then with $$P[I=i]=x_i^2$$, taking $$(a_i,a_i^{\ast })$$ to achieve the infimun in (2.4), that is, so that $$E|a_i^{\ast }-a_i|=d_1(a_i,a_i^{\ast })$$, by (2.7), we obtain   \begin{equation*} E|Y^{\ast}-Y|=E\big|x_I\left(a_I^{\ast}-a_I\right)\!\big| = \sum_{i=1}^n |x_i|^3 \gamma_{a_i} = \gamma_a \|\boldsymbol{x}\|_3^3, \end{equation*}as desired. We now provide a version of Lemma 4.4 of [1] in terms of $$\gamma _a$$ and specific constants. Lemma 2.7 The vector $$\mathbf{v}_{\mathbf{x}}$$ in (2.1) satisfies $$\|\mathbf{v}_{\mathbf{x}}\|_2 \leqslant 1$$, and if $$\|\mathbf{x}\|_3^3 \leqslant c_1/\gamma _a$$ where $$c_1=\sqrt{2/\pi }-1/2$$, then   $$\frac{1}{2} \leqslant \|\mathbf{v}_{\mathbf{x}}\|_2.$$ Proof. The upper bound follows as in the proof Lemma 4.4 in [1]. Slightly modifying the lower bound argument there through the use of Lemma 2.6 for the second inequality below, we obtain   \begin{equation*} \|\mathbf{v}_{\mathbf{x}}\|_2 = \|\mathbf{v}_{\mathbf{x}}\|_2 \|\mathbf{x}\|_2 \geqslant \big|\!\langle\mathbf{v}_{\mathbf{x}},\mathbf{x} \rangle \!\big| \geqslant \sqrt{\frac2\pi} - \gamma_a \|\mathbf{x}\|_3^3 \geqslant \sqrt{\frac2\pi} -c_1= 1/2. \end{equation*} Next we provide a version of Lemma 4.5 of [1] with the explicit constant 2, following the proof there, and impose a symmetry assumption on a that was used implicitly. Lemma 2.8 If $$\|\mathbf{x}\|_\infty \leqslant 1/2$$ and a has a symmetric distribution, then the vector $$\mathbf{v}_{\mathbf{x}}$$ in (2.1) satisfies   \begin{equation*} \|\mathbf{v}_{\mathbf{x}}\|_\infty \leqslant 2 E|a|^3 \|\mathbf{x}\|_\infty. \end{equation*} Proof. By the symmetry of a we assume without loss of generality that $$x_j \geqslant 0$$ for all j = 1, …, d when considering the inner product S = ⟨a, x⟩. For a given coordinate index i let $$S^{(i)}=\langle \boldsymbol{a},\boldsymbol{x} \rangle - a_i x_i$$. Using symmetry again in the second equality below and setting $$\tau _i^2 = \sum _{k \not = i}x_k^2$$, for fixed r ⩾ 0, we obtain   \begin{align*} \left|E\theta\left(S^{(i)}+rx_i\right)\right| =&\, \left|P\left[S^{(i)} \geqslant -rx_i\right]-P\left[S^{(i)} < -rx_i\right]\right| \\ =&\, \left|P\left[S^{(i)} \geqslant -rx_i\right]-P\left[S^{(i)}> rx_i\right]\right| = P\left[|S^{(i)}| \leqslant rx_i\right] = P\left[|S^{(i)}|/\tau_i \leqslant rx_i/\tau_i\right] \\ \leqslant&\, P\left[|g| \leqslant rx_i/\tau_i\right] + \left|P\left[|S^{(i)}|/\tau_i \leqslant rx_i/\tau_i\right]-P\left[|g| \leqslant rx_i/\tau_i\right]\right|. \end{align*}The hypothesis $$\|\boldsymbol{x}\|_\infty \leqslant 1/2$$ implies $$\tau _i^2 \geqslant 3/4$$. Hence, using the supremum bound on the standard normal density for the first term and that $$\sqrt{8/3\pi } \leqslant 1$$, the Berry–Esseen bound of [18] with constant 0.56 on the second term, noting $$0.56 (4/3)^{3/2} \leqslant 1$$ and that $$\|\boldsymbol{x}\|_3^3 \leqslant \|\boldsymbol{x}\|_\infty$$ since $$\|\boldsymbol{x}\|_2=1$$, we obtain   \begin{equation*} \left|E\left[r\theta\left(S^{(i)}+rx_i\right)\right]\right| \leqslant r^2 x_i + |r| \|\boldsymbol{x}\|_\infty E|a|^3 . \end{equation*} Considering now the $$i$$th coordinate of $$\mathbf{v}_{\mathbf{x}}\!=\!E[\mathbf{a} \theta (\langle \mathbf{a},\mathbf{x} \rangle )]$$, using $$E|a| \!\leqslant\! (Ea^2)^{1/2}\!=\!1 \!\leqslant\! (E|a|^3)^{1/3} \!\leqslant E|a|^3$$, we have   \begin{equation*} \big|E\left[ a_i \theta(\langle\boldsymbol{a},\boldsymbol{x} \rangle)\right]\!\big| = \left|E\left[a_i \theta\left(S^{(i)}+a_i x_i\right)\right]\right| \leqslant x_i + \|\boldsymbol{x}\|_\infty E|a|^3 \leqslant 2E|a|^3 \|\boldsymbol{x}\|_\infty. \end{equation*}A similar computation yields this same result when r < 0. Proof of Theorem 2.4. We follow the proof of Proposition 4.1 of [1]. By Lemma 2.7 we see $$\mathbf{v}_{\mathbf{x}} \not =0$$, and defining $$\mathbf{z}=\mathbf{v}_{\mathbf{x}}/\|\mathbf{v}_{\mathbf{x}}\|_2$$ from Lemmas 2.7 and 2.8  \begin{equation*} \|\boldsymbol{z}\|_\infty= \frac{\|\mathbf{v}_{\mathbf{x}}\|_\infty}{\|\mathbf{v}_{\mathbf{x}}\|_2} \leqslant 2\|\mathbf{v}_{\mathbf{x}}\|_\infty \leqslant 4 E|a|^3 \|\mathbf{x}\|_\infty. \end{equation*} Hence, first using the triangle inequality together with the fact that $$|\theta (\cdot )| = 1$$, with the equality following holding because $$\theta$$ is the sign function, and the second inequality following from Lemma 2.6, we obtain   \begin{align} \|\mathbf{v}_{\mathbf{x}}\|_2 = \langle\mathbf{v}_{\mathbf{x}},\mathbf{z}\rangle = E\left[\theta(\langle\mathbf{a},\mathbf{x}\rangle)\langle\mathbf{a},\mathbf{z} \rangle \right] \leqslant&\ E\big[|\langle\mathbf{a},\mathbf{z} \rangle |\big] = E\big[\theta(\langle\mathbf{a},\mathbf{z}\rangle)\langle\mathbf{a},\mathbf{z} \rangle \big] \leqslant \sqrt{\frac2\pi} +\gamma_a \|\mathbf{z}\|_\infty \nonumber\\ \leqslant&\, \sqrt{\frac2\pi} + 4\gamma_a E|a|^3 \|\mathbf{x}\|_\infty. \end{align} (2.29)Next, using (2.1), we bound $$|E[\,f_{\mathbf{x}}(\mathbf{t})]-\lambda \left \langle \mathbf{x},\mathbf{t}\right \rangle \!|=|\!\left \langle \mathbf{v}_{\mathbf{x}},\mathbf{t}\right \rangle - \lambda \left \langle \mathbf{x},\mathbf{t} \right \rangle\! |$$. By the Cauchy–Schwartz inequality, now taking $$\mathbf{t} \in B_2^d$$,   \begin{equation*} \big|\!\left\langle\mathbf{v}_{\mathbf{x}},\mathbf{t}\right\rangle - \lambda \left\langle\mathbf{x},\mathbf{t} \right\rangle\! \big|{}^2 = \big|\!\left\langle\mathbf{v}_{\mathbf{x}}-\lambda\mathbf{x},\mathbf{t} \right\rangle \!\big|{}^2\leqslant \|\mathbf{v}_{\mathbf{x}}-\lambda\mathbf{x}\|^2. \end{equation*}Furthermore, by (2.2), we have $$\left \langle \boldsymbol{v}_{\mathbf{x}},\mathbf{x}\right \rangle =\lambda$$, thus   \begin{align*} \|\mathbf{v}_{\mathbf{x}}-\lambda\mathbf{x}\|^2=&\, \|\mathbf{v}_{\mathbf{x}}\|_2^2 -\lambda^2 + 2\lambda (\lambda- \langle\mathbf{v}_{\mathbf{x}},\mathbf{x}\rangle )= \big(\|\mathbf{v}_{\mathbf{x}}\|_2 -\lambda\big)\big(\|\mathbf{v}_{\mathbf{x}}\|_2 +\lambda\big) \leqslant2\big(\|\boldsymbol{v}_{\mathbf{x}}\|_2 -\lambda\big)\\ =&\,2\left(\|\boldsymbol{v}_{\mathbf{x}}\|_2 -\sqrt{\frac2\pi}+\sqrt{\frac2\pi}-\lambda\right) \leqslant 10 \gamma_a E|a|^3 \|\mathbf{x}\|_\infty, \end{align*}where we have applied Lemma 2.7 in the first inequality and the last inequality follows from (2.29), Lemma 2.6 and that $$E|a|^3 \geqslant 1$$. Now taking a square root finishes the proof. Proof of Theorem 2.5. Under the additive error model (2.15), by Minkowski’s inequality   \begin{align*} E{\left[|g_\varepsilon|^3\right]}^{1/3}=&\,E{\left[\big|\sqrt{1-\varepsilon}g+\sqrt{\varepsilon} a\big|^3\right]}^{1/3} \leqslant\sqrt{1-\varepsilon}E{\left[|g|^3\right]}^{1/3}+\sqrt{\varepsilon}E{\left[|a|^3\right]}^{1/3}\\ =&\,\sqrt{1-\varepsilon}\left(\sqrt{\frac8\pi}\,\right)^{1/3}+\sqrt{\varepsilon}E{\left[|a|^3\right]}^{1/3}. \end{align*}Using this inequality and (2.18) in Theorem 2.4 gives the discrepancy bound in the additive error case. For the mixture model (2.16), again by Minkowski’s inequality,   \begin{align*} E{\left[|g_\varepsilon|^3\right]}^{1/3}=&\,E{\left[|g\boldsymbol{1}_{A^c} + a\boldsymbol{1}_A|^3\right]}^{1/3} \leqslant E{\left[(1-\varepsilon)|g|^3\right]}^{1/3}+E{\left[|\varepsilon a|^3\right]}^{1/3}\\ =&\,\left((1-\varepsilon)\sqrt{\frac8\pi}\,\right)^{1/3}+E{\left[\varepsilon |a|^3\right]}^{1/3}. \end{align*}Using this inequality and (2.18) in Theorem 2.4 gives the discrepancy bound in the mixed error case. 2.3 Relations between measures of discrepancy We have considered two methods for handling non-Gaussian sensing, the first using Stein coefficients and the second by the zero bias distribution. In this section we discuss some relations between these two and also their connections to the total variation distance $$d_{{\mathrm TV}}(\cdot ,\cdot )$$ appearing in the bound of [1] and discussed in Remark 2.2. The following result appears in Section 7 of [10]. Lemma 2.9 If a is a mean zero, variance 1 random variable and $$a^{\ast }$$ has the a-zero biased distribution, then   $$d_{\mathrm TV}(a,g) \leqslant 2 d_{\mathrm TV}(a,a^{\ast}).$$ (2.30) The following related result is from [4]. Lemma 2.10 If the mean zero, variance 1 random variable a has Stein coefficient T, then   \begin{equation*} d_{\mathrm TV}(a,g) \leqslant 2 E|1-T|, \end{equation*}where $$g \sim \mathscr{N}(0,1)$$. Since E[Tf′(a)] = E[E[T|a] f′(a)], if T is a Stein coefficient for a then so is h(a) = E[T|a]. Introducing this Stein coefficient in the identity that characterizes the zero bias distribution $$a^{\ast }$$, we obtain   \begin{equation*} E\left[\,f^{\prime}(a^{\ast})\right]=E\left[af(a)\right]=E\left[h(a)f^{\prime}(a)\right]\!. \end{equation*}Hence, when such a T exists h(a) is the Radon Nikodym derivative of the distribution of $$a^{\ast }$$ with respect to that of a, and in particular $$\mathscr{L}(a^{\ast })$$ is absolutely continuous with respect to $$\mathscr{L}(a)$$. When a is a mean zero, variance one random variable with density p(a), whose support is a possibly infinite interval, then using the form of the density $$p^{\ast }(a)$$ of $$a^{\ast }$$ as given in [13], we have   $$p^{\ast}(y)=E\left[a\boldsymbol{1}(a>y)\right] \quad \mbox{and} \quad h(y)=\frac{p^{\ast}(y)}{p(y)}\boldsymbol{1}\left(\,p(y)>0\right)=\frac{E\left[a\boldsymbol{1}(a>y)\right]}{p(y)}\boldsymbol{1}\left(\,p(y)>0\right)\!,$$ (2.31)and hence,   \begin{equation*} E\left|1-h(a)\right| = \int_{y:p(y)>0} \left\vert\vphantom{\frac{1}{1}}\right. 1-\frac{p^{\ast}(y)}{p(y)}\left\vert\vphantom{\frac{1}{1}}\right. p(y)\,\mathrm{d}y = \int_{\mathbb{R}}\left|\,p(y)-p^{\ast}(y)\right|\,\mathrm{d}y=d_{\mathrm TV}(a,a^{\ast}), \end{equation*}and the upper bounds in Lemmas 2.10 and 2.9 are equal. Overall then, in the case where the Stein coefficient of a random variable is given as a function of the random variable itself, the discrepancy measure considered in Theorem 2.3 under part (a) of Theorem 2.1 is simply the total variation distance between a and $$a^{\ast }$$, while that under part (b), and in Section 2.2 when $$\theta (\cdot )$$ is specialized to be the sign function, is the Wasserstein distance. Due to a result of [4], Stein coefficients can be constructed in some generality when a = F(g) for some differentiable function $$F:\mathbb{R}^n \rightarrow \mathbb{R}$$ of a standard normal vector g in $$\mathbb{R}^n$$. In this case   \begin{equation*} T=\int_0^\infty \mathrm{e}^{-t} \Big\langle \nabla F(\boldsymbol{g}),\widehat{E}\left(\nabla F(\boldsymbol{g}_t)\right)\Big\rangle\, \mathrm{d}t \end{equation*}is a Stein coefficient for a where $$\boldsymbol{g}_t=\mathrm{e}^{-t}\boldsymbol{g}+\sqrt{1-\mathrm{e}^{-2t}}\ \widehat{\boldsymbol{g}}$$, with $$\widehat{\boldsymbol{g}}$$ an independent copy of g and $$\widehat{E}$$ integrating over $$\widehat{\boldsymbol{g}}$$, that is, taking conditional expectation with respect to g. To provide a concrete example of a Stein coefficient, a simple computation using the final equality of (2.31) shows that if a has the double exponential distribution with variance 1, that is, with density   \begin{equation*} p(y)=\frac{1}{\sqrt{2}}\mathrm{e}^{-\sqrt{2}|y|} \quad \mbox{then} \quad h(y)=\frac{1}{2}\left(1+\sqrt{2}|y|\right). \end{equation*}In this case   \begin{equation*} E\left|1-h(a)\right|=E\left|1-\sqrt{2} a\right|\boldsymbol{1}(a>0)= \frac{1}{e}. \end{equation*} The following result provides a bound complementary to (2.30) of Lemma 2.9, which when taken together shows that $$d_{\mathrm TV}(a,a^{\ast })$$ and $$d_{\mathrm TV}(a,g)$$ are of the same order in general for distributions of bounded support. Lemma 2.11 If a is a mean zero, variance one random variable with density p(y) supported in [−b, b], then   \begin{equation*} d_{\mathrm TV}(a,a^{\ast}) \leqslant (1+b^2)d_{\mathrm TV}(a,g). \end{equation*} Proof. With $$p^{\ast }(y)$$ the density of $$a^{\ast }$$ given by (2.31), we have   \begin{equation*} d_{\mathrm TV}(a,a^{\ast})=\int_{[-b,b]}\left|\,p(y)-p^{\ast}(y)\right|\,\mathrm{d}y = \int_{[-b,b]}\left(\,p(y)-p^{\ast}(y)\right)\phi(y)\ \mathrm{d}y = E\phi(a)-E\phi(a^{\ast}), \end{equation*}where   \begin{equation*} \phi(y)= \begin{cases} \hfill1 & p(y) \geqslant p^{\ast}(y)\\ -1 & p(y) < p^{\ast}(y). \end{cases} \end{equation*}Setting   \begin{equation*} f(y) = \int_0^y \phi(u) \,\mathrm{d}u \quad \mbox{and} \quad q(y)=\phi(y)-y \int_0^y \phi(u) \,\mathrm{d}u, \end{equation*}we have $$f^{\prime}(y)=\phi (y)$$, and using (2.3) to yield E[q(g)] = 0, we obtain   \begin{equation*} d_{\mathrm TV}(a,a^{\ast})=E\left[\,f^{\prime}(a)-f^{\prime}(a^{\ast})\right] =E\left[\,f^{\prime}(a)-af(a)\right] = Eq(a)-Eq(g). \end{equation*}For y ∈ [−b, b] we have $$|q(y)| \leqslant |\phi (y)| + |y| \int _0^y |\phi (u)|\,\mathrm{d}u \leqslant 1+b^2$$, hence   \begin{equation*} d_{\mathrm TV}(a,a^{\ast}) \leqslant (1+b^2) d_{\mathrm TV}(a,g), \end{equation*}as claimed. 3. Proof of Theorem 1.7 So far, we have shown that the penalty $$\alpha$$ for non-normality in (1.7) of Theorem 1.7 can be bounded explicitly using discrepancy measures that arise in Stein’s method. In this section, we focus on proving Theorem 1.7 via a generic chaining argument that is the crux to the concentration inequality applied. Recall that by (1.2), (1.4) and (1.5),   $$\widehat{\mathbf{x}}_m=\mathop{\mbox{argmin}}_{\mathbf{t}\in K}~\left(\|\mathbf{t}\|_2^2-2\,f_{\mathbf{x}}(\mathbf{t})\right).$$ In order to demonstrate that $$\widehat{\mathbf{x}}_m$$ is a good estimate of $$\lambda \mathbf{x}$$, we need to control the mean of $$f_{\mathbf{x}}(\cdot )$$ in (1.5) and the deviation of $$f_{\mathbf{x}}(\cdot )$$ from its mean. As shown in the previous section, the mean of $$f_{\mathbf{x}}(\cdot )$$ can be effectively characterized through the introduced discrepancy measures. The deviation is controlled by the following lemma. Lemma 3.1 (Concentration) Let $$\mathscr{T} := D(K,\lambda \mathbf{x})\cap \mathbb{S}^{d-1}$$. Under the assumptions of Theorem 1.7, for all u ⩾ 2 and $$m\geqslant \omega (\mathscr{T})^2$$,   \begin{equation*} P\left[\sup_{\mathbf{t}\in \mathscr{T}}\big|\,f_{\mathbf{x}}(\mathbf{t})-E{\left[\,f_{\mathbf{x}}(\mathbf{t})\right]}\big| \geqslant C_0\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right)\frac{\omega(\mathscr{T})+u}{\sqrt{m}}\right] \leqslant 4\mathrm{e}^{-u}, \end{equation*}where $$C_0>0$$ is a fixed constant.1 The proof of this lemma, provided in the next subsection, is based on the improved chaining technique introduced in [7]. We now show that once Lemma 3.1 is proved, Theorem 1.7 follows without much overhead. Using Lemma 1.9 for the first inequality, we have   \begin{align*} \|\,\widehat{\mathbf{x}}_m-\lambda\mathbf{x}\|_2^2\leqslant&\, L(\,\widehat{\mathbf{x}}_m)-L(\lambda\mathbf{x}) +2\alpha\|\,\widehat{\mathbf{x}}_m-\lambda\mathbf{x}\|_2\\ =&\,L(\,\widehat{\mathbf{x}}_m)-L_m(\,\widehat{\mathbf{x}}_m)+L_m(\,\widehat{\mathbf{x}}_m) -L_m(\lambda\mathbf{x})+L_m(\lambda\mathbf{x})-L(\lambda\mathbf{x}) + 2\alpha\|\,\widehat{\mathbf{x}}_m-\lambda\mathbf{x}\|_2\\ =& -2\big( E_m\left[y\left\langle\mathbf{a},\widehat{\mathbf{x}}_m\right\rangle\right]-f_{\mathbf{x}}(\,\widehat{\mathbf{x}}_m)\big) +L_m(\,\widehat{\mathbf{x}}_m)-L_m(\lambda\mathbf{x}) +2\big( E_m\left[\,y\left\langle\mathbf{a},\lambda\mathbf{x}\right\rangle\right] -f_{\mathbf{x}}(\lambda\mathbf{x})\big)\\ &+2\alpha\|\,\widehat{\mathbf{x}}_m-\lambda\mathbf{x}\|_2\\ \leqslant&\,2\big|\,f_{\mathbf{x}}(\,\widehat{\mathbf{x}}_m-\lambda\mathbf{x})-E_m\left[\,y\left\langle\mathbf{a},\widehat{\mathbf{x}}_m-\lambda\mathbf{x}\right\rangle\right] \!\big| +L_m(\,\widehat{\mathbf{x}}_m)-L_m(\lambda\mathbf{x}) +2\alpha\|\,\widehat{\mathbf{x}}_m-\lambda\mathbf{x}\|_2, \end{align*}where $$E_m[\cdot ]$$ is the conditional expectation given $$\{(\mathbf{a}_i,y_i)\}_{i=1}^m$$. Since $$\widehat{\mathbf{x}}_m$$ solves (1.4) and $$\lambda \mathbf{x}\in K$$, it follows that $$L_m(\widehat{\mathbf{x}}_m)-L_m(\lambda \mathbf{x})\leqslant 0$$. Thus,   $$\|\,\widehat{\mathbf{x}}_m-\lambda\mathbf{x}\|_2^2 \leqslant 2\big|\,f_{\mathbf{x}}(\,\widehat{\mathbf{x}}_m-\lambda\mathbf{x})-E_m[\,y\left\langle\mathbf{a},\widehat{\mathbf{x}}_m-\lambda\mathbf{x}\right\rangle]\big| +2\alpha\|\,\widehat{\mathbf{x}}_m-\lambda\mathbf{x}\|_2.$$Since $$\widehat{\mathbf{x}}_m-\lambda \mathbf{x}\in D(K,\lambda \mathbf{x})$$, dividing both sides by $$\|\,\widehat{\mathbf{x}}_m-\lambda \mathbf{x}\|_2$$, the conclusion holding trivially should this norm be zero, using the fact that for any fixed $$\boldsymbol{t}\in \mathbb{R}^d$$, $$E{\left [\,y\left \langle \mathbf{a},\boldsymbol{t}\right \rangle \right ]}=E{\left [\,f_{\mathbf{x}}(\boldsymbol{t})\right ]}$$ gives   \begin{equation*} \|\,\widehat{\mathbf{x}}_m-\lambda\mathbf{x}\|_2 \leqslant2\sup_{\mathbf t\in \mathscr{T}}\big|\,f_{\mathbf{x}}(\boldsymbol{t})-E\left[\,f_{\mathbf{x}}(\boldsymbol{t})\right]\!\big|+2\alpha. \end{equation*}Now applying Lemma 3.1 finishes the proof of Theorem 1.7. 3.1 Preliminaries In addition to chaining, we need the following notions and propositions; we recall the $$\psi _q$$ norms from Definition 1.4. Definition 3.2 (Subgaussian random vector) A random vector $$\mathbf{X}\in \mathbb{R}^d$$ is subgaussian if the random variables $$\langle \mathbf{X},\mathbf{z}\rangle ,\mathbf{z}\in \mathbb{S}^{d-1}$$ are subgaussian with uniformly bounded subgaussian norm. The corresponding subgaussian norm of the vector X is then given by   $$\|\mathbf{X}\|_{\psi_2}=\sup_{\mathbf{z}\in\mathbb{S}^{d-1}}\big\|\langle\mathbf{X},\mathbf{z}\rangle\big\|_{\psi_2}.$$ The proof of the following two propositions is shown in the Appendix. Proposition 3.3 If both X and Y are subgaussian random variables, then XY is a subexponential random variable, satisfying   $$\|XY\|_{\psi_1}\leqslant 2\|X\|_{\psi_2}\|Y\|_{\psi_2}.$$ Proposition 3.4 If a is a subgaussian random vector with covariance matrix $$\mathbf{\Sigma }$$, then   $$\sigma_{\max}(\mathbf{\Sigma})\leqslant 2\|\mathbf{a}\|_{\psi_2}^2,$$where $$\sigma _{\max }(\cdot )$$ denotes the maximal singular value of a matrix. In addition, we need the following fact that a vector of d independent subgaussian random variables is subgaussian. Proposition 3.5 (Lemma 5.24 of [21]) Consider a random vector $$\mathbf{a}\in \mathbb{R}^d$$, where each entry $$a_i$$ is an i.i.d. copy of a centered subgaussian random variable a. Then, a is a subgaussian random vector with norm $$\|\mathbf{a}\|_{\psi _2}\leqslant C\|a\|_{\psi _2}$$ where C is an absolute positive constant. 3.2 Proving Lemma 3.1 via generic chaining Throughout this section, C denotes an absolute constant whose value may change at each occurrence. The following notions are necessary ingredients in the generic chaining argument. Let $$(\mathscr{T},d)$$ be a metric space. If $$\mathscr{A}_{l}\subseteq \mathscr{A}_{l+1} \subseteq \mathscr{T}$$ for every l ⩾ 0 we say $$\{\mathscr{A}_l\}_{l=0}^{\infty }$$ is an increasing sequence of subsets of $$\mathscr{T}$$. Let $$N_0=1$$ and $$N_l=2^{2^l},~\forall\, l\geqslant 1$$. Definition 3.6 (Admissible sequence) An increasing sequence of subsets $$\{\mathscr{A}_l\}_{l=0}^{\infty }$$ of $$\mathscr{T}$$ is admissible if $$|\mathscr{A}_l|\leqslant N_l$$ for all l ⩾ 0. Essentially following the framework of Section 2.2 of [20], for each subset $$\mathscr{A}_l$$, we define $$\pi _l\!:\!\mathscr{T}\!\!\rightarrow\! \mathscr{A}_l$$ as the closest point map $$\pi _l(\boldsymbol{t})=\textrm{arg}\min _{\mathbf s\in \mathscr{A}_l}d(\mathbf s,\mathbf t),~\forall\, \mathbf t\in\! \mathscr{T}$$. Since each $$\mathscr{A}_l$$ is a finite set, the minimum is always achievable. If the argmin is not unique a representative is chosen arbitrarily. The Talagrand $$\gamma _2$$-functional is defined as   $$\gamma_2(\mathscr{T},d):=\inf\sup_{\mathbf t\in \mathscr{T}}\sum_{l=0}^{\infty}2^{l/2}d\left(\mathbf t,\pi_l(\mathbf t)\right)\!,$$ (3.1)where the infimum is taken with respect to all admissible sequences. Though there is no guarantee that $$\gamma _2(\mathscr{T},d)$$ is finite, the following majorizing measure theorem tells us that its value is comparable to the supremum of a certain Gaussian process. Lemma 3.7 (Theorem 2.4.1 of [20]) Consider a family of centered Gaussian random variables $$\{G(\mathbf t)\}_{\mathbf t\in \mathscr{T}}$$ indexed by $$\mathscr{T}$$, with the canonical distance   $$d(\mathbf s,\mathbf t)=E\left[\left(G(\mathbf s)-G(\mathbf t)\right)^2\right]^{1/2},\quad \forall\, \mathbf s,\mathbf t\in \mathscr{T}.$$ Then for a universal constant L that does not depend on the covariance of the Gaussian family, we have   $$\frac1L\gamma_2(\mathscr{T},d)\leqslant E\left[\sup_{\mathbf t\in \mathscr{T}}G(\mathbf t)\right]\leqslant L\gamma_2(\mathscr{T},d).$$ For $$\mathscr{T} \subseteq \mathbb{R}^d$$ and $$d(\mathbf{x},\mathbf{y})=\|\mathbf{x}-\mathbf{y}\|_2$$ we write $$\gamma _2(\mathscr{T})$$ to denote $$\gamma _2(\mathscr{T},\|\cdot \|_2)$$ defined in (3.1). Defining the Gaussian process $$G(\boldsymbol{t})=\left \langle \mathbf{g},\mathbf t\right \rangle ,~\mathbf t\in \mathscr{T}$$, with $$\mathbf{g}\sim \mathscr{N}(0,\mathbf{I}_{d\times d})$$, we have   $$E\left[\left(G(t)-G(s)\right)^2\right]^{1/2}=\|t-s\|_2,\quad\forall\, t,s\in \mathscr{T}.$$When $$\mathscr{T}$$ is bounded we may conclude that $$\omega (\mathscr{T})<\infty$$ directly from Definition 1.1, and Lemma 3.7 then implies that Gaussian mean width $$\omega (\mathscr{T})$$ and $$\gamma _2(\mathscr{T})$$ are of the same order, i.e. there exists a universal constant L ⩾ 1 independent of $$\mathscr{T}$$ such that   $$\frac1L\gamma_2(\mathscr{T})\leqslant\omega(\mathscr{T})\leqslant L \gamma_2(\mathscr{T}).$$ (3.2) Define   $$\overline{Z}(\boldsymbol{t})=f_{\mathbf{x}}(\boldsymbol{t})-E{\left[\,f_{\mathbf{x}}(\boldsymbol{t})\right]},$$where $$f_{\mathbf{x}}(\boldsymbol{t})$$ is as defined in (1.5) and   $$Z(\boldsymbol{t})= \frac{1}{m}\sum_{i=1}^m\varepsilon_iy_i\langle\mathbf{a}_i,\boldsymbol{t}\rangle,$$where $$\varepsilon _i, i=1,\ldots ,m$$ are Rademancher variables taking values uniformly in {1, −1}, independent of each other and of $$\{y_i,\boldsymbol{a}_i,i=1,2,\ldots ,m\}$$. The majority of the proof of Lemma 3.1 is devoted to showing that   $$P\left[\sup_{\boldsymbol{t}\in \mathscr{T}}\left|Z(\boldsymbol{t})\right| \geqslant C\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right) \frac{\omega(\mathscr{T})+u}{\sqrt{m}}\right] \leqslant \mathrm{e}^{-u} \quad \mbox{for {u \geqslant 2, m\geqslant \omega(\mathscr{T})^2},} \quad$$ (3.3)where C > 0 is a constant. Once (3.3) is justified, by the fact u ⩾ 2, we have   \begin{equation*} P\left[\sup_{\boldsymbol{t}\in \mathscr{T}}\left|Z(\boldsymbol{t})\right| \geqslant C\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right) \frac{\omega(\mathscr{T})+1}{\sqrt{m}}u\right] \leqslant \mathrm{e}^{-u} \quad \mbox{for {$u \geqslant 2, m\geqslant \omega(\mathscr{T})^2$}.} \quad \end{equation*}By Lemma A.5, with p = 1 and k = 1, we have   $$E\left[\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})\right|\right] \leqslant C\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right)\frac{\omega(\mathscr{T})+1}{\sqrt{m}}.$$ Thus, invoking the first bound in the symmetrization lemma, Lemma A.3,   \begin{equation*} E\left[\sup_{\boldsymbol{t} \in \mathscr{T}}\left|\overline{Z}(\boldsymbol{t})\right|\right]\leqslant2E\left[\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})\right|\right] \leqslant C\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right) \frac{\omega(\mathscr{T})+1}{\sqrt{m}}. \end{equation*}We may then finish the proof of Lemma 3.1 using the fact that u ⩾ 2, the second bound in the symmetrization lemma with $$\beta =(2C(\|a\|_{\psi _2}^2+\|y\|_{\psi _2}^2) \omega (\mathscr{T})+u)/\sqrt{m}$$ and (3.3), which together imply   \begin{multline*} P\left[\sup_{\boldsymbol{t} \in \mathscr{T}}\left|\overline{Z}(\boldsymbol{t})\right| \geqslant C\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right) \frac{\omega(\mathscr{T})+u}{\sqrt{m}}\right]\\ \leqslant 4P\left[\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})\right|\geqslant C\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right) \frac{\omega(\mathscr{T})+u}{\sqrt{m}}\right]\leqslant 4\mathrm{e}^{-u}. \end{multline*} The rest of the section is devoted to the proof of (3.3). Pick $$\mathbf{t}_0\in \mathscr{T}$$ so that $$\{\mathbf{t}_0\}=\mathscr{A}_0\subseteq \mathscr{A}_1\subseteq \mathscr{A}_2\subseteq \mathscr{A}_3\subseteq \cdots$$ is an admissible sequence, satisfying   $$\sup_{\boldsymbol{t} \in \mathscr{T}}\sum_{l=0}^\infty2^{l/2}d\left(t,\pi_l(\boldsymbol{t})\right)\leqslant 2\gamma_2(\mathscr{T}),$$ (3.4)where we recall $$\pi _l$$ is the closest point map from $$\mathscr{T}$$ to $$\mathscr{A}_l$$, and the constant 2 on the right-hand side of the inequality is introduced to handle the case, where the infimum in the definition of $$\gamma _2(T)$$ is not achieved. Then, for any $$\boldsymbol{t}\in \mathscr{T}$$, we write $$Z(\boldsymbol{t})-Z(\boldsymbol{t}_0)$$ as a telescoping sum, i.e.   $$Z(\boldsymbol{t})-Z(\boldsymbol{t}_0)=\sum_{l=1}^{\infty}Z\left(\pi_l(\boldsymbol{t})\right)-Z\left(\pi_{l-1}(\boldsymbol{t})\right)=\sum_{l=1}^{\infty}\frac1m\sum_{i=1}^m\varepsilon_iy_i\big\langle\mathbf{a}_i,\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\big\rangle.$$ (3.5)Note that this telescoping sum converges with probability 1 because the right-hand side of (3.4) is finite. Then, following ideas in [7], we fix an arbitrary positive integer p and let $$l_p:=\lfloor \log _2p\rfloor$$. Specializing (3.5) to the case $$\boldsymbol{t}_0= \pi _{l_p}(\boldsymbol{t})$$ we obtain, with probability one, that   $$Z(\boldsymbol{t})-Z\left(\pi_{l_p}(\boldsymbol{t})\right)=\sum_{l=l_p+1}^{\infty}Z\left(\pi_l(\boldsymbol{t})\right)-Z\left(\pi_{l-1}(\boldsymbol{t})\right)=\sum_{l=l_p+1}^{\infty}\frac1m\sum_{i=1}^m\varepsilon_iy_i\big\langle\mathbf{a}_i,\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\big\rangle.$$ (3.6) We split the outer index of summation in (3.6) into the following two sets:   \begin{equation*} I_{1,p}:=\big\{l>l_p:2^{l/2}\leqslant \sqrt{m}\big\} \quad \mbox{and} \quad I_{2,p}:=\big\{l>l_p:2^{l/2}>\sqrt{m}\big\}. \end{equation*} On the coarse scale $$I_{1,p}$$, we have the following lemma: Lemma 3.8 (Coarse scale chaining) For all p ⩾ 1 and u ⩾ 2, there exists a constant c > 0 such that the inequality   \begin{equation*} \sup_{\boldsymbol{t} \in \mathscr{T}}\left|\sum_{l\in I_{1,p}}Z\left(\pi_l(\boldsymbol{t})\right)-Z\left(\pi_{l-1}(\boldsymbol{t})\right)\right|\leqslant 4(\sqrt{2}+1)\|\mathbf{a}\|_{\psi_2}\|y\|_{\psi_2} \frac{u}{\sqrt{m}}\gamma_2(\mathscr{T}) \end{equation*}holds with probability at least $$1-c\mathrm{e}^{-pu/4}$$. Proof. We assume $$I_{1,p}$$ is non-empty, else the claim is trivial. By Proposition 3.3 and Definition 3.2, for any i ∈ {1, 2, ⋯ , m}, we have   \begin{equation*} \left\|\varepsilon_iy_i\langle\mathbf{a}_i,\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\rangle\right\|_{\psi_1} \leqslant 2\|\mathbf{a}\|_{\psi_2}\|y\|_{\psi_2}\left\|\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\right\|_2\!. \end{equation*}Thus, for each $$l\in I_{1,p}$$, applying Bernstein’s inequality (Lemma A.7) to   \begin{equation*} Z(\pi_l(\boldsymbol{t}))-Z(\pi_{l-1}(\boldsymbol{t}))=\frac{1}{m}\sum_{i=1}^m\varepsilon_iy_i\big\langle\mathbf{a}_i,\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\big\rangle, \end{equation*}an average of independent subexponential random variables, we have that for all v ⩾ 1,   \begin{equation*} P\left[\big|Z(\pi_l(\boldsymbol{t}))-Z(\pi_{l-1}(\boldsymbol{t}))\big|\geqslant 2\|\mathbf{a}\|_{\psi_2}\|y\|_{\psi_2}\left(\frac{\sqrt{2v}}{\sqrt{m}}+\frac{v}{m}\right)\big\|\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\big\|_2\right]\leqslant 2\mathrm{e}^{-v}. \end{equation*}Let $$v=2^lu$$ for some u ⩾ 2. Using that $$2^{l/2}\leqslant \sqrt{m}$$ since $$l \in I_{1,p}$$ and that $$u \geqslant \sqrt{u}$$, we have   $$P\left[\big|Z(\pi_l(\boldsymbol{t}))-Z(\pi_{l-1}(\boldsymbol{t}))\big|\geqslant 2\|\mathbf a\|_{\psi_2}\|y\|_{\psi_2}(\sqrt{2}+1)\frac{u}{\sqrt{m}}2^{l/2}\big\|\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\big\|_2\right]\leqslant 2\exp(-2^lu).$$ (3.7) Now for every $$l \in I_{1,p}$$ and $$\boldsymbol{t} \in{\mathscr{T}}$$, define the event   \begin{equation*} \Omega_{l,\boldsymbol{t}}=\left\{\omega: \big|Z(\pi_l(\boldsymbol{t}))-Z(\pi_{l-1}(\boldsymbol{t}))\big|\geqslant 2(\sqrt{2}+1)\|\mathbf a\|_{\psi_2}\|y\|_{\psi_2}\frac{u}{\sqrt{m}}2^{l/2}\big\|\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\big\|_2\right\}, \end{equation*}and let $$\Omega :=\bigcup _{l\in I_{1,p}}\bigcup _{\boldsymbol{t} \in \mathscr{T}}\Omega _{l,\boldsymbol{t}}$$. As $$\mathscr{A}_{l}=\{\pi _{l}(\boldsymbol{t})\}_{\boldsymbol{t} \in \mathscr{T}}$$ contains at most $$2^{2^l}$$ points, it follows that the union over $$\boldsymbol{t} \in \mathscr{T}$$ in the definition of $$\Omega$$ can be written as a union over at most $$2^{2^{l+1}}$$ indices. Hence, with u ⩾ 2, Lemma A.4 with k = 1 may now be invoked to yield   \begin{equation*} P\left[\bigcup_{l\in I_{1,p},\boldsymbol{t}\in \mathscr{T}}\Omega_{l,\boldsymbol{t}}\right]\leqslant c\mathrm{e}^{-pu/4}, \end{equation*} for some c > 0. Thus, on the event $$\Omega ^c$$, we have   \begin{align*} \sup_{\boldsymbol{t} \in \mathscr{T}}\left|\sum_{l\in I_{1,p}}Z(\pi_l(\boldsymbol{t}))-Z(\pi_{l-1}(\boldsymbol{t}))\right| \leqslant&\,\sup_{\boldsymbol{t} \in \mathscr{T}}\sum_{l\in I_{1,p}}\big|Z(\pi_l(\boldsymbol{t}))-Z(\pi_{l-1}(\boldsymbol{t}))\big|\\ \leqslant&\, \sup_{\boldsymbol{t} \in \mathscr{T}}2(\sqrt{2}+1)\|\mathbf a\|_{\psi_2}\|y\|_{\psi_2} \frac{u}{\sqrt{m}} \sum_{l\in I_1}2^{l/2}\big\|\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\big\|_2\\ \leqslant&\,\sup_{\boldsymbol{t} \in \mathscr{T}}2(\sqrt{2}+1)\|\mathbf a\|_{\psi_2}\|y\|_{\psi_2} \frac{u}{\sqrt{m}} \sum_{l=1}^{\infty}2^{l/2}\big\|\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\big\|_2\\ \leqslant&\,4(\sqrt{2}+1)\|\mathbf a\|_{\psi_2}\|y\|_{\psi_2}\frac{u}{\sqrt{m}}\gamma_2(\mathscr{T}), \end{align*}where the last inequality follows from (3.4), finishing the proof. For the finer scale chaining, we will apply the following lemma whose proof is in the Appendix. Lemma 3.9 For any $$\mathbf{t}\in \mathbb{R}^d$$, u ⩾ 1 and $$2^{l/2}>\sqrt{m}$$, we have   \begin{equation*} P\left[\left(\frac1m\sum_{i=1}^m\langle\mathbf{a}_i,\mathbf{t}\rangle^2\right)^{1/2} \geqslant \sqrt{5+3\sqrt2} \|\mathbf{a}\|_{\psi_2} \sqrt{\frac{u}{m}} 2^{l/2} \|t\|_2\right] \leqslant 2\exp(-2^lu). \end{equation*} Lemma 3.10 (Finer scale chaining) Let   $$Y_m=\left|\frac1m\sum_{i=1}^my_i^2-E{[y^2]}\right|.$$Then for all p ⩾ 1, with probability at least $$1-c\mathrm{e}^{-pu/4}$$  \begin{equation*} \sup_{\boldsymbol{t} \in \mathscr{T}}\left|\sum_{l\in I_{2,p}}Z(\pi_l(\boldsymbol{t}))-Z(\pi_{l-1}(\boldsymbol{t}))\right| \leqslant2\sqrt{5+3\sqrt2}\left(Y_m^{1/2}+\sqrt{2}\|y\|_{\psi_2}\right)\|\mathbf{a}\|_{\psi_2} \sqrt{\frac{u}{m}} \gamma_2(\mathscr{T}), \end{equation*}with some constant c > 0 and u ⩾ 2. Proof. For any $$p \geqslant 1, l\in I_{2,p}$$ and $$t\in \mathscr{T}$$, by the Cauchy–Schwarz inequality,   \begin{align*} \big|Z(\pi_l(\boldsymbol{t}))-Z(\pi_{l-1}(\boldsymbol{t}))\big|=&\,\left|\frac1m\sum_{i=1}^m\varepsilon_iy_i\big\langle\mathbf{a}_i,\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\big\rangle\right|\\ \leqslant&\,\left(\frac1m\sum_{i=1}^my_i^2\right)^{1/2} \cdot\left(\frac1m\sum_{i=1}^m\big|\langle\mathbf{a}_i,\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\rangle\big|{}^2\right)^{1/2}. \end{align*}Since y is subgaussian, $$E\left [y^2\right ]\leqslant 2\|y\|_{\psi _2}^2$$. Thus,   $$\left(\frac1m\sum_{i=1}^my_i^2\right)^{1/2}=\left(\frac1m\sum_{i=1}^my_i^2-E\,{[y^2]}+E\,{[y^2]}\right)^{1/2} \leqslant Y_m^{1/2}+\sqrt{2}\|y\|_{\psi_2}.$$Furthermore, by Lemma 3.9, for any $$l\in I_{2,p}$$, we have   $$P\left[\left(\frac1m\sum_{i=1}^m\big|\langle\mathbf{a}_i,\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\rangle\big|{}^2\right)^{1/2} \!\!\geqslant \sqrt{5+3\sqrt2}\|\mathbf{a}\|_{\psi_2} \sqrt{\frac{u}{m}} 2^{l/2} \big\|\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\big\|_2\right]\! \leqslant\! 2\exp(-2^lu).$$Thus, combining the above two inequalities,   \begin{align*} &P\left[\big|Z(\pi_l(\boldsymbol{t}))-Z(\pi_{l-1}(\boldsymbol{t}))\big|\geqslant\sqrt{5+3\sqrt2}\left(Y_m^{1/2}+\sqrt{2}\|y\|_{\psi_2}\right)\|\mathbf{a}\|_{\psi_2} \sqrt{\frac{u}{m}} 2^{l/2}\big\|\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\big\|_2\right]\\[10pt] &\qquad\leqslant 2\exp(-2^lu). \end{align*}The rest of the proof follows a standard chaining argument similar to the proof of Lemma 3.8 after (3.7) and is not repeated here for brevity. Now we are ready to prove Lemma 3.1, for which we have already demonstrated the sufficiency of (3.3). Proof of 3.42 First, for all p ⩾ 1 and u ⩾ 2, by Lemma 3.10, with probability at least $$1-c \mathrm{e}^{-pu/4}$$,   \begin{align*} &\sup_{\boldsymbol{t} \in \mathscr{T}}\left|\sum_{l\in I_{2,p}}Z(\pi_l(\boldsymbol{t}))-Z(\pi_{l-1}(\boldsymbol{t}))\right|\\[10pt] &\quad\leqslant2\sqrt{5+3\sqrt2}Y_m^{1/2}\|\mathbf{a}\|_{\psi_2} \sqrt{\frac{u}{m}} \gamma_2(\mathscr{T}) +2\sqrt{8+6\sqrt2}\ \|\mathbf{a}\|_{\psi_2}\|y\|_{\psi_2} \sqrt{\frac{u}{m}} \gamma_2(\mathscr{T})\\[10pt] &\quad\leqslant Y_m+\left(5+3\sqrt2\right)\|\mathbf a\|_{\psi_2}^2\frac{u}{m}\gamma_2(\mathscr{T})^2 +2\sqrt{8+6\sqrt2}\ \|\mathbf{a}\|_{\psi_2}\|y\|_{\psi_2} \sqrt{\frac{u}{m}} \gamma_2(\mathscr{T}), \end{align*}where we applied the inequality $$2ab\leqslant a^2+b^2$$ on the first term. Then, combining with Lemma 3.8, we have with probability at least $$1-c\mathrm{e}^{-pu/4}$$,   \begin{align*} &\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})-Z\left(\pi_{l_p}(\boldsymbol{t})\right)\right|\\ &\quad\leqslant \sup_{\boldsymbol{t} \in \mathscr{T}}\left|\sum_{l\in I_{1,p}}Z\left(\pi_l(\boldsymbol{t})\right)-Z\left(\pi_{l-1}(\boldsymbol{t})\right)\right| +\sup_{\boldsymbol{t} \in \mathscr{T}}\left|\sum_{l\in I_{2,p}}Z\left(\pi_l(\boldsymbol{t})\right)-Z\left(\pi_{l-1}(\boldsymbol{t})\right)\right|\\ &\quad\leqslant Y_m+\left(5+3\sqrt2\right)\|\mathbf a\|_{\psi_2}^2\frac{u}{m}\gamma_2(\mathscr{T})^2 +2\sqrt{8+6\sqrt{2}}\ \|\mathbf a\|_{\psi_2}\|y\|_{\psi_2} \sqrt{\frac{u}{m}} \gamma_2(\mathscr{T})\\ &\qquad+4\left(\sqrt{2}+1\right)\|\mathbf a\|_{\psi_2}\|y\|_{\psi_2} \frac{u}{\sqrt{m}}\gamma_2(\mathscr{T})\\ &\quad\leqslant Y_m+\left(5+3\sqrt2\right)\|\mathbf a\|_{\psi_2}^2\frac{u}{m}\gamma_2(\mathscr{T})^2 +\left( \sqrt{8+6\sqrt{2}}+ 2\left(\sqrt{2}+1\right) \right) 2\|\mathbf a\|_{\psi_2}\|y\|_{\psi_2}\frac{u}{\sqrt{m}} \gamma_2(\mathscr{T}). \end{align*} By the conditions in (3.3) we have $$m\geqslant \omega (\mathscr{T})^2$$. Using inequality (3.2) on the relation between $$\omega (\mathscr{T})$$ and $$\gamma _2(\mathscr{T})$$ gives $$m\geqslant \gamma _2(\mathscr{T})^2/L^2$$. Thus, $$\gamma _2(\mathscr{T})^2/m\leqslant L\gamma _2(\mathscr{T})/\sqrt{m}$$, and the second term is bounded by   \begin{equation*} \left(5+3\sqrt2\right)L \|\mathbf a\|_{\psi_2}^2\frac{u}{\sqrt{m}}\gamma_2(\mathscr{T}). \end{equation*} For the last term we apply the bound $$2\|\mathbf a\|_{\psi _2}\|y\|_{\psi _2}\leqslant \|\mathbf a\|_{\psi _2}^2+\|y\|_{\psi _2}^2$$ Thus, with probability at least $$1-c\mathrm{e}^{-pu/4}$$,   $$\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})-Z\left(\pi_{l_p}(\boldsymbol{t})\right)\right|\leqslant Y_m + C\left(\|\mathbf a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right)\frac{u\gamma_2(\mathscr{T})}{\sqrt{m}}, \nonumber$$for the constant   $$C=5L+2+(3L+2)\sqrt{2}+\sqrt{8+6\sqrt{2}}.$$By Proposition 3.5, $$\|\mathbf a\|_{\psi _2}\leqslant C\|a\|_{\psi _2}$$ for some constant C. Thus, with probability at least $$1-c\mathrm{e}^{-pu/4}$$, for some constant C large enough,   \begin{equation*} \sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})-Z(\pi_{l_p}\left(\boldsymbol{t})\right)\right|\leqslant Y_m + C\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right)\frac{u\gamma_2(\mathscr{T})}{\sqrt{m}}, \end{equation*}or equivalently   \begin{equation*} \xi \leqslant C\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right)\frac{u\gamma_2(T)}{\sqrt{m}} \quad \mbox{where} \quad \xi = \max\left\{\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})-Z\left(\pi_{l_p}(\boldsymbol{t})\right)\right|-Y_m,0\right\}. \end{equation*}Invoking Lemma A.5 with k = 1, for all $$1 \leqslant p < \infty$$  $$E\left[\xi^p\right]^{1/p}\leqslant C\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right)\frac{\gamma_2(\mathscr{T})}{\sqrt{m}}.$$ Since   \begin{align*} \xi\geqslant&\,\max\left\{\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})-Z\left(\pi_{l_p}(\boldsymbol{t})\right)\right|,0\right\}-Y_m=\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})-Z\left(\pi_{l_p}(\boldsymbol{t})\right)\right|-Y_m\\ \geqslant&\,\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})\right|-\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z\left(\pi_{l_p}(\boldsymbol{t})\right)\right|-Y_m, \end{align*}and $$\xi$$ and $$Y_m$$ are both non-negative, by Minkowski’s inequality it follows that   \begin{align} E\left[\left(\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})\right|\right)^p\right]^{1/p}\leqslant&\, E\left[\left(\xi+\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z\left(\pi_{l_p}(\boldsymbol{t})\right)\right|+Y_m\right)^p\right]^{1/p} \nonumber\\ \leqslant&\, E\left[\xi^p\right]^{1/p}+E\left[\left(\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z\left(\pi_{l_p}(\boldsymbol{t})\right)\right|\right)^p\right]^{1/p} +E\left[Y_m^p\right]^{1/p}\nonumber\\ \leqslant&\, C\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right)\frac{\gamma_2(\mathscr{T})}{\sqrt{m}} +E\left[\left(\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z\left(\pi_{l_p}(\boldsymbol{t})\right)\right|\right)^p\right]^{1/p} +E\left[Y_m^p\right]^{1/p}.\end{align} (3.8) For the second term, we have   \begin{equation*} E\left[\left(\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z\left(\pi_{l_p}(\boldsymbol{t})\right)\right|\right)^p\right] \leqslant\sum_{\mathbf{t}\in\mathscr{A}_{l_p}}E\left[|Z(\boldsymbol{t})|^p\right] \leqslant|\mathscr{A}_{l_p}|\sup_{\boldsymbol{t} \in \mathscr{T}}E\left[|Z(\boldsymbol{t})|^p\right] \leqslant 2^p\sup_{\boldsymbol{t} \in \mathscr{T}}E\left[|Z(\boldsymbol{t})|^p\right], \end{equation*}where the first inequality follows from the fact that $$\pi _{l_p}(\cdot )$$ can only take values in $$\mathscr{A}_{l_p}$$, and the last inequality follows from the fact that $$l_p=\lfloor \log _2p\rfloor$$. On the other hand, applying Proposition 3.5, yielding that $$\|\boldsymbol{a}\|_{\psi _2} \leqslant C\|a\|_{\psi _2}$$ and Proposition 3.3, by a direct application of Bernstein’s inequality (Lemma A.7) we have, for any fixed $$\mathbf{t}\in \mathscr{T}$$,   $$P\left[\left|Z(\boldsymbol{t})\right|\geqslant 2C\|y\|_{\psi_2}\|a\|_{\psi_2}\left(1+\sqrt2\right)\frac{pu}{\sqrt{m}}\right]\leqslant2\mathrm{e}^{-pu}, \quad \mbox{whenever $$pu\geqslant0$$.} \quad$$Hence, applying Lemma A.5 with k = 1, for all $$1 \leqslant p < \infty$$,   $$E{\left[\left|Z(\boldsymbol{t})\right|^p\right]}^{1/p}\leqslant \frac{C\|y\|_{\psi_2}\|a\|_{\psi_2}p}{\sqrt{m}},$$for all $$t\in \mathscr{T}$$ and some constant C > 0. Thus,   $$E\left[\left(\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\pi_{l_p}(\boldsymbol{t}))\right|\right)^p\right]^{1/p}\leqslant\frac{2C\|y\|_{\psi_2}\|a\|_{\psi_2}p}{\sqrt{m}}\leqslant\frac{C\left(\|y\|_{\psi_2}^2+\|a\|_{\psi_2}^2\right)p}{\sqrt{m}}.$$ (3.9) Now consider $$E [Y_m^p ]^{1/p}$$, the final term in (3.8), recalling that $$Y_m=\frac 1m\sum _{i=1}^m(y_i^2-E [y_i^2 ])$$. Applying Proposition 3.3, we have   $$\left\|y_i^2-E\left[y_i^2\right]\right\|_{\psi_1}\leqslant \left\|y_i^2\right\|_{\psi_{1}}+E\left[y_i^2\right] \leqslant2\big\|y_i\big\|_{\psi_2}^2+2\big\|y_i\big\|_{\psi_2}^2=4\big\|y\big\|_{\psi_2}^2.$$Thus, using Bernstein’s inequality and Lemma A.5 as before, we obtain   $$Pr\left[Y_m\geqslant4\left(1+\sqrt{2}\right)\|y\|_{\psi_2}^2\frac{pu}{\sqrt{m}}\right]\leqslant2\mathrm{e}^{-pu},\quad~\forall\, pu\geqslant0,$$and   $$E\left[Y_m^p\right]^{1/p}\leqslant\frac{C\|y\|_{\psi_2}^2p}{\sqrt{m}}\leqslant\frac{C\left(\|y\|_{\psi_2}^2+\|a\|_{\psi_2}^2\right)p}{\sqrt{m}}.$$ (3.10)Combining (3.8), (3.9) and (3.10) gives   $$E\left[\left(\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})\right|\right)^p\right]^{1/p}\leqslant\frac{C\left(\|y\|_{\psi_2}^2+\|a\|_{\psi_2}^2\right)\left(\gamma_2(\mathscr{T})+p\right)}{\sqrt{m}},$$for some constant C > 0. Since this inequality holds for any p ⩾ 1, applying Lemma A.6 with k = 1 yields   $$P\left[\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})\right| \geqslant C\left(\|y\|_{\psi_2}^2+\|a\|_{\psi_2}^2\right)\frac{\gamma_2(\mathscr{T})+u}{\sqrt{m}}\right] \leqslant \mathrm{e}^{-u}.$$The proof of (3.3) is now completed by invoking Lemma 3.7, which gives $$\gamma _2(\mathscr{T})\leqslant L\omega (\mathscr{T})$$ for some constant L ⩾ 1. Funding NSA grant (H98230-15-1-0250) to L.G. Footnotes 1  Since the set K is closed, the set $$D(K,\lambda \mathbf{x})\cap \mathbb{S}^{d-1}\subseteq \mathbb{R}^d$$ is also closed and thus Borel measurable. By taking $$\mathscr{T}=D(K,\lambda \mathbf{x})\cap \mathbb{S}^{d-1}\subseteq \mathbb{R}^d$$ in Remark 1.3, we have that the supremum is indeed measurable in the probability space $$(\Omega ,~\mathscr{E},~P)$$. References 1. Ai, A., Lapanowski, A., Plan, Y. & Vershynin, R. ( 2014) One-bit compressed sensing with non-Gaussian measurements. Linear Algeb. Appl. , 441, 222– 239. Google Scholar CrossRef Search ADS   2. Boucheron, S., Lugosi, G. & Massart, P. ( 2013) Concentration Inequalities: A Nonasymptotic Theory of Independence . Oxford: Oxford University Press. Google Scholar CrossRef Search ADS   3. Cacoullos, T. & Papathanasiou, V. ( 1992) Lower variance bounds and a new proof of the central limit theorem. J. Multivar. Anal. , 43, 173– 184. Google Scholar CrossRef Search ADS   4. Chatterjee, S. ( 2009) Fluctuations of eigenvalues and second order Poincaré inequalities. Probab. Theory Relat. Fields , 143, 1– 40. Google Scholar CrossRef Search ADS   5. Chen, L., Goldstein, L. & Shao, Q. ( 2010) Normal Approximation by Stein’s Method . New York: Springer. 6. Cohn, D. L. ( 1980) Measure Theory . Boston: Birkhauser. Google Scholar CrossRef Search ADS   7. Dirksen, S. ( 2015) Tail bounds via generic chaining. Electron. J. Probab. , 20, 1– 29. Google Scholar CrossRef Search ADS   8. Erdogdu, M. A., Dicker, L. H. & Bayati, M. ( 2016) Scaled least squares estimator for GLM’s in large-scale problems. Adv. Neural Inf. Process. Syst. , 3324– 3332. 9. Foucart, S. & Rauhut, H. ( 2013) A Mathematical Introduction to Compressive Sensing . Boston: Birkhauser. Google Scholar CrossRef Search ADS   10. Goldstein, L. ( 2007) $$L^1$$ bounds in normal approximation. Ann. Probab ., 35, 1888– 1930. Google Scholar CrossRef Search ADS   11. Goldstein, L. ( 2010) Bounds on the constant in the mean central limit theorem. Ann. Probab ., 38, 1672– 1689. Google Scholar CrossRef Search ADS   12. Goldstein, L., Minsker, S. & Wei, X. ( 2016) Structured signal recovery from non-linear and heavy-tailed measurements . preprint arXiv:1609.01025. 13. Goldstein, L. & Reinert, G. ( 1997) Stein’s method and the zero bias transformation with application to simple random sampling. Ann. Appl. Probab. , 7, 935– 952. Google Scholar CrossRef Search ADS   14. Ledoux, M. & Talagrand, M. ( 1991) Probability in Banach Spaces: Isoperimetry and Processes . Berlin: Springer. Google Scholar CrossRef Search ADS   15. Plan, Y. & Vershynin, R. ( 2016) The generalized lasso with non-linear observations. IEEE Trans. Inf. Theory , 62, 1528– 1537. Google Scholar CrossRef Search ADS   16. Rachev, S. T. ( 1991) Probability Metrics and the Stability of Stochastic Models,  vol. 269. The University of Michigan, John Wiley. 17. Rudelson, M. & Vershynin, R. ( 2008) On sparse reconstruction from Fourier and Gaussian measurements. Commun. Pure Appl. Math. , 61, 1025– 1045. Google Scholar CrossRef Search ADS   18. Shevtsova, I. G. ( 2010) An improvement of convergence rate estimates in the Lyapunov theorem. Doklady Math. , 82, 862– 864. Google Scholar CrossRef Search ADS   19. Stein, C. ( 1972) A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. Proc. Sixth Berkeley Symp. Math. Statist. Probab.  2, 583– 602. 20. Talagrand, M. ( 2014) Upper and lower bounds for stochastic processes: modern methods and classical problems. Ergebnisse der Mathematik und ihrer Grenzgebiete . Berlin Heidelberg: Springer. 21. Vershynin, R. ( 2010) Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing: Theory and Applications . New York: Cambridge University Press. Google Scholar CrossRef Search ADS   22. Yang, Z., Balasubramanian, K. & Liu, H. ( 2017) High-dimensional Non-Gaussian Single Index Models via Thresholded Score Function Estimation. Proceedings of the 34th International Conference on Machine Learning , pp. 3851– 3860. APPENDIX A. Additional lemmas The following lemma is one version of the contraction principle; for a proof see [14]: Lemma A.1 Let $$F:[0,\infty ) \rightarrow [0,\infty )$$ be convex and non-decreasing. Let $$\{\eta _i\}$$ and $$\{\xi _i\}$$ be two symmetric sequences of real-valued random variables such that for some constant C ⩾ 1 for every i and t > 0, we have   $$P\left[|\eta_i|>t\right]\leqslant C \cdot P\left[|\xi_i|>t\right]\!.$$Then, for any finite sequence $$\{\mathbf{x}_i\}$$ in a vector space with semi-norm ∥⋅∥,   $$E{\left[F\left(\left\|\sum_i\eta_i\mathbf{x}_i\right\|\right)\right]} \leqslant E{\left[F\left(C \cdot\left\|\sum_i\xi_i\mathbf{x}_i\right\|\right)\right]}.$$ Remark A.2 Though Lemma 4.6 of [14] states the contraction principle in a Banach space, the proofs of Theorem 4.4 and Lemma 4.6 of [14] hold for vector spaces under any semi-norm. The following symmetrization lemma is the same as Lemma 4.6 of [1]. Lemma A.3 (Symmetrization) Let   \begin{equation*} \overline{Z}(\boldsymbol{t})=f_{\mathbf{x}}(\boldsymbol{t})-E\left[\,f_{\mathbf{x}}(\boldsymbol{t})\right] \quad \mbox{where} \quad f_{\boldsymbol{x}}(\boldsymbol{t})=\frac{1}{m}\sum_{i=1}^m y_i \langle\boldsymbol{a}_i,\mathbf t \rangle, \end{equation*}and   $$Z(\boldsymbol{t})= \frac{1}{m}\sum_{i=1}^m\varepsilon_iy_i\langle\mathbf{a}_i,\mathbf t\rangle,$$where $$\{ \varepsilon _i: 1 \leqslant i \leqslant m\}$$ is a collection of Rademacher random variables, each uniformly distributed over {−1, 1} and independent of each other and of $$\{y_i,\boldsymbol{a}_i: 1 \leqslant i \leqslant m\}$$. Then for any measurable set $$\mathscr{T}\subset \mathbb{R}^d$$,   $$E{\left[\sup_{\boldsymbol{t} \in \mathscr{T}}\left|\overline{Z}(\boldsymbol{t})\right|\right]} \leqslant 2E{\left[\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})\right|\right]},$$and for any $$\beta>0$$  $$P\left[\sup_{\boldsymbol{t} \in \mathscr{T}}\left|\overline{Z}(\boldsymbol{t})\right|\geqslant 2E{\left[\sup_{\boldsymbol{t} \in \mathscr{T}}\left|\overline{Z}(\boldsymbol{t})\right|\right]}+\beta\right]\leqslant 4P\left[\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})\right|\geqslant \beta/2\right].$$ Lemma A.4 (Lemma A.4 of [7]) Fix $$1\leqslant p <\infty$$, $$0<k<\infty$$, u ⩾ 2 and $$l_p:=\lfloor \log _2p\rfloor$$. For every $$l>l_p$$, let $$J_l$$ be an index set such that $$|J_l|\leqslant 2^{2^{l+1}}$$, and $$\{\Omega _{l,i} \}_{i\in J_l}$$ a collection of events, satisfying   $$P\left[\Omega_{l,i}\right]\leqslant 2\exp(-2^lu^k),\quad\forall\, i\in J_l.$$Then there exists an absolute constant c ⩽ 16 such that   $$P\left[\cup_{l>l_p}\cup_{i\in J_l}\Omega_{l,i}\right]\leqslant c \exp(-pu^k/4).$$ Lemma A.5 (Lemma A.5 of [7]) Fix $$1\leqslant p <\infty$$ and $$0<k<\infty$$. Let $$\beta \geqslant 0$$ and suppose that $$\xi$$ is a non-negative random variable such that for some $$c,u_*>0$$,   $$P\left[\xi>\beta u\right]\leqslant c\exp(-pu^k/4),\quad \forall\, u\geqslant u_*.$$Then for a constant $$\tilde{c}_k>0$$ depending only on k,   $$E{\left[\xi^p\right]}^{1/p}\leqslant\beta(\tilde{c}_kc+u_*).$$ Lemma A.6 (Proposition 7.11 of [9]) If X is a non-negative random variable, satisfying   $$E{\left[X^p\right]}^{1/p}\leqslant b+ ap^{1/k} \quad \forall\, p\geqslant1,$$for positive real numbers a and k, and b ⩾ 0, then for any u ⩾ 1,   $$P\left[X\geqslant \mathrm{e}^{1/k}(b+au)\right]\leqslant\exp(-u^k/k).$$ Finally, for the following result see Theorem 2.10 of [2]. Lemma A.7 (Bernstein’s inequality) Let $$X_1,\cdots ,X_m$$ be a sequence of independent, mean zero random variables. If there exist positive constants $$\sigma$$ and D such that   $$\frac1m\sum_{i=1}^mE{\left[|X_i|^p\right]}\leqslant\frac{p!}{2}\sigma^2D^{p-2},~p=2,3,\cdots$$then for any u ⩾ 0,   $$P\left[\left|\frac1m\sum_{i=1}^mX_i\right|\geqslant\frac{\sigma}{\sqrt{m}}\sqrt{2u}+\frac{D}{m}u\right] \leqslant2\exp(-u).$$If $$X_1,\cdots ,X_m$$ are all subexponential random variables, then $$\sigma$$ and D can be chosen as $$\sigma\!\!\! \!=\frac{1}{m}\sum _{i=1}^m\|X_i\|_{\psi _1}$$ and $$D=\max _i\|X_i\|_{\psi _1}$$. B. Additional proofs With g a standard normal variable, we begin by considering the solution f to (2.28), the special case of the Stein equation   $$f^{\prime}(x)-xf(x)=h(x)-Eh(g),$$ (B.1)with the specific choice of test function h(x) = |x|. Lemma B.1 The solution f of (2.28) satisfies ∥f″∥ = 1. Proof. In general, when f solves (B.1) for a given test function h(⋅) then − f(−x) solves (B.1) for h(−⋅). As in the case at hand h(x) = |x|, for which h(−x) = h(x), it suffices to show that 0 ⩽ f(x) ⩽ 1 for all x > 0, over which range (2.28) specializes to   $$f^{\prime}(x)-xf(x)=x-\sqrt{\frac2\pi}.$$ (B.2)Taking derivative on both sides yields   $$f^{\prime\prime}(x)-f(x)-xf^{\prime}(x)=1,$$and combining the above two equalities gives   $$f^{\prime\prime}(x)=(1+x^2)\,f(x)+x\left(x-\sqrt{\frac2\pi}\right)+1.$$ (B.3)On the other hand, solving (B.2) via integrating factors gives, for all x > 0,   \begin{align} f(x)=&\,-\mathrm{e}^{x^2/2}\int_x^\infty\left(z-\sqrt{\frac2\pi}\right)\mathrm{e}^{-z^2/2}\,\mathrm{d}z =-1+2\mathrm{e}^{x^2/2}\int_x^\infty\frac{\mathrm{e}^{-z^2/2}}{\sqrt{2\pi}}\,\mathrm{d}z\nonumber\\ =&\,-1+2\mathrm{e}^{x^2/2}(1-\Phi(x)), \end{align} (B.4)where $$\Phi (\cdot )$$ is the cumulative distribution function of the standard normal. For any x > 0, by classical upper and lower tail bounds for $$\Phi (\cdot )$$, we have   $$\frac{x}{\sqrt{2 \pi}(1+x^2)} \leqslant \mathrm{e}^{x^2/2}(1-\Phi(x))\leqslant\min\left\{\frac12,\frac{1}{x\sqrt{2\pi}}\right\},$$which in turn implies, using (B.3) and (B.4), that for all x > 0   \begin{equation*} 0 \leqslant f^{\prime\prime}(x) \leqslant \min\left\{x\left(x-\sqrt{\frac{2}{\pi}}\right)+1,\frac{1}{x}\sqrt{\frac{2}{\pi}}\right\}. \end{equation*}Handling the cases $$0<x\leqslant \sqrt{2/\pi }$$ and $$x>\sqrt{2/\pi }$$ separately, we see 0 ⩽ f″(x) ⩽ 1 for all x > 0, as desired. Proof of Proposition 3.3 We may assume $$\|Y\|_{\psi _1}\not =0$$ as the inequality is trivial otherwise. By definition $$\|XY\|_{\psi _1}=\sup _{p\geqslant 1}p^{-1}E{\left [|XY|^p\right ]}^{1/p}$$. Applying $$2ab\leqslant a^2+b^2$$ and Minkowski’s inequality, for any $$\varepsilon>0$$,   \begin{equation*} E{\left[|XY|^p\right]}^{1/p}\leqslant E{\left[\left|\frac{X^2}{2\varepsilon}+\frac{\varepsilon Y^2}{2}\right|{}^p\right]}^{1/p} \leqslant\frac{1}{2\varepsilon}E{\left[X^{2p}\right]}^{1/p}+\frac{\varepsilon}{2}E{\left[Y^{2p}\right]}^{1/p}. \end{equation*}Applying the definition of the $$\psi _1$$ norm, this inequality implies   $$\|XY\|_{\psi_1}\leqslant\frac{1}{2\varepsilon}\|X^2\|_{\psi_1}+\frac{\varepsilon}{2}\|Y^2\|_{\psi_1}.$$The term $$\|X^2\|_{\psi _1}$$ can be bounded as follows,   \begin{equation*} \|X^2\|_{\psi_1}=\sup_{p\geqslant1}\left(p^{-1/2}E{[X^{2p}]}^{1/2p}\right)^2 =2 \sup_{p\geqslant1}\left((2p)^{-1/2}E{[X^{2p}]}^{1/2p}\right)^2 \leqslant2\|X\|_{\psi_2}^2. \end{equation*}Arguing similarly for Y,   $$\|XY\|_{\psi_1}\leqslant\frac{1}{\varepsilon}\|X\|^2_{\psi_2}+\varepsilon\|Y\|^2_{\psi_2},$$and choosing $$\varepsilon =\|X\|_{\psi _2}/\|Y\|_{\psi _2}$$ finishes the proof. Proof of Proposition 3.4 By definition, we have   \begin{align*} \|\mathbf{a}\|_{\psi_2}=&\sup_{\mathbf{z}\in\mathbb{S}^{d-1}}\|\langle\mathbf{a},\mathbf{z}\rangle\|_{\psi_2}\\ =&\sup_{\mathbf{z}\in\mathbb{S}^{d-1}} \sup_{p\geqslant1}\frac{1}{p^{1/2}}E{\left[|\langle\mathbf{a},\mathbf{z}\rangle|^p\right]}^{1/p}\\ \geqslant&\sup_{\mathbf{z}\in\mathbb{S}^{d-1}}\frac{1}{\sqrt{2}}E{\left[\langle\mathbf{a},\mathbf{z}\rangle^2\right]}^{1/2}\\ =&\frac{1}{\sqrt{2}}\sup_{\mathbf{z}\in\mathbb{S}^{d-1}}\langle\mathbf{\Sigma}\mathbf{z},\mathbf{z}\rangle^{1/2}\\ =&\frac{1}{\sqrt{2}}\sigma_{\max}(\mathbf{\Sigma})^{1/2}, \end{align*}and squaring both sides finishes the proof. Proof of Lemma 3.9 Since $$\langle \mathbf{a}_i,\mathbf{t}\rangle$$ is subgaussian, it follows, $$\langle \mathbf{a}_i,\mathbf{t}\rangle ^2$$ is subexponential by Proposition 3.3. Note that $$E [\langle \mathbf{a}_i,\mathbf{t}\rangle ^2 ]\leqslant \sigma _{\max }(\mathbf \Sigma )\|\mathbf{t}\|_2^2\leqslant 2\|\mathbf{a}\|_{\psi _2}^2\|\mathbf{t}\|_2^2$$ by Proposition 3.4. Then, by Remark 1.5 and Proposition 3.3  $$\left\|\langle\mathbf{a}_i,\mathbf{t}\rangle^2-E{\left[\langle\mathbf{a}_i,\mathbf{t}\rangle^2\right]}\right\|_{\psi_1} \leqslant\left\|\langle\mathbf{a}_i,\mathbf{t}\rangle^2\right\|_{\psi_1}+2\|\mathbf{a}\|_{\psi_2}^2\|\mathbf{t}\|_2^2 \leqslant3\|\mathbf a\|_{\psi_2}^2\|\mathbf{t}\|_2^2.$$Now an application of Bernstein’s inequality (Lemma A.7) gives   $$P\left[\left(\frac1m\sum_{i=1}^m\langle\mathbf{a}_i,\mathbf{t}\rangle^2-E{\left[\langle\mathbf{a}_i,\mathbf{t}\rangle^2\right]}\right)\geqslant 3\|\mathbf a\|_{\psi_2}^2\left(\frac{\sqrt{2v}}{\sqrt{m}}+\frac{v}{m}\right)\|\mathbf{t}\|_2^2\right] \leqslant 2\mathrm{e}^{-v}.$$We let $$v=2^{l}u$$ and apply the hypothesis $$2^{l/2}>\sqrt{m}$$ and u ⩾ 1 to obtain   $$P\left[\left(\frac1m\sum_{i=1}^m\langle\mathbf{a}_i,\mathbf{t}\rangle^2-E{\left[\langle\mathbf{a}_i,\mathbf{t}\rangle^2\right]}\right) \geqslant 3\left(1+\sqrt{2}\right) \|\mathbf a\|_{\psi_2}^2 \frac{2^lu}{m}\|\mathbf{t}\|_2^2\right] \leqslant 2\exp{(-2^lu)}.$$Thus, by $$2^{l/2}>\sqrt{m}$$ and u ⩾ 1 again,   $$P\left[\left(\frac1m\sum_{i=1}^m\langle\mathbf{a}_i,\mathbf{t}\rangle^2\right) \geqslant \left(3 \left(1+\sqrt{2}\right)+2\right)\frac{2^lu}{m} \|\mathbf a\|_{\psi_2}^2\|\mathbf{t}\|_2^2\right] \leqslant 2\exp{(-2^lu)},$$which yields the claim upon taking square roots on both sides of the first inequality. © The Author(s) 2018. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) For permissions, please e-mail: journals. permissions@oup.com http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Information and Inference: A Journal of the IMA Oxford University Press

Non-Gaussian observations in nonlinear compressed sensing via Stein discrepancies

, Volume Advance Article – May 21, 2018
35 pages

/lp/ou_press/non-gaussian-observations-in-nonlinear-compressed-sensing-via-stein-4tnygs60LQ
Publisher
Oxford University Press
ISSN
2049-8764
eISSN
2049-8772
D.O.I.
10.1093/imaiai/iay006
Publisher site
See Article on Publisher Site

Abstract

Abstract Performance guarantees for estimates of unknowns in nonlinear compressed sensing models under non-Gaussian measurements can be achieved through the use of distributional characteristics that are sensitive to the distance to normality, and which in particular return the value of zero under Gaussian or linear sensing. The use of these characteristics, or discrepancies, improves some previous results in this area by relaxing conditions and tightening performance bounds. In addition, these characteristics are tractable to compute when Gaussian sensing is corrupted by either additive errors or mixing. 1. Introduction Consider the nonlinear sensing model where $$(y_1,\boldsymbol{a}_1),\ldots ,(y_m,\boldsymbol{a}_m)$$ in $$\mathbb{R} \times \mathbb{R}^d$$ are i.i.d. copies of an observation and sensing vector pair (y, a), satisfying   $$E{\left[y|\boldsymbol{a}\right]}=\theta(\langle\boldsymbol{a},\boldsymbol{x}\rangle),$$ (1.1)where a is composed of entry-wise independent random variables distributed as a, a mean zero, variance one random variable. Throughout the paper we assume that the function $$\theta :\mathbb{R} \rightarrow \mathbb{R}$$ is measurable, and $$\boldsymbol{x} \in \mathbb{R}^d$$ is an unknown, non-zero vector lying in a closed set $$K \subseteq{\mathbb{R}}^d$$. The goal is to recover x given the measurement pairs $$\{(y_i,\mathbf{a}_i)\}_{i=1}^m$$. We note that the magnitude of x is unidentifiable under the model (1.1) as $$\theta (\cdot )$$ is unknown. Hence, in the following, by absorbing a factor of ∥x∥ into $$\theta$$, we may assume $$\|\boldsymbol{x}\|_2=1$$ without loss of generality. In [1], the authors consider model (1.1) under the one-bit sensing scenario where $$y_1,\ldots ,y_m$$ lie in the two-point set {−1, 1} and $$\theta :\mathbb{R} \rightarrow [-1,1]$$. They demonstrate that despite $$\theta$$ being unknown and potentially highly nonlinear, performance guarantees can be provided for estimators $$\widehat{\boldsymbol{x}}$$ of x without additional knowledge of the structure of $$\theta$$ and in a way that allows for non-Gaussian sensing. Nonlinear compressed sensing beyond the one-bit model has also been considered in previous works under certain distribution assumptions. For example, [15] and [12] consider the nonlinear model (1.1), with measurement vectors a being Gaussian and an elliptical symmetric distribution, respectively. More recently, [22] considers measurement vectors of general distribution via a score function method, under the assumption that the full knowledge of the distribution function is known. We also mention that the work [8] handles non-Gaussian designs using the zero bias transform in order to study equivalences between Generalized and Ordinary least squares. In [1], consideration of the non-Gaussian case introduces some challenges, reflected in potentially poor performance of the bounds, additional smoothness assumptions and difficulties that may arise when the unknown is extremely sparse. We show many of these difficulties can be overcome through the introduction of various measures of the discrepancy between the sensing distribution of a and the standard normal g. Though our main goal is to develop bounds that are sensitive to certain deviations from normality, and which in particular recover the previous results for Gaussian sensing and linear sensing as special cases, we also improve previous results by supplying explicit small constants in our recovery bounds. Regarding notation, we generally adhere to the principle that random variables appear in upper case, but to be consistent with the existing literature, and in particular with [1], we make an exception for the components of the sensing vector, generically denoted by a and the Gaussian by g, and also for the observed values, denoted by y. Vectors are in bold face. 1.1 Estimator and main result Given the pairs $$\{(y_i,\boldsymbol{a}_i)\}_{i=1}^m$$ generated by the model (1.1), let   $$L_m(\mathbf{t}):=\|\mathbf{t}\|_2^2-\frac{2}{m}\sum_{i=1}^my_i\left\langle\mathbf{a}_i,\mathbf{t}\right\rangle \quad \mbox{for $$\boldsymbol{t} \in K$$,} \quad$$ (1.2)which is an unbiased estimator of   $$L(\mathbf{t}):=\|\mathbf{t}\|_2^2-2E{\left[\,y\left\langle\mathbf{a},\mathbf{t}\right\rangle\right]}.$$ (1.3)As the components of a have mean zero, variance one and are independent, $$E{ [\mathbf{a}\mathbf{a}^T ]}=\mathbf{I}_{d\times d}$$, and therefore minimizing L(t) is equivalent to minimizing the quadratic loss $$E [\left (\,y-\left \langle \mathbf{a},\mathbf{t}\right \rangle \right )^2 ]$$. Thus, we define the estimator   $$\widehat{\mathbf{x}}_m:=\mathop{\mbox{argmin}}_{\mathbf{t}\in K}L_m(\mathbf{t}).$$ (1.4)For simplicity of notation, we will write   $$f_{\mathbf{x}}(\mathbf{t}):=\frac{1}{m}\sum_{i=1}^m y_i\left\langle\mathbf{a}_i,\mathbf{t}\right\rangle\!\!.$$ (1.5)To state the main result, we need the following three definitions: Definition 1. 1 (Gaussian mean width) For $$\mathbf{g}\!\!\sim\!\! \mathscr{N}(0,\mathbf{I}_{d\times d})$$, the Gaussian mean width of a set $$\mathscr{T}\subseteq \mathbb{R}^d$$ is   $$\omega(\mathscr{T})=E{\left[\sup_{\boldsymbol{t}\in \mathscr{T}}\left\langle\mathbf{g},\boldsymbol{t}\right\rangle\right]}.$$ Remark 1.2 In [1], the definition of Gaussian mean width of a set $$\mathscr{T}$$ is taken to be   $$\omega(\mathscr{T})=E{\left[\sup_{\boldsymbol{t}\in\mathscr{T} - \mathscr{T}}\left\langle\mathbf{g},\boldsymbol{t}\right\rangle\right]},$$where the supremum is over the Minkowski difference. Here, for ease of presentation, we adopt the somewhat more ‘classical’ Definition 1.1 that appears in earlier works in the literature, such as [17]. These two definitions are equivalent up to a constant as $$E \,[\sup _{\boldsymbol{t}\in \mathscr{T} - \mathscr{T}}\left \langle \mathbf{g},\boldsymbol{t}\right \rangle ]= 2E\, [\sup _{\boldsymbol{t}\in \mathscr{T}}\left \langle \mathbf{g},\boldsymbol{t}\right \rangle ]$$, which can be seen using the symmetry of the distribution of g. Remark 1.3 (Measurability issue) The precise meaning of $$E\, [\sup _{t\in \mathscr{T}}X(t) ]$$ for an arbitrary process $$\{X(t)\}_{t\in \mathscr{T}}$$ is not clear if $$\mathscr{T}$$ is uncountable. In fact, for an uncountable index set $$\mathscr{T}$$, the function $$\sup _{t\in{\mathscr{T}}}X(t)$$ might not be measurable. Letting $$(\Omega ,\mathcal{E},P)$$ be the underlying probability space, well-known counter examples exist even in the case where X(⋅) is jointly measurable on the product space $$(\Omega \times \mathscr{T},~\mathscr{E}\otimes \Psi )$$ (first constructed by Luzin and Suslin), where $$\varPsi$$ is a Borel $$\sigma$$-algebra on $$\mathscr{T}$$. However, when $$\mathscr{T}$$ is a Borel measurable subset of $$\mathbb{R}^d$$ (which is the case we are interested in) and X(⋅) is jointly measurable on $$(\Omega \times \mathscr{T},~\mathscr{E}\otimes \varPsi )$$, one can show that the $$\sup _{t\in \mathscr{T}}X(t)$$ is always measurable. Indeed, $$\sup _{t\in \mathscr{T}}X(t)$$ is measurable if and only if the set $$\{\sup _{t\in \mathscr{T}}X(t)> c\}\in \mathscr{E}$$ for any $$c\in \mathbb{R}$$. On the other hand, $$\{\sup _{t\in \mathscr{T}}X(t)> c\}=P_{\Omega }\{X(\cdot )> c\}$$, where for any set $$A\in \Omega \times \mathscr{T}$$, $$P_{\Omega }A:=\{\omega \in \Omega :(\omega ,t)\in A\}$$ is the projection of the set A onto $$\Omega$$. Then, the measurability comes from the following theorem in [6]: if $$(\Omega , \mathscr{E})$$ is a measurable space and $$\mathscr{T}$$ is a Polish space, then the projection onto $$\Omega$$ of any product measurable subset of $$\Omega \times \mathscr{T}$$ is also measurable. Definition 1.4 ($$\psi _q$$-norm) The $$\psi _q$$-norm of a real-valued random variable X is given by   $$\|X\|_{\psi_q}=\sup_{p\geqslant1}p^{-\frac{1}{q}}\left(E\left[|X|^p\right]\right)^{\frac1p}\!\!.$$In particular, for q = 1 and q = 2, respectively, the value of $$\psi _q$$ is called the subexponential and subgaussian norm, and we say X is subexponential or subgaussian when $$\|X\|_{\psi _1}<\infty$$ or $$\|X\|_{\psi _2}<\infty$$. The subgaussian q = 2 case of Definition 1.4 is the most important. Though here the $$\psi _2$$-norm we have chosen to use is based on comparing the growth of a distribution’s absolute moments to that of a normal, definitions equivalent up to universal constants can also be stated in terms of comparisons of tail decay or of the Laplace transform of X, among others. Remark 1.5 It is easily justified that $$\|\cdot \|_{\psi _q}$$ for q ⩾ 1 defines a norm with identification of almost everywhere equal random variables. Here we only check the triangle inequality as it is immediate that $$\|\cdot \|_{\psi _q}$$ is homogeneous and separates points. Indeed, for any two random variables X and Y, the Minkowski inequality yields that   \begin{equation*} \|X+Y\|_{\psi_q}=\sup_{p\geqslant1}p^{-\frac{1}{q}}\left(E{\left[|X+Y|^p\right]}\right)^{\frac1p} \leqslant\sup_{p\geqslant1}p^{-\frac{1}{q}}\left(\left(E{\left[|X|^p\right]}\right)^{\frac1p}+\left(E{\left[|Y|^p\right]}\right)^{\frac1p}\right) \leqslant\|X\|_{\psi_q}+\|Y\|_{\psi_q}. \end{equation*} Definition 1.6 (Descent cone) The descent cone of a set $$\mathscr{T}\subseteq \mathbb{R}^d$$ at any point $$\boldsymbol{t}_0\in \mathscr{T}$$ is defined as   $$D(\mathscr{T},\boldsymbol{t}_0)=\big\{\tau\mathbf{h}:~\tau\geqslant0, \mathbf{h}\in \mathscr{T}-\boldsymbol{t}_0\big\}.$$ Theorem 1.7 Let $$\mathbf{a}=(a_1,\ldots ,a_d)$$ where $$a_1,\ldots ,a_d$$ are i.i.d. copies of a random variable a with a centered subgaussian distribution having unit variance, and let $$\{(y_i,\mathbf{a}_i)\}_{i=1}^m$$ be i.i.d. copies of the pair (y, a) where y, given by the sensing model (1.1), is assumed to be subgaussian. If K is a closed, measurable subset of $$\mathbb{R}^d$$ and $$\lambda \boldsymbol{x} \in K$$ where   $$\lambda=E{\left[\,y\left\langle\mathbf{a},\mathbf{x}\right\rangle\right]},$$ (1.6)then for all u ⩾ 2, with probability at least $$1-4\mathrm{e}^{-u}$$, the estimator $$\widehat{\mathbf{x}}_m$$ given by (1.4) satisfies   $$\left\|\,\widehat{\mathbf{x}}_m-\lambda\mathbf{x}\right\|_2 \leqslant2\alpha+C_0\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right)\frac{\omega\left(D(K,\lambda\mathbf{x})\cap\mathbb{S}^{d-1}\right)+u}{\sqrt{m}},$$for all $$m\geqslant \omega (D(K,\mathbf{x})\cap \mathbb{S}^{d-1})^2$$ and some constant $$C_0>0$$, where   $$\alpha=\sup\left\{ \big|E{\left[ y \left\langle\mathbf{a},\mathbf{t}\right\rangle \right]}-\lambda\left\langle\mathbf{x},\mathbf{t}\right\rangle\! \big|, \mathbf{t} \in B_2^d \right\}\!,$$ (1.7)and $$\mathbb{S}^{d-1}$$ and $$B_2^d$$ are the unit Euclidean sphere and ball in $$\mathbb{R}^d$$, respectively. We note that $$\alpha =0$$ under the conditions of Theorems 2.1 and 2.4, and also when $$\theta$$ is linear. Hence, Theorem 1.7 recovers results for the normal and linear compressed sensing models as special cases. Remark 1.8 At first glance it may seem surprising that the least squares type estimator (1.4), which is well known to work when $$\theta$$ is linear, succeeds in such greater generality. The appearance of the factors $$\lambda$$ and $$\alpha$$ in (1.6) and (1.7), respectively, may also be non-intuitive. The following explanations may shed some light. First, regarding the scaling factor $$\lambda$$, one can easily verify that if $$\theta (w)=\mu w$$, a linear function, then $$\lambda =\mu$$. Hence, in this case $$\theta (\langle \boldsymbol{a},{x}\rangle )=\lambda \langle \boldsymbol{a},\boldsymbol{x}\rangle = \langle \boldsymbol{a}, \lambda \boldsymbol{x}\rangle$$, which behaves as though the unknown vector to be estimated has length $$\lambda$$, possibly different from one, the assumed length of x. Next, we present Lemma 1.9, used later in the proof of Theorem 1.7, to give some intuition as to why the proposed estimator succeeds when $$\theta$$ is nonlinear. Let L be as in (1.3), the expectation of the function $$L_m$$ whose argument at the minimum defines the estimator $$\widehat{\mathbf{x}}_m$$. Lemma 1.9 For any t ∈ K, we have   $$L(\mathbf{t})-L(\lambda\mathbf{x})\geqslant\|\mathbf{t}-\lambda\mathbf{x}\|_2^2-2\alpha\|\mathbf{t}-\lambda\mathbf{x}\|_2,$$where $$\lambda$$ and $$\alpha$$ are defined in (1.6) and (1.7), respectively. Proof. For any t ∈ K,   \begin{align*} L(\mathbf{t})-L(\lambda\mathbf{x})=&\, \|\mathbf{t}\|_2^2-\|\lambda\mathbf{x}\|_2^2 -2E{\left[\,y\left\langle\mathbf{a},\mathbf{t}-\lambda\mathbf{x}\right\rangle\right]} \\ \geqslant&\,\|\mathbf{t}\|_2^2-\|\lambda\mathbf{x}\|_2^2-2\lambda\left\langle\mathbf{t}-\lambda\mathbf{x},\mathbf{x}\right\rangle-2\alpha\|\mathbf{t}-\lambda\mathbf{x}\|_2\\ =&\,\|\mathbf{t}-\lambda\mathbf{x}\|_2^2-2\alpha\|\mathbf{t}-\lambda\mathbf{x}\|_2, \end{align*}where the inequality follows from (1.7). Hence, if one could minimize L instead of $$L_m$$ (the difference in practice being controlled by a generic chaining argument), when $$\lambda \boldsymbol{x} \in K$$, the set over which L is minimized, one obtains   $$\|\,\widehat{\boldsymbol{x}}_m-\lambda\boldsymbol{x}\|_2^2 \leqslant \left[L(\,\widehat{\boldsymbol{x}}_m) - L(\lambda\boldsymbol{x})\right] + 2 \alpha \|\,\widehat{\boldsymbol{x}}_m-\lambda\boldsymbol{x}\|_2 \leqslant 2 \alpha \|\,\widehat{\boldsymbol{x}}_m-\lambda\boldsymbol{x}\|_2,$$ (1.8)and therefore that   \begin{equation*} \|\,\widehat{\boldsymbol{x}}_m-\lambda\boldsymbol{x}\|_2 \leqslant 2 \alpha. \end{equation*} From the inequality in the proof of Lemma 1.9, one can see that $$\alpha$$ is the ‘price’ for replacing the nonlinearity inherent in y with a simpler inner product, as supported by the fact that $$\alpha =0$$ when $$\theta$$ is linear. In addition, parts (a) and (b) of Theorem 2.1 to follow show that $$\alpha$$ is again zero when $$\theta$$ is Lipschitz, or has bounded second derivative, and the sensing vector is composed of independent Gaussian variables. Theorem 2.4 provides this same conclusion when $$\theta$$ is the sign function. Hence, in these cases, minimizing L would lead to exact recovery. As mentioned earlier, the length of the unknown vector x in (1.1) is not identifiable due to the generality in $$\theta$$ that the model allows. However, if one has prior knowledge that $$\|\mathbf{x}\|_2 = 1$$, the following corollary to Theorem 1.7 shows that rescaling $$\widehat{\mathbf{x}}_m$$ to have norm 1 gives an estimator of the true vector x. The idea underlying the corollary was originally developed in [12]. Corollary 1.10 Let the conditions of Theorem 1.7 be in force, and suppose that $$\|\mathbf{x}\|_2 = 1$$ and $$\lambda> 0$$. Define the normalized estimator $$\overline{\mathbf{x}}_m$$ as   $$\overline{\mathbf{x}}_m := \begin{cases} \widehat{\mathbf{x}}_m/\|\,\widehat{\mathbf{x}}_m\|_2,~~&\textrm{if }~\widehat{\mathbf{x}}_m\neq0, \\ 0,~~&\textrm{if }~\widehat{\mathbf{x}}_m=0. \end{cases}$$Then there exists a constant $$C_0>0$$ such that for all u ⩾ 2, with probability at least $$1-4\mathrm{e}^{-u}$$,   $$\left\|\, \overline{\mathbf{x}}_m - \mathbf{x}\right\|_2 \leqslant\frac{4\alpha}{\lambda}+2C_0\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right)\frac{\omega\left(D(K,\lambda\mathbf{x})\cap\mathbb{S}^{d-1}\right)+u}{\lambda\sqrt{m}},$$whenever $$m\geqslant \omega (D(K,\mathbf{x})\cap \mathbb{S}^{d-1})^2$$. Proof. By Theorem 1.7, we know that with probability at least $$1-4\mathrm{e}^{-u}$$  $$\left\|\,\widehat{\mathbf{x}}_m-\lambda\mathbf{x}\right\|_2\leqslant B \quad \mbox{where} \quad B = 2\alpha+C_0\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right)\frac{\omega\left(D(K,\lambda\mathbf{x})\cap\mathbb{S}^{d-1}\right)+u}{\sqrt{m}}.$$Since $$\lambda>0$$, it follows that on this event   \begin{equation*} \left\|\frac{\widehat{\mathbf{x}}_m}{\lambda} - \mathbf{x}\right\|_2\leqslant \frac{B}{\lambda}. \end{equation*} Let $$\omega \in [0,\pi )$$ be the angle between $$\widehat{\mathbf{x}}_m$$ and x (see Fig. 1). First consider the case where either $$\omega \geqslant \frac{\pi }{2}$$ or $$\widehat{\mathbf{x}}_m =0$$. Then $$\langle\, \widehat{\mathbf{x}}_m,\mathbf{x}\rangle \leqslant 0$$, and we have from the above inequality,   $$\frac{B}{\lambda}\geqslant\left\|\frac{\widehat{\mathbf{x}}_m}{\lambda} - \mathbf{x}\right\|_2 = \sqrt{\|\,\widehat{\mathbf{x}}_m/\lambda\|_2^2 - 2 \langle\,\widehat{\mathbf{x}}_m,\mathbf{x}\rangle/\lambda + \| \mathbf{x}\|_2^2}\geqslant\|\mathbf{x}\|_2=1.$$ Hence, applying the triangle inequality, we have   \begin{equation*} \left\|\overline{\mathbf{x}}_m - \mathbf{x}\right\|_2\leqslant 2 \leqslant \frac{2B}{\lambda}. \end{equation*} In the remaining case where $$\omega <\frac{\pi }{2}$$ and $$\widehat{\mathbf{x}}_m \neq 0$$, as can be seen with the help of Fig. 1,   \begin{multline*} \left\|\frac{\widehat{\mathbf{x}}_m}{\|\,\widehat{\mathbf{x}}_m\|_2} - \mathbf{x}\right\|_2 = \frac{\textrm{dist}\left(\mathbf{x},\textrm{span}\left(\,\widehat{\mathbf{x}}_m\right)\right)}{\cos(\omega/2)} \leqslant \frac{\textrm{dist}\left(\mathbf{x},\textrm{span}\left(\,\widehat{\mathbf{x}}_m\right)\right)}{\cos(\pi/4)} \leqslant \frac{\left\|\,\widehat{\mathbf{x}}_m/\lambda - \mathbf{x}\right\|_2}{\cos(\pi/4)} \leqslant \frac{\sqrt{2}B}{\lambda}, \end{multline*}where $$\textrm{dist}\left (\mathbf{x},\textrm{span}\left (\,\widehat{\mathbf{x}}_m\right )\right )$$ denotes the distance of the vector x to the linear span of $$\widehat{\mathbf{x}}_m$$, the first inequality follows from $$\omega <\frac{\pi }{2}$$ and the second inequality follows from the fact that $$\frac{\widehat{\mathbf{x}}_m}{\lambda }$$ is in the linear span of $$\widehat{\mathbf{x}}_m$$. Combining the above two cases completes the proof. Fig. 1. View largeDownload slide Illustration of the geometric relation between the estimator $$\widehat{\mathbf{x}}_m$$ and the true vector x. Fig. 1. View largeDownload slide Illustration of the geometric relation between the estimator $$\widehat{\mathbf{x}}_m$$ and the true vector x. Remark 1.11 We compare the result in Corollary 1.10 with Lemma 2.2 of [1], where a nearly identical bound is presented under the additional assumptions that $$\{y_i\}_{i=1}^m$$ take values in {−1, 1}, $$\theta :\mathbb{R}\rightarrow [-1,1]$$, that K lies in a unit Euclidean ball $$B_2^d$$ and $$g \sim \mathscr{N}(0,1)$$. Specifically, under the preceding assumptions it is shown that   $$\left\|\,\widehat{\mathbf{x}}_m-\mathbf{x}\right\|_2\leqslant\frac{4\alpha}{\lambda}+C\|a\|_{\psi_2}\frac{\omega(K)+u}{\lambda\sqrt{m}},$$with probability at least $$1-4\mathrm{e}^{-u^2}$$. Under the normality assumption $$\left \langle \mathbf{g},\mathbf{x}\right \rangle \sim \mathscr{N}(0,1)$$ and $$\lambda$$ of (1.6) specializes to $$E[g\theta (g)]$$. Here, we are able to obtain a more general result that allows y to be subgaussian rather than restricting it to lie in {−1, 1}, which comes at the extra cost of a term that is of the same order as previously existing ones in the bound, and in particular which vanish as $$m\rightarrow \infty$$. Lastly, allowing y to be subgaussian, the variable $$y\left \langle \mathbf{a},\mathbf{t}\right \rangle$$ is subexponential for all $$\mathbf{t}\in \mathbb{R}^d$$, as opposed to being subgaussian as in [1]. This additional generality necessitates a generic chaining argument to obtain the subexponential concentration bound. This paper is organized as follows. In Section 2 we introduce two measures of a distribution’s discrepancy from the normal that have their roots in Stein’s method, see [5,19]. The zero bias distribution is introduced first, being relevant for both Sections 2.1 and 2.2, that considers the cases where $$\theta$$ is a smooth function, and the sign function, respectively. Section 2.1 further introduces a discrepancy measure based on Stein coefficients, and Theorem 2.1 provides bounds on $$\alpha$$ of (1.7) in terms of these two measures, when $$\theta$$ is Lipschitz and when it has a bounded second derivative. Section 2.1 also defines two specific error models on the Gaussian, an additive one in (2.23) and the other via mixtures in (2.24). Theorem 2.3 shows the behavior of the bound on $$\alpha$$ in these two models as a function of the amount $$\varepsilon \in [0,1]$$ the Gaussian is corrupted, tending to zero as $$\varepsilon$$ becomes small. Section 2.2 provides corresponding results when $$\theta$$ is the sign function, specifically in Theorems 2.4 and 2.5. Section 2.3 studies some relationships between the two discrepancy measures applied and also to the total variation distance. Theorem 1.7 is proved in Section 3. The presentation of the postponed proofs of some results used earlier appears in an Appendix in Sections A and B. 2. Discrepancy bounds via Stein’s method Here we introduce two measures of the sensing distribution’s proximity to normality that can be used to bound $$\alpha$$ in (1.7). In Sections 2.1 and 2.2 we consider the cases where $$\theta$$ is a Lipschitz function and the sign function, respectively; the difference in the degree of smoothness in these two cases necessitates the use of different ways of measuring the discrepancy to normality. An observation that will be useful in both settings is that by definition (1.5), for any $$\mathbf{t}\in \mathbb{R}^d$$, we have   \begin{align} &E\left[\,f_{\mathbf{x}}(\mathbf{t})\right]=E\left[\,y\langle\mathbf{a},\mathbf{t}\rangle\right]=E\left[E\left[\,y\langle\mathbf{a},\mathbf{t}\rangle \mathbf{a}\right]\right] =E\left[\langle\mathbf{a},\mathbf{t}\rangle \theta(\langle\mathbf{a},\mathbf{x}\rangle)\right]=\langle\mathbf{v}_{\mathbf{x}},\mathbf{t}\rangle \nonumber\\ &\qquad \qquad\mbox{where} \quad\mathbf{v}_{\mathbf{x}}=E\left[\mathbf{a}\theta(\langle\mathbf{a},\mathbf{x}\rangle)\right]\!. \end{align} (2.1)Specializing to the case where t = x, we may therefore express $$\lambda$$ in (1.6) as   $$\lambda=\langle\mathbf{v}_{\mathbf{x}},\mathbf{x}\rangle = E \left[\langle\mathbf{a},\mathbf{x}\rangle \theta(\langle\mathbf{a},\mathbf{x}\rangle)\right]\!.$$ (2.2) In the settings of both Sections 2.1 and 2.2, we require facts regarding the zero bias distribution and depend on [13] or [5] for properties stated below. With $${\mathcal L}(\cdot )$$ denoting distribution or law, given a mean zero distribution $${\mathcal L}(a)$$ with finite, non-zero variance $$\sigma ^2$$, there exists a unique law $${\mathcal L}(a^{\ast })$$, termed the ‘a-zero bias’ distribution, characterized by the satisfaction of   $$E\, [af(a)]=\sigma^2 E\,[f^{\prime}(a^{\ast})] \quad \mbox{for all Lipschitz functions $$f$$.} \quad$$ (2.3)The existence of the variance of a, and hence also its second moment, guarantees that the expectation on the left, and hence also on the right, exists. Letting   \begin{equation*} {\textrm{Lip}}_1=\big\{g:\mathbb{R} \rightarrow \mathbb{R} \quad \mbox{satisfying} \quad \left|g(y)-g(x)\right| \leqslant |y-x|\big\}, \end{equation*}we recall that the Wasserstein or $$L^1$$ distance between the laws $$\mathscr{L}(X)$$ and $$\mathscr{L}(Y)$$ of two random variables X and Y can be defined as   \begin{equation*} d_1\left(\mathscr{L}(X),\mathscr{L}(Y)\right) = \sup_{f \in{\textrm{Lip}}_1} \big|Ef(X)-Ef(Y)\big|, \end{equation*}or alternatively as   $$d_1\left(\mathscr{L}(X),\mathscr{L}(Y)\right) = \inf_{(X,Y)}E|X-Y|,$$ (2.4)where the infimum is over all couplings (X, Y) of random variables having the given marginals. The infimum is achievable for real-valued random variables, see [16]. Now we define our first discrepancy measure by   $$\gamma_{\mathscr{L}(a)} = d_1(a,a^{\ast}).$$ (2.5)Stein’s characterization [19] of the normal yields that $$\mathscr{L}(a^{\ast })=\mathscr{L}(a)$$ if and only if a is a mean zero normal variable. Further, with some abuse of notation, writing $$\gamma _a$$ for (2.5) for simplicity, Lemma 1.1 of [11] yields that if a has mean zero, variance 1 and finite third moment, then   $$\gamma_a \leqslant \frac{1}{2}E|a|^3,$$ (2.6)so in particular $$\gamma _a < \infty$$ whenever a has a finite third moment. In the case where $$Y_1,\ldots ,Y_n$$ are independent mean zero random variables with finite, non-zero variances $$\sigma _1^2,\ldots ,\sigma _n^2$$, having sum $$Y=\sum _{i=1}^n Y_i$$ with variance $$\sigma ^2$$, we may construct $$Y^{\ast }$$ with the Y-zero biased distribution by letting   $$Y^{\ast}=Y-Y_I+Y_I^{\ast} \quad \mbox{where} \quad P[I=i]=\frac{\sigma_i^2}{\sigma^2},$$ (2.7)where $$Y_i^{\ast }$$ has the $$Y_i$$-zero biased distribution and is independent of $$Y_j, j \not =i$$, and where the random index I is independent of $$\{Y_i,Y_i^{\ast }, i=1,\ldots ,n\}$$. We will also make use of the fact that for any c≠0   $$\mathscr{L}((ca)^{\ast})=\mathscr{L}(ca^{\ast}).$$ (2.8) 2.1 Lipschitz functions When $$\theta$$ is a Lipschitz function inequality (2.12) of Theorem 2.1 below gives a bound on $$\alpha$$ in (1.7) in terms of Stein coefficients. We say T is a Stein coefficient, or Stein kernel, for a random variable X with finite, non-zero variance when   $$E\left[X\,f(X)\right]=E\left[T\,f^{\prime}(X)\right]$$ (2.9)for all Lipschitz functions f. Specializing (2.9) to the cases where f(x) = 1 and f(x) = x, we find   $$E[X]=0 \quad \mbox{and} \quad{\mathrm Var}(X)= E[T].$$ (2.10) By Stein’s characterization [19], the distribution of X is normal with mean zero and variance $$\sigma ^2$$ if and only if $$T=\sigma ^2$$. Correspondingly, for unit variance random variables we will define our second discrepancy measure as E|1 − T|. If c is a non-zero constant and $$T_X$$ is a Stein coefficient for X, then $$c^2T_X$$ is a Stein coefficient for Y = cX. Indeed, with h(x) = f(cx) below we obtain changed g to h to avoid confusion with normal   $$E\left[Yf(Y)\right]=cE\left[Xf(cX)\right]=cE\left[Xh(X)\right]=cE\left[T_X h^{\prime}(X)\right]=cE\left[cT_Xf^{\prime}(cX)\right] =E\left[c^2T_Xf^{\prime}(Y)\right].$$ (2.11)Stein coefficients first appeared in the work of [3], and were further developed in [4] for random variables that are functions of Gaussians; we revisit this later point in Section 2.3. The following result considers two separate sets of hypotheses on the unknown function $$\theta$$ and the sensing distribution a. The assumptions leading to the bound (2.12) require fewer conditions on $$\theta$$ and more on a as compared to those leading to (2.13). That is, though Stein coefficients may fail to exist for certain mean zero, variance one distributions, discrete ones in particular, the zero bias distribution here exists for all. We note that by Stein’s characterization, when a is standard normal we may take T = 1 in (2.12) and $$\gamma _a=0$$ in (2.13), and hence $$\alpha =0$$ in both the cases considered in the theorem that follows. The bound (2.13) also returns zero discrepancy in the special case where $$\theta$$ is linear and thus recovers the results on linear compressed sensing [17] when combined with Theorem 1.7. For a real-valued function f with domain D let   \begin{equation*} \|\,f\|=\sup_{x \in D}\left|\,f(x)\right|\!. \end{equation*} Theorem 2.1 Let a be a mean zero, variance one random variable and set $$\boldsymbol{a}=(a_1,\ldots ,a_d)$$ with $$a_1,\ldots ,a_d$$ independent random variables distributed as a, and let $$\alpha$$ be as in (1.7). (a) If $$\theta \in \textrm{Lip}_1$$ and a has Stein coefficient T, then   $$\alpha \leqslant E|1-T|.$$ (2.12) (b) If $$\theta$$ possesses a bounded second derivative, then   $$\alpha \leqslant \|\theta^{\prime\prime}\| \gamma_a.$$ (2.13) Remark 2.2 In [1] the quantity $$\alpha$$ is bounded in terms of the total variation distance $$d_{\mathrm TV}(a,g)$$ between a and the standard Gaussian distribution g. In particular, for $$\theta \in C^2$$, Proposition 5.5 of [1] yields   $$\alpha \leqslant 8(Ea^6+Eg^6)^{1/2}\left(\|\theta^{\prime}\|+\|\theta^{\prime\prime}\|\right)\sqrt{d_{\mathrm TV}(a,g)}.$$ (2.14) In contrast, the upper bound (2.12) does not depend on any moments of a, requires $$\theta$$ to be only once differentiable, and in typical cases where $$d_{\mathrm TV}(a,g)$$ and E|1 − T| are of the same order, that is, when the upper bound in Lemma 2.10 is of the correct order, $$\alpha$$ in (2.12) is bounded by a first power rather than the larger square root in (2.14). When $$\theta$$ possesses a bounded second derivative, the upper bound (2.13) improves on (2.14) in terms of constant factors, requirements on the existence of moments and dependence on a first power rather than a square root. In this case Lemma 2.11 shows $$d_{\mathrm TV}(a,g)$$ and $$\gamma _a$$ are of the same order when a has bounded support. Measuring discrepancy from normality in terms of E|1 − T| and $$\gamma _a$$ also has the advantage of being tractable when each component of the Gaussian sensing vector g has been independently corrupted at the level of some $$\varepsilon \in [0,1]$$ by a non-Gaussian, mean zero, variance one distribution a. In the two models we consider we let the sensing vector have i.i.d. entries, and hence only specify the distribution of its components. The first model is the case of additive error, where each component of the sensing vector is of the form   $$g_\varepsilon = \sqrt{1-\varepsilon}g+\sqrt{\varepsilon} a$$ (2.15)with a independent of g, with the second one being the mixture model where each component has been corrupted due to some ‘bad event’ A that substitutes g with a so that   $$g_\varepsilon=g\boldsymbol{1}_{A^c} + a\boldsymbol{1}_A,$$ (2.16)where A occurs with probability $$\varepsilon$$, independently of g, a and a given Stein coefficient T for a. Since   $$E\left[Tf^{\prime}(a)\right]=E \left[E[T|a]\ f^{\prime}(a)\right]\!,$$ (2.17)we see that E[T|a] is a Stein coefficient for a. Hence, upon replacing T by E[T|a] only the independence of A from {g, a} is required. Theorem 2.3 shows that under both scenarios (a) and (b) considered in Theorem 2.1, and further, under both the additive and mixture models, the value $$\alpha$$ can be bounded explicitly in terms of a quantity that vanishes in $$\varepsilon$$. Further, we note that both error models agree with each other, and with the model of Theorem 2.1, when $$\varepsilon =1$$, so that Theorem 2.3 recovers Theorem 2.1 when so specializing. We now present Theorem 2.3 followed by its proof, then the proof of Theorem 2.1. Theorem 2.3 Under condition (a) of Theorem 2.1, under both the additive (2.15) and mixture (2.16) error models, we have   \begin{equation*} \alpha \leqslant \varepsilon E|1-T|. \end{equation*}As regards the measure $$\gamma _a$$ in (b) of Theorem 2.1, under the additive error model (2.15),   $$\gamma_{g_\varepsilon} \leqslant \varepsilon^{3/2} \gamma_a, \quad \mbox{and when {\theta} has a bounded second derivative,} \quad \alpha \leqslant \varepsilon^{3/2}\|\theta^{\prime\prime}\| \gamma_a,$$ (2.18)and under the mixture error model (2.16),   $$\gamma_{g_\varepsilon} \leqslant \varepsilon \gamma_a, \quad \mbox{and when {\theta} has a bounded second derivative,} \quad \alpha \leqslant \varepsilon \|\theta^{\prime\prime}\| \gamma_a.$$ (2.19) Proof. By the assumptions of independence and on the mean and variance of a and g, in both error models $$g_\varepsilon$$ has mean zero and variance 1. As the components of the sensing vector are i.i.d. by construction, the hypotheses on a in Theorem 2.1 hold. First consider scenario (a) under the additive error model. If a random variable W is the sum of two independent mean zero variables X and Y with finite variances, and Stein coefficients $$T_X$$ and $$T_Y$$, respectively, then for any Lipshitz function f one has   \begin{align*} E\left[Wf(W)\right]&=E\left[(X+Y)f(X+Y)\right]= E\left[Xf(X+Y)\right]+E\left[Yf(X+Y)\right]\\ &= E\left[T_Xf^{\prime}(X+Y)\right]+E\left[T_Yf^{\prime}(X+Y)\right] =E\left[(T_X+T_Y)f^{\prime}(X+Y)\right]\\ &= E\left[T_Wf^{\prime}(W)\right] \quad \mbox{where {$T_W=T_X+T_Y$},} \quad \end{align*}showing that Stein coefficients are additive for independents summands. In particular, now also using (2.11), we see that the Stein coefficient $$T_\varepsilon$$ for $$g_\varepsilon$$ in (2.15) is given by $$T_\varepsilon = 1-\varepsilon + \varepsilon T$$, where T is the given Stein coefficient for a. As $$1-T_\varepsilon =\varepsilon (1-T)$$, the first claim of the lemma follows by applying Theorem 2.1. For the mixture model, by the independence between A and {a, g, T},   \begin{align*} E\left[g_\varepsilon f(g_\varepsilon)\right] &= (1-\varepsilon)E\left[gf(g)\right]+\varepsilon E\left[af(a)\right] = (1-\varepsilon)E\left[f^{\prime}(g)\right]+\varepsilon E\left[Tf^{\prime}(a)\right] \\ &=E\left[\boldsymbol{1}_{A^c}f^{\prime}(g)+T\boldsymbol{1}_Af^{\prime}(a)\right]= E\left[\boldsymbol{1}_{A^c}f^{\prime}(g_\varepsilon)+T\boldsymbol{1}_Af^{\prime}(g_\varepsilon)\right]\\ &= E\left[T_\varepsilon f^{\prime}(g_\varepsilon)\right] \quad \mbox{where} \quad T_\varepsilon =\boldsymbol{1}_{A^c} + T\boldsymbol{1}_A. \end{align*}Hence, the bound just shown for the additive model is seen to hold also for the mixture model by applying Theorem 2.1 and observing that $$1-T_\varepsilon =\boldsymbol{1}_A(1-T)$$, and recalling the independence between T and A. Now consider scenario (b) under the additive error model. This paragraph rewritten for clarity Identity (2.7) says one may construct the zero bias distribution of a sum of independent terms by choosing a summand proportional to its variance and replacing it by a variable independent of the remaining summands, and having the chosen summands’ zero bias distribution, where the replacement is done independent of all else. As the two summands in (2.15) have variance $$1-\varepsilon$$ and $$\varepsilon$$, we choose them for replacement with these probabilities, respectively. Hence, letting B be the event that a is chosen, we see   \begin{equation*} g_\varepsilon^{\ast} = \left(\sqrt{1-\varepsilon}g^{\ast}+\sqrt{\varepsilon} a\right)\boldsymbol{1}_{B^c} + \left(\sqrt{1-\varepsilon}g+\sqrt{\varepsilon} a^{\ast}\right)\boldsymbol{1}_B = \sqrt{1-\varepsilon} g + \sqrt{\varepsilon} \left(a\boldsymbol{1}_{B^c} + a^{\ast}\boldsymbol{1}_B\right) \end{equation*}has the $$g_\varepsilon$$-zero bias distribution, where for the first equality we have applied (2.8), yielding $$(\sqrt{1-\varepsilon } g)^{\ast }=_d\sqrt{1-\varepsilon } g^{\ast }$$ and likewise $$(\varepsilon a)^{\ast }=_d\varepsilon a^{\ast }$$, and used that the standard normal is a fixed point of the zero bias transformation for the second. In addition, we construct $$a^{\ast }$$ to have the a-zero bias distribution, be independent of g and B and achieve the infimum $$d_1(\mathscr{L}(a),\mathscr{L}(a^{\ast }))$$ in (2.4), that is, giving the coupling that minimizes $$E|a^{\ast }-a|$$. We now obtain   \begin{equation*} g_\varepsilon^{\ast}-g_\varepsilon = \sqrt{1-\varepsilon} g + \sqrt{\varepsilon} \left( a\boldsymbol{1}_{B^c} + a^{\ast}\boldsymbol{1}_B\right) -\left(\sqrt{1-\varepsilon} g + \sqrt{\varepsilon} a\right)=\sqrt{\varepsilon}(a^{\ast}-a)\boldsymbol{1}_B. \end{equation*}As the Wasserstein distance is the infimum (2.4) over all couplings between g and $$g_\varepsilon$$, using that B is independent of a and $$a^{\ast }$$, we have   \begin{equation*} \gamma_{g_\varepsilon} = d_1\left(g_\varepsilon,g_\varepsilon^{\ast}\right) \leqslant E\left|g_\varepsilon^{\ast}-g_\varepsilon\right| = \sqrt\varepsilon E\left|a^{\ast}-a\right|P(B) = \varepsilon^{3/2}\gamma_a. \end{equation*}The proof of (2.18), the first claim under (b), can now be completed by applying (2.13). Continuing under scenario (b), again consider the mixture model (2.16). By Theorem 2.1 of [11], as Var(a) = Var(g), the variable   \begin{equation*} g_\varepsilon^{\ast}= g^{\ast}\boldsymbol{1}_{A^c} + a^{\ast}\boldsymbol{1}_A= g\boldsymbol{1}_{A^c} + a^{\ast}\boldsymbol{1}_A \end{equation*}has the $$g_\varepsilon$$ zero bias distribution, where we again take $$g^{\ast }$$ and $$a^{\ast }$$ as in the previous construction. Hence, arguing as for the additive error model, we obtain the bound   \begin{equation*} \gamma_{g_\varepsilon} \leqslant E\left|g_\varepsilon^{\ast}-g_\varepsilon\right| =E{\left[|a^{\ast}-a|\boldsymbol{1}_A\right]} = \varepsilon \gamma_a. \end{equation*}The second claim under (b) now follows as the first. Proof of Theorem 2.1. Recalling that x is a unit vector, for any $$\boldsymbol{t} \in B_2^d$$ the vectors x and v = t −⟨x, t⟩x are perpendicular. If v≠0 set $$\boldsymbol{x}^\perp$$ to be the unit vector in direction v, and let $$\boldsymbol{x}^\perp$$ be zero otherwise. These vectors produce an orthogonal decomposition of any $$\boldsymbol{t} \in B_2^d$$ as   $$\boldsymbol{t}= \left\langle\boldsymbol{x},\boldsymbol{t}\right\rangle\boldsymbol{x} + \left\langle\boldsymbol{x}^\perp,\boldsymbol{t}\right\rangle\boldsymbol{x}^\perp.$$ (2.20)Defining   \begin{equation*} Y=\langle\boldsymbol{a},\boldsymbol{x} \rangle \quad \mbox{and} \quad Y^\perp=\left\langle\boldsymbol{a},\boldsymbol{x}^\perp\right\rangle, \end{equation*}using the decomposition (2.20) in (2.1) and the expression for $$\lambda$$ in (2.2) yields   \begin{align*} E\left[\,f_{\boldsymbol{x}}(\boldsymbol{t})\right] =&\, E\left[\left\langle\boldsymbol{a},\boldsymbol{t}\right\rangle\theta(\left\langle\boldsymbol{a},\boldsymbol{x}\right\rangle)\right] = \left\langle\boldsymbol{x},\boldsymbol{t}\right\rangle E\left[\left\langle\boldsymbol{a},\boldsymbol{x}\right\rangle \theta\left(\left\langle\boldsymbol{a},\boldsymbol{x}\right\rangle\right)\right]+\left\langle\boldsymbol{x}^\perp,\boldsymbol{t} \right\rangle E\left[\left\langle\boldsymbol{a},\boldsymbol{x}^\perp \right\rangle \theta(\left\langle\boldsymbol{a},\boldsymbol{x}\right\rangle)\right] \\ =&\, \lambda \left\langle\boldsymbol{x},\boldsymbol{t}\right\rangle +\left\langle\boldsymbol{x}^\perp,\boldsymbol{t} \right\rangle E\left[Y^\perp \theta(Y)\right]. \end{align*}As $$\|\mathbf{x}^\perp \|_2$$ and $$\|\boldsymbol{t}\|_2$$ are at most one, applying the Cauchy–Schwarz inequality we obtain from below to be used in both cases   $$\big|E\left[\,f_{\boldsymbol{x}}(\boldsymbol{t})\right]-\lambda \left\langle\boldsymbol{x},\boldsymbol{t} \right\rangle\!\big| \leqslant \left\vert\vphantom{\frac{1}{1}}\right. E\left[Y^\perp \theta(Y)\right] \left\vert\vphantom{\frac{1}{1}}\right..$$ (2.21) We determine a Stein coefficient for $$Y^\perp$$ as follows. For $$T_i$$ Stein coefficients for $$a_i$$, independent and identically distributed as the given T for all i = 1, …, d, by conditioning on $$Y-x_ia_i$$, a function of $$\{a_j, j \not = i\}$$ and therefore independent of $$a_i$$, using the scaling property (2.11), we have   \begin{align} E\left[x_i^\perp a_i\theta(Y)\right]&=E\left[ x_i^\perp a_i \theta\left(x_i a_i + (Y-x_ia_i)\right)\right]=E\left[ x_i^\perp x_i T_i \theta^{\prime}\left(x_i a_i + (Y-x_ia_i)\right)\right]\nonumber\\ &=E\left[ x_i^\perp x_i T_i \theta^{\prime}(Y)\right]. \end{align} (2.22)Hence,   $$E\left[Y^\perp \theta(Y)\right]=\sum_{i=1}^d E\left[x_i^\perp a_i \theta(Y)\right]=E\left[T_{Y^\perp} \theta^{\prime}(Y)\right]\ \mbox{where} \ \ T_{Y^\perp}= \sum_{i=1}^d x_i^\perp x_iT_i= \sum_{i=1}^d x_i^\perp x_i(T_i-1),$$ (2.23)where the last equality follows from $$\left \langle \mathbf{x},\mathbf{x}^\perp \right \rangle =0$$. Now from (2.21) and (2.23), we have   \begin{align*} \big|E\left[\,f_{\boldsymbol{x}}(\boldsymbol{t})\right]-\lambda \left\langle\boldsymbol{x},\boldsymbol{t} \right\rangle\!\big| &\leqslant \left|E\left[T_{Y^\perp} \theta^{\prime}(Y)\right]\right|\\ &\leqslant E|T_{Y^\perp}| \leqslant \sum_{i=1}^d \left|x_i^\perp x_i \right| E|T-1| \leqslant\|\mathbf{x}^\perp\|_2\|\mathbf{x}\|_2 E|T-1|\leqslant E|T-1|, \end{align*}using $$\theta \in \textrm{Lip}_1$$ in the second inequality, followed by (2.23) again and the Cauchy–Schwarz inequality, noting that $$\|\mathbf{x}^\perp \|_2 \le 1$$ and $$\|\mathbf{x}\|_2=1$$. Hence, we obtain   \begin{equation*} \big|E\left[\,f_{\boldsymbol{x}}(\boldsymbol{t})\right]-\lambda \left\langle\boldsymbol{x},\boldsymbol{t} \right\rangle\!\big| \leqslant E|T-1| \quad \mbox{for all $\boldsymbol{t} \in B_2^d$,} \quad \end{equation*}which completes the proof of (2.12) in light of the definition (1.7) of $$\alpha$$. In a similar fashion, if $$\theta$$ is twice differentiable with bounded second derivative, then in place of (2.22) for every i = 1, …, d we may write   \begin{equation*} E\left[x_i^\perp a_i\theta(Y)\right]=E\left[ x_i^\perp a_i \theta(x_i a_i + (Y-x_ia_i))\right]=E\left[ x_i^\perp x_i \theta^{\prime}\left(x_i a_i^{\ast} + (Y-x_ia_i)\right)\right], \end{equation*}where $$a_i,a_i^{\ast }$$ are constructed on the same space to be an optimal coupling, in the sense of achieving the infimum of $$E|a^{\ast }-a|$$. Hence,   \begin{align} E\left[Y^\perp \theta(Y)\right]=\sum_{i=1}^d E\left[x_i^\perp a_i \theta(Y)\right]&=\sum_{i=1}^d E\left[ x_i^\perp x_i \theta^{\prime}\left(x_i a_i^{\ast} + (Y-x_ia_i)\right)\right] \nonumber\\ &= \sum_{i=1}^d E\left[ x_i^\perp x_i \left(\theta^{\prime}\left(x_i a_i^{\ast} + (Y-x_ia_i)\right) - \theta^{\prime}(Y)\right) \right]\nonumber \\&= \sum_{i=1}^d E\left[ x_i^\perp x_i \left(\theta^{\prime}\left(x_i a_i^{\ast} + (Y-x_ia_i)\right) - \theta^{\prime}\left(x_i a_i + (Y-x_ia_i)\right)\right) \right],\end{align} (2.24)where in the third inequality we have used $$\langle \boldsymbol{x}^\perp ,\boldsymbol{x} \rangle =0$$, as in (2.23). The proof of (2.13) is completed by applying (2.21) and (2.24) to obtain   \begin{align*} \big|E\left[\,f_{\boldsymbol{x}}(\boldsymbol{t})\right]-\lambda \left\langle\boldsymbol{x},\boldsymbol{t} \right\rangle\!\big| &\leqslant \left| E\left[Y^\perp \theta(Y)\right] \right| \leqslant \|\theta^{\prime\prime}\| \sum_{i=1}^d E \left| x_i^\perp x_i^2 \left(a_i^{\ast}-a_i\right) \right| \leqslant \|\theta^{\prime\prime}\| \gamma_a \sum_{i=1}^d \left| x_i^\perp x_i^2 \right| \\ &\leqslant \|\theta^{\prime\prime}\| \gamma_a \sum_{i=1}^d \left| x_i^\perp x_i \right| \le \|\theta^{\prime\prime}\| \gamma_a, \end{align*}where we have applied the mean value theorem for the second inequality, the fact that the infimum in (2.4) is achieved for the third, that $$\|\boldsymbol{x}\|_2=1$$ for the fourth and the Cauchy–Schwarz inequality for the last. 2.2 Sign function In this section we consider the case where $$\theta$$ is the sign function given by   \begin{equation*} \theta(x)=\begin{cases}-1 & x <0\\ \hfill1 & x \geqslant 0. \end{cases} \end{equation*}The motivation comes from the one bit compressed sensing model, see [1] for a more detailed discussion. The following result shows how $$\alpha$$ of (1.7) can be bounded in terms of the discrepancy measure $$\gamma _a$$ introduced in Section 2.1. Throughout this section set   \begin{equation*} c_1=\sqrt{2/\pi}-1/2. \end{equation*}We continue to assume that the unknown vector x has unit Euclidean length. In the following, we say a random variable a is symmetric if the distributions of a and −a are equal. Theorem 2.4 Let $$\theta$$ be the sign function, a have a symmetric distribution and $$\gamma _a$$ as defined in (2.5). If $$\|\boldsymbol{x}\|_3^3 \leqslant c_1/\gamma _a$$ and $$\|\boldsymbol{x}\|_\infty \leqslant 1/2$$, then $$\alpha$$ defined in (1.7) satisfies   $$\alpha \leqslant \left(10 \gamma_a E|a|^3 \|\boldsymbol{x}\|_\infty \right)^{1/2}.$$ (2.25) Under the condition that $$\|\boldsymbol{x}\|_\infty \leqslant c/E|a|^3$$ for some c > 0, Proposition 4.1 of [1] yields the existence of a constant C such that   $$\alpha \leqslant CE|a|^3 \|\boldsymbol{x}\|_\infty^{1/2}.$$ (2.26)Theorem 2.4 improves (2.26) by introducing the factor of $$\gamma _a$$ in the bound, thus providing a right-hand side that takes the value 0 when a is normal. Applying the inequality $$\gamma _a \leqslant E|a|^3/2$$ in (2.6) to (2.25) in the case where a has finite third moment recovers (2.26) with C assigned the specific value of $$\sqrt{5}$$. In terms of the total variation distance between a and the Gaussian g, Proposition 5.2 in [1] provides the bound   \begin{equation*} \alpha \leqslant C (Ea^4)^{1/8} d_{\mathrm TV}(a,g)^{1/8} \end{equation*}depending on an unspecified constant and an eighth root. For distributions where $$\gamma _a$$ is comparable to the total variation distance, see Section 2.3, the bound of Theorem 2.4 would be preferred as far as its dependence on the distance between a and g and is also explicit. Now we derive bounds on $$\alpha$$ defined in (1.7) for the two error models introduced in Section 2.1. As in Theorem 2.3, the bounds vanish as $$\varepsilon$$ tends to zero. We note that Theorem 2.4 is recovered as the special case $$\varepsilon =1$$ for both models considered. For comparison, in view of the relation between (2.25) of Theorem 2.4 and (2.26), for these error models the bounds one obtains from the latter are the same as the ones below, but with the factor $$\gamma _a$$ replaced by $$C=\sqrt{5}$$ by virtue of (2.6), and with the cubic term, which gives a bound on the third absolute moment of the $$\varepsilon$$-contaminated distribution, appearing outside the square root. Theorem 2.5 In the additive and mixture error models (2.15) and (2.16), the bound of Theorem 2.4 becomes, respectively,   \begin{equation*} \alpha \leqslant \left(10\varepsilon^{3/2}\gamma_a\left(\sqrt{1-\varepsilon}\left(\sqrt{\frac8\pi}\right)^{1/3}+\sqrt{\varepsilon}E{\left[|a|^3\right]}^{1/3}\right)^3\|\mathbf{x}\|_{\infty}\right)^{1/2} \end{equation*}and   \begin{equation*} \alpha \leqslant \left(10\varepsilon\gamma_a\left(\left((1-\varepsilon) \sqrt{\frac8\pi}\right)^{1/3}+E{\left[\varepsilon|a|^3\right]}^{1/3}\right)^3\|\mathbf{x}\|_{\infty}\right)^{1/2}. \end{equation*} We first demonstrate the proof of Theorem 2.4, starting with a series of lemmas. Lemma 2.6 For any mean zero, variance 1 random variable a and any $$\mathbf{x} \in B_2^d$$,   $$\left| \left\langle\mathbf{v}_{\mathbf{x}},\mathbf{x}\right\rangle - \sqrt{\frac2\pi} \right|\leqslant \gamma_a \|\mathbf{x}\|_3^3,$$ (2.27)where $$\mathbf{v}_{\mathbf{x}}=E[\mathbf{a}\theta (\langle \mathbf{a},\mathbf{x} \rangle )]$$ as in (2.1). The inequality in Lemma 2.6 should be compared to Lemma 5.3 of [1], where the bound on the quantity in (2.27) is in terms of the fourth root of the total variation distance between a and g and their fourth moments. Proof. It is direct to verify that $$E|g|=\sqrt{2/\pi }$$ for $$g \sim \mathscr{N}(0,1)$$. In Lemma B.1 in Appendix B, we show that when taking f to be the unique bounded solution to the Stein equation   $$f^{\prime}(x)-xf(x)=|x|-\sqrt{\frac{2}{\pi}},$$ (2.28) we have $$\|\,f^{\prime\prime}\|_\infty =1$$, where $$\|\cdot \|_\infty$$ is the essential supremum. Hence, for a mean zero, variance one random variable Y, using that sets of measure zero do not affect the integral below, we have   \begin{align*} |E|Y|-E|g||&=\big|E\left[\,f^{\prime}(Y)-Yf(Y)\right]\!\big|=\big|E\left[\,f^{\prime}(Y)-f^{\prime}(Y^{\ast})\right]\!\big| =\left| E{\left[\int_Y^{Y^{\ast}} f^{\prime\prime}(u)\,\mathrm{d}u\right]}\right| \\ &\leqslant \|\,f^{\prime\prime}\|_{\infty} E|Y^{\ast}-Y|=E|Y^{\ast}-Y|, \end{align*}where $$Y^{\ast }$$ is any random variable on the same space as Y, having the Y-zero biased distribution. As $$\theta$$ is the sign function   \begin{equation*} \left\langle\mathbf{v}_{\mathbf{x}},\mathbf{x}\right\rangle = E\left[\langle\mathbf{a},\mathbf{x} \rangle \theta( \langle\mathbf{a},\mathbf{x} \rangle ) \right]=E\big|\!\langle\mathbf{a},\mathbf{x} \rangle\!\big| \quad \mbox{and hence} \quad \left|\left\langle\mathbf{v}_{\mathbf{x}},\mathbf{x}\right\rangle - \sqrt{\frac2\pi}\right|=\big|E|\langle\mathbf{a},\mathbf{x} \rangle|- E|g|\big|. \end{equation*} For the case at hand, let $$Y=\langle \boldsymbol{a},\boldsymbol{x} \rangle = \sum _{i=1}^n x_i a_i$$, where $$a_1,\ldots ,a_n$$ are independent and identically distributed as a, having mean zero and variance 1 and recall $$\|\boldsymbol{x}\|_2=1$$. Then with $$P[I=i]=x_i^2$$, taking $$(a_i,a_i^{\ast })$$ to achieve the infimun in (2.4), that is, so that $$E|a_i^{\ast }-a_i|=d_1(a_i,a_i^{\ast })$$, by (2.7), we obtain   \begin{equation*} E|Y^{\ast}-Y|=E\big|x_I\left(a_I^{\ast}-a_I\right)\!\big| = \sum_{i=1}^n |x_i|^3 \gamma_{a_i} = \gamma_a \|\boldsymbol{x}\|_3^3, \end{equation*}as desired. We now provide a version of Lemma 4.4 of [1] in terms of $$\gamma _a$$ and specific constants. Lemma 2.7 The vector $$\mathbf{v}_{\mathbf{x}}$$ in (2.1) satisfies $$\|\mathbf{v}_{\mathbf{x}}\|_2 \leqslant 1$$, and if $$\|\mathbf{x}\|_3^3 \leqslant c_1/\gamma _a$$ where $$c_1=\sqrt{2/\pi }-1/2$$, then   $$\frac{1}{2} \leqslant \|\mathbf{v}_{\mathbf{x}}\|_2.$$ Proof. The upper bound follows as in the proof Lemma 4.4 in [1]. Slightly modifying the lower bound argument there through the use of Lemma 2.6 for the second inequality below, we obtain   \begin{equation*} \|\mathbf{v}_{\mathbf{x}}\|_2 = \|\mathbf{v}_{\mathbf{x}}\|_2 \|\mathbf{x}\|_2 \geqslant \big|\!\langle\mathbf{v}_{\mathbf{x}},\mathbf{x} \rangle \!\big| \geqslant \sqrt{\frac2\pi} - \gamma_a \|\mathbf{x}\|_3^3 \geqslant \sqrt{\frac2\pi} -c_1= 1/2. \end{equation*} Next we provide a version of Lemma 4.5 of [1] with the explicit constant 2, following the proof there, and impose a symmetry assumption on a that was used implicitly. Lemma 2.8 If $$\|\mathbf{x}\|_\infty \leqslant 1/2$$ and a has a symmetric distribution, then the vector $$\mathbf{v}_{\mathbf{x}}$$ in (2.1) satisfies   \begin{equation*} \|\mathbf{v}_{\mathbf{x}}\|_\infty \leqslant 2 E|a|^3 \|\mathbf{x}\|_\infty. \end{equation*} Proof. By the symmetry of a we assume without loss of generality that $$x_j \geqslant 0$$ for all j = 1, …, d when considering the inner product S = ⟨a, x⟩. For a given coordinate index i let $$S^{(i)}=\langle \boldsymbol{a},\boldsymbol{x} \rangle - a_i x_i$$. Using symmetry again in the second equality below and setting $$\tau _i^2 = \sum _{k \not = i}x_k^2$$, for fixed r ⩾ 0, we obtain   \begin{align*} \left|E\theta\left(S^{(i)}+rx_i\right)\right| =&\, \left|P\left[S^{(i)} \geqslant -rx_i\right]-P\left[S^{(i)} < -rx_i\right]\right| \\ =&\, \left|P\left[S^{(i)} \geqslant -rx_i\right]-P\left[S^{(i)}> rx_i\right]\right| = P\left[|S^{(i)}| \leqslant rx_i\right] = P\left[|S^{(i)}|/\tau_i \leqslant rx_i/\tau_i\right] \\ \leqslant&\, P\left[|g| \leqslant rx_i/\tau_i\right] + \left|P\left[|S^{(i)}|/\tau_i \leqslant rx_i/\tau_i\right]-P\left[|g| \leqslant rx_i/\tau_i\right]\right|. \end{align*}The hypothesis $$\|\boldsymbol{x}\|_\infty \leqslant 1/2$$ implies $$\tau _i^2 \geqslant 3/4$$. Hence, using the supremum bound on the standard normal density for the first term and that $$\sqrt{8/3\pi } \leqslant 1$$, the Berry–Esseen bound of [18] with constant 0.56 on the second term, noting $$0.56 (4/3)^{3/2} \leqslant 1$$ and that $$\|\boldsymbol{x}\|_3^3 \leqslant \|\boldsymbol{x}\|_\infty$$ since $$\|\boldsymbol{x}\|_2=1$$, we obtain   \begin{equation*} \left|E\left[r\theta\left(S^{(i)}+rx_i\right)\right]\right| \leqslant r^2 x_i + |r| \|\boldsymbol{x}\|_\infty E|a|^3 . \end{equation*} Considering now the $$i$$th coordinate of $$\mathbf{v}_{\mathbf{x}}\!=\!E[\mathbf{a} \theta (\langle \mathbf{a},\mathbf{x} \rangle )]$$, using $$E|a| \!\leqslant\! (Ea^2)^{1/2}\!=\!1 \!\leqslant\! (E|a|^3)^{1/3} \!\leqslant E|a|^3$$, we have   \begin{equation*} \big|E\left[ a_i \theta(\langle\boldsymbol{a},\boldsymbol{x} \rangle)\right]\!\big| = \left|E\left[a_i \theta\left(S^{(i)}+a_i x_i\right)\right]\right| \leqslant x_i + \|\boldsymbol{x}\|_\infty E|a|^3 \leqslant 2E|a|^3 \|\boldsymbol{x}\|_\infty. \end{equation*}A similar computation yields this same result when r < 0. Proof of Theorem 2.4. We follow the proof of Proposition 4.1 of [1]. By Lemma 2.7 we see $$\mathbf{v}_{\mathbf{x}} \not =0$$, and defining $$\mathbf{z}=\mathbf{v}_{\mathbf{x}}/\|\mathbf{v}_{\mathbf{x}}\|_2$$ from Lemmas 2.7 and 2.8  \begin{equation*} \|\boldsymbol{z}\|_\infty= \frac{\|\mathbf{v}_{\mathbf{x}}\|_\infty}{\|\mathbf{v}_{\mathbf{x}}\|_2} \leqslant 2\|\mathbf{v}_{\mathbf{x}}\|_\infty \leqslant 4 E|a|^3 \|\mathbf{x}\|_\infty. \end{equation*} Hence, first using the triangle inequality together with the fact that $$|\theta (\cdot )| = 1$$, with the equality following holding because $$\theta$$ is the sign function, and the second inequality following from Lemma 2.6, we obtain   \begin{align} \|\mathbf{v}_{\mathbf{x}}\|_2 = \langle\mathbf{v}_{\mathbf{x}},\mathbf{z}\rangle = E\left[\theta(\langle\mathbf{a},\mathbf{x}\rangle)\langle\mathbf{a},\mathbf{z} \rangle \right] \leqslant&\ E\big[|\langle\mathbf{a},\mathbf{z} \rangle |\big] = E\big[\theta(\langle\mathbf{a},\mathbf{z}\rangle)\langle\mathbf{a},\mathbf{z} \rangle \big] \leqslant \sqrt{\frac2\pi} +\gamma_a \|\mathbf{z}\|_\infty \nonumber\\ \leqslant&\, \sqrt{\frac2\pi} + 4\gamma_a E|a|^3 \|\mathbf{x}\|_\infty. \end{align} (2.29)Next, using (2.1), we bound $$|E[\,f_{\mathbf{x}}(\mathbf{t})]-\lambda \left \langle \mathbf{x},\mathbf{t}\right \rangle \!|=|\!\left \langle \mathbf{v}_{\mathbf{x}},\mathbf{t}\right \rangle - \lambda \left \langle \mathbf{x},\mathbf{t} \right \rangle\! |$$. By the Cauchy–Schwartz inequality, now taking $$\mathbf{t} \in B_2^d$$,   \begin{equation*} \big|\!\left\langle\mathbf{v}_{\mathbf{x}},\mathbf{t}\right\rangle - \lambda \left\langle\mathbf{x},\mathbf{t} \right\rangle\! \big|{}^2 = \big|\!\left\langle\mathbf{v}_{\mathbf{x}}-\lambda\mathbf{x},\mathbf{t} \right\rangle \!\big|{}^2\leqslant \|\mathbf{v}_{\mathbf{x}}-\lambda\mathbf{x}\|^2. \end{equation*}Furthermore, by (2.2), we have $$\left \langle \boldsymbol{v}_{\mathbf{x}},\mathbf{x}\right \rangle =\lambda$$, thus   \begin{align*} \|\mathbf{v}_{\mathbf{x}}-\lambda\mathbf{x}\|^2=&\, \|\mathbf{v}_{\mathbf{x}}\|_2^2 -\lambda^2 + 2\lambda (\lambda- \langle\mathbf{v}_{\mathbf{x}},\mathbf{x}\rangle )= \big(\|\mathbf{v}_{\mathbf{x}}\|_2 -\lambda\big)\big(\|\mathbf{v}_{\mathbf{x}}\|_2 +\lambda\big) \leqslant2\big(\|\boldsymbol{v}_{\mathbf{x}}\|_2 -\lambda\big)\\ =&\,2\left(\|\boldsymbol{v}_{\mathbf{x}}\|_2 -\sqrt{\frac2\pi}+\sqrt{\frac2\pi}-\lambda\right) \leqslant 10 \gamma_a E|a|^3 \|\mathbf{x}\|_\infty, \end{align*}where we have applied Lemma 2.7 in the first inequality and the last inequality follows from (2.29), Lemma 2.6 and that $$E|a|^3 \geqslant 1$$. Now taking a square root finishes the proof. Proof of Theorem 2.5. Under the additive error model (2.15), by Minkowski’s inequality   \begin{align*} E{\left[|g_\varepsilon|^3\right]}^{1/3}=&\,E{\left[\big|\sqrt{1-\varepsilon}g+\sqrt{\varepsilon} a\big|^3\right]}^{1/3} \leqslant\sqrt{1-\varepsilon}E{\left[|g|^3\right]}^{1/3}+\sqrt{\varepsilon}E{\left[|a|^3\right]}^{1/3}\\ =&\,\sqrt{1-\varepsilon}\left(\sqrt{\frac8\pi}\,\right)^{1/3}+\sqrt{\varepsilon}E{\left[|a|^3\right]}^{1/3}. \end{align*}Using this inequality and (2.18) in Theorem 2.4 gives the discrepancy bound in the additive error case. For the mixture model (2.16), again by Minkowski’s inequality,   \begin{align*} E{\left[|g_\varepsilon|^3\right]}^{1/3}=&\,E{\left[|g\boldsymbol{1}_{A^c} + a\boldsymbol{1}_A|^3\right]}^{1/3} \leqslant E{\left[(1-\varepsilon)|g|^3\right]}^{1/3}+E{\left[|\varepsilon a|^3\right]}^{1/3}\\ =&\,\left((1-\varepsilon)\sqrt{\frac8\pi}\,\right)^{1/3}+E{\left[\varepsilon |a|^3\right]}^{1/3}. \end{align*}Using this inequality and (2.18) in Theorem 2.4 gives the discrepancy bound in the mixed error case. 2.3 Relations between measures of discrepancy We have considered two methods for handling non-Gaussian sensing, the first using Stein coefficients and the second by the zero bias distribution. In this section we discuss some relations between these two and also their connections to the total variation distance $$d_{{\mathrm TV}}(\cdot ,\cdot )$$ appearing in the bound of [1] and discussed in Remark 2.2. The following result appears in Section 7 of [10]. Lemma 2.9 If a is a mean zero, variance 1 random variable and $$a^{\ast }$$ has the a-zero biased distribution, then   $$d_{\mathrm TV}(a,g) \leqslant 2 d_{\mathrm TV}(a,a^{\ast}).$$ (2.30) The following related result is from [4]. Lemma 2.10 If the mean zero, variance 1 random variable a has Stein coefficient T, then   \begin{equation*} d_{\mathrm TV}(a,g) \leqslant 2 E|1-T|, \end{equation*}where $$g \sim \mathscr{N}(0,1)$$. Since E[Tf′(a)] = E[E[T|a] f′(a)], if T is a Stein coefficient for a then so is h(a) = E[T|a]. Introducing this Stein coefficient in the identity that characterizes the zero bias distribution $$a^{\ast }$$, we obtain   \begin{equation*} E\left[\,f^{\prime}(a^{\ast})\right]=E\left[af(a)\right]=E\left[h(a)f^{\prime}(a)\right]\!. \end{equation*}Hence, when such a T exists h(a) is the Radon Nikodym derivative of the distribution of $$a^{\ast }$$ with respect to that of a, and in particular $$\mathscr{L}(a^{\ast })$$ is absolutely continuous with respect to $$\mathscr{L}(a)$$. When a is a mean zero, variance one random variable with density p(a), whose support is a possibly infinite interval, then using the form of the density $$p^{\ast }(a)$$ of $$a^{\ast }$$ as given in [13], we have   $$p^{\ast}(y)=E\left[a\boldsymbol{1}(a>y)\right] \quad \mbox{and} \quad h(y)=\frac{p^{\ast}(y)}{p(y)}\boldsymbol{1}\left(\,p(y)>0\right)=\frac{E\left[a\boldsymbol{1}(a>y)\right]}{p(y)}\boldsymbol{1}\left(\,p(y)>0\right)\!,$$ (2.31)and hence,   \begin{equation*} E\left|1-h(a)\right| = \int_{y:p(y)>0} \left\vert\vphantom{\frac{1}{1}}\right. 1-\frac{p^{\ast}(y)}{p(y)}\left\vert\vphantom{\frac{1}{1}}\right. p(y)\,\mathrm{d}y = \int_{\mathbb{R}}\left|\,p(y)-p^{\ast}(y)\right|\,\mathrm{d}y=d_{\mathrm TV}(a,a^{\ast}), \end{equation*}and the upper bounds in Lemmas 2.10 and 2.9 are equal. Overall then, in the case where the Stein coefficient of a random variable is given as a function of the random variable itself, the discrepancy measure considered in Theorem 2.3 under part (a) of Theorem 2.1 is simply the total variation distance between a and $$a^{\ast }$$, while that under part (b), and in Section 2.2 when $$\theta (\cdot )$$ is specialized to be the sign function, is the Wasserstein distance. Due to a result of [4], Stein coefficients can be constructed in some generality when a = F(g) for some differentiable function $$F:\mathbb{R}^n \rightarrow \mathbb{R}$$ of a standard normal vector g in $$\mathbb{R}^n$$. In this case   \begin{equation*} T=\int_0^\infty \mathrm{e}^{-t} \Big\langle \nabla F(\boldsymbol{g}),\widehat{E}\left(\nabla F(\boldsymbol{g}_t)\right)\Big\rangle\, \mathrm{d}t \end{equation*}is a Stein coefficient for a where $$\boldsymbol{g}_t=\mathrm{e}^{-t}\boldsymbol{g}+\sqrt{1-\mathrm{e}^{-2t}}\ \widehat{\boldsymbol{g}}$$, with $$\widehat{\boldsymbol{g}}$$ an independent copy of g and $$\widehat{E}$$ integrating over $$\widehat{\boldsymbol{g}}$$, that is, taking conditional expectation with respect to g. To provide a concrete example of a Stein coefficient, a simple computation using the final equality of (2.31) shows that if a has the double exponential distribution with variance 1, that is, with density   \begin{equation*} p(y)=\frac{1}{\sqrt{2}}\mathrm{e}^{-\sqrt{2}|y|} \quad \mbox{then} \quad h(y)=\frac{1}{2}\left(1+\sqrt{2}|y|\right). \end{equation*}In this case   \begin{equation*} E\left|1-h(a)\right|=E\left|1-\sqrt{2} a\right|\boldsymbol{1}(a>0)= \frac{1}{e}. \end{equation*} The following result provides a bound complementary to (2.30) of Lemma 2.9, which when taken together shows that $$d_{\mathrm TV}(a,a^{\ast })$$ and $$d_{\mathrm TV}(a,g)$$ are of the same order in general for distributions of bounded support. Lemma 2.11 If a is a mean zero, variance one random variable with density p(y) supported in [−b, b], then   \begin{equation*} d_{\mathrm TV}(a,a^{\ast}) \leqslant (1+b^2)d_{\mathrm TV}(a,g). \end{equation*} Proof. With $$p^{\ast }(y)$$ the density of $$a^{\ast }$$ given by (2.31), we have   \begin{equation*} d_{\mathrm TV}(a,a^{\ast})=\int_{[-b,b]}\left|\,p(y)-p^{\ast}(y)\right|\,\mathrm{d}y = \int_{[-b,b]}\left(\,p(y)-p^{\ast}(y)\right)\phi(y)\ \mathrm{d}y = E\phi(a)-E\phi(a^{\ast}), \end{equation*}where   \begin{equation*} \phi(y)= \begin{cases} \hfill1 & p(y) \geqslant p^{\ast}(y)\\ -1 & p(y) < p^{\ast}(y). \end{cases} \end{equation*}Setting   \begin{equation*} f(y) = \int_0^y \phi(u) \,\mathrm{d}u \quad \mbox{and} \quad q(y)=\phi(y)-y \int_0^y \phi(u) \,\mathrm{d}u, \end{equation*}we have $$f^{\prime}(y)=\phi (y)$$, and using (2.3) to yield E[q(g)] = 0, we obtain   \begin{equation*} d_{\mathrm TV}(a,a^{\ast})=E\left[\,f^{\prime}(a)-f^{\prime}(a^{\ast})\right] =E\left[\,f^{\prime}(a)-af(a)\right] = Eq(a)-Eq(g). \end{equation*}For y ∈ [−b, b] we have $$|q(y)| \leqslant |\phi (y)| + |y| \int _0^y |\phi (u)|\,\mathrm{d}u \leqslant 1+b^2$$, hence   \begin{equation*} d_{\mathrm TV}(a,a^{\ast}) \leqslant (1+b^2) d_{\mathrm TV}(a,g), \end{equation*}as claimed. 3. Proof of Theorem 1.7 So far, we have shown that the penalty $$\alpha$$ for non-normality in (1.7) of Theorem 1.7 can be bounded explicitly using discrepancy measures that arise in Stein’s method. In this section, we focus on proving Theorem 1.7 via a generic chaining argument that is the crux to the concentration inequality applied. Recall that by (1.2), (1.4) and (1.5),   $$\widehat{\mathbf{x}}_m=\mathop{\mbox{argmin}}_{\mathbf{t}\in K}~\left(\|\mathbf{t}\|_2^2-2\,f_{\mathbf{x}}(\mathbf{t})\right).$$ In order to demonstrate that $$\widehat{\mathbf{x}}_m$$ is a good estimate of $$\lambda \mathbf{x}$$, we need to control the mean of $$f_{\mathbf{x}}(\cdot )$$ in (1.5) and the deviation of $$f_{\mathbf{x}}(\cdot )$$ from its mean. As shown in the previous section, the mean of $$f_{\mathbf{x}}(\cdot )$$ can be effectively characterized through the introduced discrepancy measures. The deviation is controlled by the following lemma. Lemma 3.1 (Concentration) Let $$\mathscr{T} := D(K,\lambda \mathbf{x})\cap \mathbb{S}^{d-1}$$. Under the assumptions of Theorem 1.7, for all u ⩾ 2 and $$m\geqslant \omega (\mathscr{T})^2$$,   \begin{equation*} P\left[\sup_{\mathbf{t}\in \mathscr{T}}\big|\,f_{\mathbf{x}}(\mathbf{t})-E{\left[\,f_{\mathbf{x}}(\mathbf{t})\right]}\big| \geqslant C_0\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right)\frac{\omega(\mathscr{T})+u}{\sqrt{m}}\right] \leqslant 4\mathrm{e}^{-u}, \end{equation*}where $$C_0>0$$ is a fixed constant.1 The proof of this lemma, provided in the next subsection, is based on the improved chaining technique introduced in [7]. We now show that once Lemma 3.1 is proved, Theorem 1.7 follows without much overhead. Using Lemma 1.9 for the first inequality, we have   \begin{align*} \|\,\widehat{\mathbf{x}}_m-\lambda\mathbf{x}\|_2^2\leqslant&\, L(\,\widehat{\mathbf{x}}_m)-L(\lambda\mathbf{x}) +2\alpha\|\,\widehat{\mathbf{x}}_m-\lambda\mathbf{x}\|_2\\ =&\,L(\,\widehat{\mathbf{x}}_m)-L_m(\,\widehat{\mathbf{x}}_m)+L_m(\,\widehat{\mathbf{x}}_m) -L_m(\lambda\mathbf{x})+L_m(\lambda\mathbf{x})-L(\lambda\mathbf{x}) + 2\alpha\|\,\widehat{\mathbf{x}}_m-\lambda\mathbf{x}\|_2\\ =& -2\big( E_m\left[y\left\langle\mathbf{a},\widehat{\mathbf{x}}_m\right\rangle\right]-f_{\mathbf{x}}(\,\widehat{\mathbf{x}}_m)\big) +L_m(\,\widehat{\mathbf{x}}_m)-L_m(\lambda\mathbf{x}) +2\big( E_m\left[\,y\left\langle\mathbf{a},\lambda\mathbf{x}\right\rangle\right] -f_{\mathbf{x}}(\lambda\mathbf{x})\big)\\ &+2\alpha\|\,\widehat{\mathbf{x}}_m-\lambda\mathbf{x}\|_2\\ \leqslant&\,2\big|\,f_{\mathbf{x}}(\,\widehat{\mathbf{x}}_m-\lambda\mathbf{x})-E_m\left[\,y\left\langle\mathbf{a},\widehat{\mathbf{x}}_m-\lambda\mathbf{x}\right\rangle\right] \!\big| +L_m(\,\widehat{\mathbf{x}}_m)-L_m(\lambda\mathbf{x}) +2\alpha\|\,\widehat{\mathbf{x}}_m-\lambda\mathbf{x}\|_2, \end{align*}where $$E_m[\cdot ]$$ is the conditional expectation given $$\{(\mathbf{a}_i,y_i)\}_{i=1}^m$$. Since $$\widehat{\mathbf{x}}_m$$ solves (1.4) and $$\lambda \mathbf{x}\in K$$, it follows that $$L_m(\widehat{\mathbf{x}}_m)-L_m(\lambda \mathbf{x})\leqslant 0$$. Thus,   $$\|\,\widehat{\mathbf{x}}_m-\lambda\mathbf{x}\|_2^2 \leqslant 2\big|\,f_{\mathbf{x}}(\,\widehat{\mathbf{x}}_m-\lambda\mathbf{x})-E_m[\,y\left\langle\mathbf{a},\widehat{\mathbf{x}}_m-\lambda\mathbf{x}\right\rangle]\big| +2\alpha\|\,\widehat{\mathbf{x}}_m-\lambda\mathbf{x}\|_2.$$Since $$\widehat{\mathbf{x}}_m-\lambda \mathbf{x}\in D(K,\lambda \mathbf{x})$$, dividing both sides by $$\|\,\widehat{\mathbf{x}}_m-\lambda \mathbf{x}\|_2$$, the conclusion holding trivially should this norm be zero, using the fact that for any fixed $$\boldsymbol{t}\in \mathbb{R}^d$$, $$E{\left [\,y\left \langle \mathbf{a},\boldsymbol{t}\right \rangle \right ]}=E{\left [\,f_{\mathbf{x}}(\boldsymbol{t})\right ]}$$ gives   \begin{equation*} \|\,\widehat{\mathbf{x}}_m-\lambda\mathbf{x}\|_2 \leqslant2\sup_{\mathbf t\in \mathscr{T}}\big|\,f_{\mathbf{x}}(\boldsymbol{t})-E\left[\,f_{\mathbf{x}}(\boldsymbol{t})\right]\!\big|+2\alpha. \end{equation*}Now applying Lemma 3.1 finishes the proof of Theorem 1.7. 3.1 Preliminaries In addition to chaining, we need the following notions and propositions; we recall the $$\psi _q$$ norms from Definition 1.4. Definition 3.2 (Subgaussian random vector) A random vector $$\mathbf{X}\in \mathbb{R}^d$$ is subgaussian if the random variables $$\langle \mathbf{X},\mathbf{z}\rangle ,\mathbf{z}\in \mathbb{S}^{d-1}$$ are subgaussian with uniformly bounded subgaussian norm. The corresponding subgaussian norm of the vector X is then given by   $$\|\mathbf{X}\|_{\psi_2}=\sup_{\mathbf{z}\in\mathbb{S}^{d-1}}\big\|\langle\mathbf{X},\mathbf{z}\rangle\big\|_{\psi_2}.$$ The proof of the following two propositions is shown in the Appendix. Proposition 3.3 If both X and Y are subgaussian random variables, then XY is a subexponential random variable, satisfying   $$\|XY\|_{\psi_1}\leqslant 2\|X\|_{\psi_2}\|Y\|_{\psi_2}.$$ Proposition 3.4 If a is a subgaussian random vector with covariance matrix $$\mathbf{\Sigma }$$, then   $$\sigma_{\max}(\mathbf{\Sigma})\leqslant 2\|\mathbf{a}\|_{\psi_2}^2,$$where $$\sigma _{\max }(\cdot )$$ denotes the maximal singular value of a matrix. In addition, we need the following fact that a vector of d independent subgaussian random variables is subgaussian. Proposition 3.5 (Lemma 5.24 of [21]) Consider a random vector $$\mathbf{a}\in \mathbb{R}^d$$, where each entry $$a_i$$ is an i.i.d. copy of a centered subgaussian random variable a. Then, a is a subgaussian random vector with norm $$\|\mathbf{a}\|_{\psi _2}\leqslant C\|a\|_{\psi _2}$$ where C is an absolute positive constant. 3.2 Proving Lemma 3.1 via generic chaining Throughout this section, C denotes an absolute constant whose value may change at each occurrence. The following notions are necessary ingredients in the generic chaining argument. Let $$(\mathscr{T},d)$$ be a metric space. If $$\mathscr{A}_{l}\subseteq \mathscr{A}_{l+1} \subseteq \mathscr{T}$$ for every l ⩾ 0 we say $$\{\mathscr{A}_l\}_{l=0}^{\infty }$$ is an increasing sequence of subsets of $$\mathscr{T}$$. Let $$N_0=1$$ and $$N_l=2^{2^l},~\forall\, l\geqslant 1$$. Definition 3.6 (Admissible sequence) An increasing sequence of subsets $$\{\mathscr{A}_l\}_{l=0}^{\infty }$$ of $$\mathscr{T}$$ is admissible if $$|\mathscr{A}_l|\leqslant N_l$$ for all l ⩾ 0. Essentially following the framework of Section 2.2 of [20], for each subset $$\mathscr{A}_l$$, we define $$\pi _l\!:\!\mathscr{T}\!\!\rightarrow\! \mathscr{A}_l$$ as the closest point map $$\pi _l(\boldsymbol{t})=\textrm{arg}\min _{\mathbf s\in \mathscr{A}_l}d(\mathbf s,\mathbf t),~\forall\, \mathbf t\in\! \mathscr{T}$$. Since each $$\mathscr{A}_l$$ is a finite set, the minimum is always achievable. If the argmin is not unique a representative is chosen arbitrarily. The Talagrand $$\gamma _2$$-functional is defined as   $$\gamma_2(\mathscr{T},d):=\inf\sup_{\mathbf t\in \mathscr{T}}\sum_{l=0}^{\infty}2^{l/2}d\left(\mathbf t,\pi_l(\mathbf t)\right)\!,$$ (3.1)where the infimum is taken with respect to all admissible sequences. Though there is no guarantee that $$\gamma _2(\mathscr{T},d)$$ is finite, the following majorizing measure theorem tells us that its value is comparable to the supremum of a certain Gaussian process. Lemma 3.7 (Theorem 2.4.1 of [20]) Consider a family of centered Gaussian random variables $$\{G(\mathbf t)\}_{\mathbf t\in \mathscr{T}}$$ indexed by $$\mathscr{T}$$, with the canonical distance   $$d(\mathbf s,\mathbf t)=E\left[\left(G(\mathbf s)-G(\mathbf t)\right)^2\right]^{1/2},\quad \forall\, \mathbf s,\mathbf t\in \mathscr{T}.$$ Then for a universal constant L that does not depend on the covariance of the Gaussian family, we have   $$\frac1L\gamma_2(\mathscr{T},d)\leqslant E\left[\sup_{\mathbf t\in \mathscr{T}}G(\mathbf t)\right]\leqslant L\gamma_2(\mathscr{T},d).$$ For $$\mathscr{T} \subseteq \mathbb{R}^d$$ and $$d(\mathbf{x},\mathbf{y})=\|\mathbf{x}-\mathbf{y}\|_2$$ we write $$\gamma _2(\mathscr{T})$$ to denote $$\gamma _2(\mathscr{T},\|\cdot \|_2)$$ defined in (3.1). Defining the Gaussian process $$G(\boldsymbol{t})=\left \langle \mathbf{g},\mathbf t\right \rangle ,~\mathbf t\in \mathscr{T}$$, with $$\mathbf{g}\sim \mathscr{N}(0,\mathbf{I}_{d\times d})$$, we have   $$E\left[\left(G(t)-G(s)\right)^2\right]^{1/2}=\|t-s\|_2,\quad\forall\, t,s\in \mathscr{T}.$$When $$\mathscr{T}$$ is bounded we may conclude that $$\omega (\mathscr{T})<\infty$$ directly from Definition 1.1, and Lemma 3.7 then implies that Gaussian mean width $$\omega (\mathscr{T})$$ and $$\gamma _2(\mathscr{T})$$ are of the same order, i.e. there exists a universal constant L ⩾ 1 independent of $$\mathscr{T}$$ such that   $$\frac1L\gamma_2(\mathscr{T})\leqslant\omega(\mathscr{T})\leqslant L \gamma_2(\mathscr{T}).$$ (3.2) Define   $$\overline{Z}(\boldsymbol{t})=f_{\mathbf{x}}(\boldsymbol{t})-E{\left[\,f_{\mathbf{x}}(\boldsymbol{t})\right]},$$where $$f_{\mathbf{x}}(\boldsymbol{t})$$ is as defined in (1.5) and   $$Z(\boldsymbol{t})= \frac{1}{m}\sum_{i=1}^m\varepsilon_iy_i\langle\mathbf{a}_i,\boldsymbol{t}\rangle,$$where $$\varepsilon _i, i=1,\ldots ,m$$ are Rademancher variables taking values uniformly in {1, −1}, independent of each other and of $$\{y_i,\boldsymbol{a}_i,i=1,2,\ldots ,m\}$$. The majority of the proof of Lemma 3.1 is devoted to showing that   $$P\left[\sup_{\boldsymbol{t}\in \mathscr{T}}\left|Z(\boldsymbol{t})\right| \geqslant C\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right) \frac{\omega(\mathscr{T})+u}{\sqrt{m}}\right] \leqslant \mathrm{e}^{-u} \quad \mbox{for {u \geqslant 2, m\geqslant \omega(\mathscr{T})^2},} \quad$$ (3.3)where C > 0 is a constant. Once (3.3) is justified, by the fact u ⩾ 2, we have   \begin{equation*} P\left[\sup_{\boldsymbol{t}\in \mathscr{T}}\left|Z(\boldsymbol{t})\right| \geqslant C\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right) \frac{\omega(\mathscr{T})+1}{\sqrt{m}}u\right] \leqslant \mathrm{e}^{-u} \quad \mbox{for {$u \geqslant 2, m\geqslant \omega(\mathscr{T})^2$}.} \quad \end{equation*}By Lemma A.5, with p = 1 and k = 1, we have   $$E\left[\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})\right|\right] \leqslant C\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right)\frac{\omega(\mathscr{T})+1}{\sqrt{m}}.$$ Thus, invoking the first bound in the symmetrization lemma, Lemma A.3,   \begin{equation*} E\left[\sup_{\boldsymbol{t} \in \mathscr{T}}\left|\overline{Z}(\boldsymbol{t})\right|\right]\leqslant2E\left[\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})\right|\right] \leqslant C\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right) \frac{\omega(\mathscr{T})+1}{\sqrt{m}}. \end{equation*}We may then finish the proof of Lemma 3.1 using the fact that u ⩾ 2, the second bound in the symmetrization lemma with $$\beta =(2C(\|a\|_{\psi _2}^2+\|y\|_{\psi _2}^2) \omega (\mathscr{T})+u)/\sqrt{m}$$ and (3.3), which together imply   \begin{multline*} P\left[\sup_{\boldsymbol{t} \in \mathscr{T}}\left|\overline{Z}(\boldsymbol{t})\right| \geqslant C\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right) \frac{\omega(\mathscr{T})+u}{\sqrt{m}}\right]\\ \leqslant 4P\left[\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})\right|\geqslant C\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right) \frac{\omega(\mathscr{T})+u}{\sqrt{m}}\right]\leqslant 4\mathrm{e}^{-u}. \end{multline*} The rest of the section is devoted to the proof of (3.3). Pick $$\mathbf{t}_0\in \mathscr{T}$$ so that $$\{\mathbf{t}_0\}=\mathscr{A}_0\subseteq \mathscr{A}_1\subseteq \mathscr{A}_2\subseteq \mathscr{A}_3\subseteq \cdots$$ is an admissible sequence, satisfying   $$\sup_{\boldsymbol{t} \in \mathscr{T}}\sum_{l=0}^\infty2^{l/2}d\left(t,\pi_l(\boldsymbol{t})\right)\leqslant 2\gamma_2(\mathscr{T}),$$ (3.4)where we recall $$\pi _l$$ is the closest point map from $$\mathscr{T}$$ to $$\mathscr{A}_l$$, and the constant 2 on the right-hand side of the inequality is introduced to handle the case, where the infimum in the definition of $$\gamma _2(T)$$ is not achieved. Then, for any $$\boldsymbol{t}\in \mathscr{T}$$, we write $$Z(\boldsymbol{t})-Z(\boldsymbol{t}_0)$$ as a telescoping sum, i.e.   $$Z(\boldsymbol{t})-Z(\boldsymbol{t}_0)=\sum_{l=1}^{\infty}Z\left(\pi_l(\boldsymbol{t})\right)-Z\left(\pi_{l-1}(\boldsymbol{t})\right)=\sum_{l=1}^{\infty}\frac1m\sum_{i=1}^m\varepsilon_iy_i\big\langle\mathbf{a}_i,\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\big\rangle.$$ (3.5)Note that this telescoping sum converges with probability 1 because the right-hand side of (3.4) is finite. Then, following ideas in [7], we fix an arbitrary positive integer p and let $$l_p:=\lfloor \log _2p\rfloor$$. Specializing (3.5) to the case $$\boldsymbol{t}_0= \pi _{l_p}(\boldsymbol{t})$$ we obtain, with probability one, that   $$Z(\boldsymbol{t})-Z\left(\pi_{l_p}(\boldsymbol{t})\right)=\sum_{l=l_p+1}^{\infty}Z\left(\pi_l(\boldsymbol{t})\right)-Z\left(\pi_{l-1}(\boldsymbol{t})\right)=\sum_{l=l_p+1}^{\infty}\frac1m\sum_{i=1}^m\varepsilon_iy_i\big\langle\mathbf{a}_i,\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\big\rangle.$$ (3.6) We split the outer index of summation in (3.6) into the following two sets:   \begin{equation*} I_{1,p}:=\big\{l>l_p:2^{l/2}\leqslant \sqrt{m}\big\} \quad \mbox{and} \quad I_{2,p}:=\big\{l>l_p:2^{l/2}>\sqrt{m}\big\}. \end{equation*} On the coarse scale $$I_{1,p}$$, we have the following lemma: Lemma 3.8 (Coarse scale chaining) For all p ⩾ 1 and u ⩾ 2, there exists a constant c > 0 such that the inequality   \begin{equation*} \sup_{\boldsymbol{t} \in \mathscr{T}}\left|\sum_{l\in I_{1,p}}Z\left(\pi_l(\boldsymbol{t})\right)-Z\left(\pi_{l-1}(\boldsymbol{t})\right)\right|\leqslant 4(\sqrt{2}+1)\|\mathbf{a}\|_{\psi_2}\|y\|_{\psi_2} \frac{u}{\sqrt{m}}\gamma_2(\mathscr{T}) \end{equation*}holds with probability at least $$1-c\mathrm{e}^{-pu/4}$$. Proof. We assume $$I_{1,p}$$ is non-empty, else the claim is trivial. By Proposition 3.3 and Definition 3.2, for any i ∈ {1, 2, ⋯ , m}, we have   \begin{equation*} \left\|\varepsilon_iy_i\langle\mathbf{a}_i,\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\rangle\right\|_{\psi_1} \leqslant 2\|\mathbf{a}\|_{\psi_2}\|y\|_{\psi_2}\left\|\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\right\|_2\!. \end{equation*}Thus, for each $$l\in I_{1,p}$$, applying Bernstein’s inequality (Lemma A.7) to   \begin{equation*} Z(\pi_l(\boldsymbol{t}))-Z(\pi_{l-1}(\boldsymbol{t}))=\frac{1}{m}\sum_{i=1}^m\varepsilon_iy_i\big\langle\mathbf{a}_i,\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\big\rangle, \end{equation*}an average of independent subexponential random variables, we have that for all v ⩾ 1,   \begin{equation*} P\left[\big|Z(\pi_l(\boldsymbol{t}))-Z(\pi_{l-1}(\boldsymbol{t}))\big|\geqslant 2\|\mathbf{a}\|_{\psi_2}\|y\|_{\psi_2}\left(\frac{\sqrt{2v}}{\sqrt{m}}+\frac{v}{m}\right)\big\|\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\big\|_2\right]\leqslant 2\mathrm{e}^{-v}. \end{equation*}Let $$v=2^lu$$ for some u ⩾ 2. Using that $$2^{l/2}\leqslant \sqrt{m}$$ since $$l \in I_{1,p}$$ and that $$u \geqslant \sqrt{u}$$, we have   $$P\left[\big|Z(\pi_l(\boldsymbol{t}))-Z(\pi_{l-1}(\boldsymbol{t}))\big|\geqslant 2\|\mathbf a\|_{\psi_2}\|y\|_{\psi_2}(\sqrt{2}+1)\frac{u}{\sqrt{m}}2^{l/2}\big\|\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\big\|_2\right]\leqslant 2\exp(-2^lu).$$ (3.7) Now for every $$l \in I_{1,p}$$ and $$\boldsymbol{t} \in{\mathscr{T}}$$, define the event   \begin{equation*} \Omega_{l,\boldsymbol{t}}=\left\{\omega: \big|Z(\pi_l(\boldsymbol{t}))-Z(\pi_{l-1}(\boldsymbol{t}))\big|\geqslant 2(\sqrt{2}+1)\|\mathbf a\|_{\psi_2}\|y\|_{\psi_2}\frac{u}{\sqrt{m}}2^{l/2}\big\|\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\big\|_2\right\}, \end{equation*}and let $$\Omega :=\bigcup _{l\in I_{1,p}}\bigcup _{\boldsymbol{t} \in \mathscr{T}}\Omega _{l,\boldsymbol{t}}$$. As $$\mathscr{A}_{l}=\{\pi _{l}(\boldsymbol{t})\}_{\boldsymbol{t} \in \mathscr{T}}$$ contains at most $$2^{2^l}$$ points, it follows that the union over $$\boldsymbol{t} \in \mathscr{T}$$ in the definition of $$\Omega$$ can be written as a union over at most $$2^{2^{l+1}}$$ indices. Hence, with u ⩾ 2, Lemma A.4 with k = 1 may now be invoked to yield   \begin{equation*} P\left[\bigcup_{l\in I_{1,p},\boldsymbol{t}\in \mathscr{T}}\Omega_{l,\boldsymbol{t}}\right]\leqslant c\mathrm{e}^{-pu/4}, \end{equation*} for some c > 0. Thus, on the event $$\Omega ^c$$, we have   \begin{align*} \sup_{\boldsymbol{t} \in \mathscr{T}}\left|\sum_{l\in I_{1,p}}Z(\pi_l(\boldsymbol{t}))-Z(\pi_{l-1}(\boldsymbol{t}))\right| \leqslant&\,\sup_{\boldsymbol{t} \in \mathscr{T}}\sum_{l\in I_{1,p}}\big|Z(\pi_l(\boldsymbol{t}))-Z(\pi_{l-1}(\boldsymbol{t}))\big|\\ \leqslant&\, \sup_{\boldsymbol{t} \in \mathscr{T}}2(\sqrt{2}+1)\|\mathbf a\|_{\psi_2}\|y\|_{\psi_2} \frac{u}{\sqrt{m}} \sum_{l\in I_1}2^{l/2}\big\|\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\big\|_2\\ \leqslant&\,\sup_{\boldsymbol{t} \in \mathscr{T}}2(\sqrt{2}+1)\|\mathbf a\|_{\psi_2}\|y\|_{\psi_2} \frac{u}{\sqrt{m}} \sum_{l=1}^{\infty}2^{l/2}\big\|\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\big\|_2\\ \leqslant&\,4(\sqrt{2}+1)\|\mathbf a\|_{\psi_2}\|y\|_{\psi_2}\frac{u}{\sqrt{m}}\gamma_2(\mathscr{T}), \end{align*}where the last inequality follows from (3.4), finishing the proof. For the finer scale chaining, we will apply the following lemma whose proof is in the Appendix. Lemma 3.9 For any $$\mathbf{t}\in \mathbb{R}^d$$, u ⩾ 1 and $$2^{l/2}>\sqrt{m}$$, we have   \begin{equation*} P\left[\left(\frac1m\sum_{i=1}^m\langle\mathbf{a}_i,\mathbf{t}\rangle^2\right)^{1/2} \geqslant \sqrt{5+3\sqrt2} \|\mathbf{a}\|_{\psi_2} \sqrt{\frac{u}{m}} 2^{l/2} \|t\|_2\right] \leqslant 2\exp(-2^lu). \end{equation*} Lemma 3.10 (Finer scale chaining) Let   $$Y_m=\left|\frac1m\sum_{i=1}^my_i^2-E{[y^2]}\right|.$$Then for all p ⩾ 1, with probability at least $$1-c\mathrm{e}^{-pu/4}$$  \begin{equation*} \sup_{\boldsymbol{t} \in \mathscr{T}}\left|\sum_{l\in I_{2,p}}Z(\pi_l(\boldsymbol{t}))-Z(\pi_{l-1}(\boldsymbol{t}))\right| \leqslant2\sqrt{5+3\sqrt2}\left(Y_m^{1/2}+\sqrt{2}\|y\|_{\psi_2}\right)\|\mathbf{a}\|_{\psi_2} \sqrt{\frac{u}{m}} \gamma_2(\mathscr{T}), \end{equation*}with some constant c > 0 and u ⩾ 2. Proof. For any $$p \geqslant 1, l\in I_{2,p}$$ and $$t\in \mathscr{T}$$, by the Cauchy–Schwarz inequality,   \begin{align*} \big|Z(\pi_l(\boldsymbol{t}))-Z(\pi_{l-1}(\boldsymbol{t}))\big|=&\,\left|\frac1m\sum_{i=1}^m\varepsilon_iy_i\big\langle\mathbf{a}_i,\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\big\rangle\right|\\ \leqslant&\,\left(\frac1m\sum_{i=1}^my_i^2\right)^{1/2} \cdot\left(\frac1m\sum_{i=1}^m\big|\langle\mathbf{a}_i,\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\rangle\big|{}^2\right)^{1/2}. \end{align*}Since y is subgaussian, $$E\left [y^2\right ]\leqslant 2\|y\|_{\psi _2}^2$$. Thus,   $$\left(\frac1m\sum_{i=1}^my_i^2\right)^{1/2}=\left(\frac1m\sum_{i=1}^my_i^2-E\,{[y^2]}+E\,{[y^2]}\right)^{1/2} \leqslant Y_m^{1/2}+\sqrt{2}\|y\|_{\psi_2}.$$Furthermore, by Lemma 3.9, for any $$l\in I_{2,p}$$, we have   $$P\left[\left(\frac1m\sum_{i=1}^m\big|\langle\mathbf{a}_i,\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\rangle\big|{}^2\right)^{1/2} \!\!\geqslant \sqrt{5+3\sqrt2}\|\mathbf{a}\|_{\psi_2} \sqrt{\frac{u}{m}} 2^{l/2} \big\|\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\big\|_2\right]\! \leqslant\! 2\exp(-2^lu).$$Thus, combining the above two inequalities,   \begin{align*} &P\left[\big|Z(\pi_l(\boldsymbol{t}))-Z(\pi_{l-1}(\boldsymbol{t}))\big|\geqslant\sqrt{5+3\sqrt2}\left(Y_m^{1/2}+\sqrt{2}\|y\|_{\psi_2}\right)\|\mathbf{a}\|_{\psi_2} \sqrt{\frac{u}{m}} 2^{l/2}\big\|\pi_l(\boldsymbol{t})-\pi_{l-1}(\boldsymbol{t})\big\|_2\right]\\[10pt] &\qquad\leqslant 2\exp(-2^lu). \end{align*}The rest of the proof follows a standard chaining argument similar to the proof of Lemma 3.8 after (3.7) and is not repeated here for brevity. Now we are ready to prove Lemma 3.1, for which we have already demonstrated the sufficiency of (3.3). Proof of 3.42 First, for all p ⩾ 1 and u ⩾ 2, by Lemma 3.10, with probability at least $$1-c \mathrm{e}^{-pu/4}$$,   \begin{align*} &\sup_{\boldsymbol{t} \in \mathscr{T}}\left|\sum_{l\in I_{2,p}}Z(\pi_l(\boldsymbol{t}))-Z(\pi_{l-1}(\boldsymbol{t}))\right|\\[10pt] &\quad\leqslant2\sqrt{5+3\sqrt2}Y_m^{1/2}\|\mathbf{a}\|_{\psi_2} \sqrt{\frac{u}{m}} \gamma_2(\mathscr{T}) +2\sqrt{8+6\sqrt2}\ \|\mathbf{a}\|_{\psi_2}\|y\|_{\psi_2} \sqrt{\frac{u}{m}} \gamma_2(\mathscr{T})\\[10pt] &\quad\leqslant Y_m+\left(5+3\sqrt2\right)\|\mathbf a\|_{\psi_2}^2\frac{u}{m}\gamma_2(\mathscr{T})^2 +2\sqrt{8+6\sqrt2}\ \|\mathbf{a}\|_{\psi_2}\|y\|_{\psi_2} \sqrt{\frac{u}{m}} \gamma_2(\mathscr{T}), \end{align*}where we applied the inequality $$2ab\leqslant a^2+b^2$$ on the first term. Then, combining with Lemma 3.8, we have with probability at least $$1-c\mathrm{e}^{-pu/4}$$,   \begin{align*} &\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})-Z\left(\pi_{l_p}(\boldsymbol{t})\right)\right|\\ &\quad\leqslant \sup_{\boldsymbol{t} \in \mathscr{T}}\left|\sum_{l\in I_{1,p}}Z\left(\pi_l(\boldsymbol{t})\right)-Z\left(\pi_{l-1}(\boldsymbol{t})\right)\right| +\sup_{\boldsymbol{t} \in \mathscr{T}}\left|\sum_{l\in I_{2,p}}Z\left(\pi_l(\boldsymbol{t})\right)-Z\left(\pi_{l-1}(\boldsymbol{t})\right)\right|\\ &\quad\leqslant Y_m+\left(5+3\sqrt2\right)\|\mathbf a\|_{\psi_2}^2\frac{u}{m}\gamma_2(\mathscr{T})^2 +2\sqrt{8+6\sqrt{2}}\ \|\mathbf a\|_{\psi_2}\|y\|_{\psi_2} \sqrt{\frac{u}{m}} \gamma_2(\mathscr{T})\\ &\qquad+4\left(\sqrt{2}+1\right)\|\mathbf a\|_{\psi_2}\|y\|_{\psi_2} \frac{u}{\sqrt{m}}\gamma_2(\mathscr{T})\\ &\quad\leqslant Y_m+\left(5+3\sqrt2\right)\|\mathbf a\|_{\psi_2}^2\frac{u}{m}\gamma_2(\mathscr{T})^2 +\left( \sqrt{8+6\sqrt{2}}+ 2\left(\sqrt{2}+1\right) \right) 2\|\mathbf a\|_{\psi_2}\|y\|_{\psi_2}\frac{u}{\sqrt{m}} \gamma_2(\mathscr{T}). \end{align*} By the conditions in (3.3) we have $$m\geqslant \omega (\mathscr{T})^2$$. Using inequality (3.2) on the relation between $$\omega (\mathscr{T})$$ and $$\gamma _2(\mathscr{T})$$ gives $$m\geqslant \gamma _2(\mathscr{T})^2/L^2$$. Thus, $$\gamma _2(\mathscr{T})^2/m\leqslant L\gamma _2(\mathscr{T})/\sqrt{m}$$, and the second term is bounded by   \begin{equation*} \left(5+3\sqrt2\right)L \|\mathbf a\|_{\psi_2}^2\frac{u}{\sqrt{m}}\gamma_2(\mathscr{T}). \end{equation*} For the last term we apply the bound $$2\|\mathbf a\|_{\psi _2}\|y\|_{\psi _2}\leqslant \|\mathbf a\|_{\psi _2}^2+\|y\|_{\psi _2}^2$$ Thus, with probability at least $$1-c\mathrm{e}^{-pu/4}$$,   $$\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})-Z\left(\pi_{l_p}(\boldsymbol{t})\right)\right|\leqslant Y_m + C\left(\|\mathbf a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right)\frac{u\gamma_2(\mathscr{T})}{\sqrt{m}}, \nonumber$$for the constant   $$C=5L+2+(3L+2)\sqrt{2}+\sqrt{8+6\sqrt{2}}.$$By Proposition 3.5, $$\|\mathbf a\|_{\psi _2}\leqslant C\|a\|_{\psi _2}$$ for some constant C. Thus, with probability at least $$1-c\mathrm{e}^{-pu/4}$$, for some constant C large enough,   \begin{equation*} \sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})-Z(\pi_{l_p}\left(\boldsymbol{t})\right)\right|\leqslant Y_m + C\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right)\frac{u\gamma_2(\mathscr{T})}{\sqrt{m}}, \end{equation*}or equivalently   \begin{equation*} \xi \leqslant C\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right)\frac{u\gamma_2(T)}{\sqrt{m}} \quad \mbox{where} \quad \xi = \max\left\{\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})-Z\left(\pi_{l_p}(\boldsymbol{t})\right)\right|-Y_m,0\right\}. \end{equation*}Invoking Lemma A.5 with k = 1, for all $$1 \leqslant p < \infty$$  $$E\left[\xi^p\right]^{1/p}\leqslant C\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right)\frac{\gamma_2(\mathscr{T})}{\sqrt{m}}.$$ Since   \begin{align*} \xi\geqslant&\,\max\left\{\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})-Z\left(\pi_{l_p}(\boldsymbol{t})\right)\right|,0\right\}-Y_m=\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})-Z\left(\pi_{l_p}(\boldsymbol{t})\right)\right|-Y_m\\ \geqslant&\,\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})\right|-\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z\left(\pi_{l_p}(\boldsymbol{t})\right)\right|-Y_m, \end{align*}and $$\xi$$ and $$Y_m$$ are both non-negative, by Minkowski’s inequality it follows that   \begin{align} E\left[\left(\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})\right|\right)^p\right]^{1/p}\leqslant&\, E\left[\left(\xi+\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z\left(\pi_{l_p}(\boldsymbol{t})\right)\right|+Y_m\right)^p\right]^{1/p} \nonumber\\ \leqslant&\, E\left[\xi^p\right]^{1/p}+E\left[\left(\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z\left(\pi_{l_p}(\boldsymbol{t})\right)\right|\right)^p\right]^{1/p} +E\left[Y_m^p\right]^{1/p}\nonumber\\ \leqslant&\, C\left(\|a\|_{\psi_2}^2+\|y\|_{\psi_2}^2\right)\frac{\gamma_2(\mathscr{T})}{\sqrt{m}} +E\left[\left(\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z\left(\pi_{l_p}(\boldsymbol{t})\right)\right|\right)^p\right]^{1/p} +E\left[Y_m^p\right]^{1/p}.\end{align} (3.8) For the second term, we have   \begin{equation*} E\left[\left(\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z\left(\pi_{l_p}(\boldsymbol{t})\right)\right|\right)^p\right] \leqslant\sum_{\mathbf{t}\in\mathscr{A}_{l_p}}E\left[|Z(\boldsymbol{t})|^p\right] \leqslant|\mathscr{A}_{l_p}|\sup_{\boldsymbol{t} \in \mathscr{T}}E\left[|Z(\boldsymbol{t})|^p\right] \leqslant 2^p\sup_{\boldsymbol{t} \in \mathscr{T}}E\left[|Z(\boldsymbol{t})|^p\right], \end{equation*}where the first inequality follows from the fact that $$\pi _{l_p}(\cdot )$$ can only take values in $$\mathscr{A}_{l_p}$$, and the last inequality follows from the fact that $$l_p=\lfloor \log _2p\rfloor$$. On the other hand, applying Proposition 3.5, yielding that $$\|\boldsymbol{a}\|_{\psi _2} \leqslant C\|a\|_{\psi _2}$$ and Proposition 3.3, by a direct application of Bernstein’s inequality (Lemma A.7) we have, for any fixed $$\mathbf{t}\in \mathscr{T}$$,   $$P\left[\left|Z(\boldsymbol{t})\right|\geqslant 2C\|y\|_{\psi_2}\|a\|_{\psi_2}\left(1+\sqrt2\right)\frac{pu}{\sqrt{m}}\right]\leqslant2\mathrm{e}^{-pu}, \quad \mbox{whenever $$pu\geqslant0$$.} \quad$$Hence, applying Lemma A.5 with k = 1, for all $$1 \leqslant p < \infty$$,   $$E{\left[\left|Z(\boldsymbol{t})\right|^p\right]}^{1/p}\leqslant \frac{C\|y\|_{\psi_2}\|a\|_{\psi_2}p}{\sqrt{m}},$$for all $$t\in \mathscr{T}$$ and some constant C > 0. Thus,   $$E\left[\left(\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\pi_{l_p}(\boldsymbol{t}))\right|\right)^p\right]^{1/p}\leqslant\frac{2C\|y\|_{\psi_2}\|a\|_{\psi_2}p}{\sqrt{m}}\leqslant\frac{C\left(\|y\|_{\psi_2}^2+\|a\|_{\psi_2}^2\right)p}{\sqrt{m}}.$$ (3.9) Now consider $$E [Y_m^p ]^{1/p}$$, the final term in (3.8), recalling that $$Y_m=\frac 1m\sum _{i=1}^m(y_i^2-E [y_i^2 ])$$. Applying Proposition 3.3, we have   $$\left\|y_i^2-E\left[y_i^2\right]\right\|_{\psi_1}\leqslant \left\|y_i^2\right\|_{\psi_{1}}+E\left[y_i^2\right] \leqslant2\big\|y_i\big\|_{\psi_2}^2+2\big\|y_i\big\|_{\psi_2}^2=4\big\|y\big\|_{\psi_2}^2.$$Thus, using Bernstein’s inequality and Lemma A.5 as before, we obtain   $$Pr\left[Y_m\geqslant4\left(1+\sqrt{2}\right)\|y\|_{\psi_2}^2\frac{pu}{\sqrt{m}}\right]\leqslant2\mathrm{e}^{-pu},\quad~\forall\, pu\geqslant0,$$and   $$E\left[Y_m^p\right]^{1/p}\leqslant\frac{C\|y\|_{\psi_2}^2p}{\sqrt{m}}\leqslant\frac{C\left(\|y\|_{\psi_2}^2+\|a\|_{\psi_2}^2\right)p}{\sqrt{m}}.$$ (3.10)Combining (3.8), (3.9) and (3.10) gives   $$E\left[\left(\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})\right|\right)^p\right]^{1/p}\leqslant\frac{C\left(\|y\|_{\psi_2}^2+\|a\|_{\psi_2}^2\right)\left(\gamma_2(\mathscr{T})+p\right)}{\sqrt{m}},$$for some constant C > 0. Since this inequality holds for any p ⩾ 1, applying Lemma A.6 with k = 1 yields   $$P\left[\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})\right| \geqslant C\left(\|y\|_{\psi_2}^2+\|a\|_{\psi_2}^2\right)\frac{\gamma_2(\mathscr{T})+u}{\sqrt{m}}\right] \leqslant \mathrm{e}^{-u}.$$The proof of (3.3) is now completed by invoking Lemma 3.7, which gives $$\gamma _2(\mathscr{T})\leqslant L\omega (\mathscr{T})$$ for some constant L ⩾ 1. Funding NSA grant (H98230-15-1-0250) to L.G. Footnotes 1  Since the set K is closed, the set $$D(K,\lambda \mathbf{x})\cap \mathbb{S}^{d-1}\subseteq \mathbb{R}^d$$ is also closed and thus Borel measurable. By taking $$\mathscr{T}=D(K,\lambda \mathbf{x})\cap \mathbb{S}^{d-1}\subseteq \mathbb{R}^d$$ in Remark 1.3, we have that the supremum is indeed measurable in the probability space $$(\Omega ,~\mathscr{E},~P)$$. References 1. Ai, A., Lapanowski, A., Plan, Y. & Vershynin, R. ( 2014) One-bit compressed sensing with non-Gaussian measurements. Linear Algeb. Appl. , 441, 222– 239. Google Scholar CrossRef Search ADS   2. Boucheron, S., Lugosi, G. & Massart, P. ( 2013) Concentration Inequalities: A Nonasymptotic Theory of Independence . Oxford: Oxford University Press. Google Scholar CrossRef Search ADS   3. Cacoullos, T. & Papathanasiou, V. ( 1992) Lower variance bounds and a new proof of the central limit theorem. J. Multivar. Anal. , 43, 173– 184. Google Scholar CrossRef Search ADS   4. Chatterjee, S. ( 2009) Fluctuations of eigenvalues and second order Poincaré inequalities. Probab. Theory Relat. Fields , 143, 1– 40. Google Scholar CrossRef Search ADS   5. Chen, L., Goldstein, L. & Shao, Q. ( 2010) Normal Approximation by Stein’s Method . New York: Springer. 6. Cohn, D. L. ( 1980) Measure Theory . Boston: Birkhauser. Google Scholar CrossRef Search ADS   7. Dirksen, S. ( 2015) Tail bounds via generic chaining. Electron. J. Probab. , 20, 1– 29. Google Scholar CrossRef Search ADS   8. Erdogdu, M. A., Dicker, L. H. & Bayati, M. ( 2016) Scaled least squares estimator for GLM’s in large-scale problems. Adv. Neural Inf. Process. Syst. , 3324– 3332. 9. Foucart, S. & Rauhut, H. ( 2013) A Mathematical Introduction to Compressive Sensing . Boston: Birkhauser. Google Scholar CrossRef Search ADS   10. Goldstein, L. ( 2007) $$L^1$$ bounds in normal approximation. Ann. Probab ., 35, 1888– 1930. Google Scholar CrossRef Search ADS   11. Goldstein, L. ( 2010) Bounds on the constant in the mean central limit theorem. Ann. Probab ., 38, 1672– 1689. Google Scholar CrossRef Search ADS   12. Goldstein, L., Minsker, S. & Wei, X. ( 2016) Structured signal recovery from non-linear and heavy-tailed measurements . preprint arXiv:1609.01025. 13. Goldstein, L. & Reinert, G. ( 1997) Stein’s method and the zero bias transformation with application to simple random sampling. Ann. Appl. Probab. , 7, 935– 952. Google Scholar CrossRef Search ADS   14. Ledoux, M. & Talagrand, M. ( 1991) Probability in Banach Spaces: Isoperimetry and Processes . Berlin: Springer. Google Scholar CrossRef Search ADS   15. Plan, Y. & Vershynin, R. ( 2016) The generalized lasso with non-linear observations. IEEE Trans. Inf. Theory , 62, 1528– 1537. Google Scholar CrossRef Search ADS   16. Rachev, S. T. ( 1991) Probability Metrics and the Stability of Stochastic Models,  vol. 269. The University of Michigan, John Wiley. 17. Rudelson, M. & Vershynin, R. ( 2008) On sparse reconstruction from Fourier and Gaussian measurements. Commun. Pure Appl. Math. , 61, 1025– 1045. Google Scholar CrossRef Search ADS   18. Shevtsova, I. G. ( 2010) An improvement of convergence rate estimates in the Lyapunov theorem. Doklady Math. , 82, 862– 864. Google Scholar CrossRef Search ADS   19. Stein, C. ( 1972) A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. Proc. Sixth Berkeley Symp. Math. Statist. Probab.  2, 583– 602. 20. Talagrand, M. ( 2014) Upper and lower bounds for stochastic processes: modern methods and classical problems. Ergebnisse der Mathematik und ihrer Grenzgebiete . Berlin Heidelberg: Springer. 21. Vershynin, R. ( 2010) Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing: Theory and Applications . New York: Cambridge University Press. Google Scholar CrossRef Search ADS   22. Yang, Z., Balasubramanian, K. & Liu, H. ( 2017) High-dimensional Non-Gaussian Single Index Models via Thresholded Score Function Estimation. Proceedings of the 34th International Conference on Machine Learning , pp. 3851– 3860. APPENDIX A. Additional lemmas The following lemma is one version of the contraction principle; for a proof see [14]: Lemma A.1 Let $$F:[0,\infty ) \rightarrow [0,\infty )$$ be convex and non-decreasing. Let $$\{\eta _i\}$$ and $$\{\xi _i\}$$ be two symmetric sequences of real-valued random variables such that for some constant C ⩾ 1 for every i and t > 0, we have   $$P\left[|\eta_i|>t\right]\leqslant C \cdot P\left[|\xi_i|>t\right]\!.$$Then, for any finite sequence $$\{\mathbf{x}_i\}$$ in a vector space with semi-norm ∥⋅∥,   $$E{\left[F\left(\left\|\sum_i\eta_i\mathbf{x}_i\right\|\right)\right]} \leqslant E{\left[F\left(C \cdot\left\|\sum_i\xi_i\mathbf{x}_i\right\|\right)\right]}.$$ Remark A.2 Though Lemma 4.6 of [14] states the contraction principle in a Banach space, the proofs of Theorem 4.4 and Lemma 4.6 of [14] hold for vector spaces under any semi-norm. The following symmetrization lemma is the same as Lemma 4.6 of [1]. Lemma A.3 (Symmetrization) Let   \begin{equation*} \overline{Z}(\boldsymbol{t})=f_{\mathbf{x}}(\boldsymbol{t})-E\left[\,f_{\mathbf{x}}(\boldsymbol{t})\right] \quad \mbox{where} \quad f_{\boldsymbol{x}}(\boldsymbol{t})=\frac{1}{m}\sum_{i=1}^m y_i \langle\boldsymbol{a}_i,\mathbf t \rangle, \end{equation*}and   $$Z(\boldsymbol{t})= \frac{1}{m}\sum_{i=1}^m\varepsilon_iy_i\langle\mathbf{a}_i,\mathbf t\rangle,$$where $$\{ \varepsilon _i: 1 \leqslant i \leqslant m\}$$ is a collection of Rademacher random variables, each uniformly distributed over {−1, 1} and independent of each other and of $$\{y_i,\boldsymbol{a}_i: 1 \leqslant i \leqslant m\}$$. Then for any measurable set $$\mathscr{T}\subset \mathbb{R}^d$$,   $$E{\left[\sup_{\boldsymbol{t} \in \mathscr{T}}\left|\overline{Z}(\boldsymbol{t})\right|\right]} \leqslant 2E{\left[\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})\right|\right]},$$and for any $$\beta>0$$  $$P\left[\sup_{\boldsymbol{t} \in \mathscr{T}}\left|\overline{Z}(\boldsymbol{t})\right|\geqslant 2E{\left[\sup_{\boldsymbol{t} \in \mathscr{T}}\left|\overline{Z}(\boldsymbol{t})\right|\right]}+\beta\right]\leqslant 4P\left[\sup_{\boldsymbol{t} \in \mathscr{T}}\left|Z(\boldsymbol{t})\right|\geqslant \beta/2\right].$$ Lemma A.4 (Lemma A.4 of [7]) Fix $$1\leqslant p <\infty$$, $$0<k<\infty$$, u ⩾ 2 and $$l_p:=\lfloor \log _2p\rfloor$$. For every $$l>l_p$$, let $$J_l$$ be an index set such that $$|J_l|\leqslant 2^{2^{l+1}}$$, and $$\{\Omega _{l,i} \}_{i\in J_l}$$ a collection of events, satisfying   $$P\left[\Omega_{l,i}\right]\leqslant 2\exp(-2^lu^k),\quad\forall\, i\in J_l.$$Then there exists an absolute constant c ⩽ 16 such that   $$P\left[\cup_{l>l_p}\cup_{i\in J_l}\Omega_{l,i}\right]\leqslant c \exp(-pu^k/4).$$ Lemma A.5 (Lemma A.5 of [7]) Fix $$1\leqslant p <\infty$$ and $$0<k<\infty$$. Let $$\beta \geqslant 0$$ and suppose that $$\xi$$ is a non-negative random variable such that for some $$c,u_*>0$$,   $$P\left[\xi>\beta u\right]\leqslant c\exp(-pu^k/4),\quad \forall\, u\geqslant u_*.$$Then for a constant $$\tilde{c}_k>0$$ depending only on k,   $$E{\left[\xi^p\right]}^{1/p}\leqslant\beta(\tilde{c}_kc+u_*).$$ Lemma A.6 (Proposition 7.11 of [9]) If X is a non-negative random variable, satisfying   $$E{\left[X^p\right]}^{1/p}\leqslant b+ ap^{1/k} \quad \forall\, p\geqslant1,$$for positive real numbers a and k, and b ⩾ 0, then for any u ⩾ 1,   $$P\left[X\geqslant \mathrm{e}^{1/k}(b+au)\right]\leqslant\exp(-u^k/k).$$ Finally, for the following result see Theorem 2.10 of [2]. Lemma A.7 (Bernstein’s inequality) Let $$X_1,\cdots ,X_m$$ be a sequence of independent, mean zero random variables. If there exist positive constants $$\sigma$$ and D such that   $$\frac1m\sum_{i=1}^mE{\left[|X_i|^p\right]}\leqslant\frac{p!}{2}\sigma^2D^{p-2},~p=2,3,\cdots$$then for any u ⩾ 0,   $$P\left[\left|\frac1m\sum_{i=1}^mX_i\right|\geqslant\frac{\sigma}{\sqrt{m}}\sqrt{2u}+\frac{D}{m}u\right] \leqslant2\exp(-u).$$If $$X_1,\cdots ,X_m$$ are all subexponential random variables, then $$\sigma$$ and D can be chosen as $$\sigma\!\!\! \!=\frac{1}{m}\sum _{i=1}^m\|X_i\|_{\psi _1}$$ and $$D=\max _i\|X_i\|_{\psi _1}$$. B. Additional proofs With g a standard normal variable, we begin by considering the solution f to (2.28), the special case of the Stein equation   $$f^{\prime}(x)-xf(x)=h(x)-Eh(g),$$ (B.1)with the specific choice of test function h(x) = |x|. Lemma B.1 The solution f of (2.28) satisfies ∥f″∥ = 1. Proof. In general, when f solves (B.1) for a given test function h(⋅) then − f(−x) solves (B.1) for h(−⋅). As in the case at hand h(x) = |x|, for which h(−x) = h(x), it suffices to show that 0 ⩽ f(x) ⩽ 1 for all x > 0, over which range (2.28) specializes to   $$f^{\prime}(x)-xf(x)=x-\sqrt{\frac2\pi}.$$ (B.2)Taking derivative on both sides yields   $$f^{\prime\prime}(x)-f(x)-xf^{\prime}(x)=1,$$and combining the above two equalities gives   $$f^{\prime\prime}(x)=(1+x^2)\,f(x)+x\left(x-\sqrt{\frac2\pi}\right)+1.$$ (B.3)On the other hand, solving (B.2) via integrating factors gives, for all x > 0,   \begin{align} f(x)=&\,-\mathrm{e}^{x^2/2}\int_x^\infty\left(z-\sqrt{\frac2\pi}\right)\mathrm{e}^{-z^2/2}\,\mathrm{d}z =-1+2\mathrm{e}^{x^2/2}\int_x^\infty\frac{\mathrm{e}^{-z^2/2}}{\sqrt{2\pi}}\,\mathrm{d}z\nonumber\\ =&\,-1+2\mathrm{e}^{x^2/2}(1-\Phi(x)), \end{align} (B.4)where $$\Phi (\cdot )$$ is the cumulative distribution function of the standard normal. For any x > 0, by classical upper and lower tail bounds for $$\Phi (\cdot )$$, we have   $$\frac{x}{\sqrt{2 \pi}(1+x^2)} \leqslant \mathrm{e}^{x^2/2}(1-\Phi(x))\leqslant\min\left\{\frac12,\frac{1}{x\sqrt{2\pi}}\right\},$$which in turn implies, using (B.3) and (B.4), that for all x > 0   \begin{equation*} 0 \leqslant f^{\prime\prime}(x) \leqslant \min\left\{x\left(x-\sqrt{\frac{2}{\pi}}\right)+1,\frac{1}{x}\sqrt{\frac{2}{\pi}}\right\}. \end{equation*}Handling the cases $$0<x\leqslant \sqrt{2/\pi }$$ and $$x>\sqrt{2/\pi }$$ separately, we see 0 ⩽ f″(x) ⩽ 1 for all x > 0, as desired. Proof of Proposition 3.3 We may assume $$\|Y\|_{\psi _1}\not =0$$ as the inequality is trivial otherwise. By definition $$\|XY\|_{\psi _1}=\sup _{p\geqslant 1}p^{-1}E{\left [|XY|^p\right ]}^{1/p}$$. Applying $$2ab\leqslant a^2+b^2$$ and Minkowski’s inequality, for any $$\varepsilon>0$$,   \begin{equation*} E{\left[|XY|^p\right]}^{1/p}\leqslant E{\left[\left|\frac{X^2}{2\varepsilon}+\frac{\varepsilon Y^2}{2}\right|{}^p\right]}^{1/p} \leqslant\frac{1}{2\varepsilon}E{\left[X^{2p}\right]}^{1/p}+\frac{\varepsilon}{2}E{\left[Y^{2p}\right]}^{1/p}. \end{equation*}Applying the definition of the $$\psi _1$$ norm, this inequality implies   $$\|XY\|_{\psi_1}\leqslant\frac{1}{2\varepsilon}\|X^2\|_{\psi_1}+\frac{\varepsilon}{2}\|Y^2\|_{\psi_1}.$$The term $$\|X^2\|_{\psi _1}$$ can be bounded as follows,   \begin{equation*} \|X^2\|_{\psi_1}=\sup_{p\geqslant1}\left(p^{-1/2}E{[X^{2p}]}^{1/2p}\right)^2 =2 \sup_{p\geqslant1}\left((2p)^{-1/2}E{[X^{2p}]}^{1/2p}\right)^2 \leqslant2\|X\|_{\psi_2}^2. \end{equation*}Arguing similarly for Y,   $$\|XY\|_{\psi_1}\leqslant\frac{1}{\varepsilon}\|X\|^2_{\psi_2}+\varepsilon\|Y\|^2_{\psi_2},$$and choosing $$\varepsilon =\|X\|_{\psi _2}/\|Y\|_{\psi _2}$$ finishes the proof. Proof of Proposition 3.4 By definition, we have   \begin{align*} \|\mathbf{a}\|_{\psi_2}=&\sup_{\mathbf{z}\in\mathbb{S}^{d-1}}\|\langle\mathbf{a},\mathbf{z}\rangle\|_{\psi_2}\\ =&\sup_{\mathbf{z}\in\mathbb{S}^{d-1}} \sup_{p\geqslant1}\frac{1}{p^{1/2}}E{\left[|\langle\mathbf{a},\mathbf{z}\rangle|^p\right]}^{1/p}\\ \geqslant&\sup_{\mathbf{z}\in\mathbb{S}^{d-1}}\frac{1}{\sqrt{2}}E{\left[\langle\mathbf{a},\mathbf{z}\rangle^2\right]}^{1/2}\\ =&\frac{1}{\sqrt{2}}\sup_{\mathbf{z}\in\mathbb{S}^{d-1}}\langle\mathbf{\Sigma}\mathbf{z},\mathbf{z}\rangle^{1/2}\\ =&\frac{1}{\sqrt{2}}\sigma_{\max}(\mathbf{\Sigma})^{1/2}, \end{align*}and squaring both sides finishes the proof. Proof of Lemma 3.9 Since $$\langle \mathbf{a}_i,\mathbf{t}\rangle$$ is subgaussian, it follows, $$\langle \mathbf{a}_i,\mathbf{t}\rangle ^2$$ is subexponential by Proposition 3.3. Note that $$E [\langle \mathbf{a}_i,\mathbf{t}\rangle ^2 ]\leqslant \sigma _{\max }(\mathbf \Sigma )\|\mathbf{t}\|_2^2\leqslant 2\|\mathbf{a}\|_{\psi _2}^2\|\mathbf{t}\|_2^2$$ by Proposition 3.4. Then, by Remark 1.5 and Proposition 3.3  $$\left\|\langle\mathbf{a}_i,\mathbf{t}\rangle^2-E{\left[\langle\mathbf{a}_i,\mathbf{t}\rangle^2\right]}\right\|_{\psi_1} \leqslant\left\|\langle\mathbf{a}_i,\mathbf{t}\rangle^2\right\|_{\psi_1}+2\|\mathbf{a}\|_{\psi_2}^2\|\mathbf{t}\|_2^2 \leqslant3\|\mathbf a\|_{\psi_2}^2\|\mathbf{t}\|_2^2.$$Now an application of Bernstein’s inequality (Lemma A.7) gives   $$P\left[\left(\frac1m\sum_{i=1}^m\langle\mathbf{a}_i,\mathbf{t}\rangle^2-E{\left[\langle\mathbf{a}_i,\mathbf{t}\rangle^2\right]}\right)\geqslant 3\|\mathbf a\|_{\psi_2}^2\left(\frac{\sqrt{2v}}{\sqrt{m}}+\frac{v}{m}\right)\|\mathbf{t}\|_2^2\right] \leqslant 2\mathrm{e}^{-v}.$$We let $$v=2^{l}u$$ and apply the hypothesis $$2^{l/2}>\sqrt{m}$$ and u ⩾ 1 to obtain   $$P\left[\left(\frac1m\sum_{i=1}^m\langle\mathbf{a}_i,\mathbf{t}\rangle^2-E{\left[\langle\mathbf{a}_i,\mathbf{t}\rangle^2\right]}\right) \geqslant 3\left(1+\sqrt{2}\right) \|\mathbf a\|_{\psi_2}^2 \frac{2^lu}{m}\|\mathbf{t}\|_2^2\right] \leqslant 2\exp{(-2^lu)}.$$Thus, by $$2^{l/2}>\sqrt{m}$$ and u ⩾ 1 again,   $$P\left[\left(\frac1m\sum_{i=1}^m\langle\mathbf{a}_i,\mathbf{t}\rangle^2\right) \geqslant \left(3 \left(1+\sqrt{2}\right)+2\right)\frac{2^lu}{m} \|\mathbf a\|_{\psi_2}^2\|\mathbf{t}\|_2^2\right] \leqslant 2\exp{(-2^lu)},$$which yields the claim upon taking square roots on both sides of the first inequality. © The Author(s) 2018. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) For permissions, please e-mail: journals. permissions@oup.com

Journal

Information and Inference: A Journal of the IMAOxford University Press

Published: May 21, 2018

DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month Explore the DeepDyve Library Search Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly Organize Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place. Access Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals. Your journals are on DeepDyve Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more. All the latest content is available, no embargo periods. DeepDyve Freelancer DeepDyve Pro Price FREE$49/month
\$360/year

Save searches from
PubMed

Create lists to

Export lists, citations