# Isometric sketching of any set via the Restricted Isometry Property

Isometric sketching of any set via the Restricted Isometry Property Abstract In this paper we show that for the purposes of dimensionality reduction certain class of structured random matrices behave similarly to random Gaussian matrices. This class includes several matrices for which matrix-vector multiply can be computed in log-linear time, providing efficient dimensionality reduction of general sets. In particular, we show that using such matrices any set from high dimensions can be embedded into lower dimensions with near optimal distortion. We obtain our results by connecting dimensionality reduction of any set to dimensionality reduction of sparse vectors via a chaining argument. 1. Introduction Dimensionality reduction or sketching is the problem of mapping a set into a low-dimensional space while preserving certain properties of the original high-dimensional set. Such low-dimensional embeddings have found numerous applications in a wide variety of applied and theoretical disciplines across science and engineering. Perhaps the most fundamental and popular result for dimensionality reduction is the Johnson–Lindenstrauss (JL) Lemma. This lemma states that any set of of p points can be embedded into $$\mathcal{O}\left(\frac{\log p}{\delta ^{2}}\right)$$ dimensions, while preserving the Euclidean norm of all points within a multiplicative factor between 1 − δ and 1 + δ. The JL Lemma in its modern form can be stated as follows. Lemma 1.1 (JL Lemma [18]) Let δ ∈ (0, 1) and let $$\boldsymbol{x}_{1},\boldsymbol{ x}_{2}, \ldots ,\boldsymbol{ x}_{p}\in \mathbb{R}^{n}$$ be arbitrary points. Then as long as $$m=\mathcal{O}\left(\frac{\log p}{\delta ^{2}}\right)$$ there exists a matrix $$\boldsymbol{A}\in \mathbb{R}^{m\times n}$$ such that   \begin{align} (1-\delta)\left\|\boldsymbol{x}_{i}\right\|_{\ell_{2}}\le\left\|\boldsymbol{A} \boldsymbol{x}_{i}\right\|_{\ell_{2}}\le(1+\delta)\left\|\boldsymbol{x}_{i}\right\|_{\ell_{2}}\!\!, \end{align} (1.1) for all i = 1, 2, …, p. This lemma was originally proven to hold with high probability for a matrix A that projects all data points onto a random subspace of dimension m and then scales them by $$\sqrt{\frac{n}{m}}$$. The result was later generalized so that A could have i.i.d. normal random entries as well as other random ensembles [10,15]. More recently, many authors have focused on constructions where the mapping by A can be computed in $$\mathcal{O}(n \log n)$$ time. See [1,2,13,,19,22,23] for examples of such constructions as well as the more recent papers [3,26] for further details on related and improved constructions. In many applications of dimensionality reduction arising in statistical learning, optimization, numerical linear algebra, one aims to embed a set containing an infinite continuum of points into lower dimensions while preserving the Euclidean norm of all point up to a multiplicative distortion. A classical result due to Gordon [16] characterizes the precise tradeoff between distortion, ‘size’ of the set and the amount of reduction in dimension for a subset of the unit sphere. Before stating this result we need the definition of the Gaussian width of a set which provides a measure of the ‘complexity’ or ‘size’ of a set $$\mathcal{T}$$. Definition 1.1 For a set $$\mathcal{T}\subset \mathbb{R}^{n}$$, the mean width $$\omega (\mathcal{T})$$ is defined as   \begin{align*} \omega(\mathcal{T})=\operatorname{\mathbb{E}}[\sup_{\boldsymbol{v}\in\mathcal{T}}\boldsymbol{g}^{T}\boldsymbol{v}]. \end{align*} Here, $$\boldsymbol{g}\in \mathbb{R}^{n}$$ a Gaussian random vector distributed as $$\mathcal{N}(\mathbf{0},\boldsymbol{I}_{n})$$. Theorem 1.2 (Gordon’s escape through the mesh) Let δ ∈ (0, 1), $$\mathcal{T}\subset \mathbb{R}^{n}$$ be a subset of the unit sphere ($$\mathcal{T}\subset \mathbb{S}^{n-1}$$) and let $$\boldsymbol{A}\in \mathbb{R}^{m\times n}$$ be a matrix with i.i.d. $$\mathcal{N}(0,1/{b_{m}^{2}})$$ entries where $$b_{m}=\sqrt{2}\cdot \varGamma \left (\frac{m+1}{2}\right )/\varGamma \left (\frac{m}{2}\right )\approx \sqrt{m}$$ and Γ denotes the Gamma function. Then,   \begin{align} \left|\left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}\right|\le \delta\left\|\boldsymbol{x}\right\|_{\ell_{2}}\!, \end{align} (1.2) holds for all $$\boldsymbol{x}\in \mathcal{T}$$ with probability at least $$1-2\mathrm{e}^{-\frac{\eta ^{2}}{2}}$$ as long as   \begin{align} m\ge\frac{\left(\omega(\mathcal{T})+\eta\right)^{2}}{\delta^{2}}. \end{align} (1.3) We note that the JL Lemma for Gaussian matrices follows as a special case. Indeed, for a set $$\mathcal{T}$$ containing a finite number of points $$\left |\mathcal{T}\right |\le p$$, one can show that $$\omega (\mathcal{T})\le \sqrt{2\log p}$$ so that the minimal amount of dimension reduction m allowed by (1.3) is of the same order as Lemma 1.1. Recently, a line of research by Mendelson and collaborators [20,21,24,25] showed that the inequality (1.2) continues to hold for matrices with independent sub-Gaussian rows (albeit at a loss in terms of the constants). More recently in [28], Oymak & Tropp obtain the precise constants when the entries are i.i.d. sub-Gaussian. See also [12,34] for more recent results and applications. Connected to this, Bourgain et al. [5] have shown that a similar result to Gordon’s theorem holds for certain ensembles of matrices with sparse entries. This paper develops an analogue of Gordon’s result for more structured matrices, particularly those that admit efficient multiplication. At the heart of our analysis is a theorem that shows that matrices that preserve the Euclidean norm of sparse vectors (also known as Restricted Isometry Property (RIP) matrices), when multiplied by a random sign pattern preserve the Euclidean norm of any set. Roughly stated, linear transforms that provide low distortion embedding of sparse vectors also allow low distortion embedding of any set! We believe that our result provides a rigorous justification for replacing ‘slow’ Gaussian matrices with ‘fast’ and computationally friendly matrices in many scientific and engineering applications. Indeed, in a companion paper [27] we utilize the results of this paper to develop sharp rates of convergence for various optimization problems involving such structured matrices. Our results imply faster algorithms for a variety of other problems including: sparse and low-rank approximation from underdetermined samples [6,7], subspace embeddings [32], and sketched least-squares problems [30]. 2. Isometric sketching of sparse vectors To connect isometric sketching of sparse vectors to isometric sketching of general sets, we begin by defining the RIP. Roughly stated, RIP ensures that a matrix preserves the Euclidean norm of sparse vectors up to a multiplicative distortion δ. This definition immediately implies that RIP matrices can be utilized for isometric sketching of sparse vectors. Definition 2.1 (RIP) A matrix $$\boldsymbol{A}\in \mathbb{R}^{m\times n}$$ satisfies the RIP with distortion δ > 0 at a sparsity level s, if for all vectors x with sparsity at most s, we have   \begin{align} \left| \left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\right|\leq \max(\delta,\delta^{2})\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\!. \end{align} (2.1) We shall use the short-hand RIP(δ, s) to denote this property. This definition is essentially identical to the classical definition of RIP [7]. The only difference is that we do not restrict δ to lie in the interval [0, 1]. Consequently, the correct dependence on δ in the right-hand side of (2.1) is $$\max (\delta ,\delta ^{2})$$. For the purposes of this paper we need a more refined notion of RIP. More specifically, we need RIP to simultaneously hold for different sparsity and distortion levels. Definition 2.2 (Multiresolution RIP (MRIP)) Let $$L=\lceil \log _{2} n\rceil$$. Given δ > 0 and a number s ≥ 1, for ℓ = 0, 1, 2, …, L, let (δℓ, sℓ) = (2ℓ/2δ, 2ℓs) be a sequence of distortion and sparsity levels. We say a matrix $${\boldsymbol{A}}\in \mathbb{R}^{m\times n}$$ satisfies the MRIP with distortion δ > 0 at sparsity s, if for all ℓ ∈ {1, 2, …, L}, RIP(δℓ, sℓ) holds. More precisely for vectors of sparsity at most sℓ ($$\left \|\boldsymbol{x} \right \|_{\ell _{0}}\le s_{\ell }$$) the sequence of inequalities   \begin{align} \left| \left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\right|\leq \max\left(\delta_{\ell},\delta_{\ell}^{2}\right)\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\!, \end{align} (2.2) simultaneously holds for all ℓ ∈ {1, 2, …, L}. We shall use the short-hand MRIP(δ, s) to denote this property. At the lowest scale, this definition reduces to the standard RIP(δ, s) definition. Noting that sL = 2Ls ≥ n at the highest scale this condition requires   \begin{align*} \left| \left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\right|\leq \max\left(\delta_{L},{\delta_{L}^{2}}\right)\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\!, \end{align*} to hold for all vectors $$\boldsymbol{x}\in \mathbb{R}^{n}$$. While this condition looks considerably more restrictive than the standard definition of RIP, with proper scaling it can be easily satisfied for popular random matrix ensembles used for dimensionality reduction. These include dense matrices with i.i.d. sub-Gaussian entries as well as structured matrices such as randomly subsampled Hadamard or the Discrete Cosine Transform matrix. The latter are special cases of Subsampled Orthogonal with Random Sign (SORS) matrices described in detail in Definition 3.3 for which matrix-vector multiply can be computed in log-linear time. 3. From isometric sketching of sparse vectors to general sets Our main result states that a matrix obeying MRIP with the right distortion level $$\tilde{\delta }$$ can be used for embedding any subset $$\mathcal{T}$$ of $$\mathbb{R}^{n}$$. Theorem 3.1 For a set $$\mathcal{T}\subset \mathbb{R}^{n}$$ let $$\textit{rad}(\mathcal{T})=\sup _{\boldsymbol{v}\in \mathcal{T}} \left \|\boldsymbol{v}\right \|_{\ell _{2}}$$ be the maximum Euclidean norm of a point inside $$\mathcal{T}$$. Suppose the matrix $$\boldsymbol{H}\in \mathbb{R}^{m\times n}$$ obeys the MRIP with sparsity and distortion levels   \begin{align} s=200(1+\eta)\quad\textrm{and}\quad \tilde{\delta}= \frac{\delta\cdot \textit{rad}(\mathcal{T})}{C\max\left( \textit{rad}(\mathcal{T}),\omega(\mathcal{T})\right)}, \end{align} (3.1) with C > 0 an absolute constant. Then, for a diagonal matrix D with an i.i.d. random sign pattern on the diagonal, the matrix A = HD obeys   \begin{align} \sup_{\boldsymbol{x}\in\mathcal{T}}|\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}|\leq \max\left(\delta,\delta^{2}\right)\cdot\left( \textit{rad}(\mathcal{T})\right)^{2}\!, \end{align} (3.2) with probability at least $$1-\exp (-\eta )$$. This theorem shows that if a matrix isometrically embeds sparse vectors at all scales, then it becomes suitable for isometric embedding of any set when multiplied by a random sign pattern. For random matrix ensembles that are commonly used for dimensionality reduction, the minimum dimension m for the MRIP$$(s,\tilde{\delta })$$ to hold grows as $$m\sim \frac{s}{\tilde{\delta }^{2}}$$. In Theorem 3.1, we have s ∼ 1 and $$\tilde{\delta }\sim \frac{\delta }{\omega (\mathcal{T})}$$ so that the minimum dimension m for (3.2) to hold is of the order of $$m\sim \frac{\omega ^{2}(\mathcal{T})}{\delta ^{2}}$$. This is exactly the same scaling one would obtain by using Gaussian random matrices via Gordon’s lemma in (1.3). To see this more clearly we now focus on applying Theorem 3.1 to random matrices obtained by subsampling a unitary matrix. Definition 3.2 (SORS matrices) Let $$\boldsymbol{F}\in \mathbb{R}^{n\times n}$$ denote an orthonormal matrix obeying   \begin{align} \boldsymbol{F}^{\ast}\ \boldsymbol{F} =\boldsymbol{ I }\quad\textrm{and}\quad\max_{i, \ j}\left|\boldsymbol{F}_{i j}\right|\le \frac{\varDelta}{\sqrt{n}}. \end{align} (3.3) Define the random subsampled matrix $$\boldsymbol{H}\in \mathbb{R}^{m\times n}$$ with i.i.d. rows chosen uniformly at random from the rows of F. Now we define the SORS measurement ensemble as $$A=\sqrt{n/m}\ HD$$, where $$\boldsymbol{D}\in \mathbb{R}^{n\times n}$$ is a random diagonal matrix with the diagonal entries i.i.d. ± 1 with equal probability. To simplify exposition, in the definition above we have focused on SORS matrices based on subsampled orthonormal matrices H with i.i.d. rows chosen uniformly at random from the rows of an orthonormal matrix F obeying (3.3). However, our results continue to hold for SORS matrices defined via a much broader class of random matrices H with i.i.d. rows chosen according to a probability measure on Bounded Orthonormal Systems. Please see [14, Section 12.1] for further details on such ensembles. By utilizing results on RIP of subsampled orthogonal random matrices obeying (3.3) we can show that the MRIP holds at the sparsity and distortion levels required by (3.1). Therefore, Theorem 3.1 immediately implies a result similar to Gordon’s lemma for SORS matrices. Theorem 3.3 Let $$\mathcal{T}\subset \mathbb{R}^{n}$$ and suppose $${\boldsymbol{A}}\in \mathbb{R}^{m\times n}$$ is selected from the SORS distribution of Definition 3.2. Then,   \begin{align} \sup_{\boldsymbol{x}\in\mathcal{T}}|\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}|\leq \max\{\delta,\delta^{2}\}\cdot \left( \textit{rad}(\mathcal{T})\right)^{2}\!, \end{align} (3.4) holds with probability at least 1 − 2e−η as long as   \begin{align} m\geq C\varDelta^{2}(1+\eta)^{2}(\log n)^{4}\ \frac{\max\left(1,\frac{\omega^{2}(\mathcal{T})}{\left( \textit{rad}(\mathcal{T})\right)^{2}}\right)}{\delta^{2}}. \end{align} (3.5) We would like to point out that one can improve the dependence on η and potentially replace a few $$\log n$$ factors with $$\log \left (\omega (\mathcal{T})\right )$$ by utilizing improved RIP bounds such as [9,11,31]. We also note that any future result that reduces log factors in the sample complexity of RIP will also automatically improve the bound on m in our results. In fact, after the first version of this manuscript became available there has been a very interesting reduction of log factors in the required sample complexity of RIP by Haviv & Regev in [17] (related also see an earlier improved RIP result of Bourgain [4] brought to our attention by Jelani Nelson). We believe that utilizing this new RIP result it may be possible to improve the bound in (3.5) to   \begin{align} m\geq C\varDelta^{2}(1+\eta)^{2}(\log \omega(\mathcal{T}))^{2}\log n\ \frac{\max\left(1,\frac{\omega^{2}(\mathcal{T})}{\left( \textit{rad}(\mathcal{T})\right)^{2}}\right)}{\delta^{2}}. \end{align} (3.6) Unfortunately, (3.6) does not trivially follow from the results in [17]. The reason is twofold: (1) the results of [17] are based on more classical definitions of RIP (without the $$\max (\delta ,\delta ^{2})$$ as in (2.1)) and (2) the dependence on the distortion level δ in terms of sample complexity is not of the form 1/δ2 and has slightly weaker dependence of the form $$\frac{\log ^{4}(1/\delta )}{\delta ^{2}}$$ which holds for sufficiently small δ. Closing this gap is an interesting future research direction. Ignoring constant/logarithmic factors, Theorem 3.3 is an exact analogue of Gordon’s lemma for Gaussian matrices in terms of the tradeoff between the reduced dimension m and the distortion level δ. Gordon’s result for Gaussian matrices has been utilized in numerous applications. Theorem 3.3 above allows one to replace Gaussian matrices with SORS matrices for such problems. For example, Chandrasekaran et al. [8] use Gordon’s lemma to obtain near optimal sample complexity bounds for linear inverse problems involving Gaussian matrices. An immediate application of Theorem 3.3 implies near optimal sample complexity results using SORS matrices. To the extent of our knowledge this is the first sample optimal result using a matrix with fast multiply. We refer the reader to our companion paper for further detail [27]. Theorem 3.3 holds for all sets $$\mathcal{T}$$, while using matrices that have fast multiplication. We would like to pause to mention a few interesting results that hold with additional assumptions on the set $$\mathcal{T}$$. Perhaps, the first results of this kind were established for the RIP in [7,31], where the set $$\mathcal{T}$$ is the set of vectors with a certain sparsity level. Krahmer & Ward established a JL type embedding for RIP matrices with columns multiplied by a random sign pattern [22]. That is, the authors show that Theorem 3.3 holds when $$\mathcal{T}$$ is a finite point cloud. More recently, in [37] the authors show a Gordon-type embedding result holds for manifold signals using RIP matrices whose columns are multiplied by a random sign pattern. All of these interesting results on embedding of finite points serve as a precursor to our results. Indeed, [22] plays a crucial role in our proof. A practical contribution of our work is that SORS matrices can be used to embed any set, which not only unifies the existing set-specific results, but also implies new optimal embedding bounds for several tasks including low-rank approximation, least-squares sketching, and group-sparse and dictionary-sparse signal modeling. Earlier, we mentioned the very interesting paper of Bourgain et al. [5] which establishes a result in the spirit of Theorem 3.3 for sparse matrices. Indeed, [5] shows that for certain random matrices with sparse columns the dependence of the minimum dimension m on the mean width $$\omega (\mathcal{T})$$ and distortion δ, is of the form $$m\gtrsim \frac{\omega ^{2}(\mathcal{T})}{\delta ^{2}}$$polylog$$(\frac{n}{\delta })$$. In this result, the sparsity level of the columns of the matrix (and in turn the computational complexity of the dimension reduction scheme) is controlled by a parameter which characterizes the spikiness of the set $$\mathcal{T}$$. In addition, the authors of [5] also establish results for particular $$\mathcal{T}$$ using Fast JL7 matrices, e.g. see [5, Section 6.2]. Recently, Pilanci & Wainwright in [29] have established a result of similar flavor to Theorem 3.3, but with suboptimal tradeoff betwee n the allowed dimension reduction and the complexity of the set $$\mathcal{T}$$. Roughly stated, this result requires $$m\gtrsim \left (\log n\right )^{4}\frac{\omega ^{4}(\mathcal{T})}{\delta ^{2}}$$ using a subsampled Hadamard matrix combined with a diagonal matrix of i.i.d. Rademacher random variables. We would like to point out that our proofs also hint at an alternative proof strategy to that of [29] if one is interested in establishing $$m\gtrsim (\log n)^{4}\frac{\omega ^{4}(\mathcal{T})}{\delta ^{2}}$$. In particular, one can cover the set $$\mathcal{T}$$ with Euclidean balls of size δ. Based on Sudakov’s inequality the logarithm of the size of this cover is at most $$\frac{\omega ^{2}(\mathcal{T})}{\delta ^{2}}$$. One can then relate this cover to a cover obtained by using a random pseudo-metric such as the one defined in [31]. As a result one incurs an additional factor $$(\log n)^{4}\omega ^{2}(\mathcal{T})$$. Multiplying these two factors leads to the requirement $$m\gtrsim (\log n)^{4}\frac{\omega ^{4}(\mathcal{T})}{\delta ^{2}}$$. 4. Proofs Before we move to the proof of the main theorem, we begin by stating known results on RIP for bounded orthogonal systems and show how Theorem 3.3 follows from our main theorem (Theorem 3.1). 4.1. Proof of Theorem 3.2 for SORS matrices We first state a classical result on RIP originally due to Rudelson & Vershynin [31,35]. We state the version in [14] which holds generally for bounded orthogonal systems. We remark that the results in [31,35] as well as those of [14] are stated for the regime δ < 1. However, by going through the analysis contained in these papers carefully one can confirm that our definition of RIP (with max(δ, δ2) on the right-hand side in lieu of δ) continues to hold for δ ≥ 1. Lemma 4.1 (RIP for sparse signals, [14,31,35]) Let $$\boldsymbol{F}\in \mathbb{R}^{n\times n}$$ denote an orthonormal matrix obeying   \begin{align} \boldsymbol{F}^{\ast}\boldsymbol{F}=\boldsymbol{I}\quad\textrm{and}\quad\max_{i, j}\left|\boldsymbol{F}_{ij}\right|\le \frac{\varDelta}{\sqrt{n}}. \end{align} (4.1) Define the random subsampled matrix $$\boldsymbol{H}\in \mathbb{R}^{m\times n}$$ with i.i.d. rows chosen uniformly at random from the rows of F. Then RIP(δ, s) holds with probability at least 1 − e−η for all δ > 0 as long as   \begin{align*} m\ge C\varDelta^{2}\frac{s\left(\log^{3} n\log m+\eta\right)}{\delta^{2}}. \end{align*} Here C > 0 is a fixed numerical constant. Applying the union bound over $$L=\lceil \log n\rceil$$ sparsity levels and using the change of variable $$\eta \rightarrow \eta +\log L$$, together with the fact that $$(\log n)^{4}+\eta \le (1+\eta )(\log n)^{4}$$, Lemma 4.1 immediately leads to the following lemma. Lemma 4.2 Consider $$\boldsymbol{H}\in \mathbb{R}^{m\times n}$$ distributed as in Lemma 4.1. H obeys MRIP with sparsity s and distortion $$\tilde{\delta }>0$$ with probability 1 − e−η as long as   \begin{align*} m\geq C(1+\eta)\varDelta^{2}\frac{s(\log n)^{4}}{\tilde{\delta}^{2}}\nonumber. \end{align*} Theorem 3.3 now follows by using s = C(1 + η) and $$\tilde{\delta }=\frac{\delta }{C\max \left (1,\frac{\omega (\mathcal{T})}{\textrm{rad}(\mathcal{T})}\right )}$$ in Theorem 3.1. 4.2. Connection between JL-embedding and RIP A critical tool in our proof is a powerful result of Krahmer & Ward [22] that shows that RIP matrices with columns multiplied by a random sign pattern obey the JL Lemma. Theorem 4.3 (Discrete JL embedding via RIP, [22]) Assume $$\mathcal{T}\subset \mathbb{R}^{n}$$ is a finite set of points. Suppose $$\boldsymbol{H}\in \mathbb{R}^{m\times n}$$ is a matrix satisfying RIP(s, δ) with sparsity s and distortion δ > 0 obeying   \begin{align*} {s\ge\min\left(40(\log\left(4|\mathcal{T}|\right)+\eta),n\right)\quad\textrm{and}\quad 0<\delta\leq \frac{\varepsilon}{4},} \end{align*} where $$\boldsymbol{D}\in \mathbb{R}^{n\times n}$$ is a random diagonal matrix with the diagonal entries i.i.d. ± 1 with equal probability. Then the matrix A = HD obeys   \begin{align} \left|\left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\right|\leq \max(\varepsilon,\varepsilon^{2})\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\!, \end{align} (4.2) simultaneously for all $$\boldsymbol{x}\in \mathcal{T}$$ with probability at least 1 − e−η. As above, the result stated in [22] restricts ε to [0, 1], and has ε rather than $$\max (\varepsilon ,\varepsilon ^{2})$$ in the right hand side of the inequality (4.2). However, it is easy to verify that their proof (with essentially no modifications) can accommodate the result stated above. 4.3. Connecting JL to Gordon (overview of proof of Theorem 3.1) Before we provide a complete proof of Theorem 3.1, in this section we wish to provide a high-level description of our proof. The full proof can be found in Section 4.5 with some details deferred to the Appendix. For simplicity let us focus on the case where rad$$(\mathcal{T})=1$$. The main aim of Theorem 3.1 is to prove the bound   \begin{align} \left|\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\right|\le\max(\delta,\delta^{2}), \end{align} (4.3) for all $$\boldsymbol{x}\in \mathcal{T}$$. A natural way to establish this bound is to use a covering number argument. Let us explain why this approach fails and then how we can fix it. In the covering number approach we cover the set $$\mathcal{T}$$ with balls of size ε as depicted Fig. 1a. We denote the centers of these covers by $$\mathcal{N}$$. We then try to prove the bound in (4.3) by first controlling $$\big |\left \|{\boldsymbol{A}}\boldsymbol{x}\right \|_{\ell _{2}}^{2}-\left \|\boldsymbol{x}\right \|_{\ell _{2}}^{2}\big |$$ on the cover via the Discrete JL embedding result of Theorem 4.3, and then try to control how much $$\big |\left \|{\boldsymbol{A}}\boldsymbol{x}\right \|_{\ell _{2}}^{2}-\left \|\boldsymbol{x}\right \|_{\ell _{2}}^{2}\big |$$ deviates from $$\big |\left \|{\boldsymbol{A}}\boldsymbol{z}\right \|_{\ell _{2}}^{2}-\left \|\boldsymbol{z}\right \|_{\ell _{2}}^{2}\big |$$ for the closest $$\boldsymbol{z}\in \mathcal{N}$$ inside the cover to x. This proof strategy fails because for the deviation term to be smaller than $$\frac{1}{2}\max (\delta ,\delta ^{2})$$ then ϵ has to be very small. In particular, for most matrices that obey the RIP the spectral norm of A roughly scales with $$\sqrt{\frac{n}{m}}$$, so that ϵ must be on the order of $$\frac{\delta }{\|\boldsymbol{A}\|}\sim \sqrt{\frac{m}{n}}\delta$$ where ∼/$$\gtrsim$$ denote equality/inequality up to a fixed numerical constant. This in turn means that the size of the cover needs to be very large. More specifically, applying the Sudakov inequalities (e.g. see [36, Theorem 2.2]) we have $$\left |\mathcal{N}\right |\lesssim 2^{\frac{n}{m}\frac{\omega ^{2}(\mathcal{T})}{\delta ^{2}}}$$. Now combining Lemma 4.1 together with Theorem 4.3 implies that to achieve a distortion of the order of $$\frac{1}{2}\max (\delta ,\delta ^{2})$$ on the cover, the reduced dimension m must obey   \begin{align*} m\gtrsim \frac{\log |\mathcal{N}|}{\delta^{2}}\log^{4} n \sim \frac{n}{m}\frac{\omega^{2}(\mathcal{T})}{\delta^{4}}\log^{4} n\quad\Leftrightarrow\quad m\gtrsim \sqrt{n}\frac{\omega(\mathcal{T})}{\delta^{2}}\log^{2} n, \end{align*} which is far from optimal. To overcome this deficiency we use successively larger and larger discrete sets $$\mathcal{T}_{0},\mathcal{T}_{1},\mathcal{T}_{2},\ldots ,\mathcal{T}_{L}$$ to approximate the set $$\mathcal{T}$$. For a point $$\boldsymbol{x}\in \mathcal{T}$$ let zℓ denote the closest point from $$\mathcal{T}_{\ell }$$ to x. We will obtain the desired bound by utilizing a telescoping sum of the form   \begin{align*} \big|\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\big|\le &\left(\big|\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\big|-\big|\left\|{\boldsymbol{A}}\boldsymbol{z}_{L}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{L}\right\|_{\ell_{2}}^{2}\big|\right)\\ &+\sum_{\ell=1}^{L} \left(\big|\left\|{\boldsymbol{A}}\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}\big|-\big|\left\|{\boldsymbol{A}}\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}\big|\right) +\big|\left\|{\boldsymbol{A}}\boldsymbol{z}_{0}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{0}\right\|_{\ell_{2}}^{2}\big|. \end{align*} Fig. 1. View largeDownload slide The figure on the left shows a standard covering of the set $$\mathcal{T}$$ with balls of radius ε with the points in black depicting the centers of the cover. The figure on the right depicts the centers of finer and finer covers of the set $$\mathcal{T}$$ depicted by black (z0), green (z1) and orange (z2) points. In order to bound $$\big |\left \|{\boldsymbol{A}}\boldsymbol{x}\right \|_{\ell _{2}}^{2}-\left \|\boldsymbol{x}\right \|_{\ell _{2}}^{2}\big |$$, we bound deviations along the path connecting the closest points to x from these successive covers. In the figure on the right these points are denoted by z0, z1 and z2. Fig. 1. View largeDownload slide The figure on the left shows a standard covering of the set $$\mathcal{T}$$ with balls of radius ε with the points in black depicting the centers of the cover. The figure on the right depicts the centers of finer and finer covers of the set $$\mathcal{T}$$ depicted by black (z0), green (z1) and orange (z2) points. In order to bound $$\big |\left \|{\boldsymbol{A}}\boldsymbol{x}\right \|_{\ell _{2}}^{2}-\left \|\boldsymbol{x}\right \|_{\ell _{2}}^{2}\big |$$, we bound deviations along the path connecting the closest points to x from these successive covers. In the figure on the right these points are denoted by z0, z1 and z2. We then apply the discrete JL result of Theorem 4.3 to bound each of these terms separately. We pick the size of the successive approximations to be $$|\mathcal{T}_{\ell }|=2^{2^{\ell }}$$. We also bound the different deviation terms with different distortion levels $$\big(\textrm{roughly of the order of} \ \delta _{\ell }=2^{\ell /2}\frac{\delta }{\omega (\mathcal{T})}\big)$$. The key factor in our proofs is that the size of the successive approximations and the distortion level is chosen carefully to balance each other out. That is, the reduction in the dimension m will take the form   \begin{align*} m\gtrsim\underset{\ell=1,2,\ldots,L}{\max} \frac{\log |\mathcal{T}_{\ell}|}{\delta_{\ell}^{2}}\log^{4} n=\underset{\ell=1,2,\ldots,L}{\max} \frac{\log |2^{2^{\ell}}|}{2^{\ell} \frac{\delta^{2}}{\omega^{2}(\mathcal{T})}}\log^{4} n= \frac{\omega^{2}(\mathcal{T})}{\delta^{2}}\sim \log^{4} n\frac{\omega^{2}(\mathcal{T})}{\delta^{2}}, \end{align*} which is essentially the result we are interested in. All of this will be made completely rigorous in the coming sections by using a generic chaining style argument. 4.4. Generic chaining related notations and definitions Our proof makes use of the machinery of generic chaining, e.g. see [33]. We gather some of the required definitions and notations in this section. Define N0 = 1 and $$N_{\ell }=2^{2^{\ell }}$$ for ℓ ≥ 1. Definition 4.4 (Admissible sequence, [33]) Given a set $$\mathcal{T}$$ an admissible sequence is an increasing sequence ($$\mathcal{A}_{\ell }$$) of partitions of $$\mathcal{T}$$ such that $$\left |\mathcal{A}_{\ell }\right |\le N_{\ell }$$. Following [33], here increasing sequence of partitions means that every set of $$\mathcal{A}_{\ell +1}$$ is contained in a set of $$\mathcal{A}_{\ell }$$ and $$\mathcal{A}_{\ell }(t)$$ is the unique element of $$\mathcal{A}_{\ell }$$ that contains t. Then the γ2 functional is defined as   \begin{align*} \gamma_{2}(\mathcal{T})=\inf\underset{t}{\sup}\sum_{\ell=0}^{\infty} 2^{\ell/2}\textrm{rad}(\mathcal{A}_{\ell}(t)),\nonumber \end{align*} where the infimum is taken over all admissible sequences. Let $$\bar{\mathcal{A}}_{\ell }$$ be one such optimal admissible sequence. Based on this sequence we define the successive covers. Definition 4.5 (successive covers) Define the center point of a set to be the center of the smallest ball containing that set. Using $$\bar{\mathcal{A}}_{\ell }$$ we construct successive covers $$\mathcal{T}_{\ell }$$ of $$\mathcal{T}$$ by taking the center point of each set of $$\bar{\mathcal{A}}_{\ell }$$. Let eℓ(v) be the associated distortion of the cover with respect to a point v, i.e. $$e_{\ell }(\boldsymbol{v})=\textrm{dist}(\boldsymbol{v},\mathcal{T}_{\ell })$$. Then for all $$\boldsymbol{v}\in \mathcal{T}$$, the γ2 functional obeys   \begin{align*} \sum_{\ell=0}^{\infty} 2^{\ell/2}e_{\ell}(\boldsymbol{v}) \leq \gamma_{2}(\mathcal{T}).\nonumber \end{align*} It is well known that $$\gamma _{2}(\mathcal{T})$$ and Gaussian width $$\omega (\mathcal{T})$$ are of the same order. More precisely, for a fixed numerical constant C  \begin{align*} C^{-1}\omega(\mathcal{T})\leq \gamma_{2}(\mathcal{T})\leq C\omega(\mathcal{T}).\nonumber \end{align*} Given the distortion δ in the statement of Theorem 3.1 we also define different scales of distortion   \begin{align*} \delta_{0}=\delta,\,\delta_{1}=2^{1/2}\delta,\,\dots,\,\delta_{L}=2^{L/2}\delta\nonumber, \end{align*} with $$L=\log _{2}\lceil n\rceil$$. 4.5. Proof of Theorem 3.1 Without loss of generality we assume that rad$$(\mathcal{T})=1$$. We begin by noting that the MRIP property combined with the powerful JL-embedding result stated in Theorem 4.3 allows for JL embedding at different distortion levels. We apply such an argument to successively more refined covers of the set $$\mathcal{T}$$ and at different distortion scales inside a generic chaining type argument to arrive at the proof for an arbitrary (and potentially continuous) set $$\mathcal{T}$$. We should point out that one can also follow an alternative approach which leads to the same conclusion. Instead of using multi-resolution RIP, we could have defined a ‘multi-resolution embedding property’ for the mapping A that isometrically maps finite set of points $$\mathcal{T}$$ with a near optimal set cardinality-distortion tradeoff at varying levels. One can show that this property also implies isometric embedding of a continuous set $$\mathcal{T}$$. We begin by stating a lemma which shows isometric embedding as well as a few other properties for points belonging to the refined covers $$\mathcal{T}_{\ell }$$ at different distortion levels δℓ. The proof of this lemma is deferred to Appendix A. Lemma 4.6 Suppose $$\boldsymbol{H}\in \mathbb{R}^{m\times n}$$ obeys MRIP$$\big(s,\frac{\delta }{4}\big)$$ with distortion level δ and sparsity s = 200(1 + η). Furthermore, let $$\boldsymbol{D}\in \mathbb{R}^{n\times n}$$ be a diagonal matrix with a random i.i.d. sign pattern on the diagonal and set A = HD. Also let $$\mathcal{T}_{\ell }$$ be successive refinements of the set $$\mathcal{T}$$ from Definition 4.5. Then, with probability at least $$1-\exp (-\eta )$$ the following identities hold simultaneously for all ℓ = 1, 2, …, L, For all $$\boldsymbol{v}\in \mathcal{T}_{\ell -1}\cup \mathcal{T}_{\ell }\cup (\mathcal{T}_{\ell -1}-\mathcal{T}_{\ell })$$,   \begin{align} \left\|{\boldsymbol{A}}\boldsymbol{v}\right\|_{\ell_{2}}\leq \left(1+2^{\ell/2}\delta\right)\left\|\boldsymbol{v}\right\|_{\ell_{2}}\!. \end{align} (4.4) For all $$\boldsymbol{v}\in \mathcal{T}_{\ell -1}\cup \mathcal{T}_{\ell }\cup (\mathcal{T}_{\ell -1}-\mathcal{T}_{\ell })$$,   \begin{align} |\left\|{\boldsymbol{A}}\boldsymbol{v}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{v}\right\|_{\ell_{2}}^{2}|\leq \max\big(2^{\ell/2}\delta,2^{\ell}\delta^{2}\big)\cdot\left\|\boldsymbol{v}\right\|_{\ell_{2}}^{2}\!.\end{align} (4.5) For all $$\boldsymbol{u}\in \mathcal{T}_{\ell -1}$$ and $$\boldsymbol{v}\in \mathcal{T}_{\ell }-\{\boldsymbol{u}\}:=\{\boldsymbol{y}-\boldsymbol{u}: \boldsymbol{y}\in \mathcal{T}_{\ell }\}$$,   \begin{align} \left|\boldsymbol{u}^{\ast}{\boldsymbol{A}}^{\ast}{\boldsymbol{A}}\boldsymbol{v}-\boldsymbol{u}^{\ast}\boldsymbol{v}\right|\leq \max\big(2^{\ell/2}\delta,2^{\ell}\delta^{2}\big)\cdot\left\|\boldsymbol{u}\right\|_{\ell_{2}}\left\|\boldsymbol{v}\right\|_{\ell_{2}}\!. \end{align} (4.6) With this lemma in place we are ready to prove our main theorem. To this aim given a point $$\boldsymbol{x}\in \mathcal{T}$$, for ℓ = 1, 2, …, L let zℓ be the closest neighbor of x in $$\mathcal{T}_{\ell }$$. We also define zL+1 = x. We note that zℓ depends on x. For ease of presentation we do not make this dependence explicit. We also drop x from the distortion term eℓ(x) and simply use eℓ. Now observe that for all ℓ = 1, 2, …, L, we have   \begin{align} \left\|{\boldsymbol{z}}_{\ell}-{\boldsymbol{z}}_{\ell-1}\right\|_{\ell_{2}}\leq \left\|{\boldsymbol{z}}_{\ell}-\boldsymbol{x}\right\|_{\ell_{2}}+\left\|{\boldsymbol{z}}_{\ell-1}-\boldsymbol{x}\right\|_{\ell_{2}}\leq e_{\ell}+e_{\ell-1}\leq 2e_{\ell-1}. \end{align} (4.7) We are interested in bounding $$|\left \|{\boldsymbol{A}}\boldsymbol{x}\right \|_{\ell _{2}}^{2}-\left \|\boldsymbol{x}\right \|_{\ell _{2}}^{2}|$$ for all $$\boldsymbol{x}\in \mathcal{T}$$. Define $$\tilde{L}=\max \left (0,\lfloor 2\log _{2}\left (\frac{1}{\delta }\right )\rfloor \right )$$, and note that applying the triangular inequality, we have   \begin{align} |\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}|\le&\left|\left\|\boldsymbol{A}\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}^{2}\right|+\left|\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|{\boldsymbol{A}}\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}^{2}\right|+\left|\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}^{2}\right|\nonumber\\ \le&\sum_{\ell=1}^{\tilde{L}}\left(\left|\left\|\boldsymbol{A}\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}\right|-\left|\left\|\boldsymbol{A}\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}\right|\right)\nonumber\\ &+\left|\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|{\boldsymbol{A}}\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}^{2}\right|+\left|\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}^{2}\right|+\left|\left\|\boldsymbol{A}\boldsymbol{z}_{0}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{0}\right\|_{\ell_{2}}^{2}\right|\!. \end{align} (4.8) First note that by Lemma 4.6  \begin{align*}\left |\left\|{\boldsymbol{A}}{\boldsymbol{z}}_{0}\right\|_{\ell_{2}}^{2}-\left\|{\boldsymbol{z}}_{0}\right\|_{\ell_{2}}^{2}\right|\leq \max\left(\delta,\delta^{2}\right)\left\|{\boldsymbol{z}}_{0}\right\|_{\ell_{2}}^{2}\leq \max\left(\delta,\delta^{2}\right)\!. \end{align*} Using the above inequality in (4.8), we arrive at   \begin{align}\left |\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\right|\le&\sum_{\ell=1}^{\tilde{L}}\left(\left|\left\|\boldsymbol{A}\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}\right|-\left|\left\|\boldsymbol{A}\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}\right|\right)\nonumber\\ &+\left|\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|{\boldsymbol{A}}\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}^{2}\right|+\left|\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}^{2}\right|+\max\left(\delta,\delta^{2}\right)\!. \end{align} (4.9) Lemma 4.7 whose proof is deferred to Appendix B utilizes Lemma 4.6 to bound each of the first three terms in (4.9). Before getting into the details of these bounds we would like to point out that (4.9), as well as the results presented in Lemma 4.7 below are derived under the assumption that $$\tilde{L}\le L$$. Proper modification allows us to bound $$|\left \|{\boldsymbol{A}}\boldsymbol{x}\right \|_{\ell _{2}}^{2}-\left \|\boldsymbol{x}\right \|_{\ell _{2}}^{2}|$$ even when $$\tilde{L}> L$$. We shall explain this argument in complete detail in Appendix C. Lemma 4.7 Under the assumptions of Theorem 3.1 the following three inequalities hold   \begin{align} \sum_{\ell=1}^{\tilde{L}}\left(\left|\|\boldsymbol{A}\boldsymbol{z}_{\ell}\|_{\ell_{2}}^{2}-\|\boldsymbol{z}_{\ell}\|_{\ell_{2}}^{2}\right|-\left|\|\boldsymbol{A}\boldsymbol{z}_{\ell-1}\|_{\ell_{2}}^{2}-\|\boldsymbol{z}_{\ell-1}\|_{\ell_{2}}^{2}\right|\right)\le 10\sqrt{2}\delta\gamma_{2}(\mathcal{T}), \end{align} (4.10)  \begin{align} \left|\|{\boldsymbol{A}}\boldsymbol{x}\|_{\ell_{2}}^{2}-\|{\boldsymbol{A}}\boldsymbol{z}_{\tilde{L}}\|_{\ell_{2}}^{2}\right|+\left|\|\boldsymbol{x}\|_{\ell_{2}}^{2}-\|\boldsymbol{z}_{\tilde{L}}\|_{\ell_{2}}^{2}\right|\le 32\delta^{2}{\gamma_{2}^{2}}(\mathcal{T})+16\sqrt{2}\delta\gamma_{2}(\mathcal{T}) \end{align} (4.11) and   \begin{align} \left|\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}^{2}\right|\le4\delta^{2}{\gamma_{2}^{2}}(\mathcal{T})+4\sqrt{2}\delta\gamma_{2}(\mathcal{T}). \end{align} (4.12) We now utilize this lemma to complete the proof of the theorem. We plug in the bounds from (4.10), (4.11) and (4.12) into (4.9) and use the fact that $$\gamma _{2}(\mathcal{T})\le C\omega (\mathcal{T})$$ for a fixed numerical constant C, to conclude that for $$\tilde{L}\le L$$, we have   \begin{align} \left|\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\right|&\le 10\sqrt{2}\delta\gamma_{2}(\mathcal{T})+32\delta^{2}{\gamma_{2}^{2}}(\mathcal{T})+16\sqrt{2}\delta\gamma_{2}(\mathcal{T})+4\delta^{2}{\gamma_{2}^{2}}(\mathcal{T})\nonumber\\ &\quad +4\sqrt{2}\delta\gamma_{2}(\mathcal{T})+\max(\delta,\delta^{2})\nonumber\\ &\le 36\delta^{2}C^{2}\omega^{2}(\mathcal{T})+30\sqrt{2}C\delta\omega(\mathcal{T})+\max(\delta,\delta^{2})\nonumber\\ &\le 79\cdot\max\left(C\delta\omega(\mathcal{T}),C^{2}\delta^{2}\omega^{2}(\mathcal{T})\right)+\max(\delta,\delta^{2})\nonumber\\ &\le 80\cdot\max\left(C\delta\left(\max(1,\omega(\mathcal{T}))\right),C^{2}\delta^{2 }\left(\max(1,\omega(\mathcal{T}))\right)^{2}\right)\!. \end{align} (4.13) We thus conclude that for all $$\boldsymbol{x}\in \mathcal{T}$$  \begin{align} \left|\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\right|\le 80\cdot\max\left(C\delta\left(\max(1,\omega(\mathcal{T}))\right),C^{2}\delta^{2}\left(\max(1,\omega(\mathcal{T}))\right)^{2}\right)\! . \end{align} (4.14) Note that assuming MRIP$$\big(s,\frac{\delta }{4}\big)$$ with s = 200(1 + η) we have arrived at (4.15). Applying the change of variable   \begin{align*} \delta\rightarrow\frac{\delta}{320C\max\left(1,\omega(\mathcal{T})\right)}, \end{align*} we can conclude that under the stated assumptions of the theorem for all $$\boldsymbol{x}\in \mathcal{T}$$  \begin{align*} \left|\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\right|\le\max(\delta,\delta^{2}), \end{align*} completing the proof. Funding B.R. is generously supported by ONR awards N00014-11-1-0723 and N00014-13-1-0129, NSF awards CCF-1148243 and CCF-1217058, AFOSR award FA9550-13-1-0138 and a Sloan Research Fellowship. S.O. was generously supported by the Simons Institute for the Theory of Computing and NSF award CCF-1217058. This research is supported in part by NSF CISE Expeditions Award CCF-1139158, LBNL Award 7076018 and DARPA XData Award FA8750-12-2-0331. Acknowledgements The authors are thankful for the gifts from Amazon Web Services, Google, SAP, The Thomas and Stacey Siebel Foundation, Adatao, Adobe, Apple, Inc., Blue Goji, Bosch, C3Energy, Cisco, Cray, Cloudera, EMC2, Ericsson, Facebook, Guavus, HP, Huawei, Informatica, Intel, Microsoft, NetApp, Pivotal, Samsung, Schlumberger, Splunk, Virdata and VMware, and also want to thank Ahmed El Alaoui for a careful reading of the manuscript. We also thank Sjoerd Dirksen for helpful comments and also for pointing us to some useful references on generalizing Gordon’s result to matrices with sub-Gaussian entries. We also thank Jelani Nelson for a careful reading of this paper, very helpful comments/insights and pointing us to the improved RIP result of Bourgain [4] and Mien Wang for noticing that the telescoping sum is not necessary at the beginning of section 4.4.3. We would also like to thank Christopher J. Rozell for bringing the paper [37] on stable and efficient embedding of manifold signals to our attention. References 1. Achlioptas, D. ( 2003) Database-friendly random projections: Johnson-Lindenstrauss with binary coins. J. Comput. Syst. Sci. , 66, 671-- 687. Google Scholar CrossRef Search ADS   2. Ailon, N. & Liberty, E. ( 2013) An almost optimal unrestricted fast Johnson-Lindenstrauss transform. ACM Trans. Algorithms (TALG) , 9, 21. 3. Ailon, N. & Rauhut, H. ( 2014) Fast and RIP-optimal transforms. Discrete Comput. Geometry , 52, 780-- 798. Google Scholar CrossRef Search ADS   4. Bourgain, J. ( 2014) An improved estimate in the Restricted Isometry problem. In Geometric Aspects of Functional Analysis.  Cham: Springer, pp. 65-- 70. 5. Bourgain, J., Dirksen, S. & Nelson, J. ( 2013) Toward a unified theory of sparse dimensionality reduction in Euclidean space. ArXiv preprint, arXiv:1311.2542. 6. Candes, E. & Recht, B. ( 2012) Exact matrix completion via convex optimization. Commun. ACM , 55, 111-- 119. Google Scholar CrossRef Search ADS   7. Candes, E. J. & Tao, T. ( 2005) Decoding by linear programming. IEEE Trans. Info. Theory , 51, 4203-- 4215. Google Scholar CrossRef Search ADS   8. Chandrasekaran, V., Recht, B., Parrilo, P. A. & Willsky, A. S. ( 2012) The convex geometry of linear inverse problems. Foundations of Comput. Mathematics , 12, 805-- 849. Google Scholar CrossRef Search ADS   9. Cheraghchi, M., Guruswami, V. & Velingker, A. ( 2013) Restricted isometry of Fourier matrices and list decodability of random linear codes. SIAM J. Comput. , 42, 1888-- 1914. Google Scholar CrossRef Search ADS   10. Dasgupta, S. & Gupta, A. ( 2003) An elementary proof of a theorem of Johnson and Lindenstrauss. Random Struct. Algorithms , 22, 60-- 65. Google Scholar CrossRef Search ADS   11. Dirksen, S. ( 2013) Tail bounds via generic chaining. ArXiv preprint, arXiv:1309.3522. 12. Dirksen, S. ( 2014) Dimensionality reduction with subgaussian matrices: a unified theory. ArXiv preprint, arXiv:1402.3973. 13. Do, T. T., Gan, L., Chen, Y., Nguyen, N. & Tran, T. D. ( 2009) Fast and efficient dimensionality reduction using structurally random matrices. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP , 1821-- 1824. 14. Foucart, S. & Rauhut, H. ( 2013) Random sampling in bounded orthonormal systems. A Mathematical Introduction to Compressive Sensing.  Basel: Birkhäuser Springer, pp. 367-- 433. 15. Frankl, P. & Maehara, H. ( 1988) The Johnson-Lindenstrauss lemma and the sphericity of some graphs. J. Combinatorial Theory, Series B , 44, 355-- 362. Google Scholar CrossRef Search ADS   16. Gordon, Y. ( 1988) On Milman’s inequality and random subspaces which escape through a mesh in $$\mathbb{R}^n$$. In Geometric Aspects of Functional Analysis . Berlin, Heidelberg: Springer, pp. 84-- 106. 17. Haviv, I. & Regev, O. ( 2015) The restricted isometry property of subsampled Fourier matrices. ArXiv preprint, arXiv:1507.01768. 18. Johnson, W. B. & Lindenstrauss, J. ( 1984) Extensions of Lipschitz mappings into a Hilbert space. Contemp. Mathematics , 26, 189-- 206. Google Scholar CrossRef Search ADS   19. Kane, D. M. & Nelson, J. ( 2010) A derandomized sparse Johnson-Lindenstrauss transform. ArXiv preprint, arXiv:1006.3585. 20. Klartag, B. & Mendelson, S. ( 2005) Empirical processes and random projections. J. Funct. Anal. , 225, 229-- 245. Google Scholar CrossRef Search ADS   21. Koltchinskii, V. & Mendelson, S. ( 2013) Bounding the smallest singular value of a random matrix without concentration. ArXiv preprint, arXiv:1312.3580. 22. Krahmer, F. & Ward, R. ( 2011) New and improved Johnson-Lindenstrauss embeddings via the Restricted Isometry Property. SIAM J. Math. Anal. , 43, 1269-- 1281. Google Scholar CrossRef Search ADS   23. Liberty, E., Ailon, N. & Singer, A. ( 2011) Dense fast random projections and lean Walsh Transforms. Discrete Comput. Geometry , 45, 34-- 44. Google Scholar CrossRef Search ADS   24. Mendelson, S. ( 2014) Learning without concentration. arXiv preprint, arXiv:1401.0304. 25. Mendelson, S., Pajor, A. & Tomczak-Jaegermann, N. ( 2007) Reconstruction and subgaussian operators in asymptotic geometric analysis. Geometric Funct. Anal. , 17, 1248-- 1282. Google Scholar CrossRef Search ADS   26. Nelson, J., Price, E. & Wootters, M. ( 2014) New constructions of RIP matrices with fast multiplication and fewer rows. Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms , SIAM, 1515-- 1528. 27. Oymak, S., Recht, B. & Soltanolkotabi, M. ( 2017) Sharp Time–Data Tradeoffs for Linear Inverse Problems. IEEE Transactions on Information Theory. ArXiv preprint, arXiv:1507.04793. 28. Oymak, S. & Tropp, J. A. ( 2015) Universality laws for randomized dimension reduction, with applications. Information and Inference: A Journal of the IMA. ArXiv preprint, arXiv:1511.09433. 29. Pilanci, M. & Wainwright, M. J. ( 2014) Randomized sketches of convex programs with sharp guarantees. IEEE International Symposium on Information Theory (ISIT) , 921-- 925. 30. Pilanci, M. & Wainwright, M. J. ( 2015) Randomized sketches of convex programs with sharp guarantees. IEEE Transactions on Information Theory , 61, 5096-- 5115. Google Scholar CrossRef Search ADS   31. Rudelson, M. & Vershynin, R. ( 2006) Sparse reconstruction by convex relaxation: Fourier and gaussian measurements. 40th Annual Conference on Information Sciences and Systems , pp. 207-- 212. 32. Sarlos, T. ( 2006) Improved approximation algorithms for large matrices via random projections. Foundations of Computer Science FOCS’06. 47th Annual IEEE Symposium on , IEEE , pp. 143-- 151. 33. Talagrand, M. ( 2006) The Generic Chaining: Upper and Lower Bounds of Stochastic Processes. Springer Science & Business Media. 34. Tropp, J. A. ( 2014) Convex recovery of a structured signal from independent random linear measurements. ArXiv preprint, arXiv:1405.1102. 35. Vershynin, R. ( 2010) Introduction to the non-asymptotic analysis of random matrices. ArXiv preprint, arXiv:1011.3027. 36. Vershynin, R. ( 2011) Lectures in geometric functional analysis. Unpublished manuscript. Available at http://www-personal. umich. edu/romanv/papers/GFA-book/GFA-book.pdf. 37. Yap, H. L., Wakin, M. B. & Rozell, C. J. ( 2013) Stable manifold embeddings with structured random matrices. IEEE J. Selected Top. Signal Processing , 7, 720-- 730. Google Scholar CrossRef Search ADS   Appendix A. Proof of Lemma 4.6 For a set $$\mathcal{M}$$ we define the normalized set $$\widetilde{\mathcal{M}}=\left \{\frac{\boldsymbol{v}}{\left \|\boldsymbol{v}\right \|_{\ell _{2}}}:\ \boldsymbol{v}\in \mathcal{M}\right \}$$. We shall also define   \begin{align*} \mathcal{Q}_{\ell}=\mathcal{T}_{\ell-1}\cup \mathcal{T}_{\ell}\cup (\mathcal{T}_{\ell}-\mathcal{T}_{\ell-1})\cup \left(\widetilde{(\mathcal{T}_{\ell}-\mathcal{T}_{\ell-1})}-\widetilde{\mathcal{T}}_{\ell-1}\right)\cup \left(\widetilde{(\mathcal{T}_{\ell}-\mathcal{T}_{\ell-1})}+\widetilde{\mathcal{T}}_{\ell-1}\right)\!. \end{align*} We will first prove that for ℓ = 1, 2, …, L and every $$\boldsymbol{v}\in \mathcal{Q}_{\ell }$$  \begin{align} \left|\left\|{\boldsymbol{A}}\boldsymbol{v}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{v}\right\|_{\ell_{2}}^{2}\right|\leq \max\left(2^{\ell/2}\delta,2^{\ell}\delta^{2}\right)\cdot\left\|\boldsymbol{v}\right\|_{\ell_{2}}^{2}, \end{align} (A.1) holds with probability at least 1 − e−η. We then explain how the other identities follow from this result. To this aim, note that that by the assumptions of the lemma MRIP$$\big(s,\frac{\delta }{4}\big)$$ holds for the matrix H with s = 200(1 + η). By definition this is equivalent to RIP(sℓ, δℓ) holding for ℓ = 1, 2, …, L with $$\big(s_{\ell },\frac{\delta _{\ell }}{4}\big)=\big(2^{\ell } s,\frac{2^{\ell /2}\delta }{4}\big)$$. Now observe that the number of entries of $$\mathcal{Q}_{\ell }$$ obeys $$\left |\mathcal{Q}_{\ell }\right |\le 5N_{\ell }^{2}$$ with $$N_{\ell }=2^{2^{\ell }}$$ which implies   \begin{align} s_{\ell}&=2^{\ell} s\nonumber\\ &=2^{\ell}\left(200+200\eta\right)\nonumber\\ &\ge2^{\ell}\left(40(\log 2)(\log_{2} (20)+2)+\frac{40}{2}(\eta+1)\right)\nonumber\\ &\ge 2^{\ell}\left(40(\log 2)\left(\frac{\log_{2} (20)}{2^{\ell}}+2\right)+\frac{40\ell}{2^{\ell}}(\eta+1)\right)\nonumber\\ &\ge 40(\log 2)\left(\log_{2}(20)+2^{\ell+1}\right)+40\ell(\eta+1)\nonumber\\[3pt] &\ge 40\log\left(4\left|\mathcal{Q}_{\ell}\right|\right)+40\ell(\eta+1)\nonumber\\[3pt] &\ge\min\left(40\log\left(4\left|\mathcal{Q}_{\ell}\right|\right)+40\ell(\eta+1),n\right)\!. \end{align} (A.2) By the MRIP assumption, RIP$$\big(s_{\ell },\frac{\delta _{\ell }}{4}\big)$$ holds for H. This together with (A.2) allows us to apply Theorem 4.3 to conclude that for each ℓ = 1, 2, …, L and every $$\boldsymbol{x}\in \mathcal{Q}_{\ell }$$  \begin{align*} \left|\left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\right|\le\max\left(\delta_{\ell},\delta_{\ell}^{2}\right)\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\!, \end{align*} holds with probability at least 1 − e−ℓ(η+1). Noting that   \begin{align*} \sum_{\ell=1}^{L} \textrm{e}^{-\ell(\eta+1)}\le\sum_{\ell=1}^{\infty} \textrm{e}^{-\ell(\eta+1)}=\frac{\textrm{e}^{-(\eta+1)}}{1-\textrm{e}^{-(\eta+1)}}\le \textrm{e}^{-\eta}, \end{align*} completes the proof of (A.1) by the union bound. We note that since $$\mathcal{T}_{\ell -1}\cup \mathcal{T}_{\ell }\cup (\mathcal{T}_{\ell }-\mathcal{T}_{\ell -1})\subset \mathcal{Q}_{\ell }$$, (A.1) immediately implies (4.5). The proof of (4.4) follows from the proof of (4.5) by noting that   \begin{align*} (1+\delta_{\ell})^{2}\geq 1+\max\left(\delta_{\ell},\delta_{\ell}^{2}\right)\!. \end{align*} To prove (4.6), first note that $$\frac{\boldsymbol{v}}{\left \|\boldsymbol{v}\right \|_{\ell _{2}}}-\frac{\boldsymbol{u}}{\left \|\boldsymbol{u}\right \|_{\ell _{2}}}\in \widetilde{(\mathcal{T}_{\ell }-\mathcal{T}_{\ell -1})}-\widetilde{\mathcal{T}}_{\ell -1}$$ and $$\frac{\boldsymbol{u}}{\left \|\boldsymbol{u}\right \|_{\ell _{2}}}+\frac{\boldsymbol{v}}{\left \|\boldsymbol{v}\right \|_{\ell _{2}}}\in \widetilde{(\mathcal{T}_{\ell }-\mathcal{T}_{\ell -1})}+\widetilde{\mathcal{T}}_{\ell -1}$$. Hence, applying (A.1)   \begin{align*} \left|\left\|{\boldsymbol{A}}\left(\frac{\boldsymbol{u}}{\left\|\boldsymbol{u}\right\|_{\ell_{2}}}+\frac{\boldsymbol{v}}{\left\|\boldsymbol{v}\right\|_{\ell_{2}}}\right)\right\|_{\ell_{2}}^{2}-\left\|\frac{\boldsymbol{u}}{\left\|\boldsymbol{u}\right\|_{\ell_{2}}}+\frac{\boldsymbol{v}}{\left\|\boldsymbol{v}\right\|_{\ell_{2}}}\right\|_{\ell_{2}}^{2}\right|\leq& \max\left(\delta_{\ell},\delta_{\ell}^{2}\right)\left\|\frac{\boldsymbol{u}}{\left\|\boldsymbol{u}\right\|_{\ell_{2}}}+\frac{\boldsymbol{v}}{\left\|\boldsymbol{v}\right\|_{\ell_{2}}}\right\|_{\ell_{2}}^{2}\nonumber\\[3pt] \left|\left\|{\boldsymbol{A}}\left(\frac{\boldsymbol{v}}{\left\|\boldsymbol{v}\right\|_{\ell_{2}}}-\frac{\boldsymbol{u}}{\left\|\boldsymbol{u}\right\|_{\ell_{2}}}\right)\right\|_{\ell_{2}}^{2}-\left\|\frac{\boldsymbol{v}}{\left\|\boldsymbol{v}\right\|_{\ell_{2}}}-\frac{\boldsymbol{u}}{\left\|\boldsymbol{u}\right\|_{\ell_{2}}}\right\|_{\ell_{2}}^{2}\right|\leq& \max\left(\delta_{\ell},\delta_{\ell}^{2}\right)\left\|\frac{\boldsymbol{v}}{\left\|\boldsymbol{v}\right\|_{\ell_{2}}}-\frac{\boldsymbol{u}}{\left\|\boldsymbol{u}\right\|_{\ell_{2}}}\right\|_{\ell_{2}}^{2}. \end{align*} Summing these two identities and applying the triangular inequality, we conclude that   \begin{align*} \frac{1}{\left\|\boldsymbol{u}\right\|_{\ell_{2}}\left\|\boldsymbol{v}\right\|_{\ell_{2}}}\left|\boldsymbol{u}^{\ast}{\boldsymbol{A}}^{\ast}{\boldsymbol{A}}\boldsymbol{v}-\boldsymbol{u}^{\ast}\boldsymbol{v}\right|&\leq \frac{1}{4}\max\left(\delta_{\ell},\delta_{\ell}^{2}\right)\left(\left\|\frac{\boldsymbol{u}}{\left\|\boldsymbol{u}\right\|_{\ell_{2}}}+\frac{\boldsymbol{v}}{\left\|\boldsymbol{v}\right\|_{\ell_{2}}}\right\|_{\ell_{2}}^{2}+\left\|\frac{\boldsymbol{v}}{\left\|\boldsymbol{v}\right\|_{\ell_{2}}}-\frac{\boldsymbol{u}}{\left\|\boldsymbol{u}\right\|_{\ell_{2}}}\right\|_{\ell_{2}}^{2}\right)\\ &=\max\left(\delta_{\ell},\delta_{\ell}^{2}\right)\!, \end{align*} completing the proof of (4.6). Appendix B. Proof of Lemma 4.7 We prove each of the three inequalities in the next three sections. B.1 Proof of inequality (4.10) For $$1\le \ell \le \tilde{L}$$, we have δℓ = 2ℓ/2δ ≤ 1 so that $$\max (\delta _{\ell },\delta _{\ell }^{2})=\delta _{\ell }$$. Thus, applying Lemma 4.6 together with (4.7) we arrive at   \begin{align} \left|\left\|{\boldsymbol{A}}({\boldsymbol{z}}_{\ell}-{\boldsymbol{z}}_{\ell-1})\right\|_{\ell_{2}}^{2}-\left\|{\boldsymbol{z}}_{\ell}-{\boldsymbol{z}}_{\ell-1}\right\|_{\ell_{2}}^{2}\right|\leq 2^{\ell/2}\delta\left\|\boldsymbol{z}_{\ell}-\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}\le 2^{\ell/2+2}\delta e_{\ell-1}^{2}, \end{align} (B.1) and   \begin{align} \left|\langle{\boldsymbol{A}}({\boldsymbol{z}}_{\ell}-{\boldsymbol{z}}_{\ell-1}),{\boldsymbol{A}}{\boldsymbol{z}}_{\ell-1}\rangle- \langle{\boldsymbol{z}}_{\ell}-{\boldsymbol{z}}_{\ell-1},{\boldsymbol{z}}_{\ell-1}\rangle\right|\leq 2^{\ell/2+1}\delta e_{\ell-1}\!. \end{align} (B.2) The triangular inequality yields   \begin{align*} \left|\left\|{\boldsymbol{A}}{\boldsymbol{z}}_{\ell}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}\right|=& \left|\left\|{\boldsymbol{A}}({\boldsymbol{z}}_{\ell}-{\boldsymbol{z}}_{\ell-1})+{\boldsymbol{A}}{\boldsymbol{z}}_{\ell-1} \right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}\right|\\ \le&\left|\left\|{\boldsymbol{A}}({\boldsymbol{z}}_{\ell}-{\boldsymbol{z}}_{\ell-1})\right\|_{\ell_{2}}^{2}- \left\|\boldsymbol{z}_{\ell}-\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}\right|+\left|\left\|{\boldsymbol{A}} {\boldsymbol{z}}_{\ell-1}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}\right|\\ &+2\left|\langle{\boldsymbol{A}}({\boldsymbol{z}}_{\ell}-{\boldsymbol{z}}_{\ell-1}),{\boldsymbol{A}} {\boldsymbol{z}}_{\ell-1}\rangle-\langle\boldsymbol{z}_{\ell}-\boldsymbol{z}_{\ell-1},\boldsymbol{z}_{\ell-1}\rangle\right|\!. \end{align*} Combining the latter with (B.1) and (B.2) we arrive at the following recursion   \begin{align} \left|\left\|{\boldsymbol{A}}{\boldsymbol{z}}_{\ell}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}\right|-\left|\left\|\boldsymbol{A}\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}\right|\le\delta\left(2e_{\ell-1}+4e_{\ell-1}^{2}\right)2^{\ell/2}. \end{align} (B.3) Adding both sides of the above inequality for $$1\leq \ell \leq \tilde{L}$$, and using $$e_{\ell }^{2}\leq 2e_{\ell }\leq 4$$, we arrive at   \begin{align*} \sum_{\ell=1}^{\tilde{L}}\left(\left|\left\|\boldsymbol{A}\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}\right|-\left|\left\|\boldsymbol{A}\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}\right|\right)\le&10\delta\left(\sum_{\ell=1}^{\tilde{L}}2^{\ell/2}e_{\ell-1}\right)\nonumber\\ =&10\sqrt{2}\delta\left(\sum_{\ell=0}^{\tilde{L}-1}2^{\ell/2}e_{\ell}\right)\nonumber\\ =&10\sqrt{2}\delta\gamma_{2}(\mathcal{T}). \end{align*} B.2 Proof of inequality (4.11) To bound the second term we begin by bounding $$\left |\left \|\boldsymbol{A}\boldsymbol{x}\right \|_{\ell _{2}}-\left \|\boldsymbol{A}\boldsymbol{z}_{\tilde{L}}\right \|_{\ell _{2}}\right |$$. To this aim first note that, since MRIP$$\big(s,\frac{\delta }{4}\big)$$ holds for H with s = 200(1 + η), then sL = 200 × 2L(1 + η) ≥ n. As a result for all $$\boldsymbol{x}\in \mathbb{R}^{n}$$, we have   \begin{align*} \left|\left\|\boldsymbol{H}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\right|\le\max\left(\frac{1}{4}\delta_{L},\frac{1}{16}{\delta_{L}^{2}}\right)\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\!. \end{align*} Using the simple inequality $$1+\max (\delta ,\delta ^{2})\le (1+\delta )^{2}$$, this immediately implies   \begin{align} \left\|\boldsymbol{A}\right\|=\left\|\boldsymbol{H}\right\|\le\frac{1}{4}2^{\frac{L}{2}}\delta+1. \end{align} (B.4) Furthermore, by the definition of Nℓ we have $$\left \|\boldsymbol{x}-\boldsymbol{z}_{L}\right \|_{\ell _{2}}\le e_{L}$$. These two inequalities together with repeated use of the triangular inequality, we have   \begin{align*} \left|\left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}-\left\|\boldsymbol{A}\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\right|=&\left|\left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}-\left\|\boldsymbol{A}\boldsymbol{z}_{L}\right\|_{\ell_{2}}+\left\|\boldsymbol{A}\boldsymbol{z}_{L}\right\|_{\ell_{2}}-\left\|\boldsymbol{A}\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\right|\\ \le&\left\|\boldsymbol{A}(\boldsymbol{x}-\boldsymbol{z}_{L})\right\|_{\ell_{2}}+\left\|\boldsymbol{A}(\boldsymbol{z}_{L}-\boldsymbol{z}_{\tilde{L}})\right\|_{\ell_{2}}\\ \le&\left\|\boldsymbol{A}\right\|\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}+\left\|\sum_{\ell=\tilde{L}+1}^{L}\boldsymbol{A}(\boldsymbol{z}_{\ell}-\boldsymbol{z}_{\ell-1})\right\|_{\ell_{2}}\\ \le&\left(\frac{1}{4}2^{\frac{L}{2}}\delta+1\right)e_{L}+\sum_{\ell=\tilde{L}+1}^{L}\left\|\boldsymbol{A}(\boldsymbol{z}_{\ell}-\boldsymbol{z}_{\ell-1})\right\|_{\ell_{2}}\!. \end{align*} Using Lemma 4.6 equation (4.4) in the above inequality and noting that for $$\ell>\tilde{L}$$, we have 2ℓ/2δ ≥ 1 and conclude that   \begin{align} \left|\left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}-\left\|\boldsymbol{A}\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\right|&\le\left(\frac{1}{4}2^{\frac{L}{2}}\delta+1\right)e_{L}+\sum_{\ell=\tilde{L}+1}^{L}(1+2^{\ell/2}\delta)\left\|\boldsymbol{z}_{\ell}-\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}\nonumber\\ &\le\frac{5}{4}2^{L/2}\delta e_{L}+\sum_{\ell=\tilde{L}+1}^{L}2^{\ell/2+1}\delta\left\|\boldsymbol{z}_{\ell}-\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}\nonumber\\ &\le\frac{5}{4}\delta2^{L/2}e_{L}+4\sqrt{2}\delta\sum_{\ell=\tilde{L}+1}^{L}2^{(\ell-1)/2}e_{\ell-1}\nonumber\\ &\le 4\sqrt{2}\delta\left(\sum_{\ell=\tilde{L}}^{L} 2^{\ell/2}e_{\ell}\right)\nonumber\\ &\le 4\sqrt{2}\delta\gamma_{2}(\mathcal{T}). \end{align} (B.5) Now note that by Lemma 4.6 equation (4.4) and using the fact that rad$$(\mathcal{T})=1$$, we know that $$\left \|\boldsymbol{A}\boldsymbol{z}_{\tilde{L}}\right \|_{\ell _{2}}\le 1+2^{\tilde{L}/2}\delta \le 2$$. Thus, using this inequality together with (B.5) we arrive at   \begin{align*} \left|\left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{A}\boldsymbol{z}_{\tilde{L}}^{2}\right\|_{\ell_{2}}\right|&\le\left|\left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}-\left\|\boldsymbol{A}\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\right|\left|\left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}+\left\|\boldsymbol{A}\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\right|\nonumber\\ &\le\left|\left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}-\left\|\boldsymbol{A}\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\right|\!^{^{2}}+2\left|\left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}-\left\|\boldsymbol{A}\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\right|\left\|\boldsymbol{A}\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\nonumber\\ &\le 32\delta^{2}{\gamma_{2}^{2}}(\mathcal{T})+16\sqrt{2}\delta\gamma_{2}(\mathcal{T}). \end{align*} B.3 Proof of inequality (4.12) Similar to the second term we begin by bounding $$\left |\left \|\boldsymbol{x}\right \|_{\ell _{2}}-\left \|\boldsymbol{z}_{\tilde{L}}\right \|_{\ell _{2}}\right |$$. Noting that $$2^{\ell /2}\delta \ge \frac{1}{\sqrt{2}}$$ for $$\ell \ge \tilde{L}$$, we have   \begin{align*} \left|\left\|\boldsymbol{x}\right\|_{\ell_{2}}-\left\|\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\right|\le\left\|\boldsymbol{x}-\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\le e_{\tilde{L}}\le \sqrt{2}\cdot2^{\tilde{L}/2}\delta e_{\tilde{L}}\le \sqrt{2}\delta\gamma_{2}(\mathcal{T}). \end{align*} Thus, using this inequality together with the fact that $$\left \|\boldsymbol{z}_{\tilde{L}}\right \|_{\ell _{2}}\le 1$$ we arrive at   \begin{align*} \left|\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}^{2}\right|&=\left|\left\|\boldsymbol{x}\right\|_{\ell_{2}}-\left\|\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\right|\cdot\left(\left\|\boldsymbol{x}\right\|_{\ell_{2}}+\left\|\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\right)\nonumber\\ &\le\left|\left\|\boldsymbol{x}\right\|_{\ell_{2}}-\left\|\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\right|{}^{2}+2\left|\left\|\boldsymbol{x}\right\|_{\ell_{2}}-\left\|\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\right|\left\|\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\nonumber\\ &\le 4\delta^{2}{\gamma_{2}^{2}}(\mathcal{T})+4\sqrt{2}\delta\gamma_{2}(\mathcal{T}). \end{align*} Appendix C. Establishing an analog of (4.9) and the bounds (4.10), (4.11) and (4.12) when $$\tilde{L}>L$$ This section describes how an analog of (4.9) as well as the subsequent bounds in Sections B.1, B.2 and B.3 can be derived when $$\tilde{L}>L$$. Using similar arguments leading to the derivation of (4.9) we arrive at   \begin{align} |\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}|\le&\sum_{\ell=1}^{L}\left(\left|\left\|\boldsymbol{A}\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}\right|-\left|\left\|\boldsymbol{A}\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}\right|\right)\nonumber\\ &+\left|\left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\right|-\left|\left\|\boldsymbol{A}\boldsymbol{z}_{L}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{L}\right\|_{\ell_{2}}^{2}\right|+\max\left(\delta,\delta^{2}\right)\!. \end{align} (C.1) The main difference with the $$\tilde{L}\le L$$ case is that we let the summation in the first term go up to L and instead of studying the second line of (4.9), we will directly bound the difference $$\left |\left \|{\boldsymbol{A}}\boldsymbol{x}\right \|_{\ell _{2}}^{2}-\left \|\boldsymbol{x}\right \|_{\ell _{2}}^{2}\right |-\left |\left \|{\boldsymbol{A}}{\boldsymbol{z}}_{L}\right \|_{\ell _{2}}^{2}-\left \|{\boldsymbol{z}}_{L}\right \|_{\ell _{2}}^{2}\right |$$ in (C.1). We now turn our attention to bounding the first two terms in (C.1). For the first term in (C.1), an argument identical to the derivation of (4.10) in Section B.1 allows us to conclude   \begin{align} \sum_{\ell=1}^{L}\left(\left|\left\|\boldsymbol{A}\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}\right|-\left|\left\|\boldsymbol{A}\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}\right|\right)\le10\sqrt{2}\delta\gamma_{2}(\mathcal{T}). \end{align} (C.2) To bound the second term in (C.1) note that, we have   \begin{align} &\left|\|{\boldsymbol{A}}\boldsymbol{x}\|_{\ell_{2}}^{2}-\|\boldsymbol{x}\|_{\ell_{2}}^{2}\right|-\left|\|{\boldsymbol{A}}{\boldsymbol{z}}_{L}\|_{\ell_{2}}^{2}-\|{\boldsymbol{z}}_{L}\|_{\ell_{2}}^{2}\right|\nonumber\\ &\quad\le\left|\left(\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|{\boldsymbol{A}}{\boldsymbol{z}}_{L}\right\|_{\ell_{2}}^{2}\right)-\left(\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|{\boldsymbol{z}}_{L}\right\|_{\ell_{2}}^{2}\right)\right|\!,\nonumber\\ &\quad=\left|\left(\left\|{\boldsymbol{A}}\left(\boldsymbol{x}-\boldsymbol{z}_{L}\right)+{\boldsymbol{A}}\boldsymbol{z}_{L}\right\|_{\ell_{2}}^{2}-\left\|{\boldsymbol{A}}{\boldsymbol{z}}_{L}\right\|_{\ell_{2}}^{2}\right)-\left(\left\|(\boldsymbol{x}-\boldsymbol{z}_{L})+\boldsymbol{z}_{L}\right\|_{\ell_{2}}^{2}-\left\|{\boldsymbol{z}}_{L}\right\|_{\ell_{2}}^{2}\right)\right|\!,\nonumber\\ &\quad=\left|\left(\left\|\boldsymbol{A}(\boldsymbol{x}-\boldsymbol{z}_{L})\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}^{2}\right)+2\left(\langle\boldsymbol{A}(\boldsymbol{x}-\boldsymbol{z}_{L}),{\boldsymbol{A}}\boldsymbol{z}_{L}\rangle-\langle\boldsymbol{x}-\boldsymbol{z}_{L},\boldsymbol{z}_{L}\rangle\right)\right|\!,\nonumber\\ &\quad\le\left|\left\|\boldsymbol{A}(\boldsymbol{x}-\boldsymbol{z}_{L})\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}^{2}\right|+2\left|\langle\boldsymbol{A}(\boldsymbol{x}-\boldsymbol{z}_{L}),{\boldsymbol{A}}\boldsymbol{z}_{L}\rangle-\langle\boldsymbol{x}-\boldsymbol{z}_{L},\boldsymbol{z}_{L}\rangle\right|\!,\nonumber\\ &\quad=\left|\left\|\boldsymbol{A}(\boldsymbol{x}-\boldsymbol{z}_{L})\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}^{2}\right|+2\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}\left|\left\langle\boldsymbol{A}\frac{\boldsymbol{x}-\boldsymbol{z}_{L}}{\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}},{\boldsymbol{A}}\boldsymbol{z}_{L}\right\rangle-\left\langle\frac{\boldsymbol{x}-\boldsymbol{z}_{L}}{\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}},\boldsymbol{z}_{L}\right\rangle\right|\!,\nonumber\\ &\quad\le\left|\left\|\boldsymbol{A}(\boldsymbol{x}-\boldsymbol{z}_{L})\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}^{2}\right|\nonumber\\ &\qquad+\frac{1}{2}\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}\left|\left\Vert{\boldsymbol{A}\left(\frac{\boldsymbol{x}-\boldsymbol{z}_{L}}{\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}}+\boldsymbol{z}_{L}\right)}\right\Vert{}_{\ell_{2}}^{2}-\left\Vert{\frac{\boldsymbol{x}-\boldsymbol{z}_{L}}{\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}}+\boldsymbol{z}_{L}}\right\Vert{}_{\ell_{2}}^{2}\right|\nonumber\\ &\qquad+\frac{1}{2}\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}\left|\left\Vert{\boldsymbol{A}\left(\frac{\boldsymbol{x}-\boldsymbol{z}_{L}}{\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}}-\boldsymbol{z}_{L}\right)}\right\Vert{}_{\ell_{2}}^{2}-\left\Vert{\frac{\boldsymbol{x}-\boldsymbol{z}_{L}}{\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}}-\boldsymbol{z}_{L}}\right\Vert{}_{\ell_{2}}^{2}\right|\!. \end{align} (C.3) To complete our bound note that since MRIP$$\big(s,\frac{\delta }{4}\big)$$ holds for A with s = 200(1 + η), then sL = 200 × 2L(1 + η) ≥ n. As a result for all $$\boldsymbol{w}\in \mathbb{R}^{n}$$, we have   \begin{align*} \left|\left\|\boldsymbol{A}\boldsymbol{w}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{w}\right\|_{\ell_{2}}^{2}\right|\le\max\left(\frac{1}{4}\delta_{L},\frac{1}{16}{\delta_{L}^{2}}\right)\left\|\boldsymbol{w}\right\|_{\ell_{2}}^{2}\!. \end{align*} For $$\tilde{L}>L$$ we have $$\delta _{L}=2^{\frac{L}{2}}\delta \le 1$$ which immediately implies that for all $$\boldsymbol{w}\in \mathbb{R}^{n}$$, and we have   \begin{align} \left|\left\|\boldsymbol{A}\boldsymbol{w}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{w}\right\|_{\ell_{2}}^{2}\right|\le\frac{1}{4}2^{L/2}\delta\left\|\boldsymbol{w}\right\|_{\ell_{2}}^{2}. \end{align} (C.4) Now using (C.4) with $$\boldsymbol{w}=\boldsymbol{x}-\boldsymbol{z}_{L}, \frac{\boldsymbol{x}-\boldsymbol{z}_{L}}{\left \|\boldsymbol{x}-\boldsymbol{z}_{L}\right \|_{\ell _{2}}}-\boldsymbol{z}_{L}$$, and $$\frac{\boldsymbol{x}-\boldsymbol{z}_{L}}{\left \|\boldsymbol{x}-\boldsymbol{z}_{L}\right \|_{\ell _{2}}}+\boldsymbol{z}_{L}$$ in (C.3) and noting that $$\left \|\boldsymbol{z}_{L}\right \|_{\ell _{2}}\le$$rad$$(\mathcal{T})\le 1$$, we conclude that   \begin{align} \left|\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}\!-\!\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\right|\!-\!\left|\left\|{\boldsymbol{A}}{\boldsymbol{z}}_{L}\right\|_{\ell_{2}}^{2}\!-\!\left\|{\boldsymbol{z}}_{L}\right\|_{\ell_{2}}^{2}\right|&\le\frac{1}{4}2^{L/2}\delta\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}^{2}+\frac{1}{8}2^{L/2}\delta\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}\left\Vert{\frac{\boldsymbol{x}-\boldsymbol{z}_{L}}{\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}}+\boldsymbol{z}_{L}}\right\Vert{}_{\ell_{2}}^{2}\nonumber\\ &\quad+\frac{1}{8}2^{L/2}\delta\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}\left\Vert{\frac{\boldsymbol{x}-\boldsymbol{z}_{L}}{\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}}-\boldsymbol{z}_{L}}\right\Vert{}_{\ell_{2}}^{2}\nonumber\\ &\le\frac{1}{4}2^{L/2}\delta\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}^{2}+2^{L/2}\delta\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}\nonumber\\ &\le 2^{L/2}\delta \left(\frac{1}{4}{{e_{L}^{2}}}+e_{L}\right)\nonumber\\ &\le\frac{3}{2}2^{L/2}\delta e_{L}\nonumber\\ &\le\frac{3}{2}\delta \gamma_{2}(\mathcal{T}). \end{align} (C.5) Plugging (C.2) and (C.5) into (C.1), we arrive at   \begin{align} \left| \|{\boldsymbol{A}}\boldsymbol{x}\|_{\ell_{2}}^{2}-\|\boldsymbol{x}\|_{\ell_{2}}^{2}\right| \le 16\delta \gamma_{2}(\mathcal{T})+\max(\delta,\delta^{2}). \end{align} (C.6) © The Author(s) 2018. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) For permissions, please e-mail: journals. permissions@oup.com http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Information and Inference: A Journal of the IMA Oxford University Press

# Isometric sketching of any set via the Restricted Isometry Property

, Volume Advance Article – Mar 7, 2018
20 pages

/lp/ou_press/isometric-sketching-of-any-set-via-the-restricted-isometry-property-3fHM51f3kw
Publisher
Oxford University Press
ISSN
2049-8764
eISSN
2049-8772
D.O.I.
10.1093/imaiai/iax019
Publisher site
See Article on Publisher Site

### Abstract

Abstract In this paper we show that for the purposes of dimensionality reduction certain class of structured random matrices behave similarly to random Gaussian matrices. This class includes several matrices for which matrix-vector multiply can be computed in log-linear time, providing efficient dimensionality reduction of general sets. In particular, we show that using such matrices any set from high dimensions can be embedded into lower dimensions with near optimal distortion. We obtain our results by connecting dimensionality reduction of any set to dimensionality reduction of sparse vectors via a chaining argument. 1. Introduction Dimensionality reduction or sketching is the problem of mapping a set into a low-dimensional space while preserving certain properties of the original high-dimensional set. Such low-dimensional embeddings have found numerous applications in a wide variety of applied and theoretical disciplines across science and engineering. Perhaps the most fundamental and popular result for dimensionality reduction is the Johnson–Lindenstrauss (JL) Lemma. This lemma states that any set of of p points can be embedded into $$\mathcal{O}\left(\frac{\log p}{\delta ^{2}}\right)$$ dimensions, while preserving the Euclidean norm of all points within a multiplicative factor between 1 − δ and 1 + δ. The JL Lemma in its modern form can be stated as follows. Lemma 1.1 (JL Lemma [18]) Let δ ∈ (0, 1) and let $$\boldsymbol{x}_{1},\boldsymbol{ x}_{2}, \ldots ,\boldsymbol{ x}_{p}\in \mathbb{R}^{n}$$ be arbitrary points. Then as long as $$m=\mathcal{O}\left(\frac{\log p}{\delta ^{2}}\right)$$ there exists a matrix $$\boldsymbol{A}\in \mathbb{R}^{m\times n}$$ such that   \begin{align} (1-\delta)\left\|\boldsymbol{x}_{i}\right\|_{\ell_{2}}\le\left\|\boldsymbol{A} \boldsymbol{x}_{i}\right\|_{\ell_{2}}\le(1+\delta)\left\|\boldsymbol{x}_{i}\right\|_{\ell_{2}}\!\!, \end{align} (1.1) for all i = 1, 2, …, p. This lemma was originally proven to hold with high probability for a matrix A that projects all data points onto a random subspace of dimension m and then scales them by $$\sqrt{\frac{n}{m}}$$. The result was later generalized so that A could have i.i.d. normal random entries as well as other random ensembles [10,15]. More recently, many authors have focused on constructions where the mapping by A can be computed in $$\mathcal{O}(n \log n)$$ time. See [1,2,13,,19,22,23] for examples of such constructions as well as the more recent papers [3,26] for further details on related and improved constructions. In many applications of dimensionality reduction arising in statistical learning, optimization, numerical linear algebra, one aims to embed a set containing an infinite continuum of points into lower dimensions while preserving the Euclidean norm of all point up to a multiplicative distortion. A classical result due to Gordon [16] characterizes the precise tradeoff between distortion, ‘size’ of the set and the amount of reduction in dimension for a subset of the unit sphere. Before stating this result we need the definition of the Gaussian width of a set which provides a measure of the ‘complexity’ or ‘size’ of a set $$\mathcal{T}$$. Definition 1.1 For a set $$\mathcal{T}\subset \mathbb{R}^{n}$$, the mean width $$\omega (\mathcal{T})$$ is defined as   \begin{align*} \omega(\mathcal{T})=\operatorname{\mathbb{E}}[\sup_{\boldsymbol{v}\in\mathcal{T}}\boldsymbol{g}^{T}\boldsymbol{v}]. \end{align*} Here, $$\boldsymbol{g}\in \mathbb{R}^{n}$$ a Gaussian random vector distributed as $$\mathcal{N}(\mathbf{0},\boldsymbol{I}_{n})$$. Theorem 1.2 (Gordon’s escape through the mesh) Let δ ∈ (0, 1), $$\mathcal{T}\subset \mathbb{R}^{n}$$ be a subset of the unit sphere ($$\mathcal{T}\subset \mathbb{S}^{n-1}$$) and let $$\boldsymbol{A}\in \mathbb{R}^{m\times n}$$ be a matrix with i.i.d. $$\mathcal{N}(0,1/{b_{m}^{2}})$$ entries where $$b_{m}=\sqrt{2}\cdot \varGamma \left (\frac{m+1}{2}\right )/\varGamma \left (\frac{m}{2}\right )\approx \sqrt{m}$$ and Γ denotes the Gamma function. Then,   \begin{align} \left|\left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}\right|\le \delta\left\|\boldsymbol{x}\right\|_{\ell_{2}}\!, \end{align} (1.2) holds for all $$\boldsymbol{x}\in \mathcal{T}$$ with probability at least $$1-2\mathrm{e}^{-\frac{\eta ^{2}}{2}}$$ as long as   \begin{align} m\ge\frac{\left(\omega(\mathcal{T})+\eta\right)^{2}}{\delta^{2}}. \end{align} (1.3) We note that the JL Lemma for Gaussian matrices follows as a special case. Indeed, for a set $$\mathcal{T}$$ containing a finite number of points $$\left |\mathcal{T}\right |\le p$$, one can show that $$\omega (\mathcal{T})\le \sqrt{2\log p}$$ so that the minimal amount of dimension reduction m allowed by (1.3) is of the same order as Lemma 1.1. Recently, a line of research by Mendelson and collaborators [20,21,24,25] showed that the inequality (1.2) continues to hold for matrices with independent sub-Gaussian rows (albeit at a loss in terms of the constants). More recently in [28], Oymak & Tropp obtain the precise constants when the entries are i.i.d. sub-Gaussian. See also [12,34] for more recent results and applications. Connected to this, Bourgain et al. [5] have shown that a similar result to Gordon’s theorem holds for certain ensembles of matrices with sparse entries. This paper develops an analogue of Gordon’s result for more structured matrices, particularly those that admit efficient multiplication. At the heart of our analysis is a theorem that shows that matrices that preserve the Euclidean norm of sparse vectors (also known as Restricted Isometry Property (RIP) matrices), when multiplied by a random sign pattern preserve the Euclidean norm of any set. Roughly stated, linear transforms that provide low distortion embedding of sparse vectors also allow low distortion embedding of any set! We believe that our result provides a rigorous justification for replacing ‘slow’ Gaussian matrices with ‘fast’ and computationally friendly matrices in many scientific and engineering applications. Indeed, in a companion paper [27] we utilize the results of this paper to develop sharp rates of convergence for various optimization problems involving such structured matrices. Our results imply faster algorithms for a variety of other problems including: sparse and low-rank approximation from underdetermined samples [6,7], subspace embeddings [32], and sketched least-squares problems [30]. 2. Isometric sketching of sparse vectors To connect isometric sketching of sparse vectors to isometric sketching of general sets, we begin by defining the RIP. Roughly stated, RIP ensures that a matrix preserves the Euclidean norm of sparse vectors up to a multiplicative distortion δ. This definition immediately implies that RIP matrices can be utilized for isometric sketching of sparse vectors. Definition 2.1 (RIP) A matrix $$\boldsymbol{A}\in \mathbb{R}^{m\times n}$$ satisfies the RIP with distortion δ > 0 at a sparsity level s, if for all vectors x with sparsity at most s, we have   \begin{align} \left| \left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\right|\leq \max(\delta,\delta^{2})\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\!. \end{align} (2.1) We shall use the short-hand RIP(δ, s) to denote this property. This definition is essentially identical to the classical definition of RIP [7]. The only difference is that we do not restrict δ to lie in the interval [0, 1]. Consequently, the correct dependence on δ in the right-hand side of (2.1) is $$\max (\delta ,\delta ^{2})$$. For the purposes of this paper we need a more refined notion of RIP. More specifically, we need RIP to simultaneously hold for different sparsity and distortion levels. Definition 2.2 (Multiresolution RIP (MRIP)) Let $$L=\lceil \log _{2} n\rceil$$. Given δ > 0 and a number s ≥ 1, for ℓ = 0, 1, 2, …, L, let (δℓ, sℓ) = (2ℓ/2δ, 2ℓs) be a sequence of distortion and sparsity levels. We say a matrix $${\boldsymbol{A}}\in \mathbb{R}^{m\times n}$$ satisfies the MRIP with distortion δ > 0 at sparsity s, if for all ℓ ∈ {1, 2, …, L}, RIP(δℓ, sℓ) holds. More precisely for vectors of sparsity at most sℓ ($$\left \|\boldsymbol{x} \right \|_{\ell _{0}}\le s_{\ell }$$) the sequence of inequalities   \begin{align} \left| \left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\right|\leq \max\left(\delta_{\ell},\delta_{\ell}^{2}\right)\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\!, \end{align} (2.2) simultaneously holds for all ℓ ∈ {1, 2, …, L}. We shall use the short-hand MRIP(δ, s) to denote this property. At the lowest scale, this definition reduces to the standard RIP(δ, s) definition. Noting that sL = 2Ls ≥ n at the highest scale this condition requires   \begin{align*} \left| \left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\right|\leq \max\left(\delta_{L},{\delta_{L}^{2}}\right)\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\!, \end{align*} to hold for all vectors $$\boldsymbol{x}\in \mathbb{R}^{n}$$. While this condition looks considerably more restrictive than the standard definition of RIP, with proper scaling it can be easily satisfied for popular random matrix ensembles used for dimensionality reduction. These include dense matrices with i.i.d. sub-Gaussian entries as well as structured matrices such as randomly subsampled Hadamard or the Discrete Cosine Transform matrix. The latter are special cases of Subsampled Orthogonal with Random Sign (SORS) matrices described in detail in Definition 3.3 for which matrix-vector multiply can be computed in log-linear time. 3. From isometric sketching of sparse vectors to general sets Our main result states that a matrix obeying MRIP with the right distortion level $$\tilde{\delta }$$ can be used for embedding any subset $$\mathcal{T}$$ of $$\mathbb{R}^{n}$$. Theorem 3.1 For a set $$\mathcal{T}\subset \mathbb{R}^{n}$$ let $$\textit{rad}(\mathcal{T})=\sup _{\boldsymbol{v}\in \mathcal{T}} \left \|\boldsymbol{v}\right \|_{\ell _{2}}$$ be the maximum Euclidean norm of a point inside $$\mathcal{T}$$. Suppose the matrix $$\boldsymbol{H}\in \mathbb{R}^{m\times n}$$ obeys the MRIP with sparsity and distortion levels   \begin{align} s=200(1+\eta)\quad\textrm{and}\quad \tilde{\delta}= \frac{\delta\cdot \textit{rad}(\mathcal{T})}{C\max\left( \textit{rad}(\mathcal{T}),\omega(\mathcal{T})\right)}, \end{align} (3.1) with C > 0 an absolute constant. Then, for a diagonal matrix D with an i.i.d. random sign pattern on the diagonal, the matrix A = HD obeys   \begin{align} \sup_{\boldsymbol{x}\in\mathcal{T}}|\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}|\leq \max\left(\delta,\delta^{2}\right)\cdot\left( \textit{rad}(\mathcal{T})\right)^{2}\!, \end{align} (3.2) with probability at least $$1-\exp (-\eta )$$. This theorem shows that if a matrix isometrically embeds sparse vectors at all scales, then it becomes suitable for isometric embedding of any set when multiplied by a random sign pattern. For random matrix ensembles that are commonly used for dimensionality reduction, the minimum dimension m for the MRIP$$(s,\tilde{\delta })$$ to hold grows as $$m\sim \frac{s}{\tilde{\delta }^{2}}$$. In Theorem 3.1, we have s ∼ 1 and $$\tilde{\delta }\sim \frac{\delta }{\omega (\mathcal{T})}$$ so that the minimum dimension m for (3.2) to hold is of the order of $$m\sim \frac{\omega ^{2}(\mathcal{T})}{\delta ^{2}}$$. This is exactly the same scaling one would obtain by using Gaussian random matrices via Gordon’s lemma in (1.3). To see this more clearly we now focus on applying Theorem 3.1 to random matrices obtained by subsampling a unitary matrix. Definition 3.2 (SORS matrices) Let $$\boldsymbol{F}\in \mathbb{R}^{n\times n}$$ denote an orthonormal matrix obeying   \begin{align} \boldsymbol{F}^{\ast}\ \boldsymbol{F} =\boldsymbol{ I }\quad\textrm{and}\quad\max_{i, \ j}\left|\boldsymbol{F}_{i j}\right|\le \frac{\varDelta}{\sqrt{n}}. \end{align} (3.3) Define the random subsampled matrix $$\boldsymbol{H}\in \mathbb{R}^{m\times n}$$ with i.i.d. rows chosen uniformly at random from the rows of F. Now we define the SORS measurement ensemble as $$A=\sqrt{n/m}\ HD$$, where $$\boldsymbol{D}\in \mathbb{R}^{n\times n}$$ is a random diagonal matrix with the diagonal entries i.i.d. ± 1 with equal probability. To simplify exposition, in the definition above we have focused on SORS matrices based on subsampled orthonormal matrices H with i.i.d. rows chosen uniformly at random from the rows of an orthonormal matrix F obeying (3.3). However, our results continue to hold for SORS matrices defined via a much broader class of random matrices H with i.i.d. rows chosen according to a probability measure on Bounded Orthonormal Systems. Please see [14, Section 12.1] for further details on such ensembles. By utilizing results on RIP of subsampled orthogonal random matrices obeying (3.3) we can show that the MRIP holds at the sparsity and distortion levels required by (3.1). Therefore, Theorem 3.1 immediately implies a result similar to Gordon’s lemma for SORS matrices. Theorem 3.3 Let $$\mathcal{T}\subset \mathbb{R}^{n}$$ and suppose $${\boldsymbol{A}}\in \mathbb{R}^{m\times n}$$ is selected from the SORS distribution of Definition 3.2. Then,   \begin{align} \sup_{\boldsymbol{x}\in\mathcal{T}}|\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}|\leq \max\{\delta,\delta^{2}\}\cdot \left( \textit{rad}(\mathcal{T})\right)^{2}\!, \end{align} (3.4) holds with probability at least 1 − 2e−η as long as   \begin{align} m\geq C\varDelta^{2}(1+\eta)^{2}(\log n)^{4}\ \frac{\max\left(1,\frac{\omega^{2}(\mathcal{T})}{\left( \textit{rad}(\mathcal{T})\right)^{2}}\right)}{\delta^{2}}. \end{align} (3.5) We would like to point out that one can improve the dependence on η and potentially replace a few $$\log n$$ factors with $$\log \left (\omega (\mathcal{T})\right )$$ by utilizing improved RIP bounds such as [9,11,31]. We also note that any future result that reduces log factors in the sample complexity of RIP will also automatically improve the bound on m in our results. In fact, after the first version of this manuscript became available there has been a very interesting reduction of log factors in the required sample complexity of RIP by Haviv & Regev in [17] (related also see an earlier improved RIP result of Bourgain [4] brought to our attention by Jelani Nelson). We believe that utilizing this new RIP result it may be possible to improve the bound in (3.5) to   \begin{align} m\geq C\varDelta^{2}(1+\eta)^{2}(\log \omega(\mathcal{T}))^{2}\log n\ \frac{\max\left(1,\frac{\omega^{2}(\mathcal{T})}{\left( \textit{rad}(\mathcal{T})\right)^{2}}\right)}{\delta^{2}}. \end{align} (3.6) Unfortunately, (3.6) does not trivially follow from the results in [17]. The reason is twofold: (1) the results of [17] are based on more classical definitions of RIP (without the $$\max (\delta ,\delta ^{2})$$ as in (2.1)) and (2) the dependence on the distortion level δ in terms of sample complexity is not of the form 1/δ2 and has slightly weaker dependence of the form $$\frac{\log ^{4}(1/\delta )}{\delta ^{2}}$$ which holds for sufficiently small δ. Closing this gap is an interesting future research direction. Ignoring constant/logarithmic factors, Theorem 3.3 is an exact analogue of Gordon’s lemma for Gaussian matrices in terms of the tradeoff between the reduced dimension m and the distortion level δ. Gordon’s result for Gaussian matrices has been utilized in numerous applications. Theorem 3.3 above allows one to replace Gaussian matrices with SORS matrices for such problems. For example, Chandrasekaran et al. [8] use Gordon’s lemma to obtain near optimal sample complexity bounds for linear inverse problems involving Gaussian matrices. An immediate application of Theorem 3.3 implies near optimal sample complexity results using SORS matrices. To the extent of our knowledge this is the first sample optimal result using a matrix with fast multiply. We refer the reader to our companion paper for further detail [27]. Theorem 3.3 holds for all sets $$\mathcal{T}$$, while using matrices that have fast multiplication. We would like to pause to mention a few interesting results that hold with additional assumptions on the set $$\mathcal{T}$$. Perhaps, the first results of this kind were established for the RIP in [7,31], where the set $$\mathcal{T}$$ is the set of vectors with a certain sparsity level. Krahmer & Ward established a JL type embedding for RIP matrices with columns multiplied by a random sign pattern [22]. That is, the authors show that Theorem 3.3 holds when $$\mathcal{T}$$ is a finite point cloud. More recently, in [37] the authors show a Gordon-type embedding result holds for manifold signals using RIP matrices whose columns are multiplied by a random sign pattern. All of these interesting results on embedding of finite points serve as a precursor to our results. Indeed, [22] plays a crucial role in our proof. A practical contribution of our work is that SORS matrices can be used to embed any set, which not only unifies the existing set-specific results, but also implies new optimal embedding bounds for several tasks including low-rank approximation, least-squares sketching, and group-sparse and dictionary-sparse signal modeling. Earlier, we mentioned the very interesting paper of Bourgain et al. [5] which establishes a result in the spirit of Theorem 3.3 for sparse matrices. Indeed, [5] shows that for certain random matrices with sparse columns the dependence of the minimum dimension m on the mean width $$\omega (\mathcal{T})$$ and distortion δ, is of the form $$m\gtrsim \frac{\omega ^{2}(\mathcal{T})}{\delta ^{2}}$$polylog$$(\frac{n}{\delta })$$. In this result, the sparsity level of the columns of the matrix (and in turn the computational complexity of the dimension reduction scheme) is controlled by a parameter which characterizes the spikiness of the set $$\mathcal{T}$$. In addition, the authors of [5] also establish results for particular $$\mathcal{T}$$ using Fast JL7 matrices, e.g. see [5, Section 6.2]. Recently, Pilanci & Wainwright in [29] have established a result of similar flavor to Theorem 3.3, but with suboptimal tradeoff betwee n the allowed dimension reduction and the complexity of the set $$\mathcal{T}$$. Roughly stated, this result requires $$m\gtrsim \left (\log n\right )^{4}\frac{\omega ^{4}(\mathcal{T})}{\delta ^{2}}$$ using a subsampled Hadamard matrix combined with a diagonal matrix of i.i.d. Rademacher random variables. We would like to point out that our proofs also hint at an alternative proof strategy to that of [29] if one is interested in establishing $$m\gtrsim (\log n)^{4}\frac{\omega ^{4}(\mathcal{T})}{\delta ^{2}}$$. In particular, one can cover the set $$\mathcal{T}$$ with Euclidean balls of size δ. Based on Sudakov’s inequality the logarithm of the size of this cover is at most $$\frac{\omega ^{2}(\mathcal{T})}{\delta ^{2}}$$. One can then relate this cover to a cover obtained by using a random pseudo-metric such as the one defined in [31]. As a result one incurs an additional factor $$(\log n)^{4}\omega ^{2}(\mathcal{T})$$. Multiplying these two factors leads to the requirement $$m\gtrsim (\log n)^{4}\frac{\omega ^{4}(\mathcal{T})}{\delta ^{2}}$$. 4. Proofs Before we move to the proof of the main theorem, we begin by stating known results on RIP for bounded orthogonal systems and show how Theorem 3.3 follows from our main theorem (Theorem 3.1). 4.1. Proof of Theorem 3.2 for SORS matrices We first state a classical result on RIP originally due to Rudelson & Vershynin [31,35]. We state the version in [14] which holds generally for bounded orthogonal systems. We remark that the results in [31,35] as well as those of [14] are stated for the regime δ < 1. However, by going through the analysis contained in these papers carefully one can confirm that our definition of RIP (with max(δ, δ2) on the right-hand side in lieu of δ) continues to hold for δ ≥ 1. Lemma 4.1 (RIP for sparse signals, [14,31,35]) Let $$\boldsymbol{F}\in \mathbb{R}^{n\times n}$$ denote an orthonormal matrix obeying   \begin{align} \boldsymbol{F}^{\ast}\boldsymbol{F}=\boldsymbol{I}\quad\textrm{and}\quad\max_{i, j}\left|\boldsymbol{F}_{ij}\right|\le \frac{\varDelta}{\sqrt{n}}. \end{align} (4.1) Define the random subsampled matrix $$\boldsymbol{H}\in \mathbb{R}^{m\times n}$$ with i.i.d. rows chosen uniformly at random from the rows of F. Then RIP(δ, s) holds with probability at least 1 − e−η for all δ > 0 as long as   \begin{align*} m\ge C\varDelta^{2}\frac{s\left(\log^{3} n\log m+\eta\right)}{\delta^{2}}. \end{align*} Here C > 0 is a fixed numerical constant. Applying the union bound over $$L=\lceil \log n\rceil$$ sparsity levels and using the change of variable $$\eta \rightarrow \eta +\log L$$, together with the fact that $$(\log n)^{4}+\eta \le (1+\eta )(\log n)^{4}$$, Lemma 4.1 immediately leads to the following lemma. Lemma 4.2 Consider $$\boldsymbol{H}\in \mathbb{R}^{m\times n}$$ distributed as in Lemma 4.1. H obeys MRIP with sparsity s and distortion $$\tilde{\delta }>0$$ with probability 1 − e−η as long as   \begin{align*} m\geq C(1+\eta)\varDelta^{2}\frac{s(\log n)^{4}}{\tilde{\delta}^{2}}\nonumber. \end{align*} Theorem 3.3 now follows by using s = C(1 + η) and $$\tilde{\delta }=\frac{\delta }{C\max \left (1,\frac{\omega (\mathcal{T})}{\textrm{rad}(\mathcal{T})}\right )}$$ in Theorem 3.1. 4.2. Connection between JL-embedding and RIP A critical tool in our proof is a powerful result of Krahmer & Ward [22] that shows that RIP matrices with columns multiplied by a random sign pattern obey the JL Lemma. Theorem 4.3 (Discrete JL embedding via RIP, [22]) Assume $$\mathcal{T}\subset \mathbb{R}^{n}$$ is a finite set of points. Suppose $$\boldsymbol{H}\in \mathbb{R}^{m\times n}$$ is a matrix satisfying RIP(s, δ) with sparsity s and distortion δ > 0 obeying   \begin{align*} {s\ge\min\left(40(\log\left(4|\mathcal{T}|\right)+\eta),n\right)\quad\textrm{and}\quad 0<\delta\leq \frac{\varepsilon}{4},} \end{align*} where $$\boldsymbol{D}\in \mathbb{R}^{n\times n}$$ is a random diagonal matrix with the diagonal entries i.i.d. ± 1 with equal probability. Then the matrix A = HD obeys   \begin{align} \left|\left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\right|\leq \max(\varepsilon,\varepsilon^{2})\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\!, \end{align} (4.2) simultaneously for all $$\boldsymbol{x}\in \mathcal{T}$$ with probability at least 1 − e−η. As above, the result stated in [22] restricts ε to [0, 1], and has ε rather than $$\max (\varepsilon ,\varepsilon ^{2})$$ in the right hand side of the inequality (4.2). However, it is easy to verify that their proof (with essentially no modifications) can accommodate the result stated above. 4.3. Connecting JL to Gordon (overview of proof of Theorem 3.1) Before we provide a complete proof of Theorem 3.1, in this section we wish to provide a high-level description of our proof. The full proof can be found in Section 4.5 with some details deferred to the Appendix. For simplicity let us focus on the case where rad$$(\mathcal{T})=1$$. The main aim of Theorem 3.1 is to prove the bound   \begin{align} \left|\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\right|\le\max(\delta,\delta^{2}), \end{align} (4.3) for all $$\boldsymbol{x}\in \mathcal{T}$$. A natural way to establish this bound is to use a covering number argument. Let us explain why this approach fails and then how we can fix it. In the covering number approach we cover the set $$\mathcal{T}$$ with balls of size ε as depicted Fig. 1a. We denote the centers of these covers by $$\mathcal{N}$$. We then try to prove the bound in (4.3) by first controlling $$\big |\left \|{\boldsymbol{A}}\boldsymbol{x}\right \|_{\ell _{2}}^{2}-\left \|\boldsymbol{x}\right \|_{\ell _{2}}^{2}\big |$$ on the cover via the Discrete JL embedding result of Theorem 4.3, and then try to control how much $$\big |\left \|{\boldsymbol{A}}\boldsymbol{x}\right \|_{\ell _{2}}^{2}-\left \|\boldsymbol{x}\right \|_{\ell _{2}}^{2}\big |$$ deviates from $$\big |\left \|{\boldsymbol{A}}\boldsymbol{z}\right \|_{\ell _{2}}^{2}-\left \|\boldsymbol{z}\right \|_{\ell _{2}}^{2}\big |$$ for the closest $$\boldsymbol{z}\in \mathcal{N}$$ inside the cover to x. This proof strategy fails because for the deviation term to be smaller than $$\frac{1}{2}\max (\delta ,\delta ^{2})$$ then ϵ has to be very small. In particular, for most matrices that obey the RIP the spectral norm of A roughly scales with $$\sqrt{\frac{n}{m}}$$, so that ϵ must be on the order of $$\frac{\delta }{\|\boldsymbol{A}\|}\sim \sqrt{\frac{m}{n}}\delta$$ where ∼/$$\gtrsim$$ denote equality/inequality up to a fixed numerical constant. This in turn means that the size of the cover needs to be very large. More specifically, applying the Sudakov inequalities (e.g. see [36, Theorem 2.2]) we have $$\left |\mathcal{N}\right |\lesssim 2^{\frac{n}{m}\frac{\omega ^{2}(\mathcal{T})}{\delta ^{2}}}$$. Now combining Lemma 4.1 together with Theorem 4.3 implies that to achieve a distortion of the order of $$\frac{1}{2}\max (\delta ,\delta ^{2})$$ on the cover, the reduced dimension m must obey   \begin{align*} m\gtrsim \frac{\log |\mathcal{N}|}{\delta^{2}}\log^{4} n \sim \frac{n}{m}\frac{\omega^{2}(\mathcal{T})}{\delta^{4}}\log^{4} n\quad\Leftrightarrow\quad m\gtrsim \sqrt{n}\frac{\omega(\mathcal{T})}{\delta^{2}}\log^{2} n, \end{align*} which is far from optimal. To overcome this deficiency we use successively larger and larger discrete sets $$\mathcal{T}_{0},\mathcal{T}_{1},\mathcal{T}_{2},\ldots ,\mathcal{T}_{L}$$ to approximate the set $$\mathcal{T}$$. For a point $$\boldsymbol{x}\in \mathcal{T}$$ let zℓ denote the closest point from $$\mathcal{T}_{\ell }$$ to x. We will obtain the desired bound by utilizing a telescoping sum of the form   \begin{align*} \big|\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\big|\le &\left(\big|\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\big|-\big|\left\|{\boldsymbol{A}}\boldsymbol{z}_{L}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{L}\right\|_{\ell_{2}}^{2}\big|\right)\\ &+\sum_{\ell=1}^{L} \left(\big|\left\|{\boldsymbol{A}}\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}\big|-\big|\left\|{\boldsymbol{A}}\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}\big|\right) +\big|\left\|{\boldsymbol{A}}\boldsymbol{z}_{0}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{0}\right\|_{\ell_{2}}^{2}\big|. \end{align*} Fig. 1. View largeDownload slide The figure on the left shows a standard covering of the set $$\mathcal{T}$$ with balls of radius ε with the points in black depicting the centers of the cover. The figure on the right depicts the centers of finer and finer covers of the set $$\mathcal{T}$$ depicted by black (z0), green (z1) and orange (z2) points. In order to bound $$\big |\left \|{\boldsymbol{A}}\boldsymbol{x}\right \|_{\ell _{2}}^{2}-\left \|\boldsymbol{x}\right \|_{\ell _{2}}^{2}\big |$$, we bound deviations along the path connecting the closest points to x from these successive covers. In the figure on the right these points are denoted by z0, z1 and z2. Fig. 1. View largeDownload slide The figure on the left shows a standard covering of the set $$\mathcal{T}$$ with balls of radius ε with the points in black depicting the centers of the cover. The figure on the right depicts the centers of finer and finer covers of the set $$\mathcal{T}$$ depicted by black (z0), green (z1) and orange (z2) points. In order to bound $$\big |\left \|{\boldsymbol{A}}\boldsymbol{x}\right \|_{\ell _{2}}^{2}-\left \|\boldsymbol{x}\right \|_{\ell _{2}}^{2}\big |$$, we bound deviations along the path connecting the closest points to x from these successive covers. In the figure on the right these points are denoted by z0, z1 and z2. We then apply the discrete JL result of Theorem 4.3 to bound each of these terms separately. We pick the size of the successive approximations to be $$|\mathcal{T}_{\ell }|=2^{2^{\ell }}$$. We also bound the different deviation terms with different distortion levels $$\big(\textrm{roughly of the order of} \ \delta _{\ell }=2^{\ell /2}\frac{\delta }{\omega (\mathcal{T})}\big)$$. The key factor in our proofs is that the size of the successive approximations and the distortion level is chosen carefully to balance each other out. That is, the reduction in the dimension m will take the form   \begin{align*} m\gtrsim\underset{\ell=1,2,\ldots,L}{\max} \frac{\log |\mathcal{T}_{\ell}|}{\delta_{\ell}^{2}}\log^{4} n=\underset{\ell=1,2,\ldots,L}{\max} \frac{\log |2^{2^{\ell}}|}{2^{\ell} \frac{\delta^{2}}{\omega^{2}(\mathcal{T})}}\log^{4} n= \frac{\omega^{2}(\mathcal{T})}{\delta^{2}}\sim \log^{4} n\frac{\omega^{2}(\mathcal{T})}{\delta^{2}}, \end{align*} which is essentially the result we are interested in. All of this will be made completely rigorous in the coming sections by using a generic chaining style argument. 4.4. Generic chaining related notations and definitions Our proof makes use of the machinery of generic chaining, e.g. see [33]. We gather some of the required definitions and notations in this section. Define N0 = 1 and $$N_{\ell }=2^{2^{\ell }}$$ for ℓ ≥ 1. Definition 4.4 (Admissible sequence, [33]) Given a set $$\mathcal{T}$$ an admissible sequence is an increasing sequence ($$\mathcal{A}_{\ell }$$) of partitions of $$\mathcal{T}$$ such that $$\left |\mathcal{A}_{\ell }\right |\le N_{\ell }$$. Following [33], here increasing sequence of partitions means that every set of $$\mathcal{A}_{\ell +1}$$ is contained in a set of $$\mathcal{A}_{\ell }$$ and $$\mathcal{A}_{\ell }(t)$$ is the unique element of $$\mathcal{A}_{\ell }$$ that contains t. Then the γ2 functional is defined as   \begin{align*} \gamma_{2}(\mathcal{T})=\inf\underset{t}{\sup}\sum_{\ell=0}^{\infty} 2^{\ell/2}\textrm{rad}(\mathcal{A}_{\ell}(t)),\nonumber \end{align*} where the infimum is taken over all admissible sequences. Let $$\bar{\mathcal{A}}_{\ell }$$ be one such optimal admissible sequence. Based on this sequence we define the successive covers. Definition 4.5 (successive covers) Define the center point of a set to be the center of the smallest ball containing that set. Using $$\bar{\mathcal{A}}_{\ell }$$ we construct successive covers $$\mathcal{T}_{\ell }$$ of $$\mathcal{T}$$ by taking the center point of each set of $$\bar{\mathcal{A}}_{\ell }$$. Let eℓ(v) be the associated distortion of the cover with respect to a point v, i.e. $$e_{\ell }(\boldsymbol{v})=\textrm{dist}(\boldsymbol{v},\mathcal{T}_{\ell })$$. Then for all $$\boldsymbol{v}\in \mathcal{T}$$, the γ2 functional obeys   \begin{align*} \sum_{\ell=0}^{\infty} 2^{\ell/2}e_{\ell}(\boldsymbol{v}) \leq \gamma_{2}(\mathcal{T}).\nonumber \end{align*} It is well known that $$\gamma _{2}(\mathcal{T})$$ and Gaussian width $$\omega (\mathcal{T})$$ are of the same order. More precisely, for a fixed numerical constant C  \begin{align*} C^{-1}\omega(\mathcal{T})\leq \gamma_{2}(\mathcal{T})\leq C\omega(\mathcal{T}).\nonumber \end{align*} Given the distortion δ in the statement of Theorem 3.1 we also define different scales of distortion   \begin{align*} \delta_{0}=\delta,\,\delta_{1}=2^{1/2}\delta,\,\dots,\,\delta_{L}=2^{L/2}\delta\nonumber, \end{align*} with $$L=\log _{2}\lceil n\rceil$$. 4.5. Proof of Theorem 3.1 Without loss of generality we assume that rad$$(\mathcal{T})=1$$. We begin by noting that the MRIP property combined with the powerful JL-embedding result stated in Theorem 4.3 allows for JL embedding at different distortion levels. We apply such an argument to successively more refined covers of the set $$\mathcal{T}$$ and at different distortion scales inside a generic chaining type argument to arrive at the proof for an arbitrary (and potentially continuous) set $$\mathcal{T}$$. We should point out that one can also follow an alternative approach which leads to the same conclusion. Instead of using multi-resolution RIP, we could have defined a ‘multi-resolution embedding property’ for the mapping A that isometrically maps finite set of points $$\mathcal{T}$$ with a near optimal set cardinality-distortion tradeoff at varying levels. One can show that this property also implies isometric embedding of a continuous set $$\mathcal{T}$$. We begin by stating a lemma which shows isometric embedding as well as a few other properties for points belonging to the refined covers $$\mathcal{T}_{\ell }$$ at different distortion levels δℓ. The proof of this lemma is deferred to Appendix A. Lemma 4.6 Suppose $$\boldsymbol{H}\in \mathbb{R}^{m\times n}$$ obeys MRIP$$\big(s,\frac{\delta }{4}\big)$$ with distortion level δ and sparsity s = 200(1 + η). Furthermore, let $$\boldsymbol{D}\in \mathbb{R}^{n\times n}$$ be a diagonal matrix with a random i.i.d. sign pattern on the diagonal and set A = HD. Also let $$\mathcal{T}_{\ell }$$ be successive refinements of the set $$\mathcal{T}$$ from Definition 4.5. Then, with probability at least $$1-\exp (-\eta )$$ the following identities hold simultaneously for all ℓ = 1, 2, …, L, For all $$\boldsymbol{v}\in \mathcal{T}_{\ell -1}\cup \mathcal{T}_{\ell }\cup (\mathcal{T}_{\ell -1}-\mathcal{T}_{\ell })$$,   \begin{align} \left\|{\boldsymbol{A}}\boldsymbol{v}\right\|_{\ell_{2}}\leq \left(1+2^{\ell/2}\delta\right)\left\|\boldsymbol{v}\right\|_{\ell_{2}}\!. \end{align} (4.4) For all $$\boldsymbol{v}\in \mathcal{T}_{\ell -1}\cup \mathcal{T}_{\ell }\cup (\mathcal{T}_{\ell -1}-\mathcal{T}_{\ell })$$,   \begin{align} |\left\|{\boldsymbol{A}}\boldsymbol{v}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{v}\right\|_{\ell_{2}}^{2}|\leq \max\big(2^{\ell/2}\delta,2^{\ell}\delta^{2}\big)\cdot\left\|\boldsymbol{v}\right\|_{\ell_{2}}^{2}\!.\end{align} (4.5) For all $$\boldsymbol{u}\in \mathcal{T}_{\ell -1}$$ and $$\boldsymbol{v}\in \mathcal{T}_{\ell }-\{\boldsymbol{u}\}:=\{\boldsymbol{y}-\boldsymbol{u}: \boldsymbol{y}\in \mathcal{T}_{\ell }\}$$,   \begin{align} \left|\boldsymbol{u}^{\ast}{\boldsymbol{A}}^{\ast}{\boldsymbol{A}}\boldsymbol{v}-\boldsymbol{u}^{\ast}\boldsymbol{v}\right|\leq \max\big(2^{\ell/2}\delta,2^{\ell}\delta^{2}\big)\cdot\left\|\boldsymbol{u}\right\|_{\ell_{2}}\left\|\boldsymbol{v}\right\|_{\ell_{2}}\!. \end{align} (4.6) With this lemma in place we are ready to prove our main theorem. To this aim given a point $$\boldsymbol{x}\in \mathcal{T}$$, for ℓ = 1, 2, …, L let zℓ be the closest neighbor of x in $$\mathcal{T}_{\ell }$$. We also define zL+1 = x. We note that zℓ depends on x. For ease of presentation we do not make this dependence explicit. We also drop x from the distortion term eℓ(x) and simply use eℓ. Now observe that for all ℓ = 1, 2, …, L, we have   \begin{align} \left\|{\boldsymbol{z}}_{\ell}-{\boldsymbol{z}}_{\ell-1}\right\|_{\ell_{2}}\leq \left\|{\boldsymbol{z}}_{\ell}-\boldsymbol{x}\right\|_{\ell_{2}}+\left\|{\boldsymbol{z}}_{\ell-1}-\boldsymbol{x}\right\|_{\ell_{2}}\leq e_{\ell}+e_{\ell-1}\leq 2e_{\ell-1}. \end{align} (4.7) We are interested in bounding $$|\left \|{\boldsymbol{A}}\boldsymbol{x}\right \|_{\ell _{2}}^{2}-\left \|\boldsymbol{x}\right \|_{\ell _{2}}^{2}|$$ for all $$\boldsymbol{x}\in \mathcal{T}$$. Define $$\tilde{L}=\max \left (0,\lfloor 2\log _{2}\left (\frac{1}{\delta }\right )\rfloor \right )$$, and note that applying the triangular inequality, we have   \begin{align} |\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}|\le&\left|\left\|\boldsymbol{A}\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}^{2}\right|+\left|\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|{\boldsymbol{A}}\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}^{2}\right|+\left|\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}^{2}\right|\nonumber\\ \le&\sum_{\ell=1}^{\tilde{L}}\left(\left|\left\|\boldsymbol{A}\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}\right|-\left|\left\|\boldsymbol{A}\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}\right|\right)\nonumber\\ &+\left|\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|{\boldsymbol{A}}\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}^{2}\right|+\left|\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}^{2}\right|+\left|\left\|\boldsymbol{A}\boldsymbol{z}_{0}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{0}\right\|_{\ell_{2}}^{2}\right|\!. \end{align} (4.8) First note that by Lemma 4.6  \begin{align*}\left |\left\|{\boldsymbol{A}}{\boldsymbol{z}}_{0}\right\|_{\ell_{2}}^{2}-\left\|{\boldsymbol{z}}_{0}\right\|_{\ell_{2}}^{2}\right|\leq \max\left(\delta,\delta^{2}\right)\left\|{\boldsymbol{z}}_{0}\right\|_{\ell_{2}}^{2}\leq \max\left(\delta,\delta^{2}\right)\!. \end{align*} Using the above inequality in (4.8), we arrive at   \begin{align}\left |\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\right|\le&\sum_{\ell=1}^{\tilde{L}}\left(\left|\left\|\boldsymbol{A}\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}\right|-\left|\left\|\boldsymbol{A}\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}\right|\right)\nonumber\\ &+\left|\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|{\boldsymbol{A}}\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}^{2}\right|+\left|\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}^{2}\right|+\max\left(\delta,\delta^{2}\right)\!. \end{align} (4.9) Lemma 4.7 whose proof is deferred to Appendix B utilizes Lemma 4.6 to bound each of the first three terms in (4.9). Before getting into the details of these bounds we would like to point out that (4.9), as well as the results presented in Lemma 4.7 below are derived under the assumption that $$\tilde{L}\le L$$. Proper modification allows us to bound $$|\left \|{\boldsymbol{A}}\boldsymbol{x}\right \|_{\ell _{2}}^{2}-\left \|\boldsymbol{x}\right \|_{\ell _{2}}^{2}|$$ even when $$\tilde{L}> L$$. We shall explain this argument in complete detail in Appendix C. Lemma 4.7 Under the assumptions of Theorem 3.1 the following three inequalities hold   \begin{align} \sum_{\ell=1}^{\tilde{L}}\left(\left|\|\boldsymbol{A}\boldsymbol{z}_{\ell}\|_{\ell_{2}}^{2}-\|\boldsymbol{z}_{\ell}\|_{\ell_{2}}^{2}\right|-\left|\|\boldsymbol{A}\boldsymbol{z}_{\ell-1}\|_{\ell_{2}}^{2}-\|\boldsymbol{z}_{\ell-1}\|_{\ell_{2}}^{2}\right|\right)\le 10\sqrt{2}\delta\gamma_{2}(\mathcal{T}), \end{align} (4.10)  \begin{align} \left|\|{\boldsymbol{A}}\boldsymbol{x}\|_{\ell_{2}}^{2}-\|{\boldsymbol{A}}\boldsymbol{z}_{\tilde{L}}\|_{\ell_{2}}^{2}\right|+\left|\|\boldsymbol{x}\|_{\ell_{2}}^{2}-\|\boldsymbol{z}_{\tilde{L}}\|_{\ell_{2}}^{2}\right|\le 32\delta^{2}{\gamma_{2}^{2}}(\mathcal{T})+16\sqrt{2}\delta\gamma_{2}(\mathcal{T}) \end{align} (4.11) and   \begin{align} \left|\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}^{2}\right|\le4\delta^{2}{\gamma_{2}^{2}}(\mathcal{T})+4\sqrt{2}\delta\gamma_{2}(\mathcal{T}). \end{align} (4.12) We now utilize this lemma to complete the proof of the theorem. We plug in the bounds from (4.10), (4.11) and (4.12) into (4.9) and use the fact that $$\gamma _{2}(\mathcal{T})\le C\omega (\mathcal{T})$$ for a fixed numerical constant C, to conclude that for $$\tilde{L}\le L$$, we have   \begin{align} \left|\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\right|&\le 10\sqrt{2}\delta\gamma_{2}(\mathcal{T})+32\delta^{2}{\gamma_{2}^{2}}(\mathcal{T})+16\sqrt{2}\delta\gamma_{2}(\mathcal{T})+4\delta^{2}{\gamma_{2}^{2}}(\mathcal{T})\nonumber\\ &\quad +4\sqrt{2}\delta\gamma_{2}(\mathcal{T})+\max(\delta,\delta^{2})\nonumber\\ &\le 36\delta^{2}C^{2}\omega^{2}(\mathcal{T})+30\sqrt{2}C\delta\omega(\mathcal{T})+\max(\delta,\delta^{2})\nonumber\\ &\le 79\cdot\max\left(C\delta\omega(\mathcal{T}),C^{2}\delta^{2}\omega^{2}(\mathcal{T})\right)+\max(\delta,\delta^{2})\nonumber\\ &\le 80\cdot\max\left(C\delta\left(\max(1,\omega(\mathcal{T}))\right),C^{2}\delta^{2 }\left(\max(1,\omega(\mathcal{T}))\right)^{2}\right)\!. \end{align} (4.13) We thus conclude that for all $$\boldsymbol{x}\in \mathcal{T}$$  \begin{align} \left|\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\right|\le 80\cdot\max\left(C\delta\left(\max(1,\omega(\mathcal{T}))\right),C^{2}\delta^{2}\left(\max(1,\omega(\mathcal{T}))\right)^{2}\right)\! . \end{align} (4.14) Note that assuming MRIP$$\big(s,\frac{\delta }{4}\big)$$ with s = 200(1 + η) we have arrived at (4.15). Applying the change of variable   \begin{align*} \delta\rightarrow\frac{\delta}{320C\max\left(1,\omega(\mathcal{T})\right)}, \end{align*} we can conclude that under the stated assumptions of the theorem for all $$\boldsymbol{x}\in \mathcal{T}$$  \begin{align*} \left|\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\right|\le\max(\delta,\delta^{2}), \end{align*} completing the proof. Funding B.R. is generously supported by ONR awards N00014-11-1-0723 and N00014-13-1-0129, NSF awards CCF-1148243 and CCF-1217058, AFOSR award FA9550-13-1-0138 and a Sloan Research Fellowship. S.O. was generously supported by the Simons Institute for the Theory of Computing and NSF award CCF-1217058. This research is supported in part by NSF CISE Expeditions Award CCF-1139158, LBNL Award 7076018 and DARPA XData Award FA8750-12-2-0331. Acknowledgements The authors are thankful for the gifts from Amazon Web Services, Google, SAP, The Thomas and Stacey Siebel Foundation, Adatao, Adobe, Apple, Inc., Blue Goji, Bosch, C3Energy, Cisco, Cray, Cloudera, EMC2, Ericsson, Facebook, Guavus, HP, Huawei, Informatica, Intel, Microsoft, NetApp, Pivotal, Samsung, Schlumberger, Splunk, Virdata and VMware, and also want to thank Ahmed El Alaoui for a careful reading of the manuscript. We also thank Sjoerd Dirksen for helpful comments and also for pointing us to some useful references on generalizing Gordon’s result to matrices with sub-Gaussian entries. We also thank Jelani Nelson for a careful reading of this paper, very helpful comments/insights and pointing us to the improved RIP result of Bourgain [4] and Mien Wang for noticing that the telescoping sum is not necessary at the beginning of section 4.4.3. We would also like to thank Christopher J. Rozell for bringing the paper [37] on stable and efficient embedding of manifold signals to our attention. References 1. Achlioptas, D. ( 2003) Database-friendly random projections: Johnson-Lindenstrauss with binary coins. J. Comput. Syst. Sci. , 66, 671-- 687. Google Scholar CrossRef Search ADS   2. Ailon, N. & Liberty, E. ( 2013) An almost optimal unrestricted fast Johnson-Lindenstrauss transform. ACM Trans. Algorithms (TALG) , 9, 21. 3. Ailon, N. & Rauhut, H. ( 2014) Fast and RIP-optimal transforms. Discrete Comput. Geometry , 52, 780-- 798. Google Scholar CrossRef Search ADS   4. Bourgain, J. ( 2014) An improved estimate in the Restricted Isometry problem. In Geometric Aspects of Functional Analysis.  Cham: Springer, pp. 65-- 70. 5. Bourgain, J., Dirksen, S. & Nelson, J. ( 2013) Toward a unified theory of sparse dimensionality reduction in Euclidean space. ArXiv preprint, arXiv:1311.2542. 6. Candes, E. & Recht, B. ( 2012) Exact matrix completion via convex optimization. Commun. ACM , 55, 111-- 119. Google Scholar CrossRef Search ADS   7. Candes, E. J. & Tao, T. ( 2005) Decoding by linear programming. IEEE Trans. Info. Theory , 51, 4203-- 4215. Google Scholar CrossRef Search ADS   8. Chandrasekaran, V., Recht, B., Parrilo, P. A. & Willsky, A. S. ( 2012) The convex geometry of linear inverse problems. Foundations of Comput. Mathematics , 12, 805-- 849. Google Scholar CrossRef Search ADS   9. Cheraghchi, M., Guruswami, V. & Velingker, A. ( 2013) Restricted isometry of Fourier matrices and list decodability of random linear codes. SIAM J. Comput. , 42, 1888-- 1914. Google Scholar CrossRef Search ADS   10. Dasgupta, S. & Gupta, A. ( 2003) An elementary proof of a theorem of Johnson and Lindenstrauss. Random Struct. Algorithms , 22, 60-- 65. Google Scholar CrossRef Search ADS   11. Dirksen, S. ( 2013) Tail bounds via generic chaining. ArXiv preprint, arXiv:1309.3522. 12. Dirksen, S. ( 2014) Dimensionality reduction with subgaussian matrices: a unified theory. ArXiv preprint, arXiv:1402.3973. 13. Do, T. T., Gan, L., Chen, Y., Nguyen, N. & Tran, T. D. ( 2009) Fast and efficient dimensionality reduction using structurally random matrices. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP , 1821-- 1824. 14. Foucart, S. & Rauhut, H. ( 2013) Random sampling in bounded orthonormal systems. A Mathematical Introduction to Compressive Sensing.  Basel: Birkhäuser Springer, pp. 367-- 433. 15. Frankl, P. & Maehara, H. ( 1988) The Johnson-Lindenstrauss lemma and the sphericity of some graphs. J. Combinatorial Theory, Series B , 44, 355-- 362. Google Scholar CrossRef Search ADS   16. Gordon, Y. ( 1988) On Milman’s inequality and random subspaces which escape through a mesh in $$\mathbb{R}^n$$. In Geometric Aspects of Functional Analysis . Berlin, Heidelberg: Springer, pp. 84-- 106. 17. Haviv, I. & Regev, O. ( 2015) The restricted isometry property of subsampled Fourier matrices. ArXiv preprint, arXiv:1507.01768. 18. Johnson, W. B. & Lindenstrauss, J. ( 1984) Extensions of Lipschitz mappings into a Hilbert space. Contemp. Mathematics , 26, 189-- 206. Google Scholar CrossRef Search ADS   19. Kane, D. M. & Nelson, J. ( 2010) A derandomized sparse Johnson-Lindenstrauss transform. ArXiv preprint, arXiv:1006.3585. 20. Klartag, B. & Mendelson, S. ( 2005) Empirical processes and random projections. J. Funct. Anal. , 225, 229-- 245. Google Scholar CrossRef Search ADS   21. Koltchinskii, V. & Mendelson, S. ( 2013) Bounding the smallest singular value of a random matrix without concentration. ArXiv preprint, arXiv:1312.3580. 22. Krahmer, F. & Ward, R. ( 2011) New and improved Johnson-Lindenstrauss embeddings via the Restricted Isometry Property. SIAM J. Math. Anal. , 43, 1269-- 1281. Google Scholar CrossRef Search ADS   23. Liberty, E., Ailon, N. & Singer, A. ( 2011) Dense fast random projections and lean Walsh Transforms. Discrete Comput. Geometry , 45, 34-- 44. Google Scholar CrossRef Search ADS   24. Mendelson, S. ( 2014) Learning without concentration. arXiv preprint, arXiv:1401.0304. 25. Mendelson, S., Pajor, A. & Tomczak-Jaegermann, N. ( 2007) Reconstruction and subgaussian operators in asymptotic geometric analysis. Geometric Funct. Anal. , 17, 1248-- 1282. Google Scholar CrossRef Search ADS   26. Nelson, J., Price, E. & Wootters, M. ( 2014) New constructions of RIP matrices with fast multiplication and fewer rows. Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms , SIAM, 1515-- 1528. 27. Oymak, S., Recht, B. & Soltanolkotabi, M. ( 2017) Sharp Time–Data Tradeoffs for Linear Inverse Problems. IEEE Transactions on Information Theory. ArXiv preprint, arXiv:1507.04793. 28. Oymak, S. & Tropp, J. A. ( 2015) Universality laws for randomized dimension reduction, with applications. Information and Inference: A Journal of the IMA. ArXiv preprint, arXiv:1511.09433. 29. Pilanci, M. & Wainwright, M. J. ( 2014) Randomized sketches of convex programs with sharp guarantees. IEEE International Symposium on Information Theory (ISIT) , 921-- 925. 30. Pilanci, M. & Wainwright, M. J. ( 2015) Randomized sketches of convex programs with sharp guarantees. IEEE Transactions on Information Theory , 61, 5096-- 5115. Google Scholar CrossRef Search ADS   31. Rudelson, M. & Vershynin, R. ( 2006) Sparse reconstruction by convex relaxation: Fourier and gaussian measurements. 40th Annual Conference on Information Sciences and Systems , pp. 207-- 212. 32. Sarlos, T. ( 2006) Improved approximation algorithms for large matrices via random projections. Foundations of Computer Science FOCS’06. 47th Annual IEEE Symposium on , IEEE , pp. 143-- 151. 33. Talagrand, M. ( 2006) The Generic Chaining: Upper and Lower Bounds of Stochastic Processes. Springer Science & Business Media. 34. Tropp, J. A. ( 2014) Convex recovery of a structured signal from independent random linear measurements. ArXiv preprint, arXiv:1405.1102. 35. Vershynin, R. ( 2010) Introduction to the non-asymptotic analysis of random matrices. ArXiv preprint, arXiv:1011.3027. 36. Vershynin, R. ( 2011) Lectures in geometric functional analysis. Unpublished manuscript. Available at http://www-personal. umich. edu/romanv/papers/GFA-book/GFA-book.pdf. 37. Yap, H. L., Wakin, M. B. & Rozell, C. J. ( 2013) Stable manifold embeddings with structured random matrices. IEEE J. Selected Top. Signal Processing , 7, 720-- 730. Google Scholar CrossRef Search ADS   Appendix A. Proof of Lemma 4.6 For a set $$\mathcal{M}$$ we define the normalized set $$\widetilde{\mathcal{M}}=\left \{\frac{\boldsymbol{v}}{\left \|\boldsymbol{v}\right \|_{\ell _{2}}}:\ \boldsymbol{v}\in \mathcal{M}\right \}$$. We shall also define   \begin{align*} \mathcal{Q}_{\ell}=\mathcal{T}_{\ell-1}\cup \mathcal{T}_{\ell}\cup (\mathcal{T}_{\ell}-\mathcal{T}_{\ell-1})\cup \left(\widetilde{(\mathcal{T}_{\ell}-\mathcal{T}_{\ell-1})}-\widetilde{\mathcal{T}}_{\ell-1}\right)\cup \left(\widetilde{(\mathcal{T}_{\ell}-\mathcal{T}_{\ell-1})}+\widetilde{\mathcal{T}}_{\ell-1}\right)\!. \end{align*} We will first prove that for ℓ = 1, 2, …, L and every $$\boldsymbol{v}\in \mathcal{Q}_{\ell }$$  \begin{align} \left|\left\|{\boldsymbol{A}}\boldsymbol{v}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{v}\right\|_{\ell_{2}}^{2}\right|\leq \max\left(2^{\ell/2}\delta,2^{\ell}\delta^{2}\right)\cdot\left\|\boldsymbol{v}\right\|_{\ell_{2}}^{2}, \end{align} (A.1) holds with probability at least 1 − e−η. We then explain how the other identities follow from this result. To this aim, note that that by the assumptions of the lemma MRIP$$\big(s,\frac{\delta }{4}\big)$$ holds for the matrix H with s = 200(1 + η). By definition this is equivalent to RIP(sℓ, δℓ) holding for ℓ = 1, 2, …, L with $$\big(s_{\ell },\frac{\delta _{\ell }}{4}\big)=\big(2^{\ell } s,\frac{2^{\ell /2}\delta }{4}\big)$$. Now observe that the number of entries of $$\mathcal{Q}_{\ell }$$ obeys $$\left |\mathcal{Q}_{\ell }\right |\le 5N_{\ell }^{2}$$ with $$N_{\ell }=2^{2^{\ell }}$$ which implies   \begin{align} s_{\ell}&=2^{\ell} s\nonumber\\ &=2^{\ell}\left(200+200\eta\right)\nonumber\\ &\ge2^{\ell}\left(40(\log 2)(\log_{2} (20)+2)+\frac{40}{2}(\eta+1)\right)\nonumber\\ &\ge 2^{\ell}\left(40(\log 2)\left(\frac{\log_{2} (20)}{2^{\ell}}+2\right)+\frac{40\ell}{2^{\ell}}(\eta+1)\right)\nonumber\\ &\ge 40(\log 2)\left(\log_{2}(20)+2^{\ell+1}\right)+40\ell(\eta+1)\nonumber\\[3pt] &\ge 40\log\left(4\left|\mathcal{Q}_{\ell}\right|\right)+40\ell(\eta+1)\nonumber\\[3pt] &\ge\min\left(40\log\left(4\left|\mathcal{Q}_{\ell}\right|\right)+40\ell(\eta+1),n\right)\!. \end{align} (A.2) By the MRIP assumption, RIP$$\big(s_{\ell },\frac{\delta _{\ell }}{4}\big)$$ holds for H. This together with (A.2) allows us to apply Theorem 4.3 to conclude that for each ℓ = 1, 2, …, L and every $$\boldsymbol{x}\in \mathcal{Q}_{\ell }$$  \begin{align*} \left|\left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\right|\le\max\left(\delta_{\ell},\delta_{\ell}^{2}\right)\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\!, \end{align*} holds with probability at least 1 − e−ℓ(η+1). Noting that   \begin{align*} \sum_{\ell=1}^{L} \textrm{e}^{-\ell(\eta+1)}\le\sum_{\ell=1}^{\infty} \textrm{e}^{-\ell(\eta+1)}=\frac{\textrm{e}^{-(\eta+1)}}{1-\textrm{e}^{-(\eta+1)}}\le \textrm{e}^{-\eta}, \end{align*} completes the proof of (A.1) by the union bound. We note that since $$\mathcal{T}_{\ell -1}\cup \mathcal{T}_{\ell }\cup (\mathcal{T}_{\ell }-\mathcal{T}_{\ell -1})\subset \mathcal{Q}_{\ell }$$, (A.1) immediately implies (4.5). The proof of (4.4) follows from the proof of (4.5) by noting that   \begin{align*} (1+\delta_{\ell})^{2}\geq 1+\max\left(\delta_{\ell},\delta_{\ell}^{2}\right)\!. \end{align*} To prove (4.6), first note that $$\frac{\boldsymbol{v}}{\left \|\boldsymbol{v}\right \|_{\ell _{2}}}-\frac{\boldsymbol{u}}{\left \|\boldsymbol{u}\right \|_{\ell _{2}}}\in \widetilde{(\mathcal{T}_{\ell }-\mathcal{T}_{\ell -1})}-\widetilde{\mathcal{T}}_{\ell -1}$$ and $$\frac{\boldsymbol{u}}{\left \|\boldsymbol{u}\right \|_{\ell _{2}}}+\frac{\boldsymbol{v}}{\left \|\boldsymbol{v}\right \|_{\ell _{2}}}\in \widetilde{(\mathcal{T}_{\ell }-\mathcal{T}_{\ell -1})}+\widetilde{\mathcal{T}}_{\ell -1}$$. Hence, applying (A.1)   \begin{align*} \left|\left\|{\boldsymbol{A}}\left(\frac{\boldsymbol{u}}{\left\|\boldsymbol{u}\right\|_{\ell_{2}}}+\frac{\boldsymbol{v}}{\left\|\boldsymbol{v}\right\|_{\ell_{2}}}\right)\right\|_{\ell_{2}}^{2}-\left\|\frac{\boldsymbol{u}}{\left\|\boldsymbol{u}\right\|_{\ell_{2}}}+\frac{\boldsymbol{v}}{\left\|\boldsymbol{v}\right\|_{\ell_{2}}}\right\|_{\ell_{2}}^{2}\right|\leq& \max\left(\delta_{\ell},\delta_{\ell}^{2}\right)\left\|\frac{\boldsymbol{u}}{\left\|\boldsymbol{u}\right\|_{\ell_{2}}}+\frac{\boldsymbol{v}}{\left\|\boldsymbol{v}\right\|_{\ell_{2}}}\right\|_{\ell_{2}}^{2}\nonumber\\[3pt] \left|\left\|{\boldsymbol{A}}\left(\frac{\boldsymbol{v}}{\left\|\boldsymbol{v}\right\|_{\ell_{2}}}-\frac{\boldsymbol{u}}{\left\|\boldsymbol{u}\right\|_{\ell_{2}}}\right)\right\|_{\ell_{2}}^{2}-\left\|\frac{\boldsymbol{v}}{\left\|\boldsymbol{v}\right\|_{\ell_{2}}}-\frac{\boldsymbol{u}}{\left\|\boldsymbol{u}\right\|_{\ell_{2}}}\right\|_{\ell_{2}}^{2}\right|\leq& \max\left(\delta_{\ell},\delta_{\ell}^{2}\right)\left\|\frac{\boldsymbol{v}}{\left\|\boldsymbol{v}\right\|_{\ell_{2}}}-\frac{\boldsymbol{u}}{\left\|\boldsymbol{u}\right\|_{\ell_{2}}}\right\|_{\ell_{2}}^{2}. \end{align*} Summing these two identities and applying the triangular inequality, we conclude that   \begin{align*} \frac{1}{\left\|\boldsymbol{u}\right\|_{\ell_{2}}\left\|\boldsymbol{v}\right\|_{\ell_{2}}}\left|\boldsymbol{u}^{\ast}{\boldsymbol{A}}^{\ast}{\boldsymbol{A}}\boldsymbol{v}-\boldsymbol{u}^{\ast}\boldsymbol{v}\right|&\leq \frac{1}{4}\max\left(\delta_{\ell},\delta_{\ell}^{2}\right)\left(\left\|\frac{\boldsymbol{u}}{\left\|\boldsymbol{u}\right\|_{\ell_{2}}}+\frac{\boldsymbol{v}}{\left\|\boldsymbol{v}\right\|_{\ell_{2}}}\right\|_{\ell_{2}}^{2}+\left\|\frac{\boldsymbol{v}}{\left\|\boldsymbol{v}\right\|_{\ell_{2}}}-\frac{\boldsymbol{u}}{\left\|\boldsymbol{u}\right\|_{\ell_{2}}}\right\|_{\ell_{2}}^{2}\right)\\ &=\max\left(\delta_{\ell},\delta_{\ell}^{2}\right)\!, \end{align*} completing the proof of (4.6). Appendix B. Proof of Lemma 4.7 We prove each of the three inequalities in the next three sections. B.1 Proof of inequality (4.10) For $$1\le \ell \le \tilde{L}$$, we have δℓ = 2ℓ/2δ ≤ 1 so that $$\max (\delta _{\ell },\delta _{\ell }^{2})=\delta _{\ell }$$. Thus, applying Lemma 4.6 together with (4.7) we arrive at   \begin{align} \left|\left\|{\boldsymbol{A}}({\boldsymbol{z}}_{\ell}-{\boldsymbol{z}}_{\ell-1})\right\|_{\ell_{2}}^{2}-\left\|{\boldsymbol{z}}_{\ell}-{\boldsymbol{z}}_{\ell-1}\right\|_{\ell_{2}}^{2}\right|\leq 2^{\ell/2}\delta\left\|\boldsymbol{z}_{\ell}-\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}\le 2^{\ell/2+2}\delta e_{\ell-1}^{2}, \end{align} (B.1) and   \begin{align} \left|\langle{\boldsymbol{A}}({\boldsymbol{z}}_{\ell}-{\boldsymbol{z}}_{\ell-1}),{\boldsymbol{A}}{\boldsymbol{z}}_{\ell-1}\rangle- \langle{\boldsymbol{z}}_{\ell}-{\boldsymbol{z}}_{\ell-1},{\boldsymbol{z}}_{\ell-1}\rangle\right|\leq 2^{\ell/2+1}\delta e_{\ell-1}\!. \end{align} (B.2) The triangular inequality yields   \begin{align*} \left|\left\|{\boldsymbol{A}}{\boldsymbol{z}}_{\ell}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}\right|=& \left|\left\|{\boldsymbol{A}}({\boldsymbol{z}}_{\ell}-{\boldsymbol{z}}_{\ell-1})+{\boldsymbol{A}}{\boldsymbol{z}}_{\ell-1} \right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}\right|\\ \le&\left|\left\|{\boldsymbol{A}}({\boldsymbol{z}}_{\ell}-{\boldsymbol{z}}_{\ell-1})\right\|_{\ell_{2}}^{2}- \left\|\boldsymbol{z}_{\ell}-\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}\right|+\left|\left\|{\boldsymbol{A}} {\boldsymbol{z}}_{\ell-1}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}\right|\\ &+2\left|\langle{\boldsymbol{A}}({\boldsymbol{z}}_{\ell}-{\boldsymbol{z}}_{\ell-1}),{\boldsymbol{A}} {\boldsymbol{z}}_{\ell-1}\rangle-\langle\boldsymbol{z}_{\ell}-\boldsymbol{z}_{\ell-1},\boldsymbol{z}_{\ell-1}\rangle\right|\!. \end{align*} Combining the latter with (B.1) and (B.2) we arrive at the following recursion   \begin{align} \left|\left\|{\boldsymbol{A}}{\boldsymbol{z}}_{\ell}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}\right|-\left|\left\|\boldsymbol{A}\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}\right|\le\delta\left(2e_{\ell-1}+4e_{\ell-1}^{2}\right)2^{\ell/2}. \end{align} (B.3) Adding both sides of the above inequality for $$1\leq \ell \leq \tilde{L}$$, and using $$e_{\ell }^{2}\leq 2e_{\ell }\leq 4$$, we arrive at   \begin{align*} \sum_{\ell=1}^{\tilde{L}}\left(\left|\left\|\boldsymbol{A}\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}\right|-\left|\left\|\boldsymbol{A}\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}\right|\right)\le&10\delta\left(\sum_{\ell=1}^{\tilde{L}}2^{\ell/2}e_{\ell-1}\right)\nonumber\\ =&10\sqrt{2}\delta\left(\sum_{\ell=0}^{\tilde{L}-1}2^{\ell/2}e_{\ell}\right)\nonumber\\ =&10\sqrt{2}\delta\gamma_{2}(\mathcal{T}). \end{align*} B.2 Proof of inequality (4.11) To bound the second term we begin by bounding $$\left |\left \|\boldsymbol{A}\boldsymbol{x}\right \|_{\ell _{2}}-\left \|\boldsymbol{A}\boldsymbol{z}_{\tilde{L}}\right \|_{\ell _{2}}\right |$$. To this aim first note that, since MRIP$$\big(s,\frac{\delta }{4}\big)$$ holds for H with s = 200(1 + η), then sL = 200 × 2L(1 + η) ≥ n. As a result for all $$\boldsymbol{x}\in \mathbb{R}^{n}$$, we have   \begin{align*} \left|\left\|\boldsymbol{H}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\right|\le\max\left(\frac{1}{4}\delta_{L},\frac{1}{16}{\delta_{L}^{2}}\right)\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\!. \end{align*} Using the simple inequality $$1+\max (\delta ,\delta ^{2})\le (1+\delta )^{2}$$, this immediately implies   \begin{align} \left\|\boldsymbol{A}\right\|=\left\|\boldsymbol{H}\right\|\le\frac{1}{4}2^{\frac{L}{2}}\delta+1. \end{align} (B.4) Furthermore, by the definition of Nℓ we have $$\left \|\boldsymbol{x}-\boldsymbol{z}_{L}\right \|_{\ell _{2}}\le e_{L}$$. These two inequalities together with repeated use of the triangular inequality, we have   \begin{align*} \left|\left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}-\left\|\boldsymbol{A}\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\right|=&\left|\left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}-\left\|\boldsymbol{A}\boldsymbol{z}_{L}\right\|_{\ell_{2}}+\left\|\boldsymbol{A}\boldsymbol{z}_{L}\right\|_{\ell_{2}}-\left\|\boldsymbol{A}\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\right|\\ \le&\left\|\boldsymbol{A}(\boldsymbol{x}-\boldsymbol{z}_{L})\right\|_{\ell_{2}}+\left\|\boldsymbol{A}(\boldsymbol{z}_{L}-\boldsymbol{z}_{\tilde{L}})\right\|_{\ell_{2}}\\ \le&\left\|\boldsymbol{A}\right\|\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}+\left\|\sum_{\ell=\tilde{L}+1}^{L}\boldsymbol{A}(\boldsymbol{z}_{\ell}-\boldsymbol{z}_{\ell-1})\right\|_{\ell_{2}}\\ \le&\left(\frac{1}{4}2^{\frac{L}{2}}\delta+1\right)e_{L}+\sum_{\ell=\tilde{L}+1}^{L}\left\|\boldsymbol{A}(\boldsymbol{z}_{\ell}-\boldsymbol{z}_{\ell-1})\right\|_{\ell_{2}}\!. \end{align*} Using Lemma 4.6 equation (4.4) in the above inequality and noting that for $$\ell>\tilde{L}$$, we have 2ℓ/2δ ≥ 1 and conclude that   \begin{align} \left|\left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}-\left\|\boldsymbol{A}\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\right|&\le\left(\frac{1}{4}2^{\frac{L}{2}}\delta+1\right)e_{L}+\sum_{\ell=\tilde{L}+1}^{L}(1+2^{\ell/2}\delta)\left\|\boldsymbol{z}_{\ell}-\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}\nonumber\\ &\le\frac{5}{4}2^{L/2}\delta e_{L}+\sum_{\ell=\tilde{L}+1}^{L}2^{\ell/2+1}\delta\left\|\boldsymbol{z}_{\ell}-\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}\nonumber\\ &\le\frac{5}{4}\delta2^{L/2}e_{L}+4\sqrt{2}\delta\sum_{\ell=\tilde{L}+1}^{L}2^{(\ell-1)/2}e_{\ell-1}\nonumber\\ &\le 4\sqrt{2}\delta\left(\sum_{\ell=\tilde{L}}^{L} 2^{\ell/2}e_{\ell}\right)\nonumber\\ &\le 4\sqrt{2}\delta\gamma_{2}(\mathcal{T}). \end{align} (B.5) Now note that by Lemma 4.6 equation (4.4) and using the fact that rad$$(\mathcal{T})=1$$, we know that $$\left \|\boldsymbol{A}\boldsymbol{z}_{\tilde{L}}\right \|_{\ell _{2}}\le 1+2^{\tilde{L}/2}\delta \le 2$$. Thus, using this inequality together with (B.5) we arrive at   \begin{align*} \left|\left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{A}\boldsymbol{z}_{\tilde{L}}^{2}\right\|_{\ell_{2}}\right|&\le\left|\left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}-\left\|\boldsymbol{A}\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\right|\left|\left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}+\left\|\boldsymbol{A}\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\right|\nonumber\\ &\le\left|\left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}-\left\|\boldsymbol{A}\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\right|\!^{^{2}}+2\left|\left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}-\left\|\boldsymbol{A}\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\right|\left\|\boldsymbol{A}\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\nonumber\\ &\le 32\delta^{2}{\gamma_{2}^{2}}(\mathcal{T})+16\sqrt{2}\delta\gamma_{2}(\mathcal{T}). \end{align*} B.3 Proof of inequality (4.12) Similar to the second term we begin by bounding $$\left |\left \|\boldsymbol{x}\right \|_{\ell _{2}}-\left \|\boldsymbol{z}_{\tilde{L}}\right \|_{\ell _{2}}\right |$$. Noting that $$2^{\ell /2}\delta \ge \frac{1}{\sqrt{2}}$$ for $$\ell \ge \tilde{L}$$, we have   \begin{align*} \left|\left\|\boldsymbol{x}\right\|_{\ell_{2}}-\left\|\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\right|\le\left\|\boldsymbol{x}-\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\le e_{\tilde{L}}\le \sqrt{2}\cdot2^{\tilde{L}/2}\delta e_{\tilde{L}}\le \sqrt{2}\delta\gamma_{2}(\mathcal{T}). \end{align*} Thus, using this inequality together with the fact that $$\left \|\boldsymbol{z}_{\tilde{L}}\right \|_{\ell _{2}}\le 1$$ we arrive at   \begin{align*} \left|\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}^{2}\right|&=\left|\left\|\boldsymbol{x}\right\|_{\ell_{2}}-\left\|\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\right|\cdot\left(\left\|\boldsymbol{x}\right\|_{\ell_{2}}+\left\|\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\right)\nonumber\\ &\le\left|\left\|\boldsymbol{x}\right\|_{\ell_{2}}-\left\|\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\right|{}^{2}+2\left|\left\|\boldsymbol{x}\right\|_{\ell_{2}}-\left\|\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\right|\left\|\boldsymbol{z}_{\tilde{L}}\right\|_{\ell_{2}}\nonumber\\ &\le 4\delta^{2}{\gamma_{2}^{2}}(\mathcal{T})+4\sqrt{2}\delta\gamma_{2}(\mathcal{T}). \end{align*} Appendix C. Establishing an analog of (4.9) and the bounds (4.10), (4.11) and (4.12) when $$\tilde{L}>L$$ This section describes how an analog of (4.9) as well as the subsequent bounds in Sections B.1, B.2 and B.3 can be derived when $$\tilde{L}>L$$. Using similar arguments leading to the derivation of (4.9) we arrive at   \begin{align} |\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}|\le&\sum_{\ell=1}^{L}\left(\left|\left\|\boldsymbol{A}\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}\right|-\left|\left\|\boldsymbol{A}\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}\right|\right)\nonumber\\ &+\left|\left\|\boldsymbol{A}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\right|-\left|\left\|\boldsymbol{A}\boldsymbol{z}_{L}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{L}\right\|_{\ell_{2}}^{2}\right|+\max\left(\delta,\delta^{2}\right)\!. \end{align} (C.1) The main difference with the $$\tilde{L}\le L$$ case is that we let the summation in the first term go up to L and instead of studying the second line of (4.9), we will directly bound the difference $$\left |\left \|{\boldsymbol{A}}\boldsymbol{x}\right \|_{\ell _{2}}^{2}-\left \|\boldsymbol{x}\right \|_{\ell _{2}}^{2}\right |-\left |\left \|{\boldsymbol{A}}{\boldsymbol{z}}_{L}\right \|_{\ell _{2}}^{2}-\left \|{\boldsymbol{z}}_{L}\right \|_{\ell _{2}}^{2}\right |$$ in (C.1). We now turn our attention to bounding the first two terms in (C.1). For the first term in (C.1), an argument identical to the derivation of (4.10) in Section B.1 allows us to conclude   \begin{align} \sum_{\ell=1}^{L}\left(\left|\left\|\boldsymbol{A}\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell}\right\|_{\ell_{2}}^{2}\right|-\left|\left\|\boldsymbol{A}\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{z}_{\ell-1}\right\|_{\ell_{2}}^{2}\right|\right)\le10\sqrt{2}\delta\gamma_{2}(\mathcal{T}). \end{align} (C.2) To bound the second term in (C.1) note that, we have   \begin{align} &\left|\|{\boldsymbol{A}}\boldsymbol{x}\|_{\ell_{2}}^{2}-\|\boldsymbol{x}\|_{\ell_{2}}^{2}\right|-\left|\|{\boldsymbol{A}}{\boldsymbol{z}}_{L}\|_{\ell_{2}}^{2}-\|{\boldsymbol{z}}_{L}\|_{\ell_{2}}^{2}\right|\nonumber\\ &\quad\le\left|\left(\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|{\boldsymbol{A}}{\boldsymbol{z}}_{L}\right\|_{\ell_{2}}^{2}\right)-\left(\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}-\left\|{\boldsymbol{z}}_{L}\right\|_{\ell_{2}}^{2}\right)\right|\!,\nonumber\\ &\quad=\left|\left(\left\|{\boldsymbol{A}}\left(\boldsymbol{x}-\boldsymbol{z}_{L}\right)+{\boldsymbol{A}}\boldsymbol{z}_{L}\right\|_{\ell_{2}}^{2}-\left\|{\boldsymbol{A}}{\boldsymbol{z}}_{L}\right\|_{\ell_{2}}^{2}\right)-\left(\left\|(\boldsymbol{x}-\boldsymbol{z}_{L})+\boldsymbol{z}_{L}\right\|_{\ell_{2}}^{2}-\left\|{\boldsymbol{z}}_{L}\right\|_{\ell_{2}}^{2}\right)\right|\!,\nonumber\\ &\quad=\left|\left(\left\|\boldsymbol{A}(\boldsymbol{x}-\boldsymbol{z}_{L})\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}^{2}\right)+2\left(\langle\boldsymbol{A}(\boldsymbol{x}-\boldsymbol{z}_{L}),{\boldsymbol{A}}\boldsymbol{z}_{L}\rangle-\langle\boldsymbol{x}-\boldsymbol{z}_{L},\boldsymbol{z}_{L}\rangle\right)\right|\!,\nonumber\\ &\quad\le\left|\left\|\boldsymbol{A}(\boldsymbol{x}-\boldsymbol{z}_{L})\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}^{2}\right|+2\left|\langle\boldsymbol{A}(\boldsymbol{x}-\boldsymbol{z}_{L}),{\boldsymbol{A}}\boldsymbol{z}_{L}\rangle-\langle\boldsymbol{x}-\boldsymbol{z}_{L},\boldsymbol{z}_{L}\rangle\right|\!,\nonumber\\ &\quad=\left|\left\|\boldsymbol{A}(\boldsymbol{x}-\boldsymbol{z}_{L})\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}^{2}\right|+2\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}\left|\left\langle\boldsymbol{A}\frac{\boldsymbol{x}-\boldsymbol{z}_{L}}{\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}},{\boldsymbol{A}}\boldsymbol{z}_{L}\right\rangle-\left\langle\frac{\boldsymbol{x}-\boldsymbol{z}_{L}}{\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}},\boldsymbol{z}_{L}\right\rangle\right|\!,\nonumber\\ &\quad\le\left|\left\|\boldsymbol{A}(\boldsymbol{x}-\boldsymbol{z}_{L})\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}^{2}\right|\nonumber\\ &\qquad+\frac{1}{2}\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}\left|\left\Vert{\boldsymbol{A}\left(\frac{\boldsymbol{x}-\boldsymbol{z}_{L}}{\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}}+\boldsymbol{z}_{L}\right)}\right\Vert{}_{\ell_{2}}^{2}-\left\Vert{\frac{\boldsymbol{x}-\boldsymbol{z}_{L}}{\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}}+\boldsymbol{z}_{L}}\right\Vert{}_{\ell_{2}}^{2}\right|\nonumber\\ &\qquad+\frac{1}{2}\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}\left|\left\Vert{\boldsymbol{A}\left(\frac{\boldsymbol{x}-\boldsymbol{z}_{L}}{\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}}-\boldsymbol{z}_{L}\right)}\right\Vert{}_{\ell_{2}}^{2}-\left\Vert{\frac{\boldsymbol{x}-\boldsymbol{z}_{L}}{\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}}-\boldsymbol{z}_{L}}\right\Vert{}_{\ell_{2}}^{2}\right|\!. \end{align} (C.3) To complete our bound note that since MRIP$$\big(s,\frac{\delta }{4}\big)$$ holds for A with s = 200(1 + η), then sL = 200 × 2L(1 + η) ≥ n. As a result for all $$\boldsymbol{w}\in \mathbb{R}^{n}$$, we have   \begin{align*} \left|\left\|\boldsymbol{A}\boldsymbol{w}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{w}\right\|_{\ell_{2}}^{2}\right|\le\max\left(\frac{1}{4}\delta_{L},\frac{1}{16}{\delta_{L}^{2}}\right)\left\|\boldsymbol{w}\right\|_{\ell_{2}}^{2}\!. \end{align*} For $$\tilde{L}>L$$ we have $$\delta _{L}=2^{\frac{L}{2}}\delta \le 1$$ which immediately implies that for all $$\boldsymbol{w}\in \mathbb{R}^{n}$$, and we have   \begin{align} \left|\left\|\boldsymbol{A}\boldsymbol{w}\right\|_{\ell_{2}}^{2}-\left\|\boldsymbol{w}\right\|_{\ell_{2}}^{2}\right|\le\frac{1}{4}2^{L/2}\delta\left\|\boldsymbol{w}\right\|_{\ell_{2}}^{2}. \end{align} (C.4) Now using (C.4) with $$\boldsymbol{w}=\boldsymbol{x}-\boldsymbol{z}_{L}, \frac{\boldsymbol{x}-\boldsymbol{z}_{L}}{\left \|\boldsymbol{x}-\boldsymbol{z}_{L}\right \|_{\ell _{2}}}-\boldsymbol{z}_{L}$$, and $$\frac{\boldsymbol{x}-\boldsymbol{z}_{L}}{\left \|\boldsymbol{x}-\boldsymbol{z}_{L}\right \|_{\ell _{2}}}+\boldsymbol{z}_{L}$$ in (C.3) and noting that $$\left \|\boldsymbol{z}_{L}\right \|_{\ell _{2}}\le$$rad$$(\mathcal{T})\le 1$$, we conclude that   \begin{align} \left|\left\|{\boldsymbol{A}}\boldsymbol{x}\right\|_{\ell_{2}}^{2}\!-\!\left\|\boldsymbol{x}\right\|_{\ell_{2}}^{2}\right|\!-\!\left|\left\|{\boldsymbol{A}}{\boldsymbol{z}}_{L}\right\|_{\ell_{2}}^{2}\!-\!\left\|{\boldsymbol{z}}_{L}\right\|_{\ell_{2}}^{2}\right|&\le\frac{1}{4}2^{L/2}\delta\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}^{2}+\frac{1}{8}2^{L/2}\delta\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}\left\Vert{\frac{\boldsymbol{x}-\boldsymbol{z}_{L}}{\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}}+\boldsymbol{z}_{L}}\right\Vert{}_{\ell_{2}}^{2}\nonumber\\ &\quad+\frac{1}{8}2^{L/2}\delta\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}\left\Vert{\frac{\boldsymbol{x}-\boldsymbol{z}_{L}}{\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}}-\boldsymbol{z}_{L}}\right\Vert{}_{\ell_{2}}^{2}\nonumber\\ &\le\frac{1}{4}2^{L/2}\delta\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}^{2}+2^{L/2}\delta\left\|\boldsymbol{x}-\boldsymbol{z}_{L}\right\|_{\ell_{2}}\nonumber\\ &\le 2^{L/2}\delta \left(\frac{1}{4}{{e_{L}^{2}}}+e_{L}\right)\nonumber\\ &\le\frac{3}{2}2^{L/2}\delta e_{L}\nonumber\\ &\le\frac{3}{2}\delta \gamma_{2}(\mathcal{T}). \end{align} (C.5) Plugging (C.2) and (C.5) into (C.1), we arrive at   \begin{align} \left| \|{\boldsymbol{A}}\boldsymbol{x}\|_{\ell_{2}}^{2}-\|\boldsymbol{x}\|_{\ell_{2}}^{2}\right| \le 16\delta \gamma_{2}(\mathcal{T})+\max(\delta,\delta^{2}). \end{align} (C.6) © The Author(s) 2018. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) For permissions, please e-mail: journals. permissions@oup.com

### Journal

Information and Inference: A Journal of the IMAOxford University Press

Published: Mar 7, 2018

## You’re reading a free preview. Subscribe to read the entire article.

### DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month ### Explore the DeepDyve Library ### Search Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly ### Organize Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place. ### Access Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals. ### Your journals are on DeepDyve Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more. All the latest content is available, no embargo periods. DeepDyve ### Freelancer DeepDyve ### Pro Price FREE$49/month
\$360/year

Save searches from
PubMed

Create lists to

Export lists, citations