# Phase retrieval via randomized Kaczmarz: theoretical guarantees

Phase retrieval via randomized Kaczmarz: theoretical guarantees Abstract We consider the problem of phase retrieval, i.e. that of solving systems of quadratic equations. A simple variant of the randomized Kaczmarz method was recently proposed for phase retrieval, and it was shown numerically to have a computational edge over state-of-the-art Wirtinger flow methods. In this paper, we provide the first theoretical guarantee for the convergence of the randomized Kaczmarz method for phase retrieval. We show that it is sufficient to have as many Gaussian measurements as the dimension, up to a constant factor. Along the way, we introduce a sufficient condition on measurement sets for which the randomized Kaczmarz method is guaranteed to work. We show that Gaussian sampling vectors satisfy this property with high probability; this is proved using a chaining argument coupled with bounds on Vapnik–Chervonenkis (VC) dimension and metric entropy. 1. Introduction The phase retrieval problem is that of solving a system of quadratic equations   $$\lvert\langle a_{i},z\rangle^{2}\rvert = {b_{i}^{2}}, \quad\quad i = 1,2,\ldots,m,$$ (1.1)where $$a_{i} \in \mathbb{R}^{n}$$ (or $$\mathbb{C}^{n}$$) are known sampling vectors, $$b_{i}> 0$$ are observed measurements and $$z \in \mathbb{R}^{n}$$ (or $$\mathbb{C}^{n}$$) is the decision variable. This problem is well motivated by practical concerns [13] and has been a topic of study from at least the early 1980s. Over the last half a decade, there has been great interest in constructing and analysing algorithms with provable guarantees given certain classes of sampling vector sets. One line of research involves ‘lifting’ the quadratic system to a linear system, which is then solved using convex relaxation (PhaseLift) [5]. A second method is to formulate and solve a linear program in the natural parameter space using an anchor vector (PhaseMax) [2,14,15]. Although both of these methods can be proved to have near optimal sample efficiency, the most empirically successful approach has been to directly optimize various naturally formulated non-convex loss functions, the most notable of which are displayed in Table 1. Table 1 Non-convex loss functions for phase retrieval Loss function  Name  Papers  $${f(z) = \sum _{i=1}^{m} (\lvert \langle a_{i},z\rangle \rvert ^{2}-{b_{i}^{2}} )^{2}}$$  Squared loss for intensities  [4,19]  $${f(z) = \sum _{i=1}^{m} \left (\lvert \langle a_{i},z\rangle \rvert -b_{i}\right )^{2}}$$  Squared loss for amplitudes  [24,26]  $${f(z) = \sum _{i=1}^{m} \vert \lvert \langle a_{i},z\rangle \rvert ^{2}- b_{i}^{2} \vert }$$  $$\ell _{1}$$ loss for intensities  [9,11,12]  Loss function  Name  Papers  $${f(z) = \sum _{i=1}^{m} (\lvert \langle a_{i},z\rangle \rvert ^{2}-{b_{i}^{2}} )^{2}}$$  Squared loss for intensities  [4,19]  $${f(z) = \sum _{i=1}^{m} \left (\lvert \langle a_{i},z\rangle \rvert -b_{i}\right )^{2}}$$  Squared loss for amplitudes  [24,26]  $${f(z) = \sum _{i=1}^{m} \vert \lvert \langle a_{i},z\rangle \rvert ^{2}- b_{i}^{2} \vert }$$  $$\ell _{1}$$ loss for intensities  [9,11,12]  View Large Table 1 Non-convex loss functions for phase retrieval Loss function  Name  Papers  $${f(z) = \sum _{i=1}^{m} (\lvert \langle a_{i},z\rangle \rvert ^{2}-{b_{i}^{2}} )^{2}}$$  Squared loss for intensities  [4,19]  $${f(z) = \sum _{i=1}^{m} \left (\lvert \langle a_{i},z\rangle \rvert -b_{i}\right )^{2}}$$  Squared loss for amplitudes  [24,26]  $${f(z) = \sum _{i=1}^{m} \vert \lvert \langle a_{i},z\rangle \rvert ^{2}- b_{i}^{2} \vert }$$  $$\ell _{1}$$ loss for intensities  [9,11,12]  Loss function  Name  Papers  $${f(z) = \sum _{i=1}^{m} (\lvert \langle a_{i},z\rangle \rvert ^{2}-{b_{i}^{2}} )^{2}}$$  Squared loss for intensities  [4,19]  $${f(z) = \sum _{i=1}^{m} \left (\lvert \langle a_{i},z\rangle \rvert -b_{i}\right )^{2}}$$  Squared loss for amplitudes  [24,26]  $${f(z) = \sum _{i=1}^{m} \vert \lvert \langle a_{i},z\rangle \rvert ^{2}- b_{i}^{2} \vert }$$  $$\ell _{1}$$ loss for intensities  [9,11,12]  View Large These loss functions enjoy nice properties, which make them amenable to various optimization schemes [12,19]. Those with provable guarantees include the prox-linear method of [11], and various gradient descent methods [4,6,9,24,26]. Some of these methods also involve adaptive measurement pruning to enhance performance. In 2015, Wei [25] proposed adapting a family of randomized Kaczmarz methods for solving the phase retrieval problem. He was able to show using numerical experiments that these methods perform comparably with state-of-the-art Wirtinger flow (gradient descent) methods when the sampling vectors are real or complex Gaussian, or when they follow the coded diffraction pattern (CDP) model [4]. He also showed that randomized Kaczmarz methods outperform Wirtinger flow when the sampling vectors are the concatenation of a few unitary bases. Unfortunately, [25] was not able to provide adequate theoretical justification for the convergence of these methods (see Theorem 2.6 in [25]). In this paper, we attempt to bridge this gap by showing that the basic randomized Kaczmarz scheme used in conjunction with truncated spectral initialization achieves linear convergence to the solution with high probability, whenever the sampling vectors are drawn uniformly from the sphere1$$S^{n-1}$$ and the number of measurements m is larger than a constant times the dimension n. It is also interesting to note that the basic randomized Kaczmarz scheme is exactly stochastic gradient descent for the Amplitude Flow objective, which suggests that other gradient descent schemes can also be accelerated using stochasticity. 1.1. Randomized Kaczmarz for solving linear systems The Kaczmarz method is a fast iterative method for solving systems of overdetermined linear equations that works by iteratively satisfying one equation at a time. In 2009, Strohmer and Vershynin [18] were able to give a provable guarantee on its rate of convergence, provided that the equation to be satisfied at each step is selected using a prescribed randomized scheme. Suppose our system to be solved is given by   $$Ax = b,$$ (1.2)where A is an m by n matrix. Denoting the rows of A by $${a_{1}^{T}},\ldots ,{a_{m}^{T}}$$, we can write (1.2) as the system of linear equations   $$\langle a_{i},x\rangle = b_{i}, \quad i=1,\ldots,m.$$The solution set of each equation is a hyperplane. The randomized Kaczmarz method is a simple iterative algorithm in which we project the running approximation onto the hyperplane of a randomly chosen equation. More formally, at each step k we randomly choose an index r(k) from [m] such that the probability that r(k) = i is proportional to $$\lVert a_{i}{\rVert _{2}^{2}}$$, and update the running approximation as follows:   $$x_{k} := x_{k-1} + \frac{b_{r(k)} - \langle a_{r(k)},x_{k-1}\rangle}{\lVert a_{r(k)}{\rVert_{2}^{2}}}a_{r(k)}.$$ Strohmer and Vershynin [18] were able to prove the following theorem: Theorem 1.1 (Linear convergence for linear systems) Let $$\kappa (A) = \lVert A\rVert _{F} / \sigma _{\min }(A)$$. Then for any initialization $$x_{0}$$ to the equation (1.2), the estimates given to us by randomized Kaczmarz satisfy   $$\mathbb{E}\lVert x_{k}-x{\rVert_{2}^{2}} \leq ({1-\kappa(A)^{-2}})^{k} \lVert x_{0} -x{\rVert_{2}^{2}}.$$ Note that if A has bounded condition number, then $$\kappa (A) \asymp \sqrt{n}$$. 1.2. Randomized Kaczmarz for phase retrieval In the phase retrieval problem (1.1), each equation   $$\lvert\langle a_{i},x\rangle\rvert = b_{i}$$defines two hyperplanes, one corresponding to each of ± x. A natural adaptation of the randomized Kaczmarz update for this situation is then to project the running approximation to the closer hyperplane. We restrict to the case where each measurement vector $$a_{i}$$ has unit norm, so that in equations, this is given by   $$x_{k} := x_{k-1} + \eta_{k} a_{r(k)},$$ (1.3)where   $$\eta_{k} = \textrm{sign}(\langle a_{r(k)},x_{k-1}\rangle) b_{r(k)} - \langle a_{r(k)},x_{k-1}\rangle.$$ In order to obtain a convergence guarantee for this algorithm, we need to choose $$x_{0}$$ so that it is close enough to the signal vector x. This is unlike the case for linear systems where we could start with an arbitrary initial estimate $$x_{0} \in \mathbb{R}^{n}$$, but the requirement is par for the course for phase retrieval algorithms. Unsurprisingly, there is a rich literature on how to obtain such estimates [5,6,24,26]. The best methods are able to obtain a good initial estimate using O(n) samples. 1.3. Contributions and main result The main result of our paper guarantees the linear convergence of randomized Kaczmarz algorithm for phase retrieval for random measurements $$a_{i}$$ that are drawn independently and uniformly from the unit sphere. Theorem 1.2 (Convergence guarantee for algorithm) Fix $$\epsilon> 0$$, $$0 < \delta _{1} \leq 1/2$$ and $$0 < \delta ,\delta _{2} \leq 1$$. There are absolute constants C, c > 0 such that if   $$m \geq C(n\log(m/n) + \log(1/\delta)),$$then with probability at least $$1-\delta$$, m sampling vectors selected uniformly and independently from the unit sphere $$S^{n-1}$$ form a set such that the following holds: let $$x \in \mathbb{R}^{n}$$ be a signal vector and let $$x_{0}$$ be an initial estimate satisfying $$\lVert x_{0}-x\rVert _{2} \leq c\sqrt{\delta _{1}}\lVert x\rVert _{2}$$. Then for any $$\epsilon> 0$$, if   $$K \geq 2(\log(1/\epsilon) + \log(2/\delta_{2}))n,$$then the Kth step randomized Kaczmarz estimate $$x_{K}$$ satisfies $$\lVert x_{K}-x{\rVert _{2}^{2}} \leq \epsilon \lVert x_{0} - x{\rVert _{2}^{2}}$$ with probability at least $$1-\delta _{1}-\delta _{2}$$. Comparing this result with Theorem 1.1, we observe two key differences. First, there are now two sources of randomness: one is in the creation of the measurements $$a_{i}$$, and the other is in the selection of the equation at every iteration of the algorithm. The theorem gives a guarantee that holds with high probability over both sources of randomness. Theorem 1.2 also requires an initial estimate $$x_{0}$$. This is not hard to obtain. Indeed, using the truncated spectral initialization method of [6], we may obtain such an estimate with high probability given $$m \gtrsim n$$. For more details, see Proposition B.1. The proof of this theorem is more non-trivial than the Strohmer–Vershynin analysis of randomized Kaczmarz algorithm for linear systems [18]. We break down the argument in smaller steps, each of which may be of independent interest to researchers in this field. First, we generalize the Kaczmarz update formula (1.3) and define what it means to take a randomized Kaczmarz step with respect to any probability measure on the sphere $$S^{n-1}$$: we choose a measurement vector at each step according to this measure. Using a simple geometric argument, we then provide a bound for the expected decrement in distance to the solution set in a single step, where the quality of the bound is given in terms of the properties of the measure we are using for the Kaczmarz update (Lemma 2.1). Performing the generalized Kaczmarz update with respect to the uniform measure on the sphere corresponds to running the algorithm with unlimited measurements. We utilize the symmetry of the uniform measure to compute an explicit formula for the bound on the stepwise expected decrement in distance. This decrement is geometric whenever we make the update from a point making an angle of less than $$\pi /8$$ with the true solution, so we obtain linear convergence conditioned on no iterates escaping from the ‘basin of linear convergence’. We are able to bound the probability of this bad event using a supermartingale inequality (Theorem 3.1). Next, we abstract out the property of the uniform measure that allows us to obtain local linear convergence. We call this property the anti-concentration on wedges property, calling it ACW for short. Using this convenient definition, we can easily generalize our previous proofs for the uniform measure to show that all ACW measures give rise to randomized Kaczmarz update schemes with local linear convergence (Theorem 4.3). The usual Kaczmarz update corresponds running the generalized Kaczmarz update with respect to $$\mu _{A} := \frac{1}{m}\sum _{i=1}\delta _{a_{i}}$$. We are able to prove that when the $$a_{i}$$s are selected uniformly and independently from the sphere, then $$\mu _{A}$$ satisfies the ACW condition with high probability, so long as $$m \gtrsim n$$ (Theorem 5.9). The proof of this fact uses Vapnik–Chervonenkis (VC) theory and a chaining argument, together with metric entropy estimates. Finally, we are able to put everything together to prove a guarantee for the full algorithm in Section 6. In that section, we also discuss the failure probabilities $$\delta$$, $$\delta _{1}$$ and $$\delta _{2}$$, and how they can be controlled. 1.4. Related work During the preparation of this manuscript, we became aware of independent simultaneous work done by Jeong and Güntürk. They also studied the randomized Kaczmarz method adapted to phase retrieval, and obtained almost the same result that we did (see [16] and Theorem 1.1 therein). In order to prove their guarantee, they use a stopping time argument similar to ours, but replace the ACW condition with a stronger condition called admissibility. They prove that measurement systems comprising vectors drawn independently and uniformly from the sphere satisfy this property with high probability, and the main tools they use in their proof are hyperplane tessellations and a net argument together with Lipschitz relaxation of indicator functions. After submitting the first version of this manuscript, we also became aware of independent work done by Zhang, Zhou, Liang and Chi [26]. Their work examines stochastic schemes in more generality (see Section 3 in their paper), and they claim to prove linear convergence for both the randomized Kaczmarz method as well as what they called Incremental Reshaped Wirtinger Flow. However, they only prove that the distance to the solution decreases in expectation under a single Kaczmarz update (an analogue of our Lemma 2.1 specialized to real Gaussian measurements). As we will see in our paper, this bound cannot be naively iterated. 1.5. Notation Throughout the paper, C and c are absolute constants that can change from line to line. 2. Computations for a single step In this section, we will compute what happens in expectation for a single update step of the randomized Kaczmarz method. It will be convenient to generalize our sampling scheme slightly as follows. When we work with a fixed matrix A, we may view our selection of a random row $$a_{r(k)}$$ as drawing a random vector according to the measure $$\mu _{A} := \frac{1}{m}\sum _{i=1}^{m} \delta _{a_{i}}$$. We need not restrict ourselves to sums of Diracs. For any probability measure $$\mu$$ on the sphere $$S^{n-1}$$, we define the random map $$P = P_{\mu }$$ on vectors $$z \in \mathbb{R}^{n}$$ by setting   $$P z := z + \eta a,$$ (2.1)where   $$\eta = \textrm{sign}(\langle a,z\rangle)\lvert\langle a,x\rangle\rvert - \langle a,z\rangle \quad\textrm{and}\quad a \sim \mu.$$ (2.2)Note that as before, x is a fixed vector in $$\mathbb{R}^{n}$$ (think of x as the actual solution of the phase retrieval problem). We call $$P_{\mu }$$ the generalized Kaczmarz projection with respect to $$\mu$$. Using this update rule over independent realizations of P, $$P_{1},P_{2},\ldots,$$ together with an initial estimate $$x_{0}$$, gives rise to a generalized randomized Kaczmarz algorithm for finding x: set the kth step estimate to be   \begin{align} x_{k} := P_{k}P_{k-1}\cdots P_{1} x_{0}. \end{align} (2.3) Fix a vector $$z \in \mathbb{R}^{n}$$ that is closer to x than to −x, i.e. so that ⟨x, z⟩ > 0, and suppose that we are trying to find x. Examining the formula in (2.2), we see that P projects z onto the right hyperplane (i.e. the one passing through x instead of the one passing through −x) if and only if ⟨a, z⟩ and ⟨a, x⟩ have the same sign. In other words, this occurs if and only if the random vector a does not fall into the region of the sphere defined by   $$W_{x,z} := \lbrace v \in S^{n-1} \ | \ \textrm{sign}(\langle v,x\rangle) \neq \textrm{sign}(\langle v,z\rangle)\rbrace.$$ (2.4)This is the region lying between the two hemispheres with normal vectors x and z. We call such a region a spherical wedge, since in three dimensions it has the shape depicted in Fig. 1. Fig. 1. View largeDownload slide Geometry of $$W_{x,z}$$. Fig. 1. View largeDownload slide Geometry of $$W_{x,z}$$. When $$a \notin W_{x,z}$$, we can use the Pythagorean theorem to write   $$\lVert z-x{\rVert_{2}^{2}} = \lVert Pz-x{\rVert_{2}^{2}} + \langle z-x,a\rangle^{2}.$$ (2.5)Rearranging gives   $$\lVert Pz-x{\rVert_{2}^{2}} = \lVert z-x{\rVert_{2}^{2}}(1 - \langle\tilde{z},a\rangle^{2}),$$ (2.6)where $$\tilde{z} = (z-x)/\lVert z-x\rVert _{2}$$. In the complement of this event, we get   $$Pz = z + \langle a,(-x)-z\rangle a = z - \langle a,z-x\rangle + \langle a,-2x\rangle,$$and using orthogonality,   $$\lVert Pz-x{\rVert_{2}^{2}} = \lVert z-x{\rVert_{2}^{2}} - \langle a,z-x\rangle^{2} + \langle a,2x\rangle^{2}.$$ (2.7) Fig. 2. View largeDownload slide Orientation of x, z, and Pz when $$a \in W_{x,{\hskip.90pt}z}$$ and when $$a \notin W_{x,{\hskip.90pt}z}$$. $$H_{+}$$ and $$H_{-}$$ denote the hyperplanes defined by the equations ⟨y, a⟩ = b and ⟨y, a⟩ = −b, respectively. $$H_{0}$$ denotes the hyperplane defined by the equation ⟨y, a⟩ = 0. The left diagram demonstrates the situation when $$a \in W_{x,{\hskip.90pt}z}$$, thereby justifying (2.5). The right diagram demonstrates the situation when $$a \notin W_{x,{\hskip.90pt}z}$$, thereby justifying (2.8). Fig. 2. View largeDownload slide Orientation of x, z, and Pz when $$a \in W_{x,{\hskip.90pt}z}$$ and when $$a \notin W_{x,{\hskip.90pt}z}$$. $$H_{+}$$ and $$H_{-}$$ denote the hyperplanes defined by the equations ⟨y, a⟩ = b and ⟨y, a⟩ = −b, respectively. $$H_{0}$$ denotes the hyperplane defined by the equation ⟨y, a⟩ = 0. The left diagram demonstrates the situation when $$a \in W_{x,{\hskip.90pt}z}$$, thereby justifying (2.5). The right diagram demonstrates the situation when $$a \notin W_{x,{\hskip.90pt}z}$$, thereby justifying (2.8). Since z gets projected to the hyperplane containing −x, it may move further away from x. However, we can bound how far away it can move. Because ⟨a, x⟩ has the opposite sign as ⟨a, z⟩, we have   $$\lvert\langle a,z+x\rangle\rvert < \lvert\langle a,z-x\rangle\rvert,$$and so   $$\lvert\langle a,2x\rangle\rvert = \lvert\langle a,(z-x) - (z + x)\rangle\rvert < 2 \lvert\langle a,z-x\rangle\rvert.$$Substituting this into (2.7), we get the bound   $$\lVert Pz-x{\rVert_{2}^{2}} \leq \lVert z-x{\rVert_{2}^{2}} + 3\langle a,z-x\rangle^{2} = \lVert z-x{\rVert_{2}^{2}}(1 + 3\langle\tilde{z},a\rangle^{2}),$$ (2.8)where $$\tilde{z}$$ is as before. We can combine (2.6) and (2.8) into a single inequality by writing   \begin{align*} \lVert Pz-x\rVert_{2}^{2} & \leq \lVert z-x\rVert_{2}^{2}(1 - \langle\tilde{z},a\rangle^{2})1_{W_{x,z}^{c}}(a) + \lVert z-x\rVert_{2}^{2}(1 + 3\langle\tilde{z},a\rangle^{2})1_{W_{x,z}}(a) \\ & = \lVert z-x\rVert_{2}^{2}\left(1 - (1-4\cdot 1_{W_{x,z}}(a))\langle\tilde{z},a\rangle^{2}\right) \\ & = \lVert z-x\rVert_{2}^{2}\left(1 - \left\langle\tilde{z},(1-4\cdot 1_{W_{x,z}}(a))aa^{T} \tilde{z}\right\rangle\right). \end{align*}Taking expectations, we can remove the role that $$\tilde{z}$$ plays by bounding this as follows:   \begin{align*} \mathbb{E}\Big[\lVert z-x{\rVert_{2}^{2}}\left(1 - \left\langle\tilde{z},(1-4\cdot 1_{W_{x,z}}(a))aa^{T} \tilde{z}\right\rangle\right)\Big] & = \lVert z-x{\rVert_{2}^{2}}\left(1 - \left\langle\tilde{z},\mathbb{E}[(1-4\cdot 1_{W_{x,z}}(a))aa^{T}] \tilde{z}\right\rangle\right) \\ & \leq \lVert z-x{\rVert_{2}^{2}}\left[{1 -\lambda_{\min}\left(\mathbb{E} aa^{T}-4\mathbb{E} aa^{T}1_{W_{x,z}}(a)\right)}\right]. \end{align*} We may thus summarize what we have obtained in the following lemma. Lemma 2.1 (Expected decrement) Fix vectors $$x,z \in \mathbb{R}^{n}$$, a probability measure $$\mu$$ on $$S^{n-1}$$, and let $$P = P_{\mu }$$, $$W_{x,z}$$ be defined as in (2.1) and (2.4), respectively. Then   $$\mathbb{E}\lVert Pz-x{\rVert_{2}^{2}} \leq \left[{1 -\lambda_{\min}\left(\mathbb{E} aa^{T}-4\mathbb{E} aa^{T}1_{W_{x,z}}(a)\right)}\right]\lVert z-x{\rVert_{2}^{2}}.$$ Let us next compute what happens for $$\mu = \sigma$$, the uniform measure on the sphere. It is easy to see that $$\mathbb{E} aa^{T} = \frac{1}{n}I_{n}$$, so it remains to compute $$\mathbb{E} aa^{T}1_{W_{x,z}}(a)$$. To do this, we make a convenient choice of coordinates: let $$\theta$$ be the angle between z and x. We assume that both points lie in the plane spanned by $$e_{1}$$ and $$e_{2}$$, the first two basis vectors, and that the angle between z and x is bisected by $$e_{1}$$, as illustrated in Fig. 3. Fig. 3. View largeDownload slide Choice of coordinates. Fig. 3. View largeDownload slide Choice of coordinates. For convenience, denote $$M := \mathbb{E} aa^{T} 1_{W_{x,z}}(a)$$. Let Q denote the orthogonal projection operator onto the span of $$e_{1}$$ and $$e_{2}$$. Then $$Q(W_{x,z})$$ is the union of two sectors of angle $$\theta$$, which are respectively bisected by $$e_{2}$$ and $$-e_{2}$$. Recall that all coordinate projections of the uniform random vector a are uncorrelated. It is clear that from the symmetry in Fig. 3 that they remain uncorrelated even when conditioning on the event that $$a \in W_{x,z}$$. As such, M is a diagonal matrix. Let $$\phi$$ denote the anti-clockwise angle of Qa from $$e_{2}$$ (see Fig. 3). We may write   $$\langle a,e_{1}\rangle^{2} = {\lVert Qa\rVert_{2}^{2}} \langle Qa/\lVert Qa\rVert_{2}, e_{1}\rangle^{2} = {\lVert Qa\rVert_{2}^{2}} \sin^{2}\phi.$$ Note that the magnitude and direction of Qa are independent, and $$a \in W_{x,z}$$ if either $$\phi$$ or $$\phi - \pi$$ lies between $$-\theta /2$$ and $$\theta /2$$. We therefore have   $$M_{11} = \mathbb{E} [\langle a,e_{1}\rangle^{2}1_{W_{x,z}}(a)] = \mathbb{E}\big[{\lVert Qa\rVert_{2}^{2}}\mathbb{E}\sin^{2}\phi 1_{(-\theta/2,\theta/2)}(\phi \textrm{ or }\phi-\pi)\big].$$By a standard calculation using symmetry, we have $$\mathbb{E}{\lVert Qa\rVert _{2}^{2}} = 2/n$$. Since $$\phi$$ is distributed uniformly on the circle, we can compute   $$\mathbb{E}\sin^{2}\phi 1_{(-\theta/2,\theta/2)}(\phi \textrm{ or }\phi-\pi) = \frac{1}{\pi}\int_{-\theta/2}^{\theta/2} \sin^{2} t \,\mathrm{d}t = \frac{1}{\pi} \int_{-\theta/2}^{\theta/2} \frac{1-\cos(2t)}{2} \,\mathrm{d}t = \frac{\theta-\sin\theta}{2\pi}.$$ As such, we have $$M_{11} = (\theta - \sin \theta )/n\pi$$, and by a similar calculation, $$M_{22} = (\theta + \sin \theta )/n\pi$$. Meanwhile, for i ≥ 3 we have   \begin{align*} M_{ii} & = \frac{\textrm{Tr}(M) - M_{11} - M_{22}}{n-2} \\ & = \frac{\mathbb{E}\left[\lVert(I-Q)a{\rVert_{2}^{2}}1_{W_{x,z}}(a)\right]}{n-2} \\ & = \frac{\mathbb{E}\lVert(I-Q)a{\rVert_{2}^{2}}\mathbb{E} 1_{(-\theta/2,\theta/2)}(\phi \textrm{ or }\phi-\pi)}{n-2} \\ & = \frac{(n-2)/n \cdot \theta/\pi }{n} = \frac{\theta}{n\pi}. \end{align*}This implies that   $$\lambda_{\max}(M_{\theta}) = \frac{\theta + \sin\theta}{n\pi}.$$ (2.9)We have now completed proving the following lemma. Lemma 2.2 (Expected decrement for uniform measure) Fix vectors $$x, z \in \mathbb{R}^{n}$$ such that ⟨z, x⟩ > 0, and let $$P = P_{\sigma }$$ denote the generalized Kaczmarz projection with respect to $$\sigma$$, the uniform measure on the sphere. Let $$\theta$$ be the angle between z and x. Then   $$\mathbb{E}\lVert Pz-x{\rVert_{2}^{2}} \leq \Bigg[ 1 - \frac{1-4(\theta+\sin\theta)/\pi}{n} \Bigg] \lVert z-x{\rVert_{2}^{2}}.$$ Remark 2.3 By being more careful, one may compute an exact formula for the expected decrement rather than a bound as is the case in previous lemma. This is not necessary for our purposes and does not give better guarantees in our analysis, so the computation is omitted. 3. Local linear convergence using unlimited uniform measurements In this section, we will show that if we start with an initial estimate that is close enough to the ground truth x, then repeatedly applying generalized Kaczmarz projections with respect to the uniform measure $$\sigma$$ gives linear convergence in expectation. This is exactly the situation we would be in if we were to run randomized Kaczmarz, given an unlimited supply of independent sampling vector $$a_{1},a_{2},\ldots$$ drawn uniformly from the sphere. We would like to imitate the proof for linear convergence of randomized Kaczmarz for linear systems (Theorem 1.1) given in [18]. We denote by $$X_{k}$$ the estimate after k steps, using capital letters to emphasize the fact that it is a random variable. If we know that $$X_{k}$$ takes the value $$x_{k} \in \mathbb{R}^{n}$$, and the angle $$\theta _{k}$$ that z makes with $$x_{k}$$ is smaller than $$\pi /8$$, then, Lemma 2.2 tells us   $$\mathbb{E}\big[\lVert X_{k+1}-x{\rVert_{2}^{2}} \ | \ X_{k} = x_{k}\big] \leq (1-\alpha_{\sigma}/n)\lVert x_{k}-x{\rVert_{2}^{2}},$$ (3.1)where $$\alpha _{\sigma } := 1/2 - 4\sin (\pi /8)/\pi> 0$$. The proof for Theorem 1.1 proceeds by unconditioning and iterating a bound similar to (3.1). Unfortunately, our bound depends on $$x_{k}$$ being in a specific region in $$\mathbb{R}^{n}$$ and does not hold arbitrarily. Nonetheless, by using some basic concepts from stochastic process theory, we may derive a conditional linear convergence estimate. The details are as follows. For each k, let $$\mathcal{F}_{k}$$ denote the $$\sigma$$-algebra generated by $$a_{1},a_{2},\ldots ,a_{k}$$, where $$a_{k}$$ is the sampling vector used in step k. Let $$B \subset \mathbb{R}^{n}$$ be the region comprising all points making an angle less than or equal to $$\pi /8$$ with x. This is our basin of linear convergence. Let us assume a fixed initial estimate $$x_{0} \in B$$. Now define a stopping time $$\tau$$ via   \begin{align} \tau := \min \lbrace k \colon X_{k} \notin B\rbrace. \end{align} (3.2) For each k, and $$x_{k} \in B$$, we have   \begin{align*} \mathbb{E}\big[\lVert X_{k+1}-x{\rVert_{2}^{2}}1_{\tau>{k+1}} \ | \ X_{k} = x_{k}\big] & \leq \mathbb{E}\big[\lVert X_{k+1}-x{\rVert_{2}^{2}} 1_{\tau > k} \ | \ X_{k} = x_{k}\big] \\ & = \mathbb{E}\big[\lVert X_{k+1}-x{\rVert_{2}^{2}}1_{\tau > k} \ | \ X_{k} = x_{k}, \mathcal{F}_{k}\big] \\ & = \mathbb{E}\big[\lVert X_{k+1}-x{\rVert_{2}^{2}} \ | \ X_{k} = x_{k}, \mathcal{F}_{k}\big]1_{\tau > k} \\ & \leq (1-\alpha_{\sigma}/n)\lVert x_{k}-x{\rVert_{2}^{2}} 1_{\tau > k}. \end{align*} Here, the first inequality follows from the inclusion $$\lbrace \tau>{k+1}\rbrace \subset \lbrace \tau > k\rbrace$$, the first equality statement from the Markov nature of the process $$(X_{k})$$, the second equality statement from the fact that $$\tau$$ is a stopping time, while the second inequality is simply (3.1). Taking expectations with respect to $$X_{k}$$ then gives   \begin{align*} \mathbb{E}\big[\lVert X_{k+1}-x{\rVert_{2}^{2}} 1_{\tau>{k+1}}\big] & = \mathbb{E}\big[\mathbb{E}\big[{\lVert X_{k+1}-x{\rVert_{2}^{2}}1_{\tau >{k+1}} \ | \ X_{k}} \big]\big] \\ & \leq (1-\alpha_{\sigma}/n)\mathbb{E}\big[\lVert X_{k}-x{\rVert_{2}^{2}} 1_{\tau > k}\big]. \end{align*} By induction, we therefore obtain   $$\mathbb{E}\big[\lVert X_{k}-x{\rVert_{2}^{2}} 1_{\tau> k}\big] \leq (1-\alpha_{\sigma}/n)^{k}\lVert x_{0}-x{\rVert_{2}^{2}}.$$ We have thus proven the first part of the following convergence theorem. Theorem 3.1 (Linear convergence from unlimited measurements) Let x be a vector in $$\mathbb{R}^{n}$$, let $$\delta> 0$$, and let $$x_{0}$$ be an initial estimate to x such that $$\lVert x_{0} - x\rVert _{2} \leq \delta \lVert x\rVert _{2}$$. Suppose that our measurements $$a_{1},a_{2},\ldots$$ are fully independent random vectors distributed uniformly on the sphere $$S^{n-1}$$. Let $$X_{k}$$ be the estimate given by the randomized Kaczmarz update formula (2.3) at step k, and let $$\tau$$ be the stopping time defined via (3.2). Then for every $$k \in \mathbb{Z}_{+}$$,   $$\mathbb{E}\big[\lVert X_{k}-x{\rVert_{2}^{2}} 1_{\tau = \infty}\big] \leq (1-\alpha_{\sigma}/n)^{k}\lVert x_{0}-x{\rVert_{2}^{2}},$$ (3.3)where $$\alpha _{\sigma } = 1/2 - 4\sin (\pi /8)/\pi> 0$$. Furthermore, $$\mathbb{P}(\tau < \infty ) \leq (\delta /\sin (\pi /8))^{2}$$. Proof. In order to prove the second statement, we combine a stopping time argument with a supermartingale maximal inequality. Set $$Y_{k} := \lVert X_{\tau \wedge\, k}-x{\rVert _{2}^{2}}$$. We claim that $$Y_{k}$$ is a supermartingale. To see this, we break up its conditional expectation as follows:   \begin{align*} \mathbb{E}[Y_{k+1} \ | \ \mathcal{F}_{k}] & = \mathbb{E}\big[\lVert X_{\tau \wedge (k+1)}-x{\rVert_{2}^{2}}1_{\tau \leq\, k} \ | \ \mathcal{F}_{k}\big] + \mathbb{E}\big[\lVert X_{\tau \wedge (k+1)}-x{\rVert_{2}^{2}}1_{\tau> k} \ | \ \mathcal{F}_{k}\big] \\ & = \mathbb{E}\big[\lVert X_{\tau \wedge\, k}-x{\rVert_{2}^{2}}1_{\tau \leq\, k} \ | \ \mathcal{F}_{k}\big] + \mathbb{E}\big[\lVert X_{k+1}-x{\rVert_{2}^{2}}1_{\tau > k} \ | \ \mathcal{F}_{k}\big]. \end{align*} Since $$\lVert X_{\tau \wedge k}-x{\rVert _{2}^{2}}$$ is measurable with respect to $$\mathcal{F}_{k}$$, we get   $$\mathbb{E}\big[\lVert X_{\tau \wedge\, k}-x{\rVert_{2}^{2}}1_{\tau \leq\, k} \ | \ \mathcal{F}_{k}\big] = \lVert X_{\tau \wedge\, k}-x{\rVert_{2}^{2}}1_{\tau \leq\, k} = Y_{k} 1_{\tau \leq\, k}.$$Meanwhile, on the event $$\tau> k$$, we have $$X_{k} \in B$$, so we may use (3.1) to obtain   $$\mathbb{E}\big[\lVert X_{k+1}-x{\rVert_{2}^{2}}1_{\tau> k} \ | \ \mathcal{F}_{k}\big] = \mathbb{E}\big[\lVert X_{k+1}-x{\rVert_{2}^{2}} \ | \ \mathcal{F}_{k}\big]1_{\tau > k} \leq (1-\alpha_{\sigma}/n)\lVert X_{k}-x{\rVert_{2}^{2}}1_{\tau > k}.$$Next, notice that   $$\lVert X_{k}-x{\rVert_{2}^{2}}1_{\tau> k} = \lVert X_{\tau \wedge\, k}-x{\rVert_{2}^{2}}1_{\tau > k} = Y_{k} 1_{\tau > k}.$$Combining these calculations gives   $$\mathbb{E}[Y_{k+1} \ | \ \mathcal{F}_{k}] \leq Y_{k} 1_{\tau \leq\, k} + (1-\alpha_{\sigma}/n)Y_{k} 1_{\tau> k} \leq Y_{k}.$$ Now define a second stopping time T to be the earliest time k such that $$\lVert X_{k}-x\rVert _{2} \geq \sin (\pi /8) \cdot \lVert x\rVert _{2}$$. A simple geometric argument tells us that $$T \leq \tau$$, and that T also satisfies   $$T = \inf\big\lbrace k \ | \ Y_{k} \geq \sin^{2}(\pi/8){\lVert x\rVert_{2}^{2}}\big\rbrace.$$As such, we have   $$\mathbb{P}(\tau < \infty ) \leq \mathbb{P}(T < \infty) = \mathbb{P}\Bigg(\sup_{1 \leq\, k <\, \infty} Y_{k} \geq \sin^{2}(\pi/8) {\lVert x\rVert_{2}^{2}}\Bigg).$$Since $$(Y_{k})$$ is a non-negative supermartingale, we may apply the supermartingale maximal inequality to obtain a bound on the right-hand side:   $$\mathbb{P}\Bigg(\sup_{1 \leq\, k <\, \infty} Y_{k} \geq \sin^{2}(\pi/8) {\lVert x\rVert_{2}^{2}}\Bigg) \leq \frac{\mathbb{E} Y_{0}}{\sin^{2}(\pi/8){\lVert x\rVert_{2}^{2}}} \leq (\delta/\sin(\pi/8))^{2}.$$This completes the proof of the theorem. Corollary 3.2 Fix $$\epsilon> 0$$, $$0 < \delta _{1} \leq 1/2$$ and $$0 < \delta _{2} \leq 1$$. In the setting of Theorem 3.1, suppose that $$\lVert x_{0}-x\rVert _{2} \leq \sqrt{\delta _{1}}\sin (\pi /8)\lVert x\rVert _{2}$$. Then with probability at least $$1-\delta _{1}-\delta _{2}$$, if $$k \geq (\log (2/\epsilon )+\log (1/\delta _{2}))n/\alpha _{\sigma }$$ then $$\lVert X_{k}-x{\rVert _{2}^{2}} \leq \epsilon \lVert x_{0}-x{\rVert _{2}^{2}}$$. Proof. First observe that   $$\mathbb{P}(\tau < \infty) \leq \left({\frac{\sqrt{\delta_{1}}\sin(\pi/8)}{\sin(\pi/8)}}\right)^{2} = \delta_{1} \leq 1/2.$$ Next, since   \begin{align*} \mathbb{E}\big[\lVert X_{k}-x{\rVert_{2}^{2}} 1_{\tau = \infty}\big] & = \mathbb{E}\big[\lVert X_{k}-x{\rVert_{2}^{2}} \ | \ \tau = \infty\big]\mathbb{P}(\tau = \infty) + 0 \cdot \mathbb{P}(\tau < \infty) \\ & \geq \frac{1}{2}\mathbb{E}\big[\lVert X_{k}-x{\rVert_{2}^{2}} \ | \ \tau = \infty\big], \end{align*}applying Theorem 3.1 gives   $$\mathbb{E}\big[\lVert X_{k}-x{\rVert_{2}^{2}} \ | \ \tau = \infty\big] \leq 2(1-\alpha_{\sigma}/n)^{k}\lVert x_{0}-x{\rVert_{2}^{2}}.$$Applying Markov’s inequality then gives   \begin{align*} \mathbb{P}\big(\lVert X_{k}-x{\rVert_{2}^{2}}> \epsilon \lVert x_{0}-x{\rVert_{2}^{2}} \ | \ \tau = \infty\big) & \leq \frac{\mathbb{E}\left[\lVert X_{k}-x{\rVert_{2}^{2}} \ | \ \tau = \infty\right]}{\epsilon \lVert x_{0}-x{\rVert_{2}^{2}}} \\ & \leq \frac{2(1-\alpha_{\sigma}/n)^{k}}{\epsilon}. \end{align*} Plugging our choice of k into this last bound shows that it is in turn bounded by $$\delta _{2}$$. We therefore have   \begin{align*} \mathbb{P}\big(\lVert X_{k}-x{\rVert_{2}^{2}} \leq \epsilon \lVert x_{0}-x{\rVert_{2}^{2}} \big) & = \mathbb{P}\big(\lVert X_{k}-x{\rVert_{2}^{2}} \leq \epsilon \lVert x_{0}-x{\rVert_{2}^{2}} \ | \ \tau = \infty\big) \mathbb{P}(\tau = \infty) \\ & \geq (1-\delta_{2})(1-\delta_{1}) \\ & \geq 1 - \delta_{1} - \delta_{2} \end{align*}as we wanted. 4. Local linear convergence for $$\textrm{ACW}(\theta ,\alpha )$$ measures We would like to extend the analysis in the previous section to the setting where we only have access to finitely many uniform measurements, i.e. when we are back in the situation of (1.1). When we sample uniformly from the rows of A, this can be seen as running the generalized randomized Kaczmarz algorithm using the measure $$\mu _{A} = \frac{1}{m}\sum _{i=1}^{m} \delta _{a_{i}}$$ as opposed to $$\mu = \sigma$$. If we retrace our steps, we will see that the key property of the uniform measure $$\sigma$$ that we used was that if $$W \subset S^{n-1}$$ is a wedge2 of angle $$\theta$$, then we could make $$\lambda _{\max }(\mathbb{E}_{\sigma } aa^{T} 1_{W}(a))$$ arbitrarily small by taking $$\theta$$ small enough (see equation (2.9)). We do not actually need such a strong statement. It suffices for there to be an absolute constant $$\alpha$$ such that   $$\lambda_{\min}(\mathbb{E} aa^{T}-4\mathbb{E} aa^{T}1_{W}(a)) \geq \frac{\alpha}{n}$$ (4.1)holds for $$\theta$$ small enough. Definition 4.1 (Anti-concentration) If a probability measure $$\mu$$ on $$S^{n-1}$$ satisfies (4.1) for all wedges W of angle less than $$\theta$$, we say that it is anti-concentrated on wedges of angle $$\theta$$ at level $$\alpha$$, or for short, that it satisfies the $$\textrm{ACW}(\theta ,\alpha )$$ condition. Abusing notation, we say that a measurement matrix A is $$\textrm{ACW}(\theta ,\alpha )$$ if the uniform measure on its rows is $$\textrm{ACW}(\theta ,\alpha )$$. Plugging in this definition into Lemma 2.1, we immediately get the following statement. Lemma 4.2 (Expected decrement for ACW measure) Let $$\mu$$ be a probability measure on the sphere $$S^{n-1}$$ satisfying the $$ACW(\theta ,\alpha )$$ condition for some $$\alpha> 0$$ and some acute angle $$\theta> 0$$. Let $$P = P_{\mu }$$ denote the generalized Kaczmarz projection with respect to $$\mu$$. Then for any $$x, z \in \mathbb{R}^{n}$$ such that the angle between them is less than $$\theta$$, we have   $$\mathbb{E}\lVert Pz-x{\rVert_{2}^{2}} \leq (1-\alpha/n) \lVert z-x{\rVert_{2}^{2}}.$$ (4.2) We may now imitate the arguments in the previous section to obtain a guarantee for local linear convergence for the generalized randomized Kaczmarz algorithm using such a measure $$\mu$$. Theorem 4.3 (Linear convergence for ACW measure) Suppose $$\mu$$ is an $$\textrm{ACW}(\theta ,\alpha )$$ measure. Let x be a vector in $$\mathbb{R}^{n}$$, let $$\delta> 0$$, and let $$x_{0}$$ be an initial estimate to x such that $$\lVert x_{0} - x\rVert _{2} \leq \delta \lVert x\rVert _{2}$$. Let $$X_{k}$$ denote the kth step of the generalized randomized Kaczmarz method with respect to the measure $$\mu$$, defined as in (2.3). Let $$\varOmega$$ be the event that for every $$k \in \mathbb{Z}_{+}$$, $$X_{k}$$ makes an angle less than $$\theta$$ with x. Then for every $$k \in \mathbb{Z}_{+}$$,   $$\mathbb{E}\big[\lVert X_{k}-x{\rVert_{2}^{2}} 1_{\varOmega}\big] \leq (1-\alpha/n)^{k}\lVert x_{0}-x{\rVert_{2}^{2}}.$$ (4.3)Furthermore, $$\mathbb{P}(\varOmega ^{c}) \leq (\delta /\sin \theta )^{2}$$. Proof. We repeat the proof of Theorem 3.1. Let $$B_{\mu } \subset S^{n-1}$$ be the region on the sphere comprising all points making an angle less than or equal to $$\pi /8$$ with x. Define stopping times $$\tau _{\mu }$$ and $$T_{\mu }$$ as the earliest times that $$X_{k} \notin B_{\mu }$$ and $$\lVert X_{k} - x_{0}\rVert _{2} \geq \sin (\theta )\lVert x\rVert _{2}$$, respectively. Again, $$Y_{k} := X_{k\wedge \tau _{\mu }}$$ is a supermartingale, so we may use the supermartingale inequality to bound the probability of $$\varOmega ^{c}$$. Conditioned on the event $$\varOmega$$, we may iterate the bound given by Lemma 4.2 to obtain (4.3). Corollary 4.4 Fix $$\epsilon> 0$$, $$0 < \delta _{1} \leq 1/2$$ and $$0 < \delta _{2} \leq 1$$. In the setting of Theorem 4.3, suppose that $$\lVert x_{0}-x\rVert _{2} \leq \sqrt{\delta _{1}}\sin (\theta )\lVert x\rVert _{2}$$. Then with probability at least $$1-\delta _{1}-\delta _{2}$$, if $$k \geq (\log (2/\epsilon )+\log (1/\delta _{2}))n/\alpha$$ then $$\lVert X_{k}-x{\rVert _{2}^{2}} \leq \epsilon \lVert x_{0}-x{\rVert _{2}^{2}}$$. 5. $$\textrm{ACW}(\theta ,\alpha )$$ condition for finitely many uniform measurements Following the theory in the previous section, we see that to prove linear convergence from finitely many uniform measurements, it suffices to show that the measurement matrix A is $$\textrm{ACW}(\theta ,\alpha )$$ for some $$\theta$$ and $$\alpha$$. For a fixed wedge W, we can easily achieve (4.1) by using a standard matrix concentration theorem. By taking a union bound, we can guarantee that it holds over exponentially many wedges with high probability. However, the function $$W \mapsto \lambda _{\max } (\mathbb{E} aa^{T} 1_{W}(a) )$$ is not Lipschitz with respect to any natural parametrization of wedges in $$S^{n-1}$$, so a naive net argument fails. To get around this, we use VC theory, metric entropy and a chaining theorem from [10]. First, we will use the theory of VC dimension and growth functions to argue that all wedges contain approximately the right fraction of points. This is the content of Lemma 5.1. In order to prove this, a fair number of standard definitions and results are required. These are all provided in Appendix A. Lemma 5.1 (Uniform concentration of empirical measure over wedges) Fix an acute angle $$\theta> 0$$. Let $$\mathcal{W}_{\theta }$$ denote the collection of all wedges of $$S^{n-1}$$ of angle less than $$\theta$$. Suppose A is an m by n matrix with rows $$a_{i}$$ that are independent uniform random vectors on $$S^{n-1}$$, and let $$\mu _{A} = \frac{1}{m}\sum _{i=1}^{m} \delta _{a_{i}}$$. Then if $$m \geq (4\pi /\theta )^{2}(2n\log (2em/n)+\log (2/\delta ))$$, with probability at least $$1 - \delta$$, we have   $$\sup_{W \in \mathcal{W}} \mu_{A}(W) \leq 2\theta/\pi.$$ Proof. Using VC theory [21], we have   $$\mathbb{P}\Bigg(\sup_{W \in \mathcal{W}} \lvert\mu_{A}(W) - \sigma(W)\rvert \geq u\Bigg ) \leq 4\varPi_{\mathcal{W}_{\theta}}(2m)\exp(-mu^{2}/16)$$ (5.1)whenever $$m \geq 2/u^{2}$$. Let $$\mathcal{S}$$ be the collection of all sectors of any angle, and let $$\mathcal{H}$$ denote the collection of all hemispheres. By Claim A.3 and the Sauer–Shelah lemma (Lemma A.1) relating VC dimension to growth functions, we have $$\varPi _{\mathcal{H}}(2m) \leq (2em/n)^{n}$$. Next, notice that using the notation in (A.2), we have $$\mathcal{W} = \mathcal{H}\varDelta \mathcal{H}$$. As such, we may apply Claim A.4 to get   $$\varPi_{\mathcal{W}}(2m) \leq (2em/n)^{2n}.$$ We now plug this bound into the right-hand side of (5.1), set $$u = \theta /\pi$$ and simplify to get   $$\mathbb{P}\Bigg(\sup_{W \in \mathcal{W}} \lvert\mu_{A}(W) - \sigma(W)\rvert \geq \theta/\pi \Bigg) \leq 4\exp(2n\log(2em/n)-m(\theta/\pi)^{2}/16).$$ Our assumption implies that $$m \geq 2/(\theta /\pi )^{2}$$ so the bound holds, and also that the bound is less than $$\delta$$. Finally, since $$\mathcal{W}_{\theta } \subset \mathcal{W}$$, on the complement of this event, any $$W \in \mathcal{W}_{\theta }$$ satisfies   $$\mu_{A}(W) \leq \sigma(W) + \theta/\pi \leq 2\theta/\pi$$as we wanted. For every wedge $$W \in \mathcal{W}_{\theta }$$, we may associate the configuration vector   $$s_{W,\,A} := (1_{W}(a_{1}),1_{W}(a_{2}),\ldots,1_{W}(a_{m})).$$We can write   $$\lambda_{\max}(\mathbb{E}_{\mu_{A}}aa^{T}1_{W}(a)) = \frac{1}{m}\lambda_{\max}(A^{T}S_{W,\,A}A),$$ (5.2)where $$S_{W,\,A} = \textrm{diag}(s_{W,\,A})$$. $$S_{W,\,A}$$ is thus a selector matrix, and if we condition on the good event given to us by the previous theorem, it selects at most a $$2\theta /\pi$$ fraction of the rows of A. This means that $$s_{W,\,A} \in \mathcal{S}_{2\theta /\pi }$$, where we define   $$\mathcal{S}_{\tau} := \lbrace d \in \{0,1\}^{m} \ | \ \langle d,1\rangle \leq \tau \cdot m \rbrace.$$ We would like to majorize the quantity in (5.2) uniformly over all wedges W by the quantity $$\frac{1}{4}\lambda _{\min } \left(\mathbb{E}_{\mu _{A}} aa^{T} \right)$$. In order to do this, we define a stochastic process $$(Y_{s,v})$$ indexed by $$s \in \mathcal{S}_{2\theta /\pi }$$ and $$v \in{B^{n}_{2}}$$, setting   $$Y_{s,v} := n v^{T} A^{T}\textrm{diag}(s)Av = \sum_{i=1}^{m} s_{i} \langle\sqrt{n}a_{i},v\rangle^{2}.$$ (5.3)If we condition on the good set in Lemma 5.1, it is clear that   $$\sup_{W \in \mathcal{W}_{\theta}} \frac{1}{m}\lambda_{\max}(A^{T}S_{W,\,A}A) \leq \frac{1}{nm}\sup_{s \in \mathcal{S}_{2\theta/\pi},v \in{B^{n}_{2}}} Y_{s,v},$$so it suffices to bound the quantity on the right. We will do this using a slightly sophisticated form of chaining, which requires us to make a few definitions. Let (T, d) be a metric space. A sequence $$\mathcal{T} = (T_{k})_{k \in \mathbb{Z}_{+}}$$ of subsets of T is called admissible if $$\lvert T_{0}\rvert = 1$$, and $$\lvert T_{k}\rvert \leq 2^{2^{k}}$$ for all k ≥ 1. For any $$0 < \alpha < \infty$$, we define the $$\gamma _{\alpha }$$ functional of (T, d) to be   $$\gamma_{\alpha}(T,d) := \inf_{\mathcal{T}}\sup_{t \in T} \sum_{k=0}^{\infty} 2^{k/\alpha} d(t,T_{k}).$$ Let $$d_{1}$$ and $$d_{2}$$ be two metrics on T. We say that a process $$(Y_{t})$$ has mixed tail increments with respect to $$(d_{1},d_{2})$$ if there are constants c and C such that for all s, t ∈ T, we have the bound   $$\mathbb{P}(\lvert Y_{s}-Y_{t}\rvert \geq c(\sqrt{u}d_{2}(s,t) + ud_{1}(s,t)) ) \leq Ce^{-u}.$$ (5.4) Remark 5.2 In [10], processes with mixed tail increments are defined as above, but with the further restriction that c = 1 and C = 2. This is not necessary for the result that we need (Lemma 5) to hold. The indeterminacy of c and C gets absorbed into the final constant in the bound. Lemma 5.3 (Theorem 5, [10]) If $$(Y_{t})_{t \in T}$$ has mixed tail increments, then there is a constant C such that for any u ≥ 1, with probability at least $$1 - e^{-u}$$,   $$\sup_{t \in T}\lvert Y_{t} - Y_{t_{0}}\rvert \leq C(\gamma_{2}(T,d_{2}) + \gamma_{1}(T,d_{1}) + \sqrt{u}\textrm{diam}(T,d_{2}) + u\textrm{diam}(T,d_{1})).$$ At first glance, the $$\gamma _{2}$$ and $$\gamma _{1}$$ quantities seem mysterious and intractable. We will show, however, that they can be bounded by more familiar quantities that are easily computable in our situation. Let us postpone this for the moment, and first show that our process $$(Y_{s,v})$$ has mixed tail increments. Lemma 5.4 ($$(Y_{s,v})$$ has mixed tail increments) Let $$(Y_{s,v})$$ be the process defined in (5.3). Define the metrics $$d_{1}$$ and $$d_{2}$$ on $$\mathcal{S}_{2\theta /\pi } \times{B^{n}_{2}}$$ using the norms $$\lvert \kern -0.25ex\lvert \kern -0.25ex\lvert (w,v)\rvert \kern -0.25ex\rvert \kern -0.25ex\rvert _{1} = \max \lbrace \lVert w\rVert _{\infty },\lVert v\rVert _{2}\rbrace$$ and $$\lvert \kern -0.25ex\lvert \kern -0.25ex\lvert (w,v)\rvert \kern -0.25ex\rvert \kern -0.25ex\rvert _{2} = \max \lbrace \lVert w\rVert _{2},\sqrt{2m\theta /\pi }\lVert v\rVert _{2}\rbrace$$. Then the process has mixed tail increments with respect to $$(d_{1},d_{2})$$. Proof. The main tool that we use is Bernstein’s inequality [23] for sums of subexponential random variables. Observe that each $$\sqrt{n}a_{i}$$ is a sub-Gaussian random vector with bounded sub-Gaussian norm $$\lVert \sqrt{n}a_{i}\rVert _{\psi _{2}} \leq C$$, where C by an absolute constant. As such, for any $$v \in{B^{n}_{2}}$$, $$\langle \sqrt{n}a_{i},v\rangle ^{2}$$ is a subexponential random variable with bounded subexponential norm $$\lVert \langle \sqrt{n}a_{i},v \rangle ^{2} \rVert _{\psi _{1}} \leq C^{2}$$ [23]. Now fix v and let $$s, s^\prime \in \mathcal{S}_{2\theta /\pi }$$. Then   $$Y_{s,v} - Y_{s^\prime,v} = \sum_{i=1}^{m} \left(s_{i}- s_{i}^\prime\right) \langle\sqrt{n}a_{i},v\rangle^{2}.$$Using Bernstein, we have   $$\mathbb{P}(\lvert Y_{s,v} - Y_{s^\prime,v}\rvert \geq u ) \leq 2\exp\big(-c\min\big\lbrace u^{2}/\lVert s-s^\prime{\rVert_{2}^{2}},u/\lVert s-s^\prime\rVert_{\infty}\big\rbrace\big).$$ (5.5) Similarly, if we fix $$s \in \mathcal{S}_{2\theta /\pi }$$ and let $$v, v^\prime \in{B^{n}_{2}}$$, then   \begin{align*} Y_{s,v} - Y_{s,\,v^\prime} & = \sum_{i=1}^{m} s_{i} (\langle\sqrt{n}a_{i},v\rangle^{2} - \langle\sqrt{n}a_{i},v^\prime\rangle^{2} ) \\[-2pt] & = \sum_{i=1}^{m} s_{i} \langle\sqrt{n}a_{i},v-v^\prime\rangle\langle\sqrt{n}a_{i},v+v^\prime\rangle. \end{align*}We can bound the subexponential norm of each summand via   \begin{align*} \lVert s_{i}\langle\sqrt{n}a_{i},v-v^\prime\rangle\langle\sqrt{n}a_{i},v+v^\prime\rangle\rVert_{\psi_{1}} & \leq s_{i}\lVert\langle\sqrt{n}a_{i},v-v^\prime\rangle\rVert_{\psi_{2}} \cdot \lVert\langle\sqrt{n}a_{i},v+v^\prime\rangle\rVert_{\psi_{1}} \\[-2pt] & \leq Cs_{i}\lVert v-v^\prime\rVert_{2}. \end{align*}As such,   $$\sum_{i=1}^{m} \lVert s_{i}\langle\sqrt{n}a_{i},v-v^{\prime}\rangle\langle\sqrt{n}a_{i},v+v^{\prime}\rangle\rVert_{\psi_{1}}^{2} \leq C \lVert v-v^{\prime}\rVert_{2}^{2}\sum_{i=1}^{m} s_{i}^{2} \leq C(2\theta/\pi)m \lVert v-v^{\prime}\rVert_{2}^{2}.$$Applying Bernstein as before, we get   $$\mathbb{P}(\lvert Y_{s,v} - Y_{s,v^{\prime}}\rvert \geq u ) \leq 2\exp\big(-c\min\big\lbrace u^{2}/ (2\theta/\pi)m\lVert v-v^{\prime}\rVert_{2}^{2},u/\lVert v-v^{\prime}\rVert_{2}\big\rbrace\big).$$ (5.6) Now, recall the simple observation that for any numbers $$a, b \in \mathbb{R}$$, we have   $$\max\lbrace\lvert a\rvert,\lvert b\rvert\rbrace \leq \lvert a\rvert + \lvert b\rvert \leq 2\max\lbrace\lvert a\rvert,\lvert b\rvert\rbrace.$$As such, for any u > 0, given $$s,s^\prime \in \mathcal{S}_{2\theta /\pi }$$, $$v,v^\prime \in{B^{n}_{2}}$$, we have   \begin{align*} &\sqrt{u}\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert(s,v) - (s^\prime,v^\prime)\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{2} + u\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert(s,v) - (s^\prime,v^\prime)\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{1} \nonumber\\[-2pt] &\quad \geq \frac{1}{2}\left(\sqrt{u}\lVert s-s^\prime\rVert_{2} + \sqrt{u}\sqrt{2m\theta/\pi}\lVert v-v^\prime\rVert_{2} + u\lVert s-s^\prime\rVert_{\infty} + u\lVert v-v^\prime\rVert_{2}\right) \\[-2pt] & \quad\geq \frac{1}{2}\max\left\{\sqrt{u}\lVert s-s^\prime\rVert_{2} + u\lVert s-s^\prime\rVert_{\infty}, \sqrt{u}\sqrt{2m\theta/\pi}\lVert s-s^\prime\rVert_{2} + u\lVert v-v^\prime\rVert_{2} \right\}. \end{align*} Since   $$\lvert Y_{s,v} - Y_{s^\prime,v^\prime}\rvert \leq \lvert Y_{s,v} - Y_{s^\prime,v}\rvert + \lvert Y_{s^\prime,v} - Y_{s^\prime,v^\prime}\rvert,$$we have that if   $$\lvert Y_{s,v} - Y_{s^\prime,v^\prime}\rvert \geq c\left({\sqrt{u}\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert(s,v) - (s^\prime,v^\prime)\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{2} + u\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert(s,v) - (s^\prime,v^\prime)\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{1}}\right)\!,$$then either   $$\lvert Y_{s,v} - Y_{s^\prime,v}\rvert \geq \frac{c}{4}\left({\sqrt{u}\lVert s-s^\prime\rVert_{2} + u\lVert s-s^\prime\rVert_{\infty}}\right)$$or   $$\lvert Y_{s^\prime,v} - Y_{s^\prime,v^\prime}\rvert \geq \frac{c}{4}\left({\sqrt{u}\sqrt{2m\theta/\pi}\lVert v-v^{\prime}\rVert_{2} + u\lVert v-v^{\prime}\rVert_{2} }\right).$$ We can then combine the bounds (5.6) and (5.5) to get   $$\mathbb{P}\left(\lvert Y_{s,v} - Y_{s^\prime,v^\prime}\rvert \geq c\left(\sqrt{u}\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert(w,v) - (w^\prime,v^\prime)\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{2} + u\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert(w,v) - (w^\prime,v^\prime)\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{1}\right)\right) \leq 4e^{-u}.$$Hence, the process $$(Y_{s,v})$$ satisfies the definition (5.4) for having mixed tail increments. We next bound the $$\gamma _{1}$$ and $$\gamma _{2}$$ functions for $$\mathcal{S}_{2\theta /\pi } \times{B^{n}_{2}}$$. Lemma 5.5 We may bound the $$\gamma _{1}$$ functional of $$\mathcal{S}_{2\theta /\pi } \times{B^{n}_{2}}$$ by   $$\gamma_{1}\big(\mathcal{S}_{2\theta/\pi} \times{B^{n}_{2}},\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert\cdot\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{1}\big) \leq C((2\theta/\pi)\log(\pi/2\theta)m + n).$$ Proof. The proof of the bound uses metric entropy and a version of Dudley’s inequality. Let (T, d) be a metric space, and for any u > 0, let N(T, d, u) denote the covering number of T at scale u, i.e. the smallest number of radius u balls needed to cover T. Dudley’s inequality (see [20]) states that there is an absolute constant C for which   $$\gamma_{1}(T,d) \leq C \int_{0}^{\infty} \log N(T,d,u)\, \mathrm{d}u.$$ (5.7) Recall that $$\mathcal{S}_{2\theta /\pi }$$ is the set of all {0, 1} vectors with fewer than $$2\theta /\pi$$ ones. For convenience, let us assume that $$2m\theta /\pi$$ is an integer. We then have the inclusion   $$\mathcal{S}_{2\theta/\pi} \subset \bigcup_{I \in \mathcal{I}} [0,1]^{I},$$ where $$\mathcal{I}$$ is the collection of all subsets of [m] of size $$2m\theta /\pi$$, and $$[0,1]^{I}$$ denotes the unit cube in the coordinate set I. We may then also write   $$\mathcal{S}_{2\theta/\pi} \times{B^{n}_{2}} \subset \bigcup_{I \in \mathcal{I}} \big([0,1]^{I} \times{B^{n}_{2}}\big).$$ Note that a union of covers for each $$[0,1]^{I} \times{B^{n}_{2}}$$ gives a cover for $$\mathcal{S}_{2\theta /\pi } \times{B^{n}_{2}}$$. This, together with the symmetry of $$\lVert \cdot \rVert _{\infty }$$ with respect to permutation of the coordinates gives   $$N\left(\mathcal{S}_{2\theta/\pi} \times{B^{n}_{2}},\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert\cdot\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{1},u\right) \leq \lvert\mathcal{I}\rvert \cdot N\big([0,1]^{I} \times{B^{n}_{2}},\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert\cdot\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{1},u\big)$$for some fixed index set I. We next generalize the notion of covering numbers slightly. Given two sets T and K, we let N(T, K) denote the number of translates of K needed to cover the set T. It is easy to see that we have $$N(T,d,u) = N(T,uB_{d})$$, where $$B_{d}$$ is the unit ball with respect to the metric d. Since the unit ball for $$\lvert \kern -0.25ex\lvert \kern -0.25ex\lvert \cdot \rvert \kern -0.25ex\rvert \kern -0.25ex\rvert _{1}$$ is $$B_{\infty }^{m} \times{B_{2}^{n}}$$, we therefore have   \begin{align*} N\big([0,1]^{I} \times{B^{n}_{2}},\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert\cdot\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{1},u\big) & = N\big([0,1]^{I} \times{B^{n}_{2}}, u\big(B_{\infty}^{m} \times{B_{2}^{n}}\big)\big) \\ & \leq N\big(B^{(2\theta/\pi)m}_{\infty} \times{B^{n}_{2}}, u\big(B^{(2\theta/\pi)m}_{\infty}\times{B_{2}^{n}}\big) \big). \end{align*} Such a quantity can be bounded using a volumetric argument. Generally, for any centrally symmetric convex body K in $$\mathbb{R}^{n}$$, we have (see Corollary 4.1.15 in [1])   $$N(K,uK) \leq (3/u)^{n}.$$ (5.8)This implies that   $$\log N([0,1]^{I} \times S^{n-1},\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert\cdot\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{1},u) \leq \log(3/u)((2\theta/\pi)m+n).$$Finally, observe that   $$\log\lvert\mathcal{I}\rvert = \log{{m}\choose{(2\theta/\pi)m}} \leq (2\theta/\pi)m\log(e\pi/2\theta).$$ We can thus plug these last two bounds into (5.7), noting that the integrand is zero for u ≥ 1 to get   \begin{align*} \gamma_{1}\big(\mathcal{S}_{2\theta/\pi}\times{B^{n}_{2}},\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert\cdot\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{1}\big) & \leq C {\int_{0}^{1}} (2\theta/\pi)m\log(e\pi/2\theta) + \log(3/u)((2\theta/\pi)m+n) \,\mathrm{d}u \\ & \leq C((2\theta/\pi)\log(\pi/2\theta)m + n) \end{align*}as was to be shown. Lemma 5.6 We may bound the $$\gamma _{2}$$ functional of $$\mathcal{S}_{2\theta /\pi }\times{B^{n}_{2}}$$ by   $$\gamma_{2}\left(\mathcal{S}_{2\theta/\pi}\times{B^{n}_{2}},\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert\cdot\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{2}\right) \leq C\sqrt{2\theta/\pi}\left(m + \sqrt{mn}\right)\!.$$ Proof. Since $$\alpha = 2$$, we may appeal directly to the theory of Gaussian complexity [22]. However, since we have already introduced some of the theory of metric entropy in the previous lemma, we might as well continue down this path. In this case, we have the Dudley bound   $$\gamma_{2}(T,d) \leq C \int_{0}^{\infty} \sqrt{\log N(T,d,u)}\, \mathrm{d}u$$ (5.9)for any metric space (T, d). Observe that the unit ball for $$\lvert \kern -0.25ex\lvert \kern -0.25ex\lvert \cdot \rvert \kern -0.25ex\rvert \kern -0.25ex\rvert _{2}$$ is $${B^{m}_{2}} \times (2m\theta /\pi )^{-1/2}{B^{n}_{2}}$$. On the other hand, we conveniently have   $$\mathcal{S}_{2\theta/\pi} \times{B^{n}_{2}} \subset \sqrt{2m\theta/\pi}{B^{m}_{2}} \times{B^{n}_{2}}.$$As such, we have   \begin{align*} N\big(\mathcal{S}_{2\theta/\pi} \times{B^{n}_{2}},\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert\cdot\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{2},u\big) & \leq N\big(\sqrt{2m\theta/\pi}{B^{m}_{2}} \times{B^{n}_{2}},\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert\cdot\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{2},u\big) \\ & = N\big(\sqrt{2m\theta/\pi}{B^{m}_{2}} \times{B^{n}_{2}},u \big({B^{m}_{2}} \times (2m\theta/\pi)^{-1/2}{B^{n}_{2}}\big)\big) \\ & = N(T,(2m\theta/\pi)^{-1/2}uT), \end{align*}where $$T = \sqrt{2m\theta /\pi }{B^{m}_{2}} \times{B^{n}_{2}}$$. Plugging this into (5.9) and subsequently using the volumetric bound (5.8), we get   \begin{align*} \gamma_{2}\left(\mathcal{S}_{2\theta/\pi}\times{B^{n}_{2}},\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert\cdot\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{2}\right) & \leq C \int_{0}^{\infty} \sqrt{\log N(T,(2m\theta/\pi)^{-1/2}uT)} \,\mathrm{d}u \\ & = C\sqrt{2m\theta/\pi}\int_{0}^{\infty} \sqrt{\log N(T,uT)} \,\mathrm{d}u \\ & \leq C \sqrt{2m\theta/\pi}\sqrt{m+n}, \end{align*}which is clearly equivalent to the bound that we want. At this stage, we can put everything together to bound the supremum of our stochastic process. Theorem 5.7 (Bound on supremum of $$(Y_{s,v})$$) Let $$(Y_{s,v})$$ be the process defined in (5.3). Let $$0 < \delta < 1/e$$, let $$\theta$$ be an acute angle, and suppose $$m \geq \max \lbrace n,\log (1/\delta )\pi /2\theta \rbrace$$. Then with probability at least $$1-\delta$$, the supremum of the process satisfies   $$\sup_{s \in \mathcal{S}_{2\theta/\pi},v \in{B^{n}_{2}}} Y_{s,v} \leq C\sqrt{2\theta/\pi}\cdot m.$$ (5.10) Proof. It is easy to see that we have   $$\textrm{diam}\left(\mathcal{S}_{2\theta/\pi} \times{B^{n}_{2}},\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert\cdot\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{1}\right) = 2,$$and   $$\textrm{diam}\left(\mathcal{S}_{2\theta/\pi} \times{B^{n}_{2}},\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert\cdot\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{2}\right) = 2\sqrt{2m\theta/\pi}.$$Also, observe that we have $$Y_{s,0} = 0$$ for any $$s \in \mathcal{S}_{2\theta /\pi }$$. Using these, together with the previous two lemmas bounding the $$\gamma _{1}$$ and $$\gamma _{2}$$ functionals, we may apply Lemma 5.3 to see that   $$\sup_{s \in \mathcal{S}_{2\theta/\pi},v \in{B^{n}_{2}}} Y_{s,v} \leq C\left({(2\theta/\pi)\log(\pi/2\theta)m + n} + \sqrt{2\theta/\pi}({m + \sqrt{mn} }) + u + \sqrt{u} \sqrt{2m\theta/\pi}\right),$$with probability at least $$1-e^{-u}$$. Using our assumptions on m, we may simplify this bound to obtain (5.10). Finally, we show that $$\frac{1}{m}\sum _{i=1}^{m} a_{i} {a_{i}^{T}}$$ is well-behaved. Lemma 5.8 Let $$\delta>0$$. Then if $$m \geq C(n+\sqrt{\log (1/\delta )})$$, with probability at least $$1-\delta$$, we have   $$\left\lVert{\frac{n}{m}\sum_{i=1}^{m}a_{i}{a_{i}^{T}} - I_{n}}\right\rVert \leq 0.1.$$ Proof. Note, as before, that the $$\sqrt{n}a_{i}$$s are isotropic sub-Gaussian random variables with sub-Gaussian norm bounded by an absolute constant. The claim then follows immediately from Theorem 5.39 in [23], which itself is proved using Bernstein and a simple net argument. Theorem 5.9 (Finite measurement sets satisfy ACW condition) There is some $$\theta _{0}> 0$$ and an absolute constant C such that for all angles $$0 < \theta \leq \theta _{0}$$, for all dimensions n, and any $$\delta> 0$$, if m satisfies   $$m \geq C(\pi/2\theta)^{2}(n\log(m/n) + \log(1/\delta)),$$ (5.11)then with probability at least $$1-\delta$$, the measurement set A comprising m independent random vectors drawn uniformly from $$S^{n-1}$$ satisfies the $$\textrm{ACW}(\theta ,\alpha )$$ condition with $$\alpha = 1/2$$. Proof. Fix $$n, \delta> 0$$. Choose $$\theta _{0}$$ such that the constant C in the statement in Theorem 5.7 satisfies $$C\sqrt{2\theta _{0}/\pi } \leq 0.1$$. Fix $$0 < \theta \leq \theta _{0}$$, and let $$\varOmega _{1}$$, $$\varOmega _{2}$$ and $$\varOmega _{3}$$ denote the good events in Lemma 5.1, Theorem 5.7 and Lemma 5.8 with this choice of $$\theta$$. Whenever m satisfies our assumption (5.11), the intersection of these events occurs with probability at least $$1-3\delta$$ by the union bound. Let us condition on being in the intersection of these events. For any wedge $$W \in \mathcal{W}_{\theta }$$ (i.e of angle less than $$\theta$$), Lemma 5.1 tells us that its associated selector vector satisfies $$s_{W,\,A} \in \mathcal{S}_{2\theta /\pi }$$ (i.e. that it has at most $$2m\theta /\pi$$ ones). By Theorem 5.7 and our assumption on $$\theta _{0}$$, we then have   $$\lambda_{\max}\left({\frac{1}{m}\sum_{i=1}^{m} a_{i}{a_{i}^{T}}1_{W}(a_{i})}\right) \leq \frac{1}{nm}\sup_{s \in \mathcal{S}_{2\theta/\pi}, v \in{B^{n}_{2}}} Y_{s,v} \leq \frac{0.1}{n}.$$ On the other hand, Lemma 5.8 guarantees that   $$\lambda_{\min}\left({\frac{1}{m}\sum_{i=1}^{m} a_{i}{a_{i}^{T}}}\right) \geq \frac{0.9}{n}.$$ Combining these, we get   $$\lambda_{\min}\left({\frac{1}{m}\sum_{i=1}^{m} a_{i}{a_{i}^{T}} - \frac{4}{m}\sum_{i=1}^{m} a_{i}{a_{i}^{T}}1_{W}(a_{i})}\right) \geq \frac{1}{2n},$$which was to be shown. 6. Proof and discussion of Theorem 1.2 We restate the theorem here for convenience. Theorem 6.1 Fix $$\epsilon> 0$$, $$0 < \delta _{1} \leq 1/2$$ and $$0 < \delta ,\delta _{2} \leq 1$$. There are absolute constants C, c > 0 such that if   $$m \geq C(n\log(m/n) + \log(1/\delta)),$$then with probability at least $$1-\delta$$, m sampling vectors selected uniformly and independently from the unit sphere $$S^{n-1}$$ form a set such that the following holds: let $$x \in \mathbb{R}^{n}$$ be a signal vector and let $$x_{0}$$ be an initial estimate satisfying $$\lVert x_{0}-x\rVert _{2} \leq c\sqrt{\delta _{1}}\lVert x\rVert _{2}$$. Then for any $$\epsilon> 0$$, if   $$K \geq 2(\log(1/\epsilon) + \log(2/\delta_{2}))n,$$then the Kth step randomized Kaczmarz estimate $$x_{K}$$ satisfies $$\lVert x_{K}-x{\rVert _{2}^{2}} \leq \epsilon \lVert x_{0} - x{\rVert _{2}^{2}}$$ with probability at least $$1-\delta _{1}-\delta _{2}$$. Proof. Let A be our m by n measurement matrix. By Theorem 5.9, there is an angle $$\theta _{0}$$, and a constant C such that for $$m \geq C(n\log (m/n) + \log (1/\delta ))$$, A is $$\textrm{ACW}(\theta _{0},1/2)$$ with probability at least $$1-\delta$$. We can then use Corollary 4.4 to guarantee that with probability at least $$1-\delta _{1}-\delta _{2}$$, running the randomized Kaczmarz update K times gives an estimate $$x_{K}$$ satisfying   $$\lVert x_{K}-x{\rVert_{2}^{2}} \leq \epsilon\lVert x_{0}-x{\rVert_{2}^{2}}.$$This completes the proof of the theorem. Inspecting the statement of the theorem, we see that we can make the failure probability $$\delta$$ as small as possible by making m large enough. Likewise, we can do the same with $$\delta _{2}$$ by adjusting K. Proposition B.1 shows that we can also make $$\delta _{2}$$ smaller by increasing m. However, while the dependence of m and K on $$\delta$$ and $$\delta _{2}$$, respectively, is logarithmic, the dependence of m on $$\delta _{1}$$ is polynomial (we need $$m \gtrsim 1/{\delta _{1}^{2}}$$). This is rather unsatisfactory, but can be overcome by a simple ensemble method. We encapsulate this idea in the following algorithm. Algorithm 1 Ensemble Randomized Kaczmarz Require: Measurements $$b_{1},\ldots ,b_{m}$$, sampling vectors $$a_{1},\ldots ,a_{m}$$, relative error tolerance $$\epsilon$$, iteration count K, trial count L. Ensure: An estimate $$\hat{x}$$ for x. 1: Obtain an initial estimate $$x_{0}$$ using the truncated spectral initialization method (see Appendix B). 2: forl = 1,…, L, run K randomized Kaczmarz update steps starting from $$x_{0}$$ to obtain an estimate $$x_{K}^{(l)}$$. 3: forl = 1, …, L, do 4: if$$\lvert\ B(x_{K}^{(l)},2\sqrt{\epsilon }) \cap \lbrace x_{K}^{(1)},\ldots ,x_{K}^{(L)}\rbrace \rvert \geq L/2$$ 5: return$$\hat{x} := x_{K}^{(l)}$$. Proposition 6.2 (Guarantee for ensemble method) Given the assumptions of Theorem 6.1, further assume that $$\delta _{1} + \delta _{2} \leq 1/3$$. For any $$\delta ^\prime> 0$$, there is an absolute constant C such that if $$L \geq C\log (1/\delta ^\prime )$$, then the estimate $$\hat{x}$$ given by Algorithm 1 satisfies $$\lVert \hat{x}-x{\rVert _{2}^{2}} \leq 9\epsilon \lVert x_{0}-x{\rVert _{2}^{2}}$$ with probability at least $$1-\delta ^\prime$$. Proof. For 1 ≤ l ≤ L, let $$\chi _{l}$$ be the indicator variable for $$\lVert x_{K}^{(l)}-x{ \rVert _{2}^{2}} \leq \epsilon \lVert x_{0}-x{ \rVert _{2}^{2}}$$. Then $$\chi _{1},\ldots ,\chi _{L}$$ are i.i.d. Bernoulli random variables each with success probability at least 2/3. Let I be the set of indices l for which $$\chi _{l} = 1$$. Using a Chernoff bound [22], we see that with probability at least $$1-e^{-cL}$$, $$\lvert I\rvert \geq L/2$$. Now let I′ be the set of indices for which $$\lvert B (x_{K}^{(l)},2\epsilon ) \cap \lbrace x_{K}^{(1)},\ldots ,x_{K}^{(L)} \rbrace \rvert \geq L/2$$. Observe that for all $$l,l^\prime \in I$$, we have   $$\big\lVert x_{K}^{(l)}- x_{K}^{(l^\prime)}\big\rVert_{2} \leq \big\lVert x_{K}^{(l)}- x\big\rVert_{2} + \big\lVert x- x_{K}^{(l^\prime)}\big\rVert_{2} \leq 2\sqrt{\epsilon}.$$This implies that $$I \subset I^\prime$$, so $$I^\prime \neq \emptyset$$. Furthermore, for all $$l^\prime \in I^\prime$$, there is l ∈ I for which $$\lVert x_{K}^{(l)}-x_{K}^{(l^\prime )} \rVert _{2} \leq 2\sqrt{\epsilon }$$. As such, we have   $$\big\lVert x_{K}^{(l^\prime)}-x\big\rVert_{2} \leq \big\lVert x_{K}^{(l^\prime)}- x_{K}^{(l)}\big\rVert_{2} + \big\lVert x_{K}^{(l)}-x\big\rVert_{2} \leq 3\sqrt{\epsilon}.$$Now, observe that the estimate $$\hat{x}$$ returned by Algorithm 1 is precisely some $$x_{K}^{(l^\prime )}$$ for which $$l^\prime \in I^\prime$$. This shows that on the good event, we indeed have $$\big \lVert \hat{x}-x{\big \rVert _{2}^{2}} \leq 9\epsilon \lVert x_{0}-x{\rVert _{2}^{2}}$$. By our assumption on L, we see that the failure probability is bounded by $$\delta ^\prime$$. In practice however, the ensemble method is not required. Numerical experiments show that the randomized Kaczmarz method always eventually converges from any initial estimate. 7. Extensions 7.1. Arbitrary initialization In order to obtain a convergence guarantee, we used a truncated spectral initialization to obtain an initial estimate before running randomized Kaczmarz updates. Since the number of steps that we require is only linear in the dimension, and each step requires only linear time, the iteration phase of the algorithm only requires $$O (n^{2} )$$ time, and furthermore does not need to see all the data in order to start running. The spectral initialization on the other hand requires one to see all the data. Forming the matrix from which we obtain the estimate involves adding m rank 1 matrices, and hence naively requires $$O (mn^{2} )$$ time. There is hence an incentive to do away with this step altogether, and ask whether the randomized Kaczmarz algorithm works well even if we start from an arbitrary initialization. We have some numerical evidence that this is indeed true, at least for real Gaussian measurements. Unfortunately, we do not have any theoretical justification for this phenomenon, and it will be interesting to see if any results can be obtained in this direction. 7.2. Complex Gaussian measurements We have proved our main results for measurement systems comprising random vectors drawn independently and uniformly from the sphere, or equivalently, for real Gaussian measurements. These are not the measurement sets that are used in practical applications, which often deal with imaging and hence make use of complex measurements. While most theoretical guarantees for phase retrieval algorithms are in terms of real Gaussian measurements, some also hold for complex Gaussian measurements, even with identical proofs. This is the case for PhaseMax [5] and for Wirtinger flow [4]. We believe that a similar situation should hold for the randomized Kaczmarz method, but are not yet able to recalibrate our tools to handle the complex setting. It is easy to adapt the randomized Kaczmarz update formula (1.3) itself: we simply replace the sign of $$\langle a_{r(k)},x_{k-1}\rangle$$ with its phase $$\left(\textrm{i.e.} \frac{\langle a_{r(k)},x_{k-1}\rangle }{\lvert \langle a_{r(k)},x_{k-1}\rangle \rvert } \right)$$. Numerical experiments also show that convergence does occur for complex Gaussian measurements (and even CDP measurements) [25]. Nonetheless, in trying to adapt the proof to this situation, we meet an obstacle at the first step: when computing the error term, we can no longer simply sum up the influence of ‘bad measurements’ as we did in Lemma 2.1. Instead, every term contributes an error that scales with the phase difference   $$\frac{\langle a_{i},z\rangle}{\lvert\langle a_{i},z\rangle\rvert} - \frac{\langle a_{i},x\rangle}{\lvert\langle a_{i},x\rangle\rvert}.$$ Since the argument of Jeong and Güntürk also heavily relies on the decomposition of the measurement set into ‘good’ and ‘bad’ measurements, their method likewise does not easily generalize to cover the complex setting. We leave it to future work to prove convergence in this setting, whether by adapting our methods, or by proposing completely new ones. 7.3. Deterministic constructions of measurement sets The theory that we have developed in this paper does not apply solely to Gaussian measurements, and generalizes to any measurement sets that satisfy the ACW condition that we introduced in Section 5. It will be interesting to investigate what natural classes of measurement sets satisfy this condition. Acknowledgements Y.T. would like to thank Halyun Jeong for insightful discussions on this topic. We would also like to thank the anonymous reviewers for their many helpful comments. Funding Juha Heinonen Memorial Graduate Fellowship at the University of Michigan to [Y.T.]. National Science Foundation Grant DMS [1265782] and U.S. Air Force Grant [FA9550-18-1-0031] to R.V. Footnotes 1  This is essentially equivalent to being real Gaussian because of the concentration of norm phenomenon in high dimensions. Also, one may normalize vectors easily. 2 Recall that a wedge of angle $$\theta$$ is the region of the sphere between two hemispheres with normal vectors making an angle of $$\theta$$. Appendix A. Growth functions and VC dimension In this section, we define growth functions and VC dimension. We also state some standard results on these topics that we require for our proofs in Section 5. We refer the interested reader to [17] for a more in-depth exposition on these topics. Let $$\mathcal{X}$$ be a set and $$\mathcal{C}$$ be a family of subsets of $$\mathcal{X}$$. For a given set $$C \in \mathcal{C}$$, we slightly abuse notation and identify it with its indicator function $$1_{C} \colon \mathcal{X} \to \lbrace 0,1\rbrace$$. The growth function$$\varPi _{\mathcal{C}}\colon \mathbb{N} \to \mathbb{R}$$ of $$\mathcal{C}$$ is defined via   $$\varPi_{\mathcal{C}}(m) := \max_{x_{1},\ldots,\,x_{m} \in \mathcal{X}} \lvert\left\{{(C(x_{1}),C(x_{2}),\ldots,C(x_{m})) \colon C \in \mathcal{C}}\right\}\rvert.$$ Meanwhile, the VC dimension of $$\mathcal{C}$$ is defined to be the largest integer m for which $$\varPi _{\mathcal{C}}(m) = 2^{m}$$. These two concepts are fundamental to statistical learning theory. The key connection between them is given by the Sauer–Shelah lemma. Lemma A.1 (Sauer–Shelah, Corollary 3.3 in [17]) Let $$\mathcal{C}$$ be a collection of subsets of VC dimension d. Then for all m ≥ d, have   $$\varPi_{\mathcal{C}}(m) \leq \left({\frac{em}{d}}\right)^{d}\!.$$ The reason why we are interested in the growth function of a family of subsets $$\mathcal{C}$$ is because we have the following guarantee for the uniform convergence for the empirical measures of sets belonging to $$\mathcal{C}$$. Proposition A.2 (Uniform deviation, Theorem 2 in [21]) Let $$\mathcal{C}$$ be a family of subsets of a set $$\mathcal{X}$$. Let $$\mu$$ be a probability measure on $$\mathcal{X}$$, and let $$\hat{\mu }_{m} := \frac{1}{m} \sum _{i=1}^{m} \delta _{X_{i}}$$ be the empirical measure obtained from m independent copies of a random variable X with distribution $$\mu$$. For every u such that $$m \geq 2/u^{2}$$, the following deviation inequality holds:   $$\mathbb{P}\!\left(\sup_{C \in \mathcal{C}} \lvert\hat{\mu}_{m}(C) - \sigma(C)\rvert \geq u \right) \leq 4\varPi_{\mathcal{C}}(2m)\exp(-mu^{2}/16).$$ (A.1) We now state and prove two simple claims. Claim A.3 Let $$\mathcal{C}$$ be the collection of all hemispheres in $$S^{n-1}$$. Then the VC dimension of $$\mathcal{C}$$ is bounded from above by $$n+1$$. Proof. It is a standard fact from statistical learning theory [17] that the VC dimension of half-spaces in $$\mathbb{R}^{n}$$ is $$n+1$$. Since $$S^{n-1}$$ is a subset of $$\mathbb{R}^{n}$$, the claim follows by the definition of VC dimension. Claim A.4 Let $$\mathcal{C}$$ and $$\mathcal{D}$$ be two collections of functions from a set $$\mathcal{X}$$ to {0, 1}. Using $$\varDelta$$ to denote symmetric difference, we define   $$\mathcal{C}\varDelta\mathcal{D} := \lbrace C\varDelta D \ | \ C \in \mathcal{C}, D \in \mathcal{D}\rbrace.$$ (A.2)Then the growth function $$\varPi _{\mathcal{C}\varDelta \mathcal{D}}$$ of $$\mathcal{C}\varDelta \mathcal{D}$$ satisfies $$\varPi _{\mathcal{C}\varDelta \mathcal{D}}(m) \leq \varPi _{\mathcal{C}}(m)\cdot \varPi _{\mathcal{D}}(m)$$ for all $$m \in \mathbb{Z}_{+}$$. Proof. Fix m, and points $$x_{1},\ldots ,x_{m} \in \mathcal{X}$$. Then every possible configuration $$(\,f(x_{1}),f(x_{2}),\ldots ,f(x_{m}))$$ arising from some $$f \in \mathcal{C}\varDelta \mathcal{D}$$ is the point-wise symmetric difference   $$(\,f(x_{1}),f(x_{2}),\ldots,f(x_{m})) = (C(x_{1}),C(x_{2}),\ldots,C(x_{m}))\varDelta (D(x_{1}),D(x_{2}),\ldots,D(x_{m}))$$of configurations arising from some $$C \in \mathcal{C}$$ and $$D \in \mathcal{D}$$. By the definition of growth functions, there are at most $$\varPi _{\mathcal{C}}(m)\cdot \varPi _{\mathcal{D}}(m)$$ pairs of these configurations, from which the bound follows. Remark A.5 There is an extensive literature on how to bound the VC dimension of concept classes that arise from finite intersections or unions of those from a known collection of concept classes, each of which has bounded VC dimension. We won’t require this much sophistication here, and refer the reader to [3] for more details. Appendix B. Initialization Several different schemes have been proposed for obtaining initial estimates for PhaseMax and gradient descent methods for phase retrieval. Surprisingly, these are all spectral in nature: the initial estimate $$x_{0}$$ is obtained as the leading eigenvector to a matrix that is constructed out of the sampling vectors $$a_{1},\ldots ,a_{m}$$ and their associated measurements $$b_{1},\ldots ,b_{m}$$ [4,6,24,26]. There seems to be empirical evidence, at least for Gaussian measurements, that the best performing method is the orthogonality-promoting method of [24]. Nonetheless, for any given relative error tolerance, all the methods seem to require sample complexity of the same order. Hence, we focus on the truncated spectral method of [6] for expositional clarity, and refer the reader to the respective papers on the other methods for more details. The truncated spectral method initializes $$x_{0} := \lambda _{0} \tilde{x}_{0}$$, where $$\lambda _{0} = \sqrt{\frac{1}{m}\sum _{i=1}^{m}{b_{i}^{2}}}$$, and $$\tilde{x}_{0}$$ is the leading eigenvector of   $$Y = \frac{1}{m}\sum_{i=1}^{m}{b_{i}^{2}}a_{i}{a_{i}^{T}}1(b_{i} \leq 3\lambda_{0}).$$Note that when constructing Y, we sum up only those sampling vectors whose corresponding measurements satisfy $$b_{i} \leq 3\lambda _{0}$$. The point of this is to remove the influence of unduly large measurements, and allow for good concentration estimates, as we shall soon demonstrate. Suppose from now on that the $$a_{i}$$s are independent standard Gaussian vectors. In [6], the authors prove that with probability at least $$1-\exp (-\varOmega (m))$$, we have $$\lVert \tilde{x}_{0} - x\rVert _{2} \leq \epsilon \lVert x\rVert _{2}$$ for any fixed relative error tolerance $$\epsilon$$ (see their Proposition 3). They do not, however, examine the dependence of the probability bound on $$\epsilon$$. Nonetheless, by examining the proof more carefully, we can make this dependence explicit. In doing so, we obtain the following proposition. Proposition B.1 (Relative error guarantee for initialization) Let $$a_{1},\ldots ,a_{m}$$, $$b_{1},\ldots ,b_{m}$$Y and $$x_{0}$$ be defined as in the preceding discussion. Fix $$\epsilon> 0$$ and $$0 < \delta < 1$$. Then with probability at least $$1-\delta$$, we have $$\lVert x_{0}-x\rVert _{2} \leq \epsilon \lVert x\rVert _{2}$$ so long as $$m \geq C(\log (1/\delta )+n)/\epsilon ^{2}$$. Proof. We simply make the following observations while following the proof of Proposition 3 in [6]. First, since all quantities are 2-homogeneous in $$\lVert x\rVert _{2}$$, we may assume without loss of generality that $$\lVert x\rVert _{2} = 1$$. Next, there is some absolute constant c such that if we define $$Y_{1}$$ and $$Y_{2}$$ by choosing $$\gamma _{1} = 3+ c\epsilon$$, $$\gamma _{2} = 3 - c\epsilon$$, we have the bound $$\lVert \mathbb{E} Y_{1} - \mathbb{E} Y_{2}\rVert \leq C\epsilon$$. Note also that the deviation estimates $$\lVert Y_{1}-\mathbb{E} Y_{1}\rVert$$, $$\lVert Y_{2}-\mathbb{E} Y_{2}\rVert$$ are bounded by $$C\epsilon$$ given our assumptions on m. This implies that with high probability,   $$\lVert Y-\beta_{1} xx^{T} - \beta_{2} I_{n}\rVert \leq C\epsilon.$$Adjust our constants so that C in the last equation is bounded by $$\beta _{1} - \beta _{2}$$. We may then apply Davis–Kahan [8] to get   $$\lVert\tilde{x}_{0} - x\rVert_{2} \leq \frac{\lVert Y-\beta_{1} xx^{T} - \beta_{2} I_{n}\rVert}{\beta_{1} - \beta_{2}} \leq \epsilon$$as we wanted. By examining the proof carefully, the astute reader will observe that the crucial properties that we used were the rotational invariance of the $$a_{i}$$s (to compute the formulas for $$\mathbb{E} Y_{1}$$ and $$\mathbb{E} Y_{2}$$) and their sub-Gaussian tails (to derive the deviation estimates). These properties also hold for sampling vectors that are uniformly distributed on the sphere. As such, a more lengthly and tedious calculation can be done to show that the guarantee also holds for such sampling vectors. If the reader has any residual doubt, perhaps this can be assuaged by noting that a uniform sampling vector and its associated measurement $$(a_{i},b_{i})$$ can be turned into an honest real Gaussian vector by multiplying both quantities by an independent $$\chi ^{2}$$ random variable with n degrees of freedom. References 1. Artstein-Avidan, S., Giannopoulos, A. & Milman, V. D. ( 2015) Asymptotic Geometric Analysis, Part I . Providence, RI, USA: American Mathematical Society. Google Scholar CrossRef Search ADS   2. Bahmani, S. & Romberg, J. ( 2016) Phase retrieval meets statistical learning theory: a flexible convex relaxation. pp. 1– 17, This is a preprint, available as arXiv: 1610.04210 . 3. Blumer, A., Ehrenfeucht, A., Haussler, D. & Warmuth, M. K. ( 1989) Learnability and the Vapnik–Chervonenkis dimension. J. ACM , 36, 929– 965. Google Scholar CrossRef Search ADS   4. Candes, E. J., Li, X. & Soltanolkotabi, M. ( 2015) Phase retrieval via wirtinger flow: theory and algorithms. IEEE Trans. Info. Theory , 61, 1985– 2007. Google Scholar CrossRef Search ADS   5. Candes, E. J., Strohmer, T. & Voroninski, V. ( 2013) PhaseLift: exact and stable signal recovery from magnitude measurements via convex programming. Commun. Pure Appl. Math. , 66, 1241– 1274. Google Scholar CrossRef Search ADS   6. Chen, Y. & Candes, E. J. ( 2015) Solving random quadratic systems of equations is nearly as easy as solving linear systems. Adv. Neural Inf. Process. Syst. , 2, 739– 747. 7. Chi, Y. & Lu, Y. M. ( 2016) Kaczmarz method for solving quadratic equations. IEEE Signal Process. Lett. , 23, 1183– 1187. Google Scholar CrossRef Search ADS   8. Davis, C. & Kahan, W. M. ( 1970) The rotation of eigenvectors by a perturbation. III. SIAM J. Numer. Anal. , 7, 1– 46. Google Scholar CrossRef Search ADS   9. Davis, D., Drusvyatskiy, D. & Paquette, C. ( 2017) The nonsmooth landscape of phase retrieval. This is a preprint, available as arXiv: 1711.03247 . 10. Dirksen, S. ( 2015) Tail bounds via generic chaining. Electron. J. Probab. , 20, 1– 29. Google Scholar CrossRef Search ADS   11. Duchi, J. C. & Ruan, F. ( 2017) Solving (most) of a set of quadratic equalities: composite optimization for robust phase retrieval. pp. 1-- 49, This is a preprint, available as arXiv: 1705.02356 . 12. Eldar, Y. C. & Mendelson, S. ( 2014) Phase retrieval: stability and recovery guarantees. Appl. Comput. Harmon. Anal. , 36, 473– 494. Google Scholar CrossRef Search ADS   13. Fienup, J. R. ( 1982) Phase retrieval algorithms: a comparison. Appl. Opt. , 21, 2758. 14. Goldstein, T. & Studer, C. ( 2016) PhaseMax: Convex Phase Retrieval via Basis Pursuit. pp. 1– 28, This is a preprint, available as arXiv: 1610.07531 . 15. Hand, P. & Voroninski, V. ( 2016) An elementary proof of convex phase retrieval in the natural parameter space via the linear program PhaseMax. pp. 1– 8, This is a preprint, available as arXiv: 1611.03935. 16. Jeong, H. & Güntürk, C. S. ( 2017) Convergence of the randomized Kaczmarz method for phase retrieval. pp. 1– 13, This is a preprint, available as arXiv: 1706.10291. 17. Mohri, M., Rostamizadeh, A. & Talwalkar, A. ( 2012) Foundations of Machine Learning . Cambridge, MA, USA: MIT press. 18. Strohmer, T. & Vershynin, R. ( 2009) A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl. , 15, 262– 278. Google Scholar CrossRef Search ADS   19. Sun, J., Qu, Q. & Wright, J. ( 2017) A Geometric analysis of phase retrieval. Foundations of Computational Mathematics  pp. 1– 68, DOI: DOI https://doi.org/10.1007/s10208-017-9365-9. 20. Talagrand, M. ( 2005) The Generic Chaining: Upper and Lower Bounds of Stochastic Processes . Springer Monographs in Mathematics. Heidelberg, Berlin: Springer. 21. Vapnik, V. N. & Chervonenkis, A. Y. ( 1971) On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. , 16, 264– 280. Google Scholar CrossRef Search ADS   22. Vershynin, R. High-Dimensional Probability . Cambridge, UK: Cambridge University Press. 23. Vershynin, R. (2011) Introduction to the non-asymptotic analysis of random matrices. Compressed Sensing (Y. C. Eldar & G. Kutyniok eds).  Cambridge: Cambridge University Press, pp. 210– 268. 24. Wang, G., Giannakis, G. B. & Eldar, Y. C. ( 2016) Solving systems of random quadratic equations via truncated amplitude flow. IEEE Trans. Signal Process. , 65, 1961– 1974. Google Scholar CrossRef Search ADS   25. Wei, K. ( 2015) Solving systems of phaseless equations via Kaczmarz methods: a proof of concept study. Inverse Probl. , 31, 12125008. 26. Zhang, H., Zhou, Y., Liang, Y. & Chi, Y. ( 2016) Reshaped Wirtinger flow for solving quadratic system of equations. NIPS Proc. , 2622– 2630, This is a preprint, available as arXiv: 1605.07719. © The Author(s) 2018. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) For permissions, please e-mail: journals. permissions@oup.com http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Information and Inference: A Journal of the IMA Oxford University Press

# Phase retrieval via randomized Kaczmarz: theoretical guarantees

, Volume Advance Article – Apr 3, 2018
27 pages

/lp/ou_press/phase-retrieval-via-randomized-kaczmarz-theoretical-guarantees-7GJezauI31
Publisher
Oxford University Press
ISSN
2049-8764
eISSN
2049-8772
D.O.I.
10.1093/imaiai/iay005
Publisher site
See Article on Publisher Site

### Abstract

Abstract We consider the problem of phase retrieval, i.e. that of solving systems of quadratic equations. A simple variant of the randomized Kaczmarz method was recently proposed for phase retrieval, and it was shown numerically to have a computational edge over state-of-the-art Wirtinger flow methods. In this paper, we provide the first theoretical guarantee for the convergence of the randomized Kaczmarz method for phase retrieval. We show that it is sufficient to have as many Gaussian measurements as the dimension, up to a constant factor. Along the way, we introduce a sufficient condition on measurement sets for which the randomized Kaczmarz method is guaranteed to work. We show that Gaussian sampling vectors satisfy this property with high probability; this is proved using a chaining argument coupled with bounds on Vapnik–Chervonenkis (VC) dimension and metric entropy. 1. Introduction The phase retrieval problem is that of solving a system of quadratic equations   $$\lvert\langle a_{i},z\rangle^{2}\rvert = {b_{i}^{2}}, \quad\quad i = 1,2,\ldots,m,$$ (1.1)where $$a_{i} \in \mathbb{R}^{n}$$ (or $$\mathbb{C}^{n}$$) are known sampling vectors, $$b_{i}> 0$$ are observed measurements and $$z \in \mathbb{R}^{n}$$ (or $$\mathbb{C}^{n}$$) is the decision variable. This problem is well motivated by practical concerns [13] and has been a topic of study from at least the early 1980s. Over the last half a decade, there has been great interest in constructing and analysing algorithms with provable guarantees given certain classes of sampling vector sets. One line of research involves ‘lifting’ the quadratic system to a linear system, which is then solved using convex relaxation (PhaseLift) [5]. A second method is to formulate and solve a linear program in the natural parameter space using an anchor vector (PhaseMax) [2,14,15]. Although both of these methods can be proved to have near optimal sample efficiency, the most empirically successful approach has been to directly optimize various naturally formulated non-convex loss functions, the most notable of which are displayed in Table 1. Table 1 Non-convex loss functions for phase retrieval Loss function  Name  Papers  $${f(z) = \sum _{i=1}^{m} (\lvert \langle a_{i},z\rangle \rvert ^{2}-{b_{i}^{2}} )^{2}}$$  Squared loss for intensities  [4,19]  $${f(z) = \sum _{i=1}^{m} \left (\lvert \langle a_{i},z\rangle \rvert -b_{i}\right )^{2}}$$  Squared loss for amplitudes  [24,26]  $${f(z) = \sum _{i=1}^{m} \vert \lvert \langle a_{i},z\rangle \rvert ^{2}- b_{i}^{2} \vert }$$  $$\ell _{1}$$ loss for intensities  [9,11,12]  Loss function  Name  Papers  $${f(z) = \sum _{i=1}^{m} (\lvert \langle a_{i},z\rangle \rvert ^{2}-{b_{i}^{2}} )^{2}}$$  Squared loss for intensities  [4,19]  $${f(z) = \sum _{i=1}^{m} \left (\lvert \langle a_{i},z\rangle \rvert -b_{i}\right )^{2}}$$  Squared loss for amplitudes  [24,26]  $${f(z) = \sum _{i=1}^{m} \vert \lvert \langle a_{i},z\rangle \rvert ^{2}- b_{i}^{2} \vert }$$  $$\ell _{1}$$ loss for intensities  [9,11,12]  View Large Table 1 Non-convex loss functions for phase retrieval Loss function  Name  Papers  $${f(z) = \sum _{i=1}^{m} (\lvert \langle a_{i},z\rangle \rvert ^{2}-{b_{i}^{2}} )^{2}}$$  Squared loss for intensities  [4,19]  $${f(z) = \sum _{i=1}^{m} \left (\lvert \langle a_{i},z\rangle \rvert -b_{i}\right )^{2}}$$  Squared loss for amplitudes  [24,26]  $${f(z) = \sum _{i=1}^{m} \vert \lvert \langle a_{i},z\rangle \rvert ^{2}- b_{i}^{2} \vert }$$  $$\ell _{1}$$ loss for intensities  [9,11,12]  Loss function  Name  Papers  $${f(z) = \sum _{i=1}^{m} (\lvert \langle a_{i},z\rangle \rvert ^{2}-{b_{i}^{2}} )^{2}}$$  Squared loss for intensities  [4,19]  $${f(z) = \sum _{i=1}^{m} \left (\lvert \langle a_{i},z\rangle \rvert -b_{i}\right )^{2}}$$  Squared loss for amplitudes  [24,26]  $${f(z) = \sum _{i=1}^{m} \vert \lvert \langle a_{i},z\rangle \rvert ^{2}- b_{i}^{2} \vert }$$  $$\ell _{1}$$ loss for intensities  [9,11,12]  View Large These loss functions enjoy nice properties, which make them amenable to various optimization schemes [12,19]. Those with provable guarantees include the prox-linear method of [11], and various gradient descent methods [4,6,9,24,26]. Some of these methods also involve adaptive measurement pruning to enhance performance. In 2015, Wei [25] proposed adapting a family of randomized Kaczmarz methods for solving the phase retrieval problem. He was able to show using numerical experiments that these methods perform comparably with state-of-the-art Wirtinger flow (gradient descent) methods when the sampling vectors are real or complex Gaussian, or when they follow the coded diffraction pattern (CDP) model [4]. He also showed that randomized Kaczmarz methods outperform Wirtinger flow when the sampling vectors are the concatenation of a few unitary bases. Unfortunately, [25] was not able to provide adequate theoretical justification for the convergence of these methods (see Theorem 2.6 in [25]). In this paper, we attempt to bridge this gap by showing that the basic randomized Kaczmarz scheme used in conjunction with truncated spectral initialization achieves linear convergence to the solution with high probability, whenever the sampling vectors are drawn uniformly from the sphere1$$S^{n-1}$$ and the number of measurements m is larger than a constant times the dimension n. It is also interesting to note that the basic randomized Kaczmarz scheme is exactly stochastic gradient descent for the Amplitude Flow objective, which suggests that other gradient descent schemes can also be accelerated using stochasticity. 1.1. Randomized Kaczmarz for solving linear systems The Kaczmarz method is a fast iterative method for solving systems of overdetermined linear equations that works by iteratively satisfying one equation at a time. In 2009, Strohmer and Vershynin [18] were able to give a provable guarantee on its rate of convergence, provided that the equation to be satisfied at each step is selected using a prescribed randomized scheme. Suppose our system to be solved is given by   $$Ax = b,$$ (1.2)where A is an m by n matrix. Denoting the rows of A by $${a_{1}^{T}},\ldots ,{a_{m}^{T}}$$, we can write (1.2) as the system of linear equations   $$\langle a_{i},x\rangle = b_{i}, \quad i=1,\ldots,m.$$The solution set of each equation is a hyperplane. The randomized Kaczmarz method is a simple iterative algorithm in which we project the running approximation onto the hyperplane of a randomly chosen equation. More formally, at each step k we randomly choose an index r(k) from [m] such that the probability that r(k) = i is proportional to $$\lVert a_{i}{\rVert _{2}^{2}}$$, and update the running approximation as follows:   $$x_{k} := x_{k-1} + \frac{b_{r(k)} - \langle a_{r(k)},x_{k-1}\rangle}{\lVert a_{r(k)}{\rVert_{2}^{2}}}a_{r(k)}.$$ Strohmer and Vershynin [18] were able to prove the following theorem: Theorem 1.1 (Linear convergence for linear systems) Let $$\kappa (A) = \lVert A\rVert _{F} / \sigma _{\min }(A)$$. Then for any initialization $$x_{0}$$ to the equation (1.2), the estimates given to us by randomized Kaczmarz satisfy   $$\mathbb{E}\lVert x_{k}-x{\rVert_{2}^{2}} \leq ({1-\kappa(A)^{-2}})^{k} \lVert x_{0} -x{\rVert_{2}^{2}}.$$ Note that if A has bounded condition number, then $$\kappa (A) \asymp \sqrt{n}$$. 1.2. Randomized Kaczmarz for phase retrieval In the phase retrieval problem (1.1), each equation   $$\lvert\langle a_{i},x\rangle\rvert = b_{i}$$defines two hyperplanes, one corresponding to each of ± x. A natural adaptation of the randomized Kaczmarz update for this situation is then to project the running approximation to the closer hyperplane. We restrict to the case where each measurement vector $$a_{i}$$ has unit norm, so that in equations, this is given by   $$x_{k} := x_{k-1} + \eta_{k} a_{r(k)},$$ (1.3)where   $$\eta_{k} = \textrm{sign}(\langle a_{r(k)},x_{k-1}\rangle) b_{r(k)} - \langle a_{r(k)},x_{k-1}\rangle.$$ In order to obtain a convergence guarantee for this algorithm, we need to choose $$x_{0}$$ so that it is close enough to the signal vector x. This is unlike the case for linear systems where we could start with an arbitrary initial estimate $$x_{0} \in \mathbb{R}^{n}$$, but the requirement is par for the course for phase retrieval algorithms. Unsurprisingly, there is a rich literature on how to obtain such estimates [5,6,24,26]. The best methods are able to obtain a good initial estimate using O(n) samples. 1.3. Contributions and main result The main result of our paper guarantees the linear convergence of randomized Kaczmarz algorithm for phase retrieval for random measurements $$a_{i}$$ that are drawn independently and uniformly from the unit sphere. Theorem 1.2 (Convergence guarantee for algorithm) Fix $$\epsilon> 0$$, $$0 < \delta _{1} \leq 1/2$$ and $$0 < \delta ,\delta _{2} \leq 1$$. There are absolute constants C, c > 0 such that if   $$m \geq C(n\log(m/n) + \log(1/\delta)),$$then with probability at least $$1-\delta$$, m sampling vectors selected uniformly and independently from the unit sphere $$S^{n-1}$$ form a set such that the following holds: let $$x \in \mathbb{R}^{n}$$ be a signal vector and let $$x_{0}$$ be an initial estimate satisfying $$\lVert x_{0}-x\rVert _{2} \leq c\sqrt{\delta _{1}}\lVert x\rVert _{2}$$. Then for any $$\epsilon> 0$$, if   $$K \geq 2(\log(1/\epsilon) + \log(2/\delta_{2}))n,$$then the Kth step randomized Kaczmarz estimate $$x_{K}$$ satisfies $$\lVert x_{K}-x{\rVert _{2}^{2}} \leq \epsilon \lVert x_{0} - x{\rVert _{2}^{2}}$$ with probability at least $$1-\delta _{1}-\delta _{2}$$. Comparing this result with Theorem 1.1, we observe two key differences. First, there are now two sources of randomness: one is in the creation of the measurements $$a_{i}$$, and the other is in the selection of the equation at every iteration of the algorithm. The theorem gives a guarantee that holds with high probability over both sources of randomness. Theorem 1.2 also requires an initial estimate $$x_{0}$$. This is not hard to obtain. Indeed, using the truncated spectral initialization method of [6], we may obtain such an estimate with high probability given $$m \gtrsim n$$. For more details, see Proposition B.1. The proof of this theorem is more non-trivial than the Strohmer–Vershynin analysis of randomized Kaczmarz algorithm for linear systems [18]. We break down the argument in smaller steps, each of which may be of independent interest to researchers in this field. First, we generalize the Kaczmarz update formula (1.3) and define what it means to take a randomized Kaczmarz step with respect to any probability measure on the sphere $$S^{n-1}$$: we choose a measurement vector at each step according to this measure. Using a simple geometric argument, we then provide a bound for the expected decrement in distance to the solution set in a single step, where the quality of the bound is given in terms of the properties of the measure we are using for the Kaczmarz update (Lemma 2.1). Performing the generalized Kaczmarz update with respect to the uniform measure on the sphere corresponds to running the algorithm with unlimited measurements. We utilize the symmetry of the uniform measure to compute an explicit formula for the bound on the stepwise expected decrement in distance. This decrement is geometric whenever we make the update from a point making an angle of less than $$\pi /8$$ with the true solution, so we obtain linear convergence conditioned on no iterates escaping from the ‘basin of linear convergence’. We are able to bound the probability of this bad event using a supermartingale inequality (Theorem 3.1). Next, we abstract out the property of the uniform measure that allows us to obtain local linear convergence. We call this property the anti-concentration on wedges property, calling it ACW for short. Using this convenient definition, we can easily generalize our previous proofs for the uniform measure to show that all ACW measures give rise to randomized Kaczmarz update schemes with local linear convergence (Theorem 4.3). The usual Kaczmarz update corresponds running the generalized Kaczmarz update with respect to $$\mu _{A} := \frac{1}{m}\sum _{i=1}\delta _{a_{i}}$$. We are able to prove that when the $$a_{i}$$s are selected uniformly and independently from the sphere, then $$\mu _{A}$$ satisfies the ACW condition with high probability, so long as $$m \gtrsim n$$ (Theorem 5.9). The proof of this fact uses Vapnik–Chervonenkis (VC) theory and a chaining argument, together with metric entropy estimates. Finally, we are able to put everything together to prove a guarantee for the full algorithm in Section 6. In that section, we also discuss the failure probabilities $$\delta$$, $$\delta _{1}$$ and $$\delta _{2}$$, and how they can be controlled. 1.4. Related work During the preparation of this manuscript, we became aware of independent simultaneous work done by Jeong and Güntürk. They also studied the randomized Kaczmarz method adapted to phase retrieval, and obtained almost the same result that we did (see [16] and Theorem 1.1 therein). In order to prove their guarantee, they use a stopping time argument similar to ours, but replace the ACW condition with a stronger condition called admissibility. They prove that measurement systems comprising vectors drawn independently and uniformly from the sphere satisfy this property with high probability, and the main tools they use in their proof are hyperplane tessellations and a net argument together with Lipschitz relaxation of indicator functions. After submitting the first version of this manuscript, we also became aware of independent work done by Zhang, Zhou, Liang and Chi [26]. Their work examines stochastic schemes in more generality (see Section 3 in their paper), and they claim to prove linear convergence for both the randomized Kaczmarz method as well as what they called Incremental Reshaped Wirtinger Flow. However, they only prove that the distance to the solution decreases in expectation under a single Kaczmarz update (an analogue of our Lemma 2.1 specialized to real Gaussian measurements). As we will see in our paper, this bound cannot be naively iterated. 1.5. Notation Throughout the paper, C and c are absolute constants that can change from line to line. 2. Computations for a single step In this section, we will compute what happens in expectation for a single update step of the randomized Kaczmarz method. It will be convenient to generalize our sampling scheme slightly as follows. When we work with a fixed matrix A, we may view our selection of a random row $$a_{r(k)}$$ as drawing a random vector according to the measure $$\mu _{A} := \frac{1}{m}\sum _{i=1}^{m} \delta _{a_{i}}$$. We need not restrict ourselves to sums of Diracs. For any probability measure $$\mu$$ on the sphere $$S^{n-1}$$, we define the random map $$P = P_{\mu }$$ on vectors $$z \in \mathbb{R}^{n}$$ by setting   $$P z := z + \eta a,$$ (2.1)where   $$\eta = \textrm{sign}(\langle a,z\rangle)\lvert\langle a,x\rangle\rvert - \langle a,z\rangle \quad\textrm{and}\quad a \sim \mu.$$ (2.2)Note that as before, x is a fixed vector in $$\mathbb{R}^{n}$$ (think of x as the actual solution of the phase retrieval problem). We call $$P_{\mu }$$ the generalized Kaczmarz projection with respect to $$\mu$$. Using this update rule over independent realizations of P, $$P_{1},P_{2},\ldots,$$ together with an initial estimate $$x_{0}$$, gives rise to a generalized randomized Kaczmarz algorithm for finding x: set the kth step estimate to be   \begin{align} x_{k} := P_{k}P_{k-1}\cdots P_{1} x_{0}. \end{align} (2.3) Fix a vector $$z \in \mathbb{R}^{n}$$ that is closer to x than to −x, i.e. so that ⟨x, z⟩ > 0, and suppose that we are trying to find x. Examining the formula in (2.2), we see that P projects z onto the right hyperplane (i.e. the one passing through x instead of the one passing through −x) if and only if ⟨a, z⟩ and ⟨a, x⟩ have the same sign. In other words, this occurs if and only if the random vector a does not fall into the region of the sphere defined by   $$W_{x,z} := \lbrace v \in S^{n-1} \ | \ \textrm{sign}(\langle v,x\rangle) \neq \textrm{sign}(\langle v,z\rangle)\rbrace.$$ (2.4)This is the region lying between the two hemispheres with normal vectors x and z. We call such a region a spherical wedge, since in three dimensions it has the shape depicted in Fig. 1. Fig. 1. View largeDownload slide Geometry of $$W_{x,z}$$. Fig. 1. View largeDownload slide Geometry of $$W_{x,z}$$. When $$a \notin W_{x,z}$$, we can use the Pythagorean theorem to write   $$\lVert z-x{\rVert_{2}^{2}} = \lVert Pz-x{\rVert_{2}^{2}} + \langle z-x,a\rangle^{2}.$$ (2.5)Rearranging gives   $$\lVert Pz-x{\rVert_{2}^{2}} = \lVert z-x{\rVert_{2}^{2}}(1 - \langle\tilde{z},a\rangle^{2}),$$ (2.6)where $$\tilde{z} = (z-x)/\lVert z-x\rVert _{2}$$. In the complement of this event, we get   $$Pz = z + \langle a,(-x)-z\rangle a = z - \langle a,z-x\rangle + \langle a,-2x\rangle,$$and using orthogonality,   $$\lVert Pz-x{\rVert_{2}^{2}} = \lVert z-x{\rVert_{2}^{2}} - \langle a,z-x\rangle^{2} + \langle a,2x\rangle^{2}.$$ (2.7) Fig. 2. View largeDownload slide Orientation of x, z, and Pz when $$a \in W_{x,{\hskip.90pt}z}$$ and when $$a \notin W_{x,{\hskip.90pt}z}$$. $$H_{+}$$ and $$H_{-}$$ denote the hyperplanes defined by the equations ⟨y, a⟩ = b and ⟨y, a⟩ = −b, respectively. $$H_{0}$$ denotes the hyperplane defined by the equation ⟨y, a⟩ = 0. The left diagram demonstrates the situation when $$a \in W_{x,{\hskip.90pt}z}$$, thereby justifying (2.5). The right diagram demonstrates the situation when $$a \notin W_{x,{\hskip.90pt}z}$$, thereby justifying (2.8). Fig. 2. View largeDownload slide Orientation of x, z, and Pz when $$a \in W_{x,{\hskip.90pt}z}$$ and when $$a \notin W_{x,{\hskip.90pt}z}$$. $$H_{+}$$ and $$H_{-}$$ denote the hyperplanes defined by the equations ⟨y, a⟩ = b and ⟨y, a⟩ = −b, respectively. $$H_{0}$$ denotes the hyperplane defined by the equation ⟨y, a⟩ = 0. The left diagram demonstrates the situation when $$a \in W_{x,{\hskip.90pt}z}$$, thereby justifying (2.5). The right diagram demonstrates the situation when $$a \notin W_{x,{\hskip.90pt}z}$$, thereby justifying (2.8). Since z gets projected to the hyperplane containing −x, it may move further away from x. However, we can bound how far away it can move. Because ⟨a, x⟩ has the opposite sign as ⟨a, z⟩, we have   $$\lvert\langle a,z+x\rangle\rvert < \lvert\langle a,z-x\rangle\rvert,$$and so   $$\lvert\langle a,2x\rangle\rvert = \lvert\langle a,(z-x) - (z + x)\rangle\rvert < 2 \lvert\langle a,z-x\rangle\rvert.$$Substituting this into (2.7), we get the bound   $$\lVert Pz-x{\rVert_{2}^{2}} \leq \lVert z-x{\rVert_{2}^{2}} + 3\langle a,z-x\rangle^{2} = \lVert z-x{\rVert_{2}^{2}}(1 + 3\langle\tilde{z},a\rangle^{2}),$$ (2.8)where $$\tilde{z}$$ is as before. We can combine (2.6) and (2.8) into a single inequality by writing   \begin{align*} \lVert Pz-x\rVert_{2}^{2} & \leq \lVert z-x\rVert_{2}^{2}(1 - \langle\tilde{z},a\rangle^{2})1_{W_{x,z}^{c}}(a) + \lVert z-x\rVert_{2}^{2}(1 + 3\langle\tilde{z},a\rangle^{2})1_{W_{x,z}}(a) \\ & = \lVert z-x\rVert_{2}^{2}\left(1 - (1-4\cdot 1_{W_{x,z}}(a))\langle\tilde{z},a\rangle^{2}\right) \\ & = \lVert z-x\rVert_{2}^{2}\left(1 - \left\langle\tilde{z},(1-4\cdot 1_{W_{x,z}}(a))aa^{T} \tilde{z}\right\rangle\right). \end{align*}Taking expectations, we can remove the role that $$\tilde{z}$$ plays by bounding this as follows:   \begin{align*} \mathbb{E}\Big[\lVert z-x{\rVert_{2}^{2}}\left(1 - \left\langle\tilde{z},(1-4\cdot 1_{W_{x,z}}(a))aa^{T} \tilde{z}\right\rangle\right)\Big] & = \lVert z-x{\rVert_{2}^{2}}\left(1 - \left\langle\tilde{z},\mathbb{E}[(1-4\cdot 1_{W_{x,z}}(a))aa^{T}] \tilde{z}\right\rangle\right) \\ & \leq \lVert z-x{\rVert_{2}^{2}}\left[{1 -\lambda_{\min}\left(\mathbb{E} aa^{T}-4\mathbb{E} aa^{T}1_{W_{x,z}}(a)\right)}\right]. \end{align*} We may thus summarize what we have obtained in the following lemma. Lemma 2.1 (Expected decrement) Fix vectors $$x,z \in \mathbb{R}^{n}$$, a probability measure $$\mu$$ on $$S^{n-1}$$, and let $$P = P_{\mu }$$, $$W_{x,z}$$ be defined as in (2.1) and (2.4), respectively. Then   $$\mathbb{E}\lVert Pz-x{\rVert_{2}^{2}} \leq \left[{1 -\lambda_{\min}\left(\mathbb{E} aa^{T}-4\mathbb{E} aa^{T}1_{W_{x,z}}(a)\right)}\right]\lVert z-x{\rVert_{2}^{2}}.$$ Let us next compute what happens for $$\mu = \sigma$$, the uniform measure on the sphere. It is easy to see that $$\mathbb{E} aa^{T} = \frac{1}{n}I_{n}$$, so it remains to compute $$\mathbb{E} aa^{T}1_{W_{x,z}}(a)$$. To do this, we make a convenient choice of coordinates: let $$\theta$$ be the angle between z and x. We assume that both points lie in the plane spanned by $$e_{1}$$ and $$e_{2}$$, the first two basis vectors, and that the angle between z and x is bisected by $$e_{1}$$, as illustrated in Fig. 3. Fig. 3. View largeDownload slide Choice of coordinates. Fig. 3. View largeDownload slide Choice of coordinates. For convenience, denote $$M := \mathbb{E} aa^{T} 1_{W_{x,z}}(a)$$. Let Q denote the orthogonal projection operator onto the span of $$e_{1}$$ and $$e_{2}$$. Then $$Q(W_{x,z})$$ is the union of two sectors of angle $$\theta$$, which are respectively bisected by $$e_{2}$$ and $$-e_{2}$$. Recall that all coordinate projections of the uniform random vector a are uncorrelated. It is clear that from the symmetry in Fig. 3 that they remain uncorrelated even when conditioning on the event that $$a \in W_{x,z}$$. As such, M is a diagonal matrix. Let $$\phi$$ denote the anti-clockwise angle of Qa from $$e_{2}$$ (see Fig. 3). We may write   $$\langle a,e_{1}\rangle^{2} = {\lVert Qa\rVert_{2}^{2}} \langle Qa/\lVert Qa\rVert_{2}, e_{1}\rangle^{2} = {\lVert Qa\rVert_{2}^{2}} \sin^{2}\phi.$$ Note that the magnitude and direction of Qa are independent, and $$a \in W_{x,z}$$ if either $$\phi$$ or $$\phi - \pi$$ lies between $$-\theta /2$$ and $$\theta /2$$. We therefore have   $$M_{11} = \mathbb{E} [\langle a,e_{1}\rangle^{2}1_{W_{x,z}}(a)] = \mathbb{E}\big[{\lVert Qa\rVert_{2}^{2}}\mathbb{E}\sin^{2}\phi 1_{(-\theta/2,\theta/2)}(\phi \textrm{ or }\phi-\pi)\big].$$By a standard calculation using symmetry, we have $$\mathbb{E}{\lVert Qa\rVert _{2}^{2}} = 2/n$$. Since $$\phi$$ is distributed uniformly on the circle, we can compute   $$\mathbb{E}\sin^{2}\phi 1_{(-\theta/2,\theta/2)}(\phi \textrm{ or }\phi-\pi) = \frac{1}{\pi}\int_{-\theta/2}^{\theta/2} \sin^{2} t \,\mathrm{d}t = \frac{1}{\pi} \int_{-\theta/2}^{\theta/2} \frac{1-\cos(2t)}{2} \,\mathrm{d}t = \frac{\theta-\sin\theta}{2\pi}.$$ As such, we have $$M_{11} = (\theta - \sin \theta )/n\pi$$, and by a similar calculation, $$M_{22} = (\theta + \sin \theta )/n\pi$$. Meanwhile, for i ≥ 3 we have   \begin{align*} M_{ii} & = \frac{\textrm{Tr}(M) - M_{11} - M_{22}}{n-2} \\ & = \frac{\mathbb{E}\left[\lVert(I-Q)a{\rVert_{2}^{2}}1_{W_{x,z}}(a)\right]}{n-2} \\ & = \frac{\mathbb{E}\lVert(I-Q)a{\rVert_{2}^{2}}\mathbb{E} 1_{(-\theta/2,\theta/2)}(\phi \textrm{ or }\phi-\pi)}{n-2} \\ & = \frac{(n-2)/n \cdot \theta/\pi }{n} = \frac{\theta}{n\pi}. \end{align*}This implies that   $$\lambda_{\max}(M_{\theta}) = \frac{\theta + \sin\theta}{n\pi}.$$ (2.9)We have now completed proving the following lemma. Lemma 2.2 (Expected decrement for uniform measure) Fix vectors $$x, z \in \mathbb{R}^{n}$$ such that ⟨z, x⟩ > 0, and let $$P = P_{\sigma }$$ denote the generalized Kaczmarz projection with respect to $$\sigma$$, the uniform measure on the sphere. Let $$\theta$$ be the angle between z and x. Then   $$\mathbb{E}\lVert Pz-x{\rVert_{2}^{2}} \leq \Bigg[ 1 - \frac{1-4(\theta+\sin\theta)/\pi}{n} \Bigg] \lVert z-x{\rVert_{2}^{2}}.$$ Remark 2.3 By being more careful, one may compute an exact formula for the expected decrement rather than a bound as is the case in previous lemma. This is not necessary for our purposes and does not give better guarantees in our analysis, so the computation is omitted. 3. Local linear convergence using unlimited uniform measurements In this section, we will show that if we start with an initial estimate that is close enough to the ground truth x, then repeatedly applying generalized Kaczmarz projections with respect to the uniform measure $$\sigma$$ gives linear convergence in expectation. This is exactly the situation we would be in if we were to run randomized Kaczmarz, given an unlimited supply of independent sampling vector $$a_{1},a_{2},\ldots$$ drawn uniformly from the sphere. We would like to imitate the proof for linear convergence of randomized Kaczmarz for linear systems (Theorem 1.1) given in [18]. We denote by $$X_{k}$$ the estimate after k steps, using capital letters to emphasize the fact that it is a random variable. If we know that $$X_{k}$$ takes the value $$x_{k} \in \mathbb{R}^{n}$$, and the angle $$\theta _{k}$$ that z makes with $$x_{k}$$ is smaller than $$\pi /8$$, then, Lemma 2.2 tells us   $$\mathbb{E}\big[\lVert X_{k+1}-x{\rVert_{2}^{2}} \ | \ X_{k} = x_{k}\big] \leq (1-\alpha_{\sigma}/n)\lVert x_{k}-x{\rVert_{2}^{2}},$$ (3.1)where $$\alpha _{\sigma } := 1/2 - 4\sin (\pi /8)/\pi> 0$$. The proof for Theorem 1.1 proceeds by unconditioning and iterating a bound similar to (3.1). Unfortunately, our bound depends on $$x_{k}$$ being in a specific region in $$\mathbb{R}^{n}$$ and does not hold arbitrarily. Nonetheless, by using some basic concepts from stochastic process theory, we may derive a conditional linear convergence estimate. The details are as follows. For each k, let $$\mathcal{F}_{k}$$ denote the $$\sigma$$-algebra generated by $$a_{1},a_{2},\ldots ,a_{k}$$, where $$a_{k}$$ is the sampling vector used in step k. Let $$B \subset \mathbb{R}^{n}$$ be the region comprising all points making an angle less than or equal to $$\pi /8$$ with x. This is our basin of linear convergence. Let us assume a fixed initial estimate $$x_{0} \in B$$. Now define a stopping time $$\tau$$ via   \begin{align} \tau := \min \lbrace k \colon X_{k} \notin B\rbrace. \end{align} (3.2) For each k, and $$x_{k} \in B$$, we have   \begin{align*} \mathbb{E}\big[\lVert X_{k+1}-x{\rVert_{2}^{2}}1_{\tau>{k+1}} \ | \ X_{k} = x_{k}\big] & \leq \mathbb{E}\big[\lVert X_{k+1}-x{\rVert_{2}^{2}} 1_{\tau > k} \ | \ X_{k} = x_{k}\big] \\ & = \mathbb{E}\big[\lVert X_{k+1}-x{\rVert_{2}^{2}}1_{\tau > k} \ | \ X_{k} = x_{k}, \mathcal{F}_{k}\big] \\ & = \mathbb{E}\big[\lVert X_{k+1}-x{\rVert_{2}^{2}} \ | \ X_{k} = x_{k}, \mathcal{F}_{k}\big]1_{\tau > k} \\ & \leq (1-\alpha_{\sigma}/n)\lVert x_{k}-x{\rVert_{2}^{2}} 1_{\tau > k}. \end{align*} Here, the first inequality follows from the inclusion $$\lbrace \tau>{k+1}\rbrace \subset \lbrace \tau > k\rbrace$$, the first equality statement from the Markov nature of the process $$(X_{k})$$, the second equality statement from the fact that $$\tau$$ is a stopping time, while the second inequality is simply (3.1). Taking expectations with respect to $$X_{k}$$ then gives   \begin{align*} \mathbb{E}\big[\lVert X_{k+1}-x{\rVert_{2}^{2}} 1_{\tau>{k+1}}\big] & = \mathbb{E}\big[\mathbb{E}\big[{\lVert X_{k+1}-x{\rVert_{2}^{2}}1_{\tau >{k+1}} \ | \ X_{k}} \big]\big] \\ & \leq (1-\alpha_{\sigma}/n)\mathbb{E}\big[\lVert X_{k}-x{\rVert_{2}^{2}} 1_{\tau > k}\big]. \end{align*} By induction, we therefore obtain   $$\mathbb{E}\big[\lVert X_{k}-x{\rVert_{2}^{2}} 1_{\tau> k}\big] \leq (1-\alpha_{\sigma}/n)^{k}\lVert x_{0}-x{\rVert_{2}^{2}}.$$ We have thus proven the first part of the following convergence theorem. Theorem 3.1 (Linear convergence from unlimited measurements) Let x be a vector in $$\mathbb{R}^{n}$$, let $$\delta> 0$$, and let $$x_{0}$$ be an initial estimate to x such that $$\lVert x_{0} - x\rVert _{2} \leq \delta \lVert x\rVert _{2}$$. Suppose that our measurements $$a_{1},a_{2},\ldots$$ are fully independent random vectors distributed uniformly on the sphere $$S^{n-1}$$. Let $$X_{k}$$ be the estimate given by the randomized Kaczmarz update formula (2.3) at step k, and let $$\tau$$ be the stopping time defined via (3.2). Then for every $$k \in \mathbb{Z}_{+}$$,   $$\mathbb{E}\big[\lVert X_{k}-x{\rVert_{2}^{2}} 1_{\tau = \infty}\big] \leq (1-\alpha_{\sigma}/n)^{k}\lVert x_{0}-x{\rVert_{2}^{2}},$$ (3.3)where $$\alpha _{\sigma } = 1/2 - 4\sin (\pi /8)/\pi> 0$$. Furthermore, $$\mathbb{P}(\tau < \infty ) \leq (\delta /\sin (\pi /8))^{2}$$. Proof. In order to prove the second statement, we combine a stopping time argument with a supermartingale maximal inequality. Set $$Y_{k} := \lVert X_{\tau \wedge\, k}-x{\rVert _{2}^{2}}$$. We claim that $$Y_{k}$$ is a supermartingale. To see this, we break up its conditional expectation as follows:   \begin{align*} \mathbb{E}[Y_{k+1} \ | \ \mathcal{F}_{k}] & = \mathbb{E}\big[\lVert X_{\tau \wedge (k+1)}-x{\rVert_{2}^{2}}1_{\tau \leq\, k} \ | \ \mathcal{F}_{k}\big] + \mathbb{E}\big[\lVert X_{\tau \wedge (k+1)}-x{\rVert_{2}^{2}}1_{\tau> k} \ | \ \mathcal{F}_{k}\big] \\ & = \mathbb{E}\big[\lVert X_{\tau \wedge\, k}-x{\rVert_{2}^{2}}1_{\tau \leq\, k} \ | \ \mathcal{F}_{k}\big] + \mathbb{E}\big[\lVert X_{k+1}-x{\rVert_{2}^{2}}1_{\tau > k} \ | \ \mathcal{F}_{k}\big]. \end{align*} Since $$\lVert X_{\tau \wedge k}-x{\rVert _{2}^{2}}$$ is measurable with respect to $$\mathcal{F}_{k}$$, we get   $$\mathbb{E}\big[\lVert X_{\tau \wedge\, k}-x{\rVert_{2}^{2}}1_{\tau \leq\, k} \ | \ \mathcal{F}_{k}\big] = \lVert X_{\tau \wedge\, k}-x{\rVert_{2}^{2}}1_{\tau \leq\, k} = Y_{k} 1_{\tau \leq\, k}.$$Meanwhile, on the event $$\tau> k$$, we have $$X_{k} \in B$$, so we may use (3.1) to obtain   $$\mathbb{E}\big[\lVert X_{k+1}-x{\rVert_{2}^{2}}1_{\tau> k} \ | \ \mathcal{F}_{k}\big] = \mathbb{E}\big[\lVert X_{k+1}-x{\rVert_{2}^{2}} \ | \ \mathcal{F}_{k}\big]1_{\tau > k} \leq (1-\alpha_{\sigma}/n)\lVert X_{k}-x{\rVert_{2}^{2}}1_{\tau > k}.$$Next, notice that   $$\lVert X_{k}-x{\rVert_{2}^{2}}1_{\tau> k} = \lVert X_{\tau \wedge\, k}-x{\rVert_{2}^{2}}1_{\tau > k} = Y_{k} 1_{\tau > k}.$$Combining these calculations gives   $$\mathbb{E}[Y_{k+1} \ | \ \mathcal{F}_{k}] \leq Y_{k} 1_{\tau \leq\, k} + (1-\alpha_{\sigma}/n)Y_{k} 1_{\tau> k} \leq Y_{k}.$$ Now define a second stopping time T to be the earliest time k such that $$\lVert X_{k}-x\rVert _{2} \geq \sin (\pi /8) \cdot \lVert x\rVert _{2}$$. A simple geometric argument tells us that $$T \leq \tau$$, and that T also satisfies   $$T = \inf\big\lbrace k \ | \ Y_{k} \geq \sin^{2}(\pi/8){\lVert x\rVert_{2}^{2}}\big\rbrace.$$As such, we have   $$\mathbb{P}(\tau < \infty ) \leq \mathbb{P}(T < \infty) = \mathbb{P}\Bigg(\sup_{1 \leq\, k <\, \infty} Y_{k} \geq \sin^{2}(\pi/8) {\lVert x\rVert_{2}^{2}}\Bigg).$$Since $$(Y_{k})$$ is a non-negative supermartingale, we may apply the supermartingale maximal inequality to obtain a bound on the right-hand side:   $$\mathbb{P}\Bigg(\sup_{1 \leq\, k <\, \infty} Y_{k} \geq \sin^{2}(\pi/8) {\lVert x\rVert_{2}^{2}}\Bigg) \leq \frac{\mathbb{E} Y_{0}}{\sin^{2}(\pi/8){\lVert x\rVert_{2}^{2}}} \leq (\delta/\sin(\pi/8))^{2}.$$This completes the proof of the theorem. Corollary 3.2 Fix $$\epsilon> 0$$, $$0 < \delta _{1} \leq 1/2$$ and $$0 < \delta _{2} \leq 1$$. In the setting of Theorem 3.1, suppose that $$\lVert x_{0}-x\rVert _{2} \leq \sqrt{\delta _{1}}\sin (\pi /8)\lVert x\rVert _{2}$$. Then with probability at least $$1-\delta _{1}-\delta _{2}$$, if $$k \geq (\log (2/\epsilon )+\log (1/\delta _{2}))n/\alpha _{\sigma }$$ then $$\lVert X_{k}-x{\rVert _{2}^{2}} \leq \epsilon \lVert x_{0}-x{\rVert _{2}^{2}}$$. Proof. First observe that   $$\mathbb{P}(\tau < \infty) \leq \left({\frac{\sqrt{\delta_{1}}\sin(\pi/8)}{\sin(\pi/8)}}\right)^{2} = \delta_{1} \leq 1/2.$$ Next, since   \begin{align*} \mathbb{E}\big[\lVert X_{k}-x{\rVert_{2}^{2}} 1_{\tau = \infty}\big] & = \mathbb{E}\big[\lVert X_{k}-x{\rVert_{2}^{2}} \ | \ \tau = \infty\big]\mathbb{P}(\tau = \infty) + 0 \cdot \mathbb{P}(\tau < \infty) \\ & \geq \frac{1}{2}\mathbb{E}\big[\lVert X_{k}-x{\rVert_{2}^{2}} \ | \ \tau = \infty\big], \end{align*}applying Theorem 3.1 gives   $$\mathbb{E}\big[\lVert X_{k}-x{\rVert_{2}^{2}} \ | \ \tau = \infty\big] \leq 2(1-\alpha_{\sigma}/n)^{k}\lVert x_{0}-x{\rVert_{2}^{2}}.$$Applying Markov’s inequality then gives   \begin{align*} \mathbb{P}\big(\lVert X_{k}-x{\rVert_{2}^{2}}> \epsilon \lVert x_{0}-x{\rVert_{2}^{2}} \ | \ \tau = \infty\big) & \leq \frac{\mathbb{E}\left[\lVert X_{k}-x{\rVert_{2}^{2}} \ | \ \tau = \infty\right]}{\epsilon \lVert x_{0}-x{\rVert_{2}^{2}}} \\ & \leq \frac{2(1-\alpha_{\sigma}/n)^{k}}{\epsilon}. \end{align*} Plugging our choice of k into this last bound shows that it is in turn bounded by $$\delta _{2}$$. We therefore have   \begin{align*} \mathbb{P}\big(\lVert X_{k}-x{\rVert_{2}^{2}} \leq \epsilon \lVert x_{0}-x{\rVert_{2}^{2}} \big) & = \mathbb{P}\big(\lVert X_{k}-x{\rVert_{2}^{2}} \leq \epsilon \lVert x_{0}-x{\rVert_{2}^{2}} \ | \ \tau = \infty\big) \mathbb{P}(\tau = \infty) \\ & \geq (1-\delta_{2})(1-\delta_{1}) \\ & \geq 1 - \delta_{1} - \delta_{2} \end{align*}as we wanted. 4. Local linear convergence for $$\textrm{ACW}(\theta ,\alpha )$$ measures We would like to extend the analysis in the previous section to the setting where we only have access to finitely many uniform measurements, i.e. when we are back in the situation of (1.1). When we sample uniformly from the rows of A, this can be seen as running the generalized randomized Kaczmarz algorithm using the measure $$\mu _{A} = \frac{1}{m}\sum _{i=1}^{m} \delta _{a_{i}}$$ as opposed to $$\mu = \sigma$$. If we retrace our steps, we will see that the key property of the uniform measure $$\sigma$$ that we used was that if $$W \subset S^{n-1}$$ is a wedge2 of angle $$\theta$$, then we could make $$\lambda _{\max }(\mathbb{E}_{\sigma } aa^{T} 1_{W}(a))$$ arbitrarily small by taking $$\theta$$ small enough (see equation (2.9)). We do not actually need such a strong statement. It suffices for there to be an absolute constant $$\alpha$$ such that   $$\lambda_{\min}(\mathbb{E} aa^{T}-4\mathbb{E} aa^{T}1_{W}(a)) \geq \frac{\alpha}{n}$$ (4.1)holds for $$\theta$$ small enough. Definition 4.1 (Anti-concentration) If a probability measure $$\mu$$ on $$S^{n-1}$$ satisfies (4.1) for all wedges W of angle less than $$\theta$$, we say that it is anti-concentrated on wedges of angle $$\theta$$ at level $$\alpha$$, or for short, that it satisfies the $$\textrm{ACW}(\theta ,\alpha )$$ condition. Abusing notation, we say that a measurement matrix A is $$\textrm{ACW}(\theta ,\alpha )$$ if the uniform measure on its rows is $$\textrm{ACW}(\theta ,\alpha )$$. Plugging in this definition into Lemma 2.1, we immediately get the following statement. Lemma 4.2 (Expected decrement for ACW measure) Let $$\mu$$ be a probability measure on the sphere $$S^{n-1}$$ satisfying the $$ACW(\theta ,\alpha )$$ condition for some $$\alpha> 0$$ and some acute angle $$\theta> 0$$. Let $$P = P_{\mu }$$ denote the generalized Kaczmarz projection with respect to $$\mu$$. Then for any $$x, z \in \mathbb{R}^{n}$$ such that the angle between them is less than $$\theta$$, we have   $$\mathbb{E}\lVert Pz-x{\rVert_{2}^{2}} \leq (1-\alpha/n) \lVert z-x{\rVert_{2}^{2}}.$$ (4.2) We may now imitate the arguments in the previous section to obtain a guarantee for local linear convergence for the generalized randomized Kaczmarz algorithm using such a measure $$\mu$$. Theorem 4.3 (Linear convergence for ACW measure) Suppose $$\mu$$ is an $$\textrm{ACW}(\theta ,\alpha )$$ measure. Let x be a vector in $$\mathbb{R}^{n}$$, let $$\delta> 0$$, and let $$x_{0}$$ be an initial estimate to x such that $$\lVert x_{0} - x\rVert _{2} \leq \delta \lVert x\rVert _{2}$$. Let $$X_{k}$$ denote the kth step of the generalized randomized Kaczmarz method with respect to the measure $$\mu$$, defined as in (2.3). Let $$\varOmega$$ be the event that for every $$k \in \mathbb{Z}_{+}$$, $$X_{k}$$ makes an angle less than $$\theta$$ with x. Then for every $$k \in \mathbb{Z}_{+}$$,   $$\mathbb{E}\big[\lVert X_{k}-x{\rVert_{2}^{2}} 1_{\varOmega}\big] \leq (1-\alpha/n)^{k}\lVert x_{0}-x{\rVert_{2}^{2}}.$$ (4.3)Furthermore, $$\mathbb{P}(\varOmega ^{c}) \leq (\delta /\sin \theta )^{2}$$. Proof. We repeat the proof of Theorem 3.1. Let $$B_{\mu } \subset S^{n-1}$$ be the region on the sphere comprising all points making an angle less than or equal to $$\pi /8$$ with x. Define stopping times $$\tau _{\mu }$$ and $$T_{\mu }$$ as the earliest times that $$X_{k} \notin B_{\mu }$$ and $$\lVert X_{k} - x_{0}\rVert _{2} \geq \sin (\theta )\lVert x\rVert _{2}$$, respectively. Again, $$Y_{k} := X_{k\wedge \tau _{\mu }}$$ is a supermartingale, so we may use the supermartingale inequality to bound the probability of $$\varOmega ^{c}$$. Conditioned on the event $$\varOmega$$, we may iterate the bound given by Lemma 4.2 to obtain (4.3). Corollary 4.4 Fix $$\epsilon> 0$$, $$0 < \delta _{1} \leq 1/2$$ and $$0 < \delta _{2} \leq 1$$. In the setting of Theorem 4.3, suppose that $$\lVert x_{0}-x\rVert _{2} \leq \sqrt{\delta _{1}}\sin (\theta )\lVert x\rVert _{2}$$. Then with probability at least $$1-\delta _{1}-\delta _{2}$$, if $$k \geq (\log (2/\epsilon )+\log (1/\delta _{2}))n/\alpha$$ then $$\lVert X_{k}-x{\rVert _{2}^{2}} \leq \epsilon \lVert x_{0}-x{\rVert _{2}^{2}}$$. 5. $$\textrm{ACW}(\theta ,\alpha )$$ condition for finitely many uniform measurements Following the theory in the previous section, we see that to prove linear convergence from finitely many uniform measurements, it suffices to show that the measurement matrix A is $$\textrm{ACW}(\theta ,\alpha )$$ for some $$\theta$$ and $$\alpha$$. For a fixed wedge W, we can easily achieve (4.1) by using a standard matrix concentration theorem. By taking a union bound, we can guarantee that it holds over exponentially many wedges with high probability. However, the function $$W \mapsto \lambda _{\max } (\mathbb{E} aa^{T} 1_{W}(a) )$$ is not Lipschitz with respect to any natural parametrization of wedges in $$S^{n-1}$$, so a naive net argument fails. To get around this, we use VC theory, metric entropy and a chaining theorem from [10]. First, we will use the theory of VC dimension and growth functions to argue that all wedges contain approximately the right fraction of points. This is the content of Lemma 5.1. In order to prove this, a fair number of standard definitions and results are required. These are all provided in Appendix A. Lemma 5.1 (Uniform concentration of empirical measure over wedges) Fix an acute angle $$\theta> 0$$. Let $$\mathcal{W}_{\theta }$$ denote the collection of all wedges of $$S^{n-1}$$ of angle less than $$\theta$$. Suppose A is an m by n matrix with rows $$a_{i}$$ that are independent uniform random vectors on $$S^{n-1}$$, and let $$\mu _{A} = \frac{1}{m}\sum _{i=1}^{m} \delta _{a_{i}}$$. Then if $$m \geq (4\pi /\theta )^{2}(2n\log (2em/n)+\log (2/\delta ))$$, with probability at least $$1 - \delta$$, we have   $$\sup_{W \in \mathcal{W}} \mu_{A}(W) \leq 2\theta/\pi.$$ Proof. Using VC theory [21], we have   $$\mathbb{P}\Bigg(\sup_{W \in \mathcal{W}} \lvert\mu_{A}(W) - \sigma(W)\rvert \geq u\Bigg ) \leq 4\varPi_{\mathcal{W}_{\theta}}(2m)\exp(-mu^{2}/16)$$ (5.1)whenever $$m \geq 2/u^{2}$$. Let $$\mathcal{S}$$ be the collection of all sectors of any angle, and let $$\mathcal{H}$$ denote the collection of all hemispheres. By Claim A.3 and the Sauer–Shelah lemma (Lemma A.1) relating VC dimension to growth functions, we have $$\varPi _{\mathcal{H}}(2m) \leq (2em/n)^{n}$$. Next, notice that using the notation in (A.2), we have $$\mathcal{W} = \mathcal{H}\varDelta \mathcal{H}$$. As such, we may apply Claim A.4 to get   $$\varPi_{\mathcal{W}}(2m) \leq (2em/n)^{2n}.$$ We now plug this bound into the right-hand side of (5.1), set $$u = \theta /\pi$$ and simplify to get   $$\mathbb{P}\Bigg(\sup_{W \in \mathcal{W}} \lvert\mu_{A}(W) - \sigma(W)\rvert \geq \theta/\pi \Bigg) \leq 4\exp(2n\log(2em/n)-m(\theta/\pi)^{2}/16).$$ Our assumption implies that $$m \geq 2/(\theta /\pi )^{2}$$ so the bound holds, and also that the bound is less than $$\delta$$. Finally, since $$\mathcal{W}_{\theta } \subset \mathcal{W}$$, on the complement of this event, any $$W \in \mathcal{W}_{\theta }$$ satisfies   $$\mu_{A}(W) \leq \sigma(W) + \theta/\pi \leq 2\theta/\pi$$as we wanted. For every wedge $$W \in \mathcal{W}_{\theta }$$, we may associate the configuration vector   $$s_{W,\,A} := (1_{W}(a_{1}),1_{W}(a_{2}),\ldots,1_{W}(a_{m})).$$We can write   $$\lambda_{\max}(\mathbb{E}_{\mu_{A}}aa^{T}1_{W}(a)) = \frac{1}{m}\lambda_{\max}(A^{T}S_{W,\,A}A),$$ (5.2)where $$S_{W,\,A} = \textrm{diag}(s_{W,\,A})$$. $$S_{W,\,A}$$ is thus a selector matrix, and if we condition on the good event given to us by the previous theorem, it selects at most a $$2\theta /\pi$$ fraction of the rows of A. This means that $$s_{W,\,A} \in \mathcal{S}_{2\theta /\pi }$$, where we define   $$\mathcal{S}_{\tau} := \lbrace d \in \{0,1\}^{m} \ | \ \langle d,1\rangle \leq \tau \cdot m \rbrace.$$ We would like to majorize the quantity in (5.2) uniformly over all wedges W by the quantity $$\frac{1}{4}\lambda _{\min } \left(\mathbb{E}_{\mu _{A}} aa^{T} \right)$$. In order to do this, we define a stochastic process $$(Y_{s,v})$$ indexed by $$s \in \mathcal{S}_{2\theta /\pi }$$ and $$v \in{B^{n}_{2}}$$, setting   $$Y_{s,v} := n v^{T} A^{T}\textrm{diag}(s)Av = \sum_{i=1}^{m} s_{i} \langle\sqrt{n}a_{i},v\rangle^{2}.$$ (5.3)If we condition on the good set in Lemma 5.1, it is clear that   $$\sup_{W \in \mathcal{W}_{\theta}} \frac{1}{m}\lambda_{\max}(A^{T}S_{W,\,A}A) \leq \frac{1}{nm}\sup_{s \in \mathcal{S}_{2\theta/\pi},v \in{B^{n}_{2}}} Y_{s,v},$$so it suffices to bound the quantity on the right. We will do this using a slightly sophisticated form of chaining, which requires us to make a few definitions. Let (T, d) be a metric space. A sequence $$\mathcal{T} = (T_{k})_{k \in \mathbb{Z}_{+}}$$ of subsets of T is called admissible if $$\lvert T_{0}\rvert = 1$$, and $$\lvert T_{k}\rvert \leq 2^{2^{k}}$$ for all k ≥ 1. For any $$0 < \alpha < \infty$$, we define the $$\gamma _{\alpha }$$ functional of (T, d) to be   $$\gamma_{\alpha}(T,d) := \inf_{\mathcal{T}}\sup_{t \in T} \sum_{k=0}^{\infty} 2^{k/\alpha} d(t,T_{k}).$$ Let $$d_{1}$$ and $$d_{2}$$ be two metrics on T. We say that a process $$(Y_{t})$$ has mixed tail increments with respect to $$(d_{1},d_{2})$$ if there are constants c and C such that for all s, t ∈ T, we have the bound   $$\mathbb{P}(\lvert Y_{s}-Y_{t}\rvert \geq c(\sqrt{u}d_{2}(s,t) + ud_{1}(s,t)) ) \leq Ce^{-u}.$$ (5.4) Remark 5.2 In [10], processes with mixed tail increments are defined as above, but with the further restriction that c = 1 and C = 2. This is not necessary for the result that we need (Lemma 5) to hold. The indeterminacy of c and C gets absorbed into the final constant in the bound. Lemma 5.3 (Theorem 5, [10]) If $$(Y_{t})_{t \in T}$$ has mixed tail increments, then there is a constant C such that for any u ≥ 1, with probability at least $$1 - e^{-u}$$,   $$\sup_{t \in T}\lvert Y_{t} - Y_{t_{0}}\rvert \leq C(\gamma_{2}(T,d_{2}) + \gamma_{1}(T,d_{1}) + \sqrt{u}\textrm{diam}(T,d_{2}) + u\textrm{diam}(T,d_{1})).$$ At first glance, the $$\gamma _{2}$$ and $$\gamma _{1}$$ quantities seem mysterious and intractable. We will show, however, that they can be bounded by more familiar quantities that are easily computable in our situation. Let us postpone this for the moment, and first show that our process $$(Y_{s,v})$$ has mixed tail increments. Lemma 5.4 ($$(Y_{s,v})$$ has mixed tail increments) Let $$(Y_{s,v})$$ be the process defined in (5.3). Define the metrics $$d_{1}$$ and $$d_{2}$$ on $$\mathcal{S}_{2\theta /\pi } \times{B^{n}_{2}}$$ using the norms $$\lvert \kern -0.25ex\lvert \kern -0.25ex\lvert (w,v)\rvert \kern -0.25ex\rvert \kern -0.25ex\rvert _{1} = \max \lbrace \lVert w\rVert _{\infty },\lVert v\rVert _{2}\rbrace$$ and $$\lvert \kern -0.25ex\lvert \kern -0.25ex\lvert (w,v)\rvert \kern -0.25ex\rvert \kern -0.25ex\rvert _{2} = \max \lbrace \lVert w\rVert _{2},\sqrt{2m\theta /\pi }\lVert v\rVert _{2}\rbrace$$. Then the process has mixed tail increments with respect to $$(d_{1},d_{2})$$. Proof. The main tool that we use is Bernstein’s inequality [23] for sums of subexponential random variables. Observe that each $$\sqrt{n}a_{i}$$ is a sub-Gaussian random vector with bounded sub-Gaussian norm $$\lVert \sqrt{n}a_{i}\rVert _{\psi _{2}} \leq C$$, where C by an absolute constant. As such, for any $$v \in{B^{n}_{2}}$$, $$\langle \sqrt{n}a_{i},v\rangle ^{2}$$ is a subexponential random variable with bounded subexponential norm $$\lVert \langle \sqrt{n}a_{i},v \rangle ^{2} \rVert _{\psi _{1}} \leq C^{2}$$ [23]. Now fix v and let $$s, s^\prime \in \mathcal{S}_{2\theta /\pi }$$. Then   $$Y_{s,v} - Y_{s^\prime,v} = \sum_{i=1}^{m} \left(s_{i}- s_{i}^\prime\right) \langle\sqrt{n}a_{i},v\rangle^{2}.$$Using Bernstein, we have   $$\mathbb{P}(\lvert Y_{s,v} - Y_{s^\prime,v}\rvert \geq u ) \leq 2\exp\big(-c\min\big\lbrace u^{2}/\lVert s-s^\prime{\rVert_{2}^{2}},u/\lVert s-s^\prime\rVert_{\infty}\big\rbrace\big).$$ (5.5) Similarly, if we fix $$s \in \mathcal{S}_{2\theta /\pi }$$ and let $$v, v^\prime \in{B^{n}_{2}}$$, then   \begin{align*} Y_{s,v} - Y_{s,\,v^\prime} & = \sum_{i=1}^{m} s_{i} (\langle\sqrt{n}a_{i},v\rangle^{2} - \langle\sqrt{n}a_{i},v^\prime\rangle^{2} ) \\[-2pt] & = \sum_{i=1}^{m} s_{i} \langle\sqrt{n}a_{i},v-v^\prime\rangle\langle\sqrt{n}a_{i},v+v^\prime\rangle. \end{align*}We can bound the subexponential norm of each summand via   \begin{align*} \lVert s_{i}\langle\sqrt{n}a_{i},v-v^\prime\rangle\langle\sqrt{n}a_{i},v+v^\prime\rangle\rVert_{\psi_{1}} & \leq s_{i}\lVert\langle\sqrt{n}a_{i},v-v^\prime\rangle\rVert_{\psi_{2}} \cdot \lVert\langle\sqrt{n}a_{i},v+v^\prime\rangle\rVert_{\psi_{1}} \\[-2pt] & \leq Cs_{i}\lVert v-v^\prime\rVert_{2}. \end{align*}As such,   $$\sum_{i=1}^{m} \lVert s_{i}\langle\sqrt{n}a_{i},v-v^{\prime}\rangle\langle\sqrt{n}a_{i},v+v^{\prime}\rangle\rVert_{\psi_{1}}^{2} \leq C \lVert v-v^{\prime}\rVert_{2}^{2}\sum_{i=1}^{m} s_{i}^{2} \leq C(2\theta/\pi)m \lVert v-v^{\prime}\rVert_{2}^{2}.$$Applying Bernstein as before, we get   $$\mathbb{P}(\lvert Y_{s,v} - Y_{s,v^{\prime}}\rvert \geq u ) \leq 2\exp\big(-c\min\big\lbrace u^{2}/ (2\theta/\pi)m\lVert v-v^{\prime}\rVert_{2}^{2},u/\lVert v-v^{\prime}\rVert_{2}\big\rbrace\big).$$ (5.6) Now, recall the simple observation that for any numbers $$a, b \in \mathbb{R}$$, we have   $$\max\lbrace\lvert a\rvert,\lvert b\rvert\rbrace \leq \lvert a\rvert + \lvert b\rvert \leq 2\max\lbrace\lvert a\rvert,\lvert b\rvert\rbrace.$$As such, for any u > 0, given $$s,s^\prime \in \mathcal{S}_{2\theta /\pi }$$, $$v,v^\prime \in{B^{n}_{2}}$$, we have   \begin{align*} &\sqrt{u}\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert(s,v) - (s^\prime,v^\prime)\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{2} + u\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert(s,v) - (s^\prime,v^\prime)\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{1} \nonumber\\[-2pt] &\quad \geq \frac{1}{2}\left(\sqrt{u}\lVert s-s^\prime\rVert_{2} + \sqrt{u}\sqrt{2m\theta/\pi}\lVert v-v^\prime\rVert_{2} + u\lVert s-s^\prime\rVert_{\infty} + u\lVert v-v^\prime\rVert_{2}\right) \\[-2pt] & \quad\geq \frac{1}{2}\max\left\{\sqrt{u}\lVert s-s^\prime\rVert_{2} + u\lVert s-s^\prime\rVert_{\infty}, \sqrt{u}\sqrt{2m\theta/\pi}\lVert s-s^\prime\rVert_{2} + u\lVert v-v^\prime\rVert_{2} \right\}. \end{align*} Since   $$\lvert Y_{s,v} - Y_{s^\prime,v^\prime}\rvert \leq \lvert Y_{s,v} - Y_{s^\prime,v}\rvert + \lvert Y_{s^\prime,v} - Y_{s^\prime,v^\prime}\rvert,$$we have that if   $$\lvert Y_{s,v} - Y_{s^\prime,v^\prime}\rvert \geq c\left({\sqrt{u}\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert(s,v) - (s^\prime,v^\prime)\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{2} + u\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert(s,v) - (s^\prime,v^\prime)\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{1}}\right)\!,$$then either   $$\lvert Y_{s,v} - Y_{s^\prime,v}\rvert \geq \frac{c}{4}\left({\sqrt{u}\lVert s-s^\prime\rVert_{2} + u\lVert s-s^\prime\rVert_{\infty}}\right)$$or   $$\lvert Y_{s^\prime,v} - Y_{s^\prime,v^\prime}\rvert \geq \frac{c}{4}\left({\sqrt{u}\sqrt{2m\theta/\pi}\lVert v-v^{\prime}\rVert_{2} + u\lVert v-v^{\prime}\rVert_{2} }\right).$$ We can then combine the bounds (5.6) and (5.5) to get   $$\mathbb{P}\left(\lvert Y_{s,v} - Y_{s^\prime,v^\prime}\rvert \geq c\left(\sqrt{u}\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert(w,v) - (w^\prime,v^\prime)\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{2} + u\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert(w,v) - (w^\prime,v^\prime)\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{1}\right)\right) \leq 4e^{-u}.$$Hence, the process $$(Y_{s,v})$$ satisfies the definition (5.4) for having mixed tail increments. We next bound the $$\gamma _{1}$$ and $$\gamma _{2}$$ functions for $$\mathcal{S}_{2\theta /\pi } \times{B^{n}_{2}}$$. Lemma 5.5 We may bound the $$\gamma _{1}$$ functional of $$\mathcal{S}_{2\theta /\pi } \times{B^{n}_{2}}$$ by   $$\gamma_{1}\big(\mathcal{S}_{2\theta/\pi} \times{B^{n}_{2}},\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert\cdot\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{1}\big) \leq C((2\theta/\pi)\log(\pi/2\theta)m + n).$$ Proof. The proof of the bound uses metric entropy and a version of Dudley’s inequality. Let (T, d) be a metric space, and for any u > 0, let N(T, d, u) denote the covering number of T at scale u, i.e. the smallest number of radius u balls needed to cover T. Dudley’s inequality (see [20]) states that there is an absolute constant C for which   $$\gamma_{1}(T,d) \leq C \int_{0}^{\infty} \log N(T,d,u)\, \mathrm{d}u.$$ (5.7) Recall that $$\mathcal{S}_{2\theta /\pi }$$ is the set of all {0, 1} vectors with fewer than $$2\theta /\pi$$ ones. For convenience, let us assume that $$2m\theta /\pi$$ is an integer. We then have the inclusion   $$\mathcal{S}_{2\theta/\pi} \subset \bigcup_{I \in \mathcal{I}} [0,1]^{I},$$ where $$\mathcal{I}$$ is the collection of all subsets of [m] of size $$2m\theta /\pi$$, and $$[0,1]^{I}$$ denotes the unit cube in the coordinate set I. We may then also write   $$\mathcal{S}_{2\theta/\pi} \times{B^{n}_{2}} \subset \bigcup_{I \in \mathcal{I}} \big([0,1]^{I} \times{B^{n}_{2}}\big).$$ Note that a union of covers for each $$[0,1]^{I} \times{B^{n}_{2}}$$ gives a cover for $$\mathcal{S}_{2\theta /\pi } \times{B^{n}_{2}}$$. This, together with the symmetry of $$\lVert \cdot \rVert _{\infty }$$ with respect to permutation of the coordinates gives   $$N\left(\mathcal{S}_{2\theta/\pi} \times{B^{n}_{2}},\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert\cdot\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{1},u\right) \leq \lvert\mathcal{I}\rvert \cdot N\big([0,1]^{I} \times{B^{n}_{2}},\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert\cdot\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{1},u\big)$$for some fixed index set I. We next generalize the notion of covering numbers slightly. Given two sets T and K, we let N(T, K) denote the number of translates of K needed to cover the set T. It is easy to see that we have $$N(T,d,u) = N(T,uB_{d})$$, where $$B_{d}$$ is the unit ball with respect to the metric d. Since the unit ball for $$\lvert \kern -0.25ex\lvert \kern -0.25ex\lvert \cdot \rvert \kern -0.25ex\rvert \kern -0.25ex\rvert _{1}$$ is $$B_{\infty }^{m} \times{B_{2}^{n}}$$, we therefore have   \begin{align*} N\big([0,1]^{I} \times{B^{n}_{2}},\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert\cdot\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{1},u\big) & = N\big([0,1]^{I} \times{B^{n}_{2}}, u\big(B_{\infty}^{m} \times{B_{2}^{n}}\big)\big) \\ & \leq N\big(B^{(2\theta/\pi)m}_{\infty} \times{B^{n}_{2}}, u\big(B^{(2\theta/\pi)m}_{\infty}\times{B_{2}^{n}}\big) \big). \end{align*} Such a quantity can be bounded using a volumetric argument. Generally, for any centrally symmetric convex body K in $$\mathbb{R}^{n}$$, we have (see Corollary 4.1.15 in [1])   $$N(K,uK) \leq (3/u)^{n}.$$ (5.8)This implies that   $$\log N([0,1]^{I} \times S^{n-1},\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert\cdot\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{1},u) \leq \log(3/u)((2\theta/\pi)m+n).$$Finally, observe that   $$\log\lvert\mathcal{I}\rvert = \log{{m}\choose{(2\theta/\pi)m}} \leq (2\theta/\pi)m\log(e\pi/2\theta).$$ We can thus plug these last two bounds into (5.7), noting that the integrand is zero for u ≥ 1 to get   \begin{align*} \gamma_{1}\big(\mathcal{S}_{2\theta/\pi}\times{B^{n}_{2}},\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert\cdot\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{1}\big) & \leq C {\int_{0}^{1}} (2\theta/\pi)m\log(e\pi/2\theta) + \log(3/u)((2\theta/\pi)m+n) \,\mathrm{d}u \\ & \leq C((2\theta/\pi)\log(\pi/2\theta)m + n) \end{align*}as was to be shown. Lemma 5.6 We may bound the $$\gamma _{2}$$ functional of $$\mathcal{S}_{2\theta /\pi }\times{B^{n}_{2}}$$ by   $$\gamma_{2}\left(\mathcal{S}_{2\theta/\pi}\times{B^{n}_{2}},\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert\cdot\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{2}\right) \leq C\sqrt{2\theta/\pi}\left(m + \sqrt{mn}\right)\!.$$ Proof. Since $$\alpha = 2$$, we may appeal directly to the theory of Gaussian complexity [22]. However, since we have already introduced some of the theory of metric entropy in the previous lemma, we might as well continue down this path. In this case, we have the Dudley bound   $$\gamma_{2}(T,d) \leq C \int_{0}^{\infty} \sqrt{\log N(T,d,u)}\, \mathrm{d}u$$ (5.9)for any metric space (T, d). Observe that the unit ball for $$\lvert \kern -0.25ex\lvert \kern -0.25ex\lvert \cdot \rvert \kern -0.25ex\rvert \kern -0.25ex\rvert _{2}$$ is $${B^{m}_{2}} \times (2m\theta /\pi )^{-1/2}{B^{n}_{2}}$$. On the other hand, we conveniently have   $$\mathcal{S}_{2\theta/\pi} \times{B^{n}_{2}} \subset \sqrt{2m\theta/\pi}{B^{m}_{2}} \times{B^{n}_{2}}.$$As such, we have   \begin{align*} N\big(\mathcal{S}_{2\theta/\pi} \times{B^{n}_{2}},\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert\cdot\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{2},u\big) & \leq N\big(\sqrt{2m\theta/\pi}{B^{m}_{2}} \times{B^{n}_{2}},\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert\cdot\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{2},u\big) \\ & = N\big(\sqrt{2m\theta/\pi}{B^{m}_{2}} \times{B^{n}_{2}},u \big({B^{m}_{2}} \times (2m\theta/\pi)^{-1/2}{B^{n}_{2}}\big)\big) \\ & = N(T,(2m\theta/\pi)^{-1/2}uT), \end{align*}where $$T = \sqrt{2m\theta /\pi }{B^{m}_{2}} \times{B^{n}_{2}}$$. Plugging this into (5.9) and subsequently using the volumetric bound (5.8), we get   \begin{align*} \gamma_{2}\left(\mathcal{S}_{2\theta/\pi}\times{B^{n}_{2}},\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert\cdot\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{2}\right) & \leq C \int_{0}^{\infty} \sqrt{\log N(T,(2m\theta/\pi)^{-1/2}uT)} \,\mathrm{d}u \\ & = C\sqrt{2m\theta/\pi}\int_{0}^{\infty} \sqrt{\log N(T,uT)} \,\mathrm{d}u \\ & \leq C \sqrt{2m\theta/\pi}\sqrt{m+n}, \end{align*}which is clearly equivalent to the bound that we want. At this stage, we can put everything together to bound the supremum of our stochastic process. Theorem 5.7 (Bound on supremum of $$(Y_{s,v})$$) Let $$(Y_{s,v})$$ be the process defined in (5.3). Let $$0 < \delta < 1/e$$, let $$\theta$$ be an acute angle, and suppose $$m \geq \max \lbrace n,\log (1/\delta )\pi /2\theta \rbrace$$. Then with probability at least $$1-\delta$$, the supremum of the process satisfies   $$\sup_{s \in \mathcal{S}_{2\theta/\pi},v \in{B^{n}_{2}}} Y_{s,v} \leq C\sqrt{2\theta/\pi}\cdot m.$$ (5.10) Proof. It is easy to see that we have   $$\textrm{diam}\left(\mathcal{S}_{2\theta/\pi} \times{B^{n}_{2}},\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert\cdot\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{1}\right) = 2,$$and   $$\textrm{diam}\left(\mathcal{S}_{2\theta/\pi} \times{B^{n}_{2}},\lvert\kern-0.25ex\lvert\kern-0.25ex\lvert\cdot\rvert\kern-0.25ex\rvert\kern-0.25ex\rvert_{2}\right) = 2\sqrt{2m\theta/\pi}.$$Also, observe that we have $$Y_{s,0} = 0$$ for any $$s \in \mathcal{S}_{2\theta /\pi }$$. Using these, together with the previous two lemmas bounding the $$\gamma _{1}$$ and $$\gamma _{2}$$ functionals, we may apply Lemma 5.3 to see that   $$\sup_{s \in \mathcal{S}_{2\theta/\pi},v \in{B^{n}_{2}}} Y_{s,v} \leq C\left({(2\theta/\pi)\log(\pi/2\theta)m + n} + \sqrt{2\theta/\pi}({m + \sqrt{mn} }) + u + \sqrt{u} \sqrt{2m\theta/\pi}\right),$$with probability at least $$1-e^{-u}$$. Using our assumptions on m, we may simplify this bound to obtain (5.10). Finally, we show that $$\frac{1}{m}\sum _{i=1}^{m} a_{i} {a_{i}^{T}}$$ is well-behaved. Lemma 5.8 Let $$\delta>0$$. Then if $$m \geq C(n+\sqrt{\log (1/\delta )})$$, with probability at least $$1-\delta$$, we have   $$\left\lVert{\frac{n}{m}\sum_{i=1}^{m}a_{i}{a_{i}^{T}} - I_{n}}\right\rVert \leq 0.1.$$ Proof. Note, as before, that the $$\sqrt{n}a_{i}$$s are isotropic sub-Gaussian random variables with sub-Gaussian norm bounded by an absolute constant. The claim then follows immediately from Theorem 5.39 in [23], which itself is proved using Bernstein and a simple net argument. Theorem 5.9 (Finite measurement sets satisfy ACW condition) There is some $$\theta _{0}> 0$$ and an absolute constant C such that for all angles $$0 < \theta \leq \theta _{0}$$, for all dimensions n, and any $$\delta> 0$$, if m satisfies   $$m \geq C(\pi/2\theta)^{2}(n\log(m/n) + \log(1/\delta)),$$ (5.11)then with probability at least $$1-\delta$$, the measurement set A comprising m independent random vectors drawn uniformly from $$S^{n-1}$$ satisfies the $$\textrm{ACW}(\theta ,\alpha )$$ condition with $$\alpha = 1/2$$. Proof. Fix $$n, \delta> 0$$. Choose $$\theta _{0}$$ such that the constant C in the statement in Theorem 5.7 satisfies $$C\sqrt{2\theta _{0}/\pi } \leq 0.1$$. Fix $$0 < \theta \leq \theta _{0}$$, and let $$\varOmega _{1}$$, $$\varOmega _{2}$$ and $$\varOmega _{3}$$ denote the good events in Lemma 5.1, Theorem 5.7 and Lemma 5.8 with this choice of $$\theta$$. Whenever m satisfies our assumption (5.11), the intersection of these events occurs with probability at least $$1-3\delta$$ by the union bound. Let us condition on being in the intersection of these events. For any wedge $$W \in \mathcal{W}_{\theta }$$ (i.e of angle less than $$\theta$$), Lemma 5.1 tells us that its associated selector vector satisfies $$s_{W,\,A} \in \mathcal{S}_{2\theta /\pi }$$ (i.e. that it has at most $$2m\theta /\pi$$ ones). By Theorem 5.7 and our assumption on $$\theta _{0}$$, we then have   $$\lambda_{\max}\left({\frac{1}{m}\sum_{i=1}^{m} a_{i}{a_{i}^{T}}1_{W}(a_{i})}\right) \leq \frac{1}{nm}\sup_{s \in \mathcal{S}_{2\theta/\pi}, v \in{B^{n}_{2}}} Y_{s,v} \leq \frac{0.1}{n}.$$ On the other hand, Lemma 5.8 guarantees that   $$\lambda_{\min}\left({\frac{1}{m}\sum_{i=1}^{m} a_{i}{a_{i}^{T}}}\right) \geq \frac{0.9}{n}.$$ Combining these, we get   $$\lambda_{\min}\left({\frac{1}{m}\sum_{i=1}^{m} a_{i}{a_{i}^{T}} - \frac{4}{m}\sum_{i=1}^{m} a_{i}{a_{i}^{T}}1_{W}(a_{i})}\right) \geq \frac{1}{2n},$$which was to be shown. 6. Proof and discussion of Theorem 1.2 We restate the theorem here for convenience. Theorem 6.1 Fix $$\epsilon> 0$$, $$0 < \delta _{1} \leq 1/2$$ and $$0 < \delta ,\delta _{2} \leq 1$$. There are absolute constants C, c > 0 such that if   $$m \geq C(n\log(m/n) + \log(1/\delta)),$$then with probability at least $$1-\delta$$, m sampling vectors selected uniformly and independently from the unit sphere $$S^{n-1}$$ form a set such that the following holds: let $$x \in \mathbb{R}^{n}$$ be a signal vector and let $$x_{0}$$ be an initial estimate satisfying $$\lVert x_{0}-x\rVert _{2} \leq c\sqrt{\delta _{1}}\lVert x\rVert _{2}$$. Then for any $$\epsilon> 0$$, if   $$K \geq 2(\log(1/\epsilon) + \log(2/\delta_{2}))n,$$then the Kth step randomized Kaczmarz estimate $$x_{K}$$ satisfies $$\lVert x_{K}-x{\rVert _{2}^{2}} \leq \epsilon \lVert x_{0} - x{\rVert _{2}^{2}}$$ with probability at least $$1-\delta _{1}-\delta _{2}$$. Proof. Let A be our m by n measurement matrix. By Theorem 5.9, there is an angle $$\theta _{0}$$, and a constant C such that for $$m \geq C(n\log (m/n) + \log (1/\delta ))$$, A is $$\textrm{ACW}(\theta _{0},1/2)$$ with probability at least $$1-\delta$$. We can then use Corollary 4.4 to guarantee that with probability at least $$1-\delta _{1}-\delta _{2}$$, running the randomized Kaczmarz update K times gives an estimate $$x_{K}$$ satisfying   $$\lVert x_{K}-x{\rVert_{2}^{2}} \leq \epsilon\lVert x_{0}-x{\rVert_{2}^{2}}.$$This completes the proof of the theorem. Inspecting the statement of the theorem, we see that we can make the failure probability $$\delta$$ as small as possible by making m large enough. Likewise, we can do the same with $$\delta _{2}$$ by adjusting K. Proposition B.1 shows that we can also make $$\delta _{2}$$ smaller by increasing m. However, while the dependence of m and K on $$\delta$$ and $$\delta _{2}$$, respectively, is logarithmic, the dependence of m on $$\delta _{1}$$ is polynomial (we need $$m \gtrsim 1/{\delta _{1}^{2}}$$). This is rather unsatisfactory, but can be overcome by a simple ensemble method. We encapsulate this idea in the following algorithm. Algorithm 1 Ensemble Randomized Kaczmarz Require: Measurements $$b_{1},\ldots ,b_{m}$$, sampling vectors $$a_{1},\ldots ,a_{m}$$, relative error tolerance $$\epsilon$$, iteration count K, trial count L. Ensure: An estimate $$\hat{x}$$ for x. 1: Obtain an initial estimate $$x_{0}$$ using the truncated spectral initialization method (see Appendix B). 2: forl = 1,…, L, run K randomized Kaczmarz update steps starting from $$x_{0}$$ to obtain an estimate $$x_{K}^{(l)}$$. 3: forl = 1, …, L, do 4: if$$\lvert\ B(x_{K}^{(l)},2\sqrt{\epsilon }) \cap \lbrace x_{K}^{(1)},\ldots ,x_{K}^{(L)}\rbrace \rvert \geq L/2$$ 5: return$$\hat{x} := x_{K}^{(l)}$$. Proposition 6.2 (Guarantee for ensemble method) Given the assumptions of Theorem 6.1, further assume that $$\delta _{1} + \delta _{2} \leq 1/3$$. For any $$\delta ^\prime> 0$$, there is an absolute constant C such that if $$L \geq C\log (1/\delta ^\prime )$$, then the estimate $$\hat{x}$$ given by Algorithm 1 satisfies $$\lVert \hat{x}-x{\rVert _{2}^{2}} \leq 9\epsilon \lVert x_{0}-x{\rVert _{2}^{2}}$$ with probability at least $$1-\delta ^\prime$$. Proof. For 1 ≤ l ≤ L, let $$\chi _{l}$$ be the indicator variable for $$\lVert x_{K}^{(l)}-x{ \rVert _{2}^{2}} \leq \epsilon \lVert x_{0}-x{ \rVert _{2}^{2}}$$. Then $$\chi _{1},\ldots ,\chi _{L}$$ are i.i.d. Bernoulli random variables each with success probability at least 2/3. Let I be the set of indices l for which $$\chi _{l} = 1$$. Using a Chernoff bound [22], we see that with probability at least $$1-e^{-cL}$$, $$\lvert I\rvert \geq L/2$$. Now let I′ be the set of indices for which $$\lvert B (x_{K}^{(l)},2\epsilon ) \cap \lbrace x_{K}^{(1)},\ldots ,x_{K}^{(L)} \rbrace \rvert \geq L/2$$. Observe that for all $$l,l^\prime \in I$$, we have   $$\big\lVert x_{K}^{(l)}- x_{K}^{(l^\prime)}\big\rVert_{2} \leq \big\lVert x_{K}^{(l)}- x\big\rVert_{2} + \big\lVert x- x_{K}^{(l^\prime)}\big\rVert_{2} \leq 2\sqrt{\epsilon}.$$This implies that $$I \subset I^\prime$$, so $$I^\prime \neq \emptyset$$. Furthermore, for all $$l^\prime \in I^\prime$$, there is l ∈ I for which $$\lVert x_{K}^{(l)}-x_{K}^{(l^\prime )} \rVert _{2} \leq 2\sqrt{\epsilon }$$. As such, we have   $$\big\lVert x_{K}^{(l^\prime)}-x\big\rVert_{2} \leq \big\lVert x_{K}^{(l^\prime)}- x_{K}^{(l)}\big\rVert_{2} + \big\lVert x_{K}^{(l)}-x\big\rVert_{2} \leq 3\sqrt{\epsilon}.$$Now, observe that the estimate $$\hat{x}$$ returned by Algorithm 1 is precisely some $$x_{K}^{(l^\prime )}$$ for which $$l^\prime \in I^\prime$$. This shows that on the good event, we indeed have $$\big \lVert \hat{x}-x{\big \rVert _{2}^{2}} \leq 9\epsilon \lVert x_{0}-x{\rVert _{2}^{2}}$$. By our assumption on L, we see that the failure probability is bounded by $$\delta ^\prime$$. In practice however, the ensemble method is not required. Numerical experiments show that the randomized Kaczmarz method always eventually converges from any initial estimate. 7. Extensions 7.1. Arbitrary initialization In order to obtain a convergence guarantee, we used a truncated spectral initialization to obtain an initial estimate before running randomized Kaczmarz updates. Since the number of steps that we require is only linear in the dimension, and each step requires only linear time, the iteration phase of the algorithm only requires $$O (n^{2} )$$ time, and furthermore does not need to see all the data in order to start running. The spectral initialization on the other hand requires one to see all the data. Forming the matrix from which we obtain the estimate involves adding m rank 1 matrices, and hence naively requires $$O (mn^{2} )$$ time. There is hence an incentive to do away with this step altogether, and ask whether the randomized Kaczmarz algorithm works well even if we start from an arbitrary initialization. We have some numerical evidence that this is indeed true, at least for real Gaussian measurements. Unfortunately, we do not have any theoretical justification for this phenomenon, and it will be interesting to see if any results can be obtained in this direction. 7.2. Complex Gaussian measurements We have proved our main results for measurement systems comprising random vectors drawn independently and uniformly from the sphere, or equivalently, for real Gaussian measurements. These are not the measurement sets that are used in practical applications, which often deal with imaging and hence make use of complex measurements. While most theoretical guarantees for phase retrieval algorithms are in terms of real Gaussian measurements, some also hold for complex Gaussian measurements, even with identical proofs. This is the case for PhaseMax [5] and for Wirtinger flow [4]. We believe that a similar situation should hold for the randomized Kaczmarz method, but are not yet able to recalibrate our tools to handle the complex setting. It is easy to adapt the randomized Kaczmarz update formula (1.3) itself: we simply replace the sign of $$\langle a_{r(k)},x_{k-1}\rangle$$ with its phase $$\left(\textrm{i.e.} \frac{\langle a_{r(k)},x_{k-1}\rangle }{\lvert \langle a_{r(k)},x_{k-1}\rangle \rvert } \right)$$. Numerical experiments also show that convergence does occur for complex Gaussian measurements (and even CDP measurements) [25]. Nonetheless, in trying to adapt the proof to this situation, we meet an obstacle at the first step: when computing the error term, we can no longer simply sum up the influence of ‘bad measurements’ as we did in Lemma 2.1. Instead, every term contributes an error that scales with the phase difference   $$\frac{\langle a_{i},z\rangle}{\lvert\langle a_{i},z\rangle\rvert} - \frac{\langle a_{i},x\rangle}{\lvert\langle a_{i},x\rangle\rvert}.$$ Since the argument of Jeong and Güntürk also heavily relies on the decomposition of the measurement set into ‘good’ and ‘bad’ measurements, their method likewise does not easily generalize to cover the complex setting. We leave it to future work to prove convergence in this setting, whether by adapting our methods, or by proposing completely new ones. 7.3. Deterministic constructions of measurement sets The theory that we have developed in this paper does not apply solely to Gaussian measurements, and generalizes to any measurement sets that satisfy the ACW condition that we introduced in Section 5. It will be interesting to investigate what natural classes of measurement sets satisfy this condition. Acknowledgements Y.T. would like to thank Halyun Jeong for insightful discussions on this topic. We would also like to thank the anonymous reviewers for their many helpful comments. Funding Juha Heinonen Memorial Graduate Fellowship at the University of Michigan to [Y.T.]. National Science Foundation Grant DMS [1265782] and U.S. Air Force Grant [FA9550-18-1-0031] to R.V. Footnotes 1  This is essentially equivalent to being real Gaussian because of the concentration of norm phenomenon in high dimensions. Also, one may normalize vectors easily. 2 Recall that a wedge of angle $$\theta$$ is the region of the sphere between two hemispheres with normal vectors making an angle of $$\theta$$. Appendix A. Growth functions and VC dimension In this section, we define growth functions and VC dimension. We also state some standard results on these topics that we require for our proofs in Section 5. We refer the interested reader to [17] for a more in-depth exposition on these topics. Let $$\mathcal{X}$$ be a set and $$\mathcal{C}$$ be a family of subsets of $$\mathcal{X}$$. For a given set $$C \in \mathcal{C}$$, we slightly abuse notation and identify it with its indicator function $$1_{C} \colon \mathcal{X} \to \lbrace 0,1\rbrace$$. The growth function$$\varPi _{\mathcal{C}}\colon \mathbb{N} \to \mathbb{R}$$ of $$\mathcal{C}$$ is defined via   $$\varPi_{\mathcal{C}}(m) := \max_{x_{1},\ldots,\,x_{m} \in \mathcal{X}} \lvert\left\{{(C(x_{1}),C(x_{2}),\ldots,C(x_{m})) \colon C \in \mathcal{C}}\right\}\rvert.$$ Meanwhile, the VC dimension of $$\mathcal{C}$$ is defined to be the largest integer m for which $$\varPi _{\mathcal{C}}(m) = 2^{m}$$. These two concepts are fundamental to statistical learning theory. The key connection between them is given by the Sauer–Shelah lemma. Lemma A.1 (Sauer–Shelah, Corollary 3.3 in [17]) Let $$\mathcal{C}$$ be a collection of subsets of VC dimension d. Then for all m ≥ d, have   $$\varPi_{\mathcal{C}}(m) \leq \left({\frac{em}{d}}\right)^{d}\!.$$ The reason why we are interested in the growth function of a family of subsets $$\mathcal{C}$$ is because we have the following guarantee for the uniform convergence for the empirical measures of sets belonging to $$\mathcal{C}$$. Proposition A.2 (Uniform deviation, Theorem 2 in [21]) Let $$\mathcal{C}$$ be a family of subsets of a set $$\mathcal{X}$$. Let $$\mu$$ be a probability measure on $$\mathcal{X}$$, and let $$\hat{\mu }_{m} := \frac{1}{m} \sum _{i=1}^{m} \delta _{X_{i}}$$ be the empirical measure obtained from m independent copies of a random variable X with distribution $$\mu$$. For every u such that $$m \geq 2/u^{2}$$, the following deviation inequality holds:   $$\mathbb{P}\!\left(\sup_{C \in \mathcal{C}} \lvert\hat{\mu}_{m}(C) - \sigma(C)\rvert \geq u \right) \leq 4\varPi_{\mathcal{C}}(2m)\exp(-mu^{2}/16).$$ (A.1) We now state and prove two simple claims. Claim A.3 Let $$\mathcal{C}$$ be the collection of all hemispheres in $$S^{n-1}$$. Then the VC dimension of $$\mathcal{C}$$ is bounded from above by $$n+1$$. Proof. It is a standard fact from statistical learning theory [17] that the VC dimension of half-spaces in $$\mathbb{R}^{n}$$ is $$n+1$$. Since $$S^{n-1}$$ is a subset of $$\mathbb{R}^{n}$$, the claim follows by the definition of VC dimension. Claim A.4 Let $$\mathcal{C}$$ and $$\mathcal{D}$$ be two collections of functions from a set $$\mathcal{X}$$ to {0, 1}. Using $$\varDelta$$ to denote symmetric difference, we define   $$\mathcal{C}\varDelta\mathcal{D} := \lbrace C\varDelta D \ | \ C \in \mathcal{C}, D \in \mathcal{D}\rbrace.$$ (A.2)Then the growth function $$\varPi _{\mathcal{C}\varDelta \mathcal{D}}$$ of $$\mathcal{C}\varDelta \mathcal{D}$$ satisfies $$\varPi _{\mathcal{C}\varDelta \mathcal{D}}(m) \leq \varPi _{\mathcal{C}}(m)\cdot \varPi _{\mathcal{D}}(m)$$ for all $$m \in \mathbb{Z}_{+}$$. Proof. Fix m, and points $$x_{1},\ldots ,x_{m} \in \mathcal{X}$$. Then every possible configuration $$(\,f(x_{1}),f(x_{2}),\ldots ,f(x_{m}))$$ arising from some $$f \in \mathcal{C}\varDelta \mathcal{D}$$ is the point-wise symmetric difference   $$(\,f(x_{1}),f(x_{2}),\ldots,f(x_{m})) = (C(x_{1}),C(x_{2}),\ldots,C(x_{m}))\varDelta (D(x_{1}),D(x_{2}),\ldots,D(x_{m}))$$of configurations arising from some $$C \in \mathcal{C}$$ and $$D \in \mathcal{D}$$. By the definition of growth functions, there are at most $$\varPi _{\mathcal{C}}(m)\cdot \varPi _{\mathcal{D}}(m)$$ pairs of these configurations, from which the bound follows. Remark A.5 There is an extensive literature on how to bound the VC dimension of concept classes that arise from finite intersections or unions of those from a known collection of concept classes, each of which has bounded VC dimension. We won’t require this much sophistication here, and refer the reader to [3] for more details. Appendix B. Initialization Several different schemes have been proposed for obtaining initial estimates for PhaseMax and gradient descent methods for phase retrieval. Surprisingly, these are all spectral in nature: the initial estimate $$x_{0}$$ is obtained as the leading eigenvector to a matrix that is constructed out of the sampling vectors $$a_{1},\ldots ,a_{m}$$ and their associated measurements $$b_{1},\ldots ,b_{m}$$ [4,6,24,26]. There seems to be empirical evidence, at least for Gaussian measurements, that the best performing method is the orthogonality-promoting method of [24]. Nonetheless, for any given relative error tolerance, all the methods seem to require sample complexity of the same order. Hence, we focus on the truncated spectral method of [6] for expositional clarity, and refer the reader to the respective papers on the other methods for more details. The truncated spectral method initializes $$x_{0} := \lambda _{0} \tilde{x}_{0}$$, where $$\lambda _{0} = \sqrt{\frac{1}{m}\sum _{i=1}^{m}{b_{i}^{2}}}$$, and $$\tilde{x}_{0}$$ is the leading eigenvector of   $$Y = \frac{1}{m}\sum_{i=1}^{m}{b_{i}^{2}}a_{i}{a_{i}^{T}}1(b_{i} \leq 3\lambda_{0}).$$Note that when constructing Y, we sum up only those sampling vectors whose corresponding measurements satisfy $$b_{i} \leq 3\lambda _{0}$$. The point of this is to remove the influence of unduly large measurements, and allow for good concentration estimates, as we shall soon demonstrate. Suppose from now on that the $$a_{i}$$s are independent standard Gaussian vectors. In [6], the authors prove that with probability at least $$1-\exp (-\varOmega (m))$$, we have $$\lVert \tilde{x}_{0} - x\rVert _{2} \leq \epsilon \lVert x\rVert _{2}$$ for any fixed relative error tolerance $$\epsilon$$ (see their Proposition 3). They do not, however, examine the dependence of the probability bound on $$\epsilon$$. Nonetheless, by examining the proof more carefully, we can make this dependence explicit. In doing so, we obtain the following proposition. Proposition B.1 (Relative error guarantee for initialization) Let $$a_{1},\ldots ,a_{m}$$, $$b_{1},\ldots ,b_{m}$$Y and $$x_{0}$$ be defined as in the preceding discussion. Fix $$\epsilon> 0$$ and $$0 < \delta < 1$$. Then with probability at least $$1-\delta$$, we have $$\lVert x_{0}-x\rVert _{2} \leq \epsilon \lVert x\rVert _{2}$$ so long as $$m \geq C(\log (1/\delta )+n)/\epsilon ^{2}$$. Proof. We simply make the following observations while following the proof of Proposition 3 in [6]. First, since all quantities are 2-homogeneous in $$\lVert x\rVert _{2}$$, we may assume without loss of generality that $$\lVert x\rVert _{2} = 1$$. Next, there is some absolute constant c such that if we define $$Y_{1}$$ and $$Y_{2}$$ by choosing $$\gamma _{1} = 3+ c\epsilon$$, $$\gamma _{2} = 3 - c\epsilon$$, we have the bound $$\lVert \mathbb{E} Y_{1} - \mathbb{E} Y_{2}\rVert \leq C\epsilon$$. Note also that the deviation estimates $$\lVert Y_{1}-\mathbb{E} Y_{1}\rVert$$, $$\lVert Y_{2}-\mathbb{E} Y_{2}\rVert$$ are bounded by $$C\epsilon$$ given our assumptions on m. This implies that with high probability,   $$\lVert Y-\beta_{1} xx^{T} - \beta_{2} I_{n}\rVert \leq C\epsilon.$$Adjust our constants so that C in the last equation is bounded by $$\beta _{1} - \beta _{2}$$. We may then apply Davis–Kahan [8] to get   $$\lVert\tilde{x}_{0} - x\rVert_{2} \leq \frac{\lVert Y-\beta_{1} xx^{T} - \beta_{2} I_{n}\rVert}{\beta_{1} - \beta_{2}} \leq \epsilon$$as we wanted. By examining the proof carefully, the astute reader will observe that the crucial properties that we used were the rotational invariance of the $$a_{i}$$s (to compute the formulas for $$\mathbb{E} Y_{1}$$ and $$\mathbb{E} Y_{2}$$) and their sub-Gaussian tails (to derive the deviation estimates). These properties also hold for sampling vectors that are uniformly distributed on the sphere. As such, a more lengthly and tedious calculation can be done to show that the guarantee also holds for such sampling vectors. If the reader has any residual doubt, perhaps this can be assuaged by noting that a uniform sampling vector and its associated measurement $$(a_{i},b_{i})$$ can be turned into an honest real Gaussian vector by multiplying both quantities by an independent $$\chi ^{2}$$ random variable with n degrees of freedom. References 1. Artstein-Avidan, S., Giannopoulos, A. & Milman, V. D. ( 2015) Asymptotic Geometric Analysis, Part I . Providence, RI, USA: American Mathematical Society. Google Scholar CrossRef Search ADS   2. Bahmani, S. & Romberg, J. ( 2016) Phase retrieval meets statistical learning theory: a flexible convex relaxation. pp. 1– 17, This is a preprint, available as arXiv: 1610.04210 . 3. Blumer, A., Ehrenfeucht, A., Haussler, D. & Warmuth, M. K. ( 1989) Learnability and the Vapnik–Chervonenkis dimension. J. ACM , 36, 929– 965. Google Scholar CrossRef Search ADS   4. Candes, E. J., Li, X. & Soltanolkotabi, M. ( 2015) Phase retrieval via wirtinger flow: theory and algorithms. IEEE Trans. Info. Theory , 61, 1985– 2007. Google Scholar CrossRef Search ADS   5. Candes, E. J., Strohmer, T. & Voroninski, V. ( 2013) PhaseLift: exact and stable signal recovery from magnitude measurements via convex programming. Commun. Pure Appl. Math. , 66, 1241– 1274. Google Scholar CrossRef Search ADS   6. Chen, Y. & Candes, E. J. ( 2015) Solving random quadratic systems of equations is nearly as easy as solving linear systems. Adv. Neural Inf. Process. Syst. , 2, 739– 747. 7. Chi, Y. & Lu, Y. M. ( 2016) Kaczmarz method for solving quadratic equations. IEEE Signal Process. Lett. , 23, 1183– 1187. Google Scholar CrossRef Search ADS   8. Davis, C. & Kahan, W. M. ( 1970) The rotation of eigenvectors by a perturbation. III. SIAM J. Numer. Anal. , 7, 1– 46. Google Scholar CrossRef Search ADS   9. Davis, D., Drusvyatskiy, D. & Paquette, C. ( 2017) The nonsmooth landscape of phase retrieval. This is a preprint, available as arXiv: 1711.03247 . 10. Dirksen, S. ( 2015) Tail bounds via generic chaining. Electron. J. Probab. , 20, 1– 29. Google Scholar CrossRef Search ADS   11. Duchi, J. C. & Ruan, F. ( 2017) Solving (most) of a set of quadratic equalities: composite optimization for robust phase retrieval. pp. 1-- 49, This is a preprint, available as arXiv: 1705.02356 . 12. Eldar, Y. C. & Mendelson, S. ( 2014) Phase retrieval: stability and recovery guarantees. Appl. Comput. Harmon. Anal. , 36, 473– 494. Google Scholar CrossRef Search ADS   13. Fienup, J. R. ( 1982) Phase retrieval algorithms: a comparison. Appl. Opt. , 21, 2758. 14. Goldstein, T. & Studer, C. ( 2016) PhaseMax: Convex Phase Retrieval via Basis Pursuit. pp. 1– 28, This is a preprint, available as arXiv: 1610.07531 . 15. Hand, P. & Voroninski, V. ( 2016) An elementary proof of convex phase retrieval in the natural parameter space via the linear program PhaseMax. pp. 1– 8, This is a preprint, available as arXiv: 1611.03935. 16. Jeong, H. & Güntürk, C. S. ( 2017) Convergence of the randomized Kaczmarz method for phase retrieval. pp. 1– 13, This is a preprint, available as arXiv: 1706.10291. 17. Mohri, M., Rostamizadeh, A. & Talwalkar, A. ( 2012) Foundations of Machine Learning . Cambridge, MA, USA: MIT press. 18. Strohmer, T. & Vershynin, R. ( 2009) A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl. , 15, 262– 278. Google Scholar CrossRef Search ADS   19. Sun, J., Qu, Q. & Wright, J. ( 2017) A Geometric analysis of phase retrieval. Foundations of Computational Mathematics  pp. 1– 68, DOI: DOI https://doi.org/10.1007/s10208-017-9365-9. 20. Talagrand, M. ( 2005) The Generic Chaining: Upper and Lower Bounds of Stochastic Processes . Springer Monographs in Mathematics. Heidelberg, Berlin: Springer. 21. Vapnik, V. N. & Chervonenkis, A. Y. ( 1971) On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. , 16, 264– 280. Google Scholar CrossRef Search ADS   22. Vershynin, R. High-Dimensional Probability . Cambridge, UK: Cambridge University Press. 23. Vershynin, R. (2011) Introduction to the non-asymptotic analysis of random matrices. Compressed Sensing (Y. C. Eldar & G. Kutyniok eds).  Cambridge: Cambridge University Press, pp. 210– 268. 24. Wang, G., Giannakis, G. B. & Eldar, Y. C. ( 2016) Solving systems of random quadratic equations via truncated amplitude flow. IEEE Trans. Signal Process. , 65, 1961– 1974. Google Scholar CrossRef Search ADS   25. Wei, K. ( 2015) Solving systems of phaseless equations via Kaczmarz methods: a proof of concept study. Inverse Probl. , 31, 12125008. 26. Zhang, H., Zhou, Y., Liang, Y. & Chi, Y. ( 2016) Reshaped Wirtinger flow for solving quadratic system of equations. NIPS Proc. , 2622– 2630, This is a preprint, available as arXiv: 1605.07719. © The Author(s) 2018. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) For permissions, please e-mail: journals. permissions@oup.com

### Journal

Information and Inference: A Journal of the IMAOxford University Press

Published: Apr 3, 2018

## You’re reading a free preview. Subscribe to read the entire article.

### DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month ### Explore the DeepDyve Library ### Search Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly ### Organize Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place. ### Access Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals. ### Your journals are on DeepDyve Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more. All the latest content is available, no embargo periods. DeepDyve ### Freelancer DeepDyve ### Pro Price FREE$49/month
\$360/year

Save searches from
PubMed

Create lists to

Export lists, citations

Abstract access only

18 million full-text articles

Print

20 pages / month

PDF Discount

20% off