TY - JOUR AU - Ruan,, Feng AB - Abstract We develop procedures, based on minimization of the composition |$f(x) = h(c(x))$| of a convex function |$h$| and smooth function |$c$|⁠, for solving random collections of quadratic equalities, applying our methodology to phase retrieval problems. We show that the prox-linear algorithm we develop can solve phase retrieval problems—even with adversarially faulty measurements—with high probability as soon as the number of measurements |$m$| is a constant factor larger than the dimension |$n$| of the signal to be recovered. The algorithm requires essentially no tuning—it consists of solving a sequence of convex problems—and it is implementable without any particular assumptions on the measurements taken. We provide substantial experiments investigating our methods, indicating the practical effectiveness of the procedures and showing that they succeed with high probability as soon as |$m / n \ge 2$| when the signal is real-valued. 1. Introduction We wish to solve the following problem: we have a set of |$m$| vectors |$a_i \in{\mathbb{C}}^n$| and non-negative scalars |$b_i \in{\mathbb{R}}_+$|⁠, |$i = 1, \ldots , m$|⁠, and wish to find a vector |$x \in{\mathbb{C}}^n$| such that \begin{equation} b_i = |\langle a_i, x\rangle|^2 \ \textrm{for}\ \textrm{most}\, i \in \{1, \ldots, m\}. \end{equation} (1) As stated, this is a combinatorial problem that is, in the worst case, NP-hard [20]. Yet it naturally arises in a number of real-world situations, including phase retrieval [21,22,24], in which one receives measurements of the form \begin{equation*} b_i = | \langle a_i, x_\star \rangle |^2 \end{equation*} for known measurement vectors |$a_i \in{\mathbb{C}}^n$|⁠, while |$x_\star \in{\mathbb{C}}^n$| is unknown. The problem in phase retrieval arises due to limitations of optical sensors, where one illuminates an object |$x_\star $|⁠, which yields diffraction pattern |$A x_\star $|⁠, but sensors may measure only the amplitudes |$b = |A x_\star |^2$|⁠, where |$|\cdot |^2$| denotes the element-wise squared magnitude [38]. In the case in which some measurements may be corrupted, the problem is even more challenging. An alternative objective for the problem (1) is an exact penalty formulation [26], which replaces the equality constraint |$b_i = | \langle a_i, x \rangle |^2$| with a non-differentiable cost measuring the error |$b_i - | \langle a_i, x \rangle |^2$|⁠, yielding the formulation \begin{align} \mathop{\textrm{minimize}}_x ~ f(x) := \frac{1}{m} \sum_{i = 1}^m \big|| \langle a_i, x \rangle |^2 - b_i\big| = \frac{1}{m} \big\|{|Ax|^2 - b}\big\|_1. \end{align} (2) This objective is a natural replacement of the equality constrained problem (1), and the |$\ell _1$|-loss handles gross errors in the measurements |$b_i$| in a relatively benign way (as is well known in the statistics and optimization literature on |$\ell _1$|-based losses and median estimators). Moreover, in the case when |$b_i \!=\! | \langle a_i, x_\star \rangle |^2$| for all |$i$|⁠, it is clear that taking |$\iota \!=\! \sqrt{-1}$| to be the imaginary unit, then the set |$\{\textrm{e}^{\iota \theta } x_\star \mid \theta \in \left [{0},{2\pi }\right )\}$| globally minimizes |$f(x)$|⁠, though it is only possible to recover |$x_\star $| up to its phase (or sign flip in the real case). Candès et al. [9] and Eldar and Mendelson [19], as well as results we discuss later in the paper, show (roughly) that |$f(x)$| stably identifies |$x_\star $|⁠, in that |$f(x)$| grows very quickly as |$\mathop{\textrm{dist}}(x, X_\star ) = \inf \{\left \|{x-x_\star }\right \|_2 \mid x_\star \in X_\star \}$|⁠, where |$X_\star $| denotes the global minimum of |$f$|⁠, grows. The objective is, unfortunately, non-smooth, non-convex—not even locally convex near |$x_\star $|⁠, as is clear in the special case when |$x\in{\mathbb{R}}$| and |$f(x) = |x^2 - 1|$|⁠, so that a local analysis based on convexity is impossible—and at least |$f(x)$|a priori seems difficult to minimize. Nonetheless, the objective (2) enjoys a number of structural properties that, as we explore below, make solving problem (1) tractable as long as the measurement vectors |$a_i$| are sufficiently random. In particular, we can write |$f$| as the composition |$f(x) = h(c(x))$| of a convex function |$h$| and smooth function |$c$|⁠, a structure known in the optimization literature to be amenable to efficient algorithms [7,16,23]. This compositional structure lends itself nicely to the prox-linear algorithm, a variant of the Gauss–Newton procedure, which we describe briefly here for the real case. The composite optimization problem, which Fletcher & Watson (23) originally develop (see also [6,37]) and a number of researchers [7,8,16,18] have studied further, is to minimize \begin{equation} \mathop{\textrm{minimize}}_x~ f(x) := h(c(x))\ \textrm{subject}\ \textrm{to}\ x \in X, \end{equation} (3) where the function |$h : {\mathbb{R}}^m \to{\mathbb{R}}$| is convex, |$c : {\mathbb{R}}^n \to{\mathbb{R}}^m$| is smooth and |$X$| is a convex set. Extended to the complex case, this general form encompasses our objective (2), where we take |$h(z) = \frac{1}{m} \left \|{z}\right \|_1$| and |$c(x) = [| \langle a_i, x \rangle |^2 - b_i]_{i = 1}^m$|⁠. Using the common idea of most optimization schemes—trust region, gradient descent, Newton’s method—to build a simpler to optimize local model of the objective and repeatedly minimize this model, we can replace |$h(c(x))$| in problem (3) by linearizing only |$c$|⁠. This immediately gives a convex surrogate and leads to the prox-linear algorithm developed by Burke & Ferris [7,8], among others [16–18]. Fixing |$x \in{\mathbb{R}}^n$|⁠, for any |$y \in{\mathbb{R}}^n$| we define the local ‘linearization’ of |$f$| at |$x$| by \begin{equation} f_x(y) := h\big(c(x) + \nabla c(x)^T (y - x)\big), \end{equation} (4) where |$\nabla c(x) \in{\mathbb{R}}^{n \times m}$| denotes the Jacobian transpose of |$c$| at |$x$|⁠. This function is evidently convex in |$y$| and the prox-linear algorithm proceeds iteratively |$x_1, x_2, \ldots $| by minimizing the regularized models \begin{equation} x_{k + 1} = \mathop{{\textrm{argmin}}}_{x \in X} \left\{f_{x_k}(x) + \frac{1}{2 \alpha_k} \big\|{x - x_k}\big\|_2^2 \right\}, \end{equation} (5) where |$\alpha _k> 0$| is a stepsize. If |$h$| is |$L$|-Lipschitz and |$\nabla c$| is |$\beta $|-Lipschitz, then choosing any stepsize |$\alpha \le \frac{1}{\beta L}$| guarantees that the method (5) is a descent method and finds approximate stationary points of the problem (3) [16,17]. The case in which |$c : {\mathbb{C}}^n \to{\mathbb{R}}^m$| requires a bit more elaboration based on the Wirtinger calculus, which we address later. We briefly summarize our main contribution as follows. Broadly, this work provides a general method for robust non-convex modeling, focusing carefully on the problem (2); our work carries on a line of work identifying statistical scenarios that nominally yield non-convex optimization problems, yet admit computationally efficient estimation and optimization procedures [1,15,28,31]. In this direction, our work develops analytic and statistical tools to analyze a collection of optimization and modeling approaches beyond gradient-based (and Riemannian gradient-based) procedures, using both non-smooth and non-convex models while leveraging statistical structure. More precisely, we show how to apply prox-linear method (5) to any measurement matrix |$A$| with no tuning parameters, except that the stepsize satisfies |$\alpha \le \big(\frac{1}{m} \left |\!\left |\!\left |{A^{\textrm{H}} A} \right |\!\right |\!\right |_{\textrm{op}}\big)^{-1}$|⁠. Each iteration requires solving a QP in |$n$| variables, which is efficiently solvable using standard convex programming approaches. We show that—with extremely high probability under appropriate random measurement models—our prox-linear method exhibits local quadratic convergence to the signal as soon as the number of measurements |$m/n$| is greater than some numerical constant, meaning we must solve only |$\log _2 \log _2 \frac{1}{\epsilon }$| such convex problems to find an estimate |$\widehat{x}$| of |$x_\star $| such that |$\mathop{\textrm{dist}} (x, X_\star ) \le \epsilon $|⁠. In practice, this is five convex quadratic programs. Our procedure applies both in the noiseless setting and when a (constant but random) fraction of the measurements are even adversarially corrupted. 1.1 Related work and approaches to phase retrieval Our work should be viewed in the context of the recent and successful collection of work on phase retrieval. A natural strategy for problem (1), when we wish to find |$x$| satisfying |$| \langle a_i, x \rangle |^2 = b_i$| for all |$i \in [m]$|⁠, is to lift the problem into a semidefinite program (SDP) by setting |$X = xx^{\textrm{H}}$|⁠, relaxing the rank 1 constraint, and solving \begin{equation*} \mathop{\textrm{minimize}}_X ~ \mathop{\textrm{tr}}(X)\ \textrm{subject}\ \textrm{to}\ X \succeq 0, ~ \mathop{\textrm{tr}}(X a_i a_i^{\textrm{H}} ) = b_i. \end{equation*} This is the approach that a number of convex approaches to phase retrieval take [9–11,14,45]. The resulting SDP is computationally challenging for large |$n$|⁠, as it requires storing and manipulating an |$n \times n$| matrix variable. Moreover, computation times to achieve |$\epsilon $|-accurate solutions to this problem generally scale on the order of |$n^3 / \mathsf{poly}(\epsilon )$|⁠, where |$\mathsf{poly}(\epsilon )$| denotes a polynomial in |$\epsilon $|⁠. These difficulties have led a number of researchers to consider non-convex approaches to the phase retrieval problem that—as we do—maintain only a vector |$x \in{\mathbb{C}}^n$|⁠, rather than forming a full matrix |$X \in{\mathbb{C}}^{n \times n}$|⁠. We necessarily give only a partial overview, focusing on recent work on provably convergent schemes. Early work in computational approaches to phase retrieval is based on (non-convex) alternating projection approaches, notably those by Gerchberg & Saxton [24] and Fineup [22]. Motivated by the challenges of convex approaches and the success of alternating minimization [22,24], Netrapalli et al. [33] develop an algorithm (AltMinPhase) that alternates between minimizing |$\left \|{Ax - C b}\right \|_2$| in |$x$| and in |$C$| over diagonal matrices of phases (signs) with modulus 1. Their algorithm is elegant, but the analysis requires resampling a new measurement matrix |$A$| and measurements |$b$| in each iteration. More recently, Candès et al. [12] develop Wirtinger flow, a gradient-based method that performs a careful modification of gradient descent on the objective \begin{align*} F(x) := \frac{1}{2m} \sum_{i = 1}^m \big(\big|\langle a_i, x\rangle\big|^2 - b_i\big)^2, \end{align*} where |$x \in{\mathbb{C}}^n$| may be complex. Wang et al. [46] build on this work by attacking a modification of this objective, showing how to perform a generalized descent method on \begin{equation*} F(x) := \frac{1}{2m} \sum_{i = 1}^m \left(\big|\langle a_i, x \rangle\big| - \sqrt{b_i}\right)^2, \end{equation*} and providing arguments for the convergence of their method. Wang et al. achieve striking empirical results when the measurements and signals are real-valued, achieving better than 50% perfect signal recovery when the measurement ratio |$m/n = 2$|⁠, which is essentially at the threshold for injectivity of the real-valued measurements |$b = (Ax_\star )^2$|⁠. Zhang et al. [47] also study a variant of Wirtinger flow based on median estimates that handles some outliers. Unfortunately, these procedures rely fairly strongly on Gaussianity assumptions, and their gradient descent approaches require subsampling schemes (to select ‘good’ terms in the sum); these procedures have parameters chosen carefully to reflect Gaussianity in the measurement matrices |$A$|⁠, and it is not always clear how to extend them to non-Gaussian measurements. 1.2 Our contributions and outline In this paper, we focus on prox-linear methods, the iterations (5) for the non-smooth, non-convex problem (2). In addition to being (to us at least) aesthetically pleasing, as we minimize the natural objective (2), our approach yields a number of theoretical and practical benefits. In the literature on signal recovery from phaseless measurements, stability of the reconstruction of a signal is of paramount importance. To solve the phase retrieval problem at all, one requires injectivity of the measurements |$b = |Ax|^2$|⁠, which for real |$A \in{\mathbb{R}}^{m \times n}$| in general position necessitates |$m \ge 2n - 1$| and for complex |$A \in{\mathbb{C}}^{m \times n}$| in general position necessitates |$m \ge 4n - 2$| (cf. [2]). Stability makes this injectivity more robust. Consider the real-valued case first. Eldar & Mendelson [19] say that a measurement matrix |$A \in{\mathbb{R}}^{m \times n}$| is |$\lambda \ge 0$| stable if \begin{equation} \big\|{(Ax)^2 - (Ay)^2}\big\|_1 \ge \lambda \left\|{x - y}\right\|_2 \left\|{x + y}\right\|_2 \ \mbox{for all } x, y \in{\mathbb{R}}^n. \end{equation} (6a) Such conditions, which hold with high probability for suitable designs |$A$|⁠, are also common in semidefinite relaxation approaches to phase retrieval; cf. Candès et al. [9, Lemma 3.2]. (See also the paper [4].) This condition means that distant signals |$x, x^{\prime} \in{\mathbb{R}}^n$| cannot be confused in the measurement domain |$\{(Ay)^2 \mid y \in{\mathbb{R}}^n\} \subset{\mathbb{R}}^m_+$| because |$A$| does a good job of separating them; the more stable a measurement matrix, the ‘easier’ the recovery problem should be. In the case in which |$x$| is complex, the stability condition (6a) becomes \begin{equation} \big\|{|Ax|^2 - |Ay|^2}\big\|_1 \ge \lambda \inf_\theta \|{x - \textrm{e}^{\iota \theta} y}\|_2 \cdot \sup_\theta \|{x - \textrm{e}^{\iota \theta} y}\|_2 \ \mbox{for all } x, y \in{\mathbb{C}}^n, \end{equation} (6b) a slightly stronger condition. We provide stability guarantees for both situations for general classes of random matrices by adapting Mendelson’s ‘small ball’ techniques [32] (Section 3.1). Most literature on non-convex approaches to phase retrieval requires such a stability condition—and usually more because of the quadratic objectives often used—to guarantee signal recovery. In contrast, our procedure requires essentially only the stability condition (6), a mild bound on the operator norm |$\frac{1}{m} \left |\!\left |\!\left |{A} \right |\!\right |\!\right |_{\textrm{op}}^2$|⁠, and an initialization within some constant factor of |$\left \|{x_\star }\right \|_2$| to guarantee both fast convergence and exact signal recovery. With this in mind, in Section 2 we develop purely optimization-based deterministic results, which build off of classical results on composite optimization, that rely on the stability condition (6). By identifying the conditions required for fast convergence and recovery, we can then spend the remainder of the paper showing how various measurement models guarantee sufficient conditions for our convergence results. In particular, in Section 3, we show how a number of sensing matrices |$A$| suffice to guarantee convergence and signal recovery in the noiseless setting, that is, when |$b = |A x_\star |^2$|⁠. In Section 4, we extend these results to the case when a constant fraction of the measurements |$b_i$| may be arbitrarily corrupted, showing that stability and a somewhat stronger condition on |$\left |\!\left |\!\left |{A} \right |\!\right |\!\right |_{\textrm{op}}$| are still sufficient to guarantee signal recovery; again, these results hold for our basic algorithm with no tuning parameters. In the final sections of the paper, we provide a substantial empirical evaluation of our proposed algorithms. While our method in principle requires no tuning—it solves a sequence of explicit convex problems—there is some art in developing efficient methods for the solution of the sequence of convex optimization problems we solve. In Section 5, we describe these implementation details, and in Section 6 we provide experimental evidence of the success of our proposed approach. In reasonably high-dimensional settings (⁠|$n \ge 1000$|⁠), with real-valued random Gaussian measurements our method achieves perfect signal recovery in about 80–90% of cases, even when |$m/n = 2$|⁠. The method also handles outlying measurements well, substantially improving state-of-the-art performance, and we give applications with measurement matrices that demonstrably fail all of our conditions, but for which the method is still straightforward to implement and empirically successful. Notation We collect our common notation here. We let |$\left \|{\cdot }\right \|$| and |$\left \|{\cdot }\right \|_2$| denote the usual vector |$\ell _2$|-norm, and for a matrix |$A \in{\mathbb{C}}^{m \times n}$|⁠, |$\left |\!\left |\!\left |{A} \right |\!\right |\!\right |_{\textrm{op}}$| denotes its |$\ell _2$|-operator norm. The notation |$A^{\textrm{H}}$| means the Hermitian conjugate (conjugate transpose) of |$A \in{\mathbb{C}}^{m \times n}$|⁠. For |$a \in{\mathbb{C}}$|⁠, |${\textrm{Re}}(a)$| denotes its real part and |${\textrm{Im}}(a)$| its imaginary part. We take |$ \langle \cdot , \cdot \rangle $| to be the standard inner product on whatever space it applies; for |$u, v \in{\mathbb{C}}^n$|⁠, this is |$ \langle u, v \rangle = u^{\textrm{H}} v$|⁠, while for |$A, B \in{\mathbb{C}}^{m \times n}$|⁠, this is |$ \langle A, B \rangle = \mathop{\textrm{tr}}(A^{\textrm{H}} B)$|⁠. Let |${\mathsf{quant}}_\alpha (\{c_i\})$| denote the |$\alpha $|-quantile of a vector |$c \in{\mathbb{R}}^m$|⁠, that is, if |$c_{(1)} \le c_{(2)} \le \cdots \le c_{(m)}$|⁠, the |$\alpha $|th quantile linearly interpolates |$c_{(\left \lfloor{m \alpha }\right \rfloor )}$| and |$c_{(\left \lceil{m \alpha }\right \rceil )}$|⁠. For a random variable |$X$|⁠, |${\mathsf{quant}}_\alpha (X)$| denotes its |$\alpha $|th quantile. 2. Composite optimization, algorithm and convergence analysis We begin our development by providing convergence guarantees—under appropriate conditions—for the prox-linear algorithm (the iteration (5)) applied to the composite optimization problem (3), which we recall is to \begin{align*} \mathop{\textrm{minimize}}_x~ f(x) := h(c(x))\ \textrm{subject}\ \textrm{to}\ x \in X, \end{align*} where |$h : {\mathbb{R}}^m \to{\mathbb{R}}$| is convex and |$c : {\mathbb{C}}^n \to{\mathbb{R}}^m$| is appropriately smooth. In our context, as |$c$| is a real-valued complex function, it cannot have an ordinary complex derivative, so some care is required; we use the Wirtinger Calculus, also known as the |${\mathbb{C}}{\mathbb{R}}$|-calculus, referring the interested reader to the survey of Kreutz-Delgado [29] for more (see especially [29, Eq. (31)]). In brief, however, because |$c$| is real-valued, we let |$\nabla c(x)$| denote the Hermitian conjugate of the Jacobian of |$c$|⁠. This allows us to treat |$c$| as a mapping from |${\mathbb{R}}^{2n}$| to |${\mathbb{R}}$|⁠, so that |$\nabla c(x) \in{\mathbb{C}}^{n \times m}$| satisfies \begin{equation*} c(y) = c(x) + {\textrm{Re}}\big(\nabla c(x)^{\textrm{H}} (y - x)\big) + O(\|{y - x}\|^2) \end{equation*} as |$y \to x$|⁠. We summarize the procedure in Algorithm 1 for further reference. In our application to quadratic constraints and phase retrieval, |$h(z) = \frac{1}{m} \left \|{z}\right \|_1$| and |$c(x) = |Ax|^2 - b$|⁠, so that the iteration (5) is the solution of a quadratic problem. A number of researchers have studied convergence and stopping conditions for Algorithm 1, showing that it converges to stationary points [7], as well as demonstrating that the stopping condition |$\left \|{x_k - x_{k-1}}\right \|_2 \le \alpha \epsilon $| holds after |$O(\epsilon ^{-2})$| iterations and guarantees approximate stationarity [16,17]. Algorithm 1 and the iteration (5) sometimes enjoy fast (local) convergence rates as well. To describe this phenomenon, define the distance function |$\mathop{\textrm{dist}}(x, S) := \inf _y \{\left \|{x-y}\right \|_2 \mid y\in S\}$|⁠. We say that |$h$| has weak sharp minima if it grows linearly away from its minima, meaning |$h(z) \ge \inf _z h(z) + \lambda \mathop{\textrm{dist}}(z, {\textrm{argmin}}\ h)$| for some |$\lambda> 0$|⁠. Under this condition (with an additional transversality condition between |$c$| and |$\textrm{argmin}\ h$|⁠), Burke & Ferris [8] show that convergence of the prox-linear algorithm near points in |$X_\star := \{x : c(x) \in \mathop{{\textrm{argmin}}} h\}$| is quadratic, because the model (4) of |$f$| is quadratically good, but |$h(c(x))$| grows linearly away from |$X_\star $|⁠. These assumptions can be weakened to growth of |$h \circ c$| along its minimizing set [17, Thm. 7.2]. We build from this elegant development—though our problems do not precisely satisfy the weak sharp minima conditions because of outliers—to show how the prox-linear algorithm provides an effective, easy-to-implement and elegant approach to problems involving solution of quadratic equalities, specifically focusing on phase retrieval. 2.1 Quadratic convergence and the prox-linear method for phase retrieval We turn now to an analysis of the prox-linear algorithm for phase retrieval problems, providing conditions on the function |$f$| sufficient for quadratic convergence. We introduce two conditions on the function |$f(x)$| and its linearized form |$f_x(y)$| that suffice for this desideratum; as we show in the sequel, these conditions hold with extremely high probability for a number of random measurement models. These conditions are the keystones of our analysis of the (robust) phase retrieval problem. As motivation for our first condition, consider the phase retrieval problem. If the measurement matrix |$A$| satisfies conditions (6) and the measurements |$b_i$| are noiseless with |$b = |Ax_\star |^2$|⁠, then for |$f(x) = \frac{1}{m} \|{|Ax|^2 - b}\|_1$| we have |$f(x) - f(x_\star ) \ge \lambda \mathop{\textrm{dist}}(x, X_\star )\left \|{x_\star }\right \|_2$| for the set of signals |$X_\star = \{ \textrm{e}^{\iota \theta } x_\star \mid \theta \in{\mathbb{R}}\}$|⁠. When the measurements have noise or outliers, this may still hold, prompting us to define the following. Condition C1 There exists |$\lambda> 0$| such that for all |$x \in{\mathbb{R}}^n$| (or |$x\in{\mathbb{C}}^n$| in the complex case) \begin{equation*} f(x) - f(x_\star) \geq \lambda \mathop{\mathrm{dist}}(0, X_\star) \mathop{\mathrm{dist}}(x, X_\star), \end{equation*} where |$X_\star $| denotes the set of global minima of |$f$|⁠. This condition is a close cousin of Burke & Ferris’s sharp minima condition [8], though it does not require that |$c(x_\star ) \in \mathop{\textrm{argmin}}_z h(z)$|⁠; based on their work, it is intuitive that it should prove useful in establishing fast convergence of the prox-linear algorithm. The second condition, which is essentially automatically satisfied for the linear approximation (4), is a requirement that the linearized function |$f_x(y)$| is quadratically close to |$f(y)$|⁠. Condition C2 There exists |$L < \infty $| such that for all |$x, y\in{\mathbb{R}}^n$| (or |$x, y \in{\mathbb{C}}^n$| in the complex case) \begin{equation*} |\,f(y) - f_x(y)| \leq \frac{L}{2} \left\|{x-y}\right\|_2^2. \end{equation*} Locally, Condition C2 holds for any composition |$f(x) = h(c(x))$| of a convex |$h$| with smooth |$c$|⁠, but the phase retrieval objective (2) satisfies the bound globally. Indeed, for |$a, x, y \in{\mathbb{C}}^n$| we have \begin{equation*} | \langle a, y \rangle |^2 = | \langle a, x \rangle |^2 + 2 {\textrm{Re}}( \langle x, a \rangle \langle a, y - x \rangle ) + | \langle a, y - x \rangle |^2, \end{equation*} so the linearization (4) of |$f$| around |$x \in{\mathbb{C}}^n$| is \begin{equation} f_x(y) = \frac{1}{m} \sum_{i=1}^m \big|| \langle a_i, x \rangle |^2 - b_i + 2 {\textrm{Re}} ( \langle x, a_i \rangle \langle a_i, y-x \rangle )\big|. \end{equation} (7) Letting |$A = [a_1 ~ \cdots ~ a_m]^{\textrm{H}}$| denote the measurement matrix, then using the preceding expansion of |$| \langle a, y \rangle |^2$|⁠, we have immediately by the triangle inequality that \begin{align*} -(x - y)^{\textrm{H}} \left(\frac{1}{m} A^{\textrm{H}} A\right) (x - y) + f_x(y) \le \frac{1}{m} \sum_{i = 1}^m \, &|| \langle a_i, y \rangle |^2 - b_i| = f(y) \\ & \le f_x(y) + (x - y)^{\textrm{H}} \left(\frac{1}{m} A^{\textrm{H}} A\right) (x - y). \end{align*} That is, Condition C2 holds with |$L = 2 |\!|\!|{\frac{1}{m} A^{\textrm{H}} A}|\!|\!|_{\textrm{op}}$|⁠: \begin{equation} \left|f(y) - f_x(y)\right| \le \frac{1}{m}\big|\!\big|\!\big|{A^{\textrm{H}} A} \big|\!\big|\!\big|_{\textrm{op}} \big\|{x - y}\big\|_2^2. \end{equation} (8) Given Conditions C1 and C2, we turn to convergence guarantees for the prox-linear Algorithm 1, which in our case requires solving a sequence of convex quadratic programs. An implementation of Algorithm 1 that solves iteration (5) exactly may be computationally challenging. Thus, we allow inaccuracy in the solutions, assuming there exists a sequence of additive accuracy parameters |$\epsilon _k \ge 0$| such that the iterates |$x_k$| satisfy \begin{equation} f_{x_k}(x_{k + 1}) + \frac{L}{2} \left\|{x_{k + 1} - x_k}\right\|_2^2 \le \inf_x \left\{ f_{x_k}(x) + \frac{L}{2} \left\|{x - x_k}\right\|_2^2 \right\} + \epsilon_k. \end{equation} (9) We have the following theorem, whose proof we provide in Section 2.2. Theorem 1 Let Conditions C1 and C2 hold. Assume that in each step of Algorithm 1, we solve the intermediate optimization problem to accuracy |$\epsilon _k$|⁠, and define the relative error measures |$\beta _k = \frac{2 \epsilon _k}{\lambda \mathop{\textrm{dist}}(0, X_\star )^2}$|⁠. Then \begin{equation*} \frac{\mathop{\textrm{dist}}(x_k, X_\star)}{\mathop{\textrm{dist}}(0, X_\star)} \le \frac{\lambda}{2 L} \max\left\{ \left(\frac{2 L}{\lambda} \cdot \frac{\mathop{\textrm{dist}}(x_0, X_\star)}{\mathop{\textrm{dist}}(0, X_\star)} \right)^{2^k}, \max_{0 \le\, j < k} \left(\frac{4 L}{\lambda} \cdot \beta_j\right)^{2^{k - j - 1}}\right\}. \end{equation*} If |$\epsilon _k = 0$| in the solution quality inequality (9), then \begin{equation*} \frac{\mathop{\textrm{dist}}(x_k, X_\star)}{\mathop{\textrm{dist}}(0, X_\star)} \le \frac{\lambda}{L} \left(\frac{L}{\lambda} \cdot \frac{\mathop{\textrm{dist}}(x_0, X_\star)}{\mathop{\textrm{dist}}(0, X_\star)}\right)^{2^k}. \end{equation*} Theorem 1 motivates our approach for the remainder of the paper; we can guarantee exact, accurate and fast solutions to the phase retrieval problem under the following three conditions: Stability (Condition C1), Quadratic approximation (Condition C2), via an upper bound on |$\left |\!\left |\!\left |{A^{\textrm{H}} A} \right |\!\right |\!\right |_{\textrm{op}}$| and application of inequality (8) and An initializer |$x_0$| of the iterations that is good enough, meaning that it satisfies the constant relative error guarantee |$\mathop{\textrm{dist}}(x_0, X_\star ) \le \mathop{\textrm{dist}}(0, X_\star ) \frac{\lambda }{L}$|⁠. In the coming sections, we show that each of these three conditions holds in both noiseless measurement models (Section 3) and with adversarially perturbed measurements |$b_i$| (Section 4). Before continuing, we provide a few additional remarks regarding Theorem 1. First, if |$\epsilon _k$| are very small because we solve the intermediate steps (5) to (near) machine precision, then for all intents and purposes about five iterations suffice for machine precision accurate solutions. Quadratic convergence is also achievable with errors in inequality (9); if the minimization accuracies decrease quickly enough that |$\epsilon _k \le 2^{-2^k}$|⁠, then we certainly still have quadratic convergence. More broadly, Theorem 1 shows that the accuracy of solution in iteration |$j$| need not be very high to guarantee high accuracy reconstruction of the signal |$x_\star $|⁠; only in the last few iterations is moderate to high accuracy necessary. If it is computationally cheap, it is thus advantageous—as we explore in our experimental work—to solve early iterations of the prox-linear method inaccurately. 2.2 Proof of Theorem 1 We prove the result in two steps as follows: we first provide a per-iteration progress guarantee, and then we use this guarantee to show quadratic convergence. Let |$x_\star \in X_\star $| be any global optimum of the objective |$f(x)$|⁠. The function |$x \mapsto f_{x_k}(x) + \frac{L}{2} \left \|{x - x_k}\right \|_2^2$| is |$L$|-strongly convex in |$x$|⁠. If we define |$x_{_\star ,k + 1}$| to be the exact minimizer of |$f_{x_k}(x) + \frac{L}{2} \left \|{x - x_k}\right \|_2^2$|⁠, the standard optimality conditions for strongly convex minimization imply \begin{align*} f_{x_k}(x_{k+1}) + \frac{L}{2} \left\|{x_{k + 1} - x_k}\right\|_2^2 & \le f_{x_k}(x_{_\star,k+1}) + \frac{L}{2} \left\|{x_{_\star,k+1} - x_k}\right\|_2^2 + \epsilon_k \nonumber \\ & \le f_{x_k}(x_\star) + \frac{L}{2} \left\|{x_\star - x_k}\right\|_2^2 - \frac{L}{2} \left\|{x_\star - x_{_\star,k+1}}\right\|_2^2 + \epsilon_k. \end{align*} Using the approximation Condition C2, so that |$\frac{L}{2} \left \|{x_k - x_{k+1}}\right \|_2^2 + f_{x_k}(x_{k + 1}) \ge f(x_{k + 1})$| and |$f_{x_k}(x_\star ) \le f(x_\star ) + \frac{L}{2} \left \|{x_k - x_\star }\right \|_2^2$|⁠, we have by substituting in the preceding inequality that \begin{align*} f(x_{k + 1}) & \le f_{x_k}(x_\star) + \frac{L}{2} \left\|{x_\star - x_k}\right\|_2^2 - \frac{L}{2} \left\|{x_\star - x_{_\star,k+1}}\right\|_2^2 + \epsilon_k \\ & \le f(x_\star) + L \left\|{x_k - x_\star}\right\|_2^2 - \frac{L}{2} \left\|{x_\star - x_{_\star,k+1}}\right\|_2^2 + \epsilon_k. \end{align*} Summarizing, we get for all |$x_\star \in X_\star $| and |$k \in{\mathbb{N}}$|⁠, \begin{equation} f(x_{k + 1}) - f(x_\star) + \frac{L}{2} \left\|{x_\star - x_{_\star,k+1}}\right\|_2^2 \le L \left\|{x_k - x_\star}\right\|_2^2 + \epsilon_k. \end{equation} (10) By applying the stability Condition C1, we immediately obtain the progress guarantee \begin{equation} \begin{split} \lambda \left\|{x_\star}\right\|_2 \mathop{\textrm{dist}}\left(x_{k+1}, X_\star\right) + \frac{L}{2} \big\|{x_{_\star,k+1} - x_\star}\big\|_2^2 \le L \left\|{x_k - x_\star}\right\|_2^2 + \epsilon_k. \end{split} \end{equation} (11) We now transform the guarantee (11) into one involving only |$x_k$|⁠, |$x_{k + 1}$| and |$x_\star $|⁠, rather than |$x_{_\star ,k+1}$|⁠, by bounding the difference between |$x_{_\star ,k+1}$| and |$x_{k+1}$|⁠. The |$L$|-strong convexity of |$f_{x_k}(\cdot ) + \frac{L}{2} \left \|{\cdot - x_k}\right \|_2^2$| implies that \begin{align*} \epsilon_k + f_{x_k}(x_{_\star,k+1}) + \frac{L}{2} \big\|{x_{_\star,k+1} - x_k}\big\|_2^2 & \ge f_{x_k}(x_{k+1}) + \frac{L}{2} \left\|{x_{k+1} - x_k}\right\|_2^2 \\ & \ge f_{x_k}(x_{_\star,k+1}) + \frac{L}{2} \left\|{x_{_\star,k+1} - x_k}\right\|_2^2 + \frac{L}{2} \left\|{x_{k+1} - x_{_\star,k+1}}\right\|_2^2, \end{align*} whence we obtain |$\frac{L}{2} \left \|{x_{_\star ,k+1} - x_{k + 1}}\right \|_2^2 \le \epsilon _k$|⁠. Using the standard quadratic inequality |$\left \|{x_{k+1} - x_\star }\right \|_2^2 \le 2 \left \|{x_{k+1} - x_{_\star ,k+1}}\right \|_2^2 + 2\left \|{x_{_\star ,k+1} - x_\star }\right \|_2^2$|⁠, we thus have by expression (11) that, for all |$x_\star \in X_\star $|⁠, \begin{equation} \begin{split} \lambda \left\|{x_\star}\right\|_2 \mathop{\textrm{dist}}\left(x_{k+1}, X_\star\right) + \frac{L}{4} \left\|{x_{k+1} - x_\star}\right\|_2^2 \le L \left\|{x_k - x_\star}\right\|_2^2 + 2\epsilon_k. \end{split} \end{equation} (12) Now, taking infimum over |$x_\star \in X_\star $| over both sides for Equation (12), we find that, \begin{equation*} \lambda \mathop{\textrm{dist}}(0, X_\star) \mathop{\textrm{dist}}(x_{k+1}, X_\star) \le L \mathop{\textrm{dist}}(x_k, X_\star)^2 + 2 \epsilon_k. \end{equation*} Dividing each side by |$\lambda \mathop{\textrm{dist}}(0, X_\star )^2$| yields \begin{equation} \frac{\mathop{\textrm{dist}}(x_{k+1}, X_\star)}{\mathop{\textrm{dist}}(0, X_\star)} \le \frac{L}{\lambda} \frac{\mathop{\textrm{dist}}(x_k, X_\star)^2}{\mathop{\textrm{dist}}(0, X_\star)^2} + \frac{2 \epsilon_k}{\lambda \mathop{\textrm{dist}}(0, X_\star)^2}. \end{equation} (13) Inductively applying inequality (13) when |$\epsilon _k = 0$| yields the second statement of the theorem. When |$\epsilon _k> 0$|⁠, a brief technical lemma shows the convergence rate. Lemma 2.1 Let the sequences |$a_k \ge 0$|⁠, |$\epsilon _k \ge 0$| satisfy |$a_k \le \kappa a_{k-1}^2 + \epsilon _{k-1}$|⁠. Then \begin{equation*} a_k \le \max\left\{ (2 \kappa)^{2^k - 1} a_0^{2^k}, \max_{0 \le j < k} (2 \kappa)^{2^{k - j - 1} - 1} (2 \epsilon_j)^{2^{k - j - 1}} \right\}. \end{equation*} Proof. The proof is by induction. For |$a_1$|⁠, we certainly have |$a_1 \le 2 \kappa \vee 2 \epsilon _0$| because both sequences are non-negative. For the general case, assume the result holds for |$a_k$|⁠, where |$k$| is arbitrary. Then \begin{align*} a_{k + 1} & \le \kappa a_k^2 + \epsilon_k \le \max\left\{2 \kappa a_k^2, 2 \epsilon_k\right\} \\ & \le 2 \kappa (2 \kappa)^{2^{k + 1} - 2} a_0^{2^{k + 1}} \vee \max_{0 \le j < k} \left\{2 \kappa (2 \kappa)^{2^{k - j} - 2} (2 \epsilon_j)^{2^{k - j}}\right\} \vee 2 \epsilon_k \\ & = \max \left\{ (2 \kappa)^{2^{k + 1} - 1} a_0^{2^{k + 1}}, \max_{0 \le j < k + 1} (2 \kappa)^{2^{k - j} - 1} (2 \epsilon_j)^{2^{k - j}}\right\} \end{align*} as desired. Applying Lemma 2.1 in inequality (13) with |$\kappa = \frac{L}{\lambda }$| and |$a_k = \frac{\mathop{\textrm{dist}}(x_k, X_\star )}{\mathop{\textrm{dist}}(0, X_\star )}$| yields the theorem. 3. Noiseless phase retrieval problem We begin our discussion of the phase retrieval problem by considering the noiseless case, that is, when the observations |$b_i = | \langle a_i, x_\star \rangle |^2$|⁠. Throughout this section and the remainder of the paper, the vectors |$a_i$| are assumed to be independent and identically distributed copies of a random vector |$a \in{\mathbb{C}}^n$|⁠. Based on our discussion after Theorem 1, we show that (i) the objective (2) is stable, C1, (ii) is quadratically approximable, C2 and (iii) that we have a good initializer, |$x_0$|⁠. In the coming three sections, we address each of these in turn, providing progressively stronger assumptions that are sufficient for each condition to hold with high probability as soon as the number of measurements |$m/n> c$|⁠, where |$c$| is a numerical constant. In Section 3.4 we provide a summary theorem that encapsulates our results. For readability, we defer proofs to Appendix A. 3.1 Stability Our first step is to provide conditions under which stability holds. We divide our discussion of stability conditions into the real and complex cases, as the real case is considerably easier. 3.1.1 Stability for real-valued vectors With the stability condition in mind, we make the following assumption on the random measurement vectors |$a\in{\mathbb{R}}^n$|⁠. Assumption 1 There exist constants |$\kappa _{\textrm{st}}> 0$| and |$p_0>0$| such that for all |$u, v \in{\mathbb{R}}^n$|⁠, |$\left \|{u}\right \|_2 = \left \|{v}\right \|_2 = 1$|⁠, we have \begin{equation*} {\mathbb{P}}\left(| \langle a, u \rangle | \wedge | \langle a,v \rangle | \geq \kappa_{\textrm{st}} \right) \geq p_0. \end{equation*} Intuitively, Assumption 1 says that the measurement vectors |$a \in{\mathbb{R}}^n$| have sufficient support in all directions |$u, v \in{\mathbb{R}}^n$|⁠. One simple sufficient condition for this is a type of small-ball condition [32], as follows, which makes it clear that Assumption 1 requires no light tails: just that the probability of one of |$| \langle a, u \rangle |$| and |$| \langle a, v \rangle |$| being small is small. Example 1 (Stability by the small-ball method) Assume that we may choose positive but small enough |$\varepsilon> 0$| that \begin{equation*} \sup_{\left\|{u}\right\|_2 = 1} {\mathbb{P}}\left(| \langle a, u \rangle | < \varepsilon\right) < \frac{1}{2}. \end{equation*} Then by the union bound, the choice |$p_0 = 1 - 2 \sup _{\left \|{u}\right \|_2} {\mathbb{P}}(| \langle a, u \rangle | < \varepsilon )> 0$| and |$\kappa _{\textrm{st}} = \varepsilon $| immediately yields \begin{align*} {\mathbb{P}}\left(| \langle a, u \rangle | \wedge | \langle a,v \rangle | \ge \kappa_{\textrm{st}}\right) & = 1 - {\mathbb{P}}\left(| \langle a, u \rangle | < \varepsilon ~ \mbox{or} ~ \wedge | \langle a,v \rangle | < \varepsilon \right) \\ & \ge 1 - {\mathbb{P}}\left(| \langle a, u \rangle | < \varepsilon\right) - {\mathbb{P}}\left(| \langle a, u \rangle | < \varepsilon\right) \ge p_0. \end{align*} As a further specialization, if |$a \sim{\mathsf{N}}(0, I_n)$| is an isotropic Gaussian, then the choice |$\kappa _{\textrm{st}} = 0.31$| yields |${\mathbb{P}}(| \langle a, u \rangle | \le \kappa _{\textrm{st}}) \le \frac{1}{4}$|⁠, so we may take |$\kappa _{\textrm{st}} = 0.31$| and |$p_0 = \frac{1}{2}$|⁠. As we note in the discussion preceding Condition C1, for the objective (2) it is immediate that in the noiseless case that, \begin{equation*} f(x) - f(\pm x_\star) = f(x) = \frac{1}{m} \big\|{(Ax)^2 - (A x_\star)^2}\big\|_1 = \frac{1}{m} \sum_{i=1}^m | \langle a_i, x-x_\star \rangle \langle a_i, x+x_\star \rangle |. \end{equation*} Thus, if we can show the stability condition (6a)—equivalently, that |$\sum _{i = 1}^m | \langle a_i, u \rangle \langle a_i, v \rangle | \ge \lambda m$| for |$u, v \in{\mathbb{S}}^{n-1}$|—then Condition C1 holds for |$X_\star = \{\pm x_\star \}$|⁠. To that end, we provide the following guarantee, which we prove in Appendix A.1. Proposition 1 Let Assumption 1 hold. There exists a numerical constant |$c < \infty $| such that for any |$t \ge 0$|⁠, \begin{equation*} {\mathbb{P}}\left(\frac{1}{m} \sum_{i=1}^m | \langle a_i, u \rangle \langle a_i, v \rangle | \ge \kappa_{\textrm{st}}^2 \cdot \left(p_0 - c \sqrt{\frac{n}{m}}-\sqrt{2t}\right) ~\mbox{for all}~ u, v \in{\mathbb{S}}^{n-1}\right) \ge 1 - 2 \textrm{e}^{-mt}. \end{equation*} Proposition 1 immediately yields the following corollary, which shows that the stability condition C1 holds with high probability for |$m/n \gtrsim 1$|⁠. Corollary 3.1 Let Assumption 1 hold. Then there exists a numerical constant |$c < \infty $| such that if |$m p_0^2 \ge c n$|⁠, then \begin{equation*} {\mathbb{P}}\left(f(x) - f(x_\star) \geq \frac{1}{2} \kappa_{\textrm{st}}^2 p_0 \left\|{x-x_\star}\right\|_2\left\|{x+x_\star}\right\|_2 \ \mbox{for all }x \in \mathbb{R}^n \right) \ge 1 - 2 \exp\left(-\frac{mp_0^2}{32}\right). \end{equation*} Eldar & Mendelson [19] establish the stability Condition C1 under more restrictive assumptions on the distribution of the measurement vectors |$\{a_i\}_{i=1}^m$|⁠. Concretely, they require that the distribution of |$a$| is sub-gaussian (see Definition 3.1 to come) and isotropic (meaning that |${\mathbb{E}} [aa^T] = I_n$|⁠). As Example 1 makes clear, our result only requires weaker small ball assumptions without any restrictions on tails or the covariance structure of the random vector |$a$|⁠. 3.1.2 Complex Case We now investigate conditions sufficient for stability of |$A$| in the measurement vectors |$a_i$| are complex-valued. In this case, the argument is not quite so simple, as stability for complex vectors requires more uniform notions of function growth. Accordingly, we make the following two assumptions on the random vectors |$a_i \in{\mathbb{C}}^n$|⁠. The first is a small-ball type assumption, while the second requires that the vectors |$a_i$| are appropriately uniform in direction. To fully state our assumptions, we require an additional definition on the sub-Gaussianity of random vectors. We define this in terms of Orlicz norms (following [43,44, Ch. 2.2]). Definition 3.1 The random vector |$a \in{\mathbb{C}}^n$| is |$\sigma ^2$|-sub-Gaussian if for all |$v \in{\mathbb{C}}^n$|⁠, |$\left \|{v}\right \|_2 = 1$|⁠, \begin{equation*} {\mathbb{E}} \left[\exp\bigg(\frac{| \langle a, v \rangle |^2}{\sigma^2} \bigg)\right] \leq e. \end{equation*} With this definition, we now provide assumptions on the random measurement vector |$a \in{\mathbb{C}}^n$|⁠. Assumption 2 There exists a non-increasing function |$h : {\mathbb{R}}_+ \to{\mathbb{R}}_+$| with |$h(0) = 0$| such that for |$\epsilon \ge 0$|⁠, \begin{equation*} {\mathbb{P}}(\left\|{a}\right\|_2 \le \epsilon \sqrt{n}) \le h(\epsilon). \end{equation*} We also require that the distribution of |$a$| is sufficiently uniform in direction. Assumption 3 Let |$w = \sqrt{n} a / \left \|{a}\right \|_2$|⁠. The random vector |$w$| is |$\sigma ^2$|-sub-Gaussian (Definition 3.1). In addition, for any matrix |$X \in{\mathbb{C}}^{n \times n}$| with rank at most |$2$|⁠, \begin{equation*} {\mathbb{E}}\left[\left|w^{\textrm{H}} X w\right|\right] \ge \tau^2 \left\|{X}\right\|_{\textrm{Fr}}. \end{equation*} Assumption 3 may seem somewhat challenging to verify, but it holds for any rotationally symmetric distribution, and moreover, in this case we have that |$\tau \ge c\sigma $| for a numerical constant |$c> 0$|⁠. Example 2 (Rotationally symmetric measurements) Let the measurement vectors |$a_i$| be rotationally symmetric, so that for unitary |$U \in{\mathbb{C}}^{n \times n}$|⁠, the distribution of |$U a$| is identical to |$a$|⁠. We show that Assumption 3 holds. In this case, |$w = \sqrt{n} a / \left \|{a}\right \|_2$| is uniform on the radius-|$\sqrt{n}$| sphere in |${\mathbb{C}}^n$|⁠, and standard results in convex geometry [3] show |$w$| is |$O(1)$|-sub-Gaussian (Definition 3.1). As |$a$| is rotationally symmetric, |$w$| is also equal in distribution to |$\sqrt{n} z / \left \|{z}\right \|_2$|⁠, where |$z$| is standard complex normal; thus, we have for any rank |$2$| or less Hermitian |$X$| that \begin{equation*} {\mathbb{E}}[|w^{\textrm{H}} X w|] = n {\mathbb{E}}[|z^{\textrm{H}} X z|] {\mathbb{E}}[1 / \|{z}\|_2^2] \stackrel{(i)}{\ge} {\mathbb{E}}[|z ^{\textrm{H}} X z|] \stackrel{(ii)}{\ge} \frac{2 \sqrt{2}}{\pi} \|{X}\|_{\textrm{Fr}}, \end{equation*} where inequality |$(i)$| is a consequence of |${\mathbb{E}}[1 / \left \|{z}\right \|_2^2] \ge \frac{1}{n}$| and inequality |$(ii)$| is a calculation (see Lemma C.2 in Appendix C.3) as |$X$| is rank |$2$|⁠. With these assumptions in place, we have the following stability guarantee for the random matrix |$A$|⁠. We defer the proof to Appendix A.2. Proposition 2 Let Assumptions 2 and 3 hold. Let |$c> 0$| be chosen such that |$h(c) \le \frac{1}{2(1 + e)} \frac{\tau ^4}{\sigma ^4}$|⁠. There exist numerical constants |$c_0> 0$| and |$c_1 < \infty $| such that with probability at least \begin{equation*} 1 - \exp\left(-c_0 m \frac{\tau^4}{\sigma^4} + c_1 n \log \frac{\sigma^2}{\tau^2}\right), \end{equation*} we have \begin{equation*} \frac{1}{m} \big\|{|Ax|^2 - |Ay|^2}\big\|_1 \ge \frac{c^2 \tau^2}{4} \cdot \inf_\theta \|{x - \textrm{e}^{\iota\theta} y}\|_2 \cdot \sup_\theta \|{x - \textrm{e}^{\iota\theta} y}\|_2 \end{equation*} simultaneously for all |$x, y \in{\mathbb{C}}^n$|⁠. The proposition shows that if the ratio between the growth constant |$\tau $| and the sub-Gaussian constant |$\sigma $| in Assumption 3 is bounded, as it is in the case of rotationally invariant vectors |$a$| (Example 2), then we have the complex stability guarantee (6b) as soon as |$m \gtrsim n$| with extremely high probability. 3.2 Quadratic approximation With the stability condition C1 in place, we turn to a discussion of the approximation condition C2. As implied by the estimate in inequality (8), the quadratic approximation condition is satisfied with parameter |$L = 2|\!|\!|{\frac{1}{m} A^{\textrm{H}} A}|\!|\!|_{\textrm{op}}$|⁠. To control this quantity, we require that the rows of the matrix |$A \in{\mathbb{C}}^{m \times n}$| be sufficiently light-tailed. Assumption 4 The random vector |$a \in{\mathbb{C}}^n$| is |$\sigma ^2$|-sub-Gaussian. It is of course possible to bound |$\left |\!\left |\!\left |{A} \right |\!\right |\!\right |_{\textrm{op}}$| when the rows have heavier tails using appropriate symmetrization techniques and matrix Khintchine inequalities (cf. [44, Section 2.6)]; the extension is clear, so we do not address such issues. Certainly, not all measurement vectors |$a_i$| satisfy Assumption 4. Using that |${\mathbb{E}}[\textrm{e}^{\lambda Z^2}] =\left [{1 - 2 \lambda }\right ]_+^{-\frac{1}{2}}$| for |$Z \sim{\mathsf{N}}(0, 1)$|⁠, it holds for standard real normal vectors |$a_i \stackrel{\textrm{iid}}{\sim } {\mathsf{N}}(0, I_n)$| or complex normal vectors |$a_i \stackrel{\textrm{iid}}{\sim } \frac{1}{\sqrt{2}}\left ({\mathsf{N}}(0, I_n) + \iota{\mathsf{N}}(0, I_n)\right )$| with |$\sigma ^2 = \frac{2 e^2}{e^2 - 1} \approx 2.313$|⁠. Similarly, it also holds for |$a_i$| uniform on |${\mathbb{S}}^{n-1}$| with |$\sigma ^2 = O(1) \cdot \frac{1}{n}$|⁠. In practice it may be useful to apply our algorithm to the transformed data |$\{a_i / \left \|{a_i}\right \|_2\}_{i = 1}^m$| and |$\{b_i / \left \|{a_i}\right \|_2^2\}_{i=1}^m$|⁠, which (in the noiseless case or case with infrequent but arbitrary corruptions of |$b_i$|⁠) is likely to make the problem better conditioned. There are two heuristic motivations for this: first, those measurement vectors with larger magnitudes |$\left \|{a_i}\right \|$| tend to place a higher weight in the optimization problem (2), and thus normalization can make the observations ‘comparable’ to each other; second, normalization guarantees the measurement vectors |$\{a_i\}_{i=1}^m$| satisfy Assumption 4, yielding easier verification of Condition C2. (It may be more challenging to verify Assumption 1, but if the |$a_i$| are sufficiently isotropic this presents no special difficulties.) Standard results guarantee that the random matrices |$A$| have well-behaved singular vectors whenever Assumption 4 holds; we provide one such result due to Vershynin [44, Thm. 39, Eq. (25)] with constants that are achievable by tracing his proof. Lemma 3.1 Let Assumption 4 hold and |$\Sigma = {\mathbb{E}}[aa^{\textrm{H}}]$|⁠. Then for all |$t \ge 0$|⁠, \begin{align*} {\mathbb{P}}\left(\left|\!\left|\!\left|{\frac{1}{m} A^{\textrm{H}} A - \Sigma} \right|\!\right|\!\right|_{\textrm{op}} \ge 11 \sigma^2 \max\left\{ \sqrt{\frac{4n}{m} + t}, \frac{4n}{m} + t \right\} \right) \le \exp(-m t). \end{align*} Moreover, |$\left |\!\left |\!\left |{\Sigma } \right |\!\right |\!\right |_{\textrm{op}} \le \sigma ^2$| and |${\mathbb{E}}[|\langle a, v\rangle |^k] \le \Gamma \big(\frac{k}{2} + 1\big) e \sigma ^k$| for all |$k \ge 0$|⁠. Thus, we have the following corollary of Lemma 3.1, which guarantees that Condition C2 holds with high probability for |$m/n \gtrsim 1$|⁠. Corollary 3.2 Let Assumption 4 hold. Then there exists a numerical constant |$c < \infty $| such that whenever |$m \ge c n$| \begin{equation*} {\mathbb{P}}\left(\left|\,f_x(y) - f(y)\right| \leq 2 \sigma^2 \left\|{x-y}\right\|_2^2 ~\mbox{holds uniformly for}~ x, y\in{\mathbb{R}}^n\right) \ge 1- \exp\left(-\frac{m}{c}\right). \end{equation*} Proof. Assume |$m \geq cn$| for |$c$| large enough, and choose |$t$| small enough in Lemma 3.1 that |$11 \sigma ^2 \sqrt{4n / m + t} \le \sigma ^2$|⁠. Applying the triangle inequality to |$|\!|\!|{\frac{1}{m} A^{\textrm{H}} A - \Sigma }|\!|\!|_{\textrm{op}} \le |\!|\!|{\frac{1}{m} A^{\textrm{H}} A}|\!|\!|_{\textrm{op}} + \sigma ^2$| yields the result by equation (8). 3.3 Initialization The last ingredient in achieving strong convergence guarantees for our prox-linear procedure for phase retrieval is to provide a good initialization. There are a number of initialization strategies in the literature [12,46,47] based on spectral techniques, which work as follows. We decompose the initialization into the following two steps: we (i) find an estimate of the direction direction |$d_\star := x_\star /\left \|{x_\star }\right \|_2$| and (ii) estimate the magnitude |$r_\star := \left \|{x_\star }\right \|_2$|⁠. The latter is easy; assuming that |${\mathbb{E}}[aa^{\textrm{H}}] = I_n$|⁠, one simply uses |$\widehat{r}^2 = m^{-1}\sum _{i = 1}^m b_i^2$|⁠, which is unbiased and tightly concentrated. The former, the direction estimate, is somewhat trickier. Wang et al. [46] provide an empirically excellent initialization whose heuristic justification is as follows. First, for random vectors |$a_i$| in high dimensions, we expect |$a_i$| to usually be orthogonal to the direction |$d_\star $|⁠. Thus, by extracting the smallest magnitude |$b_i = | \langle a_i, x_\star \rangle |^2$|⁠, we have the vectors |$a_i$| that are ‘most’ orthogonal to the direction |$d_\star $|⁠; letting |$\mathcal{I}_{\textrm{sel}}$| be these small indices, the eigenvector corresponding to the smallest eigenvalue (for simplicity, we simply call this the smallest eigenvector) of |$\sum _{i \in \mathcal{I}_{\textrm{sel}}} a_i a_i^{\textrm{H}}$| should be close to the direction |$d_\star $|⁠. A variant of this procedure is to note that |$\frac{1}{m} \sum _{i = 1}^m a_i a_i^{\textrm{H}} \approx I_n$| when the |$a_i$| are isotropic, so that the largest eigenvector of |$\sum _{i \not \in \mathcal{I}_{\textrm{sel}}} a_i a_i^{\textrm{H}}$| should also be close to |$d_\star $|⁠. This initalization strategy has the added benefit that—unlike the original spectral initialization schemes developed by Candès et al. [12], which rely on eigenvectors of |$\sum _{i = 1}^m b_i a_i a_i^{\textrm{H}}$| that may not concentrate at sub-Gaussian rates (as the sum involves fourth moments of random vectors)—the sums |$\sum _i a_i a_i^{\textrm{H}}$| are tightly concentrated. Unfortunately, we believe Wang et al.’s proof that this initialization works contain a mistake (note that they consider only the case where |$\{a_i\}_{i=1}^m$| are real); letting |$U \in{\mathbb{R}}^{n \times (n-1)}$| be an orthogonal matrix whose columns are all orthogonal to |$d_\star $|⁠, in the proof of Lemma 2 (Eqs. (68–70) in [46]) they assert that |$|\mathcal{I}_{\textrm{sel}}^c|^{-1} \sum _{i \in \mathcal{I}_{\textrm{sel}}^c} U^{\textrm{H}} a_i a_i U \to I_{n-1}$| when the |$a_i$| are uniform on |$\sqrt{n} {\mathbb{S}}^{n-1}$|⁠. This is not true (nor does appropriate normalization by |$n$| or |$n-1$| make it true), as it ignores the subtle effects of conditioning in the construction of |$\mathcal{I}_{\textrm{sel}}$|⁠. In spite of this issue, the initialization they propose works remarkably well, and as we show presently, it provably provides a good estimate |$\widehat{d}$| of |$d_\star $|⁠. We include the initialization procedure in Algorithm 2. With the previous discussion in mind, we provide a general assumption that is sufficient for Algorithm 2 to return a direction and magnitude estimate sufficiently accurate for phase retrieval. Assumption 5 For some |$\epsilon _0 \in \left ({0},{1}\right ]$| and |$\mathsf{p}_0(d_\star )> 0$|⁠, the following hold: (i) For all |$\epsilon \in \left ({0},{\epsilon _0}\right ]$|⁠, the following continuity and directional likelihood conditions hold: \begin{equation*} {\mathbb{P}}\left(| \langle a, d_\star \rangle |^2 \in \left[\frac{1-\epsilon}{2}, \frac{1+\epsilon}{2}\right]\right) \leq \kappa \epsilon ~~\textrm{and}~~{\mathbb{P}}\left(| \langle a, d_\star \rangle |^2 \leq \frac{1-\epsilon}{2}\right) \geq \mathsf{p}_0(d_\star) >0. \end{equation*} (ii) There exist functions |$\phi : [0, \epsilon _0] \to{\mathbb{R}}_+$| and |$\Delta : [0, \epsilon _0] \to{\mathbb{C}}^{n \times n}$| such that \begin{equation*} {\mathbb{E}} \left[aa^{\textrm{H}} \mid | \langle a, d_\star \rangle |^2 \leq \frac{1-\epsilon}{2} \right] = I_n - \phi(\epsilon) d_\star d_\star^{\textrm{H}} + \Delta(\epsilon) ~~ \mbox{for} ~ \epsilon \in [0, \epsilon_0]. \end{equation*} (iii) |${\mathbb{E}}[aa^{\textrm{H}}] = I_n$|⁠. Assumption 5 on its face seems complex, but each of its components is not too stringent. Part (i) says that |$| \langle a, d_\star \rangle |^2$| has no point mass at |$| \langle a, d_\star \rangle |^2 = \frac{1}{2}$| and that |$| \langle a, d_\star \rangle |^2$| has reasonable probability of being smaller than |$\frac{1}{2}$|⁠. Part (iii) simply states that in expectation, |$a$| is isotropic (and a rescaling of |$a$| can guarantee this). Part (ii) is the most subtle and essential for our derivation; it says that |$a \in{\mathbb{R}}^n/{\mathbb{C}}^n$| is reasonably isotropic, even if we condition on |$| \langle a, d_\star \rangle |$| being near zero for some direction |$d_\star $|⁠, so that most mass of |$aa^{\textrm{H}}$| is distributed uniformly in the orthogonal directions |$I_n - d_\star d_\star ^{\textrm{H}}$|⁠. The error terms |$\phi $| and |$\Delta $| allow non-trivial latitude in this condition, so that Assumption 5 holds for more than just Gaussian vectors. That said, for concreteness we provide the following example. Example 3 (Gaussian vectors and conditional directions) We consider the standard cases that |$a_i \stackrel{\textrm{iid}}{\sim } {\mathsf{N}}(0, I_n)$|⁠, or the complex counterpart |$a_i \stackrel{\textrm{iid}}{\sim } \frac{1}{\sqrt{2}} \left ({\mathsf{N}}(0, I_n) + \iota{\mathsf{N}}(0, I_n)\right )$|⁠, showing that such |$a_i$| satisfy Assumption 5 for any |$\epsilon _0 \in (0, 1)$| with residual error |$\Delta \equiv 0$|⁠. Clearly Part (iii) holds. For Part (i), note that |$| \langle a, d_\star \rangle |^2$| is |$\chi _1^2$|-distributed with density |$f(t) = \textrm{e}^{-t/2} (2 \pi t)^{-\frac{1}{2}}$|⁠. Integrating the density using its upper bound, we may set |$\kappa \le f\big(\frac{1 - \epsilon _0}{2}\big) = \exp \big(\frac{\epsilon _0 - 1}{4}\big) / \sqrt{\pi (1 - \epsilon _0)}$| and |$\mathsf{p}_0(d_\star ) = {\mathbb{P}}\big(\chi _1^2 \leq \frac{1-\epsilon _0}{2}\big) \ge \frac{1}{2} \sqrt{1 - \epsilon _0}$|⁠. Part (ii) is all that remains. By the rotational invariance of |$a \sim{\mathsf{N}}(0, I_n)$|⁠, we see for any |$t \in{\mathbb{R}}_+$| that \begin{equation*} {\mathbb{E}}[aa^{\textrm{H}} \mid | \langle a, d_\star \rangle |^2 \leq t] = I_n - d_\star d_\star^{\textrm{H}} + {\mathbb{E}} [| \langle a, d_\star \rangle |^2 \mid | \langle a, d_\star \rangle |^2 \leq t] d_\star d_\star^{\textrm{H}}. \end{equation*} We claim the following lemma, whose proof we provide in Appendix C.1. Lemma 3.2 Let |$Z$| be a continuous random variable with a density symmetric about zero and decreasing on |${\mathbb{R}}_+$|⁠. Then for any |$c \in{\mathbb{R}}$| we have |${\mathbb{E}}[Z^2 \mid Z^2 \le c^2] \le \frac{c^2}{3}$|⁠. Thus, setting |$\phi (\epsilon ) = 1 - {\mathbb{E}}[| \langle a, d_\star \rangle |^2 \mid | \langle a, d_\star \rangle |^2 \le \frac{1 - \epsilon }{2}] \ge 1 - \frac{(1 - \epsilon )^2}{6} \ge \frac{5}{6}$|⁠, we see that part (ii) of Assumption 5 holds with |$\phi (\epsilon ) \ge \frac{5}{6}$| and |$\Delta (\epsilon ) \equiv 0$|⁠. We now state our main proposition of this section. Proposition 3 Let Assumptions 4 and 5 hold and let |$\epsilon _0$| and |$\mathsf{p}_0(d_\star )$| be as in Assumption 5. Define the error measure \begin{equation*} \nu(\epsilon) := \left[1 + \frac{4 e \big(1 + \log \frac{1}{1 \wedge \kappa \epsilon}\big) \sigma^2 \kappa}{\mathsf{p}_0(d_\star)} + \frac{2 \kappa (1+\epsilon)}{{\mathsf{p}_0(d_\star)}^2} \right] \epsilon. \end{equation*} There exists a numerical constant |$c> 0$| such that the following holds. Let |$(\widehat{r}, \widehat{d})$| be the estimated magnitude and direction of Algorithm 2 and define |$x_0 = \widehat{r} \widehat{d}$|⁠. For any |$\epsilon \in [0, \epsilon _0]$|⁠, if \begin{equation*} c \cdot \frac{m}{n} \ge \frac{\sigma^4 \log^2 \mathsf{p}_0(d_\star)}{ \mathsf{p}_0(d_\star) \epsilon^2} \vee \frac{1}{(\kappa \epsilon)^2} \end{equation*} then \begin{equation} \frac{\mathop{\textrm{dist}}(x_0, X_\star)}{\left\|{x_\star}\right\|_2} \le \sqrt{2(1 + \epsilon)}\frac{ \left|\!\left|\!\left|{\Delta(\epsilon)} \right|\!\right|\!\right|_{\textrm{op}} + \nu(\epsilon)}{ \left[{\phi(\epsilon) - \left|\!\left|\!\left|{\Delta(\epsilon)} \right|\!\right|\!\right|_{\textrm{op}} - \nu(\epsilon)}\right]_+} + \epsilon \end{equation} (14) with probability at least \begin{equation*} 1 - \exp\left(-\frac{c m \epsilon^2}{\sigma^4}\right) - 2\exp\left(-c m \epsilon^2 \kappa^2\right) - \exp\left(-\frac{m \mathsf{p}_0(d_\star)^2}{2}\right) - \exp\left(-\frac{cm \kappa \epsilon^2}{ \sigma^4 \log^2 \mathsf{p}_0(d_\star)}\right). \end{equation*} We prove Proposition 3 in two parts. In the first part (Appendix A.3), we define a number of events and proceed conditionally, showing that if each of the events occurs then the conclusion (14) holds. In the second part (Appendix A.4) we show that the events occur with high probability. We provide a few remarks to make the result clearer. Let us make the simplifying assumptions that the constants in Assumption 5 are absolute (which is true for Gaussian measurements), that is, that |$\sigma ^2 = O(1)$| and |$\mathsf{p}_0(d_\star ) = \Omega (1)$| (it is no loss of generality to assume |$\kappa \ge 1$|⁠). Then for numerical constants |$c> 0, C < \infty $| we have for any |$\epsilon \in [0, \epsilon _0]$| that the error measure |$\nu (\epsilon ) \le C \epsilon \log \frac{1}{\epsilon }$|⁠, and with probability at least |$1 - 5 \exp (-c m \epsilon ^2)$| that \begin{equation*} \frac{m}{n} \ge \frac{C}{\epsilon^2} ~~ \mbox{implies} ~~ \frac{\mathop{\textrm{dist}}(x_0, X_\star)}{\left\|{x_\star}\right\|_2} \le C\frac{\left|\!\left|\!\left|{\Delta(\epsilon)} \right|\!\right|\!\right|_{\textrm{op}} + \nu(\epsilon)}{ \left[{\phi(\epsilon) - \left|\!\left|\!\left|{\Delta(\epsilon)} \right|\!\right|\!\right|_{\textrm{op}} - \nu(\epsilon)}\right]_+} + \epsilon. \end{equation*} Here, we see three competing terms. The first two, the separation |$\phi (\epsilon )$| and error |$\Delta (\epsilon )$|⁠, arise from the conditional expectation of Assumption 5 (ii), where |${\mathbb{E}}[aa^{\textrm{H}} \mid \langle a, d_\star \rangle ^2 \le \frac{1 - \epsilon }{2}] = I_n - \phi (\epsilon ) d_\star d_\star ^{\textrm{H}} + \Delta (\epsilon )$|⁠. This is intuitive; the larger the separation |$\phi (\epsilon )$| from uniformity in the conditional expectation of |$aa^{\textrm{H}}$|⁠, the easier it is for spectral initialization to succeed; larger error |$\Delta (\epsilon )$| will hide the directional signal |$d_\star $|⁠. The last term is the error |$\nu (\epsilon ) \lesssim \epsilon \log \frac{1}{\epsilon }$|⁠, which approaches 0 nearly as quickly as |$\epsilon \to 0$|⁠. In the case that the error |$\Delta (\epsilon ) = 0$| and gap |$\phi (\epsilon )$| is bounded away from zero, which holds for elliptical distributions with identity covariance—the Gaussian distribution and uniform distribution on the sphere being the primary examples—we thus see that as soon as |$\frac{m}{n} \gtrsim \epsilon ^{-2}$| we have relative error \begin{equation} \frac{\mathop{\textrm{dist}}(x_0, X_\star)}{\left\|{x_\star}\right\|_2} \lesssim \epsilon \log \frac{1}{\epsilon} \ \mbox{with probability } \ge 1 - 5 \exp(-c m \epsilon^2). \end{equation} (15) That is, we can construct an arbitrarily good initialization with large enough sample size. (This proves that the initialization scheme of Wang et al. [46] also succeeds with high probability.) On the other hand, when |$\Delta (\epsilon ) \neq 0$| for all |$\epsilon \in [0, \epsilon _0]$|⁠, then Proposition 3 cannot guarantee arbitrarily good initialization; the error term |$|\!|\!|{\Delta (\epsilon )}|\!|\!|_{\textrm{op}}$| is never zero. However, if it is small enough, we still achieve initializers that are within constant relative distance of |$x_\star $|⁠, which is good enough for Theorem 1. 3.4 Summary and success guarantees We now have guarantees of stability, quadratic approximation and good initialization for appropriate measurement matrices |$A \in{\mathbb{R}}^{m \times n}$| when the observations |$b = |Ax_\star |^2$|⁠. We provide a summary theorem showing that our composite optimization procedure works as soon as the sample size is large enough. In stating the theorem, we assume that each of Assumptions 1, 4 and 5 holds with all of their constants actually numerical constants. The one somewhat technical assumption we require relates the sub-Gaussian parameter |$\sigma ^2$|⁠, the stability parameters |$\kappa _{\textrm{st}}$| and |$p_0$| of Assumption 1 (alternatively, in the complex case the growth constant |$\tau ^2$| and small-ball function |$h(\cdot )$| of Assumptions 2 and 3), and the error |$\Delta (\epsilon )$| and directional separation constants |$\phi (\epsilon )$| of Assumption 5. In the real case, define |$\nu := \kappa _{\textrm{st}}^2 p_0$| and in the complex, recall Proposition 2 and define |$\nu := \bigg(h^{-1}\bigg(\frac{\tau ^4}{2(1+e)\sigma ^4}\bigg)\tau \bigg)^2$|⁠. Then if we assume that for a suitably small numerical constant |$c> 0$|⁠, we have \begin{equation*} \left|\!\left|\!\left|{\Delta(\epsilon)} \right|\!\right|\!\right|_{\textrm{op}} \le c \frac{\nu}{\sigma^2}, ~~~ \phi(\epsilon) \ge \frac{1}{2} \left|\!\left|\!\left|{\Delta(\epsilon)} \right|\!\right|\!\right|_{\textrm{op}} ~~ \mbox{and} ~~ \phi(\epsilon) \ge \phi_\star> 0 \end{equation*} for all |$\epsilon \in [0, \epsilon _0]$|⁠. We then have the following theorem, which follows by combining our convergence Theorem 1 with Proposition 1, Corollary 3.2 and Proposition 3. Theorem 2 Let the conditions of the preceding paragraph hold. There exists a numerical constant |$c> 0$| such that if |$\frac{n}{m} < c$|⁠, then the initializer |$x_0$| returned by Algorithm 2 satisfies |$\mathop{\textrm{dist}}(x_0, X_\star ) \le \frac{\nu }{8 \sigma ^2} \left \|{x_\star }\right \|_2$|⁠, and assuming no error in the minimization steps of the prox-linear method (Algorithm 1), \begin{equation*} \mathop{\textrm{dist}}(x_k, X_\star) \le \left\|{x_\star}\right\|_2 2^{-2^k} \end{equation*} for all |$k \in{\mathbb{N}}$| with probability at least |$1 - \textrm{e}^{-c m}$|⁠. We make two brief summarizing remarks. First, it is necessary to have |$m \gtrsim n$| to achieve exact recovery of the signal, as the parameter |$x_\star \in{\mathbb{C}}^n$| has |$2n$| unknowns and we have |$m$| equations (indeed, |$m \ge 4n - 2$| is necessary for inectivity of the measurements in the complex case [2]). Thus, the sample complexity of Theorem 2 is optimal to within numerical constants. Secondly, Theorem 2 shows that the prox-linear algorithm exhibits local quadratic convergence to the signal |$x_\star $|⁠, which is in contrast to the local linear convergence of other non-convex methods based on gradients and generalized gradients [12,15,46]. In contrast, however, each iteration of our algorithm requires solving a structured convex quadratic program, which is somewhat more expensive than the typical gradient iterations; as we demonstrate in our experiments (Section 6), this means our methods are about four times slower in overall run time than the best gradient-based methods, though their recovery properties are better. 4. Phase retrieval with outliers The objective (2) is the analogue of the least-absolute deviation estimator—the median in |${\mathbb{R}}$|—so in analogy with our understanding of robustness [27], it is natural to expect it should be robust to outliers. We show that this is indeed the case, and the prox-linear method we develop is effective. For simplicity in our development, we assume for this section that all measurements and signals are real-valued, and we consider the following corruption model; let |$\{\xi _i\} \subset{\mathbb{R}}$| be an arbitrary sequence, and given the |$m$| measurement vectors |$a_i$|⁠, we observe \begin{equation*} b_i = \begin{cases} \langle a_i, x_\star \rangle ^2 & \mbox{if }i\in \mathcal{I}^{\textrm{in}}\\ \xi_i & \mbox{if }i\in \mathcal{I}^{\textrm{out}}, \end{cases} \end{equation*} where |$\mathcal{I}^{\textrm{out}} \subset [m]$| and |$\mathcal{I}^{\textrm{in}} \subset [m]$| denote the outliers and inliers, respectively. We assume there is a pre-specified measurement failure probability |$p_{\textrm{fail}} \in \left [{0},{\frac{1}{2}}\right )$|⁠, and |$|\mathcal{I}^{\textrm{out}}| = p_{\textrm{fail}} m$|⁠, and the indices |$i \in \mathcal{I}^{\textrm{out}}$| are chosen randomly. That is, measurement failures are random, though the noise sequence |$\xi _i$| may depend on |$a_i$| (even adversarially), as we specify presently. We assume no prior knowledge of which indices |$i \in [m]$| actually satisfy |$i \in \mathcal{I}^{\textrm{out}}$|⁠, or even of |$p_{\textrm{fail}}$|⁠. We consider the following two models for errors: Model M1. The measurement vectors |$\{a_i\}_{i=1}^m$| are independent of the all the values |$\{\xi _i\}_{i=1}^m$|⁠. Model M2. The inlying measurement vectors |$\{a_i\}_{i\in \mathcal{I}^{\textrm{in}}}$| are independent of the values |$\{\xi _i\}_{i \in \mathcal{I}^{\textrm{out}}}$| of the corrupted observations. Model M1 requires independence between the noise and measurements; the adversary may only corrupt |$\xi _i$| without observing |$a_i$|⁠. Model M2 relaxes this, allowing arbitrary dependence between the corrupted data and the measurement vectors |$a_i$| for |$i \in \mathcal{I}^{\textrm{out}}$|⁠. This is natural in scenarios in which the corruption may depend on the measurement |$a_i$|⁠. The arbitrary corruption causes some technical challenges, but we may still follow the outline in our analysis of phase retrieval without noise in Section 3. As we show in Section 4.1, the objective |$f(x)$| is still stable (Condition C1) as long as the measurement vectors are light-tailed, though Gaussianity is unnecessary. The quadratic approximation conditions (Condition C2) are completely identical to those in Section 3.2, so we ignore them. Thus, as long as |$\left |\!\left |\!\left |{A} \right |\!\right |\!\right |_{\textrm{op}}$| is not too large (meaning |$f_x(y) \approx f(y)$|⁠) and we can find a good initializer, the prox-linear iterations (5) will converge quadratically to |$x_\star $|⁠. Finding a good initializer |$x_0$| is somewhat trickier, but in Section 4.2 we provide a spectral method, inspired by Wang et al. [46], that works with high probability as soon as |$m/n \ge C$| for some numerical constant |$C$|⁠. We defer our arguments to Appendix B. 4.1 Stability The outlying indices, even when corruptions are chosen adversarially, have limited effect on the growth and identification behavior of |$f(x) = \frac{1}{m} \left \|{(Ax)^2 - b}\right \|_1$|⁠. In particular, for |$p_{\textrm{fail}}$| smaller than a numerical constant, which we can often specify, the stability condition C1 holds with high probability whenever |$m/n$| is large. More precisely, we have the following proposition, which applies to independent |$\sigma ^2$|-sub-Gaussian measurements. (See Appendix B.1 for a proof.) Proposition 4 Let Assumption 4 hold and |$\kappa _{\textrm{st}} = \inf _{u, v \in{\mathbb{S}}^{n-1}} {\mathbb{E}}[| \langle a, u \rangle \langle a, v \rangle |]$|⁠. Then under either of the models M1 or M2, there are numerical constants |$c> 0$| and |$C < \infty $| such that \begin{equation*} f(x) - f(x_\star) \ge \left(\kappa_{\textrm{st}} - 2 p_{\textrm{fail}} - C \sigma^2 \sqrt[3]{\frac{n}{m}} - C \sigma^2 t \right) \left\|{x - x_\star}\right\|_2 \left\|{x + x_\star}\right\|_2 \ \mbox{for all } x \in \mathbb{R}^{n} \end{equation*} with probability at least |$1 - 2 \textrm{e}^{-c m} - 2 \textrm{e}^{-m t^2}$|⁠. We continue our running example of Gaussian random variables to motivate the proposition. Example 4 (Gaussian vectors) We claim that for |$a \sim{\mathsf{N}}(0, I_n)$| we have \begin{equation*} \kappa_{\textrm{st}} := \inf_{u, v \in{\mathbb{S}}^{n-1}} {\mathbb{E}}[|\langle a, u \rangle \langle a, v \rangle |] = \frac{2}{\pi}. \end{equation*} Let |$Z_u = \langle a, u \rangle $|⁠, |$Z_v = \langle a, v \rangle $|⁠, and let |$X, Y$| be independent |${\mathsf{N}}(0, 1)$|⁠. Then \begin{align*} \inf_{u, v \in{\mathbb{S}}^{n-1}} {\mathbb{E}}[|Z_u Z_v|] & = \inf_{\rho \in [0, 1]} \big\{f(\rho) := {\mathbb{E}}\big[\big|\rho X^2 - (1- \rho) Y^2\big|\big]\big\}. \end{align*} The function |$f(\cdot )$| is convex and symmetric around |$\frac{1}{2}$|⁠. Thus, |$f(\rho ) \geq f(1/2) = 2/\pi $|⁠. In the Gaussian measurement case, whenever |$p_{\textrm{fail}} < \frac{1}{\pi } \approx 0.318$|⁠, there is a numerical constant |$\lambda> 0$| such that we have the stability |$f(x) - f(x_\star ) \ge \lambda \left \|{x - x_\star }\right \| \left \|{x + x_\star }\right \|$| for as long as |$m/n$| is larger than some numerical constant. 4.2 Initialization The last ingredient for achieving strong convergence guarantees for the prox-linear algorithm for phase retrieval is to provide a good initialization |$x_0 \approx x_\star $|⁠. The strategies in the noiseless setting in Section 3 will fail because of corruptions. With this in mind, we present Algorithm 3, which provides an initializer in corrupted problems. Before turning to the analysis, we provide some intuition for the algorithm. We must construct two estimates; an estimate |$\widehat{d}$| of the direction |$d_\star = x_\star / \left \|{x_\star }\right \|_2$| and an estimate |$\widehat{r}$| of the radius, or magnitude, of the signal |$r_\star = \left \|{x_\star }\right \|_2$|⁠. For the former, a variant of the spectral initialization (Algorithm 2) suffices. If we take the |$\mathcal{I}_{\textrm{sel}} \subset [m]$| to be the set of indices |$\mathcal{I}_{\textrm{sel}}$| corresponding to the smallest (say) |$b_i$|⁠, in either model M1 or M2 the indices |$i \in \mathcal{I}^{\textrm{out}}$| are independent of the measurements |$a_i$|⁠, so we expect as in Section 3.3 that |$X^{\textrm{init}} = |\mathcal{I}_{\textrm{sel}}|^{-1} \sum _{i \in \mathcal{I}_{\textrm{sel}}} a_i a_i^T = z I_n - z^{\prime} d_\star{d_\star }^T + \Delta $|⁠, where |$z, z^{\prime}$| are random positive constants and |$\Delta $| is an error matrix coming from both randomness in the |$a_i$| and the corruptions. As long as the error |$\Delta $| is small, the minimum eigenvector of |$X^{\textrm{init}}$| should be approximately |$d_\star $|⁠. Once we have a good initializer |$\widehat{d} \approx d_\star $|⁠, a natural idea to estimate |$r_\star $| is to pretend that |$\widehat{d}$|is the direction of the signal, substitute the variable |$x = \sqrt{r} \widehat{d}$| into the objective (2), and solve for |$r$| to get a robust estimate of the signal strength |$\left \|{x_\star }\right \|$|⁠. As we show presently, this procedure succeeds with high probability (and the estimate |$\widehat{r}$| is good even when the data are non-Gaussian). Let us make these ideas precise. First, we show that our estimate |$\widehat{r}$| of |$\left \|{x_\star }\right \|_2$| is accurate (see Appendix B.2 for a proof). Proposition 5 Let Assumption 4 hold and |${\mathbb{E}}[aa^T] = I_n$|⁠. Let |$\delta \in [0, 1]$| and |$p_{\textrm{fail}} \in [0, \frac{1}{2}]$|⁠. There exist numerical constants |$0 < c$|⁠, |$C < \infty $| such that if |$\widehat{d}$| is an estimate of |$d_\star $| for which \begin{equation} \delta := \frac{C \sigma^2}{1 - 2 p_{\textrm{fail}}} \mathop{\textrm{dist}}(\widehat{d}, \{\pm d_\star\}) \le 1, \end{equation} (16) then with probability at least |$1 - 2 \textrm{e}^{-c m (1 - 2 p_{\textrm{fail}})^2 / \sigma ^4}$| all minimizers |$\widehat{r}^2$| of |$G(r) = \frac{1}{m} \sum _{i = 1}^m|b_i - r \langle a_i, \widehat{d} \rangle ^2|$|⁠, defined in Algorithm 3, satisfy |$\widehat{r}^2 \in [1 \pm \delta ] \left \|{x_\star }\right \|_2^2$|⁠. Given Proposition 5, finding a good initialization of |$x_\star $| reduces to finding a good estimate |$\widehat{d}$| of the direction |$d_\star = x_\star / \left \|{x_\star }\right \|_2$|⁠. To make this precise, let |$r_\star := \left \|{x_\star }\right \|$| and assume that |$\delta = \! \frac{C \sigma ^2}{1 - 2p_{\textrm{fail}}} \mathop{\textrm{dist}}(\widehat{d}, \{\pm d_\star \}) \le 1$| as in Proposition 5; assume also the relative error guarantee |$|\widehat{r} - r_\star | \le \delta r_\star $|⁠. Using the triangle inequality and Proposition 5, for |$x_0 = \widehat{r} \widehat{d}$| we have \begin{align*} \mathop{\textrm{dist}}(x_0, X_\star) &\leq \widehat{r} \mathop{\textrm{dist}}(\widehat{d}, \{\pm d_\star\}) + |r_\star - \widehat{r}| \\ &\leq r_\star\left[(1+\delta)\mathop{\textrm{dist}}(\widehat{d}, \{\pm d_\star\}) + \delta\right] \leq 2\delta r_\star = \frac{C \sigma^2}{1 - 2p_{\textrm{fail}}} \mathop{\textrm{dist}}(\widehat{d}, \{\pm d_\star\}) r_\star, \end{align*} as claimed. We turn to the directional estimate; to make the analysis cleaner we make the normality Assumption 6 The measurement vectors |$a_i \stackrel{\textrm{iid}}{\sim } {\mathsf{N}}(0, I_n)$|⁠. To state our guarantee on |$\widehat{d}$|⁠, we require additional notation for quantiles of Gaussian and |$\chi ^2$|-random variables. Let |$W \sim{\mathsf{N}}(0, 1)$| and define the constant |$q_{\textrm{fail}}$| and its associated |$\chi ^2$|-quantile |$w_q^2$| by \begin{equation} {\mathbb{P}}(W^2 \le w_q^2) = q_{\textrm{fail}} := \frac{1}{2(1-p_{\textrm{fail}} )} + \frac{1-2p_{\textrm{fail}} }{4(1-p_{\textrm{fail}})} = \frac{3 - 2 p_{\textrm{fail}}}{4(1 - p_{\textrm{fail}})} < 1. \end{equation} (17) Secondly, define the constant |$\delta _q = 1 - {\mathbb{E}}[W^2 \mid W^2 \le w_q^2]$|⁠. We have the following guarantee. Proposition 6 Let |$\widehat{d}$| be the smallest eigenvector of |$X^{\textrm{init}} := \frac{1}{m} \sum _{i=1}^m a_i a_i^T 1\left \{i\in \mathcal{I}_{\textrm{sel}}\right \}$|⁠, as in Algorithm 3. Let the constant |$M = 0$| if Model M1 holds and |$M = 1$| if Model M2 holds. There are numerical constants |$0 < c, C < \infty $| such that for |$t \ge 0$|⁠, with probability at least \begin{equation*} 1 - \exp(-mt) - \exp(-c(1 - 2p_{\textrm{fail}}) m) - \exp\!\left(-c \frac{m \delta_q^2}{w_q^2}\right) \end{equation*} we have \begin{equation*} \mathop{\textrm{dist}}(\widehat{d}, \{\pm d_\star\}) \le \frac{C \sqrt{\frac{n}{m} + t}}{ \big[{(1 - 2p_{\textrm{fail}}) \delta_q - C \sqrt{\frac{n}{m} + t} - M p_{\textrm{fail}}}\big]_+}. \end{equation*} For intuition, we provide a few simplifications of Proposition 6 by bounding the quantities |$w_q$| and |$\delta _q$| defined in Equation (17). Using the conditional expectation bound in Lemma 3.2 and a more careful calculation for Gaussian random variables (see Lemma C.1 in the Appendices) we have \begin{equation*} {\mathbb{E}} \big[a_{i, 1}^2 \mid a_{i, 1}^2 \leq w_q^2 \big] \leq \min\left\{\frac{w_q^2}{3}, 1 - \frac{1}{2} w_q^2 {\mathbb{P}}(a_{i,1}^2 \ge w_q^2)\right\} < 1, \end{equation*} so |$\delta _q \ge \max \big\{1 - \frac{w_q^2}{3}, \frac{1}{2} w_q^2 {\mathbb{P}}(W^2 \ge w_q^2)\big\}$|⁠. For |$p_{\textrm{fail}} \le \frac{3}{8}$|⁠, we may take |$w_q^2 \le 2.71$| and |$\delta _q> \frac{1}{11}$|⁠. More generally, a standard Gaussian calculation that |$\Phi ^{-1}(p) \ge \sqrt{|\log (1 - p)|}$| as |$p \to 1$| shows that |$w_q^2 \ge \log \frac{8(1 - p_{\textrm{fail}})}{1 - 2p_{\textrm{fail}}}$| as |$p_{\textrm{fail}} \to \frac{1}{2}$|⁠, so |$\delta _q \ge \frac{1 - 2p_{\textrm{fail}}}{8(1 - p_{\textrm{fail}})} \log \frac{8(1 - p_{\textrm{fail}})}{1 - 2p_{\textrm{fail}}}$|⁠. Under Model M1, then, as long as the sample is large enough and |$p_{\textrm{fail}} < \frac{1}{2}$|⁠, we can achieve constant accuracy in the directional estimate |$\mathop{\textrm{dist}}(\widehat{d}, \{\pm d_\star \})$| with probability of failure |$\textrm{e}^{-cm}$|⁠. Under the more adversarial noise model M2, we require a bit more; more precisely, we must have |$(1 - 2p_{\textrm{fail}}) \delta _q - p_{\textrm{fail}}> 0$| to achieve accurate estimates. A numerical calculation shows that if |$p_{\textrm{fail}} < \frac{1}{4}$|⁠, then this condition holds, so that under Model M2 we can achieve constant accuracy in the directional estimate |$\mathop{\textrm{dist}}(\widehat{d}, \{\pm d_\star \})$| with high probability. 4.3 Summary and success guarantees With the guarantees of stability, quadratic approximation and good initialization for suitably random matrices |$A \in{\mathbb{R}}^{m \times n}$|⁠, we can provide a theorem showing that the prox-linear approach to the composite optimization phase retrieval objective succeeds with high probability. Roughly, once |$m/n$| is larger than a numerical constant, the prox-linear method with noisy initialization succeeds with exponentially high probability, even with outliers. Combining the convergence Theorem 1 with Propositions 4, 5, 6 and Corollary 3.2, we have the following theorem. Theorem 3 Let Assumptions 6 hold. There exist numerical constants |$c> 0$| and |$C < \infty $| such that the following hold for any |$t \ge 0$|⁠. Let the independent outliers Model M1 hold and |$p_{\textrm{fail}} < \frac{1}{\pi }$| or the adversarial outliers Model M2 hold and |$p_{\textrm{fail}} < \frac{1}{4}$|⁠. Let |$x_0$| be the initializer returned by Algorithm 3, and assume the iterates |$x_k$| of Algorithm 1 are generated without error. Then \begin{equation*} \mathop{\textrm{dist}}(x_0, X_\star) \le \frac{C}{(1 - 2p_{\textrm{fail}})^2} \sqrt{\frac{n}{m} + t} \cdot \left\|{x_\star}\right\|_2 ~~~ \mbox{and} ~~~ \mathop{\textrm{dist}}(x_k, X_\star) \le \left\|{x_\star}\right\|_2 2^{-2^k} ~~\mbox{for all}~ k \in{\mathbb{N}} \end{equation*} with probability at least |$1 - 4 \textrm{e}^{-mt} - \textrm{e}^{-c(1 - 2 p_{\textrm{fail}})^2 m}$|⁠. Thus, we see that the method succeeds with high probability as long as the sample size is large enough, though there is non-trivial degradation when |$p_{\textrm{fail}}$| is large. 5. Optimization methods In practice, we require some care to solve the sub-problems (5), to minimize |$f_x(y) + \frac{L}{2} \left \|{x - y}\right \|_2^2$|⁠, at large scale. In this section, we describe the three schemes we use to solve the sub-problems (one is simply using the industrial Mosek solver). We evaluate these more carefully in the experimental section to come. To fix notation, let |$\phi (\cdot )$| be the element-wise square operator. Recalling that |$f(x) = \frac{1}{m} \left \|{\phi (A x) - b}\right \|_1$|⁠, we perform a few simplifications to more easily describe the schemes we use to solve the prox-linear sub-problems. Assuming we begin an iteration at point |$x_0$|⁠, defining the diagonal matrix |$D = 2 \mathop{\textrm{diag}}(A x_0) \in{\mathbb{R}}^{m \times m}$| and setting |$c = b + \phi (A x_0)$|⁠, we recall inequality (8) and have \begin{align*} f(y) & \le f_{x_0}(y) + \frac{1}{m} (y - x_0)^T A^T A (y - x_0) \le \frac{1}{m} \left\|{D A y - c}\right\|_1 + \left|\!\left|\!\left|{m^{-1} A^T A} \right|\!\right|\!\right|_{\textrm{op}} \left\|{y - x_0}\right\|_2^2. \end{align*} Rewriting this with appropriately rescaled diagonal matrices |$D$| and vector |$c$|⁠, implementing Algorithm 1 becomes equivalent to solving a sequence of optimization problems of the form \begin{equation} \mathop{\textrm{minimize}}_x ~ \left\|{D A x - c}\right\|_1 + \frac{1}{2} \left\|{x}\right\|_2^2. \end{equation} (18) In small scale scenarios, the problem (18) is straightforward to solve via standard interior point method software; we use Mosek via the Convex.jl toolbox [42]. We do not describe this further. In larger-scale scenarios, we use more specialized methods, which we now describe. 5.1 Graph splitting methods for the prox-linear sub-problem When the matrices are large, we use a variant of the Alternating Directions Method of Multipliers procedure known as the proximal operator graph splitting (POGS) method of Parikh and Boyd [34,35], which minimizes objectives of the form |$f(x) + g(y)$| subject to a linear constraint |$Bx = y$|⁠. We experimented with a number of specialized first-order and interior-point methods for solving problem (18), but in our experience, POGS offers the empirically best performance. Let us describe the POGS method. Let the matrix |$B = DA$| for shorthand; evidently, problem (18) has precisely this form and is equivalent to \begin{equation*} \mathop{\textrm{minimize}}_{x \in{\mathbb{R}}^n, y \in{\mathbb{R}}^m} ~ \left\|{y - c}\right\|_1 + \frac{1}{2} \left\|{x}\right\|_2^2\ \textrm{subject}\ \textrm{to}\ B x = y. \end{equation*} The POGS method iterates to solve problem (18) as follows. Introduce dual variables |$\lambda _k \in{\mathbb{R}}^n$| and |$\nu _k \in{\mathbb{R}}^m$| associated to |$x$| and |$y$|⁠, and consider the iterations \begin{equation} \begin{split} x_{k + \frac{1}{2}} & = \mathop{{\textrm{argmin}}}_x \left\{\frac{1}{2} \left\|{x}\right\|_2^2 + \frac{\rho}{2} \left\|{x - (x_k - \lambda_k)}\right\|_2^2 \right\} \\ y_{k + \frac{1}{2}} & = \mathop{{\textrm{argmin}}}_y \left\{\left\|{y - c}\right\|_1 + \frac{\rho}{2} \left\|{y - (y_k - \nu_k)}\right\|_2^2 \right\} \\ \left[\begin{matrix} x_{k + 1} \\ y_{k + 1} \end{matrix}\right] & = \left[\begin{matrix} I_n & B^T \\ B & -I_m \end{matrix}\right]^{-1} \left[\begin{matrix} I_n & B^T \\ 0 & 0 \end{matrix}\right] \left[\begin{matrix} x_{k + \frac{1}{2}} + \lambda_k \\ y_{k + \frac{1}{2}} + \nu_k \end{matrix} \right] \\ \lambda_{k+1} & = \lambda_k + (x_{k + \frac{1}{2}} - x_{k + 1}), ~~ \nu_{k+1} = \nu_k + (y_{k + \frac{1}{2}} - y_{k + 1}). \end{split} \end{equation} (19) Each of the steps of the method (19) is trivial except for the matrix inversion, or the ‘graph projection’ step, which projects the pair |$(x_{k + \frac{1}{2}} + \lambda _k, y_{k + \frac{1}{2}} + \nu _k)$| to the set |$\{x, y : Bx = y\}$|⁠. The first two updates amount to \begin{equation*} x_{k + \frac{1}{2}} = \frac{\rho}{1 + \rho} (x_k - \lambda_k) ~~ \mbox{and} ~~ y_{k + \frac{1}{2}} = c + \mathop{\textrm{sgn}}(y_k - \nu_k - c) \odot \left[{|y_k - \nu_k - c| - 1/\rho}\right]_+, \end{equation*} where |$\odot $| denotes element-wise multiplication and each operation is element-wise. The matrix |$B$| is tall, so setting |$v_k = x_{k + \frac{1}{2}} + \lambda _k + B^T (y_{k + \frac{1}{2}} + \nu _k) \in{\mathbb{R}}^n$|⁠, then the solution of the system \begin{equation} \left[\begin{matrix} I_n & B^T \\ B & -I_n \end{matrix} \right] \left[\begin{matrix} x \\ y \end{matrix} \right] = \left[\begin{matrix} v_k \\ 0_n \end{matrix}\right] ~~ \mbox{is} ~~ x_{k+1} = (I_n + B^T B)^{-1} v_k ~~ \mbox{and} ~~ y_{k+1} = B x_{k+1}. \end{equation} (20) In this iteration, it is straightforward to cache the matrix |$(I_n + B^T B)^{-1}$|⁠, or a Cholesky factorization of the matrix, so that we can repeatedly compute the multiplication (20) in time |$n^2 + nm$|⁠. Following Parikh and Boyd [35], we define the primal and dual residuals \begin{equation*} r^{\textrm{pri}}_{k + 1} := \left[ \begin{matrix} x_{k+1} - x_{k + \frac{1}{2}} \\ y_{k+1} - y_{k + \frac{1}{2}} \end{matrix}\right] ~~ \mbox{and} ~~ r^{\textrm{dual}}_{k + 1} := \rho \left[\begin{matrix} x_k - x_{k + 1} \\ y_k - y_{k + 1} \end{matrix}\right]. \end{equation*} These residuals define a natural stopping criterion [35], where one terminates the iteration (19) once the residuals satisfy \begin{equation} \left\|{r^{\textrm{pri}}_k}\right\|_2 < \epsilon \left(\sqrt{n} + \max\{\left\|{x_k}\right\|_2, \left\|{y_k}\right\|_2\}\right) ~~ \mbox{and} ~~ \left\|{r^{\textrm{dual}}_k}\right\|_2 < \epsilon \left(\sqrt{n} + \max\{\left\|{\lambda_k}\right\|_2,\left\|{\nu_k}\right\|_2\}\right) \end{equation} (21) for some |$\epsilon > 0$|⁠, which must be specified. In our case, the quadratic convergence guarantees of Theorem 1 suggest a strategy of decreasing |$\epsilon $| across iterative solutions of the sub-problem (18). That is, we begin with some |$\epsilon = \epsilon _0$|⁠. We perform iterations of the prox-linear Algorithm 1 using the POGS iteration (19) (until the residual criterion (21) holds) to solve the inner problem (18). Periodically, in the outer prox-linear iterations of Algorithm 1, we decrease |$\epsilon $| by some large multiple. In our experiments with the POGS method for sub-problem solutions, we perform two phases. In the first phase, we iterate the prox-linear method (Algorithm 1) until either |$k = 25$| or |$\left \|{x_k - x_{k + 1}}\right \|_2 \le \delta / \left |\!\left |\!\left |{(1/m) A^TA} \right |\!\right |\!\right |_{\textrm{op}}$|⁠, where |$\delta = 10^{-3}$|⁠, using accuracy parameter |$\epsilon = 10^{-5}$| for POGS (usually, the iterations terminate well-before |$k = 25$|⁠). We then decrease this parameter to |$\epsilon = 10^{-8}$|⁠, and begin the iterations again. 5.2 Conjugate gradient methods for sub-problems In a number of scenarios, it is possible to multiply by the matrix |$A$| quickly, though the direct computation |$(I_n + A^T D^2 A)^{-1}$| in expression (20) (recall |$B = DA$|⁠) may be difficult. For example, if |$A$| is structured (say, a Fourier or Hadamard transform matrix or or sparse), computing multiplication by |$I_n + A^T D^2 A$| quickly is possible. This suggests [40, Part VI] using conjugate gradient methods. We make this more explicit here to mesh with our experiments to come. Let |$H_n$| be an |$n \times n$| orthogonal matrix for which computing the multiplication |$H_n v$| is efficient (e.g. a Hadamard or discrete cosine transform matrix). We assume that the measurement matrix |$A$| takes the form of repeated randomized measurements under |$H_n$|⁠, that is, \begin{equation} A = \left[\begin{array}{cccc} H_n S_1 & H_n S_2 & \cdots & H_n S_k \end{array}\right]^T, \end{equation} (22) where |$S_l \in{\mathbb{R}}^{n \times n}$| are (random) diagonal matrices, so that |$A \in{\mathbb{R}}^{m \times n}$| with |$m = kn$|⁠. In this case, the expensive part of the system (20) is the solution of |$(I + A^T D^2 A) x = v$|⁠. This is a positive definite system, and it is possible to compute the multiplication |$(I + A^T D^2 A)x$| in time |$O(k n + k T_{\textrm{mult}})$|⁠, where |$T_{\textrm{mult}}$| is the time to muliply an |$n$|-vector by the matrix |$H_n$| and |$H_n^T$|⁠. When |$S$| is a random sign matrix and |$H_n$| is a Hadamard or FFT matrix, the matrix |$A^TD^2 A$| is well-conditioned, as the analysis of random (subsampled) Hadamard and Fourier transforms shows [41]. Thus, in our experiments with structured matrices we use the conjugate gradient method without preconditioning [40], iterating until we have relative error |$\left \|{(I + A^T D A)x - v}\right \| \le \epsilon _{\textrm{cg}} / \left \|{v}\right \|$| where |$\epsilon _{\textrm{cg}} = 10^{-6}$|⁠. 6. Experiments We perform a number of simulations as well as experiments on real images to evaluate our method and compare with other state-of-the-art (non-convex) methods for real-valued phase retrieval problems. To standardize notation and remind the reader, in each experiment, we generate data via (variants) of the following process. We take a measurement matrix |$A \in{\mathbb{R}}^{m \times n}$|⁠, from one of a few distributions and generate a signal |$x_\star \in{\mathbb{R}}^n$| either by drawing |$x_\star \sim{\mathsf{N}}(0, I_n)$| or by taking |$x_\star \in \{-1, 1\}^n$| uniformly at random. We receive observations of the form \begin{equation*} b_i = \langle a_i, x_\star \rangle ^2, \end{equation*} and with probability |$p_{\textrm{fail}} \in [0, \frac{1}{2}]$|⁠, we corrupt the measurements |$b_i$|⁠. In our experiments, to more carefully isolate the relative performance of the iterative algorithms, rather than initialization used, we compare three initializations. The first two rely on Gaussianity of the measurement matrix |$A$|⁠. (i) Big: The initialization of Wang et al. [46]. Defining \begin{equation*} \mathcal{I}_0 := \left\{i \in [m] : b_i / \left\|{a_i}\right\|_2^2 \ge{\mathsf{quant}}_{5/6}(\{b_i / \left\|{a_i}\right\|_2^2\})\right\} ~~ \mbox{and} ~~ X^{\textrm{init}} := \sum_{i \in \mathcal{I}_0} \frac{1}{\left\|{a_i}\right\|_2^2} a_i a_i^T, \end{equation*} we set the direction |$\widehat{d} = \mathop{\textrm{argmax}}_{\left \|{d}\right \|_2 = 1} d^T X^{\textrm{init}} d$| and |$x_0 = \big(\frac{1}{m} \sum _{i = 1}^m b_i\big)^{\frac{1}{2}} \widehat{d}$|⁠. (ii) Median: The initialization of Zhang et al. [47]. We set |$\widehat{r}^2 = {\mathsf{quant}}_\frac{1}{2}(\{b_i\}) /0 .455$| and define \begin{equation*} \mathcal{I}_0 := \left\{i \in [m] : |b_i| \le 9 \lambda_0\right\} ~~ \mbox{and} ~~ X^{\textrm{init}} := \sum_{i \in \mathcal{I}_0} b_i a_i a_i^T. \end{equation*} We then set the direction |$\widehat{d} = \mathop{\textrm{argmax}}_{\left \|{d}\right \|_2 = 1} d^T X^{\textrm{init}} d$| and |$x_0 = \widehat{r} \widehat{d}$|⁠. (iii) Small: The outlier-aware initialization we describe in Section 4.2. Our convergence results in Section 2 rely on the following four quantities: the quality of the initialization, |$\mathop{\textrm{dist}}(x_0, X_\star )$|⁠; the stability parameter |$\lambda $| such that |$f(x) - f(x_\star ) \ge \lambda \left \|{x - x_\star }\right \|_2 \left \|{x + x_\star }\right \|_2$|⁠; the quadratic upper bound guaranteed by our linearized models |$f_x(y)$|⁠, so that |$|\,f_x(y) - f(y)| \le \frac{L}{2} \left \|{x - y}\right \|_2^2$| and the accuracy to which we solve the optimization problems. The random matrix |$A$| governs the middle two quantities—stability |$\lambda $| and closeness |$L$|—and we take |$L = \frac{2}{m}\left |\!\left |\!\left |{A} \right |\!\right |\!\right |_{\textrm{op}}^2$|⁠. Thus, in our experiments we directly vary the initialization scheme to generate |$x_0$| and the accuracy to which we solve the sub-problems (5), that is, \begin{equation*} x_{k + 1} = \mathop{{\textrm{argmin}}}_x \left\{ \frac{1}{m} \sum_{i = 1}^m \big| \langle a_i, x_k \rangle ^2 + 2 \langle a_i, x_k \rangle \langle a_i, x - x_k \rangle - b_i \big| + \frac{L}{2} \left\|{x - x_k}\right\|_2^2\right\}. \end{equation*} We perform each of our experiments on a server with a 16 core 2.6 GhZ Intel Xeon processor with 128 GB of RAM using julia as our language, with OpenBLAS as the BLAS library. We restrict each optimization method to use four of the cores (OpenBLAS is multi-threaded). 6.1 Simulations with zero noise and Gaussian matrices For our first collection of experiments, we evaluate the performance of our method for recovering random signals |$x_\star \in{\mathbb{R}}^n$| using Gaussian measurement matrices |$A \in{\mathbb{R}}^{m \times n}$|⁠, varying the number of measurements |$m$| over the ten values |$m \in \{1.8 n, 1.9 n, \ldots , 2.7 n\}$|⁠. (Taking |$m \ge 2.7n$| yielded 100% exact recovery in all the methods we experiment with.) We assume a noiseless measurement model, so that for |$A = [a_1 ~ \cdots ~ a_m]^T \in{\mathbb{R}}^{m \times n}$|⁠, we have |$b_i = \langle a_i, x_\star \rangle ^2$| for each |$i \in [m]$|⁠. We perform experiments with dimensions |$n \in \{400, 600, 800, 1000, 1500, 2000, 3000\}$|⁠, and within each experiment we vary a number of problem parameters, including initialization scheme and the algorithm we use to solve the sub-problems (5). To serve as our baseline for comparison, we use Truncated Amplitude Flow (TAF) [46], given that it outperforms other non-convex iterative methods for phase retrieval, including Wirtinger Flow [12], Truncated Wirtinger Flow [15] and Amplitude Flow [46]. Setting |$\psi _i = \sqrt{b_i}$|⁠, TAF tries to minimize the loss |$\frac{1}{2 m} \sum _{i = 1}^m (\psi _i - | \langle a_i, x \rangle |)^2$| via a carefully designed (generalized) gradient method. For convenience, we replicate the method (as described by Wang et al. [46]) in Algorithm 4. 6.1.1 Low-dimensional experiments with accurate prox-linear steps We begin by describing our experiments with dimensions |$n \in \{400, 600, 800\}$|⁠. For each of these, we solve the iterative sub-problems to machine precision, using Mosek and the Julia package Convex.jl [42]. We plot representative results for |$n = 400$| and |$n = 800$| in Fig. 1, which summarize 400 independent experiments, and we generate the true |$x_\star $| by taking |$x_\star \in \{-1, 1\}^n$| uniformly at random. (In separate experiments, we drew |$x_\star \sim{\mathsf{N}}(0, I_n)$|⁠, and the results were essentially identical.) In these figures, we use the ‘big’ initialization (i) (the initialization of Wang et al., as it yields the best empirical performance). Following Wang et al. [46], we perform 1000 iterations of TAF (Algorithm 4), and within each experiment, both TAF and the prox-linear method use identical initializer and data matrix |$A$|⁠. We run the prox-linear method until sequential updates |$x_k$| and |$x_{k + 1}$| satisfy |$\left \|{x_k - x_{k + 1}}\right \|_2 \le \epsilon = 10^{-5}$|⁠, which in every successful experiment we perform (with these data settings) requires six or fewer iterations. We declare an experiment successful if the output |$\widehat{x}$| of the algorithm satisfies \begin{equation} \mathop{\textrm{dist}}(\,\widehat{x}, X_\star) = \min\left\{\|{\widehat{x} - x_\star}\|_2, \|{\widehat{x} - x_\star}\|_2\right\} \le \epsilon_{\textrm{acc}} \|{x_\star}\|_2, ~~ \mbox{where} ~~ \epsilon_{\textrm{acc}} = 10^{-5}. \end{equation} (23) Fig. 1. View largeDownload slide Zero-noise experiments in dimensions |$n = 400$| and |$n = 800$|⁠. (a) Fraction of times (in 400 experiments) that one method succeeded while the other failed for |$n = 400$|⁠. (b) Fraction of successful recoveries for |$n = 400$|⁠. (c) Fraction of successful recoveries for |$n = 800$|⁠. Fig. 1. View largeDownload slide Zero-noise experiments in dimensions |$n = 400$| and |$n = 800$|⁠. (a) Fraction of times (in 400 experiments) that one method succeeded while the other failed for |$n = 400$|⁠. (b) Fraction of successful recoveries for |$n = 400$|⁠. (c) Fraction of successful recoveries for |$n = 800$|⁠. In Fig. 1(a), we plot the number of times that one method succeeds (out of the 400 experiments) while the other does not as a function of the ratio |$m/n$|⁠. We see that for these relatively small dimensional problems, the prox-linear method has mildly better performance for |$m/n$| small than does TAF. In Fig. 1(b), we plot the fraction of successful runs of the algorithms, again against |$m/n$| for |$n = 400$|⁠, and in Fig. 1(c) we plot the fraction of successful runs for |$n = 800$|⁠. Even when |$m/n = 2$|⁠, the prox-linear method has success rate of around 0.6. 6.1.2 Medium dimensional experiments with inaccurate prox-linear steps We now shift to a description of our experiments in dimensions |$n \in \{1000, 1500, 2000, 3000\}$|⁠, again without noise in the measurements; we perform between 100 and 400 experiments for each dimension in this regime, and we use the ‘big’ initialization (i) [46]. In this case, we use Parikh and Boyd’s POGS method [34] for the prox-linear steps, as we describe in Section 5.1. We perform two phases of the prox-linear method; the first performing POGS until the residual errors (21) are less than |$\epsilon = 10^{-5}$| within the prox-linear steps, the second to accuracy |$\epsilon = 10^{-8}$|⁠. We apply the prox-linear method to the matrix |$A / \sqrt{m}$| with data |$b / m$|⁠, which is equivalent but numerically more stable because |$A / \sqrt{m}$| and |$I_n$| are comparable (see the recommendations [34]). In these accuracy regimes, the time for solution of the prox-linear method and that required for 1000 iterations of TAF are comparable; TAF requires about 1.5 s while the two-phase prox-linear method requires around 4 s in dimension |$n = 1000$|⁠, and TAF requires about 5 s while the two-phase prox-linear method requires around 20 s in dimension |$n = 2000$|⁠, each with |$m = 2n$|⁠. We provide two sets of plots for these results, where again we measure success by the criterion (23), |$\left \|{x \pm x_\star }\right \|_2 \le \epsilon _{\textrm{acc}} = 10^{-5}$|⁠. In the first, Fig. 2, we show performance of prox-linear against TAF specifically for dimensions |$n = 1000$| and |$3000$|⁠. In Fig. 2(a), we plot the number of trials in which one method succeeds (out of 400 experiments) while the other does not as a function of the ratio |$m/n$|⁠. In Fig. 2(b) we plot the number of trials in which the prox-linear or TAF method succeeds, while the other does not, for |$n = 3000$| with |$x_\star $| chosen either |${\mathsf{N}}(0, I_n)$| or uniform in |$\{\pm 1\}^n$| (⁠|$\triangledown $| and |$+$| markers, respectively). We ignore ratios |$m/n \ge 2.2$| as both methods succeed in all of our trials. Out of 100 trials, there is only one (with |$m/n = 2$|⁠) in which TAF succeeds, but the prox-linear method does not. In Fig. 2(c), we plot the fraction of successful runs of the algorithms, again against |$m/n$| for |$n = 3000$|⁠. For these larger problems, there is a substantial gap in recovery probability between the prox-linear method and TAF, where with |$m/n = 2$| the prox-linear method achieves recovery more than 78% (⁠|$\pm 4$|⁠), 88% (⁠|$\pm 6$|⁠) and 91% (⁠|$\pm 6$|⁠) of the time, with 95% confidence intervals, for |$n = 1000,\ 2000$| and |$3000$|⁠, respectively. Fig. 2. View largeDownload slide Zero-noise experiments. (a) Fraction of trials (of 400) in which one method succeeds while other fails for |$n = 1000$|⁠. (b) Fraction of trials (of 100) in which one method succeeds while other fails for |$n = 3000$|⁠. (c) Fraction of successful recoveries for |$n = 3000$|⁠. Fig. 2. View largeDownload slide Zero-noise experiments. (a) Fraction of trials (of 400) in which one method succeeds while other fails for |$n = 1000$|⁠. (b) Fraction of trials (of 100) in which one method succeeds while other fails for |$n = 3000$|⁠. (c) Fraction of successful recoveries for |$n = 3000$|⁠. 6.2 Phase retrieval with outlying measurements One of the advantages we claim for the objective (2) is that it is robust to outliers. Zhang et al. [47] develop a method, which they term median-truncated Wirtinger flow, for handling outlying measurements, which (roughly) sets the index set |$\mathcal{I}_k$| used for updates in Algorithm 4 to be a set of measurements near the median values of the |$b_i$|⁠. Their algorithm requires a number of parameters that strongly rely on the assumptions that |$a_i \stackrel{\textrm{iid}}{\sim } {\mathsf{N}}(0, I_n)$|⁠; in contrast, the objective (2) and prox-linear method we investigate are straightforward to implement without any particular assumptions on |$A$| (and require no parameter tuning, relying on an explicit sequence of convex optimizations). Nonetheless, to make comparisons between the algorithms as fair as possible, we implement their procedure and perform experiments in dimension |$n \in \{100, 200\}$| with i.i.d. standard normal data matrices |$A$|⁠. We perform 100 experiments as follows. Within each experiment, we evaluate each |$m \in \{1.8 n, 2n, 3n, 4n, 6n, 8n\}$| and failure probability |$p_{\textrm{fail}} \in \{0, 0.01, 0.02, \ldots , 0.29, 0.3\}$|⁠. For fixed |$m, n$|⁠, we draw a data matrix |$A \in{\mathbb{R}}^{m \times n}$|⁠, then choose |$x_\star \in{\mathbb{R}}^n$| either by drawing |$x_\star \sim{\mathsf{Uni}}(\{-1, 1\}^n)$| or |$x_\star \sim{\mathsf{N}}(0, I_n)$|⁠. We then generate |$b_i = \langle a_i, x_\star \rangle ^2$| for |$i \in [m]$|⁠. For our experiments with |$n = 100$|⁠, we simply set |$b_i = 0$|⁠, which is more difficult for our initialization strategy, as it corrupts a large fraction of the vectors |$a_i$| used to initialize |$x_0$|⁠. For our experiments with |$n = 200$|⁠, we draw |$b_i$| from a Cauchy distribution. Each problem setting (⁠|$b_i$| Cauchy vs. zeroing and |$x_\star $| discrete or normal) yields qualitatively similar results. Fig. 3. View largeDownload slide Probability of success for median-truncated Wirtinger flow (left) and the prox-linear method (right) for different initializations, where dimension |$n = 100$|⁠. Horizontal axis indexes |$p_{\textrm{fail}}$| while vertical axis indexes the measurement ratio |$m / n$|⁠. Each pixel represents the fraction of successful recoveries (defined as |$\mathop{\textrm{dist}}(\,\widehat{x}, X_\star ) / \left \|{x_\star }\right \|_2 \le 10^{-5}$|⁠) over 100 experiments. (a) and (b): ‘Big’ initialization, with |$\widehat{r}$| estimated by Algorithm 3. (c) and (d): ‘Median’ initialization. (e) and (f): ‘Small’ initialization. Fig. 3. View largeDownload slide Probability of success for median-truncated Wirtinger flow (left) and the prox-linear method (right) for different initializations, where dimension |$n = 100$|⁠. Horizontal axis indexes |$p_{\textrm{fail}}$| while vertical axis indexes the measurement ratio |$m / n$|⁠. Each pixel represents the fraction of successful recoveries (defined as |$\mathop{\textrm{dist}}(\,\widehat{x}, X_\star ) / \left \|{x_\star }\right \|_2 \le 10^{-5}$|⁠) over 100 experiments. (a) and (b): ‘Big’ initialization, with |$\widehat{r}$| estimated by Algorithm 3. (c) and (d): ‘Median’ initialization. (e) and (f): ‘Small’ initialization. Fig. 4. View largeDownload slide Probability of success for median-truncated Wirtinger flow (left) and the prox-linear method (right), using the approximate proximal graph splitting method for sub-problem solves. Dimension |$n = 200$|⁠. (a) and (b): ‘Big’ initialization, with |$\widehat{r}$| estimated by Algorithm 3. (c) and (d): ‘Small’ initialization. Fig. 4. View largeDownload slide Probability of success for median-truncated Wirtinger flow (left) and the prox-linear method (right), using the approximate proximal graph splitting method for sub-problem solves. Dimension |$n = 200$|⁠. (a) and (b): ‘Big’ initialization, with |$\widehat{r}$| estimated by Algorithm 3. (c) and (d): ‘Small’ initialization. Figs 3–5 summarize our results. In Figs 3 and 4, we display success rates of the median truncated Wirtinger flow (the left column in each figure) and our composite optimization-based procedure (right column) for a number of initializations; each plot represents results of 100 experiments. Within each plot, a white square indicates that 100% of trials were successful, meaning the signal is recovered to accuracy |$\mathop{\textrm{dist}}(\,\widehat{x}, X_\star ) \le 10^{-5} \left \|{x_\star }\right \|_2$|⁠, while black squares indicate 0% success rates. Within each row of the figure, we present results for the two methods using the same initialization scheme. Fig. 3 gives results with |$n = 100$|⁠, using precise (Mosek-based) solutions of the prox-linear updates, while Fig. 4 gives results with |$n = 200$| using the POGS-based updates (recall step (19)), with identical parameters as in the previous section. It is clear from the figures that the composite objective yields better recovery. Fig. 5. View largeDownload slide Comparison of median-truncated Wirtinger flow and composite minimization with ‘Median’ initialization. Horizontal axis of each plot indexes |$p_{\textrm{fail}} \in [0, 0.3]$|⁠. (a) Proportion of successful solves for prox-linear method. (b) Number of iterations of prox-linear step (5) over successful solves (with error bars). (c) Proportion of successful solves for median-truncated Wirtinger flow method. (d) Number of matrix multiplications of the form |$x \mapsto (I + A^T D^2 A)^{-1} A x$| for the prox-linear method with proximal graph solves. Fig. 5. View largeDownload slide Comparison of median-truncated Wirtinger flow and composite minimization with ‘Median’ initialization. Horizontal axis of each plot indexes |$p_{\textrm{fail}} \in [0, 0.3]$|⁠. (a) Proportion of successful solves for prox-linear method. (b) Number of iterations of prox-linear step (5) over successful solves (with error bars). (c) Proportion of successful solves for median-truncated Wirtinger flow method. (d) Number of matrix multiplications of the form |$x \mapsto (I + A^T D^2 A)^{-1} A x$| for the prox-linear method with proximal graph solves. In Fig. 5, we present a different view into the behavior of the prox-linear and median truncated Wirtinger flow (MTWF) methods. In the left two plots, we show the recovery probability (over 100 trials) for our composite optimization method (Fig. 5(a)) and MTWF (Fig. 5(c)). The success probability for the composite method is higher. In Fig. 5(b), we plot the average number of iterations (along with standard error bars) the prox-linear method performs when the dimension |$n = 100$| and we use accurate sub-problem solves. We give iteration counts only for those trials that result in successful recovery; the iteration counts on unsuccessful trials are larger (indeed, if the method is not converging rapidly, this serves as a proxy for failure). In the high measurement regime, |$m/n \ge 2.5$| or so, we see that if |$p_{\textrm{fail}} = 0$| no more than seven iterations are required: this is the quadratic convergence of the method. (Indeed, for |$m/n = 8$|⁠, for |$p_{\textrm{fail}} \le 0.15$| each execution of the prox-linear method uses precisely five iterations, never more, and never fewer.) In Fig. 5(d), we show the number of matrix multiplications by the inverse matrix |$(I_n + A^T DA)^{-1}$| the method uses (recall the update (19) in the proximal graph operator splitting method). We see that for well-conditioned problems and those with little noise, the methods require relatively few matrix multiplications, while for more outliers and when |$m/n$| shrinks, there is a non-trivial increase. Fig. 6. View largeDownload slide Example recoveries of digits. Left digit is true digit, middle is initialization and right is recovered image. Fig. 6. View largeDownload slide Example recoveries of digits. Left digit is true digit, middle is initialization and right is recovered image. Fig. 7. View largeDownload slide Performance of composite optimization scheme using conjugate gradient method and proximal graph splitting to solve sub-problems (Sections 5.1 and 5.2). Fig. 7. View largeDownload slide Performance of composite optimization scheme using conjugate gradient method and proximal graph splitting to solve sub-problems (Sections 5.1 and 5.2). Fig. 8. View largeDownload slide Reconstruction of RNA image (⁠|$n = 2^{22}$|⁠) from |$m = 3n$| measurements. (a) Initialization |$x_0$| of prox-linear method. (b) Result of 10 inaccurate (⁠|$\epsilon = 10^{-3}$|⁠) solutions of prox-linear step (5). (c) Result of one additional accurate (⁠|$\epsilon = 10^{-7}$|⁠) solution of prox-linear step (5). (d) Original image. Fig. 8. View largeDownload slide Reconstruction of RNA image (⁠|$n = 2^{22}$|⁠) from |$m = 3n$| measurements. (a) Initialization |$x_0$| of prox-linear method. (b) Result of 10 inaccurate (⁠|$\epsilon = 10^{-3}$|⁠) solutions of prox-linear step (5). (c) Result of one additional accurate (⁠|$\epsilon = 10^{-7}$|⁠) solution of prox-linear step (5). (d) Original image. 6.3 Recovery of real images Our final collection of experiments investigates recovery of real-world images using more specialized measurement matrices. In this case, we let |$H_n \in \{-1, 1\}^{n \times n} / \sqrt{n}$| denote a normalized Hadamard matrix, where the multiplication |$H_n v$| requires time |$n \log n$|⁠, and |$H_n$| satisfies |$H_n = H_n^T$| and |$H_n^2 = I_n$|⁠. For some |$k \in{\mathbb{N}}$|⁠, we then take |$k$| i.i.d. diagonal sign matrices |$S_1, \ldots , S_k \in \mathop{\textrm{diag}}(\{-1, 1\}^n)$|⁠, uniformly at random, and define |$A = [H_n S_1 ~ H_n S_2 ~ \cdots ~ H_n S_k]^T \in{\mathbb{R}}^{kn \times n}$|⁠, as in expression (22). We note that this matrix explicitly does not satisfy the stability conditions that we use for our theoretical guarantees. Indeed, letting |$e_1$| and |$e_2$| be the first standard basis vectors, we have |$H_n S (e_1 + e_2) \perp H_n S (e_1 - e_2)$| no matter the sign matrix |$S$|⁠; there are similarly pathological vectors for FFT, discrete cosine and other structured matrices. Nonetheless, we perform experiments with this structured |$A$| matrix as follows.1 Given an image |$X$| represented as a matrix, we define |$x_\star = \mathop{\textrm{Vec}}(X)$|⁠, the vectorized representation of |$X$|⁠. (In the case of colored images, where |$X \in{\mathbb{R}}^{n_1 \times n_2 \times 3}$| because of the 3 RGB channels, we vectorize the channels as well.) We then set |$b = \phi (Ax)$|⁠, where |$\phi (\cdot )$| denotes element-wise squaring, and corrupt a fraction |$p_{\textrm{fail}} \in [0, 0.2]$| of the measurements |$b_i$| by zeroing them. We then follow our standard experimental protocol, initializing |$x_0$| by the ‘small’ initialization scheme (iii), with the slight twist that now we use the POGS method with conjugate gradient methods (Section 5.2) to solve the graph projection step (20). We first give results on a collection of 500 images of handwritten digits (using |$k = 3$|⁠), available on the website for the book [25]. We provide example results of the execution of our procedure in Fig. 6, which shows that while there is signal in the initialization, there is substantial work that the prox-linear method must perform to recover the images. For each of the 500 images, we vary |$p_{\textrm{fail}} \in \{0, 0.025, 0.05, \ldots , 0.175, 0.2\}$|⁠, then execute the prox-linear method. We plot summary results in Fig. 7, which in the blue curve with square markers gives the probability of successful recovery of the digit (left axis) vs. |$p_{\textrm{fail}}$| (horizontal axis). The right axis indexes the number of matrix multiplications the method executes until completion (black line with circular marks). We see that in spite of the demonstrated failure of the matrix |$A$| to satisfy our stability assumptions, we have frequent recovery of the images to accuracy |$10^{-3}$| or better. Finally, we perform experiments with eight real color images with sizes up to |$1024 \times 1024$| (yielding |$n = 2^{22}$|-dimensional problems), where we use |$k = 3$| random Hadamard sensing matrices. The prox-linear method successfuly recovers each of the eight images to relative accuracy at least |$10^{-4}$|⁠, and performs an average of |$15100$| matrix-vector multiplications (i.e. fast Hadamard transforms) over the eight experiments, with a standard deviation of |$2600$| multiplications. To give a sense of the importance of different parts of our procedure, and the relative accuracies to which we solve sub-problems (5), we display one example in Fig. 8. In this example, we perform phase retrieval on an image of RNA nanoparticles in cancer cells [36]. In Fig. 8(a), we show the result of initialization, which displays non-trivial structure, though is clearly noisy. In Fig. 8(b), we show the result of 10 steps of the prox-linear method, solving each step using POGS until the residual errors (21) are below |$\epsilon = 10^{-3}$|⁠. There are clear artifacts in this image. We then perform one refinement step with higher accuracy (⁠|$\epsilon = 10^{-7}$|⁠), which results in Fig. 8(c). This is indistinguishable, at least to our eyes, from the original image (Fig. 8(d)). Acknowledgements We thank Dima Drusvyatskiy and Courtney Paquette for a number of inspirational and motivating conversations. F. R. was additionally supported by a Stanford Graduate Fellowship. Funding NSF-CAREER Award (1553086 to J.C.D. and F.R.); Toyota Research Institute (to J.C.D. and F.R.). Footnotes 1 We do not compare to other methods designed for outliers because we could not set the constants the methods require in such a way as to yield good performance. References 1. Balakrishnan , S. , Wainwright , M. J. & Yu , B. ( 2017 ) Statistical guarantees for the EM algorithm: from population to sample-based analysis . Ann. Stat. , 45 , 77 – 120 . Google Scholar Crossref Search ADS WorldCat 2. Balan , R. , Casazza , P. & Edidin , D. ( 2006 ) On signal reconstruction without phase . Appl. Comput. Harmon. Anal. , 20 , 345 – 356 . Google Scholar Crossref Search ADS WorldCat 3. Ball , K. ( 1997 ) An elementary introduction to modern convex geometry . Flavors of Geometry (S. Levy ed.). , pp. 1 – 58 . Berkeley, USA : MSRI Publications . 4. Bandeira , A. , Cahill , J. , Mixon , D. & Nelson , A. ( 2014 ) Saving phase: injectivity and stability for phase retrieval . Appl. Comput. Harmon. Anal. , 37 , 106 – 125 . Google Scholar Crossref Search ADS WorldCat 5. Bartlett , P. L. & Mendelson , S. ( 2002 ) Rademacher and Gaussian complexities: risk bounds and structural results . J. Mach. Learn. Res. , 3 , 463 – 482 . WorldCat 6. Bertsekas , D. P. ( 1977 ) Approximation procedures based on the method of multipliers . J. Optim. Theory Appl. , 23 , 487 – 510 . Google Scholar Crossref Search ADS WorldCat 7. Burke, J. ( 1985 ) Descent methods for composite nondifferentiable optimization problems . Math. Program. , 33 , 260 – 279 . Google Scholar Crossref Search ADS WorldCat 8. Burke J. & Ferris , M. ( 1995 ) A Gauss-Newton method for convex composite optimization . Math. Program. , 71 , 179 – 194 . WorldCat 9. Candès , E. , Strohmer , T. & Voroninski , V. ( 2013a ) PhaseLift: exact and stable signal recovery from magnitude measurements via convex programming . Commun. Pure Appl. Math. , 66 , 1241 – 1274 . Google Scholar Crossref Search ADS WorldCat 10. Candès , E. J. , Eldar , Y. , Strohmer , T. & Voroninski , V. ( 2013b ) Phase retrieval via matrix completion . SIAM J. Imaging Sc. , 6 , 199 – 225 . Google Scholar Crossref Search ADS WorldCat 11. Candès , E. J. & Li , X. ( 2014 ) Solving quadratic equations via PhaseLift when there are about as many equations as unknowns . Found. Comput. Math. , 14 , 1017 – 1026 . Google Scholar Crossref Search ADS WorldCat 12. Candès , E. J. , Li , X. & Soltanolkotabi , M. ( 2015 ) Phase retrieval via Wirtinger flow: theory and algorithms . IEEE Trans. Inf. Theory , 61 , 1985 – 2007 . Google Scholar Crossref Search ADS WorldCat 13. Candès , E. J. & Plan , Y. ( 2011 ) Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements . IEEE Trans. Inf. Theory , 57 , 2342 – 2359 . Google Scholar Crossref Search ADS WorldCat 14. Chai , A. , Moscoso , M. & Papanicolaou , G. ( 2011 ) Array imaging using intensity-only measurements . Inverse Problems , 27 , 015005 . WorldCat 15. Chen , Y. & Candès , E. ( 2017 ) Solving random quadratic systems of equations is nearly as easy as solving linear systems . Commun. Pure Appl. Math. , 70 , 0822 – 0883 . Google Scholar Crossref Search ADS WorldCat 16. Drusvyatskiy , D. , Ioffe , A. & Lewis , A. ( 2016 ) Nonsmooth optimization using Taylor-like models: error bounds, convergence, and termination criteria. arXiv:1610.03446 [math.OC]. https://arxiv.org/abs/1610. 03446. 17. Drusvyatskiy , D. & Lewis , A. ( 2018 ) Error bounds, quadratic growth, and linear convergence of proximal methods. Math. Oper. Res. , to appear . COPAC 18. Drusvyatskiy , D. & Paquette , C. ( 2016 ) Efficiency of minimizing compositions of convex functions and smooth maps. arXiv:1605.00125 [math.OC]. WorldCat 19. Eldar , Y. & Mendelson , S. ( 2014 ) Phase retrieval: stability and recovery guarantees . Appl. Comput. Harmon. Anal. , 36 , 473 – 494 . Google Scholar Crossref Search ADS WorldCat 20. Fickus , M. , Mixon , D. G. , Nelson , A. A. & Wang , Y. ( 2014 ) Phase retrieval from very few measurements . Linear Algebra Appl. , 449 , 475 – 499 . Google Scholar Crossref Search ADS WorldCat 21. Fienup , J. R. ( 1978 ) Reconstruction of an object from the modulus of its Fourier transform . Opt. Lett. , 3 , 27 – 29 . Google Scholar Crossref Search ADS PubMed WorldCat 22. Fienup , J. R. ( 1982 ) Phase retrieval algorithms: a comparison . Appl. Opt. , 21 , 2758 – 2769 . Google Scholar Crossref Search ADS PubMed WorldCat 23. Fletcher , R. & Watson , G. A. ( 1980 ) First and second order conditions for a class of nondifferentiable optimization problems . Math. Program. , 18 , 291 – 307 . Google Scholar Crossref Search ADS WorldCat 24. Gerchberg , R. W. & Saxton , W. O. ( 1972 ) A practical algorithm for the determination of phase from image and diffraction . Optik , 35 , 237 – 246 . WorldCat 25. Hastie , T. , Tibshirani , R. & Friedman , J. ( 2009 ) The Elements of Statistical Learning , 2nd edn. New York, USA . Google Preview WorldCat COPAC 26. Hiriart-Urruty , J. & Lemaréchal , C. ( 1993 ) Convex Analysis and Minimization Algorithms I . New York : Springer . Google Preview WorldCat COPAC 27. Huber , P. J. & Ronchetti , E. M. ( 2009 ) Robust Statistics, 2nd edn. San Francisco, USA : John Wiley . Google Preview WorldCat COPAC 28. Keshavan , R. H. , Montanari , A. & Oh , S. ( 2010 ) Matrix completion from a few entries . IEEE Trans. Inf. Theory , 56 , 2980 – 2998 . Google Scholar Crossref Search ADS WorldCat 29. Kreutz-Delgado , K. ( 2009 ) The complex gradient operator and the |$\mathbb C \mathbb R$|-calculus. arXiv :09064835 [math.OC]. WorldCat 30. Ledoux , M. ( 2001 ) The Concentration of Measure Phenomenon . Providence, USA : American Mathematical Society . Google Preview WorldCat COPAC 31. Loh , P.-L. & Wainwright , M. J. ( 2013 ) Regularized M-estimators with nonconvexity: statistical and algorithmic theory for local optima . J. Mach. Learn. Res. , 16 , 559 – 616 . WorldCat 32. Mendelson , S. ( 2014 ) Learning without concentration. Proceedings of the Twenty Seventh Annual Conference on Computational Learning Theory . Amsterdam, the Netherlands . WorldCat 33. Netrapalli , P. , Jain , P. & Sanghavi , S. ( 2013 ) Phase retrieval using alternating minimization . Adv. Neural Inf. Process. Syst., 26 , 2796 – 2804 . WorldCat 34. Parikh , N. & Boyd , S. ( 2013 ) Proximal algorithms . Found. Trends Optim. , 1 , 123 – 231 . WorldCat 35. Parikh , N. & Boyd , S. ( 2014 ) Block splitting for distributed optimization . Math. Program. Comput. , 6 , 77 – 102 . Google Scholar Crossref Search ADS WorldCat 36. Pi , F. , Zhang , H. & Guo , P. RNA nanoparticles in cancer cells. National Cancer Institute Visuals Online, 2016 . URL https://visualsonline.cancer.gov/details.cfm?imageid=11167 . Accessed April 1, 2017 . WorldCat 37. Polyak , B. T. ( 1979 ) On the Bertsekas’ method for minimization of composite functions . International Symposium on Systems Optimization and Analysis (A. Bensoussan and J. L. Lions eds), vol. 14 . Lecture Notes in Control and Information Sciences , pp. 178 – 186 . New York, USA : Springer . 38. Schechtman , Y. , Eldar , Y. C. , Cohen , O. , Chapman , H. N. , Miao , J. & Segev , M. (2015) Phase retrieval with application to optical imaging . IEEE Signal Process. Mag. , May, 87 – 109 . WorldCat 39. Stewart , G. W. & Sun , J.-G. ( 1990 ) Matrix Perturbation Theory . Cambridge, USA : Academic Press . Google Preview WorldCat COPAC 40. Trefethen , L. N. & Bau III , D. ( 1997 ) Numerical Linear Algebra . Philadelphia, USA : SIAM . Google Preview WorldCat COPAC 41. Tropp , J. A. ( 2011 ) Improved analysis of the subsampled randomized Hadamard transform. Adv. Adapt. Data Anal. , 3 , 115 – 126 . Special issue on Sparse Representation of Data and Images . 42. Udell , M. , Mohan , K. , Zeng , D. , Hong , J. , Diamond , S. & Boyd , S. ( 2014 ) Convex optimization in Julia. In First Workshop on High Performance Technical Computing in Dynamic Languages. IEEE. pp. 18–28 . WorldCat 43. van der Vaart , A. W. & Wellner , J. A. ( 1996 ) Weak Convergence and Empirical Processes: With Applications to Statistics . New York : Springer . Google Preview WorldCat COPAC 44. Vershynin , R. ( 2012 ) Introduction to the non-asymptotic analysis of random matrices . Compressed Sensing: Theory and Applications, Chapter 5. Cambridge University Press , 210–268. 45. Waldspurger , I. , d’Aspremont , A. & Mallat , S. ( 2015 ) Phase recovery, maxcut and complex semidefinite programming. Math. Program. A , 149 , 47 – 81 . WorldCat 46. Wang , G. , Giannakis , G. B. & Eldar , Y. C. ( 2016 ) Solving systems of random quadratic equations via truncated amplitude flow. arXiv:1605.08285 [stat.ML]. WorldCat 47. Zhang , H. , Chi , Y. & Liang , Y. ( 2016 ) Provable non-convex phase retrieval with outliers: median truncated Wirtinger flow. Proceedings of the 33rd International Conference on Machine Learning . New York, USA . WorldCat Appendices A. Proofs for noiseless phase retrieval In this Appendix, we collect the proofs of the propositions and results in Section 3. Because we use it multiple times in what follows, we state a standard eigenvector perturbation result, a variant of the Davis–Kahane |$\sin $|-|$\varTheta $| theorem. Lemma A.1 (Stewart & Sun [39], Theorem 3.6) For vectors |$u, v \in{\mathbb{S}}^{n-1}$|⁠, define the angle |$\theta (u, v) = \cos ^{-1} | \langle u, v \rangle |$|⁠. Let |$X \in{\mathbb{C}}^{n \times n}$| be symmetric, |$\varDelta $| a symmetric perturbation and |$Z = X + \varDelta $|⁠, and define |${\mathsf{gap}}(X) = \lambda _1(X) - \lambda _2(X)$| to be the eigengap of |$X$|⁠. Let |$v_1$| and |$u_1$| be the first eigenvectors of |$X$| and |$Z$|⁠, respectively. Then \begin{equation*} \sqrt{1 - | \langle u_1, v_1 \rangle |^2} = |\!\sin \theta(u_1, v_1)| \le \frac{\left|\!\left|\!\left|{\varDelta} \right|\!\right|\!\right|_{\textrm{op}}}{\left[{{\mathsf{gap}}(A) - \left|\!\left|\!\left|{\varDelta} \right|\!\right|\!\right|_{\textrm{op}}}\right]_+}. \end{equation*} We will also have occasion to use the following one-sided variant of Bernstein’s inequality. Lemma A.2 (One-sided Bernstein inequality) Let |$X_i$| be non-negative random variables with |${\textrm{Var}}(X_i) \le \sigma ^2$|⁠. Then \begin{equation*} {\mathbb{P}}\left(\frac{1}{m} \sum_{i = 1}^m X_i \le \frac{1}{m} \sum_{i = 1}^m {\mathbb{E}}[X_i] - t \right) \le \exp\left(-\frac{m t^2}{2 \sigma^2}\right). \end{equation*} A.1 Proof of Proposition 1 Our proof uses Mendelson’s ‘small-ball’ techniques for concentration [32] along with control over a particular VC-dimension condition. If we define |$h_i(u, v) := \kappa _{\textrm{st}}^2 1\left \{| \langle a_i, u \rangle | \wedge | \langle a_i, v \rangle | \ge \kappa _{\textrm{st}}\right \}$|⁠, then we certainly have \begin{equation*} \frac{1}{m} \sum_{i=1}^m | \langle a_i, u \rangle \langle a_i, v \rangle | \geq \frac{1}{m} \sum_{i=1}^m h_i(u, v) ~~ \mbox{for all} u, v \in{\mathbb{R}}^n. \end{equation*} We now control the class |$\mathcal{F} := \{a \mapsto f(a) = | \langle a, u \rangle | \wedge | \langle a, v \rangle | \mid u, v \in{\mathbb{R}}^n\}$| by its VC-dimension. Lemma A.3 There exists some numerical constant |$C> 0$|⁠, such that, for any |$c \geq 0$|⁠, the VC-dimension of the collection of sets \begin{equation*} \mathcal{G} := \left\{\left\{x \in{\mathbb{R}}^n ~\mbox{s.t.}~ | \langle x, u \rangle | \wedge | \langle x, v \rangle | \geq c\right\} \mid u \in{\mathbb{R}}^n, v\in{\mathbb{R}}^n\right\} \end{equation*} is upper bounded by |$Cn$|⁠. Proof. The collection of half planes \begin{equation*} \mathcal{G}_{\textrm{plane}} := \left\{\left\{x \in{\mathbb{R}}^n ~\mbox{s.t.}~ \langle u, x \rangle + b \geq 0\right\} \mid u \in{\mathbb{R}}^n, b \in{\mathbb{R}}\right\} \end{equation*} has VC-dimension at most |$n+2$| [43, Lemma 2.6.15]. We have the containment \begin{equation*} \mathcal{G} \subset \left\{(G_1 \cup G_2) \cap (G_3 \cup G_4) \mid G_1, G_2, G_3, G_4 \in \mathcal{G}\right\}, \end{equation*} and so standard results on preservation of VC-dimension under set operations [43, Lemma 2.6.17], imply that the VC-dimension of |$\mathcal{G}$| is at most |$C$| times the VC-dimension of |$\mathcal{G}$| for some numerical constant |$C> 0$|⁠. The associated thresholds |$a \mapsto 1\left \{| \langle a, u \rangle | \wedge | \langle a, v \rangle | \ge \kappa _{\textrm{st}}\right \}$| likewise have VC-dimension at most |$Cn$| as |$u, v$| vary. Applying standard VC-concentration inequalities [5, 43, Ch. 2.6] we immediately obtain that \begin{equation*} {\mathbb{P}}\left(\sup_{u, v \in{\mathbb{S}}^{n-1}} \frac{1}{\kappa_{\textrm{st}}} \left|\frac{1}{m} \sum_{i = 1}^m h_i(u, v) - {\mathbb{E}}[h_i(u, v)]\right| \ge c \sqrt{\frac{n}{m}} + t\right) \le 2 \exp\left(-\frac{m t^2}{2}\right) \end{equation*} for all |$t \ge 0$|⁠, where |$c < \infty $| is a numerical constant. Substitute |$t^2 \mapsto 2t$| to achieve the result. A.2 Proof of Proposition 2 We begin with a small lemma relating the Frobenius norm of certain rank 2 matrices to the distances between vectors. Lemma A.4 Let |$X = xx^{\textrm{H}} - yy^{\textrm{H}} \in{\mathbb{C}}^{n \times n}$|⁠. Then |$\left \|{X}\right \|_{\textrm{Fr}} \ge \inf _\theta \|{x - \textrm{e}^{\iota \theta } y}\|_2 \sup _\theta \|{x - \textrm{e}^{\iota \theta } y}\|_2$|⁠. We defer the proof of the lemma to Appendix A.2.1. To prove the proposition, we define the function |$F : {\mathbb{C}}^{m \times n} \times{\mathbb{C}}^{n \times n}$|⁠, for |$Z = [z_1 ~ \cdots ~ z_m]^{\textrm{H}}$|⁠, by \begin{equation} F(Z, X) = \frac{1}{m} \sum_{i = 1}^m \big| \langle z_i, X\, z_i \rangle \big|. \end{equation} (A.1) Now, for a constant |$c> 0$| to be chosen, define the truncated variables \begin{equation} z_i := \sqrt{n} a_i / \left\|{a_i}\right\|_2 1\left\{\left\|{a_i}\right\|_2 \ge c \sqrt{n}\right\}. \end{equation} (A.2) Then we have that for any |$A \in{\mathbb{C}}^{m \times n}$| that \begin{equation*} F(A, xx^{\textrm{H}} - yy^{\textrm{H}}) \ge c^2 F(Z, xx^{\textrm{H}} - yy^{\textrm{H}}). \end{equation*} As the function |$F$| is homogenous in its second argument, based on Lemma A.4, the result of the theorem will hold if we can show that (with suitably high probability) \begin{equation} F(Z, X) \ge \frac{\tau^2}{4} ~~~ \mbox{for all rank 2 or less}~ X \in{\mathbb{C}}^{n \times n} ~ \mbox{with}~ \left\|{X}\right\|_{\textrm{Fr}} = 1. \end{equation} (A.3) It is, of course, no loss of generality to assume that |$X$| is Hermitian in inequality (A.3). With our desired inequality (A.3) in mind, we present a covering number-based argument to prove the theorem. The first step in this direction is to lower bound the expectation of |$F(Z, X)$|⁠. Let |$X \in \mathcal{S}_r, X = \sum _{j = 1}^r \lambda _j u_j u_j^{\textrm{H}}$|⁠. Then we have that for |$w_i = \sqrt{n} a_i / \left \|{a_i}\right \|_2$| and our choice of |$z_i$| that \begin{align*} {\mathbb{E}}\, [|z_i^{\textrm{H}} X\, z_i|] & = {\mathbb{E}}\,[|w_i^{\textrm{H}} X\, w_i|] - {\mathbb{E}}\,[|w_i^{\textrm{H}} X\, w_i| 1\,\{\|{a_i}\|_2 \le c \sqrt{n}\}]. \end{align*} Now, we have \begin{equation*} {\mathbb{E}}\big[|w_i^{\textrm{H}} X \,w_i| 1\big\{\big\|{a_i}\big\|_2 \le \epsilon \sqrt{n}\big\}\big] \stackrel{(i)}{\le} \sqrt{{\mathbb{E}}[|w_i^{\textrm{H}} X\, w_i|^2]} \sqrt{h(c)} \stackrel{(ii)}{\le} \bigg(\sum_{i=1}^r \lambda_j^2\bigg)^{\frac{1}{2}} {\mathbb{E}}\bigg[\sum_{i=1}^r | \langle u_i, w_i \rangle |^4 \bigg]^{\frac{1}{2}} \sqrt{h(c)}, \end{equation*} where the inequalities follow |$(i)$| by Cauchy–Schwartz and Assumption 3 (for the small-ball probability bound |$h(c)$|⁠) and |$(ii)$| by Cauchy–Schwartz, respectively. As |$w_i$| is a |$\sigma ^2$|-sub-Gaussian vector, standard results on sub-Gaussians imply that |${\mathbb{E}}[| \langle u, w_i \rangle |^4] \le (1 + e) \sigma ^4$| for any unit vector |$u$|⁠, and thus the final expectation has bound |$\sum _{i=1}^r {\mathbb{E}}[ \langle u_i, w_i \rangle ^4] \le (1 + e) r \sigma ^4$|⁠, and thus we obtain for any choice of |$c$| in our construction (A.2) of |$z_i$| and any rank |$r$| matrix |$X$| that \begin{equation*} {\mathbb{E}}\,\big[|z_i^{\textrm{H}} X\, z_i|\big] \ge{\mathbb{E}}\,\big[|w_i^{\textrm{H}} X \,w_i|\big] - \sqrt{1 + e} \|{X}\|_{\textrm{Fr}} \,\sigma^2 \sqrt{r h(c)}. \end{equation*} In particular, as we consider only rank |$r = 2$| matrices, Assumption 3 allows us to choose |$c$| in the definition (A.2) small enough that \begin{equation*} h(c) \le \frac{1}{2(1 + e)} \cdot \frac{\tau^4}{\sigma^4}, \end{equation*} and then \begin{equation} {\mathbb{E}}\,\big[|z_i^{\textrm{H}} X\, z_i|\big] \ge{\mathbb{E}}\,\big[|w_i^{\textrm{H}} X \,w_i|\big] - \tau^2 \left\|{X}\right\|_{\textrm{Fr}} / 2 \ge \frac{\tau^2}{2} \left\|{X}\right\|_{\textrm{Fr}}. \end{equation} (A.4) The remainder of our argument now proceeds by a covering argument, which we begin with a lemma on the continuity properties of |$F$|⁠. Lemma A.5 Let Hermitian matrices |$X, Y \in{\mathbb{C}}^{n \times n}$| have rank at most |$r$| and eigen-decompositions |$X = U \Lambda U^{\textrm{H}}$| and |$Y = V \varSigma V^{\textrm{H}}$|⁠, and let |$Z = [z_1 ~ \cdots ~ z_m]^{\textrm{H}} \in{\mathbb{C}}^{m \times n}$|⁠. Then \begin{equation*} |F(Z, X) - F(Z, Y)| \le \frac{2 \sqrt{r}}{m} \left|\!\left|\!\left|{Z} \right|\!\right|\!\right|_{\textrm{op}}^2 \left[ \left\|{U - V}\right\|_{1,2} + \left\|{\varLambda - \varSigma}\right\|_{\textrm{Fr}} \right]. \end{equation*} We provide the proof of Lemma A.5 in Appendix A.2.2. Based on Lemma A.5, we develop a particular covering set of the |$n \times n$| Hermitian matrices of rank |$r$|⁠, following a construction due to Candès and Plan [13, Thm. 2.3]. Let |$\mathcal{S}_r \subset \{X \in{\mathbb{C}}^{n \times n}\}$| be the set of Hermitian rank |$r$| matrices with |$\left \|{X}\right \|_{\textrm{Fr}} = 1$|⁠, and let |$\mathsf{O}_{n,r} = \{U \in{\mathbb{C}}^{n \times r} \mid U^{\textrm{H}} U = I_r\}$| denote the set of |$n\times r$| unitary matrices; for each |$\epsilon> 0$| there exists an |$\epsilon $|-cover |$\overline{\mathsf{O}}_{n,r}$| of |$\mathsf{O}_{n,r}$| in |$\left \|{\cdot }\right \|_{1,2}$|-norm of cardinality at most |$(3 / \epsilon )^{2nr}$|⁠. Similarly, there exists an |$\epsilon $|-cover |$D \subset{\mathbb{R}}^r$| of all vectors |$v \in{\mathbb{R}}^r$| in |$\ell _2$|-norm of cardinality at most |$(3 / \epsilon )^{r}$|⁠. Now, let |$\overline{\mathcal{S}}_r = \{U \mathop{\textrm{diag}}(d) U^{\textrm{H}} \mid U \in \overline{\mathsf{O}}_{n,r}, d \in D\}$| be a subset of the rank-|$r$| Hermitian matrices. Then |$\mathop{\textrm{card}}(\overline{\mathcal{S}}_r) \le (3 / \epsilon )^{(2n + 1)r}$|⁠, and we have the following immediate consequence of Lemma A.5. Lemma A.6 For any |$\epsilon> 0$|⁠, there exists a set |$\overline{\mathcal{S}}_r \subset \mathcal{S}_r$| of cardinality at most |$\mathop{\textrm{card}} (\overline{\mathcal{S}}_r) \le (3/\epsilon )^{(2 n + 1)r}$| such that \begin{equation*} \inf_{X \in \mathcal{S}_r} F(Z, X) \ge \min_{X \in \overline{\mathcal{S}}_r} F(Z, X) - \frac{4\sqrt{r}}{m} \left|\!\left|\!\left|{Z} \right|\!\right|\!\right|_{\textrm{op}}^2 \epsilon. \end{equation*} For our final step, we control the variance of |$F(Z, Y)$|⁠, so that we can apply the one-sided Bernstein inequality (Lemma A.2) and then a covering argument. For |$X \in \mathcal{S}_r$|⁠, then, we have \begin{equation*} {\textrm{Var}}\big(z_i^{\textrm{H}} X\, z_i\big) \le{\mathbb{E}}\left[\left|z_i^{\textrm{H}} X \,z_i\right|^2\right] = {\mathbb{E}}\left[\bigg|\sum_{j = 1}^r \lambda_j | \langle z_i, u_j \rangle |^2\bigg|^2 \right] \le \left\|{X}\right\|_{\textrm{Fr}}^2 \sum_{j = 1}^r {\mathbb{E}}[| \langle z_i, u_j \rangle |^4] \le r (1 + e) \left\|{X}\right\|_{\textrm{Fr}}^2 \sigma^2 \end{equation*} by Cauchy–Schwartz and that |${\mathbb{E}}[| \langle z_i, u_j \rangle |^4] \le (1 + e) \sigma ^4$| by sub-Gaussian moment bounds. Thus, Lemma A.2 implies that for any fixed |$X \in \mathcal{S}_r$| we have \begin{equation*} {\mathbb{P}}\left(F(Z, X) \le \frac{\tau^2}{2} - \sqrt{2(1 + e) r} \cdot \sigma^2 t\right) \le \textrm{e}^{-mt^2} \end{equation*} for all |$t \ge 0$|⁠. Now, let |$\mathcal{E}_t$| be the event that |$\frac{1}{m} \left |\!\left |\!\left |{Z} \right |\!\right |\!\right |_{\textrm{op}}^2 \le \varepsilon _{n,m}(t) := \sigma ^2 + 11 \sigma ^2 \max \{ \sqrt{4n / m + t}, 4n/m + t\}$|⁠. Then by Lemmas A.6 and 3.1 with |$r = 2$|⁠, we have for any |$K$| and |$\epsilon $| that \begin{align*} {\mathbb{P}}\left(\inf_{X \in \mathcal{S}_r} F(Z, X) \le K \right) & \le{\mathbb{P}}\left(\min_{X \in \overline{\mathcal{S}}_r} F(Z, X) - \frac{4}{m} \left|\!\left|\!\left|{Z} \right|\!\right|\!\right|_{\textrm{op}}^2 \epsilon \le K\right) \\ & \le{\mathbb{P}}\left(\min_{X \in \overline{\mathcal{S}}_r} F(Z, X) - \frac{4}{m} \left|\!\left|\!\left|{Z} \right|\!\right|\!\right|_{\textrm{op}}^2 \epsilon \le K, \mathcal{E}_t\right) + {\mathbb{P}}(\mathcal{E}_t) \\ & \le{\mathbb{P}}\left(\min_{X \in \overline{\mathcal{S}}_r} F(Z, X) - 4 \varepsilon_{n,m}(t) \epsilon \le K \right) + \textrm{e}^{-mt} \\ & \le \left(\frac{3}{\epsilon}\right)^{(2n + 1) r} \sup_{X \in \mathcal{S}_r} {\mathbb{P}}\left(F(Z, X) - 4 \varepsilon_{n,m}(t) \epsilon \le K\right) + \textrm{e}^{-mt}. \end{align*} Letting |$K = \frac{\tau ^2}{2} - 2\sqrt{(1 + e)} \sigma ^2 t - 4 \varepsilon _{n,m}(t) \epsilon $|⁠, we thus see that \begin{equation*} {\mathbb{P}}\left(\inf_{X \in \mathcal{S}_2} F(Z, X) \le \frac{\tau^2}{2} - 2 \sqrt{(1 + e)} \sigma^2 t - 4 \varepsilon_{n,m}(t) \epsilon\right) \le \left(\frac{3}{\epsilon}\right)^{2 (2n + 1)} \textrm{e}^{-mt^2} + \textrm{e}^{-mt}. \end{equation*} Summarizing, we have the following lemma. Lemma A.7 There exist numerical constants |$C_1 \le 2 \sqrt{(1 + e)}$| and |$C_2 \le 44$| such that \begin{align*} & {\mathbb{P}}\left(\inf_{X \in \mathcal{S}_2} F(Z, X) \le \frac{\tau^2}{2} - C_1 \sigma^2 t - C_2 \sigma^2 \left(\max\left\{\sqrt{\frac{n}{m} + t}, \frac{n}{m} + t\right\} \epsilon + \epsilon \right) \right) \\ & \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad \le \exp\left(-mt^2 + 2 (2n + 1)\log\frac{3}{\epsilon}\right) + \textrm{e}^{-mt}. \end{align*} Rewriting this slightly by taking |$\epsilon , t \lesssim \frac{\tau ^2}{\sigma ^2}$|⁠, we see that there exist numerical constants |$c_0> 0$| and |$c_1 < \infty $| such that \begin{equation*} {\mathbb{P}}\left(\inf_{X \in \mathcal{S}_2} F(Z, X) \le \frac{\tau^2}{4} \right) \le \exp\left(- c_0 m \frac{\tau^4}{\sigma^4} + c_1 n \log \frac{\sigma^2}{\tau^2}\right). \end{equation*} Now, returning to inequality (A.3), we see we have that for |$z_i = \sqrt{n} a_i / \left \|{a_i}\right \|_2 1\{\left \|{a_i}\right \|_2 \ge c \sqrt{n}\}$| chosen as in (A.2) and |$c> 0$| chosen small enough so that inequality (A.4) holds, we have \begin{equation*} \inf_{X \in \mathcal{S}_2} F(Z, X) \ge \frac{\tau^2}{4} \end{equation*} with probability at least |$1 - \textrm{e}^{-c_m \frac{\tau ^4}{\sigma ^4} + c_1 n \log \frac{\sigma ^2}{\tau ^2}}$|⁠. Comparing with inequality (A.3) gives the theorem. A.2.1 Proof of Lemma A.4 For |$X = xx^{\textrm{H}} - yy^{\textrm{H}}$| that \begin{align*} \left\|{X}\right\|_{\textrm{Fr}}^2 &= \left\|{x}\right\|_2^4 + \left\|{y}\right\|_2^4 - 2 | \langle x, y \rangle |^2 \\ & = \left\|{x}\right\|_2^4 + \left\|{y}\right\|_2^4 - 2\big( {\textrm{Re}}( \langle x, y \rangle )^2 + {\textrm{Im}}( \langle x, y \rangle )^2\big) \\ & = \left( \sqrt{2 \left\|{x}\right\|^4 + 2 \left\|{y}\right\|^4} - 2 \sqrt{{\textrm{Re}}( \langle x, y \rangle )^2 + {\textrm{Im}}( \langle x, y \rangle )^2}\right) \\ &\quad \left( \sqrt{2 \left\|{x}\right\|^4 + 2 \left\|{y}\right\|^4} + 2 \sqrt{{\textrm{Re}}( \langle x, y \rangle )^2 + {\textrm{Im}}( \langle x, y \rangle )^2}\right). \end{align*} By using that \begin{equation*} \|{x - \textrm{e}^{\iota \theta} y}\|^2 = \left\|{x}\right\|^2 + \left\|{y}\right\|^2 - 2\left(\cos \theta \cdot{\textrm{Re}}( \langle x, y \rangle ) - \sin \theta \cdot{\textrm{Im}}( \langle x, y \rangle )\right)\!, \end{equation*} and that the concavity of |$\sqrt{\cdot }$| implies |$\sqrt{2 a^4 + 2 b^4} \ge a^2 + b^2$|⁠, the final expression above has lower bound \begin{align*} & \left( \left\|{x}\right\|^2 + \left\|{y}\right\|^2 - 2 \sqrt{{\textrm{Re}}( \langle x, y \rangle )^2 + {\textrm{Im}}( \langle x, y \rangle )^2}\right) \left( \left\|{x}\right\|^2 + \left\|{y}\right\|^2 + 2 \sqrt{{\textrm{Re}}( \langle x, y \rangle )^2 + {\textrm{Im}}( \langle x, y \rangle )^2}\right)\\ & ~~ = \inf_\theta \|{x - \textrm{e}^{\iota \theta} y}\|^2 \cdot \sup_\theta \|{x - \textrm{e}^{\iota \theta} y}\|^2. \end{align*} A.2.2 Proof of Lemma A.5 We have \begin{align} \nonumber m |F(Z, X) - F(Z, Y)| &\le \sum_{i = 1}^m \left|\sum_{j = 1}^r \left(\lambda_j | \langle u_j, z_i \rangle |^2 - \sigma_j | \langle v_j, z_i \rangle |^2 \right)\right| \\ & \le \sum_{i = 1}^m \left|\sum_{j = 1}^r \left(\lambda_j (| \langle u_j, z_i \rangle |^2 - | \langle v_j, z_i \rangle |^2) \right)\right| + \sum_{i = 1}^m \left|\sum_{j = 1}^r \left((\lambda_j - \sigma_j)| \langle v_j, z_i \rangle |^2 \right)\right|. \end{align} (A.5) We bound each of the terms in expression (A.5) in term. For the first, we note that for any complex |$u, v, z \in{\mathbb{C}}^n$|⁠, we have \begin{equation*} | \langle u, z \rangle |^2 - | \langle v, z \rangle |^2 \le \big|z^{\textrm{H}} uu^{\textrm{H}} z - z^{\textrm{H}} vv^{\textrm{H}} z + u^{\textrm{H}} zz^{\textrm{H}} v - v^{\textrm{H}} zz^{\textrm{H}} u\big| = \left| \langle u - v, z \rangle \langle z, u + v \rangle \right|\!, \end{equation*} as |$u^{\textrm{H}} zz^{\textrm{H}} v - v^{\textrm{H}} zz^{\textrm{H}} u$| is purely imaginary. As a consequence, we have \begin{align*} \sum_{i = 1}^m \left|\sum_{j = 1}^r \lambda_j (| \langle u_j, z_i \rangle |^2 - | \langle v_j, z_i \rangle |^2) \right| & = \sum_{i = 1}^m \left|\sum_{j = 1}^r \lambda_j \langle u_j u_j^{\textrm{H}} - v_j v_j^{\textrm{H}}, z_i z_i^{\textrm{H}} \rangle \right| \\ & \le \sum_{j = 1}^r |\lambda_j| \sum_{i = 1}^m | \langle z_i, u_j - v_j \rangle | | \langle z_i, u_j + v_j \rangle | \\ & \le \sum_{j = 1}^r |\lambda_j| \bigg(\sum_{i = 1}^m | \langle z_i, u_j - v_j \rangle |^2 \bigg)^{\frac{1}{2}} \bigg(\sum_{i = 1}^m | \langle z_i, u_j + v_j \rangle |^2 \bigg) \\ & \le 2 \sum_{j = 1}^r |\lambda_j| \big|\!\big|\!\big|{Z^T Z} \big|\!\big|\!\big|_{\textrm{op}} \left\|{u_j - v_j}\right\|_2 \le 2 \big\|{\lambda}\big\|_1 \big|\!\big|\!\big|{Z^T Z} \big|\!\big|\!\big|_{\textrm{op}} \big\|{U - V}\big\|_{1,2}, \end{align*} where we have used that |$\left \|{u_j}\right \|_2 = \left \|{v_j}\right \|_2 = 1$| so |$\left \|{u_j + v_j}\right \|_2 \le 2$|⁠. For the second term in inequality (A.5), we have \begin{align*} \sum_{i = 1}^m \left|\sum_{j = 1}^r \left((\lambda_j - \sigma_j) | \langle v_j, z_i \rangle |^2 \right)\right| & \le \sum_{j = 1}^r |\lambda_j - \sigma_j| \sum_{i = 1}^m | \langle v_j, z_i \rangle |^2 \le \left\|{\lambda - \sigma}\right\|_1 \big|\!\big|\!\big|{Z^TZ} \big|\!\big|\!\big|_{\textrm{op}}. \end{align*} Noting that for vectors |$\lambda , \sigma \in{\mathbb{R}}^r$| we have |$\left \|{\lambda }\right \|_1 \le \sqrt{r} \left \|{\lambda }\right \|_2$|⁠, we obtain the lemma. A.3 Proof of Proposition 3: deterministic part Our proof of Proposition 3 eventually reduces to applying eigenvector perturbation results to the random matrices |$X^{\textrm{init}}$|⁠. To motivate our approach, note that \begin{equation} X^{\textrm{init}} := \frac{1}{|\mathcal{I}_{\textrm{sel}}|} \sum_{i \in \mathcal{I}_{\textrm{sel}}} a_i a_i^{\textrm{H}} = I_n - \phi(\epsilon) \cdot d_\star d_\star^{\textrm{H}} + \varDelta, \end{equation} (A.6) where |$\varDelta $| is a random error term that we will show has small norm. From the quantity (A.6) we can apply eigenvector perturbation arguments to derive that the directional estimate |$\widehat{d} = \mathop{{\textrm{argmin}}}_{d \in{\mathbb{S}}^{n-1}} d^{\textrm{H}} X^{\textrm{init}} d$| satisfies |$\widehat{d} \approx d_\star $|⁠. This will hold so long as |$\left |\!\left |\!\left |{\varDelta } \right |\!\right |\!\right |_{\textrm{op}}$| is small because there is substantial separation in the eigenvalues of |$I_n$| and |$I_n - \phi (\epsilon ) d_\star d_\star ^{\textrm{H}} $|⁠. With this goal in mind, we define two index sets that with high probability surround |$\mathcal{I}_{\textrm{sel}}$|⁠. Let \begin{align*} \mathcal{I}_{-\epsilon} := \left\{i \in [m] \mid | \langle a_i, d_\star \rangle |^2 < \frac{1-\epsilon}{2} \right\} ~~\textrm{and}~~ \mathcal{I}_{+\epsilon} := \left\{i \in[m] \mid | \langle a_i, d_\star \rangle |^2 \leq \frac{1+\epsilon}{2} \right\}. \end{align*} We now define events |$\mathcal{E}_1$| through |$\mathcal{E}_5$|⁠, showing that conditional on these five, the result of the proposition holds. These events roughly guarantee (⁠|$\mathcal{E}_1$|⁠) that |$\sum _{i = 1}^m a_i a_i^{\textrm{H}} $| is well-behaved, (⁠|$\mathcal{E}_2$| and |$\mathcal{E}_3$|⁠) that |$\mathcal{I}_{+\epsilon } \setminus \mathcal{I}_{-\epsilon }$| is small, and (⁠|$\mathcal{E}_4$| and |$\mathcal{E}_5$|⁠) that most of the vectors |$a_i$| for indices |$i \in \mathcal{I}_{\textrm{sel}}$| are close enough to uniform on the subspace perpendicular to |$d_\star $| that we have a good directional estimate. Now, let |$q \in [1, \infty ]$| and let |$1/p + 1/q = 1$|⁠, so that |$p$| is its conjugate. Recalling the definition of the error |$\varDelta (\epsilon )$| in Assumption 5(ii), we define \begin{equation} \begin{split} \mathcal{E}_1 & := \left\{ \frac{1}{m} \left|\!\left|\!\left|{A^{\textrm{H}} A} \right|\!\right|\!\right|_{\textrm{op}} \in [1 - \epsilon, 1 + \epsilon] \right\}, ~~ \mathcal{E}_2 := \left\{|\mathcal{I}_{+\epsilon}| \le |\mathcal{I}_{-\epsilon}| + 2 \epsilon \kappa m \right\}\!, \\ \mathcal{E}_3 & := \left\{|\mathcal{I}_{-\epsilon}| \ge \frac{1}{2} m \mathsf{p}_0(d_\star) \right\}, ~~ \mathcal{E}_4 := \left\{ \bigg|\!\bigg|\!\bigg|{\frac{1}{m} \sum_{i \in \mathcal{I}_{+\epsilon} \setminus \mathcal{I}_{-\epsilon}} a_i a_i^{\textrm{H}} }\bigg|\!\bigg|\!\bigg|_{\textrm{op}} \le 4 q \sigma^2 (\kappa \epsilon)^{\frac{1}{p}} \right\} \\ \mathcal{E}_5 & := \left\{\bigg|\!\bigg|\!\bigg|{ \frac{1}{|\mathcal{I}_{-\epsilon}|} \sum_{i \in \mathcal{I}_{-\epsilon}} a_i a_i^{\textrm{H}} - \left(I_n - \phi(\epsilon) d_\star d_\star^{\textrm{H}} \right)}\bigg|\!\bigg|\!\bigg|_{\textrm{op}} \le \left|\!\left|\!\left|{\varDelta(\epsilon)} \right|\!\right|\!\right|_{\textrm{op}} + \epsilon \right\}. \end{split} \end{equation} (A.7) We prove the result of the proposition when each |$\mathcal{E}_i$| occurs. Decompose the matrix |$X^{\textrm{init}}$| into \begin{align*} X^{\textrm{init}} & = \underbrace{I_n - \phi(\epsilon) d_\star d_\star^{\textrm{H}} }_{:= Z_0} \\ & ~~ + \underbrace{\left[\frac{1}{|\mathcal{I}_{-\epsilon}|} \sum_{i\in \mathcal{I}_{-\epsilon}} a_i a_i^{\textrm{H}} - (I_n - \phi(\epsilon)d_\star d_\star^{\textrm{H}} )\right]}_{:= Z_1} + \underbrace{\frac{1}{|\mathcal{I}_{-\epsilon}|} \sum_{i \in \mathcal{I}_{\textrm{sel}} \setminus \mathcal{I}_{-\epsilon}} a_i a_i^{\textrm{H}} }_{:= Z_2} - \underbrace{\left(\frac{1}{|\mathcal{I}_{-\epsilon}|} - \frac{1}{|\mathcal{I}_{\textrm{sel}}|}\right) \sum_{i \in \mathcal{I}_{\textrm{sel}}} a_i a_i^{\textrm{H}} }_{:= Z_3}. \end{align*} We bound the operator norms of |$Z_1, Z_2, Z_3$| in turn. On the event |$\mathcal{E}_5$|⁠, we have \begin{equation} \left|\!\left|\!\left|{Z_1} \right|\!\right|\!\right|_{\textrm{op}} \leq \left|\!\left|\!\left|{\varDelta(\epsilon)} \right|\!\right|\!\right|_{\textrm{op}} + \epsilon. \end{equation} (A.8) We turn to the error matrix |$Z_2$|⁠. On the event |$\mathcal{E}_1$|⁠, we evidently have |$\widehat{r} \in \left \|{x_\star }\right \|_2 (1 \pm \epsilon )$| by definition of |$\widehat{r}^{\,2} = \frac{1}{m} \left \|{A x_\star }\right \|_2^2$|⁠, so that \begin{equation*} \mathcal{I}_{-\epsilon} \subset \mathcal{I}_{\textrm{sel}} \subset \mathcal{I}_{+\epsilon}. \end{equation*} Using that the summands |$a_i a_i^{\textrm{H}} $| are all positive semidefinite, we thus obtain the upper bound \begin{equation} \left|\!\left|\!\left|{Z_2} \right|\!\right|\!\right|_{\textrm{op}} = \frac{1}{|\mathcal{I}_{-\epsilon}|} \bigg|\!\bigg|\!\bigg|{ \sum_{i\in \mathcal{I}_{\textrm{sel}} \setminus \mathcal{I}_{-\epsilon}} a_i a_i^{\textrm{H}} }\bigg|\!\bigg|\!\bigg|_{\textrm{op}} \leq \frac{1}{|\mathcal{I}_{-\epsilon}|} \bigg|\!\bigg|\!\bigg|{ \sum_{i\in \mathcal{I}_{+\epsilon} \setminus \mathcal{I}_{-\epsilon}} a_i a_i^{\textrm{H}} }\bigg|\!\bigg|\!\bigg|_{\textrm{op}} \stackrel{(i)}{\leq} \frac{1}{\mathsf{p}_0(d_\star)} \cdot 4 q \sigma^2 (\kappa \epsilon)^{\frac{1}{p}}, \end{equation} (A.9) where in inequality |$(i)$| we use that |$2|\mathcal{I}_{-\epsilon }| \geq \mathsf{p}_0(d_\star ) m$| on |$\mathcal{E}_3$| and |$|\!|\!|{\sum _{i\in \mathcal{I}_{+\epsilon } \setminus \mathcal{I}_{-\epsilon }} a_i a_i^{\textrm{H}} }|\!|\!|_{\textrm{op}} \leq 4 q m\sigma ^2 (\kappa \epsilon )^{\frac{1}{p}}$| on |$\mathcal{E}_4$|⁠. Lastly, we provide an upper bound on |$\left |\!\left |\!\left |{Z_3} \right |\!\right |\!\right |_{\textrm{op}}$|⁠. Again using that |$\mathcal{I}_{-\epsilon } \subset \mathcal{I}_{\textrm{sel}} \subset \mathcal{I}_{+\epsilon }$| on event |$\mathcal{E}_1$|⁠, we have \begin{equation*} \left|\frac{1}{|\mathcal{I}_{-\epsilon}|} - \frac{1}{|\mathcal{I}_{\textrm{sel}}|}\right| \leq \frac{1}{|\mathcal{I}_{-\epsilon}|} - \frac{1}{|\mathcal{I}_{+\epsilon}|} = \frac{|\mathcal{I}_{+\epsilon}| - |\mathcal{I}_{-\epsilon}|}{|\mathcal{I}_{-\epsilon}||\mathcal{I}_{+\epsilon}|} \leq \frac{2 \epsilon \kappa }{\mathsf{p}_0(d_\star)^2 m}, \end{equation*} where in the last inequality, we use that |$|\mathcal{I}_{+\epsilon }| - |\mathcal{I}_{-\epsilon }| \leq 2 \epsilon \kappa m$| on |$\mathcal{E}_2$| and that |$|\mathcal{I}_{+\epsilon }| \geq |\mathcal{I}_{-\epsilon }| \geq \mathsf{p}_0(d_\star ) m/2$| on |$\mathcal{E}_3$|⁠. Thus, by the definition of |$\mathcal{E}_1$|⁠, we have \begin{equation} \left|\!\left|\!\left|{Z_3} \right|\!\right|\!\right|_{\textrm{op}} \leq \frac{2 \epsilon \kappa}{\mathsf{p}_0(d_\star)^2} \left|\!\left|\!\left|{\frac{1}{m}\sum_{i=1}^m a_i a_i^{\textrm{H}} } \right|\!\right|\!\right|_{\textrm{op}} \leq \frac{2 \epsilon(1+\epsilon) \kappa}{\mathsf{p}_0(d_\star)^2}. \end{equation} (A.10) Combining inequalities (A.8), (A.9) and (A.10) on the error matrices |$Z_i$|⁠, the triangle inequality gives \begin{equation*} \left|\!\left|\!\left|{Z_1 + Z_2 + Z_3} \right|\!\right|\!\right|_{\textrm{op}} \le \left|\!\left|\!\left|{\varDelta(\epsilon)} \right|\!\right|\!\right|_{\textrm{op}} + \underbrace{\left[ \epsilon + \frac{4 q \sigma^2 (\kappa \epsilon)^{\frac{1}{p}}}{ \mathsf{p}_0(d_\star)} + \frac{2 \epsilon (1 + \epsilon) \kappa}{ \mathsf{p}_0(d_\star)^2}\right]}_{:= \nu(\epsilon)}. \end{equation*} This implies equality (A.6) with error bound |$\left |\!\left |\!\left |{\varDelta } \right |\!\right |\!\right |_{\textrm{op}} \le \left |\!\left |\!\left |{\varDelta (\epsilon )} \right |\!\right |\!\right |_{\textrm{op}} + \nu (\epsilon )$|⁠. Recall the definition of |$Z_0 = I_n - \phi (\epsilon ) d_\star d_\star ^{\textrm{H}} $|⁠, which has smallest eigenvector |$d_\star $| and eigengap |$\phi (\epsilon )$|⁠. Lastly, we simplify |$\nu (\epsilon )$| by a specific choice of |$p$| and |$q$| in the definition (A.7) of |$\mathcal{E}_4$|⁠. Without loss of generality, we assume |$\kappa \epsilon < 1$| (recall Assumption 5 on |$\kappa $|⁠), and define |$p = 1 + \frac{1}{\log \frac{1}{\kappa \epsilon }}$| and |$q = 1 + \log \frac{1}{ \kappa \epsilon }$|⁠. Using that for any |$z < 0$| we have |$\exp (\frac{z}{1 - 1/z}) \le \textrm{e}^{z + 1}$| and |$(\kappa \epsilon )^{\frac{1}{p}} \le e \kappa \epsilon $|⁠, allowing us to bound |$q (\kappa \epsilon )^{\frac{1}{p}} \le e \big(1 + \log \frac{1}{\kappa \epsilon }\big) (\kappa \epsilon )$| in the error term |$\nu (\epsilon )$|⁠. We now apply the eigenvector perturbation inequality of Lemma A.1. Using that, for |$\theta \in{\mathbb{R}}$|⁠, |$\left \|{u- \textrm{e}^{\iota \theta } v}\right \|_2^2 \ge 2 - 2| \langle u, v \rangle | \ge 2 - 2 | \langle u, v \rangle |^2$| for |$\left \|{u}\right \|_2 = \left \|{v}\right \|_2 = 1$|⁠, a minor rearrangement of Lemma A.1 applied to |$X^{\textrm{init}} = Z_0 + \varDelta $| for |$\varDelta = Z_1 + Z_2 + Z_3$| yields \begin{equation*} \mathop{\textrm{dist}} \left(\hat{d}, \frac{1}{r} X_\star\right) = \inf_{\theta \in [0, 2\pi]} \left\|{\hat{d} - \textrm{e}^{\iota\theta} d}\right\|_2 \le \frac{\sqrt{2} (\left|\!\left|\!\left|{\varDelta(\epsilon)} \right|\!\right|\!\right|_{\textrm{op}} + \nu(\epsilon))}{ \left[{\phi(\epsilon) - (\left|\!\left|\!\left|{\varDelta(\epsilon)} \right|\!\right|\!\right|_{\textrm{op}} + \nu(\epsilon))}\right]_+}. \end{equation*} Finally, using that |$x_0 = \widehat{r}\ \widehat{d}$| and defining |$r_\star = \left \|{x_\star }\right \|$|⁠, we have by the triangle inequality that \begin{equation*} \mathop{\textrm{dist}}(x_0, X_\star) \le \widehat{r} \mathop{\textrm{dist}} \left(\widehat{d}, \frac{1}{r}X_\star\right) + |r_\star - \widehat{r}| \stackrel{(i)}{\le} \left[\sqrt{1 + \epsilon} \mathop{\textrm{dist}}(\widehat{d}, d_\star) + \epsilon\right] r_\star, \end{equation*} where inequality |$(i)$| uses |$\mathcal{E}_1$|⁠, which we recall implies |$(1 - \epsilon ) \left \|{x_\star }\right \|_2^2 \le \widehat{r}^2 \le (1 + \epsilon ) \left \|{x_\star }\right \|_2^2$|⁠, where |$\epsilon \in [0, 1]$|⁠. This is the claimed consequence (14) in Proposition 3. A.4 Proof of Proposition 3: high probability events It remains to demonstrate that each of the events |$\mathcal{E}_1, \ldots , \mathcal{E}_5$| (recall definition (A.7)) holds with high probability, to which we dedicate the remainder of this argument in the next series of lemmas, each of which argues that one of the five events occurs with high probability. Before the statement of each lemma, we recall the corresponding event whose high probability we wish to demonstrate. Throughout, |$c> 0$| and |$C < \infty $| denote numerical constants whose values may change. We begin with |$\mathcal{E}_1 := \{ \frac{1}{m} |\!|\!|{A^{\textrm{H}} A}|\!|\!|_{\textrm{op}} \in [1 \pm \epsilon ]\}$|⁠. Lemma A.8 We have |${\mathbb{P}}(\mathcal{E}_1) \ge 1 - \exp (-c m \epsilon ^2 / \sigma ^4)$| for |$m$| large enough that |$m/n \ge \sigma ^4/(c\epsilon ^2)$|⁠. Proof. Set |$t = c \frac{\epsilon ^2}{\sigma ^4}$| in Lemma 3.1, noting that we must have |$\sigma ^2 \ge 1$| because |${\mathbb{E}}[aa^{\textrm{H}} ] = I_n$| and |$\left |\!\left |\!\left |{{\mathbb{E}}[aa^{\textrm{H}} ]} \right |\!\right |\!\right |_{\textrm{op}} \le \sigma ^2$| (recall the final part of Lemma 3.1). Moreover, |$\epsilon \in [0, 1]$| by assumption. Then taking |$c$| small enough, once we have |$\frac{n}{m} \le c \frac{\epsilon ^2}{\sigma ^4}$| we obtain the result. The event |$\mathcal{E}_2 := \{|\mathcal{I}_{-\epsilon }| \ge |\mathcal{I}_{+\epsilon }| - 2 \kappa \epsilon m\}$| likewise holds with high probability. Lemma A.9 We have |${\mathbb{P}}(\mathcal{E}_2) \ge 1-\exp (-2 \epsilon ^2 \kappa ^2 m)$|⁠. Proof. We always have that \begin{align*} \mathcal{I}_{-\epsilon} \subset \mathcal{I}_{+\epsilon} ~~\textrm{and}~~\mathcal{I}_{+\epsilon} \setminus \mathcal{I}_{-\epsilon} = \left\{i\in [m]: \frac{1}{2} (1-\epsilon) \leq | \langle a_i, d_\star \rangle |^2 \leq \frac{1}{2}(1+\epsilon)\right\}. \end{align*} Therefore, the difference in cardinalities of |$|\mathcal{I}_{-\epsilon }|$| and |$|\mathcal{I}_{+\epsilon }$| is \begin{equation*} \frac{1}{m} \left(|\mathcal{I}_{+\epsilon}| - |\mathcal{I}_{-\epsilon}|\right) = \frac{1}{m}\sum_{i=1}^m 1\left\{1-\epsilon \leq 2| \langle a_i, d_\star \rangle |^2 \leq 1+\epsilon\right\}. \end{equation*} The right-hand side is an average of i.i.d. Bernoulli random variables with means bounded by |$\kappa \epsilon $| by Assumption 5(i). Hoeffding’s inequality gives the result that |${\mathbb{P}}(|\mathcal{I}_{+\epsilon }|> |\mathcal{I}_{-\epsilon }| + 2\kappa m \epsilon ) \le \textrm{e}^{-2\kappa ^2 \epsilon ^2 m}$|⁠. Lemma A.10 The event |$\mathcal{E}_3 := \{|\mathcal{I}_{-\epsilon }| \geq \frac{1}{2} m \mathsf{p}_0(d_\star )\}$| satisfies |${\mathbb{P}}(\mathcal{E}_3) \ge 1 - \exp (-\frac{1}{2} m \mathsf{p}_0(d_\star )^2)$|⁠. Proof. As in Lemma A.9, this result is immediate from Hoeffding’s inequality. We have |$|\mathcal{I}_{-\epsilon }| = \sum _{i=1}^m 1\{| \langle a_i, d_\star \rangle |^2 \leq (1-\epsilon )/2\}$|⁠, an i.i.d. sum of Bernoulli’s with |${\mathbb{P}}(| \langle a_i, d_\star \rangle |^2 \le (1 - \epsilon )/2) \ge \mathsf{p}_0(d_\star )$| by Assumption 5(i). Hoeffding’s inequality gives the result. Showing that events |$\mathcal{E}_4$| and |$\mathcal{E}_5$| in Equation (A.7) each happen with high probability requires a little more work. We begin with |$\mathcal{E}_4$|⁠, defined in terms of a conjugate pair |$p, q \ge 1$| with |$1/p + 1/q = 1$|⁠, as |$\mathcal{E}_4 := \{|\!|\!|{\sum _{i \in \mathcal{I}_{+\epsilon } \setminus \mathcal{I}_{-\epsilon }} a_i a_i^{\textrm{H}} }|\!|\!|_{\textrm{op}} \le 4 q \sigma ^2 (\kappa \epsilon )^{\frac{1}{p}}\}$|⁠. Lemma A.11 If |$m/n> c^{-1} (\kappa \epsilon )^{-\frac{2}{p}}$|⁠, then |${\mathbb{P}}(\mathcal{E}_4) \ge 1-\exp (-c m (\kappa \epsilon )^{\frac{2}{p}})$|⁠. It is no loss of generality to assume that |$\kappa \epsilon \le 1$| by Assumption 5, so |$(\kappa \epsilon )^{\frac{2}{p}} \ge (\kappa \epsilon )^2$|⁠. Proof. The |$\{a_i\}_{i=1}^m$| are independent |$\sigma ^2$|-sub-Gaussian random vectors by Assumption 4, and for any random variable |$B_i$| with |$|B_i| \le 1$|⁠, which may depend on |$a_i$|⁠, it is clear that the collection |$\{B_i a_i\}_{i = 1}^m$| are mutually independent and still satisfy Definition 3.1. To that end, define the Bernoulli variables |$B(a) = 1\big\{| \langle a, d_\star \rangle |^2 \in \big[\frac{1 - \epsilon }{2}, \frac{1 + \epsilon }{2}\big]\big\}$| (letting |$B_i = B(a_i) = 1\left \{i \in \mathcal{I}_{+\epsilon } \setminus \mathcal{I}_{-\epsilon }\right \}$| for shorthand). Then Lemma 3.1 implies for a numerical constant |$C < \infty $| that \begin{equation} {\mathbb{P}} \left(\bigg|\!\bigg|\!\bigg|{\frac{1}{m}\sum_{i=1}^m a_i a_i^{\textrm{H}} B_i - {\mathbb{E}}[aa^{\textrm{H}} B(a)]}\bigg|\!\bigg|\!\bigg|_{\textrm{op}} \ge C \sigma^2 \max\left\{\sqrt{\frac{n}{m} + t}, \frac{n}{m} + t\right\}\right) \leq \textrm{e}^{-m t}. \end{equation} (A.11) Now, note by Hölder’s inequality that \begin{align*} {\mathbb{E}}[ \langle v, a \rangle ^2 B(a)] & \le{\mathbb{E}}[ \langle v, a \rangle ^{2q}]^{\frac{1}{q}} {\mathbb{P}}\left(| \langle a, d_\star \rangle |^2 \in \left[\frac{1 - \epsilon}{2}, \frac{1 + \epsilon}{2}\right]\right)^{\frac{1}{p}} \\ & \le (\sigma^{2q} \Gamma(q + 1) e)^{\frac{1}{q}} (\kappa \epsilon)^{\frac{1}{p}} \le q \textrm{e}^{1/q} \sigma^2 (\kappa \epsilon)^{\frac{1}{p}}, \end{align*} where we have applied Assumption 5(i) and Lemma 3.1 to bound |${\mathbb{E}}[ \langle a, d_\star \rangle ^{2q}]$|⁠. Using the triangle inequality and substituting |$t = \frac{1}{4C}(\kappa \epsilon )^{\frac{2}{p}}$| into inequality (A.11), we find that for any |$q \in (1, \infty )$| and |$1/p + 1/q = 1$|⁠, if |$\frac{n}{m} \le \frac{1}{4C}(\kappa \epsilon )^{\frac{2}{p}}$| we have \begin{equation*} {\mathbb{P}} \left(\bigg|\!\bigg|\!\bigg|{\frac{1}{m}\sum_{i=1}^m a_i a_i^{\textrm{H}} B_i - {\mathbb{E}}[aa^{\textrm{H}} B(a)]}\bigg|\!\bigg|\!\bigg|_{\textrm{op}} \ge q \sigma^2 (\kappa \epsilon)^{\frac{1}{p}}\right) \leq \exp\left(-\frac{m(\kappa \epsilon)^{\frac{2}{p}}}{4C}\right). \end{equation*} Applying the triangle inequality and that |$1 + \textrm{e}^{1/q} < 4$| gives the result. The final high probability guarantee is the most complex, and applies to the event |$\mathcal{E}_5 := \big\{\big|\!\big|\!\big|{\frac{1}{|\mathcal{I}_{-\epsilon }|} \sum _{i \in \mathcal{I}_{-\epsilon }} a_i a_i^{\textrm{H}} - (I_n - \phi (\epsilon ) d_\star d_\star ^{\textrm{H}} )}\big|\!\big|\!\big|_{\textrm{op}} \le |\!|\!|{\varDelta (\epsilon )}|\!|\!|_{\textrm{op}} + \epsilon \big\}$|⁠. Lemma A.12 Let |$\mathcal{E}_3 = \{|\mathcal{I}_{-\epsilon }| \ge \frac{1}{2} m \mathsf{p}_0(d_\star )\}$| as in Equation (A.7). Then \begin{equation*} c \frac{m}{n} \ge \frac{\sigma^4\log^2 \mathsf{p}_0(d_\star)}{ \mathsf{p}_0(d_\star) \epsilon^2} \end{equation*} implies |${\mathbb{P}}(\mathcal{E}_5 \mid \mathcal{E}_3 ) \geq 1 - \exp \bigg(-c \frac{m \mathsf{p}_0(d_\star ) \epsilon ^2}{\sigma ^4 \log ^2 \mathsf{p}_0(d_\star )}\bigg)$|⁠. Proof. For notational simplicity, define the following shorthand: \begin{equation*} E_{d_\star} := {\mathbb{E}}\left[aa^{\textrm{H}} \mid | \langle a, d_\star \rangle |^2 \le \frac{1 - \epsilon}{2}\right] = I_n - \phi(\epsilon) d_\star d_\star^{\textrm{H}} + \varDelta(\epsilon), \end{equation*} where the equality uses Assumption 5 (ii). The main idea of the proof is to show the following crucial fact: define the new sub-Gaussian parameter |$\tau ^2 = \sigma ^2 \log \frac{e}{\mathsf{p}_0(d_\star )} \ge 1$|⁠. Then there exists a numerical constant |$1 \le C < \infty $| such that for all |$t \ge 0$|⁠, \begin{equation} {\mathbb{P}}\left(\bigg|\!\bigg|\!\bigg|{ \frac{1}{|\mathcal{I}_{-\epsilon}|} \sum_{i \in \mathcal{I}_{-\epsilon}} a_i a_i^{\textrm{H}} - E_{d_\star}}\bigg|\!\bigg|\!\bigg|_{\textrm{op}} \ge C \tau^2 \max\left\{\sqrt{\frac{n}{|\mathcal{I}_{-\epsilon}|} + t}, \frac{n}{|\mathcal{I}_{-\epsilon}|} + t\right\} \mid \mathcal{I}_{-\epsilon}\right) \leq \exp(- |\mathcal{I}_{-\epsilon}| t) . \end{equation} (A.12) Suppose that the bound (A.12) holds. On the event |$\mathcal{E}_3$|⁠, we have that |$|\mathcal{I}_{-\epsilon }| \ge \frac{1}{2} m \mathsf{p}_0(d_\star )$|⁠, and so choosing \begin{equation*} t = \frac{\epsilon^2}{4 C^2 \tau^4} < 1, ~~ \mbox{and letting} ~~ \frac{m}{n} \ge \frac{2 C^2 \tau^4}{\mathsf{p}_0(d_\star) \epsilon^2} \end{equation*} yields that |$\frac{n}{|\mathcal{I}_{-\epsilon }|} + t < 1$| on |$\mathcal{E}_3$| and \begin{equation*} {\mathbb{P}}\left(\bigg|\!\bigg|\!\bigg|{ \frac{1}{|\mathcal{I}_{-\epsilon}|} \sum_{i \in \mathcal{I}_{-\epsilon}} a_i a_i^{\textrm{H}} - E_{d_\star}}\bigg|\!\bigg|\!\bigg|_{\textrm{op}} \ge \epsilon \mid \mathcal{I}_{-\epsilon}\right) \le \exp\left(-\frac{m \mathsf{p}_0(d_\star) \epsilon^2}{8 C^2 \tau^4}\right). \end{equation*} By the definition of |$E_{d_\star }$| and the triangle inequality we have \begin{align*} & {\mathbb{P}}\left(\bigg|\!\bigg|\!\bigg|{ \frac{1}{|\mathcal{I}_{-\epsilon}|} \sum_{i \in \mathcal{I}_{-\epsilon}} a_i a_i^{\textrm{H}} - (I_n - \phi\left(\epsilon\right) d_\star d_\star^{\textrm{H}} )}\bigg|\!\bigg|\!\bigg|_{\textrm{op}} \ge \left|\!\left|\!\left|{\varDelta(\epsilon)} \right|\!\right|\!\right|_{\textrm{op}} + \epsilon \mid \mathcal{I}_{-\epsilon} \right) \\ & \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad~ \le{\mathbb{P}}\left(\bigg|\!\bigg|\!\bigg|{ \frac{1}{|\mathcal{I}_{-\epsilon}|} \sum_{i \in \mathcal{I}_{-\epsilon}} a_i a_i^{\textrm{H}} - E_{d_\star}}\bigg|\!\bigg|\!\bigg|_{\textrm{op}} \ge \epsilon \mid \mathcal{I}_{-\epsilon}\right). \end{align*} The lemma follows from the fact that |$\mathcal{E}_3$| is measurable with respect to the indices |$\mathcal{I}_{-\epsilon }$|⁠. Now, we show the key inequality (A.12). The main idea is to show that, conditioning on the set |$\mathcal{I}_{-\epsilon }$|⁠, the distribution |$\{a_i\}_{i \in \mathcal{I}_{-\epsilon }}$| is still conditionally independent and sub-Gaussian. To do so, we introduce a bit of (more or less standard) notation. For a random variable |$X$|⁠, let |$\mathcal{L}(X)$| denote the law of distribution of |$X$|⁠. Using the independence of the vectors |$a_i$|⁠, we have the fact that for any fixed subset |$\mathcal{I} \subset [m]$|⁠, the collection |$\{a_i\}_{i\in \mathcal{I}}$| is independent of |$\{a_i\}_{i\not \in \mathcal{I}}$|⁠. Therefore, using the definition that |$\mathcal{I}_{-\epsilon } = \big\{i \in [m] : | \langle a_i, d_\star \rangle |^2 \le \frac{1 - \epsilon }{2}\big\}$|⁠, we have the key identity \begin{equation*} \mathcal{L}\bigg(\frac{1}{|\mathcal{I}_{-\epsilon}|} \sum_{i \in \mathcal{I}_{-\epsilon}} a_i a_i^{\textrm{H}} \mid \mathcal{I}_{-\epsilon} = \mathcal{I}\bigg) \stackrel{{\textrm{dist}}}{=} \mathcal{L}\bigg(\frac{1}{|\mathcal{I}|} \sum_{i\in \mathcal{I}} a_i a_i^{\textrm{H}} \mid \max_{i\in \mathcal{I}} | \langle a_i, d_\star \rangle |^2 \leq \frac{1-\epsilon}{2}\bigg). \end{equation*} This implies that, conditioning on |$\mathcal{I}_{-\epsilon } = \mathcal{I}$|⁠, the vectors |$\{a_i\}_{i \in \mathcal{I}}$| are still conditionally independent, and their conditional distribution is identical to the law |$\mathcal{L}\big(a\mid | \langle a, d_\star \rangle |^2 \leq \frac{1-\epsilon }{2}\big)$|⁠. The claimed inequality (A.12) will thus follow by the matrix concentration inequality in Lemma 3.1, so long as we can demonstrate appropriate sub-Gaussianity of the conditional law |$\mathcal{L}\big(a\mid | \langle a, d_\star \rangle |^2 \leq \frac{1 - \epsilon }{2}\big)$|⁠. Indeed, let us temporarily assume that |$a \mid | \langle a, d_\star \rangle |^2 \le \frac{1 - \epsilon }{2}$| is |$\tau ^2$|-sub-Gaussian, let |$\mathcal{J}$| denote all subsets |$\mathcal{I} \subset [m]$| such that |$|\mathcal{I}| \ge \frac{1}{2} m \mathsf{p}_0(d_\star )$|⁠, and define the shorthand |$E_{d_\star } = {\mathbb{E}}\big[aa^{\textrm{H}} \mid | \langle a, d_\star \rangle |^2 \le \frac{1 - \epsilon }{2}\big]$|⁠. Then by summing over |$\mathcal{J}$|⁠, we have on the event |$\mathcal{E}_3$| that for a numerical constant |$C < \infty $|⁠, \begin{align*} & {\mathbb{P}}\left(\bigg|\!\bigg|\!\bigg|{\frac{1}{|\mathcal{I}_{-\epsilon}|} \sum_{i \in \mathcal{I}_{-\epsilon}} a_i a_i^{\textrm{H}} - E_{d_\star} }\bigg|\!\bigg|\!\bigg|_{\textrm{op}} \ge C \tau^2 \max\left\{\sqrt{\frac{n}{|\mathcal{I}_{-\epsilon}|} + t}, \frac{n}{|\mathcal{I}_{-\epsilon}|} + t\right\} \mid \mathcal{E}_3 \right) \\ &\quad =\sum_{\mathcal{I} \in \mathcal{J}} {\mathbb{P}}\left( \bigg|\!\bigg|\!\bigg|{ \frac{1}{|\mathcal{I}|} \sum_{i \in \mathcal{I}} a_i a_i^{\textrm{H}} - E_{d_\star}}\bigg|\!\bigg|\!\bigg|_{\textrm{op}} \ge C \tau^2 \max\left\{\sqrt{\frac{n}{|\mathcal{I}|} + t}, \frac{n}{|\mathcal{I}|} + t\right\} \mid \mathcal{I}_{-\epsilon} = \mathcal{I} \right) {\mathbb{P}}(\mathcal{I} = \mathcal{I}_{-\epsilon} \mid \mathcal{E}_3) \\ &\quad = \sum_{\mathcal{I} \in \mathcal{J}} {\mathbb{P}}\left( \bigg|\!\bigg|\!\bigg|{ \frac{1}{|\mathcal{I}|} \!\sum_{i \in \mathcal{I}} a_i a_i^{\textrm{H}} - E_{d_\star}}\!\bigg|\!\bigg|\!\!\bigg|_{\textrm{op}} \ge C \tau^2 \max\!\left\{\sqrt{\frac{n}{|\mathcal{I}|} + t}, \frac{n}{|\mathcal{I}|} + t\right\} \mid \max_{i \in \mathcal{I}} | \langle a_i, d_\star \rangle |^2 \le \frac{1 - \epsilon}{2} \right)\\&\qquad{\mathbb{P}}(\mathcal{I} = \mathcal{I}_{-\epsilon} \!\mid\! \mathcal{E}_3) \\ &\quad \stackrel{(i)}{\le} \sum_{\mathcal{I} \in \mathcal{J}} \textrm{e}^{-|\mathcal{I}| t} \cdot{\mathbb{P}}(\mathcal{I} = \mathcal{I}_{-\epsilon} \mid \mathcal{E}_3) \le \textrm{e}^{- \frac{1}{2} m \mathsf{p}_0(d_\star) t}, \end{align*} where inequality |$(i)$| is an application of Lemma 3.1. This is evidently inequality (A.12) with appropriate choice of |$\tau ^2$|⁠. We thus show that |$\mathcal{L}\big(a\mid | \langle a, d_\star \rangle |^2 \leq \frac{1-\epsilon }{2}\big)$| is sub-gaussian with parameter |$\tau ^2 = \sigma ^2 \log \frac{e}{\mathsf{p}_0(d_\star )}$| by bounding the conditional moment generating function. Let |$\lambda \in [1, \infty ]$| and |$\lambda ^{\prime}$| be conjugate, so that |$1/ \lambda + 1/\lambda ^{\prime} = 1$|⁠. Then by Hölder’s inequality, for any |$v \in{\mathbb{S}}^{n-1}$| we have \begin{align*} {\mathbb{E}} \left[\exp\left(\frac{| \langle a, v \rangle |^2}{\lambda \sigma^2}\right) \mid | \langle a, v \rangle |^2 \leq \frac{1-\epsilon}{2}\right] &= \frac{{\mathbb{E}} \left[\exp\left(\frac{| \langle a, v \rangle |^2}{\lambda \sigma^2}\right) 1\left\{| \langle a, v \rangle |^2 \leq \frac{1-\epsilon}{2}\right\}\right]}{ {\mathbb{P}}\left(| \langle a, v \rangle |^2 \leq \frac{1-\epsilon}{2}\right)} \\ & \leq \frac{\left({\mathbb{E}}\left[\exp\left(\frac{| \langle a, v \rangle |^2}{\sigma^2}\right)\right]\right)^{\frac{1}{\lambda}} {\mathbb{P}}\left(| \langle a, v \rangle |^2 \leq \frac{1-\epsilon}{2}\right)^{\frac{1}{\lambda^{\prime}}}}{ {\mathbb{P}}\left(| \langle a, v \rangle |^2 \leq \frac{1-\epsilon}{2}\right)} \\ & = {\mathbb{E}}\left[\!\exp\!\left(\frac{| \langle a, v \rangle |^2}{\sigma^2}\right)\! \right]^{\frac{1}{\lambda}} \!{\mathbb{P}}\left(\!| \langle a, v \rangle |^2 \!\leq\! \frac{1-\epsilon}{2}\!\right)^{-\frac{1}{\lambda}} \!\leq\! \left(\frac{e}{\mathsf{p}_0(d_\star)}\right)^{\frac{1}{\lambda}}, \end{align*} where the final inequality uses the |$\sigma ^2$|-sub-Gaussianity of |$a$|⁠. Set |$\lambda = \log \frac{e}{\mathsf{p}_0(d_\star )}$| to see that conditional on |$| \langle a, d_\star \rangle |^2 \le \frac{1 - \epsilon }{2}$|⁠, the vector |$a$| is |$\sigma ^2 \log \frac{e}{\mathsf{p}_0(d_\star )}$|-sub-Gaussian, as desired. B. Proofs for phase retrieval with outliers In this section, we collect the proofs of the various results in Section 4. Before providing the proofs, we state one inequality that we use frequently that will be quite useful. Let |$W_i \in \{0, 1\}$| satisfy |$W_i = 1$| if |$i \in \mathcal{I}^{\textrm{out}}$| and |$W_i = 0$| otherwise, so that |$W$| indexes the outlying measurements. Because |$W_i$| are independent of the |$a_i$| vectors and |$\sum _i W_i = p_{\textrm{fail}} m$|⁠, Lemma 3.1 implies for all |$t \ge 0$| that \begin{equation} {\mathbb{P}}\left(\bigg|\!\bigg|\!\bigg|{\frac{1}{m} \sum_{i = 1}^m W_i a_i a_i^T - p_{\textrm{fail}} {\mathbb{E}}[a_i a_i^T]}\bigg|\!\bigg|\!\bigg|_{\textrm{op}} \ge C \sigma^2 \max\left\{\sqrt{\frac{n}{m} + t}, \frac{n}{m} + t \right\}\right) \le \textrm{e}^{-m t}. \end{equation} (B.1) B.1 Proof of Proposition 4 Recalling the set |$\mathcal{I}^{\textrm{out}}$| of outlying indices, we evidently have \begin{align} f(x) - f(x_\star) & = \frac{1}{m} \sum_{i = 1}^m | \langle a_i, x \rangle ^2 - \langle a_i, x_\star \rangle ^2| + \frac{1}{m} \sum_{i \in \mathcal{I}^{\textrm{out}}} \left(| \langle a_i, x \rangle ^2 - \xi_i| - | \langle a_i, x \rangle ^2 - \langle a_i, x_\star \rangle ^2|\right) - f(x_\star) \nonumber \\ & = \frac{\left\|{(Ax)^2 - (Ax_\star)^2}\right\|_1}{m} + \frac{1}{m} \sum_{i \in \mathcal{I}^{\textrm{out}}} \left(| \langle a_i, x \rangle ^2 - \xi_i| - | \langle a_i, x \rangle ^2 - \langle a_i, x_\star \rangle ^2| - | \langle a_i, x_\star \rangle ^2 - \xi_i|\right) \nonumber \\ & \ge \frac{\left\|{(Ax)^2 - (Ax_\star)^2}\right\|_1}{m} - \frac{2}{m} \sum_{i \in \mathcal{I}^{\textrm{out}}} | \langle a_i, x \rangle ^2 - \langle a_i, x_\star \rangle ^2|. \end{align} (B.2) Now, we note the trivial fact that if we define |$W_i = 1$| for |$i \in \mathcal{I}^{\textrm{out}}$| and |$W_i = 0$| for |$i \in \mathcal{I}^{\textrm{in}}$|⁠, then \begin{align*} \sum_{i \in \mathcal{I}^{\textrm{out}}} | \langle a_i, x \rangle ^2 - \langle a_i, x_\star \rangle ^2| & = \sum_{i = 1}^m W_i |(x - x_\star)^T a_i a_i^T (x + x_\star)| \\ & \le \left|\!\left|\!\left|{A^T \mathop{\textrm{diag}}(W) A} \right|\!\right|\!\right|_{\textrm{op}} \left\|{x - x_\star}\right\|_2 \left\|{x + x_\star}\right\|_2. \end{align*} Inequality (B.1) shows that the matrix |$\frac{1}{m}\sum _{i = 1}^m W_i a_i a_i^T$| is well concentrated. Now, let |$\mathcal{E}$| denote the event that |$\frac{1}{m} |\!|\!|{A^T \mathop{\textrm{diag}}(W) A}|\!|\!|_{\textrm{op}} \le p_{\textrm{fail}} + C \sigma ^2 \sqrt{n/m + t}$|⁠, where |$t$| is chosen so that |$n/m + t \le 1$|⁠. Returning to inequality (B.2), we have \begin{equation*} f(x) - f(x_\star) \ge \frac{1}{m} \left\|{(Ax)^2 - (Ax_\star)^2}\right\|_1 - 2 (p_{\textrm{fail}} + C \sigma^2 \sqrt{n/m + t}) \left\|{x - x_\star}\right\|_2 \left\|{x + x_\star}\right\|_2 \end{equation*} for all |$x \in{\mathbb{R}}^n$| with probability at least |$1 - \textrm{e}^{-mt}$| by inequality (B.1). We finish with the following lemma, which is a minor sharpening of Theorem 2.4 of Eldar and Mendelson [19] so that we have sharp concentration in all dimensions |$n$|⁠. We provide a proof for completeness in Appendix C.2. Lemma B.1 Let |$a_i$| be independent |$\sigma ^2$|-sub-Gaussian vectors, and define |$\kappa _{\textrm{st}}(u, v) := {\mathbb{E}}[| \langle a, u \rangle \langle a, v \rangle |]$| for |$u, v \in{\mathbb{R}}^n$|⁠. Then there exist a numerical constants |$c> 0$| and |$C < \infty $| such that \begin{equation*} \frac{1}{m} \sum_{i = 1}^m | \langle a_i, u \rangle \langle a_i, v \rangle | \ge \kappa_{\textrm{st}}(u, v) - C \sigma^2 \sqrt[3]{\frac{n}{m}} - \sigma^2 t ~~ \mbox{for all} u, v \in{\mathbb{S}}^{n - 1} \end{equation*} with probability at least |$1 - \textrm{e}^{-c mt^2} - \textrm{e}^{-cm}$| when |$m/n \ge C$|⁠. Noting that |$| \langle a_i, x \rangle ^2 - \langle a_i, x_\star \rangle ^2| = | \langle a_i, x - x_\star \rangle \langle a_i, x + x_\star \rangle |$| and substituting the result of the lemma into the preceding display, we have \begin{equation*} f(x) - f(x_\star) \ge \left(\kappa_{\textrm{st}} - 2 p_{\textrm{fail}} - C \sigma^2 \sqrt[3]{\frac{n}{m}} - C \sigma^2 t\right) \left\|{x - x_\star}\right\|_2 \left\|{x + x_\star}\right\|_2 \end{equation*} uniformly in |$x$| with probability at least |$1 - 2 \textrm{e}^{-c m t^2} - \textrm{e}^{-c m}$|⁠. B.2 Proof of Proposition 5 We first state a lemma providing a deterministic bound on the errors of the minimizing radius. Lemma B.2 Let \begin{equation} \delta = \frac{6 \left\|{x_\star}\right\|_2^2 \left|\!\left|\!\left|{A^T A} \right|\!\right|\!\right|_{\textrm{op}}}{ \sum_{i \in \mathcal{I}^{\textrm{in}}} \langle a_i, x_\star \rangle ^2 - \sum_{i \in \mathcal{I}^{\textrm{out}}} \langle a_i, x_\star \rangle ^2} \mathop{\textrm{dist}}(\widehat{d}, \{\pm d_\star\}). \end{equation} (B.3) If |$\delta \le 1$|⁠, then all minimizers of |$G(\cdot )$| belong to the set |$[1 \pm \delta ] \left \|{x_\star }\right \|_2^2$|⁠. Temporarily assuming the conclusions of the lemma, let us show that the random quantities in the bound (B.3) are small with high probability. We apply the matrix concentration inequality (B.1) to see that for a numerical constant |$C < \infty $| and all |$t \in [0, 1 - \frac{n}{m}]$|⁠, we have \begin{equation*} \frac{1}{m} \sum_{i \in \mathcal{I}^{\textrm{in}}} \langle a_i, v \rangle ^2 \ge \left[(1 - p_{\textrm{fail}}) - C \sigma^2 \sqrt{\frac{n}{m} + t} \right] ~~ \mbox{and} ~~ \frac{1}{m} \sum_{i \in \mathcal{I}^{\textrm{out}}} \langle a_i, v \rangle ^2 \le \left[p_{\textrm{fail}} + C \sigma^2 \sqrt{\frac{n}{m} + t} \right] \end{equation*} with probability at least |$1 - \textrm{e}^{-mt}$| for all vectors |$v \in{\mathbb{S}}^{n - 1}$|⁠, and |$\frac{1}{m} |\!|\!|{A^TA}|\!|\!|_{\textrm{op}} \le \sigma ^2\big(1 + C \sqrt{\frac{n}{m} + t}\big)$| with the same probability. That is, for |$t \in \big[0, 1 - \frac{n}{m}\big]$| with probability at least |$1 - 2 \textrm{e}^{-mt}$| we have that |$\delta $| in expression (B.3) satisfies \begin{equation*} \delta \le \frac{6 \sigma^2}{ 1 - 2 p_{\textrm{fail}} - C \sigma^2 \sqrt{\frac{n}{m} + t}} \mathop{\textrm{dist}}(\widehat{d}, \{\pm d_\star\}). \end{equation*} If we assume that |$\frac{n}{m} \le c (1 - 2p_{\textrm{fail}})^2 / \sigma ^4$| and replace |$t$| with |$c (1 - 2p_{\textrm{fail}})^2 / \sigma ^4$| for small enough constant |$c$|⁠, we find that \begin{equation*} \delta \le \frac{C \sigma^2}{1 - 2p_{\textrm{fail}}} \mathop{\textrm{dist}}(\widehat{d}, \{\pm d_\star\}) \end{equation*} with probability at least |$1 - 2 \textrm{e}^{-c m(1 - 2p_{\textrm{fail}})^2 / \sigma ^4}$|⁠, where |$C$| is a numerical constant. This is our desired result. Proof. We define a few pieces of notation for shorthand. Let \begin{equation*} \widehat{\sigma}^2 := \frac{1}{m} \left|\!\left|\!\left|{A^T A} \right|\!\right|\!\right|_{\textrm{op}} ~~ \mbox{and} ~~ \widehat{\sigma}_{\mathcal{I}^{\textrm{out}}}^2 := \frac{1}{m} \sum_{i \in \mathcal{I}^{\textrm{in}}} \langle a_i, x_\star \rangle ^2 - \frac{1}{m} \sum_{i \in \mathcal{I}^{\textrm{out}}} \langle a_i, x_\star \rangle ^2, \end{equation*} and define the functions |$\widehat{g}(\delta ) = G((1 + \delta ) \left \|{x_\star }\right \|_2^2)$|⁠, equivalently \begin{equation*} \widehat{g}(\delta) := \frac{1}{m} \sum_{i = 1}^m |b_i - \left\|{x_\star}\right\|_2^2 \langle a_i, \widehat{d}\rangle (1 + \delta)|, \end{equation*} and a slightly more accurate counterpart \begin{equation*} g(\delta) := \frac{1}{m} \!\sum_{i=1}^m \!\left|b_i - (1+\delta) \langle a_i, x_\star \rangle ^2\right|= \frac{1}{m} \!\sum_{i\not\in \mathcal{I}^{\textrm{in}}} \!\left| \langle a_i, x_\star \rangle ^2 - (1+\delta) \langle a_i, x_\star \rangle ^2\right| + \frac{1}{m} \sum_{i\in \mathcal{I}^{\textrm{out}}} \left|b_i - (1+\delta) \langle a_i, x_\star \rangle ^2 \right|. \end{equation*} Note that if |$\delta $| minimizes |$\widehat{g}(\delta )$|⁠, then |$(1 + \delta ) \left \|{x_\star }\right \|_2^2$| minimizes |$G(r)$|⁠. By inspection we find that the subgradients of |$g$| with respect to |$\delta $| are \begin{equation*} \partial_\delta g(\delta) = \frac{1}{m} \sum_{i \in \mathcal{I}^{\textrm{in}}} \mathop{\textrm{sgn}}(\delta) \langle a_i, x_\star \rangle ^2 - \frac{1}{m} \sum_{i \in \mathcal{I}^{\textrm{out}}} \mathop{\textrm{sgn}}((1 + \delta) \langle a_i, x_\star \rangle ^2 - b_i) \langle a_i, x_\star \rangle ^2, \end{equation*} where |$\mathop{\textrm{sgn}}(t) = 1$| if |$t> 0$|⁠, |$\setminus 1$| if |$t < 0$|⁠, and |$\mathop{\textrm{sgn}}(0) = [\setminus 1, 1]$|⁠. Evidently, for |$\delta> 0$| we have |$g^{\prime}(\delta ) \ge \widehat{\sigma }_{\mathcal{I}^{\textrm{out}}}^2$| and |$g^{\prime}(-\delta ) \le -\widehat{\sigma }_{\mathcal{I}^{\textrm{out}}}^2$|⁠, so that \begin{equation} g(\delta) \ge \widehat{\sigma}_{\mathcal{I}^{\textrm{out}}}^2 |\delta| + g(0). \end{equation} (B.4) Now, we consider the gaps between |$\widehat{g}$| and |$g$|⁠; for |$\delta \in [\setminus 1, 1]$|⁠, we have the gap \begin{align*} |g(\delta) - \widehat{g}(\delta)| & \le \frac{(1 + \delta) \left\|{x_\star}\right\|_2^2}{m} \sum_{i = 1}^m |\langle a_i, d_\star\rangle^2 - \langle a_i, \widehat{d}\rangle^2| \\ & \le \frac{(1 + \delta) \left\|{x_\star}\right\|_2^2}{m} \left|\!\left|\!\left|{A^TA} \right|\!\right|\!\right|_{\textrm{op}} \|{d_\star - \widehat{d}}\|_2 \|{d_\star + \widehat{d}}\|_2 \le 4 \left\|{x_\star}\right\|_2^2 \widehat{\sigma}^2 \mathop{\textrm{dist}}(d_\star, \{\pm \widehat{d}\}), \end{align*} where we have used the triangle inequality and Cauchy–Schwarz. Thus, we obtain \begin{equation*} \widehat{g}(\delta) - \widehat{g}(0) \ge g(\delta) - g(0) + \widehat{g}(\delta) - g(\delta) + g(0) - \widehat{g}(0) \ge \widehat{\sigma}_{\mathcal{I}^{\textrm{out}}}^2 |\delta| - 6 \left\|{x_\star}\right\|_2^2 \widehat{\sigma}^2 \mathop{\textrm{dist}}(\widehat{d}, \{\pm d_\star\}), \end{equation*} where we have applied inequality (B.4). Rearranging, we have that if |$\widehat{g}(\delta ) \le \widehat{g}(0)$| we must have \begin{equation*} |\delta| \le \frac{6 \left\|{x_\star}\right\|_2^2 \widehat{\sigma}^2}{ \widehat{\sigma}_{\mathcal{I}^{\textrm{out}}}^2} \mathop{\textrm{dist}}(\widehat{d}, \{\pm d_\star\}). \end{equation*} By convexity, any minimizer of |$\widehat{g}$| must thus lie in the above region, which gives the result when we recall that minimizers |$\delta $| of |$\widehat{g}$| are equivalent to minimizers |$(1 + \delta ) \left \|{x_\star }\right \|_2^2$| of |$G$|⁠. B.3 Proof of Proposition 6 We introduce a bit of notation before giving the proof proper. Recall that |$\mathcal{I}^{\textrm{out}} \subset [m]$| denotes the outliers, or failed measurements, and |$\mathcal{I}^{\textrm{in}} = [m] \setminus \mathcal{I}^{\textrm{out}}$| the set of |$i$| such that |$b_i = \langle a_i, x_\star \rangle ^2$| (the inliers). Recalling the selected set of indices |$\mathcal{I}_{\textrm{sel}}$|⁠, we define the shorthand \begin{equation*} \mathcal{I}^{\textrm{in}}_{\textrm{sel}} := \mathcal{I}^{\textrm{in}} \cap \mathcal{I}_{\textrm{sel}} ~~ \mbox{and} ~~ \mathcal{I}^{\textrm{out}}_{\textrm{sel}} := \mathcal{I}^{\textrm{out}} \cap \mathcal{I}_{\textrm{sel}} \end{equation*} for the chosen inliers and outliers. We decompose the matrix |$X^{\textrm{init}}$| into four matrices, each of which we control to guarantee that |$X^{\textrm{init}} \approx I_n - c d_\star{d_\star }^T$| for some constant |$c$|⁠, thus guaranteeing |$\widehat{d} \approx d_\star $|⁠. Let |$P = d_\star{d_\star }^T$| and |$P_\perp = I_n - d_\star{d_\star }^T$| be the projection operator onto the span of |$d_\star $| and its orthogonal complement. Then we may decompose the matrix |$X^{\textrm{init}}$| into the four parts \begin{align} X^{\textrm{init}} = \underbrace{\frac{1}{m} \sum_{i=1}^m P a_i a_i^T P 1\left\{i \in \mathcal{I}^{\textrm{in}}_{\textrm{sel}}\right\}}_{:= Z_0} &+ \underbrace{\frac{1}{m} \sum_{i=1}^m \left[P a_i a_i^T P_\perp + P_\perp a_i a_i^T P\right] 1\left\{i \in \mathcal{I}^{\textrm{in}}_{\textrm{sel}}\right\}}_{:= {Z_1}}\nonumber\\ \qquad ~ &+ \underbrace{\frac{1}{m} \sum_{i=1}^n P_\perp a_i a_i^T P_\perp 1\left\{i \in \mathcal{I}^{\textrm{in}}_{\textrm{sel}}\right\}}_{:= Z_2} + \underbrace{\frac{1}{m} \sum_{i=1}^m a_i a_i^T 1\left\{i \in \mathcal{I}^{\textrm{out}}_{\textrm{sel}}\right\}}_{:= Z_3}. \end{align} (B.5) Let us briefly motivate this decomposition. We expect that |$Z_0$| should be small because we choose indices |$\mathcal{I}_{\textrm{sel}}$| by taking the smallest values of |$b_i$|⁠, which should be least correlated with |$d_\star $| (recall the |$P = d_\star{d_\star }^T$|⁠). We expect |$Z_1$| to be small because of the independence of the vectors |$P a_i$| and |$P_\perp a_i$| for Gaussian measurement vectors, and |$Z_3$| to be small because |$\mathcal{I}^{\textrm{out}}_{\textrm{sel}}$| should be not too large. This leaves |$Z_2$|⁠, which (by Gaussianity) we expect to be some multiple of |$I_n - d_\star{d_\star }^T$|⁠, which will then allow us to apply eigenvector perturbation guarantees using the eigengap of the matrix |$X^{\textrm{init}}$|⁠. The rotational invariance of the Gaussian means that it is no loss of generality to assume that |$d_\star = e_1$|⁠, the first standard basis vector, so for the remainder of the argument we assume this without further comment. This means that we may decompose |$a_i$| as |$a_i = [a_{i,1} ~ a_{i, 2} ~ \cdots ~ a_{i, n}]^T = [a_{i,1} ~ a_{i, \setminus 1}^T]^T$|⁠, which we will do without further comment for the remainder of the proof. We now present four lemmas, each controlling one of the terms |$Z_l$| in the expansion (B.5). We defer proofs of each of the lemmas to the end of this argument. We begin by considering the |$Z_0$| term, which (because |$P$| is rank 1) satisfies \begin{equation*} Z_0 = \underbrace{\bigg(\frac{1}{m} \sum_{i = 1}^m \langle a_i, d_\star \rangle ^2 1\left\{i \in \mathcal{I}^{\textrm{in}}_{\textrm{sel}}\right\}\bigg)}_{:= z_0} d_\star{d_\star}^T. \end{equation*} Recalling the definition (17) of the constants |$\delta _q$| and |$w_q$| in the statement of the proposition, we have the following lemma. Lemma B.3 Define the random quantities |$z_0 = \frac{1}{m} \sum _{i=1}^m \langle a_i, d_\star \rangle ^2 1\left \{i\in \mathcal{I}^{\textrm{in}}_{\textrm{sel}}\right \}$| and |$z_2 = \frac{1}{m} |\mathcal{I}^{\textrm{in}}_{\textrm{sel}}|$|⁠. Then for |$t \in [0, 1]$|⁠, \begin{equation*} {\mathbb{P}}\left(z_2 \ge z_0 + (1 - 2p_{\textrm{fail}})(1 - t)\delta_q \right) \ge 1 - \exp\left(-\frac{1}{10} (1 - 2p_{\textrm{fail}}) m\right) - 2 \exp\left(-\frac{m}{4}\right) - \exp\left(-\frac{m t^2 \delta_q^2}{4 w_q^2}\right). \end{equation*} See Appendix B.3.1 for a proof of the lemma. We thus see that it is likely that |$z_0$| is substantially smaller than the rough fraction of inlying indices selected. We now argue that |$Z_1$| is likely to be small because it is the sum of products of independent vectors. Lemma B.4 For |$t \ge 0$| we have \begin{equation*} {\mathbb{P}}\left(\left|\!\left|\!\left|{Z_1} \right|\!\right|\!\right|_{\textrm{op}} \geq 2 \sqrt{\frac{n}{m}} + t \right) \leq \exp \left(-\frac{mt^2}{8}\right) + \exp\left(-\frac{m}{2}\right). \end{equation*} See Appendix B.3.2 for a proof. We can also show that |$Z_2$| is well behaved in the sense that it is approximately a scaled multiple of |$(I - d_\star{d_\star }^T)$|⁠. Lemma B.5 Let |$z_2$| be the random quantity |$z_2 := \frac{1}{m}|\mathcal{I}^{\textrm{in}}_{\textrm{sel}}|$|⁠. There exists a numerical constant |$C$| such that for |$t \in [0, 1]$| we have \begin{equation*} {\mathbb{P}} \left(\left|\!\left|\!\left|{Z_2 - z_2 (I_n - d_\star{d_\star}^T)} \right|\!\right|\!\right|_{\textrm{op}} \geq C\left(\sqrt{\frac{n}{m} + t}\right) \right) \le \exp(-mt). \end{equation*} See Appendix B.3.3 for a proof of the lemma. Finally, we control the size of the error matrix |$Z_3$| in the expansion (B.5), which corresponds to the contribution of the outlying measurements |$a_i$| that are included in the initialization matrix |$X^{\textrm{init}}$|⁠. We provide two slightly different guarantees, depending on the model (strength) of adversarial noise |$\xi $| assumed. Lemma B.6 Define the random quantity |$z_3 := \frac{1}{m}|\mathcal{I}^{\textrm{out}} \cap \mathcal{I}_{\textrm{sel}}| \le p_{\textrm{fail}}$|⁠. There is a numerical constant |$C$| such that the following hold: Under the independent noise model M1, for all |$t\in [0, 1]$|⁠, \begin{equation*} {\mathbb{P}}\left(\left|\!\left|\!\left|{Z_3 - z_3 I_n} \right|\!\right|\!\right|_{\textrm{op}} \geq C\sqrt{\frac{n}{m} + t}\right) \leq \exp(-mt). \end{equation*} Under the adversarial noise model M2, for all |$t\in [0, 1]$|⁠, \begin{equation*} {\mathbb{P}}\left(\left|\!\left|\!\left|{Z_{3}} \right|\!\right|\!\right|_{\textrm{op}} \geq p_{\textrm{fail}} + C \sqrt{\frac{n}{m} + t}\right) \leq \exp(-mt). \end{equation*} See Appendix B.3.4 for a proof. With these four lemmas in hand, we can prove the result of the proposition by applying the eigenvector perturbation Lemma A.1. The expansion (B.5), coupled with the four lemmas defining the constants |$z_0 = \frac{1}{m} \sum _{i \in \mathcal{I}^{\textrm{in}}_{\textrm{sel}}} \langle a_i, d_\star \rangle ^2$|⁠, |$z_2 = \frac{1}{m} |\mathcal{I}^{\textrm{in}}_{\textrm{sel}}|$| and |$z_3 = \frac{1}{m} |\mathcal{I}^{\textrm{out}}_{\textrm{sel}}|$|⁠, guarantees that \begin{align*} X^{\textrm{init}} = Z_0 + Z_1 + Z_2 + Z_3 &= z_0 d_\star{d_\star}^T + z_2 (I_n - d_\star{d_\star}^T) + z_3 I_n + \varDelta \\ &= (z_2 + z_3) I_n - (z_2 - z_0) d_\star{d_\star}^T + \varDelta, \end{align*} where the perturbation |$\varDelta \in{\mathbb{R}}^{n \times n}$| satisfies \begin{equation*} \left|\!\left|\!\left|{\varDelta} \right|\!\right|\!\right|_{\textrm{op}} \le \left|\!\left|\!\left|{Z_1} \right|\!\right|\!\right|_{\textrm{op}} + \left|\!\left|\!\left|{Z_2 - z_2(I_n - d_\star{d_\star}^T)} \right|\!\right|\!\right|_{\textrm{op}} + \begin{cases} \left|\!\left|\!\left|{Z_3 - z_3 I_n} \right|\!\right|\!\right|_{\textrm{op}} & \textrm{under}\ \textrm{model}\ \textrm{M1} \\ \left|\!\left|\!\left|{Z_3} \right|\!\right|\!\right|_{\textrm{op}} & \textrm{under}\ \textrm{model}\ \textrm{M2}. \end{cases} \end{equation*} On the event that |$z_2> z_0$|⁠, the minimal eigenvector of |$(z_2 + z_3) I_n - (z_2 - z_0) d_\star{d_\star }^T$| is |$d_\star $| with eigengap |$z_2 - z_0$|⁠. Applying Lemma A.1 gives that |$\widehat{d} = \mathop{{\textrm{argmin}}}_{d \in{\mathbb{S}}^{n-1}} d^T X^{\textrm{init}} d$| satisfies \begin{equation} 2^{-\frac{1}{2}} \mathop{\textrm{dist}}(\widehat{d}, \{\pm d_\star\}) \le \sqrt{1 - \langle \widehat{d}, d_\star \rangle^2} \le \frac{\left|\!\left|\!\left|{\Delta} \right|\!\right|\!\right|_{\textrm{op}}}{ \left[{z_2 - z_0 - \left|\!\left|\!\left|{\Delta} \right|\!\right|\!\right|_{\textrm{op}}}\right]_+}. \end{equation} (B.6) Applying Lemmas B.4 and B.5, we have for some numerical constant |$C < \infty $| that \begin{equation*} \left|\!\left|\!\left|{Z_1} \right|\!\right|\!\right|_{\textrm{op}} + \left|\!\left|\!\left|{Z_2 - z_2 (I_n - d_\star{d_\star}^T)} \right|\!\right|\!\right|_{\textrm{op}} \le C \sqrt{\frac{n}{m} + t} ~~ \mbox{with probability} \ge 1 - \textrm{e}^{-mt} - \textrm{e}^{-m/2} \end{equation*} for any |$t \ge 0$|⁠. We now consider the two noise models in turn. Under Model M1, Lemma B.6 then implies that |$\left |\!\left |\!\left |{\varDelta } \right |\!\right |\!\right |_{\textrm{op}} \le C \sqrt{\frac{n}{m} + t}$| with probability at least |$1 - \textrm{e}^{-mt} - \textrm{e}^{-m/2}$|⁠. Recalling Lemma B.3, we have for the constants |$w_q^2$| and |$\delta _q> 0$| defined in the lemma that |$z_2 \ge z_0 + \frac{1 - 2 p_{\textrm{fail}}}{2} \delta _q$| with probability at least |$1 - \exp (-c (1 - 2 p_{\textrm{fail}}) m) - \exp (-c m \delta _q^2 / w_q^2)$|⁠, where |$c> 0$| is a numerical constant. That is, the perturbatoin inequality (B.6) implies that under Model M1, we have with probability at least |$1 - \exp (-m t) - \exp (-c(1 - 2 p_{\textrm{fail}}) m) - \exp (-c m \delta _q^2 / w_q^2)$| that \begin{equation*} 2^{-\frac{1}{2}} \mathop{\textrm{dist}}(\widehat{d}, \{\pm d_\star\}) \le \frac{C \sqrt{n/m + t}}{ \left[{(1 - 2 p_{\textrm{fail}}) \delta_q - C \sqrt{n/m + t}}\right]_+}, \end{equation*} where |$0 < c, C < \infty $| are numerical constants. Under Model M2, we can bound |$\left |\!\left |\!\left |{Z_3} \right |\!\right |\!\right |_{\textrm{op}}$| by |$p_{\textrm{fail}} + C \sqrt{n/m + t}$| with probability |$1 - \textrm{e}^{-mt} - \textrm{e}^{-m/2}$| (recall Lemma B.6), so that with the same probability as above, we have \begin{equation*} 2^{-\frac{1}{2}} \mathop{\textrm{dist}}(\widehat{d}, \{\pm d_\star\}) \le \frac{p_{\textrm{fail}} + C \sqrt{n/m + t}}{ \left[{(1 - 2p_{\textrm{fail}}) \delta_q - C \sqrt{n/m + t} - p_{\textrm{fail}}}\right]_+}\ \textrm{under}\ \textrm{Model}\ \textrm{M2}. \end{equation*} This is the proposition. B.3.1 Proof of Lemma B.3 As noted earlier, it is no loss of generality to assume that |$d_\star = e_1$|⁠, the first standard basis vector, so that using our definitions of |$z_0 = \frac{1}{m} \sum _{i \in \mathcal{I}^{\textrm{in}}_{\textrm{sel}}} \langle a_i, d_\star \rangle ^2$| and |$z_2 = \frac{1}{m} |\mathcal{I}^{\textrm{in}}_{\textrm{sel}}|$|⁠, we have \begin{equation} z_2 - z_0 = \frac{1}{m} \sum_{i=1}^m (1 - a_{i, 1}^2) 1\left\{i\in \mathcal{I}^{\textrm{in}}_{\textrm{sel}}\right\}. \end{equation} (B.7) Given that we choose in indices |$\mathcal{I}_{\textrm{sel}}$| by a median, it is helpful to have the following median concentration result. Lemma B.7 Let |$\{W_i\}_{i=1}^m \stackrel{\textrm{iid}}{\sim } {\mathsf{N}}(0, 1)$|⁠. Fix |$p \in (0, 1)$| and choose |$w_q \ge 0$| so that |$q := {\mathbb{P}}(W^2 \le w_q^2) = 2(1 - \varPhi (w_q))> p$|⁠. Then \begin{equation*} {\mathbb{P}}\left({\mathsf{quant}}_p \left(\{W_i^2\}_{i=1}^m\right) \geq w_q^2\right) \leq \exp\left(-\frac{m (q-p)^2}{2(q-p)/3 + 2q(1-q)}\right). \end{equation*} Proof. Note that |${\mathsf{quant}}_p (\{W_i^2\}_{i=1}^m) \geq w_q^2$| if and only if |$\frac{1}{m} \sum _{i=1}^m 1\left \{W_i^2 \leq w_q^2\right \} \leq p$|⁠. Using that |${\textrm{Var}}\bigg(1\left \{W_i^2 \le w_q^2\right \}\bigg) = q(1 - q)$|⁠, Bernstein’s inequality applied to the i.i.d. sum |$\sum _{i = 1}^m (1\{W_i^2 \le w_q^2\} - q)$| gives the result. We now control the median of the perturbed vector |$b \in{\mathbb{R}}^m$|⁠. Since we have |$|\mathcal{I}^{\textrm{out}}| \leq p_{\textrm{fail}} m$|⁠, we have deterministic result \begin{equation*} {\mathsf{med}}\left(\{b_i\}_{i=1}^m\right) \leq{\mathsf{quant}}_{\frac{1}{2(1-p_{\textrm{fail}})}} \left(\{ \langle a_i, x_\star \rangle ^2\}_{i\in \mathcal{I}^{\textrm{in}}}\right), \end{equation*} so by upper bounding the right-hand quantity we can upper bound |${\mathsf{med}}(\{b_i\})$|⁠, which in turn allows us to control |$\mathcal{I}_{\textrm{sel}}$|⁠. By the definition of |$w_q$| and |$q_{\textrm{fail}}$| in Equation (17), which satisfies |$\delta = q_{\textrm{fail}} - \frac{1}{2(1 - p_{\textrm{fail}})} = \frac{1 - 2 p_{\textrm{fail}}}{4 (1 - p_{\textrm{fail}})}$|⁠, Lemma B.7 with |$q = q_{\textrm{fail}}$| and the fact that |$|\mathcal{I}^{\textrm{in}}| = (1 - p_{\textrm{fail}}) m$| then implies \begin{align} {\mathbb{P}}\left({\mathsf{med}}(\{b_i\}_{i=1}^m) \ge w_q^2 \left\|{x_\star}\right\|_2^2 \right) \le{\mathbb{P}}\left({\mathsf{quant}}_{(2(1 - p_{\textrm{fail}}))^{-1}}(\{ \langle a_i, x_\star \rangle ^2\}_{ i \in \mathcal{I}^{\textrm{in}}}) \ge w_q^2 \left\|{x_\star}\right\|_2^2\right) \nonumber \\ & \qquad ~ \le \exp\left(-\frac{m(1 - p_{\textrm{fail}})\delta^2}{2\delta/3 + 2\delta(1-\delta)}\right) = \exp\left(-\frac{3(1 - 2p_{\textrm{fail}}) m}{4 (2 + 6(1 - \delta))}\right) \le \exp\left(-\frac{3}{32}(1-2p_{\textrm{fail}})m\right). \end{align} (B.8) We now consider the indices |$i$| that are inliers for which |$ \langle a_i, d_\star \rangle ^2$| is small; again letting |$w_q \ge 0$| be defined as in the quantile (17), we define \begin{equation*} \mathcal{I}^{\textrm{in}}_q := \left\{i \in \mathcal{I}^{\textrm{in}} \mid \langle a_i, d_\star \rangle ^2 \le w_q^2\right\} = \left\{i \in \mathcal{I}^{\textrm{in}} \mid \langle a_i, x_\star \rangle ^2 \le w_q^2 \left\|{x_\star}\right\|_2^2 \right\}. \end{equation*} Lemma B.8 Let the set of inliers |$\mathcal{I}^{\textrm{in}}_q$| be defined as above, and let |$\delta _q = 1 - {\mathbb{E}}[W^2 \mid W^2 \le w_q^2]$| for |$W \sim{\mathsf{N}}(0, 1)$|⁠. Then for all |$t \ge 0$| we have \begin{equation*} {\mathbb{P}}\left(\frac{1}{|\mathcal{I}^{\textrm{in}}_q|} \sum_{i \in \mathcal{I}^{\textrm{in}}_q} (1 - a_{i,1}^2) \le \delta_q - t \mid \mathcal{I}^{\textrm{in}}_q \right) \le \exp\left(-\frac{2 |\mathcal{I}^{\textrm{in}}_q| t^2}{ w_q^2}\right). \end{equation*} We defer the proof of Lemma B.8, continuing on to give the proof of Lemma B.3. We now integrate out the conditioning in Lemma B.8. Recalling the definition (17) of |$w_q$| in terms of |$p_{\textrm{fail}}$|⁠, we have that |$q_{\textrm{fail}} = \frac{3 - 2 p_{\textrm{fail}}}{4(1 - p_{\textrm{fail}})}> \frac{1}{2}$| for |$p_{\textrm{fail}} \in \left [{0},{1/2}\right )$|⁠. By Hoeffding’s inequality we have |${\mathbb{P}}(|\mathcal{I}^{\textrm{in}}_q| \le m/8) \le \textrm{e}^{-m/4}$| because |$|\mathcal{I}^{\textrm{in}}| \ge (1 - p_{\textrm{fail}}) m \ge m/2$|⁠, whence we obtain \begin{align*} {\mathbb{P}}\left(\frac{1}{|\mathcal{I}^{\textrm{in}}_q|} \sum_{i \in \mathcal{I}^{\textrm{in}}_q} (1 - a_{i,1}^2) \le \delta_q - t \right) & \le \exp\left(-\frac{m t^2}{4 w_q^2}\right) {\mathbb{P}}(|\mathcal{I}^{\textrm{in}}_q| \ge m/8) + {\mathbb{P}}(|\mathcal{I}^{\textrm{in}}_q| \le m/8) \\ & \le \exp\left(-\frac{m t^2}{4 w_q^2}\right) + \exp\left(-\frac{m}{4}\right). \end{align*} Using the notation |$\delta _q$| in Lemma B.8, for |$t \in [0, 1]$| we define the event \begin{equation*} \mathcal{E} := \left\{{\mathsf{med}}(\{b_i\}_{i = 1}^m) \le w_q^2 \left\|{x_\star}\right\|_2^2, |\mathcal{I}^{\textrm{in}}_q| \ge \frac{m}{8}, \frac{1}{|\mathcal{I}^{\textrm{in}}_q|}\sum_{i \in \mathcal{I}^{\textrm{in}}_q} (1 - a_{i,1}^2) \ge (1 - t) \delta_q \right\}. \end{equation*} We immediately find that \begin{equation} {\mathbb{P}}(\mathcal{E}) \ge 1 - \exp\left(-\frac{1}{10} (1 - 2 p_{\textrm{fail}}) m\right) - 2 \exp\left(-\frac{m}{4}\right) - \exp\left(-\frac{m t^2 \delta_q^2}{4 w_q^2}\right) \end{equation} (B.9) by the preceding display and inequality (B.8). Recalling the set |$\mathcal{I}^{\textrm{in}}_q = \{i\in \mathcal{I}^{\textrm{in}}: \langle a_i, d_\star \rangle ^2 \leq w_q^2\}$|⁠, we have on the event |$\mathcal{E}$| that the selected inliers satisfy |$\mathcal{I}^{\textrm{in}}_{\textrm{sel}} \subset \mathcal{I}^{\textrm{in}}_q$| (because |${\mathsf{med}}(b) \le w_q^2 \left \|{x_\star }\right \|_2^2$|⁠). Because of our selection mechanism with |$\mathcal{I}^{\textrm{in}}_{\textrm{sel}}$| as the smallest |$b_i$| in the sample, we have that \begin{equation*} \frac{1}{|\mathcal{I}^{\textrm{in}}_{\textrm{sel}}|} \sum_{i \in \mathcal{I}^{\textrm{in}}_{\textrm{sel}}} (1 - a_{i,1}^2) \ge \frac{1}{|\mathcal{I}^{\textrm{in}}_q|} \sum_{i \in \mathcal{I}^{\textrm{in}}_q|} (1 - a_{i, 1}^2). \end{equation*} Moreover, on the event |$\mathcal{E}$| the rightmost sum is positive, and using that |$|\mathcal{I}^{\textrm{in}}_{\textrm{sel}}| \geq |\mathcal{I}^{\textrm{in}}| + |\mathcal{I}_{\textrm{sel}}| - m \geq \big(\frac{1}{2} - p_{\textrm{fail}}\big) m$|⁠, we obtain that on |$\mathcal{E}$| we have \begin{equation*} \frac{1}{m} \sum_{i \in \mathcal{I}^{\textrm{in}}_{\textrm{sel}}} (1- a_{i, 1}^2) \ge \frac{1- 2p_{\textrm{fail}}}{2 |\mathcal{I}^{\textrm{in}}_{\textrm{sel}}|} \sum_{i \in \mathcal{I}^{\textrm{in}}_{\textrm{sel}}} (1 - a_{i, 1}^2) \ge \frac{1- 2p_{\textrm{fail}}}{2 |\mathcal{I}^{\textrm{in}}_q|} \sum_{i \in \mathcal{I}^{\textrm{in}}_q} (1-a_{i, 1}^2) \ge \frac{1 - 2p_{\textrm{fail}}}{2} \delta_q. \end{equation*} Recalling expression (B.7) thus gives Lemma B.3.Proof of Lemma B.8 Fix any set of indices |$\mathcal{I}_0 \subset [m]$|⁠, and note that by Hoeffding’s inequality for bounded random variables we have \begin{equation*} {\mathbb{P}}\left(\frac{1}{|\mathcal{I}_0|} \sum_{i \in \mathcal{I}_0} a_{i,1}^2 - {\mathbb{E}}[a_{i,1}^2 \mid a_{i,1}^2 \le w_q^2] \ge t \mid a_{i,1}^2 \le w_q^2 ~ \mbox{for} i \in \mathcal{I}_0\right) \le \exp\left(-\frac{2 |\mathcal{I}_0| t^2}{w_q^2}\right) \end{equation*} for |$t \ge 0$|⁠. Recalling the definition |$\delta _q := 1 - {\mathbb{E}}[W^2 \mid W^2 \le w_q^2]$| for |$W \sim{\mathsf{N}}(0, 1)$|⁠, we thus find that for any index set |$\mathcal{I}_0 \subset [m]$|⁠, we have \begin{align*} {\mathbb{P}}\left(\frac{1}{|\mathcal{I}^{\textrm{in}}_q|} \sum_{i \in \mathcal{I}^{\textrm{in}}_q} (1 - a_{i,1}^2) \le \delta_q - t \mid \mathcal{I}^{\textrm{in}}_q = \mathcal{I}_0\right) \\ & = {\mathbb{P}}\left(\frac{1}{|\mathcal{I}_0|} \sum_{i \in \mathcal{I}_0} (1 - a_{i,1}^2) \le \delta_q - t \mid a_{i,1}^2 \le w_q^2 ~ \mbox{for} i \in \mathcal{I}_0\right) \le \exp\left(-\frac{2|\mathcal{I}_0| t^2}{w_q^2}\right). \end{align*} Noticing that the random vectors |$\{a_i\}_{i \in \mathcal{I}_0}$| are independent of |$\{a_i\}_{i \not \in \mathcal{I}_0}$|⁠, we have for any measurable set |$\mathcal{C} \subset{\mathbb{R}}$| that \begin{equation*} {\mathbb{P}}\left(\sum_{i \in \mathcal{I}^{\textrm{in}}_q} (1 - a_{i,1}^2) \in \mathcal{C} \mid \mathcal{I}^{\textrm{in}}_q = \mathcal{I}_0\right) = {\mathbb{P}}\left(\sum_{i \in \mathcal{I}_0} (1 - a_{i,1}^2) \in \mathcal{C} \mid a_{i,1}^2 \le w_q^2 ~ \mbox{for} i \in \mathcal{I}_0\right). \end{equation*} Combining the preceding two displays yields Lemma B.8. B.3.2 Proof of Lemma B.4 By our assumption (w.l.o.g.) that |$d_\star = e_1$|⁠, we have \begin{equation} \left|\!\left|\!\left|{Z_1} \right|\!\right|\!\right|_{\textrm{op}} = \left\|{\frac{1}{m} \sum_{i=1}^n a_{i, 1} a_{i, \setminus 1} 1\left\{i\in \mathcal{I}^{\textrm{in}}_{\textrm{sel}}\right\}}\right\|_2. \end{equation} (B.10) Letting |$\{a_i^{\prime}\}_{i=1}^m \stackrel{\textrm{iid}}{\sim } {\mathsf{N}}(0, I_n)$| be an independent collection of vectors, the collections |$\{a_i\}_{i \in \mathcal{I}^{\textrm{in}}}$| and |$\{\xi _i\}_{i\in \mathcal{I}^{\textrm{out}}}$| are independent, as are |$a_{i,1}$| and |$a_{i, \setminus 1}$| for each |$i$|⁠. Thus, because |$\mathcal{I}^{\textrm{in}}_{\textrm{sel}} \subset \mathcal{I}^{\textrm{in}}$|⁠, that for any measurable set |$\mathcal{C} \subset{\mathbb{R}}^{n-1}$| we have \begin{align} {\mathbb{P}} \left( \sum_{i \in \mathcal{I}^{\textrm{in}}_{\textrm{sel}}} a_{i, 1} a_{i, \setminus 1} \in \mathcal{C} \mid \{a_{i, 1}\}_{i\in \mathcal{I}^{\textrm{in}}}, \{\xi_i\}_{i\in \mathcal{I}^{\textrm{out}}}\right) = {\mathbb{P}} \left( \sum_{i \in \mathcal{I}^{\textrm{in}}_{\textrm{sel}}} a_{i,1} a_{i, \setminus 1}^{\prime} \in \mathcal{C} \mid \{a_{i, 1}\}_{i\in \mathcal{I}^{\textrm{in}}}, \mathcal{I}^{\textrm{in}}_{\textrm{sel}}\right). \end{align} (B.11) Now, we use the standard result [30] that if |$f$| is an |$L$|-Lipschitz function with respect to the |$\ell _2$|-norm, then for any standard Gaussian vector |$Z$| we have |${\mathbb{P}}(f(Z) - {\mathbb{E}}[f(Z)] \ge t) \le \exp \big(-\frac{t^2}{2 L^2}\big)$| for all |$t \ge 0$|⁠. Thus, defining the random Lipschitz constant |$\widehat{L}^2 = \frac{1}{m} \sum _{i \in \mathcal{I}^{\textrm{in}}} a_{i,1}^2$| we obtain \begin{equation*} {\mathbb{E}}\left[\bigg\|{\frac{1}{m} \sum_{i = 1}^m a_{i,1} a_{i, \setminus 1}^{\prime} 1\left\{i \in \mathcal{I}^{\textrm{in}}_{\textrm{sel}}\right\} }\bigg\|_2^2 \mid \mathcal{I}^{\textrm{in}}_{\textrm{sel}}, \{a_{i,1}\}_{i \in \mathcal{I}^{\textrm{in}}}\right] = \frac{1}{m^2} \sum_{i \in \mathcal{I}^{\textrm{in}}_{\textrm{sel}}} a_{i,1}^2 (n - 1) \le \frac{n-1}{m} \widehat{L}^2, \end{equation*} and thus for |$t \ge 0,$| \begin{equation*} {\mathbb{P}}\left(\bigg\|{\frac{1}{m} \sum_{i = 1}^m a_{i,1} a^{\prime}_{i,\setminus 1} 1\left\{i \in \mathcal{I}^{\textrm{in}}_{\textrm{sel}}\right\} }\bigg\|_2 \ge \widehat{L} \sqrt{\frac{n-1}{m}} + t \mid \{a_{i,1}\}_{i \in \mathcal{I}^{\textrm{in}}}, \mathcal{I}^{\textrm{in}}_{\textrm{sel}}\right) \le \exp\left(-\frac{m t^2}{2\widehat{L}^2}\right). \end{equation*} Moreover, |$\widehat{L}$| is |$\{a_{i,1}\}_{i \in \mathcal{I}^{\textrm{in}}}$|-measurable, and |${\mathbb{P}}(\widehat{L} \ge 2) \le \exp (-m / 2)$|⁠, again by the Lipschitz concentration of Gaussian random variables. We thus apply Equation (B.11) and obtain \begin{align*} {\mathbb{P}}\left( \bigg\|{\frac{1}{m} \sum_{i \in \mathcal{I}^{\textrm{in}}_{\textrm{sel}}} a_{i,1} a_{i, \setminus 1}}\bigg\|_2 \ge 2 \sqrt{\frac{n - 1}{m}} + t \right) \\ & \le{\mathbb{P}} \left( \bigg\|{\frac{1}{m} \sum_{i \in \mathcal{I}^{\textrm{in}}_{\textrm{sel}}} a_{i,1} a_{i, \setminus 1}}\bigg\|_2 \ge 2 \sqrt{\frac{n - 1}{m}} + t \mid \widehat{L} \le 2 \right) {\mathbb{P}}(\widehat{L} \le 2) + {\mathbb{P}}(\widehat{L} \ge 2) \\ & \le \exp\left(-\frac{mt^2}{8}\right) + \exp\left(-\frac{m}{2}\right). \end{align*} Recalling the equality (B.11) on |$\left |\!\left |\!\left |{Z_1} \right |\!\right |\!\right |_{\textrm{op}}$| shows that the previous display gives the lemma. B.3.3 Proof of Lemma B.5 Our first observation is simply the definition of the projection operator |$P_\perp $|⁠, which gives \begin{equation*} \left|\!\left|\!\left|{Z_{m, 2} - z_2 (I_n - d_\star{d_\star}^T)} \right|\!\right|\!\right|_{\textrm{op}} = \bigg|\!\bigg|\!\bigg|{\frac{1}{m}\sum_{i\in \mathcal{I}^{\textrm{in}}_{\textrm{sel}}} a_{i, \setminus 1} a_{i, \setminus 1}^T - z_2 I_{n - 1}}\bigg|\!\bigg|\!\bigg|_{\textrm{op}}. \end{equation*} As in the proof of Lemma B.4, we let |$\{a_i^{\prime}\}_{i=1}^m \stackrel{\textrm{iid}}{\sim } {\mathsf{N}}(0, I_n)$| be an independent copy of the |$a_i$|⁠. The independence of |$\{a_i\}_{i\in \mathcal{I}^{\textrm{in}}}$| and |$\{\xi _i\}_{i\in \mathcal{I}^{\textrm{out}}}$| imply that for any measurable set |$\mathcal{C}$| we have \begin{align*} {\mathbb{P}} \left(\sum_{i\in \mathcal{I}^{\textrm{in}}_{\textrm{sel}}} (a_{i, \setminus 1} a_{i, \setminus 1}^T - z_2 I_{n - 1}) \in \mathcal{C} \mid \{a_{i, 1}\}_{i\in \mathcal{I}^{\textrm{in}}}, \{\xi_i\}_{i\in \mathcal{I}^{\textrm{out}}}\right) &= {\mathbb{P}} \left(\sum_{i \in \mathcal{I}^{\textrm{in}}_{\textrm{sel}}} \left(a^{\prime}_{i, \setminus 1} {a^{\prime}_{i, \setminus 1}}^T - I_{n - 1} \right) \in \mathcal{C} \mid \mathcal{I}^{\textrm{in}}_{\textrm{sel}}\right). \end{align*} As |$\{a_{i, \setminus 1}^{\prime}\}_{i=1}^m$| are i.i.d. Gaussian vectors, Lemma 3.1 implies that for |$t\in [0, 1]$|⁠, \begin{equation*} {\mathbb{P}} \left(\left|\!\left|\!\left|{\frac{1}{m} \sum_{i=1}^m \left(a^{\prime}_{i, \setminus 1} {a^{\prime}_{i, \setminus 1}}^T - I_{n - 1} \right)1\left\{i \in \mathcal{I}^{\textrm{in}}_{\textrm{sel}}\right\} } \right|\!\right|\!\right|_{\textrm{op}} \geq C \sqrt{\frac{n}{m} + t} \mid \mathcal{I}^{\textrm{in}}_{\textrm{sel}}\right) \leq \exp(-mt) \end{equation*} for some constant |$C$|⁠. The claim of the lemma follows by integration of the above inequality. B.3.4 Proof of Lemma B.6 For notational convenience, let us denote the selected outlying indices by |$\mathcal{I}^{\textrm{out}}_{\textrm{sel}} = \mathcal{I}^{\textrm{out}}\cap \mathcal{I}_{\textrm{sel}}$|⁠. We begin our proof by considering Model M1, in which case the proof is essentially identical to Lemma B.5 (see Appendix B.3.3). Indeed, we have the identity \begin{equation*} \left|\!\left|\!\left|{Z_3 - z_3 I_n} \right|\!\right|\!\right|_{\textrm{op}} = \bigg|\!\bigg|\!\bigg|{\frac{1}{m}\sum_{i=1}^m \left(a_i a_i^T - I_n\right)1\left\{i\in \mathcal{I}^{\textrm{out}}_{\textrm{sel}}\right\} }\bigg|\!\bigg|\!\bigg|_{\textrm{op}}, \end{equation*} and so following the completely parallel route of introducing the independent sample |$\{a_i^{\prime}\}_{i = 1}^m \stackrel{\textrm{iid}}{\sim } {\mathsf{N}}(0, I_n)$|⁠, we have for any |$t \ge 0$| that \begin{align*} {\mathbb{P}}\left(\left|\!\left|\!\left|{Z_3 - z_3 I_n} \right|\!\right|\!\right|_{\textrm{op}} \ge t\right) & = {\mathbb{E}}\left[{\mathbb{P}}\left(\left|\!\left|\!\left|{Z_3 - z_3 I_n} \right|\!\right|\!\right|_{\textrm{op}} \ge t \mid \mathcal{I}^{\textrm{out}}_{\textrm{sel}}\right)\right] \\ & = {\mathbb{E}}\left[{\mathbb{P}}\left( \bigg|\!\bigg|\!\bigg|{\frac{1}{m} \sum_{i = 1}^m (a_i^{\prime} {a_i^{\prime}}^T - I_n) 1\left\{i \in \mathcal{I}^{\textrm{out}}_{\textrm{sel}}\right\}}\bigg|\!\bigg|\!\bigg|_{\textrm{op}} \ge t \mid \mathcal{I}^{\textrm{out}}_{\textrm{sel}} \right)\right]. \end{align*} Applying Lemma 3.1 to the vectors |$a_i^{\prime} 1\left \{i \in \mathcal{I}^{\textrm{out}}_{\textrm{sel}}\right \}$| gives the result. In the case in which we assume Model M2, we have the deterministic upper bound \begin{equation*} \left|\!\left|\!\left|{Z_3} \right|\!\right|\!\right|_{\textrm{op}} = \left|\!\left|\!\left|{\frac{1}{m} \sum_{i=1}^m a_i a_i^T 1\left\{i\in \mathcal{I}^{\textrm{out}}_{\textrm{sel}}\right\}} \right|\!\right|\!\right|_{\textrm{op}} \leq \left|\!\left|\!\left|{\frac{1}{m} \sum_{i=1}^m a_i a_i^T 1\left\{i\in \mathcal{I}^{\textrm{out}}\right\}} \right|\!\right|\!\right|_{\textrm{op}}. \end{equation*} As |$a_i 1\left \{i \in \mathcal{I}^{\textrm{out}}\right \}$| is an |$O(1)$|-sub-Gaussian vector, Lemma 3.1 implies \begin{equation*} {\mathbb{P}} \left(\left|\!\left|\!\left|{\frac{1}{m} \sum_{i=1}^m \left(a_{i} a_{i}^T - I_{n} \right)1\left\{i \in \mathcal{I}^{\textrm{out}}\right\} } \right|\!\right|\!\right|_{\textrm{op}} \geq C\sqrt{\frac{n}{m} + t} \mid \mathcal{I}^{\textrm{out}}_{\textrm{sel}} \right) \leq \exp(-mt). \end{equation*} Since |$|\mathcal{I}^{\textrm{out}}|/m = p_{\textrm{fail}}$|⁠, the desired claim follows via the triangle inequality. C. Technical results C.1 Proof of Lemma 3.2 Let |$p(z)$| denote the density of |$z$| and |$p_c(z^{\prime}) = p(z^{\prime}) / \int _{-c}^c p(z) \,\mathrm{d}z$| for |$z^{\prime} \in [-c, c]$|⁠. Let |$F_c(z) = \int _{-c}^{z} p_c(z^{\prime}) \,\mathrm{d}z^{\prime}$| be the CDF of |$p_c$|⁠, and define the mapping |$T : [-c, c] \to [-c, c]$| by \begin{equation*} T(u) = F_c^{-1}\left(\frac{u + c}{2c}\right) = z ~ \mbox{for the} z ~\mbox{s.t.}~ F_c(z) = \frac{u + c}{2c}. \end{equation*} Evidently, for |$Z \sim P_c$| we have |$T(Z) \sim{\mathsf{Uni}}[-c, c]$|⁠, and by the symmetry and monotonicity properties of |$p$| and |$p_c$| we have |$|T(z)| \ge |z|$| for |$z \in [-c, c]$|⁠. In particular, letting |$U \sim{\mathsf{Uni}}[-c ,c]$|⁠, we have \begin{equation*} {\textrm{Var}}(U) = {\mathbb{E}}[U^2] = {\mathbb{E}}[T(Z)^2 \mid |Z| \le c] \ge{\mathbb{E}}[Z^2 \mid |Z| \le c]. \end{equation*} Using that |${\textrm{Var}}(U) = \frac{1}{2c} \int _{-c}^c u^2 \,\mathrm{d}u = \frac{c^2}{3}$| gives the result. C.2 Proof of Lemma B.1 The proof is an essentially standard concentration and covering number argument, with a few minor wrinkles. First, we note that \begin{equation*} {\textrm{Var}}(| \langle u, a_i \rangle \langle v, a_i \rangle |) \le{\mathbb{E}}[| \langle u, a_i \rangle \langle v, a_i \rangle |^2] \stackrel{(i)}{\le} \sup_{u \in{\mathbb{S}}^{n-1}} {\mathbb{E}}[ \langle u, a_i \rangle ^4] \le 2 e \sigma^4, \end{equation*} where inequality |$(i)$| is Cauchy-Schwarz and the final inequality follows Lemma 3.1. Thus, the lower tail Bernstein’s inequality (Lemma A.2) applied to the positive random variables |$| \langle u, a_i \rangle \langle v, a_i \rangle | \ge 0$| with variance bounded by |$2 e \sigma ^4$| implies that \begin{equation} {\mathbb{P}}\left(\frac{1}{m} \sum_{i = 1}^m | \langle u, a_i \rangle \langle v, a_i \rangle | - \kappa_{\textrm{st}}(u, v) \le -t\right) \le \exp\left(-\frac{m t^2}{4e \sigma^4}\right). \end{equation} (C.1) Define |$Z_{u,v} := \frac{1}{m} \sum _{i = 1}^m | \langle u, a_i \rangle \langle v, a_i \rangle |$| for shorthand. Using that for vectors |$w, x \in{\mathbb{R}}^n$| we have |$\sum _{i = 1}^m | \langle a_i, w \rangle \langle a_i, x \rangle | \le \left \|{A w}\right \|_2 \left \|{A x}\right \|_2$| by Cauchy–Schwarz, we see that for any |$u, v, u^{\prime}, v^{\prime} \in{\mathbb{S}}^{n-1}$|⁠, \begin{align*} |Z_{u,v} - Z_{u^{\prime},v^{\prime}}| & \le \frac{1}{m} \sum_{i = 1}^m | \langle a_i, u \rangle \langle a_i, v \rangle - \langle a_i, u^{\prime} \rangle \langle a_i, v^{\prime} \rangle | \\ & \le \frac{1}{m} \sum_{i = 1}^m \left(| \langle a_i, u - u^{\prime} \rangle \langle a_i, v \rangle | + | \langle a_i, u \rangle \langle a_i, v - v^{\prime} \rangle |\right) \\ & \le \frac{1}{m} \left\|{A (u - u^{\prime})}\right\|_2 \left\|{A v}\right\|_2 + \frac{1}{m} \left\|{A (v - v^{\prime})}\right\|_2 \left\|{A u}\right\|_2 \le \frac{\left|\!\left|\!\left|{A} \right|\!\right|\!\right|_{\textrm{op}}^2}{m} \left(\left\|{u - u^{\prime}}\right\|_2 + \left\|{v - v^{\prime}}\right\|_2\right). \end{align*} Now, let |$\mathcal{N}$| be an |$\epsilon $|-cover of the sphere |${\mathbb{S}}^{n-1}$|⁠, which we use to control |$\inf _{u, v \in{\mathbb{S}}^{n-1}} \{Z_{u, v} - {\mathbb{E}}[Z_{u, v}]\}$|⁠. Using the previous display, we obtain by the definition of an |$\epsilon $|-cover argument that \begin{align*} {\mathbb{P}}\left( \inf_{u, v \in{\mathbb{S}}^{n-1}} \{Z_{u,v} - \kappa_{\textrm{st}}(u, v)\} \le -t \right) & \le{\mathbb{P}}\left( \min_{u, v \in \mathcal{N}} \{Z_{u, v} - \kappa_{\textrm{st}}(u, v)\} \le -t + \frac{2 \left|\!\left|\!\left|{A^TA} \right|\!\right|\!\right|_{\textrm{op}}}{m} \epsilon \right). \end{align*} From Lemma 3.1, we know that |$|\!|\!|{A^TA}|\!|\!|_{\textrm{op}}$| is well concentrated, and thus by considering the event that |$|\!|\!|{A^TA}|\!|\!|_{\textrm{op}} \gtrsim m \sigma ^2$| that for some numerical constant |$C < \infty $| we have \begin{align} {\mathbb{P}}\left( \inf_{u, v \in{\mathbb{S}}^{n-1}} \{Z_{u,v} - \kappa_{\textrm{st}}(u, v)\} \le -t \right) \nonumber \\ & \le{\mathbb{P}}\left( \min_{u, v \in \mathcal{N}} \{Z_{u, v} - \kappa_{\textrm{st}}(u, v)\} \le -t + 2C\sigma^2 \epsilon \right) + {\mathbb{P}}\left(\left|\!\left|\!\left|{A^TA} \right|\!\right|\!\right|_{\textrm{op}} \ge C m \sigma^2\right) \nonumber \\ & \le \mathop{\textrm{card}}(\mathcal{N})^2 \exp\left(-\frac{c m \left[{t - C \sigma^2 \epsilon}\right]_+^2}{\sigma^4} \right) + \exp(-m), \end{align} (C.2) where the first term comes from Bernstein’s inequality (C.1) and the second from Lemma 3.1. Now, let us assume that |$\mathcal{N}$| is an |$\epsilon $|-cover of |${\mathbb{S}}^{n-1}$| with minimal cardinality, which by standard volume arguments [44, Lemma 2] satisfies |$N(\epsilon ) := \mathop{\textrm{card}}(\mathcal{N}) \le (1 + 2/\epsilon )^n$| for |$\epsilon> 0$|⁠. Noting that |$(1 + 2 / \epsilon )^{2n} \le \exp (n / \epsilon )$| for |$\epsilon \le \epsilon _0 := 0.21398$|⁠, we may replace inequality (C.2) with \begin{equation*} {\mathbb{P}}\left( \inf_{u, v \in{\mathbb{S}}^{n-1}} \{Z_{u,v} - \kappa_{\textrm{st}}(u, v)\} \le -t \right) \le \exp\left(\frac{n}{\epsilon} - \frac{c m \left[{t - C \sigma^2 \epsilon}\right]_+^2}{\sigma^4} \right) + \textrm{e}^{-m} \end{equation*} valid for all |$\epsilon \le \epsilon _0$|⁠. Now, if we set |$\epsilon = \sqrt [3]{n/m}$| and define |$t = \sigma ^2 \overline{t} + (C + 1)\sigma ^2 \epsilon $|⁠, then we find that if |$m/n \ge \epsilon _0^{-3}$|⁠, we have \begin{equation*} {\mathbb{P}}\left( \inf_{u, v \in{\mathbb{S}}^{n-1}} \{Z_{u,v} - \kappa_{\textrm{st}}(u, v)\} \le -\sigma^2 \overline{t} - (C + 1) \sigma^2 \sqrt[3]{\frac{n}{m}} \right) \le \exp\left(c m \overline{t}^2\right) + \textrm{e}^{-m}. \end{equation*} This is the desired result. C.3 Properties of Gaussian random variable Lemma C.1 Let |$W \sim{\mathsf{N}}(0, 1)$|⁠. Then for all |$c \in{\mathbb{R}}_+$| \begin{equation*} {\mathbb{E}}\left[W^2 \mid W^2 \leq c \right] \le 1 - \frac{1}{2} c {\mathbb{P}}(W^2> c). \end{equation*} Proof. By the law of total probability, we have \begin{equation*} {\mathbb{E}}[W^2] = {\mathbb{E}}[W^2 \mid W^2 \leq c] {\mathbb{P}}(W^2 \leq c) + {\mathbb{E}}[W^2 \mid W^2> c] {\mathbb{P}}(W^2 > c). \end{equation*} Using that |${\mathbb{E}}[W^2] = 1$| yields \begin{align*} 1 - {\mathbb{E}}[W^2 \mid W^2 \leq c] = {\mathbb{P}}(W^2> c) \left({\mathbb{E}}[W^2 \mid W^2 > c] -{\mathbb{E}}[W^2 \mid W^2 \leq c] \right). \end{align*} For |$t \in [0, c]$|⁠, we have |${\mathbb{P}}(W^2 \in [c-t, c]) \leq{\mathbb{P}}(W^2 \in [0, t])$|⁠, so that for such |$t$| we have |${\mathbb{P}}(W^2 \geq c - t \mid W^2 \leq c) \leq{\mathbb{P}}(W^2 \leq t \mid W^2 \leq c)$|⁠, or |${\mathbb{P}}(W^2 \geq t \mid W^2 \leq c) + {\mathbb{P}}(W^2 \geq c-t \mid W^2 \leq c) \leq 1$|⁠. Performing the standard change of variables to compute |${\mathbb{E}}[W^2 \mid W^2 \le c]$|⁠, we thus obtain \begin{align*} {\mathbb{E}}[W^2 \mid W^2 \leq c] &= \int_0^c {\mathbb{P}}(W^2 \geq t \mid W^2 \leq c)\,\mathrm{d}t \\ &= \int_0^c \frac{1}{2}\left[{\mathbb{P}}(W^2 \geq t \mid W^2 \leq c) + {\mathbb{P}}(W^2 \geq c-t \mid W^2 \leq c)\right] \,\mathrm{d}t \\ &\leq \int_0^c \frac{1}{2} \,\mathrm{d}t = \frac{c}{2}. \end{align*} Using that |${\mathbb{E}}[W^2 \mid W^2> c] \geq c$| gives the lemma. Lemma C.2 Let |$X \in{\mathbb{C}}^{n \times n}$| be rank 2 and Hermitian and |$z = \frac{1}{\sqrt{2}} ({\mathsf{N}}(0, I_n) + \iota{\mathsf{N}}(0, I_n))$| be standard complex normal. Then |${\mathbb{E}}[|z^{\textrm{H}} X z|] \ge \frac{2 \sqrt{2}}{\pi } \left \|{X}\right \|_{\textrm{Fr}}$|⁠. Proof. Without loss of generality, we assume that |$\left \|{X}\right \|_{\textrm{Fr}} = 1$|⁠, so that |$X = \lambda _1 u_1 u_1^{\textrm{H}} + \lambda _2 u_2 u_2^{\textrm{H}}$| for |$\lambda _1^2 + \lambda _2^2 = 1$| and |$u_i$| orthonormal. Thus, if |$Z_1, Z_2$| are standard normal, we have \begin{equation*} z^{\textrm{H}} X z = \lambda_1 | \langle z, u_1 \rangle |^2 + \lambda_2 | \langle z, u_2 \rangle |^2 \stackrel{{\textrm{dist}}}{=} \lambda_1 Z_1^2 + \lambda_2 Z_2^2. \end{equation*} Without loss of generality, we assume |$\lambda _1 \ge 0$|⁠, so that by inspection we have for |$\lambda = \lambda _1$| that |${\mathbb{E}}[|\lambda _1 Z_1^2 + \lambda _2 Z_2^2|] \ge{\mathbb{E}}[|\lambda Z_1^2 - \sqrt{1 - \lambda ^2} Z_2^2|]$|⁠. Letting |$f(\lambda ) = {\mathbb{E}}[|\lambda Z_1^2 - \sqrt{1 - \lambda ^2} Z_2^2|]$|⁠, we have that \begin{equation*} f^{\prime}(\lambda) = {\mathbb{P}}\left(\frac{\lambda}{\sqrt{1 - \lambda^2}} Z_1^2 \ge Z_2^2\right) - {\mathbb{P}}\left(\frac{\lambda}{\sqrt{1 - \lambda^2}} Z_1^2 \le Z_2^2\right) = \begin{cases}>0 & \mbox{if}~ \lambda > \frac{1}{\sqrt{2}} \\ = 0 & \mbox{if} \lambda = \frac{1}{\sqrt{2}} \\ < 0 & \mbox{if} \lambda < \frac{1}{\sqrt{2}}. \end{cases} \end{equation*} In particular, |$f$| is quasi-convex and minimized at |$\lambda = 1 / \sqrt{2}$|⁠, and thus \begin{equation*} \frac{1}{\left\|{X}\right\|_{\textrm{Fr}}} {\mathbb{E}}[|z^{\textrm{H}} X z|] \ge \frac{1}{\sqrt{2}} {\mathbb{E}}[|Z_1^2 - Z_2^2|] = \frac{2 \sqrt{2}}{\pi}, \end{equation*} where the final equality is a calculation. © The Author(s) 2018. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) TI - Solving (most) of a set of quadratic equalities: composite optimization for robust phase retrieval JO - Information and Inference: A Journal of the IMA DO - 10.1093/imaiai/iay015 DA - 2019-09-19 UR - https://www.deepdyve.com/lp/oxford-university-press/solving-most-of-a-set-of-quadratic-equalities-composite-optimization-KLoaikSDFP SP - 471 VL - 8 IS - 3 DP - DeepDyve ER -