Information and Inference: A Journal of the IMA, Volume 7 (4) – Dec 11, 2018

/lp/ou_press/gradient-descent-with-non-convex-constraints-local-concavity-wCcj4WAdwq

- Publisher
- Oxford University Press
- Copyright
- © The Author(s) 2018. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved.
- ISSN
- 2049-8764
- eISSN
- 2049-8772
- D.O.I.
- 10.1093/imaiai/iay002
- Publisher site
- See Article on Publisher Site

Abstract Many problems in high-dimensional statistics and optimization involve minimization over non-convex constraints—for instance, a rank constraint for a matrix estimation problem—but little is known about the theoretical properties of such optimization problems for a general non-convex constraint set. In this paper we study the interplay between the geometric properties of the constraint set and the convergence behavior of gradient descent for minimization over this set. We develop the notion of local concavity coefficients of the constraint set, measuring the extent to which convexity is violated, which governs the behavior of projected gradient descent over this set. We demonstrate the versatility of these concavity coefficients by computing them for a range of problems in low-rank estimation, sparse estimation and other examples. Through our understanding of the role of these geometric properties in optimization, we then provide a convergence analysis when projections are calculated only approximately, leading to a more efficient method for projected gradient descent in low-rank estimation problems. 1. Introduction Non-convex optimization problems arise naturally in many areas of high-dimensional statistics and data analysis, and pose particular difficulty due to the possibility of becoming trapped in a local minimum or failing to converge. Nonetheless, recent results have begun to extend some of the broad convergence guarantees that have been achieved in the literature on convex optimization, into a non-convex setting. In this work, we consider a general question: when minimizing a function g(x) over a non-convex constraint set |$\mathscr{C}\subset \mathbb{R}^{d}$|, $$ \widehat{x} = \mathop{\operatorname{arg\,min}}\limits_{x\in\mathscr{C}} \mathsf{g}(x),$$ what types of conditions on g and on |$\mathscr{C}$| are sufficient to guarantee the success of projected gradient descent? More concretely, when can we expect that optimization of this non-convex problem will converge at essentially the same rate as a convex problem? In examining this question, we find that local geometric properties of the non-convex constraint set |$\mathscr{C}$| are closely tied to the behavior of gradient descent methods, and the main results of this paper study the equivalence between local geometric conditions on the boundary of |$\mathscr{C}$|, and the local behavior of optimization problems constrained to |$\mathscr{C}$|. The main contributions of this paper are the following: We develop the notion of local concavity coefficients of a non-convex constraint set |$\mathscr{C}$|, characterizing the extent to which |$\mathscr{C}$| is non-convex relative to each of its points. These coefficients, a generalization of the notions of prox-regular sets and sets of positive reach in the analysis literature, bound the set’s violations of four different characterizations of convexity—e.g. convex combinations of points must lie in the set, and the first-order optimality conditions for minimization over the set—with respect to a structured norm, such as the ℓ1 norm for sparse problems, chosen to capture the natural structure of the problem. The local concavity coefficients allow us to characterize the geometric properties of the constraint set |$\mathscr{C}$| that are favorable for analyzing the convergence of projected gradient descent. Our key results Theorems 2.1 and 2.2 prove that these multiple notions of non-convexity are in fact exactly equivalent, shedding light on the interplay between geometric properties such as curvature, and optimality properties such as the first-order conditions, in a non-convex setting. We next prove convergence results for projected gradient descent over a non-convex constraint set, minimizing a function g assumed to exhibit restricted strong convexity (RSC) and restricted smoothness (RSM) (these types of conditions are common in the high-dimensional statistics literature—see e.g. the study by Negahban et al., 2009 for background). We also allow for the projection step, i.e. projection to |$\mathscr{C}$|, to be calculated approximately, which enables greater computational efficiency. Our main convergence analysis shows that, as long as we initialize at a point x0 that is not too far away from |$\widehat{x}$|, projected gradient descent converges linearly to |$\widehat{x}$| when the constraint space |$\mathscr{C}$| satisfies the geometric properties described above. Finally, we apply these ideas to a range of specific examples: low-rank matrix estimation (where optimization is carried out under a rank constraint), sparse estimation (with non-convex regularizers such as Smoothly Clipped Absolute Deviation (SCAD) offering a lower-shrinkage alternative to the ℓ1 norm) and several other non-convex constraints. We discuss some interesting differences between constraining vs. penalizing a non-convex regularization function, in the context of sparse estimation. For the low-rank setting, we propose an approximate projection step that provides a computationally efficient alternative for low-rank estimation problems, which we then explore empirically with simulations. 2. Concavity coefficients for a non-convex constraint space We begin by studying several properties which describe the extent to which the constraint set |$\mathscr{C}\subset \mathbb{R}^{d}$| deviates from convexity. To quantify the concavity of |$\mathscr{C}$|, we will define the (global) concavity coefficient of |$\mathscr{C}$|, denoted |$\gamma = \gamma (\mathscr{C})$|, which we will later expand to local measures of concavity, |$\gamma _{x}(\mathscr{C})$|, indexed over points |$x\in \mathscr{C}$|. We examine several definitions of this concavity coefficient: essentially, we consider four properties that would hold if |$\mathscr{C}$| were convex, and then use γ to characterize the extent to which these properties are violated. Our definitions are closely connected to the notion of prox-regular sets in the analysis literature, and we will discuss this connection in detail in Section 2.3 below. Since we are interested in developing flexible tools for high-dimensional optimization problems, several different norms will appear in the definitions of the concavity coefficients: The Euclidean ℓ2 norm, |$\lVert{\cdot }\rVert _{2}$|. Projections to |$\mathscr{C}$| will always be taken with respect to the ℓ2 norm, and our later convergence guarantees will also be given with respect to this norm. If our variable is a matrix |$X\in \mathbb{R}^{n\times m}$|, the Euclidean ℓ2 norm is known as the Frobenius norm, |$\lVert{{X}}\rVert _{\mathsf{F}}=\sqrt{\sum _{ij} X_{ij}^{2}}$|. A ‘structured’ norm |$\lVert{\cdot }\rVert $|, which can be chosen to be any norm on |$\mathbb{R}^{d}$|. In some cases it may be the ℓ2 norm, but often it will be a different norm reflecting natural structure in the problem. For instance, for a low-rank estimation problem, if |$\mathscr{C}$| is a set of rank-constrained matrices then we will work with the nuclear norm, |$\lVert{\cdot }\rVert =\lVert{{\cdot }}\rVert _{\textrm{nuc}}$| (defined as the sum of the singular values of the matrix). For sparse signals, we will instead use the ℓ1 norm, |$\lVert{\cdot }\rVert =\lVert{\cdot }\rVert _{1}$|. A norm |$\lVert{\cdot }\rVert ^{\ast }$|, which is the dual norm to the structured norm |$\lVert{\cdot }\rVert $|. For low-rank matrix problems, if we work with the nuclear norm, |$\lVert{\cdot }\rVert =\lVert{{\cdot }}\rVert _{\textrm{nuc}}$|, then the dual norm is given by the spectral norm, |$\lVert{\cdot }\rVert ^{\ast }=\lVert{{\cdot }}\rVert _{\textrm{sp}}$| (i.e. the largest singular value of the matrix, also known as the matrix operator norm). For sparse problems, if |$\lVert{\cdot }\rVert =\lVert{\cdot }\rVert _{1}$| then its dual is given by the |$\ell _{\infty }$| norm, |$\lVert{\cdot }\rVert ^{\ast }=\lVert{\cdot }\rVert _{\infty }$|. When we take projections to the constraint set |$\mathscr{C}$|, if the minimizer |$P_{{\mathscr{C}}}(z)\in \operatorname{arg\,min}_{x\in \mathscr{C}}\lVert{x-z}\rVert _{2}$| is non-unique, then we write |$P_{{\mathscr{C}}}(z)$| to denote any point chosen from this set. Throughout, any assumption or claim involving |$P_{{\mathscr{C}}}(z)$| should be interpreted as holding for any choice of |$P_{{\mathscr{C}}}(z)$|. From this point on, we will assume without comment that |$\mathscr{C}$| is closed and non-empty so that the set |$\operatorname{arg\,min}_{x\in \mathscr{C}}\lVert{x-z}\rVert _{2}$| is non-empty for any z. We now present several definitions of the concavity coefficient of |$\mathscr{C}$|. Curvature First, we define γ as a bound on the extent to which a convex combination of two elements of |$\mathscr{C}$| may lie outside of |$\mathscr{C}$|: for |$x,y\in \mathscr{C}$|, \begin{align}\limsup_{t\searrow 0}\frac{\min_{z\in\mathscr{C}}\left\lVert{z - \left((1-t)x + ty\right)}\right\rVert}{t} \leqslant \gamma\lVert{x - y}{\rVert^{2}_{2}}. \end{align} (2.1)Approximate contraction Secondly, we define γ via a condition requiring that the projection operator |$P_{{\mathscr{C}}}$| is approximately contractive in a neighborhood of the set |$\mathscr{C}$|, that is, |$\lVert{P_{{\mathscr{C}}}(z) - P_{{\mathscr{C}}}(w)}\rVert _{2}$| is not much larger than |$\lVert{z-w}\rVert _{2}$|: for |$x,y\in \mathscr{C}$|. \begin{align}&\text{For any }z,w\in\mathbb{R}^{d} \text{ with } P_{{\mathscr{C}}}(z)=x\text{ and } P_{{\mathscr{C}}}(w)=y,\nonumber\\&\quad\big(1-\gamma\lVert{z-x}\rVert^{\ast}-\gamma\lVert{w-y}\rVert^{\ast}\big) \cdot \lVert{x - y}\rVert_{2} \leqslant \lVert{z - w}\rVert_{2}.\end{align} (2.2) For convenience in our theoretical analysis we will also consider a weaker ‘one-sided’ version of this property, where one of the two points is assumed to already lie in |$\mathscr{C}$|: for |$x,y\in \mathscr{C}$|. \begin{align}\text{For any }z\in\mathbb{R}^{d}\text{ with }P_{{\mathscr{C}}}(z)=x,\quad \left(1-\gamma\lVert{z-x}\rVert^{\ast}\right) \cdot \lVert{x - y}\rVert_{2} \leqslant \lVert{z-y}\rVert_{2}. \end{align} (2.3)First-order optimality For our third characterization of the concavity coefficient, we consider the standard first-order optimality conditions for minimization over a convex set, and measure the extent to which they are violated when optimizing over |$\mathscr{C}$|: for |$x,y\in \mathscr{C}$|.1 \begin{align} &\text{For any differentiable }\mathsf{f}:\mathbb{R}^{d}\rightarrow \mathbb{R}\text{ such that (x) is a local minimizer of }(\mathsf{f})\text{ over }\mathscr{C},\nonumber\\&\quad\langle{y-x},{\nabla\mathsf{f}(x)}\rangle\geqslant - \gamma\lVert{\nabla\mathsf{f}(x)}\rVert^{\ast}\lVert{y-x}{\rVert^{2}_{2}}. \end{align} (2.4)Inner products Fourthly, we introduce an inner product condition, requiring that projection to the constraint set |$\mathscr{C}$| behaves similarly to a convex projection: for |$x,y\in \mathscr{C}$|. \begin{align} \text{For any }z\in\mathbb{R}^{d}\text{ with }P_{{\mathscr{C}}}(z)=x,\quad \langle{y-x},\ {z-x}\rangle \leqslant \gamma \lVert{z-x}\rVert^{\ast}\lVert{y-x}{\rVert^{2}_{2}}. \end{align} (2.5) We will see later that, by choosing |$\lVert{\cdot }\rVert $| to reflect the structure in the signal (rather than working only with the ℓ2 norm), we are able to obtain a more favorable scaling in our concavity coefficients, and hence to prove meaningful convergence results in high-dimensional settings. On the other hand, regardless of our choice of |$\lVert{\cdot }\rVert $|, note that the ℓ2 norm also appears in the definition of the concavity coefficients, as is natural when working with inner products. Our first main result shows that the above conditions are in fact exactly equivalent: Theorem 2.1 The properties (2.1), (2.2), (2.3), (2.4) and (2.5) are equivalent; that is, for a fixed choice |$\gamma \in [0,\infty ]$|, they either all hold for every |$x,y\in \mathscr{C}$|, or all fail to hold for some |$x,y\in \mathscr{C}$|. Formally, we will define |$\gamma (\mathscr{C})$| to be the smallest value such that the above properties hold: $$ \gamma(\mathscr{C}):= \min\left\{\gamma\in[0,\infty] : \text{Properties 2.1, 2.2, 2.3, 2.4, 2.5 hold for all}\ x,y\in\mathscr{C}\,\right\}\!.$$ However, this global coefficient |$\gamma (\mathscr{C})$| is often of limited use in practical settings, since many sets are well behaved locally but not globally. For instance, the set |$\mathscr{C}\!=\!\left \{X\!\in\! \mathbb{R}^{n\times m}:\operatorname{rank}(X)\!\leqslant\! r\right \}$| has |$\gamma (\mathscr{C})\!=\!\infty $|, but exhibits smooth curvature and good convergence behavior as long as we stay away from rank-degenerate matrices (that is, matrices with rank(X) < r). Since we may often want to ensure convergence in this type of setting where global concavity cannot be bounded, we next turn to a local version of the same concavity bounds. 2.1. Local concavity coefficients We now consider the local concavity coefficients|$\gamma _{x}(\mathscr{C})$|, measuring the concavity in a set |$\mathscr{C}$| relative to a specific point x in the set. We will see examples later on where |$\gamma (\mathscr{C})=\infty\,, $| but |$\gamma _{x}(\mathscr{C})$| is bounded for many points |$x\in \mathscr{C}$|. First we define a set of ‘degenerate points’, $$ \mathscr{C}_{\mathsf{dgn}} = \left\{x\in\mathscr{C}:P_{{\mathscr{C}}}\text{ is not continuous over any neighborhood of }(x)\right\}\!,$$ and then let \begin{align}\gamma_{x}(\mathscr{C}) = \begin{cases} \infty,&x\in\mathscr{C}_{\mathsf{dgn}},\\ \min\left\{\gamma\in[0,\infty]: \text{Property (*) holds for this point (x) and any }y\in\mathscr{C}\right\}\!,&x\not\in\mathscr{C}_{\mathsf{dgn}}, \end{cases} \end{align} (2.6) where the property (*) may refer to any of the four definitions of the concavity coefficients,2 namely (2.1), (2.3), (2.4) or (2.5). We will see shortly why it is necessary to make an exception for the degenerate points |$x\in \mathscr{C}_{\mathsf{dgn}}$| in the definition of these coefficients. Our next main result shows that the equivalence between the four properties (2.1), (2.3), (2.4) and (2.5) in terms of the global concavity coefficient |$\gamma (\mathscr{C})$|, holds also for the local coefficients: Theorem 2.2 For all |$x\in \mathscr{C}$|, the definition (2.6) of |$\gamma _{x}(\mathscr{C})$| is equivalent for all four choices of the property (*), namely the conditions (2.1), (2.3), (2.4) or (2.5). To develop an intuition for the global and local concavity coefficients, we give a simple example in |$\mathbb{R}^{2}$| (relative to the ℓ2 norm, i.e. |$\lVert{\cdot }\rVert =\lVert{\cdot }\rVert ^{\ast }=\lVert{\cdot }\rVert _{2}$|), displayed in Fig. 1. Define |$\mathscr{C}=\left \{x\in \mathbb{R}^{2}: x_{1}\leqslant 0\textrm{ or }x_{2}\leqslant 0\right \}$|. Due to the degenerate point x = (0, 0), we can see that |$\gamma (\mathscr{C})=\infty $| in this case. The local concavity coefficients are given by $$ \begin{cases} \gamma_{x}(\mathscr{C}) = \infty,&\textrm{ if }x=(0,0),\\[5pt] \gamma_{x}(\mathscr{C}) = \frac{1}{2t},&\textrm{ if } x = (t,0)\textrm{ or }(0,t)\text{ for \(t>0\)},\\[5pt] \gamma_{x}(\mathscr{C}) = 0,&\textrm{ if }x_{1}<0\textrm{ or }x_{2}<0.\end{cases}$$ Note that at the degenerate point x = (0, 0), |$\mathscr{C}$| actually contains all convex combinations of this point x with any |$y\in \mathscr{C}$|, and so the curvature condition (2.1) is satisfied with γ = 0. However, |$x\in \mathscr{C}_{\mathsf{dgn}}$|, so we nonetheless set |$\gamma _{x}(\mathscr{C})=\infty $|. Fig. 1. View largeDownload slide A simple example of the local concavity coefficients on |$\mathscr{C}=\{x\in \mathbb{R}^{2}:x_{1}\leqslant 0\textrm{ or }x_{2}\leqslant 0\}$|. The gray shaded area represents |$\mathscr{C}$| while the numbers give the local concavity coefficients at each marked point. Fig. 1. View largeDownload slide A simple example of the local concavity coefficients on |$\mathscr{C}=\{x\in \mathbb{R}^{2}:x_{1}\leqslant 0\textrm{ or }x_{2}\leqslant 0\}$|. The gray shaded area represents |$\mathscr{C}$| while the numbers give the local concavity coefficients at each marked point. Practical high-dimensional examples, such as a rank constraint, will be discussed in depth in Section 5. For example we will see that, for the rank-constrained set |$\mathscr{C}=\left \{X\in \mathbb{R}^{n\times m}:\operatorname{rank}(X)\leqslant r\right \}$|, the local concavity coefficients satisfy |$\gamma _{X}(\mathscr{C})= \frac{1}{2\sigma _{r}(X)}$| relative to the nuclear norm. In general, the local coefficients can be interpreted as follows: If x lies in the interior of |$\mathscr{C}$|, or if |$\mathscr{C}$| is convex, then |$\gamma _{x}(\mathscr{C})=0$|. If x lies on the boundary of |$\mathscr{C}$|, which is a non-convex set with a smooth boundary, then we will typically see a finite but non-zero |$\gamma _{x}(\mathscr{C})$|. |$\gamma _{x}(\mathscr{C})=\infty $| can indicate a non-convex cusp or other degeneracy at the point x. 2.2. Properties We next prove some properties of the local coefficients |$\gamma _{x}(\mathscr{C})$| that will be useful for our convergence analysis, as well as for gaining intuition for these coefficients. First, the global and local coefficients are related in the natural way: Lemma 2.3 For any |$\mathscr{C}$|, |$\gamma (\mathscr{C})=\sup _{x\in \mathscr{C}}\gamma _{x}(\mathscr{C})$|. Next, observe that |$x\mapsto \gamma _{x}(\mathscr{C})$| is not continuous in general (in particular, since |$\gamma _{x}(\mathscr{C})=0$| in the interior of |$\mathscr{C}\!,$| but is often positive on the boundary). However, this map does satisfy upper semi-continuity: Lemma 2.4 The function |$x\mapsto \gamma _{x}(\mathscr{C})$| is upper semi-continuous over |$x\in \mathscr{C}$|. Furthermore, setting |$\gamma _{x}(\mathscr{C})=\infty $| at the degenerate points |$x\in \mathscr{C}_{\mathsf{dgn}}$| is natural in the following sense: the resulting map |$x\mapsto \gamma _{x}(\mathscr{C})$| is the minimal upper semi-continuous map such that the relevant local concavity properties are satisfied. We formalize this with the following lemma: Lemma 2.5 For any |$u\in \mathscr{C}_{\mathsf{dgn}}$|, for any of the four conditions, (2.1), (2.3), (2.4) or (2.5), this property does not hold in any neighborhood of u for any finite γ. That is, for any r > 0, $$ \min\Big\{\gamma\geqslant 0:\text{ Property (*) holds for all {$x\in\mathscr{C}\cap\mathbb{B}_{2}(u,r)$} and for all {$y\in\mathscr{C}$}}\Big\}= \infty,$$ where (*) may refer to any of the four equivalent properties, i.e. (2.1), (2.3), (2.4) and (2.5). (Here, |$\mathbb{B}_{2}(u,r)$| is the ball of radius r around the point u, with respect to the ℓ2 norm.) Finally, the next result shows that two-sided contraction property (2.2) holds using local coefficients, meaning that all five definitions of concavity coefficients are equivalent: Lemma 2.6 For any |$z,w\in \mathbb{R}^{d}$|, $$ \left(1-\gamma_{P_{{\mathscr{C}}}(z)}(\mathscr{C})\lVert{z-P_{{\mathscr{C}}}(z)}\rVert^{\ast}-\gamma_{P_{{\mathscr{C}}}(w)}(\mathscr{C})\lVert{w-P_{{\mathscr{C}}}(w)}\rVert^{\ast}\right) \cdot \lVert{P_{{\mathscr{C}}}(z)-P_{{\mathscr{C}}}(w)}\rVert_{2} \leqslant \lVert{z - w}\rVert_{2}.$$ In particular, for any fixed c ∈ (0, 1), Lemma 2.4 proves that \begin{align} P_{{\mathscr{C}}}\text{ is (c)-Lipschitz over the set }\left\{z\in\mathbb{R}^{d}:2\gamma_{P_{{\mathscr{C}}}(z)}(\mathscr{C})\lVert{z-P_{{\mathscr{C}}}(z)}\rVert^{\ast}\leqslant 1-c\right\}, \end{align} (2.7) where the Lipschitz constant is defined with respect to the ℓ2 norm. This provides a sort of converse to our definition of the degenerate points, where we set |$\gamma _{x}(\mathscr{C})=\infty $| for all |$x\in \mathscr{C}_{\mathsf{dgn}}$|, i.e. all points x where |$P_{{\mathscr{C}}}$| is not continuous in any neighborhood of x. 2.3. Connection to prox-regular sets The notion of prox-regular sets and sets of positive reach arises in the literature on non-smooth analysis in Hilbert spaces, for instance see the study by Colombo & Thibault (2010) for a comprehensive overview of the key results in this area. The work on prox-regular sets generalizes also to the notion of prox-regular functions (see e.g. Rockafellar & Wets, 2009, Chapter 13.F). A prox-regular set is a set |$\mathscr{C}\subset \mathbb{R}^{d}$| that satisfies3 \begin{align} \langle\,{y-x},\ {z-x}\,\rangle\leqslant\frac{1}{2\rho}\lVert{z-x}\rVert_{2}\lVert{y-x}{\rVert^{2}_{2}}, \end{align} (2.8) for all |$x,y\in \mathscr{C}$| and all |$z\in \mathbb{R}^{d}$| with |$P_{{\mathscr{C}}}(z)=x$|, for some constant ρ > 0. To capture the local variations in concavity over the set |$\mathscr{C}$|, |$\mathscr{C}$| is prox-regular with respect to a continuous function |$\rho :\mathscr{C}\rightarrow (0,\infty ]$| if \begin{align} \langle\,{y-x},{z-x}\,\rangle\leqslant\frac{1}{2\rho(x)}\lVert{z-x}\rVert_{2}\lVert{y-x}{\rVert^{2}_{2}} \end{align} (2.9) for all |$x,y\in \mathscr{C}$| and all |$z\in \mathbb{R}^{d}$| with |$P_{{\mathscr{C}}}(z)=x$| (see e.g. Colombo & Thibault, 2010, Theorem 3b).4 Historically, prox-regularity was first formulated via the notion of ‘positive reach’ (Federer, 1959): the parameter ρ appearing in (2.8) is the largest radius such that the projection operator |$P_{{\mathscr{C}}}$| is unique for all points z within distance ρ of the set |$\mathscr{C}$|; in the local version (2.9), the radius is allowed to vary locally as a function of |$x\in \mathscr{C}$|. The definitions (2.8) and (2.9) exactly coincide with our inner product condition (2.5), in the special case that |$\lVert{\cdot }\rVert $| is the ℓ2 norm, by taking |$\gamma = \frac{1}{2\rho }$| or, for the local coefficients, |$\gamma =\frac{1}{2\rho (x)}$|. In the ℓ2 setting, there is substantial literature exploring the equivalence between many different characterizations of prox-regularity, including properties that are equivalent to each of our characterizations of the local concavity coefficients. Here we note a few places in the literature where these conditions appear, and refer the reader to the study by Colombo & Thibault (2010) for historical background on these ideas. The curvature condition (2.1) is proved in the study by Colombo & Thibault (2010, Proposition 9, Theorem 14(q)). The one- and two-sided contraction conditions (2.3) and (2.2) appear in the studies by Federer (1959, Section 4.8) and Colombo & Thibault (2010, Theorem 14(g)); the inner product condition (2.5) can be found in the studies by Federer (1959, Section 4.8), Colombo & Thibault (2010, Theorem 3(b)), Canino (1988, Definition 1.5) and Colombo & Marques (2003, Definition 2.1). The first-order optimality condition (2.4) is closely related to the inner product condition, when formulated using the ideas of normal cones and proximal normal cones (for instance, in the study by Rockafellar & Wets, 2009, Theorem 6.12 relates gradients of f to normal cones at x). The distinctions between our definitions and results on local concavity coefficients, and the literature on prox-regularity, center on two key differences: the role of continuity, and the flexibility of the structured norm |$\lVert{\cdot }\rVert $| (rather than the ℓ2 norm). We discuss these two separately. Continuity In the literature on prox-regular sets, the ‘reach’ function |$x\mapsto \rho (x)\in (0,\infty ]$| is assumed to be continuous (Colombo & Thibault, 2010, Definition 1). Equivalently, we could take a continuous function |$x\mapsto \gamma _{x} = \frac{1}{2\rho (x)}\in [0,\infty )$| to agree with the notation of our local concavity coefficients. However, this is not the same as finding the smallest value γx such that the concavity coefficient conditions are satisfied (locally at the point x). For our definitions, we do not enforce continuity of the map x ↦ γx, and instead define |$\gamma _{x}(\mathscr{C})$| as the smallest value such that the conditions are satisfied. This leads to substantial challenges in proving the equivalence of the various conditions; in Lemma 2.4 we prove that the map is naturally upper semi-continuous, which allows us to show the desired equivalences. In terms of practical implications, in order to use the local concavity coefficients to describe the convergence behavior of optimization problems, it is critical that we allow for non-continuity. For instance, suppose that |$\mathscr{C}$| is non-convex, and its interior |$\mathsf{Int}(\mathscr{C})$| is non-empty. For any |$x\in \mathsf{Int}(\mathscr{C})$|, the concavity coefficient conditions are satisfied with γx = 0. In particular, consider the first-order optimality condition (2.4): if |$x\in \mathsf{Int}(\mathscr{C})$| is a local minimizer of some function f, then x is in fact the global minimizer of f(x) and we must have ∇f(x) = 0. On the other hand, since |$\mathscr{C}$| is non-convex, we must have γx > 0 for at least some of the points x on the boundary of |$\mathscr{C}$|. If we do require a continuity assumption on the function x↦γx, then we would be forced to have γx > 0 for some points |$x\in \mathsf{Int}(\mathscr{C})$| as well. This means that γx would not give a precise description of the behavior of first-order methods when constraining to |$\mathscr{C}$|—it would not reveal that non-global minima are impossible in the interior of the set. More generally, we will show in Lemma 3.2 that the local concavity coefficients (defined as the lowest possible constants, as in (2.6)) provide a tight characterization of the convergence behavior of projected gradient descent over the constraint set |$\mathscr{C}$|; if we enforce continuity, we would be forced to choose larger values for |$\gamma _{x}(\mathscr{C})$| at some points |$x\in \mathscr{C}$|, and the concavity coefficients would no longer be both necessary and sufficient for convergence. One related point is that, by allowing for |$\gamma _{x}(\mathscr{C})$| to be infinite if needed (which would be equivalent to allowing the ‘reach’ ρ(x) to be zero for some x), we can accommodate constraint sets such as the low-rank matrix constraint, |$\mathscr{C}=\left \{X\in \mathbb{R}^{n\times m}:\operatorname{rank}(X)\leqslant r\right \}$|. Recalling that |$\gamma _{X}(\mathscr{C})=\frac{1}{2\sigma _{r}(X)}$| as mentioned earlier, we see that a rank-deficient matrix X (i.e. rank(X) < r) will have |$\gamma _{X}(\mathscr{C})=\infty $|. By not requiring that the concavity coefficient is finite (equivalently, that the reach is positive), we avoid the need for any inelegant modifications (e.g. working with a truncated set such as |$\mathscr{C}=\left \{X:\operatorname{rank}(X)\leqslant r,\sigma _{r}(X)\geqslant \varepsilon \right \}$|). Structured norms Prox-regularity (or equivalently the notion of positive reach) is studied in the literature in a Hilbert space, with respect to its norm, which in |$\mathbb{R}^{d}$| means the ℓ2 norm (or a weighted ℓ2 norm).5 In contrast, our work defines local concavity coefficients with respect to a general structured norm |$\lVert{\cdot }\rVert $|, such as the ℓ1 norm in a sparse signal estimation setting. To see the distinction, compare our inner product condition (2.5) with the definition of prox-regularity (2.8). Of course, the equivalence of all norms on |$\mathbb{R}^{d}$| means that if |$\gamma (\mathscr{C})$| is finite when defined with respect to the ℓ2 norm (i.e. |$\mathscr{C}$| is prox-regular), then it is finite with respect to any other norm—so the importance of the distinction may not be immediately clear. As an example, let |$\gamma ^{\ell _{1}}(\mathscr{C})$| and |$\gamma ^{\ell _{2}}(\mathscr{C})$| denote the concavity coefficients with respect to the ℓ1 and ℓ2 norms. Since |$\lVert{\cdot }\rVert _{2}\leqslant \lVert{\cdot }\rVert _{1}\leqslant \sqrt{d}\lVert{\cdot }\rVert _{2}$|, we could trivially show that $$ \gamma^{\ell_{2}}(\mathscr{C})\leqslant \gamma^{\ell_{1}}(\mathscr{C})\leqslant \sqrt{d}\cdot \gamma^{\ell_{2}}(\mathscr{C}),$$ but the factor |$\sqrt{d}$| is unfavorable, so in many settings this is a very poor bound on |$\gamma ^{\ell _{1}}(\mathscr{C})$|. We may then ask, why can we not simply define the coefficients in terms of the ℓ2 norm? The reason is that in optimization problems arising in high-dimensional settings (for instance, high-dimensional regression in statistics), structured norms such as the ℓ1 norm (for problems involving sparse signals) or the nuclear norm (for low-rank signals) allow for statistical and computational analyses that would not be possible with the ℓ2 norm. In particular, we will see later on that convergence for the minimization problem |$\min _{x\in \mathscr{C}}\mathsf{g}(x)$| will depend on bounding |$\lVert{\nabla \mathsf{g}(x)}\rVert ^{\ast }$|. If |$\lVert{\cdot }\rVert $| is the ℓ1 norm, for instance, then |$\lVert{\nabla \mathsf{g}(x)}\rVert ^{\ast } = \lVert{\nabla \mathsf{g}(x)}\rVert _{\infty }$| will in general be much smaller than |$\lVert{\nabla \mathsf{g}(x)}\rVert _{2}$|. For instance, in a statistical problem, if ∇g(x) consists of Gaussian or sub-Gaussian noise at the true parameter vector x, then |$ \lVert{\nabla \mathsf{g}(x)}\rVert _{\infty }\sim \sqrt{\log (d)}$| while |$ \lVert{\nabla \mathsf{g}(x)}\rVert _{2}\sim \sqrt{d}$|. Therefore, being able to bound the concavity of |$\mathscr{C}$| with respect to the ℓ1 norm rather than the ℓ2 norm is crucial for analyzing convergence in a high-dimensional setting. In the next section, we will study how the choice of the norm |$\lVert{\cdot }\rVert $| and its dual |$\lVert{\cdot }\rVert ^{\ast }$| relates to the convergence properties of projected gradient descent. 3. Fast convergence of projected gradient descent Consider an optimization problem constrained to a non-convex set, |$\min \{ \mathsf{g}(x)\! :\! x\in \mathscr{C}\}$|, where |$\mathsf{g}:\mathbb{R}^{d}\!\rightarrow\! \mathbb{R}$| is a differentiable function. We will work with projected gradient descent algorithms in the setting where g is convex or approximately convex, while |$\mathscr{C}$| is non-convex with local concavity coefficients |$\gamma _{x}(\mathscr{C})$|. After choosing some initial point |$x_{0}\in \mathscr{C}$|, for each t ⩾ 0 we define \begin{align}\begin{cases} x^{\prime}_{t+1} = x_{t} - \eta\nabla\mathsf{g}\left(x_{t}\right)\!,\\ x_{t+1} = P_{{\mathscr{C}}}\left(x^\prime_{t+1}\right)\!,\end{cases} \end{align} (3.1) where if |$P_{{\mathscr{C}}}\!\left (x^\prime _{t+1}\right )$| is not unique then any closest point may be chosen. 3.1. Assumptions Assumptions on g We first consider the objective function g. Let |$\widehat{x}$| be the target of our optimization procedure, |$\widehat{x} \in \operatorname{arg\,min}_{x\in \mathscr{C}}\mathsf{g}(x)$|. We assume that g satisfies RSC and RSM conditions over |$x,y\in \mathscr{C}$|, \begin{align} \mathsf{g}(y)\geqslant \mathsf{g}(x) + \langle\,{y-x},{\nabla\mathsf{g}(x)}\rangle + \frac{\alpha}{2}\lVert{x-y}{\rVert^{2}_{2}} - \frac{\alpha}{2}\varepsilon_{\mathsf{g}}^{2} \end{align} (3.2) and \begin{align} \mathsf{g}(y)\leqslant \mathsf{g}(x) + \langle\,{y-x},{\nabla\mathsf{g}(x)}\rangle + \frac{\beta}{2}\lVert{x-y}{\rVert^{2}_{2}} + \frac{\alpha}{2}\varepsilon_{\mathsf{g}}^{2}. \end{align} (3.3) Without loss of generality we can take |$\alpha \leqslant \beta $|. As is common in the low-rank factorized optimization literature, we will work in a local neighborhood of the target |$\widehat{x}$| by assuming that our initialization point lies within radius ρ of |$\widehat{x}$|, which will allow us to require these conditions on g to hold only locally. The term εg gives some ‘slack’ in our assumption on g, and is intended to capture some vanishingly small error level. This term is often referred to as the ‘statistical error’ in the high-dimensional statistics literature, which represents the best-case scaling of the accuracy of our recovered solution. Often |$\widehat{x}$| may represent a global minimizer which is within radius εg of some ‘true’ parameter in a statistical setting; therefore, converging to |$\widehat{x}$| up to an error of magnitude εg means that the recovered solution is as accurate as |$\widehat{x}$| at recovering the true parameter. For instance, often we will have |$\varepsilon _{\mathsf{g}}\sim{\sqrt{\frac{\log (d)}{n}}}$| in a statistical setting where we are solving a sparse estimation problem of dimension d with sample size n. Assumptions on |$\mathscr{C}$| Next, turning to the non-convexity of |$\mathscr{C}$|, we will assume local concavity coefficients |$\gamma _{x}(\mathscr{C})$| that are not too large in a neighborhood of |$\widehat{x}$|, with details given below. We furthermore assume a norm compatibility condition, \begin{align} \left\lVert{z - P_{{\mathscr{C}}}(z)}\right\rVert^{\ast} \leqslant \phi \min_{x\in\mathscr{C}}\lVert{z-x}\rVert^{\ast}\text{ for all }z\in\mathbb{R}^{d}, \end{align} (3.4) for some constant |$\phi \geqslant 1$|. The norm compatibility condition is trivially true with ϕ = 1 if |$\lVert{\cdot }\rVert $| is the ℓ2 norm, since |$P_{{\mathscr{C}}}$| is a projection with respect to the ℓ2 norm. We will see that in many natural settings it holds even for other norms, often with ϕ = 1. Assumptions on gradient and initialization Finally, we assume a gradient condition that reveals the connection between the curvature of the non-convex set |$\mathscr{C}$| and the target function g: we require that \begin{align} 2\phi\cdot\max_{x,x{^\prime}\in\mathscr{C}\cap\mathbb{B}_{2}(\widehat{x},\rho)}\gamma_{x}(\mathscr{C})\lVert{\nabla\mathsf{g}(x^\prime)}\rVert^{\ast} \leqslant (1-c_{0}) \cdot \alpha. \end{align} (3.5) (Since |$x\mapsto \gamma _{x}(\mathscr{C})$| is upper semi-continuous, if g is continuously differentiable, then we can find some radius ρ > 0 and some constant c0 > 0 satisfying this condition, as long as |$2\phi \gamma _{\widehat{x}}(\mathscr{C}) \lVert{\nabla \mathsf{g}\left (\widehat{x}\right )}\rVert ^{\ast } < \alpha $|.) Our projected gradient descent algorithm will then succeed if initialized within this radius ρ from the target point |$\widehat{x}$|, with an appropriate step size. We will discuss the necessity of this type of initialization condition below in Section 3.4. In practice, relaxing the constraint |$x\in \mathscr{C}$| to a convex constraint (or convex penalty) is often sufficient for providing a good initialization point. For example, in low-rank matrix setting, if we would like to solve |$\operatorname{arg\,min}\{\mathsf{g}(X):\operatorname{rank}(X)\leqslant r\}$|, we may first solve |$\operatorname{arg\,min}_{X}\left \{\mathsf{g}(X) + \lambda \lVert{{X}}\rVert _{\textrm{nuc}}\right \}$|, where |$\lVert{{X}}\rVert _{\textrm{nuc}}$| is the nuclear norm and |$\lambda \geqslant 0$| is a penalty parameter (which we would tune to obtain the desired rank for X). Alternately, in some settings, it may be sufficient to solve an unconstrained problem arg minXg(X) and then project to the constraint set, |$P_{{\mathscr{C}}}(X)$|. For some detailed examples of suitable initialization procedures for various low-rank matrix estimation problems, see e.g. the studies by Chen & Wainwright (2015) and Tu et al. (2015). 3.2. Convergence guarantee We now state our main result, which proves that under these conditions, initializing at some |$x_{0}\in \mathscr{C}$| sufficiently close to |$\widehat{x}$| will guarantee fast convergence to |$\widehat{x}$|. Theorem 3.1 Let |$\mathscr{C}\subset \mathbb{R}^{d}$| be a constraint set and let g be a differentiable function, with minimizer |$\widehat{x}\in \operatorname{arg\,min}_{x\in \mathscr{C}}\mathsf{g}(x)$|. Suppose |$\mathscr{C}$| satisfies the norm compatibility condition (3.4) with parameter ϕ, and g satisfies RSC (3.2) and RSM (3.3) with parameters α, β, εg for all |$x,y\in \mathbb{B}_{2}(\widehat{x},\rho )$|, and the initialization condition (3.5) for some c0 > 0. If the initial point |$x_{0}\in \mathscr{C}$| and the error level εg satisfy |$\lVert{x_{0} -\widehat{x}}{\rVert ^{2}_{2}}<\rho ^{2}$| and |$\varepsilon _{\mathsf{g}}^{2}< \frac{c_{0} \rho ^{2}}{1.5}$|, then for each step |$t\geqslant 0$| of the projected gradient descent algorithm (3.1) with step size η = 1/β, $$ \lVert{x_{t} - \widehat{x}}{\rVert^{2}_{2}} \leqslant \left(1 - c_{0}\cdot \frac{2\alpha}{\alpha+\beta}\right)^{t} \lVert{x_{0} - \widehat{x}}{\rVert^{2}_{2}} +\frac{1.5{\varepsilon}_{\mathsf{g}}^{2}}{c_{0}}.$$ In other words, the iterates xt converge linearly to the minimizer |$\widehat{x}$|, up to precision level εg. 3.3. Comparison to related work We now compare to several related results for convex and non-convex projected gradient descent. (For methods that are specific to the problem of optimization over low-rank matrices, we will discuss this comparison and perform simulations later on.) Comparison to convex optimization To compare this result to the convex setting, if |$\mathscr{C}$| is a convex set and g is α-strongly convex and β-smooth, then we can set c0 = 1 and εg = 0. Our result then yields $$ \lVert{x_{t} - \widehat{x}}{\rVert^{2}_{2}} \leqslant \left(1 - \frac{2\alpha}{\alpha+\beta}\right)^{t} \lVert{x_{0} - \widehat{x}}{\rVert^{2}_{2}} = \left(\frac{\beta-\alpha}{\beta+\alpha}\right)^{t}\lVert{x_{0}-\widehat{x}}{\rVert^{2}_{2}},$$ matching known rates for the convex setting (see e.g. Bubeck, 2015, Theorem 3.10). Comparison to known results using descent conesOymak et al. (2015) study projected gradient descent for a linear regression setting, |$\mathsf{g}(x) = \frac{1}{2}\lVert{b-Ax}{\rVert ^{2}_{2}}$|, while constraining some potentially non-convex regularizer, |$\mathscr{C} = \{x:\textrm{Pen}(x)\leqslant c\}$|. Given a true solution |$x^{\star }\in \mathscr{C}$| (for instance, in a statistical setting, we may have b = Ax⋆ + (noise)), their work focuses on the descent cone of |$\mathscr{C}$| at x⋆, given by $$ \textrm{DC}_{x^{\star}} = \textrm{Smallest closed cone containing }\left\{u: \textrm{Pen}\left(x^{\star}+u\right) \leqslant c\right\}\!\!.$$ (Trivially we will have |$x_{t} - x^{\star } \in \textrm{DC}_{x^{\star }}$| since |$x_{t}\in \mathscr{C}$|.) Their results characterize the convergence of projected gradient descent in terms of the eigenvalues of A⊤A restricted to this cone. For simplicity, we show their result specialized to the noiseless setting, i.e. b = Ax⋆, given in the study by Oymak et al. (2015, Theorem 1.2): \begin{align} \lVert{x_{t} - x^{\star}}\rVert_{2} \leqslant \left(2\cdot \max_{u,v\in\textrm{DC}_{x^{\star}}\cap\mathbb{S}^{d-1}} u^{\top} \left(\mathbf{I}_{d} - \eta A^{\top} A\right) v\right)^{t} \lVert{x^{\star}}\rVert_{2}. \end{align} (3.6) For this result to be meaningful we of course need the radius of convergence to be < 1. For a convex constraint set |$\mathscr{C}$| (i.e. if Pen(x) is convex), the factor of 2 can be removed. In the non-convex setting, however, the factor of 2 means that the maximum in (3.6) must be |$<\frac{1}{2}$| for the bound to ensure convergence. Noting that ∇g(x) = A⊤A(x − x⋆) in this problem, by setting u = v ∝ x − x⋆ we see that (3.6) effectively requires that |$\eta> \frac{1}{2\alpha }$|, where α is the RSC parameter (3.2). However, we also know that |$\eta \leqslant \frac{1}{\beta }$| is generally a necessary condition to ensure stability of projected gradient descent; if |$\eta>\frac{1}{\beta }$| then we may see values of g increase over iterations, i.e. g(x1) > g(x0). Therefore, the condition (3.6) effectively requires that g is well conditioned with |$\beta \lesssim 2\alpha $|, and furthermore that x⋆ is not in the interior of |$\mathscr{C}$| (since, if this were the case, then |$\textrm{DC}_{x^{\star }} = \mathbb{R}^{d}$|). On the other hand, if the radius in (3.6) is indeed < 1, then their work does not assume any type of initialization condition for convergence to be successful, in contrast to our initialization assumption (3.5). Comparison to known results for iterative hard thresholding We now compare our results to those of Jain et al.’s (2014), which specifically treat the iterative hard thresholding algorithm for a sparsity constraint or a rank constraint, $$ \mathscr{C} = \left\{x\in\mathbb{R}^{d}:|\operatorname{support}(x)|\leqslant k\right\}\textrm{ or } \mathscr{C} = \left\{X\in\mathbb{R}^{n\times m}:\operatorname{rank}(X)\leqslant r\right\}\!\!.$$ In their work, they take a substantially different approach: instead of bounding the distance between xt and the minimizer |$\widehat{x}\in \operatorname{arg\,min}_{x\in \mathscr{C}}\mathsf{g}(x)$|, they instead take |$\widehat{x}$| to be a minimizer over a stronger constraint, $$ \widehat{x} =\mathop{\operatorname{arg\,min}}\limits_{|\operatorname{support}(x)|\leqslant k^{\star}}\mathsf{g}(x)\quad\textrm{ or }\quad\widehat{X}=\mathop{\operatorname{arg\,min}}\limits_{\operatorname{rank}(X)\leqslant r^{\star}}\mathsf{g}(X),$$ taking k⋆ ≪ k or r⋆ ≪ r to enforce that the sparsity of |$\widehat{x}$| or rank of |$\widehat{X}$| is much lower than the optimization constraint set |$\mathscr{C}$|. With this definition, then bound the gap in objective function values, |$\mathsf{g}(x_{t}) - \mathsf{g}(\widehat{x})$|. In other words, the objective function value g(xt) is, up to a small error, no larger than the best value obtained over the substantially more restricted set of k⋆-sparse vectors or of rank-r⋆ matrices. By careful use of this gap k⋆ ≪ k or r⋆ ≪ r, their analysis allows for convergence results from any initialization point |$x_{0}\in \mathscr{C}$|. In contrast, our work allows |$\widehat{x}$| to lie anywhere in |$\mathscr{C}$|, but this comes at the cost of assuming a local initialization point |$x_{0}\in \mathscr{C}\cap \mathbb{B}_{2}\left (\widehat{x},\rho \right )$|. This result suggests a possible two-phase approach: first, we might optimize over a larger rank constraint |$\mathscr{C}=\{X:\operatorname{rank}(X)\leqslant k\},$| where k ≫ k⋆ to obtain the convergence guarantees of the study by Jain et al. (2014) (which do not assume a good initialization point, but obtain weaker guarantees); then, given the solution over rank k as a good initialization point, we would then optimize over the tighter constraint |$\mathscr{C}=\{X:\operatorname{rank}(X)\leqslant k^{\star }\}$| to obtain our stronger guarantees. Comparison to results on prox-regular functionsPennanen (2002) studies conditions for linear convergence of the proximal point method for minimizing a function f(x), and shows that prox-regularity of f(x) is sufficient; Lewis & Wright (2016) also study this problem in a more general setting. For our optimization problem, this translates to setting |$\mathsf{f}(x) = \mathsf{g}(x) + \delta _{\mathscr{C}}(x)$|, where $$ \delta_{\mathscr{C}}(x)=\begin{cases}0,&x\in\mathscr{C},\\\infty,&x\not\in\mathscr{C}.\end{cases}$$ (This is usually called the ‘indicator function’ for the set |$\mathscr{C}$|.) If |$\mathsf{g}(x) + \frac{\mu }{2}\lVert{x}{\rVert ^{2}_{2}}$| is convex (i.e. the concavity of g is bounded) and |$\mathscr{C}$| is a prox-regular set (i.e. |$\gamma (\mathscr{C})<\infty $|, see Section 2.3), then f(x) is a prox-regular function. This work was extended by Iusem et al. (2003) and others to an inexact proximal point method, allowing for error in each iteration, which can be formulated to encompass the projected gradient descent algorithm studied here. Our first convergence result Theorem 3.1 extends these results into a high-dimensional setting by using the structured norm |$\lVert{\cdot }\rVert $| and its dual |$\lVert{\cdot }\rVert ^{\ast }$| (e.g. the ℓ1 norm and its dual the |$\ell _{\infty }$| norm), and requiring only RSC and RSM on g, without which we would not be able to obtain convergence guarantees in settings such as high-dimensional sparse regression or low-rank matrix estimation. 3.4. Initialization point and the gradient assumption In this result, we assume that the initialization point x0 is within some radius ρ of the target |$\widehat{x}$|, ensuring that |$2\phi \gamma _{x}(\mathscr{C})\lVert{\nabla \mathsf{g}(x)}\rVert ^{\ast }<\alpha $| for all x in the initialization neighborhood, where α is the RSC (3.2) parameter. This type of assumption arises in much of the related literature; for example in the setting of optimization over low-rank matrices, as we will see in Section 5.1, we will require that |$\lVert{{X_{0} - \widehat{X}}}\rVert _{\mathsf{F}}\lesssim \sigma _{r}\left (\widehat{X}\right )$|, which is the same condition found in existing work such as that of Chen & Wainwright (2015). In fact, the following result demonstrates that the bound (3.5) is in a sense necessary: Lemma 3.2 For any constraint set |$\mathscr{C}$| and any point |$x\in \mathscr{C}\backslash \mathscr{C}_{\mathsf{dgn}}$| with |$\gamma _{x}(\mathscr{C})>0$|, for any α, ε > 0 there exists an α-strongly convex g such that the gradient condition (3.5) is nearly satisfied at x, with |$2\gamma _{x}(\mathscr{C})\lVert{\mathsf{g}(x)}\rVert ^{\ast }\leqslant \alpha (1+ \varepsilon )$|, and, x is a stationary point of the projected gradient descent algorithm (3.1) for all sufficiently small step sizes η > 0, but x does not minimize g over |$\mathscr{C}$|. That is, if projected gradient descent is initialized at the point x, then the algorithm will never leave this point, even though it is not optimal (i.e. x is not the global minimizer). We can see with a concrete example that the condition (3.5) may be even more critical than this lemma suggests: without this bound, we may find that projected gradient descent becomes trapped at a stationary point x which is not even a local minimum, as in the following example. Example 3.3 Let |$\mathscr{C}=\left \{X\in \mathbb{R}^{2\times 2}:\operatorname{rank}(X)\leqslant 1\right \}$|, let |$\mathsf{g}(X) = \frac{1}{2}\left \lVert{{X - \left ({1 \atop 0} \quad {0 \atop 1+\varepsilon}\right )}}\right \rVert _{\mathsf{F}}^{2}$|, and let |$X_{0} = \left ({1\atop 0} \quad {0 \atop 0}\right ).$| Then trivially, we can see that g is α-strongly convex for α = 1, and that X0 is a stationary point of the projected gradient descent algorithm (3.1) for any step size |$\eta < \frac{1}{1+\varepsilon }$|. However, for any |$0<t<\sqrt{2\varepsilon }$|, setting |$X=\left ({1 \atop t} \quad {t \atop t^{2}}\right )\in \mathscr{C}$|, we can see that g(X) < g(X0)—that is, X0 is stationary point, but is not a local minimum. We will later calculate that |$\gamma _{X_{0}}(\mathscr{C}) = \frac{1}{2\sigma _{1}(X_{0})}=\frac{1}{2}$| relative to the nuclear norm |$\lVert{\cdot }\rVert =\lVert{{\cdot }}\rVert _{\textrm{nuc}}$|, with norm compatibility constant ϕ = 1 (see Section 5.1 for this calculation). Comparing against the condition (3.5) on the gradient of g, since the dual norm to |$\lVert{{\cdot }}\rVert _{\textrm{nuc}}$| is the matrix spectral norm |$\lVert{{\cdot }}\rVert _{\textrm{sp}}$|, we see that $$ 2\phi\gamma_{X_{0}}(\mathscr{C}) \cdot\lVert{{\nabla\mathsf{g}(X_{0})}}\rVert_{\textrm{sp}} = 2\cdot1\cdot \frac{1}{2} \cdot \left\lVert{{-\left(\begin{array}{cc}0 & 0 \\ 0 & 1+\varepsilon\end{array}\right) }}\right\rVert_{\textrm{sp}}= 1+ \varepsilon =\alpha(1+\varepsilon).$$ Therefore, when the initial gradient condition (3.5) is even slightly violated in this example (i.e. small ε > 0), the projected gradient descent algorithm can become trapped at a point that is not even a local minimum. While we might observe that in this particular example, the ‘bad’ stationary point X0 could be avoided by increasing the step size, in other settings if g has strong curvature in some directions (i.e. the smoothness parameter β is large), then we cannot afford a large step size η as it can cause the algorithm to fail to converge. 4. Convergence analysis using approximate projections In some settings, computing projections |$P_{{\mathscr{C}}}(x^\prime _{t+1})$| at each step of the projected gradient descent algorithm may be prohibitively expensive; for instance in a low-rank matrix optimization problem of dimension d × d, this would generally involve taking the singular value decomposition of a dense d × d matrix at each step. In these cases we may sometimes have access to a fast but approximate computation of this projection, which may come at the cost of slower convergence. We now generalize to the idea of a family of approximate projections, which allows for operators that approximate projection to |$\mathscr{C}$|. Specifically, the approximations are carried out locally: \begin{align}\begin{cases}x^\prime_{t+1} = x_{t} - \eta\nabla\mathsf{g}(x_{t}),\\[3pt] x_{t+1} = P_{{x_{t}}}\left(x^\prime_{t+1}\right)\!,\end{cases} \end{align} (4.1) where |$P_{{x_{t}}}$| comes from a family of operators |$P_{{x}}:\mathbb{R}^{d}\rightarrow \mathscr{C}$| indexed by |$x\in \mathscr{C}$|. Intuitively, we think of Px(z) as providing a very accurate approximation to |$P_{{\mathscr{C}}}(z)$| locally for z near x, but it may distort the projection more as we move farther away. To allow for our convergence analysis to carry through even with these approximate projections, we assume that the family of operators {Px} satisfies a relaxed inner product condition: \begin{align} &\text{For any }x\in\mathscr{C} and z\in\mathbb{R}^{d}\text{ with }x,P_{{x}}(z)\in\mathbb{B}_{2}\!\left(\widehat{x},\rho\right), \\ &\quad\left\langle{\,\widehat{x}-P_{{x}}(z)},{z-P_{{x}}(z)}\right\rangle \leqslant \max\{ \underbrace{\left\lVert{z-P_{{x}}(z)}\right\rVert^{\ast}}_{\textrm{concavity term}}, \underbrace{\left\lVert{z-x}\right\rVert^{\ast}}_{\textrm{distortion term}}\}\cdot \big(\underbrace{\gamma^{\textrm{c}}\lVert{\,\widehat{x}-P_{{x}}(z)}{\rVert^{2}_{2}}}_{\textrm{concavity term}} + \underbrace{\gamma^{\textrm{d}}\lVert{\,\widehat{x}-x}{\rVert^{2}_{2}}}_{\textrm{distortion term}}\big).\nonumber \end{align} (4.2) Here the ‘concavity’ terms are analogous to the inner product bound in (2.5) for exact projection to the non-convex set |$\mathscr{C}$|, except with the projection |$P_{{\mathscr{C}}}$| replaced by the operator Px; the ‘distortion’ terms mean that as we move farther away from x the bound becomes looser, as Px becomes a less accurate approximation to |$P_{{\mathscr{C}}}$|. We now present a convergence guarantee nearly identical to the result for the exact projection case, Theorem 3.1. We first need to state a version of the norm compatibility condition, modified for approximate projections: \begin{align} \left\lVert{z - P_{{x}}(z)}\right\rVert^{\ast} \leqslant \phi \lVert{z-x}\rVert^{\ast}\ \ \text{for all }x\in\mathscr{C}\cap\mathbb{B}_{2}(\widehat{x},\rho)\text{ and }z\in\mathbb{R}^{d}. \end{align} (4.3) We also require a modified initialization condition, \begin{align} 2\phi(\gamma^{\textrm{c}}+\gamma^{\textrm{d}})\max_{x\in\mathscr{C}\cap\mathbb{B}_{2}(\widehat{x},\rho)}\lVert{\nabla\mathsf{g}(x)}\rVert^{\ast} \leqslant (1-c_{0})\alpha, \end{align} (4.4) and a modified version of local uniform continuity (compare to (2.7) for exact projections), \begin{align} &\text{for any }x\in\mathscr{C}\cap\mathbb{B}_{2}(\widehat{x},\rho),\text{ for any }(\varepsilon > 0),\text{ there exists a }(\delta > 0)\text{ such that,}\nonumber\\ &\quad\text{for any }z,w\in\mathbb{R}^{d}\text{ such that }P_{x}(z)\in\mathbb{B}_{2}(\widehat{x},\rho)\text{ and }2(\gamma^{\textrm{c}}+\gamma^{\textrm{d}})\lVert{z-P_{{x}}(z)}\rVert^{\ast}\leqslant 1-c_{0},\nonumber\\ &\qquad\text{if }\lVert{z-w}\rVert_{2}\leqslant\delta\text{ then }\lVert{P_{{x}}(z)-P_{{x}}(w)}\rVert_{2}\leq\varepsilon. \end{align} (4.5) Our result for this setting now follows. Theorem 4.1 Let |$\mathscr{C}\subset \mathbb{R}^{d}$| be a constraint set and let g be a differentiable function, with minimizer |$\widehat{x}\in \operatorname{arg\,min}_{x\in \mathscr{C}}\mathsf{g}(x)$|. Let {Px} be a family of operators satisfying the inner product condition (4.2), the norm compatibility condition (4.3) and the local continuity condition (4.5) with parameters γc, γd, ϕ and radius ρ. Assume that g satisfies RSC (3.2) and restricted smoothness (3.3) with parameters α, β, εg for all |$x,\ y\in \mathbb{B}_{2}(\,\widehat{x},\ \rho )$|, and the initialization condition (4.4) for some c0 > 0. If the initial point |$x_{0}\in \mathscr{C}$| and the error level εg satisfy |$\lVert{x_{0} -\widehat{x}}{\rVert ^{2}_{2}}<\rho ^{2}$| and |$\varepsilon _{\mathsf{g}}^{2}< \frac{c_{0} \rho ^{2}}{1.5}$|, then for each step |$t\geqslant 0$| of the approximate projected gradient descent algorithm (4.1) with step size η = 1/β, $$ \lVert{x_{t} - \widehat{x}}{\rVert^{2}_{2}} \leqslant \left(1 - c_{0}\cdot \frac{2\alpha}{\alpha+\beta}\right)^{t} \lVert{x_{0} - \widehat{x}}{\rVert^{2}_{2}} +\frac{1.5\varepsilon_{\mathsf{g}}^{2}}{c_{0}}.$$ This convergence rate is identical to that obtained in Theorem 3.1 for exact projections—the only differences lie in the assumptions. 4.1. Exact vs. approximate projections To compare the two settings we have considered, exact projections |$P_{{\mathscr{C}}}$| vs. approximate projections Px, we focus on a local form of the inner product condition (4.2) for the family of approximate operators {Px}, rewritten to be analogous to the inner product condition (2.5) for exact projections. Suppose that |$\gamma ^{\text{c}}_{u}(\mathscr{C})$| and |$\gamma ^{\text{d}}_{u}(\mathscr{C})$| satisfy the property that \begin{align} &\text{for any }x,\ y\in\mathscr{C}\text{ and any }z\in\mathbb{R}^{d}\text{, writing }u=P_{{x}}(z),\nonumber\\ &\quad\langle{y-u},\ {z-u}\rangle \leqslant \max\{ \underbrace{\lVert{z-u}\rVert^{\ast}}_{\text{concavity term}}, \underbrace{\lVert{z-x}\rVert^{\ast}}_{\text{distort