# Gradient descent with non-convex constraints: local concavity determines convergence

Gradient descent with non-convex constraints: local concavity determines convergence Abstract Many problems in high-dimensional statistics and optimization involve minimization over non-convex constraints—for instance, a rank constraint for a matrix estimation problem—but little is known about the theoretical properties of such optimization problems for a general non-convex constraint set. In this paper we study the interplay between the geometric properties of the constraint set and the convergence behavior of gradient descent for minimization over this set. We develop the notion of local concavity coefficients of the constraint set, measuring the extent to which convexity is violated, which governs the behavior of projected gradient descent over this set. We demonstrate the versatility of these concavity coefficients by computing them for a range of problems in low-rank estimation, sparse estimation and other examples. Through our understanding of the role of these geometric properties in optimization, we then provide a convergence analysis when projections are calculated only approximately, leading to a more efficient method for projected gradient descent in low-rank estimation problems. 1. Introduction Non-convex optimization problems arise naturally in many areas of high-dimensional statistics and data analysis, and pose particular difficulty due to the possibility of becoming trapped in a local minimum or failing to converge. Nonetheless, recent results have begun to extend some of the broad convergence guarantees that have been achieved in the literature on convex optimization, into a non-convex setting. In this work, we consider a general question: when minimizing a function g(x) over a non-convex constraint set $$\mathscr{C}\subset \mathbb{R}^{d}$$,   $$\widehat{x} = \mathop{\operatorname{arg\,min}}\limits_{x\in\mathscr{C}} \mathsf{g}(x),$$what types of conditions on g and on $$\mathscr{C}$$ are sufficient to guarantee the success of projected gradient descent? More concretely, when can we expect that optimization of this non-convex problem will converge at essentially the same rate as a convex problem? In examining this question, we find that local geometric properties of the non-convex constraint set $$\mathscr{C}$$ are closely tied to the behavior of gradient descent methods, and the main results of this paper study the equivalence between local geometric conditions on the boundary of $$\mathscr{C}$$, and the local behavior of optimization problems constrained to $$\mathscr{C}$$. The main contributions of this paper are the following: We develop the notion of local concavity coefficients of a non-convex constraint set $$\mathscr{C}$$, characterizing the extent to which $$\mathscr{C}$$ is non-convex relative to each of its points. These coefficients, a generalization of the notions of prox-regular sets and sets of positive reach in the analysis literature, bound the set’s violations of four different characterizations of convexity—e.g. convex combinations of points must lie in the set, and the first-order optimality conditions for minimization over the set—with respect to a structured norm, such as the ℓ1 norm for sparse problems, chosen to capture the natural structure of the problem. The local concavity coefficients allow us to characterize the geometric properties of the constraint set $$\mathscr{C}$$ that are favorable for analyzing the convergence of projected gradient descent. Our key results Theorems 2.1 and 2.2 prove that these multiple notions of non-convexity are in fact exactly equivalent, shedding light on the interplay between geometric properties such as curvature, and optimality properties such as the first-order conditions, in a non-convex setting. We next prove convergence results for projected gradient descent over a non-convex constraint set, minimizing a function g assumed to exhibit restricted strong convexity (RSC) and restricted smoothness (RSM) (these types of conditions are common in the high-dimensional statistics literature—see e.g. the study by Negahban et al., 2009 for background). We also allow for the projection step, i.e. projection to $$\mathscr{C}$$, to be calculated approximately, which enables greater computational efficiency. Our main convergence analysis shows that, as long as we initialize at a point x0 that is not too far away from $$\widehat{x}$$, projected gradient descent converges linearly to $$\widehat{x}$$ when the constraint space $$\mathscr{C}$$ satisfies the geometric properties described above. Finally, we apply these ideas to a range of specific examples: low-rank matrix estimation (where optimization is carried out under a rank constraint), sparse estimation (with non-convex regularizers such as Smoothly Clipped Absolute Deviation (SCAD) offering a lower-shrinkage alternative to the ℓ1 norm) and several other non-convex constraints. We discuss some interesting differences between constraining vs. penalizing a non-convex regularization function, in the context of sparse estimation. For the low-rank setting, we propose an approximate projection step that provides a computationally efficient alternative for low-rank estimation problems, which we then explore empirically with simulations. 2. Concavity coefficients for a non-convex constraint space We begin by studying several properties which describe the extent to which the constraint set $$\mathscr{C}\subset \mathbb{R}^{d}$$ deviates from convexity. To quantify the concavity of $$\mathscr{C}$$, we will define the (global) concavity coefficient of $$\mathscr{C}$$, denoted $$\gamma = \gamma (\mathscr{C})$$, which we will later expand to local measures of concavity, $$\gamma _{x}(\mathscr{C})$$, indexed over points $$x\in \mathscr{C}$$. We examine several definitions of this concavity coefficient: essentially, we consider four properties that would hold if $$\mathscr{C}$$ were convex, and then use γ to characterize the extent to which these properties are violated. Our definitions are closely connected to the notion of prox-regular sets in the analysis literature, and we will discuss this connection in detail in Section 2.3 below. Since we are interested in developing flexible tools for high-dimensional optimization problems, several different norms will appear in the definitions of the concavity coefficients: The Euclidean ℓ2 norm, $$\lVert{\cdot }\rVert _{2}$$. Projections to $$\mathscr{C}$$ will always be taken with respect to the ℓ2 norm, and our later convergence guarantees will also be given with respect to this norm. If our variable is a matrix $$X\in \mathbb{R}^{n\times m}$$, the Euclidean ℓ2 norm is known as the Frobenius norm, $$\lVert{{X}}\rVert _{\mathsf{F}}=\sqrt{\sum _{ij} X_{ij}^{2}}$$. A ‘structured’ norm $$\lVert{\cdot }\rVert$$, which can be chosen to be any norm on $$\mathbb{R}^{d}$$. In some cases it may be the ℓ2 norm, but often it will be a different norm reflecting natural structure in the problem. For instance, for a low-rank estimation problem, if $$\mathscr{C}$$ is a set of rank-constrained matrices then we will work with the nuclear norm, $$\lVert{\cdot }\rVert =\lVert{{\cdot }}\rVert _{\textrm{nuc}}$$ (defined as the sum of the singular values of the matrix). For sparse signals, we will instead use the ℓ1 norm, $$\lVert{\cdot }\rVert =\lVert{\cdot }\rVert _{1}$$. A norm $$\lVert{\cdot }\rVert ^{\ast }$$, which is the dual norm to the structured norm $$\lVert{\cdot }\rVert$$. For low-rank matrix problems, if we work with the nuclear norm, $$\lVert{\cdot }\rVert =\lVert{{\cdot }}\rVert _{\textrm{nuc}}$$, then the dual norm is given by the spectral norm, $$\lVert{\cdot }\rVert ^{\ast }=\lVert{{\cdot }}\rVert _{\textrm{sp}}$$ (i.e. the largest singular value of the matrix, also known as the matrix operator norm). For sparse problems, if $$\lVert{\cdot }\rVert =\lVert{\cdot }\rVert _{1}$$ then its dual is given by the $$\ell _{\infty }$$ norm, $$\lVert{\cdot }\rVert ^{\ast }=\lVert{\cdot }\rVert _{\infty }$$. When we take projections to the constraint set $$\mathscr{C}$$, if the minimizer $$P_{{\mathscr{C}}}(z)\in \operatorname{arg\,min}_{x\in \mathscr{C}}\lVert{x-z}\rVert _{2}$$ is non-unique, then we write $$P_{{\mathscr{C}}}(z)$$ to denote any point chosen from this set. Throughout, any assumption or claim involving $$P_{{\mathscr{C}}}(z)$$ should be interpreted as holding for any choice of $$P_{{\mathscr{C}}}(z)$$. From this point on, we will assume without comment that $$\mathscr{C}$$ is closed and non-empty so that the set $$\operatorname{arg\,min}_{x\in \mathscr{C}}\lVert{x-z}\rVert _{2}$$ is non-empty for any z. We now present several definitions of the concavity coefficient of $$\mathscr{C}$$. Curvature First, we define γ as a bound on the extent to which a convex combination of two elements of $$\mathscr{C}$$ may lie outside of $$\mathscr{C}$$: for $$x,y\in \mathscr{C}$$,   \begin{align}\limsup_{t\searrow 0}\frac{\min_{z\in\mathscr{C}}\left\lVert{z - \left((1-t)x + ty\right)}\right\rVert}{t} \leqslant \gamma\lVert{x - y}{\rVert^{2}_{2}}. \end{align} (2.1)Approximate contraction Secondly, we define γ via a condition requiring that the projection operator $$P_{{\mathscr{C}}}$$ is approximately contractive in a neighborhood of the set $$\mathscr{C}$$, that is, $$\lVert{P_{{\mathscr{C}}}(z) - P_{{\mathscr{C}}}(w)}\rVert _{2}$$ is not much larger than $$\lVert{z-w}\rVert _{2}$$: for $$x,y\in \mathscr{C}$$.   \begin{align}&\text{For any }z,w\in\mathbb{R}^{d} \text{ with } P_{{\mathscr{C}}}(z)=x\text{ and } P_{{\mathscr{C}}}(w)=y,\nonumber\\&\quad\big(1-\gamma\lVert{z-x}\rVert^{\ast}-\gamma\lVert{w-y}\rVert^{\ast}\big) \cdot \lVert{x - y}\rVert_{2} \leqslant \lVert{z - w}\rVert_{2}.\end{align} (2.2)For convenience in our theoretical analysis we will also consider a weaker ‘one-sided’ version of this property, where one of the two points is assumed to already lie in $$\mathscr{C}$$: for $$x,y\in \mathscr{C}$$.   \begin{align}\text{For any }z\in\mathbb{R}^{d}\text{ with }P_{{\mathscr{C}}}(z)=x,\quad \left(1-\gamma\lVert{z-x}\rVert^{\ast}\right) \cdot \lVert{x - y}\rVert_{2} \leqslant \lVert{z-y}\rVert_{2}. \end{align} (2.3)First-order optimality For our third characterization of the concavity coefficient, we consider the standard first-order optimality conditions for minimization over a convex set, and measure the extent to which they are violated when optimizing over $$\mathscr{C}$$: for $$x,y\in \mathscr{C}$$.1  \begin{align} &\text{For any differentiable }\mathsf{f}:\mathbb{R}^{d}\rightarrow \mathbb{R}\text{ such that (x) is a local minimizer of }(\mathsf{f})\text{ over }\mathscr{C},\nonumber\\&\quad\langle{y-x},{\nabla\mathsf{f}(x)}\rangle\geqslant - \gamma\lVert{\nabla\mathsf{f}(x)}\rVert^{\ast}\lVert{y-x}{\rVert^{2}_{2}}. \end{align} (2.4)Inner products Fourthly, we introduce an inner product condition, requiring that projection to the constraint set $$\mathscr{C}$$ behaves similarly to a convex projection: for $$x,y\in \mathscr{C}$$.   \begin{align} \text{For any }z\in\mathbb{R}^{d}\text{ with }P_{{\mathscr{C}}}(z)=x,\quad \langle{y-x},\ {z-x}\rangle \leqslant \gamma \lVert{z-x}\rVert^{\ast}\lVert{y-x}{\rVert^{2}_{2}}. \end{align} (2.5) We will see later that, by choosing $$\lVert{\cdot }\rVert$$ to reflect the structure in the signal (rather than working only with the ℓ2 norm), we are able to obtain a more favorable scaling in our concavity coefficients, and hence to prove meaningful convergence results in high-dimensional settings. On the other hand, regardless of our choice of $$\lVert{\cdot }\rVert$$, note that the ℓ2 norm also appears in the definition of the concavity coefficients, as is natural when working with inner products. Our first main result shows that the above conditions are in fact exactly equivalent: Theorem 2.1 The properties (2.1), (2.2), (2.3), (2.4) and (2.5) are equivalent; that is, for a fixed choice $$\gamma \in [0,\infty ]$$, they either all hold for every $$x,y\in \mathscr{C}$$, or all fail to hold for some $$x,y\in \mathscr{C}$$. Formally, we will define $$\gamma (\mathscr{C})$$ to be the smallest value such that the above properties hold:   $$\gamma(\mathscr{C}):= \min\left\{\gamma\in[0,\infty] : \text{Properties 2.1, 2.2, 2.3, 2.4, 2.5 hold for all}\ x,y\in\mathscr{C}\,\right\}\!.$$However, this global coefficient $$\gamma (\mathscr{C})$$ is often of limited use in practical settings, since many sets are well behaved locally but not globally. For instance, the set $$\mathscr{C}\!=\!\left \{X\!\in\! \mathbb{R}^{n\times m}:\operatorname{rank}(X)\!\leqslant\! r\right \}$$ has $$\gamma (\mathscr{C})\!=\!\infty$$, but exhibits smooth curvature and good convergence behavior as long as we stay away from rank-degenerate matrices (that is, matrices with rank(X) < r). Since we may often want to ensure convergence in this type of setting where global concavity cannot be bounded, we next turn to a local version of the same concavity bounds. 2.1. Local concavity coefficients We now consider the local concavity coefficients$$\gamma _{x}(\mathscr{C})$$, measuring the concavity in a set $$\mathscr{C}$$ relative to a specific point x in the set. We will see examples later on where $$\gamma (\mathscr{C})=\infty\,,$$ but $$\gamma _{x}(\mathscr{C})$$ is bounded for many points $$x\in \mathscr{C}$$. First we define a set of ‘degenerate points’,   $$\mathscr{C}_{\mathsf{dgn}} = \left\{x\in\mathscr{C}:P_{{\mathscr{C}}}\text{ is not continuous over any neighborhood of }(x)\right\}\!,$$and then let   \begin{align}\gamma_{x}(\mathscr{C}) = \begin{cases} \infty,&x\in\mathscr{C}_{\mathsf{dgn}},\\ \min\left\{\gamma\in[0,\infty]: \text{Property (*) holds for this point (x) and any }y\in\mathscr{C}\right\}\!,&x\not\in\mathscr{C}_{\mathsf{dgn}}, \end{cases} \end{align} (2.6)where the property (*) may refer to any of the four definitions of the concavity coefficients,2 namely (2.1), (2.3), (2.4) or (2.5). We will see shortly why it is necessary to make an exception for the degenerate points $$x\in \mathscr{C}_{\mathsf{dgn}}$$ in the definition of these coefficients. Our next main result shows that the equivalence between the four properties (2.1), (2.3), (2.4) and (2.5) in terms of the global concavity coefficient $$\gamma (\mathscr{C})$$, holds also for the local coefficients: Theorem 2.2 For all $$x\in \mathscr{C}$$, the definition (2.6) of $$\gamma _{x}(\mathscr{C})$$ is equivalent for all four choices of the property (*), namely the conditions (2.1), (2.3), (2.4) or (2.5). To develop an intuition for the global and local concavity coefficients, we give a simple example in $$\mathbb{R}^{2}$$ (relative to the ℓ2 norm, i.e. $$\lVert{\cdot }\rVert =\lVert{\cdot }\rVert ^{\ast }=\lVert{\cdot }\rVert _{2}$$), displayed in Fig. 1. Define $$\mathscr{C}=\left \{x\in \mathbb{R}^{2}: x_{1}\leqslant 0\textrm{ or }x_{2}\leqslant 0\right \}$$. Due to the degenerate point x = (0, 0), we can see that $$\gamma (\mathscr{C})=\infty$$ in this case. The local concavity coefficients are given by  $$\begin{cases} \gamma_{x}(\mathscr{C}) = \infty,&\textrm{ if }x=(0,0),\\[5pt] \gamma_{x}(\mathscr{C}) = \frac{1}{2t},&\textrm{ if } x = (t,0)\textrm{ or }(0,t)\text{ for $$t>0$$},\\[5pt] \gamma_{x}(\mathscr{C}) = 0,&\textrm{ if }x_{1}<0\textrm{ or }x_{2}<0.\end{cases}$$Note that at the degenerate point x = (0, 0), $$\mathscr{C}$$ actually contains all convex combinations of this point x with any $$y\in \mathscr{C}$$, and so the curvature condition (2.1) is satisfied with γ = 0. However, $$x\in \mathscr{C}_{\mathsf{dgn}}$$, so we nonetheless set $$\gamma _{x}(\mathscr{C})=\infty$$. Fig. 1. View largeDownload slide A simple example of the local concavity coefficients on $$\mathscr{C}=\{x\in \mathbb{R}^{2}:x_{1}\leqslant 0\textrm{ or }x_{2}\leqslant 0\}$$. The gray shaded area represents $$\mathscr{C}$$ while the numbers give the local concavity coefficients at each marked point. Fig. 1. View largeDownload slide A simple example of the local concavity coefficients on $$\mathscr{C}=\{x\in \mathbb{R}^{2}:x_{1}\leqslant 0\textrm{ or }x_{2}\leqslant 0\}$$. The gray shaded area represents $$\mathscr{C}$$ while the numbers give the local concavity coefficients at each marked point. Practical high-dimensional examples, such as a rank constraint, will be discussed in depth in Section 5. For example we will see that, for the rank-constrained set $$\mathscr{C}=\left \{X\in \mathbb{R}^{n\times m}:\operatorname{rank}(X)\leqslant r\right \}$$, the local concavity coefficients satisfy $$\gamma _{X}(\mathscr{C})= \frac{1}{2\sigma _{r}(X)}$$ relative to the nuclear norm. In general, the local coefficients can be interpreted as follows: If x lies in the interior of $$\mathscr{C}$$, or if $$\mathscr{C}$$ is convex, then $$\gamma _{x}(\mathscr{C})=0$$. If x lies on the boundary of $$\mathscr{C}$$, which is a non-convex set with a smooth boundary, then we will typically see a finite but non-zero $$\gamma _{x}(\mathscr{C})$$. $$\gamma _{x}(\mathscr{C})=\infty$$ can indicate a non-convex cusp or other degeneracy at the point x. 2.2. Properties We next prove some properties of the local coefficients $$\gamma _{x}(\mathscr{C})$$ that will be useful for our convergence analysis, as well as for gaining intuition for these coefficients. First, the global and local coefficients are related in the natural way: Lemma 2.3 For any $$\mathscr{C}$$, $$\gamma (\mathscr{C})=\sup _{x\in \mathscr{C}}\gamma _{x}(\mathscr{C})$$. Next, observe that $$x\mapsto \gamma _{x}(\mathscr{C})$$ is not continuous in general (in particular, since $$\gamma _{x}(\mathscr{C})=0$$ in the interior of $$\mathscr{C}\!,$$ but is often positive on the boundary). However, this map does satisfy upper semi-continuity: Lemma 2.4 The function $$x\mapsto \gamma _{x}(\mathscr{C})$$ is upper semi-continuous over $$x\in \mathscr{C}$$. Furthermore, setting $$\gamma _{x}(\mathscr{C})=\infty$$ at the degenerate points $$x\in \mathscr{C}_{\mathsf{dgn}}$$ is natural in the following sense: the resulting map $$x\mapsto \gamma _{x}(\mathscr{C})$$ is the minimal upper semi-continuous map such that the relevant local concavity properties are satisfied. We formalize this with the following lemma: Lemma 2.5 For any $$u\in \mathscr{C}_{\mathsf{dgn}}$$, for any of the four conditions, (2.1), (2.3), (2.4) or (2.5), this property does not hold in any neighborhood of u for any finite γ. That is, for any r > 0,   $$\min\Big\{\gamma\geqslant 0:\text{ Property (*) holds for all {x\in\mathscr{C}\cap\mathbb{B}_{2}(u,r)} and for all {y\in\mathscr{C}}}\Big\}= \infty,$$where (*) may refer to any of the four equivalent properties, i.e. (2.1), (2.3), (2.4) and (2.5). (Here, $$\mathbb{B}_{2}(u,r)$$ is the ball of radius r around the point u, with respect to the ℓ2 norm.) Finally, the next result shows that two-sided contraction property (2.2) holds using local coefficients, meaning that all five definitions of concavity coefficients are equivalent: Lemma 2.6 For any $$z,w\in \mathbb{R}^{d}$$,   $$\left(1-\gamma_{P_{{\mathscr{C}}}(z)}(\mathscr{C})\lVert{z-P_{{\mathscr{C}}}(z)}\rVert^{\ast}-\gamma_{P_{{\mathscr{C}}}(w)}(\mathscr{C})\lVert{w-P_{{\mathscr{C}}}(w)}\rVert^{\ast}\right) \cdot \lVert{P_{{\mathscr{C}}}(z)-P_{{\mathscr{C}}}(w)}\rVert_{2} \leqslant \lVert{z - w}\rVert_{2}.$$In particular, for any fixed c ∈ (0, 1), Lemma 2.4 proves that   \begin{align} P_{{\mathscr{C}}}\text{ is (c)-Lipschitz over the set }\left\{z\in\mathbb{R}^{d}:2\gamma_{P_{{\mathscr{C}}}(z)}(\mathscr{C})\lVert{z-P_{{\mathscr{C}}}(z)}\rVert^{\ast}\leqslant 1-c\right\}, \end{align} (2.7)where the Lipschitz constant is defined with respect to the ℓ2 norm. This provides a sort of converse to our definition of the degenerate points, where we set $$\gamma _{x}(\mathscr{C})=\infty$$ for all $$x\in \mathscr{C}_{\mathsf{dgn}}$$, i.e. all points x where $$P_{{\mathscr{C}}}$$ is not continuous in any neighborhood of x. 2.3. Connection to prox-regular sets The notion of prox-regular sets and sets of positive reach arises in the literature on non-smooth analysis in Hilbert spaces, for instance see the study by Colombo & Thibault (2010) for a comprehensive overview of the key results in this area. The work on prox-regular sets generalizes also to the notion of prox-regular functions (see e.g. Rockafellar & Wets, 2009, Chapter 13.F). A prox-regular set is a set $$\mathscr{C}\subset \mathbb{R}^{d}$$ that satisfies3  \begin{align} \langle\,{y-x},\ {z-x}\,\rangle\leqslant\frac{1}{2\rho}\lVert{z-x}\rVert_{2}\lVert{y-x}{\rVert^{2}_{2}}, \end{align} (2.8)for all $$x,y\in \mathscr{C}$$ and all $$z\in \mathbb{R}^{d}$$ with $$P_{{\mathscr{C}}}(z)=x$$, for some constant ρ > 0. To capture the local variations in concavity over the set $$\mathscr{C}$$, $$\mathscr{C}$$ is prox-regular with respect to a continuous function $$\rho :\mathscr{C}\rightarrow (0,\infty ]$$ if   \begin{align} \langle\,{y-x},{z-x}\,\rangle\leqslant\frac{1}{2\rho(x)}\lVert{z-x}\rVert_{2}\lVert{y-x}{\rVert^{2}_{2}} \end{align} (2.9)for all $$x,y\in \mathscr{C}$$ and all $$z\in \mathbb{R}^{d}$$ with $$P_{{\mathscr{C}}}(z)=x$$ (see e.g. Colombo & Thibault, 2010, Theorem 3b).4 Historically, prox-regularity was first formulated via the notion of ‘positive reach’ (Federer, 1959): the parameter ρ appearing in (2.8) is the largest radius such that the projection operator $$P_{{\mathscr{C}}}$$ is unique for all points z within distance ρ of the set $$\mathscr{C}$$; in the local version (2.9), the radius is allowed to vary locally as a function of $$x\in \mathscr{C}$$. The definitions (2.8) and (2.9) exactly coincide with our inner product condition (2.5), in the special case that $$\lVert{\cdot }\rVert$$ is the ℓ2 norm, by taking $$\gamma = \frac{1}{2\rho }$$ or, for the local coefficients, $$\gamma =\frac{1}{2\rho (x)}$$. In the ℓ2 setting, there is substantial literature exploring the equivalence between many different characterizations of prox-regularity, including properties that are equivalent to each of our characterizations of the local concavity coefficients. Here we note a few places in the literature where these conditions appear, and refer the reader to the study by Colombo & Thibault (2010) for historical background on these ideas. The curvature condition (2.1) is proved in the study by Colombo & Thibault (2010, Proposition 9, Theorem 14(q)). The one- and two-sided contraction conditions (2.3) and (2.2) appear in the studies by Federer (1959, Section 4.8) and Colombo & Thibault (2010, Theorem 14(g)); the inner product condition (2.5) can be found in the studies by Federer (1959, Section 4.8), Colombo & Thibault (2010, Theorem 3(b)), Canino (1988, Definition 1.5) and Colombo & Marques (2003, Definition 2.1). The first-order optimality condition (2.4) is closely related to the inner product condition, when formulated using the ideas of normal cones and proximal normal cones (for instance, in the study by Rockafellar & Wets, 2009, Theorem 6.12 relates gradients of f to normal cones at x). The distinctions between our definitions and results on local concavity coefficients, and the literature on prox-regularity, center on two key differences: the role of continuity, and the flexibility of the structured norm $$\lVert{\cdot }\rVert$$ (rather than the ℓ2 norm). We discuss these two separately. Continuity In the literature on prox-regular sets, the ‘reach’ function $$x\mapsto \rho (x)\in (0,\infty ]$$ is assumed to be continuous (Colombo & Thibault, 2010, Definition 1). Equivalently, we could take a continuous function $$x\mapsto \gamma _{x} = \frac{1}{2\rho (x)}\in [0,\infty )$$ to agree with the notation of our local concavity coefficients. However, this is not the same as finding the smallest value γx such that the concavity coefficient conditions are satisfied (locally at the point x). For our definitions, we do not enforce continuity of the map x ↦ γx, and instead define $$\gamma _{x}(\mathscr{C})$$ as the smallest value such that the conditions are satisfied. This leads to substantial challenges in proving the equivalence of the various conditions; in Lemma 2.4 we prove that the map is naturally upper semi-continuous, which allows us to show the desired equivalences. In terms of practical implications, in order to use the local concavity coefficients to describe the convergence behavior of optimization problems, it is critical that we allow for non-continuity. For instance, suppose that $$\mathscr{C}$$ is non-convex, and its interior $$\mathsf{Int}(\mathscr{C})$$ is non-empty. For any $$x\in \mathsf{Int}(\mathscr{C})$$, the concavity coefficient conditions are satisfied with γx = 0. In particular, consider the first-order optimality condition (2.4): if $$x\in \mathsf{Int}(\mathscr{C})$$ is a local minimizer of some function f, then x is in fact the global minimizer of f(x) and we must have ∇f(x) = 0. On the other hand, since $$\mathscr{C}$$ is non-convex, we must have γx > 0 for at least some of the points x on the boundary of $$\mathscr{C}$$. If we do require a continuity assumption on the function x↦γx, then we would be forced to have γx > 0 for some points $$x\in \mathsf{Int}(\mathscr{C})$$ as well. This means that γx would not give a precise description of the behavior of first-order methods when constraining to $$\mathscr{C}$$—it would not reveal that non-global minima are impossible in the interior of the set. More generally, we will show in Lemma 3.2 that the local concavity coefficients (defined as the lowest possible constants, as in (2.6)) provide a tight characterization of the convergence behavior of projected gradient descent over the constraint set $$\mathscr{C}$$; if we enforce continuity, we would be forced to choose larger values for $$\gamma _{x}(\mathscr{C})$$ at some points $$x\in \mathscr{C}$$, and the concavity coefficients would no longer be both necessary and sufficient for convergence. One related point is that, by allowing for $$\gamma _{x}(\mathscr{C})$$ to be infinite if needed (which would be equivalent to allowing the ‘reach’ ρ(x) to be zero for some x), we can accommodate constraint sets such as the low-rank matrix constraint, $$\mathscr{C}=\left \{X\in \mathbb{R}^{n\times m}:\operatorname{rank}(X)\leqslant r\right \}$$. Recalling that $$\gamma _{X}(\mathscr{C})=\frac{1}{2\sigma _{r}(X)}$$ as mentioned earlier, we see that a rank-deficient matrix X (i.e. rank(X) < r) will have $$\gamma _{X}(\mathscr{C})=\infty$$. By not requiring that the concavity coefficient is finite (equivalently, that the reach is positive), we avoid the need for any inelegant modifications (e.g. working with a truncated set such as $$\mathscr{C}=\left \{X:\operatorname{rank}(X)\leqslant r,\sigma _{r}(X)\geqslant \varepsilon \right \}$$). Structured norms Prox-regularity (or equivalently the notion of positive reach) is studied in the literature in a Hilbert space, with respect to its norm, which in $$\mathbb{R}^{d}$$ means the ℓ2 norm (or a weighted ℓ2 norm).5 In contrast, our work defines local concavity coefficients with respect to a general structured norm $$\lVert{\cdot }\rVert$$, such as the ℓ1 norm in a sparse signal estimation setting. To see the distinction, compare our inner product condition (2.5) with the definition of prox-regularity (2.8). Of course, the equivalence of all norms on $$\mathbb{R}^{d}$$ means that if $$\gamma (\mathscr{C})$$ is finite when defined with respect to the ℓ2 norm (i.e. $$\mathscr{C}$$ is prox-regular), then it is finite with respect to any other norm—so the importance of the distinction may not be immediately clear. As an example, let $$\gamma ^{\ell _{1}}(\mathscr{C})$$ and $$\gamma ^{\ell _{2}}(\mathscr{C})$$ denote the concavity coefficients with respect to the ℓ1 and ℓ2 norms. Since $$\lVert{\cdot }\rVert _{2}\leqslant \lVert{\cdot }\rVert _{1}\leqslant \sqrt{d}\lVert{\cdot }\rVert _{2}$$, we could trivially show that   $$\gamma^{\ell_{2}}(\mathscr{C})\leqslant \gamma^{\ell_{1}}(\mathscr{C})\leqslant \sqrt{d}\cdot \gamma^{\ell_{2}}(\mathscr{C}),$$but the factor $$\sqrt{d}$$ is unfavorable, so in many settings this is a very poor bound on $$\gamma ^{\ell _{1}}(\mathscr{C})$$. We may then ask, why can we not simply define the coefficients in terms of the ℓ2 norm? The reason is that in optimization problems arising in high-dimensional settings (for instance, high-dimensional regression in statistics), structured norms such as the ℓ1 norm (for problems involving sparse signals) or the nuclear norm (for low-rank signals) allow for statistical and computational analyses that would not be possible with the ℓ2 norm. In particular, we will see later on that convergence for the minimization problem $$\min _{x\in \mathscr{C}}\mathsf{g}(x)$$ will depend on bounding $$\lVert{\nabla \mathsf{g}(x)}\rVert ^{\ast }$$. If $$\lVert{\cdot }\rVert$$ is the ℓ1 norm, for instance, then $$\lVert{\nabla \mathsf{g}(x)}\rVert ^{\ast } = \lVert{\nabla \mathsf{g}(x)}\rVert _{\infty }$$ will in general be much smaller than $$\lVert{\nabla \mathsf{g}(x)}\rVert _{2}$$. For instance, in a statistical problem, if ∇g(x) consists of Gaussian or sub-Gaussian noise at the true parameter vector x, then $$\lVert{\nabla \mathsf{g}(x)}\rVert _{\infty }\sim \sqrt{\log (d)}$$ while $$\lVert{\nabla \mathsf{g}(x)}\rVert _{2}\sim \sqrt{d}$$. Therefore, being able to bound the concavity of $$\mathscr{C}$$ with respect to the ℓ1 norm rather than the ℓ2 norm is crucial for analyzing convergence in a high-dimensional setting. In the next section, we will study how the choice of the norm $$\lVert{\cdot }\rVert$$ and its dual $$\lVert{\cdot }\rVert ^{\ast }$$ relates to the convergence properties of projected gradient descent. 3. Fast convergence of projected gradient descent Consider an optimization problem constrained to a non-convex set, $$\min \{ \mathsf{g}(x)\! :\! x\in \mathscr{C}\}$$, where $$\mathsf{g}:\mathbb{R}^{d}\!\rightarrow\! \mathbb{R}$$ is a differentiable function. We will work with projected gradient descent algorithms in the setting where g is convex or approximately convex, while $$\mathscr{C}$$ is non-convex with local concavity coefficients $$\gamma _{x}(\mathscr{C})$$. After choosing some initial point $$x_{0}\in \mathscr{C}$$, for each t ⩾ 0 we define   \begin{align}\begin{cases} x^{\prime}_{t+1} = x_{t} - \eta\nabla\mathsf{g}\left(x_{t}\right)\!,\\ x_{t+1} = P_{{\mathscr{C}}}\left(x^\prime_{t+1}\right)\!,\end{cases} \end{align} (3.1)where if $$P_{{\mathscr{C}}}\!\left (x^\prime _{t+1}\right )$$ is not unique then any closest point may be chosen. 3.1. Assumptions Assumptions on g We first consider the objective function g. Let $$\widehat{x}$$ be the target of our optimization procedure, $$\widehat{x} \in \operatorname{arg\,min}_{x\in \mathscr{C}}\mathsf{g}(x)$$. We assume that g satisfies RSC and RSM conditions over $$x,y\in \mathscr{C}$$,   \begin{align} \mathsf{g}(y)\geqslant \mathsf{g}(x) + \langle\,{y-x},{\nabla\mathsf{g}(x)}\rangle + \frac{\alpha}{2}\lVert{x-y}{\rVert^{2}_{2}} - \frac{\alpha}{2}\varepsilon_{\mathsf{g}}^{2} \end{align} (3.2)and   \begin{align} \mathsf{g}(y)\leqslant \mathsf{g}(x) + \langle\,{y-x},{\nabla\mathsf{g}(x)}\rangle + \frac{\beta}{2}\lVert{x-y}{\rVert^{2}_{2}} + \frac{\alpha}{2}\varepsilon_{\mathsf{g}}^{2}. \end{align} (3.3)Without loss of generality we can take $$\alpha \leqslant \beta$$. As is common in the low-rank factorized optimization literature, we will work in a local neighborhood of the target $$\widehat{x}$$ by assuming that our initialization point lies within radius ρ of $$\widehat{x}$$, which will allow us to require these conditions on g to hold only locally. The term εg gives some ‘slack’ in our assumption on g, and is intended to capture some vanishingly small error level. This term is often referred to as the ‘statistical error’ in the high-dimensional statistics literature, which represents the best-case scaling of the accuracy of our recovered solution. Often $$\widehat{x}$$ may represent a global minimizer which is within radius εg of some ‘true’ parameter in a statistical setting; therefore, converging to $$\widehat{x}$$ up to an error of magnitude εg means that the recovered solution is as accurate as $$\widehat{x}$$ at recovering the true parameter. For instance, often we will have $$\varepsilon _{\mathsf{g}}\sim{\sqrt{\frac{\log (d)}{n}}}$$ in a statistical setting where we are solving a sparse estimation problem of dimension d with sample size n. Assumptions on $$\mathscr{C}$$ Next, turning to the non-convexity of $$\mathscr{C}$$, we will assume local concavity coefficients $$\gamma _{x}(\mathscr{C})$$ that are not too large in a neighborhood of $$\widehat{x}$$, with details given below. We furthermore assume a norm compatibility condition,   \begin{align} \left\lVert{z - P_{{\mathscr{C}}}(z)}\right\rVert^{\ast} \leqslant \phi \min_{x\in\mathscr{C}}\lVert{z-x}\rVert^{\ast}\text{ for all }z\in\mathbb{R}^{d}, \end{align} (3.4)for some constant $$\phi \geqslant 1$$. The norm compatibility condition is trivially true with ϕ = 1 if $$\lVert{\cdot }\rVert$$ is the ℓ2 norm, since $$P_{{\mathscr{C}}}$$ is a projection with respect to the ℓ2 norm. We will see that in many natural settings it holds even for other norms, often with ϕ = 1. Assumptions on gradient and initialization Finally, we assume a gradient condition that reveals the connection between the curvature of the non-convex set $$\mathscr{C}$$ and the target function g: we require that   \begin{align} 2\phi\cdot\max_{x,x{^\prime}\in\mathscr{C}\cap\mathbb{B}_{2}(\widehat{x},\rho)}\gamma_{x}(\mathscr{C})\lVert{\nabla\mathsf{g}(x^\prime)}\rVert^{\ast} \leqslant (1-c_{0}) \cdot \alpha. \end{align} (3.5)(Since $$x\mapsto \gamma _{x}(\mathscr{C})$$ is upper semi-continuous, if g is continuously differentiable, then we can find some radius ρ > 0 and some constant c0 > 0 satisfying this condition, as long as $$2\phi \gamma _{\widehat{x}}(\mathscr{C}) \lVert{\nabla \mathsf{g}\left (\widehat{x}\right )}\rVert ^{\ast } < \alpha$$.) Our projected gradient descent algorithm will then succeed if initialized within this radius ρ from the target point $$\widehat{x}$$, with an appropriate step size. We will discuss the necessity of this type of initialization condition below in Section 3.4. In practice, relaxing the constraint $$x\in \mathscr{C}$$ to a convex constraint (or convex penalty) is often sufficient for providing a good initialization point. For example, in low-rank matrix setting, if we would like to solve $$\operatorname{arg\,min}\{\mathsf{g}(X):\operatorname{rank}(X)\leqslant r\}$$, we may first solve $$\operatorname{arg\,min}_{X}\left \{\mathsf{g}(X) + \lambda \lVert{{X}}\rVert _{\textrm{nuc}}\right \}$$, where $$\lVert{{X}}\rVert _{\textrm{nuc}}$$ is the nuclear norm and $$\lambda \geqslant 0$$ is a penalty parameter (which we would tune to obtain the desired rank for X). Alternately, in some settings, it may be sufficient to solve an unconstrained problem arg minXg(X) and then project to the constraint set, $$P_{{\mathscr{C}}}(X)$$. For some detailed examples of suitable initialization procedures for various low-rank matrix estimation problems, see e.g. the studies by Chen & Wainwright (2015) and Tu et al. (2015). 3.2. Convergence guarantee We now state our main result, which proves that under these conditions, initializing at some $$x_{0}\in \mathscr{C}$$ sufficiently close to $$\widehat{x}$$ will guarantee fast convergence to $$\widehat{x}$$. Theorem 3.1 Let $$\mathscr{C}\subset \mathbb{R}^{d}$$ be a constraint set and let g be a differentiable function, with minimizer $$\widehat{x}\in \operatorname{arg\,min}_{x\in \mathscr{C}}\mathsf{g}(x)$$. Suppose $$\mathscr{C}$$ satisfies the norm compatibility condition (3.4) with parameter ϕ, and g satisfies RSC (3.2) and RSM (3.3) with parameters α, β, εg for all $$x,y\in \mathbb{B}_{2}(\widehat{x},\rho )$$, and the initialization condition (3.5) for some c0 > 0. If the initial point $$x_{0}\in \mathscr{C}$$ and the error level εg satisfy $$\lVert{x_{0} -\widehat{x}}{\rVert ^{2}_{2}}<\rho ^{2}$$ and $$\varepsilon _{\mathsf{g}}^{2}< \frac{c_{0} \rho ^{2}}{1.5}$$, then for each step $$t\geqslant 0$$ of the projected gradient descent algorithm (3.1) with step size η = 1/β,   $$\lVert{x_{t} - \widehat{x}}{\rVert^{2}_{2}} \leqslant \left(1 - c_{0}\cdot \frac{2\alpha}{\alpha+\beta}\right)^{t} \lVert{x_{0} - \widehat{x}}{\rVert^{2}_{2}} +\frac{1.5{\varepsilon}_{\mathsf{g}}^{2}}{c_{0}}.$$ In other words, the iterates xt converge linearly to the minimizer $$\widehat{x}$$, up to precision level εg. 3.3. Comparison to related work We now compare to several related results for convex and non-convex projected gradient descent. (For methods that are specific to the problem of optimization over low-rank matrices, we will discuss this comparison and perform simulations later on.) Comparison to convex optimization To compare this result to the convex setting, if $$\mathscr{C}$$ is a convex set and g is α-strongly convex and β-smooth, then we can set c0 = 1 and εg = 0. Our result then yields   $$\lVert{x_{t} - \widehat{x}}{\rVert^{2}_{2}} \leqslant \left(1 - \frac{2\alpha}{\alpha+\beta}\right)^{t} \lVert{x_{0} - \widehat{x}}{\rVert^{2}_{2}} = \left(\frac{\beta-\alpha}{\beta+\alpha}\right)^{t}\lVert{x_{0}-\widehat{x}}{\rVert^{2}_{2}},$$matching known rates for the convex setting (see e.g. Bubeck, 2015, Theorem 3.10). Comparison to known results using descent conesOymak et al. (2015) study projected gradient descent for a linear regression setting, $$\mathsf{g}(x) = \frac{1}{2}\lVert{b-Ax}{\rVert ^{2}_{2}}$$, while constraining some potentially non-convex regularizer, $$\mathscr{C} = \{x:\textrm{Pen}(x)\leqslant c\}$$. Given a true solution $$x^{\star }\in \mathscr{C}$$ (for instance, in a statistical setting, we may have b = Ax⋆ + (noise)), their work focuses on the descent cone of $$\mathscr{C}$$ at x⋆, given by   $$\textrm{DC}_{x^{\star}} = \textrm{Smallest closed cone containing }\left\{u: \textrm{Pen}\left(x^{\star}+u\right) \leqslant c\right\}\!\!.$$(Trivially we will have $$x_{t} - x^{\star } \in \textrm{DC}_{x^{\star }}$$ since $$x_{t}\in \mathscr{C}$$.) Their results characterize the convergence of projected gradient descent in terms of the eigenvalues of A⊤A restricted to this cone. For simplicity, we show their result specialized to the noiseless setting, i.e. b = Ax⋆, given in the study by Oymak et al. (2015, Theorem 1.2):   \begin{align} \lVert{x_{t} - x^{\star}}\rVert_{2} \leqslant \left(2\cdot \max_{u,v\in\textrm{DC}_{x^{\star}}\cap\mathbb{S}^{d-1}} u^{\top} \left(\mathbf{I}_{d} - \eta A^{\top} A\right) v\right)^{t} \lVert{x^{\star}}\rVert_{2}. \end{align} (3.6)For this result to be meaningful we of course need the radius of convergence to be < 1. For a convex constraint set $$\mathscr{C}$$ (i.e. if Pen(x) is convex), the factor of 2 can be removed. In the non-convex setting, however, the factor of 2 means that the maximum in (3.6) must be $$<\frac{1}{2}$$ for the bound to ensure convergence. Noting that ∇g(x) = A⊤A(x − x⋆) in this problem, by setting u = v ∝ x − x⋆ we see that (3.6) effectively requires that $$\eta> \frac{1}{2\alpha }$$, where α is the RSC parameter (3.2). However, we also know that $$\eta \leqslant \frac{1}{\beta }$$ is generally a necessary condition to ensure stability of projected gradient descent; if $$\eta>\frac{1}{\beta }$$ then we may see values of g increase over iterations, i.e. g(x1) > g(x0). Therefore, the condition (3.6) effectively requires that g is well conditioned with $$\beta \lesssim 2\alpha$$, and furthermore that x⋆ is not in the interior of $$\mathscr{C}$$ (since, if this were the case, then $$\textrm{DC}_{x^{\star }} = \mathbb{R}^{d}$$). On the other hand, if the radius in (3.6) is indeed < 1, then their work does not assume any type of initialization condition for convergence to be successful, in contrast to our initialization assumption (3.5). Comparison to known results for iterative hard thresholding We now compare our results to those of Jain et al.’s (2014), which specifically treat the iterative hard thresholding algorithm for a sparsity constraint or a rank constraint,   $$\mathscr{C} = \left\{x\in\mathbb{R}^{d}:|\operatorname{support}(x)|\leqslant k\right\}\textrm{ or } \mathscr{C} = \left\{X\in\mathbb{R}^{n\times m}:\operatorname{rank}(X)\leqslant r\right\}\!\!.$$In their work, they take a substantially different approach: instead of bounding the distance between xt and the minimizer $$\widehat{x}\in \operatorname{arg\,min}_{x\in \mathscr{C}}\mathsf{g}(x)$$, they instead take $$\widehat{x}$$ to be a minimizer over a stronger constraint,   $$\widehat{x} =\mathop{\operatorname{arg\,min}}\limits_{|\operatorname{support}(x)|\leqslant k^{\star}}\mathsf{g}(x)\quad\textrm{ or }\quad\widehat{X}=\mathop{\operatorname{arg\,min}}\limits_{\operatorname{rank}(X)\leqslant r^{\star}}\mathsf{g}(X),$$taking k⋆ ≪ k or r⋆ ≪ r to enforce that the sparsity of $$\widehat{x}$$ or rank of $$\widehat{X}$$ is much lower than the optimization constraint set $$\mathscr{C}$$. With this definition, then bound the gap in objective function values, $$\mathsf{g}(x_{t}) - \mathsf{g}(\widehat{x})$$. In other words, the objective function value g(xt) is, up to a small error, no larger than the best value obtained over the substantially more restricted set of k⋆-sparse vectors or of rank-r⋆ matrices. By careful use of this gap k⋆ ≪ k or r⋆ ≪ r, their analysis allows for convergence results from any initialization point $$x_{0}\in \mathscr{C}$$. In contrast, our work allows $$\widehat{x}$$ to lie anywhere in $$\mathscr{C}$$, but this comes at the cost of assuming a local initialization point $$x_{0}\in \mathscr{C}\cap \mathbb{B}_{2}\left (\widehat{x},\rho \right )$$. This result suggests a possible two-phase approach: first, we might optimize over a larger rank constraint $$\mathscr{C}=\{X:\operatorname{rank}(X)\leqslant k\},$$ where k ≫ k⋆ to obtain the convergence guarantees of the study by Jain et al. (2014) (which do not assume a good initialization point, but obtain weaker guarantees); then, given the solution over rank k as a good initialization point, we would then optimize over the tighter constraint $$\mathscr{C}=\{X:\operatorname{rank}(X)\leqslant k^{\star }\}$$ to obtain our stronger guarantees. Comparison to results on prox-regular functionsPennanen (2002) studies conditions for linear convergence of the proximal point method for minimizing a function f(x), and shows that prox-regularity of f(x) is sufficient; Lewis & Wright (2016) also study this problem in a more general setting. For our optimization problem, this translates to setting $$\mathsf{f}(x) = \mathsf{g}(x) + \delta _{\mathscr{C}}(x)$$, where   $$\delta_{\mathscr{C}}(x)=\begin{cases}0,&x\in\mathscr{C},\\\infty,&x\not\in\mathscr{C}.\end{cases}$$(This is usually called the ‘indicator function’ for the set $$\mathscr{C}$$.) If $$\mathsf{g}(x) + \frac{\mu }{2}\lVert{x}{\rVert ^{2}_{2}}$$ is convex (i.e. the concavity of g is bounded) and $$\mathscr{C}$$ is a prox-regular set (i.e. $$\gamma (\mathscr{C})<\infty$$, see Section 2.3), then f(x) is a prox-regular function. This work was extended by Iusem et al. (2003) and others to an inexact proximal point method, allowing for error in each iteration, which can be formulated to encompass the projected gradient descent algorithm studied here. Our first convergence result Theorem 3.1 extends these results into a high-dimensional setting by using the structured norm $$\lVert{\cdot }\rVert$$ and its dual $$\lVert{\cdot }\rVert ^{\ast }$$ (e.g. the ℓ1 norm and its dual the $$\ell _{\infty }$$ norm), and requiring only RSC and RSM on g, without which we would not be able to obtain convergence guarantees in settings such as high-dimensional sparse regression or low-rank matrix estimation. 3.4. Initialization point and the gradient assumption In this result, we assume that the initialization point x0 is within some radius ρ of the target $$\widehat{x}$$, ensuring that $$2\phi \gamma _{x}(\mathscr{C})\lVert{\nabla \mathsf{g}(x)}\rVert ^{\ast }<\alpha$$ for all x in the initialization neighborhood, where α is the RSC (3.2) parameter. This type of assumption arises in much of the related literature; for example in the setting of optimization over low-rank matrices, as we will see in Section 5.1, we will require that $$\lVert{{X_{0} - \widehat{X}}}\rVert _{\mathsf{F}}\lesssim \sigma _{r}\left (\widehat{X}\right )$$, which is the same condition found in existing work such as that of Chen & Wainwright (2015). In fact, the following result demonstrates that the bound (3.5) is in a sense necessary: Lemma 3.2 For any constraint set $$\mathscr{C}$$ and any point $$x\in \mathscr{C}\backslash \mathscr{C}_{\mathsf{dgn}}$$ with $$\gamma _{x}(\mathscr{C})>0$$, for any α, ε > 0 there exists an α-strongly convex g such that the gradient condition (3.5) is nearly satisfied at x, with $$2\gamma _{x}(\mathscr{C})\lVert{\mathsf{g}(x)}\rVert ^{\ast }\leqslant \alpha (1+ \varepsilon )$$, and, x is a stationary point of the projected gradient descent algorithm (3.1) for all sufficiently small step sizes η > 0, but x does not minimize g over $$\mathscr{C}$$. That is, if projected gradient descent is initialized at the point x, then the algorithm will never leave this point, even though it is not optimal (i.e. x is not the global minimizer). We can see with a concrete example that the condition (3.5) may be even more critical than this lemma suggests: without this bound, we may find that projected gradient descent becomes trapped at a stationary point x which is not even a local minimum, as in the following example. Example 3.3 Let $$\mathscr{C}=\left \{X\in \mathbb{R}^{2\times 2}:\operatorname{rank}(X)\leqslant 1\right \}$$, let $$\mathsf{g}(X) = \frac{1}{2}\left \lVert{{X - \left ({1 \atop 0} \quad {0 \atop 1+\varepsilon}\right )}}\right \rVert _{\mathsf{F}}^{2}$$, and let $$X_{0} = \left ({1\atop 0} \quad {0 \atop 0}\right ).$$ Then trivially, we can see that g is α-strongly convex for α = 1, and that X0 is a stationary point of the projected gradient descent algorithm (3.1) for any step size $$\eta < \frac{1}{1+\varepsilon }$$. However, for any $$0<t<\sqrt{2\varepsilon }$$, setting $$X=\left ({1 \atop t} \quad {t \atop t^{2}}\right )\in \mathscr{C}$$, we can see that g(X) < g(X0)—that is, X0 is stationary point, but is not a local minimum. We will later calculate that $$\gamma _{X_{0}}(\mathscr{C}) = \frac{1}{2\sigma _{1}(X_{0})}=\frac{1}{2}$$ relative to the nuclear norm $$\lVert{\cdot }\rVert =\lVert{{\cdot }}\rVert _{\textrm{nuc}}$$, with norm compatibility constant ϕ = 1 (see Section 5.1 for this calculation). Comparing against the condition (3.5) on the gradient of g, since the dual norm to $$\lVert{{\cdot }}\rVert _{\textrm{nuc}}$$ is the matrix spectral norm $$\lVert{{\cdot }}\rVert _{\textrm{sp}}$$, we see that   $$2\phi\gamma_{X_{0}}(\mathscr{C}) \cdot\lVert{{\nabla\mathsf{g}(X_{0})}}\rVert_{\textrm{sp}} = 2\cdot1\cdot \frac{1}{2} \cdot \left\lVert{{-\left(\begin{array}{cc}0 & 0 \\ 0 & 1+\varepsilon\end{array}\right) }}\right\rVert_{\textrm{sp}}= 1+ \varepsilon =\alpha(1+\varepsilon).$$Therefore, when the initial gradient condition (3.5) is even slightly violated in this example (i.e. small ε > 0), the projected gradient descent algorithm can become trapped at a point that is not even a local minimum. While we might observe that in this particular example, the ‘bad’ stationary point X0 could be avoided by increasing the step size, in other settings if g has strong curvature in some directions (i.e. the smoothness parameter β is large), then we cannot afford a large step size η as it can cause the algorithm to fail to converge. 4. Convergence analysis using approximate projections In some settings, computing projections $$P_{{\mathscr{C}}}(x^\prime _{t+1})$$ at each step of the projected gradient descent algorithm may be prohibitively expensive; for instance in a low-rank matrix optimization problem of dimension d × d, this would generally involve taking the singular value decomposition of a dense d × d matrix at each step. In these cases we may sometimes have access to a fast but approximate computation of this projection, which may come at the cost of slower convergence. We now generalize to the idea of a family of approximate projections, which allows for operators that approximate projection to $$\mathscr{C}$$. Specifically, the approximations are carried out locally:   \begin{align}\begin{cases}x^\prime_{t+1} = x_{t} - \eta\nabla\mathsf{g}(x_{t}),\\[3pt] x_{t+1} = P_{{x_{t}}}\left(x^\prime_{t+1}\right)\!,\end{cases} \end{align} (4.1)where $$P_{{x_{t}}}$$ comes from a family of operators $$P_{{x}}:\mathbb{R}^{d}\rightarrow \mathscr{C}$$ indexed by $$x\in \mathscr{C}$$. Intuitively, we think of Px(z) as providing a very accurate approximation to $$P_{{\mathscr{C}}}(z)$$ locally for z near x, but it may distort the projection more as we move farther away. To allow for our convergence analysis to carry through even with these approximate projections, we assume that the family of operators {Px} satisfies a relaxed inner product condition:   \begin{align} &\text{For any }x\in\mathscr{C} and z\in\mathbb{R}^{d}\text{ with }x,P_{{x}}(z)\in\mathbb{B}_{2}\!\left(\widehat{x},\rho\right), \\ &\quad\left\langle{\,\widehat{x}-P_{{x}}(z)},{z-P_{{x}}(z)}\right\rangle \leqslant \max\{ \underbrace{\left\lVert{z-P_{{x}}(z)}\right\rVert^{\ast}}_{\textrm{concavity term}}, \underbrace{\left\lVert{z-x}\right\rVert^{\ast}}_{\textrm{distortion term}}\}\cdot \big(\underbrace{\gamma^{\textrm{c}}\lVert{\,\widehat{x}-P_{{x}}(z)}{\rVert^{2}_{2}}}_{\textrm{concavity term}} + \underbrace{\gamma^{\textrm{d}}\lVert{\,\widehat{x}-x}{\rVert^{2}_{2}}}_{\textrm{distortion term}}\big).\nonumber \end{align} (4.2)Here the ‘concavity’ terms are analogous to the inner product bound in (2.5) for exact projection to the non-convex set $$\mathscr{C}$$, except with the projection $$P_{{\mathscr{C}}}$$ replaced by the operator Px; the ‘distortion’ terms mean that as we move farther away from x the bound becomes looser, as Px becomes a less accurate approximation to $$P_{{\mathscr{C}}}$$. We now present a convergence guarantee nearly identical to the result for the exact projection case, Theorem 3.1. We first need to state a version of the norm compatibility condition, modified for approximate projections:   \begin{align} \left\lVert{z - P_{{x}}(z)}\right\rVert^{\ast} \leqslant \phi \lVert{z-x}\rVert^{\ast}\ \ \text{for all }x\in\mathscr{C}\cap\mathbb{B}_{2}(\widehat{x},\rho)\text{ and }z\in\mathbb{R}^{d}. \end{align} (4.3)We also require a modified initialization condition,   \begin{align} 2\phi(\gamma^{\textrm{c}}+\gamma^{\textrm{d}})\max_{x\in\mathscr{C}\cap\mathbb{B}_{2}(\widehat{x},\rho)}\lVert{\nabla\mathsf{g}(x)}\rVert^{\ast} \leqslant (1-c_{0})\alpha, \end{align} (4.4)and a modified version of local uniform continuity (compare to (2.7) for exact projections),   \begin{align} &\text{for any }x\in\mathscr{C}\cap\mathbb{B}_{2}(\widehat{x},\rho),\text{ for any }(\varepsilon > 0),\text{ there exists a }(\delta > 0)\text{ such that,}\nonumber\\ &\quad\text{for any }z,w\in\mathbb{R}^{d}\text{ such that }P_{x}(z)\in\mathbb{B}_{2}(\widehat{x},\rho)\text{ and }2(\gamma^{\textrm{c}}+\gamma^{\textrm{d}})\lVert{z-P_{{x}}(z)}\rVert^{\ast}\leqslant 1-c_{0},\nonumber\\ &\qquad\text{if }\lVert{z-w}\rVert_{2}\leqslant\delta\text{ then }\lVert{P_{{x}}(z)-P_{{x}}(w)}\rVert_{2}\leq\varepsilon. \end{align} (4.5) Our result for this setting now follows. Theorem 4.1 Let $$\mathscr{C}\subset \mathbb{R}^{d}$$ be a constraint set and let g be a differentiable function, with minimizer $$\widehat{x}\in \operatorname{arg\,min}_{x\in \mathscr{C}}\mathsf{g}(x)$$. Let {Px} be a family of operators satisfying the inner product condition (4.2), the norm compatibility condition (4.3) and the local continuity condition (4.5) with parameters γc, γd, ϕ and radius ρ. Assume that g satisfies RSC (3.2) and restricted smoothness (3.3) with parameters α, β, εg for all $$x,\ y\in \mathbb{B}_{2}(\,\widehat{x},\ \rho )$$, and the initialization condition (4.4) for some c0 > 0. If the initial point $$x_{0}\in \mathscr{C}$$ and the error level εg satisfy $$\lVert{x_{0} -\widehat{x}}{\rVert ^{2}_{2}}<\rho ^{2}$$ and $$\varepsilon _{\mathsf{g}}^{2}< \frac{c_{0} \rho ^{2}}{1.5}$$, then for each step $$t\geqslant 0$$ of the approximate projected gradient descent algorithm (4.1) with step size η = 1/β,   $$\lVert{x_{t} - \widehat{x}}{\rVert^{2}_{2}} \leqslant \left(1 - c_{0}\cdot \frac{2\alpha}{\alpha+\beta}\right)^{t} \lVert{x_{0} - \widehat{x}}{\rVert^{2}_{2}} +\frac{1.5\varepsilon_{\mathsf{g}}^{2}}{c_{0}}.$$ This convergence rate is identical to that obtained in Theorem 3.1 for exact projections—the only differences lie in the assumptions. 4.1. Exact vs. approximate projections To compare the two settings we have considered, exact projections $$P_{{\mathscr{C}}}$$ vs. approximate projections Px, we focus on a local form of the inner product condition (4.2) for the family of approximate operators {Px}, rewritten to be analogous to the inner product condition (2.5) for exact projections. Suppose that $$\gamma ^{\text{c}}_{u}(\mathscr{C})$$ and $$\gamma ^{\text{d}}_{u}(\mathscr{C})$$ satisfy the property that   \begin{align} &\text{for any }x,\ y\in\mathscr{C}\text{ and any }z\in\mathbb{R}^{d}\text{, writing }u=P_{{x}}(z),\nonumber\\ &\quad\langle{y-u},\ {z-u}\rangle \leqslant \max\{ \underbrace{\lVert{z-u}\rVert^{\ast}}_{\text{concavity term}}, \underbrace{\lVert{z-x}\rVert^{\ast}}_{\text{distortion term}}\}\cdot \Big(\underbrace{\gamma^{\text{c}}_{u} (\mathscr{C})\lVert{y-u}{\rVert^{2}_{2}}}_{\text{concavity term}} + \underbrace{\gamma^{\text{d}}_{u}(\mathscr{C})\lVert{y-x}{\rVert^{2}_{2}}}_{\text{distortion term}}\Big), \end{align} (4.6)where $$u\mapsto \gamma ^{\text{c}}_{u}(\mathscr{C})$$ and $$u\mapsto \gamma ^{\text{d}}_{u}(\mathscr{C})$$ are upper semi-continuous maps. We now prove that the existence of a family of operators {Px} satisfying this general condition (4.6) is in fact equivalent to bounding the local concavity coefficients of $$\mathscr{C}$$. Lemma 4.2 Consider a constraint set $$\mathscr{C}\subset \mathbb{R}^{d}$$ and a norm $$\lVert{\cdot }\rVert$$ on $$\mathbb{R}^{d}$$ with dual norm $$\lVert{\cdot }\rVert ^{\ast }$$. If $$\mathscr{C}$$ has local concavity coefficients given by $$\gamma _{x}(\mathscr{C})$$ for all $$x\in \mathscr{C}$$, then by defining operators $$P_{{x}} =P_{{\mathscr{C}}}$$ for all $$x\in \mathscr{C}$$, the inner product condition (4.6) holds with $$\gamma ^{\text{c}}_{x}(\mathscr{C})=\gamma _{x}(\mathscr{C})$$ and $$\gamma ^{\text{d}}_{x}(\mathscr{C})=0$$. Conversely, if there is some family of operators $$\{P_{{x}}\}_{x\in \mathscr{C}}$$ satisfying the inner product condition (4.6), then the local concavity coefficients of $$\mathscr{C}$$ satisfy $$\gamma _{x}(\mathscr{C}) \leqslant \gamma ^{\text{c}}_{x}(\mathscr{C}) + \gamma ^{\text{d}}_{x}(\mathscr{C})$$, provided that $$x\mapsto \gamma ^{\text{c}}_{x}(\mathscr{C}),\ x\mapsto \gamma ^{\text{d}}_{x}(\mathscr{C})$$ are upper semi-continuous, and that Px also satisfies a local continuity assumption.   \begin{align} \text{If }\gamma^{\textrm{c}}_{x}(\mathscr{C})+\gamma^{\textrm{d}}_{x}(\mathscr{C})<\infty\text{ and }z_{t}\rightarrow x,\text{ then }P_{{x}}(z_{t}) \rightarrow x. \end{align} (4.7) For this reason, we see that generalizing from exact projection $$P_{{\mathscr{C}}}$$ to a family of operators {Px} does not expand the class of problems whose convergence is ensured by our theory; essentially, if using the approximate projection operators Px guarantees fast convergence, then the same would also be true using exact projection $$P_{{\mathscr{C}}}$$. However, there may be substantial computational gain in switching from exact to approximate projection, which comes with little or no cost in terms of convergence guarantees. 5. Examples In this section we consider a range of non-convex constraints arising naturally in high-dimensional statistics, and show that these sets come equipped with well-behaved local concavity coefficients (thus allowing for fast convergence of gradient descent, for appropriate functions g). 5.1. Low-rank optimization Estimating a matrix with low-rank structure arises in a variety of problems in high-dimensional statistics and machine learning. A partial list includes principal component analysis (PCA), factor models, matrix completion and reduced rank regression. The past few years have seen extensive results on the specific problem of optimization over the space of low-rank matrices:   $$\min\left\{\mathsf{g}(X):X\in\mathbb{R}^{n\times m},\operatorname{rank}(X)\leqslant r\right\}\!,$$where in various settings g(X) may represent a least-squares loss from a linear matrix sensing problem, an objective function for the matrix completion problem or a more general function satisfying some type of restricted convexity assumption. In addition to extensive earlier work on convex relaxations of this problem via the nuclear norm and other penalties, more recently this problem has been studied using the exact rank-r constraint. The recent literature has generally treated the rank-constrained problem in one of two ways. First, the iterative hard thresholding method (also discussed earlier in Section 3.3) proceeds by taking gradient descent in the space of n × m matrices, then at each step projecting to the nearest rank-r matrix in order to enforce a rank constraint on X. This amounts to optimizing the function g(X) over the non-convex constraint space of rank-r matrices. Convergence results for this setting have been proved in the studies by Meka et al. (2009) and Jain et al. (2014). However, in high dimensions, a computational drawback of this method is the need to take a singular value decomposition of a (potentially dense) n × m matrix at each step. Alternately, one can consider the factorized approach, which reparametrizes the problem by taking a low-rank factorization, X = AB⊤ where $$A\in \mathbb{R}^{n\times r}$$ and $$B\in \mathbb{R}^{m\times r}$$, and pursuing alternating minimization or alternating gradient descent on the factors A and B. Recent results in this line of work include the studies by Gunasekar et al. (2013), Jain et al. (2013), Sun & Luo (2016), Tu et al. (2015), Zhao et al. (2015), Zhu et al. (2017) and many others. This reformulation of the problem now consists of a highly non-convex objective function g(AB⊤) optimized over a generally convex space of factor matrices $$(A,B)\in \mathbb{R}^{n\times r}\times \mathbb{R}^{m\times r}$$, via alternating gradient descent or alternating minimization over the factors A and B. In the special case where X is positive semidefinite, we can instead optimize g(AA⊤) via gradient descent on $$A\in \mathbb{R}^{n\times r}$$, which is again a non-convex objective function being minimized over a convex space, and has also been extensively studied, e.g. by Candès et al. (2015), Chen & Wainwright (2015) and Zheng & Lafferty (2015), among others. For both of these cases, the analysis of the optimization problem is complicated by the issue of identifiability, where the factor(s) can only be identified up to rotation. In this section, we will study the set of rank-constrained matrices   $$\mathscr{C}=\left\{X\in\mathbb{R}^{n\times m}:\operatorname{rank}(X)\leqslant r\right\}$$to determine how our general framework of local concavity applies to this specific low rank setting. To avoid triviality, we assume $$r<\min \{n,m\}$$. Writing $$\sigma _{1}(X)\geqslant \sigma _{2}(X)\geqslant \dots$$ to denote the sorted singular values of any matrix X, we compute the curvature condition and norm compatibility condition of $$\mathscr{C}$$: Lemma 5.1 Let $$\mathscr{C}=\{X\in \mathbb{R}^{n\times m}:\operatorname{rank}(X)\leqslant r\}$$. Then $$\mathscr{C}$$ has local concavity coefficients given by $$\gamma _{X}(\mathscr{C}) = \frac{1}{2\sigma _{r}(X)}$$ for all $$X\in \mathscr{C}$$, and satisfies the norm compatibility condition (3.4) with parameter ϕ = 1, with respect to norms $$\lVert{\cdot }\rVert =\lVert{{\cdot }}\rVert _{\text{nuc}}$$ and $$\lVert{\cdot }\rVert ^{\ast }=\lVert{{\cdot }}\rVert _{\text{sp}}$$. Thus, as long as the objective function g satisfies the appropriate conditions, we can expect projected gradient descent over the space of rank-r matrices to converge well when we initialize at some matrix X0 that is within a distance smaller than $$\sigma _{r}\left (\widehat{X}\right )$$ from the target matrix $$\widehat{X}$$, so that $$\gamma _{X}(\mathscr{C})$$ is bounded over all Xs within this radius. This is comparable to results in the factorized setting, for instance in the study by Chen & Wainwright (2015, Theorem 1), where the initialization point is similarly assumed to be within a radius that is smaller than $$\sigma _{r}(\widehat{X})$$ of the true solution $$\widehat{X}$$. Approximate projections The projection to $$\mathscr{C}$$, $$P_{{\mathscr{C}}}$$, can be obtained using a singular value decomposition (SVD), where only the top r singular values and singular vectors of the matrix are retained to compute the best rank r approximation. Nonetheless, it can be expensive to compute the SVD of a dense n × m matrix. We next propose an approximate projection operator for this space to avoid the cost of an SVD on an n × m matrix at each iteration of projected gradient descent. To construct PX, we first define some notation: for any rank-r matrix X, let TX be the tangent space of low-rank matrices at X, given by6  \begin{align}T_{X} =&\ \Big\{U A^{\top} +\, BV^{\top} \ : \ U\in\mathbb{R}^{n\times r}, V\in\mathbb{R}^{m\times r}\text{ are orthonormal bases for the column and row span of (X);}\nonumber\\ &\qquad A\in\mathbb{R}^{m\times r}, B\in\mathbb{R}^{n\times r}\text{ are any matrices}\Big\}.\end{align} (5.1)(This tangent space has frequently been studied in the context of nuclear norm minimization, see for instance the study by Candès & Recht, 2012.) We then define PX by first projecting to TX, then projecting to the rank-r constraint, that is,   \begin{align}P_{{X}}(Z) = P_{{\mathscr{C}}}\left(P_{{T_{X}}}(Z)\right)\!. \end{align} (5.2) While this approximate projection will introduce some small error into the update steps, thus slowing convergence somewhat, it comes with a potentially large benefit: the SVD computations are always carried out on low rank matrices. Specifically, defining U, V to be orthonormal bases for column and row spans of X as before, for any $$Z\in \mathbb{R}^{d}$$ we can write   $$P_{{T_{X}}}(Z) = UU^{\top} Z + Z VV^{\top} - UU^{\top} Z VV^{\top} = \underbrace{\left(\begin{array}{cc} U & \left(\mathbf{I}_{n} - UU^{\top}\right) Z V\end{array}\right)}_{n\times 2r}\cdot \underbrace{\left(\begin{array}{cc} Z^{\top} U & V\end{array}\right)}_{m\times 2r}{}^{\top},$$which means that calculating $$P_{{\mathscr{C}}}\left (P_{{T_{X}}}(Z)\right )$$ can be substantially faster than the exact projection $$P_{{\mathscr{C}}}(Z)$$ when the rank bound r is small while dimensions n, m are large. Once this projection is computed, we now have new row and column span matrices U, V ready to use for the next iteration’s approximate projection step. Our next result shows that this family of operators satisfies the conditions needed for our convergence results to be applied. Lemma 5.2 Let $$\mathscr{C}=\{X\in \mathbb{R}^{n\times m}:\operatorname{rank}(X)\leqslant r\}$$, and define the family of operators PX as in (5.2). Let $$\rho =\frac{\sigma _{r}(\widehat{X})}{4}$$. Then the inner product condition (4.2), the norm compatibility condition (4.3) and the local continuity condition (4.5), are satisfied with $$\gamma ^{\text{c}}=\gamma ^{\text{d}}=\frac{6}{\sigma _{r}\left (\widehat{X}\right )}$$ and ϕ = 3, with respect to norms $$\lVert{\cdot }\rVert =\lVert{{\cdot }}\rVert _{\text{nuc}}$$ and $$\lVert{\cdot }\rVert ^{\ast }=\lVert{{\cdot }}\rVert _{\text{sp}}$$. We see that up to a constant, this matches the results in Lemma 5.1 for the exact projection $$P_{{\mathscr{C}}}$$, and so we can expect roughly comparable convergence behavior with these approximate projections, while at the same time gaining computational efficiency by avoiding large SVDs. We will compare this approximate projection method to exact projection and factored projection empirically in Section 6. 5.2. Sparsity In many applications in high-dimensional statistics, the signal of interest is believed to be sparse or approximately sparse. Using an ℓ1 penalty or constraint serves as a convex relaxation to the sparsity constraint,   $$\mathop{\arg\min}\limits_{x} \big\{\mathsf{g}(x) + \lambda\lVert{x}\rVert_{1}\} \textrm{ or }\mathop{\arg\min}\limits_{x}\{\mathsf{g}(x) : \lVert{x}\rVert_{1}\leqslant c\big\}$$(i.e. the Lasso method in the study by Tibshirani (1996), in the case of a linear regression problem). The convex ℓ1 norm penalty shrinks many coefficients to zero, but also leads to undesirable shrinkage bias on the large coefficients of x. Optimization with hard sparsity constraints (e.g. the iterative hard thresholding method in the study by Jain et al. (2014), while sometimes prone to local minima, are known to be successful in many settings, and provide an alternative to convex relaxations (like the ℓ1 penalty) which induce shrinkage bias on large coefficients. The shrinkage problem can also be alleviated by turning to non-convex regularization functions, including the ℓq ‘norm’ for q < 1, $$\lVert{x}{\rVert ^{q}_{q}} = \sum _{i} |x_{i}|^{q}$$, whose convergence properties are studied by e.g. Knight & Fu (2000) and Chartrand (2007), as well as the SCAD penalty by Fan & Li (2001), the MCP penalty by Zhang (2010) and the adaptive Lasso/reweighted ℓ1 method by Candès et al. (2008), which is related to a non-convex ‘log-ℓ1’ penalty of the form   \begin{align} \textrm{log L1}_{\nu}(x) = \sum_{i} \nu \log\left(1+ |x_{i}|/\nu\right)\!. \end{align} (5.3)Smaller values of ν > 0 correspond to greater non-convexity, while setting $$\nu =\infty$$ recovers the ℓ1 norm. Loh & Wainwright (2013) study the convergence properties of a gradient descent algorithm for the penalized optimization problem arg minx{g(x) + λPen(x)}, where the regularizer takes the form   $$\textrm{Pen}(x) = \sum_{i} \mathsf{p}(|x_{i}|),$$where p(t) is non-decreasing and concave over $$t\geqslant 0$$, but its concavity is bounded and it has finite derivative as t ↘ 0. Essentially, this means that Pen(x) behaves like a non-convex version of the ℓ1 norm, shrinking small coefficients to zero, but avoiding heavy shrinkage on large coefficients; the SCAD, MCP and log-ℓ1 penalties are all examples. (The ℓq norm for q < 1 does not fit these assumptions, however, due to infinite derivative for coordinates $$x_{i}\rightarrow 0$$.) Proximal gradient descent with a non-convex penalty such as SCAD is also studied by Lewis & Wright (2016, Section 2.5) in the context of prox-regular functions. In this work, we consider the constrained version of this optimization problem, namely $$\operatorname{arg\,min}_{x}\{\mathsf{g}(x) :\text{Pen}(x)\leqslant c\}$$. Non-convex regularizers The general non-convex sparsity-inducing penalties studied by Loh & Wainwright (2013) are required to satisfy the following conditions (changing their notation slightly):   \begin{align}\textrm{Pen}(x) = \sum_{i} \mathsf{p}(|x_{i}|)\textrm{ where } \begin{cases} (\mathsf{p}(0)= 0)\text{ and }(\mathsf{p})\text{ is non-decreasing},\\ t\mapsto \mathsf{p}(t)/t\text{ is non-increasing (i.e.~}(\mathsf{p})\text{ is concave)},\\ t\mapsto \mathsf{p}(t) + \frac{\mu}{2}t^{2} \textrm{ is convex},\\ (\mathsf{p})\text{ is differentiable on (t>0), with }\lim_{t\searrow 0}\mathsf{p}^\prime(t) = 1.\end{cases} \end{align} (5.4) The following result calculates the local concavity coefficients for $$\mathscr{C} = \{x:\text{Pen}(x)\leqslant c\}$$. Lemma 5.3 Suppose that $$\text{Pen}(x) = \sum _{i} \mathsf{p}(|x_{i}|)$$ where p satisfies conditions (5.4). Then   $$\begin{cases} \gamma_{x}(\mathscr{C}) \leqslant \frac{\mu/2}{\mathsf{p}^\prime(x_{\textrm{min}})},&\textrm{ if } \textrm{Pen}(x)=c,\\[4pt]\gamma_{x}(\mathscr{C}) =0,&\textrm{ if }\textrm{Pen}(x)<c,\end{cases}$$with respect to the norm $$\lVert{\cdot }\rVert =\lVert{\cdot }\rVert _{1}$$ and its dual $$\lVert{\cdot }\rVert ^{\ast }=\lVert{\cdot }\rVert _{\infty }$$, where for any $$x\in \mathbb{R}^{d}\backslash \{0\}$$ we define xmin to be the magnitude of its smallest non-zero entry. As an example, consider the log-ℓ1 penalty (5.3), so that our constraint set is   $$\mathscr{C}=\left\{x\in\mathbb{R}^{d}:\sum_{i} \nu \log(1+ |x_{i}|/\nu) \leqslant c\right\}.$$In this case we have $$\mathsf{p}(t) = \nu \log (1+t/\nu )$$, which satisfies Loh & Wainright’s (2013) conditions (5.4) with $$\mu = \frac{1}{\nu }$$, and we can calculate $$\mathsf{p}^\prime (t)=\frac{1}{1+t/\nu }$$. Therefore, the local concavity coefficients for points x on the boundary of $$\mathscr{C}$$ are bounded as $$\gamma _{x}(\mathscr{C})\leqslant \frac{1+x_{\text{min}}/\nu }{2\nu }$$. In particular, taking a maximum over all $$x\in \mathscr{C}$$, we obtain $$\gamma (\mathscr{C})\leqslant \frac{e^{c/\nu }}{2\nu }$$. We can also check the norm compatibility condition: Lemma 5.4 If $$\mathscr{C} = \{x\in \mathbb{R}^{d}:\sum _{i} \mathsf{p}(|x_{i}|)\leqslant c\}$$ where c > 0 and $$\mathsf{p}:[0,\infty )\rightarrow [0,\infty )$$ satisfies the conditions (5.4), then the norm compatibility condition (3.4) is satisfied with $$\phi = \frac{1}{\mathsf{p}^\prime \left (\mathsf{p}^{-1}(c)\right )}$$. With these results in place, we would therefore expect good convergence for projected gradient descent algorithms over the non-convex sparsity constraint $$\text{Pen}(x)\leqslant c$$, as long as the objective function g and the initialization point satisfy the appropriate assumptions. 5.2.1. Constraints vs. penalities In the study of Loh & Wainwright (2013), the non-convex optimization problem $$\min _{x}\!\left \{\mathsf{g}(x) + \lambda \text{Pen}(x)\right \}$$ is studied with no assumptions about the initial point x0. Instead, they give an assumption that the concavity in Pen must be outweighed by the (restricted) convexity in g. In our work on the constrained form of this problem, $$\min \{\mathsf{g}(x):\text{Pen}(x)\leqslant c\}$$, we instead rely heavily on initialization conditions, namely (3.5), for projected gradient descent to succeed. While our result, Lemma 3.1, gives some justification for the necessity of the initialization conditions for general constraint sets $$\mathscr{C}$$, here we consider the specific setting of non-convex sparsity penalties, and offer a more direct comparison of projected and proximal gradient descent (solving the constrained or penalized forms of the problem, resp.). Lemma 3.1 suggests that the condition   \begin{align}2\gamma_{x}(\mathscr{C})\lVert{\nabla\mathsf{g}(x)}\rVert^{\ast}<\alpha, \end{align} (5.5)where α is the RSC parameter for the loss g, is to some extent necessary for the success of projected gradient descent to be assured; otherwise projected gradient descent may have x as a ‘bad’ stationary point. How does this condition relate to the proximal gradient descent algorithm for the penalized form? Specifically, suppose that x is a stationary point of proximal gradient descent for some step size η > 0, namely,   $$x = \mathop{\arg\min}\limits_{y} \left\{\frac{1}{2}\lVert{y - \left(x - \eta\nabla\mathsf{g}(x)\right)}{\rVert^{2}_{2}} + \eta\lambda\textrm{Pen}(y)\right\}.$$By first-order optimality conditions, we must therefore have   $$0 \in\partial \left( \frac{1}{2}\lVert{y - (x - \eta\nabla\mathsf{g}(x))}{\rVert^{2}_{2}} + \eta\lambda\textrm{Pen}(y)\right)\bigg\vert_{y=x},$$where the subdifferential ∂Pen(y) is defined coordinatewise,   $$\partial\mathsf{p}(|y_{i}|) = \begin{cases}\mathsf{p}^\prime(|y_{i}|)\cdot\operatorname{sign}(y_{i}),&y_{i}\neq 0,\\[3pt] [-1,1],&y_{i}=0.\end{cases}$$In other words,   $$x - (x - \eta\nabla\mathsf{g}(x)) + \eta\lambda \partial\textrm{Pen}(x) \ni 0,$$and so for all i, (∇g(x))i ∈−λ∂p(|xi|) ⊂ [−λ, λ]. Therefore, applying the bound on the concavity coefficients given in Lemma 5.3, we have   $$2\gamma_{x}(\mathscr{C})\lVert{\nabla\mathsf{g}(x)}\rVert_{\infty}\leqslant \frac{\mu\lambda}{\mathsf{p}^\prime(x_{\textrm{min}})}$$for any stationary point x of the proximal gradient descent algorithm. The assumptions for convergence of proximal gradient descent in the study by Loh & Wainwright (2013) require that $$\mu \lambda \leqslant \text{(constant)}\cdot \alpha$$. Therefore, up to some constant, we see that the condition (5.5) is automatically satisfied by any stationary point of proximal gradient descent, but for projected gradient descent this condition can instead fail and this allows for ‘bad’ stationary points. Of course, if λ is too large, then proximal gradient descent can also fail to find the global minimum; however, if λ is chosen appropriately then no initialization condition is needed, while for the constrained form, an initialization condition is apparently necessary regardless of the constraint value c. To some extent, this suggests that projected and proximal gradient descent may have fundamentally different behavior in the non-convex setting, contradicting the notion that working with a constraint or a regularizer should lead to the same results (up to issues of tuning). 5.3. Spheres, orthogonal groups and orthonormal matrices We next consider a constraint set given by $$\mathscr{C} = \{X\in \mathbb{R}^{n\times r}:X^{\top } X =\mathbf{I}_{r}\}$$, the space of all orthonormal n × r matrices. We can also consider a related set, $$\mathscr{C} = \{X\in \mathbb{R}^{n\times n}:X^{2}=X,\ \operatorname{rank}(X)=r\}$$, the set of all rank-r projection matrices (with the orthogonal group as a special case when r = n). These examples have a different flavor than the low rank and sparse settings considered above; while the previous examples effectively bound the complexity of the signal (by finding latent sparse or low-rank structure), here we are instead enforcing special properties, namely orthogonality and/or unit norm. Optimization problems over these types of constraint sets arise, for instance, in PCA type problems where we would like to find the best low-rank representation of a data set. First, we consider n × r orthonormal matrices: Lemma 5.5 Let $$\mathscr{C} = \{X\in \mathbb{R}^{n\times r}: X^{\top } X=\mathbf{I}_{r}\}$$, the space of orthogonal n × r matrices. Then $$\mathscr{C}$$ has local concavity coefficients $$\gamma _{X}(\mathscr{C})= \frac{1}{2}$$ with respect to $$\lVert{\cdot }\rVert =\lVert{{\cdot }}\rVert _{\text{nuc}}$$ and dual norm $$\lVert{\cdot }\rVert ^{\ast } = \lVert{{\cdot }}\rVert _{\text{sp}}$$. The norm compatibility condition (3.4) holds with ϕ = 1.Observe that the sphere $$\mathbb{S}^{d-1}=\{x\in \mathbb{R}^{d}:\lVert{x}\rVert _{2}=1\}$$ is a special case, obtained when r = 1. Next, in many problems we may aim to find a rank-r subspace that is optimal in some regard, but the exact choice of basis for this subspace does not matter; that is, an orthonormal basis $$X\in \mathbb{R}^{n\times r}$$ is identifiable only up to a rotation of its columns. In this case, we can instead choose to work with rank-r projection matrices: Lemma 5.6 Let $$\mathscr{C} = \{X\in \mathbb{R}^{n\times n}: \operatorname{rank}(X)=r, X\succeq 0, X^{2}=X\}$$, the space of rank-r projection matrices. Then $$\mathscr{C}$$ has local concavity coefficients $$\gamma _{X}(\mathscr{C})\leqslant 2$$ with respect to $$\lVert{\cdot }\rVert =\lVert{{\cdot }}\rVert _{\text{nuc}}$$ and $$\lVert{\cdot }\rVert ^{\ast } = \lVert{{\cdot }}\rVert _{\text{sp}}$$.A special case is the setting r = n, when $$\mathscr{C}$$ is the orthogonal group. 6. Experiments We next test the empirical performance of low-rank optimization methods on a small matrix completion problem.7 We generate an orthonormal matrix $$U^{\star }\in \mathbb{R}^{100\times 5}$$ at random, to produce a rank-5 positive semidefinite signal U⋆U⋆⊤. We choose a subset of observed entries Ω ⊂ [100] × [100] by giving each entry (i, j) for $$i\leqslant j$$ a 20% chance of being observed, then symmetrizing across the diagonal. Our observations are then given by either Yij = (U⋆U⋆⊤)ij for the noiseless setting, or Yij = (U⋆U⋆⊤)ij + Zij for the noisy setting, where Zij ∼ N(0, σ2) for entries $$i\leqslant j$$ and then we again symmetrize across the diagonal. We set $$\sigma ^{2} = 0.2\cdot \frac{\lVert{{U^{\star } U^{\star }{}^{\top }}}\rVert _{\mathsf{F}}^{2}}{100^{2}}$$ for a signal-to-noise ratio of 5. The loss function is   $$\mathsf{g}(X) = \frac{1}{2}\sum_{(i,j)\in\Omega}\left(Y_{ij} - X_{ij}\right)^{2},$$and the constraint set is $$\mathscr{C}=\{X\in \mathbb{R}^{100\times 100}:\operatorname{rank}(X)\leqslant 5\}$$. We initialize at the matrix $$X_{0} = P_{{\mathscr{C}}}(Y_{\varOmega })$$, where YΩ is the zero-filled matrix of observations, i.e. (YΩ)ij = Yij ⋅1(i, j)∈Ω. We then compare three methods: Projected gradient descent, where projection $$P_{{\mathscr{C}}}$$ to the rank constraint is carried out via an SVD at each step (i.e. the iterative hard thresholding method studied by Jain et al., (2014). Approximate projected gradient descent, where the projection $$P_{{\mathscr{C}}}$$ is replaced by $$P_{{X}}(Z)= P_{{\mathscr{C}}}\left (P_{{T_{X}}}(Z)\right )$$ as in (5.2), a more efficient computation. Factored gradient descent (as studied by Chen & Wainwright, 2015; Zheng & Lafferty, 2015 for the positive semidefinite setting), where we define $$\widetilde{\mathsf{g}}(U) = \mathsf{g}(UU^{\top })$$ over the variable $$U\in \mathbb{R}^{100\times 5}$$, and perform (unconstrained) gradient descent on the variable U with respect to $$\widetilde{\mathsf{g}}$$. For each of the three methods, we run the method for 50 iterations with a constant step size η, and repeat for 50 trials. We then choose step size η that achieves the lowest median loss at the last iteration, across all trials. Figure 2 displays the median loss, and the first and third quantiles, across the 50 trials, with respect to iteration number. We see that for both the noiseless and noisy settings, projected gradient descent achieves the lowest loss, with a very fast decay (while, of course, being the most computationally expensive method). Approximate projected gradient descent and factored gradient descent show an interesting comparison, where for early iterations (∼5–10) the factored form gives a lower loss, while afterwards the approximate version performs better. By iteration 50, all three methods give nearly identical loss. It is likely that different settings may produce a different comparison between these methods. Fig. 2. View largeDownload slide Results of the simulated matrix completion experiment comparing projected gradient descent on $$\mathscr{C}=\{X\in \mathbb{R}^{100\times 100}:\operatorname{rank}(X)\leqslant 5\}$$, gradient descent with approximate projections to the same set and factored gradient descent on the variable $$U\in \mathbb{R}^{100\times 5}$$ (which relates to the low-rank matrix X as X = UU⊤). For each method, the line and band show the median and quartiles of the loss g(X) over 50 trials. Fig. 2. View largeDownload slide Results of the simulated matrix completion experiment comparing projected gradient descent on $$\mathscr{C}=\{X\in \mathbb{R}^{100\times 100}:\operatorname{rank}(X)\leqslant 5\}$$, gradient descent with approximate projections to the same set and factored gradient descent on the variable $$U\in \mathbb{R}^{100\times 5}$$ (which relates to the low-rank matrix X as X = UU⊤). For each method, the line and band show the median and quartiles of the loss g(X) over 50 trials. Noting the rapid decrease in loss for the projected gradient descent method, we next ask whether the strengths of this method can be combined with the computational efficiency of the other two methods. As a second experiment, we repeat the steps above with the modification that, for all three methods being compared: Initialization step: for iteration 1, the update step is carried out with (exact) projected gradient descent. Then, for iterations 2, …, 50, the update step is carried out with the respective method.The results are displayed in Fig. 3. Remarkably, the three methods now perform nearly identically; starting with a single iteration of the more expensive projected gradient descent method is sufficient to allow the inexpensive methods to perform nearly as well. Fig. 3. View largeDownload slide Same as in Fig. 2, except that for all three methods, as an initialization step, the first iteration performs one step of exact projected gradient descent on $$\mathscr{C}$$; iterations $$2,3,\dots ,50$$ are then performed via the three different methods. Fig. 3. View largeDownload slide Same as in Fig. 2, except that for all three methods, as an initialization step, the first iteration performs one step of exact projected gradient descent on $$\mathscr{C}$$; iterations $$2,3,\dots ,50$$ are then performed via the three different methods. 7. Discussion In this paper we have developed the local concavity coefficients, a measure of the extent to which a constraint set $$\mathscr{C}$$ violates convexity and may therefore be challenging for first-order optimization methods. These coefficients, related to the notion of prox-regularity in the analysis literature, are defined through four different measures of concavity that we then prove to be equivalent, connecting the geometric curvature of $$\mathscr{C}$$ with its behavior with respect to projections and first-order optimality conditions. This reveals a deep connection between geometry and optimization and allows us to analyze projected gradient descent to a range of examples such as low-rank estimation problems. Many open questions remain in this area. As discussed earlier, the extent to which constrained vs. penalized regularization (i.e. projected or proximal gradient descent) differ is not yet understood for non-convex regularizers. In sparse estimation problems, the non-convex ℓq ‘norm’ is a popular regularizer that is empirically very successful (and has been studied theoretically), but it is not clear whether an ℓq norm constraint can fit into the framework of the concavity coefficients (i.e. whether $$\gamma _{x}(\mathscr{C})$$ is finite on the ℓq norm ball). For a low-rank estimation problem, research on factored gradient descent methods, which optimize over the function U ↦ g(UU⊤) or (U, V) ↦ g(UV⊤), has developed tools to work around the identifiability issue where factors are identifiable only up to rotation. Is there a more general way in which non-identifiability, which can be thought of as a lack of convexity in certain directions, can be accounted for in the theory developed here? Turning to our general results for convergence on an arbitrary non-convex constraint set $$\mathscr{C}$$, it would be interesting to determine whether we can obtain a slower convergence rate assuming g is (restricted) Lipschitz and satisfies RSC, but without an RSM result—standard results in the convex setting for this case suggest that we would want to take step size ηt ∝ 1/t and could expect a convergence rate of $$\lVert{x_{t} -\widehat{x}}{\rVert ^{2}_{2}}\sim{1}/{t}$$. Finally, the strong initialization condition to ensure convergence to a global minimizer suggests that there may be some settings in which we can obtain weaker results—since our examples show that even convergence to a local minimum cannot be assured without checking the local concavity of the constraint set, are local concavity type assumptions sufficient to guarantee that projected gradient descent converges at least to some local minimizer? Acknowledgements The authors thank John Lafferty and Stephen Wright for helpful feedback on an early version of this work. The authors are very grateful to an anonymous reviewer for pointing out many useful connections with the analysis literature. Funding Partial funding: Alfred P. Sloan fellowship; National Science Foundation (NSF) award DMS-1654076. Appendix A. Proofs of local concavity coefficient results In this section we prove the equivalence of the multiple notions of the (local or global) concavity of the constraint set $$\mathscr{C}$$, given in Theorems 2.1 and 2.2, as well as some properties of these coefficients (Lemmas 2.3, 2.4, 2.5 and 2.6). Since our equivalent characterizations of the local concavity coefficients are inspired by many related conditions in the prox-regularity literature, some of the equivalence results are well known in the ℓ2 setting (i.e. when the structured norm $$\lVert{\cdot }\rVert$$ is chosen to be the ℓ2 norm), as we have described in Section 2.3. Once we move to the general setting, where $$\lVert{\cdot }\rVert$$ may be any norm chosen to reflect the structure of the underlying signal, and where we do not assume that $$x\mapsto \gamma _{x}(\mathscr{C})$$ is continuous, many of the previously developed proof techniques will no longer apply. Throughout the proof, we will highlight those portions where our proof uses novel techniques due to the challenges of this more general setting. In order to help discuss the various definitions of these coefficients before the equivalence is established, we begin by introducing notation for the local concavity coefficients defined using each of these four properties: for all $$x\in \mathscr{C}$$, define   \begin{align*} \gamma^{\textrm{curv}}_{x}(\mathscr{C}) =&\, \min\left\{\gamma\in[0,\infty]: \text{The curvature condition (2.1) holds for this point (x) and any }y\in\mathscr{C}\right\}\!,\\ \gamma^{\textrm{contr}}_{x}(\mathscr{C}) =&\, \min\left\{\gamma\in[0,\infty]: \text{The contraction condition (2.3) holds for this point (x) and any }y\in\mathscr{C}\right\}\!,\\ \gamma^{\textrm{FO}}_{x}(\mathscr{C}) =&\, \min\left\{\gamma\in[0,\infty]: \text{The first-order condition (2.4) holds for this point (x) and any }y\in\mathscr{C}\right\}\!,\\ \gamma^{\textrm{IP}}_{x}(\mathscr{C}) =&\, \min\left\{\gamma\in[0,\infty]: \text{The inner product condition (2.5) holds for this point (x) and any }y\in\mathscr{C}\right\}\!. \end{align*}We emphasize that here we are not explicitly setting these coefficients to equal $$\infty$$ at degenerate points $$x\in \mathscr{C}_{\mathsf{dgn}}$$—they may take finite values (we will need this distinction for some technical parts of our proofs later on). We will prove that these four definitions are all equal for all $$x\not \in \mathscr{C}_{\mathsf{dgn}}$$, which is sufficient for the equivalence result Theorem 2.2 since the local concavity coefficients are set to $$\infty$$ at degenerate points. Before proceeding, we introduce one more definition: by equivalence of norms on $$\mathbb{R}^{d}$$, we can find some finite constant Bnorm such that   \begin{align} \text{For all }z\in\mathbb{R}^{d}, \begin{cases}B_{\textrm{norm}}^{-1}\lVert{z}\rVert_{2} \leqslant \lVert{z}\rVert\leqslant B_{\textrm{norm}}\lVert{z}\rVert_{2},\\[4pt] B_{\textrm{norm}}^{-1}\lVert{z}\rVert_{2} \leqslant \lVert{z}\rVert^{\ast}\leqslant B_{\textrm{norm}}\lVert{z}\rVert_{2}.\end{cases} \end{align} (A.1)Note that, while Bnorm is finite, it may be extremely large—for instance, $$B_{\text{norm}}=\sqrt{d}$$ when $$\lVert{\cdot }\rVert$$ is the ℓ1 norm. A.1 Proof of upper semi-continuity (Lemma 2.2) Before we can prove the equivalence of the four definitions of the local coefficients in (2.6), we need to first show that these coefficients are upper semi-continuous, as claimed in Lemma 2.2. Of course, since we do not yet know that the four definitions are equivalent, we need to specify which definition we are using. We will work with the inner products property (2.5). Lemma A.1 The map $$x\mapsto \gamma ^{\text{IP}}_{x}(\mathscr{C})$$ is upper semi-continuous over $$x\in \mathscr{C}\backslash \mathscr{C}_{\mathsf{dgn}}$$. This lemma will allow us to prove the equivalence result, Theorem 2.2. Once Theorem 2.2 is proved, then Lemma A.1 becomes equivalent to the original lemma, Lemma 2.2, since $$\gamma _{x}(\mathscr{C})=\infty$$ by definition on the subset $$\mathscr{C}_{\mathsf{dgn}}\subset \mathscr{C}$$, which is a closed subset by definition, while Lemma A.1 proves that $$x\mapsto \gamma _{x}(\mathscr{C})$$ is upper semi-continuous over the open subset $$\mathscr{C}\backslash \mathscr{C}_{\mathsf{dgn}} \subset \mathscr{C}$$. Before proving this result, we first state a well known fact about projections, which we will use throughout our proofs:   \begin{align} \textrm{for any }z\in\mathbb{R}^{d}\text{ and }x\in\mathscr{C}\text{ with }P_{{\mathscr{C}}}(z)=x,\text{ for any }(t\in[0,1]), P_{{\mathscr{C}}}\text{((1-t)x + tz) = x}. \end{align} (A.2)Now we prove upper semi-continuity. Proof of Lemma A.1 Take any sequence $$x_{n}\rightarrow x$$, with $$x,x_{1},x_{2},\dots \in \mathscr{C}\backslash \mathscr{C}_{\mathsf{dgn}}$$. We want to prove that   \begin{align} \gamma:= \limsup_{n\rightarrow\infty}\gamma^{\textrm{IP}}_{x_{n}}(\mathscr{C})\leqslant\gamma^{\textrm{IP}}_{x}(\mathscr{C}). \end{align} (A.3)Since $$x\not \in \mathscr{C}_{\mathsf{dgn}}$$ by assumption, we know that $$P_{{\mathscr{C}}}$$ is continuous in some neighborhood of x. Let r > 0 be some radius so that $$P_{{\mathscr{C}}}$$ is continuous on $$\mathbb{B}_{*}(x,r)$$, where $$\mathbb{B}_{*}(x,r)$$ is the ball of radius r around the point x in the dual norm $$\lVert{\cdot }\rVert ^{\ast }$$. Assume also that γ > 0, otherwise again the claim is trivial. Taking a subsequence of the points $$x_{1},x_{2},\dots$$ if necessary, we can assume without loss of generality that   $$\gamma^{\textrm{IP}}_{x_{n}}(\mathscr{C})\rightarrow\gamma.$$Fix any ε > 0 such that ε < γ. For each n, by definition of the local concavity coefficient $$\gamma ^{\text{IP}}_{x_{n}}(\mathscr{C})$$, there must exist some $$y_{n}\in \mathscr{C}$$ and some $$z{^\prime }_{n}\in \mathbb{R}^{d}$$ with $$P_{{\mathscr{C}}}(z{^\prime }_{n})=x_{n}$$, such that   \begin{align} \left\langle{\,y_{n}-x_{n}},{z^\prime_{n}-x_{n}}\right\rangle> \left(\gamma^{\textrm{IP}}_{x_{n}}(\mathscr{C})-\varepsilon\right)\lVert{z^\prime_{n}-x_{n}}\rVert^{\ast}\lVert{y_{n}-x_{n}}{\rVert^{2}_{2}}. \end{align} (A.4)Define   $$z_{n} =\begin{cases}z^\prime_{n}, & \textrm{ if }\lVert{z^\prime_{n} - x_{n}}\rVert^{\ast}\leqslant r/2,\\[4pt] x_{n} + (z^\prime_{n} - x_{n})\cdot \frac{r/2}{\lVert{z^\prime_{n}-x_{n}}\rVert^{\ast}},&\textrm{ if }\lVert{z^\prime_{n}-x_{n}}\rVert^{\ast}>r/2,\end{cases}$$so that $$\lVert{z_{n}-x_{n}}\rVert ^{\ast }\leqslant r/2$$. By (A.2), $$P_{{\mathscr{C}}}(z_{n})=x_{n}$$. Furthermore, rescaling both sides of the inequality (A.4),   \begin{align} \left\langle{\,y_{n}-x_{n}},{z_{n}-x_{n}}\right\rangle> \left(\gamma^{\textrm{IP}}_{x_{n}}(\mathscr{C})-\varepsilon\right)\lVert{z_{n}-x_{n}}\rVert^{\ast}\lVert{y_{n}-x_{n}}{\rVert^{2}_{2}}. \end{align} (A.5)Since the left-hand side is bounded by $$\lVert{y_{n}-x_{n}}\rVert \lVert{z_{n}-x_{n}}\rVert ^{\ast }$$, we see that   $$\lVert{y_{n}-x_{n}}{\rVert^{2}_{2}} < \frac{\lVert{y_{n}-x_{n}}\rVert}{\gamma^{\textrm{IP}}_{x_{n}}(\mathscr{C})-\varepsilon}\leqslant \frac{\lVert{y_{n}-x_{n}}\rVert}{(\gamma-\varepsilon)/2}$$for all n sufficiently large so that $$\gamma ^{\text{IP}}_{x_{n}}(\mathscr{C})>\gamma - \frac{\gamma -\varepsilon }{2}$$. Therefore, since $$\lVert{y_{n} -x_{n}}\rVert \leqslant B_{\text{norm}}\lVert{y_{n}-x_{n}}\rVert _{2}$$ for some finite Bnorm, then for all large n, yn lies in some ball of finite radius around x. The same is true for zn since $$\lVert{z_{n}-x_{n}}\rVert ^{\ast }\leqslant r/2$$ by construction. Thus we can find a convergent subsequence, that is, $$n_{1},n_{2},\dots$$ such that   $$\begin{cases}y_{n_{i}}\rightarrow y\text{ for some point (y)},\\ z_{n_{i}} \rightarrow z\text{ for some point (z)}. \end{cases}$$Since $$\mathscr{C}$$ is closed, we must have $$y\in \mathscr{C}$$. And, since $$x_{n_{i}}\rightarrow x$$, for sufficiently large i we have $$\lVert{x_{n_{i}}-x}\rVert ^{\ast }\leqslant r/2$$, so that $$z_{n_{i}}\in \mathbb{B}_{*}(x,r)$$. Since $$P_{{\mathscr{C}}}$$ is continuous on the ball $$\mathbb{B}_{*}(x,r)$$, then, $$P_{{\mathscr{C}}}(z_{n_{i}}) = x_{n_{i}}\rightarrow x$$ implies that we must have $$P_{{\mathscr{C}}}(z)=x$$. And,   \begin{align*} \langle {y-x},{z-x}\rangle=&\, \underset{i\rightarrow\infty}{\rm lim}\left\langle{y_{n_{i}} - x_{n_{i}}},{z_{n_{i}}-x_{n_{i}}}\right\rangle \geqslant \underset{i\rightarrow\infty}{\rm lim} \left(\gamma^{\textrm{IP}}_{x_{n_{i}}}(\mathscr{C}) - \varepsilon\right)\lVert{z_{n_{i}}-x_{n_{i}}}\rVert^{\ast}\lVert{y_{n_{i}}-x_{n_{i}}}{\rVert^{2}_{2}}\\ =&\, (\gamma-\varepsilon)\lVert{z-x}\rVert^{\ast}\lVert{y-x}{\rVert^{2}_{2}},\end{align*}where the inequality applies (A.5) for each ni. Therefore, $$\gamma ^{\text{IP}}_{x}(\mathscr{C})\geqslant \gamma -\varepsilon$$. Since ε > 0 was chosen to be arbitrarily small, this proves that $$\gamma ^{\text{IP}}_{x}(\mathscr{C})\geqslant \gamma$$, as desired. A.2 Proof of equivalence for local concavity (Theorem 2.2) Now that we have established upper semi-continuity of $$\gamma ^{\text{IP}}_{x}(\mathscr{C})$$ over $$x\in \mathscr{C}\backslash \mathscr{C}_{\mathsf{dgn}}$$, we are ready to prove the equivalence of the local concavity coefficients. Recall that if $$x\in \mathscr{C}_{\mathsf{dgn}}$$ then $$\gamma _{x}(\mathscr{C})=\infty$$ under all four definitions. Therefore, from this point on, we only need to show that   $$\gamma^{\textrm{curv}}_{x}(\mathscr{C})=\gamma^{\textrm{contr}}_{x}(\mathscr{C})=\gamma^{\textrm{IP}}_{x}(\mathscr{C})=\gamma^{\textrm{FO}}_{x}(\mathscr{C})\text{ for all }x\in\mathscr{C}\backslash\mathscr{C}_{\mathsf{dgn}}.$$In fact, we will also show that a weaker statement holds for all $$x\in \mathscr{C}$$ (i.e. without excluding degenerate points), namely   \begin{align} \gamma^{\textrm{IP}}_{x}(\mathscr{C}) \leqslant {\rm min} \left\{\gamma^{\textrm{curv}}_{x}(\mathscr{C}),\ \gamma^{\textrm{contr}}_{x}(\mathscr{C}),\ \gamma^{\textrm{FO}}_{x}(\mathscr{C})\right\}\text{ for all }x\in\mathscr{C}. \end{align} (A.6)This additional bound will be useful later in our characterization of the degenerate points, when we prove Lemma 2.5. A.2.1 Inner products ⇒ first-order optimality Fix any $$u\in \mathscr{C}\backslash \mathscr{C}_{\mathsf{dgn}}$$. Let $$\mathsf{f}:\mathbb{R}^{d}\rightarrow \mathbb{R}$$ be differentiable, and suppose that u is a local minimizer of f over $$\mathscr{C}$$. By Rockafellar & Wets (2009, Theorem 6.12), this implies that $$-\nabla \mathsf{f}(u)\in N_{\mathscr{C}}(u)$$, where $$N_{\mathscr{C}}(u)$$ is the normal cone to $$\mathscr{C}$$ at the point u (see Rockafellar & Wets, 2009, Definition 6.3). By Colombo & Thibault (2010, (12)), we know that the normal cone can be obtained by a limit of proximal normal cones,   $$N_{\mathscr{C}}(u) = {\rm lim}\underset{x\in\mathscr{C},x\rightarrow u}{\rm sup} \ \ \underbrace{\left\{w\in\mathbb{R}^{d} : P_{{\mathscr{C}}}(x+\varepsilon\cdot w) = x\textrm{ for some }\varepsilon>0\right\}}_{\textrm{Proximal normal cone to }\mathscr{C}\text{ at (x)}}.$$Therefore, we can find some sequences $$u_{1},u_{2},\dots \in \mathscr{C}$$, $$w_{1},w_{2},\dots \in \mathbb{R}^{d}$$, and $$\varepsilon _{1},\varepsilon _{2},\dots>0$$, such that $$P_{{\mathscr{C}}}(u_{n} + \varepsilon _{n}\cdot w_{n})=u_{n}$$ for all $$n\geqslant 1$$, with $$u_{n}\rightarrow u$$ and $$w_{n}\rightarrow - \nabla \mathsf{f}(u)$$. Now fix any $$y\in \mathscr{C}$$. By the inner product condition (2.5), for each $$n\geqslant 1$$,   $$\left\langle{\,y-u_{n}},{w_{n}}\right\rangle = \left\langle{\,y-u_{n}},{(u_{n} + w_{n}) - u_{n} }\right\rangle \leqslant \gamma^{\textrm{IP}}_{u_{n}}(\mathscr{C})\lVert{w_{n}}\rVert^{\ast}\lVert{y-u_{n}}{\rVert^{2}_{2}}.$$Taking limits on both sides, since $$u_{n}\rightarrow u$$ and $$w_{n}\rightarrow -\nabla \mathsf{f}(u)$$,   $$\langle{\,y-u},{-\nabla\mathsf{f}(u)}\rangle\leqslant \left(\lim\sup_{t\rightarrow\infty} \gamma^{\textrm{IP}}_{u_{n}}(\mathscr{C})\right)\cdot\lVert{\nabla\mathsf{f}(u)}\rVert^{\ast}\lVert{y-u}{\rVert^{2}_{2}}.$$Finally, recall that Lemma A.1 proves that $$x\mapsto \gamma ^{\text{IP}}_{x}(\mathscr{C})$$ is upper semi-continuous over $$x\in \mathscr{C}\backslash \mathscr{C}_{\mathsf{dgn}}$$, and $$\mathscr{C}_{\mathsf{dgn}}\subset \mathscr{C}$$ is a closed subset. Since $$u\in \mathscr{C}\backslash \mathscr{C}_{\mathsf{dgn}}$$, we therefore have $$u_{n}\in \mathscr{C}\backslash \mathscr{C}_{\mathsf{dgn}}$$ for all sufficiently large t, and therefore $$\lim \sup _{t\rightarrow \infty } \gamma ^{\text{IP}}_{u_{n}}(\mathscr{C})\leqslant \gamma ^{\text{IP}}_{u}(\mathscr{C})$$. This proves that $$\gamma ^{\text{FO}}_{u}(\mathscr{C})\leqslant \gamma ^{\text{IP}}_{u}(\mathscr{C})$$, as desired. In fact, we can formulate a more general version of the first-order optimality condition:   \begin{align} &\text{For any Lipschitz continuous }\mathsf{f}:\mathbb{R}^{d}\rightarrow \mathbb{R}\text{ such that (x) is a local minimizer of }(\mathsf{f})\text{ over }\mathscr{C},\nonumber\\&\quad\langle{y-x},{v}\rangle\geqslant - \gamma\lVert{v}\rVert^{\ast}\lVert{y-x}{\rVert^{2}_{2}}\ \ \text{ for some }(v\in\partial \mathsf{f}(x)), \end{align}v∂fx (A.7) where ∂f(x) is the subdifferential to f at x (see Rockafellar & Wets, 2009, Definition 8.3). To see why (A.7) holds, Rockafellar & Wets (2009, Theorem 8.15) guarantees that, since f is Lipschitz and x is a local minimizer of f over the closed set $$\mathscr{C}$$, then we must have $$-v \in N_{\mathscr{C}}(x)$$ for some subgradient v ∈ ∂f(x).8 The remainder of the proof is identical to the differentiable case treated above, with v in place of ∇f(x); this proves that, for any $$x\in \mathscr{C}\backslash \mathscr{C}_{\mathsf{dgn}}$$ and any $$y\in \mathscr{C}$$, the stronger first-order optimality condition (A.7) holds with $$\gamma = \gamma ^{\text{IP}}_{x}(\mathscr{C})$$. Comparing to proofs of related conditions in the literature, we see that avoiding a continuity assumption on the map x↦γx means that the first-order optimality condition (2.4) does not follow immediately from the inner product condition (2.5); however, by first establishing upper semi-continuity of the map $$x\mapsto \gamma ^{\text{IP}}_{x}(\mathscr{C})$$, the result follows. A.2.2 First-order optimality ⇒ inner products This direction of the equivalence is immediate from the definitions of these two conditions. Fix any $$x,y\in \mathscr{C}$$ and $$z\in \mathbb{R}^{d}$$ with $$P_{{\mathscr{C}}}(z)=x$$. Consider the function $$\mathsf{f}(w) = \frac{1}{2}\lVert{w-z}{\rVert ^{2}_{2}}$$, so that $$P_{{\mathscr{C}}}(z)=x\in \mathscr{C}$$ minimizes f over $$\mathscr{C}$$. We see trivially that ∇f(x) = x − z and so   $$\langle{\,y-x},{z-x}\rangle = \langle{\,y-x},{-\nabla\mathsf{f}(x)}\rangle\leqslant \gamma^{\textrm{FO}}_{x}(\mathscr{C})\lVert{\nabla\mathsf{f}(x)}\rVert^{\ast}\lVert{x-y}{\rVert^{2}_{2}} = \gamma^{\textrm{FO}}_{x}(\mathscr{C})\lVert{z-x}\rVert^{\ast}\lVert{x-y}{\rVert^{2}_{2}}.$$Therefore $$\gamma ^{\text{IP}}_{x}(\mathscr{C})\leqslant \gamma ^{\text{FO}}_{x}(\mathscr{C})$$ for all $$x\in \mathscr{C}$$, while previously we showed that the reverse inequality holds over $$x\in \mathscr{C}\backslash \mathscr{C}_{\mathsf{dgn}}$$. Therefore, $$\gamma ^{\text{IP}}_{x}(\mathscr{C})=\gamma ^{\text{FO}}_{x}(\mathscr{C})$$ for $$x\in \mathscr{C}\backslash \mathscr{C}_{\mathsf{dgn}}$$. A.2.3 Curvature ⇒ Inner products Fix any $$x,y\in \mathscr{C}$$ and any $$z\in \mathbb{R}^{d}$$ with $$P_{{\mathscr{C}}}(z)=x$$. For all t ∈ (0, 1), let xt = (1 − t)x + ty, and choose   $$\tilde{x}_{t} \in\mathop{\operatorname{arg\,min}}\limits_{x\in\mathscr{C}}\lVert{x-x_{t}}\rVert \textrm{ such that } \limsup_{t\searrow 0}\frac{\lVert{\widetilde{x}_{t} -x_{t}}\rVert}{t}\leqslant \gamma^{\textrm{curv}}_{x}(\mathscr{C})\lVert{x-y}{\rVert^{2}_{2}},$$as in the definition of $$\gamma ^{\textrm{curv}}_{x}(\mathscr{C})$$. Fix any ε > 0. Then for some t0 > 0, for all t < t0,   $$\frac{\lVert{\widetilde{x}_{t}-x_{t}}\rVert}{t}\leqslant \gamma^{\textrm{curv}}_{x}(\mathscr{C})\lVert{x-y}{\rVert^{2}_{2}} + \varepsilon.$$Since $$x=P_{{\mathscr{C}}}(z)$$, this means that for all t ∈ (0, 1),   $$\lVert{z-x}{\rVert^{2}_{2}}\leqslant\lVert{z-\widetilde{x}_{t}}{\rVert^{2}_{2}} = \lVert{z-x_{t}}{\rVert^{2}_{2}} + \lVert{\widetilde{x}_{t}-x_{t}}{\rVert^{2}_{2}} + 2\langle{z-x_{t}},{x_{t}-\widetilde{x}_{t}}\rangle .$$We can also calculate   $$\lVert{z-x_{t}}{\rVert^{2}_{2}} = \lVert{z-(1-t)x-ty}{\rVert^{2}_{2}} = \lVert{z-x}{\rVert^{2}_{2}} -2t\langle{\,y-x},{z-x}\rangle + t^{2}\lVert{x-y}{\rVert^{2}_{2}}.$$We rearrange terms to obtain   $$\langle{y-x},{z-x}\rangle\leqslant \frac{1}{2t}\left(\lVert{\widetilde{x}_{t}-x_{t}}{\rVert^{2}_{2}} + 2\left\langle{z-x_{t}},{x_{t}-\widetilde{x}_{t}}\right\rangle + t^{2}\lVert{x-y}{\rVert^{2}_{2}}\right).$$Recalling that $$\lVert{\cdot }\rVert _{2}\leqslant B_{\textrm{norm}}\lVert{\cdot }\rVert$$ for some finite constant Bnorm by (A1), we then have   \begin{align*} &\langle{\,y-x},{z-x} \rangle\\ &\quad\leqslant \frac{1}{2t}\left(\left(B_{\textrm{norm}}\right)^{2}\lVert{\widetilde{x}_{t}-x_{t}}\rVert^{2} + 2\lVert{z-x_{t}}\rVert^{\ast}\lVert{\widetilde{x}_{t}-\widetilde{x}}\rVert+ t^{2}\lVert{x-y}{\rVert^{2}_{2}}\right)\\ &\quad\leqslant \frac{1}{2t}\left((B_{\textrm{norm}})^{2}\left(\left(\gamma^{\textrm{curv}}_{x}(\mathscr{C})\lVert{x-y}{\rVert^{2}_{2}}+\varepsilon\right)\!\cdot \!t\right)^{2} \!\!+ 2\lVert{z\!-\!x_{t}}\rVert^{\ast}\!\left(\gamma^{\textrm{curv}}_{x}(\mathscr{C})\lVert{x\!-\!y}{\rVert^{2}_{2}}+\!\varepsilon\right)\!\cdot t+ t^{2}\lVert{x\!-\!y}{\rVert^{2}_{2}}\right)\\ &\quad=\lVert{z-x_{t}}\rVert^{\ast}\left(\gamma^{\textrm{curv}}_{x}(\mathscr{C})\lVert{x-y}{\rVert^{2}_{2}}+\varepsilon\right) + \frac{t}{2}\left(\left(B_{\textrm{norm}}\right)^{2}\left(\left(\gamma^{\textrm{curv}}_{x}(\mathscr{C})\lVert{x-y}{\rVert^{2}_{2}}+\varepsilon\right)\right)^{2}+ \lVert{x-y}{\rVert^{2}_{2}}\right). \end{align*}Taking a limit as t approaches zero,   $$\langle{y-x},{z-x}\rangle\leqslant \left(\gamma^{\textrm{curv}}_{x}(\mathscr{C})\lVert{x-y}{\rVert^{2}_{2}}+\varepsilon\right)\cdot\lVert{z-x}\rVert^{\ast}.$$Since ε > 0 was chosen to be arbitrarily small, therefore, $$\gamma ^{\textrm{IP}}_{x}(\mathscr{C})\leqslant \gamma ^{\textrm{curv}}_{x}(\mathscr{C})$$, for any $$x\in \mathscr{C}$$. To compare this portion of the proof to the existing literature, in the ℓ2 setting, this equivalence is proved in the study by Colombo & Thibault (2010, Theorem 14(q)). Their proof relies on the fact that the projection operator $$P_{{\mathscr{C}}}$$ is taken with respect to the ℓ2 norm, and the curvature condition also seeks to bound the ℓ2 distance between the point xt and the set $$\mathscr{C}$$. In our more general setting, the curvature condition (2.1) works with the structured norm $$\lVert{\cdot }\rVert$$, while projections are still taken with respect to the ℓ2 norm, and so the same proof technique can no longer be applied (for instance, using this technique in a sparse problem with $$\lVert{\cdot }\rVert =\lVert{\cdot }\rVert _{1}$$, our results would suffer a factor of $$B_{\textrm{norm}}=\sqrt{d}$$ by converting from the ℓ1 norm to the ℓ2 norm). In our proof, a careful treatment of these various notions of distance allows for the bound to hold. A.2.4 Inner products ⇒ curvature To prove the curvature condition, we will actually need to use the stronger form (A.7) of the first-order optimality condition—as proved in Appendix A.2.1, this condition holds with $$\gamma =\gamma ^{\textrm{IP}}_{x}(\mathscr{C})$$ for all $$x\in \mathscr{C}\backslash \mathscr{C}_{\mathsf{dgn}}$$. Fix any $$u\in \mathscr{C}\backslash \mathscr{C}_{\mathsf{dgn}}$$ and $$y\in \mathscr{C}$$. Let ut = (1 − t) ⋅ u + t ⋅ y, and define $$\mathsf{f}(x) = \lVert{x-u_{t}}\rVert$$. Note that f is a Lipschitz function. Since $$\mathscr{C}$$ is closed, and f is continuous and non-negative, it must attain a minimum over $$\mathscr{C}$$, $$x_{t}\in \operatorname{arg\,min}_{x\in \mathscr{C}}\mathsf{f}(x)$$. Since $$\mathscr{C}_{\mathsf{dgn}}$$ is a closed subset of $$\mathscr{C}$$, this means that $$x_{t}\in \mathscr{C}\backslash \mathscr{C}_{\mathsf{dgn}}$$ for any sufficiently small t > 0, since   $$\lVert{x_{t} - u}\rVert\leqslant \lVert{x_{t} - u_{t}}\rVert+\lVert{u_{t} - u}\rVert\leqslant 2\lVert{u-u_{t}}\rVert = 2t\lVert{u-y}\rVert$$(where the second inequality uses the definition of xt), and so $$x_{t}\rightarrow u$$. Next, consider the subdifferential ∂f(xt). It is well known that this subdifferential is not empty, and any element v ∈ ∂f(xt) must satisfy $$\lVert{v}\rVert ^{\ast }\leqslant 1$$ and $$\left \langle{v},{x_{t} -u_{t}}\right \rangle =\lVert{x_{t}-u_{t}}\rVert$$. Now, applying the stronger form of the first-order optimality condition given in (A.7), we have   $$\left\langle{v},{y - x_{t}}\right\rangle \geqslant - \gamma^{\textrm{IP}}_{x_{t}}(\mathscr{C})\lVert{v}\rVert^{\ast}\lVert{y-x_{t}}{\rVert^{2}_{2}}= - \gamma^{\textrm{IP}}_{x_{t}}(\mathscr{C})\lVert{y-x_{t}}{\rVert^{2}_{2}}$$and similarly, replacing $$y\in \mathscr{C}$$ with $$u\in \mathscr{C}$$,   $$\left\langle{v},{u - x_{t}}\right\rangle \geqslant - \gamma^{\textrm{IP}}_{x_{t}}(\mathscr{C})\lVert{u-x_{t}}{\rVert^{2}_{2}}.$$Taking the appropriate linear combination of these two inequalities,   $$\left\langle{v},{x_{t} - u_{t}}\right\rangle \leqslant \gamma^{\textrm{IP}}_{x_{t}}(\mathscr{C})\left((1-t)\lVert{u-x_{t}}{\rVert^{2}_{2}} + t\lVert{y-x_{t}}{\rVert^{2}_{2}}\right) = \gamma^{\textrm{IP}}_{x_{t}}(\mathscr{C})\left(t(1-t)\lVert{u-y}{\rVert^{2}_{2}} + \lVert{u_{t} - x_{t}}{\rVert^{2}_{2}}\right),$$where the last step simply uses the definition ut = (1 − t)u + ty and rearranges terms. Finally, $$\lVert{u_{t} - x_{t}}\rVert _{2}\leqslant B_{\textrm{norm}}\lVert{u_{t} - x_{t}}\rVert \leqslant B_{\textrm{norm}}\lVert{u - u_{t}}\rVert = tB_{\textrm{norm}}\lVert{u-y}\rVert$$, by definition of ut and xt, so combining everything we can write   $$\min_{x\in\mathscr{C}}\lVert{x-u_{t}}\rVert = \lVert{x_{t} - u_{t}}\rVert = \left\langle{v},{x_{t} - u_{t}}\right\rangle \leqslant \gamma^{\textrm{IP}}_{x_{t}}(\mathscr{C})\left(t(1-t)\lVert{u-y}{\rVert^{2}_{2}} + t^{2}B_{\textrm{norm}}^{2}\lVert{u-y}\rVert^{2}\right).$$Dividing by t and taking a limit,   $$\lim_{t\searrow 0}\frac{\min_{x\in\mathscr{C}}\lVert{x-u_{t}}\rVert}{t} \leqslant \left(\lim\sup_{t\searrow 0}\gamma^{\textrm{IP}}_{x_{t}}(\mathscr{C})\right)\cdot \lVert{u-y}{\rVert^{2}_{2}}.$$Finally, recall that $$x\mapsto \gamma ^{\textrm{IP}}_{x}(\mathscr{C})$$ is upper semi-continuous by Lemma A.1, and $$x_{t}\rightarrow u$$ as proved above. We thus have $$\lim \sup _{t\searrow 0}\gamma ^{\textrm{IP}}_{x_{t}}(\mathscr{C})\leqslant \gamma ^{\textrm{IP}}_{u}(\mathscr{C})$$. This proves that $$\gamma ^{\textrm{curv}}_{u}(\mathscr{C})\leqslant \gamma ^{\textrm{IP}}_{u}(\mathscr{C})$$, for any $$u\in \mathscr{C}\backslash \mathscr{C}_{\mathsf{dgn}}$$. Combining with our previous steps, we now have   $$\gamma^{\textrm{FO}}_{x}(\mathscr{C})=\gamma^{\textrm{IP}}_{x}(\mathscr{C})=\gamma^{\textrm{curv}}_{x}(\mathscr{C})$$for all $$x\in \mathscr{C}\backslash \mathscr{C}_{\mathsf{dgn}}$$, while for $$x\in \mathscr{C}_{\mathsf{dgn}}$$ we have the weaker statement $$\gamma ^{\textrm{IP}}_{x}(\mathscr{C})\leqslant \min \{\gamma ^{\textrm{curv}}_{x}(\mathscr{C}),\gamma ^{\textrm{FO}}_{x}(\mathscr{C})\}$$. To compare this portion of the proof to the existing literature, in the ℓ2 setting, the equivalient result is proved in the study by Colombo & Thibault (2010, Proposition 9). In their proof, they use the identity $$\lVert{x_{t}-u_{t}}\rVert _{2}=\frac{\langle{x_{t}-u_{t}},{x_{t}-u_{t}}\rangle }{\lVert{x_{t}-u_{t}}\rVert _{2}}$$, and then upper bound the right-hand side via the inner product condition. To translate this proof into the more general structured norm setting, we write $$\lVert{x_{t}-u_{t}}\rVert = \langle{v},{x_{t}-u_{t}}\rangle$$ for a subgradient v in the subdifferential of the function $$x\mapsto \lVert{x - u_{t}}\rVert$$, and apply results from the analysis literature along with our upper semi-continuity result, Lemma A.1. A.2.5 Approximate contraction ⇔ Inner products This proof, for the case of a general norm $$\lVert{\cdot }\rVert$$, proceeds identically as the proof for the case where $$\lVert{\cdot }\rVert =\lVert{\cdot }\rVert _{2}$$ (presented e.g. in the study by Colombo & Thibault, 2010, Theorem 3(b, d)). For completeness, we reproduce the argument here. First, we show that $$\gamma ^{\textrm{IP}}_{x}(\mathscr{C})\leqslant \gamma ^{\textrm{contr}}_{x}(\mathscr{C})$$. Fix any $$x\in \mathscr{C}$$, and any $$z\in \mathbb{R}^{d}$$ with $$x=P_{{\mathscr{C}}}(z)$$. Define zt = t ⋅ z + (1 − t) ⋅ x for t ∈ [0, 1]. By (A2), $$x=P_{{\mathscr{C}}}(z_{t})$$ for all t ∈ [0, 1]. Then for any $$y\in \mathscr{C}$$, since $$\lVert{z_{t} - x}\rVert ^{\ast } = t\lVert{z-x}\rVert ^{\ast }$$,   $$\lVert{y - x}\rVert_{2}\left(1 - \gamma^{\textrm{contr}}_{x}(\mathscr{C})\cdot t\lVert{z-x}\rVert^{\ast} \right)\leqslant \lVert{y - z_{t}}\rVert_{2}$$by the approximate contraction property (2.3). For sufficiently small t, the left-hand side is non-negative (except for the trivial case $$\gamma ^{\textrm{contr}}_{x}(\mathscr{C})=\infty$$, in which case there is nothing to prove). Squaring both sides and rearranging some terms,   $$\lVert{y-x}{\rVert^{2}_{2}} \leqslant \lVert{y-z_{t}}{\rVert^{2}_{2}} + \left(2\gamma^{\textrm{contr}}_{x}(\mathscr{C})\cdot t\lVert{z-x}\rVert^{\ast}- \left(\gamma^{\textrm{contr}}_{x}(\mathscr{C})\cdot t\lVert{z-x}\rVert^{\ast}\right)^{2}\right)\lVert{y-x}{\rVert^{2}_{2}}.$$And,   $$\lVert{y-z_{t}}{\rVert^{2}_{2}} = \lVert{y-x}{\rVert^{2}_{2}} + \lVert{x-z_{t}}{\rVert^{2}_{2}} + 2\langle{y-x},{x-z_{t}}\rangle$$so rearranging terms again,   $$2\langle{\,y-x},{z_{t} - x}\rangle \leqslant \lVert{x-z_{t}}{\rVert^{2}_{2}} + \left(2\gamma^{\textrm{contr}}_{x}(\mathscr{C})\cdot t\lVert{z-x}\rVert^{\ast}- \left(\gamma^{\textrm{contr}}_{x}(\mathscr{C}) \cdot t\lVert{z-x}\rVert^{\ast}\right)^{2}\right)\lVert{y-x}{\rVert^{2}_{2}}.$$Plugging in the definition of zt,   $$2 t\langle{\,y-x},{z - x}\rangle \leqslant t^{2} \lVert{x-z}{\rVert^{2}_{2}} + \left(2\gamma^{\textrm{contr}}_{x}(\mathscr{C}) \cdot t \lVert{z-x}\rVert^{\ast} - \gamma^{\textrm{contr}}_{x}(\mathscr{C})^{2}\cdot t^{2}\left(\lVert{z-x}\rVert^{\ast}\right)^{2}\right)\lVert{y-x}{\rVert^{2}_{2}}.$$Dividing by 2t, then taking the limit as t ↘ 0,   $$\langle{\,y-x},{z-x}\rangle\leqslant \gamma^{\textrm{contr}}_{x}(\mathscr{C})\lVert{z-x}\rVert^{\ast} \lVert{y-x}{\rVert^{2}_{2}}.$$Therefore, for any $$x\in \mathscr{C}$$, $$\gamma ^{\textrm{IP}}_{x}(\mathscr{C})\leqslant \gamma ^{\textrm{contr}}_{x}(\mathscr{C})$$. Now we prove the reverse inequality, i.e. $$\gamma ^{\textrm{contr}}_{x}(\mathscr{C})\leqslant \gamma ^{\textrm{IP}}_{x}(\mathscr{C})$$. Fix any $$x,y\in \mathscr{C}$$ and any $$z\in \mathbb{R}^{d}$$ with $$x=P_{{\mathscr{C}}}(z)$$. Then   $$\lVert{y-x}{\rVert^{2}_{2}} + \langle{y-x},{z-y}\rangle = \langle{y-x},{z-x}\rangle\leqslant\gamma^{\textrm{IP}}_{x}(\mathscr{C})\lVert{z-x}\rVert^{\ast}\lVert{y-x}{\rVert^{2}_{2}}.$$Simplifying,   $$\left(1 - \gamma^{\textrm{IP}}_{x}(\mathscr{C})\lVert{z-x}\rVert^{\ast} \right)\lVert{y-x}{\rVert^{2}_{2}} \leqslant - \langle{y-x},{z-y}\rangle\leqslant \lVert{y-x}\rVert_{2}\lVert{z-y}\rVert_{2},$$and so   $$\left(1 - \gamma^{\textrm{IP}}_{x}(\mathscr{C})\lVert{z-x}\rVert^{\ast} \right)\lVert{y-x}\rVert_{2} \leqslant\lVert{z-y}\rVert_{2}.$$Therefore, for any $$x\in \mathscr{C}$$$$\gamma ^{\textrm{contr}}_{x}(\mathscr{C})\leqslant \gamma ^{\textrm{IP}}_{x}(\mathscr{C})$$. Combining everything, we have now proved   $$\gamma^{\textrm{contr}}_{x}(\mathscr{C})=\gamma^{\textrm{IP}}_{x}(\mathscr{C})=\gamma^{\textrm{FO}}_{x}(\mathscr{C})=\gamma^{\textrm{curv}}_{x}(\mathscr{C})$$for all $$x\in \mathscr{C}\backslash \mathscr{C}_{\mathsf{dgn}}$$, in addition to the weaker bound (A.6) for all $$x\in \mathscr{C}$$, as desired. This completes the proof of Theorem 2.2. A.3 Proof of characterization of degenerate points (Lemma 2.3) Next we prove that the degenerate points $$u\in \mathscr{C}_{\mathsf{dgn}}$$ are precisely those points where any of the four local concavity conditions would fail to hold, in any neighborhood of u and for any finite γ. First, the characterization of prox-regularity given in the study by Poliquin et al. (2000, Proposition 1.2, Theorem 1.3(i)) proves that, if the projection operator $$P_{{\mathscr{C}}}$$ is not continuous in a neighborhood of $$u\in \mathscr{C}$$, then there are no constants ε > 0 and $$\gamma <\infty$$ such that the inner product condition (2.5) holds for all $$x\in \mathscr{C}\cap \mathbb{B}_{2}(u,\varepsilon )$$. Therefore, for any r > 0, $$\sup _{x\in \mathscr{C}\cap \mathbb{B}_{2}(u,r)}\gamma ^{\textrm{IP}}_{x}(\mathscr{C}) = \infty$$. Finally, in proving Theorem 2.2, we proved (A.6), i.e. $$\gamma ^{\textrm{IP}}_{x}(\mathscr{C})\leqslant \min \{\gamma ^{\textrm{curv}}_{x}(\mathscr{C}),\gamma ^{\textrm{contr}}_{x}(\mathscr{C}),\gamma ^{\textrm{FO}}_{x}(\mathscr{C})\}$$ for all $$x\in \mathscr{C}$$. This implies that,   $$\lim_{r\rightarrow 0}\left\{\sup_{x\in\mathscr{C}\cap\mathbb{B}_{2}(u,r)}\gamma^{(*)}_{x}(\mathscr{C}) \right\}= \infty,$$where (*) denotes any of the four properties, i.e. $$\gamma ^{\textrm{curv}}_{x}(\mathscr{C})$$ for the curvature condition (2.1), $$\gamma ^{\textrm{contr}}_{x}(\mathscr{C})$$ for the contraction property (2.3), $$\gamma ^{\textrm{IP}}_{x}(\mathscr{C})$$ for the inner product condition (2.5) or $$\gamma ^{\textrm{FO}}_{x}(\mathscr{C})$$ for the first-order optimality condition (2.4). This proves the lemma. A.4 Proof of two-sided contraction property (Lemma 2.4) This proof, for the case of a general norm $$\lVert{\cdot }\rVert$$, proceeds identically as the proof for the case where $$\lVert{\cdot }\rVert =\lVert{\cdot }\rVert _{2}$$ (presented e.g. in the study by Colombo & Thibault, 2010, Theorem 3(b,d)). For completeness, we reproduce the argument here. Take any $$x,y\in \mathscr{C}$$ and any $$z,w\in \mathbb{R}^{d}$$ with $$P_{{\mathscr{C}}}(z)=x$$ and $$P_{{\mathscr{C}}}(w)=x$$. By definition of the local concavity coefficients, applying the inner product bound (2.5) we have   $$\langle{\,y-x},{z-x}\rangle \leqslant \gamma_{x}(\mathscr{C}) \lVert{z-x}\rVert^{\ast}\lVert{x-y}{\rVert^{2}_{2}}.$$Applying the same property with the roles of the variables reversed,   $$\langle{x-y},{w-y}\rangle \leqslant \gamma_{y}(\mathscr{C}) \lVert{w-y}\rVert^{\ast}\lVert{x-y}{\rVert^{2}_{2}}.$$Adding these two inequalities together,   $$\langle{y-x},{z-x-w+y}\rangle\leqslant \gamma_{x}(\mathscr{C})\lVert{z-x}\rVert^{\ast}\lVert{x-y}{\rVert^{2}_{2}} + \gamma_{y}(\mathscr{C})\lVert{w-y}\rVert^{\ast}\lVert{x-y}{\rVert^{2}_{2}}.$$Rearranging terms and simplifying,   $$\big(1 - \gamma_{x}(\mathscr{C})\lVert{z-x}\rVert^{\ast} - \gamma_{y}(\mathscr{C})\lVert{w-y}\rVert^{\ast}\big) \lVert{x-y}{\rVert^{2}_{2}}\leq \langle{y-x},{w-z}\rangle.$$Since the right-hand side is bounded by $$\lVert{x-y}\rVert _{2}\lVert{z-w}\rVert _{2}$$ by the Cauchy–Schwarz inequality, this proves the lemma. In fact, we will also prove a related result that will be useful for our convergence proofs later on. As above, we have   $$\langle{\,y-x},{z-x}\rangle \leqslant \gamma_{x}(\mathscr{C}) \lVert{z-x}\rVert^{\ast}\lVert{x-y}{\rVert^{2}_{2}},$$and since $$y=P_{{\mathscr{C}}}(w)$$, we also have   $$0\geqslant\lVert{y-w}{\rVert^{2}_{2}} - \lVert{x-w}{\rVert^{2}_{2}} = \lVert{x-y}{\rVert^{2}_{2}} +2\langle{\,y-x},{x-w}\rangle.$$Then, adding the two bounds together,   $$\langle{y-x},{z-w}\rangle = \langle{y-x},{z-x}\rangle+\langle{y-x},{x-w}\rangle \leqslant \gamma_{x}(\mathscr{C})\lVert{z-x}\rVert^{\ast}\lVert{x-y}{\rVert^{2}_{2}} - \frac{1}{2}\lVert{x-y}{\rVert^{2}_{2}},$$and so   $$\left(\frac{1}{2} - \gamma_{x}(\mathscr{C})\lVert{z-x}\rVert^{\ast}\right)\lVert{x-y}{\rVert^{2}_{2}} \leqslant -\langle{y-x},{z-w}\rangle\leqslant\lVert{x-y}\rVert_{2}\lVert{z-w}\rVert_{2}.$$This proves that   \begin{align}\left(\frac{1}{2} - \gamma_{x}(\mathscr{C})\lVert{z-x}\rVert^{\ast}\right)\lVert{x-y}\rVert_{2} \leqslant\lVert{z-w}\rVert_{2}. \end{align} (A.8)In a setting where $$\gamma _{x}(\mathscr{C})\lVert{z-x}\rVert ^{\ast }$$ is small, but $$\gamma _{y}(\mathscr{C})\lVert{w-y}\rVert ^{\ast }$$ may be large, this alternate result can give a stronger bound than Lemma 2.4. A.5 Proof of equivalence for global concavity (Theorem 2.1) and local vs global coefficients (Lemma 2.1) We prove Theorem 2.1, which states that the five definitions for the global concavity coefficient $$\gamma (\mathscr{C})$$ are equivalent, alongside Lemma 2.1, which states that $$\gamma (\mathscr{C})=\sup _{x\in \mathscr{C}}\gamma _{x}(\mathscr{C})$$. First, suppose that $$\mathscr{C}$$ contains one or more degenerate points, $$\mathscr{C}_{\mathsf{dgn}}\neq \emptyset$$, in which case $$\sup _{x\in \mathscr{C}}\gamma _{x}(\mathscr{C})=\infty$$. By definition of $$\mathscr{C}_{\mathsf{dgn}}$$, the projection operator $$P_{{\mathscr{C}}}$$ is not continuous on any neighborhood of $$\mathscr{C}$$. Poliquin et al. (2000, Theorem 4.1) prove that this implies $$\mathscr{C}$$ is not prox-regular, and so $$\gamma (\mathscr{C})=\infty$$ as discussed in Section 2.3. Next, suppose that $$\mathscr{C}$$ contains no degenerate points. Let $$\gamma ^{\ast }=\sup _{x\in \mathscr{C}}\gamma _{x}(\mathscr{C})$$. Then clearly, by definition of the local coefficients $$\gamma _{x}(\mathscr{C})$$,   $$\gamma^{\ast}=\min\{\gamma\in[0,\infty]:\text{ property (*) holds for all }x,y\in\mathscr{C}\},$$where (*) may refer to any of the four equivalent properties, namely the curvature condition (2.1), the (one-sided) contraction property (2.3), the inner product condition (2.5) and the first-order condition (2.4). Next, let   $$\gamma^{\sharp}=\min\left\{\gamma\in[0,\infty]:\text{ the two-sided contraction property (2.2) holds for all }x,y\in\mathscr{C}\right\}\!.$$Clearly, the two-sided contraction property (2.2) is stronger than its one-sided version (2.3), and so $$\gamma ^{\ast }\leqslant \gamma ^{\sharp }$$. However, Lemma 2.4 shows that they are in fact equal, proving that   $$\big(1-\gamma_{x}(\mathscr{C})\lVert{z-x}\rVert^{\ast}-\gamma_{y}(\mathscr{C})\lVert{w-y}\rVert^{\ast}\big) \cdot \lVert{x-y}\rVert_{2} \leqslant \lVert{z - w}\rVert_{2}$$for all $$z,w\in \mathbb{R}^{d}$$ with $$x=P_{{\mathscr{C}}}(z)$$, $$y=P_{{\mathscr{C}}}(w)$$. Since $$\gamma _{x}(\mathscr{C}),\gamma _{y}(\mathscr{C})\leqslant \gamma ^{\ast }$$ for all $$x,y\in \mathscr{C}$$, this implies that   $$\big(1-\gamma^{\ast}\lVert{z-x}\rVert^{\ast}-\gamma^{\ast}\lVert{w-y}\rVert^{\ast}\big) \cdot \lVert{x-y}\rVert_{2} \leqslant \lVert{z - w}\rVert_{2},$$that is, (2.2) holds for all $$x,y\in \mathscr{C}$$ with constant γ = γ*. So, we have γ♯ ≤ γ*. Therefore, the five conditions defining $$\gamma (\mathscr{C})$$ are equivalent, and $$\gamma (\mathscr{C})=\gamma ^{\sharp } = \gamma ^{\ast }=\sup _{x\in \mathscr{C}}\gamma _{x}(\mathscr{C})$$, proving Theorem 2.1 and Lemma 2.1. B. Proofs of convergence results In this section we prove our convergence results for projected gradient descent (Theorem 3.1) and approximate projected gradient descent (Theorem 4.1), along with the necessity of the initialization condition (Lemma 3.1) and equivalence of the exact and approximate convergence results (Lemma 4.1). B.1 Proof of Theorem 3.1 This result, using the exact projection operator $$P_{{\mathscr{C}}}$$, is in fact a special case of Theorem 4.1, which provides a convergence guarantee using a family of operators {Px}. To see why, define $$P_{{x}} = P_{{\mathscr{C}}}$$ for all $$x\in \mathscr{C}$$. To apply Theorem 4.1, we only need to check that the relevant assumptions, namely (4.2), (4.3), (4.4) and (4.5), all hold. To check (4.2), by setting   $$\gamma^{\textrm{c}}=\max\left\{\gamma_{x}(\mathscr{C}):x\in\mathscr{C}\cap\mathbb{B}_{2}(\widehat{x},\rho)\right\}\textrm{ and }\gamma^{\textrm{d}}=0,$$we see that the desired bound is a trivial consequence of the inner product condition (2.5). The norm compatibility condition (4.3) for {Px} holds as a trivial consequence of the original norm compatibility condition (3.4). The initialization condition (4.4) follows directly from the original initialization condition (3.5) by our choice of γc, γd. Finally, we verify the local continuity condition (4.5). Fix any $$x\in \mathscr{C}\cap \mathbb{B}_{2}(\widehat{x},\rho )$$ and any $$z\in \mathbb{R}^{d}$$ such that $$P_{{\mathscr{C}}}(z)\in \mathbb{B}_{2}(\widehat{x},\rho )$$ and $$2(\gamma ^{\textrm{c}}+\gamma ^{\textrm{d}})\lVert{z-P_{{x}}(z)}\rVert ^{\ast }\leqslant 1-c_{0}$$. By (A.8), for all $$w\in \mathbb{R}^{d}$$,   $$\left(\frac{1}{2} - \gamma_{P_{{\mathscr{C}}}(z)}(\mathscr{C})\lVert{z-P_{{\mathscr{C}}}(z)}\rVert^{\ast}\right)\lVert{P_{{\mathscr{C}}}(z)-P_{{\mathscr{C}}}(w)}\rVert_{2}\leqslant\lVert{z-w}\rVert_{2}.$$Since $$P_{{x}}(z)=P_{{\mathscr{C}}}(z)\in \mathbb{B}_{2}\left (\widehat{x},\rho \right )$$ and $$\gamma ^{\textrm{c}}+\gamma ^{\textrm{d}} = \max _{u\in \mathscr{C}\cap \mathbb{B}_{2}(\widehat{x},\rho )}\gamma _{u}(\mathscr{C}) \geqslant \gamma _{P_{{\mathscr{C}}}(z)}(\mathscr{C})$$, then,   $$\lVert{P_{{\mathscr{C}}}(z)-P_{{\mathscr{C}}}(w)}\rVert_{2}\leqslant \frac{\lVert{z-w}\rVert_{2}}{1/2 - (\gamma^{\textrm{c}}+\gamma^{\textrm{d}})\lVert{z-P_{{x}}(z)}\rVert^{\ast}} \leqslant \frac{\lVert{z-w}\rVert_{2}}{c_{0}/2}.$$Setting δ = ε ⋅ c0/2 then proves the condition (4.5). With these conditions in place, Theorem 3.1 becomes simply a special case of Theorem 4.1. B.2 Proof of Theorem 4.1 For t = 0, the statement holds trivially. To prove that the bound holds for subsequent steps, we will proceed by induction. Choose any ρ0 ∈ (0, ρ) such that   $$\rho_{0} \geq \max\left\{\lVert{x_{0} -\widehat{x}}\rVert_{2},\sqrt{\frac{1.5\varepsilon_{\mathsf{g}}^{2}}{c_{0}}}\right\},$$where this maximum is < ρ by assumption of the theorem. We will prove that   \begin{align} \begin{cases}\lVert{x_{t+1} - \widehat{x}}{\rVert^{2}_{2}}\leqslant \left(1- \frac{2 c_{0} \alpha}{\textrm{Denom}}\right)\lVert{x_{t} - \widehat{x}}{\rVert^{2}_{2}} + \frac{3\alpha}{\textrm{Denom}}\varepsilon_{\mathsf{g}}^{2},\\ \lVert{x_{t+1} -\widehat{x}}\rVert_{2}\leqslant\rho_{0},\end{cases} \end{align} (B.1)for all $$t\geqslant 0$$, where the denominator term is given by   $$\textrm{Denom} = \beta + \alpha\left(c_{0} - (1-c_{0})\cdot\frac{\gamma^{\textrm{c}}}{\gamma^{\textrm{c}}+\gamma^{\textrm{d}}}\right)\leqslant \alpha+\beta.$$Assuming that this holds, then applying the first bound of (B.1) iteratively, we will then have   $$\lVert{x_{t} - \widehat{x}}{\rVert^{2}_{2}} \leqslant \left(1-\frac{2c_{0}\alpha}{\textrm{Denom}}\right)^{t}\lVert{x_{0} - \widehat{x}}{\rVert^{2}_{2}} + \frac{1.5}{c_{0}}\varepsilon_{\mathsf{g}}^{2} \leqslant \left(1-\frac{2c_{0}\alpha}{\alpha+\beta}\right)^{t}\lVert{x_{0} - \widehat{x}}{\rVert^{2}_{2}} + \frac{1.5}{c_{0}}\varepsilon_{\mathsf{g}}^{2},$$which proves the theorem. Now we turn to proving (B.1), assuming that it holds at the previous time step. In order to apply assumptions such as RSC (3.2) and the inner product condition (4.2), we first need to know that $$\lVert{x_{t+1}-\widehat{x}}\rVert _{2}\leqslant \rho$$, which we cannot ensure directly at the start. To get around this, we first consider reducing the step size. For any step size s ∈ [0, η], define   $$x^\prime_{t+1}(s) = x_{t} - s\nabla\mathsf{g}(x_{t})\textrm{ and } x_{t+1}(s) =P_{{x_{t}}}(x^\prime_{t+1}(s)).$$Define   $$\mathscr{S}=\left\{s\in[0,\eta]:\lVert{x_{t+1}(s) - \widehat{x}}\rVert_{2}\leqslant\rho\right\}\textrm{ and }\mathscr{S}_{0} = \left\{s\in[0,\eta]:\lVert{x_{t+1}(s)-\widehat{x}}\rVert\leqslant\rho_{0}\right\}\!.$$Clearly $$0\in \mathcal{S}_{0}\subseteq \mathcal{S}$$ since xt+1(0) = xt, which satisfies $$\lVert{x_{t}-\widehat{x}}\rVert _{2}\leqslant \rho _{0}$$ by assumption. First, we claim that we can find some $$\varDelta > 0\$$such that   \begin{align} \text{If }s\in\mathscr{S}_{0}\text{ then }\min\{s+\varDelta,\eta\}\in\mathscr{S}. \end{align} (B.2)To prove this, we will apply the local uniform continuity assumption (4.5). Let x = xt and ε = ρ − ρ0, and find δ > 0 as in the assumption (4.5). For any $$s\in \mathcal{S}_{0}$$, let z = $$z=x^\prime_{t+1}$$(s) = xt − s∇g(xt). Then $$P_{{x}}(z)=P_{{x_{t}}}\left (x^\prime _{t+1}(s)\right ) = x_{t+1}(s)\in \mathbb{B}_{2}\left (\widehat{x},\rho _{0}\right )\subset \mathbb{B}_{2}(\widehat{x},\rho )$$, and   \begin{align*} 2\left(\gamma^{\textrm{c}}+\gamma^{\textrm{d}}\right)\lVert{z-P_{{x}}(z)}\rVert^{\ast} \leqslant&\, 2\left(\gamma^{\textrm{c}}+\gamma^{\textrm{d}}\right)\cdot\phi \lVert{z - x_{t}}\rVert^{\ast} = 2\left(\gamma^{\textrm{c}}+\gamma^{\textrm{d}}\right)\phi s \lVert{\nabla\mathsf{g}(x_{t})}\rVert^{\ast}\\ \leqslant&\, 2\left(\gamma^{\textrm{c}}+\gamma^{\textrm{d}}\right)\cdot\phi\eta\max_{x\in\mathscr{C}\cap\mathbb{B}_{2}(\widehat{x},\rho)}\lVert{\nabla\mathsf{g}(x)}\rVert^{\ast}\leqslant \eta\cdot (1-c_{0})\alpha \leqslant 1-c_{0},\end{align*}where the first inequality uses the norm compatibility condition (4.3), the third uses the initialization condition (4.4) and the fourth uses $$\eta = 1/\beta \leqslant 1/\alpha$$. Therefore, the conditions of the local continuity statement (4.5) are satisfied, and so for any $$w\in \mathbb{R}^{d}$$ with $$\lVert{w-z}\rVert _{2}\leqslant \delta$$, we must have $$\lVert{P_{{x}}(w)-P_{{x}}(z)}\rVert _{2}\leqslant \varepsilon =\rho -\rho _{0}$$. Then   $$\lVert{P_{{x}}(w)-\widehat{x}}\rVert_{2}\leqslant\lVert{P_{{x}}(w)-P_{{x}}(z)}\rVert_{2} + \lVert{P_{{x}}(z)-\widehat{x}}\rVert_{2}\leqslant(\rho-\rho_{0})+\rho_{0} = \rho.$$Now, define $$\varDelta = \delta /\lVert{\mathsf{g}(x_{t})}\rVert _{2}$$ and set $$w\!=\!x^\prime _{t+1}(\min \{s+\varDelta ,\eta \})$$. Then $$\lVert{w\!-x^\prime _{t+1}(s)}\rVert _{2} \leqslant\! \varDelta \lVert{\nabla \mathsf{g}(x_{t})}\rVert _{2}\!\leqslant\! \delta$$, and so $$\lVert{P_{{x}}(w)-\widehat{x}}\rVert _{2}\leqslant \rho$$. This proves that $$\min \{s+\varDelta ,\eta \}\in \mathcal{S}$$, and so (B.2) holds. Next, consider any $$s\in \mathcal{S}$$. If s = 0, then $$s\in \mathcal{S}_{0}$$ since $$\lVert{x_{t} - \widehat{x}}\rVert _{2}\leqslant \rho _{0}$$ by the previous step. Otherwise, assume s > 0. We first have   \begin{align}\max\{\lVert{x^\prime_{t+1}(s)-x_{t+1}(s)}\rVert^{\ast},\lVert{x^\prime_{t+1}(s) - x_{t}}\rVert^{\ast}\}=&\,\max\{ \lVert{x^\prime_{t+1}(s) - P_{{x_{t}}}(x^\prime_{t+1}(s))}\rVert^{\ast},\lVert{x^\prime_{t+1}(s) - x_{t}}\rVert^{\ast}\}\nonumber\\ \leqslant&\, \phi\lVert{x^\prime_{t+1}(s)-x_{t}}\rVert^{\ast} = \phi\lVert{-s \nabla\mathsf{g}(x_{t})}\rVert^{\ast}\leqslant \frac{s\alpha(1-c_{0})}{2(\gamma^{\textrm{c}}+\gamma^{\textrm{d}})},\end{align} (B.3)where first inequality uses the norm compatibility condition (4.3) while the second uses the initialization condition (4.4), since $$\lVert{x_{t}-\widehat{x}}\rVert _{2}\leqslant \rho$$. We can now use the inner product condition (4.2), applied with x = xt and z = xt+1′(s), with $$P_{{x}}(z) = P_{{x_{t}}}(x^\prime _{t+1}(s)) = x_{t+1}(s)$$. (This condition can be applied as we have checked that $$x_{t},x_{t+1}(s)\in \mathbb{B}_{2}\left (\widehat{x},\rho \right )$$.) The inner product condition yields   \begin{align} &\left\langle{\widehat{x}-x_{t+1}(s)},{x^\prime_{t+1}(s)-x_{t+1}(s)}\right\rangle\nonumber\\ &\quad\leqslant\max\{ \lVert{x^\prime_{t+1}(s)-x_{t+1}(s)}\rVert^{\ast},\lVert{x^\prime_{t+1}(s)-x_{t}}\rVert^{\ast}\}\cdot \Big(\gamma^{\textrm{c}}\lVert{\widehat{x}-x_{t+1}(s)}{\rVert^{2}_{2}} +\gamma^{\textrm{d}}\lVert{\widehat{x}-x_{t}}{\rVert^{2}_{2}}\Big)\nonumber\\ &\quad\leqslant \frac{s\alpha(1-c_{0})}{2(\gamma^{\textrm{c}}+\gamma^{\textrm{d}})} \Big(\gamma^{\textrm{c}}\lVert{\widehat{x}-x_{t+1}(s)}{\rVert^{2}_{2}} +\gamma^{\textrm{d}}\lVert{\widehat{x}-x_{t}}{\rVert^{2}_{2}}\Big), \end{align} (B.4)where the last step applies (B.3). Next, abusing notation, define for every $$u\in \mathscr{C}$$  $$\gamma^{\textrm{c}}_{u}(\mathscr{C}) = \begin{cases}\gamma^{\textrm{c}},&\lVert{u-\widehat{x}}\rVert_{2} <\rho,\\ \infty,&\lVert{u-\widehat{x}}\rVert_{2}\geqslant \rho,\end{cases}\quad\textrm{ and }\quad \gamma^{\textrm{d}}_{u}(\mathscr{C}) = \begin{cases}\gamma^{\textrm{d}},&\lVert{u-\widehat{x}}\rVert_{2} <\rho,\\ \infty,&\lVert{u-\widehat{x}}\rVert_{2}\geqslant \rho.\end{cases}$$Trivially these maps are upper semi-continuous. We also need to check the local continuity assumption (4.7), which requires that if $$\gamma ^{\textrm{c}}_{x}(\mathscr{C})+\gamma ^{\textrm{d}}_{x}(\mathscr{C})<\infty$$ and $$z_{t}\rightarrow x$$, then $$P_{{x}}(z_{t}) \rightarrow x$$. In fact, by the norm compatibility assumption (4.3), for all $$x\in \mathscr{C}\cap \mathbb{B}_{2}\left (\widehat{x},\rho \right )$$ we have $$\lVert{z_{t} - P_{{x}}(z_{t})}\rVert ^{\ast }\leqslant \phi \lVert{z_{t} - x}\rVert ^{\ast }$$, and so   $$\lVert{P_{{x}}(z_{t}) - x}\rVert^{\ast}\leqslant \lVert{P_{{x}}(z_{t})-z_{t}}\rVert^{\ast}+\lVert{z_{t}-x}\rVert^{\ast}\leqslant (1+\phi)\lVert{z_{t}-x}\rVert^{\ast}\rightarrow 0,$$proving that $$P_{{x}}(z_{t})\rightarrow x$$, as desired. By Lemma 4.1, the local concavity coefficients of $$\mathscr{C}$$ then satisfy $$\gamma _{u}(\mathscr{C})\leqslant \gamma ^{\textrm{c}}_{u}(\mathscr{C}) +\gamma ^{\textrm{d}}_{u}(\mathscr{C})$$, and so in particular, $$\gamma_{\widehat{x}}(\mathscr{C})\leqslant \gamma ^{\textrm{c}}+\gamma ^{\textrm{d}}$$. We will now apply the first-order optimality conditions (2.4) at the point $$x=\widehat{x}$$. We have   \begin{align} \mathsf{g}\left(x_{t+1}(s)\right)-\mathsf{g}(\widehat{x}) \nonumber&\geqslant \left\langle{x_{t+1}(s)-\widehat{x}},{\nabla\mathsf{g}(\widehat{x})}\right\rangle + \frac{\alpha}{2}\lVert{x_{t+1}(s)-\widehat{x}}{\rVert^{2}_{2}} - \frac{\alpha\varepsilon_{\mathsf{g}}^{2}}{2}\text{ by RSC (3.2)}\\ \nonumber&\geqslant -(\gamma^{\textrm{c}}+\gamma^{\textrm{d}})\lVert{\nabla\mathsf{g}(\widehat{x})}\rVert^{\ast}\lVert{x_{t+1}(s)-\widehat{x}}{\rVert^{2}_{2}} + \frac{\alpha}{2}\lVert{x_{t+1}(s)-\widehat{x}}{\rVert^{2}_{2}}\\ &\quad- \frac{\alpha\varepsilon_{\mathsf{g}}^{2}}{2}\text{ by first-order optimality}\nonumber\\ \nonumber&\geqslant -\frac{\alpha(1-c_{0})}{2}\lVert{x_{t+1}(s)-\widehat{x}}{\rVert^{2}_{2}} + \frac{\alpha}{2}\lVert{x_{t+1}(s)-\widehat{x}}{\rVert^{2}_{2}} - \frac{\alpha\varepsilon_{\mathsf{g}}^{2}}{2}\\ &= \frac{c_{0} \alpha}{2}\lVert{x_{t+1}(s)-\widehat{x}}{\rVert^{2}_{2}} - \frac{\alpha\varepsilon_{\mathsf{g}}^{2}}{2}, \end{align} (B.5)where the next-to-last step applies the initialization condition (4.4) (plus the fact that $$\phi \geqslant 1$$) to bound $$\lVert{\nabla \mathsf{g}(\widehat{x})}\rVert ^{\ast }$$. On the other hand, we have   \begin{align} \mathsf{g}\left(x_{t+1}(s)\right)-\mathsf{g}(\widehat{x}) \nonumber=&\,\mathsf{g}(x_{t+1}(s)) - \mathsf{g}(x_{t}) + \mathsf{g}(x_{t}) - \mathsf{g}\left(\widehat{x}\right)\\ \nonumber\leqslant& \left\langle{x_{t+1}(s)-x_{t}},{\nabla\mathsf{g}(x_{t})}\right\rangle + \frac{\beta}{2}\lVert{x_{t+1}(s)-x_{t}}{\rVert^{2}_{2}}+\frac{\alpha\varepsilon_{\mathsf{g}}^{2}}{2} \nonumber\\ &+ \left\langle{x_{t} - \widehat{x}},{\nabla\mathsf{g}(x_{t})}\right\rangle - \frac{\alpha}{2}\lVert{x_{t}-\widehat{x}}{\rVert^{2}_{2}}+\frac{\alpha\varepsilon_{\mathsf{g}}^{2}}{2}\nonumber\\ =&\, \left\langle{x_{t+1}(s)-\widehat{x}},{\nabla\mathsf{g}(x_{t})}\right\rangle + \frac{\beta}{2}\lVert{x_{t+1}(s)-x_{t}}{\rVert^{2}_{2}} - \frac{\alpha}{2}\lVert{x_{t}-\widehat{x}}{\rVert^{2}_{2}} + \alpha\varepsilon_{\mathsf{g}}^{2}, \end{align} (B.6) where the inequality applies RSC (3.2) and RSM (3.3). To bound the remaining inner product term, we have   \begin{align} \nonumber&\left\langle{x_{t+1}(s)-\widehat{x}},{\nabla\mathsf{g}(x_{t})}\right\rangle = \frac{1}{s}\left\langle{x_{t+1}(s)-\widehat{x}},{x_{t} - x^\prime_{t+1}(s)}\right\rangle \\ \nonumber &\quad = \frac{1}{s}\left\langle{x_{t+1}(s)-\widehat{x}},{x_{t} - x_{t+1}(s)}\right\rangle + \frac{1}{s}\left\langle{x_{t+1}(s)-\widehat{x}},{x_{t+1}(s) - x^\prime_{t+1}(s)}\right\rangle\\ &\quad \leqslant \frac{1}{s}\left\langle{x_{t+1}(s)-\widehat{x}},{x_{t} - x_{t+1}(s)}\right\rangle + \frac{\alpha(1-c_{0})}{2(\gamma^{\textrm{c}}+\gamma^{\textrm{d}})} \Big(\gamma^{\textrm{c}}\lVert{\widehat{x}-x_{t+1}(s)}{\rVert^{2}_{2}} +\gamma^{\textrm{d}}\lVert{\widehat{x}-x_{t}}{\rVert^{2}_{2}}\Big), \end{align} (B.7)where the last step applies (B.4). For the first term on the right-hand side, we can trivially check that   \begin{align} \frac{1}{s}\left\langle{x_{t+1}(s)-\widehat{x}},{x_{t} - x_{t+1}(s)}\right\rangle = \frac{1}{2s}\lVert{x_{t} - \widehat{x}}{\rVert^{2}_{2}} - \frac{1}{2s}\lVert{x_{t+1}(s)-\widehat{x}}{\rVert^{2}_{2}} - \frac{1}{2s}\lVert{x_{t+1}(s)-x_{t}}{\rVert^{2}_{2}}. \end{align} (B.8)Combining steps (B.5), (B.6), (B.7) and (B.8), then, since $$\frac{1}{2s}\geqslant \frac{1}{2\eta }=\frac{\beta }{2}$$,   \begin{align*} \frac{c_{0}\alpha}{2}\lVert{x_{t+1}(s)-\widehat{x}}{\rVert^{2}_{2}} \leqslant&\, \frac{1}{2s}\lVert{x_{t} - \widehat{x}}{\rVert^{2}_{2}} - \frac{1}{2s}\lVert{x_{t+1}(s)-\widehat{x}}{\rVert^{2}_{2}}\\ &+ \frac{\alpha(1-c_{0})}{2\left(\gamma^{\textrm{c}}+\gamma^{\textrm{d}}\right)} \Big(\gamma^{\textrm{c}}\lVert{\widehat{x}-x_{t+1}(s)}{\rVert^{2}_{2}} +\gamma^{\textrm{d}}\lVert{\widehat{x}-x_{t}}{\rVert^{2}_{2}}\Big) - \frac{\alpha}{2}\lVert{x_{t}-\widehat{x}}{\rVert^{2}_{2}} + 1.5\alpha\varepsilon_{\mathsf{g}}^{2}.\end{align*}Rearranging terms we obtain   \begin{align}\lVert{x_{t+1}(s)-\widehat{x}}{\rVert^{2}_{2}} \!\leqslant\! \left(1 - \frac{2 \alpha c_{0}}{\frac{1}{s} + \alpha\left(c_{0} -\! (1\!-c_{0})\cdot \frac{\gamma^{\textrm{c}}}{\gamma^{\textrm{c}}+\gamma^{\textrm{d}}}\right)}\right)\lVert{x_{t}-\widehat{x}}{\rVert^{2}_{2}} +\! \frac{ 3\alpha }{\frac{1}{s} + \alpha\left(c_{0} - (1\!-c_{0})\cdot \frac{\gamma^{\textrm{c}}}{\gamma^{\textrm{c}}+\gamma^{\textrm{d}}}\right)}\varepsilon_{\mathsf{g}}^{2}. \end{align} (B.9) In particular, since $$\lVert{x_{t}-\widehat{x}}\rVert _{2}\leqslant \rho _{0}$$ and $$\varepsilon _{\mathsf{g}}^{2}\leqslant \frac{c_{0} {\rho _{0}^{2}}}{1.5}$$ by assumption, this proves that   \begin{align}\lVert{x_{t+1}(s)-\widehat{x}}\rVert_{2}\leqslant \rho_{0}. \end{align} (B.10)Therefore, we see that $$s\in \mathscr{S}_{0}$$. To summarize, we have proved that for all $$s\in \mathscr{S}$$, we also have $$s\in \mathscr{S}_{0}$$; while for $$s\in \mathscr{S}_{0}$$, we also have $$\min \{s+\varDelta ,\eta \}\in \mathscr{S}$$, where $$\varDelta$$ > 0 is fixed. Starting with $$s=0\in \mathscr{S}$$ and proceeding inductively, we see that $$\varDelta \in \mathscr{S}$$, then $$2\varDelta \in \mathscr{S}$$, etc., until inductively we obtain $$\eta \in \mathscr{S}$$. Therefore, setting s = η and xt+1(s) = xt+1, the above bounds (B.9) and (B.10) will hold. Looking at (B.9) in particular, since s = η = 1/β, we can simplify to   $$\lVert{x_{t+1}-\widehat{x}}{\rVert^{2}_{2}} \leqslant \left(1 - \frac{ 2\alpha c_{0}}{{\beta} + \alpha\left({c_{0}} - ({1-c_{0}})\cdot \frac{\gamma^{\textrm{c}}}{\gamma^{\textrm{c}}+\gamma^{\textrm{d}}}\right)}\right)\lVert{x_{t}-\widehat{x}}{\rVert^{2}_{2}} + \frac{ 3\alpha }{{\beta} + \alpha\left({c_{0}} - ({1-c_{0}})\cdot \frac{\gamma^{\textrm{c}}}{\gamma^{\textrm{c}}+\gamma^{\textrm{d}}}\right)}\varepsilon_{\mathsf{g}}^{2}.$$This proves that the inductive step (B1) holds for xt+1, as desired, which completes the proof of Theorem 4.1. B.3 Proof of necessity of the initialization conditions (Lemma 3.1) By definition of the local concavity coefficients (2.6), since $$x\not \in \mathscr{C}_{\mathsf{dgn}}$$ and $$\gamma _{x}(\mathscr{C})>0$$, we see that there must exist some $$y\in \mathscr{C}$$ and $$z\in \mathbb{R}^{d}$$, with $$x=P_{{\mathscr{C}}}(z)$$ such that   $$\langle{\,y-x},{z-x}\rangle> \frac{\gamma_{x}(\mathscr{C})}{{1+\varepsilon}}\lVert{z-x}\rVert^{\ast}\lVert{x-y}{\rVert^{2}_{2}}.$$(If no such y, z exist then local concavity coefficient at x would be $$\leqslant \frac{\gamma _{x}(\mathscr{C})}{{1+\varepsilon }}$$, which is a contradiction.) Next define $$\mathsf{g}(v) = \frac{\alpha }{2}\lVert{v-\widetilde{z}}{\rVert ^{2}_{2}}$$, where $$\widetilde{z} = x + \frac{z-x}{\lVert{z-x}\rVert ^{\ast }}\cdot \frac{{1+\varepsilon }}{2\gamma _{x}(\mathscr{C})}$$. Then g is α-strongly convex, and   $$2\gamma_{x}(\mathscr{C})\cdot \lVert{\nabla\mathsf{g}(x)}\rVert^{\ast} = 2 \gamma_{x}(\mathscr{C})\cdot\left\lVert{\alpha (\widetilde{z}-x) }\right\rVert^{\ast} = \alpha(1+\varepsilon).$$ Furthermore, for sufficiently small step size η > 0, we can see that x − η∇g(x) is a convex combination of x and z, and so we have $$P_{{\mathscr{C}}}\left (x-\eta \nabla \mathsf{g}(x)\right ) = x$$ by (A.2). Thus x is a stationary point of projected gradient descent by (A.2), for sufficiently small η > 0. However, x does not minimize g over $$\mathscr{C}$$, since   \begin{align*} \lVert{y - \widetilde{z}}{\rVert^{2}_{2}} - \lVert{x-\widetilde{z}}{\rVert^{2}_{2}} =&\, \lVert{y-x}{\rVert^{2}_{2}} -2\langle{y-x},{\widetilde{z} - x}\rangle = \lVert{y-x}{\rVert^{2}_{2}} -2\langle{y-x},{z - x}\rangle\cdot \frac{{1+\varepsilon}}{2\gamma_{x}(\mathscr{C})\lVert{z-x}\rVert^{\ast}}\\ <&\, \lVert{y-x}{\rVert^{2}_{2}} -2\cdot \frac{\gamma_{x}(\mathscr{C})}{{1+\varepsilon}}\lVert{z-x}\rVert^{\ast}\lVert{x-y}{\rVert^{2}_{2}}\cdot \frac{{1+\varepsilon}}{2\gamma_{x}(\mathscr{C})\lVert{z-x}\rVert^{\ast}} =0, \end{align*}which shows that g(y) < g(x). B.4 Proof of equivalence of the exact and approximate settings (Lemma 4.1) B.4.1 Local concavity coefficients ⇒ family of approximate projections Suppose that $$\mathscr{C}$$ has local concavity coefficients $$\gamma _{x}(\mathscr{C})$$ for all $$x\in \mathscr{C}$$. Then the inner product condition (2.5) for exact projection $$P_{{\mathscr{C}}}$$ gives   $$\langle{\,y-u},{z-u}\rangle\leqslant \gamma_{u}(\mathscr{C})\lVert{z-u}\rVert^{\ast}\lVert{y-u}{\rVert^{2}_{2}}$$for all $$u,y\in \mathscr{C}$$ and $$z\in \mathbb{R}^{d}$$ with $$P_{{\mathscr{C}}}(z)=u$$. Therefore, the general condition (4.6) holds for $$P_{{x}}=P_{{\mathscr{C}}}$$ when we set $$\gamma ^{\textrm{c}}_{u}(\mathscr{C})=\gamma _{u}(\mathscr{C})$$ and $$\gamma ^{\textrm{d}}_{u}(\mathscr{C})=0$$. B.4.2 Family of approximate projections ⇒ local concavity coefficients Suppose there exists a family of operators $$P_{{x}}:\mathbb{R}^{d}\rightarrow \mathscr{C}$$ indexed over $$x\in \mathscr{C}$$, satisfying the inner product condition in (4.6). Assume that the local continuity property (4.7) holds. We now check that the local concavity coefficients satisfy $$\gamma _{x}(\mathscr{C}) \leqslant \gamma ^{\textrm{c}}_{x}(\mathscr{C}) + \gamma ^{\textrm{d}}_{x}(\mathscr{C})$$. Fix any $$x\in \mathscr{C}$$. Assume that $$\gamma ^{\textrm{c}}_{x}(\mathscr{C})+\gamma ^{\textrm{d}}_{x}(\mathscr{C})<\infty$$ (otherwise, there is nothing to prove). We will verify that the inner product condition (2.5) holds at this point x with $$\gamma = \gamma ^{\textrm{c}}_{x}(\mathscr{C})+\gamma ^{\textrm{d}}_{x}(\mathscr{C})$$. Fix any ε > 0, $$y\in \mathscr{C}$$ and $$z\in \mathbb{R}^{d}$$ such that $$x=P_{{\mathscr{C}}}(z)\in \mathscr{C}$$. Define   $$z_{t} = (1-t)x+tz.$$We will take limits as t approaches zero throughout the proof. Below we will prove that there exists some t0 > 0 such that   \begin{align} P_{{x}}(z_{t}) = x\textrm{ for all }t\leqslant t_{0}. \end{align} (B.11)Assume for now that this holds. Take any t ∈ (0, t0). By the inner product condition (4.6) for the approximate projections Px, since x = Px(zt) for this small t,   \begin{align*}\langle{y-x},{z_{t}-x}\rangle =&\, \langle{y-P_{{x}}(z_{t})},{z_{t} - P_{{x}}(z_{t})}\rangle\\ \leqslant&\, \max\left\{\lVert{z_{t}-P_{{ x}}(z_{t})}\rVert^{\ast},\lVert{z_{t} - x}\rVert^{\ast}\right\}\cdot \left(\gamma^{\textrm{c}}_{P_{{x}}(z_{t})}(\mathscr{C})\lVert{y-P_{{ x}}(z_{t})}{\rVert^{2}_{2}}+\gamma^{\textrm{d}}_{P_{{x}}(z_{t})}(\mathscr{C})\lVert{y- x}{\rVert^{2}_{2}}\right)\\ =&\,\lVert{z_{t} - x}\rVert^{\ast}\cdot \left(\gamma^{\textrm{c}}_{x}(\mathscr{C}) + \gamma^{\textrm{d}}_{x}(\mathscr{C})\right)\cdot \lVert{y- x}{\rVert^{2}_{2}}.\end{align*}Plugging in the definition of zt, and then dividing by t,   $$\langle{\,y -x},{z-x}\rangle\leqslant \lVert{z-x}\rVert^{\ast} \cdot \left(\gamma^{\textrm{c}}_{x}(\mathscr{C})+\gamma^{\textrm{d}}_{x}(\mathscr{C})\right)\lVert{y- x}{\rVert^{2}_{2}},$$proving that the inner product condition (2.5) holds, at this point x and for any $$y\in \mathscr{C}$$, with $$\gamma = \gamma ^{\textrm{c}}_{x}(\mathscr{C})+\gamma ^{\textrm{d}}_{x}(\mathscr{C})$$. By definition of the local concavity coefficients, if $$x\in \mathscr{C}\backslash \mathscr{C}_{\mathsf{dgn}}$$, then $$\gamma _{x}(\mathscr{C})\leqslant \gamma = \gamma ^{\textrm{c}}_{x}(\mathscr{C})+\gamma ^{\textrm{d}}_{x}(\mathscr{C})$$. Now we consider the case that $$x\in \mathscr{C}_{\mathsf{dgn}}$$. Since $$x\mapsto \gamma ^{\textrm{c}}_{x}(\mathscr{C})+\gamma ^{\textrm{d}}_{x}(\mathscr{C})$$ is upper semi-continuous by assumption, we can find some r > 0 such that $$\gamma := \sup _{x{^\prime }\in \mathscr{C}\cap \mathbb{B}_{2}(x,r)}\big (\gamma ^{\textrm{c}}_{x^\prime }(\mathscr{C})+\gamma ^{\textrm{d}}_{x^\prime }(\mathscr{C})\big ) <\infty$$. By the reasoning above, then, the inner product condition (2.5) holds, with any $$x^\prime \in \mathscr{C}\cap \mathbb{B}_{2}(x,r)$$ in place of x and for any $$y\in \mathscr{C}$$, with the constant γ. However, if $$x\in \mathscr{C}_{\mathsf{dgn}}$$ then this is not possible, due to Lemma 2.3. Thus we have reached a contradiction—if $$\gamma ^{\textrm{c}}_{x}(\mathscr{C})+\gamma ^{\textrm{d}}_{x}(\mathscr{C})<\infty$$ then we must have $$x\in \mathscr{C}\backslash \mathscr{C}_{\mathsf{dgn}}$$. This completes the proof that $$\gamma _{x}(\mathscr{C})\leqslant \gamma ^{\textrm{c}}_{x}(\mathscr{C})+\gamma ^{\textrm{d}}_{x}(\mathscr{C})$$ for all $$x\in \mathscr{C}$$, assuming that (B.11) holds. Now it remains to be shown that (B.11) is indeed true. Since $$z_{t}\rightarrow x$$ trivially, and $$P_{{x}}(z_{t})\rightarrow x$$ by the local continuity assumption (4.7), we see that $$(z_{t} - P_{{x}}(z_{t}))\rightarrow 0$$. Since $$u\mapsto \gamma ^{\textrm{c}}_{u}(\mathscr{C})$$ is upper semi-continuous, then $$\gamma ^{\textrm{c}}_{P_{{x}}(z_{t})}\leqslant \gamma ^{\textrm{c}}_{x}(\mathscr{C})+\varepsilon$$ for sufficiently small t > 0, and also $$\gamma ^{\textrm{c}}_{x}(\mathscr{C})<\infty$$ by assumption. Therefore, for sufficiently small t > 0, we have $$2\gamma ^{\textrm{c}}_{P_{{x}}(z_{t})}(\mathscr{C})\max \left \{\lVert{z_{t}-P_{{x}}(z_{t})}\rVert ^{\ast },\lVert{z_{t} - x}\rVert ^{\ast }\right \}\leqslant 1$$. Then, applying the inner product condition (4.6), with u = Px(zt), with y = x and with zt in place of z, we obtain   \begin{align*}&\lVert{z_{t} - P_{{x}}(z_{t})}{\rVert^{2}_{2}} - \lVert{z_{t} - x}{\rVert^{2}_{2}} \\ &\quad=2\left\langle{z_{t} - P_{{x}}(z_{t})},{ x - P_{{x}}(z_{t})}\right\rangle - \lVert{ x-P_{{x}}(z_{t})}{\rVert^{2}_{2}}\\ &\quad\leqslant 2\max\{\lVert{z_{t}-P_{{x}}(z_{t})}\rVert^{\ast},\lVert{z_{t} \!-\! x}\rVert^{\ast}\}\left(\gamma^{\textrm{c}}_{P_{{x}}(z_{t})}(\mathscr{C})\lVert{ x -\! P_{{x}}(z_{t})}{\rVert^{2}_{2}}+\gamma^{\textrm{d}}_{P_{{x}}(z_{t})}(\mathscr{C})\lVert{x-\!x}{\rVert^{2}_{2}}\right) - \lVert{ x-P_{{x}}(z_{t})}{\rVert^{2}_{2}}\\ &\quad= \lVert{ x-P_{{x}}(z_{t})}{\rVert^{2}_{2}}\cdot(2\gamma^{\textrm{c}}_{P_{{x}}(z_{t})}(\mathscr{C})\max\left\{\lVert{z_{t}-P_{{x}}(z_{t})}\rVert^{\ast},\lVert{z_{t} - x}\rVert^{\ast}\right\}-1)\leqslant 0, \end{align*}implying that $$\lVert{z_{t} - P_{{x}}(z_{t})}{\rVert ^{2}_{2}}\leqslant \lVert{z_{t} - x}{\rVert ^{2}_{2}}$$. So,   \begin{align*} 0 \leqslant \lVert{z_{t} - x}{\rVert^{2}_{2}} - \lVert{z_{t}-P_{{x}}(z_{t})}{\rVert^{2}_{2}} =&\, 2\left\langle{z_{t} - x},{P_{{x}}(z_{t})-x}\right\rangle - \lVert{P_{{x}}(z_{t})-x}{\rVert^{2}_{2}} \\ =&\, 2t\left\langle{z - x},{P_{{x}}(z_{t})-x}\right\rangle - \lVert{P_{{x}}(z_{t})-x}{\rVert^{2}_{2}}.\end{align*}Furthermore, $$x = P_{{\mathscr{C}}}(z)$$, so   $$\lVert{z-x}{\rVert^{2}_{2}} \leqslant \lVert{z-P_{{x}}(z_{t})}{\rVert^{2}_{2}} = \lVert{z-x}{\rVert^{2}_{2}} + 2\left\langle{z-x},{x-P_{{x}}(z_{t})}\right\rangle + \lVert{x-P_{{x}}(z_{t})}{\rVert^{2}_{2}}$$so we have $$2\langle{z-x},{P_{{x}}(z_{t})-x}\rangle \!\!\leqslant \!\! \lVert{x-P_{{x}}(z_{t})}{\rVert ^{2}_{2}}$$. Since we have taken t > 0, we then have $$\lVert{x-P_{{x}}(z_{t})}{\rVert ^{2}_{2}}\leqslant t\lVert{x-P_{{x}}(z_{t})}{\rVert ^{2}_{2}}$$ which implies that x = Px(zt) for sufficiently small t. This proves (B.11), thus completing the proof of the lemma. C. Proofs for examples In this section we prove results by calculating the local concavity coefficients $$\gamma _{x}(\mathscr{C})$$ and the norm compatibility constant ϕ for the constraint sets considered in Section 5. C.1 Low-rank constraints For the low-rank constraint, we first recall the various matrix norms used in our analysis: the Frobenius norm $$\lVert{{X}}\rVert _{\mathsf{F}}=\sqrt{\sum _{ij} X_{ij}^{2}}$$ is the Euclidean ℓ2 norm when X is reshaped into a vector; the nuclear norm $$\lVert{{X}}\rVert _{\textrm{nuc}}=\sum _{i} \sigma _{i}(X)$$ is the sum of its singular values, and promotes a low-rank solution X (in the same way that, for sparse vector estimation, the ℓ1 norm promotes sparse solutions); and the spectral norm $$\lVert{{X}}\rVert _{\textrm{sp}} = \sigma _{1}(X)$$ is the largest singular value of X (sometimes called the operator norm). Recalling the subspace TX defined in (5.1) for any rank-r matrix X, we begin with an auxiliary lemma: Lemma C.1 Let $$X,Y\in \mathbb{R}^{n\times m}$$ satisfy $$\operatorname{rank}(X),\operatorname{rank}(Y)\leqslant r$$. Then   $$\left\lVert{{P_{{T_{X}}}^{\perp}(Y)}}\right\rVert_{\textrm{nuc}} \leqslant \frac{1}{2\sigma_{r}(X)} \lVert{{X-Y}}\rVert_{\mathsf{F}}^{2}.$$ Proof of Lemma C.1. Assume σr(X) > 0 (otherwise the statement is trivial). For any matrix M ∈ (TX)⊥ with $$\lVert{{M}}\rVert _{\textrm{sp}}\leqslant 1$$, define a function   $$\mathsf{f}_{M}(Z) = \frac{1}{2\sigma_{r}(X)}\lVert{{Z-X}}\rVert_{\mathsf{F}}^{2} - \langle{Z},{M}\rangle$$over matrices $$Z\in \mathbb{R}^{n\times m}$$. We can rewrite this as   $$\mathsf{f}_{M}(Z) = \frac{1}{2\sigma_{r}(X)}\left\lVert{{Z - \left(X + \sigma_{r}(X) M\right)}}\right\rVert_{\mathsf{F}}^{2} +\langle{X},{M}\rangle - \frac{\sigma_{r}(X)}{2}\lVert{{M}}\rVert_{\mathsf{F}}^{2}.$$Now, we minimize fM(Z) over a rank constraint:   $$\mathop{\operatorname{arg\,min}}\limits_{\operatorname{rank}(Z)\leqslant r}\mathsf{f}_{M}(Z) = \mathop{\operatorname{arg\,min}}\limits_{Z}\left\{\lVert{{Z - \left(X + \sigma_{r}(X) M\right)}}\rVert_{\mathsf{F}}^{2} : \operatorname{rank}(Z)\leqslant r\right\} = P_{{\mathscr{C}}}\left(X+\sigma_{r}(X)M \right)\!.$$Since $$\sigma _{1}(X), \dots ,\sigma _{r}(X)\geqslant \sigma _{r}(X)$$ while $$\lVert{{\sigma _{r}(X) M}}\rVert _{\textrm{sp}}\leqslant \sigma _{r}(X)$$, and M ∈ (TX)⊥, we see that   $$X = P_{{\mathscr{C}}}\left(X+\sigma_{r}(X)M \right)\!.$$(It may be the case that X and σr(X)M both have one or more singular values exactly equal to σr(X), in which case the projection is not unique, but X is always one of the values of the projection.) So, Z = X minimizes fM(Z) over rank-r matrices, and therefore, for any Z with $$\operatorname{rank}(Z)\leqslant r$$,   $$\mathsf{f}_{M}(Z) \geqslant \mathsf{f}_{M}(X) = \frac{1}{2\sigma_{r}(X)}\lVert{{X-X}}\rVert_{\mathsf{F}}^{2} + \langle{X},{M}\rangle =0,$$since ⟨X, M⟩ = 0 due to M ∈ (TX)⊥. Therefore, in particular, $$\mathsf{f}_{M}(Y) \geqslant 0$$ which implies that   $$\langle{Y},{M}\rangle \leqslant \frac{1}{2\sigma_{r}(X)}\lVert{{Y-X}}\rVert_{\mathsf{F}}^{2}.$$Since this is true for all M ∈ (TX)⊥ with $$\lVert{{M}}\rVert _{\textrm{sp}}\leqslant 1$$, we have proved that   $$\lVert{{P_{{T_{X}}}^{\perp}(Y)}}\rVert_{\textrm{nuc}} = \max_{M\in(T_{X})^{\perp},\lVert{{M}}\rVert_{\textrm{sp}}=1}\langle{Y},{M}\rangle \leqslant \frac{1}{2\sigma_{r}(X)}\lVert{{Y-X}}\rVert_{\mathsf{F}}^{2},$$as desired. Proof of Lemma 5.1 First, let $$P_{{\mathscr{C}}}(Z)$$ be any closest rank-r matrix to Z (not necessarily unique), and let $$U\in \mathbb{R}^{n\times r}$$ and $$V\in \mathbb{R}^{m\times r}$$ be orthonormal bases for the column span and row span of $$P_{{\mathscr{C}}}(Z)$$ (that is, if $$P_{{\mathscr{C}}}(Z)$$ is unique then the columns of U and V are the top r left and right singular vectors of Z). Regardless of uniqueness we will have $$Z-P_{{\mathscr{C}}}(Z)$$ orthogonal to U on the left and to V on the right, i.e. we can write   $$Z-P_{{\mathscr{C}}}(Z) = (\mathbf{I}-UU^{\top})\cdot(Z-P_{{\mathscr{C}}}(Z))\cdot(\mathbf{I}-VV^{\top}).$$We then have   \begin{align*}\left\langle{Y-P_{{\mathscr{C}}}(Z)},{Z-P_{{\mathscr{C}}}(Z)}\right\rangle &= \left\langle{Y-P_{{\mathscr{C}}}(Z)},{\left(\mathbf{I}-UU^{\top}\right)\cdot\left(Z-P_{{\mathscr{C}}}(Z)\right)\cdot\left(\mathbf{I}-VV^{\top}\right)}\right\rangle\\ &= \left\langle{\left(\mathbf{I}-UU^{\top}\right)\cdot(Y-P_{{\mathscr{C}}}(Z))\cdot\left(\mathbf{I}-VV^{\top}\right)},{Z-P_{{\mathscr{C}}}(Z)}\right\rangle\\ &\leqslant \left\lVert{{\left(\mathbf{I}-UU^{\top}\right)\cdot(Y-P_{{\mathscr{C}}}(Z))\cdot\left(\mathbf{I}-VV^{\top}\right)}}\right\rVert_{\textrm{nuc}} \cdot \lVert{{Z-P_{{\mathscr{C}}}(Z)}}\rVert_{\textrm{sp}}\\ &\leqslant \left\lVert{{\left(\mathbf{I}-UU^{\top}\right)\cdot Y\cdot\left(\mathbf{I}-VV^{\top}\right)}}\right\rVert_{\textrm{nuc}}\cdot \lVert{{Z-P_{{\mathscr{C}}}(Z)}}\rVert_{\textrm{sp}}, \end{align*}where the last step holds since $$P_{{\mathscr{C}}}(Z)$$ is spanned by U on the left and V on the right. Applying Lemma A.1 with $$X=P_{{\mathscr{C}}}(Z)$$, which trivially has U, V as its left and right singular vectors, we obtain   $$\left\lVert{{\left(\mathbf{I}-UU^{\top}\right)\cdot Y\cdot\left(\mathbf{I}-VV^{\top}\right)}}\right\rVert_{\textrm{nuc}} \leqslant \frac{1}{2\sigma_{r}(P_{{\mathscr{C}}}(Z))}\lVert{{Y-P_{{\mathscr{C}}}(Z)}}\rVert_{\mathsf{F}}^{2}.$$Therefore,   $$\langle{Y-P_{{\mathscr{C}}}(Z)},{Z-P_{{\mathscr{C}}}(Z)}\rangle \leqslant \frac{1}{2\sigma_{r}(P_{{\mathscr{C}}}(Z))} \lVert{{Z-P_{{\mathscr{C}}}(Z)}}\rVert_{\textrm{sp}}\lVert{{Y-P_{{\mathscr{C}}}(Z)}}\rVert_{\mathsf{F}}^{2}.$$This proves that $$\gamma _{X}(\mathscr{C})\leqslant \frac{1}{2\sigma _{r}(X)}$$ for all $$x\in \mathscr{C}$$ by the inner product condition (2.5). To prove equality, take any $$X\in \mathscr{C}\backslash \mathscr{C}_{\mathsf{dgn}}$$ (that is, we assume that rank(X) = r), and let $$X=\sigma _{1} u_{1} v_{1}^{\top } + \dots + \sigma _{r} u_{r} v_{r}^{\top }$$ be a singular value decomposition with $$\sigma _{1}\geqslant \dots \geqslant \sigma _{r}>0$$. Let $$u^\prime \in \mathbb{R}^{n},v^\prime \in \mathbb{R}^{m}$$ be unit vectors orthogonal to the left and right singular vectors of X, respectively. Define   $$Y = \sigma_{1} u_{1}v_{1}^{\top} + \dots + \sigma_{r-1}u_{r-1}v_{r-1}^{\top} + \sigma_{r} u^\prime v^\prime{}^{\top}$$and   $$Z = X + c u^\prime v^\prime{}^{\top},$$for some fixed c ∈ (0, σr). Then $$P_{{\mathscr{C}}}(Z)=X$$, and we have   $$\langle{Y-X},{Z-X}\rangle = \langle{\sigma_{r} u^\prime v^\prime{}^{\top} - \sigma_{r} u_{r} v_{r}^{\top}},{c u^\prime v^\prime{}^{\top}}\rangle = c\sigma_{r}$$while   $$\lVert{{Z-X}}\rVert_{\textrm{sp}}\lVert{{Y-X}}\rVert_{\mathsf{F}}^{2} = \lVert{{c u^{\prime} {v^\prime}^{\top}}}\rVert_{\textrm{sp}} \lVert{{\sigma_{r} u^{\prime} {v^{\prime}}^{\top} - \sigma_{r} u_{r} v_{r}^{\top}}}\rVert_{\mathsf{F}}^{2} = 2c{\sigma_{r}^{2}},$$therefore by the inner product condition (2.5), we must have $$\gamma _{X}(\mathscr{C})\geqslant \frac{1}{2\sigma _{r}(X)}$$. Turning to the norm compatibility condition, the desired bound is an immediate result of the Eckart–Young theorem (Eckart & Young, 1936), as   $$\lVert{{Z-P_{{\mathscr{C}}}(Z)}}\rVert_{\textrm{sp}} \leqslant \lVert{{Z - X}}\rVert_{\textrm{sp}},$$for all $$X \in \mathscr{C}$$ and $$Z\in \mathbb{R}^{n\times m}$$. Proof of Lemma 5.3 Let X be any matrix with $$\operatorname{rank}(X)\leqslant r$$ and let Z be any matrix. Assume $$X, P_{{X}}(Z)\in \mathbb{B}_{2}(\widehat{X},\rho )$$ for $$\rho = \frac{\sigma _{r}(\widehat{X})}{4}$$. According to Weyl’s inequality, we will have $$\sigma _{r}(X), \sigma _{r}(P_{{X}}(Z)) \geqslant \frac{3\sigma _{r}\left (\widehat{X}\right )}{4}$$. Write T = TX for convenience, and define ZT = PT(Z) and Z⊥ = PT⊥(Z). Then $$P_{{X}}(Z) = P_{{\mathscr{C}}}(P_{{T}}(Z)) = P_{{\mathscr{C}}}(Z_{T})$$, and so   \begin{align*} \left\langle{\,\widehat{X}-P_{{X}}(Z)},{Z-P_{{X}}(Z)}\right\rangle &=\left\langle{\,\widehat{X}-P_{{\mathscr{C}}}(Z_{T})},{Z-P_{{\mathscr{C}}}(Z_{T})}\right\rangle\\ &=\underbrace{\left\langle{\,\widehat{X}-P_{{\mathscr{C}}}(Z_{T})},{P_{{T}}\left(Z-P_{{\mathscr{C}}}(Z_{T})\right)}\right\rangle}_{\textrm{(Term 1)}} + \underbrace{\left\langle{\widehat{X}-P_{{\mathscr{C}}}(Z_{T})},{P_{{T}}^{\perp}\left(Z-P_{{\mathscr{C}}}(Z_{T})\right)}\right\rangle}_{\textrm{(Term 2)}}. \end{align*}First consider (Term 1). We have   \begin{align*} &\left\langle{\,\widehat{X}-P_{{\mathscr{C}}}(Z_{T})},{P_{{T}}\left(Z-P_{{\mathscr{C}}}(Z_{T})\right)}\right\rangle\\ &=\left\langle{\,\widehat{X}-P_{{\mathscr{C}}}(Z_{T})},{Z_{T} - P_{{T}}\left(P_{{\mathscr{C}}}(Z_{T})\right)}\right\rangle\\ &=\left\langle{\,\widehat{X}-P_{{\mathscr{C}}}(Z_{T})},{Z_{T} - P_{{\mathscr{C}}}(Z_{T})}\right\rangle + \left\langle{\,\widehat{X}-P_{{\mathscr{C}}}(Z_{T})},{P_{{T}}^{\perp}(P_{{\mathscr{C}}}(Z_{T}))}\right\rangle\\ &=\left\langle{\,\widehat{X}-P_{{\mathscr{C}}}(Z_{T})},{Z_{T} - P_{{\mathscr{C}}}(Z_{T})}\right\rangle - \left\langle{\,\widehat{X}-P_{{\mathscr{C}}}(Z_{T})},{P_{{T}}^{\perp}(Z_{T} - P_{{\mathscr{C}}}(Z_{T}))}\right\rangle\\ &\leqslant\frac{2}{3\sigma_{r}\left(\widehat{X}\right)}\left\lVert{{Z_{T}-P_{{\mathscr{C}}}(Z_{T})}}\right\rVert_{\textrm{sp}}\left\lVert{{\widehat{X}-P_{{\mathscr{C}}}(Z_{T})}}\right\rVert_{\mathsf{F}}^{2}- \left\langle{\,\widehat{X}-P_{{\mathscr{C}}}(Z_{T})},{P_{{T}}^{\perp}\left(Z_{T} - P_{{\mathscr{C}}}(Z_{T})\right)}\right\rangle\\ &\leqslant\frac{2}{3\sigma_{r}\left(\widehat{X}\right)}\lVert{{Z_{T}-P_{{\mathscr{C}}}(Z_{T})}}\rVert_{\textrm{sp}}\left\lVert{{\widehat{X}-P_{{\mathscr{C}}}(Z_{T})}}\right\rVert_{\mathsf{F}}^{2} \\ &\quad+\lVert{{Z_{T} - P_{{\mathscr{C}}}(Z_{T})}}\rVert_{\textrm{sp}}\left( \left\lVert{{P_{{T}}^{\perp}\left(\widehat{X}\right)}}\right\rVert_{\textrm{nuc}} + \left\lVert{{P_{{T}}^{\perp}(P_{{\mathscr{C}}}(Z_{T}))}}\right\rVert_{\textrm{nuc}}\right)\\ &\leqslant\frac{2}{3\sigma_{r}\left(\widehat{X}\right)}\lVert{{Z_{T}-P_{{\mathscr{C}}}(Z_{T})}}\rVert_{\textrm{sp}}\left\lVert{{\,\widehat{X}-P_{{\mathscr{C}}}(Z_{T})}}\right\rVert_{\mathsf{F}}^{2} \\ &\quad+ \lVert{{Z_{T} - P_{{\mathscr{C}}}(Z_{T})}}\rVert_{\textrm{sp}}\left(\frac{2}{3\sigma_{r}\left(\widehat{X}\right)}\lVert{{\widehat{X}-X}}\rVert_{\mathsf{F}}^{2}+ \frac{2}{3\sigma_{r}\left(\widehat{X}\right)}\lVert{{P_{{\mathscr{C}}}(Z_{T})-X}}\rVert_{\mathsf{F}}^{2}\right), \end{align*}where the first inequality applies the inner product condition (2.5), using the fact that $$\gamma _{P_{{\mathscr{C}}}(Z_{T})} = \frac{1}{2\sigma _{r}(P_{{\mathscr{C}}}(Z_{T}))} \leqslant \frac{2}{3\sigma _{r}\left (\widehat{X}\right )}$$; the second inequality uses the duality between nuclear norm and spectral norm; and the third applies Lemma C.1 to both nuclear norm terms since $$\operatorname{rank}(\widehat{X}),\ \operatorname{rank}(P_{{\mathscr{C}}}(Z_{T}))\leqslant r$$ and $$\frac{1}{2\sigma _{r}(X)}\leqslant \frac{2}{3\sigma _{r}\left (\widehat{X}\right )}$$. Also, since   $$\lVert{{P_{{\mathscr{C}}}(Z_{T})-X}}\rVert_{\mathsf{F}}^{2} \leqslant 2\left\lVert{{P_{{\mathscr{C}}}(Z_{T})-\widehat{X}}}\right\rVert_{\mathsf{F}}^{2} + 2\lVert{{\widehat{X}-X}}\rVert_{\mathsf{F}}^{2},$$we can simplify our bound to   $$\left\langle{\,\widehat{X}-P_{{\mathscr{C}}}(Z_{T})},{P_{{T}}(Z-P_{{\mathscr{C}}}(Z_{T}))}\right\rangle\leqslant \frac{2}{\sigma_{r}\left(\widehat{X}\right)}\lVert{{Z_{T}-P_{{\mathscr{C}}}(Z_{T})}}\rVert_{\textrm{sp}}\left(\left\lVert{{\widehat{X}-P_{{\mathscr{C}}}(Z_{T})}}\right\rVert_{\mathsf{F}}^{2} + \lVert{{\widehat{X}-X}}\rVert_{\mathsf{F}}^{2}\right).$$Finally, we have   \begin{align*}\lVert{{Z_{T} - P_{{\mathscr{C}}}(Z_{T})}}\rVert_{\textrm{sp}} \leqslant \lVert{{Z - P_{{\mathscr{C}}}(Z_{T})}}\rVert_{\textrm{sp}} + \left\lVert{{P_{{T}}^{\perp}(Z)}}\right\rVert_{\textrm{sp}} = \lVert{{Z - P_{{X}}(Z)}}\rVert_{\textrm{sp}} + \left\lVert{{P_{{T}}^{\perp}(Z-X)}}\right\rVert_{\textrm{sp}} \\ \leqslant \lVert{{Z - P_{{X}}(Z)}}\rVert_{\textrm{sp}} + \lVert{{Z-X}}\rVert_{\textrm{sp}} \leqslant 2 \max\left\{\lVert{{Z - P_{{X}}(Z)}}\rVert_{\textrm{sp}}, \lVert{{Z-X}}\rVert_{\textrm{sp}}\right\}\end{align*}since X ∈ T by definition and PT⊥ is contractive with respect to spectral norm. Then, returning to the work above,   \begin{align*}&\left\langle{\,\widehat{X}-P_{{\mathscr{C}}}(Z_{T})},{P_{{T}}(Z-P_{{\mathscr{C}}}(Z_{T}))}\right\rangle\\ &\quad\leqslant \frac{4}{\sigma_{r}\left(\widehat{X}\right)} \max\left\{\lVert{{Z - P_{{X}}(Z)}}\rVert_{\textrm{sp}}, \lVert{{Z-X}}\rVert_{\textrm{sp}}\right\}\left(\left\lVert{{\widehat{X}-P_{{\mathscr{C}}}(Z_{T})}}\right\rVert_{\mathsf{F}}^{2} + \lVert{{\widehat{X}-X}}\rVert_{\mathsf{F}}^{2}\right).\end{align*} Next we turn to (Term 2). We have   \begin{align*} &\left\langle{\,\widehat{X}-P_{{\mathscr{C}}}(Z_{T})},{P_{{T}}^{\perp}\left(Z-P_{{\mathscr{C}}}(Z_{T})\right)}\right\rangle \\ &\quad=\left\langle{P_{{T}}^{\perp}\left(\widehat{X}\right)},{Z-P_{{\mathscr{C}}}(Z_{T})}\right\rangle- \left\langle{P_{{T}}^{\perp}(P_{{\mathscr{C}}}(Z_{T}))},{Z-P_{{\mathscr{C}}}(Z_{T})}\right\rangle\\ &\quad\leqslant \lVert{{P_{{T}}^{\perp}\left(\widehat{X}\right)}}\rVert_{\textrm{nuc}}\lVert{{Z-P_{{\mathscr{C}}}(Z_{T})}}\rVert_{\textrm{sp}} + \left\lVert{{P_{{T}}^{\perp}(P_{{\mathscr{C}}}(Z_{T}))}}\right\rVert_{\textrm{nuc}}\lVert{{Z-P_{{\mathscr{C}}}(Z_{T})}}\rVert_{\textrm{sp}}\\ &\quad\leqslant \frac{2}{3\sigma_{r}\left(\widehat{X}\right)}\lVert{{\widehat{X}-X}}\rVert_{\mathsf{F}}^{2}\lVert{{Z-P_{{\mathscr{C}}}(Z_{T})}}\rVert_{\textrm{sp}} + \frac{2}{3\sigma_{r}\left(\widehat{X}\right)}\lVert{{P_{{\mathscr{C}}}(Z_{T})-X}}\rVert_{\mathsf{F}}^{2}\lVert{{Z-P_{{\mathscr{C}}}(Z_{T})}}\rVert_{\textrm{sp}}\\ &\quad\leqslant \frac{2}{\sigma_{r}\left(\widehat{X}\right)}\lVert{{Z-P_{{\mathscr{C}}}(Z_{T})}}\rVert_{\textrm{sp}} \left(\lVert{{\widehat{X}-X}}\rVert_{\mathsf{F}}^{2} +\left\lVert{{\widehat{X}-P_{{\mathscr{C}}}(Z_{T})}}\right\rVert_{\mathsf{F}}^{2}\right), \end{align*}where the second inequality applies Lemma C.1 as before, while the third inequality again uses   $$\lVert{{P_{{\mathscr{C}}}(Z_{T})-X}}\rVert_{\mathsf{F}}^{2} \leqslant 2\left\lVert{{P_{{\mathscr{C}}}(Z_{T})-\widehat{X}}}\right\rVert_{\mathsf{F}}^{2} + 2\lVert{{\widehat{X}-X}}\rVert_{\mathsf{F}}^{2}.$$ Putting the bounds for (Term 1) and (Term 2) together, we conclude that   $$\left\langle{\widehat{X} - P_{{X}}(Z)},{Z - P_{{X}}(Z)}\right\rangle \leqslant \frac{6}{\sigma_{r}\left(\widehat{X}\right)} \cdot\max\left\{\left\lVert{{Z - P_{{X}}(Z)}}\right\rVert_{\textrm{sp}}, \lVert{{Z-X}}\rVert_{\textrm{sp}}\right\}\left(\left\lVert{{\widehat{X}-P_{{\mathscr{C}}}(Z_{T})}}\right\rVert_{\mathsf{F}}^{2} + \left\lVert{{\widehat{X}-X}}\right\rVert_{\mathsf{F}}^{2}\right),$$thus proving the inner product condition (4.2). Now we turn to the norm compatibility condition (4.4). We have   $$\lVert{{Z - P_{{X}}(Z)}}\rVert_{\textrm{sp}} =\lVert{{Z - P_{{\mathscr{C}}}(Z_{T})}}\rVert_{\textrm{sp}} \leqslant \lVert{{Z_{T} - P_{{\mathscr{C}}}(Z_{T})}}\rVert_{\textrm{sp}} + \lVert{{Z_{\perp}}}\rVert_{\textrm{sp}} \leqslant \lVert{{Z_{T} - X}}\rVert_{\textrm{sp}} + \lVert{{Z_{\perp}}}\rVert_{\textrm{sp}},$$where the last step holds by the Eckart–Young theorem (Eckart & Young, 1936). Next, since ZT = Z − Z⊥, we have   $$\lVert{{Z_{T}-X}}\rVert_{\textrm{sp}} \leqslant \lVert{{Z-X}}\rVert_{\textrm{sp}} + \lVert{{Z_{\perp}}}\rVert_{\textrm{sp}},$$while since PT⊥(X) = 0, we have   $$\lVert{{Z_{\perp}}}\rVert_{\textrm{sp}} = \left\lVert{{P_{{T}}^{\perp}(Z-X)}}\right\rVert_{\textrm{sp}} \leqslant \lVert{{Z-X}}\rVert_{\textrm{sp}}.$$Combining everything, then,   $$\lVert{{Z-P_{{X}}(Z)}}\rVert_{\textrm{sp}}\leqslant 3\lVert{{Z-X}}\rVert_{\textrm{sp}},$$for any $$X\in \mathscr{C}$$ and any $$Z\in \mathbb{R}^{n\times m}$$. This proves that the norm compatibility condition holds with ϕ = 3. Finally, we consider the local continuity condition (4.5). Fix any c, ε > 0 and any $$X\in \mathscr{C}$$ and $$Z\in \mathbb{R}^{n\times m}$$ so that $$\lVert{{X-\widehat{X}}}\rVert _{\mathsf{F}}\leqslant \rho$$ and $$\left \lVert{{P_{{X}}(Z)-\widehat{X}}}\right \rVert _{\mathsf{F}}\leqslant \rho$$ where again $$\rho = \frac{\sigma _{r}\left (\widehat{X}\right )}{4}$$. Suppose that   $$2\left(\gamma^{\textrm{c}}+\gamma^{\textrm{d}}\right)\lVert{{Z-P_{{X}}(Z)}}\rVert_{\textrm{sp}} = \frac{24}{\sigma_{r}\left(\widehat{X}\right)}\lVert{{Z-P_{{X}}(Z)}}\rVert_{\textrm{sp}} \leqslant 1-c$$and take any $$W\in \mathbb{R}^{n\times m}$$ with $$\lVert{{Z-W}}\rVert _{\mathsf{F}}\leqslant \delta := \varepsilon /4.5$$. Then we calculate   $$\gamma_{P_{{X}}(Z)}(\mathscr{C}) \leqslant \frac{1}{2\sigma_{r}(P_{{X}}(Z))}\leqslant \frac{2}{3\sigma_{r}\left(\widehat{X}\right)},$$as before. And,   \begin{align*} \left\lVert{{P_{{T}}(Z)-P_{{X}}(Z)}}\right\rVert_{\textrm{sp}} &\leqslant \left\lVert{{Z-P_{{X}}(Z)}}\right\rVert_{\textrm{sp}} + \left\lVert{{P_{{T}}^{\perp}(Z)}}\right\rVert_{\textrm{sp}}\quad\textrm{ by the triangle inequality}\\ &= \lVert{{Z-P_{{X}}(Z)}}\rVert_{\textrm{sp}} + \left\lVert{{P_{{T}}^{\perp}(Z-X)}}\right\rVert_{\textrm{sp}}\quad\text{ since }P_{{T}}^{\perp}(X)=0\\ &\leqslant \lVert{{Z-P_{{X}}(Z)}}\rVert_{\textrm{sp}} +\! \lVert{{Z\!-\!X}}\rVert_{\textrm{sp}}\ \ \text{since }(P_{{T}}^{\perp})\text{ is contractive with respect to spectral norm}\\ &\leqslant 2\lVert{{Z-P_{{X}}(Z)}}\rVert_{\textrm{sp}} + \left\lVert{{P_{{X}}(Z)-\widehat{X}}}\right\rVert_{\textrm{sp}} + \left\lVert{{X-\widehat{X}}}\right\rVert_{\textrm{sp}}\quad\textrm{ by the triangle inequality}\\ &\leqslant 2\cdot \frac{(1-c)\sigma_{r}\left(\widehat{X}\right)}{24} + 2\rho\quad\text{ since }\lVert{{\cdot}}\rVert_{\textrm{sp}}\leqslant\lVert{{\cdot}}\rVert_{\mathsf{F}}\\ &\leqslant\frac{7\sigma_{r}\left(\widehat{X}\right)}{12}. \end{align*} Then   \begin{align*} \left\lVert{{P_{{X}}(Z)-P_{{X}}(W)}}\right\rVert_{\mathsf{F}} &= \lVert{{P_{{\mathscr{C}}}(P_{{T}}(Z)) - P_{{\mathscr{C}}}(P_{{T}}(W))}}\rVert_{\mathsf{F}}\\ &\leqslant \frac{\lVert{{P_{{T}}(Z)- P_{{T}}(W)}}\rVert_{\mathsf{F}}}{1-2\gamma_{P_{{X}}(Z)}(\mathscr{C})\lVert{{P_{{T}}(Z)-P_{{\mathscr{C}}}(P_{{T}}(Z))}}\rVert_{\textrm{sp}}}\quad\text{ by the contraction property (2.3)}\\ &\leqslant \frac{\lVert{{P_{{T}}(Z)- P_{{T}}(W)}}\rVert_{\mathsf{F}}}{1-2\cdot \frac{2}{3\sigma_{r}\left(\widehat{X}\right)} \cdot \frac{7\sigma_{r}\left(\widehat{X}\right)}{12}}\quad\textrm{ by the calculations above}\\ &= 4.5\lVert{{P_{{T}}(Z)- P_{{T}}(W)}}\rVert_{\mathsf{F}}\\ &\leqslant 4.5\lVert{{Z-W}}\rVert_{\mathsf{F}}, \end{align*}since T is a subspace so PT is contractive with respect to the Frobenius norm. Since $$\lVert{{Z-W}}\rVert _{\mathsf{F}}\leqslant \delta = \varepsilon /4.5$$ by assumption, this proves that $$\lVert{{P_{{X}}(Z)-P_{{X}}(W)}}\rVert _{\mathsf{F}}\leqslant \varepsilon$$, as desired. C.2 Sparsity Proof of Lemma 5.3 We check the local concavity coefficients. Fix any $$x\in \mathscr{C}$$. As before, if x is in the interior (i.e. Pen(x) < c) then $$\gamma _{x}(\mathscr{C})=0$$, so we turn to the case that Pen(x) = c, and in particular, $$x\neq 0$$. Without loss of generality, assume that x1 > 0 and that x1 is the smallest non-zero coordinate of x (and then xmin = x1). Choose any $$y\in \mathscr{C}$$ and t ∈ [0, 1]. Let   $$x_{t} = (1-t)x + ty\textrm{ and }z_{t} = x_{t} - s_{t}\mathbf{e}_{{1}},$$where $$\mathbf{e}_{{1}}=(1,0,\dots ,0)$$ and   $$s_{t} = t\cdot \frac{\mu/2}{\mathsf{p}^\prime((x_{t})_{1})} \cdot \lVert{x-y}{\rVert^{2}_{2}} .$$Since $$\lim _{t\searrow } x_{t}= x$$, and p is continuously differentiable (since it is both concave and differentiable on the positive real line), we have   $$\lim_{t\searrow 0} \frac{s_{t}}{t} = \frac{\mu/2}{\mathsf{p}^\prime(x_{1})} \cdot \lVert{x-y}{\rVert^{2}_{2}} .$$In particular, this implies that, for sufficiently small t, we have (xt)1 > 0 and (zt)1 > 0. We claim that $$\textrm{Pen}(z_{t})\leqslant c$$, in which case   $$\lim_{t\searrow 0}\frac{\min_{x^\prime\in\mathscr{C}}\lVert{x_{t} - x^\prime}\rVert_{1}}{t} \leqslant \lim_{t\searrow 0}\frac{\lVert{x_{t} - z_{t}}\rVert_{1}}{t}\\ = \lim_{t\searrow 0}\frac{s_{t}}{t} = \frac{\mu/2}{\mathsf{p}^\prime(x_{\textrm{min}})} \cdot \lVert{x-y}{\rVert^{2}_{2}},$$which proves the lemma. It now remains to check that $$\textrm{Pen}(z_{t})\leqslant c$$. We have, for coordinate i = 1,   $$\mathsf{p}(|z_{t}|_{i}) = \mathsf{p}((x_{t})_{1} - s_{t}) \leqslant \mathsf{p}((x_{t})_{1}) - s_{t} \mathsf{p}^\prime((x_{t})_{1}),$$since 0 < (xt)1 − st < (xt)1 and ρ is concave over $$\mathbb{R}_{+}$$. And, for every coordinate i,   \begin{align*} \mathsf{p}(|(x_{t})_{i}|) &= \mathsf{p}\left(|(1-t)x_{i} + ty_{i}|\right)\\ &\leqslant \mathsf{p}\left((1-t)|x_{i}| + t|y_{i}|\right)\quad\text{ since }(\rho)\text{ is non-decreasing}\\ &\leqslant (1-t)\mathsf{p}(|x_{i}|) + t\,\mathsf{p}(|y_{i}|) + \frac{\mu}{2}t(1-t)(|x_{i}|-|y_{i}|)^{2}\quad\text{ since }(t\mapsto\mathsf{p}(t)+\mu t^{2}/2)\text{ is convex}\\ &\leqslant (1-t)\mathsf{p}(|x_{i}|) + t\,\mathsf{p}(|y_{i}|) + \frac{\mu}{2}t(1-t)(x_{i}-y_{i})^{2}. \end{align*}Therefore,   \begin{align*}\textrm{Pen}(z_{t}) =&\, \sum_{i} \mathsf{p}(|z_{t}|_{i}) \leqslant \left(\sum_{i} (1-t)\mathsf{p}(|x_{i}|) + t\,\mathsf{p}(|y_{i}|) + \frac{\mu}{2}t(1-t)(x_{i}-y_{i})^{2}\right) - s_{t}\mathsf{p}^\prime((x_{t})_{1})\\ \leqslant&\, (1-t)\textrm{Pen}(x) + t\textrm{Pen}(y) +\frac{\mu}{2}\lVert{x-y}{\rVert^{2}_{2}} - s_{t}\mathsf{p}^\prime((x_{t})_{1}) \leqslant c+\frac{\mu}{2}\lVert{x-y}{\rVert^{2}_{2}} - s_{t}\mathsf{p}^\prime((x_{t})_{1}) =c,\end{align*}where the last step holds by definition of st. Proof of Lemma 5.4 Choose any $$z\in \mathbb{R}^{d}$$ and let $$x=P_{{\mathscr{C}}}(z)$$. Without loss of generality we take $$z,x\geqslant 0$$ and consider only the non-trivial case that $$z\not \in \mathscr{C}$$, therefore Pen(x) = c as x cannot lie in the interior of the set. Furthermore, we can see that   $$x_{i} = (P_{{\mathscr{C}}}(z))_{i} = \max\left\{0, z_{i} - \lambda \mathsf{p}^\prime(x_{i})\right\}\text{ for all }i=1,\dots,d$$for some $$\lambda \geqslant 0$$ by optimality of x as the closest point to z under the constraint $$\sum _{i} \mathsf{p}(x_{i})\leqslant c$$. That is, $$P_{{\mathscr{C}}}(z)$$ behaves like projection to a weighted ℓ1 ball, with weights determined by the projection xi itself. Note that Pen(x) = c, therefore $$\mathsf{p}(x_{i})\leqslant c$$ and so $$\mathsf{p}^\prime (x_{i})\geqslant \mathsf{p}^\prime \left (\mathsf{p}^{-1}(c)\right )$$ for all i. Now consider any w with $$\lVert{w-z}\rVert _{\infty }<\lambda \mathsf{p}^\prime \left (\mathsf{p}^{-1}(c)\right )$$. Let v be the vector with entries $$v_{i} = \min \{\max \{0,w_{i}\},z_{i}\}$$. Then for all i ∈ S, we have   $$|v_{i} - z_{i}| < \lambda\mathsf{p}^\prime\left(\mathsf{p}^{-1}(c)\right)\leqslant \lambda\mathsf{p}^\prime(x_{i}) = |x_{i} - z_{i}|.$$And, for i ∉ S, $$|v_{i} - z_{i}|\leqslant |z_{i}| = |x_{i} - z_{i}|$$. So, $$\lVert{v-z}\rVert _{2}<\lVert{x-z}\rVert _{2}$$ (since $$x\neq 0$$ and so $$S\neq \emptyset$$). Therefore, since x is the closest point to z in $$\mathscr{C}$$, we must have $$v\not \in \mathscr{C}$$. Since $$|v_{i}|\leqslant |w_{i}|$$ for all i by construction of v, this means that $$\textrm{Pen}(w)\geqslant \textrm{Pen}(v)>c$$, and therefore any w with $$\lVert{w-z}\rVert _{\infty }<\lambda \mathsf{p}^\prime (\mathsf{p}^{-1}(c))$$ cannot lie in $$\mathscr{C}$$. In other words,   $$\min_{w\in\mathscr{C}}\lVert{w-z}\rVert_{\infty}\geqslant \lambda\mathsf{p}^\prime\left(\mathsf{p}^{-1}(c)\right).$$On the other hand,   $$\lVert{z-x}\rVert_{\infty} =\max_{i} \left|z_{i} - \max\left\{0, z_{i} - \lambda \mathsf{p}^\prime(x_{i})\right\}\right| \leqslant \lambda\max_{i} \mathsf{p}^\prime(x_{i}) \leqslant \lambda,$$since $$\mathsf{p}^\prime (t)\leqslant 1$$ for all t. So, the norm compatibility condition (3.4) is satisfied with $$\phi =\frac{1}{\mathsf{p}^\prime \left (\mathsf{p}^{-1}(c)\right )}$$. C.3 Other examples Proof of Lemma 5.5 Let $$X,Y\in \mathscr{C}$$. For a fixed t ∈ (0, 1), let (1 − t)X + tY = ADB⊤ be an SVD. Since $$AB^{\top }\in \mathbb{R}^{n\times r}$$ is an orthonormal matrix, we then have   $$\min_{Z\in\mathscr{C}}\left\lVert{{Z - ((1-t)X+tY)}}\right\rVert_{\textrm{nuc}} \leqslant \left\lVert{{AB^{\top} - ((1-t)X+tY)}}\right\rVert_{\textrm{nuc}} = \left\lVert{{AB^{\top} - ADB^{\top}}}\right\rVert_{\textrm{nuc}} = \sum_{i=1}^{r} (1-D_{ii}).$$Furthermore,   \begin{align*}\lVert{{D}}\rVert_{\mathsf{F}}^{2} & = \lVert{{(1-t)X+tY}}\rVert_{\mathsf{F}}^{2} \\ &= (1-t)^{2}\lVert{{X}}\rVert_{\mathsf{F}}^{2} + t^{2}\lVert{{Y}}\rVert_{\mathsf{F}}^{2} + 2t(1-t)\langle{X},{Y}\rangle\\ &= (1-t)^{2}\lVert{{X}}\rVert_{\mathsf{F}}^{2} + t^{2}\lVert{{Y}}\rVert_{\mathsf{F}}^{2} + t(1-t)\left(\lVert{{X}}\rVert_{\mathsf{F}}^{2}+\lVert{{Y}}\rVert_{\mathsf{F}}^{2}-\lVert{{X-Y}}\rVert_{\mathsf{F}}^{2}\right)\\ &= (1-t)^{2}r + t^{2}r + t(1-t)\left(r+r-\lVert{{X-Y}}\rVert_{\mathsf{F}}^{2}\right)\\ &=r -t(1-t)\lVert{{X-Y}}\rVert_{\mathsf{F}}^{2}. \end{align*}A trivial calculation shows that $$1-D_{ii} = \frac{1-D_{ii}^{2}}{2} + \frac{(1-D_{ii})^{2}}{2}$$, so we have   \begin{align*}\min_{Z\in\mathscr{C}}\lVert{{Z \!-\! ((1\!-t)X+tY)}}\rVert_{\textrm{nuc}} \!\leqslant\! \sum_{i=1}^{r} (1\!-D_{ii}) \!=\!&\, \sum_{i=1}^{r} \frac{1-D_{ii}^{2}}{2} + \frac{(1-D_{ii})^{2}}{2} = \frac{r - \lVert{{D}}\rVert_{\mathsf{F}}^{2}}{2} + \sum_{i=1}^{r} \frac{(1-D_{ii})^{2}}{2} \\ =&\, \frac{1}{2}t(1-t)\lVert{{X-Y}}\rVert_{\mathsf{F}}^{2}+ \sum_{i=1}^{r} \frac{(1-D_{ii})^{2}}{2}.\end{align*}Furthermore, we can show that the last term is o(t), as follows. For any unit vector $$u\in \mathbb{R}^{r}$$,   $$\lVert{((1-t)X+tY)u}\rVert_{2} \geqslant (1-t)\lVert{Xu}\rVert_{2} - t\lVert{Yu}\rVert_{2} \geqslant 1-2t$$since X, Y are both orthonormal. Therefore (1 − t)X + tY has all its singular values $$\geqslant 1-2t$$, that is, $$D_{ii}\geqslant 1-2t$$ for all i. And trivially $$\lVert{((1-t)X+tY)u}\rVert _{2}\leqslant 1$$ so $$D_{ii}\leqslant 1$$. Then $$\sum _{i=1}^{r} (1-D_{ii})^{2} \leqslant \sum _{i=1}^{r} (2t)^{2} = 4t^{2}r$$, so we have   $$\min_{Z\in\mathscr{C}}\lVert{{Z - ((1-t)X+tY)}}\rVert_{\textrm{nuc}} \leqslant \frac{1}{2}t(1-t)\lVert{{X-Y}}\rVert_{\mathsf{F}}^{2} + 2t^{2}r.$$Dividing by t and taking a limit,   $$\lim_{t\searrow 0}\frac{\min_{Z\in\mathscr{C}}\lVert{{Z - ((1-t)X+tY)}}\rVert_{\textrm{nuc}} }{t}\leqslant \frac{1}{2}\lVert{{X-Y}}\rVert_{\mathsf{F}}^{2}.$$Comparing to the curvature condition (2.1) we see that $$\gamma _{X}(\mathscr{C})\leqslant \frac{1}{2}$$, as desired. Next, to obtain equality, take any $$X\in \mathscr{C}$$. Fix any c ∈ (0, 1). Let $$Y=-X\in \mathscr{C}$$ and $$Z=cX\in \mathbb{R}^{n\times r}$$. Clearly, $$P_{{\mathscr{C}}}(Z)=X$$. By the contraction property (2.3), we must have   $$\left(1-\gamma_{X}(\mathscr{C})\lVert{Z-X}\rVert^{\ast}\right)\lVert{{Y-X}}\rVert_{\mathsf{F}}\leqslant\lVert{{Y-Z}}\rVert_{\mathsf{F}}.$$Plugging in our choices for Y and Z, we obtain   $$\left(1-\gamma_{X}(\mathscr{C})\cdot(1-c)\right)\cdot 2\sqrt{r} \leqslant (1+c)\sqrt{r},$$and so $$\gamma _{X}(\mathscr{C})\geqslant \frac{1}{2}$$. Now we check the norm compatibility condition. For any $$X\in \mathbb{R}^{n\times r}$$, write X = ADB⊤. Then $$P_{{\mathscr{C}}}(X)=AB^{\top }$$ and so $$\lVert{X-P_{{\mathscr{C}}}(X)}\rVert ^{\ast }=\left \lVert{{ADB^{\top } - AB^{\top }}}\right \rVert _{\textrm{sp}} = \max \{d_{1}-1,1-d_{r}\}$$, where $$d_{1}\geqslant \dots \geqslant d_{r}\geqslant 0$$ are the diagonal entries of D, i.e. the singular values of X. Let $$u\in \mathbb{R}^{d}$$ be the first column of B, so that $$\lVert{Xu}\rVert _{2} = d_{1}$$. Then for any $$W\in \mathscr{C}$$, we have $$\lVert{Wu}\rVert _{2}=1$$ since W is orthonormal and u is a unit vector, so   $$\lVert{X-W}\rVert^{\ast} = \lVert{{X-W}}\rVert_{\textrm{sp}} \geqslant \lVert{(X-W)u}\rVert_{2} \geqslant \lVert{Xu}\rVert_{2} - \lVert{Wu}\rVert_{2} = d_{1} - 1.$$Now let $$v\in \mathbb{R}^{d}$$ be the rth column of B, so that $$\lVert{Xv}\rVert _{2} = d_{r}$$. Similarly we have   $$\lVert{X-W}\rVert^{\ast} = \lVert{{X-W}}\rVert_{\textrm{sp}} \geqslant \lVert{(X-W)v}\rVert_{2} \geqslant \lVert{Wv}\rVert_{2} - \lVert{Xv}\rVert_{2} = 1 - d_{r}.$$Therefore, $$\lVert{X-W}\rVert ^{\ast } \geqslant \max \{d_{1}-1,1-d_{r}\} = \lVert{X-P_{{\mathscr{C}}}(X)}\rVert ^{\ast }$$, proving that ϕ = 1. Proof of Lemma 5.6 For $$X, Y\in \mathscr{C}$$, write X = UU⊤ and Y = V V⊤ for some orthonormal matrices $$U, V\in \mathbb{R}^{n\times r}$$. For t ∈ (0, 1), let Ut = (1 − t)U + tV, and let Ut = ADB⊤ be an SVD. Then AB⊤ is the projection of Ut onto the set of orthonormal n × r matrices. Since $$A\in \mathbb{R}^{n\times r}$$ is orthonormal, we have $$AA^{\top }\in \mathscr{C}$$, and so   \begin{align*} \min_{Z\in\mathscr{C}}\lVert{{Z - ((1-t)X+tY)}}\rVert_{\textrm{nuc}} \leqslant&\, \left\lVert{{AA^{\top} - ((1-t)X+tY)}}\right\rVert_{\textrm{nuc}}\\ \leqslant&\, \underbrace{\left\lVert{{AA^{\top} - U_{t} U_{t}^{\top}}}\right\rVert_{\textrm{nuc}}}_{\textrm{(Term 1)}} + \underbrace{\left\lVert{{U_{t} U_{t}^{\top} - ((1-t)X+tY)}}\right\rVert_{\textrm{nuc}}}_{\textrm{(Term 2)}}.\end{align*}For (Term 1),   \begin{align*} \left\lVert{{AA^{\top} - U_{t}U_{t}^{\top}}}\right\rVert_{\textrm{nuc}} &=\left\lVert{{AA^{\top} - ADB^{\top} \cdot BDA^{\top}}}\right\rVert_{\textrm{nuc}}\\ &=\left\lVert{{A(\mathbf{I}_{r} - D^{2})A^{\top}}}\right\rVert_{\textrm{nuc}}\\ &=r - \lVert{{D}}\rVert_{\mathsf{F}}^{2} =r - \lVert{{U_{t}}}\rVert_{\mathsf{F}}^{2}\\ &=r - \lVert{{(1-t)U+tV}}\rVert_{\mathsf{F}}^{2}\\ &=r - (1-t)^{2}\lVert{{U}}\rVert_{\mathsf{F}}^{2} - t^{2}\lVert{{V}}\rVert_{\mathsf{F}}^{2} - 2t(1-t)\langle{U},{V}\rangle\\ &=r - (1-t)^{2}\lVert{{U}}\rVert_{\mathsf{F}}^{2} - t^{2}\lVert{{V}}\rVert_{\mathsf{F}}^{2} -t(1-t)\left(\lVert{{U}}\rVert_{\mathsf{F}}^{2}+\lVert{{V}}\rVert_{\mathsf{F}}^{2} - \lVert{{U-V}}\rVert_{\mathsf{F}}^{2}\right)\\ &=r - (1-t)^{2}r - t^{2}r -t(1-t)\left(2r-\lVert{{U-V}}\rVert_{\mathsf{F}}^{2}\right)\\ &= t(1-t)\lVert{{U-V}}\rVert_{\mathsf{F}}^{2}. \end{align*} For (Term 2),   \begin{align*} \left\lVert{{U_{t}U_{t}^{\top} - ((1-t)X+tY)}}\right\rVert_{\textrm{nuc}} &=\left\lVert{{((1-t)U+tV)((1-t)U+tV)^{\top} - (1-t)UU^{\top} - tVV^{\top}}}\right\rVert_{\textrm{nuc}}\\ &=\left\lVert{{ -t(1-t)UU^{\top} - t(1-t)VV^{\top} + t(1-t)UV^{\top} + t(1-t)VU^{\top}}}\right\rVert_{\textrm{nuc}}\\ &=\left\lVert{{ -t(1-t)(U-V)(U-V)^{\top}}}\right\rVert_{\textrm{nuc}}\\ &=t(1-t)\lVert{{U-V}}\rVert_{\mathsf{F}}^{2}. \end{align*} Combining the two, then,   $$\min_{Z\in\mathscr{C}}\lVert{{Z - ((1-t)X+tY)}}\rVert_{\textrm{nuc}} \leqslant 2t(1-t)\lVert{{U-V}}\rVert_{\mathsf{F}}^{2}.$$ Next, note that the choice of U and V is not unique. Fixing any factorizations X = UU⊤ and Y = VV⊤, let U⊤V = ADB⊤ be an SVD and let $$\widetilde{V} = VBA^{\top }$$. Then $$Y=\widetilde{V}\widetilde{V}^{\top }$$, and following the same steps as above we can calculate   $$\min_{Z\in\mathscr{C}}\lVert{{Z - ((1-t)X+tY)}}\rVert_{\textrm{nuc}} \leqslant 2t(1-t)\left\lVert{{U-\widetilde{V}}}\right\rVert_{\mathsf{F}}^{2}\!.$$Furthermore,   \begin{align*}\left\lVert{{U-\widetilde{V}}}\right\rVert_{\mathsf{F}}^{2} =&\, \lVert{{U}}\rVert_{\mathsf{F}}^{2} + \left\lVert{{\widetilde{V}}}\right\rVert_{\mathsf{F}}^{2} - 2\operatorname{trace}\left(U^{\top}\widetilde{V}\right) = 2r - 2\operatorname{trace}\left(U^{\top} VBA^{\top}\right) \\ =&\, 2r -2\operatorname{trace}\left(ADB^{\top} BA^{\top}\right) = 2r - 2\operatorname{trace}(D).\end{align*}And,   \begin{align*}\lVert{{X-Y}}\rVert_{\mathsf{F}}^{2} =&\, \lVert{{X}}\rVert_{\mathsf{F}}^{2}+\lVert{{Y}}\rVert_{\mathsf{F}}^{2} - 2\operatorname{trace}(XY) = 2r - 2\operatorname{trace}\left(UU^{\top} \widetilde{V}\widetilde{V}^{\top}\right) \\ =&\, 2r - 2\left\lVert{{U^{\top}\widetilde{V}}}\right\rVert_{\mathsf{F}}^{2} = 2r - 2\lVert{{D}}\rVert_{\mathsf{F}}^{2} \geqslant 2r - 2\operatorname{trace}(D),\end{align*}since $$\lVert{{D}}\rVert _{\mathsf{F}}^{2} = \sum _{i} (D_{ii})^{2} \leqslant \sum _{i} D_{ii}$$, as $$0\leqslant D_{ii}\leqslant 1$$ for all i since U, V are both orthonormal matrices. Therefore, this proves that $$\left \lVert{{U-\widetilde{V}}}\right \rVert _{\mathsf{F}}^{2} \leqslant \lVert{{X-Y}}\rVert _{\mathsf{F}}^{2}$$, and so   $$\min_{Z\in\mathscr{C}}\lVert{{Z - ((1-t)X+tY)}}\rVert_{\textrm{nuc}} \leqslant 2t(1-t)\lVert{{X-Y}}\rVert_{\mathsf{F}}^{2}.$$Based on the curvature condition characterization (2.1) of the local concavity coefficients, we have therefore computed $$\gamma _{X}(\mathscr{C})\leqslant 2$$, as desired. Footnotes 1  A more general form of this condition, with f Lipschitz, but not necessarily differentiable, appears in Appendix A.2.1. 2  In this definition, we only consider the ‘one-sided’ formulation (2.3) of the contraction property, since the two-sided formulation (2.2) would involve the local concavity coefficient at both x and y due to symmetry—we will see in Lemma 4 below that a version of the two-sided contraction property still holds using local coefficients. 3  The notion of prox-regularity is typically defined over any Hilbert space with its norm $$|\cdot |=\sqrt{\langle{\cdot },{\cdot }\rangle }$$ in place of the ℓ2 norm, however, we restrict to the case of $$\mathbb{R}^{d}$$ for ease of comparison. 4  There also exists in the literature an alternative notion of local prox-regularity, studied by Shapiro (1994), Poliquin et al. (2000), Mazade & Thibault (2013) and others, where $$\mathscr{C}$$ is said to be prox-regular at a point $$u\in \mathscr{C}$$ if (2.8) holds over $$x,y\in \mathscr{C}$$ that are in some arbitrarily small neighborhood of u. Importantly, this definition of local prox-regularity differs from our notion of local concavity coefficients, since we allow y to range over the entire set; this distinction is critical for studying convergence to a global rather than local minimizer. 5  The notion of prox-regular sets has been extended to Banach spaces, e.g. by Bernard et al. (2006), but this does not relate directly to our results. 6  For $$X\in \mathscr{C}$$ which is of rank strictly lower than r, we can define TX by taking $$U\in \mathbb{R}^{n\times r}$$, $$V\in \mathbb{R}^{m\times r}$$ to be any orthonormal matrices which contain the column and row span of X; this choice is not unique, but formally we assume that we have fixed some choice of space TX for each $$X\in \mathscr{C}$$. 7  Code for the simulated data experiments is available at http://www.stat.uchicago.edu/ rina/concavity.html. 8  More precisely, Rockafellar & Wets (2009, Theorem 8.15) assume only that f is proper and lower semi-continuous, but additionally requires the condition that $$\partial ^{\infty }\mathsf{f}(x) \cap \big (-N_{\mathscr{C}}(x)\big ) = \{0\}$$ (see the study by Rockafellar & Wets, (2009, Chapter 8) for definitions). Since the horizon subdifferential $$\partial ^{\infty }\mathsf{f}(x)$$ contains only the zero vector for any Lipschitz function f, this condition must be satisfied once we assume that f is Lipschitz. References Bernard, F., Thibault, L. & Zlateva, N. ( 2006) Characterizations of prox-regular sets in uniformly convex Banach spaces. J. Convex Anal. , 13, 525. Bubeck, S. ( 2015) Convex optimization: algorithms and complexity. Foundations and Trends Ⓡ; in Mach. Learn. , 8, 231-- 357. Google Scholar CrossRef Search ADS   Candès, E. & Recht, B. ( 2012) Exact matrix completion via convex optimization. Commun. ACM , 55, 111-- 119. Google Scholar CrossRef Search ADS   Candès, E. J., Li, X. & Soltanolkotabi, M. ( 2015) Phase retrieval via Wirtinger flow: theory and algorithms. IEEE Trans. Info. Theory , 61, 1985-- 2007. Google Scholar CrossRef Search ADS   Candès, E. J., Wakin, M. B. & Boyd, S. P. ( 2008) Enhancing sparsity by reweighted ℓ1 minimization. J. Fourier Anal. Appl. , 14, 877-- 905. CrossRef Search ADS   Canino, A. ( 1988) On p-convex sets and geodesics. J. Differ. Equ. , 75, 118-- 157. Google Scholar CrossRef Search ADS   Chartrand, R. ( 2007) Exact reconstruction of sparse signals via nonconvex minimization. IEEE Signal Process. Lett. , 14, 707-- 710. Google Scholar CrossRef Search ADS   Chen, Y. & Wainwright, M. J. ( 2015) Fast low-rank estimation by projected gradient descent: general statistical and algorithmic guarantees. arXiv preprint arXiv:1509.03025. Colombo, G. & Marques, M. D. P. M. ( 2003) Sweeping by a continuous prox-regular set. J. Diff. Equ. , 187, 46-- 62. Google Scholar CrossRef Search ADS   Colombo, G. & Thibault, L. ( 2010) Prox-regular sets and applications. Handbook of nonconvex analysis , 99-- 182. Eckart, C. & Young, G. ( 1936) The approximation of one matrix by another of lower rank. Psychometrika , 1, 211-- 218. Google Scholar CrossRef Search ADS   Fan, J. & Li, R. ( 2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. , 96, 1348-- 1360. Google Scholar CrossRef Search ADS   Federer, H. ( 1959) Curvature measures. Trans. Am. Math. Soc. , 93, 418-- 491. Google Scholar CrossRef Search ADS   Gunasekar, S., Acharya, A., Gaur, N. & Ghosh, J. ( 2013) Noisy matrix completion using alternating minimization. Joint European Conference on Machine Learning and Knowledge Discovery in Databases , Berlin: Springer, pp. 194-- 209. Iusem, A. N., Pennanen, T. & Svaiter, B. F. ( 2003) Inexact variants of the proximal point algorithm without monotonicity. SIAM J. Optim. , 13, 1080-- 1097. Google Scholar CrossRef Search ADS   Jain, P., Netrapalli, P. & Sanghavi, S. ( 2013) Low-rank matrix completion using alternating minimization. Proceedings of the forty-fifth annual ACM symposium on Theory of computing . Palo Alto, CA, USA: ACM, pp. 665-- 674. Jain, P., Tewari, A. & Kar, P. ( 2014) On iterative hard thresholding methods for high-dimensional M-estimation. Advances in Neural Information Processing Systems . Curran Associates, Inc., pp. 685-- 693. Knight, K. & Fu, W. ( 2000) Asymptotics for lasso-type estimators. Ann. Stat. , 28, 1356-- 1378. Google Scholar CrossRef Search ADS   Lewis, A. S. & Wright, S. J. ( 2016) A proximal method for composite minimization. Math. Program. , 158, 501-- 546. Google Scholar CrossRef Search ADS   Loh, P.-L. & Wainwright, M. J. ( 2013) Regularized M-estimators with nonconvexity: statistical and algorithmic theory for local optima. Advances in Neural Information Processing Systems. pp. 476–484 . Mazade, M. & Thibault, L. ( 2013) Regularization of differential variational inequalities with locally prox-regular sets. Math. Program. , 139, 243-- 269. Google Scholar CrossRef Search ADS   Meka, R., Jain, P. & Dhillon, I. S. ( 2009) Guaranteed rank minimization via singular value projection. arXiv preprint arXiv:0909.5457.  Negahban, S., Yu, B., Wainwright, M. J. & Ravikumar, P. K. ( 2009) A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Advances in Neural Information Processing Systems . Curran Associates, Inc., pp. 1348-- 1356. Oymak, S., Recht, B. & Soltanolkotabi, M. ( 2015) Sharp time–data tradeoffs for linear inverse problems. arXiv preprint arXiv : 1507. 04793. Pennanen, T. ( 2002) Local convergence of the proximal point algorithm and multiplier methods without monotonicity. Math. Oper. Res. , 27, 170-- 191. Google Scholar CrossRef Search ADS   Poliquin, R., Rockafellar, R. & Thibault, L. ( 2000) Local differentiability of distance functions. Trans. Amer. Math. Soc. , 352, 5231-- 5249. Google Scholar CrossRef Search ADS   Rockafellar, R. T. & Wets, R. J.-B. ( 2009) Variational analysis , vol. 317. Springer-Verlag Berlin Heidelberg. Shapiro, A. ( 1994) Existence and differentiability of metric projections in Hilbert spaces. SIAM J. Optim. , 4, 130-- 141. Google Scholar CrossRef Search ADS   Sun, R. & Luo, Z. Q. ( 2016) Guaranteed matrix completion via non-convex factorization. IEEE Trans. Inf. Theory , 62, 6535-- 6579. Google Scholar CrossRef Search ADS   Tibshirani, R. ( 1996) Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Series B Stat. Methodol. , 58, 267-- 288. Tu, S., Boczar, R., Simchowitz, M., Soltanolkotabi, M. & Recht, B. ( 2015) Low-rank solutions of linear matrix equations via Procrustes flow. arXiv preprint arXiv : 1507. 03566. Zhang, C.-H. ( 2010) Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. , 38, 894-- 942. Google Scholar CrossRef Search ADS   Zhao, Z., Wang, Z. & Liu, H. ( 2015) A nonconvex optimization framework for low rank matrix estimation. Advances in Neural Information Processing Systems . Curran Associates, Inc., pp. 559-- 567. Zheng, Q. & Lafferty, J. ( 2015) A convergent gradient descent algorithm for rank minimization and semidefinite programming from random linear measurements. Advances in Neural Information Processing Systems . pp. 109-- 117. Zhu, Z., Li, Q., Gongguo, T. & Wakin, M. B. ( 2017) Global optimality in low-rank matrix optimization. arXiv preprint arXiv : 1702. 07945. © The Author(s) 2018. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) For permissions, please e-mail: journals. permissions@oup.com http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Information and Inference: A Journal of the IMA Oxford University Press

# Gradient descent with non-convex constraints: local concavity determines convergence

, Volume Advance Article – Mar 5, 2018
52 pages

Publisher
Oxford University Press
Abstract Many problems in high-dimensional statistics and optimization involve minimization over non-convex constraints—for instance, a rank constraint for a matrix estimation problem—but little is known about the theoretical properties of such optimization problems for a general non-convex constraint set. In this paper we study the interplay between the geometric properties of the constraint set and the convergence behavior of gradient descent for minimization over this set. We develop the notion of local concavity coefficients of the constraint set, measuring the extent to which convexity is violated, which governs the behavior of projected gradient descent over this set. We demonstrate the versatility of these concavity coefficients by computing them for a range of problems in low-rank estimation, sparse estimation and other examples. Through our understanding of the role of these geometric properties in optimization, we then provide a convergence analysis when projections are calculated only approximately, leading to a more efficient method for projected gradient descent in low-rank estimation problems. 1. Introduction Non-convex optimization problems arise naturally in many areas of high-dimensional statistics and data analysis, and pose particular difficulty due to the possibility of becoming trapped in a local minimum or failing to converge. Nonetheless, recent results have begun to extend some of the broad convergence guarantees that have been achieved in the literature on convex optimization, into a non-convex setting. In this work, we consider a general question: when minimizing a function g(x) over a non-convex constraint set $$\mathscr{C}\subset \mathbb{R}^{d}$$,   $$\widehat{x} = \mathop{\operatorname{arg\,min}}\limits_{x\in\mathscr{C}} \mathsf{g}(x),$$what types of conditions on g and on $$\mathscr{C}$$ are sufficient to guarantee the success of projected gradient descent? More concretely, when can we expect that optimization of this non-convex problem will converge at essentially the same rate as a convex problem? In examining this question, we find that local geometric properties of the non-convex constraint set $$\mathscr{C}$$ are closely tied to the behavior of gradient descent methods, and the main results of this paper study the equivalence between local geometric conditions on the boundary of $$\mathscr{C}$$, and the local behavior of optimization problems constrained to $$\mathscr{C}$$. The main contributions of this paper are the following: We develop the notion of local concavity coefficients of a non-convex constraint set $$\mathscr{C}$$, characterizing the extent to which $$\mathscr{C}$$ is non-convex relative to each of its points. These coefficients, a generalization of the notions of prox-regular sets and sets of positive reach in the analysis literature, bound the set’s violations of four different characterizations of convexity—e.g. convex combinations of points must lie in the set, and the first-order optimality conditions for minimization over the set—with respect to a structured norm, such as the ℓ1 norm for sparse problems, chosen to capture the natural structure of the problem. The local concavity coefficients allow us to characterize the geometric properties of the constraint set $$\mathscr{C}$$ that are favorable for analyzing the convergence of projected gradient descent. Our key results Theorems 2.1 and 2.2 prove that these multiple notions of non-convexity are in fact exactly equivalent, shedding light on the interplay between geometric properties such as curvature, and optimality properties such as the first-order conditions, in a non-convex setting. We next prove convergence results for projected gradient descent over a non-convex constraint set, minimizing a function g assumed to exhibit restricted strong convexity (RSC) and restricted smoothness (RSM) (these types of conditions are common in the high-dimensional statistics literature—see e.g. the study by Negahban et al., 2009 for background). We also allow for the projection step, i.e. projection to $$\mathscr{C}$$, to be calculated approximately, which enables greater computational efficiency. Our main convergence analysis shows that, as long as we initialize at a point x0 that is not too far away from $$\widehat{x}$$, projected gradient descent converges linearly to $$\widehat{x}$$ when the constraint space $$\mathscr{C}$$ satisfies the geometric properties described above. Finally, we apply these ideas to a range of specific examples: low-rank matrix estimation (where optimization is carried out under a rank constraint), sparse estimation (with non-convex regularizers such as Smoothly Clipped Absolute Deviation (SCAD) offering a lower-shrinkage alternative to the ℓ1 norm) and several other non-convex constraints. We discuss some interesting differences between constraining vs. penalizing a non-convex regularization function, in the context of sparse estimation. For the low-rank setting, we propose an approximate projection step that provides a computationally efficient alternative for low-rank estimation problems, which we then explore empirically with simulations. 2. Concavity coefficients for a non-convex constraint space We begin by studying several properties which describe the extent to which the constraint set $$\mathscr{C}\subset \mathbb{R}^{d}$$ deviates from convexity. To quantify the concavity of $$\mathscr{C}$$, we will define the (global) concavity coefficient of $$\mathscr{C}$$, denoted $$\gamma = \gamma (\mathscr{C})$$, which we will later expand to local measures of concavity, $$\gamma _{x}(\mathscr{C})$$, indexed over points $$x\in \mathscr{C}$$. We examine several definitions of this concavity coefficient: essentially, we consider four properties that would hold if $$\mathscr{C}$$ were convex, and then use γ to characterize the extent to which these properties are violated. Our definitions are closely connected to the notion of prox-regular sets in the analysis literature, and we will discuss this connection in detail in Section 2.3 below. Since we are interested in developing flexible tools for high-dimensional optimization problems, several different norms will appear in the definitions of the concavity coefficients: The Euclidean ℓ2 norm, $$\lVert{\cdot }\rVert _{2}$$. Projections to $$\mathscr{C}$$ will always be taken with respect to the ℓ2 norm, and our later convergence guarantees will also be given with respect to this norm. If our variable is a matrix $$X\in \mathbb{R}^{n\times m}$$, the Euclidean ℓ2 norm is known as the Frobenius norm, $$\lVert{{X}}\rVert _{\mathsf{F}}=\sqrt{\sum _{ij} X_{ij}^{2}}$$. A ‘structured’ norm $$\lVert{\cdot }\rVert$$, which can be chosen to be any norm on $$\mathbb{R}^{d}$$. In some cases it may be the ℓ2 norm, but often it will be a different norm reflecting natural structure in the problem. For instance, for a low-rank estimation problem, if $$\mathscr{C}$$ is a set of rank-constrained matrices then we will work with the nuclear norm, $$\lVert{\cdot }\rVert =\lVert{{\cdot }}\rVert _{\textrm{nuc}}$$ (defined as the sum of the singular values of the matrix). For sparse signals, we will instead use the ℓ1 norm, $$\lVert{\cdot }\rVert =\lVert{\cdot }\rVert _{1}$$. A norm $$\lVert{\cdot }\rVert ^{\ast }$$, which is the dual norm to the structured norm $$\lVert{\cdot }\rVert$$. For low-rank matrix problems, if we work with the nuclear norm, $$\lVert{\cdot }\rVert =\lVert{{\cdot }}\rVert _{\textrm{nuc}}$$, then the dual norm is given by the spectral norm, $$\lVert{\cdot }\rVert ^{\ast }=\lVert{{\cdot }}\rVert _{\textrm{sp}}$$ (i.e. the largest singular value of the matrix, also known as the matrix operator norm). For sparse problems, if $$\lVert{\cdot }\rVert =\lVert{\cdot }\rVert _{1}$$ then its dual is given by the $$\ell _{\infty }$$ norm, $$\lVert{\cdot }\rVert ^{\ast }=\lVert{\cdot }\rVert _{\infty }$$. When we take projections to the constraint set $$\mathscr{C}$$, if the minimizer $$P_{{\mathscr{C}}}(z)\in \operatorname{arg\,min}_{x\in \mathscr{C}}\lVert{x-z}\rVert _{2}$$ is non-unique, then we write $$P_{{\mathscr{C}}}(z)$$ to denote any point chosen from this set. Throughout, any assumption or claim involving $$P_{{\mathscr{C}}}(z)$$ should be interpreted as holding for any choice of $$P_{{\mathscr{C}}}(z)$$. From this point on, we will assume without comment that $$\mathscr{C}$$ is closed and non-empty so that the set $$\operatorname{arg\,min}_{x\in \mathscr{C}}\lVert{x-z}\rVert _{2}$$ is non-empty for any z. We now present several definitions of the concavity coefficient of $$\mathscr{C}$$. Curvature First, we define γ as a bound on the extent to which a convex combination of two elements of $$\mathscr{C}$$ may lie outside of $$\mathscr{C}$$: for $$x,y\in \mathscr{C}$$,   \begin{align}\limsup_{t\searrow 0}\frac{\min_{z\in\mathscr{C}}\left\lVert{z - \left((1-t)x + ty\right)}\right\rVert}{t} \leqslant \gamma\lVert{x - y}{\rVert^{2}_{2}}. \end{align} (2.1)Approximate contraction Secondly, we define γ via a condition requiring that the projection operator $$P_{{\mathscr{C}}}$$ is approximately contractive in a neighborhood of the set $$\mathscr{C}$$, that is, $$\lVert{P_{{\mathscr{C}}}(z) - P_{{\mathscr{C}}}(w)}\rVert _{2}$$ is not much larger than $$\lVert{z-w}\rVert _{2}$$: for $$x,y\in \mathscr{C}$$.   \begin{align}&\text{For any }z,w\in\mathbb{R}^{d} \text{ with } P_{{\mathscr{C}}}(z)=x\text{ and } P_{{\mathscr{C}}}(w)=y,\nonumber\\&\quad\big(1-\gamma\lVert{z-x}\rVert^{\ast}-\gamma\lVert{w-y}\rVert^{\ast}\big) \cdot \lVert{x - y}\rVert_{2} \leqslant \lVert{z - w}\rVert_{2}.\end{align} (2.2)For convenience in our theoretical analysis we will also consider a weaker ‘one-sided’ version of this property, where one of the two points is assumed to already lie in $$\mathscr{C}$$: for $$x,y\in \mathscr{C}$$.   \begin{align}\text{For any }z\in\mathbb{R}^{d}\text{ with }P_{{\mathscr{C}}}(z)=x,\quad \left(1-\gamma\lVert{z-x}\rVert^{\ast}\right) \cdot \lVert{x - y}\rVert_{2} \leqslant \lVert{z-y}\rVert_{2}. \end{align} (2.3)First-order optimality For our third characterization of the concavity coefficient, we consider the standard first-order optimality conditions for minimization over a convex set, and measure the extent to which they are violated when optimizing over $$\mathscr{C}$$: for $$x,y\in \mathscr{C}$$.1  \begin{align} &\text{For any differentiable }\mathsf{f}:\mathbb{R}^{d}\rightarrow \mathbb{R}\text{ such that (x) is a local minimizer of }(\mathsf{f})\text{ over }\mathscr{C},\nonumber\\&\quad\langle{y-x},{\nabla\mathsf{f}(x)}\rangle\geqslant - \gamma\lVert{\nabla\mathsf{f}(x)}\rVert^{\ast}\lVert{y-x}{\rVert^{2}_{2}}. \end{align} (2.4)Inner products Fourthly, we introduce an inner product condition, requiring that projection to the constraint set $$\mathscr{C}$$ behaves similarly to a convex projection: for $$x,y\in \mathscr{C}$$.   \begin{align} \text{For any }z\in\mathbb{R}^{d}\text{ with }P_{{\mathscr{C}}}(z)=x,\quad \langle{y-x},\ {z-x}\rangle \leqslant \gamma \lVert{z-x}\rVert^{\ast}\lVert{y-x}{\rVert^{2}_{2}}. \end{align} (2.5) We will see later that, by choosing $$\lVert{\cdot }\rVert$$ to reflect the structure in the signal (rather than working only with the ℓ2 norm), we are able to obtain a more favorable scaling in our concavity coefficients, and hence to prove meaningful convergence results in high-dimensional settings. On the other hand, regardless of our choice of $$\lVert{\cdot }\rVert$$, note that the ℓ2 norm also appears in the definition of the concavity coefficients, as is natural when working with inner products. Our first main result shows that the above conditions are in fact exactly equivalent: Theorem 2.1 The properties (2.1), (2.2), (2.3), (2.4) and (2.5) are equivalent; that is, for a fixed choice $$\gamma \in [0,\infty ]$$, they either all hold for every $$x,y\in \mathscr{C}$$, or all fail to hold for some $$x,y\in \mathscr{C}$$. Formally, we will define $$\gamma (\mathscr{C})$$ to be the smallest value such that the above properties hold:   $$\gamma(\mathscr{C}):= \min\left\{\gamma\in[0,\infty] : \text{Properties 2.1, 2.2, 2.3, 2.4, 2.5 hold for all}\ x,y\in\mathscr{C}\,\right\}\!.$$However, this global coefficient $$\gamma (\mathscr{C})$$ is often of limited use in practical settings, since many sets are well behaved locally but not globally. For instance, the set $$\mathscr{C}\!=\!\left \{X\!\in\! \mathbb{R}^{n\times m}:\operatorname{rank}(X)\!\leqslant\! r\right \}$$ has $$\gamma (\mathscr{C})\!=\!\infty$$, but exhibits smooth curvature and good convergence behavior as long as we stay away from rank-degenerate matrices (that is, matrices with rank(X) < r). Since we may often want to ensure convergence in this type of setting where global concavity cannot be bounded, we next turn to a local version of the same concavity bounds. 2.1. Local concavity coefficients We now consider the local concavity coefficients$$\gamma _{x}(\mathscr{C})$$, measuring the concavity in a set $$\mathscr{C}$$ relative to a specific point x in the set. We will see examples later on where $$\gamma (\mathscr{C})=\infty\,,$$ but $$\gamma _{x}(\mathscr{C})$$ is bounded for many points $$x\in \mathscr{C}$$. First we define a set of ‘degenerate points’,   $$\mathscr{C}_{\mathsf{dgn}} = \left\{x\in\mathscr{C}:P_{{\mathscr{C}}}\text{ is not continuous over any neighborhood of }(x)\right\}\!,$$and then let   \begin{align}\gamma_{x}(\mathscr{C}) = \begin{cases} \infty,&x\in\mathscr{C}_{\mathsf{dgn}},\\ \min\left\{\gamma\in[0,\infty]: \text{Property (*) holds for this point (x) and any }y\in\mathscr{C}\right\}\!,&x\not\in\mathscr{C}_{\mathsf{dgn}}, \end{cases} \end{align} (2.6)where the property (*) may refer to any of the four definitions of the concavity coefficients,2 namely (2.1), (2.3), (2.4) or (2.5). We will see shortly why it is necessary to make an exception for the degenerate points $$x\in \mathscr{C}_{\mathsf{dgn}}$$ in the definition of these coefficients. Our next main result shows that the equivalence between the four properties (2.1), (2.3), (2.4) and (2.5) in terms of the global concavity coefficient $$\gamma (\mathscr{C})$$, holds also for the local coefficients: Theorem 2.2 For all $$x\in \mathscr{C}$$, the definition (2.6) of $$\gamma _{x}(\mathscr{C})$$ is equivalent for all four choices of the property (*), namely the conditions (2.1), (2.3), (2.4) or (2.5). To develop an intuition for the global and local concavity coefficients, we give a simple example in $$\mathbb{R}^{2}$$ (relative to the ℓ2 norm, i.e. $$\lVert{\cdot }\rVert =\lVert{\cdot }\rVert ^{\ast }=\lVert{\cdot }\rVert _{2}$$), displayed in Fig. 1. Define $$\mathscr{C}=\left \{x\in \mathbb{R}^{2}: x_{1}\leqslant 0\textrm{ or }x_{2}\leqslant 0\right \}$$. Due to the degenerate point x = (0, 0), we can see that $$\gamma (\mathscr{C})=\infty$$ in this case. The local concavity coefficients are given by  $$\begin{cases} \gamma_{x}(\mathscr{C}) = \infty,&\textrm{ if }x=(0,0),\\[5pt] \gamma_{x}(\mathscr{C}) = \frac{1}{2t},&\textrm{ if } x = (t,0)\textrm{ or }(0,t)\text{ for $$t>0$$},\\[5pt] \gamma_{x}(\mathscr{C}) = 0,&\textrm{ if }x_{1}<0\textrm{ or }x_{2}<0.\end{cases}$$Note that at the degenerate point x = (0, 0), $$\mathscr{C}$$ actually contains all convex combinations of this point x with any $$y\in \mathscr{C}$$, and so the curvature condition (2.1) is satisfied with γ = 0. However, $$x\in \mathscr{C}_{\mathsf{dgn}}$$, so we nonetheless set $$\gamma _{x}(\mathscr{C})=\infty$$. Fig. 1. View largeDownload slide A simple example of the local concavity coefficients on $$\mathscr{C}=\{x\in \mathbb{R}^{2}:x_{1}\leqslant 0\textrm{ or }x_{2}\leqslant 0\}$$. The gray shaded area represents $$\mathscr{C}$$ while the numbers give the local concavity coefficients at each marked point. Fig. 1. View largeDownload slide A simple example of the local concavity coefficients on $$\mathscr{C}=\{x\in \mathbb{R}^{2}:x_{1}\leqslant 0\textrm{ or }x_{2}\leqslant 0\}$$. The gray shaded area represents $$\mathscr{C}$$ while the numbers give the local concavity coefficients at each marked point. Practical high-dimensional examples, such as a rank constraint, will be discussed in depth in Section 5. For example we will see that, for the rank-constrained set $$\mathscr{C}=\left \{X\in \mathbb{R}^{n\times m}:\operatorname{rank}(X)\leqslant r\right \}$$, the local concavity coefficients satisfy $$\gamma _{X}(\mathscr{C})= \frac{1}{2\sigma _{r}(X)}$$ relative to the nuclear norm. In general, the local coefficients can be interpreted as follows: If x lies in the interior of $$\mathscr{C}$$, or if $$\mathscr{C}$$ is convex, then $$\gamma _{x}(\mathscr{C})=0$$. If x lies on the boundary of $$\mathscr{C}$$, which is a non-convex set with a smooth boundary, then we will typically see a finite but non-zero $$\gamma _{x}(\mathscr{C})$$. $$\gamma _{x}(\mathscr{C})=\infty$$ can indicate a non-convex cusp or other degeneracy at the point x. 2.2. Properties We next prove some properties of the local coefficients $$\gamma _{x}(\mathscr{C})$$ that will be useful for our convergence analysis, as well as for gaining intuition for these coefficients. First, the global and local coefficients are related in the natural way: Lemma 2.3 For any $$\mathscr{C}$$, $$\gamma (\mathscr{C})=\sup _{x\in \mathscr{C}}\gamma _{x}(\mathscr{C})$$. Next, observe that $$x\mapsto \gamma _{x}(\mathscr{C})$$ is not continuous in general (in particular, since $$\gamma _{x}(\mathscr{C})=0$$ in the interior of $$\mathscr{C}\!,$$ but is often positive on the boundary). However, this map does satisfy upper semi-continuity: Lemma 2.4 The function $$x\mapsto \gamma _{x}(\mathscr{C})$$ is upper semi-continuous over $$x\in \mathscr{C}$$. Furthermore, setting $$\gamma _{x}(\mathscr{C})=\infty$$ at the degenerate points $$x\in \mathscr{C}_{\mathsf{dgn}}$$ is natural in the following sense: the resulting map $$x\mapsto \gamma _{x}(\mathscr{C})$$ is the minimal upper semi-continuous map such that the relevant local concavity properties are satisfied. We formalize this with the following lemma: Lemma 2.5 For any $$u\in \mathscr{C}_{\mathsf{dgn}}$$, for any of the four conditions, (2.1), (2.3), (2.4) or (2.5), this property does not hold in any neighborhood of u for any finite γ. That is, for any r > 0,   $$\min\Big\{\gamma\geqslant 0:\text{ Property (*) holds for all {x\in\mathscr{C}\cap\mathbb{B}_{2}(u,r)} and for all {y\in\mathscr{C}}}\Big\}= \infty,$$where (*) may refer to any of the four equivalent properties, i.e. (2.1), (2.3), (2.4) and (2.5). (Here, $$\mathbb{B}_{2}(u,r)$$ is the ball of radius r around the point u, with respect to the ℓ2 norm.) Finally, the next result shows that two-sided contraction property (2.2) holds using local coefficients, meaning that all five definitions of concavity coefficients are equivalent: Lemma 2.6 For any $$z,w\in \mathbb{R}^{d}$$,   $$\left(1-\gamma_{P_{{\mathscr{C}}}(z)}(\mathscr{C})\lVert{z-P_{{\mathscr{C}}}(z)}\rVert^{\ast}-\gamma_{P_{{\mathscr{C}}}(w)}(\mathscr{C})\lVert{w-P_{{\mathscr{C}}}(w)}\rVert^{\ast}\right) \cdot \lVert{P_{{\mathscr{C}}}(z)-P_{{\mathscr{C}}}(w)}\rVert_{2} \leqslant \lVert{z - w}\rVert_{2}.$$In particular, for any fixed c ∈ (0, 1), Lemma 2.4 proves that   \begin{align} P_{{\mathscr{C}}}\text{ is (c)-Lipschitz over the set }\left\{z\in\mathbb{R}^{d}:2\gamma_{P_{{\mathscr{C}}}(z)}(\mathscr{C})\lVert{z-P_{{\mathscr{C}}}(z)}\rVert^{\ast}\leqslant 1-c\right\}, \end{align} (2.7)where the Lipschitz constant is defined with respect to the ℓ2 norm. This provides a sort of converse to our definition of the degenerate points, where we set $$\gamma _{x}(\mathscr{C})=\infty$$ for all $$x\in \mathscr{C}_{\mathsf{dgn}}$$, i.e. all points x where $$P_{{\mathscr{C}}}$$ is not continuous in any neighborhood of x. 2.3. Connection to prox-regular sets The notion of prox-regular sets and sets of positive reach arises in the literature on non-smooth analysis in Hilbert spaces, for instance see the study by Colombo & Thibault (2010) for a comprehensive overview of the key results in this area. The work on prox-regular sets generalizes also to the notion of prox-regular functions (see e.g. Rockafellar & Wets, 2009, Chapter 13.F). A prox-regular set is a set $$\mathscr{C}\subset \mathbb{R}^{d}$$ that satisfies3  \begin{align} \langle\,{y-x},\ {z-x}\,\rangle\leqslant\frac{1}{2\rho}\lVert{z-x}\rVert_{2}\lVert{y-x}{\rVert^{2}_{2}}, \end{align} (2.8)for all $$x,y\in \mathscr{C}$$ and all $$z\in \mathbb{R}^{d}$$ with $$P_{{\mathscr{C}}}(z)=x$$, for some constant ρ > 0. To capture the local variations in concavity over the set $$\mathscr{C}$$, $$\mathscr{C}$$ is prox-regular with respect to a continuous function $$\rho :\mathscr{C}\rightarrow (0,\infty ]$$ if   \begin{align} \langle\,{y-x},{z-x}\,\rangle\leqslant\frac{1}{2\rho(x)}\lVert{z-x}\rVert_{2}\lVert{y-x}{\rVert^{2}_{2}} \end{align} (2.9)for all $$x,y\in \mathscr{C}$$ and all $$z\in \mathbb{R}^{d}$$ with $$P_{{\mathscr{C}}}(z)=x$$ (see e.g. Colombo & Thibault, 2010, Theorem 3b).4 Historically, prox-regularity was first formulated via the notion of ‘positive reach’ (Federer, 1959): the parameter ρ appearing in (2.8) is the largest radius such that the projection operator $$P_{{\mathscr{C}}}$$ is unique for all points z within distance ρ of the set $$\mathscr{C}$$; in the local version (2.9), the radius is allowed to vary locally as a function of $$x\in \mathscr{C}$$. The definitions (2.8) and (2.9) exactly coincide with our inner product condition (2.5), in the special case that $$\lVert{\cdot }\rVert$$ is the ℓ2 norm, by taking $$\gamma = \frac{1}{2\rho }$$ or, for the local coefficients, $$\gamma =\frac{1}{2\rho (x)}$$. In the ℓ2 setting, there is substantial literature exploring the equivalence between many different characterizations of prox-regularity, including properties that are equivalent to each of our characterizations of the local concavity coefficients. Here we note a few places in the literature where these conditions appear, and refer the reader to the study by Colombo & Thibault (2010) for historical background on these ideas. The curvature condition (2.1) is proved in the study by Colombo & Thibault (2010, Proposition 9, Theorem 14(q)). The one- and two-sided contraction conditions (2.3) and (2.2) appear in the studies by Federer (1959, Section 4.8) and Colombo & Thibault (2010, Theorem 14(g)); the inner product condition (2.5) can be found in the studies by Federer (1959, Section 4.8), Colombo & Thibault (2010, Theorem 3(b)), Canino (1988, Definition 1.5) and Colombo & Marques (2003, Definition 2.1). The first-order optimality condition (2.4) is closely related to the inner product condition, when formulated using the ideas of normal cones and proximal normal cones (for instance, in the study by Rockafellar & Wets, 2009, Theorem 6.12 relates gradients of f to normal cones at x). The distinctions between our definitions and results on local concavity coefficients, and the literature on prox-regularity, center on two key differences: the role of continuity, and the flexibility of the structured norm $$\lVert{\cdot }\rVert$$ (rather than the ℓ2 norm). We discuss these two separately. Continuity In the literature on prox-regular sets, the ‘reach’ function $$x\mapsto \rho (x)\in (0,\infty ]$$ is assumed to be continuous (Colombo & Thibault, 2010, Definition 1). Equivalently, we could take a continuous function $$x\mapsto \gamma _{x} = \frac{1}{2\rho (x)}\in [0,\infty )$$ to agree with the notation of our local concavity coefficients. However, this is not the same as finding the smallest value γx such that the concavity coefficient conditions are satisfied (locally at the point x). For our definitions, we do not enforce continuity of the map x ↦ γx, and instead define $$\gamma _{x}(\mathscr{C})$$ as the smallest value such that the conditions are satisfied. This leads to substantial challenges in proving the equivalence of the various conditions; in Lemma 2.4 we prove that the map is naturally upper semi-continuous, which allows us to show the desired equivalences. In terms of practical implications, in order to use the local concavity coefficients to describe the convergence behavior of optimization problems, it is critical that we allow for non-continuity. For instance, suppose that $$\mathscr{C}$$ is non-convex, and its interior $$\mathsf{Int}(\mathscr{C})$$ is non-empty. For any $$x\in \mathsf{Int}(\mathscr{C})$$, the concavity coefficient conditions are satisfied with γx = 0. In particular, consider the first-order optimality condition (2.4): if $$x\in \mathsf{Int}(\mathscr{C})$$ is a local minimizer of some function f, then x is in fact the global minimizer of f(x) and we must have ∇f(x) = 0. On the other hand, since $$\mathscr{C}$$ is non-convex, we must have γx > 0 for at least some of the points x on the boundary of $$\mathscr{C}$$. If we do require a continuity assumption on the function x↦γx, then we would be forced to have γx > 0 for some points $$x\in \mathsf{Int}(\mathscr{C})$$ as well. This means that γx would not give a precise description of the behavior of first-order methods when constraining to $$\mathscr{C}$$—it would not reveal that non-global minima are impossible in the interior of the set. More generally, we will show in Lemma 3.2 that the local concavity coefficients (defined as the lowest possible constants, as in (2.6)) provide a tight characterization of the convergence behavior of projected gradient descent over the constraint set $$\mathscr{C}$$; if we enforce continuity, we would be forced to choose larger values for $$\gamma _{x}(\mathscr{C})$$ at some points $$x\in \mathscr{C}$$, and the concavity coefficients would no longer be both necessary and sufficient for convergence. One related point is that, by allowing for $$\gamma _{x}(\mathscr{C})$$ to be infinite if needed (which would be equivalent to allowing the ‘reach’ ρ(x) to be zero for some x), we can accommodate constraint sets such as the low-rank matrix constraint, $$\mathscr{C}=\left \{X\in \mathbb{R}^{n\times m}:\operatorname{rank}(X)\leqslant r\right \}$$. Recalling that $$\gamma _{X}(\mathscr{C})=\frac{1}{2\sigma _{r}(X)}$$ as mentioned earlier, we see that a rank-deficient matrix X (i.e. rank(X) < r) will have $$\gamma _{X}(\mathscr{C})=\infty$$. By not requiring that the concavity coefficient is finite (equivalently, that the reach is positive), we avoid the need for any inelegant modifications (e.g. working with a truncated set such as $$\mathscr{C}=\left \{X:\operatorname{rank}(X)\leqslant r,\sigma _{r}(X)\geqslant \varepsilon \right \}$$). Structured norms Prox-regularity (or equivalently the notion of positive reach) is studied in the literature in a Hilbert space, with respect to its norm, which in $$\mathbb{R}^{d}$$ means the ℓ2 norm (or a weighted ℓ2 norm).5 In contrast, our work defines local concavity coefficients with respect to a general structured norm $$\lVert{\cdot }\rVert$$, such as the ℓ1 norm in a sparse signal estimation setting. To see the distinction, compare our inner product condition (2.5) with the definition of prox-regularity (2.8). Of course, the equivalence of all norms on $$\mathbb{R}^{d}$$ means that if $$\gamma (\mathscr{C})$$ is finite when defined with respect to the ℓ2 norm (i.e. $$\mathscr{C}$$ is prox-regular), then it is finite with respect to any other norm—so the importance of the distinction may not be immediately clear. As an example, let $$\gamma ^{\ell _{1}}(\mathscr{C})$$ and $$\gamma ^{\ell _{2}}(\mathscr{C})$$ denote the concavity coefficients with respect to the ℓ1 and ℓ2 norms. Since $$\lVert{\cdot }\rVert _{2}\leqslant \lVert{\cdot }\rVert _{1}\leqslant \sqrt{d}\lVert{\cdot }\rVert _{2}$$, we could trivially show that   $$\gamma^{\ell_{2}}(\mathscr{C})\leqslant \gamma^{\ell_{1}}(\mathscr{C})\leqslant \sqrt{d}\cdot \gamma^{\ell_{2}}(\mathscr{C}),$$but the factor $$\sqrt{d}$$ is unfavorable, so in many settings this is a very poor bound on $$\gamma ^{\ell _{1}}(\mathscr{C})$$. We may then ask, why can we not simply define the coefficients in terms of the ℓ2 norm? The reason is that in optimization problems arising in high-dimensional settings (for instance, high-dimensional regression in statistics), structured norms such as the ℓ1 norm (for problems involving sparse signals) or the nuclear norm (for low-rank signals) allow for statistical and computational analyses that would not be possible with the ℓ2 norm. In particular, we will see later on that convergence for the minimization problem $$\min _{x\in \mathscr{C}}\mathsf{g}(x)$$ will depend on bounding $$\lVert{\nabla \mathsf{g}(x)}\rVert ^{\ast }$$. If $$\lVert{\cdot }\rVert$$ is the ℓ1 norm, for instance, then $$\lVert{\nabla \mathsf{g}(x)}\rVert ^{\ast } = \lVert{\nabla \mathsf{g}(x)}\rVert _{\infty }$$ will in general be much smaller than $$\lVert{\nabla \mathsf{g}(x)}\rVert _{2}$$. For instance, in a statistical problem, if ∇g(x) consists of Gaussian or sub-Gaussian noise at the true parameter vector x, then $$\lVert{\nabla \mathsf{g}(x)}\rVert _{\infty }\sim \sqrt{\log (d)}$$ while $$\lVert{\nabla \mathsf{g}(x)}\rVert _{2}\sim \sqrt{d}$$. Therefore, being able to bound the concavity of $$\mathscr{C}$$ with respect to the ℓ1 norm rather than the ℓ2 norm is crucial for analyzing convergence in a high-dimensional setting. In the next section, we will study how the choice of the norm $$\lVert{\cdot }\rVert$$ and its dual $$\lVert{\cdot }\rVert ^{\ast }$$ relates to the convergence properties of projected gradient descent. 3. Fast convergence of projected gradient descent Consider an optimization problem constrained to a non-convex set, $$\min \{ \mathsf{g}(x)\! :\! x\in \mathscr{C}\}$$, where $$\mathsf{g}:\mathbb{R}^{d}\!\rightarrow\! \mathbb{R}$$ is a differentiable function. We will work with projected gradient descent algorithms in the setting where g is convex or approximately convex, while $$\mathscr{C}$$ is non-convex with local concavity coefficients $$\gamma _{x}(\mathscr{C})$$. After choosing some initial point $$x_{0}\in \mathscr{C}$$, for each t ⩾ 0 we define   \begin{align}\begin{cases} x^{\prime}_{t+1} = x_{t} - \eta\nabla\mathsf{g}\left(x_{t}\right)\!,\\ x_{t+1} = P_{{\mathscr{C}}}\left(x^\prime_{t+1}\right)\!,\end{cases} \end{align} (3.1)where if $$P_{{\mathscr{C}}}\!\left (x^\prime _{t+1}\right )$$ is not unique then any closest point may be chosen. 3.1. Assumptions Assumptions on g We first consider the objective function g. Let $$\widehat{x}$$ be the target of our optimization procedure, $$\widehat{x} \in \operatorname{arg\,min}_{x\in \mathscr{C}}\mathsf{g}(x)$$. We assume that g satisfies RSC and RSM conditions over $$x,y\in \mathscr{C}$$,   \begin{align} \mathsf{g}(y)\geqslant \mathsf{g}(x) + \langle\,{y-x},{\nabla\mathsf{g}(x)}\rangle + \frac{\alpha}{2}\lVert{x-y}{\rVert^{2}_{2}} - \frac{\alpha}{2}\varepsilon_{\mathsf{g}}^{2} \end{align} (3.2)and   \begin{align} \mathsf{g}(y)\leqslant \mathsf{g}(x) + \langle\,{y-x},{\nabla\mathsf{g}(x)}\rangle + \frac{\beta}{2}\lVert{x-y}{\rVert^{2}_{2}} + \frac{\alpha}{2}\varepsilon_{\mathsf{g}}^{2}. \end{align} (3.3)Without loss of generality we can take $$\alpha \leqslant \beta$$. As is common in the low-rank factorized optimization literature, we will work in a local neighborhood of the target $$\widehat{x}$$ by assuming that our initialization point lies within radius ρ of $$\widehat{x}$$, which will allow us to require these conditions on g to hold only locally. The term εg gives some ‘slack’ in our assumption on g, and is intended to capture some vanishingly small error level. This term is often referred to as the ‘statistical error’ in the high-dimensional statistics literature, which represents the best-case scaling of the accuracy of our recovered solution. Often $$\widehat{x}$$ may represent a global minimizer which is within radius εg of some ‘true’ parameter in a statistical setting; therefore, converging to $$\widehat{x}$$ up to an error of magnitude εg means that the recovered solution is as accurate as $$\widehat{x}$$ at recovering the true parameter. For instance, often we will have $$\varepsilon _{\mathsf{g}}\sim{\sqrt{\frac{\log (d)}{n}}}$$ in a statistical setting where we are solving a sparse estimation problem of dimension d with sample size n. Assumptions on $$\mathscr{C}$$ Next, turning to the non-convexity of $$\mathscr{C}$$, we will assume local concavity coefficients $$\gamma _{x}(\mathscr{C})$$ that are not too large in a neighborhood of $$\widehat{x}$$, with details given below. We furthermore assume a norm compatibility condition,   \begin{align} \left\lVert{z - P_{{\mathscr{C}}}(z)}\right\rVert^{\ast} \leqslant \phi \min_{x\in\mathscr{C}}\lVert{z-x}\rVert^{\ast}\text{ for all }z\in\mathbb{R}^{d}, \end{align} (3.4)for some constant $$\phi \geqslant 1$$. The norm compatibility condition is trivially true with ϕ = 1 if $$\lVert{\cdot }\rVert$$ is the ℓ2 norm, since $$P_{{\mathscr{C}}}$$ is a projection with respect to the ℓ2 norm. We will see that in many natural settings it holds even for other norms, often with ϕ = 1. Assumptions on gradient and initialization Finally, we assume a gradient condition that reveals the connection between the curvature of the non-convex set $$\mathscr{C}$$ and the target function g: we require that   \begin{align} 2\phi\cdot\max_{x,x{^\prime}\in\mathscr{C}\cap\mathbb{B}_{2}(\widehat{x},\rho)}\gamma_{x}(\mathscr{C})\lVert{\nabla\mathsf{g}(x^\prime)}\rVert^{\ast} \leqslant (1-c_{0}) \cdot \alpha. \end{align} (3.5)(Since $$x\mapsto \gamma _{x}(\mathscr{C})$$ is upper semi-continuous, if g is continuously differentiable, then we can find some radius ρ > 0 and some constant c0 > 0 satisfying this condition, as long as $$2\phi \gamma _{\widehat{x}}(\mathscr{C}) \lVert{\nabla \mathsf{g}\left (\widehat{x}\right )}\rVert ^{\ast } < \alpha$$.) Our projected gradient descent algorithm will then succeed if initialized within this radius ρ from the target point $$\widehat{x}$$, with an appropriate step size. We will discuss the necessity of this type of initialization condition below in Section 3.4. In practice, relaxing the constraint $$x\in \mathscr{C}$$ to a convex constraint (or convex penalty) is often sufficient for providing a good initialization point. For example, in low-rank matrix setting, if we would like to solve $$\operatorname{arg\,min}\{\mathsf{g}(X):\operatorname{rank}(X)\leqslant r\}$$, we may first solve $$\operatorname{arg\,min}_{X}\left \{\mathsf{g}(X) + \lambda \lVert{{X}}\rVert _{\textrm{nuc}}\right \}$$, where $$\lVert{{X}}\rVert _{\textrm{nuc}}$$ is the nuclear norm and $$\lambda \geqslant 0$$ is a penalty parameter (which we would tune to obtain the desired rank for X). Alternately, in some settings, it may be sufficient to solve an unconstrained problem arg minXg(X) and then project to the constraint set, $$P_{{\mathscr{C}}}(X)$$. For some detailed examples of suitable initialization procedures for various low-rank matrix estimation problems, see e.g. the studies by Chen & Wainwright (2015) and Tu et al. (2015). 3.2. Convergence guarantee We now state our main result, which proves that under these conditions, initializing at some $$x_{0}\in \mathscr{C}$$ sufficiently close to $$\widehat{x}$$ will guarantee fast convergence to $$\widehat{x}$$. Theorem 3.1 Let $$\mathscr{C}\subset \mathbb{R}^{d}$$ be a constraint set and let g be a differentiable function, with minimizer $$\widehat{x}\in \operatorname{arg\,min}_{x\in \mathscr{C}}\mathsf{g}(x)$$. Suppose $$\mathscr{C}$$ satisfies the norm compatibility condition (3.4) with parameter ϕ, and g satisfies RSC (3.2) and RSM (3.3) with parameters α, β, εg for all $$x,y\in \mathbb{B}_{2}(\widehat{x},\rho )$$, and the initialization condition (3.5) for some c0 > 0. If the initial point $$x_{0}\in \mathscr{C}$$ and the error level εg satisfy $$\lVert{x_{0} -\widehat{x}}{\rVert ^{2}_{2}}<\rho ^{2}$$ and $$\varepsilon _{\mathsf{g}}^{2}< \frac{c_{0} \rho ^{2}}{1.5}$$, then for each step $$t\geqslant 0$$ of the projected gradient descent algorithm (3.1) with step size η = 1/β,   $$\lVert{x_{t} - \widehat{x}}{\rVert^{2}_{2}} \leqslant \left(1 - c_{0}\cdot \frac{2\alpha}{\alpha+\beta}\right)^{t} \lVert{x_{0} - \widehat{x}}{\rVert^{2}_{2}} +\frac{1.5{\varepsilon}_{\mathsf{g}}^{2}}{c_{0}}.$$ In other words, the iterates xt converge linearly to the minimizer $$\widehat{x}$$, up to precision level εg. 3.3. Comparison to related work We now compare to several related results for convex and non-convex projected gradient descent. (For methods that are specific to the problem of optimization over low-rank matrices, we will discuss this comparison and perform simulations later on.) Comparison to convex optimization To compare this result to the convex setting, if $$\mathscr{C}$$ is a convex set and g is α-strongly convex and β-smooth, then we can set c0 = 1 and εg = 0. Our result then yields   $$\lVert{x_{t} - \widehat{x}}{\rVert^{2}_{2}} \leqslant \left(1 - \frac{2\alpha}{\alpha+\beta}\right)^{t} \lVert{x_{0} - \widehat{x}}{\rVert^{2}_{2}} = \left(\frac{\beta-\alpha}{\beta+\alpha}\right)^{t}\lVert{x_{0}-\widehat{x}}{\rVert^{2}_{2}},$$matching known rates for the convex setting (see e.g. Bubeck, 2015, Theorem 3.10). Comparison to known results using descent conesOymak et al. (2015) study projected gradient descent for a linear regression setting, $$\mathsf{g}(x) = \frac{1}{2}\lVert{b-Ax}{\rVert ^{2}_{2}}$$, while constraining some potentially non-convex regularizer, $$\mathscr{C} = \{x:\textrm{Pen}(x)\leqslant c\}$$. Given a true solution $$x^{\star }\in \mathscr{C}$$ (for instance, in a statistical setting, we may have b = Ax⋆ + (noise)), their work focuses on the descent cone of $$\mathscr{C}$$ at x⋆, given by   $$\textrm{DC}_{x^{\star}} = \textrm{Smallest closed cone containing }\left\{u: \textrm{Pen}\left(x^{\star}+u\right) \leqslant c\right\}\!\!.$$(Trivially we will have $$x_{t} - x^{\star } \in \textrm{DC}_{x^{\star }}$$ since $$x_{t}\in \mathscr{C}$$.) Their results characterize the convergence of projected gradient descent in terms of the eigenvalues of A⊤A restricted to this cone. For simplicity, we show their result specialized to the noiseless setting, i.e. b = Ax⋆, given in the study by Oymak et al. (2015, Theorem 1.2):   \begin{align} \lVert{x_{t} - x^{\star}}\rVert_{2} \leqslant \left(2\cdot \max_{u,v\in\textrm{DC}_{x^{\star}}\cap\mathbb{S}^{d-1}} u^{\top} \left(\mathbf{I}_{d} - \eta A^{\top} A\right) v\right)^{t} \lVert{x^{\star}}\rVert_{2}. \end{align} (3.6)For this result to be meaningful we of course need the radius of convergence to be < 1. For a convex constraint set $$\mathscr{C}$$ (i.e. if Pen(x) is convex), the factor of 2 can be removed. In the non-convex setting, however, the factor of 2 means that the maximum in (3.6) must be $$<\frac{1}{2}$$ for the bound to ensure convergence. Noting that ∇g(x) = A⊤A(x − x⋆) in this problem, by setting u = v ∝ x − x⋆ we see that (3.6) effectively requires that $$\eta> \frac{1}{2\alpha }$$, where α is the RSC parameter (3.2). However, we also know that $$\eta \leqslant \frac{1}{\beta }$$ is generally a necessary condition to ensure stability of projected gradient descent; if $$\eta>\frac{1}{\beta }$$ then we may see values of g increase over iterations, i.e. g(x1) > g(x0). Therefore, the condition (3.6) effectively requires that g is well conditioned with $$\beta \lesssim 2\alpha$$, and furthermore that x⋆ is not in the interior of $$\mathscr{C}$$ (since, if this were the case, then $$\textrm{DC}_{x^{\star }} = \mathbb{R}^{d}$$). On the other hand, if the radius in (3.6) is indeed < 1, then their work does not assume any type of initialization condition for convergence to be successful, in contrast to our initialization assumption (3.5). Comparison to known results for iterative hard thresholding We now compare our results to those of Jain et al.’s (2014), which specifically treat the iterative hard thresholding algorithm for a sparsity constraint or a rank constraint,   $$\mathscr{C} = \left\{x\in\mathbb{R}^{d}:|\operatorname{support}(x)|\leqslant k\right\}\textrm{ or } \mathscr{C} = \left\{X\in\mathbb{R}^{n\times m}:\operatorname{rank}(X)\leqslant r\right\}\!\!.$$In their work, they take a substantially different approach: instead of bounding the distance between xt and the minimizer $$\widehat{x}\in \operatorname{arg\,min}_{x\in \mathscr{C}}\mathsf{g}(x)$$, they instead take $$\widehat{x}$$ to be a minimizer over a stronger constraint,   $$\widehat{x} =\mathop{\operatorname{arg\,min}}\limits_{|\operatorname{support}(x)|\leqslant k^{\star}}\mathsf{g}(x)\quad\textrm{ or }\quad\widehat{X}=\mathop{\operatorname{arg\,min}}\limits_{\operatorname{rank}(X)\leqslant r^{\star}}\mathsf{g}(X),$$taking k⋆ ≪ k or r⋆ ≪ r to enforce that the sparsity of $$\widehat{x}$$ or rank of $$\widehat{X}$$ is much lower than the optimization constraint set $$\mathscr{C}$$. With this definition, then bound the gap in objective function values, $$\mathsf{g}(x_{t}) - \mathsf{g}(\widehat{x})$$. In other words, the objective function value g(xt) is, up to a small error, no larger than the best value obtained over the substantially more restricted set of k⋆-sparse vectors or of rank-r⋆ matrices. By careful use of this gap k⋆ ≪ k or r⋆ ≪ r, their analysis allows for convergence results from any initialization point $$x_{0}\in \mathscr{C}$$. In contrast, our work allows $$\widehat{x}$$ to lie anywhere in $$\mathscr{C}$$, but this comes at the cost of assuming a local initialization point $$x_{0}\in \mathscr{C}\cap \mathbb{B}_{2}\left (\widehat{x},\rho \right )$$. This result suggests a possible two-phase approach: first, we might optimize over a larger rank constraint $$\mathscr{C}=\{X:\operatorname{rank}(X)\leqslant k\},$$ where k ≫ k⋆ to obtain the convergence guarantees of the study by Jain et al. (2014) (which do not assume a good initialization point, but obtain weaker guarantees); then, given the solution over rank k as a good initialization point, we would then optimize over the tighter constraint $$\mathscr{C}=\{X:\operatorname{rank}(X)\leqslant k^{\star }\}$$ to obtain our stronger guarantees. Comparison to results on prox-regular functionsPennanen (2002) studies conditions for linear convergence of the proximal point method for minimizing a function f(x), and shows that prox-regularity of f(x) is sufficient; Lewis & Wright (2016) also study this problem in a more general setting. For our optimization problem, this translates to setting $$\mathsf{f}(x) = \mathsf{g}(x) + \delta _{\mathscr{C}}(x)$$, where   $$\delta_{\mathscr{C}}(x)=\begin{cases}0,&x\in\mathscr{C},\\\infty,&x\not\in\mathscr{C}.\end{cases}$$(This is usually called the ‘indicator function’ for the set $$\mathscr{C}$$.) If $$\mathsf{g}(x) + \frac{\mu }{2}\lVert{x}{\rVert ^{2}_{2}}$$ is convex (i.e. the concavity of g is bounded) and $$\mathscr{C}$$ is a prox-regular set (i.e. $$\gamma (\mathscr{C})<\infty$$, see Section 2.3), then f(x) is a prox-regular function. This work was extended by Iusem et al. (2003) and others to an inexact proximal point method, allowing for error in each iteration, which can be formulated to encompass the projected gradient descent algorithm studied here. Our first convergence result Theorem 3.1 extends these results into a high-dimensional setting by using the structured norm $$\lVert{\cdot }\rVert$$ and its dual $$\lVert{\cdot }\rVert ^{\ast }$$ (e.g. the ℓ1 norm and its dual the $$\ell _{\infty }$$ norm), and requiring only RSC and RSM on g, without which we would not be able to obtain convergence guarantees in settings such as high-dimensional sparse regression or low-rank matrix estimation. 3.4. Initialization point and the gradient assumption In this result, we assume that the initialization point x0 is within some radius ρ of the target $$\widehat{x}$$, ensuring that $$2\phi \gamma _{x}(\mathscr{C})\lVert{\nabla \mathsf{g}(x)}\rVert ^{\ast }<\alpha$$ for all x in the initialization neighborhood, where α is the RSC (3.2) parameter. This type of assumption arises in much of the related literature; for example in the setting of optimization over low-rank matrices, as we will see in Section 5.1, we will require that $$\lVert{{X_{0} - \widehat{X}}}\rVert _{\mathsf{F}}\lesssim \sigma _{r}\left (\widehat{X}\right )$$, which is the same condition found in existing work such as that of Chen & Wainwright (2015). In fact, the following result demonstrates that the bound (3.5) is in a sense necessary: Lemma 3.2 For any constraint set $$\mathscr{C}$$ and any point $$x\in \mathscr{C}\backslash \mathscr{C}_{\mathsf{dgn}}$$ with $$\gamma _{x}(\mathscr{C})>0$$, for any α, ε > 0 there exists an α-strongly convex g such that the gradient condition (3.5) is nearly satisfied at x, with $$2\gamma _{x}(\mathscr{C})\lVert{\mathsf{g}(x)}\rVert ^{\ast }\leqslant \alpha (1+ \varepsilon )$$, and, x is a stationary point of the projected gradient descent algorithm (3.1) for all sufficiently small step sizes η > 0, but x does not minimize g over $$\mathscr{C}$$. That is, if projected gradient descent is initialized at the point x, then the algorithm will never leave this point, even though it is not optimal (i.e. x is not the global minimizer). We can see with a concrete example that the condition (3.5) may be even more critical than this lemma suggests: without this bound, we may find that projected gradient descent becomes trapped at a stationary point x which is not even a local minimum, as in the following example. Example 3.3 Let $$\mathscr{C}=\left \{X\in \mathbb{R}^{2\times 2}:\operatorname{rank}(X)\leqslant 1\right \}$$, let $$\mathsf{g}(X) = \frac{1}{2}\left \lVert{{X - \left ({1 \atop 0} \quad {0 \atop 1+\varepsilon}\right )}}\right \rVert _{\mathsf{F}}^{2}$$, and let $$X_{0} = \left ({1\atop 0} \quad {0 \atop 0}\right ).$$ Then trivially, we can see that g is α-strongly convex for α = 1, and that X0 is a stationary point of the projected gradient descent algorithm (3.1) for any step size $$\eta < \frac{1}{1+\varepsilon }$$. However, for any $$0<t<\sqrt{2\varepsilon }$$, setting $$X=\left ({1 \atop t} \quad {t \atop t^{2}}\right )\in \mathscr{C}$$, we can see that g(X) < g(X0)—that is, X0 is stationary point, but is not a local minimum. We will later calculate that $$\gamma _{X_{0}}(\mathscr{C}) = \frac{1}{2\sigma _{1}(X_{0})}=\frac{1}{2}$$ relative to the nuclear norm $$\lVert{\cdot }\rVert =\lVert{{\cdot }}\rVert _{\textrm{nuc}}$$, with norm compatibility constant ϕ = 1 (see Section 5.1 for this calculation). Comparing against the condition (3.5) on the gradient of g, since the dual norm to $$\lVert{{\cdot }}\rVert _{\textrm{nuc}}$$ is the matrix spectral norm $$\lVert{{\cdot }}\rVert _{\textrm{sp}}$$, we see that   $$2\phi\gamma_{X_{0}}(\mathscr{C}) \cdot\lVert{{\nabla\mathsf{g}(X_{0})}}\rVert_{\textrm{sp}} = 2\cdot1\cdot \frac{1}{2} \cdot \left\lVert{{-\left(\begin{array}{cc}0 & 0 \\ 0 & 1+\varepsilon\end{array}\right) }}\right\rVert_{\textrm{sp}}= 1+ \varepsilon =\alpha(1+\varepsilon).$$Therefore, when the initial gradient condition (3.5) is even slightly violated in this example (i.e. small ε > 0), the projected gradient descent algorithm can become trapped at a point that is not even a local minimum. While we might observe that in this particular example, the ‘bad’ stationary point X0 could be avoided by increasing the step size, in other settings if g has strong curvature in some directions (i.e. the smoothness parameter β is large), then we cannot afford a large step size η as it can cause the algorithm to fail to converge. 4. Convergence analysis using approximate projections In some settings, computing projections $$P_{{\mathscr{C}}}(x^\prime _{t+1})$$ at each step of the projected gradient descent algorithm may be prohibitively expensive; for instance in a low-rank matrix optimization problem of dimension d × d, this would generally involve taking the singular value decomposition of a dense d × d matrix at each step. In these cases we may sometimes have access to a fast but approximate computation of this projection, which may come at the cost of slower convergence. We now generalize to the idea of a family of approximate projections, which allows for operators that approximate projection to $$\mathscr{C}$$. Specifically, the approximations are carried out locally:   \begin{align}\begin{cases}x^\prime_{t+1} = x_{t} - \eta\nabla\mathsf{g}(x_{t}),\\[3pt] x_{t+1} = P_{{x_{t}}}\left(x^\prime_{t+1}\right)\!,\end{cases} \end{align} (4.1)where $$P_{{x_{t}}}$$ comes from a family of operators $$P_{{x}}:\mathbb{R}^{d}\rightarrow \mathscr{C}$$ indexed by $$x\in \mathscr{C}$$. Intuitively, we think of Px(z) as providing a very accurate approximation to $$P_{{\mathscr{C}}}(z)$$ locally for z near x, but it may distort the projection more as we move farther away. To allow for our convergence analysis to carry through even with these approximate projections, we assume that the family of operators {Px} satisfies a relaxed inner product condition:   \begin{align} &\text{For any }x\in\mathscr{C} and z\in\mathbb{R}^{d}\text{ with }x,P_{{x}}(z)\in\mathbb{B}_{2}\!\left(\widehat{x},\rho\right), \\ &\quad\left\langle{\,\widehat{x}-P_{{x}}(z)},{z-P_{{x}}(z)}\right\rangle \leqslant \max\{ \underbrace{\left\lVert{z-P_{{x}}(z)}\right\rVert^{\ast}}_{\textrm{concavity term}}, \underbrace{\left\lVert{z-x}\right\rVert^{\ast}}_{\textrm{distortion term}}\}\cdot \big(\underbrace{\gamma^{\textrm{c}}\lVert{\,\widehat{x}-P_{{x}}(z)}{\rVert^{2}_{2}}}_{\textrm{concavity term}} + \underbrace{\gamma^{\textrm{d}}\lVert{\,\widehat{x}-x}{\rVert^{2}_{2}}}_{\textrm{distortion term}}\big).\nonumber \end{align} (4.2)Here the ‘concavity’ terms are analogous to the inner product bound in (2.5) for exact projection to the non-convex set $$\mathscr{C}$$, except with the projection $$P_{{\mathscr{C}}}$$ replaced by the operator Px; the ‘distortion’ terms mean that as we move farther away from x the bound becomes looser, as Px becomes a less accurate approximation to $$P_{{\mathscr{C}}}$$. We now present a convergence guarantee nearly identical to the result for the exact projection case, Theorem 3.1. We first need to state a version of the norm compatibility condition, modified for approximate projections:   \begin{align} \left\lVert{z - P_{{x}}(z)}\right\rVert^{\ast} \leqslant \phi \lVert{z-x}\rVert^{\ast}\ \ \text{for all }x\in\mathscr{C}\cap\mathbb{B}_{2}(\widehat{x},\rho)\text{ and }z\in\mathbb{R}^{d}. \end{align} (4.3)We also require a modified initialization condition,   \begin{align} 2\phi(\gamma^{\textrm{c}}+\gamma^{\textrm{d}})\max_{x\in\mathscr{C}\cap\mathbb{B}_{2}(\widehat{x},\rho)}\lVert{\nabla\mathsf{g}(x)}\rVert^{\ast} \leqslant (1-c_{0})\alpha, \end{align} (4.4)and a modified version of local uniform continuity (compare to (2.7) for exact projections),   \begin{align} &\text{for any }x\in\mathscr{C}\cap\mathbb{B}_{2}(\widehat{x},\rho),\text{ for any }(\varepsilon > 0),\text{ there exists a }(\delta > 0)\text{ such that,}\nonumber\\ &\quad\text{for any }z,w\in\mathbb{R}^{d}\text{ such that }P_{x}(z)\in\mathbb{B}_{2}(\widehat{x},\rho)\text{ and }2(\gamma^{\textrm{c}}+\gamma^{\textrm{d}})\lVert{z-P_{{x}}(z)}\rVert^{\ast}\leqslant 1-c_{0},\nonumber\\ &\qquad\text{if }\lVert{z-w}\rVert_{2}\leqslant\delta\text{ then }\lVert{P_{{x}}(z)-P_{{x}}(w)}\rVert_{2}\leq\varepsilon. \end{align} (4.5) Our result for this setting now follows. Theorem 4.1 Let $$\mathscr{C}\subset \mathbb{R}^{d}$$ be a constraint set and let g be a differentiable function, with minimizer $$\widehat{x}\in \operatorname{arg\,min}_{x\in \mathscr{C}}\mathsf{g}(x)$$. Let {Px} be a family of operators satisfying the inner product condition (4.2), the norm compatibility condition (4.3) and the local continuity condition (4.5) with parameters γc, γd, ϕ and radius ρ. Assume that g satisfies RSC (3.2) and restricted smoothness (3.3) with parameters α, β, εg for all $$x,\ y\in \mathbb{B}_{2}(\,\widehat{x},\ \rho )$$, and the initialization condition (4.4) for some c0 > 0. If the initial point $$x_{0}\in \mathscr{C}$$ and the error level εg satisfy $$\lVert{x_{0} -\widehat{x}}{\rVert ^{2}_{2}}<\rho ^{2}$$ and $$\varepsilon _{\mathsf{g}}^{2}< \frac{c_{0} \rho ^{2}}{1.5}$$, then for each step $$t\geqslant 0$$ of the approximate projected gradient descent algorithm (4.1) with step size η = 1/β,   $$\lVert{x_{t} - \widehat{x}}{\rVert^{2}_{2}} \leqslant \left(1 - c_{0}\cdot \frac{2\alpha}{\alpha+\beta}\right)^{t} \lVert{x_{0} - \widehat{x}}{\rVert^{2}_{2}} +\frac{1.5\varepsilon_{\mathsf{g}}^{2}}{c_{0}}.$$ This convergence rate is identical to that obtained in Theorem 3.1 for exact projections—the only differences lie in the assumptions. 4.1. Exact vs. approximate projections To compare the two settings we have considered, exact projections $$P_{{\mathscr{C}}}$$ vs. approximate projections Px, we focus on a local form of the inner product condition (4.2) for the family of approximate operators {Px}, rewritten to be analogous to the inner product condition (2.5) for exact projections. Suppose that $$\gamma ^{\text{c}}_{u}(\mathscr{C})$$ and $$\gamma ^{\text{d}}_{u}(\mathscr{C})$$ satisfy the property that   \begin{align} &\text{for any }x,\ y\in\mathscr{C}\text{ and any }z\in\mathbb{R}^{d}\text{, writing }u=P_{{x}}(z),\nonumber\\ &\quad\langle{y-u},\ {z-u}\rangle \leqslant \max\{ \underbrace{\lVert{z-u}\rVert^{\ast}}_{\text{concavity term}}, \underbrace{\lVert{z-x}\rVert^{\ast}}_{\text{distortion term}}\}\cdot \Big(\underbrace{\gamma^{\text{c}}_{u} (\mathscr{C})\lVert{y-u}{\rVert^{2}_{2}}}_{\text{concavity term}} + \underbrace{\gamma^{\text{d}}_{u}(\mathscr{C})\lVert{y-x}{\rVert^{2}_{2}}}_{\text{distortion term}}\Big), \end{align} (4.6)where $$u\mapsto \gamma ^{\text{c}}_{u}(\mathscr{C})$$ and $$u\mapsto \gamma ^{\text{d}}_{u}(\mathscr{C})$$ are upper semi-continuous maps. We now prove that the existence of a family of operators {Px} satisfying this general condition (4.6) is in fact equivalent to bounding the local concavity coefficients of $$\mathscr{C}$$. Lemma 4.2 Consider a constraint set $$\mathscr{C}\subset \mathbb{R}^{d}$$ and a norm $$\lVert{\cdot }\rVert$$ on $$\mathbb{R}^{d}$$ with dual norm $$\lVert{\cdot }\rVert ^{\ast }$$. If $$\mathscr{C}$$ has local concavity coefficients given by $$\gamma _{x}(\mathscr{C})$$ for all $$x\in \mathscr{C}$$, then by defining operators $$P_{{x}} =P_{{\mathscr{C}}}$$ for all $$x\in \mathscr{C}$$, the inner product condition (4.6) holds with $$\gamma ^{\text{c}}_{x}(\mathscr{C})=\gamma _{x}(\mathscr{C})$$ and $$\gamma ^{\text{d}}_{x}(\mathscr{C})=0$$. Conversely, if there is some family of operators $$\{P_{{x}}\}_{x\in \mathscr{C}}$$ satisfying the inner product condition (4.6), then the local concavity coefficients of $$\mathscr{C}$$ satisfy $$\gamma _{x}(\mathscr{C}) \leqslant \gamma ^{\text{c}}_{x}(\mathscr{C}) + \gamma ^{\text{d}}_{x}(\mathscr{C})$$, provided that $$x\mapsto \gamma ^{\text{c}}_{x}(\mathscr{C}),\ x\mapsto \gamma ^{\text{d}}_{x}(\mathscr{C})$$ are upper semi-continuous, and that Px also satisfies a local continuity assumption.   \begin{align} \text{If }\gamma^{\textrm{c}}_{x}(\mathscr{C})+\gamma^{\textrm{d}}_{x}(\mathscr{C})<\infty\text{ and }z_{t}\rightarrow x,\text{ then }P_{{x}}(z_{t}) \rightarrow x. \end{align} (4.7) For this reason, we see that generalizing from exact projection $$P_{{\mathscr{C}}}$$ to a family of operators {Px} does not expand the class of problems whose convergence is ensured by our theory; essentially, if using the approximate projection operators Px guarantees fast convergence, then the same would also be true using exact projection $$P_{{\mathscr{C}}}$$. However, there may be substantial computational gain in switching from exact to approximate projection, which comes with little or no cost in terms of convergence guarantees. 5. Examples In this section we consider a range of non-convex constraints arising naturally in high-dimensional statistics, and show that these sets come equipped with well-behaved local concavity coefficients (thus allowing for fast convergence of gradient descent, for appropriate functions g). 5.1. Low-rank optimization Estimating a matrix with low-rank structure arises in a variety of problems in high-dimensional statistics and machine learning. A partial list includes principal component analysis (PCA), fact