Global rates of convergence for nonconvex optimization on manifolds

Global rates of convergence for nonconvex optimization on manifolds Abstract We consider the minimization of a cost function f on a manifold $$\mathcal{M}$$ using Riemannian gradient descent and Riemannian trust regions (RTR). We focus on satisfying necessary optimality conditions within a tolerance ε. Specifically, we show that, under Lipschitz-type assumptions on the pullbacks of f to the tangent spaces of $$\mathcal{M}$$, both of these algorithms produce points with Riemannian gradient smaller than ε in $$\mathcal{O}\big(1/\varepsilon ^{2}\big)$$ iterations. Furthermore, RTR returns a point where also the Riemannian Hessian’s least eigenvalue is larger than −ε in $$\mathcal{O} \big(1/\varepsilon ^{3}\big)$$ iterations. There are no assumptions on initialization. The rates match their (sharp) unconstrained counterparts as a function of the accuracy ε (up to constants) and hence are sharp in that sense. These are the first deterministic results for global rates of convergence to approximate first- and second-order Karush-Kuhn-Tucker points on manifolds. They apply in particular for optimization constrained to compact submanifolds of $${\mathbb{R}^{n}}$$, under simpler assumptions. 1. Introduction Optimization on manifolds is concerned with solving nonlinear and typically nonconvex computational problems of the form \begin{align} \min_{x \in \mathcal{M}} \ f(x), \end{align} (P) where $$\mathcal{M}$$ is a (smooth) Riemannian manifold and $$f \colon \mathcal{M} \to{\mathbb{R}}$$ is a (sufficiently smooth) cost function (Gabay, 1982; Smith, 1994; Edelman et al., 1998; Absil et al., 2008). Applications abound in machine learning, computer vision, scientific computing, numerical linear algebra, signal processing, etc. In typical applications x is a matrix and $$\mathcal{M}$$ could be a Stiefel manifold of orthonormal frames (including spheres and groups of rotations), a Grassmann manifold of subspaces, a cone of positive definite matrices or simply a Euclidean space such as $${\mathbb{R}^{n}}$$. The standard theory for optimization on manifolds takes the standpoint that optimizing on a manifold $$\mathcal{M}$$ is not fundamentally different from optimizing in $${\mathbb{R}^{n}}$$. Indeed, many classical algorithms from unconstrained nonlinear optimization such as gradient descent, nonlinear conjugate gradients, BFGS, Newton’s method and trust-region methods (Ruszczyński, 2006; Nocedal & Wright, 1999) have been adapted to apply to the larger framework of (P) (Adler et al., 2002; Absil et al., 2007, 2008; Ring & Wirth, 2012; Huang et al., 2015; Sato, 2016). Softwarewise, a few general toolboxes for optimization on manifolds exist now, for example, Manopt (Boumal et al., 2014), PyManopt (Townsend et al., 2016) and ROPTLIB (Huang et al., 2016). As (P) is typically nonconvex, one does not expect general purpose, efficient algorithms to converge to global optima of (P) in general. Indeed, the class of problems (P) includes known NP-hard problems. In fact, even computing local optima is NP-hard in general (Vavasis, 1991, Chapter 5). Nevertheless, one may still hope to compute points of $$\mathcal{M}$$ that satisfy first- and second-order necessary optimality conditions. These conditions take up the same form as in unconstrained nonlinear optimization, with Riemannian notions of gradient and Hessian. For $$\mathcal{M}$$ defined by equality constraints these conditions are equivalent to first- and second-order Karush-Kuhn-Tucker (KKT) conditions, but are simpler to manipulate because the Lagrangian multipliers are automatically determined. The proposition below states these necessary optimality conditions. Recall that to each point x of $$\mathcal{M}$$ there corresponds a tangent space (a linearization) $$\mathrm{T}_{x}\mathcal{M}$$. The Riemannian gradient grad f(x) is the unique tangent vector at x such that $$\mathrm{D} f(x)[\eta ] = \left \langle{\eta },{\textrm{grad}\ f(x)}\right \rangle $$ for all tangent vectors η, where $$\left \langle{\cdot },{\cdot }\right \rangle $$ is the Riemannian metric on $$\mathrm{T}_{x}\mathcal{M}$$ and Df(x)[η] is the directional derivative of f at x along η. The Riemannian Hessian Hess f(x) is a symmetric operator on $$\mathrm{T}_{x}\mathcal{M}$$, corresponding to the derivative of the gradient vector field with respect to the Levi‐Civita connection—see Absil et al. (2008, Chapter 5). These objects are easily computed in applications. A summary of relevant concepts about manifolds can be found in Appendix A. Proposition 1.1 (Necessary optimality conditions). Let $$x\in \mathcal{M}$$ be a local optimum for (P). If f is differentiable at x then grad f(x) = 0. If f is twice differentiable at x then Hess f(x) $$\succeq $$ 0 (positive semidefinite). Proof. See Yang et al. (2014, Rem. 4.2 and Cor. 4.2). A point $$x\in \mathcal{M}$$, which satisfies grad f(x) = 0, is a (first-order) critical point (also called a stationary point). If x furthermore satisfies Hess f(x) $$\succeq $$ 0, it is a second-order critical point. Existing theory for optimization algorithms on manifolds is mostly concerned with establishing global convergence to critical points without rates (where global means regardless of initialization), as well as local rates of convergence. For example, gradient descent is known to converge globally to critical points, and the convergence rate is linear once the iterates reach a sufficiently small neighborhood of the limit point (Absil et al., 2008, Chapter 4). Early work of Udriste (1994) on local convergence rates even bounds distance to optimizers as a function of iteration count, assuming initialization in a set where the Hessian of f is positive definite, with lower and upper bounds on the eigenvalues; see also Absil et al. (2008, Thm. 4.5.6, Thm. 7.4.11). Such guarantees adequately describe the empirical behavior of those methods, but give no information about how many iterations are required to reach the local regime from an arbitrary initial point x0; that is, the worst-case scenarios are not addressed. For classical unconstrained nonlinear optimization this caveat has been addressed by bounding the number of iterations required by known algorithms to compute points that satisfy necessary optimality conditions within some tolerance, without assumptions on the initial iterate. Among others, Nesterov (2004) gives a proof that, for $$\mathcal{M} = {\mathbb{R}^{n}}$$ and Lipschitz differentiable f, gradient descent with an appropriate step size computes a point x where ∥grad f(x)∥ ≤ ε in $$\mathcal{O}\big (1/\varepsilon ^{2}\big )$$ iterations. This is sharp (Cartis et al., 2010). Cartis et al. (2012) prove the same for trust-region methods, and further show that if f is twice Lipschitz continuously differentiable, then a point x where ∥grad f(x)∥ ≤ ε and Hess f(x) $$\succeq $$ −ε Id is computed in $$\mathcal{O}\big(1/\varepsilon ^{3}\big)$$ iterations, also with examples showing sharpness. In this paper we extend the unconstrained results to the larger class of optimization problems on manifolds (P). This work builds upon the original proofs (Nesterov, 2004; Cartis et al., 2012) and on existing adaptations of gradient descent and trust-region methods to manifolds (Absil et al., 2007, 2008). One key step is the identification of a set of relevant Lipschitz-type regularity assumptions, which allow the proofs to carry over from $${\mathbb{R}^{n}}$$ to $$\mathcal{M}$$ with relative ease. Main results We state the main results here informally. We use the notion of retraction Retrx (see Definition 2.1), which allows to map tangent vectors at x to points on $$\mathcal{M}$$. Iterates are related by $$x_{k+1} = \textrm{Retr}_{x_{k}}(\eta _{k})$$ for some tangent vector ηk at xk (the step). Hence, f ∘ Retrx is a lift of the cost function from $$\mathcal{M}$$ to the tangent space at x. For $$\mathcal{M} = {\mathbb{R}^{n}}$$, the standard retraction gives $$\textrm{Retr}_{x_{k}}(\eta _{k}) = x_{k} + \eta _{k}$$. By ∥⋅∥ we denote the norm associated to the Riemannian metric. About gradient descent. (See Theorems 2.5 and 2.8.) For problem (P), if f is bounded below on $$\mathcal{M}$$ and f∘ Retrx has Lipschitz gradient with constant Lg independent of x, then Riemannian gradient descent with constant step size 1/Lg or with backtracking Armijo line-search returns x with ∥grad f(x)∥≤ ε in $$\mathcal{O}\big(1/\varepsilon ^{2}\big)$$ iterations. About trust regions. (See Theorem 3.4.) For problem (P), if f is bounded below on $$\mathcal{M}$$ and f ∘ Retrx has Lipschitz gradient with constant independent of x then Riemannian trust region (RTR) returns x with ∥grad f(x)∥ ≤ εg in $$\mathcal{O}\big (1/{\varepsilon _{g}^{2}}\big )$$ iterations, under weak assumptions on the model quality. If furthermore f ∘ Retrx has Lipschitz Hessian with constant independent of x then RTR returns x with ∥grad f(x)∥ ≤ εg and Hess f(x) $$\succeq -\varepsilon _{H}$$ Id in $$\mathcal{O}\big (\max \big \{1/{\varepsilon _{H}^{3}}, 1/{\varepsilon _{g}^{2}} \varepsilon _{H}^{}\big \} \big )$$ iterations, provided the true Hessian is used in the model and a second-order retraction is used. About compact submanifolds. (See Lemmas 2.4 and 3.1.) The first-order regularity conditions above hold in particular if $$\mathcal{M}$$ is a compact submanifold of a Euclidean space $$\mathcal{E}$$ (such as $${\mathbb{R}^{n}}$$) and $$f \colon \mathcal{E} \to{\mathbb{R}}$$ has a locally Lipschitz continuous gradient. The second-order regularity conditions hold if furthermore f has a locally Lipschitz continuous Hessian on $$\mathcal{E}$$ and the retraction is second order (Definition 3.10). Since the rates $$\mathcal{O}\big(1/\varepsilon ^{2}\big)$$ and $$\mathcal{O}\big(1/\varepsilon ^{3}\big)$$ are sharp for gradient descent and trust regions when $$\mathcal{M} = {\mathbb{R}^{n}}$$ (Cartis et al., 2010, 2012), they are also sharp for $$\mathcal{M}$$ a generic Riemannian manifold. Below, constants are given explicitly, thus precisely bounding the total amount of work required in the worst case to attain a prescribed tolerance. The theorems presented here are the first deterministic results about the worst-case iteration complexity of computing (approximate) first- and second-order critical points on manifolds. The choice of analysing Riemannian gradient descent and RTR first is guided by practical concerns, as these are among the most commonly used methods on manifolds so far. The proposed complexity bounds are particularly relevant when applied to problems for which second-order necessary optimality conditions are also sufficient. See for example Sun et al. (2015, 2016), Boumal,(2015b, 2016), Bandeira et al. (2016), Bhojanapalli et al. (2016), Ge et al. (2016) and the example in Section 4. Related work The complexity of Riemannian optimization is discussed in a few recent lines of work. Zhang & Sra (2016) treat geodesically convex problems over Hadamard manifolds. This is a remarkable extension of important pieces of classical convex optimization theory to manifolds with negative curvature. Because of the focus on geodesically convex problems, those results do not apply to the more general problem (P), but have the clear advantage of guaranteeing global optimality. In Zhang et al. (2016), which appeared a day before the present paper on public repositories, the authors also study the iteration complexity of nonconvex optimization on manifolds. Their results differ from the ones presented here in that they focus on stochastic optimization algorithms, aiming for first-order conditions. Their results assume bounded curvature for the manifold. Furthermore, their analysis relies on the Riemannian exponential map, whereas we cover the more general class of retraction maps (which is computationally advantageous). We also do not use the notions of Riemannian parallel transport or logarithmic map, which, in our view, makes for a simpler analysis. Sun et al. (2015, 2016) consider dictionary learning and phase retrieval and show that these problems, when appropriately framed as optimization on a manifold, are low-dimensional and have no spurious local optimizers. They derive the complexity of RTR specialized to their application. In particular, they combine the global rate with a local convergence rate, which allows them to establish an overall better complexity than $$\mathcal{O}\big(1/\varepsilon ^{3}\big)$$, but with an idealized version of the algorithm and restricted to these relevant applications. In this paper we favor a more general approach, focused on algorithms closer to the ones implemented in practice. Recent work by Bento et al. (2017) (which appeared after a first version of this paper) focuses on iteration complexity of gradient, subgradient and proximal point methods for the case of convex cost functions on manifolds, using the exponential map as retraction. For the classical, unconstrained case, optimal complexity bounds of order $$\mathcal{O}\big(1/\varepsilon ^{1.5}\big)$$ to generate x with ∥grad f(x)∥ ≤ ε have also been given for cubic regularization methods (Cartis et al., 2011a, b) and sophisticated trust region variants (Curtis et al., 2016). Bounds for regularization methods can be further improved if higher-order derivatives are available (Birgin et al., 2017). Worst-case evaluation complexity bounds have been extended to constrained smooth problems in Cartis et al. (2014, 2015a,b). There, it is shown that some carefully devised, albeit impractical, phase 1–phase 2 methods can compute approximate KKT points with global rates of convergence of the same order as in the unconstrained case. We note that when the constraints are convex (but the objective may not be), practical, feasible methods have been devised (Cartis et al., 2015a) that connect to our approach below. Second-order optimality for the case of convex constraints with nonconvex cost has been recently addressed in Cartis et al. (2016). 2. Riemannian gradient descent methods Consider the generic Riemannian descent method described in Algorithm 1. We first prove that, provided sufficient decrease in the cost function is achieved at each iteration, the algorithm computes a point xk such that ∥grad f(xk)∥ ≤ ε with $$k = \mathcal{O}\big(1/\varepsilon ^{2}\big)$$. Then we propose a Lipschitz-type assumption, which is sufficient to guarantee that simple strategies to pick the steps ηk indeed ensure sufficient decrease. The proofs parallel the standard ones (Nesterov, 2004, Sect. 1.2.3). The main novelty is the careful extension to the Riemannian setting, which requires the well-known notion of retraction (Definition 2.1) and the new Assumption 2.6 (see below). The step ηk is a tangent vector to $$\mathcal{M}$$ at xk. Because $$\mathcal{M}$$ is nonlinear (in general) the operation xk + ηk is undefined. The notion of retraction provides a theoretically sound replacement. Informally, $$x_{k+1} = \textrm{Retr}_{x_{k}}(\eta _{k})$$ is a point on $$\mathcal{M}$$ that one reaches by moving away from xk, along the direction ηk, while remaining on the manifold. The Riemannian exponential map (which generates geodesics) is a retraction. The crucial point is that many other maps are retractions, often far less difficult to compute than the exponential. The definition of retraction below can be traced back to Shub (1986) and it appears under that name in Adler et al. (2002); see also Absil et al. (2008, Def. 4.1.1 and Sect. 4.10) for additional references. Definition 2.1 (Retraction). A retraction on a manifold $$\mathcal{M}$$ is a smooth mapping Retr from the tangent bundle1$$\mathrm{T}\mathcal{M}$$ to $$\mathcal{M}$$ with the following properties. Let $$\textrm{Retr}_{x} \colon \mathrm{T}_{x}\mathcal{M} \to \mathcal{M}$$ denote the restriction of Retr to $$\mathrm{T}_{x}\mathcal{M}$$: (i) Retrx(0x) = x, where 0x is the zero vector in $$\mathrm{T}_{x}\mathcal{M}$$; (ii) the differential of Retrx at 0x, D Retrx(0x), is the identity map. These combined conditions ensure retraction curves t ↦ Retrx(tη) agree up to first order with geodesics passing through x with velocity η, around t = 0. Sometimes we allow Retrx to be defined only locally, in a closed ball of radius ϱ(x) > 0 centered at 0x in $$\mathrm{T}_{x}\mathcal{M}$$. In linear spaces such as $${\mathbb{R}^{n}}$$ the typical choice is Retrx(η) = x + η. On the sphere, a popular choice is $$\textrm{Retr}_{x}(\eta ) = \frac{x+\eta }{\|x+\eta \|}$$. Remark 2.2 If the retraction at xk is defined only in a ball of radius ϱk = ϱ(xk) around the origin in $$\mathrm{T}_{x_{k}}\mathcal{M}$$, we limit the size of step ηk to ϱk. Theorems in this section provide a complexity result provided $$\varrho = \inf _{k} \varrho _{k}> 0$$. If the injectivity radius of the manifold is positive, retractions satisfying the condition $$\inf _{x\in \mathcal{M}} \varrho (x)> 0$$ exist. In particular, compact manifolds have positive injectivity radius (Chavel, 2006, Thm. III.2.3). The option to limit the step sizes is also useful when the constant Lg in Assumption 2.6 below does not exist globally. The two central assumptions and a general theorem about Algorithm 1 follow. Assumption 2.3 (Lower bound). There exists $$f^{\ast }> -\infty $$ such that f(x) ≥ f* for all $$x \in \mathcal{M}$$. Assumption 2.4 (Sufficient decrease). There exist $$c,c^{\prime}> 0$$ such that, for all k ≥ 0, \begin{align*} f(x_{k}) - f(x_{k+1}) \geq \min\left( c \|\textrm{grad}\ f(x_{k})\|, c^{\prime} \right) \|\textrm{grad}\ f(x_{k})\|. \end{align*} Algorithm 1 Generic Riemannian descent algorithm Algorithm 1 Generic Riemannian descent algorithm Theorem 2.5 Under Assumptions 2.3 and 2.4, Algorithm 1 returns $$x\in \mathcal{M}$$ satisfying f(x) ≤ f(x0) and ∥grad f(x)∥ ≤ ε in at most \begin{align*} \left\lceil \frac{f(x_{0})-f^{\ast}}{c} \cdot \frac{1}{\varepsilon^{2}} \right\rceil \end{align*} iterations, provided $$\varepsilon \leq \frac{c^{\prime }}{c}$$. If $$\varepsilon> \frac{c^{\prime }}{c}$$, at most $$\left \lceil \frac{f(x_{0})-f^{\ast }}{c^{\prime }} \cdot \frac{1}{\varepsilon } \right \rceil $$ iterations are required. Proof. If Algorithm 1 executes K − 1 iterations without terminating, then ∥grad f(xk)∥ > ε for all k in 0, …, K − 1. Then using Assumptions 2.3 and 2.4 in a classic telescoping sum argument gives \begin{align*} f(x_{0}) - f^{\ast} \geq f(x_{0}) - f(x_{K}) & = \sum_{k=0}^{K-1} f(x_{k}) - f(x_{k+1})> K \min(c\varepsilon, c^{\prime}) \varepsilon. \end{align*} By contradiction, the algorithm must have terminated if $$K \geq \frac{f(x_{0})-f^{\ast }}{\min (c\varepsilon , c^{\prime })\varepsilon }$$. To ensure Assumption 2.4 with simple rules for the choice of ηk it is necessary to restrict the class of functions f. For the particular case $$\mathcal{M} = {\mathbb{R}^{n}}$$ and Retrx(η) = x + η, the classical tion is to require f to have a Lipschitz continuous gradient (Nesterov, 2004), that is, existence of Lg such that \begin{align} \forall\ x, y \in{\mathbb{R}^{n}}, \quad \| \textrm{grad} \ f(x) - \textrm{grad}\ f(y) \| \leq L_{g} \|x - y\|. \end{align} (2.1) As we argue momentarily, generalizing this property to manifolds is impractical. On the other hand, it is well known that (2.1) implies (see for example Nesterov, 2004, Lemma 1.2.3; see also Berger, 2017, Appendix A for a converse): \begin{align} \forall\ x, y \in{\mathbb{R}^{n}}, \quad \left|\, f(y) - \left[\, f(x) + \left\langle{y-x},{\textrm{grad}\ f(x)}\right\rangle \right] \right| \leq \frac{L_{g}}{2} \|y-x\|^{2}. \end{align} (2.2) It is the latter we adapt to manifolds. Consider the pullback2$$\hat f_{x} = f \circ \textrm{Retr}_{x} \colon \mathrm{T}_{x}\mathcal{M} \to{\mathbb{R}}$$, conveniently defined on a vector space. It follows from the definition of retraction that $$\textrm{grad}\, \hat f_{x}(0_{x}) = \textrm{grad}\ f(x)$$.3 Thinking of x as xk and of y as $$\textrm{Retr}_{{x_{k}}}(\eta )$$, we require the following. Assumption 2.6 (Restricted Lipschitz-type gradient for pullbacks). There exists Lg ≥ 0 such that, for all xk among x0, x1, . . . generated by a specified algorithm, the composition $$\hat f_{k} = f \circ \textrm{Retr}_{x_{k}}$$ satisfies \begin{align} \left| \, \hat f_{k}(\eta) - \left[f(x_{k}) + \left\langle{\eta},{\textrm{grad}\ f(x_{k})}\right\rangle\right] \right| \leq \frac{L_{g}}{2} \|\eta\|^{2} \end{align} (2.3) for all $$\eta \in \mathrm{T}_{x_{k}}\mathcal{M}$$ such that ∥η∥≤ ϱk.4 In words, the pullbacks $$\hat f_{k}$$, possibly restricted to certain balls, are uniformly well approximated by their first-order Taylor expansions around the origin. To the best of our knowledge, this specific tion has not been used to analyse convergence of optimization algorithms on manifolds before. As will become clear, it allows for simple extensions of existing proofs in $${\mathbb{R}^{n}}$$. Notice that if each $$\hat f_{k}$$ has a Lipschitz continuous gradient with constant Lg independent of k,5 then Assumption 2.6 holds but the reverse is not necessarily true as Assumption 2.6 gives a special role to the origin. In this sense the condition on $$\hat f_{k}$$ is weaker than Lipschitz continuity of the gradient of $$\hat f_{k}$$. On the other hand, we are requiring this condition to hold for all xk with the same constant Lg. This is why we call the condition Lipschitz type rather than Lipschitz. The following lemma states that if $$\mathcal{M}$$ is a compact submanifold of $${\mathbb{R}^{n}}$$ then a sufficient condition for Assumption 2.6 to hold is for $$f\colon{\mathbb{R}^{n}}\to{\mathbb{R}}$$ to have locally Lipschitz continuous gradient (so that it has Lipschitz continuous gradient on any compact subset of $${\mathbb{R}^{n}}$$). The proof is in Appendix B. Lemma 2.7 Let $$\mathcal{E}$$ be a Euclidean space (for example, $$\mathcal{E} = {\mathbb{R}^{n}}$$) and let $$\mathcal{M}$$ be a compact Riemannian submanifold of $$\mathcal{E}$$. Let Retr be a retraction on $$\mathcal{M}$$ (globally6 defined). If $$f \colon \mathcal{E} \to{\mathbb{R}}$$ has Lipschitz continuous gradient in the convex hull of $$\mathcal{M}$$, then the pullbacks f ∘ Retrx satisfy (2.3) globally with some constant Lg independent of x; hence, Assumption 2.6 holds for any sequence of iterates and with $$\varrho _{k} = \infty $$ for all k. There are mainly two difficulties with generalizing (2.1) directly to manifolds. First, grad f(x) and grad f(y) live in two different tangent spaces, so that their difference is not defined; instead, grad f(x) must be transported to $$\mathrm{T}_{y}\mathcal{M}$$, which requires the introduction of a parallel transport$$\mathrm{P}_{x \rightarrow y} \colon \mathrm{T}_{x}\mathcal{M} \to \mathrm{T}_{y}\mathcal{M}$$ along a minimal geodesic connecting x and y. Second, the right-hand side ∥x − y∥ should become dist(x, y): the geodesic distance on $$\mathcal{M}$$. Both notions involve subtle definitions and transports may not be defined on all of $$\mathcal{M}$$. Overall, the resulting tion would read that there exists Lg such that \begin{align} \forall\ x, y \in \mathcal{M}, \quad \| \mathrm{P}_{x\to y} \textrm{grad} \ f(x) - \textrm{grad} \ f(y) \| \leq L_{g} \textrm{dist}(\,x, y). \end{align} (2.4) It is of course possible to work with (2.4)—see for example Absil et al. (2008, Def. 7.4.3) and recent work of Zhang & Sra (2016) and Zhang et al. (2016)—but we argue that it is conceptually and computationally advantageous to avoid it if possible. The computational advantage comes from the freedom in Assumption 2.6 to work with any retraction, whereas parallel transport and geodesic distance are tied to the exponential map. We note that, if the retraction is the exponential map, then it is known that Assumption 2.6 holds if (2.4) holds—see for example Bento et al. (2017, Def. 2.2 and Lemma 2.1). 2.1. Fixed step-size gradient descent method Leveraging the regularity Assumption 2.6, an easy strategy is to pick the step ηk as a fixed scaling of the negative gradient, possibly restricted to a ball of radius ϱk. Theorem 2.8 (Riemannian gradient descent with fixed step size). Under Assumptions 2.3 and 2.6, Algorithm 1 with the explicit strategy \begin{align*} \eta_{k} = -\min\left( \frac{1}{L_{g}}, \frac{\varrho_{k}}{\|\textrm{grad}\ f(x_{k})\|} \right) \textrm{grad}\ f(x_{k}) \end{align*} returns a point $$x\in \mathcal{M}$$ satisfying f(x) ≤ f(x0) and ∥grad f(x)∥≤ ε in at most \begin{align*} \left\lceil 2\left(f(x_{0}) - f^{\ast}\right)L_{g} \cdot \frac{1}{\varepsilon^{2}} \right\rceil \end{align*} iterations provided ε ≤ ϱLg, where $$\varrho = \inf _{k} \rho _{k}$$. If ε > ϱLg the algorithm succeeds in at most $$\big \lceil 2\left (f(x_{0}) - f^{\ast }\right ) \frac{1}{\varrho } \cdot \frac{1}{\varepsilon } \big \rceil $$ iterations. Each iteration requires one cost and gradient evaluation and one retraction. Proof. The regularity Assumption 2.6 provides an upper bound for the pullback for all k: \begin{align} \forall\ \eta \in \mathrm{T}_{x_{k}}\mathcal{M}\ \textrm{with}\ \|\eta\| \leq \varrho_{k}, \quad f(\textrm{Retr}_{x_{k}}(\eta)) \leq f(x_{k}) + \left\langle{\eta},{\textrm{grad}\ f(x_{k})}\right\rangle + \frac{L_{g}}{2} \|\eta\|^{2}. \end{align} (2.5) For the given choice of ηk and using $$x_{k+1} = \textrm{Retr}_{x_{k}}(\eta _{k})$$, it follows easily that \begin{align*} f(x_{k}) - f(x_{k+1}) \geq \min\left( \frac{\|\textrm{grad}\ f(x_{k})\|}{L_{g}}, \varrho_{k} \right) \left[ 1 - \frac{L_{g}}{2} \min\left( \frac{1}{L_{g}}, \frac{\varrho_{k}}{\|\textrm{grad}\ f(x_{k})\|} \right) \right] \|\textrm{grad}\ f(x_{k})\|. \end{align*} The term in brackets is at least 1/2. Thus, Assumption 2.4 holds with $$c = \frac{1}{2L_{g}}$$ and $$c^{\prime } = \frac{\varrho }{2}$$, allowing us to conclude with Theorem 2.3. Corollary 2.9 If there are no step-size restrictions in Theorem 2.5 ($$\rho _{k} \equiv \infty $$), the explicit strategy \begin{align*} \eta_{k} = -\frac{1}{L_{g}} \textrm{grad}\ f(x_{k}) \end{align*} returns a point $$x\in \mathcal{M}$$ satisfying f(x) ≤ f(x0) and ∥grad f(x)∥≤ ε in at most \begin{align*} \left\lceil 2\left(f(x_{0}) - f^{\ast}\right)L_{g} \cdot \frac{1}{\varepsilon^{2}} \right\rceil \end{align*} iterations for any ε > 0. 2.2. Gradient descent with backtracking Armijo line-search The following lemma shows that a basic Armijo-type backtracking line-search, Algorithm 2, computes a step ηk satisfying Assumption 2.4 in a bounded number of function calls, without the need to know Lg. The statement allows search directions other than −grad f(xk), provided they remain ‘related’ to −grad f(xk). This result is well known in the Euclidean case and carries over seamlessly under Assumption 2.6. Algorithm 2 Backtracking Armijo line-search Algorithm 2 Backtracking Armijo line-search Lemma 2.10 For each iteration k of Algorithm 1, let $${\eta _{k}^{0}} \in \mathrm{T}_{x_{k}}\mathcal{M}$$ be the initial search direction to be considered for line-search. Assume there exist constants c2 ∈ (0, 1] and 0 < c3 ≤ c4 such that, for all k, \begin{align*} \left\langle{-\textrm{grad}\ f(x_{k})},{{\eta_{k}^{0}}}\right\rangle & \geq c_{2} \|\textrm{grad}\ f(x_{k})\|\left\|{\eta_{k}^{0}}\right\| &\ \textrm{and}\ \qquad c_{3} \|\textrm{grad}\ f(x_{k})\| & \leq \left\|{\eta_{k}^{0}}\right\| \leq c_{4} \|\textrm{grad}\ f(x_{k})\|. \end{align*} Under Assumption 2.6, backtracking Armijo (Algorithm 2) with initial step size $${\bar t}_{k}$$ such that $${\bar t}_{k} \left \|{\eta _{k}^{0}}\right \| \leq \varrho _{k}$$ returns a positive t and $$\eta _{k} = t{\eta _{k}^{0}}$$ such that \begin{align} f(x_{k}) - f\left(\textrm{Retr}_{x_{k}}(\eta_{k})\right) & \geq c_{1}c_{2}c_{3}t\|\textrm{grad}\ f(x_{k})\|^{2} & \textrm{and} \qquad t & \geq \min\left({\bar t}_{k}, \frac{2\tau c_{2} (1-c_{1})}{c_{4}L_{g}}\right) \end{align} (2.6) in \begin{align*} 1+\log_{\tau}\left(t/{\bar t}_{k}\right) \leq \max\left( 1, 2 + \left\lceil \log_{\tau^{-1}}\left(\frac{c_{4}{\bar t}_{k}L_{g}}{2c_{2}(1-c_{1})}\right) \right\rceil \right) \end{align*} retractions and cost evaluations (not counting evaluation of f at xk). Proof. See Appendix C. The previous discussion can be particularized to bound the amount of work required by a gradient descent method using a backtracking Armijo line-search on manifolds. The constant Lg appears in the bounds but needs not be known. Note that, at iteration k, the last cost evaluation of the line-search algorithm is the cost at xk+1: it need not be recomputed. Theorem 2.11 (Riemannian gradient descent with backtracking line-search). Under Assumptions 2.3 and 2.6, Algorithm 1 with Algorithm 2 for line-search using initial search direction $${\eta _{k}^{0}} = -\textrm{grad}\ f(x_{k})$$ with parameters c1, τ and $${\bar t}_{k} \triangleq \min \left ({\bar t}, \varrho _{k} / \|\textrm{grad}\ f(x_{k})\| \right )$$ for some $$\bar t> 0$$ returns a point $$x\in \mathcal{M}$$ satisfying f(x) ≤ f(x0) and ∥grad f(x)∥≤ ε in at most \begin{align*} \left\lceil \frac{f(x_{0}) - f^{\ast}}{c_{1} \min\left(\bar t, \frac{2\tau(1-c_{1})}{L_{g}}\right)} \cdot \frac{1}{\varepsilon^{2}} \right\rceil \end{align*} iterations, provided $$\varepsilon \leq \frac{\varrho }{\min \left (\bar t, \frac{2\tau (1-c_{1})}{L_{g}} \right )} \triangleq c$$, where $$\varrho = \inf _{k} \varrho _{k}$$. If ε > c, the algorithm succeeds in at most $$\big \lceil \frac{f(x_{0}) - f^{\ast }}{c_{1} \varrho } \cdot \frac{1}{\varepsilon } \big \rceil $$ iterations. After computing f(x0) and grad f(x0), each iteration requires one gradient evaluation and at most $$\max \big ( 1, 2 + \big \lceil \log _{\tau ^{-1}}\big (\frac{{\bar t}L_{g}}{2(1-c_{1})}\big ) \big \rceil \big )$$ cost evaluations and retractions. Proof. Using $${\eta _{k}^{0}} = -\textrm{grad}\ f(x_{k})$$, one can take c2 = c3 = c4 = 1 in Lemma 2.7. Equation (2.6) in that lemma combined with the definition of $${\bar t}_{k}$$ ensures \begin{align*} f(x_{k}) - f(x_{k+1}) \geq c_{1} \min\left({\bar t}, \frac{2\tau(1-c_{1})}{L_{g}}, \frac{\varrho_{k}}{\|\textrm{grad}\ f(x_{k})\|} \right) \| \textrm{grad}\ f(x_{k}) \|^{2}. \end{align*} Thus, Assumption 2.4 holds with $$c = c_{1} \min \left ({\bar t}, \frac{2\tau (1-c_{1})}{L_{g}} \right )$$ and c′ = c1ϱ. Conclude with Theorem 2.3. 3. RTR methods The RTR method is a generalization of the classical trust-region method to manifolds (Absil et al., 2007; Conn et al., 2000)—see Algorithm 3. The algorithm is initialized with a point $$x_{0}\in \mathcal{M}$$ and a trust-region radius Δ0. At iteration k, the pullback $$\hat f_{k} = f \circ \textrm{Retr}_{x_{k}}$$ is approximated by a model $$\hat m_{k} \colon \mathrm{T}_{x_{k}}\mathcal{M} \to{\mathbb{R}}$$, \begin{align} \hat m_{k}(\eta) = f(x_{k}) + \left\langle{\eta},{\textrm{grad}\ f(x_{k})}\right\rangle + \tfrac{1}{2} \left\langle{\eta},{H_{k}[\eta]}\right\rangle, \end{align} (3.1) where $$H_{k} \colon \mathrm{T}_{x_{k}}\mathcal{M} \to \mathrm{T}_{x_{k}}\mathcal{M}$$ is a map chosen by the user. The tentative step ηk is obtained by approximately solving the associated trust-region subproblem: \begin{align} \min_{\eta \in \mathrm{T}_{x_{k}}\mathcal{M}} \ \hat m_{k}(\eta) \quad \textrm{subject to} \quad \|\eta\| \leq \Delta_{k}. \end{align} (3.2) The candidate next iterate $$x_{k}^{+} = \textrm{Retr}_{x_{k}}(\eta _{k})$$ is accepted ($$x_{k+1} = x_{k}^{+}$$) if the actual cost decrease $$f(x_{k}) - f\left (x_{k}^{+}\right )$$ is a sufficiently large fraction of the model decrease $$\hat m_{k}\left (0_{x_{k}}\right ) - \hat m_{k}(\eta _{k})$$. Otherwise, the candidate is rejected (xk+1 = xk). Depending on the level of agreement between the model decrease and actual decrease, the trust-region radius Δk can be reduced, kept unchanged or increased, but never above some parameter $$\bar \Delta $$. The parameter $$\bar \Delta $$ can be used in particular in case of a nonglobally defined retraction or if the regularity conditions on the pullbacks hold only locally. We establish worst-case iteration complexity bounds for the computation of points $$x\in \mathcal{M}$$ such that ∥grad f(x)∥≤ εg and Hess f(x) ≽−εH Id, where Hess f(x) is the Riemannian Hessian of f at x. Besides Lipschitz-type conditions on the problem itself, essential algorithmic requirements are that (i) the models $$\hat m_{k}$$ should agree sufficiently with the pullbacks $$\hat f_{k}$$ (locally) and (ii) sufficient decrease in the model should be achieved at each iteration. The analysis presented here is a generalization of the one in Cartis et al. (2012) to manifolds. 3.1. Regularity conditions In what follows, for iteration k, we make assumptions involving the ball of radius $$\Delta _{k} \leq \bar{\Delta }$$ around $$0_{x_{k}}$$ in the tangent space at xk. If Retrx is defined only in a ball of radius ϱ(x), one (conservative) strategy to ensure ϱk ≥Δk as required in the assumption below is to set $$\bar \Delta \leq \inf _{x\in \mathcal{M} : f(x) \leq f(x_{0})} \varrho (x)$$, provided this is positive (see Remark 2.2). Assumption 3.1 (Restricted Lipschitz-type gradient for pullbacks). Assumption 2.6 holds in the respective trust regions of the iterates produced by Algorithm 3, that is, with ϱk ≥Δk. Algorithm 3 RTR, modified to attain second-order optimality Algorithm 3 RTR, modified to attain second-order optimality Assumption 3.2 (Restricted Lipschitz-type Hessian for pullbacks). If $$\varepsilon _{H} < \infty $$ there exists LH ≥ 0 such that, for all xk among x0, x1, . . . generated by Algorithm 3 and such that ∥grad f(xk)∥≤ εg, $$\hat f_{k}$$ satisfies \begin{align} \left| \hat f_{k}(\eta) - \left[ f(x_{k}) + \left\langle{\eta},{\textrm{grad}\ f(x_{k})}\right\rangle + \frac{1}{2}\left\langle{\eta},{\nabla^{2} \hat f_{k}(0_{x_{k}})[\eta]}\right\rangle \right] \right| \leq \frac{L_{H}}{6} \|\eta\|^{3} \end{align} (3.4) for all $$\eta \in \mathrm{T}_{x_{k}}\mathcal{M}$$ such that ∥η∥≤Δk. As discussed in Section 3.5 below, if Retr is a second-order retraction, then $$\nabla ^{2} \hat f_{k}\left (0_{x_{k}}\right )$$ coincides with the Riemannian Hessian of f at xk. In the previous section, Lemma 2.7 gives a sufficient condition for Assumption 3.1 to hold; we complement this statement with a sufficient condition for Assumption 3.2 to hold as well. In a nutshell, if $$\mathcal{M}$$ is a compact submanifold of $${\mathbb{R}^{n}}$$ and $$f\colon{\mathbb{R}^{n}}\to{\mathbb{R}}$$ has locally Lipschitz continuous Hessian, then both assumptions hold. Lemma 3.3 Let $$\mathcal{E}$$ be a Euclidean space (for example, $$\mathcal{E} = {\mathbb{R}^{n}}$$) and let $$\mathcal{M}$$ be a compact Riemannian submanifold of $$\mathcal{E}$$. Let Retr be a second-order retraction on $$\mathcal{M}$$ (globally defined). If $$f \colon \mathcal{E} \to{\mathbb{R}}$$ has Lipschitz continuous Hessian in the convex hull of $$\mathcal{M}$$, then the pullbacks f ∘ Retrx obey (3.4) with some constant LH independent of x; hence, Assumption 3.2 holds for any sequence of iterates and trust-region radii. The proof is in Appendix B. Here too, if $$\mathcal{M}$$ is a Euclidean space and Retrx(η) = x + η, then Assumptions 3.1 and 3.2 are satisfied if f has Lipschitz continuous Hessian in the usual sense. 3.2. Assumptions about the models The model at iteration k is the function $$\hat m_{k}$$ (3.1) whose purpose is to approximate the pullback $$\hat f_{k} = f \circ \textrm{Retr}_{x_{k}}$$. It involves a map $$H_{k} \colon \mathrm{T}_{x_{k}}\mathcal{M} \to \mathrm{T}_{x_{k}}\mathcal{M}$$. Depending on the type of step being performed (aiming for first- or second-order optimality conditions), we require different properties of the maps Hk. Conditions for first-order optimality are particularly lax. Assumption 3.4 If ∥grad f(xk)∥ > εg (so that we are aiming only for a first-order condition at this step) then Hk is radially linear. That is, \begin{align} \forall\ \eta\in\mathrm{T}_{x_{k}}\mathcal{M}, \forall\ \alpha \geq 0, \quad H_{k}[\alpha \eta] = \alpha H_{k}[\eta]. \end{align} (3.5) Furthermore, there exists c0 ≥ 0 (the same for all first-order steps) such that \begin{align} \|H_{k}\| \triangleq \sup_{\eta \in \mathrm{T}_{x_{k}}\mathcal{M} : \|\eta\| \leq 1} \left\langle{\eta},{H_{k}[\eta]}\right\rangle \leq c_{0}. \end{align} (3.6) Radial linearity and boundedness are sufficient to ensure first-order agreement between $$\hat m_{k}$$ and $$\hat f_{k}$$. This relaxation from complete linearity of Hk, which would be the standard assumption, notably allows the use of nonlinear finite difference approximations of the Hessian (Boumal, 2015a). To reach second-order agreement, the conditions are stronger. Assumption 3.5 If ∥grad f(xk)∥≤ εg and $$\varepsilon _{H} < \infty $$ (so that we are aiming for a second-order condition), then Hk is linear and symmetric. Furthermore, Hk is close to $$\nabla ^{2} \hat f_{k}\left (0_{x_{k}}\right )$$ along ηk in the sense that there exists c1 ≥ 0 (the same for all second-order steps) such that \begin{align} \left| \left\langle{\eta_{k}},{\left(\nabla^{2} \hat f_{k}(0_{x_{k}}) - H_{k}\right)[\eta_{k}]}\right\rangle \right| \leq \frac{c_{1} \Delta_{k}}{3} \|\eta_{k}\|^{2}. \end{align} (3.7) The smaller Δk, the more precisely Hk must approximate the Hessian of the pullback along ηk. Lemma 3.6 (below) shows Δk is lower bounded in relation with εg and εH. Equation (3.7) involves ηk, the ultimately chosen step that typically depends on Hk. The stronger condition below does not reference ηk and yet still ensures (3.7) is satisfied: \begin{align*} \left\|\nabla^{2} \hat f_{k}\left(0_{x_{k}}\right) - H_{k}\right\| \leq \frac{c_{1} \Delta_{k}}{3}. \end{align*} Refer to Section 3.5 to relate Hk, $$\nabla ^{2} \hat f_{k}\left (0_{x_{k}}\right )$$ and Hess f(xk). 3.3. Assumptions about sufficient model decrease The steps ηk can be obtained in a number of ways, leading to different local convergence rates and empirical performance. As far as global convergence guarantees are concerned though, the requirements are modest. It is required only that, at each iteration, the candidate ηk induces sufficient decrease in the model. Known explicit strategies achieve these decreases. In particular, solving the trust-region subproblem (3.2) within some tolerance (which can be done in polynomial time if Hk is linear; see Vavasis, 1991, Sect. 4.3) is certain to satisfy the assumptions. The Steihaug–Toint truncated conjugate gradients method is a popular choice (Toint, 1981; Steihaug, 1983; Conn et al., 2000; Absil et al., 2007). See also Sorensen (1982) and Moré & Sorensen (1983) for more about the trust-region subproblem. Here we describe simpler yet satisfactory strategies. For first-order steps we require the following. Assumption 3.6 There exists c2 > 0 such that, for all k such that ∥grad f(xk)∥ > εg, the step ηk satisfies \begin{align} \hat m_{k}(0_{x_{k}}) - \hat m_{k}(\eta_{k}) & \geq c_{2} \min\left( \Delta_{k}, \frac{\varepsilon_{g}}{c_{0}} \right) \varepsilon_{g}. \end{align} (3.8) As is well known, the explicitly computable Cauchy step satisfies this requirement. For convenience let gk = grad f(xk). By definition the Cauchy step minimizes $$\hat m_{k}$$ (3.1) in the trust region along the steepest descent direction –gk. Owing to radial linearity (Assumption 3.4), this reads \begin{align*} \min_{\alpha \geq 0} \ \hat m_{k}(-\alpha g_{k}) &= f(x_{k}) - \alpha \|g_{k}\|^{2} + \frac{\alpha^{2}}{2} \left\langle{g_{k}},{H_{k}[g_{k}]}\right\rangle \\ \textrm{such that} \ \alpha \|g_{k}\| &\leq \Delta_{k}.\nonumber \end{align*} This corresponds to minimizing a quadratic in α over the interval [0, Δk/∥gk∥]. The optimal value is easily seen to be (Conn et al., 2000) \begin{align*} {\alpha_{k}^{\textrm{C}}} & = \begin{cases} \min\left( \frac{\|g_{k}\|^{2}}{\left\langle{g_{k}},{H_{k}[g_{k}]}\right\rangle}, \frac{\Delta_{k}}{\|g_{k}\|} \right) & \textrm{if}\ \left\langle{g_{k}},{H_{k}[g_{k}]}\right\rangle> 0, \\ \frac{\Delta_{k}}{\|g_{k}\|} & \textrm{otherwise.} \end{cases} \end{align*} Lemma 3.7 Let gk = grad f(xk). Under Assumption 3.4, setting ηk to be the Cauchy step $${\eta _{k}^{\textrm{C}}} = -{\alpha _{k}^{\textrm{C}}} g_{k}$$ for first-order steps fulfills Assumption 3.6 with c2 = 1/2. Computing $${\eta _{k}^{\textrm{C}}}$$ involves one gradient evaluation and one application of Hk. Proof. The claim follows as an exercise from $$\hat m_{k}\left (0_{x_{k}}\right ) - \hat m_{k}\left ({\eta _{k}^{\textrm{C}}}\right ) = {\alpha _{k}^{\textrm{C}}} \|g_{k}\|^{2} - \frac{\left ({\alpha _{k}^{\textrm{C}}}\right )^{2}}{2}\left \langle{g_{k}},{H_{k}[g_{k}]}\right \rangle $$ and $$\left \langle{g_{k}},{H_{k}[g_{k}]}\right \rangle \leq c_{0} \|g_{k}\|^{2}$$ owing to Assumption 3.4. The Steihaug–Toint truncated conjugate gradient method (Toint, 1981; Steihaug, 1983) is a monotonically improving iterative method for the trust-region subproblem whose first iterate is the Cauchy step; as such, it necessarily achieves the required model decrease. For second-order steps the requirement is as follows. Assumption 3.8 There exists c3 > 0 such that, for all k such that ∥grad f(xk)∥≤ εg and $$\lambda _{\min }(H_{k}) < -\varepsilon _{H}$$, the step ηk satisfies \begin{align} \hat m_{k}(0_{x_{k}}) - \hat m_{k}(\eta_{k}) & \geq c_{3} {\Delta_{k}^{2}} \varepsilon_{H}. \end{align} (3.9) This can be achieved by making a step of maximal length along a direction that certifies that $$\lambda _{\min }(H_{k}) < -\varepsilon _{H}$$ (Conn et al., 2000): this is called an eigenstep. Like Cauchy steps, eigensteps can be computed in a finite number of operations, independently of εg and εH. Lemma 3.3 Under Assumption 3.5, if $$\lambda _{\min }(H_{k}) < -\varepsilon _{H}$$, there exists a tangent vector $$u_{k}\in \mathrm{T}_{x_{k}}\mathcal{M}$$ with \begin{equation*} \|u_{k}\| = 1,\qquad \left\langle{u_{k}},{\textrm{grad}\ f(x_{k})}\right\rangle \leq 0 \qquad \textrm{and}\qquad \left\langle{u_{k}},{H_{k}[u_{k}]}\right\rangle < -\varepsilon_{H}. \end{equation*} Setting ηk to be any eigenstep $${\eta _{k}^{E}} = \Delta _{k} u_{k}$$ for second-order steps fulfills Assumption 3.8 with c3 = 1/2. Let v1, … ,vn be an orthonormal basis of $$\mathrm{T}_{x_{k}}\mathcal{M}$$, where $$n = \dim \mathcal{M}$$. One way of computing $${\eta _{k}^{E}}$$ involves the application of Hk to v1, … ,vn plus $$\mathcal{O}\big(n^{3}\big )$$ arithmetic operations. The amount of work is independent of εg and εH. Proof. Compute H, a symmetric matrix of size n that represents Hk in the basis v1, … ,vn, as $$H_{ij} = \left \langle{v_{i}},{H_{k}\left [v_{j}\right ]}\right \rangle $$. Compute a factorization LDLT = H + εHI where I is the identity matrix, L is invertible and triangular and D is block diagonal with blocks of size 1 × 1 and 2 × 2. The factorization can be computed in $$\mathcal{O}\big(n^{3}\big)$$ operations (Golub & Van Loan, 2012, Sect. 4.4)—see the reference for a word of caution regarding pivoting for stability; pivoting is easily incorporated in the present argument. The matrix D has the same inertia as H + εHI, hence D is not positive semidefinite (otherwise H ≽−εHI.) The structure of D makes it easy to find $$x \in{\mathbb{R}^{n}}$$ with xTDx < 0. Solve the triangular system LTy = x for $$y\in{\mathbb{R}^{n}}$$. Now 0 > xTDx = yTLDLTy = yT (H + εHI)y. Consequently, yTHy < −εH∥y∥2. We can set $$u_{k} = \pm \sum _{i=1}^{n} y_{i}v_{i} / \|y\|$$, where the sign is chosen to ensure $$\left \langle{u_{k}},{\textrm{grad}\ f(x_{k})}\right \rangle \leq 0$$. To conclude check that $$\hat m_{k}\left (0_{x_{k}}\right ) - \hat m_{k}\left ({\eta _{k}^{E}}\right ) = -\left \langle{{\eta _{k}^{E}}},{\textrm{grad}\ f(x_{k})}\right \rangle - \frac{1}{2}\left \langle{{\eta _{k}^{E}}},{H_{k}\left [{\eta _{k}^{E}}\right ]}\right \rangle \geq \frac{1}{2} {\Delta _{k}^{2}} \varepsilon _{H}$$. Notice from the proof that this strategy either certifies that $$\lambda _{\min }(H_{k}) \succeq -\varepsilon _{H} \operatorname{Id}$$ (which must be checked at step 8 in Algorithm 3) or certifies otherwise by providing an escape direction. We further note that, in practice, one usually prefers to use iterative methods to compute an approximate leftmost eigenvector of Hk without representing it as a matrix. 3.4. Main results and proofs for RTR Under the discussed assumptions, we now establish our main theorem about computation of approximate first- and second-order critical points for (P) using RTR in a bounded number of iterations. The following constants will be useful: \begin{equation} \lambda_{g} = \frac{1}{4}\min\left( \frac{1}{c_{0}}, \frac{c_{2}}{L_{g} + c_{0}} \right) \qquad \textrm{and}\qquad \lambda_{H} = \frac{3}{4} \frac{c_{3}}{L_{H} + c_{1}}. \end{equation} (3.10) Theorem 3.9 Under Assumptions 2.3, 3.1, 3.4, 3.6 and assuming $$\varepsilon _{g} \leq \frac{\Delta _{0}}{\lambda _{g}}$$,7 Algorithm 3 produces an iterate $$x_{N_{1}}$$ satisfying $$\|\textrm{grad}\ f(x_{N_{1}})\| \leq \varepsilon _{g}$$ with \begin{align} N_{1} \leq \frac{3}{2} \frac{f(x_{0}) - f^{\ast}}{\rho^{\prime} c_{2} \lambda_{g}} \frac{1}{{\varepsilon_{g}^{2}}} + \frac{1}{2} \log_{2}\left( \frac{\Delta_{0}}{\lambda_{g} \varepsilon_{g}} \right) = \mathcal{O} \left( \frac{1}{{\varepsilon_{g}^{2}}} \right). \end{align} (3.11) Furthermore, if $$\varepsilon _{H} < \infty $$ then under additional Assumptions 3.2, 3.5, and 3.8 and assuming $$\varepsilon _{g} \leq \frac{c_{2}}{c_{3}} \frac{\lambda _{H}}{{\lambda _{g}^{2}}}$$ and $$\varepsilon _{H} \leq \frac{c_{2}}{c_{3}} \frac{1}{\lambda _{g}}$$, Algorithm 3 also produces an iterate $$x_{N_{2}}$$ satisfying $$\|\textrm{grad}\ f(x_{N_{2}})\| \leq \varepsilon _{g}$$ and $$\lambda _{\min }(H_{N_{2}}) \geq -\varepsilon _{H}$$ with \begin{align} N_{1} \leq N_{2} \leq \frac{3}{2} \frac{f(x_{0}) - f^{\ast}}{\rho^{\prime} c_{3} \lambda^{2}} \frac{1}{\varepsilon^{2} \varepsilon_{H}} + \frac{1}{2}\log_{2}\left( \frac{\Delta_{0}}{\lambda \varepsilon} \right) = \mathcal{O} \left( \frac{1}{\varepsilon^{2} \varepsilon_{H}} \right), \end{align} (3.12) where we defined (λ, ε) = (λg, εg) if λgεg ≤ λHεH and (λ, ε) = (λH, εH) otherwise. Since the algorithm is a descent method, $$f(x_{N_{2}}) \leq f(x_{N_{1}}) \leq f(x_{0})$$. Remark 3.10 Theorem 3.4 makes a statement about $$\lambda _{\min }(H_{k})$$ at termination, not about $$\lambda _{\min }(\textrm{Hess}\ f(x_{k}))$$. See Section 3.5 to connect these two quantities. To establish Theorem 3.4 we work through a few lemmas, following the proof technique in Cartis et al. (2012). We first show Δk is bounded below in proportion to the tolerances εg and εH. This is used to show that the number of successful iterations in Algorithm 3 before termination (that is, iterations where ρk > ρ′; see (3.3)) is bounded above. It is then shown that the total number of iterations is at most a constant multiple of the number of successful iterations, which implies termination in bounded time. We start by showing that the trust-region radius is bounded away from zero. Essentially, this is because if Δk becomes too small, then the Cauchy step and eigenstep are certain to be successful owing to the quality of the model in such a small region, so that the trust-region radius could not decrease any further. Lemma 3.11 Under the assumptions of Theorem 3.4, if Algorithm 3 executes N iterations without terminating then \begin{align} \Delta_{k} \geq \min\left( \Delta_{0}, \lambda_{g} \varepsilon_{g}, \lambda_{H} \varepsilon_{H} \right) \end{align} (3.13) for k = 0, … , N, where λg and λH are defined in (3.10). Proof. This follows essentially the proof of Absil et al. (2008, Thm. 7.4.2), which itself follows classical proofs (Conn et al., 2000). The core idea is to control ρk (see (3.3)) close to 1, to show that there cannot be arbitrarily many trust-region radius reductions. The proof is in two parts. For the first part, assume ∥grad f(xk)∥ > εg. Then consider the gap \begin{align} |\rho_{k} - 1| & = \left| \frac{\hat f_{k}(0_{x_{k}}) - \hat f_{k}(\eta_{k})}{\hat m_{k}(0_{x_{k}}) - \hat m_{k}(\eta_{k})} - 1 \right| = \left| \frac{\hat m_{k}(\eta_{k}) - \hat f_{k}(\eta_{k})}{\hat m_{k}(0_{x_{k}}) - \hat m_{k}(\eta_{k})} \right|. \end{align} (3.14) From Assumption 3.6, we know the denominator is not too small: \begin{align*} \hat m_{k}\left(0_{x_{k}}\right) - \hat m_{k}(\eta_{k}) & \geq c_{2} \min\left( \Delta_{k}, \frac{\varepsilon_{g}}{c_{0}} \right) \varepsilon_{g}. \end{align*} Now consider the numerator: \begin{align*} |\hat m_{k}(\eta_{k}) - \hat f_{k}(\eta_{k})| & = \left| f(x_{k}) + \left\langle{\textrm{grad}\,f(x_{k})},{\eta_{k}}\right\rangle + \tfrac{1}{2}\left\langle{\eta_{k}},{H_{k}[\eta_{k}]}\right\rangle - \hat f_{k}(\eta_{k}) \right| \\ & \leq \left| f(x_{k}) + \left\langle{\textrm{grad}\,f(x_{k})},{\eta_{k}}\right\rangle - \hat f_{k}(\eta_{k}) \right| + \tfrac{1}{2}\left| \left\langle{\eta_{k}},{H_{k}[\eta_{k}]}\right\rangle \right| \\ & \leq \tfrac{1}{2}\left( L_{g} + c_{0} \right)\|\eta_{k}\|^{2}, \end{align*} where we used Assumption 3.1 for the first term, and Assumption 3.4 for the second term. Assume for the time being that $$\Delta _{k} \leq \min \left ( \frac{\varepsilon _{g}}{c_{0}}, \frac{c_{2} \varepsilon _{g}}{L_{g} + c_{0}} \right ) = 4\lambda _{g} \varepsilon _{g}$$. Then, using ∥ηk∥≤Δk, it follows that \begin{align*} |\rho_{k} - 1| \leq \frac{1}{2} \frac{L_{g} + c_{0}}{c_{2} \min\left( \Delta_{k}, \frac{\varepsilon_{g}}{c_{0}} \right)\varepsilon_{g}} {\Delta_{k}^{2}} \leq \frac{1}{2}\frac{L_{g}+c_{0}}{c_{2} \varepsilon_{g}} \Delta_{k} \leq \frac{1}{2}. \end{align*} Hence, ρk ≥ 1/2, and by the mechanism of Algorithm 3, it follows that Δk+1 ≥Δk. For the second part, assume ∥grad f(xk)∥ < εg and $$\lambda _{\min }(H_{k}) < -\varepsilon _{H}$$. Then, by Assumption 3.8, \begin{align*} \hat m_{k}(0_{x_{k}}) - \hat m_{k}(\eta_{k}) & \geq c_{3} {\Delta_{k}^{2}} \varepsilon_{H}. \end{align*} Thus, by Assumptions 3.2 and 3.5, \begin{align*} |\hat m_{k}(\eta_{k}) - \hat f_{k}(\eta_{k})| & = \left| f(x_{k}) + \left\langle{\textrm{grad}\,f(x_{k})},{\eta_{k}}\right\rangle + \tfrac{1}{2}\left\langle{\eta_{k}},{H_{k}[\eta_{k}]}\right\rangle - \hat f_{k}(\eta_{k}) \right| \\ & \leq \frac{L_{H}}{6} \|\eta_{k}\|^{3} + \tfrac{1}{2} \left| \left\langle{\eta_{k}},{\left(\nabla^{2} \hat f_{k}\left(0_{x_{k}}\right) - H_{k}\right)[\eta_{k}]}\right\rangle \right| \\ & \leq \frac{L_{H} + c_{1}}{6} {\Delta_{k}^{3}}. \end{align*} As previously, combine these observations into (3.14) to see that if $$\Delta _{k} \leq \frac{3c_{3}}{L_{H} + c_{1}}\varepsilon _{H} = 4\lambda _{H} \varepsilon _{H}$$ then \begin{align} |\rho_{k} - 1| & \leq \tfrac{1}{2} \frac{L_{H} + c_{1}}{3c_{3} \varepsilon_{H}} \Delta_{k} \leq \tfrac{1}{2}. \end{align} (3.15) Again, this implies Δk+1 ≥Δk. Now combine the two parts. We have established that, if $$\Delta _{k} \leq 4\min \left (\lambda _{g} \varepsilon _{g}, \lambda _{H}\varepsilon _{H}\right )$$, then Δk+1 ≥Δk. To conclude the proof, consider the fact that Algorithm 3 cannot reduce the radius by more than 1/4 in one step. By an argument similar to the one used for gradient methods, Lemma 3.6 implies an upper bound on the number of successful iterations required in Algorithm 3 to reach termination. Lemma 3.12 Under the assumptions of Theorem 3.4, if Algorithm 3 executes N iterations without terminating, define the set of successful steps as \begin{align*} S_{N} = \{ k \in \{0, \ldots, N\} : \rho_{k}> \rho^{\prime} \} \end{align*} and let UN designate the unsuccessful steps, so that SN and UN form a partition of {0, … ,N}. Assume εg ≤Δ0/λg. If $$\varepsilon _{H} = \infty $$, the number of successful steps obeys \begin{align} |S_{N}| \leq \frac{f(x_{0}) - f^{\ast}}{\rho^{\prime} c_{2} \lambda_{g}} \frac{1}{{\varepsilon_{g}^{2}}}. \end{align} (3.16) Otherwise, if additionally $$\varepsilon _{g} \leq \frac{c_{2}}{c_{3}} \frac{\lambda _{H}}{{\lambda _{g}^{2}}}$$ and $$\varepsilon _{H} \leq \frac{c_{2}}{c_{3}} \frac{1}{\lambda _{g}}$$, we have the bound \begin{align} |S_{N}| \leq \frac{f(x_{0}) - f^{\ast}}{\rho^{\prime} c_{3} } \frac{1}{\min(\lambda_{g}\varepsilon_{g}, \lambda_{H}\varepsilon_{H})^{2} \varepsilon_{H}}. \end{align} (3.17) Proof. The proof parallels Cartis et al. (2012, Lemma 4.5). Clearly, if k ∈ UN, then f(xk) = f(xk+1). On the other hand, if k ∈ SN then ρk ≥ ρ′ (see (3.3)). Combine this with Assumptions 3.6 and 3.8 to see that, for k ∈ SN, \begin{align*} f(x_{k}) - f(x_{k+1}) & \geq \rho^{\prime} \left(\hat m_{k}(0_{x_{k}}) - \hat m_{k}(\eta_{k})\right) \\ & \geq \rho^{\prime} \min\left( c_{2} \min\left( \Delta_{k}, \frac{\varepsilon_{g}}{c_{0}} \right) \varepsilon_{g} \, \ c_{3} {\Delta_{k}^{2}} \varepsilon_{H} \right). \end{align*} By Lemma 3.6 and the assumption λgεg ≤Δ0, it holds that $$\Delta _{k} \geq \min \left ( \lambda _{g} \varepsilon _{g}, \lambda _{H} \varepsilon _{H} \right )$$. Furthermore, using λg ≤ 1/c0 shows that $$\min (\Delta _{k}, \varepsilon _{g} / c_{0}) \geq \min (\Delta _{k}, \lambda _{g} \varepsilon _{g}) \geq \min \left ( \lambda _{g} \varepsilon _{g}, \lambda _{H} \varepsilon _{H} \right )$$. Hence, \begin{align} f(x_{k}) - f(x_{k+1}) & \geq \rho^{\prime} \min\left( c_{2} \lambda_{g} {\varepsilon_{g}^{2}}, c_{2} \lambda_{H} \varepsilon_{g} \varepsilon_{H}, c_{3} {\lambda_{g}^{2}} {\varepsilon_{g}^{2}} \varepsilon_{H}, c_{3} {\lambda_{H}^{2}} {\varepsilon_{H}^{3}} \right). \end{align} (3.18) If $$\varepsilon _{H} = \infty $$ this simplifies to \begin{align*} f(x_{k}) - f(x_{k+1}) \geq \rho^{\prime} c_{2} \lambda_{g} {\varepsilon_{g}^{2}}. \end{align*} Sum over iterations up to N and use Assumption 2.3 (bounded f): \begin{align*} f(x_{0}) - f^{\ast} \geq f(x_{0}) - f(x_{N+1}) & = \sum_{k\in S_{N}} f(x_{k}) - f(x_{k+1}) \geq |S_{N}| \rho^{\prime} c_{2} \lambda_{g} {\varepsilon_{g}^{2}}. \end{align*} Hence, \begin{align*} |S_{N}| \leq \frac{f(x_{0}) - f^{\ast}}{\rho^{\prime} c_{2} \lambda_{g}} \frac{1}{{\varepsilon_{g}^{2}}}. \end{align*} On the other hand, if $$\varepsilon _{H} < \infty $$ then, starting over from (3.18) and assuming both $$c_{3} {\lambda _{g}^{2}} {\varepsilon _{g}^{2}} \varepsilon _{H} \leq c_{2} \lambda _{H} \varepsilon _{g} \varepsilon _{H}$$ and $$c_{3} {\lambda _{g}^{2}} {\varepsilon _{g}^{2}} \varepsilon _{H} \leq c_{2}\lambda _{g}{\varepsilon _{g}^{2}}$$ (which is equivalent to $$\varepsilon _{g} \leq c_{2}\lambda _{H} / c_{3}{\lambda _{g}^{2}}$$ and εH ≤ c2/c3λg), it comes with the same telescoping sum that \begin{align*} f(x_{0}) - f^{\ast} \geq |S_{N}| \rho^{\prime} c_{3} \min(\lambda_{g}\varepsilon_{g}, \lambda_{H}\varepsilon_{H})^{2} \varepsilon_{H}. \end{align*} Solve for |SN| to conclude. Finally, we show that the total number of steps N before termination cannot be more than a fixed multiple of the number of successful steps |SN|. Lemma 3.13 Under the assumptions of Theorem 3.4, if Algorithm 3 executes N iterations without terminating, using the notation SN and UN of Lemma 3.7, it holds that \begin{align} |S_{N}| & \geq \tfrac{2}{3}(N+1) - \tfrac{1}{3}\max\left( 0, \log_{2}\left( \frac{\Delta_{0}}{\lambda_{g} \varepsilon_{g}} \right), \log_{2}\left( \frac{\Delta_{0}}{\lambda_{H} \varepsilon_{H}} \right) \right). \end{align} (3.19) Proof. The proof rests on the lower bound for Δk obtained in Lemma 3.6. It parallels Cartis et al. (2012, Lemma 4.6). For all k ∈ SN, it holds that Δk+1 ≤ 2Δk. For all k ∈ Uk, it holds that $$\Delta _{k+1} \leq \frac{1}{4} \Delta _{k}$$. Hence, \begin{align*} \Delta_{N} \leq 2^{|S_{N}|} \left(\tfrac{1}{4}\right)^{|U_{N}|} \Delta_{0}. \end{align*} On the other hand, Lemma 3.6 gives \begin{align*} \Delta_{N} \geq \min\left( \Delta_{0}, \lambda_{g} \varepsilon_{g}, \lambda_{H} \varepsilon_{H} \right). \end{align*} Combine, divide by Δ0 and take the log in base 2: \begin{align*} |S_{N}| - 2|U_{N}| \geq \min\left( 0, \log_{2}\left( \frac{\lambda_{g} \varepsilon_{g}}{\Delta_{0}}\right), \log_{2}\left( \frac{\lambda_{H} \varepsilon_{H}}{\Delta_{0}} \right) \right). \end{align*} Use |SN| + |UN| = N + 1 to conclude. We can now prove the main theorem. Proof of Theorem 3.4 It is sufficient to combine Lemmas 3.6 and 3.7 in both regimes. First, we get that if ∥grad f(xk)∥ > εg for k = 0, … , N, then \begin{align*} N+1 \leq \tfrac{3}{2} \frac{f(x_{0}) - f^{\ast}}{\rho^{\prime} c_{2} \lambda_{g}} \frac{1}{{\varepsilon_{g}^{2}}} + \tfrac{1}{2} \log_{2}\left( \frac{\Delta_{0}}{\lambda_{g} \varepsilon_{g}} \right). \end{align*} (The term $$\log _{2}\big ( \frac{\Delta _{0}}{\lambda _{H} \varepsilon _{H}} \big )$$ from Lemma 3.8 is irrelevant up to that point, as εH could just as well have been infinite.) Thus, after a number of iterations larger than the right-hand side, an iterate with sufficiently small gradient must have been produced, to avoid a contradiction. Second, we get that if for k = 0, … , N no iterate satisfies both ∥grad f(xk)∥≤ εg and $$\lambda _{\min }(H_{k}) \geq -\varepsilon _{H}$$, then \begin{align*} N+1 \leq \tfrac{3}{2} \frac{f(x_{0}) - f^{\ast}}{\rho^{\prime} c_{3} } \frac{1}{\min(\lambda_{g}\varepsilon_{g}, \lambda_{H}\varepsilon_{H})^{2} \varepsilon_{H}} + \tfrac{1}{2}\max\left( \log_{2}\left( \frac{\Delta_{0}}{\lambda_{g} \varepsilon_{g}} \right), \log_{2}\left( \frac{\Delta_{0}}{\lambda_{H} \varepsilon_{H}} \right) \right). \end{align*} Conclude with the same argument. 3.5. Connecting Hk and Hess f(xk) Theorem 3.4 states termination of Algorithm 3 in terms of ∥grad f(xk)∥ and $$\lambda _{\min }(H_{k})$$. Ideally, the latter must be turned into a statement about $$\lambda _{\min }(\textrm{Hess}\ f(x_{k}))$$, to match the second-order necessary optimality conditions of (P) more closely (recall Proposition 1.1). Assumption 3.5 itself requires only Hk to be (weakly) related to $$\nabla ^{2} \hat f_{k}(0_{x_{k}})$$ (the Hessian of the pullback of f at xk), which is different from the Riemannian Hessian of f at xk in general. It is up to the user to provide Hk sufficiently related to $$\nabla ^{2} \hat f_{k}\left (0_{x_{k}}\right )$$. Additional control over the retraction at xk can further relate $$\nabla ^{2} \hat f_{k}\left (0_{x_{k}}\right )$$ to Hess f(xk), as we do now. Proofs for this section are in Appendix D. Lemma 3.9 Define the maximal acceleration of Retr at x as the real a such that \begin{align*} \forall\ \eta \in \mathrm{T}_{x}\mathcal{M}\ \textrm{with}\ \|\eta\| = 1, \quad \bigg\| \frac{\mathrm{D}^{2}}{\mathrm{d}t^{2}} \textrm{Retr}_{x}(t\eta) |{}_{t=0} \bigg\| \leq a, \end{align*} where $$\frac{\mathrm{D}^{2}}{\mathrm{d}t^{2}} \gamma $$ denotes acceleration of the curve t ↦ γ(t) on $$\mathcal{M}$$ (Absil et al., 2008, Chapter 5). Then \begin{align*} \left\| \textrm{Hess}\ f(x) - \nabla^{2} \hat f_{x}(0_{x}) \right\| \leq a \cdot \|\textrm{grad}\ f(x)\|. \end{align*} In particular, if x is a critical point or if a = 0, the Hessians agree: $$\textrm{Hess}\ f(x) = \nabla ^{2} \hat f_{x}(0_{x})$$. The particular cases appear as Absil et al. (2008, Props. 5.5.5, 5.5.6). This result highlights the crucial role of retractions with zero acceleration, known as second-order retractions and defined in Absil et al. (2008, Prop. 5.5.5); we are not aware of earlier references to this notion. Definition 3.14 A retraction is a second-order retraction if it has zero acceleration, as defined in Lemma 3.9. Then retracted curves locally agree with geodesics up to second order. Proposition 3.15 Let $$x_{k}\in \mathcal{M}$$ be the iterate returned by Algorithm 3 under the assumptions of Theorem 3.4. It satisfies ∥grad f(xk)∥≤ εg and Hk ≽−εH Id. Assume Hk is related to the Hessian of the pullback as $$ \|\nabla ^{2} \hat f_{k}\left (0_{x_{k}}\right ) - H_{k} \| \leq \delta _{k}$$. Further assume the retraction has acceleration at xk bounded by ak, as defined in Lemma 3.9. Then \begin{align*} \textrm{Hess}\ f(x_{k}) \succeq -\left( \varepsilon_{H} + a_{k} \varepsilon_{g} + \delta_{k} \right) \operatorname{Id}. \end{align*} In particular, if the retraction is second order and $$H_{k} = \nabla ^{2} \hat f_{k}\left (0_{x_{k}}\right )$$, then Hess f(xk) ≽−εH Id. We note that second-order retractions are frequently available in applications. Indeed, retractions for submanifolds obtained as (certain types of) projections—arguably one of the most natural classes of retractions for submanifolds—are second order (Absil & Malick, 2012, Thm. 22). For example, the sphere retraction Retrx(η) = (x + η)/∥x + η∥ is second order. Such retractions for low-rank matrices are also known (Absil & Oseledets, 2015). 4. Example: smooth semidefinite programs This example is based on Boumal et al. (2016). Consider the following semidefinite program, which occurs in robust PCA (McCoy & Tropp, 2011) and as a convex relaxation of combinatorial problems such as Max-Cut, $$\mathbb{Z}_{2}$$-synchronization and community detection in the stochastic block model (Goemans & Williamson, 1995; Bandeira et al., 2016): \begin{align} \min_{X\in{\mathbb{R}^{n\times n}}} \textrm{Tr}(CX)\ \textrm{subject to}\ \textrm{diag}(X) = \boldsymbol{1}, X \succeq 0. \end{align} (4.1) The symmetric cost matrix C depends on the application. Interior point methods solve this problem in polynomial time, though they involve significant work to enforce the conic constraint X ≽ 0 (X symmetric, positive semidefinite). This motivates the approach of Burer & Monteiro (2005) to parameterize the search space as X = Y YT, where Y is in $${\mathbb{R}^{n\times p}}$$ for some well-chosen p: \begin{align} \min_{Y\in{\mathbb{R}^{n\times p}}} \textrm{Tr}\left(CYY^{\textrm{T}}\! \right)\ \textrm{subject to}\ \textrm{diag}\left(YY^{\textrm{T}}\! \right) = \boldsymbol{1}. \end{align} (4.2) This problem is of the form of (P), where $$f(Y) = \textrm{Tr}\left (CYY^{\textrm{T}}\right )$$ and the manifold is a product of n unit spheres in $${\mathbb{R}^{p}}$$: \begin{align} \mathcal{M} & = \left\{ Y\in{\mathbb{R}^{n\times p}} : \textrm{diag}\left(YY^{\textrm{T}}\! \right) = \boldsymbol{1} \right\} = \left\{ Y\in{\mathbb{R}^{n\times p}} : \textrm{each row of Y has unit norm} \right\}. \end{align} (4.3) In principle, since the parameterization X = YYT breaks convexity, the new problem could have many spurious local optimizers and saddle points. Yet, for p = n + 1, it has recently been shown that approximate second-order critical points Y map to approximate global optimizers X = YYT, as stated in the following proposition. (In this particular case there is no need to control ∥grad f(Y )∥ explicitly.) Proposition 4.1 (Boumal et al., 2016). If X⋆ is optimal for (3.19) and Y is feasible for (4.1) with p > n and Hess f(Y ) ≽−εH Id, the optimality gap is bounded as \begin{align*} 0 \leq \textrm{Tr}\left(CYY^{\textrm{T}}\! \right) - \textrm{Tr}(CX^{\star}) \leq \frac{n}{2}\varepsilon_{H}. \end{align*} Since f is smooth in $${\mathbb{R}^{n\times p}}$$ and $$\mathcal{M}$$ is a compact submanifold of $${\mathbb{R}^{n\times p}}$$, the regularity Assumptions 3.1 and 3.2 hold with any second-order retraction (Lemmas 2.4 and 3.1). In particular, they hold if $$\textrm{Retr}_{Y}(\dot Y)$$ is the result of normalizing each row of $$Y+\dot Y$$ (Section 3.5) or if the exponential map is used (which is cheap for this manifold; see Appendix E). Theorem 3.4 then implies that RTR applied to the nonconvex problem (4.2) computes a point X = Y YT feasible for (4.1) such that Tr(CX) −Tr(CX⋆) ≤ δ in $$\mathcal{O}\big(1/\delta ^{3}\big)$$ iterations. Appendix E bounds the total work with an explicit dependence on the problem dimension n as $$\mathcal{O}\big(n^{10} / \delta ^{3}\big)$$ arithmetic operations, where $$\mathcal{O}$$ hides factors depending on the data C and an additive log term. This result follows from a bound $$L_{H} \leq 8\left \|{C}\right \|_{\mathrm{2}} \sqrt{n}$$ for Assumption 3.2, which is responsible for a factor of n in the complexity—the remaining factors could be improved; see below. In Boumal et al. (2016), it is shown that, generically in C, if $$p \geq \big \lceil \sqrt{2n}\, \big \rceil $$ then all second-order critical points of (4.2) are globally optimal (despite nonconvexity). This means RTR globally converges to global optimizers with cheaper iterations (due to reduced dimensionality). Unfortunately, there is no statement of quality pertaining to approximate second-order critical points for such small p, so that this analysis is not sufficient to obtain an improved worst-case complexity bound. These bounds are worse than guarantees provided by interior point methods. Indeed, following Nesterov (2004, Sect. 4.3.3, with eq. (4.3.12)), interior point methods achieve a solution in $$\mathcal{O}\big(n^{3.5}\log (n/\delta )\big)$$ arithmetic operations. Yet, numerical experiments in Boumal et al. (2016) suggest RTR often outperforms interior point methods, indicating that the bound $$\mathcal{O}\big(n^{10}/\delta ^{3}\big)$$ is wildly pessimistic. We report it here mainly because, to the best of our knowledge, this is the first explicit bound for a Burer–Monteiro approach to solving a semidefinite program. A number of factors drive the gap between our worst-case bound and practice. In particular, strategies far more efficient than the LDLT factorization in Lemma 3.3 are used to compute second-order steps, and they can exploit structure in C. High-accuracy solutions are reached owing to RTR typically converging superlinearly, locally. And p is chosen much smaller than n + 1. See also Mei et al. (2017) for formal complexity results in a setting where p is allowed to be independent of n; this precludes reaching an objective value arbitrarily close to optimal, in exchange for cheaper computations. 5. Conclusions and perspectives We presented bounds on the number of iterations required by the Riemannian gradient descent algorithm and the RTR algorithm to reach points that approximately satisfy first- and second-order necessary optimality conditions, under some regularity assumptions but regardless of initialization. When the search space $$\mathcal{M}$$ is a Euclidean space these bounds are already known. For the more general case of $$\mathcal{M}$$ being a Riemannian manifold, these bounds are new. As a subclass of interest, we showed the regularity requirements are satisfied if $$\mathcal{M}$$ is a compact submanifold of $${\mathbb{R}^{n}}$$ and f has locally Lipschitz continuous derivatives of appropriate order. This covers a rich class of practical optimization problems. While there are no explicit assumptions made about $$\mathcal{M,}$$ the smoothness requirements for the pullback of the cost—Assumptions 2.6, 3.1 and 3.2—implicitly restrict the class of manifolds to which these results apply. Indeed, for certain manifolds, even for nice cost functions f, there may not exist retractions that ensure the assumptions hold. This is the case in particular for certain incomplete manifolds, such as open Riemannian submanifolds of $${\mathbb{R}^{n}}$$ and certain geometries of the set of fixed-rank matrices—see also Remark 2.2 about injectivity radius. For such sets it may be necessary to adapt the assumptions. For fixed-rank matrices for example, Vandereycken (2013, Sect. 4.1) obtains convergence results assuming a kind of coercivity on the cost function: for any sequence of rank-k matrices (Xi)i=1, 2, … such that the first singular value $$\sigma _{1}(X_{i}) \to \infty $$ or the kth singular value σk(Xi) → 0, it holds that $$f(X_{i}) \to \infty $$. This ensures iterates stay away from the open boundary. The iteration bounds are sharp, but additional information may yield more favorable bounds in specific contexts. In particular, when the studied algorithms converge to a nondegenerate local optimizer, they do so with an at least linear rate, so that the number of iterations is merely $$\mathcal{O}(\log (1/\varepsilon ))$$ once in the linear regime. This suggests a stitching approach: for a given application, it may be possible to show that rough approximate second-order critical points are in a local attraction basin; the iteration cost can then be bounded by the total work needed to attain such a crude point starting from anywhere, plus the total work needed to refine the crude point to high accuracy with a linear or even quadratic convergence rate. This is, to some degree, the successful strategy in Sun et al. (2015, 2016). Finally, we note that it would also be interesting to study the global convergence rates of Riemannian versions of adaptive regularization algorithms using cubics (ARC), since in the Euclidean case these can achieve approximate first-order criticality in $$\mathcal{O}\big(1/\varepsilon ^{1.5}\big)$$ instead of $$\mathcal{O} \big(1/\varepsilon ^{2}\big)$$ (Cartis et al., 2011a). Work in that direction could start with the convergence analyses proposed in Qi, (2011). Acknowledgements This paper presents research results of the Belgian Network DYSCO (Dynamical Systems, Control, and Optimization), funded by the Interuniversity Attraction Poles Programme initiated by the Belgian Science Policy Office. This work was supported by the ARC ‘Mining and Optimization of Big Data Models’. We thank Alex d’Aspremont, Simon Lacoste-Julien, Ju Sun, Bart Vandereycken and Paul Van Dooren for helpful discussions. Funding Fonds Spéciaux de Recherche (FSR) at UCLouvain (to N.B.); Chaire Havas ‘Chaire Economie et gestion des nouvelles données’ (to N.B.); ERC Starting Grant SIPA (to N.B.); Research in Paris grant at Inria & ENS (to N.B.); National Science Foundation (DMS-1719558 to N.B.); The Natural Environment Research Council (grant NE/L012146/1 to C.C.). Footnotes 1  Informally, the tangent bundle $$\mathrm{T}\mathcal{M}$$ is the set of all pairs (x, ηx) where $$x\in \mathcal{M}$$ and $$\eta _{x}\in \mathrm{T}_{x}\mathcal{M}$$. See the reference for a proper definition of $$\mathrm{T}\mathcal{M}$$ and of what it means for Retr to be smooth. 2  The composition f ∘ Retrx is called the pullback because it, quite literally, pulls back the cost function f from the manifold $$\mathcal{M}$$ to the linear space $$\mathrm{T}_{x}\mathcal{M}$$. 3  $$\forall \ \eta \in \mathrm{T}_{x}\mathcal{M}, \langle{\textrm{grad}\ \hat f_{x}(0_{x})},{\eta }\rangle = \mathrm{D} \hat f_{x}(0_{x})[\eta ] = \mathrm{D} f({x})[\mathrm{D} \textrm{Retr}_{{x}}(0_{x})[\eta ]] = \mathrm{D} f({x})[\eta ] = \left \langle{\textrm{grad}\ f({x})},{\eta }\right \rangle .$$ 4  See Remark 2.2; $$\rho _{k} = \infty $$ is valid if the retraction is globally defined and f is sufficiently nice (for example, Lemma 2.7). 5  This holds in particular in the classical setting $$\mathcal{M} = {\mathbb{R}^{n}}$$, Retrx(η) = x + η and grad f is Lg-Lipschitz. 6  This is typically not an issue in practice. For example, globally defined, practical retractions are known for the sphere, Stiefel manifold, orthogonal group, their products and many others (Absil et al., 2008, Chapter 4). 7  Theorem 3.4 is scale invariant, in that if the cost function f(x) is replaced by αf(x) for some positive α (which does not meaningfully change (P)), it is sensible to also multiply Lg, LH, c0, c1, εg and εH by α; consequently, the upper bounds on εg and εH and the upper bounds on N1 and N2 are invariant under this scaling. If it is desirable to always allow εg, εH in, say, (0, 1], one possibility is to artificially make Lg, LH, c0, c1 larger (which is always allowed). 8  The function f need not be defined outside of $$\mathcal{M}$$, but this is often the case in applications and simplifies exposition. 9  Proper definition of Riemannian Hessians requires the notion of Riemannian connections, which we omit here; see Absil et al. (2008, Chapter 5). 10  Indeed, any $$y\in \mathcal{S}^{n-1}$$ can be written as y = αx + βη with xTη = 0, ∥η∥ = 1 and α2 + β2 = 1; then, yTAy = α2xTAx + β2ηTAη + 2αβηTAx; by first-order condition, ηTAx = (xTAx)ηTx = 0, and by second-order condition, yTAy ≥ (α2 + β2)xTAx = xTAx; hence xTAx is minimal over $$\mathcal{S}^{n-1}$$. References Absil , P.-A. , Baker , C. G. & Gallivan , K. A. ( 2007 ) Trust-region methods on Riemannian manifolds . Found. Comput. Math , 7 , 303 -- 330 . doi: https://doi.org/10.1007/s10208–005-0179–9 . Google Scholar CrossRef Search ADS Absil , P.-A. , Mahony , R. & Sepulchre , R. ( 2008 ) Optimization Algorithms on Matrix Manifolds . Princeton, NJ : Princeton University Press . ISBN 978–0-691–13298-3 . Google Scholar CrossRef Search ADS Absil , P.-A. , Mahony , R. & Trumpf , J. ( 2013 ) An extrinsic look at the Riemannian Hessian . Geometric Science of Information (F. Nielsen and F. Barbaresco eds). Lecture Notes in Computer Science , vol. 8085 . Berlin Heidelberg : Springer , pp 361 -- 368 . ISBN 978–3-642–40019-3 . doi: https://doi.org/10.1007/978–3-642–40020-9 39 . URL http://sites.uclouvain.be/absil/2013.01 Google Scholar CrossRef Search ADS Absil , P.-A. & Malick , J. ( 2012 ) Projection-like retractions on matrix manifolds . SIAM J. Optim ., 22 , 135 -- 158 . doi: https://doi.org/10.1137/100802529 . Google Scholar CrossRef Search ADS Absil , P.-A. & Oseledets , I. ( 2015 ) Low-rank retractions: a survey and new results . Comput. Optim. Appl. , 62 , 5 -- 29 . doi: https://doi.org/10.1007/s10589–014-9714–4 . Accepted for publication . Google Scholar CrossRef Search ADS Absil , P.-A. , Trumpf , J. , Mahony , R. & Andrews , B. ( 2009 ) All roads lead to Newton: feasible second-order methods for equality-constrained optimization . Technical Report UCL-INMA-2009.024 . Belgium : Departement d’ingenierie mathematique, UCLouvain . Adler , R. , Dedieu , J. , Margulies , J. , Martens , M. & Shub , M. ( 2002 ) Newton’s method on Riemannian manifolds and a geometric model for the human spine . IMA J. Numer. Anal ., 22 , 359 -- 390 . doi: https://doi.org/10.1093/imanum/22.3.359 . Google Scholar CrossRef Search ADS Bandeira , A. , Boumal , N. & Voroninski , V. ( 2016 ) On the low-rank approach for semidefinite programs arising in synchronization and community detection . Proceedings of the 29th Conference on Learning Theory, COLT 2016. New York : PMLR , June 23–26 . Bento , G. , Ferreira , O. & Melo , J. ( 2017 ) Iteration-complexity of gradient, subgradient and proximal point methods on Riemannian manifolds . J. Optim. Theory Appl ., 173 , 548 -- 562 . doi: https://doi.org/10.1007/s10957–017-1093–4 . Google Scholar CrossRef Search ADS Berger , G. ( 2017 ) Fast matrix multiplication . Master Thesis. Ecole polytechnique de Louvain. Available at http://hdl.handle.net/2078.1/thesis:10630 Bhojanapalli , S. , Neyshabur , B. & Srebro , N. ( 2016 ) Global optimality of local search for low rank matrix recovery . Preprint arXiv:1605 . 07221 . Birgin , E. , Gardenghi , J. , Martínez , J. , Santos , S. & Toint , P. ( 2017 ) Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models . Math. Prog. , 163 , 359 -- 368 . doi: https://doi.org/10.1007/s10107–016-1065–8 . Google Scholar CrossRef Search ADS Boumal , N. ( 2015b ) A Riemannian low-rank method for optimization over semidefinite matrices with block-diagonal constraints . Preprint arXiv:1506.00575 . Boumal , N. ( 2015a ) Riemannian trust regions with finite-difference Hessian approximations are globally convergent . Geometric Science of Information (F. Nielsen and F. Barbaresco eds). Lecture Notes in Computer Science , vol. 9389 . Springer International Publishing , pp. 467 -- 475 . doi: https://doi.org/10.1007/978–3-319–25040-3 50 . Boumal , N. ( 2016 ) Nonconvex phase synchronization . SIAM J. Optim ., 26 , 2355 -- 2377 . doi: https://doi.org/10.1137/16M105808X . Google Scholar CrossRef Search ADS Boumal , N. , Mishra , B. , Absil , P.-A. & Sepulchre , R. ( 2014 ) Manopt, a Matlab toolbox for optimization on manifolds . J. Mach. Learn. Res ., 15 , 1455 -- 1459 . (http://www.manopt.org) Boumal , N. , Voroninski , V. & Bandeira , A. ( 2016 ) The non-convex Burer-Monteiro approach works on smooth semidefinite programs . Advances in Neural Information Processing Systems (D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon and R. Garnett eds). Vol. 29 , Curran Associates , pp. 2757 -- 2765 . Burer , S. & Monteiro , R. ( 2005 ) Local minima and convergence in low-rank semidefinite programming . Math. Prog ., 103 , 427 -- 444 . Google Scholar CrossRef Search ADS Cartis , C. , Gould , N. I. M. & Toint , P. L. ( 2010 ) On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization problems . SIAM J. Optim ., 20 , 2833 -- 2852 . doi: https://doi.org/10.1137/090774100 . Google Scholar CrossRef Search ADS Cartis , C. , Gould , N. & Toint , P. ( 2011a ) Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function- and derivative-evaluation complexity . Math. Prog. , 130 , 295 -- 319 . doi: https://doi.org/10.1007/s10107–009-0337-y . Google Scholar CrossRef Search ADS Cartis , C. , Gould , N. & Toint , P. ( 2011b ) Optimal Newton-type methods for nonconvex smooth optimization problems . ERGO Technical Report 11–009. School of Mathematics , University of Edinburgh . Cartis , C. , Gould , N. & Toint , P. ( 2012 ) Complexity bounds for second-order optimality in unconstrained optimization . J. Complexity , 28 , 93 -- 108 . doi: https://doi.org/10.1016/j.jco.2011.06.001 . Google Scholar CrossRef Search ADS Cartis , C. , Gould , N. & Toint , P. ( 2014 ) On the complexity of finding first-order critical points in constrained nonlinear optimization . Math. Prog. , 144 , 93 -- 106 . doi: https://doi.org/10.1007/s10107–012-0617–9 . Google Scholar CrossRef Search ADS Cartis , C. , Gould , N. & Toint , P. ( 2015a ) Evaluation complexity bounds for smooth constrained nonlinear optimization using scaled KKT conditions and high-order models . NA Technical Report, Maths E-print Archive1912 . Mathematical Institute , Oxford University . Cartis , C. , Gould , N. & Toint , P. ( 2015b ) On the evaluation complexity of constrained nonlinear least-squares and general constrained nonlinear optimization using second-order methods . SIAM J. Numer. Anal. , 53 , 836 -- 851 . doi: https://doi.org/10.1137/130915546 . Google Scholar CrossRef Search ADS Cartis , C. , Gould , N. I. M. & Toint , P. L. ( 2017 ) Second-order optimality and beyond: Characterization and evaluation complexity in convexly constrained nonlinear optimization . Found. Comput. Math . doi:10.1007/s10208-017-9363-y . Google Scholar CrossRef Search ADS Chavel , I. ( 2006 ) Riemannian Geometry: A Modern Introduction , Cambridge Tracts in Mathematics , vol. 108 . Cambridge : Cambridge University Press . Google Scholar CrossRef Search ADS Conn , A. , Gould , N. & Toint , P. ( 2000 ) Trust-Region Methods . MPS-SIAM Series on Optimization. SIAM . ISBN 978–0-89871–460-9 . doi: https://doi.org/10.1137/1.9780898719857 . Curtis , F. E. , Robinson , D. P. & Samadi , M. ( 2016 ) A trust region algorithm with a worst-case iteration complexity of $$\mathcal{O}(\epsilon ^{-3/2})$$ for nonconvex optimization . Math. Prog. , 162 , pages 1 -- 32 . doi: https://doi.org/10.1007/s10107–016-1026–2 . Google Scholar CrossRef Search ADS Edelman , A. , Arias , T. & Smith , S. ( 1998 ) The geometry of algorithms with orthogonality constraints . SIAM J. Matrix Anal. Appl. , 20 , 303 -- 353 . Google Scholar CrossRef Search ADS Gabay , D. ( 1982 ) Minimizing a differentiable function over a differential manifold . J. Optim. Theory Appl. , 37 , 177 -- 219 . Google Scholar CrossRef Search ADS Ge , R. , Lee , J. & Ma , T. ( 2016 ) Matrix completion has no spurious local minimum . Advances in Neural Information Processing Systems (D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon and R. Garnett eds). Vol. 29 Curran Associates , pp. 2973 -- 2981 . (http://papers.nips.cc/paper/6048-matrix-completion-has-no-spurious-local-minimum.pdf). Goemans , M. & Williamson , D. ( 1995 ) Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming . J. ACM , 42 , 1115 -- 1145 . doi: https://doi.org/10.1145/227683.227684 . Google Scholar CrossRef Search ADS Golub , G. & Van Loan , C. ( 2012 ) Matrix Computations , 4th edn. Johns Hopkins Studies in the Mathematical Sciences , vol. 3 . Baltimore, Maryland : Johns Hopkins University Press . doi:10.1137/0720042 . Huang , W. , Absil , P.-A. , Gallivan , K. & Hand , P. ( 2016 ) ROPTLIB: an object-oriented C++ library for optimization on Riemannian manifolds. Technical Report FSU16–14.v2 . Florida State University . Huang , W. , Gallivan , K. & Absil , P.-A. ( 2015 ) A Broyden class of quasi-Newton methods for Riemannian optimization . SIAM J. Optim. , 25 , 1660 -- 1685 . doi: https://doi.org/10.1137/140955483 . Google Scholar CrossRef Search ADS McCoy , M. & Tropp , J. ( 2011 ) Two proposals for robust PCA using semidefinite programming . Electron. J. Stat. , 5 , 1123 -- 1160 . doi: https://doi.org/10.1214/11-EJS636 . Google Scholar CrossRef Search ADS Mei , S. , Misiakiewicz , T. , Montanari , A. & Oliveira , R. ( 2017 ) Solving SDPs for synchronization and MaxCut problems via the Grothendieck inequality . Preprint arXiv:1703.08729. Monera , M. G. , Montesinos-Amilibia , A. & Sanabria-Codesal , E. ( 2014 ) The Taylor expansion of the exponential map and geometric applications . Rev. R. Acad. Cienc. Exactas, Físicas Nat. Ser. A Math. RACSAM , 108 , 881 -- 906 . doi: 10.1007/s13398–013-0149-z. Moré , J. & Sorensen , D. ( 1983 ) Computing a trust region step . SIAM J. Sci. Stat. Comput. , 4 , 553 -- 572 . doi: https://doi.org/10.1137/0904038 . Google Scholar CrossRef Search ADS Nesterov , Y. ( 2004 ) Introductory Lectures on Convex Optimization: A Basic Course . Applied Optimization , vol. 87 . Boston, MA : Springer . ISBN 978–1-4020–7553-7 . Google Scholar CrossRef Search ADS Nocedal , J. & Wright , S. ( 1999 ) Numerical Optimization . New York : Springer . Google Scholar CrossRef Search ADS O’Neill , B. ( 1983 ) Semi-Riemannian Geometry: With Applications to Relativity , vol. 103, Academic Press . Qi , C. ( 2011 ) Numerical optimization methods on Riemannian manifolds . PhD thesis , Florida State University , Tallahassee, FL . Ring , W. & Wirth , B. ( 2012 ) Optimization methods on Riemannian manifolds and their application to shape space . SIAM J. Optim. , 22 , 596 -- 627 . doi: https://doi.org/10.1137/11082885X . Google Scholar CrossRef Search ADS Ruszczyński , A. ( 2006 ) Nonlinear Optimization . Princeton, NJ : Princeton University Press . Sato , H. ( 2016 ) A Dai–Yuan-type Riemannian conjugate gradient method with the weak Wolfe conditions . Comput. Optim. Appl., 64 , 101 -- 118 . doi: https://doi.org/10.1007/s10589–015-9801–1 . Google Scholar CrossRef Search ADS Shub , M. ( 1986 ) Some remarks on dynamical systems and numerical analysis . Proceedings of VII ELAM. , ( L. Lara-Carrero & J. Lewowicz , eds), Equinoccio, Univ. Simón Bolívar, Caracas , pp. 69 -- 92 . Smith , S. ( 1994 ) Optimization techniques on Riemannian manifolds . Fields Inst. Commun. , 3 , 113 -- 135 . Sorensen , D. ( 1982 ) Newton’s method with a model trust region modification . SIAM J. Numer. Anal. , 19 , 409 -- 426 . doi: https://doi.org/10.1137/0719026 . Google Scholar CrossRef Search ADS Steihaug , T. ( 1983 ) The conjugate gradient method and trust regions in large scale optimization . SIAM J. Numer. Anal. , 20 , 626 -- 637 . Google Scholar CrossRef Search ADS Sun , J. , Qu , Q. & Wright , J. ( 2017a ) Complete dictionary recovery over the sphere II: recovery by Riemannian trust-region method . IEEE Trans. Info. Theory , 63 , 885 -- 914 . doi: https://doi.org/10.1109/TIT.2016.2632149 . Google Scholar CrossRef Search ADS Sun , J. , Qu , Q. & Wright , J. ( 2017b ) A geometric analysis of phase retrieval . Foundations Comput. Math. doi: https://doi.org/10.1007/s10208–017-9365–9 . Toint , P. ( 1981 ) Towards an efficient sparsity exploiting Newton method for minimization . Sparse Matrices and Their Uses ( I. Duff ed). Academic Press , pp. 57 -- 88 . Townsend , J. , Koep , N. & Weichwald , S. ( 2016 ) PyManopt: a Python toolbox for optimization on manifolds using automatic differentiation . J. Mach. Learn. Res. , 17 , 1 -- 5 . Udriste , C. ( 1994 ) Convex functions and optimization methods on Riemannian manifolds , Mathematics and its Applications , vol. 297 . Springer, Dordrecht : Kluwer Academic Publishers . doi: https://doi.org/10.1007/978–94-015–8390-9 . Google Scholar CrossRef Search ADS Vandereycken , B. ( 2013 ) Low-rank matrix completion by Riemannian optimization . SIAM J. Optim. , 23 , 1214 -- 1236 . doi: https://doi.org/10.1137/110845768 . Google Scholar CrossRef Search ADS Vavasis , S. ( 1991 ) Nonlinear Optimization: Complexity Issues . Oxford : Oxford University Press . Yang , W. , Zhang , L.-H. & Song , R. ( 2014 ) Optimality conditions for the nonlinear programming problems on Riemannian manifolds . Pacific J. Optim ., 10 , 415 -- 434 . Zhang , H. & Sra , S. ( 2016 ) First-order methods for geodesically convex optimization . Conference on Learning Theory. Curran Associates , pp. 1617 -- 1638 . Zhang , H. , Reddi , S. & Sra S. ( 2016 ) Riemannian SVRG: fast stochastic optimization on Riemannian manifolds . Advances in Neural Information Processing Systems. Curran Associates . pp. 4592 -- 4600 . Appendix A. Essentials about manifolds We give here a simplified refresher of differential geometric concepts used in the paper, restricted to Riemannian submanifolds. All concepts are illustrated with the sphere. See Absil et al. (2008) for a more complete discussion, including quotient manifolds. We endow $${\mathbb{R}^{n}}$$ with the classical Euclidean metric: for all $$x, y \in{\mathbb{R}^{n}}$$, $$\left \langle{x},{y}\right \rangle = x^{\textrm{T} }\! y$$. Consider the smooth map $$h \colon{\mathbb{R}^{n}} \mapsto{\mathbb{R}^{m}}$$ with m ≤ n and the constraint set \begin{align*} \mathcal{M} = \left\{ x \in{\mathbb{R}^{n}} : h(x) = 0 \right\}. \end{align*} Locally around each x, this set can be linearized by differentiating the constraint. The subspace corresponding to this linearization is the kernel of the differential of h at x (Absil et al., 2008, eq. (3.19)): \begin{align*} \mathrm{T}_{x}\mathcal{M} =\left \{ \eta \in{\mathbb{R}^{n}} : \mathrm{D} h(x)[\eta] = 0 \right\}. \end{align*} If this subspace has dimension n − m for all $$x\in \mathcal{M}$$ then $$\mathcal{M}$$ is a submanifold of dimension n − m of $${\mathbb{R}^{n}}$$ (Absil et al., 2008, Prop. 3.3.3) and $$\mathrm{T}_{x}\mathcal{M}$$ is called the tangent space to $$\mathcal{M}$$ at x. For example, the unit sphere in $${\mathbb{R}^{n}}$$ is a submanifold of dimension n − 1 defined by \begin{align*} \mathcal{S}^{n-1} = \left\{ x \in{\mathbb{R}^{n}} : x^{\textrm{T}}\! x = 1 \right\}, \end{align*} and the tangent space at x is \begin{align*} \mathrm{T}_{x}\mathcal{S}^{n-1} = \left\{ \eta \in{\mathbb{R}^{n}} : x^{\textrm{T}}\! \eta = 0 \right\}. \end{align*} By endowing each tangent space with the (restricted) Euclidean metric we turn $$\mathcal{M}$$ into a Riemannian submanifold of the Euclidean space $${\mathbb{R}^{n}}$$. (In general, the metric could be different and would depend on x; to disambiguate, one would write $$\left \langle{\cdot },{\cdot }\right \rangle _{x}$$.) An obvious retraction for the sphere (see Definition 2.1) is to normalize: \begin{align*} \textrm{Retr}_{x}(\eta) = \frac{x+\eta}{\|x+\eta\|}. \end{align*} Being an orthogonal projection to the manifold, this is actually a second-order retraction; see Definition 3.10 and Absil & Malick (2012, Thm. 22). The Riemannian metric leads to the notion of Riemannian gradient of a real function f defined in an open set of $${\mathbb{R}^{n}}$$ containing $$\mathcal{M}$$.8 The Riemannian gradient of f at x is the (unique) tangent vector grad f(x) at x satisfying \begin{align*} \forall\ \eta \in \mathrm{T}_{x}\mathcal{M}, \quad \mathrm{D} f(x)[\eta] = \lim_{t \to 0} \frac{f(x+t\eta)-f(x)}{t} = \left\langle{\eta},{\textrm{grad}\ f(x)}\right\rangle. \end{align*} In this setting the Riemannian gradient is nothing but the orthogonal projection of the Euclidean (classical) gradient ∇f(x) to the tangent space. Writing $$\textrm{Proj}_{x} \colon{\mathbb{R}^{n}} \to \mathrm{T}_{x}\mathcal{M}$$ for the orthogonal projector we have (Absil et al., 2008, eq. (3.37)) \begin{align*} \textrm{grad}\ f(x) = \textrm{Proj}_{x}\!\left( \nabla f(x) \right). \end{align*} Continuing the sphere example, the orthogonal projector is $$\textrm{Proj}_{x}(y) = y - \left (x^{\textrm{T} }\! y\right ) x$$, and if $$f(x) = \frac{1}{2} x^{\textrm{T} }\! A x$$ for some symmetric matrix A then \begin{align*} \nabla f(x) = Ax, \qquad \textrm{and} \qquad \textrm{grad}\ f(x) = Ax - (x^{\textrm{T}}\! A x) x. \end{align*} Notice that the critical points of f on $$\mathcal{S}^{n-1}$$ coincide with the unit eigenvectors of A. We can further define a notion of Riemannian Hessian as the projected differential of the Riemannian gradient:9 \begin{align*} \textrm{Hess}\ f(x)[\eta] = \textrm{Proj}_{x} \left( \mathrm{D}\left( x \mapsto \textrm{Proj}_{x} \nabla f(x) \right)(x)[\eta]\right). \end{align*} Hess f(x) is a linear map from $$\mathrm{T}_{x}\mathcal{M}$$ to itself, symmetric with respect to the Riemannian metric. Given a second-order retraction (Definition 3.10), it is equivalently defined by \begin{align*} \forall\ \eta \in \mathrm{T}_{x}\mathcal{M}, \quad \left\langle{\eta},{\textrm{Hess}\ f(x)[\eta]}\right\rangle = \left. \frac{\mathrm{d}^{2}}{\mathrm{d}t^{2}} f(\textrm{Retr}_{x}(t\eta))\right|{}_{t=0}; \end{align*} see Absil et al. (2008, eq. (5.35)). Continuing our sphere example, \begin{align*} \mathrm{D}\left( x \mapsto \textrm{Proj}_{x} \nabla f(x) \right)(x)[\eta] = \mathrm{D}\left( x \mapsto Ax - \left(x^{\textrm{T}}\! Ax\right)x \right)(x)[\eta] = A\eta - (x^{\textrm{T}}\! Ax)\eta - 2 (x^{\textrm{T}}\! A \eta) x. \end{align*} Projection of the latter gives the Hessian: \begin{align*} \textrm{Hess}\ f(x)[\eta] = \textrm{Proj}_{x} (A\eta) - (x^{\textrm{T}}\! Ax)\eta. \end{align*} Consider the implications of a positive semidefinite Hessian (on the tangent space): \begin{align*} \textrm{Hess}\ f(x) \succeq 0 & \iff \left\langle{\eta},{\textrm{Hess}\ f(x)[\eta]}\right\rangle \geq 0 \qquad \forall\ \eta\in\mathrm{T}_{x}\mathcal{S}^{n-1} \\ & \iff \eta^{\textrm{T}}\! A\eta \geq x^{\textrm{T}}\! A x \qquad \forall\ \eta\in\mathrm{T}_{x}\mathcal{S}^{n-1}, \|\eta\| = 1. \end{align*} Together with first-order conditions this implies that x is a leftmost eigenvector of A.10 This is an example of an optimization problem on a manifold for which second-order necessary optimality conditions are also sufficient. This is not the norm. As another (very) special example consider the case $$\mathcal{M} = {\mathbb{R}^{n}}$$; then, $$\mathrm{T}_{x}{\mathbb{R}^{n}} = {\mathbb{R}^{n}}$$, Retrx(η) = x + η is the exponential map (a fortiori a second-order retraction), Projx is the identity, grad f(x) = ∇f(x) and Hess f(x) = ∇2f(x). Appendix B. Compact submanifolds of Euclidean spaces In this appendix we prove Lemmas 2.4 and 3.1, showing that if f has locally Lipschitz continuous gradient or Hessian in a Euclidean space $$\mathcal{E}$$ (in the usual sense), and it is to be minimized over a compact submanifold of $$\mathcal{E}$$, then Assumptions 2.6, 3.1 and 3.2 hold. Proof of Lemma 2.4 By assumption, ∇f is Lipschitz continuous along any line segment in $$\mathcal{E}$$ joining x and y in $$\mathcal{M}$$. Hence, there exists L such that, for all $$x, y \in \mathcal{M}$$, \begin{align} \left|\, f(y) - \left[\, f(x) + \left\langle{\nabla f(x)},{y-x}\right\rangle \right] \right| \leq \frac{L}{2} \|y-x\|^{2}. \end{align} (B.1) In particular this holds for all y =Retrx(η), for any $$\eta \in \mathrm{T}_{x}\mathcal{M}$$. Writing grad f(x) for the Riemannian gradient of $$f|_{\mathcal{M}}$$ and using that grad f(x) is the orthogonal projection of ∇f(x) to $$\mathrm{T}_{x}\mathcal{M}$$ (Absil et al., 2008, eq. (3.37)) the inner product above decomposes as \begin{align} \left\langle{\nabla f(x)},{\textrm{Retr}_{x}(\eta)-x}\right\rangle & = \left\langle{\nabla f(x)},{\eta + \textrm{Retr}_{x}(\eta) - x - \eta}\right\rangle \nonumber\\ & = \left\langle{\textrm{grad}\ f(x)},{\eta}\right\rangle + \left\langle{\nabla f(x)},{\textrm{Retr}_{x}(\eta) - x - \eta}\right\rangle. \end{align} (B.2) Combining (B.1) with (B.2) and using the triangle inequality yields \begin{align*} \left| f(\textrm{Retr}_{x}(\eta)) - \left[ f(x) + \left\langle{\textrm{grad}\ f(x)},{\eta}\right\rangle \right] \right| \leq \frac{L}{2} \|\textrm{Retr}_{x}(\eta)-x\|^{2} + \|\nabla f(x)\| \|\textrm{Retr}_{x}(\eta) - x - \eta\|. \end{align*} Since ∇f(x) is continuous on the compact set $$\mathcal{M}$$ there exists finite G such that ∥∇f(x)∥≤ G for all $$x\in \mathcal{M}$$. It remains to show there exist finite constants α, β ≥ 0 such that, for all $$x\in \mathcal{M}$$ and for all $$\eta \in \mathrm{T}_{x}\mathcal{M}$$, \begin{align} \|\textrm{Retr}_{x}(\eta)-x\| \leq \alpha\|\eta\|\ \textrm{and} \end{align} (B.3) \begin{align} \|\textrm{Retr}_{x}(\eta) - x - \eta\| \leq \beta\|\eta\|^{2}. \end{align} (B.4) For small η, this will follow from $$\textrm{Retr}_{x}(\eta ) = x + \eta + \mathcal{O}\left (\|\eta \|^{2}\right )$$ by Definition 2.1; for large η this will follow a fortiori from compactness. This will be sufficient to conclude, as then we will have for all $$x\in \mathcal{M}$$ and $$\eta \in \mathrm{T}_{x}\mathcal{M}$$ that \begin{align*} \left| f(\textrm{Retr}_{x}(\eta)) - \left[ f(x) + \left\langle{\textrm{grad}\ f(x)},{\eta}\right\rangle \right] \right| \leq \left(\frac{L}{2} \alpha^{2} + G \beta \right)\|\eta\|^{2}. \end{align*} More formally our assumption that the retraction is defined and smooth over the whole tangent bundle a fortiori ensures the existence of r > 0 such that Retr is smooth on $$K = \{\eta \in \mathrm{T}\mathcal{M}: \|\eta \| \leq r\}$$, a compact subset of the tangent bundle (K consists of a ball in each tangent space). First, we determine α; see (B.3). For all η ∈ K we have \begin{align*} \|\textrm{Retr}_{x}(\eta) - x \| & \leq{\int_{0}^{1}} \left\|\frac{\mathrm{d}}{\mathrm{d}t} \textrm{Retr}_{x}(t\eta)\right\| \mathrm{d}t = {\int_{0}^{1}} \|\mathrm{D}\textrm{Retr}_{x}(t\eta)[\eta]\|\ \mathrm{d}t \\ & \leq{\int_{0}^{1}} \max_{\xi\in K} \|\mathrm{D}\textrm{Retr}(\xi)\| \|\eta\|\ \mathrm{d}t = \max_{\xi\in K} \|\mathrm{D}\textrm{Retr}(\xi)\| \|\eta\|, \end{align*} where the $$\max $$ exists and is finite owing to compactness of K and smoothness of Retr on K; note that this is uniform over both x and η. (If $$\xi \in \mathrm{T}_{z}\mathcal{M}$$ the notation DRetr(ξ) refers to DRetrz(ξ).) For all η ∉ K we have \begin{align*} \|\textrm{Retr}_{x}(\eta) - x \| \leq \textrm{diam}(\mathcal{M}) \leq \frac{\textrm{diam}(\mathcal{M})}{r} \|\eta\|, \end{align*} where $$\textrm{diam}(\mathcal{M})$$ is the maximal distance between any two points on $$\mathcal{M}$$: finite by compactness of $$\mathcal{M}$$. Combining, we find that (B.3) holds with \begin{align*} \alpha = \max\left( \max_{\xi\in K} \|\mathrm{D}\textrm{Retr}(\xi)\|, \frac{\textrm{diam}(\mathcal{M})}{r} \right). \end{align*} Inequality (B.4) is established along similar lines. For all η ∈ K we have \begin{align*} \|\textrm{Retr}_{x}(\eta) - x - \eta\| & \leq{\int_{0}^{1}} \left\|\frac{\textrm{d}}{\mathrm{d}t} (\textrm{Retr}_{x}(t\eta)-x-t\eta) \right\| \mathrm{d}t = {\int_{0}^{1}} \|\mathrm{D}\textrm{Retr}_{x}(t\eta)[\eta] - \eta\|\ \mathrm{d}t \\ & \leq{\int_{0}^{1}} \|\mathrm{D}\textrm{Retr}_{x}(t\eta)-\operatorname{Id}\| \|\eta\|\ \mathrm{d}t \leq \frac{1}{2} \max_{\xi\in K} \|\mathrm{D}^{2}\textrm{Retr}(\xi)\| \|\eta\|^{2}, \end{align*} where the last inequality follows from DRetrx(0x) = Id and \begin{align*} \|\mathrm{D}\textrm{Retr}_{x}(t\eta)-\operatorname{Id}\| \leq{\int_{0}^{1}} \left\| \frac{\mathrm{d}}{\mathrm{d}s} \mathrm{D}\textrm{Retr}_{x}(st\eta) \right\| \mathrm{d}s \leq \|t\eta\| {\int_{0}^{1}} \left\| \mathrm{D}^{2}\textrm{Retr}_{x}(t\eta) \right\| \mathrm{d}s. \end{align*} The case η ∉ K is treated as before: \begin{align*} \|\textrm{Retr}_{x}(\eta) - x - \eta \| \leq \|\textrm{Retr}_{x}(\eta) - x\| + \|\eta \| \leq \frac{\textrm{diam}(\mathcal{M}) + r}{r^{2}} \|\eta\|^{2}. \end{align*} Combining, we find that (B.4) holds with \begin{align*} \beta = \max\left( \frac{1}{2} \max_{\xi\in K} \big\|\mathrm{D}^{2}\textrm{Retr}(\xi)\big\|, \frac{\textrm{diam}(\mathcal{M}) + r}{r^{2}} \right), \end{align*} which concludes the proof. We now prove the corresponding second-order result, whose aim is to verify Assumptions 3.2. Proof of Lemma 3.1 By assumption ∇2f is Lipschitz continuous along any line segment in $$\mathcal{E}$$ joining x and y in $$\mathcal{M}$$. Hence, there exists L such that, for all $$x, y \in \mathcal{M}$$, \begin{align} \left| f(y) - \left[ f(x) + \left\langle{\nabla f(x)},{y-x}\right\rangle + \tfrac{1}{2}\left\langle{y-x},{\nabla^{2} f(x)[y-x]}\right\rangle \right] \right| \leq \frac{L}{6} \|y-x\|^{3}. \end{align} (B.5) Fix $$x\in \mathcal{M}$$. Let Projx denote the orthogonal projector from $$\mathcal{E}$$ to $$\mathrm{T}_{x}\mathcal{M}$$. Let grad f(x) be the Riemannian gradient of $$f|_{\mathcal{M}}$$ at x and let Hess f(x) be the Riemannian Hessian of $$f|_{\mathcal{M}}$$ at x (a symmetric operator on $$\mathrm{T}_{x}\mathcal{M}$$). For Riemannian submanifolds of Euclidean spaces we have these explicit expressions with $$\eta \in \mathrm{T}_{x}\mathcal{M}$$—see Absil et al. (2008, eqs. (3.37), (5.15), Def. (5.5.1)) and Absil et al. (2013): \begin{align*} \textrm{grad}\ f(x) & = \textrm{Proj}_{x} \nabla f(x),\ \textrm{and} \\ \left\langle{\eta},{\textrm{Hess}\ f(x)[\eta]}\right\rangle & = \left\langle{\eta},{ \mathrm{D}\left( x \mapsto \textrm{Proj}_{x} \nabla f(x) \right)(x)[\eta]}\right\rangle \\ & = \left\langle{\eta},{\left( \mathrm{D}\left( x \mapsto \textrm{Proj}_{x} \right)(x)[\eta]\right)[\nabla f(x)] + \textrm{Proj}_{x} \nabla^{2}f(x)[\eta]}\right\rangle \\ & = \left\langle{II(\eta,\eta)},{\nabla f(x)}\right\rangle + \left\langle{\eta},{\nabla^{2} f(x)[\eta]}\right\rangle, \end{align*} where II, as implicitly defined above, is the second fundamental form of $$\mathcal{M}$$: II(η, η) is a normal vector to the tangent space at x, capturing the second-order geometry of $$\mathcal{M}$$—see Absil et al. (2009, 2013) and Monera et al. (2014) for presentations relevant to our setting. In particular, II(η, η) is the acceleration in $$\mathcal{E}$$ at x of a geodesic γ(t) on $$\mathcal{M}$$ defined by γ(0) = x and $$\dot \gamma (0) = \eta $$: $$\ddot \gamma (0) = II(\eta ,\eta )$$ (O’Neill, 1983, Cor. 4.9). Let $$\eta \in \mathrm{T}_{x}\mathcal{M}$$ be arbitrary; $$y = \textrm{Retr}_{x}(\eta ) \in \mathcal{M}$$. Then \begin{align*} \left\langle{\nabla f(x)},{y-x}\right\rangle - \left\langle{\textrm{grad}\ f(x)},{\eta}\right\rangle &= \left\langle{\nabla f(x)},{y - x - \eta}\right\rangle\ \textrm{and} \\ \left\langle{y-x},{\nabla^{2} f(x)[y-x]}\right\rangle - \left\langle{\eta},{\textrm{Hess}\ f(x)[\eta]}\right\rangle &= 2\left\langle{\eta},{\nabla^{2} f(x)[y-x-\eta]}\right\rangle\\ &\quad + \left\langle{y-x-\eta},{\nabla^{2} f(x)[y-x-\eta]}\right\rangle\\ &\quad - \left\langle{\nabla f(x)},{II(\eta,\eta)}\right\rangle. \end{align*} Since $$\mathcal{M}$$ is compact and f is twice continuously differentiable, there exist G, H, independent of x, such that ∥∇f(x)∥≤ G and $$\left \|\nabla ^{2} f(x)\right \| \leq H$$ (the latter is the induced operator norm). Combining with (B.5) and using the triangle and Cauchy–Schwarz inequalities multiple times, \begin{align*} &\left|\, f(y) - \left[ f(x) + \left\langle{\textrm{grad}\ f(x)},{\eta}\right\rangle + \frac{1}{2}\left\langle{\eta},{\textrm{Hess}\ f(x)[\eta]}\right\rangle \right] \right| \\ &\quad\leq \frac{L}{6} \|y-x\|^{3} + G\left\|y-x-\eta-\frac{1}{2}II(\eta,\eta)\right\| + H\|\eta\|\|y-x-\eta\| + \frac{1}{2} H \|y-x-\eta\|^{2}. \end{align*} Using the same argument as for Lemma 2.4 we can find finite constants α, β independent of x and η such that (B.3) and (B.4) hold. Use $$\|y-x-\eta \|^{2} \leq \|y-x-\eta \| \left ( \|y-x\| + \|\eta \| \right ) \leq \beta (\alpha +1)\|\eta \|^{3}$$ to upper bound the right-hand side above with \begin{align*} \left(\frac{L}{6}\alpha^{3} + H\beta + \frac{H\beta(\alpha+1)}{2}\right) \|\eta\|^{3} + G\left\|y-x-\eta-\frac{1}{2}II(\eta,\eta)\right\|. \end{align*} We turn to the last term. Consider $$K \subset \mathrm{T}\mathcal{M}$$ as defined in the proof of Lemma 2.4 for some r > 0. If η ∉ K, that is, ∥η∥ > r, then since II is bilinear for a fixed $$x\in \mathcal{M}$$, we can define \begin{align*} \|II\| = \max_{x \in \mathcal{M}} \max_{\xi \in \mathrm{T}_{x}\mathcal{M}, \|\xi\| \leq 1} \|II(\xi, \xi)\| \end{align*} (finite by continuity and compactness) so that ∥II(η, η)∥≤∥II∥∥η∥2. Then, \begin{align*} \left\|y-x-\eta-\frac{1}{2}II(\eta,\eta)\right\| \leq \|y-x\| + \|\eta\| + \frac{1}{2} \|II(\eta, \eta)\| \leq \left( \frac{\operatorname{diam}(\mathcal{M})}{r^{3}} + \frac{1}{r^{2}} + \frac{1}{2} \frac{\|II\|}{r} \right) \|\eta\|^{3}. \end{align*} Now assume η ∈ K, that is, ∥η∥≤ r. Consider ϕ(t) = Retrx(tη) (a curve on $$\mathcal{M}$$) and let ϕ′′ denote its acceleration on $$\mathcal{M}$$ and $$\ddot \phi $$ denote its acceleration in $$\mathcal{E}$$, while $$\dot \phi = \phi ^{\prime }$$ denotes velocity along the curve. It holds that $$\ddot \phi (t) = \phi ^{\prime \prime }(t) + II(\dot \phi (t), \dot \phi (t))$$ (O’Neill, 1983, Cor. 4.9). Since Retr is a second-order retraction, acceleration on $$\mathcal{M}$$ is zero at t = 0, that is, ϕ′′(0) = 0, so that ϕ(0) = x, $$\dot \phi (0) = \eta $$ and $$\ddot \phi (0) = II(\eta , \eta )$$. Then by Taylor expansion of ϕ in $$\mathcal{E}$$, \begin{align*} y = \textrm{Retr}_{x}(\eta) = \phi(1) = x + \eta + \frac{1}{2} II(\eta,\eta) + R_{3}(\eta), \end{align*} where \begin{align*} \| R_{3}(\eta) \| = \left\| {\int_{0}^{1}} \frac{(1-t)^{2}}{2} \dddot \phi(t)\ \mathrm{d}t \right\| \leq \frac{1}{6} \max_{\xi\in K} \left\| \mathrm{D}^{3} \textrm{Retr}(\xi) \right\| \|\eta\|^{3}. \end{align*} The combined arguments ensure existence of a constant γ, independent of x and η, such that \begin{align*} \left\|y-x-\eta-\frac{1}{2}II(\eta,\eta)\right\| \leq \gamma\|\eta\|^{3}. \end{align*} Combining, we find that for all $$x\in \mathcal{M}$$ and $$\eta \in \mathrm{T}_{x}\mathcal{M}$$, \begin{align*} \left|\, f(\textrm{Retr}_{x}(\eta)) - \left[ f(x) + \left\langle{\textrm{grad}\ f(x)},{\eta}\right\rangle + \frac{1}{2}\left\langle{\eta},{\textrm{Hess}\ f(x)[\eta]}\right\rangle \right] \right| \leq \left(\frac{L}{6}\alpha^{3} + \frac{H\beta(\alpha+3)}{2} + \gamma\right) \|\eta\|^{3}. \end{align*} Since Retr is a second-order retraction, Hess f(x) coincides with the Hessian of the pullback f ∘ Retrx (Lemma 3.9). This establishes Assumption 3.2. Appendix C. Proof of Lemma 2.7 about Armijo line-search Proof of Lemma 2.7 By Assumption 2.6, upper bound (2.5) holds with $$\eta = t{\eta _{k}^{0}}$$ for any t such that ∥η∥≤ ϱk: \begin{align} f(x_{k}) - f\left(\textrm{Retr}_{x_{k}}\left(t \cdot{\eta_{k}^{0}}\right)\right) \geq t\left\langle{-\textrm{grad}\ f(x_{k})},{{\eta_{k}^{0}}}\right\rangle - \frac{Lt^{2}}{2} \left\|{\eta_{k}^{0}}\right\|^{2}. \end{align} (C.1) We determine a sufficient condition on t for the stopping criterion in Algorithm 2 to trigger. To this end observe that the right-hand side of (C.1) dominates $$c_{1} t \left \langle{-\textrm{grad}\ f(x_{k})},{{\eta _{k}^{0}}}\right \rangle $$ if \begin{align*} t(1-c_{1}) \cdot \left\langle{-\textrm{grad}\ f(x_{k})},{{\eta_{k}^{0}}}\right\rangle \geq \frac{Lt^{2}}{2} \left\|{\eta_{k}^{0}}\right\|^{2}. \end{align*} Thus, the stopping criterion in Algorithm 2 is satisfied in particular for all t in \begin{align*} \left[0, \frac{2(1-c_{1})\left\langle{-\textrm{grad} \ f(x_{k})},{{\eta_{k}^{0}}}\right\rangle}{L_{g}\|{\eta_{k}^{0}}\|^{2}}\right] \supseteq \left[ 0, \frac{2c_{2} (1-c_{1})\|\textrm{grad}\ f(x_{k})\|}{L_{g}\|{\eta_{k}^{0}}\|}\right] \supseteq \left[ 0, \frac{2c_{2} (1-c_{1})}{c_{4}L_{g}}\right]. \end{align*} Unless it equals $${\bar t}_{k}$$, the returned t cannot be smaller than τ times the last upper bound. In all cases the cost decrease satisfies \begin{align*} f(x_{k}) - f\left(\textrm{Retr}_{x_{k}}\left(t \cdot{\eta_{k}^{0}}\right)\right) & \geq c_{1} t \left\langle{-\textrm{grad}\ f(x_{k})},{{\eta_{k}^{0}}}\right\rangle \\ & \geq c_{1}c_{2}t\|\textrm{grad}\ f(x_{k})\|\left\|{\eta_{k}^{0}}\right\| \\ & \geq c_{1}c_{2}c_{3}t\|\textrm{grad}\ f(x_{k})\|^{2}. \end{align*} To count the number of iterations consider that checking whether $$t ={\bar t}_{k}$$ satisfies the stopping criterion requires one cost evaluation. Following that, t is reduced by a factor τ exactly $$\log _{\tau }(t/{\bar t}_{k}) = \log _{\tau ^{-1}}({\bar t}_{k}/t)$$ times, each followed by one cost evaluation. Appendix D. Proofs for Section 3.5 about Hk and the Hessians Proof of Lemma 3.9 The Hessian of f and that of the pullback are related by the following formulas. See (Absil et al., 2008, Chapter 5) for the precise meanings of the differential operators D and d. For all η in $$\mathrm{T}_{x}\mathcal{M}$$, writing $$\hat f_{x} = f\circ \textrm{Retr}_{x}$$ for convenience, \begin{align*} \frac{\mathrm{d}}{\mathrm{d}t} f(\textrm{Retr}_{x}(t\eta)) =& \left\langle{\textrm{grad}\ f(\textrm{Retr}_{x}(t\eta))},{\frac{\mathrm{D}}{\mathrm{d}t} \textrm{Retr}_{x}(t\eta)}\right\rangle, \\ \left\langle{\nabla^{2} \hat f_{x}(0_{x})[\eta]},{\eta}\right\rangle =& \left. \frac{\mathrm{d}^{2}}{\mathrm{d}t^{2}} f(\textrm{Retr}_{x}(t\eta)) \right|{}_{t=0} \\ =& \left\langle{\textrm{Hess}\ f(x)\left[\mathrm{D}\textrm{Retr}_{x}(0_{x})[\eta]\right]},{\left. \frac{\mathrm{D}}{\mathrm{d}t} \textrm{Retr}_{x}(t\eta) \right|{}_{t=0}}\right\rangle \\ & + \left\langle{\textrm{grad}\ f(x)},{\left. \frac{\mathrm{D}^{2}}{\mathrm{d}t^{2}} \textrm{Retr}_{x}(t\eta) \right|{}_{t=0}}\right\rangle \\ =& \left\langle{\textrm{Hess}\ f(x)[\eta]},{\eta}\right\rangle + \left\langle{\textrm{grad}\ f(x)},{\left. \frac{\mathrm{D}^{2}}{\mathrm{d}t^{2}} \textrm{Retr}_{x}(t\eta) \right|{}_{t=0}}\right\rangle. \end{align*} (To get the third equality it is assumed one is working with the Levi-Civita connection, so that Hess f is indeed the Riemannian Hessian.) Since the acceleration of the retraction is bounded, we get the result via Cauchy–Schwarz. Proof of Proposition 3.11 Combine ∥grad f(xk)∥≤ εg and Hk ≽−εH Id with \begin{align*} \left\| \textrm{Hess}\ f(x_{k}) - \nabla^{2} \hat f_{x_{k}}(0_{x_{k}}) \right\| \leq a_{k} \cdot \|\textrm{grad}\ f(x_{k})\|\ \quad\textrm{and} \qquad \left\|\nabla^{2} \hat f_{k}(0_{x_{k}}) - H_{k}\right\| \leq \delta_{k} \end{align*} by the triangular inequality. Appendix E. Complexity dependence on n in the Max-Cut example This appendix supports Section 4. By Proposition 4.1, running Algorithm 3 with $$\varepsilon _{g} = \infty $$ and $$\varepsilon _{H} = \frac{2\delta }{n}$$ yields a solution Y within a gap δ from the optimal value of (4.2). Let $$\underline{f}$$ and $$\overline{f}$$ denote the minimal and maximal values of $$f(Y) = \left \langle{C},{YY^{\textrm{T} } }\right \rangle $$ over $$\mathcal{M}$$ (see (4.3)), respectively, with metric $$\left \langle{A},{B}\right \rangle = \textrm{Tr}\left (A^{\textrm{T} }\! B\right )$$ and associated Frobenius norm ∥⋅∥F. Then using ρ′ = 1/10, setting c3 = 1/2 in Assumption 3.8 as allowed by Lemma 3.4 and using the true Hessian of the pullbacks for Hk so that c1 = 0 in Assumption 3.5, Theorem 3.4 guarantees that Algorithm 3 returns an answer in at most \begin{align} 214(\overline{f} - \underline{f}) \cdot{L_{H}^{2}} \cdot \frac{1}{{\varepsilon_{H}^{3}}}\ \textrm{+ log term} \end{align} (E.1) iterations. Using the LDLT-factorization strategy of Lemma 3.3 with a randomly generated orthonormal basis at each tangent space encountered, since $$\dim \mathcal{M} = n^{2}$$ for p = n + 1, the cost of each iteration is $$\mathcal{O}\left (n^{6}\right )$$ arithmetic operations (dominated by the cost of the LDLT factorization). It remains to bound LH, in compliance with Assumption 3.2. Let $$g \colon{\mathbb{R}} \to{\mathbb{R}}$$ be defined as $$g(t) = f(\textrm{Retr}_{Y}(t\dot Y))$$. Then using a Taylor expansion, \begin{align} f(\textrm{Retr}_{Y}(\dot Y)) = g(1) = g(0) + g^{\prime}(0) + \frac{1}{2} g^{\prime\prime}(0) + \frac{1}{6} g^{\prime\prime\prime}(t) \end{align} (E.2) for some t ∈ (0, 1). Let $$\hat f_{Y} = f \circ \textrm{Retr}_{Y}$$. Definition 2.1 for retractions implies \begin{align} g(0) = f(Y), \qquad g^{\prime}(0) = \left\langle{\textrm{grad}\ f(Y)},{\dot Y}\right\rangle, \qquad g^{\prime\prime}(0) = \left\langle{\dot Y},{\nabla^{2} \hat f_{Y}(0_{Y})[\dot Y]}\right\rangle, \end{align} (E.3) so that it only remains to bound |g′′′(t)| uniformly over $$Y, \dot Y$$ and t ∈ [0, 1]. For this example it is easier to handle g′′′ if the retraction used is the exponential map (similar bounds can be obtained with the orthogonal projection retraction; see Mei et al., 2017, Lemmas 4 and 5). This map is known in explicit form and is cheap to compute for the sphere $$\mathbb{S}^{n} = \left \{ x \in{\mathbb{R}}^{n+1} : x^{\textrm{T} }\! x = 1 \right \}$$. Indeed, if $$x \in \mathbb{S}^{n}$$ and $$\eta \in \mathrm{T}_{x}\mathbb{S}^{n}$$, following Absil et al. (2008, Ex. 5.4.1), \begin{align} \gamma(t) = \textrm{Exp}_{x}(t\eta) = \cos(t\|\eta\|) x + \sin(t\|\eta\|) \frac{1}{\|\eta\|} \eta. \end{align} (E.4) Conceiving of γ as a map from $${\mathbb{R}}$$ to $${\mathbb{R}}^{n+1}$$ its differentials are easily derived: \begin{align} \dot \gamma(t) = -\|\eta\|\sin(t\|\eta\|) x + \cos(t\|\eta\|) \eta, \qquad \ddot \gamma(t) = - \|\eta\|^{2} \gamma(t), \qquad \dddot \gamma(t) = - \|\eta\|^{2} \dot \gamma(t). \end{align} (E.5) Extending this map rowwise gives the exponential map for $$\mathcal{M}$$—of course, this is a second-order retraction. We define $$\varPhi (t) = \textrm{Retr}_{Y}(t\dot Y)$$ and $$g(t) = f(\textrm{Retr}_{Y}(t \dot Y)) = \left \langle{C\varPhi (t)},{\varPhi (t)}\right \rangle $$. In particular $$\ddot \varPhi (t) = -D\varPhi (t)$$ and $$\dddot \varPhi (t) = -D\dot \varPhi (t)$$, where $$D = \textrm{diag}\left (\|\dot y_{1}\|^{2}, \ldots , \|\dot y_{n}\|^{2}\right )$$ and $$\dot y_{k}^{\textrm{T} }\! $$ is the kth row of $$\dot Y$$. As a result, for a given Y and $$\dot Y$$, a little bit of calculus gives \begin{align} g^{\prime\prime\prime}(t) = -6\left\langle{C\dot\varPhi(t)},{D\varPhi(t)}\right\rangle - 2\left\langle{C\varPhi(t)},{D\dot\varPhi(t)}\right\rangle. \end{align} (E.6) Using Cauchy–Schwarz multiple times, as well as the inequality $$\|{AB}\|_{\mathrm{F}} \leq \left \|{A}\right \|_{\mathrm{2}}\|{B}\|_{\mathrm{F}}$$ where $$\left \|{A}\right \|_{\mathrm{2}}$$ denotes the largest singular value of A, and using that $$\|{\varPhi (t)}\|_{\mathrm{F}} = \sqrt{n}$$ and $$\|{\dot \varPhi (t)}\|_{\mathrm{F}} = \|{\dot Y}\|_{\mathrm{F}}$$ for all t, and additionally that $$\left \|{D}\right \|_{\mathrm{2}} \leq \textrm{Tr}(D) = \|{\dot Y}\|_{\mathrm{F}}^{2}$$ , it follows that \begin{align} \sup_{Y\in\mathcal{M}, \dot{Y}\in\mathrm{T}_{Y}\mathcal{M}, \dot Y\neq 0, t\in(0, 1)} \frac{|g^{\prime\prime\prime}(t)|}{\|{\dot Y}\|_{\mathrm{F}}^{3}} \leq 8\left\|{C}\right\|_{\mathrm{2}}\sqrt{n}. \end{align} (E.7) As a result an acceptable constant LH for Assumption 3.2 is $$L_{H} = 8\left \|{C}\right \|_{\mathrm{2}} \sqrt{n}.$$ Combining all the statements of this section it follows that a solution Y within an absolute gap δ of the optimal value can be obtained for problem (4.2) using Algorithm 3 in at most $$\mathcal{O}\big ( (\overline{f} - \underline{f}) \left \|{C}\right \|_{\mathrm{2}}^{2} \cdot n^{10} \cdot \frac{1}{\delta ^{3}} \big )$$ arithmetic operations, neglecting the additive logarithmic term. Note that, following Mei et al. (2017, Appendix A.2, points 1 and 2), it is also possible to bound LH as $$6\left \|{C}\right \|_{\mathrm{2}} + 2\|C\|_{1}$$, where ∥⋅∥1 is the ℓ1 operator norm. This reduces the explicit dependence on n from n10 to n9 in the bound on the total amount of work. © The Author(s) 2018. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png IMA Journal of Numerical Analysis Oxford University Press

Global rates of convergence for nonconvex optimization on manifolds

Loading next page...
 
/lp/ou_press/global-rates-of-convergence-for-nonconvex-optimization-on-manifolds-sm1zjEDcgP
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved.
ISSN
0272-4979
eISSN
1464-3642
D.O.I.
10.1093/imanum/drx080
Publisher site
See Article on Publisher Site

Abstract

Abstract We consider the minimization of a cost function f on a manifold $$\mathcal{M}$$ using Riemannian gradient descent and Riemannian trust regions (RTR). We focus on satisfying necessary optimality conditions within a tolerance ε. Specifically, we show that, under Lipschitz-type assumptions on the pullbacks of f to the tangent spaces of $$\mathcal{M}$$, both of these algorithms produce points with Riemannian gradient smaller than ε in $$\mathcal{O}\big(1/\varepsilon ^{2}\big)$$ iterations. Furthermore, RTR returns a point where also the Riemannian Hessian’s least eigenvalue is larger than −ε in $$\mathcal{O} \big(1/\varepsilon ^{3}\big)$$ iterations. There are no assumptions on initialization. The rates match their (sharp) unconstrained counterparts as a function of the accuracy ε (up to constants) and hence are sharp in that sense. These are the first deterministic results for global rates of convergence to approximate first- and second-order Karush-Kuhn-Tucker points on manifolds. They apply in particular for optimization constrained to compact submanifolds of $${\mathbb{R}^{n}}$$, under simpler assumptions. 1. Introduction Optimization on manifolds is concerned with solving nonlinear and typically nonconvex computational problems of the form \begin{align} \min_{x \in \mathcal{M}} \ f(x), \end{align} (P) where $$\mathcal{M}$$ is a (smooth) Riemannian manifold and $$f \colon \mathcal{M} \to{\mathbb{R}}$$ is a (sufficiently smooth) cost function (Gabay, 1982; Smith, 1994; Edelman et al., 1998; Absil et al., 2008). Applications abound in machine learning, computer vision, scientific computing, numerical linear algebra, signal processing, etc. In typical applications x is a matrix and $$\mathcal{M}$$ could be a Stiefel manifold of orthonormal frames (including spheres and groups of rotations), a Grassmann manifold of subspaces, a cone of positive definite matrices or simply a Euclidean space such as $${\mathbb{R}^{n}}$$. The standard theory for optimization on manifolds takes the standpoint that optimizing on a manifold $$\mathcal{M}$$ is not fundamentally different from optimizing in $${\mathbb{R}^{n}}$$. Indeed, many classical algorithms from unconstrained nonlinear optimization such as gradient descent, nonlinear conjugate gradients, BFGS, Newton’s method and trust-region methods (Ruszczyński, 2006; Nocedal & Wright, 1999) have been adapted to apply to the larger framework of (P) (Adler et al., 2002; Absil et al., 2007, 2008; Ring & Wirth, 2012; Huang et al., 2015; Sato, 2016). Softwarewise, a few general toolboxes for optimization on manifolds exist now, for example, Manopt (Boumal et al., 2014), PyManopt (Townsend et al., 2016) and ROPTLIB (Huang et al., 2016). As (P) is typically nonconvex, one does not expect general purpose, efficient algorithms to converge to global optima of (P) in general. Indeed, the class of problems (P) includes known NP-hard problems. In fact, even computing local optima is NP-hard in general (Vavasis, 1991, Chapter 5). Nevertheless, one may still hope to compute points of $$\mathcal{M}$$ that satisfy first- and second-order necessary optimality conditions. These conditions take up the same form as in unconstrained nonlinear optimization, with Riemannian notions of gradient and Hessian. For $$\mathcal{M}$$ defined by equality constraints these conditions are equivalent to first- and second-order Karush-Kuhn-Tucker (KKT) conditions, but are simpler to manipulate because the Lagrangian multipliers are automatically determined. The proposition below states these necessary optimality conditions. Recall that to each point x of $$\mathcal{M}$$ there corresponds a tangent space (a linearization) $$\mathrm{T}_{x}\mathcal{M}$$. The Riemannian gradient grad f(x) is the unique tangent vector at x such that $$\mathrm{D} f(x)[\eta ] = \left \langle{\eta },{\textrm{grad}\ f(x)}\right \rangle $$ for all tangent vectors η, where $$\left \langle{\cdot },{\cdot }\right \rangle $$ is the Riemannian metric on $$\mathrm{T}_{x}\mathcal{M}$$ and Df(x)[η] is the directional derivative of f at x along η. The Riemannian Hessian Hess f(x) is a symmetric operator on $$\mathrm{T}_{x}\mathcal{M}$$, corresponding to the derivative of the gradient vector field with respect to the Levi‐Civita connection—see Absil et al. (2008, Chapter 5). These objects are easily computed in applications. A summary of relevant concepts about manifolds can be found in Appendix A. Proposition 1.1 (Necessary optimality conditions). Let $$x\in \mathcal{M}$$ be a local optimum for (P). If f is differentiable at x then grad f(x) = 0. If f is twice differentiable at x then Hess f(x) $$\succeq $$ 0 (positive semidefinite). Proof. See Yang et al. (2014, Rem. 4.2 and Cor. 4.2). A point $$x\in \mathcal{M}$$, which satisfies grad f(x) = 0, is a (first-order) critical point (also called a stationary point). If x furthermore satisfies Hess f(x) $$\succeq $$ 0, it is a second-order critical point. Existing theory for optimization algorithms on manifolds is mostly concerned with establishing global convergence to critical points without rates (where global means regardless of initialization), as well as local rates of convergence. For example, gradient descent is known to converge globally to critical points, and the convergence rate is linear once the iterates reach a sufficiently small neighborhood of the limit point (Absil et al., 2008, Chapter 4). Early work of Udriste (1994) on local convergence rates even bounds distance to optimizers as a function of iteration count, assuming initialization in a set where the Hessian of f is positive definite, with lower and upper bounds on the eigenvalues; see also Absil et al. (2008, Thm. 4.5.6, Thm. 7.4.11). Such guarantees adequately describe the empirical behavior of those methods, but give no information about how many iterations are required to reach the local regime from an arbitrary initial point x0; that is, the worst-case scenarios are not addressed. For classical unconstrained nonlinear optimization this caveat has been addressed by bounding the number of iterations required by known algorithms to compute points that satisfy necessary optimality conditions within some tolerance, without assumptions on the initial iterate. Among others, Nesterov (2004) gives a proof that, for $$\mathcal{M} = {\mathbb{R}^{n}}$$ and Lipschitz differentiable f, gradient descent with an appropriate step size computes a point x where ∥grad f(x)∥ ≤ ε in $$\mathcal{O}\big (1/\varepsilon ^{2}\big )$$ iterations. This is sharp (Cartis et al., 2010). Cartis et al. (2012) prove the same for trust-region methods, and further show that if f is twice Lipschitz continuously differentiable, then a point x where ∥grad f(x)∥ ≤ ε and Hess f(x) $$\succeq $$ −ε Id is computed in $$\mathcal{O}\big(1/\varepsilon ^{3}\big)$$ iterations, also with examples showing sharpness. In this paper we extend the unconstrained results to the larger class of optimization problems on manifolds (P). This work builds upon the original proofs (Nesterov, 2004; Cartis et al., 2012) and on existing adaptations of gradient descent and trust-region methods to manifolds (Absil et al., 2007, 2008). One key step is the identification of a set of relevant Lipschitz-type regularity assumptions, which allow the proofs to carry over from $${\mathbb{R}^{n}}$$ to $$\mathcal{M}$$ with relative ease. Main results We state the main results here informally. We use the notion of retraction Retrx (see Definition 2.1), which allows to map tangent vectors at x to points on $$\mathcal{M}$$. Iterates are related by $$x_{k+1} = \textrm{Retr}_{x_{k}}(\eta _{k})$$ for some tangent vector ηk at xk (the step). Hence, f ∘ Retrx is a lift of the cost function from $$\mathcal{M}$$ to the tangent space at x. For $$\mathcal{M} = {\mathbb{R}^{n}}$$, the standard retraction gives $$\textrm{Retr}_{x_{k}}(\eta _{k}) = x_{k} + \eta _{k}$$. By ∥⋅∥ we denote the norm associated to the Riemannian metric. About gradient descent. (See Theorems 2.5 and 2.8.) For problem (P), if f is bounded below on $$\mathcal{M}$$ and f∘ Retrx has Lipschitz gradient with constant Lg independent of x, then Riemannian gradient descent with constant step size 1/Lg or with backtracking Armijo line-search returns x with ∥grad f(x)∥≤ ε in $$\mathcal{O}\big(1/\varepsilon ^{2}\big)$$ iterations. About trust regions. (See Theorem 3.4.) For problem (P), if f is bounded below on $$\mathcal{M}$$ and f ∘ Retrx has Lipschitz gradient with constant independent of x then Riemannian trust region (RTR) returns x with ∥grad f(x)∥ ≤ εg in $$\mathcal{O}\big (1/{\varepsilon _{g}^{2}}\big )$$ iterations, under weak assumptions on the model quality. If furthermore f ∘ Retrx has Lipschitz Hessian with constant independent of x then RTR returns x with ∥grad f(x)∥ ≤ εg and Hess f(x) $$\succeq -\varepsilon _{H}$$ Id in $$\mathcal{O}\big (\max \big \{1/{\varepsilon _{H}^{3}}, 1/{\varepsilon _{g}^{2}} \varepsilon _{H}^{}\big \} \big )$$ iterations, provided the true Hessian is used in the model and a second-order retraction is used. About compact submanifolds. (See Lemmas 2.4 and 3.1.) The first-order regularity conditions above hold in particular if $$\mathcal{M}$$ is a compact submanifold of a Euclidean space $$\mathcal{E}$$ (such as $${\mathbb{R}^{n}}$$) and $$f \colon \mathcal{E} \to{\mathbb{R}}$$ has a locally Lipschitz continuous gradient. The second-order regularity conditions hold if furthermore f has a locally Lipschitz continuous Hessian on $$\mathcal{E}$$ and the retraction is second order (Definition 3.10). Since the rates $$\mathcal{O}\big(1/\varepsilon ^{2}\big)$$ and $$\mathcal{O}\big(1/\varepsilon ^{3}\big)$$ are sharp for gradient descent and trust regions when $$\mathcal{M} = {\mathbb{R}^{n}}$$ (Cartis et al., 2010, 2012), they are also sharp for $$\mathcal{M}$$ a generic Riemannian manifold. Below, constants are given explicitly, thus precisely bounding the total amount of work required in the worst case to attain a prescribed tolerance. The theorems presented here are the first deterministic results about the worst-case iteration complexity of computing (approximate) first- and second-order critical points on manifolds. The choice of analysing Riemannian gradient descent and RTR first is guided by practical concerns, as these are among the most commonly used methods on manifolds so far. The proposed complexity bounds are particularly relevant when applied to problems for which second-order necessary optimality conditions are also sufficient. See for example Sun et al. (2015, 2016), Boumal,(2015b, 2016), Bandeira et al. (2016), Bhojanapalli et al. (2016), Ge et al. (2016) and the example in Section 4. Related work The complexity of Riemannian optimization is discussed in a few recent lines of work. Zhang & Sra (2016) treat geodesically convex problems over Hadamard manifolds. This is a remarkable extension of important pieces of classical convex optimization theory to manifolds with negative curvature. Because of the focus on geodesically convex problems, those results do not apply to the more general problem (P), but have the clear advantage of guaranteeing global optimality. In Zhang et al. (2016), which appeared a day before the present paper on public repositories, the authors also study the iteration complexity of nonconvex optimization on manifolds. Their results differ from the ones presented here in that they focus on stochastic optimization algorithms, aiming for first-order conditions. Their results assume bounded curvature for the manifold. Furthermore, their analysis relies on the Riemannian exponential map, whereas we cover the more general class of retraction maps (which is computationally advantageous). We also do not use the notions of Riemannian parallel transport or logarithmic map, which, in our view, makes for a simpler analysis. Sun et al. (2015, 2016) consider dictionary learning and phase retrieval and show that these problems, when appropriately framed as optimization on a manifold, are low-dimensional and have no spurious local optimizers. They derive the complexity of RTR specialized to their application. In particular, they combine the global rate with a local convergence rate, which allows them to establish an overall better complexity than $$\mathcal{O}\big(1/\varepsilon ^{3}\big)$$, but with an idealized version of the algorithm and restricted to these relevant applications. In this paper we favor a more general approach, focused on algorithms closer to the ones implemented in practice. Recent work by Bento et al. (2017) (which appeared after a first version of this paper) focuses on iteration complexity of gradient, subgradient and proximal point methods for the case of convex cost functions on manifolds, using the exponential map as retraction. For the classical, unconstrained case, optimal complexity bounds of order $$\mathcal{O}\big(1/\varepsilon ^{1.5}\big)$$ to generate x with ∥grad f(x)∥ ≤ ε have also been given for cubic regularization methods (Cartis et al., 2011a, b) and sophisticated trust region variants (Curtis et al., 2016). Bounds for regularization methods can be further improved if higher-order derivatives are available (Birgin et al., 2017). Worst-case evaluation complexity bounds have been extended to constrained smooth problems in Cartis et al. (2014, 2015a,b). There, it is shown that some carefully devised, albeit impractical, phase 1–phase 2 methods can compute approximate KKT points with global rates of convergence of the same order as in the unconstrained case. We note that when the constraints are convex (but the objective may not be), practical, feasible methods have been devised (Cartis et al., 2015a) that connect to our approach below. Second-order optimality for the case of convex constraints with nonconvex cost has been recently addressed in Cartis et al. (2016). 2. Riemannian gradient descent methods Consider the generic Riemannian descent method described in Algorithm 1. We first prove that, provided sufficient decrease in the cost function is achieved at each iteration, the algorithm computes a point xk such that ∥grad f(xk)∥ ≤ ε with $$k = \mathcal{O}\big(1/\varepsilon ^{2}\big)$$. Then we propose a Lipschitz-type assumption, which is sufficient to guarantee that simple strategies to pick the steps ηk indeed ensure sufficient decrease. The proofs parallel the standard ones (Nesterov, 2004, Sect. 1.2.3). The main novelty is the careful extension to the Riemannian setting, which requires the well-known notion of retraction (Definition 2.1) and the new Assumption 2.6 (see below). The step ηk is a tangent vector to $$\mathcal{M}$$ at xk. Because $$\mathcal{M}$$ is nonlinear (in general) the operation xk + ηk is undefined. The notion of retraction provides a theoretically sound replacement. Informally, $$x_{k+1} = \textrm{Retr}_{x_{k}}(\eta _{k})$$ is a point on $$\mathcal{M}$$ that one reaches by moving away from xk, along the direction ηk, while remaining on the manifold. The Riemannian exponential map (which generates geodesics) is a retraction. The crucial point is that many other maps are retractions, often far less difficult to compute than the exponential. The definition of retraction below can be traced back to Shub (1986) and it appears under that name in Adler et al. (2002); see also Absil et al. (2008, Def. 4.1.1 and Sect. 4.10) for additional references. Definition 2.1 (Retraction). A retraction on a manifold $$\mathcal{M}$$ is a smooth mapping Retr from the tangent bundle1$$\mathrm{T}\mathcal{M}$$ to $$\mathcal{M}$$ with the following properties. Let $$\textrm{Retr}_{x} \colon \mathrm{T}_{x}\mathcal{M} \to \mathcal{M}$$ denote the restriction of Retr to $$\mathrm{T}_{x}\mathcal{M}$$: (i) Retrx(0x) = x, where 0x is the zero vector in $$\mathrm{T}_{x}\mathcal{M}$$; (ii) the differential of Retrx at 0x, D Retrx(0x), is the identity map. These combined conditions ensure retraction curves t ↦ Retrx(tη) agree up to first order with geodesics passing through x with velocity η, around t = 0. Sometimes we allow Retrx to be defined only locally, in a closed ball of radius ϱ(x) > 0 centered at 0x in $$\mathrm{T}_{x}\mathcal{M}$$. In linear spaces such as $${\mathbb{R}^{n}}$$ the typical choice is Retrx(η) = x + η. On the sphere, a popular choice is $$\textrm{Retr}_{x}(\eta ) = \frac{x+\eta }{\|x+\eta \|}$$. Remark 2.2 If the retraction at xk is defined only in a ball of radius ϱk = ϱ(xk) around the origin in $$\mathrm{T}_{x_{k}}\mathcal{M}$$, we limit the size of step ηk to ϱk. Theorems in this section provide a complexity result provided $$\varrho = \inf _{k} \varrho _{k}> 0$$. If the injectivity radius of the manifold is positive, retractions satisfying the condition $$\inf _{x\in \mathcal{M}} \varrho (x)> 0$$ exist. In particular, compact manifolds have positive injectivity radius (Chavel, 2006, Thm. III.2.3). The option to limit the step sizes is also useful when the constant Lg in Assumption 2.6 below does not exist globally. The two central assumptions and a general theorem about Algorithm 1 follow. Assumption 2.3 (Lower bound). There exists $$f^{\ast }> -\infty $$ such that f(x) ≥ f* for all $$x \in \mathcal{M}$$. Assumption 2.4 (Sufficient decrease). There exist $$c,c^{\prime}> 0$$ such that, for all k ≥ 0, \begin{align*} f(x_{k}) - f(x_{k+1}) \geq \min\left( c \|\textrm{grad}\ f(x_{k})\|, c^{\prime} \right) \|\textrm{grad}\ f(x_{k})\|. \end{align*} Algorithm 1 Generic Riemannian descent algorithm Algorithm 1 Generic Riemannian descent algorithm Theorem 2.5 Under Assumptions 2.3 and 2.4, Algorithm 1 returns $$x\in \mathcal{M}$$ satisfying f(x) ≤ f(x0) and ∥grad f(x)∥ ≤ ε in at most \begin{align*} \left\lceil \frac{f(x_{0})-f^{\ast}}{c} \cdot \frac{1}{\varepsilon^{2}} \right\rceil \end{align*} iterations, provided $$\varepsilon \leq \frac{c^{\prime }}{c}$$. If $$\varepsilon> \frac{c^{\prime }}{c}$$, at most $$\left \lceil \frac{f(x_{0})-f^{\ast }}{c^{\prime }} \cdot \frac{1}{\varepsilon } \right \rceil $$ iterations are required. Proof. If Algorithm 1 executes K − 1 iterations without terminating, then ∥grad f(xk)∥ > ε for all k in 0, …, K − 1. Then using Assumptions 2.3 and 2.4 in a classic telescoping sum argument gives \begin{align*} f(x_{0}) - f^{\ast} \geq f(x_{0}) - f(x_{K}) & = \sum_{k=0}^{K-1} f(x_{k}) - f(x_{k+1})> K \min(c\varepsilon, c^{\prime}) \varepsilon. \end{align*} By contradiction, the algorithm must have terminated if $$K \geq \frac{f(x_{0})-f^{\ast }}{\min (c\varepsilon , c^{\prime })\varepsilon }$$. To ensure Assumption 2.4 with simple rules for the choice of ηk it is necessary to restrict the class of functions f. For the particular case $$\mathcal{M} = {\mathbb{R}^{n}}$$ and Retrx(η) = x + η, the classical tion is to require f to have a Lipschitz continuous gradient (Nesterov, 2004), that is, existence of Lg such that \begin{align} \forall\ x, y \in{\mathbb{R}^{n}}, \quad \| \textrm{grad} \ f(x) - \textrm{grad}\ f(y) \| \leq L_{g} \|x - y\|. \end{align} (2.1) As we argue momentarily, generalizing this property to manifolds is impractical. On the other hand, it is well known that (2.1) implies (see for example Nesterov, 2004, Lemma 1.2.3; see also Berger, 2017, Appendix A for a converse): \begin{align} \forall\ x, y \in{\mathbb{R}^{n}}, \quad \left|\, f(y) - \left[\, f(x) + \left\langle{y-x},{\textrm{grad}\ f(x)}\right\rangle \right] \right| \leq \frac{L_{g}}{2} \|y-x\|^{2}. \end{align} (2.2) It is the latter we adapt to manifolds. Consider the pullback2$$\hat f_{x} = f \circ \textrm{Retr}_{x} \colon \mathrm{T}_{x}\mathcal{M} \to{\mathbb{R}}$$, conveniently defined on a vector space. It follows from the definition of retraction that $$\textrm{grad}\, \hat f_{x}(0_{x}) = \textrm{grad}\ f(x)$$.3 Thinking of x as xk and of y as $$\textrm{Retr}_{{x_{k}}}(\eta )$$, we require the following. Assumption 2.6 (Restricted Lipschitz-type gradient for pullbacks). There exists Lg ≥ 0 such that, for all xk among x0, x1, . . . generated by a specified algorithm, the composition $$\hat f_{k} = f \circ \textrm{Retr}_{x_{k}}$$ satisfies \begin{align} \left| \, \hat f_{k}(\eta) - \left[f(x_{k}) + \left\langle{\eta},{\textrm{grad}\ f(x_{k})}\right\rangle\right] \right| \leq \frac{L_{g}}{2} \|\eta\|^{2} \end{align} (2.3) for all $$\eta \in \mathrm{T}_{x_{k}}\mathcal{M}$$ such that ∥η∥≤ ϱk.4 In words, the pullbacks $$\hat f_{k}$$, possibly restricted to certain balls, are uniformly well approximated by their first-order Taylor expansions around the origin. To the best of our knowledge, this specific tion has not been used to analyse convergence of optimization algorithms on manifolds before. As will become clear, it allows for simple extensions of existing proofs in $${\mathbb{R}^{n}}$$. Notice that if each $$\hat f_{k}$$ has a Lipschitz continuous gradient with constant Lg independent of k,5 then Assumption 2.6 holds but the reverse is not necessarily true as Assumption 2.6 gives a special role to the origin. In this sense the condition on $$\hat f_{k}$$ is weaker than Lipschitz continuity of the gradient of $$\hat f_{k}$$. On the other hand, we are requiring this condition to hold for all xk with the same constant Lg. This is why we call the condition Lipschitz type rather than Lipschitz. The following lemma states that if $$\mathcal{M}$$ is a compact submanifold of $${\mathbb{R}^{n}}$$ then a sufficient condition for Assumption 2.6 to hold is for $$f\colon{\mathbb{R}^{n}}\to{\mathbb{R}}$$ to have locally Lipschitz continuous gradient (so that it has Lipschitz continuous gradient on any compact subset of $${\mathbb{R}^{n}}$$). The proof is in Appendix B. Lemma 2.7 Let $$\mathcal{E}$$ be a Euclidean space (for example, $$\mathcal{E} = {\mathbb{R}^{n}}$$) and let $$\mathcal{M}$$ be a compact Riemannian submanifold of $$\mathcal{E}$$. Let Retr be a retraction on $$\mathcal{M}$$ (globally6 defined). If $$f \colon \mathcal{E} \to{\mathbb{R}}$$ has Lipschitz continuous gradient in the convex hull of $$\mathcal{M}$$, then the pullbacks f ∘ Retrx satisfy (2.3) globally with some constant Lg independent of x; hence, Assumption 2.6 holds for any sequence of iterates and with $$\varrho _{k} = \infty $$ for all k. There are mainly two difficulties with generalizing (2.1) directly to manifolds. First, grad f(x) and grad f(y) live in two different tangent spaces, so that their difference is not defined; instead, grad f(x) must be transported to $$\mathrm{T}_{y}\mathcal{M}$$, which requires the introduction of a parallel transport$$\mathrm{P}_{x \rightarrow y} \colon \mathrm{T}_{x}\mathcal{M} \to \mathrm{T}_{y}\mathcal{M}$$ along a minimal geodesic connecting x and y. Second, the right-hand side ∥x − y∥ should become dist(x, y): the geodesic distance on $$\mathcal{M}$$. Both notions involve subtle definitions and transports may not be defined on all of $$\mathcal{M}$$. Overall, the resulting tion would read that there exists Lg such that \begin{align} \forall\ x, y \in \mathcal{M}, \quad \| \mathrm{P}_{x\to y} \textrm{grad} \ f(x) - \textrm{grad} \ f(y) \| \leq L_{g} \textrm{dist}(\,x, y). \end{align} (2.4) It is of course possible to work with (2.4)—see for example Absil et al. (2008, Def. 7.4.3) and recent work of Zhang & Sra (2016) and Zhang et al. (2016)—but we argue that it is conceptually and computationally advantageous to avoid it if possible. The computational advantage comes from the freedom in Assumption 2.6 to work with any retraction, whereas parallel transport and geodesic distance are tied to the exponential map. We note that, if the retraction is the exponential map, then it is known that Assumption 2.6 holds if (2.4) holds—see for example Bento et al. (2017, Def. 2.2 and Lemma 2.1). 2.1. Fixed step-size gradient descent method Leveraging the regularity Assumption 2.6, an easy strategy is to pick the step ηk as a fixed scaling of the negative gradient, possibly restricted to a ball of radius ϱk. Theorem 2.8 (Riemannian gradient descent with fixed step size). Under Assumptions 2.3 and 2.6, Algorithm 1 with the explicit strategy \begin{align*} \eta_{k} = -\min\left( \frac{1}{L_{g}}, \frac{\varrho_{k}}{\|\textrm{grad}\ f(x_{k})\|} \right) \textrm{grad}\ f(x_{k}) \end{align*} returns a point $$x\in \mathcal{M}$$ satisfying f(x) ≤ f(x0) and ∥grad f(x)∥≤ ε in at most \begin{align*} \left\lceil 2\left(f(x_{0}) - f^{\ast}\right)L_{g} \cdot \frac{1}{\varepsilon^{2}} \right\rceil \end{align*} iterations provided ε ≤ ϱLg, where $$\varrho = \inf _{k} \rho _{k}$$. If ε > ϱLg the algorithm succeeds in at most $$\big \lceil 2\left (f(x_{0}) - f^{\ast }\right ) \frac{1}{\varrho } \cdot \frac{1}{\varepsilon } \big \rceil $$ iterations. Each iteration requires one cost and gradient evaluation and one retraction. Proof. The regularity Assumption 2.6 provides an upper bound for the pullback for all k: \begin{align} \forall\ \eta \in \mathrm{T}_{x_{k}}\mathcal{M}\ \textrm{with}\ \|\eta\| \leq \varrho_{k}, \quad f(\textrm{Retr}_{x_{k}}(\eta)) \leq f(x_{k}) + \left\langle{\eta},{\textrm{grad}\ f(x_{k})}\right\rangle + \frac{L_{g}}{2} \|\eta\|^{2}. \end{align} (2.5) For the given choice of ηk and using $$x_{k+1} = \textrm{Retr}_{x_{k}}(\eta _{k})$$, it follows easily that \begin{align*} f(x_{k}) - f(x_{k+1}) \geq \min\left( \frac{\|\textrm{grad}\ f(x_{k})\|}{L_{g}}, \varrho_{k} \right) \left[ 1 - \frac{L_{g}}{2} \min\left( \frac{1}{L_{g}}, \frac{\varrho_{k}}{\|\textrm{grad}\ f(x_{k})\|} \right) \right] \|\textrm{grad}\ f(x_{k})\|. \end{align*} The term in brackets is at least 1/2. Thus, Assumption 2.4 holds with $$c = \frac{1}{2L_{g}}$$ and $$c^{\prime } = \frac{\varrho }{2}$$, allowing us to conclude with Theorem 2.3. Corollary 2.9 If there are no step-size restrictions in Theorem 2.5 ($$\rho _{k} \equiv \infty $$), the explicit strategy \begin{align*} \eta_{k} = -\frac{1}{L_{g}} \textrm{grad}\ f(x_{k}) \end{align*} returns a point $$x\in \mathcal{M}$$ satisfying f(x) ≤ f(x0) and ∥grad f(x)∥≤ ε in at most \begin{align*} \left\lceil 2\left(f(x_{0}) - f^{\ast}\right)L_{g} \cdot \frac{1}{\varepsilon^{2}} \right\rceil \end{align*} iterations for any ε > 0. 2.2. Gradient descent with backtracking Armijo line-search The following lemma shows that a basic Armijo-type backtracking line-search, Algorithm 2, computes a step ηk satisfying Assumption 2.4 in a bounded number of function calls, without the need to know Lg. The statement allows search directions other than −grad f(xk), provided they remain ‘related’ to −grad f(xk). This result is well known in the Euclidean case and carries over seamlessly under Assumption 2.6. Algorithm 2 Backtracking Armijo line-search Algorithm 2 Backtracking Armijo line-search Lemma 2.10 For each iteration k of Algorithm 1, let $${\eta _{k}^{0}} \in \mathrm{T}_{x_{k}}\mathcal{M}$$ be the initial search direction to be considered for line-search. Assume there exist constants c2 ∈ (0, 1] and 0 < c3 ≤ c4 such that, for all k, \begin{align*} \left\langle{-\textrm{grad}\ f(x_{k})},{{\eta_{k}^{0}}}\right\rangle & \geq c_{2} \|\textrm{grad}\ f(x_{k})\|\left\|{\eta_{k}^{0}}\right\| &\ \textrm{and}\ \qquad c_{3} \|\textrm{grad}\ f(x_{k})\| & \leq \left\|{\eta_{k}^{0}}\right\| \leq c_{4} \|\textrm{grad}\ f(x_{k})\|. \end{align*} Under Assumption 2.6, backtracking Armijo (Algorithm 2) with initial step size $${\bar t}_{k}$$ such that $${\bar t}_{k} \left \|{\eta _{k}^{0}}\right \| \leq \varrho _{k}$$ returns a positive t and $$\eta _{k} = t{\eta _{k}^{0}}$$ such that \begin{align} f(x_{k}) - f\left(\textrm{Retr}_{x_{k}}(\eta_{k})\right) & \geq c_{1}c_{2}c_{3}t\|\textrm{grad}\ f(x_{k})\|^{2} & \textrm{and} \qquad t & \geq \min\left({\bar t}_{k}, \frac{2\tau c_{2} (1-c_{1})}{c_{4}L_{g}}\right) \end{align} (2.6) in \begin{align*} 1+\log_{\tau}\left(t/{\bar t}_{k}\right) \leq \max\left( 1, 2 + \left\lceil \log_{\tau^{-1}}\left(\frac{c_{4}{\bar t}_{k}L_{g}}{2c_{2}(1-c_{1})}\right) \right\rceil \right) \end{align*} retractions and cost evaluations (not counting evaluation of f at xk). Proof. See Appendix C. The previous discussion can be particularized to bound the amount of work required by a gradient descent method using a backtracking Armijo line-search on manifolds. The constant Lg appears in the bounds but needs not be known. Note that, at iteration k, the last cost evaluation of the line-search algorithm is the cost at xk+1: it need not be recomputed. Theorem 2.11 (Riemannian gradient descent with backtracking line-search). Under Assumptions 2.3 and 2.6, Algorithm 1 with Algorithm 2 for line-search using initial search direction $${\eta _{k}^{0}} = -\textrm{grad}\ f(x_{k})$$ with parameters c1, τ and $${\bar t}_{k} \triangleq \min \left ({\bar t}, \varrho _{k} / \|\textrm{grad}\ f(x_{k})\| \right )$$ for some $$\bar t> 0$$ returns a point $$x\in \mathcal{M}$$ satisfying f(x) ≤ f(x0) and ∥grad f(x)∥≤ ε in at most \begin{align*} \left\lceil \frac{f(x_{0}) - f^{\ast}}{c_{1} \min\left(\bar t, \frac{2\tau(1-c_{1})}{L_{g}}\right)} \cdot \frac{1}{\varepsilon^{2}} \right\rceil \end{align*} iterations, provided $$\varepsilon \leq \frac{\varrho }{\min \left (\bar t, \frac{2\tau (1-c_{1})}{L_{g}} \right )} \triangleq c$$, where $$\varrho = \inf _{k} \varrho _{k}$$. If ε > c, the algorithm succeeds in at most $$\big \lceil \frac{f(x_{0}) - f^{\ast }}{c_{1} \varrho } \cdot \frac{1}{\varepsilon } \big \rceil $$ iterations. After computing f(x0) and grad f(x0), each iteration requires one gradient evaluation and at most $$\max \big ( 1, 2 + \big \lceil \log _{\tau ^{-1}}\big (\frac{{\bar t}L_{g}}{2(1-c_{1})}\big ) \big \rceil \big )$$ cost evaluations and retractions. Proof. Using $${\eta _{k}^{0}} = -\textrm{grad}\ f(x_{k})$$, one can take c2 = c3 = c4 = 1 in Lemma 2.7. Equation (2.6) in that lemma combined with the definition of $${\bar t}_{k}$$ ensures \begin{align*} f(x_{k}) - f(x_{k+1}) \geq c_{1} \min\left({\bar t}, \frac{2\tau(1-c_{1})}{L_{g}}, \frac{\varrho_{k}}{\|\textrm{grad}\ f(x_{k})\|} \right) \| \textrm{grad}\ f(x_{k}) \|^{2}. \end{align*} Thus, Assumption 2.4 holds with $$c = c_{1} \min \left ({\bar t}, \frac{2\tau (1-c_{1})}{L_{g}} \right )$$ and c′ = c1ϱ. Conclude with Theorem 2.3. 3. RTR methods The RTR method is a generalization of the classical trust-region method to manifolds (Absil et al., 2007; Conn et al., 2000)—see Algorithm 3. The algorithm is initialized with a point $$x_{0}\in \mathcal{M}$$ and a trust-region radius Δ0. At iteration k, the pullback $$\hat f_{k} = f \circ \textrm{Retr}_{x_{k}}$$ is approximated by a model $$\hat m_{k} \colon \mathrm{T}_{x_{k}}\mathcal{M} \to{\mathbb{R}}$$, \begin{align} \hat m_{k}(\eta) = f(x_{k}) + \left\langle{\eta},{\textrm{grad}\ f(x_{k})}\right\rangle + \tfrac{1}{2} \left\langle{\eta},{H_{k}[\eta]}\right\rangle, \end{align} (3.1) where $$H_{k} \colon \mathrm{T}_{x_{k}}\mathcal{M} \to \mathrm{T}_{x_{k}}\mathcal{M}$$ is a map chosen by the user. The tentative step ηk is obtained by approximately solving the associated trust-region subproblem: \begin{align} \min_{\eta \in \mathrm{T}_{x_{k}}\mathcal{M}} \ \hat m_{k}(\eta) \quad \textrm{subject to} \quad \|\eta\| \leq \Delta_{k}. \end{align} (3.2) The candidate next iterate $$x_{k}^{+} = \textrm{Retr}_{x_{k}}(\eta _{k})$$ is accepted ($$x_{k+1} = x_{k}^{+}$$) if the actual cost decrease $$f(x_{k}) - f\left (x_{k}^{+}\right )$$ is a sufficiently large fraction of the model decrease $$\hat m_{k}\left (0_{x_{k}}\right ) - \hat m_{k}(\eta _{k})$$. Otherwise, the candidate is rejected (xk+1 = xk). Depending on the level of agreement between the model decrease and actual decrease, the trust-region radius Δk can be reduced, kept unchanged or increased, but never above some parameter $$\bar \Delta $$. The parameter $$\bar \Delta $$ can be used in particular in case of a nonglobally defined retraction or if the regularity conditions on the pullbacks hold only locally. We establish worst-case iteration complexity bounds for the computation of points $$x\in \mathcal{M}$$ such that ∥grad f(x)∥≤ εg and Hess f(x) ≽−εH Id, where Hess f(x) is the Riemannian Hessian of f at x. Besides Lipschitz-type conditions on the problem itself, essential algorithmic requirements are that (i) the models $$\hat m_{k}$$ should agree sufficiently with the pullbacks $$\hat f_{k}$$ (locally) and (ii) sufficient decrease in the model should be achieved at each iteration. The analysis presented here is a generalization of the one in Cartis et al. (2012) to manifolds. 3.1. Regularity conditions In what follows, for iteration k, we make assumptions involving the ball of radius $$\Delta _{k} \leq \bar{\Delta }$$ around $$0_{x_{k}}$$ in the tangent space at xk. If Retrx is defined only in a ball of radius ϱ(x), one (conservative) strategy to ensure ϱk ≥Δk as required in the assumption below is to set $$\bar \Delta \leq \inf _{x\in \mathcal{M} : f(x) \leq f(x_{0})} \varrho (x)$$, provided this is positive (see Remark 2.2). Assumption 3.1 (Restricted Lipschitz-type gradient for pullbacks). Assumption 2.6 holds in the respective trust regions of the iterates produced by Algorithm 3, that is, with ϱk ≥Δk. Algorithm 3 RTR, modified to attain second-order optimality Algorithm 3 RTR, modified to attain second-order optimality Assumption 3.2 (Restricted Lipschitz-type Hessian for pullbacks). If $$\varepsilon _{H} < \infty $$ there exists LH ≥ 0 such that, for all xk among x0, x1, . . . generated by Algorithm 3 and such that ∥grad f(xk)∥≤ εg, $$\hat f_{k}$$ satisfies \begin{align} \left| \hat f_{k}(\eta) - \left[ f(x_{k}) + \left\langle{\eta},{\textrm{grad}\ f(x_{k})}\right\rangle + \frac{1}{2}\left\langle{\eta},{\nabla^{2} \hat f_{k}(0_{x_{k}})[\eta]}\right\rangle \right] \right| \leq \frac{L_{H}}{6} \|\eta\|^{3} \end{align} (3.4) for all $$\eta \in \mathrm{T}_{x_{k}}\mathcal{M}$$ such that ∥η∥≤Δk. As discussed in Section 3.5 below, if Retr is a second-order retraction, then $$\nabla ^{2} \hat f_{k}\left (0_{x_{k}}\right )$$ coincides with the Riemannian Hessian of f at xk. In the previous section, Lemma 2.7 gives a sufficient condition for Assumption 3.1 to hold; we complement this statement with a sufficient condition for Assumption 3.2 to hold as well. In a nutshell, if $$\mathcal{M}$$ is a compact submanifold of $${\mathbb{R}^{n}}$$ and $$f\colon{\mathbb{R}^{n}}\to{\mathbb{R}}$$ has locally Lipschitz continuous Hessian, then both assumptions hold. Lemma 3.3 Let $$\mathcal{E}$$ be a Euclidean space (for example, $$\mathcal{E} = {\mathbb{R}^{n}}$$) and let $$\mathcal{M}$$ be a compact Riemannian submanifold of $$\mathcal{E}$$. Let Retr be a second-order retraction on $$\mathcal{M}$$ (globally defined). If $$f \colon \mathcal{E} \to{\mathbb{R}}$$ has Lipschitz continuous Hessian in the convex hull of $$\mathcal{M}$$, then the pullbacks f ∘ Retrx obey (3.4) with some constant LH independent of x; hence, Assumption 3.2 holds for any sequence of iterates and trust-region radii. The proof is in Appendix B. Here too, if $$\mathcal{M}$$ is a Euclidean space and Retrx(η) = x + η, then Assumptions 3.1 and 3.2 are satisfied if f has Lipschitz continuous Hessian in the usual sense. 3.2. Assumptions about the models The model at iteration k is the function $$\hat m_{k}$$ (3.1) whose purpose is to approximate the pullback $$\hat f_{k} = f \circ \textrm{Retr}_{x_{k}}$$. It involves a map $$H_{k} \colon \mathrm{T}_{x_{k}}\mathcal{M} \to \mathrm{T}_{x_{k}}\mathcal{M}$$. Depending on the type of step being performed (aiming for first- or second-order optimality conditions), we require different properties of the maps Hk. Conditions for first-order optimality are particularly lax. Assumption 3.4 If ∥grad f(xk)∥ > εg (so that we are aiming only for a first-order condition at this step) then Hk is radially linear. That is, \begin{align} \forall\ \eta\in\mathrm{T}_{x_{k}}\mathcal{M}, \forall\ \alpha \geq 0, \quad H_{k}[\alpha \eta] = \alpha H_{k}[\eta]. \end{align} (3.5) Furthermore, there exists c0 ≥ 0 (the same for all first-order steps) such that \begin{align} \|H_{k}\| \triangleq \sup_{\eta \in \mathrm{T}_{x_{k}}\mathcal{M} : \|\eta\| \leq 1} \left\langle{\eta},{H_{k}[\eta]}\right\rangle \leq c_{0}. \end{align} (3.6) Radial linearity and boundedness are sufficient to ensure first-order agreement between $$\hat m_{k}$$ and $$\hat f_{k}$$. This relaxation from complete linearity of Hk, which would be the standard assumption, notably allows the use of nonlinear finite difference approximations of the Hessian (Boumal, 2015a). To reach second-order agreement, the conditions are stronger. Assumption 3.5 If ∥grad f(xk)∥≤ εg and $$\varepsilon _{H} < \infty $$ (so that we are aiming for a second-order condition), then Hk is linear and symmetric. Furthermore, Hk is close to $$\nabla ^{2} \hat f_{k}\left (0_{x_{k}}\right )$$ along ηk in the sense that there exists c1 ≥ 0 (the same for all second-order steps) such that \begin{align} \left| \left\langle{\eta_{k}},{\left(\nabla^{2} \hat f_{k}(0_{x_{k}}) - H_{k}\right)[\eta_{k}]}\right\rangle \right| \leq \frac{c_{1} \Delta_{k}}{3} \|\eta_{k}\|^{2}. \end{align} (3.7) The smaller Δk, the more precisely Hk must approximate the Hessian of the pullback along ηk. Lemma 3.6 (below) shows Δk is lower bounded in relation with εg and εH. Equation (3.7) involves ηk, the ultimately chosen step that typically depends on Hk. The stronger condition below does not reference ηk and yet still ensures (3.7) is satisfied: \begin{align*} \left\|\nabla^{2} \hat f_{k}\left(0_{x_{k}}\right) - H_{k}\right\| \leq \frac{c_{1} \Delta_{k}}{3}. \end{align*} Refer to Section 3.5 to relate Hk, $$\nabla ^{2} \hat f_{k}\left (0_{x_{k}}\right )$$ and Hess f(xk). 3.3. Assumptions about sufficient model decrease The steps ηk can be obtained in a number of ways, leading to different local convergence rates and empirical performance. As far as global convergence guarantees are concerned though, the requirements are modest. It is required only that, at each iteration, the candidate ηk induces sufficient decrease in the model. Known explicit strategies achieve these decreases. In particular, solving the trust-region subproblem (3.2) within some tolerance (which can be done in polynomial time if Hk is linear; see Vavasis, 1991, Sect. 4.3) is certain to satisfy the assumptions. The Steihaug–Toint truncated conjugate gradients method is a popular choice (Toint, 1981; Steihaug, 1983; Conn et al., 2000; Absil et al., 2007). See also Sorensen (1982) and Moré & Sorensen (1983) for more about the trust-region subproblem. Here we describe simpler yet satisfactory strategies. For first-order steps we require the following. Assumption 3.6 There exists c2 > 0 such that, for all k such that ∥grad f(xk)∥ > εg, the step ηk satisfies \begin{align} \hat m_{k}(0_{x_{k}}) - \hat m_{k}(\eta_{k}) & \geq c_{2} \min\left( \Delta_{k}, \frac{\varepsilon_{g}}{c_{0}} \right) \varepsilon_{g}. \end{align} (3.8) As is well known, the explicitly computable Cauchy step satisfies this requirement. For convenience let gk = grad f(xk). By definition the Cauchy step minimizes $$\hat m_{k}$$ (3.1) in the trust region along the steepest descent direction –gk. Owing to radial linearity (Assumption 3.4), this reads \begin{align*} \min_{\alpha \geq 0} \ \hat m_{k}(-\alpha g_{k}) &= f(x_{k}) - \alpha \|g_{k}\|^{2} + \frac{\alpha^{2}}{2} \left\langle{g_{k}},{H_{k}[g_{k}]}\right\rangle \\ \textrm{such that} \ \alpha \|g_{k}\| &\leq \Delta_{k}.\nonumber \end{align*} This corresponds to minimizing a quadratic in α over the interval [0, Δk/∥gk∥]. The optimal value is easily seen to be (Conn et al., 2000) \begin{align*} {\alpha_{k}^{\textrm{C}}} & = \begin{cases} \min\left( \frac{\|g_{k}\|^{2}}{\left\langle{g_{k}},{H_{k}[g_{k}]}\right\rangle}, \frac{\Delta_{k}}{\|g_{k}\|} \right) & \textrm{if}\ \left\langle{g_{k}},{H_{k}[g_{k}]}\right\rangle> 0, \\ \frac{\Delta_{k}}{\|g_{k}\|} & \textrm{otherwise.} \end{cases} \end{align*} Lemma 3.7 Let gk = grad f(xk). Under Assumption 3.4, setting ηk to be the Cauchy step $${\eta _{k}^{\textrm{C}}} = -{\alpha _{k}^{\textrm{C}}} g_{k}$$ for first-order steps fulfills Assumption 3.6 with c2 = 1/2. Computing $${\eta _{k}^{\textrm{C}}}$$ involves one gradient evaluation and one application of Hk. Proof. The claim follows as an exercise from $$\hat m_{k}\left (0_{x_{k}}\right ) - \hat m_{k}\left ({\eta _{k}^{\textrm{C}}}\right ) = {\alpha _{k}^{\textrm{C}}} \|g_{k}\|^{2} - \frac{\left ({\alpha _{k}^{\textrm{C}}}\right )^{2}}{2}\left \langle{g_{k}},{H_{k}[g_{k}]}\right \rangle $$ and $$\left \langle{g_{k}},{H_{k}[g_{k}]}\right \rangle \leq c_{0} \|g_{k}\|^{2}$$ owing to Assumption 3.4. The Steihaug–Toint truncated conjugate gradient method (Toint, 1981; Steihaug, 1983) is a monotonically improving iterative method for the trust-region subproblem whose first iterate is the Cauchy step; as such, it necessarily achieves the required model decrease. For second-order steps the requirement is as follows. Assumption 3.8 There exists c3 > 0 such that, for all k such that ∥grad f(xk)∥≤ εg and $$\lambda _{\min }(H_{k}) < -\varepsilon _{H}$$, the step ηk satisfies \begin{align} \hat m_{k}(0_{x_{k}}) - \hat m_{k}(\eta_{k}) & \geq c_{3} {\Delta_{k}^{2}} \varepsilon_{H}. \end{align} (3.9) This can be achieved by making a step of maximal length along a direction that certifies that $$\lambda _{\min }(H_{k}) < -\varepsilon _{H}$$ (Conn et al., 2000): this is called an eigenstep. Like Cauchy steps, eigensteps can be computed in a finite number of operations, independently of εg and εH. Lemma 3.3 Under Assumption 3.5, if $$\lambda _{\min }(H_{k}) < -\varepsilon _{H}$$, there exists a tangent vector $$u_{k}\in \mathrm{T}_{x_{k}}\mathcal{M}$$ with \begin{equation*} \|u_{k}\| = 1,\qquad \left\langle{u_{k}},{\textrm{grad}\ f(x_{k})}\right\rangle \leq 0 \qquad \textrm{and}\qquad \left\langle{u_{k}},{H_{k}[u_{k}]}\right\rangle < -\varepsilon_{H}. \end{equation*} Setting ηk to be any eigenstep $${\eta _{k}^{E}} = \Delta _{k} u_{k}$$ for second-order steps fulfills Assumption 3.8 with c3 = 1/2. Let v1, … ,vn be an orthonormal basis of $$\mathrm{T}_{x_{k}}\mathcal{M}$$, where $$n = \dim \mathcal{M}$$. One way of computing $${\eta _{k}^{E}}$$ involves the application of Hk to v1, … ,vn plus $$\mathcal{O}\big(n^{3}\big )$$ arithmetic operations. The amount of work is independent of εg and εH. Proof. Compute H, a symmetric matrix of size n that represents Hk in the basis v1, … ,vn, as $$H_{ij} = \left \langle{v_{i}},{H_{k}\left [v_{j}\right ]}\right \rangle $$. Compute a factorization LDLT = H + εHI where I is the identity matrix, L is invertible and triangular and D is block diagonal with blocks of size 1 × 1 and 2 × 2. The factorization can be computed in $$\mathcal{O}\big(n^{3}\big)$$ operations (Golub & Van Loan, 2012, Sect. 4.4)—see the reference for a word of caution regarding pivoting for stability; pivoting is easily incorporated in the present argument. The matrix D has the same inertia as H + εHI, hence D is not positive semidefinite (otherwise H ≽−εHI.) The structure of D makes it easy to find $$x \in{\mathbb{R}^{n}}$$ with xTDx < 0. Solve the triangular system LTy = x for $$y\in{\mathbb{R}^{n}}$$. Now 0 > xTDx = yTLDLTy = yT (H + εHI)y. Consequently, yTHy < −εH∥y∥2. We can set $$u_{k} = \pm \sum _{i=1}^{n} y_{i}v_{i} / \|y\|$$, where the sign is chosen to ensure $$\left \langle{u_{k}},{\textrm{grad}\ f(x_{k})}\right \rangle \leq 0$$. To conclude check that $$\hat m_{k}\left (0_{x_{k}}\right ) - \hat m_{k}\left ({\eta _{k}^{E}}\right ) = -\left \langle{{\eta _{k}^{E}}},{\textrm{grad}\ f(x_{k})}\right \rangle - \frac{1}{2}\left \langle{{\eta _{k}^{E}}},{H_{k}\left [{\eta _{k}^{E}}\right ]}\right \rangle \geq \frac{1}{2} {\Delta _{k}^{2}} \varepsilon _{H}$$. Notice from the proof that this strategy either certifies that $$\lambda _{\min }(H_{k}) \succeq -\varepsilon _{H} \operatorname{Id}$$ (which must be checked at step 8 in Algorithm 3) or certifies otherwise by providing an escape direction. We further note that, in practice, one usually prefers to use iterative methods to compute an approximate leftmost eigenvector of Hk without representing it as a matrix. 3.4. Main results and proofs for RTR Under the discussed assumptions, we now establish our main theorem about computation of approximate first- and second-order critical points for (P) using RTR in a bounded number of iterations. The following constants will be useful: \begin{equation} \lambda_{g} = \frac{1}{4}\min\left( \frac{1}{c_{0}}, \frac{c_{2}}{L_{g} + c_{0}} \right) \qquad \textrm{and}\qquad \lambda_{H} = \frac{3}{4} \frac{c_{3}}{L_{H} + c_{1}}. \end{equation} (3.10) Theorem 3.9 Under Assumptions 2.3, 3.1, 3.4, 3.6 and assuming $$\varepsilon _{g} \leq \frac{\Delta _{0}}{\lambda _{g}}$$,7 Algorithm 3 produces an iterate $$x_{N_{1}}$$ satisfying $$\|\textrm{grad}\ f(x_{N_{1}})\| \leq \varepsilon _{g}$$ with \begin{align} N_{1} \leq \frac{3}{2} \frac{f(x_{0}) - f^{\ast}}{\rho^{\prime} c_{2} \lambda_{g}} \frac{1}{{\varepsilon_{g}^{2}}} + \frac{1}{2} \log_{2}\left( \frac{\Delta_{0}}{\lambda_{g} \varepsilon_{g}} \right) = \mathcal{O} \left( \frac{1}{{\varepsilon_{g}^{2}}} \right). \end{align} (3.11) Furthermore, if $$\varepsilon _{H} < \infty $$ then under additional Assumptions 3.2, 3.5, and 3.8 and assuming $$\varepsilon _{g} \leq \frac{c_{2}}{c_{3}} \frac{\lambda _{H}}{{\lambda _{g}^{2}}}$$ and $$\varepsilon _{H} \leq \frac{c_{2}}{c_{3}} \frac{1}{\lambda _{g}}$$, Algorithm 3 also produces an iterate $$x_{N_{2}}$$ satisfying $$\|\textrm{grad}\ f(x_{N_{2}})\| \leq \varepsilon _{g}$$ and $$\lambda _{\min }(H_{N_{2}}) \geq -\varepsilon _{H}$$ with \begin{align} N_{1} \leq N_{2} \leq \frac{3}{2} \frac{f(x_{0}) - f^{\ast}}{\rho^{\prime} c_{3} \lambda^{2}} \frac{1}{\varepsilon^{2} \varepsilon_{H}} + \frac{1}{2}\log_{2}\left( \frac{\Delta_{0}}{\lambda \varepsilon} \right) = \mathcal{O} \left( \frac{1}{\varepsilon^{2} \varepsilon_{H}} \right), \end{align} (3.12) where we defined (λ, ε) = (λg, εg) if λgεg ≤ λHεH and (λ, ε) = (λH, εH) otherwise. Since the algorithm is a descent method, $$f(x_{N_{2}}) \leq f(x_{N_{1}}) \leq f(x_{0})$$. Remark 3.10 Theorem 3.4 makes a statement about $$\lambda _{\min }(H_{k})$$ at termination, not about $$\lambda _{\min }(\textrm{Hess}\ f(x_{k}))$$. See Section 3.5 to connect these two quantities. To establish Theorem 3.4 we work through a few lemmas, following the proof technique in Cartis et al. (2012). We first show Δk is bounded below in proportion to the tolerances εg and εH. This is used to show that the number of successful iterations in Algorithm 3 before termination (that is, iterations where ρk > ρ′; see (3.3)) is bounded above. It is then shown that the total number of iterations is at most a constant multiple of the number of successful iterations, which implies termination in bounded time. We start by showing that the trust-region radius is bounded away from zero. Essentially, this is because if Δk becomes too small, then the Cauchy step and eigenstep are certain to be successful owing to the quality of the model in such a small region, so that the trust-region radius could not decrease any further. Lemma 3.11 Under the assumptions of Theorem 3.4, if Algorithm 3 executes N iterations without terminating then \begin{align} \Delta_{k} \geq \min\left( \Delta_{0}, \lambda_{g} \varepsilon_{g}, \lambda_{H} \varepsilon_{H} \right) \end{align} (3.13) for k = 0, … , N, where λg and λH are defined in (3.10). Proof. This follows essentially the proof of Absil et al. (2008, Thm. 7.4.2), which itself follows classical proofs (Conn et al., 2000). The core idea is to control ρk (see (3.3)) close to 1, to show that there cannot be arbitrarily many trust-region radius reductions. The proof is in two parts. For the first part, assume ∥grad f(xk)∥ > εg. Then consider the gap \begin{align} |\rho_{k} - 1| & = \left| \frac{\hat f_{k}(0_{x_{k}}) - \hat f_{k}(\eta_{k})}{\hat m_{k}(0_{x_{k}}) - \hat m_{k}(\eta_{k})} - 1 \right| = \left| \frac{\hat m_{k}(\eta_{k}) - \hat f_{k}(\eta_{k})}{\hat m_{k}(0_{x_{k}}) - \hat m_{k}(\eta_{k})} \right|. \end{align} (3.14) From Assumption 3.6, we know the denominator is not too small: \begin{align*} \hat m_{k}\left(0_{x_{k}}\right) - \hat m_{k}(\eta_{k}) & \geq c_{2} \min\left( \Delta_{k}, \frac{\varepsilon_{g}}{c_{0}} \right) \varepsilon_{g}. \end{align*} Now consider the numerator: \begin{align*} |\hat m_{k}(\eta_{k}) - \hat f_{k}(\eta_{k})| & = \left| f(x_{k}) + \left\langle{\textrm{grad}\,f(x_{k})},{\eta_{k}}\right\rangle + \tfrac{1}{2}\left\langle{\eta_{k}},{H_{k}[\eta_{k}]}\right\rangle - \hat f_{k}(\eta_{k}) \right| \\ & \leq \left| f(x_{k}) + \left\langle{\textrm{grad}\,f(x_{k})},{\eta_{k}}\right\rangle - \hat f_{k}(\eta_{k}) \right| + \tfrac{1}{2}\left| \left\langle{\eta_{k}},{H_{k}[\eta_{k}]}\right\rangle \right| \\ & \leq \tfrac{1}{2}\left( L_{g} + c_{0} \right)\|\eta_{k}\|^{2}, \end{align*} where we used Assumption 3.1 for the first term, and Assumption 3.4 for the second term. Assume for the time being that $$\Delta _{k} \leq \min \left ( \frac{\varepsilon _{g}}{c_{0}}, \frac{c_{2} \varepsilon _{g}}{L_{g} + c_{0}} \right ) = 4\lambda _{g} \varepsilon _{g}$$. Then, using ∥ηk∥≤Δk, it follows that \begin{align*} |\rho_{k} - 1| \leq \frac{1}{2} \frac{L_{g} + c_{0}}{c_{2} \min\left( \Delta_{k}, \frac{\varepsilon_{g}}{c_{0}} \right)\varepsilon_{g}} {\Delta_{k}^{2}} \leq \frac{1}{2}\frac{L_{g}+c_{0}}{c_{2} \varepsilon_{g}} \Delta_{k} \leq \frac{1}{2}. \end{align*} Hence, ρk ≥ 1/2, and by the mechanism of Algorithm 3, it follows that Δk+1 ≥Δk. For the second part, assume ∥grad f(xk)∥ < εg and $$\lambda _{\min }(H_{k}) < -\varepsilon _{H}$$. Then, by Assumption 3.8, \begin{align*} \hat m_{k}(0_{x_{k}}) - \hat m_{k}(\eta_{k}) & \geq c_{3} {\Delta_{k}^{2}} \varepsilon_{H}. \end{align*} Thus, by Assumptions 3.2 and 3.5, \begin{align*} |\hat m_{k}(\eta_{k}) - \hat f_{k}(\eta_{k})| & = \left| f(x_{k}) + \left\langle{\textrm{grad}\,f(x_{k})},{\eta_{k}}\right\rangle + \tfrac{1}{2}\left\langle{\eta_{k}},{H_{k}[\eta_{k}]}\right\rangle - \hat f_{k}(\eta_{k}) \right| \\ & \leq \frac{L_{H}}{6} \|\eta_{k}\|^{3} + \tfrac{1}{2} \left| \left\langle{\eta_{k}},{\left(\nabla^{2} \hat f_{k}\left(0_{x_{k}}\right) - H_{k}\right)[\eta_{k}]}\right\rangle \right| \\ & \leq \frac{L_{H} + c_{1}}{6} {\Delta_{k}^{3}}. \end{align*} As previously, combine these observations into (3.14) to see that if $$\Delta _{k} \leq \frac{3c_{3}}{L_{H} + c_{1}}\varepsilon _{H} = 4\lambda _{H} \varepsilon _{H}$$ then \begin{align} |\rho_{k} - 1| & \leq \tfrac{1}{2} \frac{L_{H} + c_{1}}{3c_{3} \varepsilon_{H}} \Delta_{k} \leq \tfrac{1}{2}. \end{align} (3.15) Again, this implies Δk+1 ≥Δk. Now combine the two parts. We have established that, if $$\Delta _{k} \leq 4\min \left (\lambda _{g} \varepsilon _{g}, \lambda _{H}\varepsilon _{H}\right )$$, then Δk+1 ≥Δk. To conclude the proof, consider the fact that Algorithm 3 cannot reduce the radius by more than 1/4 in one step. By an argument similar to the one used for gradient methods, Lemma 3.6 implies an upper bound on the number of successful iterations required in Algorithm 3 to reach termination. Lemma 3.12 Under the assumptions of Theorem 3.4, if Algorithm 3 executes N iterations without terminating, define the set of successful steps as \begin{align*} S_{N} = \{ k \in \{0, \ldots, N\} : \rho_{k}> \rho^{\prime} \} \end{align*} and let UN designate the unsuccessful steps, so that SN and UN form a partition of {0, … ,N}. Assume εg ≤Δ0/λg. If $$\varepsilon _{H} = \infty $$, the number of successful steps obeys \begin{align} |S_{N}| \leq \frac{f(x_{0}) - f^{\ast}}{\rho^{\prime} c_{2} \lambda_{g}} \frac{1}{{\varepsilon_{g}^{2}}}. \end{align} (3.16) Otherwise, if additionally $$\varepsilon _{g} \leq \frac{c_{2}}{c_{3}} \frac{\lambda _{H}}{{\lambda _{g}^{2}}}$$ and $$\varepsilon _{H} \leq \frac{c_{2}}{c_{3}} \frac{1}{\lambda _{g}}$$, we have the bound \begin{align} |S_{N}| \leq \frac{f(x_{0}) - f^{\ast}}{\rho^{\prime} c_{3} } \frac{1}{\min(\lambda_{g}\varepsilon_{g}, \lambda_{H}\varepsilon_{H})^{2} \varepsilon_{H}}. \end{align} (3.17) Proof. The proof parallels Cartis et al. (2012, Lemma 4.5). Clearly, if k ∈ UN, then f(xk) = f(xk+1). On the other hand, if k ∈ SN then ρk ≥ ρ′ (see (3.3)). Combine this with Assumptions 3.6 and 3.8 to see that, for k ∈ SN, \begin{align*} f(x_{k}) - f(x_{k+1}) & \geq \rho^{\prime} \left(\hat m_{k}(0_{x_{k}}) - \hat m_{k}(\eta_{k})\right) \\ & \geq \rho^{\prime} \min\left( c_{2} \min\left( \Delta_{k}, \frac{\varepsilon_{g}}{c_{0}} \right) \varepsilon_{g} \, \ c_{3} {\Delta_{k}^{2}} \varepsilon_{H} \right). \end{align*} By Lemma 3.6 and the assumption λgεg ≤Δ0, it holds that $$\Delta _{k} \geq \min \left ( \lambda _{g} \varepsilon _{g}, \lambda _{H} \varepsilon _{H} \right )$$. Furthermore, using λg ≤ 1/c0 shows that $$\min (\Delta _{k}, \varepsilon _{g} / c_{0}) \geq \min (\Delta _{k}, \lambda _{g} \varepsilon _{g}) \geq \min \left ( \lambda _{g} \varepsilon _{g}, \lambda _{H} \varepsilon _{H} \right )$$. Hence, \begin{align} f(x_{k}) - f(x_{k+1}) & \geq \rho^{\prime} \min\left( c_{2} \lambda_{g} {\varepsilon_{g}^{2}}, c_{2} \lambda_{H} \varepsilon_{g} \varepsilon_{H}, c_{3} {\lambda_{g}^{2}} {\varepsilon_{g}^{2}} \varepsilon_{H}, c_{3} {\lambda_{H}^{2}} {\varepsilon_{H}^{3}} \right). \end{align} (3.18) If $$\varepsilon _{H} = \infty $$ this simplifies to \begin{align*} f(x_{k}) - f(x_{k+1}) \geq \rho^{\prime} c_{2} \lambda_{g} {\varepsilon_{g}^{2}}. \end{align*} Sum over iterations up to N and use Assumption 2.3 (bounded f): \begin{align*} f(x_{0}) - f^{\ast} \geq f(x_{0}) - f(x_{N+1}) & = \sum_{k\in S_{N}} f(x_{k}) - f(x_{k+1}) \geq |S_{N}| \rho^{\prime} c_{2} \lambda_{g} {\varepsilon_{g}^{2}}. \end{align*} Hence, \begin{align*} |S_{N}| \leq \frac{f(x_{0}) - f^{\ast}}{\rho^{\prime} c_{2} \lambda_{g}} \frac{1}{{\varepsilon_{g}^{2}}}. \end{align*} On the other hand, if $$\varepsilon _{H} < \infty $$ then, starting over from (3.18) and assuming both $$c_{3} {\lambda _{g}^{2}} {\varepsilon _{g}^{2}} \varepsilon _{H} \leq c_{2} \lambda _{H} \varepsilon _{g} \varepsilon _{H}$$ and $$c_{3} {\lambda _{g}^{2}} {\varepsilon _{g}^{2}} \varepsilon _{H} \leq c_{2}\lambda _{g}{\varepsilon _{g}^{2}}$$ (which is equivalent to $$\varepsilon _{g} \leq c_{2}\lambda _{H} / c_{3}{\lambda _{g}^{2}}$$ and εH ≤ c2/c3λg), it comes with the same telescoping sum that \begin{align*} f(x_{0}) - f^{\ast} \geq |S_{N}| \rho^{\prime} c_{3} \min(\lambda_{g}\varepsilon_{g}, \lambda_{H}\varepsilon_{H})^{2} \varepsilon_{H}. \end{align*} Solve for |SN| to conclude. Finally, we show that the total number of steps N before termination cannot be more than a fixed multiple of the number of successful steps |SN|. Lemma 3.13 Under the assumptions of Theorem 3.4, if Algorithm 3 executes N iterations without terminating, using the notation SN and UN of Lemma 3.7, it holds that \begin{align} |S_{N}| & \geq \tfrac{2}{3}(N+1) - \tfrac{1}{3}\max\left( 0, \log_{2}\left( \frac{\Delta_{0}}{\lambda_{g} \varepsilon_{g}} \right), \log_{2}\left( \frac{\Delta_{0}}{\lambda_{H} \varepsilon_{H}} \right) \right). \end{align} (3.19) Proof. The proof rests on the lower bound for Δk obtained in Lemma 3.6. It parallels Cartis et al. (2012, Lemma 4.6). For all k ∈ SN, it holds that Δk+1 ≤ 2Δk. For all k ∈ Uk, it holds that $$\Delta _{k+1} \leq \frac{1}{4} \Delta _{k}$$. Hence, \begin{align*} \Delta_{N} \leq 2^{|S_{N}|} \left(\tfrac{1}{4}\right)^{|U_{N}|} \Delta_{0}. \end{align*} On the other hand, Lemma 3.6 gives \begin{align*} \Delta_{N} \geq \min\left( \Delta_{0}, \lambda_{g} \varepsilon_{g}, \lambda_{H} \varepsilon_{H} \right). \end{align*} Combine, divide by Δ0 and take the log in base 2: \begin{align*} |S_{N}| - 2|U_{N}| \geq \min\left( 0, \log_{2}\left( \frac{\lambda_{g} \varepsilon_{g}}{\Delta_{0}}\right), \log_{2}\left( \frac{\lambda_{H} \varepsilon_{H}}{\Delta_{0}} \right) \right). \end{align*} Use |SN| + |UN| = N + 1 to conclude. We can now prove the main theorem. Proof of Theorem 3.4 It is sufficient to combine Lemmas 3.6 and 3.7 in both regimes. First, we get that if ∥grad f(xk)∥ > εg for k = 0, … , N, then \begin{align*} N+1 \leq \tfrac{3}{2} \frac{f(x_{0}) - f^{\ast}}{\rho^{\prime} c_{2} \lambda_{g}} \frac{1}{{\varepsilon_{g}^{2}}} + \tfrac{1}{2} \log_{2}\left( \frac{\Delta_{0}}{\lambda_{g} \varepsilon_{g}} \right). \end{align*} (The term $$\log _{2}\big ( \frac{\Delta _{0}}{\lambda _{H} \varepsilon _{H}} \big )$$ from Lemma 3.8 is irrelevant up to that point, as εH could just as well have been infinite.) Thus, after a number of iterations larger than the right-hand side, an iterate with sufficiently small gradient must have been produced, to avoid a contradiction. Second, we get that if for k = 0, … , N no iterate satisfies both ∥grad f(xk)∥≤ εg and $$\lambda _{\min }(H_{k}) \geq -\varepsilon _{H}$$, then \begin{align*} N+1 \leq \tfrac{3}{2} \frac{f(x_{0}) - f^{\ast}}{\rho^{\prime} c_{3} } \frac{1}{\min(\lambda_{g}\varepsilon_{g}, \lambda_{H}\varepsilon_{H})^{2} \varepsilon_{H}} + \tfrac{1}{2}\max\left( \log_{2}\left( \frac{\Delta_{0}}{\lambda_{g} \varepsilon_{g}} \right), \log_{2}\left( \frac{\Delta_{0}}{\lambda_{H} \varepsilon_{H}} \right) \right). \end{align*} Conclude with the same argument. 3.5. Connecting Hk and Hess f(xk) Theorem 3.4 states termination of Algorithm 3 in terms of ∥grad f(xk)∥ and $$\lambda _{\min }(H_{k})$$. Ideally, the latter must be turned into a statement about $$\lambda _{\min }(\textrm{Hess}\ f(x_{k}))$$, to match the second-order necessary optimality conditions of (P) more closely (recall Proposition 1.1). Assumption 3.5 itself requires only Hk to be (weakly) related to $$\nabla ^{2} \hat f_{k}(0_{x_{k}})$$ (the Hessian of the pullback of f at xk), which is different from the Riemannian Hessian of f at xk in general. It is up to the user to provide Hk sufficiently related to $$\nabla ^{2} \hat f_{k}\left (0_{x_{k}}\right )$$. Additional control over the retraction at xk can further relate $$\nabla ^{2} \hat f_{k}\left (0_{x_{k}}\right )$$ to Hess f(xk), as we do now. Proofs for this section are in Appendix D. Lemma 3.9 Define the maximal acceleration of Retr at x as the real a such that \begin{align*} \forall\ \eta \in \mathrm{T}_{x}\mathcal{M}\ \textrm{with}\ \|\eta\| = 1, \quad \bigg\| \frac{\mathrm{D}^{2}}{\mathrm{d}t^{2}} \textrm{Retr}_{x}(t\eta) |{}_{t=0} \bigg\| \leq a, \end{align*} where $$\frac{\mathrm{D}^{2}}{\mathrm{d}t^{2}} \gamma $$ denotes acceleration of the curve t ↦ γ(t) on $$\mathcal{M}$$ (Absil et al., 2008, Chapter 5). Then \begin{align*} \left\| \textrm{Hess}\ f(x) - \nabla^{2} \hat f_{x}(0_{x}) \right\| \leq a \cdot \|\textrm{grad}\ f(x)\|. \end{align*} In particular, if x is a critical point or if a = 0, the Hessians agree: $$\textrm{Hess}\ f(x) = \nabla ^{2} \hat f_{x}(0_{x})$$. The particular cases appear as Absil et al. (2008, Props. 5.5.5, 5.5.6). This result highlights the crucial role of retractions with zero acceleration, known as second-order retractions and defined in Absil et al. (2008, Prop. 5.5.5); we are not aware of earlier references to this notion. Definition 3.14 A retraction is a second-order retraction if it has zero acceleration, as defined in Lemma 3.9. Then retracted curves locally agree with geodesics up to second order. Proposition 3.15 Let $$x_{k}\in \mathcal{M}$$ be the iterate returned by Algorithm 3 under the assumptions of Theorem 3.4. It satisfies ∥grad f(xk)∥≤ εg and Hk ≽−εH Id. Assume Hk is related to the Hessian of the pullback as $$ \|\nabla ^{2} \hat f_{k}\left (0_{x_{k}}\right ) - H_{k} \| \leq \delta _{k}$$. Further assume the retraction has acceleration at xk bounded by ak, as defined in Lemma 3.9. Then \begin{align*} \textrm{Hess}\ f(x_{k}) \succeq -\left( \varepsilon_{H} + a_{k} \varepsilon_{g} + \delta_{k} \right) \operatorname{Id}. \end{align*} In particular, if the retraction is second order and $$H_{k} = \nabla ^{2} \hat f_{k}\left (0_{x_{k}}\right )$$, then Hess f(xk) ≽−εH Id. We note that second-order retractions are frequently available in applications. Indeed, retractions for submanifolds obtained as (certain types of) projections—arguably one of the most natural classes of retractions for submanifolds—are second order (Absil & Malick, 2012, Thm. 22). For example, the sphere retraction Retrx(η) = (x + η)/∥x + η∥ is second order. Such retractions for low-rank matrices are also known (Absil & Oseledets, 2015). 4. Example: smooth semidefinite programs This example is based on Boumal et al. (2016). Consider the following semidefinite program, which occurs in robust PCA (McCoy & Tropp, 2011) and as a convex relaxation of combinatorial problems such as Max-Cut, $$\mathbb{Z}_{2}$$-synchronization and community detection in the stochastic block model (Goemans & Williamson, 1995; Bandeira et al., 2016): \begin{align} \min_{X\in{\mathbb{R}^{n\times n}}} \textrm{Tr}(CX)\ \textrm{subject to}\ \textrm{diag}(X) = \boldsymbol{1}, X \succeq 0. \end{align} (4.1) The symmetric cost matrix C depends on the application. Interior point methods solve this problem in polynomial time, though they involve significant work to enforce the conic constraint X ≽ 0 (X symmetric, positive semidefinite). This motivates the approach of Burer & Monteiro (2005) to parameterize the search space as X = Y YT, where Y is in $${\mathbb{R}^{n\times p}}$$ for some well-chosen p: \begin{align} \min_{Y\in{\mathbb{R}^{n\times p}}} \textrm{Tr}\left(CYY^{\textrm{T}}\! \right)\ \textrm{subject to}\ \textrm{diag}\left(YY^{\textrm{T}}\! \right) = \boldsymbol{1}. \end{align} (4.2) This problem is of the form of (P), where $$f(Y) = \textrm{Tr}\left (CYY^{\textrm{T}}\right )$$ and the manifold is a product of n unit spheres in $${\mathbb{R}^{p}}$$: \begin{align} \mathcal{M} & = \left\{ Y\in{\mathbb{R}^{n\times p}} : \textrm{diag}\left(YY^{\textrm{T}}\! \right) = \boldsymbol{1} \right\} = \left\{ Y\in{\mathbb{R}^{n\times p}} : \textrm{each row of Y has unit norm} \right\}. \end{align} (4.3) In principle, since the parameterization X = YYT breaks convexity, the new problem could have many spurious local optimizers and saddle points. Yet, for p = n + 1, it has recently been shown that approximate second-order critical points Y map to approximate global optimizers X = YYT, as stated in the following proposition. (In this particular case there is no need to control ∥grad f(Y )∥ explicitly.) Proposition 4.1 (Boumal et al., 2016). If X⋆ is optimal for (3.19) and Y is feasible for (4.1) with p > n and Hess f(Y ) ≽−εH Id, the optimality gap is bounded as \begin{align*} 0 \leq \textrm{Tr}\left(CYY^{\textrm{T}}\! \right) - \textrm{Tr}(CX^{\star}) \leq \frac{n}{2}\varepsilon_{H}. \end{align*} Since f is smooth in $${\mathbb{R}^{n\times p}}$$ and $$\mathcal{M}$$ is a compact submanifold of $${\mathbb{R}^{n\times p}}$$, the regularity Assumptions 3.1 and 3.2 hold with any second-order retraction (Lemmas 2.4 and 3.1). In particular, they hold if $$\textrm{Retr}_{Y}(\dot Y)$$ is the result of normalizing each row of $$Y+\dot Y$$ (Section 3.5) or if the exponential map is used (which is cheap for this manifold; see Appendix E). Theorem 3.4 then implies that RTR applied to the nonconvex problem (4.2) computes a point X = Y YT feasible for (4.1) such that Tr(CX) −Tr(CX⋆) ≤ δ in $$\mathcal{O}\big(1/\delta ^{3}\big)$$ iterations. Appendix E bounds the total work with an explicit dependence on the problem dimension n as $$\mathcal{O}\big(n^{10} / \delta ^{3}\big)$$ arithmetic operations, where $$\mathcal{O}$$ hides factors depending on the data C and an additive log term. This result follows from a bound $$L_{H} \leq 8\left \|{C}\right \|_{\mathrm{2}} \sqrt{n}$$ for Assumption 3.2, which is responsible for a factor of n in the complexity—the remaining factors could be improved; see below. In Boumal et al. (2016), it is shown that, generically in C, if $$p \geq \big \lceil \sqrt{2n}\, \big \rceil $$ then all second-order critical points of (4.2) are globally optimal (despite nonconvexity). This means RTR globally converges to global optimizers with cheaper iterations (due to reduced dimensionality). Unfortunately, there is no statement of quality pertaining to approximate second-order critical points for such small p, so that this analysis is not sufficient to obtain an improved worst-case complexity bound. These bounds are worse than guarantees provided by interior point methods. Indeed, following Nesterov (2004, Sect. 4.3.3, with eq. (4.3.12)), interior point methods achieve a solution in $$\mathcal{O}\big(n^{3.5}\log (n/\delta )\big)$$ arithmetic operations. Yet, numerical experiments in Boumal et al. (2016) suggest RTR often outperforms interior point methods, indicating that the bound $$\mathcal{O}\big(n^{10}/\delta ^{3}\big)$$ is wildly pessimistic. We report it here mainly because, to the best of our knowledge, this is the first explicit bound for a Burer–Monteiro approach to solving a semidefinite program. A number of factors drive the gap between our worst-case bound and practice. In particular, strategies far more efficient than the LDLT factorization in Lemma 3.3 are used to compute second-order steps, and they can exploit structure in C. High-accuracy solutions are reached owing to RTR typically converging superlinearly, locally. And p is chosen much smaller than n + 1. See also Mei et al. (2017) for formal complexity results in a setting where p is allowed to be independent of n; this precludes reaching an objective value arbitrarily close to optimal, in exchange for cheaper computations. 5. Conclusions and perspectives We presented bounds on the number of iterations required by the Riemannian gradient descent algorithm and the RTR algorithm to reach points that approximately satisfy first- and second-order necessary optimality conditions, under some regularity assumptions but regardless of initialization. When the search space $$\mathcal{M}$$ is a Euclidean space these bounds are already known. For the more general case of $$\mathcal{M}$$ being a Riemannian manifold, these bounds are new. As a subclass of interest, we showed the regularity requirements are satisfied if $$\mathcal{M}$$ is a compact submanifold of $${\mathbb{R}^{n}}$$ and f has locally Lipschitz continuous derivatives of appropriate order. This covers a rich class of practical optimization problems. While there are no explicit assumptions made about $$\mathcal{M,}$$ the smoothness requirements for the pullback of the cost—Assumptions 2.6, 3.1 and 3.2—implicitly restrict the class of manifolds to which these results apply. Indeed, for certain manifolds, even for nice cost functions f, there may not exist retractions that ensure the assumptions hold. This is the case in particular for certain incomplete manifolds, such as open Riemannian submanifolds of $${\mathbb{R}^{n}}$$ and certain geometries of the set of fixed-rank matrices—see also Remark 2.2 about injectivity radius. For such sets it may be necessary to adapt the assumptions. For fixed-rank matrices for example, Vandereycken (2013, Sect. 4.1) obtains convergence results assuming a kind of coercivity on the cost function: for any sequence of rank-k matrices (Xi)i=1, 2, … such that the first singular value $$\sigma _{1}(X_{i}) \to \infty $$ or the kth singular value σk(Xi) → 0, it holds that $$f(X_{i}) \to \infty $$. This ensures iterates stay away from the open boundary. The iteration bounds are sharp, but additional information may yield more favorable bounds in specific contexts. In particular, when the studied algorithms converge to a nondegenerate local optimizer, they do so with an at least linear rate, so that the number of iterations is merely $$\mathcal{O}(\log (1/\varepsilon ))$$ once in the linear regime. This suggests a stitching approach: for a given application, it may be possible to show that rough approximate second-order critical points are in a local attraction basin; the iteration cost can then be bounded by the total work needed to attain such a crude point starting from anywhere, plus the total work needed to refine the crude point to high accuracy with a linear or even quadratic convergence rate. This is, to some degree, the successful strategy in Sun et al. (2015, 2016). Finally, we note that it would also be interesting to study the global convergence rates of Riemannian versions of adaptive regularization algorithms using cubics (ARC), since in the Euclidean case these can achieve approximate first-order criticality in $$\mathcal{O}\big(1/\varepsilon ^{1.5}\big)$$ instead of $$\mathcal{O} \big(1/\varepsilon ^{2}\big)$$ (Cartis et al., 2011a). Work in that direction could start with the convergence analyses proposed in Qi, (2011). Acknowledgements This paper presents research results of the Belgian Network DYSCO (Dynamical Systems, Control, and Optimization), funded by the Interuniversity Attraction Poles Programme initiated by the Belgian Science Policy Office. This work was supported by the ARC ‘Mining and Optimization of Big Data Models’. We thank Alex d’Aspremont, Simon Lacoste-Julien, Ju Sun, Bart Vandereycken and Paul Van Dooren for helpful discussions. Funding Fonds Spéciaux de Recherche (FSR) at UCLouvain (to N.B.); Chaire Havas ‘Chaire Economie et gestion des nouvelles données’ (to N.B.); ERC Starting Grant SIPA (to N.B.); Research in Paris grant at Inria & ENS (to N.B.); National Science Foundation (DMS-1719558 to N.B.); The Natural Environment Research Council (grant NE/L012146/1 to C.C.). Footnotes 1  Informally, the tangent bundle $$\mathrm{T}\mathcal{M}$$ is the set of all pairs (x, ηx) where $$x\in \mathcal{M}$$ and $$\eta _{x}\in \mathrm{T}_{x}\mathcal{M}$$. See the reference for a proper definition of $$\mathrm{T}\mathcal{M}$$ and of what it means for Retr to be smooth. 2  The composition f ∘ Retrx is called the pullback because it, quite literally, pulls back the cost function f from the manifold $$\mathcal{M}$$ to the linear space $$\mathrm{T}_{x}\mathcal{M}$$. 3  $$\forall \ \eta \in \mathrm{T}_{x}\mathcal{M}, \langle{\textrm{grad}\ \hat f_{x}(0_{x})},{\eta }\rangle = \mathrm{D} \hat f_{x}(0_{x})[\eta ] = \mathrm{D} f({x})[\mathrm{D} \textrm{Retr}_{{x}}(0_{x})[\eta ]] = \mathrm{D} f({x})[\eta ] = \left \langle{\textrm{grad}\ f({x})},{\eta }\right \rangle .$$ 4  See Remark 2.2; $$\rho _{k} = \infty $$ is valid if the retraction is globally defined and f is sufficiently nice (for example, Lemma 2.7). 5  This holds in particular in the classical setting $$\mathcal{M} = {\mathbb{R}^{n}}$$, Retrx(η) = x + η and grad f is Lg-Lipschitz. 6  This is typically not an issue in practice. For example, globally defined, practical retractions are known for the sphere, Stiefel manifold, orthogonal group, their products and many others (Absil et al., 2008, Chapter 4). 7  Theorem 3.4 is scale invariant, in that if the cost function f(x) is replaced by αf(x) for some positive α (which does not meaningfully change (P)), it is sensible to also multiply Lg, LH, c0, c1, εg and εH by α; consequently, the upper bounds on εg and εH and the upper bounds on N1 and N2 are invariant under this scaling. If it is desirable to always allow εg, εH in, say, (0, 1], one possibility is to artificially make Lg, LH, c0, c1 larger (which is always allowed). 8  The function f need not be defined outside of $$\mathcal{M}$$, but this is often the case in applications and simplifies exposition. 9  Proper definition of Riemannian Hessians requires the notion of Riemannian connections, which we omit here; see Absil et al. (2008, Chapter 5). 10  Indeed, any $$y\in \mathcal{S}^{n-1}$$ can be written as y = αx + βη with xTη = 0, ∥η∥ = 1 and α2 + β2 = 1; then, yTAy = α2xTAx + β2ηTAη + 2αβηTAx; by first-order condition, ηTAx = (xTAx)ηTx = 0, and by second-order condition, yTAy ≥ (α2 + β2)xTAx = xTAx; hence xTAx is minimal over $$\mathcal{S}^{n-1}$$. References Absil , P.-A. , Baker , C. G. & Gallivan , K. A. ( 2007 ) Trust-region methods on Riemannian manifolds . Found. Comput. Math , 7 , 303 -- 330 . doi: https://doi.org/10.1007/s10208–005-0179–9 . Google Scholar CrossRef Search ADS Absil , P.-A. , Mahony , R. & Sepulchre , R. ( 2008 ) Optimization Algorithms on Matrix Manifolds . Princeton, NJ : Princeton University Press . ISBN 978–0-691–13298-3 . Google Scholar CrossRef Search ADS Absil , P.-A. , Mahony , R. & Trumpf , J. ( 2013 ) An extrinsic look at the Riemannian Hessian . Geometric Science of Information (F. Nielsen and F. Barbaresco eds). Lecture Notes in Computer Science , vol. 8085 . Berlin Heidelberg : Springer , pp 361 -- 368 . ISBN 978–3-642–40019-3 . doi: https://doi.org/10.1007/978–3-642–40020-9 39 . URL http://sites.uclouvain.be/absil/2013.01 Google Scholar CrossRef Search ADS Absil , P.-A. & Malick , J. ( 2012 ) Projection-like retractions on matrix manifolds . SIAM J. Optim ., 22 , 135 -- 158 . doi: https://doi.org/10.1137/100802529 . Google Scholar CrossRef Search ADS Absil , P.-A. & Oseledets , I. ( 2015 ) Low-rank retractions: a survey and new results . Comput. Optim. Appl. , 62 , 5 -- 29 . doi: https://doi.org/10.1007/s10589–014-9714–4 . Accepted for publication . Google Scholar CrossRef Search ADS Absil , P.-A. , Trumpf , J. , Mahony , R. & Andrews , B. ( 2009 ) All roads lead to Newton: feasible second-order methods for equality-constrained optimization . Technical Report UCL-INMA-2009.024 . Belgium : Departement d’ingenierie mathematique, UCLouvain . Adler , R. , Dedieu , J. , Margulies , J. , Martens , M. & Shub , M. ( 2002 ) Newton’s method on Riemannian manifolds and a geometric model for the human spine . IMA J. Numer. Anal ., 22 , 359 -- 390 . doi: https://doi.org/10.1093/imanum/22.3.359 . Google Scholar CrossRef Search ADS Bandeira , A. , Boumal , N. & Voroninski , V. ( 2016 ) On the low-rank approach for semidefinite programs arising in synchronization and community detection . Proceedings of the 29th Conference on Learning Theory, COLT 2016. New York : PMLR , June 23–26 . Bento , G. , Ferreira , O. & Melo , J. ( 2017 ) Iteration-complexity of gradient, subgradient and proximal point methods on Riemannian manifolds . J. Optim. Theory Appl ., 173 , 548 -- 562 . doi: https://doi.org/10.1007/s10957–017-1093–4 . Google Scholar CrossRef Search ADS Berger , G. ( 2017 ) Fast matrix multiplication . Master Thesis. Ecole polytechnique de Louvain. Available at http://hdl.handle.net/2078.1/thesis:10630 Bhojanapalli , S. , Neyshabur , B. & Srebro , N. ( 2016 ) Global optimality of local search for low rank matrix recovery . Preprint arXiv:1605 . 07221 . Birgin , E. , Gardenghi , J. , Martínez , J. , Santos , S. & Toint , P. ( 2017 ) Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models . Math. Prog. , 163 , 359 -- 368 . doi: https://doi.org/10.1007/s10107–016-1065–8 . Google Scholar CrossRef Search ADS Boumal , N. ( 2015b ) A Riemannian low-rank method for optimization over semidefinite matrices with block-diagonal constraints . Preprint arXiv:1506.00575 . Boumal , N. ( 2015a ) Riemannian trust regions with finite-difference Hessian approximations are globally convergent . Geometric Science of Information (F. Nielsen and F. Barbaresco eds). Lecture Notes in Computer Science , vol. 9389 . Springer International Publishing , pp. 467 -- 475 . doi: https://doi.org/10.1007/978–3-319–25040-3 50 . Boumal , N. ( 2016 ) Nonconvex phase synchronization . SIAM J. Optim ., 26 , 2355 -- 2377 . doi: https://doi.org/10.1137/16M105808X . Google Scholar CrossRef Search ADS Boumal , N. , Mishra , B. , Absil , P.-A. & Sepulchre , R. ( 2014 ) Manopt, a Matlab toolbox for optimization on manifolds . J. Mach. Learn. Res ., 15 , 1455 -- 1459 . (http://www.manopt.org) Boumal , N. , Voroninski , V. & Bandeira , A. ( 2016 ) The non-convex Burer-Monteiro approach works on smooth semidefinite programs . Advances in Neural Information Processing Systems (D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon and R. Garnett eds). Vol. 29 , Curran Associates , pp. 2757 -- 2765 . Burer , S. & Monteiro , R. ( 2005 ) Local minima and convergence in low-rank semidefinite programming . Math. Prog ., 103 , 427 -- 444 . Google Scholar CrossRef Search ADS Cartis , C. , Gould , N. I. M. & Toint , P. L. ( 2010 ) On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization problems . SIAM J. Optim ., 20 , 2833 -- 2852 . doi: https://doi.org/10.1137/090774100 . Google Scholar CrossRef Search ADS Cartis , C. , Gould , N. & Toint , P. ( 2011a ) Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function- and derivative-evaluation complexity . Math. Prog. , 130 , 295 -- 319 . doi: https://doi.org/10.1007/s10107–009-0337-y . Google Scholar CrossRef Search ADS Cartis , C. , Gould , N. & Toint , P. ( 2011b ) Optimal Newton-type methods for nonconvex smooth optimization problems . ERGO Technical Report 11–009. School of Mathematics , University of Edinburgh . Cartis , C. , Gould , N. & Toint , P. ( 2012 ) Complexity bounds for second-order optimality in unconstrained optimization . J. Complexity , 28 , 93 -- 108 . doi: https://doi.org/10.1016/j.jco.2011.06.001 . Google Scholar CrossRef Search ADS Cartis , C. , Gould , N. & Toint , P. ( 2014 ) On the complexity of finding first-order critical points in constrained nonlinear optimization . Math. Prog. , 144 , 93 -- 106 . doi: https://doi.org/10.1007/s10107–012-0617–9 . Google Scholar CrossRef Search ADS Cartis , C. , Gould , N. & Toint , P. ( 2015a ) Evaluation complexity bounds for smooth constrained nonlinear optimization using scaled KKT conditions and high-order models . NA Technical Report, Maths E-print Archive1912 . Mathematical Institute , Oxford University . Cartis , C. , Gould , N. & Toint , P. ( 2015b ) On the evaluation complexity of constrained nonlinear least-squares and general constrained nonlinear optimization using second-order methods . SIAM J. Numer. Anal. , 53 , 836 -- 851 . doi: https://doi.org/10.1137/130915546 . Google Scholar CrossRef Search ADS Cartis , C. , Gould , N. I. M. & Toint , P. L. ( 2017 ) Second-order optimality and beyond: Characterization and evaluation complexity in convexly constrained nonlinear optimization . Found. Comput. Math . doi:10.1007/s10208-017-9363-y . Google Scholar CrossRef Search ADS Chavel , I. ( 2006 ) Riemannian Geometry: A Modern Introduction , Cambridge Tracts in Mathematics , vol. 108 . Cambridge : Cambridge University Press . Google Scholar CrossRef Search ADS Conn , A. , Gould , N. & Toint , P. ( 2000 ) Trust-Region Methods . MPS-SIAM Series on Optimization. SIAM . ISBN 978–0-89871–460-9 . doi: https://doi.org/10.1137/1.9780898719857 . Curtis , F. E. , Robinson , D. P. & Samadi , M. ( 2016 ) A trust region algorithm with a worst-case iteration complexity of $$\mathcal{O}(\epsilon ^{-3/2})$$ for nonconvex optimization . Math. Prog. , 162 , pages 1 -- 32 . doi: https://doi.org/10.1007/s10107–016-1026–2 . Google Scholar CrossRef Search ADS Edelman , A. , Arias , T. & Smith , S. ( 1998 ) The geometry of algorithms with orthogonality constraints . SIAM J. Matrix Anal. Appl. , 20 , 303 -- 353 . Google Scholar CrossRef Search ADS Gabay , D. ( 1982 ) Minimizing a differentiable function over a differential manifold . J. Optim. Theory Appl. , 37 , 177 -- 219 . Google Scholar CrossRef Search ADS Ge , R. , Lee , J. & Ma , T. ( 2016 ) Matrix completion has no spurious local minimum . Advances in Neural Information Processing Systems (D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon and R. Garnett eds). Vol. 29 Curran Associates , pp. 2973 -- 2981 . (http://papers.nips.cc/paper/6048-matrix-completion-has-no-spurious-local-minimum.pdf). Goemans , M. & Williamson , D. ( 1995 ) Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming . J. ACM , 42 , 1115 -- 1145 . doi: https://doi.org/10.1145/227683.227684 . Google Scholar CrossRef Search ADS Golub , G. & Van Loan , C. ( 2012 ) Matrix Computations , 4th edn. Johns Hopkins Studies in the Mathematical Sciences , vol. 3 . Baltimore, Maryland : Johns Hopkins University Press . doi:10.1137/0720042 . Huang , W. , Absil , P.-A. , Gallivan , K. & Hand , P. ( 2016 ) ROPTLIB: an object-oriented C++ library for optimization on Riemannian manifolds. Technical Report FSU16–14.v2 . Florida State University . Huang , W. , Gallivan , K. & Absil , P.-A. ( 2015 ) A Broyden class of quasi-Newton methods for Riemannian optimization . SIAM J. Optim. , 25 , 1660 -- 1685 . doi: https://doi.org/10.1137/140955483 . Google Scholar CrossRef Search ADS McCoy , M. & Tropp , J. ( 2011 ) Two proposals for robust PCA using semidefinite programming . Electron. J. Stat. , 5 , 1123 -- 1160 . doi: https://doi.org/10.1214/11-EJS636 . Google Scholar CrossRef Search ADS Mei , S. , Misiakiewicz , T. , Montanari , A. & Oliveira , R. ( 2017 ) Solving SDPs for synchronization and MaxCut problems via the Grothendieck inequality . Preprint arXiv:1703.08729. Monera , M. G. , Montesinos-Amilibia , A. & Sanabria-Codesal , E. ( 2014 ) The Taylor expansion of the exponential map and geometric applications . Rev. R. Acad. Cienc. Exactas, Físicas Nat. Ser. A Math. RACSAM , 108 , 881 -- 906 . doi: 10.1007/s13398–013-0149-z. Moré , J. & Sorensen , D. ( 1983 ) Computing a trust region step . SIAM J. Sci. Stat. Comput. , 4 , 553 -- 572 . doi: https://doi.org/10.1137/0904038 . Google Scholar CrossRef Search ADS Nesterov , Y. ( 2004 ) Introductory Lectures on Convex Optimization: A Basic Course . Applied Optimization , vol. 87 . Boston, MA : Springer . ISBN 978–1-4020–7553-7 . Google Scholar CrossRef Search ADS Nocedal , J. & Wright , S. ( 1999 ) Numerical Optimization . New York : Springer . Google Scholar CrossRef Search ADS O’Neill , B. ( 1983 ) Semi-Riemannian Geometry: With Applications to Relativity , vol. 103, Academic Press . Qi , C. ( 2011 ) Numerical optimization methods on Riemannian manifolds . PhD thesis , Florida State University , Tallahassee, FL . Ring , W. & Wirth , B. ( 2012 ) Optimization methods on Riemannian manifolds and their application to shape space . SIAM J. Optim. , 22 , 596 -- 627 . doi: https://doi.org/10.1137/11082885X . Google Scholar CrossRef Search ADS Ruszczyński , A. ( 2006 ) Nonlinear Optimization . Princeton, NJ : Princeton University Press . Sato , H. ( 2016 ) A Dai–Yuan-type Riemannian conjugate gradient method with the weak Wolfe conditions . Comput. Optim. Appl., 64 , 101 -- 118 . doi: https://doi.org/10.1007/s10589–015-9801–1 . Google Scholar CrossRef Search ADS Shub , M. ( 1986 ) Some remarks on dynamical systems and numerical analysis . Proceedings of VII ELAM. , ( L. Lara-Carrero & J. Lewowicz , eds), Equinoccio, Univ. Simón Bolívar, Caracas , pp. 69 -- 92 . Smith , S. ( 1994 ) Optimization techniques on Riemannian manifolds . Fields Inst. Commun. , 3 , 113 -- 135 . Sorensen , D. ( 1982 ) Newton’s method with a model trust region modification . SIAM J. Numer. Anal. , 19 , 409 -- 426 . doi: https://doi.org/10.1137/0719026 . Google Scholar CrossRef Search ADS Steihaug , T. ( 1983 ) The conjugate gradient method and trust regions in large scale optimization . SIAM J. Numer. Anal. , 20 , 626 -- 637 . Google Scholar CrossRef Search ADS Sun , J. , Qu , Q. & Wright , J. ( 2017a ) Complete dictionary recovery over the sphere II: recovery by Riemannian trust-region method . IEEE Trans. Info. Theory , 63 , 885 -- 914 . doi: https://doi.org/10.1109/TIT.2016.2632149 . Google Scholar CrossRef Search ADS Sun , J. , Qu , Q. & Wright , J. ( 2017b ) A geometric analysis of phase retrieval . Foundations Comput. Math. doi: https://doi.org/10.1007/s10208–017-9365–9 . Toint , P. ( 1981 ) Towards an efficient sparsity exploiting Newton method for minimization . Sparse Matrices and Their Uses ( I. Duff ed). Academic Press , pp. 57 -- 88 . Townsend , J. , Koep , N. & Weichwald , S. ( 2016 ) PyManopt: a Python toolbox for optimization on manifolds using automatic differentiation . J. Mach. Learn. Res. , 17 , 1 -- 5 . Udriste , C. ( 1994 ) Convex functions and optimization methods on Riemannian manifolds , Mathematics and its Applications , vol. 297 . Springer, Dordrecht : Kluwer Academic Publishers . doi: https://doi.org/10.1007/978–94-015–8390-9 . Google Scholar CrossRef Search ADS Vandereycken , B. ( 2013 ) Low-rank matrix completion by Riemannian optimization . SIAM J. Optim. , 23 , 1214 -- 1236 . doi: https://doi.org/10.1137/110845768 . Google Scholar CrossRef Search ADS Vavasis , S. ( 1991 ) Nonlinear Optimization: Complexity Issues . Oxford : Oxford University Press . Yang , W. , Zhang , L.-H. & Song , R. ( 2014 ) Optimality conditions for the nonlinear programming problems on Riemannian manifolds . Pacific J. Optim ., 10 , 415 -- 434 . Zhang , H. & Sra , S. ( 2016 ) First-order methods for geodesically convex optimization . Conference on Learning Theory. Curran Associates , pp. 1617 -- 1638 . Zhang , H. , Reddi , S. & Sra S. ( 2016 ) Riemannian SVRG: fast stochastic optimization on Riemannian manifolds . Advances in Neural Information Processing Systems. Curran Associates . pp. 4592 -- 4600 . Appendix A. Essentials about manifolds We give here a simplified refresher of differential geometric concepts used in the paper, restricted to Riemannian submanifolds. All concepts are illustrated with the sphere. See Absil et al. (2008) for a more complete discussion, including quotient manifolds. We endow $${\mathbb{R}^{n}}$$ with the classical Euclidean metric: for all $$x, y \in{\mathbb{R}^{n}}$$, $$\left \langle{x},{y}\right \rangle = x^{\textrm{T} }\! y$$. Consider the smooth map $$h \colon{\mathbb{R}^{n}} \mapsto{\mathbb{R}^{m}}$$ with m ≤ n and the constraint set \begin{align*} \mathcal{M} = \left\{ x \in{\mathbb{R}^{n}} : h(x) = 0 \right\}. \end{align*} Locally around each x, this set can be linearized by differentiating the constraint. The subspace corresponding to this linearization is the kernel of the differential of h at x (Absil et al., 2008, eq. (3.19)): \begin{align*} \mathrm{T}_{x}\mathcal{M} =\left \{ \eta \in{\mathbb{R}^{n}} : \mathrm{D} h(x)[\eta] = 0 \right\}. \end{align*} If this subspace has dimension n − m for all $$x\in \mathcal{M}$$ then $$\mathcal{M}$$ is a submanifold of dimension n − m of $${\mathbb{R}^{n}}$$ (Absil et al., 2008, Prop. 3.3.3) and $$\mathrm{T}_{x}\mathcal{M}$$ is called the tangent space to $$\mathcal{M}$$ at x. For example, the unit sphere in $${\mathbb{R}^{n}}$$ is a submanifold of dimension n − 1 defined by \begin{align*} \mathcal{S}^{n-1} = \left\{ x \in{\mathbb{R}^{n}} : x^{\textrm{T}}\! x = 1 \right\}, \end{align*} and the tangent space at x is \begin{align*} \mathrm{T}_{x}\mathcal{S}^{n-1} = \left\{ \eta \in{\mathbb{R}^{n}} : x^{\textrm{T}}\! \eta = 0 \right\}. \end{align*} By endowing each tangent space with the (restricted) Euclidean metric we turn $$\mathcal{M}$$ into a Riemannian submanifold of the Euclidean space $${\mathbb{R}^{n}}$$. (In general, the metric could be different and would depend on x; to disambiguate, one would write $$\left \langle{\cdot },{\cdot }\right \rangle _{x}$$.) An obvious retraction for the sphere (see Definition 2.1) is to normalize: \begin{align*} \textrm{Retr}_{x}(\eta) = \frac{x+\eta}{\|x+\eta\|}. \end{align*} Being an orthogonal projection to the manifold, this is actually a second-order retraction; see Definition 3.10 and Absil & Malick (2012, Thm. 22). The Riemannian metric leads to the notion of Riemannian gradient of a real function f defined in an open set of $${\mathbb{R}^{n}}$$ containing $$\mathcal{M}$$.8 The Riemannian gradient of f at x is the (unique) tangent vector grad f(x) at x satisfying \begin{align*} \forall\ \eta \in \mathrm{T}_{x}\mathcal{M}, \quad \mathrm{D} f(x)[\eta] = \lim_{t \to 0} \frac{f(x+t\eta)-f(x)}{t} = \left\langle{\eta},{\textrm{grad}\ f(x)}\right\rangle. \end{align*} In this setting the Riemannian gradient is nothing but the orthogonal projection of the Euclidean (classical) gradient ∇f(x) to the tangent space. Writing $$\textrm{Proj}_{x} \colon{\mathbb{R}^{n}} \to \mathrm{T}_{x}\mathcal{M}$$ for the orthogonal projector we have (Absil et al., 2008, eq. (3.37)) \begin{align*} \textrm{grad}\ f(x) = \textrm{Proj}_{x}\!\left( \nabla f(x) \right). \end{align*} Continuing the sphere example, the orthogonal projector is $$\textrm{Proj}_{x}(y) = y - \left (x^{\textrm{T} }\! y\right ) x$$, and if $$f(x) = \frac{1}{2} x^{\textrm{T} }\! A x$$ for some symmetric matrix A then \begin{align*} \nabla f(x) = Ax, \qquad \textrm{and} \qquad \textrm{grad}\ f(x) = Ax - (x^{\textrm{T}}\! A x) x. \end{align*} Notice that the critical points of f on $$\mathcal{S}^{n-1}$$ coincide with the unit eigenvectors of A. We can further define a notion of Riemannian Hessian as the projected differential of the Riemannian gradient:9 \begin{align*} \textrm{Hess}\ f(x)[\eta] = \textrm{Proj}_{x} \left( \mathrm{D}\left( x \mapsto \textrm{Proj}_{x} \nabla f(x) \right)(x)[\eta]\right). \end{align*} Hess f(x) is a linear map from $$\mathrm{T}_{x}\mathcal{M}$$ to itself, symmetric with respect to the Riemannian metric. Given a second-order retraction (Definition 3.10), it is equivalently defined by \begin{align*} \forall\ \eta \in \mathrm{T}_{x}\mathcal{M}, \quad \left\langle{\eta},{\textrm{Hess}\ f(x)[\eta]}\right\rangle = \left. \frac{\mathrm{d}^{2}}{\mathrm{d}t^{2}} f(\textrm{Retr}_{x}(t\eta))\right|{}_{t=0}; \end{align*} see Absil et al. (2008, eq. (5.35)). Continuing our sphere example, \begin{align*} \mathrm{D}\left( x \mapsto \textrm{Proj}_{x} \nabla f(x) \right)(x)[\eta] = \mathrm{D}\left( x \mapsto Ax - \left(x^{\textrm{T}}\! Ax\right)x \right)(x)[\eta] = A\eta - (x^{\textrm{T}}\! Ax)\eta - 2 (x^{\textrm{T}}\! A \eta) x. \end{align*} Projection of the latter gives the Hessian: \begin{align*} \textrm{Hess}\ f(x)[\eta] = \textrm{Proj}_{x} (A\eta) - (x^{\textrm{T}}\! Ax)\eta. \end{align*} Consider the implications of a positive semidefinite Hessian (on the tangent space): \begin{align*} \textrm{Hess}\ f(x) \succeq 0 & \iff \left\langle{\eta},{\textrm{Hess}\ f(x)[\eta]}\right\rangle \geq 0 \qquad \forall\ \eta\in\mathrm{T}_{x}\mathcal{S}^{n-1} \\ & \iff \eta^{\textrm{T}}\! A\eta \geq x^{\textrm{T}}\! A x \qquad \forall\ \eta\in\mathrm{T}_{x}\mathcal{S}^{n-1}, \|\eta\| = 1. \end{align*} Together with first-order conditions this implies that x is a leftmost eigenvector of A.10 This is an example of an optimization problem on a manifold for which second-order necessary optimality conditions are also sufficient. This is not the norm. As another (very) special example consider the case $$\mathcal{M} = {\mathbb{R}^{n}}$$; then, $$\mathrm{T}_{x}{\mathbb{R}^{n}} = {\mathbb{R}^{n}}$$, Retrx(η) = x + η is the exponential map (a fortiori a second-order retraction), Projx is the identity, grad f(x) = ∇f(x) and Hess f(x) = ∇2f(x). Appendix B. Compact submanifolds of Euclidean spaces In this appendix we prove Lemmas 2.4 and 3.1, showing that if f has locally Lipschitz continuous gradient or Hessian in a Euclidean space $$\mathcal{E}$$ (in the usual sense), and it is to be minimized over a compact submanifold of $$\mathcal{E}$$, then Assumptions 2.6, 3.1 and 3.2 hold. Proof of Lemma 2.4 By assumption, ∇f is Lipschitz continuous along any line segment in $$\mathcal{E}$$ joining x and y in $$\mathcal{M}$$. Hence, there exists L such that, for all $$x, y \in \mathcal{M}$$, \begin{align} \left|\, f(y) - \left[\, f(x) + \left\langle{\nabla f(x)},{y-x}\right\rangle \right] \right| \leq \frac{L}{2} \|y-x\|^{2}. \end{align} (B.1) In particular this holds for all y =Retrx(η), for any $$\eta \in \mathrm{T}_{x}\mathcal{M}$$. Writing grad f(x) for the Riemannian gradient of $$f|_{\mathcal{M}}$$ and using that grad f(x) is the orthogonal projection of ∇f(x) to $$\mathrm{T}_{x}\mathcal{M}$$ (Absil et al., 2008, eq. (3.37)) the inner product above decomposes as \begin{align} \left\langle{\nabla f(x)},{\textrm{Retr}_{x}(\eta)-x}\right\rangle & = \left\langle{\nabla f(x)},{\eta + \textrm{Retr}_{x}(\eta) - x - \eta}\right\rangle \nonumber\\ & = \left\langle{\textrm{grad}\ f(x)},{\eta}\right\rangle + \left\langle{\nabla f(x)},{\textrm{Retr}_{x}(\eta) - x - \eta}\right\rangle. \end{align} (B.2) Combining (B.1) with (B.2) and using the triangle inequality yields \begin{align*} \left| f(\textrm{Retr}_{x}(\eta)) - \left[ f(x) + \left\langle{\textrm{grad}\ f(x)},{\eta}\right\rangle \right] \right| \leq \frac{L}{2} \|\textrm{Retr}_{x}(\eta)-x\|^{2} + \|\nabla f(x)\| \|\textrm{Retr}_{x}(\eta) - x - \eta\|. \end{align*} Since ∇f(x) is continuous on the compact set $$\mathcal{M}$$ there exists finite G such that ∥∇f(x)∥≤ G for all $$x\in \mathcal{M}$$. It remains to show there exist finite constants α, β ≥ 0 such that, for all $$x\in \mathcal{M}$$ and for all $$\eta \in \mathrm{T}_{x}\mathcal{M}$$, \begin{align} \|\textrm{Retr}_{x}(\eta)-x\| \leq \alpha\|\eta\|\ \textrm{and} \end{align} (B.3) \begin{align} \|\textrm{Retr}_{x}(\eta) - x - \eta\| \leq \beta\|\eta\|^{2}. \end{align} (B.4) For small η, this will follow from $$\textrm{Retr}_{x}(\eta ) = x + \eta + \mathcal{O}\left (\|\eta \|^{2}\right )$$ by Definition 2.1; for large η this will follow a fortiori from compactness. This will be sufficient to conclude, as then we will have for all $$x\in \mathcal{M}$$ and $$\eta \in \mathrm{T}_{x}\mathcal{M}$$ that \begin{align*} \left| f(\textrm{Retr}_{x}(\eta)) - \left[ f(x) + \left\langle{\textrm{grad}\ f(x)},{\eta}\right\rangle \right] \right| \leq \left(\frac{L}{2} \alpha^{2} + G \beta \right)\|\eta\|^{2}. \end{align*} More formally our assumption that the retraction is defined and smooth over the whole tangent bundle a fortiori ensures the existence of r > 0 such that Retr is smooth on $$K = \{\eta \in \mathrm{T}\mathcal{M}: \|\eta \| \leq r\}$$, a compact subset of the tangent bundle (K consists of a ball in each tangent space). First, we determine α; see (B.3). For all η ∈ K we have \begin{align*} \|\textrm{Retr}_{x}(\eta) - x \| & \leq{\int_{0}^{1}} \left\|\frac{\mathrm{d}}{\mathrm{d}t} \textrm{Retr}_{x}(t\eta)\right\| \mathrm{d}t = {\int_{0}^{1}} \|\mathrm{D}\textrm{Retr}_{x}(t\eta)[\eta]\|\ \mathrm{d}t \\ & \leq{\int_{0}^{1}} \max_{\xi\in K} \|\mathrm{D}\textrm{Retr}(\xi)\| \|\eta\|\ \mathrm{d}t = \max_{\xi\in K} \|\mathrm{D}\textrm{Retr}(\xi)\| \|\eta\|, \end{align*} where the $$\max $$ exists and is finite owing to compactness of K and smoothness of Retr on K; note that this is uniform over both x and η. (If $$\xi \in \mathrm{T}_{z}\mathcal{M}$$ the notation DRetr(ξ) refers to DRetrz(ξ).) For all η ∉ K we have \begin{align*} \|\textrm{Retr}_{x}(\eta) - x \| \leq \textrm{diam}(\mathcal{M}) \leq \frac{\textrm{diam}(\mathcal{M})}{r} \|\eta\|, \end{align*} where $$\textrm{diam}(\mathcal{M})$$ is the maximal distance between any two points on $$\mathcal{M}$$: finite by compactness of $$\mathcal{M}$$. Combining, we find that (B.3) holds with \begin{align*} \alpha = \max\left( \max_{\xi\in K} \|\mathrm{D}\textrm{Retr}(\xi)\|, \frac{\textrm{diam}(\mathcal{M})}{r} \right). \end{align*} Inequality (B.4) is established along similar lines. For all η ∈ K we have \begin{align*} \|\textrm{Retr}_{x}(\eta) - x - \eta\| & \leq{\int_{0}^{1}} \left\|\frac{\textrm{d}}{\mathrm{d}t} (\textrm{Retr}_{x}(t\eta)-x-t\eta) \right\| \mathrm{d}t = {\int_{0}^{1}} \|\mathrm{D}\textrm{Retr}_{x}(t\eta)[\eta] - \eta\|\ \mathrm{d}t \\ & \leq{\int_{0}^{1}} \|\mathrm{D}\textrm{Retr}_{x}(t\eta)-\operatorname{Id}\| \|\eta\|\ \mathrm{d}t \leq \frac{1}{2} \max_{\xi\in K} \|\mathrm{D}^{2}\textrm{Retr}(\xi)\| \|\eta\|^{2}, \end{align*} where the last inequality follows from DRetrx(0x) = Id and \begin{align*} \|\mathrm{D}\textrm{Retr}_{x}(t\eta)-\operatorname{Id}\| \leq{\int_{0}^{1}} \left\| \frac{\mathrm{d}}{\mathrm{d}s} \mathrm{D}\textrm{Retr}_{x}(st\eta) \right\| \mathrm{d}s \leq \|t\eta\| {\int_{0}^{1}} \left\| \mathrm{D}^{2}\textrm{Retr}_{x}(t\eta) \right\| \mathrm{d}s. \end{align*} The case η ∉ K is treated as before: \begin{align*} \|\textrm{Retr}_{x}(\eta) - x - \eta \| \leq \|\textrm{Retr}_{x}(\eta) - x\| + \|\eta \| \leq \frac{\textrm{diam}(\mathcal{M}) + r}{r^{2}} \|\eta\|^{2}. \end{align*} Combining, we find that (B.4) holds with \begin{align*} \beta = \max\left( \frac{1}{2} \max_{\xi\in K} \big\|\mathrm{D}^{2}\textrm{Retr}(\xi)\big\|, \frac{\textrm{diam}(\mathcal{M}) + r}{r^{2}} \right), \end{align*} which concludes the proof. We now prove the corresponding second-order result, whose aim is to verify Assumptions 3.2. Proof of Lemma 3.1 By assumption ∇2f is Lipschitz continuous along any line segment in $$\mathcal{E}$$ joining x and y in $$\mathcal{M}$$. Hence, there exists L such that, for all $$x, y \in \mathcal{M}$$, \begin{align} \left| f(y) - \left[ f(x) + \left\langle{\nabla f(x)},{y-x}\right\rangle + \tfrac{1}{2}\left\langle{y-x},{\nabla^{2} f(x)[y-x]}\right\rangle \right] \right| \leq \frac{L}{6} \|y-x\|^{3}. \end{align} (B.5) Fix $$x\in \mathcal{M}$$. Let Projx denote the orthogonal projector from $$\mathcal{E}$$ to $$\mathrm{T}_{x}\mathcal{M}$$. Let grad f(x) be the Riemannian gradient of $$f|_{\mathcal{M}}$$ at x and let Hess f(x) be the Riemannian Hessian of $$f|_{\mathcal{M}}$$ at x (a symmetric operator on $$\mathrm{T}_{x}\mathcal{M}$$). For Riemannian submanifolds of Euclidean spaces we have these explicit expressions with $$\eta \in \mathrm{T}_{x}\mathcal{M}$$—see Absil et al. (2008, eqs. (3.37), (5.15), Def. (5.5.1)) and Absil et al. (2013): \begin{align*} \textrm{grad}\ f(x) & = \textrm{Proj}_{x} \nabla f(x),\ \textrm{and} \\ \left\langle{\eta},{\textrm{Hess}\ f(x)[\eta]}\right\rangle & = \left\langle{\eta},{ \mathrm{D}\left( x \mapsto \textrm{Proj}_{x} \nabla f(x) \right)(x)[\eta]}\right\rangle \\ & = \left\langle{\eta},{\left( \mathrm{D}\left( x \mapsto \textrm{Proj}_{x} \right)(x)[\eta]\right)[\nabla f(x)] + \textrm{Proj}_{x} \nabla^{2}f(x)[\eta]}\right\rangle \\ & = \left\langle{II(\eta,\eta)},{\nabla f(x)}\right\rangle + \left\langle{\eta},{\nabla^{2} f(x)[\eta]}\right\rangle, \end{align*} where II, as implicitly defined above, is the second fundamental form of $$\mathcal{M}$$: II(η, η) is a normal vector to the tangent space at x, capturing the second-order geometry of $$\mathcal{M}$$—see Absil et al. (2009, 2013) and Monera et al. (2014) for presentations relevant to our setting. In particular, II(η, η) is the acceleration in $$\mathcal{E}$$ at x of a geodesic γ(t) on $$\mathcal{M}$$ defined by γ(0) = x and $$\dot \gamma (0) = \eta $$: $$\ddot \gamma (0) = II(\eta ,\eta )$$ (O’Neill, 1983, Cor. 4.9). Let $$\eta \in \mathrm{T}_{x}\mathcal{M}$$ be arbitrary; $$y = \textrm{Retr}_{x}(\eta ) \in \mathcal{M}$$. Then \begin{align*} \left\langle{\nabla f(x)},{y-x}\right\rangle - \left\langle{\textrm{grad}\ f(x)},{\eta}\right\rangle &= \left\langle{\nabla f(x)},{y - x - \eta}\right\rangle\ \textrm{and} \\ \left\langle{y-x},{\nabla^{2} f(x)[y-x]}\right\rangle - \left\langle{\eta},{\textrm{Hess}\ f(x)[\eta]}\right\rangle &= 2\left\langle{\eta},{\nabla^{2} f(x)[y-x-\eta]}\right\rangle\\ &\quad + \left\langle{y-x-\eta},{\nabla^{2} f(x)[y-x-\eta]}\right\rangle\\ &\quad - \left\langle{\nabla f(x)},{II(\eta,\eta)}\right\rangle. \end{align*} Since $$\mathcal{M}$$ is compact and f is twice continuously differentiable, there exist G, H, independent of x, such that ∥∇f(x)∥≤ G and $$\left \|\nabla ^{2} f(x)\right \| \leq H$$ (the latter is the induced operator norm). Combining with (B.5) and using the triangle and Cauchy–Schwarz inequalities multiple times, \begin{align*} &\left|\, f(y) - \left[ f(x) + \left\langle{\textrm{grad}\ f(x)},{\eta}\right\rangle + \frac{1}{2}\left\langle{\eta},{\textrm{Hess}\ f(x)[\eta]}\right\rangle \right] \right| \\ &\quad\leq \frac{L}{6} \|y-x\|^{3} + G\left\|y-x-\eta-\frac{1}{2}II(\eta,\eta)\right\| + H\|\eta\|\|y-x-\eta\| + \frac{1}{2} H \|y-x-\eta\|^{2}. \end{align*} Using the same argument as for Lemma 2.4 we can find finite constants α, β independent of x and η such that (B.3) and (B.4) hold. Use $$\|y-x-\eta \|^{2} \leq \|y-x-\eta \| \left ( \|y-x\| + \|\eta \| \right ) \leq \beta (\alpha +1)\|\eta \|^{3}$$ to upper bound the right-hand side above with \begin{align*} \left(\frac{L}{6}\alpha^{3} + H\beta + \frac{H\beta(\alpha+1)}{2}\right) \|\eta\|^{3} + G\left\|y-x-\eta-\frac{1}{2}II(\eta,\eta)\right\|. \end{align*} We turn to the last term. Consider $$K \subset \mathrm{T}\mathcal{M}$$ as defined in the proof of Lemma 2.4 for some r > 0. If η ∉ K, that is, ∥η∥ > r, then since II is bilinear for a fixed $$x\in \mathcal{M}$$, we can define \begin{align*} \|II\| = \max_{x \in \mathcal{M}} \max_{\xi \in \mathrm{T}_{x}\mathcal{M}, \|\xi\| \leq 1} \|II(\xi, \xi)\| \end{align*} (finite by continuity and compactness) so that ∥II(η, η)∥≤∥II∥∥η∥2. Then, \begin{align*} \left\|y-x-\eta-\frac{1}{2}II(\eta,\eta)\right\| \leq \|y-x\| + \|\eta\| + \frac{1}{2} \|II(\eta, \eta)\| \leq \left( \frac{\operatorname{diam}(\mathcal{M})}{r^{3}} + \frac{1}{r^{2}} + \frac{1}{2} \frac{\|II\|}{r} \right) \|\eta\|^{3}. \end{align*} Now assume η ∈ K, that is, ∥η∥≤ r. Consider ϕ(t) = Retrx(tη) (a curve on $$\mathcal{M}$$) and let ϕ′′ denote its acceleration on $$\mathcal{M}$$ and $$\ddot \phi $$ denote its acceleration in $$\mathcal{E}$$, while $$\dot \phi = \phi ^{\prime }$$ denotes velocity along the curve. It holds that $$\ddot \phi (t) = \phi ^{\prime \prime }(t) + II(\dot \phi (t), \dot \phi (t))$$ (O’Neill, 1983, Cor. 4.9). Since Retr is a second-order retraction, acceleration on $$\mathcal{M}$$ is zero at t = 0, that is, ϕ′′(0) = 0, so that ϕ(0) = x, $$\dot \phi (0) = \eta $$ and $$\ddot \phi (0) = II(\eta , \eta )$$. Then by Taylor expansion of ϕ in $$\mathcal{E}$$, \begin{align*} y = \textrm{Retr}_{x}(\eta) = \phi(1) = x + \eta + \frac{1}{2} II(\eta,\eta) + R_{3}(\eta), \end{align*} where \begin{align*} \| R_{3}(\eta) \| = \left\| {\int_{0}^{1}} \frac{(1-t)^{2}}{2} \dddot \phi(t)\ \mathrm{d}t \right\| \leq \frac{1}{6} \max_{\xi\in K} \left\| \mathrm{D}^{3} \textrm{Retr}(\xi) \right\| \|\eta\|^{3}. \end{align*} The combined arguments ensure existence of a constant γ, independent of x and η, such that \begin{align*} \left\|y-x-\eta-\frac{1}{2}II(\eta,\eta)\right\| \leq \gamma\|\eta\|^{3}. \end{align*} Combining, we find that for all $$x\in \mathcal{M}$$ and $$\eta \in \mathrm{T}_{x}\mathcal{M}$$, \begin{align*} \left|\, f(\textrm{Retr}_{x}(\eta)) - \left[ f(x) + \left\langle{\textrm{grad}\ f(x)},{\eta}\right\rangle + \frac{1}{2}\left\langle{\eta},{\textrm{Hess}\ f(x)[\eta]}\right\rangle \right] \right| \leq \left(\frac{L}{6}\alpha^{3} + \frac{H\beta(\alpha+3)}{2} + \gamma\right) \|\eta\|^{3}. \end{align*} Since Retr is a second-order retraction, Hess f(x) coincides with the Hessian of the pullback f ∘ Retrx (Lemma 3.9). This establishes Assumption 3.2. Appendix C. Proof of Lemma 2.7 about Armijo line-search Proof of Lemma 2.7 By Assumption 2.6, upper bound (2.5) holds with $$\eta = t{\eta _{k}^{0}}$$ for any t such that ∥η∥≤ ϱk: \begin{align} f(x_{k}) - f\left(\textrm{Retr}_{x_{k}}\left(t \cdot{\eta_{k}^{0}}\right)\right) \geq t\left\langle{-\textrm{grad}\ f(x_{k})},{{\eta_{k}^{0}}}\right\rangle - \frac{Lt^{2}}{2} \left\|{\eta_{k}^{0}}\right\|^{2}. \end{align} (C.1) We determine a sufficient condition on t for the stopping criterion in Algorithm 2 to trigger. To this end observe that the right-hand side of (C.1) dominates $$c_{1} t \left \langle{-\textrm{grad}\ f(x_{k})},{{\eta _{k}^{0}}}\right \rangle $$ if \begin{align*} t(1-c_{1}) \cdot \left\langle{-\textrm{grad}\ f(x_{k})},{{\eta_{k}^{0}}}\right\rangle \geq \frac{Lt^{2}}{2} \left\|{\eta_{k}^{0}}\right\|^{2}. \end{align*} Thus, the stopping criterion in Algorithm 2 is satisfied in particular for all t in \begin{align*} \left[0, \frac{2(1-c_{1})\left\langle{-\textrm{grad} \ f(x_{k})},{{\eta_{k}^{0}}}\right\rangle}{L_{g}\|{\eta_{k}^{0}}\|^{2}}\right] \supseteq \left[ 0, \frac{2c_{2} (1-c_{1})\|\textrm{grad}\ f(x_{k})\|}{L_{g}\|{\eta_{k}^{0}}\|}\right] \supseteq \left[ 0, \frac{2c_{2} (1-c_{1})}{c_{4}L_{g}}\right]. \end{align*} Unless it equals $${\bar t}_{k}$$, the returned t cannot be smaller than τ times the last upper bound. In all cases the cost decrease satisfies \begin{align*} f(x_{k}) - f\left(\textrm{Retr}_{x_{k}}\left(t \cdot{\eta_{k}^{0}}\right)\right) & \geq c_{1} t \left\langle{-\textrm{grad}\ f(x_{k})},{{\eta_{k}^{0}}}\right\rangle \\ & \geq c_{1}c_{2}t\|\textrm{grad}\ f(x_{k})\|\left\|{\eta_{k}^{0}}\right\| \\ & \geq c_{1}c_{2}c_{3}t\|\textrm{grad}\ f(x_{k})\|^{2}. \end{align*} To count the number of iterations consider that checking whether $$t ={\bar t}_{k}$$ satisfies the stopping criterion requires one cost evaluation. Following that, t is reduced by a factor τ exactly $$\log _{\tau }(t/{\bar t}_{k}) = \log _{\tau ^{-1}}({\bar t}_{k}/t)$$ times, each followed by one cost evaluation. Appendix D. Proofs for Section 3.5 about Hk and the Hessians Proof of Lemma 3.9 The Hessian of f and that of the pullback are related by the following formulas. See (Absil et al., 2008, Chapter 5) for the precise meanings of the differential operators D and d. For all η in $$\mathrm{T}_{x}\mathcal{M}$$, writing $$\hat f_{x} = f\circ \textrm{Retr}_{x}$$ for convenience, \begin{align*} \frac{\mathrm{d}}{\mathrm{d}t} f(\textrm{Retr}_{x}(t\eta)) =& \left\langle{\textrm{grad}\ f(\textrm{Retr}_{x}(t\eta))},{\frac{\mathrm{D}}{\mathrm{d}t} \textrm{Retr}_{x}(t\eta)}\right\rangle, \\ \left\langle{\nabla^{2} \hat f_{x}(0_{x})[\eta]},{\eta}\right\rangle =& \left. \frac{\mathrm{d}^{2}}{\mathrm{d}t^{2}} f(\textrm{Retr}_{x}(t\eta)) \right|{}_{t=0} \\ =& \left\langle{\textrm{Hess}\ f(x)\left[\mathrm{D}\textrm{Retr}_{x}(0_{x})[\eta]\right]},{\left. \frac{\mathrm{D}}{\mathrm{d}t} \textrm{Retr}_{x}(t\eta) \right|{}_{t=0}}\right\rangle \\ & + \left\langle{\textrm{grad}\ f(x)},{\left. \frac{\mathrm{D}^{2}}{\mathrm{d}t^{2}} \textrm{Retr}_{x}(t\eta) \right|{}_{t=0}}\right\rangle \\ =& \left\langle{\textrm{Hess}\ f(x)[\eta]},{\eta}\right\rangle + \left\langle{\textrm{grad}\ f(x)},{\left. \frac{\mathrm{D}^{2}}{\mathrm{d}t^{2}} \textrm{Retr}_{x}(t\eta) \right|{}_{t=0}}\right\rangle. \end{align*} (To get the third equality it is assumed one is working with the Levi-Civita connection, so that Hess f is indeed the Riemannian Hessian.) Since the acceleration of the retraction is bounded, we get the result via Cauchy–Schwarz. Proof of Proposition 3.11 Combine ∥grad f(xk)∥≤ εg and Hk ≽−εH Id with \begin{align*} \left\| \textrm{Hess}\ f(x_{k}) - \nabla^{2} \hat f_{x_{k}}(0_{x_{k}}) \right\| \leq a_{k} \cdot \|\textrm{grad}\ f(x_{k})\|\ \quad\textrm{and} \qquad \left\|\nabla^{2} \hat f_{k}(0_{x_{k}}) - H_{k}\right\| \leq \delta_{k} \end{align*} by the triangular inequality. Appendix E. Complexity dependence on n in the Max-Cut example This appendix supports Section 4. By Proposition 4.1, running Algorithm 3 with $$\varepsilon _{g} = \infty $$ and $$\varepsilon _{H} = \frac{2\delta }{n}$$ yields a solution Y within a gap δ from the optimal value of (4.2). Let $$\underline{f}$$ and $$\overline{f}$$ denote the minimal and maximal values of $$f(Y) = \left \langle{C},{YY^{\textrm{T} } }\right \rangle $$ over $$\mathcal{M}$$ (see (4.3)), respectively, with metric $$\left \langle{A},{B}\right \rangle = \textrm{Tr}\left (A^{\textrm{T} }\! B\right )$$ and associated Frobenius norm ∥⋅∥F. Then using ρ′ = 1/10, setting c3 = 1/2 in Assumption 3.8 as allowed by Lemma 3.4 and using the true Hessian of the pullbacks for Hk so that c1 = 0 in Assumption 3.5, Theorem 3.4 guarantees that Algorithm 3 returns an answer in at most \begin{align} 214(\overline{f} - \underline{f}) \cdot{L_{H}^{2}} \cdot \frac{1}{{\varepsilon_{H}^{3}}}\ \textrm{+ log term} \end{align} (E.1) iterations. Using the LDLT-factorization strategy of Lemma 3.3 with a randomly generated orthonormal basis at each tangent space encountered, since $$\dim \mathcal{M} = n^{2}$$ for p = n + 1, the cost of each iteration is $$\mathcal{O}\left (n^{6}\right )$$ arithmetic operations (dominated by the cost of the LDLT factorization). It remains to bound LH, in compliance with Assumption 3.2. Let $$g \colon{\mathbb{R}} \to{\mathbb{R}}$$ be defined as $$g(t) = f(\textrm{Retr}_{Y}(t\dot Y))$$. Then using a Taylor expansion, \begin{align} f(\textrm{Retr}_{Y}(\dot Y)) = g(1) = g(0) + g^{\prime}(0) + \frac{1}{2} g^{\prime\prime}(0) + \frac{1}{6} g^{\prime\prime\prime}(t) \end{align} (E.2) for some t ∈ (0, 1). Let $$\hat f_{Y} = f \circ \textrm{Retr}_{Y}$$. Definition 2.1 for retractions implies \begin{align} g(0) = f(Y), \qquad g^{\prime}(0) = \left\langle{\textrm{grad}\ f(Y)},{\dot Y}\right\rangle, \qquad g^{\prime\prime}(0) = \left\langle{\dot Y},{\nabla^{2} \hat f_{Y}(0_{Y})[\dot Y]}\right\rangle, \end{align} (E.3) so that it only remains to bound |g′′′(t)| uniformly over $$Y, \dot Y$$ and t ∈ [0, 1]. For this example it is easier to handle g′′′ if the retraction used is the exponential map (similar bounds can be obtained with the orthogonal projection retraction; see Mei et al., 2017, Lemmas 4 and 5). This map is known in explicit form and is cheap to compute for the sphere $$\mathbb{S}^{n} = \left \{ x \in{\mathbb{R}}^{n+1} : x^{\textrm{T} }\! x = 1 \right \}$$. Indeed, if $$x \in \mathbb{S}^{n}$$ and $$\eta \in \mathrm{T}_{x}\mathbb{S}^{n}$$, following Absil et al. (2008, Ex. 5.4.1), \begin{align} \gamma(t) = \textrm{Exp}_{x}(t\eta) = \cos(t\|\eta\|) x + \sin(t\|\eta\|) \frac{1}{\|\eta\|} \eta. \end{align} (E.4) Conceiving of γ as a map from $${\mathbb{R}}$$ to $${\mathbb{R}}^{n+1}$$ its differentials are easily derived: \begin{align} \dot \gamma(t) = -\|\eta\|\sin(t\|\eta\|) x + \cos(t\|\eta\|) \eta, \qquad \ddot \gamma(t) = - \|\eta\|^{2} \gamma(t), \qquad \dddot \gamma(t) = - \|\eta\|^{2} \dot \gamma(t). \end{align} (E.5) Extending this map rowwise gives the exponential map for $$\mathcal{M}$$—of course, this is a second-order retraction. We define $$\varPhi (t) = \textrm{Retr}_{Y}(t\dot Y)$$ and $$g(t) = f(\textrm{Retr}_{Y}(t \dot Y)) = \left \langle{C\varPhi (t)},{\varPhi (t)}\right \rangle $$. In particular $$\ddot \varPhi (t) = -D\varPhi (t)$$ and $$\dddot \varPhi (t) = -D\dot \varPhi (t)$$, where $$D = \textrm{diag}\left (\|\dot y_{1}\|^{2}, \ldots , \|\dot y_{n}\|^{2}\right )$$ and $$\dot y_{k}^{\textrm{T} }\! $$ is the kth row of $$\dot Y$$. As a result, for a given Y and $$\dot Y$$, a little bit of calculus gives \begin{align} g^{\prime\prime\prime}(t) = -6\left\langle{C\dot\varPhi(t)},{D\varPhi(t)}\right\rangle - 2\left\langle{C\varPhi(t)},{D\dot\varPhi(t)}\right\rangle. \end{align} (E.6) Using Cauchy–Schwarz multiple times, as well as the inequality $$\|{AB}\|_{\mathrm{F}} \leq \left \|{A}\right \|_{\mathrm{2}}\|{B}\|_{\mathrm{F}}$$ where $$\left \|{A}\right \|_{\mathrm{2}}$$ denotes the largest singular value of A, and using that $$\|{\varPhi (t)}\|_{\mathrm{F}} = \sqrt{n}$$ and $$\|{\dot \varPhi (t)}\|_{\mathrm{F}} = \|{\dot Y}\|_{\mathrm{F}}$$ for all t, and additionally that $$\left \|{D}\right \|_{\mathrm{2}} \leq \textrm{Tr}(D) = \|{\dot Y}\|_{\mathrm{F}}^{2}$$ , it follows that \begin{align} \sup_{Y\in\mathcal{M}, \dot{Y}\in\mathrm{T}_{Y}\mathcal{M}, \dot Y\neq 0, t\in(0, 1)} \frac{|g^{\prime\prime\prime}(t)|}{\|{\dot Y}\|_{\mathrm{F}}^{3}} \leq 8\left\|{C}\right\|_{\mathrm{2}}\sqrt{n}. \end{align} (E.7) As a result an acceptable constant LH for Assumption 3.2 is $$L_{H} = 8\left \|{C}\right \|_{\mathrm{2}} \sqrt{n}.$$ Combining all the statements of this section it follows that a solution Y within an absolute gap δ of the optimal value can be obtained for problem (4.2) using Algorithm 3 in at most $$\mathcal{O}\big ( (\overline{f} - \underline{f}) \left \|{C}\right \|_{\mathrm{2}}^{2} \cdot n^{10} \cdot \frac{1}{\delta ^{3}} \big )$$ arithmetic operations, neglecting the additive logarithmic term. Note that, following Mei et al. (2017, Appendix A.2, points 1 and 2), it is also possible to bound LH as $$6\left \|{C}\right \|_{\mathrm{2}} + 2\|C\|_{1}$$, where ∥⋅∥1 is the ℓ1 operator norm. This reduces the explicit dependence on n from n10 to n9 in the bound on the total amount of work. © The Author(s) 2018. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved.

Journal

IMA Journal of Numerical AnalysisOxford University Press

Published: Feb 7, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off