Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

On stochastic accelerated gradient with convergence rate

On stochastic accelerated gradient with convergence rate 1IntroductionLarge-scale machine learning problems are becoming ubiquitous in science, engineering, government business, and almost all areas. Faced with huge data, investigators typically prefer algorithms that process each observation only once, or a few times. Stochastic approximation (SA) algorithms such as stochastic gradient descent (SGD), although introduced more than 60 years ago [1], still were widely used and studied method in some contexts (see [2,3, 4,5,6, 7,8,9, 10,11,12, 13,14,15, 16,17,18, 19,20,21, 22,23,24, 25,26]).To our knowledge, Robbins and Monro [1] first proposed the SA on the gradient descent method. From then on, SA algorithms were widely used in stochastic optimization and machine learning. Polyak [2] and Polyak and Juditsky [3] developed an important improvement of the SA method by using longer stepsizes with consequent averaging of the obtained iterates. The mirror-descent SA was demonstrated by Nemirovski et al. [6] who showed that the mirror-descent SA exhibited an unimprovable expected rate for solving nonstrongly convex programming (CP) problems. Shalev-Shwartz et al. [5] and Nemirovski et al. [6] studied averaged SGD and achieved the rate of O(1/μn)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}\mu n)in the strongly convex case, and they obtained only O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{-0.1em}\sqrt{n})in the non strongly convex case. Bach and Moulines [10] considered and analyzed SA algorithms that achieve a rate of O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}n)for least-square regression and logistic regression learning problems in the non strongly-convex case. The convergence rate of the SA algorithm for least-square regression and logistic regression is almost optimal, respectively. However, they need some assumptions (A1–A6). It is natural to ask that the convergence rate for least-square regression is O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}n)under fewer assumptions. In this article, we consider an accelerated SA type learning algorithm for solving the least-square regression and logistic regression problem and achieve a rate of O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}n)for least-square regression learning problems under assumptions A1–A4 in [10]. For solving a class of CP problems, Nesterov presented the accelerated gradient method in a celebrated work [12]. Now, the accelerated gradient method has also been generalized by Beck and Teboulle [13], Tseng [14], Nesterov [15,16] to solve an emerging class of composite CP problems. In 2012, Lan [17] further showed that the accelerated gradient method is optimal for solving not only smooth CP problems but also general nonsmooth and stochastic CP problems. The accelerated stochastic approximation (AC-SA) algorithm was proposed by Ghadimi and Lan [18,19] using properly modifying Nesterov’s optimal method for smooth CP. Recently, they [20,21] also developed a generic AC-SA algorithmic framework, which can be specialized to yield optimal or nearly optimal methods for solving strongly convex stochastic composite optimization problems. Motivated by those mentioned jobs, we aim to consider and analyze an accelerated SA algorithm that achieves a rate of O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}n)for classical least-square regression and logistic regression problems, respectively.Zhu [25] introduced Katyusha, a direct, primal-only stochastic gradient method to fix this issue. It has a provably accelerated convergence rate in convex (offline) stochastic optimization. It can be incorporated into a variance-reduction-based algorithm and speed it up, in terms of both sequential and parallel performance. A new gradient-based optimization approach by automatically adjusting the learning rate is proposed by Cao [26]. This approach can be applied to design nonadaptive learning rate and adaptive learning rate. This approach could be an alternative method to optimize the learning rate based on the SGD algorithm besides the current nonadaptive learning rate methods e.g. SGD, momentum, Nesterov and the adaptive learning rate methods, e.g., AdaGrad, AdaDelta, and Adam.In this article, we consider minimizing a convex function ff, which is defined on a closed convex set in Euclidean space, given by f(θ)=12E[ℓ(y,⟨θ,x⟩)]f\left(\theta )=\frac{1}{2}{\mathbb{E}}{[}\ell (y,\langle \theta ,x\rangle )], where (x,y)∈X×R\left(x,y)\in X\times {\mathbb{R}}denotes the sample data and ℓ\ell denotes a loss function that is convex with respect to the second variable. This loss function includes least-square regression and logistic regression. In the SA framework, z={zi}i=1n={(xi,yi)}i=1n∈Zn{\bf{z}}={\left\{{z}_{i}\right\}}_{i=1}^{n}={\left\{\left({x}_{i},{y}_{i})\right\}}_{i=1}^{n}\in {Z}^{n}denote a set of random samples, which are independently drawn according to the unknown probability measure ρ\rho and the predictor defined by θ\theta is updated after each pair is seen.The rest of this article is organized as follows. In Section 2, we give a brief introduction to the accelerated gradient algorithm for least-square regression. In Section 3, we study the accelerated gradient algorithm for logistic regression. In Section 4, we compare our results with the known related work. Finally, we conclude this article with the obtained results.2The stochastic accelerated gradient algorithm for least-square regressionIn this section, we consider the accelerated gradient algorithm for least-square regression. The novelty of this article is that our convergence result can obtain a nonasymptotic rate O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}n). To give the convergence property of the stochastic accelerated gradient algorithm for the regression problem, we make the following assumptions: (a)ℱ{\mathcal{ {\mathcal F} }}is a dd-dimensional Euclidean space, with d≥1d\ge 1.(b)Let (X,d)\left(X,d)be a compact metric space and let Y=RY={\mathbb{R}}. Let ρ\rho be a probability distribution on Z=ℱ×YZ={\mathcal{ {\mathcal F} }}\times Yand (X,Y)\left(X,Y)be a corresponding random variable.(c)E‖xn‖2{\mathbb{E}}\Vert {x}_{n}{\Vert }^{2}is finite, i.e., E‖xk‖2≤M{\mathbb{E}}\Vert {x}_{k}{\Vert }^{2}\le Mfor any k≥1k\ge 1.(d)The global minimum of f(θ)=12E[⟨θ,xk⟩2−2yk⟨θ,xk⟩]f\left(\theta )=\frac{1}{2}{\mathbb{E}}\left[{\langle \theta ,{x}_{k}\rangle }^{2}-2{y}_{k}\langle \theta ,{x}_{k}\rangle ]is attained at a certain θ∗∈Rd{\theta }^{\ast }\in {{\mathbb{R}}}^{d}. Let ξk=(yk−⟨θ∗,xk⟩)xk{\xi }_{k}=({y}_{k}-\langle {\theta }^{\ast },{x}_{k}\rangle ){x}_{k}denote the residual. For any k≥1k\ge 1, we have Eξk=0{\mathbb{E}}{\xi }_{k}=0. We also assume that Eξk2≤σ2{\mathbb{E}}{\xi }_{k}^{2}\le {\sigma }^{2}for every kkand ξ¯k=1k∑i=1kξi{\overline{\xi }}_{k}=\frac{1}{k}{\sum }_{i=1}^{k}{\xi }_{i}.Assumptions (a)–(d) are standard in SA (see, e.g., [9,10,22]). Compared with the work of Bach and Moulines [10], we do not need the conditions that the covariance operator ℋ=E(xk⨂xk){\mathcal{ {\mathcal H} }}={\mathbb{E}}\left({x}_{k}\hspace{0.33em}\bigotimes \hspace{0.33em}{x}_{k})is invertible for any k≥1k\ge 1, and that the operator E(xk⨂xk){\mathbb{E}}\left({x}_{k}\hspace{0.33em}\bigotimes \hspace{0.33em}{x}_{k})satisfies E[ξi⨂ξi]≼σ2ℋ{\mathbb{E}}\left[{\xi }_{i}\hspace{0.33em}\bigotimes \hspace{0.33em}{\xi }_{i}]\preccurlyeq {\sigma }^{2}{\mathcal{ {\mathcal H} }}and E(‖xi‖2xk⨂xk)≼R2ℋ{\mathbb{E}}\left(\Vert {x}_{i}{\Vert }^{2}{x}_{k}\hspace{0.33em}\bigotimes \hspace{0.33em}{x}_{k})\preccurlyeq {R}^{2}{\mathcal{ {\mathcal H} }}for a positive number RR.Let x0∈ℱ{x}_{0}\in {\mathcal{ {\mathcal F} }}, {αk}\left\{{\alpha }_{k}\right\}satisfy α1=1{\alpha }_{1}=1and αk>0{\alpha }_{k}\gt 0for any k≥2k\ge 2, βk>0{\beta }_{k}\gt 0, and λk{\lambda }_{k}. (i)Set the initial θ0ag=θ0{\theta }_{0}^{ag}={\theta }_{0}and(1)θkmd=(1−αk)θk−1ag+αkθk−1.{\theta }_{k}^{md}=\left(1-{\alpha }_{k}){\theta }_{k-1}^{ag}+{\alpha }_{k}{\theta }_{k-1}.(ii)Set(2)θk=θk−1−λk∇f(θkmd)=θk−1−λk{E(⟨θkmd,xk⟩xk−ykxk)},{\theta }_{k}={\theta }_{k-1}-{\lambda }_{k}\nabla f\left({\theta }_{k}^{md})={\theta }_{k-1}-{\lambda }_{k}\{{\mathbb{E}}(\langle {\theta }_{k}^{md},{x}_{k}\rangle {x}_{k}-{y}_{k}{x}_{k})\},(3)θkag=θkmd−βk(∇f(θkmd)+ξ¯k)=θkmd−βk{E(⟨θkmd,xk⟩xk−ykxk)+ξ¯k}.{\theta }_{k}^{ag}={\theta }_{k}^{md}-{\beta }_{k}(\nabla f\left({\theta }_{k}^{md})+{\overline{\xi }}_{k})={\theta }_{k}^{md}-{\beta }_{k}\{{\mathbb{E}}(\langle {\theta }_{k}^{md},{x}_{k}\rangle {x}_{k}-{y}_{k}{x}_{k})+{\overline{\xi }}_{k}\}.(iii)Set k←k+1k\leftarrow k+1and go to step (i).To establish the convergence rate of the accelerated gradient algorithm, we need the following Lemma (see Lemma 1 of [7]).Lemma 1Let αk{\alpha }_{k}be the stepsizes in the accelerated gradient algorithm and the sequence {ηk}\left\{{\eta }_{k}\right\}satisfiesηk=(1−αk)ηk−1+τk,k=1,2,…,{\eta }_{k}=\left(1-{\alpha }_{k}){\eta }_{k-1}+{\tau }_{k},\hspace{1em}k=1,2,\ldots ,where(4)Γk=1,k=1,(1−αk)Γk−1,k≥2.{\Gamma }_{k}=\left\{\begin{array}{ll}1,& k=1,\\ \left(1-{\alpha }_{k}){\Gamma }_{k-1},& k\ge 2.\end{array}\right.Then we have ηk≤Γk∑i=1kτiΓi{\eta }_{k}\le {\Gamma }_{k}{\sum }_{i=1}^{k}\frac{{\tau }_{i}}{{\Gamma }_{i}}for any k≥1k\ge 1.We establish the convergence rate of the developed algorithm. The goal is to estimate the bound on the expectation E[f(θnag)−f(θ∗)]{\mathbb{E}}[f\left({\theta }_{n}^{ag})-f\left({\theta }^{\ast })]. Theorem 1 describes the convergence property of the accelerated gradient algorithm for least-square regression.Theorem 1Let {θkmd,θkag}\left\{{\theta }_{k}^{md},{\theta }_{k}^{ag}\right\}be computed by the accelerated gradient algorithm and Γk{\Gamma }_{k}be defined in (4). Assume (a)–(d). If {αk},{βk},and{λk}\left\{{\alpha }_{k}\right\},\left\{{\beta }_{k}\right\},and\hspace{0.25em}\left\{{\lambda }_{k}\right\}are chosen such thatαkλk≤βk≤12M,α1λ1Γ1≥α2λ2Γ2≥⋯,\begin{array}{l}{\alpha }_{k}{\lambda }_{k}\le {\beta }_{k}\le \frac{1}{2M},\\ \frac{{\alpha }_{1}}{{\lambda }_{1}{\Gamma }_{1}}\ge \frac{{\alpha }_{2}}{{\lambda }_{2}{\Gamma }_{2}}\ge \cdots ,\end{array}then for any n≥1n\ge 1, we haveE[f(θnag)−f(θ∗)]≤Γn2λ1‖θ0−θ∗‖2+Mσ2Γn∑k=1nβk2kΓk.{\mathbb{E}}{[}f({\theta }_{n}^{ag})-f\left({\theta }^{\ast })]\le \frac{{\Gamma }_{n}}{2{\lambda }_{1}}\Vert {\theta }_{0}-{\theta }^{\ast }{\Vert }^{2}+M{\sigma }^{2}{\Gamma }_{n}\mathop{\sum }\limits_{k=1}^{n}\frac{{\beta }_{k}^{2}}{k{\Gamma }_{k}}.ProofBy Taylor expansion of the function ffand (2), we have f(θkag)=f(θkmd)+⟨∇f(θkmd),θkag−θkmd⟩+(θkag−θkmd)T∇2f(θkmd)(θkag−θkmd)≤f(θkmd)−βk‖∇f(θkmd)‖2−βk⟨∇f(θkmd),ξ¯k⟩+βk2E‖xk‖2‖∇f(θkmd)+ξ¯k‖2≤f(θkmd)−βk‖∇f(θkmd)‖2−βk⟨∇f(θkmd),ξ¯k⟩+βk2M‖∇f(θkmd)+ξ¯k‖2.\begin{array}{rcl}f({\theta }_{k}^{ag})& =& f\left({\theta }_{k}^{md})+\langle \nabla f\left({\theta }_{k}^{md}),{\theta }_{k}^{ag}-{\theta }_{k}^{md}\rangle +{\left({\theta }_{k}^{ag}-{\theta }_{k}^{md})}^{T}{\nabla }^{2}f\left({\theta }_{k}^{md})\left({\theta }_{k}^{ag}-{\theta }_{k}^{md})\\ & \le & f\left({\theta }_{k}^{md})-{\beta }_{k}\Vert \nabla f\left({\theta }_{k}^{md}){\Vert }^{2}-{\beta }_{k}\langle \nabla f\left({\theta }_{k}^{md}),{\overline{\xi }}_{k}\rangle +{\beta }_{k}^{2}{\mathbb{E}}\Vert {x}_{k}{\Vert }^{2}\Vert \nabla f\left({\theta }_{k}^{md})+{\overline{\xi }}_{k}{\Vert }^{2}\\ & \le & f\left({\theta }_{k}^{md})-{\beta }_{k}\Vert \nabla f\left({\theta }_{k}^{md}){\Vert }^{2}-{\beta }_{k}\langle \nabla f\left({\theta }_{k}^{md}),{\overline{\xi }}_{k}\rangle +{\beta }_{k}^{2}M\Vert \nabla f\left({\theta }_{k}^{md})+{\overline{\xi }}_{k}{\Vert }^{2}.\end{array}where the last inequality follows from the assumption (c).Since f(μ)−f(ν)=⟨∇f(ν),μ−ν⟩+(μ−ν)TE(xkxkT)(μ−ν),f\left(\mu )-f\left(\nu )=\langle \nabla f\left(\nu ),\mu -\nu \rangle +{\left(\mu -\nu )}^{T}{\mathbb{E}}\left({x}_{k}{x}_{k}^{T})\left(\mu -\nu ),we have (5)f(ν)−f(μ)=⟨∇f(ν),ν−μ⟩−(μ−ν)TE(xkxkT)(μ−ν)≤⟨∇f(ν),ν−μ⟩,f\left(\nu )-f\left(\mu )=\langle \nabla f\left(\nu ),\nu -\mu \rangle -{\left(\mu -\nu )}^{T}{\mathbb{E}}\left({x}_{k}{x}_{k}^{T})\left(\mu -\nu )\le \langle \nabla f\left(\nu ),\nu -\mu \rangle ,where the inequality follows from the positive semidefinition of matrix E(xkxkT){\mathbb{E}}\left({x}_{k}{x}_{k}^{T}).By (1) and (5), we have f(θkmd)−[(1−αk)f(θk−1ag)+αkf(θ)]=αk[f(θkmd)−f(θ)]+(1−αk)[f(θkmd)−f(θk−1ag)]≤αk⟨∇f(θkmd),θkmd−θ⟩+(1−αk)⟨∇f(θkmd),θkmd−θk−1ag⟩=⟨∇f(θkmd),αk(θkmd−θ)+(1−αk)(θkmd−θk−1ag)⟩=αk⟨∇f(θkmd),θk−1−θ⟩.\begin{array}{rcl}f\left({\theta }_{k}^{md})-\left[\left(1-{\alpha }_{k})f\left({\theta }_{k-1}^{ag})+{\alpha }_{k}f\left(\theta )]& =& {\alpha }_{k}[f\left({\theta }_{k}^{md})-f\left(\theta )]+\left(1-{\alpha }_{k})[f\left({\theta }_{k}^{md})-f\left({\theta }_{k-1}^{ag})]\\ & \le & {\alpha }_{k}\langle \nabla f\left({\theta }_{k}^{md}),{\theta }_{k}^{md}-\theta \rangle +\left(1-{\alpha }_{k})\langle \nabla f\left({\theta }_{k}^{md}),{\theta }_{k}^{md}-{\theta }_{k-1}^{ag}\rangle \\ & =& \langle \nabla f\left({\theta }_{k}^{md}),{\alpha }_{k}\left({\theta }_{k}^{md}-\theta )+\left(1-{\alpha }_{k})\left({\theta }_{k}^{md}-{\theta }_{k-1}^{ag})\rangle \\ & =& {\alpha }_{k}\langle \nabla f\left({\theta }_{k}^{md}),{\theta }_{k-1}-\theta \rangle .\end{array}So we obtain f(θkag)≤(1−αk)f(θk−1ag)+αkf(θ)+αk⟨∇f(θkmd),θk−1−θ⟩−βk‖∇f(θkmd)‖2−βk⟨∇f(θkmd),ξ¯k⟩+βk2M‖∇f(θkmd)+ξ¯k‖2.f({\theta }_{k}^{ag})\le \left(1-{\alpha }_{k})f\left({\theta }_{k-1}^{ag})+{\alpha }_{k}f\left(\theta )+{\alpha }_{k}\langle \nabla f\left({\theta }_{k}^{md}),{\theta }_{k-1}-\theta \rangle \hspace{4em}-{\beta }_{k}\Vert \nabla f\left({\theta }_{k}^{md}){\Vert }^{2}-{\beta }_{k}\langle \nabla f\left({\theta }_{k}^{md}),{\overline{\xi }}_{k}\rangle +{\beta }_{k}^{2}M\Vert \nabla f\left({\theta }_{k}^{md})+{\overline{\xi }}_{k}{\Vert }^{2}.It follows from (2) that ‖θk−θ‖2=∥θk−1−λk∇f(θkmd)−θ∥2=‖θk−1−θ‖2−2λk⟨∇f(θkmd),θk−1−θ⟩+λk2∥∇f(θkmd)∥2.\Vert {\theta }_{k}-\theta {\Vert }^{2}={\parallel {\theta }_{k-1}-{\lambda }_{k}\nabla f({\theta }_{k}^{md})-\theta \parallel }^{2}=\Vert {\theta }_{k-1}-\theta {\Vert }^{2}-2{\lambda }_{k}\langle \nabla f\left({\theta }_{k}^{md}),{\theta }_{k-1}-\theta \rangle +{\lambda }_{k}^{2}{\parallel \nabla f\left({\theta }_{k}^{md})\parallel }^{2}.Then, we have (6)⟨∇f(θkmd),θk−1−θ⟩=12λk[‖θk−1−θ‖2−‖θk−θ‖2]+λk2∥∇f(θkmd)∥2,\langle \nabla f\left({\theta }_{k}^{md}),{\theta }_{k-1}-\theta \rangle =\frac{1}{2{\lambda }_{k}}{[}\Vert {\theta }_{k-1}-\theta {\Vert }^{2}-\Vert {\theta }_{k}-\theta {\Vert }^{2}]+\frac{{\lambda }_{k}}{2}{\parallel \nabla f\left({\theta }_{k}^{md})\parallel }^{2},and meanwhile, (7)∥∇f(θkmd)+ξ¯k∥2=∥∇f(θkmd)∥2+∥ξ¯k∥2+2⟨∇f(θkmd),ξ¯k⟩.{\parallel \nabla f\left({\theta }_{k}^{md})+{\overline{\xi }}_{k}\parallel }^{2}={\parallel \nabla f\left({\theta }_{k}^{md})\parallel }^{2}+{\parallel {\overline{\xi }}_{k}\parallel }^{2}+2\langle \nabla f\left({\theta }_{k}^{md}),{\overline{\xi }}_{k}\rangle .Combining the aforementioned two equalities (6) and (7), we obtain f(θkag)≤(1−αk)f(θk−1ag)+αkf(θ)+αk2λk[‖θk−1−θ‖2−‖θk−θ‖2]−βk1−λkαk2βk−βkM‖∇f(θkmd)‖2+Mβk2∥ξ¯k∥2+⟨ξ¯k,(2βk2M−βk)∇f(θkmd)⟩.\begin{array}{rcl}f({\theta }_{k}^{ag})& \le & \left(1-{\alpha }_{k})f\left({\theta }_{k-1}^{ag})+{\alpha }_{k}f\left(\theta )+\frac{{\alpha }_{k}}{2{\lambda }_{k}}{[}\Vert {\theta }_{k-1}-\theta {\Vert }^{2}-\Vert {\theta }_{k}-\theta {\Vert }^{2}]\\ & & -{\beta }_{k}\left(1-\frac{{\lambda }_{k}{\alpha }_{k}}{2{\beta }_{k}}-{\beta }_{k}M\right)\Vert \nabla f\left({\theta }_{k}^{md}){\Vert }^{2}+M{\beta }_{k}^{2}{\parallel {\overline{\xi }}_{k}\parallel }^{2}+\langle {\overline{\xi }}_{k},\left(2{\beta }_{k}^{2}M-{\beta }_{k})\nabla f\left({\theta }_{k}^{md})\rangle .\end{array}\hspace{3.0em}The aforementioned inequality is equal to f(θkag)−f(θ)≤(1−αk)[f(θk−1ag)−f(θ)]+αk2λk[‖θk−1−θ‖2−‖θk−θ‖2]−βk1−λkαk2βk−βkM‖∇f(θkmd)‖2+Mβk2∥ξ¯k∥2+⟨ξ¯k,(2βk2M−βk)∇f(θkmd)⟩.\begin{array}{rcl}f({\theta }_{k}^{ag})-f\left(\theta )& \le & \left(1-{\alpha }_{k})[f\left({\theta }_{k-1}^{ag})-f\left(\theta )]+\frac{{\alpha }_{k}}{2{\lambda }_{k}}{[}\Vert {\theta }_{k-1}-\theta {\Vert }^{2}-\Vert {\theta }_{k}-\theta {\Vert }^{2}]\\ & & -{\beta }_{k}\left(1-\frac{{\lambda }_{k}{\alpha }_{k}}{2{\beta }_{k}}-{\beta }_{k}M\right)\Vert \nabla f\left({\theta }_{k}^{md}){\Vert }^{2}+M{\beta }_{k}^{2}{\parallel {\overline{\xi }}_{k}\parallel }^{2}+\langle {\overline{\xi }}_{k},\left(2{\beta }_{k}^{2}M-{\beta }_{k})\nabla f\left({\theta }_{k}^{md})\rangle .\end{array}By using Lemma 1, we have f(θnag)−f(θ)≤Γn∑k=1nαk2λkΓk[‖θk−1−θ‖2−‖θk−θ‖2]−Γn∑k=1nβkΓk1−λkαk2βk−βkM‖∇f(θkmd)‖2+Γn∑k=1nβk2MΓk∥ξ¯k∥2+Γn∑k=1n1Γk⟨ξ¯k,(2βk2M−βk)∇f(θkmd)⟩.\begin{array}{rcl}f({\theta }_{n}^{ag})-f\left(\theta )& \le & {\Gamma }_{n}\mathop{\displaystyle \sum }\limits_{k=1}^{n}\frac{{\alpha }_{k}}{2{\lambda }_{k}{\Gamma }_{k}}{[}\Vert {\theta }_{k-1}-\theta {\Vert }^{2}-\Vert {\theta }_{k}-\theta {\Vert }^{2}]-{\Gamma }_{n}\mathop{\displaystyle \sum }\limits_{k=1}^{n}\frac{{\beta }_{k}}{{\Gamma }_{k}}\left(1-\frac{{\lambda }_{k}{\alpha }_{k}}{2{\beta }_{k}}-{\beta }_{k}M\right)\Vert \nabla f\left({\theta }_{k}^{md}){\Vert }^{2}\\ & & +{\Gamma }_{n}\mathop{\displaystyle \sum }\limits_{k=1}^{n}\frac{{\beta }_{k}^{2}M}{{\Gamma }_{k}}{\parallel {\overline{\xi }}_{k}\parallel }^{2}+{\Gamma }_{n}\mathop{\displaystyle \sum }\limits_{k=1}^{n}\frac{1}{{\Gamma }_{k}}\langle {\overline{\xi }}_{k},\left(2{\beta }_{k}^{2}M-{\beta }_{k})\nabla f\left({\theta }_{k}^{md})\rangle .\end{array}Since α1λ1Γ1≥α2λ2Γ2≥⋯,α1=Γ1=1,\frac{{\alpha }_{1}}{{\lambda }_{1}{\Gamma }_{1}}\ge \frac{{\alpha }_{2}}{{\lambda }_{2}{\Gamma }_{2}}\ge \cdots ,\hspace{1em}{\alpha }_{1}={\Gamma }_{1}=1,then ∑k=1nαk2λkΓk[‖θk−1−θ‖2−‖θk−θ‖2]≤α12λ1Γ1[‖θ0−θ‖2]=12λ1‖θ0−θ‖2.\mathop{\sum }\limits_{k=1}^{n}\frac{{\alpha }_{k}}{2{\lambda }_{k}{\Gamma }_{k}}{[}\Vert {\theta }_{k-1}-\theta {\Vert }^{2}-\Vert {\theta }_{k}-\theta {\Vert }^{2}]\le \frac{{\alpha }_{1}}{2{\lambda }_{1}{\Gamma }_{1}}{[}\Vert {\theta }_{0}-\theta {\Vert }^{2}]=\frac{1}{2{\lambda }_{1}}\Vert {\theta }_{0}-\theta {\Vert }^{2}.So we obtain (8)f(θnag)−f(θ)≤Γn2λ1‖θ0−θ‖2+Γn∑k=1nβk2MΓk∥ξ¯k∥2+Γn∑k=1n1Γk⟨ξ¯k,(2βk2M−βk)∇f(θkmd)⟩,f({\theta }_{n}^{ag})-f\left(\theta )\le \frac{{\Gamma }_{n}}{2{\lambda }_{1}}\Vert {\theta }_{0}-\theta {\Vert }^{2}+{\Gamma }_{n}\mathop{\sum }\limits_{k=1}^{n}\frac{{\beta }_{k}^{2}M}{{\Gamma }_{k}}{\parallel {\overline{\xi }}_{k}\parallel }^{2}+{\Gamma }_{n}\mathop{\sum }\limits_{k=1}^{n}\frac{1}{{\Gamma }_{k}}\langle {\overline{\xi }}_{k},\left(2{\beta }_{k}^{2}M-{\beta }_{k})\nabla f\left({\theta }_{k}^{md})\rangle ,where the inequality follows from the assumption αkλk≤βk≤12M.{\alpha }_{k}{\lambda }_{k}\le {\beta }_{k}\le \frac{1}{2M}.Under assumption (d), we have Eξ¯k=1k∑i=1kEξi=0,Eξ¯k2=E1k∑i=1kξi2≤σ2k.{\mathbb{E}}{\bar{\xi }}_{k}=\frac{1}{k}\mathop{\sum }\limits_{i=1}^{k}{\mathbb{E}}{\xi }_{i}=0,\hspace{1em}{\mathbb{E}}{\bar{\xi }}_{k}^{2}={\mathbb{E}}{\left(\frac{1}{k}\mathop{\sum }\limits_{i=1}^{k}{\xi }_{i}\right)}^{2}\le \frac{{\sigma }^{2}}{k}.\hspace{1.85em}Taking expectation on both sides of the inequality (8) with respect to (xi,yi)\left({x}_{i},{y}_{i}), we obtain for x∈Rdx\in {{\mathbb{R}}}^{d}, E[f(θnag)−f(θ)]≤Γn2λ1‖θ0−θ‖2+Mσ2Γn∑k=1nβk2kΓk.{\mathbb{E}}{[}f({\theta }_{n}^{ag})-f\left(\theta )]\le \frac{{\Gamma }_{n}}{2{\lambda }_{1}}\Vert {\theta }_{0}-\theta {\Vert }^{2}+M{\sigma }^{2}{\Gamma }_{n}\mathop{\sum }\limits_{k=1}^{n}\frac{{\beta }_{k}^{2}}{k{\Gamma }_{k}}.\hspace{0.5em}Now, fixing θ=θ∗\theta ={\theta }^{\ast }, we have E[f(θnag)−f(θ∗)]≤Γn2λ1‖θ0−θ∗‖2+Mσ2Γn∑k=1nβk2kΓk.{\mathbb{E}}{[}f({\theta }_{n}^{ag})-f\left({\theta }^{\ast })]\le \frac{{\Gamma }_{n}}{2{\lambda }_{1}}\Vert {\theta }_{0}-{\theta }^{\ast }{\Vert }^{2}+M{\sigma }^{2}{\Gamma }_{n}\mathop{\sum }\limits_{k=1}^{n}\frac{{\beta }_{k}^{2}}{k{\Gamma }_{k}}.This finishes the proof of Theorem 2.2.□In the following, we apply the results of Theorem 1 to some particular selections of {αk},{βk}\left\{{\alpha }_{k}\right\},\left\{{\beta }_{k}\right\}, and {λk}\left\{{\lambda }_{k}\right\}. We obtain the following Corollary 1.Corollary 1Suppose that αk{\alpha }_{k}and βk{\beta }_{k}in the accelerated gradient algorithm for regression learning are set to(9)αk=1k+1,βk=1M(k+1),andλk=12M∀k≥1,{\alpha }_{k}=\frac{1}{k+1},\hspace{1em}{\beta }_{k}=\frac{1}{M\left(k+1)},\hspace{1em}{and}\hspace{1em}{\lambda }_{k}=\frac{1}{2M}\hspace{1em}\forall k\ge 1,then for any n≥1n\ge 1, we haveE[f(θnag)−f(θ∗)]≤M2‖θ0−θ∗‖2+σ2M(n+1).{\mathbb{E}}{[}f({\theta }_{n}^{ag})-f\left({\theta }^{\ast })]\le \frac{{M}^{2}\Vert {\theta }_{0}-{\theta }^{\ast }{\Vert }^{2}+{\sigma }^{2}}{M\left(n+1)}.ProofIn the view (4) and (9), we have for k≥2k\ge 2Γk=(1−αk)Γk−1=kk+1×k−1k×k−2k−1×⋯×23×Γ1=2k+1.{\Gamma }_{k}=\left(1-{\alpha }_{k}){\Gamma }_{k-1}=\frac{k}{k+1}\times \frac{k-1}{k}\times \frac{k-2}{k-1}\times \cdots \times \frac{2}{3}\times {\Gamma }_{1}=\frac{2}{k+1}.It is easy to verify αkλk=12M(k+1)≤βk=1M(k+1)≤12M,α1λ1Γ1=α2λ2Γ2=⋯=14M.\begin{array}{rcl}{\alpha }_{k}{\lambda }_{k}=\frac{1}{2M\left(k+1)}& \le & {\beta }_{k}=\frac{1}{M\left(k+1)}\le \frac{1}{2M},\\ \frac{{\alpha }_{1}}{{\lambda }_{1}{\Gamma }_{1}}=\frac{{\alpha }_{2}}{{\lambda }_{2}{\Gamma }_{2}}& =& \cdots =\frac{1}{4M}.\end{array}Then, we obtain MΓnσ2∑k=1nβk2kΓk=2σ2n+1∑k=1nMM2(k+1)22kk+1=σ2M(n+1)∑k=1n1k(k+1)=σ2M(n+1)1−12+12−13+⋯+1n−1−1n≤σ2M(n+1).\begin{array}{rcl}M{\Gamma }_{n}{\sigma }^{2}\mathop{\displaystyle \sum }\limits_{k=1}^{n}\frac{{\beta }_{k}^{2}}{k{\Gamma }_{k}}& =& \frac{2{\sigma }^{2}}{n+1}\mathop{\displaystyle \sum }\limits_{k=1}^{n}\frac{\frac{M}{{M}^{2}{\left(k+1)}^{2}}}{\frac{2k}{k+1}}=\frac{{\sigma }^{2}}{M\left(n+1)}\mathop{\displaystyle \sum }\limits_{k=1}^{n}\frac{1}{k\left(k+1)}\\ & =& \frac{{\sigma }^{2}}{M\left(n+1)}\left\{1-\frac{1}{2}+\frac{1}{2}-\frac{1}{3}+\cdots +\frac{1}{n-1}-\frac{1}{n}\right\}\\ & \le & \frac{{\sigma }^{2}}{M\left(n+1)}.\end{array}From the result of Theorem 1, we have E[f(θnag)−f(θ∗)]≤Mn+1‖θ0−θ∗‖2+σ2M(n+1)=M2‖θ0−θ∗‖2+σ2M(n+1).{\mathbb{E}}{[}f({\theta }_{n}^{ag})-f\left({\theta }^{\ast })]\le \frac{M}{n+1}\Vert {\theta }_{0}-{\theta }^{\ast }{\Vert }^{2}+\frac{{\sigma }^{2}}{M\left(n+1)}=\frac{{M}^{2}\Vert {\theta }_{0}-{\theta }^{\ast }{\Vert }^{2}+{\sigma }^{2}}{M\left(n+1)}.The proof of Corollary 1 is completed.□Corollary 1 shows that the developed algorithm is able to achieve a convergence rate of O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}n)without strong convexity and Lipschitz continuous gradient assumptions.3The stochastic accelerated gradient algorithm for logistic regressionIn this section, we consider the convergence property of the accelerated gradient algorithm for logistic regression.We make the following assumptions: (B1)ℱ{\mathcal{ {\mathcal F} }}is a dd-dimension Euclidean space, with d≥1d\ge 1.(B2)The observations (xi,yi)∈ℱ×{−1,1}\left({x}_{i},{y}_{i})\in {\mathcal{ {\mathcal F} }}\times \left\{-1,1\right\}are independent and identically distributed.(B3)E‖xi‖2{\mathbb{E}}\Vert {x}_{i}{\Vert }^{2}is finite, i.e., E‖xi‖2≤M{\mathbb{E}}\Vert {x}_{i}{\Vert }^{2}\le Mfor any i≥1i\ge 1.(B4)We consider l(θ)=E[log(1+exp(−yi⟨xi,θ⟩))]l\left(\theta )={\mathbb{E}}\left[\log \left(1+\exp \left(-{y}_{i}\langle {x}_{i},\theta \rangle ))]. We denote by θ∗∈Rd{\theta }^{\ast }\in {{\mathbb{R}}}^{d}a global minimizer of lland thus assume to exist. Let ξi=(yi−⟨θ∗,xi⟩)xi{\xi }_{i}=({y}_{i}-\langle {\theta }^{\ast },{x}_{i}\rangle ){x}_{i}denote the residual. For any i≥1i\ge 1, we have Eξi=0{\mathbb{E}}{\xi }_{i}=0. We also assume that Eξi2≤σ2{\mathbb{E}}{\xi }_{i}^{2}\le {\sigma }^{2}for every iiand ξ¯k=1k∑i=1kξi{\bar{\xi }}_{k}=\frac{1}{k}\mathop{\sum }\limits_{i=1}^{k}{\xi }_{i}.Let x0∈ℱ{x}_{0}\in {\mathcal{ {\mathcal F} }}, {αk}\left\{{\alpha }_{k}\right\}satisfy α1=1{\alpha }_{1}=1and αk>0{\alpha }_{k}\gt 0for any k≥2k\ge 2, βk>0{\beta }_{k}\gt 0, and λk{\lambda }_{k}. (i)Set the initial θ0ag=θ0{\theta }_{0}^{ag}={\theta }_{0}and(10)θkmd=(1−αk)θk−1ag+αkθk−1.{\theta }_{k}^{md}=\left(1-{\alpha }_{k}){\theta }_{k-1}^{ag}+{\alpha }_{k}{\theta }_{k-1}.(ii)Set(11)θk=θk−1−λk∇l(θkmd)=θk−1−λk−ykexp{−yk⟨xk,θkmd⟩}xk1+exp{−yk⟨xk,θkmd⟩},{\theta }_{k}={\theta }_{k-1}-{\lambda }_{k}\nabla l\left({\theta }_{k}^{md})={\theta }_{k-1}-{\lambda }_{k}\frac{-{y}_{k}\exp \left\{-{y}_{k}\langle {x}_{k},{\theta }_{k}^{md}\rangle \right\}{x}_{k}}{1+\exp \left\{-{y}_{k}\langle {x}_{k},{\theta }_{k}^{md}\rangle \right\}},(12)θkag=θkmd−βk(∇l(θkmd)+ξ¯k)=θkmd−βk−ykexp{−yk⟨xk,θkmd⟩}xk1+exp{−yk⟨xk,θkmd⟩}+ξ¯k.{\theta }_{k}^{ag}={\theta }_{k}^{md}-{\beta }_{k}(\nabla l\left({\theta }_{k}^{md})+{\bar{\xi }}_{k})={\theta }_{k}^{md}-{\beta }_{k}\left\{\phantom{\rule[-1.25em]{}{0ex}},\frac{-{y}_{k}\exp \left\{-{y}_{k}\langle {x}_{k},{\theta }_{k}^{md}\rangle \right\}{x}_{k}}{1+\exp \left\{-{y}_{k}\langle {x}_{k},{\theta }_{k}^{md}\rangle \right\}}+{\bar{\xi }}_{k}\right\}.(iii)Set k←k+1k\leftarrow k+1and go to step (i).Theorem 2 describes the convergence property of the accelerated gradient algorithm for logistic regression.Theorem 2Let {θkmd,θkag}\left\{{\theta }_{k}^{md},{\theta }_{k}^{ag}\right\}be computed by the accelerated gradient algorithm and Γk{\Gamma }_{k}be defined in (4). Assume (B1)–(B4). If {αk},{βk},and{λk}\left\{{\alpha }_{k}\right\},\left\{{\beta }_{k}\right\},and\hspace{0.25em}\left\{{\lambda }_{k}\right\}are chosen such thatαkλk≤βk≤12M,{\alpha }_{k}{\lambda }_{k}\le {\beta }_{k}\le \frac{1}{2M},α1λ1Γ1≥α2λ2Γ2≥⋯,\frac{{\alpha }_{1}}{{\lambda }_{1}{\Gamma }_{1}}\ge \frac{{\alpha }_{2}}{{\lambda }_{2}{\Gamma }_{2}}\ge \cdots ,and then for any n≥1n\ge 1, we haveE[f(θnag)−f(θ∗)]≤Γn2λ1‖θ0−θ∗‖2+Mσ2Γn∑k=1nβk2kΓk.{\mathbb{E}}{[}f({\theta }_{n}^{ag})-f\left({\theta }^{\ast })]\le \frac{{\Gamma }_{n}}{2{\lambda }_{1}}\Vert {\theta }_{0}-{\theta }^{\ast }{\Vert }^{2}+M{\sigma }^{2}{\Gamma }_{n}\mathop{\sum }\limits_{k=1}^{n}\frac{{\beta }_{k}^{2}}{k{\Gamma }_{k}}.ProofBy Taylor expansion of the function ll, there exists a ϑ{\vartheta }such that (13)l(θkag)=l(θkmd)+⟨∇l(θkmd),θkag−θkmd⟩+(θkag−θkmd)T∇2l(ϑ)(θkag−θkmd)=l(θkmd)−βk‖∇l(θkmd)‖2+βk⟨∇l(θkmd),ξ¯k⟩+(θkag−θkmd)TEexp{−yk⟨xk,ϑ⟩}xkxkT1+exp{−yk⟨xk,ϑ⟩}(θkag−θkmd).\begin{array}{rcl}l({\theta }_{k}^{ag})& =& l\left({\theta }_{k}^{md})+\langle \nabla l\left({\theta }_{k}^{md}),{\theta }_{k}^{ag}-{\theta }_{k}^{md}\rangle +{\left({\theta }_{k}^{ag}-{\theta }_{k}^{md})}^{T}{\nabla }^{2}l\left({\vartheta })\left({\theta }_{k}^{ag}-{\theta }_{k}^{md})\\ & =& l\left({\theta }_{k}^{md})-{\beta }_{k}\Vert \nabla l\left({\theta }_{k}^{md}){\Vert }^{2}+{\beta }_{k}\langle \nabla l\left({\theta }_{k}^{md}),{\bar{\xi }}_{k}\rangle +{\left({\theta }_{k}^{ag}-{\theta }_{k}^{md})}^{T}{\mathbb{E}}\frac{\exp \left\{-{y}_{k}\langle {x}_{k},{\vartheta }\rangle \right\}{x}_{k}{x}_{k}^{T}}{1+\exp \left\{-{y}_{k}\langle {x}_{k},{\vartheta }\rangle \right\}}\left({\theta }_{k}^{ag}-{\theta }_{k}^{md}).\end{array}It is easy to verify that the matrix Eexp{−yk⟨xk,ϑ⟩}xkxkT1+exp{−yk⟨xk,ϑ⟩}{\mathbb{E}}\frac{\exp \left\{-{y}_{k}\langle {x}_{k},{\vartheta }\rangle \right\}{x}_{k}{x}_{k}^{T}}{1+\exp \left\{-{y}_{k}\langle {x}_{k},{\vartheta }\rangle \right\}}is positive semidefinite and the largest eigenvalue of it satisfies λmaxEexp{−yk⟨xk,ϑ⟩}xkxkT1+exp{−yk⟨xk,ϑ⟩}≤E‖xk‖2≤M.\hspace{9.2em}{\lambda }_{max}\left({\mathbb{E}}\frac{\exp \left\{-{y}_{k}\langle {x}_{k},{\vartheta }\rangle \right\}{x}_{k}{x}_{k}^{T}}{1+\exp \left\{-{y}_{k}\langle {x}_{k},{\vartheta }\rangle \right\}}\right)\le {\mathbb{E}}\Vert {x}_{k}{\Vert }^{2}\le M.\hspace{3.7em}Combining with (12) and (13), we have l(θkag)≤l(θkmd)−βk‖∇l(θkmd)‖2+βk⟨∇l(θkmd),ξ¯k⟩+βk2M‖∇l(θkmd)+ξ¯k‖2.l({\theta }_{k}^{ag})\le l\left({\theta }_{k}^{md})-{\beta }_{k}\Vert \nabla l\left({\theta }_{k}^{md}){\Vert }^{2}+{\beta }_{k}\langle \nabla l\left({\theta }_{k}^{md}),{\bar{\xi }}_{k}\rangle +{\beta }_{k}^{2}M\Vert \nabla l\left({\theta }_{k}^{md})+{\bar{\xi }}_{k}{\Vert }^{2}.Similar to (13), there exists a ζ∈Rd\zeta \in {{\mathbb{R}}}^{d}satisfying l(μ)−l(ν)=⟨∇l(ν),μ−ν⟩+(μ−ν)TEexp{−yk⟨xk,ζ⟩}xkxkT1+exp{−yk⟨xk,ζ⟩}(μ−ν),μ,ν∈Rd,l\left(\mu )-l\left(\nu )=\langle \nabla l\left(\nu ),\mu -\nu \rangle +{\left(\mu -\nu )}^{T}{\mathbb{E}}\frac{\exp \left\{-{y}_{k}\langle {x}_{k},\zeta \rangle \right\}{x}_{k}{x}_{k}^{T}}{1+\exp \left\{-{y}_{k}\langle {x}_{k},\zeta \rangle \right\}}\left(\mu -\nu ),\mu ,\nu \in {{\mathbb{R}}}^{d},and we have l(ν)−l(μ)=⟨∇l(ν),ν−μ⟩−(μ−ν)TEexp{−yk⟨xk,ζ⟩}xkxkT1+exp{−yk⟨xk,ζ⟩}(μ−ν)≤⟨∇l(ν),ν−μ⟩,l\left(\nu )-l\left(\mu )=\langle \nabla l\left(\nu ),\nu -\mu \rangle -{\left(\mu -\nu )}^{T}{\mathbb{E}}\frac{\exp \left\{-{y}_{k}\langle {x}_{k},\zeta \rangle \right\}{x}_{k}{x}_{k}^{T}}{1+\exp \left\{-{y}_{k}\langle {x}_{k},\zeta \rangle \right\}}\left(\mu -\nu )\le \langle \nabla l\left(\nu ),\nu -\mu \rangle ,where the inequality follows from the positive semidefinition of matrix Eexp{−yk⟨xk,ζ⟩}xkxkT1+exp{−yk⟨xk,ζ⟩}{\mathbb{E}}\frac{\exp \left\{-{y}_{k}\langle {x}_{k},\zeta \rangle \right\}{x}_{k}{x}_{k}^{T}}{1+\exp \left\{-{y}_{k}\langle {x}_{k},\zeta \rangle \right\}}.Similar to (5), we have l(θkmd)−[(1−αk)l(θk−1ag)+αkl(θ)]≤αk⟨∇l(θkmd),θk−1−θ⟩.l\left({\theta }_{k}^{md})-\left[\left(1-{\alpha }_{k})l\left({\theta }_{k-1}^{ag})+{\alpha }_{k}l\left(\theta )]\le {\alpha }_{k}\langle \nabla l\left({\theta }_{k}^{md}),{\theta }_{k-1}-\theta \rangle .So we obtain l(θkag)≤(1−αk)l(θk−1ag)+αkl(θ)+αk⟨∇l(θkmd),θk−1−θ⟩−βk‖∇l(θkmd)‖2+βk⟨∇l(θkmd),ξ¯k⟩+βk2M‖∇l(θkmd)+ξ¯k‖2.l({\theta }_{k}^{ag})\le \left(1-{\alpha }_{k})l\left({\theta }_{k-1}^{ag})+{\alpha }_{k}l\left(\theta )+{\alpha }_{k}\langle \nabla l\left({\theta }_{k}^{md}),{\theta }_{k-1}-\theta \rangle \hspace{4em}-{\beta }_{k}\Vert \nabla l\left({\theta }_{k}^{md}){\Vert }^{2}+{\beta }_{k}\langle \nabla l\left({\theta }_{k}^{md}),{\bar{\xi }}_{k}\rangle +{\beta }_{k}^{2}M\Vert \nabla l\left({\theta }_{k}^{md})+{\bar{\xi }}_{k}{\Vert }^{2}.It follows from (11) that ‖θk−θ‖2=∥θk−1−λk∇l(θkmd)−θ∥2=‖θk−1−θ‖2−2λk⟨∇l(θkmd),θk−1−θ⟩+∥∇l(θkmd)∥2.\begin{array}{rcl}\Vert {\theta }_{k}-\theta {\Vert }^{2}& =& {\parallel {\theta }_{k-1}-{\lambda }_{k}\nabla l({\theta }_{k}^{md})-\theta \parallel }^{2}\\ & =& \Vert {\theta }_{k-1}-\theta {\Vert }^{2}-2{\lambda }_{k}\langle \nabla l\left({\theta }_{k}^{md}),{\theta }_{k-1}-\theta \rangle +{\parallel \nabla l\left({\theta }_{k}^{md})\parallel }^{2}.\end{array}Then, we have (14)⟨∇l(θkmd),θk−1−θ⟩=12λk[‖θk−1−θ‖2−‖θk−θ‖2]+λk2∥∇l(θkmd)∥2.\langle \nabla l\left({\theta }_{k}^{md}),{\theta }_{k-1}-\theta \rangle =\frac{1}{2{\lambda }_{k}}{[}\Vert {\theta }_{k-1}-\theta {\Vert }^{2}-\Vert {\theta }_{k}-\theta {\Vert }^{2}]+\frac{{\lambda }_{k}}{2}{\parallel \nabla l\left({\theta }_{k}^{md})\parallel }^{2}.However, (15)∥∇l(θkmd)+ξ¯k∥2=∥∇f(θkmd)∥2+∥ξ¯k∥2+2⟨∇l(θkmd),ξ¯k⟩.{\parallel \nabla l\left({\theta }_{k}^{md})+{\overline{\xi }}_{k}\parallel }^{2}={\parallel \nabla f\left({\theta }_{k}^{md})\parallel }^{2}+{\parallel {\overline{\xi }}_{k}\parallel }^{2}+2\langle \nabla l\left({\theta }_{k}^{md}),{\overline{\xi }}_{k}\rangle .Combining the aforementioned two equalities (14) and (15), we obtain l(θkag)≤(1−αk)l(θk−1ag)+αkl(θ)+αk2λk[‖θk−1−θ‖2−‖θk−θ‖2]−βk1−λkαk2βk−βkM‖∇l(θkmd)‖2+Mβk2∥ξ¯k∥2+⟨ξ¯k,(2βk2M−βk)∇l(θkmd)⟩.\begin{array}{rcl}l({\theta }_{k}^{ag})& \le & \left(1-{\alpha }_{k})l\left({\theta }_{k-1}^{ag})+{\alpha }_{k}l\left(\theta )+\frac{{\alpha }_{k}}{2{\lambda }_{k}}{[}\Vert {\theta }_{k-1}-\theta {\Vert }^{2}-\Vert {\theta }_{k}-\theta {\Vert }^{2}]-{\beta }_{k}\left(1-\frac{{\lambda }_{k}{\alpha }_{k}}{2{\beta }_{k}}-{\beta }_{k}M\right)\Vert \nabla l\left({\theta }_{k}^{md}){\Vert }^{2}+M{\beta }_{k}^{2}{\parallel {\overline{\xi }}_{k}\parallel }^{2}+\langle {\overline{\xi }}_{k},\left(2{\beta }_{k}^{2}M-{\beta }_{k})\nabla l\left({\theta }_{k}^{md})\rangle .\end{array}The aforementioned inequality is equal to l(θkag)−l(θ)≤(1−αk)[l(θk−1ag)−l(θ)]+αk2λk[‖θk−1−θ‖2−‖θk−θ‖2]−βk1−λkαk2βk−βkM‖∇l(θkmd)‖2+Mβk2∥ξ¯k∥2+⟨ξ¯k,(2βk2M−βk)∇l(θkmd)⟩.l({\theta }_{k}^{ag})-l\left(\theta )\le \left(1-{\alpha }_{k})\left[l\left({\theta }_{k-1}^{ag})-l\left(\theta )]+\frac{{\alpha }_{k}}{2{\lambda }_{k}}{[}\Vert {\theta }_{k-1}-\theta {\Vert }^{2}-\Vert {\theta }_{k}-\theta {\Vert }^{2}]-{\beta }_{k}\left(1-\frac{{\lambda }_{k}{\alpha }_{k}}{2{\beta }_{k}}-{\beta }_{k}M\right)\Vert \nabla l\left({\theta }_{k}^{md}){\Vert }^{2}+M{\beta }_{k}^{2}{\parallel {\overline{\xi }}_{k}\parallel }^{2}+\langle {\overline{\xi }}_{k},\left(2{\beta }_{k}^{2}M-{\beta }_{k})\nabla l\left({\theta }_{k}^{md})\rangle .By using Lemma 1, we have l(θnag)−l(θ)≤Γn∑k=1nαk2λkΓk[‖θk−1−θ‖2−‖θk−θ‖2]−Γn∑k=1nβkΓk1−λkαk2βk−βkM‖∇l(θkmd)‖2+Γn∑k=1nβk2MΓk∥ξ¯k∥2+Γn∑k=1n1Γk⟨ξ¯k,(2βk2M−βk)∇l(θkmd)⟩.\begin{array}{rcl}l({\theta }_{n}^{ag})-l\left(\theta )& \le & {\Gamma }_{n}\mathop{\displaystyle \sum }\limits_{k=1}^{n}\frac{{\alpha }_{k}}{2{\lambda }_{k}{\Gamma }_{k}}{[}\Vert {\theta }_{k-1}-\theta {\Vert }^{2}-\Vert {\theta }_{k}-\theta {\Vert }^{2}]-{\Gamma }_{n}\mathop{\displaystyle \sum }\limits_{k=1}^{n}\frac{{\beta }_{k}}{{\Gamma }_{k}}\left(1-\frac{{\lambda }_{k}{\alpha }_{k}}{2{\beta }_{k}}-{\beta }_{k}M\right)\Vert \nabla l\left({\theta }_{k}^{md}){\Vert }^{2}\\ & & +{\Gamma }_{n}\mathop{\displaystyle \sum }\limits_{k=1}^{n}\frac{{\beta }_{k}^{2}M}{{\Gamma }_{k}}{\parallel {\overline{\xi }}_{k}\parallel }^{2}+{\Gamma }_{n}\mathop{\displaystyle \sum }\limits_{k=1}^{n}\frac{1}{{\Gamma }_{k}}\langle {\overline{\xi }}_{k},\left(2{\beta }_{k}^{2}M-{\beta }_{k})\nabla l\left({\theta }_{k}^{md})\rangle .\end{array}Since α1λ1Γ1≥α2λ2Γ2≥⋯,α1=Γ1=1,\frac{{\alpha }_{1}}{{\lambda }_{1}{\Gamma }_{1}}\ge \frac{{\alpha }_{2}}{{\lambda }_{2}{\Gamma }_{2}}\ge \cdots \hspace{0.33em},\hspace{1em}{\alpha }_{1}={\Gamma }_{1}=1,then ∑k=1nαk2λkΓk[‖θk−1−θ‖2−‖θk−θ‖2]≤α12λ1Γ1[‖θ0−θ‖2]=12λ1‖θ0−θ‖2.\mathop{\sum }\limits_{k=1}^{n}\frac{{\alpha }_{k}}{2{\lambda }_{k}{\Gamma }_{k}}{[}\Vert {\theta }_{k-1}-\theta {\Vert }^{2}-\Vert {\theta }_{k}-\theta {\Vert }^{2}]\le \frac{{\alpha }_{1}}{2{\lambda }_{1}{\Gamma }_{1}}{[}\Vert {\theta }_{0}-\theta {\Vert }^{2}]=\frac{1}{2{\lambda }_{1}}\Vert {\theta }_{0}-\theta {\Vert }^{2}.So we obtain (16)l(θnag)−l(θ)≤Γn2λ1‖θ0−θ‖2+Γn∑k=1nβk2MΓk∥ξ¯k∥2+Γn∑k=1n1Γk⟨ξ¯k,(2βk2M−βk)∇l(θkmd)⟩,l({\theta }_{n}^{ag})-l\left(\theta )\le \frac{{\Gamma }_{n}}{2{\lambda }_{1}}\Vert {\theta }_{0}-\theta {\Vert }^{2}+{\Gamma }_{n}\mathop{\sum }\limits_{k=1}^{n}\frac{{\beta }_{k}^{2}M}{{\Gamma }_{k}}{\parallel {\overline{\xi }}_{k}\parallel }^{2}+{\Gamma }_{n}\mathop{\sum }\limits_{k=1}^{n}\frac{1}{{\Gamma }_{k}}\langle {\overline{\xi }}_{k},\left(2{\beta }_{k}^{2}M-{\beta }_{k})\nabla l\left({\theta }_{k}^{md})\rangle ,where the inequality follows from the assumption αkλk≤βk≤12M.{\alpha }_{k}{\lambda }_{k}\le {\beta }_{k}\le \frac{1}{2M}.Under assumption (d), we have Eξ¯k=1k∑i=1kEξi=0,Eξ¯k2=E1k∑i=1kξi2≤σ2k.{\mathbb{E}}{\bar{\xi }}_{k}=\frac{1}{k}\mathop{\sum }\limits_{i=1}^{k}{\mathbb{E}}{\xi }_{i}=0,\hspace{1em}{\mathbb{E}}{\bar{\xi }}_{k}^{2}={\mathbb{E}}{\left(\frac{1}{k}\mathop{\sum }\limits_{i=1}^{k}{\xi }_{i}\right)}^{2}\le \frac{{\sigma }^{2}}{k}.\hspace{1.35em}Taking expectation on both sides of the inequality (16) with respect to (xi,yi)\left({x}_{i},{y}_{i}), we obtain for θ∈Rd\theta \in {{\mathbb{R}}}^{d}, E[l(θnag)−l(θ)]≤Γn2λ1‖θ0−θ‖2+Mσ2Γn∑k=1nβk2kΓk.{\mathbb{E}}{[}l({\theta }_{n}^{ag})-l\left(\theta )]\le \frac{{\Gamma }_{n}}{2{\lambda }_{1}}\Vert {\theta }_{0}-\theta {\Vert }^{2}+M{\sigma }^{2}{\Gamma }_{n}\mathop{\sum }\limits_{k=1}^{n}\frac{{\beta }_{k}^{2}}{k{\Gamma }_{k}}.\hspace{0.5em}Now, fixing θ=θ∗\theta ={\theta }^{\ast }, we have E[l(θnag)−l(θ∗)]≤Γn2λ1‖θ0−θ∗‖2+Mσ2Γn∑k=1nβk2kΓk.{\mathbb{E}}{[}l({\theta }_{n}^{ag})-l\left({\theta }^{\ast })]\le \frac{{\Gamma }_{n}}{2{\lambda }_{1}}\Vert {\theta }_{0}-{\theta }^{\ast }{\Vert }^{2}+M{\sigma }^{2}{\Gamma }_{n}\mathop{\sum }\limits_{k=1}^{n}\frac{{\beta }_{k}^{2}}{k{\Gamma }_{k}}.This finishes the proof of Theorem 2.□Similar to Corollary 1, we specialize the results of Theorem 2 for some particular selections of {αk},{βk}\left\{{\alpha }_{k}\right\},\left\{{\beta }_{k}\right\}and λk{\lambda }_{k}.Corollary 2Suppose that αk{\alpha }_{k}, βk{\beta }_{k}, and λk{\lambda }_{k}in the accelerated gradient algorithm for regression learning are set toαk=1k+1,βk=1M(k+1),andλk=12M,∀k≥1,{\alpha }_{k}=\frac{1}{k+1},\hspace{1em}{\beta }_{k}=\frac{1}{M\left(k+1)},\hspace{1em}{and}\hspace{1em}{\lambda }_{k}=\frac{1}{2M},\hspace{1em}\forall k\ge 1,and then for any n≥1n\ge 1, we haveE[l(θnag)−l(θ∗)]≤M2‖θ0−θ∗‖2+σ2M(n+1).{\mathbb{E}}{[}l({\theta }_{n}^{ag})-l\left({\theta }^{\ast })]\le \frac{{M}^{2}\Vert {\theta }_{0}-{\theta }^{\ast }{\Vert }^{2}+{\sigma }^{2}}{M\left(n+1)}.4Comparisons with related workIn Sections 2 and 3, we have studied the AC-SA type algorithms for least-square regression and least-square learning problems, respectively. We have derived the upper bound of AC-SA learning algorithms by using the convexity of the aim function. In this section, we discuss how our results relate to other recent studies.4.1Comparison with convergence rate for stochastic optimizationOur convergence analysis of SA learning algorithms is based on a similar analysis for stochastic composite optimization by Ghadimi and Lan in [8]. There are two differences between our work and that of Ghadimi and Lan. The first difference in our convergence analysis of SA algorithms compared with the problems of stochastic optimization in [8] is for any iteration, rather than iteration limit, i.e., the parameters βk,λk{\beta }_{k},{\lambda }_{k}of Corollary 3 in [8] are in relation with iteration limit NN, while we do not need this assumption. The second difference is in the two error bounds. Ghadimi and Lan obtained a rate of O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}\sqrt{n})for stochastic composite optimization, while we obtain the rate of O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}n)for the regression problem.Our developed accelerated stochastic gradient algorithm (SA) for the least-square regression is summarized in (1)–(3). The algorithm takes a stream of data (xk,yk)\left({x}_{k},{y}_{k})as input, and an initial guess of the parameter θ0{\theta }_{0}. The other requirements include {αk}\left\{{\alpha }_{k}\right\}, which satisfies α1=1{\alpha }_{1}=1and αk>0{\alpha }_{k}\gt 0for any k≥2k\ge 2, βk>0{\beta }_{k}\gt 0, and λk>0{\lambda }_{k}\gt 0. The algorithm involves two intermediate variables θkag{\theta }_{k}^{ag}(which is initialized to be θ0{\theta }_{0}) and θkmd{\theta }_{k}^{md}. θkmd{\theta }_{k}^{md}is updated as a linear combination of θkag{\theta }_{k}^{ag}and the current estimation of the parameter θk{\theta }_{k}(3), where αk{\alpha }_{k}is the coefficient. The parameter θk{\theta }_{k}is estimated in (2) taking λk{\lambda }_{k}as a parameter. The residue ξk{\xi }_{k}and the average residue ξ¯k{\bar{\xi }}_{k}of previous residues up to the kkth data (i.e., ξ¯k=1k∑i=1kξi{\bar{\xi }}_{k}=\frac{1}{k}\mathop{\sum }\limits_{i=1}^{k}{\xi }_{i}) are computed in (3). θkag{\theta }_{k}^{ag}is then updated through a linear combination of θkmd{\theta }_{k}^{md}, where βk{\beta }_{k}is taken as a parameter. The process continues whenever a new pair of data is seen.The unbiased estimate of the gradient, i.e., (⟨θkmd,xk⟩xk−ykxk)(\langle {\theta }_{k}^{md},{x}_{k}\rangle {x}_{k}-{y}_{k}{x}_{k})for each data point, (xk,yk)\left({x}_{k},{y}_{k})is used in (2). From this perspective, it is seen that the update of θk{\theta }_{k}is actually the same as in the SGD (also called least-mean-square) algorithm if we set αk=1{\alpha }_{k}=1. Across the training, the relative residue ξk{\xi }_{k}is computed. All the residues up to now are averaged, and the average relative residue takes effect on the update of θkag{\theta }_{k}^{ag}. It differs from the stochastic accelerated gradient algorithm in [22], where no residue is computed and used in the training.4.2Comparison with the work of Bach and MoulinesThe work that is perhaps closely related to ours is that of Bach and Moulines [10], who studied the SA problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework that includes machine learning methods based on the minimization of the empirical risk. The sample setting considered by Bach and Moulines is similar to ours: the learner is given a sample set {(xi,yi)}i=1n{\left\{\left({x}_{i},{y}_{i})\right\}}_{i=1}^{n}, and the goal of the regression learning problem is to learn a liner function ⟨θ,x⟩\langle \theta ,x\rangle , which forecasts the other inputs in XXaccording to random samples. Both we and Bach and Moulines obtained the rates of O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}n)of SA algorithm for the least-square regression, without strong-convexity assumptions. To our knowledge, the convergence rate O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}n)is optimal for least-square regression and logistic regression.Although uniform convergence bounds for regression learning algorithms have replied on the assumptions of input xk{x}_{k}and the residual ξk{\xi }_{k}, we have obtained the optimal upper bound O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}n)of stochastic learning algorithms and the order of the upper bound is independent of the dimension of input space. There are some important differences between our work and that of [10]. Bach and Moulines considered generalization properties of stochastic learning algorithms under the assumption that the covariance operator E(xk⊗xk){\mathbb{E}}\left({x}_{k}\otimes {x}_{k})is invertible. However, some covariance operators may not be invertible, such as the covariance operator E(xk⊗xk){\mathbb{E}}\left({x}_{k}\otimes {x}_{k})in R2{{\mathbb{R}}}^{2}, which is defined by E(xk⊗xk)=Exk12Exk1xk2Exk1xk2Exk22.{\mathbb{E}}\left({x}_{k}\otimes {x}_{k})=\left(\begin{array}{ll}{\mathbb{E}}{x}_{k1}^{2}& {\mathbb{E}}{x}_{k1}{x}_{k2}\\ {\mathbb{E}}{x}_{k1}{x}_{k2}& {\mathbb{E}}{x}_{k2}^{2}\end{array}\right).When two random components xk1{x}_{k1}and xk2{x}_{k2}in xk{x}_{k}satisfies xk1=xk2{x}_{k1}={x}_{k2}, then the determinant of the covariance operator E(xk⊗xk){\mathbb{E}}\left({x}_{k}\otimes {x}_{k})equals zero. However, only under the assumption of (a-d), the rate of our algorithm can reach O(1/n)O\left(1\hspace{-0.04em}\text{/}\hspace{-0.04em}n).5ConclusionIn this article, we have considered two SA algorithms that can achieve rates of O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}n)for the least-square regression and logistic regression, respectively, without strong-convexity assumptions. Without strong convexity, We focus on problems for which the well-known algorithms achieve a convergence rate for function values of O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}n). We consider and analyze accelerated SA algorithm that achieves a rate of O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}n)for classical least-square regression and logistic regression problems. Comparing with the well-known results, we only need fewer conditions to obtain the tight convergence rate for least-square regression and logistic regression problems. For the accelerated SA algorithm, we provide a nonasymptotic analysis of the generalization error (in expectation) and experimentally study our theoretical analysis. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Open Mathematics de Gruyter

On stochastic accelerated gradient with convergence rate

Open Mathematics , Volume 20 (1): 11 – Jan 1, 2022

Loading next page...
 
/lp/de-gruyter/on-stochastic-accelerated-gradient-with-convergence-rate-m0jD4eoINv

References (16)

Publisher
de Gruyter
Copyright
© 2022 Xingxing Zha et al., published by De Gruyter
ISSN
2391-5455
eISSN
2391-5455
DOI
10.1515/math-2022-0499
Publisher site
See Article on Publisher Site

Abstract

1IntroductionLarge-scale machine learning problems are becoming ubiquitous in science, engineering, government business, and almost all areas. Faced with huge data, investigators typically prefer algorithms that process each observation only once, or a few times. Stochastic approximation (SA) algorithms such as stochastic gradient descent (SGD), although introduced more than 60 years ago [1], still were widely used and studied method in some contexts (see [2,3, 4,5,6, 7,8,9, 10,11,12, 13,14,15, 16,17,18, 19,20,21, 22,23,24, 25,26]).To our knowledge, Robbins and Monro [1] first proposed the SA on the gradient descent method. From then on, SA algorithms were widely used in stochastic optimization and machine learning. Polyak [2] and Polyak and Juditsky [3] developed an important improvement of the SA method by using longer stepsizes with consequent averaging of the obtained iterates. The mirror-descent SA was demonstrated by Nemirovski et al. [6] who showed that the mirror-descent SA exhibited an unimprovable expected rate for solving nonstrongly convex programming (CP) problems. Shalev-Shwartz et al. [5] and Nemirovski et al. [6] studied averaged SGD and achieved the rate of O(1/μn)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}\mu n)in the strongly convex case, and they obtained only O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{-0.1em}\sqrt{n})in the non strongly convex case. Bach and Moulines [10] considered and analyzed SA algorithms that achieve a rate of O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}n)for least-square regression and logistic regression learning problems in the non strongly-convex case. The convergence rate of the SA algorithm for least-square regression and logistic regression is almost optimal, respectively. However, they need some assumptions (A1–A6). It is natural to ask that the convergence rate for least-square regression is O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}n)under fewer assumptions. In this article, we consider an accelerated SA type learning algorithm for solving the least-square regression and logistic regression problem and achieve a rate of O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}n)for least-square regression learning problems under assumptions A1–A4 in [10]. For solving a class of CP problems, Nesterov presented the accelerated gradient method in a celebrated work [12]. Now, the accelerated gradient method has also been generalized by Beck and Teboulle [13], Tseng [14], Nesterov [15,16] to solve an emerging class of composite CP problems. In 2012, Lan [17] further showed that the accelerated gradient method is optimal for solving not only smooth CP problems but also general nonsmooth and stochastic CP problems. The accelerated stochastic approximation (AC-SA) algorithm was proposed by Ghadimi and Lan [18,19] using properly modifying Nesterov’s optimal method for smooth CP. Recently, they [20,21] also developed a generic AC-SA algorithmic framework, which can be specialized to yield optimal or nearly optimal methods for solving strongly convex stochastic composite optimization problems. Motivated by those mentioned jobs, we aim to consider and analyze an accelerated SA algorithm that achieves a rate of O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}n)for classical least-square regression and logistic regression problems, respectively.Zhu [25] introduced Katyusha, a direct, primal-only stochastic gradient method to fix this issue. It has a provably accelerated convergence rate in convex (offline) stochastic optimization. It can be incorporated into a variance-reduction-based algorithm and speed it up, in terms of both sequential and parallel performance. A new gradient-based optimization approach by automatically adjusting the learning rate is proposed by Cao [26]. This approach can be applied to design nonadaptive learning rate and adaptive learning rate. This approach could be an alternative method to optimize the learning rate based on the SGD algorithm besides the current nonadaptive learning rate methods e.g. SGD, momentum, Nesterov and the adaptive learning rate methods, e.g., AdaGrad, AdaDelta, and Adam.In this article, we consider minimizing a convex function ff, which is defined on a closed convex set in Euclidean space, given by f(θ)=12E[ℓ(y,⟨θ,x⟩)]f\left(\theta )=\frac{1}{2}{\mathbb{E}}{[}\ell (y,\langle \theta ,x\rangle )], where (x,y)∈X×R\left(x,y)\in X\times {\mathbb{R}}denotes the sample data and ℓ\ell denotes a loss function that is convex with respect to the second variable. This loss function includes least-square regression and logistic regression. In the SA framework, z={zi}i=1n={(xi,yi)}i=1n∈Zn{\bf{z}}={\left\{{z}_{i}\right\}}_{i=1}^{n}={\left\{\left({x}_{i},{y}_{i})\right\}}_{i=1}^{n}\in {Z}^{n}denote a set of random samples, which are independently drawn according to the unknown probability measure ρ\rho and the predictor defined by θ\theta is updated after each pair is seen.The rest of this article is organized as follows. In Section 2, we give a brief introduction to the accelerated gradient algorithm for least-square regression. In Section 3, we study the accelerated gradient algorithm for logistic regression. In Section 4, we compare our results with the known related work. Finally, we conclude this article with the obtained results.2The stochastic accelerated gradient algorithm for least-square regressionIn this section, we consider the accelerated gradient algorithm for least-square regression. The novelty of this article is that our convergence result can obtain a nonasymptotic rate O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}n). To give the convergence property of the stochastic accelerated gradient algorithm for the regression problem, we make the following assumptions: (a)ℱ{\mathcal{ {\mathcal F} }}is a dd-dimensional Euclidean space, with d≥1d\ge 1.(b)Let (X,d)\left(X,d)be a compact metric space and let Y=RY={\mathbb{R}}. Let ρ\rho be a probability distribution on Z=ℱ×YZ={\mathcal{ {\mathcal F} }}\times Yand (X,Y)\left(X,Y)be a corresponding random variable.(c)E‖xn‖2{\mathbb{E}}\Vert {x}_{n}{\Vert }^{2}is finite, i.e., E‖xk‖2≤M{\mathbb{E}}\Vert {x}_{k}{\Vert }^{2}\le Mfor any k≥1k\ge 1.(d)The global minimum of f(θ)=12E[⟨θ,xk⟩2−2yk⟨θ,xk⟩]f\left(\theta )=\frac{1}{2}{\mathbb{E}}\left[{\langle \theta ,{x}_{k}\rangle }^{2}-2{y}_{k}\langle \theta ,{x}_{k}\rangle ]is attained at a certain θ∗∈Rd{\theta }^{\ast }\in {{\mathbb{R}}}^{d}. Let ξk=(yk−⟨θ∗,xk⟩)xk{\xi }_{k}=({y}_{k}-\langle {\theta }^{\ast },{x}_{k}\rangle ){x}_{k}denote the residual. For any k≥1k\ge 1, we have Eξk=0{\mathbb{E}}{\xi }_{k}=0. We also assume that Eξk2≤σ2{\mathbb{E}}{\xi }_{k}^{2}\le {\sigma }^{2}for every kkand ξ¯k=1k∑i=1kξi{\overline{\xi }}_{k}=\frac{1}{k}{\sum }_{i=1}^{k}{\xi }_{i}.Assumptions (a)–(d) are standard in SA (see, e.g., [9,10,22]). Compared with the work of Bach and Moulines [10], we do not need the conditions that the covariance operator ℋ=E(xk⨂xk){\mathcal{ {\mathcal H} }}={\mathbb{E}}\left({x}_{k}\hspace{0.33em}\bigotimes \hspace{0.33em}{x}_{k})is invertible for any k≥1k\ge 1, and that the operator E(xk⨂xk){\mathbb{E}}\left({x}_{k}\hspace{0.33em}\bigotimes \hspace{0.33em}{x}_{k})satisfies E[ξi⨂ξi]≼σ2ℋ{\mathbb{E}}\left[{\xi }_{i}\hspace{0.33em}\bigotimes \hspace{0.33em}{\xi }_{i}]\preccurlyeq {\sigma }^{2}{\mathcal{ {\mathcal H} }}and E(‖xi‖2xk⨂xk)≼R2ℋ{\mathbb{E}}\left(\Vert {x}_{i}{\Vert }^{2}{x}_{k}\hspace{0.33em}\bigotimes \hspace{0.33em}{x}_{k})\preccurlyeq {R}^{2}{\mathcal{ {\mathcal H} }}for a positive number RR.Let x0∈ℱ{x}_{0}\in {\mathcal{ {\mathcal F} }}, {αk}\left\{{\alpha }_{k}\right\}satisfy α1=1{\alpha }_{1}=1and αk>0{\alpha }_{k}\gt 0for any k≥2k\ge 2, βk>0{\beta }_{k}\gt 0, and λk{\lambda }_{k}. (i)Set the initial θ0ag=θ0{\theta }_{0}^{ag}={\theta }_{0}and(1)θkmd=(1−αk)θk−1ag+αkθk−1.{\theta }_{k}^{md}=\left(1-{\alpha }_{k}){\theta }_{k-1}^{ag}+{\alpha }_{k}{\theta }_{k-1}.(ii)Set(2)θk=θk−1−λk∇f(θkmd)=θk−1−λk{E(⟨θkmd,xk⟩xk−ykxk)},{\theta }_{k}={\theta }_{k-1}-{\lambda }_{k}\nabla f\left({\theta }_{k}^{md})={\theta }_{k-1}-{\lambda }_{k}\{{\mathbb{E}}(\langle {\theta }_{k}^{md},{x}_{k}\rangle {x}_{k}-{y}_{k}{x}_{k})\},(3)θkag=θkmd−βk(∇f(θkmd)+ξ¯k)=θkmd−βk{E(⟨θkmd,xk⟩xk−ykxk)+ξ¯k}.{\theta }_{k}^{ag}={\theta }_{k}^{md}-{\beta }_{k}(\nabla f\left({\theta }_{k}^{md})+{\overline{\xi }}_{k})={\theta }_{k}^{md}-{\beta }_{k}\{{\mathbb{E}}(\langle {\theta }_{k}^{md},{x}_{k}\rangle {x}_{k}-{y}_{k}{x}_{k})+{\overline{\xi }}_{k}\}.(iii)Set k←k+1k\leftarrow k+1and go to step (i).To establish the convergence rate of the accelerated gradient algorithm, we need the following Lemma (see Lemma 1 of [7]).Lemma 1Let αk{\alpha }_{k}be the stepsizes in the accelerated gradient algorithm and the sequence {ηk}\left\{{\eta }_{k}\right\}satisfiesηk=(1−αk)ηk−1+τk,k=1,2,…,{\eta }_{k}=\left(1-{\alpha }_{k}){\eta }_{k-1}+{\tau }_{k},\hspace{1em}k=1,2,\ldots ,where(4)Γk=1,k=1,(1−αk)Γk−1,k≥2.{\Gamma }_{k}=\left\{\begin{array}{ll}1,& k=1,\\ \left(1-{\alpha }_{k}){\Gamma }_{k-1},& k\ge 2.\end{array}\right.Then we have ηk≤Γk∑i=1kτiΓi{\eta }_{k}\le {\Gamma }_{k}{\sum }_{i=1}^{k}\frac{{\tau }_{i}}{{\Gamma }_{i}}for any k≥1k\ge 1.We establish the convergence rate of the developed algorithm. The goal is to estimate the bound on the expectation E[f(θnag)−f(θ∗)]{\mathbb{E}}[f\left({\theta }_{n}^{ag})-f\left({\theta }^{\ast })]. Theorem 1 describes the convergence property of the accelerated gradient algorithm for least-square regression.Theorem 1Let {θkmd,θkag}\left\{{\theta }_{k}^{md},{\theta }_{k}^{ag}\right\}be computed by the accelerated gradient algorithm and Γk{\Gamma }_{k}be defined in (4). Assume (a)–(d). If {αk},{βk},and{λk}\left\{{\alpha }_{k}\right\},\left\{{\beta }_{k}\right\},and\hspace{0.25em}\left\{{\lambda }_{k}\right\}are chosen such thatαkλk≤βk≤12M,α1λ1Γ1≥α2λ2Γ2≥⋯,\begin{array}{l}{\alpha }_{k}{\lambda }_{k}\le {\beta }_{k}\le \frac{1}{2M},\\ \frac{{\alpha }_{1}}{{\lambda }_{1}{\Gamma }_{1}}\ge \frac{{\alpha }_{2}}{{\lambda }_{2}{\Gamma }_{2}}\ge \cdots ,\end{array}then for any n≥1n\ge 1, we haveE[f(θnag)−f(θ∗)]≤Γn2λ1‖θ0−θ∗‖2+Mσ2Γn∑k=1nβk2kΓk.{\mathbb{E}}{[}f({\theta }_{n}^{ag})-f\left({\theta }^{\ast })]\le \frac{{\Gamma }_{n}}{2{\lambda }_{1}}\Vert {\theta }_{0}-{\theta }^{\ast }{\Vert }^{2}+M{\sigma }^{2}{\Gamma }_{n}\mathop{\sum }\limits_{k=1}^{n}\frac{{\beta }_{k}^{2}}{k{\Gamma }_{k}}.ProofBy Taylor expansion of the function ffand (2), we have f(θkag)=f(θkmd)+⟨∇f(θkmd),θkag−θkmd⟩+(θkag−θkmd)T∇2f(θkmd)(θkag−θkmd)≤f(θkmd)−βk‖∇f(θkmd)‖2−βk⟨∇f(θkmd),ξ¯k⟩+βk2E‖xk‖2‖∇f(θkmd)+ξ¯k‖2≤f(θkmd)−βk‖∇f(θkmd)‖2−βk⟨∇f(θkmd),ξ¯k⟩+βk2M‖∇f(θkmd)+ξ¯k‖2.\begin{array}{rcl}f({\theta }_{k}^{ag})& =& f\left({\theta }_{k}^{md})+\langle \nabla f\left({\theta }_{k}^{md}),{\theta }_{k}^{ag}-{\theta }_{k}^{md}\rangle +{\left({\theta }_{k}^{ag}-{\theta }_{k}^{md})}^{T}{\nabla }^{2}f\left({\theta }_{k}^{md})\left({\theta }_{k}^{ag}-{\theta }_{k}^{md})\\ & \le & f\left({\theta }_{k}^{md})-{\beta }_{k}\Vert \nabla f\left({\theta }_{k}^{md}){\Vert }^{2}-{\beta }_{k}\langle \nabla f\left({\theta }_{k}^{md}),{\overline{\xi }}_{k}\rangle +{\beta }_{k}^{2}{\mathbb{E}}\Vert {x}_{k}{\Vert }^{2}\Vert \nabla f\left({\theta }_{k}^{md})+{\overline{\xi }}_{k}{\Vert }^{2}\\ & \le & f\left({\theta }_{k}^{md})-{\beta }_{k}\Vert \nabla f\left({\theta }_{k}^{md}){\Vert }^{2}-{\beta }_{k}\langle \nabla f\left({\theta }_{k}^{md}),{\overline{\xi }}_{k}\rangle +{\beta }_{k}^{2}M\Vert \nabla f\left({\theta }_{k}^{md})+{\overline{\xi }}_{k}{\Vert }^{2}.\end{array}where the last inequality follows from the assumption (c).Since f(μ)−f(ν)=⟨∇f(ν),μ−ν⟩+(μ−ν)TE(xkxkT)(μ−ν),f\left(\mu )-f\left(\nu )=\langle \nabla f\left(\nu ),\mu -\nu \rangle +{\left(\mu -\nu )}^{T}{\mathbb{E}}\left({x}_{k}{x}_{k}^{T})\left(\mu -\nu ),we have (5)f(ν)−f(μ)=⟨∇f(ν),ν−μ⟩−(μ−ν)TE(xkxkT)(μ−ν)≤⟨∇f(ν),ν−μ⟩,f\left(\nu )-f\left(\mu )=\langle \nabla f\left(\nu ),\nu -\mu \rangle -{\left(\mu -\nu )}^{T}{\mathbb{E}}\left({x}_{k}{x}_{k}^{T})\left(\mu -\nu )\le \langle \nabla f\left(\nu ),\nu -\mu \rangle ,where the inequality follows from the positive semidefinition of matrix E(xkxkT){\mathbb{E}}\left({x}_{k}{x}_{k}^{T}).By (1) and (5), we have f(θkmd)−[(1−αk)f(θk−1ag)+αkf(θ)]=αk[f(θkmd)−f(θ)]+(1−αk)[f(θkmd)−f(θk−1ag)]≤αk⟨∇f(θkmd),θkmd−θ⟩+(1−αk)⟨∇f(θkmd),θkmd−θk−1ag⟩=⟨∇f(θkmd),αk(θkmd−θ)+(1−αk)(θkmd−θk−1ag)⟩=αk⟨∇f(θkmd),θk−1−θ⟩.\begin{array}{rcl}f\left({\theta }_{k}^{md})-\left[\left(1-{\alpha }_{k})f\left({\theta }_{k-1}^{ag})+{\alpha }_{k}f\left(\theta )]& =& {\alpha }_{k}[f\left({\theta }_{k}^{md})-f\left(\theta )]+\left(1-{\alpha }_{k})[f\left({\theta }_{k}^{md})-f\left({\theta }_{k-1}^{ag})]\\ & \le & {\alpha }_{k}\langle \nabla f\left({\theta }_{k}^{md}),{\theta }_{k}^{md}-\theta \rangle +\left(1-{\alpha }_{k})\langle \nabla f\left({\theta }_{k}^{md}),{\theta }_{k}^{md}-{\theta }_{k-1}^{ag}\rangle \\ & =& \langle \nabla f\left({\theta }_{k}^{md}),{\alpha }_{k}\left({\theta }_{k}^{md}-\theta )+\left(1-{\alpha }_{k})\left({\theta }_{k}^{md}-{\theta }_{k-1}^{ag})\rangle \\ & =& {\alpha }_{k}\langle \nabla f\left({\theta }_{k}^{md}),{\theta }_{k-1}-\theta \rangle .\end{array}So we obtain f(θkag)≤(1−αk)f(θk−1ag)+αkf(θ)+αk⟨∇f(θkmd),θk−1−θ⟩−βk‖∇f(θkmd)‖2−βk⟨∇f(θkmd),ξ¯k⟩+βk2M‖∇f(θkmd)+ξ¯k‖2.f({\theta }_{k}^{ag})\le \left(1-{\alpha }_{k})f\left({\theta }_{k-1}^{ag})+{\alpha }_{k}f\left(\theta )+{\alpha }_{k}\langle \nabla f\left({\theta }_{k}^{md}),{\theta }_{k-1}-\theta \rangle \hspace{4em}-{\beta }_{k}\Vert \nabla f\left({\theta }_{k}^{md}){\Vert }^{2}-{\beta }_{k}\langle \nabla f\left({\theta }_{k}^{md}),{\overline{\xi }}_{k}\rangle +{\beta }_{k}^{2}M\Vert \nabla f\left({\theta }_{k}^{md})+{\overline{\xi }}_{k}{\Vert }^{2}.It follows from (2) that ‖θk−θ‖2=∥θk−1−λk∇f(θkmd)−θ∥2=‖θk−1−θ‖2−2λk⟨∇f(θkmd),θk−1−θ⟩+λk2∥∇f(θkmd)∥2.\Vert {\theta }_{k}-\theta {\Vert }^{2}={\parallel {\theta }_{k-1}-{\lambda }_{k}\nabla f({\theta }_{k}^{md})-\theta \parallel }^{2}=\Vert {\theta }_{k-1}-\theta {\Vert }^{2}-2{\lambda }_{k}\langle \nabla f\left({\theta }_{k}^{md}),{\theta }_{k-1}-\theta \rangle +{\lambda }_{k}^{2}{\parallel \nabla f\left({\theta }_{k}^{md})\parallel }^{2}.Then, we have (6)⟨∇f(θkmd),θk−1−θ⟩=12λk[‖θk−1−θ‖2−‖θk−θ‖2]+λk2∥∇f(θkmd)∥2,\langle \nabla f\left({\theta }_{k}^{md}),{\theta }_{k-1}-\theta \rangle =\frac{1}{2{\lambda }_{k}}{[}\Vert {\theta }_{k-1}-\theta {\Vert }^{2}-\Vert {\theta }_{k}-\theta {\Vert }^{2}]+\frac{{\lambda }_{k}}{2}{\parallel \nabla f\left({\theta }_{k}^{md})\parallel }^{2},and meanwhile, (7)∥∇f(θkmd)+ξ¯k∥2=∥∇f(θkmd)∥2+∥ξ¯k∥2+2⟨∇f(θkmd),ξ¯k⟩.{\parallel \nabla f\left({\theta }_{k}^{md})+{\overline{\xi }}_{k}\parallel }^{2}={\parallel \nabla f\left({\theta }_{k}^{md})\parallel }^{2}+{\parallel {\overline{\xi }}_{k}\parallel }^{2}+2\langle \nabla f\left({\theta }_{k}^{md}),{\overline{\xi }}_{k}\rangle .Combining the aforementioned two equalities (6) and (7), we obtain f(θkag)≤(1−αk)f(θk−1ag)+αkf(θ)+αk2λk[‖θk−1−θ‖2−‖θk−θ‖2]−βk1−λkαk2βk−βkM‖∇f(θkmd)‖2+Mβk2∥ξ¯k∥2+⟨ξ¯k,(2βk2M−βk)∇f(θkmd)⟩.\begin{array}{rcl}f({\theta }_{k}^{ag})& \le & \left(1-{\alpha }_{k})f\left({\theta }_{k-1}^{ag})+{\alpha }_{k}f\left(\theta )+\frac{{\alpha }_{k}}{2{\lambda }_{k}}{[}\Vert {\theta }_{k-1}-\theta {\Vert }^{2}-\Vert {\theta }_{k}-\theta {\Vert }^{2}]\\ & & -{\beta }_{k}\left(1-\frac{{\lambda }_{k}{\alpha }_{k}}{2{\beta }_{k}}-{\beta }_{k}M\right)\Vert \nabla f\left({\theta }_{k}^{md}){\Vert }^{2}+M{\beta }_{k}^{2}{\parallel {\overline{\xi }}_{k}\parallel }^{2}+\langle {\overline{\xi }}_{k},\left(2{\beta }_{k}^{2}M-{\beta }_{k})\nabla f\left({\theta }_{k}^{md})\rangle .\end{array}\hspace{3.0em}The aforementioned inequality is equal to f(θkag)−f(θ)≤(1−αk)[f(θk−1ag)−f(θ)]+αk2λk[‖θk−1−θ‖2−‖θk−θ‖2]−βk1−λkαk2βk−βkM‖∇f(θkmd)‖2+Mβk2∥ξ¯k∥2+⟨ξ¯k,(2βk2M−βk)∇f(θkmd)⟩.\begin{array}{rcl}f({\theta }_{k}^{ag})-f\left(\theta )& \le & \left(1-{\alpha }_{k})[f\left({\theta }_{k-1}^{ag})-f\left(\theta )]+\frac{{\alpha }_{k}}{2{\lambda }_{k}}{[}\Vert {\theta }_{k-1}-\theta {\Vert }^{2}-\Vert {\theta }_{k}-\theta {\Vert }^{2}]\\ & & -{\beta }_{k}\left(1-\frac{{\lambda }_{k}{\alpha }_{k}}{2{\beta }_{k}}-{\beta }_{k}M\right)\Vert \nabla f\left({\theta }_{k}^{md}){\Vert }^{2}+M{\beta }_{k}^{2}{\parallel {\overline{\xi }}_{k}\parallel }^{2}+\langle {\overline{\xi }}_{k},\left(2{\beta }_{k}^{2}M-{\beta }_{k})\nabla f\left({\theta }_{k}^{md})\rangle .\end{array}By using Lemma 1, we have f(θnag)−f(θ)≤Γn∑k=1nαk2λkΓk[‖θk−1−θ‖2−‖θk−θ‖2]−Γn∑k=1nβkΓk1−λkαk2βk−βkM‖∇f(θkmd)‖2+Γn∑k=1nβk2MΓk∥ξ¯k∥2+Γn∑k=1n1Γk⟨ξ¯k,(2βk2M−βk)∇f(θkmd)⟩.\begin{array}{rcl}f({\theta }_{n}^{ag})-f\left(\theta )& \le & {\Gamma }_{n}\mathop{\displaystyle \sum }\limits_{k=1}^{n}\frac{{\alpha }_{k}}{2{\lambda }_{k}{\Gamma }_{k}}{[}\Vert {\theta }_{k-1}-\theta {\Vert }^{2}-\Vert {\theta }_{k}-\theta {\Vert }^{2}]-{\Gamma }_{n}\mathop{\displaystyle \sum }\limits_{k=1}^{n}\frac{{\beta }_{k}}{{\Gamma }_{k}}\left(1-\frac{{\lambda }_{k}{\alpha }_{k}}{2{\beta }_{k}}-{\beta }_{k}M\right)\Vert \nabla f\left({\theta }_{k}^{md}){\Vert }^{2}\\ & & +{\Gamma }_{n}\mathop{\displaystyle \sum }\limits_{k=1}^{n}\frac{{\beta }_{k}^{2}M}{{\Gamma }_{k}}{\parallel {\overline{\xi }}_{k}\parallel }^{2}+{\Gamma }_{n}\mathop{\displaystyle \sum }\limits_{k=1}^{n}\frac{1}{{\Gamma }_{k}}\langle {\overline{\xi }}_{k},\left(2{\beta }_{k}^{2}M-{\beta }_{k})\nabla f\left({\theta }_{k}^{md})\rangle .\end{array}Since α1λ1Γ1≥α2λ2Γ2≥⋯,α1=Γ1=1,\frac{{\alpha }_{1}}{{\lambda }_{1}{\Gamma }_{1}}\ge \frac{{\alpha }_{2}}{{\lambda }_{2}{\Gamma }_{2}}\ge \cdots ,\hspace{1em}{\alpha }_{1}={\Gamma }_{1}=1,then ∑k=1nαk2λkΓk[‖θk−1−θ‖2−‖θk−θ‖2]≤α12λ1Γ1[‖θ0−θ‖2]=12λ1‖θ0−θ‖2.\mathop{\sum }\limits_{k=1}^{n}\frac{{\alpha }_{k}}{2{\lambda }_{k}{\Gamma }_{k}}{[}\Vert {\theta }_{k-1}-\theta {\Vert }^{2}-\Vert {\theta }_{k}-\theta {\Vert }^{2}]\le \frac{{\alpha }_{1}}{2{\lambda }_{1}{\Gamma }_{1}}{[}\Vert {\theta }_{0}-\theta {\Vert }^{2}]=\frac{1}{2{\lambda }_{1}}\Vert {\theta }_{0}-\theta {\Vert }^{2}.So we obtain (8)f(θnag)−f(θ)≤Γn2λ1‖θ0−θ‖2+Γn∑k=1nβk2MΓk∥ξ¯k∥2+Γn∑k=1n1Γk⟨ξ¯k,(2βk2M−βk)∇f(θkmd)⟩,f({\theta }_{n}^{ag})-f\left(\theta )\le \frac{{\Gamma }_{n}}{2{\lambda }_{1}}\Vert {\theta }_{0}-\theta {\Vert }^{2}+{\Gamma }_{n}\mathop{\sum }\limits_{k=1}^{n}\frac{{\beta }_{k}^{2}M}{{\Gamma }_{k}}{\parallel {\overline{\xi }}_{k}\parallel }^{2}+{\Gamma }_{n}\mathop{\sum }\limits_{k=1}^{n}\frac{1}{{\Gamma }_{k}}\langle {\overline{\xi }}_{k},\left(2{\beta }_{k}^{2}M-{\beta }_{k})\nabla f\left({\theta }_{k}^{md})\rangle ,where the inequality follows from the assumption αkλk≤βk≤12M.{\alpha }_{k}{\lambda }_{k}\le {\beta }_{k}\le \frac{1}{2M}.Under assumption (d), we have Eξ¯k=1k∑i=1kEξi=0,Eξ¯k2=E1k∑i=1kξi2≤σ2k.{\mathbb{E}}{\bar{\xi }}_{k}=\frac{1}{k}\mathop{\sum }\limits_{i=1}^{k}{\mathbb{E}}{\xi }_{i}=0,\hspace{1em}{\mathbb{E}}{\bar{\xi }}_{k}^{2}={\mathbb{E}}{\left(\frac{1}{k}\mathop{\sum }\limits_{i=1}^{k}{\xi }_{i}\right)}^{2}\le \frac{{\sigma }^{2}}{k}.\hspace{1.85em}Taking expectation on both sides of the inequality (8) with respect to (xi,yi)\left({x}_{i},{y}_{i}), we obtain for x∈Rdx\in {{\mathbb{R}}}^{d}, E[f(θnag)−f(θ)]≤Γn2λ1‖θ0−θ‖2+Mσ2Γn∑k=1nβk2kΓk.{\mathbb{E}}{[}f({\theta }_{n}^{ag})-f\left(\theta )]\le \frac{{\Gamma }_{n}}{2{\lambda }_{1}}\Vert {\theta }_{0}-\theta {\Vert }^{2}+M{\sigma }^{2}{\Gamma }_{n}\mathop{\sum }\limits_{k=1}^{n}\frac{{\beta }_{k}^{2}}{k{\Gamma }_{k}}.\hspace{0.5em}Now, fixing θ=θ∗\theta ={\theta }^{\ast }, we have E[f(θnag)−f(θ∗)]≤Γn2λ1‖θ0−θ∗‖2+Mσ2Γn∑k=1nβk2kΓk.{\mathbb{E}}{[}f({\theta }_{n}^{ag})-f\left({\theta }^{\ast })]\le \frac{{\Gamma }_{n}}{2{\lambda }_{1}}\Vert {\theta }_{0}-{\theta }^{\ast }{\Vert }^{2}+M{\sigma }^{2}{\Gamma }_{n}\mathop{\sum }\limits_{k=1}^{n}\frac{{\beta }_{k}^{2}}{k{\Gamma }_{k}}.This finishes the proof of Theorem 2.2.□In the following, we apply the results of Theorem 1 to some particular selections of {αk},{βk}\left\{{\alpha }_{k}\right\},\left\{{\beta }_{k}\right\}, and {λk}\left\{{\lambda }_{k}\right\}. We obtain the following Corollary 1.Corollary 1Suppose that αk{\alpha }_{k}and βk{\beta }_{k}in the accelerated gradient algorithm for regression learning are set to(9)αk=1k+1,βk=1M(k+1),andλk=12M∀k≥1,{\alpha }_{k}=\frac{1}{k+1},\hspace{1em}{\beta }_{k}=\frac{1}{M\left(k+1)},\hspace{1em}{and}\hspace{1em}{\lambda }_{k}=\frac{1}{2M}\hspace{1em}\forall k\ge 1,then for any n≥1n\ge 1, we haveE[f(θnag)−f(θ∗)]≤M2‖θ0−θ∗‖2+σ2M(n+1).{\mathbb{E}}{[}f({\theta }_{n}^{ag})-f\left({\theta }^{\ast })]\le \frac{{M}^{2}\Vert {\theta }_{0}-{\theta }^{\ast }{\Vert }^{2}+{\sigma }^{2}}{M\left(n+1)}.ProofIn the view (4) and (9), we have for k≥2k\ge 2Γk=(1−αk)Γk−1=kk+1×k−1k×k−2k−1×⋯×23×Γ1=2k+1.{\Gamma }_{k}=\left(1-{\alpha }_{k}){\Gamma }_{k-1}=\frac{k}{k+1}\times \frac{k-1}{k}\times \frac{k-2}{k-1}\times \cdots \times \frac{2}{3}\times {\Gamma }_{1}=\frac{2}{k+1}.It is easy to verify αkλk=12M(k+1)≤βk=1M(k+1)≤12M,α1λ1Γ1=α2λ2Γ2=⋯=14M.\begin{array}{rcl}{\alpha }_{k}{\lambda }_{k}=\frac{1}{2M\left(k+1)}& \le & {\beta }_{k}=\frac{1}{M\left(k+1)}\le \frac{1}{2M},\\ \frac{{\alpha }_{1}}{{\lambda }_{1}{\Gamma }_{1}}=\frac{{\alpha }_{2}}{{\lambda }_{2}{\Gamma }_{2}}& =& \cdots =\frac{1}{4M}.\end{array}Then, we obtain MΓnσ2∑k=1nβk2kΓk=2σ2n+1∑k=1nMM2(k+1)22kk+1=σ2M(n+1)∑k=1n1k(k+1)=σ2M(n+1)1−12+12−13+⋯+1n−1−1n≤σ2M(n+1).\begin{array}{rcl}M{\Gamma }_{n}{\sigma }^{2}\mathop{\displaystyle \sum }\limits_{k=1}^{n}\frac{{\beta }_{k}^{2}}{k{\Gamma }_{k}}& =& \frac{2{\sigma }^{2}}{n+1}\mathop{\displaystyle \sum }\limits_{k=1}^{n}\frac{\frac{M}{{M}^{2}{\left(k+1)}^{2}}}{\frac{2k}{k+1}}=\frac{{\sigma }^{2}}{M\left(n+1)}\mathop{\displaystyle \sum }\limits_{k=1}^{n}\frac{1}{k\left(k+1)}\\ & =& \frac{{\sigma }^{2}}{M\left(n+1)}\left\{1-\frac{1}{2}+\frac{1}{2}-\frac{1}{3}+\cdots +\frac{1}{n-1}-\frac{1}{n}\right\}\\ & \le & \frac{{\sigma }^{2}}{M\left(n+1)}.\end{array}From the result of Theorem 1, we have E[f(θnag)−f(θ∗)]≤Mn+1‖θ0−θ∗‖2+σ2M(n+1)=M2‖θ0−θ∗‖2+σ2M(n+1).{\mathbb{E}}{[}f({\theta }_{n}^{ag})-f\left({\theta }^{\ast })]\le \frac{M}{n+1}\Vert {\theta }_{0}-{\theta }^{\ast }{\Vert }^{2}+\frac{{\sigma }^{2}}{M\left(n+1)}=\frac{{M}^{2}\Vert {\theta }_{0}-{\theta }^{\ast }{\Vert }^{2}+{\sigma }^{2}}{M\left(n+1)}.The proof of Corollary 1 is completed.□Corollary 1 shows that the developed algorithm is able to achieve a convergence rate of O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}n)without strong convexity and Lipschitz continuous gradient assumptions.3The stochastic accelerated gradient algorithm for logistic regressionIn this section, we consider the convergence property of the accelerated gradient algorithm for logistic regression.We make the following assumptions: (B1)ℱ{\mathcal{ {\mathcal F} }}is a dd-dimension Euclidean space, with d≥1d\ge 1.(B2)The observations (xi,yi)∈ℱ×{−1,1}\left({x}_{i},{y}_{i})\in {\mathcal{ {\mathcal F} }}\times \left\{-1,1\right\}are independent and identically distributed.(B3)E‖xi‖2{\mathbb{E}}\Vert {x}_{i}{\Vert }^{2}is finite, i.e., E‖xi‖2≤M{\mathbb{E}}\Vert {x}_{i}{\Vert }^{2}\le Mfor any i≥1i\ge 1.(B4)We consider l(θ)=E[log(1+exp(−yi⟨xi,θ⟩))]l\left(\theta )={\mathbb{E}}\left[\log \left(1+\exp \left(-{y}_{i}\langle {x}_{i},\theta \rangle ))]. We denote by θ∗∈Rd{\theta }^{\ast }\in {{\mathbb{R}}}^{d}a global minimizer of lland thus assume to exist. Let ξi=(yi−⟨θ∗,xi⟩)xi{\xi }_{i}=({y}_{i}-\langle {\theta }^{\ast },{x}_{i}\rangle ){x}_{i}denote the residual. For any i≥1i\ge 1, we have Eξi=0{\mathbb{E}}{\xi }_{i}=0. We also assume that Eξi2≤σ2{\mathbb{E}}{\xi }_{i}^{2}\le {\sigma }^{2}for every iiand ξ¯k=1k∑i=1kξi{\bar{\xi }}_{k}=\frac{1}{k}\mathop{\sum }\limits_{i=1}^{k}{\xi }_{i}.Let x0∈ℱ{x}_{0}\in {\mathcal{ {\mathcal F} }}, {αk}\left\{{\alpha }_{k}\right\}satisfy α1=1{\alpha }_{1}=1and αk>0{\alpha }_{k}\gt 0for any k≥2k\ge 2, βk>0{\beta }_{k}\gt 0, and λk{\lambda }_{k}. (i)Set the initial θ0ag=θ0{\theta }_{0}^{ag}={\theta }_{0}and(10)θkmd=(1−αk)θk−1ag+αkθk−1.{\theta }_{k}^{md}=\left(1-{\alpha }_{k}){\theta }_{k-1}^{ag}+{\alpha }_{k}{\theta }_{k-1}.(ii)Set(11)θk=θk−1−λk∇l(θkmd)=θk−1−λk−ykexp{−yk⟨xk,θkmd⟩}xk1+exp{−yk⟨xk,θkmd⟩},{\theta }_{k}={\theta }_{k-1}-{\lambda }_{k}\nabla l\left({\theta }_{k}^{md})={\theta }_{k-1}-{\lambda }_{k}\frac{-{y}_{k}\exp \left\{-{y}_{k}\langle {x}_{k},{\theta }_{k}^{md}\rangle \right\}{x}_{k}}{1+\exp \left\{-{y}_{k}\langle {x}_{k},{\theta }_{k}^{md}\rangle \right\}},(12)θkag=θkmd−βk(∇l(θkmd)+ξ¯k)=θkmd−βk−ykexp{−yk⟨xk,θkmd⟩}xk1+exp{−yk⟨xk,θkmd⟩}+ξ¯k.{\theta }_{k}^{ag}={\theta }_{k}^{md}-{\beta }_{k}(\nabla l\left({\theta }_{k}^{md})+{\bar{\xi }}_{k})={\theta }_{k}^{md}-{\beta }_{k}\left\{\phantom{\rule[-1.25em]{}{0ex}},\frac{-{y}_{k}\exp \left\{-{y}_{k}\langle {x}_{k},{\theta }_{k}^{md}\rangle \right\}{x}_{k}}{1+\exp \left\{-{y}_{k}\langle {x}_{k},{\theta }_{k}^{md}\rangle \right\}}+{\bar{\xi }}_{k}\right\}.(iii)Set k←k+1k\leftarrow k+1and go to step (i).Theorem 2 describes the convergence property of the accelerated gradient algorithm for logistic regression.Theorem 2Let {θkmd,θkag}\left\{{\theta }_{k}^{md},{\theta }_{k}^{ag}\right\}be computed by the accelerated gradient algorithm and Γk{\Gamma }_{k}be defined in (4). Assume (B1)–(B4). If {αk},{βk},and{λk}\left\{{\alpha }_{k}\right\},\left\{{\beta }_{k}\right\},and\hspace{0.25em}\left\{{\lambda }_{k}\right\}are chosen such thatαkλk≤βk≤12M,{\alpha }_{k}{\lambda }_{k}\le {\beta }_{k}\le \frac{1}{2M},α1λ1Γ1≥α2λ2Γ2≥⋯,\frac{{\alpha }_{1}}{{\lambda }_{1}{\Gamma }_{1}}\ge \frac{{\alpha }_{2}}{{\lambda }_{2}{\Gamma }_{2}}\ge \cdots ,and then for any n≥1n\ge 1, we haveE[f(θnag)−f(θ∗)]≤Γn2λ1‖θ0−θ∗‖2+Mσ2Γn∑k=1nβk2kΓk.{\mathbb{E}}{[}f({\theta }_{n}^{ag})-f\left({\theta }^{\ast })]\le \frac{{\Gamma }_{n}}{2{\lambda }_{1}}\Vert {\theta }_{0}-{\theta }^{\ast }{\Vert }^{2}+M{\sigma }^{2}{\Gamma }_{n}\mathop{\sum }\limits_{k=1}^{n}\frac{{\beta }_{k}^{2}}{k{\Gamma }_{k}}.ProofBy Taylor expansion of the function ll, there exists a ϑ{\vartheta }such that (13)l(θkag)=l(θkmd)+⟨∇l(θkmd),θkag−θkmd⟩+(θkag−θkmd)T∇2l(ϑ)(θkag−θkmd)=l(θkmd)−βk‖∇l(θkmd)‖2+βk⟨∇l(θkmd),ξ¯k⟩+(θkag−θkmd)TEexp{−yk⟨xk,ϑ⟩}xkxkT1+exp{−yk⟨xk,ϑ⟩}(θkag−θkmd).\begin{array}{rcl}l({\theta }_{k}^{ag})& =& l\left({\theta }_{k}^{md})+\langle \nabla l\left({\theta }_{k}^{md}),{\theta }_{k}^{ag}-{\theta }_{k}^{md}\rangle +{\left({\theta }_{k}^{ag}-{\theta }_{k}^{md})}^{T}{\nabla }^{2}l\left({\vartheta })\left({\theta }_{k}^{ag}-{\theta }_{k}^{md})\\ & =& l\left({\theta }_{k}^{md})-{\beta }_{k}\Vert \nabla l\left({\theta }_{k}^{md}){\Vert }^{2}+{\beta }_{k}\langle \nabla l\left({\theta }_{k}^{md}),{\bar{\xi }}_{k}\rangle +{\left({\theta }_{k}^{ag}-{\theta }_{k}^{md})}^{T}{\mathbb{E}}\frac{\exp \left\{-{y}_{k}\langle {x}_{k},{\vartheta }\rangle \right\}{x}_{k}{x}_{k}^{T}}{1+\exp \left\{-{y}_{k}\langle {x}_{k},{\vartheta }\rangle \right\}}\left({\theta }_{k}^{ag}-{\theta }_{k}^{md}).\end{array}It is easy to verify that the matrix Eexp{−yk⟨xk,ϑ⟩}xkxkT1+exp{−yk⟨xk,ϑ⟩}{\mathbb{E}}\frac{\exp \left\{-{y}_{k}\langle {x}_{k},{\vartheta }\rangle \right\}{x}_{k}{x}_{k}^{T}}{1+\exp \left\{-{y}_{k}\langle {x}_{k},{\vartheta }\rangle \right\}}is positive semidefinite and the largest eigenvalue of it satisfies λmaxEexp{−yk⟨xk,ϑ⟩}xkxkT1+exp{−yk⟨xk,ϑ⟩}≤E‖xk‖2≤M.\hspace{9.2em}{\lambda }_{max}\left({\mathbb{E}}\frac{\exp \left\{-{y}_{k}\langle {x}_{k},{\vartheta }\rangle \right\}{x}_{k}{x}_{k}^{T}}{1+\exp \left\{-{y}_{k}\langle {x}_{k},{\vartheta }\rangle \right\}}\right)\le {\mathbb{E}}\Vert {x}_{k}{\Vert }^{2}\le M.\hspace{3.7em}Combining with (12) and (13), we have l(θkag)≤l(θkmd)−βk‖∇l(θkmd)‖2+βk⟨∇l(θkmd),ξ¯k⟩+βk2M‖∇l(θkmd)+ξ¯k‖2.l({\theta }_{k}^{ag})\le l\left({\theta }_{k}^{md})-{\beta }_{k}\Vert \nabla l\left({\theta }_{k}^{md}){\Vert }^{2}+{\beta }_{k}\langle \nabla l\left({\theta }_{k}^{md}),{\bar{\xi }}_{k}\rangle +{\beta }_{k}^{2}M\Vert \nabla l\left({\theta }_{k}^{md})+{\bar{\xi }}_{k}{\Vert }^{2}.Similar to (13), there exists a ζ∈Rd\zeta \in {{\mathbb{R}}}^{d}satisfying l(μ)−l(ν)=⟨∇l(ν),μ−ν⟩+(μ−ν)TEexp{−yk⟨xk,ζ⟩}xkxkT1+exp{−yk⟨xk,ζ⟩}(μ−ν),μ,ν∈Rd,l\left(\mu )-l\left(\nu )=\langle \nabla l\left(\nu ),\mu -\nu \rangle +{\left(\mu -\nu )}^{T}{\mathbb{E}}\frac{\exp \left\{-{y}_{k}\langle {x}_{k},\zeta \rangle \right\}{x}_{k}{x}_{k}^{T}}{1+\exp \left\{-{y}_{k}\langle {x}_{k},\zeta \rangle \right\}}\left(\mu -\nu ),\mu ,\nu \in {{\mathbb{R}}}^{d},and we have l(ν)−l(μ)=⟨∇l(ν),ν−μ⟩−(μ−ν)TEexp{−yk⟨xk,ζ⟩}xkxkT1+exp{−yk⟨xk,ζ⟩}(μ−ν)≤⟨∇l(ν),ν−μ⟩,l\left(\nu )-l\left(\mu )=\langle \nabla l\left(\nu ),\nu -\mu \rangle -{\left(\mu -\nu )}^{T}{\mathbb{E}}\frac{\exp \left\{-{y}_{k}\langle {x}_{k},\zeta \rangle \right\}{x}_{k}{x}_{k}^{T}}{1+\exp \left\{-{y}_{k}\langle {x}_{k},\zeta \rangle \right\}}\left(\mu -\nu )\le \langle \nabla l\left(\nu ),\nu -\mu \rangle ,where the inequality follows from the positive semidefinition of matrix Eexp{−yk⟨xk,ζ⟩}xkxkT1+exp{−yk⟨xk,ζ⟩}{\mathbb{E}}\frac{\exp \left\{-{y}_{k}\langle {x}_{k},\zeta \rangle \right\}{x}_{k}{x}_{k}^{T}}{1+\exp \left\{-{y}_{k}\langle {x}_{k},\zeta \rangle \right\}}.Similar to (5), we have l(θkmd)−[(1−αk)l(θk−1ag)+αkl(θ)]≤αk⟨∇l(θkmd),θk−1−θ⟩.l\left({\theta }_{k}^{md})-\left[\left(1-{\alpha }_{k})l\left({\theta }_{k-1}^{ag})+{\alpha }_{k}l\left(\theta )]\le {\alpha }_{k}\langle \nabla l\left({\theta }_{k}^{md}),{\theta }_{k-1}-\theta \rangle .So we obtain l(θkag)≤(1−αk)l(θk−1ag)+αkl(θ)+αk⟨∇l(θkmd),θk−1−θ⟩−βk‖∇l(θkmd)‖2+βk⟨∇l(θkmd),ξ¯k⟩+βk2M‖∇l(θkmd)+ξ¯k‖2.l({\theta }_{k}^{ag})\le \left(1-{\alpha }_{k})l\left({\theta }_{k-1}^{ag})+{\alpha }_{k}l\left(\theta )+{\alpha }_{k}\langle \nabla l\left({\theta }_{k}^{md}),{\theta }_{k-1}-\theta \rangle \hspace{4em}-{\beta }_{k}\Vert \nabla l\left({\theta }_{k}^{md}){\Vert }^{2}+{\beta }_{k}\langle \nabla l\left({\theta }_{k}^{md}),{\bar{\xi }}_{k}\rangle +{\beta }_{k}^{2}M\Vert \nabla l\left({\theta }_{k}^{md})+{\bar{\xi }}_{k}{\Vert }^{2}.It follows from (11) that ‖θk−θ‖2=∥θk−1−λk∇l(θkmd)−θ∥2=‖θk−1−θ‖2−2λk⟨∇l(θkmd),θk−1−θ⟩+∥∇l(θkmd)∥2.\begin{array}{rcl}\Vert {\theta }_{k}-\theta {\Vert }^{2}& =& {\parallel {\theta }_{k-1}-{\lambda }_{k}\nabla l({\theta }_{k}^{md})-\theta \parallel }^{2}\\ & =& \Vert {\theta }_{k-1}-\theta {\Vert }^{2}-2{\lambda }_{k}\langle \nabla l\left({\theta }_{k}^{md}),{\theta }_{k-1}-\theta \rangle +{\parallel \nabla l\left({\theta }_{k}^{md})\parallel }^{2}.\end{array}Then, we have (14)⟨∇l(θkmd),θk−1−θ⟩=12λk[‖θk−1−θ‖2−‖θk−θ‖2]+λk2∥∇l(θkmd)∥2.\langle \nabla l\left({\theta }_{k}^{md}),{\theta }_{k-1}-\theta \rangle =\frac{1}{2{\lambda }_{k}}{[}\Vert {\theta }_{k-1}-\theta {\Vert }^{2}-\Vert {\theta }_{k}-\theta {\Vert }^{2}]+\frac{{\lambda }_{k}}{2}{\parallel \nabla l\left({\theta }_{k}^{md})\parallel }^{2}.However, (15)∥∇l(θkmd)+ξ¯k∥2=∥∇f(θkmd)∥2+∥ξ¯k∥2+2⟨∇l(θkmd),ξ¯k⟩.{\parallel \nabla l\left({\theta }_{k}^{md})+{\overline{\xi }}_{k}\parallel }^{2}={\parallel \nabla f\left({\theta }_{k}^{md})\parallel }^{2}+{\parallel {\overline{\xi }}_{k}\parallel }^{2}+2\langle \nabla l\left({\theta }_{k}^{md}),{\overline{\xi }}_{k}\rangle .Combining the aforementioned two equalities (14) and (15), we obtain l(θkag)≤(1−αk)l(θk−1ag)+αkl(θ)+αk2λk[‖θk−1−θ‖2−‖θk−θ‖2]−βk1−λkαk2βk−βkM‖∇l(θkmd)‖2+Mβk2∥ξ¯k∥2+⟨ξ¯k,(2βk2M−βk)∇l(θkmd)⟩.\begin{array}{rcl}l({\theta }_{k}^{ag})& \le & \left(1-{\alpha }_{k})l\left({\theta }_{k-1}^{ag})+{\alpha }_{k}l\left(\theta )+\frac{{\alpha }_{k}}{2{\lambda }_{k}}{[}\Vert {\theta }_{k-1}-\theta {\Vert }^{2}-\Vert {\theta }_{k}-\theta {\Vert }^{2}]-{\beta }_{k}\left(1-\frac{{\lambda }_{k}{\alpha }_{k}}{2{\beta }_{k}}-{\beta }_{k}M\right)\Vert \nabla l\left({\theta }_{k}^{md}){\Vert }^{2}+M{\beta }_{k}^{2}{\parallel {\overline{\xi }}_{k}\parallel }^{2}+\langle {\overline{\xi }}_{k},\left(2{\beta }_{k}^{2}M-{\beta }_{k})\nabla l\left({\theta }_{k}^{md})\rangle .\end{array}The aforementioned inequality is equal to l(θkag)−l(θ)≤(1−αk)[l(θk−1ag)−l(θ)]+αk2λk[‖θk−1−θ‖2−‖θk−θ‖2]−βk1−λkαk2βk−βkM‖∇l(θkmd)‖2+Mβk2∥ξ¯k∥2+⟨ξ¯k,(2βk2M−βk)∇l(θkmd)⟩.l({\theta }_{k}^{ag})-l\left(\theta )\le \left(1-{\alpha }_{k})\left[l\left({\theta }_{k-1}^{ag})-l\left(\theta )]+\frac{{\alpha }_{k}}{2{\lambda }_{k}}{[}\Vert {\theta }_{k-1}-\theta {\Vert }^{2}-\Vert {\theta }_{k}-\theta {\Vert }^{2}]-{\beta }_{k}\left(1-\frac{{\lambda }_{k}{\alpha }_{k}}{2{\beta }_{k}}-{\beta }_{k}M\right)\Vert \nabla l\left({\theta }_{k}^{md}){\Vert }^{2}+M{\beta }_{k}^{2}{\parallel {\overline{\xi }}_{k}\parallel }^{2}+\langle {\overline{\xi }}_{k},\left(2{\beta }_{k}^{2}M-{\beta }_{k})\nabla l\left({\theta }_{k}^{md})\rangle .By using Lemma 1, we have l(θnag)−l(θ)≤Γn∑k=1nαk2λkΓk[‖θk−1−θ‖2−‖θk−θ‖2]−Γn∑k=1nβkΓk1−λkαk2βk−βkM‖∇l(θkmd)‖2+Γn∑k=1nβk2MΓk∥ξ¯k∥2+Γn∑k=1n1Γk⟨ξ¯k,(2βk2M−βk)∇l(θkmd)⟩.\begin{array}{rcl}l({\theta }_{n}^{ag})-l\left(\theta )& \le & {\Gamma }_{n}\mathop{\displaystyle \sum }\limits_{k=1}^{n}\frac{{\alpha }_{k}}{2{\lambda }_{k}{\Gamma }_{k}}{[}\Vert {\theta }_{k-1}-\theta {\Vert }^{2}-\Vert {\theta }_{k}-\theta {\Vert }^{2}]-{\Gamma }_{n}\mathop{\displaystyle \sum }\limits_{k=1}^{n}\frac{{\beta }_{k}}{{\Gamma }_{k}}\left(1-\frac{{\lambda }_{k}{\alpha }_{k}}{2{\beta }_{k}}-{\beta }_{k}M\right)\Vert \nabla l\left({\theta }_{k}^{md}){\Vert }^{2}\\ & & +{\Gamma }_{n}\mathop{\displaystyle \sum }\limits_{k=1}^{n}\frac{{\beta }_{k}^{2}M}{{\Gamma }_{k}}{\parallel {\overline{\xi }}_{k}\parallel }^{2}+{\Gamma }_{n}\mathop{\displaystyle \sum }\limits_{k=1}^{n}\frac{1}{{\Gamma }_{k}}\langle {\overline{\xi }}_{k},\left(2{\beta }_{k}^{2}M-{\beta }_{k})\nabla l\left({\theta }_{k}^{md})\rangle .\end{array}Since α1λ1Γ1≥α2λ2Γ2≥⋯,α1=Γ1=1,\frac{{\alpha }_{1}}{{\lambda }_{1}{\Gamma }_{1}}\ge \frac{{\alpha }_{2}}{{\lambda }_{2}{\Gamma }_{2}}\ge \cdots \hspace{0.33em},\hspace{1em}{\alpha }_{1}={\Gamma }_{1}=1,then ∑k=1nαk2λkΓk[‖θk−1−θ‖2−‖θk−θ‖2]≤α12λ1Γ1[‖θ0−θ‖2]=12λ1‖θ0−θ‖2.\mathop{\sum }\limits_{k=1}^{n}\frac{{\alpha }_{k}}{2{\lambda }_{k}{\Gamma }_{k}}{[}\Vert {\theta }_{k-1}-\theta {\Vert }^{2}-\Vert {\theta }_{k}-\theta {\Vert }^{2}]\le \frac{{\alpha }_{1}}{2{\lambda }_{1}{\Gamma }_{1}}{[}\Vert {\theta }_{0}-\theta {\Vert }^{2}]=\frac{1}{2{\lambda }_{1}}\Vert {\theta }_{0}-\theta {\Vert }^{2}.So we obtain (16)l(θnag)−l(θ)≤Γn2λ1‖θ0−θ‖2+Γn∑k=1nβk2MΓk∥ξ¯k∥2+Γn∑k=1n1Γk⟨ξ¯k,(2βk2M−βk)∇l(θkmd)⟩,l({\theta }_{n}^{ag})-l\left(\theta )\le \frac{{\Gamma }_{n}}{2{\lambda }_{1}}\Vert {\theta }_{0}-\theta {\Vert }^{2}+{\Gamma }_{n}\mathop{\sum }\limits_{k=1}^{n}\frac{{\beta }_{k}^{2}M}{{\Gamma }_{k}}{\parallel {\overline{\xi }}_{k}\parallel }^{2}+{\Gamma }_{n}\mathop{\sum }\limits_{k=1}^{n}\frac{1}{{\Gamma }_{k}}\langle {\overline{\xi }}_{k},\left(2{\beta }_{k}^{2}M-{\beta }_{k})\nabla l\left({\theta }_{k}^{md})\rangle ,where the inequality follows from the assumption αkλk≤βk≤12M.{\alpha }_{k}{\lambda }_{k}\le {\beta }_{k}\le \frac{1}{2M}.Under assumption (d), we have Eξ¯k=1k∑i=1kEξi=0,Eξ¯k2=E1k∑i=1kξi2≤σ2k.{\mathbb{E}}{\bar{\xi }}_{k}=\frac{1}{k}\mathop{\sum }\limits_{i=1}^{k}{\mathbb{E}}{\xi }_{i}=0,\hspace{1em}{\mathbb{E}}{\bar{\xi }}_{k}^{2}={\mathbb{E}}{\left(\frac{1}{k}\mathop{\sum }\limits_{i=1}^{k}{\xi }_{i}\right)}^{2}\le \frac{{\sigma }^{2}}{k}.\hspace{1.35em}Taking expectation on both sides of the inequality (16) with respect to (xi,yi)\left({x}_{i},{y}_{i}), we obtain for θ∈Rd\theta \in {{\mathbb{R}}}^{d}, E[l(θnag)−l(θ)]≤Γn2λ1‖θ0−θ‖2+Mσ2Γn∑k=1nβk2kΓk.{\mathbb{E}}{[}l({\theta }_{n}^{ag})-l\left(\theta )]\le \frac{{\Gamma }_{n}}{2{\lambda }_{1}}\Vert {\theta }_{0}-\theta {\Vert }^{2}+M{\sigma }^{2}{\Gamma }_{n}\mathop{\sum }\limits_{k=1}^{n}\frac{{\beta }_{k}^{2}}{k{\Gamma }_{k}}.\hspace{0.5em}Now, fixing θ=θ∗\theta ={\theta }^{\ast }, we have E[l(θnag)−l(θ∗)]≤Γn2λ1‖θ0−θ∗‖2+Mσ2Γn∑k=1nβk2kΓk.{\mathbb{E}}{[}l({\theta }_{n}^{ag})-l\left({\theta }^{\ast })]\le \frac{{\Gamma }_{n}}{2{\lambda }_{1}}\Vert {\theta }_{0}-{\theta }^{\ast }{\Vert }^{2}+M{\sigma }^{2}{\Gamma }_{n}\mathop{\sum }\limits_{k=1}^{n}\frac{{\beta }_{k}^{2}}{k{\Gamma }_{k}}.This finishes the proof of Theorem 2.□Similar to Corollary 1, we specialize the results of Theorem 2 for some particular selections of {αk},{βk}\left\{{\alpha }_{k}\right\},\left\{{\beta }_{k}\right\}and λk{\lambda }_{k}.Corollary 2Suppose that αk{\alpha }_{k}, βk{\beta }_{k}, and λk{\lambda }_{k}in the accelerated gradient algorithm for regression learning are set toαk=1k+1,βk=1M(k+1),andλk=12M,∀k≥1,{\alpha }_{k}=\frac{1}{k+1},\hspace{1em}{\beta }_{k}=\frac{1}{M\left(k+1)},\hspace{1em}{and}\hspace{1em}{\lambda }_{k}=\frac{1}{2M},\hspace{1em}\forall k\ge 1,and then for any n≥1n\ge 1, we haveE[l(θnag)−l(θ∗)]≤M2‖θ0−θ∗‖2+σ2M(n+1).{\mathbb{E}}{[}l({\theta }_{n}^{ag})-l\left({\theta }^{\ast })]\le \frac{{M}^{2}\Vert {\theta }_{0}-{\theta }^{\ast }{\Vert }^{2}+{\sigma }^{2}}{M\left(n+1)}.4Comparisons with related workIn Sections 2 and 3, we have studied the AC-SA type algorithms for least-square regression and least-square learning problems, respectively. We have derived the upper bound of AC-SA learning algorithms by using the convexity of the aim function. In this section, we discuss how our results relate to other recent studies.4.1Comparison with convergence rate for stochastic optimizationOur convergence analysis of SA learning algorithms is based on a similar analysis for stochastic composite optimization by Ghadimi and Lan in [8]. There are two differences between our work and that of Ghadimi and Lan. The first difference in our convergence analysis of SA algorithms compared with the problems of stochastic optimization in [8] is for any iteration, rather than iteration limit, i.e., the parameters βk,λk{\beta }_{k},{\lambda }_{k}of Corollary 3 in [8] are in relation with iteration limit NN, while we do not need this assumption. The second difference is in the two error bounds. Ghadimi and Lan obtained a rate of O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}\sqrt{n})for stochastic composite optimization, while we obtain the rate of O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}n)for the regression problem.Our developed accelerated stochastic gradient algorithm (SA) for the least-square regression is summarized in (1)–(3). The algorithm takes a stream of data (xk,yk)\left({x}_{k},{y}_{k})as input, and an initial guess of the parameter θ0{\theta }_{0}. The other requirements include {αk}\left\{{\alpha }_{k}\right\}, which satisfies α1=1{\alpha }_{1}=1and αk>0{\alpha }_{k}\gt 0for any k≥2k\ge 2, βk>0{\beta }_{k}\gt 0, and λk>0{\lambda }_{k}\gt 0. The algorithm involves two intermediate variables θkag{\theta }_{k}^{ag}(which is initialized to be θ0{\theta }_{0}) and θkmd{\theta }_{k}^{md}. θkmd{\theta }_{k}^{md}is updated as a linear combination of θkag{\theta }_{k}^{ag}and the current estimation of the parameter θk{\theta }_{k}(3), where αk{\alpha }_{k}is the coefficient. The parameter θk{\theta }_{k}is estimated in (2) taking λk{\lambda }_{k}as a parameter. The residue ξk{\xi }_{k}and the average residue ξ¯k{\bar{\xi }}_{k}of previous residues up to the kkth data (i.e., ξ¯k=1k∑i=1kξi{\bar{\xi }}_{k}=\frac{1}{k}\mathop{\sum }\limits_{i=1}^{k}{\xi }_{i}) are computed in (3). θkag{\theta }_{k}^{ag}is then updated through a linear combination of θkmd{\theta }_{k}^{md}, where βk{\beta }_{k}is taken as a parameter. The process continues whenever a new pair of data is seen.The unbiased estimate of the gradient, i.e., (⟨θkmd,xk⟩xk−ykxk)(\langle {\theta }_{k}^{md},{x}_{k}\rangle {x}_{k}-{y}_{k}{x}_{k})for each data point, (xk,yk)\left({x}_{k},{y}_{k})is used in (2). From this perspective, it is seen that the update of θk{\theta }_{k}is actually the same as in the SGD (also called least-mean-square) algorithm if we set αk=1{\alpha }_{k}=1. Across the training, the relative residue ξk{\xi }_{k}is computed. All the residues up to now are averaged, and the average relative residue takes effect on the update of θkag{\theta }_{k}^{ag}. It differs from the stochastic accelerated gradient algorithm in [22], where no residue is computed and used in the training.4.2Comparison with the work of Bach and MoulinesThe work that is perhaps closely related to ours is that of Bach and Moulines [10], who studied the SA problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework that includes machine learning methods based on the minimization of the empirical risk. The sample setting considered by Bach and Moulines is similar to ours: the learner is given a sample set {(xi,yi)}i=1n{\left\{\left({x}_{i},{y}_{i})\right\}}_{i=1}^{n}, and the goal of the regression learning problem is to learn a liner function ⟨θ,x⟩\langle \theta ,x\rangle , which forecasts the other inputs in XXaccording to random samples. Both we and Bach and Moulines obtained the rates of O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}n)of SA algorithm for the least-square regression, without strong-convexity assumptions. To our knowledge, the convergence rate O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}n)is optimal for least-square regression and logistic regression.Although uniform convergence bounds for regression learning algorithms have replied on the assumptions of input xk{x}_{k}and the residual ξk{\xi }_{k}, we have obtained the optimal upper bound O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}n)of stochastic learning algorithms and the order of the upper bound is independent of the dimension of input space. There are some important differences between our work and that of [10]. Bach and Moulines considered generalization properties of stochastic learning algorithms under the assumption that the covariance operator E(xk⊗xk){\mathbb{E}}\left({x}_{k}\otimes {x}_{k})is invertible. However, some covariance operators may not be invertible, such as the covariance operator E(xk⊗xk){\mathbb{E}}\left({x}_{k}\otimes {x}_{k})in R2{{\mathbb{R}}}^{2}, which is defined by E(xk⊗xk)=Exk12Exk1xk2Exk1xk2Exk22.{\mathbb{E}}\left({x}_{k}\otimes {x}_{k})=\left(\begin{array}{ll}{\mathbb{E}}{x}_{k1}^{2}& {\mathbb{E}}{x}_{k1}{x}_{k2}\\ {\mathbb{E}}{x}_{k1}{x}_{k2}& {\mathbb{E}}{x}_{k2}^{2}\end{array}\right).When two random components xk1{x}_{k1}and xk2{x}_{k2}in xk{x}_{k}satisfies xk1=xk2{x}_{k1}={x}_{k2}, then the determinant of the covariance operator E(xk⊗xk){\mathbb{E}}\left({x}_{k}\otimes {x}_{k})equals zero. However, only under the assumption of (a-d), the rate of our algorithm can reach O(1/n)O\left(1\hspace{-0.04em}\text{/}\hspace{-0.04em}n).5ConclusionIn this article, we have considered two SA algorithms that can achieve rates of O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}n)for the least-square regression and logistic regression, respectively, without strong-convexity assumptions. Without strong convexity, We focus on problems for which the well-known algorithms achieve a convergence rate for function values of O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}n). We consider and analyze accelerated SA algorithm that achieves a rate of O(1/n)O\left(1\hspace{0.1em}\text{/}\hspace{0.1em}n)for classical least-square regression and logistic regression problems. Comparing with the well-known results, we only need fewer conditions to obtain the tight convergence rate for least-square regression and logistic regression problems. For the accelerated SA algorithm, we provide a nonasymptotic analysis of the generalization error (in expectation) and experimentally study our theoretical analysis.

Journal

Open Mathematicsde Gruyter

Published: Jan 1, 2022

Keywords: least-square regression; logistic regression; accelerated stochastic approximation; convergence rate; 68Q19; 68Q25; 68Q30

There are no references for this article.