# Iterative reconstruction of rank-one matrices in noise

Iterative reconstruction of rank-one matrices in noise Abstract We consider the problem of estimating a rank-one matrix in Gaussian noise under a probabilistic model for the left and right factors of the matrix. The probabilistic model can impose constraints on the factors including sparsity and positivity that arise commonly in learning problems. We propose a family of algorithms that reduce the problem to a sequence of scalar estimation computations. These algorithms are similar to approximate message-passing techniques based on Gaussian approximations of loopy belief propagation that have been used recently in compressed sensing. Leveraging analysis methods by Bayati and Montanari, we show that the asymptotic behavior of the algorithm is described by a simple scalar equivalent model, where the distribution of the estimates at each iteration is identical to certain scalar estimates of the variables in Gaussian noise. Moreover, the effective Gaussian noise level is described by a set of state evolution equations. The proposed approach to deriving algorithms thus provides a computationally simple and general method for rank-one estimation problems with a precise analysis in certain high-dimensional settings. 1. Introduction We consider the problem of estimating vectors $${\mathbf{u}}_0 \in {\mathbb{R}}^m$$ and $${\mathbf{v}}_0 \in{\mathbb{R}}^n$$ from a matrix $${\mathbf{A}}\in {\mathbb{R}}^{m {\times} n}$$ of the form $$\label{eq:ArankOne} {\mathbf{A}} = {\mathbf{u}}_0{\mathbf{v}}^\top_0 + \sqrt{m}{\mathbf{W}},$$ (1.1) where $${\mathbf{W}}$$ represents some unknown noise and $$\sqrt{m}$$ is a normalization factor. The problem can be considered a rank-one special case of finding a low-rank matrix in the presence of noise. Such low-rank estimation problems arise in a range of applications including blind channel estimation [13], antenna array processing [26], subspace system identification [32] and principal component or factor analysis [29]. When the noise term $${\mathbf{W}}$$ is zero, the vector pair $$({\mathbf{u}}_0,{\mathbf{v}}_0)$$ can be recovered exactly, up to a scaling, from the maximal left and right singular vectors of $${\mathbf{A}}$$ [25]. However, in the presence of noise, the rank-one matrix can in general only be estimated approximately. In this case, a priori information or constraints on $$({\mathbf{u}}_0,{\mathbf{v}}_0)$$ may improve the estimation. Such constraints arise, for example, in factor analysis in statistics, where one of the factors is often constrained to be either positive or sparse [5]. Similar sparsity constraints occur in the problem of dictionary learning [41]. Also, in digital communications, one of the factors could come from a discrete quadrature amplitude modulation constellation. In this article, we impose the constraints on $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$ in a Bayesian setting, where $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$ are assumed to be independent from one another with independent and identically distributed (i.i.d.) densities of the form $$\label{eq:puvDist} p_{{\mathbf{u}}_0}({\mathbf{u}}_0) = \prod_{i=1}^m p_{U_0}(u_{0i}), \quad p_{{\mathbf{v}}_0}({\mathbf{v}}_0) = \prod_{j=1}^n p_{V_0}(v_{0j}),$$ (1.2) for some scalar random variables $$U_0$$ and $$V_0$$. The noise $${\mathbf{W}}$$ in (1.1) is assumed to have i.i.d. Gaussian components, where $$W_{ij} \sim {\mathcal N}(0,\tau_w)$$ for some variance $$\tau_w > 0$$. Under this model, the posterior density of $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$ is given by $$\label{eq:puvA} p_{{\mathbf{u}}_0,{\mathbf{v}}_0|{\mathbf{A}}}({\mathbf{u}}_0,{\mathbf{v}}_0) \propto p_{{\mathbf{u}}_0}({\mathbf{u}}_0)p_{{\mathbf{v}}_0}({\mathbf{v}}_0) \exp\left[ -\frac{1}{2m\tau_w}\|{\mathbf{A}}-{\mathbf{u}}_0{\mathbf{v}}^\top\|^2_F \right]\!,$$ (1.3) where $$\|\cdot\|_F$$ is the Frobenius norm. Given this posterior density, we consider two estimation problems: MAP estimation. In this problem, we wish to find the maximum a posteriori estimates $$\label{eq:uvMAP} ({\widehat{\mathbf{u}}},{\widehat{\mathbf{v}}}) = \mathop{\mathrm{arg\,max}}_{{\mathbf{u}},{\mathbf{v}}} p_{{\mathbf{u}}_0,{\mathbf{v}}_0|{\mathbf{A}}}({\mathbf{u}},{\mathbf{v}}).$$ (1.4) MMSE estimation. In this problem, we wish to find the posterior mean or, equivalently, the minimum mean-squared error estimate, $$\label{eq:uvMMSE} {\widehat{\mathbf{u}}} = {\mathbb{E}}({\mathbf{u}}_0|{\mathbf{A}}), \quad {\widehat{\mathbf{v}}} = {\mathbb{E}}({\mathbf{v}}_0|{\mathbf{A}}).$$ (1.5) We may also be interested in the posterior variances and posterior marginals. Exact computation of either of these estimates is generally intractable. Even though the components $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$ are assumed to have independent components, the posterior density (1.3) is not, in general, separable due to the term $${\mathbf{u}}_0{\mathbf{v}}_0^\top$$. Thus, the MAP estimate requires a search over $$m$$- and $$n$$-dimensional vectors $$({\mathbf{u}}_0,{\mathbf{v}}_0)$$ and the MMSE estimate requires integration over this $$m+n$$-dimensional space. However, due to the separability assumption on the priors (1.2), it is computationally simpler to alternately estimate the factors $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$. This form of alternating estimation is used, for example, in the classic alternating power method for finding maximal singular values [25] and some alternating methods in sparse or non-negative dictionary learning [1,35,37,41,42]. The approach in this work also uses a form of alternating estimation, but based on a recently developed powerful class of algorithms known as approximate message passing (AMP). AMP methods are derived from Gaussian and quadratic approximations of loopy belief propagation in graphical models and were originally used for problems in compressed sensing [3,15–17,45,46]. The AMP methodology has been successfully applied in a range of applications [8,18,19,58] and has seen a number of improvements for stability and convergence [6,38,49–51,59]. In recent years, there has been growing interest in AMP for matrix factorization problems as well [34,36,43,44,47]. The problem considered in this article can be considered as a simple rank-one case of these problems. Our main contribution in this work is to show that, for the rank-one MMSE and MAP estimation problems, the proposed IterFac algorithm admits an asymptotically exact characterization in the limit when $$m,n {\rightarrow} \infty$$ with $$m/n$$ constant and the components of the true vectors $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$ have limiting empirical distributions. In this scenario, we show that the empirical joint distribution of the components of $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$ and the corresponding estimates from the IterFac algorithm are described by a simple scalar equivalent model, where the IterFac component estimates are identically distributed to scalar estimates of the variables corrupted by Gaussian noise. Moreover, the effective Gaussian noise level in this model is described by a simple set of scalar state evolution (SE) equations. This scalar equivalent model is common in analyses of AMP methods [3,15–17,45,46] as well as replica analyses of related estimation problems [21,48,55]. From the scalar equivalent model, one can compute the asymptotic value of almost any component-separable metric including mean-squared error or correlation. Thus, in addition to being computationally simple and general, the IterFac algorithm admits a precise analysis in the case of Gaussian noise. Moreover, since fixed points of the IterFac algorithm correspond, under suitable circumstances, to local maxima of objectives such as (1.4), the analysis can be used to characterize the behavior of such minima—even if alternate algorithms to IterFac are used. The main analytical tool is a recently developed technique by Bayati and Montanari [3] used in the analysis of AMP algorithms. This work proved that, in the limit for large Gaussian mixing matrices, the behavior of AMP estimates can be described by a scalar equivalent model with effective noise levels defined by certain scalar SE equations. Similar SE analyses have appeared in related contexts [4,22,23,40,45,46]. To prove the SE equations for the IterFac algorithm, we apply a key theorem from [3] with a simple change of variables and a slight modification to account for parameter adaptation. A conference version of this article appeared in [47]. This article provides all the proofs along with more detailed discussions and simulations. Since the original publication of the conference version in [47], several other research groups have extended and built on the work. Importantly, [11] has shown that in the case of certain discrete priors, the IterFac algorithm is provably Bayes optimal in a large system limit. It was shown in [12] that an algorithm closely related to IterFac could provide the best known scaling laws for the hidden clique problem. More recently, [43] and [31] have used related AMP-type methods for matrix factorization problems with rank greater than one. 2. Iterative rank-one factorization The proposed IterFac algorithm is shown in Algorithm 1. Given a matrix $${\mathbf{A}} \in {\mathbb{R}}^{m{\times} n}$$, the algorithm outputs a sequence of estimates $$({\mathbf{u}}(t),{\mathbf{v}}(t))$$, $$t=0,1,\,{\ldots}\,$$, for $$({\mathbf{u}}_0,{\mathbf{v}}_0)$$. The algorithm has several parameters including the initial conditions, the parameters in lines 5 and 10, the termination condition and, most importantly, the functions $$G_u(\cdot)$$ and $$G_v(\cdot)$$. In each iteration, the functions $$G_u(\cdot)$$ and $$G_v(\cdot)$$ are used to generate the estimates of the factors $${\mathbf{u}}(t)$$ and $${\mathbf{v}}(t)$$ and will be called the factor selection functions. Throughout this work, we assume that the factor selection functions are separable, meaning that they act componentwise on the vectors $${\mathbf{p}}(t)$$ and $${\mathbf{q}}(t)$$: $$\label{eq:Guvsep} u_i(t) = G_u(p_i(t),\lambda_u(t)), \quad v_j({t\! + \! 1}) = G_v(q_j(t),\lambda_v(t)),$$ (2.1) for some scalar functions $$G_u(\cdot)$$ and $$G_v(\cdot)$$. The choice of the factor selection function will depend on whether IterFac is used for MAP or MMSE estimation. Algorithm 1 View largeDownload slide Iterative Factorization (IterFac) Algorithm 1 View largeDownload slide Iterative Factorization (IterFac) MAP estimation To describe the factor selection functions for the MAP estimation problem (1.4), let $$\lambda_u(t)$$ and $$\lambda_v(t)$$ be sequences of pairs of parameters $$\label{eq:lamuvDef} \lambda_u(t) := (\gamma_u(t),\nu_u(t)), \quad \lambda_v(t) := (\gamma_v(t),\nu_v(t)).$$ (2.2) Then, for each $$t$$, consider random vectors $${\mathbf{p}}(t)$$ and $${\mathbf{q}}(t)$$ given by \begin{align} \label{eq:pqAwgn} \begin{split} &{\mathbf{p}}(t) = \gamma_u(t){\mathbf{u}}_0 + {\mathbf{z}}_u(t), \quad {\mathbf{u}}_0 \sim p_{{\mathbf{u}}_0}({\mathbf{u}}_0), \quad {\mathbf{z}}_u(t) \sim {\mathcal N}(0,\nu_u(t){\mathbf{I}}), \\ &{\mathbf{q}}(t) = \gamma_v(t){\mathbf{v}}_0 + {\mathbf{z}}_v(t), \quad {\mathbf{v}}_0 \sim p_{{\mathbf{v}}_0}({\mathbf{v}}_0), \quad {\mathbf{z}}_v(t) \sim {\mathcal N}(0,\nu_v(t){\mathbf{I}}). \end{split} \end{align} (2.3) The random variables $${\mathbf{p}}(t)$$ and $${\mathbf{q}}(t)$$ in (2.3) are simply scaled versions of the true vectors $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$ with additive white Gaussian noise (AWGN). Then, for the MAP problem, we take the factor selection functions to be the MAP estimates of $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$, given the observations $${\mathbf{p}}(t)$$ and $${\mathbf{q}}(t)$$: $$\label{eq:GuvMAP} G_u({\mathbf{p}}(t),\lambda_u(t)) := \mathop{\mathrm{arg\,max}}_{{\mathbf{u}}_0} p({\mathbf{u}}_0|{\mathbf{p}}(t),\gamma_u(t),\nu_u(t)), \quad G_v({\mathbf{q}}(t),\lambda_v(t)) = \mathop{\mathrm{arg\,max}}_{{\mathbf{v}}_0} p({\mathbf{v}}_0|{\mathbf{q}}(t),\gamma_v(t),\nu_v(t)).$$ (2.4) Importantly, due to the separability assumption on the priors (1.2), these MAP estimates can be computed componentwise: \begin{align} \label{eq:GuvMAPsep} \begin{split} & G_u(p_i,\lambda_u) = \mathop{\mathrm{arg\,min}}_{u_{0i}} \left[ -\log p_{U_0}(u_{0i}) + \frac{(\gamma_uu_{0i}-p_i)^2}{2\nu_u} \right]\!, \\ & G_v(q_j,\lambda_v) = \mathop{\mathrm{arg\,min}}_{v_{0i}} \left[ -\log p_{V_0}(v_{0i}) + \frac{(\gamma_vv_{0j}-q_j)^2}{2\nu_v} \right]\!. \end{split} \end{align} (2.5) Hence, the IterFac algorithm replaces the vector-valued MAP estimation of $${\mathbf{u}}_0,{\mathbf{v}}_0$$ from the joint density (1.3), with a sequence of scalar MAP estimation problems along with multiplications by $${\mathbf{A}}$$ and $${\mathbf{A}}^\top$$. MMSE Estimation For the MMSE estimation problem, we simply take the factor selection functions to be the MMSE estimates with respect to the random variables (2.3): $$\label{eq:GuvMMSE} G_u({\mathbf{p}}(t),\lambda_u(t)) = {\mathbb{E}}({\mathbf{u}}_0|{\mathbf{p}}(t),\gamma_u(t),\nu_u(t)), \quad G_v({\mathbf{q}}(t),\lambda_v(t)) = {\mathbb{E}}({\mathbf{v}}_0|{\mathbf{q}}(t),\gamma_v(t),\nu_v(t)).$$ (2.6) Again, due to the separability assumption (1.2), these MMSE estimation problems can be performed componentwise. Hence, MMSE IterFac reduces the vector MMSE problem to a sequence of scalar problems in Gaussian noise. 3. Intuitive algorithm derivation and analysis 3.1 Algorithm intuition Before we formally analyze the algorithm, it is instructive to understand the intuition behind the method. For both the MAP and MMSE versions of IterFac, first observe that $$\label{eq:pintuit} {\mathbf{p}}(t) \stackrel{(a)}{=} \frac{1}{m}{\mathbf{A}}{\mathbf{v}}(t)+\mu_u(t){\mathbf{u}}(t) \stackrel{(b)}{=} \frac{{\mathbf{v}}_0^\top{\mathbf{v}}(t)}{m} {\mathbf{u}}_0 + \frac{1}{\sqrt{m}}{\mathbf{W}}{\mathbf{v}}(t) + \mu_u(t){\mathbf{u}}(t) \stackrel{(c)}{=} \gamma_u(t){\mathbf{u}}_0 + {\mathbf{z}}_u(t),$$ (3.1) where (a) follows from line 6 in Algorithm 1, (b) follows from the assumptions on the measurements (1.1) and in (c), we have used the definitions $$\label{eq:gamzu} \gamma_u(t) = \frac{1}{m}{\mathbf{v}}_0^\top{\mathbf{v}}(t), \quad {\mathbf{z}}_u(t) = \frac{1}{\sqrt{m}}{\mathbf{W}}{\mathbf{v}}(t) + \mu_u(t){\mathbf{u}}(t).$$ (3.2) For the first iteration, $${\mathbf{W}}$$ is a large zero-mean matrix independent of the initial condition $${\mathbf{v}}(0)$$. Also, $$\mu_u(0)=0$$ and hence the components of $${\mathbf{z}}_u(0)$$ will be asymptotically Gaussian zero-mean variables due to the central limit theorem. Hence, the vector $${\mathbf{p}}(0)$$ will be distributed as the true vector $${\mathbf{u}}_0$$ with Gaussian noise as in the model (2.3). Therefore, we can take an initial MAP or MMSE estimate of $${\mathbf{u}}_0$$ from the scalar estimation functions in (2.4) or (2.6). Unfortunately, in subsequent iterations with $$t>0$$, $${\mathbf{v}}(t)$$ is no longer independent of $${\mathbf{W}}$$ and hence $${\mathbf{W}}{\mathbf{v}}(t)$$ will not, in general, be a Gaussian zero-mean random vector. However, remarkably, we will show that with the specific choice of $$\mu_u(t)$$ in Algorithm 1, the addition of the term $$\mu_u(t){\mathbf{u}}(t)$$ in (3.2) ‘debiases’ the noise term, so that $${\mathbf{z}}_u(t)$$ is asymptotically zero-mean Gaussian. Thus, $${\mathbf{p}}(t)$$ continues to be distributed as in (2.3), and we can construct MAP or MMSE estimates of $${\mathbf{u}}_0$$ from the vector $${\mathbf{p}}(t)$$. For example, using the MMSE estimation function (2.6) with appropriate selection of the parameters $$\lambda_u(t)$$, we obtain that the estimate for $${\mathbf{u}}({t\! + \! 1})$$ is given by the MMSE estimate ${\mathbf{u}}({t\! + \! 1}) = {\mathbb{E}}({\mathbf{u}}_0|{\mathbf{p}}(t)), \quad {\mathbf{p}}(t)=\gamma_u(t){\mathbf{u}}_0 + {\mathbf{z}}_u(t).$ For the MAP estimation problem, the MAP selection function (2.4) computes the MAP estimate for $${\mathbf{u}}_0$$. Similarly, for $${\mathbf{q}}(t)$$, we see that \begin{align} {\mathbf{q}}(t) &= \frac{1}{m}{\mathbf{A}}^\top{\mathbf{u}}({t\! + \! 1})+\mu_v(t){\mathbf{v}}(t) \nonumber \\ &= \frac{{\mathbf{u}}_0^\top{\mathbf{u}}({t\! + \! 1})}{m}{\mathbf{u}}_0 + \frac{1}{\sqrt{m}}{\mathbf{W}}^\top{\mathbf{u}}({t\! + \! 1}) + \mu_v(t){\mathbf{v}}(t) = \gamma_v(r){\mathbf{u}}_0 + {\mathbf{z}}_v(t), \label{eq:qintuit} \end{align} (3.3) for $$\gamma_v(t)={\mathbf{u}}_0^\top{\mathbf{u}}({t\! + \! 1})/m$$ and $${\mathbf{z}}_v(t)= (1/\sqrt{m}){\mathbf{W}}^\top{\mathbf{u}}({t\! + \! 1}) + \mu_v(t){\mathbf{v}}(t)$$. Then, $${\mathbf{v}}({t\! + \! 1})$$ is taken as the MMSE or MAP estimate of $${\mathbf{v}}_0$$ from the vector $${\mathbf{q}}(t)$$. Thus, we see that the algorithm attempts to produce a sequence of unbiased estimates $${\mathbf{p}}(t)$$ and $${\mathbf{q}}(t)$$ for scaled versions of $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$. Then, it constructs the MAP or MMSE estimates $${\mathbf{u}}(t)$$ and $${\mathbf{v}}({t\! + \! 1})$$ from these unbiased estimates. 3.2 SE analysis The above ‘derivation’ of the algorithm suggests a possible method to analyze the IterFac algorithm. Consider a sequence of problems indexed by $$n$$, and suppose that the dimension $$m=m(n)$$ grows linearly with $$n$$ in that $$\label{eq:betaDef} \lim_{n {\rightarrow} \infty} n/m(n) = \beta,$$ (3.4) for some $$\beta > 0$$. For each $$n$$ and iteration number $$t$$, define the sets $$\label{eq:thetauvi} \theta_{u}(t) = \bigl\{ (u_{0i},u_i(t)), i=1,\ldots,m \bigr\}, \quad \theta_{v}(t) = \bigl\{ (v_{0j},v_j(t)), j =1,\ldots,n \bigr\},$$ (3.5) which are the sets of the components of the true vectors $$u_{0i}$$ and $$v_{0j}$$ and their corresponding estimates $$u_i(t)$$ and $$v_j(t)$$. To characterize the quality of the estimates, we would like to describe the distributions of the pairs in $$\theta_u(t)$$ and $$\theta_v(t)$$. Our formal analysis below will provide an exact characterization of these distributions. Specifically, we will show that the sets have empirical limits of the form $$\label{eq:thetauvLim} \lim_{n {\rightarrow} \infty} \theta_{u}(t) \stackrel{d}{=} (U_0,U(t)), \quad \lim_{n {\rightarrow} \infty} \theta_{v}(t) \stackrel{d}{=} (V_0,V(t)),$$ (3.6) for some random variable pairs $$(U_0,U(t))$$ and $$(V_0,V(t))$$. The precise definition of empirical limits is given in Appendix A.1, but loosely, it means that the empirical distribution of the components in $$\theta_u(t)$$ and $$\theta_v(t)$$ converge to certain random variable pairs. In addition, we can inductively derive what the distributions are for the limits in (3.6). To make matters simple, suppose that the parameters $$\lambda_u(t)$$ and $$\lambda_v(t)$$ are not selected adaptively but instead given by fixed sequences $${\overline{\lambda}}_u(t)$$ and $${\overline{\lambda}}_v(t)$$. Also, given random variables in the limits of (3.6), define the deterministic constants $$\label{eq:alphaSE} {\overline{\alpha}}_{u0}(t) = {\mathbb{E}}\left[ U(t)^2 \right]\!, \quad {\overline{\alpha}}_{u1}(t) = {\mathbb{E}}\left[ U_0U(t) \right]\!, \quad {\overline{\alpha}}_{v0}(t) = {\mathbb{E}}\left[ V(t)^2 \right]\!, \quad {\overline{\alpha}}_{v1}(t) = {\mathbb{E}}\left[ V_0V(t) \right]\!.$$ (3.7) Now suppose that the second limit in (3.6) holds for some $$t$$. Then, $$\gamma_u(t)$$ in (3.2) would have the following limit: $\lim_{n {\rightarrow} \infty} \gamma_u(t) = \lim_{n{\rightarrow} \infty} \frac{n}{m} \frac{1}{n} \sum_{j=1}^n v_{0j}v_j(t) = \beta {\overline{\alpha}}_{v1}(t).$ Also, if we ignore the debiasing term with $$\mu_u(t)$$ and assume that $${\mathbf{W}}$$ is independent of $${\mathbf{v}}(t)$$, the variance of the components of $${\mathbf{z}}_u(t)$$ in (3.2) would be ${\bf var}(z_{ui}(t)) = \frac{1}{m}\sum_{j=1}^n {\bf var}(W_{ij}v_j(t)) = \beta \tau_w {\overline{\alpha}}_{v0}(t).$ Thus, we would expect a typical component of $$p_i(t)$$ in (3.1) to be distributed as $P(t) = \beta {\overline{\alpha}}_{v1}(t)U_0+Z_u(t), \quad Z_u(t) \sim {\mathcal N}(0,\beta\tau_w{\overline{\alpha}}_{v0}(t)).$ Due to the separability assumption (2.1), each component is given by $$u_i({t\! + \! 1}) = G_u(p_i(t),{\overline{\lambda}}_u(t))$$. So, we would expect the components to follow the distribution $$\label{eq:USE} U({t\! + \! 1}) = G_u(P(t),{\overline{\lambda}}_u(t)), \quad P(t) = \beta {\overline{\alpha}}_{v1}(t)U_0+Z_u(t), \quad Z_u(t) \sim {\mathcal N}(0,\beta\tau_w{\overline{\alpha}}_{v0}(t)).$$ (3.8) This provides an exact description of the expected joint density of $$(U_0,U({t\! + \! 1}))$$. From this density, we can then compute $${\overline{\alpha}}_{u0}({t\! + \! 1})$$ and $${\overline{\alpha}}_{u1}({t\! + \! 1})$$ in (3.7). A similar calculation, again assuming the ‘debiasing’ and Gaussianity assumptions are valid, shows that the limiting empirical distribution of $$\theta_v({t\! + \! 1})$$ in (3.6) should follow $$\label{eq:VSE} V({t\! + \! 1}) = G_v( Q(t),{\overline{\lambda}}_v(t)), \quad Q(t) = {\overline{\alpha}}_{u1}({t\! + \! 1})V_0+Z_v(t), \quad Z_v(t) \sim {\mathcal N}(0,\tau_w{\overline{\alpha}}_{u0}({t\! + \! 1}) ).$$ (3.9) This provides the joint density $$(V_0,V({t\! + \! 1}))$$ from which we can compute $${\overline{\alpha}}_{v0}({t\! + \! 1})$$ and $${\overline{\alpha}}_{v1}({t\! + \! 1})$$ in (3.7). Thus, we have provided a simple method to recursively compute the joint densities of the limits in (3.6) and their second-order statistics (3.7). 4. Asymptotic analysis under Gaussian noise 4.1 Main results In the above intuitive analysis, we did not formally describe the sense of convergence nor offer any formal justification for the Gaussianity assumptions. In addition, we assumed that the parameters $$\lambda_u(t)$$ and $$\lambda_v(t)$$ were fixed sequences. In reality, we will need them to be data adaptive. We will make the above arguments rigorous under the following assumptions. Note that although we will be interested in MAP or MMSE estimation functions, our analysis will apply to arbitrary factor selection functions $$G_u(\cdot)$$ and $$G_v(\cdot)$$. Assumption 4.1 Consider a sequence of random realizations of the estimation problem in Section 1 indexed by the dimension $$n$$. The matrix $${\mathbf{A}}$$ and the parameters in Algorithm 1 satisfy the following: (a) For each $$n$$, the output dimension $$m=m(n)$$ is deterministic and scales linearly with $$n$$ as (3.4). (b) The matrix $${\mathbf{A}}$$ has the form (1.1) where $${\mathbf{u}}_0 \in {\mathbb{R}}^m$$ and $${\mathbf{v}}_0 \in {\mathbb{R}}^n$$ represent ‘true’ left and right factors of a rank-one term, and $${\mathbf{W}} \in {\mathbb{R}}^{m {\times} n}$$ is an i.i.d. Gaussian matrix with components $$W_{ij} \sim {\mathcal N}(0,\tau_w)$$ for some $$\tau_w > 0$$. (c) The factor selection functions $$G_u({\mathbf{p}},\lambda_u)$$ and $$G_v({\mathbf{q}},\lambda_v)$$ in lines 7 and 12 are componentwise separable in that for all component indices $$i$$ and $$j$$, $$\label{eq:GuvSep} G_u({\mathbf{p}},\lambda_u)_i = G_u(p_i,\lambda_u), \quad G_v({\mathbf{q}},\lambda_v)_j = G_v(q_j,\lambda_v),$$ (4.1) for some scalar functions $$G_u(p,\lambda_u)$$ and $$G_v(q,\lambda_v)$$. The scalar functions must be differentiable in $$p$$ and $$q$$. Moreover, for each $$t$$, the functions $$G_u(p,\lambda_u)$$ and $$\partial G_u(p,\lambda_u)/\partial p$$ must be Lipschitz continuous in $$p$$ with a Lipschitz constant that is continuous in $$\lambda_u$$ and continuous in $$\lambda_u$$ uniformly over $$p$$. Similarly, for each $$t$$, the functions $$G_v(q,\lambda_v)$$ and $$\partial G_v(q,\lambda_v)/\partial q$$ must be Lipschitz continuous in $$q$$ with a Lipschitz constant that is continuous in $$\lambda_v$$, and continuous in $$\lambda_v$$ uniformly over $$q$$. (d) The parameters $$\lambda_u(t)$$ and $$\lambda_v(t)$$ are computed via $$\label{eq:lamRankOne} \lambda_u(t) = \frac{1}{n} \sum_{j=1}^n \phi_{\lambda v}(t,v_{0j},v_j(t)), \quad \lambda_v(t) = \frac{1}{m} \sum_{i=1}^m \phi_{\lambda u}(t,u_{0j},u_j({t\! + \! 1})),$$ (4.2) for (possibly vector-valued) functions $$\phi_{\lambda u}(\cdot)$$ and $$\phi_{\lambda v}(\cdot)$$ that are pseudo-Lipschitz continuous of order $$p=2$$. (e) For $$t=0$$, the sets (3.6) empirically converge with bounded moments of order $$2$$ to the limits $$\label{eq:thetauvLimInit} \lim_{n {\rightarrow} \infty} \theta_u(0) \stackrel{d}{=} (U_0,U(0)), \quad \lim_{n {\rightarrow} \infty} \theta_v(0) \stackrel{d}{=} (V_0,V(0)),$$ (4.3) for some random variable pairs $$(U_0,U(0)$$ and $$(V_0,V(0))$$. See Appendix A.1 for a precise definition of the empirical convergence used here. The assumptions need some explanations. Assumptions 4.1(a) and (b) simply state that we are considering an asymptotic analysis for certain large matrices $${\mathbf{A}}$$ consisting of a random rank-one matrix plus Gaussian noise. The analysis of Algorithm 1 for higher ranks is still not known, but we provide some possible ideas later. Assumption 4.1(c) is a mild condition on the factor selection functions. In particular, the separability assumption holds for the MAP or MMSE functions (2.4) and (2.6) under separable priors. Assumption (d) allows for the parameters $$\lambda_u(t)$$ and $$\lambda_v(t)$$ in the factor selection functions to be data dependent, provided that they can each be determined via empirical averages of some function of the most recent data. Assumption (e) is the initial induction hypothesis. Under these assumptions, we can recursively define the sequences of random variables $$(U_0,U(t))$$ and $$(V_0,V(t))$$ as described above. For the parameters $${\overline{\lambda}}_u(t)$$ and $${\overline{\lambda}}_v(t)$$, define them from the expectations $$\label{eq:lambdaSE} {\overline{\lambda}}_{u}(t) = {\mathbb{E}}\left[ \phi_{\lambda u}(V_0,V(t)) \right]\!, \quad {\overline{\lambda}}_{v}(t) = {\mathbb{E}}\left[ \phi_{\lambda v}(U_0,U({t\! + \! 1})) \right]\!.$$ (4.4) These are the limiting values we would expect given the adaptation rules (4.2). Theorem 4.2 Under Assumption 4.1, the sets $$\theta_u(t)$$ and $$\theta_v(t)$$ in (3.5) converge empirically with bounded moments of order $$p=2$$ with the limits in (3.6). Proof. See Appendix A.3. □ 4.2 Scalar equivalent model The main contribution of Theorem 4.2 is that it provides a simple scalar equivalent model for the asymptotic behavior of the algorithm. The sets $$\theta_u(t) = \{(u_{0i},u_i(t))\}$$ and $$\theta_v(t) = \{ (v_{0j},v_j(t)) \}$$ in (3.5) are the components of true vectors $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$ and their estimates $${\mathbf{u}}(t)$$ and $${\mathbf{v}}(t)$$. The theorem shows that empirical distributions of these components are asymptotically equivalent to simple random variable pairs $$(U_0,U(t))$$ and $$(V_0,V(t))$$ given by (3.8) and (3.9). In this scalar system, the variable $$U({t\! + \! 1})$$ is the output of the factor selection function $$G_u(\cdot)$$ applied to a scaled and Gaussian-noise-corrupted version of the true variable $$U_0$$. Similarly, $$V({t\! + \! 1})$$ is the output of the factor selection function $$G_v(\cdot)$$ applied to a scaled and Gaussian-noise-corrupted version of the true variable $$V_0$$. Following [20], we can thus call the result a single-letter characterization of the algorithm. From this single-letter characterization, one can exactly compute a large class of performance metrics of the algorithm. Specifically, the empirical convergence of $$\theta_u(t)$$ shows that for any pseudo-Lipschitz function $$\phi(u_0,u)$$ of order $$p$$, the following limit exists almost surely: $$\label{eq:phiuLim} \lim_{n {\rightarrow} \infty} \frac{1}{m} \sum_{i=1}^m \phi(u_{0i},u_i(t)) = {\mathbb{E}}\left[ \phi(U_0,U(t)) \right]\!,$$ (4.5) where the expectation on the right-hand side is over the variables $$(U_0,U(t))$$ with $$U_0$$ identical to the variable in the limit in (4.3) and $$U(t)$$ given by (3.8). This expectation can thus be explicitly evaluated by a simple two-dimensional integral and consequently any component-separable performance metric based on a suitably continuous loss function $$\phi(u_0,u)$$ can be exactly computed. For example, if we take $$\phi(u_0,u) = (u-u_0)^2$$, we can compute the asymptotic mean-squared error of the estimate, \begin{eqnarray} \lim_{n {\rightarrow} \infty} \frac{1}{m} \|{\mathbf{u}}(t)-{\mathbf{u}}_0\|^2 = \lim_{n {\rightarrow} \infty} \frac{1}{m} \sum_{i=1}^m (u_i(t)-u_{0i})^2 \nonumber = {\mathbb{E}}\left[ (U_0-U(t))^2 \right]\!. \label{eq:mseuLim} \end{eqnarray} Also, for each $$t$$, define the empirical second-order statistics $$\label{eq:alphauv} \alpha_{u0}(t) = \frac{1}{m}\|{\mathbf{u}}(t)\|^2, ~ \alpha_{u1}(t) = \frac{1}{m}{\mathbf{u}}(t)^\top{\mathbf{u}}_0, ~ \alpha_{v0}(t) = \frac{1}{n}\|{\mathbf{v}}(t)\|^2, ~ \alpha_{v1}(t) = \frac{1}{n}{\mathbf{v}}(t)^\top{\mathbf{v}}_0.$$ (4.6) Since $$\|{\mathbf{u}}(t)\|^2 = \sum_i u_i(t)^2$$, it follows that $$\alpha_{u0}(t) {\rightarrow} {\mathbb{E}}(U(t)^2)$$ almost surely as $$n {\rightarrow} \infty$$. In this way, we obtain that the following limits hold almost surely: $$\label{eq:alphauvLim} \lim_{n {\rightarrow} \infty} \alpha_{u0}(t) = {\overline{\alpha}}_{u0}(t), ~ \lim_{n {\rightarrow} \infty} \alpha_{u1}(t) = {\overline{\alpha}}_{u1}(t), ~ \lim_{n {\rightarrow} \infty} \alpha_{v0}(t) = {\overline{\alpha}}_{v0}(t), ~ \lim_{n {\rightarrow} \infty} \alpha_{v1}(t) = {\overline{\alpha}}_{v1}(t).$$ (4.7) We will also use definitions $$\label{eq:tauuv} \tau_u := {\mathbb{E}}[U_0^2], \quad \tau_v := {\mathbb{E}}[V_0^2].$$ (4.8) From the second-order statistics, we can compute the asymptotic correlation coefficient between $${\mathbf{u}}_0$$ and its estimate $${\mathbf{u}}$$ given by \begin{eqnarray} \rho_u(t) &:=& \lim_{n {\rightarrow} \infty} \frac{|{\mathbf{u}}(t)^\top{\mathbf{u}}_0|^2} {\|{\mathbf{u}}(t)\|^2\|{\mathbf{u}}_0\|^2} = \lim_{n {\rightarrow} \infty} \frac{|({\mathbf{u}}(t)^\top{\mathbf{u}}_0)/m|^2} {(\|{\mathbf{u}}(t)\|^2/m)(\|{\mathbf{u}}_0\|^2/m)} \nonumber \\ &=& \frac{ \bigl[{\mathbb{E}}(U(t)U_0)\bigr]^2 }{ {\mathbb{E}} U(t)^2 {\mathbb{E}} U_0^2} = \frac{{\overline{\alpha}}_{u1}^2(t)}{{\overline{\alpha}}_{u0}(t)\tau_u}. \hspace{0.5in} \label{eq:corruLim} \end{eqnarray} (4.9) Similarly, the asymptotic correlation coefficient between $${\mathbf{v}}_0$$ and $${\mathbf{v}}$$ has a simple expression $$\label{eq:corrvLim} \rho_v(t) := \lim_{n {\rightarrow} \infty} \frac{|{\mathbf{v}}(t)^\top{\mathbf{v}}_0|^2} {\|{\mathbf{v}}(t)\|^2\|{\mathbf{v}}_0\|^2} = \frac{{\overline{\alpha}}_{v1}^2(t)}{{\overline{\alpha}}_{v0}(t)\tau_v}.$$ (4.10) The correlation coefficient is useful, since we know that, without additional constraints, the terms $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$ can be estimated only up to a scalar. The correlation coefficient is scale invariant. 5. Examples The SE analysis can be used to exactly predict the asymptotic behavior of the IterFac algorithm for any smooth scalar estimation functions, including the MAP or MMSE functions (2.4) and (2.6). There are, however, two cases, where the SE equations have particularly simple and interesting solutions: linear estimation functions and MMSE functions. 5.1 Linear selection functions Suppose we use linear selection functions of the form $$\label{eq:Guvlin} G_u({\mathbf{p}},\lambda_u) =\lambda_u{\mathbf{p}},\quad G_v({\mathbf{q}},\lambda_v)= \lambda_v{\mathbf{q}},$$ (5.1) where the parameters $$\lambda_u$$ and $$\lambda_v$$ allow for normalization or other scalings of the outputs. Linear selection functions of the form (5.1) arise when one selects $$G_u(\cdot)$$ and $$G_v(\cdot)$$ from the MAP or MMSE functions (2.4) or (2.6) with Gaussian priors. With Gaussian priors, the correct solution to the MAP estimate (1.4) is for $$({\widehat{\mathbf{u}}},{\widehat{\mathbf{v}}})$$ to be the (appropriately scaled) left and right maximal singular vectors of $${\mathbf{A}}$$. We will thus call the estimates $$({\mathbf{u}}(t),{\mathbf{v}}(t))$$ of Algorithm 1 and linear selection functions (5.1) the estimated maximal singular vectors. Theorem 5.1 Consider the state evolution equation (3.7), (3.8) and (3.9) with the linear selection functions (5.1). Then we have the following properties: (a) The asymptotic correlation coefficients (4.9) and (4.10) satisfy the following recursive rules: $$\label{eq:corrLinSE} \rho_u({t\! + \! 1}) = \frac{\beta \tau_u\tau_v \rho_v(t)}{ \beta \tau_u\tau_v \rho_v(t) + \tau_w}, \quad \rho_v(t) = \frac{\tau_u\tau_v \rho_u(t)}{\tau_u\tau_v \rho_u(t) + \tau_w}.$$ (5.2) (b) For any positive initial condition $$\rho_v(0) > 0$$, the asymptotic correlation coefficients converge to the limits $$\label{eq:corrLinLimit} \lim_{t {\rightarrow} \infty} \rho_u(t) = \rho_u^* := \frac{\left[ \beta\tau_u^2\tau_v^2 - \tau_w^2 \right]_+}{ \tau_u\tau_v(\beta\tau_u\tau_v + \tau_w)}, \quad \lim_{t {\rightarrow} \infty} \rho_v(t) = \rho_v^* := \frac{\left[\beta\tau_u^2\tau_v^2 - \tau_w^2\right]_+}{ \beta \tau_u\tau_v(\tau_u\tau_v + \tau_w)},$$ (5.3) where $$[x]_+= \max\{0,x\}$$. Proof. See Appendix A.4. □ The theorem provides a set of recursive equations for the asymptotic correlation coefficients $$\rho_u(t)$$ and $$\rho_v(t)$$ along with simple expressions for the limiting values as $$t {\rightarrow} \infty$$. We thus obtain exactly how correlated the estimated maximal singular vectors of a matrix $${\mathbf{A}}$$ of the form (1.1) are to the rank-one factors $$({\mathbf{u}}_0,{\mathbf{v}}_0)$$. The proof of the theorem also provides expressions for the second-order statistics in (3.7) to be used in the scalar equivalent model. The fixed point expressions (5.3) agree with the more general results in [33] that derive the correlations for ranks greater than one and low-rank recovery with missing entries. Similar results can also be found in [2,7,28]. An interesting consequence of the expressions in (5.3) is that unless $$\label{eq:snrMinLin} \sqrt{\beta}\tau_u\tau_v > \tau_w,$$ (5.4) the asymptotic correlation coefficients are exactly zero. The ratio $$\tau_u\tau_v/\tau_w$$ can be interpreted as a scaled signal-to-noise ratio (SNR). 5.2 MMSE estimation Next, suppose we use the MMSE selection functions (2.6). Using the scalar equivalent models (3.8) and (3.9), we take the scaling factor and noise parameters as \begin{align} \label{eq:mmseParam} \begin{split} &\gamma_u(t) = \beta {\overline{\alpha}}_{v1}(t), \quad \nu_u(t)=\beta \tau_w {\overline{\alpha}}_{v0}(t), \\ &\gamma_v(t) = {\overline{\alpha}}_{u1}({t\! + \! 1}), \quad \nu_v(t)= \tau_w {\overline{\alpha}}_{u0}({t\! + \! 1}). \end{split} \end{align} (5.5) Observe that these parameters can be computed from the SE equations and hence can be determined offline and are thus not data dependent. We can use the initial condition $$v_j(0) = {\mathbb{E}}[V_0]$$ for all $$j$$, so that the initial variable in (4.3) is $$V(0) = {\mathbb{E}}[V_0]$$. To analyze the algorithms, define $$\label{eq:mmseuv} {\mathcal E}_u(\eta_u) = {\bf var}(U_0 {\,|\,} Y=\sqrt{\eta_u}U_0 + D), \quad {\mathcal E}_v(\eta_v) = {\bf var}(V_0 {\,|\,} Y=\sqrt{\eta_v}V_0 + D),$$ (5.6) where $$D \sim \mathcal{N}(0,1)$$ is independent of $$U_0$$ and $$V_0$$. That is, $${\mathcal E}_u(\eta_u)$$ and $${\mathcal E}_v(\eta_v)$$ are the mean-squared errors of estimating $$U_0$$ and $$V_0$$ from observations $$Y$$ with SNRs of $$\eta_u$$ and $$\eta_v$$. The functions $${\mathcal E}_u(\cdot)$$ and $${\mathcal E}_v(\cdot)$$ arise in a range of estimation problems, and the analytic and functional properties of these functions can be found in [24,60]. Theorem 5.2 Consider the solutions to the SE equations (3.7), (3.8) and (3.9) under the MMSE selection functions (2.6) with parameters (5.5) and initial condition $$V(0) = {\mathbb{E}}[V_0]$$. Then we have the following properties: (a) For all $$t$$, the asymptotic correlation coefficients (4.9) and (4.10) satisfy the recursive relationships $$\label{eq:corrSEmmse} \rho_u({t\! + \! 1}) = 1 - \frac{1}{\tau_u}{\mathcal E}_u( \beta\tau_v\rho_v(t)/\tau_w), \quad \rho_v(t) = 1 - \frac{1}{\tau_v}{\mathcal E}_v( \tau_u\rho_u(t)/\tau_w),$$ (5.7) with the initial condition $$\rho_v(0)=({\mathbb{E}} V_0)^2/\tau_v$$. (b) If, in addition, $${\mathcal E}_u(\eta_u)$$ and $${\mathcal E}_v(\eta_v)$$ are continuous, then for any positive initial condition, $$\rho_v(0)>0$$, as $$t {\rightarrow} \infty$$, the asymptotic correlation coefficients $$(\rho_u(t),\rho_v(t))$$ increase monotonically to fixed points $$(\rho_u^*, \rho_u^*)$$ of (5.7) with $$\rho_v^* > 0$$. Proof. See Appendix A.5. □ Again, we see that we can obtain simple, explicit recursions for the asymptotic correlations. Moreover, the asymptotic correlations provably converge to fixed points of the SE equations. The proof of the theorem also provides expressions for the second-order statistics in (3.7) to be used in the scalar equivalent model. 5.3 Zero initial conditions The limiting condition in part (b) of Theorem 5.2 requires that $$\rho_v(0) > 0$$, which occurs when $${\mathbb{E}}[V_0] \neq 0$$. Suppose, on the other hand, that $${\mathbb{E}}[U_0] = {\mathbb{E}}[V_0] = 0$$. Then, the initial condition will be $$V(0)= {\mathbb{E}}[V_0] = 0$$. Under this initial condition, a simple set of calculations show that the SE equations (5.7) will generate a sequence with $$\rho_v(t) = \rho_u(t)=0$$ for all $$t$$. Thus, the IterFac algorithm will produce no useful estimates. Of course, with zero-mean random variables, a more sensible initial condition is to take $${\mathbf{v}}(0)$$ to be some non-zero random vector, as is commonly done in power algorithm recursions for computing maximal singular vectors. To understand the behavior of the algorithm under this random initial condition, let $$\label{eq:rhovn} \rho_v(t,n) := \frac{|{\mathbf{v}}(t)^\top{\mathbf{v}}_0|^2}{\|{\mathbf{v}}_0\|^2 \|{\mathbf{v}}(t)\|^2},$$ (5.8) where we have explicitly denoted the dependence on the problem dimension $$n$$. From (4.10), we have that $$\lim_{n {\rightarrow}\infty} \rho_v(t,n) = \rho_v(t)$$ for all $$t$$. Also, with a random initial condition $${\mathbf{v}}(0)$$ independent of $${\mathbf{v}}_0$$, it can be checked that $$\rho_v(0,n) = \mathcal {\rm O}(1/n)$$ so that $\rho_v(0) = \lim_{n {\rightarrow}\infty} \rho_v(0,n) = 0.$ Hence, from the SE equations (5.7), $$\rho_v(t) = \rho_u(t)=0$$ for all $$t$$. That is, $$\label{eq:corrLimZero} \lim_{t {\rightarrow} \infty} \lim_{n {\rightarrow}\infty} \rho(t,n) = 0.$$ (5.9) This limit suggests that, even with random initial conditions, the IterFac algorithm will not produce a useful estimate. However, it is still possible that the limit in the opposite order to (5.9) may be non-zero: $$\label{eq:rhontLim} \lim_{n {\rightarrow} \infty} \lim_{t {\rightarrow}\infty} \rho(t,n) > 0.$$ (5.10) That is, for each $$n$$, it may be possible to obtain a non-zero correlation, but the number of iterations for convergence increases with $$n$$ since the algorithm starts from a decreasingly small initial correlation. Unfortunately, our SE analysis cannot make predictions on limits in the order of (5.10). We can however analyze the following limit: Lemma 5.1 Consider the MMSE SE equations (5.7) with random variables $$U_0$$ and $$V_0$$ such that $${\mathbb{E}}[V_0] = {\mathbb{E}}[U_0] = 0$$. For each $$\epsilon > 0$$, let $$\rho_v^\epsilon(t)$$ be the solution to the SE equations with an initial condition $$\rho_v(0)= \epsilon$$. Then, we have the following results: (a) If $$\sqrt{\beta} \tau_u \tau_v > \tau_w$$, $$\label{eq:corrvLimEpsPos} \lim_{\epsilon {\rightarrow} 0} \lim_{t {\rightarrow} \infty} \rho^\epsilon_v(t) > 0.$$ (5.11) (b) Conversely, if $$\sqrt{\beta} \tau_u \tau_v < \tau_w$$, $$\label{eq:corrvLimEpsNeg} \lim_{\epsilon {\rightarrow} 0} \lim_{t {\rightarrow} \infty} \rho^\epsilon_v(t) = 0.$$ (5.12) Proof. See Appendix A.6. □ The result of the lemma is somewhat disappointing. The lemma shows that $$\sqrt{\beta} \tau_u \tau_v > \tau_w$$ is essentially necessary and sufficient for the IterFac algorithm with MMSE estimates to be able to overcome arbitrarily small initial conditions and obtain an estimate with a non-zero correlation to the true vector. Unfortunately, this is identical to condition (5.4) for the linear estimator to obtain a non-zero correlation. Thus, the IterFac algorithm with MMSE estimates performs no better than simple linear estimation in the initial iterations when the priors have zero means. Since linear estimation is equivalent to finding maximal singular vectors without any particular constraints, we could interpret Lemma 5.1 as saying that the IterFac algorithm under MMSE estimation cannot exploit structure in the components in the initial iterations. As a result, at low SNRs it may be necessary to use other algorithms to produce an initial condition for IterFac; such procedures require further study. 6. Numerical simulation 6.1 Bernoulli–exponential To validate the SE analysis, we first consider a simple case where the left factor $${\mathbf{u}}_0 \in {\mathbb{R}}^m$$ is i.i.d. Gaussian, zero-mean and $${\mathbf{v}}_0 \in {\mathbb{R}}^n$$ has Bernoulli–exponential components:1 $$\label{eq:vBernExp} v_{0j} \sim \left\{ \begin{array}{ll} 0 & \mbox{with prob } 1-\lambda, \\ \mbox{Exp}(1) & \mbox{with prob } \lambda, \end{array} \right.$$ (6.1) which provides a simple model for a sparse, positive vector. The parameter $$\lambda$$ is the fraction of non-zero components and is set in this simulation to $$\lambda=0.1$$. Note that these components have a non-zero mean so the difficulties of Section 5.3 are avoided. The dimensions are $$(m,n)=(1000,500)$$, and the noise level $$\tau_w$$ is set according to the scaled SNR defined as $$\label{eq:snrDef} {{\small \sf SNR}} = 10\log_{10}(\tau_u\tau_v/\tau_w).$$ (6.2) Estimating the vector $${\mathbf{v}}_0$$ in this setup is related to finding sparse principal vectors of the matrix $${\mathbf{A}}^\top{\mathbf{A}}$$ for which there are a large number of excellent methods including [5,9,30,53,61,62], to name a few. These algorithms include methods based on thresholding, $$\ell_1$$-regularization and semidefinite programming. A comparison of IterFac against these methods would be an interesting avenue of future research. Here, we simply wish to verify the SE predictions of the IterFac method. The results of the simulation are shown in Fig. 1, which shows the simulated and SE-predicted performance of the IterFac algorithm with both the linear and MMSE selection functions for the priors on $${\mathbf{u}}$$ and $${\mathbf{v}}$$. The algorithm is run for $$t=10$$ iterations, and the plot shows the median of the final correlation coefficient $$\rho_v(t)$$ over 50 Monte Carlo trials at each value of SNR. It can be seen that the performance of the IterFac algorithm for both the linear and MMSE estimates are in excellent agreement with the SE predictions. The correlation coefficient of the linear estimator also matches the correlation of the estimates produced from the maximal singular vectors of $${\mathbf{A}}$$. This is not surprising since, with linear selection functions, the IterFac algorithm is essentially an iterative method to find the maximal singular vectors. The figure also shows the benefit of exploiting the prior on $${\mathbf{v}}_0$$, which is evident from the superior performance of the MMSE estimate over the linear reconstruction. Fig. 1. View largeDownload slide Simulation of the IterFac algorithm for both the linear selection functions in Section 5.1 (labeled IterFac-lin) and MMSE selection functions in Section 5.2 (labeled IterFac-MMSE). Plotted are the correlation values after 10 iterations. The simulated values are compared against the SE predictions. Also shown is the simple estimate from the maximal singular vectors of $${\mathbf{A}}$$. Fig. 1. View largeDownload slide Simulation of the IterFac algorithm for both the linear selection functions in Section 5.1 (labeled IterFac-lin) and MMSE selection functions in Section 5.2 (labeled IterFac-MMSE). Plotted are the correlation values after 10 iterations. The simulated values are compared against the SE predictions. Also shown is the simple estimate from the maximal singular vectors of $${\mathbf{A}}$$. Figure 2 shows the correlation coefficient as a function of the iteration number for the MMSE and linear methods for two values of the SNR. Again, we see that the SE predictions are closely matched to the median simulated values. In addition, we see that we get good convergence within 4 to 8 iterations. Based on the SE predictions, this number will not scale with dimension and hence the simulation suggests that only a small number of iterations will be necessary even for very large problems. All code for the simulation can be found in the public GAMP sourceforge project [54]. Fig. 2. View largeDownload slide Per iteration performance of the IterFac algorithm for both the linear selection functions in Section 5.1 (IF-lin) and MMSE selection functions in Section 5.2 (IF-MMSE). The simulated values are compared with the SE predictions. Fig. 2. View largeDownload slide Per iteration performance of the IterFac algorithm for both the linear selection functions in Section 5.1 (IF-lin) and MMSE selection functions in Section 5.2 (IF-MMSE). The simulated values are compared with the SE predictions. 6.2 Bernoulli–Gaussian The above experiment considers a positive random vector $$V_0$$, so that the initial estimate $${\mathbb{E}}(V_0) > 0$$ and, hence, has a non-zero correlation with $$V_0$$. As discussed in Section 5.3, predicting the performance of IterFac with zero initial conditions, (i.e., $${\mathbb{E}}(V_0)={\mathbb{E}}(U_0)=0$$) and a small random initial condition, requires that we exchange the limits in iteration and problem size. To validate these predictions, we took a Bernoulli–Gaussian (BG) prior where the components of $${\mathbf{v}}_0$$ are distributed as $$\label{eq:vBernGauss} v_{0j} \sim \left\{ \begin{array}{ll} 0 & \mbox{with prob } 1-\lambda, \\ {\mathcal N}(0,1) & \mbox{with prob } \lambda, \end{array} \right.$$ (6.3) where, as before, $$\lambda$$ represents the sparsity rate. Importantly, $${\mathbb{E}}(V_0)=0$$. All other parameters are identical to the previous section. We used the MMSE factor selection functions (2.6), but with normalization and fixed parameters $$\lambda_v(t)=(\gamma_v(t),\nu_v(t))$$. The reason that it is necessary to fix the parameters is that, in the case of zero initial conditions, the estimator does not know the true scaling factors $$\gamma_v(t)$$ and noise level $$\nu_v(t)$$—they will depend on random initial conditions and cannot be predicted in the early iterations accurately. To circumvent the problem, we used a procedure where, in each iteration $$t$$, $${\mathbf{p}}(t)$$ was first normalized so that its average second moment matched $$\tau_v$$. Then, we ran the MMSE estimator with parameters $$\gamma_v(t)=1$$ and $$\nu_v = \alpha_v \tau_v$$ for a scaling factor $$\alpha_v=0.5$$. The idea is that, in every iteration, the MMSE estimator assumes that the SNR is some particular value as opposed to using the predicted value from SE. This is not optimal, but interestingly, the performance of this suboptimal estimator can still be predicted. Figure 3 compares the simulated performance of IterFac with the theoretical prediction from SE for a BG prior with $$\lambda = 0.1$$ and Gaussian prior with $$\lambda=1$$. Note that for $$\lambda=1$$, the MMSE factor selection function is linear. We see an excellent match between simulations and SE predictions. Also, we see that the two curves show the same SNR point, where one is able to obtain a non-zero correlation. This latter point confirms that the phase-transition point for linear estimation is identical to that of MMSE estimation. Hence, the particular distribution on $$V_0$$ does not matter—only the SNR. Fig. 3. View largeDownload slide Simulated performance of IterFac under Bernoulli–Gaussian (BG) and Gaussian priors. The performance is compared with the theoretical predicted performance from the limit of the SE. Fig. 3. View largeDownload slide Simulated performance of IterFac under Bernoulli–Gaussian (BG) and Gaussian priors. The performance is compared with the theoretical predicted performance from the limit of the SE. 7. Limitations and extensions There are several potential lines for future work, some of which have already been explored in other works made since the original publication of this method in [47]. 7.1 Extensions to higher rank The algorithm presented in this article considers only rank-one matrices. The works [31,34,43,52] have proposed AMP-type algorithms for more general classes of matrix factorization problems and provided analyses of these methods based on replica methods and other ideas from statistical physics. 7.2 Unknown priors The MMSE estimator in Section 5.2 requires exact knowledge of the priors on $$U_0$$ and $$V_0$$ as well as the Gaussian noise level $$\tau_w$$. In many problems in statistics, these are not known. There are two possible solutions that may be investigated in the future. One method is to parameterize the distributions of $$U_0$$ and $$V_0$$ and estimate these parameters in the MMSE selection functions (2.6)—potentially through an expectation maximization (EM)-type procedure as in [10]. This EM-type approach with hidden hyperparameters has been recently successfully used in a related approximate message passing method in [57]. The analysis of such parameter learning could possibly be accounted for through the adaptation parameters $$\lambda_u(t)$$ and $$\lambda_v(t)$$. A second approach is to assume that the distributions of $$U_0$$ and $$V_0$$ belong to a known family of distributions and then find a min-max solution. Such a min-max technique was proposed for AMP recovery of sparse vectors in [15]. See also [14]. 7.3 Optimality Although this article characterizes the performance of the IterFac algorithm, it remains open how far that performance is from optimal estimation such as the joint MMSE estimates of $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$. AMP methods have been shown to be provably optimal in a wide range of scenarios [3,4,22,23,27,45,46]. More recently, [11] has proved that for sparse binary priors, the IterFac algorithm was provably Bayes optimal for certain sparsity levels. A potential line of future work is to see whether these results can be extended to more general settings. 8. Conclusions We have presented a computationally efficient method for estimating rank-one matrices in noise. The estimation problem is reduced to a sequence of scalar AWGN estimation problems which can be performed easily for a large class of priors or regularization functions on the coefficients. In the case of Gaussian noise, the asymptotic performance of the algorithm is exactly characterized by a set of scalar state evolution equations that appear to match the performance at moderate dimensions well. Thus, the methodology is computationally simple, general and admits a precise analysis in certain asymptotic, random settings. Future work include extensions to higher-rank matrices and handling of the cases where the priors are not known. Acknowledgements The authors thank Vivek Goyal, Ulugbek Kamilov, Andrea Montanari and Phil Schniter for detailed comments on an earlier draft and the industrial affiliate members of NYU WIRELESS (http://wireless.engineering.nyu.edu). Funding National Science Foundation (in part) [1254204 and 1738286 to A.K.F.]; Office of Naval Research [N00014-15-1-2677 to A.K.F.]; National Science Foundation (in part) [1302336, 1564142 and 1547332 to S.R.]. Footnotes 1 All code for these experiments can be found in the gampmatlab repository https://sourceforge.net/projects/gampmatlab. References 1. Aharon M. , Elad M. & Bruckstein A. ( 2006 ) K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process. , 54 , 4311 – 4322 . Google Scholar Crossref Search ADS 2. Baik J. , Arous G. B. , & Péché S. ( 2005 ) Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Ann. Probab. , 33 , 1643 – 1697 . Google Scholar Crossref Search ADS 3. Bayati M. & Montanari A. ( 2011 ) The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Trans. Inform. Theory , 57 , 764 – 785 . Google Scholar Crossref Search ADS 4. Boutros J. & Caire G. ( 2002 ) Iterative multiuser joint decoding: unified framework and asymptotic analysis. IEEE Trans. Inform. Theory , 48 , 1772 – 1793 . Google Scholar Crossref Search ADS 5. Cadima J. & Jolliffe I. T. ( 1995 ) Loadings and correlations in the interpretation of principal components. J. Appl. Stat. , 22 , 208 – 214 . Google Scholar Crossref Search ADS 6. Caltagirone F. , Zdeborová L. & Krzakala F. ( 2014 ) On convergence of approximate message passing. IEEE International Symposium on Information Theory (ISIT), Honolulu, Hawaii, June 2014. IEEE , pp. 1812 – 1816 . 7. Capitaine M. , Donati-Martin C. & Féral D. ( 2009 ) The largest eigenvalues of finite rank deformation of large Wigner matrices: convergence and nonuniversality of the fluctuations. Ann. Probab. , 37 , 1 – 47 . Google Scholar Crossref Search ADS 8. Chen S. Y. , Tong H. , Wang Z. , Liu S. , Li M. & Zhang B. ( 2011 ) Improved generalized belief propagation for vision processing. Math. Probl. Eng. , 2011 , 416963 . 9. D’Aspremont A. , El Ghaoui L. , Jordan M. I. & Lanckriet G. R. G. ( 2007 ) A direct formulation for sparse PCA using semidefinite programming. SIAM Rev. , 49 , 434 – 448 . Google Scholar Crossref Search ADS 10. Dempster A. , Laird N. M. & Rubin D. B. ( 1977 ) Maximum-likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. , 39 , 1 – 17 . 11. Deshpande Y. & Montanari A. ( 2014 ) Information-theoretically optimal sparse PCA. IEEE International Symposium on Information Theory (ISIT), Honolulu, Hawaii, June 2014. IEEE , pp. 2197 – 2201 . 12. Deshpande Y. & Montanari A. ( 2015 ) Finding hidden cliques of size $$\sqrt\{N/e\}$$ in nearly linear time. Found. Comput. Math. , 15 , 1069 – 1128 . Google Scholar Crossref Search ADS 13. Ding Z. ( 1997 ) Matrix outer-product decomposition method for blind multiple channel identification. IEEE Trans. Signal Process. , 45 , 3053 – 3061 . Google Scholar Crossref Search ADS 14. Donoho D. L. & Johnstone I. M. ( 1994 ) Minimax risk over $$l_p$$-balls for $$l_q$$-error. Probab. Theory Relat. Fields , 99 , 277 – 303 . Google Scholar Crossref Search ADS 15. Donoho D. L. , Maleki A. & Montanari A. ( 2009 ) Message-passing algorithms for compressed sensing. Proc. Nat. Acad. Sci. , 106 , 18914 – 18919 . Google Scholar Crossref Search ADS 16. Donoho D. L. , Maleki A. & Montanari A. ( 2010 ) Message passing algorithms for compressed sensing I: motivation and construction. Proc. IEEE Information Theory Workshop, Cairo, Jan 6 2010. IEEE , pp. 1 – 5 . 17. Donoho D. L. , Maleki A. & Montanari A. ( 2010 ) Message passing algorithms for compressed sensing II: analysis and validation. Proc. IEEE Information Theory Workshop, Cairo, Jan 6 2010. IEEE , pp. 1 – 5 . 18. Fletcher A. K. & Rangan S. ( 2014 ) Scalable inference for neuronal connectivity from calcium imaging. Advances in Neural Information Processing Systems ( Ghahramani Z. Welling M. Cortes C. Lawrence N. D. & Weinberger K. Q. eds). Montreal, Canada : MIT Press , pp. 2843 – 2851 . 19. Fletcher A. K. , Rangan S. , Varshney L. R. & Bhargava A. ( 2011 ) Neural reconstruction with approximate message passing (NeuRAMP). Advances in Neural Information Processing Systems ( Shawe-Taylor J. Zemel R. S. Bartlett P. L. Pereira F. & Weinberger K. Q. eds). Barcelona : MIT Press , pp. 2555 – 2563 . 20. Guo D. , Baron D. & Shamai S. ( 2009 ) A single-letter characterization of optimal noisy compressed sensing. Proc. IEEE, Communication, Control, and Computing, 2009. Monticello, IL, Sep 30 2009. IEEE , pp. 52 – 59 . 21. Guo D. & Verdú S. ( 2005 ) Randomly spread CDMA: asymptotics via statistical physics. IEEE Trans. Inform. Theory , 51 , 1983 – 2010 . Google Scholar Crossref Search ADS 22. Guo D. & Wang C-C. ( 2006 ) Asymptotic mean-square optimality of belief propagation for sparse linear systems. IEEE Information Theory Workshop - ITW ’06 Chengdu, Nov, 2016. IEEE , pp. 194 – 198 . 23. Guo D. & Wang C-C. ( 2007 ) Random sparse linear systems observed via arbitrary channels: a decoupling principle. IEEE International Symposium on Information Theory (ISIT), Nice, France, June 2007. IEEE , pp. 946 – 950 . 24. Guo D. , Wu Y. , Shamai S. & Verdú S. ( 2011 ) Estimation in Gaussian noise: properties of the minimum mean-square error. IEEE Trans. Inform. Theory , 57 , 2371 – 2385 . Google Scholar Crossref Search ADS 25. Horn R. A. & Johnson C. R. ( 1985 ) Matrix Analysis. Cambridge, England : Cambridge University Press , Reprinted with corrections 1987 . 26. Jafar S. A. & Goldsmith A. ( 2004 ) Transmitter optimization and optimality of beamforming for multiple antenna systems. IEEE Trans. Wireless Commun. , 3 , 1165 – 1175 . Google Scholar Crossref Search ADS 27. Javanmard A. & Montanari A. ( 2013 ) State evolution for general approximate message passing algorithms, with applications to spatial coupling. Inf. Infer. , 2 , 115 – 144 . Google Scholar Crossref Search ADS 28. Johnstone I. M. & Lu A. Y. ( 2009 ) On consistency and sparsity for principal components analysis in high dimensions. J. Amer. Stat. Assoc. , 104 , 682 – 703 . Google Scholar Crossref Search ADS 29. Jolliffe I. T. ( 1986 ) Principal Component Analysis. New York, NY : Springer . 30. Jolliffe I. T. , Trendafilov N. & Uddin M. ( 2003 ) A modified principal component technique based on the LASSO. J. Comput. Graphical Stat. , 12 , 531 – 547 . Google Scholar Crossref Search ADS 31. Kabashima Y. , Krzakala F. , Mézard M. , Sakata A. & Zdeborová L. ( 2016 ) Phase transitions and sample complexity in Bayes-optimal matrix factorization. IEEE Trans. Inform. Theory , 62 , 4228 – 4265 . Google Scholar Crossref Search ADS 32. Katayama T. ( 2010 ) Subspace Methods for System Identification. New York : Springer . 33. Keshavan R. , Montanari A. & Oh S. ( 2010 ) Matrix completion from a few entries. IEEE Trans. Inform. Theory , 56 , 2980 – 2998 . Google Scholar Crossref Search ADS 34. Krzakala F. , Mézard M. & Zdeborová L. ( 2013 ) Phase diagram and approximate message passing for blind calibration and dictionary learning. IEEE International Symposium on Information Theory (ISIT), Istanbul, Turkey, 2014. IEEE , pp. 659 – 663 . 35. Lee D. D. & Seung H. S. ( 2001 ) Algorithms for non-negative matrix factorization. Advances in Neural Information Processing Systems ( Dietterich T. G. Becker S. & Ghahraman Z. eds). Barcelona, Vancouver, Canada : MIT Press , pp. 556 – 562 . 36. Lesieur T. , Krzakala F. & Zdeborová L. ( 2015 ) MMSE of probabilistic low-rank matrix estimation: Universality with respect to the output channel. Annual Allerton Conference on Communication, Control, and Computing (Allerton), Montincello, IL, Oct. 2015. IEEE . 37. Lewicki M. S. & Sejnowski T. J. ( 2000 ) Learning overcomplete representations. Neural Comp. , 12 , 337 – 365 . Google Scholar Crossref Search ADS 38. Manoel A. , Krzakala F. , Tramel E. W. & Zdeborová L. ( 2015 ) Swept approximate message passing for sparse estimation. International Conference on Machine Learning, 7-9 July 2015 ( Bach F. & Blei D. eds). Lille, France : MIT Press , pp. 1123 – 1132 . 39. Marčenko V. A. & Pastur L. A. ( 1967 ) Distribution of eigenvalues for some sets of random matrices. Math. USSR–Sbornik , 1 , 457 – 483 . Google Scholar Crossref Search ADS 40. Montanari A. & Tse D. ( 2006 ) Analysis of belief propagation for non-linear problems: the example of CDMA (or: how to prove Tanaka’s formula). arXiv:cs/0602028v1 [cs.IT] . 41. Olshausen B. A. & Field D. J. ( 1996 ) Natural image statistics and efficient coding. Network Comp. Neural Syst. , 7 , 333 – 339 . Google Scholar Crossref Search ADS 42. Olshausen B. A. & Field D. J. ( 1997 ) Sparse coding with an overcomplete basis set: A strategy employed by V1? J. Vis. Res. , 37 , 3311 – 3325 . Google Scholar Crossref Search ADS 43. Parker J. T. , Schniter P. & Cevher V. ( 2014 ) Bilinear generalized approximate message passing—Part I: Derivation. IEEE Trans. Signal Process. , 62 , 5839 – 5853 . Google Scholar Crossref Search ADS 44. Parker J. T. , Schniter P. & Cevher V. ( 2014 ) Bilinear generalized approximate message passing—Part II: Applications. IEEE Trans. Signal Process. , 62 , 5854 – 5867 . Google Scholar Crossref Search ADS 45. Rangan S. ( 2010 ) Estimation with random linear mixing, belief propagation and compressed sensing. IEEE Conference on Information Sciences and Systems (CISS), March 2010. Princeton, NJ : IEEE , pp. 1 – 6 . 46. Rangan S. ( 2011 ) Generalized approximate message passing for estimation with random linear mixing. IEEE International Symposium on Information Theory (ISIT), St. Petersberg, Russia, June 2015. IEEE , pp. 2174 – 2178 . 47. Rangan S. & Fletcher A. K. ( 2012 ) Iterative estimation of constrained rank-one matrices in noise. IEEE International Symposium on Information Theory (ISIT), Cambridge, MA, July 2012. IEEE . 48. Rangan S. , Fletcher A. & Goyal V. K. ( 2012 ) Asymptotic analysis of MAP estimation via the replica method and applications to compressed sensing. IEEE Trans. Inform. Theory , 58 , 1902 – 1923 . Google Scholar Crossref Search ADS 49. Rangan S. , Fletcher A. K. , Schniter P. & Kamilov U. S. ( 2017 ) Inference for generalized linear models via alternating directions and Bethe free energy minimization. IEEE Trans. Inform. Theory , 63 , 676 – 697 . Google Scholar Crossref Search ADS 50. Rangan S. , Schniter P. & Fletcher A. ( 2014 ) On the convergence of approximate message passing with arbitrary matrices. IEEE International Symposium on Information Theory (ISIT), Honolulu, Hawaii, June 2014. IEEE , pp. 236 – 240 . 51. Rangan S. , Schniter P. , Riegler E. , Fletcher A. & Cevher V. ( 2016 ) Fixed points of generalized approximate message passing with arbitrary matrices. IEEE Trans. Inform. Theory , 62 , 7464 – 7474 . Google Scholar Crossref Search ADS 52. Sakata A. & Kabashima Y. ( 2013 ) Sample complexity of Bayesian optimal dictionary learning. IEEE International Symposium on Information Theory (ISIT), Istanbul, Turkey, June 2013. IEEE , pp. 669 – 673 . 53. Shena H. & Huang J. Z. ( 2008 ) Sparse principal component analysis. J. Multivariate Anal. , 99 , 1015 – 1034 . Google Scholar Crossref Search ADS 54. SourceForge ( 2015 ) Generalized approximate message passing. SourceForge.net project gampmatlab , Available on-line at http://gampmatlab.sourceforge.net/. 55. Tanaka T. ( 2002 ) A statistical-mechanics approach to large-system analysis of CDMA multiuser detectors. IEEE Trans. Inform. Theory , 48 , 2888 – 2910 . Google Scholar Crossref Search ADS 56. Vidyasagar M. ( 1978 ) Nonlinear Systems Analysis. Englewood Cliffs, NJ : Prentice-Hall . 57. Vila J. P. & Schniter P. ( 2011 ) Expectation-maximization Bernoulli–Gaussian approximate message passing. IEEE Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, Sept 2011. IEEE , pp. 799 – 803 . 58. Vila J. , Schniter P. & Meola J. ( 2013 ) Hyperspectral image unmixing via bilinear generalized approximate message passing. SPIE Defense, Security, and Sensing. Society of Photographic Instrumentation Engineers (SPIE) Defense, Security, and Sensing, Baltimore, MD, May 2013. SPIE , pp. 87430Y – 87430Y . 59. Vila J. , Schniter P. , Rangan S. , Krzakala F. & Zdeborová L. ( 2015 ) Adaptive damping and mean removal for the generalized approximate message passing algorithm. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2015. Brisbane, Australia : IEEE , pp. 2021 – 2025 . 60. Wu Y. & Verdú S. ( 2012 ) Functional properties of minimum mean-square error and mutual information. IEEE Trans. Inform. Theory , 58 , 1289 – 1301 . Google Scholar Crossref Search ADS 61. Zhang Z. , Zha H. & Simon H. ( 2002 ) Low rank approximations with sparse factors I: Basic algorithms and error analysis. SIAM J. Matrix Anal. Appl. , 23 , 706 – 727 . Google Scholar Crossref Search ADS 62. Zou H. , Hastie T. & Tibshirani R. ( 2006 ) Sparse principal component analysis. J. Comput. Graphical Stat. , 15 , 265 – 286 . Google Scholar Crossref Search ADS Appendix A. Appendices: proofs A.1 Empirical convergence of random variables Bayati and Montanari’s analysis in [3] employs certain deterministic models on the vectors and then proves convergence properties of related empirical distributions. To apply the same analysis here, we need to review some of their definitions. We say a function $$\phi:{\mathbb{R}}^r {\rightarrow} {\mathbb{R}}^s$$ is pseudo-Lipschitz of order $$p>1$$, if there exists an $$L > 0$$ such for any $${\mathbf{x}}$$, $${\mathbf{y}} \in {\mathbb{R}}^r$$, $\|\phi({\mathbf{x}}) - \phi({\mathbf{y}})\| \leq L(1+\|{\mathbf{x}}\|^{p-1}+\|{\mathbf{y}}\|^{p-1})\|{\mathbf{x}}-{\mathbf{y}}\|.$ Now suppose that for each $$n=1,2,\,{\ldots}\,$$, we have a set of vectors $\theta(n) = \left\{ {\mathbf{v}}_i(n), i=1,\ldots,\ell(n)\right\}\!,$ where the elements are vectors $${\mathbf{v}}_i(n) \in {\mathbb{R}}^s$$ and the size of the set is given by $$\ell(n)$$. We say that the set of components of $$\theta(n)$$empirically converges with bounded moments of order $$p$$ as $$n{\rightarrow}\infty$$ to a random vector $${\mathbf{V}}$$ on $${\mathbb{R}}^s$$ if, for all pseudo-Lipschitz continuous functions, $$\phi$$, of order $$p$$, $$\label{eq:phiConv} \lim_{n {\rightarrow}\infty} \frac{1}{\ell(n)} \sum_{i=1}^{\ell(n)} \phi({\mathbf{v}}_i(n)) = {\mathbb{E}}[\phi({\mathbf{V}})] < \infty.$$ (A.1) When the nature of convergence is clear, we may write (with some abuse of notation) $\lim_{n {\rightarrow} \infty} \theta(n) \stackrel{d}{=} {\mathbf{V}}.$ A.2 Bayati–Montanari recursions with adaptation Our main result will need an adaptive version of the recursion theorem of Bayati and Montanari [3]. Let $$H_u(t,d,u_0,\nu_u)$$ and $$H_v(t,b,v_0,\nu_v)$$ be two functions defined on arguments $$t=0,1,2,\ldots$$ and $$d$$, $$b$$, $$u_0$$ and $$v_0 \in {\mathbb{R}}$$ as well as vectors $$\nu_u$$ and $$\nu_v$$. Given a matrix $${\mathbf{S}} \in {\mathbb{R}}^{m {\times} n}$$ and vectors $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$, generate a sequence of vectors $${\mathbf{b}}(t)$$ and $${\mathbf{d}}(t)$$ by the iterations \begin{eqnarray} {\mathbf{b}}(t) &=& {\mathbf{S}} {\mathbf{v}}(t) + \xi_u(t){\mathbf{u}}(t), \label{eq:genAMPb} \\ \end{eqnarray} (A.2a) \begin{eqnarray} {\mathbf{d}}(t) &=& {\mathbf{S}}^\top{\mathbf{u}}({t\! + \! 1}) + \xi_v(t){\mathbf{v}}(t), \label{eq:genAMPd} \end{eqnarray} (A.2b) where \begin{eqnarray} u_i({t\! + \! 1}) &=& H_u(t,b_i(t),u_{0i},\nu_u(t)), \\ \end{eqnarray} (A.3a) \begin{eqnarray} v_j({t\! + \! 1}) &=& H_v(t,d_j(t),v_{0j},\nu_v(t)), \end{eqnarray} (A.3b) and $$\xi_v(t)$$ and $$\xi_u(t)$$ are scalar step sizes given by \begin{eqnarray} \xi_v(t) &=& -\frac{1}{m} \sum_{i=1}^m \frac{\partial}{\partial b_i}H_u(t,b_i(t),u_{0i},\nu_u(t)), \\ \end{eqnarray} (A.4a) \begin{eqnarray} \xi_u({t\! + \! 1}) &=& -\frac{1}{m} \sum_{j=1}^n \frac{\partial}{\partial d_j} H_v(t,d_j(t),v_{0j},\nu_v(t)). \hspace{0.7cm} \label{eq:genAMPmuu} \end{eqnarray} (A.4b) The recursions (A.2) to (A.4) are identical to the recursions analyzed in [3], except for the introduction of the parameters $$\nu_u(t)$$ and $$\nu_v(t)$$. We will call these parameters adaptation parameters, since they enable the functions $$H_u(\cdot)$$ and $$H_v(\cdot)$$ to have some data dependence, not explicitly considered in [3]. Similarly to the selection of the parameters $$\lambda_u(t)$$ and $$\lambda_v(t)$$ in (4.2), we assume that, in each iteration $$t$$, the adaptation parameters are selected by functions of the form $$\label{eq:genNu} \nu_u(t) = \frac{1}{n} \sum_{j=1}^n \phi_u(t,v_{0j},v_j(t)), \quad \nu_v(t) = \frac{1}{m} \sum_{i=1}^m \phi_v(t,u_{0i},u_i({t\! + \! 1})),$$ (A.5) where $$\phi_u(\cdot)$$ and $$\phi_v(\cdot)$$ are (possibly vector valued) pseudo-Lipschitz continuous of order $$p$$ for some $$p > 1$$. Thus, the values of $$\nu_u(t)$$ and $$\nu_v(t)$$ depend on the outputs $${\mathbf{v}}(t)$$ and $${\mathbf{u}}({t\! + \! 1})$$. Note that in equations (A.3) to (A.5), $$d_i$$, $$u_{0i}$$, $$b_j$$ and $$v_{0j}$$ are the components of the vectors $${\mathbf{d}}$$, $${\mathbf{u}}_0$$, $${\mathbf{b}}$$ and $${\mathbf{v}}_0$$, respectively. The algorithm is initialized with $$t=0$$, $$\xi_u(0)=0$$ and some vector $${\mathbf{v}}(0)$$. Now, similarly to Section 4, consider a sequence of random realizations of the parameters indexed by the input dimension $$n$$. For each $$n$$, we assume that the output dimension $$m=m(n)$$ is deterministic and scales linearly as in (3.4) for some $$\beta \geq 0$$. Assume that the transform matrix $${\mathbf{S}}$$ has i.i.d. Gaussian components $$s_{ij} \sim \mathcal{N}(0,1/m)$$. Also assume that the empirical limits in (4.3) hold with bounded moments of order $$2p-2$$ for some limiting random variables $$(U_0,U(0))$$ and $$(V_0,V(0))$$. We will also assume the following continuity assumptions on $$H_u(\cdot)$$ and $$H_v(\cdot)$$: Assumption A1 The function $$H_u(t,b,u_0,\nu_u)$$ satisfies the following continuity conditions: (a) For every $$\nu_u$$ and $$t$$, $$H_u(t,b,u_0,\nu_u)$$ and its derivative $$\partial H_u(t,b,u_0,\nu_u)/\partial b$$ are Lipschitz continuous in $$b$$ and $$u_0$$ for some Lipschitz constant that is continuous in $$\nu_u$$; and (b) for every $$\nu_u$$ and $$t$$, $$H_u(t,b,u_0,\nu_u)$$ and $$\partial H_u(t,b,u_0,\nu_u)/\partial b$$ are continuous at $$\nu_u$$ uniformly over $$(b,u_0)$$. The function $$H_v(t,d,v_0,\nu_v)$$ satisfies the analogous continuity assumptions in $$d$$, $$v_0$$ and $$\nu_v$$. Under these assumption, we will show, as in Section 4, that for any fixed iteration $$t$$, the sets $$\theta_u(t)$$ and $$\theta_v(t)$$ in (3.5) converge empirically to the limits (3.6) for some random variable pairs $$(U_0,U(t))$$ and $$(V_0,V(t))$$. The random variable $$U_0$$ is identical to the variable in the limit (4.3) and, for $$t \geq 0$$, $$U(t)$$ is given by $$\label{eq:UtGen} U({t\! + \! 1}) = H_u(t,B(t),U_0,{\overline{\nu}}_u(t)), \quad B(t) \sim {\mathcal N}(0,\tau^b(t)),$$ (A.6) for some deterministic constants $${\overline{\nu}}_v(t)$$ and $$\tau^b(t)$$ that will be defined below. Similarly, the random variable $$V_0$$ is identical to the variable in the limit (4.3) and, for $$t \geq 0$$, $$V(t)$$ is given by $$\label{eq:VtGen} V({t\! + \! 1}) = H_v(t,D(t),V_0,{\overline{\nu}}_v(t)), \quad D(t) \sim {\mathcal N}(0,\tau^d(t)),$$ (A.7) for some constants $${\overline{\nu}}_v(t)$$ and $$\tau^d(t)$$, also defined below. The constants $$\tau^b(t)$$, $$\tau^d(t)$$, $${\overline{\nu}}_u(t)$$ and $${\overline{\nu}}_v(t)$$ can be computed recursively through the following state evolution equations: \begin{eqnarray} \tau^d(t) &=& {\mathbb{E}}\left[ U({t\! + \! 1})^2 \right], \quad \tau^b(t) = \beta {\mathbb{E}}\left[ V(t)^2 \right], \label{eq:taub} \\ \end{eqnarray} (A.8a) \begin{eqnarray} {\overline{\nu}}_u(t) &=& {\mathbb{E}}\left[\phi_u(t,V_0,V(t))\right], \quad {\overline{\nu}}_v(t) = {\mathbb{E}}\left[\phi_v(t,U_0,U({t\! + \! 1}))\right], \label{eq:nuv} \end{eqnarray} (A.8b) where the expectations are over the random variables $$U(t)$$ and $$V(t)$$ above and initialized with $$\label{eq:SEGenInit} \tau^b(0) := \beta {\mathbb{E}}\left[ V(0)^2 \right].$$ (A.9) With these definitions, we can now state the adaptive version of the result from Bayati and Montanari [3]. Although the full proof requires that $$p=2$$, much of the proof is valid for $$p\geq 2$$. Hence, where possible, we provide the steps for the general $$p$$ case. Theorem A2 Consider the recursion in (A.2) to (A.5) satisfying the above assumptions for the case when $$p=2$$. Then, for any fixed iteration number $$t$$, the sets $$\theta_u(t)$$ and $$\theta_v(t)$$ in (3.5) converge empirically to the limits (3.6) with bounded moments of order $$p=2$$ to the random variable pairs $$(U_0,U(t))$$ and $$(V_0,V(t))$$ described above. Proof. We use an asterisk superscript to denote the outputs of the non-adaptive version of the recursions (A.2) to (A.5). That is, quantities such as $${\mathbf{u}}^*(t), {\mathbf{v}}^*(t), {\mathbf{b}}^*(t), {\mathbf{d}}^*(t),\,{\ldots}\,$$, will represent the outputs generated by recursions (A.2) to (A.5) with the same initial conditions ($${\mathbf{v}}^*(0)={\mathbf{v}}(0)$$ and $$\xi_u(0)=\xi_u^*(0)=0$$), but in (A.3) and (A.4), $$\nu_u(t)$$ and $$\nu_v(t)$$ are replaced by their deterministic limits $${\overline{\nu}}_u(t)$$ and $${\overline{\nu}}_v(t)$$. Therefore, \begin{eqnarray} u^*_i({t\! + \! 1}) &=& H_u(t,b^*_i(t),u^*_{0i},{\overline{\nu}}_u(t)), \\ \end{eqnarray} (A.10a) \begin{eqnarray} v^*_j({t\! + \! 1}) &=& H_v(t,d^*_j(t),v^*_{0j},{\overline{\nu}}_v({t\! + \! 1})), \end{eqnarray} (A.10b) and \begin{eqnarray} \xi^*_v(t) &=& -\frac{1}{m} \sum_{i=1}^m \frac{\partial}{\partial b_i}H_u(t,b^*_i(t),u^*_{0i},{\overline{\nu}}_u(t)), \\ \end{eqnarray} (A.11a) \begin{eqnarray} \xi^*_u({t\! + \! 1}) &=& -\frac{1}{m} \sum_{j=1}^n \frac{\partial}{\partial d_j} H_v(t,d^*_j(t),v^*_{0j},{\overline{\nu}}_v(t)). \hspace{0.7cm} \end{eqnarray} (A.11b) Now, Bayati and Montanari’s result in [3] shows that this non-adaptive algorithm satisfies the required properties. That is, the following limits hold with bounded moments of order $$p$$: \begin{eqnarray} \lim_{n {\rightarrow} \infty} \{(u_{0i},u^*_i(t)), i=1,\ldots,m\} &=& (U_0,U(t)), \\ \end{eqnarray} (A.12a) \begin{eqnarray} \lim_{n {\rightarrow} \infty} \{(v_{0j},v^*_j(t)), j=1,\ldots,n\} &=& (V_0,V(t)). \end{eqnarray} (A.12b) So, the limits (3.6) will be shown if we can prove the following limits hold almost surely for all $$t$$: $$\label{eq:uvlimd} \lim_{n {\rightarrow} \infty} \frac{1}{m} \|{\mathbf{u}}(t)-{\mathbf{u}}^*(t)\|_p^p = 0, \quad \lim_{n {\rightarrow} \infty} \frac{1}{n} \|{\mathbf{v}}(t)-{\mathbf{v}}^*(t)\|_p^p = 0,$$ (A.13) where $$\|\cdot\|_p$$ is the $$p$$-norm. In the course of proving (A.13), we will also show the following limits hold almost surely: \begin{eqnarray} \lim_{n {\rightarrow} \infty} \frac{1}{m} \|{\mathbf{b}}(t)-{\mathbf{b}}^*(t)\|_p^p &=& 0, \label{eq:blimd} \\ \end{eqnarray} (A.14a) \begin{eqnarray} \lim_{n {\rightarrow} \infty} \frac{1}{n} \|{\mathbf{d}}(t)-{\mathbf{d}}^*(t)\|_p^p &=& 0, \label{eq:dlimd} \\ \end{eqnarray} (A.14b) \begin{eqnarray} \lim_{n {\rightarrow} \infty} |\xi_u(t)-\xi^*_u(t)| &=& 0, \label{eq:xiulimd} \\ \end{eqnarray} (A.14c) \begin{eqnarray} \lim_{n {\rightarrow} \infty} |\xi_v(t)-\xi^*_v(t)| &=& 0, \label{eq:xivlimd} \\ \end{eqnarray} (A.14d) \begin{eqnarray} \lim_{n {\rightarrow} \infty} \nu_u(t) &=& {\overline{\nu}}_u(t), \label{eq:nuulimd} \\ \end{eqnarray} (A.14e) \begin{eqnarray} \lim_{n {\rightarrow} \infty} \nu_v(t) &=& {\overline{\nu}}_v(t). \label{eq:nuvlimd} \end{eqnarray} (A.14f) The proofs of the limits (A.13) and (A.14) can be demonstrated via induction on $$t$$ with the following straightforward (but somewhat tedious) continuity argument. To begin the induction argument, first note that the non-adaptive algorithm has the same initial condition as the adaptive algorithm. That is, $${\mathbf{v}}^*(0)={\mathbf{v}}(0)$$ and $$\xi_u(0)=\xi_u^*(0)=0$$. Also, since $$\xi_u(0)=\xi_u^*(0)=0$$, from (A.2a), the initial value of $${\mathbf{u}}(t)$$ does not matter. So, without loss of generality, we can assume that the initial condition satisfies $${\mathbf{u}}(0)={\mathbf{u}}^*(0)$$. Thus, the limits (A.13) and (A.14c) hold for $$t=0$$. We now proceed by induction. Suppose that the limits (A.13) and (A.14c) hold almost surely for some $$t \geq 0$$. Since $${\mathbf{S}}$$ has i.i.d. components with mean zero and variance $$1/m$$, by the Marčenko–Pastur theorem [39], the maximum singular value of $${\mathbf{S}}$$ is bounded. For $$p=2$$, the maximum singular value is the $$p$$-norm operator norm, and therefore, there exists a constant $$C_S > 0$$ such that $$\label{eq:Snorm} \limsup_{n {\rightarrow} \infty} \|{\mathbf{S}}\|_p \leq C_S, \quad \limsup_{n {\rightarrow} \infty} \|{\mathbf{S}}^\top\|_p \leq C_S.$$ (A.15) Substituting the bound (A.15) into (A.2a), we obtain \begin{align} \|{\mathbf{b}}(t)-{\mathbf{b}}^*(t)\|_p &\leq \|{\mathbf{S}}\|_p\|{\mathbf{v}}(t)-{\mathbf{v}}^*(t)\|_p + |\xi_u(t)|\|{\mathbf{u}}(t)-{\mathbf{u}}^*(t)\|_p + |\xi_u(t)-\xi^*_u(t)|\|{\mathbf{u}}^*(t)\|_p \nonumber \\ & \leq C_S\|{\mathbf{v}}(t)-{\mathbf{v}}^*(t)\|_p + |\xi_u(t)|\|{\mathbf{u}}(t)-{\mathbf{u}}^*(t)\|_p + |\xi_u(t)-\xi^*_u(t)|\|{\mathbf{u}}^*(t)\|_p. \label{eq:blimbnd1} \end{align} (A.16) Now, since $$p \geq 1$$, we have that for any positive numbers $$a$$ and $$b$$, $$\label{eq:abpIneq} (a+b)^p \leq 2^{p-1}(a^p + b^p).$$ (A.17) Applying (A.17) into (A.16), and the fact that $$\lim_n m/n = \beta$$, we obtain that \begin{align} \frac{1}{m}\|{\mathbf{b}}(t)-{\mathbf{b}}^*(t)\|^p_p &\leq C'_S\Bigl[ \frac{1}{n}\|{\mathbf{v}}(t)-{\mathbf{v}}^*(t)\|^p_p \nonumber \\ &\quad + \frac{|\xi_u(t)|^p}{m}\|{\mathbf{u}}(t)-{\mathbf{u}}^*(t)\|^p_p +\frac{|\xi_u(t)-\xi^*_u(t)|^p}{m} \|{\mathbf{u}}^*(t)\|^p_p\Bigr], \label{eq:blimbnd2} \end{align} (A.18) for some other constant $$C'_S > 0$$. Now, since $${\mathbf{u}}^*(t)$$ is the output of the non-adaptive algorithm it satisfies the limit $$\label{eq:ubndNA} \lim_{n {\rightarrow} \infty} \frac{1}{m}\|{\mathbf{u}}^*(t)\|_p^p = \lim_{n {\rightarrow} \infty} \frac{1}{m}\sum_{i=1}^m|u_i^*(t)|_p^p = {\mathbb{E}}|U(t)|^p < \infty.$$ (A.19) Substituting the bound (A.19) along with induction hypotheses, (A.13) and (A.14c) into (A.18) shows (A.14a). Next, to prove the limit (A.14e), first observe that since $$\phi_u(\cdot)$$ is pseudo-Lipschitz continuous of order $$p$$, we have that $${\overline{\nu}}_u(t)$$ in (A.8b) can be replaced by the limit of the empirical means $$\label{eq:nuuLimNA} {\overline{\nu}}_u(t) = \lim_{n {\rightarrow}\infty} \frac{1}{n} \sum_{j=1}^n \phi_u(t,v_{0j},v_j^*(t)),$$ (A.20) where the limit holds almost surely. Combining (A.20) with (A.5), \begin{eqnarray} \limsup_{n {\rightarrow} \infty} |\nu_u(t)-{\overline{\nu}}_u(t)| \leq \limsup_{n {\rightarrow}\infty} \frac{1}{n} \sum_{j=1}^n |\phi_u(t,v_{0j},v_j^*(t)) -\phi_u(t,v_{0j},v_j^*(t))|. \label{eq:nuuDiff1} \end{eqnarray} (A.21) Applying the fact that $$\phi_u(\cdot)$$ is pseudo-Lipschitz continuous of order $$p$$ to (A.21), we obtain that there exists a constant $$L_v > 0$$ such that \begin{align} & \underset{n\to \infty }{\mathop{\lim \sup }}\,|{{\nu }_{u}}(t)-{{{\bar{\nu }}}_{u}}(t)|\le \underset{n\to \infty }{\mathop{\lim \sup }}\,\frac{{{L}_{v}}}{n}\sum\limits_{j=1}^{n}{[}|{{v}_{j}}(t){{|}^{p-1}}+|v_{j}^{*}(t){{|}^{p-1}}]|{{v}_{j}}(t)-v_{j}^{*}(t)| \\ & \,\,\,\,\,\le \underset{n\to \infty }{\mathop{\lim \sup }}\,{{L}_{v}}\left[ {{\left( \frac{1}{n}\|\mathbf{v}(t)\|_{p}^{p} \right)}^{(p-1)/p}}+{{\left( \frac{1}{n}\|{{\mathbf{v}}^{*}}(t)\|_{p}^{p} \right)}^{(p-1)/p}} \right]{{\left( \frac{1}{n}\|\mathbf{v}(t)-{{\mathbf{v}}^{*}}(t)\|_{p}^{p} \right)}^{1/p}}, \\ \end{align} (A.22) where the last step is due to Hölder’s inequality with the exponents $\frac{p-1}{p} + \frac{1}{p} = 1.$ Now, similarly to the proof of (A.19), one can show that the non-adaptive output satisfies the limit $$\label{eq:vbnd1} \lim_n \frac{1}{n} \|{\mathbf{v}}^*(t)\|_p^p = {\mathbb{E}}|V(t)|^p < \infty.$$ (A.23) Also, from the induction hypothesis (A.13), it follows that the non-adaptive output must satisfy the same limit $$\label{eq:vbnd2} \lim_n \frac{1}{n} \|{\mathbf{v}}(t)\|_p^p = {\mathbb{E}}|V(t)|^p < \infty.$$ (A.24) Applying the bounds (A.23) and (A.24) and the limit (A.22) shows that (A.14e) holds almost surely. Now the limit in (A.14e) and the second limit in (A.13) together with the continuity conditions on $$H_u(\cdot)$$ in Assumption A1 show that the first limit in (A.13) holds almost surely for $$t+1$$ and (A.14d) holds almost surely for $$t$$. Using (A.2b), the proof of the limit (A.14b) is similar to the proof of (A.14a). These limits in turn show that the second limit in (A.13) and the limits in (A.14c) hold almost surely for $$t+1$$. We have thus shown that if (A.13) and (A.14c) hold almost surely for some $$t$$, they hold for $$t+1$$. Thus, by induction they hold for all $$t$$. Finally, applying the limits (A.12), (A.13) and a continuity argument shows that the desired limits (3.6) hold almost surely. □ A.3 Proof of Theorem 4.2 The theorem directly follows from the adaptive Bayati–Montanari recursion theorem, Theorem A2 in Appendix A.2 above, with some changes of variables. Specifically, let $$\label{eq:Sdef} {\mathbf{S}} = \frac{1}{\sqrt{m\tau_w}} {\mathbf{W}},$$ (A.25) where $${\mathbf{W}}$$ is the Gaussian noise in the rank-one model in Assumption 4.1(b). Since $${\mathbf{W}}$$ has i.i.d. components with Gaussian distributions $${\mathcal N}(0,\tau_w)$$, the components of $${\mathbf{S}}$$ will be i.i.d. with distributions $${\mathcal N}(0,1/m)$$. Now, using the rank-one model for $${\mathbf{A}}$$ in (1.1), $$\label{eq:Avpf} {\mathbf{A}}{\mathbf{v}}(t) = {\mathbf{u}}_0{\mathbf{v}}_0^\top{\mathbf{v}}(t) + \sqrt{m}{\mathbf{W}}{\mathbf{v}}(t) = n\alpha_{v1}(t){\mathbf{u}}_0 + \sqrt{m}{\mathbf{W}}{\mathbf{v}}(t),$$ (A.26) where the last step is from the definition of $$\alpha_{v1}(t)$$ in (4.6). Substituting (A.26) into the update rule for $${\mathbf{p}}(t)$$ in line 6 of Algorithm 1, we obtain \begin{eqnarray} {\mathbf{p}}(t) &=& (1/m){\mathbf{A}}(t){\mathbf{v}}(t) + \mu_u(t){\mathbf{u}}(t) \nonumber \\ &=& (1/\sqrt{m}){\mathbf{W}}{\mathbf{v}}(t) + \beta \alpha_{v1}(t){\mathbf{u}}_0 + \mu_u(t){\mathbf{u}}(t). \label{eq:Wvpf} \end{eqnarray} (A.27) Note that we have used the fact that $$\beta=n/m$$. Hence, if we define $$\label{eq:bdefpf} {\mathbf{b}}(t) = \frac{1}{\sqrt{\tau_w}}({\mathbf{p}}(t)-\beta\alpha_{v1}(t){\mathbf{u}}_0),$$ (A.28) then (A.25) and (A.27) show that $$\label{eq:bSpf} {\mathbf{b}}(t) = {\mathbf{S}}(t){\mathbf{v}}(t) + \xi_u(t){\mathbf{u}}(t), \quad \xi_u(t) = \mu_u(t)/\sqrt{\tau_w}.$$ (A.29) Similarly, one can show that if we define $$\label{eq:ddefpf} {\mathbf{d}}(t) = \frac{1}{\sqrt{\tau_w}}({\mathbf{q}}(t)-\alpha_{u1}(t){\mathbf{v}}_0),$$ (A.30) then $$\label{eq:dSpf} {\mathbf{d}}(t) = {\mathbf{S}}(t)^\top{\mathbf{u}}({t\! + \! 1}) + \xi_v(t){\mathbf{v}}(t), \quad  \xi_v(t) = \mu_v(t)/\sqrt{\tau_w}.$$ (A.31) Next define the adaptation functions $$\label{eq:phiuvpf} \phi_u(t,v_0,v) := (vv_0, \phi_{\lambda u}(t,v_0,v)), \quad \phi_v(t,u_0,u) := (uu_0, \phi_{\lambda v}(t,u_0,u)),$$ (A.32) which are the adaptation functions in (4.2) with additional components for the second-order statistics $$uu_0$$ and $$vv_0$$. Since $$\phi_{\lambda u}(t,\cdot)$$ and $$\phi_{\lambda v}(t,\cdot)$$ are pseudo-Lipschitz of order $$p$$, so are $$\phi_{u}(t,\cdot)$$ and $$\phi_{v}(t,\cdot)$$. Taking the empirical means over each of the two components of $$\phi_u(\cdot)$$ and $$\phi_v(\cdot)$$, and applying (4.2) and (4.6), we see that if $$\nu_u(t)$$ and $$\nu_v(t)$$ are defined as in (A.5), \begin{eqnarray} \nu_u(t) &=& \frac{1}{n}\sum_{j=1}^n \phi_v(t,v_{0j},v_j(t)) = (\alpha_{v1}(t),\lambda_u(t)), \\ \end{eqnarray} (A.33a) \begin{eqnarray} \nu_v(t) &=& \frac{1}{m}\sum_{i=1}^m \phi_u(t,u_{0i},u_i({t\! + \! 1})) = (\alpha_{u1}({t\! + \! 1}),\lambda_v(t)). \end{eqnarray} (A.33b) Therefore, $$\nu_u(t)$$ and $$\nu_v(t)$$ are vectors containing the parameters $$\lambda_u(t)$$ and $$\lambda_v(t)$$ for the factor selection functions in lines 7 and 12 of Algorithm 1 as well as the second-order statistics $$\alpha_{u1}(t)$$ and $$\alpha_{v1}(t)$$. Now, for $$\nu_u = (\alpha_{v1},\lambda_u)$$ and $$\nu_v = (\alpha_{u1},\lambda_v)$$ define the scalar functions \begin{eqnarray} H_u(t,b,u_0,\nu_u) &:=& G_u(\sqrt{\tau_w} b + \beta\alpha_{v1}u_0,\lambda_u), \label{eq:Hudefpf} \\ \end{eqnarray} (A.34a) \begin{eqnarray} H_v(t,d,v_0,\nu_v) &:=& G_v(\sqrt{\tau_w} d + \alpha_{u1}v_0,\lambda_v). \label{eq:Hvdefpf} \end{eqnarray} (A.34b) Since $$G_u(p,\lambda_u)$$ and $$G_v(q,\lambda_v)$$ satisfy the continuity conditions in Assumption 4.1(c), $$H_u(t,b,u_0,\lambda_u)$$ and $$H_v(t,d,v_0,\lambda_v)$$ satisfy Assumption A1. In addition, the componentwise separability assumption in (4.1) implies that the updates in lines 7 and 12 of Algorithm 1 can be rewritten as $$\label{eq:Guvpf} u_i({t\! + \! 1}) = G_u(p_i(t),\lambda_u(t)), \quad v_j({t\! + \! 1}) = G_v(d_j(t),\lambda_v(t)).$$ (A.35) Thus, combining (A.34) and (A.35) with the definitions of $${\mathbf{b}}(t)$$ and $${\mathbf{d}}(t)$$ in (A.28) and (A.30), we obtain $$\label{eq:Huvpf} u_i({t\! + \! 1}) = H_u(t,b_i(t),u_{0i},\nu_u(t)), \quad v_j({t\! + \! 1}) = H_v(t,d_j(t),v_{0j},\nu_v(t)).$$ (A.36) Next observe that \begin{eqnarray} \xi_u({t\! + \! 1}) &\stackrel{(a)}{=}& \mu_v({t\! + \! 1})/\sqrt{\tau_w} \stackrel{(b)}{=} -\frac{\sqrt{\tau_w}}{m}\sum_{j=1}^n \frac{\partial}{\partial q_j}G_v(q_j(t),\lambda_v(t)) \nonumber \\ &\stackrel{(c)}{=}& -\frac{1}{m}\sum_{j=1}^n \frac{\partial}{\partial d_j}H_v(t,d_j(t),\nu_v(t)), \hspace{3cm} \label{eq:xiuH} \end{eqnarray} (A.37) where ($$a$$) follows from the definition of $$\xi_u(t)$$ in (A.29), ($$b$$) is the setting for $$\mu_u({t\! + \! 1})$$ in line 8 and ($$c$$) follows from the definition of $$H_v(t,d)$$ in (A.34b). Similarly, one can show that $$\label{eq:xivH} \xi_v(t) = -\frac{1}{m}\sum_{i=1}^m \frac{\partial}{\partial b_i}H_u(t,b_i(t),\nu_u(t)).$$ (A.38) Equations (A.29), (A.31), (A.33), (A.34), (A.37) and (A.38) exactly match the recursions in equations (A.2) to (A.5). Therefore, Theorem A2 shows that the limits (3.6) hold in a sense that the sets $$\theta_u(t)$$ and $$\theta_v(t)$$ converge empirically with bounded moments of order $$p$$. We next show that the limits $$U(t)$$ and $$V(t)$$ on the right-hand side of (3.6) match the descriptions in (3.8) and (3.9). First, define $${\overline{\alpha}}_{u1}(t)$$ and $${\overline{\alpha}}_{v1}(t)$$ in (3.7) and $${\overline{\lambda}}_u(t)$$ and $${\overline{\lambda}}_v(t)$$ as in (4.4). Then, from (A.32), the expectations $${\overline{\nu}}_u(t)$$ and $${\overline{\nu}}_v(t)$$ in (A.8b) are given by $$\label{eq:nubarpf} {\overline{\nu}}_u(t) = ({\overline{\alpha}}_{v1}(t),{\overline{\lambda}}_u(t)), \quad {\overline{\nu}}_v(t) = ({\overline{\alpha}}_{u1}({t\! + \! 1}),{\overline{\lambda}}_v(t)).$$ (A.39) Using (A.6), (A.34a) and (A.39), we see that \begin{eqnarray} U({t\! + \! 1}) &=& H_u(t,B(t),U_0,{\overline{\nu}}_u(t)) \nonumber \\ &=& G_u\bigl(t, \beta{\overline{\alpha}}_{v1}(t)U_0 + \sqrt{\tau_w} B(t), {\overline{\lambda}}_u(t) \bigr), \label{eq:UtGB} \end{eqnarray} (A.40) where $$B(t) \sim {\mathcal N}(0,\tau^b(t))$$. Therefore, if we let $$Z_u(t) = \sqrt{\tau_w} B(t)$$, then $$Z_u(t)$$ is zero-mean Gaussian with variance ${\mathbb{E}}\left[Z_u^2(t)\right] = \tau_w \tau^b(t) \stackrel{(a)}{=} \beta \tau_w {\mathbb{E}}[V(t)^2] \stackrel{(b)}{=} \beta \tau_w {\overline{\alpha}}_{v0}(t),$ where ($$a$$) follows from (A.8a) and ($$b$$) follows from the definition of $${\overline{\alpha}}_{v0}(t)$$ in (3.7). Substituting $$Z_u(t) = \sqrt{\tau_w} B(t)$$ into (A.40) we obtain the model for $$U({t\! + \! 1})$$ in (3.8). Similarly, using (A.7) and (A.34b), we can obtain the model $$V({t\! + \! 1})$$ in (3.9). Thus, we have proved that the random variables $$(U_0,U(t))$$ and $$(V_0,V(t))$$ are described by (3.8) and (3.9), and this completes the proof. A.4 Proof of Theorem 5.1 The theorem is proven by simply evaluating the second-order statistics. We begin with $${\overline{\alpha}}_{u1}({t\! + \! 1})$$: \begin{eqnarray} {\overline{\alpha}}_{u1}({t\! + \! 1}) &\stackrel{(a)}{=}& {\mathbb{E}}[ U_0U({t\! + \! 1}) ] \stackrel{(b)}{=} {\overline{\lambda}}_u(t) {\mathbb{E}}[ U_0P(t) ] \nonumber \\ &\stackrel{(c)}{=}& {\overline{\lambda}}_u(t) {\mathbb{E}}\left[ U_0(\beta{\overline{\alpha}}_{v1}(t)U_0+Z_u(t)) \right] \stackrel{(d)}{=} {\overline{\lambda}}_u(t)\beta\tau_u{\overline{\alpha}}_{v1}(t), \label{eq:alphau1Lin} \end{eqnarray} (A.41) where ($$a$$) is the definition in (3.7), ($$b$$) follows from (3.8) and (5.1), ($$c$$) follows from (3.8) and ($$d$$) follows from the independence of $$Z_u(t)$$ and $$U_0$$ and the definition of $$\tau_u$$ in (4.8). Similarly, one can show that $$\label{eq:alphau0Lin} {\overline{\alpha}}_{u0}({t\! + \! 1}) = {\overline{\lambda}}_u(t)^2\left[ \beta\tau_u{\overline{\alpha}}^2_{v1}(t) + \beta\tau_w{\overline{\alpha}}_{v0}(t)\right]\!.$$ (A.42) Substituting (A.41) and (A.42) into (4.9), we obtain the asymptotic correlation coefficient \begin{align} \rho_u({t\! + \! 1}) &= \frac{{\overline{\lambda}}^2_u(t)\beta^2\tau_u^2{\overline{\alpha}}^2_{v1}(t)} {{\overline{\lambda}}^2_u(t)\left[ \beta\tau_u{\overline{\alpha}}^2_{v1}(t) + \beta\tau_w{\overline{\alpha}}_{v0}(t)\right]\tau_u} \nonumber \\ &= \frac{\beta\tau_u{\overline{\alpha}}_{v1}^2(t)} {\tau_u{\overline{\alpha}}^2_{v1}(t) + \tau_w{\overline{\alpha}}_{v0}(t)} = \frac{\beta\tau_u\tau_v\rho_{v}(t)} {\tau_u\tau_v\rho_{v}(t) + \tau_w}, \nonumber \end{align} where the last step follows from (4.10). This proves the first equation in (5.2). A similar set of calculations shows that \begin{eqnarray} {\overline{\alpha}}_{v1}({t\! + \! 1}) &=& {\overline{\lambda}}_v(t)\tau_v{\overline{\alpha}}_{u1}({t\! + \! 1}), \\ \end{eqnarray} (A.43a) \begin{eqnarray} {\overline{\alpha}}_{v0}({t\! + \! 1}) &=& {\overline{\lambda}}_v(t)^2\left[ \tau_v{\overline{\alpha}}^2_{u1}({t\! + \! 1}) + \tau_w{\overline{\alpha}}_{u0}(t)\right]\!. \end{eqnarray} (A.43b) Applying these equations into (4.9) and (4.10), we obtain the recursion (5.2). Hence, we have proved part (a) of the theorem. For part (b), we need the following simple lemma. Lemma A1 Suppose that $$H:[0,1] {\rightarrow} [0,1]$$ is continuous and monotonically increasing, and $$x(t)$$ is a sequence satisfying the recursive relation $$\label{eq:monoiter} x(t+1) = H(x(t)),$$ (A.44) for some initial condition $$x(0) \in [0,1]$$. Then, $$x(t)$$ monotonically either increases or decreases to some $$x^* = H(x^*)$$. Proof. This can be proved similarly to [45, Lemma 7]. □ To apply Lemma A1, observe that the recursions (5.2) show that $\rho_u({t\! + \! 1}) = \frac{\beta \tau_u\tau_v \rho_v(t)}{ \beta \tau_u\tau_v \rho_v(t) + \tau_w} = \frac{\beta \tau_u^2\tau_v^2 \rho_u(t)}{ \beta \tau_u^2\tau_v^2 \rho_u(t) + \tau_w(\tau_u\tau_v \rho_u(t) + \tau_w)}.$ So, if we define $$\label{eq:HdefLin} H(\rho_u) := \frac{\beta \tau_u^2\tau_v^2 \rho_u}{ (\beta \tau_u^2\tau_v^2 + \tau_w\tau_u\tau_v)\rho_u + \tau_w^2},$$ (A.45) then it follows from (5.2) that $$\rho_u({t\! + \! 1}) = H(\rho_u(t))$$ for all $$t$$. By taking the derivative, it can be checked that $$H(\rho_u)$$ is monotonically increasing. It follows from Lemma A1 that $$\rho_u(t) {\rightarrow} \rho_u^*$$ for some fixed point $$\rho_u^* = H(\rho_u^*)$$ with $$\rho_u^* \in [0,1]$$. Now, there are only two fixed point solutions to $$\rho_u^* = H(\rho_u^*)$$: $$\rho_u^* = 0$$ and $$\label{eq:corruLinLimitPos} \rho_u^* = \frac{\beta\tau_u^2\tau_v^2 - \tau_w^2}{ \tau_u\tau_v(\beta\tau_u\tau_v + \tau_w)}.$$ (A.46) When $$\label{eq:snrMinPf} \beta\tau_u^2\tau_v^2 \leq \tau_w^2,$$ (A.47) then $$\rho_u^*$$ in (A.46) is not positive, so the only fixed solution in $$[0,1]$$ is $$\rho_u^* = 0$$. Therefore, when (A.47) is not satisfied, $$\rho_u(t)$$ must converge to the zero fixed point: $$\rho_u(t) {\rightarrow} 0$$. Now, suppose that (A.47) is satisfied. In this case, we claim that $$\rho_u(t) {\rightarrow} \rho_u^*$$, where $$\rho_u^*$$ is in (A.46). We prove this claim by contradiction and suppose, instead, that $$\rho_u(t)$$ converges to the other fixed point: $$\rho_u(t) {\rightarrow} 0$$. Since Lemma A1 shows that $$\rho_u(t)$$ must be either monotonically increasing or decreasing, the only way $$\rho_u(t) {\rightarrow} 0$$ is that $$\rho_u(t)$$ monotonically decreases to zero. But, when (A.47) is satisfied, it can be checked that for $$\rho_u(t)$$ sufficiently small and positive, $$\rho_u({t\! + \! 1}) > H(\rho_u(t))$$. This contradicts the fact that $$\rho_u(t)$$ is monotonically decreasing and, therefore, $$\rho_u(t)$$ must converge to the other fixed point $$\rho_u^*$$ in (A.46). Hence, we have shown that when (A.47) is not satisfied, $$\rho_u(t) {\rightarrow} 0$$, and when (A.47) is satisfied $$\rho_u(t) {\rightarrow} \rho_u^*$$ in (A.46). This is equivalent to the first limit in (5.3). The second limit in (5.3) is proved similarly. A.5 Proof of Theorem 5.2 Similarly to the proof of Theorem 5.1, we begin by computing the second-order statistics of $$(U_0,U(t))$$. Since $$U(t) = {\mathbb{E}}[U_0 {\,|\,} P({t\! - \! 1})]$$, $$U(t)$$ must be uncorrelated with the error: $$U(t)(U_0-U(t))=0$$. Hence, $${\overline{\alpha}}_{u0}(t) - {\overline{\alpha}}_{u1}(t) = {\mathbb{E}}[U(t)U_0 - U(t)U(t)] = 0, \label{eq:aueqmmse}$$ (A.48) and therefore $${\overline{\alpha}}_{u0}(t) = {\overline{\alpha}}_{u1}(t)$$. Now, consider the measurement $$P(t)$$ in (3.8). The SNR in this channel is $$\label{eq:snru} \eta_u(t) = \frac{\beta^2{\overline{\alpha}}^2_{v1}(t)}{\beta\tau_w{\overline{\alpha}}_{v0}(t)} = \frac{\beta \tau_v\rho_v(t)}{\tau_w}.$$ (A.49) Since $$U({t\! + \! 1})$$ is the conditional expectation of $$U_0$$ given $$P(t)$$, the mean-squared error is given by $${\mathcal E}_u(\eta_u(t))$$ defined in (5.6). Therefore, \begin{eqnarray} {\mathcal E}_u(\eta_u(t)) = {\mathbb{E}}[U({t\! + \! 1})-U_0]^2 \stackrel{(a)}{=} {\overline{\alpha}}_{u0}({t\! + \! 1})-2{\overline{\alpha}}_{u1}({t\! + \! 1}) + \tau_u \stackrel{(b)}{=} \tau_u - {\overline{\alpha}}_{u0}({t\! + \! 1}), \label{eq:mmseau0} \end{eqnarray} (A.50) where ($$a$$) follows from expanding the square and substituting in the definitions in (3.7) and (4.8) and ($$b$$) follows from the fact that $${\overline{\alpha}}_{u0}({t\! + \! 1}) = {\overline{\alpha}}_{u1}({t\! + \! 1})$$ proven above. We have thus proved that $$\label{eq:alphaummse} {\overline{\alpha}}_{u0}({t\! + \! 1}) = {\overline{\alpha}}_{u1}({t\! + \! 1}) = \tau_u - {\mathcal E}_u(\eta_u(t)).$$ (A.51) Therefore, the asymptotic correlation coefficient is given by \begin{eqnarray} \rho_u({t\! + \! 1}) \stackrel{(a)}{=} \frac{{\overline{\alpha}}_{u1}^2({t\! + \! 1})}{ \tau_u{\overline{\alpha}}_{u0}({t\! + \! 1})} \stackrel{(b)}{=} 1 - \tau_u^{-1}{\mathcal E}_u(\eta_u(t)), \label{eq:corruMmsePf} \end{eqnarray} (A.52) where ($$a$$) follows from (4.9) and ($$b$$) follows from (A.51). Substituting (A.49) into (A.52) proves the first equation in (5.7). The second recursion in (5.7) can be proved similarly. For the initial condition in the recursion, observe that with $$V(0) = {\mathbb{E}}[V_0]$$, the second-order statistics are given by \begin{eqnarray*} {\overline{\alpha}}_{v0}(0) = {\mathbb{E}}[ V(0)^2] = ({\mathbb{E}}[V_0])^2, \quad {\overline{\alpha}}_{v1}(0) = {\mathbb{E}}[ V_0V(0)] = ({\mathbb{E}}[V_0])^2. \end{eqnarray*} Hence, from (4.10), the initial correlation coefficient is $\rho_v(0) = \frac{{\overline{\alpha}}^2_{v1}(0)}{\tau_v{\overline{\alpha}}_{v0}(0)} = \frac{({\mathbb{E}}[V_0])^2}{\tau_v},$ which agrees with the statement in the theorem. This proves part (a). To prove part (b), we again use Lemma A1. Define the functions $$\label{eq:Huvmmse} H_u(\rho_v) := 1 - \tau_u^{-1}{\mathcal E}_u(\beta \tau_v\rho_v/\tau_w), \quad H_v(\rho_u) := 1 - \tau_v^{-1}{\mathcal E}_v(\tau_u\rho_u/\tau_w),$$ (A.53) and their concatenation $$\label{eq:Hmmse} H(\rho_v) = H_v(H_u(\rho_v)).$$ (A.54) From (5.7), it follows that $$\rho_v({t\! + \! 1}) = H(\rho_v(t))$$. Now, $${\mathcal E}_u(\eta_u)$$ and $${\mathcal E}_v(\eta_v)$$ defined in (5.6) are the mean-squared errors of $$U_0$$ and $$V_0$$ under AWGN estimation measurements with SNRs $$\eta_u$$ and $$\eta_v$$. Therefore, $${\mathcal E}_u(\eta_u)$$ and $${\mathcal E}_v(\eta_v)$$ must be monotonically decreasing in $$\eta_u$$ and $$\eta_v$$. Therefore, $$H_u(\rho_v)$$ and $$H_v(\rho_u)$$ in (A.53) are monotonically increasing functions and thus so is the concatenated function $$H(\rho_v)$$ in (A.54). Also, since the assumption of part (b) is that $${\mathcal E}_u(\eta_u)$$ and $${\mathcal E}_v(\eta_v)$$ are continuous, $$H(\rho_v)$$ is also continuous. It follows from Lemma A1 that $$\rho_v(t) {\rightarrow} \rho_v^*$$ where $$\rho_v^*$$ is a fixed point of (5.7). It remains to show $$\rho_v^* > 0$$. Observe that \begin{eqnarray} {\mathcal E}_v(\eta_v) \stackrel{(a)}{=} {\mathbb{E}}[V_0 {\,|\,} Y=\sqrt{\eta_n}V_0+D] \leq {\bf var}(V_0) \stackrel{(b)}{=} {\mathbb{E}}[V_0^2] - ({\mathbb{E}}[V_0])^2 = \tau_v(1-\rho_v(0)), \nonumber \end{eqnarray} where ($$a$$) follows from the definition of $${\mathcal E}_v(\eta_v)$$ in (5.6) and ($$b$$) follows from the definition of $$\tau_v$$ in (4.8) and the initial condition $$\rho_v(0) = ({\mathbb{E}}[V_0])^2/\tau_v$$. It follows from (5.7) that $\rho_v({t\! + \! 1}) = 1-\frac{1}{\tau_v}{\mathcal E}_v(\eta_v(t)) \geq \rho_v(0).$ Therefore, the limit point $$\rho_v^*$$ of $$\rho_v(t)$$ must satisfy $$\rho_v^* \geq \rho_v(0) > 0$$. A.6 Proof of Lemma 5.1 Define the functions $$H_u$$, $$H_v$$ and $$H$$ as in (A.53) and (A.54) from the previous proof. We know that $$\rho_v({t\! + \! 1}) = H(\rho_v(t))$$. When $${\mathbb{E}}[U_0] = {\mathbb{E}}[V_0] = 0$$, then $$\rho_v = 0$$ is a fixed point of the update. We can determine the stability of this fixed point by computing the derivative of $$H(\rho_v)$$ at $$\rho_v=0$$. Towards this end, we first use a standard result (see, e.g., [21]) that, for any prior on $$U_0$$ with bounded second moments, the mean-squared error in (5.6) satisfies $$\label{eq:mmseuLow} {\mathcal E}_u(\eta_u) = \frac{\tau_u}{1 + \eta_u\tau_u} + \mathcal {\rm O}(\eta_u^2).$$ (A.55) The term $$\tau_u/(1+\eta_u\tau_u)$$ is the minimum mean-squared error with linear estimation of $$U_0$$ from an AWGN noise-corrupted measurement $$Y = \sqrt{\eta_u}U_0 + D$$, $$D \sim {\mathcal N}(0,1)$$. Equation (A.55) arises from the fact that linear estimation is optimal in low SNRs—see [21] for details. Using (A.55), we can compute the derivative of $$H_u$$, \begin{eqnarray} H_u'(0) = -\frac{1}{\tau_u}\left. \frac{\partial}{\partial \rho_v} {\mathcal E}_u(\beta \tau_v\rho_v/\tau_w) \right|_{\rho_v = 0} = -\frac{\beta \tau_v}{\tau_u\tau_w}{\mathcal E}_u'(0) = \frac{\beta \tau_u \tau_v}{\tau_w}. \label{eq:mmseuDeriv} \end{eqnarray} (A.56) Similarly, one can show that $$\label{eq:mmsevDeriv} H_v'(0) = \frac{\tau_u \tau_v}{\tau_w},$$ (A.57) and hence $$\label{eq:HzeroDeriv} H'(0) = H_v'(0)H_u'(0) = \frac{\beta \tau_u^2 \tau_v^2}{\tau_w^2}.$$ (A.58) We now apply a standard linearization analysis of the nonlinear system $$\rho_v({t\! + \! 1}) = H(\rho_v(t))$$ around the fixed point $$\rho_v=0$$. See, for example, [56]. If $$\sqrt{\beta}\tau_u\tau_v < \tau_w$$, then $$H'(0) < 1$$ and the fixed point is stable. Thus, for any $$\rho_v(0)$$ sufficiently small, $$\rho_v(t) {\rightarrow} 0$$. This proves part (b) of the lemma. On the other hand, if $$\sqrt{\beta}\tau_u\tau_v > \tau_w$$ then $$H'(0) > 1$$ and the fixed point is unstable. This will imply that for any $$\rho_v(0) > 0$$, $$\rho_v(t)$$ will diverge from zero. But, we know from Theorem 5.2 that $$\rho_v(t)$$ must converge to some fixed point of $$\rho_v = H(\rho_v)$$. So, the limit point must be positive. This proves part (a) of the lemma. © The authors 2018. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Information and Inference: A Journal of the IMA Oxford University Press

# Iterative reconstruction of rank-one matrices in noise

, Volume 7 (3) – Sep 19, 2018
32 pages

/lp/ou_press/iterative-reconstruction-of-rank-one-matrices-in-noise-MjMpOjspBi
Publisher
Oxford University Press
ISSN
2049-8764
eISSN
2049-8772
DOI
10.1093/imaiai/iax014
Publisher site
See Article on Publisher Site

### Abstract

Abstract We consider the problem of estimating a rank-one matrix in Gaussian noise under a probabilistic model for the left and right factors of the matrix. The probabilistic model can impose constraints on the factors including sparsity and positivity that arise commonly in learning problems. We propose a family of algorithms that reduce the problem to a sequence of scalar estimation computations. These algorithms are similar to approximate message-passing techniques based on Gaussian approximations of loopy belief propagation that have been used recently in compressed sensing. Leveraging analysis methods by Bayati and Montanari, we show that the asymptotic behavior of the algorithm is described by a simple scalar equivalent model, where the distribution of the estimates at each iteration is identical to certain scalar estimates of the variables in Gaussian noise. Moreover, the effective Gaussian noise level is described by a set of state evolution equations. The proposed approach to deriving algorithms thus provides a computationally simple and general method for rank-one estimation problems with a precise analysis in certain high-dimensional settings. 1. Introduction We consider the problem of estimating vectors $${\mathbf{u}}_0 \in {\mathbb{R}}^m$$ and $${\mathbf{v}}_0 \in{\mathbb{R}}^n$$ from a matrix $${\mathbf{A}}\in {\mathbb{R}}^{m {\times} n}$$ of the form $$\label{eq:ArankOne} {\mathbf{A}} = {\mathbf{u}}_0{\mathbf{v}}^\top_0 + \sqrt{m}{\mathbf{W}},$$ (1.1) where $${\mathbf{W}}$$ represents some unknown noise and $$\sqrt{m}$$ is a normalization factor. The problem can be considered a rank-one special case of finding a low-rank matrix in the presence of noise. Such low-rank estimation problems arise in a range of applications including blind channel estimation [13], antenna array processing [26], subspace system identification [32] and principal component or factor analysis [29]. When the noise term $${\mathbf{W}}$$ is zero, the vector pair $$({\mathbf{u}}_0,{\mathbf{v}}_0)$$ can be recovered exactly, up to a scaling, from the maximal left and right singular vectors of $${\mathbf{A}}$$ [25]. However, in the presence of noise, the rank-one matrix can in general only be estimated approximately. In this case, a priori information or constraints on $$({\mathbf{u}}_0,{\mathbf{v}}_0)$$ may improve the estimation. Such constraints arise, for example, in factor analysis in statistics, where one of the factors is often constrained to be either positive or sparse [5]. Similar sparsity constraints occur in the problem of dictionary learning [41]. Also, in digital communications, one of the factors could come from a discrete quadrature amplitude modulation constellation. In this article, we impose the constraints on $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$ in a Bayesian setting, where $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$ are assumed to be independent from one another with independent and identically distributed (i.i.d.) densities of the form $$\label{eq:puvDist} p_{{\mathbf{u}}_0}({\mathbf{u}}_0) = \prod_{i=1}^m p_{U_0}(u_{0i}), \quad p_{{\mathbf{v}}_0}({\mathbf{v}}_0) = \prod_{j=1}^n p_{V_0}(v_{0j}),$$ (1.2) for some scalar random variables $$U_0$$ and $$V_0$$. The noise $${\mathbf{W}}$$ in (1.1) is assumed to have i.i.d. Gaussian components, where $$W_{ij} \sim {\mathcal N}(0,\tau_w)$$ for some variance $$\tau_w > 0$$. Under this model, the posterior density of $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$ is given by $$\label{eq:puvA} p_{{\mathbf{u}}_0,{\mathbf{v}}_0|{\mathbf{A}}}({\mathbf{u}}_0,{\mathbf{v}}_0) \propto p_{{\mathbf{u}}_0}({\mathbf{u}}_0)p_{{\mathbf{v}}_0}({\mathbf{v}}_0) \exp\left[ -\frac{1}{2m\tau_w}\|{\mathbf{A}}-{\mathbf{u}}_0{\mathbf{v}}^\top\|^2_F \right]\!,$$ (1.3) where $$\|\cdot\|_F$$ is the Frobenius norm. Given this posterior density, we consider two estimation problems: MAP estimation. In this problem, we wish to find the maximum a posteriori estimates $$\label{eq:uvMAP} ({\widehat{\mathbf{u}}},{\widehat{\mathbf{v}}}) = \mathop{\mathrm{arg\,max}}_{{\mathbf{u}},{\mathbf{v}}} p_{{\mathbf{u}}_0,{\mathbf{v}}_0|{\mathbf{A}}}({\mathbf{u}},{\mathbf{v}}).$$ (1.4) MMSE estimation. In this problem, we wish to find the posterior mean or, equivalently, the minimum mean-squared error estimate, $$\label{eq:uvMMSE} {\widehat{\mathbf{u}}} = {\mathbb{E}}({\mathbf{u}}_0|{\mathbf{A}}), \quad {\widehat{\mathbf{v}}} = {\mathbb{E}}({\mathbf{v}}_0|{\mathbf{A}}).$$ (1.5) We may also be interested in the posterior variances and posterior marginals. Exact computation of either of these estimates is generally intractable. Even though the components $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$ are assumed to have independent components, the posterior density (1.3) is not, in general, separable due to the term $${\mathbf{u}}_0{\mathbf{v}}_0^\top$$. Thus, the MAP estimate requires a search over $$m$$- and $$n$$-dimensional vectors $$({\mathbf{u}}_0,{\mathbf{v}}_0)$$ and the MMSE estimate requires integration over this $$m+n$$-dimensional space. However, due to the separability assumption on the priors (1.2), it is computationally simpler to alternately estimate the factors $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$. This form of alternating estimation is used, for example, in the classic alternating power method for finding maximal singular values [25] and some alternating methods in sparse or non-negative dictionary learning [1,35,37,41,42]. The approach in this work also uses a form of alternating estimation, but based on a recently developed powerful class of algorithms known as approximate message passing (AMP). AMP methods are derived from Gaussian and quadratic approximations of loopy belief propagation in graphical models and were originally used for problems in compressed sensing [3,15–17,45,46]. The AMP methodology has been successfully applied in a range of applications [8,18,19,58] and has seen a number of improvements for stability and convergence [6,38,49–51,59]. In recent years, there has been growing interest in AMP for matrix factorization problems as well [34,36,43,44,47]. The problem considered in this article can be considered as a simple rank-one case of these problems. Our main contribution in this work is to show that, for the rank-one MMSE and MAP estimation problems, the proposed IterFac algorithm admits an asymptotically exact characterization in the limit when $$m,n {\rightarrow} \infty$$ with $$m/n$$ constant and the components of the true vectors $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$ have limiting empirical distributions. In this scenario, we show that the empirical joint distribution of the components of $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$ and the corresponding estimates from the IterFac algorithm are described by a simple scalar equivalent model, where the IterFac component estimates are identically distributed to scalar estimates of the variables corrupted by Gaussian noise. Moreover, the effective Gaussian noise level in this model is described by a simple set of scalar state evolution (SE) equations. This scalar equivalent model is common in analyses of AMP methods [3,15–17,45,46] as well as replica analyses of related estimation problems [21,48,55]. From the scalar equivalent model, one can compute the asymptotic value of almost any component-separable metric including mean-squared error or correlation. Thus, in addition to being computationally simple and general, the IterFac algorithm admits a precise analysis in the case of Gaussian noise. Moreover, since fixed points of the IterFac algorithm correspond, under suitable circumstances, to local maxima of objectives such as (1.4), the analysis can be used to characterize the behavior of such minima—even if alternate algorithms to IterFac are used. The main analytical tool is a recently developed technique by Bayati and Montanari [3] used in the analysis of AMP algorithms. This work proved that, in the limit for large Gaussian mixing matrices, the behavior of AMP estimates can be described by a scalar equivalent model with effective noise levels defined by certain scalar SE equations. Similar SE analyses have appeared in related contexts [4,22,23,40,45,46]. To prove the SE equations for the IterFac algorithm, we apply a key theorem from [3] with a simple change of variables and a slight modification to account for parameter adaptation. A conference version of this article appeared in [47]. This article provides all the proofs along with more detailed discussions and simulations. Since the original publication of the conference version in [47], several other research groups have extended and built on the work. Importantly, [11] has shown that in the case of certain discrete priors, the IterFac algorithm is provably Bayes optimal in a large system limit. It was shown in [12] that an algorithm closely related to IterFac could provide the best known scaling laws for the hidden clique problem. More recently, [43] and [31] have used related AMP-type methods for matrix factorization problems with rank greater than one. 2. Iterative rank-one factorization The proposed IterFac algorithm is shown in Algorithm 1. Given a matrix $${\mathbf{A}} \in {\mathbb{R}}^{m{\times} n}$$, the algorithm outputs a sequence of estimates $$({\mathbf{u}}(t),{\mathbf{v}}(t))$$, $$t=0,1,\,{\ldots}\,$$, for $$({\mathbf{u}}_0,{\mathbf{v}}_0)$$. The algorithm has several parameters including the initial conditions, the parameters in lines 5 and 10, the termination condition and, most importantly, the functions $$G_u(\cdot)$$ and $$G_v(\cdot)$$. In each iteration, the functions $$G_u(\cdot)$$ and $$G_v(\cdot)$$ are used to generate the estimates of the factors $${\mathbf{u}}(t)$$ and $${\mathbf{v}}(t)$$ and will be called the factor selection functions. Throughout this work, we assume that the factor selection functions are separable, meaning that they act componentwise on the vectors $${\mathbf{p}}(t)$$ and $${\mathbf{q}}(t)$$: $$\label{eq:Guvsep} u_i(t) = G_u(p_i(t),\lambda_u(t)), \quad v_j({t\! + \! 1}) = G_v(q_j(t),\lambda_v(t)),$$ (2.1) for some scalar functions $$G_u(\cdot)$$ and $$G_v(\cdot)$$. The choice of the factor selection function will depend on whether IterFac is used for MAP or MMSE estimation. Algorithm 1 View largeDownload slide Iterative Factorization (IterFac) Algorithm 1 View largeDownload slide Iterative Factorization (IterFac) MAP estimation To describe the factor selection functions for the MAP estimation problem (1.4), let $$\lambda_u(t)$$ and $$\lambda_v(t)$$ be sequences of pairs of parameters $$\label{eq:lamuvDef} \lambda_u(t) := (\gamma_u(t),\nu_u(t)), \quad \lambda_v(t) := (\gamma_v(t),\nu_v(t)).$$ (2.2) Then, for each $$t$$, consider random vectors $${\mathbf{p}}(t)$$ and $${\mathbf{q}}(t)$$ given by \begin{align} \label{eq:pqAwgn} \begin{split} &{\mathbf{p}}(t) = \gamma_u(t){\mathbf{u}}_0 + {\mathbf{z}}_u(t), \quad {\mathbf{u}}_0 \sim p_{{\mathbf{u}}_0}({\mathbf{u}}_0), \quad {\mathbf{z}}_u(t) \sim {\mathcal N}(0,\nu_u(t){\mathbf{I}}), \\ &{\mathbf{q}}(t) = \gamma_v(t){\mathbf{v}}_0 + {\mathbf{z}}_v(t), \quad {\mathbf{v}}_0 \sim p_{{\mathbf{v}}_0}({\mathbf{v}}_0), \quad {\mathbf{z}}_v(t) \sim {\mathcal N}(0,\nu_v(t){\mathbf{I}}). \end{split} \end{align} (2.3) The random variables $${\mathbf{p}}(t)$$ and $${\mathbf{q}}(t)$$ in (2.3) are simply scaled versions of the true vectors $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$ with additive white Gaussian noise (AWGN). Then, for the MAP problem, we take the factor selection functions to be the MAP estimates of $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$, given the observations $${\mathbf{p}}(t)$$ and $${\mathbf{q}}(t)$$: $$\label{eq:GuvMAP} G_u({\mathbf{p}}(t),\lambda_u(t)) := \mathop{\mathrm{arg\,max}}_{{\mathbf{u}}_0} p({\mathbf{u}}_0|{\mathbf{p}}(t),\gamma_u(t),\nu_u(t)), \quad G_v({\mathbf{q}}(t),\lambda_v(t)) = \mathop{\mathrm{arg\,max}}_{{\mathbf{v}}_0} p({\mathbf{v}}_0|{\mathbf{q}}(t),\gamma_v(t),\nu_v(t)).$$ (2.4) Importantly, due to the separability assumption on the priors (1.2), these MAP estimates can be computed componentwise: \begin{align} \label{eq:GuvMAPsep} \begin{split} & G_u(p_i,\lambda_u) = \mathop{\mathrm{arg\,min}}_{u_{0i}} \left[ -\log p_{U_0}(u_{0i}) + \frac{(\gamma_uu_{0i}-p_i)^2}{2\nu_u} \right]\!, \\ & G_v(q_j,\lambda_v) = \mathop{\mathrm{arg\,min}}_{v_{0i}} \left[ -\log p_{V_0}(v_{0i}) + \frac{(\gamma_vv_{0j}-q_j)^2}{2\nu_v} \right]\!. \end{split} \end{align} (2.5) Hence, the IterFac algorithm replaces the vector-valued MAP estimation of $${\mathbf{u}}_0,{\mathbf{v}}_0$$ from the joint density (1.3), with a sequence of scalar MAP estimation problems along with multiplications by $${\mathbf{A}}$$ and $${\mathbf{A}}^\top$$. MMSE Estimation For the MMSE estimation problem, we simply take the factor selection functions to be the MMSE estimates with respect to the random variables (2.3): $$\label{eq:GuvMMSE} G_u({\mathbf{p}}(t),\lambda_u(t)) = {\mathbb{E}}({\mathbf{u}}_0|{\mathbf{p}}(t),\gamma_u(t),\nu_u(t)), \quad G_v({\mathbf{q}}(t),\lambda_v(t)) = {\mathbb{E}}({\mathbf{v}}_0|{\mathbf{q}}(t),\gamma_v(t),\nu_v(t)).$$ (2.6) Again, due to the separability assumption (1.2), these MMSE estimation problems can be performed componentwise. Hence, MMSE IterFac reduces the vector MMSE problem to a sequence of scalar problems in Gaussian noise. 3. Intuitive algorithm derivation and analysis 3.1 Algorithm intuition Before we formally analyze the algorithm, it is instructive to understand the intuition behind the method. For both the MAP and MMSE versions of IterFac, first observe that $$\label{eq:pintuit} {\mathbf{p}}(t) \stackrel{(a)}{=} \frac{1}{m}{\mathbf{A}}{\mathbf{v}}(t)+\mu_u(t){\mathbf{u}}(t) \stackrel{(b)}{=} \frac{{\mathbf{v}}_0^\top{\mathbf{v}}(t)}{m} {\mathbf{u}}_0 + \frac{1}{\sqrt{m}}{\mathbf{W}}{\mathbf{v}}(t) + \mu_u(t){\mathbf{u}}(t) \stackrel{(c)}{=} \gamma_u(t){\mathbf{u}}_0 + {\mathbf{z}}_u(t),$$ (3.1) where (a) follows from line 6 in Algorithm 1, (b) follows from the assumptions on the measurements (1.1) and in (c), we have used the definitions $$\label{eq:gamzu} \gamma_u(t) = \frac{1}{m}{\mathbf{v}}_0^\top{\mathbf{v}}(t), \quad {\mathbf{z}}_u(t) = \frac{1}{\sqrt{m}}{\mathbf{W}}{\mathbf{v}}(t) + \mu_u(t){\mathbf{u}}(t).$$ (3.2) For the first iteration, $${\mathbf{W}}$$ is a large zero-mean matrix independent of the initial condition $${\mathbf{v}}(0)$$. Also, $$\mu_u(0)=0$$ and hence the components of $${\mathbf{z}}_u(0)$$ will be asymptotically Gaussian zero-mean variables due to the central limit theorem. Hence, the vector $${\mathbf{p}}(0)$$ will be distributed as the true vector $${\mathbf{u}}_0$$ with Gaussian noise as in the model (2.3). Therefore, we can take an initial MAP or MMSE estimate of $${\mathbf{u}}_0$$ from the scalar estimation functions in (2.4) or (2.6). Unfortunately, in subsequent iterations with $$t>0$$, $${\mathbf{v}}(t)$$ is no longer independent of $${\mathbf{W}}$$ and hence $${\mathbf{W}}{\mathbf{v}}(t)$$ will not, in general, be a Gaussian zero-mean random vector. However, remarkably, we will show that with the specific choice of $$\mu_u(t)$$ in Algorithm 1, the addition of the term $$\mu_u(t){\mathbf{u}}(t)$$ in (3.2) ‘debiases’ the noise term, so that $${\mathbf{z}}_u(t)$$ is asymptotically zero-mean Gaussian. Thus, $${\mathbf{p}}(t)$$ continues to be distributed as in (2.3), and we can construct MAP or MMSE estimates of $${\mathbf{u}}_0$$ from the vector $${\mathbf{p}}(t)$$. For example, using the MMSE estimation function (2.6) with appropriate selection of the parameters $$\lambda_u(t)$$, we obtain that the estimate for $${\mathbf{u}}({t\! + \! 1})$$ is given by the MMSE estimate ${\mathbf{u}}({t\! + \! 1}) = {\mathbb{E}}({\mathbf{u}}_0|{\mathbf{p}}(t)), \quad {\mathbf{p}}(t)=\gamma_u(t){\mathbf{u}}_0 + {\mathbf{z}}_u(t).$ For the MAP estimation problem, the MAP selection function (2.4) computes the MAP estimate for $${\mathbf{u}}_0$$. Similarly, for $${\mathbf{q}}(t)$$, we see that \begin{align} {\mathbf{q}}(t) &= \frac{1}{m}{\mathbf{A}}^\top{\mathbf{u}}({t\! + \! 1})+\mu_v(t){\mathbf{v}}(t) \nonumber \\ &= \frac{{\mathbf{u}}_0^\top{\mathbf{u}}({t\! + \! 1})}{m}{\mathbf{u}}_0 + \frac{1}{\sqrt{m}}{\mathbf{W}}^\top{\mathbf{u}}({t\! + \! 1}) + \mu_v(t){\mathbf{v}}(t) = \gamma_v(r){\mathbf{u}}_0 + {\mathbf{z}}_v(t), \label{eq:qintuit} \end{align} (3.3) for $$\gamma_v(t)={\mathbf{u}}_0^\top{\mathbf{u}}({t\! + \! 1})/m$$ and $${\mathbf{z}}_v(t)= (1/\sqrt{m}){\mathbf{W}}^\top{\mathbf{u}}({t\! + \! 1}) + \mu_v(t){\mathbf{v}}(t)$$. Then, $${\mathbf{v}}({t\! + \! 1})$$ is taken as the MMSE or MAP estimate of $${\mathbf{v}}_0$$ from the vector $${\mathbf{q}}(t)$$. Thus, we see that the algorithm attempts to produce a sequence of unbiased estimates $${\mathbf{p}}(t)$$ and $${\mathbf{q}}(t)$$ for scaled versions of $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$. Then, it constructs the MAP or MMSE estimates $${\mathbf{u}}(t)$$ and $${\mathbf{v}}({t\! + \! 1})$$ from these unbiased estimates. 3.2 SE analysis The above ‘derivation’ of the algorithm suggests a possible method to analyze the IterFac algorithm. Consider a sequence of problems indexed by $$n$$, and suppose that the dimension $$m=m(n)$$ grows linearly with $$n$$ in that $$\label{eq:betaDef} \lim_{n {\rightarrow} \infty} n/m(n) = \beta,$$ (3.4) for some $$\beta > 0$$. For each $$n$$ and iteration number $$t$$, define the sets $$\label{eq:thetauvi} \theta_{u}(t) = \bigl\{ (u_{0i},u_i(t)), i=1,\ldots,m \bigr\}, \quad \theta_{v}(t) = \bigl\{ (v_{0j},v_j(t)), j =1,\ldots,n \bigr\},$$ (3.5) which are the sets of the components of the true vectors $$u_{0i}$$ and $$v_{0j}$$ and their corresponding estimates $$u_i(t)$$ and $$v_j(t)$$. To characterize the quality of the estimates, we would like to describe the distributions of the pairs in $$\theta_u(t)$$ and $$\theta_v(t)$$. Our formal analysis below will provide an exact characterization of these distributions. Specifically, we will show that the sets have empirical limits of the form $$\label{eq:thetauvLim} \lim_{n {\rightarrow} \infty} \theta_{u}(t) \stackrel{d}{=} (U_0,U(t)), \quad \lim_{n {\rightarrow} \infty} \theta_{v}(t) \stackrel{d}{=} (V_0,V(t)),$$ (3.6) for some random variable pairs $$(U_0,U(t))$$ and $$(V_0,V(t))$$. The precise definition of empirical limits is given in Appendix A.1, but loosely, it means that the empirical distribution of the components in $$\theta_u(t)$$ and $$\theta_v(t)$$ converge to certain random variable pairs. In addition, we can inductively derive what the distributions are for the limits in (3.6). To make matters simple, suppose that the parameters $$\lambda_u(t)$$ and $$\lambda_v(t)$$ are not selected adaptively but instead given by fixed sequences $${\overline{\lambda}}_u(t)$$ and $${\overline{\lambda}}_v(t)$$. Also, given random variables in the limits of (3.6), define the deterministic constants $$\label{eq:alphaSE} {\overline{\alpha}}_{u0}(t) = {\mathbb{E}}\left[ U(t)^2 \right]\!, \quad {\overline{\alpha}}_{u1}(t) = {\mathbb{E}}\left[ U_0U(t) \right]\!, \quad {\overline{\alpha}}_{v0}(t) = {\mathbb{E}}\left[ V(t)^2 \right]\!, \quad {\overline{\alpha}}_{v1}(t) = {\mathbb{E}}\left[ V_0V(t) \right]\!.$$ (3.7) Now suppose that the second limit in (3.6) holds for some $$t$$. Then, $$\gamma_u(t)$$ in (3.2) would have the following limit: $\lim_{n {\rightarrow} \infty} \gamma_u(t) = \lim_{n{\rightarrow} \infty} \frac{n}{m} \frac{1}{n} \sum_{j=1}^n v_{0j}v_j(t) = \beta {\overline{\alpha}}_{v1}(t).$ Also, if we ignore the debiasing term with $$\mu_u(t)$$ and assume that $${\mathbf{W}}$$ is independent of $${\mathbf{v}}(t)$$, the variance of the components of $${\mathbf{z}}_u(t)$$ in (3.2) would be ${\bf var}(z_{ui}(t)) = \frac{1}{m}\sum_{j=1}^n {\bf var}(W_{ij}v_j(t)) = \beta \tau_w {\overline{\alpha}}_{v0}(t).$ Thus, we would expect a typical component of $$p_i(t)$$ in (3.1) to be distributed as $P(t) = \beta {\overline{\alpha}}_{v1}(t)U_0+Z_u(t), \quad Z_u(t) \sim {\mathcal N}(0,\beta\tau_w{\overline{\alpha}}_{v0}(t)).$ Due to the separability assumption (2.1), each component is given by $$u_i({t\! + \! 1}) = G_u(p_i(t),{\overline{\lambda}}_u(t))$$. So, we would expect the components to follow the distribution $$\label{eq:USE} U({t\! + \! 1}) = G_u(P(t),{\overline{\lambda}}_u(t)), \quad P(t) = \beta {\overline{\alpha}}_{v1}(t)U_0+Z_u(t), \quad Z_u(t) \sim {\mathcal N}(0,\beta\tau_w{\overline{\alpha}}_{v0}(t)).$$ (3.8) This provides an exact description of the expected joint density of $$(U_0,U({t\! + \! 1}))$$. From this density, we can then compute $${\overline{\alpha}}_{u0}({t\! + \! 1})$$ and $${\overline{\alpha}}_{u1}({t\! + \! 1})$$ in (3.7). A similar calculation, again assuming the ‘debiasing’ and Gaussianity assumptions are valid, shows that the limiting empirical distribution of $$\theta_v({t\! + \! 1})$$ in (3.6) should follow $$\label{eq:VSE} V({t\! + \! 1}) = G_v( Q(t),{\overline{\lambda}}_v(t)), \quad Q(t) = {\overline{\alpha}}_{u1}({t\! + \! 1})V_0+Z_v(t), \quad Z_v(t) \sim {\mathcal N}(0,\tau_w{\overline{\alpha}}_{u0}({t\! + \! 1}) ).$$ (3.9) This provides the joint density $$(V_0,V({t\! + \! 1}))$$ from which we can compute $${\overline{\alpha}}_{v0}({t\! + \! 1})$$ and $${\overline{\alpha}}_{v1}({t\! + \! 1})$$ in (3.7). Thus, we have provided a simple method to recursively compute the joint densities of the limits in (3.6) and their second-order statistics (3.7). 4. Asymptotic analysis under Gaussian noise 4.1 Main results In the above intuitive analysis, we did not formally describe the sense of convergence nor offer any formal justification for the Gaussianity assumptions. In addition, we assumed that the parameters $$\lambda_u(t)$$ and $$\lambda_v(t)$$ were fixed sequences. In reality, we will need them to be data adaptive. We will make the above arguments rigorous under the following assumptions. Note that although we will be interested in MAP or MMSE estimation functions, our analysis will apply to arbitrary factor selection functions $$G_u(\cdot)$$ and $$G_v(\cdot)$$. Assumption 4.1 Consider a sequence of random realizations of the estimation problem in Section 1 indexed by the dimension $$n$$. The matrix $${\mathbf{A}}$$ and the parameters in Algorithm 1 satisfy the following: (a) For each $$n$$, the output dimension $$m=m(n)$$ is deterministic and scales linearly with $$n$$ as (3.4). (b) The matrix $${\mathbf{A}}$$ has the form (1.1) where $${\mathbf{u}}_0 \in {\mathbb{R}}^m$$ and $${\mathbf{v}}_0 \in {\mathbb{R}}^n$$ represent ‘true’ left and right factors of a rank-one term, and $${\mathbf{W}} \in {\mathbb{R}}^{m {\times} n}$$ is an i.i.d. Gaussian matrix with components $$W_{ij} \sim {\mathcal N}(0,\tau_w)$$ for some $$\tau_w > 0$$. (c) The factor selection functions $$G_u({\mathbf{p}},\lambda_u)$$ and $$G_v({\mathbf{q}},\lambda_v)$$ in lines 7 and 12 are componentwise separable in that for all component indices $$i$$ and $$j$$, $$\label{eq:GuvSep} G_u({\mathbf{p}},\lambda_u)_i = G_u(p_i,\lambda_u), \quad G_v({\mathbf{q}},\lambda_v)_j = G_v(q_j,\lambda_v),$$ (4.1) for some scalar functions $$G_u(p,\lambda_u)$$ and $$G_v(q,\lambda_v)$$. The scalar functions must be differentiable in $$p$$ and $$q$$. Moreover, for each $$t$$, the functions $$G_u(p,\lambda_u)$$ and $$\partial G_u(p,\lambda_u)/\partial p$$ must be Lipschitz continuous in $$p$$ with a Lipschitz constant that is continuous in $$\lambda_u$$ and continuous in $$\lambda_u$$ uniformly over $$p$$. Similarly, for each $$t$$, the functions $$G_v(q,\lambda_v)$$ and $$\partial G_v(q,\lambda_v)/\partial q$$ must be Lipschitz continuous in $$q$$ with a Lipschitz constant that is continuous in $$\lambda_v$$, and continuous in $$\lambda_v$$ uniformly over $$q$$. (d) The parameters $$\lambda_u(t)$$ and $$\lambda_v(t)$$ are computed via $$\label{eq:lamRankOne} \lambda_u(t) = \frac{1}{n} \sum_{j=1}^n \phi_{\lambda v}(t,v_{0j},v_j(t)), \quad \lambda_v(t) = \frac{1}{m} \sum_{i=1}^m \phi_{\lambda u}(t,u_{0j},u_j({t\! + \! 1})),$$ (4.2) for (possibly vector-valued) functions $$\phi_{\lambda u}(\cdot)$$ and $$\phi_{\lambda v}(\cdot)$$ that are pseudo-Lipschitz continuous of order $$p=2$$. (e) For $$t=0$$, the sets (3.6) empirically converge with bounded moments of order $$2$$ to the limits $$\label{eq:thetauvLimInit} \lim_{n {\rightarrow} \infty} \theta_u(0) \stackrel{d}{=} (U_0,U(0)), \quad \lim_{n {\rightarrow} \infty} \theta_v(0) \stackrel{d}{=} (V_0,V(0)),$$ (4.3) for some random variable pairs $$(U_0,U(0)$$ and $$(V_0,V(0))$$. See Appendix A.1 for a precise definition of the empirical convergence used here. The assumptions need some explanations. Assumptions 4.1(a) and (b) simply state that we are considering an asymptotic analysis for certain large matrices $${\mathbf{A}}$$ consisting of a random rank-one matrix plus Gaussian noise. The analysis of Algorithm 1 for higher ranks is still not known, but we provide some possible ideas later. Assumption 4.1(c) is a mild condition on the factor selection functions. In particular, the separability assumption holds for the MAP or MMSE functions (2.4) and (2.6) under separable priors. Assumption (d) allows for the parameters $$\lambda_u(t)$$ and $$\lambda_v(t)$$ in the factor selection functions to be data dependent, provided that they can each be determined via empirical averages of some function of the most recent data. Assumption (e) is the initial induction hypothesis. Under these assumptions, we can recursively define the sequences of random variables $$(U_0,U(t))$$ and $$(V_0,V(t))$$ as described above. For the parameters $${\overline{\lambda}}_u(t)$$ and $${\overline{\lambda}}_v(t)$$, define them from the expectations $$\label{eq:lambdaSE} {\overline{\lambda}}_{u}(t) = {\mathbb{E}}\left[ \phi_{\lambda u}(V_0,V(t)) \right]\!, \quad {\overline{\lambda}}_{v}(t) = {\mathbb{E}}\left[ \phi_{\lambda v}(U_0,U({t\! + \! 1})) \right]\!.$$ (4.4) These are the limiting values we would expect given the adaptation rules (4.2). Theorem 4.2 Under Assumption 4.1, the sets $$\theta_u(t)$$ and $$\theta_v(t)$$ in (3.5) converge empirically with bounded moments of order $$p=2$$ with the limits in (3.6). Proof. See Appendix A.3. □ 4.2 Scalar equivalent model The main contribution of Theorem 4.2 is that it provides a simple scalar equivalent model for the asymptotic behavior of the algorithm. The sets $$\theta_u(t) = \{(u_{0i},u_i(t))\}$$ and $$\theta_v(t) = \{ (v_{0j},v_j(t)) \}$$ in (3.5) are the components of true vectors $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$ and their estimates $${\mathbf{u}}(t)$$ and $${\mathbf{v}}(t)$$. The theorem shows that empirical distributions of these components are asymptotically equivalent to simple random variable pairs $$(U_0,U(t))$$ and $$(V_0,V(t))$$ given by (3.8) and (3.9). In this scalar system, the variable $$U({t\! + \! 1})$$ is the output of the factor selection function $$G_u(\cdot)$$ applied to a scaled and Gaussian-noise-corrupted version of the true variable $$U_0$$. Similarly, $$V({t\! + \! 1})$$ is the output of the factor selection function $$G_v(\cdot)$$ applied to a scaled and Gaussian-noise-corrupted version of the true variable $$V_0$$. Following [20], we can thus call the result a single-letter characterization of the algorithm. From this single-letter characterization, one can exactly compute a large class of performance metrics of the algorithm. Specifically, the empirical convergence of $$\theta_u(t)$$ shows that for any pseudo-Lipschitz function $$\phi(u_0,u)$$ of order $$p$$, the following limit exists almost surely: $$\label{eq:phiuLim} \lim_{n {\rightarrow} \infty} \frac{1}{m} \sum_{i=1}^m \phi(u_{0i},u_i(t)) = {\mathbb{E}}\left[ \phi(U_0,U(t)) \right]\!,$$ (4.5) where the expectation on the right-hand side is over the variables $$(U_0,U(t))$$ with $$U_0$$ identical to the variable in the limit in (4.3) and $$U(t)$$ given by (3.8). This expectation can thus be explicitly evaluated by a simple two-dimensional integral and consequently any component-separable performance metric based on a suitably continuous loss function $$\phi(u_0,u)$$ can be exactly computed. For example, if we take $$\phi(u_0,u) = (u-u_0)^2$$, we can compute the asymptotic mean-squared error of the estimate, \begin{eqnarray} \lim_{n {\rightarrow} \infty} \frac{1}{m} \|{\mathbf{u}}(t)-{\mathbf{u}}_0\|^2 = \lim_{n {\rightarrow} \infty} \frac{1}{m} \sum_{i=1}^m (u_i(t)-u_{0i})^2 \nonumber = {\mathbb{E}}\left[ (U_0-U(t))^2 \right]\!. \label{eq:mseuLim} \end{eqnarray} Also, for each $$t$$, define the empirical second-order statistics $$\label{eq:alphauv} \alpha_{u0}(t) = \frac{1}{m}\|{\mathbf{u}}(t)\|^2, ~ \alpha_{u1}(t) = \frac{1}{m}{\mathbf{u}}(t)^\top{\mathbf{u}}_0, ~ \alpha_{v0}(t) = \frac{1}{n}\|{\mathbf{v}}(t)\|^2, ~ \alpha_{v1}(t) = \frac{1}{n}{\mathbf{v}}(t)^\top{\mathbf{v}}_0.$$ (4.6) Since $$\|{\mathbf{u}}(t)\|^2 = \sum_i u_i(t)^2$$, it follows that $$\alpha_{u0}(t) {\rightarrow} {\mathbb{E}}(U(t)^2)$$ almost surely as $$n {\rightarrow} \infty$$. In this way, we obtain that the following limits hold almost surely: $$\label{eq:alphauvLim} \lim_{n {\rightarrow} \infty} \alpha_{u0}(t) = {\overline{\alpha}}_{u0}(t), ~ \lim_{n {\rightarrow} \infty} \alpha_{u1}(t) = {\overline{\alpha}}_{u1}(t), ~ \lim_{n {\rightarrow} \infty} \alpha_{v0}(t) = {\overline{\alpha}}_{v0}(t), ~ \lim_{n {\rightarrow} \infty} \alpha_{v1}(t) = {\overline{\alpha}}_{v1}(t).$$ (4.7) We will also use definitions $$\label{eq:tauuv} \tau_u := {\mathbb{E}}[U_0^2], \quad \tau_v := {\mathbb{E}}[V_0^2].$$ (4.8) From the second-order statistics, we can compute the asymptotic correlation coefficient between $${\mathbf{u}}_0$$ and its estimate $${\mathbf{u}}$$ given by \begin{eqnarray} \rho_u(t) &:=& \lim_{n {\rightarrow} \infty} \frac{|{\mathbf{u}}(t)^\top{\mathbf{u}}_0|^2} {\|{\mathbf{u}}(t)\|^2\|{\mathbf{u}}_0\|^2} = \lim_{n {\rightarrow} \infty} \frac{|({\mathbf{u}}(t)^\top{\mathbf{u}}_0)/m|^2} {(\|{\mathbf{u}}(t)\|^2/m)(\|{\mathbf{u}}_0\|^2/m)} \nonumber \\ &=& \frac{ \bigl[{\mathbb{E}}(U(t)U_0)\bigr]^2 }{ {\mathbb{E}} U(t)^2 {\mathbb{E}} U_0^2} = \frac{{\overline{\alpha}}_{u1}^2(t)}{{\overline{\alpha}}_{u0}(t)\tau_u}. \hspace{0.5in} \label{eq:corruLim} \end{eqnarray} (4.9) Similarly, the asymptotic correlation coefficient between $${\mathbf{v}}_0$$ and $${\mathbf{v}}$$ has a simple expression $$\label{eq:corrvLim} \rho_v(t) := \lim_{n {\rightarrow} \infty} \frac{|{\mathbf{v}}(t)^\top{\mathbf{v}}_0|^2} {\|{\mathbf{v}}(t)\|^2\|{\mathbf{v}}_0\|^2} = \frac{{\overline{\alpha}}_{v1}^2(t)}{{\overline{\alpha}}_{v0}(t)\tau_v}.$$ (4.10) The correlation coefficient is useful, since we know that, without additional constraints, the terms $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$ can be estimated only up to a scalar. The correlation coefficient is scale invariant. 5. Examples The SE analysis can be used to exactly predict the asymptotic behavior of the IterFac algorithm for any smooth scalar estimation functions, including the MAP or MMSE functions (2.4) and (2.6). There are, however, two cases, where the SE equations have particularly simple and interesting solutions: linear estimation functions and MMSE functions. 5.1 Linear selection functions Suppose we use linear selection functions of the form $$\label{eq:Guvlin} G_u({\mathbf{p}},\lambda_u) =\lambda_u{\mathbf{p}},\quad G_v({\mathbf{q}},\lambda_v)= \lambda_v{\mathbf{q}},$$ (5.1) where the parameters $$\lambda_u$$ and $$\lambda_v$$ allow for normalization or other scalings of the outputs. Linear selection functions of the form (5.1) arise when one selects $$G_u(\cdot)$$ and $$G_v(\cdot)$$ from the MAP or MMSE functions (2.4) or (2.6) with Gaussian priors. With Gaussian priors, the correct solution to the MAP estimate (1.4) is for $$({\widehat{\mathbf{u}}},{\widehat{\mathbf{v}}})$$ to be the (appropriately scaled) left and right maximal singular vectors of $${\mathbf{A}}$$. We will thus call the estimates $$({\mathbf{u}}(t),{\mathbf{v}}(t))$$ of Algorithm 1 and linear selection functions (5.1) the estimated maximal singular vectors. Theorem 5.1 Consider the state evolution equation (3.7), (3.8) and (3.9) with the linear selection functions (5.1). Then we have the following properties: (a) The asymptotic correlation coefficients (4.9) and (4.10) satisfy the following recursive rules: $$\label{eq:corrLinSE} \rho_u({t\! + \! 1}) = \frac{\beta \tau_u\tau_v \rho_v(t)}{ \beta \tau_u\tau_v \rho_v(t) + \tau_w}, \quad \rho_v(t) = \frac{\tau_u\tau_v \rho_u(t)}{\tau_u\tau_v \rho_u(t) + \tau_w}.$$ (5.2) (b) For any positive initial condition $$\rho_v(0) > 0$$, the asymptotic correlation coefficients converge to the limits $$\label{eq:corrLinLimit} \lim_{t {\rightarrow} \infty} \rho_u(t) = \rho_u^* := \frac{\left[ \beta\tau_u^2\tau_v^2 - \tau_w^2 \right]_+}{ \tau_u\tau_v(\beta\tau_u\tau_v + \tau_w)}, \quad \lim_{t {\rightarrow} \infty} \rho_v(t) = \rho_v^* := \frac{\left[\beta\tau_u^2\tau_v^2 - \tau_w^2\right]_+}{ \beta \tau_u\tau_v(\tau_u\tau_v + \tau_w)},$$ (5.3) where $$[x]_+= \max\{0,x\}$$. Proof. See Appendix A.4. □ The theorem provides a set of recursive equations for the asymptotic correlation coefficients $$\rho_u(t)$$ and $$\rho_v(t)$$ along with simple expressions for the limiting values as $$t {\rightarrow} \infty$$. We thus obtain exactly how correlated the estimated maximal singular vectors of a matrix $${\mathbf{A}}$$ of the form (1.1) are to the rank-one factors $$({\mathbf{u}}_0,{\mathbf{v}}_0)$$. The proof of the theorem also provides expressions for the second-order statistics in (3.7) to be used in the scalar equivalent model. The fixed point expressions (5.3) agree with the more general results in [33] that derive the correlations for ranks greater than one and low-rank recovery with missing entries. Similar results can also be found in [2,7,28]. An interesting consequence of the expressions in (5.3) is that unless $$\label{eq:snrMinLin} \sqrt{\beta}\tau_u\tau_v > \tau_w,$$ (5.4) the asymptotic correlation coefficients are exactly zero. The ratio $$\tau_u\tau_v/\tau_w$$ can be interpreted as a scaled signal-to-noise ratio (SNR). 5.2 MMSE estimation Next, suppose we use the MMSE selection functions (2.6). Using the scalar equivalent models (3.8) and (3.9), we take the scaling factor and noise parameters as \begin{align} \label{eq:mmseParam} \begin{split} &\gamma_u(t) = \beta {\overline{\alpha}}_{v1}(t), \quad \nu_u(t)=\beta \tau_w {\overline{\alpha}}_{v0}(t), \\ &\gamma_v(t) = {\overline{\alpha}}_{u1}({t\! + \! 1}), \quad \nu_v(t)= \tau_w {\overline{\alpha}}_{u0}({t\! + \! 1}). \end{split} \end{align} (5.5) Observe that these parameters can be computed from the SE equations and hence can be determined offline and are thus not data dependent. We can use the initial condition $$v_j(0) = {\mathbb{E}}[V_0]$$ for all $$j$$, so that the initial variable in (4.3) is $$V(0) = {\mathbb{E}}[V_0]$$. To analyze the algorithms, define $$\label{eq:mmseuv} {\mathcal E}_u(\eta_u) = {\bf var}(U_0 {\,|\,} Y=\sqrt{\eta_u}U_0 + D), \quad {\mathcal E}_v(\eta_v) = {\bf var}(V_0 {\,|\,} Y=\sqrt{\eta_v}V_0 + D),$$ (5.6) where $$D \sim \mathcal{N}(0,1)$$ is independent of $$U_0$$ and $$V_0$$. That is, $${\mathcal E}_u(\eta_u)$$ and $${\mathcal E}_v(\eta_v)$$ are the mean-squared errors of estimating $$U_0$$ and $$V_0$$ from observations $$Y$$ with SNRs of $$\eta_u$$ and $$\eta_v$$. The functions $${\mathcal E}_u(\cdot)$$ and $${\mathcal E}_v(\cdot)$$ arise in a range of estimation problems, and the analytic and functional properties of these functions can be found in [24,60]. Theorem 5.2 Consider the solutions to the SE equations (3.7), (3.8) and (3.9) under the MMSE selection functions (2.6) with parameters (5.5) and initial condition $$V(0) = {\mathbb{E}}[V_0]$$. Then we have the following properties: (a) For all $$t$$, the asymptotic correlation coefficients (4.9) and (4.10) satisfy the recursive relationships $$\label{eq:corrSEmmse} \rho_u({t\! + \! 1}) = 1 - \frac{1}{\tau_u}{\mathcal E}_u( \beta\tau_v\rho_v(t)/\tau_w), \quad \rho_v(t) = 1 - \frac{1}{\tau_v}{\mathcal E}_v( \tau_u\rho_u(t)/\tau_w),$$ (5.7) with the initial condition $$\rho_v(0)=({\mathbb{E}} V_0)^2/\tau_v$$. (b) If, in addition, $${\mathcal E}_u(\eta_u)$$ and $${\mathcal E}_v(\eta_v)$$ are continuous, then for any positive initial condition, $$\rho_v(0)>0$$, as $$t {\rightarrow} \infty$$, the asymptotic correlation coefficients $$(\rho_u(t),\rho_v(t))$$ increase monotonically to fixed points $$(\rho_u^*, \rho_u^*)$$ of (5.7) with $$\rho_v^* > 0$$. Proof. See Appendix A.5. □ Again, we see that we can obtain simple, explicit recursions for the asymptotic correlations. Moreover, the asymptotic correlations provably converge to fixed points of the SE equations. The proof of the theorem also provides expressions for the second-order statistics in (3.7) to be used in the scalar equivalent model. 5.3 Zero initial conditions The limiting condition in part (b) of Theorem 5.2 requires that $$\rho_v(0) > 0$$, which occurs when $${\mathbb{E}}[V_0] \neq 0$$. Suppose, on the other hand, that $${\mathbb{E}}[U_0] = {\mathbb{E}}[V_0] = 0$$. Then, the initial condition will be $$V(0)= {\mathbb{E}}[V_0] = 0$$. Under this initial condition, a simple set of calculations show that the SE equations (5.7) will generate a sequence with $$\rho_v(t) = \rho_u(t)=0$$ for all $$t$$. Thus, the IterFac algorithm will produce no useful estimates. Of course, with zero-mean random variables, a more sensible initial condition is to take $${\mathbf{v}}(0)$$ to be some non-zero random vector, as is commonly done in power algorithm recursions for computing maximal singular vectors. To understand the behavior of the algorithm under this random initial condition, let $$\label{eq:rhovn} \rho_v(t,n) := \frac{|{\mathbf{v}}(t)^\top{\mathbf{v}}_0|^2}{\|{\mathbf{v}}_0\|^2 \|{\mathbf{v}}(t)\|^2},$$ (5.8) where we have explicitly denoted the dependence on the problem dimension $$n$$. From (4.10), we have that $$\lim_{n {\rightarrow}\infty} \rho_v(t,n) = \rho_v(t)$$ for all $$t$$. Also, with a random initial condition $${\mathbf{v}}(0)$$ independent of $${\mathbf{v}}_0$$, it can be checked that $$\rho_v(0,n) = \mathcal {\rm O}(1/n)$$ so that $\rho_v(0) = \lim_{n {\rightarrow}\infty} \rho_v(0,n) = 0.$ Hence, from the SE equations (5.7), $$\rho_v(t) = \rho_u(t)=0$$ for all $$t$$. That is, $$\label{eq:corrLimZero} \lim_{t {\rightarrow} \infty} \lim_{n {\rightarrow}\infty} \rho(t,n) = 0.$$ (5.9) This limit suggests that, even with random initial conditions, the IterFac algorithm will not produce a useful estimate. However, it is still possible that the limit in the opposite order to (5.9) may be non-zero: $$\label{eq:rhontLim} \lim_{n {\rightarrow} \infty} \lim_{t {\rightarrow}\infty} \rho(t,n) > 0.$$ (5.10) That is, for each $$n$$, it may be possible to obtain a non-zero correlation, but the number of iterations for convergence increases with $$n$$ since the algorithm starts from a decreasingly small initial correlation. Unfortunately, our SE analysis cannot make predictions on limits in the order of (5.10). We can however analyze the following limit: Lemma 5.1 Consider the MMSE SE equations (5.7) with random variables $$U_0$$ and $$V_0$$ such that $${\mathbb{E}}[V_0] = {\mathbb{E}}[U_0] = 0$$. For each $$\epsilon > 0$$, let $$\rho_v^\epsilon(t)$$ be the solution to the SE equations with an initial condition $$\rho_v(0)= \epsilon$$. Then, we have the following results: (a) If $$\sqrt{\beta} \tau_u \tau_v > \tau_w$$, $$\label{eq:corrvLimEpsPos} \lim_{\epsilon {\rightarrow} 0} \lim_{t {\rightarrow} \infty} \rho^\epsilon_v(t) > 0.$$ (5.11) (b) Conversely, if $$\sqrt{\beta} \tau_u \tau_v < \tau_w$$, $$\label{eq:corrvLimEpsNeg} \lim_{\epsilon {\rightarrow} 0} \lim_{t {\rightarrow} \infty} \rho^\epsilon_v(t) = 0.$$ (5.12) Proof. See Appendix A.6. □ The result of the lemma is somewhat disappointing. The lemma shows that $$\sqrt{\beta} \tau_u \tau_v > \tau_w$$ is essentially necessary and sufficient for the IterFac algorithm with MMSE estimates to be able to overcome arbitrarily small initial conditions and obtain an estimate with a non-zero correlation to the true vector. Unfortunately, this is identical to condition (5.4) for the linear estimator to obtain a non-zero correlation. Thus, the IterFac algorithm with MMSE estimates performs no better than simple linear estimation in the initial iterations when the priors have zero means. Since linear estimation is equivalent to finding maximal singular vectors without any particular constraints, we could interpret Lemma 5.1 as saying that the IterFac algorithm under MMSE estimation cannot exploit structure in the components in the initial iterations. As a result, at low SNRs it may be necessary to use other algorithms to produce an initial condition for IterFac; such procedures require further study. 6. Numerical simulation 6.1 Bernoulli–exponential To validate the SE analysis, we first consider a simple case where the left factor $${\mathbf{u}}_0 \in {\mathbb{R}}^m$$ is i.i.d. Gaussian, zero-mean and $${\mathbf{v}}_0 \in {\mathbb{R}}^n$$ has Bernoulli–exponential components:1 $$\label{eq:vBernExp} v_{0j} \sim \left\{ \begin{array}{ll} 0 & \mbox{with prob } 1-\lambda, \\ \mbox{Exp}(1) & \mbox{with prob } \lambda, \end{array} \right.$$ (6.1) which provides a simple model for a sparse, positive vector. The parameter $$\lambda$$ is the fraction of non-zero components and is set in this simulation to $$\lambda=0.1$$. Note that these components have a non-zero mean so the difficulties of Section 5.3 are avoided. The dimensions are $$(m,n)=(1000,500)$$, and the noise level $$\tau_w$$ is set according to the scaled SNR defined as $$\label{eq:snrDef} {{\small \sf SNR}} = 10\log_{10}(\tau_u\tau_v/\tau_w).$$ (6.2) Estimating the vector $${\mathbf{v}}_0$$ in this setup is related to finding sparse principal vectors of the matrix $${\mathbf{A}}^\top{\mathbf{A}}$$ for which there are a large number of excellent methods including [5,9,30,53,61,62], to name a few. These algorithms include methods based on thresholding, $$\ell_1$$-regularization and semidefinite programming. A comparison of IterFac against these methods would be an interesting avenue of future research. Here, we simply wish to verify the SE predictions of the IterFac method. The results of the simulation are shown in Fig. 1, which shows the simulated and SE-predicted performance of the IterFac algorithm with both the linear and MMSE selection functions for the priors on $${\mathbf{u}}$$ and $${\mathbf{v}}$$. The algorithm is run for $$t=10$$ iterations, and the plot shows the median of the final correlation coefficient $$\rho_v(t)$$ over 50 Monte Carlo trials at each value of SNR. It can be seen that the performance of the IterFac algorithm for both the linear and MMSE estimates are in excellent agreement with the SE predictions. The correlation coefficient of the linear estimator also matches the correlation of the estimates produced from the maximal singular vectors of $${\mathbf{A}}$$. This is not surprising since, with linear selection functions, the IterFac algorithm is essentially an iterative method to find the maximal singular vectors. The figure also shows the benefit of exploiting the prior on $${\mathbf{v}}_0$$, which is evident from the superior performance of the MMSE estimate over the linear reconstruction. Fig. 1. View largeDownload slide Simulation of the IterFac algorithm for both the linear selection functions in Section 5.1 (labeled IterFac-lin) and MMSE selection functions in Section 5.2 (labeled IterFac-MMSE). Plotted are the correlation values after 10 iterations. The simulated values are compared against the SE predictions. Also shown is the simple estimate from the maximal singular vectors of $${\mathbf{A}}$$. Fig. 1. View largeDownload slide Simulation of the IterFac algorithm for both the linear selection functions in Section 5.1 (labeled IterFac-lin) and MMSE selection functions in Section 5.2 (labeled IterFac-MMSE). Plotted are the correlation values after 10 iterations. The simulated values are compared against the SE predictions. Also shown is the simple estimate from the maximal singular vectors of $${\mathbf{A}}$$. Figure 2 shows the correlation coefficient as a function of the iteration number for the MMSE and linear methods for two values of the SNR. Again, we see that the SE predictions are closely matched to the median simulated values. In addition, we see that we get good convergence within 4 to 8 iterations. Based on the SE predictions, this number will not scale with dimension and hence the simulation suggests that only a small number of iterations will be necessary even for very large problems. All code for the simulation can be found in the public GAMP sourceforge project [54]. Fig. 2. View largeDownload slide Per iteration performance of the IterFac algorithm for both the linear selection functions in Section 5.1 (IF-lin) and MMSE selection functions in Section 5.2 (IF-MMSE). The simulated values are compared with the SE predictions. Fig. 2. View largeDownload slide Per iteration performance of the IterFac algorithm for both the linear selection functions in Section 5.1 (IF-lin) and MMSE selection functions in Section 5.2 (IF-MMSE). The simulated values are compared with the SE predictions. 6.2 Bernoulli–Gaussian The above experiment considers a positive random vector $$V_0$$, so that the initial estimate $${\mathbb{E}}(V_0) > 0$$ and, hence, has a non-zero correlation with $$V_0$$. As discussed in Section 5.3, predicting the performance of IterFac with zero initial conditions, (i.e., $${\mathbb{E}}(V_0)={\mathbb{E}}(U_0)=0$$) and a small random initial condition, requires that we exchange the limits in iteration and problem size. To validate these predictions, we took a Bernoulli–Gaussian (BG) prior where the components of $${\mathbf{v}}_0$$ are distributed as $$\label{eq:vBernGauss} v_{0j} \sim \left\{ \begin{array}{ll} 0 & \mbox{with prob } 1-\lambda, \\ {\mathcal N}(0,1) & \mbox{with prob } \lambda, \end{array} \right.$$ (6.3) where, as before, $$\lambda$$ represents the sparsity rate. Importantly, $${\mathbb{E}}(V_0)=0$$. All other parameters are identical to the previous section. We used the MMSE factor selection functions (2.6), but with normalization and fixed parameters $$\lambda_v(t)=(\gamma_v(t),\nu_v(t))$$. The reason that it is necessary to fix the parameters is that, in the case of zero initial conditions, the estimator does not know the true scaling factors $$\gamma_v(t)$$ and noise level $$\nu_v(t)$$—they will depend on random initial conditions and cannot be predicted in the early iterations accurately. To circumvent the problem, we used a procedure where, in each iteration $$t$$, $${\mathbf{p}}(t)$$ was first normalized so that its average second moment matched $$\tau_v$$. Then, we ran the MMSE estimator with parameters $$\gamma_v(t)=1$$ and $$\nu_v = \alpha_v \tau_v$$ for a scaling factor $$\alpha_v=0.5$$. The idea is that, in every iteration, the MMSE estimator assumes that the SNR is some particular value as opposed to using the predicted value from SE. This is not optimal, but interestingly, the performance of this suboptimal estimator can still be predicted. Figure 3 compares the simulated performance of IterFac with the theoretical prediction from SE for a BG prior with $$\lambda = 0.1$$ and Gaussian prior with $$\lambda=1$$. Note that for $$\lambda=1$$, the MMSE factor selection function is linear. We see an excellent match between simulations and SE predictions. Also, we see that the two curves show the same SNR point, where one is able to obtain a non-zero correlation. This latter point confirms that the phase-transition point for linear estimation is identical to that of MMSE estimation. Hence, the particular distribution on $$V_0$$ does not matter—only the SNR. Fig. 3. View largeDownload slide Simulated performance of IterFac under Bernoulli–Gaussian (BG) and Gaussian priors. The performance is compared with the theoretical predicted performance from the limit of the SE. Fig. 3. View largeDownload slide Simulated performance of IterFac under Bernoulli–Gaussian (BG) and Gaussian priors. The performance is compared with the theoretical predicted performance from the limit of the SE. 7. Limitations and extensions There are several potential lines for future work, some of which have already been explored in other works made since the original publication of this method in [47]. 7.1 Extensions to higher rank The algorithm presented in this article considers only rank-one matrices. The works [31,34,43,52] have proposed AMP-type algorithms for more general classes of matrix factorization problems and provided analyses of these methods based on replica methods and other ideas from statistical physics. 7.2 Unknown priors The MMSE estimator in Section 5.2 requires exact knowledge of the priors on $$U_0$$ and $$V_0$$ as well as the Gaussian noise level $$\tau_w$$. In many problems in statistics, these are not known. There are two possible solutions that may be investigated in the future. One method is to parameterize the distributions of $$U_0$$ and $$V_0$$ and estimate these parameters in the MMSE selection functions (2.6)—potentially through an expectation maximization (EM)-type procedure as in [10]. This EM-type approach with hidden hyperparameters has been recently successfully used in a related approximate message passing method in [57]. The analysis of such parameter learning could possibly be accounted for through the adaptation parameters $$\lambda_u(t)$$ and $$\lambda_v(t)$$. A second approach is to assume that the distributions of $$U_0$$ and $$V_0$$ belong to a known family of distributions and then find a min-max solution. Such a min-max technique was proposed for AMP recovery of sparse vectors in [15]. See also [14]. 7.3 Optimality Although this article characterizes the performance of the IterFac algorithm, it remains open how far that performance is from optimal estimation such as the joint MMSE estimates of $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$. AMP methods have been shown to be provably optimal in a wide range of scenarios [3,4,22,23,27,45,46]. More recently, [11] has proved that for sparse binary priors, the IterFac algorithm was provably Bayes optimal for certain sparsity levels. A potential line of future work is to see whether these results can be extended to more general settings. 8. Conclusions We have presented a computationally efficient method for estimating rank-one matrices in noise. The estimation problem is reduced to a sequence of scalar AWGN estimation problems which can be performed easily for a large class of priors or regularization functions on the coefficients. In the case of Gaussian noise, the asymptotic performance of the algorithm is exactly characterized by a set of scalar state evolution equations that appear to match the performance at moderate dimensions well. Thus, the methodology is computationally simple, general and admits a precise analysis in certain asymptotic, random settings. Future work include extensions to higher-rank matrices and handling of the cases where the priors are not known. Acknowledgements The authors thank Vivek Goyal, Ulugbek Kamilov, Andrea Montanari and Phil Schniter for detailed comments on an earlier draft and the industrial affiliate members of NYU WIRELESS (http://wireless.engineering.nyu.edu). Funding National Science Foundation (in part) [1254204 and 1738286 to A.K.F.]; Office of Naval Research [N00014-15-1-2677 to A.K.F.]; National Science Foundation (in part) [1302336, 1564142 and 1547332 to S.R.]. Footnotes 1 All code for these experiments can be found in the gampmatlab repository https://sourceforge.net/projects/gampmatlab. References 1. Aharon M. , Elad M. & Bruckstein A. ( 2006 ) K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process. , 54 , 4311 – 4322 . Google Scholar Crossref Search ADS 2. Baik J. , Arous G. B. , & Péché S. ( 2005 ) Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Ann. Probab. , 33 , 1643 – 1697 . Google Scholar Crossref Search ADS 3. Bayati M. & Montanari A. ( 2011 ) The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Trans. Inform. Theory , 57 , 764 – 785 . Google Scholar Crossref Search ADS 4. Boutros J. & Caire G. ( 2002 ) Iterative multiuser joint decoding: unified framework and asymptotic analysis. IEEE Trans. Inform. Theory , 48 , 1772 – 1793 . Google Scholar Crossref Search ADS 5. Cadima J. & Jolliffe I. T. ( 1995 ) Loadings and correlations in the interpretation of principal components. J. Appl. Stat. , 22 , 208 – 214 . Google Scholar Crossref Search ADS 6. Caltagirone F. , Zdeborová L. & Krzakala F. ( 2014 ) On convergence of approximate message passing. IEEE International Symposium on Information Theory (ISIT), Honolulu, Hawaii, June 2014. IEEE , pp. 1812 – 1816 . 7. Capitaine M. , Donati-Martin C. & Féral D. ( 2009 ) The largest eigenvalues of finite rank deformation of large Wigner matrices: convergence and nonuniversality of the fluctuations. Ann. Probab. , 37 , 1 – 47 . Google Scholar Crossref Search ADS 8. Chen S. Y. , Tong H. , Wang Z. , Liu S. , Li M. & Zhang B. ( 2011 ) Improved generalized belief propagation for vision processing. Math. Probl. Eng. , 2011 , 416963 . 9. D’Aspremont A. , El Ghaoui L. , Jordan M. I. & Lanckriet G. R. G. ( 2007 ) A direct formulation for sparse PCA using semidefinite programming. SIAM Rev. , 49 , 434 – 448 . Google Scholar Crossref Search ADS 10. Dempster A. , Laird N. M. & Rubin D. B. ( 1977 ) Maximum-likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. , 39 , 1 – 17 . 11. Deshpande Y. & Montanari A. ( 2014 ) Information-theoretically optimal sparse PCA. IEEE International Symposium on Information Theory (ISIT), Honolulu, Hawaii, June 2014. IEEE , pp. 2197 – 2201 . 12. Deshpande Y. & Montanari A. ( 2015 ) Finding hidden cliques of size $$\sqrt\{N/e\}$$ in nearly linear time. Found. Comput. Math. , 15 , 1069 – 1128 . Google Scholar Crossref Search ADS 13. Ding Z. ( 1997 ) Matrix outer-product decomposition method for blind multiple channel identification. IEEE Trans. Signal Process. , 45 , 3053 – 3061 . Google Scholar Crossref Search ADS 14. Donoho D. L. & Johnstone I. M. ( 1994 ) Minimax risk over $$l_p$$-balls for $$l_q$$-error. Probab. Theory Relat. Fields , 99 , 277 – 303 . Google Scholar Crossref Search ADS 15. Donoho D. L. , Maleki A. & Montanari A. ( 2009 ) Message-passing algorithms for compressed sensing. Proc. Nat. Acad. Sci. , 106 , 18914 – 18919 . Google Scholar Crossref Search ADS 16. Donoho D. L. , Maleki A. & Montanari A. ( 2010 ) Message passing algorithms for compressed sensing I: motivation and construction. Proc. IEEE Information Theory Workshop, Cairo, Jan 6 2010. IEEE , pp. 1 – 5 . 17. Donoho D. L. , Maleki A. & Montanari A. ( 2010 ) Message passing algorithms for compressed sensing II: analysis and validation. Proc. IEEE Information Theory Workshop, Cairo, Jan 6 2010. IEEE , pp. 1 – 5 . 18. Fletcher A. K. & Rangan S. ( 2014 ) Scalable inference for neuronal connectivity from calcium imaging. Advances in Neural Information Processing Systems ( Ghahramani Z. Welling M. Cortes C. Lawrence N. D. & Weinberger K. Q. eds). Montreal, Canada : MIT Press , pp. 2843 – 2851 . 19. Fletcher A. K. , Rangan S. , Varshney L. R. & Bhargava A. ( 2011 ) Neural reconstruction with approximate message passing (NeuRAMP). Advances in Neural Information Processing Systems ( Shawe-Taylor J. Zemel R. S. Bartlett P. L. Pereira F. & Weinberger K. Q. eds). Barcelona : MIT Press , pp. 2555 – 2563 . 20. Guo D. , Baron D. & Shamai S. ( 2009 ) A single-letter characterization of optimal noisy compressed sensing. Proc. IEEE, Communication, Control, and Computing, 2009. Monticello, IL, Sep 30 2009. IEEE , pp. 52 – 59 . 21. Guo D. & Verdú S. ( 2005 ) Randomly spread CDMA: asymptotics via statistical physics. IEEE Trans. Inform. Theory , 51 , 1983 – 2010 . Google Scholar Crossref Search ADS 22. Guo D. & Wang C-C. ( 2006 ) Asymptotic mean-square optimality of belief propagation for sparse linear systems. IEEE Information Theory Workshop - ITW ’06 Chengdu, Nov, 2016. IEEE , pp. 194 – 198 . 23. Guo D. & Wang C-C. ( 2007 ) Random sparse linear systems observed via arbitrary channels: a decoupling principle. IEEE International Symposium on Information Theory (ISIT), Nice, France, June 2007. IEEE , pp. 946 – 950 . 24. Guo D. , Wu Y. , Shamai S. & Verdú S. ( 2011 ) Estimation in Gaussian noise: properties of the minimum mean-square error. IEEE Trans. Inform. Theory , 57 , 2371 – 2385 . Google Scholar Crossref Search ADS 25. Horn R. A. & Johnson C. R. ( 1985 ) Matrix Analysis. Cambridge, England : Cambridge University Press , Reprinted with corrections 1987 . 26. Jafar S. A. & Goldsmith A. ( 2004 ) Transmitter optimization and optimality of beamforming for multiple antenna systems. IEEE Trans. Wireless Commun. , 3 , 1165 – 1175 . Google Scholar Crossref Search ADS 27. Javanmard A. & Montanari A. ( 2013 ) State evolution for general approximate message passing algorithms, with applications to spatial coupling. Inf. Infer. , 2 , 115 – 144 . Google Scholar Crossref Search ADS 28. Johnstone I. M. & Lu A. Y. ( 2009 ) On consistency and sparsity for principal components analysis in high dimensions. J. Amer. Stat. Assoc. , 104 , 682 – 703 . Google Scholar Crossref Search ADS 29. Jolliffe I. T. ( 1986 ) Principal Component Analysis. New York, NY : Springer . 30. Jolliffe I. T. , Trendafilov N. & Uddin M. ( 2003 ) A modified principal component technique based on the LASSO. J. Comput. Graphical Stat. , 12 , 531 – 547 . Google Scholar Crossref Search ADS 31. Kabashima Y. , Krzakala F. , Mézard M. , Sakata A. & Zdeborová L. ( 2016 ) Phase transitions and sample complexity in Bayes-optimal matrix factorization. IEEE Trans. Inform. Theory , 62 , 4228 – 4265 . Google Scholar Crossref Search ADS 32. Katayama T. ( 2010 ) Subspace Methods for System Identification. New York : Springer . 33. Keshavan R. , Montanari A. & Oh S. ( 2010 ) Matrix completion from a few entries. IEEE Trans. Inform. Theory , 56 , 2980 – 2998 . Google Scholar Crossref Search ADS 34. Krzakala F. , Mézard M. & Zdeborová L. ( 2013 ) Phase diagram and approximate message passing for blind calibration and dictionary learning. IEEE International Symposium on Information Theory (ISIT), Istanbul, Turkey, 2014. IEEE , pp. 659 – 663 . 35. Lee D. D. & Seung H. S. ( 2001 ) Algorithms for non-negative matrix factorization. Advances in Neural Information Processing Systems ( Dietterich T. G. Becker S. & Ghahraman Z. eds). Barcelona, Vancouver, Canada : MIT Press , pp. 556 – 562 . 36. Lesieur T. , Krzakala F. & Zdeborová L. ( 2015 ) MMSE of probabilistic low-rank matrix estimation: Universality with respect to the output channel. Annual Allerton Conference on Communication, Control, and Computing (Allerton), Montincello, IL, Oct. 2015. IEEE . 37. Lewicki M. S. & Sejnowski T. J. ( 2000 ) Learning overcomplete representations. Neural Comp. , 12 , 337 – 365 . Google Scholar Crossref Search ADS 38. Manoel A. , Krzakala F. , Tramel E. W. & Zdeborová L. ( 2015 ) Swept approximate message passing for sparse estimation. International Conference on Machine Learning, 7-9 July 2015 ( Bach F. & Blei D. eds). Lille, France : MIT Press , pp. 1123 – 1132 . 39. Marčenko V. A. & Pastur L. A. ( 1967 ) Distribution of eigenvalues for some sets of random matrices. Math. USSR–Sbornik , 1 , 457 – 483 . Google Scholar Crossref Search ADS 40. Montanari A. & Tse D. ( 2006 ) Analysis of belief propagation for non-linear problems: the example of CDMA (or: how to prove Tanaka’s formula). arXiv:cs/0602028v1 [cs.IT] . 41. Olshausen B. A. & Field D. J. ( 1996 ) Natural image statistics and efficient coding. Network Comp. Neural Syst. , 7 , 333 – 339 . Google Scholar Crossref Search ADS 42. Olshausen B. A. & Field D. J. ( 1997 ) Sparse coding with an overcomplete basis set: A strategy employed by V1? J. Vis. Res. , 37 , 3311 – 3325 . Google Scholar Crossref Search ADS 43. Parker J. T. , Schniter P. & Cevher V. ( 2014 ) Bilinear generalized approximate message passing—Part I: Derivation. IEEE Trans. Signal Process. , 62 , 5839 – 5853 . Google Scholar Crossref Search ADS 44. Parker J. T. , Schniter P. & Cevher V. ( 2014 ) Bilinear generalized approximate message passing—Part II: Applications. IEEE Trans. Signal Process. , 62 , 5854 – 5867 . Google Scholar Crossref Search ADS 45. Rangan S. ( 2010 ) Estimation with random linear mixing, belief propagation and compressed sensing. IEEE Conference on Information Sciences and Systems (CISS), March 2010. Princeton, NJ : IEEE , pp. 1 – 6 . 46. Rangan S. ( 2011 ) Generalized approximate message passing for estimation with random linear mixing. IEEE International Symposium on Information Theory (ISIT), St. Petersberg, Russia, June 2015. IEEE , pp. 2174 – 2178 . 47. Rangan S. & Fletcher A. K. ( 2012 ) Iterative estimation of constrained rank-one matrices in noise. IEEE International Symposium on Information Theory (ISIT), Cambridge, MA, July 2012. IEEE . 48. Rangan S. , Fletcher A. & Goyal V. K. ( 2012 ) Asymptotic analysis of MAP estimation via the replica method and applications to compressed sensing. IEEE Trans. Inform. Theory , 58 , 1902 – 1923 . Google Scholar Crossref Search ADS 49. Rangan S. , Fletcher A. K. , Schniter P. & Kamilov U. S. ( 2017 ) Inference for generalized linear models via alternating directions and Bethe free energy minimization. IEEE Trans. Inform. Theory , 63 , 676 – 697 . Google Scholar Crossref Search ADS 50. Rangan S. , Schniter P. & Fletcher A. ( 2014 ) On the convergence of approximate message passing with arbitrary matrices. IEEE International Symposium on Information Theory (ISIT), Honolulu, Hawaii, June 2014. IEEE , pp. 236 – 240 . 51. Rangan S. , Schniter P. , Riegler E. , Fletcher A. & Cevher V. ( 2016 ) Fixed points of generalized approximate message passing with arbitrary matrices. IEEE Trans. Inform. Theory , 62 , 7464 – 7474 . Google Scholar Crossref Search ADS 52. Sakata A. & Kabashima Y. ( 2013 ) Sample complexity of Bayesian optimal dictionary learning. IEEE International Symposium on Information Theory (ISIT), Istanbul, Turkey, June 2013. IEEE , pp. 669 – 673 . 53. Shena H. & Huang J. Z. ( 2008 ) Sparse principal component analysis. J. Multivariate Anal. , 99 , 1015 – 1034 . Google Scholar Crossref Search ADS 54. SourceForge ( 2015 ) Generalized approximate message passing. SourceForge.net project gampmatlab , Available on-line at http://gampmatlab.sourceforge.net/. 55. Tanaka T. ( 2002 ) A statistical-mechanics approach to large-system analysis of CDMA multiuser detectors. IEEE Trans. Inform. Theory , 48 , 2888 – 2910 . Google Scholar Crossref Search ADS 56. Vidyasagar M. ( 1978 ) Nonlinear Systems Analysis. Englewood Cliffs, NJ : Prentice-Hall . 57. Vila J. P. & Schniter P. ( 2011 ) Expectation-maximization Bernoulli–Gaussian approximate message passing. IEEE Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, Sept 2011. IEEE , pp. 799 – 803 . 58. Vila J. , Schniter P. & Meola J. ( 2013 ) Hyperspectral image unmixing via bilinear generalized approximate message passing. SPIE Defense, Security, and Sensing. Society of Photographic Instrumentation Engineers (SPIE) Defense, Security, and Sensing, Baltimore, MD, May 2013. SPIE , pp. 87430Y – 87430Y . 59. Vila J. , Schniter P. , Rangan S. , Krzakala F. & Zdeborová L. ( 2015 ) Adaptive damping and mean removal for the generalized approximate message passing algorithm. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2015. Brisbane, Australia : IEEE , pp. 2021 – 2025 . 60. Wu Y. & Verdú S. ( 2012 ) Functional properties of minimum mean-square error and mutual information. IEEE Trans. Inform. Theory , 58 , 1289 – 1301 . Google Scholar Crossref Search ADS 61. Zhang Z. , Zha H. & Simon H. ( 2002 ) Low rank approximations with sparse factors I: Basic algorithms and error analysis. SIAM J. Matrix Anal. Appl. , 23 , 706 – 727 . Google Scholar Crossref Search ADS 62. Zou H. , Hastie T. & Tibshirani R. ( 2006 ) Sparse principal component analysis. J. Comput. Graphical Stat. , 15 , 265 – 286 . Google Scholar Crossref Search ADS Appendix A. Appendices: proofs A.1 Empirical convergence of random variables Bayati and Montanari’s analysis in [3] employs certain deterministic models on the vectors and then proves convergence properties of related empirical distributions. To apply the same analysis here, we need to review some of their definitions. We say a function $$\phi:{\mathbb{R}}^r {\rightarrow} {\mathbb{R}}^s$$ is pseudo-Lipschitz of order $$p>1$$, if there exists an $$L > 0$$ such for any $${\mathbf{x}}$$, $${\mathbf{y}} \in {\mathbb{R}}^r$$, $\|\phi({\mathbf{x}}) - \phi({\mathbf{y}})\| \leq L(1+\|{\mathbf{x}}\|^{p-1}+\|{\mathbf{y}}\|^{p-1})\|{\mathbf{x}}-{\mathbf{y}}\|.$ Now suppose that for each $$n=1,2,\,{\ldots}\,$$, we have a set of vectors $\theta(n) = \left\{ {\mathbf{v}}_i(n), i=1,\ldots,\ell(n)\right\}\!,$ where the elements are vectors $${\mathbf{v}}_i(n) \in {\mathbb{R}}^s$$ and the size of the set is given by $$\ell(n)$$. We say that the set of components of $$\theta(n)$$empirically converges with bounded moments of order $$p$$ as $$n{\rightarrow}\infty$$ to a random vector $${\mathbf{V}}$$ on $${\mathbb{R}}^s$$ if, for all pseudo-Lipschitz continuous functions, $$\phi$$, of order $$p$$, $$\label{eq:phiConv} \lim_{n {\rightarrow}\infty} \frac{1}{\ell(n)} \sum_{i=1}^{\ell(n)} \phi({\mathbf{v}}_i(n)) = {\mathbb{E}}[\phi({\mathbf{V}})] < \infty.$$ (A.1) When the nature of convergence is clear, we may write (with some abuse of notation) $\lim_{n {\rightarrow} \infty} \theta(n) \stackrel{d}{=} {\mathbf{V}}.$ A.2 Bayati–Montanari recursions with adaptation Our main result will need an adaptive version of the recursion theorem of Bayati and Montanari [3]. Let $$H_u(t,d,u_0,\nu_u)$$ and $$H_v(t,b,v_0,\nu_v)$$ be two functions defined on arguments $$t=0,1,2,\ldots$$ and $$d$$, $$b$$, $$u_0$$ and $$v_0 \in {\mathbb{R}}$$ as well as vectors $$\nu_u$$ and $$\nu_v$$. Given a matrix $${\mathbf{S}} \in {\mathbb{R}}^{m {\times} n}$$ and vectors $${\mathbf{u}}_0$$ and $${\mathbf{v}}_0$$, generate a sequence of vectors $${\mathbf{b}}(t)$$ and $${\mathbf{d}}(t)$$ by the iterations \begin{eqnarray} {\mathbf{b}}(t) &=& {\mathbf{S}} {\mathbf{v}}(t) + \xi_u(t){\mathbf{u}}(t), \label{eq:genAMPb} \\ \end{eqnarray} (A.2a) \begin{eqnarray} {\mathbf{d}}(t) &=& {\mathbf{S}}^\top{\mathbf{u}}({t\! + \! 1}) + \xi_v(t){\mathbf{v}}(t), \label{eq:genAMPd} \end{eqnarray} (A.2b) where \begin{eqnarray} u_i({t\! + \! 1}) &=& H_u(t,b_i(t),u_{0i},\nu_u(t)), \\ \end{eqnarray} (A.3a) \begin{eqnarray} v_j({t\! + \! 1}) &=& H_v(t,d_j(t),v_{0j},\nu_v(t)), \end{eqnarray} (A.3b) and $$\xi_v(t)$$ and $$\xi_u(t)$$ are scalar step sizes given by \begin{eqnarray} \xi_v(t) &=& -\frac{1}{m} \sum_{i=1}^m \frac{\partial}{\partial b_i}H_u(t,b_i(t),u_{0i},\nu_u(t)), \\ \end{eqnarray} (A.4a) \begin{eqnarray} \xi_u({t\! + \! 1}) &=& -\frac{1}{m} \sum_{j=1}^n \frac{\partial}{\partial d_j} H_v(t,d_j(t),v_{0j},\nu_v(t)). \hspace{0.7cm} \label{eq:genAMPmuu} \end{eqnarray} (A.4b) The recursions (A.2) to (A.4) are identical to the recursions analyzed in [3], except for the introduction of the parameters $$\nu_u(t)$$ and $$\nu_v(t)$$. We will call these parameters adaptation parameters, since they enable the functions $$H_u(\cdot)$$ and $$H_v(\cdot)$$ to have some data dependence, not explicitly considered in [3]. Similarly to the selection of the parameters $$\lambda_u(t)$$ and $$\lambda_v(t)$$ in (4.2), we assume that, in each iteration $$t$$, the adaptation parameters are selected by functions of the form $$\label{eq:genNu} \nu_u(t) = \frac{1}{n} \sum_{j=1}^n \phi_u(t,v_{0j},v_j(t)), \quad \nu_v(t) = \frac{1}{m} \sum_{i=1}^m \phi_v(t,u_{0i},u_i({t\! + \! 1})),$$ (A.5) where $$\phi_u(\cdot)$$ and $$\phi_v(\cdot)$$ are (possibly vector valued) pseudo-Lipschitz continuous of order $$p$$ for some $$p > 1$$. Thus, the values of $$\nu_u(t)$$ and $$\nu_v(t)$$ depend on the outputs $${\mathbf{v}}(t)$$ and $${\mathbf{u}}({t\! + \! 1})$$. Note that in equations (A.3) to (A.5), $$d_i$$, $$u_{0i}$$, $$b_j$$ and $$v_{0j}$$ are the components of the vectors $${\mathbf{d}}$$, $${\mathbf{u}}_0$$, $${\mathbf{b}}$$ and $${\mathbf{v}}_0$$, respectively. The algorithm is initialized with $$t=0$$, $$\xi_u(0)=0$$ and some vector $${\mathbf{v}}(0)$$. Now, similarly to Section 4, consider a sequence of random realizations of the parameters indexed by the input dimension $$n$$. For each $$n$$, we assume that the output dimension $$m=m(n)$$ is deterministic and scales linearly as in (3.4) for some $$\beta \geq 0$$. Assume that the transform matrix $${\mathbf{S}}$$ has i.i.d. Gaussian components $$s_{ij} \sim \mathcal{N}(0,1/m)$$. Also assume that the empirical limits in (4.3) hold with bounded moments of order $$2p-2$$ for some limiting random variables $$(U_0,U(0))$$ and $$(V_0,V(0))$$. We will also assume the following continuity assumptions on $$H_u(\cdot)$$ and $$H_v(\cdot)$$: Assumption A1 The function $$H_u(t,b,u_0,\nu_u)$$ satisfies the following continuity conditions: (a) For every $$\nu_u$$ and $$t$$, $$H_u(t,b,u_0,\nu_u)$$ and its derivative $$\partial H_u(t,b,u_0,\nu_u)/\partial b$$ are Lipschitz continuous in $$b$$ and $$u_0$$ for some Lipschitz constant that is continuous in $$\nu_u$$; and (b) for every $$\nu_u$$ and $$t$$, $$H_u(t,b,u_0,\nu_u)$$ and $$\partial H_u(t,b,u_0,\nu_u)/\partial b$$ are continuous at $$\nu_u$$ uniformly over $$(b,u_0)$$. The function $$H_v(t,d,v_0,\nu_v)$$ satisfies the analogous continuity assumptions in $$d$$, $$v_0$$ and $$\nu_v$$. Under these assumption, we will show, as in Section 4, that for any fixed iteration $$t$$, the sets $$\theta_u(t)$$ and $$\theta_v(t)$$ in (3.5) converge empirically to the limits (3.6) for some random variable pairs $$(U_0,U(t))$$ and $$(V_0,V(t))$$. The random variable $$U_0$$ is identical to the variable in the limit (4.3) and, for $$t \geq 0$$, $$U(t)$$ is given by $$\label{eq:UtGen} U({t\! + \! 1}) = H_u(t,B(t),U_0,{\overline{\nu}}_u(t)), \quad B(t) \sim {\mathcal N}(0,\tau^b(t)),$$ (A.6) for some deterministic constants $${\overline{\nu}}_v(t)$$ and $$\tau^b(t)$$ that will be defined below. Similarly, the random variable $$V_0$$ is identical to the variable in the limit (4.3) and, for $$t \geq 0$$, $$V(t)$$ is given by $$\label{eq:VtGen} V({t\! + \! 1}) = H_v(t,D(t),V_0,{\overline{\nu}}_v(t)), \quad D(t) \sim {\mathcal N}(0,\tau^d(t)),$$ (A.7) for some constants $${\overline{\nu}}_v(t)$$ and $$\tau^d(t)$$, also defined below. The constants $$\tau^b(t)$$, $$\tau^d(t)$$, $${\overline{\nu}}_u(t)$$ and $${\overline{\nu}}_v(t)$$ can be computed recursively through the following state evolution equations: \begin{eqnarray} \tau^d(t) &=& {\mathbb{E}}\left[ U({t\! + \! 1})^2 \right], \quad \tau^b(t) = \beta {\mathbb{E}}\left[ V(t)^2 \right], \label{eq:taub} \\ \end{eqnarray} (A.8a) \begin{eqnarray} {\overline{\nu}}_u(t) &=& {\mathbb{E}}\left[\phi_u(t,V_0,V(t))\right], \quad {\overline{\nu}}_v(t) = {\mathbb{E}}\left[\phi_v(t,U_0,U({t\! + \! 1}))\right], \label{eq:nuv} \end{eqnarray} (A.8b) where the expectations are over the random variables $$U(t)$$ and $$V(t)$$ above and initialized with $$\label{eq:SEGenInit} \tau^b(0) := \beta {\mathbb{E}}\left[ V(0)^2 \right].$$ (A.9) With these definitions, we can now state the adaptive version of the result from Bayati and Montanari [3]. Although the full proof requires that $$p=2$$, much of the proof is valid for $$p\geq 2$$. Hence, where possible, we provide the steps for the general $$p$$ case. Theorem A2 Consider the recursion in (A.2) to (A.5) satisfying the above assumptions for the case when $$p=2$$. Then, for any fixed iteration number $$t$$, the sets $$\theta_u(t)$$ and $$\theta_v(t)$$ in (3.5) converge empirically to the limits (3.6) with bounded moments of order $$p=2$$ to the random variable pairs $$(U_0,U(t))$$ and $$(V_0,V(t))$$ described above. Proof. We use an asterisk superscript to denote the outputs of the non-adaptive version of the recursions (A.2) to (A.5). That is, quantities such as $${\mathbf{u}}^*(t), {\mathbf{v}}^*(t), {\mathbf{b}}^*(t), {\mathbf{d}}^*(t),\,{\ldots}\,$$, will represent the outputs generated by recursions (A.2) to (A.5) with the same initial conditions ($${\mathbf{v}}^*(0)={\mathbf{v}}(0)$$ and $$\xi_u(0)=\xi_u^*(0)=0$$), but in (A.3) and (A.4), $$\nu_u(t)$$ and $$\nu_v(t)$$ are replaced by their deterministic limits $${\overline{\nu}}_u(t)$$ and $${\overline{\nu}}_v(t)$$. Therefore, \begin{eqnarray} u^*_i({t\! + \! 1}) &=& H_u(t,b^*_i(t),u^*_{0i},{\overline{\nu}}_u(t)), \\ \end{eqnarray} (A.10a) \begin{eqnarray} v^*_j({t\! + \! 1}) &=& H_v(t,d^*_j(t),v^*_{0j},{\overline{\nu}}_v({t\! + \! 1})), \end{eqnarray} (A.10b) and \begin{eqnarray} \xi^*_v(t) &=& -\frac{1}{m} \sum_{i=1}^m \frac{\partial}{\partial b_i}H_u(t,b^*_i(t),u^*_{0i},{\overline{\nu}}_u(t)), \\ \end{eqnarray} (A.11a) \begin{eqnarray} \xi^*_u({t\! + \! 1}) &=& -\frac{1}{m} \sum_{j=1}^n \frac{\partial}{\partial d_j} H_v(t,d^*_j(t),v^*_{0j},{\overline{\nu}}_v(t)). \hspace{0.7cm} \end{eqnarray} (A.11b) Now, Bayati and Montanari’s result in [3] shows that this non-adaptive algorithm satisfies the required properties. That is, the following limits hold with bounded moments of order $$p$$: \begin{eqnarray} \lim_{n {\rightarrow} \infty} \{(u_{0i},u^*_i(t)), i=1,\ldots,m\} &=& (U_0,U(t)), \\ \end{eqnarray} (A.12a) \begin{eqnarray} \lim_{n {\rightarrow} \infty} \{(v_{0j},v^*_j(t)), j=1,\ldots,n\} &=& (V_0,V(t)). \end{eqnarray} (A.12b) So, the limits (3.6) will be shown if we can prove the following limits hold almost surely for all $$t$$: $$\label{eq:uvlimd} \lim_{n {\rightarrow} \infty} \frac{1}{m} \|{\mathbf{u}}(t)-{\mathbf{u}}^*(t)\|_p^p = 0, \quad \lim_{n {\rightarrow} \infty} \frac{1}{n} \|{\mathbf{v}}(t)-{\mathbf{v}}^*(t)\|_p^p = 0,$$ (A.13) where $$\|\cdot\|_p$$ is the $$p$$-norm. In the course of proving (A.13), we will also show the following limits hold almost surely: \begin{eqnarray} \lim_{n {\rightarrow} \infty} \frac{1}{m} \|{\mathbf{b}}(t)-{\mathbf{b}}^*(t)\|_p^p &=& 0, \label{eq:blimd} \\ \end{eqnarray} (A.14a) \begin{eqnarray} \lim_{n {\rightarrow} \infty} \frac{1}{n} \|{\mathbf{d}}(t)-{\mathbf{d}}^*(t)\|_p^p &=& 0, \label{eq:dlimd} \\ \end{eqnarray} (A.14b) \begin{eqnarray} \lim_{n {\rightarrow} \infty} |\xi_u(t)-\xi^*_u(t)| &=& 0, \label{eq:xiulimd} \\ \end{eqnarray} (A.14c) \begin{eqnarray} \lim_{n {\rightarrow} \infty} |\xi_v(t)-\xi^*_v(t)| &=& 0, \label{eq:xivlimd} \\ \end{eqnarray} (A.14d) \begin{eqnarray} \lim_{n {\rightarrow} \infty} \nu_u(t) &=& {\overline{\nu}}_u(t), \label{eq:nuulimd} \\ \end{eqnarray} (A.14e) \begin{eqnarray} \lim_{n {\rightarrow} \infty} \nu_v(t) &=& {\overline{\nu}}_v(t). \label{eq:nuvlimd} \end{eqnarray} (A.14f) The proofs of the limits (A.13) and (A.14) can be demonstrated via induction on $$t$$ with the following straightforward (but somewhat tedious) continuity argument. To begin the induction argument, first note that the non-adaptive algorithm has the same initial condition as the adaptive algorithm. That is, $${\mathbf{v}}^*(0)={\mathbf{v}}(0)$$ and $$\xi_u(0)=\xi_u^*(0)=0$$. Also, since $$\xi_u(0)=\xi_u^*(0)=0$$, from (A.2a), the initial value of $${\mathbf{u}}(t)$$ does not matter. So, without loss of generality, we can assume that the initial condition satisfies $${\mathbf{u}}(0)={\mathbf{u}}^*(0)$$. Thus, the limits (A.13) and (A.14c) hold for $$t=0$$. We now proceed by induction. Suppose that the limits (A.13) and (A.14c) hold almost surely for some $$t \geq 0$$. Since $${\mathbf{S}}$$ has i.i.d. components with mean zero and variance $$1/m$$, by the Marčenko–Pastur theorem [39], the maximum singular value of $${\mathbf{S}}$$ is bounded. For $$p=2$$, the maximum singular value is the $$p$$-norm operator norm, and therefore, there exists a constant $$C_S > 0$$ such that $$\label{eq:Snorm} \limsup_{n {\rightarrow} \infty} \|{\mathbf{S}}\|_p \leq C_S, \quad \limsup_{n {\rightarrow} \infty} \|{\mathbf{S}}^\top\|_p \leq C_S.$$ (A.15) Substituting the bound (A.15) into (A.2a), we obtain \begin{align} \|{\mathbf{b}}(t)-{\mathbf{b}}^*(t)\|_p &\leq \|{\mathbf{S}}\|_p\|{\mathbf{v}}(t)-{\mathbf{v}}^*(t)\|_p + |\xi_u(t)|\|{\mathbf{u}}(t)-{\mathbf{u}}^*(t)\|_p + |\xi_u(t)-\xi^*_u(t)|\|{\mathbf{u}}^*(t)\|_p \nonumber \\ & \leq C_S\|{\mathbf{v}}(t)-{\mathbf{v}}^*(t)\|_p + |\xi_u(t)|\|{\mathbf{u}}(t)-{\mathbf{u}}^*(t)\|_p + |\xi_u(t)-\xi^*_u(t)|\|{\mathbf{u}}^*(t)\|_p. \label{eq:blimbnd1} \end{align} (A.16) Now, since $$p \geq 1$$, we have that for any positive numbers $$a$$ and $$b$$, $$\label{eq:abpIneq} (a+b)^p \leq 2^{p-1}(a^p + b^p).$$ (A.17) Applying (A.17) into (A.16), and the fact that $$\lim_n m/n = \beta$$, we obtain that \begin{align} \frac{1}{m}\|{\mathbf{b}}(t)-{\mathbf{b}}^*(t)\|^p_p &\leq C'_S\Bigl[ \frac{1}{n}\|{\mathbf{v}}(t)-{\mathbf{v}}^*(t)\|^p_p \nonumber \\ &\quad + \frac{|\xi_u(t)|^p}{m}\|{\mathbf{u}}(t)-{\mathbf{u}}^*(t)\|^p_p +\frac{|\xi_u(t)-\xi^*_u(t)|^p}{m} \|{\mathbf{u}}^*(t)\|^p_p\Bigr], \label{eq:blimbnd2} \end{align} (A.18) for some other constant $$C'_S > 0$$. Now, since $${\mathbf{u}}^*(t)$$ is the output of the non-adaptive algorithm it satisfies the limit $$\label{eq:ubndNA} \lim_{n {\rightarrow} \infty} \frac{1}{m}\|{\mathbf{u}}^*(t)\|_p^p = \lim_{n {\rightarrow} \infty} \frac{1}{m}\sum_{i=1}^m|u_i^*(t)|_p^p = {\mathbb{E}}|U(t)|^p < \infty.$$ (A.19) Substituting the bound (A.19) along with induction hypotheses, (A.13) and (A.14c) into (A.18) shows (A.14a). Next, to prove the limit (A.14e), first observe that since $$\phi_u(\cdot)$$ is pseudo-Lipschitz continuous of order $$p$$, we have that $${\overline{\nu}}_u(t)$$ in (A.8b) can be replaced by the limit of the empirical means $$\label{eq:nuuLimNA} {\overline{\nu}}_u(t) = \lim_{n {\rightarrow}\infty} \frac{1}{n} \sum_{j=1}^n \phi_u(t,v_{0j},v_j^*(t)),$$ (A.20) where the limit holds almost surely. Combining (A.20) with (A.5), \begin{eqnarray} \limsup_{n {\rightarrow} \infty} |\nu_u(t)-{\overline{\nu}}_u(t)| \leq \limsup_{n {\rightarrow}\infty} \frac{1}{n} \sum_{j=1}^n |\phi_u(t,v_{0j},v_j^*(t)) -\phi_u(t,v_{0j},v_j^*(t))|. \label{eq:nuuDiff1} \end{eqnarray} (A.21) Applying the fact that $$\phi_u(\cdot)$$ is pseudo-Lipschitz continuous of order $$p$$ to (A.21), we obtain that there exists a constant $$L_v > 0$$ such that \begin{align} & \underset{n\to \infty }{\mathop{\lim \sup }}\,|{{\nu }_{u}}(t)-{{{\bar{\nu }}}_{u}}(t)|\le \underset{n\to \infty }{\mathop{\lim \sup }}\,\frac{{{L}_{v}}}{n}\sum\limits_{j=1}^{n}{[}|{{v}_{j}}(t){{|}^{p-1}}+|v_{j}^{*}(t){{|}^{p-1}}]|{{v}_{j}}(t)-v_{j}^{*}(t)| \\ & \,\,\,\,\,\le \underset{n\to \infty }{\mathop{\lim \sup }}\,{{L}_{v}}\left[ {{\left( \frac{1}{n}\|\mathbf{v}(t)\|_{p}^{p} \right)}^{(p-1)/p}}+{{\left( \frac{1}{n}\|{{\mathbf{v}}^{*}}(t)\|_{p}^{p} \right)}^{(p-1)/p}} \right]{{\left( \frac{1}{n}\|\mathbf{v}(t)-{{\mathbf{v}}^{*}}(t)\|_{p}^{p} \right)}^{1/p}}, \\ \end{align} (A.22) where the last step is due to Hölder’s inequality with the exponents $\frac{p-1}{p} + \frac{1}{p} = 1.$ Now, similarly to the proof of (A.19), one can show that the non-adaptive output satisfies the limit $$\label{eq:vbnd1} \lim_n \frac{1}{n} \|{\mathbf{v}}^*(t)\|_p^p = {\mathbb{E}}|V(t)|^p < \infty.$$ (A.23) Also, from the induction hypothesis (A.13), it follows that the non-adaptive output must satisfy the same limit $$\label{eq:vbnd2} \lim_n \frac{1}{n} \|{\mathbf{v}}(t)\|_p^p = {\mathbb{E}}|V(t)|^p < \infty.$$ (A.24) Applying the bounds (A.23) and (A.24) and the limit (A.22) shows that (A.14e) holds almost surely. Now the limit in (A.14e) and the second limit in (A.13) together with the continuity conditions on $$H_u(\cdot)$$ in Assumption A1 show that the first limit in (A.13) holds almost surely for $$t+1$$ and (A.14d) holds almost surely for $$t$$. Using (A.2b), the proof of the limit (A.14b) is similar to the proof of (A.14a). These limits in turn show that the second limit in (A.13) and the limits in (A.14c) hold almost surely for $$t+1$$. We have thus shown that if (A.13) and (A.14c) hold almost surely for some $$t$$, they hold for $$t+1$$. Thus, by induction they hold for all $$t$$. Finally, applying the limits (A.12), (A.13) and a continuity argument shows that the desired limits (3.6) hold almost surely. □ A.3 Proof of Theorem 4.2 The theorem directly follows from the adaptive Bayati–Montanari recursion theorem, Theorem A2 in Appendix A.2 above, with some changes of variables. Specifically, let $$\label{eq:Sdef} {\mathbf{S}} = \frac{1}{\sqrt{m\tau_w}} {\mathbf{W}},$$ (A.25) where $${\mathbf{W}}$$ is the Gaussian noise in the rank-one model in Assumption 4.1(b). Since $${\mathbf{W}}$$ has i.i.d. components with Gaussian distributions $${\mathcal N}(0,\tau_w)$$, the components of $${\mathbf{S}}$$ will be i.i.d. with distributions $${\mathcal N}(0,1/m)$$. Now, using the rank-one model for $${\mathbf{A}}$$ in (1.1), $$\label{eq:Avpf} {\mathbf{A}}{\mathbf{v}}(t) = {\mathbf{u}}_0{\mathbf{v}}_0^\top{\mathbf{v}}(t) + \sqrt{m}{\mathbf{W}}{\mathbf{v}}(t) = n\alpha_{v1}(t){\mathbf{u}}_0 + \sqrt{m}{\mathbf{W}}{\mathbf{v}}(t),$$ (A.26) where the last step is from the definition of $$\alpha_{v1}(t)$$ in (4.6). Substituting (A.26) into the update rule for $${\mathbf{p}}(t)$$ in line 6 of Algorithm 1, we obtain \begin{eqnarray} {\mathbf{p}}(t) &=& (1/m){\mathbf{A}}(t){\mathbf{v}}(t) + \mu_u(t){\mathbf{u}}(t) \nonumber \\ &=& (1/\sqrt{m}){\mathbf{W}}{\mathbf{v}}(t) + \beta \alpha_{v1}(t){\mathbf{u}}_0 + \mu_u(t){\mathbf{u}}(t). \label{eq:Wvpf} \end{eqnarray} (A.27) Note that we have used the fact that $$\beta=n/m$$. Hence, if we define $$\label{eq:bdefpf} {\mathbf{b}}(t) = \frac{1}{\sqrt{\tau_w}}({\mathbf{p}}(t)-\beta\alpha_{v1}(t){\mathbf{u}}_0),$$ (A.28) then (A.25) and (A.27) show that $$\label{eq:bSpf} {\mathbf{b}}(t) = {\mathbf{S}}(t){\mathbf{v}}(t) + \xi_u(t){\mathbf{u}}(t), \quad \xi_u(t) = \mu_u(t)/\sqrt{\tau_w}.$$ (A.29) Similarly, one can show that if we define $$\label{eq:ddefpf} {\mathbf{d}}(t) = \frac{1}{\sqrt{\tau_w}}({\mathbf{q}}(t)-\alpha_{u1}(t){\mathbf{v}}_0),$$ (A.30) then $$\label{eq:dSpf} {\mathbf{d}}(t) = {\mathbf{S}}(t)^\top{\mathbf{u}}({t\! + \! 1}) + \xi_v(t){\mathbf{v}}(t), \quad  \xi_v(t) = \mu_v(t)/\sqrt{\tau_w}.$$ (A.31) Next define the adaptation functions $$\label{eq:phiuvpf} \phi_u(t,v_0,v) := (vv_0, \phi_{\lambda u}(t,v_0,v)), \quad \phi_v(t,u_0,u) := (uu_0, \phi_{\lambda v}(t,u_0,u)),$$ (A.32) which are the adaptation functions in (4.2) with additional components for the second-order statistics $$uu_0$$ and $$vv_0$$. Since $$\phi_{\lambda u}(t,\cdot)$$ and $$\phi_{\lambda v}(t,\cdot)$$ are pseudo-Lipschitz of order $$p$$, so are $$\phi_{u}(t,\cdot)$$ and $$\phi_{v}(t,\cdot)$$. Taking the empirical means over each of the two components of $$\phi_u(\cdot)$$ and $$\phi_v(\cdot)$$, and applying (4.2) and (4.6), we see that if $$\nu_u(t)$$ and $$\nu_v(t)$$ are defined as in (A.5), \begin{eqnarray} \nu_u(t) &=& \frac{1}{n}\sum_{j=1}^n \phi_v(t,v_{0j},v_j(t)) = (\alpha_{v1}(t),\lambda_u(t)), \\ \end{eqnarray} (A.33a) \begin{eqnarray} \nu_v(t) &=& \frac{1}{m}\sum_{i=1}^m \phi_u(t,u_{0i},u_i({t\! + \! 1})) = (\alpha_{u1}({t\! + \! 1}),\lambda_v(t)). \end{eqnarray} (A.33b) Therefore, $$\nu_u(t)$$ and $$\nu_v(t)$$ are vectors containing the parameters $$\lambda_u(t)$$ and $$\lambda_v(t)$$ for the factor selection functions in lines 7 and 12 of Algorithm 1 as well as the second-order statistics $$\alpha_{u1}(t)$$ and $$\alpha_{v1}(t)$$. Now, for $$\nu_u = (\alpha_{v1},\lambda_u)$$ and $$\nu_v = (\alpha_{u1},\lambda_v)$$ define the scalar functions \begin{eqnarray} H_u(t,b,u_0,\nu_u) &:=& G_u(\sqrt{\tau_w} b + \beta\alpha_{v1}u_0,\lambda_u), \label{eq:Hudefpf} \\ \end{eqnarray} (A.34a) \begin{eqnarray} H_v(t,d,v_0,\nu_v) &:=& G_v(\sqrt{\tau_w} d + \alpha_{u1}v_0,\lambda_v). \label{eq:Hvdefpf} \end{eqnarray} (A.34b) Since $$G_u(p,\lambda_u)$$ and $$G_v(q,\lambda_v)$$ satisfy the continuity conditions in Assumption 4.1(c), $$H_u(t,b,u_0,\lambda_u)$$ and $$H_v(t,d,v_0,\lambda_v)$$ satisfy Assumption A1. In addition, the componentwise separability assumption in (4.1) implies that the updates in lines 7 and 12 of Algorithm 1 can be rewritten as $$\label{eq:Guvpf} u_i({t\! + \! 1}) = G_u(p_i(t),\lambda_u(t)), \quad v_j({t\! + \! 1}) = G_v(d_j(t),\lambda_v(t)).$$ (A.35) Thus, combining (A.34) and (A.35) with the definitions of $${\mathbf{b}}(t)$$ and $${\mathbf{d}}(t)$$ in (A.28) and (A.30), we obtain $$\label{eq:Huvpf} u_i({t\! + \! 1}) = H_u(t,b_i(t),u_{0i},\nu_u(t)), \quad v_j({t\! + \! 1}) = H_v(t,d_j(t),v_{0j},\nu_v(t)).$$ (A.36) Next observe that \begin{eqnarray} \xi_u({t\! + \! 1}) &\stackrel{(a)}{=}& \mu_v({t\! + \! 1})/\sqrt{\tau_w} \stackrel{(b)}{=} -\frac{\sqrt{\tau_w}}{m}\sum_{j=1}^n \frac{\partial}{\partial q_j}G_v(q_j(t),\lambda_v(t)) \nonumber \\ &\stackrel{(c)}{=}& -\frac{1}{m}\sum_{j=1}^n \frac{\partial}{\partial d_j}H_v(t,d_j(t),\nu_v(t)), \hspace{3cm} \label{eq:xiuH} \end{eqnarray} (A.37) where ($$a$$) follows from the definition of $$\xi_u(t)$$ in (A.29), ($$b$$) is the setting for $$\mu_u({t\! + \! 1})$$ in line 8 and ($$c$$) follows from the definition of $$H_v(t,d)$$ in (A.34b). Similarly, one can show that $$\label{eq:xivH} \xi_v(t) = -\frac{1}{m}\sum_{i=1}^m \frac{\partial}{\partial b_i}H_u(t,b_i(t),\nu_u(t)).$$ (A.38) Equations (A.29), (A.31), (A.33), (A.34), (A.37) and (A.38) exactly match the recursions in equations (A.2) to (A.5). Therefore, Theorem A2 shows that the limits (3.6) hold in a sense that the sets $$\theta_u(t)$$ and $$\theta_v(t)$$ converge empirically with bounded moments of order $$p$$. We next show that the limits $$U(t)$$ and $$V(t)$$ on the right-hand side of (3.6) match the descriptions in (3.8) and (3.9). First, define $${\overline{\alpha}}_{u1}(t)$$ and $${\overline{\alpha}}_{v1}(t)$$ in (3.7) and $${\overline{\lambda}}_u(t)$$ and $${\overline{\lambda}}_v(t)$$ as in (4.4). Then, from (A.32), the expectations $${\overline{\nu}}_u(t)$$ and $${\overline{\nu}}_v(t)$$ in (A.8b) are given by $$\label{eq:nubarpf} {\overline{\nu}}_u(t) = ({\overline{\alpha}}_{v1}(t),{\overline{\lambda}}_u(t)), \quad {\overline{\nu}}_v(t) = ({\overline{\alpha}}_{u1}({t\! + \! 1}),{\overline{\lambda}}_v(t)).$$ (A.39) Using (A.6), (A.34a) and (A.39), we see that \begin{eqnarray} U({t\! + \! 1}) &=& H_u(t,B(t),U_0,{\overline{\nu}}_u(t)) \nonumber \\ &=& G_u\bigl(t, \beta{\overline{\alpha}}_{v1}(t)U_0 + \sqrt{\tau_w} B(t), {\overline{\lambda}}_u(t) \bigr), \label{eq:UtGB} \end{eqnarray} (A.40) where $$B(t) \sim {\mathcal N}(0,\tau^b(t))$$. Therefore, if we let $$Z_u(t) = \sqrt{\tau_w} B(t)$$, then $$Z_u(t)$$ is zero-mean Gaussian with variance ${\mathbb{E}}\left[Z_u^2(t)\right] = \tau_w \tau^b(t) \stackrel{(a)}{=} \beta \tau_w {\mathbb{E}}[V(t)^2] \stackrel{(b)}{=} \beta \tau_w {\overline{\alpha}}_{v0}(t),$ where ($$a$$) follows from (A.8a) and ($$b$$) follows from the definition of $${\overline{\alpha}}_{v0}(t)$$ in (3.7). Substituting $$Z_u(t) = \sqrt{\tau_w} B(t)$$ into (A.40) we obtain the model for $$U({t\! + \! 1})$$ in (3.8). Similarly, using (A.7) and (A.34b), we can obtain the model $$V({t\! + \! 1})$$ in (3.9). Thus, we have proved that the random variables $$(U_0,U(t))$$ and $$(V_0,V(t))$$ are described by (3.8) and (3.9), and this completes the proof. A.4 Proof of Theorem 5.1 The theorem is proven by simply evaluating the second-order statistics. We begin with $${\overline{\alpha}}_{u1}({t\! + \! 1})$$: \begin{eqnarray} {\overline{\alpha}}_{u1}({t\! + \! 1}) &\stackrel{(a)}{=}& {\mathbb{E}}[ U_0U({t\! + \! 1}) ] \stackrel{(b)}{=} {\overline{\lambda}}_u(t) {\mathbb{E}}[ U_0P(t) ] \nonumber \\ &\stackrel{(c)}{=}& {\overline{\lambda}}_u(t) {\mathbb{E}}\left[ U_0(\beta{\overline{\alpha}}_{v1}(t)U_0+Z_u(t)) \right] \stackrel{(d)}{=} {\overline{\lambda}}_u(t)\beta\tau_u{\overline{\alpha}}_{v1}(t), \label{eq:alphau1Lin} \end{eqnarray} (A.41) where ($$a$$) is the definition in (3.7), ($$b$$) follows from (3.8) and (5.1), ($$c$$) follows from (3.8) and ($$d$$) follows from the independence of $$Z_u(t)$$ and $$U_0$$ and the definition of $$\tau_u$$ in (4.8). Similarly, one can show that $$\label{eq:alphau0Lin} {\overline{\alpha}}_{u0}({t\! + \! 1}) = {\overline{\lambda}}_u(t)^2\left[ \beta\tau_u{\overline{\alpha}}^2_{v1}(t) + \beta\tau_w{\overline{\alpha}}_{v0}(t)\right]\!.$$ (A.42) Substituting (A.41) and (A.42) into (4.9), we obtain the asymptotic correlation coefficient \begin{align} \rho_u({t\! + \! 1}) &= \frac{{\overline{\lambda}}^2_u(t)\beta^2\tau_u^2{\overline{\alpha}}^2_{v1}(t)} {{\overline{\lambda}}^2_u(t)\left[ \beta\tau_u{\overline{\alpha}}^2_{v1}(t) + \beta\tau_w{\overline{\alpha}}_{v0}(t)\right]\tau_u} \nonumber \\ &= \frac{\beta\tau_u{\overline{\alpha}}_{v1}^2(t)} {\tau_u{\overline{\alpha}}^2_{v1}(t) + \tau_w{\overline{\alpha}}_{v0}(t)} = \frac{\beta\tau_u\tau_v\rho_{v}(t)} {\tau_u\tau_v\rho_{v}(t) + \tau_w}, \nonumber \end{align} where the last step follows from (4.10). This proves the first equation in (5.2). A similar set of calculations shows that \begin{eqnarray} {\overline{\alpha}}_{v1}({t\! + \! 1}) &=& {\overline{\lambda}}_v(t)\tau_v{\overline{\alpha}}_{u1}({t\! + \! 1}), \\ \end{eqnarray} (A.43a) \begin{eqnarray} {\overline{\alpha}}_{v0}({t\! + \! 1}) &=& {\overline{\lambda}}_v(t)^2\left[ \tau_v{\overline{\alpha}}^2_{u1}({t\! + \! 1}) + \tau_w{\overline{\alpha}}_{u0}(t)\right]\!. \end{eqnarray} (A.43b) Applying these equations into (4.9) and (4.10), we obtain the recursion (5.2). Hence, we have proved part (a) of the theorem. For part (b), we need the following simple lemma. Lemma A1 Suppose that $$H:[0,1] {\rightarrow} [0,1]$$ is continuous and monotonically increasing, and $$x(t)$$ is a sequence satisfying the recursive relation $$\label{eq:monoiter} x(t+1) = H(x(t)),$$ (A.44) for some initial condition $$x(0) \in [0,1]$$. Then, $$x(t)$$ monotonically either increases or decreases to some $$x^* = H(x^*)$$. Proof. This can be proved similarly to [45, Lemma 7]. □ To apply Lemma A1, observe that the recursions (5.2) show that $\rho_u({t\! + \! 1}) = \frac{\beta \tau_u\tau_v \rho_v(t)}{ \beta \tau_u\tau_v \rho_v(t) + \tau_w} = \frac{\beta \tau_u^2\tau_v^2 \rho_u(t)}{ \beta \tau_u^2\tau_v^2 \rho_u(t) + \tau_w(\tau_u\tau_v \rho_u(t) + \tau_w)}.$ So, if we define $$\label{eq:HdefLin} H(\rho_u) := \frac{\beta \tau_u^2\tau_v^2 \rho_u}{ (\beta \tau_u^2\tau_v^2 + \tau_w\tau_u\tau_v)\rho_u + \tau_w^2},$$ (A.45) then it follows from (5.2) that $$\rho_u({t\! + \! 1}) = H(\rho_u(t))$$ for all $$t$$. By taking the derivative, it can be checked that $$H(\rho_u)$$ is monotonically increasing. It follows from Lemma A1 that $$\rho_u(t) {\rightarrow} \rho_u^*$$ for some fixed point $$\rho_u^* = H(\rho_u^*)$$ with $$\rho_u^* \in [0,1]$$. Now, there are only two fixed point solutions to $$\rho_u^* = H(\rho_u^*)$$: $$\rho_u^* = 0$$ and $$\label{eq:corruLinLimitPos} \rho_u^* = \frac{\beta\tau_u^2\tau_v^2 - \tau_w^2}{ \tau_u\tau_v(\beta\tau_u\tau_v + \tau_w)}.$$ (A.46) When $$\label{eq:snrMinPf} \beta\tau_u^2\tau_v^2 \leq \tau_w^2,$$ (A.47) then $$\rho_u^*$$ in (A.46) is not positive, so the only fixed solution in $$[0,1]$$ is $$\rho_u^* = 0$$. Therefore, when (A.47) is not satisfied, $$\rho_u(t)$$ must converge to the zero fixed point: $$\rho_u(t) {\rightarrow} 0$$. Now, suppose that (A.47) is satisfied. In this case, we claim that $$\rho_u(t) {\rightarrow} \rho_u^*$$, where $$\rho_u^*$$ is in (A.46). We prove this claim by contradiction and suppose, instead, that $$\rho_u(t)$$ converges to the other fixed point: $$\rho_u(t) {\rightarrow} 0$$. Since Lemma A1 shows that $$\rho_u(t)$$ must be either monotonically increasing or decreasing, the only way $$\rho_u(t) {\rightarrow} 0$$ is that $$\rho_u(t)$$ monotonically decreases to zero. But, when (A.47) is satisfied, it can be checked that for $$\rho_u(t)$$ sufficiently small and positive, $$\rho_u({t\! + \! 1}) > H(\rho_u(t))$$. This contradicts the fact that $$\rho_u(t)$$ is monotonically decreasing and, therefore, $$\rho_u(t)$$ must converge to the other fixed point $$\rho_u^*$$ in (A.46). Hence, we have shown that when (A.47) is not satisfied, $$\rho_u(t) {\rightarrow} 0$$, and when (A.47) is satisfied $$\rho_u(t) {\rightarrow} \rho_u^*$$ in (A.46). This is equivalent to the first limit in (5.3). The second limit in (5.3) is proved similarly. A.5 Proof of Theorem 5.2 Similarly to the proof of Theorem 5.1, we begin by computing the second-order statistics of $$(U_0,U(t))$$. Since $$U(t) = {\mathbb{E}}[U_0 {\,|\,} P({t\! - \! 1})]$$, $$U(t)$$ must be uncorrelated with the error: $$U(t)(U_0-U(t))=0$$. Hence, $${\overline{\alpha}}_{u0}(t) - {\overline{\alpha}}_{u1}(t) = {\mathbb{E}}[U(t)U_0 - U(t)U(t)] = 0, \label{eq:aueqmmse}$$ (A.48) and therefore $${\overline{\alpha}}_{u0}(t) = {\overline{\alpha}}_{u1}(t)$$. Now, consider the measurement $$P(t)$$ in (3.8). The SNR in this channel is $$\label{eq:snru} \eta_u(t) = \frac{\beta^2{\overline{\alpha}}^2_{v1}(t)}{\beta\tau_w{\overline{\alpha}}_{v0}(t)} = \frac{\beta \tau_v\rho_v(t)}{\tau_w}.$$ (A.49) Since $$U({t\! + \! 1})$$ is the conditional expectation of $$U_0$$ given $$P(t)$$, the mean-squared error is given by $${\mathcal E}_u(\eta_u(t))$$ defined in (5.6). Therefore, \begin{eqnarray} {\mathcal E}_u(\eta_u(t)) = {\mathbb{E}}[U({t\! + \! 1})-U_0]^2 \stackrel{(a)}{=} {\overline{\alpha}}_{u0}({t\! + \! 1})-2{\overline{\alpha}}_{u1}({t\! + \! 1}) + \tau_u \stackrel{(b)}{=} \tau_u - {\overline{\alpha}}_{u0}({t\! + \! 1}), \label{eq:mmseau0} \end{eqnarray} (A.50) where ($$a$$) follows from expanding the square and substituting in the definitions in (3.7) and (4.8) and ($$b$$) follows from the fact that $${\overline{\alpha}}_{u0}({t\! + \! 1}) = {\overline{\alpha}}_{u1}({t\! + \! 1})$$ proven above. We have thus proved that $$\label{eq:alphaummse} {\overline{\alpha}}_{u0}({t\! + \! 1}) = {\overline{\alpha}}_{u1}({t\! + \! 1}) = \tau_u - {\mathcal E}_u(\eta_u(t)).$$ (A.51) Therefore, the asymptotic correlation coefficient is given by \begin{eqnarray} \rho_u({t\! + \! 1}) \stackrel{(a)}{=} \frac{{\overline{\alpha}}_{u1}^2({t\! + \! 1})}{ \tau_u{\overline{\alpha}}_{u0}({t\! + \! 1})} \stackrel{(b)}{=} 1 - \tau_u^{-1}{\mathcal E}_u(\eta_u(t)), \label{eq:corruMmsePf} \end{eqnarray} (A.52) where ($$a$$) follows from (4.9) and ($$b$$) follows from (A.51). Substituting (A.49) into (A.52) proves the first equation in (5.7). The second recursion in (5.7) can be proved similarly. For the initial condition in the recursion, observe that with $$V(0) = {\mathbb{E}}[V_0]$$, the second-order statistics are given by \begin{eqnarray*} {\overline{\alpha}}_{v0}(0) = {\mathbb{E}}[ V(0)^2] = ({\mathbb{E}}[V_0])^2, \quad {\overline{\alpha}}_{v1}(0) = {\mathbb{E}}[ V_0V(0)] = ({\mathbb{E}}[V_0])^2. \end{eqnarray*} Hence, from (4.10), the initial correlation coefficient is $\rho_v(0) = \frac{{\overline{\alpha}}^2_{v1}(0)}{\tau_v{\overline{\alpha}}_{v0}(0)} = \frac{({\mathbb{E}}[V_0])^2}{\tau_v},$ which agrees with the statement in the theorem. This proves part (a). To prove part (b), we again use Lemma A1. Define the functions $$\label{eq:Huvmmse} H_u(\rho_v) := 1 - \tau_u^{-1}{\mathcal E}_u(\beta \tau_v\rho_v/\tau_w), \quad H_v(\rho_u) := 1 - \tau_v^{-1}{\mathcal E}_v(\tau_u\rho_u/\tau_w),$$ (A.53) and their concatenation $$\label{eq:Hmmse} H(\rho_v) = H_v(H_u(\rho_v)).$$ (A.54) From (5.7), it follows that $$\rho_v({t\! + \! 1}) = H(\rho_v(t))$$. Now, $${\mathcal E}_u(\eta_u)$$ and $${\mathcal E}_v(\eta_v)$$ defined in (5.6) are the mean-squared errors of $$U_0$$ and $$V_0$$ under AWGN estimation measurements with SNRs $$\eta_u$$ and $$\eta_v$$. Therefore, $${\mathcal E}_u(\eta_u)$$ and $${\mathcal E}_v(\eta_v)$$ must be monotonically decreasing in $$\eta_u$$ and $$\eta_v$$. Therefore, $$H_u(\rho_v)$$ and $$H_v(\rho_u)$$ in (A.53) are monotonically increasing functions and thus so is the concatenated function $$H(\rho_v)$$ in (A.54). Also, since the assumption of part (b) is that $${\mathcal E}_u(\eta_u)$$ and $${\mathcal E}_v(\eta_v)$$ are continuous, $$H(\rho_v)$$ is also continuous. It follows from Lemma A1 that $$\rho_v(t) {\rightarrow} \rho_v^*$$ where $$\rho_v^*$$ is a fixed point of (5.7). It remains to show $$\rho_v^* > 0$$. Observe that \begin{eqnarray} {\mathcal E}_v(\eta_v) \stackrel{(a)}{=} {\mathbb{E}}[V_0 {\,|\,} Y=\sqrt{\eta_n}V_0+D] \leq {\bf var}(V_0) \stackrel{(b)}{=} {\mathbb{E}}[V_0^2] - ({\mathbb{E}}[V_0])^2 = \tau_v(1-\rho_v(0)), \nonumber \end{eqnarray} where ($$a$$) follows from the definition of $${\mathcal E}_v(\eta_v)$$ in (5.6) and ($$b$$) follows from the definition of $$\tau_v$$ in (4.8) and the initial condition $$\rho_v(0) = ({\mathbb{E}}[V_0])^2/\tau_v$$. It follows from (5.7) that $\rho_v({t\! + \! 1}) = 1-\frac{1}{\tau_v}{\mathcal E}_v(\eta_v(t)) \geq \rho_v(0).$ Therefore, the limit point $$\rho_v^*$$ of $$\rho_v(t)$$ must satisfy $$\rho_v^* \geq \rho_v(0) > 0$$. A.6 Proof of Lemma 5.1 Define the functions $$H_u$$, $$H_v$$ and $$H$$ as in (A.53) and (A.54) from the previous proof. We know that $$\rho_v({t\! + \! 1}) = H(\rho_v(t))$$. When $${\mathbb{E}}[U_0] = {\mathbb{E}}[V_0] = 0$$, then $$\rho_v = 0$$ is a fixed point of the update. We can determine the stability of this fixed point by computing the derivative of $$H(\rho_v)$$ at $$\rho_v=0$$. Towards this end, we first use a standard result (see, e.g., [21]) that, for any prior on $$U_0$$ with bounded second moments, the mean-squared error in (5.6) satisfies $$\label{eq:mmseuLow} {\mathcal E}_u(\eta_u) = \frac{\tau_u}{1 + \eta_u\tau_u} + \mathcal {\rm O}(\eta_u^2).$$ (A.55) The term $$\tau_u/(1+\eta_u\tau_u)$$ is the minimum mean-squared error with linear estimation of $$U_0$$ from an AWGN noise-corrupted measurement $$Y = \sqrt{\eta_u}U_0 + D$$, $$D \sim {\mathcal N}(0,1)$$. Equation (A.55) arises from the fact that linear estimation is optimal in low SNRs—see [21] for details. Using (A.55), we can compute the derivative of $$H_u$$, \begin{eqnarray} H_u'(0) = -\frac{1}{\tau_u}\left. \frac{\partial}{\partial \rho_v} {\mathcal E}_u(\beta \tau_v\rho_v/\tau_w) \right|_{\rho_v = 0} = -\frac{\beta \tau_v}{\tau_u\tau_w}{\mathcal E}_u'(0) = \frac{\beta \tau_u \tau_v}{\tau_w}. \label{eq:mmseuDeriv} \end{eqnarray} (A.56) Similarly, one can show that $$\label{eq:mmsevDeriv} H_v'(0) = \frac{\tau_u \tau_v}{\tau_w},$$ (A.57) and hence $$\label{eq:HzeroDeriv} H'(0) = H_v'(0)H_u'(0) = \frac{\beta \tau_u^2 \tau_v^2}{\tau_w^2}.$$ (A.58) We now apply a standard linearization analysis of the nonlinear system $$\rho_v({t\! + \! 1}) = H(\rho_v(t))$$ around the fixed point $$\rho_v=0$$. See, for example, [56]. If $$\sqrt{\beta}\tau_u\tau_v < \tau_w$$, then $$H'(0) < 1$$ and the fixed point is stable. Thus, for any $$\rho_v(0)$$ sufficiently small, $$\rho_v(t) {\rightarrow} 0$$. This proves part (b) of the lemma. On the other hand, if $$\sqrt{\beta}\tau_u\tau_v > \tau_w$$ then $$H'(0) > 1$$ and the fixed point is unstable. This will imply that for any $$\rho_v(0) > 0$$, $$\rho_v(t)$$ will diverge from zero. But, we know from Theorem 5.2 that $$\rho_v(t)$$ must converge to some fixed point of $$\rho_v = H(\rho_v)$$. So, the limit point must be positive. This proves part (a) of the lemma. © The authors 2018. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

### Journal

Information and Inference: A Journal of the IMAOxford University Press

Published: Sep 19, 2018

### References

Access the full text.