Regularized gradient descent: a non-convex recipe for fast joint blind deconvolution and demixing

Regularized gradient descent: a non-convex recipe for fast joint blind deconvolution and demixing Abstract We study the question of extracting a sequence of functions $$\{\boldsymbol{f}_{\!i}, \boldsymbol{g}_{i}\}_{i=1}^{s}$$ from observing only the sum of their convolutions, i.e. from $$\boldsymbol{y} = \sum _{i=1}^{s} \boldsymbol{f}_{\!i}\ast \boldsymbol{g}_{i}$$. While convex optimization techniques are able to solve this joint blind deconvolution–demixing problem provably and robustly under certain conditions, for medium-size or large-size problems we need computationally faster methods without sacrificing the benefits of mathematical rigor that come with convex methods. In this paper we present a non-convex algorithm which guarantees exact recovery under conditions that are competitive with convex optimization methods, with the additional advantage of being computationally much more efficient. Our two-step algorithm converges to the global minimum linearly and is also robust in the presence of additive noise. While the derived performance bounds are suboptimal in terms of the information-theoretic limit, numerical simulations show remarkable performance even if the number of measurements is close to the number of degrees of freedom. We discuss an application of the proposed framework in wireless communications in connection with the Internet-of-Things. 1. Introduction The goal of blind deconvolution is the task of estimating two unknown functions from their convolution. While it is a highly ill-posed bilinear inverse problem, blind deconvolution is also an extremely important problem in signal processing [1], communications engineering [39], imaging processing [5], audio processing [24], etc. In this paper, we deal with an even more difficult and more general variation of the blind deconvolution problem, in which we have to extract multiple convolved signals mixed together in one observation signal. This joint blind deconvolution–demixing problem arises in a range of applications such as acoustics [24], dictionary learning [2] and wireless communications [39]. We briefly discuss one such application in more detail. Blind deconvolution/demixing problems are expected to play a vital role in the future Internet-of-Things. The Internet-of-Things will connect billions of wireless devices, which is far more than the current wireless systems can technically and economically accommodate. One of the many challenges in the design of the Internet-of-Things will be its ability to manage the massive number of sporadic traffic generating devices which are most of the time inactive, but regularly access the network for minor updates with no human interaction [43]. This means among others that the overhead caused by the exchange of certain types of information between transmitter and receiver, such as channel estimation, assignment of data slots, etc., has to be avoided as much as possible [27,36]. Focusing on the underlying mathematical challenges, we consider a multi-user communication scenario where many different users/devices communicate with a common base station, as illustrated in Fig. 1. Suppose we have s users and each of them sends a signal gi through an unknown channel (which differs from user to user) to a common base station. We assume that the ith channel, represented by its impulse response fi, does not change during the transmission of the signal gi. Therefore fi acts as convolution operator, i.e., the signal transmitted by the ith user arriving at the base station becomes fi * gi, where ‘*’ denotes convolution. Fig. 1. View largeDownload slide Single-antenna multi-user communication scenario without explicit channel estimation: each of the s users sends a signal gi through an unknown channel fi to a common base station. The base station measures the superposition of all those signals, namely, $$\boldsymbol{y} = \sum _{i=1}^{s} {\boldsymbol{f}_{\!i}} \ast \boldsymbol{g}_{i} $$ (plus noise). The goal is to extract all pairs of $$\{({\boldsymbol{f}_{\!i}},\boldsymbol{g}_{i})\}_{i=1}^{s}$$ simultaneously from y. Fig. 1. View largeDownload slide Single-antenna multi-user communication scenario without explicit channel estimation: each of the s users sends a signal gi through an unknown channel fi to a common base station. The base station measures the superposition of all those signals, namely, $$\boldsymbol{y} = \sum _{i=1}^{s} {\boldsymbol{f}_{\!i}} \ast \boldsymbol{g}_{i} $$ (plus noise). The goal is to extract all pairs of $$\{({\boldsymbol{f}_{\!i}},\boldsymbol{g}_{i})\}_{i=1}^{s}$$ simultaneously from y. The antenna at the base station, instead of receiving each individual component fi *gi, is only able to record the superposition of all those signals, namely,   \begin{align} \boldsymbol{y} = \sum_{i=1}^{s} {\boldsymbol{f}_{\!\!i}}\ast \boldsymbol{g}_{i} +\boldsymbol{n}, \end{align} (1.1) where n represents noise. We aim to develop a fast algorithm to simultaneously extract all pairs $$\{({\boldsymbol{f}_{\!\!i}},\boldsymbol{g}_{i})\}_{i=1}^{s}$$ from y (i.e., estimating the channel/impulse responses fi and the signals gi jointly) in a numerically efficient and robust way, while keeping the number of required measurements as small as possible. 1.1. State of the art and contributions of this paper A thorough theoretical analysis concerning the solvability of demixing problems via convex optimization can be found in [26]. There, the authors derive explicit sharp bounds and phase transitions regarding the number of measurements required to successfully demix structured signals (such as sparse signals or low-rank matrices) from a single measurement vector. In principle we could recast the blind deconvolution/demixing problem as the demixing of a sum of rank-one matrices, see (4). As such, it seems to fit into the framework analyzed by McCoy and Tropp. However, the setup in [26] differs from ours in a crucial manner. McCoy and Tropp consider as measurement matrices (see the matrices $$\mathcal{A}_{i}$$ in (4)) full-rank random matrices, while in our setting the measurement matrices are rank-one. This difference fundamentally changes the theoretical analysis. The findings in [26] are therefore not applicable to the problem of joint blind deconvolution/demixing. The compressive principal component analysis in [42] is also a form of demixing problem, but its setting is only vaguely related to ours. There is a large amount of literature on demixing problems, but the vast majority does not have a ‘blind deconvolution component’, therefore this body of work is only marginally related to the topic of our paper. Blind deconvolution/demixing problems also appear in convolutional dictionary learning, see e.g. [2]. There, the aim is to factorize an ensemble of input vectors into a linear combination of overcomplete basis elements which are modeled as shift-invariant—the latter property is why the factorization turns into a convolution. The setup is similar to (1.1), but with an additional penalty term to enforce sparsity of the convolving filters. The existing literature on convolutional dictionary learning is mainly focused on empirical results, therefore there is little overlap with our work. But it is an interesting challenge for future research to see whether the approach in this paper can be modified to provide a fast and theoretically sound solver for the sparse convolutional coding problem. There are numerous papers concerned with blind deconvolution/demixing problems in the area of wireless communications [20,31,37]. But the majority of these papers assumes the availability of multiple measurement vectors, which makes the problem significantly easier. Those methods however cannot be applied to the case of a single measurement vector, which is the focus of this paper. Thus there is essentially no overlap of those papers with our work. Our previous paper [23] solves (1.1) under subspace conditions, i.e., assuming that both fi and gi belong to known linear subspaces. This contributes to generalizing the pioneering work by Ahmed et al. [1] from the ‘single-user’ scenario to the ‘multi-user’ scenario. Both [1] and [23] employ a two-step convex approach: first ‘lifting’ [9] is used and then the lifted version of the original bilinear inverse problems is relaxed into a semi-definite program. An improvement of the theoretical bounds in [23] was announced in [29]. While the convex approach is certainly effective and elegant, it can hardly handle large-scale problems. This motivates us to apply a non-convex optimization approach [8,21] to this blind-deconvolution-blind-demixing problem. The mathematical challenge, when using non-convex methods, is to derive a rigorous convergence framework with conditions that are competitive with those in a convex framework. In the last few years several excellent articles have appeared on provably convergent non-convex optimization applied to various problems in signal processing and machine learning, e.g., matrix completion [16,17,34], phase retrieval [3,8,11,33], blind deconvolution [4,19,21], dictionary learning [32], super-resolution [12] and low-rank matrix recovery [35,41]. In this paper we derive the first non-convex optimization algorithm to solve (1.1) fast and with rigorous theoretical guarantees concerning exact recovery, convergence rates, as well as robustness for noisy data. Our work can be viewed as a generalization of blind deconvolution [21] (s = 1) to the multi-user scenario (s > 1). The idea behind our approach is strongly motivated by the non-convex optimization algorithm for phase retrieval proposed in [8]. In this foundational paper, the authors use a two-step approach: (i) construct a good initial guess with a numerically efficient algorithm; (ii) starting with this initial guess, prove that simple gradient descent will converge to the true solution. Our paper follows a similar two-step scheme. However, the techniques used here are quite different from [8]. Like the matrix completion problem [7], the performance of the algorithm relies heavily and inherently on how much the ground truth signals are aligned with the design matrix. Due to this so-called ‘incoherence’ issue, we need to impose extra constraints, which results in a different construction of the so-called basin of attraction. Therefore, influenced by [17,21,34], we add penalty terms to control the incoherence and this leads to the regularized gradient descent method, which forms the core of our proposed algorithm. To the best of our knowledge, our algorithm is the first algorithm for the blind deconvolution/blind demixing problem that is numerically efficient, is robust against noise and comes with rigorous recovery guarantees. 1.2. Notation For a matrix Z, ∥Z∥ denotes its operator norm and ∥Z∥F is its the Frobenius norm. For a vector z, ∥z∥ is its Euclidean norm and $$\|\boldsymbol{z}\|_{\infty }$$ is the $$\ell _{\infty }$$-norm. For both matrices and vectors, Z* and z* denote their complex conjugate transpose. $$\bar{\boldsymbol{z}}$$ is the complex conjugate of z. We equip the matrix space $$\mathbb{C}^{K\times N}$$ with the inner product defined by ⟨U, V⟩ := Tr(U*V). For a given vector z, diag(z) represents the diagonal matrix whose diagonal entries are z. For any $$z\in \mathbb{R}$$, let $$z_{+} = \frac{z + |z|}{2}.$$ 2. Preliminaries Obviously, without any further assumption, it is impossible to solve (1.1). Therefore, we impose the following subspace assumptions throughout our discussion [1,23]. Channel subspace assumption: each finite impulse response $${\boldsymbol{f}_{\!\!i}}\in \mathbb{C}^{L}$$ is assumed to have maximum delay spread K, i.e.,   \begin{align*} {\boldsymbol{f}_{\!\!i}} = \left[\begin{array}{@{}c@{}} \boldsymbol{h}_{i} \\ \boldsymbol{0} \end{array}\right]. \end{align*} Here $$\boldsymbol{h}_{i} \in{\mathbb{C}}^{K}$$ is the non-zero part of fi and fi(n) = 0 for n > K. Signal subspace assumption: let $$\boldsymbol{g}_{i} : = {\boldsymbol{C}_{\!i}}\ \bar{\boldsymbol{x}}_{i}$$ be the outcome of the signal $$\bar{\boldsymbol{x}}_{i}\in \mathbb{C}^{N}$$ encoded by a matrix $${\boldsymbol{C}_{\!i}}\in \mathbb{C}^{L\times N}$$ with L > N, where the encoding matrix Ci is known and assumed to have full rank.1 Remark 2.1 Both subspace assumptions are common in various applications. For instance in wireless communications, the channel impulse response can always be modeled to have finite support (or maximum delay spread, as it is called in engineering jargon) due to the physical properties of wave propagation [14]; and the signal subspace assumption is a standard feature found in many current communication systems [14], including Code-division multiple access (CDMA) where Ci is known as spreading matrix and Orthogonal frequency-division multiplexing (OFDM) where Ci is known as precoding matrix. The specific choice of the encoding matrices Ci depends on a variety of conditions. In this paper, we derive our theory by assuming that Ci is a complex Gaussian random matrix, i.e., each entry in Ci is i.i.d. $$\mathcal{C}\mathcal{N}(0,1)$$. This assumption, while sometimes imposed in the wireless communications literature, is somewhat unrealistic in practice, due to the lack of a fast algorithm to apply Ci and due to storage requirements. In practice one would rather choose Ci to be something like the product of a Hadamard matrix and a diagonal matrix with random binary entries. We hope to address such more structured encoding matrices in our future research. Our numerical simulations (see Section 4) show no difference in the performance of our algorithm for either choice. Under the two assumptions above, the model actually has a simpler form in the frequency domain. We assume throughout the paper that the convolution of finite sequences is circular convolution.2 By applying the Discrete Fourier Transform (DFT) to (1.1) along with the two assumptions, we have   \begin{align*} \frac{1}{\sqrt{L}}\boldsymbol{F} \boldsymbol{y} = \sum_{i=1}^{s}\operatorname{diag}(\boldsymbol{F} \boldsymbol{h}_{i})(\boldsymbol{F}\boldsymbol{C}_{i} \bar{\boldsymbol{x}}_{i}) + \frac{1}{\sqrt{L}}\boldsymbol{F}\boldsymbol{n},\end{align*} where F is the L × L normalized unitary DFT matrix with F*F = FF* = IL. The noise is assumed to be additive white complex Gaussian noise with $$\boldsymbol{n}\sim \mathcal{C}\mathcal{N}(\boldsymbol{0}, \sigma ^{2}{d_{0}^{2}}\boldsymbol{I}_{L})$$ where $$d_{0} = \sqrt{\sum _{i=1}^{s} \|\boldsymbol{h}_{i0}\|^{2} \|\boldsymbol{x}_{i0}\|^{2}}$$, and $$\{(\boldsymbol{h}_{i0}, \boldsymbol{x}_{i0})\}_{i=1}^{s}$$ is the ground truth. We define $$d_{i0} = \|\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\|_{F}$$ and assume without loss of generality that ∥hi0∥ and ∥xi0∥ are of the same norm, i.e., $$\|\boldsymbol{h}_{i0}\| = \|\boldsymbol{x}_{i0}\| = \sqrt{d_{i0}}$$, which is due to the scaling ambiguity.3 In that way, $$\frac{1}{\sigma ^{2}}$$ actually is a measure of signal to noise ratio (SNR). Let $$\boldsymbol{h}_{i}\in \mathbb{C}^{K}$$ be the first K non-zero entries of fi and $$\boldsymbol{B}\in \mathbb{C}^{L\times K}$$ be a low-frequency DFT matrix (the first K columns of an L × L unitary DFT matrix). Then a simple relation holds,   \begin{align*} \boldsymbol{F}{\boldsymbol{f}_{\!\!i}} ={\boldsymbol{B}}{\boldsymbol{h}_{\!i}}, \quad \boldsymbol{B}^{*}\boldsymbol{B} = \boldsymbol{I}_{K}. \end{align*} We also denote $$\boldsymbol{A}_{i} := \overline{\boldsymbol{F}{\boldsymbol{C}_{\!i}}}$$ and $$\boldsymbol{e} := \frac{1}{\sqrt{L}}\boldsymbol{F}\boldsymbol{n}$$. Due to the Gaussianity, Ai also possesses complex Gaussian distribution and so does e. From now on, instead of focusing on the original model, we consider (with a slight abuse of notation) the following equivalent formulation throughout our discussion:   \begin{align} \boldsymbol{y} = \sum_{i=1}^{s} \operatorname{diag}(\boldsymbol{B}{\boldsymbol{h}_{\!i}})\overline{\boldsymbol{A}_{i}\boldsymbol{x}_{i}} + \boldsymbol{e}, \end{align} (2.1) where $$\boldsymbol{e} \sim \mathcal{C}\mathcal{N}\big(\boldsymbol{0}, \frac{\sigma ^{2}{d_{0}^{2}}}{L}\boldsymbol{I}_{L}\big)$$. Our goal here is to estimate all $$\{\boldsymbol{h}_{i}, \boldsymbol{x}_{i}\}_{i=1}^{s}$$ from y, B and $$\{\boldsymbol{A}_{i}\}_{i=1}^{s}$$. Obviously, this is a bilinear inverse problem, i.e., if all $$\{\boldsymbol{h}_{i}\}_{i=1}^{s}$$ are given, it is a linear inverse problem (the ordinary demixing problem) to recover all $$\{\boldsymbol{x}_{i}\}_{i=1}^{s}$$, and vice versa. We note that there is a scaling ambiguity in all blind deconvolution problems that cannot be resolved by any reconstruction method without further information. Therefore, when we talk about exact recovery in the following, then this is understood modulo such a trivial scaling ambiguity. Before proceeding to our proposed algorithm we introduce some notation to facilitate a more convenient presentation of our approach. Let bl be the lth column of B* and ail be the lth column of $$\boldsymbol{A}_{i}^{*}$$. Based on our assumptions the following properties hold:   \begin{align*} \sum_{l=1}^{L}\boldsymbol{b}_{l}\boldsymbol{b}_{l}^{*} = \boldsymbol{I}_{K}, \quad \|\boldsymbol{b}_{l}\|^{2} = \frac{K}{L}, \quad \boldsymbol{a}_{il}\sim \mathcal{C}\mathcal{N}(\boldsymbol{0}, \boldsymbol{I}_{N}). \end{align*} Moreover, inspired by the well-known lifting idea [1,6,9,22], we define the useful matrix-valued linear operator $$\mathcal{A}_{i} : \mathbb{C}^{K\times N} \to \mathbb{C}^{L}$$ and its adjoint $$\mathcal{A}_{i}^{*}:\mathbb{C}^{L}\rightarrow \mathbb{C}^{K\times N}$$ by   \begin{align} \mathcal{A}_{i}(\boldsymbol{Z}) := \{\boldsymbol{b}_{l}^{*}\boldsymbol{Z}\boldsymbol{a}_{il}\}_{l=1}^{L}, \quad \mathcal{A}^{*}_{i}(\boldsymbol{z}) := \sum_{l=1}^{L} z_{l} \boldsymbol{b}_{l}\boldsymbol{a}_{il}^{*} = \boldsymbol{B}^{*}\operatorname{diag}(\boldsymbol{z})\boldsymbol{A}_{i} \end{align} (2.2) for each 1 ≤ i ≤ s under canonical inner product over $$\mathbb{C}^{K\times N}.$$ Therefore, (2.1) can be written in the following equivalent form   \begin{align} \boldsymbol{y} = \sum_{i=1}^{s} \mathcal{A}_{i}(\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*}) + \boldsymbol{e}. \end{align} (2.3) Hence, we can think of y as the observation vector obtained from taking linear measurements with respect to a set of rank-one matrices $$\{\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*}\}_{i=1}^{s}.$$ In fact, with a bit of linear algebra (and ignoring the noise term for the moment), the lth entry of y in (2.3) equals the inner product of two block-diagonal matrices:   \begin{align} y_{l} = \left \langle \underbrace{ \left[\begin{array}{@{}cccc@{}} \boldsymbol{h}_{1,0}\boldsymbol{x}_{1,0}^{*} & \boldsymbol{0} & \cdots & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{h}_{2,0}\boldsymbol{x}_{2,0}^{*} & \cdots & \boldsymbol{0} \\ \vdots & \vdots & \ddots & \vdots \\ \boldsymbol{0} & \boldsymbol{0} & \cdots & \boldsymbol{h}_{s0}\boldsymbol{x}_{s0}^{*} \end{array}\right]}_{\textrm{defined as}\ \boldsymbol{X}_{0} }, \left[\begin{array}{@{}cccc@{}} \boldsymbol{b}_{l}\boldsymbol{a}_{1l}^{*} & \boldsymbol{0} & \cdots & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{b}_{l}\boldsymbol{a}_{2l}^{*} & \cdots & \boldsymbol{0} \\ \vdots & \vdots & \ddots & \vdots \\ \boldsymbol{0} & \boldsymbol{0} & \cdots & \boldsymbol{b}_{l}\boldsymbol{a}_{sl}^{*} \end{array} \right] \right\rangle + e_{l}, \end{align} (2.4) where $$y_{l} = \sum _{i=1}^{s} \boldsymbol{b}_{l}^{*}\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\boldsymbol{a}_{il} + e_{l}, 1\leq l\leq L$$ and X0 is defined as the ground truth matrix. In other words, we aim to recover such a block-diagonal matrix X0 from L linear measurements with block structure if e = 0. By stacking all $$\{\boldsymbol{h}_{i}\}_{i=1}^{s}$$ (and $$\{\boldsymbol{x}_{i}\}_{i=1}^{s}, \{\boldsymbol{h}_{i0}\}_{i=1}^{s},\{\boldsymbol{x}_{i0}\}_{i=1}^{s}$$) into a long column, we let   \begin{align} \boldsymbol{h} := \left[\begin{array}{@{}c@{}} \boldsymbol{h}_{1} \\ \vdots\\ \boldsymbol{h}_{s} \end{array}\right], \quad \boldsymbol{h}_{0} := \left[\begin{array}{@{}c@{}} \boldsymbol{h}_{1,0} \\ \vdots\\ \boldsymbol{h}_{s0} \end{array}\right]\in\mathbb{C}^{Ks} ,\quad \boldsymbol{x} := \left[\begin{array}{@{}c@{}} \boldsymbol{x}_{1} \\ \vdots\\ \boldsymbol{x}_{s} \end{array}\right],\quad \boldsymbol{x}_{0} := \left[\begin{array}{@{}c@{}} \boldsymbol{x}_{1,0} \\ \vdots\\ \boldsymbol{x}_{s0} \end{array}\right] \in\mathbb{C}^{Ns}. \end{align} (2.5) We define $$\mathcal{H}$$ as a bilinear operator which maps a pair $$(\boldsymbol{h}, \boldsymbol{x})\in \mathbb{C}^{Ks}\times \mathbb{C}^{Ns}$$ into a block diagonal matrix in $$\mathbb{C}^{Ks\times Ns}$$, i.e.,   \begin{align} \mathcal{H}(\boldsymbol{h}, \boldsymbol{x}) := \left[\begin{array}{@{}cccc@{}} \boldsymbol{h}_{1}\boldsymbol{x}_{1}^{*} & \boldsymbol{0} & \cdots & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{h}_{2}\boldsymbol{x}_{2}^{*} & \cdots & \boldsymbol{0} \\ \vdots & \vdots & \ddots & \vdots \\ \boldsymbol{0} & \boldsymbol{0} & \cdots & \boldsymbol{h}_{s}\boldsymbol{x}_{s}^{*} \end{array}\right]\in\mathbb{C}^{Ks\times Ns}. \end{align} (2.6) Let $$\boldsymbol{X} := \mathcal{H}(\boldsymbol{h}, \boldsymbol{x})$$ and $$\boldsymbol{X}_{0} := \mathcal{H}(\boldsymbol{h}_{0}, \boldsymbol{x}_{0})$$ where X0 is the ground truth as illustrated in (2.4). Define $$\mathcal{A}(\boldsymbol{Z}):\mathbb{C}^{Ks\times Ns}\rightarrow \mathbb{C}^{L}$$ as   \begin{align} \mathcal{A}(\boldsymbol{Z}) := \sum_{i=1}^{s} \mathcal{A}_{i}(\boldsymbol{Z}_{i}), \end{align} (2.7) where Z = blkdiag(Z1, ⋯ , Zs) and blkdiag is the standard MATLAB function to construct block diagonal matrix. Therefore, $$\mathcal{A}(\mathcal{H}(\boldsymbol{h}, \boldsymbol{x})) = \sum _{i=1}^{s} \mathcal{A}_{i}(\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*})$$ and $$\boldsymbol{y} = \mathcal{A}(\mathcal{H}(\boldsymbol{h}_{0}, \boldsymbol{x}_{0})) + \boldsymbol{e}.$$ The adjoint operator $$\mathcal{A}^{*}$$ is defined naturally as   \begin{align} \mathcal{A}^{*}(\boldsymbol{z}) : = \left[\begin{array}{@{}cccc@{}} \mathcal{A}_{1}^{*}(\boldsymbol{z}) & \boldsymbol{0} & \cdots & \boldsymbol{0} \\ \boldsymbol{0} & \mathcal{A}_{2}^{*}(\boldsymbol{z}) & \cdots & \boldsymbol{0} \\ \vdots & \vdots & \ddots & \vdots \\ \boldsymbol{0} & \boldsymbol{0} & \cdots & \mathcal{A}_{s}^{*}(\boldsymbol{z}) \end{array}\right]\in\mathbb{C}^{Ks\times Ns}, \end{align} (2.8) which is a linear map from $$\mathbb{C}^{L}$$ to $$\mathbb{C}^{Ks\times Ns}.$$ To measure the approximation error of X0 given by X, we define δ(h, x) as the global relative error:   \begin{align} \delta(\boldsymbol{h},\boldsymbol{x}) := \frac{\|\boldsymbol{X} - \boldsymbol{X}_{0}\|_{F}}{\|\boldsymbol{X}_{0}\|_{F}} = \frac{\sqrt{\sum_{i=1}^{s} \|\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\|_{F}^{2}}}{d_{0}} = \sqrt{\frac{\sum_{i=1}^{s}{\delta_{i}^{2}} d_{i0}^{2}}{ \sum_{i=1}^{s} d_{i0}^{2}}}, \end{align} (2.9) where δi := δi(hi, xi) is the relative error within each component:   \begin{align*} \delta_{i}(\boldsymbol{h}_{i},\boldsymbol{x}_{i}) := \frac{\|\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\|_{F}}{d_{i0}}. \end{align*} Note that δ and δi are functions of (h, x) and (hi, xi), respectively, and in most cases, we just simply use δ and δi if no possibility of confusion exists. 2.1. Convex versus non-convex approaches As indicated in (2.4), joint blind deconvolution–demixing can be recast as the task to recover a rank-s block-diagonal matrix from linear measurements. In general, such a low-rank matrix recovery problem is NP-hard. In order to take advantage of the low-rank property of the ground truth, it is natural to adopt convex relaxation by solving a convenient nuclear norm minimization program, i.e.,   \begin{align} \min \sum_{i=1}^{s} \|\boldsymbol{Z}_{i}\|_{*}, \quad s.t. \quad\sum_{i=1}^{s} \mathcal{A}_{i}(\boldsymbol{Z}_{i}) = \boldsymbol{y}. \end{align} (2.10) The question of when the solution of (2.10) yields exact recovery is first answered in our previous work [23]. Late, [15,29] have improved this result to the near-optimal bound L ≥ C0s(K + N) up to some $$\log $$-factors where the main theoretical result is informally summarized in the following theorem. Theorem 2.2 (Theorem 1.1 in [15]). Suppose that Ai are L × N i.i.d. complex Gaussian matrices and B is an L × K partial DFT matrix with B*B =IK. Then solving (2.10) gives exact recovery if the number of measurements L yields   \begin{align*} L \geq C_{\gamma} s(K+N)\log^{3}L \end{align*} with probability at least 1 − L−γ where Cγ is an absolute scalar only depending on γ linearly. While the semidefinite programming (SDP) relaxation is definitely effective and has theoretic performance guarantees, the computational costs for solving an SDP already become too expensive for moderate size problems, let alone for large scale problems. Therefore, we try to look for a more efficient non-convex approach such as gradient descent, which hopefully is also reinforced by theory. It seems quite natural to achieve the goal by minimizing the following nonlinear least squares objective function with respect to (h, x)   \begin{align} F(\boldsymbol{h}, \boldsymbol{x}) : = \|\mathcal{A} \left(\mathcal{H}(\boldsymbol{h},\boldsymbol{x})\right) - \boldsymbol{y}\|^{2} = \left\|\sum_{i=1}^{s}\mathcal{A}_{i}(\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*}) - \boldsymbol{y}\right\|^{2}. \end{align} (2.11) In particular, if e = 0, we write   \begin{align} F_{0}(\boldsymbol{h}, \boldsymbol{x}) : = \left\|\sum_{i=1}^{s}\mathcal{A}_{i}(\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*})\right\|^{2}. \end{align} (2.12) As also pointed out in [21], this is a highly non-convex optimization problem. Many of the commonly used algorithms, such as gradient descent or alternating minimization, may not necessarily yield convergence to the global minimum, so that we cannot always hope to obtain the desired solution. Often, those simple algorithms might get stuck in local minima. 2.2. The basin of attraction Motivated by several excellent recent papers of non-convex optimization on various signal processing and machine learning problem, we propose our two-step algorithm: (i) compute an initial guess carefully; (ii) apply gradient descent to the objective function, starting with the carefully chosen initial guess. One difficulty of understanding non-convex optimization consists in how to construct the so-called basin of attraction, i.e., if the starting point is inside this basin of attraction, the iterates will always stay inside the region and converge to the global minimum. The construction of the basin of attraction varies for different problems [3,8,34]. For this problem, similar to [21], the construction follows from the following three observations. Each of these observations suggests the definition of a certain neighborhood and the basin of attraction is then defined as the intersection of these three neighborhood sets $$\mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }.$$ Ambiguity of solution: in fact, we can only recover (hi, xi) up to a scalar since (αhi, α−1xi) and (hi, xi) are both solutions for $$\alpha \neq 0$$. From a numerical perspective, we want to avoid the scenario when $$\|\boldsymbol{h}_{i}\|\rightarrow 0$$ and $$\|\boldsymbol{x}_{i}\|\rightarrow \infty $$ while ∥hi∥∥xi∥ is fixed, which potentially leads to numerical instability. To balance both the norm of ∥hi∥ and ∥xi∥ for all 1 ≤ i ≤ s, we define   \begin{align*} \mathcal{N}_d := \left\{\left\{(\boldsymbol{h}_{i}, \boldsymbol{x}_{i})\right\}_{i=1}^{s}: \| \boldsymbol{h}_{i}\| \leq 2\sqrt{d_{i0}}, \| \boldsymbol{x}_{i}\| \leq 2\sqrt{d_{i0}}, 1\leq i\leq s \right\}, \end{align*} which is a convex set. Incoherence: the performance depends on how large/small the incoherence $${\mu ^{2}_{h}}$$ is, where $${\mu _{h}^{2}}$$ is defined by   \begin{align*} {\mu^{2}_{h}} : = \max_{1\leq i\leq s} \frac{L\|\boldsymbol{B}\boldsymbol{h}_{i0}\|^{2}_{\infty}}{\|\boldsymbol{h}_{i0}\|^{2}}. \end{align*} The idea is that: the smaller the $${\mu ^{2}_{h}}$$ is, the better the performance is. Let us consider an extreme case: if Bhi0 is highly sparse or spiky, we lose much information on those zero/small entries and cannot hope to get satisfactory recovered signals. In other words, we need the ground truth hi0 has ‘spectral flatness’ and hi0 is not highly localized on the Fourier domain. A similar quantity is also introduced in the matrix completion problem [7,34]. The larger $${\mu ^{2}_{h}}$$ is, the more hi0 is aligned with one particular row of B. To control the incoherence between bl and hi, we define the second neighborhood,   \begin{align} \mathcal{N}_{\mu} := \left\{ \{\boldsymbol{h}_{i}\}_{i=1}^{s} : \sqrt{L} \|\boldsymbol{B}\boldsymbol{h}_{i}\|_{\infty} \leq 4\sqrt{d_{i0}}\mu, 1\leq i\leq s\right\}, \end{align} (2.13) where μ is a parameter and μ ≥ μh. Note that $$\mathcal{N}_{\mu }$$ is also a convex set. Close to the ground truth: we also want to construct an initial guess such that it is close to the ground truth, i.e.,   \begin{align} \mathcal{N}_{\epsilon} := \left\{\left\{(\boldsymbol{h}_{i}, \boldsymbol{x}_{i})\right\}_{i=1}^{s}: \delta_{i} = \frac{\|\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\|_{F}}{d_{i0}} \leq \varepsilon, 1\leq i\leq s \right\}\!, \end{align} (2.14) where ε is a predetermined parameter in $$(0, \frac{1}{15}]$$. Remark 2.3 To ensure δi ≤ ε, it suffices to ensure $$\delta \leq \frac{\varepsilon }{\sqrt{s}\kappa }$$ where $$\kappa := \frac{\max d_{i0}}{\min d_{i0}} \geq 1$$. This is because   \begin{align*} \frac{1}{s\kappa^{2}}\sum_{i=1}^{s}{\delta_{i}^{2}} \leq \delta^{2} \leq \frac{\varepsilon^{2}}{s\kappa^{2}} \end{align*} which implies $$\max _{1\leq i\leq s}\delta _{i} \leq \varepsilon .$$ Remark 2.4 When we say $$(\boldsymbol{h}, \boldsymbol{x})\in \mathcal{N}_{d}, \mathcal{N}_{\mu }$$ or $$\mathcal{N}_{\epsilon }$$, it means for all $$i=1,\dots ,s$$ we have $$(\boldsymbol{h}_{i},\boldsymbol{x}_{i}) \in \mathcal{N}_d$$, $$\mathcal{N}_{\mu }$$ or $$\mathcal{N}_{\epsilon }$$, respectively. In particular, $$(\boldsymbol{h}_{0}, \boldsymbol{x}_{0}) \in \mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$ where h0 and x0 are defined in (2.5). 2.3. Objective function and Wirtinger derivative To implement the first two observations, we introduce the regularizer G(h, x), defined as the sum of s components   \begin{align} G(\boldsymbol{h}, \boldsymbol{x}):= \sum_{i=1}^{s} G_{i}(\boldsymbol{h}_{i},\boldsymbol{x}_{i}) . \end{align} (2.15) For each component Gi(hi, xi), we let ρ ≥ d2 + 2∥e∥2, 0.9d0 ≤ d ≤ 1.1d0, 0.9di0 ≤ di ≤ 1.1di0 for all 1 ≤ i ≤ s and   \begin{align} G_{i} := \rho \left[ \underbrace{G_{0}\left(\frac{\|\boldsymbol{h}_{i}\|^{2}}{2d_{i}}\right) + G_{0}\left(\frac{\|\boldsymbol{x}_{i}\|^{2}}{2d_{i}}\right)}_{ \mathcal{N}_d } + \underbrace{\sum_{l=1}^{L}G_{0}\left(\frac{L |\boldsymbol{b}_{l}^{*}\boldsymbol{h}_{i}|^{2}}{8d_{i}\mu^{2} }\right)}_{\mathcal{N}_{\mu}} \right], \end{align} (2.16) where $$G_{0}(z) = \max \{z-1, 0\}^{2}$$. Here both d and $$\{d_{i}\}_{i=1}^{s}$$ are data-driven and well approximated by our spectral initialization procedure; and μ2 is a tuning parameter which could be estimated if we assume a specific statistical model for the channel (for example, in the widely used Rayleigh fading model, the channel coefficients are assumed to be complex Gaussian). The idea behind Gi is quite straightforward though the formulation is complicated. For each Gi in (17), the first two terms try to force the iterates to lie in $$\mathcal{N}_d$$ and the third term tries to encourage the iterates to lie in $$\mathcal{N}_{\mu }.$$ What about the neighborhood $$\mathcal{N}_{\epsilon }$$? A proper choice of the initialization followed by gradient descent which keeps the objective function decreasing will ensure that the iterates stay in $$\mathcal{N}_{\epsilon }$$. Finally, we consider the objective function as the sum of nonlinear least squares objective function F(h, x) in (2.11) and the regularizer G(h, x),   \begin{align} \widetilde{F}(\boldsymbol{h}, \boldsymbol{x}) := F(\boldsymbol{h},\boldsymbol{x}) + G(\boldsymbol{h}, \boldsymbol{x}). \end{align} (2.17) Note that the input of the function $$\widetilde{F}(\boldsymbol{h},\boldsymbol{x})$$ consists of complex variables, but the output is real-valued. As a result, the following simple relations hold   \begin{align*} \frac{\partial \widetilde{F}}{\partial \bar{\boldsymbol{h}}_{i}} = \overline{\frac{\partial \widetilde{F}}{\partial \boldsymbol{h}_{i}} }, \quad \frac{\partial \widetilde{F}}{\partial \bar{\boldsymbol{x}}_{i}} = \overline{\frac{\partial \widetilde{F}}{\partial \boldsymbol{x}_{i}} }. \end{align*} Similar properties also apply to both F(h, x) and G(h, x). Therefore, to minimize this function, it suffices to consider only the gradient of $$\widetilde{F}$$ with respect to $$\bar{\boldsymbol{h}}_{i}$$ and $$\bar{\boldsymbol{x}}_{i}$$, which is also called Wirtinger derivative [8]. The Wirtinger derivatives of F(h, x) and G(h, x) w.r.t. $$\bar{\boldsymbol{h}}_{i}$$ and $$\bar{\boldsymbol{x}}_{i}$$ can be easily computed as follows   \begin{align} \nabla F_{\boldsymbol{h}_{i}} & = \mathcal{A}_{i}^{*}\left(\mathcal{A}(\boldsymbol{X})- \boldsymbol{y}\right)\boldsymbol{x}_{i} = \mathcal{A}_{i}^{*}\left(\mathcal{A}(\boldsymbol{X}-\boldsymbol{X}_{0})- \boldsymbol{e} \right)\boldsymbol{x}_{i}, \\ \nabla F_{\boldsymbol{x}_{i}} & = \left(\mathcal{A}_{i}^{*}\left(\mathcal{A}(\boldsymbol{X}) - \boldsymbol{y}\right)\right)^{*}\boldsymbol{h}_{i} = \left(\mathcal{A}_{i}^{*}\left(\mathcal{A}(\boldsymbol{X}-\boldsymbol{X}_{0}) - \boldsymbol{e}\right)\right)^{*}\boldsymbol{h}_{i}, \\ \nabla G_{\boldsymbol{h}_{i}} & = \frac{\rho}{2d_{i}}\left[G^{\prime}_{0}\left(\frac{\|\boldsymbol{h}_{i}\|^{2}}{2d_{i}}\right) \boldsymbol{h}_{i} + \frac{L}{4\mu^{2}} \sum_{l=1}^{L} G^{\prime}_{0}\left(\frac{L|\boldsymbol{b}_{l}^{*}\boldsymbol{h}_{i}|^{2}}{8d_{i}\mu^{2}}\right) \boldsymbol{b}_{l}\boldsymbol{b}_{l}^{*}\boldsymbol{h}_{i} \right], \\ \nabla G_{\boldsymbol{x}_{i}} & = \frac{\rho}{2d_{i}} G^{\prime}_{0}\left( \frac{\|\boldsymbol{x}_{i}\|^{2}}{2d_{i}}\right) \boldsymbol{x}_{i}, \end{align} (2.18) where $$\mathcal{A}(\boldsymbol{X}) = \sum _{i=1}^{s} \mathcal{A}_{i}(\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*})$$ and $$\mathcal{A}^{*}$$ is defined in (2.8). In short, we denote   \begin{align} \nabla\widetilde{F}_{\boldsymbol{h}} : = \nabla F_{\boldsymbol{h}} + \nabla G_{\boldsymbol{h}}, \quad \nabla F_{\boldsymbol{h}} : =\left[ \begin{array}{@{}c@{}} \nabla F_{\boldsymbol{h}_{1}} \\ \vdots \\ \nabla F_{\boldsymbol{h}_{s}} \end{array}\right], \quad \nabla G_{\boldsymbol{h}} : =\left[ \begin{array}{@{}c@{}} \nabla G_{\boldsymbol{h}_{1}} \\ \vdots \\ \nabla G_{\boldsymbol{h}_{s}} \end{array}\right]. \end{align} (2.22) Similar definitions hold for $$\nabla \widetilde{F}_{\boldsymbol{x}},\nabla F_{\boldsymbol{x}}$$ and Gx. It is easy to see that $$\nabla F_{\boldsymbol{h}} = \mathcal{A}^{*}(\mathcal{A}(\boldsymbol{X}) - \boldsymbol{y})\boldsymbol{x}$$ and $$\nabla F_{\boldsymbol{x}} = (\mathcal{A}^{*}(\mathcal{A}(\boldsymbol{X}) - \boldsymbol{y}))^{*}\boldsymbol{h}$$. 3. Algorithm and theory 3.1. Two-step algorithm As mentioned before, the first step is to find a good initial guess $$(\boldsymbol{u}^{(0)}, \boldsymbol{v}^{(0)})\in \mathbb{C}^{Ks}\times \mathbb{C}^{Ns}$$ such that it is inside the basin of attraction. The initialization follows from this key fact:   \begin{align*} \operatorname{\mathbb{E}}(\mathcal{A}_{i}^{*}(\boldsymbol{y})) = \operatorname{\mathbb{E}}\left(\mathcal{A}_{i}^{*}\left(\sum_{j=1}^{s}\mathcal{A}_{j}(\boldsymbol{h}_{j0}\boldsymbol{x}_{j0}^{*} )+\boldsymbol{e}\right)\right) = \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}, \end{align*} where we use $$\boldsymbol{B}^{*}\boldsymbol{B} = \sum _{l=1}^{L}\boldsymbol{b}_{l}\boldsymbol{b}_{l}^{*} = \boldsymbol{I}_{K}$$, $$\operatorname{\mathbb{E}}(\boldsymbol{a}_{il}\boldsymbol{a}_{il}^{*}) = \boldsymbol{I}_{N}$$ and   \begin{align*} \operatorname{\mathbb{E}}(\mathcal{A}_{i}^{*}\mathcal{A}_{i}(\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*})) & = \sum_{l=1}^{L} \boldsymbol{b}_{l}\boldsymbol{b}_{l}^{*}\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*} \operatorname{\mathbb{E}}(\boldsymbol{a}_{il}\boldsymbol{a}_{il}^{*}) = \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}, \\ \operatorname{\mathbb{E}}(\mathcal{A}_{j}^{*}\mathcal{A}_{i}(\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*})) & = \sum_{l=1}^{L} \boldsymbol{b}_{l}\boldsymbol{b}_{l}^{*}\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*} \operatorname{\mathbb{E}}(\boldsymbol{a}_{il}\boldsymbol{a}_{jl}^{*}) = \boldsymbol{0},\quad \forall j\neq i. \end{align*} Therefore, it is natural to extract the leading singular value and associated left and right singular vectors from each $$\mathcal{A}_{i}^{*}(\boldsymbol{y})$$ and use them as (a hopefully good) approximation to (di0, hi0, xi0). This idea leads to Algorithm 1, the theoretic guarantees of which are given in Section 6.5. The second step of the algorithm is just to apply gradient descent to $$\widetilde{F}$$ with the initial guess $$\{(\boldsymbol{u}^{(0)}_{i}, \boldsymbol{v}^{(0)}_{i}, d_{i})\}_{i=1}^{s}$$ or $$(\boldsymbol{u}^{(0)}, \boldsymbol{v}^{(0)},\{ d_{i}\}_{i=1}^{s})$$, where u(0) stems from stacking all $$\boldsymbol{u}^{(0)}_{i}$$ into one long vector.4 Remark 3.1 For Algorithm 2, we can rewrite each iteration into  \begin{align*} \boldsymbol{u}^{(t)} = \boldsymbol{u}^{(t-1)} - \eta\nabla \widetilde{F}_{\boldsymbol{h}}(\boldsymbol{u}^{(t-1)}, \boldsymbol{v}^{(t-1)}), \quad\boldsymbol{v}^{(t)} = \boldsymbol{v}^{(t-1)} - \eta\nabla \widetilde{F}_{\boldsymbol{x}}(\boldsymbol{u}^{(t-1)}, \boldsymbol{v}^{(t-1)}), \end{align*} where $$\nabla \widetilde{F}_{\boldsymbol{h}}$$ and $$\nabla \widetilde{F}_{\boldsymbol{x}}$$ are in (2.22), and   \begin{align*} \boldsymbol{u}^{(t)} : =\left[ \begin{array}{@{}c@{}} \boldsymbol{u}_{1}^{(t)} \\ \vdots \\ \boldsymbol{u}_{s}^{(t)} \end{array}\right], \quad \boldsymbol{v}^{(t)} : =\left[ \begin{array}{@{}c@{}} \boldsymbol{v}_{1}^{(t)} \\ \vdots \\ \boldsymbol{v}_{s}^{(t)} \end{array}\right]. \end{align*} 3.2. Main results Our main findings are summarized as follows: Theorem 3.2 shows that the initial guess given by Algorithm 1 indeed belongs to the basin of attraction. Moreover, di also serves as a good approximation of di0 for each i. Theorem 3.3 demonstrates that the regularized Wirtinger gradient descent will guarantee the linear convergence of the iterates and the recovery is exact in the noisefree case and stable in the presence of noise. Theorem 3.2 The initialization obtained via Algorithm 1 satisfies   \begin{align} (\boldsymbol{u}^{(0)}, \boldsymbol{v}^{(0)}) \in \frac{1}{\sqrt{3}}\mathcal{N}_d\bigcap \frac{1}{\sqrt{3}} \mathcal{N}_{\mu}\bigcap \mathcal{N}_{\frac{2\varepsilon}{5\sqrt{s}\kappa}} \end{align} (3.1) and   \begin{align} 0.9d_{i0} \leq d_{i}\leq 1.1d_{i0},\quad 0.9d_{0} \leq d\leq 1.1d_{0}, \end{align} (3.2) holds with probability at least 1 − L−γ+1 if the number of measurements satisfies   \begin{align} L \geq C_{\gamma+\log(s)}({\mu_{h}^{2}} + \sigma^{2})s^{2} \kappa^{4} \max\{K,N\}\log^{2} L/\varepsilon^{2}. \end{align} (3.3) Here ε is any predetermined constant in $$(0, \frac{1}{15}]$$, and Cγ is a constant only linearly depending on γ with γ ≥ 1. Theorem 3.3 Starting with the initial value z(0) := (u(0), v(0)) satisfying (3.1), the Algorithm 2 creates a sequence of iterates (u(t), v(t)) which converges to the global minimum linearly,   \begin{align} \|\mathcal{H}(\boldsymbol{u}^{(t)}, \boldsymbol{v}^{(t)}) - \mathcal{H}(\boldsymbol{h}_{0}, \boldsymbol{x}_{0}) \|_{F} \leq \frac{\varepsilon d_{0}}{\sqrt{2s\kappa^{2}}}(1 - \eta\omega)^{t/2} + 60\sqrt{s} \|\mathcal{A}^{*}(\boldsymbol{e})\| \end{align} (3.4) with probability at least 1 − L−γ+1 where $$\eta \omega = \mathcal{O}((s\kappa d_{0}(K+N)\log ^{2}L)^{-1})$$ and   \begin{align*} \|\mathcal{A}^{*}(\boldsymbol{e})\| \leq C_{0} \sigma d_{0}\sqrt{\frac{\gamma s(K + N)(\log^{2}L)}{L}} \end{align*} if the number of measurements L satisfies   \begin{align} L \geq C_{\gamma+\log (s)}(\mu^{2} + \sigma^{2})s^{2} \kappa^{4} \max\{K,N\}\log^{2} L/\varepsilon^{2}. \end{align} (3.5) Remark 3.4 Our previous work [23] shows that the convex approach via semidefinite programming (see (2.10)) requires $$L \geq C_{0}s^{2}(K +{\mu ^{2}_{h}} N)\log ^{3}(L)$$ to ensure exact recovery. Later, [15] improves this result to the near-optimal bound $$L\geq C_{0}s(K +{\mu ^{2}_{h}} N)$$ up to some $$\log $$-factors. The difference between non-convex and convex methods lies in the appearance of the condition number κ in (3.5). This is not just an artifact of the proof—empirically we also observe that the value of κ affects the convergence rate of our non-convex algorithm, see Fig. 5. Remark 3.5 Our theory suggests s2-dependence for the number of measurements L, although numerically L in fact depends on s linearly, as shown in Section 4. The reason for s2-dependence will be addressed in details in Section 5.2. Remark 3.6 In the theoretical analysis, we assume that Ai (or equivalently Ci) is a Gaussian random matrix. Numerical simulations suggest that this assumption is clearly not necessary. For example, Ci may be chosen to be a Hadamard-type matrix which is more appropriate and favorable for communications. Remark 3.7 If e = 0, (3.4) shows that (u(t), v(t)) converges to the ground truth at a linear rate. On the other hand, if noise exists, (u(t), v(t)) is guaranteed to converge to a point within a small neighborhood of (h0, x0). More importantly, if the number of measurements L gets larger, $$\|\mathcal{A}^{*}(\boldsymbol{e})\|$$ decays at the rate of $$\mathcal{O}(L^{-1/2})$$. 4. Numerical simulations In this section we present a range of numerical simulations to illustrate and complement different aspects of our theoretical framework. We will empirically analyze the number of measurements needed for perfect joint deconvolution/demixing to see how this compares to our theoretical bounds. We will also study the robustness for noisy data. In our simulations we use Gaussian encoding matrices, as in our theorems. But we also try more realistic structured encoding matrices, that are more reminiscent of what one might come across in wireless communications. While Theorem 3.3 says that the number of measurements L depends quadratically on the number of sources s, numerical simulations suggest near-optimal performance. Figure 2 demonstrates that L actually depends linearly on s, i.e., the boundary between success (white) and failure (black) is approximately a linear function of s. In the experiment, K = N = 50 are fixed, all Ai are complex Gaussians and all (hi, xi) are standard complex Gaussian vectors. For each pair of (L, s), 25 experiments are performed and we treat the recovery as a success if $$\frac{\|\hat{\boldsymbol{X}} - \boldsymbol{X}_{0}\|_{F}}{\|\boldsymbol{X}_{0}\|_{F}} \leq 10^{-3}.$$ For our algorithm, we use backtracking to determine the stepsize and the iteration stops either if $$\|\mathcal{A}(\mathcal{H}(\boldsymbol{h}^{(t+1)}, \boldsymbol{x}^{(t+1)}) - \mathcal{H}(\boldsymbol{h}^{(t)}, \boldsymbol{x}^{(t)})) \| < 10^{-6}\|\boldsymbol{y}\|$$ or if the number of iterations reaches 500. The backtracking is based on the Armijo–Goldstein condition [25]. The initial stepsize is chosen to be $$\eta = \frac{1}{K+N}$$. If $$\widetilde{F}(\boldsymbol{z}^{(t)} - \eta \nabla \widetilde{F}(\boldsymbol{z}^{(t)}))> \widetilde{F}(\boldsymbol{z}^{(t)})$$, we just divide η by two and use a smaller stepsize. Fig. 2. View largeDownload slide Phase transition plot for empirical recovery performance under different choices of (L, s) where K = N = 50 are fixed. Black region: failure; white region: success. The red solid line depicts the number of degrees of freedom and the green dashed line shows the empirical phase transition bound for Algorithm 2. Fig. 2. View largeDownload slide Phase transition plot for empirical recovery performance under different choices of (L, s) where K = N = 50 are fixed. Black region: failure; white region: success. The red solid line depicts the number of degrees of freedom and the green dashed line shows the empirical phase transition bound for Algorithm 2. Fig. 3. View largeDownload slide Empirical probability of successful recovery for different pairs of (L, s) when K = N = 50 are fixed. Fig. 3. View largeDownload slide Empirical probability of successful recovery for different pairs of (L, s) when K = N = 50 are fixed. Fig. 4. View largeDownload slide Relative error vs. SNR (dB): SNR = $$20\log _{10}\left (\frac{\|\boldsymbol{y}\|}{\|\boldsymbol{e}\|}\right )$$. Fig. 4. View largeDownload slide Relative error vs. SNR (dB): SNR = $$20\log _{10}\left (\frac{\|\boldsymbol{y}\|}{\|\boldsymbol{e}\|}\right )$$. We see from Fig. 2 that the number of measurements for the proposed algorithm to succeed not only seems to depend linearly on the number of sensors, but it is actually rather close to the information-theoretic limit s(K + N). Indeed, the green dashed line in Fig. 2, which represents the empirical boundary for the phase transition between success and failure, corresponds to a line with slope about $$\frac{3}{2} s(K+N)$$. It is interesting to compare this empirical performance to the sharp theoretical phase transition bounds one would obtain via convex optimization [10,26]. Considering the convex approach based on lifting in [23], we can adapt the theoretical framework in [10] to the blind deconvolution/demixing setting, but with one modification. The bounds in [10] rely on Gaussian widths of tangent cones related to the measurement matrices $$\mathcal{A}_{i}$$. Since simply analytic formulas for these expressions seem to be out of reach for the structured rank-one measurement matrices used in our paper, we instead compute the bounds for full-rank Gaussian random matrices, which yields a sharp bound of about 3s(K + N) (the corresponding bounds for rank-one sensing matrices will likely have a constant larger than 3). Note that these sharp theoretical bounds predict quite accurately the empirical behavior of convex methods. Thus our empirical bound for using a non-convex methods compares rather favorably with that of the convex approach. Similar conclusions can be drawn from Fig. 3; there all Ai are in the form of Ai = FDiH where F is the unitary L × L DFT matrix, all Di are independent diagonal binary ±1 matrices and H is an L × N fixed partial deterministic Hadamard matrix. The purpose of using Di is to enhance the incoherence between each channel so that our algorithm is able to tell apart each individual signal and channel. As before we assume Gaussian channels, i.e., $$\boldsymbol{h}_{i}\sim \mathcal{C}\mathcal{N}(\boldsymbol{0}, \boldsymbol{I}_{K})$$. Therefore, our approach does not only work for Gaussian encoding matrices Ai, but also for the matrices that are interesting to real-world applications, although no satisfactory theory has been derived yet for that case. Moreover, due to the structure of Ai and B, fast transform algorithms are available, potentially allowing for real-time deployment. Figure 4 shows the robustness of our algorithm under different levels of noise. We also run 25 samples for each level of SNR and different L, and then compute the average relative error. It is easily seen that the relative error scales linearly with the SNR and one unit of increase in SNR (in dB) results in one unit of decrease in the relative error. Theorem 3.3 suggests that the performance and convergence rate actually depend on the condition number of $$\boldsymbol{X}_{0} = \mathcal{H}(\boldsymbol{h}_{0},\boldsymbol{x}_{0})$$, i.e., on $$\kappa = \frac{\max d_{i0}}{\min d_{i0}}$$ where di0 = ∥hi0∥∥xi0∥. Next we demonstrate that this dependence on the condition number is not an artifact of the proof, but is indeed also observed empirically. In this experiment, we let s = 2 and set for the first component d1, 0 = 1 and for the second one d2, 0 = κ for κ ∈ {1, 2, 5}. Here, κ = 1 means that the received signals of both sensors have equal power, whereas κ = 5 means that the signal received from the second sensor is considerably stronger. The initial stepsize is chosen as η = 1, followed by the backtracking scheme. Figure 5 shows how the relative error decays with respect to the number of iterations t under different condition number κ and L. The larger κ is, the slower the convergence rate is, as we see from Fig. 5. This may result from two reasons: our spectral initialization may not be able to give a good initial guess for those weak components; moreover, during the gradient descent procedure, the gradient directions for the weak components could be totally dominated/polluted by the strong components. Currently, we still have no effective way of how to deal with this issue of slow convergence when κ is not small. We have to leave this topic for future investigations. 5. Convergence analysis Our convergence analysis relies on the following four conditions where the first three of them are local properties. We will also briefly discuss how they contribute to the proof of our main theorem. Note that our previous work [21] on blind deconvolution is actually a special case (s = 1) of (2.1). The proof of Theorem 3.3 follows in part the main ideas in [21]. The readers may find the technical parts of [21] and this manuscript share many similarities. However, there are also important differences. After all, we are now dealing with a more complicated problem where the ground truth matrix X0 and measurement matrices are both rank-s block-diagonal matrices, as shown in (2.4), instead of rank-one matrices in [21]. The key is to understand the properties of the linear operator $$\mathcal{A}$$ applying to different types of block-diagonal matrices. Therefore, many technical details are much more involved while on the other hand, some of results in [21] can be used directly. During the presentation, we will clearly point out both the similarities to and differences from [21]. Fig. 5. View largeDownload slide Relative error vs. number of iterations t. Fig. 5. View largeDownload slide Relative error vs. number of iterations t. 5.1. Four key conditions Condition 5.1 Local regularity condition: let $$\boldsymbol{z} : = (\boldsymbol{h}, \boldsymbol{x})\in \mathbb{C}^{s(K+N)}$$ and $$\nabla \widetilde{F}(\boldsymbol{z}) := \left [ {{\nabla \widetilde{F}_{\boldsymbol{h}}(\boldsymbol{z})} \atop {\nabla \widetilde{F}_{\boldsymbol{x}}(\boldsymbol{z})}}\right ] \in \mathbb{C}^{s(K+N)}$$, then   \begin{align} \|\nabla \widetilde{F}(\boldsymbol{z})\|^{2} \geq \omega [\widetilde{F}(\boldsymbol{z}) - c]_{+} \end{align} (5.1) for $$\boldsymbol{z} \in \mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$ where $$\omega = \frac{d_{0}}{7000}$$ and $$c = \|\boldsymbol{e}\|^{2} + 2000s\|\mathcal{A}^{*}(\boldsymbol{e})\|^{2}.$$ We will prove Condition 5.1 in Section 6.3. Condition 5.1 states that $$\widetilde{F}(\boldsymbol{z}) = 0$$ if $$\|\nabla \widetilde{F}(\boldsymbol{z})\| = 0$$ and e = 0, i.e., all the stationary points inside the basin of attraction are global minima. Condition 5.2 Local smoothness condition: let z = (h, x) and w = (u, v) and there holds   \begin{align} \|\widetilde{F}(\boldsymbol{z} + \boldsymbol{w}) - \widetilde{F}(\boldsymbol{z})\| \leq C_{L} \|\boldsymbol{w}\| \end{align} (5.2) for z + w and z inside $$\mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$ where $$C_{L} \approx \mathcal{O}(d_{0}s\kappa (1 + \sigma ^{2})(K + N)\log ^{2} L )$$ is the Lipschitz constant of $$\widetilde{F}$$ over $$\mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$. The convergence rate is governed by CL. The proof of Condition 5.2 can be found in Section 6.4. Condition 5.3 Local restricted isometry property: denote $$\boldsymbol{X} = \mathcal{H}(\boldsymbol{h}, \boldsymbol{x})$$ and $$\boldsymbol{X}_{0} = \mathcal{H}(\boldsymbol{h}_{0}, \boldsymbol{x}_{0})$$. There holds   \begin{align} \frac{2}{3} \|\boldsymbol{X} - \boldsymbol{X}_{0}\|_{F}^{2} \leq \left\| \mathcal{A}(\boldsymbol{X} - \boldsymbol{X}_{0}) \right\|^{2} \leq \frac{3}{2} \|\boldsymbol{X} - \boldsymbol{X}_{0}\|_{F}^{2} \end{align} (5.3) uniformly all for $$(\boldsymbol{h}, \boldsymbol{x})\in \mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$. Condition 5.3 will be proven in Section 6.2. It says that the convergence of the objective function implies the convergence of the iterates. Remark 5.4 (Necessity of inter-user incoherence). Although Condition 5.3 is seemingly the same as the one in our previous work [21], it is indeed very different. Recall that $$\mathcal{A}$$ is a linear operator acting on block-diagonal matrices and its output is the sum of s different components involving $$\mathcal{A}_{i}$$. Therefore, the proof of Condition 5.3 heavily depends on the inter-user incoherence, whereas this notion of incoherence is not needed at all for the single-user scenario. At the beginning of Section 2, we discuss the choice of Ci (or Ai). In order to distinguish one user from another, it is essential to use sufficiently different5 encoding matrices Ci (or Ai). Here the independence and Gaussianity of all Ci (or Ai) guarantee that $$\|\mathcal{P}_{T_{i}}\mathcal{A}_{i}^{*}\mathcal{A}_{j}\mathcal{P}_{T_{j}}\|$$ is sufficiently small for all $$i\neq j$$ where Ti is defined in (6.1). It is a key element to ensure the validity of Condition 5.3 which is also an important component to prove Condition 5.1. On the other hand, due to the recent progress on this joint deconvolution and demixing problem, one is also able to prove a local restricted isometry property with tools such as bounding the suprema of chaos processes [15] by assuming $$\{\boldsymbol{A}_{i}\}_{i=1}^{s}$$ as Gaussian matrices. Condition 5.5 Robustness condition: let $$\varepsilon \leq \frac{1}{15}$$ be a predetermined constant. We have   \begin{align} \|\mathcal{A}^{*}(\boldsymbol{e})\| = \max_{1\leq i\leq s}\|\mathcal{A}_{i}^{*}(\boldsymbol{e})\| \leq \frac{\varepsilon d_{0}}{10\sqrt{2}s \kappa}, \end{align} (5.4) where $$\boldsymbol{e}\sim \mathcal{C}\mathcal{N}\big(0, \frac{\sigma ^{2}{d_{0}^{2}}}{L}\big)$$ if L ≥ Cγκ2s2(K + N)/ε2. We will prove Condition 5.5 in Section 6.5. We now extract one useful result based on Conditions 5.3 and 5.5. From these two conditions, we are able to produce a good approximation of F(h, x) for all $$(\boldsymbol{h}, \boldsymbol{x})\in \mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$ in terms of δ in (2.9). For $$(\boldsymbol{h}, \boldsymbol{x})\in \mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$, the following inequality holds   \begin{align} \frac{2}{3}\delta^{2}{d_{0}^{2}} -\frac{\varepsilon\delta{d_{0}^{2}}}{5\sqrt{s}\kappa} + \|\boldsymbol{e}\|^{2} \leq F(\boldsymbol{h}, \boldsymbol{x}) \leq \frac{3}{2}\delta^{2}{d_{0}^{2}} + \frac{\varepsilon\delta{d_{0}^{2}}}{5\sqrt{s}\kappa} + \|\boldsymbol{e}\|^{2}. \end{align} (5.5) Note that (5.5) simply follows from   \begin{align*} F(\boldsymbol{h}, \boldsymbol{x}) = \| \mathcal{A}(\boldsymbol{X} - \boldsymbol{X}_{0}) \|_{F}^{2} - 2\operatorname{Re}(\langle \boldsymbol{X}- \boldsymbol{X}_{0}, \mathcal{A}^{*}(\boldsymbol{e})\rangle) + \|\boldsymbol{e}\|^{2}. \end{align*} Note that (5.3) implies $$\frac{2}{3}\delta ^{2}{d_{0}^{2}}\leq \|\mathcal{A}(\boldsymbol{X}-\boldsymbol{X}_{0})\|_{F}^{2}\leq \frac{3}{2}\delta ^{2}{d_{0}^{2}}$$. Thus it suffices to estimate the cross-term,   \begin{align} |\!\operatorname{Re}(\langle \boldsymbol{X}- \boldsymbol{X}_{0}, \mathcal{A}^{*}(\boldsymbol{e})\rangle)| & \leq \|\mathcal{A}^{*}(\boldsymbol{e})\| \|\boldsymbol{X} - \boldsymbol{X}_{0}\|_{*} = \|\mathcal{A}^{*}(\boldsymbol{e})\| \sum_{i=1}^{s}\|\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\|_{*} \nonumber \\ & \leq \sqrt{2}\|\mathcal{A}^{*}(\boldsymbol{e})\| \sum_{i=1}^{s}\|\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\|_{F} \nonumber \\ & \leq \sqrt{2s} \|\mathcal{A}^{*}(\boldsymbol{e})\| \|\boldsymbol{X} - \boldsymbol{X}_{0}\|_{F} \leq \frac{\varepsilon \delta{d_{0}^{2}}}{10\sqrt{s}\kappa}, \end{align} (5.6) where ∥⋅∥* and ∥⋅∥ are a pair of dual norms and $$\|\mathcal{A}^{*}(\boldsymbol{e})\|$$ comes from (5.4). 5.2. Outline of the convergence analysis For the ease of proof, we introduce another neighborhood:   \begin{align*} \mathcal{N}_{\widetilde{F}} = \left\{ (\boldsymbol{h},\boldsymbol{x}) : \widetilde{F}(\boldsymbol{h}, \boldsymbol{x}) \leq \frac{\varepsilon^{2}{d_{0}^{2}}}{3s\kappa^{2}} + \|\boldsymbol{e}\|^{2}\right\}. \end{align*} Moreover, another reason to consider $$\mathcal{N}_{\widetilde{F}}$$ is based on the fact that gradient descent only allows one to make the objective function decrease if the step size is chosen appropriately. In other words, all the iterates z(t) generated by gradient descent are inside $$\mathcal{N}_{\widetilde{F}}$$ as long as $$\boldsymbol{z}^{(0)}\in \mathcal{N}_{\widetilde{F}}.$$ On the other hand, it is crucial to note that the decrease of the objective function does not necessarily imply the decrease of the relative error of the iterates. Therefore, we want to construct an initial guess in $$\mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$ so that z(0) is sufficiently close to the ground truth and then analyze the behavior of z(t). In the rest of this section, we basically try to prove the following relation:   \begin{align*} \underbrace{ \frac{1}{\sqrt{3}}\mathcal{N}_d\cap \frac{1}{\sqrt{3}} \mathcal{N}_{\mu} \cap \mathcal{N}_{\frac{2\varepsilon}{5\sqrt{s}\kappa}}}_{\textrm{Initial guess}} \subset \underbrace{\mathcal{N}_{\epsilon}\cap \mathcal{N}_{\widetilde{F}}}_{ \{\boldsymbol{z}^{(t)}\}_{t\geq 0}\ \textrm{in}\ \mathcal{N}_{\epsilon}\cap \mathcal{N}_{\widetilde{F}} } \subset \underbrace{\mathcal{N}_d \cap \mathcal{N}_{\mu} \cap \mathcal{N}_{\epsilon}}_{\textrm{Key conditions hold over}\ \mathcal{N}_d \cap \mathcal{N}_{\mu} \cap \mathcal{N}_{\epsilon} }. \end{align*} Now we give a more detailed explanation of the relation above, which constitutes the main structure of the proof: We will show $$\frac{1}{\sqrt{3}}\mathcal{N}_d\cap \frac{1}{\sqrt{3}} \mathcal{N}_{\mu } \cap \mathcal{N}_{\frac{2\varepsilon }{5\sqrt{s}\kappa }} \subset \mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$ in the proof of Theorem 3.3 in Section 5.3, which is quite straightforward. Lemma 5.6 explains why it holds that $$\mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}\subset \mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$ and where the s2-bottleneck comes from. Lemma 5.8 implicitly shows that the iterates z(t) will remain in $$\mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$ if the initial guess z(0) is inside $$\mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$ and $$\widetilde{F}(\boldsymbol{z}^{(t)})$$ is monotonically decreasing (simply by induction). Lemma 5.9 makes this observation explicit by showing that $$\boldsymbol{z}^{(t)}\in \mathcal{N}_{\epsilon } \cap \mathcal{N}_{\widetilde{F}}$$ implies $$\boldsymbol{z}^{(t+1)} : = \boldsymbol{z}^{(t)} - \eta \nabla \widetilde{F}(\boldsymbol{z}^{(t)})\in \mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$ if the stepsize η obeys $$\eta \leq \frac{1}{C_{L}}$$. Moreover, Lemma 5.9 guarantees sufficient decrease of $$\widetilde{F}(\boldsymbol{z}^{(t)})$$ in each iteration, which paves the road toward the proof of linear convergence of $$\widetilde{F}(\boldsymbol{z}^{(t)})$$ and thus z(t). Remember that $$\mathcal{N}_d$$ and $$\mathcal{N}_{\mu }$$ are both convex sets, and the purpose of introducing regularizers Gi(hi, xi) is to approximately project the iterates onto $$\mathcal{N}_d\cap \mathcal{N}_{\mu }.$$ Moreover, we hope that once the iterates are inside $$\mathcal{N}_{\epsilon }$$ and inside a sublevel subset $$\mathcal{N}_{\widetilde{F}}$$, they will never escape from $$\mathcal{N}_{\widetilde{F}}\cap \mathcal{N}_{\epsilon }$$. Those ideas are fully reflected in the following lemma. Lemma 5.6 Assume 0.9di0 ≤ di ≤ 1.1di0 and 0.9d0 ≤ d ≤ 1.1d0. There holds $$\mathcal{N}_{\widetilde{F}} \subset \mathcal{N}_d \, \cap\, \mathcal{N}_{\mu }$$; moreover, under Conditions 5.3 and 5.5, we have $$\mathcal{N}_{\widetilde{F}} \cap \mathcal{N}_{\epsilon }\subset \mathcal{N}_d \cap \mathcal{N}_{\mu }\cap \mathcal{N}_{\frac{9}{10}\epsilon }$$. Proof. If $$(\boldsymbol{h}, \boldsymbol{x}) \notin \mathcal{N}_d \cap \mathcal{N}_{\mu }$$, by the definition of G in (2.15), at least one component in G exceeds $$\rho G_{0}\left (\frac{2d_{i0}}{d_{i}}\right )$$. We have   \begin{align*} \widetilde{F}(\boldsymbol{h}, \boldsymbol{x}) & \geq \rho G_{0}\left(\frac{2d_{i0}}{d_{i}}\right) \geq (d^{2} + 2\|\boldsymbol{e}\|^{2}) \left( \frac{2d_{i0}}{d_{i}} - 1\right)^{2} \\ & \geq (2/1.1 - 1)^{2} (d^{2} + 2\|\boldsymbol{e}\|^{2}) \\ & \geq \frac{1}{2}{d_{0}^{2}} + \|\boldsymbol{e}\|^{2}> \frac{\varepsilon^{2}{d_{0}^{2}}}{3s \kappa^{2}} + \|\boldsymbol{e}\|^{2}, \end{align*} where ρ ≥ d2 + 2∥e∥2, 0.9d0 ≤ d ≤ 1.1d0 and 0.9di0 ≤ di ≤ 1.1di0. This implies $$(\boldsymbol{h}, \boldsymbol{x}) \notin \mathcal{N}_{\widetilde{F}}$$ and hence $$\mathcal{N}_{\widetilde{F}} \subset \mathcal{N}_d \cap \mathcal{N}_{\mu }$$. Note that $$(\boldsymbol{h}, \boldsymbol{x})\in \mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$ if $$(\boldsymbol{h}, \boldsymbol{x}) \in \mathcal{N}_{\widetilde{F}} \cap \mathcal{N}_{\epsilon }$$. Applying (5.5) gives   \begin{align*} \frac{2}{3}\delta^{2}{d_{0}^{2}} -\frac{\varepsilon\delta{d_{0}^{2}}}{5\sqrt{s}\kappa} + \|\boldsymbol{e}\|^{2} \leq F(\boldsymbol{h}, \boldsymbol{x})\leq \widetilde{F}(\boldsymbol{h}, \boldsymbol{x})\leq\frac{\varepsilon^{2}{d_{0}^{2}}}{3s\kappa^{2}} + \|\boldsymbol{e}\|^{2} \end{align*} which implies that $$\delta \leq \frac{9}{10}\frac{\varepsilon }{\sqrt{s}\kappa }.$$ By definition of δ in (2.9), there holds   \begin{align} \frac{81\varepsilon^{2}}{100s\kappa^{2}} \geq \delta^{2} = \frac{\sum_{i=1}^{s}{\delta_{i}^{2}}d_{i0}^{2}}{\sum_{i=1}^{s} d_{i0}^{2}} \geq \frac{\sum_{i=1}^{s}{\delta_{i}^{2}}}{s\kappa^{2}} \geq \frac{1}{s\kappa^{2}} \max_{1\leq i\leq s}{\delta_{i}^{2}}, \end{align} (5.7) which gives $$\delta _{i} \leq \frac{9}{10}\varepsilon $$ and $$(\boldsymbol{h}, \boldsymbol{x})\in \mathcal{N}_{\frac{9}{10}\varepsilon }.$$ Remark 5.7 The s2-bottleneck comes from (5.7). If δ ≤ ε is small, we cannot guarantee that each δi is also smaller than ε. Just consider the simplest case when all di0 are the same: then $${d_{0}^{2}} = \sum _{i=1}^{s} d_{i0}^{2} = s d_{i0}^{2}$$ and there holds   \begin{align*} \varepsilon^{2}\geq \delta^{2} = \frac{1}{s}\sum_{i=1}^{s}{\delta_{i}^{2}}. \end{align*} Obviously, we cannot conclude that $$\max \delta _{i} \leq \varepsilon $$, but only say that $$\delta _{i} \leq \sqrt{s}\varepsilon .$$ This is why we require $$\delta ={\cal O}\big(\frac{\varepsilon }{\sqrt{s}}\big)$$ to ensure δi ≤ ε, which gives s2-dependence in L. Lemma 5.8 Denote z1 = (h1, x1) and z2 = (h2, x2). Let z(λ) := (1 − λ)z1 + λz2. If $$\boldsymbol{z}_{1} \in \mathcal{N}_{\epsilon }$$ and $$\boldsymbol{z}(\lambda ) \in \mathcal{N}_{\widetilde{F}}$$ for all λ ∈ [0, 1], we have $$\boldsymbol{z}_{2} \in \mathcal{N}_{\epsilon }$$. Proof. Note that for $$\boldsymbol{z}_{1}\in \mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$, we have $$\boldsymbol{z}_{1}\in \mathcal{N}_d\cap \mathcal{N}_{\mu }\cap \mathcal{N}_{\frac{9}{10}\varepsilon }$$ which follows from the second part of Lemma 5.6. Now we prove $$\boldsymbol{z}_{2}\in \mathcal{N}_{\epsilon }$$ by contradiction. Let us suppose that $$\boldsymbol{z}_{2} \notin \mathcal{N}_{\epsilon }$$ and $$\boldsymbol{z}_{1} \in \mathcal{N}_{\epsilon }$$. There exists $$\boldsymbol{z}(\lambda _{0}):=(\boldsymbol{h}(\lambda _{0}), \boldsymbol{x}(\lambda _{0})) \in \mathcal{N}_{\epsilon }$$ for some λ0 ∈ [0, 1] such that $$\max _{1\leq i\leq s}\frac{\|\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\|_{F}}{d_{i0}} = \epsilon $$. Therefore, $$\boldsymbol{z}(\lambda _{0}) \in \mathcal{N}_{\widetilde{F}}\cap \mathcal{N}_{\epsilon }$$ and Lemma 5.6 implies $$\max _{1\leq i\leq s}\frac{\|\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\|_{F}}{d_{i0}} \leq \frac{9}{10}\epsilon $$, which contradicts $$\max _{1\leq i\leq s}\frac{\|\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\|_{F}}{d_{i0}} = \epsilon $$. Lemma 5.9 Let the stepsize $$\eta \leq \frac{1}{C_{L}}$$, $$\boldsymbol{z}^{(t)} : = (\boldsymbol{u}^{(t)}, \boldsymbol{v}^{(t)})\in \mathbb{C}^{s(K + N)}$$ and CL be the Lipschitz constant of $$\nabla \widetilde{F}(\boldsymbol{z})$$ over $$\mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$ in (5.2). If $$\boldsymbol{z}^{(t)}\in \mathcal{N}_{\epsilon } \cap \mathcal{N}_{\widetilde{F}}$$, we have $$\boldsymbol{z}^{(t+1)} \in \mathcal{N}_{\epsilon } \cap \mathcal{N}_{\widetilde{F}}$$ and   \begin{align} \widetilde{F}(\boldsymbol{z}^{(t+1)}) \leq \widetilde{F}(\boldsymbol{z}^{(t)}) - \eta \|\nabla \widetilde{F}(\boldsymbol{z}^{(t)})\|^{2}, \end{align} (5.8) where $$\boldsymbol{z}^{(t+1)} = \boldsymbol{z}^{(t)} - \eta \nabla \widetilde{F}(\boldsymbol{z}^{(t)}).$$ Remark 5.10 This lemma tells us that once $$\boldsymbol{z}^{(t)}\in \mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$, the next iterate $$\boldsymbol{z}^{(t+1)} = \boldsymbol{z}^{(t)} - \eta \nabla \widetilde{F}(\boldsymbol{z}^{(t)})$$ is also inside $$\mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$ as long as the stepsize $$\eta \leq \frac{1}{C_{L}}$$. In other words, $$\mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$ is in fact a stronger version of the basin of attraction. Moreover, the objective function will decay sufficiently in each step as long as we can control the lower bound of the $$\nabla \widetilde{F}$$, which is guaranteed by the Local Regularity Condition 5.3. Proof. Let $$\phi (\tau ) := \widetilde{F}(\boldsymbol{z}^{(t)} - \tau \nabla \widetilde{F}(\boldsymbol{z}^{(t)}))$$, $$\phi (0) = \widetilde{F}(\boldsymbol{z}^{(t)})$$ and consider the following quantity:   \begin{align*} \tau_{\max}: = \max \{\mu: \phi(\tau) \leq \widetilde{F}(\boldsymbol{z}^{(t)}), 0\leq\tau \leq \mu \}, \end{align*} where $$\tau _{\max }$$ is the largest stepsize such that the objective function $$\widetilde{F}(\boldsymbol{z})$$ evaluated at any point over the whole line segment $$\{\boldsymbol{z}^{(t)} -\tau \widetilde{F}(\boldsymbol{z}^{(t)}), 0\leq \tau \leq \tau _{\max }\}$$ is not greater than $$\widetilde{F}(\boldsymbol{z}^{(t)})$$. Now we will show $$\tau _{\max } \geq \frac{1}{C_{L}}$$. Obviously, if $$\|\nabla \widetilde{F}(\boldsymbol{z}^{(t)})\| = 0$$, it holds automatically. Consider $$\|\nabla \widetilde{F}(\boldsymbol{z}^{(t)})\|\neq 0$$ and assume $$\tau _{\max } < \frac{1}{C_{L}}$$. First note that,   \begin{align*} \frac{\mathop{}\!\mathrm{d}}{\mathop{}\!\mathrm{d} \tau} \phi(\tau) < 0 \Longrightarrow\tau_{\max}> 0. \end{align*} By the definition of $$\tau _{\max }$$, there holds $$\phi (\tau _{\max }) = \phi (0)$$ since ϕ(τ) is a continuous function w.r.t. τ. Lemma 5.8 implies   \begin{align*} \{ \boldsymbol{z}^{(t)} - \tau \nabla\widetilde{F}(\boldsymbol{z}^{(t)}), 0\leq \tau \leq \tau_{\max} \} \subseteq \mathcal{N}_{\epsilon}\cap\mathcal{N}_{\widetilde{F}}. \end{align*} Now we apply Lemma 6.20, the modified descent lemma, and obtain   \begin{align*} \widetilde{F}(\boldsymbol{z}^{(t)} - \tau_{\max}\nabla\widetilde{F}(\boldsymbol{z}^{(t)})) \leq \widetilde{F}(\boldsymbol{z}^{(t)}) - (2\tau_{\max} - C_{L}\tau_{\max}^{2})\|\widetilde{F}(\boldsymbol{z}^{(t)})\|^{2} \leq \widetilde{F}(\boldsymbol{z}^{(t)}) - \tau_{\max}\|\widetilde{F}(\boldsymbol{z}^{(t)})\|^{2}, \end{align*} where $$C_{L}\tau _{\max } \leq 1.$$ In other words, $$\phi (\tau _{\max }) \leq \widetilde{F}(\boldsymbol{z}^{(t)} - \tau _{\max }\nabla \widetilde{F}(\boldsymbol{z}^{(t)})) < \widetilde{F}(\boldsymbol{z}^{(t)}) = \phi (0)$$ contradicts $$\phi (\tau _{\max }) = \phi (0)$$. Therefore, we conclude that $$\tau _{\max } \geq \frac{1}{C_{L}}$$. For any $$\eta \leq \frac{1}{C_{L}}$$, Lemma 5.8 implies   \begin{align*} \{ \boldsymbol{z}^{(t)} - \tau \nabla\widetilde{F}(\boldsymbol{z}^{(t)}), 0\leq \tau \leq \eta \} \subseteq \mathcal{N}_{\epsilon}\cap\mathcal{N}_{\widetilde{F}} \end{align*} and applying Lemma 6.20 gives   \begin{align*} \widetilde{F}(\boldsymbol{z}^{(t)} - \eta \nabla\widetilde{F}(\boldsymbol{z}^{(t)})) \leq \widetilde{F}(\boldsymbol{z}^{(t)}) - (2\eta - C_{L}\eta^{2})\|\widetilde{F}(\boldsymbol{z}^{(t)})\|^{2} \leq \widetilde{F}(\boldsymbol{z}^{(t)}) - \eta\|\widetilde{F}(\boldsymbol{z}^{(t)})\|^{2}. \end{align*} 5.3. Proof of Theorem 3.3 Combining all the considerations above, we now prove Theorem 3.3 to conclude this section. Proof. The proof consists of three parts: Part I: proof of $$\boldsymbol{z}^{(0)} : = (\boldsymbol{u}^{(0)}, \boldsymbol{v}^{(0)}) \in \mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$. From the assumption of Theorem 3.3,   \begin{align*} \boldsymbol{z}^{(0)} \in \frac{1}{\sqrt{3}}\mathcal{N}_d \bigcap \frac{1}{\sqrt{3}}\mathcal{N}_{\mu}\cap \mathcal{N}_{\frac{2\varepsilon}{5\sqrt{s}\kappa}}. \end{align*} First we show G(u(0), v(0)) = 0: for 0 ≤ i ≤ s and the definition of $$\mathcal{N}_d$$ and $$\mathcal{N}_{\mu }$$,   \begin{align*} \frac{\|\boldsymbol{u}^{(0)}_{i}\|^{2}}{2d_{i}} \leq \frac{2d_{i0}}{3d_{i}} < 1, \quad \frac{L|\boldsymbol{b}_{l}^{*} \boldsymbol{u}^{(0)}_{i}|^{2}}{8d_{i}\mu^{2}} \leq \frac{L}{8d_{i}\mu^{2}} \cdot\frac{16d_{i0}\mu^{2}}{3L} \leq \frac{2d_{i0}}{3d_{i}} < 1, \end{align*} where $$\|\boldsymbol{u}^{(0)}_{i}\| \leq \frac{2\sqrt{d_{i0}}}{\sqrt{3}}$$, $$\sqrt{L}\|\boldsymbol{B}\boldsymbol{u}^{(0)}_{i}\|_{\infty } \leq \frac{4 \sqrt{d_{i0}}\mu }{\sqrt{3}}$$ and $$\frac{9}{10}d_{i0} \leq d_{i}\leq \frac{11}{10}d_{i0}.$$ Therefore   \begin{align*} G_{0}\left( \frac{\|\boldsymbol{u}^{(0)}_{i}\|^{2}}{2d_{i}}\right) = G_{0}\left( \frac{\|\boldsymbol{v}^{(0)}_{i}\|^{2}}{2d_{i}}\right) = G_{0}\left(\frac{L|\boldsymbol{b}_{l}^{*}\boldsymbol{u}_{i}^{(0)}|^{2}}{8d_{i}\mu^{2}}\right) = 0\end{align*} for all 1 ≤ l ≤ L and G(u(0), v(0)) = 0. For $$\boldsymbol{z}^{(0)} = (\boldsymbol{u}^{(0)}, \boldsymbol{v}^{(0)})\in \mathcal{N}_{\frac{2\varepsilon }{5\sqrt{s}\kappa }}$$, we have $$\delta (\boldsymbol{z}^{(0)}) := \frac{\sqrt{\sum _{i=1}^{s}{\delta _{i}^{2}}d_{i0}^{2} }}{d_{0}} \leq \frac{2\varepsilon }{5\sqrt{s}\kappa }.$$ By (5.5), there holds $$\delta (\boldsymbol{z}^{(0)}) \leq \frac{2\varepsilon }{5\sqrt{s}\kappa }$$ and G(u(0), v(0)) = 0,   \begin{align*} \widetilde{F}(\boldsymbol{u}^{(0)}, \boldsymbol{v}^{(0)}) = F(\boldsymbol{u}^{(0)}, \boldsymbol{v}^{(0)}) \leq \|\boldsymbol{e}\|^{2} + \frac{3}{2}\delta^{2}(\boldsymbol{z}^{(0)}){d_{0}^{2}} + \frac{\varepsilon \delta(\boldsymbol{z}^{(0)}){d_{0}^{2}}}{5\sqrt{s}\kappa} \leq \|\boldsymbol{e}\|^{2} + \frac{\varepsilon^{2}{d_{0}^{2}}}{3s\kappa^{2}} \end{align*} and hence $$\boldsymbol{z}^{(0)} = (\boldsymbol{u}^{(0)}, \boldsymbol{v}^{(0)})\in \mathcal{N}_{\epsilon }\bigcap \mathcal{N}_{\widetilde{F}}.$$ Part II: the linear convergence of the objective function$$ \ \widetilde{F}(\boldsymbol{z}^{(t)})$$. Denote z(t) := (u(t), v(t)). Note that $$\boldsymbol{z}^{(0)}\in \mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$, Lemma 5.9 implies $$\boldsymbol{z}^{(t)}\in \mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$ for all t ≥ 0 by induction if $$\eta \leq \frac{1}{C_{L}}$$. Moreover, combining Condition 5.1 with Lemma 5.9 leads to   \begin{align*} \widetilde{F}(\boldsymbol{z}^{(t )}) \leq \widetilde{F}(\boldsymbol{z}^{(t-1)}) - \eta\omega \left[ \widetilde{F}(\boldsymbol{z}^{(t-1)}) - c \right]_{+}, \quad t\geq 1 \end{align*} with $$c = \|\boldsymbol{e}\|^{2} + a\|\mathcal{A}^{*}(\boldsymbol{e})\|^{2}$$ and a = 2000s. Therefore, by induction, we have   \begin{align*} \left[ \widetilde{F}(\boldsymbol{z}^{(t)}) - c\right]_{+} \leq (1 - \eta\omega) \left[ \widetilde{F}(\boldsymbol{z}^{(t-1)}) - c \right]_{+} \leq \left(1 - \eta\omega\right)^{t} \left[ \widetilde{F}(\boldsymbol{z}^{(0)}) - c\right]_{+} \leq \frac{\varepsilon^{2}{d_{0}^{2}}}{3s\kappa^{2}} (1 - \eta\omega)^{t}, \end{align*} where $$\widetilde{F}(\boldsymbol{z}^{(0)}) \leq \frac{\varepsilon ^{2}{d_{0}^{2}}}{3s\kappa ^{2}} + \|\boldsymbol{e}\|^{2}$$ and $$\left [ \widetilde{F}(\boldsymbol{z}^{(0)}) - c \right ]_{+} \leq \left [ \frac{1}{3s\kappa ^{2}}\varepsilon ^{2}{d_{0}^{2}} - a\|\mathcal{A}^{*}(\boldsymbol{e})\|^{2} \right ]_{+} \leq \frac{\varepsilon ^{2}{d_{0}^{2}}}{3s\kappa ^{2}}.$$ Now we conclude that $$\left [ \widetilde{F}(\boldsymbol{z}^{(t)}) - c\right ]_{+}$$ converges to 0 linearly. Part III: the linear convergence of the iterates (u(t), v(t)). Denote   \begin{align*} \delta(\boldsymbol{z}^{(t)}) : = \frac{\|\mathcal{H}(\boldsymbol{u}^{(t)}, \boldsymbol{v}^{(t)}) - \mathcal{H}(\boldsymbol{h}_{0},\boldsymbol{x}_{0})\|_{F}}{d_{0}}. \end{align*} Note that $$\boldsymbol{z}^{(t)}\in \mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}\subseteq \mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$ and over $$\mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$, there holds $$F_{0}(\boldsymbol{z}^{(t)}) \geq \frac{2}{3}\delta ^{2}(\boldsymbol{z}^{(t)}){d_{0}^{2}}$$ which follows from Local restricted isometry property (RIP) Condition in (5.3) and F0(z(t)) defined in (2.12). Moreover   \begin{align*} \widetilde{F}(\boldsymbol{z}^{(t)}) - \|\boldsymbol{e}\|^{2} \geq & F_{0}(\boldsymbol{z}^{(t)}) - 2\operatorname{Re}\left(\langle \mathcal{A}^{*}(\boldsymbol{e}), \mathcal{H}(\boldsymbol{u}^{(0)}, \boldsymbol{v}^{(0)}) - \mathcal{H}(\boldsymbol{h}_{0}, \boldsymbol{x}_{0}) \rangle\right) \\ & \geq \frac{2}{3} \delta^{2}(\boldsymbol{z}^{(t)}){d_{0}^{2}} - 2\sqrt{2s}\|\mathcal{A}^{*}(\boldsymbol{e})\| \delta(\boldsymbol{z}^{(t)})d_{0}, \end{align*} where G(z(t)) ≥ 0 and the second inequality follows from (5.6). There holds   \begin{align*} \frac{2}{3} \delta^{2}(\boldsymbol{z}^{(t)}){d_{0}^{2}} - 2\sqrt{2s}\|\mathcal{A}^{*}(\boldsymbol{e})\| \delta(\boldsymbol{z}^{(t)})d_{0} - a\|\mathcal{A}^{*}(\boldsymbol{e})\|^{2} \leq \left[ \widetilde{F}(\boldsymbol{z}^{(t)}) - c \right]_{+} \leq \frac{ \varepsilon^{2}{d_{0}^{2}}}{3s\kappa^{2}}(1 - \eta\omega)^{t} \end{align*} and equivalently,   \begin{align*} \left|\delta(\boldsymbol{z}^{(t)})d_{0} - \frac{3\sqrt{2}}{2} \|\mathcal{A}^{*}(\boldsymbol{e})\| \right|{}^{2} \leq \frac{\varepsilon^{2}{d_{0}^{2}}}{2s\kappa^{2}} (1 - \eta\omega)^{t} + \left(\frac{3}{2}a + \frac{9}{2}\right)\|\mathcal{A}^{*}(\boldsymbol{e})\|^{2}. \end{align*} Solving the inequality above for δ(z(t)), we have   \begin{align} \delta(\boldsymbol{z}^{(t)}) d_{0} & \leq \frac{\varepsilon d_{0}}{\sqrt{2s\kappa^{2}}}(1 - \eta\omega)^{t/2} +\left(\frac{3\sqrt{2}}{2} + \sqrt{\frac{3}{2}a + \frac{9}{2}} \right)\|\mathcal{A}^{*}(\boldsymbol{e})\| \nonumber \\ & \leq \frac{\varepsilon d_{0}}{\sqrt{2s\kappa^{2}}}(1 - \eta\omega)^{t/2} + 60\sqrt{s} \|\mathcal{A}^{*}(\boldsymbol{e})\|, \end{align} (5.9) where a = 2000s. Let $$d^{(t)} : = \sqrt{\sum _{i=1}^{s} \|\boldsymbol{u}_{i}^{(t)}\|^{2}\|\boldsymbol{v}_{i}^{(t)}\|^{2} }$$ for $$t\in \mathbb{Z}_{\geq 0}.$$ By (5.9) and triangle inequality, we immediately obtain $$|d^{(t)} - d_{0}| \leq \frac{\varepsilon d_{0}}{\sqrt{2s\kappa ^{2}}}(1 - \eta \omega )^{t/2} + 60\sqrt{s} \|\mathcal{A}^{*}(\boldsymbol{e})\|.$$ 6. Proof of the four conditions This section is devoted to proving the four key conditions introduced in Section 5. The local smoothness condition and the robustness condition are relatively less challenging to deal with. The more difficult part is to show the local regularity condition and the local isometry property. The key to solve those problems is to understand how the vector-valued linear operator $$\mathcal{A}$$ in (2.7) behaves on block-diagonal matrices, such as $$\mathcal{H}(\boldsymbol{h},\boldsymbol{x})$$, $$\mathcal{H}(\boldsymbol{h}_{0},\boldsymbol{x}_{0})$$ and $$\mathcal{H}(\boldsymbol{h},\boldsymbol{x}) - \mathcal{H}(\boldsymbol{h}_{0},\boldsymbol{x}_{0}).$$ In particular, when s = 1, all those matrices become rank-one matrices, which have been well discussed in our previous work [21]. First of all, we define the linear subspace $$T_{i}\subset \mathbb{C}^{K\times N}$$ along with its orthogonal complement for 1 ≤ i ≤ s as   \begin{align} T_{i} & := \{ \boldsymbol{Z}_{i}\in\mathbb{C}^{K\times N} : \boldsymbol{Z}_{i} = \boldsymbol{h}_{i0}\boldsymbol{v}_{i}^{*} + \boldsymbol{u}_{i}\boldsymbol{x}_{i0}^{*}, \quad \boldsymbol{u}_{i}\in\mathbb{C}^{K},\boldsymbol{v}_{i}\in\mathbb{C}^{N} \},\nonumber \\ T^{\bot}_{i} & := \left\{ \left(\boldsymbol{I}_{K} - \frac{\boldsymbol{h}_{i0}\boldsymbol{h}_{i0}^{*}}{d_{i0}}\right) \boldsymbol{Z}_{i} \left(\boldsymbol{I}_{N} - \frac{\boldsymbol{x}_{i0}\boldsymbol{x}_{i0}^{*}}{d_{i0}}\right) :\boldsymbol{Z}_{i}\in\mathbb{C}^{K\times N} \right\}\!, \end{align} (6.1) where $$\|\boldsymbol{h}_{i0}\| = \|\boldsymbol{x}_{i0}\| = \sqrt{d_{i0}}.$$ In particular, $$\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*} \in T_{i}$$ for all 1 ≤ i ≤ s. The proof also requires us to consider block-diagonal matrices whose ith block belongs to Ti (or $$T^{\bot }_{i}$$). Let $$\boldsymbol{Z} = \operatorname{blkdiag}(\boldsymbol{Z}_{1},\cdots ,\boldsymbol{Z}_{s})\in \mathbb{C}^{Ks\times Ns}$$ be a block-diagonal matrix and say Z ∈ T if   \begin{align*} T := \left\{\textrm{blkdiag}\ (\{\boldsymbol{Z}_{i}\}_{i=1}^{s}) | \boldsymbol{Z}_{i}\in T_{i} \right\} \end{align*} and Z ∈ T⊥ if   \begin{align*} T^{\bot} := \left\{\textrm{blkdiag}\ (\{\boldsymbol{Z}_{i}\}_{i=1}^{s}) | \boldsymbol{Z}_{i}\in T^{\bot}_{i} \right\}\!, \end{align*} where both T and T⊥ are subsets in $$\mathbb{C}^{Ks\times Ns}$$ and $$\mathcal{H}(\boldsymbol{h}_{0},\boldsymbol{x}_{0})\in T.$$ Now we take a closer look at a special case of block-diagonal matrices, i.e., $$\mathcal{H}(\boldsymbol{h}, \boldsymbol{x})$$ and calculate its projection onto T and T⊥, respectively, and it suffices to consider $$\mathcal{P}_{T_{i}}(\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*})$$ and $$\mathcal{P}_{T^{\bot }_{i}}(\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*})$$. For each block $$\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*}$$ and 1 ≤ i ≤ s, there are unique orthogonal decompositions   \begin{align} \boldsymbol{h}_{i} := \alpha_{i1} \boldsymbol{h}_{i0} + \tilde{\boldsymbol{h}}_{i}, \quad \boldsymbol{x} := \alpha_{i2} \boldsymbol{x}_{i0} + \tilde{\boldsymbol{x}}_{i}, \end{align} (6.2) where $$\boldsymbol{h}_{i0} \perp \tilde{\boldsymbol{h}}_{i}$$ and $$\boldsymbol{x}_{i0} \perp \tilde{\boldsymbol{x}}_{i}$$. It is important to note that $$\alpha _{i1} = \alpha _{i1}(\boldsymbol{h}_{i}) = \frac{\langle \boldsymbol{h}_{i0}, \boldsymbol{h}_{i}\rangle }{d_{i0}}$$ and $$\alpha _{i2} = \alpha _{i2}(\boldsymbol{x}_{i}) = \frac{\langle \boldsymbol{x}_{i0}, \boldsymbol{x}_{i}\rangle }{d_{i0}},$$ and thus αi1 and αi2 are functions of hi and xi, respectively. Immediately, we have the following matrix orthogonal decomposition for $$\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*}$$ onto Ti and $$T^{\bot }_{i}$$,   \begin{align} \boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0} \boldsymbol{x}_{i0}^{*} = \underbrace{(\alpha_{i1} \overline{\alpha_{i2}} - 1)\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*} + \overline{\alpha_{i2}} \tilde{\boldsymbol{h}}_{i} \boldsymbol{x}_{i0}^{*} + \alpha_{i1} \boldsymbol{h}_{i0} \tilde{\boldsymbol{x}}_{i}^{*}}_{\textrm{belong to}\ T_{i}} + \underbrace{\tilde{\boldsymbol{h}}_{i} \tilde{\boldsymbol{x}}_{i}^{*}}_{\textrm{belongs to}\ T^{\bot}_{i}}, \end{align} (6.3) where the first three components are in Ti while $$\tilde{\boldsymbol{h}}_{i}\tilde{\boldsymbol{x}}_{i}^{*}\in T^{\bot }_{i}$$. 6.1. Key lemmata From the decomposition in (6.2) and (6.3), we want to analyze how $$\|\tilde{\boldsymbol{h}}_{i}\|$$, $$\|\tilde{\boldsymbol{x}}_{i}\|$$, αi1 and αi2 depend on $$\delta _{i} = \frac{\|\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\|_{F}}{d_{i0}}$$ if δi < 1. The following lemma answers this question, which can be viewed as an application of singular value/vector perturbation theory [40] applied to rank-one matrices. From the lemma below, we can see that if $$\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*}$$ is close to $$\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}$$, then $$\mathcal{P}_{T^{\bot }_{i}}(\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*})$$ is in fact very small (of order $${\cal O}({\delta _{i}^{2}} d_{i0})$$). Lemma 6.1 (Lemma 5.9 in [21]) Recall that $$\|\boldsymbol{h}_{i0}\| = \|\boldsymbol{x}_{i0}\| = \sqrt{d_{i0}}$$. If $$\delta _{i} := \frac{\|\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0} \boldsymbol{x}_{i0}^{*}\|_{F}}{d_{i0}}<1$$, we have the following useful bounds   \begin{align*} |\alpha_{i1}|\leq \frac{\|\boldsymbol{h}_{i}\|}{\|\boldsymbol{h}_{i0}\|}, \quad |\alpha_{i1}\overline{\alpha_{i2}} - 1|\leq \delta_{i}, \end{align*} and   \begin{align*} \|\tilde{\boldsymbol{h}}_{i}\| \leq \frac{\delta_{i}}{1 - \delta_{i}}\|\boldsymbol{h}_{i}\|,\quad \|\tilde{\boldsymbol{x}}_{i}\| \leq \frac{\delta_{i}}{1 - \delta_{i}}\|\boldsymbol{x}_{i}\|,\quad \|\tilde{\boldsymbol{h}}_{i}\| \|\tilde{\boldsymbol{x}}_{i}\| \leq \frac{{\delta_{i}^{2}}}{2(1 - \delta_{i})} d_{i0}. \end{align*} Moreover, if $$\|\boldsymbol{h}_{i}\| \leq 2\sqrt{d_{i0}}$$ and $$\sqrt{L}\|\boldsymbol{B} \boldsymbol{h}_{i}\|_{\infty } \leq 4\mu \sqrt{d_{i0}}$$, i.e., $$\boldsymbol{h}_{i}\in \mathcal{N}_d\bigcap \mathcal{N}_{\mu }$$, we have $$\sqrt{L}\|\boldsymbol{B}_{i} \tilde{\boldsymbol{h}}_{i}\|_{\infty } \leq 6 \mu \sqrt{d_{i0}}$$. Now we start to focus on several results related to the linear operator $$\mathcal{A}$$. Lemma 6.2 (Operator norm of $$\mathcal{A}$$). For $$\mathcal{A}$$ defined in (2.7), there holds   \begin{align} \|\mathcal{A}\| \leq \sqrt{s(N\log(NL/2) + (\gamma+\log s)\log L)} \end{align} (6.4) with probability at least 1 − L−γ. Proof. Note that $$\mathcal{A}_{i}(\boldsymbol{Z}_{i}) : = \{\boldsymbol{b}_{l}^{*}\boldsymbol{Z}_{i}\boldsymbol{a}_{il}\}_{l=1}^{L}$$ in (2.2). Lemma 1 in [1] implies   \begin{align*} \|\mathcal{A}_{i}\| \leq \sqrt{N\log(NL/2) + \gamma^{\prime}\log L} \end{align*} with probability at least $$1 - L^{-\gamma ^{\prime }}.$$ By taking the union bound over 1 ≤ i ≤ s,   \begin{align*} \max\|\mathcal{A}_{i}\| \leq \sqrt{N\log(NL/2) + (\gamma+ \log s)\log L} \end{align*} with probability at least $$1 - sL^{-\gamma -\log s} \geq 1 - L^{-\gamma }.$$ For $$\mathcal{A}$$ defined in (2.7), applying the triangle inequality gives   \begin{align*} \|\mathcal{A}(\boldsymbol{Z})\| = \left\|\sum_{i=1}^{s} \mathcal{A}_{i}(\boldsymbol{Z}_{i})\right\| \leq \sum_{i=1}^{s} \|\mathcal{A}_{i}\|\|\boldsymbol{Z}_{i}\|_{F} \leq \max_{1\leq i\leq s} \|\mathcal{A}_{i}\| \sqrt{s \sum_{i=1}^{s} \|\boldsymbol{Z}_{i}\|_{F}^{2}} = \sqrt{s}\max_{1\leq i\leq s} \|\mathcal{A}_{i}\| \|\boldsymbol{Z}\|_{F}, \end{align*} where $$\boldsymbol{Z} = \operatorname{blkdiag}(\boldsymbol{Z}_{1},\cdots , \boldsymbol{Z}_{s})\in \mathbb{C}^{Ks\times Ns}.$$ Therefore,   \begin{align*} \|\mathcal{A}\| \leq \sqrt{s}\max_{1\leq i\leq s}\|\mathcal{A}_{i}\| \leq \sqrt{ s(N\log(NL/2) + (\gamma+\log s)\log L)} \end{align*} with probability at least 1 − L−γ. Lemma 6.3 (Restricted isometry property for $$\mathcal{A}$$ on T). The linear operator $$\mathcal{A}$$ restricted on T is well-conditioned, i.e.,   \begin{align} \|\mathcal{P}_{T}\mathcal{A}^{*}\mathcal{A}\mathcal{P}_{T} - \mathcal{P}_{T}\| \leq \frac{1}{10}, \end{align} (6.5) where $$\mathcal{P}_{T}$$ is the projection operator from $$\mathbb{C}^{Ks\times Ns}$$ onto T, given $$L \geq C_{\gamma }s^{2} \max \{K,{\mu _{h}^{2}} N\}\log ^{2}L$$ with probability at least 1 − L−γ. Remark 6.4 Here $$\mathcal{A}\mathcal{P}_{T}$$ and $$\mathcal{P}_{T}\mathcal{A}^{*}$$ are defined as   \begin{align*} \mathcal{A}\mathcal{P}_{T}(\boldsymbol{Z}) = \sum_{i=1}^{s} \mathcal{A}_{i}(\mathcal{P}_{T_{i}}(\boldsymbol{Z}_{i})), \quad\mathcal{P}_{T}\mathcal{A}^{*}(\boldsymbol{z}) = \operatorname{blkdiag}( \mathcal{P}_{T_{1}}(\mathcal{A}_{1}^{*}(\boldsymbol{z})), \cdots, \mathcal{P}_{T_{s}}(\mathcal{A}_{s}^{*}(\boldsymbol{z})) ), \end{align*} respectively, where Z is a block-diagonal matrix and $$\boldsymbol{z}\in \mathbb{C}^{L}.$$ As shown in the remark above, the proof of Lemma 6.3 depends on the properties of both $$\mathcal{P}_{T_{i}}\mathcal{A}_{i}^{*}\mathcal{A}_{i}\mathcal{P}_{T_{i}}$$ and $$\mathcal{P}_{T_{i}}\mathcal{A}_{i}^{*}\mathcal{A}_{j}\mathcal{P}_{T_{j}}$$ for $$i\neq j$$. Fortunately, we have already proven related results in [23] which are written as follows: Lemma 6.5 (Inter-user incoherence, Corollary 5.3 and 5.8 in [23]). There hold   \begin{align} \|\mathcal{P}_{T_{i}} \mathcal{A}_{i}^{*}\mathcal{A}_{j}\mathcal{P}_{T_{j}}\| \leq \frac{1}{10s}, \quad \forall i\neq j; \qquad \|\mathcal{P}_{T_{i}} \mathcal{A}_{i}^{*}\mathcal{A}_{i}\mathcal{P}_{T_{i}} - \mathcal{P}_{T_{i}}\| \leq \frac{1}{10s}, \quad\forall 1\leq i\leq s \end{align} (6.6) with probability at least 1 − L−γ+1 if $$L\geq C_{\gamma }s^{2}\max \{K,{\mu ^{2}_{h}}N\}\log ^{2}L\log (s+1).$$ Note that $$\|\mathcal{P}_{T_{i}} \mathcal{A}_{i}^{*}\mathcal{A}_{j}\mathcal{P}_{T_{j}}\| \leq \frac{1}{10s}$$ holds because of independence between each individual random Gaussian matrix Ai. In particular, if s = 1, the inter-user incoherence $$\|\mathcal{P}_{T_{i}} \mathcal{A}_{i}^{*}\mathcal{A}_{j}\mathcal{P}_{T_{j}}\| \leq \frac{1}{10s}$$ is not needed at all. With (6.6), it is easy to prove Lemma 6.3. Proof of Lemma 6.3 For any block diagonal matrix $$\boldsymbol{Z} = \operatorname{blkdiag}(\boldsymbol{Z}_{1}, \cdots ,\boldsymbol{Z}_{s})\in \mathbb{C}^{Ks\times Ns}$$ and $$\boldsymbol{Z}_{i}\in \mathbb{C}^{K\times N}$$,   \begin{align} \langle \boldsymbol{Z}, \mathcal{P}_{T}\mathcal{A}^{*}\mathcal{A}\mathcal{P}_{T}(\boldsymbol{Z}) - \mathcal{P}_{T}(\boldsymbol{Z})\rangle & = \sum_{1\leq i,j\leq s} \langle \mathcal{A}_{i}\mathcal{P}_{T_{i}}(\boldsymbol{Z}_{i}), \mathcal{A}_{j}\mathcal{P}_{T_{j}}(\boldsymbol{Z}_{j})\rangle - \|\mathcal{P}_{T}(\boldsymbol{Z})\|_{F}^{2} \nonumber \\ & = \sum_{i=1}^{s} \langle \boldsymbol{Z}_{i}, \mathcal{P}_{T_{i}}\mathcal{A}_{i}^{*}\mathcal{A}_{i}\mathcal{P}_{T_{i}}(\boldsymbol{Z}_{i}) - \mathcal{P}_{T_{i}}(\boldsymbol{Z}_{i})\rangle + \sum_{i\neq j} \langle \mathcal{A}_{i}\mathcal{P}_{T_{i}}(\boldsymbol{Z}_{i}), \mathcal{A}_{j}\mathcal{P}_{T_{j}}(\boldsymbol{Z}_{j})\rangle. \end{align} (6.7) Using (6.6), the following two inequalities hold,   \begin{align*} |\langle \boldsymbol{Z}_{i}, \mathcal{P}_{T_{i}}\mathcal{A}_{i}^{*}\mathcal{A}_{i}\mathcal{P}_{T_{i}}(\boldsymbol{Z}_{i}) - \mathcal{P}_{T_{i}}(\boldsymbol{Z}_{i})\rangle| & \leq \|\mathcal{P}_{T_{i}}\mathcal{A}_{i}^{*}\mathcal{A}_{i}\mathcal{P}_{T_{i}} - \mathcal{P}_{T_{i}} \| \|\boldsymbol{Z}_{i}\|_{F}^{2} \leq \frac{\|\boldsymbol{Z}_{i}\|^{2}_{F}}{10s}, \\ |\langle \mathcal{A}_{i}\mathcal{P}_{T_{i}}(\boldsymbol{Z}_{i}), \mathcal{A}_{j}\mathcal{P}_{T_{j}}(\boldsymbol{Z}_{j})\rangle| & \leq \|\mathcal{P}_{T_{i}}\mathcal{A}_{i}^{*}\mathcal{A}_{j}\mathcal{P}_{T_{j}} \| \|\boldsymbol{Z}_{i}\|_{F}\|\boldsymbol{Z}_{j}\|_{F} \leq \frac{\|\boldsymbol{Z}_{i}\|_{F}\|\boldsymbol{Z}_{j}\|_{F}}{10s}. \end{align*} After substituting both estimates into (6.7), we have   \begin{align*} |\langle \boldsymbol{Z}, \mathcal{P}_{T}\mathcal{A}^{*}\mathcal{A}\mathcal{P}_{T}(\boldsymbol{Z}) - \mathcal{P}_{T}(\boldsymbol{Z})\rangle| \leq \sum_{1\leq i, j\leq s} \frac{ \|\boldsymbol{Z}_{i}\|_{F}\|\boldsymbol{Z}_{j}\|_{F} }{10s} \leq \frac{1}{10s}\left(\sum_{i=1}^{s} \|\boldsymbol{Z}_{i}\|_{F}\right)^{2} \leq \frac{\|\boldsymbol{Z}\|_{F}^{2}}{10}. \end{align*} Finally, we show how $$\mathcal{A}$$ behaves when applied to block-diagonal matrices $$\boldsymbol{X} = \mathcal{H}(\boldsymbol{h},\boldsymbol{x})$$. In particular, the calculations will be much simplified for the case s = 1. Lemma 6.6 ($$\mathcal{A}$$ restricted on block-diagonal matrices with rank-one blocks). Consider $$\boldsymbol{X} = \mathcal{H}(\boldsymbol{h}, \boldsymbol{x})$$ and   \begin{align} \sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x}) := \max_{1\leq l\leq L} \sum_{i=1}^{s} |\boldsymbol{b}^{*}_{l}\boldsymbol{h}_{i}|^{2} \|\boldsymbol{x}_{i}\|^{2}. \end{align} (6.8) Conditioned on (6.4), we have   \begin{align} \|\mathcal{A}(\boldsymbol{X})\|^{2} \leq \frac{4}{3} \|\boldsymbol{X}\|_{F}^{2}+ 2 \sqrt{2s\|\boldsymbol{X}\|_{F}^{2} \sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x})(K+N)\log L} + 8s\sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x})(K+N) \log L, \end{align} (6.9) uniformly for any $$\boldsymbol{h}\in \mathbb{C}^{Ks}$$ and $$\boldsymbol{x}\in \mathbb{C}^{Ns}$$ with probability at least $$1 - \frac{1}{\gamma }\exp (-s(K+N))$$ if $$L\geq C_{\gamma }s(K+N)\log L$$. Here $$ \|\boldsymbol{X}\|_{F}^{2}= \|\mathcal{H}(\boldsymbol{h}, \boldsymbol{x})\|_{F}^{2} = \sum _{i=1}^{s} \|\boldsymbol{h}_{i}\|^{2}\|\boldsymbol{x}_{i}\|^{2}.$$ Remark 6.7 Here are a few more explanations and facts about $$\sigma ^{2}_{\max }(\boldsymbol{h},\boldsymbol{x})$$. Note that $$\|\mathcal{A}(\boldsymbol{X})\|^{2}$$ is the sum of L sub-exponential6 random variables, i.e.,   \begin{align} \|\mathcal{A}(\boldsymbol{X})\|^{2} = \sum_{l=1}^{L} \left|\sum_{i=1}^{s} \boldsymbol{b}_{l}^{*}\boldsymbol{h}_{i} \boldsymbol{x}_{i}^{*}\boldsymbol{a}_{il}\right|{}^{2}. \end{align} (6.10) Here $$\sigma ^{2}_{\max }(\boldsymbol{h}, \boldsymbol{x})$$ corresponds to the largest expectation of all those components in $$\|\mathcal{A}(\boldsymbol{X})\|^{2}$$. For $$\sigma ^{2}_{\max }(\boldsymbol{h}, \boldsymbol{x})$$, without loss of generality, we assume ∥xi∥ = 1 for 1 ≤ i ≤ s and let $$\boldsymbol{h}\in \mathbb{C}^{Ks}$$ be a unit vector, i.e., $$\|\boldsymbol{h}\|^{2} = \sum _{i=1}^{s} \|\boldsymbol{h}_{i}\|^{2}= 1$$. The bound   \begin{align} \frac{1}{L} \leq \sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x}) \leq \frac{K}{L} \end{align} (6.11) follows from $$L \sigma ^{2}_{\max }(\boldsymbol{h}, \boldsymbol{x}) \geq \sum _{l=1}^{L} \sum _{i=1}^{s} |\boldsymbol{b}_{l}^{*}\boldsymbol{h}_{i}|^{2} = \|\boldsymbol{h}\|^{2}=1.$$ Moreover, $$\sigma _{\max }^{2}(\boldsymbol{h},\boldsymbol{x})$$ and $$\sigma _{\max }(\boldsymbol{h},\boldsymbol{x})$$ are both Lipschitz functions w.r.t. h. Now we want to determine their Lipschitz constants. First note that for ∥xi∥ = 1, $$\sigma _{\max }(\boldsymbol{h},\boldsymbol{x})$$ equals   \begin{align*} \sigma_{\max}(\boldsymbol{h}, \boldsymbol{x}) = \max_{1\leq l\leq L} \|(\boldsymbol{I}_{s}\otimes \boldsymbol{b}_{l}^{*})\boldsymbol{h}\|, \end{align*} where ⊗ denotes Kronecker product. Let $$\boldsymbol{u}\in \mathbb{C}^{Ks}$$ be another unit vector and we have   \begin{align} |\sigma_{\max}(\boldsymbol{h}, \boldsymbol{x}) - \sigma_{\max}(\boldsymbol{u}, \boldsymbol{x})| & = \left| \max_{1\leq l\leq L} \|(\boldsymbol{I}_{s}\otimes \boldsymbol{b}_{l}^{*})\boldsymbol{h} - \max_{1\leq l\leq L} \|(\boldsymbol{I}_{s}\otimes \boldsymbol{b}_{l}^{*})\boldsymbol{u}\| \right| \nonumber \\ & = \max_{1\leq l\leq L} \left| \|(\boldsymbol{I}_{s}\otimes \boldsymbol{b}_{l}^{*}) \boldsymbol{h}\| - \|(\boldsymbol{I}_{s}\otimes \boldsymbol{b}_{l}^{*}) \boldsymbol{u}\| \right| \nonumber \\ & \leq \max_{1\leq l\leq L} \|(\boldsymbol{I}_{s}\otimes \boldsymbol{b}_{l}^{*}) (\boldsymbol{h} - \boldsymbol{u})\| \leq \|\boldsymbol{h}-\boldsymbol{u}\|, \end{align} (6.12) where $$\|\boldsymbol{I}_{s}\otimes \boldsymbol{b}_{l}^{*}\| = \|\boldsymbol{b}_{l}\| \sqrt{\frac{K}{L}} < 1.$$ For $$\sigma ^{2}_{\max }(\boldsymbol{h},\boldsymbol{x}),$$  \begin{align} |\sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x}) - \sigma^{2}_{\max}(\boldsymbol{u}, \boldsymbol{x})| & \leq (\sigma_{\max}(\boldsymbol{h}, \boldsymbol{x}) + \sigma_{\max}(\boldsymbol{u}, \boldsymbol{x})) \cdot |\sigma_{\max}(\boldsymbol{h}, \boldsymbol{x}) - \sigma_{\max}(\boldsymbol{u}, \boldsymbol{x})| \nonumber \\ & \leq \frac{2K}{L}\|\boldsymbol{h}-\boldsymbol{u}\| \leq 2\|\boldsymbol{h}-\boldsymbol{u}\|. \end{align} (6.13) Proof of Lemma 6.6 Without loss of generality, let ∥xi∥ = 1 and $$\sum _{i=1}^{s} \|\boldsymbol{h}_{i}\|^{2} = 1$$. It suffices to prove $$f(\boldsymbol{h}, \boldsymbol{x}) \leq \frac{4}{3}$$ for all $$(\boldsymbol{h}, \boldsymbol{x})\in \mathbb{C}^{Ks}\times \mathbb{C}^{Ns}$$ in (2.5) where f(h, x) is defined as   \begin{align*} f(\boldsymbol{h}, \boldsymbol{x}) := \|\mathcal{A}(\boldsymbol{X})\|^{2} - 2 \sqrt{2s \sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x})(K+N)\log L} - 8s\sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x})(K+N) \log L. \end{align*} Part I: bounds of$$\|\mathcal{A}(\boldsymbol{X})\|^{2}$$ for any fixed (h, x). From (47), we already know that $$Y = \|\mathcal{A}(\boldsymbol{X})\|_{F}^{2} = \sum _{i=1}^{2L} c_{i}{\xi _{i}^{2}}$$ where {ξi} are i.i.d. $${\chi ^{2}_{1}}$$ random variables and $$\boldsymbol{c} = (c_{1}, \cdots , c_{2L})^{T}\in \mathbb{R}^{2L}$$. More precisely, we can determine $$\{c_{i}\}_{i=1}^{2L}$$ as   \begin{align*} \left| \sum_{i=1}^{s} \boldsymbol{b}_{l}^{*}\boldsymbol{h}_{i}\boldsymbol{x}^{*}\boldsymbol{a}_{il}\right|{}^{2} = c_{2l-1} \xi_{2l-1}^{2} + c_{2l}\xi_{2l}^{2},\quad c_{2l-1} = c_{2l} = \frac{1}{2}\sum_{i=1}^{s} |\boldsymbol{b}_{l}^{*}\boldsymbol{h}_{i}|^{2} \end{align*} because $$\sum _{i=1}^{s} \boldsymbol{b}^{*}_{l}\boldsymbol{h}_{i} \boldsymbol{x}_{i}^{*}\boldsymbol{a}_{il} \sim \mathcal{C}\mathcal{N}\left (0, \sum _{i=1}^{s} |\boldsymbol{b}^{*}_{l} \boldsymbol{h}_{i}|^{2}\right )$$. By the Bernstein inequality, there holds   \begin{align} \mathbb{P}(Y - \mathbb{E}(Y) \geq t) \leq \exp\left(- \frac{t^{2}}{8\|\boldsymbol{c}\|^{2}}\right) \vee \exp\left(- \frac{t}{8\|\boldsymbol{c}\|_{\infty}}\right)\!, \end{align} (6.14) where $$\operatorname{\mathbb{E}}(Y) = \|\boldsymbol{X}\|_{F}^{2} = 1.$$ In order to apply the Bernstein inequality, we need to estimate ∥c∥2 and $$\|\boldsymbol{c}\|_{\infty }$$ as follows,   \begin{align*} \|\boldsymbol{c}\|_{\infty} & = \frac{1}{2}\max_{1\leq l\leq L}\sum_{i=1}^{s}|\boldsymbol{b}^{*}_{l}\boldsymbol{h}_{i}|^{2} = \frac{1}{2} \sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x}), \\ \|\boldsymbol{c}\|_{2}^{2} & = \frac{1}{2}\sum_{l=1}^{L} \left|\sum_{i=1}^{s}|\boldsymbol{b}^{*}_{l}\boldsymbol{h}_{i}|^{2} \right|{}^{2} \leq \frac{1}{2}\left( \sum_{i=1}^{s}\sum_{l=1}^{L}|\boldsymbol{b}^{*}_{l}\boldsymbol{h}_{i}|^{2} \right)\max_{1\leq l\leq L}\sum_{i=1}^{s}|\boldsymbol{b}^{*}_{l}\boldsymbol{h}_{i}|^{2} \leq \frac{1}{2} \sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x}). \end{align*} Applying (6.14) gives   \begin{align*} \mathbb{P}( \|\mathcal{A}(\boldsymbol{X})\|^{2} \geq 1 + t)\leq \exp\left(- \frac{t^{2}}{4 \sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x})}\right) \vee \exp\left(- \frac{t}{4\sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x})}\right). \end{align*} In particular, by setting   \begin{align*} t = g(\boldsymbol{h},\boldsymbol{x}):= 2 \sqrt{2 s\sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x})(K+N)\log L} + 8s\sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x})(K + N)\log L, \end{align*} we have   \begin{align*} \mathbb{P}\left(\|\mathcal{A}(\boldsymbol{X})\|^{2} \geq 1 + g(\boldsymbol{h},\boldsymbol{x})\right) \leq \textrm{e}^{ - 2 s(K+N)(\log L)}. \end{align*} So far, we have shown that f(h, x) ≤ 1 with probability at least $$1 - \textrm{e}^{- 2 s(K+N)(\log L)}$$ for a fixed pair of (h, x). Part II: covering argument. Now we will use a covering argument to extend this result for all (h, x) and thus prove that $$f(\boldsymbol{h}, \boldsymbol{x})\leq \frac{4}{3}$$ uniformly for all (h, x). We start with defining $$\mathcal{K}$$ and $$\mathcal{N}_{i}$$ as ϵ0-nets of $$\mathcal{S}^{Ks-1}$$ and $$\mathcal{S}^{N-1}$$ for h and xi, 1 ≤ i ≤ s, respectively. The bounds $$|\mathcal{K}|\leq \big(1+\frac{2}{\epsilon _{0}}\big)^{2sK}$$ and $$|\mathcal{N}_{i}|\leq \big(1+\frac{2}{\epsilon _{0}}\big)^{2N}$$ follow from the covering numbers of the sphere (Lemma 5.2 in [38]). Here we let $$\mathcal{N} := \mathcal{N}_{1}\times \cdots \times \mathcal{N}_{s}.$$ By taking the union bound over $$\mathcal{K}\times \mathcal{N},$$ we have that f(h, x) ≤ 1 holds uniformly for all $$(\boldsymbol{h}, \boldsymbol{x}) \in \mathcal{K} \times \mathcal{N}$$ with probability at least   \begin{align*} 1- \left(1+ 2/\epsilon_{0}\right)^{2s(K + N)} \textrm{e}^{ - 2s(K+N)\log L } = 1- \textrm{e}^{-2s(K + N)\left(\log L - \log \left(1 + 2/\varepsilon_{0}\right)\right)}. \end{align*} For any $$(\boldsymbol{h}, \boldsymbol{x}) \in \mathcal{S}^{Ks-1}\times \underbrace{\mathcal{S}^{N-1}\times \cdots \times \mathcal{S}^{N-1}}_{s\ \textrm{times}}$$, we can find a point $$(\boldsymbol{u}, \boldsymbol{v}) \in \mathcal{K} \times \mathcal{N}$$ satisfying ∥h −u∥≤ ε0 and ∥xi −vi∥ ≤ ε0 for all 1 ≤ i ≤ s. Conditioned on (6.4), we know that   \begin{align*} \|\mathcal{A}\|^{2}\leq s(N\log(NL/2) + (\gamma + \log s)\log L) \leq s(N + \gamma + \log s)\log L. \end{align*} Now we aim to evaluate |f(h, x) − f(u, v)|. First we consider |f(u, x) − f(u, v)|. Since $$\sigma ^{2}_{\max }(\boldsymbol{u}, \boldsymbol{x}) = \sigma ^{2}_{\max }(\boldsymbol{u},\boldsymbol{v})$$ if ∥xi∥ = ∥vi∥ = ∥u∥ = 1 for 1 ≤ i ≤ s, we have   \begin{align*} |f(\boldsymbol{u}, \boldsymbol{x}) - f(\boldsymbol{u}, \boldsymbol{v})| & = \left|\left\| \mathcal{A}(\mathcal{H}(\boldsymbol{u}, \boldsymbol{x}))\right\|_{F}^{2} - \left\| \mathcal{A}(\mathcal{H}(\boldsymbol{u},\boldsymbol{v})) \right\|_{F}^{2} \right| \\ & \leq \left\| \mathcal{A}(\mathcal{H}(\boldsymbol{u}, \boldsymbol{x} - \boldsymbol{v}))\right\| \cdot \left\| \mathcal{A}(\mathcal{H}(\boldsymbol{u}, \boldsymbol{x} + \boldsymbol{v}))\right\| \\ & \leq \|\mathcal{A}\|^{2} \sqrt{\sum_{i=1}^{s} \|\boldsymbol{u}_{i}\|^{2}\|\boldsymbol{x}_{i} - \boldsymbol{v}_{i}\|^{2}} \sqrt{\sum_{i=1}^{s} \|\boldsymbol{u}_{i}\|^{2}\|\boldsymbol{x}_{i} + \boldsymbol{v}_{i}\|^{2}} \\ & \leq 2\|\mathcal{A}\|^{2} \varepsilon_{0} \leq 2s(N + \gamma + \log s)(\log L)\varepsilon_{0}, \end{align*} where the first inequality is due to ||z1|2 −|z2|2|≤|z1 − z2||z1 + z2| for any $$z_{1}, z_{2} \in \mathbb{C}$$. We proceed to estimate |f(h, x) − f(u, x)| by using (50) and (49),   \begin{align*} | f(\boldsymbol{h}, \boldsymbol{x}) - f(\boldsymbol{u}, \boldsymbol{x})| & \leq J_{1} + J_{2} + J_{3} \\ & \leq (2\|\mathcal{A}\|^{2} + 2\sqrt{2s(K+N)\log L}+ 16s(K+N) \log L) \varepsilon_{0}\\ & \leq 25s(K +N + \gamma + \log s)(\log L) \varepsilon_{0}, \end{align*} where (6.13) and (6.12) give   \begin{align*} J_{1} & = \left| \|\mathcal{A}(\mathcal{H}(\boldsymbol{h},\boldsymbol{x}))\|_{F}^{2} - \|\mathcal{A}(\mathcal{H}(\boldsymbol{u},\boldsymbol{x}))\|_{F}^{2}\right| \leq \left\| \mathcal{A}( \mathcal{H}(\boldsymbol{h} - \boldsymbol{u},\boldsymbol{x}) )\right\| \left\| \mathcal{A}( \mathcal{H}(\boldsymbol{h} + \boldsymbol{u},\boldsymbol{x}) )\right\| \leq 2\|\mathcal{A}\|^{2} \varepsilon_{0}, \\ J_{2} & = 2 \sqrt{2s(K+N)\log L}\cdot |\sigma_{\max}(\boldsymbol{h}, \boldsymbol{x}) - \sigma_{\max}(\boldsymbol{u}, \boldsymbol{x})| \leq 2 \sqrt{2s(K+N)\log L} \varepsilon_{0}, \\ J_{3} & = 8s(K+N) (\log L) \cdot |\sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x}) - \sigma^{2}_{\max}(\boldsymbol{u}, \boldsymbol{x})| \leq 16s(K+N)(\log L) \varepsilon_{0}. \end{align*} Therefore, if $$\epsilon _{0} = \frac{1}{81s(N + K + \gamma + \log s)\log L}$$, there holds   \begin{align*} f(\boldsymbol{h},\boldsymbol{x}) \leq f(\boldsymbol{u},\boldsymbol{v}) + \underbrace{|f(\boldsymbol{u}, \boldsymbol{x}) - f(\boldsymbol{u}, \boldsymbol{v})| + |f(\boldsymbol{h},\boldsymbol{x}) -f(\boldsymbol{u},\boldsymbol{x}) |}_{\leq 27s(K+N+\gamma + \log s)(\log L)\varepsilon_{0}\leq \frac{1}{3}} \leq \frac{4}{3} \end{align*} for all (h, x) uniformly with probability at least $$1- \textrm{e}^{-2s(K + N)\left (\log L - \log \left (1 + 2/\varepsilon _{0}\right )\right )}.$$ By letting $$L \geq C_{\gamma }s(K+N)\log L$$ with Cγ reasonably large and γ ≥ 1, we have $$\log L - \log \left (1 + 2/\varepsilon _{0}\right ) \geq \frac{1}{2}(1 + \log (\gamma ))$$ and with probability at least $$1 - \frac{1}{\gamma }\exp (-s(K+N))$$. 6.2. Proof of the local restricted isometry property Lemma 6.8 Conditioned on (6.5) and (6.9), the following RIP type of property holds:   \begin{align*} \frac{2}{3} \|\boldsymbol{X} - \boldsymbol{X}_{0}\|_{F}^{2} \leq \|\mathcal{A}(\boldsymbol{X} - \boldsymbol{X}_{0})\|^{2} \leq \frac{3}{2}\|\boldsymbol{X}-\boldsymbol{X}_{0}\|_{F}^{2} \end{align*} uniformly for all $$(\boldsymbol{h},\boldsymbol{x})\in \mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$ with μ ≥ μh and $$\epsilon \leq \frac{1}{15}$$ if $$L \geq C_{\gamma }\mu ^{2} s(K+N)\log ^{2} L$$ for some numerical constant Cγ. Proof. The main idea of the proof follows two steps: decompose X − X0 onto T and T⊥, then apply (6.5) and (6.9) to $$\mathcal{P}_{T}(\boldsymbol{X}-\boldsymbol{X}_{0})$$ and $$\mathcal{P}_{T^{\bot }}(\boldsymbol{X}-\boldsymbol{X}_{0}),$$ respectively. For any $$\boldsymbol{X} =\mathcal{H}(\boldsymbol{h},\boldsymbol{x})\in \mathcal{N}_{\epsilon }$$ with $$\delta _{i} \leq \varepsilon \leq \frac{1}{15}$$, we can decompose X − X0 as the sum of two block diagonal matrices U = blkdiag(Ui, 1 ≤ i ≤ s) and V = blkdiag(Vi, 1 ≤ i ≤ s) where each pair of (Ui, Vi) corresponds to the orthogonal decomposition of $$\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}$$,   \begin{align} \boldsymbol{h}_{i}\boldsymbol{x}^{*}_{i} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*} := \underbrace{(\alpha_{i1} \overline{\alpha_{i2}} - 1)\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*} + \overline{\alpha_{i2}} \tilde{\boldsymbol{h}}_{i} \boldsymbol{x}_{i0}^{*} + \alpha_{i1} \boldsymbol{h}_{i0}\tilde{\boldsymbol{x}}_{i}^{*}}_{\boldsymbol{U}_{i}\in T_{i}} + \underbrace{ \tilde{\boldsymbol{h}}_{i} \tilde{\boldsymbol{x}}_{i}^{*}}_{\boldsymbol{V}_{i} \in T_{i}^{\perp}} \end{align} (6.15) which has been briefly discussed in (6.2) and (6.3). Note that $$\mathcal{A}(\boldsymbol{X} - \boldsymbol{X}_{0}) = \mathcal{A}(\boldsymbol{U} + \boldsymbol{V})$$ and   \begin{align*} \|\mathcal{A}(\boldsymbol{U})\| - \|\mathcal{A}(\boldsymbol{V})\| \leq \|\mathcal{A}(\boldsymbol{U} + \boldsymbol{V})\| \leq \|\mathcal{A}(\boldsymbol{U})\| + \|\mathcal{A}(\boldsymbol{V})\|. \end{align*} Therefore, it suffices to have a two-side bound for $$\|\mathcal{A}(\boldsymbol{U})\|$$ and an upper bound for $$\|\mathcal{A}(\boldsymbol{V})\|$$ where U ∈ T and V ∈ T⊥ in order to establish the local isometry property. Estimation of$$\|\mathcal{A}(\boldsymbol{U})\|$$: For $$\|\mathcal{A}(\boldsymbol{U})\|$$, we know from Lemma 6.3 that   \begin{align} \sqrt{\frac{9}{10}}\|\boldsymbol{U}\|_{F}\leq \|\mathcal{A}(\boldsymbol{U})\| \leq \sqrt{\frac{11}{10}}\|\boldsymbol{U}\|_{F} \end{align} (6.16) and hence we only need to compute ∥U∥F. By Lemma 6.1, there also hold $$\|\boldsymbol{V}_{i}\|_{F} \leq \frac{{\delta _{i}^{2}}}{2(1 - \delta _{i})} d_{i0}$$ and δi −∥Vi∥F ≤∥Ui∥F ≤ δi + ∥Vi∥F, i.e.,   \begin{align*} \left(\delta_{i} - \frac{{\delta_{i}^{2}}}{2(1 - \delta_{i})}\right)d_{i0} \leq \|\boldsymbol{U}_{i}\|_{F} \leq \left(\delta_{i} + \frac{{\delta_{i}^{2}}}{2(1 - \delta_{i})}\right)d_{i0}, \quad 1\leq i\leq s. \end{align*} With $$\|\boldsymbol{U}\|_{F}^{2} = \sum _{i=1}^{s} \|\boldsymbol{U}_{i}\|_{F}^{2}$$, it is easy to get $$\delta d_{0}\left (1 - \frac{\varepsilon }{2(1-\varepsilon )}\right ) \leq \|\boldsymbol{U}\|_{F} \leq \delta d_{0} \left (1 + \frac{\varepsilon }{2(1-\varepsilon )}\right )$$. Combined with (6.16), we get   \begin{align} \sqrt{\frac{9}{10}}\left(1 - \frac{\varepsilon}{2(1-\varepsilon)}\right)\delta d_{0} \leq \|\mathcal{A}(\boldsymbol{U}) \| \leq \sqrt{\frac{11}{10}}\left(1 + \frac{\varepsilon}{2(1-\varepsilon)}\right)\delta d_{0}. \end{align} (6.17) Estimation of$$\|\mathcal{A}(\boldsymbol{V})\|$$: note that V is a block-diagonal matrix with rank-one block. So applying Lemma 6.6 gives us   \begin{align} \|\mathcal{A}(\boldsymbol{V})\|^{2} &\leq \frac{4}{3} \|\boldsymbol{V}\|_{F}^{2}+ 2 \sqrt{2s\|\boldsymbol{V}\|_{F}^{2} \sigma^{2}_{\max}(\tilde{\boldsymbol{h}}, \tilde{\boldsymbol{x}})(K+N)\log L} + 8s\sigma^{2}_{\max}(\tilde{\boldsymbol{h}}, \tilde{\boldsymbol{x}})(K+N) \log L, \end{align} (6.18) where $$\boldsymbol{V} = \mathcal{H}(\tilde{\boldsymbol{h}}, \tilde{\boldsymbol{x}})$$ and $$ \tilde{\boldsymbol{h}} = \left[ \begin{array}{@{}c@{}} \tilde{\boldsymbol{h}}_{1} \\ \vdots \\ \tilde{\boldsymbol{h}}_{s} \end{array}\right]. $$ It suffices to get an estimation of ∥V∥F and $$\sigma ^{2}_{\max }(\tilde{\boldsymbol{h}},\tilde{\boldsymbol{x}})$$ to bound $$\|\mathcal{A}(\boldsymbol{V})\|$$ in (6.18). Lemma 6.1 says that $$\|\tilde{\boldsymbol{h}}_{i}\| \|\tilde{\boldsymbol{x}}_{i}\| \leq \frac{{\delta _{i}^{2}}}{2(1 - \delta _{i})} d_{i0} \leq \frac{\varepsilon }{2(1-\varepsilon )} \delta _{i} d_{i0}$$ if ε < 1. Moreover,   \begin{align} \|\tilde{\boldsymbol{x}}_{i}\| \leq \frac{\delta_{i}}{1 - \delta_{i}}\|\boldsymbol{x}_{i}\| \leq \frac{2\delta_{i}}{1 - \delta_{i}} \sqrt{d_{i0}}, \quad \sqrt{L}\|\boldsymbol{B} \tilde{\boldsymbol{h}}_{i} \|_{\infty} \leq 6 \mu \sqrt{d_{i0}}, \quad 1\leq i\leq s \end{align} (6.19) if (h, x) belongs to $$\mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }.$$ For ∥V∥F,   \begin{align*} \|\boldsymbol{V}\|_{F} = \sqrt{\sum_{i=1}^{s} \|\boldsymbol{V}_{i}\|_{F}^{2}} = \sqrt{\sum_{i=1}^{s} \|\tilde{\boldsymbol{h}}_{i}\|^{2} \|\tilde{\boldsymbol{x}}_{i}\|^{2}} \leq \frac{\varepsilon\delta d_{0}}{2(1-\varepsilon)}. \end{align*} Now we aim to get an upper bound for $$\sigma ^{2}_{\max }(\tilde{\boldsymbol{h}}, \tilde{\boldsymbol{x}})$$ by using (6.19),   \begin{align*} \sigma_{\max}^{2}(\tilde{\boldsymbol{h}}, \tilde{\boldsymbol{x}}) = \max_{1\leq l\leq L}\sum_{i=1}^{s} |\boldsymbol{b}^{*}_{l}\tilde{\boldsymbol{h}}_{i}|^{2} \|\tilde{\boldsymbol{x}}_{i}\|^{2} \leq C_{0}\frac{\mu^{2} \sum_{i=1}^{s}{\delta_{i}^{2}} d_{i0}^{2}}{L} = C_{0}\frac{\mu^{2}\delta^{2}{d_{0}^{2}}}{L}. \end{align*} By substituting the estimations of ∥V∥F and $$\sigma ^{2}_{\max }(\tilde{\boldsymbol{h}}, \tilde{\boldsymbol{x}})$$ into (6.18)   \begin{align} \|\mathcal{A}(\boldsymbol{V})\|^{2} \leq \frac{\varepsilon^{2} \delta^{2}{d_{0}^{2}}}{3(1-\varepsilon)^{2}} + \frac{\sqrt{2}\varepsilon \delta^{2}{d_{0}^{2}}}{1-\varepsilon} \sqrt{\frac{C_{0}\mu^{2} s (K+N)\log L}{L}} + \frac{8C_{0} \mu^{2} \delta^{2}{d_{0}^{2}}s(K+N)\log L}{L}. \end{align} (6.20) By letting $$L \geq C_{\gamma }\mu ^{2} s(K + N)\log ^{2} L$$ with Cγ sufficiently large and combining (6.20) and (6.17), we have   \begin{align*} \sqrt{\frac{2}{3}}\delta d_{0} \leq \|\mathcal{A}(\boldsymbol{U})\| - \|\mathcal{A}(\boldsymbol{V})\| \leq \|\mathcal{A}(\boldsymbol{U}+\boldsymbol{V})\| \leq \|\mathcal{A}(\boldsymbol{U})\| + \|\mathcal{A}(\boldsymbol{V})\| \leq \sqrt{\frac{3}{2}}\delta d_{0}, \end{align*} which gives $$\frac{2}{3}\|\boldsymbol{X} - \boldsymbol{X}_{0}\|_{F}^{2} \leq \|\mathcal{A}(\boldsymbol{X} - \boldsymbol{X}_{0})\|^{2} \leq \frac{3}{2}\|\boldsymbol{X} - \boldsymbol{X}_{0}\|_{F}^{2}.$$ 6.3. Proof of the local regularity condition We first introduce a few notations: for all $$(\boldsymbol{h}, \boldsymbol{x}) \in \mathcal{N}_d \cap \mathcal{N}_{\epsilon }$$, consider $$\alpha _{i1}, \alpha _{i2}, \tilde{\boldsymbol{h}}_{i}$$ and $$\tilde{\boldsymbol{x}}_{i}$$ defined in (6.2) and define   \begin{align*} \varDelta\boldsymbol{h}_{i} = \boldsymbol{h}_{i} - \alpha_{i} \boldsymbol{h}_{i0}, \quad \varDelta\boldsymbol{x}_{i} = \boldsymbol{x}_{i} - \overline{\alpha}_{i}^{-1}\boldsymbol{x}_{i0}, \end{align*} where   \begin{align*} \alpha_{i} (\boldsymbol{h}_{i}, \boldsymbol{x}_{i})= \begin{cases} (1 - \delta_{0})\alpha_{i1}, & \textrm{if}\ \|\boldsymbol{h}_{i}\|_{2} \geq \|\boldsymbol{x}_{i}\|_{2} \\ \frac{1}{(1 - \delta_{0})\overline{\alpha_{i2}}}, & \textrm{if}\ \|\boldsymbol{h}_{i}\|_{2} < \|\boldsymbol{x}_{i}\|_{2}\end{cases} \end{align*} with   \begin{align} \delta_{0} := \frac{\delta}{10}. \end{align} (6.21) The function αi(hi, xi) is defined for each block of $$\boldsymbol{X} = \mathcal{H}(\boldsymbol{h}, \boldsymbol{x}).$$ The particular form of αi(h, x) serves primarily for proving the Lemma 6.11, i.e., local regularity condition of G(h, x). We also define   \begin{align*} \varDelta\boldsymbol{h} : =\left[ \begin{array}{@{}c@{}} \boldsymbol{h}_{1} - \alpha_{1} \boldsymbol{h}_{1,0} \\ \vdots \\ \boldsymbol{h}_{s} - \alpha_{s} \boldsymbol{h}_{s0} \end{array}\right]\in\mathbb{C}^{Ks}, \quad \varDelta\boldsymbol{x} : =\left[ \begin{array}{@{}c@{}} \boldsymbol{x}_{1} - \alpha_{1} \boldsymbol{x}_{1,0} \\ \vdots \\ \boldsymbol{x}_{s} - \alpha_{s} \boldsymbol{x}_{s0} \end{array}\right]\in\mathbb{C}^{Ns}. \end{align*} The following lemma gives bounds of Δxi and Δhi. Lemma 6.9 For all $$(\boldsymbol{h},\boldsymbol{x}) \in \mathcal{N}_d \cap \mathcal{N}_{\epsilon }$$ with $$\epsilon \leq \frac{1}{15}$$, there hold   \begin{align*} \max\{ \|\Delta\boldsymbol{h}_{i}\|_{2}^{2}, \|\Delta\boldsymbol{x}_{i}\|_{2}^{2}\} \leq (7.5{\delta_{i}^{2}} + 2.88{\delta_{0}^{2}}) d_{i0},\quad \|\Delta\boldsymbol{h}_{i}\|_{2}^{2} \|\Delta\boldsymbol{x}_{i}\|_{2}^{2} \leq \frac{1}{26}({\delta_{i}^{2}} +{\delta_{0}^{2}}) d_{i0}^{2}. \end{align*} Moreover, if we assume $$(\boldsymbol{h}_{i}, \boldsymbol{x}_{i}) \in \mathcal{N}_{\mu }$$ additionally, we have $$ \sqrt{L}\|\boldsymbol{B}(\Delta \boldsymbol{h}_{i})\|_{\infty } \leq 6\mu \sqrt{d_{i0}}$$. Proof. We only consider ∥hi∥2 ≥∥xi∥2 and αi = (1 − δ0)α1i, and the other case is exactly the same due to the symmetry. For both Δhi and Δxi, by definition,   \begin{align} \Delta\boldsymbol{h}_{i} & = \boldsymbol{h}_{i} - \alpha_{i}\boldsymbol{h}_{i0} = \delta_{0} \alpha_{i1} \boldsymbol{h}_{i0} + \tilde{\boldsymbol{h}}_{i},\\ \Delta\boldsymbol{x}_{i} & = \boldsymbol{x}_{i} - \frac{1}{(1 - \delta_{0})\overline{\alpha_{i}}_{1}} \boldsymbol{x}_{i0} = \left(\alpha_{i2} - \frac{1}{(1 - \delta_{0})\overline{\alpha}_{i1}}\right)\boldsymbol{x}_{i0} + \tilde{\boldsymbol{x}}_{i} , \end{align} (6.22) where $$\boldsymbol{h}_{i} = \alpha _{i1}\boldsymbol{h}_{i0} + \tilde{\boldsymbol{h}}_{i}$$ and $$\boldsymbol{x}_{i} = \alpha _{i2}\boldsymbol{x}_{i0} + \tilde{\boldsymbol{x}}_{i}$$ come from the orthogonal decomposition in (6.2). We start with estimating ∥Δhi∥2. Note that $$\|\boldsymbol{h}_{i}\|_{2}^{2} \leq 4d_{i0}$$ and $$\|\alpha _{i1} \boldsymbol{h}_{i0}\|_{2}^{2}\leq \|\boldsymbol{h}_{i}\|_{2}^{2}$$ since $$(\boldsymbol{h}, \boldsymbol{x})\in \mathcal{N}_d\cap \mathcal{N}_{\mu }$$. By Lemma 6.1, we have   \begin{align} \|\Delta\boldsymbol{h}_{i}\|_{2}^{2} = \|\tilde{\boldsymbol{h}}_{i}\|_{2}^{2} +{\delta_{0}^{2}}\|\alpha_{i1} \boldsymbol{h}_{i0}\|_{2}^{2} \leq \left(\left(\frac{\delta_{i}}{1-\delta_{i}}\right)^{2} +{\delta_{0}^{2}}\right)\|\boldsymbol{h}_{i}\|_{2}^{2} \leq ( 4.6{\delta_{i}^{2}} + 4{\delta_{0}^{2}}) d_{i0}. \end{align} (6.24) Then we calculate ∥Δxi∥: from (6.23), we have   \begin{align*} \|\Delta\boldsymbol{x}_{i}\|^{2} = \left|\alpha_{i2} - \frac{1}{(1 - \delta_{0})\overline{\alpha}_{i1}} \right|{}^{2}d_{i0} + \|\tilde{\boldsymbol{x}}_{i}\|^{2} \leq \left|\alpha_{i2} - \frac{1}{(1 - \delta_{0})\overline{\alpha}_{i1}} \right|{}^{2}d_{i0} + \frac{4{\delta_{i}^{2}} d_{i0}}{(1 - \delta_{i})^{2}}, \end{align*} where Lemma 6.1 gives $$\|\tilde{\boldsymbol{x}}_{i}\|_{2} \leq \frac{\delta _{i}}{1-\delta _{i}}\|\boldsymbol{x}_{i}\|_{2} \leq \frac{2\delta _{i}}{1-\delta _{i}} \sqrt{d_{i0}}$$ for $$(\boldsymbol{h},\boldsymbol{x})\in \mathcal{N}_d\cap \mathcal{N}_{\epsilon }$$. So it suffices to estimate $$\left | \alpha _{i2} - \frac{1}{(1 - \delta _{0})\overline{\alpha }_{i1}} \right |$$, which satisfies   \begin{align} \left|\alpha_{i2} - \frac{1}{(1 - \delta_{0})\overline{\alpha_{i1}}}\right| = \frac{1}{|\alpha_{i1}|} \left| \overline{\alpha_{i1}} \alpha_{i2}- 1 - \frac{\delta_{0}}{1 - \delta_{0}} \right| \leq \frac{1}{|\alpha_{i1}|} \left( \left|(\overline{\alpha_{i1}} \alpha_{i2}- 1)\right| + \frac{\delta_{0}}{1 - \delta_{0}} \right). \end{align} (6.25) Lemma 6.1 implies that $$| \overline{\alpha _{i1}} \alpha _{i2}- 1| \leq \delta _{i}$$, and (6.2) gives   \begin{align} |\alpha_{i1}|^{2} = \frac{1}{d_{i0}}(\|\boldsymbol{h}_{i}\|^{2} - \| \tilde{\boldsymbol{h}}_{i} \|^{2}) \geq \frac{1}{d_{i0}}\left(1 - \frac{{\delta_{i}^{2}}}{(1-\delta_{i})^{2}} \right)\|\boldsymbol{h}_{i}\|^{2} \geq \left(1 - \frac{{\delta_{i}^{2}}}{(1-\delta_{i})^{2}} \right)(1-\varepsilon), \end{align} (6.26) where $$\|\tilde{\boldsymbol{h}}_{i}\| \leq \frac{\delta _{i}}{1-\delta _{i}}\|\boldsymbol{h}_{i}\|$$ and ∥hi∥2 ≥ ∥hi∥∥xi∥ ≥ (1 − ε)di0 if ∥hi∥ ≥ ∥xi∥. Substituting (6.26) into (6.25) gives   \begin{align*} \left|\alpha_{i2} - \frac{1}{(1 - \delta_{0})\overline{\alpha_{i1}}}\right| \leq \frac{1}{\sqrt{1-\varepsilon}} \left(1 - \frac{{\delta_{i}^{2}}}{(1-\delta_{i})^{2}} \right)^{-1/2}\left(\delta_{i} + \frac{\delta_{0}}{1-\delta_{0}}\right) \leq 1.2(\delta_{i} + \delta_{0}). \end{align*} Then we have   \begin{align} \|\Delta\boldsymbol{x}_{i}\|_{2}^{2} \leq \left(1.44(\delta_{i}+\delta_{0})^{2}+ \frac{4{\delta^{2}_{i}}}{(1 - \delta_{i})^{2}}\right) d_{i0} \leq (7.5{\delta_{i}^{2}} + 2.88{\delta_{0}^{2}})d_{i0}. \end{align} (6.27) Finally, we try to bound ∥Δhi∥2∥Δxi∥2. Lemma 6.1 gives $$\|\tilde{\boldsymbol{h}}_{i}\|_{2} \|\tilde{\boldsymbol{x}}_{i}\|_{2} \leq \frac{{\delta _{i}^{2}}d_{i0}}{2(1 - \delta _{i})}$$ and |αi1| ≤ 2. Combining them along with (6.22), (6.23), (6.24) and (6.27), we have   \begin{align*} \|\Delta\boldsymbol{h}_{i}\|_{2}^{2} \|\Delta\boldsymbol{x}_{i}\|_{2}^{2} &\leq \|\tilde{\boldsymbol{h}}_{i}\|_{2}^{2}\|\tilde{\boldsymbol{x}}_{i}\|_{2}^{2} +{\delta_{0}^{2}} |\alpha_{i1}|^{2} \|\boldsymbol{h}_{i0}\|_{2}^{2} \|\Delta\boldsymbol{x}_{i}\|_{2}^{2} + \left|\alpha_{i2} - \frac{1}{(1 - \delta_{0})\overline{\alpha}_{i1}}\right|{}^{2} \|\boldsymbol{x}_{i0}\|_{2}^{2} \|\Delta\boldsymbol{h}_{i}\|_{2}^{2} \\ & \leq \left(\frac{{\delta_{i}^{4}}}{4(1 - \delta_{i})^{2}} + 4{\delta_{0}^{2}} (7.5{\delta_{i}^{2}} + 2.88{\delta_{0}^{2}}) + 1.44(\delta_{i} + \delta_{0})^{2} (4.6{\delta_{i}^{2}} + 4{\delta_{0}^{2}} )\right) d_{i0}^{2} \\ & \leq \frac{({\delta_{i}^{2}} +{\delta_{0}^{2}})d_{i0}^{2}}{26}. \end{align*} By symmetry, similar results hold for the case ∥hi∥2 < ∥xi∥2 and $$\max \{\|\Delta \boldsymbol{h}_{i}\|, \|\Delta \boldsymbol{x}_{i}\|\} \leq (7.5{\delta _{i}^{2}} + 2.88{\delta _{0}^{2}})d_{i0}.$$ Next, under the additional assumption $$(\boldsymbol{h}, \boldsymbol{x}) \in \mathcal{N}_{\mu }$$, we now prove $$\sqrt{L}\|\boldsymbol{B}(\Delta \boldsymbol{h}_{i})\|_{\infty } \leq 6\mu \sqrt{d_{i0}}$$: Case 1: ∥hi∥2 ≥ ∥xi∥2 and αi = (1 − δ0)αi1. By Lemma 6.1 gives |αi1| ≤ 2, which implies  \begin{align*} \sqrt{L}\|\boldsymbol{B}(\Delta\boldsymbol{h}_{i})\|_{\infty} &\leq \sqrt{L}\|\boldsymbol{B}\boldsymbol{h}_{i} \|_{\infty} + (1 - \delta_{0}) |\alpha_{i1}|\sqrt{L}\|\boldsymbol{B}\boldsymbol{h}_{i0}\|_{\infty} \\ &\leq 4\mu\sqrt{d_{i0}} + 2(1 - \delta_{0})\mu_{h} \sqrt{d_{i0}} \leq 6\mu\sqrt{d_{i0}}. \end{align*} Case 2: ∥hi∥2 < ∥xi∥2 and $$\alpha _{i} = \frac{1}{(1-\delta _{0})\overline{\alpha _{i2}}}$$. Using the same argument as (63) gives  \begin{align*} |\alpha_{i2}|^{2}\geq \left(1 - \frac{{\delta_{i}^{2}}}{(1-\delta_{i})^{2}} \right)(1-\varepsilon). \end{align*} Therefore,   \begin{align*} \sqrt{L}\|\boldsymbol{B}(\Delta\boldsymbol{h}_{i})\|_{\infty} &\leq \sqrt{L}\|\boldsymbol{B}\boldsymbol{h}_{i}\|_{\infty} + \frac{1}{(1 - \delta_{0}) |\overline{\alpha_{i2}}|} \sqrt{L}\|\boldsymbol{B}\boldsymbol{h}_{0}\|_{\infty} \\ &\leq 4\mu\sqrt{d_{0}} + \left(1 - \frac{{\delta_{i}^{2}}}{(1-\delta_{i})^{2}} \right)^{-1/2} \frac{\mu_{h} \sqrt{d}_{0}}{(1-\delta_{0})\sqrt{1-\varepsilon}} \leq 6 \mu\sqrt{d_{0}}. \end{align*} Lemma 6.10 (Local Regularity for F(h, x)) Conditioned on (31) and (46), the following inequality holds   \begin{align*} \operatorname{Re}\left(\langle \nabla F_{\boldsymbol{h}}, \Delta\boldsymbol{h} \rangle + \langle \nabla F_{\boldsymbol{x}}, \Delta\boldsymbol{x}\rangle\right) \geq \frac{\delta^{2}{d_{0}^{2}}}{8} - 2\sqrt{s}\delta d_{0} \|\mathcal{A}^{*}(\boldsymbol{e})\|, \end{align*} uniformly for any $$(\boldsymbol{h}, \boldsymbol{x}) \in \mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$ with $$\epsilon \leq \frac{1}{15}$$ if $$L \geq C\mu ^{2} s(K+N)\log ^{2} L$$ for some numerical constant C. Proof. First note that for   \begin{align*} I_{0} = \langle \nabla F_{\boldsymbol{h}}, \Delta\boldsymbol{h} \rangle + \overline{\langle \nabla F_{\boldsymbol{x}}, \Delta\boldsymbol{x}\rangle } = \sum_{i=1}^{s} \langle \nabla F_{\boldsymbol{h}_{i}}, \Delta\boldsymbol{h}_{i} \rangle + \overline{\langle \nabla F_{\boldsymbol{x}_{i}}, \Delta\boldsymbol{x}_{i} \rangle}. \end{align*} For each component, recall that (19) and (20), we have   \begin{align*} \langle \nabla F_{\boldsymbol{h}_{i}}, \Delta\boldsymbol{h}_{i} \rangle + \overline{\langle \nabla F_{\boldsymbol{x}_{i}}, \Delta\boldsymbol{x}_{i} \rangle} & = \langle \mathcal{A}_{i}^{*}(\mathcal{A}(\boldsymbol{X} - \boldsymbol{X}_{0}) - \boldsymbol{e})\boldsymbol{x}_{i}, \Delta\boldsymbol{h}_{i} \rangle + \overline{\langle (\mathcal{A}_{i}^{*}(\mathcal{A}(\boldsymbol{X} - \boldsymbol{X}_{0}) - \boldsymbol{e}))^{*}\boldsymbol{h}_{i}, \Delta\boldsymbol{x}_{i} \rangle} \\ & = \left\langle \mathcal{A}(\boldsymbol{X} - \boldsymbol{X}_{0}) - \boldsymbol{e}, \mathcal{A}_{i}((\Delta\boldsymbol{h}_{i})\boldsymbol{x}_{i}^{*} + \boldsymbol{h}_{i} (\Delta\boldsymbol{x}_{i})^{*}) \right\rangle. \end{align*} Define Ui and Vi as   \begin{align} \boldsymbol{U}_{i} := \alpha_{i}\boldsymbol{h}_{i0}(\Delta\boldsymbol{x}_{i})^{*} + \overline{\alpha_{i}}^{-1}(\Delta\boldsymbol{h}_{i})\boldsymbol{x}_{i0}^{*} \in T_{i}, \quad \boldsymbol{V}_{i} := \Delta\boldsymbol{h}_{i}(\Delta\boldsymbol{x}_{i})^{*}. \end{align} (6.28) Here Vi does not necessarily belong to $$T^{\bot }_{i}.$$ From the way of how Δhi, Δxi, Ui and Vi are constructed, two simple relations hold:   \begin{align*} \boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*} = \boldsymbol{U}_{i} + \boldsymbol{V}_{i}, \qquad (\Delta\boldsymbol{h}_{i})\boldsymbol{x}_{i}^{*} + \boldsymbol{h}_{i} (\Delta\boldsymbol{x}_{i})^{*} = \boldsymbol{U}_{i} + 2\boldsymbol{V}_{i}. \end{align*} Define U := blkdiag(U1, ⋯ , Us) and V := blkdiag(V1, ⋯ , Vs). I0 can be simplified to   \begin{align*} I_{0} & = \sum_{i=1}^{s} \langle \nabla F_{\boldsymbol{h}_{i}}, \Delta\boldsymbol{h}_{i} \rangle + \overline{\langle \nabla F_{\boldsymbol{x}_{i}}, \Delta\boldsymbol{x}_{i} \rangle} = \sum_{i=1}^{s} \langle \mathcal{A}(\boldsymbol{U}+\boldsymbol{V})- \boldsymbol{e}, \mathcal{A}_{i}(\boldsymbol{U}_{i} + 2\boldsymbol{V}_{i})\rangle \\ & = \underbrace{\langle \mathcal{A}(\boldsymbol{U}+\boldsymbol{V}), \mathcal{A}(\boldsymbol{U} + 2\boldsymbol{V})\rangle}_{I_{01}} - \underbrace{\langle \boldsymbol{e}, \mathcal{A}(\boldsymbol{U} + 2\boldsymbol{V})\rangle}_{I_{02}}. \end{align*} Now we will give a lower bound for Re(I01) and an upper bound for Re(I02) so that the lower bound of Re(I0) is obtained. By the Cauchy–Schwarz inequality, Re(I01) has the lower bound   \begin{align} \operatorname{Re}(I_{01}) \geq (\|\mathcal{A}(\boldsymbol{U})\| - \|\mathcal{A}(\boldsymbol{V})\|) (\|\mathcal{A}(\boldsymbol{U})\| - 2\|\mathcal{A}(\boldsymbol{V})\|). \end{align} (6.29) In the following, we will give an upper bound for $$\|\mathcal{A}(\boldsymbol{V})\|$$ and a lower bound for $$\|\mathcal{A}(\boldsymbol{U})\|$$. Upper bound for $$\|\mathcal{A}(\boldsymbol{V})\|$$: note that V is a block-diagonal matrix with rank-one blocks, and applying Lemma 6.6 results in   \begin{align*} \|\mathcal{A}(\boldsymbol{V})\|^{2} \leq \frac{4}{3}\sum_{i=1}^{s} \|\boldsymbol{V}\|_{F}^{2} + 2\sigma_{\max}(\Delta\boldsymbol{h},\Delta\boldsymbol{x}) \|\boldsymbol{V}\|_{F}\sqrt{2s(K+N)\log L} + 8s\sigma_{\max}^{2}(\Delta\boldsymbol{h}, \Delta\boldsymbol{x})(K+N) \log L. \end{align*} By using Lemma 6.9, we have $$\|\Delta \boldsymbol{h}_{i}\|^{2} \leq (7.5{\delta _{i}^{2}} + 2.88{\delta _{0}^{2}})d_{i0}$$ and $$\sqrt{L}\|\boldsymbol{B}(\Delta \boldsymbol{h}_{i})\|_{\infty } \leq 6\mu \sqrt{d_{i0}}$$. Substituting them into $$\sigma ^{2}_{\max }(\Delta \boldsymbol{h},\Delta \boldsymbol{x})$$ gives   \begin{align*} \sigma_{\max}^{2}(\Delta\boldsymbol{h}, \Delta\boldsymbol{x}) = \max_{1\leq l\leq L}\left(\sum_{i=1}^{s} |\boldsymbol{b}_{l}^{*}\Delta\boldsymbol{h}_{i}|^{2} \|\Delta\boldsymbol{x}_{i}\|^{2}\right) \leq \frac{36\mu^{2}}{L} \sum_{i=1}^{s} \left(7.5{\delta_{i}^{2}} + 2.88{\delta_{0}^{2}}\right)d_{i0}^{2} \leq \frac{272\mu^{2} \delta^{2}{d_{0}^{2}}}{L}. \end{align*} For ∥V∥F, note that $$\|\Delta \boldsymbol{h}_{i}\|^{2}\|\Delta \boldsymbol{x}_{i}\|^{2} \leq \frac{1}{26}({\delta _{i}^{2}} +{\delta _{0}^{2}})d_{i0}^{2}$$ and thus   \begin{align*} \|\boldsymbol{V}\|^{2}_{F} = \sum_{i=1}^{s} \|\Delta\boldsymbol{h}_{i}\|^{2}\|\Delta\boldsymbol{x}_{i}\|^{2} \leq\frac{1}{26} \sum_{i=1}^{s}\left({\delta_{i}^{2}} +{\delta_{0}^{2}}\right)d_{i0}^{2} \leq \frac{1}{26} \cdot 1.01\delta^{2}{d_{0}^{2}} = \frac{\delta^{2}{d_{0}^{2}}}{25}. \end{align*} Then by $$\delta \leq \varepsilon \leq \frac{1}{15}$$ and letting $$L \geq C\mu ^{2} s(K + N)\log ^{2} L$$ for a sufficiently large numerical constant C, there holds   \begin{align} \|\mathcal{A}(\boldsymbol{V})\|^{2} \leq \frac{\delta^{2}{d_{0}^{2}}}{16} \implies \|\mathcal{A}(\boldsymbol{V})\| \leq \frac{\delta d_{0}}{4}. \end{align} (6.30) Lower bound for $$\|\mathcal{A}(\boldsymbol{U})\|$$: by the triangle inequality, $$\|\boldsymbol{U}\|_{F} \geq \delta d_{0} - \frac{1}{5} \delta d_{0} \geq \frac{4}{5}\delta d_{0}$$ if $$\epsilon \leq \frac{1}{15}$$ since ∥V∥F ≤ 0.2δd0. Since U ∈ T, by Lemma 6.3, there holds   \begin{align} \|\mathcal{A}(\boldsymbol{U})\| \geq \sqrt{\frac{9}{10}}\|\boldsymbol{U}\|_{F} \geq \frac{3}{4} \delta d_{0}. \end{align} (6.31) With the upper bound of $$\mathcal{A}(\boldsymbol{V})$$ in (6.30), the lower bound of $$\mathcal{A}(\boldsymbol{U})$$ in (6.31), and (6.29), we get $$\operatorname{Re}(I_{01}) \geq \frac{\delta ^{2}{d_{0}^{2}}}{8}.$$ Now let us give an upper bound for Re(I02),   \begin{align*} \| I_{02} \| & \leq \|\mathcal{A}^{*}(\boldsymbol{e})\| \|\boldsymbol{U} + 2\boldsymbol{V}\|_{*} = \|\mathcal{A}^{*}(\boldsymbol{e})\| \sum_{i=1}^{s}\|\underbrace{\boldsymbol{U}_{i} + 2\boldsymbol{V}_{i}}_{\textrm{rank-2}} \|_{*} \\ & \leq \sqrt{2}\|\mathcal{A}^{*}(\boldsymbol{e})\| \sum_{i=1}^{s} \|\boldsymbol{U}_{i} + 2\boldsymbol{V}_{i}\|_{F} \\ & \leq \sqrt{2s}\|\mathcal{A}^{*}(\boldsymbol{e})\| \|\boldsymbol{U} + 2\boldsymbol{V}\|_{F} \leq 2\sqrt{s}\delta d_{0} \|\mathcal{A}^{*}(\boldsymbol{e})\|, \end{align*} where ∥⋅∥ and ∥⋅∥* are a pair of dual norms and   \begin{align*} \|\boldsymbol{U} + 2\boldsymbol{V}\|_{F} \leq \|\boldsymbol{U} + \boldsymbol{V}\|_{F} + \|\boldsymbol{V}\|_{F} \leq \delta d_{0}+ 0.2\delta d_{0} \leq 1.2\delta d_{0}. \end{align*} Combining the estimation of Re(I01) and Re(I02) above leads to   \begin{align*} \operatorname{Re}( \langle \nabla F_{\boldsymbol{h}}, \Delta\boldsymbol{h}\rangle + \langle \nabla F_{\boldsymbol{x}}, \Delta\boldsymbol{x}\rangle) \geq \frac{\delta^{2}{d_{0}^{2}}}{8} - 2\sqrt{s}\delta d_{0} \|\mathcal{A}^{*}(\boldsymbol{e})\|. \end{align*} Lemma 6.11 (Local Regularity for G(h, x) For any $$(\boldsymbol{h}, \boldsymbol{x}) \in \mathcal{N}_d \bigcap \mathcal{N}_{\epsilon }$$ with $$\epsilon \leq \frac{1}{15}$$ and $$\frac{9}{10}d_{0} \leq d \leq \frac{11}{10}d_{0}$$, $$\frac{9}{10}d_{i0} \leq d_{i} \leq \frac{11}{10}d_{i0}$$, the following inequality holds uniformly   \begin{align} \operatorname{Re}\left(\langle \nabla G_{\boldsymbol{h}_{i}}, \Delta\boldsymbol{h}_{i} \rangle + \langle \nabla G_{\boldsymbol{x}_{i}}, \Delta\boldsymbol{x}_{i} \rangle\right) \geq 2\delta_{0}\sqrt{ \rho G_{i}(\boldsymbol{h}_{i}, \boldsymbol{x}_{i})} = \frac{\delta}{5}\sqrt{ \rho G_{i}(\boldsymbol{h}_{i}, \boldsymbol{x}_{i})}, \end{align} (6.32) where ρ ≥ d2 + 2∥e∥2. Immediately, we have   \begin{align} \operatorname{Re}\left(\langle \nabla G_{\boldsymbol{h}}, \Delta\boldsymbol{h} \rangle + \langle \nabla G_{\boldsymbol{x}}, \Delta\boldsymbol{x} \rangle\right) =\sum_{i=1}^{s}\operatorname{Re}\left(\langle \nabla G_{\boldsymbol{h}_{i}}, \Delta\boldsymbol{h}_{i} \rangle + \langle \nabla G_{\boldsymbol{x}_{i}}, \Delta\boldsymbol{x}_{i} \rangle\right) \geq{\frac{\delta}{5}}\sqrt{ \rho G(\boldsymbol{h}, \boldsymbol{x})}. \end{align} (6.33) Remark 6.12 For the local regularity condition for G(h, x), we use the results from [21] when s = 1. This is because each component Gi(h, x) only depends on (hi, xi) by definition and thus the lower bound of $$\operatorname{Re}\left (\langle \nabla G_{\boldsymbol{h}_{i}}, \Delta \boldsymbol{h}_{i} \rangle + \langle \nabla G_{\boldsymbol{x}_{i}}, \Delta \boldsymbol{x}_{i} \rangle \right )$$ is completely determined by (hi, xi) and δ0, and is independent of s. Proof. For each i : 1 ≤ i ≤ s, ∇Ghi (or ∇Gxi) only depends on hi (or xi) and there holds   \begin{align*} \operatorname{Re}\left(\langle \nabla G_{\boldsymbol{h}_{i}}, \Delta\boldsymbol{h}_{i} \rangle + \langle \nabla G_{\boldsymbol{x}_{i}}, \Delta\boldsymbol{x}_{i} \rangle\right) \geq 2\delta_{0}\sqrt{ \rho G_{i}(\boldsymbol{h}_{i}, \boldsymbol{x}_{i})} = \frac{\delta}{5}\sqrt{ \rho G_{i}(\boldsymbol{h}_{i}, \boldsymbol{x}_{i})}, \end{align*} which follows exactly from Lemma 5.17 in [21]. For (6.33), by definition of ∇Gh and ∇Gx in (2.22),   \begin{align*} \operatorname{Re}\left(\langle \nabla G_{\boldsymbol{h}}, \Delta\boldsymbol{h} \rangle + \langle \nabla G_{\boldsymbol{x}}, \Delta\boldsymbol{x} \rangle\right) & = \sum_{i=1}^{s}\operatorname{Re}\left(\langle \nabla G_{\boldsymbol{h}_{i}}, \Delta\boldsymbol{h}_{i} \rangle + \langle \nabla G_{\boldsymbol{x}_{i}}, \Delta\boldsymbol{x}_{i} \rangle\right) \\ &\geq \frac{\delta}{5} \sum_{i=1}^{s}\sqrt{ \rho G_{i}(\boldsymbol{h}_{i}, \boldsymbol{x}_{i})} \geq \frac{\delta}{5}\sqrt{\rho G(\boldsymbol{h},\boldsymbol{x})}, \end{align*} where $$G(\boldsymbol{h},\boldsymbol{x}) = \sum _{i=1}^{s} G_{i}(\boldsymbol{h}_{i},\boldsymbol{x}_{i}).$$ Lemma 6.13 (Proof of the Local Regularity Condition) Conditioned on (5.3), for the objective function $$\widetilde{F}(\boldsymbol{h},\boldsymbol{x})$$ in (2.17), there exists a positive constant ω such that   \begin{align} \|\nabla \widetilde{F}(\boldsymbol{h}, \boldsymbol{x})\|^{2} \geq \omega \left[ \widetilde{F}(\boldsymbol{h}, \boldsymbol{x}) - c \right]_{+} \end{align} (6.34) with $$c = \|\boldsymbol{e}\|^{2} + 2000s \|\mathcal{A}^{*}(\boldsymbol{e})\|^{2}$$ and $$\omega = \frac{d_{0}}{7000}$$ for all $$(\boldsymbol{h}, \boldsymbol{x}) \in \mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$. Here we set ρ ≥ d2 + 2∥e∥2. Proof. Following from Lemma 6.10 and Lemma 6.11, we have   \begin{align*} &\operatorname{Re}( \langle \nabla F_{\boldsymbol{h}}, \Delta\boldsymbol{h}\rangle + \langle \nabla F_{\boldsymbol{x}}, \Delta\boldsymbol{x}\rangle) \geq \frac{\delta^{2}{d_{0}^{2}}}{8} - 2\sqrt{s}\delta d_{0} \|\mathcal{A}^{*}(\boldsymbol{e})\| \\ &\operatorname{Re}( \langle \nabla G_{\boldsymbol{h}}, \Delta\boldsymbol{h} \rangle + \langle \nabla G_{\boldsymbol{x}}, \Delta\boldsymbol{x} \rangle) \geq \frac{\delta d}{5} \sqrt{ G(\boldsymbol{h}, \boldsymbol{x})} \geq \frac{9\delta d_{0}}{50}\sqrt{G(\boldsymbol{h}, \boldsymbol{x})} \end{align*} for all $$(\boldsymbol{h}, \boldsymbol{x}) \in \mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$ where ρ ≥ d2 + 2∥e∥2 ≥ d2 and $$\frac{9}{10}d_{0} \leq d \leq \frac{11}{10}d_{0}$$. Adding them together gives $$\operatorname{Re}\left (\langle \nabla \widetilde{F}_{\boldsymbol{h}}, \Delta \boldsymbol{h} \rangle + \langle \nabla \widetilde{F}_{\boldsymbol{x}}, \Delta \boldsymbol{x} \rangle \right )$$ on the left side. Moreover, Cauchy–Schwarz inequality implies   \begin{align*} \operatorname{Re}\left (\langle \nabla \widetilde{F}_{\boldsymbol{h}}, \Delta\boldsymbol{h} \rangle + \langle \nabla \widetilde{F}_{\boldsymbol{x}}, \Delta\boldsymbol{x} \rangle\right) \leq 4\delta\sqrt{d_{0}} \| \nabla \widetilde{F}(\boldsymbol{h}, \boldsymbol{x})\|, \end{align*} where both ∥Δh∥2 and ∥Δx∥2 are bounded by 8δ2d0 in Lemma 6.9 since   \begin{align*} \|\Delta\boldsymbol{h}\|^{2} = \sum_{i=1}^{s} \|\Delta\boldsymbol{h}_{i}\|^{2} \leq \sum_{i=1}^{s} (7.5{\delta_{i}^{2}} + 2.88{\delta_{0}^{2}}) d_{i0} \leq 8\delta^{2} d_{0}. \end{align*} Therefore,   \begin{align} \frac{\delta^{2}{d_{0}^{2}}}{8} + \frac{ 9\delta d_{0}\sqrt{ G(\boldsymbol{h}, \boldsymbol{x})}}{50} - 2\sqrt{s}\delta d_{0} \|\mathcal{A}^{*}(\boldsymbol{e}) \| \leq 4 \delta \sqrt{d_{0}} \| \nabla \widetilde{F}(\boldsymbol{h}, \boldsymbol{x})\|. \end{align} (6.35) Dividing both sides of (6.35) by δd0, we obtain   \begin{align*} \frac{4}{\sqrt{d_{0}}} \|\nabla \widetilde{F}(\boldsymbol{h}, \boldsymbol{x})\| & \geq \frac{\delta d_{0}}{12} + \frac{9}{50}\sqrt{G(\boldsymbol{h}, \boldsymbol{x})} + \frac{\delta d_{0} }{24} - 2\sqrt{s}\|\mathcal{A}^{*}(\boldsymbol{e})\| \\ & \geq \frac{1}{6\sqrt{6}}[\sqrt{F_{0}(\boldsymbol{h},\boldsymbol{x})} + \sqrt{G(\boldsymbol{h}, \boldsymbol{x})}] + \frac{\delta d_{0}}{24} - 2\sqrt{s}\|\mathcal{A}^{*}(\boldsymbol{e})\|, \end{align*} where the Local RIP condition (5.3) implies $$F_{0}(\boldsymbol{h}, \boldsymbol{x}) \leq \frac{3}{2}\delta ^{2}{d_{0}^{2}}$$ and hence $$\frac{\delta d_{0}}{12} \geq \frac{1}{6\sqrt{6}}\sqrt{F_{0}(\boldsymbol{h}, \boldsymbol{x})}$$, where F0(h, x) is defined in (2.12). Note that (5.6) gives   \begin{align} \sqrt{2\left[ \operatorname{Re}(\langle \mathcal{A}^{*}(\boldsymbol{e}), \boldsymbol{X} - \boldsymbol{X}_{0}\rangle) \right]_{+}} \leq \sqrt{ 2\sqrt{2s} \|\mathcal{A}^{*}(\boldsymbol{e})\| \delta d_{0}} \leq \frac{\sqrt{6}\delta d_{0}}{4} + \frac{4\sqrt{s}}{\sqrt{6}}\|\mathcal{A}^{*}(\boldsymbol{e})\|. \end{align} (6.36) By (6.36) and $$\widetilde{F}(\boldsymbol{h}, \boldsymbol{x}) - \|\boldsymbol{e}\|^{2} \leq F_{0}(\boldsymbol{h}, \boldsymbol{x}) + 2 [\operatorname{Re}(\langle \mathcal{A}^{*}(\boldsymbol{e}), \boldsymbol{X} - \boldsymbol{X}_{0}\rangle )]_{+} + G(\boldsymbol{h}, \boldsymbol{x})$$, there holds   \begin{align*} \frac{4}{\sqrt{d_{0}}} \|\nabla \widetilde{F}(\boldsymbol{h}, \boldsymbol{x})\| \geq & \frac{1}{6\sqrt{6}} \left[ \left(\sqrt{F_{0}(\boldsymbol{h}, \boldsymbol{x})}\right. +\sqrt{2\left[ \operatorname{Re}(\langle \mathcal{A}^{*}(\boldsymbol{e}), \boldsymbol{X}-\boldsymbol{X}_{0}\rangle) \right]_{+}} + \sqrt{G(\boldsymbol{h}, \boldsymbol{x})}\right) \\ & + \frac{\delta d_{0}}{24} - \frac{1}{6\sqrt{6}} \left( \frac{\sqrt{6}\delta d_{0}}{4} + \frac{4\sqrt{s}}{\sqrt{6}}\|\mathcal{A}^{*}(\boldsymbol{e})\|\right) - 2\sqrt{s}\|\mathcal{A}^{*}(\boldsymbol{e})\|\\ \geq & \frac{1}{6\sqrt{6}} \left[ \sqrt{ \left[\widetilde{F}(\boldsymbol{h}, \boldsymbol{x}) - \|\boldsymbol{e}\|^{2}\right]_{+}} - \sqrt{1000s}\|\mathcal{A}^{*}(\boldsymbol{e})\|\right]. \end{align*} For any non-negative real numbers a and b, we have $$[\sqrt{(x - a)_{+}} - b ]_{+} + b \geq \sqrt{(x - a)_{+}} $$ and it implies   \begin{align*} ( x - a)_{+} \leq 2 ( [\sqrt{(x - a)_{+}} - b ]_{+}^{2} + b^{2}) \Longrightarrow [\sqrt{(x - a)_{+}} - b ]_{+}^{2} \geq \frac{(x - a)_{+}}{2} - b^{2}. \end{align*} Therefore, by setting a = ∥e∥2 and $$b = \sqrt{1000s}\|\mathcal{A}^{*}(\boldsymbol{e})\|$$, there holds   \begin{align*} \|\nabla \widetilde{F}(\boldsymbol{h}, \boldsymbol{x})\|^{2} & \geq \frac{d_{0}}{3500} \left[ \frac{\widetilde{F}(\boldsymbol{h}, \boldsymbol{x}) - \|\boldsymbol{e}\|^{2} }{2} - 1000s \|\mathcal{A}^{*}(\boldsymbol{e})\|^{2} \right]_{+} \\ & \geq \frac{d_{0}}{7000} \left[ \widetilde{F}(\boldsymbol{h}, \boldsymbol{x}) - (\|\boldsymbol{e}\|^{2} + 2000s \|\mathcal{A}^{*}(\boldsymbol{e})\|^{2}) \right]_{+}. \end{align*} 6.4. Local smoothness Lemma 6.14 Conditioned on (5.3), (5.4) and (6.4), for any $$\boldsymbol{z} : = (\boldsymbol{h}, \boldsymbol{x})\in \mathbb{C}^{(K+N)s}$$ and $$\boldsymbol{w} : = (\boldsymbol{u}, \boldsymbol{v})\in \mathbb{C}^{(K+N)s}$$ such that z and $$\boldsymbol{z}+\boldsymbol{w} \in \mathcal{N}_{\epsilon } \cap \mathcal{N}_{\widetilde{F}}$$, there holds   \begin{align*} \| \nabla\widetilde{F}(\boldsymbol{z} + \boldsymbol{w}) - \nabla\widetilde{F}(\boldsymbol{z}) \| \leq C_{L} \|\boldsymbol{w}\|, \end{align*} with   \begin{align*} C_{L} \leq \left(10\|\mathcal{A}\|^{2}d_{0} + \frac{2\rho}{\min d_{i0}} \left( 5 + \frac{2L}{\mu^{2}} \right)\right)\!, \end{align*} where ρ ≥ d2 + 2∥e∥2 and $$\|\mathcal{A}\| \leq \sqrt{s(N\log (NL/2) + (\gamma +\log s)\log L)}$$ holds with probability at least 1 − L−γ from Lemma 6.2. In particular, $$L = \mathcal{O}((\mu ^{2} + \sigma ^{2})s(K + N)\log ^{2} L)$$ and $$\|\boldsymbol{e}\|^{2} = \mathcal{O}(\sigma ^{2}{d_{0}^{2}})$$ follows from $$\|\boldsymbol{e}\|^{2} \sim \frac{\sigma ^{2}{d_{0}^{2}}}{2L} \chi ^{2}_{2L}$$ and (6.14). Therefore, CL can be simplified to   \begin{align*} C_{L} = \mathcal{O}(d_{0}s\kappa(1 + \sigma^{2})(K + N)\log^{2} L ) \end{align*} by choosing ρ ≈ d2 + 2∥e∥2. Proof. By Lemma 5.6, we know that both z = (h, x) and $$\boldsymbol{z}+\boldsymbol{w}=(\boldsymbol{h}+\boldsymbol{u}, \boldsymbol{x}+\boldsymbol{v}) \in \mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$. Note that   \begin{align*} \nabla \widetilde{F} = (\nabla \widetilde{F}_{\boldsymbol{h}}, \nabla \widetilde{F}_{\boldsymbol{x}}) = (\nabla F_{\boldsymbol{h}} + \nabla G_{\boldsymbol{h}}, \nabla F_{\boldsymbol{x}} + \nabla G_{\boldsymbol{x}}), \end{align*} where (2.18), (2.19), (2.20) and (2.21) give ∇Fh, ∇Fx, ∇Gh and ∇Gx. It suffices to find out the Lipschitz constants for all of those four functions. Step 1: we first estimate the Lipschitz constant for ∇Fh and the result can be applied to ∇Fx due to symmetry.   \begin{align*} \nabla F_{\boldsymbol{h}}(\boldsymbol{z} + \boldsymbol{w}) - \nabla F_{\boldsymbol{h}}(\boldsymbol{z}) &= \mathcal{A}^{*}\mathcal{A}( \mathcal{H}(\boldsymbol{h}+\boldsymbol{u},\boldsymbol{x} +\boldsymbol{v})) (\boldsymbol{x} + \boldsymbol{v}) - \left[ \mathcal{A}^{*}\mathcal{A}(\mathcal{H}(\boldsymbol{h},\boldsymbol{x}))\boldsymbol{x} + \mathcal{A}^{*}(\boldsymbol{y}) \boldsymbol{v}\right] \\ & = \mathcal{A}^{*}( \mathcal{A}(\mathcal{H}(\boldsymbol{h}+\boldsymbol{u}, \boldsymbol{x}+ \boldsymbol{v}) - \mathcal{H}(\boldsymbol{h},\boldsymbol{x})) )(\boldsymbol{x} + \boldsymbol{v}) \\ & \quad + \mathcal{A}^{*}\mathcal{A}( \mathcal{H}(\boldsymbol{h},\boldsymbol{x}) - \mathcal{H}(\boldsymbol{h}_{0},\boldsymbol{x}_{0}) )\boldsymbol{v} - \mathcal{A}^{*}(\boldsymbol{e}) \boldsymbol{v} \\ & = \mathcal{A}^{*}( \mathcal{A}( \mathcal{H}(\boldsymbol{h}+\boldsymbol{u},\boldsymbol{v}) + \mathcal{H}(\boldsymbol{u}, \boldsymbol{x}) ) )(\boldsymbol{x} + \boldsymbol{v}) \\ & \quad + \mathcal{A}^{*}\mathcal{A}( \mathcal{H}(\boldsymbol{h},\boldsymbol{x}) - \mathcal{H}(\boldsymbol{h}_{0},\boldsymbol{x}_{0}) )\boldsymbol{v} - \mathcal{A}^{*}(\boldsymbol{e}) \boldsymbol{v}. \end{align*} Note that $$\|\mathcal{H}(\boldsymbol{h}, \boldsymbol{x})\|_{F}\leq \sqrt{\sum _{i=1}^{s} \|\boldsymbol{h}_{i}\|^{2}\|\boldsymbol{x}_{i}\|^{2}} \leq \|\boldsymbol{h}\|\|\boldsymbol{x}\|$$ and $$\boldsymbol{z}, \boldsymbol{z}+\boldsymbol{w} \in \mathcal{N}_d$$ directly implies   \begin{align*} \|\mathcal{H}(\boldsymbol{u},\boldsymbol{x}) + \mathcal{H}(\boldsymbol{h}+\boldsymbol{u},\boldsymbol{v})\|_{F} \leq \|\boldsymbol{u}\| \|\boldsymbol{x}\| + \|\boldsymbol{h}+\boldsymbol{u}\|\|\boldsymbol{v}\| \leq 2\sqrt{d_{0}} (\|\boldsymbol{u}\| + \|\boldsymbol{v}\|), \end{align*} where $$\|\boldsymbol{h} + \boldsymbol{u}\| \leq 2\sqrt{d_{0}}.$$ Moreover, (31) implies   \begin{align*} \|\mathcal{H}(\boldsymbol{h},\boldsymbol{x}) - \mathcal{H}(\boldsymbol{h}_{0},\boldsymbol{x}_{0})\|_{F} \leq \epsilon d_{0} \end{align*} since $$\boldsymbol{z}\in \mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }.$$ Combined with $$\|\mathcal{A}^{*}(\boldsymbol{e})\| \leq \varepsilon d_{0}$$ in (5.4) and $$\|\boldsymbol{x} + \boldsymbol{v}\| \leq 2\sqrt{d_{0}}$$, we have   \begin{align} \|\nabla F_{\boldsymbol{h}}(\boldsymbol{z} + \boldsymbol{w}) - \nabla F_{\boldsymbol{h}}(\boldsymbol{z}) \| & \leq 4d_{0} \|\mathcal{A}\|^{2}(\|\boldsymbol{u}\| + \|\boldsymbol{v}\|) + \varepsilon d_{0} \|\mathcal{A}\|^{2} \|\boldsymbol{v}\| + \varepsilon d_{0} \|\boldsymbol{v}\| \nonumber \\ & \leq 5d_{0} \|\mathcal{A}\|^{2} ( \|\boldsymbol{u}\| + \|\boldsymbol{v}\|). \end{align} (6.37) Due to the symmetry between ∇Fh and ∇Fx, we have,   \begin{align} \|\nabla F_{\boldsymbol{x}}(\boldsymbol{z} + \boldsymbol{w}) - \nabla F_{\boldsymbol{x}}(\boldsymbol{z}) \| \leq 5d_{0}\|\mathcal{A}\|^{2} ( \|\boldsymbol{u}\| + \|\boldsymbol{v}\|). \end{align} (6.38) In other words,   \begin{align*} \| \nabla F(\boldsymbol{z} + \boldsymbol{w}) - \nabla F(\boldsymbol{z}) \| \leq 5\sqrt{2}d_{0} \|\mathcal{A}\|^{2}(\|\boldsymbol{u}\| + \|\boldsymbol{v}\|) \leq 10d_{0}\|\mathcal{A}\|^{2}\|\boldsymbol{w}\|, \end{align*} where $$\|\boldsymbol{u}\| + \|\boldsymbol{v}\| \leq \sqrt{2}\|\boldsymbol{w}\|.$$ Step 2: we estimate the upper bound of ∥∇Gxi(zi +wi) −∇Gxi(zi)∥. Implied by Lemma 5.19 in [21], we have   \begin{align} \| \nabla G_{\boldsymbol{x}_{i}}(\boldsymbol{z}_{i} + \boldsymbol{w}_{i}) - \nabla G_{\boldsymbol{x}_{i}}(\boldsymbol{z}_{i}) \| \leq \frac{5d_{i0} \rho}{{d_{i}^{2}}} \|\boldsymbol{v}_{i}\|. \end{align} (6.39) Step 3: we estimate the upper bound of ∥∇Ghi(z + w) −∇Ghi(z)∥. Denote   \begin{align*} \nabla G_{\boldsymbol{h}_{i}}(\boldsymbol{z} + \boldsymbol{w}) - \nabla G_{\boldsymbol{h}_{i}}(\boldsymbol{z}) =& \underbrace{\frac{\rho}{2d_{i}}\left[G^{\prime}_{0}\left(\frac{\|\boldsymbol{h}_{i} + \boldsymbol{u}_{i}\|^{2}}{2d_{i}}\right) (\boldsymbol{h}_{i} + \boldsymbol{u}_{i}) - G^{\prime}_{0}\left(\frac{\|\boldsymbol{h}_{i}\|^{2}}{2d_{i}}\right) \boldsymbol{h}_{i}\right] }_{\boldsymbol{j}_{1}} \\ & \underbrace{+ \frac{\rho L}{8d_{i}\mu^{2} }\sum_{l=1}^{L} \left[\!G^{\prime}_{0}\left(\frac{L|\boldsymbol{b}_{l}^{*}(\boldsymbol{h}_{i} + \boldsymbol{u}_{i})|^{2}}{8d_{i}\mu^{2}}\right) \boldsymbol{b}_{l}^{*}(\boldsymbol{h}_{i} + \boldsymbol{u}_{i}) \!-\! G^{\prime}_{0}\left(\frac{L|\boldsymbol{b}_{l}^{*}\boldsymbol{h}_{i}|^{2}}{8d_{i}\mu^{2}}\right) \boldsymbol{b}_{l}^{*}\boldsymbol{h}_{i} \right]\boldsymbol{b}_{l}}_{\boldsymbol{j}_{2}}\!. \end{align*} Following the same estimation of j1 and j2 in Lemma 5.19 of [21], we have   \begin{align} \|\boldsymbol{j}_{1}\| \leq \frac{5d_{i0} \rho}{{d_{i}^{2}}} \|\boldsymbol{u}_{i}\|, \quad \|\boldsymbol{j}_{2}\| \leq \frac{3\rho Ld_{i0}}{2{d_{i}^{2}}\mu^{2}}\|\boldsymbol{u}_{i}\|. \end{align} (6.40) Therefore, combining (6.39) and (6.40) gives   \begin{align*} \|\nabla G(\boldsymbol{z} + \boldsymbol{w}) - \nabla G(\boldsymbol{z})\| & = \sqrt{\sum_{i=1}^{s} \left(\| \nabla G_{\boldsymbol{h}_{i}}(\boldsymbol{z} + \boldsymbol{w}) - \nabla G_{\boldsymbol{h}_{i}}(\boldsymbol{z}) \|^{2} + \| \nabla G_{\boldsymbol{x}_{i}}(\boldsymbol{z} + \boldsymbol{w}) - \nabla G_{\boldsymbol{x}_{i}}(\boldsymbol{z}) \|^{2}\right)} \\ & \leq \max\left\{\frac{5d_{i0} \rho}{{d_{i}^{2}}} + \frac{3\rho Ld_{i0}}{2{d_{i}^{2}}\mu^{2}}\right\} \sqrt{\sum_{i=1}^{s}\|\boldsymbol{u}_{i}\|^{2}} + \max\left\{\frac{5d_{i0} \rho}{{d_{i}^{2}}}\right\} \sqrt{\sum_{i=1}^{s}\|\boldsymbol{v}_{i}\|^{2}} \\ & \leq \max\left\{\frac{5d_{i0} \rho}{{d_{i}^{2}}} + \frac{3\rho Ld_{i0}}{2{d_{i}^{2}}\mu^{2}}\right\} \|\boldsymbol{u}\| + \max\left\{\frac{5d_{i0} \rho}{{d_{i}^{2}}}\right\} \|\boldsymbol{v}\| \\ & \leq \frac{2\rho}{\min d_{i0}} \left( 5 + \frac{2L}{\mu^{2}} \right)\|\boldsymbol{w}\|. \end{align*} In summary, the Lipschitz constant CL of $$\widetilde{F}(\boldsymbol{z})$$ has an upper bound as follows:   \begin{align*} \|\nabla \widetilde{F}(\boldsymbol{z} + \boldsymbol{w}) - \nabla \widetilde{F}(\boldsymbol{z})\| & \leq \|\nabla F(\boldsymbol{z} + \boldsymbol{w}) - \nabla F(\boldsymbol{z})\| + \|\nabla G(\boldsymbol{z} + \boldsymbol{w}) - \nabla G(\boldsymbol{z})\| \\ & \leq \left(10\|\mathcal{A}\|^{2}d_{0} + \frac{2\rho}{\min d_{i0}} \left( 5 + \frac{2L}{\mu^{2}} \right)\right) \|\boldsymbol{w}\|. \end{align*} 6.5. Robustness condition and spectral initialization In this section, we will prove the robustness condition (5.4) and also Theorem 3.2. To prove (5.4), it suffices to show the following lemma, which is a more general version of (5.4). Lemma 6.15 Consider a sequence of Gaussian independent random variable $$\boldsymbol{c} = (c_{1}, \cdots , c_{L})\in \mathbb{C}^{L}$$ where $$c_{l}\sim \mathcal{C}\mathcal{N}\big(0, \frac{{\lambda _{i}^{2}}}{L}\big)$$ with λi ≤ λ. Moreover, we assume $$\mathcal{A}_{i}$$ in (2.2) is independent of c. Then there holds   \begin{align*} \|\mathcal{A}^{*}(\boldsymbol{c}) = \|\max_{1\leq i\leq s}\|\mathcal{A}_{i}^{*}(\boldsymbol{c} )\| \leq \xi \end{align*} with probability at least 1 − L−γ if $$L \geq C_{\gamma + \log (s)}\big( \frac{\lambda }{\xi } +\frac{\lambda ^{2}}{\xi ^{2}} \big)\max \{ K,N \}\log L/\xi ^{2}.$$ Proof. It suffices to show that $$\max _{1\leq i\leq s}\|\mathcal{A}_{i}^{*}(\boldsymbol{c})\| \leq \xi $$. For each fixed i : 1 ≤ i ≤ s,   \begin{align*} \mathcal{A}_{i}^{*}(\boldsymbol{c}) = \sum_{l=1}^{L} c_{l}\boldsymbol{b}_{l}\boldsymbol{a}_{il}^{*}. \end{align*} The key is to apply the matrix Bernstein inequality (6.53) and we need to estimate $$\|\mathcal{Z}_{l}\|_{\psi _{1}}$$, and the variance of $$\sum _{l=1}^{L} \mathcal{Z}_{l}.$$ For each l, $$\| c_{l} \boldsymbol{b}_{l}\boldsymbol{a}_{il}^{*}\|_{\psi _{1}} \leq \frac{\lambda \sqrt{KN}}{L}$$ follows from (6.58). Moreover, the variance of $$\mathcal{A}_{i}^{*}(\boldsymbol{c})$$ is bounded by $$\frac{\lambda ^{2} \max \{K,N\}}{L}$$ since   \begin{align*} \operatorname{\mathbb{E}} [ \mathcal{A}_{i}^{*}(\boldsymbol{c}) (\mathcal{A}_{i}^{*}(\boldsymbol{c}) )^{*} ] &= \sum_{l=1}^{L} \operatorname{\mathbb{E}}(|c_{l}|^{2} \|\boldsymbol{a}_{il}\|^{2})\boldsymbol{b}_{l}\boldsymbol{b}_{l}^{*} = \frac{N}{L} \sum_{l=1}^{L}{\lambda_{l}^{2}}\boldsymbol{b}_{l}\boldsymbol{b}_{l}^{*} \preceq \frac{\lambda^{2} N}{L}, \\ \operatorname{\mathbb{E}} [ (\mathcal{A}_{i}^{*}(\boldsymbol{c}) )^{*} (\mathcal{A}_{i}^{*}(\boldsymbol{c})) ] &= \sum_{l=1}^{L} \|\boldsymbol{b}_{l}\|^{2} \operatorname{\mathbb{E}}(|c_{l}|^{2} \boldsymbol{a}_{il}\boldsymbol{a}_{il}^{*}) = \frac{K}{L^{2}} \sum_{l=1}^{L}{\lambda_{i}^{2}} \boldsymbol{I}_{N} \preceq \frac{\lambda^{2} K}{L}. \end{align*} Letting $$t = \gamma \log L$$ and applying (6.53) lead to   \begin{align*} \|\mathcal{A}_{i}^{*}(\boldsymbol{c})\| \leq C_{0}\max\left\{ \frac{\lambda\sqrt{KN}\log^{2} L}{L},\sqrt{\frac{C_{\gamma}\lambda^{2}\max\{K,N\}\log L}{L}}\right\} \leq \xi. \end{align*} Therefore, by taking the union bound over 1 ≤ i ≤ s,   \begin{align*} \|\mathcal{A}_{i}^{*}(\boldsymbol{c})\| \leq \xi \end{align*} with probability at least 1 − L−γ if $$L\geq C_{\gamma +\log (s)} \big(\frac{\lambda }{\xi } +\frac{\lambda ^{2}}{\xi ^{2}} \big)\max \{K,N\}\log ^{2}L$$. The robustness condition is an immediate result of Lemma 6.15 by setting $$\xi = \frac{\varepsilon d_{0}}{10\sqrt{2}s\kappa }$$ and λ = σd0. Corollary 6.16 [Robustness Condition] For $$\boldsymbol{e} \sim \mathcal{C}\mathcal{N}\big(\boldsymbol{0}, \frac{\sigma ^{2}{d_{0}^{2}}}{L}\boldsymbol{I}_{L}\big) $$  \begin{align*} \|\mathcal{A}_{i}^{*}(\boldsymbol{e})\| \leq \frac{\varepsilon d_{0}}{10\sqrt{2} s\kappa}, \quad \forall 1\leq i\leq s \end{align*} with probability at least 1 − L−γ if $$L \geq C_{\gamma }\big(\frac{ s^{2}\kappa ^{2} \sigma ^{2}}{\varepsilon ^{2}} + \frac{s\kappa \sigma }{\varepsilon }\big)\max \{K, N\} \log L$$. Lemma 6.17 For $$\boldsymbol{e} \sim \mathcal{C}\mathcal{N}\big(\boldsymbol{0}, \frac{\sigma ^{2}{d_{0}^{2}}}{L}\boldsymbol{I}_{L}\big) $$, there holds   \begin{align} \| \mathcal{A}_{i}^{*}(\boldsymbol{y}) - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*} \| \leq \xi d_{i0}, \quad \forall 1\leq i\leq s \end{align} (6.41) with probability at least 1 − L−γ if $$L \geq C_{\gamma + \log (s)}s\kappa ^{2} ({\mu ^{2}_{h}} + \sigma ^{2}) \max \{K, N\} \log L /\xi ^{2}.$$ Remark 6.18 The success of the initialization algorithm completely relies on the lemma above. As mentioned in Section 3, $$\operatorname{\mathbb{E}}(\mathcal{A}_{i}^{*}(\boldsymbol{y})) = \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}$$ and Lemma 6.41 confirms that $$\mathcal{A}_{i}^{*}(\boldsymbol{y})$$ is close to $$\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}$$ in operator norm and hence the spectral method is able to give us a reliable initialization. Proof. Note that   \begin{align*} \mathcal{A}_{i}^{*}(\boldsymbol{y}) = \mathcal{A}_{i}^{*}\mathcal{A}_{i}(\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}) + \mathcal{A}_{i}^{*}(\boldsymbol{w}_{i}), \end{align*} where   \begin{align} \boldsymbol{w}_{i} = \boldsymbol{y} - \mathcal{A}_{i}(\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}) = \sum_{j\neq i} \mathcal{A}_{j}(\boldsymbol{h}_{j0}\boldsymbol{x}_{j0}^{*}) + \boldsymbol{e} \end{align} (6.42) is independent of $$\mathcal{A}_{i}.$$ The proof consists of two parts: 1. show that $$\|\mathcal{A}_{i}^{*}\mathcal{A}_{i}(\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}) - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\| \leq \frac{\xi d_{i0}}{2}$$; 2. prove that $$\|\mathcal{A}_{i}^{*}(\boldsymbol{w}_{i})\| \leq \frac{\xi d_{i0}}{2}$$. Part I: following from the definition of $$\mathcal{A}_{i}$$ and $$\mathcal{A}_{i}^{*}$$ in (2.2),   \begin{align*} \mathcal{A}_{i}^{*}\mathcal{A}_{i}(\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}) - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*} = \sum_{l=1}^{L} \underbrace{\boldsymbol{b}_{l}\boldsymbol{b}_{l}^{*}\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}(\boldsymbol{a}_{il}\boldsymbol{a}_{il}^{*} - \boldsymbol{I}_{N})}_{\textrm{defined as}\ \mathcal{Z}_{l} }, \end{align*} where B*B =IK. The sub-exponential norm of $$\mathcal{Z}_{l}$$ is bounded by   \begin{align*} \|\mathcal{Z}_{l}\|_{\psi_{1}} \leq \max_{1\leq l\leq L}\|\boldsymbol{b}_{l}\| |\boldsymbol{b}_{l}^{*}\boldsymbol{h}_{i0}| \| (\boldsymbol{a}_{il}\boldsymbol{a}_{il}^{*} - \boldsymbol{I}_{N}) \boldsymbol{x}_{i0}\|_{\psi_{1}} \leq \frac{\mu\sqrt{KN}d_{i0}}{L}, \end{align*} where $$\|\boldsymbol{b}_{l}\| = \sqrt{\frac{K}{L}}$$, $$\max _{l}|\boldsymbol{b}^{*}_{l}\boldsymbol{h}_{i0}|^{2} \leq \frac{\mu ^{2} d_{i0}}{L}$$ and $$\| (\boldsymbol{a}_{il}\boldsymbol{a}_{il}^{*} - \boldsymbol{I}_{N}) \boldsymbol{x}_{i0}\|_{\psi _{1}} \leq \sqrt{Nd_{i0}}$$ follows from (6.56). We proceed to estimate the variance of $$\sum _{l=1}^{L} \mathcal{Z}_{l}$$ by using (6.55) and (6.57):   \begin{align*} \left\| \sum_{l=1}^{L}\operatorname{\mathbb{E}}(\mathcal{Z}_{l}\mathcal{Z}_{l}^{*})\right\| & = \left\| \sum |\boldsymbol{b}_{l}^{*}\boldsymbol{h}_{i0}|^{2} \boldsymbol{x}_{i0}^{*}\operatorname{\mathbb{E}}(\boldsymbol{a}_{il} \boldsymbol{a}_{il}^{*} - \boldsymbol{I}_{N})^{2}\boldsymbol{x}_{i0}\boldsymbol{b}_{l}\boldsymbol{b}_{l}^{*}\right\| \leq \frac{\mu^{2}N d_{i0}^{2}}{L}, \\ \left\| \sum_{l=1}^{L}\operatorname{\mathbb{E}}(\mathcal{Z}_{l}^{*}\mathcal{Z}_{l})\right\| & = \frac{K}{L}\left\| \sum_{l=1}^{L} |\boldsymbol{b}_{l}^{*}\boldsymbol{h}_{i0}|^{2} \operatorname{\mathbb{E}}\left[ (\boldsymbol{a}_{il}\boldsymbol{a}_{il}^{*} - \boldsymbol{I}_{N})\boldsymbol{x}_{i0}\boldsymbol{x}_{i0}^{*}(\boldsymbol{a}_{il}\boldsymbol{a}_{il}^{*} - \boldsymbol{I}_{N})\right] \right\| \leq \frac{Kd_{i0}^{2}}{L}. \end{align*} Therefore, the variance of $$\sum _{l=1}^{L}\mathcal{Z}_{l}$$ is bounded by $$\frac{\max \{K,{\mu ^{2}_{h}}N\}d_{i0}^{2}}{L}$$. By applying matrix Bernstein inequality (6.53) and taking the union bound over all i, we prove that   \begin{align*} \|\mathcal{A}_{i}^{*}\mathcal{A}_{i}(\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}) - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\| \leq \frac{\xi d_{i0}}{2}, \quad \forall 1\leq i\leq s \end{align*} holds with probability at least 1 − L−γ+1 if $$L \geq C_{\gamma +\log (s)} \max \{K,{\mu _{h}^{2}}N\}\log L/\xi ^{2}.$$ Part II: for each 1 ≤ l ≤ L, the lth entry of wi in (6.42), i.e., $$(\boldsymbol{w}_{i} )_{l} = \sum _{j\neq i} \boldsymbol{b}_{l}^{*}\boldsymbol{h}_{j0}\boldsymbol{x}_{j0}^{*}\boldsymbol{a}_{jl} + e_{l}$$, is independent of $$\boldsymbol{b}^{*}_{l}\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\boldsymbol{a}_{il}$$ and obeys $$\mathcal{C}\mathcal{N}\big(0, \frac{\sigma _{il}^{2}}{L}\big)$$. Here   \begin{align*} \sigma_{il}^{2} & = L\operatorname{\mathbb{E}}|(\boldsymbol{w}_{i})_{l}|^{2} = L\sum_{j\neq i} |\boldsymbol{b}_{l}^{*}\boldsymbol{h}_{j0}|^{2} \| \boldsymbol{x}_{j0} \|^{2} + \sigma^{2}\|\boldsymbol{X}_{0}\|_{F}^{2} \\ & \leq{\mu_{h}^{2}} \sum_{j\neq i}\|\boldsymbol{h}_{j0}\|^{2} \|\boldsymbol{x}_{j0}\|^{2} + \sigma^{2}\|\boldsymbol{X}_{0}\|_{F}^{2} \leq ({\mu_{h}^{2}} + \sigma^{2}) \|\boldsymbol{X}_{0}\|_{F}^{2}. \end{align*} This gives $$\max _{i,l} \sigma _{il}^{2}\leq ({\mu ^{2}_{h}} + \sigma ^{2}) \|\boldsymbol{X}_{0}\|_{F}^{2}.$$ Thanks to the independence between wi and $$\mathcal{A}_{i}$$, applying Lemma 6.15 results in   \begin{align} \|\mathcal{A}_{i}^{*}(\boldsymbol{w}_{i})\| \leq \frac{\xi d_{i0}}{2} \end{align} (6.43) with probability 1 − L−γ+1 if $$L \geq C\max \left ( \frac{({\mu _{h}^{2}} + \sigma ^{2}) \|\boldsymbol{X}_{0}\|_{F}^{2} }{\xi ^{2}d_{i0}^{2}}, \frac{\sqrt{{\mu ^{2}_{h}} + \sigma ^{2}}\|\boldsymbol{X}_{0}\|_{F} }{\xi d_{i0}} \right )\max \{K,N\}\log L.$$ Therefore, combining (6.42) with (6.43), we get   \begin{align*} \|\mathcal{A}_{i}^{*}(\boldsymbol{y}) - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\| \leq \|\mathcal{A}_{i}^{*}\mathcal{A}_{i}(\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}) - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\| + \|\mathcal{A}_{i}^{*}(\boldsymbol{w}_{i})\| \leq \xi d_{i0} \end{align*} for all 1 ≤ i ≤ s with probability at least 1 − L−γ+1 if   \begin{align*} L \geq C_{\gamma+\log (s)}({\mu_{h}^{2}} + \sigma^{2})s \kappa^{2} \max\{K,N\}\log L/\xi^{2}, \end{align*} where $$\|\boldsymbol{X}_{0}\|_{F}/d_{i0} \leq \sqrt{s}\kappa .$$ Before moving to the proof of Theorem 3.2, we introduce a property about the projection onto a closed convex set. Lemma 6.19 (Theorem 2.8 in [13]). Let $$Q := \{ \boldsymbol{w}\in \mathbb{C}^{K} | \sqrt{L}\|\boldsymbol{B}\boldsymbol{w}\|_{\infty } \leq 2\sqrt{d}\mu \}$$ be a closed non-empty convex set. There holds   \begin{align*} \operatorname{Re}( \langle \boldsymbol{z} - \mathcal{P}_{Q}(\boldsymbol{z}), \boldsymbol{w} - \mathcal{P}_{Q}(\boldsymbol{z}) \rangle ) \leq 0, \quad \forall \, \boldsymbol{w} \in Q, \boldsymbol{z}\in \mathbb{C}^{K}, \end{align*} where $$\mathcal{P}_{Q}(\boldsymbol{z})$$ is the projection of z onto Q. With this lemma, we can easily see   \begin{align} \|\boldsymbol{z} - \boldsymbol{w}\|^{2} \!=\! \|\boldsymbol{z} - \mathcal{P}_{Q}(\boldsymbol{z})\|^{2} + \|\mathcal{P}_{Q}(\boldsymbol{z}) - \boldsymbol{w}\|^{2} + 2\operatorname{Re}(\langle \boldsymbol{z} - \mathcal{P}_{Q}(\boldsymbol{z}), \mathcal{P}_{Q}(\boldsymbol{z}) - \boldsymbol{w} \rangle) \!\geq\! \|\mathcal{P}_{Q}(\boldsymbol{z}) - \boldsymbol{w}\|^{2} \end{align} (6.44) for all $$\boldsymbol{z}\in \mathbb{C}^{K}$$ and w ∈ Q. It means that projection onto non-empty closed convex set is non-expansive. Now we present the proof of Theorem 3.2. Proof of Theorem 3.2 By choosing $$L \geq C_{\gamma +\log (s)}({\mu _{h}^{2}} + \sigma ^{2})s^{2} \kappa ^{4} \max \{K,N\}\log L/\varepsilon ^{2}$$, we have   \begin{align} \|\mathcal{A}_{i}^{*}(\boldsymbol{y}) - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\| \leq \xi d_{i0}, \quad \forall 1\leq i\leq s, \end{align} (6.45) where $$\xi = \frac{\varepsilon }{10\sqrt{2s}\kappa }$$. By applying the triangle inequality to (6.45), it is easy to see that   \begin{align} (1 - \xi)d_{i0} \leq d_{i} \leq (1 + \xi)d_{i0}, \quad |d_{i} - d_{i0}| \leq \xi d_{i0} \leq \frac{\varepsilon d_{i0}}{10\sqrt{2s}\kappa} < \frac{d_{i0}}{10}, \end{align} (6.46) which gives $$\frac{9}{10}d_{i0} \leq d_{i} \leq \frac{11}{10}d_{i0}$$ where $$d_{i} = \|\mathcal{A}_{i}^{*}(\boldsymbol{y})\|$$ according to Algorithm 1. Part I: proof of$$(\boldsymbol{u}^{(0)},\boldsymbol{v}^{(0)})\in \frac{1}{\sqrt{3}}\mathcal{N}_d \cap \frac{1}{\sqrt{3}}\mathcal{N}_{\mu }$$ Note that $$\boldsymbol{v}_{i}^{(0)} = \sqrt{d_{i}}\|\hat{\boldsymbol{x}}_{i0}\| = \sqrt{d_{i}}$$ where $$\hat{\boldsymbol{x}}_{i0}$$ is the leading right singular vector of $$\mathcal{A}_{i}^{*}(\boldsymbol{y}).$$ Therefore,   \begin{align*} \| \boldsymbol{v}_{i}^{(0)} \| = \sqrt{d_{i}} \|\hat{\boldsymbol{x}}_{i0}\| = \sqrt{d_{i}} \leq \sqrt{(1 + \xi)d_{i0}} \leq \frac{2}{\sqrt{3}}\sqrt{d_{i0}}, \quad \forall 1\leq i\leq s \end{align*} which implies $$\{\boldsymbol{v}_{i}^{(0)}\}_{i=1}^{s} \in \frac{1}{\sqrt{3}}\mathcal{N}_d.$$ Now we will prove that $$\boldsymbol{u}_{i}^{(0)}\in \frac{1}{\sqrt{3}}\mathcal{N}_d\cap \frac{1}{\sqrt{3}}\mathcal{N}_{\mu }$$ by Lemma 6.19. By Algorithm 1, $$\boldsymbol{u}_{i}^{(0)}$$ is the minimizer to the function $$f(\boldsymbol{z}) = \frac{1}{2} \| \boldsymbol{z} - \sqrt{d_{i}} \hat{\boldsymbol{h}}_{i0} \|^{2}$$ over $$Q_{i} := \{ \boldsymbol{z} | \sqrt{L}\|\boldsymbol{B}\boldsymbol{z}\|_{\infty } \leq 2\sqrt{d_{i}}\mu \}.$$ Obviously, by definition, $$\boldsymbol{u}_{i}^{(0)}$$ is the projection of $$\sqrt{d_{i}} \hat{\boldsymbol{h}}_{i0}$$ onto Qi. Note that $$\boldsymbol{u}_{i}^{(0)}\in Q_{i}$$ implies $$\sqrt{L}\|\boldsymbol{B}\boldsymbol{u}_{i}^{(0)}\|_{\infty } \leq 2\sqrt{d_{i}}\mu \leq 2\sqrt{(1+\xi )d_{i0}}\mu \leq \frac{4\sqrt{d_{i0}}\mu }{\sqrt{3}}$$ and hence $$\boldsymbol{u}_{i}^{(0)}\in \frac{1}{\sqrt{3}} \mathcal{N}_{\mu }.$$ Moreover, due to (6.44), there holds   \begin{align} \|\sqrt{d_{i}}\hat{\boldsymbol{h}}_{i0} - \boldsymbol{w}\|^{2} \geq \|\boldsymbol{u}_{i}^{(0)} - \boldsymbol{w} \|^{2}, \quad \forall \boldsymbol{w}\in Q_{i} .\end{align} (6.47) In particular, let w = 0 ∈ Qi and immediately we have   \begin{align*} \|\boldsymbol{u}_{i}^{(0)}\|^{2} \leq d_{i} \leq \frac{4}{3} \Longrightarrow \boldsymbol{u}_{i}^{(0)}\in \frac{1}{\sqrt{3}}\mathcal{N}_{\mu}. \end{align*} In other words, $$\{(\boldsymbol{u}_{i}^{(0)}, \boldsymbol{v}_{i}^{(0)})\}_{i=1}^{s} \in \frac{1}{\sqrt{3}}\mathcal{N}_d\cap \frac{1}{\sqrt{3}}\mathcal{N}_{\mu }$$. Part II: proof of$$(\boldsymbol{u}^{(0)},\boldsymbol{v}^{(0)})\in \mathcal{N}_{\frac{2\varepsilon }{5\sqrt{s}\kappa }}$$ We will show $$\|\boldsymbol{u}_{i}^{(0)}(\boldsymbol{v}_{i}^{(0)})^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\|_{F} \leq 4\xi d_{i0}$$ for 1 ≤ i ≤ s so that $$\frac{\|\boldsymbol{u}_{i}^{(0)}(\boldsymbol{v}_{i}^{(0)})^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\|_{F}}{d_{i0}} \leq \frac{2\varepsilon }{5\sqrt{s}\kappa }$$. First note that $$\sigma _{j}(\mathcal{A}_{i}^{*}(\boldsymbol{y})) \leq \xi d_{i0}$$ for all j ≥ 2, which follows from Weyl’s inequality [28] for singular values where $$\sigma _{j}(\mathcal{A}_{i}^{*}(\boldsymbol{y}))$$ denotes the jth largest singular value of $$\mathcal{A}_{i}^{*}(\boldsymbol{y})$$. Hence there holds   \begin{align} \| d_{i} \hat{\boldsymbol{h}}_{i0}\hat{\boldsymbol{x}}_{i0}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*} \| \leq \|\mathcal{A}_{i}^{*}(\boldsymbol{y}) - d_{i} \hat{\boldsymbol{h}}_{i0}\hat{\boldsymbol{x}}_{i0}^{*} \| + \|\mathcal{A}_{i}^{*}(\boldsymbol{y}) - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*} \| \leq 2\xi d_{i0}. \end{align} (6.48) On the other hand, for any i,   \begin{align*} \left\| \left(\boldsymbol{I}_{K} - \frac{\boldsymbol{h}_{i0}\boldsymbol{h}_{i0}^{*}}{d_{i0}}\right)\hat{\boldsymbol{h}}_{i0} \right\| &= \left\| \left(\boldsymbol{I}_{K} - \frac{\boldsymbol{h}_{i0}\boldsymbol{h}_{i0}^{*}}{d_{i0}}\right) \hat{\boldsymbol{h}}_{i0}\hat{\boldsymbol{x}}_{i0}^{*}\hat{\boldsymbol{x}}_{i0}\hat{\boldsymbol{h}}_{i0}^{*} \right\| \\ &= \left\| \left(\boldsymbol{I}_{K} - \frac{\boldsymbol{h}_{i0}\boldsymbol{h}_{i0}^{*}}{d_{i0}}\right)\left[ \frac{1}{d_{i0}}(( \mathcal{A}_{i}^{*}(\boldsymbol{y}) - d_{i} \hat{\boldsymbol{h}}_{i0}\hat{\boldsymbol{x}}_{i0}^{*})) + \hat{\boldsymbol{h}}_{i0}\hat{\boldsymbol{x}}_{i0}^{*} - \frac{\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}}{d_{i0}} \right] \hat{\boldsymbol{x}}_{i0}\hat{\boldsymbol{h}}_{i0}^{*} \right\| \\ &= \frac{1}{d_{i0}} \| \mathcal{A}_{i}^{*}(\boldsymbol{y}) - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*} \| + \left|\frac{d_{i}}{d_{i0}}-1\right| \leq 2\xi, \end{align*} where $$ \big(\boldsymbol{I}_{K} - \frac{\boldsymbol{h}_{i0}\boldsymbol{h}_{i0}^{*}}{d_{i0}}\big) \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*} = \boldsymbol{0}$$ and $$(\mathcal{A}_{i}^{*}(\boldsymbol{y}) - d_{i} \hat{\boldsymbol{h}}_{i0}\hat{\boldsymbol{x}}_{i0}^{*})\hat{\boldsymbol{x}}_{i0}\hat{\boldsymbol{h}}_{i0}^{*} = \boldsymbol{0}$$. Therefore, we have   \begin{align} \left\| \hat{\boldsymbol{h}}_{i0} - \frac{\boldsymbol{h}_{i0}^{*}\hat{\boldsymbol{h}}_{i0}}{d_{i0}} \boldsymbol{h}_{i0} \right\| \leq 2\xi, \quad \|\sqrt{d_{i}} \hat{\boldsymbol{h}}_{i0} - t_{i0} \boldsymbol{h}_{i0} \| \leq 2\sqrt{d_{i}}\xi, \end{align} (6.49) where $$t_{i0} = \frac{\sqrt{d_{i}}\boldsymbol{h}_{i0}^{*}\hat{\boldsymbol{h}}_{i0}}{d_{i0}}$$ and $$|t_{i0}| \leq \sqrt{d_{i}/d_{i0}} <\sqrt{2}$$. If we substitute w by ti0hi0 ∈ Qi into (6.47),   \begin{align} \|\sqrt{d_{i}}\hat{\boldsymbol{h}}_{i0} - t_{i0} \boldsymbol{h}_{i0}\| \geq \| \boldsymbol{u}_{i}^{(0)} - t_{i0} \boldsymbol{h}_{i0}\|, \end{align} (6.50) where ti0hi0 ∈ Qi follows from $$\sqrt{L} |t_{i0}|\|\boldsymbol{B}\boldsymbol{h}_{i0}\|_{\infty } \leq |t_{i0}| \sqrt{d_{i0}}\mu _{h} \leq \sqrt{2d_{i0}}\mu $$. Now we are ready to estimate $$\|\boldsymbol{u}^{(0)}_{i}(\boldsymbol{v}_{i}^{(0)})^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*} \|_{F}$$ as follows,   \begin{align*} \|\boldsymbol{u}^{(0)}_{i}(\boldsymbol{v}_{i}^{(0)})^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*} \|_{F} & \leq \|\boldsymbol{u}^{(0)}_{i}(\boldsymbol{v}_{i}^{(0)})^{*} - t_{i0}\boldsymbol{h}_{i0}(\boldsymbol{v}_{i}^{(0)})^{*} \|_{F} + \|t_{i0}\boldsymbol{h}_{i0}(\boldsymbol{v}_{i}^{(0)})^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*} \|_{F} \\ & \leq \underbrace{\|\boldsymbol{u}_{i}^{(0)} - t_{i0}\boldsymbol{h}_{i0}\| \|\boldsymbol{v}_{i}^{(0)}\|}_{I_{1}} + \underbrace{\left\| \frac{d_{i}}{d_{i0}} \boldsymbol{h}_{i0} \boldsymbol{h}^{*}_{i0} \hat{\boldsymbol{h}}_{i0} \hat{\boldsymbol{x}}_{i0}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*} \right\|_{F}}_{I_{2}}. \end{align*} Here I1 ≤ 2ξdi because $$\|\boldsymbol{v}_{i}^{(0)}\| = \sqrt{d_{i}}$$ and $$\|\boldsymbol{u}_{i}^{(0)} - t_{i0}\boldsymbol{h}_{i0}\| \leq 2\sqrt{d_{i}}\xi $$ follows from (6.49) and (6.50). For I2, there holds   \begin{align*} I_{2} = \left\| \frac{\boldsymbol{h}_{i0}\boldsymbol{h}_{i0}^{*}}{d_{i0}} \left(d_{i} \hat{\boldsymbol{h}}_{i0} \hat{\boldsymbol{x}}_{i0}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\right) \right\|_{F} \leq \|d_{i} \hat{\boldsymbol{h}}_{i0} \hat{\boldsymbol{x}}_{i0}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\|_{F} \leq 2\sqrt{2}\xi d_{i0}, \end{align*} which follows from (6.48). Therefore,   \begin{align*} \|\boldsymbol{u}^{(0)}_{i}(\boldsymbol{v}_{i}^{(0)})^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*} \|_{F} & \leq 2\xi d_{i} + 2 \sqrt{2}\xi d_{i0} \leq 5\xi d_{i0}\leq \frac{2\varepsilon d_{i0}}{5\sqrt{s}\kappa }. \end{align*} Acknowledgments S. Ling would like to thank Felix Krahmer and Dominik Stöger for the discussion about [29], and also thanks Ju Sun for pointing out the connection between convolutional dictionary learning and this work. Footnotes 1  Here we use the conjugate $$\bar{\boldsymbol{x}}_{\!i}$$ instead of xi because it will simplify our notation in later derivations. 2  This circular convolution assumption can often be reinforced directly (e.g., in wireless communications the use of a cyclic prefix in OFDM renders the convolution circular) or indirectly (e.g., via zero-padding). In the first case replacing regular convolution by circular convolution does not introduce any errors at all. In the latter case one introduces an additional approximation error in the inversion which is negligible, since it decays exponentially for impulse responses of finite length [30]. 3  Namely, if the pair (hi, xi) is a solution, then so is (αhi, α−1xi) for any $$\alpha \neq 0$$. 4  It is clear that instead of gradient descent one could also use a second-order method to achieve faster convergence at the tradeoff of increased computational cost per iteration. The theoretical convergence analysis for a second-order method will require a very different approach from the one developed in this paper. 5  Suppose all Ci are the same, there is no hope to recover all pairs of $$\{(\boldsymbol{h}_{i},\boldsymbol{x}_{i})\}_{i=1}^{s}$$ simultaneously. 6  For the definition and properties of sub-exponential random variables, the readers can find all relevant information in [38]. References 1. Ahmed, A., Recht, B. & Romberg, J. ( 2014) Blind deconvolution using convex programming. IEEE Trans. Inf. Theory , 60, 1711-- 1732. Google Scholar CrossRef Search ADS   2. Bristow, H., Eriksson, A. & Lucey, S. ( 2013) Fast convolutional sparse coding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,  pp. 391-- 398. 3. Cai, T. T., Li, X. & Ma, Z. ( 2016) Optimal rates of convergence for noisy sparse phase retrieval via thresholded wirtinger flow. Ann. Stat. , 44, 2221-- 2251. Google Scholar CrossRef Search ADS   4. Cambareri, V. & Jacques, L. ( 2016) Through the haze: a non-convex approach to blind calibration for linear random sensing models. arXiv preprint arXiv: 1610.09028. 5. Campisi, P. & Egiazarian, K. ( 2007) Blind Image Deconvolution: Theory and Applications . Boca Raton, FL: CRC press. Google Scholar CrossRef Search ADS   6. Candès, E., Eldar, Y., Strohmer, T. & Voroninski, V. ( 2013) Phase retrieval via matrix completion. SIAM J. Imaging Sci ., 6, 199-- 225. Google Scholar CrossRef Search ADS   7. Candès, E. & Recht, B. ( 2009) Exact matrix completion via convex optimization. Foundations Comput. Mathematics , 9, 717-- 772. Google Scholar CrossRef Search ADS   8. Candes, E. J., Li, X. & Soltanolkotabi, M. ( 2015) Phase retrieval via Wirtinger flow: theory and algorithms. IEEE Trans. Inf. Theory , 61, 1985-- 2007. Google Scholar CrossRef Search ADS   9. Candès, E. J., Strohmer, T. & Voroninski, V. ( 2013) Phaselift: exact and stable signal recovery from magnitude measurements via convex programming. Commun. Pure Appl. Mathematics , 66, 1241-- 1274. Google Scholar CrossRef Search ADS   10. Chandrasekaran, V., Recht, B., Parrilo, P. & Willsky, A. ( 2012) The convex geometry of linear inverse problems. Foundations Comput. Mathematics , 12, 805-- 849. Google Scholar CrossRef Search ADS   11. Chen, Y. & Candes, E. ( 2015) Solving random quadratic systems of equations is nearly as easy as solving linear systems. Communications on pure and applied mathematics , 70, 822-- 883. 12. Eftekhari, A. & Wakin, M. B. ( 2015) Greed is super: a fast algorithm for super-resolution. arXiv preprint arXiv: 1511.03385. 13. Escalante, R. & Raydan, M. ( 2011) Alternating Projection Methods,  vol 8. SIAM. Google Scholar CrossRef Search ADS   14. Goldsmith, A. ( 2005) Wireless Communications . Cambridge University Press. Google Scholar CrossRef Search ADS   15. Jung, P., Krahmer, F. & Stöger, D. ( 2017) Blind demixing and deconvolution at near-optimal rate. arXiv preprint arXiv: 1704.04178. 16. Keshavan, R., Montanari, A., Matrix, S. & Oh, S. ( 2009) Completion from noisy entries. Advances in Neural Information Processing Systems , pp. 952-- 960. 17. Keshavan, R. H., Montanari, A. & Oh, S. ( 2010) Matrix completion from a few entries. IEEE Trans. Inf. Theory , 56, 2980-- 2998. Google Scholar CrossRef Search ADS   18. Koltchinskii, V. ( 2011) Von Neumann entropy penalization and low-rank matrix estimation. Ann. Stat ., 39, 2936-- 2973. Google Scholar CrossRef Search ADS   19. Lee, K., Li, Y., Junge, M. & Bresler, Y. ( 2017) Blind recovery of sparse signals from subsampled convolution. IEEE Trans. Inf. Theory , 63, 802-- 821. Google Scholar CrossRef Search ADS   20. Li, X. & Fan, H. H. ( 2001) Direct blind multiuser detection for CDMA in multipath without channel estimation. IEEE Trans. Signal Process ., 49, 63-- 73. Google Scholar CrossRef Search ADS   21. Li, X., Ling, S., Strohmer, T. & Wei, K. ( 2016) Rapid, robust, and reliable blind deconvolution via nonconvex optimization. arXiv preprint arXiv: 1606.04933. 22. Ling, S. & Strohmer, T. ( 2015) Self-calibration and biconvex compressive sensing. Inverse Probl ., 31, 115002. 23. Ling, S. & Strohmer, T. ( 2017) Blind deconvolution meets blind demixing: algorithms and performance bounds. IEEE Trans. Inf. Theory , 63, 4497-- 4520. Google Scholar CrossRef Search ADS   24. Liu, J., Xin, J., Qi, Y. & Zheng, F.-G. ( 2009) A time domain algorithm for blind separation of convolutive sound mixtures and L1 constrained minimization of cross correlations. Commun. Math. Sci ., 7, 109-- 128. Google Scholar CrossRef Search ADS   25. Luenberger, D. G. & Ye, Y. Linear and Nonlinear Programming , vol. 228. US: Springer, 2015. 26. McCoy, M. B. & Tropp, J. A. ( 2017) Achievable performance of convex demixing. Technical report, Caltech, Paper dated Feb. 2013. ACM Technical Report  2017-- 02. 27. Shafi, M., Molisch, A. F., Smith, P. J., Haustein, T., Zhu, P., De Silva, P., Tufvesson, F., Benjebbour, A. & Wunder, G. ( 2017) 5G: a tutorial overview of standards, trials, challenges, deployment, and practice. IEEE J. Selected Areas Commun ., 35, 1201-- 1221. Google Scholar CrossRef Search ADS   28. Stewart, G. W. ( 1990) Perturbation theory for the singular value decomposition. Technical Report CS-TR  2539, University of Maryland. 29. Stöger, D., Jung, P. & Krahmer, F. ( 2016) Blind deconvolution and compressed sensing. Compressed Sensing Theory and its Applications to Radar, Sonar and Remote Sensing (CoSeRa), 2016 4th International Workshop on , pp. 24-- 27. IEEE. 30. Strohmer, T. ( 2002) Four short stories about Toeplitz matrix calculations. Linear Algebra Appl. , 343/344: 321–344. Special issue on structured and infinite systems of linear equations.  31. Sudhakar, P., Arberet, S. & Gribonval, R. ( 2010) Double sparsity: towards blind estimation of multiple channels. Latent Variable Analysis and Signal Separation , pp. 571-- 578. Berlin Heidelberg: Springer. 32. Sun, J., Qu, Q. & Wright, J. ( 2017) Complete dictionary recovery over the sphere I: overview and the geometric picture. IEEE Trans. Inf. Theory , 63, 853-- 884. Google Scholar CrossRef Search ADS   33. Sun, J., Qu, Q. & Wright, J. ( 2017) A geometric analysis of phase retrieval. Foundation of Computational Mathematics . Doi: 10.1007/s10208-017-9365-9, Issn: 1615-3383. 34. Sun, R. & Luo, Z.-Q. ( 2016) Guaranteed matrix completion via non-convex factorization. IEEE Trans. Inf. Theory , 62, 6535-- 6579. Google Scholar CrossRef Search ADS   35. Tu, S., Boczar, R., Simchowitz, M., Soltanolkotabi, M. & Recht, B. ( 2016) Low-rank solutions of linear matrix equations via procrustes flow. Proceedings of The 33rd International Conference on Machine Learning , pp. 964-- 973. 36. Vannithamby, R. & Talwar, S. ( 2017) Towards 5G: Applications, Requirements and Candidate Technologies . United Kingdom: John Wiley & Sons. Google Scholar CrossRef Search ADS   37. Verdu, S. ( 1998) Multiuser Detection . Cambridge University Press. 38. Vershynin, R. ( 2012) Introduction to the non-asymptotic analysis of random matrices. Compressed Sensing: Theory and Applications  (Y. C. Eldar & G. Kutyniok eds), chapter 5. Cambridge University Press. 39. Wang, X. & Poor, H. V. ( 1998) Blind equalization and multiuser detection in dispersive CDMA channels. IEEE Trans. Commun. , 46, 91-- 103. Google Scholar CrossRef Search ADS   40. Wedin, P.-Å. ( 1972) Perturbation bounds in connection with singular value decomposition. BIT Numerical Mathematics , 12, 99-- 111. Google Scholar CrossRef Search ADS   41. Wei, K., Cai, J.-F., Chan, T. F. & Leung, S. ( 2016) Guarantees of riemannian optimization for low rank matrix recovery. SIAM J. Matrix Anal. Appl ., 37, 1198-- 1222. Google Scholar CrossRef Search ADS   42. Wright, J., Ganesh, A., Min, K. & Ma, Y. ( 2013) Compressive principal component pursuit. Inf. Inference , 2, 32-- 68. Google Scholar CrossRef Search ADS   43. Wunder, G., Boche, H., Strohmer, T. & Jung, P. ( 2015) Sparse signal processing concepts for efficient 5G system design. IEEE Access ., 3, 195-- 208. Google Scholar CrossRef Search ADS   Appendix Descent Lemma Lemma 6.20. (Lemma 6.1 in [21]). If $$f(\boldsymbol{z}, \bar{\boldsymbol{z}})$$ is a continuously differentiable real-valued function with two complex variables z and $$\bar{\boldsymbol{z}}$$, (for simplicity, we just denote $$f(\boldsymbol{z}, \bar{\boldsymbol{z}})$$ by f(z) and keep in the mind that f(z) only assumes real values) for $$\boldsymbol{z} := (\boldsymbol{h}, \boldsymbol{x}) \in \mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$. Suppose that there exists a constant CL such that   \begin{align*} \|\nabla f(\boldsymbol{z} + t \Delta \boldsymbol{z}) - \nabla f(\boldsymbol{z})\| \leq C_{L} t\|\Delta\boldsymbol{z}\|, \quad \forall 0\leq t\leq 1, \end{align*} for all $$\boldsymbol{z}\in \mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$ and Δz such that $$\boldsymbol{z} + t\Delta \boldsymbol{z} \in \mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$ and 0 ≤ t ≤ 1. Then   \begin{align*} f(\boldsymbol{z} + \Delta \boldsymbol{z}) \leq f(\boldsymbol{z}) + 2\operatorname{Re}( (\Delta \boldsymbol{z})^{T} \overline{\nabla} f(\boldsymbol{z})) + C_{L}\|\Delta \boldsymbol{z}\|^{2}, \end{align*} where $$\overline{\nabla } f(\boldsymbol{z}) := \frac{\partial f(\boldsymbol{z}, \bar{\boldsymbol{z}})}{\partial \boldsymbol{z}}$$ is the complex conjugate of $$\nabla f(\boldsymbol{z}) = \frac{\partial f(\boldsymbol{z}, \bar{\boldsymbol{z}})}{\partial \bar{\boldsymbol{z}}}$$. Concentration inequality We define the matrix ψ1-norm via   \begin{align*} \|\boldsymbol{Z}\|_{\psi_{1}} := \inf_{u \geq 0} \{ \operatorname{\mathbb{E}}[ \exp(\|\boldsymbol{Z}\|/u)] \leq 2 \}. \end{align*} Theorem 6.21 [18] Consider a finite sequence of $$\mathcal{Z}_{l}$$ of independent centered random matrices with dimension M1 × M2. Assume that $$R : = \max _{1\leq l\leq L}\|\mathcal{Z}_{l}\|_{\psi _{1}}$$ and introduce the random matrix   \begin{align} \mathcal{S} = \sum_{l=1}^{L} \mathcal{Z}_{l}. \end{align} (A.1) Compute the variance parameter   \begin{align} {\sigma_{0}^{2}} = \max\left\{ \left\| \sum_{l=1}^{L} \operatorname{\mathbb{E}}(\mathcal{Z}_{l}\mathcal{Z}_{l}^{*})\right\|, \left\| \sum_{l=1}^{L} \operatorname{\mathbb{E}}(\mathcal{Z}_{l}^{*} \mathcal{Z}_{l})\right\| \right\}, \end{align} (A.2) then for all t ≥ 0   \begin{align} \|\mathcal{S}\| \leq C_{0} \max\left\{ \sigma_{0} \sqrt{t + \log(M_{1} + M_{2})}, R\log\left( \frac{\sqrt{L}R}{\sigma_{0}}\right)(t + \log(M_{1} + M_{2})) \right\} \end{align} (A.3) with probability at least 1 − e−t where C0 is an absolute constant. Lemma 6.22 (Lemma 10–13 in [1], Lemma 12.4 in [23]). Let $$\boldsymbol{u}\in \mathbb{C}^{n} \sim \mathcal{C}\mathcal{N}(\boldsymbol{0}, \boldsymbol{I}_{n})$$, then $$\|\boldsymbol{u}\|^{2} \sim \frac{1}{2}\chi ^{2}_{2n}$$ and   \begin{align} \| \|\boldsymbol{u}\|^{2} \|_{\psi_{1}} = \| \langle\boldsymbol{u}, \boldsymbol{u}\rangle \|_{\psi_{1}} \leq C n \end{align} (A.4) and   \begin{align} \operatorname{\mathbb{E}} (\boldsymbol{u}\boldsymbol{u}^{*} - \boldsymbol{I}_{n})^{2} = n\boldsymbol{I}_{n}. \end{align} (A.5) Let $$\boldsymbol{q}\in \mathbb{C}^{n}$$ be any deterministic vector, then the following properties hold   \begin{align} \| (\boldsymbol{u}\boldsymbol{u}^{*} - \boldsymbol{I})\boldsymbol{q}\|_{\psi_{1}} \leq C\sqrt{n}\|\boldsymbol{q}\|, \end{align} (A.6)  \begin{align} \operatorname{\mathbb{E}}[ (\boldsymbol{u}\boldsymbol{u}^{*} - \boldsymbol{I})\boldsymbol{q}\boldsymbol{q}^{*} (\boldsymbol{u}\boldsymbol{u}^{*} - \boldsymbol{I})] = \|\boldsymbol{q}\|^{2} \boldsymbol{I}_{n}. \end{align} (A.7) Let $$\boldsymbol{v}\sim \mathcal{C}\mathcal{N}(\boldsymbol{0}, \boldsymbol{I}_{m})$$ be a complex Gaussian random vector in $$\mathbb{C}^{m}$$, independent of u, then   \begin{align} \left\| \|\boldsymbol{u}\| \cdot \|\boldsymbol{v}\|\right\|_{\psi_{1}} \leq C\sqrt{mn}. \end{align} (A.8) © The Author(s) 2018. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) For permissions, please e-mail: journals. permissions@oup.com http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Information and Inference: A Journal of the IMA Oxford University Press

Regularized gradient descent: a non-convex recipe for fast joint blind deconvolution and demixing

Loading next page...
 
/lp/ou_press/regularized-gradient-descent-a-non-convex-recipe-for-fast-joint-blind-E4dWUhBa0z
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved.
ISSN
2049-8764
eISSN
2049-8772
D.O.I.
10.1093/imaiai/iax022
Publisher site
See Article on Publisher Site

Abstract

Abstract We study the question of extracting a sequence of functions $$\{\boldsymbol{f}_{\!i}, \boldsymbol{g}_{i}\}_{i=1}^{s}$$ from observing only the sum of their convolutions, i.e. from $$\boldsymbol{y} = \sum _{i=1}^{s} \boldsymbol{f}_{\!i}\ast \boldsymbol{g}_{i}$$. While convex optimization techniques are able to solve this joint blind deconvolution–demixing problem provably and robustly under certain conditions, for medium-size or large-size problems we need computationally faster methods without sacrificing the benefits of mathematical rigor that come with convex methods. In this paper we present a non-convex algorithm which guarantees exact recovery under conditions that are competitive with convex optimization methods, with the additional advantage of being computationally much more efficient. Our two-step algorithm converges to the global minimum linearly and is also robust in the presence of additive noise. While the derived performance bounds are suboptimal in terms of the information-theoretic limit, numerical simulations show remarkable performance even if the number of measurements is close to the number of degrees of freedom. We discuss an application of the proposed framework in wireless communications in connection with the Internet-of-Things. 1. Introduction The goal of blind deconvolution is the task of estimating two unknown functions from their convolution. While it is a highly ill-posed bilinear inverse problem, blind deconvolution is also an extremely important problem in signal processing [1], communications engineering [39], imaging processing [5], audio processing [24], etc. In this paper, we deal with an even more difficult and more general variation of the blind deconvolution problem, in which we have to extract multiple convolved signals mixed together in one observation signal. This joint blind deconvolution–demixing problem arises in a range of applications such as acoustics [24], dictionary learning [2] and wireless communications [39]. We briefly discuss one such application in more detail. Blind deconvolution/demixing problems are expected to play a vital role in the future Internet-of-Things. The Internet-of-Things will connect billions of wireless devices, which is far more than the current wireless systems can technically and economically accommodate. One of the many challenges in the design of the Internet-of-Things will be its ability to manage the massive number of sporadic traffic generating devices which are most of the time inactive, but regularly access the network for minor updates with no human interaction [43]. This means among others that the overhead caused by the exchange of certain types of information between transmitter and receiver, such as channel estimation, assignment of data slots, etc., has to be avoided as much as possible [27,36]. Focusing on the underlying mathematical challenges, we consider a multi-user communication scenario where many different users/devices communicate with a common base station, as illustrated in Fig. 1. Suppose we have s users and each of them sends a signal gi through an unknown channel (which differs from user to user) to a common base station. We assume that the ith channel, represented by its impulse response fi, does not change during the transmission of the signal gi. Therefore fi acts as convolution operator, i.e., the signal transmitted by the ith user arriving at the base station becomes fi * gi, where ‘*’ denotes convolution. Fig. 1. View largeDownload slide Single-antenna multi-user communication scenario without explicit channel estimation: each of the s users sends a signal gi through an unknown channel fi to a common base station. The base station measures the superposition of all those signals, namely, $$\boldsymbol{y} = \sum _{i=1}^{s} {\boldsymbol{f}_{\!i}} \ast \boldsymbol{g}_{i} $$ (plus noise). The goal is to extract all pairs of $$\{({\boldsymbol{f}_{\!i}},\boldsymbol{g}_{i})\}_{i=1}^{s}$$ simultaneously from y. Fig. 1. View largeDownload slide Single-antenna multi-user communication scenario without explicit channel estimation: each of the s users sends a signal gi through an unknown channel fi to a common base station. The base station measures the superposition of all those signals, namely, $$\boldsymbol{y} = \sum _{i=1}^{s} {\boldsymbol{f}_{\!i}} \ast \boldsymbol{g}_{i} $$ (plus noise). The goal is to extract all pairs of $$\{({\boldsymbol{f}_{\!i}},\boldsymbol{g}_{i})\}_{i=1}^{s}$$ simultaneously from y. The antenna at the base station, instead of receiving each individual component fi *gi, is only able to record the superposition of all those signals, namely,   \begin{align} \boldsymbol{y} = \sum_{i=1}^{s} {\boldsymbol{f}_{\!\!i}}\ast \boldsymbol{g}_{i} +\boldsymbol{n}, \end{align} (1.1) where n represents noise. We aim to develop a fast algorithm to simultaneously extract all pairs $$\{({\boldsymbol{f}_{\!\!i}},\boldsymbol{g}_{i})\}_{i=1}^{s}$$ from y (i.e., estimating the channel/impulse responses fi and the signals gi jointly) in a numerically efficient and robust way, while keeping the number of required measurements as small as possible. 1.1. State of the art and contributions of this paper A thorough theoretical analysis concerning the solvability of demixing problems via convex optimization can be found in [26]. There, the authors derive explicit sharp bounds and phase transitions regarding the number of measurements required to successfully demix structured signals (such as sparse signals or low-rank matrices) from a single measurement vector. In principle we could recast the blind deconvolution/demixing problem as the demixing of a sum of rank-one matrices, see (4). As such, it seems to fit into the framework analyzed by McCoy and Tropp. However, the setup in [26] differs from ours in a crucial manner. McCoy and Tropp consider as measurement matrices (see the matrices $$\mathcal{A}_{i}$$ in (4)) full-rank random matrices, while in our setting the measurement matrices are rank-one. This difference fundamentally changes the theoretical analysis. The findings in [26] are therefore not applicable to the problem of joint blind deconvolution/demixing. The compressive principal component analysis in [42] is also a form of demixing problem, but its setting is only vaguely related to ours. There is a large amount of literature on demixing problems, but the vast majority does not have a ‘blind deconvolution component’, therefore this body of work is only marginally related to the topic of our paper. Blind deconvolution/demixing problems also appear in convolutional dictionary learning, see e.g. [2]. There, the aim is to factorize an ensemble of input vectors into a linear combination of overcomplete basis elements which are modeled as shift-invariant—the latter property is why the factorization turns into a convolution. The setup is similar to (1.1), but with an additional penalty term to enforce sparsity of the convolving filters. The existing literature on convolutional dictionary learning is mainly focused on empirical results, therefore there is little overlap with our work. But it is an interesting challenge for future research to see whether the approach in this paper can be modified to provide a fast and theoretically sound solver for the sparse convolutional coding problem. There are numerous papers concerned with blind deconvolution/demixing problems in the area of wireless communications [20,31,37]. But the majority of these papers assumes the availability of multiple measurement vectors, which makes the problem significantly easier. Those methods however cannot be applied to the case of a single measurement vector, which is the focus of this paper. Thus there is essentially no overlap of those papers with our work. Our previous paper [23] solves (1.1) under subspace conditions, i.e., assuming that both fi and gi belong to known linear subspaces. This contributes to generalizing the pioneering work by Ahmed et al. [1] from the ‘single-user’ scenario to the ‘multi-user’ scenario. Both [1] and [23] employ a two-step convex approach: first ‘lifting’ [9] is used and then the lifted version of the original bilinear inverse problems is relaxed into a semi-definite program. An improvement of the theoretical bounds in [23] was announced in [29]. While the convex approach is certainly effective and elegant, it can hardly handle large-scale problems. This motivates us to apply a non-convex optimization approach [8,21] to this blind-deconvolution-blind-demixing problem. The mathematical challenge, when using non-convex methods, is to derive a rigorous convergence framework with conditions that are competitive with those in a convex framework. In the last few years several excellent articles have appeared on provably convergent non-convex optimization applied to various problems in signal processing and machine learning, e.g., matrix completion [16,17,34], phase retrieval [3,8,11,33], blind deconvolution [4,19,21], dictionary learning [32], super-resolution [12] and low-rank matrix recovery [35,41]. In this paper we derive the first non-convex optimization algorithm to solve (1.1) fast and with rigorous theoretical guarantees concerning exact recovery, convergence rates, as well as robustness for noisy data. Our work can be viewed as a generalization of blind deconvolution [21] (s = 1) to the multi-user scenario (s > 1). The idea behind our approach is strongly motivated by the non-convex optimization algorithm for phase retrieval proposed in [8]. In this foundational paper, the authors use a two-step approach: (i) construct a good initial guess with a numerically efficient algorithm; (ii) starting with this initial guess, prove that simple gradient descent will converge to the true solution. Our paper follows a similar two-step scheme. However, the techniques used here are quite different from [8]. Like the matrix completion problem [7], the performance of the algorithm relies heavily and inherently on how much the ground truth signals are aligned with the design matrix. Due to this so-called ‘incoherence’ issue, we need to impose extra constraints, which results in a different construction of the so-called basin of attraction. Therefore, influenced by [17,21,34], we add penalty terms to control the incoherence and this leads to the regularized gradient descent method, which forms the core of our proposed algorithm. To the best of our knowledge, our algorithm is the first algorithm for the blind deconvolution/blind demixing problem that is numerically efficient, is robust against noise and comes with rigorous recovery guarantees. 1.2. Notation For a matrix Z, ∥Z∥ denotes its operator norm and ∥Z∥F is its the Frobenius norm. For a vector z, ∥z∥ is its Euclidean norm and $$\|\boldsymbol{z}\|_{\infty }$$ is the $$\ell _{\infty }$$-norm. For both matrices and vectors, Z* and z* denote their complex conjugate transpose. $$\bar{\boldsymbol{z}}$$ is the complex conjugate of z. We equip the matrix space $$\mathbb{C}^{K\times N}$$ with the inner product defined by ⟨U, V⟩ := Tr(U*V). For a given vector z, diag(z) represents the diagonal matrix whose diagonal entries are z. For any $$z\in \mathbb{R}$$, let $$z_{+} = \frac{z + |z|}{2}.$$ 2. Preliminaries Obviously, without any further assumption, it is impossible to solve (1.1). Therefore, we impose the following subspace assumptions throughout our discussion [1,23]. Channel subspace assumption: each finite impulse response $${\boldsymbol{f}_{\!\!i}}\in \mathbb{C}^{L}$$ is assumed to have maximum delay spread K, i.e.,   \begin{align*} {\boldsymbol{f}_{\!\!i}} = \left[\begin{array}{@{}c@{}} \boldsymbol{h}_{i} \\ \boldsymbol{0} \end{array}\right]. \end{align*} Here $$\boldsymbol{h}_{i} \in{\mathbb{C}}^{K}$$ is the non-zero part of fi and fi(n) = 0 for n > K. Signal subspace assumption: let $$\boldsymbol{g}_{i} : = {\boldsymbol{C}_{\!i}}\ \bar{\boldsymbol{x}}_{i}$$ be the outcome of the signal $$\bar{\boldsymbol{x}}_{i}\in \mathbb{C}^{N}$$ encoded by a matrix $${\boldsymbol{C}_{\!i}}\in \mathbb{C}^{L\times N}$$ with L > N, where the encoding matrix Ci is known and assumed to have full rank.1 Remark 2.1 Both subspace assumptions are common in various applications. For instance in wireless communications, the channel impulse response can always be modeled to have finite support (or maximum delay spread, as it is called in engineering jargon) due to the physical properties of wave propagation [14]; and the signal subspace assumption is a standard feature found in many current communication systems [14], including Code-division multiple access (CDMA) where Ci is known as spreading matrix and Orthogonal frequency-division multiplexing (OFDM) where Ci is known as precoding matrix. The specific choice of the encoding matrices Ci depends on a variety of conditions. In this paper, we derive our theory by assuming that Ci is a complex Gaussian random matrix, i.e., each entry in Ci is i.i.d. $$\mathcal{C}\mathcal{N}(0,1)$$. This assumption, while sometimes imposed in the wireless communications literature, is somewhat unrealistic in practice, due to the lack of a fast algorithm to apply Ci and due to storage requirements. In practice one would rather choose Ci to be something like the product of a Hadamard matrix and a diagonal matrix with random binary entries. We hope to address such more structured encoding matrices in our future research. Our numerical simulations (see Section 4) show no difference in the performance of our algorithm for either choice. Under the two assumptions above, the model actually has a simpler form in the frequency domain. We assume throughout the paper that the convolution of finite sequences is circular convolution.2 By applying the Discrete Fourier Transform (DFT) to (1.1) along with the two assumptions, we have   \begin{align*} \frac{1}{\sqrt{L}}\boldsymbol{F} \boldsymbol{y} = \sum_{i=1}^{s}\operatorname{diag}(\boldsymbol{F} \boldsymbol{h}_{i})(\boldsymbol{F}\boldsymbol{C}_{i} \bar{\boldsymbol{x}}_{i}) + \frac{1}{\sqrt{L}}\boldsymbol{F}\boldsymbol{n},\end{align*} where F is the L × L normalized unitary DFT matrix with F*F = FF* = IL. The noise is assumed to be additive white complex Gaussian noise with $$\boldsymbol{n}\sim \mathcal{C}\mathcal{N}(\boldsymbol{0}, \sigma ^{2}{d_{0}^{2}}\boldsymbol{I}_{L})$$ where $$d_{0} = \sqrt{\sum _{i=1}^{s} \|\boldsymbol{h}_{i0}\|^{2} \|\boldsymbol{x}_{i0}\|^{2}}$$, and $$\{(\boldsymbol{h}_{i0}, \boldsymbol{x}_{i0})\}_{i=1}^{s}$$ is the ground truth. We define $$d_{i0} = \|\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\|_{F}$$ and assume without loss of generality that ∥hi0∥ and ∥xi0∥ are of the same norm, i.e., $$\|\boldsymbol{h}_{i0}\| = \|\boldsymbol{x}_{i0}\| = \sqrt{d_{i0}}$$, which is due to the scaling ambiguity.3 In that way, $$\frac{1}{\sigma ^{2}}$$ actually is a measure of signal to noise ratio (SNR). Let $$\boldsymbol{h}_{i}\in \mathbb{C}^{K}$$ be the first K non-zero entries of fi and $$\boldsymbol{B}\in \mathbb{C}^{L\times K}$$ be a low-frequency DFT matrix (the first K columns of an L × L unitary DFT matrix). Then a simple relation holds,   \begin{align*} \boldsymbol{F}{\boldsymbol{f}_{\!\!i}} ={\boldsymbol{B}}{\boldsymbol{h}_{\!i}}, \quad \boldsymbol{B}^{*}\boldsymbol{B} = \boldsymbol{I}_{K}. \end{align*} We also denote $$\boldsymbol{A}_{i} := \overline{\boldsymbol{F}{\boldsymbol{C}_{\!i}}}$$ and $$\boldsymbol{e} := \frac{1}{\sqrt{L}}\boldsymbol{F}\boldsymbol{n}$$. Due to the Gaussianity, Ai also possesses complex Gaussian distribution and so does e. From now on, instead of focusing on the original model, we consider (with a slight abuse of notation) the following equivalent formulation throughout our discussion:   \begin{align} \boldsymbol{y} = \sum_{i=1}^{s} \operatorname{diag}(\boldsymbol{B}{\boldsymbol{h}_{\!i}})\overline{\boldsymbol{A}_{i}\boldsymbol{x}_{i}} + \boldsymbol{e}, \end{align} (2.1) where $$\boldsymbol{e} \sim \mathcal{C}\mathcal{N}\big(\boldsymbol{0}, \frac{\sigma ^{2}{d_{0}^{2}}}{L}\boldsymbol{I}_{L}\big)$$. Our goal here is to estimate all $$\{\boldsymbol{h}_{i}, \boldsymbol{x}_{i}\}_{i=1}^{s}$$ from y, B and $$\{\boldsymbol{A}_{i}\}_{i=1}^{s}$$. Obviously, this is a bilinear inverse problem, i.e., if all $$\{\boldsymbol{h}_{i}\}_{i=1}^{s}$$ are given, it is a linear inverse problem (the ordinary demixing problem) to recover all $$\{\boldsymbol{x}_{i}\}_{i=1}^{s}$$, and vice versa. We note that there is a scaling ambiguity in all blind deconvolution problems that cannot be resolved by any reconstruction method without further information. Therefore, when we talk about exact recovery in the following, then this is understood modulo such a trivial scaling ambiguity. Before proceeding to our proposed algorithm we introduce some notation to facilitate a more convenient presentation of our approach. Let bl be the lth column of B* and ail be the lth column of $$\boldsymbol{A}_{i}^{*}$$. Based on our assumptions the following properties hold:   \begin{align*} \sum_{l=1}^{L}\boldsymbol{b}_{l}\boldsymbol{b}_{l}^{*} = \boldsymbol{I}_{K}, \quad \|\boldsymbol{b}_{l}\|^{2} = \frac{K}{L}, \quad \boldsymbol{a}_{il}\sim \mathcal{C}\mathcal{N}(\boldsymbol{0}, \boldsymbol{I}_{N}). \end{align*} Moreover, inspired by the well-known lifting idea [1,6,9,22], we define the useful matrix-valued linear operator $$\mathcal{A}_{i} : \mathbb{C}^{K\times N} \to \mathbb{C}^{L}$$ and its adjoint $$\mathcal{A}_{i}^{*}:\mathbb{C}^{L}\rightarrow \mathbb{C}^{K\times N}$$ by   \begin{align} \mathcal{A}_{i}(\boldsymbol{Z}) := \{\boldsymbol{b}_{l}^{*}\boldsymbol{Z}\boldsymbol{a}_{il}\}_{l=1}^{L}, \quad \mathcal{A}^{*}_{i}(\boldsymbol{z}) := \sum_{l=1}^{L} z_{l} \boldsymbol{b}_{l}\boldsymbol{a}_{il}^{*} = \boldsymbol{B}^{*}\operatorname{diag}(\boldsymbol{z})\boldsymbol{A}_{i} \end{align} (2.2) for each 1 ≤ i ≤ s under canonical inner product over $$\mathbb{C}^{K\times N}.$$ Therefore, (2.1) can be written in the following equivalent form   \begin{align} \boldsymbol{y} = \sum_{i=1}^{s} \mathcal{A}_{i}(\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*}) + \boldsymbol{e}. \end{align} (2.3) Hence, we can think of y as the observation vector obtained from taking linear measurements with respect to a set of rank-one matrices $$\{\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*}\}_{i=1}^{s}.$$ In fact, with a bit of linear algebra (and ignoring the noise term for the moment), the lth entry of y in (2.3) equals the inner product of two block-diagonal matrices:   \begin{align} y_{l} = \left \langle \underbrace{ \left[\begin{array}{@{}cccc@{}} \boldsymbol{h}_{1,0}\boldsymbol{x}_{1,0}^{*} & \boldsymbol{0} & \cdots & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{h}_{2,0}\boldsymbol{x}_{2,0}^{*} & \cdots & \boldsymbol{0} \\ \vdots & \vdots & \ddots & \vdots \\ \boldsymbol{0} & \boldsymbol{0} & \cdots & \boldsymbol{h}_{s0}\boldsymbol{x}_{s0}^{*} \end{array}\right]}_{\textrm{defined as}\ \boldsymbol{X}_{0} }, \left[\begin{array}{@{}cccc@{}} \boldsymbol{b}_{l}\boldsymbol{a}_{1l}^{*} & \boldsymbol{0} & \cdots & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{b}_{l}\boldsymbol{a}_{2l}^{*} & \cdots & \boldsymbol{0} \\ \vdots & \vdots & \ddots & \vdots \\ \boldsymbol{0} & \boldsymbol{0} & \cdots & \boldsymbol{b}_{l}\boldsymbol{a}_{sl}^{*} \end{array} \right] \right\rangle + e_{l}, \end{align} (2.4) where $$y_{l} = \sum _{i=1}^{s} \boldsymbol{b}_{l}^{*}\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\boldsymbol{a}_{il} + e_{l}, 1\leq l\leq L$$ and X0 is defined as the ground truth matrix. In other words, we aim to recover such a block-diagonal matrix X0 from L linear measurements with block structure if e = 0. By stacking all $$\{\boldsymbol{h}_{i}\}_{i=1}^{s}$$ (and $$\{\boldsymbol{x}_{i}\}_{i=1}^{s}, \{\boldsymbol{h}_{i0}\}_{i=1}^{s},\{\boldsymbol{x}_{i0}\}_{i=1}^{s}$$) into a long column, we let   \begin{align} \boldsymbol{h} := \left[\begin{array}{@{}c@{}} \boldsymbol{h}_{1} \\ \vdots\\ \boldsymbol{h}_{s} \end{array}\right], \quad \boldsymbol{h}_{0} := \left[\begin{array}{@{}c@{}} \boldsymbol{h}_{1,0} \\ \vdots\\ \boldsymbol{h}_{s0} \end{array}\right]\in\mathbb{C}^{Ks} ,\quad \boldsymbol{x} := \left[\begin{array}{@{}c@{}} \boldsymbol{x}_{1} \\ \vdots\\ \boldsymbol{x}_{s} \end{array}\right],\quad \boldsymbol{x}_{0} := \left[\begin{array}{@{}c@{}} \boldsymbol{x}_{1,0} \\ \vdots\\ \boldsymbol{x}_{s0} \end{array}\right] \in\mathbb{C}^{Ns}. \end{align} (2.5) We define $$\mathcal{H}$$ as a bilinear operator which maps a pair $$(\boldsymbol{h}, \boldsymbol{x})\in \mathbb{C}^{Ks}\times \mathbb{C}^{Ns}$$ into a block diagonal matrix in $$\mathbb{C}^{Ks\times Ns}$$, i.e.,   \begin{align} \mathcal{H}(\boldsymbol{h}, \boldsymbol{x}) := \left[\begin{array}{@{}cccc@{}} \boldsymbol{h}_{1}\boldsymbol{x}_{1}^{*} & \boldsymbol{0} & \cdots & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{h}_{2}\boldsymbol{x}_{2}^{*} & \cdots & \boldsymbol{0} \\ \vdots & \vdots & \ddots & \vdots \\ \boldsymbol{0} & \boldsymbol{0} & \cdots & \boldsymbol{h}_{s}\boldsymbol{x}_{s}^{*} \end{array}\right]\in\mathbb{C}^{Ks\times Ns}. \end{align} (2.6) Let $$\boldsymbol{X} := \mathcal{H}(\boldsymbol{h}, \boldsymbol{x})$$ and $$\boldsymbol{X}_{0} := \mathcal{H}(\boldsymbol{h}_{0}, \boldsymbol{x}_{0})$$ where X0 is the ground truth as illustrated in (2.4). Define $$\mathcal{A}(\boldsymbol{Z}):\mathbb{C}^{Ks\times Ns}\rightarrow \mathbb{C}^{L}$$ as   \begin{align} \mathcal{A}(\boldsymbol{Z}) := \sum_{i=1}^{s} \mathcal{A}_{i}(\boldsymbol{Z}_{i}), \end{align} (2.7) where Z = blkdiag(Z1, ⋯ , Zs) and blkdiag is the standard MATLAB function to construct block diagonal matrix. Therefore, $$\mathcal{A}(\mathcal{H}(\boldsymbol{h}, \boldsymbol{x})) = \sum _{i=1}^{s} \mathcal{A}_{i}(\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*})$$ and $$\boldsymbol{y} = \mathcal{A}(\mathcal{H}(\boldsymbol{h}_{0}, \boldsymbol{x}_{0})) + \boldsymbol{e}.$$ The adjoint operator $$\mathcal{A}^{*}$$ is defined naturally as   \begin{align} \mathcal{A}^{*}(\boldsymbol{z}) : = \left[\begin{array}{@{}cccc@{}} \mathcal{A}_{1}^{*}(\boldsymbol{z}) & \boldsymbol{0} & \cdots & \boldsymbol{0} \\ \boldsymbol{0} & \mathcal{A}_{2}^{*}(\boldsymbol{z}) & \cdots & \boldsymbol{0} \\ \vdots & \vdots & \ddots & \vdots \\ \boldsymbol{0} & \boldsymbol{0} & \cdots & \mathcal{A}_{s}^{*}(\boldsymbol{z}) \end{array}\right]\in\mathbb{C}^{Ks\times Ns}, \end{align} (2.8) which is a linear map from $$\mathbb{C}^{L}$$ to $$\mathbb{C}^{Ks\times Ns}.$$ To measure the approximation error of X0 given by X, we define δ(h, x) as the global relative error:   \begin{align} \delta(\boldsymbol{h},\boldsymbol{x}) := \frac{\|\boldsymbol{X} - \boldsymbol{X}_{0}\|_{F}}{\|\boldsymbol{X}_{0}\|_{F}} = \frac{\sqrt{\sum_{i=1}^{s} \|\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\|_{F}^{2}}}{d_{0}} = \sqrt{\frac{\sum_{i=1}^{s}{\delta_{i}^{2}} d_{i0}^{2}}{ \sum_{i=1}^{s} d_{i0}^{2}}}, \end{align} (2.9) where δi := δi(hi, xi) is the relative error within each component:   \begin{align*} \delta_{i}(\boldsymbol{h}_{i},\boldsymbol{x}_{i}) := \frac{\|\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\|_{F}}{d_{i0}}. \end{align*} Note that δ and δi are functions of (h, x) and (hi, xi), respectively, and in most cases, we just simply use δ and δi if no possibility of confusion exists. 2.1. Convex versus non-convex approaches As indicated in (2.4), joint blind deconvolution–demixing can be recast as the task to recover a rank-s block-diagonal matrix from linear measurements. In general, such a low-rank matrix recovery problem is NP-hard. In order to take advantage of the low-rank property of the ground truth, it is natural to adopt convex relaxation by solving a convenient nuclear norm minimization program, i.e.,   \begin{align} \min \sum_{i=1}^{s} \|\boldsymbol{Z}_{i}\|_{*}, \quad s.t. \quad\sum_{i=1}^{s} \mathcal{A}_{i}(\boldsymbol{Z}_{i}) = \boldsymbol{y}. \end{align} (2.10) The question of when the solution of (2.10) yields exact recovery is first answered in our previous work [23]. Late, [15,29] have improved this result to the near-optimal bound L ≥ C0s(K + N) up to some $$\log $$-factors where the main theoretical result is informally summarized in the following theorem. Theorem 2.2 (Theorem 1.1 in [15]). Suppose that Ai are L × N i.i.d. complex Gaussian matrices and B is an L × K partial DFT matrix with B*B =IK. Then solving (2.10) gives exact recovery if the number of measurements L yields   \begin{align*} L \geq C_{\gamma} s(K+N)\log^{3}L \end{align*} with probability at least 1 − L−γ where Cγ is an absolute scalar only depending on γ linearly. While the semidefinite programming (SDP) relaxation is definitely effective and has theoretic performance guarantees, the computational costs for solving an SDP already become too expensive for moderate size problems, let alone for large scale problems. Therefore, we try to look for a more efficient non-convex approach such as gradient descent, which hopefully is also reinforced by theory. It seems quite natural to achieve the goal by minimizing the following nonlinear least squares objective function with respect to (h, x)   \begin{align} F(\boldsymbol{h}, \boldsymbol{x}) : = \|\mathcal{A} \left(\mathcal{H}(\boldsymbol{h},\boldsymbol{x})\right) - \boldsymbol{y}\|^{2} = \left\|\sum_{i=1}^{s}\mathcal{A}_{i}(\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*}) - \boldsymbol{y}\right\|^{2}. \end{align} (2.11) In particular, if e = 0, we write   \begin{align} F_{0}(\boldsymbol{h}, \boldsymbol{x}) : = \left\|\sum_{i=1}^{s}\mathcal{A}_{i}(\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*})\right\|^{2}. \end{align} (2.12) As also pointed out in [21], this is a highly non-convex optimization problem. Many of the commonly used algorithms, such as gradient descent or alternating minimization, may not necessarily yield convergence to the global minimum, so that we cannot always hope to obtain the desired solution. Often, those simple algorithms might get stuck in local minima. 2.2. The basin of attraction Motivated by several excellent recent papers of non-convex optimization on various signal processing and machine learning problem, we propose our two-step algorithm: (i) compute an initial guess carefully; (ii) apply gradient descent to the objective function, starting with the carefully chosen initial guess. One difficulty of understanding non-convex optimization consists in how to construct the so-called basin of attraction, i.e., if the starting point is inside this basin of attraction, the iterates will always stay inside the region and converge to the global minimum. The construction of the basin of attraction varies for different problems [3,8,34]. For this problem, similar to [21], the construction follows from the following three observations. Each of these observations suggests the definition of a certain neighborhood and the basin of attraction is then defined as the intersection of these three neighborhood sets $$\mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }.$$ Ambiguity of solution: in fact, we can only recover (hi, xi) up to a scalar since (αhi, α−1xi) and (hi, xi) are both solutions for $$\alpha \neq 0$$. From a numerical perspective, we want to avoid the scenario when $$\|\boldsymbol{h}_{i}\|\rightarrow 0$$ and $$\|\boldsymbol{x}_{i}\|\rightarrow \infty $$ while ∥hi∥∥xi∥ is fixed, which potentially leads to numerical instability. To balance both the norm of ∥hi∥ and ∥xi∥ for all 1 ≤ i ≤ s, we define   \begin{align*} \mathcal{N}_d := \left\{\left\{(\boldsymbol{h}_{i}, \boldsymbol{x}_{i})\right\}_{i=1}^{s}: \| \boldsymbol{h}_{i}\| \leq 2\sqrt{d_{i0}}, \| \boldsymbol{x}_{i}\| \leq 2\sqrt{d_{i0}}, 1\leq i\leq s \right\}, \end{align*} which is a convex set. Incoherence: the performance depends on how large/small the incoherence $${\mu ^{2}_{h}}$$ is, where $${\mu _{h}^{2}}$$ is defined by   \begin{align*} {\mu^{2}_{h}} : = \max_{1\leq i\leq s} \frac{L\|\boldsymbol{B}\boldsymbol{h}_{i0}\|^{2}_{\infty}}{\|\boldsymbol{h}_{i0}\|^{2}}. \end{align*} The idea is that: the smaller the $${\mu ^{2}_{h}}$$ is, the better the performance is. Let us consider an extreme case: if Bhi0 is highly sparse or spiky, we lose much information on those zero/small entries and cannot hope to get satisfactory recovered signals. In other words, we need the ground truth hi0 has ‘spectral flatness’ and hi0 is not highly localized on the Fourier domain. A similar quantity is also introduced in the matrix completion problem [7,34]. The larger $${\mu ^{2}_{h}}$$ is, the more hi0 is aligned with one particular row of B. To control the incoherence between bl and hi, we define the second neighborhood,   \begin{align} \mathcal{N}_{\mu} := \left\{ \{\boldsymbol{h}_{i}\}_{i=1}^{s} : \sqrt{L} \|\boldsymbol{B}\boldsymbol{h}_{i}\|_{\infty} \leq 4\sqrt{d_{i0}}\mu, 1\leq i\leq s\right\}, \end{align} (2.13) where μ is a parameter and μ ≥ μh. Note that $$\mathcal{N}_{\mu }$$ is also a convex set. Close to the ground truth: we also want to construct an initial guess such that it is close to the ground truth, i.e.,   \begin{align} \mathcal{N}_{\epsilon} := \left\{\left\{(\boldsymbol{h}_{i}, \boldsymbol{x}_{i})\right\}_{i=1}^{s}: \delta_{i} = \frac{\|\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\|_{F}}{d_{i0}} \leq \varepsilon, 1\leq i\leq s \right\}\!, \end{align} (2.14) where ε is a predetermined parameter in $$(0, \frac{1}{15}]$$. Remark 2.3 To ensure δi ≤ ε, it suffices to ensure $$\delta \leq \frac{\varepsilon }{\sqrt{s}\kappa }$$ where $$\kappa := \frac{\max d_{i0}}{\min d_{i0}} \geq 1$$. This is because   \begin{align*} \frac{1}{s\kappa^{2}}\sum_{i=1}^{s}{\delta_{i}^{2}} \leq \delta^{2} \leq \frac{\varepsilon^{2}}{s\kappa^{2}} \end{align*} which implies $$\max _{1\leq i\leq s}\delta _{i} \leq \varepsilon .$$ Remark 2.4 When we say $$(\boldsymbol{h}, \boldsymbol{x})\in \mathcal{N}_{d}, \mathcal{N}_{\mu }$$ or $$\mathcal{N}_{\epsilon }$$, it means for all $$i=1,\dots ,s$$ we have $$(\boldsymbol{h}_{i},\boldsymbol{x}_{i}) \in \mathcal{N}_d$$, $$\mathcal{N}_{\mu }$$ or $$\mathcal{N}_{\epsilon }$$, respectively. In particular, $$(\boldsymbol{h}_{0}, \boldsymbol{x}_{0}) \in \mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$ where h0 and x0 are defined in (2.5). 2.3. Objective function and Wirtinger derivative To implement the first two observations, we introduce the regularizer G(h, x), defined as the sum of s components   \begin{align} G(\boldsymbol{h}, \boldsymbol{x}):= \sum_{i=1}^{s} G_{i}(\boldsymbol{h}_{i},\boldsymbol{x}_{i}) . \end{align} (2.15) For each component Gi(hi, xi), we let ρ ≥ d2 + 2∥e∥2, 0.9d0 ≤ d ≤ 1.1d0, 0.9di0 ≤ di ≤ 1.1di0 for all 1 ≤ i ≤ s and   \begin{align} G_{i} := \rho \left[ \underbrace{G_{0}\left(\frac{\|\boldsymbol{h}_{i}\|^{2}}{2d_{i}}\right) + G_{0}\left(\frac{\|\boldsymbol{x}_{i}\|^{2}}{2d_{i}}\right)}_{ \mathcal{N}_d } + \underbrace{\sum_{l=1}^{L}G_{0}\left(\frac{L |\boldsymbol{b}_{l}^{*}\boldsymbol{h}_{i}|^{2}}{8d_{i}\mu^{2} }\right)}_{\mathcal{N}_{\mu}} \right], \end{align} (2.16) where $$G_{0}(z) = \max \{z-1, 0\}^{2}$$. Here both d and $$\{d_{i}\}_{i=1}^{s}$$ are data-driven and well approximated by our spectral initialization procedure; and μ2 is a tuning parameter which could be estimated if we assume a specific statistical model for the channel (for example, in the widely used Rayleigh fading model, the channel coefficients are assumed to be complex Gaussian). The idea behind Gi is quite straightforward though the formulation is complicated. For each Gi in (17), the first two terms try to force the iterates to lie in $$\mathcal{N}_d$$ and the third term tries to encourage the iterates to lie in $$\mathcal{N}_{\mu }.$$ What about the neighborhood $$\mathcal{N}_{\epsilon }$$? A proper choice of the initialization followed by gradient descent which keeps the objective function decreasing will ensure that the iterates stay in $$\mathcal{N}_{\epsilon }$$. Finally, we consider the objective function as the sum of nonlinear least squares objective function F(h, x) in (2.11) and the regularizer G(h, x),   \begin{align} \widetilde{F}(\boldsymbol{h}, \boldsymbol{x}) := F(\boldsymbol{h},\boldsymbol{x}) + G(\boldsymbol{h}, \boldsymbol{x}). \end{align} (2.17) Note that the input of the function $$\widetilde{F}(\boldsymbol{h},\boldsymbol{x})$$ consists of complex variables, but the output is real-valued. As a result, the following simple relations hold   \begin{align*} \frac{\partial \widetilde{F}}{\partial \bar{\boldsymbol{h}}_{i}} = \overline{\frac{\partial \widetilde{F}}{\partial \boldsymbol{h}_{i}} }, \quad \frac{\partial \widetilde{F}}{\partial \bar{\boldsymbol{x}}_{i}} = \overline{\frac{\partial \widetilde{F}}{\partial \boldsymbol{x}_{i}} }. \end{align*} Similar properties also apply to both F(h, x) and G(h, x). Therefore, to minimize this function, it suffices to consider only the gradient of $$\widetilde{F}$$ with respect to $$\bar{\boldsymbol{h}}_{i}$$ and $$\bar{\boldsymbol{x}}_{i}$$, which is also called Wirtinger derivative [8]. The Wirtinger derivatives of F(h, x) and G(h, x) w.r.t. $$\bar{\boldsymbol{h}}_{i}$$ and $$\bar{\boldsymbol{x}}_{i}$$ can be easily computed as follows   \begin{align} \nabla F_{\boldsymbol{h}_{i}} & = \mathcal{A}_{i}^{*}\left(\mathcal{A}(\boldsymbol{X})- \boldsymbol{y}\right)\boldsymbol{x}_{i} = \mathcal{A}_{i}^{*}\left(\mathcal{A}(\boldsymbol{X}-\boldsymbol{X}_{0})- \boldsymbol{e} \right)\boldsymbol{x}_{i}, \\ \nabla F_{\boldsymbol{x}_{i}} & = \left(\mathcal{A}_{i}^{*}\left(\mathcal{A}(\boldsymbol{X}) - \boldsymbol{y}\right)\right)^{*}\boldsymbol{h}_{i} = \left(\mathcal{A}_{i}^{*}\left(\mathcal{A}(\boldsymbol{X}-\boldsymbol{X}_{0}) - \boldsymbol{e}\right)\right)^{*}\boldsymbol{h}_{i}, \\ \nabla G_{\boldsymbol{h}_{i}} & = \frac{\rho}{2d_{i}}\left[G^{\prime}_{0}\left(\frac{\|\boldsymbol{h}_{i}\|^{2}}{2d_{i}}\right) \boldsymbol{h}_{i} + \frac{L}{4\mu^{2}} \sum_{l=1}^{L} G^{\prime}_{0}\left(\frac{L|\boldsymbol{b}_{l}^{*}\boldsymbol{h}_{i}|^{2}}{8d_{i}\mu^{2}}\right) \boldsymbol{b}_{l}\boldsymbol{b}_{l}^{*}\boldsymbol{h}_{i} \right], \\ \nabla G_{\boldsymbol{x}_{i}} & = \frac{\rho}{2d_{i}} G^{\prime}_{0}\left( \frac{\|\boldsymbol{x}_{i}\|^{2}}{2d_{i}}\right) \boldsymbol{x}_{i}, \end{align} (2.18) where $$\mathcal{A}(\boldsymbol{X}) = \sum _{i=1}^{s} \mathcal{A}_{i}(\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*})$$ and $$\mathcal{A}^{*}$$ is defined in (2.8). In short, we denote   \begin{align} \nabla\widetilde{F}_{\boldsymbol{h}} : = \nabla F_{\boldsymbol{h}} + \nabla G_{\boldsymbol{h}}, \quad \nabla F_{\boldsymbol{h}} : =\left[ \begin{array}{@{}c@{}} \nabla F_{\boldsymbol{h}_{1}} \\ \vdots \\ \nabla F_{\boldsymbol{h}_{s}} \end{array}\right], \quad \nabla G_{\boldsymbol{h}} : =\left[ \begin{array}{@{}c@{}} \nabla G_{\boldsymbol{h}_{1}} \\ \vdots \\ \nabla G_{\boldsymbol{h}_{s}} \end{array}\right]. \end{align} (2.22) Similar definitions hold for $$\nabla \widetilde{F}_{\boldsymbol{x}},\nabla F_{\boldsymbol{x}}$$ and Gx. It is easy to see that $$\nabla F_{\boldsymbol{h}} = \mathcal{A}^{*}(\mathcal{A}(\boldsymbol{X}) - \boldsymbol{y})\boldsymbol{x}$$ and $$\nabla F_{\boldsymbol{x}} = (\mathcal{A}^{*}(\mathcal{A}(\boldsymbol{X}) - \boldsymbol{y}))^{*}\boldsymbol{h}$$. 3. Algorithm and theory 3.1. Two-step algorithm As mentioned before, the first step is to find a good initial guess $$(\boldsymbol{u}^{(0)}, \boldsymbol{v}^{(0)})\in \mathbb{C}^{Ks}\times \mathbb{C}^{Ns}$$ such that it is inside the basin of attraction. The initialization follows from this key fact:   \begin{align*} \operatorname{\mathbb{E}}(\mathcal{A}_{i}^{*}(\boldsymbol{y})) = \operatorname{\mathbb{E}}\left(\mathcal{A}_{i}^{*}\left(\sum_{j=1}^{s}\mathcal{A}_{j}(\boldsymbol{h}_{j0}\boldsymbol{x}_{j0}^{*} )+\boldsymbol{e}\right)\right) = \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}, \end{align*} where we use $$\boldsymbol{B}^{*}\boldsymbol{B} = \sum _{l=1}^{L}\boldsymbol{b}_{l}\boldsymbol{b}_{l}^{*} = \boldsymbol{I}_{K}$$, $$\operatorname{\mathbb{E}}(\boldsymbol{a}_{il}\boldsymbol{a}_{il}^{*}) = \boldsymbol{I}_{N}$$ and   \begin{align*} \operatorname{\mathbb{E}}(\mathcal{A}_{i}^{*}\mathcal{A}_{i}(\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*})) & = \sum_{l=1}^{L} \boldsymbol{b}_{l}\boldsymbol{b}_{l}^{*}\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*} \operatorname{\mathbb{E}}(\boldsymbol{a}_{il}\boldsymbol{a}_{il}^{*}) = \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}, \\ \operatorname{\mathbb{E}}(\mathcal{A}_{j}^{*}\mathcal{A}_{i}(\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*})) & = \sum_{l=1}^{L} \boldsymbol{b}_{l}\boldsymbol{b}_{l}^{*}\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*} \operatorname{\mathbb{E}}(\boldsymbol{a}_{il}\boldsymbol{a}_{jl}^{*}) = \boldsymbol{0},\quad \forall j\neq i. \end{align*} Therefore, it is natural to extract the leading singular value and associated left and right singular vectors from each $$\mathcal{A}_{i}^{*}(\boldsymbol{y})$$ and use them as (a hopefully good) approximation to (di0, hi0, xi0). This idea leads to Algorithm 1, the theoretic guarantees of which are given in Section 6.5. The second step of the algorithm is just to apply gradient descent to $$\widetilde{F}$$ with the initial guess $$\{(\boldsymbol{u}^{(0)}_{i}, \boldsymbol{v}^{(0)}_{i}, d_{i})\}_{i=1}^{s}$$ or $$(\boldsymbol{u}^{(0)}, \boldsymbol{v}^{(0)},\{ d_{i}\}_{i=1}^{s})$$, where u(0) stems from stacking all $$\boldsymbol{u}^{(0)}_{i}$$ into one long vector.4 Remark 3.1 For Algorithm 2, we can rewrite each iteration into  \begin{align*} \boldsymbol{u}^{(t)} = \boldsymbol{u}^{(t-1)} - \eta\nabla \widetilde{F}_{\boldsymbol{h}}(\boldsymbol{u}^{(t-1)}, \boldsymbol{v}^{(t-1)}), \quad\boldsymbol{v}^{(t)} = \boldsymbol{v}^{(t-1)} - \eta\nabla \widetilde{F}_{\boldsymbol{x}}(\boldsymbol{u}^{(t-1)}, \boldsymbol{v}^{(t-1)}), \end{align*} where $$\nabla \widetilde{F}_{\boldsymbol{h}}$$ and $$\nabla \widetilde{F}_{\boldsymbol{x}}$$ are in (2.22), and   \begin{align*} \boldsymbol{u}^{(t)} : =\left[ \begin{array}{@{}c@{}} \boldsymbol{u}_{1}^{(t)} \\ \vdots \\ \boldsymbol{u}_{s}^{(t)} \end{array}\right], \quad \boldsymbol{v}^{(t)} : =\left[ \begin{array}{@{}c@{}} \boldsymbol{v}_{1}^{(t)} \\ \vdots \\ \boldsymbol{v}_{s}^{(t)} \end{array}\right]. \end{align*} 3.2. Main results Our main findings are summarized as follows: Theorem 3.2 shows that the initial guess given by Algorithm 1 indeed belongs to the basin of attraction. Moreover, di also serves as a good approximation of di0 for each i. Theorem 3.3 demonstrates that the regularized Wirtinger gradient descent will guarantee the linear convergence of the iterates and the recovery is exact in the noisefree case and stable in the presence of noise. Theorem 3.2 The initialization obtained via Algorithm 1 satisfies   \begin{align} (\boldsymbol{u}^{(0)}, \boldsymbol{v}^{(0)}) \in \frac{1}{\sqrt{3}}\mathcal{N}_d\bigcap \frac{1}{\sqrt{3}} \mathcal{N}_{\mu}\bigcap \mathcal{N}_{\frac{2\varepsilon}{5\sqrt{s}\kappa}} \end{align} (3.1) and   \begin{align} 0.9d_{i0} \leq d_{i}\leq 1.1d_{i0},\quad 0.9d_{0} \leq d\leq 1.1d_{0}, \end{align} (3.2) holds with probability at least 1 − L−γ+1 if the number of measurements satisfies   \begin{align} L \geq C_{\gamma+\log(s)}({\mu_{h}^{2}} + \sigma^{2})s^{2} \kappa^{4} \max\{K,N\}\log^{2} L/\varepsilon^{2}. \end{align} (3.3) Here ε is any predetermined constant in $$(0, \frac{1}{15}]$$, and Cγ is a constant only linearly depending on γ with γ ≥ 1. Theorem 3.3 Starting with the initial value z(0) := (u(0), v(0)) satisfying (3.1), the Algorithm 2 creates a sequence of iterates (u(t), v(t)) which converges to the global minimum linearly,   \begin{align} \|\mathcal{H}(\boldsymbol{u}^{(t)}, \boldsymbol{v}^{(t)}) - \mathcal{H}(\boldsymbol{h}_{0}, \boldsymbol{x}_{0}) \|_{F} \leq \frac{\varepsilon d_{0}}{\sqrt{2s\kappa^{2}}}(1 - \eta\omega)^{t/2} + 60\sqrt{s} \|\mathcal{A}^{*}(\boldsymbol{e})\| \end{align} (3.4) with probability at least 1 − L−γ+1 where $$\eta \omega = \mathcal{O}((s\kappa d_{0}(K+N)\log ^{2}L)^{-1})$$ and   \begin{align*} \|\mathcal{A}^{*}(\boldsymbol{e})\| \leq C_{0} \sigma d_{0}\sqrt{\frac{\gamma s(K + N)(\log^{2}L)}{L}} \end{align*} if the number of measurements L satisfies   \begin{align} L \geq C_{\gamma+\log (s)}(\mu^{2} + \sigma^{2})s^{2} \kappa^{4} \max\{K,N\}\log^{2} L/\varepsilon^{2}. \end{align} (3.5) Remark 3.4 Our previous work [23] shows that the convex approach via semidefinite programming (see (2.10)) requires $$L \geq C_{0}s^{2}(K +{\mu ^{2}_{h}} N)\log ^{3}(L)$$ to ensure exact recovery. Later, [15] improves this result to the near-optimal bound $$L\geq C_{0}s(K +{\mu ^{2}_{h}} N)$$ up to some $$\log $$-factors. The difference between non-convex and convex methods lies in the appearance of the condition number κ in (3.5). This is not just an artifact of the proof—empirically we also observe that the value of κ affects the convergence rate of our non-convex algorithm, see Fig. 5. Remark 3.5 Our theory suggests s2-dependence for the number of measurements L, although numerically L in fact depends on s linearly, as shown in Section 4. The reason for s2-dependence will be addressed in details in Section 5.2. Remark 3.6 In the theoretical analysis, we assume that Ai (or equivalently Ci) is a Gaussian random matrix. Numerical simulations suggest that this assumption is clearly not necessary. For example, Ci may be chosen to be a Hadamard-type matrix which is more appropriate and favorable for communications. Remark 3.7 If e = 0, (3.4) shows that (u(t), v(t)) converges to the ground truth at a linear rate. On the other hand, if noise exists, (u(t), v(t)) is guaranteed to converge to a point within a small neighborhood of (h0, x0). More importantly, if the number of measurements L gets larger, $$\|\mathcal{A}^{*}(\boldsymbol{e})\|$$ decays at the rate of $$\mathcal{O}(L^{-1/2})$$. 4. Numerical simulations In this section we present a range of numerical simulations to illustrate and complement different aspects of our theoretical framework. We will empirically analyze the number of measurements needed for perfect joint deconvolution/demixing to see how this compares to our theoretical bounds. We will also study the robustness for noisy data. In our simulations we use Gaussian encoding matrices, as in our theorems. But we also try more realistic structured encoding matrices, that are more reminiscent of what one might come across in wireless communications. While Theorem 3.3 says that the number of measurements L depends quadratically on the number of sources s, numerical simulations suggest near-optimal performance. Figure 2 demonstrates that L actually depends linearly on s, i.e., the boundary between success (white) and failure (black) is approximately a linear function of s. In the experiment, K = N = 50 are fixed, all Ai are complex Gaussians and all (hi, xi) are standard complex Gaussian vectors. For each pair of (L, s), 25 experiments are performed and we treat the recovery as a success if $$\frac{\|\hat{\boldsymbol{X}} - \boldsymbol{X}_{0}\|_{F}}{\|\boldsymbol{X}_{0}\|_{F}} \leq 10^{-3}.$$ For our algorithm, we use backtracking to determine the stepsize and the iteration stops either if $$\|\mathcal{A}(\mathcal{H}(\boldsymbol{h}^{(t+1)}, \boldsymbol{x}^{(t+1)}) - \mathcal{H}(\boldsymbol{h}^{(t)}, \boldsymbol{x}^{(t)})) \| < 10^{-6}\|\boldsymbol{y}\|$$ or if the number of iterations reaches 500. The backtracking is based on the Armijo–Goldstein condition [25]. The initial stepsize is chosen to be $$\eta = \frac{1}{K+N}$$. If $$\widetilde{F}(\boldsymbol{z}^{(t)} - \eta \nabla \widetilde{F}(\boldsymbol{z}^{(t)}))> \widetilde{F}(\boldsymbol{z}^{(t)})$$, we just divide η by two and use a smaller stepsize. Fig. 2. View largeDownload slide Phase transition plot for empirical recovery performance under different choices of (L, s) where K = N = 50 are fixed. Black region: failure; white region: success. The red solid line depicts the number of degrees of freedom and the green dashed line shows the empirical phase transition bound for Algorithm 2. Fig. 2. View largeDownload slide Phase transition plot for empirical recovery performance under different choices of (L, s) where K = N = 50 are fixed. Black region: failure; white region: success. The red solid line depicts the number of degrees of freedom and the green dashed line shows the empirical phase transition bound for Algorithm 2. Fig. 3. View largeDownload slide Empirical probability of successful recovery for different pairs of (L, s) when K = N = 50 are fixed. Fig. 3. View largeDownload slide Empirical probability of successful recovery for different pairs of (L, s) when K = N = 50 are fixed. Fig. 4. View largeDownload slide Relative error vs. SNR (dB): SNR = $$20\log _{10}\left (\frac{\|\boldsymbol{y}\|}{\|\boldsymbol{e}\|}\right )$$. Fig. 4. View largeDownload slide Relative error vs. SNR (dB): SNR = $$20\log _{10}\left (\frac{\|\boldsymbol{y}\|}{\|\boldsymbol{e}\|}\right )$$. We see from Fig. 2 that the number of measurements for the proposed algorithm to succeed not only seems to depend linearly on the number of sensors, but it is actually rather close to the information-theoretic limit s(K + N). Indeed, the green dashed line in Fig. 2, which represents the empirical boundary for the phase transition between success and failure, corresponds to a line with slope about $$\frac{3}{2} s(K+N)$$. It is interesting to compare this empirical performance to the sharp theoretical phase transition bounds one would obtain via convex optimization [10,26]. Considering the convex approach based on lifting in [23], we can adapt the theoretical framework in [10] to the blind deconvolution/demixing setting, but with one modification. The bounds in [10] rely on Gaussian widths of tangent cones related to the measurement matrices $$\mathcal{A}_{i}$$. Since simply analytic formulas for these expressions seem to be out of reach for the structured rank-one measurement matrices used in our paper, we instead compute the bounds for full-rank Gaussian random matrices, which yields a sharp bound of about 3s(K + N) (the corresponding bounds for rank-one sensing matrices will likely have a constant larger than 3). Note that these sharp theoretical bounds predict quite accurately the empirical behavior of convex methods. Thus our empirical bound for using a non-convex methods compares rather favorably with that of the convex approach. Similar conclusions can be drawn from Fig. 3; there all Ai are in the form of Ai = FDiH where F is the unitary L × L DFT matrix, all Di are independent diagonal binary ±1 matrices and H is an L × N fixed partial deterministic Hadamard matrix. The purpose of using Di is to enhance the incoherence between each channel so that our algorithm is able to tell apart each individual signal and channel. As before we assume Gaussian channels, i.e., $$\boldsymbol{h}_{i}\sim \mathcal{C}\mathcal{N}(\boldsymbol{0}, \boldsymbol{I}_{K})$$. Therefore, our approach does not only work for Gaussian encoding matrices Ai, but also for the matrices that are interesting to real-world applications, although no satisfactory theory has been derived yet for that case. Moreover, due to the structure of Ai and B, fast transform algorithms are available, potentially allowing for real-time deployment. Figure 4 shows the robustness of our algorithm under different levels of noise. We also run 25 samples for each level of SNR and different L, and then compute the average relative error. It is easily seen that the relative error scales linearly with the SNR and one unit of increase in SNR (in dB) results in one unit of decrease in the relative error. Theorem 3.3 suggests that the performance and convergence rate actually depend on the condition number of $$\boldsymbol{X}_{0} = \mathcal{H}(\boldsymbol{h}_{0},\boldsymbol{x}_{0})$$, i.e., on $$\kappa = \frac{\max d_{i0}}{\min d_{i0}}$$ where di0 = ∥hi0∥∥xi0∥. Next we demonstrate that this dependence on the condition number is not an artifact of the proof, but is indeed also observed empirically. In this experiment, we let s = 2 and set for the first component d1, 0 = 1 and for the second one d2, 0 = κ for κ ∈ {1, 2, 5}. Here, κ = 1 means that the received signals of both sensors have equal power, whereas κ = 5 means that the signal received from the second sensor is considerably stronger. The initial stepsize is chosen as η = 1, followed by the backtracking scheme. Figure 5 shows how the relative error decays with respect to the number of iterations t under different condition number κ and L. The larger κ is, the slower the convergence rate is, as we see from Fig. 5. This may result from two reasons: our spectral initialization may not be able to give a good initial guess for those weak components; moreover, during the gradient descent procedure, the gradient directions for the weak components could be totally dominated/polluted by the strong components. Currently, we still have no effective way of how to deal with this issue of slow convergence when κ is not small. We have to leave this topic for future investigations. 5. Convergence analysis Our convergence analysis relies on the following four conditions where the first three of them are local properties. We will also briefly discuss how they contribute to the proof of our main theorem. Note that our previous work [21] on blind deconvolution is actually a special case (s = 1) of (2.1). The proof of Theorem 3.3 follows in part the main ideas in [21]. The readers may find the technical parts of [21] and this manuscript share many similarities. However, there are also important differences. After all, we are now dealing with a more complicated problem where the ground truth matrix X0 and measurement matrices are both rank-s block-diagonal matrices, as shown in (2.4), instead of rank-one matrices in [21]. The key is to understand the properties of the linear operator $$\mathcal{A}$$ applying to different types of block-diagonal matrices. Therefore, many technical details are much more involved while on the other hand, some of results in [21] can be used directly. During the presentation, we will clearly point out both the similarities to and differences from [21]. Fig. 5. View largeDownload slide Relative error vs. number of iterations t. Fig. 5. View largeDownload slide Relative error vs. number of iterations t. 5.1. Four key conditions Condition 5.1 Local regularity condition: let $$\boldsymbol{z} : = (\boldsymbol{h}, \boldsymbol{x})\in \mathbb{C}^{s(K+N)}$$ and $$\nabla \widetilde{F}(\boldsymbol{z}) := \left [ {{\nabla \widetilde{F}_{\boldsymbol{h}}(\boldsymbol{z})} \atop {\nabla \widetilde{F}_{\boldsymbol{x}}(\boldsymbol{z})}}\right ] \in \mathbb{C}^{s(K+N)}$$, then   \begin{align} \|\nabla \widetilde{F}(\boldsymbol{z})\|^{2} \geq \omega [\widetilde{F}(\boldsymbol{z}) - c]_{+} \end{align} (5.1) for $$\boldsymbol{z} \in \mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$ where $$\omega = \frac{d_{0}}{7000}$$ and $$c = \|\boldsymbol{e}\|^{2} + 2000s\|\mathcal{A}^{*}(\boldsymbol{e})\|^{2}.$$ We will prove Condition 5.1 in Section 6.3. Condition 5.1 states that $$\widetilde{F}(\boldsymbol{z}) = 0$$ if $$\|\nabla \widetilde{F}(\boldsymbol{z})\| = 0$$ and e = 0, i.e., all the stationary points inside the basin of attraction are global minima. Condition 5.2 Local smoothness condition: let z = (h, x) and w = (u, v) and there holds   \begin{align} \|\widetilde{F}(\boldsymbol{z} + \boldsymbol{w}) - \widetilde{F}(\boldsymbol{z})\| \leq C_{L} \|\boldsymbol{w}\| \end{align} (5.2) for z + w and z inside $$\mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$ where $$C_{L} \approx \mathcal{O}(d_{0}s\kappa (1 + \sigma ^{2})(K + N)\log ^{2} L )$$ is the Lipschitz constant of $$\widetilde{F}$$ over $$\mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$. The convergence rate is governed by CL. The proof of Condition 5.2 can be found in Section 6.4. Condition 5.3 Local restricted isometry property: denote $$\boldsymbol{X} = \mathcal{H}(\boldsymbol{h}, \boldsymbol{x})$$ and $$\boldsymbol{X}_{0} = \mathcal{H}(\boldsymbol{h}_{0}, \boldsymbol{x}_{0})$$. There holds   \begin{align} \frac{2}{3} \|\boldsymbol{X} - \boldsymbol{X}_{0}\|_{F}^{2} \leq \left\| \mathcal{A}(\boldsymbol{X} - \boldsymbol{X}_{0}) \right\|^{2} \leq \frac{3}{2} \|\boldsymbol{X} - \boldsymbol{X}_{0}\|_{F}^{2} \end{align} (5.3) uniformly all for $$(\boldsymbol{h}, \boldsymbol{x})\in \mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$. Condition 5.3 will be proven in Section 6.2. It says that the convergence of the objective function implies the convergence of the iterates. Remark 5.4 (Necessity of inter-user incoherence). Although Condition 5.3 is seemingly the same as the one in our previous work [21], it is indeed very different. Recall that $$\mathcal{A}$$ is a linear operator acting on block-diagonal matrices and its output is the sum of s different components involving $$\mathcal{A}_{i}$$. Therefore, the proof of Condition 5.3 heavily depends on the inter-user incoherence, whereas this notion of incoherence is not needed at all for the single-user scenario. At the beginning of Section 2, we discuss the choice of Ci (or Ai). In order to distinguish one user from another, it is essential to use sufficiently different5 encoding matrices Ci (or Ai). Here the independence and Gaussianity of all Ci (or Ai) guarantee that $$\|\mathcal{P}_{T_{i}}\mathcal{A}_{i}^{*}\mathcal{A}_{j}\mathcal{P}_{T_{j}}\|$$ is sufficiently small for all $$i\neq j$$ where Ti is defined in (6.1). It is a key element to ensure the validity of Condition 5.3 which is also an important component to prove Condition 5.1. On the other hand, due to the recent progress on this joint deconvolution and demixing problem, one is also able to prove a local restricted isometry property with tools such as bounding the suprema of chaos processes [15] by assuming $$\{\boldsymbol{A}_{i}\}_{i=1}^{s}$$ as Gaussian matrices. Condition 5.5 Robustness condition: let $$\varepsilon \leq \frac{1}{15}$$ be a predetermined constant. We have   \begin{align} \|\mathcal{A}^{*}(\boldsymbol{e})\| = \max_{1\leq i\leq s}\|\mathcal{A}_{i}^{*}(\boldsymbol{e})\| \leq \frac{\varepsilon d_{0}}{10\sqrt{2}s \kappa}, \end{align} (5.4) where $$\boldsymbol{e}\sim \mathcal{C}\mathcal{N}\big(0, \frac{\sigma ^{2}{d_{0}^{2}}}{L}\big)$$ if L ≥ Cγκ2s2(K + N)/ε2. We will prove Condition 5.5 in Section 6.5. We now extract one useful result based on Conditions 5.3 and 5.5. From these two conditions, we are able to produce a good approximation of F(h, x) for all $$(\boldsymbol{h}, \boldsymbol{x})\in \mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$ in terms of δ in (2.9). For $$(\boldsymbol{h}, \boldsymbol{x})\in \mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$, the following inequality holds   \begin{align} \frac{2}{3}\delta^{2}{d_{0}^{2}} -\frac{\varepsilon\delta{d_{0}^{2}}}{5\sqrt{s}\kappa} + \|\boldsymbol{e}\|^{2} \leq F(\boldsymbol{h}, \boldsymbol{x}) \leq \frac{3}{2}\delta^{2}{d_{0}^{2}} + \frac{\varepsilon\delta{d_{0}^{2}}}{5\sqrt{s}\kappa} + \|\boldsymbol{e}\|^{2}. \end{align} (5.5) Note that (5.5) simply follows from   \begin{align*} F(\boldsymbol{h}, \boldsymbol{x}) = \| \mathcal{A}(\boldsymbol{X} - \boldsymbol{X}_{0}) \|_{F}^{2} - 2\operatorname{Re}(\langle \boldsymbol{X}- \boldsymbol{X}_{0}, \mathcal{A}^{*}(\boldsymbol{e})\rangle) + \|\boldsymbol{e}\|^{2}. \end{align*} Note that (5.3) implies $$\frac{2}{3}\delta ^{2}{d_{0}^{2}}\leq \|\mathcal{A}(\boldsymbol{X}-\boldsymbol{X}_{0})\|_{F}^{2}\leq \frac{3}{2}\delta ^{2}{d_{0}^{2}}$$. Thus it suffices to estimate the cross-term,   \begin{align} |\!\operatorname{Re}(\langle \boldsymbol{X}- \boldsymbol{X}_{0}, \mathcal{A}^{*}(\boldsymbol{e})\rangle)| & \leq \|\mathcal{A}^{*}(\boldsymbol{e})\| \|\boldsymbol{X} - \boldsymbol{X}_{0}\|_{*} = \|\mathcal{A}^{*}(\boldsymbol{e})\| \sum_{i=1}^{s}\|\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\|_{*} \nonumber \\ & \leq \sqrt{2}\|\mathcal{A}^{*}(\boldsymbol{e})\| \sum_{i=1}^{s}\|\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\|_{F} \nonumber \\ & \leq \sqrt{2s} \|\mathcal{A}^{*}(\boldsymbol{e})\| \|\boldsymbol{X} - \boldsymbol{X}_{0}\|_{F} \leq \frac{\varepsilon \delta{d_{0}^{2}}}{10\sqrt{s}\kappa}, \end{align} (5.6) where ∥⋅∥* and ∥⋅∥ are a pair of dual norms and $$\|\mathcal{A}^{*}(\boldsymbol{e})\|$$ comes from (5.4). 5.2. Outline of the convergence analysis For the ease of proof, we introduce another neighborhood:   \begin{align*} \mathcal{N}_{\widetilde{F}} = \left\{ (\boldsymbol{h},\boldsymbol{x}) : \widetilde{F}(\boldsymbol{h}, \boldsymbol{x}) \leq \frac{\varepsilon^{2}{d_{0}^{2}}}{3s\kappa^{2}} + \|\boldsymbol{e}\|^{2}\right\}. \end{align*} Moreover, another reason to consider $$\mathcal{N}_{\widetilde{F}}$$ is based on the fact that gradient descent only allows one to make the objective function decrease if the step size is chosen appropriately. In other words, all the iterates z(t) generated by gradient descent are inside $$\mathcal{N}_{\widetilde{F}}$$ as long as $$\boldsymbol{z}^{(0)}\in \mathcal{N}_{\widetilde{F}}.$$ On the other hand, it is crucial to note that the decrease of the objective function does not necessarily imply the decrease of the relative error of the iterates. Therefore, we want to construct an initial guess in $$\mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$ so that z(0) is sufficiently close to the ground truth and then analyze the behavior of z(t). In the rest of this section, we basically try to prove the following relation:   \begin{align*} \underbrace{ \frac{1}{\sqrt{3}}\mathcal{N}_d\cap \frac{1}{\sqrt{3}} \mathcal{N}_{\mu} \cap \mathcal{N}_{\frac{2\varepsilon}{5\sqrt{s}\kappa}}}_{\textrm{Initial guess}} \subset \underbrace{\mathcal{N}_{\epsilon}\cap \mathcal{N}_{\widetilde{F}}}_{ \{\boldsymbol{z}^{(t)}\}_{t\geq 0}\ \textrm{in}\ \mathcal{N}_{\epsilon}\cap \mathcal{N}_{\widetilde{F}} } \subset \underbrace{\mathcal{N}_d \cap \mathcal{N}_{\mu} \cap \mathcal{N}_{\epsilon}}_{\textrm{Key conditions hold over}\ \mathcal{N}_d \cap \mathcal{N}_{\mu} \cap \mathcal{N}_{\epsilon} }. \end{align*} Now we give a more detailed explanation of the relation above, which constitutes the main structure of the proof: We will show $$\frac{1}{\sqrt{3}}\mathcal{N}_d\cap \frac{1}{\sqrt{3}} \mathcal{N}_{\mu } \cap \mathcal{N}_{\frac{2\varepsilon }{5\sqrt{s}\kappa }} \subset \mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$ in the proof of Theorem 3.3 in Section 5.3, which is quite straightforward. Lemma 5.6 explains why it holds that $$\mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}\subset \mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$ and where the s2-bottleneck comes from. Lemma 5.8 implicitly shows that the iterates z(t) will remain in $$\mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$ if the initial guess z(0) is inside $$\mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$ and $$\widetilde{F}(\boldsymbol{z}^{(t)})$$ is monotonically decreasing (simply by induction). Lemma 5.9 makes this observation explicit by showing that $$\boldsymbol{z}^{(t)}\in \mathcal{N}_{\epsilon } \cap \mathcal{N}_{\widetilde{F}}$$ implies $$\boldsymbol{z}^{(t+1)} : = \boldsymbol{z}^{(t)} - \eta \nabla \widetilde{F}(\boldsymbol{z}^{(t)})\in \mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$ if the stepsize η obeys $$\eta \leq \frac{1}{C_{L}}$$. Moreover, Lemma 5.9 guarantees sufficient decrease of $$\widetilde{F}(\boldsymbol{z}^{(t)})$$ in each iteration, which paves the road toward the proof of linear convergence of $$\widetilde{F}(\boldsymbol{z}^{(t)})$$ and thus z(t). Remember that $$\mathcal{N}_d$$ and $$\mathcal{N}_{\mu }$$ are both convex sets, and the purpose of introducing regularizers Gi(hi, xi) is to approximately project the iterates onto $$\mathcal{N}_d\cap \mathcal{N}_{\mu }.$$ Moreover, we hope that once the iterates are inside $$\mathcal{N}_{\epsilon }$$ and inside a sublevel subset $$\mathcal{N}_{\widetilde{F}}$$, they will never escape from $$\mathcal{N}_{\widetilde{F}}\cap \mathcal{N}_{\epsilon }$$. Those ideas are fully reflected in the following lemma. Lemma 5.6 Assume 0.9di0 ≤ di ≤ 1.1di0 and 0.9d0 ≤ d ≤ 1.1d0. There holds $$\mathcal{N}_{\widetilde{F}} \subset \mathcal{N}_d \, \cap\, \mathcal{N}_{\mu }$$; moreover, under Conditions 5.3 and 5.5, we have $$\mathcal{N}_{\widetilde{F}} \cap \mathcal{N}_{\epsilon }\subset \mathcal{N}_d \cap \mathcal{N}_{\mu }\cap \mathcal{N}_{\frac{9}{10}\epsilon }$$. Proof. If $$(\boldsymbol{h}, \boldsymbol{x}) \notin \mathcal{N}_d \cap \mathcal{N}_{\mu }$$, by the definition of G in (2.15), at least one component in G exceeds $$\rho G_{0}\left (\frac{2d_{i0}}{d_{i}}\right )$$. We have   \begin{align*} \widetilde{F}(\boldsymbol{h}, \boldsymbol{x}) & \geq \rho G_{0}\left(\frac{2d_{i0}}{d_{i}}\right) \geq (d^{2} + 2\|\boldsymbol{e}\|^{2}) \left( \frac{2d_{i0}}{d_{i}} - 1\right)^{2} \\ & \geq (2/1.1 - 1)^{2} (d^{2} + 2\|\boldsymbol{e}\|^{2}) \\ & \geq \frac{1}{2}{d_{0}^{2}} + \|\boldsymbol{e}\|^{2}> \frac{\varepsilon^{2}{d_{0}^{2}}}{3s \kappa^{2}} + \|\boldsymbol{e}\|^{2}, \end{align*} where ρ ≥ d2 + 2∥e∥2, 0.9d0 ≤ d ≤ 1.1d0 and 0.9di0 ≤ di ≤ 1.1di0. This implies $$(\boldsymbol{h}, \boldsymbol{x}) \notin \mathcal{N}_{\widetilde{F}}$$ and hence $$\mathcal{N}_{\widetilde{F}} \subset \mathcal{N}_d \cap \mathcal{N}_{\mu }$$. Note that $$(\boldsymbol{h}, \boldsymbol{x})\in \mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$ if $$(\boldsymbol{h}, \boldsymbol{x}) \in \mathcal{N}_{\widetilde{F}} \cap \mathcal{N}_{\epsilon }$$. Applying (5.5) gives   \begin{align*} \frac{2}{3}\delta^{2}{d_{0}^{2}} -\frac{\varepsilon\delta{d_{0}^{2}}}{5\sqrt{s}\kappa} + \|\boldsymbol{e}\|^{2} \leq F(\boldsymbol{h}, \boldsymbol{x})\leq \widetilde{F}(\boldsymbol{h}, \boldsymbol{x})\leq\frac{\varepsilon^{2}{d_{0}^{2}}}{3s\kappa^{2}} + \|\boldsymbol{e}\|^{2} \end{align*} which implies that $$\delta \leq \frac{9}{10}\frac{\varepsilon }{\sqrt{s}\kappa }.$$ By definition of δ in (2.9), there holds   \begin{align} \frac{81\varepsilon^{2}}{100s\kappa^{2}} \geq \delta^{2} = \frac{\sum_{i=1}^{s}{\delta_{i}^{2}}d_{i0}^{2}}{\sum_{i=1}^{s} d_{i0}^{2}} \geq \frac{\sum_{i=1}^{s}{\delta_{i}^{2}}}{s\kappa^{2}} \geq \frac{1}{s\kappa^{2}} \max_{1\leq i\leq s}{\delta_{i}^{2}}, \end{align} (5.7) which gives $$\delta _{i} \leq \frac{9}{10}\varepsilon $$ and $$(\boldsymbol{h}, \boldsymbol{x})\in \mathcal{N}_{\frac{9}{10}\varepsilon }.$$ Remark 5.7 The s2-bottleneck comes from (5.7). If δ ≤ ε is small, we cannot guarantee that each δi is also smaller than ε. Just consider the simplest case when all di0 are the same: then $${d_{0}^{2}} = \sum _{i=1}^{s} d_{i0}^{2} = s d_{i0}^{2}$$ and there holds   \begin{align*} \varepsilon^{2}\geq \delta^{2} = \frac{1}{s}\sum_{i=1}^{s}{\delta_{i}^{2}}. \end{align*} Obviously, we cannot conclude that $$\max \delta _{i} \leq \varepsilon $$, but only say that $$\delta _{i} \leq \sqrt{s}\varepsilon .$$ This is why we require $$\delta ={\cal O}\big(\frac{\varepsilon }{\sqrt{s}}\big)$$ to ensure δi ≤ ε, which gives s2-dependence in L. Lemma 5.8 Denote z1 = (h1, x1) and z2 = (h2, x2). Let z(λ) := (1 − λ)z1 + λz2. If $$\boldsymbol{z}_{1} \in \mathcal{N}_{\epsilon }$$ and $$\boldsymbol{z}(\lambda ) \in \mathcal{N}_{\widetilde{F}}$$ for all λ ∈ [0, 1], we have $$\boldsymbol{z}_{2} \in \mathcal{N}_{\epsilon }$$. Proof. Note that for $$\boldsymbol{z}_{1}\in \mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$, we have $$\boldsymbol{z}_{1}\in \mathcal{N}_d\cap \mathcal{N}_{\mu }\cap \mathcal{N}_{\frac{9}{10}\varepsilon }$$ which follows from the second part of Lemma 5.6. Now we prove $$\boldsymbol{z}_{2}\in \mathcal{N}_{\epsilon }$$ by contradiction. Let us suppose that $$\boldsymbol{z}_{2} \notin \mathcal{N}_{\epsilon }$$ and $$\boldsymbol{z}_{1} \in \mathcal{N}_{\epsilon }$$. There exists $$\boldsymbol{z}(\lambda _{0}):=(\boldsymbol{h}(\lambda _{0}), \boldsymbol{x}(\lambda _{0})) \in \mathcal{N}_{\epsilon }$$ for some λ0 ∈ [0, 1] such that $$\max _{1\leq i\leq s}\frac{\|\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\|_{F}}{d_{i0}} = \epsilon $$. Therefore, $$\boldsymbol{z}(\lambda _{0}) \in \mathcal{N}_{\widetilde{F}}\cap \mathcal{N}_{\epsilon }$$ and Lemma 5.6 implies $$\max _{1\leq i\leq s}\frac{\|\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\|_{F}}{d_{i0}} \leq \frac{9}{10}\epsilon $$, which contradicts $$\max _{1\leq i\leq s}\frac{\|\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\|_{F}}{d_{i0}} = \epsilon $$. Lemma 5.9 Let the stepsize $$\eta \leq \frac{1}{C_{L}}$$, $$\boldsymbol{z}^{(t)} : = (\boldsymbol{u}^{(t)}, \boldsymbol{v}^{(t)})\in \mathbb{C}^{s(K + N)}$$ and CL be the Lipschitz constant of $$\nabla \widetilde{F}(\boldsymbol{z})$$ over $$\mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$ in (5.2). If $$\boldsymbol{z}^{(t)}\in \mathcal{N}_{\epsilon } \cap \mathcal{N}_{\widetilde{F}}$$, we have $$\boldsymbol{z}^{(t+1)} \in \mathcal{N}_{\epsilon } \cap \mathcal{N}_{\widetilde{F}}$$ and   \begin{align} \widetilde{F}(\boldsymbol{z}^{(t+1)}) \leq \widetilde{F}(\boldsymbol{z}^{(t)}) - \eta \|\nabla \widetilde{F}(\boldsymbol{z}^{(t)})\|^{2}, \end{align} (5.8) where $$\boldsymbol{z}^{(t+1)} = \boldsymbol{z}^{(t)} - \eta \nabla \widetilde{F}(\boldsymbol{z}^{(t)}).$$ Remark 5.10 This lemma tells us that once $$\boldsymbol{z}^{(t)}\in \mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$, the next iterate $$\boldsymbol{z}^{(t+1)} = \boldsymbol{z}^{(t)} - \eta \nabla \widetilde{F}(\boldsymbol{z}^{(t)})$$ is also inside $$\mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$ as long as the stepsize $$\eta \leq \frac{1}{C_{L}}$$. In other words, $$\mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$ is in fact a stronger version of the basin of attraction. Moreover, the objective function will decay sufficiently in each step as long as we can control the lower bound of the $$\nabla \widetilde{F}$$, which is guaranteed by the Local Regularity Condition 5.3. Proof. Let $$\phi (\tau ) := \widetilde{F}(\boldsymbol{z}^{(t)} - \tau \nabla \widetilde{F}(\boldsymbol{z}^{(t)}))$$, $$\phi (0) = \widetilde{F}(\boldsymbol{z}^{(t)})$$ and consider the following quantity:   \begin{align*} \tau_{\max}: = \max \{\mu: \phi(\tau) \leq \widetilde{F}(\boldsymbol{z}^{(t)}), 0\leq\tau \leq \mu \}, \end{align*} where $$\tau _{\max }$$ is the largest stepsize such that the objective function $$\widetilde{F}(\boldsymbol{z})$$ evaluated at any point over the whole line segment $$\{\boldsymbol{z}^{(t)} -\tau \widetilde{F}(\boldsymbol{z}^{(t)}), 0\leq \tau \leq \tau _{\max }\}$$ is not greater than $$\widetilde{F}(\boldsymbol{z}^{(t)})$$. Now we will show $$\tau _{\max } \geq \frac{1}{C_{L}}$$. Obviously, if $$\|\nabla \widetilde{F}(\boldsymbol{z}^{(t)})\| = 0$$, it holds automatically. Consider $$\|\nabla \widetilde{F}(\boldsymbol{z}^{(t)})\|\neq 0$$ and assume $$\tau _{\max } < \frac{1}{C_{L}}$$. First note that,   \begin{align*} \frac{\mathop{}\!\mathrm{d}}{\mathop{}\!\mathrm{d} \tau} \phi(\tau) < 0 \Longrightarrow\tau_{\max}> 0. \end{align*} By the definition of $$\tau _{\max }$$, there holds $$\phi (\tau _{\max }) = \phi (0)$$ since ϕ(τ) is a continuous function w.r.t. τ. Lemma 5.8 implies   \begin{align*} \{ \boldsymbol{z}^{(t)} - \tau \nabla\widetilde{F}(\boldsymbol{z}^{(t)}), 0\leq \tau \leq \tau_{\max} \} \subseteq \mathcal{N}_{\epsilon}\cap\mathcal{N}_{\widetilde{F}}. \end{align*} Now we apply Lemma 6.20, the modified descent lemma, and obtain   \begin{align*} \widetilde{F}(\boldsymbol{z}^{(t)} - \tau_{\max}\nabla\widetilde{F}(\boldsymbol{z}^{(t)})) \leq \widetilde{F}(\boldsymbol{z}^{(t)}) - (2\tau_{\max} - C_{L}\tau_{\max}^{2})\|\widetilde{F}(\boldsymbol{z}^{(t)})\|^{2} \leq \widetilde{F}(\boldsymbol{z}^{(t)}) - \tau_{\max}\|\widetilde{F}(\boldsymbol{z}^{(t)})\|^{2}, \end{align*} where $$C_{L}\tau _{\max } \leq 1.$$ In other words, $$\phi (\tau _{\max }) \leq \widetilde{F}(\boldsymbol{z}^{(t)} - \tau _{\max }\nabla \widetilde{F}(\boldsymbol{z}^{(t)})) < \widetilde{F}(\boldsymbol{z}^{(t)}) = \phi (0)$$ contradicts $$\phi (\tau _{\max }) = \phi (0)$$. Therefore, we conclude that $$\tau _{\max } \geq \frac{1}{C_{L}}$$. For any $$\eta \leq \frac{1}{C_{L}}$$, Lemma 5.8 implies   \begin{align*} \{ \boldsymbol{z}^{(t)} - \tau \nabla\widetilde{F}(\boldsymbol{z}^{(t)}), 0\leq \tau \leq \eta \} \subseteq \mathcal{N}_{\epsilon}\cap\mathcal{N}_{\widetilde{F}} \end{align*} and applying Lemma 6.20 gives   \begin{align*} \widetilde{F}(\boldsymbol{z}^{(t)} - \eta \nabla\widetilde{F}(\boldsymbol{z}^{(t)})) \leq \widetilde{F}(\boldsymbol{z}^{(t)}) - (2\eta - C_{L}\eta^{2})\|\widetilde{F}(\boldsymbol{z}^{(t)})\|^{2} \leq \widetilde{F}(\boldsymbol{z}^{(t)}) - \eta\|\widetilde{F}(\boldsymbol{z}^{(t)})\|^{2}. \end{align*} 5.3. Proof of Theorem 3.3 Combining all the considerations above, we now prove Theorem 3.3 to conclude this section. Proof. The proof consists of three parts: Part I: proof of $$\boldsymbol{z}^{(0)} : = (\boldsymbol{u}^{(0)}, \boldsymbol{v}^{(0)}) \in \mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$. From the assumption of Theorem 3.3,   \begin{align*} \boldsymbol{z}^{(0)} \in \frac{1}{\sqrt{3}}\mathcal{N}_d \bigcap \frac{1}{\sqrt{3}}\mathcal{N}_{\mu}\cap \mathcal{N}_{\frac{2\varepsilon}{5\sqrt{s}\kappa}}. \end{align*} First we show G(u(0), v(0)) = 0: for 0 ≤ i ≤ s and the definition of $$\mathcal{N}_d$$ and $$\mathcal{N}_{\mu }$$,   \begin{align*} \frac{\|\boldsymbol{u}^{(0)}_{i}\|^{2}}{2d_{i}} \leq \frac{2d_{i0}}{3d_{i}} < 1, \quad \frac{L|\boldsymbol{b}_{l}^{*} \boldsymbol{u}^{(0)}_{i}|^{2}}{8d_{i}\mu^{2}} \leq \frac{L}{8d_{i}\mu^{2}} \cdot\frac{16d_{i0}\mu^{2}}{3L} \leq \frac{2d_{i0}}{3d_{i}} < 1, \end{align*} where $$\|\boldsymbol{u}^{(0)}_{i}\| \leq \frac{2\sqrt{d_{i0}}}{\sqrt{3}}$$, $$\sqrt{L}\|\boldsymbol{B}\boldsymbol{u}^{(0)}_{i}\|_{\infty } \leq \frac{4 \sqrt{d_{i0}}\mu }{\sqrt{3}}$$ and $$\frac{9}{10}d_{i0} \leq d_{i}\leq \frac{11}{10}d_{i0}.$$ Therefore   \begin{align*} G_{0}\left( \frac{\|\boldsymbol{u}^{(0)}_{i}\|^{2}}{2d_{i}}\right) = G_{0}\left( \frac{\|\boldsymbol{v}^{(0)}_{i}\|^{2}}{2d_{i}}\right) = G_{0}\left(\frac{L|\boldsymbol{b}_{l}^{*}\boldsymbol{u}_{i}^{(0)}|^{2}}{8d_{i}\mu^{2}}\right) = 0\end{align*} for all 1 ≤ l ≤ L and G(u(0), v(0)) = 0. For $$\boldsymbol{z}^{(0)} = (\boldsymbol{u}^{(0)}, \boldsymbol{v}^{(0)})\in \mathcal{N}_{\frac{2\varepsilon }{5\sqrt{s}\kappa }}$$, we have $$\delta (\boldsymbol{z}^{(0)}) := \frac{\sqrt{\sum _{i=1}^{s}{\delta _{i}^{2}}d_{i0}^{2} }}{d_{0}} \leq \frac{2\varepsilon }{5\sqrt{s}\kappa }.$$ By (5.5), there holds $$\delta (\boldsymbol{z}^{(0)}) \leq \frac{2\varepsilon }{5\sqrt{s}\kappa }$$ and G(u(0), v(0)) = 0,   \begin{align*} \widetilde{F}(\boldsymbol{u}^{(0)}, \boldsymbol{v}^{(0)}) = F(\boldsymbol{u}^{(0)}, \boldsymbol{v}^{(0)}) \leq \|\boldsymbol{e}\|^{2} + \frac{3}{2}\delta^{2}(\boldsymbol{z}^{(0)}){d_{0}^{2}} + \frac{\varepsilon \delta(\boldsymbol{z}^{(0)}){d_{0}^{2}}}{5\sqrt{s}\kappa} \leq \|\boldsymbol{e}\|^{2} + \frac{\varepsilon^{2}{d_{0}^{2}}}{3s\kappa^{2}} \end{align*} and hence $$\boldsymbol{z}^{(0)} = (\boldsymbol{u}^{(0)}, \boldsymbol{v}^{(0)})\in \mathcal{N}_{\epsilon }\bigcap \mathcal{N}_{\widetilde{F}}.$$ Part II: the linear convergence of the objective function$$ \ \widetilde{F}(\boldsymbol{z}^{(t)})$$. Denote z(t) := (u(t), v(t)). Note that $$\boldsymbol{z}^{(0)}\in \mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$, Lemma 5.9 implies $$\boldsymbol{z}^{(t)}\in \mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}$$ for all t ≥ 0 by induction if $$\eta \leq \frac{1}{C_{L}}$$. Moreover, combining Condition 5.1 with Lemma 5.9 leads to   \begin{align*} \widetilde{F}(\boldsymbol{z}^{(t )}) \leq \widetilde{F}(\boldsymbol{z}^{(t-1)}) - \eta\omega \left[ \widetilde{F}(\boldsymbol{z}^{(t-1)}) - c \right]_{+}, \quad t\geq 1 \end{align*} with $$c = \|\boldsymbol{e}\|^{2} + a\|\mathcal{A}^{*}(\boldsymbol{e})\|^{2}$$ and a = 2000s. Therefore, by induction, we have   \begin{align*} \left[ \widetilde{F}(\boldsymbol{z}^{(t)}) - c\right]_{+} \leq (1 - \eta\omega) \left[ \widetilde{F}(\boldsymbol{z}^{(t-1)}) - c \right]_{+} \leq \left(1 - \eta\omega\right)^{t} \left[ \widetilde{F}(\boldsymbol{z}^{(0)}) - c\right]_{+} \leq \frac{\varepsilon^{2}{d_{0}^{2}}}{3s\kappa^{2}} (1 - \eta\omega)^{t}, \end{align*} where $$\widetilde{F}(\boldsymbol{z}^{(0)}) \leq \frac{\varepsilon ^{2}{d_{0}^{2}}}{3s\kappa ^{2}} + \|\boldsymbol{e}\|^{2}$$ and $$\left [ \widetilde{F}(\boldsymbol{z}^{(0)}) - c \right ]_{+} \leq \left [ \frac{1}{3s\kappa ^{2}}\varepsilon ^{2}{d_{0}^{2}} - a\|\mathcal{A}^{*}(\boldsymbol{e})\|^{2} \right ]_{+} \leq \frac{\varepsilon ^{2}{d_{0}^{2}}}{3s\kappa ^{2}}.$$ Now we conclude that $$\left [ \widetilde{F}(\boldsymbol{z}^{(t)}) - c\right ]_{+}$$ converges to 0 linearly. Part III: the linear convergence of the iterates (u(t), v(t)). Denote   \begin{align*} \delta(\boldsymbol{z}^{(t)}) : = \frac{\|\mathcal{H}(\boldsymbol{u}^{(t)}, \boldsymbol{v}^{(t)}) - \mathcal{H}(\boldsymbol{h}_{0},\boldsymbol{x}_{0})\|_{F}}{d_{0}}. \end{align*} Note that $$\boldsymbol{z}^{(t)}\in \mathcal{N}_{\epsilon }\cap \mathcal{N}_{\widetilde{F}}\subseteq \mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$ and over $$\mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$, there holds $$F_{0}(\boldsymbol{z}^{(t)}) \geq \frac{2}{3}\delta ^{2}(\boldsymbol{z}^{(t)}){d_{0}^{2}}$$ which follows from Local restricted isometry property (RIP) Condition in (5.3) and F0(z(t)) defined in (2.12). Moreover   \begin{align*} \widetilde{F}(\boldsymbol{z}^{(t)}) - \|\boldsymbol{e}\|^{2} \geq & F_{0}(\boldsymbol{z}^{(t)}) - 2\operatorname{Re}\left(\langle \mathcal{A}^{*}(\boldsymbol{e}), \mathcal{H}(\boldsymbol{u}^{(0)}, \boldsymbol{v}^{(0)}) - \mathcal{H}(\boldsymbol{h}_{0}, \boldsymbol{x}_{0}) \rangle\right) \\ & \geq \frac{2}{3} \delta^{2}(\boldsymbol{z}^{(t)}){d_{0}^{2}} - 2\sqrt{2s}\|\mathcal{A}^{*}(\boldsymbol{e})\| \delta(\boldsymbol{z}^{(t)})d_{0}, \end{align*} where G(z(t)) ≥ 0 and the second inequality follows from (5.6). There holds   \begin{align*} \frac{2}{3} \delta^{2}(\boldsymbol{z}^{(t)}){d_{0}^{2}} - 2\sqrt{2s}\|\mathcal{A}^{*}(\boldsymbol{e})\| \delta(\boldsymbol{z}^{(t)})d_{0} - a\|\mathcal{A}^{*}(\boldsymbol{e})\|^{2} \leq \left[ \widetilde{F}(\boldsymbol{z}^{(t)}) - c \right]_{+} \leq \frac{ \varepsilon^{2}{d_{0}^{2}}}{3s\kappa^{2}}(1 - \eta\omega)^{t} \end{align*} and equivalently,   \begin{align*} \left|\delta(\boldsymbol{z}^{(t)})d_{0} - \frac{3\sqrt{2}}{2} \|\mathcal{A}^{*}(\boldsymbol{e})\| \right|{}^{2} \leq \frac{\varepsilon^{2}{d_{0}^{2}}}{2s\kappa^{2}} (1 - \eta\omega)^{t} + \left(\frac{3}{2}a + \frac{9}{2}\right)\|\mathcal{A}^{*}(\boldsymbol{e})\|^{2}. \end{align*} Solving the inequality above for δ(z(t)), we have   \begin{align} \delta(\boldsymbol{z}^{(t)}) d_{0} & \leq \frac{\varepsilon d_{0}}{\sqrt{2s\kappa^{2}}}(1 - \eta\omega)^{t/2} +\left(\frac{3\sqrt{2}}{2} + \sqrt{\frac{3}{2}a + \frac{9}{2}} \right)\|\mathcal{A}^{*}(\boldsymbol{e})\| \nonumber \\ & \leq \frac{\varepsilon d_{0}}{\sqrt{2s\kappa^{2}}}(1 - \eta\omega)^{t/2} + 60\sqrt{s} \|\mathcal{A}^{*}(\boldsymbol{e})\|, \end{align} (5.9) where a = 2000s. Let $$d^{(t)} : = \sqrt{\sum _{i=1}^{s} \|\boldsymbol{u}_{i}^{(t)}\|^{2}\|\boldsymbol{v}_{i}^{(t)}\|^{2} }$$ for $$t\in \mathbb{Z}_{\geq 0}.$$ By (5.9) and triangle inequality, we immediately obtain $$|d^{(t)} - d_{0}| \leq \frac{\varepsilon d_{0}}{\sqrt{2s\kappa ^{2}}}(1 - \eta \omega )^{t/2} + 60\sqrt{s} \|\mathcal{A}^{*}(\boldsymbol{e})\|.$$ 6. Proof of the four conditions This section is devoted to proving the four key conditions introduced in Section 5. The local smoothness condition and the robustness condition are relatively less challenging to deal with. The more difficult part is to show the local regularity condition and the local isometry property. The key to solve those problems is to understand how the vector-valued linear operator $$\mathcal{A}$$ in (2.7) behaves on block-diagonal matrices, such as $$\mathcal{H}(\boldsymbol{h},\boldsymbol{x})$$, $$\mathcal{H}(\boldsymbol{h}_{0},\boldsymbol{x}_{0})$$ and $$\mathcal{H}(\boldsymbol{h},\boldsymbol{x}) - \mathcal{H}(\boldsymbol{h}_{0},\boldsymbol{x}_{0}).$$ In particular, when s = 1, all those matrices become rank-one matrices, which have been well discussed in our previous work [21]. First of all, we define the linear subspace $$T_{i}\subset \mathbb{C}^{K\times N}$$ along with its orthogonal complement for 1 ≤ i ≤ s as   \begin{align} T_{i} & := \{ \boldsymbol{Z}_{i}\in\mathbb{C}^{K\times N} : \boldsymbol{Z}_{i} = \boldsymbol{h}_{i0}\boldsymbol{v}_{i}^{*} + \boldsymbol{u}_{i}\boldsymbol{x}_{i0}^{*}, \quad \boldsymbol{u}_{i}\in\mathbb{C}^{K},\boldsymbol{v}_{i}\in\mathbb{C}^{N} \},\nonumber \\ T^{\bot}_{i} & := \left\{ \left(\boldsymbol{I}_{K} - \frac{\boldsymbol{h}_{i0}\boldsymbol{h}_{i0}^{*}}{d_{i0}}\right) \boldsymbol{Z}_{i} \left(\boldsymbol{I}_{N} - \frac{\boldsymbol{x}_{i0}\boldsymbol{x}_{i0}^{*}}{d_{i0}}\right) :\boldsymbol{Z}_{i}\in\mathbb{C}^{K\times N} \right\}\!, \end{align} (6.1) where $$\|\boldsymbol{h}_{i0}\| = \|\boldsymbol{x}_{i0}\| = \sqrt{d_{i0}}.$$ In particular, $$\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*} \in T_{i}$$ for all 1 ≤ i ≤ s. The proof also requires us to consider block-diagonal matrices whose ith block belongs to Ti (or $$T^{\bot }_{i}$$). Let $$\boldsymbol{Z} = \operatorname{blkdiag}(\boldsymbol{Z}_{1},\cdots ,\boldsymbol{Z}_{s})\in \mathbb{C}^{Ks\times Ns}$$ be a block-diagonal matrix and say Z ∈ T if   \begin{align*} T := \left\{\textrm{blkdiag}\ (\{\boldsymbol{Z}_{i}\}_{i=1}^{s}) | \boldsymbol{Z}_{i}\in T_{i} \right\} \end{align*} and Z ∈ T⊥ if   \begin{align*} T^{\bot} := \left\{\textrm{blkdiag}\ (\{\boldsymbol{Z}_{i}\}_{i=1}^{s}) | \boldsymbol{Z}_{i}\in T^{\bot}_{i} \right\}\!, \end{align*} where both T and T⊥ are subsets in $$\mathbb{C}^{Ks\times Ns}$$ and $$\mathcal{H}(\boldsymbol{h}_{0},\boldsymbol{x}_{0})\in T.$$ Now we take a closer look at a special case of block-diagonal matrices, i.e., $$\mathcal{H}(\boldsymbol{h}, \boldsymbol{x})$$ and calculate its projection onto T and T⊥, respectively, and it suffices to consider $$\mathcal{P}_{T_{i}}(\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*})$$ and $$\mathcal{P}_{T^{\bot }_{i}}(\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*})$$. For each block $$\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*}$$ and 1 ≤ i ≤ s, there are unique orthogonal decompositions   \begin{align} \boldsymbol{h}_{i} := \alpha_{i1} \boldsymbol{h}_{i0} + \tilde{\boldsymbol{h}}_{i}, \quad \boldsymbol{x} := \alpha_{i2} \boldsymbol{x}_{i0} + \tilde{\boldsymbol{x}}_{i}, \end{align} (6.2) where $$\boldsymbol{h}_{i0} \perp \tilde{\boldsymbol{h}}_{i}$$ and $$\boldsymbol{x}_{i0} \perp \tilde{\boldsymbol{x}}_{i}$$. It is important to note that $$\alpha _{i1} = \alpha _{i1}(\boldsymbol{h}_{i}) = \frac{\langle \boldsymbol{h}_{i0}, \boldsymbol{h}_{i}\rangle }{d_{i0}}$$ and $$\alpha _{i2} = \alpha _{i2}(\boldsymbol{x}_{i}) = \frac{\langle \boldsymbol{x}_{i0}, \boldsymbol{x}_{i}\rangle }{d_{i0}},$$ and thus αi1 and αi2 are functions of hi and xi, respectively. Immediately, we have the following matrix orthogonal decomposition for $$\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*}$$ onto Ti and $$T^{\bot }_{i}$$,   \begin{align} \boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0} \boldsymbol{x}_{i0}^{*} = \underbrace{(\alpha_{i1} \overline{\alpha_{i2}} - 1)\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*} + \overline{\alpha_{i2}} \tilde{\boldsymbol{h}}_{i} \boldsymbol{x}_{i0}^{*} + \alpha_{i1} \boldsymbol{h}_{i0} \tilde{\boldsymbol{x}}_{i}^{*}}_{\textrm{belong to}\ T_{i}} + \underbrace{\tilde{\boldsymbol{h}}_{i} \tilde{\boldsymbol{x}}_{i}^{*}}_{\textrm{belongs to}\ T^{\bot}_{i}}, \end{align} (6.3) where the first three components are in Ti while $$\tilde{\boldsymbol{h}}_{i}\tilde{\boldsymbol{x}}_{i}^{*}\in T^{\bot }_{i}$$. 6.1. Key lemmata From the decomposition in (6.2) and (6.3), we want to analyze how $$\|\tilde{\boldsymbol{h}}_{i}\|$$, $$\|\tilde{\boldsymbol{x}}_{i}\|$$, αi1 and αi2 depend on $$\delta _{i} = \frac{\|\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}\|_{F}}{d_{i0}}$$ if δi < 1. The following lemma answers this question, which can be viewed as an application of singular value/vector perturbation theory [40] applied to rank-one matrices. From the lemma below, we can see that if $$\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*}$$ is close to $$\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}$$, then $$\mathcal{P}_{T^{\bot }_{i}}(\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*})$$ is in fact very small (of order $${\cal O}({\delta _{i}^{2}} d_{i0})$$). Lemma 6.1 (Lemma 5.9 in [21]) Recall that $$\|\boldsymbol{h}_{i0}\| = \|\boldsymbol{x}_{i0}\| = \sqrt{d_{i0}}$$. If $$\delta _{i} := \frac{\|\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0} \boldsymbol{x}_{i0}^{*}\|_{F}}{d_{i0}}<1$$, we have the following useful bounds   \begin{align*} |\alpha_{i1}|\leq \frac{\|\boldsymbol{h}_{i}\|}{\|\boldsymbol{h}_{i0}\|}, \quad |\alpha_{i1}\overline{\alpha_{i2}} - 1|\leq \delta_{i}, \end{align*} and   \begin{align*} \|\tilde{\boldsymbol{h}}_{i}\| \leq \frac{\delta_{i}}{1 - \delta_{i}}\|\boldsymbol{h}_{i}\|,\quad \|\tilde{\boldsymbol{x}}_{i}\| \leq \frac{\delta_{i}}{1 - \delta_{i}}\|\boldsymbol{x}_{i}\|,\quad \|\tilde{\boldsymbol{h}}_{i}\| \|\tilde{\boldsymbol{x}}_{i}\| \leq \frac{{\delta_{i}^{2}}}{2(1 - \delta_{i})} d_{i0}. \end{align*} Moreover, if $$\|\boldsymbol{h}_{i}\| \leq 2\sqrt{d_{i0}}$$ and $$\sqrt{L}\|\boldsymbol{B} \boldsymbol{h}_{i}\|_{\infty } \leq 4\mu \sqrt{d_{i0}}$$, i.e., $$\boldsymbol{h}_{i}\in \mathcal{N}_d\bigcap \mathcal{N}_{\mu }$$, we have $$\sqrt{L}\|\boldsymbol{B}_{i} \tilde{\boldsymbol{h}}_{i}\|_{\infty } \leq 6 \mu \sqrt{d_{i0}}$$. Now we start to focus on several results related to the linear operator $$\mathcal{A}$$. Lemma 6.2 (Operator norm of $$\mathcal{A}$$). For $$\mathcal{A}$$ defined in (2.7), there holds   \begin{align} \|\mathcal{A}\| \leq \sqrt{s(N\log(NL/2) + (\gamma+\log s)\log L)} \end{align} (6.4) with probability at least 1 − L−γ. Proof. Note that $$\mathcal{A}_{i}(\boldsymbol{Z}_{i}) : = \{\boldsymbol{b}_{l}^{*}\boldsymbol{Z}_{i}\boldsymbol{a}_{il}\}_{l=1}^{L}$$ in (2.2). Lemma 1 in [1] implies   \begin{align*} \|\mathcal{A}_{i}\| \leq \sqrt{N\log(NL/2) + \gamma^{\prime}\log L} \end{align*} with probability at least $$1 - L^{-\gamma ^{\prime }}.$$ By taking the union bound over 1 ≤ i ≤ s,   \begin{align*} \max\|\mathcal{A}_{i}\| \leq \sqrt{N\log(NL/2) + (\gamma+ \log s)\log L} \end{align*} with probability at least $$1 - sL^{-\gamma -\log s} \geq 1 - L^{-\gamma }.$$ For $$\mathcal{A}$$ defined in (2.7), applying the triangle inequality gives   \begin{align*} \|\mathcal{A}(\boldsymbol{Z})\| = \left\|\sum_{i=1}^{s} \mathcal{A}_{i}(\boldsymbol{Z}_{i})\right\| \leq \sum_{i=1}^{s} \|\mathcal{A}_{i}\|\|\boldsymbol{Z}_{i}\|_{F} \leq \max_{1\leq i\leq s} \|\mathcal{A}_{i}\| \sqrt{s \sum_{i=1}^{s} \|\boldsymbol{Z}_{i}\|_{F}^{2}} = \sqrt{s}\max_{1\leq i\leq s} \|\mathcal{A}_{i}\| \|\boldsymbol{Z}\|_{F}, \end{align*} where $$\boldsymbol{Z} = \operatorname{blkdiag}(\boldsymbol{Z}_{1},\cdots , \boldsymbol{Z}_{s})\in \mathbb{C}^{Ks\times Ns}.$$ Therefore,   \begin{align*} \|\mathcal{A}\| \leq \sqrt{s}\max_{1\leq i\leq s}\|\mathcal{A}_{i}\| \leq \sqrt{ s(N\log(NL/2) + (\gamma+\log s)\log L)} \end{align*} with probability at least 1 − L−γ. Lemma 6.3 (Restricted isometry property for $$\mathcal{A}$$ on T). The linear operator $$\mathcal{A}$$ restricted on T is well-conditioned, i.e.,   \begin{align} \|\mathcal{P}_{T}\mathcal{A}^{*}\mathcal{A}\mathcal{P}_{T} - \mathcal{P}_{T}\| \leq \frac{1}{10}, \end{align} (6.5) where $$\mathcal{P}_{T}$$ is the projection operator from $$\mathbb{C}^{Ks\times Ns}$$ onto T, given $$L \geq C_{\gamma }s^{2} \max \{K,{\mu _{h}^{2}} N\}\log ^{2}L$$ with probability at least 1 − L−γ. Remark 6.4 Here $$\mathcal{A}\mathcal{P}_{T}$$ and $$\mathcal{P}_{T}\mathcal{A}^{*}$$ are defined as   \begin{align*} \mathcal{A}\mathcal{P}_{T}(\boldsymbol{Z}) = \sum_{i=1}^{s} \mathcal{A}_{i}(\mathcal{P}_{T_{i}}(\boldsymbol{Z}_{i})), \quad\mathcal{P}_{T}\mathcal{A}^{*}(\boldsymbol{z}) = \operatorname{blkdiag}( \mathcal{P}_{T_{1}}(\mathcal{A}_{1}^{*}(\boldsymbol{z})), \cdots, \mathcal{P}_{T_{s}}(\mathcal{A}_{s}^{*}(\boldsymbol{z})) ), \end{align*} respectively, where Z is a block-diagonal matrix and $$\boldsymbol{z}\in \mathbb{C}^{L}.$$ As shown in the remark above, the proof of Lemma 6.3 depends on the properties of both $$\mathcal{P}_{T_{i}}\mathcal{A}_{i}^{*}\mathcal{A}_{i}\mathcal{P}_{T_{i}}$$ and $$\mathcal{P}_{T_{i}}\mathcal{A}_{i}^{*}\mathcal{A}_{j}\mathcal{P}_{T_{j}}$$ for $$i\neq j$$. Fortunately, we have already proven related results in [23] which are written as follows: Lemma 6.5 (Inter-user incoherence, Corollary 5.3 and 5.8 in [23]). There hold   \begin{align} \|\mathcal{P}_{T_{i}} \mathcal{A}_{i}^{*}\mathcal{A}_{j}\mathcal{P}_{T_{j}}\| \leq \frac{1}{10s}, \quad \forall i\neq j; \qquad \|\mathcal{P}_{T_{i}} \mathcal{A}_{i}^{*}\mathcal{A}_{i}\mathcal{P}_{T_{i}} - \mathcal{P}_{T_{i}}\| \leq \frac{1}{10s}, \quad\forall 1\leq i\leq s \end{align} (6.6) with probability at least 1 − L−γ+1 if $$L\geq C_{\gamma }s^{2}\max \{K,{\mu ^{2}_{h}}N\}\log ^{2}L\log (s+1).$$ Note that $$\|\mathcal{P}_{T_{i}} \mathcal{A}_{i}^{*}\mathcal{A}_{j}\mathcal{P}_{T_{j}}\| \leq \frac{1}{10s}$$ holds because of independence between each individual random Gaussian matrix Ai. In particular, if s = 1, the inter-user incoherence $$\|\mathcal{P}_{T_{i}} \mathcal{A}_{i}^{*}\mathcal{A}_{j}\mathcal{P}_{T_{j}}\| \leq \frac{1}{10s}$$ is not needed at all. With (6.6), it is easy to prove Lemma 6.3. Proof of Lemma 6.3 For any block diagonal matrix $$\boldsymbol{Z} = \operatorname{blkdiag}(\boldsymbol{Z}_{1}, \cdots ,\boldsymbol{Z}_{s})\in \mathbb{C}^{Ks\times Ns}$$ and $$\boldsymbol{Z}_{i}\in \mathbb{C}^{K\times N}$$,   \begin{align} \langle \boldsymbol{Z}, \mathcal{P}_{T}\mathcal{A}^{*}\mathcal{A}\mathcal{P}_{T}(\boldsymbol{Z}) - \mathcal{P}_{T}(\boldsymbol{Z})\rangle & = \sum_{1\leq i,j\leq s} \langle \mathcal{A}_{i}\mathcal{P}_{T_{i}}(\boldsymbol{Z}_{i}), \mathcal{A}_{j}\mathcal{P}_{T_{j}}(\boldsymbol{Z}_{j})\rangle - \|\mathcal{P}_{T}(\boldsymbol{Z})\|_{F}^{2} \nonumber \\ & = \sum_{i=1}^{s} \langle \boldsymbol{Z}_{i}, \mathcal{P}_{T_{i}}\mathcal{A}_{i}^{*}\mathcal{A}_{i}\mathcal{P}_{T_{i}}(\boldsymbol{Z}_{i}) - \mathcal{P}_{T_{i}}(\boldsymbol{Z}_{i})\rangle + \sum_{i\neq j} \langle \mathcal{A}_{i}\mathcal{P}_{T_{i}}(\boldsymbol{Z}_{i}), \mathcal{A}_{j}\mathcal{P}_{T_{j}}(\boldsymbol{Z}_{j})\rangle. \end{align} (6.7) Using (6.6), the following two inequalities hold,   \begin{align*} |\langle \boldsymbol{Z}_{i}, \mathcal{P}_{T_{i}}\mathcal{A}_{i}^{*}\mathcal{A}_{i}\mathcal{P}_{T_{i}}(\boldsymbol{Z}_{i}) - \mathcal{P}_{T_{i}}(\boldsymbol{Z}_{i})\rangle| & \leq \|\mathcal{P}_{T_{i}}\mathcal{A}_{i}^{*}\mathcal{A}_{i}\mathcal{P}_{T_{i}} - \mathcal{P}_{T_{i}} \| \|\boldsymbol{Z}_{i}\|_{F}^{2} \leq \frac{\|\boldsymbol{Z}_{i}\|^{2}_{F}}{10s}, \\ |\langle \mathcal{A}_{i}\mathcal{P}_{T_{i}}(\boldsymbol{Z}_{i}), \mathcal{A}_{j}\mathcal{P}_{T_{j}}(\boldsymbol{Z}_{j})\rangle| & \leq \|\mathcal{P}_{T_{i}}\mathcal{A}_{i}^{*}\mathcal{A}_{j}\mathcal{P}_{T_{j}} \| \|\boldsymbol{Z}_{i}\|_{F}\|\boldsymbol{Z}_{j}\|_{F} \leq \frac{\|\boldsymbol{Z}_{i}\|_{F}\|\boldsymbol{Z}_{j}\|_{F}}{10s}. \end{align*} After substituting both estimates into (6.7), we have   \begin{align*} |\langle \boldsymbol{Z}, \mathcal{P}_{T}\mathcal{A}^{*}\mathcal{A}\mathcal{P}_{T}(\boldsymbol{Z}) - \mathcal{P}_{T}(\boldsymbol{Z})\rangle| \leq \sum_{1\leq i, j\leq s} \frac{ \|\boldsymbol{Z}_{i}\|_{F}\|\boldsymbol{Z}_{j}\|_{F} }{10s} \leq \frac{1}{10s}\left(\sum_{i=1}^{s} \|\boldsymbol{Z}_{i}\|_{F}\right)^{2} \leq \frac{\|\boldsymbol{Z}\|_{F}^{2}}{10}. \end{align*} Finally, we show how $$\mathcal{A}$$ behaves when applied to block-diagonal matrices $$\boldsymbol{X} = \mathcal{H}(\boldsymbol{h},\boldsymbol{x})$$. In particular, the calculations will be much simplified for the case s = 1. Lemma 6.6 ($$\mathcal{A}$$ restricted on block-diagonal matrices with rank-one blocks). Consider $$\boldsymbol{X} = \mathcal{H}(\boldsymbol{h}, \boldsymbol{x})$$ and   \begin{align} \sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x}) := \max_{1\leq l\leq L} \sum_{i=1}^{s} |\boldsymbol{b}^{*}_{l}\boldsymbol{h}_{i}|^{2} \|\boldsymbol{x}_{i}\|^{2}. \end{align} (6.8) Conditioned on (6.4), we have   \begin{align} \|\mathcal{A}(\boldsymbol{X})\|^{2} \leq \frac{4}{3} \|\boldsymbol{X}\|_{F}^{2}+ 2 \sqrt{2s\|\boldsymbol{X}\|_{F}^{2} \sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x})(K+N)\log L} + 8s\sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x})(K+N) \log L, \end{align} (6.9) uniformly for any $$\boldsymbol{h}\in \mathbb{C}^{Ks}$$ and $$\boldsymbol{x}\in \mathbb{C}^{Ns}$$ with probability at least $$1 - \frac{1}{\gamma }\exp (-s(K+N))$$ if $$L\geq C_{\gamma }s(K+N)\log L$$. Here $$ \|\boldsymbol{X}\|_{F}^{2}= \|\mathcal{H}(\boldsymbol{h}, \boldsymbol{x})\|_{F}^{2} = \sum _{i=1}^{s} \|\boldsymbol{h}_{i}\|^{2}\|\boldsymbol{x}_{i}\|^{2}.$$ Remark 6.7 Here are a few more explanations and facts about $$\sigma ^{2}_{\max }(\boldsymbol{h},\boldsymbol{x})$$. Note that $$\|\mathcal{A}(\boldsymbol{X})\|^{2}$$ is the sum of L sub-exponential6 random variables, i.e.,   \begin{align} \|\mathcal{A}(\boldsymbol{X})\|^{2} = \sum_{l=1}^{L} \left|\sum_{i=1}^{s} \boldsymbol{b}_{l}^{*}\boldsymbol{h}_{i} \boldsymbol{x}_{i}^{*}\boldsymbol{a}_{il}\right|{}^{2}. \end{align} (6.10) Here $$\sigma ^{2}_{\max }(\boldsymbol{h}, \boldsymbol{x})$$ corresponds to the largest expectation of all those components in $$\|\mathcal{A}(\boldsymbol{X})\|^{2}$$. For $$\sigma ^{2}_{\max }(\boldsymbol{h}, \boldsymbol{x})$$, without loss of generality, we assume ∥xi∥ = 1 for 1 ≤ i ≤ s and let $$\boldsymbol{h}\in \mathbb{C}^{Ks}$$ be a unit vector, i.e., $$\|\boldsymbol{h}\|^{2} = \sum _{i=1}^{s} \|\boldsymbol{h}_{i}\|^{2}= 1$$. The bound   \begin{align} \frac{1}{L} \leq \sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x}) \leq \frac{K}{L} \end{align} (6.11) follows from $$L \sigma ^{2}_{\max }(\boldsymbol{h}, \boldsymbol{x}) \geq \sum _{l=1}^{L} \sum _{i=1}^{s} |\boldsymbol{b}_{l}^{*}\boldsymbol{h}_{i}|^{2} = \|\boldsymbol{h}\|^{2}=1.$$ Moreover, $$\sigma _{\max }^{2}(\boldsymbol{h},\boldsymbol{x})$$ and $$\sigma _{\max }(\boldsymbol{h},\boldsymbol{x})$$ are both Lipschitz functions w.r.t. h. Now we want to determine their Lipschitz constants. First note that for ∥xi∥ = 1, $$\sigma _{\max }(\boldsymbol{h},\boldsymbol{x})$$ equals   \begin{align*} \sigma_{\max}(\boldsymbol{h}, \boldsymbol{x}) = \max_{1\leq l\leq L} \|(\boldsymbol{I}_{s}\otimes \boldsymbol{b}_{l}^{*})\boldsymbol{h}\|, \end{align*} where ⊗ denotes Kronecker product. Let $$\boldsymbol{u}\in \mathbb{C}^{Ks}$$ be another unit vector and we have   \begin{align} |\sigma_{\max}(\boldsymbol{h}, \boldsymbol{x}) - \sigma_{\max}(\boldsymbol{u}, \boldsymbol{x})| & = \left| \max_{1\leq l\leq L} \|(\boldsymbol{I}_{s}\otimes \boldsymbol{b}_{l}^{*})\boldsymbol{h} - \max_{1\leq l\leq L} \|(\boldsymbol{I}_{s}\otimes \boldsymbol{b}_{l}^{*})\boldsymbol{u}\| \right| \nonumber \\ & = \max_{1\leq l\leq L} \left| \|(\boldsymbol{I}_{s}\otimes \boldsymbol{b}_{l}^{*}) \boldsymbol{h}\| - \|(\boldsymbol{I}_{s}\otimes \boldsymbol{b}_{l}^{*}) \boldsymbol{u}\| \right| \nonumber \\ & \leq \max_{1\leq l\leq L} \|(\boldsymbol{I}_{s}\otimes \boldsymbol{b}_{l}^{*}) (\boldsymbol{h} - \boldsymbol{u})\| \leq \|\boldsymbol{h}-\boldsymbol{u}\|, \end{align} (6.12) where $$\|\boldsymbol{I}_{s}\otimes \boldsymbol{b}_{l}^{*}\| = \|\boldsymbol{b}_{l}\| \sqrt{\frac{K}{L}} < 1.$$ For $$\sigma ^{2}_{\max }(\boldsymbol{h},\boldsymbol{x}),$$  \begin{align} |\sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x}) - \sigma^{2}_{\max}(\boldsymbol{u}, \boldsymbol{x})| & \leq (\sigma_{\max}(\boldsymbol{h}, \boldsymbol{x}) + \sigma_{\max}(\boldsymbol{u}, \boldsymbol{x})) \cdot |\sigma_{\max}(\boldsymbol{h}, \boldsymbol{x}) - \sigma_{\max}(\boldsymbol{u}, \boldsymbol{x})| \nonumber \\ & \leq \frac{2K}{L}\|\boldsymbol{h}-\boldsymbol{u}\| \leq 2\|\boldsymbol{h}-\boldsymbol{u}\|. \end{align} (6.13) Proof of Lemma 6.6 Without loss of generality, let ∥xi∥ = 1 and $$\sum _{i=1}^{s} \|\boldsymbol{h}_{i}\|^{2} = 1$$. It suffices to prove $$f(\boldsymbol{h}, \boldsymbol{x}) \leq \frac{4}{3}$$ for all $$(\boldsymbol{h}, \boldsymbol{x})\in \mathbb{C}^{Ks}\times \mathbb{C}^{Ns}$$ in (2.5) where f(h, x) is defined as   \begin{align*} f(\boldsymbol{h}, \boldsymbol{x}) := \|\mathcal{A}(\boldsymbol{X})\|^{2} - 2 \sqrt{2s \sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x})(K+N)\log L} - 8s\sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x})(K+N) \log L. \end{align*} Part I: bounds of$$\|\mathcal{A}(\boldsymbol{X})\|^{2}$$ for any fixed (h, x). From (47), we already know that $$Y = \|\mathcal{A}(\boldsymbol{X})\|_{F}^{2} = \sum _{i=1}^{2L} c_{i}{\xi _{i}^{2}}$$ where {ξi} are i.i.d. $${\chi ^{2}_{1}}$$ random variables and $$\boldsymbol{c} = (c_{1}, \cdots , c_{2L})^{T}\in \mathbb{R}^{2L}$$. More precisely, we can determine $$\{c_{i}\}_{i=1}^{2L}$$ as   \begin{align*} \left| \sum_{i=1}^{s} \boldsymbol{b}_{l}^{*}\boldsymbol{h}_{i}\boldsymbol{x}^{*}\boldsymbol{a}_{il}\right|{}^{2} = c_{2l-1} \xi_{2l-1}^{2} + c_{2l}\xi_{2l}^{2},\quad c_{2l-1} = c_{2l} = \frac{1}{2}\sum_{i=1}^{s} |\boldsymbol{b}_{l}^{*}\boldsymbol{h}_{i}|^{2} \end{align*} because $$\sum _{i=1}^{s} \boldsymbol{b}^{*}_{l}\boldsymbol{h}_{i} \boldsymbol{x}_{i}^{*}\boldsymbol{a}_{il} \sim \mathcal{C}\mathcal{N}\left (0, \sum _{i=1}^{s} |\boldsymbol{b}^{*}_{l} \boldsymbol{h}_{i}|^{2}\right )$$. By the Bernstein inequality, there holds   \begin{align} \mathbb{P}(Y - \mathbb{E}(Y) \geq t) \leq \exp\left(- \frac{t^{2}}{8\|\boldsymbol{c}\|^{2}}\right) \vee \exp\left(- \frac{t}{8\|\boldsymbol{c}\|_{\infty}}\right)\!, \end{align} (6.14) where $$\operatorname{\mathbb{E}}(Y) = \|\boldsymbol{X}\|_{F}^{2} = 1.$$ In order to apply the Bernstein inequality, we need to estimate ∥c∥2 and $$\|\boldsymbol{c}\|_{\infty }$$ as follows,   \begin{align*} \|\boldsymbol{c}\|_{\infty} & = \frac{1}{2}\max_{1\leq l\leq L}\sum_{i=1}^{s}|\boldsymbol{b}^{*}_{l}\boldsymbol{h}_{i}|^{2} = \frac{1}{2} \sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x}), \\ \|\boldsymbol{c}\|_{2}^{2} & = \frac{1}{2}\sum_{l=1}^{L} \left|\sum_{i=1}^{s}|\boldsymbol{b}^{*}_{l}\boldsymbol{h}_{i}|^{2} \right|{}^{2} \leq \frac{1}{2}\left( \sum_{i=1}^{s}\sum_{l=1}^{L}|\boldsymbol{b}^{*}_{l}\boldsymbol{h}_{i}|^{2} \right)\max_{1\leq l\leq L}\sum_{i=1}^{s}|\boldsymbol{b}^{*}_{l}\boldsymbol{h}_{i}|^{2} \leq \frac{1}{2} \sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x}). \end{align*} Applying (6.14) gives   \begin{align*} \mathbb{P}( \|\mathcal{A}(\boldsymbol{X})\|^{2} \geq 1 + t)\leq \exp\left(- \frac{t^{2}}{4 \sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x})}\right) \vee \exp\left(- \frac{t}{4\sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x})}\right). \end{align*} In particular, by setting   \begin{align*} t = g(\boldsymbol{h},\boldsymbol{x}):= 2 \sqrt{2 s\sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x})(K+N)\log L} + 8s\sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x})(K + N)\log L, \end{align*} we have   \begin{align*} \mathbb{P}\left(\|\mathcal{A}(\boldsymbol{X})\|^{2} \geq 1 + g(\boldsymbol{h},\boldsymbol{x})\right) \leq \textrm{e}^{ - 2 s(K+N)(\log L)}. \end{align*} So far, we have shown that f(h, x) ≤ 1 with probability at least $$1 - \textrm{e}^{- 2 s(K+N)(\log L)}$$ for a fixed pair of (h, x). Part II: covering argument. Now we will use a covering argument to extend this result for all (h, x) and thus prove that $$f(\boldsymbol{h}, \boldsymbol{x})\leq \frac{4}{3}$$ uniformly for all (h, x). We start with defining $$\mathcal{K}$$ and $$\mathcal{N}_{i}$$ as ϵ0-nets of $$\mathcal{S}^{Ks-1}$$ and $$\mathcal{S}^{N-1}$$ for h and xi, 1 ≤ i ≤ s, respectively. The bounds $$|\mathcal{K}|\leq \big(1+\frac{2}{\epsilon _{0}}\big)^{2sK}$$ and $$|\mathcal{N}_{i}|\leq \big(1+\frac{2}{\epsilon _{0}}\big)^{2N}$$ follow from the covering numbers of the sphere (Lemma 5.2 in [38]). Here we let $$\mathcal{N} := \mathcal{N}_{1}\times \cdots \times \mathcal{N}_{s}.$$ By taking the union bound over $$\mathcal{K}\times \mathcal{N},$$ we have that f(h, x) ≤ 1 holds uniformly for all $$(\boldsymbol{h}, \boldsymbol{x}) \in \mathcal{K} \times \mathcal{N}$$ with probability at least   \begin{align*} 1- \left(1+ 2/\epsilon_{0}\right)^{2s(K + N)} \textrm{e}^{ - 2s(K+N)\log L } = 1- \textrm{e}^{-2s(K + N)\left(\log L - \log \left(1 + 2/\varepsilon_{0}\right)\right)}. \end{align*} For any $$(\boldsymbol{h}, \boldsymbol{x}) \in \mathcal{S}^{Ks-1}\times \underbrace{\mathcal{S}^{N-1}\times \cdots \times \mathcal{S}^{N-1}}_{s\ \textrm{times}}$$, we can find a point $$(\boldsymbol{u}, \boldsymbol{v}) \in \mathcal{K} \times \mathcal{N}$$ satisfying ∥h −u∥≤ ε0 and ∥xi −vi∥ ≤ ε0 for all 1 ≤ i ≤ s. Conditioned on (6.4), we know that   \begin{align*} \|\mathcal{A}\|^{2}\leq s(N\log(NL/2) + (\gamma + \log s)\log L) \leq s(N + \gamma + \log s)\log L. \end{align*} Now we aim to evaluate |f(h, x) − f(u, v)|. First we consider |f(u, x) − f(u, v)|. Since $$\sigma ^{2}_{\max }(\boldsymbol{u}, \boldsymbol{x}) = \sigma ^{2}_{\max }(\boldsymbol{u},\boldsymbol{v})$$ if ∥xi∥ = ∥vi∥ = ∥u∥ = 1 for 1 ≤ i ≤ s, we have   \begin{align*} |f(\boldsymbol{u}, \boldsymbol{x}) - f(\boldsymbol{u}, \boldsymbol{v})| & = \left|\left\| \mathcal{A}(\mathcal{H}(\boldsymbol{u}, \boldsymbol{x}))\right\|_{F}^{2} - \left\| \mathcal{A}(\mathcal{H}(\boldsymbol{u},\boldsymbol{v})) \right\|_{F}^{2} \right| \\ & \leq \left\| \mathcal{A}(\mathcal{H}(\boldsymbol{u}, \boldsymbol{x} - \boldsymbol{v}))\right\| \cdot \left\| \mathcal{A}(\mathcal{H}(\boldsymbol{u}, \boldsymbol{x} + \boldsymbol{v}))\right\| \\ & \leq \|\mathcal{A}\|^{2} \sqrt{\sum_{i=1}^{s} \|\boldsymbol{u}_{i}\|^{2}\|\boldsymbol{x}_{i} - \boldsymbol{v}_{i}\|^{2}} \sqrt{\sum_{i=1}^{s} \|\boldsymbol{u}_{i}\|^{2}\|\boldsymbol{x}_{i} + \boldsymbol{v}_{i}\|^{2}} \\ & \leq 2\|\mathcal{A}\|^{2} \varepsilon_{0} \leq 2s(N + \gamma + \log s)(\log L)\varepsilon_{0}, \end{align*} where the first inequality is due to ||z1|2 −|z2|2|≤|z1 − z2||z1 + z2| for any $$z_{1}, z_{2} \in \mathbb{C}$$. We proceed to estimate |f(h, x) − f(u, x)| by using (50) and (49),   \begin{align*} | f(\boldsymbol{h}, \boldsymbol{x}) - f(\boldsymbol{u}, \boldsymbol{x})| & \leq J_{1} + J_{2} + J_{3} \\ & \leq (2\|\mathcal{A}\|^{2} + 2\sqrt{2s(K+N)\log L}+ 16s(K+N) \log L) \varepsilon_{0}\\ & \leq 25s(K +N + \gamma + \log s)(\log L) \varepsilon_{0}, \end{align*} where (6.13) and (6.12) give   \begin{align*} J_{1} & = \left| \|\mathcal{A}(\mathcal{H}(\boldsymbol{h},\boldsymbol{x}))\|_{F}^{2} - \|\mathcal{A}(\mathcal{H}(\boldsymbol{u},\boldsymbol{x}))\|_{F}^{2}\right| \leq \left\| \mathcal{A}( \mathcal{H}(\boldsymbol{h} - \boldsymbol{u},\boldsymbol{x}) )\right\| \left\| \mathcal{A}( \mathcal{H}(\boldsymbol{h} + \boldsymbol{u},\boldsymbol{x}) )\right\| \leq 2\|\mathcal{A}\|^{2} \varepsilon_{0}, \\ J_{2} & = 2 \sqrt{2s(K+N)\log L}\cdot |\sigma_{\max}(\boldsymbol{h}, \boldsymbol{x}) - \sigma_{\max}(\boldsymbol{u}, \boldsymbol{x})| \leq 2 \sqrt{2s(K+N)\log L} \varepsilon_{0}, \\ J_{3} & = 8s(K+N) (\log L) \cdot |\sigma^{2}_{\max}(\boldsymbol{h}, \boldsymbol{x}) - \sigma^{2}_{\max}(\boldsymbol{u}, \boldsymbol{x})| \leq 16s(K+N)(\log L) \varepsilon_{0}. \end{align*} Therefore, if $$\epsilon _{0} = \frac{1}{81s(N + K + \gamma + \log s)\log L}$$, there holds   \begin{align*} f(\boldsymbol{h},\boldsymbol{x}) \leq f(\boldsymbol{u},\boldsymbol{v}) + \underbrace{|f(\boldsymbol{u}, \boldsymbol{x}) - f(\boldsymbol{u}, \boldsymbol{v})| + |f(\boldsymbol{h},\boldsymbol{x}) -f(\boldsymbol{u},\boldsymbol{x}) |}_{\leq 27s(K+N+\gamma + \log s)(\log L)\varepsilon_{0}\leq \frac{1}{3}} \leq \frac{4}{3} \end{align*} for all (h, x) uniformly with probability at least $$1- \textrm{e}^{-2s(K + N)\left (\log L - \log \left (1 + 2/\varepsilon _{0}\right )\right )}.$$ By letting $$L \geq C_{\gamma }s(K+N)\log L$$ with Cγ reasonably large and γ ≥ 1, we have $$\log L - \log \left (1 + 2/\varepsilon _{0}\right ) \geq \frac{1}{2}(1 + \log (\gamma ))$$ and with probability at least $$1 - \frac{1}{\gamma }\exp (-s(K+N))$$. 6.2. Proof of the local restricted isometry property Lemma 6.8 Conditioned on (6.5) and (6.9), the following RIP type of property holds:   \begin{align*} \frac{2}{3} \|\boldsymbol{X} - \boldsymbol{X}_{0}\|_{F}^{2} \leq \|\mathcal{A}(\boldsymbol{X} - \boldsymbol{X}_{0})\|^{2} \leq \frac{3}{2}\|\boldsymbol{X}-\boldsymbol{X}_{0}\|_{F}^{2} \end{align*} uniformly for all $$(\boldsymbol{h},\boldsymbol{x})\in \mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$ with μ ≥ μh and $$\epsilon \leq \frac{1}{15}$$ if $$L \geq C_{\gamma }\mu ^{2} s(K+N)\log ^{2} L$$ for some numerical constant Cγ. Proof. The main idea of the proof follows two steps: decompose X − X0 onto T and T⊥, then apply (6.5) and (6.9) to $$\mathcal{P}_{T}(\boldsymbol{X}-\boldsymbol{X}_{0})$$ and $$\mathcal{P}_{T^{\bot }}(\boldsymbol{X}-\boldsymbol{X}_{0}),$$ respectively. For any $$\boldsymbol{X} =\mathcal{H}(\boldsymbol{h},\boldsymbol{x})\in \mathcal{N}_{\epsilon }$$ with $$\delta _{i} \leq \varepsilon \leq \frac{1}{15}$$, we can decompose X − X0 as the sum of two block diagonal matrices U = blkdiag(Ui, 1 ≤ i ≤ s) and V = blkdiag(Vi, 1 ≤ i ≤ s) where each pair of (Ui, Vi) corresponds to the orthogonal decomposition of $$\boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*}$$,   \begin{align} \boldsymbol{h}_{i}\boldsymbol{x}^{*}_{i} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*} := \underbrace{(\alpha_{i1} \overline{\alpha_{i2}} - 1)\boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*} + \overline{\alpha_{i2}} \tilde{\boldsymbol{h}}_{i} \boldsymbol{x}_{i0}^{*} + \alpha_{i1} \boldsymbol{h}_{i0}\tilde{\boldsymbol{x}}_{i}^{*}}_{\boldsymbol{U}_{i}\in T_{i}} + \underbrace{ \tilde{\boldsymbol{h}}_{i} \tilde{\boldsymbol{x}}_{i}^{*}}_{\boldsymbol{V}_{i} \in T_{i}^{\perp}} \end{align} (6.15) which has been briefly discussed in (6.2) and (6.3). Note that $$\mathcal{A}(\boldsymbol{X} - \boldsymbol{X}_{0}) = \mathcal{A}(\boldsymbol{U} + \boldsymbol{V})$$ and   \begin{align*} \|\mathcal{A}(\boldsymbol{U})\| - \|\mathcal{A}(\boldsymbol{V})\| \leq \|\mathcal{A}(\boldsymbol{U} + \boldsymbol{V})\| \leq \|\mathcal{A}(\boldsymbol{U})\| + \|\mathcal{A}(\boldsymbol{V})\|. \end{align*} Therefore, it suffices to have a two-side bound for $$\|\mathcal{A}(\boldsymbol{U})\|$$ and an upper bound for $$\|\mathcal{A}(\boldsymbol{V})\|$$ where U ∈ T and V ∈ T⊥ in order to establish the local isometry property. Estimation of$$\|\mathcal{A}(\boldsymbol{U})\|$$: For $$\|\mathcal{A}(\boldsymbol{U})\|$$, we know from Lemma 6.3 that   \begin{align} \sqrt{\frac{9}{10}}\|\boldsymbol{U}\|_{F}\leq \|\mathcal{A}(\boldsymbol{U})\| \leq \sqrt{\frac{11}{10}}\|\boldsymbol{U}\|_{F} \end{align} (6.16) and hence we only need to compute ∥U∥F. By Lemma 6.1, there also hold $$\|\boldsymbol{V}_{i}\|_{F} \leq \frac{{\delta _{i}^{2}}}{2(1 - \delta _{i})} d_{i0}$$ and δi −∥Vi∥F ≤∥Ui∥F ≤ δi + ∥Vi∥F, i.e.,   \begin{align*} \left(\delta_{i} - \frac{{\delta_{i}^{2}}}{2(1 - \delta_{i})}\right)d_{i0} \leq \|\boldsymbol{U}_{i}\|_{F} \leq \left(\delta_{i} + \frac{{\delta_{i}^{2}}}{2(1 - \delta_{i})}\right)d_{i0}, \quad 1\leq i\leq s. \end{align*} With $$\|\boldsymbol{U}\|_{F}^{2} = \sum _{i=1}^{s} \|\boldsymbol{U}_{i}\|_{F}^{2}$$, it is easy to get $$\delta d_{0}\left (1 - \frac{\varepsilon }{2(1-\varepsilon )}\right ) \leq \|\boldsymbol{U}\|_{F} \leq \delta d_{0} \left (1 + \frac{\varepsilon }{2(1-\varepsilon )}\right )$$. Combined with (6.16), we get   \begin{align} \sqrt{\frac{9}{10}}\left(1 - \frac{\varepsilon}{2(1-\varepsilon)}\right)\delta d_{0} \leq \|\mathcal{A}(\boldsymbol{U}) \| \leq \sqrt{\frac{11}{10}}\left(1 + \frac{\varepsilon}{2(1-\varepsilon)}\right)\delta d_{0}. \end{align} (6.17) Estimation of$$\|\mathcal{A}(\boldsymbol{V})\|$$: note that V is a block-diagonal matrix with rank-one block. So applying Lemma 6.6 gives us   \begin{align} \|\mathcal{A}(\boldsymbol{V})\|^{2} &\leq \frac{4}{3} \|\boldsymbol{V}\|_{F}^{2}+ 2 \sqrt{2s\|\boldsymbol{V}\|_{F}^{2} \sigma^{2}_{\max}(\tilde{\boldsymbol{h}}, \tilde{\boldsymbol{x}})(K+N)\log L} + 8s\sigma^{2}_{\max}(\tilde{\boldsymbol{h}}, \tilde{\boldsymbol{x}})(K+N) \log L, \end{align} (6.18) where $$\boldsymbol{V} = \mathcal{H}(\tilde{\boldsymbol{h}}, \tilde{\boldsymbol{x}})$$ and $$ \tilde{\boldsymbol{h}} = \left[ \begin{array}{@{}c@{}} \tilde{\boldsymbol{h}}_{1} \\ \vdots \\ \tilde{\boldsymbol{h}}_{s} \end{array}\right]. $$ It suffices to get an estimation of ∥V∥F and $$\sigma ^{2}_{\max }(\tilde{\boldsymbol{h}},\tilde{\boldsymbol{x}})$$ to bound $$\|\mathcal{A}(\boldsymbol{V})\|$$ in (6.18). Lemma 6.1 says that $$\|\tilde{\boldsymbol{h}}_{i}\| \|\tilde{\boldsymbol{x}}_{i}\| \leq \frac{{\delta _{i}^{2}}}{2(1 - \delta _{i})} d_{i0} \leq \frac{\varepsilon }{2(1-\varepsilon )} \delta _{i} d_{i0}$$ if ε < 1. Moreover,   \begin{align} \|\tilde{\boldsymbol{x}}_{i}\| \leq \frac{\delta_{i}}{1 - \delta_{i}}\|\boldsymbol{x}_{i}\| \leq \frac{2\delta_{i}}{1 - \delta_{i}} \sqrt{d_{i0}}, \quad \sqrt{L}\|\boldsymbol{B} \tilde{\boldsymbol{h}}_{i} \|_{\infty} \leq 6 \mu \sqrt{d_{i0}}, \quad 1\leq i\leq s \end{align} (6.19) if (h, x) belongs to $$\mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }.$$ For ∥V∥F,   \begin{align*} \|\boldsymbol{V}\|_{F} = \sqrt{\sum_{i=1}^{s} \|\boldsymbol{V}_{i}\|_{F}^{2}} = \sqrt{\sum_{i=1}^{s} \|\tilde{\boldsymbol{h}}_{i}\|^{2} \|\tilde{\boldsymbol{x}}_{i}\|^{2}} \leq \frac{\varepsilon\delta d_{0}}{2(1-\varepsilon)}. \end{align*} Now we aim to get an upper bound for $$\sigma ^{2}_{\max }(\tilde{\boldsymbol{h}}, \tilde{\boldsymbol{x}})$$ by using (6.19),   \begin{align*} \sigma_{\max}^{2}(\tilde{\boldsymbol{h}}, \tilde{\boldsymbol{x}}) = \max_{1\leq l\leq L}\sum_{i=1}^{s} |\boldsymbol{b}^{*}_{l}\tilde{\boldsymbol{h}}_{i}|^{2} \|\tilde{\boldsymbol{x}}_{i}\|^{2} \leq C_{0}\frac{\mu^{2} \sum_{i=1}^{s}{\delta_{i}^{2}} d_{i0}^{2}}{L} = C_{0}\frac{\mu^{2}\delta^{2}{d_{0}^{2}}}{L}. \end{align*} By substituting the estimations of ∥V∥F and $$\sigma ^{2}_{\max }(\tilde{\boldsymbol{h}}, \tilde{\boldsymbol{x}})$$ into (6.18)   \begin{align} \|\mathcal{A}(\boldsymbol{V})\|^{2} \leq \frac{\varepsilon^{2} \delta^{2}{d_{0}^{2}}}{3(1-\varepsilon)^{2}} + \frac{\sqrt{2}\varepsilon \delta^{2}{d_{0}^{2}}}{1-\varepsilon} \sqrt{\frac{C_{0}\mu^{2} s (K+N)\log L}{L}} + \frac{8C_{0} \mu^{2} \delta^{2}{d_{0}^{2}}s(K+N)\log L}{L}. \end{align} (6.20) By letting $$L \geq C_{\gamma }\mu ^{2} s(K + N)\log ^{2} L$$ with Cγ sufficiently large and combining (6.20) and (6.17), we have   \begin{align*} \sqrt{\frac{2}{3}}\delta d_{0} \leq \|\mathcal{A}(\boldsymbol{U})\| - \|\mathcal{A}(\boldsymbol{V})\| \leq \|\mathcal{A}(\boldsymbol{U}+\boldsymbol{V})\| \leq \|\mathcal{A}(\boldsymbol{U})\| + \|\mathcal{A}(\boldsymbol{V})\| \leq \sqrt{\frac{3}{2}}\delta d_{0}, \end{align*} which gives $$\frac{2}{3}\|\boldsymbol{X} - \boldsymbol{X}_{0}\|_{F}^{2} \leq \|\mathcal{A}(\boldsymbol{X} - \boldsymbol{X}_{0})\|^{2} \leq \frac{3}{2}\|\boldsymbol{X} - \boldsymbol{X}_{0}\|_{F}^{2}.$$ 6.3. Proof of the local regularity condition We first introduce a few notations: for all $$(\boldsymbol{h}, \boldsymbol{x}) \in \mathcal{N}_d \cap \mathcal{N}_{\epsilon }$$, consider $$\alpha _{i1}, \alpha _{i2}, \tilde{\boldsymbol{h}}_{i}$$ and $$\tilde{\boldsymbol{x}}_{i}$$ defined in (6.2) and define   \begin{align*} \varDelta\boldsymbol{h}_{i} = \boldsymbol{h}_{i} - \alpha_{i} \boldsymbol{h}_{i0}, \quad \varDelta\boldsymbol{x}_{i} = \boldsymbol{x}_{i} - \overline{\alpha}_{i}^{-1}\boldsymbol{x}_{i0}, \end{align*} where   \begin{align*} \alpha_{i} (\boldsymbol{h}_{i}, \boldsymbol{x}_{i})= \begin{cases} (1 - \delta_{0})\alpha_{i1}, & \textrm{if}\ \|\boldsymbol{h}_{i}\|_{2} \geq \|\boldsymbol{x}_{i}\|_{2} \\ \frac{1}{(1 - \delta_{0})\overline{\alpha_{i2}}}, & \textrm{if}\ \|\boldsymbol{h}_{i}\|_{2} < \|\boldsymbol{x}_{i}\|_{2}\end{cases} \end{align*} with   \begin{align} \delta_{0} := \frac{\delta}{10}. \end{align} (6.21) The function αi(hi, xi) is defined for each block of $$\boldsymbol{X} = \mathcal{H}(\boldsymbol{h}, \boldsymbol{x}).$$ The particular form of αi(h, x) serves primarily for proving the Lemma 6.11, i.e., local regularity condition of G(h, x). We also define   \begin{align*} \varDelta\boldsymbol{h} : =\left[ \begin{array}{@{}c@{}} \boldsymbol{h}_{1} - \alpha_{1} \boldsymbol{h}_{1,0} \\ \vdots \\ \boldsymbol{h}_{s} - \alpha_{s} \boldsymbol{h}_{s0} \end{array}\right]\in\mathbb{C}^{Ks}, \quad \varDelta\boldsymbol{x} : =\left[ \begin{array}{@{}c@{}} \boldsymbol{x}_{1} - \alpha_{1} \boldsymbol{x}_{1,0} \\ \vdots \\ \boldsymbol{x}_{s} - \alpha_{s} \boldsymbol{x}_{s0} \end{array}\right]\in\mathbb{C}^{Ns}. \end{align*} The following lemma gives bounds of Δxi and Δhi. Lemma 6.9 For all $$(\boldsymbol{h},\boldsymbol{x}) \in \mathcal{N}_d \cap \mathcal{N}_{\epsilon }$$ with $$\epsilon \leq \frac{1}{15}$$, there hold   \begin{align*} \max\{ \|\Delta\boldsymbol{h}_{i}\|_{2}^{2}, \|\Delta\boldsymbol{x}_{i}\|_{2}^{2}\} \leq (7.5{\delta_{i}^{2}} + 2.88{\delta_{0}^{2}}) d_{i0},\quad \|\Delta\boldsymbol{h}_{i}\|_{2}^{2} \|\Delta\boldsymbol{x}_{i}\|_{2}^{2} \leq \frac{1}{26}({\delta_{i}^{2}} +{\delta_{0}^{2}}) d_{i0}^{2}. \end{align*} Moreover, if we assume $$(\boldsymbol{h}_{i}, \boldsymbol{x}_{i}) \in \mathcal{N}_{\mu }$$ additionally, we have $$ \sqrt{L}\|\boldsymbol{B}(\Delta \boldsymbol{h}_{i})\|_{\infty } \leq 6\mu \sqrt{d_{i0}}$$. Proof. We only consider ∥hi∥2 ≥∥xi∥2 and αi = (1 − δ0)α1i, and the other case is exactly the same due to the symmetry. For both Δhi and Δxi, by definition,   \begin{align} \Delta\boldsymbol{h}_{i} & = \boldsymbol{h}_{i} - \alpha_{i}\boldsymbol{h}_{i0} = \delta_{0} \alpha_{i1} \boldsymbol{h}_{i0} + \tilde{\boldsymbol{h}}_{i},\\ \Delta\boldsymbol{x}_{i} & = \boldsymbol{x}_{i} - \frac{1}{(1 - \delta_{0})\overline{\alpha_{i}}_{1}} \boldsymbol{x}_{i0} = \left(\alpha_{i2} - \frac{1}{(1 - \delta_{0})\overline{\alpha}_{i1}}\right)\boldsymbol{x}_{i0} + \tilde{\boldsymbol{x}}_{i} , \end{align} (6.22) where $$\boldsymbol{h}_{i} = \alpha _{i1}\boldsymbol{h}_{i0} + \tilde{\boldsymbol{h}}_{i}$$ and $$\boldsymbol{x}_{i} = \alpha _{i2}\boldsymbol{x}_{i0} + \tilde{\boldsymbol{x}}_{i}$$ come from the orthogonal decomposition in (6.2). We start with estimating ∥Δhi∥2. Note that $$\|\boldsymbol{h}_{i}\|_{2}^{2} \leq 4d_{i0}$$ and $$\|\alpha _{i1} \boldsymbol{h}_{i0}\|_{2}^{2}\leq \|\boldsymbol{h}_{i}\|_{2}^{2}$$ since $$(\boldsymbol{h}, \boldsymbol{x})\in \mathcal{N}_d\cap \mathcal{N}_{\mu }$$. By Lemma 6.1, we have   \begin{align} \|\Delta\boldsymbol{h}_{i}\|_{2}^{2} = \|\tilde{\boldsymbol{h}}_{i}\|_{2}^{2} +{\delta_{0}^{2}}\|\alpha_{i1} \boldsymbol{h}_{i0}\|_{2}^{2} \leq \left(\left(\frac{\delta_{i}}{1-\delta_{i}}\right)^{2} +{\delta_{0}^{2}}\right)\|\boldsymbol{h}_{i}\|_{2}^{2} \leq ( 4.6{\delta_{i}^{2}} + 4{\delta_{0}^{2}}) d_{i0}. \end{align} (6.24) Then we calculate ∥Δxi∥: from (6.23), we have   \begin{align*} \|\Delta\boldsymbol{x}_{i}\|^{2} = \left|\alpha_{i2} - \frac{1}{(1 - \delta_{0})\overline{\alpha}_{i1}} \right|{}^{2}d_{i0} + \|\tilde{\boldsymbol{x}}_{i}\|^{2} \leq \left|\alpha_{i2} - \frac{1}{(1 - \delta_{0})\overline{\alpha}_{i1}} \right|{}^{2}d_{i0} + \frac{4{\delta_{i}^{2}} d_{i0}}{(1 - \delta_{i})^{2}}, \end{align*} where Lemma 6.1 gives $$\|\tilde{\boldsymbol{x}}_{i}\|_{2} \leq \frac{\delta _{i}}{1-\delta _{i}}\|\boldsymbol{x}_{i}\|_{2} \leq \frac{2\delta _{i}}{1-\delta _{i}} \sqrt{d_{i0}}$$ for $$(\boldsymbol{h},\boldsymbol{x})\in \mathcal{N}_d\cap \mathcal{N}_{\epsilon }$$. So it suffices to estimate $$\left | \alpha _{i2} - \frac{1}{(1 - \delta _{0})\overline{\alpha }_{i1}} \right |$$, which satisfies   \begin{align} \left|\alpha_{i2} - \frac{1}{(1 - \delta_{0})\overline{\alpha_{i1}}}\right| = \frac{1}{|\alpha_{i1}|} \left| \overline{\alpha_{i1}} \alpha_{i2}- 1 - \frac{\delta_{0}}{1 - \delta_{0}} \right| \leq \frac{1}{|\alpha_{i1}|} \left( \left|(\overline{\alpha_{i1}} \alpha_{i2}- 1)\right| + \frac{\delta_{0}}{1 - \delta_{0}} \right). \end{align} (6.25) Lemma 6.1 implies that $$| \overline{\alpha _{i1}} \alpha _{i2}- 1| \leq \delta _{i}$$, and (6.2) gives   \begin{align} |\alpha_{i1}|^{2} = \frac{1}{d_{i0}}(\|\boldsymbol{h}_{i}\|^{2} - \| \tilde{\boldsymbol{h}}_{i} \|^{2}) \geq \frac{1}{d_{i0}}\left(1 - \frac{{\delta_{i}^{2}}}{(1-\delta_{i})^{2}} \right)\|\boldsymbol{h}_{i}\|^{2} \geq \left(1 - \frac{{\delta_{i}^{2}}}{(1-\delta_{i})^{2}} \right)(1-\varepsilon), \end{align} (6.26) where $$\|\tilde{\boldsymbol{h}}_{i}\| \leq \frac{\delta _{i}}{1-\delta _{i}}\|\boldsymbol{h}_{i}\|$$ and ∥hi∥2 ≥ ∥hi∥∥xi∥ ≥ (1 − ε)di0 if ∥hi∥ ≥ ∥xi∥. Substituting (6.26) into (6.25) gives   \begin{align*} \left|\alpha_{i2} - \frac{1}{(1 - \delta_{0})\overline{\alpha_{i1}}}\right| \leq \frac{1}{\sqrt{1-\varepsilon}} \left(1 - \frac{{\delta_{i}^{2}}}{(1-\delta_{i})^{2}} \right)^{-1/2}\left(\delta_{i} + \frac{\delta_{0}}{1-\delta_{0}}\right) \leq 1.2(\delta_{i} + \delta_{0}). \end{align*} Then we have   \begin{align} \|\Delta\boldsymbol{x}_{i}\|_{2}^{2} \leq \left(1.44(\delta_{i}+\delta_{0})^{2}+ \frac{4{\delta^{2}_{i}}}{(1 - \delta_{i})^{2}}\right) d_{i0} \leq (7.5{\delta_{i}^{2}} + 2.88{\delta_{0}^{2}})d_{i0}. \end{align} (6.27) Finally, we try to bound ∥Δhi∥2∥Δxi∥2. Lemma 6.1 gives $$\|\tilde{\boldsymbol{h}}_{i}\|_{2} \|\tilde{\boldsymbol{x}}_{i}\|_{2} \leq \frac{{\delta _{i}^{2}}d_{i0}}{2(1 - \delta _{i})}$$ and |αi1| ≤ 2. Combining them along with (6.22), (6.23), (6.24) and (6.27), we have   \begin{align*} \|\Delta\boldsymbol{h}_{i}\|_{2}^{2} \|\Delta\boldsymbol{x}_{i}\|_{2}^{2} &\leq \|\tilde{\boldsymbol{h}}_{i}\|_{2}^{2}\|\tilde{\boldsymbol{x}}_{i}\|_{2}^{2} +{\delta_{0}^{2}} |\alpha_{i1}|^{2} \|\boldsymbol{h}_{i0}\|_{2}^{2} \|\Delta\boldsymbol{x}_{i}\|_{2}^{2} + \left|\alpha_{i2} - \frac{1}{(1 - \delta_{0})\overline{\alpha}_{i1}}\right|{}^{2} \|\boldsymbol{x}_{i0}\|_{2}^{2} \|\Delta\boldsymbol{h}_{i}\|_{2}^{2} \\ & \leq \left(\frac{{\delta_{i}^{4}}}{4(1 - \delta_{i})^{2}} + 4{\delta_{0}^{2}} (7.5{\delta_{i}^{2}} + 2.88{\delta_{0}^{2}}) + 1.44(\delta_{i} + \delta_{0})^{2} (4.6{\delta_{i}^{2}} + 4{\delta_{0}^{2}} )\right) d_{i0}^{2} \\ & \leq \frac{({\delta_{i}^{2}} +{\delta_{0}^{2}})d_{i0}^{2}}{26}. \end{align*} By symmetry, similar results hold for the case ∥hi∥2 < ∥xi∥2 and $$\max \{\|\Delta \boldsymbol{h}_{i}\|, \|\Delta \boldsymbol{x}_{i}\|\} \leq (7.5{\delta _{i}^{2}} + 2.88{\delta _{0}^{2}})d_{i0}.$$ Next, under the additional assumption $$(\boldsymbol{h}, \boldsymbol{x}) \in \mathcal{N}_{\mu }$$, we now prove $$\sqrt{L}\|\boldsymbol{B}(\Delta \boldsymbol{h}_{i})\|_{\infty } \leq 6\mu \sqrt{d_{i0}}$$: Case 1: ∥hi∥2 ≥ ∥xi∥2 and αi = (1 − δ0)αi1. By Lemma 6.1 gives |αi1| ≤ 2, which implies  \begin{align*} \sqrt{L}\|\boldsymbol{B}(\Delta\boldsymbol{h}_{i})\|_{\infty} &\leq \sqrt{L}\|\boldsymbol{B}\boldsymbol{h}_{i} \|_{\infty} + (1 - \delta_{0}) |\alpha_{i1}|\sqrt{L}\|\boldsymbol{B}\boldsymbol{h}_{i0}\|_{\infty} \\ &\leq 4\mu\sqrt{d_{i0}} + 2(1 - \delta_{0})\mu_{h} \sqrt{d_{i0}} \leq 6\mu\sqrt{d_{i0}}. \end{align*} Case 2: ∥hi∥2 < ∥xi∥2 and $$\alpha _{i} = \frac{1}{(1-\delta _{0})\overline{\alpha _{i2}}}$$. Using the same argument as (63) gives  \begin{align*} |\alpha_{i2}|^{2}\geq \left(1 - \frac{{\delta_{i}^{2}}}{(1-\delta_{i})^{2}} \right)(1-\varepsilon). \end{align*} Therefore,   \begin{align*} \sqrt{L}\|\boldsymbol{B}(\Delta\boldsymbol{h}_{i})\|_{\infty} &\leq \sqrt{L}\|\boldsymbol{B}\boldsymbol{h}_{i}\|_{\infty} + \frac{1}{(1 - \delta_{0}) |\overline{\alpha_{i2}}|} \sqrt{L}\|\boldsymbol{B}\boldsymbol{h}_{0}\|_{\infty} \\ &\leq 4\mu\sqrt{d_{0}} + \left(1 - \frac{{\delta_{i}^{2}}}{(1-\delta_{i})^{2}} \right)^{-1/2} \frac{\mu_{h} \sqrt{d}_{0}}{(1-\delta_{0})\sqrt{1-\varepsilon}} \leq 6 \mu\sqrt{d_{0}}. \end{align*} Lemma 6.10 (Local Regularity for F(h, x)) Conditioned on (31) and (46), the following inequality holds   \begin{align*} \operatorname{Re}\left(\langle \nabla F_{\boldsymbol{h}}, \Delta\boldsymbol{h} \rangle + \langle \nabla F_{\boldsymbol{x}}, \Delta\boldsymbol{x}\rangle\right) \geq \frac{\delta^{2}{d_{0}^{2}}}{8} - 2\sqrt{s}\delta d_{0} \|\mathcal{A}^{*}(\boldsymbol{e})\|, \end{align*} uniformly for any $$(\boldsymbol{h}, \boldsymbol{x}) \in \mathcal{N}_d \cap \mathcal{N}_{\mu } \cap \mathcal{N}_{\epsilon }$$ with $$\epsilon \leq \frac{1}{15}$$ if $$L \geq C\mu ^{2} s(K+N)\log ^{2} L$$ for some numerical constant C. Proof. First note that for   \begin{align*} I_{0} = \langle \nabla F_{\boldsymbol{h}}, \Delta\boldsymbol{h} \rangle + \overline{\langle \nabla F_{\boldsymbol{x}}, \Delta\boldsymbol{x}\rangle } = \sum_{i=1}^{s} \langle \nabla F_{\boldsymbol{h}_{i}}, \Delta\boldsymbol{h}_{i} \rangle + \overline{\langle \nabla F_{\boldsymbol{x}_{i}}, \Delta\boldsymbol{x}_{i} \rangle}. \end{align*} For each component, recall that (19) and (20), we have   \begin{align*} \langle \nabla F_{\boldsymbol{h}_{i}}, \Delta\boldsymbol{h}_{i} \rangle + \overline{\langle \nabla F_{\boldsymbol{x}_{i}}, \Delta\boldsymbol{x}_{i} \rangle} & = \langle \mathcal{A}_{i}^{*}(\mathcal{A}(\boldsymbol{X} - \boldsymbol{X}_{0}) - \boldsymbol{e})\boldsymbol{x}_{i}, \Delta\boldsymbol{h}_{i} \rangle + \overline{\langle (\mathcal{A}_{i}^{*}(\mathcal{A}(\boldsymbol{X} - \boldsymbol{X}_{0}) - \boldsymbol{e}))^{*}\boldsymbol{h}_{i}, \Delta\boldsymbol{x}_{i} \rangle} \\ & = \left\langle \mathcal{A}(\boldsymbol{X} - \boldsymbol{X}_{0}) - \boldsymbol{e}, \mathcal{A}_{i}((\Delta\boldsymbol{h}_{i})\boldsymbol{x}_{i}^{*} + \boldsymbol{h}_{i} (\Delta\boldsymbol{x}_{i})^{*}) \right\rangle. \end{align*} Define Ui and Vi as   \begin{align} \boldsymbol{U}_{i} := \alpha_{i}\boldsymbol{h}_{i0}(\Delta\boldsymbol{x}_{i})^{*} + \overline{\alpha_{i}}^{-1}(\Delta\boldsymbol{h}_{i})\boldsymbol{x}_{i0}^{*} \in T_{i}, \quad \boldsymbol{V}_{i} := \Delta\boldsymbol{h}_{i}(\Delta\boldsymbol{x}_{i})^{*}. \end{align} (6.28) Here Vi does not necessarily belong to $$T^{\bot }_{i}.$$ From the way of how Δhi, Δxi, Ui and Vi are constructed, two simple relations hold:   \begin{align*} \boldsymbol{h}_{i}\boldsymbol{x}_{i}^{*} - \boldsymbol{h}_{i0}\boldsymbol{x}_{i0}^{*} = \boldsymbol{U}_{i} + \boldsymbol{V}_{i}, \qquad (\Delta\boldsymbol{h}_{i})\boldsymbol{x}_{i}^{*} + \boldsymbol{h}_{i} (\Delta\boldsymbol{x}_{i})^{*} = \boldsymbol{U}_{i} + 2\boldsymbol{V}_{i}. \end{align*} Define U := blkdiag(U1, ⋯ , Us) and V := blkdiag(V1, ⋯ , Vs). I0 can be simplified to   \begin{align*} I_{0} & = \sum_{i=1}^{s} \langle \nabla F_{\boldsymbol{h}_{i}}, \Delta\boldsymbol{h}_{i} \rangle + \overline{\langle \nabla F_{\boldsymbol{x}_{i}}, \Delta\boldsymbol{x}_{i} \rangle} = \sum_{i=1}^{s} \langle \mathcal{A}(\boldsymbol{U}+\boldsymbol{V})- \boldsymbol{e}, \mathcal{A}_{i}(\boldsymbol{U}_{i} + 2\boldsymbol{V}_{i})\rangle \\ & = \underbrace{\langle \mathcal{A}(\boldsymbol{U}+\boldsymbol{V}), \mathcal{A}(\boldsymbol{U} + 2\boldsymbol{V})\rangle}_{I_{01}} - \underbrace{\langle \boldsymbol{e}, \mathcal{A}(\boldsymbol{U} + 2\boldsymbol{V})\rangle}_{I_{02}}. \end{align*} Now we will give a lower bound for Re(I01) and an upper bound for Re(I02) so that the lower bound of Re(I0) is obtained. By the Cauchy–Schwarz inequality, Re(I01) has the lower bound   \begin{align} \operatorname{Re}(I_{01}) \geq (\|\mathcal{A}(\boldsymbol{U})\| - \|\mathcal{A}(\boldsymbol{V})\|) (\|\mathcal{A}(\boldsymbol{U})\| - 2\|\mathcal{A}(\boldsymbol{V})\|). \end{align} (6.29) In the following, we will give an upper bound for $$\|\mathcal{A}(\boldsymbol{V})\|$$ and a lower bound for $$\|\mathcal{A}(\boldsymbol{U})\|$$. Upper bound for $$\|\mathcal{A}(\boldsymbol{V})\|$$: note that V is a block-diagonal matrix with rank-one blocks, and applying Lemma 6.6 results in   \begin{align*} \|\mathcal{A}(\boldsymbol{V})\|^{2} \leq \frac{4}{3}\sum_{i=1}^{s} \|\boldsymbol{V}\|_{F}^{2} + 2\sigma_{\max}(\Delta\boldsymbol{h},\Delta\boldsymbol{x}) \|\boldsymbol{V}\|_{F}\sqrt{2s(K+N)\log L} + 8s\sigma_{\max}^{2}(\Delta\boldsymbol{h}, \Delta\boldsymbol{x})(K+N) \log L. \end{align*} By using Lemma 6.9, we have $$\|\Delta \boldsymbol{h}_{i}\|^{2} \leq (7.5{\delta _{i}^{2}} + 2.88{\delta _{0}^{2}})d_{i0}$$ and $$\sqrt{L}\|\boldsymbol{B}(\Delta \boldsymbol{h}_{i})\|_{\infty } \leq 6\mu \sqrt{d_{i0}}$$. Substituting them into $$\sigma ^{2}_{\max }(\Delta \boldsymbol{h},\Delta \boldsymbol{x})$$ gives   \begin{align*} \sigma_{\max}^{2}(\Delta\boldsymbol{h}, \Delta\boldsymbol{x}) = \max_{1\leq l\leq L}\left(\sum_{i=1}^{s} |\boldsymbol{b}_{l}^{*}\Delta\boldsymbol{h}_{i}|^{2} \|\Delta\boldsymbol{x}_{i}\|^{2}\right) \leq \frac{36\mu^{2}}{L}