New approach to Bayesian high-dimensional linear regression

New approach to Bayesian high-dimensional linear regression Abstract Consider the problem of estimating parameters $$X^n \in \mathbb{R}^n $$, from $$m$$ response variables $$Y^m = AX^n+Z^m$$, under the assumption that the distribution of $$X^n$$ is known. Lack of computationally feasible algorithms that employ generic prior distributions and provide a good estimate of $$X^n$$ has limited the set of distributions researchers use to model the data. To address this challenge, in this article, a new estimation scheme named quantized maximum a posteriori (Q-MAP) is proposed. The new method has the following properties: (i) In the noiseless setting, it has similarities to maximum a posteriori (MAP) estimation. (ii) In the noiseless setting, when $$X_1,\ldots,X_n$$ are independent and identically distributed, asymptotically, as $$n$$ grows to infinity, its required sampling rate ($$m/n$$) for an almost zero-distortion recovery approaches the fundamental limits. (iii) It scales favorably with the dimensions of the problem and therefore is applicable to high-dimensional setups. (iv) The solution of the Q-MAP optimization can be found via a proposed iterative algorithm that is provably robust to error (noise) in response variables. 1. Introduction Consider the problem of Bayesian linear regression defined as follows; $${{\mathbb{\mathbf{X}}}}=\{X_i\}_{i=1}^{\infty}$$ denotes a stationary random process whose distribution is known. The goal is to estimate $$X^n$$ from $$m$$ response variables of the form $$Y^m=AX^n + Z^m$$, where $$A\in{\rm I}\kern-0.20em{\rm R}^{m\times n}$$ and $$Z^m\in{\rm I}\kern-0.20em{\rm R}^m$$ denote the design matrix and the noise vector, respectively. (Both cases of $$m<n$$ and $$m\geq n$$ are of interest and valid under our model.) To solve this problem, there are two fundamental questions that can be raised: How should we use the distribution of $$X^n$$ to obtain an efficient estimator? To answer this question, there are two main criteria that should be taken into account: (i) computational complexity: how efficiently can the estimate be computed? (ii) accuracy: how well does the estimator perform? If we ignore the computational complexity constraint, then the answer to our first question is simple. An optimal Bayes estimator seeks to minimize the Bayes risk defined as $${\rm {E}}[\ell(\hat{X}^n, X^n)]$$, where $$\ell: {\rm I}\kern-0.20em{\rm R}^n\times{\rm I}\kern-0.20em{\rm R}^n\to{\rm I}\kern-0.20em{\rm R}^+$$ denotes the considered loss function. For instance, $$\ell(x^n,{\hat{X}}^n)=\left\|x^n-{\hat{X}}^n\right\|_2^2$$ leads to the minimum mean square error (MMSE) estimator. However, the computational complexity of MMSE estimator for generic distributions is very high. Can the performance of the estimator be analyzed in high-dimensional settings? The answer to this question is also complicated. Even the performance analysis of standard estimators such as MMSE estimator is challenging. In fact, even if we assume that $$\mathbf{X}$$ is an independent and identically distributed (i.i.d.) process, the analysis is still complicated and heuristic tools such as replica method from statistical physics have been employed to achieve this goal. In response to the above two questions, we propose an optimization problem, which we refer to as quantized maximum a posteriori (Q-MAP). We then show how this optimization problem can be analyzed and solved. Before presenting the Q-MAP optimization, we introduce some notation. 1.1 Notation Calligraphic letters such as $$\mathcal{X}$$ and $$\mathcal{Y}$$ denote sets. The size of a set $$\mathcal{X}$$ is denoted by $$|\mathcal{X}|$$. Given vector $$(v_1,v_2,\ldots,v_n)\in{\rm I}\kern-0.20em{\rm R}^n$$ and integers $$i,j\in\{1,\ldots,n\}$$, where $$i\leq j$$, $$v_i^j\triangleq (v_i,v_{i+1},\ldots,v_j).$$ For simplicity $$v_1^j$$ and $$v_j^j$$ are denoted by $$v^j$$ and $$v_j$$, respectively. For two vectors $$u^n$$ and $$v^n$$, both in $${\rm I}\kern-0.20em{\rm R}^n$$, let $$\langle u^n,v^n\rangle$$ denote their inner product defined as $$\langle u^n,v^n\rangle\triangleq \sum_{i=1}^nu_iv_i$$. The all-zero and all-one vectors of length $$n$$ are denoted by $$0^n$$ and $$1^n$$, respectively. Uppercase letters such as $$X$$ and $$Y$$ denote random variables. The alphabet of a random variable $$X$$ is denoted by $$\mathcal{X}$$. The entropy of a finite-alphabet random variable $$U$$ with probability mass function (pmf) $$p(u)$$, $$u\in\mathcal{U}$$, is defined as $$H(U)=-\sum_{u\in\mathcal{U}}p(u)\log{ p(u)}$$ [6]. Given finite-alphabet random variables $$U$$ and $$V$$ with joint pmf $$p(u,v)$$, $$(u,v)\in\mathcal{U}\times \mathcal{V}$$, the conditional entropy of $$U$$ given $$V$$ is defined as $$H(U|V)=-\sum_{(u,v)\in\mathcal{U}\times \mathcal{V}} p(u,v)\log p(u|v)$$. Matrices are also denoted by uppercase letters such as $$A$$ and $$B$$ and are differentiated from random variables by context. Throughout the article $$\log$$ and $$\ln$$ refer to logarithm in base 2 and the natural logarithm, respectively. For $$x\in{\rm I}\kern-0.20em{\rm R}$$, $$\lfloor x\rfloor$$ denotes the largest integer smaller than or equal to $$x$$. Therefore, $$0\leq x-\lfloor x\rfloor<1$$, for all $$x$$. The $$b$$-bit quantized version of $$x$$ is denoted by $$[x]_b$$ and is defined as   \begin{align} [x]_b\triangleq\lfloor x\rfloor+\sum_{i=1}^b2^{-i}a_i, \end{align} (1.1) where for all $$i$$, $$a_i\in\{0,1\}$$, and $$0.a_1a_2\ldots$$ denotes the binary representation of $$x-\lfloor x\rfloor$$. When $$x-\lfloor x\rfloor$$ is a dyadic real number, which have two possible binary representations, let $$0.a_1a_2\ldots$$ denote the representation which has a finite number of ones. For a vector $$x^n\in{\rm I}\kern-0.20em{\rm R}^n$$, $$[x^n]_b\triangleq([x_1]_b,\ldots,[x_n]_b).$$ Consider a vector $$u^n\in\mathcal{U}^n$$, where $$|\mathcal{U}|<\infty$$. The $$(k+1){\rm th}$$ order empirical distribution of $$u^n$$ is denoted by $$\hat{p}^{(k+1)}$$ and is defined as follows: for any $$a^{k+1}\in \mathcal{U}^{k+1}$$,   \begin{align} \hat{p}^{(k+1)}(a^{k+1}| u^n)&\triangleq {|\{i: u_{i-k}^i=a^{k+1}, k+1\leq i\leq n\}|\over n-k}\nonumber\\ &={1\over n-k}\sum_{i=k+1}^n1_{u_{i-k}^{i}=a^{k+1}},\label{eq:emp-dist} \end{align} (1.2) where $$1_{\mathcal{E}}$$ denotes the indicator function of event $$\mathcal{E}$$. In other words, $$\hat{p}^{(k+1)}(.| u^n)$$ denotes the distribution of a randomly selected length-$$(k+1)$$ substring of $$u^n$$. 1.2 Contributions Consider the stochastic process $${{\mathbb{\mathbf{X}}}}=\{X_i\}_{i=1}^{\infty}$$ discussed earlier. Let $$X_i\in\mathcal{X}$$, for all $$i$$. Assume that $$\mathcal{X}$$ is a bounded subset of $${\rm I}\kern-0.20em{\rm R}$$. Define the $$b$$-bit quantized version of $$\mathcal{X}$$ as   \begin{align} \mathcal{X}_b\triangleq \{[x]_b: \;x\in\mathcal{X}\}. \end{align} (1.3) Note that since $$\mathcal{X}$$ is a bounded set, $$\mathcal{X}_b$$ is a finite set, ie $$|\mathcal{X}_b|<\infty$$. Given $$k\in{\rm I}\kern-0.20em{\rm N}^+$$ and a set of non-negative weights $${\bf w}=(w_{a^{k+1}}:\;a^{k+1}\in \mathcal{X}_b^{k+1})$$, define function $$c_{{\bf w}}: {\bf X}_b^n\to {\rm I}\kern-0.20em{\rm R}$$ as follows. For $$u^n\in\mathcal{X}_b^n$$,   \begin{align}\label{eq:def-cw} c_{{\bf w}}(u^n)\triangleq \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|u^n). \end{align} (1.4) As it will be described later, for a proper choice of weights $${\bf w}$$, this function measures both the level of ‘structuredness’ of sequences in $$\mathcal{X}_b^n$$ and also how closely they match the $$(k+1)^{\rm th}$$ order distribution of the process $${{\mathbb{\mathbf{X}}}}$$. For instance, as will be shown later in Section 3.1, for an i.i.d. process $${{\mathbb{\mathbf{X}}}}$$ with $$X_i\sim (1- p)\delta_0+ pf_c$$, where $$p\in(0,1)$$, $$\delta_0$$ denotes a point mass at zero, and $$f_c$$ denotes an absolutely continuous distribution over a bounded set, a bound of the form $$c_{{\bf w}}(u^n)\leq \gamma$$, with $$k=0$$, simplifies to a bound on the $$\ell_0$$-norm of sequence $$u^n$$. Hence, intuitively speaking, the constraint $$ c_{{\bf w}}(u^n) \leq \gamma$$ holds for ‘structured’ sequences that comply with the known source model. For a stationary process $${{\mathbb{\mathbf{X}}}}$$, the Q-MAP optimization estimates $$X^n$$ from noisy linear measurements $$Y^m=AX^n+Z^m$$, by solving the following optimization:   \begin{align}\label{eq:Q-MAP} {\hat{X}}^n&\;\;= \;\; \mathop{\text{arg min}}\limits_{u^n\in\mathcal{X}_b^n} \;\;\;\left\|Au^n-Y^m\right\|^2 \nonumber \\ & \;\;\;\;\;\; \;\;\;{\rm subject \ to}\;\;\;\ c_{{\bf w}}(u^n) \leq \gamma_n, \end{align} (1.5) where $$\gamma_n$$ is a number that may depend on $$n$$ and the distribution of $${{\mathbb{\mathbf{X}}}}$$, $$b$$ and $$k$$ are parameters that need to be set properly, and the non-negative weights $${\bf w}=(w_{a^{k+1}}:\;a^{k+1}\in \mathcal{X}_b^{k+1})$$ are defined as a function of the $$(k+1)$$th order marginal distribution of stationary process $${{\mathbb{\mathbf{X}}}}$$ as:   \begin{align} w_{a^{k+1}}\triangleq \log {1\over {\rm {P}}([X_{k+1}]_b=a_{k+1}|[X^k]_b=a^k)}.\label{eq:coeffs-Q-MAP} \end{align} (1.6) For this specific choice of $${\bf w}$$, the function $$c_{{\bf w}}$$ can also be written as   \begin{align}\label{eq:simplify-cw} c_{{\bf w}}(u^n)&= \sum_{a^k}\hat{q}_{k}(a^k) D_{\rm KL}(\hat{q}_{k+1}(\cdot|a^k)\| q_{k+1}(\cdot|a^k)) + \hat{H}_{k}(u^n), \end{align} (1.7) where $$\hat{q}_{k}$$, $$\hat{q}_{k+1}$$ and $${q}_{k+1}$$ denote the $$k{\rm th}$$ order empirical distribution induced by $$u^n$$, the $$(k+1){\rm th}$$ order empirical distribution induced by $$u^n$$ and the distribution of $$[X^{k+1}]_b$$, respectively. (The derivation of (1.7) can be found in (7.92), as part of the proof of Theorem 5.1.) This alternative expression reveals how $$c_{{\bf w}}(u^n)$$ captures both the structuredness of $$u^n$$ through $$ \hat{H}_{k}(u^n)$$ and the similarity between the empirical distribution (type) of $$u^n$$ and the distribution of the quantized source process $${{\mathbb{\mathbf{X}}}}$$ through $$\sum_{a^k}\hat{q}_{k}(a^k) D_{\rm KL}(\hat{q}_{k+1}(\cdot|a^k)\| q_{k+1}(\cdot|a^k))$$. Remark 1.1 To better understand the weights $${\bf w}$$ specified in (1.6), consider an i.i.d. sparse process $${{\mathbb{\mathbf{X}}}}$$ distributed as $$p f_c+(1-p)\delta_0$$. Here, $$p\in[0,1]$$ and $$f_c$$ denotes an absolutely continuous distribution over a bounded interval $$(x_1,x_2)$$, where $$x_1<x_2$$. In that case, since the process is memoryless, it is enough to study the weights for $$k=0$$. (Note that $${\rm {P}}([X_{k+1}]_b=a_{k+1}|[X^k]_b=a^k)={\rm {P}}([X_{k+1}]_b=a_{k+1})$$, for every $$a^{k+1}\in\mathcal{X}_b^{k+1}$$.) For $$k=0$$, for $$a\in\mathcal{X}_b$$, $$w_a= \log {1\over {\rm {P}}([X_{1}]_b=a)}$$. On the other hand, it is straightforward to check that for $$b$$ large, $${\rm {P}}([X_{1}]_b=0)\approx 1-p$$. For $$a\in\mathcal{X}_b$$, $$a\neq 0$$, by the mean value theorem,   \[ {\rm {P}}([X_{1}]_b=a)=2^{-b}f_c(x_a), \] for some $$x_a\in(a,a+2^{-b}]$$. Therefore, $$w_0\approx -\log(1-p)$$, and, for $$a\in\mathcal{X}_b$$, $$a\neq 0$$,   \[ w_a=b-\log f_c(x_a). \] It can be observed that, as $$b$$ grows, all weights, except $$w_0$$, grow to infinity, at a rate linear in $$b$$. (This issue and its implications are discussed in further details in Section 3.) Note that minimizing $$\|Au^n - Y^m\|^2$$ is natural since we would like to obtain a parameter vector that matches the response variables. However, with no constraint on potential solution vectors, the estimate will suffer from overfitting (unless $$m$$ is much larger than $$n$$). Hence, some constraints should be imposed on the set of potential solutions. In Q-MAP optimization, this constraint requires a potential solution $$u^n\in\mathcal{X}_n^n$$ to satisfy   \begin{align} c_{{\bf w}}(u^n) \leq \gamma_n. \end{align} (1.8) There are two other features of the above optimization that are worth emphasis and clarification at this point: Quantized reconstructions: While the parameter vector $$X^n$$ and the response variables $$Y^m$$ are typically real valued, the estimate produced by the Q-MAP optimization lies in the quantized space $$\mathcal{X}_b^n$$. The motivation for this quantization will be explained later, but in a nutshell, this step helps both the theoretical analysis and also the implementation of the optimization. Memory parameter ($$k$$): again both for the convenience of the theoretical analysis and also for the ease of implementation, only dependencies captured by the $$(k+1)$$th order probability distribution of the process $${{\mathbb{\mathbf{X}}}}$$ are taken into account in the Q-MAP optimization. This memory parameter is an free parameter that can be selected based on the source distribution. As shown later, for instance, in the noiseless setting, for an i.i.d. process, $$k=0$$ is enough to achieve the fundamental limits in terms of sampling rate ($$m/n$$). Although the Q-MAP optimization provides a new approach to Bayesian compressed sensing, it is still not an easy optimization problem. For instance, for the i.i.d. distribution mentioned earlier, $$(1- p)\delta_0+ p f_c$$, the constraint becomes equivalent to having an upper bound on $$\|u^n\|_0$$. This is similar to the notoriously difficult optimal variable selection problem. Hence, despite the fact that the Q-MAP optimization problems might appear simpler than other estimators such as MMSE, in fact it can still be computationally infeasible. However, inspired by the projected gradient descent (PGD) method in convex optimization, we propose the following algorithm to solve the Q-MAP optimization. Define   \begin{align} \mathcal{F}_o\triangleq \big\{u^n\in\mathcal{X}_b^n: \; c_{{\bf w}}(u^n)\leq \gamma_n \big\},\label{eq:def-Fc-rand} \end{align} (1.9) where function $$c_{{\bf w}}$$ is defined in (1.4). Note that the set $$\mathcal{F}_o$$ depends on quantization level $$b$$, memory parameter $$k$$, weights $${\bf w}=\{w_{a^{k+1}}\}_{a^{k+1}\in\mathcal{X}_b^{k+1}}$$ and also parameter $$\gamma_n$$. The PGD algorithm generates a sequence of estimates $${\hat{X}}^n(t)$$, $$t=0,1,\ldots$$ of the sequence $$X^n$$. It starts by setting $${\hat{X}}^n(0)=0^n$$, and proceeds by updating $${\hat{X}}^n(t)$$, its estimate at time $$t$$, as follows   \begin{align} S^n(t+1)&={\hat{X}}^n(t)+\mu A^{\top}(Y^m-A{\hat{X}}^n(t))\nonumber\\ {\hat{X}}^n(t+1)&=\mathop{\text{arg min}}\limits_{u^n\in\mathcal{F}_o}\left\|u^n-S^n(t+1)\right\|\!,\label{eq:PGD-update} \end{align} (1.10) where $$\mu$$ denotes the step-size and ensures that the algorithm does not diverge to infinity. Intuitively, the above procedure, at each step, moves the current estimate toward the $$Ax^n=Y^m$$ hyperplane and then projects the new estimate to the set of structured vectors. As will be proved later, when $$m$$ is large enough, in the noiseless setting, the estimates provided by the PGD algorithm converge to $$X^n$$, with high probability. The challenging step in running the PGD method is the projection step, which requires finding the closest point to $$S^n(t+1)$$ in $$\mathcal{F}_o$$. For some special distributions such as sparse or piecewise-constant discussed in Section 3, the corresponding set $$\mathcal{F}_o$$ has a special form that simplifies the projection con-siderably. In general, while projection to a non-convex discrete set can be complicated, as we will discuss in Section 6.2, we believe that because of the special structure of the set $$\mathcal{F}_o$$, a dynamic programming approach can be used for performing this projection. More specifically, we will explain how a Viterbi algorithm [31] with $$2^{bk}$$ states and $$n$$ stages can be used for this purpose. Hence, the complexity of the proposed method for doing the projection task required by the PGD grows linearly in $$n$$, but exponentially in $$kb$$. We expect that for ‘structured distributions’ the scaling with $$b$$ and $$k$$ can be improved much beyond this. We will describe our intuition in Section 6.2 but leave the formal exploration of this direction to future research. The main question we have not addressed yet is how well the Q-MAP optimization and the proposed PGD method recover $$X^n$$ from $$Y^m$$. In the next few paragraphs, we informally state our main results regarding the performance of the QMAP optimization and the PGD-based algorithm. Before that, note that, to recover $$X^n$$ from $$m<n$$ response variables, intuitively, the process should be of structured. Hence, first, we briefly review a measure of structuredness developed for real-valued stochastic processes. The $$k$$th order upper information dimension of stationary process $${{\mathbb{\mathbf{X}}}}$$ is defined as   \begin{equation}\label{eq:first_appearanced_k} \bar{d}_k({{\mathbb{\mathbf{X}}}})\triangleq \limsup_{b\to \infty} {H([X_{k+1}]_b|[X^k]_b) \over b}, \end{equation} (1.11) where $$H([X_{k+1}]_b|[X^k]_b) $$ denotes the conditional entropy of $$[X_{k+1}]_b$$ given $$[X^k]_b$$. Similarly, the lower $$k$$th order upper information dimension of $${{\mathbb{\mathbf{X}}}}$$ is defined as   \begin{align} \underline{d}_k({{\mathbb{\mathbf{X}}}})\triangleq \liminf_{b\to \infty} {H([X_{k+1}]_b|[X^k]_b) \over b}. \end{align} (1.12) If $$\bar{d}_k({{\mathbb{\mathbf{X}}}})=\underline{d}_k({{\mathbb{\mathbf{X}}}})$$, then the $$k$$th order information dimension of process $${{\mathbb{\mathbf{X}}}}$$ is defined as $${d}_k({{\mathbb{\mathbf{X}}}})=\bar{d}_k({{\mathbb{\mathbf{X}}}})=\underline{d}_k({{\mathbb{\mathbf{X}}}})$$ [15]. For $$k=0$$, $$\bar{d}_k({{\mathbb{\mathbf{X}}}})$$ ($$\underline{d}_k({{\mathbb{\mathbf{X}}}})$$) is equal to the upper (lower) Rényi information dimension of $$X_1$$ [23], which is a well-known measure of structuredness for real-valued random variables or random vectors. It can be proved that for all stationary sources with $$H(\lfloor X_1 \rfloor)$$, $$\bar{d}_k({{\mathbb{\mathbf{X}}}})\leq 1$$ and $$\underline{d}_k({{\mathbb{\mathbf{X}}}})\leq 1$$ [15]. To gain some insight on these definitions, consider an i.i.d. process $${{\mathbb{\mathbf{X}}}}$$ with $$X_1 \sim (1-p) \delta_0(x) + p {\rm Unif}(0,1)$$, where $${\rm Unif}$$ denotes a uniform distribution. This is called the spike and slab prior [19]. It can be proved that for this process $${d}_0({{\mathbb{\mathbf{X}}}})=\bar{d}_0({{\mathbb{\mathbf{X}}}})=\underline{d}_0({{\mathbb{\mathbf{X}}}}) =p$$ [23]. For general stationary sources with infinite memory, the limit of $$\bar{d}_k({{\mathbb{\mathbf{X}}}})$$ as $$k$$ grows to infinity is defined as the upper information dimension of the process $${{\mathbb{\mathbf{X}}}}$$ and is denoted by $$\bar{d}_o({{\mathbb{\mathbf{X}}}})$$. As argued in [15], the information dimension of a process measures its level of structuredness, and is related to the number of response variables required for its accurate recovery. On the basis of these definitions and concepts, we state our results in the following. The exposition of our results is informal and lacks many details. All the details will be clarified later in the article. Informal Result 1. Consider the noiseless setting ($$Z^m=0^m$$), and assume that the elements of the design matrix $$A$$ are i.i.d. Gaussian. Further assume that the process $${{\mathbb{\mathbf{X}}}}$$ satisfies certain mixing conditions, and for a fixed $$k$$, $${m\over n}> \bar{d}_k({{\mathbb{\mathbf{X}}}})$$. Then, asymptotically, for a proper quantization level which grows with $$n$$, the Q-MAP optimization recovers $$X^n$$ with high probability. There is an interesting feature of this result that we would like to emphasize here: (i) if $$ \bar{d}_k({{\mathbb{\mathbf{X}}}})$$ is strictly smaller than $$1$$, then we can estimate $$X^n$$ accurately, from $$m <n$$ response variables. In fact the smaller $$\bar{d}_k({{\mathbb{\mathbf{X}}}})$$, the less response variables are required. In particular, we can consider the spike and slab prior we discussed before that corresponds to sparse parameter vectors that are studied in the literature [1,4]. For this prior $$\bar{d}_0({{\mathbb{\mathbf{X}}}}) =p$$. Hence, as long as $$m> np$$, asymptotically, the estimate of Q-MAP with $$k=0$$ will be accurate. Note that $$np$$ is in fact the expected number of non-zero elements in $$\beta$$. We believe that even an MMSE estimator that employs only the $$k{\rm th}$$ order distribution of the source cannot recover with a smaller number of response variables. We present some examples that confirm our claim; however, the optimality of the result we obtain above is an open question that we leave for future research. The above result is for Q-MAP that is still computationally complicated. Our next result is about our proposed PGD-based algorithm. Informal Result 2. Consider again the noiseless setting, and assume that the elements of $$A$$ are i.i.d. Gaussian. If the process $${{\mathbb{\mathbf{X}}}}$$ satisfies certain mixing conditions, and $${m\over n}> 80 b \bar{d}_k({{\mathbb{\mathbf{X}}}})$$, where $$k$$ is a fixed parameter, then the estimates derived by the PGD algorithm, with high probability, converge to $$X^n$$. We will also characterize the convergence rate of the PGD-based algorithm and its performance in the presence of an additive white Gaussian noise (AWGN). Compared with Informal Result 1, the number of response variables required in Informal Result 2 is a factor $$80b$$ higher. As we will discuss later we let $$b$$ grow as in $$O(\log \log n)$$, and hence the difference between Informal Result 1 and Informal Result 2 is not substantial. 1.3 Related work and discussion Bayesian linear regression has been the topic of extensive research in the past 50 years [10–12,16,17,19–21,26,27,30,32]. In all these papers, $$X^n$$ is considered as a random vector whose distribution is known. However, often simple models are considered for the distribution of $$X^n$$ to simplify either the posterior calculations or to apply Markov chain Monte Carlo methods such as Gibbs sampling. This article considers a different scenario. We ignore the computational issues at first and consider an arbitrary distribution for $$X^n$$. This is in particular useful for applications in which complicated prior can be learned. (For instance, one might have access to a large database that has many different draws of process $$\mathbf{X}$$.) We then present an optimization for estimating $$X^n$$ and prove the optimality of this approach under some conditions. This approach let us avoid the limitations that are imposed by posterior calculations. On the other hand, one main advantage of having posterior distributions is that they can be used in calculating confidence intervals. Exploring confidence intervals and related topics remains an open question for future research. Our theoretical analyses are inspired by the recent surge of interest toward understanding the high-dimensional linear regression problem [1,4,5,8,24,28]. In this front, there has been very limited work on the theoretical analysis of the Bayesian recovery algorithms, especially beyond memoryless sources. Two of the main tools that have been used for this purpose in the literature are the replica method [9] and state evolution [18]. Both methods have been employed to analyze the asymptotic performance of MMSE and MAP estimators under the asymptotic setting where $$m,n \rightarrow \infty$$, while $$m/n$$ is fixed. They both have some limitations. For instance, replica-based methods are not fully rigorous. Moreover, while they work well for i.i.d. sequences, it is not clear how they can be applied to sources with memory. The state evolution framework suffers from similar issues. Our article presents the first result in this direction for processes with memory. 1.4 Organization of the article The organization of the article is as follows. The Q-MAP estimator is developed in Section 2, and in Section 3, it is simplified for some simple distributions and shown to have connections to some well-known algorithms. In Section 4, two classes of stochastic processes are studied. The empirical distributions of the quantized versions of processes in each class have exponential convergence rates. The performance of the Q-MAP estimator is studied in Section 5. An iterative method based on PGD is proposed and studied in Section 6. Section 7 presents the proofs of the main results of the article and finally Section 8 concludes the article. 2. Quantized MAP estimator Consider the problem of estimating $$X^n$$ from noise-free response variables $$Y^m=AX^n$$, where $${\mathbb{\mathbf{X}}}=\{X_i\}_{i=1}^{\infty}$$ is a stationary process. Since there is no noise, one can employ a MAP estimator and find the most likely parameter vector given response variables $$Y^m$$. Instead of solving the original MAP estimator, we consider finding the most probable sequence in the quantized space $$\mathcal{X}_b^n$$. That is,   \begin{align} {\rm maximize} \;\;&{\rm{P}}([X^n]_b=u^n)\nonumber\\ {\rm subject\; to}\;\; & u^n\in\mathcal{X}_b^n,\nonumber\\ &[x^n]_b=u^n,\nonumber\\ &Ax^n=Y^m,\label{eq:Q-MAP-orig} \end{align} (2.1) where $${\rm{P}}$$ denotes the law of process $$\mathbf{X}$$. The optimization described in (2.1) can be further simplified and made more amenable to both analysis and implementation. Note that if $$Ax^n=Y^m$$, and $$|x_i-u_i|\leq 2^{-b}$$, for all $$i$$, then   \begin{align} \|Au^n-Y^m\|_2^2&\leq (\sigma_{\max}(A))^2 \|x^n-u^n\|^2\nonumber\\ &\leq 2^{-2b}n^2(\sigma_{\max}(A))^2 ,\label{eq:Q-MAP-orig-simlifies-s1} \end{align} (2.2) where $$\sigma_{\rm max}(A)$$ denotes the maximum singular value of the design matrix $$A$$. This provides an upper bound on $${1\over n^2}\|A[x^n]_b-Y^m\|_2^2$$ in terms of $$\sigma_{\max}(A)$$ and $$b$$. In other words, since $$u^n$$ is a quantized version of $$x^n$$, and $$Ax^n=Y^m$$, $$\|Au^n-Y^m\|_2^2$$ is also expected to be small. To further simplify (2.2), we focus on the other term in (2.2), i.e. $$-\log{\rm{P}}([X^n]_b=u^n)$$. Assume that the process $${\mathbb{\mathbf{X}}}$$ is such that $${\rm{P}}([X^n]_b=u^n)$$ can be factored as   \begin{align} {\rm{P}}([X^n]_b=u^n)= {\rm{P}}([X^k]_b=u^k ) \prod_{i=k+1}^n{\rm{P}}([X_i]_b=u_i|[X_{i-k}^{i-1}]_b=u_{i-k}^{i-1}), \end{align} (2.3) for some finite $$k$$. In other words, the $$b$$-bit quantized version of $${\mathbb{\mathbf{X}}}$$ is a Markov process of order $$k$$. Then define coefficients $$(w_{a^{k+1}}:\;a^{k+1}\in \mathcal{X}_b^{k+1})$$ according to (1.6). This assumption simplifies the term $$-\log{\rm{P}}([X^n]_b=u^n)$$ in the following way:   \begin{align} &-\log {\rm{P}}([X^n]_b=u^n)\nonumber\\ &\quad=-\log {\rm{P}}([X^k]_b=u^k ) -\sum_{i=k+1}^n\log {\rm{P}}([X_i]_b=u_i|[X_{i-k}^{i-1}]_b=u_{i-k}^{i-1})\nonumber\\ &\quad=-\log {\rm{P}}([X^k]_b=u^k ) \nonumber\\ &\qquad-\sum_{i=k+1}^n\log {\rm{P}}([X_i]_b=u_i|[X_{i-k}^{i-1}]_b=u_{i-k}^{i-1})\sum_{a^{k+1}\in \mathcal{X}_b^{k+1}}\mathbb{1}_{u_{i-k}^i=a^{k+1}}\nonumber\\ &\quad=-\log {\rm{P}}([X^k]_b=u^k ) \nonumber\\ &\qquad-\sum_{i=k+1}^n \sum_{a^{k+1}\in \mathcal{X}_b^{k+1}}\mathbb{1}_{u_{i-k}^i=a^{k+1}}\log{\rm{P}}([X_i]_b=a_i|[X_{i-k}^{i-1}]_b=a_{i-k}^{i-1})\nonumber\\ &\quad=-\log {\rm{P}}([X^k]_b=u^k ) + \sum_{i=k+1}^n \sum_{a^{k+1}\in \mathcal{X}_b^{k+1}}w_{a^{k+1}}\mathbb{1}_{u_{i-k}^i=a^{k+1}}\nonumber\\ &\quad=-\log {\rm{P}}([X^k]_b=u^k ) + \sum_{a^{k+1}\in \mathcal{X}_b^{k+1}}w_{a^{k+1}}\sum_{i=k+1}^n\mathbb{1}_{u_{i-k}^i=a^{k+1}}\nonumber\\ &\quad=-\log {\rm{P}}([X^k]_b=u^k ) + (n-k)\sum_{a^{k+1}\in \mathcal{X}_b^{k+1}} w_{a^{k+1}}\hat{p}^{(k+1)}(a^{k+1}|u^n)\nonumber\\ &\quad=-\log {\rm{P}}([X^k]_b=u^k ) + (n-k)c_{{\bf w} }(u^n), \end{align} (2.4) where $$\hat{p}^{(k+1)}$$ denotes the $$(k+1){\rm th}$$ order empirical distribution of $$u^n$$ as defined in (1.2), and $$c_{{\bf w} }(u^n)$$ is defined in (1.4). Assuming $$k$$ is much smaller than $$n$$, and ignoring the negligible term of $$-\log {\rm{P}}([X^k]_b=u^k )/(n-k)$$, instead of minimizing $$-\log{\rm{P}}([X^n]_b=u^n)$$, subject to an upper bound on $$\|Au^n-Y^m\|_2^2$$, we consider the following optimization where the roles of the cost and constraint functions are flipped   \begin{align} {\hat{X}}^n&=\;\;\mathop{\text{arg min}}\limits_{u^n\in\mathcal{X}_b^n} \;\; \|Au^n-Y^m\|_2^2 \nonumber \\ &\;\;\;\;\;\;{\rm subject \; to} \;\; c_{{\bf w} }(u^n)\leq \gamma, \end{align} (2.5) or its Lagrangian form   \begin{align} {\hat{X}}^n&= \mathop{\text{arg min}}\limits_{u^n\in\mathcal{X}_b^n}\Big[c_{{\bf w} }(u^n)+{\lambda\over n^2}\|Au^n-Y^m\|^2\Big].\label{eq:Q-MAP-L} \end{align} (2.6) The choice of parameters $$\lambda>0$$ and $$\gamma$$ is discussed later in our analysis. We refer to both (2.5) and (2.6) as quantized MAP (Q-MAP) estimators. Obtaining the Q-MAP estimator involved several approximation and relaxation steps. It is not clear how accurate these approximations are, and what the performance of the ultimate algorithm is. Also, solving Q-MAP estimator requires specifying parameters $$b$$ and $$\lambda$$, which significantly affect the performance of the estimator. These questions are all answered in Section 5. Before that, in the following section, we focus on two specific processes, which are well studied in the literature and derive the Q-MAP formulation in each case. This will clarify some of the properties of our Q-MAP formulation. 3. Special distributions To get a better understanding of Q-MAP optimization described in (2.5) and (2.6) and especially the term   \[ c_{{\bf w} }(u^n)=\sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|u^n), \] in this section we study two special distributions and derive simpler statement for the Q-MAP optimization in each case. 3.1 Independent and identically distributed sparse processes One of the most popular models for sparse parameter vectors is the spike and slab prior [19]. Consider an i.i.d. process $${\mathbb{\mathbf{X}}}$$, such that $$X_i\sim (1-p)\delta_0+pU_{[0,1]}$$. Since the process is i.i.d., by setting $$k=0$$ the optimization stated in (2.5) can be simplified as   \begin{align} {\hat{X}}^n&= \mathop{\text{arg min}}\limits_{u^n\in\mathcal{X}_b^n: \sum\limits_{a\in\mathcal{X}_b} w_{a} \hat{p}^{(1)}(a|u^n)\leq \gamma} \|Au^n-Y^m\|^2,\label{eq:QMAP-iid-1} \end{align} (3.1) where   \begin{align} w_a=\log {1\over {\rm{P}}([X_{1}]_b=a_{1})} = \log {1\over (1-p)\mathbb{1}_{a_1=0}+p{2^{-b}}}. \end{align} (3.2) Therefore, $$\sum_{a\in\mathcal{X}_b} w_{a} \hat{p}^{(1)}(a|u^n)$$ in (3.1) can be written as   \begin{align} \sum_{a\in\mathcal{X}_b} w_{a} \hat{p}^{(1)}(a|u^n) &=\sum_{a\in\mathcal{X}_b} w_{a} \left({1\over n}\sum_{i=1}^n\mathbb{1}_{u_i=a}\right)\nonumber\\ &= {1\over n}\sum_{i=1}^n\sum_{a\in\mathcal{X}_b} w_a \mathbb{1}_{u_i=a}\nonumber\\ & \stackrel{(a)}{=} {1\over n}\sum_{i=1}^n w_{u_i} \nonumber\\ &=-{1\over n}\sum_{i=1}^n \log ((1-p)\mathbb{1}_{u_i=0}+p{2^{-b}})\nonumber\\ &=-\hat{p}(0|u^n) \log ((1-p)+p{2^{-b}})-(1-\hat{p}(0|u^n) )\log(p{2^{-b}})\nonumber\\ &=\hat{p}(0|u^n) \log {p{2^{-b}}\over (1-p)+p{2^{-b}}}-\log(p{2^{-b}}),\label{eq:simplified-cost-ell-0} \end{align} (3.3) where (a) holds because $$\sum_{a\in\mathcal{X}_b} w_a \mathbb{1}_{u_i=a}=w_{u_i}$$. Since $$-\log(p{2^{-b}})$$ is constant, and   \[ \log { (1-p)+p{2^{-b}}\over p{2^{-b}}} \] is positive, from (3.3), an upper bound on $$\sum_{a\in\mathcal{X}_b} w_{a} \hat{p}^{(1)}(a|u^n) $$ is in fact an upper bound on the $$\ell_0$$-norm of $$u^n$$ defined as   \begin{align*} \|u^n\|_0\triangleq |\{i:\; u_i\neq 0\}|. \end{align*} (Note that $$ \|u^n\|_0=(1-\hat{p}(0|u^n))n$$.) Therefore, given these simplifications, (3.1) can be written as   \begin{align} {\hat{X}}^n&= \mathop{\text{arg min}}\limits_{u^n\in\mathcal{X}_b^n: \|u^n\|_0\leq \gamma'} \|Au^n-Y^m\|^2, \label{eq:QMAP-iid-2} \end{align} (3.4) where $$\gamma'$$ is a function of $$\gamma$$, $$b$$ and $$p$$. 3.2 Piecewise-constant processes Another popular example is a piecewise-constant process. (Refer to [29] for some applications of this model.) As our second example, we introduce a first-order Markov process that can model piecewise-constant functions. Conditioned on $$X_i=x_i$$, $$X_{i+1}$$ is distributed as $$(1-p)\delta_{x_i}+p {\rm Unif}_{[0,1]}$$. In other words, at each time step, the Markov chain either stays at its previous value or jumps to a new value, which is drawn from a uniform distribution over $$[0,1]$$, independent of the past values. The jump process can be modeled as a $$\mathrm{Bern}(p)$$ process which is independent of the past values of the Markov chain. Then, since the process has a memory of order one, (2.6) can be written as   \begin{align} {\hat{X}}^n&= \mathop{\text{arg min}}\limits_{u^n\in\mathcal{X}_b^n}\Bigg[\sum_{a^{2}\in\mathcal{X}_b^{2}} w_{a^{2}} \hat{p}^{(2)}(a^{2}|u^n) +{\lambda\over n^2}\|Au^n-Y^m\|^2\Bigg],\label{eq:QMAP-markov-1} \end{align} (3.5) where for $$a^2\in\mathcal{X}_b^2$$  \begin{align} w_{a^2}=\log {1\over {\rm{P}}([X_{2}]_b=a_{2}|[X_{1}]_b=a_{1})}. \end{align} (3.6) Given the Kernel of the Markov chain, we have   \begin{align} {\rm{P}}([X_{2}]_b=a_{2}|[X_{1}]_b=a_{1})=(1-p)\mathbb{1}_{a_2=a_1}+p{2^{-b}}. \end{align} (3.7) Let $$N_J(u^n)$$ denote the number of jumps in sequence $$u^n$$, i.e. $$N_J(u^n)=\sum_{i=2}^n\mathbb{1}_{u_i\neq u_{i-1}}$$. Then, the first term in the cost function in (2.5) can be rewritten as   \begin{align} \sum_{a^2\in\mathcal{X}_b^2} w_{a^2} \hat{p}^{(2)}(a^2|u^n) &= {1\over n-1}\sum_{i=2}^n\sum_{a^2\in\mathcal{X}_b} w_{a^2} \mathbb{1}_{u_{i-1}^i=a^2}= {1\over n-1}\sum_{i=2}^n w_{u_{i-1}^i}\nonumber\\ &={1\over n-1}\sum_{i=2}^n \log\left((1-p)\mathbb{1}_{u_i=u_{i-1}}+p{2^{-b}}\right)\nonumber\\ &=-{N_J(u^n)\over n-1} \log(p{2^{-b}})-\left(1-{N_J(u^n)\over n-1}\right) \log(1-p+p{2^{-b}})\nonumber\\ &={N_J(u^n)\over n-1} \log{1-p+p{2^{-b}} \over p{2^{-b}}}-\log(1-p+p{2^{-b}}).\label{eq:cost-markov} \end{align} (3.8) Inserting (3.8) in (3.9), it follows that   \begin{align} {\hat{X}}^n&= \mathop{\text{arg min}}\limits_{u^n\in\mathcal{X}_b^n}\left[\alpha_b\left({N_J(u^n)\over n-1}\right)+{\lambda\over n^2}\|Au^n-Y^m\|^2\right]\!, \end{align} (3.9) where $$\alpha_b= \log { (1-p)+p{2^{-b}}\over p{2^{-b}}}$$. Note that the term $${N_J(u^n)\over n-1}$$ is counting the number of jumps in $$u^n$$ which seems to be a natural regularizer here. 4. Exponential convergence rates In our theoretical analysis, one of the main features required from the source process $${\mathbb{\mathbf{X}}}=\{X_i\}_{i=1}^{\infty}$$ is that the empirical statistics derived from $$[X^n]_b$$ to converge, asymptotically, to their expected values. In all our analysis, we require this to hold even when $$b$$ grows with $$n$$. Intuitively, if this is not the case, we do not expect the Q-MAP estimator to be able to obtain a good estimate of $$X^n$$. In the following two sections, we study two important classes of stochastic processes which satisfy this property. 4.1 $${\it{\Psi}}^*$$-mixing processes The first class of processes that satisfy our requirements are $${\it{\Psi}}^*$$-mixing processes. Consider a stationary process $${\mathbb{\mathbf{X}}}=\{X_i\}_{i=1}^\infty$$. Let $$\mathcal{F}_j^k$$ denote the $$\sigma$$-field of events generated by random variables $$X_j^k$$, where $$j\leq k$$. Define   \begin{equation} \psi^*(g) = \sup \frac{{\rm{P}}(\mathcal{A} \cap \mathcal{B})}{{\rm{P}}(\mathcal{A}) {\rm{P}}(\mathcal{B})},\label{eq:def-Psi-star} \end{equation} (4.1) where the supremum is taken over all events $$\mathcal{A} \in \mathcal{F}_{0}^{j}$$ and $$\mathcal{B} \in \mathcal{F}_{j+g}^\infty$$, where $$P(\mathcal{A})>0$$ and $${\rm{P}}(\mathcal{B})>0$$. Definition 4.1 A stationary process $${\mathbb{\mathbf{X}}}$$ is called $$\psi^*$$-mixing, if $$\psi^*(g) \to 1$$, as $$g$$ grows to infinity. There are many examples of $${\it{\Psi}}^*$$-mixing processes. For instance, it is straightforward to check that any i.i.d. sequence is $${\it{\Psi}}^*$$-mixing. Also, every finite-state Markov chain is $${\it{\Psi}}^*$$-mixing [25]. (For further information on $${\it{\Psi}}^*$$-mixing processes, the reader may refer to [3].) Another example of $${\it{\Psi}}^*$$-mixing processes are processes built by taking the moving average of i.i.d. processes. More specifically, consider an i.i.d. process $${\mathbb{\mathbf{Y}}}$$ and let process $${\mathbb{\mathbf{X}}}$$ denote the moving average of process $${\mathbb{\mathbf{Y}}}$$ defined as $$X_i={1\over r}\sum_{j=0}^{r-1}Y_{i-j}$$, for all $$i$$. Then, process $${\mathbb{\mathbf{Y}}}$$ is $${\it{\Psi}}^*$$-mixing. As mentioned earlier, the advantage of $${\it{\Psi}}^*$$-mixing processes is the fast convergence of their empirical distributions to their expected values. This is captured by the following result from [15], which is a straightforward extension of a similar result in [25] for finite alphabet. Theorem 4.2 Consider a $${\it{\Psi}}^*$$-mixing process $${\mathbb{\mathbf{X}}}=\{X_i\}_{i=1}^{\infty}$$, and its $$b$$-bit quantized version $${\mathbb{\mathbf{Z}}}=\{Z_i\}_{i=1}^{\infty}$$, where $$Z_i=[X_i]_b$$ and $$\mathcal{Z}=\mathcal{X}_b$$. Define measure $$\mu_k$$, such that for $$a^k\in\mathcal{Z}^k$$, $$\mu_k(a^k)={\rm{P}}(Z^k=a^k)$$. Then, for any $$\epsilon>0$$, and any $$b$$ large enough, there exists $$g\in{\rm I}\kern-0.20em{\rm N}$$, depending only on $$\epsilon$$ and function $${\it{\Psi}}^*(g)$$ defined in (4.1), such that for any $$n>k+6(k+g)/\epsilon$$,   \begin{equation}\label{eq:exponentialconvergence} {\rm{P}}(\|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k\|_1\geq \epsilon)\leq 2^{c\epsilon^2/8}(k+g)n^{|\mathcal{Z}|^k}2^{-{nc\epsilon^2\over 8(k+g)}}, \end{equation} (4.2) where $$c=1/(2\ln 2)$$. Note that the upper bound in (4.2) only depends on $$b$$ through $$|\mathcal{Z}|=|\mathcal{X}_b|$$. Hence, if $$b=b_n$$ grows with $$n$$, it should grow slowly enough, such that overall $$2^{c\epsilon^2/8}(k+g)n^{|\mathcal{Z}|^k}2^{-{nc\epsilon^2\over 8(k+g)}}$$ still converges to zero, as $$n$$ grows to infinity. One such example, used in our results, is $$b=b_n=\lceil r \log\log n\rceil$$, $$r\geq 1$$. For this choice of $$b_n$$, Theorem 4.2 guarantees that the empirical distribution derived from the quantized sequence remains close to its expected value, with high probability. 4.2 Weak $${\it{\Psi}}^*_q$$-mixing Markov processes Finite-alphabet Markov chains are known to be $${\it{\Psi}}^*$$-mixing, and therefore their empirical distributions have exponential convergence rates [25]. Continuous space Markov processes on the other hand are not $${\it{\Psi}}^*$$-mixing in general, and hence the results of the previous section may not hold. However, for many such Markov processes, it is still possible to show that the empirical distributions of their quantized versions converge to their expected values, even if the quantization level $$b$$ grows with $$n$$. In this section, we show how such results can be proved by extending the definition of $${\it{\Psi}}^*$$-mixing processes to define weak $${\it{\Psi}}_q^*$$-mixing Markov processes. As shown at the end of this section, one important example of continuous-space Markov processes which is provably weak $${\it{\Psi}}_q^*$$-mixing is the piecewise-constant process discussed in Section 3.2. Consider a real-valued stationary stochastic process $${\mathbb{\mathbf{X}}}=\{X_i\}$$, with alphabet $$\mathcal{X}=[l,u]$$, where $$l,u\in{\rm I}\kern-0.20em{\rm R}$$. Let process $${\mathbb{\mathbf{Z}}}=\{Z_i\}$$ denote the $$b$$-bit quantized version of process $${\mathbb{\mathbf{X}}}$$. That is, $$Z_i=[X_i]_b$$, for all $$i$$. The alphabet of process $${\mathbb{\mathbf{Z}}}$$ is clearly $$\mathcal{Z}=\mathcal{X}_b=\{[x]_b: x\in\mathcal{X}_b\}$$. Let $$\mu_k^{(b)}$$ denote the distribution of $$Z^k$$. That is, for any $$z^k\in\mathcal{Z}^k$$,   \begin{align} \mu_k^{(b)}(z^k)={\rm{P}}(Z^k=z^k)={\rm{P}}([X^k]_b=z^k). \end{align} (4.3) The following lemma proves that if the process $${\mathbb{\mathbf{X}}}$$ has a property analogous to being $${\it{\Psi}}^*$$-mixing, then potentially it has exponential convergence rates. Lemma 4.1 Suppose that the stationary process $${\mathbb{\mathbf{X}}}$$ is such that there exists a function $${\it{\Psi}}:{\rm I}\kern-0.20em{\rm N}\times {\rm I}\kern-0.20em{\rm N}\to{\rm I}\kern-0.20em{\rm R}^+$$, which satisfies the following. For any $$(b,g,\ell_1,\ell_2)\in{\rm I}\kern-0.20em{\rm N}^4$$, $$u^{\ell_1}\in\mathcal{Z}^{\ell_1}$$ and $$v^{\ell_2}\in\mathcal{Z}^{\ell_2}$$:   \begin{align} {\rm{P}}\left(Z^{\ell_1}=u^{\ell_1},Z_{\ell_1+g+1}^{\ell_1+g+\ell_2}=v^{\ell_2}\right)\leq {\rm{P}}\left(Z^{\ell_1}=u^{\ell_1}\right) {\rm{P}}\left(Z_{\ell_1+g+1}^{\ell_1+g+\ell_2}=v^{\ell_2}\right){\it{\Psi}}(b,g),\label{eq:cond-Psi-b-g} \end{align} (4.4) where $$b$$ denotes the quantization level of process $${\mathbb{\mathbf{Z}}}$$. Then for any given $$\epsilon>0$$, for any positive integers $$k$$ and $$g$$ such that $$4(k+g)/(n-k)<\epsilon$$,   \begin{align} {\rm{P}}\left(\|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\|_1\geq \epsilon\right)\leq (k+g){\it{\Psi}}^t(b,g)(t+1)^{|\mathcal{Z}|^k}2^{-c\epsilon^2t/4}, \end{align} (4.5) where $$t=\lfloor{n-k+1\over k+g}\rfloor$$ and $$c=1/(2\ln 2)$$. The proof is presented in Section 7.3. Note that if a process is $${\it{\Psi}}^*$$-mixing, then it is straightforward to confirm the existence of $$\tilde{{\it{\Psi}}}(g)$$ that satisfies   \begin{align} {\rm{P}}\left(Z^{\ell_1}=u^{\ell_1},Z_{\ell_1+g+1}^{\ell_1+g+\ell_2}=v^{\ell_2}\right)\leq {\rm{P}}\left(Z^{\ell_1}=u^{\ell_1}\right) {\rm{P}}\left(Z_{\ell_1+g+1}^{\ell_1+g+\ell_2}=v^{\ell_2}\right)\tilde{{\it{\Psi}}}(g). \end{align} (4.6) However, in this section, we are interested in processes that are not necessarily $${\it{\Psi}}^*$$-mixing. Lemma 4.1 allows us to prove the convergence of the empirical distributions for many such processes. To justify our claims, we focus on the class of Markov processes. For notational simplicity, we focus on first-order Markov processes. It is straightforward to extend these results to higher order Markov processes as well. Let $${\mathbb{\mathbf{X}}}$$ denote a first-order stationary Markov process with Kernel function $$K: (\mathcal{X},2^{\mathcal{X}}) \rightarrow \mathbb{R}^+$$ and first-order stationary distribution $$\pi: 2^{\mathcal{X}}\rightarrow {\rm I}\kern-0.20em{\rm R}^+$$. (Here $$2^{\mathcal{X}}$$ denotes the set of subsets of $$\mathcal{X}$$.) In other words, for any $$x\in\mathcal{X}$$ and any measurable subset of $$\mathcal{X}$$,   \begin{align} K(x,\mathcal{A})={\rm{P}}(X_2\in\mathcal{A}|X_1=x) \end{align} (4.7) and   \begin{align} \pi(\mathcal{A})={\rm{P}}(X_1\in\mathcal{A}). \end{align} (4.8) Also, for $$g\in{\rm I}\kern-0.20em{\rm N}^+$$,   \begin{align} K^g(x,\mathcal{A})={\rm{P}}(X_{1+g}\in\mathcal{A}|X_1=x). \end{align} (4.9) Clearly $$k^g$$ can be evaluated from function $$K$$. Finally, with a slight overloading of notation, for $$x\in\mathcal{X}$$ and $$z\in\mathcal{X}_b$$,   \begin{align} \pi(z)={\rm{P}}([X_1]_b=z) \end{align} (4.10) and   \begin{align} K(x,z)={\rm{P}}([X_2]_b=z|X_1=x). \end{align} (4.11) Similarly, for $$g\in{\rm I}\kern-0.20em{\rm N}^+$$, $$K^g(x,z)={\rm{P}}([X_{1+g}]_b=z|X_1=x)$$. Again, with another slight overloading of notation, for $$x\in\mathcal{X}$$, and $$w^{l+1}\in\mathcal{Z}^{l+1}$$,   \begin{align} \pi\left(w_2^{l+1}|x\right)={{\rm{P}}\left([X_2^{l+1}]_b=w^{l+1}_2|X_1=x\right)} \end{align} (4.12) and   \begin{align} \pi\left(w^{l+1}_2|w_1\right)={{\rm{P}}\left([X_2^{l+1}]_b=w_2^{l+1}|[X_1]_b=w_1\right)}. \end{align} (4.13) Define functions $${\it{\Psi}}_1:{\rm I}\kern-0.20em{\rm N}\times{\rm I}\kern-0.20em{\rm N}\to {\rm I}\kern-0.20em{\rm R}^+$$ and $${\it{\Psi}}_2:{\rm I}\kern-0.20em{\rm N}\to {\rm I}\kern-0.20em{\rm R}^+$$ as   \begin{align} {\it{\Psi}}_1(b,g) \triangleq \sup_{(x,z)\in\mathcal{X}\times\mathcal{X}_b} {K^{g}(x,z) \over \pi(z)}\label{eq:Psi1-def} \end{align} (4.14) and   \begin{align} {\it{\Psi}}_2(b)\triangleq \sup_{(x,\ell_2,w^{\ell_2})\in\mathcal{X}\times{\rm I}\kern-0.20em{\rm N}\times \mathcal{Z}^{\ell_2}: [x]_b=w_1} {\pi(w_{2}^{\ell_2}|x)\over \pi(w_{2}^{\ell_2}|w_1)}.\label{eq:Psi2-def} \end{align} (4.15) Our next lemma shows how $${\it{\Psi}}(b,g)$$ in Lemma 4.1 can be calculated using $${\it{\Psi}}_1$$ and $${\it{\Psi}}_2$$. Lemma 4.2 Consider a first-order aperiodic Markov process $${\mathbb{\mathbf{X}}}=\{X_i\}_{i=1}^{\infty}$$. Let $${\mathbb{\mathbf{Z}}}=\{Z_i\}$$ denote the $$b$$-bit quantized version of process $${\mathbb{\mathbf{X}}}$$. That is, $$Z_i=[X_i]_b$$, and $$\mathcal{Z}=\mathcal{X}_b=\{[x]_b: x\in\mathcal{X}\}$$. Also, let $$\mu_b$$ denote the distribution associated with the finite-alphabet process $${\mathbb{\mathbf{Z}}}$$. Then, for all $$(\ell_1,g,\ell_2)\in{\rm I}\kern-0.20em{\rm N}^3$$, $$u^{\ell_1}\in\mathcal{Z}^{\ell_1}$$, $$v^g\in\mathcal{Z}^g$$ and $$w^{\ell_2}\in\mathcal{Z}^{\ell_2}$$, we have   \begin{align} \mu_b\left(u^{\ell_1}v^{g}w^{\ell_2}\right)\leq \mu_b\left(u^{\ell_1}\right)\mu_b\left(w^{\ell_2}\right){\it{\Psi}}_1(b,g){\it{\Psi}}_2(b), \end{align} (4.16) where by definition $$\mu_b\left(u^{\ell_1}v^{g}w^{\ell_2}\right)={\rm{P}}\left(Z^{\ell_1+g+\ell_2}=[u^{\ell_1},v^{g},w^{\ell_2}]\right)$$, $$\mu_b\left(u^{\ell_1}\right)={\rm{P}}\left(Z^{\ell_1}=u^{\ell_1}\right)$$ and $$\mu_b\left(w^{\ell_2}\right)={\rm{P}}\left(Z^{\ell_2}=w^{\ell_2}\right)$$. Furthermore, for any fixed $$b$$, $${\it{\Psi}}_1(b,g)$$ is a non-increasing function of $$g$$ that converges to $$1$$, as $$g$$ grows to infinity. The proof is presented in Section 7.4. Combining Lemmas 4.1 and 4.2, we obtain an upper bound of the form   \[ (k+g){\it{\Psi}}^t(b,g)(t+1)^{|\mathcal{Z}|^k}2^{-c\epsilon^2t/4} \] on $${\rm{P}}(\|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\|_1\geq \epsilon)$$. To prove our desired convergence results, we need to ensure that this upper bound goes to zero as $$n$$ grows to infinity. It is straightforward to note that as $$n \rightarrow \infty$$, $$t \rightarrow \infty$$, and hence the term $$2^{-c\epsilon^2t/4}$$ converges to zero. However, if $${\it{\Psi}}^t(b,g)(t+1)^{|\mathcal{Z}|^k}$$ grows faster than $$2^{-c\epsilon^2t/4}$$, then we do not reach the desired goal. Our next theorems prove that under some mild conditions on the Markov process, for slow enough growth of $$b=b_n$$ with $$n$$, $${\it{\Psi}}^t(b,g)$$ does not grow too fast. First note that since for any $$z^n\in\mathcal{Z}^n$$, $$\| \mu_{k_1}-\hat{p}^{(k_1)}(\cdot|z^n)\|_1\leq \| \mu_{k_2}-\hat{p}^{(k_2)}(\cdot|z^n)\|_1$$, for all $$k_1\leq k_2$$, to prove fast convergence of $$\hat{p}^{(k)}(\cdot|Z^n)$$ it is enough to prove this statement for $$k$$ large. Theorem 4.3 Consider an aperiodic stationary first-order Markov chain $${\mathbb{\mathbf{X}}}$$, and its $$b$$-bit quantized version $${\mathbb{\mathbf{Z}}}$$, where $$Z_i=[X_i]_b$$ and $$\mathcal{Z}=\mathcal{X}_b$$. Let $$\mu_k^{(b)}$$ denote the $$k$$th order probability distribution of process $${\mathbb{\mathbf{Z}}}$$, i.e. for any $$z^k\in\mathcal{Z}^k$$,   \begin{align} \mu_k^{(b)}(z^k)={\rm{P}}(Z^k=z^k). \end{align} (4.17) Let $$b=b_n= \lceil r\log\log n \rceil$$, where $$r\geq 1$$. Assume that there exists a sequence $$g=g_n$$, such that $$g_n=o(n)$$, and process $${\mathbb{\mathbf{X}}}$$ satisfies the following conditions: $$\lim_{n\to\infty}{\it{\Psi}}_1(b_n,g_n)=1 $$ and $$\lim_{b\to\infty}{\it{\Psi}}_2(b)=1$$, where functions $${\it{\Psi}}_1$$ and $${\it{\Psi}}_2$$ are defined in (4.14) and (4.15), respectively. Then, given $$\epsilon>0$$ and positive integer $$k$$, for $$n$$ large enough,   \begin{align} {\rm{P}}\left( \|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\|_1\geq \epsilon\right)\leq 2^{c\epsilon^2/4} (k+g_n)n^{|\mathcal{Z}|^k}2^{-{c \epsilon^2n\over 8(k+g_n)}}, \end{align} (4.18) where $$c=1/(2\ln 2)$$. The proof is presented in Section 7.5. Remark 4.1 Lemma 4.2 proves that for any fixed $$b$$, $${\it{\Psi}}_1(b,g)$$ converges to one, as $$g$$ grows without bound. However, in this article we are mainly interested in the cases where $$ b_n=\lceil r\log\log n \rceil$$. The condition on $${\it{\Psi}}_1$$ specified in Theorem 4.3 ensures that even if $$b$$ also grows to infinity, there exists a proper choice of sequence $$g_n$$ as a function of $$n$$, for which $${\it{\Psi}}_1(b_n,g_n)$$ still converges to one, as $$n$$ grows without bound. Theorem 4.3 proves that if $$\lim_{n\to\infty}{\it{\Psi}}_1(b_n,g_n)=1 $$ and $$\lim_{b\to\infty}{\it{\Psi}}_2(b)=1$$, then the quantized version of an analog Markov process also has fast convergence rates. We refer to a Markov process that satisfies these two conditions with $$b_n = \lceil r\log\log n \rceil$$ as a weak $${\it{\Psi}}^*_q$$-mixing Markov process. To better understand these conditions, we next consider the piecewise-constant source studied in Section 3.2 and prove that it is a weak $${\it{\Psi}}^*_q$$-mixing Markov process. Theorem 4.4 Consider a first-order stationary Markov process $${\mathbb{\mathbf{X}}}$$, such that conditioned on $$X_i=x_i$$, $$X_{i+1}$$ is distributed as $$(1-p)\delta_{x_i}+pf_c$$, where $$f_c$$ denotes an absolutely continuous distribution over $$\mathcal{X}=[0,1]$$. Further assume that there exists $$f_{\min}>0$$, such that $$f_c(x)\geq f_{\min},$$ for all $$x\in(0,1)$$. Then, for $$b=b_n=\lceil r\log\log n\rceil $$ and $$g=g_n=\lfloor \gamma \, r\log\log n \rfloor$$, where $$\gamma>-{1\over \log(1-p)}$$, $$\lim_{n\to\infty}{\it{\Psi}}_1(b_n,g_n)=1$$, and $${\it{\Psi}}_2(b)=1$$, for all $$b$$. The proof is presented in Section 7.6. 5. Theoretical analysis of Q-MAP In this section, we formalize Informal Result 1 presented in Section 1. The following theorem provides conditions for the success of the Q-MAP estimator, for the case where the response variables are noise-free. We state all the results for $${\it{\Psi}}^*$$-mixing processes, but they also apply to weak $${\it{\Psi}}^*_q$$-mixing Markov processes. Theorem 5.1 Consider a $${\it{\Psi}}^*$$-mixing stationary process $${{\mathbb{\mathbf{X}}}}$$, and let $$Y^m=AX^n$$. Assume that the entries of the design matrix $$A$$ are i.i.d. $$\mathcal{N}(0,1)$$. Choose $$k$$, $$r>1$$ and $${\delta}>0$$, and let $$b=b_n=\lceil r\log\log n\rceil$$, $${\gamma}={\gamma}_n= b_n(\bar{d}_k({{\mathbb{\mathbf{X}}}})+{\delta}$$) and $$m=m_n\geq (1+\delta)n\bar{d}_k({{\mathbb{\mathbf{X}}}})$$. Assume that there exists a constant $$f_{k+1}>0$$, such that for any quantization level $$b$$ and any $$u^{k+1}\in\mathcal{X}_b^{k+1}$$ with $${\rm{P}}([X^{k+1}]_b=u^{k+1})\neq 0$$,   \begin{align} {\rm{P}}\left([X^{k+1}]_b=u^{k+1}\right)\geq {f_{k+1} |\mathcal{X}_b|^{-(k+1)}}.\label{eq:cond-f-k} \end{align} (5.1) Further, assume that $${\hat{X}}^n$$ denotes the solution of (2.5), where the coefficients are computed according to (1.6). Then, for any $${\epsilon}>0$$,   \begin{align} \lim_{n\to\infty} {\rm{P}}\left( { 1\over \sqrt{n}}\|X^n-{\hat{X}}^n\|_2>{\epsilon}\right)=0. \end{align} (5.2) The proof is presented in Section 7.7. Remark 5.1 A technical condition required by Theorem 5.1 is existence of a constant $$f_{k+1}$$, for which (5.1) holds. It is straightforward to confirm that this condition holds for many distributions of interest. For instance, consider an i.i.d. process $$X_i\stackrel{\rm i.i.d.}{\sim} p f_c+(1-p)f_d$$, where $$f_c$$ and $$f_d$$ denote an absolutely continuous distribution and a discrete distribution, respectively. Then, if $$f_c$$ has a bounded support and $$\inf f_c$$ over its support is non-zero, then (5.1) holds. Intuitively, this condition guarantees that the probability of every quantized sequence with non-zero probability cannot get too small. The reason this condition is required is that it simplifies bounding the Kullback–Leibler distance between empirical distribution and the underlying distribution. We believe that the theorem holds in a more general setting, but its proof is left for future work. We remind the reader that we also introduced a Lagrangian version of Q-MAP in (2.6). It turns out that we can derive the same performance guarantees for the Lagrangian Q-MAP as well. Theorem 5.2 Consider a $${\it{\Psi}}^*$$-mixing stationery process $${{\mathbb{\mathbf{X}}}}$$. Let $$Y^m=AX^n$$, where the entries of $$A$$ are i.i.d. $$\mathcal{N}(0,1)$$. Choose $$k$$, $$r>1$$ and $${\delta}>0$$, and let $$b=b_n=\lceil r\log\log n\rceil$$, $${\lambda}={\lambda}_n=(\log n)^{2r}$$ and $$m=m_n\geq (1+\delta)n\bar{d}_k({{\mathbb{\mathbf{X}}}})$$. Assume that there exists a constant $$f_{k+1}>0$$, such that for any quantization level $$b$$, and any $$u^{k+1}\in\mathcal{X}_b^{k+1}$$ with $${\rm{P}}([X^{k+1}]_b=u^{k+1})\neq 0$$, (5.1) holds. Further, assume that $${\hat{X}}^n$$ denotes the solution of (2.6), where the coefficients are computed according to (1.6). Then, for any $${\epsilon}>0$$,   \begin{align} \lim_{n\to\infty} {\rm{P}}\left( { 1\over \sqrt{n}}\|X^n-{\hat{X}}^n\|_2>{\epsilon}\right)=0. \end{align} (5.3) The proof is presented in Section 7.8. To better understand the implications of Theorems 5.1 and 5.2, consider the case where the process $${{\mathbb{\mathbf{X}}}}$$ is stationary and memoryless. All such processes are $${\it{\Psi}}^*$$-mixing, and satisfy $$\bar{d}_k({{\mathbb{\mathbf{X}}}})=\bar{d}_0({{\mathbb{\mathbf{X}}}})$$, for all $$k\geq 0$$. Therefore, as long as $$m_n \geq (1+\delta)n\bar{d}_0({{\mathbb{\mathbf{X}}}})$$, asymptotically, the Q-MAP algorithm provides an accurate estimate of the parameters vector. On the other hand, since the process is i.i.d., $$\bar{d}_0({{\mathbb{\mathbf{X}}}})=\bar{d}(X_1)$$, where $$\bar{d}(X_1)$$ denotes the upper Rényi information dimension of $$X_1$$ [23]. For an i.i.d. process whose marginal distribution is a mixture of discrete and continuous distributions, asymptotically, the Rényi information dimension of the marginal distribution characterizes the minimum sampling rate ($$m/n$$) required for an accurate recovery of the parameters vector [33]. Hence, for such i.i.d. sources, in a noiseless setting, the algorithm presented in (3.4) achieves the fundamental limits in terms of sampling rate. Finally, another interesting implication of Theorem 5.2 is the following. The Q-MAP optimization mentioned in (2.5) is not a convex optimization. Hence, its solution does not necessarily coincide with the solution of (2.6). However, at least in the noiseless setting we can derive similar performance bounds for both. 6. Solving Q-MAP 6.1 PGD The goal of this section is to analyze the performance of the PGD algorithm introduced in Section 1.2. The results are presented for $${\it{\Psi}}^*$$-mixing processes, but they also apply to weak $${\it{\Psi}}_q^*$$-mixing Markov processes with no change. Note that even though PGD algorithms have been studied extensively for convex optimization problems, since our optimization is discrete and consequently not convex, those analyses do not apply to our problem. Given $${\delta}>0$$, consider the Q-MAP optimization characterized as   \begin{align} {\hat{X}}^n=&\mathop{\text{arg min}}\limits_{u^n\in\mathcal{X}_b^n}\;\;\;\;\;\;\;\; \|Y^m-Au^n\|^2 \nonumber\\ & {\rm subject \; to}\;\;\;\;\; c_{{\bf w}}(u^n) \leq b(\bar{d}_k({{\mathbb{\mathbf{X}}}})+{\delta}). \end{align} (6.1) The corresponding PGD algorithm proceeds as follows. For $$t=1,2,\ldots,t_n$$,   \begin{align} S^n(t+1)&={\hat{X}}^n(t)+\mu A^{\top}(Y^m-A{\hat{X}}^n(t))\nonumber\\ {\hat{X}}^n(t+1)&=\mathop{\text{arg min}}\limits_{u^n\in\mathcal{F}_o}\left\|u^n-S^n(t+1)\right\|\!, \end{align} (6.2) where $$\mathcal{F}_o = \Big\{u^n\in\mathcal{X}_b^n: \; c_{{\bf w}}(u^n) \leq b(\bar{d}_k({{\mathbb{\mathbf{X}}}})+{\delta}) \Big\}$$, defined earlier in (1.9). Theorem 6.1 below proves that, having enough number of response variables, with probability approaching one, the PGD-based algorithm recovers parameters $$X^n$$ with high probability, even in the presence of measurement noise. Theorem 6.1 Consider a $${\it{\Psi}}^{*}$$-mixing process $${{\mathbb{\mathbf{X}}}}$$. Let $$Y^m=AX^n+Z^m$$, where the elements of matrix $$A$$ are i.i.d. $$\mathcal{N}(0,1)$$ and $$Z_i$$, $$i=1,\ldots,m$$, are i.i.d. $$\mathcal{N}(0,\sigma^2)$$. Choose $$k$$, $$r>1$$ and $${\delta}>0$$, and let $$b=b_n=\lceil r\log\log n\rceil$$, $${\gamma}={\gamma}_n= b_n(\bar{d}_k({{\mathbb{\mathbf{X}}}})+{\delta}$$), and $$m=m_n\geq (1+\delta)n\bar{d}_k({{\mathbb{\mathbf{X}}}})$$. Assume that there exists a constant $$f_{k+1}>0$$, such that for any quantization level $$b$$, and any $$u^{k+1}\in\mathcal{X}_b^{k+1}$$ with $${\rm{P}}([X^{k+1}]_b=u^{k+1})\neq 0$$, (5.1) holds. For $$r>1$$, let $$b=b_n=\lceil r\log\log n\rceil,$$ and $$m=m_n=80 nb(\bar{d}_k({{\mathbb{\mathbf{X}}}})+{\delta}).$$ Let $$\mu={1\over m}$$, and consider $${\hat{X}}_n(t)$$, $$t=0,1,\ldots,t_n$$, generated according to (6.2). Define the error vector at iteration $$t$$ as   \begin{align} E^n(t)={\hat{X}}^n(t)-[X^n]_b. \end{align} (6.3) Then, with probability approaching one,   \begin{align} {1\over \sqrt{n}}\|E^n(t+1)\|\leq {0.9\over \sqrt{n}}\|E^n(t)\|+2\left(2+\sqrt{n\over m}\;\right)^2 2^{-b}+ { \sigma\over 2}\sqrt{b(\bar{d}_k({{\mathbb{\mathbf{X}}}})+3{\delta})\over m}, \end{align} (6.4) for $$t=1,2,\ldots$$. The proof is presented in Section 7.9. Comparing this result with Theorem 5.2 reveals that the minimum value of $$m$$ required in this theorem is $$80b_n = 80\lceil r\log\log n\rceil $$ times higher than the number of response variables required in Theorem 5.2. One can also decrease (increase) the factor $$80$$ and slow down (speed up) the convergence rate of the algorithm. At this point it is not clear to us whether the factor $$r\log\log n$$ in Theorem 6.1 (for the number of response variables) is necessary in general or is an artifact of our proof technique. For some specific priors such as the spike and slab distribution discussed earlier, with a slight modification of the algorithm, it is known that this factor can be improved. In that case, given the special form of the coefficients, we may let $$b$$ grow to infinity for a fixed $$n$$. Then the algorithm becomes equivalent to the iterative hard thresholding (IHT) algorithm introduced in [2]. The analysis in [2] shows that the number of response variables $$m_n$$ required by the IHT algorithm is proportional to $$n$$ and does not have the $$\log \log n$$ factor that appears in Theorem 6.1. In Theorem 6.1 and all of the previous results, the elements of the design matrix $$A$$ were assumed to be generated according to $$\mathcal{N}(0,1)$$ distribution. In the noisy setup, where the response variables are distorted by a noise of variance $$\sigma^2$$, this model implies having per response signal to noise ratio (SNR) that grows linearly with $$n$$. This has made the result of the previous theorem misleading. If we consider per element error, i.e. $${1\over \sqrt{n}}\|E^n(t+1)\|$$ then the error seems to go to zero. To fix this issue, we assume that the elements of $$A$$ are generated according to $$\mathcal{N}(0,{1\over n})$$. The following corollary restates the result of Theorem 6.1 under this scaling and an appropriate adjustment of coefficient $$\mu$$. Corollary 6.1 Consider the setup of Theorem 6.1, where the elements of $$A$$ are generated i.i.d. $$\mathcal{N}(0,{1\over n})$$. Let   \begin{align} S^n(t+1)&={\hat{X}}^n(t)+{n\over m}A^{\top}(Y^m-A{\hat{X}}^n(t))\nonumber\\ {\hat{X}}^n(t+1)&=\mathop{\text{arg min}}\limits_{u^n\in\mathcal{F}_o}\left\|u^n-S^n(t+1)\right\|\!.\label{eq:PGD-noisy-update} \end{align} (6.5) Then, with probability approaching one,   \begin{align} {1\over \sqrt{n}}\|E^n(t+1)\|\leq {0.9\over \sqrt{n}}\|E^n(t)\|+{2(\sqrt{n}+2\sqrt{m})^2\over m}2^{-b}+ { \sigma\over 2}\sqrt{nb(\bar{d}_k({{\mathbb{\mathbf{X}}}})+3{\delta})\over m}, \end{align} (6.6) for $$t=1,2,\ldots$$. Note that for $$m=m_n=80 nb(\bar{d}_k({{\mathbb{\mathbf{X}}}})+3{\delta})$$,   \[ { \sigma\over 2}\sqrt{nb(\bar{d}_k({{\mathbb{\mathbf{X}}}})+3{\delta})\over m}\leq {\sigma\over 12}. \] 6.2 Discussion of computational complexity of PGD As explained earlier, at iteration $$t+1$$, the PGD-based algorithm updates its estimate $${\hat{X}}^n(t )$$ to $${\hat{X}}^n(t +1)$$ by performing the following two steps: $$S^n(t+1)={\hat{X}}^n(t)+\mu A^{\top}(Y^m-A{\hat{X}}^n(t))$$, $${\hat{X}}^n(t+1)={\text{arg min}}_{u^n\in\mathcal{F}_o}\left\|u^n-S^n(t+1)\right\|$$. Clearly, the challenging part is performing the second step, which is projection on the set $$\mathcal{F}_o$$. For some special distributions, such as the spike and slab prior, discussed in Section 3.1, and piecewise-constant processes, discussed in Section 3.2, and their extensions this projection step is not complicated. For instance, for the aforementioned sparse vector, $$\mathcal{F}_o$$ contains sparse quantized vectors, and hence the projection step is just keeping the quantized versions of the largest components of $$S^{t}$$ and setting the rest to zero. This is very similar to the IHT algorithm [2]. However, for more general distributions this projection step may be challenging. Hence, to make the PGD method efficient, we need to be able to solve the following optimization efficiently:   \begin{align}\label{eq:1} {\hat{X}}^n\;=\;&\mathop{\text{arg min}}\limits_{u^n\in\mathcal{X}_b^{n}} \;\;\;\;\;\;\;\left\|u^n-x^n\right\| \nonumber \\ &{\rm subject \; to} \;\; \;c_{{\bf w}}(u^n) \leq \gamma, \end{align} (6.7) where $$x^n\in{\rm I}\kern-0.20em{\rm R}^n$$, weights $${\bf w}=\{w_{a^{k+1}}: a^{k+1}\in \mathcal{X}_b^{k+1}\}$$ and $$\gamma\in{\rm I}\kern-0.20em{\rm R}^+$$ are all given input parameters. Equation (6.7) can be stated in the Lagrangian form as   \begin{align} {\hat{X}}^n\;=\;&\mathop{\text{arg min}}\limits_{u^n\in\mathcal{X}_b^{n}} \Big[{1\over n^2}\left\|u^n-x^n\right\|^2 +{\alpha} c_{{\bf w}}(u^n)\Big],\label{eq:lagrangian-eq-projection} \end{align} (6.8) where $${\alpha}>0$$ is a parameter that depends on $$\gamma$$. Since $$\|u^n-x^n\|^2=\sum_{i=1}^n(u_i-x_i)^2$$, the optimization stated in (6.8) is exactly the optimization studied in [14]. It has been proved in [14] that the solution of (6.8) can be found efficiently via the standard dynamic programming (Viterbi algorithm) [31]. (For further information, refer [14].) The question is whether, for an appropriate choice of $${\alpha}$$, the minimizers of (6.7) and (6.8) are the same. If the answer to this question is affirmative, it implies that both steps of the PGD method can be implemented efficiently. In the following, we intuitively argue why we believe that this might be the case. Making the argument rigorous and a deeper investigation of this connection is left to future research. Consider partitioning the set of sequences in $$\mathcal{X}_b^n$$ based on their $$(k+1)$$th order empirical distributions, which are referred to as their $$(k+1)$$th order types. For a $$(k+1)$$th order type $$q_{k+1}(\cdot): \mathcal{X}_b^{k+1}\to {\rm I}\kern-0.20em{\rm R}^+$$, let $$\mathcal{T}_n(q_{k+1})$$ denote the set of sequences in $$\mathcal{X}_b^n$$ whose $$(k+1)$$th order types agree with $$q_{k+1}$$. That is,   \begin{align} \mathcal{T}_n(q_{k+1})\triangleq \left\{u^n: \hat{p}^{(k+1)}(a^{k+1}|u^n)=q_{k+1}(a^{k+1}), \forall \; a^{k+1}\in\mathcal{U}_b^{k+1}\right\}\!. \end{align} (6.9) Let $$\mathcal{P}_{n}^{k+1}$$ denote the set of all possible $$(k+1)$$th order types, for sequences in $$\mathcal{X}_b^n$$. In other words,   \begin{align} \mathcal{P}_{n}^{k+1}\triangleq\left \{ \hat{p}^{(k+1)}(\cdot|u^n): u^n\in\mathcal{X}_b^n\right\}\!. \end{align} (6.10) It can be proved that (Theorem I.6.14 in [25])   \begin{align} |\mathcal{P}_{n}^{k+1}|\leq (n+1)^{|\mathcal{X}_b|^{k+1}}. \end{align} (6.11) Furthermore, we have   \begin{align} \mathcal{X}_b^n=\cup_{q_{k+1}\in\mathcal{P}_{n}^{k+1}} \mathcal{T}_n(q_{k+1}). \end{align} (6.12) Therefore,   \begin{align} &\min_{u^n\in\mathcal{X}_b^{n}} \Bigg[{1\over n}\left\|u^n-x^n\right\|^2 +{\alpha} \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}}\hat{p}^{(k+1)}(a^{k+1}|u^n)\Bigg]\nonumber\\ &\quad =\min_{q_{k+1} \in \mathcal{P}_n^{k+1}}\min_{u^n\in\mathcal{T}_n(q_{k+1}) } \Bigg[{1\over n}\left\|u^n-x^n\right\|^2 +{\alpha} \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}}q_{k+1}(a^{k+1})\Bigg]\nonumber\\ &\quad =\min_{q_{k+1} \in \mathcal{P}_n^{k+1}} \Bigg[ \Big[\min_{u^n\in\mathcal{T}_n(q_{k+1}) } {1\over n}\left\|u^n-x^n\right\|^2 \Big]+{\alpha} \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}}q_{k+1}(a^{k+1})\Bigg],\label{eq:2} \end{align} (6.13) where the last line follows because $$ \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}}\hat{p}^{(k+1)}(a^{k+1}|u^n)$$ only depends on the $$(k+1)$$th order type of sequence $$u^n$$. For any type $$q_{k+1} \in \mathcal{P}_n^{k+1}$$, define the minimum distortion attainable by sequences of that type as $$D(q_{k+1},x^n)$$, ie   \begin{align} D(q_{k+1},x^n) = \min_{u^n \in \mathcal{T}_n(q_{k+1}) }{1\over n} \left\|u^n-x^n\right\|^2. \end{align} (6.14) Then (6.13) and (6.7) can be written as   \begin{align} \min_{q_{k+1} \in \mathcal{P}_{n}^{k+1}} \Bigg[D(q_{k+1},x^n) +{\alpha}\sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}}q_{k+1}(a^{k+1}) \Bigg] \end{align} (6.15) and   \begin{align} \min_{q_{k+1} \in \mathcal{P}_{n}^{k+1}: \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}}q_{k+1}(a^{k+1})\leq {\gamma}} D(q_{k+1},x^n), \end{align} (6.16) respectively. Both of these optimizations are discrete optimization. However, since   \[ \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}}q_{k+1}(a^{k+1}) \] is a convex function of $$q_{k+1}$$, if, in the high-dimensional setting, for input sequences $$x^n$$ of interest, $$D(q_{k+1})$$ also behaves almost as a convex function, then we expect the two optimizations to be the same, for a proper choice of parameter $${\alpha}$$. In the remainder of this section, we argue why, in a high-dimensional setting, we conjecture that $$D(q_{k+1})$$ satisfies the mentioned property. We leave further investigation of the subject to future research. First, note that if $$x^n$$ is almost stationary, for instance it is generated by a Markov process with finite memory, then $$D(q_{k+1},x^n)$$ only depends on $$q_{k+1}$$ and finite-order empirical distributions of $$x^n$$ and not on $$n$$ or $$x^n$$. Now assuming that this is true, consider $$q^{(1)}_{k+1}$$ and $$q^{(2)}_{k+1}$$ in $$\mathcal{P}_n^{k+1}$$. Also given $$\theta\in(0,1)$$, let $$n_1=\lfloor \theta n\rfloor$$ and $$n_2=n-n_1$$. Also, let $${\tilde{x}}^{n_1}$$ and $$\bar{x}^{n_2}$$ denote the minimizers of $$D(q^{(1)}_{k+1},x^{n_1})$$ and $$D(q^{(2)}_{k+1},x_{n_1+1}^{n})$$, respectively. Assume that $$\theta q^{(1)}_{k+1}+(1-\theta)q^{(2)}_{k+1}\in \mathcal{P}_n^{k+1}$$ and let $${\hat{X}}^n=[{\tilde{x}}^{n_1},\bar{x}^{n_2}]$$. Then, for large $$n$$, it is straightforward to check that $$\hat{p}^{(k+1)}(\cdot|{\hat{X}}^n)\approx \theta \hat{p}^{(k+1)}(\cdot|{\tilde{x}}^{n_1})+(1-\theta) \hat{p}^{(k+1)}(\cdot|\bar{x}^{n_2})=\theta q^{(1)}_{k+1}+(1-\theta)q^{(2)}_{k+1}$$. Therefore,   \begin{align} n D(\theta p_1 + (1-\theta)p_2,x^n) &\leq \|x^n-{\hat{X}}^n\|^2\nonumber\\ &= \|x^{n_1}-{\tilde{x}}^{n_1}\|^2+\|x_{n_1+1}^{n}-\bar{x}^{n_2}\|^2\nonumber\\ &= n_1D\left(q^{(2)}_{k+1},x_{n_1+1}^{n}\right)+n_2D\left(q^{(2)}_{k+1},x_{n_1+1}^{n}\right). \end{align} (6.17) Dividing both sides by $$n$$, it follows that   \begin{align} D(\theta p_1 + (1-\theta)p_2,x^n)\leq \theta D\left(q^{(2)}_{k+1},x_{n_1+1}^{n}\right)+(1-\theta)D\left(q^{(2)}_{k+1},x_{n_1+1}^{n}\right). \label{eq:3} \end{align} (6.18) Therefore, as we expect, for large values of $$n$$ and stationary sequences $$x^n$$, $$D(q_{k+1},x^n)$$ depends on $$x^n$$ only through its empirical distribution, then in (6.18), since $$x_{n_1+1}^{n}$$ and $$x_{n_1+1}^{n}$$ have almost the same empirical distribution as $$x^n$$, they can be replaced by $$x^n$$. This establishes our conjecture about almost convexity of function $$D$$. 7. Proofs 7.1 Preliminaries on information theory Before presenting the proofs, in this section, we review some preliminary definitions and concepts that are used in some of the proofs. Consider stationary process $${\mathbb{\mathbf{U}}}=\{U_i\}_{i=1}^{\infty}$$, with finite alphabet $$\mathcal{U}$$. The entropy rate of process $${\mathbb{\mathbf{U}}}$$ is defined as   \begin{align} \bar{H}({\mathbb{\mathbf{U}}})\triangleq \lim_{k\to\infty}H\left(U_{k+1}|U^k\right)\!. \end{align} (7.1) Consider $$u^n\in\mathcal{U}^n$$, where $$\mathcal{U}$$ is a finite set. The $$(k+1)$$th order empirical distribution of $$u^n$$ is defined in (1.2). The $$k$$th order conditional empirical entropy of $$u^n$$ is defined as $$\hat{H}_k(u^n)=H(U_{k+1}|U^k)$$, where $$U^{k+1}$$ is distributed as $$\hat{p}^{(k+1)}(\cdot|u^n)$$. In other words,   \begin{align} \hat{H}_k(u^n)=-\sum_{a^{k+1}\in\mathcal{U}^{k+1}}\hat{p}^{(k+1)}(a^{k+1}|u^n)\log{\hat{p}^{(k+1)}(a^{k+1}|u^n) \over \hat{p}^{(k)}(a^k|u^n)}. \end{align} (7.2) In some of the proofs, we employ a compression scheme called Lempel-Ziv. Compression schemes aim to represent a sequence $$u^n \in\mathcal{U}^n$$ in as few bits as possible. It turns out that intuitively if $$u^n$$ is a sample of a finite-alphabet stationary ergodic process $${\mathbb{\mathbf{U}}}$$, asymptotically, the smallest expected number of bits per symbol required to represent $$u^n$$ is $$ \bar{H}({\mathbb{\mathbf{U}}})$$. Compression algorithms that achieve this bound are called optimal. One of the well-known examples of optimal compression schemes is Lempel–Ziv (LZ) [34] coding. (LZ is also a universal compression code, since it does not use any information regarding the distribution of the process.) In summary, the LZ compression code operates at follows: it first incrementally parses the input sequence into unique phrases such that each phrase is the shortest phrase that is not seen earlier. Then, each phrase is encoded by (i) an index to the location of the phrase consisting of the current phrase except its last symbol, and (ii) last symbol of the phrase. Given $$u^n\in\mathcal{U}^n$$, let $$\ell_{\rm LZ}(u^n)$$ denote the length of the binary coded sequence assigned to $$u^n$$ using the LZ compression code. Note that since LZ algorithm assigns a unique coded sequence to every input sequence, we have   \begin{align} |\{u^n: \ell_{\rm LZ}(u^n)\leq r\}|\leq \sum_{i=1}^r2^i\leq 2^{r+1}.\label{eq:LZ-sequences} \end{align} (7.3) The LZ length function $$\ell_{\rm LZ}(\cdot)$$ is mentioned in some of the following proofs because of its connections with the conditional empirical entropy function $$\hat{H}_k(\cdot)$$. This connection established in [22] for binary sources and extended in [15] to general sources with alphabet $$\mathcal{U}$$ such that $$|\mathcal{U}|=2^b$$ states that, for all $$u^n\in\mathcal{U}^n$$,   \begin{align} {1\over n}\ell_{\rm LZ}(u^n)\leq \hat{H}_k(u^n)+{b(kb+b+3)\over (1-\epsilon_n)\log n-b}+\gamma_n,\label{eq:connections-LZ-Hk} \end{align} (7.4) where   \begin{align} \epsilon_n={2b+\log\left(2^b+{2^b-1\over b}\log n -2\right)\over \log n}, \end{align} (7.5) and $$\gamma_n=o(1)$$ does not depend on sequence $$u^n$$ or $$b$$. Finally, we finish this section, by two lemmas related to continuity properties of the entropy function and the Kullback–Leibler distance. Lemma 7.1 (Theorem 17.3.3 in [6]) Consider distributions $$p$$ and $$q$$ on finite alphabet $$\mathcal{U}$$ such that $$\|p-q\|_1\leq \epsilon$$. Then,   \begin{align} |H(p)-H(q)|\leq -\epsilon\log \epsilon +\epsilon\log |\mathcal{U}|. \end{align} (7.6) Lemma 7.2 Consider distributions $$p$$ and $$q$$ over discrete set $$\mathcal{U}$$ such that $$\|p-q\|_1\leq \epsilon$$. Further assume that $$p\ll q$$, and let $$q_{\min}=\min_{u\in\mathcal{U}: q(u)\neq 0} q(u)$$. Then,   \begin{align} D(p\| q)\leq -\epsilon\log \epsilon +\epsilon \log |\mathcal{U}|-\epsilon \log q_{\min}. \end{align} (7.7) Proof. Let $$\mathcal{U}^*\triangleq \{u\in \mathcal{U}:\;q(u)\neq 0\}.$$ Since $$p\ll q$$, if $$q(u)=0$$, then $$p(u)=0$$. Therefore, by definition   \begin{align} D(p\| q)&= \sum_{u\in\mathcal{U}} p(u)\log {p(u)\over q(u)}\nonumber\\ &= \sum_{u\in\mathcal{U}^*} p(u)\log {p(u)\over q(u)}\nonumber\\ &= \sum_{u\in\mathcal{U}^*} p(u)\log p(u)-\sum_{u\in\mathcal{U}^*} p(u)\log q(u)\nonumber\\ &= \sum_{u\in\mathcal{U}^*} p(u)\log p(u)-\sum_{u\in\mathcal{U}^*} (p(u)-q(u)+q(u))\log q(u)\nonumber\\ &= H(q)-H(p)-\sum_{u\in\mathcal{U}^*} (p(u)-q(u))\log q(u). \end{align} (7.8) Hence, by the triangle inequality,   \begin{align} D(p\| q)&\leq |H(q)-H(p)|-\sum_{u\in\mathcal{U}^*} |p(u)-q(u)|\log q(u)\nonumber\\ &\stackrel{(a)}{\leq} -\epsilon\log \epsilon +\epsilon \log |\mathcal{U}|-\sum_{u\in\mathcal{U}^*} |p(u)-q(u)|\log q(u)\nonumber\\ &\leq -\epsilon\log \epsilon +\epsilon \log |\mathcal{U}|+\log \left({1\over q_{\min}}\right)\sum_{u\in\mathcal{U}^*} |p(u)-q(u)|\nonumber\\ &\stackrel{(b)}{\leq} -\epsilon\log \epsilon +\epsilon \log |\mathcal{U}|-\epsilon\log q_{\min}, \end{align} (7.9) where $$(a)$$ and $$(b)$$ follow from Theorem 17.3.3 in [6] and $$\|p-q\|_1\leq \epsilon$$, respectively. □ 7.2 Useful concentration lemmas Lemma 7.3 Consider $$u^n\in {\rm I}\kern-0.20em{\rm R}^n$$ and $$v^n\in {\rm I}\kern-0.20em{\rm R}^n$$ such that $$\|u^n\|=\|v^n\|=1$$. Let $$\alpha\triangleq \langle u^n,v^n \rangle $$. Consider matrix $$A\in{\rm I}\kern-0.20em{\rm R}^{m\times n}$$ with i.i.d. standard normal entries. Then, for any $$\tau>0$$,   \begin{align} {\rm{P}}\left({1\over m}\langle Au^n,Av^n\rangle-\langle u^n,v^n\rangle\leq -\tau\right)\leq {\rm e}^{m((\alpha-\tau)s)-{m\over 2}\ln ((1+s\alpha)^2-s^2)}, \end{align} (7.10) where $$s>0$$ is a free parameter smaller than $${1\over 1-\alpha}$$. Proof. Let $$A_i^n$$ denote the $$i$$th row of matrix $$A$$. Then,   \begin{align} Au^n=\left[\begin{array}{c} \langle A_1^n,u^n \rangle\\ \langle A_2^n,u^n \rangle\\ \vdots\\ \langle A_m^n,u^n \rangle\\ \end{array} \right], \;\;\;\; Av^n=\left[\begin{array}{c} \langle A_1^n,v^n \rangle\\ \langle A_2^n,v^n \rangle\\ \vdots\\ \langle A_m^n,v^n \rangle\\ \end{array} \right] \end{align} (7.11) and   \begin{align} {1\over m}\langle Au^n,Av^n\rangle= {1\over m}\sum_{i=1}^m\langle A_i^n,u^n \rangle \langle A_i^n,v^n \rangle. \end{align} (7.12) Let   \begin{align} X_i=\langle A_i^n,u^n \rangle\end{align} (7.13) and   \begin{align} Y_i=\langle A_i^n,v^n \rangle. \end{align} (7.14) Since $$A$$ is generated by drawing its entries from an i.i.d. standard normal distribution, and $$\|u^n\|=\|v^n\|=1$$, $$\{(X_i,Y_i)\}_{i=1}^m$$ is a sequence of i.i.d. random vectors. To derive the joint distribution of $$(X_i,Y_i)$$, note that both $$X_i$$ and $$Y_i$$ are linear combination of Gaussian random variables. Therefore, they are also jointly distributed Gaussian random variables, and hence it suffices to find their first- and second-order moments. Note that   \begin{gather} {\rm{E}}[X_i]={\rm{E}}[Y_i]=0,\\ \end{gather} (7.15)  \begin{gather} {\rm{E}}[X_i^2]=\sum_{j,k}{\rm{E}}[A_{i,j}A_{i,k}]u_ju_k=\sum_{j}{\rm{E}}[A_{i,j}^2]u_j^2=\sum_{j}u_j^2=1 \end{gather} (7.16) and similarly $${\rm{E}}[Y_i^2]=1$$. Also,   \begin{align} {\rm{E}}[X_iY_i]&={\rm{E}}[\langle A_i^n,u^n \rangle\langle A_i^n,v^n \rangle]\nonumber\\ &=\sum_{j,k}{\rm{E}}[A_{i,j}A_{i,k}]u_jv_k\nonumber\\ &=\sum_{j}{\rm{E}}[A_{i,j}^2]u_jv_j\nonumber\\ &=\langle u^n,v^n \rangle=\alpha. \end{align} (7.17) Therefore, in summary,   \begin{align}(X_i,Y_i)\sim \mathcal{N}\left( \left[ \begin{array}{c} 0\\ 0 \end{array}\right],\left[ \begin{array}{cc} 1&\alpha\\ \alpha& 1 \end{array}\right]\right).\end{align} (7.18) For any $$s'>0$$, by the Chernoff bounding method, we have   \begin{align} {\rm{P}}\left({1\over m}\langle Au^n,Av^n\rangle-\langle u^n,v^n\rangle\leq -\tau\right)&= {\rm{P}}\left({1\over m}\sum_{i=1}^m(X_iY_i-\alpha)\leq -\tau\right)\nonumber\\ &= {\rm{P}}\left({s'\over m}\sum_{i=1}^m(X_iY_i-\alpha)\leq -s'\tau\right)\nonumber\\ &= {\rm{P}}\left( {\rm e}^{s'(\tau-\alpha)}\leq {\rm e}^{-{s'\over m}\sum_{i=1}^m X_iY_i}\right)\nonumber\\ &\leq {\rm e}^{s'(\alpha-\tau)}{\rm{E}}\left[ {\rm e}^{-{s'\over m}\sum_{i=1}^m X_iY_i}\right]\nonumber\\ &= {\rm e}^{s'(\alpha-\tau)}\left({\rm{E}}\left[ {\rm e}^{-{s'\over m}X_1Y_1}\right]\right)^m,\label{eq:deviation-tau} \end{align} (7.19) where the last line follows because $$(X_i,Y_i)$$ is an i.i.d. sequence. In the following we compute $${\rm{E}}[ {\rm e}^{{s\over m}X_1Y_1}]$$. Let $$A=(X_1+Y_1)/2$$ and $$B=(X_1-Y_1)/2$$. Then, $${\rm{E}}[A]={\rm{E}}[B]=0$$ and   \begin{align} {\rm{E}}[A^2]&={1+\alpha\over 2},\\ \end{align} (7.20)  \begin{align} {\rm{E}}[B^2]&={1-\alpha\over 2} \end{align} (7.21) and $${\rm{E}}[AB]={\rm{E}}[(X_1^2-Y_1^2)/4]=0$$. Therefore, $$A$$ and $$B$$ are independent Gaussian random variables. Therefore,   \begin{align} {\rm{E}}[ {\rm e}^{-{s'\over m}X_1Y_1}] &= {\rm{E}}\left[ {\rm e}^{-{s'\over m}(A+B)(A-B)}\right]\nonumber\\ &= {\rm{E}}\left[ {\rm e}^{-{s'\over m}A^2}\right] {\rm{E}}\left[{\rm e}^{{s'\over m}B^2}\right]. \end{align} (7.22) Given $$Z\sim \mathcal{N}(0,\sigma^2)$$, it is straightforward to show that, for $$\lambda>-1/(2\sigma^2)$$,   \begin{align} {\rm{E}}\left[{\rm e}^{-\lambda Z^2}\right]={1\over \sqrt{1+2\lambda \sigma^2}}. \end{align} (7.23) Therefore, for $${s'\over m}<{1\over 1-\alpha}$$,   \begin{align} {\rm{E}}[ {\rm e}^{{s'\over m}X_1Y_1}] &= {1\over \sqrt{(1+{s'\over m}(1+\alpha))(1-{s'\over m}(1-\alpha))}}\nonumber\\ &= {1\over \sqrt{(1+{s'\alpha\over m})^2 -{s'^2\over m^2}}}.\label{eq:E-XY} \end{align} (7.24) Therefore, combining (7.19) and (7.24), it follows that   \begin{align} {\rm{P}}\left({1\over m}\langle Au^n,Av^n\rangle-\langle u^n,v^n\rangle\leq -\tau\right)&= {\rm e}^{s'(\alpha-\tau)}{\rm e}^{-{m\over 2}\log\left( \left(1+{s'\alpha\over m}\right)^2-\left({s'\over m}\right)^2 \right)}.\label{eq:sp-s-last} \end{align} (7.25) Replacing $$s'/m$$ with $$s$$ in (7.25) yields the desired result. □ Corollary 7.1 Consider $$u^n\in {\rm I}\kern-0.20em{\rm R}^n$$ and $$v^n\in {\rm I}\kern-0.20em{\rm R}^n$$ such that $$\|u^n\|=\|v^n\|=1$$. Also, consider matrix $$A\in{\rm I}\kern-0.20em{\rm R}^{m\times n}$$ with i.i.d. standard normal entries. Then,   \begin{align} {\rm{P}}\left({1\over m}\langle Au^n,Av^n\rangle-\langle u^n,v^n\rangle\leq -0.45\right)\leq 2^{-0.05m}. \end{align} (7.26) Proof. From Lemma 7.3, for $$\alpha=\langle u^n,v^n\rangle$$, and $$s<1/(1-\alpha)$$,   \begin{align} {\rm{P}}\left({1\over m}\langle Au^n,Av^n\rangle-\alpha\leq -0.45 \right)\leq {\rm e}^{m((\alpha-0.45)s)-{m\over 2}\ln ((1+s\alpha)^2-s^2)}=2^{-m f(\alpha,s)}, \end{align} (7.27) where   \begin{align} f(\alpha,s)=(\log {\rm e})\left({1\over 2}\ln((1+s\alpha)^2-s^2)-(\alpha-0.45)s\right). \end{align} (7.28) Figure 1 plots $$\max_{s\in(0,{1\over 1-\alpha})} f(\alpha,s)$$ and shows that   \begin{align} \min_{\alpha\in(-1,1)}\max_{s\in\left(0,{1\over 1-\alpha}\right)} f(\alpha,s)\; \geq \; 0.05. \end{align} (7.29) □ Fig. 1. View largeDownload slide $$\max_{s\in(0,{1\over 1-\alpha})} f(\alpha,s).$$ Fig. 1. View largeDownload slide $$\max_{s\in(0,{1\over 1-\alpha})} f(\alpha,s).$$ The following two lemmas are proved in [13]. Lemma 7.4 ($$\chi^2$$ concentration) Fix $$\tau>0$$, and let $$U_i\stackrel{\rm i.i.d.}{\sim}\mathcal{N}(0,1)$$, $$i=1,2,\ldots,m$$. Then,   \begin{align} {\rm{P}}\left( \sum_{i=1}^m U_i^2 <m(1- \tau) \right) \leq {\rm e} ^{\frac{m}{2}(\tau + \ln(1- \tau))} \end{align} (7.30) and   \begin{align}\label{eq:chisq} {\rm{P}}\left( \sum_{i=1}^m U_i^2 > m(1+\tau) \right) \leq {\rm e} ^{-\frac{m}{2}(\tau - \ln(1+ \tau))}. \end{align} (7.31) Lemma 7.5 Consider $$U^n$$ and $$V^n$$, where, for each $$i$$, $$U_i$$ and $$V_i$$ are two independent standard normal random variables. Then the distribution of $$\langle U^n,V^n \rangle=\sum_{i=1}^nU_iV_i$$ is the same as the distribution of $$\|U^n\|G$$, where $$G\sim\mathcal{N}(0,1)$$ is independent of $$\|U^n\|_2$$. 7.3 Proof of Lemma 4.1 Before presenting the proof, we establish some preliminary results. Consider an analog process $${\mathbb{\mathbf{X}}}=\{X_i\}$$ with alphabet $$\mathcal{X}=[l,u]$$, where $$l,u\in{\rm I}\kern-0.20em{\rm R}$$. Let process $${\mathbb{\mathbf{Z}}}=\{Z_i\}$$ denote the $$b$$-bit quantized version of process $${\mathbb{\mathbf{X}}}$$. That is, $$Z_i=[X_i]_b$$, and the alphabet of process $${\mathbb{\mathbf{Z}}}$$ is $$\mathcal{Z}=\mathcal{X}_b=\{[x]_b: x\in\mathcal{X}_b\}$$. For $$i=1,\ldots,k+g$$, define a sequence of length $$t$$, $$\{S^{(i)}_j\}_{j=1}^t$$ over super-alphabet $$\mathcal{Z}^k$$ as follows. For $$i=1$$, $$\{S^{(1)}_j\}_{j=1}^t$$ is defined as   \begin{align} &\underbrace{Z_1,\ldots,Z_k}_{S_1^{(1)}},Z_{k+1},\ldots,Z_{k+g},\underbrace{Z_{k+g+1},\ldots,Z_{2k+g+1}}_{S_2^{(1)}},\\ \end{align} (7.32)  \begin{align} &{Z_{2k+g+2},\ldots \ldots}, \underbrace{Z_{(t-1)(k+g)+1},\ldots,Z_{(t-1)(k+g)+k}}_{S_t^{(1)}},Z_{t(k+g)-g+1},\ldots,Z_n. \end{align} (7.33) Similarly, $$\left\{S^{(i)}_j\right\}_{j=1}^t$$, $$i=1,\ldots,k+g$$, is defined by starting the grouping of the symbols at $$Z_i$$. In other words,   \begin{align} S^{(i)}_j\triangleq Z_{(k+g)(j-1)+i}^{(k+g)(j-1)+i+k-1}. \end{align} (7.34) For instance, the sequence $$\left\{S^{(k+g)}_j\right\}_{j=1}^t$$, which corresponds to the largest shift at beginning, is defined as   \begin{align} &Z_1,\ldots,Z_{k+g-1},\underbrace{Z_{k+g},\ldots,Z_{2k+g-1}}_{S_1^{(k)}},Z_{2k+g},\ldots,Z_{2k+2g-1},\\ \end{align} (7.35)  \begin{align} &\underbrace{Z_{2k+2g},\ldots,Z_{3k+2g-1}}_{S_2^{(k)}},Z_{3k+2g}\ldots. \end{align} (7.36) This definition implies that $$n$$ satisfies   \begin{align} (t-1)(k+g)+2k+g-1\leq n<(t-1)(k+g)+2k+g-1+k+g.\label{eq:bound-n} \end{align} (7.37) That is, $$t$$ is the only integer in the $$\big({n-2k-g+1\over k+g},{n-k+1\over k+g} \big]$$ interval, or in other words, $$t=\lfloor{n-k+1\over k+g}\rfloor$$. Before we prove Lemma 4.1, we prove the following auxiliary lemma. Lemma 7.6 For any given $$\epsilon>0$$ and any positive integers $$g$$ and $$k$$ such that $$4(k+g)/(n-k)<\epsilon$$, if $$\|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\|_1\geq \epsilon$$, then that there exists $$i=1,\ldots,k+g$$, such that   \begin{align} \|\hat{p}^{(1)}(\cdot|S^{(i),t})-\mu_k^{(b)}(\cdot)\|_1 \geq {\epsilon\over 2}, \end{align} (7.38) where $$S^{(i),t}$$ denotes the sequence $$S^{(i)}_1, S^{(i)}_2, \ldots, S^{(i)}_t$$. Note that in Lemma 7.6, $$\hat{p}^{(1)}(\cdot|S^{(i),t})$$ denotes the standard first order empirical distribution of the sup-alphabet sequence $$S^{(i),t}$$, i.e. for $$a^k\in\mathcal{X}^k$$,   \begin{align} \hat{p}^{(1)}(a^k|S^{(i),t})={|\{j: S^{(i)}_j=a^k, 1\leq j \leq t\}|\over t}. \end{align} (7.39) Proof. Note that by definition, for any $$a^k\in\mathcal{Z}^k$$,   \begin{align} \hat{p}^{(k)}(a^k|Z^n)&={1\over n-k}\sum_{i=k}^n \mathbb{1}_{Z_{i-k+1}^i=a^k}\nonumber\\ &={1\over n-k}\sum_{i=k}^n \mathbb{1}_{Z_{i-k+1}^i=a^k}\nonumber\\ &={1\over n-k}\left(\sum_{i=1}^{k+g}\sum_{j=1}^t \mathbb{1}_{S^{(i)}_j=a^k}+\sum_{t(k+g)+k}^n \mathbb{1}_{Z_{i-k+1}^i=a^k}\right)\nonumber\\ &={t\over n-k}\sum_{i=1}^{k+g}\left({1\over t}\sum_{j=1}^t \mathbb{1}_{S^{(i)}_j=a^k}\right)+{1\over n-k}\sum_{t(k+g)+k}^n \mathbb{1}_{Z_{i-k+1}^i=a^k}\nonumber\\ &={t\over n-k}\sum_{i=1}^{k+g}\hat{p}^{(1)}(a^k|S^{(i),t})+{1\over n-k}\sum_{t(k+g)+k}^n \mathbb{1}_{Z_{i-k+1}^i=a^k}.\label{eq:emp-total-to-overlapping} \end{align} (7.40) Therefore,   \begin{align} \|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\|_1&=\sum_{a^k}|\hat{p}^{(k)}(a^k|Z^n)-\mu_k^{(b)}(a^k)|\nonumber\\ &\stackrel{(a)}{=}\sum_{a^k}\left|{t\over n-k}\sum_{i=1}^{k+g}\hat{p}^{(1)}(a^k|S^{(i),t})+{1\over n-k}\sum_{t(k+g)+k}^n \mathbb{1}_{Z_{i-k+1}^i=a^k}-\mu_k^{(b)}(a^k)\right|\nonumber\\ &\stackrel{(b)}{=}\sum_{a^k}\left|{t\over n-k}\sum_{i=1}^{k+g}\hat{p}^{(1)}(a^k|S^{(i),t})-\mu_k^{(b)}(a^k)\right|+{1\over n-k}\sum_{a^k}\sum_{t(k+g)+k}^n \mathbb{1}_{Z_{i-k+1}^i=a^k}\nonumber\\ &\stackrel{(c)}{=}\sum_{a^k}\left|{t\over n-k}\sum_{i=1}^{k+g}\hat{p}^{(1)}(a^k|S^{(i),t})-\mu_k^{(b)}(a^k)\right|+{n-t(k+g)-k+1\over n-k},\label{eq:ell1-phat-k-mu-k-vs-phat-shifted-extra} \end{align} (7.41) where $$(a)$$ follows from (7.40), $$(b)$$ follows from the triangle inequality and $$(c)$$ holds because   \begin{align} \sum_{a^k} \mathbb{1}_{Z_{i-k+1}^i=a^k}=1. \end{align} (7.42) But since $$n$$ satisfies the bounds of (7.37), the last term on the right-hand side of (7.40) can be upper-bounded as   \begin{align} {n-t(k+g)-k+1\over n-k}\leq {k+g\over n-k}. \end{align} (7.43) Therefore, from (7.41) we have   \begin{align} \|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\|_1& \leq \sum_{a^k}\left|{t\over n-k}\sum_{i=1}^{k+g}\hat{p}^{(1)}(a^k|S^{(i),t})-\mu_k^{(b)}(a^k)\right|+ {k+g\over n-k}.\label{eq:dist-phat-k-avg-phat-1} \end{align} (7.44) On the other hand, again by the triangle inequality,   \begin{align} &\sum_{a^k}\left|{t\over n-k}\sum_{i=1}^{k+g}\hat{p}^{(1)}(a^k|S^{(i),t})-\mu_k^{(b)}(a^k)\right|\nonumber\\ &\quad=\sum_{a^k}\left|{t\over n-k}\sum_{i=1}^{k+g}\left(\hat{p}^{(1)}(a^k|S^{(i),t})-\mu_k^{(b)}(a^k)\right)+\left({t(k+g)\over n-k}-1\right)\mu_k^{(b)}(a^k)\right| \nonumber\\ &\quad\leq \sum_{a^k}\left|{t\over n-k}\sum_{i=1}^{k+g}\left(\hat{p}^{(1)}(a^k|S^{(i),t})-\mu_k^{(b)}(a^k)\right)\right|+\left|{t(k+g)\over n-k}-1\right|\sum_{a^k} \mu_k^{(b)}(a^k) \nonumber\\ &\quad= \sum_{a^k}\left|{t\over n-k}\sum_{i=1}^{k+g}\left(\hat{p}^{(1)}(a^k|S^{(i),t})-\mu_k^{(b)}(a^k)\right)\right|+\left|{t(k+g)\over n-k}-1\right|\nonumber\\ &\quad\leq {t\over n-k} \sum_{i=1}^{k+g} \sum_{a^k}\left|\hat{p}^{(1)}(a^k|S^{(i),t})-\mu_k^{(b)}(a^k)\right|+\left|{t(k+g)\over n-k}-1\right|\nonumber\\ &\quad= {t\over n-k} \sum_{i=1}^{k+g} \left\|\hat{p}^{(1)}(\cdot|S^{(i),t})-\mu_k^{(b)}(\cdot)\right\|_1+\left|{t(k+g)\over n-k}-1\right|.\label{eq:dist-phat-k-avg-phat-2} \end{align} (7.45) Therefore, combining (7.41) and (7.45) yields   \begin{align} \left\|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\right\|_1& \leq {t\over n-k} \sum_{i=1}^{k+g} \left\|\hat{p}^{(1)}(\cdot|S^{(i),t})-\mu_k^{(b)}(\cdot)\right\|_1+\left|{t(k+g)\over n-k}-1\right|+{k+g\over n-k}.\label{eq:dist-phat-k-avg-phat-3} \end{align} (7.46) Since by construction $$t(k+g)\leq n-k$$, we have $$t/(n-k)\leq 1/(k+g)$$. Therefore, it follows from (7.46) that:   \begin{align} \left\|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\right\|_1& \leq {1\over k+g} \sum_{i=1}^{k+g} \left\|\hat{p}^{(1)}(\cdot|S^{(i),t})-\mu_k^{(b)}(\cdot)\right\|_1+1-{t(k+g)\over n-k}+{k+g\over n-k}\nonumber\\ &= {1\over k+g} \sum_{i=1}^{k+g} \left\|\hat{p}^{(1)}(\cdot|S^{(i),t})-\mu_k^{(b)}(\cdot)\right\|_1+{n-t(k+g)+g\over n-k}.\label{eq:dist-phat-k-avg-phat-4} \end{align} (7.47) Notice that if   \begin{align} {n-t(k+g)+g\over n-k}\leq {\epsilon\over 2},\label{eq:cond-k-g-epsilon} \end{align} (7.48) and   \begin{align} \left\|\hat{p}^{(1)}(\cdot|S^{(i),t})-\mu_k^{(b)}(\cdot)\right\|_1\leq {\epsilon\over 2}, \end{align} (7.49) for all $$i$$, then, from (7.47), $$\|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\|_1\leq {\epsilon}$$. But if $$4(k+g)/(n-k)<\epsilon$$, since $$t=\lfloor {n-k+1\over k+g} \rfloor$$, then (7.48) holds. This means that, to have $$\|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\|_1> {\epsilon}$$, we need $$\|\hat{p}^{(1)}(\cdot|S^{(i),t})-\mu_k^{(b)}(\cdot)\|_1> {\epsilon\over 2}$$, for at least one $$i$$ in $$\{1,\ldots,k+g\}$$. □ Now we can discuss the proof of Lemma 4.1. The proof is a straightforward extension of Lemma III.1.3 in [25]. However, we include a summary of the proof for completeness. By Lemma 7.6, if $$4(k+g)/(n-k)<\epsilon$$, then $$\|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\|_1\geq \epsilon$$ implies that there exists $$i\in\{1,\ldots,k+g\}$$ such that   \begin{align} \left\|\hat{p}^{(1)}(\cdot|S^{(i),t})-\mu_k^{(b)}(\cdot)\right\|_1\geq {\epsilon\over 2}. \end{align} (7.50) We next bound the probability that the above event happens. For each $$i\in\{1,\ldots,k+g\}$$, define event $$\mathcal{E}^{(i)}$$ as follows   \begin{align} \mathcal{E}^{(i)}\triangleq \left\{D_{\rm KL}(\hat{p}^{(1)}(\cdot|S^{(i),t}),\mu_k^{(b)})> \epsilon^2/2\right\}. \end{align} (7.51) By the Pinsker’s inequality, for any $$i$$,   \begin{align} \left\|\hat{p}^{(1)}(\cdot|S^{(i),t})-\mu_k^{(b)}\right\|_1 \leq \sqrt{(2\ln 2) D_{\rm KL}(\hat{p}^{(1)}(\cdot|S^{(i),t}),\mu_k^{(b)}(\cdot))}. \end{align} (7.52) Therefore, if $$D_{\rm KL}(\hat{p}^{(1)}(\cdot|S^{(i),t}),\mu_k^{(b)}(\cdot))\leq c\epsilon^2/4$$, where as defined earlier $$c=1/(2\ln 2)$$, then $$\|\hat{p}^{(1)}(\cdot|S^{(i),t})-\mu_k^{(b)}(\cdot)\|_1 \leq \epsilon/2$$. This implies that   \begin{align} {\rm{P}}\left(\left\|\hat{p}^{(1)}(\cdot|S^{(i),t})-\mu_k^{(b)}\right\|_1 >{\epsilon\over 2} \right)\leq {\rm{P}}\left(D_{\rm KL}\left(\hat{p}^{(1)}\left(\cdot|S^{(i),t}\right),\mu_k^{(b)}\right) >{c\epsilon^2\over 4}\right). \end{align} (7.53) On the other hand, for $$S^{(i),t}=s^t$$, where $$s^t\in (\mathcal{Z}^k)^t$$, we have   \begin{align} {\rm{P}}(S^{(i),t}=s^t)&={\rm{P}}\Big(Z_{(k+g)(j-1)+i}^{(k+g)(j-1)+i+k-1}=s_j, \;j=1,\ldots,t\Big)\nonumber\\ &\stackrel{(a)}{\leq} {\it{\Psi}}^t(b,g) \prod_{j=1}^t {\rm{P}}\Big(Z_{(k+g)(j-1)+i}^{(k+g)(j-1)+i+k-1}=s_j\Big), \end{align} (7.54) where $$(a)$$ follows from applying condition (4.4) $$t$$ times. However, by the standard method of types techniques [7], we have   \begin{align} \prod_{j=1}^t{\rm{P}}\Big(Z_{(k+g)(j-1)+i}^{(k+g)(j-1)+i+k-1}=s_j\Big)&= 2^{-t\left(\hat{H}_1(s^t)+D_{\rm KL}\left(\hat{p}^{(1)}\left(\cdot|s^t\right),\mu_k^{(b)}\right)\right)}. \end{align} (7.55) Therefore, if $$D_{\rm KL}(\hat{p}^{(1)}(\cdot|s^t),\mu_k^{(b)}) >{c\epsilon^2\over 4}$$, then   \begin{align} \prod_{j=1}^t{\rm{P}}\Big(Z_{(k+g)(j-1)+i}^{(k+g)(j-1)+i+k-1}=s_j\Big)&\leq 2^{-t\hat{H}_1(s^t)-c\epsilon^2t/4}. \end{align} (7.56) Hence,   \begin{align} {\rm{P}}\Big(D_{\rm KL}\left(\hat{p}^{(1)}\left(\cdot|S^{(i),t}\right),\mu_k^{(b)}\right)>{c\epsilon^2\over 4}\Big) &=\sum_{s^t:D_{\rm KL}\left(\hat{p}^{(1)}\left(\cdot|s^t\right),\mu_k^{(b)}\right) >{c\epsilon^2\over 4} }{\rm{P}}\left(S^{(i),t}=s^t\right)\nonumber \\ &\leq 2^{-c\epsilon^2t/4}{\it{\Psi}}^t(b,g) \sum_{s^t:D_{\rm KL}\left(\hat{p}^{(1)}(\cdot|s^t),\mu_k^{(b)}\right) >{c\epsilon^2\over 4} }2^{-t\hat{H}_1(s^t)}\nonumber\\ &\leq 2^{-c\epsilon^2t/4} {\it{\Psi}}^t(b,g) \sum_{s^t }2^{-t\hat{H}_1(s^t)}. \end{align} (7.57) Since $$\sum_{s^t }2^{-t\hat{H}_1(s^t)}$$ can be proven to be smaller than the total number of types of sequences $$s^t\in(\mathcal{Z}^k)^t$$, we have $$\sum_{s^t }2^{-t\hat{H}_1(s^t)}\leq (t+1)^{|\mathcal{Z}|^k}$$. This upper bound combined by the union bound on $$\mathcal{E}^{(i)}$$, $$i=1,\ldots,k+g$$, yields the desired result. 7.4 Proof of Lemma 4.2 For any $$u^{\ell_1}\in\mathcal{Z}^{\ell_1}$$, $$v^{g}\in\mathcal{Z}^{g}$$, and $$w^{\ell_2}\in\mathcal{Z}^{\ell_2}$$, we have   \begin{align}\label{eq:quant-markov} &{\rm{P}}\left(Z^{\ell_1+g+\ell_2}=[u^{\ell_1}v^{g}w^{\ell_2}]\right)\nonumber\\ &\quad\leq \sum_{v^{g}\in\mathcal{Z}^g} {\rm{P}}\left(Z^{\ell_1+g+\ell_2}=[u^{\ell_1}v^{g}w^{\ell_2}]\right)\nonumber\\ &\quad= {\rm{P}}\left(Z^{\ell_1}=u^{\ell_1}, Z_{\ell_1+g+1}^{\ell_1+g+\ell_2}=w^{\ell_2}\right)\nonumber\\ &\quad= {\rm{P}}\left(Z^{\ell_1}=u^{\ell_1}\right) {\rm{P}}\left(Z_{\ell_1+g+1}=w_1 | Z^{\ell_1}= u^{\ell_1}\right) \nonumber\\ &\qquad\times {\rm{P}}\left(Z_{\ell_1+g+2}^{\ell_1+g+\ell_2}=w_2^{\ell_2}|Z_{\ell_1+g+1}=w_1,Z^{\ell_1}=u^{\ell_1}\right)\nonumber\\ &\quad= {\rm{P}}\left(Z^{\ell_1}=u^{\ell_1}\right)\int_{\mathcal{X}} {\rm{P}}\left(Z_{\ell_1+g+1}=w_1 | X_{\ell_1}= x_{\ell_1}, Z^{\ell_1}= u^{\ell_1}\right) \,{\rm d}\mu\left(x_{\ell_1}| Z^{\ell_1}= u^{\ell_1}\right) \nonumber\\ &\qquad\times {\rm{P}}\left(Z_{\ell_1+g+2}^{\ell_1+g+\ell_2}=w_2^{\ell_2}|Z_{\ell_1+g+1}=w_1,Z^{\ell_1}=u^{\ell_1}\right)\nonumber\\ &\quad \stackrel{(a)}{=} {\rm{P}}\left(Z^{\ell_1}=u^{\ell_1}\right)\int_{\mathcal{X}} {\rm{P}}\left(Z_{\ell_1+g+1}=w_1 | X_{\ell_1}= x_{\ell_1}\right) \,{\rm d}\mu\left(x_{\ell_1}| Z^{\ell_1}= u^{\ell_1}\right) \nonumber\\ &\qquad\times {\rm{P}}\left(Z_{\ell_1+g+2}^{\ell_1+g+\ell_2}=w_{2}^{\ell_2}|Z_{\ell_1+g+1}=w_1,Z^{\ell_1}=u^{\ell_1}\right)\nonumber\\ &\quad \stackrel{(b)}{=} \mu_b(u^{\ell_1})\left({\int_{\mathcal{X}}{K^{g+1}(x_{\ell_1},w_1)} \,{\rm d}\mu\left(x_{\ell_1}| u^{\ell_1}\right) }\right) \mu_b\left(w_2^{\ell_2}|w_1,u^{\ell_1}\right)\nonumber\\ &\quad = \mu_b(u^{\ell_1})\Bigg({\int_{\mathcal{X}^l}{K^{g+1}(x_{\ell_1},w_1)\over \pi(w_1)} \,{\rm d}\mu(x_{\ell_1}| u^{\ell_1}) }\Bigg) \pi(w_1)\mu_b\left(w_2^{\ell_2}|w_1,u^{\ell_1}\right)\nonumber\\ &\quad\leq \mu_b(u^{\ell_1}) \pi(w_1)\mu_b(w_2^{\ell_2}|w_1,u^{\ell_1})\left(\sup_{x: [x]_b=u_{\ell_1}} {{K^{g+1}(x,w_1)} \over \pi(w_1)}\right)\nonumber\\ &\quad\leq \mu_b(u^{\ell_1}) \pi(w_1)\mu_b(w_2^{\ell_2}|w_1,u^{\ell_1})\left(\sup_{(x,z)\in\mathcal{X}\times \mathcal{Z}} {{K^{g+1}(x,z)} \over \pi(z)}\right)\nonumber\\ &\quad= \mu_b(u^{\ell_1}) \pi(w_1)\mu_b(w_2^{\ell_2}|w_1,u^{\ell_1}){\it{\Psi}}_1(b,g), \end{align} (7.58) where (a) holds because $${\mathbb{\mathbf{X}}}$$ is a first-order Markov chain and in (b),   \begin{align} \mu_b\left(u^{\ell_1}\right)&= {\rm{P}}\left(Z^{\ell_1}=u^{\ell_1}\right),\\ \end{align} (7.59)  \begin{align} \mu_b\left(w_2^{\ell_2}|w_1,u^{\ell_1}\right)&={\rm{P}}\left(Z_{\ell_1+g+2}^{\ell_1+g+\ell_2}=w_2^{\ell_2}|Z_{\ell_1+g+1}=w_1,Z^{\ell_1}=u^{\ell_1}\right) \end{align} (7.60) and $$\mu(x_{\ell_1}| u^{\ell_1})$$ denotes the probability measure of $$X_{\ell_1}$$ conditioned on $$Z^{\ell_1}=u^{\ell_1}$$. Also, since the Markov chain is a stationary process, we have   \begin{align} K^{g+1}(x_{\ell_1},w_1)={\rm{P}}([X_{g+1}]_b=w_1|X_0=x_{\ell_1}). \end{align} (7.61) Another term in (7.58) is $$ \pi(w_1)\mu_b(w_{2}^{\ell_2}|w_1,u^{\ell_1})$$. Since $$\pi(w^{\ell_2})= \pi(w_1) \pi(w_{2}^{\ell_2}|w_1)$$, we have   \begin{align} \pi(w^l)\mu_b(w_{l+1}^{\ell_2}|w^l,u^{\ell_1})&= \pi(w^{\ell_2}){\mu_b(w_{l+1}^{\ell_2}|w^l,u^{\ell_1})\over \pi(w_{l+1}^{\ell_2}|w^l) }.\label{eq:bayes-rule} \end{align} (7.62) But   \begin{align} \mu_b(w_2^{\ell_2}|w_1,u^{\ell_1})&=\int \mu_b(w_{2}^{\ell_2}|x,w_1,u^{\ell_1})\,{\rm d}\mu(x|w_1,u^{\ell_1})\nonumber\\ &=\int \mu_b(w_2^{\ell_2}|x)\,{\rm d}\mu(x|w_1,u^{\ell_1}),\label{eq:int-Markov} \end{align} (7.63) where the second equality holds because $$(Z^{\ell_1},Z_{\ell_1+1})\to X_{\ell_1+1}\to Z_{\ell_1+2}^{\ell_1+l+\ell_2}$$ forms a Markov chain. Therefore,   \begin{align} {\mu_b\left(w_{2}^{\ell_2}|w_1,u^{\ell_1}\right)\over \pi(w_{2}^{\ell_2}|w_1) } &={\int \mu_b(w_{2}^{\ell_2}|x)\,{\rm d}\mu(x|w_1,u^{\ell_1})\over \pi(w_2^{\ell_2}|w_1) },\nonumber\\ &\stackrel{(a)}{\leq } {\Big(\sup\limits_{[x]_b=w_1}\mu_b(w_2^{\ell_2}|x)\Big)\int {\rm d}\mu(x|w_1,u^{\ell_1})\over \pi(w_{2}^{\ell_2}|w_1) },\nonumber\\ &\stackrel{(b)}{= } \sup\limits_{[x]_b=w_1}{\pi(w_{2}^{\ell_2}|x)\over \pi(w_{2}^{\ell_2}|w_1)}\nonumber\\ &\leq \sup_{(x,w^{\ell_2}): [x]_b=w_1} {\pi(w_{2}^{\ell_2}|x)\over \pi(w_{2}^{\ell_2}|w_1)}\nonumber\\ &\leq {\it{\Psi}}_2(b),\label{eq:bd-realted-psi2} \end{align} (7.64) where (a) and (b) hold because $$\mu(x|w_1,u^{\ell_1})$$ is only non-zero when $$x$$ is such that $$[x]_b=w_1$$, and $$\int {\rm d}\mu(x|w_1,u^{\ell_1})=1$$, respectively. Finally, combining (7.58), (7.62) and (7.64) yields the desired result. We next prove that, for a fixed $$b$$, $${\it{\Psi}}_1(b,g)$$ is non-increasing function of $$g$$. For any $$x\in\mathcal{X}$$ and $$z\in\mathcal{X}_b$$, we have   \begin{align} {K^{g+1}(x,z) \over \pi(z)} &={{\rm{P}}([X_{g+1}]_b=z|X_0=x) \over {\rm{P}}([X_{g+1}]_b=z)} \nonumber\\ &={\int {\rm{P}}([X_{g+1}]_b=z|X_1=x',X_0=x)\,{\rm d}\mu(x'|X_0=x) \over {\rm{P}}([X_{g+1}]_b=z)} \nonumber\\ &\stackrel{(a)}{=}{\int {\rm{P}}([X_{g}]_b=z|X_0=x')\,{\rm d}\mu(x'|X_0=x) \over {\rm{P}}([X_{g}]_b=z)} \nonumber\\ &=\sup_{x'\in\mathcal{X}} {{\rm{P}}([X_{g}]_b=z|X_0=x') \over {\rm{P}}([X_{g}]_b=z)} \int {\rm d}\mu(x'|X_0=x) \nonumber\\ &\stackrel{(b)}{\leq} {\it{\Psi}}_1(g,b),\label{eq:Kg-plus-1} \end{align} (7.65) where $$(a)$$ follows because of the Markovity and stationarity assumptions and $$(b)$$ follows because $$\int {\rm d}\mu(x'|X_0=x)=1$$. Since the right-hand side of (7.65) only depends on $$g$$ and $$b$$, taking the supremum of the left-hand side over $$(x,z)\in\mathcal{X}\times\mathcal{X}_b$$ proves that   \begin{align} {\it{\Psi}}_1(g+1)\leq {\it{\Psi}}_1(g). \end{align} (7.66) Furthermore, since $${\mathbb{\mathbf{X}}}$$ is assumed to be an aperiodic Markov chain, $$\lim_{g\to\infty} K^{g}(x,z)=\pi(z)$$, for all $$x$$ and $$z$$. Therefore, $${\it{\Psi}}_1(g)$$ converges to one, as $$g\to\infty$$. 7.5 Proof of Theorem 4.3 Define $${\it{\Psi}}(b,g)={\it{\Psi}}_1(b,g){\it{\Psi}}_2(b)$$. Then it follows from Lemmas 4.1 and 4.2 that, given $$\epsilon>0$$, for any positive integers $$g$$ and $$k$$ that satisfy $$4(k+g)/(n-k)<\epsilon$$,   \begin{align} {\rm{P}}\left(\|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\|_1\geq \epsilon\right)\leq (k+g){\it{\Psi}}_1^t(b,g){\it{\Psi}}_2^t(b)(t+1)^{|\mathcal{Z}|^k}2^{-c\epsilon^2t/4},\label{eq:ell-1-main} \end{align} (7.67) where $$t=\lfloor{n-k+1\over k+g}\rfloor$$ and $$c=1/(2\ln 2)$$. Since by assumption $$\lim_{b\to\infty}{\it{\Psi}}_2(b)=1$$, there exists $$b_{\epsilon}$$ such that for all $$b\geq b_{\epsilon}$$, $${\it{\Psi}}_2(b)\leq 2^{c\epsilon^2/16}$$. But $$ b_n=\lceil r\log\log n \rceil$$ is a diverging sequence of $$n$$. Therefore, there exists $$n_{\epsilon}>0$$, such that for all $$n\geq n_{\epsilon}$$,   \begin{align} {\it{\Psi}}_2(b_n)\leq 2^{c\epsilon^2/16}.\label{eq:bd-Psi-2} \end{align} (7.68) On the other hand, by the theorem’s assumption, there exists a sequence $$g=g_n$$, where $$g=o(n)$$, such that $$\lim_{n\to\infty}{\it{\Psi}}_1(b_n,g_n)=1 $$. Therefore, there exists $$n'_{\epsilon}$$ such that for all $$n\geq n'_{\epsilon}$$,   \begin{align} {\it{\Psi}}_1(b_n,g_n)\leq 2^{c\epsilon^2/16}.\label{eq:bd-Psi-1} \end{align} (7.69) Moreover, since $$g=g_n=o(n)$$ and $$k$$ is fixed, there exists $$n''_{\epsilon}>0$$ such that for all $$n\geq n''_{\epsilon}$$,   \begin{align} {4(k+g_n) \over n-k}<\epsilon.\label{eq:k-g-n} \end{align} (7.70) Therefore, for $$n>\max(n_{\epsilon},n'_{\epsilon},n''_{\epsilon})$$, from (7.67), (7.68), (7.69) and (7.70), we have   \begin{align} {\rm{P}}(\|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\|_1\geq \epsilon)&\leq (k+g){\it{\Psi}}_1^t(b,g){\it{\Psi}}_2^t(b)(t+1)^{|\mathcal{Z}|^k}2^{-c\epsilon^2t/4}\nonumber\\ &\leq (k+g)(t+1)^{|\mathcal{Z}|^k}2^{-tc\epsilon^2/8}\nonumber\\ &\leq (k+g)n^{|\mathcal{Z}|^k}2^{-tc\epsilon^2/8}, \end{align} (7.71) where the last line follows from the fact that $$t+1\leq n$$. But since $$t=\lfloor{n-k+1\over k+g}\rfloor$$, $$t\geq {n-k\over k+g}-1$$. Hence,   \begin{align} {\rm{P}}(\|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\|_1\geq \epsilon)&\leq 2^{c\epsilon^2(1+k/(k+g))/8} (k+g)n^{|\mathcal{Z}|^k}2^{-{c \epsilon^2n\over 8(k+g)}}\nonumber\\ &\leq 2^{c\epsilon^2/4} (k+g)n^{|\mathcal{Z}|^k}2^{-{c \epsilon^2n\over 8(k+g)}}, \end{align} (7.72) which is the desired result. 7.6 Proof of Theorem 4.4 For each $$i$$, let random variable $$J_i$$ be an indicator of a jump at time $$i$$. That is,   \begin{align} J_i=\mathbb{1}_{X_i\neq X_{i-1}}. \end{align} (7.73) Consider $$x\in \mathcal{X}$$ and $$z\in\mathcal{X}_b$$. Then, by definition,   \begin{align} {K^g(x,z)\over \pi(z)}&={{\rm{P}}([X_g]_b=z|X_0=x)\over {\rm{P}}([X_0]_b=z)}. \end{align} (7.74) But,   \begin{align} {\rm{P}}([X_g]_b=z|X_0=x)&= \sum_{d^g\in\{0,1\}^g}{\rm{P}}([X_g]_b=z,J^g=d^g|X_0=x)\nonumber\\ &\stackrel{(a)}{=}\sum_{d^g\in\{0,1\}^g}{\rm{P}}([X_g]_b=z|J^g=d^g,X_0=x){\rm{P}}(J^g=d^g), \end{align} (7.75) where $$(a)$$ follows from the independence of the jump events and the value of the Markov process at each time. Now if there is a jump between time $$1$$ and time $$g$$, then by definition of the transition probabilities the value of $$[X_g]_b$$ become independent of $$[X_1]_b$$ and also the jumps pattern. In other words, for any $$J^g\neq (0,\ldots,0)$$,   \begin{align} {\rm{P}}([X_g]_b=z|J^g=d^g,X_0=x)={\rm{P}}([X_g]_b=z)={\rm{P}}([X_0]_b=z), \end{align} (7.76) where the last equality follows from the stationarity of the Markov process. But $$J^g\neq (0,\ldots,0)$$ means that there has been no jump from time $$0$$ up to time $$g$$, and therefore $$X_g=X_0$$. This implies that   \begin{align} {\rm{P}}([X_g]_b=z|J^g=0^g,X_0=x)=\mathbb{1}_{z=[x]_b}. \end{align} (7.77) Since $$J^g= (0,\ldots,0)=(1-p)^g$$, combining the intermediate steps, it follows that   \begin{align} {\rm{P}}([X_g]_b=z|X_0=x)&= (1-(1-p)^g){\rm{P}}([X_0]_b=z)+(1-p)^g\mathbb{1}_{z=[x]_b}, \end{align} (7.78) and as a result   \begin{align} {K^g(x,z)\over \pi(z)}&=(1-(1-p)^g)+ {(1-p)^g\mathbb{1}_{z=[x]_b}\over {\rm{P}}([X_0]_b=z)}. \end{align} (7.79) But given that by our assumption $$f(x)\geq f_{\min}$$, $${\rm{P}}([X_0]_b=z)\geq f_{\min} 2^{-b}$$. Therefore,   \begin{align} {\it{\Psi}}_1(b,g)& =\sup_{(x,z)\in\mathcal{X}\times \mathcal{X}_b} {K^g(x,z)\over \pi(z)}\nonumber\\ &\leq (1-(1-p)^g)+ {(1-p)^g 2^b\over f_{\min}}.\label{eq:bd-Psi-1-Markov} \end{align} (7.80) For $$b=b_n=\lceil r\log\log n \rceil $$ and $$g=g_n=\lfloor \gamma \, r\log\log n \rfloor$$, we have   \begin{align} \log ((1-p)^g 2^b)&=g\log (1-p)+b\\ \end{align} (7.81)  \begin{align} &\leq r(\gamma \log(1-p) +1)\log\log n. \end{align} (7.82) But since $$\gamma>-{1\over \log(1-p)}$$, $$\gamma \log(1-p) +1<0$$, which from (7.80) proves the desired result, i.e. $$\lim_{n\to\infty}{\it{\Psi}}_1(b_n,g_n)=1$$. It is easy to check that due to its special distribution, the quantized version of process $${\mathbb{\mathbf{X}}}$$ is also a first-order Markov process. Therefore, from (4.15), we have   \begin{align} {\it{\Psi}}_2(b)&=\sup_{(x,w^2)\in\mathcal{X}\times \mathcal{Z}^{2}: [x]_b=w_1} {\pi(w_{2}|x)\over \pi(w_{2}|w_1)}\nonumber\\ &=\sup_{(x,w_2)\in\mathcal{X}\times \mathcal{Z}} {{\rm{P}}([X_2]_b=w_2|X_1=x)\over {\rm{P}}([X_2]_b=w_2|[X_1]_b=[x]_b)}. \end{align} (7.83) But   \begin{align} {\rm{P}}([X_2]_b=w_2|X_1=x)&={\rm{P}}([X_2]_b=w_2,J_2=1|X_1=x)p\nonumber\\ &\;\;\;+{\rm{P}}([X_2]_b=w_2,J_2=0|X_1=x)(1-p)\nonumber\\ &=p{\rm{P}}([X_2]_b=w_2)+(1-p)\mathbb{1}_{w_2=[x]_b}, \end{align} (7.84) and similarly   \begin{align} {\rm{P}}([X_2]_b=w_2|[X_1]_b=[x]_b)&={\rm{P}}([X_2]_b=w_2,J_2=1|[X_1]_b=[x]_b)p\nonumber\\ &\;\;\;+{\rm{P}}([X_2]_b=w_2,J_2=0|[X_1]_b=[x]_b)(1-p)\nonumber\\ &=p{\rm{P}}([X_2]_b=w_2)+(1-p)\mathbb{1}_{w_2=[x]_b}, \end{align} (7.85) which proves that $${\it{\Psi}}_2(b)=1$$, for all $$b$$. 7.7 Proof of Theorem 5.1 By definition   \begin{align} \bar{d}_k({\mathbb{\mathbf{X}}})=\limsup_{b\to\infty} {H([X_{k+1}]_b|[X^k]_b)\over b}, \end{align} (7.86) and therefore, for any $$\delta_1>0$$, there exists $$b_{\delta_1}$$ such that for all $$b\geq b_{\delta_1}$$, $${H([X_{k+1}]_b|[X^k]_b)\over b} \leq \bar{d}_k({\mathbb{\mathbf{X}}}) + \delta_1. $$ Since $$b=b_n=\lceil r\log\log n \rceil$$ converges to infinity as $$n\to\infty$$, for all $$n$$ large enough, $$b=b_n>b_{\delta_1}$$, and as a result   \begin{align} {H([X_{k+1}]_b|[X^k]_b)\over b} \leq \bar{d}_k({\mathbb{\mathbf{X}}}) + \delta_1. \end{align} (7.87) For the rest of the proof, assume that $$n$$ is large enough such that $$b_n>b_{\delta_1}$$. Define distribution $$q_{k+1}$$ over $$\mathcal{X}_b^{k+1}$$ as the $$(k+1)$$th order distribution of the quantized process $$[X_1]_b, \ldots, [X_n]_b$$. That is, for $$a^{k+1}\in\mathcal{X}_b^{k+1}$$,   \begin{align} q_{k+1}(a_{k+1}|a^k)={\rm{P}}([X_{k+1}]_b=a_{k+1}|[X^k]_b=a^k),\label{eq:q-k+1-source} \end{align} (7.88) and   \begin{align} q_{k}(a^{k})=\sum_{a_{k+1}\in\mathcal{X}_b}q_{k+1}(a^{k+1})={\rm{P}}([X^k]_b=a^k).\label{eq:q-k-source} \end{align} (7.89) Also define distributions $$\hat{q}_{k+1}^{(1)}$$ and $$\hat{q}_{k+1}^{(2)}$$ as the empirical distributions induced by $${\hat{X}}^n$$ and $$[X^n]_b$$, respectively. In other words, $$\hat{q}_{k}^{(1)}(a^{k})=\hat{p}^{(k)}(a^k|{\hat{X}}^n)$$, $$\hat{q}_{k}^{(2)}(a^{k})=\hat{p}^{(k)}(a^k|[X^n]_b)$$ and   \begin{align} \hat{q}_{k+1}^{(1)}(a_{k+1}|a^k)={\hat{q}_{k+1}^{(1)}(a^{k+1})\over \hat{q}_{k}^{(1)}(a^{k}) }={\hat{p}^{(k+1)}(a^{k+1}|{\hat{X}}^n)\over \hat{p}^{(k)}(a^k|{\hat{X}}^n)} \end{align} (7.90) and   \begin{align} \hat{q}^{(2)}_{k+1}(a_{k+1}|a^k)={\hat{q}_{k+1}^{(2)}(a^{k+1})\over \hat{q}_{k}^{(2)}(a^{k}) }={\hat{p}^{(k+1)}(a^{k+1}|[X^n]_b)\over \hat{p}^{(k)}(a^k|[X^n]_b)}. \end{align} (7.91) As the first step, we would like to prove that $${1\over b} \hat{H}_{k}({\hat{X}}^n)\leq \bar{d}_k({\mathbb{\mathbf{X}}})+\delta$$. Using the definitions above, we have   \begin{align}\label{eq:simplifyprobterm1} \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|{\hat{X}}^n) &= \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}}\hat{p}^{(k+1)}(a^{k+1}|{\hat{X}}^n) \log{1\over q_{k+1}(a_{k+1}|a^k)} \nonumber \\ &= \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|{\hat{X}}^n) \log{\hat{q}_{k+1}^{(1)}(a_{k+1}|a^k)\over q_{k+1}(a_{k+1}|a^k)}\nonumber\\ &\quad+ \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}}\hat{p}^{(k+1)}(a^{k+1}|{\hat{X}}^n) \log{1\over \hat{q}_{k+1}^{(1)}(a_{k+1}|a^k)} \nonumber \\ &= \sum_{a^k}\hat{q}_{k}^{(1)}(a^k) D_{\rm KL}(\hat{q}_{k+1}^{(1)}(\cdot|a^k)\| q_{k+1}(\cdot|a^k)) + \hat{H}_{k}({\hat{X}}^n). \end{align} (7.92) Since $${\hat{X}}^n$$ is the minimizer of (2.5), we have   \begin{align} \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|{\hat{X}}^n) \leq (\bar{d}_k({\mathbb{\mathbf{X}}})+\delta)b. \end{align} (7.93) Combining this equation with (7.92) and the fact that $$D_{\rm KL}$$ is always positive, we obtain   \begin{align} {1\over b} \hat{H}_{k}({\hat{X}}^n)\leq \bar{d}_k({\mathbb{\mathbf{X}}})+\delta.\label{eq:bd-H-hat-X-hat-Q-MAP} \end{align} (7.94) As the second step of the proof, we show that with high probability   \begin{align} {1\over b} \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|[X^n]_b) \leq \bar{d}_k({\mathbb{\mathbf{X}}}) +\delta. \end{align} (7.95) In other words, we would like to show that the vector $$[X^n]_b=([X_1]_b, [X_2]_b, \ldots, [X_n]_b)$$ satisfies the constraint of the optimization (2.5). Following the same steps as those used in deriving (7.92), we get   \begin{align} &\sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|[X^n]_b)\nonumber\\ &\quad= \sum_{a^k}\hat{q}_{k}^{(2)}(a^k) D_{\rm KL}(\hat{q}_{k+1}^{(2)}(\cdot|a^k)\| q_{k+1}(\cdot|a^k))+ \hat{H}_{k}([X^n]_b).\label{eq:simplifyprobterm2} \end{align} (7.96) Also, note that   \begin{align} \sum_{a^k}\hat{q}_{k}^{(2)}(a^k) D_{\rm KL}(\hat{q}_{k+1}^{(2)}(\cdot|a^k)\| q_{k+1}(\cdot|a^k)) &= \sum_{a^k}\hat{q}_{k}^{(2)}(a^k)\sum_{a_{k+1}} \hat{q}_{k+1}^{(2)}(a_{k+1}|a^k) \log {\hat{q}_{k+1}^{(2)}(a_{k+1}|a^k)\over q_{k+1}(a_{k+1}|a^k)}\nonumber\\ &=\sum_{a^{k+1}}\hat{q}_{k+1}^{(2)}(a^{k+1})\Bigg( \log {\hat{q}_{k+1}^{(2)}(a^{k+1})\over q_{k+1}(a^{k+1})}-\log {\hat{q}_{k}^{(2)}(a^{k})\over q_{k}(a^{k})}\Bigg)\nonumber\\ &= D_{\rm KL}(\hat{q}_{k+1}^{(2)}\| q_{k+1})- D_{\rm KL}(\hat{q}_{k}^{(2)}\| q_{k}).\label{eq:D-KL-cond-regular} \end{align} (7.97) Therefore, since $$0\leq D_{\rm KL}(\hat{q}_{k+1}^{(2)}\| q_{k+1})- D_{\rm KL}(\hat{q}_{k}^{(2)}\| q_{k})\leq D_{\rm KL}(\hat{q}_{k+1}^{(2)}\| q_{k+1})$$, from (7.96),   \begin{align}\label{eq:cost-vs-KL-distance} \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|[X^n]_b) \leq \hat{H}_{k}([X^n]_b) +D_{\rm KL}(\hat{q}_{k+1}^{(2)}\| q_{k+1}). \end{align} (7.98) Given $$\delta_2>0$$, define event $$\mathcal{E}_1$$ as   \begin{equation}\label{eqdef:Ec1} \mathcal{E}_1\triangleq \{\|\hat{q}_{k+1}^{(2)}-q_{k+1}\|_1<\delta_2\}. \end{equation} (7.99) Consider random vector $$U^{k+1}$$ distributed according to $$\hat{q}_{k+1}^{(2)}$$, which denotes the empirical distribution of $$[X^n]_b$$. Then, by definition, $$\hat{H}_k([X^n]_b)= H(U_{k+1}|U^k)= H(U^{k+1})-H(U^k).$$ Therefore,   \begin{align} |\hat{H}_k([X^n]_b)-H([X_{k+1}]_b|[X^k]_b)|&=|H(U^{k+1})-H(U^k)-H([X^{k+1}]_b)+H([X^k]_b)|\nonumber\\ &\leq |H(U^{k+1})-H([X^{k+1}]_b)|+|H(U^k)-H([X^k]_b)|. \end{align} (7.100) Conditioned on $$\mathcal{E}_1$$, $$\|\hat{q}_{k}^{(2)}-q_{k}\|_1\leq \|\hat{q}_{k+1}^{(2)}-q_{k+1}\|_1\leq \delta_2$$, and therefore, from Lemma 7.1,   \begin{align} |\hat{H}_k([X^n]_b)-H([X_{k+1}]_b|[X^k]_b)|\leq -2\delta_2\log \delta_2 +2\delta_2(k+1)\log |\mathcal{X}_b| \end{align} (7.101) or   \begin{align} \left|{\hat{H}_k([X^n]_b)\over b}-{H([X_{k+1}]_b|[X^k]_b)\over b}\right|\leq -{2\delta_2\over b}\log \delta_2 +\left({2(k+1)\log |\mathcal{X}_b|\over b}\right)\delta_2.\label{eq:bd-dist-Hh-H-emp} \end{align} (7.102) Moreover, conditioned on $$\mathcal{E}_1$$, since $$\|\hat{q}_{k+1}^{(2)}-{q}_{k+1}\|_1\leq \delta_2$$ and $$\hat{q}_{k+1}^{(2)} \ll {q}_{k+1}$$, from Lemma 7.2, we have   \begin{align} D(\hat{q}_{k+1}^{(2)}\|{q}_{k+1} )\leq -\delta_2\log \delta_2 +\delta_2(k+1)\log |\mathcal{X}_b|-\delta_2 \log q_{\min}, \end{align} (7.103) where   \begin{align} q_{\min}=\min_{u^{k+1}\in\mathcal{X}_b^{k+1}: q_{k+1}(u^{k+1})\neq 0} {\rm{P}}([X^{k+1}]_b=u^{k+1})\geq f_{k+1} |\mathcal{X}_b|^{-(k+1)}, \end{align} (7.104) where the last line follows from (5.1). Therefore,   \begin{align} D(\hat{q}_{k+1}^{(2)}\|{q}_{k+1} )&\leq -\delta_2\log \delta_2 +\delta_2(k+1)\log |\mathcal{X}_b|\nonumber\\ &\;\;\;-\delta_2\log f_{k+1} +\delta_2(k+1) \log |\mathcal{X}_b| \end{align} (7.105) or   \begin{align} {1\over b}D(\hat{q}_{k+1}^{(2)}\|{q}_{k+1} )&\leq -{\delta_2\over b}(\log \delta_2 +\log f_{k+1}) +\left({2(k+1)\log |\mathcal{X}_b|\over b}\right)\delta_2.\label{eq:bd-D-q2-lemma5-used} \end{align} (7.106) Hence, combining (7.98), (7.102) and (7.106), it follows that, conditioned on $$\mathcal{E}_1$$,   \begin{align} &{1\over b} \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|[X^n]_b) \nonumber\\ &\quad\leq \bar{d}_k({\mathbb{\mathbf{X}}})+\delta_1 +\left({4(k+1)\log |\mathcal{X}_b|\over b}\right)\delta_2 -{\delta_2\over b}(3\log \delta_2 +\log f_{k+1}).\label{eq:ub-Hhat-Xo-delat-1-delta-2} \end{align} (7.107) Choosing $$\delta_1=\delta/2$$ and $$\delta_2$$ small enough such that   \begin{align} &\left({4(k+1)\log |\mathcal{X}_b|\over b}\right)\delta_2 -{\delta_2\over b}(3\log \delta_2 +\log f_{k+1})\leq {\delta\over 2}. \label{eq:cond-on-delta2} \end{align} (7.108) Note that while $$|\mathcal{X}_b|$$ grows exponentially in $$b$$, for all bounded sources, $${1\over b}\log |\mathcal{X}_b|<2$$. Therefore, it is always possible to make sure that the above condition is satisfied for an appropriate choice of parameter $$\delta_2$$. For this choice of parameters, from (7.107), conditioned on $$\mathcal{E}_1$$  \begin{equation}\label{eq:constraint} {1\over b} \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|[X^n]_b) \leq \bar{d}_k({\mathbb{\mathbf{X}}}) +\delta, \end{equation} (7.109) and hence $$[X^n]_b$$ satisfies the constraint of the Q-MAP optimization described in (2.5). Hence, since $${\hat{X}}^n$$ is the minimizer of $$\|Au^n-Y^m\|^2$$, among all sequences that satisfy this constraint, we conclude that, conditioned on $$\mathcal{E}_1$$,   \begin{align} \|A{\hat{X}}^n-Y^m\|&\leq \|A[X^n]_b-Y^m\|\nonumber\\ &= \|A([X^n]_b-X^n)\|\nonumber\\ &\leq \sigma_{\max}(A)\|X^n-[X^n]_b\|\nonumber\\ &\leq \sigma_{\max}(A)2^{-b}\sqrt{n}.\label{eq:error-Xh-vs-error-Xob} \end{align} (7.110) Our goal is to use this equation to derive a bound for $$\|\hat{X}^n- X^n\|$$. The main challenge here is to find a lower bound for $$\|A{\hat{X}}^n-Y^m\|$$ in terms of $$\|\hat{X}^n- X^n\|$$. Given $$\delta_3>0$$ and $$\tau>0$$, define set $$\mathcal{C}_n$$ and events $$\mathcal{E}_2$$ and $$\mathcal{E}_3$$ as   \begin{gather}\label{eqdef:CCn} \mathcal{C}_n \triangleq \left\{u^n\in\mathcal{X}_b^n: {1\over nb}\ell_{\rm LZ}(u^n)\leq \bar{d}_k({\mathbb{\mathbf{X}}})+2\delta\right\},\\ \end{gather} (7.111)  \begin{gather} \label{eqdef:E2n} \mathcal{E}_2 \triangleq \{\sigma_{\max}(A)<\sqrt{n}+2\sqrt{m}\} \end{gather} (7.112) and   \begin{equation}\label{eqdef:E3n} \mathcal{E}_3 \triangleq \{\|A(u^n-X^n)\|\geq \|u^n-X^n\|\sqrt{(1-\tau)m}: \forall u^n\in\mathcal{C}_n\}, \end{equation} (7.113) respectively. We will prove the following: $${\hat{X}}^n\in\mathcal{C}_n$$, for $$n$$ large enough. $${\rm{P}}(\mathcal{E}_1\cap \mathcal{E}_2 \cap \mathcal{E}_3)$$ converges to one as $$n$$ grows to infinity. For the moment we assume that these two statements are true and complete the proof. Therefore, conditioned on $$\mathcal{E}_1\cap\mathcal{E}_2\cap\mathcal{E}_3$$, it follows from (7.110) that   \begin{align} \|{\hat{X}}^n-X^n\|\sqrt{(1-\tau)m} &\leq n\left(1+2\sqrt{m\over n}\right)2^{-b}\nonumber\\ &\leq 3n2^{-b}, \end{align} (7.114) where the last line follows form the fact that $$m\leq n$$. Therefore, conditioned on $$\mathcal{E}_1\cap\mathcal{E}_2\cap\mathcal{E}_3$$,   \begin{align} {1\over \sqrt{n}}\|{\hat{X}}^n-X^n\|\leq \sqrt{9 n\over (1-\tau)m2^{2b}}.\label{eq:error-Xh-vs-error-Xob-per-symb} \end{align} (7.115) To prove that $${\hat{X}}^n\in\mathcal{C}_n$$, for $$n$$ large enough, note that, from (7.94), $${1\over b}\hat{H}_k({\hat{X}}^n)\leq \bar{d}_k({\mathbb{\mathbf{X}}})+\delta$$. On the other hand, from (7.4), for our choice of parameter $$b=b_n$$, for any given $$\delta''>0$$, for all $$n$$ large enough,   \begin{align} {1\over n}\ell_{\rm LZ}({\hat{X}}^n)\leq \hat{H}_k({\hat{X}}^n)+\delta''. \end{align} (7.116) Therefore, for all $$n$$ large enough,   \begin{align} {1\over nb}\ell_{\rm LZ}({\hat{X}}^n)\leq \bar{d}_k({\mathbb{\mathbf{X}}})+\delta+{\delta''\over b}. \end{align} (7.117) Choosing $$\delta''$$ such that $${\delta''\over b}\leq \delta$$ proves the desired result, i.e. $${\hat{X}}^n\in\mathcal{C}_n$$. Let $$\tau=1-(\log n)^{-2r/(1+f)}$$, where $$f>0$$ is a free parameter. For $$b=b_n=\lceil r\log \log n\rceil$$, $$2^{2b}\geq (\log n)^{2r}$$. Therefore, from (7.115),   \begin{align} {1\over \sqrt{n}}\|{\hat{X}}^n-X^n\|&\leq \sqrt{9 (\log n)^{2r\over 1+f} \over (1+\delta)\bar{d}_k({\mathbb{\mathbf{X}}}) (\log n)^{2r}} \nonumber\\ &= \sqrt{9 \over (1+\delta)\bar{d}_k({\mathbb{\mathbf{X}}}) (\log n)^{2rf \over 1+f}} . \end{align} (7.118) Therefore, for any $$\epsilon>0$$, $$n$$ large enough, conditioned on $$\mathcal{E}_1\cap\mathcal{E}_2\cap\mathcal{E}_3$$,   \begin{align} {1\over \sqrt{n}}\|{\hat{X}}^n-X^n\|\leq \epsilon. \end{align} (7.119) To finish the proof we study the probability of $$\mathcal{E}_1\cap\mathcal{E}_2\cap\mathcal{E}_3$$. By Theorem 4.2, there exists integer $$g_{\delta_2}$$, only depending on the source distribution and $$\delta_2$$ such that for $$n>6(k+g_{\delta_2})/\delta_2+k$$,   \begin{align} {\rm{P}}(\mathcal{E}_1^c)\leq 2^{c\delta_2^2/8}(k+g_{\delta_2})n^{|\mathcal{X}_b|^k}2^{-{n\delta_2^2 \over 8(k+g_{\delta_2})}}, \end{align} (7.120) where $$c=1/(2\ln 2)$$. Also, as proved in [5],   \begin{align} {\rm{P}}(\mathcal{E}_2^c)\leq 2^{-m/2}. \end{align} (7.121) Finally, from (7.3), the size of $$\mathcal{C}_n$$ can be upper bounded as   \begin{align} |\mathcal{C}_n|\leq 2^{nb(\bar{d}_k({\mathbb{\mathbf{X}}})+2\delta)+1}. \end{align} (7.122) Now Lemma 7.4 combined with the union bound proves that, for a fixed vector $$X^n$$,   \begin{align} {\rm{P}}_A(\mathcal{E}_3^c)\leq 2^{nb(\bar{d}_k({\mathbb{\mathbf{X}}})+2\delta)+1} {\rm e}^{{m\over 2}(\tau +\ln (1-\tau))}, \end{align} (7.123) where $${\rm{P}}_A$$ reflects the fact that $$[X^n]_b$$ is fixed, and the randomness is in the generation of matrix $$A$$. For our choice of parameter $$\tau$$ combined with the Fubini’s Theorem and the Borel Cantelli Lemma proves that $${\rm{P}}_{X^n}(\mathcal{E}_3^c)\to 0$$, almost surely. 7.8 Proof of Theorem 5.2 The proof is very similar to the proof of Theorem 5.1 and follows the same logic. Similar to the proof of Theorem 5.1, for $$\delta_1>0$$, we assume that $$n$$ is large enough such that   \begin{align} {H([X_{k+1}]_b|[X^k]_b)\over b} \leq \bar{d}_k({\mathbb{\mathbf{X}}}) + \delta_1. \end{align} (7.124) Also, given $$\delta_2>0$$, $$\delta_3>0$$ and $$\tau>0$$, consider set $$\mathcal{C}_n$$ and events $$\mathcal{E}_2$$ and $$\mathcal{E}_3$$, defined in (7.111), (7.112) and (7.113), respectively. Since $${\hat{X}}^n$$ is a minimizer of $$\sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k)}(a^{k+1}|u^n) +{\lambda\over n^2}\|Au^n-Y^m\|^2$$, we have   \begin{align}\label{eq:cost-Xhat-Xnb} &\sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|{\hat{X}}^n) +{\lambda\over n^2}\|A{\hat{X}}^n-Y^m\|^2\nonumber\\ &\quad \leq \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|[X^n]_b) +{\lambda\over n^2}\|A[X^n]_b-Y^m\|^2\nonumber\\ &\quad \leq \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|[X^n]_b) +{\lambda(\sigma_{\max}(A))^2\over n^2}\|[X^n]_b-X^n\|^2\nonumber\\ &\quad \leq \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|[X^n]_b) +{\lambda(\sigma_{\max}(A))^22^{-2b}\over n}. \end{align} (7.125) Define distribution $$q_{k+1}$$, $$\hat{q}_{k+1}^{(1)}$$ and $$\hat{q}_{k+1}^{(2)}$$ over $$\mathcal{X}_b^{k+1}$$ as in the proof of Theorem 5.1. Then, given $$\delta>0$$, from (7.102) and (7.107), we set $$\delta_1=\delta/2$$ and $$\delta_2$$ small enough such that (7.108) is satisfied. Then following the same steps as the ones that led to (7.109) we obtain that conditioned on $$\mathcal{E}_1$$ (defined in (7.99)),   \begin{align} {1\over b} \hat{H}_k([X^n]_b)\leq \bar{d}_k({\mathbb{\mathbf{X}}})+\delta. \end{align} (7.126) Hence, we have   \begin{align} \hat{H}_k({\hat{X}}^n) +{\lambda\over n^2}\|A{\hat{X}}^n-Y^m\|^2 \leq b(\bar{d}_k({\mathbb{\mathbf{X}}})+\delta ) +{\lambda(\sigma_{\max}(A))^22^{-2b}\over n}.\label{eq:ub-cost-L-QMAP} \end{align} (7.127) Since both terms on the left-hand side of (7.127) are positive, each of them should be smaller than the bound on the right-hand side, i.e.   \begin{align}\label{eq:upper-bd-Hhat-k} {1\over b}\hat{H}_k({\hat{X}}^n) &\;\leq\; \bar{d}_k({\mathbb{\mathbf{X}}})+\delta +{\lambda(\sigma_{\max}(A))^22^{-2b}\over bn} \end{align} (7.128) and   \begin{align}\label{eq:upper-bd-meas-error} {\lambda\over b n^2}\|A{\hat{X}}^n-Y^m\|^2&\;\leq\; \bar{d}_k({\mathbb{\mathbf{X}}})+\delta +{\lambda(\sigma_{\max}(A))^22^{-2b}\over bn}. \end{align} (7.129) Since $$\lambda=\lambda_n=(\log n)^{2r}$$ and $$b=b_b=\lceil r\log\log n\rceil$$, $$\lambda 2^{-2b}\leq 1$$, and hence, conditioned on $$\mathcal{E}_2$$,   \begin{align} {\lambda(\sigma_{\max}(A))^22^{-2b}\over bn}\leq {(\sqrt{n}+2\sqrt{m})^2\over nb}\leq {9\over b}. \end{align} (7.130) Therefore, since $$b_n\to\infty$$, as $$n\to\infty$$, for all $$n$$ large enough, conditioned on $$\mathcal{E}_2$$,   \begin{align} {\lambda(\sigma_{\max}(A))^22^{-2b}\over bn}\leq \delta, \end{align} (7.131) and therefore, from (7.128) and (7.129),   \begin{align}\label{eq:upper-bd-Hhat-k-simple} {1\over b}\hat{H}_k({\hat{X}}^n) &\;\leq\; \bar{d}_k({\mathbb{\mathbf{X}}})+2\delta \end{align} (7.132) and   \begin{align}\label{eq:upper-bd-meas-error-simple} {\lambda\over b n^2}\|A{\hat{X}}^n-Y^m\|^2&\;\leq\; \bar{d}_k({\mathbb{\mathbf{X}}})+2\delta. \end{align} (7.133) Therefore, choosing $$\delta_3=3\delta$$, conditioned on $$\mathcal{E}_1\cap\mathcal{E}_2\cap\mathcal{E}_3$$, $${\hat{X}}^n\in\mathcal{C}_n$$. Finally, from (7.129) we have   \begin{align} {\lambda(1-\tau)m\over n^2 b} \|{\hat{X}}^n-X^n\|^2 \leq \bar{d}_k({\mathbb{\mathbf{X}}})+2\delta, \end{align} (7.134) or   \begin{align}\label{eq:bound-error-triangle} {1\over\sqrt{ n}} \|{\hat{X}}^n-X^n\| &\leq \sqrt{ (\bar{d}_k({\mathbb{\mathbf{X}}})+2\delta)bn\over \lambda (1-\tau)m}, \end{align} (7.135) which proves that, for our set of parameters, conditioned on $$\mathcal{E}_1\cap\mathcal{E}_2\cap\mathcal{E}_3$$, $${1\over \sqrt{n}}\|X^n-{\hat{X}}^n\|$$ can be made arbitrary small. Setting of parameter $$\tau$$ and proving that $${\rm{P}}(\mathcal{E}_1\cap\mathcal{E}_2\cap\mathcal{E}_3)$$ converges to one can be done exactly as it was done in the proof of Theorem 5.1. 7.9 Proof of Theorem 6.1 As argued in the proof of Theorem 5.1, given $$\delta>0$$, there exists $$b_{\delta}$$, such that for $$b>b_{\delta}$$,   \begin{align} {H([X_{k+1}]_b|[X^k]_b)\over b}\leq \bar{d}_k({\mathbb{\mathbf{X}}})+{\delta\over 2}.\label{eq:conv-dk-cond-H} \end{align} (7.136) In the rest of the proof we assume that $$n$$ is large enough so that $$b=b_n> b_{\delta}$$. Define distributions $$q_{k+1}$$ and $$\hat{q}_{k+1}$$ over $$\mathcal{X}_b^{k+1}$$ as follows. Let $$q_{k+1}$$ and $$\hat{q}_{k+1}$$ denote the distribution of $$[X^{k+1}]_b$$ and the empirical distribution of $$[X^n]_b$$, respectively. From (7.96) and (7.97), it follows that   \begin{align} \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|[X^n]_b)\leq \hat{H}_k([X^n]_b)+D_{\rm KL}(\hat{q}_{k+1}\| q_{k+1}).\label{eq:bd-dist-Hhat-W-Phat} \end{align} (7.137) Define events $$\mathcal{E}_1$$ and $$\mathcal{E}_2$$ as   \begin{align} \mathcal{E}_1=\{\sigma_{\max}(A)<\sqrt{n}+2\sqrt{m}\} \end{align} (7.138) and   \begin{align} \mathcal{E}_2=\{\|\hat{q}_{k+1}-q_{k+1}\|_1<{\delta'}\}, \end{align} (7.139) where $$\delta'>0$$ is selected such that   \begin{align} {1\over b}(-2\delta'\log {\delta'} +2\delta'(k+1)\log |\mathcal{X}_b|)\leq {\delta\over 4}\label{eq:delta-p-vs-delta} \end{align} (7.140) and   \begin{align} {1\over b}(-\delta'\log {\delta'}-\delta' \log f_{k+1} +2\delta'(k+1)\log |\mathcal{X}_b|)\leq {\delta\over 4},\label{eq:delta-p-vs-delta-2} \end{align} (7.141) for all $$b$$ large enough. This is always possible, since $$ (-2\delta'\log {\delta'})/b$$ is a decreasing function of $$b$$ and $$\log|\mathcal{X}_b|/b$$ can be upper bounded by a constant not depending on $$b$$. Hence, by picking $$\delta'$$ small enough both (7.140) and (7.141) hold. Given this choice of parameters, from (7.102), conditioned on $$\mathcal{E}_2$$, we have   \begin{align} {1\over b}\left|\hat{H}_k([X^n]_b)-H([X_{k+1}]_b|[X^k]_b)\right|\leq {\delta\over 4}. \end{align} (7.142) Also, from Lemma 7.2, and (7.106), conditioned on $$\mathcal{E}_2$$,   \begin{align} {1\over b}D(\hat{q}_{k+1}\|q_{k+1} )&\leq -{\delta'\over b}(\log \delta'+\log f_{k+1}) +\left({2(k+1)\log |\mathcal{X}_b|\over b}\right)\delta'. \end{align} (7.143) Therefore, from (7.141),   \begin{align} {1\over b}D(\hat{q}_{k+1}\|q_{k+1} )&\leq {\delta\over 4}. \end{align} (7.144) Conditioned on $$\mathcal{E}_2$$, from (7.136) and (7.137), it follows that   \begin{align} {1\over b}\sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|[X^n]_b)&\leq {\hat{H}_k([X^n]_b)\over b}+{1\over b}D_{\rm KL}(\hat{q}_{k+1}\| q_{k+1})\nonumber\\ &\leq {H([X_{k+1}]_b|[X^k]_b)\over b}+{\delta\over 4}+{\delta\over 4}\nonumber\\ &\leq \bar{d}_k({\mathbb{\mathbf{X}}})+{\delta\over 2}+{\delta\over 4}+{\delta\over 4}=\bar{d}_k({\mathbb{\mathbf{X}}})+\delta, \end{align} (7.145) which implies that $$[X^n]_b\in\mathcal{F}_o$$. Since $${\hat{X}}^n(t+1)$$ is the solution of (6.2), automatically, $${\hat{X}}^n(t+1)\in\mathcal{F}_o$$. Also, as we just proved, conditioned on $$\mathcal{E}_2\cap\mathcal{E}_3$$, $$[X^n]_b\in\mathcal{F}_o$$ as well. Therefore, conditioned on $$\mathcal{E}_2\cap\mathcal{E}_3$$,   \begin{align} \|{\hat{X}}^n(t+1)-S^n(t+1)\|\leq \|[X^n]_b-S^n(t+1)\|, \end{align} (7.146) or equivalently   \begin{align} \|{\hat{X}}^n(t+1)-[X^n]_b+[X^n]_b-S^n(t+1)\|&\leq \|[X^n]_b-S^n(t+1)\|.\label{eq:add-sub-Xon} \end{align} (7.147) Raising both sides of (7.147) to power two and canceling out the common term from both sides, we derive   \begin{align} \|{\hat{X}}^n(t+1)-[X^n]_b\|^2 +2\langle {\hat{X}}^n(t+1)-[X^n]_b,[X^n]_b-S^n(t+1) \rangle &\leq 0. \end{align} (7.148) If we plug in the expression for $$S^n(t+1)$$ we obtain   \begin{align} \|{\hat{X}}^n(t+1)-[X^n]_b\|^2 &\leq \;2\langle {\hat{X}}^n(t+1)-[X^n]_b,-[X^n]_b+S^n(t+1) \rangle\nonumber\\ &=2\langle {\hat{X}}^n(t+1)-[X^n]_b,-[X^n]_b+{\hat{X}}^n(t)+\mu A^{\top}(Y^m-A{\hat{X}}^n(t)) \rangle\nonumber\\ &=2\langle {\hat{X}}^n(t+1)-[X^n]_b,{\hat{X}}^n(t)-[X^n]_b \rangle\nonumber\\ &\quad-2\mu \langle {\hat{X}}^n(t+1)-[X^n]_b,A^{\top}A({\hat{X}}^n(t)-X^n) \rangle\nonumber\\ &\quad+2\mu \langle {\hat{X}}^n(t+1)-[X^n]_b,A^{\top}Z^m\rangle\nonumber\\ &=2\langle {\hat{X}}^n(t+1)-[X^n]_b,{\hat{X}}^n(t)-[X^n]_b \rangle\nonumber\\ &\quad-2\mu \langle {\hat{X}}^n(t+1)-[X^n]_b,A^{\top}A({\hat{X}}^n(t)-[X]_b^n) \rangle\nonumber\\ &\quad+2\mu \langle {\hat{X}}^n(t+1)-[X^n]_b,A^{\top}A(X^n-[X]_b^n) \rangle\nonumber\\ &\quad+2\mu \langle {\hat{X}}^n(t+1)-[X^n]_b,A^{\top}Z^m\rangle.\label{eq:Et+1-vs-Et-step1} \end{align} (7.149) Define   \begin{align} E^n(t) \triangleq \|{\hat{X}}^n(t+1)-[X^n]_b\| \end{align} (7.150) and   \begin{align} {\tilde{E}}^n(t) \triangleq {E^n(t)\over \|E^n(t)\|}. \end{align} (7.151) Then, it follows from (7.149) that   \begin{align} \|E^n(t+1)\| &\leq\;2\langle {\tilde{E}}^n(t+1), {\tilde{E}}^n(t)\rangle \|E^n(t)\| -2\mu \langle {\tilde{E}}^n(t+1),A^{\top}A {\tilde{E}}^n(t) \rangle \|E^n(t)\| \nonumber\\ &\qquad+2\mu \langle {\tilde{E}}^n(t+1),A^{\top}A(X^n-[X]_b^n) \rangle\nonumber\\ &\qquad+2\mu \langle {\tilde{E}}^n(t+1),A^{\top}Z^m\rangle\nonumber\\ &\quad=\;2\Big(\langle {\tilde{E}}^n(t+1), {\tilde{E}}^n(t)\rangle -\mu \langle A{\tilde{E}}^n(t+1),A {\tilde{E}}^n(t) \rangle \Big)\|E^n(t)\| \nonumber\\ &\qquad+2\mu \langle {\tilde{E}}^n(t+1),A^{\top}A(X^n-[X]_b^n) \rangle\nonumber\\ &\qquad+2\mu \langle {\tilde{E}}^n(t+1),A^{\top}Z^m\rangle \nonumber\\ &\quad\leq 2\Big(\langle {\tilde{E}}^n(t+1), {\tilde{E}}^n(t)\rangle -\mu \langle A{\tilde{E}}^n(t+1),A {\tilde{E}}^n(t) \rangle \Big)\|E^n(t)\| \nonumber\\ &\qquad+2\mu\sigma_{\max}(A^{\top}A)\|X^n-[X]_b^n\|\nonumber\\ &\qquad+2\mu \langle {\tilde{E}}^n(t+1),A^{\top}Z^m\rangle. \label{eq:Et+1-vs-Et-step2} \end{align} (7.152) Note that for a fixed $$X^n$$, $${\tilde{E}}^n(t)$$ and $${\tilde{E}}^n(t+1)$$ can only take a finite number of different values. Let $$\mathcal{S}_e$$ denote the set of all possible normalized error vectors. That is,   \begin{align} \mathcal{S}_e \triangleq \left\{{u^n-v^n \over \|u^n-v^n\|}: u^n,v^n\in\mathcal{F}_o \right\}. \end{align} (7.153) Clearly, $${\tilde{E}}^n(t)$$ and $${\tilde{E}}^n(t+1)$$ are both members of $$\mathcal{S}_e$$. Define event $$\mathcal{E}_3$$ as follows   \begin{align} \mathcal{E}_3\triangleq \Big\{ \langle u^n, v^n \rangle -{1\over m} \langle Au^n,A v^n\rangle \leq 0.45 :\; \forall \; (u^n,v^n)\in\mathcal{S}_e^2\Big\}. \end{align} (7.154) Conditioned on $$\mathcal{E}_1\cap \mathcal{E}_2\cap \mathcal{E}_3$$, it follows from (7.152) that   \begin{align} \|E^n(t+1)\| &\leq 0.9 \|E^n(t)\|+{2 (\sqrt{n}+2\sqrt{m})^2\over m}(2^{-b}\sqrt{n})\nonumber\\ &\;\;\;+2\mu \langle {\tilde{E}}^n(t+1),A^{\top}Z^m\rangle. \label{eq:Et+1-vs-Et-noisy-step3} \end{align} (7.155) The only remaining term on the right-hand side of (7.155) is $$2\mu \langle {\tilde{E}}^n(t+1),A^{\top}Z^m\rangle$$. To upper bound this term, we employ Lemma 7.5. Let $$A_i^m$$, $$i=1,\ldots,n$$, denote the $$i$$th column of matrix $$A$$. Then,   \begin{align} A^{\top}Z^m=\left[\begin{array}{c} \langle A_1^m,Z^m \rangle\\ \langle A_2^m,Z^m \rangle\\ \vdots\\ \langle A_n^m,Z^m \rangle\\ \end{array} \right], \end{align} (7.156) and   \begin{align} \langle {\tilde{E}}^n(t+1),A^{\top}Z^m\rangle =\sum_{i=1}^n {\tilde{E}}_i(t+1) \langle A_i^m,Z^m \rangle. \end{align} (7.157) By Lemma 7.5, for each $$i$$, $$\langle A_i^m,Z^m \rangle$$ is distributed as $$\|Z^m\| G_i $$, where $$G_i$$ is a standard normal distribution independent of $$\|Z^m\|$$. Therefore, since the columns of matrix $$A$$ are also independent, overall $$\langle {\tilde{E}}^n(t+1),A^{\top}Z^m\rangle$$ is distributed as   \begin{align} \|Z^m\|\sum_{i=1}^n {\tilde{E}}_i(t+1) G_i, \end{align} (7.158) where $$G_i$$ are i.i.d. standard normal independent of $$\|Z^m\|$$. Given $$\tau_1>0$$ and $$\tau_2>0$$, define events $$\mathcal{E}_4$$ and $$\mathcal{E}_5$$ as follows:   \begin{align} \mathcal{E}_4\triangleq \left\{{1\over m}\|Z^m\|^2\leq (1+\tau_1)\sigma^2\right\} \end{align} (7.159) and   \begin{align} \mathcal{E}_5\triangleq\{ |\langle\tilde{e}^n,G^n\rangle|^2\leq 1+\tau_2: \;\forall \tilde{e}^n\in\mathcal{S}_e\}. \end{align} (7.160) By Lemma 7.4,   \begin{align} {\rm{P}}(\mathcal{E}_4^c)&={\rm{P}}\Big({1\over \sigma^2}\|Z^m\|^2> (1+\tau_1)m\Big)\nonumber\\ &\leq {\rm e} ^{-\frac{m}{2}(\tau_1 - \ln(1+ \tau_1))}. \end{align} (7.161) For a fixed vector $$\tilde{e}^n$$, $$\langle\tilde{e}^n,G^n\rangle =\sum_{i=1}^n \tilde{e}_iG_i$$ has a standard normal distribution and therefore, by letting $$m=1$$ in Lemma 7.4, it follows that   \begin{align} {\rm{P}}(|\langle\tilde{e}^n,G^n\rangle|^2 >1+\tau_2 )\leq {\rm e} ^{-0.5(\tau_2 - \ln(1+ \tau_2))}. \end{align} (7.162) Hence, by the union bound,   \begin{align} {\rm{P}}(\mathcal{E}_4^c)&\leq |\mathcal{S}_e|\,{\rm e} ^{-0.5(\tau_2 - \ln(1+ \tau_2))}\nonumber\\ &\leq 2^{nb(\bar{d}_k({\mathbb{\mathbf{X}}})+2\delta)}{\rm e} ^{-0.5(\tau_2 - \ln(1+ \tau_2))}, \end{align} (7.163) where the last inequality follows from (7.173) and (7.179). But, for $$\tau_2>7$$,   \begin{align} {\rm e} ^{-0.5(\tau_2 - \ln(1+ \tau_2))}\leq 2^{-0.5 \tau_2}, \end{align} (7.164) which implies that for $$\tau_2>7$$,   \begin{align} {\rm{P}}(\mathcal{E}_5^c)\leq 2^{nb(\bar{d}_k({\mathbb{\mathbf{X}}})+2\delta)-0.5\tau_2}. \end{align} (7.165) Setting   \begin{align} \tau_2=2nb(\bar{d}_k({\mathbb{\mathbf{X}}})+3\delta)-1 \end{align} (7.166) ensures that   \begin{align} {\rm{P}}(\mathcal{E}_5^c)\leq 2^{-\delta bn+0.5}, \end{align} (7.167) which converges to zero as $$n$$ grows to infinity. Finally, setting $$\tau_1=1$$, conditioned on $$\mathcal{E}_4\cap\mathcal{E}_5$$, we have   \begin{align} 2\mu \langle {\tilde{E}}^n(t+1),A^{\top}Z^m\rangle& ={2\over m} \langle {\tilde{E}}^n(t+1),A^{\top}Z^m\rangle \nonumber\\ &\leq {1\over m}\sqrt{m(1+\tau_1)\sigma^2(1+\tau_2)}\nonumber\\ &= \sqrt{(2m)\sigma^2 (2nb(\bar{d}_k({\mathbb{\mathbf{X}}})+3\delta))\over m^2}\nonumber\\ &= { \sigma\over 2}\sqrt{nb(\bar{d}_k({\mathbb{\mathbf{X}}})+3\delta)\over m}.\label{eq:bd-noisy-proj} \end{align} (7.168) Combining (7.168) and (7.155) yields the desired upper bound on the last term in (7.155). To finish the proof, we need to show that $${\rm{P}}(\mathcal{E}_1\cap\mathcal{E}_2\cap\mathcal{E}_3)$$ also approaches one, as $$n$$ grows without bound. Reference [5] proves that   \begin{align} {\rm{P}}(\mathcal{E}_1^c)\leq 2^{-m/2}. \end{align} (7.169) By Theorem 4.2, there exists integer $$g_{\delta'}$$, depending only on $$\delta'$$ and the source distribution, such that for any $$n>6(k+g_{\delta'})/(b\delta)+k$$,   \begin{align} {\rm{P}}(\mathcal{E}_2^c)\leq 2^{c\delta'^2/8} (k+g_{\delta'}+1)n^{|\mathcal{X}_b|^{k+1}}2^{-{nc \delta'^2 \over 8(k+g_{\delta'}+1)}}, \end{align} (7.170) where $$c=1/(2\ln 2)$$. This proves that for our choice of parameters, $${\rm{P}}(\mathcal{E}_2^c)$$ converges to zero. In the rest of the proof we bound $${\rm{P}}(\mathcal{E}_3^c)$$. From Corollary 7.1, for any $$u^n,v^n\in\mathcal{S}_e$$,   \begin{align} {\rm{P}}\Big({1\over m}\langle Au^n,Av^n\rangle-\langle u^n,v^n\rangle\leq -0.45\Big)\leq 2^{-0.05m}. \end{align} (7.171) Therefore, by the union bound,   \begin{align} {\rm{P}}(\mathcal{E}_3^c)&={\rm{P}}\left({1\over m}\langle Au^n,A\tilde{e}^n(t)\rangle-\langle u^n,\tilde{e}^n(t)\rangle\leq -0.45: {\rm for}\;{\rm some}\;(u^n,v^n)\in\mathcal{S}_e^2\right)\nonumber\\ &\leq |\mathcal{S}_e|^2 2^{-0.05m}.\label{eq:ub-Ec1-union} \end{align} (7.172) Note that   \begin{align} |\mathcal{S}_e|\leq |\mathcal{F}_o|^2.\label{eq:size-Se-vs-Fo} \end{align} (7.173) In the following, we derive an upper bound on the size of $$\mathcal{F}_o$$. For any $$u^n\in\mathcal{F}_o$$, by definition, we have   \begin{align} \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|u^n)\leq b(\bar{d}_k({\mathbb{\mathbf{X}}})+\delta).\label{eq:compare-cost-linear-W} \end{align} (7.174) Let $$\hat{q}_{k+1}^{(u)}=\hat{p}^{(k+1)}(\cdot|u^n)$$ denote the $$(k+1)$$th order empirical distributions of $$u^n$$. Following the steps used in deriving (7.92), it follows that   \begin{align}\label{eq:bd-HXt} &\sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|u^n)\nonumber\\ &\quad=\sum_{a^k}\hat{q}_{k}^{(u)}(a^k) D_{\rm KL}(\hat{q}_{k+1}^{(u)}(\cdot|a^k)\| q_{k+1}(\cdot|a^k))+ \hat{H}_{k}(u^n). \end{align} (7.175) Since all the terms on the right-hand side or (7.175) are positive, it follows from (7.174) that   \begin{align} \hat{H}_{k}(u^n)&\leq b(\bar{d}_k({\mathbb{\mathbf{X}}})+\delta). \label{eq:bd-Hhat-Xh-Xo} \end{align} (7.176) On the other hand, given our choice of quantization level $$b$$, for $$n$$ large enough, for any $$v^n\in\mathcal{X}_b^n$$,   \begin{align} {1\over nb}\ell_{\rm LZ}(v^n)&\leq {1\over b} \hat{H}_k(v^n)+\delta.\label{eq:UB-LZ} \end{align} (7.177) Therefore, for any $$u^n\in\mathcal{F}_o$$, from (7.176) and (7.177), it follows that   \begin{align} {1\over nb}\ell_{\rm LZ}(u^n)&\leq {1\over b} \hat{H}_k(u^n)+\delta \nonumber\\ &\leq \bar{d}_k({\mathbb{\mathbf{X}}})+2\delta. \end{align} (7.178) Note that, from (7.3), we have   \begin{align} |\mathcal{F}_o|\leq \left|\left\{v^n:\; {1\over nb}\ell_{\rm LZ}(v^n)\leq\bar{d}_k({\mathbb{\mathbf{X}}})+2\delta\right\}\right|\leq 2^{nb(\bar{d}_k({\mathbb{\mathbf{X}}})+2\delta)}.\label{eq:size-Fo} \end{align} (7.179) Hence, from (7.172),   \begin{align} &{\rm{P}}(\mathcal{E}_3^c)\leq |\mathcal{F}_o|^4 2^{-0.05m}\leq 2^{ nb(\bar{d}_k({\mathbb{\mathbf{X}}})+2\delta)-0.05m}\leq 2^{-2nb \delta}.\label{eq:bound-et-n} \end{align} (7.180) 8. Conclusion For a stationary process $${{\mathbb{\mathbf{X}}}}$$, we have studied the problem of estimating $$X^n$$ from $$m$$ response variables $$Y^m = AX^n+Z^m$$, under the assumption that the distribution of $${{\mathbb{\mathbf{X}}}}$$ is known. We have proposed the Q-MAP optimization, which estimates $$X^n$$ from $$Y^m$$. The new optimization satisfies the following properties: (i) It applies to generic classes of distributions, as long as they satisfy certain mixing conditions. (ii) Unlike other Bayesian approaches, in the high-dimensional settings, the performance of the Q-MAP optimization can be characterized for generic distributions. Our analyses show that, for certain distributions such as spike-and-slab, asymptotically, Q-MAP achieves the minimum required sampling rate. (Whether Q-MAP achieves the optimal sampling rate for general distributions is still an open question.) (iii) PGD can be applied to approximate the solution of Q-MAP. While the optimization involved in Q-MAP is non-convex, we have characterized the performance of the corresponding PGD algorithm, under both noiseless and noisy settings. Our analysis has revealed that with slightly more measurements than Q-MAP, the PGD-based method recovers $$X^n$$ accurately in the noiseless setting. Funding National Science Foundation (CCF-1420328). References 1. Bickel P. J., Ritov Y. & Tsybakov A. B. ( 2009) Simultaneous analysis of Lasso and Dantzig selector. Ann. Statist. , 37, 1705– 1732. Google Scholar CrossRef Search ADS   2. Blumensath T. & Davies M. E. ( 2009) Iterative hard thresholding for compressed sensing. Appl. Comput. Harmon. Anal. , 27, 265– 274. Google Scholar CrossRef Search ADS   3. Bradley R. C. ( 2005) Basic properties of strong mixing conditions. A survey and some open questions. Probab. Surv. , 2, 107– 144. Google Scholar CrossRef Search ADS   4. Candes E. & Tao T. ( 2007) The Dantzig selector: statistical estimation when p is much larger than n. Ann. Statist. , 35, 2313– 2351. Google Scholar CrossRef Search ADS   5. Candès E., Romberg J. & Tao T. ( 2005) Decoding by linear programming. IEEE Trans. Inform. Theory , 51, 4203– 4215. Google Scholar CrossRef Search ADS   6. Cover T. & Thomas J. ( 2006) Elements of Information Theory , 2nd edn. New York: Wiley. 7. Csiszar I. & Körner J. ( 2011) Information Theory: Coding Theorems for Discrete Memoryless Systems . New York, NY: Cambridge University Press. Google Scholar CrossRef Search ADS   8. Donoho D. & Montanari A. ( 2016) High dimensional robust m-estimation: asymptotic variance via approximate message passing. Probab. Theory Relat. Fields , 166, 935– 969. Google Scholar CrossRef Search ADS   9. Guo D. & Verdú S. ( 2005) Randomly spread CDMA: Asymptotics via statistical physics. IEEE Trans. Inform. Theory , 51, 1983– 2010. Google Scholar CrossRef Search ADS   10. Hans C. ( 2009) Bayesian Lasso regression. Biometrika , 96, 835– 845. Google Scholar CrossRef Search ADS   11. Hans C. ( 2010) Model uncertainty and variable selection in Bayesian Lasso regression. Statist. Comput. , 20, 221– 229. Google Scholar CrossRef Search ADS   12. Hoadley B. ( 1970) A Bayesian look at inverse linear regression. J. Amer. Statist. Assoc. , 65, 356– 369. Google Scholar CrossRef Search ADS   13. Jalali S., Maleki A. & Baraniuk R. ( 2014) Minimum complexity pursuit for universal compressed sensing. IEEE Trans. Inform. Theory , 60, 2253– 2268. Google Scholar CrossRef Search ADS   14. Jalali S., Montanari A. & Weissman T. ( 2012) Lossy compression of discrete sources via the Viterbi algorithm. IEEE Trans. Inform. Theory , 58, 2475– 2489. Google Scholar CrossRef Search ADS   15. Jalali S. & Poor H. V. ( 2017) Universal compressed sensing for almost lossless recovery. IEEE Trans. Inform. Theory , 63, 2933– 2953. 16. Lindley D. V. & Smith A. F. ( 1972) Bayes estimates for the linear model. J. R. Statist. Soc. Ser. B , 34, 1– 41. 17. Liu C. ( 1996) Bayesian robust multivariate linear regression with incomplete data. J. Amer. Statist. Assoc. , 91, 1219– 1227. Google Scholar CrossRef Search ADS   18. Maleki A. ( 2010) Approximate message passing algorithm for compressed sensing. Stanford University. PhD Thesis. 19. Mitchell T. J. & Beauchamp J. J. ( 1988) Bayesian variable selection in linear regression. J. Amer. Statist. Assoc. , 83, 1023– 1032. Google Scholar CrossRef Search ADS   20. O’Hara R. B. & Sillanpää M. J. ( 2009) A review of Bayesian variable selection methods: what, how and which. Bayes. Anal. , 4, 85– 117. Google Scholar CrossRef Search ADS   21. Park T. & Casella G. ( 2008) The Bayesian Lasso. J. Amer. Statist. Assoc. , 103, 681– 686. Google Scholar CrossRef Search ADS   22. Plotnik E., Weinberger M. J. & Ziv J. ( 1992) Upper bounds on the probability of sequences emitted by finite-state sources and on the redundancy of the Lempel-Ziv algorithm. IEEE Trans. Inform. Theory , 38, 66– 72. Google Scholar CrossRef Search ADS   23. Rényi A. ( 1959) On the dimension and entropy of probability distributions. Acta Math. Acad. Sci. Hung. , 10, 193– 215. Google Scholar CrossRef Search ADS   24. Rigollet P. & Tsybakov A. ( 2011) Exponential screening and optimal rates of sparse estimation. Ann. Statist. , 39, 731– 771. Google Scholar CrossRef Search ADS   25. Shields P. ( 1996a) The Ergodic Theory of Discrete Sample Paths . Providence, RI: American Mathematical Society. Google Scholar CrossRef Search ADS   26. Shihao J., Ya X. & Carin L. ( 2008) Bayesian Compressive Sensing. IEEE Trans. Signal. Process. , 56, 2346– 2356. Google Scholar CrossRef Search ADS   27. Som S. & Schniter P. ( 2012) Compressive imaging using approximate message passing and a Markov-tree prior. IEEE Trans. Signal Process. , 60, 3439– 3448. Google Scholar CrossRef Search ADS   28. Su W. & Candes E. ( 2016) SLOPE is adaptive to unknown sparsity and asymptotically minimax. Ann. Statist. , 44, 1038– 1068. Google Scholar CrossRef Search ADS   29. Tibshirani R., Saunders M., Rosset S., Zhu J. & Knight K. ( 2005) Sparsity and smoothness via the fused Lasso. J. R. Statist. Soc.: Ser. B , 67, 91– 108. Google Scholar CrossRef Search ADS   30. Tipping M. E. ( 2001) Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. , 1, 211– 244. 31. Viterbi A. J. ( 1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inform. Theory , 13, 260– 269. Google Scholar CrossRef Search ADS   32. West M. ( 1984) Outlier models and prior distributions in Bayesian linear regression. J. R. Statist. Soc. Ser. B , 46, 431– 439. 33. Wu Y. & Verdú S. ( 2010) Rényi information dimension: fundamental limits of almost lossless analog compression. IEEE Trans. Inform. Theory , 56, 3721– 3748. Google Scholar CrossRef Search ADS   34. Ziv J. & Lempel A. ( 1978) Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory , 24, 530– 536. Google Scholar CrossRef Search ADS   © The authors 2018. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Information and Inference: A Journal of the IMA Oxford University Press

New approach to Bayesian high-dimensional linear regression

Loading next page...
 
/lp/ou_press/new-approach-to-bayesian-high-dimensional-linear-regression-GuiRh8EQON
Publisher
Oxford University Press
Copyright
© The authors 2018. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved.
ISSN
2049-8764
eISSN
2049-8772
D.O.I.
10.1093/imaiai/iax016
Publisher site
See Article on Publisher Site

Abstract

Abstract Consider the problem of estimating parameters $$X^n \in \mathbb{R}^n $$, from $$m$$ response variables $$Y^m = AX^n+Z^m$$, under the assumption that the distribution of $$X^n$$ is known. Lack of computationally feasible algorithms that employ generic prior distributions and provide a good estimate of $$X^n$$ has limited the set of distributions researchers use to model the data. To address this challenge, in this article, a new estimation scheme named quantized maximum a posteriori (Q-MAP) is proposed. The new method has the following properties: (i) In the noiseless setting, it has similarities to maximum a posteriori (MAP) estimation. (ii) In the noiseless setting, when $$X_1,\ldots,X_n$$ are independent and identically distributed, asymptotically, as $$n$$ grows to infinity, its required sampling rate ($$m/n$$) for an almost zero-distortion recovery approaches the fundamental limits. (iii) It scales favorably with the dimensions of the problem and therefore is applicable to high-dimensional setups. (iv) The solution of the Q-MAP optimization can be found via a proposed iterative algorithm that is provably robust to error (noise) in response variables. 1. Introduction Consider the problem of Bayesian linear regression defined as follows; $${{\mathbb{\mathbf{X}}}}=\{X_i\}_{i=1}^{\infty}$$ denotes a stationary random process whose distribution is known. The goal is to estimate $$X^n$$ from $$m$$ response variables of the form $$Y^m=AX^n + Z^m$$, where $$A\in{\rm I}\kern-0.20em{\rm R}^{m\times n}$$ and $$Z^m\in{\rm I}\kern-0.20em{\rm R}^m$$ denote the design matrix and the noise vector, respectively. (Both cases of $$m<n$$ and $$m\geq n$$ are of interest and valid under our model.) To solve this problem, there are two fundamental questions that can be raised: How should we use the distribution of $$X^n$$ to obtain an efficient estimator? To answer this question, there are two main criteria that should be taken into account: (i) computational complexity: how efficiently can the estimate be computed? (ii) accuracy: how well does the estimator perform? If we ignore the computational complexity constraint, then the answer to our first question is simple. An optimal Bayes estimator seeks to minimize the Bayes risk defined as $${\rm {E}}[\ell(\hat{X}^n, X^n)]$$, where $$\ell: {\rm I}\kern-0.20em{\rm R}^n\times{\rm I}\kern-0.20em{\rm R}^n\to{\rm I}\kern-0.20em{\rm R}^+$$ denotes the considered loss function. For instance, $$\ell(x^n,{\hat{X}}^n)=\left\|x^n-{\hat{X}}^n\right\|_2^2$$ leads to the minimum mean square error (MMSE) estimator. However, the computational complexity of MMSE estimator for generic distributions is very high. Can the performance of the estimator be analyzed in high-dimensional settings? The answer to this question is also complicated. Even the performance analysis of standard estimators such as MMSE estimator is challenging. In fact, even if we assume that $$\mathbf{X}$$ is an independent and identically distributed (i.i.d.) process, the analysis is still complicated and heuristic tools such as replica method from statistical physics have been employed to achieve this goal. In response to the above two questions, we propose an optimization problem, which we refer to as quantized maximum a posteriori (Q-MAP). We then show how this optimization problem can be analyzed and solved. Before presenting the Q-MAP optimization, we introduce some notation. 1.1 Notation Calligraphic letters such as $$\mathcal{X}$$ and $$\mathcal{Y}$$ denote sets. The size of a set $$\mathcal{X}$$ is denoted by $$|\mathcal{X}|$$. Given vector $$(v_1,v_2,\ldots,v_n)\in{\rm I}\kern-0.20em{\rm R}^n$$ and integers $$i,j\in\{1,\ldots,n\}$$, where $$i\leq j$$, $$v_i^j\triangleq (v_i,v_{i+1},\ldots,v_j).$$ For simplicity $$v_1^j$$ and $$v_j^j$$ are denoted by $$v^j$$ and $$v_j$$, respectively. For two vectors $$u^n$$ and $$v^n$$, both in $${\rm I}\kern-0.20em{\rm R}^n$$, let $$\langle u^n,v^n\rangle$$ denote their inner product defined as $$\langle u^n,v^n\rangle\triangleq \sum_{i=1}^nu_iv_i$$. The all-zero and all-one vectors of length $$n$$ are denoted by $$0^n$$ and $$1^n$$, respectively. Uppercase letters such as $$X$$ and $$Y$$ denote random variables. The alphabet of a random variable $$X$$ is denoted by $$\mathcal{X}$$. The entropy of a finite-alphabet random variable $$U$$ with probability mass function (pmf) $$p(u)$$, $$u\in\mathcal{U}$$, is defined as $$H(U)=-\sum_{u\in\mathcal{U}}p(u)\log{ p(u)}$$ [6]. Given finite-alphabet random variables $$U$$ and $$V$$ with joint pmf $$p(u,v)$$, $$(u,v)\in\mathcal{U}\times \mathcal{V}$$, the conditional entropy of $$U$$ given $$V$$ is defined as $$H(U|V)=-\sum_{(u,v)\in\mathcal{U}\times \mathcal{V}} p(u,v)\log p(u|v)$$. Matrices are also denoted by uppercase letters such as $$A$$ and $$B$$ and are differentiated from random variables by context. Throughout the article $$\log$$ and $$\ln$$ refer to logarithm in base 2 and the natural logarithm, respectively. For $$x\in{\rm I}\kern-0.20em{\rm R}$$, $$\lfloor x\rfloor$$ denotes the largest integer smaller than or equal to $$x$$. Therefore, $$0\leq x-\lfloor x\rfloor<1$$, for all $$x$$. The $$b$$-bit quantized version of $$x$$ is denoted by $$[x]_b$$ and is defined as   \begin{align} [x]_b\triangleq\lfloor x\rfloor+\sum_{i=1}^b2^{-i}a_i, \end{align} (1.1) where for all $$i$$, $$a_i\in\{0,1\}$$, and $$0.a_1a_2\ldots$$ denotes the binary representation of $$x-\lfloor x\rfloor$$. When $$x-\lfloor x\rfloor$$ is a dyadic real number, which have two possible binary representations, let $$0.a_1a_2\ldots$$ denote the representation which has a finite number of ones. For a vector $$x^n\in{\rm I}\kern-0.20em{\rm R}^n$$, $$[x^n]_b\triangleq([x_1]_b,\ldots,[x_n]_b).$$ Consider a vector $$u^n\in\mathcal{U}^n$$, where $$|\mathcal{U}|<\infty$$. The $$(k+1){\rm th}$$ order empirical distribution of $$u^n$$ is denoted by $$\hat{p}^{(k+1)}$$ and is defined as follows: for any $$a^{k+1}\in \mathcal{U}^{k+1}$$,   \begin{align} \hat{p}^{(k+1)}(a^{k+1}| u^n)&\triangleq {|\{i: u_{i-k}^i=a^{k+1}, k+1\leq i\leq n\}|\over n-k}\nonumber\\ &={1\over n-k}\sum_{i=k+1}^n1_{u_{i-k}^{i}=a^{k+1}},\label{eq:emp-dist} \end{align} (1.2) where $$1_{\mathcal{E}}$$ denotes the indicator function of event $$\mathcal{E}$$. In other words, $$\hat{p}^{(k+1)}(.| u^n)$$ denotes the distribution of a randomly selected length-$$(k+1)$$ substring of $$u^n$$. 1.2 Contributions Consider the stochastic process $${{\mathbb{\mathbf{X}}}}=\{X_i\}_{i=1}^{\infty}$$ discussed earlier. Let $$X_i\in\mathcal{X}$$, for all $$i$$. Assume that $$\mathcal{X}$$ is a bounded subset of $${\rm I}\kern-0.20em{\rm R}$$. Define the $$b$$-bit quantized version of $$\mathcal{X}$$ as   \begin{align} \mathcal{X}_b\triangleq \{[x]_b: \;x\in\mathcal{X}\}. \end{align} (1.3) Note that since $$\mathcal{X}$$ is a bounded set, $$\mathcal{X}_b$$ is a finite set, ie $$|\mathcal{X}_b|<\infty$$. Given $$k\in{\rm I}\kern-0.20em{\rm N}^+$$ and a set of non-negative weights $${\bf w}=(w_{a^{k+1}}:\;a^{k+1}\in \mathcal{X}_b^{k+1})$$, define function $$c_{{\bf w}}: {\bf X}_b^n\to {\rm I}\kern-0.20em{\rm R}$$ as follows. For $$u^n\in\mathcal{X}_b^n$$,   \begin{align}\label{eq:def-cw} c_{{\bf w}}(u^n)\triangleq \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|u^n). \end{align} (1.4) As it will be described later, for a proper choice of weights $${\bf w}$$, this function measures both the level of ‘structuredness’ of sequences in $$\mathcal{X}_b^n$$ and also how closely they match the $$(k+1)^{\rm th}$$ order distribution of the process $${{\mathbb{\mathbf{X}}}}$$. For instance, as will be shown later in Section 3.1, for an i.i.d. process $${{\mathbb{\mathbf{X}}}}$$ with $$X_i\sim (1- p)\delta_0+ pf_c$$, where $$p\in(0,1)$$, $$\delta_0$$ denotes a point mass at zero, and $$f_c$$ denotes an absolutely continuous distribution over a bounded set, a bound of the form $$c_{{\bf w}}(u^n)\leq \gamma$$, with $$k=0$$, simplifies to a bound on the $$\ell_0$$-norm of sequence $$u^n$$. Hence, intuitively speaking, the constraint $$ c_{{\bf w}}(u^n) \leq \gamma$$ holds for ‘structured’ sequences that comply with the known source model. For a stationary process $${{\mathbb{\mathbf{X}}}}$$, the Q-MAP optimization estimates $$X^n$$ from noisy linear measurements $$Y^m=AX^n+Z^m$$, by solving the following optimization:   \begin{align}\label{eq:Q-MAP} {\hat{X}}^n&\;\;= \;\; \mathop{\text{arg min}}\limits_{u^n\in\mathcal{X}_b^n} \;\;\;\left\|Au^n-Y^m\right\|^2 \nonumber \\ & \;\;\;\;\;\; \;\;\;{\rm subject \ to}\;\;\;\ c_{{\bf w}}(u^n) \leq \gamma_n, \end{align} (1.5) where $$\gamma_n$$ is a number that may depend on $$n$$ and the distribution of $${{\mathbb{\mathbf{X}}}}$$, $$b$$ and $$k$$ are parameters that need to be set properly, and the non-negative weights $${\bf w}=(w_{a^{k+1}}:\;a^{k+1}\in \mathcal{X}_b^{k+1})$$ are defined as a function of the $$(k+1)$$th order marginal distribution of stationary process $${{\mathbb{\mathbf{X}}}}$$ as:   \begin{align} w_{a^{k+1}}\triangleq \log {1\over {\rm {P}}([X_{k+1}]_b=a_{k+1}|[X^k]_b=a^k)}.\label{eq:coeffs-Q-MAP} \end{align} (1.6) For this specific choice of $${\bf w}$$, the function $$c_{{\bf w}}$$ can also be written as   \begin{align}\label{eq:simplify-cw} c_{{\bf w}}(u^n)&= \sum_{a^k}\hat{q}_{k}(a^k) D_{\rm KL}(\hat{q}_{k+1}(\cdot|a^k)\| q_{k+1}(\cdot|a^k)) + \hat{H}_{k}(u^n), \end{align} (1.7) where $$\hat{q}_{k}$$, $$\hat{q}_{k+1}$$ and $${q}_{k+1}$$ denote the $$k{\rm th}$$ order empirical distribution induced by $$u^n$$, the $$(k+1){\rm th}$$ order empirical distribution induced by $$u^n$$ and the distribution of $$[X^{k+1}]_b$$, respectively. (The derivation of (1.7) can be found in (7.92), as part of the proof of Theorem 5.1.) This alternative expression reveals how $$c_{{\bf w}}(u^n)$$ captures both the structuredness of $$u^n$$ through $$ \hat{H}_{k}(u^n)$$ and the similarity between the empirical distribution (type) of $$u^n$$ and the distribution of the quantized source process $${{\mathbb{\mathbf{X}}}}$$ through $$\sum_{a^k}\hat{q}_{k}(a^k) D_{\rm KL}(\hat{q}_{k+1}(\cdot|a^k)\| q_{k+1}(\cdot|a^k))$$. Remark 1.1 To better understand the weights $${\bf w}$$ specified in (1.6), consider an i.i.d. sparse process $${{\mathbb{\mathbf{X}}}}$$ distributed as $$p f_c+(1-p)\delta_0$$. Here, $$p\in[0,1]$$ and $$f_c$$ denotes an absolutely continuous distribution over a bounded interval $$(x_1,x_2)$$, where $$x_1<x_2$$. In that case, since the process is memoryless, it is enough to study the weights for $$k=0$$. (Note that $${\rm {P}}([X_{k+1}]_b=a_{k+1}|[X^k]_b=a^k)={\rm {P}}([X_{k+1}]_b=a_{k+1})$$, for every $$a^{k+1}\in\mathcal{X}_b^{k+1}$$.) For $$k=0$$, for $$a\in\mathcal{X}_b$$, $$w_a= \log {1\over {\rm {P}}([X_{1}]_b=a)}$$. On the other hand, it is straightforward to check that for $$b$$ large, $${\rm {P}}([X_{1}]_b=0)\approx 1-p$$. For $$a\in\mathcal{X}_b$$, $$a\neq 0$$, by the mean value theorem,   \[ {\rm {P}}([X_{1}]_b=a)=2^{-b}f_c(x_a), \] for some $$x_a\in(a,a+2^{-b}]$$. Therefore, $$w_0\approx -\log(1-p)$$, and, for $$a\in\mathcal{X}_b$$, $$a\neq 0$$,   \[ w_a=b-\log f_c(x_a). \] It can be observed that, as $$b$$ grows, all weights, except $$w_0$$, grow to infinity, at a rate linear in $$b$$. (This issue and its implications are discussed in further details in Section 3.) Note that minimizing $$\|Au^n - Y^m\|^2$$ is natural since we would like to obtain a parameter vector that matches the response variables. However, with no constraint on potential solution vectors, the estimate will suffer from overfitting (unless $$m$$ is much larger than $$n$$). Hence, some constraints should be imposed on the set of potential solutions. In Q-MAP optimization, this constraint requires a potential solution $$u^n\in\mathcal{X}_n^n$$ to satisfy   \begin{align} c_{{\bf w}}(u^n) \leq \gamma_n. \end{align} (1.8) There are two other features of the above optimization that are worth emphasis and clarification at this point: Quantized reconstructions: While the parameter vector $$X^n$$ and the response variables $$Y^m$$ are typically real valued, the estimate produced by the Q-MAP optimization lies in the quantized space $$\mathcal{X}_b^n$$. The motivation for this quantization will be explained later, but in a nutshell, this step helps both the theoretical analysis and also the implementation of the optimization. Memory parameter ($$k$$): again both for the convenience of the theoretical analysis and also for the ease of implementation, only dependencies captured by the $$(k+1)$$th order probability distribution of the process $${{\mathbb{\mathbf{X}}}}$$ are taken into account in the Q-MAP optimization. This memory parameter is an free parameter that can be selected based on the source distribution. As shown later, for instance, in the noiseless setting, for an i.i.d. process, $$k=0$$ is enough to achieve the fundamental limits in terms of sampling rate ($$m/n$$). Although the Q-MAP optimization provides a new approach to Bayesian compressed sensing, it is still not an easy optimization problem. For instance, for the i.i.d. distribution mentioned earlier, $$(1- p)\delta_0+ p f_c$$, the constraint becomes equivalent to having an upper bound on $$\|u^n\|_0$$. This is similar to the notoriously difficult optimal variable selection problem. Hence, despite the fact that the Q-MAP optimization problems might appear simpler than other estimators such as MMSE, in fact it can still be computationally infeasible. However, inspired by the projected gradient descent (PGD) method in convex optimization, we propose the following algorithm to solve the Q-MAP optimization. Define   \begin{align} \mathcal{F}_o\triangleq \big\{u^n\in\mathcal{X}_b^n: \; c_{{\bf w}}(u^n)\leq \gamma_n \big\},\label{eq:def-Fc-rand} \end{align} (1.9) where function $$c_{{\bf w}}$$ is defined in (1.4). Note that the set $$\mathcal{F}_o$$ depends on quantization level $$b$$, memory parameter $$k$$, weights $${\bf w}=\{w_{a^{k+1}}\}_{a^{k+1}\in\mathcal{X}_b^{k+1}}$$ and also parameter $$\gamma_n$$. The PGD algorithm generates a sequence of estimates $${\hat{X}}^n(t)$$, $$t=0,1,\ldots$$ of the sequence $$X^n$$. It starts by setting $${\hat{X}}^n(0)=0^n$$, and proceeds by updating $${\hat{X}}^n(t)$$, its estimate at time $$t$$, as follows   \begin{align} S^n(t+1)&={\hat{X}}^n(t)+\mu A^{\top}(Y^m-A{\hat{X}}^n(t))\nonumber\\ {\hat{X}}^n(t+1)&=\mathop{\text{arg min}}\limits_{u^n\in\mathcal{F}_o}\left\|u^n-S^n(t+1)\right\|\!,\label{eq:PGD-update} \end{align} (1.10) where $$\mu$$ denotes the step-size and ensures that the algorithm does not diverge to infinity. Intuitively, the above procedure, at each step, moves the current estimate toward the $$Ax^n=Y^m$$ hyperplane and then projects the new estimate to the set of structured vectors. As will be proved later, when $$m$$ is large enough, in the noiseless setting, the estimates provided by the PGD algorithm converge to $$X^n$$, with high probability. The challenging step in running the PGD method is the projection step, which requires finding the closest point to $$S^n(t+1)$$ in $$\mathcal{F}_o$$. For some special distributions such as sparse or piecewise-constant discussed in Section 3, the corresponding set $$\mathcal{F}_o$$ has a special form that simplifies the projection con-siderably. In general, while projection to a non-convex discrete set can be complicated, as we will discuss in Section 6.2, we believe that because of the special structure of the set $$\mathcal{F}_o$$, a dynamic programming approach can be used for performing this projection. More specifically, we will explain how a Viterbi algorithm [31] with $$2^{bk}$$ states and $$n$$ stages can be used for this purpose. Hence, the complexity of the proposed method for doing the projection task required by the PGD grows linearly in $$n$$, but exponentially in $$kb$$. We expect that for ‘structured distributions’ the scaling with $$b$$ and $$k$$ can be improved much beyond this. We will describe our intuition in Section 6.2 but leave the formal exploration of this direction to future research. The main question we have not addressed yet is how well the Q-MAP optimization and the proposed PGD method recover $$X^n$$ from $$Y^m$$. In the next few paragraphs, we informally state our main results regarding the performance of the QMAP optimization and the PGD-based algorithm. Before that, note that, to recover $$X^n$$ from $$m<n$$ response variables, intuitively, the process should be of structured. Hence, first, we briefly review a measure of structuredness developed for real-valued stochastic processes. The $$k$$th order upper information dimension of stationary process $${{\mathbb{\mathbf{X}}}}$$ is defined as   \begin{equation}\label{eq:first_appearanced_k} \bar{d}_k({{\mathbb{\mathbf{X}}}})\triangleq \limsup_{b\to \infty} {H([X_{k+1}]_b|[X^k]_b) \over b}, \end{equation} (1.11) where $$H([X_{k+1}]_b|[X^k]_b) $$ denotes the conditional entropy of $$[X_{k+1}]_b$$ given $$[X^k]_b$$. Similarly, the lower $$k$$th order upper information dimension of $${{\mathbb{\mathbf{X}}}}$$ is defined as   \begin{align} \underline{d}_k({{\mathbb{\mathbf{X}}}})\triangleq \liminf_{b\to \infty} {H([X_{k+1}]_b|[X^k]_b) \over b}. \end{align} (1.12) If $$\bar{d}_k({{\mathbb{\mathbf{X}}}})=\underline{d}_k({{\mathbb{\mathbf{X}}}})$$, then the $$k$$th order information dimension of process $${{\mathbb{\mathbf{X}}}}$$ is defined as $${d}_k({{\mathbb{\mathbf{X}}}})=\bar{d}_k({{\mathbb{\mathbf{X}}}})=\underline{d}_k({{\mathbb{\mathbf{X}}}})$$ [15]. For $$k=0$$, $$\bar{d}_k({{\mathbb{\mathbf{X}}}})$$ ($$\underline{d}_k({{\mathbb{\mathbf{X}}}})$$) is equal to the upper (lower) Rényi information dimension of $$X_1$$ [23], which is a well-known measure of structuredness for real-valued random variables or random vectors. It can be proved that for all stationary sources with $$H(\lfloor X_1 \rfloor)$$, $$\bar{d}_k({{\mathbb{\mathbf{X}}}})\leq 1$$ and $$\underline{d}_k({{\mathbb{\mathbf{X}}}})\leq 1$$ [15]. To gain some insight on these definitions, consider an i.i.d. process $${{\mathbb{\mathbf{X}}}}$$ with $$X_1 \sim (1-p) \delta_0(x) + p {\rm Unif}(0,1)$$, where $${\rm Unif}$$ denotes a uniform distribution. This is called the spike and slab prior [19]. It can be proved that for this process $${d}_0({{\mathbb{\mathbf{X}}}})=\bar{d}_0({{\mathbb{\mathbf{X}}}})=\underline{d}_0({{\mathbb{\mathbf{X}}}}) =p$$ [23]. For general stationary sources with infinite memory, the limit of $$\bar{d}_k({{\mathbb{\mathbf{X}}}})$$ as $$k$$ grows to infinity is defined as the upper information dimension of the process $${{\mathbb{\mathbf{X}}}}$$ and is denoted by $$\bar{d}_o({{\mathbb{\mathbf{X}}}})$$. As argued in [15], the information dimension of a process measures its level of structuredness, and is related to the number of response variables required for its accurate recovery. On the basis of these definitions and concepts, we state our results in the following. The exposition of our results is informal and lacks many details. All the details will be clarified later in the article. Informal Result 1. Consider the noiseless setting ($$Z^m=0^m$$), and assume that the elements of the design matrix $$A$$ are i.i.d. Gaussian. Further assume that the process $${{\mathbb{\mathbf{X}}}}$$ satisfies certain mixing conditions, and for a fixed $$k$$, $${m\over n}> \bar{d}_k({{\mathbb{\mathbf{X}}}})$$. Then, asymptotically, for a proper quantization level which grows with $$n$$, the Q-MAP optimization recovers $$X^n$$ with high probability. There is an interesting feature of this result that we would like to emphasize here: (i) if $$ \bar{d}_k({{\mathbb{\mathbf{X}}}})$$ is strictly smaller than $$1$$, then we can estimate $$X^n$$ accurately, from $$m <n$$ response variables. In fact the smaller $$\bar{d}_k({{\mathbb{\mathbf{X}}}})$$, the less response variables are required. In particular, we can consider the spike and slab prior we discussed before that corresponds to sparse parameter vectors that are studied in the literature [1,4]. For this prior $$\bar{d}_0({{\mathbb{\mathbf{X}}}}) =p$$. Hence, as long as $$m> np$$, asymptotically, the estimate of Q-MAP with $$k=0$$ will be accurate. Note that $$np$$ is in fact the expected number of non-zero elements in $$\beta$$. We believe that even an MMSE estimator that employs only the $$k{\rm th}$$ order distribution of the source cannot recover with a smaller number of response variables. We present some examples that confirm our claim; however, the optimality of the result we obtain above is an open question that we leave for future research. The above result is for Q-MAP that is still computationally complicated. Our next result is about our proposed PGD-based algorithm. Informal Result 2. Consider again the noiseless setting, and assume that the elements of $$A$$ are i.i.d. Gaussian. If the process $${{\mathbb{\mathbf{X}}}}$$ satisfies certain mixing conditions, and $${m\over n}> 80 b \bar{d}_k({{\mathbb{\mathbf{X}}}})$$, where $$k$$ is a fixed parameter, then the estimates derived by the PGD algorithm, with high probability, converge to $$X^n$$. We will also characterize the convergence rate of the PGD-based algorithm and its performance in the presence of an additive white Gaussian noise (AWGN). Compared with Informal Result 1, the number of response variables required in Informal Result 2 is a factor $$80b$$ higher. As we will discuss later we let $$b$$ grow as in $$O(\log \log n)$$, and hence the difference between Informal Result 1 and Informal Result 2 is not substantial. 1.3 Related work and discussion Bayesian linear regression has been the topic of extensive research in the past 50 years [10–12,16,17,19–21,26,27,30,32]. In all these papers, $$X^n$$ is considered as a random vector whose distribution is known. However, often simple models are considered for the distribution of $$X^n$$ to simplify either the posterior calculations or to apply Markov chain Monte Carlo methods such as Gibbs sampling. This article considers a different scenario. We ignore the computational issues at first and consider an arbitrary distribution for $$X^n$$. This is in particular useful for applications in which complicated prior can be learned. (For instance, one might have access to a large database that has many different draws of process $$\mathbf{X}$$.) We then present an optimization for estimating $$X^n$$ and prove the optimality of this approach under some conditions. This approach let us avoid the limitations that are imposed by posterior calculations. On the other hand, one main advantage of having posterior distributions is that they can be used in calculating confidence intervals. Exploring confidence intervals and related topics remains an open question for future research. Our theoretical analyses are inspired by the recent surge of interest toward understanding the high-dimensional linear regression problem [1,4,5,8,24,28]. In this front, there has been very limited work on the theoretical analysis of the Bayesian recovery algorithms, especially beyond memoryless sources. Two of the main tools that have been used for this purpose in the literature are the replica method [9] and state evolution [18]. Both methods have been employed to analyze the asymptotic performance of MMSE and MAP estimators under the asymptotic setting where $$m,n \rightarrow \infty$$, while $$m/n$$ is fixed. They both have some limitations. For instance, replica-based methods are not fully rigorous. Moreover, while they work well for i.i.d. sequences, it is not clear how they can be applied to sources with memory. The state evolution framework suffers from similar issues. Our article presents the first result in this direction for processes with memory. 1.4 Organization of the article The organization of the article is as follows. The Q-MAP estimator is developed in Section 2, and in Section 3, it is simplified for some simple distributions and shown to have connections to some well-known algorithms. In Section 4, two classes of stochastic processes are studied. The empirical distributions of the quantized versions of processes in each class have exponential convergence rates. The performance of the Q-MAP estimator is studied in Section 5. An iterative method based on PGD is proposed and studied in Section 6. Section 7 presents the proofs of the main results of the article and finally Section 8 concludes the article. 2. Quantized MAP estimator Consider the problem of estimating $$X^n$$ from noise-free response variables $$Y^m=AX^n$$, where $${\mathbb{\mathbf{X}}}=\{X_i\}_{i=1}^{\infty}$$ is a stationary process. Since there is no noise, one can employ a MAP estimator and find the most likely parameter vector given response variables $$Y^m$$. Instead of solving the original MAP estimator, we consider finding the most probable sequence in the quantized space $$\mathcal{X}_b^n$$. That is,   \begin{align} {\rm maximize} \;\;&{\rm{P}}([X^n]_b=u^n)\nonumber\\ {\rm subject\; to}\;\; & u^n\in\mathcal{X}_b^n,\nonumber\\ &[x^n]_b=u^n,\nonumber\\ &Ax^n=Y^m,\label{eq:Q-MAP-orig} \end{align} (2.1) where $${\rm{P}}$$ denotes the law of process $$\mathbf{X}$$. The optimization described in (2.1) can be further simplified and made more amenable to both analysis and implementation. Note that if $$Ax^n=Y^m$$, and $$|x_i-u_i|\leq 2^{-b}$$, for all $$i$$, then   \begin{align} \|Au^n-Y^m\|_2^2&\leq (\sigma_{\max}(A))^2 \|x^n-u^n\|^2\nonumber\\ &\leq 2^{-2b}n^2(\sigma_{\max}(A))^2 ,\label{eq:Q-MAP-orig-simlifies-s1} \end{align} (2.2) where $$\sigma_{\rm max}(A)$$ denotes the maximum singular value of the design matrix $$A$$. This provides an upper bound on $${1\over n^2}\|A[x^n]_b-Y^m\|_2^2$$ in terms of $$\sigma_{\max}(A)$$ and $$b$$. In other words, since $$u^n$$ is a quantized version of $$x^n$$, and $$Ax^n=Y^m$$, $$\|Au^n-Y^m\|_2^2$$ is also expected to be small. To further simplify (2.2), we focus on the other term in (2.2), i.e. $$-\log{\rm{P}}([X^n]_b=u^n)$$. Assume that the process $${\mathbb{\mathbf{X}}}$$ is such that $${\rm{P}}([X^n]_b=u^n)$$ can be factored as   \begin{align} {\rm{P}}([X^n]_b=u^n)= {\rm{P}}([X^k]_b=u^k ) \prod_{i=k+1}^n{\rm{P}}([X_i]_b=u_i|[X_{i-k}^{i-1}]_b=u_{i-k}^{i-1}), \end{align} (2.3) for some finite $$k$$. In other words, the $$b$$-bit quantized version of $${\mathbb{\mathbf{X}}}$$ is a Markov process of order $$k$$. Then define coefficients $$(w_{a^{k+1}}:\;a^{k+1}\in \mathcal{X}_b^{k+1})$$ according to (1.6). This assumption simplifies the term $$-\log{\rm{P}}([X^n]_b=u^n)$$ in the following way:   \begin{align} &-\log {\rm{P}}([X^n]_b=u^n)\nonumber\\ &\quad=-\log {\rm{P}}([X^k]_b=u^k ) -\sum_{i=k+1}^n\log {\rm{P}}([X_i]_b=u_i|[X_{i-k}^{i-1}]_b=u_{i-k}^{i-1})\nonumber\\ &\quad=-\log {\rm{P}}([X^k]_b=u^k ) \nonumber\\ &\qquad-\sum_{i=k+1}^n\log {\rm{P}}([X_i]_b=u_i|[X_{i-k}^{i-1}]_b=u_{i-k}^{i-1})\sum_{a^{k+1}\in \mathcal{X}_b^{k+1}}\mathbb{1}_{u_{i-k}^i=a^{k+1}}\nonumber\\ &\quad=-\log {\rm{P}}([X^k]_b=u^k ) \nonumber\\ &\qquad-\sum_{i=k+1}^n \sum_{a^{k+1}\in \mathcal{X}_b^{k+1}}\mathbb{1}_{u_{i-k}^i=a^{k+1}}\log{\rm{P}}([X_i]_b=a_i|[X_{i-k}^{i-1}]_b=a_{i-k}^{i-1})\nonumber\\ &\quad=-\log {\rm{P}}([X^k]_b=u^k ) + \sum_{i=k+1}^n \sum_{a^{k+1}\in \mathcal{X}_b^{k+1}}w_{a^{k+1}}\mathbb{1}_{u_{i-k}^i=a^{k+1}}\nonumber\\ &\quad=-\log {\rm{P}}([X^k]_b=u^k ) + \sum_{a^{k+1}\in \mathcal{X}_b^{k+1}}w_{a^{k+1}}\sum_{i=k+1}^n\mathbb{1}_{u_{i-k}^i=a^{k+1}}\nonumber\\ &\quad=-\log {\rm{P}}([X^k]_b=u^k ) + (n-k)\sum_{a^{k+1}\in \mathcal{X}_b^{k+1}} w_{a^{k+1}}\hat{p}^{(k+1)}(a^{k+1}|u^n)\nonumber\\ &\quad=-\log {\rm{P}}([X^k]_b=u^k ) + (n-k)c_{{\bf w} }(u^n), \end{align} (2.4) where $$\hat{p}^{(k+1)}$$ denotes the $$(k+1){\rm th}$$ order empirical distribution of $$u^n$$ as defined in (1.2), and $$c_{{\bf w} }(u^n)$$ is defined in (1.4). Assuming $$k$$ is much smaller than $$n$$, and ignoring the negligible term of $$-\log {\rm{P}}([X^k]_b=u^k )/(n-k)$$, instead of minimizing $$-\log{\rm{P}}([X^n]_b=u^n)$$, subject to an upper bound on $$\|Au^n-Y^m\|_2^2$$, we consider the following optimization where the roles of the cost and constraint functions are flipped   \begin{align} {\hat{X}}^n&=\;\;\mathop{\text{arg min}}\limits_{u^n\in\mathcal{X}_b^n} \;\; \|Au^n-Y^m\|_2^2 \nonumber \\ &\;\;\;\;\;\;{\rm subject \; to} \;\; c_{{\bf w} }(u^n)\leq \gamma, \end{align} (2.5) or its Lagrangian form   \begin{align} {\hat{X}}^n&= \mathop{\text{arg min}}\limits_{u^n\in\mathcal{X}_b^n}\Big[c_{{\bf w} }(u^n)+{\lambda\over n^2}\|Au^n-Y^m\|^2\Big].\label{eq:Q-MAP-L} \end{align} (2.6) The choice of parameters $$\lambda>0$$ and $$\gamma$$ is discussed later in our analysis. We refer to both (2.5) and (2.6) as quantized MAP (Q-MAP) estimators. Obtaining the Q-MAP estimator involved several approximation and relaxation steps. It is not clear how accurate these approximations are, and what the performance of the ultimate algorithm is. Also, solving Q-MAP estimator requires specifying parameters $$b$$ and $$\lambda$$, which significantly affect the performance of the estimator. These questions are all answered in Section 5. Before that, in the following section, we focus on two specific processes, which are well studied in the literature and derive the Q-MAP formulation in each case. This will clarify some of the properties of our Q-MAP formulation. 3. Special distributions To get a better understanding of Q-MAP optimization described in (2.5) and (2.6) and especially the term   \[ c_{{\bf w} }(u^n)=\sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|u^n), \] in this section we study two special distributions and derive simpler statement for the Q-MAP optimization in each case. 3.1 Independent and identically distributed sparse processes One of the most popular models for sparse parameter vectors is the spike and slab prior [19]. Consider an i.i.d. process $${\mathbb{\mathbf{X}}}$$, such that $$X_i\sim (1-p)\delta_0+pU_{[0,1]}$$. Since the process is i.i.d., by setting $$k=0$$ the optimization stated in (2.5) can be simplified as   \begin{align} {\hat{X}}^n&= \mathop{\text{arg min}}\limits_{u^n\in\mathcal{X}_b^n: \sum\limits_{a\in\mathcal{X}_b} w_{a} \hat{p}^{(1)}(a|u^n)\leq \gamma} \|Au^n-Y^m\|^2,\label{eq:QMAP-iid-1} \end{align} (3.1) where   \begin{align} w_a=\log {1\over {\rm{P}}([X_{1}]_b=a_{1})} = \log {1\over (1-p)\mathbb{1}_{a_1=0}+p{2^{-b}}}. \end{align} (3.2) Therefore, $$\sum_{a\in\mathcal{X}_b} w_{a} \hat{p}^{(1)}(a|u^n)$$ in (3.1) can be written as   \begin{align} \sum_{a\in\mathcal{X}_b} w_{a} \hat{p}^{(1)}(a|u^n) &=\sum_{a\in\mathcal{X}_b} w_{a} \left({1\over n}\sum_{i=1}^n\mathbb{1}_{u_i=a}\right)\nonumber\\ &= {1\over n}\sum_{i=1}^n\sum_{a\in\mathcal{X}_b} w_a \mathbb{1}_{u_i=a}\nonumber\\ & \stackrel{(a)}{=} {1\over n}\sum_{i=1}^n w_{u_i} \nonumber\\ &=-{1\over n}\sum_{i=1}^n \log ((1-p)\mathbb{1}_{u_i=0}+p{2^{-b}})\nonumber\\ &=-\hat{p}(0|u^n) \log ((1-p)+p{2^{-b}})-(1-\hat{p}(0|u^n) )\log(p{2^{-b}})\nonumber\\ &=\hat{p}(0|u^n) \log {p{2^{-b}}\over (1-p)+p{2^{-b}}}-\log(p{2^{-b}}),\label{eq:simplified-cost-ell-0} \end{align} (3.3) where (a) holds because $$\sum_{a\in\mathcal{X}_b} w_a \mathbb{1}_{u_i=a}=w_{u_i}$$. Since $$-\log(p{2^{-b}})$$ is constant, and   \[ \log { (1-p)+p{2^{-b}}\over p{2^{-b}}} \] is positive, from (3.3), an upper bound on $$\sum_{a\in\mathcal{X}_b} w_{a} \hat{p}^{(1)}(a|u^n) $$ is in fact an upper bound on the $$\ell_0$$-norm of $$u^n$$ defined as   \begin{align*} \|u^n\|_0\triangleq |\{i:\; u_i\neq 0\}|. \end{align*} (Note that $$ \|u^n\|_0=(1-\hat{p}(0|u^n))n$$.) Therefore, given these simplifications, (3.1) can be written as   \begin{align} {\hat{X}}^n&= \mathop{\text{arg min}}\limits_{u^n\in\mathcal{X}_b^n: \|u^n\|_0\leq \gamma'} \|Au^n-Y^m\|^2, \label{eq:QMAP-iid-2} \end{align} (3.4) where $$\gamma'$$ is a function of $$\gamma$$, $$b$$ and $$p$$. 3.2 Piecewise-constant processes Another popular example is a piecewise-constant process. (Refer to [29] for some applications of this model.) As our second example, we introduce a first-order Markov process that can model piecewise-constant functions. Conditioned on $$X_i=x_i$$, $$X_{i+1}$$ is distributed as $$(1-p)\delta_{x_i}+p {\rm Unif}_{[0,1]}$$. In other words, at each time step, the Markov chain either stays at its previous value or jumps to a new value, which is drawn from a uniform distribution over $$[0,1]$$, independent of the past values. The jump process can be modeled as a $$\mathrm{Bern}(p)$$ process which is independent of the past values of the Markov chain. Then, since the process has a memory of order one, (2.6) can be written as   \begin{align} {\hat{X}}^n&= \mathop{\text{arg min}}\limits_{u^n\in\mathcal{X}_b^n}\Bigg[\sum_{a^{2}\in\mathcal{X}_b^{2}} w_{a^{2}} \hat{p}^{(2)}(a^{2}|u^n) +{\lambda\over n^2}\|Au^n-Y^m\|^2\Bigg],\label{eq:QMAP-markov-1} \end{align} (3.5) where for $$a^2\in\mathcal{X}_b^2$$  \begin{align} w_{a^2}=\log {1\over {\rm{P}}([X_{2}]_b=a_{2}|[X_{1}]_b=a_{1})}. \end{align} (3.6) Given the Kernel of the Markov chain, we have   \begin{align} {\rm{P}}([X_{2}]_b=a_{2}|[X_{1}]_b=a_{1})=(1-p)\mathbb{1}_{a_2=a_1}+p{2^{-b}}. \end{align} (3.7) Let $$N_J(u^n)$$ denote the number of jumps in sequence $$u^n$$, i.e. $$N_J(u^n)=\sum_{i=2}^n\mathbb{1}_{u_i\neq u_{i-1}}$$. Then, the first term in the cost function in (2.5) can be rewritten as   \begin{align} \sum_{a^2\in\mathcal{X}_b^2} w_{a^2} \hat{p}^{(2)}(a^2|u^n) &= {1\over n-1}\sum_{i=2}^n\sum_{a^2\in\mathcal{X}_b} w_{a^2} \mathbb{1}_{u_{i-1}^i=a^2}= {1\over n-1}\sum_{i=2}^n w_{u_{i-1}^i}\nonumber\\ &={1\over n-1}\sum_{i=2}^n \log\left((1-p)\mathbb{1}_{u_i=u_{i-1}}+p{2^{-b}}\right)\nonumber\\ &=-{N_J(u^n)\over n-1} \log(p{2^{-b}})-\left(1-{N_J(u^n)\over n-1}\right) \log(1-p+p{2^{-b}})\nonumber\\ &={N_J(u^n)\over n-1} \log{1-p+p{2^{-b}} \over p{2^{-b}}}-\log(1-p+p{2^{-b}}).\label{eq:cost-markov} \end{align} (3.8) Inserting (3.8) in (3.9), it follows that   \begin{align} {\hat{X}}^n&= \mathop{\text{arg min}}\limits_{u^n\in\mathcal{X}_b^n}\left[\alpha_b\left({N_J(u^n)\over n-1}\right)+{\lambda\over n^2}\|Au^n-Y^m\|^2\right]\!, \end{align} (3.9) where $$\alpha_b= \log { (1-p)+p{2^{-b}}\over p{2^{-b}}}$$. Note that the term $${N_J(u^n)\over n-1}$$ is counting the number of jumps in $$u^n$$ which seems to be a natural regularizer here. 4. Exponential convergence rates In our theoretical analysis, one of the main features required from the source process $${\mathbb{\mathbf{X}}}=\{X_i\}_{i=1}^{\infty}$$ is that the empirical statistics derived from $$[X^n]_b$$ to converge, asymptotically, to their expected values. In all our analysis, we require this to hold even when $$b$$ grows with $$n$$. Intuitively, if this is not the case, we do not expect the Q-MAP estimator to be able to obtain a good estimate of $$X^n$$. In the following two sections, we study two important classes of stochastic processes which satisfy this property. 4.1 $${\it{\Psi}}^*$$-mixing processes The first class of processes that satisfy our requirements are $${\it{\Psi}}^*$$-mixing processes. Consider a stationary process $${\mathbb{\mathbf{X}}}=\{X_i\}_{i=1}^\infty$$. Let $$\mathcal{F}_j^k$$ denote the $$\sigma$$-field of events generated by random variables $$X_j^k$$, where $$j\leq k$$. Define   \begin{equation} \psi^*(g) = \sup \frac{{\rm{P}}(\mathcal{A} \cap \mathcal{B})}{{\rm{P}}(\mathcal{A}) {\rm{P}}(\mathcal{B})},\label{eq:def-Psi-star} \end{equation} (4.1) where the supremum is taken over all events $$\mathcal{A} \in \mathcal{F}_{0}^{j}$$ and $$\mathcal{B} \in \mathcal{F}_{j+g}^\infty$$, where $$P(\mathcal{A})>0$$ and $${\rm{P}}(\mathcal{B})>0$$. Definition 4.1 A stationary process $${\mathbb{\mathbf{X}}}$$ is called $$\psi^*$$-mixing, if $$\psi^*(g) \to 1$$, as $$g$$ grows to infinity. There are many examples of $${\it{\Psi}}^*$$-mixing processes. For instance, it is straightforward to check that any i.i.d. sequence is $${\it{\Psi}}^*$$-mixing. Also, every finite-state Markov chain is $${\it{\Psi}}^*$$-mixing [25]. (For further information on $${\it{\Psi}}^*$$-mixing processes, the reader may refer to [3].) Another example of $${\it{\Psi}}^*$$-mixing processes are processes built by taking the moving average of i.i.d. processes. More specifically, consider an i.i.d. process $${\mathbb{\mathbf{Y}}}$$ and let process $${\mathbb{\mathbf{X}}}$$ denote the moving average of process $${\mathbb{\mathbf{Y}}}$$ defined as $$X_i={1\over r}\sum_{j=0}^{r-1}Y_{i-j}$$, for all $$i$$. Then, process $${\mathbb{\mathbf{Y}}}$$ is $${\it{\Psi}}^*$$-mixing. As mentioned earlier, the advantage of $${\it{\Psi}}^*$$-mixing processes is the fast convergence of their empirical distributions to their expected values. This is captured by the following result from [15], which is a straightforward extension of a similar result in [25] for finite alphabet. Theorem 4.2 Consider a $${\it{\Psi}}^*$$-mixing process $${\mathbb{\mathbf{X}}}=\{X_i\}_{i=1}^{\infty}$$, and its $$b$$-bit quantized version $${\mathbb{\mathbf{Z}}}=\{Z_i\}_{i=1}^{\infty}$$, where $$Z_i=[X_i]_b$$ and $$\mathcal{Z}=\mathcal{X}_b$$. Define measure $$\mu_k$$, such that for $$a^k\in\mathcal{Z}^k$$, $$\mu_k(a^k)={\rm{P}}(Z^k=a^k)$$. Then, for any $$\epsilon>0$$, and any $$b$$ large enough, there exists $$g\in{\rm I}\kern-0.20em{\rm N}$$, depending only on $$\epsilon$$ and function $${\it{\Psi}}^*(g)$$ defined in (4.1), such that for any $$n>k+6(k+g)/\epsilon$$,   \begin{equation}\label{eq:exponentialconvergence} {\rm{P}}(\|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k\|_1\geq \epsilon)\leq 2^{c\epsilon^2/8}(k+g)n^{|\mathcal{Z}|^k}2^{-{nc\epsilon^2\over 8(k+g)}}, \end{equation} (4.2) where $$c=1/(2\ln 2)$$. Note that the upper bound in (4.2) only depends on $$b$$ through $$|\mathcal{Z}|=|\mathcal{X}_b|$$. Hence, if $$b=b_n$$ grows with $$n$$, it should grow slowly enough, such that overall $$2^{c\epsilon^2/8}(k+g)n^{|\mathcal{Z}|^k}2^{-{nc\epsilon^2\over 8(k+g)}}$$ still converges to zero, as $$n$$ grows to infinity. One such example, used in our results, is $$b=b_n=\lceil r \log\log n\rceil$$, $$r\geq 1$$. For this choice of $$b_n$$, Theorem 4.2 guarantees that the empirical distribution derived from the quantized sequence remains close to its expected value, with high probability. 4.2 Weak $${\it{\Psi}}^*_q$$-mixing Markov processes Finite-alphabet Markov chains are known to be $${\it{\Psi}}^*$$-mixing, and therefore their empirical distributions have exponential convergence rates [25]. Continuous space Markov processes on the other hand are not $${\it{\Psi}}^*$$-mixing in general, and hence the results of the previous section may not hold. However, for many such Markov processes, it is still possible to show that the empirical distributions of their quantized versions converge to their expected values, even if the quantization level $$b$$ grows with $$n$$. In this section, we show how such results can be proved by extending the definition of $${\it{\Psi}}^*$$-mixing processes to define weak $${\it{\Psi}}_q^*$$-mixing Markov processes. As shown at the end of this section, one important example of continuous-space Markov processes which is provably weak $${\it{\Psi}}_q^*$$-mixing is the piecewise-constant process discussed in Section 3.2. Consider a real-valued stationary stochastic process $${\mathbb{\mathbf{X}}}=\{X_i\}$$, with alphabet $$\mathcal{X}=[l,u]$$, where $$l,u\in{\rm I}\kern-0.20em{\rm R}$$. Let process $${\mathbb{\mathbf{Z}}}=\{Z_i\}$$ denote the $$b$$-bit quantized version of process $${\mathbb{\mathbf{X}}}$$. That is, $$Z_i=[X_i]_b$$, for all $$i$$. The alphabet of process $${\mathbb{\mathbf{Z}}}$$ is clearly $$\mathcal{Z}=\mathcal{X}_b=\{[x]_b: x\in\mathcal{X}_b\}$$. Let $$\mu_k^{(b)}$$ denote the distribution of $$Z^k$$. That is, for any $$z^k\in\mathcal{Z}^k$$,   \begin{align} \mu_k^{(b)}(z^k)={\rm{P}}(Z^k=z^k)={\rm{P}}([X^k]_b=z^k). \end{align} (4.3) The following lemma proves that if the process $${\mathbb{\mathbf{X}}}$$ has a property analogous to being $${\it{\Psi}}^*$$-mixing, then potentially it has exponential convergence rates. Lemma 4.1 Suppose that the stationary process $${\mathbb{\mathbf{X}}}$$ is such that there exists a function $${\it{\Psi}}:{\rm I}\kern-0.20em{\rm N}\times {\rm I}\kern-0.20em{\rm N}\to{\rm I}\kern-0.20em{\rm R}^+$$, which satisfies the following. For any $$(b,g,\ell_1,\ell_2)\in{\rm I}\kern-0.20em{\rm N}^4$$, $$u^{\ell_1}\in\mathcal{Z}^{\ell_1}$$ and $$v^{\ell_2}\in\mathcal{Z}^{\ell_2}$$:   \begin{align} {\rm{P}}\left(Z^{\ell_1}=u^{\ell_1},Z_{\ell_1+g+1}^{\ell_1+g+\ell_2}=v^{\ell_2}\right)\leq {\rm{P}}\left(Z^{\ell_1}=u^{\ell_1}\right) {\rm{P}}\left(Z_{\ell_1+g+1}^{\ell_1+g+\ell_2}=v^{\ell_2}\right){\it{\Psi}}(b,g),\label{eq:cond-Psi-b-g} \end{align} (4.4) where $$b$$ denotes the quantization level of process $${\mathbb{\mathbf{Z}}}$$. Then for any given $$\epsilon>0$$, for any positive integers $$k$$ and $$g$$ such that $$4(k+g)/(n-k)<\epsilon$$,   \begin{align} {\rm{P}}\left(\|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\|_1\geq \epsilon\right)\leq (k+g){\it{\Psi}}^t(b,g)(t+1)^{|\mathcal{Z}|^k}2^{-c\epsilon^2t/4}, \end{align} (4.5) where $$t=\lfloor{n-k+1\over k+g}\rfloor$$ and $$c=1/(2\ln 2)$$. The proof is presented in Section 7.3. Note that if a process is $${\it{\Psi}}^*$$-mixing, then it is straightforward to confirm the existence of $$\tilde{{\it{\Psi}}}(g)$$ that satisfies   \begin{align} {\rm{P}}\left(Z^{\ell_1}=u^{\ell_1},Z_{\ell_1+g+1}^{\ell_1+g+\ell_2}=v^{\ell_2}\right)\leq {\rm{P}}\left(Z^{\ell_1}=u^{\ell_1}\right) {\rm{P}}\left(Z_{\ell_1+g+1}^{\ell_1+g+\ell_2}=v^{\ell_2}\right)\tilde{{\it{\Psi}}}(g). \end{align} (4.6) However, in this section, we are interested in processes that are not necessarily $${\it{\Psi}}^*$$-mixing. Lemma 4.1 allows us to prove the convergence of the empirical distributions for many such processes. To justify our claims, we focus on the class of Markov processes. For notational simplicity, we focus on first-order Markov processes. It is straightforward to extend these results to higher order Markov processes as well. Let $${\mathbb{\mathbf{X}}}$$ denote a first-order stationary Markov process with Kernel function $$K: (\mathcal{X},2^{\mathcal{X}}) \rightarrow \mathbb{R}^+$$ and first-order stationary distribution $$\pi: 2^{\mathcal{X}}\rightarrow {\rm I}\kern-0.20em{\rm R}^+$$. (Here $$2^{\mathcal{X}}$$ denotes the set of subsets of $$\mathcal{X}$$.) In other words, for any $$x\in\mathcal{X}$$ and any measurable subset of $$\mathcal{X}$$,   \begin{align} K(x,\mathcal{A})={\rm{P}}(X_2\in\mathcal{A}|X_1=x) \end{align} (4.7) and   \begin{align} \pi(\mathcal{A})={\rm{P}}(X_1\in\mathcal{A}). \end{align} (4.8) Also, for $$g\in{\rm I}\kern-0.20em{\rm N}^+$$,   \begin{align} K^g(x,\mathcal{A})={\rm{P}}(X_{1+g}\in\mathcal{A}|X_1=x). \end{align} (4.9) Clearly $$k^g$$ can be evaluated from function $$K$$. Finally, with a slight overloading of notation, for $$x\in\mathcal{X}$$ and $$z\in\mathcal{X}_b$$,   \begin{align} \pi(z)={\rm{P}}([X_1]_b=z) \end{align} (4.10) and   \begin{align} K(x,z)={\rm{P}}([X_2]_b=z|X_1=x). \end{align} (4.11) Similarly, for $$g\in{\rm I}\kern-0.20em{\rm N}^+$$, $$K^g(x,z)={\rm{P}}([X_{1+g}]_b=z|X_1=x)$$. Again, with another slight overloading of notation, for $$x\in\mathcal{X}$$, and $$w^{l+1}\in\mathcal{Z}^{l+1}$$,   \begin{align} \pi\left(w_2^{l+1}|x\right)={{\rm{P}}\left([X_2^{l+1}]_b=w^{l+1}_2|X_1=x\right)} \end{align} (4.12) and   \begin{align} \pi\left(w^{l+1}_2|w_1\right)={{\rm{P}}\left([X_2^{l+1}]_b=w_2^{l+1}|[X_1]_b=w_1\right)}. \end{align} (4.13) Define functions $${\it{\Psi}}_1:{\rm I}\kern-0.20em{\rm N}\times{\rm I}\kern-0.20em{\rm N}\to {\rm I}\kern-0.20em{\rm R}^+$$ and $${\it{\Psi}}_2:{\rm I}\kern-0.20em{\rm N}\to {\rm I}\kern-0.20em{\rm R}^+$$ as   \begin{align} {\it{\Psi}}_1(b,g) \triangleq \sup_{(x,z)\in\mathcal{X}\times\mathcal{X}_b} {K^{g}(x,z) \over \pi(z)}\label{eq:Psi1-def} \end{align} (4.14) and   \begin{align} {\it{\Psi}}_2(b)\triangleq \sup_{(x,\ell_2,w^{\ell_2})\in\mathcal{X}\times{\rm I}\kern-0.20em{\rm N}\times \mathcal{Z}^{\ell_2}: [x]_b=w_1} {\pi(w_{2}^{\ell_2}|x)\over \pi(w_{2}^{\ell_2}|w_1)}.\label{eq:Psi2-def} \end{align} (4.15) Our next lemma shows how $${\it{\Psi}}(b,g)$$ in Lemma 4.1 can be calculated using $${\it{\Psi}}_1$$ and $${\it{\Psi}}_2$$. Lemma 4.2 Consider a first-order aperiodic Markov process $${\mathbb{\mathbf{X}}}=\{X_i\}_{i=1}^{\infty}$$. Let $${\mathbb{\mathbf{Z}}}=\{Z_i\}$$ denote the $$b$$-bit quantized version of process $${\mathbb{\mathbf{X}}}$$. That is, $$Z_i=[X_i]_b$$, and $$\mathcal{Z}=\mathcal{X}_b=\{[x]_b: x\in\mathcal{X}\}$$. Also, let $$\mu_b$$ denote the distribution associated with the finite-alphabet process $${\mathbb{\mathbf{Z}}}$$. Then, for all $$(\ell_1,g,\ell_2)\in{\rm I}\kern-0.20em{\rm N}^3$$, $$u^{\ell_1}\in\mathcal{Z}^{\ell_1}$$, $$v^g\in\mathcal{Z}^g$$ and $$w^{\ell_2}\in\mathcal{Z}^{\ell_2}$$, we have   \begin{align} \mu_b\left(u^{\ell_1}v^{g}w^{\ell_2}\right)\leq \mu_b\left(u^{\ell_1}\right)\mu_b\left(w^{\ell_2}\right){\it{\Psi}}_1(b,g){\it{\Psi}}_2(b), \end{align} (4.16) where by definition $$\mu_b\left(u^{\ell_1}v^{g}w^{\ell_2}\right)={\rm{P}}\left(Z^{\ell_1+g+\ell_2}=[u^{\ell_1},v^{g},w^{\ell_2}]\right)$$, $$\mu_b\left(u^{\ell_1}\right)={\rm{P}}\left(Z^{\ell_1}=u^{\ell_1}\right)$$ and $$\mu_b\left(w^{\ell_2}\right)={\rm{P}}\left(Z^{\ell_2}=w^{\ell_2}\right)$$. Furthermore, for any fixed $$b$$, $${\it{\Psi}}_1(b,g)$$ is a non-increasing function of $$g$$ that converges to $$1$$, as $$g$$ grows to infinity. The proof is presented in Section 7.4. Combining Lemmas 4.1 and 4.2, we obtain an upper bound of the form   \[ (k+g){\it{\Psi}}^t(b,g)(t+1)^{|\mathcal{Z}|^k}2^{-c\epsilon^2t/4} \] on $${\rm{P}}(\|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\|_1\geq \epsilon)$$. To prove our desired convergence results, we need to ensure that this upper bound goes to zero as $$n$$ grows to infinity. It is straightforward to note that as $$n \rightarrow \infty$$, $$t \rightarrow \infty$$, and hence the term $$2^{-c\epsilon^2t/4}$$ converges to zero. However, if $${\it{\Psi}}^t(b,g)(t+1)^{|\mathcal{Z}|^k}$$ grows faster than $$2^{-c\epsilon^2t/4}$$, then we do not reach the desired goal. Our next theorems prove that under some mild conditions on the Markov process, for slow enough growth of $$b=b_n$$ with $$n$$, $${\it{\Psi}}^t(b,g)$$ does not grow too fast. First note that since for any $$z^n\in\mathcal{Z}^n$$, $$\| \mu_{k_1}-\hat{p}^{(k_1)}(\cdot|z^n)\|_1\leq \| \mu_{k_2}-\hat{p}^{(k_2)}(\cdot|z^n)\|_1$$, for all $$k_1\leq k_2$$, to prove fast convergence of $$\hat{p}^{(k)}(\cdot|Z^n)$$ it is enough to prove this statement for $$k$$ large. Theorem 4.3 Consider an aperiodic stationary first-order Markov chain $${\mathbb{\mathbf{X}}}$$, and its $$b$$-bit quantized version $${\mathbb{\mathbf{Z}}}$$, where $$Z_i=[X_i]_b$$ and $$\mathcal{Z}=\mathcal{X}_b$$. Let $$\mu_k^{(b)}$$ denote the $$k$$th order probability distribution of process $${\mathbb{\mathbf{Z}}}$$, i.e. for any $$z^k\in\mathcal{Z}^k$$,   \begin{align} \mu_k^{(b)}(z^k)={\rm{P}}(Z^k=z^k). \end{align} (4.17) Let $$b=b_n= \lceil r\log\log n \rceil$$, where $$r\geq 1$$. Assume that there exists a sequence $$g=g_n$$, such that $$g_n=o(n)$$, and process $${\mathbb{\mathbf{X}}}$$ satisfies the following conditions: $$\lim_{n\to\infty}{\it{\Psi}}_1(b_n,g_n)=1 $$ and $$\lim_{b\to\infty}{\it{\Psi}}_2(b)=1$$, where functions $${\it{\Psi}}_1$$ and $${\it{\Psi}}_2$$ are defined in (4.14) and (4.15), respectively. Then, given $$\epsilon>0$$ and positive integer $$k$$, for $$n$$ large enough,   \begin{align} {\rm{P}}\left( \|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\|_1\geq \epsilon\right)\leq 2^{c\epsilon^2/4} (k+g_n)n^{|\mathcal{Z}|^k}2^{-{c \epsilon^2n\over 8(k+g_n)}}, \end{align} (4.18) where $$c=1/(2\ln 2)$$. The proof is presented in Section 7.5. Remark 4.1 Lemma 4.2 proves that for any fixed $$b$$, $${\it{\Psi}}_1(b,g)$$ converges to one, as $$g$$ grows without bound. However, in this article we are mainly interested in the cases where $$ b_n=\lceil r\log\log n \rceil$$. The condition on $${\it{\Psi}}_1$$ specified in Theorem 4.3 ensures that even if $$b$$ also grows to infinity, there exists a proper choice of sequence $$g_n$$ as a function of $$n$$, for which $${\it{\Psi}}_1(b_n,g_n)$$ still converges to one, as $$n$$ grows without bound. Theorem 4.3 proves that if $$\lim_{n\to\infty}{\it{\Psi}}_1(b_n,g_n)=1 $$ and $$\lim_{b\to\infty}{\it{\Psi}}_2(b)=1$$, then the quantized version of an analog Markov process also has fast convergence rates. We refer to a Markov process that satisfies these two conditions with $$b_n = \lceil r\log\log n \rceil$$ as a weak $${\it{\Psi}}^*_q$$-mixing Markov process. To better understand these conditions, we next consider the piecewise-constant source studied in Section 3.2 and prove that it is a weak $${\it{\Psi}}^*_q$$-mixing Markov process. Theorem 4.4 Consider a first-order stationary Markov process $${\mathbb{\mathbf{X}}}$$, such that conditioned on $$X_i=x_i$$, $$X_{i+1}$$ is distributed as $$(1-p)\delta_{x_i}+pf_c$$, where $$f_c$$ denotes an absolutely continuous distribution over $$\mathcal{X}=[0,1]$$. Further assume that there exists $$f_{\min}>0$$, such that $$f_c(x)\geq f_{\min},$$ for all $$x\in(0,1)$$. Then, for $$b=b_n=\lceil r\log\log n\rceil $$ and $$g=g_n=\lfloor \gamma \, r\log\log n \rfloor$$, where $$\gamma>-{1\over \log(1-p)}$$, $$\lim_{n\to\infty}{\it{\Psi}}_1(b_n,g_n)=1$$, and $${\it{\Psi}}_2(b)=1$$, for all $$b$$. The proof is presented in Section 7.6. 5. Theoretical analysis of Q-MAP In this section, we formalize Informal Result 1 presented in Section 1. The following theorem provides conditions for the success of the Q-MAP estimator, for the case where the response variables are noise-free. We state all the results for $${\it{\Psi}}^*$$-mixing processes, but they also apply to weak $${\it{\Psi}}^*_q$$-mixing Markov processes. Theorem 5.1 Consider a $${\it{\Psi}}^*$$-mixing stationary process $${{\mathbb{\mathbf{X}}}}$$, and let $$Y^m=AX^n$$. Assume that the entries of the design matrix $$A$$ are i.i.d. $$\mathcal{N}(0,1)$$. Choose $$k$$, $$r>1$$ and $${\delta}>0$$, and let $$b=b_n=\lceil r\log\log n\rceil$$, $${\gamma}={\gamma}_n= b_n(\bar{d}_k({{\mathbb{\mathbf{X}}}})+{\delta}$$) and $$m=m_n\geq (1+\delta)n\bar{d}_k({{\mathbb{\mathbf{X}}}})$$. Assume that there exists a constant $$f_{k+1}>0$$, such that for any quantization level $$b$$ and any $$u^{k+1}\in\mathcal{X}_b^{k+1}$$ with $${\rm{P}}([X^{k+1}]_b=u^{k+1})\neq 0$$,   \begin{align} {\rm{P}}\left([X^{k+1}]_b=u^{k+1}\right)\geq {f_{k+1} |\mathcal{X}_b|^{-(k+1)}}.\label{eq:cond-f-k} \end{align} (5.1) Further, assume that $${\hat{X}}^n$$ denotes the solution of (2.5), where the coefficients are computed according to (1.6). Then, for any $${\epsilon}>0$$,   \begin{align} \lim_{n\to\infty} {\rm{P}}\left( { 1\over \sqrt{n}}\|X^n-{\hat{X}}^n\|_2>{\epsilon}\right)=0. \end{align} (5.2) The proof is presented in Section 7.7. Remark 5.1 A technical condition required by Theorem 5.1 is existence of a constant $$f_{k+1}$$, for which (5.1) holds. It is straightforward to confirm that this condition holds for many distributions of interest. For instance, consider an i.i.d. process $$X_i\stackrel{\rm i.i.d.}{\sim} p f_c+(1-p)f_d$$, where $$f_c$$ and $$f_d$$ denote an absolutely continuous distribution and a discrete distribution, respectively. Then, if $$f_c$$ has a bounded support and $$\inf f_c$$ over its support is non-zero, then (5.1) holds. Intuitively, this condition guarantees that the probability of every quantized sequence with non-zero probability cannot get too small. The reason this condition is required is that it simplifies bounding the Kullback–Leibler distance between empirical distribution and the underlying distribution. We believe that the theorem holds in a more general setting, but its proof is left for future work. We remind the reader that we also introduced a Lagrangian version of Q-MAP in (2.6). It turns out that we can derive the same performance guarantees for the Lagrangian Q-MAP as well. Theorem 5.2 Consider a $${\it{\Psi}}^*$$-mixing stationery process $${{\mathbb{\mathbf{X}}}}$$. Let $$Y^m=AX^n$$, where the entries of $$A$$ are i.i.d. $$\mathcal{N}(0,1)$$. Choose $$k$$, $$r>1$$ and $${\delta}>0$$, and let $$b=b_n=\lceil r\log\log n\rceil$$, $${\lambda}={\lambda}_n=(\log n)^{2r}$$ and $$m=m_n\geq (1+\delta)n\bar{d}_k({{\mathbb{\mathbf{X}}}})$$. Assume that there exists a constant $$f_{k+1}>0$$, such that for any quantization level $$b$$, and any $$u^{k+1}\in\mathcal{X}_b^{k+1}$$ with $${\rm{P}}([X^{k+1}]_b=u^{k+1})\neq 0$$, (5.1) holds. Further, assume that $${\hat{X}}^n$$ denotes the solution of (2.6), where the coefficients are computed according to (1.6). Then, for any $${\epsilon}>0$$,   \begin{align} \lim_{n\to\infty} {\rm{P}}\left( { 1\over \sqrt{n}}\|X^n-{\hat{X}}^n\|_2>{\epsilon}\right)=0. \end{align} (5.3) The proof is presented in Section 7.8. To better understand the implications of Theorems 5.1 and 5.2, consider the case where the process $${{\mathbb{\mathbf{X}}}}$$ is stationary and memoryless. All such processes are $${\it{\Psi}}^*$$-mixing, and satisfy $$\bar{d}_k({{\mathbb{\mathbf{X}}}})=\bar{d}_0({{\mathbb{\mathbf{X}}}})$$, for all $$k\geq 0$$. Therefore, as long as $$m_n \geq (1+\delta)n\bar{d}_0({{\mathbb{\mathbf{X}}}})$$, asymptotically, the Q-MAP algorithm provides an accurate estimate of the parameters vector. On the other hand, since the process is i.i.d., $$\bar{d}_0({{\mathbb{\mathbf{X}}}})=\bar{d}(X_1)$$, where $$\bar{d}(X_1)$$ denotes the upper Rényi information dimension of $$X_1$$ [23]. For an i.i.d. process whose marginal distribution is a mixture of discrete and continuous distributions, asymptotically, the Rényi information dimension of the marginal distribution characterizes the minimum sampling rate ($$m/n$$) required for an accurate recovery of the parameters vector [33]. Hence, for such i.i.d. sources, in a noiseless setting, the algorithm presented in (3.4) achieves the fundamental limits in terms of sampling rate. Finally, another interesting implication of Theorem 5.2 is the following. The Q-MAP optimization mentioned in (2.5) is not a convex optimization. Hence, its solution does not necessarily coincide with the solution of (2.6). However, at least in the noiseless setting we can derive similar performance bounds for both. 6. Solving Q-MAP 6.1 PGD The goal of this section is to analyze the performance of the PGD algorithm introduced in Section 1.2. The results are presented for $${\it{\Psi}}^*$$-mixing processes, but they also apply to weak $${\it{\Psi}}_q^*$$-mixing Markov processes with no change. Note that even though PGD algorithms have been studied extensively for convex optimization problems, since our optimization is discrete and consequently not convex, those analyses do not apply to our problem. Given $${\delta}>0$$, consider the Q-MAP optimization characterized as   \begin{align} {\hat{X}}^n=&\mathop{\text{arg min}}\limits_{u^n\in\mathcal{X}_b^n}\;\;\;\;\;\;\;\; \|Y^m-Au^n\|^2 \nonumber\\ & {\rm subject \; to}\;\;\;\;\; c_{{\bf w}}(u^n) \leq b(\bar{d}_k({{\mathbb{\mathbf{X}}}})+{\delta}). \end{align} (6.1) The corresponding PGD algorithm proceeds as follows. For $$t=1,2,\ldots,t_n$$,   \begin{align} S^n(t+1)&={\hat{X}}^n(t)+\mu A^{\top}(Y^m-A{\hat{X}}^n(t))\nonumber\\ {\hat{X}}^n(t+1)&=\mathop{\text{arg min}}\limits_{u^n\in\mathcal{F}_o}\left\|u^n-S^n(t+1)\right\|\!, \end{align} (6.2) where $$\mathcal{F}_o = \Big\{u^n\in\mathcal{X}_b^n: \; c_{{\bf w}}(u^n) \leq b(\bar{d}_k({{\mathbb{\mathbf{X}}}})+{\delta}) \Big\}$$, defined earlier in (1.9). Theorem 6.1 below proves that, having enough number of response variables, with probability approaching one, the PGD-based algorithm recovers parameters $$X^n$$ with high probability, even in the presence of measurement noise. Theorem 6.1 Consider a $${\it{\Psi}}^{*}$$-mixing process $${{\mathbb{\mathbf{X}}}}$$. Let $$Y^m=AX^n+Z^m$$, where the elements of matrix $$A$$ are i.i.d. $$\mathcal{N}(0,1)$$ and $$Z_i$$, $$i=1,\ldots,m$$, are i.i.d. $$\mathcal{N}(0,\sigma^2)$$. Choose $$k$$, $$r>1$$ and $${\delta}>0$$, and let $$b=b_n=\lceil r\log\log n\rceil$$, $${\gamma}={\gamma}_n= b_n(\bar{d}_k({{\mathbb{\mathbf{X}}}})+{\delta}$$), and $$m=m_n\geq (1+\delta)n\bar{d}_k({{\mathbb{\mathbf{X}}}})$$. Assume that there exists a constant $$f_{k+1}>0$$, such that for any quantization level $$b$$, and any $$u^{k+1}\in\mathcal{X}_b^{k+1}$$ with $${\rm{P}}([X^{k+1}]_b=u^{k+1})\neq 0$$, (5.1) holds. For $$r>1$$, let $$b=b_n=\lceil r\log\log n\rceil,$$ and $$m=m_n=80 nb(\bar{d}_k({{\mathbb{\mathbf{X}}}})+{\delta}).$$ Let $$\mu={1\over m}$$, and consider $${\hat{X}}_n(t)$$, $$t=0,1,\ldots,t_n$$, generated according to (6.2). Define the error vector at iteration $$t$$ as   \begin{align} E^n(t)={\hat{X}}^n(t)-[X^n]_b. \end{align} (6.3) Then, with probability approaching one,   \begin{align} {1\over \sqrt{n}}\|E^n(t+1)\|\leq {0.9\over \sqrt{n}}\|E^n(t)\|+2\left(2+\sqrt{n\over m}\;\right)^2 2^{-b}+ { \sigma\over 2}\sqrt{b(\bar{d}_k({{\mathbb{\mathbf{X}}}})+3{\delta})\over m}, \end{align} (6.4) for $$t=1,2,\ldots$$. The proof is presented in Section 7.9. Comparing this result with Theorem 5.2 reveals that the minimum value of $$m$$ required in this theorem is $$80b_n = 80\lceil r\log\log n\rceil $$ times higher than the number of response variables required in Theorem 5.2. One can also decrease (increase) the factor $$80$$ and slow down (speed up) the convergence rate of the algorithm. At this point it is not clear to us whether the factor $$r\log\log n$$ in Theorem 6.1 (for the number of response variables) is necessary in general or is an artifact of our proof technique. For some specific priors such as the spike and slab distribution discussed earlier, with a slight modification of the algorithm, it is known that this factor can be improved. In that case, given the special form of the coefficients, we may let $$b$$ grow to infinity for a fixed $$n$$. Then the algorithm becomes equivalent to the iterative hard thresholding (IHT) algorithm introduced in [2]. The analysis in [2] shows that the number of response variables $$m_n$$ required by the IHT algorithm is proportional to $$n$$ and does not have the $$\log \log n$$ factor that appears in Theorem 6.1. In Theorem 6.1 and all of the previous results, the elements of the design matrix $$A$$ were assumed to be generated according to $$\mathcal{N}(0,1)$$ distribution. In the noisy setup, where the response variables are distorted by a noise of variance $$\sigma^2$$, this model implies having per response signal to noise ratio (SNR) that grows linearly with $$n$$. This has made the result of the previous theorem misleading. If we consider per element error, i.e. $${1\over \sqrt{n}}\|E^n(t+1)\|$$ then the error seems to go to zero. To fix this issue, we assume that the elements of $$A$$ are generated according to $$\mathcal{N}(0,{1\over n})$$. The following corollary restates the result of Theorem 6.1 under this scaling and an appropriate adjustment of coefficient $$\mu$$. Corollary 6.1 Consider the setup of Theorem 6.1, where the elements of $$A$$ are generated i.i.d. $$\mathcal{N}(0,{1\over n})$$. Let   \begin{align} S^n(t+1)&={\hat{X}}^n(t)+{n\over m}A^{\top}(Y^m-A{\hat{X}}^n(t))\nonumber\\ {\hat{X}}^n(t+1)&=\mathop{\text{arg min}}\limits_{u^n\in\mathcal{F}_o}\left\|u^n-S^n(t+1)\right\|\!.\label{eq:PGD-noisy-update} \end{align} (6.5) Then, with probability approaching one,   \begin{align} {1\over \sqrt{n}}\|E^n(t+1)\|\leq {0.9\over \sqrt{n}}\|E^n(t)\|+{2(\sqrt{n}+2\sqrt{m})^2\over m}2^{-b}+ { \sigma\over 2}\sqrt{nb(\bar{d}_k({{\mathbb{\mathbf{X}}}})+3{\delta})\over m}, \end{align} (6.6) for $$t=1,2,\ldots$$. Note that for $$m=m_n=80 nb(\bar{d}_k({{\mathbb{\mathbf{X}}}})+3{\delta})$$,   \[ { \sigma\over 2}\sqrt{nb(\bar{d}_k({{\mathbb{\mathbf{X}}}})+3{\delta})\over m}\leq {\sigma\over 12}. \] 6.2 Discussion of computational complexity of PGD As explained earlier, at iteration $$t+1$$, the PGD-based algorithm updates its estimate $${\hat{X}}^n(t )$$ to $${\hat{X}}^n(t +1)$$ by performing the following two steps: $$S^n(t+1)={\hat{X}}^n(t)+\mu A^{\top}(Y^m-A{\hat{X}}^n(t))$$, $${\hat{X}}^n(t+1)={\text{arg min}}_{u^n\in\mathcal{F}_o}\left\|u^n-S^n(t+1)\right\|$$. Clearly, the challenging part is performing the second step, which is projection on the set $$\mathcal{F}_o$$. For some special distributions, such as the spike and slab prior, discussed in Section 3.1, and piecewise-constant processes, discussed in Section 3.2, and their extensions this projection step is not complicated. For instance, for the aforementioned sparse vector, $$\mathcal{F}_o$$ contains sparse quantized vectors, and hence the projection step is just keeping the quantized versions of the largest components of $$S^{t}$$ and setting the rest to zero. This is very similar to the IHT algorithm [2]. However, for more general distributions this projection step may be challenging. Hence, to make the PGD method efficient, we need to be able to solve the following optimization efficiently:   \begin{align}\label{eq:1} {\hat{X}}^n\;=\;&\mathop{\text{arg min}}\limits_{u^n\in\mathcal{X}_b^{n}} \;\;\;\;\;\;\;\left\|u^n-x^n\right\| \nonumber \\ &{\rm subject \; to} \;\; \;c_{{\bf w}}(u^n) \leq \gamma, \end{align} (6.7) where $$x^n\in{\rm I}\kern-0.20em{\rm R}^n$$, weights $${\bf w}=\{w_{a^{k+1}}: a^{k+1}\in \mathcal{X}_b^{k+1}\}$$ and $$\gamma\in{\rm I}\kern-0.20em{\rm R}^+$$ are all given input parameters. Equation (6.7) can be stated in the Lagrangian form as   \begin{align} {\hat{X}}^n\;=\;&\mathop{\text{arg min}}\limits_{u^n\in\mathcal{X}_b^{n}} \Big[{1\over n^2}\left\|u^n-x^n\right\|^2 +{\alpha} c_{{\bf w}}(u^n)\Big],\label{eq:lagrangian-eq-projection} \end{align} (6.8) where $${\alpha}>0$$ is a parameter that depends on $$\gamma$$. Since $$\|u^n-x^n\|^2=\sum_{i=1}^n(u_i-x_i)^2$$, the optimization stated in (6.8) is exactly the optimization studied in [14]. It has been proved in [14] that the solution of (6.8) can be found efficiently via the standard dynamic programming (Viterbi algorithm) [31]. (For further information, refer [14].) The question is whether, for an appropriate choice of $${\alpha}$$, the minimizers of (6.7) and (6.8) are the same. If the answer to this question is affirmative, it implies that both steps of the PGD method can be implemented efficiently. In the following, we intuitively argue why we believe that this might be the case. Making the argument rigorous and a deeper investigation of this connection is left to future research. Consider partitioning the set of sequences in $$\mathcal{X}_b^n$$ based on their $$(k+1)$$th order empirical distributions, which are referred to as their $$(k+1)$$th order types. For a $$(k+1)$$th order type $$q_{k+1}(\cdot): \mathcal{X}_b^{k+1}\to {\rm I}\kern-0.20em{\rm R}^+$$, let $$\mathcal{T}_n(q_{k+1})$$ denote the set of sequences in $$\mathcal{X}_b^n$$ whose $$(k+1)$$th order types agree with $$q_{k+1}$$. That is,   \begin{align} \mathcal{T}_n(q_{k+1})\triangleq \left\{u^n: \hat{p}^{(k+1)}(a^{k+1}|u^n)=q_{k+1}(a^{k+1}), \forall \; a^{k+1}\in\mathcal{U}_b^{k+1}\right\}\!. \end{align} (6.9) Let $$\mathcal{P}_{n}^{k+1}$$ denote the set of all possible $$(k+1)$$th order types, for sequences in $$\mathcal{X}_b^n$$. In other words,   \begin{align} \mathcal{P}_{n}^{k+1}\triangleq\left \{ \hat{p}^{(k+1)}(\cdot|u^n): u^n\in\mathcal{X}_b^n\right\}\!. \end{align} (6.10) It can be proved that (Theorem I.6.14 in [25])   \begin{align} |\mathcal{P}_{n}^{k+1}|\leq (n+1)^{|\mathcal{X}_b|^{k+1}}. \end{align} (6.11) Furthermore, we have   \begin{align} \mathcal{X}_b^n=\cup_{q_{k+1}\in\mathcal{P}_{n}^{k+1}} \mathcal{T}_n(q_{k+1}). \end{align} (6.12) Therefore,   \begin{align} &\min_{u^n\in\mathcal{X}_b^{n}} \Bigg[{1\over n}\left\|u^n-x^n\right\|^2 +{\alpha} \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}}\hat{p}^{(k+1)}(a^{k+1}|u^n)\Bigg]\nonumber\\ &\quad =\min_{q_{k+1} \in \mathcal{P}_n^{k+1}}\min_{u^n\in\mathcal{T}_n(q_{k+1}) } \Bigg[{1\over n}\left\|u^n-x^n\right\|^2 +{\alpha} \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}}q_{k+1}(a^{k+1})\Bigg]\nonumber\\ &\quad =\min_{q_{k+1} \in \mathcal{P}_n^{k+1}} \Bigg[ \Big[\min_{u^n\in\mathcal{T}_n(q_{k+1}) } {1\over n}\left\|u^n-x^n\right\|^2 \Big]+{\alpha} \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}}q_{k+1}(a^{k+1})\Bigg],\label{eq:2} \end{align} (6.13) where the last line follows because $$ \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}}\hat{p}^{(k+1)}(a^{k+1}|u^n)$$ only depends on the $$(k+1)$$th order type of sequence $$u^n$$. For any type $$q_{k+1} \in \mathcal{P}_n^{k+1}$$, define the minimum distortion attainable by sequences of that type as $$D(q_{k+1},x^n)$$, ie   \begin{align} D(q_{k+1},x^n) = \min_{u^n \in \mathcal{T}_n(q_{k+1}) }{1\over n} \left\|u^n-x^n\right\|^2. \end{align} (6.14) Then (6.13) and (6.7) can be written as   \begin{align} \min_{q_{k+1} \in \mathcal{P}_{n}^{k+1}} \Bigg[D(q_{k+1},x^n) +{\alpha}\sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}}q_{k+1}(a^{k+1}) \Bigg] \end{align} (6.15) and   \begin{align} \min_{q_{k+1} \in \mathcal{P}_{n}^{k+1}: \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}}q_{k+1}(a^{k+1})\leq {\gamma}} D(q_{k+1},x^n), \end{align} (6.16) respectively. Both of these optimizations are discrete optimization. However, since   \[ \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}}q_{k+1}(a^{k+1}) \] is a convex function of $$q_{k+1}$$, if, in the high-dimensional setting, for input sequences $$x^n$$ of interest, $$D(q_{k+1})$$ also behaves almost as a convex function, then we expect the two optimizations to be the same, for a proper choice of parameter $${\alpha}$$. In the remainder of this section, we argue why, in a high-dimensional setting, we conjecture that $$D(q_{k+1})$$ satisfies the mentioned property. We leave further investigation of the subject to future research. First, note that if $$x^n$$ is almost stationary, for instance it is generated by a Markov process with finite memory, then $$D(q_{k+1},x^n)$$ only depends on $$q_{k+1}$$ and finite-order empirical distributions of $$x^n$$ and not on $$n$$ or $$x^n$$. Now assuming that this is true, consider $$q^{(1)}_{k+1}$$ and $$q^{(2)}_{k+1}$$ in $$\mathcal{P}_n^{k+1}$$. Also given $$\theta\in(0,1)$$, let $$n_1=\lfloor \theta n\rfloor$$ and $$n_2=n-n_1$$. Also, let $${\tilde{x}}^{n_1}$$ and $$\bar{x}^{n_2}$$ denote the minimizers of $$D(q^{(1)}_{k+1},x^{n_1})$$ and $$D(q^{(2)}_{k+1},x_{n_1+1}^{n})$$, respectively. Assume that $$\theta q^{(1)}_{k+1}+(1-\theta)q^{(2)}_{k+1}\in \mathcal{P}_n^{k+1}$$ and let $${\hat{X}}^n=[{\tilde{x}}^{n_1},\bar{x}^{n_2}]$$. Then, for large $$n$$, it is straightforward to check that $$\hat{p}^{(k+1)}(\cdot|{\hat{X}}^n)\approx \theta \hat{p}^{(k+1)}(\cdot|{\tilde{x}}^{n_1})+(1-\theta) \hat{p}^{(k+1)}(\cdot|\bar{x}^{n_2})=\theta q^{(1)}_{k+1}+(1-\theta)q^{(2)}_{k+1}$$. Therefore,   \begin{align} n D(\theta p_1 + (1-\theta)p_2,x^n) &\leq \|x^n-{\hat{X}}^n\|^2\nonumber\\ &= \|x^{n_1}-{\tilde{x}}^{n_1}\|^2+\|x_{n_1+1}^{n}-\bar{x}^{n_2}\|^2\nonumber\\ &= n_1D\left(q^{(2)}_{k+1},x_{n_1+1}^{n}\right)+n_2D\left(q^{(2)}_{k+1},x_{n_1+1}^{n}\right). \end{align} (6.17) Dividing both sides by $$n$$, it follows that   \begin{align} D(\theta p_1 + (1-\theta)p_2,x^n)\leq \theta D\left(q^{(2)}_{k+1},x_{n_1+1}^{n}\right)+(1-\theta)D\left(q^{(2)}_{k+1},x_{n_1+1}^{n}\right). \label{eq:3} \end{align} (6.18) Therefore, as we expect, for large values of $$n$$ and stationary sequences $$x^n$$, $$D(q_{k+1},x^n)$$ depends on $$x^n$$ only through its empirical distribution, then in (6.18), since $$x_{n_1+1}^{n}$$ and $$x_{n_1+1}^{n}$$ have almost the same empirical distribution as $$x^n$$, they can be replaced by $$x^n$$. This establishes our conjecture about almost convexity of function $$D$$. 7. Proofs 7.1 Preliminaries on information theory Before presenting the proofs, in this section, we review some preliminary definitions and concepts that are used in some of the proofs. Consider stationary process $${\mathbb{\mathbf{U}}}=\{U_i\}_{i=1}^{\infty}$$, with finite alphabet $$\mathcal{U}$$. The entropy rate of process $${\mathbb{\mathbf{U}}}$$ is defined as   \begin{align} \bar{H}({\mathbb{\mathbf{U}}})\triangleq \lim_{k\to\infty}H\left(U_{k+1}|U^k\right)\!. \end{align} (7.1) Consider $$u^n\in\mathcal{U}^n$$, where $$\mathcal{U}$$ is a finite set. The $$(k+1)$$th order empirical distribution of $$u^n$$ is defined in (1.2). The $$k$$th order conditional empirical entropy of $$u^n$$ is defined as $$\hat{H}_k(u^n)=H(U_{k+1}|U^k)$$, where $$U^{k+1}$$ is distributed as $$\hat{p}^{(k+1)}(\cdot|u^n)$$. In other words,   \begin{align} \hat{H}_k(u^n)=-\sum_{a^{k+1}\in\mathcal{U}^{k+1}}\hat{p}^{(k+1)}(a^{k+1}|u^n)\log{\hat{p}^{(k+1)}(a^{k+1}|u^n) \over \hat{p}^{(k)}(a^k|u^n)}. \end{align} (7.2) In some of the proofs, we employ a compression scheme called Lempel-Ziv. Compression schemes aim to represent a sequence $$u^n \in\mathcal{U}^n$$ in as few bits as possible. It turns out that intuitively if $$u^n$$ is a sample of a finite-alphabet stationary ergodic process $${\mathbb{\mathbf{U}}}$$, asymptotically, the smallest expected number of bits per symbol required to represent $$u^n$$ is $$ \bar{H}({\mathbb{\mathbf{U}}})$$. Compression algorithms that achieve this bound are called optimal. One of the well-known examples of optimal compression schemes is Lempel–Ziv (LZ) [34] coding. (LZ is also a universal compression code, since it does not use any information regarding the distribution of the process.) In summary, the LZ compression code operates at follows: it first incrementally parses the input sequence into unique phrases such that each phrase is the shortest phrase that is not seen earlier. Then, each phrase is encoded by (i) an index to the location of the phrase consisting of the current phrase except its last symbol, and (ii) last symbol of the phrase. Given $$u^n\in\mathcal{U}^n$$, let $$\ell_{\rm LZ}(u^n)$$ denote the length of the binary coded sequence assigned to $$u^n$$ using the LZ compression code. Note that since LZ algorithm assigns a unique coded sequence to every input sequence, we have   \begin{align} |\{u^n: \ell_{\rm LZ}(u^n)\leq r\}|\leq \sum_{i=1}^r2^i\leq 2^{r+1}.\label{eq:LZ-sequences} \end{align} (7.3) The LZ length function $$\ell_{\rm LZ}(\cdot)$$ is mentioned in some of the following proofs because of its connections with the conditional empirical entropy function $$\hat{H}_k(\cdot)$$. This connection established in [22] for binary sources and extended in [15] to general sources with alphabet $$\mathcal{U}$$ such that $$|\mathcal{U}|=2^b$$ states that, for all $$u^n\in\mathcal{U}^n$$,   \begin{align} {1\over n}\ell_{\rm LZ}(u^n)\leq \hat{H}_k(u^n)+{b(kb+b+3)\over (1-\epsilon_n)\log n-b}+\gamma_n,\label{eq:connections-LZ-Hk} \end{align} (7.4) where   \begin{align} \epsilon_n={2b+\log\left(2^b+{2^b-1\over b}\log n -2\right)\over \log n}, \end{align} (7.5) and $$\gamma_n=o(1)$$ does not depend on sequence $$u^n$$ or $$b$$. Finally, we finish this section, by two lemmas related to continuity properties of the entropy function and the Kullback–Leibler distance. Lemma 7.1 (Theorem 17.3.3 in [6]) Consider distributions $$p$$ and $$q$$ on finite alphabet $$\mathcal{U}$$ such that $$\|p-q\|_1\leq \epsilon$$. Then,   \begin{align} |H(p)-H(q)|\leq -\epsilon\log \epsilon +\epsilon\log |\mathcal{U}|. \end{align} (7.6) Lemma 7.2 Consider distributions $$p$$ and $$q$$ over discrete set $$\mathcal{U}$$ such that $$\|p-q\|_1\leq \epsilon$$. Further assume that $$p\ll q$$, and let $$q_{\min}=\min_{u\in\mathcal{U}: q(u)\neq 0} q(u)$$. Then,   \begin{align} D(p\| q)\leq -\epsilon\log \epsilon +\epsilon \log |\mathcal{U}|-\epsilon \log q_{\min}. \end{align} (7.7) Proof. Let $$\mathcal{U}^*\triangleq \{u\in \mathcal{U}:\;q(u)\neq 0\}.$$ Since $$p\ll q$$, if $$q(u)=0$$, then $$p(u)=0$$. Therefore, by definition   \begin{align} D(p\| q)&= \sum_{u\in\mathcal{U}} p(u)\log {p(u)\over q(u)}\nonumber\\ &= \sum_{u\in\mathcal{U}^*} p(u)\log {p(u)\over q(u)}\nonumber\\ &= \sum_{u\in\mathcal{U}^*} p(u)\log p(u)-\sum_{u\in\mathcal{U}^*} p(u)\log q(u)\nonumber\\ &= \sum_{u\in\mathcal{U}^*} p(u)\log p(u)-\sum_{u\in\mathcal{U}^*} (p(u)-q(u)+q(u))\log q(u)\nonumber\\ &= H(q)-H(p)-\sum_{u\in\mathcal{U}^*} (p(u)-q(u))\log q(u). \end{align} (7.8) Hence, by the triangle inequality,   \begin{align} D(p\| q)&\leq |H(q)-H(p)|-\sum_{u\in\mathcal{U}^*} |p(u)-q(u)|\log q(u)\nonumber\\ &\stackrel{(a)}{\leq} -\epsilon\log \epsilon +\epsilon \log |\mathcal{U}|-\sum_{u\in\mathcal{U}^*} |p(u)-q(u)|\log q(u)\nonumber\\ &\leq -\epsilon\log \epsilon +\epsilon \log |\mathcal{U}|+\log \left({1\over q_{\min}}\right)\sum_{u\in\mathcal{U}^*} |p(u)-q(u)|\nonumber\\ &\stackrel{(b)}{\leq} -\epsilon\log \epsilon +\epsilon \log |\mathcal{U}|-\epsilon\log q_{\min}, \end{align} (7.9) where $$(a)$$ and $$(b)$$ follow from Theorem 17.3.3 in [6] and $$\|p-q\|_1\leq \epsilon$$, respectively. □ 7.2 Useful concentration lemmas Lemma 7.3 Consider $$u^n\in {\rm I}\kern-0.20em{\rm R}^n$$ and $$v^n\in {\rm I}\kern-0.20em{\rm R}^n$$ such that $$\|u^n\|=\|v^n\|=1$$. Let $$\alpha\triangleq \langle u^n,v^n \rangle $$. Consider matrix $$A\in{\rm I}\kern-0.20em{\rm R}^{m\times n}$$ with i.i.d. standard normal entries. Then, for any $$\tau>0$$,   \begin{align} {\rm{P}}\left({1\over m}\langle Au^n,Av^n\rangle-\langle u^n,v^n\rangle\leq -\tau\right)\leq {\rm e}^{m((\alpha-\tau)s)-{m\over 2}\ln ((1+s\alpha)^2-s^2)}, \end{align} (7.10) where $$s>0$$ is a free parameter smaller than $${1\over 1-\alpha}$$. Proof. Let $$A_i^n$$ denote the $$i$$th row of matrix $$A$$. Then,   \begin{align} Au^n=\left[\begin{array}{c} \langle A_1^n,u^n \rangle\\ \langle A_2^n,u^n \rangle\\ \vdots\\ \langle A_m^n,u^n \rangle\\ \end{array} \right], \;\;\;\; Av^n=\left[\begin{array}{c} \langle A_1^n,v^n \rangle\\ \langle A_2^n,v^n \rangle\\ \vdots\\ \langle A_m^n,v^n \rangle\\ \end{array} \right] \end{align} (7.11) and   \begin{align} {1\over m}\langle Au^n,Av^n\rangle= {1\over m}\sum_{i=1}^m\langle A_i^n,u^n \rangle \langle A_i^n,v^n \rangle. \end{align} (7.12) Let   \begin{align} X_i=\langle A_i^n,u^n \rangle\end{align} (7.13) and   \begin{align} Y_i=\langle A_i^n,v^n \rangle. \end{align} (7.14) Since $$A$$ is generated by drawing its entries from an i.i.d. standard normal distribution, and $$\|u^n\|=\|v^n\|=1$$, $$\{(X_i,Y_i)\}_{i=1}^m$$ is a sequence of i.i.d. random vectors. To derive the joint distribution of $$(X_i,Y_i)$$, note that both $$X_i$$ and $$Y_i$$ are linear combination of Gaussian random variables. Therefore, they are also jointly distributed Gaussian random variables, and hence it suffices to find their first- and second-order moments. Note that   \begin{gather} {\rm{E}}[X_i]={\rm{E}}[Y_i]=0,\\ \end{gather} (7.15)  \begin{gather} {\rm{E}}[X_i^2]=\sum_{j,k}{\rm{E}}[A_{i,j}A_{i,k}]u_ju_k=\sum_{j}{\rm{E}}[A_{i,j}^2]u_j^2=\sum_{j}u_j^2=1 \end{gather} (7.16) and similarly $${\rm{E}}[Y_i^2]=1$$. Also,   \begin{align} {\rm{E}}[X_iY_i]&={\rm{E}}[\langle A_i^n,u^n \rangle\langle A_i^n,v^n \rangle]\nonumber\\ &=\sum_{j,k}{\rm{E}}[A_{i,j}A_{i,k}]u_jv_k\nonumber\\ &=\sum_{j}{\rm{E}}[A_{i,j}^2]u_jv_j\nonumber\\ &=\langle u^n,v^n \rangle=\alpha. \end{align} (7.17) Therefore, in summary,   \begin{align}(X_i,Y_i)\sim \mathcal{N}\left( \left[ \begin{array}{c} 0\\ 0 \end{array}\right],\left[ \begin{array}{cc} 1&\alpha\\ \alpha& 1 \end{array}\right]\right).\end{align} (7.18) For any $$s'>0$$, by the Chernoff bounding method, we have   \begin{align} {\rm{P}}\left({1\over m}\langle Au^n,Av^n\rangle-\langle u^n,v^n\rangle\leq -\tau\right)&= {\rm{P}}\left({1\over m}\sum_{i=1}^m(X_iY_i-\alpha)\leq -\tau\right)\nonumber\\ &= {\rm{P}}\left({s'\over m}\sum_{i=1}^m(X_iY_i-\alpha)\leq -s'\tau\right)\nonumber\\ &= {\rm{P}}\left( {\rm e}^{s'(\tau-\alpha)}\leq {\rm e}^{-{s'\over m}\sum_{i=1}^m X_iY_i}\right)\nonumber\\ &\leq {\rm e}^{s'(\alpha-\tau)}{\rm{E}}\left[ {\rm e}^{-{s'\over m}\sum_{i=1}^m X_iY_i}\right]\nonumber\\ &= {\rm e}^{s'(\alpha-\tau)}\left({\rm{E}}\left[ {\rm e}^{-{s'\over m}X_1Y_1}\right]\right)^m,\label{eq:deviation-tau} \end{align} (7.19) where the last line follows because $$(X_i,Y_i)$$ is an i.i.d. sequence. In the following we compute $${\rm{E}}[ {\rm e}^{{s\over m}X_1Y_1}]$$. Let $$A=(X_1+Y_1)/2$$ and $$B=(X_1-Y_1)/2$$. Then, $${\rm{E}}[A]={\rm{E}}[B]=0$$ and   \begin{align} {\rm{E}}[A^2]&={1+\alpha\over 2},\\ \end{align} (7.20)  \begin{align} {\rm{E}}[B^2]&={1-\alpha\over 2} \end{align} (7.21) and $${\rm{E}}[AB]={\rm{E}}[(X_1^2-Y_1^2)/4]=0$$. Therefore, $$A$$ and $$B$$ are independent Gaussian random variables. Therefore,   \begin{align} {\rm{E}}[ {\rm e}^{-{s'\over m}X_1Y_1}] &= {\rm{E}}\left[ {\rm e}^{-{s'\over m}(A+B)(A-B)}\right]\nonumber\\ &= {\rm{E}}\left[ {\rm e}^{-{s'\over m}A^2}\right] {\rm{E}}\left[{\rm e}^{{s'\over m}B^2}\right]. \end{align} (7.22) Given $$Z\sim \mathcal{N}(0,\sigma^2)$$, it is straightforward to show that, for $$\lambda>-1/(2\sigma^2)$$,   \begin{align} {\rm{E}}\left[{\rm e}^{-\lambda Z^2}\right]={1\over \sqrt{1+2\lambda \sigma^2}}. \end{align} (7.23) Therefore, for $${s'\over m}<{1\over 1-\alpha}$$,   \begin{align} {\rm{E}}[ {\rm e}^{{s'\over m}X_1Y_1}] &= {1\over \sqrt{(1+{s'\over m}(1+\alpha))(1-{s'\over m}(1-\alpha))}}\nonumber\\ &= {1\over \sqrt{(1+{s'\alpha\over m})^2 -{s'^2\over m^2}}}.\label{eq:E-XY} \end{align} (7.24) Therefore, combining (7.19) and (7.24), it follows that   \begin{align} {\rm{P}}\left({1\over m}\langle Au^n,Av^n\rangle-\langle u^n,v^n\rangle\leq -\tau\right)&= {\rm e}^{s'(\alpha-\tau)}{\rm e}^{-{m\over 2}\log\left( \left(1+{s'\alpha\over m}\right)^2-\left({s'\over m}\right)^2 \right)}.\label{eq:sp-s-last} \end{align} (7.25) Replacing $$s'/m$$ with $$s$$ in (7.25) yields the desired result. □ Corollary 7.1 Consider $$u^n\in {\rm I}\kern-0.20em{\rm R}^n$$ and $$v^n\in {\rm I}\kern-0.20em{\rm R}^n$$ such that $$\|u^n\|=\|v^n\|=1$$. Also, consider matrix $$A\in{\rm I}\kern-0.20em{\rm R}^{m\times n}$$ with i.i.d. standard normal entries. Then,   \begin{align} {\rm{P}}\left({1\over m}\langle Au^n,Av^n\rangle-\langle u^n,v^n\rangle\leq -0.45\right)\leq 2^{-0.05m}. \end{align} (7.26) Proof. From Lemma 7.3, for $$\alpha=\langle u^n,v^n\rangle$$, and $$s<1/(1-\alpha)$$,   \begin{align} {\rm{P}}\left({1\over m}\langle Au^n,Av^n\rangle-\alpha\leq -0.45 \right)\leq {\rm e}^{m((\alpha-0.45)s)-{m\over 2}\ln ((1+s\alpha)^2-s^2)}=2^{-m f(\alpha,s)}, \end{align} (7.27) where   \begin{align} f(\alpha,s)=(\log {\rm e})\left({1\over 2}\ln((1+s\alpha)^2-s^2)-(\alpha-0.45)s\right). \end{align} (7.28) Figure 1 plots $$\max_{s\in(0,{1\over 1-\alpha})} f(\alpha,s)$$ and shows that   \begin{align} \min_{\alpha\in(-1,1)}\max_{s\in\left(0,{1\over 1-\alpha}\right)} f(\alpha,s)\; \geq \; 0.05. \end{align} (7.29) □ Fig. 1. View largeDownload slide $$\max_{s\in(0,{1\over 1-\alpha})} f(\alpha,s).$$ Fig. 1. View largeDownload slide $$\max_{s\in(0,{1\over 1-\alpha})} f(\alpha,s).$$ The following two lemmas are proved in [13]. Lemma 7.4 ($$\chi^2$$ concentration) Fix $$\tau>0$$, and let $$U_i\stackrel{\rm i.i.d.}{\sim}\mathcal{N}(0,1)$$, $$i=1,2,\ldots,m$$. Then,   \begin{align} {\rm{P}}\left( \sum_{i=1}^m U_i^2 <m(1- \tau) \right) \leq {\rm e} ^{\frac{m}{2}(\tau + \ln(1- \tau))} \end{align} (7.30) and   \begin{align}\label{eq:chisq} {\rm{P}}\left( \sum_{i=1}^m U_i^2 > m(1+\tau) \right) \leq {\rm e} ^{-\frac{m}{2}(\tau - \ln(1+ \tau))}. \end{align} (7.31) Lemma 7.5 Consider $$U^n$$ and $$V^n$$, where, for each $$i$$, $$U_i$$ and $$V_i$$ are two independent standard normal random variables. Then the distribution of $$\langle U^n,V^n \rangle=\sum_{i=1}^nU_iV_i$$ is the same as the distribution of $$\|U^n\|G$$, where $$G\sim\mathcal{N}(0,1)$$ is independent of $$\|U^n\|_2$$. 7.3 Proof of Lemma 4.1 Before presenting the proof, we establish some preliminary results. Consider an analog process $${\mathbb{\mathbf{X}}}=\{X_i\}$$ with alphabet $$\mathcal{X}=[l,u]$$, where $$l,u\in{\rm I}\kern-0.20em{\rm R}$$. Let process $${\mathbb{\mathbf{Z}}}=\{Z_i\}$$ denote the $$b$$-bit quantized version of process $${\mathbb{\mathbf{X}}}$$. That is, $$Z_i=[X_i]_b$$, and the alphabet of process $${\mathbb{\mathbf{Z}}}$$ is $$\mathcal{Z}=\mathcal{X}_b=\{[x]_b: x\in\mathcal{X}_b\}$$. For $$i=1,\ldots,k+g$$, define a sequence of length $$t$$, $$\{S^{(i)}_j\}_{j=1}^t$$ over super-alphabet $$\mathcal{Z}^k$$ as follows. For $$i=1$$, $$\{S^{(1)}_j\}_{j=1}^t$$ is defined as   \begin{align} &\underbrace{Z_1,\ldots,Z_k}_{S_1^{(1)}},Z_{k+1},\ldots,Z_{k+g},\underbrace{Z_{k+g+1},\ldots,Z_{2k+g+1}}_{S_2^{(1)}},\\ \end{align} (7.32)  \begin{align} &{Z_{2k+g+2},\ldots \ldots}, \underbrace{Z_{(t-1)(k+g)+1},\ldots,Z_{(t-1)(k+g)+k}}_{S_t^{(1)}},Z_{t(k+g)-g+1},\ldots,Z_n. \end{align} (7.33) Similarly, $$\left\{S^{(i)}_j\right\}_{j=1}^t$$, $$i=1,\ldots,k+g$$, is defined by starting the grouping of the symbols at $$Z_i$$. In other words,   \begin{align} S^{(i)}_j\triangleq Z_{(k+g)(j-1)+i}^{(k+g)(j-1)+i+k-1}. \end{align} (7.34) For instance, the sequence $$\left\{S^{(k+g)}_j\right\}_{j=1}^t$$, which corresponds to the largest shift at beginning, is defined as   \begin{align} &Z_1,\ldots,Z_{k+g-1},\underbrace{Z_{k+g},\ldots,Z_{2k+g-1}}_{S_1^{(k)}},Z_{2k+g},\ldots,Z_{2k+2g-1},\\ \end{align} (7.35)  \begin{align} &\underbrace{Z_{2k+2g},\ldots,Z_{3k+2g-1}}_{S_2^{(k)}},Z_{3k+2g}\ldots. \end{align} (7.36) This definition implies that $$n$$ satisfies   \begin{align} (t-1)(k+g)+2k+g-1\leq n<(t-1)(k+g)+2k+g-1+k+g.\label{eq:bound-n} \end{align} (7.37) That is, $$t$$ is the only integer in the $$\big({n-2k-g+1\over k+g},{n-k+1\over k+g} \big]$$ interval, or in other words, $$t=\lfloor{n-k+1\over k+g}\rfloor$$. Before we prove Lemma 4.1, we prove the following auxiliary lemma. Lemma 7.6 For any given $$\epsilon>0$$ and any positive integers $$g$$ and $$k$$ such that $$4(k+g)/(n-k)<\epsilon$$, if $$\|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\|_1\geq \epsilon$$, then that there exists $$i=1,\ldots,k+g$$, such that   \begin{align} \|\hat{p}^{(1)}(\cdot|S^{(i),t})-\mu_k^{(b)}(\cdot)\|_1 \geq {\epsilon\over 2}, \end{align} (7.38) where $$S^{(i),t}$$ denotes the sequence $$S^{(i)}_1, S^{(i)}_2, \ldots, S^{(i)}_t$$. Note that in Lemma 7.6, $$\hat{p}^{(1)}(\cdot|S^{(i),t})$$ denotes the standard first order empirical distribution of the sup-alphabet sequence $$S^{(i),t}$$, i.e. for $$a^k\in\mathcal{X}^k$$,   \begin{align} \hat{p}^{(1)}(a^k|S^{(i),t})={|\{j: S^{(i)}_j=a^k, 1\leq j \leq t\}|\over t}. \end{align} (7.39) Proof. Note that by definition, for any $$a^k\in\mathcal{Z}^k$$,   \begin{align} \hat{p}^{(k)}(a^k|Z^n)&={1\over n-k}\sum_{i=k}^n \mathbb{1}_{Z_{i-k+1}^i=a^k}\nonumber\\ &={1\over n-k}\sum_{i=k}^n \mathbb{1}_{Z_{i-k+1}^i=a^k}\nonumber\\ &={1\over n-k}\left(\sum_{i=1}^{k+g}\sum_{j=1}^t \mathbb{1}_{S^{(i)}_j=a^k}+\sum_{t(k+g)+k}^n \mathbb{1}_{Z_{i-k+1}^i=a^k}\right)\nonumber\\ &={t\over n-k}\sum_{i=1}^{k+g}\left({1\over t}\sum_{j=1}^t \mathbb{1}_{S^{(i)}_j=a^k}\right)+{1\over n-k}\sum_{t(k+g)+k}^n \mathbb{1}_{Z_{i-k+1}^i=a^k}\nonumber\\ &={t\over n-k}\sum_{i=1}^{k+g}\hat{p}^{(1)}(a^k|S^{(i),t})+{1\over n-k}\sum_{t(k+g)+k}^n \mathbb{1}_{Z_{i-k+1}^i=a^k}.\label{eq:emp-total-to-overlapping} \end{align} (7.40) Therefore,   \begin{align} \|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\|_1&=\sum_{a^k}|\hat{p}^{(k)}(a^k|Z^n)-\mu_k^{(b)}(a^k)|\nonumber\\ &\stackrel{(a)}{=}\sum_{a^k}\left|{t\over n-k}\sum_{i=1}^{k+g}\hat{p}^{(1)}(a^k|S^{(i),t})+{1\over n-k}\sum_{t(k+g)+k}^n \mathbb{1}_{Z_{i-k+1}^i=a^k}-\mu_k^{(b)}(a^k)\right|\nonumber\\ &\stackrel{(b)}{=}\sum_{a^k}\left|{t\over n-k}\sum_{i=1}^{k+g}\hat{p}^{(1)}(a^k|S^{(i),t})-\mu_k^{(b)}(a^k)\right|+{1\over n-k}\sum_{a^k}\sum_{t(k+g)+k}^n \mathbb{1}_{Z_{i-k+1}^i=a^k}\nonumber\\ &\stackrel{(c)}{=}\sum_{a^k}\left|{t\over n-k}\sum_{i=1}^{k+g}\hat{p}^{(1)}(a^k|S^{(i),t})-\mu_k^{(b)}(a^k)\right|+{n-t(k+g)-k+1\over n-k},\label{eq:ell1-phat-k-mu-k-vs-phat-shifted-extra} \end{align} (7.41) where $$(a)$$ follows from (7.40), $$(b)$$ follows from the triangle inequality and $$(c)$$ holds because   \begin{align} \sum_{a^k} \mathbb{1}_{Z_{i-k+1}^i=a^k}=1. \end{align} (7.42) But since $$n$$ satisfies the bounds of (7.37), the last term on the right-hand side of (7.40) can be upper-bounded as   \begin{align} {n-t(k+g)-k+1\over n-k}\leq {k+g\over n-k}. \end{align} (7.43) Therefore, from (7.41) we have   \begin{align} \|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\|_1& \leq \sum_{a^k}\left|{t\over n-k}\sum_{i=1}^{k+g}\hat{p}^{(1)}(a^k|S^{(i),t})-\mu_k^{(b)}(a^k)\right|+ {k+g\over n-k}.\label{eq:dist-phat-k-avg-phat-1} \end{align} (7.44) On the other hand, again by the triangle inequality,   \begin{align} &\sum_{a^k}\left|{t\over n-k}\sum_{i=1}^{k+g}\hat{p}^{(1)}(a^k|S^{(i),t})-\mu_k^{(b)}(a^k)\right|\nonumber\\ &\quad=\sum_{a^k}\left|{t\over n-k}\sum_{i=1}^{k+g}\left(\hat{p}^{(1)}(a^k|S^{(i),t})-\mu_k^{(b)}(a^k)\right)+\left({t(k+g)\over n-k}-1\right)\mu_k^{(b)}(a^k)\right| \nonumber\\ &\quad\leq \sum_{a^k}\left|{t\over n-k}\sum_{i=1}^{k+g}\left(\hat{p}^{(1)}(a^k|S^{(i),t})-\mu_k^{(b)}(a^k)\right)\right|+\left|{t(k+g)\over n-k}-1\right|\sum_{a^k} \mu_k^{(b)}(a^k) \nonumber\\ &\quad= \sum_{a^k}\left|{t\over n-k}\sum_{i=1}^{k+g}\left(\hat{p}^{(1)}(a^k|S^{(i),t})-\mu_k^{(b)}(a^k)\right)\right|+\left|{t(k+g)\over n-k}-1\right|\nonumber\\ &\quad\leq {t\over n-k} \sum_{i=1}^{k+g} \sum_{a^k}\left|\hat{p}^{(1)}(a^k|S^{(i),t})-\mu_k^{(b)}(a^k)\right|+\left|{t(k+g)\over n-k}-1\right|\nonumber\\ &\quad= {t\over n-k} \sum_{i=1}^{k+g} \left\|\hat{p}^{(1)}(\cdot|S^{(i),t})-\mu_k^{(b)}(\cdot)\right\|_1+\left|{t(k+g)\over n-k}-1\right|.\label{eq:dist-phat-k-avg-phat-2} \end{align} (7.45) Therefore, combining (7.41) and (7.45) yields   \begin{align} \left\|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\right\|_1& \leq {t\over n-k} \sum_{i=1}^{k+g} \left\|\hat{p}^{(1)}(\cdot|S^{(i),t})-\mu_k^{(b)}(\cdot)\right\|_1+\left|{t(k+g)\over n-k}-1\right|+{k+g\over n-k}.\label{eq:dist-phat-k-avg-phat-3} \end{align} (7.46) Since by construction $$t(k+g)\leq n-k$$, we have $$t/(n-k)\leq 1/(k+g)$$. Therefore, it follows from (7.46) that:   \begin{align} \left\|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\right\|_1& \leq {1\over k+g} \sum_{i=1}^{k+g} \left\|\hat{p}^{(1)}(\cdot|S^{(i),t})-\mu_k^{(b)}(\cdot)\right\|_1+1-{t(k+g)\over n-k}+{k+g\over n-k}\nonumber\\ &= {1\over k+g} \sum_{i=1}^{k+g} \left\|\hat{p}^{(1)}(\cdot|S^{(i),t})-\mu_k^{(b)}(\cdot)\right\|_1+{n-t(k+g)+g\over n-k}.\label{eq:dist-phat-k-avg-phat-4} \end{align} (7.47) Notice that if   \begin{align} {n-t(k+g)+g\over n-k}\leq {\epsilon\over 2},\label{eq:cond-k-g-epsilon} \end{align} (7.48) and   \begin{align} \left\|\hat{p}^{(1)}(\cdot|S^{(i),t})-\mu_k^{(b)}(\cdot)\right\|_1\leq {\epsilon\over 2}, \end{align} (7.49) for all $$i$$, then, from (7.47), $$\|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\|_1\leq {\epsilon}$$. But if $$4(k+g)/(n-k)<\epsilon$$, since $$t=\lfloor {n-k+1\over k+g} \rfloor$$, then (7.48) holds. This means that, to have $$\|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\|_1> {\epsilon}$$, we need $$\|\hat{p}^{(1)}(\cdot|S^{(i),t})-\mu_k^{(b)}(\cdot)\|_1> {\epsilon\over 2}$$, for at least one $$i$$ in $$\{1,\ldots,k+g\}$$. □ Now we can discuss the proof of Lemma 4.1. The proof is a straightforward extension of Lemma III.1.3 in [25]. However, we include a summary of the proof for completeness. By Lemma 7.6, if $$4(k+g)/(n-k)<\epsilon$$, then $$\|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\|_1\geq \epsilon$$ implies that there exists $$i\in\{1,\ldots,k+g\}$$ such that   \begin{align} \left\|\hat{p}^{(1)}(\cdot|S^{(i),t})-\mu_k^{(b)}(\cdot)\right\|_1\geq {\epsilon\over 2}. \end{align} (7.50) We next bound the probability that the above event happens. For each $$i\in\{1,\ldots,k+g\}$$, define event $$\mathcal{E}^{(i)}$$ as follows   \begin{align} \mathcal{E}^{(i)}\triangleq \left\{D_{\rm KL}(\hat{p}^{(1)}(\cdot|S^{(i),t}),\mu_k^{(b)})> \epsilon^2/2\right\}. \end{align} (7.51) By the Pinsker’s inequality, for any $$i$$,   \begin{align} \left\|\hat{p}^{(1)}(\cdot|S^{(i),t})-\mu_k^{(b)}\right\|_1 \leq \sqrt{(2\ln 2) D_{\rm KL}(\hat{p}^{(1)}(\cdot|S^{(i),t}),\mu_k^{(b)}(\cdot))}. \end{align} (7.52) Therefore, if $$D_{\rm KL}(\hat{p}^{(1)}(\cdot|S^{(i),t}),\mu_k^{(b)}(\cdot))\leq c\epsilon^2/4$$, where as defined earlier $$c=1/(2\ln 2)$$, then $$\|\hat{p}^{(1)}(\cdot|S^{(i),t})-\mu_k^{(b)}(\cdot)\|_1 \leq \epsilon/2$$. This implies that   \begin{align} {\rm{P}}\left(\left\|\hat{p}^{(1)}(\cdot|S^{(i),t})-\mu_k^{(b)}\right\|_1 >{\epsilon\over 2} \right)\leq {\rm{P}}\left(D_{\rm KL}\left(\hat{p}^{(1)}\left(\cdot|S^{(i),t}\right),\mu_k^{(b)}\right) >{c\epsilon^2\over 4}\right). \end{align} (7.53) On the other hand, for $$S^{(i),t}=s^t$$, where $$s^t\in (\mathcal{Z}^k)^t$$, we have   \begin{align} {\rm{P}}(S^{(i),t}=s^t)&={\rm{P}}\Big(Z_{(k+g)(j-1)+i}^{(k+g)(j-1)+i+k-1}=s_j, \;j=1,\ldots,t\Big)\nonumber\\ &\stackrel{(a)}{\leq} {\it{\Psi}}^t(b,g) \prod_{j=1}^t {\rm{P}}\Big(Z_{(k+g)(j-1)+i}^{(k+g)(j-1)+i+k-1}=s_j\Big), \end{align} (7.54) where $$(a)$$ follows from applying condition (4.4) $$t$$ times. However, by the standard method of types techniques [7], we have   \begin{align} \prod_{j=1}^t{\rm{P}}\Big(Z_{(k+g)(j-1)+i}^{(k+g)(j-1)+i+k-1}=s_j\Big)&= 2^{-t\left(\hat{H}_1(s^t)+D_{\rm KL}\left(\hat{p}^{(1)}\left(\cdot|s^t\right),\mu_k^{(b)}\right)\right)}. \end{align} (7.55) Therefore, if $$D_{\rm KL}(\hat{p}^{(1)}(\cdot|s^t),\mu_k^{(b)}) >{c\epsilon^2\over 4}$$, then   \begin{align} \prod_{j=1}^t{\rm{P}}\Big(Z_{(k+g)(j-1)+i}^{(k+g)(j-1)+i+k-1}=s_j\Big)&\leq 2^{-t\hat{H}_1(s^t)-c\epsilon^2t/4}. \end{align} (7.56) Hence,   \begin{align} {\rm{P}}\Big(D_{\rm KL}\left(\hat{p}^{(1)}\left(\cdot|S^{(i),t}\right),\mu_k^{(b)}\right)>{c\epsilon^2\over 4}\Big) &=\sum_{s^t:D_{\rm KL}\left(\hat{p}^{(1)}\left(\cdot|s^t\right),\mu_k^{(b)}\right) >{c\epsilon^2\over 4} }{\rm{P}}\left(S^{(i),t}=s^t\right)\nonumber \\ &\leq 2^{-c\epsilon^2t/4}{\it{\Psi}}^t(b,g) \sum_{s^t:D_{\rm KL}\left(\hat{p}^{(1)}(\cdot|s^t),\mu_k^{(b)}\right) >{c\epsilon^2\over 4} }2^{-t\hat{H}_1(s^t)}\nonumber\\ &\leq 2^{-c\epsilon^2t/4} {\it{\Psi}}^t(b,g) \sum_{s^t }2^{-t\hat{H}_1(s^t)}. \end{align} (7.57) Since $$\sum_{s^t }2^{-t\hat{H}_1(s^t)}$$ can be proven to be smaller than the total number of types of sequences $$s^t\in(\mathcal{Z}^k)^t$$, we have $$\sum_{s^t }2^{-t\hat{H}_1(s^t)}\leq (t+1)^{|\mathcal{Z}|^k}$$. This upper bound combined by the union bound on $$\mathcal{E}^{(i)}$$, $$i=1,\ldots,k+g$$, yields the desired result. 7.4 Proof of Lemma 4.2 For any $$u^{\ell_1}\in\mathcal{Z}^{\ell_1}$$, $$v^{g}\in\mathcal{Z}^{g}$$, and $$w^{\ell_2}\in\mathcal{Z}^{\ell_2}$$, we have   \begin{align}\label{eq:quant-markov} &{\rm{P}}\left(Z^{\ell_1+g+\ell_2}=[u^{\ell_1}v^{g}w^{\ell_2}]\right)\nonumber\\ &\quad\leq \sum_{v^{g}\in\mathcal{Z}^g} {\rm{P}}\left(Z^{\ell_1+g+\ell_2}=[u^{\ell_1}v^{g}w^{\ell_2}]\right)\nonumber\\ &\quad= {\rm{P}}\left(Z^{\ell_1}=u^{\ell_1}, Z_{\ell_1+g+1}^{\ell_1+g+\ell_2}=w^{\ell_2}\right)\nonumber\\ &\quad= {\rm{P}}\left(Z^{\ell_1}=u^{\ell_1}\right) {\rm{P}}\left(Z_{\ell_1+g+1}=w_1 | Z^{\ell_1}= u^{\ell_1}\right) \nonumber\\ &\qquad\times {\rm{P}}\left(Z_{\ell_1+g+2}^{\ell_1+g+\ell_2}=w_2^{\ell_2}|Z_{\ell_1+g+1}=w_1,Z^{\ell_1}=u^{\ell_1}\right)\nonumber\\ &\quad= {\rm{P}}\left(Z^{\ell_1}=u^{\ell_1}\right)\int_{\mathcal{X}} {\rm{P}}\left(Z_{\ell_1+g+1}=w_1 | X_{\ell_1}= x_{\ell_1}, Z^{\ell_1}= u^{\ell_1}\right) \,{\rm d}\mu\left(x_{\ell_1}| Z^{\ell_1}= u^{\ell_1}\right) \nonumber\\ &\qquad\times {\rm{P}}\left(Z_{\ell_1+g+2}^{\ell_1+g+\ell_2}=w_2^{\ell_2}|Z_{\ell_1+g+1}=w_1,Z^{\ell_1}=u^{\ell_1}\right)\nonumber\\ &\quad \stackrel{(a)}{=} {\rm{P}}\left(Z^{\ell_1}=u^{\ell_1}\right)\int_{\mathcal{X}} {\rm{P}}\left(Z_{\ell_1+g+1}=w_1 | X_{\ell_1}= x_{\ell_1}\right) \,{\rm d}\mu\left(x_{\ell_1}| Z^{\ell_1}= u^{\ell_1}\right) \nonumber\\ &\qquad\times {\rm{P}}\left(Z_{\ell_1+g+2}^{\ell_1+g+\ell_2}=w_{2}^{\ell_2}|Z_{\ell_1+g+1}=w_1,Z^{\ell_1}=u^{\ell_1}\right)\nonumber\\ &\quad \stackrel{(b)}{=} \mu_b(u^{\ell_1})\left({\int_{\mathcal{X}}{K^{g+1}(x_{\ell_1},w_1)} \,{\rm d}\mu\left(x_{\ell_1}| u^{\ell_1}\right) }\right) \mu_b\left(w_2^{\ell_2}|w_1,u^{\ell_1}\right)\nonumber\\ &\quad = \mu_b(u^{\ell_1})\Bigg({\int_{\mathcal{X}^l}{K^{g+1}(x_{\ell_1},w_1)\over \pi(w_1)} \,{\rm d}\mu(x_{\ell_1}| u^{\ell_1}) }\Bigg) \pi(w_1)\mu_b\left(w_2^{\ell_2}|w_1,u^{\ell_1}\right)\nonumber\\ &\quad\leq \mu_b(u^{\ell_1}) \pi(w_1)\mu_b(w_2^{\ell_2}|w_1,u^{\ell_1})\left(\sup_{x: [x]_b=u_{\ell_1}} {{K^{g+1}(x,w_1)} \over \pi(w_1)}\right)\nonumber\\ &\quad\leq \mu_b(u^{\ell_1}) \pi(w_1)\mu_b(w_2^{\ell_2}|w_1,u^{\ell_1})\left(\sup_{(x,z)\in\mathcal{X}\times \mathcal{Z}} {{K^{g+1}(x,z)} \over \pi(z)}\right)\nonumber\\ &\quad= \mu_b(u^{\ell_1}) \pi(w_1)\mu_b(w_2^{\ell_2}|w_1,u^{\ell_1}){\it{\Psi}}_1(b,g), \end{align} (7.58) where (a) holds because $${\mathbb{\mathbf{X}}}$$ is a first-order Markov chain and in (b),   \begin{align} \mu_b\left(u^{\ell_1}\right)&= {\rm{P}}\left(Z^{\ell_1}=u^{\ell_1}\right),\\ \end{align} (7.59)  \begin{align} \mu_b\left(w_2^{\ell_2}|w_1,u^{\ell_1}\right)&={\rm{P}}\left(Z_{\ell_1+g+2}^{\ell_1+g+\ell_2}=w_2^{\ell_2}|Z_{\ell_1+g+1}=w_1,Z^{\ell_1}=u^{\ell_1}\right) \end{align} (7.60) and $$\mu(x_{\ell_1}| u^{\ell_1})$$ denotes the probability measure of $$X_{\ell_1}$$ conditioned on $$Z^{\ell_1}=u^{\ell_1}$$. Also, since the Markov chain is a stationary process, we have   \begin{align} K^{g+1}(x_{\ell_1},w_1)={\rm{P}}([X_{g+1}]_b=w_1|X_0=x_{\ell_1}). \end{align} (7.61) Another term in (7.58) is $$ \pi(w_1)\mu_b(w_{2}^{\ell_2}|w_1,u^{\ell_1})$$. Since $$\pi(w^{\ell_2})= \pi(w_1) \pi(w_{2}^{\ell_2}|w_1)$$, we have   \begin{align} \pi(w^l)\mu_b(w_{l+1}^{\ell_2}|w^l,u^{\ell_1})&= \pi(w^{\ell_2}){\mu_b(w_{l+1}^{\ell_2}|w^l,u^{\ell_1})\over \pi(w_{l+1}^{\ell_2}|w^l) }.\label{eq:bayes-rule} \end{align} (7.62) But   \begin{align} \mu_b(w_2^{\ell_2}|w_1,u^{\ell_1})&=\int \mu_b(w_{2}^{\ell_2}|x,w_1,u^{\ell_1})\,{\rm d}\mu(x|w_1,u^{\ell_1})\nonumber\\ &=\int \mu_b(w_2^{\ell_2}|x)\,{\rm d}\mu(x|w_1,u^{\ell_1}),\label{eq:int-Markov} \end{align} (7.63) where the second equality holds because $$(Z^{\ell_1},Z_{\ell_1+1})\to X_{\ell_1+1}\to Z_{\ell_1+2}^{\ell_1+l+\ell_2}$$ forms a Markov chain. Therefore,   \begin{align} {\mu_b\left(w_{2}^{\ell_2}|w_1,u^{\ell_1}\right)\over \pi(w_{2}^{\ell_2}|w_1) } &={\int \mu_b(w_{2}^{\ell_2}|x)\,{\rm d}\mu(x|w_1,u^{\ell_1})\over \pi(w_2^{\ell_2}|w_1) },\nonumber\\ &\stackrel{(a)}{\leq } {\Big(\sup\limits_{[x]_b=w_1}\mu_b(w_2^{\ell_2}|x)\Big)\int {\rm d}\mu(x|w_1,u^{\ell_1})\over \pi(w_{2}^{\ell_2}|w_1) },\nonumber\\ &\stackrel{(b)}{= } \sup\limits_{[x]_b=w_1}{\pi(w_{2}^{\ell_2}|x)\over \pi(w_{2}^{\ell_2}|w_1)}\nonumber\\ &\leq \sup_{(x,w^{\ell_2}): [x]_b=w_1} {\pi(w_{2}^{\ell_2}|x)\over \pi(w_{2}^{\ell_2}|w_1)}\nonumber\\ &\leq {\it{\Psi}}_2(b),\label{eq:bd-realted-psi2} \end{align} (7.64) where (a) and (b) hold because $$\mu(x|w_1,u^{\ell_1})$$ is only non-zero when $$x$$ is such that $$[x]_b=w_1$$, and $$\int {\rm d}\mu(x|w_1,u^{\ell_1})=1$$, respectively. Finally, combining (7.58), (7.62) and (7.64) yields the desired result. We next prove that, for a fixed $$b$$, $${\it{\Psi}}_1(b,g)$$ is non-increasing function of $$g$$. For any $$x\in\mathcal{X}$$ and $$z\in\mathcal{X}_b$$, we have   \begin{align} {K^{g+1}(x,z) \over \pi(z)} &={{\rm{P}}([X_{g+1}]_b=z|X_0=x) \over {\rm{P}}([X_{g+1}]_b=z)} \nonumber\\ &={\int {\rm{P}}([X_{g+1}]_b=z|X_1=x',X_0=x)\,{\rm d}\mu(x'|X_0=x) \over {\rm{P}}([X_{g+1}]_b=z)} \nonumber\\ &\stackrel{(a)}{=}{\int {\rm{P}}([X_{g}]_b=z|X_0=x')\,{\rm d}\mu(x'|X_0=x) \over {\rm{P}}([X_{g}]_b=z)} \nonumber\\ &=\sup_{x'\in\mathcal{X}} {{\rm{P}}([X_{g}]_b=z|X_0=x') \over {\rm{P}}([X_{g}]_b=z)} \int {\rm d}\mu(x'|X_0=x) \nonumber\\ &\stackrel{(b)}{\leq} {\it{\Psi}}_1(g,b),\label{eq:Kg-plus-1} \end{align} (7.65) where $$(a)$$ follows because of the Markovity and stationarity assumptions and $$(b)$$ follows because $$\int {\rm d}\mu(x'|X_0=x)=1$$. Since the right-hand side of (7.65) only depends on $$g$$ and $$b$$, taking the supremum of the left-hand side over $$(x,z)\in\mathcal{X}\times\mathcal{X}_b$$ proves that   \begin{align} {\it{\Psi}}_1(g+1)\leq {\it{\Psi}}_1(g). \end{align} (7.66) Furthermore, since $${\mathbb{\mathbf{X}}}$$ is assumed to be an aperiodic Markov chain, $$\lim_{g\to\infty} K^{g}(x,z)=\pi(z)$$, for all $$x$$ and $$z$$. Therefore, $${\it{\Psi}}_1(g)$$ converges to one, as $$g\to\infty$$. 7.5 Proof of Theorem 4.3 Define $${\it{\Psi}}(b,g)={\it{\Psi}}_1(b,g){\it{\Psi}}_2(b)$$. Then it follows from Lemmas 4.1 and 4.2 that, given $$\epsilon>0$$, for any positive integers $$g$$ and $$k$$ that satisfy $$4(k+g)/(n-k)<\epsilon$$,   \begin{align} {\rm{P}}\left(\|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\|_1\geq \epsilon\right)\leq (k+g){\it{\Psi}}_1^t(b,g){\it{\Psi}}_2^t(b)(t+1)^{|\mathcal{Z}|^k}2^{-c\epsilon^2t/4},\label{eq:ell-1-main} \end{align} (7.67) where $$t=\lfloor{n-k+1\over k+g}\rfloor$$ and $$c=1/(2\ln 2)$$. Since by assumption $$\lim_{b\to\infty}{\it{\Psi}}_2(b)=1$$, there exists $$b_{\epsilon}$$ such that for all $$b\geq b_{\epsilon}$$, $${\it{\Psi}}_2(b)\leq 2^{c\epsilon^2/16}$$. But $$ b_n=\lceil r\log\log n \rceil$$ is a diverging sequence of $$n$$. Therefore, there exists $$n_{\epsilon}>0$$, such that for all $$n\geq n_{\epsilon}$$,   \begin{align} {\it{\Psi}}_2(b_n)\leq 2^{c\epsilon^2/16}.\label{eq:bd-Psi-2} \end{align} (7.68) On the other hand, by the theorem’s assumption, there exists a sequence $$g=g_n$$, where $$g=o(n)$$, such that $$\lim_{n\to\infty}{\it{\Psi}}_1(b_n,g_n)=1 $$. Therefore, there exists $$n'_{\epsilon}$$ such that for all $$n\geq n'_{\epsilon}$$,   \begin{align} {\it{\Psi}}_1(b_n,g_n)\leq 2^{c\epsilon^2/16}.\label{eq:bd-Psi-1} \end{align} (7.69) Moreover, since $$g=g_n=o(n)$$ and $$k$$ is fixed, there exists $$n''_{\epsilon}>0$$ such that for all $$n\geq n''_{\epsilon}$$,   \begin{align} {4(k+g_n) \over n-k}<\epsilon.\label{eq:k-g-n} \end{align} (7.70) Therefore, for $$n>\max(n_{\epsilon},n'_{\epsilon},n''_{\epsilon})$$, from (7.67), (7.68), (7.69) and (7.70), we have   \begin{align} {\rm{P}}(\|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\|_1\geq \epsilon)&\leq (k+g){\it{\Psi}}_1^t(b,g){\it{\Psi}}_2^t(b)(t+1)^{|\mathcal{Z}|^k}2^{-c\epsilon^2t/4}\nonumber\\ &\leq (k+g)(t+1)^{|\mathcal{Z}|^k}2^{-tc\epsilon^2/8}\nonumber\\ &\leq (k+g)n^{|\mathcal{Z}|^k}2^{-tc\epsilon^2/8}, \end{align} (7.71) where the last line follows from the fact that $$t+1\leq n$$. But since $$t=\lfloor{n-k+1\over k+g}\rfloor$$, $$t\geq {n-k\over k+g}-1$$. Hence,   \begin{align} {\rm{P}}(\|\hat{p}^{(k)}(\cdot|Z^n)-\mu_k^{(b)}\|_1\geq \epsilon)&\leq 2^{c\epsilon^2(1+k/(k+g))/8} (k+g)n^{|\mathcal{Z}|^k}2^{-{c \epsilon^2n\over 8(k+g)}}\nonumber\\ &\leq 2^{c\epsilon^2/4} (k+g)n^{|\mathcal{Z}|^k}2^{-{c \epsilon^2n\over 8(k+g)}}, \end{align} (7.72) which is the desired result. 7.6 Proof of Theorem 4.4 For each $$i$$, let random variable $$J_i$$ be an indicator of a jump at time $$i$$. That is,   \begin{align} J_i=\mathbb{1}_{X_i\neq X_{i-1}}. \end{align} (7.73) Consider $$x\in \mathcal{X}$$ and $$z\in\mathcal{X}_b$$. Then, by definition,   \begin{align} {K^g(x,z)\over \pi(z)}&={{\rm{P}}([X_g]_b=z|X_0=x)\over {\rm{P}}([X_0]_b=z)}. \end{align} (7.74) But,   \begin{align} {\rm{P}}([X_g]_b=z|X_0=x)&= \sum_{d^g\in\{0,1\}^g}{\rm{P}}([X_g]_b=z,J^g=d^g|X_0=x)\nonumber\\ &\stackrel{(a)}{=}\sum_{d^g\in\{0,1\}^g}{\rm{P}}([X_g]_b=z|J^g=d^g,X_0=x){\rm{P}}(J^g=d^g), \end{align} (7.75) where $$(a)$$ follows from the independence of the jump events and the value of the Markov process at each time. Now if there is a jump between time $$1$$ and time $$g$$, then by definition of the transition probabilities the value of $$[X_g]_b$$ become independent of $$[X_1]_b$$ and also the jumps pattern. In other words, for any $$J^g\neq (0,\ldots,0)$$,   \begin{align} {\rm{P}}([X_g]_b=z|J^g=d^g,X_0=x)={\rm{P}}([X_g]_b=z)={\rm{P}}([X_0]_b=z), \end{align} (7.76) where the last equality follows from the stationarity of the Markov process. But $$J^g\neq (0,\ldots,0)$$ means that there has been no jump from time $$0$$ up to time $$g$$, and therefore $$X_g=X_0$$. This implies that   \begin{align} {\rm{P}}([X_g]_b=z|J^g=0^g,X_0=x)=\mathbb{1}_{z=[x]_b}. \end{align} (7.77) Since $$J^g= (0,\ldots,0)=(1-p)^g$$, combining the intermediate steps, it follows that   \begin{align} {\rm{P}}([X_g]_b=z|X_0=x)&= (1-(1-p)^g){\rm{P}}([X_0]_b=z)+(1-p)^g\mathbb{1}_{z=[x]_b}, \end{align} (7.78) and as a result   \begin{align} {K^g(x,z)\over \pi(z)}&=(1-(1-p)^g)+ {(1-p)^g\mathbb{1}_{z=[x]_b}\over {\rm{P}}([X_0]_b=z)}. \end{align} (7.79) But given that by our assumption $$f(x)\geq f_{\min}$$, $${\rm{P}}([X_0]_b=z)\geq f_{\min} 2^{-b}$$. Therefore,   \begin{align} {\it{\Psi}}_1(b,g)& =\sup_{(x,z)\in\mathcal{X}\times \mathcal{X}_b} {K^g(x,z)\over \pi(z)}\nonumber\\ &\leq (1-(1-p)^g)+ {(1-p)^g 2^b\over f_{\min}}.\label{eq:bd-Psi-1-Markov} \end{align} (7.80) For $$b=b_n=\lceil r\log\log n \rceil $$ and $$g=g_n=\lfloor \gamma \, r\log\log n \rfloor$$, we have   \begin{align} \log ((1-p)^g 2^b)&=g\log (1-p)+b\\ \end{align} (7.81)  \begin{align} &\leq r(\gamma \log(1-p) +1)\log\log n. \end{align} (7.82) But since $$\gamma>-{1\over \log(1-p)}$$, $$\gamma \log(1-p) +1<0$$, which from (7.80) proves the desired result, i.e. $$\lim_{n\to\infty}{\it{\Psi}}_1(b_n,g_n)=1$$. It is easy to check that due to its special distribution, the quantized version of process $${\mathbb{\mathbf{X}}}$$ is also a first-order Markov process. Therefore, from (4.15), we have   \begin{align} {\it{\Psi}}_2(b)&=\sup_{(x,w^2)\in\mathcal{X}\times \mathcal{Z}^{2}: [x]_b=w_1} {\pi(w_{2}|x)\over \pi(w_{2}|w_1)}\nonumber\\ &=\sup_{(x,w_2)\in\mathcal{X}\times \mathcal{Z}} {{\rm{P}}([X_2]_b=w_2|X_1=x)\over {\rm{P}}([X_2]_b=w_2|[X_1]_b=[x]_b)}. \end{align} (7.83) But   \begin{align} {\rm{P}}([X_2]_b=w_2|X_1=x)&={\rm{P}}([X_2]_b=w_2,J_2=1|X_1=x)p\nonumber\\ &\;\;\;+{\rm{P}}([X_2]_b=w_2,J_2=0|X_1=x)(1-p)\nonumber\\ &=p{\rm{P}}([X_2]_b=w_2)+(1-p)\mathbb{1}_{w_2=[x]_b}, \end{align} (7.84) and similarly   \begin{align} {\rm{P}}([X_2]_b=w_2|[X_1]_b=[x]_b)&={\rm{P}}([X_2]_b=w_2,J_2=1|[X_1]_b=[x]_b)p\nonumber\\ &\;\;\;+{\rm{P}}([X_2]_b=w_2,J_2=0|[X_1]_b=[x]_b)(1-p)\nonumber\\ &=p{\rm{P}}([X_2]_b=w_2)+(1-p)\mathbb{1}_{w_2=[x]_b}, \end{align} (7.85) which proves that $${\it{\Psi}}_2(b)=1$$, for all $$b$$. 7.7 Proof of Theorem 5.1 By definition   \begin{align} \bar{d}_k({\mathbb{\mathbf{X}}})=\limsup_{b\to\infty} {H([X_{k+1}]_b|[X^k]_b)\over b}, \end{align} (7.86) and therefore, for any $$\delta_1>0$$, there exists $$b_{\delta_1}$$ such that for all $$b\geq b_{\delta_1}$$, $${H([X_{k+1}]_b|[X^k]_b)\over b} \leq \bar{d}_k({\mathbb{\mathbf{X}}}) + \delta_1. $$ Since $$b=b_n=\lceil r\log\log n \rceil$$ converges to infinity as $$n\to\infty$$, for all $$n$$ large enough, $$b=b_n>b_{\delta_1}$$, and as a result   \begin{align} {H([X_{k+1}]_b|[X^k]_b)\over b} \leq \bar{d}_k({\mathbb{\mathbf{X}}}) + \delta_1. \end{align} (7.87) For the rest of the proof, assume that $$n$$ is large enough such that $$b_n>b_{\delta_1}$$. Define distribution $$q_{k+1}$$ over $$\mathcal{X}_b^{k+1}$$ as the $$(k+1)$$th order distribution of the quantized process $$[X_1]_b, \ldots, [X_n]_b$$. That is, for $$a^{k+1}\in\mathcal{X}_b^{k+1}$$,   \begin{align} q_{k+1}(a_{k+1}|a^k)={\rm{P}}([X_{k+1}]_b=a_{k+1}|[X^k]_b=a^k),\label{eq:q-k+1-source} \end{align} (7.88) and   \begin{align} q_{k}(a^{k})=\sum_{a_{k+1}\in\mathcal{X}_b}q_{k+1}(a^{k+1})={\rm{P}}([X^k]_b=a^k).\label{eq:q-k-source} \end{align} (7.89) Also define distributions $$\hat{q}_{k+1}^{(1)}$$ and $$\hat{q}_{k+1}^{(2)}$$ as the empirical distributions induced by $${\hat{X}}^n$$ and $$[X^n]_b$$, respectively. In other words, $$\hat{q}_{k}^{(1)}(a^{k})=\hat{p}^{(k)}(a^k|{\hat{X}}^n)$$, $$\hat{q}_{k}^{(2)}(a^{k})=\hat{p}^{(k)}(a^k|[X^n]_b)$$ and   \begin{align} \hat{q}_{k+1}^{(1)}(a_{k+1}|a^k)={\hat{q}_{k+1}^{(1)}(a^{k+1})\over \hat{q}_{k}^{(1)}(a^{k}) }={\hat{p}^{(k+1)}(a^{k+1}|{\hat{X}}^n)\over \hat{p}^{(k)}(a^k|{\hat{X}}^n)} \end{align} (7.90) and   \begin{align} \hat{q}^{(2)}_{k+1}(a_{k+1}|a^k)={\hat{q}_{k+1}^{(2)}(a^{k+1})\over \hat{q}_{k}^{(2)}(a^{k}) }={\hat{p}^{(k+1)}(a^{k+1}|[X^n]_b)\over \hat{p}^{(k)}(a^k|[X^n]_b)}. \end{align} (7.91) As the first step, we would like to prove that $${1\over b} \hat{H}_{k}({\hat{X}}^n)\leq \bar{d}_k({\mathbb{\mathbf{X}}})+\delta$$. Using the definitions above, we have   \begin{align}\label{eq:simplifyprobterm1} \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|{\hat{X}}^n) &= \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}}\hat{p}^{(k+1)}(a^{k+1}|{\hat{X}}^n) \log{1\over q_{k+1}(a_{k+1}|a^k)} \nonumber \\ &= \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|{\hat{X}}^n) \log{\hat{q}_{k+1}^{(1)}(a_{k+1}|a^k)\over q_{k+1}(a_{k+1}|a^k)}\nonumber\\ &\quad+ \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}}\hat{p}^{(k+1)}(a^{k+1}|{\hat{X}}^n) \log{1\over \hat{q}_{k+1}^{(1)}(a_{k+1}|a^k)} \nonumber \\ &= \sum_{a^k}\hat{q}_{k}^{(1)}(a^k) D_{\rm KL}(\hat{q}_{k+1}^{(1)}(\cdot|a^k)\| q_{k+1}(\cdot|a^k)) + \hat{H}_{k}({\hat{X}}^n). \end{align} (7.92) Since $${\hat{X}}^n$$ is the minimizer of (2.5), we have   \begin{align} \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|{\hat{X}}^n) \leq (\bar{d}_k({\mathbb{\mathbf{X}}})+\delta)b. \end{align} (7.93) Combining this equation with (7.92) and the fact that $$D_{\rm KL}$$ is always positive, we obtain   \begin{align} {1\over b} \hat{H}_{k}({\hat{X}}^n)\leq \bar{d}_k({\mathbb{\mathbf{X}}})+\delta.\label{eq:bd-H-hat-X-hat-Q-MAP} \end{align} (7.94) As the second step of the proof, we show that with high probability   \begin{align} {1\over b} \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|[X^n]_b) \leq \bar{d}_k({\mathbb{\mathbf{X}}}) +\delta. \end{align} (7.95) In other words, we would like to show that the vector $$[X^n]_b=([X_1]_b, [X_2]_b, \ldots, [X_n]_b)$$ satisfies the constraint of the optimization (2.5). Following the same steps as those used in deriving (7.92), we get   \begin{align} &\sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|[X^n]_b)\nonumber\\ &\quad= \sum_{a^k}\hat{q}_{k}^{(2)}(a^k) D_{\rm KL}(\hat{q}_{k+1}^{(2)}(\cdot|a^k)\| q_{k+1}(\cdot|a^k))+ \hat{H}_{k}([X^n]_b).\label{eq:simplifyprobterm2} \end{align} (7.96) Also, note that   \begin{align} \sum_{a^k}\hat{q}_{k}^{(2)}(a^k) D_{\rm KL}(\hat{q}_{k+1}^{(2)}(\cdot|a^k)\| q_{k+1}(\cdot|a^k)) &= \sum_{a^k}\hat{q}_{k}^{(2)}(a^k)\sum_{a_{k+1}} \hat{q}_{k+1}^{(2)}(a_{k+1}|a^k) \log {\hat{q}_{k+1}^{(2)}(a_{k+1}|a^k)\over q_{k+1}(a_{k+1}|a^k)}\nonumber\\ &=\sum_{a^{k+1}}\hat{q}_{k+1}^{(2)}(a^{k+1})\Bigg( \log {\hat{q}_{k+1}^{(2)}(a^{k+1})\over q_{k+1}(a^{k+1})}-\log {\hat{q}_{k}^{(2)}(a^{k})\over q_{k}(a^{k})}\Bigg)\nonumber\\ &= D_{\rm KL}(\hat{q}_{k+1}^{(2)}\| q_{k+1})- D_{\rm KL}(\hat{q}_{k}^{(2)}\| q_{k}).\label{eq:D-KL-cond-regular} \end{align} (7.97) Therefore, since $$0\leq D_{\rm KL}(\hat{q}_{k+1}^{(2)}\| q_{k+1})- D_{\rm KL}(\hat{q}_{k}^{(2)}\| q_{k})\leq D_{\rm KL}(\hat{q}_{k+1}^{(2)}\| q_{k+1})$$, from (7.96),   \begin{align}\label{eq:cost-vs-KL-distance} \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|[X^n]_b) \leq \hat{H}_{k}([X^n]_b) +D_{\rm KL}(\hat{q}_{k+1}^{(2)}\| q_{k+1}). \end{align} (7.98) Given $$\delta_2>0$$, define event $$\mathcal{E}_1$$ as   \begin{equation}\label{eqdef:Ec1} \mathcal{E}_1\triangleq \{\|\hat{q}_{k+1}^{(2)}-q_{k+1}\|_1<\delta_2\}. \end{equation} (7.99) Consider random vector $$U^{k+1}$$ distributed according to $$\hat{q}_{k+1}^{(2)}$$, which denotes the empirical distribution of $$[X^n]_b$$. Then, by definition, $$\hat{H}_k([X^n]_b)= H(U_{k+1}|U^k)= H(U^{k+1})-H(U^k).$$ Therefore,   \begin{align} |\hat{H}_k([X^n]_b)-H([X_{k+1}]_b|[X^k]_b)|&=|H(U^{k+1})-H(U^k)-H([X^{k+1}]_b)+H([X^k]_b)|\nonumber\\ &\leq |H(U^{k+1})-H([X^{k+1}]_b)|+|H(U^k)-H([X^k]_b)|. \end{align} (7.100) Conditioned on $$\mathcal{E}_1$$, $$\|\hat{q}_{k}^{(2)}-q_{k}\|_1\leq \|\hat{q}_{k+1}^{(2)}-q_{k+1}\|_1\leq \delta_2$$, and therefore, from Lemma 7.1,   \begin{align} |\hat{H}_k([X^n]_b)-H([X_{k+1}]_b|[X^k]_b)|\leq -2\delta_2\log \delta_2 +2\delta_2(k+1)\log |\mathcal{X}_b| \end{align} (7.101) or   \begin{align} \left|{\hat{H}_k([X^n]_b)\over b}-{H([X_{k+1}]_b|[X^k]_b)\over b}\right|\leq -{2\delta_2\over b}\log \delta_2 +\left({2(k+1)\log |\mathcal{X}_b|\over b}\right)\delta_2.\label{eq:bd-dist-Hh-H-emp} \end{align} (7.102) Moreover, conditioned on $$\mathcal{E}_1$$, since $$\|\hat{q}_{k+1}^{(2)}-{q}_{k+1}\|_1\leq \delta_2$$ and $$\hat{q}_{k+1}^{(2)} \ll {q}_{k+1}$$, from Lemma 7.2, we have   \begin{align} D(\hat{q}_{k+1}^{(2)}\|{q}_{k+1} )\leq -\delta_2\log \delta_2 +\delta_2(k+1)\log |\mathcal{X}_b|-\delta_2 \log q_{\min}, \end{align} (7.103) where   \begin{align} q_{\min}=\min_{u^{k+1}\in\mathcal{X}_b^{k+1}: q_{k+1}(u^{k+1})\neq 0} {\rm{P}}([X^{k+1}]_b=u^{k+1})\geq f_{k+1} |\mathcal{X}_b|^{-(k+1)}, \end{align} (7.104) where the last line follows from (5.1). Therefore,   \begin{align} D(\hat{q}_{k+1}^{(2)}\|{q}_{k+1} )&\leq -\delta_2\log \delta_2 +\delta_2(k+1)\log |\mathcal{X}_b|\nonumber\\ &\;\;\;-\delta_2\log f_{k+1} +\delta_2(k+1) \log |\mathcal{X}_b| \end{align} (7.105) or   \begin{align} {1\over b}D(\hat{q}_{k+1}^{(2)}\|{q}_{k+1} )&\leq -{\delta_2\over b}(\log \delta_2 +\log f_{k+1}) +\left({2(k+1)\log |\mathcal{X}_b|\over b}\right)\delta_2.\label{eq:bd-D-q2-lemma5-used} \end{align} (7.106) Hence, combining (7.98), (7.102) and (7.106), it follows that, conditioned on $$\mathcal{E}_1$$,   \begin{align} &{1\over b} \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|[X^n]_b) \nonumber\\ &\quad\leq \bar{d}_k({\mathbb{\mathbf{X}}})+\delta_1 +\left({4(k+1)\log |\mathcal{X}_b|\over b}\right)\delta_2 -{\delta_2\over b}(3\log \delta_2 +\log f_{k+1}).\label{eq:ub-Hhat-Xo-delat-1-delta-2} \end{align} (7.107) Choosing $$\delta_1=\delta/2$$ and $$\delta_2$$ small enough such that   \begin{align} &\left({4(k+1)\log |\mathcal{X}_b|\over b}\right)\delta_2 -{\delta_2\over b}(3\log \delta_2 +\log f_{k+1})\leq {\delta\over 2}. \label{eq:cond-on-delta2} \end{align} (7.108) Note that while $$|\mathcal{X}_b|$$ grows exponentially in $$b$$, for all bounded sources, $${1\over b}\log |\mathcal{X}_b|<2$$. Therefore, it is always possible to make sure that the above condition is satisfied for an appropriate choice of parameter $$\delta_2$$. For this choice of parameters, from (7.107), conditioned on $$\mathcal{E}_1$$  \begin{equation}\label{eq:constraint} {1\over b} \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|[X^n]_b) \leq \bar{d}_k({\mathbb{\mathbf{X}}}) +\delta, \end{equation} (7.109) and hence $$[X^n]_b$$ satisfies the constraint of the Q-MAP optimization described in (2.5). Hence, since $${\hat{X}}^n$$ is the minimizer of $$\|Au^n-Y^m\|^2$$, among all sequences that satisfy this constraint, we conclude that, conditioned on $$\mathcal{E}_1$$,   \begin{align} \|A{\hat{X}}^n-Y^m\|&\leq \|A[X^n]_b-Y^m\|\nonumber\\ &= \|A([X^n]_b-X^n)\|\nonumber\\ &\leq \sigma_{\max}(A)\|X^n-[X^n]_b\|\nonumber\\ &\leq \sigma_{\max}(A)2^{-b}\sqrt{n}.\label{eq:error-Xh-vs-error-Xob} \end{align} (7.110) Our goal is to use this equation to derive a bound for $$\|\hat{X}^n- X^n\|$$. The main challenge here is to find a lower bound for $$\|A{\hat{X}}^n-Y^m\|$$ in terms of $$\|\hat{X}^n- X^n\|$$. Given $$\delta_3>0$$ and $$\tau>0$$, define set $$\mathcal{C}_n$$ and events $$\mathcal{E}_2$$ and $$\mathcal{E}_3$$ as   \begin{gather}\label{eqdef:CCn} \mathcal{C}_n \triangleq \left\{u^n\in\mathcal{X}_b^n: {1\over nb}\ell_{\rm LZ}(u^n)\leq \bar{d}_k({\mathbb{\mathbf{X}}})+2\delta\right\},\\ \end{gather} (7.111)  \begin{gather} \label{eqdef:E2n} \mathcal{E}_2 \triangleq \{\sigma_{\max}(A)<\sqrt{n}+2\sqrt{m}\} \end{gather} (7.112) and   \begin{equation}\label{eqdef:E3n} \mathcal{E}_3 \triangleq \{\|A(u^n-X^n)\|\geq \|u^n-X^n\|\sqrt{(1-\tau)m}: \forall u^n\in\mathcal{C}_n\}, \end{equation} (7.113) respectively. We will prove the following: $${\hat{X}}^n\in\mathcal{C}_n$$, for $$n$$ large enough. $${\rm{P}}(\mathcal{E}_1\cap \mathcal{E}_2 \cap \mathcal{E}_3)$$ converges to one as $$n$$ grows to infinity. For the moment we assume that these two statements are true and complete the proof. Therefore, conditioned on $$\mathcal{E}_1\cap\mathcal{E}_2\cap\mathcal{E}_3$$, it follows from (7.110) that   \begin{align} \|{\hat{X}}^n-X^n\|\sqrt{(1-\tau)m} &\leq n\left(1+2\sqrt{m\over n}\right)2^{-b}\nonumber\\ &\leq 3n2^{-b}, \end{align} (7.114) where the last line follows form the fact that $$m\leq n$$. Therefore, conditioned on $$\mathcal{E}_1\cap\mathcal{E}_2\cap\mathcal{E}_3$$,   \begin{align} {1\over \sqrt{n}}\|{\hat{X}}^n-X^n\|\leq \sqrt{9 n\over (1-\tau)m2^{2b}}.\label{eq:error-Xh-vs-error-Xob-per-symb} \end{align} (7.115) To prove that $${\hat{X}}^n\in\mathcal{C}_n$$, for $$n$$ large enough, note that, from (7.94), $${1\over b}\hat{H}_k({\hat{X}}^n)\leq \bar{d}_k({\mathbb{\mathbf{X}}})+\delta$$. On the other hand, from (7.4), for our choice of parameter $$b=b_n$$, for any given $$\delta''>0$$, for all $$n$$ large enough,   \begin{align} {1\over n}\ell_{\rm LZ}({\hat{X}}^n)\leq \hat{H}_k({\hat{X}}^n)+\delta''. \end{align} (7.116) Therefore, for all $$n$$ large enough,   \begin{align} {1\over nb}\ell_{\rm LZ}({\hat{X}}^n)\leq \bar{d}_k({\mathbb{\mathbf{X}}})+\delta+{\delta''\over b}. \end{align} (7.117) Choosing $$\delta''$$ such that $${\delta''\over b}\leq \delta$$ proves the desired result, i.e. $${\hat{X}}^n\in\mathcal{C}_n$$. Let $$\tau=1-(\log n)^{-2r/(1+f)}$$, where $$f>0$$ is a free parameter. For $$b=b_n=\lceil r\log \log n\rceil$$, $$2^{2b}\geq (\log n)^{2r}$$. Therefore, from (7.115),   \begin{align} {1\over \sqrt{n}}\|{\hat{X}}^n-X^n\|&\leq \sqrt{9 (\log n)^{2r\over 1+f} \over (1+\delta)\bar{d}_k({\mathbb{\mathbf{X}}}) (\log n)^{2r}} \nonumber\\ &= \sqrt{9 \over (1+\delta)\bar{d}_k({\mathbb{\mathbf{X}}}) (\log n)^{2rf \over 1+f}} . \end{align} (7.118) Therefore, for any $$\epsilon>0$$, $$n$$ large enough, conditioned on $$\mathcal{E}_1\cap\mathcal{E}_2\cap\mathcal{E}_3$$,   \begin{align} {1\over \sqrt{n}}\|{\hat{X}}^n-X^n\|\leq \epsilon. \end{align} (7.119) To finish the proof we study the probability of $$\mathcal{E}_1\cap\mathcal{E}_2\cap\mathcal{E}_3$$. By Theorem 4.2, there exists integer $$g_{\delta_2}$$, only depending on the source distribution and $$\delta_2$$ such that for $$n>6(k+g_{\delta_2})/\delta_2+k$$,   \begin{align} {\rm{P}}(\mathcal{E}_1^c)\leq 2^{c\delta_2^2/8}(k+g_{\delta_2})n^{|\mathcal{X}_b|^k}2^{-{n\delta_2^2 \over 8(k+g_{\delta_2})}}, \end{align} (7.120) where $$c=1/(2\ln 2)$$. Also, as proved in [5],   \begin{align} {\rm{P}}(\mathcal{E}_2^c)\leq 2^{-m/2}. \end{align} (7.121) Finally, from (7.3), the size of $$\mathcal{C}_n$$ can be upper bounded as   \begin{align} |\mathcal{C}_n|\leq 2^{nb(\bar{d}_k({\mathbb{\mathbf{X}}})+2\delta)+1}. \end{align} (7.122) Now Lemma 7.4 combined with the union bound proves that, for a fixed vector $$X^n$$,   \begin{align} {\rm{P}}_A(\mathcal{E}_3^c)\leq 2^{nb(\bar{d}_k({\mathbb{\mathbf{X}}})+2\delta)+1} {\rm e}^{{m\over 2}(\tau +\ln (1-\tau))}, \end{align} (7.123) where $${\rm{P}}_A$$ reflects the fact that $$[X^n]_b$$ is fixed, and the randomness is in the generation of matrix $$A$$. For our choice of parameter $$\tau$$ combined with the Fubini’s Theorem and the Borel Cantelli Lemma proves that $${\rm{P}}_{X^n}(\mathcal{E}_3^c)\to 0$$, almost surely. 7.8 Proof of Theorem 5.2 The proof is very similar to the proof of Theorem 5.1 and follows the same logic. Similar to the proof of Theorem 5.1, for $$\delta_1>0$$, we assume that $$n$$ is large enough such that   \begin{align} {H([X_{k+1}]_b|[X^k]_b)\over b} \leq \bar{d}_k({\mathbb{\mathbf{X}}}) + \delta_1. \end{align} (7.124) Also, given $$\delta_2>0$$, $$\delta_3>0$$ and $$\tau>0$$, consider set $$\mathcal{C}_n$$ and events $$\mathcal{E}_2$$ and $$\mathcal{E}_3$$, defined in (7.111), (7.112) and (7.113), respectively. Since $${\hat{X}}^n$$ is a minimizer of $$\sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k)}(a^{k+1}|u^n) +{\lambda\over n^2}\|Au^n-Y^m\|^2$$, we have   \begin{align}\label{eq:cost-Xhat-Xnb} &\sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|{\hat{X}}^n) +{\lambda\over n^2}\|A{\hat{X}}^n-Y^m\|^2\nonumber\\ &\quad \leq \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|[X^n]_b) +{\lambda\over n^2}\|A[X^n]_b-Y^m\|^2\nonumber\\ &\quad \leq \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|[X^n]_b) +{\lambda(\sigma_{\max}(A))^2\over n^2}\|[X^n]_b-X^n\|^2\nonumber\\ &\quad \leq \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|[X^n]_b) +{\lambda(\sigma_{\max}(A))^22^{-2b}\over n}. \end{align} (7.125) Define distribution $$q_{k+1}$$, $$\hat{q}_{k+1}^{(1)}$$ and $$\hat{q}_{k+1}^{(2)}$$ over $$\mathcal{X}_b^{k+1}$$ as in the proof of Theorem 5.1. Then, given $$\delta>0$$, from (7.102) and (7.107), we set $$\delta_1=\delta/2$$ and $$\delta_2$$ small enough such that (7.108) is satisfied. Then following the same steps as the ones that led to (7.109) we obtain that conditioned on $$\mathcal{E}_1$$ (defined in (7.99)),   \begin{align} {1\over b} \hat{H}_k([X^n]_b)\leq \bar{d}_k({\mathbb{\mathbf{X}}})+\delta. \end{align} (7.126) Hence, we have   \begin{align} \hat{H}_k({\hat{X}}^n) +{\lambda\over n^2}\|A{\hat{X}}^n-Y^m\|^2 \leq b(\bar{d}_k({\mathbb{\mathbf{X}}})+\delta ) +{\lambda(\sigma_{\max}(A))^22^{-2b}\over n}.\label{eq:ub-cost-L-QMAP} \end{align} (7.127) Since both terms on the left-hand side of (7.127) are positive, each of them should be smaller than the bound on the right-hand side, i.e.   \begin{align}\label{eq:upper-bd-Hhat-k} {1\over b}\hat{H}_k({\hat{X}}^n) &\;\leq\; \bar{d}_k({\mathbb{\mathbf{X}}})+\delta +{\lambda(\sigma_{\max}(A))^22^{-2b}\over bn} \end{align} (7.128) and   \begin{align}\label{eq:upper-bd-meas-error} {\lambda\over b n^2}\|A{\hat{X}}^n-Y^m\|^2&\;\leq\; \bar{d}_k({\mathbb{\mathbf{X}}})+\delta +{\lambda(\sigma_{\max}(A))^22^{-2b}\over bn}. \end{align} (7.129) Since $$\lambda=\lambda_n=(\log n)^{2r}$$ and $$b=b_b=\lceil r\log\log n\rceil$$, $$\lambda 2^{-2b}\leq 1$$, and hence, conditioned on $$\mathcal{E}_2$$,   \begin{align} {\lambda(\sigma_{\max}(A))^22^{-2b}\over bn}\leq {(\sqrt{n}+2\sqrt{m})^2\over nb}\leq {9\over b}. \end{align} (7.130) Therefore, since $$b_n\to\infty$$, as $$n\to\infty$$, for all $$n$$ large enough, conditioned on $$\mathcal{E}_2$$,   \begin{align} {\lambda(\sigma_{\max}(A))^22^{-2b}\over bn}\leq \delta, \end{align} (7.131) and therefore, from (7.128) and (7.129),   \begin{align}\label{eq:upper-bd-Hhat-k-simple} {1\over b}\hat{H}_k({\hat{X}}^n) &\;\leq\; \bar{d}_k({\mathbb{\mathbf{X}}})+2\delta \end{align} (7.132) and   \begin{align}\label{eq:upper-bd-meas-error-simple} {\lambda\over b n^2}\|A{\hat{X}}^n-Y^m\|^2&\;\leq\; \bar{d}_k({\mathbb{\mathbf{X}}})+2\delta. \end{align} (7.133) Therefore, choosing $$\delta_3=3\delta$$, conditioned on $$\mathcal{E}_1\cap\mathcal{E}_2\cap\mathcal{E}_3$$, $${\hat{X}}^n\in\mathcal{C}_n$$. Finally, from (7.129) we have   \begin{align} {\lambda(1-\tau)m\over n^2 b} \|{\hat{X}}^n-X^n\|^2 \leq \bar{d}_k({\mathbb{\mathbf{X}}})+2\delta, \end{align} (7.134) or   \begin{align}\label{eq:bound-error-triangle} {1\over\sqrt{ n}} \|{\hat{X}}^n-X^n\| &\leq \sqrt{ (\bar{d}_k({\mathbb{\mathbf{X}}})+2\delta)bn\over \lambda (1-\tau)m}, \end{align} (7.135) which proves that, for our set of parameters, conditioned on $$\mathcal{E}_1\cap\mathcal{E}_2\cap\mathcal{E}_3$$, $${1\over \sqrt{n}}\|X^n-{\hat{X}}^n\|$$ can be made arbitrary small. Setting of parameter $$\tau$$ and proving that $${\rm{P}}(\mathcal{E}_1\cap\mathcal{E}_2\cap\mathcal{E}_3)$$ converges to one can be done exactly as it was done in the proof of Theorem 5.1. 7.9 Proof of Theorem 6.1 As argued in the proof of Theorem 5.1, given $$\delta>0$$, there exists $$b_{\delta}$$, such that for $$b>b_{\delta}$$,   \begin{align} {H([X_{k+1}]_b|[X^k]_b)\over b}\leq \bar{d}_k({\mathbb{\mathbf{X}}})+{\delta\over 2}.\label{eq:conv-dk-cond-H} \end{align} (7.136) In the rest of the proof we assume that $$n$$ is large enough so that $$b=b_n> b_{\delta}$$. Define distributions $$q_{k+1}$$ and $$\hat{q}_{k+1}$$ over $$\mathcal{X}_b^{k+1}$$ as follows. Let $$q_{k+1}$$ and $$\hat{q}_{k+1}$$ denote the distribution of $$[X^{k+1}]_b$$ and the empirical distribution of $$[X^n]_b$$, respectively. From (7.96) and (7.97), it follows that   \begin{align} \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|[X^n]_b)\leq \hat{H}_k([X^n]_b)+D_{\rm KL}(\hat{q}_{k+1}\| q_{k+1}).\label{eq:bd-dist-Hhat-W-Phat} \end{align} (7.137) Define events $$\mathcal{E}_1$$ and $$\mathcal{E}_2$$ as   \begin{align} \mathcal{E}_1=\{\sigma_{\max}(A)<\sqrt{n}+2\sqrt{m}\} \end{align} (7.138) and   \begin{align} \mathcal{E}_2=\{\|\hat{q}_{k+1}-q_{k+1}\|_1<{\delta'}\}, \end{align} (7.139) where $$\delta'>0$$ is selected such that   \begin{align} {1\over b}(-2\delta'\log {\delta'} +2\delta'(k+1)\log |\mathcal{X}_b|)\leq {\delta\over 4}\label{eq:delta-p-vs-delta} \end{align} (7.140) and   \begin{align} {1\over b}(-\delta'\log {\delta'}-\delta' \log f_{k+1} +2\delta'(k+1)\log |\mathcal{X}_b|)\leq {\delta\over 4},\label{eq:delta-p-vs-delta-2} \end{align} (7.141) for all $$b$$ large enough. This is always possible, since $$ (-2\delta'\log {\delta'})/b$$ is a decreasing function of $$b$$ and $$\log|\mathcal{X}_b|/b$$ can be upper bounded by a constant not depending on $$b$$. Hence, by picking $$\delta'$$ small enough both (7.140) and (7.141) hold. Given this choice of parameters, from (7.102), conditioned on $$\mathcal{E}_2$$, we have   \begin{align} {1\over b}\left|\hat{H}_k([X^n]_b)-H([X_{k+1}]_b|[X^k]_b)\right|\leq {\delta\over 4}. \end{align} (7.142) Also, from Lemma 7.2, and (7.106), conditioned on $$\mathcal{E}_2$$,   \begin{align} {1\over b}D(\hat{q}_{k+1}\|q_{k+1} )&\leq -{\delta'\over b}(\log \delta'+\log f_{k+1}) +\left({2(k+1)\log |\mathcal{X}_b|\over b}\right)\delta'. \end{align} (7.143) Therefore, from (7.141),   \begin{align} {1\over b}D(\hat{q}_{k+1}\|q_{k+1} )&\leq {\delta\over 4}. \end{align} (7.144) Conditioned on $$\mathcal{E}_2$$, from (7.136) and (7.137), it follows that   \begin{align} {1\over b}\sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|[X^n]_b)&\leq {\hat{H}_k([X^n]_b)\over b}+{1\over b}D_{\rm KL}(\hat{q}_{k+1}\| q_{k+1})\nonumber\\ &\leq {H([X_{k+1}]_b|[X^k]_b)\over b}+{\delta\over 4}+{\delta\over 4}\nonumber\\ &\leq \bar{d}_k({\mathbb{\mathbf{X}}})+{\delta\over 2}+{\delta\over 4}+{\delta\over 4}=\bar{d}_k({\mathbb{\mathbf{X}}})+\delta, \end{align} (7.145) which implies that $$[X^n]_b\in\mathcal{F}_o$$. Since $${\hat{X}}^n(t+1)$$ is the solution of (6.2), automatically, $${\hat{X}}^n(t+1)\in\mathcal{F}_o$$. Also, as we just proved, conditioned on $$\mathcal{E}_2\cap\mathcal{E}_3$$, $$[X^n]_b\in\mathcal{F}_o$$ as well. Therefore, conditioned on $$\mathcal{E}_2\cap\mathcal{E}_3$$,   \begin{align} \|{\hat{X}}^n(t+1)-S^n(t+1)\|\leq \|[X^n]_b-S^n(t+1)\|, \end{align} (7.146) or equivalently   \begin{align} \|{\hat{X}}^n(t+1)-[X^n]_b+[X^n]_b-S^n(t+1)\|&\leq \|[X^n]_b-S^n(t+1)\|.\label{eq:add-sub-Xon} \end{align} (7.147) Raising both sides of (7.147) to power two and canceling out the common term from both sides, we derive   \begin{align} \|{\hat{X}}^n(t+1)-[X^n]_b\|^2 +2\langle {\hat{X}}^n(t+1)-[X^n]_b,[X^n]_b-S^n(t+1) \rangle &\leq 0. \end{align} (7.148) If we plug in the expression for $$S^n(t+1)$$ we obtain   \begin{align} \|{\hat{X}}^n(t+1)-[X^n]_b\|^2 &\leq \;2\langle {\hat{X}}^n(t+1)-[X^n]_b,-[X^n]_b+S^n(t+1) \rangle\nonumber\\ &=2\langle {\hat{X}}^n(t+1)-[X^n]_b,-[X^n]_b+{\hat{X}}^n(t)+\mu A^{\top}(Y^m-A{\hat{X}}^n(t)) \rangle\nonumber\\ &=2\langle {\hat{X}}^n(t+1)-[X^n]_b,{\hat{X}}^n(t)-[X^n]_b \rangle\nonumber\\ &\quad-2\mu \langle {\hat{X}}^n(t+1)-[X^n]_b,A^{\top}A({\hat{X}}^n(t)-X^n) \rangle\nonumber\\ &\quad+2\mu \langle {\hat{X}}^n(t+1)-[X^n]_b,A^{\top}Z^m\rangle\nonumber\\ &=2\langle {\hat{X}}^n(t+1)-[X^n]_b,{\hat{X}}^n(t)-[X^n]_b \rangle\nonumber\\ &\quad-2\mu \langle {\hat{X}}^n(t+1)-[X^n]_b,A^{\top}A({\hat{X}}^n(t)-[X]_b^n) \rangle\nonumber\\ &\quad+2\mu \langle {\hat{X}}^n(t+1)-[X^n]_b,A^{\top}A(X^n-[X]_b^n) \rangle\nonumber\\ &\quad+2\mu \langle {\hat{X}}^n(t+1)-[X^n]_b,A^{\top}Z^m\rangle.\label{eq:Et+1-vs-Et-step1} \end{align} (7.149) Define   \begin{align} E^n(t) \triangleq \|{\hat{X}}^n(t+1)-[X^n]_b\| \end{align} (7.150) and   \begin{align} {\tilde{E}}^n(t) \triangleq {E^n(t)\over \|E^n(t)\|}. \end{align} (7.151) Then, it follows from (7.149) that   \begin{align} \|E^n(t+1)\| &\leq\;2\langle {\tilde{E}}^n(t+1), {\tilde{E}}^n(t)\rangle \|E^n(t)\| -2\mu \langle {\tilde{E}}^n(t+1),A^{\top}A {\tilde{E}}^n(t) \rangle \|E^n(t)\| \nonumber\\ &\qquad+2\mu \langle {\tilde{E}}^n(t+1),A^{\top}A(X^n-[X]_b^n) \rangle\nonumber\\ &\qquad+2\mu \langle {\tilde{E}}^n(t+1),A^{\top}Z^m\rangle\nonumber\\ &\quad=\;2\Big(\langle {\tilde{E}}^n(t+1), {\tilde{E}}^n(t)\rangle -\mu \langle A{\tilde{E}}^n(t+1),A {\tilde{E}}^n(t) \rangle \Big)\|E^n(t)\| \nonumber\\ &\qquad+2\mu \langle {\tilde{E}}^n(t+1),A^{\top}A(X^n-[X]_b^n) \rangle\nonumber\\ &\qquad+2\mu \langle {\tilde{E}}^n(t+1),A^{\top}Z^m\rangle \nonumber\\ &\quad\leq 2\Big(\langle {\tilde{E}}^n(t+1), {\tilde{E}}^n(t)\rangle -\mu \langle A{\tilde{E}}^n(t+1),A {\tilde{E}}^n(t) \rangle \Big)\|E^n(t)\| \nonumber\\ &\qquad+2\mu\sigma_{\max}(A^{\top}A)\|X^n-[X]_b^n\|\nonumber\\ &\qquad+2\mu \langle {\tilde{E}}^n(t+1),A^{\top}Z^m\rangle. \label{eq:Et+1-vs-Et-step2} \end{align} (7.152) Note that for a fixed $$X^n$$, $${\tilde{E}}^n(t)$$ and $${\tilde{E}}^n(t+1)$$ can only take a finite number of different values. Let $$\mathcal{S}_e$$ denote the set of all possible normalized error vectors. That is,   \begin{align} \mathcal{S}_e \triangleq \left\{{u^n-v^n \over \|u^n-v^n\|}: u^n,v^n\in\mathcal{F}_o \right\}. \end{align} (7.153) Clearly, $${\tilde{E}}^n(t)$$ and $${\tilde{E}}^n(t+1)$$ are both members of $$\mathcal{S}_e$$. Define event $$\mathcal{E}_3$$ as follows   \begin{align} \mathcal{E}_3\triangleq \Big\{ \langle u^n, v^n \rangle -{1\over m} \langle Au^n,A v^n\rangle \leq 0.45 :\; \forall \; (u^n,v^n)\in\mathcal{S}_e^2\Big\}. \end{align} (7.154) Conditioned on $$\mathcal{E}_1\cap \mathcal{E}_2\cap \mathcal{E}_3$$, it follows from (7.152) that   \begin{align} \|E^n(t+1)\| &\leq 0.9 \|E^n(t)\|+{2 (\sqrt{n}+2\sqrt{m})^2\over m}(2^{-b}\sqrt{n})\nonumber\\ &\;\;\;+2\mu \langle {\tilde{E}}^n(t+1),A^{\top}Z^m\rangle. \label{eq:Et+1-vs-Et-noisy-step3} \end{align} (7.155) The only remaining term on the right-hand side of (7.155) is $$2\mu \langle {\tilde{E}}^n(t+1),A^{\top}Z^m\rangle$$. To upper bound this term, we employ Lemma 7.5. Let $$A_i^m$$, $$i=1,\ldots,n$$, denote the $$i$$th column of matrix $$A$$. Then,   \begin{align} A^{\top}Z^m=\left[\begin{array}{c} \langle A_1^m,Z^m \rangle\\ \langle A_2^m,Z^m \rangle\\ \vdots\\ \langle A_n^m,Z^m \rangle\\ \end{array} \right], \end{align} (7.156) and   \begin{align} \langle {\tilde{E}}^n(t+1),A^{\top}Z^m\rangle =\sum_{i=1}^n {\tilde{E}}_i(t+1) \langle A_i^m,Z^m \rangle. \end{align} (7.157) By Lemma 7.5, for each $$i$$, $$\langle A_i^m,Z^m \rangle$$ is distributed as $$\|Z^m\| G_i $$, where $$G_i$$ is a standard normal distribution independent of $$\|Z^m\|$$. Therefore, since the columns of matrix $$A$$ are also independent, overall $$\langle {\tilde{E}}^n(t+1),A^{\top}Z^m\rangle$$ is distributed as   \begin{align} \|Z^m\|\sum_{i=1}^n {\tilde{E}}_i(t+1) G_i, \end{align} (7.158) where $$G_i$$ are i.i.d. standard normal independent of $$\|Z^m\|$$. Given $$\tau_1>0$$ and $$\tau_2>0$$, define events $$\mathcal{E}_4$$ and $$\mathcal{E}_5$$ as follows:   \begin{align} \mathcal{E}_4\triangleq \left\{{1\over m}\|Z^m\|^2\leq (1+\tau_1)\sigma^2\right\} \end{align} (7.159) and   \begin{align} \mathcal{E}_5\triangleq\{ |\langle\tilde{e}^n,G^n\rangle|^2\leq 1+\tau_2: \;\forall \tilde{e}^n\in\mathcal{S}_e\}. \end{align} (7.160) By Lemma 7.4,   \begin{align} {\rm{P}}(\mathcal{E}_4^c)&={\rm{P}}\Big({1\over \sigma^2}\|Z^m\|^2> (1+\tau_1)m\Big)\nonumber\\ &\leq {\rm e} ^{-\frac{m}{2}(\tau_1 - \ln(1+ \tau_1))}. \end{align} (7.161) For a fixed vector $$\tilde{e}^n$$, $$\langle\tilde{e}^n,G^n\rangle =\sum_{i=1}^n \tilde{e}_iG_i$$ has a standard normal distribution and therefore, by letting $$m=1$$ in Lemma 7.4, it follows that   \begin{align} {\rm{P}}(|\langle\tilde{e}^n,G^n\rangle|^2 >1+\tau_2 )\leq {\rm e} ^{-0.5(\tau_2 - \ln(1+ \tau_2))}. \end{align} (7.162) Hence, by the union bound,   \begin{align} {\rm{P}}(\mathcal{E}_4^c)&\leq |\mathcal{S}_e|\,{\rm e} ^{-0.5(\tau_2 - \ln(1+ \tau_2))}\nonumber\\ &\leq 2^{nb(\bar{d}_k({\mathbb{\mathbf{X}}})+2\delta)}{\rm e} ^{-0.5(\tau_2 - \ln(1+ \tau_2))}, \end{align} (7.163) where the last inequality follows from (7.173) and (7.179). But, for $$\tau_2>7$$,   \begin{align} {\rm e} ^{-0.5(\tau_2 - \ln(1+ \tau_2))}\leq 2^{-0.5 \tau_2}, \end{align} (7.164) which implies that for $$\tau_2>7$$,   \begin{align} {\rm{P}}(\mathcal{E}_5^c)\leq 2^{nb(\bar{d}_k({\mathbb{\mathbf{X}}})+2\delta)-0.5\tau_2}. \end{align} (7.165) Setting   \begin{align} \tau_2=2nb(\bar{d}_k({\mathbb{\mathbf{X}}})+3\delta)-1 \end{align} (7.166) ensures that   \begin{align} {\rm{P}}(\mathcal{E}_5^c)\leq 2^{-\delta bn+0.5}, \end{align} (7.167) which converges to zero as $$n$$ grows to infinity. Finally, setting $$\tau_1=1$$, conditioned on $$\mathcal{E}_4\cap\mathcal{E}_5$$, we have   \begin{align} 2\mu \langle {\tilde{E}}^n(t+1),A^{\top}Z^m\rangle& ={2\over m} \langle {\tilde{E}}^n(t+1),A^{\top}Z^m\rangle \nonumber\\ &\leq {1\over m}\sqrt{m(1+\tau_1)\sigma^2(1+\tau_2)}\nonumber\\ &= \sqrt{(2m)\sigma^2 (2nb(\bar{d}_k({\mathbb{\mathbf{X}}})+3\delta))\over m^2}\nonumber\\ &= { \sigma\over 2}\sqrt{nb(\bar{d}_k({\mathbb{\mathbf{X}}})+3\delta)\over m}.\label{eq:bd-noisy-proj} \end{align} (7.168) Combining (7.168) and (7.155) yields the desired upper bound on the last term in (7.155). To finish the proof, we need to show that $${\rm{P}}(\mathcal{E}_1\cap\mathcal{E}_2\cap\mathcal{E}_3)$$ also approaches one, as $$n$$ grows without bound. Reference [5] proves that   \begin{align} {\rm{P}}(\mathcal{E}_1^c)\leq 2^{-m/2}. \end{align} (7.169) By Theorem 4.2, there exists integer $$g_{\delta'}$$, depending only on $$\delta'$$ and the source distribution, such that for any $$n>6(k+g_{\delta'})/(b\delta)+k$$,   \begin{align} {\rm{P}}(\mathcal{E}_2^c)\leq 2^{c\delta'^2/8} (k+g_{\delta'}+1)n^{|\mathcal{X}_b|^{k+1}}2^{-{nc \delta'^2 \over 8(k+g_{\delta'}+1)}}, \end{align} (7.170) where $$c=1/(2\ln 2)$$. This proves that for our choice of parameters, $${\rm{P}}(\mathcal{E}_2^c)$$ converges to zero. In the rest of the proof we bound $${\rm{P}}(\mathcal{E}_3^c)$$. From Corollary 7.1, for any $$u^n,v^n\in\mathcal{S}_e$$,   \begin{align} {\rm{P}}\Big({1\over m}\langle Au^n,Av^n\rangle-\langle u^n,v^n\rangle\leq -0.45\Big)\leq 2^{-0.05m}. \end{align} (7.171) Therefore, by the union bound,   \begin{align} {\rm{P}}(\mathcal{E}_3^c)&={\rm{P}}\left({1\over m}\langle Au^n,A\tilde{e}^n(t)\rangle-\langle u^n,\tilde{e}^n(t)\rangle\leq -0.45: {\rm for}\;{\rm some}\;(u^n,v^n)\in\mathcal{S}_e^2\right)\nonumber\\ &\leq |\mathcal{S}_e|^2 2^{-0.05m}.\label{eq:ub-Ec1-union} \end{align} (7.172) Note that   \begin{align} |\mathcal{S}_e|\leq |\mathcal{F}_o|^2.\label{eq:size-Se-vs-Fo} \end{align} (7.173) In the following, we derive an upper bound on the size of $$\mathcal{F}_o$$. For any $$u^n\in\mathcal{F}_o$$, by definition, we have   \begin{align} \sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|u^n)\leq b(\bar{d}_k({\mathbb{\mathbf{X}}})+\delta).\label{eq:compare-cost-linear-W} \end{align} (7.174) Let $$\hat{q}_{k+1}^{(u)}=\hat{p}^{(k+1)}(\cdot|u^n)$$ denote the $$(k+1)$$th order empirical distributions of $$u^n$$. Following the steps used in deriving (7.92), it follows that   \begin{align}\label{eq:bd-HXt} &\sum_{a^{k+1}\in\mathcal{X}_b^{k+1}} w_{a^{k+1}} \hat{p}^{(k+1)}(a^{k+1}|u^n)\nonumber\\ &\quad=\sum_{a^k}\hat{q}_{k}^{(u)}(a^k) D_{\rm KL}(\hat{q}_{k+1}^{(u)}(\cdot|a^k)\| q_{k+1}(\cdot|a^k))+ \hat{H}_{k}(u^n). \end{align} (7.175) Since all the terms on the right-hand side or (7.175) are positive, it follows from (7.174) that   \begin{align} \hat{H}_{k}(u^n)&\leq b(\bar{d}_k({\mathbb{\mathbf{X}}})+\delta). \label{eq:bd-Hhat-Xh-Xo} \end{align} (7.176) On the other hand, given our choice of quantization level $$b$$, for $$n$$ large enough, for any $$v^n\in\mathcal{X}_b^n$$,   \begin{align} {1\over nb}\ell_{\rm LZ}(v^n)&\leq {1\over b} \hat{H}_k(v^n)+\delta.\label{eq:UB-LZ} \end{align} (7.177) Therefore, for any $$u^n\in\mathcal{F}_o$$, from (7.176) and (7.177), it follows that   \begin{align} {1\over nb}\ell_{\rm LZ}(u^n)&\leq {1\over b} \hat{H}_k(u^n)+\delta \nonumber\\ &\leq \bar{d}_k({\mathbb{\mathbf{X}}})+2\delta. \end{align} (7.178) Note that, from (7.3), we have   \begin{align} |\mathcal{F}_o|\leq \left|\left\{v^n:\; {1\over nb}\ell_{\rm LZ}(v^n)\leq\bar{d}_k({\mathbb{\mathbf{X}}})+2\delta\right\}\right|\leq 2^{nb(\bar{d}_k({\mathbb{\mathbf{X}}})+2\delta)}.\label{eq:size-Fo} \end{align} (7.179) Hence, from (7.172),   \begin{align} &{\rm{P}}(\mathcal{E}_3^c)\leq |\mathcal{F}_o|^4 2^{-0.05m}\leq 2^{ nb(\bar{d}_k({\mathbb{\mathbf{X}}})+2\delta)-0.05m}\leq 2^{-2nb \delta}.\label{eq:bound-et-n} \end{align} (7.180) 8. Conclusion For a stationary process $${{\mathbb{\mathbf{X}}}}$$, we have studied the problem of estimating $$X^n$$ from $$m$$ response variables $$Y^m = AX^n+Z^m$$, under the assumption that the distribution of $${{\mathbb{\mathbf{X}}}}$$ is known. We have proposed the Q-MAP optimization, which estimates $$X^n$$ from $$Y^m$$. The new optimization satisfies the following properties: (i) It applies to generic classes of distributions, as long as they satisfy certain mixing conditions. (ii) Unlike other Bayesian approaches, in the high-dimensional settings, the performance of the Q-MAP optimization can be characterized for generic distributions. Our analyses show that, for certain distributions such as spike-and-slab, asymptotically, Q-MAP achieves the minimum required sampling rate. (Whether Q-MAP achieves the optimal sampling rate for general distributions is still an open question.) (iii) PGD can be applied to approximate the solution of Q-MAP. While the optimization involved in Q-MAP is non-convex, we have characterized the performance of the corresponding PGD algorithm, under both noiseless and noisy settings. Our analysis has revealed that with slightly more measurements than Q-MAP, the PGD-based method recovers $$X^n$$ accurately in the noiseless setting. Funding National Science Foundation (CCF-1420328). References 1. Bickel P. J., Ritov Y. & Tsybakov A. B. ( 2009) Simultaneous analysis of Lasso and Dantzig selector. Ann. Statist. , 37, 1705– 1732. Google Scholar CrossRef Search ADS   2. Blumensath T. & Davies M. E. ( 2009) Iterative hard thresholding for compressed sensing. Appl. Comput. Harmon. Anal. , 27, 265– 274. Google Scholar CrossRef Search ADS   3. Bradley R. C. ( 2005) Basic properties of strong mixing conditions. A survey and some open questions. Probab. Surv. , 2, 107– 144. Google Scholar CrossRef Search ADS   4. Candes E. & Tao T. ( 2007) The Dantzig selector: statistical estimation when p is much larger than n. Ann. Statist. , 35, 2313– 2351. Google Scholar CrossRef Search ADS   5. Candès E., Romberg J. & Tao T. ( 2005) Decoding by linear programming. IEEE Trans. Inform. Theory , 51, 4203– 4215. Google Scholar CrossRef Search ADS   6. Cover T. & Thomas J. ( 2006) Elements of Information Theory , 2nd edn. New York: Wiley. 7. Csiszar I. & Körner J. ( 2011) Information Theory: Coding Theorems for Discrete Memoryless Systems . New York, NY: Cambridge University Press. Google Scholar CrossRef Search ADS   8. Donoho D. & Montanari A. ( 2016) High dimensional robust m-estimation: asymptotic variance via approximate message passing. Probab. Theory Relat. Fields , 166, 935– 969. Google Scholar CrossRef Search ADS   9. Guo D. & Verdú S. ( 2005) Randomly spread CDMA: Asymptotics via statistical physics. IEEE Trans. Inform. Theory , 51, 1983– 2010. Google Scholar CrossRef Search ADS   10. Hans C. ( 2009) Bayesian Lasso regression. Biometrika , 96, 835– 845. Google Scholar CrossRef Search ADS   11. Hans C. ( 2010) Model uncertainty and variable selection in Bayesian Lasso regression. Statist. Comput. , 20, 221– 229. Google Scholar CrossRef Search ADS   12. Hoadley B. ( 1970) A Bayesian look at inverse linear regression. J. Amer. Statist. Assoc. , 65, 356– 369. Google Scholar CrossRef Search ADS   13. Jalali S., Maleki A. & Baraniuk R. ( 2014) Minimum complexity pursuit for universal compressed sensing. IEEE Trans. Inform. Theory , 60, 2253– 2268. Google Scholar CrossRef Search ADS   14. Jalali S., Montanari A. & Weissman T. ( 2012) Lossy compression of discrete sources via the Viterbi algorithm. IEEE Trans. Inform. Theory , 58, 2475– 2489. Google Scholar CrossRef Search ADS   15. Jalali S. & Poor H. V. ( 2017) Universal compressed sensing for almost lossless recovery. IEEE Trans. Inform. Theory , 63, 2933– 2953. 16. Lindley D. V. & Smith A. F. ( 1972) Bayes estimates for the linear model. J. R. Statist. Soc. Ser. B , 34, 1– 41. 17. Liu C. ( 1996) Bayesian robust multivariate linear regression with incomplete data. J. Amer. Statist. Assoc. , 91, 1219– 1227. Google Scholar CrossRef Search ADS   18. Maleki A. ( 2010) Approximate message passing algorithm for compressed sensing. Stanford University. PhD Thesis. 19. Mitchell T. J. & Beauchamp J. J. ( 1988) Bayesian variable selection in linear regression. J. Amer. Statist. Assoc. , 83, 1023– 1032. Google Scholar CrossRef Search ADS   20. O’Hara R. B. & Sillanpää M. J. ( 2009) A review of Bayesian variable selection methods: what, how and which. Bayes. Anal. , 4, 85– 117. Google Scholar CrossRef Search ADS   21. Park T. & Casella G. ( 2008) The Bayesian Lasso. J. Amer. Statist. Assoc. , 103, 681– 686. Google Scholar CrossRef Search ADS   22. Plotnik E., Weinberger M. J. & Ziv J. ( 1992) Upper bounds on the probability of sequences emitted by finite-state sources and on the redundancy of the Lempel-Ziv algorithm. IEEE Trans. Inform. Theory , 38, 66– 72. Google Scholar CrossRef Search ADS   23. Rényi A. ( 1959) On the dimension and entropy of probability distributions. Acta Math. Acad. Sci. Hung. , 10, 193– 215. Google Scholar CrossRef Search ADS   24. Rigollet P. & Tsybakov A. ( 2011) Exponential screening and optimal rates of sparse estimation. Ann. Statist. , 39, 731– 771. Google Scholar CrossRef Search ADS   25. Shields P. ( 1996a) The Ergodic Theory of Discrete Sample Paths . Providence, RI: American Mathematical Society. Google Scholar CrossRef Search ADS   26. Shihao J., Ya X. & Carin L. ( 2008) Bayesian Compressive Sensing. IEEE Trans. Signal. Process. , 56, 2346– 2356. Google Scholar CrossRef Search ADS   27. Som S. & Schniter P. ( 2012) Compressive imaging using approximate message passing and a Markov-tree prior. IEEE Trans. Signal Process. , 60, 3439– 3448. Google Scholar CrossRef Search ADS   28. Su W. & Candes E. ( 2016) SLOPE is adaptive to unknown sparsity and asymptotically minimax. Ann. Statist. , 44, 1038– 1068. Google Scholar CrossRef Search ADS   29. Tibshirani R., Saunders M., Rosset S., Zhu J. & Knight K. ( 2005) Sparsity and smoothness via the fused Lasso. J. R. Statist. Soc.: Ser. B , 67, 91– 108. Google Scholar CrossRef Search ADS   30. Tipping M. E. ( 2001) Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. , 1, 211– 244. 31. Viterbi A. J. ( 1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inform. Theory , 13, 260– 269. Google Scholar CrossRef Search ADS   32. West M. ( 1984) Outlier models and prior distributions in Bayesian linear regression. J. R. Statist. Soc. Ser. B , 46, 431– 439. 33. Wu Y. & Verdú S. ( 2010) Rényi information dimension: fundamental limits of almost lossless analog compression. IEEE Trans. Inform. Theory , 56, 3721– 3748. Google Scholar CrossRef Search ADS   34. Ziv J. & Lempel A. ( 1978) Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory , 24, 530– 536. Google Scholar CrossRef Search ADS   © The authors 2018. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved.

Journal

Information and Inference: A Journal of the IMAOxford University Press

Published: Jan 2, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off