Entropic CLT and Phase Transition in High-dimensional Wishart Matrices

Entropic CLT and Phase Transition in High-dimensional Wishart Matrices Abstract We consider high dimensional Wishart matrices $$\mathbb{X} \mathbb{X}^{\top}$$, where the entries of $$\mathbb{X} \in \mathbb{R}^{n \times d}$$ are i.i.d. from a log-concave distribution. We prove an information theoretic phase transition: such matrices are close in total variation distance to the corresponding Gaussian ensemble if and only if d is much larger than $$n^3$$. Our proof is entropy-based, making use of the chain rule for relative entropy along with the recursive structure in the definition of the Wishart ensemble. The proof crucially relies on the well known relation between Fisher information and entropy, a variational representation for Fisher information, concentration bounds for the spectral norm of a random matrix, and certain small ball probability estimates for log-concave measures. 1 Introduction Let $$\mu$$ be a probability distribution supported on $$\mathbb{R}$$ with zero mean and unit variance. We consider a Wishart matrix (with removed diagonal) $$W = \left( \mathbb{X} \mathbb{X}^{\top} - \mathrm{diag}(\mathbb{X} \mathbb{X}^{\top}) \right) / \sqrt{d}$$ , where $$\mathbb{X}$$ is an $$n \times d$$ random matrix with i.i.d. entries from $$\mu$$. The distribution of $$W$$, which we denote $$\mathcal{W}_{n,d}(\mu)$$, is of importance in many areas of mathematics. Perhaps most prominently it arises in statistics as the distribution of covariance matrices, and in this case $$n$$ can be thought of as the number of parameters and $$d$$ as the sample size. Another application is in the theory of random graphs, where the thresholded matrix $$A_{i,j} = \mathbb{1}\{W_{i,j} >\tau\}$$ is the adjacency matrix of a random geometric graph on $$n$$ vertices, where each vertex is associated to a latent feature vector in $$\mathbb{R}^d$$ (namely the $$i^{th}$$ row of $$\mathbb{X}$$), and an edge is present between two vertices if the correlation between the underlying features is large enough. Wishart matrices also appear in physics, as a simple model of a random mixed quantum state where $$n$$ and $$d$$ are the dimensions of the observable and unobservable states, respectively. The measure $$\mathcal{W}_{n,d}(\mu)$$ becomes approximately Gaussian when $$d$$ goes to infinity and $$n$$ remains bounded (see Section 1.1). Thus in the classical regime of statistics where the sample size is much larger than the number of parameters one can use the well understood theory of Gaussian matrices to study the properties of $$\mathcal{W}_{n,d}(\mu)$$. In this article, we investigate the extent to which this Gaussian picture remains relevant in the high-dimensional regime, where the matrix size $$n$$ also goes to infinity. Our main result, stated informally, is the following universality of a critical dimension for sufficiently smooth measures $$\mu$$ (namely log-concave): the Wishart measure $$\mathcal{W}_{n,d}(\mu)$$ becomes approximately Gaussian if and only if $$d$$ is much larger than $$n^3$$. From a statistical perspective this means that analyses based on Gaussian approximation of a Wishart are valid as long as the number of samples is at least the cube of the number of parameters. In the random graph setting this gives a dimension barrier to the extraction of geometric information from a network, as our result shows that all geometry is lost when the dimension of the latent feature space is larger than the cube of the number of vertices. 1.1 Main result Writing $$X_i \in \mathbb{R}^d$$ for the $$i^{th}$$ row of $$\mathbb{X}$$ one has for $$i \neq j$$, $$W_{i,j} = \frac{1}{\sqrt{d}} \langle X_i, X_j \rangle$$. In particular $$\mathbb{E} W _{i,j} = 0$$ and $$\mathbb{E} W_{i,j} W_{\ell, k} = \mathbb{1}\{(i, j) = (\ell, k) \ \text{and} \ i \neq j\}.$$ Thus for fixed $$n$$, by the multivariate central limit theorem one has, as $$d$$ goes to infinity,   Wn,d(μ)→DGn, where $$\mathcal{G}_n$$ is the distribution of a $$n \times n$$ Wigner matrix with null diagonal and standard Gaussian entries off diagonal (recall that a Wigner matrix is symmetric and the entries above the main diagonal are i.i.d.). Recall that the total variation distance between two measures $$\lambda, \nu$$ is defined as $$\mathrm{TV}(\lambda, \nu) = \sup_A |\lambda(A) - \nu(A)|$$ where the supremum is over all measurable sets $$A$$. Our main result is the following: Theorem 1. Assuming that $$\mu$$ is log-concave and $$d/ (n^3 \log^2(d)) \rightarrow +\infty$$, one has   TV(Wn,d(μ),Gn)→0. (1) □ (Recall that a measure $$\mu$$ with density $$f$$ is said to be log-concave if $$f(\cdot)=e^{-\phi(\cdot)}$$ for some convex function $$\phi.$$) Observe that for (1) to be true one needs some kind of smoothness assumption on $$\mu$$. Indeed if $$\mu$$ is purely atomic then so is $$\mathcal{W}_{n,d}(\mu)$$, and thus its total variation distance to $$\mathcal{G}_n$$ is $$1$$. We also remark that Theorem 1 is tight up to the logarithmic factor in the sense that if $$d/ n^3 \rightarrow 0$$, then   TV(Wn,d(μ),Gn)→1, (2) see Section 1.2 below for more details on this result. Finally, our proof in fact gives the following quantitative version of (1): Theorem 2. There exists a universal constant $$C>1$$ such that for $$d \geq C n^2,$$  TV(Wn,d(μ),Gn)2≤C(n3log2⁡(d)+n2log4⁡(d)d+n3d). □ 1.2 Related work and ideas of proof In the case, where $$\mu$$ is a standard Gaussian, Theorem 1 (without the logarithmic factor) was recently proven simultaneously and independently in [8, 15]. We also observe that previously to these results, certain properties of a Gaussian Wishart were already known to behave as those of a Gaussian matrix, and for values of $$d$$ much smaller than $$n^3$$, see for example, [16] for the largest eigenvalue at $$d \approx n$$, and [4] on whether the quantum state represented by the Wishart is separable at $$d \approx n^{3/2}$$. The proof of Theorem 1 for the Gaussian case is simpler as both measures have a known density with a rather simple form, and one can then explicitly compute the total variation distance as the $$L_1$$ distance between the densities. We now discuss how to lower bound $$\mathrm{TV}(\mathcal{W}_{n,d}(\mu), \mathcal{G}_n).$$ Bubeck et al. [8] implicitly proves (2) when $$\mu$$ is Gaussian. Taking inspiration from this, one can show that in the regime $$d/ n^3 \rightarrow 0$$, for any $$\mu$$ (zero mean, unit variance and finite fourth moment; Log-concavity implies exponential tails and hence existence of alle moments. Se (11).), one can distinguish $$\mathcal{G}_n$$ and $$\mathcal{W}_{n,d}(\mu)$$ by considering the statistic $$A \in \mathbb{R}^{n \times n} \mapsto \mathrm{Tr}(A^3)$$. Indeed it turns out that the mean of $$\mathrm{Tr}(A^3)$$ under the two measures are zero and $$\Theta (\frac{n^3}{\sqrt{d}}),$$ respectively, whereas, the variances are $$\Theta (n^3)$$ and $$\Theta (n^{3} + \frac{n^5}{{d}^2})$$. Since $$d=o(n^3)$$ implies $$\sqrt{n^{3}+\frac{n^5}{{d}^2}}=o(\frac{n^3}{\sqrt{d}}),$$ (2) follows by a simple application of Chebyshev’s inequality. We omit the details and refer the interested reader to [8]. Proving normal approximation results without the assumption of independence is a natural question and has been a subject of intense study over many years. One method that has found several applications in such settings is the so-called Stein’s method of exchangeable pairs. Since Stein’s original work (see [22]) the method has been considerably generalized to prove error bounds on convergence to Gaussian distribution in various situations. The multidimensional case was treated first in [10]. For several applications of Stein’s method in proving central limit theorem (CLT) see [9] and the references therein. In our setting, note that   W=∑i=1d(XiXi⊤−diag(XiXi⊤))/d, where the $$\mathbb{X}_i$$ are i.i.d vectors in $$\mathbb{R}^n$$ whose coordinates are i.i.d samples from a one-dimensional measure $$\mu.$$ Considering $$\mathbb{Y}_i=\mathbb{X}_i \mathbb{X}_i^{\top} - \mathrm{diag}(\mathbb{X}_i \mathbb{X}_i^{\top})$$ as a vector in $$\mathbb{R}^{n^2}$$ and noting that $$|\mathbb{Y}_{i}|^3 \sim n^3,$$ a straightforward application of Stein’s method using exchangeable pairs (see the proof of [10, Theorem 7]) provides the following suboptimal bound: the Wishart ensemble converges to the Gaussian ensemble (convergence of integrals against ‘smooth’ enough test functions) when $$d \gg n^6.$$ Whether there is a way to use Stein’s method to recover Theorem 1 in any reasonable metric (total variation metric, Wasserstein metric, etc.) remains an open problem (see Section 6 for more on this). Our approach to proving (1) is information theoretic and hence completely different from [8, 15] (this is a necessity since for a general $$\mu$$ there is no simple expression for the density of $$\mathcal{W}_{n,d}(\mu)$$). The first step in our proof, described in Section 2, is to use Pinsker’s inequality to change the focus from total variation distance to the relative entropy (see also Section 2 for definitions). Together with the chain rule for relative entropy this allows us to bound the relative entropy of $$\mathcal{W}_{n,d}(\mu)$$ with respect to $$\mathcal{G}_n$$ by induction on the dimension $$n$$. The base case essentially follows from the work of [3] who proved that the relative entropy between the standard one-dimensional Gaussian and $$\frac{1}{\sqrt{d}} \sum_{i=1}^d x_i$$, where $$x_1, \ldots, x_d \in \mathbb{R}$$ is an i.i.d. sequence from a log-concave measure $$\mu$$, goes to $$0$$ at a rate $$1/d$$. One of the main technical contributions of our work is a certain generalization of the latter result in higher dimensions, see Theorem 3 in Section 3. Recently [5] also studied a high dimensional generalization of the result in [6] (which contains the key elements for the proof in [3]) but it seems that Theorem 3 is not comparable to the main theorem in [5]. Another important part of the induction argument, which is carried out in Section 4, relies on controlling from above the expectation of $$-\mathrm{logdet}(\frac{1}{d} \mathbb{X} \mathbb{X}^{\top})$$, which should be understood as the relative entropy between a centered Gaussian with covariance given by $$\frac{1}{d} \mathbb{X} \mathbb{X}^{\top}$$ and a standard Gaussian in $$\mathbb{R}^n$$. This leads us to study the probability that $$\mathbb{X} \mathbb{X}^{\top}$$ is close to being non-invertible. Denoting by $$s_{\mathrm{min}}$$ the smallest singular value of $$\mathbb{X}$$, it suffices to prove a ‘good enough’ upper bound for $$\mathbb{P}(s_{\mathrm{min}}(\mathbb{X}^{\top}) \leq \epsilon)$$ for all small $$\epsilon$$. The case when the entries of $$\mathbb{X}$$ are Gaussian allows to work with exact formulas and was studied in [12, 21]. The last few years have seen tremendous progress in understanding the universality of the tail behavior of extreme singular values of random matrices with i.i.d. entries from general distributions. See [20] and the references therein for a detailed account of these results. Such estimates are quite delicate, and it is worthwhile to mention that the following estimate was proved only recently in [19]: Let $$A \in \mathbb{R}^{n \times d}$$ with $$(d\ge n)$$ be a rectangular matrix with i.i.d. subgaussian entries then for all $$\epsilon >0,$$  P(smin(A⊤)≤ϵ(d−n−1))≤(Cϵ)d−n+1+cd, where $$c,C$$ are independent of $$n,d$$. In full generality, such estimates are essentially sharp since in the case where the entries are random signs, $$s_{\mathrm{min}}$$ is zero with probability $$c^d$$. Unfortunately this type of bound is not useful for us, as we need to control $$\mathbb{P}(s_{\mathrm{min}}(\mathbb{X}^{\top}) \leq \epsilon)$$ for arbitrarily small scales$$\epsilon$$ (indeed $$\mathrm{logdet}(\frac{1}{d} \mathbb{X} \mathbb{X}^{\top})$$ would blow up if $$s_{\mathrm{min}}$$ can be zero with non-zero probability). It turns out that the assumption of log-concavity of the distribution allows us to do that. To this end we use recent advances in [18] on small ball probability estimates for such distributions: Let $$Y \in \mathbb{R}^n$$ be an isotropic centered log-concave random variable, and $$\epsilon \in (0,1/10)$$, then one has $$\mathbb{P}(|Y| \leq \epsilon \sqrt{n}) \leq (C \epsilon)^{\sqrt{n}}$$. This together with an $$\epsilon$$-net argument gives us the required control on $$\mathbb{P}(s_{\mathrm{min}}(\mathbb{X}^{\top}) \leq \epsilon)$$. We conclude the article with several open problems in Section 6. 2 An induction proof via the chain rule for relative entropy Recall that the (differential) entropy of a measure $$\lambda$$ with a density $$f$$ (all densities are understood with respect to the Lebesgue measure unless stated otherwise) is defined as:   Ent(λ)=Ent(f)=−∫f(x)log⁡f(x)dx. The relative entropy of a measure $$\lambda$$ (with density $$f$$) with respect to a measure $$\nu$$ (with density $$g$$) is defined as   Ent(λ‖ν)=∫f(x)log⁡f(x)g(x)dx. With a slight abuse of notations we sometimes write $$\mathrm{Ent}(Y \Vert \nu)$$ , where $$Y$$ is a random variable distributed according to some distribution $$\lambda$$. Pinsker’s inequality gives:   TV(Wn,d(μ),Gn)2≤12Ent(Wn,d(μ)‖Gn). Next, recall the chain rule for relative entropy states for any random variables $$Y_1, Y_2, Z_1, Z_2$$,   Ent((Y1,Y2)‖(Z1,Z2))=Ent(Y1‖Z1)+Ey∼λ1Ent(Y2|Y1=y‖Z2|Z1=y), where $$\lambda_1$$ is the (marginal) distribution of $$Y_1$$, and $$Y_2 \vert Y_1=y$$ is used to denote the distribution of $$Y_2$$ conditionally on the event $$Y_1 = y$$ (and similarly for $$Z_2 \vert Z_1=y$$). Also observe that a sample from $$\mathcal{W}_{n+1,d}(\mu)$$ can be obtained by adjoining to $$\left( \mathbb{X} \mathbb{X}^{\top} - \mathrm{diag}(\mathbb{X} \mathbb{X}^{\top}) \right) / \sqrt{d}$$ (whose distribution is $$\mathcal{W}_{n,d}(\mu)$$) the column vector $$\mathbb{X} X /\sqrt{d}$$ (and the row vector $$(\mathbb{X} X)^{\top} /\sqrt{d}$$) where $$X \in \mathbb{R}^d$$ has i.i.d. entries from $$\mu$$. Thus denoting $$\gamma_n$$ for the standard Gaussian measure in $$\mathbb{R}^n$$ we obtain for all $$n\ge 1,$$  Ent(Wn+1,d(μ)‖Gn+1)=Ent(Wn,d(μ)‖Gn)+EX Ent(XX/d | XX⊤‖γn). (3) By convexity of the relative entropy (see e.g., [11]) one also has:   EX Ent(XX/d | XX⊤ ‖γn)≤EX Ent(XX/d | X ‖γn). (4) Also, since by definition both $$\mathcal{W}_{1,d}(\mu)$$ and $$\mathcal{G}_{1}$$ are zero, $$\mathrm{Ent}(\mathcal{W}_{1,d}(\mu) \Vert \mathcal{G}_{1})=0$$ as well. Next we need a simple lemma to rewrite the right hand side of (4): Lemma 1. Let $$A \in \mathbb{R}^{n \times d}$$ and $$Q \in \mathbb{R}^{n \times n}$$ be such that $$Q A A^{\top} Q^{\top} = \mathrm{I}_n$$. Then one has for any isotropic random variable $$X \in \mathbb{R}^d$$,   Ent(AX‖γn)=Ent(QAX‖γn)+12Tr(AA⊤)−n2+logdet(Q). □ Proof. Denote $$\Phi_{\Sigma}$$ for the density of a centered $$\mathbb{R}^n$$ valued, Gaussian with covariance matrix $$\Sigma$$ (i.e., $$\Phi_{\Sigma}(x) = \frac{1}{\sqrt{(2 \pi)^n \mathrm{det}(\Sigma)}} \exp(- \frac{1}{2} x^{\top} \Sigma^{-1} x )$$), and let $$G \sim \gamma_n$$. Also let $$f$$ be the density of $$Q A X$$. Then one has (the first equality is a simple change of variables):   Ent(AX‖G) =Ent(QAX‖QG) = ∫f(x)log⁡(f(x)ΦQQ⊤(x))dx = ∫f(x)log⁡(f(x)ΦIn(x))dx+∫f(x)log⁡(ΦIn(x)ΦQQ⊤(x))dx =Ent(QAX‖G)+∫f(x)(12x⊤(QQ⊤)−1x−12x⊤x+12logdet(QQ⊤)) =Ent(QAX‖G)+12Tr((QQ⊤)−1)−n2+logdet(Q), where for the last equality we used the fact that $$Q A X$$ is isotropic, that is $$\int f(x) x x^{\top} {\rm d}x = \mathrm{I}_n$$ and $$\mathrm{det}(QQ^T)=\mathrm{det}(Q)^2$$. Finally, it only remains to observe that $$\mathrm{Tr}\left( (Q Q^{\top})^{-1} \right) = \mathrm{Tr}(A A^{\top})$$. ■ Combining (3) and (4) with Lemma 1 ( one can take $$Q = (\frac{1}{d} \mathbb{X} \mathbb{X}^{\top})^{-1/2}$$), and using that $$\mathbb{E} \ \mathrm{Tr}(\mathbb{X} \mathbb{X}^{\top}) = nd$$, one obtains   Ent(Wn+1,d(μ)‖Gn+1) ≤Ent(Wn,d(μ)‖Gn)+EX Ent((XX⊤)−1/2X X | X ‖γn)−12EX logdet(1dXX⊤). (5) In Section 3, we show how to bound the term $$\mathrm{Ent}(A X \Vert \gamma_n),$$ where $$A \in \mathbb{R}^{n \times d}$$ has orthonormal rows (i.e., $$A A^{\top} = \mathrm{I}_n$$) and thereby proving a central limit theorem. In Section 4, we deal with the term $$\mathbb{E}_{\mathbb{X}} \ \mathrm{logdet} (\frac{1}{d} \mathbb{X} \mathbb{X}^{\top})$$. The proof of Theorem 2 and hence Theorem 1 would thus follow by iterating (5) and the results of these sections. 3 A high dimensional entropic CLT The main goal of this section is to prove the following high dimensional generalization of the entropic CLT of [3]. Theorem 3. Let $$Y \in \mathbb{R}^d$$ be a random vector with i.i.d. entries from a distribution $$\nu$$ with zero mean, unit variance, and spectral gap (A probability measure $$\mu$$ is said to have spectral gap $$c$$ if for all smooth functions $$g$$ with $$\mathbb{E}_{\mu}(g)=0,$$ we have $$\mathbb{E}_{\mu}(g^2) \le \frac{1}{c}\mathbb{E}_{\mu}(g'^2).$$) $$c\in (0,1]$$. Let $$A \in \mathbb{R}^{n \times d}$$ be a matrix such that $$A A^{\top} = \mathrm{I}_n$$. Let $$\epsilon = \max_{i\in [d]} (A^{\top} A)_{i,i}$$ and $$\zeta = \max_{i,j \in [d], i \neq j} |(A^{\top} A)_{i,j}|$$. Then one has,   Ent(AY‖γn)≤nmin(2(ϵ+ζ2d)/c,1) Ent(ν‖γ1). □ The assumption $$A A^{\top} = \mathrm{I}_n$$ implies that the rows of $$A$$ form an orthonormal system. In particular if $$A$$ is built by picking rows one after the other at uniform on the Euclidean sphere in $$\mathbb{R}^d$$ conditionally on being orthogonal to previous rows, then one expects that $$\epsilon \simeq n / d$$ and $$\zeta \simeq \sqrt{n} / d$$. Theorem 3 then yields $$\mathrm{Ent}( A Y \Vert \gamma_n) \lesssim n^2 / d$$. Thus we already see appearing the term $$n^3 / d$$ from Theorem 1 as we will sum the latter bound over the $$n$$ rounds of induction (see Section 2). For the special case $$n=1$$, Theorem 3 is slightly weaker than the result of [3] which makes appear the $$\ell_4$$-norm of $$A$$. Sections 3.1 and 3.2 are dedicated to the proof of Theorem 3. Then in Section 3.3, we show how to apply this result to bound the term $$\mathbb{E}_{\mathbb{X}} \ \mathrm{Ent}(Q \mathbb{X} X / \sqrt{d} \ \vert \ \mathbb{X} \Vert \gamma_{n})$$ from Section 2. 3.1 From entropy to Fisher information For a density function $$w : \mathbb{R}^n \rightarrow \mathbb{R}_+$$, let $$J(w) := \int_{\mathbb{R}^n} \frac{|\nabla w(x)|^2}{w(x)} {\rm d}x$$ denote its Fisher information (where $$\nabla w(\cdot)$$ denotes the gradient vector of $$w$$ and $$|\cdot|$$ denotes the euclidean norm), and $$I(w) := \int \frac{\nabla w(x) \nabla w(x)^{\top}}{w(x)} {\rm d}x,$$ the Fisher information matrix (if $$\nu$$ denotes the measure whose density is $$w$$, we may also write $$J(\nu)$$ instead of $$J(w)$$). We use $$P_t$$ to denote the Ornstein–Uhlenbeck semigroup, that is, for a random variable $$Z$$ with density $$g$$, we define   PtZ:=exp⁡(−t)Z+1−exp⁡(−2t)G, where $$G \sim \gamma_n$$ (the standard Gaussian in $$\mathbb{R}^n$$) is independent of $$Z$$; we denote by $$P_t g$$, the density of $$P_t Z$$. The de Bruijn identity states that the Fisher information is the time derivative of the entropy along the Ornstein–Uhlenbeck semigroup, more precisely one has for any centered and isotropic density $$w$$ :   Ent(w‖γn)=Ent(γn)−Ent(w)=∫0∞(J(Ptw)−n)dt, (the first equality is a simple consequence of the form of the normal density). Our objective is to prove a bound of the form (for some constant $$C$$ depending on $$A$$)   Ent(AY‖γn)≤C Ent(ν‖γ1), (6) and thus given the above identity it suffices to show that for any $$t > 0$$,   J(ht)−n≤C (J(νt)−1), (7) where $$h_t$$ is the density of $$P_t A Y$$ (which is equal to the density of $$A P_t Y$$) and $$\nu_t$$ is such that $$P_t Y$$ has distribution $$\nu_t^{\otimes d}$$. Furthermore if $$e_1, \ldots, e_n$$ denotes the canonical basis of $$\mathbb{R}^n$$, then to prove (7) it is enough to show that for any $$i \in [n]$$,   ei⊤I(ht)ei−1≤Ci (J(νt)−1), (8) where $$\sum_{i=1}^n C_i = C$$. Recall $$c$$ is the spectral gap of $$\nu.$$ We will show that one can take,   Ci=1−cUi2cWi+2Vi, where we denote $$B= A^{\top} A \in \mathbb{R}^{d \times d}$$, and   Ui=∑j=1dAi,j2(1−Bj,j),Wi=∑j=1dAi,j2(1−Bj,j)2,Vi=∑j,k∈[d],k≠j(Ai,jBj,k)2. Straightforward calculations (using that $$U_i \geq 1- \epsilon$$, $$W_i \leq 1$$, and $$V_i \leq \zeta^2 d$$) show that one has $$\sum_{i=1}^n \left( 1 - \frac{c U_i^2}{c W_i + 2 V_i} \right) \leq 2 n (\epsilon + \zeta^2 d) / c,$$ where $$\epsilon = \max_{i\in [d]} B_{i,i}$$ and $$\zeta = \max_{i,j \in [d], i \neq j} |B_{i,j}|$$, thus concluding the proof of Theorem 3. In the next subsection, we prove (8) for a given $$t>0$$ and $$i=1$$. We use the following well known but crucial fact: the spectral gap of $$\nu_t$$ is in $$[c,1]$$ (see [Proposition 1, [6]]). Denoting $$f$$ for the density of $$\nu_t$$, one has with $$\varphi = - \log f$$ that $$J:=J(\nu_t) = \int \phi''(x) d\mu(x)$$. The last equality easily follows from the fact that for any $$t > 0$$ one has $$\int f'' =0$$ (which itself follows from the smoothness of $$\nu_t$$ induced by the convolution of $$\nu$$ with a Gaussian). 3.2 Variational representation of Fisher information Let $$Z \in \mathbb{R}^d$$ be a random variable with a twice continuously differentiable density $$w$$ such that $$\int \frac{|\nabla w|^2}{w} < \infty$$ and $$\int \Vert \nabla^2 w \Vert < \infty$$, and let $$h$$ the density of $$A Z \in \mathbb{R}^n$$. Our main tool is a remarkable formula from [6], which states the following: for all $$e\in \mathbb{R}^n$$ and all sufficiently smooth map $$p : \mathbb{R}^d \rightarrow \mathbb{R}^d$$ with $$A p(x) = e, \forall x \in \mathbb{R}^d$$, one has (with $$D p$$ denoting the Jacobian matrix of $$p$$),   e⊤I(h)e≤∫(Tr(Dp(x)2)+p(x)⊤∇2(−log⁡w(x))p(x))w(x)dx. (9) For sake of completeness, we include a short proof of this inequality in Section 5. Let $$(a_{1}, \ldots, a_{d})$$ be the first row of $$A$$. Following [3], to prove [7], we would like to use the above formula (The smoothness assumptions on $$w$$ are satisfied in our context since we consider a random variable convolved with a Gaussian.) with $$p$$ of the form $$(a_{1} r(x_1), \ldots, a_{d} r(x_d))$$ for some map $$r : \mathbb{R} \rightarrow \mathbb{R}$$. Since we need to satisfy $$A p(x) = e_1$$ , we adjust the formula accordingly and take   p(x)=(Id−A⊤A)(a1r(x1),…,adr(xd))⊤+A⊤e1. In particular we get, with $$B= A^{\top} A$$,   pi(x)=ai+ai(1−Bi,i)r(xi)−∑j∈[d],j≠iBi,jajr(xj) and   ∂pi∂xj(x)={ai(1−Bi,i)r′(xi)if i=j−Bi,jajr′(xj)otherwise.  Next recall that we apply (9) to prove (8) where $$w(x) = \prod_{i=1}^d f(x_i)$$, in which case we have (recall also the notation $$\varphi = - \log f$$):   p(x)⊤∇2(−log⁡w(x))p(x) =∑i=1dpi(x)2φ″(xi) =∑i=1dφ″(xi)(ai+ai(1−Bi,i)r(xi)−∑j∈[d],j≠iBi,jajr(xj))2. We also have   Tr(Dp(x)2)=∑i=1dai2(1−Bi,i)2r′(xi)2+∑i,j∈[d],i≠jBi,j2aiajr′(xi)r′(xj). Putting the above together, we obtain (with a slightly lengthy straightforward computation) that $$e_1^{\top} I(h) e_1$$ is upper bounded by (recall also that $$\sum_i a_i^2 =1$$ and $$\sum_{j} B_{i,j} a_j = a_i$$ since $$B A^{\top} = A^{\top}$$)   J +W(∫f(r′)2+∫fφ″r2)+JV∫fr2+J(W−V)(∫fr)2 +2U(∫fφ″r−J∫fr)−2W(∫fr)(∫fφ″r)+M(∫fr′)2 (10) where   U=∑i=1dai2(1−Bii),W=∑i=1dai2(1−Bii)2,V=∑i,j∈[d],i≠j(Bi,jaj)2,M=∑i,j∈[d],i≠jBi,j2aiaj. Observe that by Cauchy–Schwarz inequality one has $$M \le V$$, and furthermore following [3] one also has with $$m=\int f r$$,   (∫fr′)2=(∫f′(r−m))2=(∫f′ff(r−m))2≤J(∫fr2−m2). Thus we get from (10) and the above observations that $$e_1^{\top} I(h_t) e_1 - J \le T(r)$$ , where   T(r) =W(∫f(r′)2+∫fφ″r2)+2JV(∫fr2)+J(W−2V)(∫fr)2 +2U(fφ″r−J∫fr)−2W(∫fr)(∫fφ″r), which is the exact same quantity as the one obtained in [3]. The goal now is to optimize over $$r$$ to make this quantity as negative as possible. Solving the above optimization problem is exactly the content of [3, Section 2.4] and it yields the following bound:   e1⊤I(ht)e1−1≤[1−cU2cW+2V](J−1), which is exactly the claimed bound in (8). 3.3 Using Theorem 3 Throughout this section, we will assume $$d\ge n,$$ to have cleaner expressions for some of the error bounds. Given (5) we want to apply Theorem 3 with $$A = (\mathbb{X} \mathbb{X}^{\top})^{-1/2} \mathbb{X}$$ (also observe that the spectral gap assumption of Theorem 3 is satisfied since log-concavity and isotropy of $$\mu$$ imply that $$\mu$$ has a spectral gap in $$[1/12,1]$$, [7]). In particular, we have $$A^{\top} A = \mathbb{X}^{\top} (\mathbb{X} \mathbb{X}^{\top})^{-1} \mathbb{X}$$, and thus denoting $$\mathbb{X}_i \in \mathbb{R}^n$$ for the ith column of $$\mathbb{X}$$ one has for any $$i, j \in [d]$$,   (A⊤A)i,j=Xi⊤(XX⊤)−1Xj=1dXi⊤(1dXX⊤)−1Xj. In particular this yields:   |(A⊤A)i,j|≤1d|Xi⊤Xj|+1d|Xi|⋅|Xj|⋅‖(1dXX⊤)−1−In‖, where $$||\cdot||$$ denotes the operator norm. We now recall two important results on log-concave random vectors (More classical inequalities could also be used here since the entries of $$\mathbb{X}$$ are independent. This would slightly improve the logarithmic factors but it would obscure the main message of this section so we decided to use the more general inequalities for log-concave vectors.). First Paouris’ inequality ([17] [14, Theorem 2]) states that for an isotropic, centered, log-concave random variable $$Y \in \mathbb{R}^n$$ one has for any $$t \geq C$$,   P(|Y|≥(1+t)n)≤exp⁡(−ctn), (11) where $$c, C$$ are universal constants. We also need an inequality proved by Adamczak et al. [1, Theorem 4.1] which states that for a sequence $$Y_1, \ldots, Y_d \in \mathbb{R}^n$$ of i.i.d. copies of $$Y$$, one has for any $$t \geq 1$$ and $$\epsilon \in (0,1)$$,   P(‖1d∑i=1dYiYi⊤−In‖>ϵ)≤exp⁡(−ctn), (12) provided that $$d \geq C \frac{t^4}{\epsilon^2} \log^2\left(2 \frac{t^2}{\epsilon^2} \right) n$$. Paouris’ inequality (11) directly yields that for any $$i \in [d]$$, with probability at least $$1-\delta$$, one has   |Xi|≤n+1clog⁡(1/δ). Furthermore, by a well known consequence of Prékopa-Leindler’s inequality, conditionally on $$\mathbb{X}_j$$ one has for $$i \neq j$$ that $$\mathbb{X}_i^{\top} \frac{\mathbb{X}_j}{|\mathbb{X}_j|}$$ is a centered, isotropic, log-concave random variable. In particular using (11) and independence of $$\mathbb{X}_i$$ and $$\mathbb{X}_j$$ one obtains that for $$i \neq j$$, with probability at least $$1-\delta$$,   |Xi⊤Xj|≤|Xj|(1+1clog⁡(1/δ)). To use (12) we plug in $$t=C \log(1/\delta)/\sqrt{n}$$ for a suitable constant $$C$$ so that $$\exp( - c t \sqrt{n} )$$ is at most $$\delta.$$ Thus $$\epsilon$$ needs to be such that $$d \geq C \frac{t^4}{\epsilon^2} \log^2\left(2 \frac{t^2}{\epsilon^2} \right) n.$$ Also without loss of generality by possibly choosing the value of $$t$$ to be a constant times larger, we can assume $$\frac{d}{t^2n}$$ lies outside a fixed interval containing $$1.$$ A suitable value of $$\epsilon$$ can now be seen from the following string of inequalities, in which we use the fact that $$x\log^2x$$ is increasing outside a neighborhood of $$1$$ (the value of the constant $$C$$ will change from line to line):   dt2n≥Ct2ϵ2log2⁡(2t2ϵ2),ifϵ2t2≥Clog2⁡(dt2n)t2nd,ifϵ≥Ct2nd|log⁡(dt2n)|. Plugging in our choice of $$t=C \log(1/\delta)/\sqrt{n},$$ we see that any,   ϵ≥C1dnlog2⁡(1/δ)[log⁡(d)+log⁡log⁡(1/δ,)], works. Thus with probability at least $$1-\delta$$,   ‖1dXX⊤−In‖≤C′1dnlog2⁡(1/δ)[log⁡(d)+log⁡log⁡(1/δ)]. (13) If $$\Vert A - \mathrm{I}_n \Vert \leq \epsilon < 1$$ then $$\Vert A^{-1} - \mathrm{I}_n \Vert \leq \frac{\epsilon}{1-\epsilon}$$. From now on $$C$$ denotes a universal constant whose value can change at each occurence. Putting together all of the above with a union bound, we obtain for $$d \geq C n^2$$ that with probability at least $$1-1/d$$, simultaneously for all $$i \neq j$$,    |Xi|≤C(n+log⁡(d)), |Xi⊤Xj|≤C(nlog⁡(d)+log2⁡(d)), ‖(1dXX⊤)−1−In‖≤C1n, where the last inequality follows from (13) by plugging in $$\delta=1/d$$ and using the fact that $$\log^4(d)=o(\sqrt{d}).$$ This yields (using the bounds in the previous page) that with probability at least $$1-\frac{1}{d}$$ simultaneously for all $$i \neq j$$,   |(A⊤A)i,j|≤Cnlog⁡(d)+log2⁡(d)d, and   |(A⊤A)i,i|≤Cn+log2⁡(d)d. Thus denoting $$\epsilon = \max_{i\in [d]} (A^{\top} A)_{i,i}$$ and $$\zeta = \max_{i,j \in [d], i \neq j} |(A^{\top} A)_{i,j}|$$ one has:   Emin(ϵ+ζ2d,1)≤Cnlog2⁡(d)+log4⁡(d)d. By Theorem 3, this bounds one of the terms in the upper bound in (5). Thus to complete the proof of Theorem 2 all that is left to do is bound the term $$\mathbb{E}_{\mathbb{X}}[ \ -\mathrm{logdet} (\frac{1}{d} \mathbb{X} \mathbb{X}^{\top})]$$. That is the goal of the next section. 4 Small ball probability estimates Lemma 2. There exists universal $$C>0$$ such that for $$d \geq C n^2,$$  E(−logdet(1dXX⊤))≤C(nd+n2d). (14) □ Proof. We decompose this expectation on the event (and its complement) that the smallest eigenvalue $$\lambda_{\mathrm{min}}$$ of $$\frac{1}{d} \mathbb{X} \mathbb{X}^{\top}$$ is less than $$1/2$$. We first write, using $$-\log(x) \leq 1-x + 2(1-x)^2$$ for $$x \geq 1/2$$,   E(−logdet(1dXX⊤)1{λmin≥1/2})≤E(|Tr(In−1dXX⊤)|+2‖In−1dXX⊤‖HS2), where $$||\cdot||_{\mathrm{HS}}$$ denotes the Hilbert-Schmidt norm. Denote $$\zeta$$ for the $$4^{th}$$ moment of $$\mu$$. Then one has,(recall that $$X_i \in \mathbb{R}^d$$ denotes the $$i^{th}$$ row of $$\mathbb{X}$$),   E |Tr(In−1dXX⊤)|≤E(Tr(In−1dXX⊤))2=E(∑i=1n(1−|Xi|2/d))2=(ζ−1)nd. Similarly one can easily check that,   E ‖In−1dXX⊤‖HS2=(∑i,j=1n1d2E ⟨Xi,Xj⟩2)−n=n2−nd+nd(ζ−1)≤n2d. By log-concavity of $$\mu$$ one has $$\zeta \leq 70$$, and thus we proved (for some universal constant $$C>0$$):   E(−logdet(1dXX⊤)1{λmin≥1/2})≤C(nd+n2d). (15) We now take care of the integral on the event $$\{\lambda_{\mathrm{min}} < 1/2\}$$. First observe that for a large enough constant $$C>0,$$ (13) gives for $$d \geq C$$, $$\mathbb{P}(\lambda_{\mathrm{min}} < 1/2)\leq \exp(- d^{1/10}).$$ In particular, we have for any $$\xi \in (0,1)$$:   E(−logdet(1dXX⊤)1{λmin<1/2}) ≤nE(−log⁡(λmin)1{λmin<1/2}) =n∫log⁡(2)∞P(−log⁡(λmin)≥t)dt =n∫01/21sP(λmin<s)ds ≤nξexp⁡(−d1/10)+n∫0ξ1sP(λmin<s)ds. (16) We will choose $$\xi$$ to be a suitable power of $$d$$ and the proof will be complete once we control $$\mathbb{P}(\lambda_{\mathrm{min}} < s)$$ for $$s\le \xi$$. This essentially boils down to estimation of certain small ball probabilities. We proceed by bounding the maximum eigenvalue, $$\lambda_{\mathrm{max}},$$ using a standard net argument. For any $$\epsilon$$- net $$\mathcal{N}_{\epsilon}$$ on $$S^{n-1},$$  λmax=supθ∈Sn−1θ⊤XX⊤dθ≤1(1−ϵ)2supθ∈Nϵθ⊤XX⊤dθ. Choosing $$\epsilon=1/2$$ gives $$|\mathcal{N}_{\epsilon}| \le 5^n.$$ Putting everything together along with subexponential tail of isotropic log-concave random variables (see (11)) we get, $$\mathbb{P}(\lambda_{\mathrm{max}} > M) \le 5^n \exp(-c \sqrt{M d}),$$ (for more details see [20]). Similarly observe,   P(λmin<s)=P(∃θ∈Sn−1:θ⊤XX⊤dθ<s)=P(∃θ∈Sn−1:|X⊤θ|<sd). Furthermore, if $$|\frac{1}{\sqrt{d}} \mathbb{X}^{\top} \theta| < \sqrt{s}$$ for some $$\theta \in \mathbb{S}^{n-1}$$, then one has for any $$\varphi \in \mathbb{S}^{n-1}$$, $$|\frac{1}{\sqrt{d}} \mathbb{X}^{\top} \varphi| < \sqrt{s} + \sqrt{\lambda_{\mathrm{max}}} |\theta - \varphi|$$. Thus we get by choosing $$\varphi$$ to be in a $$s$$- net $$\mathcal{N}_{s}$$ :   P(λmin<s)≤(3s)nsupφ∈NsP(|X⊤φ|<2sd)+P(λmax>1/s). We now use the Paouris small ball probability bound [14, Theorem 2], (see also [18]) which states that for an isotropic centered log-concave random variable $$Y \in \mathbb{R}^d$$, and any $$\epsilon \in (0,1/10)$$, one has,   P(|Y|≤ϵd)≤(cϵ)d, for some universal constant $$c>0.$$ As $$\mathbb{X}^{\top} \varphi$$ is an isotropic, centered, log-concave random variable, we obtain for $$d \geq C n^2,$$  P(λmin<s)≤(cs)Cd+exp⁡(−C/s). Finally plugging this back in (16) and choosing $$\xi$$ to be a suitable negative power of $$d,$$ we obtain for $$d \geq C n^2$$,   E(−logdet(1dXX⊤)1{λmin<1/2})≤nexp⁡(−d1/20), and thus together with (15) it yields (14). ■ 5 Proof of (9) Recall that $$Z \in \mathbb{R}^d$$ is a random variable with a twice continuously differentiable density $$w$$ such that $$\int \frac{|\nabla w|^2}{w} < \infty$$ and $$\int \Vert \nabla^2 w \Vert < \infty$$, $$h$$ is the density of $$A Z \in \mathbb{R}^n$$ (with $$A A^{\top} = \mathrm{I}_n$$), and also we fix $$e\in \mathbb{R}^n$$ and a sufficiently smooth map (For instance it is enough that $$p$$ is twice continuously differentiable, and that the coordinate functions $$p_i$$ and their derivatives $$\frac{\partial p_i}{\partial x_i}$$, $$\frac{\partial p_i}{\partial x_j}$$, $$\frac{\partial^2 p_i}{\partial x_i \partial x_j}$$ are bounded.) $$p : \mathbb{R}^d \rightarrow \mathbb{R}^d$$ with $$A p(x) = e, \forall x \in \mathbb{R}^d$$. We want to prove:   e⊤I(h)e≤∫Rd(Tr(Dp2)+p⊤∇2(−log⁡w)p)w. (17) First we rewrite the right hand side in (17) as follows:   ∫Rd(Tr(Dp2)+p⊤∇2(−log⁡w)p)w=∫Rd(∇⋅(pw))2w. The above identity is a straightforward calculation (with several applications of the one-dimensional integration by parts, which are justified by the assumptions on $$p$$ and $$w$$), see [6] for more details. Now we rewrite the left hand side of (17). Using the notation $$g_x$$ for the partial derivative of a function $$g$$ in the direction $$x$$, we have   e⊤I(h)e=∫Rnhe2h. Next observe that for any $$x \in \mathbb{R}^n$$ one can write $$h(x) = \int_{E^{\perp}} w(A^{\top} x + \cdot)$$ where $$E \subset \mathbb{R}^d$$ is the $$n$$-dimensional subspace generated by the orthonormal rows of $$A$$, and thus thanks to the assumptions on $$w$$ one has:   he(x)=∫A⊤x+E⊥wA⊤e=∫A⊤x+E⊥∇⋅((A⊤e)w). The key step is now to remark that the condition $$\forall x, A p(x) = e$$ exactly means that the projection of $$p$$ on $$E$$ is $$A^{\top} e$$, and thus by the Divergence Theorem one has   ∫A⊤x+E⊥∇⋅((A⊤e)w)=∫A⊤x+E⊥∇⋅(pw). The proof is concluded with a simple Cauchy–Schwarz inequality:   e⊤I(h)e=∫Rn(∫A⊤x+E⊥∇⋅(pw))2∫A⊤x+E⊥w≤∫Rn∫A⊤x+E⊥(∇⋅(pw))2w=∫Rd(∇⋅(pw))2w. 6 Open problems This work leaves many questions open. A basic question is whether one could get away with less independence assumption on the matrix $$\mathbb{X}$$. Indeed several of the estimates in Sections 3 and 4 would work under the assumption that the rows (or the columns) of $$\mathbb{X}$$ are i.i.d. from a log-concave distribution in $$\mathbb{R}^d$$ (or $$\mathbb{R}^n$$). However, it seems that the core of the proof, namely the induction argument from Section 2, breaks without the independence assumption for the entries of $$\mathbb{X}$$. Thus it remains open whether Theorem 1 is true with only row (or column) independence for $$\mathbb{X}$$. The case of row independence is probably much harder than column independence. As we observed in Section 1.2, a natural alternative route to prove Theorem 1 (or possibly a variant of it with a different metric) would be to use Stein’s method. A straightforward application of existing results yield the suboptimal dimension dependency $$d \gg n^6$$ for convergence, and it is an intriguing open problem whether the optimal rate $$d \gg n^3$$ can be obtained with Stein’s method. In this article, we consider Wishart matrices with zeroed out diagonal elements in order to avoid further technical difficulties (also for many applications -such as the random geometric graph example- the diagonal elements do not contain relevant information). We believe that Theorem 1 remains true with the diagonal included (given an appropriate modification of the Gaussian ensemble). The main difficulty is that in the chain rule argument one will have to deal with the law of the diagonal elements conditionally on the other entries. We leave this to further works, but when $$\mu$$ is the standard Gaussian, it is easy to conclude the calculations with these conditional laws. In [13], it is proven that when $$\mu$$ is a standard Gaussian and $$d/n \rightarrow +\infty$$, one has $$\mathrm{TV}(\mathcal{W}_{n,d}(\mu), \mathcal{W}_{n,d+1}(\mu)) \rightarrow 0$$. It seems conceivable that the techniques develop in this article could be useful to prove such a result for a more general class of distributions $$\mu$$. However, a major obstacle is that the tools from Section 3 are strongly tied to measuring the relative entropy with respect to a standard Gaussian (because it maximizes the entropy), and it is not clear at all how to adapt this part of the proof. Finally, one may be interested in understanding CLT of the form (1) for higher-order interactions. More precisely recall that by denoting $$\mathbb{X}_i$$ for the $$i^{th}$$ column of $$\mathbb{X}$$ one can write $$\mathbb{X} \mathbb{X}^{\top} = \sum_{i=1}^d \mathbb{X}_i \otimes \mathbb{X}_i = \sum_{i=1}^d \mathbb{X}_i^{\otimes 2}$$. For $$p \in \mathbb{N},$$ we may now consider the distribution $$\mathcal{W}_{n,d}^{(p)}$$ of $$\frac{1}{\sqrt{d}} \sum_{i=1}^d \mathbb{X}_i^{\otimes p}$$ (for sake of consistency we should remove the non-principal terms in this tensor). The measure $$\mathcal{W}_{n,d}^{(p)}$$ have recently gained interest in the machine learning community, see [2]. It would be interesting to see if the method described in this article can be used to understand how large $$d$$ needs to be as a function of $$n$$ and $$p$$ so that $$\mathcal{W}_{n,d}^{(p)}$$ is close to being a Gaussian distribution. Acknowledgments We are grateful to the anonymous referees for the suggestions that helped improve the article. This work was completed while S.G. was an intern at Microsoft Research in Redmond. He thanks the Theory group for its hospitality. References [1] Adamczak R. Litvak A. Pajor A. and Tomczak-Jaegermann. N. “Quantitative estimates of the convergence of the empirical covariance matrix in log-concave ensembles.” Journal of the American Mathematical Society  23, no. 2 ( 2010): 535– 61. Google Scholar CrossRef Search ADS   [2] Anandkumar A. Ge R. Hsu D. Kakade S. M. and Telgarsky. M. “Tensor decompositions for learning latent variable models.” Journal of Machine Learning Research  15 ( 2014): 2773– 832. [3] Artstein S. Ball K. M. Barthe F. and Naor. A. “On the rate of convergence in the entropic central limit theorem.” Probability theory and related fields  129, no. 3 ( 2004): 381– 90. Google Scholar CrossRef Search ADS   [4] Aubrun G. Szarek S. J. and Ye. D. “Entanglement thresholds for random induced states.” Communications on Pure and Applied Mathematics  67, no. 1 ( 2014): 129– 71. Google Scholar CrossRef Search ADS   [5] Ball K. and Nguyen. V. H. “Entropy jumps for log-concave isotropic random vectors and spectral gap.” Studia Mathematica  213, no. 1 ( 2012): 81– 96. Google Scholar CrossRef Search ADS   [6] Ball K. Barthe F. and Naor. A. “Entropy jumps in the presence of a spectral gap.” Duke Mathematical Journal  119, no. 1 ( 2003): 41– 63. Google Scholar CrossRef Search ADS   [7] Bobkov S. “Isoperimetric and analytic inequalities for log-concave probability measures.” Annals of Probability  27, no. 4 ( 1999): 1903– 21. [8] Bubeck S. Ding J. Eldan R. and Rácz. M. Z. “Testing for high-dimensional geometry in random graphs.” Random Structures and Algorithms  49, no. 3 ( 2016): 503– 32. Google Scholar CrossRef Search ADS   [9] Chatterjee. S. “A short survey of stein’s method.” Seoul, Proceeding of ICM , Volume IV, 2014. Pages: 1– 24. [10] Chatterjee S. and Meckes. E. “Multivariate normal approximation using exchangeable pairs.” ALEA , 4 ( 2008): 257– 83. [11] Cover T. M. and Thomas. J. A. Elements of information theory . Wiley-Interscience, 1991. Google Scholar CrossRef Search ADS   [12] Edelman A. “Eigenvalues and condition numbers of random matrices.” SIAM Journal on Matrix Analysis and Applications  9, no. 4 ( 1988): 543– 60. Google Scholar CrossRef Search ADS   [13] Eldan R. “An efficiency upper bound for inverse covariance estimation.” Israel Journal of Mathematics  207, no. 1 ( 2015): 1– 9. Google Scholar CrossRef Search ADS   [14] Guédon O. “Concentration phenomena in high dimensional geometry.” In ESAIM: Proceedings , vol 44, 47– 60. EDP Sciences, 2014. Google Scholar CrossRef Search ADS   [15] Jiang T. and Li. D. “Approximation of rectangular beta-laguerre ensembles and large deviations.” Journal of Theoretical Probability  ( 2013): 1– 44. [16] Johnstone I. M. “On the distribution of the largest eigenvalue in principal components analysis.” The Annals of Statistics  29, no. 2 ( 2001): 295– 327. Google Scholar CrossRef Search ADS   [17] Paouris G. “Concentration of mass on convex bodies.” Geometric & Functional Analysis GAFA  16, no. 5 ( 2006): 1021– 49. Google Scholar CrossRef Search ADS   [18] Paouris G. “Small ball probability estimates for log-concave measures.” Transactions of the American Mathematical Society  364, no. 1 ( 2012): 287– 308. Google Scholar CrossRef Search ADS   [19] Rudelson M. and Vershynin. R. “The littlewood–offord problem and invertibility of random matrices.” Advances in Mathematics  218, no. 2 ( 2008): 600– 33. Google Scholar CrossRef Search ADS   [20] Rudelson M. and Vershynin. R. “Non-asymptotic theory of random matrices: extreme singular values.” Proceedings of the International Congress of Mathematicians , ICM2010 ( 2010): pp. 1576– 602. [21] Sankar A. Spielman D. A. and Teng. S.-H. “Smoothed analysis of the condition numbers and growth factors of matrices.” SIAM Journal on Matrix Analysis and Applications  28, no. 2 ( 2006): 446– 76. Google Scholar CrossRef Search ADS   [22] Stein C. “Approximate computation of expectations.” Lecture Notes-Monograph Series  7 ( 1986): i– 164. Communicated by Prof. Assaf Naor © The Author(s) 2016. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permission@oup.com. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png International Mathematics Research Notices Oxford University Press

Entropic CLT and Phase Transition in High-dimensional Wishart Matrices

, Volume 2018 (2) – Jan 1, 2018
19 pages

/lp/ou_press/entropic-clt-and-phase-transition-in-high-dimensional-wishart-matrices-kLQY3Q40s0
Publisher
Oxford University Press
ISSN
1073-7928
eISSN
1687-0247
D.O.I.
10.1093/imrn/rnw243
Publisher site
See Article on Publisher Site

Abstract

Abstract We consider high dimensional Wishart matrices $$\mathbb{X} \mathbb{X}^{\top}$$, where the entries of $$\mathbb{X} \in \mathbb{R}^{n \times d}$$ are i.i.d. from a log-concave distribution. We prove an information theoretic phase transition: such matrices are close in total variation distance to the corresponding Gaussian ensemble if and only if d is much larger than $$n^3$$. Our proof is entropy-based, making use of the chain rule for relative entropy along with the recursive structure in the definition of the Wishart ensemble. The proof crucially relies on the well known relation between Fisher information and entropy, a variational representation for Fisher information, concentration bounds for the spectral norm of a random matrix, and certain small ball probability estimates for log-concave measures. 1 Introduction Let $$\mu$$ be a probability distribution supported on $$\mathbb{R}$$ with zero mean and unit variance. We consider a Wishart matrix (with removed diagonal) $$W = \left( \mathbb{X} \mathbb{X}^{\top} - \mathrm{diag}(\mathbb{X} \mathbb{X}^{\top}) \right) / \sqrt{d}$$ , where $$\mathbb{X}$$ is an $$n \times d$$ random matrix with i.i.d. entries from $$\mu$$. The distribution of $$W$$, which we denote $$\mathcal{W}_{n,d}(\mu)$$, is of importance in many areas of mathematics. Perhaps most prominently it arises in statistics as the distribution of covariance matrices, and in this case $$n$$ can be thought of as the number of parameters and $$d$$ as the sample size. Another application is in the theory of random graphs, where the thresholded matrix $$A_{i,j} = \mathbb{1}\{W_{i,j} >\tau\}$$ is the adjacency matrix of a random geometric graph on $$n$$ vertices, where each vertex is associated to a latent feature vector in $$\mathbb{R}^d$$ (namely the $$i^{th}$$ row of $$\mathbb{X}$$), and an edge is present between two vertices if the correlation between the underlying features is large enough. Wishart matrices also appear in physics, as a simple model of a random mixed quantum state where $$n$$ and $$d$$ are the dimensions of the observable and unobservable states, respectively. The measure $$\mathcal{W}_{n,d}(\mu)$$ becomes approximately Gaussian when $$d$$ goes to infinity and $$n$$ remains bounded (see Section 1.1). Thus in the classical regime of statistics where the sample size is much larger than the number of parameters one can use the well understood theory of Gaussian matrices to study the properties of $$\mathcal{W}_{n,d}(\mu)$$. In this article, we investigate the extent to which this Gaussian picture remains relevant in the high-dimensional regime, where the matrix size $$n$$ also goes to infinity. Our main result, stated informally, is the following universality of a critical dimension for sufficiently smooth measures $$\mu$$ (namely log-concave): the Wishart measure $$\mathcal{W}_{n,d}(\mu)$$ becomes approximately Gaussian if and only if $$d$$ is much larger than $$n^3$$. From a statistical perspective this means that analyses based on Gaussian approximation of a Wishart are valid as long as the number of samples is at least the cube of the number of parameters. In the random graph setting this gives a dimension barrier to the extraction of geometric information from a network, as our result shows that all geometry is lost when the dimension of the latent feature space is larger than the cube of the number of vertices. 1.1 Main result Writing $$X_i \in \mathbb{R}^d$$ for the $$i^{th}$$ row of $$\mathbb{X}$$ one has for $$i \neq j$$, $$W_{i,j} = \frac{1}{\sqrt{d}} \langle X_i, X_j \rangle$$. In particular $$\mathbb{E} W _{i,j} = 0$$ and $$\mathbb{E} W_{i,j} W_{\ell, k} = \mathbb{1}\{(i, j) = (\ell, k) \ \text{and} \ i \neq j\}.$$ Thus for fixed $$n$$, by the multivariate central limit theorem one has, as $$d$$ goes to infinity,   Wn,d(μ)→DGn, where $$\mathcal{G}_n$$ is the distribution of a $$n \times n$$ Wigner matrix with null diagonal and standard Gaussian entries off diagonal (recall that a Wigner matrix is symmetric and the entries above the main diagonal are i.i.d.). Recall that the total variation distance between two measures $$\lambda, \nu$$ is defined as $$\mathrm{TV}(\lambda, \nu) = \sup_A |\lambda(A) - \nu(A)|$$ where the supremum is over all measurable sets $$A$$. Our main result is the following: Theorem 1. Assuming that $$\mu$$ is log-concave and $$d/ (n^3 \log^2(d)) \rightarrow +\infty$$, one has   TV(Wn,d(μ),Gn)→0. (1) □ (Recall that a measure $$\mu$$ with density $$f$$ is said to be log-concave if $$f(\cdot)=e^{-\phi(\cdot)}$$ for some convex function $$\phi.$$) Observe that for (1) to be true one needs some kind of smoothness assumption on $$\mu$$. Indeed if $$\mu$$ is purely atomic then so is $$\mathcal{W}_{n,d}(\mu)$$, and thus its total variation distance to $$\mathcal{G}_n$$ is $$1$$. We also remark that Theorem 1 is tight up to the logarithmic factor in the sense that if $$d/ n^3 \rightarrow 0$$, then   TV(Wn,d(μ),Gn)→1, (2) see Section 1.2 below for more details on this result. Finally, our proof in fact gives the following quantitative version of (1): Theorem 2. There exists a universal constant $$C>1$$ such that for $$d \geq C n^2,$$  TV(Wn,d(μ),Gn)2≤C(n3log2⁡(d)+n2log4⁡(d)d+n3d). □ 1.2 Related work and ideas of proof In the case, where $$\mu$$ is a standard Gaussian, Theorem 1 (without the logarithmic factor) was recently proven simultaneously and independently in [8, 15]. We also observe that previously to these results, certain properties of a Gaussian Wishart were already known to behave as those of a Gaussian matrix, and for values of $$d$$ much smaller than $$n^3$$, see for example, [16] for the largest eigenvalue at $$d \approx n$$, and [4] on whether the quantum state represented by the Wishart is separable at $$d \approx n^{3/2}$$. The proof of Theorem 1 for the Gaussian case is simpler as both measures have a known density with a rather simple form, and one can then explicitly compute the total variation distance as the $$L_1$$ distance between the densities. We now discuss how to lower bound $$\mathrm{TV}(\mathcal{W}_{n,d}(\mu), \mathcal{G}_n).$$ Bubeck et al. [8] implicitly proves (2) when $$\mu$$ is Gaussian. Taking inspiration from this, one can show that in the regime $$d/ n^3 \rightarrow 0$$, for any $$\mu$$ (zero mean, unit variance and finite fourth moment; Log-concavity implies exponential tails and hence existence of alle moments. Se (11).), one can distinguish $$\mathcal{G}_n$$ and $$\mathcal{W}_{n,d}(\mu)$$ by considering the statistic $$A \in \mathbb{R}^{n \times n} \mapsto \mathrm{Tr}(A^3)$$. Indeed it turns out that the mean of $$\mathrm{Tr}(A^3)$$ under the two measures are zero and $$\Theta (\frac{n^3}{\sqrt{d}}),$$ respectively, whereas, the variances are $$\Theta (n^3)$$ and $$\Theta (n^{3} + \frac{n^5}{{d}^2})$$. Since $$d=o(n^3)$$ implies $$\sqrt{n^{3}+\frac{n^5}{{d}^2}}=o(\frac{n^3}{\sqrt{d}}),$$ (2) follows by a simple application of Chebyshev’s inequality. We omit the details and refer the interested reader to [8]. Proving normal approximation results without the assumption of independence is a natural question and has been a subject of intense study over many years. One method that has found several applications in such settings is the so-called Stein’s method of exchangeable pairs. Since Stein’s original work (see [22]) the method has been considerably generalized to prove error bounds on convergence to Gaussian distribution in various situations. The multidimensional case was treated first in [10]. For several applications of Stein’s method in proving central limit theorem (CLT) see [9] and the references therein. In our setting, note that   W=∑i=1d(XiXi⊤−diag(XiXi⊤))/d, where the $$\mathbb{X}_i$$ are i.i.d vectors in $$\mathbb{R}^n$$ whose coordinates are i.i.d samples from a one-dimensional measure $$\mu.$$ Considering $$\mathbb{Y}_i=\mathbb{X}_i \mathbb{X}_i^{\top} - \mathrm{diag}(\mathbb{X}_i \mathbb{X}_i^{\top})$$ as a vector in $$\mathbb{R}^{n^2}$$ and noting that $$|\mathbb{Y}_{i}|^3 \sim n^3,$$ a straightforward application of Stein’s method using exchangeable pairs (see the proof of [10, Theorem 7]) provides the following suboptimal bound: the Wishart ensemble converges to the Gaussian ensemble (convergence of integrals against ‘smooth’ enough test functions) when $$d \gg n^6.$$ Whether there is a way to use Stein’s method to recover Theorem 1 in any reasonable metric (total variation metric, Wasserstein metric, etc.) remains an open problem (see Section 6 for more on this). Our approach to proving (1) is information theoretic and hence completely different from [8, 15] (this is a necessity since for a general $$\mu$$ there is no simple expression for the density of $$\mathcal{W}_{n,d}(\mu)$$). The first step in our proof, described in Section 2, is to use Pinsker’s inequality to change the focus from total variation distance to the relative entropy (see also Section 2 for definitions). Together with the chain rule for relative entropy this allows us to bound the relative entropy of $$\mathcal{W}_{n,d}(\mu)$$ with respect to $$\mathcal{G}_n$$ by induction on the dimension $$n$$. The base case essentially follows from the work of [3] who proved that the relative entropy between the standard one-dimensional Gaussian and $$\frac{1}{\sqrt{d}} \sum_{i=1}^d x_i$$, where $$x_1, \ldots, x_d \in \mathbb{R}$$ is an i.i.d. sequence from a log-concave measure $$\mu$$, goes to $$0$$ at a rate $$1/d$$. One of the main technical contributions of our work is a certain generalization of the latter result in higher dimensions, see Theorem 3 in Section 3. Recently [5] also studied a high dimensional generalization of the result in [6] (which contains the key elements for the proof in [3]) but it seems that Theorem 3 is not comparable to the main theorem in [5]. Another important part of the induction argument, which is carried out in Section 4, relies on controlling from above the expectation of $$-\mathrm{logdet}(\frac{1}{d} \mathbb{X} \mathbb{X}^{\top})$$, which should be understood as the relative entropy between a centered Gaussian with covariance given by $$\frac{1}{d} \mathbb{X} \mathbb{X}^{\top}$$ and a standard Gaussian in $$\mathbb{R}^n$$. This leads us to study the probability that $$\mathbb{X} \mathbb{X}^{\top}$$ is close to being non-invertible. Denoting by $$s_{\mathrm{min}}$$ the smallest singular value of $$\mathbb{X}$$, it suffices to prove a ‘good enough’ upper bound for $$\mathbb{P}(s_{\mathrm{min}}(\mathbb{X}^{\top}) \leq \epsilon)$$ for all small $$\epsilon$$. The case when the entries of $$\mathbb{X}$$ are Gaussian allows to work with exact formulas and was studied in [12, 21]. The last few years have seen tremendous progress in understanding the universality of the tail behavior of extreme singular values of random matrices with i.i.d. entries from general distributions. See [20] and the references therein for a detailed account of these results. Such estimates are quite delicate, and it is worthwhile to mention that the following estimate was proved only recently in [19]: Let $$A \in \mathbb{R}^{n \times d}$$ with $$(d\ge n)$$ be a rectangular matrix with i.i.d. subgaussian entries then for all $$\epsilon >0,$$  P(smin(A⊤)≤ϵ(d−n−1))≤(Cϵ)d−n+1+cd, where $$c,C$$ are independent of $$n,d$$. In full generality, such estimates are essentially sharp since in the case where the entries are random signs, $$s_{\mathrm{min}}$$ is zero with probability $$c^d$$. Unfortunately this type of bound is not useful for us, as we need to control $$\mathbb{P}(s_{\mathrm{min}}(\mathbb{X}^{\top}) \leq \epsilon)$$ for arbitrarily small scales$$\epsilon$$ (indeed $$\mathrm{logdet}(\frac{1}{d} \mathbb{X} \mathbb{X}^{\top})$$ would blow up if $$s_{\mathrm{min}}$$ can be zero with non-zero probability). It turns out that the assumption of log-concavity of the distribution allows us to do that. To this end we use recent advances in [18] on small ball probability estimates for such distributions: Let $$Y \in \mathbb{R}^n$$ be an isotropic centered log-concave random variable, and $$\epsilon \in (0,1/10)$$, then one has $$\mathbb{P}(|Y| \leq \epsilon \sqrt{n}) \leq (C \epsilon)^{\sqrt{n}}$$. This together with an $$\epsilon$$-net argument gives us the required control on $$\mathbb{P}(s_{\mathrm{min}}(\mathbb{X}^{\top}) \leq \epsilon)$$. We conclude the article with several open problems in Section 6. 2 An induction proof via the chain rule for relative entropy Recall that the (differential) entropy of a measure $$\lambda$$ with a density $$f$$ (all densities are understood with respect to the Lebesgue measure unless stated otherwise) is defined as:   Ent(λ)=Ent(f)=−∫f(x)log⁡f(x)dx. The relative entropy of a measure $$\lambda$$ (with density $$f$$) with respect to a measure $$\nu$$ (with density $$g$$) is defined as   Ent(λ‖ν)=∫f(x)log⁡f(x)g(x)dx. With a slight abuse of notations we sometimes write $$\mathrm{Ent}(Y \Vert \nu)$$ , where $$Y$$ is a random variable distributed according to some distribution $$\lambda$$. Pinsker’s inequality gives:   TV(Wn,d(μ),Gn)2≤12Ent(Wn,d(μ)‖Gn). Next, recall the chain rule for relative entropy states for any random variables $$Y_1, Y_2, Z_1, Z_2$$,   Ent((Y1,Y2)‖(Z1,Z2))=Ent(Y1‖Z1)+Ey∼λ1Ent(Y2|Y1=y‖Z2|Z1=y), where $$\lambda_1$$ is the (marginal) distribution of $$Y_1$$, and $$Y_2 \vert Y_1=y$$ is used to denote the distribution of $$Y_2$$ conditionally on the event $$Y_1 = y$$ (and similarly for $$Z_2 \vert Z_1=y$$). Also observe that a sample from $$\mathcal{W}_{n+1,d}(\mu)$$ can be obtained by adjoining to $$\left( \mathbb{X} \mathbb{X}^{\top} - \mathrm{diag}(\mathbb{X} \mathbb{X}^{\top}) \right) / \sqrt{d}$$ (whose distribution is $$\mathcal{W}_{n,d}(\mu)$$) the column vector $$\mathbb{X} X /\sqrt{d}$$ (and the row vector $$(\mathbb{X} X)^{\top} /\sqrt{d}$$) where $$X \in \mathbb{R}^d$$ has i.i.d. entries from $$\mu$$. Thus denoting $$\gamma_n$$ for the standard Gaussian measure in $$\mathbb{R}^n$$ we obtain for all $$n\ge 1,$$  Ent(Wn+1,d(μ)‖Gn+1)=Ent(Wn,d(μ)‖Gn)+EX Ent(XX/d | XX⊤‖γn). (3) By convexity of the relative entropy (see e.g., [11]) one also has:   EX Ent(XX/d | XX⊤ ‖γn)≤EX Ent(XX/d | X ‖γn). (4) Also, since by definition both $$\mathcal{W}_{1,d}(\mu)$$ and $$\mathcal{G}_{1}$$ are zero, $$\mathrm{Ent}(\mathcal{W}_{1,d}(\mu) \Vert \mathcal{G}_{1})=0$$ as well. Next we need a simple lemma to rewrite the right hand side of (4): Lemma 1. Let $$A \in \mathbb{R}^{n \times d}$$ and $$Q \in \mathbb{R}^{n \times n}$$ be such that $$Q A A^{\top} Q^{\top} = \mathrm{I}_n$$. Then one has for any isotropic random variable $$X \in \mathbb{R}^d$$,   Ent(AX‖γn)=Ent(QAX‖γn)+12Tr(AA⊤)−n2+logdet(Q). □ Proof. Denote $$\Phi_{\Sigma}$$ for the density of a centered $$\mathbb{R}^n$$ valued, Gaussian with covariance matrix $$\Sigma$$ (i.e., $$\Phi_{\Sigma}(x) = \frac{1}{\sqrt{(2 \pi)^n \mathrm{det}(\Sigma)}} \exp(- \frac{1}{2} x^{\top} \Sigma^{-1} x )$$), and let $$G \sim \gamma_n$$. Also let $$f$$ be the density of $$Q A X$$. Then one has (the first equality is a simple change of variables):   Ent(AX‖G) =Ent(QAX‖QG) = ∫f(x)log⁡(f(x)ΦQQ⊤(x))dx = ∫f(x)log⁡(f(x)ΦIn(x))dx+∫f(x)log⁡(ΦIn(x)ΦQQ⊤(x))dx =Ent(QAX‖G)+∫f(x)(12x⊤(QQ⊤)−1x−12x⊤x+12logdet(QQ⊤)) =Ent(QAX‖G)+12Tr((QQ⊤)−1)−n2+logdet(Q), where for the last equality we used the fact that $$Q A X$$ is isotropic, that is $$\int f(x) x x^{\top} {\rm d}x = \mathrm{I}_n$$ and $$\mathrm{det}(QQ^T)=\mathrm{det}(Q)^2$$. Finally, it only remains to observe that $$\mathrm{Tr}\left( (Q Q^{\top})^{-1} \right) = \mathrm{Tr}(A A^{\top})$$. ■ Combining (3) and (4) with Lemma 1 ( one can take $$Q = (\frac{1}{d} \mathbb{X} \mathbb{X}^{\top})^{-1/2}$$), and using that $$\mathbb{E} \ \mathrm{Tr}(\mathbb{X} \mathbb{X}^{\top}) = nd$$, one obtains   Ent(Wn+1,d(μ)‖Gn+1) ≤Ent(Wn,d(μ)‖Gn)+EX Ent((XX⊤)−1/2X X | X ‖γn)−12EX logdet(1dXX⊤). (5) In Section 3, we show how to bound the term $$\mathrm{Ent}(A X \Vert \gamma_n),$$ where $$A \in \mathbb{R}^{n \times d}$$ has orthonormal rows (i.e., $$A A^{\top} = \mathrm{I}_n$$) and thereby proving a central limit theorem. In Section 4, we deal with the term $$\mathbb{E}_{\mathbb{X}} \ \mathrm{logdet} (\frac{1}{d} \mathbb{X} \mathbb{X}^{\top})$$. The proof of Theorem 2 and hence Theorem 1 would thus follow by iterating (5) and the results of these sections. 3 A high dimensional entropic CLT The main goal of this section is to prove the following high dimensional generalization of the entropic CLT of [3]. Theorem 3. Let $$Y \in \mathbb{R}^d$$ be a random vector with i.i.d. entries from a distribution $$\nu$$ with zero mean, unit variance, and spectral gap (A probability measure $$\mu$$ is said to have spectral gap $$c$$ if for all smooth functions $$g$$ with $$\mathbb{E}_{\mu}(g)=0,$$ we have $$\mathbb{E}_{\mu}(g^2) \le \frac{1}{c}\mathbb{E}_{\mu}(g'^2).$$) $$c\in (0,1]$$. Let $$A \in \mathbb{R}^{n \times d}$$ be a matrix such that $$A A^{\top} = \mathrm{I}_n$$. Let $$\epsilon = \max_{i\in [d]} (A^{\top} A)_{i,i}$$ and $$\zeta = \max_{i,j \in [d], i \neq j} |(A^{\top} A)_{i,j}|$$. Then one has,   Ent(AY‖γn)≤nmin(2(ϵ+ζ2d)/c,1) Ent(ν‖γ1). □ The assumption $$A A^{\top} = \mathrm{I}_n$$ implies that the rows of $$A$$ form an orthonormal system. In particular if $$A$$ is built by picking rows one after the other at uniform on the Euclidean sphere in $$\mathbb{R}^d$$ conditionally on being orthogonal to previous rows, then one expects that $$\epsilon \simeq n / d$$ and $$\zeta \simeq \sqrt{n} / d$$. Theorem 3 then yields $$\mathrm{Ent}( A Y \Vert \gamma_n) \lesssim n^2 / d$$. Thus we already see appearing the term $$n^3 / d$$ from Theorem 1 as we will sum the latter bound over the $$n$$ rounds of induction (see Section 2). For the special case $$n=1$$, Theorem 3 is slightly weaker than the result of [3] which makes appear the $$\ell_4$$-norm of $$A$$. Sections 3.1 and 3.2 are dedicated to the proof of Theorem 3. Then in Section 3.3, we show how to apply this result to bound the term $$\mathbb{E}_{\mathbb{X}} \ \mathrm{Ent}(Q \mathbb{X} X / \sqrt{d} \ \vert \ \mathbb{X} \Vert \gamma_{n})$$ from Section 2. 3.1 From entropy to Fisher information For a density function $$w : \mathbb{R}^n \rightarrow \mathbb{R}_+$$, let $$J(w) := \int_{\mathbb{R}^n} \frac{|\nabla w(x)|^2}{w(x)} {\rm d}x$$ denote its Fisher information (where $$\nabla w(\cdot)$$ denotes the gradient vector of $$w$$ and $$|\cdot|$$ denotes the euclidean norm), and $$I(w) := \int \frac{\nabla w(x) \nabla w(x)^{\top}}{w(x)} {\rm d}x,$$ the Fisher information matrix (if $$\nu$$ denotes the measure whose density is $$w$$, we may also write $$J(\nu)$$ instead of $$J(w)$$). We use $$P_t$$ to denote the Ornstein–Uhlenbeck semigroup, that is, for a random variable $$Z$$ with density $$g$$, we define   PtZ:=exp⁡(−t)Z+1−exp⁡(−2t)G, where $$G \sim \gamma_n$$ (the standard Gaussian in $$\mathbb{R}^n$$) is independent of $$Z$$; we denote by $$P_t g$$, the density of $$P_t Z$$. The de Bruijn identity states that the Fisher information is the time derivative of the entropy along the Ornstein–Uhlenbeck semigroup, more precisely one has for any centered and isotropic density $$w$$ :   Ent(w‖γn)=Ent(γn)−Ent(w)=∫0∞(J(Ptw)−n)dt, (the first equality is a simple consequence of the form of the normal density). Our objective is to prove a bound of the form (for some constant $$C$$ depending on $$A$$)   Ent(AY‖γn)≤C Ent(ν‖γ1), (6) and thus given the above identity it suffices to show that for any $$t > 0$$,   J(ht)−n≤C (J(νt)−1), (7) where $$h_t$$ is the density of $$P_t A Y$$ (which is equal to the density of $$A P_t Y$$) and $$\nu_t$$ is such that $$P_t Y$$ has distribution $$\nu_t^{\otimes d}$$. Furthermore if $$e_1, \ldots, e_n$$ denotes the canonical basis of $$\mathbb{R}^n$$, then to prove (7) it is enough to show that for any $$i \in [n]$$,   ei⊤I(ht)ei−1≤Ci (J(νt)−1), (8) where $$\sum_{i=1}^n C_i = C$$. Recall $$c$$ is the spectral gap of $$\nu.$$ We will show that one can take,   Ci=1−cUi2cWi+2Vi, where we denote $$B= A^{\top} A \in \mathbb{R}^{d \times d}$$, and   Ui=∑j=1dAi,j2(1−Bj,j),Wi=∑j=1dAi,j2(1−Bj,j)2,Vi=∑j,k∈[d],k≠j(Ai,jBj,k)2. Straightforward calculations (using that $$U_i \geq 1- \epsilon$$, $$W_i \leq 1$$, and $$V_i \leq \zeta^2 d$$) show that one has $$\sum_{i=1}^n \left( 1 - \frac{c U_i^2}{c W_i + 2 V_i} \right) \leq 2 n (\epsilon + \zeta^2 d) / c,$$ where $$\epsilon = \max_{i\in [d]} B_{i,i}$$ and $$\zeta = \max_{i,j \in [d], i \neq j} |B_{i,j}|$$, thus concluding the proof of Theorem 3. In the next subsection, we prove (8) for a given $$t>0$$ and $$i=1$$. We use the following well known but crucial fact: the spectral gap of $$\nu_t$$ is in $$[c,1]$$ (see [Proposition 1, [6]]). Denoting $$f$$ for the density of $$\nu_t$$, one has with $$\varphi = - \log f$$ that $$J:=J(\nu_t) = \int \phi''(x) d\mu(x)$$. The last equality easily follows from the fact that for any $$t > 0$$ one has $$\int f'' =0$$ (which itself follows from the smoothness of $$\nu_t$$ induced by the convolution of $$\nu$$ with a Gaussian). 3.2 Variational representation of Fisher information Let $$Z \in \mathbb{R}^d$$ be a random variable with a twice continuously differentiable density $$w$$ such that $$\int \frac{|\nabla w|^2}{w} < \infty$$ and $$\int \Vert \nabla^2 w \Vert < \infty$$, and let $$h$$ the density of $$A Z \in \mathbb{R}^n$$. Our main tool is a remarkable formula from [6], which states the following: for all $$e\in \mathbb{R}^n$$ and all sufficiently smooth map $$p : \mathbb{R}^d \rightarrow \mathbb{R}^d$$ with $$A p(x) = e, \forall x \in \mathbb{R}^d$$, one has (with $$D p$$ denoting the Jacobian matrix of $$p$$),   e⊤I(h)e≤∫(Tr(Dp(x)2)+p(x)⊤∇2(−log⁡w(x))p(x))w(x)dx. (9) For sake of completeness, we include a short proof of this inequality in Section 5. Let $$(a_{1}, \ldots, a_{d})$$ be the first row of $$A$$. Following [3], to prove [7], we would like to use the above formula (The smoothness assumptions on $$w$$ are satisfied in our context since we consider a random variable convolved with a Gaussian.) with $$p$$ of the form $$(a_{1} r(x_1), \ldots, a_{d} r(x_d))$$ for some map $$r : \mathbb{R} \rightarrow \mathbb{R}$$. Since we need to satisfy $$A p(x) = e_1$$ , we adjust the formula accordingly and take   p(x)=(Id−A⊤A)(a1r(x1),…,adr(xd))⊤+A⊤e1. In particular we get, with $$B= A^{\top} A$$,   pi(x)=ai+ai(1−Bi,i)r(xi)−∑j∈[d],j≠iBi,jajr(xj) and   ∂pi∂xj(x)={ai(1−Bi,i)r′(xi)if i=j−Bi,jajr′(xj)otherwise.  Next recall that we apply (9) to prove (8) where $$w(x) = \prod_{i=1}^d f(x_i)$$, in which case we have (recall also the notation $$\varphi = - \log f$$):   p(x)⊤∇2(−log⁡w(x))p(x) =∑i=1dpi(x)2φ″(xi) =∑i=1dφ″(xi)(ai+ai(1−Bi,i)r(xi)−∑j∈[d],j≠iBi,jajr(xj))2. We also have   Tr(Dp(x)2)=∑i=1dai2(1−Bi,i)2r′(xi)2+∑i,j∈[d],i≠jBi,j2aiajr′(xi)r′(xj). Putting the above together, we obtain (with a slightly lengthy straightforward computation) that $$e_1^{\top} I(h) e_1$$ is upper bounded by (recall also that $$\sum_i a_i^2 =1$$ and $$\sum_{j} B_{i,j} a_j = a_i$$ since $$B A^{\top} = A^{\top}$$)   J +W(∫f(r′)2+∫fφ″r2)+JV∫fr2+J(W−V)(∫fr)2 +2U(∫fφ″r−J∫fr)−2W(∫fr)(∫fφ″r)+M(∫fr′)2 (10) where   U=∑i=1dai2(1−Bii),W=∑i=1dai2(1−Bii)2,V=∑i,j∈[d],i≠j(Bi,jaj)2,M=∑i,j∈[d],i≠jBi,j2aiaj. Observe that by Cauchy–Schwarz inequality one has $$M \le V$$, and furthermore following [3] one also has with $$m=\int f r$$,   (∫fr′)2=(∫f′(r−m))2=(∫f′ff(r−m))2≤J(∫fr2−m2). Thus we get from (10) and the above observations that $$e_1^{\top} I(h_t) e_1 - J \le T(r)$$ , where   T(r) =W(∫f(r′)2+∫fφ″r2)+2JV(∫fr2)+J(W−2V)(∫fr)2 +2U(fφ″r−J∫fr)−2W(∫fr)(∫fφ″r), which is the exact same quantity as the one obtained in [3]. The goal now is to optimize over $$r$$ to make this quantity as negative as possible. Solving the above optimization problem is exactly the content of [3, Section 2.4] and it yields the following bound:   e1⊤I(ht)e1−1≤[1−cU2cW+2V](J−1), which is exactly the claimed bound in (8). 3.3 Using Theorem 3 Throughout this section, we will assume $$d\ge n,$$ to have cleaner expressions for some of the error bounds. Given (5) we want to apply Theorem 3 with $$A = (\mathbb{X} \mathbb{X}^{\top})^{-1/2} \mathbb{X}$$ (also observe that the spectral gap assumption of Theorem 3 is satisfied since log-concavity and isotropy of $$\mu$$ imply that $$\mu$$ has a spectral gap in $$[1/12,1]$$, [7]). In particular, we have $$A^{\top} A = \mathbb{X}^{\top} (\mathbb{X} \mathbb{X}^{\top})^{-1} \mathbb{X}$$, and thus denoting $$\mathbb{X}_i \in \mathbb{R}^n$$ for the ith column of $$\mathbb{X}$$ one has for any $$i, j \in [d]$$,   (A⊤A)i,j=Xi⊤(XX⊤)−1Xj=1dXi⊤(1dXX⊤)−1Xj. In particular this yields:   |(A⊤A)i,j|≤1d|Xi⊤Xj|+1d|Xi|⋅|Xj|⋅‖(1dXX⊤)−1−In‖, where $$||\cdot||$$ denotes the operator norm. We now recall two important results on log-concave random vectors (More classical inequalities could also be used here since the entries of $$\mathbb{X}$$ are independent. This would slightly improve the logarithmic factors but it would obscure the main message of this section so we decided to use the more general inequalities for log-concave vectors.). First Paouris’ inequality ([17] [14, Theorem 2]) states that for an isotropic, centered, log-concave random variable $$Y \in \mathbb{R}^n$$ one has for any $$t \geq C$$,   P(|Y|≥(1+t)n)≤exp⁡(−ctn), (11) where $$c, C$$ are universal constants. We also need an inequality proved by Adamczak et al. [1, Theorem 4.1] which states that for a sequence $$Y_1, \ldots, Y_d \in \mathbb{R}^n$$ of i.i.d. copies of $$Y$$, one has for any $$t \geq 1$$ and $$\epsilon \in (0,1)$$,   P(‖1d∑i=1dYiYi⊤−In‖>ϵ)≤exp⁡(−ctn), (12) provided that $$d \geq C \frac{t^4}{\epsilon^2} \log^2\left(2 \frac{t^2}{\epsilon^2} \right) n$$. Paouris’ inequality (11) directly yields that for any $$i \in [d]$$, with probability at least $$1-\delta$$, one has   |Xi|≤n+1clog⁡(1/δ). Furthermore, by a well known consequence of Prékopa-Leindler’s inequality, conditionally on $$\mathbb{X}_j$$ one has for $$i \neq j$$ that $$\mathbb{X}_i^{\top} \frac{\mathbb{X}_j}{|\mathbb{X}_j|}$$ is a centered, isotropic, log-concave random variable. In particular using (11) and independence of $$\mathbb{X}_i$$ and $$\mathbb{X}_j$$ one obtains that for $$i \neq j$$, with probability at least $$1-\delta$$,   |Xi⊤Xj|≤|Xj|(1+1clog⁡(1/δ)). To use (12) we plug in $$t=C \log(1/\delta)/\sqrt{n}$$ for a suitable constant $$C$$ so that $$\exp( - c t \sqrt{n} )$$ is at most $$\delta.$$ Thus $$\epsilon$$ needs to be such that $$d \geq C \frac{t^4}{\epsilon^2} \log^2\left(2 \frac{t^2}{\epsilon^2} \right) n.$$ Also without loss of generality by possibly choosing the value of $$t$$ to be a constant times larger, we can assume $$\frac{d}{t^2n}$$ lies outside a fixed interval containing $$1.$$ A suitable value of $$\epsilon$$ can now be seen from the following string of inequalities, in which we use the fact that $$x\log^2x$$ is increasing outside a neighborhood of $$1$$ (the value of the constant $$C$$ will change from line to line):   dt2n≥Ct2ϵ2log2⁡(2t2ϵ2),ifϵ2t2≥Clog2⁡(dt2n)t2nd,ifϵ≥Ct2nd|log⁡(dt2n)|. Plugging in our choice of $$t=C \log(1/\delta)/\sqrt{n},$$ we see that any,   ϵ≥C1dnlog2⁡(1/δ)[log⁡(d)+log⁡log⁡(1/δ,)], works. Thus with probability at least $$1-\delta$$,   ‖1dXX⊤−In‖≤C′1dnlog2⁡(1/δ)[log⁡(d)+log⁡log⁡(1/δ)]. (13) If $$\Vert A - \mathrm{I}_n \Vert \leq \epsilon < 1$$ then $$\Vert A^{-1} - \mathrm{I}_n \Vert \leq \frac{\epsilon}{1-\epsilon}$$. From now on $$C$$ denotes a universal constant whose value can change at each occurence. Putting together all of the above with a union bound, we obtain for $$d \geq C n^2$$ that with probability at least $$1-1/d$$, simultaneously for all $$i \neq j$$,    |Xi|≤C(n+log⁡(d)), |Xi⊤Xj|≤C(nlog⁡(d)+log2⁡(d)), ‖(1dXX⊤)−1−In‖≤C1n, where the last inequality follows from (13) by plugging in $$\delta=1/d$$ and using the fact that $$\log^4(d)=o(\sqrt{d}).$$ This yields (using the bounds in the previous page) that with probability at least $$1-\frac{1}{d}$$ simultaneously for all $$i \neq j$$,   |(A⊤A)i,j|≤Cnlog⁡(d)+log2⁡(d)d, and   |(A⊤A)i,i|≤Cn+log2⁡(d)d. Thus denoting $$\epsilon = \max_{i\in [d]} (A^{\top} A)_{i,i}$$ and $$\zeta = \max_{i,j \in [d], i \neq j} |(A^{\top} A)_{i,j}|$$ one has:   Emin(ϵ+ζ2d,1)≤Cnlog2⁡(d)+log4⁡(d)d. By Theorem 3, this bounds one of the terms in the upper bound in (5). Thus to complete the proof of Theorem 2 all that is left to do is bound the term $$\mathbb{E}_{\mathbb{X}}[ \ -\mathrm{logdet} (\frac{1}{d} \mathbb{X} \mathbb{X}^{\top})]$$. That is the goal of the next section. 4 Small ball probability estimates Lemma 2. There exists universal $$C>0$$ such that for $$d \geq C n^2,$$  E(−logdet(1dXX⊤))≤C(nd+n2d). (14) □ Proof. We decompose this expectation on the event (and its complement) that the smallest eigenvalue $$\lambda_{\mathrm{min}}$$ of $$\frac{1}{d} \mathbb{X} \mathbb{X}^{\top}$$ is less than $$1/2$$. We first write, using $$-\log(x) \leq 1-x + 2(1-x)^2$$ for $$x \geq 1/2$$,   E(−logdet(1dXX⊤)1{λmin≥1/2})≤E(|Tr(In−1dXX⊤)|+2‖In−1dXX⊤‖HS2), where $$||\cdot||_{\mathrm{HS}}$$ denotes the Hilbert-Schmidt norm. Denote $$\zeta$$ for the $$4^{th}$$ moment of $$\mu$$. Then one has,(recall that $$X_i \in \mathbb{R}^d$$ denotes the $$i^{th}$$ row of $$\mathbb{X}$$),   E |Tr(In−1dXX⊤)|≤E(Tr(In−1dXX⊤))2=E(∑i=1n(1−|Xi|2/d))2=(ζ−1)nd. Similarly one can easily check that,   E ‖In−1dXX⊤‖HS2=(∑i,j=1n1d2E ⟨Xi,Xj⟩2)−n=n2−nd+nd(ζ−1)≤n2d. By log-concavity of $$\mu$$ one has $$\zeta \leq 70$$, and thus we proved (for some universal constant $$C>0$$):   E(−logdet(1dXX⊤)1{λmin≥1/2})≤C(nd+n2d). (15) We now take care of the integral on the event $$\{\lambda_{\mathrm{min}} < 1/2\}$$. First observe that for a large enough constant $$C>0,$$ (13) gives for $$d \geq C$$, $$\mathbb{P}(\lambda_{\mathrm{min}} < 1/2)\leq \exp(- d^{1/10}).$$ In particular, we have for any $$\xi \in (0,1)$$:   E(−logdet(1dXX⊤)1{λmin<1/2}) ≤nE(−log⁡(λmin)1{λmin<1/2}) =n∫log⁡(2)∞P(−log⁡(λmin)≥t)dt =n∫01/21sP(λmin<s)ds ≤nξexp⁡(−d1/10)+n∫0ξ1sP(λmin<s)ds. (16) We will choose $$\xi$$ to be a suitable power of $$d$$ and the proof will be complete once we control $$\mathbb{P}(\lambda_{\mathrm{min}} < s)$$ for $$s\le \xi$$. This essentially boils down to estimation of certain small ball probabilities. We proceed by bounding the maximum eigenvalue, $$\lambda_{\mathrm{max}},$$ using a standard net argument. For any $$\epsilon$$- net $$\mathcal{N}_{\epsilon}$$ on $$S^{n-1},$$  λmax=supθ∈Sn−1θ⊤XX⊤dθ≤1(1−ϵ)2supθ∈Nϵθ⊤XX⊤dθ. Choosing $$\epsilon=1/2$$ gives $$|\mathcal{N}_{\epsilon}| \le 5^n.$$ Putting everything together along with subexponential tail of isotropic log-concave random variables (see (11)) we get, $$\mathbb{P}(\lambda_{\mathrm{max}} > M) \le 5^n \exp(-c \sqrt{M d}),$$ (for more details see [20]). Similarly observe,   P(λmin<s)=P(∃θ∈Sn−1:θ⊤XX⊤dθ<s)=P(∃θ∈Sn−1:|X⊤θ|<sd). Furthermore, if $$|\frac{1}{\sqrt{d}} \mathbb{X}^{\top} \theta| < \sqrt{s}$$ for some $$\theta \in \mathbb{S}^{n-1}$$, then one has for any $$\varphi \in \mathbb{S}^{n-1}$$, $$|\frac{1}{\sqrt{d}} \mathbb{X}^{\top} \varphi| < \sqrt{s} + \sqrt{\lambda_{\mathrm{max}}} |\theta - \varphi|$$. Thus we get by choosing $$\varphi$$ to be in a $$s$$- net $$\mathcal{N}_{s}$$ :   P(λmin<s)≤(3s)nsupφ∈NsP(|X⊤φ|<2sd)+P(λmax>1/s). We now use the Paouris small ball probability bound [14, Theorem 2], (see also [18]) which states that for an isotropic centered log-concave random variable $$Y \in \mathbb{R}^d$$, and any $$\epsilon \in (0,1/10)$$, one has,   P(|Y|≤ϵd)≤(cϵ)d, for some universal constant $$c>0.$$ As $$\mathbb{X}^{\top} \varphi$$ is an isotropic, centered, log-concave random variable, we obtain for $$d \geq C n^2,$$  P(λmin<s)≤(cs)Cd+exp⁡(−C/s). Finally plugging this back in (16) and choosing $$\xi$$ to be a suitable negative power of $$d,$$ we obtain for $$d \geq C n^2$$,   E(−logdet(1dXX⊤)1{λmin<1/2})≤nexp⁡(−d1/20), and thus together with (15) it yields (14). ■ 5 Proof of (9) Recall that $$Z \in \mathbb{R}^d$$ is a random variable with a twice continuously differentiable density $$w$$ such that $$\int \frac{|\nabla w|^2}{w} < \infty$$ and $$\int \Vert \nabla^2 w \Vert < \infty$$, $$h$$ is the density of $$A Z \in \mathbb{R}^n$$ (with $$A A^{\top} = \mathrm{I}_n$$), and also we fix $$e\in \mathbb{R}^n$$ and a sufficiently smooth map (For instance it is enough that $$p$$ is twice continuously differentiable, and that the coordinate functions $$p_i$$ and their derivatives $$\frac{\partial p_i}{\partial x_i}$$, $$\frac{\partial p_i}{\partial x_j}$$, $$\frac{\partial^2 p_i}{\partial x_i \partial x_j}$$ are bounded.) $$p : \mathbb{R}^d \rightarrow \mathbb{R}^d$$ with $$A p(x) = e, \forall x \in \mathbb{R}^d$$. We want to prove:   e⊤I(h)e≤∫Rd(Tr(Dp2)+p⊤∇2(−log⁡w)p)w. (17) First we rewrite the right hand side in (17) as follows:   ∫Rd(Tr(Dp2)+p⊤∇2(−log⁡w)p)w=∫Rd(∇⋅(pw))2w. The above identity is a straightforward calculation (with several applications of the one-dimensional integration by parts, which are justified by the assumptions on $$p$$ and $$w$$), see [6] for more details. Now we rewrite the left hand side of (17). Using the notation $$g_x$$ for the partial derivative of a function $$g$$ in the direction $$x$$, we have   e⊤I(h)e=∫Rnhe2h. Next observe that for any $$x \in \mathbb{R}^n$$ one can write $$h(x) = \int_{E^{\perp}} w(A^{\top} x + \cdot)$$ where $$E \subset \mathbb{R}^d$$ is the $$n$$-dimensional subspace generated by the orthonormal rows of $$A$$, and thus thanks to the assumptions on $$w$$ one has:   he(x)=∫A⊤x+E⊥wA⊤e=∫A⊤x+E⊥∇⋅((A⊤e)w). The key step is now to remark that the condition $$\forall x, A p(x) = e$$ exactly means that the projection of $$p$$ on $$E$$ is $$A^{\top} e$$, and thus by the Divergence Theorem one has   ∫A⊤x+E⊥∇⋅((A⊤e)w)=∫A⊤x+E⊥∇⋅(pw). The proof is concluded with a simple Cauchy–Schwarz inequality:   e⊤I(h)e=∫Rn(∫A⊤x+E⊥∇⋅(pw))2∫A⊤x+E⊥w≤∫Rn∫A⊤x+E⊥(∇⋅(pw))2w=∫Rd(∇⋅(pw))2w. 6 Open problems This work leaves many questions open. A basic question is whether one could get away with less independence assumption on the matrix $$\mathbb{X}$$. Indeed several of the estimates in Sections 3 and 4 would work under the assumption that the rows (or the columns) of $$\mathbb{X}$$ are i.i.d. from a log-concave distribution in $$\mathbb{R}^d$$ (or $$\mathbb{R}^n$$). However, it seems that the core of the proof, namely the induction argument from Section 2, breaks without the independence assumption for the entries of $$\mathbb{X}$$. Thus it remains open whether Theorem 1 is true with only row (or column) independence for $$\mathbb{X}$$. The case of row independence is probably much harder than column independence. As we observed in Section 1.2, a natural alternative route to prove Theorem 1 (or possibly a variant of it with a different metric) would be to use Stein’s method. A straightforward application of existing results yield the suboptimal dimension dependency $$d \gg n^6$$ for convergence, and it is an intriguing open problem whether the optimal rate $$d \gg n^3$$ can be obtained with Stein’s method. In this article, we consider Wishart matrices with zeroed out diagonal elements in order to avoid further technical difficulties (also for many applications -such as the random geometric graph example- the diagonal elements do not contain relevant information). We believe that Theorem 1 remains true with the diagonal included (given an appropriate modification of the Gaussian ensemble). The main difficulty is that in the chain rule argument one will have to deal with the law of the diagonal elements conditionally on the other entries. We leave this to further works, but when $$\mu$$ is the standard Gaussian, it is easy to conclude the calculations with these conditional laws. In [13], it is proven that when $$\mu$$ is a standard Gaussian and $$d/n \rightarrow +\infty$$, one has $$\mathrm{TV}(\mathcal{W}_{n,d}(\mu), \mathcal{W}_{n,d+1}(\mu)) \rightarrow 0$$. It seems conceivable that the techniques develop in this article could be useful to prove such a result for a more general class of distributions $$\mu$$. However, a major obstacle is that the tools from Section 3 are strongly tied to measuring the relative entropy with respect to a standard Gaussian (because it maximizes the entropy), and it is not clear at all how to adapt this part of the proof. Finally, one may be interested in understanding CLT of the form (1) for higher-order interactions. More precisely recall that by denoting $$\mathbb{X}_i$$ for the $$i^{th}$$ column of $$\mathbb{X}$$ one can write $$\mathbb{X} \mathbb{X}^{\top} = \sum_{i=1}^d \mathbb{X}_i \otimes \mathbb{X}_i = \sum_{i=1}^d \mathbb{X}_i^{\otimes 2}$$. For $$p \in \mathbb{N},$$ we may now consider the distribution $$\mathcal{W}_{n,d}^{(p)}$$ of $$\frac{1}{\sqrt{d}} \sum_{i=1}^d \mathbb{X}_i^{\otimes p}$$ (for sake of consistency we should remove the non-principal terms in this tensor). The measure $$\mathcal{W}_{n,d}^{(p)}$$ have recently gained interest in the machine learning community, see [2]. It would be interesting to see if the method described in this article can be used to understand how large $$d$$ needs to be as a function of $$n$$ and $$p$$ so that $$\mathcal{W}_{n,d}^{(p)}$$ is close to being a Gaussian distribution. Acknowledgments We are grateful to the anonymous referees for the suggestions that helped improve the article. This work was completed while S.G. was an intern at Microsoft Research in Redmond. He thanks the Theory group for its hospitality. References [1] Adamczak R. Litvak A. Pajor A. and Tomczak-Jaegermann. N. “Quantitative estimates of the convergence of the empirical covariance matrix in log-concave ensembles.” Journal of the American Mathematical Society  23, no. 2 ( 2010): 535– 61. Google Scholar CrossRef Search ADS   [2] Anandkumar A. Ge R. Hsu D. Kakade S. M. and Telgarsky. M. “Tensor decompositions for learning latent variable models.” Journal of Machine Learning Research  15 ( 2014): 2773– 832. [3] Artstein S. Ball K. M. Barthe F. and Naor. A. “On the rate of convergence in the entropic central limit theorem.” Probability theory and related fields  129, no. 3 ( 2004): 381– 90. Google Scholar CrossRef Search ADS   [4] Aubrun G. Szarek S. J. and Ye. D. “Entanglement thresholds for random induced states.” Communications on Pure and Applied Mathematics  67, no. 1 ( 2014): 129– 71. Google Scholar CrossRef Search ADS   [5] Ball K. and Nguyen. V. H. “Entropy jumps for log-concave isotropic random vectors and spectral gap.” Studia Mathematica  213, no. 1 ( 2012): 81– 96. Google Scholar CrossRef Search ADS   [6] Ball K. Barthe F. and Naor. A. “Entropy jumps in the presence of a spectral gap.” Duke Mathematical Journal  119, no. 1 ( 2003): 41– 63. Google Scholar CrossRef Search ADS   [7] Bobkov S. “Isoperimetric and analytic inequalities for log-concave probability measures.” Annals of Probability  27, no. 4 ( 1999): 1903– 21. [8] Bubeck S. Ding J. Eldan R. and Rácz. M. Z. “Testing for high-dimensional geometry in random graphs.” Random Structures and Algorithms  49, no. 3 ( 2016): 503– 32. Google Scholar CrossRef Search ADS   [9] Chatterjee. S. “A short survey of stein’s method.” Seoul, Proceeding of ICM , Volume IV, 2014. Pages: 1– 24. [10] Chatterjee S. and Meckes. E. “Multivariate normal approximation using exchangeable pairs.” ALEA , 4 ( 2008): 257– 83. [11] Cover T. M. and Thomas. J. A. Elements of information theory . Wiley-Interscience, 1991. Google Scholar CrossRef Search ADS   [12] Edelman A. “Eigenvalues and condition numbers of random matrices.” SIAM Journal on Matrix Analysis and Applications  9, no. 4 ( 1988): 543– 60. Google Scholar CrossRef Search ADS   [13] Eldan R. “An efficiency upper bound for inverse covariance estimation.” Israel Journal of Mathematics  207, no. 1 ( 2015): 1– 9. Google Scholar CrossRef Search ADS   [14] Guédon O. “Concentration phenomena in high dimensional geometry.” In ESAIM: Proceedings , vol 44, 47– 60. EDP Sciences, 2014. Google Scholar CrossRef Search ADS   [15] Jiang T. and Li. D. “Approximation of rectangular beta-laguerre ensembles and large deviations.” Journal of Theoretical Probability  ( 2013): 1– 44. [16] Johnstone I. M. “On the distribution of the largest eigenvalue in principal components analysis.” The Annals of Statistics  29, no. 2 ( 2001): 295– 327. Google Scholar CrossRef Search ADS   [17] Paouris G. “Concentration of mass on convex bodies.” Geometric & Functional Analysis GAFA  16, no. 5 ( 2006): 1021– 49. Google Scholar CrossRef Search ADS   [18] Paouris G. “Small ball probability estimates for log-concave measures.” Transactions of the American Mathematical Society  364, no. 1 ( 2012): 287– 308. Google Scholar CrossRef Search ADS   [19] Rudelson M. and Vershynin. R. “The littlewood–offord problem and invertibility of random matrices.” Advances in Mathematics  218, no. 2 ( 2008): 600– 33. Google Scholar CrossRef Search ADS   [20] Rudelson M. and Vershynin. R. “Non-asymptotic theory of random matrices: extreme singular values.” Proceedings of the International Congress of Mathematicians , ICM2010 ( 2010): pp. 1576– 602. [21] Sankar A. Spielman D. A. and Teng. S.-H. “Smoothed analysis of the condition numbers and growth factors of matrices.” SIAM Journal on Matrix Analysis and Applications  28, no. 2 ( 2006): 446– 76. Google Scholar CrossRef Search ADS   [22] Stein C. “Approximate computation of expectations.” Lecture Notes-Monograph Series  7 ( 1986): i– 164. Communicated by Prof. Assaf Naor © The Author(s) 2016. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permission@oup.com.

Journal

International Mathematics Research NoticesOxford University Press

Published: Jan 1, 2018

DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 12 million articles from more than
10,000 peer-reviewed journals.

All for just $49/month Explore the DeepDyve Library Unlimited reading Read as many articles as you need. Full articles with original layout, charts and figures. Read online, from anywhere. Stay up to date Keep up with your field with Personalized Recommendations and Follow Journals to get automatic updates. Organize your research It’s easy to organize your research with our built-in tools. Your journals are on DeepDyve Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more. All the latest content is available, no embargo periods. DeepDyve Freelancer DeepDyve Pro Price FREE$49/month

\$360/year
Save searches from