# Sketching for large-scale learning of mixture models

Sketching for large-scale learning of mixture models
Keriven, Nicolas;Bourrier, Anthony;Gribonval, Rémi;Pérez, Patrick
2017-12-22 00:00:00
Abstract Learning parameters from voluminous data can be prohibitive in terms of memory and computational requirements. We propose a ‘compressive learning’ framework, where we estimate model parameters from a sketch of the training data. This sketch is a collection of generalized moments of the underlying probability distribution of the data. It can be computed in a single pass on the training set and is easily computable on streams or distributed datasets. The proposed framework shares similarities with compressive sensing, which aims at drastically reducing the dimension of high-dimensional signals while preserving the ability to reconstruct them. To perform the estimation task, we derive an iterative algorithm analogous to sparse reconstruction algorithms in the context of linear inverse problems. We exemplify our framework with the compressive estimation of a Gaussian mixture model (GMM), providing heuristics on the choice of the sketching procedure and theoretical guarantees of reconstruction. We experimentally show on synthetic data that the proposed algorithm yields results comparable to the classical expectation-maximization technique while requiring significantly less memory and fewer computations when the number of database elements is large. We further demonstrate the potential of the approach on real large-scale data (over $$10^{8}$$ training samples) for the task of model-based speaker verification. Finally, we draw some connections between the proposed framework and approximate Hilbert space embedding of probability distributions using random features. We show that the proposed sketching operator can be seen as an innovative method to design translation-invariant kernels adapted to the analysis of GMMs. We also use this theoretical framework to derive preliminary information preservation guarantees, in the spirit of infinite-dimensional compressive sensing. 1. Introduction An essential challenge in machine learning is to develop scalable techniques able to cope with large-scale training collections comprised of numerous elements of high dimension. To achieve this goal, it is necessary to come up with approximate learning schemes that perform the learning task with a fair precision while reducing the memory and computational requirements compared with the standard techniques. A possible way to achieve such savings is to compress data beforehand, as illustrated in Fig. 1. Instead of the more classical individual compression of each element of the database with random projections [2,24,53,72,84], we consider in this article a framework corresponding to the top right scheme: a large collection of training data is compressed into a fixed-size representation called sketch. The dimension of the sketch—and therefore the cost of infering/learning the parameters of interest from this sketch—does not depend on the number of data items in the initial collection. A complementary path to handling large-scale collections is online learning [30]. Sketching, which leverages ideas originating from streaming algorithms [40], can trivially be turned into an online version and is amenable to distributed computing. Fig. 1. View largeDownload slide Paths to compressive learning. The training data is compressed into a smaller representation, typically through random projections. This can either consist of reducing the dimensions of each individual entry (left bottom) or in computing a more global compressed representation of the data, called sketch (top right). Parameters are then inferred from such a compressed representation by a learning algorithm adapted to it. The proposed approach explores the second sketch-based option. Fig. 1. View largeDownload slide Paths to compressive learning. The training data is compressed into a smaller representation, typically through random projections. This can either consist of reducing the dimensions of each individual entry (left bottom) or in computing a more global compressed representation of the data, called sketch (top right). Parameters are then inferred from such a compressed representation by a learning algorithm adapted to it. The proposed approach explores the second sketch-based option. 1.1 From compressive sensing to sketched learning As we will see, compressing a training collection into a sketch before learning is reminiscent of—and indeed inspired by—compressive sensing (CS) [52] and streaming algorithms [39,40]. The main goal of CS is to find a dimensionality-reducing linear operator $${\mathbf{M}}$$ such that certain high-dimensional vectors (or signals) can be reconstructed from their observations by $${\mathbf{M}}$$. Initial work on CS [29,46] showed that such a reconstruction is possible for $$k$$-sparse signals of dimension $$d$$; from only $$\mathcal{O}(k)$$ linear measurements by (theoretically) solving an intractable NP-complete problem [52, Chapter 2], and in practice from $$\mathcal{O}(k\ln(d))$$ linear measurements by solving a convex relaxation of this problem. Matrices $${\mathbf{M}}$$ with such reconstruction guarantees can be obtained as typical draws of certain random matrix ensembles. This reconstruction paradigm from few random measurements has subsequently been considered and proved to work for more general signal models [17]. Examples of such models include low-rank matrices [25], cosparse vectors [75] and dictionary models [26]. Reconstruction from compressive measurements for these models is made possible, among other properties, by the fact that they correspond to unions of subspaces [12], which have a much lower dimension than the ambient dimension. Low-dimensional models also intervene in learning procedures, where one aims at fitting a model of moderate ‘dimension’ to some training data $$\{{\mathbf{x}}_{1} \ldots {\mathbf{x}}_{n}\} \subset X$$ to prevent overfitting and ensure good generalization properties. In this article, we consider mixture models comprising probability distributions on the set $$X$$ of the form \begin{equation}\label{eq:MixtureModel} P = \sum_{k=1}^K \alpha_k P_k, \end{equation} (1.1) where the $$P_k$$s are probability distributions taken in a certain set and the $$\alpha_k$$s, $$\alpha_{k}\geq 0$$, $$\sum_{k}\alpha_{k}=1$$, are the weights of the mixture. Such a model can be considered as a generalized sparse model in the linear space $$E$$ of finite measures over the set $$X$$. Similarly to compressive sensing, one can define a linear compressive operator $$\mathcal{A}: E\rightarrow \mathbb{C}^m$$, which computes generalized moments [60] of a measure $$\nu$$: \begin{equation}\label{eq:generalized_moments} \mathcal{A}: \nu \mapsto \mathcal{A}\nu := \frac{1}{\sqrt{m}} \left[\int_{X} M_1 \,{\rm d}\nu, \ldots , \int_{X} M_m \,{\rm d}\nu\right]\!, \end{equation} (1.2) where the $$M_j$$s are well-chosen functions on $$X$$ and the constant $$\frac{1}{\sqrt{m}}$$ is used for normalization purposes. In the case where $$\nu$$ is a probability measure $$P$$, the integrals are the expectations of $$M_j({\mathbf{x}})$$ with $${\mathbf{x}}\sim P$$. Given some training data $$\mathcal{X}=\left\{{\mathbf{x}}_1,\ldots,{\mathbf{x}}_n\right\}$$ drawn from $$X$$, the corresponding empirical distribution is \begin{equation}\label{eq:empirical_sketch} \hat{P}=\frac{1}{n}\sum_{i=1}^n \delta_{{\mathbf{x}}_i}, \end{equation} (1.3) where $$\delta_{{\mathbf{x}}_i}$$ is a unit mass at $${\mathbf{x}}_i$$. A practical sketch of the data can then be defined1 and computed as \begin{equation}\label{eq:EmpSketch} \hat{\mathbf{z}} = \mathcal{A} \hat{P}=\frac{1}{n\sqrt{m}}\left[\sum_{i=1}^n M_j({\mathbf{x}}_i)\right]_{j=1,...,m}\!. \end{equation} (1.4) Denoting $${\it{\Sigma}} \subset E$$ the set of probability distributions satisfying (1.1), fitting a probability mixture to the training collection $$\mathcal{X}$$ in a compressive fashion can be expressed as the following optimization problem \begin{equation}\label{eq:problemset} \underset{P\in {\it{\Sigma}}}{\text{argmin}}~\left\lVert{\hat{\mathbf{z}}-\mathcal{A} P}\right\rVert_2\!, \end{equation} (1.5) which corresponds to the search for the probability mixture in the model set $${\it{\Sigma}}$$ whose sketch is closest to the empirical data sketch $$\hat{\mathbf{z}}$$. By analogy with sparse reconstruction, we propose an iterative greedy reconstruction algorithm to empirically address this problem and exemplify our framework on the estimation of GMMs. 1.2 Related works A large set of the existing literature on random projections for dimension reduction in the context of learning focuses on the scheme represented on the bottom left of Fig. 1: each item of the training collection is individually compressed with random projections [2,53] prior to learning for classification [24,84] or regression [72] or to fitting a Gaussian Mixture Model [43]. In contrast, we consider here a framework, where the whole training collection is compressed to a fixed-size sketch, corresponding to the top right scheme in Fig. 1. This framework builds on work initiated in [19,20]. Compared to [19,20], the algorithms we propose here: (a) are inspired by orthogonal matching pursuits (OMP) rather than iterative hard thresholding (IHT), (b) can handle GMMs with arbitrary diagonal variances, (c) yield much better empirical performance (see Section 5 for a thorough experimental comparison). Our approach bears connections with the generalized method of moments (GeMM) [60], where parameters are estimated by matching empirical generalized moments computed from the data with theoretical generalized moments of the distribution. Typically used in practice when the considered probability models do not yield explicit likelihoods, the GeMM also provides mathematical tools to study the identifiability and learnability of parametric models such as GMMs [9,10,63]. Using the empirical characteristic function is a natural way of obtaining moment conditions [49,50,103]. Following developments of GeMM with a continuum of moments instead of a finite number of them [31], powerful estimators can be derived when the characteristic function is available at all frequencies simultaneously [32,33,109]. Yet these estimators are rarely implementable in practice. This is naturally connected with the formulation of mixture model estimation as a linear inverse problem. In [11,22], for example, this is addressed by considering a finite and incoherent dictionary of densities and the unknown density is reconstructed from its scalar products with every density of the dictionary. These scalar products can be interpreted as generalized moments of the density of interest. Under these assumptions, the authors provide reconstruction guarantees for their cost functions. In our framework, we consider possibly infinite and even uncountable dictionaries and only collect a limited number of ‘measurements’, much smaller than the number of elements in the dictionary. Sketching is a classical technique in the database literature [40]. A sketch is a fixed-size data structure that is updated with each element of a data stream, allowing one to perform some tasks without keeping the data stored. Applications include the detection of frequent elements, called heavy hitters [38] and simple statistical estimations on the data [55]. The sketches used in these works typically involve quantization steps that we do not perform in our work. We also consider the density estimation problem, which is closer to machine learning than the typical application of sketches. Closer to our work, the authors in [102] consider the estimation of two-dimensional histograms from random linear sketches. Even though this last method is theoretically applicable in higher dimensions, the complexity would grow exponentially with the dimension of the problem. Such a ‘curse of dimensionality’ is also found in [11,22]: the size of the dictionary grows exponentially with the dimension of the data vectors and naturally impacts the cost of the estimation procedure. In our work, we rather consider parametric dictionaries that are described by a finite number of parameters. This enables us to empirically leverage the structure of iterative algorithms from sparse reconstruction and compressive sensing to optimize with respect to these parameters, offering better scalability to higher dimensions. This is reminiscent of generalized moments methods for the reconstruction of measures supported on a finite subset of the real line [44] and can be applied to much more general families of probability measures. The particular sketching operator that we propose to apply on GMMs (see Section 3 and further) is obtained by randomly sampling the (empirical) characteristic function of the distribution of the training data. This can be seen as a combination between two techniques from the reproducing kernel Hilbert space (RKHS) literature, namely embedding of probability distributions in RKHS using a feature map referred to as the ‘mean map’ [16,94,100] and random Fourier features (RFFs) for approximating translation-invariant reproducing kernels [82]. Embedding of probability distributions in RKHS with the mean map has been successfully used for a large variety of tasks, such as two-sample test [57], classification [74] or even performing algebraic operations on distributions [92]. In [95], the metric of the RKHS is combined to a sparsity-promoting penalty to estimate a mixture of distributions selected in a finite family. As mentioned above, if this finite set is defined by a grid in the space of parameters, the method is subject to the curse of dimensionality. Closer to our approach, in [97], the estimation of a mixture model with respect to the metric of the RKHS is considered with a greedy algorithm. The proposed algorithm is, however, designed to approximate the target distribution by a large mixture with many components, resulting in an approximation error that decreases as the number of components increases, while our approach considers (1.1) as a ‘sparse’ combination of a fixed, limited number of components that we aim at identifying. Furthermore, unlike our method that uses RFFs to obtain an efficient algorithm, the algorithm proposed in [97] does not seem to be directly implementable. Many kernel methods can be performed efficiently using finite-dimensional, nonlinear embeddings that approximate the feature map of the RKHS [82,105]. A popular method approximates translation-invariant kernels with RFFs [82]. There has been since many variants of RFFs that are faster [34,69,110], more precise [108] or designed for a different type of kernel [105]. Similar to our sketching operator, structures combining RFFs with mean map embedding of probability distributions have been recently used by the kernel community [15,77,101] to accelerate methods such as classification with the so-called support measure machine [74,77,101] or two-sample test [35,65,78,112]. Our point of view, i.e., that of generalized compressive sensing, is sensibly different: we consider the sketch as a compressed representation of the probability distribution, and demonstrate that it contains enough information to robustly recover the distribution from it, resulting in an effective ‘compressive learning’ alternative to usual mixture estimation algorithms. 1.3 Contributions and outline Our main contributions can be summarized as follows: Inspired by orthogonal matching pursuit (OMP) and its variant OMP with replacement (OMPR), we design in Section 2 an algorithmic framework for general compressive mixture estimation. In the specific context of GMMs, we design in Section 3.2 an algorithm that scales better with the number $$K$$ of mixture components. Inspired by random Fourier sampling for compressive sensing, we consider sketching operators $$\mathcal{A}$$ defined in terms of random sampling of the characteristic function [19,20]. However, we show that the sampling pattern of [19,20] is not adapted in high dimension. In Section 3.3, in the specific case of GMMs, we propose a new heuristic and devise a practical scheme for randomly drawing the frequencies that define $$\mathcal{A}$$. This is empirically demonstrated to yield significantly improved performance in Section 5. We establish in Section 4 a connection between the choice of the proposed sampling pattern and the design of a reproducing kernel on probability distributions. Compared to more traditional strategies that can be found in the literature for kernel design [58,77], our heuristic is relatively simpler, faster to perform and fully unsupervised. Adapting it beyond the current context of (nearly isotropic) mixture fitting would be an interesting challenge. Extensive tests on synthetic data in Section 5 demonstrate that our approach matches the estimation precision of a state-of-the-art C++ implementation of the expectation–maximization (EM) algorithm while enabling significant savings in time and memory. In the context of hypothesis testing-based speaker verification, we also report in Section 6 results on real data, where we exploit a corpus of 1000 h of speech at scales inaccessible to traditional methods, and match using a very limited number of measurements the results obtained with EM. We provide preliminary theoretical results (Theorem 7.1 in Section 7) on the information preservation guarantees of the sketching operator. The proofs of these results (Appendices B and C) introduce a new variant of the restricted isometry property (RIP) [28], connected here to kernel mean embedding and Random Features. Compared to usual guarantees in the GeMM literature, our results have less of a ‘statistics’ flavor and more of a ‘signal processing’ one, such as robustness to modeling error, i.e., the true distribution of the data is not exactly a GMM, but only close to a GMM. This preliminary analysis, which is conducted with important restrictions on the GMM parameters, does not permit yet to prove the efficiency of the GMM-learning procedure observed in practice in terms of sketch size, but introduces new connections between compressive sensing, GeMM and kernel methods. It is structured so as to outline important hypotheses that will be examined in the future for a complete theoretical understanding of the method. 2. A compressive mixture estimation framework In classical compressive sensing [52], a signal $${\mathbf{x}} \in \mathbb{R}^d$$ is encoded with a measurement matrix $${\mathbf{M}} \in \mathbb{R}^{m \times d}$$ into a compressed representation $${\mathbf{z}} \in \mathbb{R}^m$$: \begin{align}\label{eq:CS} {\mathbf{z}}={\mathbf{M}}{\mathbf{x}} \end{align} (2.1) and the goal is to recover $${\mathbf{x}}$$ from those linear measurements. Often the system is underdetermined ($$m<d$$) and recovery can only be done with additional assumptions, typically sparsity. The vector $${\mathbf{x}}=[x_\ell]_{\ell=1}^d$$ is said to be sparse if it has only a limited number $$k<d$$ of non-zero coefficients. Its support is the set of indices of non-zero entries: $${\it{\Gamma}}({\mathbf{x}})=\{\ell\ |\ x_\ell\neq 0\}$$. The notation $${\mathbf{M}}_{{\it{\Gamma}}}$$ (respectively $${\mathbf{x}}_{{\it{\Gamma}}}$$) denotes the restriction of matrix $${\mathbf{M}}$$ (respectively vector $${\mathbf{x}}$$) to columns (respectively entries) with indices in $${\it{\Gamma}}$$. A sparse vector can be seen as a combination of few basis elements: $${\mathbf{x}}=\sum_{\ell \in {\it{\Gamma}}} x_\ell {\mathbf{e}}_\ell$$, where $$\left\lbrace{\mathbf{e}}_\ell\right\rbrace_{\ell=1,...,d}$$ is the canonical basis of $$\mathbb{R}^d$$. The measurement vector $${\mathbf{z}}$$ is thus expressed as a combination of few atoms, corresponding to the columns of the measurement matrix: $${\mathbf{z}}=\sum_{\ell \in {\it{\Gamma}}}x_\ell{\mathbf{M}}{\mathbf{e}}_\ell$$. The set of all atoms is referred to as a dictionary$$\left\lbrace{\mathbf{M}}{\mathbf{e}}_\ell\right\rbrace_{\ell=1,...,d}$$. 2.1 Mixture model and generalized compressive sensing Let $$E$$ = $$E(X)$$ be a space of signed finite measures over a measurable space $$(X,\mathcal{B})$$ and $${\mathcal{P}}$$ the set of probability distributions over $$X$$, $${\mathcal{P}} := \left\lbrace P \in E; P\geq 0, \int_X \,{\rm d}P=1\right\rbrace$$. In our framework, a distribution $$P \in {\mathcal{P}}$$ is encoded with a linear measurement operator (or sketching operator) $$\mathcal{A}: {\mathcal{P}} \rightarrow \mathbb{C}^m$$: \begin{equation} {\mathbf{z}}=\mathcal{A} P. \end{equation} (2.2) As in classical compressive sensing, we define a ‘sparse’ model in $${\mathcal{P}}$$. As mentioned in the introduction, it is here assimilated to a mixture model (1.1), generated by combining elements from some given set $$\mathcal{G}=\left\lbrace P_{\boldsymbol{\theta}}\right\rbrace_{{\boldsymbol{\theta}} \in {\mathcal{T}}} \subset {\mathcal{P}}$$ of basic distributions indexed by a parameter $${\boldsymbol{\theta}} \in {\mathcal{T}}$$. For some finite $$K \in \mathbb{N}^*$$, a distribution is thus said to be K-sparse (in $$\mathcal{G}$$) if it is a convex combination of $$K$$ elements from $$\mathcal{G}$$: \begin{equation} P_{{\it{\Theta}},{\boldsymbol{\alpha}}}=\sum_{k=1}^K \alpha_k P_{{\boldsymbol{\theta}}_k}, \end{equation} (2.3) with $$P_{{\boldsymbol{\theta}}_k} \in \mathcal{G}$$, $$\alpha_k \geq 0$$ for all $$k$$s, and $$\sum_{k=1}^K \alpha_k=1$$. We name support of the representation2$$({\it{\Theta}},{\boldsymbol{\alpha}})$$ of such a sparse distribution the set of parameters $${\it{\Theta}}=\left\lbrace{\boldsymbol{\theta}}_1,...,{\boldsymbol{\theta}}_K\right\rbrace$$. The measurement vector $${\mathbf{z}}=\mathcal{A} P_{{\it{\Theta}},{\boldsymbol{\alpha}}}=\sum_{k=1}^K \alpha_k \mathcal{A} P_{{\boldsymbol{\theta}}_k}$$ of a sparse distribution is a combination of atoms selected from the dictionary$$\left\lbrace\mathcal{A} P_{\boldsymbol{\theta}}\right\rbrace_{{\boldsymbol{\theta}} \in {\mathcal{T}}}$$ indexed by the parameter $${\boldsymbol{\theta}}$$. Table 1 summarizes the notations used in the context of compressive mixture estimation and their correspondence with more classical notions from compressive sensing. Table 1 Correspondance between objects manipulated in usual compressive sensing of finite-dimensional signals and in the proposed compressive mixture estimation framework Usual compressive sensing Compressive mixture estimation Signal $${\mathbf{x}} \in \mathbb{R}^d$$ $$P \in {\mathcal{P}}$$ Model $$K$$-sparse vectors $$K$$-sparse mixtures $$P_{{\it{\Theta}},{\boldsymbol{\alpha}}} = \sum_{k=1}^K \alpha_k P_{{\boldsymbol{\theta}}_k}$$ Measurement operator $${\mathbf{M}} \in \mathbb{R}^{m \times d}$$ $$\mathcal{A}: {\mathcal{P}} \rightarrow \mathbb{C}^m$$ Support $${\it{\Gamma}}({\mathbf{x}})=\{\ell\ |\ x_\ell \neq 0\}$$ $${\it{\Gamma}}(P_{{\it{\Theta}},{\boldsymbol{\alpha}}})={\it{\Theta}}=\lbrace{\boldsymbol{\theta}}_1,...,{\boldsymbol{\theta}}_K\rbrace$$ Dictionary of atoms $$\left\lbrace{\mathbf{M}}{\mathbf{e}}_\ell\right\rbrace_{\ell=1,...,d}$$ $$\left\lbrace\mathcal{A} P_{\boldsymbol{\theta}}\right\rbrace_{{\boldsymbol{\theta}} \in {\mathcal{T}}}$$ Usual compressive sensing Compressive mixture estimation Signal $${\mathbf{x}} \in \mathbb{R}^d$$ $$P \in {\mathcal{P}}$$ Model $$K$$-sparse vectors $$K$$-sparse mixtures $$P_{{\it{\Theta}},{\boldsymbol{\alpha}}} = \sum_{k=1}^K \alpha_k P_{{\boldsymbol{\theta}}_k}$$ Measurement operator $${\mathbf{M}} \in \mathbb{R}^{m \times d}$$ $$\mathcal{A}: {\mathcal{P}} \rightarrow \mathbb{C}^m$$ Support $${\it{\Gamma}}({\mathbf{x}})=\{\ell\ |\ x_\ell \neq 0\}$$ $${\it{\Gamma}}(P_{{\it{\Theta}},{\boldsymbol{\alpha}}})={\it{\Theta}}=\lbrace{\boldsymbol{\theta}}_1,...,{\boldsymbol{\theta}}_K\rbrace$$ Dictionary of atoms $$\left\lbrace{\mathbf{M}}{\mathbf{e}}_\ell\right\rbrace_{\ell=1,...,d}$$ $$\left\lbrace\mathcal{A} P_{\boldsymbol{\theta}}\right\rbrace_{{\boldsymbol{\theta}} \in {\mathcal{T}}}$$ Table 1 Correspondance between objects manipulated in usual compressive sensing of finite-dimensional signals and in the proposed compressive mixture estimation framework Usual compressive sensing Compressive mixture estimation Signal $${\mathbf{x}} \in \mathbb{R}^d$$ $$P \in {\mathcal{P}}$$ Model $$K$$-sparse vectors $$K$$-sparse mixtures $$P_{{\it{\Theta}},{\boldsymbol{\alpha}}} = \sum_{k=1}^K \alpha_k P_{{\boldsymbol{\theta}}_k}$$ Measurement operator $${\mathbf{M}} \in \mathbb{R}^{m \times d}$$ $$\mathcal{A}: {\mathcal{P}} \rightarrow \mathbb{C}^m$$ Support $${\it{\Gamma}}({\mathbf{x}})=\{\ell\ |\ x_\ell \neq 0\}$$ $${\it{\Gamma}}(P_{{\it{\Theta}},{\boldsymbol{\alpha}}})={\it{\Theta}}=\lbrace{\boldsymbol{\theta}}_1,...,{\boldsymbol{\theta}}_K\rbrace$$ Dictionary of atoms $$\left\lbrace{\mathbf{M}}{\mathbf{e}}_\ell\right\rbrace_{\ell=1,...,d}$$ $$\left\lbrace\mathcal{A} P_{\boldsymbol{\theta}}\right\rbrace_{{\boldsymbol{\theta}} \in {\mathcal{T}}}$$ Usual compressive sensing Compressive mixture estimation Signal $${\mathbf{x}} \in \mathbb{R}^d$$ $$P \in {\mathcal{P}}$$ Model $$K$$-sparse vectors $$K$$-sparse mixtures $$P_{{\it{\Theta}},{\boldsymbol{\alpha}}} = \sum_{k=1}^K \alpha_k P_{{\boldsymbol{\theta}}_k}$$ Measurement operator $${\mathbf{M}} \in \mathbb{R}^{m \times d}$$ $$\mathcal{A}: {\mathcal{P}} \rightarrow \mathbb{C}^m$$ Support $${\it{\Gamma}}({\mathbf{x}})=\{\ell\ |\ x_\ell \neq 0\}$$ $${\it{\Gamma}}(P_{{\it{\Theta}},{\boldsymbol{\alpha}}})={\it{\Theta}}=\lbrace{\boldsymbol{\theta}}_1,...,{\boldsymbol{\theta}}_K\rbrace$$ Dictionary of atoms $$\left\lbrace{\mathbf{M}}{\mathbf{e}}_\ell\right\rbrace_{\ell=1,...,d}$$ $$\left\lbrace\mathcal{A} P_{\boldsymbol{\theta}}\right\rbrace_{{\boldsymbol{\theta}} \in {\mathcal{T}}}$$ 2.2 Principle for reconstruction: moment matching As mentioned in Section 1, usual reconstruction algorithms (also known as decoders [17,36]) in generalized compressive sensing, are designed with the purpose of minimizing the measurement error while enforcing sparsity [13], as formulated in equation (1.5). Here, it also corresponds to traditional parametric optimization in the GeMM [60]: \begin{equation}\label{eq:costfun} \underset{{\it{\Theta}},{\boldsymbol{\alpha}}}{\text{argmin}} \left\lVert{\hat{\mathbf{z}}-\mathcal{A} P_{{\it{\Theta}},{\boldsymbol{\alpha}}}}\right\rVert_2\!, \end{equation} (2.4) where $$\hat{\mathbf{z}}=\mathcal{A}\hat{P}=\tfrac{1}{n}\sum_{i=1}^n\mathcal{A}\delta_{{\mathbf{x}}_i}$$ is the empirical sketch. This problem is usually highly non-convex and does not allow for an efficient direct optimization, nevertheless we show in Section 7 that in some cases it yields a decoder robust to modeling errors and empirical estimation errors, with high probability. Convex relaxations of (2.4) based on sparsity-promoting penalization terms [11,22,44,80,95] can be envisioned in certain settings; however, their direct adaptation to general uncountable dictionaries of atoms (e.g., with GMMs) seems difficult. The main alternative is greedy algorithms. Using an algorithm inspired by iterative hard thresholding (IHT) [14], Bourrier et al. [19] estimate mixtures of isotropic Gaussian distributions with fixed variance, using a sketch formed by sampling the empirical characteristic function. As will be shown in Section 5.6, this IHT-like algorithm often yields an unsatisfying local minimum of (2.4) when the variance is estimated. Instead, we propose a greedy approach similar to OMP [73,79] and its extension OMPR [64]. Another intuitive solution would be to discretize the space of the parameter $${\boldsymbol{\theta}}$$ to obtain a finite dictionary of atoms and apply the classic convex relaxation or greedy methods mentioned above. However, one quickly encounters the well-known curse of dimensionality: for a grid with precision $$\varepsilon$$ and a parameter of dimension $$p$$, the size of the dictionary is as $$\mathcal{O}\left(\varepsilon^{-p}\right)$$, which is intractable even for moderate $$p$$. Initial experiments for learning GMMs in dimension $$d=2$$ with diagonal covariance (i.e., the dimension of the parameter $${\boldsymbol{\theta}}$$ is $$p=4$$) show that this approach is extremely slow and has a very limited precision. Instead, in the next section, we propose an adaptation of OMPR directly in the continuous domain. 2.3 Inspiration: OMPR for classical compressive sensing Matching pursuit (MP) [73] and OMP [79] deal with general sparse approximation problems. They gradually extend the sparse support by selecting atoms most correlated with the residual signal, until the desired sparsity is attained. An efficient variation of OMP called OMPR [64] exhibits better reconstruction guarantees. Inspired by IHT [14], and similar to CoSAMP or subspace pursuit [52], it increases the number of iterations of OMP and extends the size of the support further than the desired sparsity before reducing it with hard thresholding to suppress spurious atoms. 2.4 Proposed algorithms: compressive learning OMP/OMPR Adapting OMPR to the considered compressive mixture estimation framework requires several modifications. We detail them below and summarize them in Algorithm 1. Several aspects of this framework must be highlighted: Non-negativity. Unlike classical compressive sensing, the compressive mixture estimation framework imposes a non-negativity constraint on the weights $${\boldsymbol{\alpha}}$$ that we enforce at each iteration. Thus Step 1 is modified compared with the classical OMPR by replacing the modulus of the correlation by its real part, to avoid negative correlation between atom and residual. Similarly, in Step 4, we perform a non-negative least-squares (NNLS) [68] instead of a classical least squares. Normalization. Note that we do not enforce normalization $$\sum_{k=1}^K \alpha_k=1$$ at each iteration. Instead, a normalization of $${\boldsymbol{\alpha}}$$ is performed at the end of the algorithm to obtain a valid distribution. Enforcing the normalization constraint at each iteration was found on initial experiments to have a negligible effect while increasing computation time. Continuous dictionary. The set $$\mathcal{G}$$ of elementary distributions is often continuously indexed (as with GMMs, see Section 3.2) and cannot be exhaustively searched. Instead we propose to replace the maximization in Step 1 of classical OMPR with a randomly initialized gradient descent, denoted by a call to a sub-routine $$\texttt{maximize}_{\boldsymbol{\theta}}$$, leading to a—local—maximum of the correlation between atom and residual. Note that the atoms are normalized during the search, as is often the case with OMP. Global optimization step to handle coherent dictionaries. Compared with the classical OMPR, the proposed algorithm includes a new step at each iteration (Step 5), which further reduces the cost function with a gradient descent initialized with the current parameters $$({\it{\Theta}},{\boldsymbol{\alpha}})$$. This is denoted by a call to the sub-routine $$\texttt{minimize}_{{\it{\Theta}},{\boldsymbol{\alpha}}}$$. The need for this additional step stems from the lack of incoherence between the elements of the uncountable dictionary. For instance, in the case of GMM estimation, a $$(K+1)$$-GMM approximation of a distribution cannot be directly derived from a $$K$$-GMM by simply adding a Gaussian. This is reminiscent of a similar problem handled in high-resolution matching pursuit (HRMP) [59], which uses a multi-scale decomposition of atoms, while we handle here the more general case of a continuous dictionary using a global gradient descent that adjusts all atoms. Experiments show that this step is the most time-consuming of the algorithm, but that it is necessary. Algorithm 1: View largeDownload slide Compressive mixture learning à la OMP: CL-OMP ($$T=K$$) and CL-OMPR ($$T\,{=}\,2K$$) Algorithm 1: View largeDownload slide Compressive mixture learning à la OMP: CL-OMP ($$T=K$$) and CL-OMPR ($$T\,{=}\,2K$$) Similar to classical OMPR, Algorithm 1 yields two algorithms depending on the number of iterations: Compressive Learning OMP (CL-OMP) if run without hard thresholding (i.e., with $$T=K$$); CL-OMPR (with replacement) if run with $$T=2K$$ iterations. Learning the number of components? In the proposed framework, the number of components $$K$$ is known in advance and provided by the user. However, it is known that greedy approaches such as OMP are convenient to derive stopping conditions that could be readily applied to CL-OMP: when the residual falls below a fixed (or adaptive) threshold, stop the algorithm (additional strategies would be required for CL-OMPR however). In this article, however, we only compare the proposed method with classical approaches such as EM for GMMs that consider the number of components $$K$$known in advance. We leave for future work the implementation of a stopping condition for CL-OMP(R) and comparison with existing methods for model selection. 2.5 Complexity of CL-OMP(R) Just as OMP, which complexity scales quadratically with the sparsity parameter $$K$$, proposed greedy approaches CL-OMP or CL-OMPR have a computational cost of the order of $$\mathcal{O}(mdKT)$$, where $$T \geq K$$ is the number of iterations, resulting in a quadratic cost with respect to the number of components $$K$$. This is potentially a limiting factor for the estimation of mixtures of many basic distributions (large $$K$$). In classical sparse approximation, approximate least squares approaches such as gradient pursuit [13] or LocOMP [71] have been developed to overcome this computational bottleneck. One could probably get inspiration from these approaches to further scale-up compressive mixture estimation; however, in the context of GMMs, we propose in Section 3.2 to rather leverage ideas from existing fast EM algorithms that are specific to GMMs. 2.6 Sketching by randomly sampling the characteristic function Let us now assume $$X=\mathbb{R}^d$$. The reader will notice that in classical compressive sensing, the compressed object is a vector $${\mathbf{x}} \in \mathbb{R}^{d}$$, while in this context, a training collection of vectors $$\{{\mathbf{x}}_{1},\ldots,{\mathbf{x}}_{n}\} \subset \mathbb{R}^{d}$$ is considered as an empirical version of some probability distribution $$P \in E(\mathbb{R}^{d})$$, which is the compressed object. The proposed algorithms CL-OMP(R) are suitable for any sketching operator $$\mathcal{A}$$ and any mixture model of parametric densities $$P_{\boldsymbol{\theta}}$$, as long as the optimization schemes in Steps 1 and 5 can be performed. In the case of a continuous dictionary, the optimization steps can be performed with simple gradient descents implicitly represented by calls to the sub-routines $$\texttt{maximize}_{\boldsymbol{\theta}}$$ and $$\texttt{minimize}_{{\it{\Theta}},{\boldsymbol{\alpha}}}$$, provided $$\mathcal{A} P_{\boldsymbol{\theta}}$$ and its gradient with respect to $${\boldsymbol{\theta}}$$ have a closed-form expression. Characterictic function and Compressive sensing. The characteristic function $$\psi_P$$ of a probability distribution $$P$$ is a well-known object with many desirable properties. It completely defines any distribution with no ambiguity and often has a closed-form expression (even for distributions that may not have a probability density function with closed-form expression, e.g., for $$\alpha$$-stable distributions [91]). These properties make it a potentially suitable choice to build a sketch used with CL-OMP(R). The characteristic function $$\psi_\nu$$ of a finite measure $$\nu \in E(\mathbb{R}^d)$$ is defined as: \begin{equation}\label{eq:general_chrc} \psi_\nu({\boldsymbol{\omega}})=\int\left({\rm e}^{-\mathsf{i}{\boldsymbol{\omega}}^T{\mathbf{x}}}\right){\rm d}\nu({\mathbf{x}})\quad \forall {\boldsymbol{\omega}} \in \mathbb{R}^{d}. \end{equation} (2.5) For a sparse distribution $$P_{{\it{\Theta}},{\boldsymbol{\alpha}}}$$ (in some given set of basic distributions $$\mathcal{G} \subset {\mathcal{P}}$$), we also denote $$\psi_{{\it{\Theta}},{\boldsymbol{\alpha}}}=\psi_{P_{{\it{\Theta}},{\boldsymbol{\alpha}}}}$$ for simplicity. As mentioned in the Introduction, using the characteristic function of distributions for moment matching has a long history [49,50]. Let us note the early method of [103] that studies the properties of the characteristic function sampled on a grid of equally spaced frequencies to estimate univariate GMMs by moment matching, which is akin to the method presented in the rest of the article, but does not provide a practical estimation procedure. The characteristic function is also used in more recent developments of GeMM [32]. Interestingly, using the characteristic function (i.e., the Fourier transform) of a probability distribution also bears connection with compressive sensing, which is the main inspiration for this article. In many important applications, such as medical imaging (MRI and tomography), astrophysics or geophysical exploration, one wishes to reconstruct a signal from incomplete samples of its discrete Fourier transform. Random Fourier sampling was therefore one of the first problems to give rise to the classical notions of compressive sensing [27–29]. Indeed, a random uniform selection of rows of the full Fourier matrix, i.e., a random selection of frequencies, forms a partial Fourier matrix that satisfies a certain RIP with overwhelming probability [29], and is therefore appropriate for the encoding and recovery of sparse signals. For more details, we refer the reader to [5,27–29,52] and references therein. Inspired by this methodology, we form the sketch by randomly sampling the characteristic function of the probability distribution $$P$$. As we will see in Section 7, this random sampling sets forth a RIP-like theoretical analysis of the moment matching method. Definition of the sketch. In practice, given frequencies $${\it{\Omega}}=\left\lbrace{\boldsymbol{\omega}}_1,...,{\boldsymbol{\omega}}_m\right\rbrace$$ in $$\mathbb{R}^d$$, we define generalized moment functions: \begin{equation}\label{eq:chrc_moment} M_j({\mathbf{x}})=\exp\left(-\mathsf{i} {\boldsymbol{\omega}}_j^{\rm T}{\mathbf{x}}\right)\!,~j=1\cdots m, \end{equation} (2.6) and the sketching operator (1.2) is therefore expressed as \begin{equation} \label{eq:skop} \mathcal{A} \nu=\frac{1}{\sqrt{m}}\left[\psi_\nu({\boldsymbol{\omega}}_1),...,\psi_\nu({\boldsymbol{\omega}}_m)\right]^{\rm T}\!. \end{equation} (2.7) Given a training collection $$\mathcal{X}=\left\lbrace{\mathbf{x}}_1,...,{\mathbf{x}}_n\right\rbrace$$ in $$\mathbb{R}^d$$, we denote $$\hat\psi({\boldsymbol{\omega}})=\frac{1}{n} \sum_{i=1}^n {\rm e}^{-\mathsf{i}{\boldsymbol{\omega}}^{\rm T}{\mathbf{x}}_i}$$ the empirical characteristic function.3 The empirical sketch $$\hat{\mathbf{z}}=\mathcal{A} \hat P$$ is \begin{equation}\label{eq:empirical_sketch_chf} \hat{\mathbf{z}}=\frac{1}{\sqrt{m}}\left[\hat\psi({\boldsymbol{\omega}}_1),...,\hat\psi({\boldsymbol{\omega}}_m)\right]^{\rm T}\!. \end{equation} (2.8) To fully specify the sketching operator (2.7), one needs to choose the frequencies $${\boldsymbol{\omega}}_j$$ that define it. In the spirit of random Fourier sampling, we propose to define a probability distribution $${\it{\Lambda}} \in {\mathcal{P}}$$ to draw $$({\boldsymbol{\omega}}_1,...,{\boldsymbol{\omega}}_m) \overset{{\rm i.i.d.}}{\sim} {\it{\Lambda}}$$. Choosing this distribution is a problem of its own that will be discussed in details in Section 3.3. Connections with Random Neural Networks. It is possible to draw connections between the proposed sketching operation and neural networks. Denote $${\mathbf{W}}:=\left[{\boldsymbol{\omega}}_1,...,{\boldsymbol{\omega}}_m\right] \in \mathbb{R}^{d\times m}$$ and $${\mathbf{X}}:=\left[{\mathbf{x}}_1,...,{\mathbf{x}}_n\right]\in\mathbb{R}^{d\times n}$$. To derive the sketch, one needs to compute the matrix $${\mathbf{U}}:={\mathbf{W}}^T{\mathbf{X}} \in \mathbb{R}^{m\times n}$$, take the complex exponential of each individual entry $${\mathbf{V}}:=\bar{f}({\mathbf{U}})$$, where $$\bar{f}$$ is the pointwise application of the function $$f(x)={\rm e}^{-\mathsf{i} x}/\sqrt{m}$$ and finally pool the columns of $${\mathbf{V}}$$ to obtain the empirical sketch $$\hat{\mathbf{z}}=\left(\sum_{i=1}^n{\mathbf{v}}_i\right)/n$$. This procedure indeed shares similarities with a one-layer neural network: the output $${\mathbf{y}}\in\mathbb{R}^m$$ of such a network can be expressed as $${\mathbf{y}}=f({\mathbf{W}}^T{\mathbf{x}})$$, where $${\mathbf{x}}\in\mathbb{R}^d$$ is the input signal, $$f$$ is a pointwise nonlinear function and $${\mathbf{W}} \in \mathbb{R}^{d\times m}$$ is some weighting matrix. Therefore, in the proposed framework, such a one-layer network is applied to many inputs $${\mathbf{x}}_i$$, then the empirical average of the outputs $${\mathbf{z}}_i$$ is taken to obtain a representation of the underlying distribution of the $${\mathbf{x}}_i$$. Neural networks with weights $${\mathbf{W}}$$ chosen at random rather than learned on training data—as is done with the frequencies $${\boldsymbol{\omega}}_j$$—have been studied in the so-called random kitchen sinks [83] or in the context of deep neural networks (DNNs) with Gaussian weights [56]. In the latter, they have been shown to perform a stable embedding of the input $${\mathbf{x}}$$ when it lives on a low-dimensional manifold. Similar to [56], we show in Section 7 that with high probability the sketching procedure is a stable embedding of the probability distribution of $${\mathbf{x}}$$ when this distribution belongs to, e.g., a compact manifold. 2.7 Complexity of sketching The main computational load of the sketching operation is the computation of the matrix $${\mathbf{U}}={\mathbf{W}}^T{\mathbf{X}}$$, which theoretically scales in $$\mathcal{O}(dmn)$$. Large multiplications by random matrices are indeed a well-known computational bottleneck in CS, and some methods circumvent this issue by using approximated fast transforms [45,70]. Closer to our work (see Section 4), fast transforms have also been used in the context of random Fourier features and kernel methods [69,110]. We leave the study of a possible adaptation of these acceleration methods for future work and focus on simple practical remarks about the computation of the sketch. GPU computing. Matrix multiplication is one of the most studied problems in the context of large-scale computing. A classical way to drastically reduce its cost consists in using GPU computing [106]. Recent architectures can even leverage giant matrix multiplication using multiple GPUs in parallel [111]. Distributed/online computing. The computation of the sketch can also be performed in a distributed manner. One can divide the database $$\mathcal{X}$$ in $$T$$ subsets $$\mathcal{X}_t$$ containing $$n_t$$ items respectively, after which individual computing units can compute the sketches $$\hat{\mathbf{z}}_t$$ of each subset $$\mathcal{X}_t$$ in parallel, using the same frequencies. These sketches are then easily merged4 as $$\hat{\mathbf{z}}=\sum_{t=1}^T n_t\hat{\mathbf{z}}_t/\sum_{t=1}^T n_t$$. The cost of computing the sketch is thus divided by the number of units $$T$$. Similarly, this simple observation allows the sketch to be computed in an online fashion. 3. Sketching GMMs In this section, we instantiate the proposed framework in the context of GMMs. A more scalable algorithm specific to GMMs is first introduced as a possible alternative to CL-OMP(R). We then focus on the design of the sketching operator $$\mathcal{A}$$, i.e., on the design of the frequency distribution $${\it{\Lambda}}$$ (see Section 2.6). 3.1 Gaussian Mixture Models In the GMM framework, the basic distributions $$P_{\boldsymbol{\theta}} \in \mathcal{G}$$ are Gaussian distributions with density functions denoted $${p}_{\boldsymbol{\theta}}$$: \begin{equation}\label{eq:gaussdensity} {p}_{\boldsymbol{\theta}}({\mathbf{x}})=\mathcal{N}({\mathbf{x}};{\boldsymbol{\mu}},{\boldsymbol{{\it{\Sigma}}}})=\frac{1}{(2\pi)^{d/2}|{\boldsymbol{{\it{\Sigma}}}}|^{1/2}}\exp\left(-\frac{1}{2}({\mathbf{x}}-{\boldsymbol{\mu}})^{\rm T}{\boldsymbol{{\it{\Sigma}}}}^{-1}({\mathbf{x}}-{\boldsymbol{\mu}})\right)\!, \end{equation} (3.1) where $${\boldsymbol{\theta}}=({\boldsymbol{\mu}},{\boldsymbol{{\it{\Sigma}}}})$$ represents the parameters of the Gaussian with mean $${\boldsymbol{\mu}} \in \mathbb{R}^d$$ and covariance $${\boldsymbol{{\it{\Sigma}}}} \in \mathbb{R}^{d \times d}$$. A Gaussian is said to be isotropic, or spherical, if its covariance matrix is proportional to the identity: $${\boldsymbol{{\it{\Sigma}}}}=\sigma^2{\mathbf{I}}$$. Densities of GMMs are denoted $${p}_{{\it{\Theta}},{\boldsymbol{\alpha}}}=\sum_{k=1}^K \alpha_k {p}_{{\boldsymbol{\theta}}_k}$$. A $$K$$-GMM is then naturally parametrized by weight vector $${\boldsymbol{\alpha}} \in \mathbb{R}^K$$ and parameters $${\boldsymbol{\theta}}_k=({\boldsymbol{\mu}}_k,{\boldsymbol{{\it{\Sigma}}}}_k),~k=1\cdots K$$. Compressive density estimation with mixtures of isotropic Gaussians with fixed, known variance was considered in [19]. In this work, we consider mixtures of Gaussians with diagonal covariances, which is known to be sufficient for many applications [86,113] and is the default framework in well-known toolboxes such as VLFeat [104]. We denote $$(\sigma_{k,1}^2,...,\sigma_{k,d}^2)=\mathrm{diag}\left({\boldsymbol{{\it{\Sigma}}}}_k\right)$$. Depending on the context, we equivalently denote $${\boldsymbol{\theta}}_k=[{\boldsymbol{\mu}}_k;{\boldsymbol{\sigma}}_k] \in \mathbb{R}^{2d}$$, where $${\boldsymbol{\sigma}}_k=[\sigma^2_{k,\ell}]_{\ell=1\cdots d}$$ is the diagonal of the covariance of the $$k$$th Gaussian. The characteristic function of a Gaussian $$P_{\boldsymbol{\theta}}$$ has a closed-form expression: \begin{equation}\label{eq:chrc} \psi_{\boldsymbol{\theta}}({\boldsymbol{\omega}})=\exp\left( -\mathsf{i}{\boldsymbol{\omega}}^{\rm T}{\boldsymbol{\mu}}\right)\exp\left(-\frac{1}{2}{\boldsymbol{\omega}}^{\rm T}{\boldsymbol{{\it{\Sigma}}}}{\boldsymbol{\omega}}\right)\!, \end{equation} (3.2) from which we can easily derive the expression of the gradients necessary to perform the optimization schemes $$\texttt{maximize}_{\boldsymbol{\theta}}$$, $$\texttt{minimize}_{{\it{\Theta}},{\boldsymbol{\alpha}}}$$ in CL-OMP(R), with the sketching operator introduced in Section 2.6. 3.2 A faster algorithm specific to GMM estimation To handle mixtures of many Gaussians (large $$K$$), the fast hierarchical EM used in [90] alternates between binary splits of each Gaussian $$k$$ along its first principal component (in the case of Gaussians with diagonal covariance, this is the dimension $$\ell$$ with the highest variance $$\sigma^2_{k,\ell}$$) and a few EM iterations. Our compressive adaptation is summarized in Algorithm 2. The binary split is performed by calling the function $$\texttt{Split}$$, and the EM steps are replaced by Step 5 of Algorithm 1 to adjust all Gaussians. In the case where the targeted sparsity level $$K$$ is not a power of $$2$$, we split the GMM until the support reaches a size $$2^{\left\lceil \log_2 K \right\rceil} > K$$, then reduce it with a hard thresholding (Step 3 of Algorithm 1), similar to CL-OMPR. Function View largeDownload slide Split($${\it{\Theta}}$$): split each Gaussian in the support along its dimension of highest variance Function View largeDownload slide Split($${\it{\Theta}}$$): split each Gaussian in the support along its dimension of highest variance Algorithm 2: View largeDownload slide An algorithm with complexity $$O(K \log K)$$ for compressive GMM estimation Algorithm 2: View largeDownload slide An algorithm with complexity $$O(K \log K)$$ for compressive GMM estimation Since the number of iterations in Algorithm 2 is $$T=\left\lceil \log_2 K \right\rceil$$, the computational cost of this algorithm scales in $$\mathcal{O}(mdK\log K)$$, which is much faster for large $$K$$ than the quadratic cost of CL-OMPR. In practice, Algorithm 2 is seen to sometimes yield worse results than CL-OMPR (see Section 5.6), but be well-adapted to other tasks such as, e.g., GMM estimation for large-scale speaker verification (see Section 6). 3.3 Designing the frequency sampling pattern A key ingredient in designing the sketching operator $$\mathcal{A}$$ is the choice of a probability distribution $${\it{\Lambda}}$$ to draw the frequency sampling pattern $$\{{\boldsymbol{\omega}}_{1},\ldots,{\boldsymbol{\omega}}_{m}\}$$ as defined in Section 2.6. As we will see in Section 4, this corresponds to the design of a translation-invariant kernel in the data domain. Interestingly, working in the Fourier domain seems to make the heuristic design strategy more direct. Literature on designing kernels in this context usually focus on maximizing the distance between the sketch of two distributions [77,98], which cannot be readily applied in our context since we sketch only a single distribution. However, as we will see, the proposed approach follows the general idea of maximizing the capacity of the sketch to distinguish this distribution from others, which amounts to maximizing the variations of the sketch with respect to the parameters of the GMM at the selected frequencies. 3.3.1 Oracle frequency sampling pattern for a single known Gaussian We start by designing a heuristic for choosing frequencies adapted to the estimation of a single Gaussian $$P_{\boldsymbol{\theta}}$$, assuming the parameters $${\boldsymbol{\theta}}=({\boldsymbol{\mu}},{\boldsymbol{{\it{\Sigma}}}})$$ are available—which is obviously not the case in practice. We will deal in due time with mixtures, and with unknown parameters. Gaussian frequency distribution. Recall the expression (3.2) of the characteristic function $$\psi_{\boldsymbol{\theta}}({\boldsymbol{\omega}})={\rm e}^{ -\mathsf{i}{\boldsymbol{\omega}}^{\rm T}{\boldsymbol{\mu}}-\frac{1}{2}{\boldsymbol{\omega}}^{\rm T}{\boldsymbol{{\it{\Sigma}}}}{\boldsymbol{\omega}}}$$ of the Gaussian $$P_{\boldsymbol{\theta}}$$. It is an oscillating function with Gaussian amplitude of inverted variance with respect to the original Gaussian. Given that $$|\psi_{\boldsymbol{\theta}}| \propto \mathcal{N}\left(0,{\boldsymbol{{\it{\Sigma}}}}^{-1}\right)$$, choosing a Gaussian frequency distribution $${\it{\Lambda}}^{(G)}_{\boldsymbol{{\it{\Sigma}}}}=\mathcal{N}\left(0,{\boldsymbol{{\it{\Sigma}}}}^{-1}\right)$$ is a possible intuitive choice [19,20] to sample the characteristic function $$\psi_{\boldsymbol{\theta}}$$. It concentrates frequencies in the regions where the sampled characteristic function has high amplitude. However, points drawn from a high-dimensional Gaussian concentrate on an ellipsoid, which moves away from the origin as the dimension $$d$$ increases. Such a Gaussian sampling therefore ‘undersamples’ low or even middle frequencies. This phenomenon has long been one of the reasons for using dimensionality reduction for GMM estimation [43]. Hence, in high dimension the amplitude of the characteristic function becomes negligible (with high probability) at all selected frequencies. Folded Gaussian radial frequency distribution. In light of the problem observed with the Gaussian frequency distribution, we propose to draw frequencies from a distribution allowing for an accurate control of the quantity $${\boldsymbol{\omega}}^T{\boldsymbol{{\it{\Sigma}}}}{\boldsymbol{\omega}}$$ and thus of the amplitude $${\rm e}^{-\frac{1}{2}{\boldsymbol{\omega}}^T{\boldsymbol{{\it{\Sigma}}}}{\boldsymbol{\omega}}}$$ of the characteristic function. This is achieved by drawing \begin{equation} \label{eq:decomp} {\boldsymbol{\omega}}=R{\it{\Sigma}}^{-\frac{1}{2}}{\boldsymbol{\varphi}}, \end{equation} (3.3) where $${\boldsymbol{\varphi}} \in \mathbb{R}^d$$ is uniformly distributed on the $$\ell_2$$ unit sphere $$\mathcal{S}_{d-1}$$, and $$R \in \mathbb{R}_+$$ is a radius chosen independently from $${\boldsymbol{\varphi}}$$ with a distribution $$p_R$$ we will now specify. With the decomposition (3.3), the characteristic function $$\psi_{{\boldsymbol{\theta}}}$$ is now expressed as \begin{equation*} \psi_{{\boldsymbol{\theta}}}\left(R{\it{\Sigma}}^{-\frac{1}{2}}{\boldsymbol{\varphi}}\right)=\exp\left(-\mathsf{i} R{\boldsymbol{\varphi}}^{\rm T}{\it{\Sigma}}^{-\frac{1}{2}}{\boldsymbol{\mu}}\right)\exp\left(-\tfrac{1}{2}R^2\right) =\psi_{\theta}(R), \end{equation*} where $$\psi_{\theta}$$ is the characteristic function of a one-dimensional Gaussian with mean $$\mu={\boldsymbol{\varphi}}^{\rm T}{\it{\Sigma}}^{-\frac{1}{2}}{\boldsymbol{\mu}}$$ and variance $$\sigma^2=1$$. We thus consider the estimation of a one-dimensional Gaussian $$P_{\theta_0}=\mathcal{N}(\mu_0,\sigma_0^2)$$, with $$\sigma_0=1$$, as our baseline to design a radius distribution $$p_R$$. In this setting, we no longer suffer from unwanted concentration phenomena and can resort to the intuitive Gaussian radius distribution to sample $$\psi_{\theta_0}$$. It corresponds to a radius density function $$p_R= \mathcal{N}^+\left(0,\frac{1}{\sigma_0^2}\right)=\mathcal{N}^+(0,1)$$ (i.e., Gaussian with absolute value, referred to as folded Gaussian). Using this radius distribution with the decomposition (3.3) yields a frequency distribution $${\it{\Lambda}}^{(FGr)}_{\boldsymbol{{\it{\Sigma}}}}$$ referred to as Folded Gaussian radius frequency distribution. Note that, similar to the Gaussian frequency distribution, the folded Gaussian radius distribution only depends on the (oracle) covariance $${\boldsymbol{{\it{\Sigma}}}}$$ of the sketched distribution $$P_{\boldsymbol{\theta}}$$. Adapted radius distribution. Although we will see it yields decent results in Section 5, the folded Gaussian radius frequency distribution somehow produces too many frequencies with a low radius $$R$$. These carry a limited quantity of information about the original distribution, since all characteristic functions equal $$1$$ at the origin.5 We now present a heuristics that may avoid this ‘waste’ of frequencies. Intuitively, the chosen frequencies should properly discriminate Gaussians with different parameters, at least in the neighborhood of the true parameter $$\theta_0=(\mu_{0},\sigma_{0})=(\mu_0,1)$$. This corresponds to promoting frequencies $$\omega$$ leading to a large difference $$|\psi_\theta(\omega)-\psi_{\theta_0}(\omega)|$$ for parameters $$\theta$$ close to $$\theta_0$$. A way to achieve this is to promote frequencies where the norm of the gradient $$\left\lVert{\nabla_\theta \psi_{\theta}(\omega)}\right\rVert_2$$ is large at $$\theta = \theta_0$$. Recall that for a one-dimensional Gaussian $$\psi_\theta(\omega)={\rm e}^{-\mathsf{i}\mu\omega}{\rm e}^{-\frac{1}{2}\sigma^2\omega^2}$$. The norm of the gradient is expressed as: \begin{align} \left\lVert{\nabla_{\theta} \psi_{\theta}(\omega)}\right\rVert_2^2&=\left\lvert\nabla_\mu \psi_{\theta}(\omega)\right\rvert^2 + \left\lvert\nabla_{\sigma^2} \psi_{\theta}(\omega)\right\rvert^2 =\left\lvert-\mathsf{i}\omega\psi_{\theta}(\omega)\right\rvert^2 + \left\lvert-\frac{1}{2}\omega^2\psi_{\theta}(\omega)\right\rvert^2 =\left(R^2+\frac{R^4}{4}\right){\rm e}^{-\sigma^2 R^2} \notag \end{align} and therefore $$\left\lVert{\nabla_{\theta} \psi_{\theta_0}(\omega)}\right\rVert_2=\left(R^2+\tfrac{R^4}{4}\right)^{\frac{1}{2}}{\rm e}^{-\frac{1}{2}R^2}$$ since $$\sigma^2_0=1$$. This expression still has a Gaussian decrease (up to polynomial factors) and indeed avoids very low frequencies. It can be normalized to a density function: \begin{equation} \label{eq:adapted_density} p_R=C\left(R^2+\tfrac{R^4}{4}\right)^{\frac{1}{2}}{\rm e}^{-\frac{1}{2}R^2}, \end{equation} (3.4) with $$C$$ some normalization constant. Using this radius distribution with the decomposition (3.3) yields a distribution $${\it{\Lambda}}^{(Ar)}_{\boldsymbol{{\it{\Sigma}}}}$$ referred to as adapted radius frequency distribution. Once again, this distribution only depends on the covariance $${\boldsymbol{{\it{\Sigma}}}}$$. 3.3.2 Oracle frequency sampling pattern for a known mixture of Gaussians Any frequency distribution $${\it{\Lambda}}^{(.)}_{\boldsymbol{{\it{\Sigma}}}}$$ selected for sampling the characteristic function of a single known Gaussian $$P_{\boldsymbol{\theta}}$$ can be immediately and naturally extended to a frequency distribution $${\it{\Lambda}}^{(.)}_{{\it{\Theta}},{\boldsymbol{\alpha}}}$$ to sample the characteristic function of a known GMM $$P_{{\it{\Theta}},{\boldsymbol{\alpha}}}$$ by mixing the frequency distributions corresponding to each Gaussian: \begin{equation} {\it{\Lambda}}^{(.)}_{{\it{\Theta}},{\boldsymbol{\alpha}}}=\sum_{k=1}^K \alpha_k {\it{\Lambda}}^{(.)}_{{\boldsymbol{{\it{\Sigma}}}}_k}. \end{equation} (3.5) Each component $${\it{\Lambda}}^{(.)}_{{\boldsymbol{{\it{\Sigma}}}}_k}$$ has the same weight than its corresponding Gaussian $$P_{{\boldsymbol{\theta}}_k}$$. Indeed, a Gaussian with a high weight must be precisely estimated, as its influence on the overall reconstruction error (e.g., in terms of Kullback–Leibler divergence) is more important than the components with low weights. Thus, more frequencies adapted to this Gaussian are selected. The draw of frequencies with an oracle distribution $${\it{\Lambda}}^{(.)}_{{\it{\Theta}},{\boldsymbol{\alpha}}}$$ is summarized in Function $$\texttt{DrawFreq}$$. Function View largeDownload slide DrawFreq($$\left\lbrace {{\boldsymbol{\it{\Sigma}}}}_k\right\rbrace_{k=1,...,K},{\boldsymbol\alpha},m,f$$): drawing frequencies for a GMM with known variances and weights, choosing one of the three distributions described in Section 3.3 Function View largeDownload slide DrawFreq($$\left\lbrace {{\boldsymbol{\it{\Sigma}}}}_k\right\rbrace_{k=1,...,K},{\boldsymbol\alpha},m,f$$): drawing frequencies for a GMM with known variances and weights, choosing one of the three distributions described in Section 3.3 3.3.3 Choosing the frequency sampling pattern in practice In practice, the parameters $$({\it{\Theta}},{\boldsymbol{\alpha}})$$ of the GMM to be estimated are obviously unknown beforehand, so the oracle frequency distributions $${\it{\Lambda}}^{(.)}_{{\it{\Theta}},{\boldsymbol{\alpha}}}$$ introduced in the previous section cannot be computed. We propose a simple method to obtain an approximate distribution that yields good results in practice. The reader must also keep in mind that it is very easy to integrate some prior knowledge in this design, especially since the proposed frequency distributions only take into account the variances of the GMM components, not their means. The idea is to estimate the average variance $$\bar\sigma^2=\frac{1}{Kd}\sum_{k=1}^K \sum_{\ell=1}^d \sigma^2_{k,\ell}$$ of the components in the GMM—note that this parameter may be significantly different from the global variance of the data, for instance in the case of well-separated components with small variances. This estimation is performed using a light sketch $${\mathbf{z}}_0$$ with $$m_0$$ frequencies, computed on a small subset of $$n_0$$ items from the database, then a frequency distribution corresponding to a single isotropic Gaussian $${\it{\Lambda}}^{(.)}_{\bar\sigma^2{\mathbf{I}}}$$ is selected. Indeed, if the variances $$\sigma_{k,\ell}^2$$’s are not too different from each other, the amplitude of the empirical characteristic function $$|\hat\psi({\boldsymbol{\omega}})|$$approximately follows $${\rm e}^{-\frac{1}{2}\left\lVert{{\boldsymbol{\omega}}}\right\rVert_2^2\bar\sigma^2}$$ with high oscillations, allowing for a very simple amplitude estimation process: assuming the $$m_0$$ frequencies used to compute the sketch $${\mathbf{z}}_0$$ are ordered by increasing Euclidean radius, the sketch $${\mathbf{z}}_0$$ is divided into consecutive blocks, maximal peaks of its modulus are identified within each block forming a curve that approximately follows $${\rm e}^{-\frac{1}{2}R^2\bar\sigma^2}$$, then a simple regression is used to estimate $$\bar\sigma^2$$. This process is illustrated in Fig. 2. Fig. 2. View largeDownload slide Estimation of $$\bar{\sigma}^2$$ (Function $$\texttt{EstimMeanSigma}$$), for $$d=10$$, $$K=5$$, $$m_0=500$$ and $$n_0=5000$$. Modulus of the sketch with respect to the norm of frequencies (ordered by increasing radius, dots). Visualization of the peaks in each block of $$20$$ consecutive values (piecewise linear curve). Fitted curve $${\rm e}^{-\frac{1}{2}R^2\bar{\sigma}^2}$$ for the estimated $$\bar{\sigma}^2$$ (smooth curve). Fig. 2. View largeDownload slide Estimation of $$\bar{\sigma}^2$$ (Function $$\texttt{EstimMeanSigma}$$), for $$d=10$$, $$K=5$$, $$m_0=500$$ and $$n_0=5000$$. Modulus of the sketch with respect to the norm of frequencies (ordered by increasing radius, dots). Visualization of the peaks in each block of $$20$$ consecutive values (piecewise linear curve). Fitted curve $${\rm e}^{-\frac{1}{2}R^2\bar{\sigma}^2}$$ for the estimated $$\bar{\sigma}^2$$ (smooth curve). To cope with the fact that the ‘range’ of frequencies that must be considered to compute $${\mathbf{z}}_0$$ is also not known beforehand, we initialize $$\bar\sigma^2=1$$ and reiterate this procedure several times, each time drawing $$m_0$$ frequencies adapted to the current estimation of $$\bar\sigma^2$$, i.e., with some choice of frequency distribution $${\it{\Lambda}}^{(.)}_{\bar\sigma^2{\mathbf{I}}}$$, and update $$\bar\sigma^2$$ at each iteration. In practice, the procedure quickly converges in three iterations. The entire process is summarized in detail in Function $$\texttt{EstimMeanSigma}$$ and Algorithm 3. 3.4 Summary At this point, all procedures necessary for compressive GMM estimation have been defined. Given a database $$\mathcal{X}=\{{\mathbf{x}}_1,...,{\mathbf{x}}_n\}$$, a number of measurements $$m$$ and a number of components $$K$$, the entire process is as follows: In the absence of prior knowledge, draw $$m$$ frequencies using Algorithm 3 on (a fraction of) the data set. The proposed adapted radius frequency distribution is recommended, other parameters have default values (see Section 5.3), the effect of modifying them has been found to be negligible. Compute the empirical sketch $$\hat{\mathbf{z}}=\frac{1}{\sqrt{m}}\left[\frac{1}{n}\sum_{i=1}^n {\rm e}^{-\mathsf{i} {\boldsymbol{\omega}}_j^T{\mathbf{x}}_i}\right]_{j=1...m}$$. One may use GPU and/or distributed/online computing. Estimate a $$K$$-GMM using CL-OMP(R) or the more scalable, but less precise Algorithm 2. Connections with Distilled sensing. The reader may note that designing a measurement operator adapted to some particular data does not fit the classical paradigm of compressive sensing. The two-stage approaches used to choose the frequency distribution presented above can be related to a line of work referred to as adaptive (or distilled) sensing [61], in which a portion of the computational budget is used to crudely design the measurement operator while the rest is used to actually measure the signal. Most often these methods are extended to multistage approaches–where the measurement operator is refined at each iteration–and have been used in machine learning [37] or signal processing [7]. Allocating the resources and choosing between exploration (designing the measurement operator) and precision (actually measuring the signal) is a classic trade-off in areas such as reinforcement learning or game theory. Function View largeDownload slide EstimMeanSigma($${\mathcal{X}},n_0,m_0,c,T$$): Estimation of the mean variance $$\bar\sigma^2$$ Function View largeDownload slide EstimMeanSigma($${\mathcal{X}},n_0,m_0,c,T$$): Estimation of the mean variance $$\bar\sigma^2$$ Algorithm 3: View largeDownload slide Draw the frequencies in practice. Algorithm 3: View largeDownload slide Draw the frequencies in practice. 4. Kernel design and sketching It turns out that sketching operators as (2.7) are intimately related to RKHS embedding of probability distributions [94,100] and RFFs [82]. The proposed carefully designed choice of frequency distribution appears as an innovative method to design a reproducing kernel, which is faster and provides better results than traditional choices in the kernel literature [101], as experimentally shown in Section 5.4. Additionally, RKHS embedding of distributions turns out to be an appropriate framework to derive the information preservation guarantees of Section 7. 4.1 Reproducing Kernel Hilbert Spaces (RKHS) We refer the reader to Appendix A for definitions related to positive definite (p.d.) kernels and measures. Let $$\kappa: {X} \times {X} \rightarrow \mathbb{C}$$ be a p.d. kernel. By Moore–Aronszajn Theorem [4], to this kernel is associated a unique Hilbert space $${\mathcal{H}} \subset \mathbb{C}^{{X}}$$ that satisfies the following properties: for any $${\mathbf{x}} \in {X}$$ the function $$\kappa({\mathbf{x}},.)$$ belongs to $${\mathcal{H}}$$, and the kernel satisfies the reproducing property $$\forall f \in {\mathcal{H}}, \forall {\mathbf{x}} \in {X}, ~ \left\langle {\kappa({\mathbf{x}},.)},{f} \right\rangle_{\mathcal{H}}=f({\mathbf{x}})$$. The space $${\mathcal{H}}$$ is referred to as the RKHS associated with the kernel $$\kappa$$. We denote $$\left\langle.,.\right\rangle_{\mathcal{H}}$$ the scalar product of the RKHS $${\mathcal{H}}$$. We refer the reader to [62] and references therein for a review of RKHS and kernel methods. We focus here on the space $${X}=\mathbb{R}^d$$ and on translation-invariant kernels of the form \begin{equation}\label{eq:TIk} \kappa({\mathbf{x}},{\mathbf{y}})={\mathbf{K}}({\mathbf{x}}-{\mathbf{y}}), \end{equation} (4.1) where $${\mathbf{K}}:\mathbb{R}^d\rightarrow\mathbb{C}$$ is a positive definite function. Translation-invariant positive definite kernels are characterized by the Bochner theorem: Theorem 4.1 (Bochner [88], Theorem 1.4.3) A continuous function $${\mathbf{K}}:\mathbb{R}^d \rightarrow \mathbb{C}$$ is positive definite if and only if it is the Fourier transform of a finite (i.e., $${\it{\Lambda}}(\mathbb{R}^{d}) = \int_{\mathbb{R}^{d}} d{\it{\Lambda}}({\boldsymbol{\omega}}) < \infty$$) non-negative measure $${\it{\Lambda}}$$ on $$\mathbb{R}^d$$, that is: \begin{equation}\label{eq:freqdist} {\mathbf{K}}({\mathbf{x}})=\int_{\mathbb{R}^n} {\rm e}^{-\mathsf{i}{\boldsymbol{\omega}}^T{\mathbf{x}}} {\rm d}{\it{\Lambda}}({\boldsymbol{\omega}}). \end{equation} (4.2) This expression implies the normalization $$|\kappa({\mathbf{x}},{\mathbf{y}})|\leq |\kappa({\mathbf{x}},{\mathbf{x}})|={\it{\Lambda}}(\mathbb{R}^{d})$$. Hence, without loss of generality, up to a scaling factor, we suppose $${\it{\Lambda}}(\mathbb{R}^{d})=1$$. This means in particular that $${\it{\Lambda}}$$ is a probability distribution on $$\mathbb{R}^d$$. 4.2 Hilbert space embedding of probability distributions Embeddings of probability distributions [94,100] are obtained by defining the Mean Map $$\varphi:{\mathcal{P}} \rightarrow {\mathcal{H}}$$: \begin{equation} \varphi(P)=\mathbb{E}_{{\mathbf{x}} \sim P}(\kappa({\mathbf{x}},.)). \end{equation} (4.3) Using the mean map we can define the following p.d. kernel $$\mathcal{K}(P,{Q})=\left\langle {\varphi(P)},{\varphi({Q})} \right\rangle_{\mathcal{H}}$$ over the set of probability distributions $${\mathcal{P}}$$. Note that this expression is also equivalent to the following definition [74,100]: $$\mathcal{K}(P,{Q})=\mathbb{E}_{{\mathbf{x}}\sim P,{\mathbf{y}}\sim {Q}}(\kappa({\mathbf{x}},{\mathbf{y}}))$$. We denote $$\gamma_{\kappa}\left({P},{Q}\right)=\|\varphi(P)-\varphi({Q})\|_{\mathcal{H}}$$ the pseudometric induced by the mean map on the set of probability distributions, often referred to as maximum mean discrepancy (MMD) [101]. Fukumizu et al. introduced in [54] the concept of characteristic kernel, that is a kernel $$\kappa$$ for which the map $$\varphi$$ is injective and the pseudometric $$\gamma_{\kappa}$$ is a true metric. Translation-invariant characteristic kernels are characterized by the following theorem [100]. Theorem 4.2 (Sriperumbudur et al. [100]) Assume that $$\kappa({\mathbf{x}},{\mathbf{y}})={\mathbf{K}}({\mathbf{x}}-{\mathbf{y}})$$, where $${\mathbf{K}}$$ is a positive definite function on $$\mathbb{R}^d$$. Then $$\kappa$$ is characteristic if and only if $$\text{supp}({\it{\Lambda}})=\mathbb{R}^d$$, where $${\it{\Lambda}}$$ is defined as (4.2). Many classical translation-invariant kernels (Gaussian, Laplacian, Cauchy, etc.) indeed exhibit a Fourier transform with a support that is equal to $$\mathbb{R}^d$$ and are therefore characteristic. This is also the case for all kernels corresponding to the proposed frequency distributions $${\it{\Lambda}}^{(.)}_{\boldsymbol{{\it{\Sigma}}}}$$ that we defined in Section 3.3. 4.3 From kernel design to the design of a frequency sampling pattern In the case of a translation-invariant kernel, the metric $$\gamma_{\kappa}$$ can be expressed as $$\gamma^2_{\kappa}\left({P},{Q}\right) = \gamma^2_{{\it{\Lambda}}}\left({P},{Q}\right)$$, where by abuse of notation, for a given frequency distribution $${\it{\Lambda}}$$, we introduce \begin{align} \gamma^2_{{\it{\Lambda}}}\left({P},{Q}\right) &:=\int_{\mathbb{R}^d}\left\lvert\psi_P({\boldsymbol{\omega}})-\psi_{Q}({\boldsymbol{\omega}})\right\rvert^2 {\rm d}{\it{\Lambda}}({\boldsymbol{\omega}}), \label{eq:chrcdiff} \end{align} (4.4) where we recall that $$\psi_P({\boldsymbol{\omega}})$$ is the characteristic function of $$P$$. The proof of Theorem 4.2 is actually based on this reformulation [100]. Hence, given $$m$$ frequencies $$({\boldsymbol{\omega}}_1,...,{\boldsymbol{\omega}}_m) \overset{{\rm i.i.d.}}{\sim} {\it{\Lambda}}$$ and the corresponding sketching operator $$\mathcal{A}$$ (2.7), we can expect that for large enough $$m$$, with high probability, \begin{align} \gamma^2_{{\it{\Lambda}}}\left({P},{Q}\right) = \mathbb{E}_{{\boldsymbol{\omega}}\sim{\it{\Lambda}}}|\psi_P({\boldsymbol{\omega}})-\psi_{Q}({\boldsymbol{\omega}})|^2 \approx \frac{1}{m} \sum_{j=1}^m |\psi_P({\boldsymbol{\omega}}_{j})-\psi_{Q}({\boldsymbol{\omega}}_{j})|^2 = \left\lVert{\mathcal{A} P - \mathcal{A} {Q}}\right\rVert_2^2 \label{eq:emp_norms}. \end{align} (4.5) Building on this relation between the metric $$\gamma^2_{{\it{\Lambda}}}\left({\cdot},{\cdot}\right)$$ and the sketching operators considered in this article, we next derive some connections with kernel design. We further exploit these connections to draw a theoretical analysis of the information preservation guarantees of the sketching operator $$\mathcal{A}$$ in Section 7. Traditional kernel design. Given some learning task, the selection of an appropriate kernel—known as kernel design—is a difficult, open problem, usually done by cross-validation. Even when using RFFs to leverage the computational gain of explicit embeddings (compared with potentially costly implicit kernels), the considered frequency distribution $${\it{\Lambda}}$$ is usually derived from a translation-invariant kernel $$\kappa$$ chosen in a parametric family endowed with closed-form expressions for both $$\kappa$$ and $${\it{\Lambda}}$$ [82,101]. A typical example is the Gaussian kernel (chosen for the simplicity of its Fourier transform), the bandwidth of which is often selected by cross-validation [98,101]. Spectral kernel design: designing the frequency sampling pattern. Learning an appropriate kernel by directly deriving a spectral distribution of frequencies for random Fourier features has been an increasingly popular idea in the last few years. In the field of usual reproducing kernels on finite-dimensional objects, researchers have explored the possibility of modifying the matrix of frequencies to obtain a better approximation quality [108] or to accelerate the computation of the kernel [69]. Both ideas have been exploited for learning an appropriate frequency distribution, often modeled as a mixture of Gaussians [76,107,110] or by optimizing weights over a combination of many distributions [93]. In the context of using the mean map with random features, learning a kernel has been often explored for the two-sample test problem, mainly based on the idea of maximizing the discriminative power of the MMD [98]. Similar to our approach, such methods often divide the database into two parts to learn the kernel then perform the hypothesis test [58,65], or are done in a streaming context [78]. Variants of the random Fourier features have also been used [35]. Compared to these methods, the approach proposed in Section 3.3 is relatively simple, fast to perform and based on an approximate theoretical expression of the MMD for clustered data instead of a thorough statistical analysis of the problem. In the spirit of our sketching framework, it only requires a tiny portion of the database and is performed in a completely unsupervised manner, which is quite different from most existing literature. Furthermore, in this article, we only consider traditional random Fourier features for translation-invariant kernels. Using more exotic kernels and/or adapting more advanced learning methods to our framework is left for future work. Still, in Section 5.4, we empirically show that the estimation of $$\bar\sigma^2$$ (Function $$\texttt{EstimMeanSigma}$$) is much faster than estimation by cross-validation, and that the adapted radius distribution $${\it{\Lambda}}^{(Ar)}_{\bar\sigma^2{\mathbf{I}}}$$ performs better than a traditional Gaussian kernel with optimized bandwidth. 5. Experiments with synthetic data To validate the proposed framework, the algorithms CL-OMP(R) (Algorithm 1) are first extensively tested against synthetic problems for which the true parameters of the GMM are known. Experiments on real data will be conducted in Section 6, with the additional analysis of Algorithm 2. The full Matlab code is available at [67]. 5.1 Generating the data When dealing with synthetic data for various settings, it is particularly difficult to generate problems that are ‘equally difficult’ when varying the dimension $$d$$ and the level of sparsity $$K$$. We opted for a simple heuristic to draw $$K$$ Gaussians neither too close nor too far from one another, given their variances. Since the $$d$$-dimensional ball with radius $$r$$ has volume $$D_d r^d$$ (with $$D_d$$ a constant depending on $$d$$), we consider that any Gaussian which is approximately isotropic with variance $$\sigma^2$$ (i.e., $$\mathcal{N}(0,\mathbf{{\it{\Sigma}}})$$ with $$\mathbf{{\it{\Sigma}}} \approx \sigma^{2}{\mathbf{I}}$$) ‘occupies’ a volume $$V_{\sigma^2} =\sigma^d V_1,$$ where $$V_1$$ is a reference volume for $$\sigma=1$$. Variances $$\sigma_{k,\ell}^2$$ are randomly drawn uniformly from $$0.25$$ to $$1.75$$, so that $$\mathbb{E}(\bar\sigma^2)=1$$. The means $${\boldsymbol{\mu}}$$ are chosen so that they lie in a ball sufficiently large to accommodate for a volume $$K \times V_{\bar\sigma^2}$$. We therefore choose $${\boldsymbol{\mu}} \sim \mathcal{N}(0,\sigma_{{\boldsymbol{\mu}}}^2 {\mathbf{I}})$$ with $$\sigma_{\boldsymbol{\mu}}$$ that verifies $$V_{\sigma_{{\boldsymbol{\mu}}}^2}=KV_{\bar\sigma^2}$$, which yields $$\sigma_{{\boldsymbol{\mu}}}=K^{\frac{1}{d}}\bar\sigma,$$ i.e., $$\sigma_{{\boldsymbol{\mu}}}=K^{\frac{1}{d}}$$ by considering the expected value of $$\bar\sigma$$. In practice, this choice seems to offer problems that are neither too elementary nor too difficult. 5.2 Evaluation measure To evaluate reconstruction performance when the true distribution $$p$$ of the data is known, one can resort to the classic Kullback–Leibler divergence $$D(P||{Q})$$ [41]. A symmetric version of the KL divergence is more comfortable to work with: $$d(P,\bar P)=D(P||\bar P)+D(\bar P || P)$$, where $$P$$ is the true distribution and $$\bar P$$ is the estimated one (we still refer to this measure as ‘KL-divergence’). In our framework, $$P$$ and $$\bar P$$ are GMMs with density functions denoted $${p}$$ and $$\bar {p}$$, hence as in [19] to estimate the KL-divergence in practice we draw $$({\mathbf{y}}_1,...,{\mathbf{y}}_{n'}) \overset{i.i.d.}{\sim} P$$ with $$n'=5\times 10^5$$ and compute \begin{equation} d(P,\bar P)\approx \frac{1}{n'} \sum_{i=1}^{n'} \left[\ln\left(\frac{{p}({\mathbf{y}}_i)}{\bar {p}({\mathbf{y}}_i)}\right) + \frac{\bar {p}({\mathbf{y}}_i)}{{p}({\mathbf{y}}_i)}\ln\left(\frac{\bar {p}({\mathbf{y}}_i)}{{p}({\mathbf{y}}_i)}\right)\right]\!. \end{equation} (5.1) 5.3 Basic set-up The basic set-up is the same for all following experiments, given data $$({\mathbf{x}}_1,...,{\mathbf{x}}_n) \in \mathbb{R}^d$$ and parameters $$m,K$$. We suppose the data to be approximately centered (see Section 5.1). First, unless specified otherwise (e.g., in Section 5.4), we draw frequencies according to an adapted radius distribution $${\it{\Lambda}}^{(Ar)}_{\bar\sigma^2{\mathbf{I}}}$$, using Algorithm 3 with parameters $$m_0=500$$, $$n_0=\min(n,5000)$$, $$c=30$$, $$T=5$$. The empirical sketch $$\hat{\mathbf{z}}$$ is then computed. The compressive algorithms are performed with their respective number of iterations. For Step 1 of CL-OMP(R), the gradient descent is initialized with a centered isotropic Gaussian with random variance6$$\sigma^2 \sim \mathcal{U}\left([0.5\bar\sigma^2;1.5\bar\sigma^2]\right)$$. Furthermore, during all optimization steps in CL-OMP(R) or Algorithm 2, we constrain all variances to be larger than a small $$10^{-15}$$ for numerical stability. All continuous optimization schemes in CL-OMP(R) are performed with Stephen Becker’s adaptation of the L-BFGS-B algorithm [23] in C, with Matlab wrappers [8]. We compare our results with an adaptation of the previous IHT algorithm [19,20] for isotropic Gaussians with fixed variances, in which all optimization steps have been straightforwardly adapted for our non-isotropic framework with relaxed variances. The IHT is performed with $$10$$ iterations, which is the default value in the original article. We use the VLFeat toolbox [104] to perform the EM algorithm with diagonal variances. The algorithm is repeated 10 times with random initializations and the result yielding the best log-likelihood is kept. Each run is limited to 100 iterations. All following reconstruction results are computed by taking the geometric mean over 50 experiments (i.e., the regular mean in logarithmic scale). 5.4 Results: role of the choice of frequency distribution In this section, we compare different choices of frequency distributions. We draw $$N=300\,000$$ items in two settings, respectively low and high dimensional: $$d=2,~K=3$$ and $$d=20,~K=5$$. In each setting, we construct the sketch with $$m=10K(2d+1)$$ frequencies (see Section 5.5). Reconstruction is performed with CL-OMPR. In Table 2, we compare the three frequency distributions introduced in Section 3, both with the oracle frequency distribution $${\it{\Lambda}}^{(.)}_{{\it{\Theta}}_{0}, {\boldsymbol{\alpha}}_{0}}$$ (i.e., using Function $$\texttt{DrawFreq}$$ with the true parameters of the GMM)—we remind the reader that this setting unrealistically assumes that the variances and weights of the GMM are known beforehand, and with the approximate one $${\it{\Lambda}}^{(.)}_{\bar\sigma^2 {\mathbf{I}}}$$ (Algorithm 3). The results show that the Gaussian frequency distribution indeed yields poor reconstruction results in high dimension ($$d=20$$), while the Adapted radius frequency distribution outperforms the two others. The use of the approximate $${\it{\Lambda}}^{(.)}_{\bar\sigma^2 {\mathbf{I}}}$$ instead of the oracle $${\it{\Lambda}}^{(.)}_{{\it{\Theta}}_{0}, {\boldsymbol{\alpha}}_{0}}$$ is shown to have little effect. Table 2 Log-KL divergence on synthetic data using the CL-OMPR algorithm, for $$m=10(2d+1)K$$ frequencies and $$N=300\ 000$$ items. We compare the three proposed frequency distributions: Gaussian [19] (G), folded Gaussian radius (FGr) or adapted radius (Ar), using either the oracle distribution defined inSection 3.3.2or the approximate distribution used in practice, learned with $$\texttt{EstimMeanSigma}$$ (G) (FGr) (Ar) $$d=2,~K=3$$ $${\it{\Lambda}}_{{\it{\Theta}}_0,{\boldsymbol{\alpha}}_0}^{(.)}$$ $$-9.30$$ $$-8.90$$ $$\mathbf{-9.38}$$ $${\it{\Lambda}}_{\bar\sigma^2{\mathbf{I}}}^{(.)}$$ $$-8.93$$ $$-8.67$$ $$\mathbf{-9.20}$$ $$d=20,~K=5$$ $${\it{\Lambda}}_{{\it{\Theta}}_0,{\boldsymbol{\alpha}}_0}^{(.)}$$ $$-$$$$3.15$$ $$-6.17$$ $$\mathbf{-6.48}$$ $${\it{\Lambda}}_{\bar\sigma^2{\mathbf{I}}}^{(.)}$$ $$-$$$$3.41$$ $$-5.80$$ $$\mathbf{-6.32}$$ (G) (FGr) (Ar) $$d=2,~K=3$$ $${\it{\Lambda}}_{{\it{\Theta}}_0,{\boldsymbol{\alpha}}_0}^{(.)}$$ $$-9.30$$ $$-8.90$$ $$\mathbf{-9.38}$$ $${\it{\Lambda}}_{\bar\sigma^2{\mathbf{I}}}^{(.)}$$ $$-8.93$$ $$-8.67$$ $$\mathbf{-9.20}$$ $$d=20,~K=5$$ $${\it{\Lambda}}_{{\it{\Theta}}_0,{\boldsymbol{\alpha}}_0}^{(.)}$$ $$-$$$$3.15$$ $$-6.17$$ $$\mathbf{-6.48}$$ $${\it{\Lambda}}_{\bar\sigma^2{\mathbf{I}}}^{(.)}$$ $$-$$$$3.41$$ $$-5.80$$ $$\mathbf{-6.32}$$ Table 2 Log-KL divergence on synthetic data using the CL-OMPR algorithm, for $$m=10(2d+1)K$$ frequencies and $$N=300\ 000$$ items. We compare the three proposed frequency distributions: Gaussian [19] (G), folded Gaussian radius (FGr) or adapted radius (Ar), using either the oracle distribution defined inSection 3.3.2or the approximate distribution used in practice, learned with $$\texttt{EstimMeanSigma}$$ (G) (FGr) (Ar) $$d=2,~K=3$$ $${\it{\Lambda}}_{{\it{\Theta}}_0,{\boldsymbol{\alpha}}_0}^{(.)}$$ $$-9.30$$ $$-8.90$$ $$\mathbf{-9.38}$$ $${\it{\Lambda}}_{\bar\sigma^2{\mathbf{I}}}^{(.)}$$ $$-8.93$$ $$-8.67$$ $$\mathbf{-9.20}$$ $$d=20,~K=5$$ $${\it{\Lambda}}_{{\it{\Theta}}_0,{\boldsymbol{\alpha}}_0}^{(.)}$$ $$-$$$$3.15$$ $$-6.17$$ $$\mathbf{-6.48}$$ $${\it{\Lambda}}_{\bar\sigma^2{\mathbf{I}}}^{(.)}$$ $$-$$$$3.41$$ $$-5.80$$ $$\mathbf{-6.32}$$ (G) (FGr) (Ar) $$d=2,~K=3$$ $${\it{\Lambda}}_{{\it{\Theta}}_0,{\boldsymbol{\alpha}}_0}^{(.)}$$ $$-9.30$$ $$-8.90$$ $$\mathbf{-9.38}$$ $${\it{\Lambda}}_{\bar\sigma^2{\mathbf{I}}}^{(.)}$$ $$-8.93$$ $$-8.67$$ $$\mathbf{-9.20}$$ $$d=20,~K=5$$ $${\it{\Lambda}}_{{\it{\Theta}}_0,{\boldsymbol{\alpha}}_0}^{(.)}$$ $$-$$$$3.15$$ $$-6.17$$ $$\mathbf{-6.48}$$ $${\it{\Lambda}}_{\bar\sigma^2{\mathbf{I}}}^{(.)}$$ $$-$$$$3.41$$ $$-5.80$$ $$\mathbf{-6.32}$$ In Table 3, we compare the proposed Adapted radius frequency distribution with a frequency distribution $${\it{\Lambda}}=\mathcal{N}(0,{\mathbf{I}}\sigma^{-2}_{\rm best})$$ corresponding to a Gaussian kernel $$\kappa({\mathbf{x}},{\mathbf{y}}) = \exp\left(-\tfrac{\|{\mathbf{x}}-{\mathbf{y}}\|_2^2}{2\sigma^2_{\rm best}}\right)$$ [82,101], where the bandwidth $$\sigma^2_{\rm best}$$ is selected among $$15$$ values exponentially spread from $$10^{-1}$$ to $$10^2$$ to yield the best reconstruction results in each setting, which is akin to more traditional strategies where the kernel is selected by supervised learning procedures. It is seen that the adapted radius distribution $${\it{\Lambda}}^{(Ar)}_{\bar\sigma^2}$$ outperforms the Gaussian kernel in each case, and the estimation of the parameter $$\bar\sigma^2$$ is significantly faster than the tedious learning of the bandwidth $$\sigma^2_{\rm best}$$. Table 3 Log-KL divergence results on synthetic data using the CL-OMPR algorithm, for $$m=10(2d+1)K$$ frequencies and $$n=300\ 000$$ items, comparing the proposed adapted radius distribution $${\it{\Lambda}}^{(Ar)}_{\bar\sigma^2{\mathbf{I}}}$$ and a frequency distribution $${\it{\Lambda}}^{(Gk)}_{\sigma^2_{\rm best}}$$ corresponding to a Gaussian kernel (Gk) [82,101], with a bandwidth $$\sigma_{\rm best}^2$$ selected among $$15$$ values exponentially spread from $$10^{-1}$$ to $$10^2$$ yielding the best result in each case. Estimation times for parameters $$\bar\sigma^2$$ and $$\sigma^2_{\rm best}$$ are also given Reconstruction result Est. time $$d=2,~K=3$$ Adapted radius (proposed) $$\mathbf{-9.38}$$ ($$\bar\sigma^2$$) $$1.02s$$ Gaussian kernel [82,101] $$-9.14$$ ($$\sigma_{best}^2$$) $$23.10s$$ $$d=20,~K=5$$ Adapted radius (proposed) $$\mathbf{-6.32}$$ ($$\bar\sigma^2$$) $$0.98s$$ Gaussian kernel [82,101] $$-5.18$$ ($$\sigma_{best}^2$$) $$74.10s$$ Reconstruction result Est. time $$d=2,~K=3$$ Adapted radius (proposed) $$\mathbf{-9.38}$$ ($$\bar\sigma^2$$) $$1.02s$$ Gaussian kernel [82,101] $$-9.14$$ ($$\sigma_{best}^2$$) $$23.10s$$ $$d=20,~K=5$$ Adapted radius (proposed) $$\mathbf{-6.32}$$ ($$\bar\sigma^2$$) $$0.98s$$ Gaussian kernel [82,101] $$-5.18$$ ($$\sigma_{best}^2$$) $$74.10s$$ Table 3 Log-KL divergence results on synthetic data using the CL-OMPR algorithm, for $$m=10(2d+1)K$$ frequencies and $$n=300\ 000$$ items, comparing the proposed adapted radius distribution $${\it{\Lambda}}^{(Ar)}_{\bar\sigma^2{\mathbf{I}}}$$ and a frequency distribution $${\it{\Lambda}}^{(Gk)}_{\sigma^2_{\rm best}}$$ corresponding to a Gaussian kernel (Gk) [82,101], with a bandwidth $$\sigma_{\rm best}^2$$ selected among $$15$$ values exponentially spread from $$10^{-1}$$ to $$10^2$$ yielding the best result in each case. Estimation times for parameters $$\bar\sigma^2$$ and $$\sigma^2_{\rm best}$$ are also given Reconstruction result Est. time $$d=2,~K=3$$ Adapted radius (proposed) $$\mathbf{-9.38}$$ ($$\bar\sigma^2$$) $$1.02s$$ Gaussian kernel [82,101] $$-9.14$$ ($$\sigma_{best}^2$$) $$23.10s$$ $$d=20,~K=5$$ Adapted radius (proposed) $$\mathbf{-6.32}$$ ($$\bar\sigma^2$$) $$0.98s$$ Gaussian kernel [82,101] $$-5.18$$ ($$\sigma_{best}^2$$) $$74.10s$$ Reconstruction result Est. time $$d=2,~K=3$$ Adapted radius (proposed) $$\mathbf{-9.38}$$ ($$\bar\sigma^2$$) $$1.02s$$ Gaussian kernel [82,101] $$-9.14$$ ($$\sigma_{best}^2$$) $$23.10s$$ $$d=20,~K=5$$ Adapted radius (proposed) $$\mathbf{-6.32}$$ ($$\bar\sigma^2$$) $$0.98s$$ Gaussian kernel [82,101] $$-5.18$$ ($$\sigma_{best}^2$$) $$74.10s$$ We conduct an experiment to determine how robust is the isotropic distribution of frequencies $${\it{\Lambda}}^{(Ar)}_{\bar{\sigma}^2{\mathbf{I}}}$$ when treating strongly non-isotropic data. We first generate a GMM in dimension $$d=10$$ with $$K=10$$ components, with the process described in Section 5.3, but with identity covariances. Then, the first five dimensions of the parameters are divided by a factor $$A>0$$, meaning that: for $$\ell=1,...,5$$ and $$k=1,...,K$$, we do $$\mu_{k,\ell} \leftarrow \mu_{k,\ell}/A$$ and $$\sigma^2_{k,\ell} \leftarrow \sigma_{k,\ell}^2/A^2$$. It simulates data that do not have the same range in each dimension (Fig. 3, top). In Fig. 3 (bottom), for increasing values of $$A$$ we compare CL-OMPR with the learned frequency distribution $${\it{\Lambda}}^{(Ar)}_{\bar{\sigma}^2{\mathbf{I}}}$$, CL-OMPR using the ideal frequency distribution $${\it{\Lambda}}^{(Ar)}_{{\it{\Theta}}_0,{\boldsymbol{\alpha}}_0}$$, and EM. It is indeed seen that, in terms of KL divergence (left), the results produced by CL-OMPR with the learned frequency distribution deteriorates as the non-isotropy of the GMM increases. On the contrary, CL-OMPR with the oracle frequency distribution and EM do not seem to be affected. Hence, in the first case, the problem lies with the learned frequency distribution and not the CL-OMPR algorithm itself. To further confirm this fact, we examine the MMD $$\gamma_{\it{\Lambda}}$$ with the corresponding frequency distribution $${\it{\Lambda}}$$ in each case. It is seen that CL-OMPR with the learned frequency distribution performs as well as the oracle one, meaning that the MMD defined with an isotropic choice of frequencies $$\gamma_{{\it{\Lambda}}^{(Ar)}_{\bar{\sigma}^2{\mathbf{I}}}}$$ is indeed not adapted for strongly non-isotropic problems and that, despite the ability of CL-OMPR to approximately minimize the cost function (2.4) (which approximates the MMD, see (4.5)), it does not produce good results. Fig. 3. View largeDownload slide Top: isotropic GMM in dimension $$n=10$$ with $$K=10$$ components displayed along the first two dimensions (left), the same GMM with one dimension divided by $$A=5$$ (right). Bottom: reconstruction results for the KL-divergence (left) or MMD (right) with $$n=2.10^5$$ items, using $$m=2000$$ frequencies for CL-OMPR, with respect to the coefficient $$A$$. Fig. 3. View largeDownload slide Top: isotropic GMM in dimension $$n=10$$ with $$K=10$$ components displayed along the first two dimensions (left), the same GMM with one dimension divided by $$A=5$$ (right). Bottom: reconstruction results for the KL-divergence (left) or MMD (right) with $$n=2.10^5$$ items, using $$m=2000$$ frequencies for CL-OMPR, with respect to the coefficient $$A$$. In that particular case, the problem could be potentially resolved by a whitening of the data, e.g., by computing the empirical covariance of the fraction of the database used for the frequency distribution design phase and multiplying each data point by its inverse during the sketching phase. However, there are of course cases where the proposed methods would be further challenged, for instance if components are flat in a dimension, but far apart: the global covariance of the data along that dimension would be large, even if the variance of each Gaussian components is small (these cases are however rarely encountered in practice). As mentioned before (see Section 4.3), another solution would be to use more advanced methods for designing the MMD $$\gamma_{\it{\Lambda}}$$ than the proposed simple one. Overall, we leave treatment of strongly non-isotropic data for future work. Nevertheless, from now on, all experiments are performed with an approximate adapted radius distribution $${\it{\Lambda}}^{(Ar)}_{\bar\sigma^2 {\mathbf{I}}}$$. 5.5 Results: number of measurements We now evaluate the quality of the reconstruction with respect to the number $$m$$ of frequencies, for varying dimension $$d$$ and number of components $$K$$ in the GMM (Fig. 4). Fig. 4. View largeDownload slide Log KL-divergence reconstruction results for CL-OMPR with respect to the normalized number of frequencies $$m/((2d+1)K)$$ and the dimension $$d$$ (left) or number of components $$K$$ (right), using $$n=300\ 000$$ items. On the left $$K=10$$, and on the right $$d=10$$. Fig. 4. View largeDownload slide Log KL-divergence reconstruction results for CL-OMPR with respect to the normalized number of frequencies $$m/((2d+1)K)$$ and the dimension $$d$$ (left) or number of components $$K$$ (right), using $$n=300\ 000$$ items. On the left $$K=10$$, and on the right $$d=10$$. It is seen that the KL divergence exhibits a sharp phase transition with respect to $$m$$, whose occurrence seems to be proportional to the ‘dimension’ of the problem, i.e., the number of free parameters $$(2d+1)K$$. This phenomenon is akin to usual compressive sensing. In light of this observation, the number of frequencies in all following experiments is chosen as $$m=5(2d+1)K$$, beyond the empirically obtained phase transition. 5.6 Results: comparison of the algorithms We compare the compressive algorithms and EM in terms of reconstruction, computation time and memory usage, with respect to the number of samples $$N$$. Precision of the reconstruction. In Fig. 5, KL-divergence reconstruction results are shown for EM and all compressed algorithms: IHT [19], CL-OMP (Algorithm 1 with $$T=K$$), CL-OMPR (Algorithm 1 with $$T=2K$$) and Algorithm 2. We consider two settings, one with low $$K=5$$ (left) and one with high $$K=20$$ (right), in dimension $$n=10$$. The number of measurements is set at $$m=5(2d+1)K$$ in each case. Fig. 5. View largeDownload slide Reconstruction results on synthetic data in dimension $$d=10$$, with $$K=5$$ component (left) or $$K=20$$ (right), and number of frequencies $$m=5(2d+1)K$$, with respect to the number of items in the database $$n$$. Fig. 5. View largeDownload slide Reconstruction results on synthetic data in dimension $$d=10$$, with $$K=5$$ component (left) or $$K=20$$ (right), and number of frequencies $$m=5(2d+1)K$$, with respect to the number of items in the database $$n$$. With few Gaussians ($$K=5$$), all compressive algorithms yield similar results, close to those achieved by EM. The precision of the reconstruction is seen to improve steadily with the size $$n$$ of the database. With more Gaussians ($$K=20$$), CL-OMPR clearly outperforms the other compressive algorithms and even outperforms EM for very large $$n$$. The IHT algorithm [19] adapted to non-isotropic variances often fails to recover a satisfying GMM. Indeed, IHT fills the support of the GMM at the very first iteration and is seen to converge toward a local minimum of (2.4), in which all Gaussians in the GMM are equal to the same large Gaussian that encompasses all data. Note that this situation could not happen in [19], where all Gaussians have fixed, known variance. Computation time and memory. In Fig. 6, computation time and memory usage of the compressive methods and EM are presented with respect to the database size $$n$$, using an Intel Core i7-4600U 2.1 GHz CPU with 8 GB of RAM. In terms of time complexity (respectively memory usage), the EM algorithm scales in $$\mathcal{O}(dnKT_{EM})$$ for a fixed number of iterations $$T_{EM}=100$$ (respectively $$\mathcal{O}(nd)$$). The CL-OMP or CL-OMPR algorithms scale in $$\mathcal{O}(mdK^2)$$ (respectively $$\mathcal{O}(md)$$), while Algorithm 2 scales in $$\mathcal{O}(mdK\log K)$$ (respectively the same $$\mathcal{O}(md)$$). Fig. 6. View largeDownload slide Memory (left) and time (right) usage of all algorithms on synthetic data with dimension $$n=10$$, number of components $$K=5$$, and number of frequencies $$m=5(2d+1)K$$, with respect to the number of items in the database $$n$$. On the left, ‘sketching’ refers to the time of computing the sketch with a naive, direct computation, which must be added to the computation time of the recovery algorithm (that does not vary with $$n$$) to obtain the total computation time of the proposed method. However, the reader must keep in mind that the sketching step can be massively parallelized, is adapted to streams of data and so on. Fig. 6. View largeDownload slide Memory (left) and time (right) usage of all algorithms on synthetic data with dimension $$n=10$$, number of components $$K=5$$, and number of frequencies $$m=5(2d+1)K$$, with respect to the number of items in the database $$n$$. On the left, ‘sketching’ refers to the time of computing the sketch with a naive, direct computation, which must be added to the computation time of the recovery algorithm (that does not vary with $$n$$) to obtain the total computation time of the proposed method. However, the reader must keep in mind that the sketching step can be massively parallelized, is adapted to streams of data and so on. At large $$n$$ the EM algorithm indeed becomes substantially slower than all compressive methods (Fig. 6, left). We also keep in mind that we compare a MATLAB implementation of the compressive methods with a state-of-the-art C++ implementation of EM [104]. Similarly, at large $$n$$ the compressive algorithms outperform EM by several orders of magnitude in terms of memory usage (Fig. 6, right). We discussed the computational cost of the sketching operation in Section 2.7 and the possible use of parallelization and distributed computing. In our settings, even when it is done linearly this operation is still faster than EM (Fig. 6, left). 6. Large-scale proof of concept: speaker verification GMMs are popular for their capacity to smoothly approximate any distribution [87] by a large number of Gaussians. This is often the case with real data, and the problem of fitting a large GMM to data drawn from some distribution is somewhat different from that of clustering data and identifying reasonably well-separated components, as presented in the previous section. To try out compressive methods on this challenging task, we test them on a speaker verification problem, with a classical approach requiring GMM referred to as universal background model (GMM-UBM) [86]. 6.1 Overview of speaker verification Given a fragment of speech and a candidate speaker, the goal is to assess whether the fragment was indeed spoken by that person. We quickly describe GMM-UBM in this section. For more details, we refer the reader to the original article [86]. Similar to many speech processing tasks, this approach uses Mel Frequency Cepstrum Coefficients (MFCC) and their derivatives ($${\it{\Delta}}$$-MFCC) as features $${\mathbf{x}}_{i}$$. Those features have been often modeled with GMMs or more advanced Markov models. However, in our framework, we do not use $${\it{\Delta}}$$-MFCC; indeed, these coefficients typically have a negligible range in dynamic compared with the MFCC, which results in a difficult and unstable choice of frequencies. As mentioned before, this problem may be potentially solved by a pre-whitening of the data, which we leave for future work. In this configuration, the speaker verification results will indeed not be state of the art, but our goal is mainly to test our compressive approach on a different type of problem than that of clustering synthetic data, for which we have already observed excellent results. In the GMM-UBM model, each speaker $$S$$ is represented by one GMM $$({\it{\Theta}}_S,{\boldsymbol{\alpha}}_S)$$. The key point is the introduction of a model $$({\it{\Theta}}_{UBM},{\boldsymbol{\alpha}}_{UBM})$$ that represents a ‘generic’ speaker, referred to as UBM. Given the speech data $$\mathcal{X}$$ and a candidate speaker $$S$$, the statistic used for hypothesis testing is a likelihood ratio between the speaker and the generic model: \begin{equation} T(\mathcal{X})=\frac{{p}_{{\it{\Theta}}_{S},{\boldsymbol{\alpha}}_{S}}(\mathcal{X})}{{p}_{{\it{\Theta}}_{\rm UBM},{\boldsymbol{\alpha}}_{\rm UBM}}(\mathcal{X})}. \end{equation} (6.1) If $$T(\mathcal{X})$$ exceeds a threshold $$\tau$$, the data $$\mathcal{X}$$ are considered as being uttered by the speaker $$S$$. The GMMs corresponding to each speaker must somehow be ‘comparable’ to each other and to the UBM. Therefore, the UBM is learned prior to individual speaker models, using a large database of speech data uterred by many speakers. Then, given the training data $$\mathcal{X}_S$$ specific to one speaker, one M-step from the EM algorithm initialized with the UBM is used to adapt the UBM and derive the model $$({\it{\Theta}}_S,{\boldsymbol{\alpha}}_S)$$. We refer the reader to [86] for more details on this procedure. In our framework, the EM or compressive estimation algorithms are used to learn the UBM. We note that this type of signal processing task may fully benefit from the advantages of the sketch structure described in Section 2.7. For instance, in practice one can imagine collecting bit by bit the data to train the UBM in a real-life environment, in which case the sketch and the UBM may be progressively updated without having to keep the spoken fragments, possibly of sensitive nature. 6.2 Set-up The experiments were performed on the classical NIST05 speaker verification database. Both training and testing fragments are 5-min conversations between two speakers. The database contains approximately 650 speakers and 30 000 trials. The MFCCs are computed using the Voicebox toolbox [21]. After filtering the audio data by a speech activity detector, the MFCCs are computed on 23 ms frames with a $$50\%$$ overlap. The first coefficient is removed and we obtain 12-dimensional features ($$d=12$$). Results are presented by choosing the threshold $$\tau$$ that yields the same rates of false alarm and missed detection, referred to as equal error rate (EER). Each result is obtained as the mean of five experiments. In all experiments, except when indicated otherwise, the compressive methods are performed using a sketch obtained by compressing the entire database of $$n=2.10^8$$ MFCC vectors after voice activity detection. The compression is performed taking advantage of distributed computing, by dividing the database into $$200$$ parts that are then compressed simultaneously on a computer cluster. Hence, even for a high number of frequencies $$m=10^5$$ the compression of the $$n=2.10^8$$ items takes less than an hour. 6.3 Results Splitting algorithm. In the previous section, Algorithm 2 was observed to be less accurate than CL-OMPR (Fig. 5). However, as mentioned before, the estimation problem considered here is somehow not to identify well-separated components, but rather to fit a GMM with a large number of components to a smooth probability density. In the first case, on synthetic data, Algorithm 2 is indeed expected to sometimes yield poor results: unlike an MP-based approach such as CL-OMPR, at each iteration it locally divides the current Gaussians rather than ‘exploring’ elsewhere. In the second case, however, Algorithm 2 may yield a correct approximation of the smooth density, by successively approaching it with GMMs at increasingly finer scales. In Table 4, we compare the results obtained with CL-OMPR and Algorithm 2 on the speaker verification task using $$K=64$$ Gaussians in the UBM. Results are indeed similar when the number of frequencies $$m$$ is large, and even surprisingly better with Algorithm 2 for a low number of frequencies $$m=1000$$. Naturally, Algorithm 2 is much faster than CL-OMPR, with more than a $$10$$ times speed-up. Table 4 Comparison between CL-OMPR and Algorithm 2 for speaker verification, with $$K=64$$ EER (%) Time (s) CL-OMPR Algorithm 2 CL-OMPR Algorithm 2 $$m=10^3$$ $$40.3$$ $$32.5$$ $$7.10^2$$ $$5.10$$$$^{2}$$ $$m=10^4$$ $$29.4$$ $$29.0$$ $$7.10^3$$ $$5.10^2$$ $$m=10^5$$ $$28.8$$ $$28.6$$ $$7.10^4$$ $$5.10^3$$ EER (%) Time (s) CL-OMPR Algorithm 2 CL-OMPR Algorithm 2 $$m=10^3$$ $$40.3$$ $$32.5$$ $$7.10^2$$ $$5.10$$$$^{2}$$ $$m=10^4$$ $$29.4$$ $$29.0$$ $$7.10^3$$ $$5.10^2$$ $$m=10^5$$ $$28.8$$ $$28.6$$ $$7.10^4$$ $$5.10^3$$ Table 4 Comparison between CL-OMPR and Algorithm 2 for speaker verification, with $$K=64$$ EER (%) Time (s) CL-OMPR Algorithm 2 CL-OMPR Algorithm 2 $$m=10^3$$ $$40.3$$ $$32.5$$ $$7.10^2$$ $$5.10$$$$^{2}$$ $$m=10^4$$ $$29.4$$ $$29.0$$ $$7.10^3$$ $$5.10^2$$ $$m=10^5$$ $$28.8$$ $$28.6$$ $$7.10^4$$ $$5.10^3$$ EER (%) Time (s) CL-OMPR Algorithm 2 CL-OMPR Algorithm 2 $$m=10^3$$ $$40.3$$ $$32.5$$ $$7.10^2$$ $$5.10$$$$^{2}$$ $$m=10^4$$ $$29.4$$ $$29.0$$ $$7.10^3$$ $$5.10^2$$ $$m=10^5$$ $$28.8$$ $$28.6$$ $$7.10^4$$ $$5.10^3$$ Sketching a large database. In Table 5, we compare the EER results when using either $$n_1=3.10^5$$ items uniformly selected in the database to cover all speakers or all $$n_2=2.10^8$$ items in the database. The compressive Algorithm 2 is performed at both scales, while EM is only performed with $$n_1$$ items, since the whole database is too large to be handled by the VLFeat toolbox on a machine with 8 GB of RAM. For the compressive approach, the use of the entire database indeed improves the results when compared with using only $$n_1$$ items to compute the sketch. At low $$K=8$$ or $$K=64$$ and high number of frequencies $$m$$, the compressive approach using $$n_2$$ items outperforms EM using only $$n_1$$ items. Table 5 Comparison EM and Algorithm 2 for speaker verification, in terms of EER. For Algorithm 2, results that outperform these of EM are outlined $$K=8$$ $$K=64$$ $$K=512$$ $$n_1$$ $$n_2$$ $$n_1$$ $$n_2$$ $$n_1$$ $$n_2$$ EM $$31.4$$ n/a $$29.5$$ n/a $$27.5$$ n/a Algorithm 2 $$m=10^3$$ $$32.5$$ $$\mathbf{31.2}$$ $$31.1$$ $$32.5$$ $$31.2$$ $$29.4$$ $$m=10^4$$ $$32.1$$ $$\mathbf{30.7}$$ $$30.2$$ $$\mathbf{29.0}$$ $$30.3$$ $$29.1$$ $$m=10^5$$ $$32.5$$ $$\mathbf{30.7}$$ $$29.8$$ $$\mathbf{28.6}$$ $$29.4$$ $$29.2$$ $$K=8$$ $$K=64$$ $$K=512$$ $$n_1$$ $$n_2$$ $$n_1$$ $$n_2$$ $$n_1$$ $$n_2$$ EM $$31.4$$ n/a $$29.5$$ n/a $$27.5$$ n/a Algorithm 2 $$m=10^3$$ $$32.5$$ $$\mathbf{31.2}$$ $$31.1$$ $$32.5$$ $$31.2$$ $$29.4$$ $$m=10^4$$ $$32.1$$ $$\mathbf{30.7}$$ $$30.2$$ $$\mathbf{29.0}$$ $$30.3$$ $$29.1$$ $$m=10^5$$ $$32.5$$ $$\mathbf{30.7}$$ $$29.8$$ $$\mathbf{28.6}$$ $$29.4$$ $$29.2$$ Table 5 Comparison EM and Algorithm 2 for speaker verification, in terms of EER. For Algorithm 2, results that outperform these of EM are outlined $$K=8$$ $$K=64$$ $$K=512$$ $$n_1$$ $$n_2$$ $$n_1$$ $$n_2$$ $$n_1$$ $$n_2$$ EM $$31.4$$ n/a $$29.5$$ n/a $$27.5$$ n/a Algorithm 2 $$m=10^3$$ $$32.5$$ $$\mathbf{31.2}$$ $$31.1$$ $$32.5$$ $$31.2$$ $$29.4$$ $$m=10^4$$ $$32.1$$ $$\mathbf{30.7}$$ $$30.2$$ $$\mathbf{29.0}$$ $$30.3$$ $$29.1$$ $$m=10^5$$ $$32.5$$ $$\mathbf{30.7}$$ $$29.8$$ $$\mathbf{28.6}$$ $$29.4$$ $$29.2$$ $$K=8$$ $$K=64$$ $$K=512$$ $$n_1$$ $$n_2$$ $$n_1$$ $$n_2$$ $$n_1$$ $$n_2$$ EM $$31.4$$ n/a $$29.5$$ n/a $$27.5$$ n/a Algorithm 2 $$m=10^3$$ $$32.5$$ $$\mathbf{31.2}$$ $$31.1$$ $$32.5$$ $$31.2$$ $$29.4$$ $$m=10^4$$ $$32.1$$ $$\mathbf{30.7}$$ $$30.2$$ $$\mathbf{29.0}$$ $$30.3$$ $$29.1$$ $$m=10^5$$ $$32.5$$ $$\mathbf{30.7}$$ $$29.8$$ $$\mathbf{28.6}$$ $$29.4$$ $$29.2$$ Limitations due to coherence. Although increasing the number of components $$K$$ seems to consistently improve the results of EM, it is not the case with the compressive method for a fixed sketch size $$m$$. A possible intuitive explanation could be that by increasing the number of components, we also increase the coherence between them—i.e., the Gaussians in the GMM are increasingly overlapping each other—which makes it more and more difficult to handle for any sparsity-based approach. In practice, it results in many components in the GMM having weights $$\alpha\approx 0$$. In other words, the algorithm outputs a $$K'$$-GMM with $$K' < K$$: there seems to be a ‘limit’ number of components above which additional Gaussians are useless. It may be possible to deal with a higher level of sparsity by drastically increasing the number of frequencies $$m$$, at the cost of higher compression and estimation times. Number of components $$K$$ and compression. In Fig. 7, we study the effect of $$m$$ for various number of components $$K=8$$, $$64$$ and $$512$$. In each case, we observe a sharp phase transition going from an EER of $$50 \%$$, which corresponds to random guessing, to the results observed in Table 5. Somehow surprisingly, this phase transition does not seem to depend on $$K$$, unlike the one observed on synthetic data (Fig. 4). As mentioned before, it could be interesting to drastically increase $$m$$ to see whether the gap between results obtained EM and those obtained with Algorithm 2 can be bridged in the $$K=512$$ case; however, the phase transition pattern does not support this idea, but rather a limitation of the method itself, maybe in the algorithmic approach. Fig. 7. View largeDownload slide Effect of the number of frequencies $$m$$ on speaker verification results. Algorithm 2 is performed on the whole database with $$n_2=2\times 10^8$$ items, while EM can only be performed on $$n_1=3\times 10^5$$ items. Fig. 7. View largeDownload slide Effect of the number of frequencies $$m$$ on speaker verification results. Algorithm 2 is performed on the whole database with $$n_2=2\times 10^8$$ items, while EM can only be performed on $$n_1=3\times 10^5$$ items. Overall, the results on synthetic and real data show that the fitting problem is, as expected, more challenging than the clustering problem for the proposed sparsity-based approach. Indeed, while the clustering problem (synthetic data) is that of identifying well-separated components of a sparse distribution, the fitting problem is similar to a sparse approximation task, which is known to be challenging when the ‘signal’ (i.e., the true distribution of the data) is not sparse. Nevertheless, let us point out that in Fig. 7, results approaching those of EM are obtained for $$m=3000$$ frequencies only, which corresponds to a whopping $$33\,000$$-fold compression of the database. 7. Information preservation guarantees? In this section, we derive a number of information preservation guarantees for the proposed sketching operator. Let us come back to the ‘generic’ compressive learning inverse problem introduced in Section 1: \begin{equation}\label{eq:problem} \underset{P\in {\it{\Sigma}}}{\text{argmin}}~\left\lVert{\hat{\mathbf{z}}-\mathcal{A} P}\right\rVert_2\!, \end{equation} (7.1) where $${\it{\Sigma}}$$ is some ‘low-dimensional’ model. Compressive estimation algorithms such as CL-OMP(R) (Algorithm 1) or the more scalable Algorithm 2 (specifically designed for GMM) seek an approximate solution to this problem in the case $${\it{\Sigma}}={\mathcal{G}_{K}}$$, with $${\mathcal{G}_{K}}$$ the set of $$K$$-sparse distributions in $$\mathcal{G}$$. Precise recovery guarantees for CL-OMP(R) or Algorithm 2 are beyond the scope of this article due to the random nature of several steps in these algorithms and the many non-convex optimization schemes that they contain. Instead, we rather demonstrated empirically in Sections 5 and 6 that these algorithms perform well on a large range of GMM estimation problems, with synthetic and real data. In parallel, a fundamental question consists in asking whether the problem is well-posed, i.e., if a potential solution of (7.1) is somehow guaranteed to be a ‘good’ estimate of the distribution of the data. Namely, we ask the following questions: For a distribution $$P_{\it{\Sigma}} \in {\it{\Sigma}}$$, does the sketch $${\mathbf{z}}=\mathcal{A} P_{{\it{\Sigma}}}$$ contain ‘enough’ information to retrieve the distribution $$P_{\it{\Sigma}}$$? Is this retrieval robust to using the empirical sketch $$\hat{\mathbf{z}}$$ instead of the true sketch $${\mathbf{z}}$$? Is this retrieval robust if $${\mathbf{z}} = \mathcal{A} P$$ where the encoded distribution $$P$$ is not exactly in the model $${\it{\Sigma}}$$, but only well approximated by a distribution in the model (i.e., the distance $$d(P,{\it{\Sigma}})=\inf_{P_{\it{\Sigma}} \in {\it{\Sigma}}}d(P,P_{\it{\Sigma}})$$ is small for some metric $$d$$)? The answers to these questions are linked to the existence of a so-called instance-optimal decoder [17,36], i.e., a (not necessarily tractable) reconstruction paradigm that is robust to noise and modeling error. In this article, we show that with high probability, the decoder induced by solving (7.1) is instance optimal with respect to the MMD metric [101] (Equation (4.4)), provided the model and frequency distribution satisfy some assumptions. We then prove this result for GMMs with bounded parameters, starting with the toy example of single Gaussians ($$K=1$$). The proof of our main theorem (Theorem 7.1), given in Appendix B, introduces variants of usual tools in compressive sensing. The idea is to use the fact that a lower restricted isometry property (LRIP) induces the existence of a robust instance optimal decoder, as shown by Bourrier et al. [17]. To prove the LRIP for the measurement operator $$\mathcal{A}$$, we use the fact that the empirical mean $$\|\mathcal{A} P - \mathcal{A} {Q}\|_2^2$$ concentrates around its expectation $$\gamma^2_{\it{\Lambda}}\left({P},{Q}\right)$$ and use $$\epsilon$$-coverings to extend this concentration result uniformly over the whole model—this method is similar to the ‘simple proof’ of the RIP developed by Baraniuk et al. [6]. As we will see, the results are still in a preliminary state and suboptimal in some cases. In particular, they do not permit yet to ascertain that the GMM learning method is successful with high probability with a limited sketch size as observed in practice, and prove instead that the method is robust and stable when the sketch size is of the order of the size of the original database (which is fortunately not necessary in practice). However, we believe that the use of compressive sensing tools in the context of kernel mean embedding and random features is a promising lead for future work, as is the introduction of guarantees such as robustness to modeling error for a GeMM problem. This connection can be seen as the main theoretical contribution of this article. 7.1 Existence of instance optimal decoders for sketched distributions In this section, we formulate a general result that guarantees robust decoding of any model $${\it{\Sigma}}$$ (not necessarily restricted to mixture models) under some hypotheses. The reader should refer to Appendix A for definitions related to $$\epsilon$$-coverings. Assumption A1 $$({\it{\Sigma}})$$ (Compactness of the model) The model $${\it{\Sigma}}$$ is compact with respect to the total variation norm $$\|.\|_{TV}$$. In particular, it implies that it has finite covering numbers $$N_{{\it{\Sigma}},\|.\|_{TV}}(\epsilon) < \infty$$. In the context of mixture models, the following Lemma shows that compactness of any basic set of distributions $$\mathcal{G}$$ extends to its set of $$K$$-sparse distributions $${\mathcal{G}_{K}}$$ and that their covering numbers are related. Its proof is given in Appendix C.1. Lemma 7.1 Suppose the set of basic distributions $$\mathcal{G}$$ is compact with respect to some norm $$\|.\|$$, denote $$C=\max_{P\in\mathcal{G}}\|P\|$$. Then, for all $$K$$ the set of $$K$$-sparse distributions $${\mathcal{G}_{K}}$$ is also compact and satisfies, for all $$\epsilon>0$$ and $$0<\tau<1$$, \begin{equation} \label{eq:covnummix} N_{{\mathcal{G}_{K}},\|.\|}(\epsilon) \leq \left(\frac{8C \cdot N_{\mathcal{G},\|.\|}(\tau\epsilon)}{(1-\tau)\epsilon}\right)^K. \end{equation} (7.2) Note that, in the case where $$\|.\|=\|.\|_{TV}$$ as in Assumption $$\mathbf{A_1}({\it{\Sigma}})$$, we have $$C=1$$. The second assumption involves the model $${\it{\Sigma}}$$, the frequency distribution $${\it{\Lambda}}$$, a small non-negative constant $$\eta\geq 0$$ and a constant $${\mathrm{A}}>0$$. Assumption A2 $$(\eta,{\it{\Sigma}},{\it{\Lambda}},{\mathrm{A}})$$ (Domination of the total variation norm) For all $$P_1,P_2 \in {\it{\Sigma}}$$, we have \begin{equation} \label{eq:domination} \Big[\gamma_{{\it{\Lambda}}}\left({P_1},{P_2}\right)\geq \eta\Big] \Rightarrow \Big[\|P_1-P_2\|_{TV}\leq {\mathrm{A}} \gamma_{{\it{\Lambda}}}\left({P_1},{P_2}\right)\Big], \end{equation} (7.3) where $$\gamma_{{\it{\Lambda}}}$$ is the MMD (4.4). Note that, since $$\|P_1-P_2\|_{TV} \leq 2$$ for all measures, if $$\eta>0$$ Assumption $$\mathbf{A_2}(\eta,{\it{\Sigma}},{\it{\Lambda}},{\mathrm{A}})$$ is always verified with $${\mathrm{A}} = 2/\eta$$. ‘Ideal’ decoder. Given any sketch $${\mathbf{z}} \in \mathbb{C}^m$$ and measurement operator $$\mathcal{A}$$, for all $$P_1,P_2 \in {\mathcal{P}}$$, we have \begin{equation}\label{eq:continuityfordecoder} \Big| \|{\mathbf{z}}-\mathcal{A} P_1\|_2-\|{\mathbf{z}}-\mathcal{A} P_2\|_2 \Big| \leq \|\mathcal{A} (P_1-P_2)\|_2 \leq \|P_1-P_2\|_{TV}, \end{equation} (7.4) by Lemma A.1 in Appendix A.2. Hence, the function $$P \in {\it{\Sigma}} \mapsto \|{\mathbf{z}}-\mathcal{A} P\|_2$$ is continuous with respect to the total variation norm. If Assumption $$\mathbf{A_1}({\it{\Sigma}})$$ is satisfied and the model $${\it{\Sigma}}$$ is compact, this function reaches its minimum on it, which implies that the problem (7.1) has at least one solution. In light of this observation, under Assumption $$\mathbf{A_1}({\it{\Sigma}})$$, we analyse below the information-theoretic estimation guarantees of the idealized decoder $${\it{\Delta}}$$ that minimizes (7.1), i.e., return a distribution verifying: \begin{equation}\label{eq:decod} {\it{\Delta}}({\mathbf{z}},\mathcal{A}) \in \underset{P \in {\it{\Sigma}}}{\text{argmin}}\|{\mathbf{z}}-\mathcal{A} P\|_2. \end{equation} (7.5) We now turn to the main result of this section. Theorem 7.1 Consider a model $${\it{\Sigma}}$$, a frequency distribution $${\it{\Lambda}}$$, a small positive constant $$1 \geq \eta > 0$$ and a constant $${\mathrm{A}}\geq 1$$ such that Assumptions $$\mathbf{A_1}({\it{\Sigma}})$$ and $$\mathbf{A_2}(\eta,{\it{\Sigma}},{\it{\Lambda}},{\mathrm{A}})$$ hold. Let $${\mathbf{x}}_i \in \mathbb{R}^d$$, $$i=1...n$$ be $$n$$ points drawn i.i.d. from an arbitrary distribution $$P^*\in{\mathcal{P}}$$, and $${\boldsymbol{\omega}}_{j} \in \mathbb{R}^d$$, $$j=1...m$$ be $$m$$ frequencies drawn i.i.d. from $${\it{\Lambda}}$$. Denote $$\bar{P}={\it{\Delta}}(\hat{\mathbf{z}},\mathcal{A})$$ the distribution reconstructed from the empirical sketch $$\hat{\mathbf{z}}$$. Let $$\rho>0$$. Suppose that \begin{equation}\label{eq:thmmbound} m\geq 12{\mathrm{A}}^2\log\left(\tfrac{2}{\rho} \cdot N_{{\it{\Sigma}},\|.\|_{TV}}\left(\tfrac{\eta^2}{24}\right)\right)\!. \end{equation} (7.6) Then, with probability at least $$1-\rho$$ on the drawing of the items $${\mathbf{x}}_i$$and sampling frequencies $${\boldsymbol{\omega}}_{j}$$, we have \begin{equation} \gamma_{{\it{\Lambda}}}\left({P^*},{\bar{P}}\right)\leq 5\ d_{TV}(P^*,{\it{\Sigma}})+\tfrac{4(1+\sqrt{2\log(2/\rho)})}{\sqrt{n}}+\eta, \end{equation} (7.7) where $$\gamma_{{\it{\Lambda}}}$$ is the MMD (4.4) and $$d_{TV}(P^*,{\it{\Sigma}})=\inf_{P \in {\it{\Sigma}}} \|P^*-P\|_{TV}$$ is the distance from $$P^*$$ to the model. Hence, the MMD between the distribution in the model recovered from the empirical sketch and the original distribution $$P^*$$ is controlled by the distance (measured by the total variation norm) between $$P^*$$ and the model and a small additive error. This proves that the decoding is robust both to the use of the empirical sketch instead of the true sketch and to the fact that $$P^*$$ may not be exactly in the model. The choice of the kernel for the MMD is crucial to obtain meaningful guarantees. It somehow justifies the general strategy of choosing a kernel that maximizes the discriminative power of the MMD [98], which as mentioned earlier is the idea behind the proposed adapted radius heuristic. Further work will examine relationship between the MMD and other classic metrics such as likelihood or KL-divergence [85]. Limitations and future work. Theorem 7.1 does not hold for an additive error $$\eta=0$$, which would be more akin to usual compressive sensing. In Appendix B.2, we give a version of the Theorem under a different set of Hypotheses $$\mathbf{H_i}$$ that include the $$\eta=0$$ case. However, exhibiting a model $${\it{\Sigma}}$$ and frequency distribution $${\it{\Lambda}}$$ that satisfy those hypotheses in the $$\eta=0$$ case is yet to be done. The control of the MMD $$\gamma_{{\it{\Lambda}}}\left({P^*},{\bar{P}}\right)$$ with the distance $$d_{TV}(P^*,{\it{\Sigma}})$$ is not optimal—ideally we would like to have the same metric on both sides of the inequality. In Appendix B, we formulate all results for a general metric $$d$$ under some assumptions, then specialize in the case $$d=\|.\|_{TV}$$, allowing for possible future generalizations. We now apply Theorem 7.1 to GMMs, starting with the case $$K=1$$ as a toy example. 7.2 Applications to single Gaussians (toy example) We consider the case $$K=1$$ where the model $${\it{\Sigma}}$$ is formed by single Gaussians as a toy example,7 i.e., $${\it{\Sigma}}=\mathcal{G}$$. We show that when the set $${\mathcal{T}}$$ of parameters of the Gaussians is compact, Assumption $$\mathbf{A_1}({\it{\Sigma}})$$ is verified with an explicit bound on the covering numbers of the model. Additionally, when the frequency distribution is Gaussian, Assumption $$\mathbf{A_2}(\eta,{\it{\Sigma}},{\it{\Lambda}},{\mathrm{A}})$$ is verified with a bound $${\mathrm{A}}$$ that does not depend on the additive error $$\eta$$. Using the Gaussian kernel has the advantage of yielding closed-form expressions for the mean kernel, which at this point does not seem to be feasible with the proposed Adapted Radius distribution. Recall the parametrization $${\boldsymbol{\theta}}=[{\boldsymbol{\mu}},{\boldsymbol{\sigma}}]$$ with $${\boldsymbol{\mu}}\in \mathbb{R}^d$$ and $${\boldsymbol{\sigma}}\in\left(\mathbb{R}_+\setminus\{0\}\right)^d$$ of the set $$\mathcal{G}=\left\lbrace P_{\boldsymbol{\theta}}\right\rbrace_{{\boldsymbol{\theta}} \in {\mathcal{T}}}$$ of Gaussians with diagonal covariance. We suppose here that the set of parameters $${\mathcal{T}} \subset \mathbb{R}^{2d}$$ is compact, i.e., closed and bounded. In particular, the variances of the considered Gaussians are bounded, and we denote $$\sigma^2_{\min}:=\min_{[{\boldsymbol{\mu}},{\boldsymbol{\sigma}}] \in {\mathcal{T}}}\min_{\ell} \sigma^2_{\ell} >0$$ (respectively $$\sigma^2_{\max}:=\max_{[{\boldsymbol{\mu}},{\boldsymbol{\sigma}}] \in {\mathcal{T}}}\max_{\ell} \sigma^2_{\ell}<\infty$$) their minimum (respectively maximum) value. We also define $$M:=\max_{[{\boldsymbol{\mu}},{\boldsymbol{\sigma}}]} \|{\boldsymbol{\mu}}\|_2$$, and finally, we denote radius $$({\mathcal{T}}):=\min\{r>0;\exists {\mathbf{x}} \in \mathbb{R}^{2d}, {\mathcal{T}} \subset B({\mathbf{x}},r)\}$$ the Chebyshev radius [1] of $${\mathcal{T}}$$, i.e., the minimal radius $$r$$ such that $${\mathcal{T}}$$ is contained in a ball of radius $$r$$ for the Euclidean norm. Note that our framework is significantly different from many other works [57,66,94], which provide guarantees when the support of the distributions is compact, while here the Gaussian densities have infinite support, but their parameters belong to a compact set, which is a far more realistic setting. Theorem 7.2 Suppose the set of parameters $${\mathcal{T}} \subset \mathbb{R}^{2d}$$ is compact. Then, the set of Gaussians $$\mathcal{G}=\left\lbrace P_{\boldsymbol{\theta}}\right\rbrace_{{\boldsymbol{\theta}} \in {\mathcal{T}}}$$ is compact. Furthermore, for all $$\epsilon>0$$, we have \begin{equation} \label{eq:covnum_g} N_{\mathcal{G},\|.\|_{TV}}(\epsilon)\leq \left(\frac{B}{\epsilon}\right)^{2d}\!, \end{equation} (7.8) where $$B:=8\max\left(\sigma_{\min}^{-1},\sigma_{\min}^{-2}/\sqrt{2}\right)\text{radius}\left({{\mathcal{T}}}\right)$$. Thus, Assumption $$\mathbf{A_1}({\it{\Sigma}})$$ is verified for the model $${\it{\Sigma}} = \mathcal{G}$$. Assumption $$\mathbf{A_2}(\eta,{\it{\Sigma}},{\it{\Lambda}},{\mathrm{A}})$$ is also verified, and for a Gaussian frequency distribution, we have the following: Theorem 7.3 Suppose the set of parameters $${\mathcal{T}} \subset \mathbb{R}^{2d}$$ is compact, and the frequency distribution is an isotropic Gaussian $${\it{\Lambda}}=\mathcal{N}\left(0,\frac{a}{d}{\mathbf{I}}\right)$$ for some $$a >0$$. Then, for all $$P,{Q} \in \mathcal{G}$$, we have \begin{equation} \|P-{Q}\|_{TV}\leq D\gamma_{{\it{\Lambda}}}\left({P},{Q}\right)\!, \end{equation} (7.9) where \begin{equation} D=\max(\sigma_{\min}^{-1},\sigma_{\min}^{-2}/\sqrt{2})\sqrt{\frac{2dD_1\cdot {\rm e}^{3a\sigma_{\max}^2}}{a(1-{\rm e}^{-D_1})}} ~ \text{ with } D_1=\sigma_{\max}^2a\left(1+\frac{2M^2}{d}\right)\!. \end{equation} (7.10) The proof is given in Appendix C.2. As a consequence, Assumption $$\mathbf{A_2}(\eta,{\it{\Sigma}},{\it{\Lambda}},{\mathrm{A}})$$ is verified for $${\it{\Sigma}} = \mathcal{G}$$, $${\it{\Lambda}}=\mathcal{N}(0,\sigma_{\it{\Lambda}}^2{\mathbf{I}})$$, any $$\eta\geq 0$$ and $${\mathrm{A}}=\min(D,2/\eta)$$. Theorems 7.2 and 7.3 lead to the following immediate corollary of Theorem 7.1. Corollary 7.1 In the case $${\it{\Sigma}}=\mathcal{G}$$ of single Gaussians with a compact set of parameters, for a Gaussian frequency distribution $${\it{\Lambda}}=\mathcal{N}\left(1,\frac{a}{d}{\mathbf{I}}\right)$$ and any constant $$0<\eta\leq 1$$, Theorem 7.1 is verified with (7.6) replaced by \begin{equation} m\geq 12{\mathrm{A}}^2\left(4d\log\left(\frac{C}{\eta}\right)+\log\frac{2}{\rho}\right)\!, \end{equation} (7.11) where $${\mathrm{A}}=\min(D,2/\eta)$$, $$C=\sqrt{24B}$$, $$B$$ is defined as in Theorem 7.2 and $$D$$ is defined as in Theorem 7.3. Hence, for a fixed model $${\it{\Sigma}}=\mathcal{G}$$, the additive error $$\eta$$ decreases exponentially with the number of measurements $$m$$. It is also interesting to note that conversely, for a small additive error $$\eta\leq 2/D=\mathcal{O}(1/\sqrt{d})$$, assuming that all parameters of the model appearing in the expression of $$D$$ are constant, we have $${\mathrm{A}}=\mathcal{O}(\sqrt{d})$$, and thus, the number of measurements $$m$$ must grow as $$\mathcal{O}(d^2)$$ with the dimension. It is sub-optimal in the sense that the ideal estimators of the mean and diagonal covariance of a single Gaussian, i.e., the empirical mean and covariance, have size $$\mathcal{O}(d)$$. This number of measurements may scale with the size of the empirical estimators for Gaussian with full covariance, although the results presented in this article do not directly apply to this case. 7.3 Application to GMMs Theorem 7.2 and Lemma 7.1 allow for an immediate extension of Assumption $$\mathbf{A_1}({\it{\Sigma}})$$ from the set of basic distributions to the corresponding mixture models. In the case of GMMs, we have the following corollary, whose proof is given in Appendix C.2. Corollary 7.2 Suppose the set of parameters $${\mathcal{T}} \subset \mathbb{R}^{2d}$$ is compact. Then, for all $$K>1$$ the set of GMMs $${\mathcal{G}_{K}}$$ is compact. Furthermore, for all $$\epsilon>0$$, we have \begin{equation} \label{eq:covnumgmm} N_{{\mathcal{G}_{K}},\|.\|_{TV}}(\epsilon)\leq \left(\frac{2(B+1)}{\epsilon}\right)^{(2d+1)K}, \end{equation} (7.12) where $$B$$ is defined as in Theorem 7.2. Unfortunately, unlike the compactness property, Assumption $$\mathbf{A_2}(\eta,{\it{\Sigma}},{\it{\Lambda}},{\mathrm{A}})$$ cannot be immediately extended from the set of basic distributions $$\mathcal{G}$$ to the mixture model $${\mathcal{G}_{K}}$$, and it is not clear whether doing so would require some additional hypotheses or not. Although we strongly believe that a result similar to the $$K=1$$ case holds (see the discussion at the end of this section), here we use the fact that Assumption $$\mathbf{A_2}(\eta,{\it{\Sigma}},{\it{\Lambda}},{\mathrm{A}})$$ is verified with $${\mathrm{A}}=2/\eta$$, regardless of the model and frequency distribution. This leads to the following corollary. Corollary 7.3 In the case $${\it{\Sigma}}={\mathcal{G}_{K}}$$ of GMMs with a compact set of parameters, for any frequency distribution and constant $$0<\eta\leq 1$$, Theorem 7.1 is verified with (7.6) replaced by \begin{equation} m\geq 48\eta^{-2}\left(2K(2d+1)\log\left(\frac{C}{\eta}\right)+\log\frac{2}{\rho}\right)\!, \end{equation} (7.13) where $$C=\sqrt{48(B+1)}$$, with $$B$$ defined as in Theorem 7.2. Conjecture. Corollary 7.3 suggests that the reconstruction error $$\eta$$ for GMMs decreases as $$\mathcal{O}\left(n^{-\frac{1}{2}}+m^{-\frac{1}{2}}\right)$$ (up to some inverse exponential factor), which seems to nullify the advantages of the ‘compressive’ approach. This is due to the use of the ‘worst’ bound $${\mathrm{A}}=2/\eta$$ in Assumption $$\mathbf{A_2}(\eta,{\it{\Sigma}},{\it{\Lambda}},{\mathrm{A}})$$. However, we strongly believe that Assumption $$\mathbf{A_2}(\eta,{\it{\Sigma}},{\it{\Lambda}},{\mathrm{A}})$$ may hold with a better bound that does not depend on $$\eta$$, similar to the $$K=1$$ case. We support this claim by empirically evaluating reconstruction results of the CL-OMPR algorithm with respect to the MMD in Fig. 8. We observe a phase transition pattern similar to the one already noted in Section 5.5 for the KL-divergence, which is inconsistent with an additive error that scales in $$\mathcal{O}\left(m^{-1/2}\right)$$. On the contrary, the $$\mathcal{O}\left(n^{-1/2}\right)$$ decrease is indeed observed, which supports the theory, but also the capacity of CL-OMPR to approximate the ideal decoder (7.5). Fig. 8. View largeDownload slide Reconstruction results of CL-OMPR for the $$\gamma_{{\it{\Lambda}}}$$ metric with respect to $$m$$ (left) and $$n$$ (right), in dimension $$d=10$$ and $$K=5$$ components, using the true theoretical sketch $$\mathcal{A} P_{{\it{\Theta}}_0,{\boldsymbol{\alpha}}_0}$$ on the left and $$m=5K(2d+1)$$ frequencies on the right. In a similar fashion to the KL-divergence (Section 5.2), the MMD $$\gamma_{{\it{\Lambda}}}$$ is approximated by drawing $$5\times 10^5$$ frequencies from $${\it{\Lambda}}$$ and by empirically evaluating (4.4), using the closed-form expression of the characteristic function of GMMs. Fig. 8. View largeDownload slide Reconstruction results of CL-OMPR for the $$\gamma_{{\it{\Lambda}}}$$ metric with respect to $$m$$ (left) and $$n$$ (right), in dimension $$d=10$$ and $$K=5$$ components, using the true theoretical sketch $$\mathcal{A} P_{{\it{\Theta}}_0,{\boldsymbol{\alpha}}_0}$$ on the left and $$m=5K(2d+1)$$ frequencies on the right. In a similar fashion to the KL-divergence (Section 5.2), the MMD $$\gamma_{{\it{\Lambda}}}$$ is approximated by drawing $$5\times 10^5$$ frequencies from $${\it{\Lambda}}$$ and by empirically evaluating (4.4), using the closed-form expression of the characteristic function of GMMs. The proof of Assumption $$\mathbf{A_2}(\eta,{\it{\Sigma}},{\it{\Lambda}},{\mathrm{A}})$$ for general GMMs seems complex and beyond the scope of this paper. A possible strategy would be to be able to directly extend Assumption $$\mathbf{A_2}(\eta,{\it{\Sigma}},{\it{\Lambda}},{\mathrm{A}})$$ on the basic set of distributions $$\mathcal{G}$$ to the corresponding mixture model $${\mathcal{G}_{K}}$$, while another approach would be to use a different metric other than the total variation norm. The latter is particularly discussed in Appendix B. 8. Conclusion and outlooks We presented a method for probability mixture estimation on a large database exploiting a sketch of the data instead of the data itself. The sketch is an appropriate structure that leads to considerable gain in terms of memory. It can be computed in a distributed or streaming manner, and it can fully exploit the advantages of GPU computing. A typical greedy method for sparse reconstruction was defined, leading to reconstruction algorithms both efficient and stable, even when the dictionary of atoms is infinite and uncountable. In the case of GMM, an additional efficient algorithm based on hierarchical splitting of GMMs was described. A heuristic to select generalized moments based on a decomposition robust to high dimension and maximal variations of the characteristic function was designed. A procedure to estimate the parameter of this heuristic was described, resulting in a method that is faster than traditional kernel design by cross-validation, has the advantage of being unsupervised, and thus is probably suited for other tasks and yields better reconstruction results. Excellent results were observed on synthetic data, where the greedy algorithms approach the reconstruction results of EM, using less memory and computation time when the number of database elements is large. The method was successfully applied to a large-scale speaker verification task. The hierarchical approach proved to be the most efficient method for this challenge, illustrating the diversity of the problem and of the proposed solutions. As in usual compressive sensing, limitations of the method when the number of sparse components in the distribution is large were observed. Finally, information preservation guarantees were developed for the recovery of any compact set of distributions. The proof of Theorem 7.1 (Appendix B) introduced a weaker variant of the RIP for non-uniform recovery. We then applied this result to GMMs with bounded parameters, and observed a technical bottleneck between the toy $$K=1$$ case and general GMMs. Outlooks. As mentioned earlier, the method can readily be applied to other mixture models, such as mixtures of $$\alpha$$-stable distributions, which do not have explicit likelihood, but whose characteristic function is known [91]. Based on the principle of maximizing the variation of the characteristic function, suitable heuristics for the choice of the sampling pattern may be derived for other models. The method can also easily incorporate variants of random Fourier features that are faster to compute or more precise [69,108]. Existing methods to learn the kernel [58,78,93,110] may be adapted to our framework, and in return, the proposed unsupervised kernel learning procedure and adapted radius heuristic may be useful for other tasks. As mentioned earlier, technical difficulties on the domination between certain metrics were observed between the toy $$K=1$$ case and general GMMs, pointing to a promising lead for future work. The proof of Theorem 7.1 also uses innovative variants of several classical tools in compressive sensing, which may be useful in the study of other instances of generalized compressive sensing. Funding European Research Council, PLEASE project (ERC-StG- 2011-277906, in part). Footnotes 1 Any other unbiased empirical estimator of the moments, for example using the empirical median, can be envisioned. 2 Note that this representation might not be unique. 3 Other more robust estimators can be envisioned such as the empirical median. The empirical average allows more easy streaming or distributed computing. 4 A similar strategy can also be used on a single machine when the matrix $${\mathbf{U}}$$ is too large to be stored. 5 In a way, numerous measures of the characteristic function near the origin essentially measure its derivatives at various orders, which are associated to classical polynomial moments. 6 After trying many variants of initialization strategy, we came to the conclusion that the algorithm is very robust to initialization. This is one of the possible choices. 7 Obviously, fitting a single Gaussian to a data set can easily be done by direct empirical estimators. 8 For instance, consider a set $$A$$ formed by two points, included in set $$B$$ which is a ball of radius $$\epsilon$$. Suppose those two points diametrically opposed in $$B$$. We have $$A\subset B$$, but two balls of radius $$\epsilon$$ are required to cover $$A$$ (since their centers have to be in $$A$$), while only one such ball is sufficient to cover $$B$$. 9 Compactness of the model is still required here to define a projection operator onto it as well as to ensure at least one solution to the problem (7.1). This assumption could be relaxed, with the addition of an arbitrary small additive error, similar to [18]. 10 Note that the choice of $$\tau$$ is not optimal (indeed the minimum is attained for $$\tau=\frac{2d}{2d+1}$$); however, we choose this value for the simplicity and clarity of the resulting bound. References 1. Chebyshev Radius ( 2016) https://www.encyclopediaofmath.org/index.php/Chebyshe_vradius (accessed 25 November 2017). 2. Achlioptas D. ( 2001) Database-friendly random projections. Principles of Database Systems (PODS). Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. New York, NY, USA: ACM. pp. 274– 281. 3. Ahrendt P. ( 2005) The Multivariate Gaussian Probability Distribution. Discussion Paper, IMM, Technical University of Denmark. 4. Aronzajn N. ( 1950) Theory of reproducing kernels. Trans. Amer. Math. Soc. , 68, 337– 404. Google Scholar CrossRef Search ADS 5. Baraniuk R. ( 2007) Compressive sensing. IEEE Signal Process. Mag. , 24, 118– 121. Google Scholar CrossRef Search ADS 6. Baraniuk R., Davenport M., Devore R. A. & Wakin M. ( 2008) A simple proof of the restricted isometry property for random matrices. Constr. Approx. , 28, 253– 263. Google Scholar CrossRef Search ADS 7. Bashan E., Raich R. & Hero A. O. ( 2008) Optimal two-stage search for sparse targets using convex criteria. IEEE Trans. Signal Process. , 56, 5389– 5402. Google Scholar CrossRef Search ADS 8. Becker S. ( 2013) L-BFGSB-C. https://github.com/stephenbeckr/L-BFGS-B-C (accessed 25 November 2017). 9. Belkin M. & Sinha K. ( 2010a) Polynomial learning of distribution families. IEEE 51st Annual Symposium on Foundations of Computer Science . IEEE. 10. Belkin M. & Sinha K. ( 2010b) Toward learning Gaussian mixtures with arbitrary separation. COLT - 2010. The 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010 ( Kalai A. T. & Mohri M. eds). Omnipress. 11. Bertin K., Le Pennec E. & Rivoirard V. ( 2011) Adaptive Dantzig density estimation. Ann. de Ins. H. Poincaré Probab. Stat. , 47, 43– 74. Google Scholar CrossRef Search ADS 12. Blumensath T. ( 2011) Sampling and reconstructing signals from a union of linear subspaces. IEEE Trans. Inform. Theory , 57, 4660– 4671. Google Scholar CrossRef Search ADS 13. Blumensath T. & Davies M. E. ( 2008) Iterative thresholding for sparse approximations. J. Fourier Anal. Appl. , 14, 629– 654. Google Scholar CrossRef Search ADS 14. Blumensath T. & Davies M. E. ( 2009) Iterative hard thresholding for compressed sensing. Appl. Comput. Harmon. Anal. , 27, 265– 274. Google Scholar CrossRef Search ADS 15. Bo L. & Sminchisescu C. ( 2009) Efficient match kernels between sets of features for visual recognition. Advances in Neural Information Processing System (NIPS) ( Bengio Y. Schuurmans D. Lafferty J. D. Williams C. K. I. & Culotta A. eds). Curran Associates, Inc. 16. Borgwardt K. M., Gretton A., Rasch M. J., Kriegel H. P., Schölkopf B. & Smola A. J. ( 2006) Integrating structured biological data by Kernel Maximum Mean Discrepancy. Bioinformatics , 22, 49– 57. Google Scholar CrossRef Search ADS 17. Bourrier A. ( 2014) Compressed sensing and dimensionality reduction for unsupervised learning. Ph.D. Thesis. http://www.theses.fr/2014REN1S023 (accessed 26 November 2017). 18. Bourrier A., Davies M. E., Peleg T. & Gribonval R. ( 2014) Fundamental performance limits for ideal decoders in high-dimensional linear inverse problems. IEEE Trans. Inform. Theory , 60, 7928– 7946. Google Scholar CrossRef Search ADS 19. Bourrier A., Gribonval R. & Pérez P. ( 2013) Compressive gaussian mixture estimation. IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) . IEEE, pp. 6024– 6028. 20. Bourrier A., Gribonval R. & Pérez P. ( 2015) Compressive Gaussian mixture estimation. Compressed Sensing and its Applications—MATHEON Workshop 2013 ( Boche H. Calderbank R. Kutyniok G. & Vybiral J. eds). Basel: Birkhäuser, pp. 6024– 6028. 21. Brooks M. ( 2005) VOICEBOX: Speech Processing Toolbox for MATLAB. http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html (accessed 25 November 2017). 22. Bunea F., Tsybakov A. B., Wegkamp M. H. & Barbu A. ( 2010) SPADES and mixture models. Ann. Stat. , 38, 2525– 2558. Google Scholar CrossRef Search ADS 23. Byrd R. H., Lu P., Nocedal J. & Zhu C. ( 1995) A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. , 16, 1190– 1208. Google Scholar CrossRef Search ADS 24. Calderbank R., Jafarpour S. & Schapire R. ( 2009) Compressed learning: Universal sparse dimensionality reduction and learning in the measurement domain. Discussion Paper. Technical report. 25. Candès E. J., Eldar Y. C., Needell D. & Randall P. ( 2011) Compressed sensing with coherent and redundant dictionaries. Appl. Comput. Harmonic Anal. , 31, 59– 73. Google Scholar CrossRef Search ADS 26. Candès E. J. & Plan Y. ( 2011) Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Trans. Inform. Theory , 57, 2342– 2359. Google Scholar CrossRef Search ADS 27. Candès E. J., Romberg J. K. & Tao T. ( 2006) Stable signal recovery from incomplete and inaccurate measurements. Commun. Pure Appl. Math. , 59, 1207– 1223. Google Scholar CrossRef Search ADS 28. Candès E. J. & Tao T. ( 2004) Decoding by linear programming. IEEE Tran. Inform. Theory , 51, 4203– 4215. Google Scholar CrossRef Search ADS 29. Candès E. J. & Tao T. ( 2006) Near-optimal signal recovery from random projections: Universal encoding strategies? IEEE Trans. Inform. Theory , 52, 5406– 5425. Google Scholar CrossRef Search ADS 30. Cappé O. & Moulines E. ( 2009) Online EM algorithm for latent data models. J. Stat. Soc. , 71, 593– 613. Google Scholar CrossRef Search ADS 31. Carrasco M. & Florens J.-P. ( 2000) Generalization of GMM to a continuum of moment conditions. Econ. Theory ., 16, 797– 834. Google Scholar CrossRef Search ADS 32. Carrasco M. & Florens J.-P. ( 2002) Efficient GMM estimation using the empirical characteristic function. IDEI Working Paper. Toulouse: Institut d’Économie Industrielle (IDEI). Technical report. 33. Carrasco M. & Florens J.-P. ( 2014) On the asymptotic efficiency of GMM. Econ. Theory , 30, 372– 406. Google Scholar CrossRef Search ADS 34. Choromanski K. & Sindhwani V. ( 2016) Recycling randomness with structure for sublinear time kernel expansions. Proceedings of The 33rd International Conference on Machine Learning , vol. 48 ( Balcan M. F. & Weinberger K. Q. eds), 20– 22 June 2016. New York, New York: PMLR, pp. 2502– 2510. 35. Chwialkowski K., Ramdas A., Sejdinovic D. & Gretton A. ( 2015) Fast two-sample testing with analytic representations of probability measures. Advances in Neural Information Processing Systems (NIPS) ( Cortes C. Lawrence N. D. Lee D. D. Sugiyama M. & Garnett R. eds). Curran Associates, Inc., pp. 1981– 1989. 36. Cohen A., Dahmen W. & Devore R. A. ( 2009) Compressed sensing and best k-term approximation. J. Amer. Math. Soc. , 22, 211– 231. Google Scholar CrossRef Search ADS 37. Cohn D. A., Ghahramani Z. & Jordan M. I. ( 1996) Active Learning with Statistical Models. J. Artif. Intell. Res. , 4, 129– 145. 38. Cormode G. & Hadjieleftheriou M. ( 2009) Methods for finding frequent items in data streams. VLDB J. , 19, 3– 20. Google Scholar CrossRef Search ADS 39. Cormode G., Korn F., Muthukrishnan S. & Divesh S. ( 2004) Diamond in the rough: finding hierarchical heavy hitters in multi-dimensional data. International Conference on Management of Data . ACM Press, pp. 155– 166. 40. Cormode G. & Muthukrishnan S. ( 2005) An improved data stream summary: the count-min sketch and its applications. J. Algorithms , 55, 58– 75. Google Scholar CrossRef Search ADS 41. Cover T. M. & Thomas J. A. ( 2006) Elements of Information Theory . Hoboken, NJ: Wiley-Interscience. 42. Cucker F. & Smale S. ( 2002) On the mathematical foundations of learning. Bull. Amer. Math. Soc. , 39, 1– 49. Google Scholar CrossRef Search ADS 43. Dasgupta S. ( 1999) Learning mixtures of Gaussians. IEEE 51st Annual Symposium on Foundations of Computer Science . IEEE. 44. De Castro Y. & Gamboa F. ( 2012) Exact reconstruction using Beurling minimal extrapolation. J. Math. Anal. Appl. , 395, 336– 354. Google Scholar CrossRef Search ADS 45. Do T. T., Gan L., Nguyen N. H. & Tran T. D. ( 2012) Fast and Efficient Compressive Sensing using Structurally Random Matrices. IEEE Trans. Signal Process. , 60, 139– 154. Google Scholar CrossRef Search ADS 46. Donoho D. L. ( 2006) Compressed sensing. IEEE Trans. Inform. Theory , 52, 1289– 1306. Google Scholar CrossRef Search ADS 47. Duchi J. ( 2007) Derivations for linear algebra and optimization. Discussion Paper. http://web.stanford.edu/~jduchi/projects/general_notes.pdf (accessed 26 November 2017). 48. Fedotov A. A., Harremoës P. & Topsøe F. ( 2003) Refinements of Pinsker’s Inequality. IEEE Trans. Inform. Theory , 49, 1491– 1498. Google Scholar CrossRef Search ADS 49. Feuerverger A. & McDunnough P. ( 1981) On Some Fourier methods for Inference. J. Am. Stat. Assoc. , 76, 379– 387. Google Scholar CrossRef Search ADS 50. Feuerverger A. & Mureika R. ( 1977) The empirical characteristic function and its applications. Ann. Stat ., 5, 88– 97. Google Scholar CrossRef Search ADS 51. Fischer T. ( 2012) Existence, uniqueness, and minimality of the Jordan measure decomposition. arXiv preprint:1206.5449. 52. Foucart S. & Rauhut H. ( 2013) A Mathematical Introduction to Compressive Sensing , Applied and Numerical Harmonic Analysis. New York, NY: Springer. Google Scholar CrossRef Search ADS 53. Fradkin D. & Madigan D. ( 2003) Experiments with random projections for machine learning. International Conference on Knowledge Discovery and Data Mining (KDD) . New York, NY, USA: ACM, pp. 517– 522. 54. Fukumizu K., Gretton A., Sun X. & Schölkopf B. ( 2008) Kernel Measures of Conditional Dependence. Advances in Neural Information Processing System (NIPS) ( Platt J. C. Koller D. Singer Y. and Roweis S. T. eds). Curran Associates, Inc., pp. 489– 496. 55. Gilbert A. C., Kotidis Y., Muthukrishnan S. & Strauss M. J. ( 2002) How to summarize the universe: Dynamic maintenance of quantiles. International Conference on Very Large Data Bases (VLBD) . Hong Kong, China: VLDB Endowment, pp. 454– 465. Google Scholar CrossRef Search ADS 56. Giryes R., Sapiro G. & Bronstein A. M. ( 2015) Deep neural networks with random Gaussian weights: A universal classification strategy? IEEE Trans. Signal Processing , 64, 3444– 3457. Google Scholar CrossRef Search ADS 57. Gretton A., Borgwardt K. M., Rasch M. J., Schölkopf B. & Smola A. J. ( 2007) A Kernel method for the two-sample problem. Advances in Neural Information Processing Systems (NIPS) ( Schölkopf B. Platt J. C. & Hoffman T. eds). MIT Press, pp. 513– 520. 58. Gretton A., Sriperumbudur B. K., Sejdinovic D., Strathmann H. & Pontil M. ( 2012) Optimal kernel choice for large-scale two-sample tests. Advances in Neural Information Processing Systems (NIPS) ( Pereira F. Burges C. J. C. Bottou L. & Weinberger K. Q. eds). Curran Associates, Inc., pp. 1205– 1213. 59. Gribonval R., Bacry E., Mallat S., Depalle P. & Rodet X. ( 1996) Analysis of sound signals with high resolution matching pursuit. IEEE-SP International Symposium on Time-Frequency and Time-Scale Analysis . IEEE, pp. 125– 128. 60. Hall A. R. ( 2005) Generalized Method of Moments . Oxford: Oxford University press. 61. Haupt J., Castro R., Nowak R. & May S. T. ( 2011) Distilled sensing : adaptive sampling for sparse detection and estimation. IEEE Trans. Infor. Theory , 57, 6222– 6235. Google Scholar CrossRef Search ADS 62. Hofmann T., Schölkopf B. & Smola A. J. ( 2008) Kernel methods in machine learning. Ann. Stat. , 36, 1171– 1220. Google Scholar CrossRef Search ADS 63. Hsu D. & Kakade S. M. ( 2013) Learning mixtures of spherical gaussians: moment methods and spectral decompositions. Conference on Innovations in Theoretical Computer Science . New York, NY, USA: ACM. 64. Jain P., Tewari A. & Dhillon I. S. ( 2011) Orthogonal matching pursuit with replacement. Advances in Neural Information Processing Systems (NIPS) ( Shawe-Taylor J. Zemel R. S. Bartlett P. L. Pereira F. & Weinberger K. Q. eds). Curran Associates, Inc., pp. 1215– 1223. 65. Jitkrittum W., Szabó Z., Chwialkowski K. & Gretton A. ( 2016) Interpretable distribution features with maximum testing power. Advances in Neural Information and Processing Systems (NIPS) ( Lee D. D. Sugiyama M. Luxburg U. V. Guyon I. & Garnett R. eds). Curran Associates, Inc., pp. 181– 189. 66. Kanagawa M. & Fukumizu K. ( 2014) Recovering distributions from Gaussian RKHS embeddings. 17th International Conference on Artificial Intelligence and Statistics ( Kaski S. & Corander J. eds). PMLR, pp. 457– 465. 67. Keriven N. ( 2016) SketchMLbox : a Matlab toolbox for large-scale learning of mixture models. http://sketchml.gforge.inria.fr (accessed 25 November 2017). 68. Lawson C. L. & Hanson R. J. ( 1995) Solving Least Squares Problems . SIAM Classics in Applied Mathematics . Society for Industrial and Applied Mathematics. 69. Le Q., Sarlós T. & Smola A. J. ( 2013) Fastfood—approximating kernel expansions in loglinear time. Int. Conf. Mach. Learn. (ICML) , 28, pp. 1– 9. 70. Le Magoarou L. & Gribonval R. ( 2016) Flexible Multi-layer Sparse Approximations of matrices and applications. IEEE J. Sel. Top. Signal Process. , 10, 688– 700. Google Scholar CrossRef Search ADS 71. Mailhé B. & Gribonval R. ( 2009) LocOMP: algorithme localement orthogonal pour l’approximation parcimonieuse rapide de signaux longs sur des dictionnaires locaux. Actes du XXIIe colloque GRETSI . 72. Maillard O.-A. & Munos R. ( 2009) Compressed least-squares regression. Advances in Neural Information and Processing Systems (NIPS) ( Bengio Y. Schuurmans D. Lafferty J. D. Williams C. K. I. & Culotta A. eds). Curran Associates, Inc., pp. 1213– 1221. 73. Mallat S. & Zhang Z. ( 1993) Matching pursuit with time-frequency dictionaries. IEEE Trans. Signal Process. , 41, 3397– 3415. Google Scholar CrossRef Search ADS 74. Muandet K., Fukumizu K., Dinuzzo F. & Schölkopf B. ( 2012) Learning from distributions via support measure machines. Advances in Neural Information Processing Systems (NIPS) ( Pereira F. Burges C. J. C. Bottou L. & Weinberger K. Q. eds). Curran Associates, Inc., pp. 10– 18. 75. Nam S., Davies M. E., Elad M. & Gribonval R. ( 2013) The cosparse analysis model and algorithms. Appl. Comput. Harmon. Anal. , 34, 30– 56. Google Scholar CrossRef Search ADS 76. Oliva J. B., Dubey A., Poczos B., Schneider J. & Xing E. P. ( 2016) Bayesian nonparametric kernel-learning. International Conference on Artificial Intelligence and Statistics (AISTATS) . Proceedings of Machine Learning Research, Vol. 51 ( Gretton A. & Robert C. C. eds). Cadiz, Spain: PMLR, pp. 1078– 1086. 77. Oliva J. B., Sutherland D. J. & Schneider J. ( 2015) Deep Mean Maps. arxiv preprint: 1511.04150. 78. Paige B., Sejdinovic D. & Wood F. ( 2016) Super-sampling with a reservoir. Uncertainty in Artificial Intelligence . Arlington, Virginia: AUAI Press. 79. Pati Y., Rezaiifar R. & Krishnaprasad P. ( 1993) Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. Asilomar Conference on Signals, Systems and Computers . IEEE. 80. Pilanci M., El Ghaoui L. & Chandrasekaran V. ( 2012) Recovery of sparse probability measures via convex programming. Advances in Neural Information Processing Systems (NIPS) ( Pereira F. Burges C. J. C. Bottou L. & Weinberger K. Q. eds). Curran Associates, Inc., pp. 2420– 2428. 81. Puy G., Davies M. E. & Gribonval R. ( 2015) Linear embeddings of low-dimensional subsets of a Hilbert space to R m. European Signal Processing Conference (EUSIPCO) . Nice, France, pp. 469– 473. 82. Rahimi A. & Recht B. ( 2007) Random features for large scale kernel machines. Advances in Neural Information Processing Systems (NIPS) ( Platt J. C. Koller D. Singer Y. & Roweis S. T. eds). Curran Associates, Inc., pp. 1177– 1184. 83. Rahimi A. & Recht B. ( 2009) Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. Advances in Neural Information Processing Systems (NIPS) ( Koller D. Schuurmans D. Bengio Y. & Bottou L. eds). Curran Associates, Inc., pp. 1313– 1320. 84. Reboredo H., Renna F., Calderbank R. & Rodrigues M. R. D. ( 2013) Compressive Classification. IEEE Global Conference on Signal and Information Processing (GlobalSIP) Projections designs for compressive classification. IEEE, pp. 1029– 1032. 85. Reddi S. J., Ramdas A., Póczos B., Singh A. & Wasserman L. ( 2015) On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. AAAI Conference on Artifical Intelligence . AAAI Press. 86. Reynolds D. A., Quatieri T. F. & Dunn R. B. ( 2000) Speaker verification using adapted gaussian mixture models. Digital Signal Process. , 10, 19– 41. Google Scholar CrossRef Search ADS 87. Reynolds D. A. & Rose R. C. ( 1995) Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process. , 3, 72– 83. Google Scholar CrossRef Search ADS 88. Rudin W. ( 1962) Fourier Analysis on Groups . New York; London: Interscience Publishers. 89. Rudin W. ( 1987) Real and Complex Analysis . Singapore: McGraw-Hill edn. 90. Sadjadi S., Slaney M. & Heck L. ( 2013) MSR Identity Toolbox v1.0: MATLAB toolbox for speaker-recognition Research.. Speech and Language Processing Technical Committee Newsletter . Microsoft Research Technical Report. 91. Salas-Gonzalez D., Kuruoglu E. E. & Ruiz D. P. ( 2009) Finite mixture of alpha-stable distributions. Digital Signal Process. , 19, 250– 264. Google Scholar CrossRef Search ADS 92. Schölkopf B. ( 2015) Computing functions of random variables via reproducing kernel Hilbert space representations. Stat. Comput. , 25, 755– 766. Google Scholar CrossRef Search ADS 93. Sinha A. & Duchi J. ( 2016) Learning kernels with random features. Advances in Neural Information and Processing Systems (NIPS) ( Lee D. D. Sugiyama M. Luxburg U. V. Guyon I. & Garnett R. eds). Curran Associates, Inc., pp. 1298– 1306. 94. Smola A., Gretton A., Song L. & Schölkopf B. ( 2007) A Hilbert space embedding for distributions. International Conference on Algorithmic Learning Theory . Algorithmic Learning Theory. ALT 2007. Lecture Notes in Computer Science, vol 4754 ( Hutter M. Servedio R. A. Takimoto E. eds). Berlin, Heidelberg: Springer, pp. 13– 31. 95. Song L., Zhang X., Smola A. J., Gretton A. & Schölkopf B. ( 2008) Tailoring density estimation via reproducing kernel moment matching. Proceedings of the 25th Annual International Conference on Machine Learning (ICML 2008) . New York, NY, USA: ACM, pp. 992– 999. 96. Sridharan K. ( 2002) A Gentle Introduction to Concentration Inequalities. Discussion Paper. Technical report. https://www.cs.cornell.edu/~sridharan/concentration.pdf (accessed 26 November 2017). 97. Sriperumbudur B. K. ( 2011) Mixture density estimation via hilbert space embedding of measures. IEEE International Symposium on Information Theory . IEEE, pp. 1027– 1030. 98. Sriperumbudur B. K., Fukumizu K., Gretton A., Lanckriet G. R. G. & Schölkopf B. ( 2009) Kernel choice and classifiability for RKHS embeddings of probability distributions. Advances in Neural Information and Processing Systems (NIPS) ( Bengio Y. Schuurmans D. Lafferty J. D. Williams C. K. I. & Culotta A. eds). Curran Associates, Inc., pp. 1750– 1758. 99. Sriperumbudur B. K., Fukumizu K., Gretton A., Schölkopf B. & Lanckriet G. R. ( 2012) On the empirical estimation of integral probability metrics. Electron. J. Stat. , 6, 1550– 1599. Google Scholar CrossRef Search ADS 100. Sriperumbudur B. K., Gretton A., Fukumizu K., Schölkopf B. & Lanckriet G. R. G. ( 2010) Hilbert space embeddings and metrics on probability measures. J. Mach. Learn. Res. , 11, 1517– 1561. 101. Sutherland D. J., Oliva J. B., Barnabas P. & Schneider J. ( 2016) Linear-time learning on distributions with approximate kernel embeddings. AAAI Press, pp. 2073– 2079. 102. Thaper N., Guha S., Indyk P. & Koudas N. ( 2002) Dynamic multidimensional histograms. In International conference on Management of data (SIGMOD) . New York, NY: ACM, pp. 428– 439. 103. Tran K. C. ( 1998) Estimating mixtures of normal distributions via empirical characteristic function. Econ. Rev. , 17, 167– 183. Google Scholar CrossRef Search ADS 104. Vedaldi A. & Fulkerson B. ( 2010) VLFeat—An open and portable library of computer vision algorithms. Discussion Paper. ACM Multimedia, pp. 1469– 1472. 105. Vedaldi A. & Zisserman A. ( 2012) Efficient additive kernels via explicit feature maps. IEEE Trans. Pattern Anal. Mach. Intell. , 34, 480– 492. Google Scholar CrossRef Search ADS PubMed 106. Volkov V. & Demmel J. W. ( 2008) Benchmarking GPUs to Tune Dense Linear Algebra. Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC ’08 . Austin, Texas: IEEE Press, pp. 31:1– 31:11. 107. Wilson A. & Adams R. ( 2013) Gaussian process kernels for pattern discovery and extrapolation. International Conference on Machine Learning (ICML) , vol. 28. JMLR.org, pp. 1067– 1075. 108. Xinnan F., Ananda Y., Suresh T., Choromanski K., Holtmann-Rice D. & Kumar S. ( 2016) Orthogonal Random Features. Advances in Neural Information and Processing Systems (NIPS) ( Lee D. D. Sugiyama M. Luxburg U. V. Guyon I. & Garnett R. eds). Curran Associates, Inc., pp. 1975– 1983. 109. Xu D. & Knight J. ( 2010) Continuous empirical characteristic function estimation of mixtures of normal parameters. Econ. Rev. , 30, 25– 50. Google Scholar CrossRef Search ADS 110. Yang Z., Smola A. J., Song L. & Wilson A. G. ( 2015) A la Carte—learning fast kernels. Int. Conf. Artif. Intell. Stat. (AISTATS) , 38, 1098– 1106. 111. Zhang P. & Gao Y. ( 2015) Matrix Multiplication on High-Density Multi-GPU Architectures: Theoretical and Experimental Investigations. ISC High Performance , vol. 9137 ( Kunkel J. M. & Ludwig T. eds). Springer, pp. 17– 30. Google Scholar CrossRef Search ADS 112. Zhao J. & Meng D. ( 2015) FastMMD: Ensemble of circular discrepancy for efficient two-sample test. J. Neural Comput. , 27, 1345– 1372. Google Scholar CrossRef Search ADS 113. Zhou X., Zhuang X., Yan S., Chnag S.-F., Hasegawa-Johnson M. & Huang T. S. ( 2008) SIFT-Bag kernel for video event analysis. ACM International Conference on Multimedia . New York, NY: ACM, pp. 229– 238. Appendix A. Definitions, preliminary results In this section, we group some definitions and useful results. A.1 Positive definite kernels We recall the definition of p.d. kernels. Definition A.1 (Positive definite kernel) Let $${X}$$ be an arbitrary set. A symmetric function (or kernel) $$\kappa:{X} \times {X} \rightarrow \mathbb{C}$$ is called positive definite (p.d.) if, for all $$n\in\mathbb{N}$$, $$c_1,...,c_n\in\mathbb{C}$$ and all $${\mathbf{x}}_1,...,{\mathbf{x}}_n\in X$$, we have \begin{equation*} \sum_{i,j=1}^n c_i\bar{c_j}\kappa({\mathbf{x}}_i,{\mathbf{x}}_j)\geq 0. \end{equation*} Note that strict positivity is not mandatory in the above equation. In terms of vocabulary, p.d. kernels bear connections with, e.g., positive semi-definite matrices (however they are indeed called positive definite kernels in the literature). Definition A.2 (Positive definite function) A function $${\mathbf{K}}:\mathbb{R}^d \rightarrow \mathbb{C}$$ is called positive definite if the kernel defined by $$\kappa({\mathbf{x}},{\mathbf{y}})={\mathbf{K}}({\mathbf{x}}-{\mathbf{y}})$$ is positive definite. A.2 Measures Definition A.3 (Non-negative measure) A measure $$\nu\in E$$ over a measurable space $$(X,\mathcal{B})$$ is said non-negative if: \begin{equation*} \forall B \in \mathcal{B},~\nu(B)\geq 0. \end{equation*} Definition A.4 (Support of a measure) The support of a signed measure $$\nu \in E$$ over a measurable, topological space $$X$$ is defined to be the closed set, \begin{equation*} \text{supp}(\nu):=X\backslash \bigcup\left\lbrace U\subset {X}:~U \text{ is open},~\nu(U)=0\right\rbrace\!. \end{equation*} Definition A.5 (Total variation norm, Finite measure) Let $$\nu\in E$$ be a signed measure over a measurable space $$(X,\mathcal{B})$$. Define the Jordan decomposition $$(\nu^-, \nu^+)$$ of $$\nu$$ where $$\nu^+$$ and $$\nu^-$$ are positive measures (see [51] and [89], Chapter 6, for more details). Denote $$|\nu|=\nu^+ +\nu^-$$. The total variation norm of $$\nu$$ is defined as: \[ \|\nu\|_{TV}=|\nu|({X})=\int_{X} {\rm d}|\nu|({\mathbf{x}}). \] The measure $$\nu$$ is said finite if $$\|\nu\|_{TV}< \infty$$. Note that if $$\nu$$ is totally continuous with respect to the Lebesgue measure, i.e., if there exists an integrable function $$f$$ such that $${\rm d}\nu({\mathbf{x}})=f({\mathbf{x}})\,{\rm d}{\mathbf{x}}$$, then the total variation norm is the classic $$L^1$$-norm of this function: $$\|\nu\|_{TV}=\|f\|_{L^1}$$. We have the following bounds. Lemma A.1 For any sketching operator $$\mathcal{A}$$ obtained by sampling the characteristic function, and any finite signed measure $$\nu \in E$$, we have \begin{equation} \label{eq:lemskopbound} \|\mathcal{A} \nu\|_2 \leq \|\nu\|_{TV}. \end{equation} (A.1) For any frequency sampling distribution $${\it{\Lambda}}$$ and any pair of probability distributions $$P,{Q} \in {\mathcal{P}}$$, we have \begin{equation} \label{eq:lemnormkbound} \gamma_{{\it{\Lambda}}}\left({P},{Q}\right) \leq \|P-{Q}\|_{TV}. \end{equation} (A.2) Proof. For any $$\nu \in E$$, we have \begin{equation*} \|\mathcal{A}\nu\|_2^2 = \frac{1}{m} \sum_{j=1}^m \left\lvert\int_{\mathbb{R}^d} {\rm e}^{-\mathsf{i}{\boldsymbol{\omega}}_j^T{\mathbf{x}}}\,{\rm d}\nu({\mathbf{x}})\right\rvert^2 \leq \frac{1}{m} \sum_{j=1}^m \left(\int_{\mathbb{R}^d} {\rm d}|\nu|({\mathbf{x}})\right)^2 =\|\nu\|_{TV}^2. \end{equation*} The second inequality is proved in [99], Proposition 5.1(ii), using here the fact that $$\|\kappa|\leq 1$$. □ A.3 Covering numbers Definition A.6 (Ball, $$\epsilon$$-covering, Covering number) Let $$(X,d)$$ be a metric space. For any $$\epsilon >0$$ and $$x\in X$$, we denote $$B_X(x,\epsilon)$$ the ball of radius $$\epsilon$$ centered at the point $$x$$: \[ B_X(x,\epsilon)=\left\lbrace y\in X,~d(x,y)\leq\epsilon\right\rbrace\!. \] Let $$Y \subseteq X$$ be a subset of $$X$$. A subset $$Z\subseteq Y$$ is an $$\epsilon$$-covering of $$Y$$ if $$Y\subseteq\bigcup_{z\in Z} B_X(z,\epsilon)$$. The covering number$$N_{Y,d}(\epsilon) \in \mathbb{N}\cup\lbrace +\infty \rbrace$$ is the smallest number of points $$y_i \in Y$$ such that the set $$\{y_i\}$$ is an $$\epsilon$$-covering of $$Y$$. Remark A.1 A subset $$Y$$ of a topological space $$(X,d)$$ that has finite covering numbers for any $$\epsilon>0$$ is called totally bounded and is not necessarily compact: a set is in fact compact if and only if it is totally bounded and complete. Hence, though in the rest of the article, we often focus on explicitly bounding the covering numbers of certain sets, if compactness of these sets is required it will have to be proved independently. Our definition of covering numbers is that of internal covering numbers, meaning that the centers of the covering balls are required to be included in the set being covered. Somewhat counter intuitively these covering numbers (for a fixed radius $$\epsilon$$) are not necessarily increasing with the inclusion of sets.8 We have instead the following property: Lemma A.2 Let $$A \subseteq B \subseteq X$$ be subsets of a metric space $$(X,d)$$, and $$\epsilon>0$$. Then, \begin{equation} \label{eq:covnumsub} N_{A,d}(\epsilon) \leq N_{B,d}(\epsilon/2). \end{equation} (A.3) Proof. Let $$b_1,...,b_N$$ be an $$\epsilon/2$$-covering of $$B$$. We construct an $$\epsilon$$-covering $$a_i$$ of $$A$$ in the following way. Each $$b_i$$ is either: (a) in the set $$A$$, in which case we take $$a_i=b_i$$, (b) at distance less than $$\epsilon/2$$ of a point $$a\in A$$, in which case we take $$a_i=a$$ and note that the ball centered on $$a_i$$ covers at least as much as the ball centered in $$b_i$$, i.e., $$B_X(b_i,\epsilon/2)\subset B_X(a_i,\epsilon)$$, (c) in none of these cases and we discard it. There are less $$a_i$$’s than $$b_i$$’s, and the union of balls of radius $$\epsilon$$ with centers $$a_i$$ covers at least as much as the balls of radius $$\epsilon/2$$ with centers $$b_i$$, and therefore the set of $$a_i$$’s is an $$\epsilon$$-covering of $$B$$ and of $$A$$. □ Another useful property is related to the embedding of sets by a Lipschitz function. Lemma A.3 Let $$(X,d)$$ and $$(X',d')$$ be two metric spaces, and $$Y \subseteq X$$, $$Y' \subseteq X'$$. If there exists a surjective function $$f: Y \rightarrow Y'$$ which is $$L$$-Lipschitz with $$L>0$$, i.e., such that \begin{equation*} \forall x,y\in Y,~d'(f(x),f(y)) \leq L d(x,y), \end{equation*} then for all $$\epsilon>0$$, we have \begin{equation} \label{eq:covnumlipschitz} N_{Y',d'}(\epsilon) \leq N_{Y,d}(\epsilon/L). \end{equation} (A.4) Proof. Define $$\epsilon_2=\epsilon/L$$, denote $$N=N_{Y,d}(\epsilon_2)$$, and let $$y_i \in Y$$, $$i=1,...,N$$ be an $$\epsilon_2$$-covering of $$Y$$. Let any $$y' \in Y'$$. There exists $$y \in Y$$ such that $$f(y)=y'$$ since $$f$$ is surjective. Let $$y_i$$ be a center of a ball in the $$\epsilon_2$$-covering of $$Y$$, we have \begin{equation*} d'(y',f(y_i)) = d'(f(y),f(y_i)) \leq Ld(y,y_i) \leq L\epsilon_2 = \epsilon. \end{equation*} Thus $$\{f(y_i)\}_{i=1,...,N}$$ is an $$\epsilon$$-covering of $$Y'$$, and we have $$N_{Y',d'}(\epsilon) \leq N$$. □ Finally, we report a property from [42]: Lemma A.4 ([42], Proposition 5) Let $$(X,\|.\|)$$ be a Banach space of finite dimension $$d$$. Then, for any $$\epsilon>0,~x\in X$$ and $$R>0$$, we have \begin{equation} N_{B_X(x,R),\|.\|}(\epsilon)\leq \left(\frac{4R}{\epsilon}\right)^d\!. \end{equation} (A.5) A.4 Concentration of averages We will use Bernstein’s inequality in the following simple version [96]: Lemma A.5 (Bernstein’s inequality ([96], Theorem 6)) Let $$x_i \in \mathbb{R}$$, $$i=1,...,n$$ be i.i.d. bounded random variables such that $$\mathbb{E}x_i=0$$, $$|x_i| \leq M$$ and $$Var(x_i) \leq \sigma^2$$ for all $$i$$’s. Then for all $$t>0$$, we have \begin{equation} P\left(\frac{1}{n}\sum_{i=1}^n x_i \geq t\right)\leq \exp\left(-\frac{nt^2}{2\sigma^2+2Mt/3}\right)\!. \end{equation} (A.6) We also report a concentration result in Hilbert spaces from [83]. Lemma A.6 ([83], Lemma 4) Let $${\mathbf{x}}_i \in \mathcal{H}$$, $$i=1,...,n$$ be i.i.d. random variables in a Hilbert space $$(\mathcal{H},\|.\|)$$ such that $$\|{\mathbf{x}}_i\|\leq M$$ with probability one. Denote $$\bar {\mathbf{x}}$$ their empirical average $$\bar {\mathbf{x}}=\left(\sum_{i=1}^n {\mathbf{x}}_i\right)/n$$. Then for any $$\rho>0$$, with probability at least $$1-\rho$$, \begin{equation} \|\bar {\mathbf{x}} - \mathbb{E} \bar {\mathbf{x}}\| \leq \frac{M}{\sqrt{n}}\left(1+\sqrt{2\log\frac{1}{\rho}}\right)\!. \end{equation} (A.7) Appendix B. Proof of Theorem 7.1 B.1 Lower RIP A measurement operator $$\mathcal{A}$$ satisfies the usual generalized LRIP [17] on the model $${\it{\Sigma}}$$ with constant $$\beta>0$$ if: \begin{equation} \label{eq:rip} \forall P,{Q} \in {\it{\Sigma}}, ~ \beta \gamma_{\it{\Lambda}}\left({P},{Q}\right)^2 \leq \|\mathcal{A} P - \mathcal{A} {Q}\|_2^2. \end{equation} (B.1) For a measurement operator $$\mathcal{A}$$ drawn at random (in our case by randomly drawing a set of frequencies $${\it{\Omega}}=\{{\boldsymbol{\omega}}_j\}_{j=1,...,m}$$), the usual approach from compressive sensing theory is to prove that, with high probability, (B.1) is satisfied: \begin{equation} \label{eq:unirip} {\mathbb{P}}_{\it{\Omega}}\left(\forall P,{Q} \in {\it{\Sigma}}, ~ \beta \gamma_{\it{\Lambda}}\left({P},{Q}\right)^2 \leq \|\mathcal{A} P - \mathcal{A} {Q}\|_2^2\right) \geq 1-\rho, \end{equation} (B.2) where $${\mathbb{P}}_{\it{\Omega}}$$ indicates probability with respect to the set of frequencies $${\it{\Omega}}$$. Defining the normalized secant set \begin{equation} {\mathcal{S}}:=\left\lbrace \frac{P-{Q}}{\gamma_{\it{\Lambda}}\left({P},{Q}\right)}; ~ P,{Q} \in {\it{\Sigma}}, ~ \gamma_{\it{\Lambda}}\left({P},{Q}\right)\neq 0\right\rbrace \subset E, \end{equation} (B.3) the LRIP (B.1) is equivalent to: $$\forall \nu \in {\mathcal{S}}, ~ \beta \leq \|\mathcal{A} \nu\|_2^2$$, and (B.2) is equivalent to \begin{equation} \label{eq:uniripsecant} {\mathbb{P}}_{\it{\Omega}}\left(\forall \nu \in {\mathcal{S}}, ~ \beta \leq \|\mathcal{A} \nu\|_2^2\right) \geq 1-\rho. \end{equation} (B.4) Hence, a typical proof of the (L)RIP [6,81] consists in defining an $$\epsilon$$-covering of the normalized secant set, proving a pointwise LRIP at the center of each ball using concentration results, then uniformly extending the result to the whole normalized secant set using Lipschitz continuity of the measurement operator. Semi-uniform LRIP. In our framework, we introduce a ‘non-uniform’ version of the LRIP, in which the inequality (B.2) will be verified for a given$$P \in {\it{\Sigma}}$$ with high probability, uniformly for all $${Q} \in {\it{\Sigma}}$$. It is expressed as: \begin{equation} \label{eq:semirip} \forall P \in {\it{\Sigma}}, ~ {\mathbb{P}}_{\it{\Omega}}\left(\forall {Q} \in {\it{\Sigma}}, ~ \beta \gamma_{\it{\Lambda}}\left({P},{Q}\right)^2 \leq \|\mathcal{A} P - \mathcal{A} {Q}\|_2^2\right) \geq 1-\rho. \end{equation} (B.5) We refer to this version of the LRIP as semi-uniform in probability. It holds with a smaller number of measurements $$m$$ than the uniform case, and we show in the next section that it is sufficient to obtain recovery guarantees with joint probability on the drawing of frequencies $$\{{\boldsymbol{\omega}}_j\}$$ and items $$\{{\mathbf{x}}_i\}$$. For more details on non-uniform compressive sensing results, we refer the reader to the book by Foucart and Rauhut [52, Chapters 9 and 11]. Similar to the uniform case, we introduce a family of ‘non-uniform’ normalized secant sets, defined for each $$P \in {\it{\Sigma}}$$ as: \begin{equation} {\mathcal{S}}_P := \left\lbrace \frac{P-{Q}}{\gamma_{\it{\Lambda}}\left({P},{Q}\right)}; ~ {Q} \in {\it{\Sigma}}, ~ \gamma_{\it{\Lambda}}\left({P},{Q}\right)\neq 0\right\rbrace \subset E. \end{equation} (B.6) The semi-uniform LRIP (B.5) is then equivalent to \begin{equation} \label{eq:uniripsecantnonuni} \forall P \in {\it{\Sigma}},~{\mathbb{P}}_{\it{\Omega}}\left(\forall \nu \in {\mathcal{S}}_P, ~ \beta \leq \|\mathcal{A} \nu\|_2^2\right) \geq 1-\rho. \end{equation} (B.7) A typical proof would therefore follow the exact same pattern than the uniform case, using non-uniform normalized secant sets instead of the normalized secant set. Restricted, semi-uniform LRIP. Unlike finite-dimensional frameworks, where normalized secant sets are contained in a unit ball that is necessarily compact, here it is in general challenging to prove the existence of finite covering numbers for this set. Under Assumption $$\mathbf{A_1}({\it{\Sigma}})$$, the model $${\it{\Sigma}}$$itself is compact, which suggests using the embedding $${Q} \in {\it{\Sigma}} \rightarrow \frac{P-{Q}}{\gamma_{\it{\Lambda}}\left({P},{Q}\right)} \in {\mathcal{S}}_P$$. However, the behavior of this function when $${Q}$$ gets close to $$P$$ may be delicate to analyze. Thus, for all $$\eta\geq0$$ and $$P\in{\it{\Sigma}}$$, we define the restricted non-uniform normalized secant set: \begin{equation} \label{eq:secant} {\mathcal{S}}^\eta_P := \left\lbrace \frac{P-{Q}}{\gamma_{\it{\Lambda}}\left({P},{Q}\right)}; ~ {Q} \in {\it{\Sigma}}, ~ \gamma_{\it{\Lambda}}\left({P},{Q}\right)>\eta\right\rbrace \subset E. \end{equation} (B.8) Note that, when we let $$\eta=0$$ the restricted non-uniform normalized secant set $${\mathcal{S}}^0_P$$ is just the previous non-uniform normalized secant set $${\mathcal{S}}_P$$. Hypotheses to establish the restricted, semi-uniform LRIP. We are going to prove the restricted semi-uniform LRIP (B.5) under two hypotheses. The first hypothesis depends on a model $${\it{\Sigma}}$$, a frequency distribution $${\it{\Lambda}}$$, a non-negative constant $$\eta \geq 0$$, and a choice of metric $$d(\cdot,\cdot)$$. Hypothesis H1 $$(\eta,{\it{\Sigma}},{\it{\Lambda}},d)$$ (Covering numbers of the secant set) For all $$P \in {\it{\Sigma}}$$, the restricted non-uniform normalized secant set $${\mathcal{S}}^\eta_P$$ has finite covering numbers $$N_{{\mathcal{S}}^\eta_P,d}(\epsilon)<\infty$$. In the case where the constant $$\eta>0$$ is positive and the metric $$d=\|.\|$$ is a norm, the covering numbers of the secant set are controlled by those of the model. Lemma B.1 Let $$\|.\|$$ be a norm on the space of finite signed measure $$E$$, and $${\it{\Lambda}}$$ a frequency distribution such that the model $${\it{\Sigma}}$$ has finite covering numbers with respect to some metric $$\bar{d}$$, which satisfies \[ \forall P,{Q}\in{\it{\Sigma}},~\bar{d}(P,{Q})\geq\max\Big(\|P-{Q}\|,\gamma_{{\it{\Lambda}}}\left({P},{Q}\right)\Big). \] The model is in particular bounded for the norm $$\|.\|$$, denote $$C=\max\left(1,\sup_{P,{Q}\in{\it{\Sigma}}}\|P-{Q}\|\right)$$. Then, for any strictly positive constant $$1\geq \eta > 0$$, Assumption $$\mathbf{H_1}(\eta,{\it{\Sigma}},{\it{\Lambda}},d)$$ holds with $$d = \|.\|$$. Furthermore, for any $$P\in{\it{\Sigma}}$$ and $$\epsilon>0$$, we have \begin{equation} N_{{\mathcal{S}}^\eta_P,\|.\|}(\epsilon)\leq N_{{\it{\Sigma}},\bar{d}}\left(\frac{\epsilon\eta^2}{2(C+1)}\right)\!. \end{equation} (B.9) Proof. Let $$P\in{\it{\Sigma}}$$ be any distribution in the model. Consider the complement of the ball $$B_{{\it{\Sigma}},\gamma_{{\it{\Lambda}}}}(P,\eta)$$: \begin{equation} \mathcal{Q}^\eta_P=B_{{\it{\Sigma}},\gamma_{{\it{\Lambda}}}}^c(P,\eta)=\left\lbrace {Q} \in {\it{\Sigma}},~ \gamma_{{\it{\Lambda}}}\left({P},{Q}\right) >\eta \right\rbrace \subset {\it{\Sigma}}, \end{equation} (B.10) and the function $$f_P:\mathcal{Q}^\eta_P\rightarrow{\mathcal{S}}^\eta_P$$ such that $$f_P({Q})=\frac{P-{Q}}{\gamma_{{\it{\Lambda}}}\left({P},{Q}\right)}$$, which is surjective by definition of $${\mathcal{S}}^\eta_P$$. Let us show that $$f_P$$ is $$(C+1)/\eta^{2}$$-Lipschitz continuous for the metric $$\bar{d}$$ and conclude with Lemma A.3. For any $${Q}_1,{Q}_2 \in \mathcal{Q}^\eta_P$$, we have \begin{align*} \hspace{-2.3pc}\|f_P({Q}_1)-f_P({Q}_2)\| &= \left\lVert \frac{P-{Q}_1}{\gamma_{{\it{\Lambda}}}\left({P},{Q}\right)_1} - \frac{P-{Q}_2}{\gamma_{{\it{\Lambda}}}\left({P},{Q}\right)_2} \right\rVert, \\ \hspace{-2.3pc}&\leq \left\lVert \frac{P-{Q}_1}{\gamma_{{\it{\Lambda}}}\left({P},{Q}\right)_1} - \frac{P-{Q}_2}{\gamma_{{\it{\Lambda}}}\left({P},{Q}\right)_1} \right\rVert + \left\lVert \frac{P-{Q}_2}{\gamma_{{\it{\Lambda}}}\left({P},{Q}\right)_1} - \frac{P-{Q}_2}{\gamma_{{\it{\Lambda}}}\left({P},{Q}\right)_2} \right\rVert,\\ \hspace{-4pc}&\leq \frac{1}{\eta}\|{Q}_2-{Q}_1\| + \|P-{Q}_2\|\left\lvert\frac{1}{\gamma_{{\it{\Lambda}}}\left({P},{Q}\right)_1}-\frac{1}{\gamma_{{\it{\Lambda}}}\left({P},{Q}\right)_2}\right\rvert, & \text{since } \gamma_{{\it{\Lambda}}}\left({P},{Q}\right)_1> \eta\\ \hspace{-2.3pc}&\leq \frac{1}{\eta}\|{Q}_1-{Q}_2\| +\frac{C}{\eta^2}\Big| \gamma_{{\it{\Lambda}}}\left({P},{Q}\right)_2-\gamma_{{\it{\Lambda}}}\left({P},{Q}\right)_1\Big|, & \text{since } \|P-{Q}_2\|\leq C\\ \hspace{-2.3pc}&\leq \frac{1}{\eta}\|{Q}_1-{Q}_2\| +\frac{C}{\eta^2}\gamma_{{\it{\Lambda}}}\left({{Q}_1},{{Q}_2}\right)\!, & \text{by the triangle inequality,}\\ \hspace{-2.3pc}&\leq \frac{C+1}{\eta^2}\bar{d}\left({Q}_1,{Q}_2\right)\!. &\text{ since } \eta\leq 1\\ \end{align*} Hence, the function $$f_P$$ is Lipshitz continuous with constant $$L=(C+1)/\eta^2$$, and therefore for all $$\epsilon>0$$: \begin{align*} N_{{\mathcal{S}}^\eta_P,\|.\|}(\epsilon) \stackrel{\text{Lemma A.3}}{\leq} N_{\mathcal{Q}^\eta_P,\bar{d}}(\epsilon/L) \stackrel{\text{Lemma A.2}}{\leq} N_{{\it{\Sigma}},\bar{d}}\left(\frac{\epsilon}{2L}\right)\!.\\[-3pc] \end{align*} □ To formulate the second hypothesis, we denote $$f$$ a function from $$\mathbb{N} \times \mathbb{R}_+$$ to $$\mathbb{R}_+$$. Hypothesis H2 $$(\eta,{\it{\Sigma}},{\it{\Lambda}},f)$$ (Probability of the pointwise LRIP) For any $$P \in {\it{\Sigma}}$$, any $$\nu \in {\mathcal{S}}^\eta_P$$, any $$t\geq 0$$ and any integer $$m>0$$, we have \begin{equation} {\mathbb{P}}_{\it{\Omega}}\left(1-\|\mathcal{A} \nu\|_2^2\geq t \right) \leq f(m,t), \end{equation} (B.11) where $$\mathcal{A}:E\rightarrow \mathbb{C}^m$$ is a sketching operator built by independently drawing $$m$$ frequencies according to $${\it{\Lambda}}$$. In Section B.2, under hypotheses $$\mathbf{H_1}(\eta,{\it{\Sigma}},{\it{\Lambda}},d)$$ and $$\mathbf{H_2}(\eta,{\it{\Sigma}},{\it{\Lambda}},f)$$, we prove an extended version of Theorem 7.1 (referred to as Theorem 7.1 bis). Unlike Theorem 7.1 this extended version covers the case $$\eta=0$$. Then, in Section B.3 we prove that under the Assumptions $$\mathbf{A_1}({\it{\Sigma}})$$ and $$\mathbf{A_2}(\eta,{\it{\Sigma}},{\it{\Lambda}},{\mathrm{A}})$$ used to state Theorem 7.1, $$\mathbf{H_1}(\eta,{\it{\Sigma}},{\it{\Lambda}},d)$$ holds with $$d(\nu,\nu') = \|\nu-\nu'\|_{TV}$$provided that $$\eta>0$$, and $$\mathbf{H_2}(\eta,{\it{\Sigma}},{\it{\Lambda}},f)$$ holds for an appropriate choice of function $$f$$. Remark B.1 Ideally, one would like to exploit Theorem 7.1 bis with $$\eta=0$$ to obtain performance guarantees without the extra additive term $$\eta$$. However, putting this into practice would require characterizing covering numbers for the normalized secant set $${\mathcal{S}}_P={\mathcal{S}}_P^{0}$$, which can be tricky and is left to future work. It is indeed already not trivial to determine when this set has finite covering numbers. We can now state our version of the LRIP. Theorem B.1 Consider a model $${\it{\Sigma}}$$, a frequency distribution $${\it{\Lambda}}$$, a non-negative constant $$\eta\geq0$$, a metric $$d(\cdot,\cdot)$$ and a function $$f$$ such that Assumptions $$\mathbf{H_1}(\eta,{\it{\Sigma}},{\it{\Lambda}},d)$$ and $$\mathbf{H_2}(\eta,{\it{\Sigma}},{\it{\Lambda}},f)$$ are satisfied. Let $$P^* \in {\it{\Sigma}}$$ be any distribution in the model. Assume that for all $$\nu,\nu' \in {\mathcal{S}}^{\eta}_{P^*}$$ \begin{equation}\label{eq:HypSkopLipschitz} \sup_\mathcal{A}\Big|\|\mathcal{A} \nu\|_{2} -\| \mathcal{A} \nu'\|_{2}\Big| \leq d(\nu,\nu'), \end{equation} (B.12) where the supremum is over all possible frequencies $${\boldsymbol{\omega}}$$ defining the sketching operator $$\mathcal{A}$$. Define \[ \rho=N_{{\mathcal{S}}^\eta_{P^*},d }\left(\tfrac{1}{4}\right)\cdot f\left(m,\tfrac{7}{16}\right)\!. \] Then, with probability at least $$1-\rho$$ on the drawing of the $$m$$ frequencies $${\boldsymbol{\omega}}_j$$’s, we have \begin{equation} \label{eq:riplem} \forall P \in {\it{\Sigma}} ~\text{s.t. } \gamma_{{\it{\Lambda}}}\left({P^*},{P}\right)\geq \eta:\quad \frac{1}{4}\gamma_{{\it{\Lambda}}}\left({P^*},{P}\right)^2 \leq \|\mathcal{A} P^* - \mathcal{A} P\|_2^2. \end{equation} (B.13) Proof. The idea is to define an $$\epsilon$$-covering of the restricted non-uniform normalized secant set $${\mathcal{S}}_{P^*}^\eta$$ with respect to the metric $$d$$, to apply the concentration result of Assumption $$\mathbf{H_2}(\eta,{\it{\Sigma}},{\it{\Lambda}},f)$$ at the center of each ball and to conclude with a union bound. Let $$\epsilon > 0$$, $$1>t>0$$ and denote $$N=N_{{\mathcal{S}}_{P^*}^\eta,d }(\epsilon)$$ for simplicity, which is finite by Assumption $$\mathbf{H_1}(\eta,{\it{\Sigma}},{\it{\Lambda}},d)$$. We consider $$\nu_1,...,\nu_{N}$$ an $$\epsilon$$-covering of $${\mathcal{S}}_{P^*}^\eta$$ with respect to the metric $$d$$. Considering Assumption $$\mathbf{H_2}(\eta,{\it{\Sigma}},{\it{\Lambda}},f)$$, a union bound yields that, with probability greater than $$1-N \cdot f(m,t)$$, \begin{equation}\label{eq:rip_bernstein_union} \forall i \in [1,N],~ 1-\left\lVert \mathcal{A}\nu_i\right\rVert^2_2 < t. \end{equation} (B.14) Assuming (B.14) is satisfied, let any distribution $$P \in {\it{\Sigma}}$$ such that $$\gamma_{{\it{\Lambda}}}\left({P^*},{P}\right)>\eta$$. Denote $$\nu:=\frac{P^*-P}{\gamma_{{\it{\Lambda}}}\left({P^*},{P}\right)} \in {\mathcal{S}}_{P^*}^\eta$$, and let $$i\in[1,N]$$ such that $$\nu_{i}$$ is the center of the ball closest to $$\nu$$ in the covering. We have \begin{align} 1- \|\mathcal{A}\nu\|_2 =& 1 - \|\mathcal{A}\nu_i\|_2 + \|\mathcal{A}\nu_i\|_2 - \|\mathcal{A}\nu\|_2 \notag\\ \leq & 1-\sqrt{1-t} + \|\mathcal{A} \nu_i\|_2-\|\mathcal{A}\nu\|_2 \notag &\text{since (B.14) is verified}\\ \stackrel{(a)}\leq& 1-\sqrt{1-t} + {\rm d}(\nu_i,\nu) &\text{by (B.12)}\\ \leq& 1-\sqrt{1-t} + \epsilon\notag. \end{align} Choosing $$t>0$$ and $$\epsilon = \sqrt{1-t}-1/2$$ (e.g.: $$t = 7/16$$, $$\epsilon=1/4$$), we obtain $$\|\mathcal{A} \nu\|_{2}^{2} \geq 1/4$$, that is to say \begin{equation} \frac{1}{4}\gamma_{{\it{\Lambda}}}\left({P^*},{P}\right)^2 \leq \|\mathcal{A} P^*-\mathcal{A} P\|_2^2. \end{equation} (B.15) This shows that the LRIP (B.13) is verified except with probability at most \[ N_{{\mathcal{S}}^\eta_{P^*},d}\left(\sqrt{1-t}-1/2\right)\cdot f\left(m,t\right)\!. \] Specializing to $$t=7/16$$ yields the desired result. □ Remark B.2 (LRIP without the restricted Lipschitz property (B.12)?) In (a) we used Property (B.12), which could be called a restricted Lipschitz property for the function $$\nu \mapsto \|\mathcal{A} \nu\|_{2}$$. It is restricted to $${\mathcal{S}}^{\eta}_{P^*}$$, but assumed to hold uniformly over all possible draws of the sketching operator $$\mathcal{A}$$. Thanks to Lemma A.1 and the triangle inequality, \[ \Big| \|\mathcal{A} \nu\|_{2}-\|\mathcal{A} \nu'\|_{2}\Big| \leq \|\mathcal{A} (\nu-\nu')\|_{2} \leq \|\nu-\nu'\|_{TV}, \] hence property (B.12) indeed holds when $${\rm d}(\nu,\nu') = \|\nu-\nu'\|_{TV}$$, even without the restriction to $${\mathcal{S}}^{\eta}_{P^*}$$. In the rest of this article, we primarily concentrate on this setting. It would, however, be interesting for future work to consider metrics $$d$$ with much larger balls than those of the total variation norm, since they may lead to substantially smaller covering numbers $$N_{{\mathcal{S}}^\eta_{P^*},\|.\|} \ll N_{{\mathcal{S}}^\eta_{P^*},\|.\|_{TV}}$$, and thus LRIP guarantees for much smaller $$m$$. Theorem B.1 still yields guarantees with such metrics, provided the restricted Lipschitz property (B.12) holds. An analog of Theorem B.1 for even weaker metrics that do not satisfy (B.12), can also be envisioned using chaining arguments [81] provided that \[ {\mathbb{P}}_{{\it{\Omega}}}\left(\|\mathcal{A} \nu - \mathcal{A} \nu'\|_{2} \geq (1+t) d(\nu,\nu')\right) \leq g(m,t) \] with an appropriately decaying function $$g(m,t)$$. B.2 A version of Theorem 7.1 with weaker assumptions In this section, we formulate a version of Theorem 7.1, referred to as Theorem 7.1 bis, that allows for an additive error $$\eta=0$$, under the Hypotheses $$\mathbf{H_i}$$. In the next section, we deduce from it the version given in Section 7.1 that uses Assumptions $$\mathbf{A_i}$$. Theorem 7.1 bis Consider a model $${\it{\Sigma}}$$, a frequency distribution $${\it{\Lambda}}$$, a non-negative constant $$\eta\geq 0$$ and a function $$f$$ such that Assumptions $$\mathbf{H_1}(\eta,{\it{\Sigma}},{\it{\Lambda}},d)$$ and $$\mathbf{H_2}(\eta,{\it{\Sigma}},{\it{\Lambda}},f)$$ hold with $$d(\nu,\nu') = \|\nu-\nu'\|_{TV}$$. Consider $$\tilde{d}(\cdot,\cdot)$$ a metric such that, for any $$P_{1},P_{2} \in {\mathcal{P}}$$, we have \begin{equation} \label{eq:DefMNormLikeCondition} \tilde{d}(P_{1},P_{2}) \geq \max\left(\gamma_{{\it{\Lambda}}}\left({P_{1}},{P_{2}}\right), \sup_{\mathcal{A}} \|\mathcal{A} P_{1}-\mathcal{A} P_{2}\|_{2}\right)\!, \end{equation} (B.16) where the supremum is over all possible frequencies $${\boldsymbol{\omega}}$$ defining the sketching operator $$\mathcal{A}$$. Assume that $${\it{\Sigma}}$$ is compact9 with respect to $$\tilde{d}$$, and note that under this assumption the decoder $${\it{\Delta}}$$ is still well-defined by (7.5), since one can replace the total variation norm by the metric $$\tilde{d}$$ in the right-hand side of (7.4). Let $${\mathbf{x}}_i \in \mathbb{R}^d$$, $$i=1...n$$ be $$n$$ points drawn i.i.d. from an arbitrary distribution $$P^*\in{\mathcal{P}}$$, and $${\boldsymbol{\omega}}_{j} \in \mathbb{R}^d$$, $$j=1...m$$ be $$m$$ frequencies drawn i.i.d. from $${\it{\Lambda}}$$. Denote $$\bar{P}={\it{\Delta}}(\hat{\mathbf{z}},\mathcal{A})$$ the distribution reconstructed from the empirical sketch $$\hat{\mathbf{z}}$$. Define $$P_\text{proj} \in {\it{\Sigma}}$$ as (one of) the projection(s) of the probability $$P^*$$ onto the model: \begin{equation}\label{eq:proj} P_\text{proj} \in \underset{P\in {\it{\Sigma}}}{\text{argmin}} \ \tilde{d}(P^{*},P), \end{equation} (B.17) which indeed exists since $${\it{\Sigma}}$$ is assumed to be compact. Define \begin{equation}\label{eq:thmbisprb} \rho = 2N_{{\mathcal{S}}^\eta_{P_{proj}},\|.\|_{TV}}\left(\tfrac{1}{4}\right)\cdot f\left(m,\tfrac{7}{16}\right)\!. \end{equation} (B.18) Then, with probability at least $$1-\rho$$ on the drawing of the items $${\mathbf{x}}_i$$and sampling frequencies $${\boldsymbol{\omega}}_{j}$$, we have \begin{equation} \gamma_{{\it{\Lambda}}}\left({P^*},{\bar{P}}\right)\leq 5\ \tilde{d}(P^{*},{\it{\Sigma}}) +\tfrac{4(1+\sqrt{2\log(2/\rho)})}{\sqrt{n}}+\eta, \end{equation} (B.19) where $$\tilde{d}(P^*,{\it{\Sigma}})=\inf_{P \in {\it{\Sigma}}}\ \tilde{d}(P^*,P)$$ is the distance from $$P^*$$ to the model. Proof. Recall that the target distribution $$P^*$$ and its projection $$P_\text{proj}$$ are fixed. By Lemma A.1, the restricted Lipschitz property (B.12) holds with $$d(\nu,\nu') = \|\nu-\nu'\|_{TV}$$. Considering (B.18), Theorem B.1 yields that since Assumptions $$\mathbf{H_1}(\eta,{\it{\Sigma}},{\it{\Lambda}},d)$$ and $$\mathbf{H_2}(\eta,{\it{\Sigma}},{\it{\Lambda}},f)$$ hold, the LRIP applied to $$P_\text{proj}$$ is satisfied with probability at least $$1-\rho/2$$ on the drawing of frequencies: \begin{equation*} \forall P_{\it{\Sigma}} \in {\it{\Sigma}} ~\text{s.t. } \gamma_{{\it{\Lambda}}}\left({P_\text{proj}},{P_{\it{\Sigma}}}\right)\geq \eta:\quad \frac{1}{4}\gamma_{{\it{\Lambda}}}\left({P_\text{proj}},{P_{\it{\Sigma}}}\right)^2 \leq \|\mathcal{A} P_\text{proj} - \mathcal{A} P_{\it{\Sigma}}\|_2^2, \end{equation*} which can be reformulated in: \begin{equation} \label{eq:applilrip} \forall P_{\it{\Sigma}} \in {\it{\Sigma}},~\gamma_{{\it{\Lambda}}}\left({P_\text{proj}},{P_{\it{\Sigma}}}\right) \leq \max\Big(2\|\mathcal{A} P^* - \mathcal{A} P_{\it{\Sigma}}\|_2,\eta\Big) \leq 2\|\mathcal{A} P^* - \mathcal{A} P_{\it{\Sigma}}\|_2+\eta. \end{equation} (B.20) Let $$P \in {\mathcal{P}}$$ be any distribution. For some random draw of the operator $$\mathcal{A}$$, denote $$\bar{P}={\it{\Delta}}\left(\mathcal{A} P,\mathcal{A}\right) \in {\it{\Sigma}}$$ which, by definition of the decoder, belongs to the model. We have: \begin{align*} \gamma_{{\it{\Lambda}}}\left({\bar{P}},{P^*}\right) &\leq \gamma_{{\it{\Lambda}}}\left({\bar{P}},{P_\text{proj}}\right) + \gamma_{{\it{\Lambda}}}\left({P_\text{proj}},{P^*}\right) &\text{triangle ineq.}\\ &\leq 2\|\mathcal{A} \bar{P}-\mathcal{A} P_\text{proj}\|_2 + \gamma_{{\it{\Lambda}}}\left({P_\text{proj}},{P^*}\right) + \eta &\text{using (B.20)}\\ &\leq 2\|\mathcal{A} \bar{P}-\mathcal{A} P\|_2 + 2\|\mathcal{A} P-\mathcal{A} P^* \|_2 + 2\|\mathcal{A} P^*-\mathcal{A} P_\text{proj} \|_2 + \gamma_{{\it{\Lambda}}}\left({P_\text{proj}},{P^*}\right) + \eta &\text{triangle ineq.}\\ \end{align*} Given the definition (7.5) of the decoder $${\it{\Delta}}$$ and of the distribution $$\bar{P}={\it{\Delta}}\left({\mathcal{A}} P,\mathcal{A}\right)$$, we have \begin{equation} \|\mathcal{A} \bar{P}-\mathcal{A} P\|_2 = \min_{P_{\it{\Sigma}}\in{\it{\Sigma}}} \|\mathcal{A} P_{\it{\Sigma}}-\mathcal{A} P\|_2. \end{equation} (B.21) Since $$P_\text{proj}$$ is in the model $${\it{\Sigma}}$$, we thus have $$\|\mathcal{A} \bar{P}-\mathcal{A} P\|_2 \leq \|\mathcal{A} P_\text{proj}-\mathcal{A} P\|_2$$. Hence, \begin{align*} \gamma_{{\it{\Lambda}}}\left({\bar{P}},{P^*}\right) \leq& 2\|\mathcal{A} P_\text{proj}-\mathcal{A} P\|_2 + 2\|\mathcal{A} P-\mathcal{A} P^* \|_2 + 2\|\mathcal{A} P^*-\mathcal{A} P_\text{proj} \|_2 + \gamma_{{\it{\Lambda}}}\left({P_\text{proj}},{P^*}\right) +\eta\\ \leq& 4\|\mathcal{A} P-\mathcal{A} P^* \|_2 + 4\|\mathcal{A} P^*-\mathcal{A} P_\text{proj} \|_2 + \gamma_{{\it{\Lambda}}}\left({P_\text{proj}},{P^*}\right) + \eta &\text{triangle ineq.}\\ \stackrel{(b)}{\leq}& 4\|\mathcal{A} P-\mathcal{A} P^* \|_2 + 5\ \tilde{d}(P^*,{\it{\Sigma}}) +\eta. &\text{using (B.16) and (B.17)}\\ \end{align*} Thus, we proved that, with probability at least $$1-\rho/2$$ on the drawing of frequencies, we have: \begin{equation*} \forall P\in{\mathcal{P}},~ \gamma_{{\it{\Lambda}}}\left({P^*},{{\it{\Delta}}\left({\mathcal{A}} P,\mathcal{A}\right)}\right) \leq 5\ \tilde{d}(P^*,{\it{\Sigma}})+ 4\|\mathcal{A} P-\mathcal{A} P^* \|_2 +\eta. \end{equation*} In particular, with a joint probability of at least $$1-\rho/2$$ on the drawing of frequencies $${\boldsymbol{\omega}}_j$$and items $${\mathbf{x}}_i$$, we have \begin{equation} \label{eq:step1} \gamma_{{\it{\Lambda}}}\left({P^*},{{\it{\Delta}}\left(\mathcal{A}\hat{P},\mathcal{A}\right)}\right) \leq 5\ \tilde{d}(P^*,{\it{\Sigma}})+ 4\|\mathcal{A} \hat P-\mathcal{A} P^* \|_2 +\eta, \end{equation} (B.22) where $$\hat P=\frac{1}{n}\sum_{i=1}^n \delta_{{\mathbf{x}}_i}$$. We now show that with high probability the term $$\|\mathcal{A} \hat P-\mathcal{A} P^* \|_2$$ is bounded by $$\epsilon:=\frac{(1+\sqrt{2\log(2/\rho)})}{\sqrt{n}}$$. Denote $${\mathbb{P}}_{\it{\Omega}}$$ (respectively $${\mathbb{P}}_\mathcal{X}$$) the probability distribution of the set of frequencies $${\it{\Omega}}=\{{\boldsymbol{\omega}}_j\}_{j=1,...,m}$$ (respectively of the set of items in the database $$\mathcal{X}=\{{\mathbf{x}}_i\}_{i=1,...,n}$$). Their joint distribution is denoted $${\mathbb{P}}_{{\it{\Omega}},\mathcal{X}}$$ and is such that $${\rm d}{\mathbb{P}}_{{\it{\Omega}},\mathcal{X}}({\it{\Omega}},\mathcal{X})={\rm d}{\mathbb{P}}_{\it{\Omega}}({\it{\Omega}}){\rm d}{\mathbb{P}}_\mathcal{X}(\mathcal{X})$$ by independence. Consider the set: \begin{equation*} A:=\left\lbrace ({\it{\Omega}},\mathcal{X}) \text{ s.t. } \|\mathcal{A} \hat P - \mathcal{A} P^*\|_2 \leq \epsilon \right\rbrace\!. \end{equation*} For a fixed measurement operator $$\mathcal{A}$$, we use Lemma A.6 on the random variables $$\mathcal{A} \delta_{{\mathbf{x}}_i}$$ in $$\mathbb{C}^m$$. We observe that $$\|\mathcal{A}\delta_{{\mathbf{x}}_i}\|_2 = 1$$, $$\mathcal{A} \hat P=\left(\sum_{i=1}^n \mathcal{A} \delta_{{\mathbf{x}}_i}\right)/n$$ and $$\mathcal{A} P^*=\tfrac{1}{\sqrt{m}}\left[\mathbb{E}_{{\mathbf{x}} \sim P^{*}}{\rm e}^{-\mathsf{i} {\boldsymbol{\omega}}_j^T{\mathbf{x}}}\right]_{j=1...m}=\mathbb{E}_{{\mathbf{x}} \sim P^{*}}\mathcal{A} \delta_{{\mathbf{x}}}$$. Hence, with probability at least $$1-\rho/2$$ on the drawing of items: \begin{equation*} \|\mathcal{A} \hat P - \mathcal{A} P^*\|_2 \leq \epsilon. \end{equation*} In other words, we obtain the following result: \begin{equation}\label{eq:empbound} \forall {\it{\Omega}},~ \int_{\mathcal{X}} \mathbf{1}_A({\it{\Omega}},\mathcal{X})\, {\rm d}{\mathbb{P}}_{\mathcal{X}}(\mathcal{X}) \geq 1-\rho/2. \end{equation} (B.23) Hence, \begin{align*} \iint_{{\it{\Omega}},\mathcal{X}} \mathbf{1}_A({\it{\Omega}},\mathcal{X})\, {\rm d}{\mathbb{P}}_{{\it{\Omega}},\mathcal{X}}({\it{\Omega}},\mathcal{X}) &= \int_{\it{\Omega}} \left(\int_{\mathcal{X}} \mathbf{1}_A({\it{\Omega}},\mathcal{X}) \, {\rm d}{\mathbb{P}}_{\mathcal{X}}(\mathcal{X}) \right) {\rm d}{\mathbb{P}}_{\it{\Omega}}({\it{\Omega}}) \notag \\ \geq& (1-\rho/2)\int_{\it{\Omega}} {\rm d}{\mathbb{P}}_{\it{\Omega}}({\it{\Omega}}) = 1-\rho/2, \end{align*} meaning that, with probability at least $$1-\rho/2$$ on the drawing of frequencies and items, we have \begin{equation} \label{eq:step2} \|\mathcal{A} \hat P - \mathcal{A} P^*\|_2 \leq \epsilon. \end{equation} (B.24) We can now conclude: a union bound yields that (B.22) and (B.24) are simultaneously satisfied with probability at least $$1-\rho$$, which leads to the desired result. □ Remark B.3 In (b) we used inequality (B.16), in particular the assumption that the inequality \begin{equation}\label{eq:TmpAssumption} \tilde{d}(P_{1},P_{2}) \geq \|\mathcal{A} P_{1}-\mathcal{A} P_{2}\|_{2} \end{equation} (B.25) holds uniformly over all possible choices of frequencies defining $$\mathcal{A}$$. Thanks to Lemma A.1, the inequality (B.16) indeed holds with $$\tilde{d} = d_{TV}$$. In the rest of this article, we concentrate on this setting. It would, however, be interesting for future work to consider metrics $$\tilde{d}$$ much closer to the natural choice $$\gamma_{{\it{\Lambda}}}\left({\ },{}\right)$$, since they may lead to sharper upper bounds. A possibility would be to relax (B.16) and only assume that inequality (B.25) (possibly up to multiplicative constants) holds with high probability on the draw of $$\mathcal{A}$$, given a pair $$P_{1} = P^{*}$$ and $$P_{2}=P_{proj}$$. B.3 Proof of Theorem 7.1 We now turn to the proof of Theorem 7.1, which consists in proving that the Assumptions $$\mathbf{A_i}$$ imply the hypotheses $$\mathbf{H_i}$$ in the $$\eta>0$$ case, and applying Theorem 7.1 bis. For Hypothesis $$\mathbf{H_1}(\eta,{\it{\Sigma}},{\it{\Lambda}},d)$$, we apply Lemma B.1 with the total variation norm $$\|.\|=\bar{d}=\|.\|_{TV}$$ to obtain the following corollary. Corollary B.1 Let a model $${\it{\Sigma}}$$ such that Assumption $$\mathbf{A_1}({\it{\Sigma}})$$ is satisfied, and a positive constant $$\eta>0$$. Then, for any frequency distribution $${\it{\Lambda}}$$, Hypothesis $$\mathbf{H_1}(\eta,{\it{\Sigma}},{\it{\Lambda}},d)$$ is satisfied with $$d=\|.\|_{TV}$$, and we have \begin{equation} N_{{\mathcal{S}}^\eta_P,\|.\|_{TV}}(\epsilon)\leq N_{{\it{\Sigma}},\|.\|_{TV}}\left(\frac{\epsilon\eta^2}{6}\right)\!. \end{equation} (B.26) Hypothesis $$\mathbf{H_2}(\eta,{\it{\Sigma}},{\it{\Lambda}},f)$$ is an application of Bernstein’s inequality. Lemma B.2 Consider a model $${\it{\Sigma}}$$, a frequency distribution $${\it{\Lambda}}$$, a non-negative constant $$\eta\geq0$$ and a constant $${\mathrm{A}}$$ such that Assumption $$\mathbf{A_2}(\eta,{\it{\Sigma}},{\it{\Lambda}},{\mathrm{A}})$$ is satisfied. Then Assumption $$\mathbf{H_2}(\eta,{\it{\Sigma}},{\it{\Lambda}},f)$$ is satisfied with the function $$f$$ defined as \begin{equation} f(m,t)=\exp\left(-\frac{m}{2{\mathrm{A}}^2}\cdot \frac{t^2}{1+t/3}\right)\!. \end{equation} (B.27) Proof. Fix $$P^*\in{\it{\Sigma}}$$. Suppose $${\boldsymbol{\omega}}_j$$, $$j=1...m$$ are drawn i.i.d. from $${\it{\Lambda}}$$ and let $$\mathcal{A}$$ be the corresponding sketching operator. Let any $$\nu \in {\mathcal{S}}^\eta_{P^*}$$, denote $$P\in{\it{\Sigma}}$$ such that $$\nu=\frac{P^*-P}{\gamma_{{\it{\Lambda}}}\left({P^*},{P}\right)}$$. Denote $$Z_j=1-\frac{|\psi_{P^*}({\boldsymbol{\omega}}_j)-\psi_P({\boldsymbol{\omega}}_j)|^2}{\gamma_{{\it{\Lambda}}}\left({P^*},{P}\right)^2}$$. Since Assumption $$\mathbf{A_2}(\eta,{\it{\Sigma}},{\it{\Lambda}},{\mathrm{A}})$$ is verified and $$|\psi_{P^*}({\boldsymbol{\omega}}_j)-\psi_P({\boldsymbol{\omega}}_j)|\leq \|P^*-P\|_{TV}$$ for all frequencies, the $$Z_j$$’s are i.i.d. random variables verifying $$Z_j \in [1-{\mathrm{A}}^2;1]$$. Furthermore, according to Lemma A.1 we have necessarily $${\mathrm{A}}\geq 1$$, and thus we have \begin{equation} |Z_j| \leq {\mathrm{A}}^2. \end{equation} (B.28) The $$Z_j$$’s are also centered: \begin{align*} &\mathbb{E}_{{\boldsymbol{\omega}}_j \sim {\it{\Lambda}}}Z_j=1-\frac{\mathbb{E}_{{\boldsymbol{\omega}}\sim{\it{\Lambda}}}|\psi_{P^*}({\boldsymbol{\omega}})-\psi_P({\boldsymbol{\omega}})|^2}{\gamma_{{\it{\Lambda}}}\left({P^*},{P}\right)^2}=0. &\text{using (4.4)} \end{align*} Furthermore, we have \begin{align*} Var(Z_j)&=Var\left(\frac{|\psi_{P^*}({\boldsymbol{\omega}})-\psi_P({\boldsymbol{\omega}})|^2}{\gamma_{{\it{\Lambda}}}\left({P^*},{P}\right)^2}\right) \leq\frac{\mathbb{E}|\psi_{P^*}({\boldsymbol{\omega}})-\psi_P({\boldsymbol{\omega}})|^4}{\gamma_{{\it{\Lambda}}}\left({P^*},{P}\right)^4} \leq \frac{\|P^{*}-P\|_{TV}^{2} \cdot \mathbb{E}|\psi_{P^*}({\boldsymbol{\omega}})-\psi_P({\boldsymbol{\omega}})|^2}{\gamma_{{\it{\Lambda}}}\left({P^*},{P}\right)^2 \cdot \gamma_{{\it{\Lambda}}}\left({P^*},{P}\right)^2}\\ &\leq{\mathrm{A}}^2 \cdot \frac{\mathbb{E}|\psi_{P^*}({\boldsymbol{\omega}})-\psi_P({\boldsymbol{\omega}})|^2}{\gamma_{{\it{\Lambda}}}\left({P^*},{P}\right)^2} = {\mathrm{A}}^2. \end{align*} Since $$\frac{1}{m}\sum_{j=1}^m Z_j=1-\|\mathcal{A}\nu\|_2^2$$, applying Bernstein’s inequality (Lemma A.5) we get : for all $$t>0$$, \begin{equation}\label{eq:rip_bernstein} \forall \nu \in {\mathcal{S}}_{P^*}^\eta,~{\mathbb{P}}_{\it{\Omega}}\left(1-\|\mathcal{A} \nu \|^2_2 \geq t\right) = {\mathbb{P}}_{\it{\Omega}}\left(\tfrac{1}{m}\sum_{j=1}^m Z_j \geq t\right) \leq \exp\left(-\frac{m}{2{\mathrm{A}}^2} \cdot \frac{t^{2}}{1+t/3}\right)\!. \end{equation} (B.29) □ We can finally prove Theorem 7.1, by combining Corollary B.1, Lemma B.2 and Theorem 7.1 bis. For simplification, we also use the following bound on the function $$f$$ defined in Lemma B.2: \[ f(m,7/16)=\exp\left(-\frac{147 m}{1760{\mathrm{A}}^2}\right) \leq \exp\left(-\frac{m}{12{\mathrm{A}}^2}\right)\!. \] Appendix C. Application to GMMs C.1 Compactness of mixture models In this section, we prove Lemma 7.1, which extends the compactness of the set $$\mathcal{G}$$ of basic distributions to the corresponding mixture model $${\mathcal{G}_{K}}$$. Proof of Lemma 7.1. Recall that we have assumed compactness of the set $$\mathcal{G}$$ with respect to some norm $$\|.\|$$. In particular, it is bounded, and we note $$C=\max_{P\in\mathcal{G}}\|P\|$$. Let $$\epsilon>0$$ and $$\tau \,{\in}\, ]0;1[$$. Denote $$\epsilon_1=\tau\epsilon$$ and $$\epsilon_2=(1-\tau)\epsilon/C$$. Also denote $$N_1=N_{\mathcal{G},\|.\|}(\epsilon_1)$$ and let $${\mathcal{N}}_1=\{P_1,...,P_{N_1}\}$$ be an $$\epsilon_1$$-covering of $$\mathcal{G}$$. Similarly, let $$B_1^+=\{{\boldsymbol{\alpha}} \in \mathbb{R}_+^K, \sum_{k=1}^K \alpha_k=1\}$$, denote $$N_2=N_{B_1^+,\|.\|_1}(\epsilon_2)$$, let $${\mathcal{N}}_2=\{{\boldsymbol{\alpha}}_1,...,{\boldsymbol{\alpha}}_{N_2}\}$$ be an $$\epsilon_2$$-covering of $$B_1^+$$. Denote $$B_1:=B_{\mathbb{R},\|.\|_1}(0,1)$$ the unit $$\ell_1$$-ball in $$\mathbb{R}^K$$, note that $$B_1^+ \subset B_1$$, such that we have \begin{equation} N_2 = N_{B_1^+,\|.\|_1}(\epsilon_2)\stackrel{\text{Lemma A.2}}{\leq} N_{B_1,\|.\|_1}(\epsilon_2/2) \stackrel{\text{Lemma A.4}}{\leq} (8/\epsilon_2)^K. \label{eq:covnumalpha} \end{equation} (C.1) Define the following set: \begin{equation} {\it{\Gamma}}:=\left\lbrace P_{{\it{\Theta}},{\boldsymbol{\alpha}}}\in{\mathcal{G}_{K}};~ \forall k,~ P_{{\boldsymbol{\theta}}_k} \in {\mathcal{N}}_1,~ {\boldsymbol{\alpha}} \in {\mathcal{N}}_2\right\rbrace\!. \end{equation} (C.2) The cardinality of this set verifies $$\left\lvert {\it{\Gamma}}\right\rvert \leq \left(\left\lvert{\mathcal{N}}_1\right\rvert\right)^K \left\lvert{\mathcal{N}}_2\right\rvert=(N_1)^K N_2$$. Let us show that $${\it{\Gamma}}$$ is an $$\epsilon$$-covering of $${\mathcal{G}_{K}}$$. Let $$P_{{\it{\Theta}},{\boldsymbol{\alpha}}} \in {\mathcal{G}_{K}}$$ be any $$K$$-sparse distribution. For all $$k=1...K$$, let $$P_{\bar{\boldsymbol{\theta}}_k} \in {\mathcal{N}}_1$$ be the distribution in $${\mathcal{N}}_1$$, which is the closest to $$P_{{\boldsymbol{\theta}}_k}$$, and let $$\bar{\boldsymbol{\alpha}} \in {\mathcal{N}}_2$$ be the weight vector in $${\mathcal{N}}_2$$ that is the closest to $${\boldsymbol{\alpha}}$$. Denote $$\bar{\it{\Theta}}=\{\bar{\boldsymbol{\theta}}_1,...,\bar{\boldsymbol{\theta}}_K\}$$, and note that $$P_{\bar{\it{\Theta}},\bar{\boldsymbol{\alpha}}} \in {\it{\Gamma}}$$. We have \begin{align} \|P_{{\it{\Theta}},{\boldsymbol{\alpha}}}-P_{\bar{\it{\Theta}},\bar{\boldsymbol{\alpha}}}\|&= \left\lVert \sum_{k=1}^K \alpha_k P_{{\boldsymbol{\theta}}_k} - \sum_{k=1}^K \bar\alpha_k P_{\bar{\boldsymbol{\theta}}_k} \right\rVert, \notag \\ &\leq \left\lVert \sum_{k=1}^K \alpha_k P_{{\boldsymbol{\theta}}_k}- \sum_{k=1}^K \alpha_k P_{\bar{\boldsymbol{\theta}}_k} \right\rVert + \left\lVert \sum_{k=1}^K \alpha_k P_{\bar{\boldsymbol{\theta}}_k} - \sum_{k=1}^K \bar\alpha_k P_{\bar{\boldsymbol{\theta}}_k} \right\rVert, \notag \\ &\leq \sum_{k=1}^K \alpha_k \left\lVert P_{{\boldsymbol{\theta}}_k} - P_{\bar{\boldsymbol{\theta}}_k} \right\rVert + \sum_{k=1}^K |\alpha_k-\bar\alpha_k| \left\lVert P_{\bar{\boldsymbol{\theta}}_k} \right\rVert, \notag \\ &\leq \sum_{k=1}^K \alpha_k \|P_{{\boldsymbol{\theta}}_k}-P_{\bar{\boldsymbol{\theta}}_k}\| + C\|{\boldsymbol{\alpha}}-\bar{\boldsymbol{\alpha}}\|_1, \notag\\ &\leq \epsilon_1 \sum_{k=1}^K \alpha_k + C\epsilon_2 = \epsilon_1 + C\epsilon_2 = \epsilon ,\label{eq:lipgmm} \end{align} (C.3) and $${\it{\Gamma}}$$ is indeed an $$\epsilon$$-covering of $${\mathcal{G}_{K}}$$. Therefore, we have the bound (for all $$\tau$$) \begin{equation*} N_{{\mathcal{G}_{K}},\|.\|}(\epsilon) \leq \left\lvert {\it{\Gamma}}\right\rvert \leq (N_1)^K N_2 \stackrel{\text{by (C.1)}}{\leq} \left(\frac{8C\cdot N_{\mathcal{G},\|.\| }(\tau\epsilon)}{(1-\tau)\epsilon}\right)^K\!. \end{equation*} Furthermore, in equation (C.3), we have shown in particular that for all $$P_{{\it{\Theta}},{\boldsymbol{\alpha}}},P_{{\it{\Theta}}',{\boldsymbol{\alpha}}'} \in {\mathcal{G}_{K}}$$, \begin{align*} \|P_{{\it{\Theta}},{\boldsymbol{\alpha}}}-P_{{\it{\Theta}}',{\boldsymbol{\alpha}}'}\|\leq& \sum_{k=1}^K \alpha_k\|P_{{\boldsymbol{\theta}}_k}-P_{{\boldsymbol{\theta}}'_k}\| + \|{\boldsymbol{\alpha}}-{\boldsymbol{\alpha}}'\|_1, \end{align*} and therefore the embedding $$(P_{{\boldsymbol{\theta}}_1},...,P_{{\boldsymbol{\theta}}_K},{\boldsymbol{\alpha}})\rightarrow P_{{\it{\Theta}},{\boldsymbol{\alpha}}}$$ from $$(\mathcal{G})^K\times B_1^+$$ to $${\mathcal{G}_{K}}$$ is continuous. Hence, $${\mathcal{G}_{K}}$$ is the continuous image of the set $$(\mathcal{G})^K\times B_1^+$$, which is compact since $$\mathcal{G}$$ is compact, and therefore $${\mathcal{G}_{K}}$$ is compact. □ C.2 Covering numbers of Gaussians Proof of Theorem 7.2. Consider the embedding $$\phi: {\mathcal{T}} \rightarrow \mathcal{G}$$ defined as $$\phi({\boldsymbol{\theta}})= P_{\boldsymbol{\theta}}$$, which is surjective by definition of $$\mathcal{G}$$. We show that $$\phi$$ is Lipschitz continuous, for the Euclidean norm on $${\mathcal{T}} \subset \mathbb{R}^{2n}$$ and total variation norm on $$\mathcal{G} \subset E$$. We begin by the classical Pinsker’s inequality [48]: \begin{equation} \label{eq:pinsker} \|P-{Q}\|_{TV} \leq \sqrt{2D_{KL}(P||{Q})}, \end{equation} (C.4) where $$D_{KL}$$ is the Kullback–Leibler divergence. By symmetry, we have: \begin{equation} \label{eq:pinsker2} \|P-{Q}\|^2_{TV} \leq D_{KL}(P||{Q})+D_{KL}({Q}||P). \end{equation} (C.5) The Kullback–Leibler divergence has a closed-form expression in the case of multivariate Gaussians [47]: \begin{equation} \label{eq:kl_gauss} D_{KL}(P_{{\boldsymbol{\theta}}_1}||P_{{\boldsymbol{\theta}}_2}) = \frac{1}{2}\left[\log\frac{|{\boldsymbol{{\it{\Sigma}}}}_2|}{|{\boldsymbol{{\it{\Sigma}}}}_1|} + \mathrm{tr}\left({\boldsymbol{{\it{\Sigma}}}}_2^{-1}{\boldsymbol{{\it{\Sigma}}}}_1\right)-d + \left({\boldsymbol{\mu}}_2-{\boldsymbol{\mu}}_1\right)^T {\boldsymbol{{\it{\Sigma}}}}_2^{-1}\left({\boldsymbol{\mu}}_2-{\boldsymbol{\mu}}_1\right) \right]\!. \end{equation} (C.6) In our case, with diagonal Gaussians and bounded parameters, we have \begin{align*} D_{KL}(P_{{\boldsymbol{\theta}}_1}||P_{{\boldsymbol{\theta}}_2}) + D_{KL}(P_{{\boldsymbol{\theta}}_1}||P_{{\boldsymbol{\theta}}_2}) =& \frac{\mathrm{tr}\left({\boldsymbol{{\it{\Sigma}}}}_2^{-1}{\boldsymbol{{\it{\Sigma}}}}_1\right) + \mathrm{tr}\left({\boldsymbol{{\it{\Sigma}}}}_1^{-1}{\boldsymbol{{\it{\Sigma}}}}_2\right)}{2}-d + \left({\boldsymbol{\mu}}_2-{\boldsymbol{\mu}}_1\right)^T \frac{{\boldsymbol{{\it{\Sigma}}}}_2^{-1}+{\boldsymbol{{\it{\Sigma}}}}_1^{-1}}{2}\left({\boldsymbol{\mu}}_2-{\boldsymbol{\mu}}_1\right) \\ =& \frac12 \sum_{\ell=1}^d \left(\frac{\sigma^2_{1,\ell}}{\sigma^2_{2,\ell}}+\frac{\sigma^2_{2,\ell}}{\sigma^2_{1,\ell}}-2\right) + \sum_{\ell=1}^d \frac{\sigma_{1,\ell}^{-2}+\sigma_{2,\ell}^{-2}}{2}(\mu_{2,\ell}-\mu_{1,\ell})^2, \\ \leq& \frac12 \sum_{\ell=1}^d \left(\frac{\sigma^4_{2,\ell}+\sigma^4_{1,\ell}-2\sigma^2_{1,\ell}\sigma^2_{2,\ell}}{\sigma^2_{1,\ell}\sigma^2_{2,\ell}} \right)+ \frac{1}{\sigma^2_{\min}}\|{\boldsymbol{\mu}}_1-{\boldsymbol{\mu}}_2\|_2^2, \\ \leq& \frac{1}{2\sigma_{\min}^4}\sum_{\ell=1}^d\left(\sigma^2_{1,\ell}-\sigma^2_{2,\ell}\right)^2+ \frac{1}{\sigma^2_{\min}}\|{\boldsymbol{\mu}}_1-{\boldsymbol{\mu}}_2\|_2^2, \\ \leq& \frac{1}{2\sigma_{\min}^4}\|{\boldsymbol{\sigma}}_2-{\boldsymbol{\sigma}}_1\|_2^2 + \frac{1}{\sigma^2_{\min}}\|{\boldsymbol{\mu}}_1-{\boldsymbol{\mu}}_2\|_2^2 \leq L^2\|{\boldsymbol{\theta}}_1-{\boldsymbol{\theta}}_2\|^2_2, \end{align*} where $$L:=\max\left(\sigma_{\min}^{-1},\sigma_{\min}^{-2}/\sqrt{2}\right)$$. Hence, \begin{equation}\label{eq:lipg} \|P_{{\boldsymbol{\theta}}_1}-P_{{\boldsymbol{\theta}}_2}\|_{TV} \leq L\|{\boldsymbol{\theta}}_1-{\boldsymbol{\theta}}_2\|_2, \end{equation} (C.7) and the embedding $$\phi$$ is $$L$$-Lipschitz. Hence, $$\mathcal{G}$$ is the continuous image of a compact set and is compact. Since $$\phi$$ is also surjective, we can apply Lemma A.3 and conclude: denote $$\mathcal{B} \subseteq \mathbb{R}^{2d}$$ a ball of radius $$\text{radius}\left({{\mathcal{T}}}\right)$$ for the Euclidean norm such that $${\mathcal{T}} \subseteq \mathcal{B}$$, and we have \begin{equation} N_{\mathcal{G},\|.\|_{TV}}(\epsilon) \stackrel{\text{Lemma A.3}}{\leq} N_{{\mathcal{T}},\|.\|_2}(\epsilon/L) \stackrel{\text{Lemma A.2}}{\leq} N_{\mathcal{B},\|.\|_2}\left(\frac{\epsilon}{2L}\right) \stackrel{\text{Lemma A.4}}{\leq} \left(\frac{8L\text{radius}\left({{\mathcal{T}}}\right)}{\epsilon}\right)^{2d} = \left(\frac{B}{\epsilon}\right)^{2d}, \label{eq:covnumgauss} \end{equation} (C.8) which is the desired result. □ Proof of Corollary 7.2. Combining Theorem 7.2 and Lemma 7.1 proves that the set of GMMs $${\mathcal{G}_{K}}$$ is compact. Furthermore, we obtain the following bound, for all $$0<\tau<1$$: \begin{equation*} N_{{\mathcal{G}_{K}},\|.\|_{TV}}(\epsilon) \leq \frac{B^{2dK}8^K}{\tau^{2dK}(1-\tau)^K \epsilon^{(2d+1)K}}, \end{equation*} where $$B$$ is defined as in Theorem 7.2. By choosing10$$\tau=\frac{B}{B+1}$$, we obtain \begin{equation*} N_{{\mathcal{G}_{K}},\|.\|_{TV}}(\epsilon) \leq \left(\frac{B+1}{\epsilon}\right)^{(2d+1)K}8^K \leq \left(\frac{2(B+1)}{\epsilon}\right)^{(2d+1)K}, \end{equation*} since $$8^{\frac{1}{2d+1}} \leq 8^{1/3} = 2$$. □ C.3 Domination between metrics on the set of Gaussians In this section, we aim at proving Theorem 7.3. We begin by an intermediate result. Lemma C.1 Suppose that $${\mathcal{T}} \subset \mathbb{R}^{2d}$$ is such that $$\|{\boldsymbol{\mu}}\|_2\leq M$$ and $$0<\sigma_{\min}^2\leq \sigma^2_i \leq \sigma_{\max}^2$$ for all $$[{\boldsymbol{\mu}},{\boldsymbol{\sigma}}]\in{\mathcal{T}}$$. For all $$P_{{\boldsymbol{\theta}}_1},~P_{{\boldsymbol{\theta}}_2} \in \mathcal{G}$$, \begin{equation} \|{\boldsymbol{\theta}}_1-{\boldsymbol{\theta}}_2\|_2^2\leq D'\|{p}_{{\boldsymbol{\theta}}_1}-{p}_{{\boldsymbol{\theta}}_2}\|_{L^2(\mathbb{R}^d)}^2, \end{equation} (C.9) where \[ D':=\frac{(4\pi\sigma_{\max}^2)^{d/2+1}D_1}{\pi(1-{\rm e}^{-D_1})} \text{ with } D_1=\frac{M^2}{\sigma_{\min}^2} + \frac{d}{2}\log\left(\frac{\sigma_{\max}^2}{\sigma_{\min}^2}\right). \] Proof. We use a property from [3] on product of Gaussians: \begin{equation} \int\! \mathcal{N}({\mathbf{x}};{\boldsymbol{\mu}}_a,{\boldsymbol{{\it{\Sigma}}}}_a)\mathcal{N}({\mathbf{x}};{\boldsymbol{\mu}}_b,{\boldsymbol{{\it{\Sigma}}}}_b)\,{\rm d}{\mathbf{x}}=\frac{1}{(2\pi)^{d/2}|{\boldsymbol{{\it{\Sigma}}}}_a+{\boldsymbol{{\it{\Sigma}}}}_b|^{1/2}}\exp\left(\!\!-\frac{1}{2}({\boldsymbol{\mu}}_a-{\boldsymbol{\mu}}_b)^T({\boldsymbol{{\it{\Sigma}}}}_a+{\boldsymbol{{\it{\Sigma}}}}_b)^{-1}({\boldsymbol{\mu}}_a-{\boldsymbol{\mu}}_b)\!\!\right)\!. \end{equation} (C.10) Hence, we have \begin{align} \|{p}_{{\boldsymbol{\theta}}_1}-{p}_{{\boldsymbol{\theta}}_2}\|_{L^2(\mathbb{R}^d)}^2&=\int ({p}_{{\boldsymbol{\theta}}_1}({\mathbf{x}})-{p}_{{\boldsymbol{\theta}}_2}({\mathbf{x}}))^2\,{\rm d}{\mathbf{x}} \notag \\ &=\int {p}_{{\boldsymbol{\theta}}_1}({\mathbf{x}})^2\,{\rm d}{\mathbf{x}}+\int {p}_{{\boldsymbol{\theta}}_2}({\mathbf{x}})^2\,{\rm d}{\mathbf{x}} -2\int {p}_{{\boldsymbol{\theta}}_1}({\mathbf{x}}){p}_{{\boldsymbol{\theta}}_2}({\mathbf{x}})\,{\rm d}{\mathbf{x}} \notag \\ &=\frac{1}{(2\pi)^{d/2}}\left[|2{\boldsymbol{{\it{\Sigma}}}}_1|^{-\frac{1}{2}}+|2{\boldsymbol{{\it{\Sigma}}}}_2|^{-\frac{1}{2}}-2|{\boldsymbol{{\it{\Sigma}}}}_1+{\boldsymbol{{\it{\Sigma}}}}_2|^{-\frac{1}{2}}{\rm e}^{-\frac12 ({\boldsymbol{\mu}}_1-{\boldsymbol{\mu}}_2)^T({\boldsymbol{{\it{\Sigma}}}}_1+{\boldsymbol{{\it{\Sigma}}}}_2)^{-1}({\boldsymbol{\mu}}_1-{\boldsymbol{\mu}}_2)}\right] \notag \\ &=\frac{|2{\boldsymbol{{\it{\Sigma}}}}_1|^{-\frac{1}{2}}+|2{\boldsymbol{{\it{\Sigma}}}}_2|^{-\frac{1}{2}}}{(2\pi)^{d/2}}\left[1-{\rm e}^{-\left(\frac12 ({\boldsymbol{\mu}}_1-{\boldsymbol{\mu}}_2)^T({\boldsymbol{{\it{\Sigma}}}}_1+{\boldsymbol{{\it{\Sigma}}}}_2)^{-1}({\boldsymbol{\mu}}_1-{\boldsymbol{\mu}}_2)+\log\left(\frac{|2{\boldsymbol{{\it{\Sigma}}}}_1|^{-\frac{1}{2}}+|2{\boldsymbol{{\it{\Sigma}}}}_2|^{-\frac{1}{2}}}{2|{\boldsymbol{{\it{\Sigma}}}}_1+{\boldsymbol{{\it{\Sigma}}}}_2|^{-\frac{1}{2}}}\right)\right)}\right] \notag \\ &\geq \frac{2}{(4\pi\sigma_{\max}^2)^{d/2}}\left[1-{\rm e}^{-\left(\frac12 ({\boldsymbol{\mu}}_1-{\boldsymbol{\mu}}_2)^T({\boldsymbol{{\it{\Sigma}}}}_1+{\boldsymbol{{\it{\Sigma}}}}_2)^{-1}({\boldsymbol{\mu}}_1-{\boldsymbol{\mu}}_2)+\log\left(\frac{|2{\boldsymbol{{\it{\Sigma}}}}_1|^{-\frac{1}{2}}+|2{\boldsymbol{{\it{\Sigma}}}}_2|^{-\frac{1}{2}}}{2|{\boldsymbol{{\it{\Sigma}}}}_1+{\boldsymbol{{\it{\Sigma}}}}_2|^{-\frac{1}{2}}}\right)\right)}\right] \label{eq:toy_gauss_inter}. \end{align} (C.11) On the one hand, we have \begin{equation}\label{eq:toy_gauss_mu_bound} 0 \leq \frac{1}{4\sigma^2_{\max}}\|{\boldsymbol{\mu}}_1-{\boldsymbol{\mu}}_2\|^2_2 \leq \frac12 ({\boldsymbol{\mu}}_1-{\boldsymbol{\mu}}_2)^T({\boldsymbol{{\it{\Sigma}}}}_1+{\boldsymbol{{\it{\Sigma}}}}_2)^{-1}({\boldsymbol{\mu}}_1-{\boldsymbol{\mu}}_2) \leq \frac12 \frac{4M^2}{2\sigma_{\min}^2} = \frac{M^2}{\sigma_{\min}^2}. \end{equation} (C.12) On the other hand, \begin{align*} \log\left(\frac{|2{\boldsymbol{{\it{\Sigma}}}}_1|^{-\frac{1}{2}}+|2{\boldsymbol{{\it{\Sigma}}}}_2|^{-\frac{1}{2}}}{2|{\boldsymbol{{\it{\Sigma}}}}_1+{\boldsymbol{{\it{\Sigma}}}}_2|^{-\frac{1}{2}}}\right)&=\log\left(\frac{|2{\boldsymbol{{\it{\Sigma}}}}_1|^{-\frac{1}{2}}+|2{\boldsymbol{{\it{\Sigma}}}}_2|^{-\frac{1}{2}}}{2}\right) + \frac12 \log|{\boldsymbol{{\it{\Sigma}}}}_1+{\boldsymbol{{\it{\Sigma}}}}_2| \\ &\geq \frac12\log|2{\boldsymbol{{\it{\Sigma}}}}_1|^{-\frac{1}{2}}+\frac12 \log|2{\boldsymbol{{\it{\Sigma}}}}_2|^{-\frac{1}{2}} + \frac12 \log|{\boldsymbol{{\it{\Sigma}}}}_1+{\boldsymbol{{\it{\Sigma}}}}_2| &\text{by concavity of the log}\\ &=\frac{1}{4}\left(\sum_{\ell=1}^d \log(\sigma_{1,\ell}^2+\sigma_{2,\ell}^2)^2-\sum_{\ell=1}^d \log(2\sigma_{1,\ell}^2) -\sum_{\ell=1}^d \log(2\sigma_{2,\ell}^2)\right) \\ &=\frac{1}{4}\sum_{\ell=1}^d \log\left(\frac{(\sigma_{1,\ell}^2+\sigma_{2,\ell}^2)^2}{4\sigma_{1,\ell}^2\sigma_{2,\ell}^2}\right)\\ &\geq \frac{1}{4}\sum_{\ell=1}^d\left(1-\frac{4\sigma_{1,\ell}^2\sigma_{2,\ell}^2}{(\sigma_{1,\ell}^2+\sigma_{2,\ell}^2)^2}\right) &\text{since $-\log(x)\geq 1-x$}\\ &= \sum_{\ell=1}^d \frac{(\sigma_{1,\ell}^2-\sigma_{2,\ell}^2)^2}{4(\sigma_{1,\ell}^2+\sigma_{2,\ell}^2)} \geq \frac{1}{8\sigma_{\max}^2}\|{\boldsymbol{\sigma}}_1-{\boldsymbol{\sigma}}_2\|_2^2 \end{align*} and therefore \begin{equation}\label{eq:toy_gauss_sig_bound} 0 \leq \frac{1}{8\sigma_{\max}^2}\|{\boldsymbol{\sigma}}_1-{\boldsymbol{\sigma}}_2\|_2^2 \leq \log\left(\frac{|2{\boldsymbol{{\it{\Sigma}}}}_1|^{-\frac{1}{2}}+|2{\boldsymbol{{\it{\Sigma}}}}_2|^{-\frac{1}{2}}}{2|{\boldsymbol{{\it{\Sigma}}}}_1+{\boldsymbol{{\it{\Sigma}}}}_2|^{-\frac{1}{2}}}\right) \leq \log\left(\left(\frac{2\sigma_{\max}^2}{2\sigma_{\min}^2}\right)^{d/2}\right) = \frac{d}{2}\log\left(\frac{\sigma_{\max}^2}{\sigma_{\min}^2}\right)\!. \end{equation} (C.13) We can now bound \begin{equation}\label{eq:toy_gauss_bounds} 0\leq \frac{1}{8\sigma_{\max}^2}\|{\boldsymbol{\theta}}_1-{\boldsymbol{\theta}}_2\|_2^2 \leq \left[\frac12 ({\boldsymbol{\mu}}_1-{\boldsymbol{\mu}}_2)^T({\boldsymbol{{\it{\Sigma}}}}_1+{\boldsymbol{{\it{\Sigma}}}}_2)^{-1}({\boldsymbol{\mu}}_1-{\boldsymbol{\mu}}_2)+\log\left(\frac{|2{\boldsymbol{{\it{\Sigma}}}}_1|^{-\frac{1}{2}}+|2{\boldsymbol{{\it{\Sigma}}}}_2|^{-\frac{1}{2}}}{2|{\boldsymbol{{\it{\Sigma}}}}_1+{\boldsymbol{{\it{\Sigma}}}}_2|^{-\frac{1}{2}}}\right)\right] \leq D_1, \end{equation} (C.14) where \begin{equation*} D_1=\frac{M^2}{\sigma_{\min}^2} + \frac{d}{2}\log\left(\frac{\sigma_{\max}^2}{\sigma_{\min}^2}\right)\!. \end{equation*} By concavity of $$x\mapsto 1-{\rm e}^{-x}$$, the function $$\varphi: x\mapsto \frac{1-{\rm e}^{-x}}{x}$$ is decreasing. Hence, we have: \begin{equation*} \forall x\in[0;D_1],~1-{\rm e}^{-x}\geq \frac{1-{\rm e}^{-D_1}}{D_1}x. \end{equation*} Therefore, given (C.11) and (C.14), we have \begin{equation*} \|{p}_{{\boldsymbol{\theta}}_1}-{p}_{{\boldsymbol{\theta}}_2}\|_{L^2(\mathbb{R}^d)}^2 \geq D_2 \left(\frac12 ({\boldsymbol{\mu}}_1-{\boldsymbol{\mu}}_2)^T({\boldsymbol{{\it{\Sigma}}}}_1+{\boldsymbol{{\it{\Sigma}}}}_2)^{-1}({\boldsymbol{\mu}}_1-{\boldsymbol{\mu}}_2)+\log\left(\frac{|2{\boldsymbol{{\it{\Sigma}}}}_1|^{-\frac{1}{2}}+|2{\boldsymbol{{\it{\Sigma}}}}_2|^{-\frac{1}{2}}}{2|{\boldsymbol{{\it{\Sigma}}}}_1+{\boldsymbol{{\it{\Sigma}}}}_2|^{-\frac{1}{2}}}\right)\right)\!, \end{equation*} where $$D_2:=\frac{2(1-{\rm e}^{-D_1})}{D_1(4\pi\sigma_{\max}^2)^{d/2}}$$. And we have, using (C.14) again: \begin{align*} \|{p}_{{\boldsymbol{\theta}}_1}-{p}_{{\boldsymbol{\theta}}_2}\|_{L^2(\mathbb{R}^d)}^2 \geq \frac{D_2}{8\sigma_{\max}^2}\|{\boldsymbol{\theta}}_1-{\boldsymbol{\theta}}_2\|_2^2, \end{align*} which leads to the desired result. □ Proof of Theorem 7.3. We denote $$\mathcal{F}$$ and $$\mathcal{F}^{-1}$$ the Fourier and inverse Fourier transform: \begin{align*} \mathcal{F}(f)({\boldsymbol{\omega}})&=\int_{\mathbb{R}^d}{\rm e}^{-\mathsf{i} {\boldsymbol{\omega}}^T{\mathbf{x}}}f({\mathbf{x}})\,{\rm d}{\mathbf{x}}, \\ \mathcal{F}^{-1}(F)({\mathbf{x}})&=\frac{1}{(2\pi)^d}\int_{\mathbb{R}^d}{\rm e}^{\mathsf{i} {\boldsymbol{\omega}}^T{\mathbf{x}}}F({\boldsymbol{\omega}})\,{\rm d}{\boldsymbol{\omega}}. \end{align*} We recall the classical Plancherel’s Theorem: \begin{equation} \label{eq:plancherel} \|f\|_2^2=\frac{1}{(2\pi)^d}\|\mathcal{F}(f)\|_2^2. \end{equation} (C.15) Let $$P_1,P_2\in \mathcal{G}$$. Recall that $${\it{\Lambda}}=\mathcal{N}\left(0,\frac{a}{n}{\mathbf{I}}\right)$$, denote $$\sigma_{\it{\Lambda}}^2:=\frac{a}{d}$$. The MMD is expressed: \begin{align*} \gamma^2_{{\it{\Lambda}}}\left({P_1},{P_2}\right)&=\int |\psi_1({\boldsymbol{\omega}})-\psi_2({\boldsymbol{\omega}})|^2\mathcal{N}({\boldsymbol{\omega}};0,\sigma_{\it{\Lambda}}^2{\mathbf{I}})\,{\rm d}{\boldsymbol{\omega}} \\ &=\int |\psi_1({\boldsymbol{\omega}})-\psi_2({\boldsymbol{\omega}})|^2\frac{1}{(2\pi\sigma_{\it{\Lambda}}^2)^{d/2}}\,{\rm e}^{-\frac{\|{\boldsymbol{\omega}}\|_2^2}{2\sigma_{\it{\Lambda}}^2}}{\rm d}{\boldsymbol{\omega}} \\ &=\frac{1}{(2\pi\sigma_{\it{\Lambda}}^2)^{d/2}}\int \left\lvert({\rm e}^{-\mathsf{i}{\boldsymbol{\omega}}^T{\boldsymbol{\mu}}_1}\,{\rm e}^{-\frac{1}{2}{\boldsymbol{\omega}}^T\left({\boldsymbol{{\it{\Sigma}}}}_1+\frac{{\mathbf{I}}}{2\sigma_{\it{\Lambda}}^2}\right){\boldsymbol{\omega}}}-{\rm e}^{-\mathsf{i}{\boldsymbol{\omega}}^T{\boldsymbol{\mu}}_2}\,{\rm e}^{-\frac{1}{2}{\boldsymbol{\omega}}^T\left({\boldsymbol{{\it{\Sigma}}}}_2+\frac{{\mathbf{I}}}{2\sigma_{\it{\Lambda}}^2}\right){\boldsymbol{\omega}}}\right\rvert^2{\rm d}{\boldsymbol{\omega}} \\ &=\frac{1}{(2\pi\sigma_{\it{\Lambda}}^2)^{d/2}}\int \left\lvert\psi_{P'_1}({\boldsymbol{\omega}})-\psi_{P'_2}({\boldsymbol{\omega}})\right\rvert^2{\rm d}{\boldsymbol{\omega}}, \end{align*} where $$P'_i$$ is a Gaussian with the same mean than $$P_i$$ and dilated variance $${\boldsymbol{{\it{\Sigma}}}}_i'={\boldsymbol{{\it{\Sigma}}}}_i+\frac{{\mathbf{I}}}{2\sigma_{\it{\Lambda}}^2}$$. Since $$\psi_{P'_i}=\mathcal{F}({p}'_i)$$, by Plancherel’s Theorem we have: \begin{equation*} \gamma^2_{{\it{\Lambda}}}\left({P_1},{P_2}\right)=\left(\frac{2\pi}{\sigma_{\it{\Lambda}}^2}\right)^{d/2}\int \left\lvert {p}'_1({\mathbf{x}})-{p}'_2({\mathbf{x}})\right\rvert^2\,{\rm d}{\mathbf{x}}=\left(\frac{2\pi}{\sigma_{\it{\Lambda}}^2}\right)^{d/2}\|{p}'_1-{p}'_2\|_2^2. \end{equation*} The parameters of the Gaussians $$P'_i$$ belong to a compact set $${\mathcal{T}}'=\left\lbrace\left({\boldsymbol{\mu}},{\boldsymbol{{\it{\Sigma}}}}+\frac{{\mathbf{I}}}{2\sigma_{\it{\Lambda}}^2}\right); ({\boldsymbol{\mu}},{\boldsymbol{{\it{\Sigma}}}})\in {\mathcal{T}}\right\rbrace$$. We can therefore apply Lemma C.1, such that: \begin{equation} D'\|{p}'_1-{p}'_2\|_2^2 \geq \|{\boldsymbol{\theta}}'_1-{\boldsymbol{\theta}}'_2\|_2^2=\|{\boldsymbol{\theta}}_1-{\boldsymbol{\theta}}_2\|_2^2. \end{equation} (C.16) The last equality comes from the fact that the variance between $${\boldsymbol{\theta}}_i$$ and $${\boldsymbol{\theta}}'_i$$ are just translated. The constant $$D'$$ is: \begin{equation*} D'=\frac{(4\pi(\sigma_{\max}^2+1/(2\sigma_{\it{\Lambda}}^2)))^{d/2+1}D_1}{\pi(1-{\rm e}^{-D_1})} \text{ with } D_1=\frac{M^2}{\sigma_{\min}^2+1/(2\sigma_{\it{\Lambda}}^2)} + \frac{d}{2}\log\left(\frac{\sigma_{\max}^2+1/(2\sigma_{\it{\Lambda}}^2)}{\sigma_{\min}^2+1/(2\sigma_{\it{\Lambda}}^2)}\right), \end{equation*} and thus \begin{equation*} \left(\frac{\sigma_{\it{\Lambda}}^2}{2\pi}\right)^{d/4}D'\gamma_{{\it{\Lambda}}}\left({P_1},{P_2}\right)\geq \|{\boldsymbol{\theta}}_1-{\boldsymbol{\theta}}_2\|_2. \end{equation*} Considering equation (C.7) in the proof of Theorem 7.2, we have \begin{equation} \|P_1-P_2\|_{TV}\leq \max(\sigma_{\min}^{-1},\sigma_{\min}^{-2}/\sqrt{2})\|{\boldsymbol{\theta}}_1-{\boldsymbol{\theta}}_2\|_2 \leq \bar D\gamma_{{\it{\Lambda}}}\left({P_1},{P_2}\right)\!, \end{equation} (C.17) with $$\bar D:=\max(\sigma_{\min}^{-1},\sigma_{\min}^{-2}/\sqrt{2})\left(\frac{\sigma_{\it{\Lambda}}^2}{2\pi}\right)^{d/4}D'=\max(\sigma_{\min}^{-1},\sigma_{\min}^{-2}/\sqrt{2})\sqrt{\frac{2D_1\cdot (2\sigma_{\it{\Lambda}}^2\sigma_{\max}^2+1)^{d/2+1}}{\sigma_{\it{\Lambda}}^2(1-{\rm e}^{-D_1})}}$$. We now use the fact that $$\sigma_{\it{\Lambda}}^2=\frac{a}{d}$$. We have \begin{align} \hspace{-1pc}D_1&=\frac{M^2}{\sigma_{\min}^2+d/(2a)} + \frac{d}{2}\log\left(\frac{2a\sigma_{\max}^2/d+1}{2a\sigma_{\min}^2/d+1}\right) \notag\\ \hspace{-1pc}&\leq \frac{M^2}{\sigma_{\min}^2+d/(2a)} + \frac{d}{2}\log\left(2a\sigma_{\max}^2/d+1\right) \notag\\ \tag{since $\log(1+x)\leq x$} \hspace{-1pc}&\leq \frac{M^2}{\sigma_{\min}^2+d/(2a)} + \frac{d}{2}2a\sigma_{\max}^2/d = \frac{M^2}{\sigma_{\min}^2+d/(2a)} + a\sigma_{\max}^2\hspace{3pc}\\ \hspace{-1pc}&\leq \sigma_{\max}^2a\left(1+\frac{2M^2}{d}\right) := D_2 \label{eq:final1} \end{align} (C.18) and similarly \begin{align} (2\sigma_{\it{\Lambda}}^2\sigma_{\max}^2+1)^{d/2+1}&=(2a\sigma_{\max}^2/d+1)\exp\left(\frac{d}{2}\log\left(2a\sigma_{\max}^2/d+1\right)\right) \notag\\ &\leq (2a\sigma_{\max}^2/d+1){\rm e}^{a\sigma_{\max}^2} \notag \\ &\leq {\rm e}^{a\sigma_{\max}^2\left(1+\frac{1}{d}\right)} \leq {\rm e}^{3a\sigma_{\max}^2}. \label{eq:final2} \end{align} (C.19) Using (C.18) with the fact that the function $$\varphi:x\mapsto \frac{1-{\rm e}^{-x}}{x}$$ is decreasing, and (C.19), we obtain \begin{equation} \|P_1-P_2\|_{TV} \leq D\gamma_{{\it{\Lambda}}}\left({P_1},{P_2}\right)\!, \end{equation} (C.20) with $$D:=\max(\sigma_{\min}^{-1},\sigma_{\min}^{-2}/\sqrt{2})\sqrt{\frac{2dD_2\cdot {\rm e}^{3a\sigma_{\max}^2}}{a(1-{\rm e}^{-D_2})}}$$. □ © The authors 2017. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved.
http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png
Information and Inference: A Journal of the IMA
Oxford University Press
http://www.deepdyve.com/lp/oxford-university-press/sketching-for-large-scale-learning-of-mixture-models-pdYJzaSSj4