# Algorithms for learning sparse additive models with interactions in high dimensions*

Algorithms for learning sparse additive models with interactions in high dimensions* Abstract A function $$f:{\mathbb{R}}^d \rightarrow {\mathbb{R}}$$ is a sparse additive model (SPAM), if it is of the form $$f(\mathbf x) = \sum_{l \in \mathscr{S}}\phi_{l}(x_l)$$, where $$\mathscr{S} \subset [{d}]$$, $${|{{\mathscr{S}}}|} \ll {d}$$. Assuming $$\phi$$’s, $$\mathscr{S}$$ to be unknown, there exists extensive work for estimating $$f$$ from its samples. In this work, we consider a generalized version of SPAMs that also allows for the presence of a sparse number of second-order interaction terms. For some $${\mathscr{S}_1} \subset [{d}], {\mathscr{S}_2} \subset {[d] \choose 2}$$, with $${|{{{\mathscr{S}_1}}}|} \ll {d}, {|{{{\mathscr{S}_2}}}|} \ll d^2$$, the function $$f$$ is now assumed to be of the form: $$\sum_{p \in {\mathscr{S}_1}}\phi_{p} (x_p) + \sum_{(l,l^{\prime}) \in {\mathscr{S}_2}}\phi_{(l,l^{\prime})} (x_{l},x_{l^{\prime}})$$. Assuming we have the freedom to query $$f$$ anywhere in its domain, we derive efficient algorithms that provably recover $${\mathscr{S}_1},{\mathscr{S}_2}$$ with finite sample bounds. Our analysis covers the noiseless setting where exact samples of $$f$$ are obtained and also extends to the noisy setting where the queries are corrupted with noise. For the noisy setting in particular, we consider two noise models namely: i.i.d. Gaussian noise and arbitrary but bounded noise. Our main methods for identification of $${\mathscr{S}_2}$$ essentially rely on estimation of sparse Hessian matrices, for which we provide two novel compressed sensing-based schemes. Once $${\mathscr{S}_1}, {\mathscr{S}_2}$$ are known, we show how the individual components $$\phi_p$$, $$\phi_{(l,l^{\prime})}$$ can be estimated via additional queries of $$f$$, with uniform error bounds. Lastly, we provide simulation results on synthetic data that validate our theoretical findings. 1. Introduction Many scientific problems involve estimating an unknown function $$f$$, defined over a compact subset of $${\mathbb{R}}^{{d}}$$, with $${d}$$ large. Such problems arise, for instance, in modeling complex physical processes [32,35,58]. Information about $$f$$ is typically available in the form of point values $$(x_i,f(x_i))_{i=1}^{n}$$, which are then used for learning $$f$$. It is well known that the problem suffers from the curse of dimensionality, if only smoothness assumptions are placed on $$f$$. For example, if $$f$$ is $$C^s$$ smooth ($$s$$ times continuously differentiable), then for uniformly approximating $$f$$ within error $$\delta \in (0,1)$$, one needs $$n = {\it{\Omega}}(\delta^{-{d}/s})$$ samples [51]. A popular line of work in recent times considers the setting where $$f$$ possesses an intrinsic low-dimensional structure, i.e. depends on only a small subset of $${d}$$ variables. There exist algorithms for estimating such $$f$$—tailored to the underlying structural assumption—along with attractive theoretical guarantees, which do not suffer from the curse of dimensionality (cf. [9,15,18,53]). One such assumption leads to the class of sparse additive models (SPAMs) wherein $$f = \sum_{l \in {\mathscr{S}}} \phi_l$$ for some unknown $$S \subset {\left\{{{1,\dots,{d}}}\right\}}$$ with $${|{{{\mathscr{S}}}}|} = {k} \ll {d}$$. There exist several algorithms for learning these models (cf. [23,33,43,45,54]). Here, we focus on a generalized SPAM, where $$f$$ can also contain a small number of second-order interaction terms, i.e. $$f(x_1,x_2,\dots,x_d) = \sum_{p \in {\mathscr{S}_1}}\phi_{p} (x_p) + \sum_{(l,l^{\prime}) \in {\mathscr{S}_2}}\phi_{(l,l^{\prime})} (x_{l},x_{l^{\prime}}) \quad {\mathscr{S}_1} \subset [{d}], {\mathscr{S}_2} \subset {[d] \choose 2},$$ (1.1) with $${|{{{\mathscr{S}_1}}}|} \ll {d},{|{{{\mathscr{S}_2}}}|} \ll {d}^2$$. Here, $$\phi_{(l,l^{\prime})} (x_{l},x_{l^{\prime}}) \not\equiv g_l(x_l) + h_{{l^{\prime}}}(x_{{l^{\prime}}})$$ for some univariates $$g_l, h_{{l^{\prime}}}$$ meaning that $$\frac{\partial^2}{\partial_l \partial_{{l^{\prime}}}} \phi_{(l,l^{\prime})} \not\equiv 0$$. As opposed to SPAMs, the problem is significantly harder now—allowing interactions leads to an additional $${d}({d}-1)/2$$ unknowns out of which of only a few terms (i.e. those in $${\mathscr{S}_2}$$) are relevant. In the sequel, we will denote $${\mathscr{S}}$$ to be the support of $$f$$ consisting of variables that are part of $${\mathscr{S}_1}$$ or $${\mathscr{S}_2}$$ and $$k$$ to be the size of $${\mathscr{S}}$$. Moreover, we will denote $${\rho_{m}}$$ to be the maximum number of occurrences of a variable in $${\mathscr{S}_2}$$—this parameter captures the underlying complexity of the interactions. There exist relatively few results for learning models of the form (1.1), with the existing work being mostly in the regression framework in statistics (cf. [31, 42, 49]). Here, $$(x_i,f(x_i))_{i=1}^{n}$$ are typically samples from an unknown probability measure $$\mathbb{P}$$, with the samples moreover assumed to be corrupted with (i.i.d.) stochastic noise. In this article, we consider the approximation theoretic setting, where we have the freedom to query $$f$$ at any desired set of points (cf. [15, 18, 54]). We propose strategies for querying $$f$$, along with efficient recovery algorithms, which leads to much stronger guarantees than known in the regression setting. In particular, we provide the first finite sample bounds for exactly recovering $${\mathscr{S}_1}$$ and $${\mathscr{S}_2}$$. This is shown for (i) the noiseless setting where exact samples are observed, as well as (ii) the noisy setting, where the samples are corrupted with noise (either i.i.d. Gaussian or arbitrary but bounded noise models). Once $${\mathscr{S}_1}$$, $${\mathscr{S}_2}$$ are identified, we show in Section 6 how the individual components $$\phi_p, \phi_{(l,l^{\prime})}$$ of the model can be estimated, with uniform error bounds. This is shown for both the noiseless and noisy query settings. It is accomplished by additionally sampling $$f$$ along the identified one/two-dimensional subspaces corresponding to $${\mathscr{S}_1},{\mathscr{S}_2}$$, respectively, and by employing standard estimators from approximation theory and statistics. 1.1 Our contributions We make the following contributions for learning models of the form (1.1). 1. First, we provide an efficient algorithm, namely Algorithm 3, which provably recovers $${\mathscr{S}_1}, {\mathscr{S}_2}$$ exactly with high probability1 (w.h.p.), with $$O({k} {\rho_{m}} (\log {d})^3)$$ noiseless queries. When the point queries are corrupted with (i.i.d.) Gaussian noise, we show that Algorithm 3 identifies $${\mathscr{S}_1}$$, $${\mathscr{S}_2}$$ w.h.p., with $$O({\rho_{m}}^5 {k}^2 (\log {d})^4)$$ noisy queries of $$f$$. We also analyze the setting of arbitrary but bounded noise and derive sufficient conditions on the noise magnitude that enable recovery of $${\mathscr{S}_1}, {\mathscr{S}_2}$$. 2. Second, we provide another efficient algorithm namely Algorithm 4, which provably recovers $${\mathscr{S}_1}, {\mathscr{S}_2}$$ exactly w.h.p., with (i) $$O({k} {\rho_{m}} (\log {d})^2)$$ noiseless queries and (ii) $$O({\rho_{m}}^5 {k}^5 (\log {d})^3)$$ noisy queries (i.i.d. Gaussian noise). We also analyze the setting of arbitrary but bounded noise. 3. We provide an algorithm tailored to the special case where the underlying interaction graph corresponding to $${\mathscr{S}_2}$$ is known to be a perfect matching, i.e. each variable interacts with at most one variable (so $${\rho_{m}} = 1$$). We show that the algorithm identifies $${\mathscr{S}_1},{\mathscr{S}_2}$$ w.h.p., with (i) $$O({k} (\log d)^2)$$ noiseless queries and (ii) $$O({k}^2 (\log d)^3)$$ noisy queries (i.i.d. Gaussian noise). We also analyze the setting of arbitrary but bounded noise. 4. An important part of Algorithms 3 and 4 is two novel compressive sensing (CS)-based methods, for estimating sparse, $$d \times d$$ Hessian matrices. These might be of independent interest. We also provide simulation results on synthetic data that validate our theoretical findings concerning the identification of $${\mathscr{S}_1},{\mathscr{S}_2}$$. Algorithm 3 appeared in AISTATS $$2016$$ [55], in a preliminary version of this article. The results in Section 6 (estimating individual components of $$f$$) were part of the supplementary material in [55]. 1.2 Related work We now provide a brief overview of related work, followed by an outline of our main contributions and an overview of the methods. A more detailed comparison with related work is provided in Section 8. Learning SPAMs. This model was introduced in the non-parametric regression setting by Lin & Zhang [31] who proposed the component selection and smoothing (COSSO) method—an extension of the lasso to the reproducing kernel Hilbert space (RKHS) setting. It essentially performs least squares minimization with a sparsity inducing penalty term involving the sum of norms of the function components. In fact, this method is designed to handle the more general smoothing spline analysis of variance model [21,56]. It has since been studied extensively in the regression framework with a multitude of results involving estimation of $$f$$ (cf.[23,25,26,33,43,45]) and/or variable selection, i.e. identifying the support $${\mathscr{S}}$$ (cf. [23, 45, 57]). A common theme behind (nearly all of) these approaches is to first (approximately) represent each $$\phi_j$$; $$1 \leq j \leq d$$, in a suitable basis of finite size. This is done, for example, via B-splines (cf. [23,33]) and finite combination of kernel functions (cf. [26,43]). Thereafter, the problem reduces to a finite-dimensional one that involves finding the values of the coefficients in the corresponding basis representation. This is accomplished by performing least squares minimization subject to sparsity and smoothness inducing penalty terms—the optimization problem is convex on account of the choice of the penalty terms and hence can be solved efficiently. With regard to the problem of estimating $$f$$, Koltchinskii & Yuan [26] and Raskutti et al. [43] proposed a convex program for estimating $$f$$ in the RKHS setting along with $$L_2$$ error rates. These error rates were shown to be minimax optimal by Raskutti et al. [43]. For example, $$f$$ lying in a Sobolev space with smoothness parameter $$\alpha > 1/2$$ are estimated at the optimal $$L_2$$ rate: $$\frac{{k} \log d}{n} + {k} n^{-\frac{2\alpha}{2\alpha+1}}$$, where $$n$$ denotes the number of samples. There also exist results for the variable selection problem, i.e. for estimating the support $${\mathscr{S}}$$. In contrast to the setting of sparse linear models, for which non-asymptotic sample complexity bounds are known [58,59], the corresponding results in the non-parametric setting are usually asymptotic, i.e. derived in the limit of large $$n$$. This property is referred to as sparsistency in the statistics literature; an estimator is called sparsistent if $${\widehat{{{\mathscr{S}}}}} = {\mathscr{S}}$$ with probability approaching one as $$n \rightarrow \infty$$. Variable selection results for SPAMs in the non-parametric regression setting can be found, for instance, in [23, 45, 57]. Recently, Tyagi et al. [54] considered this problem in the approximation theoretic setting; they proposed a method that identifies $${\mathscr{S}}$$ w.h.p. with sample complexities $$O({k} \log d)$$, $$O({k}^3 (\log {d})^2)$$ in the absence/presence of Gaussian noise, respectively. Although there exists a significant amount of work in the literature for SPAMs, the aforementioned methods are designed for specifically learning SPAMs and cannot handle generalized SPAMs of the form (1.1) containing interaction terms. Learning generalized SPAMs. There exist fewer results for generalized SPAMs of the form (1.1) in the regression setting. The COSSO algorithm [31] can handle (1.1); however, its convergence rates are shown only for the case of no interactions. Radchenko & James [42] proposed the VANISH algorithm—a least squares method with sparsity constraints and show that their method is sparsistent. Storlie et al. [49] proposed ACOSSO—an adaptive version of the COSSO algorithm—which can also handle (1.1). They derived convergence rates and sparsistency results for their method, albeit for the case of no interactions. Recently, Dalalyan et al. [13] and Yang & Tokdar [61] studied a generalization of (1.1) that allows for the presence of a sparse number of $$m$$-wise interaction terms for some additional sparsity parameter $$m$$. Although they derive non-asymptotic $$L_2$$ error rates for estimating $$f$$ in such generic setting, they do not guarantee unique identification of the interaction terms for any value of $$m$$. A special case of (1.1)—where $$\phi_p$$’s are linear and each $$\phi_{(l,l^{\prime})}$$ is of the form $$x_l x_{{l^{\prime}}}$$ – has been studied considerably. Within this setting, there exist algorithms that recover $${\mathscr{S}_1},{\mathscr{S}_2}$$, along with convergence rates for estimating $$f$$ in the limit of large $$n$$ [3, 8, 42]. There also exist non-asymptotic sampling bounds for identifying the interaction terms in the noiseless setting (cf. [24,37]). However, finite sample bounds for the nonlinear model (1.1) are not known in general. Other low-dimensional function models. There exist results for other, more general classes of intrinsically low-dimensional functions that we now mention starting with the approximation theoretic setting. DeVore et al. [15] consider functions depending on an unknown subset $${\mathscr{S}}$$ of the variables with $${|{{{\mathscr{S}}}}|} = {k} \ll d$$. The functions do not necessarily possess an additive structure, so the function class is more general than (1.1). They provide algorithms that recover $${\mathscr{S}}$$ exactly w.h.p., with $$O(c^k {k} \log d)$$ noiseless queries of $$f$$, for some constant $$c > 0$$. Schnass & Vybiral [46] derived a simpler algorithm for this problem in the noiseless setting. This function class was also studied by Comminges & Dalalyan [11,12] in the non-parametric regression setting wherein they analyzed an estimator that identifies $${\mathscr{S}}$$ w.h.p., with $$O(c^k {k} \log d)$$ samples of $$f$$. Fornasier et al. [18] and Tyagi & Cevher [53] considered a generalization of the above function class where $$f$$ is now of the form $$f(\mathbf x) = g({\mathbf{A}}\mathbf x)$$, for unknown $${\mathbf{A}} \in {\mathbb{R}}^{k \times d}$$. They derived algorithms that approximately recover the row span of $${\mathbf{A}}$$, with sample complexities typically polynomial in $${k}, {d}$$. Although the above methods could possibly recover the underlying support $${\mathscr{S}}$$ for the SPAM (1.1), their sample complexities are either exponential in $${k}$$ [11, 12, 15] or polynomial in $${d}$$ [18,53]. As explained in Section [46], the algorithm of Schnass & Vybiral [46] would recover $${\mathscr{S}}$$ w.h.p., with $$O({\rho_{m}}^4 {k} (\log {d})^2)$$ noiseless queries, with potentially large constants (depending on smoothness of $$f$$) within the $$O(\cdot)$$ term. Moreover, we note that the aforementioned methods are not designed for identifying interactions among the variables. 1.3 Overview of methods used We now describe the main underlying ideas behind the algorithms described in this article, for identifying $${\mathscr{S}_1},{\mathscr{S}_2}$$. On a top level, our methods are based on two simple observations for the model (1.1), namely that for any $$\mathbf x \in {\mathbb{R}}^d$$: The gradient $${\nabla} f(\mathbf x) \in {\mathbb{R}}^d$$ is $${k}$$ sparse. The Hessian $${\nabla^2} f(\mathbf x) \in {\mathbb{R}}^{d \times d}$$ is at most $${k} ({\rho_{m}} + 1)$$ sparse. In particular, it has $${k}$$ non-zero rows, with each such row having at most $${\rho_{m}} + 1$$ non-zero entries. For the special case of no overlap, i.e. $${\rho_{m}} = 1$$, we proceed in two phases. In the first phase—outlined as Algorithm 1—we identify all variables in $${\mathscr{S}}$$ by estimating $${\nabla} f(\mathbf x)$$ via $$\ell_1$$ minimization,2 for each $$\mathbf x$$ lying within a carefully constructed finite set $${\chi} \in {\mathbb{R}}^d$$. The set $${\chi}$$ in particular is constructed3 so that it provides a uniform discretization of all possible two-dimensional canonical subspaces in $${\mathbb{R}}^d$$. In the second phase—outlined as Algorithm 1—we identify the sets $${\mathscr{S}_1},{\mathscr{S}_2}$$ via a simple (deterministic) binary search-based procedure, over the rows of the corresponding $${k} \times {k}$$ sub-matrix of the Hessian of $$f$$. For the general case, however, where $${\rho_{m}} \geq 1$$, the above scheme does not guarantee identification of $${\mathscr{S}}$$; see discussion at the beginning of Section 4.1. Therefore, we consider a different ‘two-phase’ approach, where in the first phase we query $$f$$ with the goal of identifying the set of interactions $${\mathscr{S}_2}$$. This in fact entails estimating the sparse Hessian $${\nabla^2} f(\mathbf x)$$, at each $$\mathbf x$$ lying within $${\chi}$$. We propose two different methods for estimating $${\nabla^2} f(\mathbf x)$$, utilizing tools from CS. The first method is a part of Algorithm 3 where we estimate each row of $${\nabla^2} f (\mathbf x)$$ separately, via a ‘difference of gradients’ approach. This is motivated by the following identity, based on the Taylor expansion of $${\nabla} f$$ at $$\mathbf x$$, for suitable $${\mathbf v^{\prime}} \in {\mathbb{R}}^d$$, $${\mu_1} > 0$$: $$\frac{{\nabla} f(\mathbf x + {\mu_1}{\mathbf v^{\prime}}) - {\nabla} f(\mathbf x)}{{\mu_1}} = {\nabla^2} f(\mathbf x) {\mathbf v^{\prime}} + O({\mu_1}).$$ (1.2) We can see from (1.2) that a difference of gradient vectors corresponds to obtaining a perturbed linear measurement of each$${\rho_{m}} + 1$$ sparse row of $${\nabla^2} f(\mathbf x)$$. CS theory tells us that by collecting $$O({\rho_{m}} \log d)$$ such ‘gradient differences’—each difference term corresponding to a random choice of $${\mathbf v^{\prime}}$$ from a suitable distribution—we can estimate each row of $${\nabla^2} f (\mathbf x)$$ via $$\ell_1$$ minimization. Since $${\nabla} f$$ is $${k}$$ sparse, it can also be estimated via $$O({k} \log d)$$ queries of $$f$$—this leads to obtain an estimate of $${\nabla^2} f (\mathbf x)$$ with $$O({k} {\rho_{m}} (\log d)^2)$$ queries of $$f$$ in total. The second method is a part of Algorithm 4 where we estimate all entries of $${\nabla^2} f(\mathbf x)$$ in ‘one go’. This is motivated by the following identity, based on the Taylor expansion of $$f$$ at $$\mathbf x$$, for suitable $${\mathbf v} \in {\mathbb{R}}^d$$, $${\mu} > 0$$: $$\frac{f(\mathbf x + 2{\mu}{\mathbf v}) + f(\mathbf x - 2{\mu}{\mathbf v}) - 2f(\mathbf x)}{4{\mu}^2} = \langle{\mathbf v} {\mathbf v}^{\rm T},{\nabla^2} f(\mathbf x)\rangle + O({\mu}).$$ (1.3) We see from (1.3) that the L.H.S. corresponds to a perturbed linear measurement of the Hessian, with a rank 1 matrix. By leveraging recent results in CS—most notably the work of Chen et al. [7]—we recover an estimate of $${\nabla^2} f(\mathbf x)$$ through $$\ell_1$$ minimization, by choosing $${\mathbf v}$$’s randomly from a suitable distribution. As described in detail in Section 5, this requires us to make $$O({k} {\rho_{m}} \log d)$$ queries of $$f$$. Once $${\mathscr{S}_2}$$ is estimated, we estimate $${\mathscr{S}_1}$$ by invoking (a slightly improved version of) the method of Tyagi et al. [54] for learning SPAMs, on the reduced variables set. Outline of the article. The rest of the article is organized as follows. Section 2 contains a formal description of the problem along with notation used. We begin by analyzing the special case of no overlap between the elements of $${\mathscr{S}_2}$$ (i.e. $${\rho_{m}} = 1$$) in Section 3. Section 4 then considers the general setting, where $${\rho_{m}} \geq 1$$. In particular, it describes Algorithm 3 wherein the underlying sparse Hessian of $$f$$ is estimated via a difference of sparse gradients mechanism. Section 5 also handles the general overlap setting, albeit with a different method for estimating the sparse Hessian of $$f$$. Once $${\mathscr{S}_1},{\mathscr{S}_2}$$ are estimated, we describe how the individual components of $$f$$ can be estimated via standard tools from approximation theory and statistics in Section 6. Section 7 contains simulation results on synthetic examples. We provide a detailed discussion of related work in Section 8 and conclude with directions for future work in Section 9. All proofs are deferred to the appendix. 2. Notation and problem setup Notation. Scalars are mostly denoted by plain letters (e.g. $${k_1}$$, $${k_2}$$ and $${d}$$), vectors by lowercase boldface letters (e.g., $${\mathbf x}$$) or by lowercase Greek letters (e.g. $$\zeta$$), matrices by uppercase boldface letters (e.g. $${{\mathbf{A}}}$$) and sets by uppercase calligraphic letters (e.g. $$\mathscr{S}$$), with the exception of $$[{d}]$$ which denotes the index set $${\left\{{{1, \ldots, {d}}}\right\}}$$. Given a set $$\mathscr{S} \subseteq [{d}]$$, we denote its complement by $$\mathscr{S}^c := [{d}] \setminus \mathscr{S}$$ and for vector $$\mathbf x = (x_1,\dots,x_{{d}}) \in {\mathbb{R}}^{{d}}$$, $$(\mathbf x)_{\mathscr{S}}$$ denotes the restriction of $$\mathbf x$$ onto $$\mathscr{S}$$, i.e. $$((\mathbf x)_{\mathscr{S}})_l = x_l$$ if $$l \in \mathscr{S}$$ and $$0$$ otherwise. We use $${|{{\mathscr{S}}}|}$$ to denote the cardinality of a set $$\mathscr{S}$$. The $$\ell_p$$ norm of a vector $$\mathbf x \in {\mathbb{R}}^{{d}}$$ is defined as $$\parallel{\mathbf x}\parallel_p := \left ( \sum_{l=1}^{d} {|{{x_i}}|}^p \right )^{1/p}$$. Let $$g$$ be a function of $$n$$ variables, $$g(x_1,\dots,x_n)$$. $${\mathbb{E}}_p[g]$$, $${\mathbb{E}}_{(l,l^{\prime})}[g]$$ denote expectation w.r.t. uniform distributions over $$x_p$$ and $$(x_l, x_{l^{\prime}})$$, respectively. $${\mathbb{E}}[g]$$ denotes expectation w.r.t. uniform distribution over $$(x_1,\dots,x_n)$$. For any compact $${\it{\Omega}} \subset {\mathbb{R}}^n$$, we denote by $$\parallel{g}\parallel_{{L_{\infty}} ({\it{\Omega}})}$$, the $${L_{\infty}}$$ norm of $$g$$ in $${\it{\Omega}}$$. The partial derivative operator $$\frac{\partial}{\partial x_i}$$ is denoted by $$\partial_i$$, for $$i=1,\dots,n$$. So, for instance, $$\frac{\partial^3 g}{\partial x_1^2 \partial x_2}$$ will be denoted by $$\partial_1^2 \partial_2 g$$. We are interested in the problem of approximating functions $$f:{\mathbb{R}}^d \rightarrow {\mathbb{R}}$$ from point queries. For some unknown sets $${\mathscr{S}_1} \subset [{d}], {\mathscr{S}_2} \subset {[{d}] \choose 2}$$, the function $$f$$ is assumed to have the following form. $$f(x_1,\dots,x_d) = \sum_{p \in {\mathscr{S}_1}}\phi_{p} (x_p) + \sum_{(l,l^{\prime}) \in {\mathscr{S}_2}}\phi_{(l,l^{\prime})} (x_{l},x_{l^{\prime}}).$$ (2.1) Hence, $$f$$ is considered to be a sum of a sparse number of univariate and bivariate functions, denoted by $$\phi_p$$ and $$\phi_{(l,l^{\prime})}$$, respectively. Here, $$\phi_{(l,l^{\prime})}$$ is considered to be ‘truly bivariate’ meaning that $$\partial_l \partial_{{l^{\prime}}} \phi_{(l,l^{\prime})} \not\equiv 0$$. The set of coordinate variables that are in $${\mathscr{S}_2}$$ is denoted by $${{\mathscr{S}}_2^{\text{var}}} := {\left\{{{l \in [{d}]: \exists l^{\prime} \in [d] \ \text{s.t.} \ (l,l^{\prime}) \in {\mathscr{S}_2} \ \text{or} \ (l^{\prime},l) \in {\mathscr{S}_2}}}\right\}}.$$ (2.2) For each $$l \in {\mathscr{S}_2^{\text{var}}}$$, we refer to the number of occurrences of $$l$$ in $${\mathscr{S}_2}$$, as the degree of $$l$$, formally denoted as follows. $${\rho}(l) := {|{{{\left\{{{l^{\prime} \in {{\mathscr{S}}_2^{\text{var}}} : {(l,l^{\prime})} \in {{\mathscr{S}}_2} \ \text{or} \ {(l^{\prime},l)} \in {{\mathscr{S}}_2}}}\right\}}}}|} \quad l \in {{\mathscr{S}}_2^{\text{var}}}.$$ (2.3) Model Uniqueness. We first note that representation (2.1) is not unique. First, we could add constants to each $$\phi_l, \phi_{(l,l^{\prime})}$$, which sum up to zero. Furthermore, for each $$l \in {\mathscr{S}_2^{\text{var}}}$$ with $${\rho}(l) > 1,$$ we could add univariates that sum to zero. We can do the same for $$l \in {\mathscr{S}_1} \cap {\mathscr{S}_2^{\text{var}}} : {\rho}(l) = 1$$. These ambiguities are thankfully avoided by re-writing (2.1) uniquely in the following ANOVA form. $$f(x_1,\dots,x_d) = c + \sum_{p \in {\mathscr{S}_1}}\phi_{p} (x_p) + \sum_{(l,l^{\prime}) \in {\mathscr{S}_2}} \phi_{(l,l^{\prime})} (x_{l},x_{l^{\prime}}) + \sum_{q \in {\mathscr{S}_2^{\text{var}}}: {\rho}(q) > 1} \phi_{q} (x_q) \quad {\mathscr{S}_1} \cap {\mathscr{S}_2^{\text{var}}} = \emptyset.$$ (2.4) Here, $$c = {\mathbb{E}}[f]$$ and $${\mathbb{E}}_p[\phi_p] = {\mathbb{E}}_{(l,l^{\prime})}[\phi_{(l,l^{\prime})}] = 0$$, $$\forall p \in {\mathscr{S}_1}, (l,l^{\prime}) \in {\mathscr{S}_2}$$, with expectations being over uniform distributions w.r.t. variable range $$[-1,1]$$. In addition, certain bivariate components have zero marginal mean with respect to either $$l$$ or $${l^{\prime}}$$. In particular, $${\mathbb{E}}_{l}[\phi_{(l,l^{\prime})}] = 0$$ if $${\rho}({l^{\prime}}) > 1$$ and $${\mathbb{E}}_{{l^{\prime}}}[\phi_{(l,l^{\prime})}] = 0$$ if $${\rho}(l) > 1$$. The univariate $$\phi_q$$ corresponding to $$q \in {\mathscr{S}_2^{\text{var}}}$$ with $${\rho}(q) > 1$$ represents the net marginal effect of the variable and has $${\mathbb{E}}_q[\phi_q] = 0$$. We note that $${\mathscr{S}_1}, {\mathscr{S}_2^{\text{var}}}$$ are disjoint in (2.4). This is due to the fact that each $$p \in {\mathscr{S}_1} \cap {\mathscr{S}_2^{\text{var}}}$$ with $${\rho}(p) = 1$$ can be merged with its bivariate form, while each $$p \in {\mathscr{S}_1} \cap {\mathscr{S}_2^{\text{var}}}$$ with $${\rho}(p) > 1$$ can be merged with its net marginal univariate form. The uniqueness of (2.4) is shown formally in the appendix. We assume the setting $${|{{{\mathscr{S}_1}}}|} = {k_1} \ll {d}$$, $${|{{{\mathscr{S}_2}}}|} = {k_2} \ll d$$. Clearly, $${|{{{\mathscr{S}_2^{\text{var}}}}}|} \leq 2{k_2}$$ with equality iff elements in $${\mathscr{S}_2}$$ are pairwise disjoint. The set of all active variables, i.e. $${\mathscr{S}_1} \cup {\mathscr{S}_2^{\text{var}}}$$ will be denoted by $${\mathscr{S}}$$. We then define $${k}: = {|{{{\mathscr{S}}}}|} = {k_1} + {|{{{\mathscr{S}_2^{\text{var}}}}}|}$$ to be the total sparsity of the problem. The largest degree of a variable in $${\mathscr{S}_2^{\text{var}}}$$ is defined to be $${\rho_{m}} := \max_{l \in {\mathscr{S}_2^{\text{var}}}} {\rho}(l)$$. Clearly, $$1 \leq {\rho_{m}} \leq {k_2}$$. Goals. Assuming that we have the freedom to query $$f$$ within its domain, our goal is now two fold. First, we would like to exactly recover the unknown sets $${\mathscr{S}_1},{\mathscr{S}_2}$$. Second, we would like to estimate $$c$$ as well as each: (i) $$\phi_p, p \in {\mathscr{S}_1}$$, (ii) $$\phi_{(l,l^{\prime})}, (l,l^{\prime}) \in {\mathscr{S}_2}$$ and (iii) $$\phi_{q}, q \in {\mathscr{S}_2^{\text{var}}}, {\rho}(q) > 1$$, in (2.4). In particular, we would like to estimate the univariate and bivariate components within compact domains $$[-1,1]$$, $$[-1,1]^2$$, respectively. If $${\mathscr{S}_1}, {\mathscr{S}_2}$$ were known beforehand, then one can estimate $$f$$ via standard results from approximation theory or non-parametric regression.4 Hence our primary focus in the article is to recover $${\mathscr{S}_1}, {\mathscr{S}_2}$$. Our main assumptions for this problem are listed below. Assumption 1 We assume that $$f$$ can be queried from the slight enlargement: $$[-(1+r),(1+r)]^d$$ of $$[-1,1]^d$$ for some small $$r > 0$$. As will be seen later, the enlargement $$r$$ can be made arbitrarily close to $$0$$. Assumption 2 We assume each $$\phi_{(l,l^{\prime})},\phi_p$$ to be three times continuously differentiable, within $$[-(1+r),(1+r)]^2$$ and $$[-(1+r),(1+r)]$$, respectively. Since these domains are compact, there then exist constants $${B}_m \geq 0$$, $$m=0,1,2,3$$, so that \begin{align} \parallel{\partial_l^{m_1} \partial_{l^{\prime}}^{m_2} \phi_{(l,l^{\prime})}}\parallel_{{L_{\infty}}[-(1+r),(1+r)]^2} \leq {B}_m \quad (l,l^{\prime}) \in {\mathscr{S}_2}, \ m_1 + m_2 = m, \\ \end{align} (2.5) \begin{align} \parallel{\partial_p^{m} \phi_{p}}\parallel_{{L_{\infty}}[-(1+r),(1+r)]} \leq {B}_m \quad p \in {\mathscr{S}_1} \ \text{or}, \ p \in {\mathscr{S}_2^{\text{var}}} \ \& \ {\rho}(p) > 1. \end{align} (2.6) Our next assumption is for the purpose of identification of active variables, i.e. the elements of $${\mathscr{S}_1} \cup {\mathscr{S}_2^{\text{var}}}$$. Assumption 3 For some constants $${D}_1, {\lambda}_1 > 0$$, we assume that for each $$(l,l^{\prime}) \in {\mathscr{S}_2}$$, $$\exists$$ connected $${\mathscr{I}}_{l,1}, {\mathscr{I}}_{l^{\prime},1}, {\mathscr{I}}_{l,2}, {\mathscr{I}}_{l^{\prime},2} \subset [-1,1]$$, each of Lebesgue measure at least $${\lambda}_1 > 0$$, so that \begin{align} {|{{\partial_l \phi_{(l,l^{\prime})} (x_{l},x_{l^{\prime}})}}|} &> {D}_1 \quad \forall (x_{l},x_{l^{\prime}}) \in {\mathscr{I}}_{l,1} \times {\mathscr{I}}_{l^{\prime},1}, \\ \end{align} (2.7) \begin{align} {|{{\partial_{l^{\prime}} \phi_{(l,l^{\prime})} (x_{l},x_{l^{\prime}})}}|} &> {D}_1 \quad \forall (x_{l},x_{l^{\prime}}) \in {\mathscr{I}}_{l,2} \times {\mathscr{I}}_{l^{\prime},2}. \end{align} (2.8) Similarly, we assume that for each $$p \in {\mathscr{S}_1}$$, $$\exists$$ connected $${\mathscr{I}}_p \subset [-1,1]$$, of Lebesgue measure at least $${\lambda}_1 > 0$$, such that $${|{{\partial_p \phi_p(x_p)}}|} > {D}_1$$, $$\forall x_p \in {\mathscr{I}}_p$$. These assumptions essentially serve to distinguish an active variable from a non-active one and are also in a sense necessary. For instance, if say $$\partial_l \phi_{(l,l^{\prime})}$$ was zero throughout $$[-1,1]^2$$, then it equivalently means that $$\partial_l \phi_{(l,l^{\prime})}$$ is only a function of $$x_{l^{\prime}}$$. If $$\partial_l \phi_{(l,l^{\prime})} = \partial_{l^{\prime}} \phi_{(l,l^{\prime})} = 0$$ in $$[-1,1]^2$$, then $$\phi_{(l,l^{\prime})} \equiv 0$$ in $$[-1,1]^2$$. The same reasoning applies for $$\phi_p$$’s. Our last assumption concerns the identification of $${\mathscr{S}_2}$$. Assumption 4 For some constants $${D}_2, {\lambda}_2 > 0$$, we assume that for each $$(l,l^{\prime}) \in {\mathscr{S}_2}$$, $$\exists$$ connected $${\mathscr{I}}_{l}, {\mathscr{I}}_{l^{\prime}} \subset [-1,1]$$, each interval of Lebesgue measure at least $${\lambda}_2 > 0$$, such that $${|{{\partial_l \partial_{l^{\prime}} \phi_{(l,l^{\prime})} (x_{l},x_{l^{\prime}})}}|} > {D}_2, \quad \forall (x_{l},x_{l^{\prime}}) \in {\mathscr{I}}_{l} \times {\mathscr{I}}_{l^{\prime}}$$. Our problem-specific parameters are (i) $${B}_i$$, $$i=0,\dots,3$$, (ii) $${D}_j,{\lambda}_j$$, $$j=1,2$$ and (iii) $${k}, {\rho_{m}}$$. We do not assume $${k_1}, {k_2}$$ to be known but instead assume that $${k}$$ is known. Furthermore, it suffices to use estimates for the problem parameters instead of exact values. In particular, we can use upper bounds for: $${k}, {\rho_{m}}$$, $${B}_i$$, $$i=0,\dots,3$$ and lower bounds for $${D}_j,{\lambda}_j$$, $$j=1,2$$. Underlying interaction graph. One might intuitively guess that the underlying ‘structure’ of interactions between the elements in $${\mathscr{S}_2^{\text{var}}}$$ shapes the difficulty of the problem. More formally, consider the graph $$G = (V,E)$$ where $$V = [{d}]$$ and $$E = {\mathscr{S}_2} \subset {V \choose 2}$$ denote the set of vertices and edges, respectively. We refer to the induced subgraph $$I_G = ({\mathscr{S}_2^{\text{var}}}, {\mathscr{S}_2})$$ of $$G$$, as the interaction graph. We consider not only the general setting—where no assumption is made on $$I_G$$—but also a special case, where $$I_G$$ is a perfect matching. This is illustrated in Fig. 1. In Fig. 1a, $$I_G$$ is a perfect matching meaning that each vertex is of degree 1. In other words, there is no overlap between the elements of $${\mathscr{S}_2}$$. In terms of the difficulty of interactions, this corresponds to the easiest setting. Fig. 1b corresponds to the general setting where no structural assumption is placed on $$I_G$$. Therefore, we can now potentially have overlaps between the elements of $${\mathscr{S}_2}$$, since each element in $${\mathscr{S}_2^{\text{var}}}$$ can be paired with up to $${\rho_{m}}$$ other elements. This corresponds to the hardest setting as far as the difficulty of interactions is concerned. Fig. 1 View largeDownload slide Isolated disks represent elements of $$\mathscr{S}_1$$ while disks that are part of an edge are elements of $$\mathscr{S}_2^{\text{var}}$$. Circles denote elements of $$[{d}] \setminus \{{\mathscr{S}_1} \cup {\mathscr{S}_2^{\text{var}}} \}$$. On the left, we have the special setting, where $$I_G$$ is a perfect matching. On the right, we have the most general setting where no assumption is made on $$I_G$$. Fig. 1 View largeDownload slide Isolated disks represent elements of $$\mathscr{S}_1$$ while disks that are part of an edge are elements of $$\mathscr{S}_2^{\text{var}}$$. Circles denote elements of $$[{d}] \setminus \{{\mathscr{S}_1} \cup {\mathscr{S}_2^{\text{var}}} \}$$. On the left, we have the special setting, where $$I_G$$ is a perfect matching. On the right, we have the most general setting where no assumption is made on $$I_G$$. 3. Sampling scheme for the non-overlap case In this section, we consider the special case where all elements in $${\mathscr{S}_2}$$ are pair-wise disjoint. In other words, $${\rho}(i) = 1$$, for each $$i \in {\mathscr{S}_2^{\text{var}}}$$. We first treat the noiseless setting in Section 3.1, wherein the exact function values are obtained at each query. We then handle the noisy setting in Section 3.2, where the function values are corrupted with external noise. 3.1 Analysis for noiseless setting Our approach essentially consists of two phases. In the first phase, we sample the function $$f$$ appropriately and recover the complete set of active variables $${\mathscr{S}}$$. In the second phase, we focus on the reduced $${k}$$-dimensional subspace corresponding to $${\mathscr{S}}$$. We sample $$f$$ at appropriate points in this subspace and consequently identify $${\mathscr{S}_1}$$ and $${\mathscr{S}_2}$$. Let us now elaborate on these two phases in more detail. 3.1.1 First phase: recovering all active variables The crux of this phase is based on the following observation. On account of the structure of $$f$$, we see that at any $$\mathbf x \in {\mathbb{R}}^d$$, the gradient $${\nabla} f(\mathbf x) \in {\mathbb{R}}^{{d}}$$ has the following form: \begin{equation*} ({\nabla} f(\mathbf x))_q = \left\{ \begin{array}{@{}rl} \partial_q \phi_q(x_q) \quad , & q \in {\mathscr{S}_1} \\ \partial_q \phi_{(q,q^{\prime})} (x_{q},x_{q^{\prime}}) \quad , & (q,q^{\prime}) \in {\mathscr{S}_2}\\ \partial_q \phi_{(q^{\prime},q)} (x_{q^{\prime}},x_{q}) \quad , & (q^{\prime},q) \in {\mathscr{S}_2}\\ 0 \quad , & \text{otherwise} \end{array} \right. \quad q=1,\dots,{d}. \end{equation*} Hence, $${\nabla} f(\mathbf x)$$ is at most $${k}$$-sparse, i.e. has at most $$k$$ non-zero entries, for any $$\mathbf x$$. Note that the $$q{\text{th}}$$ component of $${\nabla} f(\mathbf x)$$ is zero if $$q \notin {\mathscr{S}_1} \cup {\mathscr{S}_2^{\text{var}}}$$. Say we somehow recover $${\nabla} f(\mathbf x)$$ at sufficiently many $$\mathbf x$$’s within $$[-1,1]^{{d}}$$. Then, we would also have suitably many samples of the functions: $$\partial_q \phi_q, \partial_{l} \phi_{(l,l^{\prime})}, \partial_{l^{\prime}} \phi_{(l,l^{\prime})}$$, $$\forall$$$$p \in {\mathscr{S}_1}, (l,l^{\prime}) \in {\mathscr{S}_2}$$. Specifically, if the number of samples is large enough, then we would have sampled each of $$\partial_q \phi_q, \partial_{l} \phi_{(l,l^{\prime})}, \partial_{l^{\prime}} \phi_{(l,l^{\prime})}$$, within their respective ‘critical intervals’, as defined in Assumption 3. Provided that the estimation noise is sufficiently small enough, this suggests that we should then, via a threshold operation, be able to detect all variables in $${\mathscr{S}_1} \cup {\mathscr{S}_2^{\text{var}}}$$. We now proceed to formalize our above discussion in a systematic manner. Compressive sensing formulation. We begin by discussing how a sparse gradient $${\nabla} f$$ can be estimated at any point $$\mathbf x$$, via CS. As $$f$$ is $${\mathscr{C}}^3$$ smooth, therefore the Taylor’s expansion of $$f$$ at $$\mathbf x$$, along $${\mathbf v},-{\mathbf v} \in {\mathbb{R}}^{{d}}$$, with step size $${\mu} > 0$$, and $$\zeta = \mathbf x + \theta {\mathbf v}$$, $$\zeta^{\prime} = \mathbf x - \theta^{\prime} {\mathbf v}$$, $$\theta,\theta^{\prime} \in (0,{\mu})$$ gives us: \begin{align} f(\mathbf x + {\mu}{\mathbf v}) &= f(\mathbf x) + {\mu} \langle{\mathbf v},{\nabla} f(\mathbf x)\rangle + \frac{1}{2}{\mu}^2 {\mathbf v}^{\rm T} {\nabla^2} {f}(\mathbf x) {\mathbf v} + {R}_3(\zeta), \\ \end{align} (3.1) \begin{align} f(\mathbf x - {\mu}{\mathbf v}) &= f(\mathbf x) + {\mu} \langle -{\mathbf v},{\nabla} f(\mathbf x)\rangle + \frac{1}{2}{\mu}^2 {\mathbf v}^{\rm T} {\nabla^2} {f}(\mathbf x) {\mathbf v} + {R}_3(\zeta^{\prime}). \end{align} (3.2) Subtracting the above and dividing by $$2{\mu}$$ leads to the standard ‘central difference’ estimate of $$\langle{\mathbf v},{\nabla} f(\mathbf x)\rangle$$. \begin{align} \frac{f(\mathbf x + {\mu}{\mathbf v}) - f(\mathbf x - {\mu}{\mathbf v})}{2{\mu}} = \langle{\mathbf v},{\nabla} f(\mathbf x)\rangle + \underbrace{\frac{{R}_3(\zeta) - {R}_3(\zeta^{\prime})}{2{\mu}}}_{O({\mu}^2)}. \end{align} (3.3) Notice that in (3.3), the expression on the left-hand side corresponds to a noisy-linear measurement of $${\nabla} f(\mathbf x)$$, with $${\mathbf v}$$. The ‘noise’ here arises on account of the third-order terms $${R}_3(\zeta),{R}_3(\zeta^{\prime}) = O({\mu}^3)$$, in the Taylor expansion. Now let the $${\mathbf v}$$’s be chosen from the set: \begin{align} {\mathscr{V}} &:= {\left\{{{v_j \in {\mathbb{R}}^{{d}} : v_{j,q} = \pm\frac{1}{\sqrt{{m_{v}}}} \ \text{w.p.} \ 1/2 \ \text{each}; \ j=1,\dots,{m_{v}} \ \text{and} \ q=1,\dots,{{d}}}}\right\}}. \end{align} (3.4) Then, employing (3.3) at each $${\mathbf v}_j \in {\mathscr{V}}$$ gives us the linear system: $$\underbrace{\frac{f(\mathbf x + {\mu}{\mathbf v}_j) - f(\mathbf x - {\mu}{\mathbf v}_j)}{2{\mu}}}_{y_j} = \langle{\mathbf v}_j,{\nabla} f(\mathbf x)\rangle + \underbrace{\frac{{R}_3(\zeta_{j}) - {R}_3(\zeta^{\prime}_{j})}{2{\mu}}}_{{n}_j} \quad j=1,\dots,{m_{v}}.$$ (3.5) Denoting $$\mathbf y = [y_1 \dots y_{{m_{v}}}]$$, $${\mathbf n} = [{n}_1 \dots {n}_{{m_{v}}}]$$ and $${\mathbf{V}} = [{\mathbf v}_1 \dots {\mathbf v}_{{m_{v}}}]^{\rm T} \in {\mathbb{R}}^{{m_{v}} \times {{d}}}$$, we can re-write (3.5) succinctly as follows $$\mathbf y = {\mathbf{V}}{\nabla} f(\mathbf x) + {\mathbf n}.$$ (3.6) As we know $$\mathbf y, {\mathbf{V}}$$, therefore we can estimate the unknown $$k$$-sparse vector $${\nabla} f(\mathbf x)$$ via standard $$\ell_1$$ minimization [6,16]: $${\widehat{{{\nabla}}}} f (\mathbf x) := \underset{\mathbf y = {\mathbf{V}}\mathbf z}{\mathrm{argmin}} \parallel{\mathbf z}\parallel_1.$$ (3.7) Remark 3.1 Estimating sparse gradients via CS was—to the best of our knowledge—first considered by Fornasier et al. [18] for learning functions of the form: $$f(\mathbf x) = g({\mathbf{A}}\mathbf x)$$. It was then also employed by Tyagi et al. [54] for learning SPAMs (without interaction terms). However, [18,54] consider a ‘forward difference’ estimate of $$\langle{\mathbf v},{\nabla} f(\mathbf x)\rangle$$, via $$(f(\mathbf x + {\mu}{\mathbf v}) - f(\mathbf x))/{\mu}$$, resulting in $$O({\mu})$$ perturbation error in (3.3). Remark 3.2 The above sampling mechanism is related to the ‘simultaneous perturbation’ gradient approximation method of [48]. Specifically in [48], for a random $${\mathbf v} = (v_1,\dots,v_{{d}})^{\rm T} \in {\mathbb{R}}^{{d}}$$, $${\widehat{{{\nabla}}}} f(\mathbf x)$$ is defined to be: $$\left(\frac{f(\mathbf x + {\mu}{\mathbf v}) - f(\mathbf x - {\mu}{\mathbf v})}{2{\mu} v_1},\dots,\frac{f(\mathbf x + {\mu}{\mathbf v}) - f(\mathbf x - {\mu}{\mathbf v})}{2{\mu} v_{{d}}} \right)^{\rm T}$$ (3.8) The bias of the above estimate can be shown to be $$O({\mu}^2)$$ for $$C^3$$ smooth $$f$$. The following theorem from [18] provides guarantees for stable recovery via $$\ell_1$$ minimization: $$\triangle(\mathbf y) := \underset{\mathbf y = {\mathbf{V}} \mathbf z}{\mathrm{argmin}} \parallel{\mathbf z}\parallel_1$$. While the first part is by now standard (see, for example, [2]), the second result was stated in [18] as a specialization of Theorem 1.2 from [60] to the case of Bernoulli measurement matrices. Theorem 3.1 ([18,60]). Let $${\mathbf{V}}$$ be a $${m_{v}} \times {d}$$ random matrix with all entries being Bernoulli i.i.d. random variables scaled with $$1/\sqrt{{m_{v}}}$$. Then the following results hold. Let $$0 < {\kappa} < 1$$. Then there are two positive constants $$c_1,c_2 > 0$$, such that the matrix $${\mathbf{V}}$$ has the Restricted Isometry Property (RIP) $$(1-{\kappa}) \parallel{\mathbf w}\parallel_2^2 \leq \parallel{{\mathbf{V}}\mathbf w}\parallel_2^2 \leq (1+{\kappa}) \parallel{\mathbf w}\parallel_2^2$$ (3.9) for all $$\mathbf w \in {\mathbb{R}}^{{d}}$$ such that $$|$$supp($$\mathbf w$$)$$|$$$$\leq c_2 {m_{v}}/ \log({d}/{m_{v}})$$ with probability at least $$1-\,{\rm e}^{-c_1 {m_{v}}}$$. Let us suppose $${d} > (\log 6)^2 {m_{v}}$$. Then there are positive constants $$C, c_1^{\prime}, c_2^{\prime} > 0$$ such that with probability at least $$1 -\,{\rm e}^{-c_1^{\prime} {m_{v}}} - \,{\rm e}^{-\sqrt{{m_{v}}{d}}}$$ the matrix $${\mathbf{V}}$$ has the following property. For every $$\mathbf w \in {\mathbb{R}}^{{d}}$$, $${\mathbf n} \in {\mathbb{R}}^{{m_{v}}}$$ and every natural number $$K \leq c_2^{\prime} {m_{v}} / \log({d}/{m_{v}})$$, we have $$\parallel{\triangle({\mathbf{V}}\parallel\mathbf w + {\mathbf n}) - \mathbf w}_2 \leq C \left(K^{-1/2} \sigma_{K}(\mathbf w)_{1} + \max{\left\{{{\parallel{{\mathbf n}}\parallel_2, \sqrt{\log {d}}\parallel{{\mathbf n}}\parallel_{\infty}}}\right\}}\right),$$ (3.10) where \begin{equation*} \sigma_{K}(\mathbf w)_{1} := \inf{\left\{{{\parallel{\mathbf w - \mathbf z}\parallel_1 : |\text{supp}(\mathbf z)| \leq K}}\right\}} \end{equation*} is the best $$K$$-term approximation of $$\mathbf w$$. Remark 3.3 The proof of the second part of Theorem 3.1 requires (3.9) to hold, which is the case in our setting with high probability. Remark 3.4 Since $${m_{v}} \geq K$$ is necessary, note that $$K \leq c_2^{\prime} {m_{v}} / \log(d/{m_{v}})$$ is satisfied if $${m_{v}} > (1 / c_2^{\prime}) K \log(d/K)$$. Also note that $$K \log (d/K) > \log d$$ in the regime5$$K \ll d$$. As pointed out by a reviewer, a slight improvement over Theorem 3.1 is given by [19, Theorem 11.10] where the $$\log d$$ term in (3.10) is replaced with $$\log (d/{m_{v}})$$. Estimating sufficiently many gradients. Given the discussion above, the next natural question is how should one choose the points $$\mathbf x$$, where the gradient $${\nabla} f(\mathbf x)$$ should be estimated? Note that $$f$$ is composed of the sum of univariate and bivariate functions, residing on mutually orthogonal one- or two-dimensional canonical subspaces of $${\mathbb{R}}^{{d}}$$. Therefore, this suggests that it is sufficient if our set of points—let us call it $${\chi}$$—has the property that it provides a two-dimensional discretization of any canonical two-dimensional subspace of $${\mathbb{R}}^d$$. To construct $${\chi}$$ we will make use of hash functions or more specifically a family of hash functions, defined as follows. Definition 3.2 For some $$t \in \mathbb{N}$$ and $$j=1,2,\dots$$, let $$h_j : [{d}] \rightarrow {\left\{{{1,2,\dots,t}}\right\}}$$. We then call the set $${{\mathscr{H}}_{t}^{d}} = {\left\{{{{h}_1,{h}_2,\dots}}\right\}}$$ a $$({d},t)$$-hash family if for any distinct $$i_1,i_2,\dots,i_t \in [{d}]$$, $$\exists$$$${h} \in {{\mathscr{H}}_{t}^{d}}$$ such that $$h$$ is an injection when restricted to $$i_1,i_2,\dots,i_t$$. Hash functions are common in theoretical computer science and are widely used such as in finding juntas [34]. There exists a fairly simple probabilistic method using which one can construct $${{\mathscr{H}}_{t}^{d}}$$ of size $$O(t \,{\rm e}^t \log {d})$$ with high probability. The reader is, for instance, referred to Section $$5$$ in [15] where for any constant $$C_1 > 1$$, the probabilistic construction yields $${{\mathscr{H}}_{t}^{d}}$$ of size $${|{{{{\mathscr{H}}_{t}^{d}}}}|} \leq (C_1 + 1)t \,{\rm e}^t \log {d}$$ with probability at least $$1 - {d}^{-C_1 t}$$, in time linear in the output size. We note that the size of $${{\mathscr{H}}_{t}^{d}}$$ is nearly optimal—it is known that the size of any such family is $${\it{\Omega}}(\,{\rm e}^t \log {d} / \sqrt{t})$$ [20, 27, 40]. There also exist efficient deterministic constructions for such families of partitions, with the size of the family being $$O(t^{O(\log t)} \,{\rm e}^t \log {d})$$ and which take time linear in the output size [36]. For our purposes, we consider the probabilistic construction of the family due to its smaller resulting size. Specifically, we consider the family $${{\mathscr{H}}_{2}^{d}}$$ so that for any distinct $$i,j$$, there exists $$h \in {{\mathscr{H}}_{2}^{d}}$$ s.t. $$h(i) \neq h(j)$$. Let us first define for any $${h} \in {{\mathscr{H}}_{2}^{d}}$$, the vectors $${\mathbf{e}}_1({h}), {\mathbf{e}}_2({h}) \in {\mathbb{R}}^{{d}}$$, where $$({\mathbf{e}}_i({h}))_q := \left\{ \begin{array}{rl} 1 \quad , & h(q) = i, \\ 0 \quad , & {\text{otherwise}} \end{array} \right . \quad \text{for} \ i=1,2 \ \text{and} \ q = 1,\dots,{d}.$$ (3.11) Given at hand $${{\mathscr{H}}_{2}^{d}}$$, we construct our set $${\chi}$$ using the procedure6 in [15]. Specifically, for some integer $${m_{x}} > 0$$, we construct for each $$h \in {{\mathscr{H}}_{2}^{d}}$$ the set $${\chi}({h})$$ as: $${\chi}({h}) := {\left\{{{\mathbf x({h}) \in [-1,1]^{{d}}: \mathbf x({h}) = \sum_{i=1}^{2} c_i {\mathbf{e}}_i({h}); c_1,c_2 \in {\left\{{{-1,-\frac{{m_{x}}-1}{{m_{x}}},\dots,\frac{{m_{x}}-1}{{m_{x}}},1}}\right\}}}}\right\}}.$$ (3.12) Note that $${\chi}({h})$$ consists of $$(2{m_{x}}+1)^2$$ points that discretize: span$$({\mathbf{e}}_1(h),{\mathbf{e}}_2(h))$$, within $$[-1,1]^{{d}}$$, with a spacing of $$1/{m_{x}}$$ along each $${\mathbf{e}}_i$$. Given this, we obtain the complete set as $${\chi} = \cup_{h \in {{\mathscr{H}}_{2}^{d}}} {\chi}(h)$$ so that $${|{{{\chi}}}|} \leq (2{m_{x}}+1)^2 {|{{{{\mathscr{H}}_{2}^{d}}}}|}$$. Clearly, $${\chi}$$ discretizes any two-dimensional canonical subspace, within $$[-1,1]^{{d}}$$. Recovering set of active variables. Our scheme for recovering the set of active variables is outlined formally in the form of Algorithm 1. At each $$\mathbf x \in {\chi}$$, we obtain the estimate $${\widehat{{{\nabla}}}} f(\mathbf x)$$ via $$\ell_1$$ minimization. We then perform a thresholding operation, i.e. set to zero those components of $${\widehat{{{\nabla}}}} f(\mathbf x)$$, whose magnitude is below a certain threshold. All indices then corresponding to non-zero components are identified as active variables. Algorithm 1 Sub-routine for estimating $${{\mathscr{S}}}$$ 1. Construct $$({d},2)$$-hash family $${{\mathscr{H}}_{2}^{d}}$$ and the set $${\mathscr{V}}$$ for suitable $${m_{v}} \in {\mathbb{Z}}^{+}$$. Choose suitable $${\mu} \in {\mathbb{Z}}^{+}$$ and initialize $$\widehat{{\mathscr{S}}} = \emptyset$$. 2. Choose suitable $${m_{x}} \in {\mathbb{Z}}^{+}$$. For each $${h} \in {{\mathscr{H}}_{2}^{d}}$$ do: Create the set $${\chi}({h})$$. For $$\mathbf x_i \in {\chi}(h)$$; $$i=1,\dots,(2{m_{x}}+1)^2$$ do: (a) Construct $$\mathbf y_i$$ where $$(\mathbf y_i)_j = \frac{f(\mathbf x_i + {\mu} \mathbf v_j)-f(\mathbf x_i - {\mu} \mathbf v_j)}{2{\mu}}$$; $$j=1,\dots,{m_{v}}$$. (b) Set $$\widehat{{{\nabla}}} f(\mathbf x_i) := \underset{{\mathbf y_i = {\mathbf{V}}\mathbf z}}{\operatorname{argmin}} {\parallel{{\mathbf z}}\parallel}_1$$. For suitable $${\tau} > 0$$, update: \begin{equation*} \widehat{{\mathscr{S}}} = \widehat{{\mathscr{S}}} \cup {q \in {\left\{{{1,\dots,d}}\right\}}: {|{{(\widehat{{{\nabla}}} f(\mathbf x_i))_q}}|} > {\tau}}. \end{equation*} The following Lemma provides sufficient conditions on the sampling parameters: $${m_{x}},{m_{v}},{\mu}$$ and the threshold $${\tau}$$, which guarantee that $${\widehat{{{\mathscr{S}}}}} = {\mathscr{S}}$$ holds. Lemma 3.1 Let $${{\mathscr{H}}_{2}^{d}}$$ be of size $${|{{{{\mathscr{H}}_{2}^{d}}}}|} \leq 2(C_1 + 1)\,{\rm e}^2 \log {d}$$ for some constant $$C_1 > 1$$. Then there exist constants $$c_3^{\prime} \geq 1$$ and $$C, c_1^{\prime} > 0$$ such that for any $${m_{x}}, {m_{v}}, {\mu}$$ satisfying $$c_3^{\prime} {k} \log({d}/{k}) < {m_{v}} < {d}/(\log 6)^2, \quad {m_{x}} \geq {\lambda}_1^{-1} \quad \text{and} \quad {\mu} < \left( \frac{3{D}_1 {m_{v}}}{4 C {B}_3 {k}} \right)^{1/2},$$ (3.13) the choice $${\tau} = \frac{2 C {\mu}^2 {B}_3 {k}}{3{m_{v}}}$$ implies that $$\widehat{\mathscr{S}} = {\mathscr{S}}$$ holds with probability at least $$1 - \,{\rm e}^{-c_1^{\prime} {m_{v}}} - \,{\rm e}^{-\sqrt{{m_{v}} {d}}} - {d}^{-2C_1}$$. Here, $${\lambda}_1, {D}_1, {B}_3 > 0$$ are problem-specific constants defined in Section 2. Query complexity. We estimate $${\nabla} f$$ at $$(2{m_{x}} + 1)^2 {|{{{{\mathscr{H}}_{2}^{d}}}}|}$$ many points. For each such estimate, we query $$f$$ at $$2{m_{v}}$$ points, leading to a total of $$2{m_{v}}(2{m_{x}} + 1)^2 {|{{{{\mathscr{H}}_{2}^{d}}}}|}$$ queries. From Lemma 3.1, we then obtain a query complexity of $$O({k} (\log {d})^2 {\lambda}_1^{-2})$$ for exact recovery of the set of active variables, i.e. $${\mathscr{S}_1} \cup {\mathscr{S}_2^{\text{var}}}$$. Computational complexity. The family $${{\mathscr{H}}_{2}^{d}}$$ can be constructed7 in time polynomial in $$d$$. Step 2 involves solving a linear program in $$O(d)$$ variables, which can be done efficiently up to arbitrary accuracy, in time polynomial in $$({m_{v}},d)$$ (using, for instance, interior point methods (cf. [39]). Since we solve $$O({\lambda}_1^{-2} \log d)$$ such linear programs, hence the overall computation time is polynomial in the number of queries and dimension $$d$$. Remark 3.5 It is worth noting that in practice, it might be preferable to replace the $$\ell_1$$ minimization step with a non-convex algorithm such as ‘Iterative hard thresholding’ (IHT) (cf. [4,5,28–30]). Such methods consider solving the non-convex optimization problem: \begin{equation*} \min_{\mathbf z} \parallel{{\mathbf{V}}\parallel\mathbf z - \mathbf y}^2 \quad \text{s.t.} \quad \parallel{\mathbf z}\parallel_0 \leq K \end{equation*} for finding a $$K$$-sparse solution to an under-determined linear system of equations and generally have a lower computational complexity than their convex analogues. Moreover, provided $${\mathbf{V}}$$ also satisfies the RIP (as stated in 3.9), they then also enjoy strong theoretical guarantees, similar to that for convex approaches. Remark 3.6 Algorithm 1 essentially estimates $${\nabla} f$$ at $$O(\log d)$$ points. The method of Fornasier et al. [18] is designed for a more general function class than ours and hence involves estimating $${\nabla} f$$ on points sampled uniformly at random from the unit sphere $$\mathbb{S}^{d-1}$$—the size of such a set is typically polynomial in $$d$$. The method of Tyagi et al. [54] is tailored toward SPAMs without interactions; it essentially estimates $${\nabla} f$$ along a uniform one-dimensional grid (hence at constantly many points). Hence, conceptually, Algorithm 1 is a simple generalization of the scheme of Tyagi et al. [54]. 3.1.2 Second phase: recovering individual sets Given that we have recovered $${\mathscr{S}} = {\mathscr{S}_1} \cup {\mathscr{S}_2^{\text{var}}}$$, we now proceed to see how we can recover the individual sets: $${\mathscr{S}_1}$$ and $${\mathscr{S}_2}$$. Let us denote w.l.o.g, $${\mathscr{S}}$$ to be $${\left\{{{1,2,\dots,{k}}}\right\}}$$ and also denote $$g : {\mathbb{R}}^{{k}} \rightarrow {\mathbb{R}}$$ to be $$g(x_1,x_2,\dots,x_{{k}}) = c + \sum_{p \in {\mathscr{S}_1}}\phi_{p} (x_p) + \sum_{(l,l^{\prime}) \in {\mathscr{S}_2}}\phi_{(l,l^{\prime})} (x_{l},x_{l^{\prime}}).$$ (3.14) Here, $${\mathscr{S}_2} \subset {[{k}] \choose 2}$$ with $${\mathscr{S}_2^{\text{var}}} \cap {\mathscr{S}_1} = \emptyset$$. We have reduced our problem to that of querying some unknown function $${k}$$-variate function $$g$$, of the form (3.14), with queries $$\mathbf x \in {\mathbb{R}}^{{k}}$$. Indeed, this is equivalent to querying $$f$$ at $$(\mathbf x)_{\mathscr{S}}$$, i.e. the restriction of $$\mathbf x$$ onto $${\mathscr{S}}$$. To identify $${\mathscr{S}_1}$$ and $${\mathscr{S}_2}$$, let us recall the discussion in Assumption 4 : for any $$(l,l^{\prime}) \in {\mathscr{S}_2}$$, we will have that $$\exists (x_{l},x_{l^{\prime}}) \in [-1,1]^2$$ such that $$\partial_l \partial_{l^{\prime}} g(\mathbf x) = \partial_l \partial_{l^{\prime}} \phi_{(l,l^{\prime})} (x_{l},x_{l^{\prime}}) \neq 0$$. Furthermore, for $$p \in {\mathscr{S}_1}$$ and any $$p^{\prime} \neq p$$, we know that $$\partial_p \partial_{p^{\prime}} g(\mathbf x) \equiv 0$$, $$\forall \mathbf x \in {\mathbb{R}}^{{k}}$$. In light of this, our goal will be now to query $$g$$ to estimate the off-diagonal entries of its Hessian $${\nabla^2} g$$. This is a natural approach as these entries contain information about the mixed second-order partial derivatives of $$g$$. We now proceed toward motivating our sampling scheme. Motivation behind sampling scheme. At any $$\mathbf x \in {\mathbb{R}}^{{k}}$$, the Hessian $${\nabla^2} g (\mathbf x)$$ is a $${k} \times {k}$$ symmetric matrix with the following structure. \begin{equation*} ({\nabla^2} g(\mathbf x))_{i,j} = \left\{ \begin{array}{@{}rl} \partial^2_i \phi_i(x_i) \quad , & i \in {\mathscr{S}_1}, \ i=j \\ \partial^2_i \phi_{(i,i^{\prime})}(x_i,x_{i^{\prime}}) \quad , & (i,i^{\prime}) \in {\mathscr{S}_2}, \ j = i \\ \partial^2_i \phi_{(i^{\prime},i)}(x_{i^{\prime}},x_i) \quad , & (i^{\prime},i) \in {\mathscr{S}_2}, \ j = i \\ \partial_i \partial_j \phi_{(i,j)}(x_i,x_{j}) \quad , & (i,j) \in {\mathscr{S}_2}\\ \partial_i \partial_j \phi_{(j,i)}(x_j,x_{i}) \quad , & (j,i) \in {\mathscr{S}_2}\\ 0 \quad , & \text{otherwise} \end{array} \right. \end{equation*} Note that each row of $${\nabla^2} g$$ has at most $$2$$ non-zero entries. If $$i \in {\mathscr{S}_1}$$, then the non-zero entry can only be the $$(i,i)$$th entry of $${\nabla^2} g$$. If $$i \in {\mathscr{S}_2^{\text{var}}}$$, then the $$i$$th row can have two non-zero entries. In this case, the non-zero entries will be the $$(i,i)$$th and $$(i,j)$$th entries of $${\nabla^2} g$$, if $$(i,j) \in {\mathscr{S}_2}$$ or $$(j,i) \in {\mathscr{S}_2}$$. Now, for $$\mathbf x,{\mathbf v} \in {\mathbb{R}}^{{k}}$$, $${\mu_1} > 0$$, consider the Taylor expansion of $${\nabla} g$$ at $$\mathbf x$$ along $${\mathbf v}$$, with step size $${\mu_1}$$. For $$\zeta_i = \mathbf x + \theta_i {\mathbf v}$$, for some $$\theta_i \in (0,{\mu_1})$$, $$i=1,\dots,{k}$$, we have $$\frac{{\nabla} g(\mathbf x + {\mu_1}{\mathbf v}) - {\nabla} g(\mathbf x)}{{\mu_1}} = {\nabla^2} g(\mathbf x) {\mathbf v} + \frac{{\mu_1}}{2} {\begin{pmatrix} \mathbf v^{\mathrm{T}} {\nabla^2} \partial_1 g(\zeta_1) \mathbf v \\ \vdots \\ \mathbf v^{\mathrm{T}} {\nabla^2} \partial_{{k}} g(\zeta_{{k}}) \mathbf v \end{pmatrix}}.$$ (3.15) Alternately, we have the following identity for each individual $$\partial_i g$$. $$\frac{\partial_i g(\mathbf x + {\mu_1}{\mathbf v}) - \partial_i g(\mathbf x)}{{\mu_1}} = \langle{\nabla} \partial_i g(\mathbf x),{\mathbf v}\rangle + \frac{{\mu_1}}{2} {\mathbf v}^{\rm T} {\nabla^2} \partial_i g(\zeta_i) {\mathbf v} \quad i = 1,\dots,{k}.$$ (3.16) Say we estimate $$\partial_i g(\mathbf x),\partial_i g(\mathbf x + {\mu_1}{\mathbf v})$$ with $${\widehat{{\partial_i}}} g(\mathbf x),{\widehat{{\partial_i}}} g(\mathbf x + {\mu_1}{\mathbf v})$$, respectively, using finite differences with step size parameter $${\beta} > 0$$. Then we can write $${\widehat{{\partial_i}}} g(\mathbf x) = \partial_i g(\mathbf x) + {\eta}_i(\mathbf x,{\beta}), \quad {\widehat{{\partial_i}}} g(\mathbf x + {\mu_1}{\mathbf v}) = \partial_i g(\mathbf x + {\mu_1}{\mathbf v}) + {\eta}_i(\mathbf x + {\mu_1}{\mathbf v},{\beta})$$ (3.17) with $${\eta}_i(\mathbf x,{\beta}),{\eta}_i(\mathbf x + {\mu_1}{\mathbf v},{\beta}) = O({\beta}^2)$$ being the corresponding estimation errors. Plugging these estimates in (3.16), we finally obtain the following. $$\frac{{\widehat{{\partial_i}}} g(\mathbf x + {\mu_1}{\mathbf v}) - {\widehat{{\partial_i}}} g(\mathbf x)}{{\mu_1}} = \langle{\nabla} \partial_i g(\mathbf x),{\mathbf v}\rangle + \underbrace{\frac{{\mu_1}}{2} {\mathbf v}^{\rm T} {\nabla^2} \partial_i g(\zeta_i) {\mathbf v} + \frac{{\eta}_i(\mathbf x + {\mu_1}{\mathbf v},{\beta}) - {\eta}_i(\mathbf x,{\beta})}{{\mu_1}}}_{\text{Error term}}.$$ (3.18) We see in (3.18) that the L.H.S. can be viewed as taking a noisy linear measurement of the $$i$$th row of $${\nabla^2} g(\mathbf x)$$ with measurement vector $${\mathbf v}$$. Hence, for any $$i \in {\mathscr{S}}$$ we can via (3.18) hope to recover the $$2$$ sparse vector: $${\nabla} \partial_i g(\mathbf x) \in {\mathbb{R}}^{{k}}$$. In fact, we are only interested in estimating the off-diagonal entries of $${\nabla^2} g$$. Therefore, while testing for $$i \in {\mathscr{S}}$$, we can fix the $$i$$th component of $${\mathbf v}$$ to be zero. This means that $${\nabla} \partial_i g$$ can in fact be considered as a $$1$$ sparse vector, and our task is to find the location of the non-zero entry. We now describe our sampling scheme that accomplishes this, by performing a binary search over $${\nabla} \partial_i g$$. Sampling scheme. Say that we are currently testing for variable $$i \in {\mathscr{S}}$$, i.e. we would like to determine whether it is in $${\mathscr{S}_1}$$ or $${\mathscr{S}_2^{\text{var}}}$$. Denote $${\mathscr{T}}$$ as the set of variables that have been classified so far. We will first create our set of points $${\chi}_i$$ at which $${\nabla} \partial_i g$$ will be estimated, as follows. Consider $${\mathbf{e}}_1(i), {\mathbf{e}}_2(i) \in {\mathbb{R}}^{{k}}$$ where for $$j=1,\dots,{k}$$: $$({\mathbf{e}}_1(i))_j := \left\{ \begin{array}{rl} 1 \quad , & j = i, \\ 0 \quad , & \text{otherwise} \end{array} \right ., \quad ({\mathbf{e}}_2(i))_j := \left\{ \begin{array}{rl} 0 \quad , & j = i \ \text{or} \ j \in {\mathscr{T}}, \\ 1 \quad , & \text{otherwise} \end{array}. \right .$$ (3.19) We then form the following set of points which corresponds to a discretization of the two-dimensional space spanned by $${\mathbf{e}}_1(i),{\mathbf{e}}_2(i)$$, within $$[-1,1]^{{k}}$$. $${\chi}_i := {\left\{{{\mathbf x \in [-1,1]^{{k}} : \mathbf x = c_1{\mathbf{e}}_1(i) + c_2{\mathbf{e}}_2(i); c_1,c_2 \in {\left\{{{-1,-\frac{{m^{\prime}_{x}}-1}{{m^{\prime}_{x}}},\dots,\frac{{m^{\prime}_{x}}-1}{{m^{\prime}_{x}}},1}}\right\}}}}\right\}}.$$ (3.20) Now for each $$\mathbf x \in {\chi}_i$$ and suitable step size parameter $${\beta} > 0$$, we will obtain the samples $$g(\mathbf x + {\beta} {\mathbf{e}}_1(i)), g(\mathbf x - {\beta} {\mathbf{e}}_1(i))$$. Then, we obtain via central differences the estimate: $${\widehat{{\partial_i}}} g(\mathbf x) = (g(\mathbf x + {\beta} {\mathbf{e}}_1(i)) - g(\mathbf x - {\beta} {\mathbf{e}}_1(i)))/(2{\beta})$$. For our choice of $${\mathbf v}$$ and parameter $${\mu_1} > 0$$, we can similarly obtain $${\widehat{{\partial_i}}} g(\mathbf x + {\mu_1} {\mathbf v})$$. We now describe how the measurement vectors $${\mathbf v}$$ can be chosen in an adaptive fashion, to identify $${\mathscr{S}_1},{\mathscr{S}_2}$$. First, we create a vector $${\mathbf v}_0(i)$$ that enables us to test, whether there exists a variable $$j \neq i$$ such that $$(i,j) \in {\mathscr{S}_2}$$ (if $$i > j$$) or $$(j,i) \in {\mathscr{S}_2}$$ (if $$j > i$$). To this end, we set $${\mathbf v}_0(i) = {\mathbf{e}}_2(i)$$. Clearly, $$i \in {\mathscr{S}_2^{\text{var}}}$$ iff there exists $$\mathbf x \in [-1,1]^{{k}}$$ such that $$\langle{\nabla} \partial_i g(\mathbf x),{\mathbf v}_0(i)\rangle \neq 0$$. This suggests the following strategy. For each $$\mathbf x \in {\chi}_i$$, we compute $$({\widehat{{\partial_i}}} g(\mathbf x + {\mu_1}{\mathbf v}_0(i)) - {\widehat{{\partial_i}}} g(\mathbf x))/({\mu_1})$$—this will be a noisy estimate of $$\langle{\nabla} \partial_i g(\mathbf x),{\mathbf v}_0(i)\rangle$$. Provided that the number of points is large enough and the noise is made suitably small, we see that via a threshold-based procedure as in the previous phase, one would be able to correctly classify the other variable as either belonging to $${\mathscr{S}_1}$$ or $${\mathscr{S}_2}$$. In case the above procedure classifies $$i$$ as being a part of $${\mathscr{S}_2^{\text{var}}}$$, then we would still need to identify the other variable $$j \in {\mathscr{S}_2^{\text{var}}}$$, forming the pair. This can be handled via a binary search-based procedure, as follows. The measurement vectors $${\mathbf v}_1(i), {\mathbf v}_2(i), \dots$$ are chosen adaptively, meaning that the choice of $${\mathbf v}_j(i)$$ depends on the past choices: $${\mathbf v}_1(i), \dots,{\mathbf v}_{j-1}(i)$$. $${\mathbf v}_1(i)$$ is constructed as follows. We construct an equipartition $${{\mathscr{P}}}_1(i), {{\mathscr{P}}}_2(i) \subset {\mathscr{S}} \setminus {\left\{{{{\mathscr{T}} \cup {\left\{{{i}}\right\}}}}\right\}}$$ such that $${{\mathscr{P}}}_1(i) \cup {{\mathscr{P}}}_2(i) = {\mathscr{S}} \setminus {\left\{{{{\mathscr{T}} \cup {\left\{{{i}}\right\}}}}\right\}}$$, $${{\mathscr{P}}}_1(i) \cap {{\mathscr{P}}}_2(i) = \emptyset$$, $${|{{{{\mathscr{P}}}_1(i)}}|} = {\lfloor{{\frac{{k} - 1 - {|{{{\mathscr{T}}}}{2}}}|}}\rfloor}$$ and $${|{{{{\mathscr{P}}}_2(i)}}|} = {k} - 1 - {|{{{\mathscr{T}}}}|} - {|{{{{\mathscr{P}}}_1(i)}}|}$$. Then $${\mathbf v}_1(i)$$ is chosen to be such that: $$({\mathbf v}_1(i))_l := \left\{ \begin{array}{rl} 1 \quad , & l \in {{\mathscr{P}}}_1(i), \\ 0 \quad , & \text{otherwise} \end{array} \right . \quad l = 1,\dots,{k}.$$ (3.21) Let $$\mathbf x^{*} \in {\chi}_i$$ be the point, at which $${\mathbf v}_0(i)$$ detects $$i$$. We now find: $$({\widehat{{\partial_i}}} g(\mathbf x^{*} + {\mu_1}{\mathbf v}_1(i)) - {\widehat{{\partial_i}}} g(\mathbf x^{*}))/{\mu_1}$$, and test whether it is larger than a certain threshold. This tells us whether the other active variable $$j$$ belongs to $${{\mathscr{P}}}_1(i)$$ or to $${{\mathscr{P}}}_2(i)$$. Then, we create $${\mathbf v}_2(i)$$ by partitioning the identified subset, in the same manner as $${\mathbf v}_1(i)$$ and perform the same tests again. It is clear that we would need at most $$\lceil{\log ({k}-{|{{{\mathscr{T}}}}|})}\rceil$$ many $${\mathbf v}(i)$$’s in this process. Hence, if $$i \in {\mathscr{S}_2^{\text{var}}}$$, then we would need at most $$\lceil{\log ({k}-{|{{{\mathscr{T}}}}|})}\rceil + 1$$ measurement vectors to find the other member of the pair in $${\mathscr{S}_2^{\text{var}}}$$. In case $$i \in {\mathscr{S}_1}$$, then $${\mathbf v}_0(i)$$ by itself suffices. The above procedure is outlined formally in Algorithm 2. We now provide sufficient conditions on the parameters $${m^{\prime}_{x}} > 0, {\beta}$$ and $${\mu_1} > 0$$, along with a corresponding threshold, that together guarantee recovery of $${\mathscr{S}_1}$$ and $${\mathscr{S}_2}$$. This is stated in the following lemma. Lemma 3.2 Let $${m^{\prime}_{x}} > 0, {\beta}$$ and $${\mu_1} > 0$$ be chosen to satisfy: \begin{align} {m^{\prime}_{x}} \geq {\lambda}_2^{-1}, \quad {\beta} < \frac{\sqrt{3} {D}_2}{4\sqrt{2} {B}_3}, \quad {\mu_1} \in \left(\frac{{D}_2 - \sqrt{{D}_2^2 - (32/3){\beta}^2 {B}_3^2}}{8{B}_3}, \frac{{D}_2 + \sqrt{{D}_2^2 - (32/3){\beta}^2 {B}_3^2}}{8{B}_3} \right). \end{align} (3.22) Then for the choice $${\tau^{\prime}} = \frac{{\beta}^2 {B}_3}{3{\mu_1}} + 2{\mu_1}{B}_3$$, we have for Algorithm 1 that $${\widehat{{{\mathscr{S}_1}}}} = {\mathscr{S}_1}$$ and $${\widehat{{{\mathscr{S}_2}}}} = {\mathscr{S}_2}$$. Here, $${B}_3, {D}_2, {\lambda}_2 > 0$$ are problem-specific constants, defined in Section 2. Query complexity. Note that for each $$i \in {\mathscr{S}_1}$$ we make at most $$4 {{m^{\prime}_{x}}}^{2}$$ queries. This is clear from Step 4: four queries are made for estimating the two partial derivatives and this is done at most $${{m^{\prime}_{x}}}^2$$ times. If $$i \in {\mathscr{S}_2^{\text{var}}}$$, then we notice that in Step 9, we make two queries for each $${\mathbf v}(i)$$ leading to at most $$2\lceil{\log {k}}\rceil$$ queries during Steps 8–9. In addition, we still make at most $$4 {{m^{\prime}_{x}}}^{2}$$ queries during Step 4, as discussed earlier. Hence, the total number of queries made is at most: $${k_1} \cdot 4 {{m^{\prime}_{x}}}^{2} + {k_2} \cdot \left(4 {{m^{\prime}_{x}}}^{2} + 2\lceil{\log {k}}\rceil \right) < {k}(4 {{m^{\prime}_{x}}}^{2} + 2\lceil{\log {k}}\rceil).$$ (3.23) Since $${m^{\prime}_{x}} \geq {\lambda}_2^{-1}$$, the query complexity for this phase is $$O({k}({\lambda}_2^{-2} + \log {k}))$$. Computational complexity. It is clear that the overall computation time is linear in the number of queries and hence at most polynomial in $${k}$$. 3.2 Analysis for noisy setting We now analyze the noisy setting where at each query $$\mathbf x$$, we observe: $$f(\mathbf x) + {z^{\prime}}$$, where $${z^{\prime}} \in {\mathbb{R}}$$ denotes external noise. To see how this affects Algorithm 1, (3.6) now changes to $$\mathbf y = {\mathbf{V}}{\nabla} f(\mathbf x) + {\mathbf n} + {\mathbf{{z}}}$$, where $${z}_{j} = ({z^{\prime}}_{j,1} - {z^{\prime}}_{j,2})/(2{\mu})$$. Therefore, while the Taylor’s remainder term $${|{{{n}_j}}|} = O({\mu}^2)$$, the external noise term $${|{{{z}_j}}|}$$ scales as $${\mu}^{-1}$$. Hence, in contrast to Lemma 3.1 the step-size $${\mu}$$ needs to be chosen carefully now—a value which is too small would blow up the external noise component, whereas a large value would increase perturbation due to higher order Taylor’s terms. A similar problem would occur in the next phase when we try to identify $${\mathscr{S}_1}, {\mathscr{S}_2}$$. Indeed, due to the introduction of noise, we now observe $$g(\mathbf x + {\beta} {\mathbf{e}}_1(i)) + {z^{\prime}}_{i,1}$$, $$g(\mathbf x - {\beta} {\mathbf{e}}_1(i)) + {z^{\prime}}_{i,2}$$. This changes the expression for $${\widehat{{\partial_i}}} g(\mathbf x)$$ in (3.17) to: $${\widehat{{\partial_i}}} g(\mathbf x) = \partial_i g(\mathbf x) + {\eta}_i(\mathbf x,{\beta}) + {z}_i(\mathbf x,{\beta})$$ where $${z}_i(\mathbf x,{\beta}) = ({z^{\prime}}_{i,1} - {z^{\prime}}_{i,2})/(2{\beta})$$. Recall that $${\eta}_i(\mathbf x,{\beta}) = O({\beta}^2)$$ corresponds to the Taylor’s remainder term. Hence we again see that in contrast to Lemma 3.2, the step $${\beta}$$ cannot be chosen too small now, as it would blow up the external noise component. Arbitrary bounded noise. In this scenario, we assume the external noise to be arbitrary and bounded, meaning that $${|{{{z^{\prime}}}}|} < {\varepsilon}$$, for some finite $${\varepsilon} \geq 0$$. Clearly, if $${\varepsilon}$$ is too large, then we would expect recovery of $${\mathscr{S}} = {\mathscr{S}_1} \cup {\mathscr{S}_2^{\text{var}}}$$ to be impossible, as the structure of $$f$$ would be destroyed. However, we show that if $${\varepsilon} = O(\frac{{D}_1^{3/2}}{\sqrt{{B}_3 {k}}})$$, then Algorithm 1 recovers the total support $${\mathscr{S}}$$, with appropriate choice of sampling parameters. Furthermore, assuming $${\mathscr{S}}$$ is recovered exactly, and provided $${\varepsilon}$$ additionally satisfies $${\varepsilon} = O(\frac{{D}_2^{3}}{{B}_3^2})$$, then with proper choice of sampling parameters, Algorithm 1 identifies $${\mathscr{S}_1}, {\mathscr{S}_2}$$. This is stated formally in the following Theorem. Theorem 3.3 Let the constants $$c_3^{\prime}, C, c_1^{\prime}, C_1$$ and $${{\mathscr{H}}_{2}^{d}}, {m_{x}}, {m_{v}}$$ be as defined in Lemma 3.1. Say $${\varepsilon} < {\varepsilon}_1 = \frac{{D}_1^{3/2}}{3C\sqrt{4{B}_3 {k} C}}$$. Then for $$\theta_1 = \cos^{-1}(-{\varepsilon} / {\varepsilon}_1)$$, let $${\mu}$$ be chosen to satisfy: $${\mu} \in \left(2\sqrt{\frac{{D}_1{m_{v}}}{4{B}_3 {k}}}\cos(\theta_1/3 - 2\pi/3) , 2\sqrt{\frac{{D}_1{m_{v}}}{4{B}_3 {k}}}\cos(\theta_1/3)\right)$$ (3.24) We then have in Algorithm 1 for the choice: $${\tau} = C\left(\frac{2{\mu}^2 {B}_3 {k}}{3{m_{v}}} + \frac{{\varepsilon}\sqrt{{m_{v}}}}{{\mu}}\right)$$ that $$\widehat{\mathscr{S}} = {\mathscr{S}}$$ holds with probability at least $$1 - \,{\rm e}^{-c_1^{\prime} {m_{v}}} - \,{\rm e}^{-\sqrt{{m_{v}} {d}}} - {d}^{-2C_1}$$. Given that $$\widehat{\mathscr{S}} = {\mathscr{S}}$$, let $${m^{\prime}_{x}}$$ be as defined in Lemma 3.2. Assuming $${\varepsilon} < \frac{{D}_2^3}{384\sqrt{2}{B}_3^2} = {\varepsilon}_2$$ holds, then for $$\theta_2 = \cos^{-1}(-{\varepsilon} / {\varepsilon}_2)$$ let $${\beta}, {\mu_1}$$ be chosen to satisfy: \begin{align} {\mu_1} &\in \left(\frac{{D}_2 - \sqrt{{D}_2^2 - \frac{32}{3{\beta}}{B}_3({\beta}^3{B}_3 + 6{\varepsilon})}}{8{B}_3}, \frac{{D}_2 + \sqrt{{D}_2^2 - \frac{32}{3{\beta}}{B}_3({\beta}^3{B}_3 + 6{\varepsilon})}}{8{B}_3}\right), \\ {\beta} &\in \left(\frac{{D}_2}{2\sqrt{2}{B}_3}\cos(\theta_2/3 - 2\pi/3), \frac{{D}_2}{2\sqrt{2}{B}_3}\cos(\theta_2/3) \right). \end{align} (3.25) Then the choice $${\tau^{\prime}} = \frac{{\beta}^2 {B}_3}{3{\mu_1}} + 2{\mu_1}{B}_3 + \frac{2{\varepsilon}}{{\beta}{\mu_1}}$$ implies in Algorithm 2 that $${\widehat{{{\mathscr{S}_1}}}} = {\mathscr{S}_1}$$ and $${\widehat{{{\mathscr{S}_2}}}} = {\mathscr{S}_2}$$. Stochastic noise. We now assume that the point queries are corrupted with i.i.d. Gaussian noise, so that $${z^{\prime}} \sim {\mathscr{N}}(0,\sigma^2)$$ for $$\sigma^2 < \infty$$. To reduce $$\sigma$$, we consider resampling each point query a sufficient number of times and averaging the values. In Algorithm 1, i.e. during the estimation of $${\mathscr{S}}$$, we resample each query $$N_1$$ times so that $${z^{\prime}} \sim {\mathscr{N}}(0,\sigma^2/N_1)$$. For any $${\varepsilon} > 0$$, if $$N_1$$ is chosen large enough, then we can obtain a uniform bound $${|{{{z^{\prime}}}}|} < {\varepsilon}$$—via standard tail bounds for Gaussian’s—over all noise samples, with high probability. Consequently, the noise model transforms to a bounded noise one which means that by choosing $${\varepsilon} < {\varepsilon}_1$$, we can use the result of Theorem 3.3 for estimating $${\mathscr{S}}$$. Similarly in Algorithm 1, we resample each query $$N_2$$ times so that now $${z^{\prime}} \sim {\mathscr{N}}(0,\sigma^2/N_2)$$. For any $${\varepsilon^{\prime}} > 0$$, and $$N_2$$ large enough, we can again uniformly bound $${|{{{z^{\prime}}}}|} < {\varepsilon^{\prime}}$$ with high probability. By now choosing $${\varepsilon^{\prime}} < {\varepsilon}_2$$, we can then use the result of Theorem 3.3 for estimating $${\mathscr{S}_1}, {\mathscr{S}_2}$$. These conditions are stated formally in the following Theorem. Theorem 3.4 Let the constants $$c_3^{\prime}, C, c_1^{\prime}, C_1$$ and $${{\mathscr{H}}_{2}^{d}}, {m_{x}}, {m_{v}}$$ be as defined in Lemma 3.1. For any $${\varepsilon} < {\varepsilon}_1 = \frac{{D}_1^{3/2}}{3C\sqrt{4{B}_3 {k} C}}$$, $$0 < p_1 < 1$$, $$\theta_1 = \cos^{-1}(-{\varepsilon} / {\varepsilon}_1)$$, say we resample each query in Algorithm 1, $$N_1 > \frac{\sigma^2}{{\varepsilon}^2} {{\log} (\frac{2}{p_1}{m_{v}}(2{m_{x}}+1)^2{|{{{{\mathscr{H}}_{2}^{d}}}}|})}$$ times, and average the values. Then by choosing $${\mu}$$ and $${\tau}$$ as in Theorem 3.3, we have that $$\widehat{\mathscr{S}} = {\mathscr{S}}$$ holds with probability at least $$1 - p_1 - \,{\rm e}^{-c_1^{\prime} {m_{v}}} - \,{\rm e}^{-\sqrt{{m_{v}} {d}}} - {d}^{-2C_1}$$. Given that $$\widehat{\mathscr{S}} = {\mathscr{S}}$$, let $${m^{\prime}_{x}}$$ be as defined in Lemma 3.2. For any $${\varepsilon^{\prime}} < \frac{{D}_2^3}{384\sqrt{2}{B}_3^2} = {\varepsilon}_2$$, $$0 < p_2 < 1$$, $$\theta_2 = \cos^{-1}(-{\varepsilon^{\prime}} / {\varepsilon}_2)$$, say we resample each query in Algorithm 1, $$N_2 > { \frac{\sigma^2}{{{\varepsilon^{\prime}}}^2} \log\left(\frac{2}{p_2}({k}(2{{m^{\prime}_{x}}}^2 + \lceil\log {k} \rceil)) \right)}$$ times. Then by choosing $${\beta}, {\mu_1}, {\tau^{\prime}}$$ as in Theorem 3.3, we have that $${\widehat{{{\mathscr{S}_1}}}} = {\mathscr{S}_1}$$ and $${\widehat{{{\mathscr{S}_2}}}} = {\mathscr{S}_2}$$, with probability at least $$1-p_2$$. We now analyze the query complexity for the i.i.d. Gaussian noise case. One can verify that $${\varepsilon}_1 = O({k}^{-1/2})$$. Since $${m_{v}} = O({k} \log {d})$$, $${|{{{{\mathscr{H}}_{2}^{d}}}}|} = O(\log {d}), {m_{x}} = O({\lambda}_1^{-1})$$, then by choosing $$p_1 = O({d}^{-\delta})$$ for any constant $$\delta > 0$$, we arrive at $$N_1 = {O({k} \log(({d}^{\delta}) ({k} \log d) ({\lambda}_1^{-2} \log {d})))} = O({k} \log {d})$$. This leads to a total sample complexity of $$O(N_1 {k} (\log {d})^2 {\lambda}_1^{-2}) = O({k}^2 (\log {d})^3 {\lambda}_1^{-2})$$ for guaranteeing $$\widehat{\mathscr{S}} = {\mathscr{S}}$$, with high probability. Next, we see that $${\varepsilon^{\prime}} = O(1)$$ and thus $$N_2 = O(\log ({k}({\lambda}_2^{-2} + \log {k})/p_2))$$. Therefore, with an additional $$O(N_2 {k}({\lambda}_2^{-2} + \log {k})) = O({k}({\lambda}_2^{-2} + \log {k}) \log({k}/p_2))$$ samples, we are guaranteed with probability at least $$1-p_2$$ that $${\widehat{{{\mathscr{S}_1}}}} = {\mathscr{S}_1}$$ and $${\widehat{{{\mathscr{S}_2}}}} = {\mathscr{S}_2}$$. 4 Sampling scheme for the general overlap case We now analyze the general scenario where overlaps can occur among the elements of $${\mathscr{S}_2}$$. Therefore, the degrees of the variables occurring in $${\mathscr{S}_2^{\text{var}}}$$ can be greater than one. Contrary to the non-overlap case, we now sample $$f$$ in order to directly estimate its $${d} \times {d}$$ Hessian $${\nabla^2}{f}$$, at suitably chosen points. In particular, this enables us to subsequently identify $${\mathscr{S}_2}$$. Once $${\mathscr{S}_2}$$ is identified, we are left with a SPAM—with no variable interactions – on the set $$[{d}] \setminus {\mathscr{S}_2}$$. We then identify $${\mathscr{S}_1}$$ by employing the sampling scheme from [54] on this reduced space. 4.1 Analysis for noiseless setting In this section, we consider the noiseless scenario, i.e. we assume the exact sample $$f(\mathbf x)$$ is obtained for any query $$\mathbf x$$. To begin with, we explain why the sampling scheme for the non-overlap case does not directly apply here. To this end, note that the gradient of $$f$$ has the following structure for each $$q \in [{d}]$$. \begin{equation*} ({\nabla} f(\mathbf x))_q = \left\{ \begin{array}{@{}rl} \partial_q \phi_q(x_q) \quad , & q \in {\mathscr{S}_1} \\[4pt] \partial_q \phi_{(q,q^{\prime})} (x_{q},x_{q^{\prime}}) \quad , & (q,q^{\prime}) \in {\mathscr{S}_2} \ \& \ {\rho}(q) = 1, \\[4pt] \partial_q \phi_{(q^{\prime},q)} (x_{q^{\prime}},x_{q}) \quad , & (q^{\prime},q) \in {\mathscr{S}_2} \ \& \ {\rho}(q) = 1, \\[4pt] \partial_q \phi_q(x_q) + \sum\limits_{(q,q^{\prime}) \in {\mathscr{S}_2}} \partial_q \phi_{(q,q^{\prime})} (x_{q},x_{q^{\prime}}) \\ [8pt] + \sum\limits_{(q^{\prime},q) \in {\mathscr{S}_2}} \partial_q \phi_{(q^{\prime},q)} (x_{q^{\prime}},x_{q}) \quad , & q \in {\mathscr{S}_2^{\text{var}}} \ \& \ {\rho}(q) > 1, \\[4pt] 0 \quad , & \text{otherwise.} \end{array} \right. \end{equation*} Therefore, for any $$q \in {\mathscr{S}_2^{\text{var}}}$$ with $${\rho}(q) > 1$$, we notice that $$({\nabla} f(\mathbf x))_q$$ is by itself the sum of $${\rho}(q)$$ many bivariate functions and $$\partial_q \phi_q$$. This causes an issue as far as identifying $$q$$—via estimating $${\nabla}{f}$$ followed by thresholding—is concerned, as was done for the non-overlap case. While we assume the magnitudes of $$\partial_q \phi_{(q,q^{\prime})}$$ to be sufficiently large within respective subsets of $$[-1,1]^2$$, it is not clear what that implies for $${|{{({\nabla} f(\mathbf x))_q}}|}$$. Note that $$({\nabla} f(\mathbf x))_q \not\equiv 0$$ since $$q$$ is an active variable. However, a lower bound on: $${|{{({\nabla} f(\mathbf x))_q}}|}$$, and also on the measure of the interval where it is attained, appears to be non-trivial to obtain. Estimating sparse Hessian matrices In light of the above discussion, we consider an alternative approach, wherein we directly estimate the Hessian $${\nabla^2}{f}(\mathbf x) \in {\mathbb{R}}^{{d} \times {d}}$$, at suitably chosen $$\mathbf x \in [-1,1]^{{d}}$$. Observe that $${\nabla^2}{f}(\mathbf x)$$ has the following structure for $$i \in {\mathscr{S}_2^{\text{var}}}$$ and $$j=1,\dots,{d}$$: \begin{equation*} ({\nabla^2}{f}(\mathbf x))_{i,j} = \left\{ \begin{array}{rl} \partial_i^2 \phi_{(i,i^{\prime})} (x_i,x_{i^{\prime}}) \quad \quad , & {\rho}(i) = 1, (i,i^{\prime}) \in {\mathscr{S}_2}, i = j, \\ \partial_i^2 \phi_{(i^{\prime},i)} (x_{i^{\prime}},x_i) \quad \quad , & {\rho}(i) = 1, (i^{\prime},i) \in {\mathscr{S}_2}, i = j, \\ \partial_i^2 \phi_i(x_i) + \sum\limits_{(i,i^{\prime}) \in {\mathscr{S}_2}} \partial_i^2 \phi_{(i,i^{\prime})} (x_i,x_{i^{\prime}}) \\ + \sum\limits_{(i^{\prime},i) \in {\mathscr{S}_2}} \partial_i^2 \phi_{(i^{\prime},i)} (x_{i^{\prime}},x_i) \quad \quad , & {\rho}(i) > 1, i = j, \\ \partial_i \partial_j \phi_{(i,j)} (x_i,x_j) \quad \quad , & (i,j) \in {{\mathscr{S}_2}}, \\ \partial_i \partial_j \phi_{(j,i)} (x_j,x_i) \quad \quad , & (j,i) \in {{\mathscr{S}_2}}, \\ 0 \quad , & \text{otherwise} \end{array} \right. , \end{equation*} while if $$i \in {\mathscr{S}_1}$$, we have for $$j=1,\dots,{d}$$: \begin{equation*} ({\nabla^2}{f}(\mathbf x))_{i,j} = \left\{ \begin{array}{rl} \partial_i^2 \phi_{i} (x_i) \quad , & i = j, \\ 0 \quad , & \text{otherwise} \end{array} \right. \end{equation*} The $$l$$th row of $${\nabla^2}{f}(\mathbf x)$$ can be denoted by $${\nabla} \partial_l f(\mathbf x)^{\rm T} \in {\mathbb{R}}^{{d}}$$. If $$l \in {\mathscr{S}_1}$$, then $${\nabla} \partial_l f(\mathbf x)^{\rm T}$$ has at most one non-zero entry, namely the $$l$$th entry, and has all other entries equal to zero. In other words, $${\nabla} \partial_l f(\mathbf x)^{\rm T}$$ is $$1$$-sparse for $$l \in {\mathscr{S}_1}$$. If $$l \in {\mathscr{S}_2^{\text{var}}}$$, then we see that $${\nabla} \partial_l f(\mathbf x)^{\rm T}$$ will have at most $$({\rho}(l) + 1)$$ non-zero entries, implying that it is $$({\rho}(l) + 1) \leq ({\rho_{m}} + 1)$$-sparse. At suitably chosen $$\mathbf x$$’s, our aim specifically is to detect the non-zero off diagonal entries of $${\nabla^2}{f}(\mathbf x)$$ since they correspond precisely to $${\mathscr{S}_2}$$. To this end, we consider the ‘difference of gradients’-based approach used in Section 3.1.2. Contrary to the setting in Section 3.1.2, however, we now have a $${d} \times {d}$$ Hessian and have no knowledge about the set of active variables: $${\mathscr{S}_1} \cup {\mathscr{S}_2^{\text{var}}}$$. Therefore, the Hessian estimation problem is harder now and requires a different sampling scheme. Sampling scheme for estimating $${\mathscr{S}_2}$$. For $$\mathbf x,{\mathbf v^{\prime}} \in {\mathbb{R}}^{{d}}$$, $${\mu_1} > 0$$, consider the Taylor expansion of $${\nabla} f$$ at $$\mathbf x$$ along $${\mathbf v^{\prime}}$$, with step size $${\mu_1}$$. For $$\zeta_i = \mathbf x + \theta_i {\mathbf v^{\prime}}$$, for some $$\theta_i \in (0,{\mu_1})$$, $$i=1,\dots,{d}$$, we obtain the following identity. \begin{align} \frac{{\nabla} f(\mathbf x + {\mu_1}{\mathbf v^{\prime}}) - {\nabla} f(\mathbf x)}{{\mu_1}} &= {\nabla^2} f(\mathbf x) {\mathbf v^{\prime}} + \frac{{\mu_1}}{2} \begin{pmatrix} {{\mathbf v^{\prime}}}^{\mathrm{T}} {\nabla^2} \partial_1 f(\zeta_1) {\mathbf v^{\prime}} \\ \vdots \\ {{\mathbf v^{\prime}}}^{\mathrm{T}} {\nabla^2} \partial_{{d}} f(\zeta_{{d}}) {\mathbf v^{\prime}} \end{pmatrix} \notag\\ &= \begin{pmatrix} \langle{{\nabla} \partial_1 f({\mathbf x})},{{\mathbf v^{\prime}}}\rangle \\ \vdots \\ \langle{{\nabla} \partial_{{d}} f({\mathbf x})},{{\mathbf v^{\prime}}}\rangle \end{pmatrix} + \frac{{\mu_1}}{2} \begin{pmatrix} {{\mathbf v^{\prime}}}^{\mathrm{T}} {\nabla^2} \partial_1 f(\zeta_1) {\mathbf v^{\prime}} \\ \vdots \\ {{\mathbf v^{\prime}}}^{\mathrm{T}} {\nabla^2} \partial_{{d}} f(\zeta_{{d}}) {\mathbf v^{\prime}} \end{pmatrix}. \end{align} (4.1) Fig. 2 View largeDownload slide (a) $${\nabla^2} f(\mathbf x)$$ estimated using: $${\widehat{{{\nabla}}}}f(\mathbf x)$$ (disk at center) and neighborhood gradient estimates. (b) Geometric picture: $${d} = 3$$, $${h} \in {\mathscr{H}}_2^3$$ with $${h}(1) = {h}(3) \neq {h}(2)$$. Disks are points in $${\chi}({h})$$. Fig. 2 View largeDownload slide (a) $${\nabla^2} f(\mathbf x)$$ estimated using: $${\widehat{{{\nabla}}}}f(\mathbf x)$$ (disk at center) and neighborhood gradient estimates. (b) Geometric picture: $${d} = 3$$, $${h} \in {\mathscr{H}}_2^3$$ with $${h}(1) = {h}(3) \neq {h}(2)$$. Disks are points in $${\chi}({h})$$. We see from (4.1) that the $$l$$th entry of $$({\nabla} f(\mathbf x + {\mu_1}{\mathbf v^{\prime}}) - {\nabla} f(\mathbf x))/{\mu_1}$$ corresponds to a linear measurement of the $$l$$th row of $${\nabla^2} f(\mathbf x)$$ with $${\mathbf v^{\prime}}$$. From the preceding discussion, we also know that each row of $${\nabla^2} f(\mathbf x)$$ is at most $$({\rho_{m}} + 1)$$-sparse. This suggests the following idea: for any $$\mathbf x$$, if we obtain sufficiently many linear measurements of each row of $${\nabla^2} f(\mathbf x)$$, then we can estimate each row separately via $$\ell_1$$ minimization. To this end, we first need an efficient way for estimating $${\nabla} f(\mathbf x) \in {\mathbb{R}}^{{d}}$$, at any point $$\mathbf x$$. Note that $${\nabla} f(\mathbf x)$$ is $${k}$$-sparse, therefore we can estimate it via the randomized scheme, explained in Section 3.1.1, with $$O({k} \log {d})$$ queries of $$f$$. This gives us: $${\widehat{{{\nabla}}}} f(\mathbf x) = {\nabla} f(\mathbf x) + \mathbf w(\mathbf x)$$, where $$\mathbf w(\mathbf x) \in {\mathbb{R}}^{{d}}$$ denotes the estimation noise. Plugging this in (4.1) results in the following identity. \begin{align} \frac{{\widehat{{{\nabla}}}} f(\mathbf x + {\mu_1}{\mathbf v^{\prime}}) - {\widehat{{{\nabla}}}} f(\mathbf x)}{{\mu_1}} = \begin{pmatrix} \langle{{\nabla} \partial_1 f({\mathbf x})},{{\mathbf v^{\prime}}}\rangle \\ \vdots \\ \langle{{\nabla} \partial_{{d}} f({\mathbf x})},{{\mathbf v^{\prime}}}\rangle \end{pmatrix} + \underbrace{\frac{{\mu_1}}{2} \begin{pmatrix} {{\mathbf v^{\prime}}}^{\mathrm{T}} {\nabla^2} \partial_1 f(\zeta_1) {\mathbf v^{\prime}} \\ \vdots \\ {{\mathbf v^{\prime}}}^{\mathrm{T}} {\nabla^2} \partial_{{d}} f(\zeta_{{d}}) {\mathbf v^{\prime}} \end{pmatrix} + \frac{\mathbf w(\mathbf x + {\mu_1}{\mathbf v^{\prime}}) - \mathbf w(\mathbf x)}{{\mu_1}}}_`{\text{Noise}}'. \end{align} (4.2) Now let $${\mathbf v^{\prime}}$$ be chosen from the set: \begin{align} {\mathscr{V}^{\prime}} &:= \left\{{{\mathbf v^{\prime}}_j \in {\mathbb{R}}^{{d}} : {v^{\prime}}_{j,q} = \pm\frac{1}{\sqrt{{m_{v^{\prime}}}}} \ \text{w.p.} \ 1/2 \ \text{each}; \ j=1,\dots,{m_{v^{\prime}}} \ \text{and} \ q=1,\dots,{{d}}}\right\}. \end{align} (4.3) Then, employing (4.2) at each $${\mathbf v^{\prime}}_j \in {\mathscr{V}^{\prime}}$$ and denoting $${\mathbf{V}}^{\prime} = [{\mathbf v^{\prime}}_1 \dots {\mathbf v^{\prime}}_{{m_{v^{\prime}}}}]^{\rm T} \in {\mathbb{R}}^{{m_{v^{\prime}}} \times {d}}$$, we obtain $${d}$$ linear systems for $$q=1,\dots,{d}$$: \begin{align} \label{eq:hessrow_est_linfin} \underbrace{\frac{1}{{\mu_1}}\begin{pmatrix} (\widehat{{\nabla}} f({\mathbf x} + {\mu_1}{\mathbf v^{\prime}}_1) - \widehat{{\nabla}} f({\mathbf x}))_q \\ \vdots \\ (\widehat{{\nabla}} f({\mathbf x} + {\mu_1}{\mathbf v^{\prime}}_{{m_{v^{\prime}}}}) - \widehat{{\nabla}} f({\mathbf x}))_q \end{pmatrix}}_{\mathbf y_q} &= {\mathbf{V}}^{\prime} {\nabla} \partial_q f(\mathbf x) + \underbrace{\frac{{\mu_1}}{2} \begin{pmatrix} {{\mathbf v^{\prime}}_1}^{\mathrm{T}} {\nabla^2} \partial_q f(\zeta_{1}) {\mathbf v^{\prime}}_1 \\ \vdots \\ {{\mathbf v^{\prime}}_{{m_{v^{\prime}}}}}^{\mathrm{T}} {\nabla^2} \partial_q f(\zeta_{{m_{v^{\prime}}}}) {\mathbf v^{\prime}}_{{m_{v^{\prime}}}} \end{pmatrix}}_{{\mathbf{\eta_{q,1}}}}\notag\\ &\quad + \underbrace{\frac{1}{{\mu_1}}\begin{pmatrix} w_q({\mathbf x} + {\mu_1}{\mathbf v^{\prime}}_1) - w_q({\mathbf x}) \\ \vdots \\ w_q({\mathbf x} + {\mu_1}{\mathbf v^{\prime}}_{{m_{v^{\prime}}}}) - w_q({\mathbf x}) \end{pmatrix}}_{{\mathbf{\eta_{q,2}}}}. \end{align} (4.4) Given the measurement vector $$\mathbf y_q$$, we can obtain the estimate $${\widehat{{{\nabla}}}} \partial_q f(\mathbf x)$$ individually for each $$q$$, via $$\ell_1$$ minimization: $${\widehat{{{\nabla}}}} \partial_q f(\mathbf x) := \underset{\mathbf y_q = {\mathbf{V}}^{\prime} \mathbf z}{\mathrm{argmin}} \parallel{\mathbf z}\parallel_1 \quad q=1,\dots,{d}.$$ (4.5) Hence, we have obtained an estimate $${\widehat{{{\nabla^2}}}}f(\mathbf x) := [{\widehat{{{\nabla}}}} \partial_1 f(\mathbf x) \cdots {\widehat{{{\nabla}}}} \partial_d f(\mathbf x)]^{\rm T}$$ of the Hessian $${\nabla^2} f(\mathbf x)$$, at the point $$\mathbf x$$. Next, we would like to have a suitable set of points $$\mathbf x$$, in the sense that it provides a sufficiently fine discretization, of any canonical two-dimensional subspace of $${\mathbb{R}}^{{d}}$$. To this end, we can simply consider the set $${\chi}$$ as defined in (3.12), for the same reasons as before. Sampling scheme for estimating $${\mathscr{S}_1}$$. While the above sampling scheme enables us to recover $${\mathscr{S}_2}$$, we can recover $${\mathscr{S}_1}$$ as follows. Let $$\widehat{{\mathscr{S}_2^{\text{var}}}}$$ denote the set of variables in the estimated set $${\widehat{{{\mathscr{S}_2}}}}$$, and let $${\mathscr{P}} := [{d}] \setminus \widehat{{\mathscr{S}_2^{\text{var}}}}$$. Assuming $${\widehat{{{\mathscr{S}_2}}}} = {\mathscr{S}_2}$$, we have $${\mathscr{S}_1} \subset {\mathscr{P}}$$. Therefore, the model we are left with now is a SPAM with no variable interactions on the reduced variable set $${\mathscr{P}}$$. For identification of $${\mathscr{S}_1}$$, we employ the sampling scheme of [54], wherein the gradient of $$f$$ is estimated along a discrete set of points on the line: $$\left\{{(x,\dots,x) \in {\mathbb{R}}^{{d}} : x \in [-1,1]}\right\}$$. For some $${m^{\prime}_{x}} \in \mathbb{Z}^{+}$$, we denote this discrete set by: $${\chi}_{\text{diag}} := \left\{{\mathbf x = (x \ x \ \cdots \ x) \in {\mathbb{R}}^{{d}}: x \in \left\{{-1,-\frac{{m^{\prime}_{x}}-1}{{m^{\prime}_{x}}},\dots,\frac{{m^{\prime}_{x}}-1}{{m^{\prime}_{x}}},1}\right\}}\right\}.$$ (4.6) Note that $${|{{{\chi}_{\text{diag}}}}|} = 2{m^{\prime}_{x}}+1$$. The motivation for estimating $${\nabla} f$$ at $$\mathbf x \in {\chi}_{\text{diag}}$$ is that we obtain estimates of $$\partial_p \phi_p$$ at equispaced points within $$[-1,1]$$, for $$p \in {\mathscr{S}_1}$$. With a sufficiently fine discretization, we would ‘hit’ the critical regions associated with each $$\partial_p \phi_p$$, as defined in Assumption 3. By applying a thresholding operation, we would then be able to identify each $$p \in {\mathscr{S}_1}$$. Let us denote $${\mathscr{V}^{\prime}}p$$ to be the set of sampling directions in $${\mathbb{R}}^{{d}}$$—analogous to $${\mathscr{V}},{\mathscr{V}^{\prime}}$$ defined in (3.4), (4.3), respectively—with $${|{{{\mathscr{V}^{\prime\prime}}}}|} = {m_{v^{\prime\prime}}}$$: $${\mathscr{V}^{\prime\prime}} := \left\{{{\mathbf v^{\prime\prime}}_j \in {\mathbb{R}}^{{d}} : {v^{\prime\prime}}_{j,q} = \pm\frac{1}{\sqrt{{m_{v^{\prime\prime}}}}} \ \text{w.p.} \ 1/2 \ \text{each}; \ j=1,\dots,{m_{v^{\prime\prime}}} \ \text{and} \ q=1,\dots,{{d}}}\right\}.$$ (4.7) For each $$\mathbf x \in {\chi}_{\text{diag}}$$, we will query $$f$$ at points $$(\mathbf x + {\mu^{\prime}} {\mathbf v^{\prime\prime}}_j)_{{\mathscr{P}}}, (\mathbf x - {\mu^{\prime}} {\mathbf v^{\prime\prime}}_j)_{{\mathscr{P}}}$$; $${\mathbf v^{\prime\prime}}_j \in {\mathscr{V}^{\prime\prime}}$$, restricted to $${\mathscr{P}}$$. Then by obtaining the measurements: $$y_j = (f((\mathbf x + {\mu^{\prime}} {\mathbf v^{\prime\prime}}_j)_{{\mathscr{P}}}) - f((\mathbf x - {\mu^{\prime}} {\mathbf v^{\prime\prime}}_j)_{{\mathscr{P}}}))/(2{\mu^{\prime}}), \ j=1,\dots,{m_{v^{\prime\prime}}}$$, and denoting $$({\mathbf{V^{\prime\prime}}})_{{\mathscr{P}}} = [({\mathbf v^{\prime\prime}}_1)_{{\mathscr{P}}} \cdots ({\mathbf v^{\prime\prime}}_{{m_{v^{\prime\prime}}}})_{{\mathscr{P}}}]^{\rm T}$$, we obtain the estimate $$({\widehat{{{\nabla}}}} f((\mathbf x)_{{\mathscr{P}}}))_{{\mathscr{P}}} := \underset{\mathbf y = ({\mathbf{V^{\prime\prime}}})_{{\mathscr{P}}}(\mathbf z)_{{\mathscr{P}}}}{\mathrm{argmin}} \parallel{(\mathbf z)_{{\mathscr{P}}}}\parallel_1$$. This notation simply means that we search over $$\mathbf z \in {\mathbb{R}}^{{{\mathscr{P}}}}$$ to form the estimate $$({\widehat{{{\nabla}}}} f((\mathbf x)_{{\mathscr{P}}}))_{{\mathscr{P}}}$$. The complete procedure for estimating $${\mathscr{S}_1},{\mathscr{S}_2}$$ is described formally in Algorithm 3. Next, we provide sufficient conditions on our sampling parameters that guarantee exact recovery of $${\mathscr{S}_1}, {\mathscr{S}_2}$$ by the algorithm. This is stated in the following Theorem. Theorem 4.1 Let $${{\mathscr{H}}_{2}^{d}}$$ be of size $${|{{{{\mathscr{H}}_{2}^{d}}}}|} \leq 2(C + 1)\,{\rm e}^2 \log {d}$$ for some constant $$C > 1$$. Then $$\exists$$ constants $$c_1^{\prime},c_2^{\prime} \geq 1$$ and $$C_1, C_2, c_4^{\prime}, c_5^{\prime} > 0$$, such that the following is true. Let $${m_{x}}, {m_{v}}, {m_{v^{\prime}}}$$ satisfy \begin{equation*} {m_{x}} \geq {\lambda}_2^{-1}, \quad c_1^{\prime} {k} \log\left(\frac{{d}}{{k}}\right) < {m_{v}} < \frac{{d}}{(\log 6)^2}, \quad c_2^{\prime} {\rho_{m}} \log\left(\frac{{d}}{{\rho_{m}}}\right) < {m_{v^{\prime}}} < \frac{{d}}{(\log 6)^2}. \end{equation*} Denoting $$a = \frac{(4{\rho_{m}}+1){B}_3}{2\sqrt{{m_{v^{\prime}}}}}$$, $$b = \frac{C_1\sqrt{{m_{v^{\prime}}}}((4{\rho_{m}}+1){k}){B}_3}{3{m_{v}}}$$, let $${\mu}, {\mu_1}$$ satisfy \begin{align*} &{\mu}^2 < \frac{{D}_2^2}{16 a b C_2^2}, \\[8pt] &{\mu_1} \in \left(({D}_2/(4aC_2)) - \sqrt{({D}_2/(4aC_2))^2 - (b{\mu}^2/a)}, ({D}_2/(4aC_2)) + \sqrt{({D}_2/(4aC_2))^2 - (b{\mu}^2/a)} \right). \end{align*} We then have that the choice \begin{align*} {\tau^{\prime}} = C_2 \left(\frac{{\mu_1}(4{\rho_{m}} + 1){B}_3}{2\sqrt{{m_{v^{\prime}}}}} + \frac{C_1\sqrt{{m_{v^{\prime}}}} {\mu}^2((4{\rho_{m}}+1){k}){B}_3}{3{\mu_1}{m_{v}}} \right), \end{align*} implies $${\widehat{{{\mathscr{S}_2}}}} = {\mathscr{S}_2}$$ with probability at least $$1 - \,{\rm e}^{-c_4^{\prime} {m_{v}}} - \,{\rm e}^{-\sqrt{{m_{v}} {d}}} - \,{\rm e}^{-c_5^{\prime} {m_{v^{\prime}}}} - \,{\rm e}^{-\sqrt{{m_{v^{\prime}}} {d}}} - {d}^{-2C}$$. Given that $${\widehat{{{\mathscr{S}_2}}}} = {\mathscr{S}_2}$$, then $$\exists$$ constants $$c_3^{\prime} \geq 1$$ and $$C_3, c_6^{\prime} > 0$$, such that for $${m^{\prime}_{x}}, {m_{v^{\prime\prime}}}, {\mu^{\prime}}$$ satisfying \begin{equation*} {m^{\prime}_{x}} \geq {\lambda}_1^{-1}, \quad c_3^{\prime} ({k}-{|{{\widehat{{\mathscr{S}_2^{\text{var}}}}}}|}) \log\left(\frac{|{{\mathscr{P}}}|}{{k}-{|{{\widehat{{\mathscr{S}_2^{\text{var}}}}}}|}}\right) < {m_{v^{\prime\prime}}} < \frac{|{{\mathscr{P}}}|}{(\log 6)^2}, \quad {{\mu^{\prime}}}^2 < \frac{3{m_{v^{\prime\prime}}} {D}_1}{C_3 ({k}-{|{{\widehat{{\mathscr{S}_2^{\text{var}}}}}}|}) {B}_3}, \end{equation*} the choice: $${\tau^{\prime\prime}} = \frac{C_3 ({k}-{|{{\widehat{{\mathscr{S}_2^{\text{var}}}}}}|}) {{\mu^{\prime}}}^2 {B}_3}{6{m_{v^{\prime\prime}}}}$$, implies $${\widehat{{{\mathscr{S}_1}}}} = {\mathscr{S}_1}$$ with probability at least $$1 - \,{\rm e}^{-c_6^{\prime} {m_{v^{\prime\prime}}}} - \,{\rm e}^{-\sqrt{{m_{v^{\prime\prime}}} |{{\mathscr{P}}}|}}$$. Query complexity. Estimating $${\nabla} f(\mathbf x)$$ at some fixed $$\mathbf x$$ requires $$2{m_{v}} = O({k}\log {d})$$ queries. Estimating $${\nabla^2} f(\mathbf x)$$ involves the estimation of $${\nabla} f(\mathbf x)$$—along with an additional $${m_{v^{\prime}}}$$ gradient vectors in a neighborhood of $$\mathbf x$$—implying $$O({m_{v}}{m_{v^{\prime}}}) = O({k}{\rho_{m}}(\log {d})^2)$$ point queries of $$f$$. Since $${\nabla^2} f$$ is estimated at all points in $${\chi}$$ in the worst case, this consequently implies a total query complexity of $$O({k}{\rho_{m}}(\log {d})^2 {|{{{\chi}}}|}) = O({\lambda}_2^{-2}{k}{\rho_{m}}(\log {d})^3)$$, for estimating $${\mathscr{S}_2}$$. We make an additional $$O({\lambda}_1^{-1} ({k}-{|{{\widehat{{\mathscr{S}_2^{\text{var}}}}}}|}) \log ({d} - {|{{\widehat{{\mathscr{S}_2^{\text{var}}}}}}|}))$$ queries of $$f$$, to estimate $${\mathscr{S}_1}$$. Therefore, the overall query complexity for estimating $${\mathscr{S}_1},{\mathscr{S}_2}$$ is $$O({\lambda}_2^{-2}{k}{\rho_{m}}(\log {d})^3)$$. Computational complexity. The family $${{\mathscr{H}}_{2}^{d}}$$ can be constructed8 in time polynomial in $$d$$. For each $$\mathbf x \in {\chi}$$, we first solve $${m_{v^{\prime}}} + 1$$ linear programs in $$O(d)$$ variables (Steps 10 and 13), each solvable in time polynomial in $$({m_{v}}, d)$$. We then solve $$d$$ linear programs in $$O(d)$$ variables (Step 17), each of which takes time polynomial in $$({m_{v^{\prime}}}, d)$$. Since this is done at $${|{{{\chi}}}|} = O({\lambda}_2^{-2} \log d)$$ many points, hence the overall computation time for estimation of $${\mathscr{S}_2}$$ (and subsequently $${\mathscr{S}_1}$$) is polynomial in the number of queries, and in $$d$$. Remark 4.1 In Algorithm 4.1, we could have optimized the procedure for identifying $${\mathscr{S}_1}$$ as follows. Observe that for each $${h} \in {{\mathscr{H}}_{2}^{d}}$$, we always have a subset of points (i.e. $$\subset {\chi}({h})$$) that discretize $$\left\{{(x,\dots,x) \in {\mathbb{R}}^{{d}} : x \in [-1,1]}\right\}$$. Therefore, for each $$\mathbf x$$ lying in this subset, we could go through $${\widehat{{{\nabla}}}} f(\mathbf x)$$, and check via a thresholding operation, whether there exists a variables(s) in $${\mathscr{S}_1}$$. If $${m_{x}}$$ is large enough ($$\geq {\lambda}_1^{-1}$$), then it would also enable us to recover $${\mathscr{S}_1}$$ completely. A downside of this approach is that we would require additional, stronger conditions on the step size parameter $${\mu}$$ to guarantee identification of $${\mathscr{S}_1}$$. Since the estimation procedure for $${\mathscr{S}_1}$$ in Algorithm 3 comes at the same order-wise sampling cost, therefore we choose to query $$f$$ again, in order to identify $${\mathscr{S}_1}$$. Remark 4.2 We also note that the condition on $${\mu^{\prime}}$$ is less strict than in [54] for identifying $${\mathscr{S}_1}$$. This is because in [54], the gradient is estimated via a forward difference procedure, while we perform a central difference procedure in (3.3). 4.2 Analysis for noisy setting We now consider the case where at each query $$\mathbf x$$, we observe $$f(\mathbf x) + {z^{\prime}}$$, with $${z^{\prime}} \in {\mathbb{R}}$$ denoting external noise. To estimate $${\nabla} f(\mathbf x)$$, we obtain the samples : $$f(\mathbf x + {\mu} {\mathbf v}_j) + {z^{\prime}}_{j,1}$$ and $$f(\mathbf x - {\mu} {\mathbf v}_j) + {z^{\prime}}_{j,2}$$, $$j = 1,\dots,{m_{v}}$$. This changes (3.6) to the linear system $$\mathbf y = {\mathbf{V}}{\nabla} f(\mathbf x) + {\mathbf n} + {\mathbf{{z}}}$$, where $${z}_{j} = ({z^{\prime}}_{j,1} - {z^{\prime}}_{j,2})/(2{\mu})$$. Arbitrary bounded noise. In this scenario, we assume the external noise to be arbitrary and bounded, meaning that $${|{{{z^{\prime}}}}|} < {\varepsilon}$$, for some finite $${\varepsilon} \geq 0$$. Theorem 4.2 shows that Algorithm 3 recovers $${\mathscr{S}_1},{\mathscr{S}_2}$$ with appropriate choice of sampling parameters, provided $${\varepsilon}$$ is not too large. Theorem 4.2 Assuming the notation in Theorem 4.1, let $$a, b, {m_{x}}, {m_{v}}, {m_{v^{\prime}}}, {{\mathscr{H}}_{2}^{d}}$$ be as defined in Theorem 4.1. Say $${\varepsilon} < {\varepsilon}_1 = \frac{{D}_2^{3}}{192\sqrt{3} C_1 C_2^3 \sqrt{a^3 b {m_{v^{\prime}}} {m_{v}}}}$$. Then for $$\theta_1 = \cos^{-1}(-{\varepsilon} / {\varepsilon}_1)$$, let $${\mu}, {\mu_1}$$ satisfy: \begin{align} {\mu} &\in \left(\sqrt{\frac{{D}_2^2}{12 a b C_2^2}}\cos(\theta_1/3 - 2\pi/3) , \sqrt{\frac{{D}_2^2}{12 a b C_2^2}}\cos(\theta_1/3)\right), \\ \end{align} (4.8) \begin{align} {\mu_1} &\in \left(\frac{{D}_2}{4aC_2} - \sqrt{\left(\frac{{D}_2}{4aC_2}\right)^2 - \left(\frac{b{\mu}^2 + 2C_1\sqrt{{m_{v}}{m_{v^{\prime}}}}{\varepsilon}}{a}\right)}, \frac{{D}_2}{4aC_2}\right.\notag\\ &\quad\left. + \sqrt{\left(\frac{{D}_2}{4aC_2}\right)^2 - \left(\frac{b{\mu}^2 + 2C_1\sqrt{{m_{v}}{m_{v^{\prime}}}}{\varepsilon}}{a}\right)} \right). \end{align} (4.9) We then have in Algorithm 3 for the choice $${\tau^{\prime}} = C_2 \left(\frac{{\mu_1}(4{\rho_{m}} + 1){B}_3}{2\sqrt{{m_{v^{\prime}}}}} + \frac{C_1\sqrt{{m_{v^{\prime}}}} {\mu}^2((4{\rho_{m}}+1){k}){B}_3}{3{\mu_1}{m_{v}}} + \frac{2C_1{\varepsilon}\sqrt{{m_{v^{\prime}}}{m_{v}}}}{{\mu}{\mu_1}}\right),$$ (4.10) that $${\widehat{{{\mathscr{S}_2}}}} = {\mathscr{S}_2}$$ with probability at least $$1 - \,{\rm e}^{-c_4^{\prime} {m_{v}}} - \,{\rm e}^{-\sqrt{{m_{v}} {d}}} - \,{\rm e}^{-c_5^{\prime} {m_{v^{\prime}}}} - \,{\rm e}^{-\sqrt{{m_{v^{\prime}}} {d}}} - {d}^{-2C}$$. Given that $${\widehat{{{\mathscr{S}_2}}}} = {\mathscr{S}_2}$$, let $${m^{\prime}_{x}}, {m_{v^{\prime\prime}}}$$ be chosen as in Theorem 4.1. Let $$a_1 = \frac{({k}-{|{{\widehat{{\mathscr{S}_2^{\text{var}}}}}}|}) {B}_3}{6{m_{v^{\prime\prime}}}}$$, $$b_1 = \sqrt{{m_{v^{\prime\prime}}}}$$ and assume $${\varepsilon} < {\varepsilon}_2 = \frac{{D}_1^{3/2}}{3\sqrt{6 a_1 C_3^3 b_1^2}}$$. For $$\theta_2 = \cos^{-1}(-{\varepsilon} / {\varepsilon}_2)$$, let $${\mu^{\prime}} \in (2\sqrt{{D}_1/(6 a_1 C_3)} \cos(\theta_2/3 - 2\pi/3), 2\sqrt{{D}_1/(6 a_1 C_3)} \cos(\theta_2/3))$$. We then have in Algorithm 3 for the choice $${\tau^{\prime\prime}} = C_3\left(\frac{({k}-{|{{\widehat{{\mathscr{S}_2^{\text{var}}}}}}|}) {{\mu^{\prime}}}^2 {B}_3}{6{m_{v^{\prime\prime}}}} + \frac{b_1{\varepsilon}}{{\mu}}\right)$$ that $${\widehat{{{\mathscr{S}_1}}}} = {\mathscr{S}_1}$$ with probability at least $$1 - \,{\rm e}^{-c_6^{\prime} {m_{v^{\prime\prime}}}} - \,{\rm e}^{-\sqrt{{m_{v^{\prime\prime}}} |{{\mathscr{P}}}|}}$$. We see that in contrast to Theorem 4.1, the step sizes: $${\mu}, {\mu^{\prime}}$$ cannot be chosen too small now, on account of external noise. Also note that the parameters $$\pi/2 \leq \theta_1,\theta_2 \leq \pi$$ arising due to $${\varepsilon}$$, affect the size of the intervals from which $${\mu}, {\mu^{\prime}}$$ can be chosen, respectively. One can verify that plugging $${\varepsilon} = 0$$ in Theorem 4.2 (implying $$\theta_1,\theta_2 = \pi/2$$), gives us the sampling conditions of Theorem 4.1. Stochastic noise. We now consider i.i.d. Gaussian noise, so that $${z^{\prime}} \sim {\mathscr{N}}(0,\sigma^2)$$ for variance $$\sigma^2 < \infty$$. As in Section 3.2, we resample each point query a sufficient number of times and average, to reduce $$\sigma$$. Doing this $$N_1$$ times in Steps 9 and 12, and $$N_2$$ times in Step 25, for $$N_1, N_2$$ large enough, we can recover $${\mathscr{S}_1},{\mathscr{S}_2}$$ as shown formally in the following theorem. Theorem 4.3 Assuming the notation in Theorem 4.1, let $$a, b, {m_{x}}, {m_{v}}, {m_{v^{\prime}}}, {{\mathscr{H}}_{2}^{d}}$$ be as defined in Theorem 4.1. For any $${\varepsilon} < {\varepsilon}_1 = \frac{{D}_2^{3}}{192\sqrt{3} C_1 C_2^3 \sqrt{a^3 b {m_{v^{\prime}}} {m_{v}}}}$$, $$0 < p_1 < 1$$ and $$\theta_1 = \cos^{-1}(-{\varepsilon} / {\varepsilon}_1)$$, say we resample each query in Steps 9–12 of Algorithm 3, $$N_1 > { \frac{\sigma^2}{{\varepsilon}^2} \log (\frac{2}{p_1}{m_{v}}({m_{v^{\prime}}}+1)(2{m_{x}}+1)^2{|{{{{\mathscr{H}}_{2}^{d}}}}|})}$$ times and average the values. Let $${\mu}, {\mu_1}, {\tau^{\prime}}$$ be chosen to satisfy (4.8), (4.9) and (4.10), respectively. We then have in Algorithm 3 that $${\widehat{{{\mathscr{S}_2}}}} = {\mathscr{S}_2}$$ with probability $$1 -p_1 - \,{\rm e}^{-c_4^{\prime} {m_{v}}} - \,{\rm e}^{-\sqrt{{m_{v}} {d}}} - \,{\rm e}^{-c_5^{\prime} {m_{v^{\prime}}}} - \,{\rm e}^{-\sqrt{{m_{v^{\prime}}} {d}}} - {d}^{-2C}$$. Given that $${\widehat{{{\mathscr{S}_2}}}} = {\mathscr{S}_2}$$, let $${m^{\prime}_{x}}, {m_{v^{\prime\prime}}}, a_1, b_1$$ be as stated in Theorem 4.2. For any $${\varepsilon^{\prime}} < {\varepsilon}_2 = \frac{{D}_1^{3/2}}{\sqrt{6 a_1 C_3^3 b_1^2}}$$, $$0 < p_2 < 1$$, and $$\theta_2 = \cos^{-1}(-{\varepsilon^{\prime}} / {\varepsilon}_2)$$, say we resample each query in Step 25 of Algorithm 3, $$N_2 > \frac{\sigma^2}{{{\varepsilon^{\prime}}}^2} {{\log}(\frac{2 (2{m^{\prime}_{x}}+1){m_{v^{\prime\prime}}}}{p_2})}$$ times. Furthermore, let $${\mu^{\prime}}, {\tau^{\prime\prime}}$$ be chosen as stated in Theorem 4.2. We then have in Algorithm 3 that $${\widehat{{{\mathscr{S}_1}}}} = {\mathscr{S}_1}$$ with probability at least $$1 - p_2 - \,{\rm e}^{-c_6^{\prime} {m_{v^{\prime\prime}}}} - \,{\rm e}^{-\sqrt{{m_{v^{\prime\prime}}} |{{\mathscr{P}}}|}}$$. Query complexity. Let us analyze the query complexity when the noise is i.i.d. Gaussian. For estimating $${\mathscr{S}_2}$$, we have $${\varepsilon} = O({\rho_{m}}^{-2} {k}^{-1/2})$$. Furthermore, $$(2{m_{x}}+1)^2 = {\lambda}_2^{-2}$$, $${|{{{{\mathscr{H}}_{2}^{d}}}}|} = O(\log d)$$, $${m_{v}} = O({k} \log {d})$$ and $${m_{v^{\prime}}} = O({\rho_{m}} \log {d})$$. Choosing $$p_1 = {d}^{-\delta}$$ for any constant $$\delta > 0$$ gives us $$N_1 = {O({\rho_{m}}^4 {k} \log(({d}^{\delta}){k} {\rho_{m}} (\log {d})^3 ))} = O({\rho_{m}}^4 {k} \log {d}).$$ This means that our total sample complexity for estimating $${\mathscr{S}_2}$$ is: $${O(N_1 {k}{\rho_{m}}(\log {d})^2 {|{{{\chi}}}|})} = O({\rho_{m}}^5 {k}^2 (\log {d})^4 {\lambda}_2^{-2}).$$ This ensures $${\widehat{{{\mathscr{S}_2}}}} = {\mathscr{S}_2}$$ with high probability. Next, for estimating $${\mathscr{S}_1}$$, we have $${\varepsilon^{\prime}} = O(({k} - {|{{{\mathscr{S}_2^{\text{var}}}}}|})^{-1/2})$$. Choosing $$p_2 = (({d} - {|{{{\mathscr{S}_2^{\text{var}}}}}|})^{-\delta})$$ for any constant $$\delta > 0$$, we get $$N_2 = O(({k} - {|{{{\mathscr{S}_2^{\text{var}}}}}|}) \log ({d} - {|{{{\mathscr{S}_2^{\text{var}}}}}|}))$$. This means the total sample complexity for estimating $${\mathscr{S}_1}$$ is $$O(N_2 {\lambda}_1^{-1} ({k}-{|{{\widehat{{\mathscr{S}_2^{\text{var}}}}}}|}) \log ({d} - {|{{\widehat{{\mathscr{S}_2^{\text{var}}}}}}|})) = O({\lambda}_1^{-1} ({k}-{|{{\widehat{{\mathscr{S}_2^{\text{var}}}}}}|})^2 (\log ({d} - {|{{\widehat{{\mathscr{S}_2^{\text{var}}}}}}|}))^2)$$. Putting it together, we have that in case of i.i.d. Gaussian noise, the sampling complexity of Algorithm 3 for estimating $${\mathscr{S}_1}, {\mathscr{S}_2}$$ is $$O({\rho_{m}}^5 {k}^2 (\log {d})^4)$$. Remark 4.3 We saw above that $$O({k}^2 (\log {d})^2)$$ samples are sufficient for estimating $${\mathscr{S}_1}$$ in presence of i.i.d. Gaussian noise. This improves the corresponding bound in [54] by a $$O({k})$$ factor and is due to the less strict condition on $${\mu^{\prime}}$$ (cf. Remark 4.2). 5. Alternate sampling scheme for the general overlap case We now derive an alternate algorithm for estimating the sets $${\mathscr{S}_1}, {\mathscr{S}_2}$$, for the general overlap case. This algorithm differs from Algorithm 3 with respect to the scheme for estimating $${\mathscr{S}_2}$$—the procedure for estimating $${\mathscr{S}_1}$$ is the same as Algorithm 3. To estimate $${\mathscr{S}_2}$$, we now make use of recent results from CS, for recovering sparse symmetric matrices from few linear measurements. More precisely, we leverage these results for estimating the sparse Hessian $${\nabla^2} f(\mathbf x)$$ at any fixed $$\mathbf x \in {\mathbb{R}}^d$$. This is in stark contrast to the approaches we proposed so far, wherein, each row of the Hessian $${\nabla^2} f(\mathbf x)$$ was approximated separately. As we will show, this results in slightly improved sampling bounds for estimating $${\mathscr{S}_2}$$ in the noiseless setting as opposed to those stated in Theorem 4.1. 5.1 Analysis for noiseless setting We begin with the setting of noiseless point queries and show how the problem of estimating $${\nabla^2} f(\mathbf x)$$ at any $$\mathbf x \in {\mathbb{R}}^d$$ can be formulated as one of recovering an unknown sparse, symmetric matrix from linear measurements. To this end, first note that for $$\mathbf x, {\mathbf v} \in {\mathbb{R}}^{{d}}$$, step size $${\mu} > 0$$ and $$\zeta = \mathbf x + \theta {\mathbf v}$$, $$\zeta^{\prime} = \mathbf x - \theta^{\prime} {\mathbf v}$$, $$\theta,\theta^{\prime} \in (0,2{\mu})$$, one obtains via Taylor expansion of the $$C^3$$ smooth $$f,$$ the following identity: \begin{align} \frac{f(\mathbf x + 2{\mu}{\mathbf v}) + f(\mathbf x - 2{\mu}{\mathbf v}) - 2f(\mathbf x)}{4{\mu}^2} = {\mathbf v}^{\rm T}{\nabla^2} f(\mathbf x){\mathbf v} + \underbrace{\frac{{R}_3(\zeta) + {R}_3(\zeta^{\prime})}{4{\mu}^2}}_{O({\mu})}. \end{align} (5.1) Here, $${R}_3(\zeta), {R}_3(\zeta^{\prime}) = O({\mu}^3)$$ denote the third-order Taylor terms. Importantly, (5.1) corresponds to a ‘noisy’ linear measurement of $${\nabla^2} f(\mathbf x)$$ i.e. $${\mathbf v}^{\rm T}{\nabla} f(\mathbf x){\mathbf v} = \langle{\mathbf v}{\mathbf v}^{\rm T},{\nabla^2} f(\mathbf x)\rangle$$, via the measurement matrix $${\mathbf v}{\mathbf v}^{\rm T}$$. The noise arises on account of the Taylor remainder terms. We now present a recent result for recovering sparse symmetric matrices [7] that we leverage for estimating $${\nabla^2} f(\mathbf x)$$. Recovering sparse symmetric matrices via $$\ell_1$$ minimization. hemantt Let $${\mathbf v}$$ be composed of i.i.d. sub-Gaussian entries with $$v_i = a_i/\sqrt{{m_{v}}}$$, and the $$a_i$$’s drawn in an i.i.d. manner from a distribution satisfying: $${\mathbb{E}}[a_i] = 0, \ {\mathbb{E}}[a_i^2] = 1 \ \text{and} \ {\mathbb{E}}[a_i^4] > 1.$$ (5.2) For concreteness, we will consider the following set whose elements clearly meet these moment conditions: \begin{align} {\mathscr{V}} &:= \left\{{{\mathbf v}_j \in {\mathbb{R}}^{{d}} : v_{j,q} = \left\{ \begin{array}{rl} \pm \sqrt{\frac{3}{{m_{v}}}} ; & \text{w.p} \ 1/6 \ \text{each}, \\ 0 ; & \text{w.p} \ 2/3 \end{array} \right\}; \ j=1,\dots,{m_{v}} \ \text{and} \ q=1,\dots,{{d}}}\right\}. \end{align} (5.3) Note that a symmetric Bernoulli distribution does not meet the aforementioned fourth-order moment condition. Furthermore, let $${\mathscr{M}}: {\mathbb{R}}^{d \times d} \rightarrow {\mathbb{R}}^{{m_{v}}}$$ denote a linear operator acting on square matrices, with \begin{align} {\mathscr{M}}({\mathbf{H}}) := [\langle{\mathbf v}_1{\mathbf v}_1^{\rm T},{\mathbf{H}}\rangle \cdots \langle{\mathbf v}_{{m_{v}}}{\mathbf v}_{{m_{v}}}^{\rm T},{\mathbf{H}}\rangle]^{\rm T} \quad {\mathbf{H}} \in {\mathbb{R}}^{d \times d}. \end{align} (5.4) For an unknown symmetric matrix $${\mathbf{H}}_0 \in {\mathbb{R}}^{d \times d}$$, say we have at hand $${m_{v}}$$ linear measurements $$\mathbf y = {\mathscr{M}}({\mathbf{H}}_0) + {\mathbf n} \quad \mathbf y, {\mathbf n} \in {\mathbb{R}}^{{m_{v}}}, \parallel{{\mathbf n}}\parallel_1 \leq {\eta}.$$ (5.5) Then as shown in [7, Section C], we can recover an estimate $$\widehat{{\mathbf{H}}_0}$$ to $${\mathbf{H}}_0$$ via $$\ell_1$$ minimization, by solving: $$\widehat{{\mathbf{H}}}_0 = \underset{{\mathbf{H}}}{\mathrm{argmin}} \parallel{{\mathbf{H}}}\parallel_1 \quad \text{s.t.} \quad {\mathbf{H}}^{\rm T} = {\mathbf{H}}, \quad \parallel{\mathbf y - {\mathscr{M}}({\mathbf{H}})}\parallel_1 \leq {\eta}.$$ (5.6) Remark 5.1 Equation (5.6) was proposed in [7, Section C] for recovering sparse covariance matrices (which are positive semidefinite (PSD)) with the symmetry constraint replaced by a PSD constraint. However, as noted in the discussion in [7, Section E], one can replace the PSD constraint by a symmetry constraint, in order to recover more general symmetric matrices (which are not necessarily PSD). Remark 5.2 Note that (5.6) can be reformulated as a linear program in $$O(d^2)$$ variables and hence can be solved efficiently up to arbitrary accuracy (using, for instance, interior point methods (cf. [39])). The estimation property of (5.6) is captured in the following theorem. Theorem 5.1 [7, Theorem 3] Consider the sampling model in (5.5) with $${\mathbf v}_i$$’s satisfying (5.2), and let $$({\mathbf{H}}_0)_{{\it{\Omega}}}$$ denote the best $${K}$$ term approximation of $${\mathbf{H}}_0$$. Then there exist constants $$c_1, c_1^{\prime}, c_2, C_1, C_2 > 0$$ such that with probability exceeding $$1 - c_1\,{\rm e}^{-c_2 {m_{v}}}$$, the solution $$\widehat{{\mathbf{H}}}_0$$ to (5.6) satisfies $$\parallel{\widehat{{\mathbf{H}}}\parallel_0 - {{\mathbf{H}}_0}}_F \leq C_2 \frac{\parallel{{\mathbf{H}}_0 - ({\mathbf{H}}_0)_{{\it{\Omega}}}}\parallel_1}{\sqrt{{K}}} + C_1 {\eta},$$ (5.7) simultaneously for all (symmetric) $${\mathbf{H}}_0 \in {\mathbb{R}}^{d \times d}$$, provided $${m_{v}} > c_1^{\prime} {K} \log(d^2/{K})$$. The proof of Theorem 5.1 relies on the $$\ell_2 / \ell_1$$ RIP for sparse symmetric matrices, introduced by Chen et al. [7]: Definition 5.2 [7] For the set of symmetric $$K$$ sparse matrices, the operator $${\mathscr{B}}$$ is said to satisfy the $$\ell_2 / \ell_1$$ RIP with constants $$\gamma_1,\gamma_2 > 0$$, if for all such matrices $${\mathbf{X}}$$: \begin{equation*} (1-\gamma_1)\parallel{{\mathbf{X}}}\parallel_F \leq \parallel{{\mathscr{B}}({\mathbf{X}})}\parallel_1 \leq (1+\gamma_2)\parallel{{\mathbf{X}}}\parallel_F. \end{equation*} While the operator $${\mathscr{M}}$$ defined in (5.4) does not satisfy $$\ell_2 / \ell_1$$ RIP (since each $${\mathbf v}_i{\mathbf v}_i^{\rm T}$$ has non-zero mean), one could consider instead a set of debiased measurement matrices $${\mathbf{B}}_i := {\mathbf v}_{2i-1}{\mathbf v}_{2i-1}^{\rm T} - {\mathbf v}_{2i}{\mathbf v}_{2i}^{\rm T}$$, with $${\mathscr{B}}_i({\mathbf{X}}) := \langle{\mathbf{B}}_i,{\mathbf{X}}\rangle$$ for $$i=1,\dots,m$$. Chen et al. [7, Corollary 2] then show that the linear map $${\mathscr{B}}: {\mathbb{R}}^{d \times d} \rightarrow {\mathbb{R}}^m$$ satisfies $$\ell_2 / \ell_1$$ RIP, for $${\mathbf v}_i$$’s satisfying (5.2), provided $$m = {\it{\Omega}}(K \log(d^2/K))$$. Remark 5.3 Observe that the $$\ell_1$$ norm constraint in (5.6) arises due to the $$\ell_2 / \ell_1$$ RIP in Definition 5.2. It is unclear whether the linear map $${\mathscr{B}}$$ also satisfies the conventional $$\ell_2 /\ell_2$$ RIP.9 However, assuming it were do so, the $$\ell_1$$ norm constraint in (5.6) could then be replaced by $$\parallel{\mathbf y - {\mathscr{M}}({\mathbf{H}})}\parallel_2 \leq {\eta}$$. In particular, it might then be possible to use faster non-convex IHT-based methods (cf. Remark 3.5). Estimating $${\mathscr{S}_2}, {\mathscr{S}_1}$$. Given the linear program defined in (5.6), we can estimate $${\nabla^2} f(\mathbf x)$$ in a straightforward manner, at any fixed $$\mathbf x \in {\mathbb{R}}^d$$. Indeed, for some suitable step size $${\mu} > 0$$, we first collect the samples: $$f(\mathbf x), \left\{{f(\mathbf x - 2{\mu}{\mathbf v}_j)}\right\}_{j=1}^{{m_{v}}}, \left\{{f(\mathbf x + 2{\mu}{\mathbf v}_j)}\right\}_{j=1}^{{m_{v}}}$$, with $${\mathbf v}_j \in {\mathscr{V}}$$. Then, we form the linear system $$\mathbf y = {\mathscr{M}}({\nabla^2} f(\mathbf x)) + {\mathbf n}$$, where $$y_j = \frac{f(\mathbf x + 2{\mu}{\mathbf v}_j) + f(\mathbf x - 2{\mu}{\mathbf v}_j) - 2f(\mathbf x)}{4{\mu}^2}, \quad {n}_j = \frac{{R}_3(\zeta_j) + {R}_3(\zeta_j^{\prime})}{4{\mu}^2} j=1,\dots,{m_{v}}.$$ (5.8) Since $${\nabla^2} f(\mathbf x)$$ is at most $${k} ({\rho_{m}}+1)$$ sparse, therefore we obtain an estimate $${\widehat{{{\nabla^2}}}} f(\mathbf x)$$ to $${\nabla^2} f(\mathbf x)$$ with $$2{m_{v}} + 1$$ queries of $$f$$ with $${m_{v}} > c_1^{\prime}{k} {\rho_{m}} \log(\frac{d^2}{{k}{\rho_{m}}})$$. Thereafter, we proceed as in Section 4, i.e. we estimate $${\nabla^2} f$$ at each $$\mathbf x \in {\chi} = \cup_{{h} \in {{\mathscr{H}}_{2}^{d}}} {\chi}({h})$$, with $${\chi}({h})$$ as defined in (3.12). Remark 5.4 Note that $${\nabla^2} f(\mathbf x)$$ actually has at most $${k} + 2{|{{{\mathscr{S}_2}}}|}$$ non-zero entries. Therefore, if we had assumed $${|{{{\mathscr{S}_2}}}|}$$ to be known as part of our problem setup (in Section 2), then the choice $${m_{v}} > c_1^{\prime} ({k} + 2{|{{{\mathscr{S}_2}}}|}) \log(\frac{d^2}{{k} + 2{|{{{\mathscr{S}_2}}}|}})$$ would suffice for estimating $${\nabla^2} f(\mathbf x)$$. We can bound $$2{|{{{\mathscr{S}_2}}}|} \leq {k}{\rho_{m}}$$—this is also tight in the worst case—however, in certain settings, this would be pessimistic.10 Once $${\mathscr{S}_2}$$ is identified, we can simply reuse the procedure in Algorithm 3, for estimating $${\mathscr{S}_1}$$. The above discussion for identifying $${\mathscr{S}_1},{\mathscr{S}_2}$$ is formally outlined in Algorithm 4. The following theorem provides sufficient conditions on the sampling parameters in Algorithm 4, which guarantee $${\widehat{{{\mathscr{S}_2}}}} = {\mathscr{S}_2}$$, $${\widehat{{{\mathscr{S}_1}}}} = {\mathscr{S}_1}$$ with high probability. Theorem 5.3 Let $${{\mathscr{H}}_{2}^{d}}$$ be of size $${|{{{{\mathscr{H}}_{2}^{d}}}}|} \leq 2(C + 1)\,{\rm e}^2 \log {d}$$ for some constant $$C > 1$$. Then $$\exists$$ constants $$c_1, c_1^{\prime}, c_2, C_1 > 0$$, such that the following is true. Let $${m_{x}}, {m_{v}}, {\mu}$$ satisfy $${m_{x}} \geq {\lambda}_2^{-1}, \ {m_{v}} > c_1^{\prime}{k} {\rho_{m}} \log\left(\frac{d^2}{{k}{\rho_{m}}}\right), \ {\mu} < {\frac{\sqrt{{m_{v}}}{D}_2}{2\sqrt{6} C_1 {B}_3(4{\rho_{m}}+1){k}}}.$$ (5.9) We then have for the choices $${\eta} = {\frac{2\sqrt{3}{\mu}{B}_3(4{\rho_{m}}+1){k}}{\sqrt{{m_{v}}}}}, \ {\tau} = C_1{\eta}$$ that $${\widehat{{{\mathscr{S}_2}}}} = {\mathscr{S}_2}$$ with probability at least $$1 - c_1 \,{\rm e}^{-c_2 {m_{v}}} - {d}^{-2C}$$. Given that $${\widehat{{{\mathscr{S}_2}}}} = {\mathscr{S}_2}$$, the sampling conditions for estimating $${\widehat{{{\mathscr{S}_1}}}}$$ are identical to Theorem 4.1. Query complexity. Estimating $${\nabla^2} f(\mathbf x)$$ at some fixed $$\mathbf x$$ requires $$2{m_{v}} + 1 = O({k}{\rho_{m}}\log(\frac{d^2}{{k}{\rho_{m}}}))$$ queries. Since $${\nabla^2} f$$ is estimated at all points in $${\chi}$$ in the worst case, this consequently implies a total query complexity of $$O({k}{\rho_{m}}\log(\frac{d^2}{{k}{\rho_{m}}}) {|{{{\chi}}}|}) = O({\lambda}_2^{-2}{k}{\rho_{m}}(\log {d})^2)$$, for estimating $${\mathscr{S}_2}$$. As seen in Theorem 4.1, we make an additional $$O({\lambda}_1^{-1} ({k}-{|{{\widehat{{\mathscr{S}_2^{\text{var}}}}}}|}) \log ({d} - {|{{\widehat{{\mathscr{S}_2^{\text{var}}}}}}|}))$$ queries of $$f$$, in order to estimate $${\mathscr{S}_1}$$. Therefore, the overall query complexity for estimating $${\mathscr{S}_1},{\mathscr{S}_2}$$ is $$O({\lambda}_2^{-2}{k}{\rho_{m}}(\log {d})^2)$$. Observe that this is better by a $$\log d$$ factor as compared to the sampling bound for Algorithm 3 (in the noiseless setting). Computational complexity. The family $${{\mathscr{H}}_{2}^{d}}$$ can be constructed11 in time polynomial in $$d$$. At each $$\mathbf x \in {\chi}$$, we solve a linear program (Step 11) in $$O(d^2)$$ variables, which can be done up to arbitrary accuracy in time polynomial in $$({m_{v}},d)$$. Since this is done at $${|{{{\chi}}}|} = O({\lambda}_2^{-2} \log d)$$ many points, hence the overall computation time for estimation of $${\mathscr{S}_2}$$ (and subsequently $${\mathscr{S}_1}$$) is polynomial in the number of queries, and in $$d$$. 5.2 Analysis for noisy setting We now consider the case where at each query $$\mathbf x$$, we observe $$f(\mathbf x) + {z^{\prime}}$$, with $${z^{\prime}} \in {\mathbb{R}}$$ denoting external noise. In order to estimate $${\nabla^2} f(\mathbf x)$$, we obtain the samples : $$f(\mathbf x + 2{\mu}{\mathbf v}_j) + {z^{\prime}}_{j,1}$$, $$f(\mathbf x - 2{\mu} {\mathbf v}_j) + {z^{\prime}}_{j,2}$$ and $$f(\mathbf x) + {z^{\prime}}_{3}$$, $$j = 1,\dots,{m_{v}}$$. This changes (5.8) to the linear system $$\mathbf y = {\mathscr{M}}({\nabla^2} f(\mathbf x)) + {\mathbf n} + {\mathbf{{z}}}$$, where $${z}_{j} = ({z^{\prime}}_{j,1} + {z^{\prime}}_{j,2} - 2{z^{\prime}}_{3})/(4{\mu}^2)$$. Arbitrary bounded noise. Assuming the external noise to be arbitrary and bounded, meaning that $${|{{{z^{\prime}}}}|} < {\varepsilon}$$, Theorem 5.4 shows that Algorithm 4 recovers $${\mathscr{S}_1},{\mathscr{S}_2}$$ with appropriate choice of sampling parameters provided $${\varepsilon}$$ is not too large. Theorem 5.4 Assuming the notation in Theorem 5.3, let $${m_{x}},{m_{v}}$$ and $${{\mathscr{H}}_{2}^{d}}$$ be as defined in Theorem 5.3. Denoting $$a = {\frac{\sqrt{6}{B}_3(4{\rho_{m}}+1){k}}{\sqrt{{m_{v}}}}}$$, say $${\varepsilon}$$ satisfies $${\varepsilon} < {\varepsilon}_1 = \frac{\sqrt{2} {D}_2^{3}}{54 a^2 C_1^3 {m_{v}}}$$ and $$\theta_1 = \cos^{-1}\left(1-\frac{2{\varepsilon}}{{\varepsilon}_1}\right)$$. Let $${\mu} \in \left(-\frac{{D}_2}{3 a C_1}\cos\left(\frac{\theta_1}{3} + \frac{\pi}{3}\right) + \frac{{D}_2}{6 a C_1}, \frac{{D}_2}{3 a C_1}\cos\left(\frac{\theta_1}{3}\right) + \frac{{D}_2}{6 a C_1} \right).$$ (5.10) We then have in Algorithm 4 for the choices $${\eta} = \left({\frac{2\sqrt{3}{\mu}{B}_3(4{\rho_{m}}+1){k}}{\sqrt{{m_{v}}}}} + \frac{{\varepsilon}{m_{v}}}{{\mu}^2}\right)$$, $${\tau} = C_1 {\eta}$$, that $${\widehat{{{\mathscr{S}_2}}}} = {\mathscr{S}_2}$$ with probability at least $$1 - c_1 \,{\rm e}^{-c_2 {m_{v}}} - {d}^{-2C}$$. Given that $${\widehat{{{\mathscr{S}_2}}}} = {\mathscr{S}_2}$$, the sampling conditions for estimating $${\widehat{{{\mathscr{S}_1}}}}$$ are identical to Theorem 4.2. Stochastic noise. We now consider i.i.d. Gaussian noise, so that $${z^{\prime}} \sim {\mathscr{N}}(0,\sigma^2)$$ for variance $$\sigma^2 < \infty$$. As in Sections 3.2 and 4.2, we reduce $$\sigma$$ via resampling and averaging. Doing this $$N_1$$ times in Step 10, and $$N_2$$ times during estimation of $${\mathscr{S}_1}$$, for $$N_1, N_2$$ large enough, we can recover $${\mathscr{S}_1},{\mathscr{S}_2}$$ as shown formally in the following Theorem. Theorem 5.5 Assuming the notation in Theorem 5.3, let $${m_{x}},{m_{v}}$$ and $${{\mathscr{H}}_{2}^{d}}$$ be as defined in Theorem 5.3. For any $${\varepsilon} < {\varepsilon}_1 = \frac{\sqrt{2} {D}_2^{3}}{54 a^2 C_1^3 {m_{v}}}$$, $$0 < p_1 < 1$$, say we resample each query in Step 10 of Algorithm 4, $$N_1 > \frac{3\sigma^2}{4{\varepsilon}^2} \log (\frac{2}{p_1}{m_{v}}(2{m_{x}}+1)^2{|{{{{\mathscr{H}}_{2}^{d}}}}|})$$ times, and average the values. We then have in Algorithm 4 for the choices of $${\eta}$$, $${\tau}$$, $${\mu}$$ as in Theorem 5.4, that $${\widehat{{{\mathscr{S}_2}}}} = {\mathscr{S}_2}$$ with probability at least $$1 - c_1 \,{\rm e}^{-c_2 {m_{v}}} - {d}^{-2C} - p_1$$. Given that $${\widehat{{{\mathscr{S}_2}}}} = {\mathscr{S}_2}$$, the sampling conditions for estimating $${\widehat{{{\mathscr{S}_1}}}}$$ are identical to Theorem 4.2. Query complexity. We now analyze the query complexity for Algorithm 4, when the noise is i.i.d. Gaussian. For estimating $${\mathscr{S}_2}$$, we have $${\varepsilon} = O({\rho_{m}}^{-2} {k}^{-2})$$. Furthermore: $$(2{m_{x}}+1)^2 = {\lambda}_2^{-2}$$, $${|{{{{\mathscr{H}}_{2}^{d}}}}|} = O(\log d)$$, $${m_{v}} = O({k}{\rho_{m}} \log {d})$$. Choosing $$p_1 = {d}^{-\delta}$$ for any constant $$\delta > 0$$ gives us $$N_1 = O({\rho_{m}}^4 {k}^4 \log({d}^{\delta} ({k}{\rho_{m}} \log d) {\lambda}_2^{-2} \log d)) = O({\rho_{m}}^4 {k}^4 \log {d}).$$ This means that our total sample complexity for ensuring $${\widehat{{{\mathscr{S}_2}}}} = {\mathscr{S}_2}$$ with high probability is $${O(N_1 {k}{\rho_{m}} \log {d} {|{{{\chi}}}|})} = O({\rho_{m}}^5 {k}^5 (\log {d})^3 {\lambda}_2^{-2}).$$ Lastly, by noting the sample complexity for estimating $${\mathscr{S}_1}$$ from Theorem 4.2, we conclude that the overall sample complexity for ensuring $${\widehat{{{\mathscr{S}_1}}}} = {\mathscr{S}_1}$$ and $${\widehat{{{\mathscr{S}_2}}}} = {\mathscr{S}_2}$$, in the presence of i.i.d. Gaussian noise, is $$O({\rho_{m}}^5 {k}^5 (\log {d})^3 {\lambda}_2^{-2})$$. Observe that this bound has a relatively worse scaling w.r.t. $${\rho_{m}}$$ compared to that for Algorithm 3 (derived after Theorem 4.2); specifically, by a factor of $${k}^3$$. On the other hand, the scaling w.r.t. $$d$$ is better by a logarithmic factor, compared with that for Algorithm 3. 6. Learning individual components of model Recall from (2.4) the unique representation of the model: $$f(x_1,\dots,x_d) = c + \sum_{p \in {\mathscr{S}_1}}\phi_{p} (x_p) + \sum_{(l,l^{\prime}) \in {\mathscr{S}_2}} \phi_{(l,l^{\prime})} (x_{l},x_{l^{\prime}}) + \sum_{q \in {\mathscr{S}_2^{\text{var}}}: {\rho}(q) > 1} \phi_{q} (x_q),$$ (6.1) where $${\mathscr{S}_1} \cap {\mathscr{S}_2^{\text{var}}} = \emptyset$$. Having estimated the sets $${\mathscr{S}_1}$$ and $${\mathscr{S}_2}$$, we now show how the individual univariate and bivariate functions in the model can be estimated. We will see this for the settings of noiseless, as well as noisy (arbitrary, bounded noise and stochastic noise) point queries. 6.1 Noiseless queries In this scenario, we obtain the exact value $$f(\mathbf x)$$ at each query $$\mathbf x \in {\mathbb{R}}^{{d}}$$. Let us first see how each $$\phi_p$$; $$p \in {\mathscr{S}_1}$$ can be estimated. For some $$-1 = t_1 < t_2 < \dots < t_n = -1$$, consider the set $${\chi}_p := \left\{\mathbf x_i \in {\mathbb{R}}^{{d}}: (\mathbf x_i)_j = \left\{ \begin{array}{rl} t_i ; & j = p, \\ 0 ; & j \neq p \end{array} \right\} ; 1 \leq i \leq n; 1 \leq j \leq {d}\right\} \quad p \in {\mathscr{S}_1}.$$ (6.2) We obtain the samples $$\left\{{f(\mathbf x_i)}\right\}_{i=1}^{n}$$, $$\mathbf x_i \in {\chi}_p$$. Here, $$f(\mathbf x_i) = \phi_p(t_i) + C$$ with $$C$$ being a constant that depends on the other components in the model. Given the samples, one can then employ spline-based ‘quasi interpolant operators’ [14], to obtain an estimate $${\tilde{\phi}}_p :[-1,1] \rightarrow {\mathbb{R}}$$, to $$\phi_p + C$$. Construction of such operators can be found, for instance, in [14] (see also [22]). One can suitably choose the $$t_i$$’s and construct quasi interpolants that approximate any $$C^m$$ smooth univariate function with optimal $${L_{\infty}}[-1,1]$$ error rate $$O(n^{-m})$$ [14,22]. Having obtained $${\tilde{\phi}}_p$$, we then define $$\widehat{\phi}_p := {\tilde{\phi}}_p -{\mathbb{E}}_p[{\tilde{\phi}}_p] \quad p \in {\mathscr{S}_1},$$ (6.3) to be the estimate of $$\phi_p$$. The bivariate components corresponding to each $$(l,l^{\prime}) \in {\mathscr{S}_2}$$ can be estimated in a similar manner as above. To this end, for some strictly increasing sequences: $$(-1 = {t^{\prime}}_1,{t^{\prime}}_2,\dots,{t^{\prime}}_{n_1} = 1)$$, $$(-1 = t_1,t_2,\dots,t_{n_1} = 1)$$, consider the set $${\chi}_{(l,l^{\prime})} := \left\{\mathbf x_{i,j} \in {\mathbb{R}}^{{d}}: (\mathbf x_{i,j})_q = \left\{ \begin{array}{rl} {t^{\prime}}_i ; & q = l, \\ t_j ; & q = l^{\prime}, \\ 0 ; & q \neq l,l^{\prime} \end{array} \right\} ; 1 \leq i,j \leq n_1; 1 \leq q \leq {d} \right\} \quad (l,l^{\prime}) \in {\mathscr{S}_2}.$$ (6.4) We then obtain the samples $$\left\{{f(\mathbf x_{i,j})}\right\}_{i,j=1}^{n_1}$$, $$\mathbf x_{i,j} \in {\chi}_{(l,l^{\prime})}$$, where \begin{align} f(\mathbf x_{i,j}) &= \phi_{(l,l^{\prime})}({t^{\prime}}_i,t_j) + \sum_{\substack{l_1:(l,l_1) \in {\mathscr{S}_2} \\ l_1 \neq {l^{\prime}}}} \phi_{(l,l_1)} ({t^{\prime}}_i,0) + \sum_{\substack{l_1:(l_1,l) \in {\mathscr{S}_2} \\ l_1 \neq {l^{\prime}}}} \phi_{(l_1,l)} (0,{t^{\prime}}_i) \nonumber \\ &+ \sum_{\substack{{l^{\prime}}_1:({l^{\prime}},{l^{\prime}}_1) \in {\mathscr{S}_2} \\ {l^{\prime}}_1 \neq l}} \phi_{({l^{\prime}},{l^{\prime}}_1)} (t_j,0) + \sum_{\substack{{l^{\prime}}_1:({l^{\prime}}_1,{l^{\prime}}) \in {\mathscr{S}_2} \\ {l^{\prime}}_1 \neq l}} \phi_{({l^{\prime}}_1,{l^{\prime}})} (0,t_j) + \phi_l({t^{\prime}}_i) + \phi_{{l^{\prime}}}(t_j) + C, \\ \end{align} (6.5) \begin{align} &= g_{(l,l^{\prime})}({t^{\prime}}_i,t_j) + C, \end{align} (6.6) with $$C$$ being a constant. (6.5) is a general expression—if, for example, $${\rho}(l) = 1$$, then the terms $$\phi_l,\phi_{(l,l_1)},\phi_{(l_1,l)}$$ will be zero. Given this, we can again obtain estimates $${\tilde{\phi}}_{(l,l^{\prime})}:[-1,1]^2 \rightarrow {\mathbb{R}}$$ to $$g_{(l,l^{\prime})} + C$$, via spline-based quasi interpolants. Let us denote $$n = n_1^2$$ to be the total number of samples of $$f$$. For an appropriate choice of $$({t^{\prime}}_i,t_j)$$’s, one can construct bivariate quasi interpolants that approximate any $$C^m$$ smooth bivariate function, with optimal $${L_{\infty}}[-1,1]^2$$ error rate $$O(n^{-m/2})$$ [14,22]. Subsequently, we define the final estimates $$\widehat{\phi}_{(l,l^{\prime})}$$ to $$\phi_{(l,l^{\prime})}$$ as follows. $$\widehat{\phi}_{(l,l^{\prime})} := \left\{ \begin{array}{rl} {\tilde{\phi}}_{(l,l^{\prime})} - {\mathbb{E}}_{(l,l^{\prime})}[{\tilde{\phi}}_{(l,l^{\prime})}] , & {\rho}(l), {\rho}(l^{\prime}) = 1, \\ {\tilde{\phi}}_{(l,l^{\prime})} - {\mathbb{E}}_{l}[{\tilde{\phi}}_{(l,l^{\prime})}] , & {\rho}(l) = 1, {\rho}(l^{\prime}) > 1, \\ {\tilde{\phi}}_{(l,l^{\prime})} - {\mathbb{E}}_{l^{\prime}}[{\tilde{\phi}}_{(l,l^{\prime})}] , & {\rho}(l) > 1, {\rho}(l^{\prime}) = 1, \\ {\tilde{\phi}}_{(l,l^{\prime})} - {\mathbb{E}}_{l}[{\tilde{\phi}}_{(l,l^{\prime})}] - {\mathbb{E}}_{l^{\prime}}[{\tilde{\phi}}_{(l,l^{\prime})}] + {\mathbb{E}}_{(l,l^{\prime})}[{\tilde{\phi}}_{(l,l^{\prime})}] , & {\rho}(l) > 1, {\rho}(l^{\prime}) > 1. \end{array} \right.$$ (6.7) Lastly, we require to estimate the univariate’s : $$\phi_l$$ for each $$l \in {\mathscr{S}_2^{\text{var}}}$$ such that $${\rho}(l) > 1$$. As above, for some strictly increasing sequences: $$(-1 = {t^{\prime}}_1,{t^{\prime}}_2,\dots,{t^{\prime}}_{n_1} = 1)$$, $$(-1 = t_1,t_2,\dots,t_{n_1} = 1)$$, consider the set \begin{align} {\chi}_{l} &:= \Biggl\{\mathbf x_{i,j} \in {\mathbb{R}}^{{d}}: (\mathbf x_{i,j})_q = \left\{ \begin{array}{rl} {t^{\prime}}_i ; & q = l, \\ t_j ; & q \neq l \ \& \ q \in {\mathscr{S}_2^{\text{var}}}, \\ 0; & q \notin {\mathscr{S}_2^{\text{var}}}, \end{array} \right\} ; 1 \leq i,j \leq n_1; 1 \leq q \leq {d} \Biggr\}\notag\\ &\quad\times l \in {\mathscr{S}_2^{\text{var}}}: {\rho}(l) > 1. \end{align} (6.8) We obtain $$\left\{{f(\mathbf x_{i,j})}\right\}_{i,j=1}^{n_1}$$, $$\mathbf x_{i,j} \in {\chi}_{l}$$ where this time \begin{align} f(\mathbf x_{i,j}) &= \phi_l({t^{\prime}}_i) + \sum_{{\rho}({l^{\prime}}) > 1, {l^{\prime}} \neq l} \phi_{{l^{\prime}}}(t_j) + \sum_{{l^{\prime}}:(l,l^{\prime}) \in {\mathscr{S}_2}} \phi_{(l,l^{\prime})}({t^{\prime}}_i,t_j) \\ \end{align} (6.9) \begin{align} &+ \sum_{{l^{\prime}}:(l^{\prime},l) \in {\mathscr{S}_2}} \phi_{(l^{\prime},l)}(t_j,{t^{\prime}}_i) + \sum_{(q,q^{\prime}) \in {\mathscr{S}_2} : q,{q^{\prime}} \neq l} \phi_{(q,q^{\prime})}(t_j,t_j) + C \\ \end{align} (6.10) \begin{align} &= g_l({t^{\prime}}_i,t_j) + C \end{align} (6.11) for a constant, $$C$$. Denoting $$n = n_1^2$$ to be the total number of samples of $$f$$, we can again obtain an estimate $${\tilde{\phi}}_l(x_l,x)$$ to $$g_l(x_l,x) + C$$, with $${L_{\infty}}[-1,1]^2$$ error rate $$O(n^{-3/2})$$. Then with $${\tilde{\phi}}_l$$ at hand, we define the estimate $$\widehat{\phi}_l: [-1,1] \rightarrow {\mathbb{R}}$$ as $$\widehat{\phi}_l := {\mathbb{E}}_x[{\tilde{\phi}}_l] - {\mathbb{E}}_{(l,x)}[{\tilde{\phi}}_l] \quad l \in {\mathscr{S}_2^{\text{var}}} : {\rho}(l) > 1.$$ (6.12) The following proposition formally describes the error rates for the aforementioned estimates. Proposition 6.1 For $$C^3$$ smooth components $$\phi_p, \phi_{(l,l^{\prime})},\phi_{l}$$, let $$\widehat{\phi}_p$$, $$\widehat{\phi}_{(l,l^{\prime})}, \widehat{\phi}_{l}$$ be the respective estimates as defined in (6.3), (6.7) and (6.12), respectively. Also, let $$n$$ denote the number of queries (of $$f$$) made per component. We then have that: $$\parallel{\widehat{\phi}_p - \phi_p}\parallel_{{L_{\infty}}[-1,1]} = O(n^{-3}) \forall p \in {\mathscr{S}_1}$$, $$\parallel{\widehat{\phi}_{(l,l^{\prime})}\parallel - \phi_{(l,l^{\prime})}}_{{L_{\infty}}[-1,1]^2} = O(n^{-3/2}) \forall (l,l^{\prime}) \in {\mathscr{S}_2}$$ and $$\parallel{\widehat{\phi}_{l} - \phi_{l}}\parallel_{{L_{\infty}}[-1,1]} = O(n^{-3/2}) \forall l \in {\mathscr{S}_2^{\text{var}}}: {\rho}(l) > 1$$. 6.2 Noisy queries We now look at the case where for each query $$\mathbf x \in {\mathbb{R}}^d$$, we obtain a noisy value $$f(\mathbf x) + {z^{\prime}}$$. Arbitrary bounded noise. We begin with the scenario where $${z^{\prime}}_i$$ is arbitrary and bounded with $$|{{z^{\prime}}_i}| < {\varepsilon}, \ \forall i$$. Since the noise is arbitrary in nature, therefore we simply proceed as in the noiseless case, i.e. by approximating each component via a quasi-interpolant. As the magnitude of the noise is bounded by $${\varepsilon}$$, it results in an additional $$O({\varepsilon})$$ term in the approximation error rates of Proposition 6.1. To see this for the univariate case, let us denote $$Q: C({\mathbb{R}}) \rightarrow {{\mathscr{H}}}$$ to be a quasi-interpolant operator. This a linear operator, with $$C({\mathbb{R}})$$ denoting the space of continuous functions defined over $${\mathbb{R}}$$ and $${{\mathscr{H}}}$$ denoting a univariate spline space. Consider $$u \in C^m[-1,1]$$ for some positive integer $$m$$, and let $$g:[-1,1] \rightarrow {\mathbb{R}}$$ be an arbitrary continuous function with $$\parallel{g}\parallel_{{L_{\infty}}[-1,1]} < {\varepsilon}$$. Denote $$\widehat{u} = u + g$$ to be the ‘corrupted’ version of $$u$$, and let $$n$$ be the number of samples of $$\widehat{u}$$ used by $$Q$$. We then have by linearity of $$Q$$ that: $$\parallel{Q(\widehat{u}) - u}\parallel_{{L_{\infty}}[-1,1]} = \parallel{Q(u) + Q(g) - u}\parallel_{{L_{\infty}}[-1,1]} \leq \underbrace{\parallel{Q(u) - u}\parallel_{{L_{\infty}}[-1,1]}}_{= O(n^{-m})} + \parallel{Q}\parallel\underbrace{\parallel{g}\parallel_{{L_{\infty}}[-1,1]}}_{\leq \parallel{Q}\parallel{\varepsilon}},$$ (6.13) with $$\parallel{Q}\parallel$$ being the operator norm of $$Q$$. One can construct $$Q$$ with $$\parallel{Q}\parallel$$ bounded12 from above by a constant depending only on $$m$$. The above argument can be extended easily to the multivariate case. We state this for the bivariate case for completeness. Denote $$Q_1: C({\mathbb{R}}^2) \rightarrow {{\mathscr{H}}}$$ to be a quasi-interpolant operator, with $${{\mathscr{H}}}$$ denoting a bivariate spline space. Consider $$u_1 \in C^m[-1,1]^2$$ for some positive integer $$m$$, and let $$g_1:[-1,1] \rightarrow {\mathbb{R}}$$ be an arbitrary continuous function with $$\parallel{g_1}\parallel_{{L_{\infty}}[-1,1]^2} < {\varepsilon}$$. Let $$\widehat{u}_1 = u_1 + g_1$$ and let $$n$$ be the number of samples of $$\widehat{u_1}$$ used by $$Q_1$$. We then have by linearity of $$Q_1$$ that: \begin{align} &\parallel{Q_1(\widehat{u_1}) - u_1}\parallel_{{L_{\infty}}[-1,1]^2} = \parallel{Q_1(u_1) + Q_1(g_1) - u_1}\parallel_{{L_{\infty}}[-1,1]^2}\notag\\ &\quad \leq \underbrace{\parallel{Q_1(u_1) - u_1}\parallel_{{L_{\infty}}[-1,1]^2}}_{= O(n^{-m/2})} + \parallel{Q_1}\parallel\underbrace{\parallel{g_1}\parallel_{{L_{\infty}}[-1,1]^2}}_{\leq \parallel{Q_1}\parallel{\varepsilon}}, \end{align} (6.14) with $$\parallel{Q_1}\parallel$$ being the operator norm of $$Q_1$$. As for the univariate case, one can construct $$Q_1$$ with $$\parallel{Q_1}\parallel$$ bounded12 from above by a constant depending only on $$m$$. Let us define our final estimates $$\widehat{\phi}_p$$, $$\widehat{\phi}_{(l,l^{\prime})}$$ and $$\widehat{\phi}_{l}$$ as in (6.3), (6.7) and (6.12), respectively. The following proposition formally states the error bounds, for this particular noise model. Proposition 6.2 (Arbitrary bounded noise) For $$C^3$$ smooth components $$\phi_p, \phi_{(l,l^{\prime})},\phi_{l}$$, let $$\widehat{\phi}_p$$, $$\widehat{\phi}_{(l,l^{\prime})}, \widehat{\phi}_{l}$$ be the respective estimates as defined in (6.3), (6.7) and (6.12), respectively. Also, let $$n$$ denote the number of noisy queries (of $$f$$) made per component with the external noise magnitude being bounded by $${\varepsilon}$$. We then have that $$\parallel{\widehat{\phi}_p - \phi_p}\parallel_{{L_{\infty}}[-1,1]} = O(n^{-3}) + O({\varepsilon}) \forall p \in {\mathscr{S}_1}$$, $$\parallel{\widehat{\phi}_{(l,l^{\prime})}\parallel - \phi_{(l,l^{\prime})}}_{{L_{\infty}}[-1,1]^2} = O(n^{-3/2}) + O({\varepsilon}) \forall (l,l^{\prime}) \in {\mathscr{S}_2}$$ and $$\parallel{\widehat{\phi}_{l} - \phi_{l}}\parallel_{{L_{\infty}}[-1,1]} = O(n^{-3/2}) + O({\varepsilon}) \forall l \in {\mathscr{S}_2^{\text{var}}}: {\rho}(l) > 1$$. The proof is similar to that of Proposition 6.1 and hence skipped. Stochastic noise. We now consider the setting where $${z^{\prime}}_i \sim {\mathscr{N}}(0,\sigma^2)$$ are i.i.d. Gaussian random variables. Similar to the noiseless case, estimating the individual components again involves sampling $$f$$ along the subspaces corresponding to $${\mathscr{S}_1}$$, $${\mathscr{S}_2}$$. Because of the presence of stochastic noise, however, we now make use of non-parametric regression techniques to compute the estimates. While there exist a number of methods that could be used for this purpose (cf. [52]), we only discuss a specific one for clarity of exposition. To elaborate, we again construct the sets defined in (6.2),(6.4) and (6.8). In particular, we uniformly discretize the domains $$[-1,1]$$ and $$[-1,1]^2$$, by choosing the respective $$t_i$$’s and $$({t^{\prime}}_i,t_j)$$’s accordingly. This is the so called ‘fixed design’ setting in non-parametric statistics. Upon collecting the samples $$\left\{{f(\mathbf x_i) + {z^{\prime}}_i}\right\}_{i=1}^{n}$$, one can then derive estimates $${\tilde{\phi}}_p$$, $${\tilde{\phi}}_{(l,l^{\prime})}, {\tilde{\phi}}_l$$, to $$\phi_p + C$$, $$g_{(l,l^{\prime})} + C$$ and $$g_l + C$$, respectively, by using local polynomial estimators (cf. [17,52] and references within). It is known that these estimators achieve the (minimax optimal) $${L_{\infty}}$$ error rate: $${\it{\Omega}}((n^{-1} \log n)^{\frac{m}{2m+{d}}})$$, for estimating $$d$$-variate, $$C^m$$ smooth functions over compact domains.13 Translated to our setting, we then have that the functions: $$\phi_p + C$$, $$g_{(l,l^{\prime})} + C$$ and $$g_l + C$$ are estimated at the rates: $$O((n^{-1} \log n)^{\frac{3}{7}})$$ and $$O((n^{-1} \log n)^{\frac{3}{8}})$$, respectively. Denoting the above intermediate estimates by $${\tilde{\phi}}_{p}$$, $${\tilde{\phi}}_{(l,l^{\prime})}$$, $${\tilde{\phi}}_l$$, we define our final estimates $$\widehat{\phi}_p$$, $$\widehat{\phi}_{(l,l^{\prime})}$$ and $$\widehat{\phi}_{l}$$ as in (6.3), (6.7) and (6.12), respectively. The following proposition describes the error rates of these estimates. Proposition 6.3 (i.i.d. Gaussian noise) For $$C^3$$ smooth components $$\phi_p, \phi_{(l,l^{\prime})},\phi_{l}$$, let $$\widehat{\phi}_p$$, $$\widehat{\phi}_{(l,l^{\prime})}, \widehat{\phi}_{l}$$ be the respective estimates as defined in (6.3), (6.7) and (6.12), respectively. Let $$n$$ denote the number of noisy queries (of $$f$$) made per component, with noise samples $${z^{\prime}}_1,{z^{\prime}}_2,\dots,{z^{\prime}}_n$$ being i.i.d. Gaussian. Furthermore, let $${\mathbb{E}}_{z}[\cdot]$$ denote expectation w.r.t. the joint distribution of $${z^{\prime}}_1,{z^{\prime}}_2,\dots,{z^{\prime}}_n$$. We then have that $${\mathbb{E}}_{z}[\parallel{\widehat{\phi}_p - \phi_p}\parallel_{{L_{\infty}}[-1,1]}] = O((n^{-1} \log n)^{\frac{3}{7}}) \forall p \in {\mathscr{S}_1}$$, $${\mathbb{E}}_{z}[\parallel{\widehat{\phi}_{(l,l^{\prime})} - \phi_{(l,l^{\prime})}}\parallel_{{L_{\infty}}[-1,1]^2}] = O((n^{-1} \log n)^{\frac{3}{8}}) \forall (l,l^{\prime}) \in {\mathscr{S}_2}$$ and $${\mathbb{E}}_{z}[\parallel{\widehat{\phi}_{l} - \phi_{l}}\parallel_{{L_{\infty}}[-1,1]}] = O((n^{-1} \log n)^{\frac{3}{8}}) \forall l \in {\mathscr{S}_2^{\text{var}}}: {\rho}(l) > 1$$. 7. Simulation results We now provide some simulation results for our methods on synthetic examples. The main goal of our experiments is to provide a proof of concept, validating some of the theoretical results that were derived earlier. We consider both non-overlapping (Section 7.1) and overlapping settings (Section 7.2). In our experiments, we use the ALPS algorithm [28] as our CS solver—an efficient first-order method. Starting with the non-overlapping case, we present phase transition results and also show the dependence of $$d$$ on the number of samples, for recovery of $${\mathscr{S}_1}$$, $${\mathscr{S}_2}$$. We then empirically demonstrate the dependence of the number of samples on $$k$$. In both cases, our findings support our theory for sample complexities. We conduct similar experiments for the overlapping case and also additionally demonstrate empirically the dependence of the number of samples on the parameter $$\rho_m$$. 7.1 Non-overlapping setting We consider the following experimental setup: $${\mathscr{S}_1} = \left\{{1, 2}\right\}$$ and $${\mathscr{S}_2} = \left\{{(3, 4), (5, 6)}\right\}$$, which implies $${k_1} = 2$$, $${k_2} = 2$$ and $${k} = 6$$. Moreover, we consider three different types of $$f$$ namely: $$(i)$$$$f_1(\mathbf x) = 2x_1 - 3x_2^2 + 4x_3x_4 - 5x_5x_6$$, $$(ii)$$$$f_2(\mathbf x) = 10 \sin(\pi \cdot x_1) + 5 \,{\rm e}^{-2x_2} + 10 \sin(\pi \cdot x_3 x_4) + 5 \,{\rm e}^{-2x_5 x_6}$$, $$(iii)$$$$f_3(\mathbf x) = \frac{10}{3} \cos(\pi \cdot x_1) + 8 x_1^2 + 5 ( x_2^4 - x_2^2 + \frac{4}{5} x_4) + \frac{10}{3} \cos(\pi \cdot x_3x_4) + 8 (x_3x_4)^2 + 5 ( (x_5x_6)^4 - (x_5x_6)^2 + \frac{4}{5} x_5x_6)$$. For all cases, we use Algorithms 1 and 2. For $$f_1$$, the problem parameters are set to $${\lambda}_1 = 0.3$$, $${\lambda}_2 = 1$$, $${D}_1 = 2$$, $${D}_2 = 3$$, $${B}_3 = 6$$, while for $$f_2, f_3$$: $${\lambda}_1 = {\lambda}_2 = 0.3$$, $${D}_1 = 8$$, $${D}_2 = 4$$, $${B}_3 = 35$$. Given these constants, we obtain $${m_{x}} = 1$$, $${m^{\prime}_{x}} = 4$$ for $$f_1$$ and $${m_{x}} = {m^{\prime}_{x}} = 4$$ for $$f_2, f_3$$. We use constant $$\widetilde{C}$$ (to be defined next) when we set $${m_{v}} := \widetilde{C} {k} \log\left({d}/{k}\right)$$. For the construction of the hash functions, we set the size to $$|{{\mathscr{H}}_{2}^{d}}| = C^{\prime} \log d$$ with $$C^{\prime} = 1.7$$, leading to $$|\mathscr{H}_2^d| \in [8, 12]$$ for $$10^2 \leq d \leq 10^3$$. For the noiseless setting, we choose step sizes: $${\mu}, {\mu_1}, {\beta}$$ and thresholds: $${\tau^{\prime}}, {\tau}$$ as in Lemma 3.1 and Lemma 3.2. For the noisy setting, we consider the function values to be corrupted with i.i.d. Gaussian noise. We reduce the noise variance by repeating each query $$N_1$$ and $$N_2$$ times, respectively, and averaging. The noise variance values considered are $$\sigma^2 \in \left\{10^{-4}, 10^{-3}, 10^{-2}\right\}$$ for which we choose: \begin{align*} (N_1,N_2) &\in \left\{{(40,15), (75,31), (80,35)}\right\} \quad \text{for} \quad f_1, \\ (N_1,N_2) &\in \left\{{(60,30), (85,36), (90,40)}\right\} \quad \text{for} \quad f_2, \\ \text{and} \quad (N_1,N_2) &\in \left\{{(59,30), (85,35), (90,40)}\right\} \quad \text{for} \quad f_3. \end{align*} Moreover, we now choose parameters $${\mu}, {\mu_1}, {\beta}, {\tau^{\prime}}, {\tau}$$ as in Theorem 3.3. Dependence on $${d}$$. We see in Fig. 3, that for $$\widetilde{C} \approx 3.8$$ the probability of successful identification (noiseless case) undergoes a phase transition and becomes close to $$1$$, for different values of $${d}$$. This validates the statements of Lemmas 3.1–3.2. Fixing $$\widetilde{C} = 3.8$$, we then see that with the total number of queries growing slowly with $${d}$$, we have successful identification. For the noisy case, the total number of queries is roughly $$10^2$$ times that in the noiseless setting; however, the scaling with $${d}$$ is similar to that for noiseless case. Focusing on the function models $$f_2$$ and $$f_3$$, observe that the number of queries is seen to be slightly larger than that for $$f_1$$ in the noisy settings; this fact becomes more obvious in the overlapping case later on. Fig. 3 View largeDownload slide First (respectively, second and third) column is for $$f_1$$ (respectively, $$f_2$$ and $$f_3$$). Top row depicts the success probability of identifying exactly $${\mathscr{S}_1},{\mathscr{S}_2}$$, in the noiseless case. $$x$$-axis represent the constant $$\widetilde{C}$$. The bottom panel depicts total queries versus $${d}$$ for exact recovery, with $$\widetilde{C} = 3.8$$ and various noise settings. All results are over five independent Monte Carlo trials. Fig. 3 View largeDownload slide First (respectively, second and third) column is for $$f_1$$ (respectively, $$f_2$$ and $$f_3$$). Top row depicts the success probability of identifying exactly $${\mathscr{S}_1},{\mathscr{S}_2}$$, in the noiseless case. $$x$$-axis represent the constant $$\widetilde{C}$$. The bottom panel depicts total queries versus $${d}$$ for exact recovery, with $$\widetilde{C} = 3.8$$ and various noise settings. All results are over five independent Monte Carlo trials. Dependence on $${k}$$. We now demonstrate the scaling of the total number of queries versus the sparsity $${k}$$ for identification of $${\mathscr{S}_1},{\mathscr{S}_2}$$. Consider the model \begin{align} f(\mathbf x) &= \sum_{i = 1}^T \Big(\alpha_1 \mathbf x_{(i-1)5 + 1} - \alpha_2\mathbf x_{(i-1)5 + 2}^2 + \alpha_3 \mathbf x_{(i-1)5 + 3} \mathbf x_{(i-1)5 + 4} - \alpha_4 \mathbf x_{(i-1)5 + 5} \mathbf x_{(i-1)5 + 6}\Big) , \end{align} (7.1) where $$\mathbf x \in {\mathbb{R}}^{{d}}$$ for $$d = 500$$. Here, $$\alpha_i \in [2, 5], \forall i$$; i.e. we randomly selected $$\alpha_i$$’s within range and kept the values fixed for all five Monte Carlo iterations. Note that sparsity $${k} = 6T$$; we consider $$T \in \left\{1, 2, \dots, 10\right\}$$. We set $${\lambda}_1 = 0.3$$, $${\lambda}_2 = 1$$, $${D}_1 = 2$$, $${D}_2 = 3$$, $${B}_3 = 6$$ and $$\widetilde{C} = 3.8$$, i.e. the same setting with model $$f_1$$ above. For the noisy cases, we consider $$\sigma^2$$ as before and choose the same values for $$(N_1,N_2)$$ as for $$f_1$$. In Fig. 4, we see that the number of queries scales as $$\sim {k} \log({d}/{k})$$ and is roughly $$10^2$$ more in the noisy case when compared with the noiseless setting. Fig. 4 View largeDownload slide Total number of queries versus different sparsity values $$k$$, for (7.1). This is for both noiseless and noisy cases (i.i.d. Gaussian) with variances $$\sigma^2 \in \left\{10^{-4}, 10^{-3}, 10^{-2}\right\}$$. Fig. 4 View largeDownload slide Total number of queries versus different sparsity values $$k$$, for (7.1). This is for both noiseless and noisy cases (i.i.d. Gaussian) with variances $$\sigma^2 \in \left\{10^{-4}, 10^{-3}, 10^{-2}\right\}$$. 7.2 Overlapping setting For the overlapping case, we set $${\mathscr{S}_1} = \left\{1, 2 \right\}$$ and $${\mathscr{S}_2} = \{(3, 4), (4, 5)\}$$, which implies $${k_1} = 2$$, $${k_2} = 2$$, $${\rho_{m}} = 2$$ and $${k} = 5$$. Because of the presence of overlap between the elements of $${\mathscr{S}_2}$$, we now employ Algorithm 3 for identifying $${\mathscr{S}_1},{\mathscr{S}_2}$$. Remark 7.1 We deliberately avoid using Algorithm 4 on account of Remark 5.3 —it is unclear to us whether IHT-based methods could be employed for solving (5.6), with provable recovery guarantees. Although we could instead use standard interior point solvers, they will be slow, especially for the range of values of dimension $$d$$ that we will be considering. For an easier comparison with the non-overlapping case, we consider similar models as the previous subsection; observe that there are now common variables across the components of $$f$$. $$(i)$$$$f_1(\mathbf x) = 2x_1 - 3x_2^2 + 4x_3x_4 - 5x_4x_5$$, $$(ii)$$$$f_2(\mathbf x) = 10 \sin(\pi \cdot x_1) + 5 \,{\rm e}^{-2x_2} + 10\sin(\pi \cdot x_3 x_4) + 5 \,{\rm e}^{-2x_4 x_5}$$, $$(iii)$$$$f_3(\mathbf x) = \frac{10}{3} \cos(\pi \cdot x_1) + 8 x_1^2 + 5 ( x_2^4 - x_2^2 + \frac{4}{5} x_4) + \frac{10}{3} \cos(\pi \cdot x_3x_4) + 8 (x_3x_4)^2 + 5 ( (x_4x_5)^4 - (x_4x_5)^2 + \frac{4}{5} x_4x_5)$$. Parameters $${\lambda}_1, {\lambda}_2, {D}_1, {D}_2, {B}_3$$ are set as in the previous subsection. For a constant $$\widetilde{C}$$ (chosen later), we set $${m_{v}} := \widetilde{C} {k} \log\left({d}/{k}\right)$$, $${m_{v^{\prime}}} := \widetilde{C} {\rho_{m}} \log({d}/{\rho_{m}})$$ and $${m_{v^{\prime\prime}}} := \widetilde{C} ({k}-{|{{\widehat{{\mathscr{S}_2^{\text{var}}}}}}|}) \log(\frac{|{{\mathscr{P}}}|}{{k}-{|{{\widehat{{\mathscr{S}_2^{\text{var}}}}}}|}})$$. The size of the hash family $${|{{{{\mathscr{H}}_{2}^{d}}}}|}$$ for different values of $$d$$ is set as before for the non-overlapping setting. For the noiseless setting, we choose step sizes: $${\mu}, {\mu_1}, {\mu^{\prime}}$$ and thresholds: $${\tau^{\prime}}, {\tau^{\prime\prime}}$$ as in Theorem 4.1. For the noisy setting, we consider the function values to be corrupted with i.i.d. Gaussian noise. We reduce the noise variance by repeating each query $$N_1$$ and $$N_2$$ times, respectively, and averaging. The noise variance values considered are $$\sigma^2 \in \left\{10^{-4}, 10^{-3}, 10^{-2}\right\}$$ for which we choose: \begin{align*} (N_1,N_2) &\in \left\{{(50,20), (85,36), (90,40)}\right\} \quad \text{for} \quad f_1, \\ (N_1,N_2) &\in \left\{{(60,30), (90,40), (95,43)}\right\} \quad \text{for} \quad f_2, \\ \text{and} \quad (N_1,N_2) &\in \left\{{(59,30), (89,40), (93,43)}\right\} \quad \text{for} \quad f_3. \end{align*} Moreover, we now choose the parameters: $${\mu}, {\mu_1}, {\mu^{\prime}}, {\tau^{\prime}}, {\tau^{\prime\prime}}$$ as in Theorem 4.2. Dependence on $${d}$$. We see in Fig. 5 that for $$\widetilde{C} \approx 5.6,$$ the probability of successful identification (noiseless case) undergoes a phase transition and becomes close to $$1$$, for different values of $${d}$$, as in the non-overlapping case. This validates the statement of Theorem 4.1. As in the non-overlapping case, in the presence of noise, the total number of queries is roughly $$10^2$$ times that in the noiseless setting; however, the scaling with $${d}$$ is similar to that for the noiseless setting. Fig. 5 View largeDownload slide First (respectively, second and third) column is for $$f_1$$ (respectively, $$f_2$$ and $$f_3$$). Top row depicts the success probability of identifying exactly $${\mathscr{S}_1},{\mathscr{S}_2}$$, in the noiseless case. $$x$$-axis represents the constant $$\widetilde{C}$$. The bottom panel depicts total queries versus $${d}$$ for exact recovery, with $$\widetilde{C} = 5.6$$ and various noise settings. All results are over five independent Monte Carlo trials. Fig. 5 View largeDownload slide First (respectively, second and third) column is for $$f_1$$ (respectively, $$f_2$$ and $$f_3$$). Top row depicts the success probability of identifying exactly $${\mathscr{S}_1},{\mathscr{S}_2}$$, in the noiseless case. $$x$$-axis represents the constant $$\widetilde{C}$$. The bottom panel depicts total queries versus $${d}$$ for exact recovery, with $$\widetilde{C} = 5.6$$ and various noise settings. All results are over five independent Monte Carlo trials. Dependence on $${k}$$. We now demonstrate the scaling of the total number of queries versus the sparsity $${k}$$ for identification of $${\mathscr{S}_1},{\mathscr{S}_2}$$. Consider the model \begin{align} f(\mathbf x) &= \sum_{i = 1}^T \Big(\alpha_1 \mathbf x_{(i-1)5 + 1} - \alpha_2\mathbf x_{(i-1)5 + 2} ^2 + \alpha_3 \mathbf x_{(i-1)5 + 3} \mathbf x_{(i-1)5 + 4} - \alpha_4 \mathbf x_{(i-1)5 + 4} \mathbf x_{(i-1)5 + 5}\Big) \end{align} (7.2) where $$\mathbf x \in {\mathbb{R}}^{{d}}$$ for $$d = 500$$. Here, $$\alpha_i \in [2, 5], \forall i$$; i.e. we randomly selected $$\alpha_i$$’s within range and kept the values fixed for all $$5$$ Monte Carlo iterations. Note that $${\rho_{m}} = 2$$ and the sparsity $${k} = 5T$$; we consider $$T \in \left\{1, 2, \dots, 10\right\}$$. We set $${\lambda}_1 = 0.3$$, $${\lambda}_2 = 1$$, $${D}_1 = 2$$, $${D}_2 = 3$$, $${B}_3 = 6$$ and $$\widetilde{C} = 5.6$$. For the noisy cases, we consider $$\sigma^2$$ as before and choose the same values for $$(N_1,N_2)$$ as for $$f_1$$. In Fig. 6 (left panel), we again see that the number of queries scales as $$\sim {k} \log({d}/{k})$$ and is roughly $$10^2$$ more in the noisy case when compared with the noiseless setting. Fig. 6 View largeDownload slide Left panel: total number of queries versus different sparsity values $$k$$, for (7.2). Right panel: total number of queries versus $${\rho_{m}}$$ for (7.3). This is for both noiseless and noisy cases (i.i.d. Gaussian) with variances $$\sigma^2 \in \left\{10^{-4}, 10^{-3}, 10^{-2}\right\}$$. Fig. 6 View largeDownload slide Left panel: total number of queries versus different sparsity values $$k$$, for (7.2). Right panel: total number of queries versus $${\rho_{m}}$$ for (7.3). This is for both noiseless and noisy cases (i.i.d. Gaussian) with variances $$\sigma^2 \in \left\{10^{-4}, 10^{-3}, 10^{-2}\right\}$$. Dependence on $${\rho_{m}}$$. We now demonstrate the scaling of the total queries versus the maximum degree $${\rho_{m}}$$ for identification of $${\mathscr{S}_1},{\mathscr{S}_2}$$. Consider the model $$f(\mathbf x) =$$ \begin{align} \alpha_1 \mathbf x_{1} - \alpha_2\mathbf x_{2}^2 + \sum_{i = 1}^{T} \left(\alpha_{3,i} \mathbf x_{3} \mathbf x_{i+3}\right) + \sum_{i=1}^{5}\left(\alpha_{4,i}\mathbf x_{2+2i}\mathbf x_{3+2i} \right). \end{align} (7.3) We choose $$d = 500$$, $$\widetilde{C} = 6$$, $$\alpha_i \in [2,\dots, 5], \forall i$$ (as earlier) and set $${\lambda}_1 = 0.3$$, $${\lambda}_2 = 1$$, $${D}_1 = 2$$, $${D}_2 = 3$$, $${B}_3 = 6$$. For $$T \geq 2$$, we have $${\rho_{m}} = T$$; we choose $$T \in \left\{{2,3,\dots,10}\right\}$$. Also note that $${k} = 13$$ throughout. For the noisy cases, we consider $$\sigma^2$$ as before and choose $$(N_1,N_2) \in \left\{{(70,40), (90,50), (100,70)}\right\}$$. In Fig. 6 (right panel), we see that the number of queries scales as $$\sim {\rho_{m}} \log({d}/{\rho_{m}})$$ and is roughly $$10^2$$ more in the noisy case when compared with the noiseless setting. 8. Discussion We now provide a more detailed discussion with respect to related work, starting with results for learning SPAMs. Learning SPAMs. Ravikumar et al. [45] and Meier et al. [33] proposed methods based on least squares loss regularized with sparsity and smoothness constraints. While Ravikumar et al. show their method to be sparsistent for second-order Sobolev smooth $$f$$, one can obtain a rough estimate of how the number of samples $$n$$ behaves with respect to $${k}, d$$. Indeed, from Corollary 1 of Theorem 2 in [45], we see that the probability of incorrect identification of $${\mathscr{S}}$$ approximately scales14 as: $$\frac{\log d}{(\log n)^2} + \frac{{k}}{\log n} + \frac{\log d\sqrt{{k}}}{n^{1/6}}$$. This means that $$n$$ roughly scales as $$\max\{{k}^3 (\log d)^6, \,{\rm e}^{{k}}, \,{\rm e}^{\sqrt{\log d}}\}$$, for a constant probability of error. In contrast, our $$O(k^2 (\log d)^2)$$ bound (recall Theorem 4.3) has a clearly better scaling. Meier et al. [33] derive error rates of $$O({k} (\log d / n)^{2/5})$$ for estimating $$C^2$$ smooth $$f$$ in the empirical $$L_2(\mathbb{P}_n)$$ norm. They also show conditions under which their method is guaranteed to recover $${\widehat{{{\mathscr{S}}}}} \subset {\mathscr{S}}$$. Huang et al. [23] proposed a method based on the adaptive group Lasso and show that it is sparsistent. In contrast to [45], it is unclear here how exactly $$n$$ scales with $${k}, d$$. They also derive $$L_2$$ error rates for estimating the individual components of the SPAM. Wahl [57] consider the variable selection problem for SPAMs. They propose an estimator that essentially involves looking at all subsets of $$\{1,\dots,d\}$$ of size $${k}$$ and hence is practically infeasible. They show that for the periodic Sobolev class of functions (with smoothness parameter $$\alpha > 1/2$$), their estimator recovers $${\mathscr{S}}$$ w.h.p. with $$O({k}^{\frac{2\alpha+1}{2\alpha}} (\log d)^4)$$ samples [57, Corollary 3]. Consequently, they are also able to estimate each individual component of the model in the $$L_2(\mathbb{P})$$ norm. We observe that the dependency of their bound on $$d$$ is worse than ours by a factor of $$(\log d)^2$$; however, the scaling with $$k$$ is better for all $$\alpha > 1/2$$. Learning generalized SPAMs. Radchenko & James [42] proposed the VANISH algorithm—a least squares method with sparsity constraints. Assuming $$f$$ to be second-order Sobolev smooth, they show their method to be sparsistent. They also show a consistency result for estimating $$f$$, similar to [45]. One can obtain a rough estimate of how their sampling bounds scale with $$d,{|{{{\mathscr{S}_1}}}|},{|{{{\mathscr{S}_2}}}|}$$ for exact identification of $${\mathscr{S}_1},{\mathscr{S}_2}$$. Denoting $$m = {|{{{\mathscr{S}_1}}}|} + {|{{{\mathscr{S}_2}}}|}$$ and $$n$$ to be the number of samples, we see from Corollary 1 of [42, Theorem 2] that the probability of failure, i.e. incorrect identification of $${\mathscr{S}_1},{\mathscr{S}_2}$$, approximately scales15 as $$\frac{\sqrt{m}}{\log n} + \frac{(\log d)^3}{n^{3/5}}$$. This implies that $$n$$ roughly scales as $$\max\{e^{m}, (\log d)^5\}$$ for a constant probability of error. In contrast, as seen from Theorems 4.3 and 5.5, our bounds are polynomial in $$m$$, and have a better scaling with dimension $$d$$.} Dalalyan et al. [13] studied a generalization of (1.1) that allows for the presence of a sparse number ($$m$$) of $$s$$-wise interaction terms for some additional sparsity parameter $$s$$. Specifically, they studied this in the Gaussian white noise model.16 Assuming $$f$$ to lie in a Sobolev space with smoothness parameters $$\beta,L > 0$$, and some $$\epsilon \in (0,1)$$,17 they derive a non-asymptotic $$L_2$$ error rate (in expectation) of: $$\max\{m L^{\frac{s}{2\beta+s}}\epsilon^{\frac{4\beta}{2\beta+s}}, ms\epsilon^2 \log(d/(sm^{1/s}))\}$$, which is also shown to be minimax optimal. However, they do not guarantee unique identification of the interaction terms for any value of $$s$$. Furthermore, the computational complexity of their estimator is exponential in $$d,s,m$$, although they discuss possible ways to reduce this complexity. The above model was also recently studied by Yang et al. [61]; they consider a Bayesian estimation of $$f$$ in the Gaussian process (GP) setting wherein a GP prior is placed on $$f$$, and inference on $$f$$ is carried out by summarizing the resulting posterior probability given the data. They derived minimax estimation rates for Hölder smooth $$f$$ in the $$L_2$$ norm, along with a method that nearly achieves the optimal estimation rate (modulo some log factors) in the empirical $$L_2(\mathbb{P}_n)$$ norm}. However they do not guarantee unique identification of the interaction terms. Suzuki [50] studied a special case where $$[d]$$ is pre-divided into $$m$$ disjoint subsets, with an additive component18 defined on each subset. Assuming a sparse number of components, they derived PAC Bayesian bounds for estimation of $$f$$ in the $$L_2(\mathbb{P}_n)$$ norm. A special case of (1.1)—where $$\phi_p$$’s are linear and each $$\phi_{(l,l^{\prime})}$$ is of the form $$x_l x_{{l^{\prime}}}$$ – has been studied considerably. Within this setting, there exist algorithms that recover $${\mathscr{S}_1},{\mathscr{S}_2}$$, along with convergence rates for estimating $$f$$, in the limit of large $$n$$ [3,8,42]. Kekatos & Giannakis [24] show that exact recovery is possible (w.h.p.) via $$\ell_1$$ minimization with $$O(({|{{{\mathscr{S}_1}}}|} + {|{{{\mathscr{S}_2}}}|}) (\log d)^4)$$ noiseless point queries. This is based on the RIP for structured random matrices as developed in [44]. Nazer & Nowak [37] generalized this to the setting of sparse multilinear systems—albeit in the noiseless setting—and derived non-asymptotic sampling bounds for identifying the interaction terms, via $$\ell_1$$ minimization. Upon translating Theorem $$1.1$$ from their article into our setting, with general overlap (so $${\rho_{m}} \geq 1$$), we obtain a sample complexity19 of $$O(({|{{{\mathscr{S}_1}}}|} + {|{{{\mathscr{S}_2}}}|})^2 \log(d/({|{{{\mathscr{S}_1}}}|} + {|{{{\mathscr{S}_2}}}|}))) = O({k}^2{\rho_{m}}^2 \log(d/({k}{\rho_{m}})))$$. On the other hand, for the case of no overlap, their sample complexity turns out to be $$O(({|{{{\mathscr{S}_1}}}|} + {|{{{\mathscr{S}_2}}}|}) \log(d/({|{{{\mathscr{S}_1}}}|} + {|{{{\mathscr{S}_2}}}|}))) = O({k}\log(d/{k}))$$ for recovering $${\mathscr{S}_1},{\mathscr{S}_2}$$ w.h.p. However, finite sample bounds for the nonlinear model (1.1) are not known in general. We also note that it is common in the statistics literature to impose a heredity constraint on the interactions, wherein an interaction term is present only if the corresponding main effect terms (i.e. those in $${\mathscr{S}_1}$$) are present (cf. [3,8,42]). This is typically done to make the model interpretable, as interaction terms are difficult to interpret compared to main effect terms. Other low-dimensional function models. We now provide a comparison with existing work related to other low-dimensional models from the literature, starting with the approximation theoretic setting. DeVore et al. [15] consider functions depending on a small subset $${\mathscr{S}}$$ of the variables. The functions do not necessarily possess an additive structure, thus the setting is more general than (1.1). They provide algorithms that recover $${\mathscr{S}}$$ exactly w.h.p., with $$O(c^{{k}} {k} \log d)$$ noiseless queries of $$f$$, for some constant $$c > 1$$. Their methods essentially make use of a $$({d},{k})$$-hash family: $${\mathscr{H}_{k}^{d}}$$ (cf. Definition 3.2). for constructing their sampling sets, and while these methods could be used for identifying $${\mathscr{S}}$$, the sample complexity would be exponential in $${k}$$. Schnass & Vybiral [46] consider the same model for $$f$$ in the noiseless setting and derive a simple algorithm that recovers $${\mathscr{S}}$$ w.h.p., with $$O(\frac{C_1^4}{\alpha^4} {k} (\log {d})^2)$$ noiseless queries. Here, $$C_1 = \max_{i \in {\mathscr{S}}} \parallel{\partial_i f}\parallel_{\infty}$$ and $$\alpha = \min_{i \in {\mathscr{S}}} \parallel{\partial_i f}\parallel_1$$ with $$\parallel{\cdot}\parallel_1$$ denoting the $$L_1$$ norm. While the $$C_1$$ term is a constant depending on the smoothness of $$f$$, one can construct examples of $$f$$ for which $$\alpha = c^{-{k}}$$, for some constant $$c > 1$$. This implies that the sample bounds could be exponential in $${k}$$ for general $${k}$$ variate functions (as one would expect). This method could be applied to (1.1), to learn the set of active variables. In particular, for the general overlap case ($${\rho_{m}} \geq 1$$), their algorithm will identify the support $${\mathscr{S}}$$ w.h.p., with $$O(\frac{{C_1^{\prime}}^4 {\rho_{m}}^4}{\alpha^4} {k} (\log {d})^2)$$ noiseless queries where now: $$C_1 = \max_{i \in {\mathscr{S}}} \parallel{\partial_i f}\parallel_{\infty} \leq C_1^{\prime}{\rho_{m}}$$, with $$C_1^{\prime}$$ a constant depending on the smoothness of $$f$$. For the general overlap case, we see that their bounds in the noiseless setting are worse by a $${\rho_{m}}^3$$ factor compared to those for Algorithms 3 and 4, however, better by a $$\log d$$ factor compared to Algorithm 3. Moreover, it is not clear how the $$\alpha$$ term scales with respect to $${\rho_{m}}$$ here. For the non-overlap case, the scaling of their sampling bounds with respect to $${k}, d$$ matches ours for the noiseless setting, up to an additional $$\frac{C_1^4}{\alpha^4}$$ term. While $$\alpha$$ does not depend on $${k}$$ now, their sampling bound increases for small values of $$\alpha$$ or large values of $$C_1$$. The dependence of the sampling bound on the parameters $$C_1,\alpha$$ is not necessary in the noiseless setting, as seen from our sampling bounds that (in the noiseless case) depend on the measure of the region where $$\partial_i f$$ ($$i \in {\mathscr{S}}$$) and/or $$\partial_l \partial_{{l^{\prime}}} f$$ ($$(l,l^{\prime}) \in {\mathscr{S}_2}$$) are large. This model was considered by Comminges & Dalalyan [11,12] in the regression setting. Assuming $$f$$ to be differentiable, and the joint density of the covariates to be known, they propose an estimator that identifies the unknown subset $${\mathscr{S}}$$ w.h.p., with sample complexity $$O(c^{{k}} {k} \log d)$$. This bound is shown to be tight although the estimator that achieves it is impractical—in the worst case it looks at all subsets of $$\left\{{1,\dots,d}\right\}$$ of size $${k}$$. Fornasier et al. [18] and Tyagi & Cevher [53] generalized this model class to functions $$f$$ of the form $$f(\mathbf x) = g({\mathbf{A}}\mathbf x)$$, for unknown $${\mathbf{A}} \in {\mathbb{R}}^{k \times d}$$. They derive algorithms that approximately recover the row span of $${\mathbf{A}}$$, with sample complexities20 typically polynomial in $${k}, {d}$$. Specifically, [18] considers the setting where the rows of $${\mathbf{A}}$$ are sparse. They propose a method that essentially estimates the gradient of $$f$$—via $$\ell_1$$ minimization—at suitably (typically polynomially in $$d$$) many points on the unit sphere $$\mathbb{S}^{d-1}$$. Tyagi and Cevher [53] generalized this result to the setting, where $${\mathbf{A}}$$ is not necessarily sparse, by making use of low rank matrix recovery techniques. Estimation of sparse Hessian matrices. There exists related work for estimating sparse Hessian matrices in the optimization literature. Powell & Toint [41] and Coleman & Moré [10] consider the setting where the sparsity structure of $${\nabla^2} f(\mathbf x)$$is known and aim to estimate $${\nabla^2} f(\mathbf x)$$ via gradient differences. Their aim is to minimize the number of gradient evaluations, needed for this purpose. In particular, Coleman & Moré [10] approach the problem from a graph theoretic point of view and provide a graph coloring interpretation. Bandeira et al. [1] consider derivative-free optimization (DFO) problems, wherein they approximate the underlying objective function $$f$$, by a quadratic polynomial interpolation model. Specifically, they build such a model by assuming $${\nabla^2} f$$ to be sparse but do not assume the sparsity pattern to be known. Their approach is to minimize the $$\ell_1$$ norm of the entries of the model Hessian, subject to interpolation conditions. As they do not assume $${\nabla} f$$ to be sparse, they arrive at a sampling bound of $$O({d} (\log {d})^4)$$ [1, Corollary 4.1], for recovering $${\nabla} f(\mathbf x)$$, $${\nabla^2} f(\mathbf x)$$, with high probability. In case $${\nabla} f$$ were also sparse, one can verify that their bound changes to $$O(({|{{{\mathscr{S}}}}|} + 2{|{{{\mathscr{S}_2}}}|}) (\log({|{{{\mathscr{S}}}}|} + 2{|{{{\mathscr{S}_2}}}|}))^2 (\log d)^2)$$ = $$O({k}{\rho_{m}} (\log({k}{\rho_{m}}))^2 (\log d)^2)$$. They essentially make use of the RIP for structured random matrices as outlined in Theorem 4.4 of [44]. Bounded orthonormal systems. One of the reviewers pointed out another interesting approach that could be used for identifying $${\mathscr{S}_1},{\mathscr{S}_2}$$, that we now discuss. Note that this is only a rough sketch and verifying the details is left for future work. Let $$\psi_k(\mathbf x)$$ be a bounded orthonormal system (BOS)21 in $$L_2([-1,1]^d)$$, for $$k = 0,1,\dots,N$$, consisting of univariate and bivariate functions. This could, for example, be constructed using a subset of the real trigonometric basis functions (see [13, Section 1.2]), with the $$\psi_k$$’s satisfying the zero (marginal) mean conditions. In our model, there are a total of $$d$$ univariate and $${d \choose 2}$$ bivariate functions. Say we take $$N_1$$ basis functions per coordinate, and $$N_2$$ basis functions per coordinate-tuple, so that $$N = d N_1 + {d \choose 2}N_2$$. Now, $$f(\mathbf x) = \sum_{k=0}^{N} \alpha_k \psi_k(\mathbf x) + r(\mathbf x),$$ where $$r$$ denotes the remainder term. Since $$f$$ is $$C^3$$ smooth, we can uniformly approximate each univariate and bivariate $$\phi$$ with error rates: $$N_1^{-p_1}$$ (for some $$p_1 > 0$$) and $$N_2^{-p_2}$$ (for some $$p_2 > 0$$), respectively. Using triangle inequality, we then obtain for any $$\mathbf x \in [-1,1]^d$$ the bound: \begin{eqnarray} |{r(\mathbf x)}| \lesssim {|{{{\mathscr{S}_1}}}|} N_1^{-p_1} + {|{{{\mathscr{S}_2}}}|} N_2^{-p_2}. \end{eqnarray} (8.1) So for bounding $$|{r(\mathbf x)}|$$ by a sufficiently small constant, we require $$N_1 \sim {|{{{\mathscr{S}_1}}}|}^{\frac{1}{p_1}}$$ and $$N_2 \sim {|{{{\mathscr{S}_2}}}|}^{\frac{1}{p_1}}$$. By querying $$f$$ at $$\mathbf x_1,\dots,\mathbf x_m$$ (sampled uniformly at random), we get $$y_l = f(\mathbf x_l) + z_l$$, $$l=1,\dots,m$$, which in matrix form can be written as $$y = {\mathbf{A}} \alpha + {\mathbf e}$$. Here, $$e_l = z_l + r_l(\mathbf x)$$ and $$\alpha \in {\mathbb{R}}^{N}$$ is $$N_1 {|{{{\mathscr{S}_1}}}|} + N_2 {|{{{\mathscr{S}_2}}}|}$$ sparse. Since the rows of $${\mathbf{A}}$$ correspond to a BOS, one can recover $$\alpha$$ via $$\ell_1$$ minimization;22 using the RIP result for BOS [44, Theorem 4.4], we obtain the bound: \begin{align} m &\geq C_1 ({|{{{\mathscr{S}_1}}}|}N_1 + {|{{{\mathscr{S}_2}}}|}N_2) \log^2({|{{{\mathscr{S}_1}}}|}N_1 + {|{{{\mathscr{S}_2}}}|}N_2) \log^2(d N_1 + {d \choose 2} N_2) \\ \end{align} (8.2) \begin{align} &\gtrsim ({|{{{\mathscr{S}_1}}}|}^{\frac{1}{p_1} + 1} + {|{{{\mathscr{S}_2}}}|}^{\frac{1}{p_2} + 1}) \log^2({|{{{\mathscr{S}_1}}}|}^{\frac{1}{p_1} + 1} + {|{{{\mathscr{S}_2}}}|}^{\frac{1}{p_2} + 1}) \log^2(d). \end{align} (8.3) Note that the above bound is super-linear in the sparsity: $${|{{{\mathscr{S}_1}}}|} + {|{{{\mathscr{S}_2}}}|}$$ and this would be the case even when the samples are noiseless. In contrast, our bounds for Algorithms 1–4 are linear in sparsity, for the noiseless and bounded noise case. Also, observe that $$\alpha$$ is actually block sparse: it has $${d \choose 2}$$ ‘blocks’, each of length $$N_2$$, out of which exactly $${|{{{\mathscr{S}_2}}}|}$$ blocks are non-zero. Moreover, there are $$d$$ blocks, each of length $$N_1$$, out of which $${|{{{\mathscr{S}_1}}}|}$$ blocks are non-zero. While we are not aware of a RIP result for BOS with block sparsity,23 we would nevertheless still require $$m \gtrsim ({|{{{\mathscr{S}_1}}}|}N_1 + {|{{{\mathscr{S}_2}}}|}N_2) \sim {|{{{\mathscr{S}_1}}}|}^{\frac{1}{p_1} + 1} + {|{{{\mathscr{S}_2}}}|}^{\frac{1}{p_2} + 1}$$, which is super-linear in sparsity. For the setting of Gaussian noise, however, it is possible that the above approach might give a better scaling with $${k},{\rho_{m}}$$ compared to our results. 9. Concluding remarks In this article, we considered a generalization of SPAMs, of the form (1.1), now also allowing for the presence of a small number of bivariate components. We started with the special case where each variable interacts with at most one other variable, and then moved on to the general setting where variables can possibly be part of more than interaction term. For each of these settings, we derived algorithms with sample complexity bounds—both in the noiseless as well as the noisy query settings. For the general overlap case, the identification of the interaction set $${\mathscr{S}_2}$$ essentially involved the estimation of the $$d \times d$$ Hessian of $$f$$ at carefully chosen points. In fact, these points were simply part of a collection of canonical two-dimensional uniform grids, within $$[-1,1]^d$$. Upon identifying $${\mathscr{S}_2}$$, the estimation of $${\mathscr{S}_1}$$ was subsequently performed by employing the sampling scheme of Tyagi et al. [54] on the reduced set of variables. Furthermore, once $${\mathscr{S}_1}, {\mathscr{S}_2}$$ are identified, we showed how one can recover uniform approximations to the individual components of the model, by additionally querying $$f$$ along the one/two-dimensional subspaces corresponding to $${\mathscr{S}_1}, {\mathscr{S}_2}$$. For the setting of noiseless queries, we observed that the sample complexity of Algorithm 4 is close to optimal. However, for the noisy setting—in particular the setting of Gaussian noise—we saw that the sample complexity of Algorithm 4 has a worse dependency in terms of $${k}, {\rho_{m}}$$ compared to Algorithm 3. In general, the sample complexity bounds of our algorithms, in the presence of Gaussian noise, have a sub-optimal dependence on $${k},{\rho_{m}}$$. This is mainly due to the localized nature of our sampling schemes—the external noise gets scaled by the step size parameter leading to the noise variance scaling up. Hence, the number of samples required to reduce the noise variance (by resampling and averaging) increases, leading to an increase in the total sample complexity. An interesting direction for future work would be to consider alternate—possibly non-localized sampling schemes—with improved non-asymptotic sampling bounds for identifying $${\mathscr{S}_1},{\mathscr{S}_2}$$ in the setting of Gaussian noise. Another limitation of our analysis is that it is restricted to $$C^3$$ smooth functions. It would be interesting to extend the results to more general $$C^{r}$$ smooth functions $$r \geq 1$$ and also to other smoothness classes such as Hölder/Lipschitz continuous functions. Lastly, we only consider pairwise interactions between the variables; a natural generalization would be to consider a model that can include components which are at most $$m$$-variate. The goal would then be to query $$f$$, to identify all interaction terms. Acknowledgments This work was mostly done while HT was affiliated to the Department of Computer Science, ETH Zürich. HT would like to thank: Yuxin Chen for helpful discussions related to the recovery of sparse symmetric matrices in Section 5.1; Jan Vybiral for helpful discussions related to bounded orthonormal systems in Section 8. The authors would like to thank the anonymous reviewers for helpful comments and suggestions that greatly helped to improve a preliminary version of the article. Funding SNSF grant (CRSII2_147633); and The Alan Turing Institute under the EPSRC grant (EP/N510129/1). Footnotes 1With probability $$1 - O(d^{-c})$$ for some constant $$c > 0$$. 2We note that the idea of estimating a sparse gradient via $$\ell_1$$ minimization is motivated from Fornasier et al. [18]; their algorithm, however, is for a more general function class than ours. 3see Definition 3.2 and ensuing discussion. 4This is discussed later. 5More precisely, if $$d > K^{\frac{K}{K-1}}$$. 6Such sets were used in [15] for a more general problem involving functions that are intrinsically $${k}$$ variate and do not necessarily have an additive structure. 7Recall discussion following Definition 3.2. 8Recall discussion following Definition 3.2. 9We are not aware of a formal proof of this fact in the literature. 10For example, when $$O(1)$$ variables have degree $${\rho_{m}}$$, and the remaining variables have degree $$1$$ leading to $$|{{\mathscr{S}_2}}| = O({k} + {\rho_{m}})$$. 11Recall discussion following Definition 3.2. 12For instance, see Theorems $$14.4$$ and $$15.2$$ in [22]. 13See [52] for $${d}=1$$, and [38] for $${d} \geq 1.$$ 14Here, we set the term $$\rho^{*}_n$$ capturing the minimum magnitude of the univariate components (as defined in [45, Theorem 2]) to O(1). 15Here, we set the term $$b$$ capturing the minimum magnitude of the univariate and bivariate components (as defined in [42, Section 3.2]) to O(1). 16This is known to be asymptotically equivalent to the non-parametric regression model as the number of samples $$n \rightarrow \infty.$$ 17$$\epsilon$$ corresponds to $$\sigma^2/\sqrt{n}$$ in regression, where $$\sigma^2$$ denotes variance of noise. 18Thus for $$m = d$$, we obtain an SPAM. 19This sample complexity implies exact recovery of $${\mathscr{S}_1},{\mathscr{S}_2}$$ w.h.p. 20These were derived predominantly in the noiseless setting, with some discussion in [53] about handling Gaussian noise via resampling and averaging. 21$$\psi_0 \equiv 1$$, i.e. it is the constant function. 22Consequently, we would be able to identify $${\mathscr{S}_1},{\mathscr{S}_2}$$ by thresholding. 23The existing ones seem to be only for matrices with i.i.d. sub-Gaussian entries. A preliminary version of this article appeared in the proceedings of the $$19$$th International Conference on Artificial Intelligence and Statistics (AISTATS) 2016 [55]. The present draft is an expanded version containing additional results. References 1. Bandeira A. , Scheinberg K. & Vicente L. ( 2012 ) Computation of sparse low degree interpolating polynomials and their application to derivative-free optimization . Math. Program. , 134 , 223 – 257 . Google Scholar CrossRef Search ADS 2. Baraniuk R. , Davenport M. , DeVore R. & Wakin M. ( 2008 ) A simple proof of the restricted isometry property for random matrices . Constr. Approx. , 28 , 253 – 263 . Google Scholar CrossRef Search ADS 3. Bien J. , Taylor J. & Tibshirani R. ( 2013 ) A Lasso for hierarchical interactions . Ann. Statist. , 41 , 1111 – 1141 . Google Scholar CrossRef Search ADS 4. Blumensath T. & Davies M. ( 2009 ) Iterative hard thresholding for compressed sensing . Appl. Comput. Harmon. Anal. , 27 , 265 – 274 . Google Scholar CrossRef Search ADS 5. Blumensath T. & Davies M. ( 2010 ) Normalized iterative hard thresholding: guaranteed stability and performance . IEEE J. Sel. Topics Signal Process. , 4 , 298 – 309 . Google Scholar CrossRef Search ADS 6. Candès E. , Romberg J. & Tao T. ( 2006 ) Stable signal recovery from incomplete and inaccurate measurements . Commun. Pure Appl. Math. , 59 , 1207 – 1223 . Google Scholar CrossRef Search ADS 7. Chen Y. , Chi Y. & Goldsmith A. ( 2015 ) Exact and stable covariance estimation from quadratic sampling via convex programming . IEEE Trans. Inform. Theory , 61 , 4034 – 4059 . Google Scholar CrossRef Search ADS 8. Choi N. , Li W. & Zhu J. ( 2010 ) Variable selection with the strong heredity constraint and its oracle property . J. Am. Stat. Assoc. , 105 , 354 – 364 . Google Scholar CrossRef Search ADS 9. Cohen A. , Daubechies I. , DeVore R. , Kerkyacharian G. & Picard D. ( 2012 ) Capturing ridge functions in high dimensions from point queries . Constr. Approx. , 35 , 225 – 243 . Google Scholar CrossRef Search ADS 10. Coleman T. & Moré J. ( 1984 ) Estimation of sparse Hessian matrices and graph coloring problems . Math. Program. , 28 , 243 – 270 . Google Scholar CrossRef Search ADS 11. Comminges L. & Dalalyan A. ( 2012 ) Tight conditions for consistency of variable selection in the context of high dimensionality . Ann. Statist. , 40 , 2667 – 2696 . Google Scholar CrossRef Search ADS 12. Comminges L. & Dalalyan A. ( 2012 ) Tight conditions for consistent variable selection in high dimensional nonparametric regression . J. Mach. Learn. Res. , 19 , 187 – 206 . 13. Dalalyan A. , Ingster Y. & Tsybakov A. ( 2014 ) Statistical inference in compound functional models . Probab. Theory Related Fields , 158 , 513 – 532 . Google Scholar CrossRef Search ADS 14. de Boor C. ( 1978 ) A Practical Guide to Splines . New York : Springer . Google Scholar CrossRef Search ADS 15. DeVore R. , Petrova G. & Wojtaszczyk P. ( 2011 ) Approximation of functions of few variables in high dimensions . Constr. Approx. , 33 , 125 – 143 . Google Scholar CrossRef Search ADS 16. Donoho D. ( 2006 ) Compressed sensing . IEEE Trans. Inform. Theory , 52 , 1289 – 1306 . Google Scholar CrossRef Search ADS 17. Fan J. & Gijbels I. ( 1996 ) Local Polynomial Modeling and Its Applications . London, New York : Chapman & Hall . 18. Fornasier M. , Schnass K. & Vybíral J. ( 2012 ) Learning functions of few arbitrary linear parameters in high dimensions . Found. Comput. Math. , 12 , 229 – 262 . Google Scholar CrossRef Search ADS 19. Foucart S. & Rauhut H. ( 2013 ) A Mathematical Introduction to Compressive Sensing . New York : Birkhäuser/Springer . Google Scholar CrossRef Search ADS 20. Fredman M. & Komlos J. ( 1984 ) On the size of separating systems and families of perfect hash functions . SIAM J. Algebr. Discrete Methods , 5 , 61 – 68 . Google Scholar CrossRef Search ADS 21. Gu C. ( 2002 ) Smoothing Spline ANOVA Models . New York : Springer . Google Scholar CrossRef Search ADS 22. Gyö L. , Kohler M. , Krzyzak A. & Walk H. ( 2002 ) A Distribution-Free Theory of Nonparametric Regression . New York : Springer . 23. Huang J. , Horowitz J. & Wei F. ( 2010 ) Variable selection in nonparametric additive models . Ann. Statist. , 38 , 2282 – 2313 . Google Scholar CrossRef Search ADS 24. Kekatos V. & Giannakis G. ( 2011 ) Sparse Volterra and polynomial regression models: recoverability and estimation . Trans. Signal Process. , 59 , 5907 – 5920 . Google Scholar CrossRef Search ADS 25. Koltchinskii V. & Yuan M. ( 2008 ) Sparse recovery in large ensembles of Kernel machines . 21st Annual Conference on Learning Theory (COLT) . pp. 229 – 238 . 26. Koltchinskii V. & Yuan M. ( 2010 ) Sparsity in multiple kernel learning . Ann. Statist. , 38 , 3660 – 3695 . Google Scholar CrossRef Search ADS 27. Korner J. & Martin K. ( 1988 ) New bounds for perfect hashing via information theory . Eur. J. Combin. , 9 , 523 – 530 . Google Scholar CrossRef Search ADS 28. Kyrillidis A. & Cevher V. ( 2011 ) Recipes on hard thresholding methods . 4th IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP) . pp. 353 – 356 . 29. Kyrillidis A. & Cevher V. ( 2012 ) Combinatorial selection and least absolute shrinkage via the CLASH algorithm . IEEE International Symposium on Information Theory (ISIT) . pp. 2216 – 2220 . 30. Kyrillidis A. , Puy G. & Cevher V. ( 2012 ) Hard thresholding with norm constraints . IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . pp. 3645 – 3648 . 31. Lin Y. & Zhang H. ( 2006 ) Component selection and smoothing in multivariate nonparametric regression . Ann. Statist. , 34 , 2272 – 2297 . Google Scholar CrossRef Search ADS 32. Maathuis M. , Kalisch M. & Bühlmann P. ( 2009 ) Estimating high-dimensional intervention effects from observational data . Ann. Statist. , 37 , 3133 – 3164 . Google Scholar CrossRef Search ADS 33. Meier L. , Van De Geer S. & Bühlmann P. ( 2009 ) High-dimensional additive modeling . Ann. Statist. , 37 , 3779 – 3821 . Google Scholar CrossRef Search ADS 34. Mossel E. , O’Donnell R. & Servedio R. ( 2003 ) Learning juntas . 35th Annual ACM Symposium on Theory of Computing (STOC) . pp. 206 – 212 . 35. Muller-Gronbach T. & Ritter K. ( 2008 ) Minimal errors for strong and weak approximation of stochastic differential equations . Monte Carlo and Quasi-Monte Carlo Methods . pp. 53 – 82 . 36. Naor M. , Schulman L. & Srinivasan A. ( 1995 ) Splitters and near-optimal derandomization . Proceedings of the 36th Annual Symposium on Foundations of Computer Science, 1995. pp. 182 – 191 . 37. Nazer B. & Nowak R. ( 2010 ) Sparse interactions: Identifying high-dimensional multilinear systems via compressed sensing . 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton) . pp. 1589 – 1596 . 38. Nemirovski A. ( 2000 ) Topics in non-parametric statistics . Ecole d‘Et‘e de Probabilitès de Saint-Flour XXVIII, 1998 . New York : Springer , pp. 85 – 277 . 39. Nesterov Y. & Nemirovskii A. ( 1994 ) Interior-Point Polynomial Algorithms in Convex Programming . Philadelphia : Society for Industrial and Applied Mathematics . Google Scholar CrossRef Search ADS 40. Nilli A. ( 1994 ) Perfect hashing and probability . Combin. Probab. Comput. , 3 , 407 – 409 . Google Scholar CrossRef Search ADS 41. Powell M. & Toint P. L. ( 1979 ) On the Estimation of sparse Hessian matrices . SIAM J. Numer. Anal. , 16 , 1060 – 1074 . Google Scholar CrossRef Search ADS 42. Radchenko P. & James G. M. ( 2010 ) Variable selection using adaptive nonlinear interaction structures in high dimensions . J. Amer. Statist. Assoc. , 105 , 1541 – 1553 . Google Scholar CrossRef Search ADS 43. Raskutti G. , Wainwright M. & Yu B. ( 2012 ) Minimax-optimal rates for sparse additive models over Kernel classes via convex programming . J. Mach. Learn. Res. , 13 , 389 – 427 . 44. Rauhut H. ( 2010 ) Compressive sensing and structured random matrices . Theoretical Foundations and Numerical Methods for Sparse Recovery , vol. 9 . 1 – 92 . 45. Ravikumar P. , Lafferty J. , Liu H. & Wasserman L. ( 2009 ) Sparse additive models . J. R. Stat. Soc. Ser. B. Stat. Methodol. , 71 , 1009 – 1030 . Google Scholar CrossRef Search ADS 46. Schnass K. & Vybiral J. ( 2011 ) Compressed learning of high-dimensional sparse functions . IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . pp. 3924 – 3927 . 47. Smirnov V. ( 1964 ) A Course of Higher Mathematics . Reading, MA : Addison-Wesley . 48. Spall J. ( 1992 ) Multivariate stochastic approximation using a simultaneous perturbation gradient approximation . IEEE Trans. Automat. Control , 37 , 332 – 341 . Google Scholar CrossRef Search ADS 49. Storlie C. B. , Bondell H. D. , Reich B. J. & Zhang H. H. ( 2011 ) Surface estimation, variable selection, and the nonparametric oracle property . Statist. Sinica , 21 , 679 – 705 . Google Scholar CrossRef Search ADS 50. Suzuki T. ( 2012 ) PAC-Bayesian bound for Gaussian process regression and multiple Kernel additive model . 25th Annual Conference on Learning Theory (COLT) . pp. 8.1 – 8.20 . 51. Traub J. , Wasilkowski G. & Wozniakowski H. ( 1988 ) Information-Based Complexity . New York : Academic Press . 52. Tsybakov A. ( 2008 ) Introduction to Nonparametric Estimation . New York : Springer . 53. Tyagi H. & Cevher V. ( 2012 ) Active learning of multi-index function models . Advances in Neural Information Processing Systems (NIPS) 25 . pp. 1475 – 1483 . 54. Tyagi H. , Krause A. & Gärtner B. ( 2014 ) Efficient sampling for learning sparse additive models in high dimensions . Advances in Neural Information Processing Systems (NIPS) 27 . pp. 514 – 522 . 55. Tyagi H. , Kyrillidis A. , Gärtner B. & Krause A. ( 2016 ) Learning sparse additive models with interactions in high dimensions . 19th International Conference on Artificial Intelligence and Statistics (AISTATS) . pp. 111 – 120 . 56. Wahba G. ( 2003 ) An introduction to (smoothing spline) ANOVA models in RKHS, with examples in geographical data, medicine, atmospheric science and machine learning . 13th IFAC Symposium on System Identification, Rotterdam . pp. 549 – 559 . 57. Wahl M. ( 2015 ) Variable selection in high-dimensional additive models based on norms of projections . ArXiv e-prints, arXiv:1406.0052 , 2015 . 58. Wainwright M. ( 2009a ) Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting . IEEE Trans. Inform. Theory , 55 , 5728 – 5741 . Google Scholar CrossRef Search ADS 59. Wainwright M. ( 2009b ) Sharp thresholds for high-dimensional and noisy sparsity recovery using L1-constrained quadratic programming (Lasso) . IEEE Trans. Inform. Theory , 55 , 2183 – 2202 . Google Scholar CrossRef Search ADS 60. Wojtaszczyk P. ( 2012 ) $$\ell_1$$ Minimization with noisy data . SIAM J. Numer. Anal. , 50 , 458 – 467 . Google Scholar CrossRef Search ADS 61. Yang Y. & Tokdar S. ( 2015 ) Minimax-optimal nonparametric regression in high dimensions . Ann. Statist. , 43 , 652 – 674 . Google Scholar CrossRef Search ADS Appendix A. Model uniqueness We show here that the model representation (2.4) is a unique representation for $$f$$ of the form (2.1). We first note that any measurable $$f:[-1,1]^{{d}} \rightarrow {\mathbb{R}}$$ admits a unique ANOVA decomposition [21,56] of the form: $$f(x_1,\dots,x_{{d}}) = c + \sum_{\alpha} f_{\alpha}(x_{\alpha}) + \sum_{\alpha < \beta} f_{\alpha\beta} + \sum_{\alpha < \beta < \gamma} f_{\alpha\beta\gamma} + \ldots$$ (A.1) Indeed, for any probability measure $$\mu_{\alpha}$$ on $$[-1,1]$$, let $${\mathscr{E}}_{\alpha}$$ denote the averaging operator, defined as $${\mathscr{E}}_{\alpha}(f)(\mathbf x) := \int_{[-1,1]} f(x_1,\dots,x_{{d}})\,{\rm d}\mu_{\alpha}.$$ (A.2) Then the components of the model can be written as: $${c = (\prod_{\alpha} {\mathscr{E}}_{\alpha}) f}$$, $$f_{\alpha} = (I-{\mathscr{E}}_{\alpha})\prod_{\beta \neq \alpha} {\mathscr{E}}_{\beta} f$$, $$f_{\alpha\beta} = ((I-{\mathscr{E}}_{\alpha})(I-{\mathscr{E}}_{\beta})\prod_{\gamma \neq \alpha,\beta}{\mathscr{E}}_{\gamma}) f$$ and so on. For our purpose, $$\mu_{\alpha}$$ is taken to be the uniform probability measure on $$[-1,1]$$. Given this, we now find the ANOVA decomposition of $$f$$ defined in (2.1). As a sanity check, let us verify that $$f_{\alpha\beta\gamma} \equiv 0$$ for all $$\alpha < \beta < \gamma$$. Indeed if $$p \in {\mathscr{S}_1}$$, then at least two of $$\alpha < \beta < \gamma$$ will not be equal to $$p$$. Similarly, for any $$(l,l^{\prime}) \in {\mathscr{S}_2}$$, at least one of $$\alpha,\beta,\gamma$$ will not be equal to $$l$$ and $${l^{\prime}}$$. This implies $$f_{\alpha\beta\gamma} \equiv 0$$. The same reasoning trivially applies for high-order components of the ANOVA decomposition. That $$c = {\mathbb{E}}[f] = \sum_{p \in {\mathscr{S}_1}} {\mathbb{E}}_p[\phi_p] + \sum_{(l,l^{\prime}) \in {\mathscr{S}_2}} {\mathbb{E}}_{(l,l^{\prime})}[\phi_{(l,l^{\prime})}]$$ is readily seen. Next, we have that \begin{gather} (I-{\mathscr{E}}_{\alpha})\prod_{\beta \neq \alpha} {\mathscr{E}}_{\beta} \phi_p = \left\{ \begin{array}{rl} 0 \quad ; & \alpha \neq p, \\ \phi_p - {\mathbb{E}}_p[\phi_p] \quad ; & \alpha = p \end{array} \right\} \quad p \in {\mathscr{S}_1}.\\ \end{gather} (A.3) \begin{gather} (I-{\mathscr{E}}_{\alpha})\prod_{\beta \neq \alpha} {\mathscr{E}}_{\beta} \phi_{(l,l^{\prime})} = \left\{ \begin{array}{rl} {\mathbb{E}}_{{l^{\prime}}}[\phi_{(l,l^{\prime})}] - {\mathbb{E}}_{(l,l^{\prime})}[\phi_{(l,l^{\prime})}] \quad ; & \alpha = l, \\ {\mathbb{E}}_{l}[\phi_{(l,l^{\prime})}] - {\mathbb{E}}_{(l,l^{\prime})}[\phi_{(l,l^{\prime})}] \quad ; & \alpha = {l^{\prime}}, \\ 0 \quad ; & \alpha \neq l,{l^{\prime}}, \end{array} \right\} \quad (l,l^{\prime}) \in {\mathscr{S}_2}. \end{gather} (A.4) (A.3) and (A.4) give us the first-order components of $$\phi_p, \phi_{(l,l^{\prime})}$$, respectively. {One can next verify using the same arguments as earlier that for any $$\alpha < \beta$$: $$(I-{\mathscr{E}}_{\alpha})(I-{\mathscr{E}}_{\beta})\prod_{\gamma \neq \alpha,\beta} {\mathscr{E}}_{\gamma} \phi_p = 0 \quad \forall p \in {\mathscr{S}_1}.$$ (A.5) Lastly, we have for any $$\alpha < \beta$$ that the corresponding second-order component of $$\phi_{(l,l^{\prime})}$$ is given by: $$(I-{\mathscr{E}}_{\alpha})(I-{\mathscr{E}}_{\beta})\prod_{\gamma \neq \alpha,\beta} {\mathscr{E}}_{\gamma} \phi_{(l,l^{\prime})} = \left\{ \begin{array}{rl} \phi_{(l,l^{\prime})} - {\mathbb{E}}_l[\phi_{(l,l^{\prime})}] \\ - {\mathbb{E}}_{{l^{\prime}}}[\phi_{(l,l^{\prime})}] + {\mathbb{E}}_{(l,l^{\prime})}[\phi_{(l,l^{\prime})}] \quad ; & \alpha = l, \beta = {l^{\prime}}, \\ 0 \quad ; & \text{otherwise} \end{array} \right\} \quad (l,l^{\prime}) \in {\mathscr{S}_2}.$$ (A.6) We now make the following observations regarding the variables in $${\mathscr{S}_1} \cap {\mathscr{S}_2^{\text{var}}}$$. For each $$l \in {\mathscr{S}_1} \cap {\mathscr{S}_2^{\text{var}}}$$ such that $${\rho}(l) = 1$$ and $$(l,l^{\prime}) \in {\mathscr{S}_2}$$, we can simply merge $$\phi_l$$ with $$\phi_{(l,l^{\prime})}$$. Thus, $$l$$ is no longer in $${\mathscr{S}_1}$$. For each $$l \in {\mathscr{S}_1} \cap {\mathscr{S}_2^{\text{var}}}$$ such that $${\rho}(l) > 1$$, we can add the first-order component for $$\phi_l$$ with the total first-order component corresponding to all $$\phi_{(l,l^{\prime})}$$’s and $$\phi_{(l^{\prime},l)}$$’s. Hence again, $$l$$ will no longer be in $${\mathscr{S}_1}$$. Therefore, all $$q \in {\mathscr{S}_1} \cap {\mathscr{S}_2^{\text{var}}}$$ can essentially be merged with $${\mathscr{S}_2}$$. Keeping this re-arrangement in mind, we can to begin with assume in (2.1) that $${\mathscr{S}_1} \cap {\mathscr{S}_2^{\text{var}}} = \emptyset$$. Then with the help of (A.3), (A.4), (A.5) and (A.6), we have that any $$f$$ of the form (2.1) (with $${\mathscr{S}_1} \cap {\mathscr{S}_2^{\text{var}}} = \emptyset$$), can be uniquely written as: $$f(x_1,\dots,x_d) = c + \sum_{p \in {\mathscr{S}_1}}{\tilde{\phi}}_{p} (x_p) + \sum_{(l,l^{\prime}) \in {\mathscr{S}_2}} {\tilde{\phi}}_{(l,l^{\prime})} (x_{l},x_{l^{\prime}}) + \sum_{q \in {\mathscr{S}_2^{\text{var}}}: {\rho}(q) > 1} {\tilde{\phi}}_{q} (x_q) \quad {{\mathscr{S}_1} \cap {\mathscr{S}_2^{\text{var}}}} = \emptyset,$$ (A.7) where \begin{align} c &= \sum_{p \in {\mathscr{S}_1}} {\mathbb{E}}_p[\phi_p] + \sum_{(l,l^{\prime}) \in {\mathscr{S}_2}} {\mathbb{E}}_{(l,l^{\prime})}[\phi_{(l,l^{\prime})}], \\ \end{align} (A.8) \begin{align} {\tilde{\phi}}_{p} &= \phi_p - {\mathbb{E}}_p[\phi_p] \quad \forall p \in {\mathscr{S}_1}, \end{align} (A.9) $${\tilde{\phi}}_{(l,l^{\prime})} = \left\{ \begin{array}{rl} \phi_{(l,l^{\prime})} - {\mathbb{E}}_{(l,l^{\prime})}[\phi_{(l,l^{\prime})}] , & {\rho}(l), {\rho}(l^{\prime}) = 1, \\ \phi_{(l,l^{\prime})} - {\mathbb{E}}_{l}[\phi_{(l,l^{\prime})}] , & {\rho}(l) = 1, {\rho}(l^{\prime}) > 1, \\ \phi_{(l,l^{\prime})} - {\mathbb{E}}_{l^{\prime}}[\phi_{(l,l^{\prime})}] , & {\rho}(l) > 1, {\rho}(l^{\prime}) = 1, \\ \phi_{(l,l^{\prime})} - {\mathbb{E}}_{l}[\phi_{(l,l^{\prime})}] - {\mathbb{E}}_{l^{\prime}}[\phi_{(l,l^{\prime})}] + {\mathbb{E}}_{(l,l^{\prime})}[\phi_{(l,l^{\prime})}] , & {\rho}(l) > 1, {\rho}(l^{\prime}) > 1, \end{array} \right.$$ (A.10) \begin{align} \text{and} \quad {\tilde{\phi}}_{q} &= \sum_{q^{\prime}: (q,q^{\prime}) \in {\mathscr{S}_2}} ({\mathbb{E}}_{q^{\prime}}[\phi_{(q,q^{\prime})}] - {\mathbb{E}}_{q,q^{\prime}}[\phi_{(q,q^{\prime})}]) \nonumber \\ &+ \sum_{q^{\prime}: (q^{\prime},q) \in {\mathscr{S}_2}} ({\mathbb{E}}_{q^{\prime}}[\phi_{(q^{\prime},q)}] - {\mathbb{E}}_{q^{\prime},q}[\phi_{(q^{\prime},q)}]) \quad \forall q \in {\mathscr{S}_2^{\text{var}}}: {\rho}(q) > 1. \end{align} (A.11) Appendix B. Real roots of a cubic equation in trigonometric form Before proceeding with the proofs, we briefly recall the conditions under which a cubic equation possesses real roots, along with expressions for the same. The material in this section is taken from [47, Chapter 18 (Sections 191 and 192)]. To begin with, given any cubic equation: $$y^3 + a_1y^2 +a_2y + a_3 = 0,$$ (B.1) one can make the substitution $$x = y - (a_1/3)$$ to change (B.1) to the form: $$x^3 + px + q = 0, \quad \text{where} \quad p = a_2 - \frac{a_1^2}{3} \ \text{and} \ q = \frac{2a_1^3}{27} - \frac{a_1a_2}{3} + a_3.$$ (B.2) If $$p,q$$ are real (which is the case if $$a_1$$, $$a_2$$ and $$a_3$$ are real), then (B.2) has three real and distinct roots if its discriminant: $$(q^2/4) + (p^3/27) < 0$$. Denoting $$r = \sqrt{-\frac{p^3}{27}}, \quad \cos \phi = -\frac{q}{2r},$$ (B.3) we then have that the real roots of (B.2) are given by $$x = 2 \sqrt[3]{r} \cos \frac{\phi + 2j\pi}{3} = 2 \sqrt{-\frac{p}{3}} \cos \frac{\phi + 2j\pi}{3} \quad j = 0,1,2.$$ (B.4) Consequently, the roots of (B.1) are then given by: $$y = 2 \sqrt{-\frac{p}{3}} \cos \frac{\phi + 2j\pi}{3} - \frac{a_1}{3} \quad j = 0,1,2.$$ (B.5) Appendix C. Proofs for Section 3 C.1 Proof of Lemma 3.1 Recall that for $$\mathbf x \in {\chi}$$, we recover a stable approximation $${\widehat{{{\nabla}}}} f(\mathbf x)$$ to $${\nabla} f(\mathbf x)$$ via $$\ell_1$$ minimization [6,16]: $${\widehat{{{\nabla}}}} f(\mathbf x) = \triangle(\mathbf y) := \underset{\mathbf y = {\mathbf{V}}\mathbf z}{\mathrm{argmin}} \parallel{\mathbf z}\parallel_1.$$ (C.1) Applying Theorem 3.1 to our setting yields the following corollary. Corollary C.1 There exist constants $$c_3^{\prime} \geq 1$$ and $$C,c_1^{\prime} > 0$$ such that for $${m_{v}}$$ satisfying $$c_3^{\prime} {k} \log({d}/{k}) < {m_{v}} < {d}/(\log 6)^2$$ we have with probability at least $$1-\,{\rm e}^{-c_1^{\prime}{m_{v}}} - \,{\rm e}^{-\sqrt{{m_{v}} {d}}}$$ that $${\widehat{{{\nabla}}}} f(\mathbf x)$$ satisfies for all $$\mathbf x \in {\chi}$$: $$\parallel{{\widehat{{{\nabla}}}} f(\mathbf x) - {\nabla} f(\mathbf x)}\parallel_2 \leq \frac{2 C {\mu}^2 {B}_3 {k}}{3{m_{v}}},$$ (C.2) where $${B}_3 > 0$$ is the constant defined in Assumption 2. Proof. Since $${\nabla} f(\mathbf x)$$ is at most $${k}$$-sparse for any $$\mathbf x \in {\mathbb{R}}^{{d}}$$, we immediately have from (3.10) that $$\parallel{{\widehat{{{\nabla}}}} f(\mathbf x) - {\nabla} f(\mathbf x)}\parallel_2 \leq C \max\left\{{\parallel{{\mathbf n}}\parallel_2, \sqrt{\log {d}}\parallel{{\mathbf n}}\parallel_{\infty}}\right\} \quad \forall \mathbf x \in {\chi}.$$ (C.3) It remains to bound $$\parallel{{\mathbf n}}\parallel_2, \parallel{{\mathbf n}}\parallel_{\infty}$$. To this end, recall that $${\mathbf n} = [{n}_1 \dots {n}_{{m_{v}}}]$$, where $${n}_j = \frac{{R}_3(\zeta_{j}) - {R}_3(\zeta^{\prime}_{j})}{2{\mu}}$$, for some $$\zeta_j,\zeta^{\prime}_j \in {\mathbb{R}}^{{d}}$$. Here, $${R}_3(\zeta)$$ denotes the third-order Taylor remainder term. By taking the structure of $$f$$ into account, we can uniformly bound $$|{{R}_3(\zeta_j)}|$$ as follows (so the same bound holds for $$|{{R}_3(\zeta^{\prime}_j)}|$$). \begin{align} |{{R}_3(\zeta_j)}| &= \frac{{\mu}^3}{6} |\sum_{p \in {\mathscr{S}_1}} \partial_p^3 \phi_p(\zeta_{j,p}) v_p^3 + \sum_{(l,l^{\prime}) \in {\mathscr{S}_2}} ( \partial_l^3 \phi_{(l,l^{\prime})}(\zeta_{j,l}, \zeta_{j,{l^{\prime}}}) v_l^3 + \partial_{l^{\prime}}^3 \phi_{(l,l^{\prime})}(\zeta_{j,l}, \zeta_{j,{l^{\prime}}}) v_{l^{\prime}}^3) \\ &+ \sum_{(l,l^{\prime}) \in {\mathscr{S}_2}} (3\partial_l \partial_{l^{\prime}}^2 \phi_{(l,l^{\prime})}(\zeta_{j,l}, \zeta_{j,{l^{\prime}}}) v_l v_{l^{\prime}}^2 + 3\partial_l^2 \partial_{l^{\prime}} \phi_{(l,l^{\prime})}(\zeta_{j,l}, \zeta_{j,{l^{\prime}}}) v_l^2 v_{l^{\prime}})|, \nonumber \\ \end{align} (C.4) \begin{align} &\leq \frac{{\mu}^3}{6} \left[\left(\frac{1}{\sqrt{{m_{v}}}}\right)^3 {k_1} {B}_3 + \left(\frac{1}{\sqrt{{m_{v}}}}\right)^3 {k_2} (2{B}_3) + \left(\frac{1}{\sqrt{{m_{v}}}}\right)^3 {k_2} (6{B}_3)\right], \\ \end{align} (C.5) \begin{align} &= \frac{{\mu}^3{B}_3 ({k_1} + 8{k_2})}{6{m_{v}}^{3/2}}. \end{align} (C.6) Using the fact that $${k_1} + 8{k_2} \leq 4({k_1} + 2{k_2}) = 4{k}$$, we consequently obtain \begin{align} \parallel{{\mathbf n}}\parallel_{\infty} &= \max_{j} {|{{{n}_j}}|} \leq \frac{{\mu}^2{B}_3 ({k_1} + 8{k_2})}{6{m_{v}}^{3/2}} \leq \frac{2 {\mu}^2{B}_3{k}}{3{m_{v}}^{3/2}}, \\ \end{align} (C.7) \begin{align} \text{and} \quad \parallel{{\mathbf n}}\parallel_2 &\leq \sqrt{{m_{v}}} \parallel{{\mathbf n}}\parallel_{\infty} \leq \frac{2 {\mu}^2{B}_3{k}}{3 {m_{v}}}. \end{align} (C.8) Using (C.7) and (C.8) in (C.3), we finally obtain for the stated choice of $${m_{v}}$$ (cf. Remark 3.4), the bound in (C.2). □ Let us denote $${\tau} = \frac{2 C {\mu}^2 {B}_3 {k}}{3{m_{v}}}$$. To prove the lemma, we first observe that (C.2) trivially implies that $$\widehat{\partial_q} f(\mathbf x) \in [\partial_q f(\mathbf x) - {\tau}, \partial_q f(\mathbf x) + {\tau}] \quad q=1,\dots,{d}.$$ (C.9) Now, in case $$q \notin {\mathscr{S}_1} \cup {\mathscr{S}_2^{\text{var}}}$$, then $$\partial_q f(\mathbf x) = 0$$$$\forall \mathbf x \in {\mathbb{R}}^{{d}}$$, meaning that $$\widehat{\partial_q} f(\mathbf x) \in [-{\tau},{\tau}]$$. If $${m_{x}} \geq {\lambda}_1^{-1}$$, then for every $$q \in {\mathscr{S}_1} \cup {\mathscr{S}_2^{\text{var}}}$$, $$\exists {h} \in {{\mathscr{H}}_{2}^{d}}$$ and at least one $$\mathbf x \in {\chi}({h})$$, so that $$|{\partial_q f(\mathbf x)}| > {D}_1$$. Indeed, this follows from the definition of $${{\mathscr{H}}_{2}^{d}}$$ and by construction of $${\chi}({h})$$ for $${h} \in {{\mathscr{H}}_{2}^{d}}$$. Furthermore, for such $$\mathbf x$$, we have from (C.9) that $$|{\widehat{\partial_q} f(\mathbf x)}| \geq {D}_1 - {\tau}$$. Therefore, if $${\tau} < \frac{{D}_1}{2}$$ holds, then clearly we would have $$|{\widehat{\partial_q} f(\mathbf x)}| > \frac{{D}_1}{2} > {\tau}$$, meaning that we will be able to identify $$q$$. Lastly, we observe that the condition $${\tau} < \frac{{D}_1}{2}$$ translates to an equivalent condition on the step size $${\mu}$$ as follows: $${\tau} < \frac{{D}_1}{2} \Leftrightarrow \frac{2 C {\mu}^2 {B}_3 {k}}{3{m_{v}}} < \frac{{D}_1}{2} \Leftrightarrow {\mu} < \left( \frac{3{D}_1 {m_{v}}}{4 C {B}_3 {k}} \right)^{1/2}.$$ (C.10) C.2 Proof of Lemma 3.2 We proceed by first bounding the error term that arises in the estimation of $$\partial_i g(\mathbf x)$$. As $$g$$ is $${\mathscr{C}}^3$$ smooth, consider the Taylor’s expansion of $$g$$ at $$\mathbf x$$, along $${\mathbf{e}}_1(i),-{\mathbf{e}}_1(i) \in {\mathbb{R}}^{{k}}$$, with step size $${\beta} > 0$$. For some $$\zeta = \mathbf x + \theta {\mathbf{e}}_1(i)$$, $$\zeta^{\prime} = \mathbf x - \theta^{\prime} {\mathbf{e}}_1(i)$$ with $$\theta,\theta^{\prime} \in (0,{\beta})$$, we obtain the identities: \begin{align} g(\mathbf x + {\beta} {\mathbf{e}}_1(i)) &= g(\mathbf x) + {\beta} \langle{\mathbf{e}}_1(i),{\nabla} g(\mathbf x)\rangle + \frac{{\beta}^2}{2}{\mathbf{e}}_1(i)^{\rm T} {\nabla^2} g(\mathbf x) {\mathbf{e}}_1(i) + {R}_3(\zeta), \\ \end{align} (C.11) \begin{align} g(\mathbf x - {\beta} {\mathbf{e}}_1(i)) &= g(\mathbf x) - {\beta} \langle{\mathbf{e}}_1(i),{\nabla} g(\mathbf x)\rangle + \frac{{\beta}^2}{2}{\mathbf{e}}_1(i)^{\rm T} {\nabla^2} g(\mathbf x) {\mathbf{e}}_1(i) + {R}_3(\zeta^{\prime}), \end{align} (C.12) with $${R}_3(\zeta),{R}_3(\zeta^{\prime}) = O({\beta}^3)$$ being the third-order remainder terms. Subtracting the above leads to the following identity: $$\underbrace{\frac{g(\mathbf x + {\beta} {\mathbf{e}}_1(i)) - g(\mathbf x - {\beta} {\mathbf{e}}_1(i))}{2{\beta}}}_{{\widehat{{\partial_i}}} g(\mathbf x)} = \underbrace{\langle{\mathbf{e}}_1(i),{\nabla} g(\mathbf x)\rangle}_{\partial_i g(\mathbf x)} + \underbrace{\frac{{R}_3(\zeta) - {R}_3(\zeta^{\prime})}{2{\beta}}}_{{\eta}_i(\mathbf x,{\beta}) = O({\beta}^2)}.$$ (C.13) We now uniformly bound $$|{{R}_3(\zeta)}|$$, so the same bound holds for $$|{{R}_3(\zeta^{\prime})}|$$. Because of the structure of $$g$$, we have that \begin{align} |{{R}_3(\zeta)}| &= \frac{{\beta}^3}{6}|\sum_{p \in {\mathscr{S}_1}} \partial_p^3 \phi_p(\zeta_p) ({\mathbf{e}}_1(i))^3_p + \sum_{(l,l^{\prime}) \in {\mathscr{S}_2}}(\partial_l^3 \phi_{(l,l^{\prime})}(\zeta_l,\zeta_{l^{\prime}}) ({\mathbf{e}}_1(i))^3_l + \partial_{l^{\prime}}^3 \phi_{(l,l^{\prime})}(\zeta_l,\zeta_{l^{\prime}}) ({\mathbf{e}}_1(i))^3_{l^{\prime}}) \\ \end{align} (C.14) \begin{align} &+ \sum_{(l,l^{\prime}) \in {\mathscr{S}_2}}(3 \partial_l^2 \partial_{l^{\prime}} \phi_{(l,l^{\prime})}(\zeta_l,\zeta_{l^{\prime}}) ({\mathbf{e}}_1(i))^2_l ({\mathbf{e}}_1(i))_{l^{\prime}} + 3\partial_{l^{\prime}}^2 \partial_l \phi_{(l,l^{\prime})}(\zeta_l,\zeta_{l^{\prime}}) ({\mathbf{e}}_1(i))^2_{l^{\prime}}({\mathbf{e}}_1(i))_l)| \\ \end{align} (C.15) \begin{align} &= \left\{ \begin{array}{rl} \frac{{\beta}^3}{6} |{\partial_i^3 \phi_i(\zeta_i)}| , \quad & i \in {\mathscr{S}_1}, \\ \frac{{\beta}^3}{6} |{\partial_i^3 \phi_{i,j} (\zeta_i,\zeta_j)}| , \quad & i \in {\mathscr{S}_2^{\text{var}}}, (i,j) \in {\mathscr{S}_2}, \\ \frac{{\beta}^3}{6} |{\partial_i^3 \phi_{j,i} (\zeta_j,\zeta_i)}| , \quad & i \in {\mathscr{S}_2^{\text{var}}}, (j,i) \in {\mathscr{S}_2} \end{array} \right . \\ \end{align} (C.16) \begin{align} &\leq \frac{{\beta}^3 {B}_3}{6}. \end{align} (C.17) The above consequently implies that $$|{{\eta}_i(\mathbf x,{\beta})}| \leq \frac{{\beta}^2 {B}_3}{6}$$. This in turn means, for any $${\mathbf v} \in {\mathbb{R}}^{{k}}, {\mu_1} > 0$$, that $$\left|{\frac{{\eta}_i(\mathbf x + {\mu_1}{\mathbf v},{\beta}) - {\eta}_i(\mathbf x,{\beta})}{{\mu_1}}}\right| \leq \frac{{\beta}^2 {B}_3}{3{\mu_1}}.$$ (C.18) Thus, we have a uniform bound on the magnitude of one of the contributors of the error term in (3.18). We can bound the magnitude of the other term as follows. For $${\mathbf v} \in {\mathbb{R}}^{{k}}$$ and $$\zeta = \mathbf x + \theta{\mathbf v}$$, $$\theta \in (0,{\mu_1})$$, we have \begin{align} {\mathbf v}^{\rm T} {\nabla^2} \partial_i g(\zeta) {\mathbf v} = \left\{ \begin{array}{rl} v_i^2 \partial_i^3 \phi_i(\zeta_i), \quad & i \in {\mathscr{S}_1}, \\ v_i^2 \partial_i^3 \phi_{i,i^{\prime}} (\zeta_i, \zeta_{i^{\prime}}) + v_{i^{\prime}}^2 \partial_{i^{\prime}}^2 \partial_i \phi_{i,i^{\prime}} (\zeta_i, \zeta_{i^{\prime}}) + 2 v_i v_{i^{\prime}} \partial_{i^{\prime}} \partial_i^2 \phi_{i,i^{\prime}} (\zeta_i, \zeta_{i^{\prime}}), \quad & i \in {\mathscr{S}_2^{\text{var}}}, (i,i^{\prime}) \in {\mathscr{S}_2}. \end{array} \right. \end{align} (C.19) Since in our scheme we employ only $${\mathbf v} \in \left\{{0,1}\right\}^{{k}}$$, this leads to the following uniform bound. $$\left|{\frac{{\mu_1}}{2} {\mathbf v}^{\rm T} {\nabla^2} \partial_i g(\zeta) {\mathbf v}}\right| \leq 4 {B}_3 \frac{{\mu_1}}{2} = 2 {\mu_1} {B}_3 \quad \forall i \in {\mathscr{S}_1} \cup {\mathscr{S}_2^{\text{var}}}.$$ (C.20) Denoting by $${\tau^{\prime}}$$, the upper bound on the magnitude of the error term in (3.18), we thus obtain: $${\tau^{\prime}} = 2 {\mu_1} {B}_3 + \frac{{\beta}^2 {B}_3}{3{\mu_1}}.$$ (C.21) Now in case $$i \in {\mathscr{S}_1}$$, we have $$\langle{\nabla} \partial_i g(\mathbf x),{\mathbf v}_0(i)\rangle = 0$$, $$\forall \mathbf x \in {\mathbb{R}}^{{k}}$$. This in turn implies that $$\left|{\frac{\widehat{\partial}_i g(\mathbf x + {\mu_1} {\mathbf v}_0(i)) - \widehat{\partial}_i g(\mathbf x)}{{\mu_1}}}\right| \leq {\tau^{\prime}} \quad \forall \mathbf x \in {\mathbb{R}}^{{k}}.$$ (C.22) If $$i \in {\mathscr{S}_2^{\text{var}}}$$ with $$(i,i^{\prime}) \in {\mathscr{S}_2}$$, then $$\langle{\nabla} \partial_i g(\mathbf x),{\mathbf v}_0(i)\rangle = \partial_{i} \partial_{i^{\prime}} g(\mathbf x) = \partial_{i} \partial_{i^{\prime}} \phi_{(i,i^{\prime})}(x_i, x_{i^{\prime}}).$$ (C.23) For the choice $${m^{\prime}_{x}} > {\lambda}_2^{-1}$$, $$\exists \mathbf x^{*} \in {\chi}_i$$ such that $$|{\partial_{i} \partial_{i^{\prime}} \phi_{(i,i^{\prime})}(x^{*}_i, x^{*}_{i^{\prime}})}| > {D}_2$$. This is clear from the construction of $${\chi}_i$$ and on account of Assumption 3. If we guarantee that $${\tau^{\prime}} < {D}_2/2$$ holds, then consequently $$\left|{\frac{\widehat{\partial}_i g(\mathbf x^{*} + {\mu_1} {\mathbf v}_0(i)) - \widehat{\partial}_i g(\mathbf x^{*})}{{\mu_1}}}\right| > {D}_2 - {\tau^{\prime}} > {\tau^{\prime}}$$ (C.24) meaning that the pair $$(i,i^{\prime})$$ can be identified. Lastly, it is easily verifiable that the requirement $${\tau^{\prime}} < {D}_2/2$$ equivalently translates to the stated conditions on $${\beta}, {\mu}$$. C.3 Proof of Theorem 3.3 We begin by first establishing the conditions that guarantee $${\widehat{{{\mathscr{S}}}}} = {\mathscr{S}}$$ and then derive conditions that guarantee exact recovery of $${\mathscr{S}_1},{\mathscr{S}_2}$$. Estimation of $${\mathscr{S}}$$. We first note that (3.6) now changes to $$\mathbf y = {\mathbf{V}}{\nabla} f(\mathbf x) + {\mathbf n} + {\mathbf{{z}}},$$ where $${z}_{j} = ({z^{\prime}}_{j,1} - {z^{\prime}}_{j,2})/(2{\mu})$$ represents the external noise component, for $$j=1,\dots,{m_{v}}$$. Since $$\parallel{{\mathbf{{z}}}}\parallel_{\infty} \leq {\varepsilon}/{\mu}$$, therefore using the bounds on $$\parallel{{\mathbf n}}\parallel_{\infty}$$ from Section C.1 one can verify that (C.2) in Corollary C.1 changes to $$\parallel{{\widehat{{{\nabla}}}} f(\mathbf x) - {\nabla} f(\mathbf x)}\parallel_2 \leq C\left(\frac{2 {\mu}^2 {B}_3 {k}}{3{m_{v}}} + \frac{{\varepsilon}\sqrt{{m_{v}}}}{{\mu}} \right).$$ (C.25) Following the same arguments mentioned in Section C.1, we observe that if $${\tau} < {D}_1/2$$ holds, then it implies that $${\widehat{{{\mathscr{S}}}}} = {\mathscr{S}}$$. Now, $${\tau} < {D}_1/2$$ is equivalent to \begin{align} \underbrace{\frac{2 {\mu}^2 {B}_3 {k}}{3{m_{v}}}}_{a{\mu}^2} + \underbrace{\frac{{\varepsilon}\sqrt{{m_{v}}}}{{\mu}}}_{\frac{b{\varepsilon}}{{\mu}}} < \frac{{D}_1}{2C} \quad \Leftrightarrow \quad {\mu}^3 - {\frac{{D}_1}{2aC}{\mu}} + \frac{b{\varepsilon}}{a} < 0. \end{align} (C.26) Equation (C.26) is a cubic inequality. Recall from Section B that a cubic equation of the form: $$y^3 + py + q = 0$$, has three distinct real roots if its discriminant $$\frac{p^3}{27} + \frac{q^2}{4} < 0$$. Note that for this to be possible, $$p$$ must be negative, which is the case in (C.26). Applying the discriminant condition on (C.26) leads to \begin{align} -\frac{{D}_1^3}{27 \cdot 8 a^3 C^3} + \frac{b^2}{4 a^2}{\varepsilon}^2 < 0 \quad \Leftrightarrow \quad {\varepsilon} < \frac{{D}_1^{3/2}}{3 b C\sqrt{6a C}}. \end{align} (C.27) Also, recall from (B.4) that the $$3$$ distinct real roots of the cubic equation are then given by: $$y_1 = 2\sqrt{-p/3}\cos(\theta/3), \ y_2 = -2\sqrt{-p/3}\cos(\theta/3 + \pi/3), \ y_3 = -2\sqrt{-p/3}\cos(\theta/3 - \pi/3),$$ (C.28) where $$\theta = \cos^{-1}\left(\frac{-q/2}{\sqrt{-p^3/27}}\right)$$. In particular, if $$q > 0$$, then one can verify that $$y^3 + py + q < 0$$ holds if $$y \in (y_2,y_1)$$. Applying this to the cubic equation corresponding to (C.26), we consequently obtain $${\mu} \in \left(2\sqrt{\frac{{D}_1}{6aC}}\cos(\theta_1/3 - 2\pi/3), 2\sqrt{\frac{{D}_1}{6aC}}\cos(\theta_1/3)\right),$$ (C.29) where $$\theta_1 = \cos^{-1}(-{\varepsilon}/{\varepsilon}_1)$$. Estimation of $${\mathscr{S}_1}, {\mathscr{S}_2}$$. On account of noise, we first note that (C.13) changes to $$\underbrace{\frac{g(\mathbf x + {\beta} {\mathbf{e}}_1(i)) - g(\mathbf x - {\beta} {\mathbf{e}}_1(i))}{2{\beta}}}_{{\widehat{{\partial_i}}} g(\mathbf x)} = \underbrace{\langle{\mathbf{e}}_1(i),{\nabla} g(\mathbf x)\rangle}_{\partial_i g(\mathbf x)} + \underbrace{\frac{{R}_3(\zeta) - {R}_3(\zeta^{\prime})}{2{\beta}}}_{{\eta}_i(\mathbf x,{\beta}) = O({\beta}^2)} + \underbrace{\frac{{z^{\prime}}_{i,1} - {z^{\prime}}_{i,2}}{2 {\beta}}}_{{z}_i(\mathbf x,{\beta})}.$$ (C.30) This in turn results in (3.18) changing to \begin{align} &\frac{{\widehat{{\partial_i}}} g(\mathbf x + {\mu_1}{\mathbf v}) - {\widehat{{\partial_i}}} g(\mathbf x)}{{\mu_1}}\notag\\ &\quad = \langle{\nabla} \partial_i g(\mathbf x),{\mathbf v}\rangle + \underbrace{\frac{{\mu_1}}{2} {\mathbf v}^{\rm T} {\nabla^2} \partial_i g(\zeta_i) {\mathbf v} + \frac{{\eta}_i(\mathbf x + {\mu_1}{\mathbf v},{\beta}) - {\eta}_i(\mathbf x,{\beta})}{{\mu_1}} + \frac{{z}_i(\mathbf x+{\mu_1}{\mathbf v},{\beta}) - {z}_i(\mathbf x,{\beta})}{{\mu_1}}}_{\text{Error term}}. \end{align} (C.31) Using (C.18) and (C.20) and noting that $$|{({z}_i(\mathbf x+{\mu_1}{\mathbf v},{\beta}) - {z}_i(\mathbf x,{\beta}))/{\mu_1}}| \leq 2{\varepsilon}/({\beta} {\mu_1})$$, then by denoting $${\tau^{\prime}}$$ to be an upper bound on the magnitude of the error term in (C.31), we have that $${\tau^{\prime}} = 2 {\mu_1} {B}_3 + \frac{{\beta}^2 {B}_3}{3{\mu_1}} + \frac{2{\varepsilon}}{{\beta}{\mu_1}}$$. Following the same argument as in Section C.2, we have that $${\tau^{\prime}} < {D}_2/2$$ implies $${\widehat{{{\mathscr{S}_1}}}} = {\mathscr{S}_1}$$ and $${\widehat{{{\mathscr{S}_2}}}} = {\mathscr{S}_2}$$. The condition $${\tau^{\prime}} < {D}_2/2$$ is equivalent to \begin{align} 2 {\mu_1} {B}_3 + \frac{{\beta}^2 {B}_3}{3{\mu_1}} + \frac{2{\varepsilon}}{{\beta}{\mu_1}} &< \frac{{D}_2}{2} \\ \end{align} (C.32) \begin{align} \Leftrightarrow 6{B}_3 {\beta} {\mu_1}^2 - \frac{3{D}_2}{2} {\beta}{\mu_1} + ({\beta}^3{B}_3 + 6{\varepsilon}) &< 0. \end{align} (C.33) Solving (C.33) in terms of $${\mu_1}$$ leads to \begin{align} {\mu_1} &\in \left(\frac{\frac{3{D}_2}{2}{\beta} - \sqrt{\frac{9{D}_2^2}{4}{\beta}^2 - 24{\beta}{B}_3({\beta}^3{B}_3 + 6{\varepsilon})}}{12{B}_3{\beta}}, \frac{\frac{3{D}_2}{2}{\beta} + \sqrt{\frac{9{D}_2^2}{4}{\beta}^2 - 24{\beta}{B}_3({\beta}^3{B}_3 + 6{\varepsilon})}}{12{B}_3{\beta}}\right) \\ \end{align} (C.34) \begin{align} \Leftrightarrow {\mu_1} &\in \left(\frac{{D}_2 - \sqrt{{D}_2^2 - \frac{32}{3{\beta}}{B}_3({\beta}^3{B}_3 + 6{\varepsilon})}}{8{B}_3}, \frac{{D}_2 + \sqrt{{D}_2^2 - \frac{32}{3{\beta}}{B}_3({\beta}^3{B}_3 + 6{\varepsilon})}}{8{B}_3}\right) \end{align} (C.35) Now in order for the above condition on $${\mu_1}$$ to be meaningful, we require $${D}_2^2 - \frac{32}{3{\beta}}{B}_3({\beta}^3{B}_3 + 6{\varepsilon}) > 0 \quad \Leftrightarrow \quad {\beta}^3 - \frac{3{D}_2^2}{32{B}_3^2} {\beta} + \frac{6{\varepsilon}}{{B}_3} < 0.$$ (C.36) Since (C.36) is a cubic inequality, therefore by following the steps described earlier (for identification of $${\mathscr{S}}$$), one readily obtains the stated conditions on $${\mu_1}, {\varepsilon}$$ and $${\beta}$$. C.4 Proof of Theorem 3.4 We first derive conditions for estimating $${\mathscr{S}}$$ and then for estimating $${\mathscr{S}_1}, {\mathscr{S}_2}$$. Estimating $${\mathscr{S}}$$. Upon resampling $$N_1$$ times and averaging, we have for the noise vector $$\mathbf z \in {\mathbb{R}}^{{m_{v}}}$$ that $$\mathbf z = \left[\frac{({z^{\prime}}_{1,1} - {z^{\prime}}_{1,2})}{2{\mu}} \cdots \frac{({z^{\prime}}_{{m_{v}},1} - {z^{\prime}}_{{m_{v}},2})}{2{\mu}} \right],$$ (C.37) where $${z^{\prime}}_{j,1}, {z^{\prime}}_{j,2} \sim {\mathscr{N}}(0,\sigma^2/N_1)$$ are i.i.d. Our aim is to guarantee that $$|{{z^{\prime}}_{j,1} - {z^{\prime}}_{j,2}}| < 2{\varepsilon}$$ holds $$\forall j=1,\dots,{m_{v}}$$ and across all points where $${\nabla} f$$ is estimated. Indeed, we then obtain a bounded noise model and can simply use the analysis for the setting of arbitrary bounded noise. To this end, note that $${z^{\prime}}_{j,1} - {z^{\prime}}_{j,2} \sim {\mathscr{N}}(0,\frac{2\sigma^2}{N_1})$$. It can be shown that for any $$X \sim {\mathscr{N}}(0,1)$$ we have $$\mathbb{P}(|{X}| > t) \leq { 2 \,{\rm e}^{-t^2/2}}, \quad \forall t > 0.$$ (C.38) Since $${z^{\prime}}_{j,1} - {z^{\prime}}_{j,2} = \sigma\sqrt{\frac{2}{N_1}} X$$, therefore for any $${\varepsilon} > 0$$ we have \begin{align} \mathbb{P}(|{{z^{\prime}}_{j,1} - {z^{\prime}}_{j,2}}| > 2{\varepsilon}) &= \mathbb{P}\left(|{X}| > \frac{2{\varepsilon}}{\sigma}\sqrt{\frac{N_1}{2}}\right) \\ \end{align} (C.39) \begin{align} &\leq {2} \exp\left(-\frac{{\varepsilon}^2 N_1}{\sigma^2}\right). \end{align} (C.40) Now to estimate $${\nabla} f(\mathbf x),$$ we have $${m_{v}}$$ many ‘difference’ terms: $${z^{\prime}}_{j,1} - {z^{\prime}}_{j,2}$$. As this is done for each $$\mathbf x \in {\chi}$$, therefore we have a total of $${m_{v}}(2{m_{x}}+1)^2{|{{{{\mathscr{H}}_{2}^{d}}}}|}$$ many difference terms. Taking a union bound over all of them, we have for any $$p_1 \in (0,1)$$ that the choice $$N_1 > \frac{\sigma^2}{{\varepsilon}^2} \log (\frac{2}{p_1}{m_{v}}(2{m_{x}}+1)^2{|{{{{\mathscr{H}}_{2}^{d}}}}|})$$ implies that the magnitudes of all difference terms are bounded by $$2{\varepsilon}$$, with probability at least $$1-p_1$$. Estimating $${\mathscr{S}_1}, {\mathscr{S}_2}$$. In this case, we resample each query $$N_2$$ times and average—therefore the variance of the noise terms gets scaled by $$N_2$$. Note that for each $$i \in {\mathscr{S}}$$ and $$\mathbf x \in {\chi}_i$$, we have two difference terms corresponding to external noise—one corresponding to $${\widehat{{\partial_i}}} g(\mathbf x)$$ and the other corresponding to $${\widehat{{\partial_i}}} g(\mathbf x + {\mu_1} {\mathbf v})$$. This means that in total we have at most $${k}(2 {{m^{\prime}_{x}}}^{2} + \lceil{\log {k}}\rceil)$$ many difference terms arising. Therefore, taking a union bound over all of them, we have for any $$p_2 \in (0,1)$$ that the choice $$N_2 > \frac{\sigma^2}{{{\varepsilon^{\prime}}}^2} {{\log}(\frac{2 {k}(2 {{m^{\prime}_{x}}}^{2} + \lceil{\log {k}}\rceil)}{p_2})}$$ implies that the magnitudes of all difference terms are bounded by $$2{\varepsilon^{\prime}}$$, with probability at least $$1-p_2$$. Appendix D. Proofs for Section 4 D.1 Proof of Theorem 4.1 The proof is divided into the following steps. Bounding the $${\mathbf{\eta_{q,2}}}$$ term. The proof of this step is similar to that of Corollary C.1. Since $${\nabla} f(\mathbf x)$$ is at most $${k}$$ sparse, therefore for any $$\mathbf x \in {\mathbb{R}}^{{d}}$$ we immediately have from Theorem 3.1 and (3.10) the following: $$\exists C_1, c_4^{\prime}> 0, c_1^{\prime} \geq 1$$ such that for $$c_1^{\prime} {k} \log(\frac{{d}}{{k}}) < {m_{v}} < \frac{{d}}{(\log 6)^2}$$ we have with probability at least $$1 - \,{\rm e}^{-c_4^{\prime}{m_{v}}} - \,{\rm e}^{-\sqrt{{m_{v}}{d}}}$$ that $$\parallel{{\widehat{{{\nabla}}}} f(\mathbf x) - {\nabla} f(\mathbf x)}\parallel_2 \leq C_1 \max\left\{{\parallel{{\mathbf n}}\parallel_2, \sqrt{\log {d}}\parallel{{\mathbf n}}\parallel_{\infty}}\right\}.$$ (D.1) Recall from (3.5) that $${\mathbf n} = [{n}_1 \dots {n}_{{m_{v}}}],$$ where $${n}_j = \frac{{R}_3(\zeta_{j}) - {R}_3(\zeta^{\prime}_{j})}{2{\mu}}$$, for some $$\zeta_j,\zeta^{\prime}_j \in {\mathbb{R}}^{{d}}$$. Here, $${R}_3(\zeta)$$ denotes the third-order Taylor remainder terms of $$f$$. By taking the structure of $$f$$ into account, we can uniformly bound $$|{{R}_3(\zeta_j)}|$$ as follows (so the same bound holds for $$|{{R}_3(\zeta^{\prime}_j)}|$$). Let us define $${\alpha} := |{\left\{{q \in {\mathscr{S}_2^{\text{var}}}: {\rho}(q) > 1}\right\}}|$$ to be the number of variables in $${\mathscr{S}_2^{\text{var}}}$$, with degree greater than one. \begin{align} |{{R}_3(\zeta_j)}| &= \frac{{\mu}^3}{6} |\sum_{p \in {\mathscr{S}_1}} \partial_p^3 \phi_p(\zeta_{j,p}) v_p^3 + \sum_{(l,l^{\prime}) \in {\mathscr{S}_2}} ( \partial_l^3 \phi_{(l,l^{\prime})}(\zeta_{j,l}, \zeta_{j,{l^{\prime}}}) v_l^3 + \partial_{l^{\prime}}^3 \phi_{(l,l^{\prime})}(\zeta_{j,l}, \zeta_{j,{l^{\prime}}}) v_{l^{\prime}}^3) \nonumber \\ &+ \sum_{(l,l^{\prime}) \in {\mathscr{S}_2}} (3\partial_l \partial_{l^{\prime}}^2 \phi_{(l,l^{\prime})}(\zeta_{j,l}, \zeta_{j,{l^{\prime}}}) v_l v_{l^{\prime}}^2 + 3\partial_l^2 \partial_{l^{\prime}} \phi_{(l,l^{\prime})}(\zeta_{j,l}, \zeta_{j,{l^{\prime}}}) v_l^2 v_{l^{\prime}}) + \sum_{q \in {\mathscr{S}_2^{\text{var}}}: {\rho}(q)>1} \partial_q^3 \phi_q(\zeta_{j,q}) v_q^3| \\ \end{align} (D.2) \begin{align} &\leq \frac{{\mu}^3}{6} \left(\frac{{k_1} {B}_3}{{m_{v}}^{3/2}} + \frac{2{k_2}{B}_3}{{m_{v}}^{3/2}} + \frac{\alpha{B}_3}{{m_{v}}^{3/2}} + \frac{6{k_2}{B}_3}{{m_{v}}^{3/2}} \right) \\ \end{align} (D.3) \begin{align} &= \frac{{\mu}^3}{6}\frac{({k_1} + \alpha + 8{k_2}){B}_3}{{m_{v}}^{3/2}}. \end{align} (D.4) Using the fact $$2{k_2} = \sum_{l \in {\mathscr{S}_2^{\text{var}}}: {\rho}(l) > 1} {\rho}(l) + ({|{{{\mathscr{S}_2^{\text{var}}}}}|} - {\alpha})$$, we can observe that $$2{k_2} \leq {\rho_{m}}{\alpha} + ({|{{{\mathscr{S}_2^{\text{var}}}}}|} - {\alpha}) = {|{{{\mathscr{S}_2^{\text{var}}}}}|} + ({\rho_{m}}-1){\alpha}$$. Plugging this in (D.4), and using the fact $$\alpha \leq {k}$$ (since we do not assume $$\alpha$$ to be known), we obtain \begin{align} |{{R}_3(\zeta_j)}| \leq \frac{{\mu}^3}{6}\frac{({k_1} + {\alpha} + 4{|{{{\mathscr{S}_2^{\text{var}}}}}|} + 4({\rho_{m}}-1){\alpha}){B}_3}{{m_{v}}^{3/2}} \leq \frac{{\mu}^3(4{k} + (4{\rho_{m}}-3){\alpha}){B}_3}{6{m_{v}}^{3/2}} \leq \frac{{\mu}^3((4{\rho_{m}}+1){k}){B}_3}{6{m_{v}}^{3/2}}. \end{align} (D.5) This in turn implies that $$\parallel{{\mathbf n}}\parallel_{\infty} \leq \frac{{\mu}^2((4{\rho_{m}}+1){k}){B}_3}{6{m_{v}}^{3/2}}$$. Using the fact $$\parallel{{\mathbf n}}\parallel_{2} \leq \sqrt{{m_{v}}}\parallel{{\mathbf n}}\parallel_{\infty}$$, we thus obtain for the stated choice of $${m_{v}}$$ (cf. Remark 3.4) that $$\parallel{{\widehat{{{\nabla}}}} f(\mathbf x) - {\nabla} f(\mathbf x)}\parallel_2 \leq \frac{C_1{\mu}^2((4{\rho_{m}}+1){k}){B}_3}{6{m_{v}}} \quad \forall \mathbf x \in [-(1+r),1+r]^{{d}}.$$ (D.6) Recall that $$[-(1+r),1+r]^{{d}}, r > 0$$, denotes the enlargement around $$[-1,1]^{{d}}$$, in which the smoothness properties of $$\phi_p, \phi_{(l,l^{\prime})}$$ are defined in Section 2 (as Assumption 1). Also recall $$\mathbf w(\mathbf x) \in {\mathbb{R}}^{{d}}, {\mathbf{\eta_{q,2}}} \in {\mathbb{R}}^{{m_{v^{\prime}}}}$$ from (4.4). Since $$\parallel{\mathbf w(\mathbf x)}\parallel_{\infty} \leq \parallel{{\widehat{{{\nabla}}}} f(\mathbf x) - {\nabla} f(\mathbf x)}\parallel_2$$, this then implies that $$\parallel{{\mathbf{\eta_{q,2}}}}\parallel_{\infty} \leq \frac{C_1{\mu}^2((4{\rho_{m}}+1){k}){B}_3}{3{m_{v}}{\mu_1}}$$. Bounding the $${\mathbf{\eta_{q,1}}}$$ term. We will bound $$\parallel{{\mathbf{\eta_{q,1}}}}\parallel_{\infty}$$. To this end, we see from (4.4) that it suffices to uniformly bound $$|{{{\mathbf v^{\prime}}}^{{\rm T}} {\nabla^2} \partial_q f(\zeta) {\mathbf v^{\prime}}}|$$, over all: $$q \in {\mathscr{S}_1} \cup {\mathscr{S}_2^{\text{var}}}$$, $${\mathbf v^{\prime}} \in {\mathscr{V}}p$$, $$\zeta \in [-(1+r),(1+r)]^{{d}}$$. Note that $${{\mathbf v^{\prime}}}^{\rm T} {\nabla^2} \partial_q f(\zeta) {\mathbf v^{\prime}} = \sum_{l=1}^{{d}} {v_l^{\prime}}^2 ({\nabla^2} \partial_q f)_{l,l} + \sum_{i \neq j = 1}^{{d}} v_i^{\prime}v_j^{\prime} ({\nabla^2} \partial_q f)_{i,j}.$$ (D.7) We have the following three cases, depending on the type of $$q$$. $$\mathbf{q \in {\mathscr{S}_1}.}$$ $${{\mathbf v^{\prime}}}^{\rm T} {\nabla^2} \partial_q f(\zeta) {\mathbf v^{\prime}} = {v^{\prime}_{q}}^2 \partial_q^3 \phi_q(\zeta_q) \Rightarrow |{{{\mathbf v^{\prime}}}^{\rm T} {\nabla^2} \partial_q f(\zeta) {\mathbf v^{\prime}}}| \leq \frac{{B}_3}{{m_{v^{\prime}}}}.$$ (D.8) $$\mathbf{(q,q^{\prime}) \in {\mathscr{S}_2}}$$, $$\mathbf{{\rho}(q) = 1.}$$ \begin{align} {{\mathbf v^{\prime}}}^{\rm T} {\nabla^2} \partial_q f(\zeta) {\mathbf v^{\prime}} &= {v^{\prime}_{q}}^2 \partial_q^3 \phi_{(q,q^{\prime})}(\zeta_q,\zeta_{q^{\prime}}) + {v^{\prime}_{q^{\prime}}}^2 \partial_{q^{\prime}}^2 \partial_q \phi_{(q,q^{\prime})}(\zeta_q,\zeta_{q^{\prime}}) + 2 v^{\prime}_{q} v^{\prime}_{q^{\prime}} \partial_{q^{\prime}} \partial_q^2 \phi_{(q,q^{\prime})}(\zeta_q,\zeta_{q^{\prime}}), \\ \end{align} (D.9) \begin{align} \Rightarrow |{{{\mathbf v^{\prime}}}^{\rm T} {\nabla^2} \partial_q f(\zeta) {\mathbf v^{\prime}}}| &\leq \frac{4{B}_3}{{m_{v^{\prime}}}}. \end{align} (D.10) $$\mathbf{q \in {\mathscr{S}_2^{\text{var}}}}$$, $$\mathbf{{\rho}(q) > 1.}$$ \begin{align} {{\mathbf v^{\prime}}}^{\rm T} {\nabla^2} \partial_q f(\zeta) {\mathbf v^{\prime}} &= {v^{\prime}_{q}}^2(\partial_q^3 \phi_q(\zeta_q) + \sum_{(q,q^{\prime}) \in {\mathscr{S}_2}} \partial_q^3 \phi_{(q,q^{\prime})}(\zeta_{q},\zeta_{q^{\prime}}) + \sum_{(q^{\prime},q) \in {\mathscr{S}_2}} \partial_q^3 \phi_{(q^{\prime},q)}(\zeta_{q^{\prime}},\zeta_{q})) \nonumber \\ &+ \sum_{(q,q^{\prime}) \in {\mathscr{S}_2}} {v^{\prime}_{q^{\prime}}}^2 \partial_{q^{\prime}}^2 \partial_q \phi_{(q,q^{\prime})}(\zeta_{q},\zeta_{q^{\prime}}) + \sum_{(q^{\prime},q) \in {\mathscr{S}_2}} {v^{\prime}_{q^{\prime}}}^2 \partial_{q^{\prime}}^2 \partial_q \phi_{(q^{\prime},q)}(\zeta_{q^{\prime}},\zeta_{q}) \nonumber \\ &+ 2\sum_{(q,q^{\prime}) \in {\mathscr{S}_2}} v^{\prime}_{q} v^{\prime}_{q^{\prime}} \partial_{q^{\prime}} \partial_q^2 \phi_{(q,q^{\prime})}(\zeta_{q},\zeta_{q^{\prime}}) + 2\sum_{(q^{\prime},q) \in {\mathscr{S}_2}} v^{\prime}_{q} v^{\prime}_{q^{\prime}} \partial_{q^{\prime}} \partial_q^2 \phi_{(q^{\prime},q)}(\zeta_{q^{\prime}},\zeta_{q}), \\ \end{align} (D.11) \begin{align} \Rightarrow |{{{\mathbf v^{\prime}}}^{\rm T} {\nabla^2} \partial_q f(\zeta) {\mathbf v^{\prime}}}| &\leq \frac{1}{{m_{v^{\prime}}}}(({\rho_{m}}+1){B}_3 + {\rho_{m}}{B}_3 + 2{\rho_{m}}{B}_3) = \frac{(4{\rho_{m}}+1){B}_3}{{m_{v^{\prime}}}}. \end{align} (D.12) We can now uniformly bound $$\parallel{{\mathbf{\eta_{q,1}}}}\parallel_{\infty}$$ as follows. $$\parallel{{\mathbf{\eta_{q,1}}}}\parallel_{\infty} := \max_{j=1,\dots,{m_{v^{\prime}}}} \frac{{\mu_1}}{2} |{{{{\mathbf v^{\prime}}_j}^{\rm T} {\nabla^2} \partial_q f(\zeta_j) {\mathbf v^{\prime}}_j}}| \leq \frac{{\mu_1}(4{\rho_{m}}+1){B}_3}{2{m_{v^{\prime}}}}.$$ (D.13) Estimating $${\mathscr{S}_2}$$. We now proceed toward estimating $${\mathscr{S}_2}$$. To this end, we estimate $${\nabla} \partial_q f(\mathbf x)$$ for each $$q=1,\dots,{d}$$ and $$\mathbf x \in {\chi}$$. Since $${\nabla} \partial_q f(\mathbf x)$$ is at most $$({\rho_{m}}+1)$$sparse, therefore Theorem 3.1 and (3.10), immediately yield the following: $$\exists C_2, c_5^{\prime}> 0, c_2^{\prime} \geq 1$$ such that for $$c_2^{\prime} {\rho_{m}} \log(\frac{{d}}{{\rho_{m}}}) < {m_{v^{\prime}}} < \frac{{d}}{(\log 6)^2}$$ we have with probability at least $$1 - \,{\rm e}^{-c_5^{\prime}{m_{v^{\prime}}}} - \,{\rm e}^{-\sqrt{{m_{v^{\prime}}}{d}}}$$ that $$\parallel{{\widehat{{{\nabla}}}} \partial_q f(\mathbf x) - {\nabla} \partial_q f(\mathbf x)}\parallel_2 \leq C_2 \max\left\{{\parallel{{\mathbf{\eta_{q,1}}}+{\mathbf{\eta_{q,2}}}}\parallel_2, \sqrt{\log {d}}\parallel{{\mathbf{\eta_{q,1}}}+{\mathbf{\eta_{q,2}}}}\parallel_{\infty}}\right\}.$$ (D.14) Since $$\parallel{{\mathbf{\eta_{q,1}}}+{\mathbf{\eta_{q,2}}}}\parallel_{\infty} \leq \parallel{{\mathbf{\eta_{q,1}}}}\parallel_{\infty} + \parallel{{\mathbf{\eta_{q,2}}}}\parallel_{\infty}$$, therefore using the bounds on $$\parallel{{\mathbf{\eta_{q,1}}}}\parallel_{\infty},\parallel{{\mathbf{\eta_{q,2}}}}\parallel_{\infty}$$ and noting that $$\parallel{{\mathbf{\eta_{q,1}}}+{\mathbf{\eta_{q,2}}}}\parallel_2 \leq \sqrt{{m_{v^{\prime}}}}\parallel{{\mathbf{\eta_{q,1}}}+{\mathbf{\eta_{q,2}}}}\parallel_{\infty}$$, we obtain for the stated choice of $${m_{v^{\prime}}}$$ (cf. Remark 3.4) that \begin{align} \parallel{{\widehat{{{\nabla}}}} \partial_q f(\mathbf x) - {\nabla} \partial_q f(\mathbf x)}\parallel_2 &\leq \underbrace{C_2\left(\frac{{\mu_1}(4{\rho_{m}}+1){B}_3}{2\sqrt{{m_{v^{\prime}}}}} + \frac{C_1\sqrt{{m_{v^{\prime}}}}{\mu}^2((4{\rho_{m}}+1){k}){B}_3}{3{m_{v}}{\mu_1}} \right)}_{{\tau^{\prime}}} \quad q=1,\dots,{d}, \notag\\ &\quad \forall \mathbf x \in [-1,1]^{{d}}. \end{align} (D.15) We next note that (D.15) trivially leads to the bound $$\widehat{\partial_q\partial_{q^{\prime}}} f(\mathbf x) \in [\partial_q\partial_{q^{\prime}} f(\mathbf x) -{\tau^{\prime}}, \partial_q\partial_{q^{\prime}} f(\mathbf x) + {\tau^{\prime}}] \quad q,q^{\prime} = 1,\dots,{d}.$$ (D.16) Now if $$q \notin {\mathscr{S}_2^{\text{var}}}$$, then clearly $$\widehat{\partial_q\partial_{q^{\prime}}} f(\mathbf x) \in [-{\tau^{\prime}}, {\tau^{\prime}}]$$$$\forall \mathbf x \in [-1,1]^{{d}}, q \neq q^{\prime}$$. On the other hand, if $$(q,q^{\prime}) \in {\mathscr{S}_2}$$, then $$\widehat{\partial_q\partial_{q^{\prime}}} f(\mathbf x) \in [\partial_q\partial_{q^{\prime}} \phi_{(q,q^{\prime})}(x_q,x_{q^{\prime}}) -{\tau^{\prime}}, \partial_q\partial_{q^{\prime}} \phi_{(q,q^{\prime})}(x_q,x_{q^{\prime}}) + {\tau^{\prime}}].$$ (D.17) If furthermore $${m_{x}} \geq {\lambda}_2^{-1}$$, then due to the construction of $${\chi}$$, $$\exists \mathbf x \in {\chi}$$ so that $$|{\widehat{\partial_q\partial_{q^{\prime}}} f(\mathbf x)}| \geq {D}_2 - {\tau^{\prime}}$$. Hence, if $${\tau^{\prime}} < {D}_2/2$$ holds, then we would have $$|{\widehat{\partial_q\partial_{q^{\prime}}} f(\mathbf x)}| > {D}_2/2$$, leading to the identification of $$(q,q^{\prime})$$. Since this is true for each $$(q,q^{\prime}) \in {\mathscr{S}_2}$$, hence it follows that $${\widehat{{{\mathscr{S}_2}}}} = {\mathscr{S}_2}$$. Now, $${\tau^{\prime}} < {D}_2/2$$ is equivalent to \begin{align} \underbrace{\frac{(4{\rho_{m}}+1){B}_3}{2\sqrt{{m_{v^{\prime}}}}}}_{a} {\mu_1} + \underbrace{\left(\frac{C_1\sqrt{{m_{v^{\prime}}}}((4{\rho_{m}}+1){k}){B}_3}{3{m_{v}}}\right)}_{b}\frac{{\mu}^2}{{\mu_1}} < \frac{{D}_2}{2C_2} \Leftrightarrow a {\mu_1}^2 - \frac{{D}_2}{C_2}{\mu_1} + b{\mu}^2 < 0 \\ \end{align} (D.18) \begin{align} \Leftrightarrow {\mu_1} \in \left(({D}_2/(4aC_2)) - \sqrt{({D}_2/(4aC_2))^2 - (b{\mu}^2/a)} , ({D}_2/(4aC_2)) + \sqrt{({D}_2/(4aC_2))^2 - (b{\mu}^2/a)} \right). \end{align} (D.19) Lastly, we see that the bounds in (D.19) are valid if: $${\mu}^2 < \frac{{D}_2^2}{16 a b C_2^2} = \frac{3 {D}_2^2 {m_{v}}}{8C_1 C_2^2 {B}_3^2(4{\rho_{m}}+1)((4{\rho_{m}}+1){k})}.$$ (D.20) Estimating $${\mathscr{S}_1}$$. With $${\mathscr{P}} := [{d}] \setminus \widehat{{\mathscr{S}_2^{\text{var}}}}$$, we have via Taylor’s expansion of $$f$$: \begin{align} &\frac{f((\mathbf x + {\mu^{\prime}}{\mathbf v^{\prime\prime}}_j)_{{\mathscr{P}}}) - f((\mathbf x - {\mu^{\prime}}{\mathbf v^{\prime\prime}}_j)_{{\mathscr{P}}})}{2{\mu^{\prime}}}\notag\\ &\quad = \langle({\mathbf v^{\prime\prime}}_j)_{{\mathscr{P}}},({\nabla} f((\mathbf x)_{{\mathscr{P}}}))_{{\mathscr{P}}}\rangle + \underbrace{\frac{{R}_3((\zeta_j)_{{\mathscr{P}}}) - {R}_3((\zeta_j^{\prime})_{{\mathscr{P}}})}{2{\mu^{\prime}}}}_{{n}_j}\quad j=1,\dots,{m_{v^{\prime\prime}}}. \end{align} (D.21) Equation (D.21) corresponds to linear measurements of the $$({k}-{|{{\widehat{{\mathscr{S}_2^{\text{var}}}}}}|})$$ sparse vector: $$({\nabla} f((\mathbf x)_{{\mathscr{P}}}))_{{\mathscr{P}}}$$. We now proceed similar to the proof of Corollary C.1. Note that we effectively perform $$\ell_1$$ minimization over $${\mathbb{R}}^{|{{\mathscr{P}}}|}$$. Therefore, for any $$\mathbf x \in {\mathbb{R}}^{{d}}$$ we immediately have from Theorem 3.1 and (3.10), the following: $$\exists C_3, c_6^{\prime}> 0, c_3^{\prime} \geq 1$$ such that for $$c_3^{\prime} ({k}-{|{{\widehat{{\mathscr{S}_2^{\text{var}}}}}}|}) \log(\frac{|{{\mathscr{P}}}|}{{k}-{|{{\widehat{{\mathscr{S}_2^{\text{var}}}}}}|}}) < {m_{v^{\prime\prime}}} < \frac{|{{\mathscr{P}}}|}{(\log 6)^2}$$, we have with probability at least $$1 - \,{\rm e}^{-c_6^{\prime}{m_{v^{\prime\prime}}}} - \,{\rm e}^{-\sqrt{{m_{v^{\prime\prime}}}|{{\mathscr{P}}}|}}$$ that $$\parallel{({\widehat{{{\nabla}}}} f((\mathbf x)_{{\mathscr{P}}}))_{{\mathscr{P}}} - ({\nabla} f((\mathbf x)_{{\mathscr{P}}}))_{{\mathscr{P}}}}\parallel_2 \leq C_3 \max\left\{{\parallel{{\mathbf n}}\parallel_2, \sqrt{\log |{{\mathscr{P}}}|} \parallel{{\mathbf n}}\parallel_{\infty}}\right\}.$$ (D.22) We now uniformly bound $${R}_3((\zeta_j)_{{\mathscr{P}}})$$ for all $$j=1,\dots,{m_{v^{\prime\prime}}}$$ and $$\zeta_j \in [-(1+r),1+r]^{{d}}$$ as follows. \begin{align} {R}_3((\zeta_j)_{{\mathscr{P}}}) = \frac{{{\mu^{\prime}}}^3}{6}\sum_{p \in {\mathscr{S}_1} \cap {\mathscr{P}}} \partial_p^3 \phi_p(\zeta_{j,p}){{v^{\prime\prime}}_{j,p}}^3 \quad \Rightarrow |{{R}_3((\zeta_j)_{{\mathscr{P}}})}| \leq \frac{({k}-{|{{\widehat{{\mathscr{S}_2^{\text{var}}}}}}|}) {{\mu^{\prime}}}^3{B}_3}{6 {m_{v^{\prime\prime}}}^{3/2}}. \end{align} (D.23) This in turn implies that $$\parallel{{\mathbf n}}\parallel_{\infty} \leq \frac{({k}-{|{{\widehat{{\mathscr{S}_2^{\text{var}}}}}}|}) {{\mu^{\prime}}}^2{B}_3}{6 {m_{v^{\prime\prime}}}^{3/2}}$$ and $$\parallel{{\mathbf n}}\parallel_2 \leq \sqrt{{m_{v^{\prime\prime}}}}\parallel{{\mathbf n}}\parallel_{\infty} \leq \frac{({k}-{|{{\widehat{{\mathscr{S}_2^{\text{var}}}}}}|}) {{\mu^{\prime}}}^2{B}_3}{6 {m_{v^{\prime\prime}}}}$$. Plugging these bounds in (D.22), we obtain for the stated choice of $${m_{v^{\prime\prime}}}$$ (cf. Remark 3.4) that $$\parallel{({\widehat{{{\nabla}}}} f((\mathbf x)_{{\mathscr{P}}}))_{{\mathscr{P}}} - ({\nabla} f((\mathbf x)_{{\mathscr{P}}}))_{{\mathscr{P}}}}\parallel_2 \leq \underbrace{\frac{C_3 ({k}-{|{{\widehat{{\mathscr{S}_2^{\text{var}}}}}}|}) {{\mu^{\prime}}}^2{B}_3}{6 {m_{v^{\prime\prime}}}}}_{{\tau^{\prime\prime}}} \quad \mathbf x \in [-1,1]^{{d}}.$$ (D.24) Finally, using the same arguments as before, we have that $${\tau^{\prime\prime}} < {D}_1/2$$ or equivalently $${{\mu^{\prime}}}^2 < \frac{3{m_{v^{\prime\prime}}} {D}_1}{C_3 ({k}-{|{{\widehat{{\mathscr{S}_2^{\text{var}}}}}}|}) {B}_3}$$ is sufficient to recover $${\mathscr{S}_1}$$. This completes the proof. D.2 Proof of Theorem 4.2 We begin by establishing the conditions pertaining to the estimation of $${\mathscr{S}_2}$$. Then, we prove the conditions for estimation of $${\mathscr{S}_1}$$. Estimation of $${\mathscr{S}_2}$$. We first note that the linear system (3.6) now has the form: $$\mathbf y = {\mathbf{V}}{\nabla} f(\mathbf x) + {\mathbf n} + {\mathbf{{z}}}$$, where $${z}_{j} = ({z^{\prime}}_{j,1} - {z^{\prime}}_{j,2})/(2{\mu})$$ represents the external noise component for $$j=1,\dots,{m_{v}}$$. Observe that $$\parallel{{\mathbf{{z}}}}\parallel_{\infty} \leq {\varepsilon}/{\mu}$$. Using the bounds on $$\parallel{{\mathbf n}}\parallel_{\infty}, \parallel{{\mathbf n}}\parallel_{2}$$ from Section D.1, we then observe that (D.6) changes to $$\parallel{{\widehat{{{\nabla}}}} f(\mathbf x) - {\nabla} f(\mathbf x)}\parallel_2 \leq C_1\left(\frac{{\mu}^2((4{\rho_{m}}+1){k}){B}_3}{6{m_{v}}} + \frac{{\varepsilon}\sqrt{{m_{v}}}}{{\mu}}\right) \quad \forall \mathbf x \in [-(1+r),1+r]^{{d}}.$$ (D.25) As a result, we then have that $$\parallel{{\mathbf{\eta_{q,2}}}}\parallel_{\infty} \leq C_1\left(\frac{{\mu}^2((4{\rho_{m}}+1){k}){B}_3}{3{m_{v}}{\mu_1}} + \frac{2{\varepsilon}\sqrt{{m_{v}}}}{{\mu}{\mu_1}}\right).$$ (D.26) Now note that the bound on $$\parallel{{\mathbf{\eta_{q,1}}}}\parallel_{\infty}$$ is unchanged from Section D.1, i.e. $$\parallel{{\mathbf{\eta_{q,1}}}}\parallel_{\infty} \leq \frac{{\mu_1}(4{\rho_{m}}+1){B}_3}{2{m_{v^{\prime}}}}$$. As a consequence, we see that (D.15) changes to: $$\parallel{{\widehat{{{\nabla}}}} \partial_q f(\mathbf x) - {\nabla} \partial_q f(\mathbf x)}\parallel_2 \leq \underbrace{C_2\left(\frac{{\mu_1}(4{\rho_{m}}+1){B}_3}{2\sqrt{{m_{v^{\prime}}}}} + C_1\frac{\sqrt{{m_{v^{\prime}}}}{\mu}^2((4{\rho_{m}}+1){k}){B}_3}{3{m_{v}}{\mu_1}} + \frac{2C_1{\varepsilon}\sqrt{{m_{v}}{m_{v^{\prime}}}}}{{\mu}{\mu_1}}\right)}_{{\tau^{\prime}}}.$$ (D.27) With $$a$$ and $$b$$ as stated in the Theorem, we then see that $${\tau^{\prime}} < {D}_2/2$$ is equivalent to $$a{\mu_1}^2 - \frac{{D}_2}{2C_2}{\mu_1} + \left(b{\mu}^2 + \frac{2C_1{\varepsilon}\sqrt{{m_{v}}{m_{v^{\prime}}}}}{{\mu}}\right) < 0,$$ (D.28) which in turn is equivalent to \begin{align} &{\mu_1} \in \left(\frac{{D}_2}{4aC_2} - \sqrt{\left(\frac{{D}_2}{4aC_2}\right)^2 - \left(\frac{b{\mu}^3 + 2C_1{\varepsilon}\sqrt{{m_{v}}{m_{v^{\prime}}}}}{a{\mu}}\right)} , \frac{{D}_2}{4aC_2} \right.\notag\\ &\quad \left.+ \sqrt{\left(\frac{{D}_2}{4aC_2}\right)^2 - \left(\frac{b{\mu}^3 + 2C_1{\varepsilon}\sqrt{{m_{v}}{m_{v^{\prime}}}}}{a{\mu}}\right)}\right). \end{align} (D.29) For the above bound to be valid, we require \begin{align} \frac{b{\mu}^2}{a} + \frac{2C_1{\varepsilon}\sqrt{{m_{v}}{m_{v^{\prime}}}}}{a{\mu}} &< \frac{{D}_2^2}{16a^2C_2^2} \\ \end{align} (D.30) \begin{align} \Leftrightarrow {\mu}^3 - \frac{{D}_2^2}{16abC_2^2}{\mu} + \frac{2C_1{\varepsilon}\sqrt{{m_{v}}{m_{v^{\prime}}}}}{b} &< 0 \end{align} (D.31) to hold. Equation (D.31) is a cubic inequality. Recall from Section B that a cubic equation of the form: $$y^3 + py + q = 0$$ has three distinct real roots if its discriminant $$\frac{p^3}{27} + \frac{q^2}{4} < 0$$. Note that for this to be possible, $$p$$ must be negative, which is the case in (D.31). Applying this to (D.31) leads to the condition: {\varepsilon} < \frac{{D}_2^{3}}{192\sqrt{3} C_1 C_2^3 \sqrt{a^3 b {m_{v^{\prime}}} {m_{v