# Conditional expectation estimation through attributable components

Conditional expectation estimation through attributable components Abstract A general methodology is proposed for the explanation of variability in a quantity of interest x in terms of covariates z = (z1, …, zL). It provides the conditional mean $$\bar{x}(z)$$ as a sum of components, where each component is represented as a product of non-parametric one-dimensional functions of each covariate zl that are computed through an alternating projection procedure. Both x and the zl can be real or categorical variables; in addition, some or all values of each zl can be unknown, providing a general framework for multi-clustering, classification and covariate imputation in the presence of confounding factors. The procedure can be considered as a preconditioning step for the more general determination of the full conditional distribution $$\boldsymbol{\rho}(x|z)$$ through a data-driven optimal-transport barycenter problem. In particular, just iterating the procedure once yields the second order structure (i.e. the covariance) of $$\boldsymbol{\rho}(x|z)$$. The methodology is illustrated through examples that include the explanation of variability of ground temperature across the continental United States and the prediction of book preference among potential readers. 1. Introduction A broad class of problems in data analysis focus on explaining variability in a quantity of interest x in terms of the values of a set of covariates z. For instance, one can attribute part of the variability in blood pressure to factors such as age and exercise. One can go one step further and seek hidden factors: new covariates z with values to be determined as part of the problem so as to account for a significant share of the variability in x. Examples are clustering, which explains variability in terms of discrete classes, and principal component analysis, which does so in terms of continuous variables z of dimension lower than the original x. The authors have recently developed (Tabak & Trigila, 2017) a unified framework for the explanation of variability, based on extensions of the mathematical theory of optimal transport. In this framework, one estimates the conditional probability distributions $$\boldsymbol{\rho}(x|z)$$ by mapping them to their Wasserstein barycenter (Agueh & Carlier, 2011). It was shown in the study by Tabak & Trigila (2017) that principal components emerge naturally from the methodology’s simplest setting, with maps restricted to rigid translations, and hence capturing not the full conditional probability distribution $$\boldsymbol{\rho}(x|z)$$, but only its conditional expectation $$\bar{x}(z)$$. From here stems one of this article’s legs: if one thinks of principal components in terms of explanation of variability, it is natural to consider more general scenarios, where the covariates, though still with values a priori unknown, are associated with particular attributes such as space, time or activity networks, and hence required to be smooth in the topologies associated with these attributes. For explaining variability, it is the subspace spanned by the principal components that matters, not the individual components. Seeking such subspace is a low-rank factorization problem, which can be computed with particular ease through alternating projections. This is the article’s second leg: we develop far-reaching extensions of principal component analysis that can be computed efficiently through variations of the alternating projection methodology. A third leg is the occurrence of missing and unstructured data. Frequently some—and sometimes most—of the observations xj are incomplete. This is for instance the case of the Netflix problem, where each viewer only ranks a small subset of all movies, and each movie is only rated by a small fraction of viewers. In addition, observations may not take place on regular grids. For instance, a person—the ‘observation’ in a census—has individual, values of age, height and weight drawn from a continuum, not a lattice. It turns out that both the extensions of principal components discussed here and their implementation in terms of alternating projections adapt very naturally to such scenarios, so they are developed without assuming that the data are complete or structured. At the core of the methodology proposed lies a formal analogy between the low-rank factorization of tensors and the expansion of multivariate functions as sums of products of single-variable ones, as in the classical construction of solutions to linear partial differential equations via separation of variables and in multivariate Fourier and Taylor expansions. Thus, when a variable x depends on many covariates z, we write it as a sum of ‘components’, each consisting of the product of functions of the individual covariates. In low-rank factorization, these one-variable functions are vectors (which can be thought of as functions of the row or column of the matrix entry being approximated), and hence completely unrestricted in the values they can adopt. When explaining variability, this is still the case for categorical covariates, such as sex or blood type. By contrast, in the separated variable solution to PDEs (Partial Differential Equation), the individual functions are typically given (sines, exponentials, special functions) and it is only the coefficients multiplying them that are fit to the data. Our procedure adopts free vectors, fitted to the data through the alternating projection methodology as in low-rank tensor factorization, but penalizes their non-smoothness as functions of those covariates z for which a distance can be defined (which includes continuous variables such as time or blood pressure, and ordinal ones such as rankings). Thus the methodology unifies the handling of categorical and non-categorical covariates. Describing multivariate functions in terms of sums of products of single-variable ones avoids the curse of dimensionality as the number of explanatory variables grows. Describing each of these single-variable functions by the values that it adopts on a grid further makes the computational cost largely independent of the number of observations available, particularly when combined with online learning and stochastic descent. In addition, the attribution of observations to grid points has a probabilistic interpretation that leads to a natural classification and clustering procedure, also applicable to the assignment of missing covariate values. The simultaneous consideration of the other covariates turns this into a general approach to multi-clustering and classification in the presence of confounding factors. Finding the conditional expectation is a regression problem, with a vast literature to which we cannot possibly do justice within the short frame of this introduction. Hence we shall only comment on methodologies that overlap significantly with the attributable components developed here. Least-square error regression (Friedman et al., 2001) is the natural starting point, as we also employ its characterization of the conditional mean as the minimizer of the variance (though in our case the computation of the conditional mean is framed within the general setting of optimal transport, where least-squares emerges naturally from the use of the 2-Wasserstein distance). Least-square regression is typically parametric, with linear models being by far the most commonly used, though these can be extended to handle nonlinearity, either explicitly through the introduction of nonlinear feature functions, or implicitly via the ‘kernel trick’ (Mohri et al., 2012). Smoothing splines (Hastie & Tibshirani, 1990) are a less parametric alternative with a penalization of non-smoothness similar to ours, though without the penalization weighting system based on the number of observations partially attributed to each node. Principal component analysis and, more generally, low-rank matrix factorization can be regarded as least-square regressions in terms of two categorical variables, the row and column of a matrix entry. Their calculation through alternate projections (Golub & Van Loan, 2012) is the computational basis for our extension to higher dimensional and less structured scenarios, while low-rank tensor factorization (Grasedyck et al., 2013) underlies our treatment of more than two categorical covariates. Our subsequent extension to real covariates has formal similarities to the multivariable adaptive regression splines (MARS) (Friedman, 1991), though its elementary components are more general than MARS’ hinges and the conceptual and computational framework of the two methodologies are substantially different. For clarity of exposition, rather than posing the proposed procedure from the start, this article follows its conceptual evolution through a series of simple steps. Thus, after this Introduction, Section 2 first briefly summarizes the relation between explanation of variability and optimal transport developed in the study by Tabak & Trigila (2017) (subsection 2.1), and restricts the maps defining the Wasserstein barycenter to rigid translations (subsection 2.2). Section 3 first considers the estimation of conditional means for tensors and arbitrary sets of categorical covariates, yielding a matrix completion problem solved through a low-rank factorization procedure. Then we consider covariates with an underlying metric, with components that are required to be smooth functions of these covariates, an extension motivated by the formal analogy between the components of a low-rank factorization and the single-variable factors in multivariate series such as Taylor’s and Fourier’s. Subsection 3.1 further parameterizes these single-variable functions in terms of the values they adopt not on the observed samples, but on an externally prescribed grid. Among the advantages of this switch is that it allows us to refine our characterization of smoothness to include higher derivatives (Section 4). Perhaps more importantly, it permits the inclusion of latent and not fully available covariates, extending to procedure to handle clustering, multi-clustering, classification and covariate inference, all in the presence of confounding factors (Section 5). The description so far addresses the explanation of variability in real variables x. Section 6 demonstrates that the procedure remains fundamentally unchanged when the x are categorical instead, as a simple embedding into Euclidean space allows one to define their conditional expected value $$\bar{x}(z)$$. Section 7 addresses the simple, but important issue of post-processing of the results, so as to make them interpretable by the user, providing the dependence of x on a selected number of covariates z through marginalization over all the others. Section 8 shows how iterating the procedure once allows one to capture not just the conditional expectation, but also the full conditional covariance structure of x. Section 9 includes a number of examples, some synthetic and some real applications, which display the versatility and power of the attributable component methodology. Finally, some conclusions and further extensions are drawn in Section 10. 2. Removal of variability attributable to known factors 2.1. The optimal transport barycenter problem Consider a variable x ∈ Rn and a set of d covariates z. Removing the variability in x explainable by z is equivalent to transforming x through a z-dependent map y = $$\boldsymbol{Y}$$(x;z), so that y is independent of z. In order not to remove additional variability in x not attributable to z, one imposes the additional condition that the maps deform the data minimally. This process finds a natural formulation in the language of optimal transport (Tabak & Trigila, 2017):   \begin{align} \min_{Y, \mu} C = \int \left[\int c\left(x, \boldsymbol{Y}(x; z) \right) \boldsymbol{\rho}(x|z) \ \textrm{d}x \right] \boldsymbol{\nu}(z) \ \textrm{d}z, \quad y = \boldsymbol{Y}(x; z) \sim \boldsymbol{\mu}. \end{align} (2.1) Here the cost c(x, y) is a measure of the deformation brought about by the map, for which we will adopt the squared distance   \begin{align} c(x,y) = \|x - y\|^{2}, \end{align} (2.2) and $$\boldsymbol{\nu}(z)$$ represents the distribution of the covariates z. The resulting optimal $$\boldsymbol{\mu}(y)$$, independent of z, is the $$\boldsymbol{\nu}$$-weighted c- barycenter of the conditional distributions $$\boldsymbol{\rho}(x|z)$$(Tabak & Trigila, 2017). For this particular cost function, the maps $$\boldsymbol{Y}$$ are given by the gradient of a potential function: $$\boldsymbol{Y}$$(x;z) = ∇xϕ(x;z) (Brenier, 1991). In applications, one does not know the distributions $$\boldsymbol{\rho}(x|z)$$ and $$\boldsymbol{\nu}(z)$$: the data consist solely of a finite set of samples {xi, zi}, i ∈ [1…m], in terms of which the objective function becomes   $$C = \frac{1}{m} \sum_{i=1}^{m} c\left(x\,^{i}, \boldsymbol{Y}(x\,^{i}; z\,^{i}) \right) .$$ Then one needs to restrict the space of allowable maps $$\boldsymbol{Y}$$ (x;z)—else one would overfit the data—and weaken the requirement that all $$\boldsymbol{\rho}(x|z)$$ be pushed into a single $$\boldsymbol{\mu}(y)$$, since (a) this is not achievable within a restricted family of maps, and (b) the $$\boldsymbol{\rho}(x|z)$$ are only known through samples and $$\boldsymbol{\mu}(y)$$ through their corresponding transformed points. 2.2. Map restriction to rigid translations Arguably, the simplest setting restricts the maps to rigid translations depending on z:   \begin{align} \boldsymbol{Y}(x; z) = x - \boldsymbol{\beta}(z). \end{align} (2.3) For rigid translations, the only constraint that one can impose on the distribution of the range of $$\boldsymbol{Y}$$ is that its expected value $$\bar{y}$$ should be independent of z, with the implication that   \begin{align} \boldsymbol{\beta}(z) = \bar{x}(z)-\bar{y}, \quad \textrm{with} \ \bar{y} = \int \bar{x}(z) \ \boldsymbol{\nu}(z) \ \textrm{d}z. \end{align} (2.4) In addition to its explanatory value, removing the expectation $$\bar{x}(z)$$ from the conditional distribution $$\boldsymbol{\rho}(x|z)$$ is an effective pre-conditioner for the full optimal transport barycenter problem, as the following theorem shows: Theorem Consider the solution $$\boldsymbol{Y}$$full(x;z) to the optimal transport barycenter problem (2.1) with the quadratic cost function (2.2), and the family of rigid translations $$\boldsymbol{Y}$$mean(x;z) given by (2.3, 2.4). The latter pushes forward the conditional distributions $$\boldsymbol{\rho}(x|z)$$ into new conditional distributions $${\boldsymbol{\rho}_{\ast}}(x|z)$$, with common mean $$\bar{y}$$ (by contrast, $$\boldsymbol{Y}$$full(x;z) pushes forward the $$\boldsymbol{\rho}(x|z)$$ into the z-independent barycenter $$\boldsymbol{\mu}$$). Solve now again the full problem (2.1), this time using the $${\boldsymbol{\rho}_{\ast}}(x|z)$$ as input distributions, and denote the corresponding optimal map $$\boldsymbol{Y}$$nl(x;z). Then   \begin{align} \boldsymbol{Y}_{\!full}(y; z) = \boldsymbol{Y}_{\!nl}(\boldsymbol{Y}_{\!mean}(y; z); z). \end{align} (2.5) In other words, the barycenter $$\boldsymbol{\mu}(y)$$ of the $$\boldsymbol{\rho}_{\ast}(x|z)$$ agrees with that of the original distributions $$\boldsymbol{\rho}(x|z)$$, and the optimal maps pushing forward the $$\boldsymbol{\rho}(x|z)$$ onto $$\boldsymbol{\mu}(y)$$ agree with the composition of the rigid translations and the optimal maps from the $$\boldsymbol{\rho}_{\ast}(x|z)$$ to $$\boldsymbol{\mu}(y)$$. Finding $$\boldsymbol{Y}$$nl is typically much easier computationally than finding $$\boldsymbol{Y}$$full, as the $$\boldsymbol{\rho}_{\ast}(x|z)$$ are closer to each other than the $$\boldsymbol{\rho}(x|z)$$, since they share the same expected value for all values of z. Proof. The statement above follows from two basic ingredients: a characterization of the barycenter distribution $$\boldsymbol{\mu}(y)$$ and the inverse $$\boldsymbol{X}$$(y;z) of $$\boldsymbol{Y}$$(x;z) as the unique satisfiers of the two properties (Álvarez-Esteban et al., 2016); (Kuang & Tabak, 2017): Each point y is the barycenter of its images under $$\boldsymbol{X}$$:   $$y = \int \boldsymbol{X}(y; z) \ \boldsymbol{\nu}(z) \ \textrm{d}z.$$ For each value of z, $$\boldsymbol{X}$$(y;z) maps optimally $$\boldsymbol{\mu}(y)$$ onto $$\boldsymbol{\rho}(x|z)$$ and a characterization of optimal maps under the square-distance cost as gradients of a convex function (Caffarelli, 2003):   $$\boldsymbol{X}(y; z) = \boldsymbol{\nabla}_{y} \psi(y; z), \quad \boldsymbol{\psi}(\cdot ; z) \ \textrm{convex}.$$ The composition of the two inverse maps is   \begin{align} \boldsymbol{X}_{full}(y; z) = \boldsymbol{X}_{mean}(\boldsymbol{X}_{nl}(y; z); z) = \boldsymbol{X}_{nl}(y; z) + \boldsymbol{\beta}(z) = \boldsymbol{\nabla}_{y} \boldsymbol{\psi}_{nl} (y; z) + \boldsymbol{\beta}(z), \nonumber \end{align} with $${\boldsymbol{\psi}_{\!nl}}(y;z)$$ convex in y, and therefore   $$\boldsymbol{X}_{\!full}(y; z) = \boldsymbol{\nabla}_{\!y} \boldsymbol{\psi}_{\!full}(y; z), \quad \textrm{with } \ \boldsymbol{\psi}_{\!full}(y; z) = \boldsymbol{\psi}_{\!nl} (y; z) + \boldsymbol{\beta}(z) y \ \textrm{ also convex},$$ so it is optimal, as required by property 2. Also   \begin{align*} \int \boldsymbol{X}_{full}(y; z) \ \boldsymbol{\nu}(z) \ \textrm{d}z &=\int \left[\boldsymbol{X}_{\!nl}(y; z) + \boldsymbol{\beta}(z)\right] \ \boldsymbol{\nu}(z) \ \textrm{d}z \\ &=y + \int \left[\bar{x}(z)-\bar{y} \right] \ \boldsymbol{\nu}(z) \ \textrm{d}z = y, \end{align*} thus satisfying property 1 and concluding the proof. 3. A low rank tensor factorization, separated variable procedure Given a set of m observations xi and L corresponding covariates $${z_{l}^{i}}$$, the task of removing the variability associated with z from x through the z-dependent rigid translation   $$y = x - \bar{x}(z) + \bar{y}, \quad \bar{y} = \bar{x}$$ reduces to regression, i.e. to estimating the conditional mean $$\bar{x}(z)$$. This can be characterized as the minimizer of the variance:   \begin{align} \bar{x}(z) = \arg \min_{f} \sum_{i} \left\|x\,^{i} - f(z\,^{i}) \right\|^{2} \end{align} (3.1) over a proposed family of functions f(z). Any multivariable function f(z) can be approximated to arbitrary accuracy by the superposition of products of functions fl of the individual components zl of z:   $$f(z) \approx \sum_{k} \prod_{l} {f_{l}^{k}}\left(z_{l}\right)\!.$$ Classical examples are the power series when $$z \in \mathbb{R}^{L}$$,   $$f(z) \approx \sum_{k} a_{k} \prod_{l=1}^{L}{z_{l}}^{{s^{k}_{l}}}, \quad s^{k} \in \mathbb{N}^{L},$$ the Fourier series when z is in the 2π-periodic L-dimensional torus,   $$f(z) \approx \sum_{k} a_{k} \textrm{e}^{\,i \sum_{l} {\xi^{k}_{l}} z_{l}}, \quad \xi^{k} \in \mathbb{Z}^{L}$$ and the singular value decomposition when $$z=(i,j) \in \mathbb{N}^{2}$$,   $$x(i,j) \approx \sum_{k} \sigma_{k} \ u^{k}(i) v^{k}(j).$$ This suggests proposing the approximation   \begin{align} \bar{x}(z) \approx \sum_{k=1}^{d} \prod_{l=1}^{L} V(l)^{k}(z_{l}). \end{align} (3.2) If x is vectorial, we can make it scalar by including the index as one more factor zl. Then (3.1) reduces to minimizing   \begin{align} L = \sum_{i} \left(x^{i} - \sum_{k=1}^{d} \prod_{l=1}^{L} V(l)^{k}\left({z_{l}^{i}}\right) \right)^{2} \end{align} (3.3) over the degrees of freedom available in the specification of the functions V(l)k(z). Consider first the particular case where all zl are categorical variables—such as the vector index above—with a finite number of possible values. Then, for each l, V(l) is a matrix with components $$V(l\,)^{k}_{z\,_{l}}$$, and the minimization of L becomes the low-rank factorization of the tensor x(z1, …, zL) from the available—possibly repeated—entries {xi}. This problem can be solved through an alternating direction methodology, minimizing L alternatively over each V (l). This minimization yields, for each value j of zl,   \begin{align} V(l\,)_{\,j} = \left(\sum_{i \in I_{\,j}} x^{i} \prod_{b\ne l} V(b)_{{z_{b}^{i}}} \right) \left[\sum_{i \in I_{\,j}} \left(\prod_{b\ne l} V(b)_{{z_{b}^{i}}} \right)^{T} \left(\prod_{b\ne l} V(b)_{{z_{b}^{i}}} \right) \right]^{-1}, \end{align} (3.4) where   $$I_{\,j} = \left\{i : {z_{l}^{i}} = j\right\}.$$ Such tensor-completion through low-rank factorization has a broad range of applications. Consider the following two typical examples: A blood test, measuring various quantities such as cholesterol level and white cell counts. Here the variable x—a scalar—is the value of a measurement. Some of the categorical covariates zl that one would use to account for variability in x are: (a) The quantity being measured (the row in regular low-rank matrix factorization). (b) The facility where the test was performed. (c) Individual characteristics of the patient (sex, ethnicity, treatment, etc.). Excluded so far are covariates with continuous values, such as age, weight, time of the day and results of prior tests; these will be discussed below. Further down we will also include latent covariates, such as distinct groups into which the patients can be divided in light of their test results, a case of clustering in the presence of confounding factors. 2. The ‘Netflix problem’: x is the rating given by viewers to movies, z1 the viewer and z2 the movie. Further extensions below allow for the inclusion of more granular characteristics, such as movie type and year of release, and viewer’s age and location. In tensor-completion, there is only a finite set of possible arguments $$j={z_{l}^{i}}$$ for V(l)(zl), and the corresponding values $$V(l)_{\,j}^{k}$$ are free for the minimization to determine. By contrast, when using separation of variables for solving PDE’s, the form of the V(l)k(z) is known ab-initio (sines, exponentials, special functions), with only their amplitudes available to fit initial or boundary data. In a data-analysis context, one would call such model parametric. Without a specific principle to guide the selection of the functions V(l)(z), this approach suffers from the arbitrariness of any such choice, which could lead to poor, overfitted or uninterpretable results. Instead, we can preserve the freedom in the assignment of values to each $$V(l)_{{z_{l}^{i}}}^{k}$$ while enforcing a smooth dependence of V(l)k on zl by adding to the objective function a penalization for non-smoothness, proportional to   $$\sum_{i,\,j>i} C_{i,\,j}^{\,l} \left(V(l\,)_{{z_{l}^{i}}}^{k}-V(l\,)_{{z_{l}^{\,j}}}^{k}\right)^{2},$$ with $$C_{i,j}^{l}$$ inversely proportional to a measure of the distance between $${z_{l}^{i}}$$ and $${z_{l}^{\,j}}$$. The logic behind this choice is the following: When $${z_{l}^{i}}$$ and $${z_{l}^{\,j}}$$ are close to each other, the corresponding V(l)k need also be close, as otherwise they would incur a penalty. This is how one enforces smoothness of V(l)k(z). As the penalty term is quadratic in each V(l), solving for each V(l) with all other V(b) fixed amounts to inverting a linear system, as in conventional alternating projections. The new objective function has the form   \begin{align} L = \sum_{i} \left(x^{i} - \sum_{k} \prod_{l} V(l)^{k}_{{z_{l}^{i}}} \right)^{2} + \sum_{l \in M} \lambda_{l} \sum_{k=1}^{d} \left(\prod_{b \in L, b\ne l} \|V(b)^{k}\|^{2}\right) \sum_{i,\,j>i} C_{i,\,j}^{l} \left(V(l)_{{z_{l}^{i}}}^{k}-V(l)_{{z_{l}^{\,j}}}^{k}\right)^{2}, \end{align} (3.5) where M is the set of factors with an underlying metric, for which the smoothness of functions is required. The pre-factors to the penalty terms, products of squares of the norms of the V(b)k, are included so that the objective function is invariant under re-scalings of the V(l)k that preserve their product. Otherwise, the penalty terms could be made arbitrarily small, not by enforcing smoothness, but by multiplying each V(l) by a small factor and absorbing this factor into other V’s less constrained by the smoothness requirement. The constants λl quantify the amount of penalization associated to smoothness for each factor. 3.1. Semi-parameterization Now we extend the procedure above to simultaneously address various issues: As the number of observations grows, so does the number of unknowns $$V(l)_{{z_{l}^{i}}}^{k}$$, turning the solution to the linear system from each step into a computational bottleneck. Yet the underlying functions V(l)k(z) do not change in any fundamental way; we are just seeking their values at a larger set of points $$\{{z_{l}^{i}}\}$$. Ultimately, we want to resolve these functions to a level of detail determined by our needs, not by the number of observations available. Also, the observed values $${z_{l}^{i}}$$ may not provide an adequate grid for resolving the V(l)k(z), as they may cluster in some locations and under-resolve others. As a consequence, the cost matrix $$C_{ij}^{l}$$ may be highly unbalanced. Frequently, values of a covariate zl are repeated systematically across the dataset. Two typical scenarios are the following: The variable zl can only adopt a finite number of values (as in rankings), which are then necessarily repeated many times. The dataset consists of observations drawn from various individuals, each observed many times. The characteristics z of each individual then appear repeatedly, once for each such observation. Examples: patient’s age in medical studies where each patient is followed over time, a meteorological station’s latitude and elevation in datasets comprising observations from more than one station. In such situations, the most sensible choice is to adopt just one value of V(z) for each value of z, rather than penalizing their variation. When some values $${z_{l}^{i}}$$ are missing from the dataset—even though xi and possibly other covariates $${z_{b}^{i}}$$ have been recorded—applying the procedure requires a way to assign values to the corresponding $$V(l)_{{z_{l}^{i}}}^{k}$$. Sometimes the $${z_{l}^{i}}$$ are unknown ab-initio, as they are latent variables to be assigned by the algorithm to further explain variability in an unsupervised fashion. (We will address latent variables in Section 5 using the tools developed here.) The procedure is non-parametric, with the functions Vl(z) defined only by their values on the sample points $${z_{l}^{i}}$$ and with no specified functional form, just constrained by the penalization on non-smoothness. Yet there are reasons why one may want to have a more explicit description of these functions, both within and after the procedure: (a) Prediction and imputation: one typical goal of data analysis is to predict the value of x or its probability distribution for a new set of values of the covariates z. This requires evaluating the V(l)k(zl) at the new values provided. (b) On-line learning: when new data keep flowing in, as in weather forecasting, one should not start the procedure from scratch every time. Avoiding this requires a current estimation of the V(l)k(zl) for the newly arrived values of zl. 5. Even though the procedure as described so far combines naturally categorical and non-categorical variables, the latter are penalized for non-smoothness, while the former are not. Hence the algorithm will be biased to explain as much variability as possible in terms of the categorical variables, a possibly undesired feature. In fact, the exact opposite is often required, as some of these categorical variables represent ‘idiosyncratic’ variability (as do the ‘viewer’ variable in the Netflix problem, the ‘patient’ variable in a medical study and the ‘stock’ variable in finance), which one would like to use to explain only the variability left once more descriptive factors (age, weight, industry) have had their say. In order to address the issues above, we introduce for each l a grid (not necessarily regular) of values zg(l)j, and the linear interpolation scheme   \begin{align} {z_{l}^{i}} = \sum_{\,j} \alpha(l)_{i}^{\,j}{z_{g}}(l)^{\,j}, \quad \textrm{so that we can posit} \quad \tilde{V}(l)_{{z_{l}^{i}}}^{k} = \sum_{\,j} \alpha(l)_{i}^{\,j} V(l)_{\,j}^{k}. \end{align} (3.6) Here the $$\tilde{V}_{{z_{l}^{i}}}$$ are the values associated with observation i, while the Vj are the smaller set of unknowns associated with the grid zg(l)j. Then the problem becomes   \begin{align} \min_{V} \sum_{i} \left(x^{i} - \sum_{k} \prod_{l \in L} \sum_{\,j} \alpha(l)_{i}^{\,j} V(l)_{\,j}^{k} \right)^{2}+\sum_{l=1}^{L} \lambda_{l} \sum_{k} \left(\prod_{b \in L, b\ne l} \|V(b)^{k}\|^{2}\right) \sum_{i,\,j>i} C_{i,\,j}^{l} \left(V(l)_{i}^{k}-V(l)_{\,j}^{k}\right)^{2}, \end{align} (3.7) which includes (3.5) as a particular case with the α ∈ {0, 1}. Let us see how this modification addresses each of the issues raised before: The number of unknowns $$V(l)_{\,j}^{k}$$ is controlled by the externally supplied grid, not the number of observations available. Moreover, this grid can be designed so as to be well-balanced, with neither clusters nor holes, and with as fine a mesh as desired in regions where higher resolution is sought. A repeated value of $${z_{l}^{i}}$$ is assigned through the same coefficients α to the same grid points zg(l)j. When the number of values that zl can adopt is finite, those are by default the values given as grid points, with corresponding α ∈ {0, 1}. When the value of zl for observation i is missing, one should marginalize over it, adopting for $$V(l)_{{z_{l}^{i}}}^{k}$$ the expected value of V(l)k(z). This corresponds to assigning as $$\alpha (l)_{i}^{\,j}$$ the mean value $$\alpha (l)_{i}^{\,j} = \frac{1}{n_{t}} \sum _{t=1}^{n_{t}} \alpha (l)_{t}^{\,j}$$ of α over the observations $${z_{l}^{t}}$$ for which zl is known. If there is a surrogate for zl that allows us to infer that the missing $${z_{l}^{i}}$$ is closer to some of the observed $${z_{l}^{\,j}}$$ than others—often the time t adopts this role—a correspondingly weighted mean of α should be adopted. In order to evaluate $$\tilde{V}(l)^{k}(z)$$ for a new value of z, one just needs to find the values of α that interpolate z in the grid and apply (3.6). One can extend the penalization to categorical variables by adding a new, dummy value of z to the grid, with corresponding unknown $$V(l)_{o}^{k}$$, and making each $$C_{oj}^{l} = 1$$ and all other $$C_{ij}^{l}=0$$. This penalizes variability in the V(l)k without reference to any underlying metric. The final value of $$V(l)_{o}^{k}$$ will be the empirical mean of V(l)k(z), and the quantity penalized, its variance. 4. Higher orders of smoothness Our characterization of smoothness has been restricted so far to continuity, enforced through the penalization of large differences in V(l) when the corresponding zl are close. With the procedure extended to incorporate well-balanced grids for each zl, we can refine this characterization, penalizing higher derivatives of V(l). In addition to providing a smoother fit to the data, this refinement is also useful for extrapolation. Consider a situation where one would like to estimate V(z) for a value of z outside the range of the data, say $$z> z_{\max }$$. The choice most consistent with the current algorithm is to set $$V(z) = V(z_{\max })$$, as this is the most ‘continuous’ choice. If instead we were minimizing ∥V″∥2, then we would extrapolate linearly, following the slope at $$z_{\max }$$. An intermediate choice follows from minimizing a2∥V′∥2 + |V′′∥2, as now beyond $$z_{\max }$$, not being required to explain any data, one would be solving the ordinary differential equation   $$\frac{\delta}{\delta V^{\prime}} \int \left(a^{2} \|V^{\prime}\|^{2} + |V^{\prime\prime}\|^{2}\right) \textrm{d}z = 0 \rightarrow \frac{d^{2} V^{\prime}}{d z^{2}} - a^{2} V^{\prime} = 0,$$ with solution   $$V^{\prime}(z) = V^{\prime}(z_{\max}) \textrm{e}^{-a (z-z_{\max})},$$ where the slope V′ agrees with its value at $$z_{\max }$$, but then decays to zero. One simple recipe for higher-order smoothness is to penalize the squared norm of a finite difference approximation to a derivative, as in the early versions of smoothing splines (Hastie & Tibshirani, 1990). In addition, it is convenient to weight this norm, so that each value $$V(l)^{k}_{\,j}$$ is rewarded for its explanatory value and penalized for non-smoothness in a balanced way. In particular, if values of zg are added beyond the range of the data, these should not affect the fit within the range. For concreteness, we develop here the procedure to penalize the first and second derivatives of V(l); extending this to higher derivatives is straightforward. Given a sorted grid {zj}, the three-point finite-difference approximation to the first and second derivatives of V at point zj is given by   $$V_{\,j}^{\prime} \approx a V_{\,j-1} + b V_{\,j} + c V_{\,j+1} , \quad V_{\,j}^{\prime\prime} \approx A V_{\,j-1} + B V_{\,j} + C V_{\,j+1},$$ with   \begin{align*} a =&\, -\frac{{\Delta_{+}}^{2}}{D}, \quad b = -\frac{{\Delta_{+}}^{2}-{\Delta_{-}}^{2}}{D}, \quad c = \frac{{\Delta_{-}}^{2}}{D},\\ A =&\, \frac{{\Delta_{+}}}{D}, \quad B = -\frac{{\Delta_{+}}-{\Delta_{-}}}{D}, \quad C = -\frac{{\Delta_{-}}}{D}, \\ \Delta_{+} =&\ z^{\,j+1}-z^{\,j}, \quad \Delta_{-} = z^{\,j-1}-z^{\,j}, \quad D = \frac{1}{2} \Delta_{+} \Delta_{-} \left(\Delta_{-} - \Delta_{+} \right)\!. \end{align*} Defining the weight wj of grid point $${z_{g}^{\,j}}$$ by the sum of its contributions to the data points {zi}:   $$w_{\,j} = \varepsilon + \sum_{i=1}^{m} {\alpha_{i}^{\,j}},$$ where ε ≪ 1 is the weight assigned to grid values away from the data, one can propose a penalty for non-smoothness of the form   $$\lambda \sum_{j} w_{\,j} \left(\delta \left\|V_{\,j}^{\prime}\right\|^{2} + (1-\delta) \left\|V_{\,j}^{\prime\prime}\right\|^{2} \right) = \lambda V^{\prime} C V.$$ (The C so defined is a non-negative-definite tridiagonal matrix.) Here δ ∈ [0, 1] measures the relative weight given to the first and second derivatives of V. For notational clarity, we are absorbing into the constant λ the product of the square norms of the V(b)k with b ≠ l in (3.7). When written in full, the problem becomes   \begin{align} \min_{V} \sum_{i} \left(x^{i} - \sum_{k} \prod_{l \in L} \sum_{\,j} \alpha(l)_{i}^{\,j} V(l)_{\,j}^{k} \right)^{2}+\sum_{l=1}^{L} \lambda_{l} \sum_{k} \left(\prod_{b \in L, b\ne l} \left\|V(b)^{k}\right\|^{2}\right) {V(l)^{k}}^{\prime} C^{l} V(l)^{k}. \end{align} (4.1) 5. Clustering, classification and posterior assignment of missing covariate values The previous section discussed how to handle situations in which some, but not all values of a given covariate zl were missing. When no value is known, zl is a latent variable, whose values should be assigned by the algorithm to further explain the variability in the data. This section extends the procedure to handle the discovery of latent categorical variables. Assigning a value $${z_{l}^{\,i}}$$ to data point xi is a clustering problem, which the presence of other covariates zb, b ≠ l turns it into clustering in the presence of confounding variables. In our setting, this gives rise to a mechanism analogous to k-means and Expectation Maximization (Friedman et al., 2001). In order to assign $${z_{l}^{i}}$$, one first sets a number ncl of clusters, and defines the corresponding ‘grid’ zg(l)j, with j = 0…ncl (with 0 standing for the dummy element to which no point is assigned for the penalization of functions of categorical variables). From eq. (3.6), the $${z_{l}^{i}}$$ sought follows from the values of the $$\alpha (l)_{i}^{\,j}$$ used to interpolate on the grid zg(l). These determine automatically to which l-variable cluster the observation xi belongs. When clustering through hard assignments, all entries $$\alpha (l)_{i}^{\,j}$$, but one equal zero, except the one specifying to which class j the data point xi is assigned. The values of $$\alpha (l)_{i}^{\,j}$$ are linked to the corresponding values of $$V(l)_{\,j}^{k}$$ in a two step procedure similar to Lloyd’s algorithm for k-means: Given the values of $$V(l)^{k}_{\,j}$$, we assign zl by choosing the j for which $$\alpha (l)_{i}^{\,j}=1$$ so that the resulting predicted mean $$\bar{x}(z^{\,i})$$ is closest to the observation xi (in other words, to decrease as much as possible the first term in 4.1). This corresponds to the reassignment step based on proximity within a given centroid in Lloyd’s algorithm. Once the alpha are known the values of V(l)j are updated as before by minimizing L (eq. 4.1). This step finds the mean of x within each class defined by the values of zl and corresponds to the update of the centroid in the Lloyd’s algorithm. A version with soft assignments can be developed along similar lines, with each observation xi having a probability $${\alpha _{i}^{\,j}}$$ of belonging to class j. Then the V(l) are updated as before, and the $${\alpha _{i}^{\,j}}$$ can be assigned through the following Bayesian procedure: For each cluster j, compute the square distances $${d_{i}^{\,j}}$$ between the cluster’s mean $$\bar{x}({z_{1}^{i}}, \ldots , {z_{l}^{\,j}}, \ldots{z_{L}^{i}})$$ and each observation xi:   $${d_{i}^{\,j}} = \left(x^{i} - \sum_{k} \left(\prod_{b \ne l} \sum_{\,j} \alpha(b)_{i}^{\,j} V(b)_{\,j}^{k}\right) V(l)_{\,j}^{k}\right)^{2}$$ and the standard deviation   $$\sigma_{\,j} = \left(\frac{\sum_{i} \alpha(l)_{i}^{\,j} {d_{i}^{\,j}}}{\sum_{i} \alpha(l)_{i}^{\,j}} \right)^{\frac{1}{2}} .$$ Denoting Q(l)s the sth set of ns observations {i} constrained to share the same assignments α(l)i, compute the average square distance to each cluster’s mean:   $${D_{s}^{\,j}} = \frac{1}{n_{s}}\sum_{i\in Q(l)_{s}} {d_{i}^{\,j}}.$$ Update $${\alpha _{i}^{\,j}} = {P_{s}^{\,j}}$$ (i ∈ Q(l)s) through Bayes formula, modeling the clusters as one-dimensional Gaussians and adopting $${P_{s}^{\,j}}$$ itself as prior:   $$\rho^{\,j} \propto \frac{1}{{\sigma_{\,j}}} \exp\left(-\frac{{D_{s}^{\,j}}}{2{\sigma_{\,j}}^{2}}\right),$$  \begin{equation*} {\hskip-35pt}{P_{s}^{\,j}} = \frac{{P_{s}^{\,j}} \rho^{\,j}}{\sum_{h} {P_{s}^{h}} \rho^{h}}, \end{equation*} a choice with positive feedback that leads to convergence to rigid assignments $${P_{s}^{\,j}} \in \{0, 1\}$$. If soft assignments are desired throughout, one should use in Bayes formula the non-informative prior $$P_{prior}^{\,j} = \frac{1}{n_{cl}}$$ instead of $${P_{s}^{\,j}}$$. In classification problems, the αi are known (and fixed) for the training set, while they evolve as above for the testing set. To be consistent with the Baysian approach, one should in this case use either the fixed prior   $$P_{prior}^{\,j} = \frac{1}{m_{k}} \sum_{i=1}^{m_{k}} {\alpha_{i}^{\,j}},$$ where the sum is taken over the mk observations for which zi is known, or, in the spirit the Expectation Maximization algorithm, the evolving prior   $$P_{prior}^{\,j} = \frac{1}{m} \sum_{i=1}^{m} {\alpha_{i}^{\,j}}$$ computed over all observations, including the ones for with α is being updated. Notice that the procedure just described applies without changes when the latent variable zl is non-categorical. Here we would not call the procedure ‘classification’, as this refers to categorical classes; a more appropriate name would be ‘posterior assignment of unknown covariate values’. It applies, for instance, to softly assign an age to individuals for which it has not been recorded. 5.1. Use for complexity reduction In large datasets, an idiosyncratic variable zl (the viewer, the movie, the patient, the station, the stock) can adopt a very large number of values. Since idiosyncratic variables are categorical, we cannot reduce the number of unknowns V(l)k simply by interpolating on a grid zg(l) via (3.6), since one would not know, for instance, how to interpolate one patient among others when the more informative variables—age, weight, etc.—are already considered separately. However, one can still use (3.6) to the same effect, but with the values of α unknown beforehand, i.e. performing clustering. In the example just mentioned, we would replace the idiosyncratic variable ‘patient’ by ‘patient type’, with the number of types decided based on the level of resolution sought. Below, we will show an application of this clustering procedure to a matrix completion problem for a book recommendation system, where the idiosyncratic variables ‘reader’ and ‘book’ are clustered into ‘reader type’ and ‘book type’ (different from the book genre or any other covariates taken into account explicitly). 6. Categorical dependent variables The methodology allows for covariates z of any type: categorical, ordinal, real, periodic. Does a comparable level of generality extend to the variable x whose unexplained variability one attempts to minimize? In particular, there are many scenarios of interest where x is categorical: has person z1 read book z2, will patient z1 develop illness z2, which party does person z1 vote for, etc.? The object that the procedure seeks, the conditional expectation $$\bar{x}(z)$$, can still be assigned a meaning when x is categorical. For instance, for a binary x, one can assign the values {0, 1} to the two possible outcomes, and define   $$\bar{x}(z) = 0 \cdot P(0|z) + 1 \cdot P(1|z) = P(1|z),$$ where the P(x|z) are the z-conditional probabilities for the outcome x. In doing so, we have embedded the space of outcomes x into a metric space, i.e. the real line. The resulting $$\bar{x}$$ has a clear meaning in terms of the underlying probability distribution P. Similarly, if x has three possible outcomes, we can embed those into the two-dimensional plane, as vertices of an equilateral triangle. The important notion here is that all the points so assigned be equidistant, not to affect the categorical nature of the variable through a metric-induced preferential ordering. Then, following the attributed component methodology, the two dimensions of x in this example are captured through a covariate z with values in {1, 2}. More generally, a number n of outcomes is embedded into an n − 1-dimensional space. A simpler alternative would embed a variable x with n outcomes into an n-dimensional space, as vertices of the corresponding simplex, i.e. the 1 of K encoding. This has more straightforward interpretability, as each component of $$\bar{x}$$ represents the probability of the corresponding outcome. On the other hand, it has one more dimension to consider and it requires imposing the constraint that $$\sum _{i} \bar{x}_{i}(z) = 1$$, which increases—albeit slightly—the computational complexity of the algorithm. 7. Interpretabilitythrough marginalization The procedure provides an estimation of the conditional mean of the form   $$\bar{x}(z_{1}, \ldots, z_{L}) = \sum_{k=1}^{d} \prod_{l \in L} \sum_{\,j} \alpha(l)^{\,j}(z_{l}) V(l)_{\,j}^{k},$$ where the $$V(l)_{\,j}^{k}$$ are known, and α(l)j(zl) are the coefficients that interpolate zl in the given grid zg(l)j. It is straightforward to use this expression to find the estimate for $$\bar{x}$$ for a new value of z. On the other hand, determining how x depends on each individual zl under average conditions for the other covariates zb is a question of marginalization: find   \begin{align} \bar{x}(z_{l}) &= \int x \ \rho(x|z) \ \textrm{d}z_{1} \ldots \ \textrm{d}z_{l-1} \ \textrm{d}z_{l+1} \ldots \ \textrm{d}z_{L} \nonumber \\ &= \int \bar{x}(z) \ \textrm{d}z_{1} \ldots \ \textrm{d}z_{l-1} \ \textrm{d}z_{l+1} \ldots \ \textrm{d}z_{L} . \end{align} (7.1) Evaluating at $$z_{l} = z_{l}^{\ast }$$ and replacing expectations by the corresponding empirical means, one has   \begin{align*} \bar{x}(z_{l}^{\ast}) &= \frac{1}{m} \sum_{i=1}^{m} \bar{x}({z_{1}^{i}}, \ldots, z_{l}^{\ast}, \ldots, {z_{L}^{i}}) \\ &=\sum_{k=1}^{d} \frac{1}{m}\sum_{i=1}^{m} \left[\prod_{b \ne l} \sum_{\,j} \alpha(b)_{i}^{\,j} V(b)_{\,j}^{k}\right] \sum_{\,j} \alpha(l)^{\,j}(z_{l}^{\ast}) V(l)_{\,j}^{k}. \end{align*} Similarly, one can compute the marginal of $$\bar{x}$$ over s variables $$\{z_{l_{h}}\}$$, through   \begin{align} \bar{x}(z_{l_{1}}^{\ast}, \ldots, z_{l_{s}}^{\ast}) = \sum_{k=1}^{d} \frac{1}{m}\sum_{i=1}^{m} \left[\prod_{b \not{\in} \{l_{h}\}} \sum_{\,j} \alpha(b)_{i}^{\,j} V(b)_{\,j}^{k}\right]\prod_{h=1}^{s} \sum_{\,j} \alpha(l_{h})^{\,j}(z_{l_{h}}^{\ast}) V(l_{h})_{\,j}^{k} . \end{align} (7.2) 8. Further explanation of variability: the variance The estimation of the conditional mean $$\bar{x}(z)$$ extends effortlessly to second order statistics. Introducing the filtered variable y through   \begin{align} y = x - \bar{x}(z), \end{align} (8.1) one can for instance estimate the z-dependent variance σ2(z) by applying the procedure not to x, but to w = y2, since   $$\sigma^{2}(z) = \bar{w}(z).$$ Often it is a covariance matrix we are after, with one of the zl representing the ‘entry’ or ‘variable’ in x. If two entries zl = i and zl = j are always observed simultaneously—as are, for instance, two quantities in a blood test—then one writes w = y(zl = i, zb)y(zl = j, zb) with zb comprising all covariates except zl, in order to estimate $${\Sigma _{i}^{\,j}} = \bar{w}$$ as a function of zb. Rather than estimating each entry (i, j) at a time though, it is better to estimate the whole covariance matrix Σ(i, j, zb) at once, as this will use all the information available and link the various entries of Σ, so that in particular for each value of zb, Σ(:, :, zb) is a non-negative definite matrix. Yet this procedure extends to much more general situations. One might seek, for instance, the correlation between x at two locations/times zl as a function of the distance/interval Δzl between them. For this, we introduce all products $$w = x({z_{l}^{i}}) x({z_{l}^{\,j}})$$ and a new covariate $$\varDelta z = {z_{l}^{i}}-{z_{l}^{\,j}}$$. For the remaining covariates zb, one may take the average between their values at the two observations. Alternatively, one may restrict the products considered to those where the two values of zb are not too far from each other, or include the Δzb as extra covariates. Similar examples abound: correlation between height or weight at different ages, between precipitation at different locations as a function of time-lag, etc. 9. Examples This section includes examples of application of the attributable component methodology. Though one of the examples uses real data—hourly ground temperatures across the continental US—the purpose here is not to highlight any particular application, but to illustrate various aspects of the workings of the procedure. Thus we start with a couple of simple synthetic examples that directly compare the components discovered by the algorithm with the actual variable-separated solution underlying the data. Then we consider the real example of ground temperatures, which combines different kinds of covariates: categorical, real and periodic, is big enough to require the use of stochastic descent, and also illustrates different uses of the iteration of the procedure that captures second order structure underlying the data. Finally, we consider a synthetic Netflix-like problem, posed in terms of book ratings by readers, that illustrates other capabilities of the methodology, including bi-clustering for complexity reduction in the presence of confounding factors and the use of idiosyncratic variables (reader, book) in combination with more descriptive—but less specific—covariates such as age, location and book type. 9.1. Numerical separation of variables Our first two examples analyze data generated using synthetic models where the distribution underlying the samples takes the variable-separated form that our algorithm adopts as ansatz. The first of these models has three covariates zl, two periodic and one real, and takes the form   $$x^{i} = \bar{x}(z^{i}) + w^{i}, \quad \bar{x}(z) = \frac{1}{5} \cos\left(z_{1}\right) \sin\left(z_{2}\right) z_{3}, \quad w \sim \mathscr{N}(0, 1),$$ where the m = 1000 samples from z1, z2 and z3 are drawn uniformly in the interval [π, π]. Figure 1 displays the results of applying the procedure with d = 1 component and λl = 0.01, l = {1, 2, 3}. The initial values of $$V(l)_{\,j}^{k}$$ (here and in all examples in this paper) are taken as small, symmetry-breaking perturbations of the uniform $$V(l)_{\,j}^{k}=1$$. Fig. 1. View largeDownload slide Upper left: original data x (in light gray -blue on the web version of the manuscript-) and the filtered signal y (in dark gray -red on the web version of the manuscript-) as a function of the noise w. As one can see, y is a function of the noise, i.e. of the unexplained variability, as all explainable variability has been filtered away. Upper right: first (and only) component as a function of z1. Lower left: same component as a function of z2. Lower right: same component as a function of z3. As one can see, the true separated solution underlying the data has been fully recovered. Fig. 1. View largeDownload slide Upper left: original data x (in light gray -blue on the web version of the manuscript-) and the filtered signal y (in dark gray -red on the web version of the manuscript-) as a function of the noise w. As one can see, y is a function of the noise, i.e. of the unexplained variability, as all explainable variability has been filtered away. Upper right: first (and only) component as a function of z1. Lower left: same component as a function of z2. Lower right: same component as a function of z3. As one can see, the true separated solution underlying the data has been fully recovered. The second model, with one less covariate and one more component, is   \begin{align} x^{i} = \bar{x}(z^{i}) + w^{i}, \quad \bar{x}(z) = 4 \cos\left(z_{1}\right) \sin\left(z_{2}\right) + 0.3\ z_{1}{z_{2}}^{2}, \quad w \sim \mathscr{N}(0, 1). \end{align} (9.1) Here we also draw m = 1000 samples of z1 and z2 uniformly in the interval [π, π] and adopt λl = 0.01, l = {1, 2}, setting this time the number of components to d = 2. The purpose of this example is to display the non-uniqueness of the separation into components: as shown in Fig. 2, the two components recovered by the algorithm, though fitting $$\bar{x}(z)$$ to perfection, do not agree with those used to generate the data in (9.1). This non-uniqueness of the decomposition into components is general whenever L ≥ 2. For instance, for any f1, f2, g1, g2, we can write  $$f_{1}(x) g_{1}(y) + f_{2}(x) g_{2}(y) = \tilde{f}_{1}(x) \tilde{g}_{1}(y) + \tilde{f}_{2}(x) \tilde{g}_{2}(y),$$ Fig. 2. View largeDownload slide Upper left panel: original data x (in light gray -blue on the web version of the manuscript-) and the filtered signal y (in dark gray -red on the web version of the manuscript-); the latter is a function of the unexplainable variability, i.e. the noise w. Upper right panel: the two components as functions of z1. Lower left: same as functions of z2. This separated solution does not agree with the one proposed to generate the data, yet it is completely equivalent, though smoother from the perspective of the objective function. Fig. 2. View largeDownload slide Upper left panel: original data x (in light gray -blue on the web version of the manuscript-) and the filtered signal y (in dark gray -red on the web version of the manuscript-); the latter is a function of the unexplainable variability, i.e. the noise w. Upper right panel: the two components as functions of z1. Lower left: same as functions of z2. This separated solution does not agree with the one proposed to generate the data, yet it is completely equivalent, though smoother from the perspective of the objective function. with   $$\tilde{f}_{1}(x) = f_{1}(x), \quad \tilde{g}_{1}(y) = g_{1}(y) + g_{2}(y), \quad \tilde{f}_{2}(x) = f_{2}(x)-f_{1}(x), \quad \tilde{g}_{2}(y) = g_{2}(y).$$ The algorithm picks, among all equivalent decompositions, the ‘smoothest’ one, in the sense of minimizing the second term in (4.1) penalizing non-smoothness. Notice though that $$\bar{x}(z)$$ itself and all marginals are still uniquely determined, so the non-uniqueness of the decomposition affects neither the numerical predictions nor the interpretability of the results. 9.2. Ground temperature variability in the continental US For a second example, we use hourly measurements of the ground-level temperature at 47 stations located across the continental United States and one in Canada, publicly available from NOAA.1 We picked an array of stations that covers the US roughly uniformly and have data available since at least the year 2005 to the present. In this example, the variable x to explain is the temperature itself, measured in degrees Celsius, and we have chosen as covariates zl the following four quantities: The station itself, a categorical variable with values z1 ∈ [1, 2, …, 48]. The local time of the day z2 ∈ [0, 24], periodic. We adopted a grid with 24 elements, with one grid point per hour. The season, described as day of the year, z3 ∈ [0, 365.25], periodic, also with a grid of 24 points. Time in years z4 ∈ [2005, 2017.3], real, included to account for climate variability in a multi-year scale, with a grid of 75 points. The total number of observations m ≈ 5.7 ⋅ 106 was large enough to justify the use of stochastic gradient descent, which we performed by choosing at each step of the algorithm a random subset of 105 entries of x. We adopted d = 8 components and penalization parameters λ = 0.0001 for the station and λ = 0.005 for all other covariates. The reduction in variability can be quantified by the variance, which moved from 151.7 for x to 23.8 for the transformed y, corresponding to a reduction in standard deviation from 12.3 to 4.9 degrees Celsius. Figure 3 (upper left panel) shows the eight periodic components associated with z2, the time of the day. Similarly, there are eight components, not displayed, associated with each of the other three covariates. The set of all these components constitute the output of the algorithm, from which all predictions are made. Fig. 3. View largeDownload slide Upper left panel: mean temperature components as functions of the time of the day (see eq. (3.6)). Upper right panel: mean temperature as function of the time of the day, marginalized over time of the year, climate variability and station. Lower left panel: mean temperature as a function of day of the year, marginalized over station, climate variability and time of the day. Lower right panel: mean temperature as a function of the station marginalized over time of day, day of year and climate variability (darker colors corresponds to lower temperatures). Fig. 3. View largeDownload slide Upper left panel: mean temperature components as functions of the time of the day (see eq. (3.6)). Upper right panel: mean temperature as function of the time of the day, marginalized over time of the year, climate variability and station. Lower left panel: mean temperature as a function of day of the year, marginalized over station, climate variability and time of the day. Lower right panel: mean temperature as a function of the station marginalized over time of day, day of year and climate variability (darker colors corresponds to lower temperatures). Figure 3 (upper right panel) displays the easier to interpret marginalized prediction, where the dependence of $$\bar{x}$$ on the time of the day is marginalized over time of year, station and multi-year variability. Similarly, the lower left panel of the same figure displays the marginalized dependence of $$\bar{x}$$ on the time of the year, with cold winters and hot summers, and the lower right panel displays through color the marginalized dependence of $$\bar{x}$$ on the station (this dependence is marked mostly by latitude, but also by other station characteristics such as elevation and distance from the sea). Figure 4 (left panel) shows the data (in blue) and predicted mean (in red) individualized for a particular station: Manhattan, Kansas, at four levels of detail: global, yearly, monthly and weekly. We see how the prediction ‘explains’ the cycle of seasons and hours of the day, but not the less regular weather systems, of typically around five days, that flow through the region. The right panel displays similar results for Barrow, at the northern tip of Alaska. An interesting observation here is that the estimated mean daily cycle captures two peaks. It is winter in Alaska, so the sun does not rise at all. What we are observing is a manifestation of the thermal tides (Chapman & Lindzen, 1970), with their two characteristic peaks, most often observed in the pressure, but manifested here through their temperature signal. Fig. 4. View largeDownload slide Upper 4 panels: data (in light gray -blue on the web version of the manuscript-) and predicted mean (in dark gray -red on the web version of the manuscript-) for Manhattan, Kansas, displayed over the length of one year, one month and a week. Unlike the diurnal and yearly cycle, the weather systems (most noticeable on the second row, left column panel) are not ‘explained’ by the procedure, as no covariate z representing them has been included. Lower 4 panels: data (in light gray -blue on the web version of the manuscript-) and predicted mean (in dark gray -red on the web version of the manuscript-) for Barrow, Alaska. Notice the two-peaked thermal tide in the predicted diurnal signal (4th row, right column panel). Fig. 4. View largeDownload slide Upper 4 panels: data (in light gray -blue on the web version of the manuscript-) and predicted mean (in dark gray -red on the web version of the manuscript-) for Manhattan, Kansas, displayed over the length of one year, one month and a week. Unlike the diurnal and yearly cycle, the weather systems (most noticeable on the second row, left column panel) are not ‘explained’ by the procedure, as no covariate z representing them has been included. Lower 4 panels: data (in light gray -blue on the web version of the manuscript-) and predicted mean (in dark gray -red on the web version of the manuscript-) for Barrow, Alaska. Notice the two-peaked thermal tide in the predicted diurnal signal (4th row, right column panel). Finally, Fig. 5 displays in blue the climate variability captured through the marginalized dependence of $$\bar{x}$$ on time in a multi-year scale. To illuminate its possible meaning, we have superimposed in red the El Niño Index, which measures the moving 3-month average temperature of the Indian Ocean. Noticeably, the two signals have a very similar amplitude of two to three degrees and time-scale of roughly 3–4 years. Clearly, this factor has captured a global phenomenon in its manifestation on the north American continent. Fig. 5. View largeDownload slide El Niño index (in dark gray -red on the web version of the manuscript-) and $$\bar{x}$$ (in light gray -blue on the web version of the manuscript-), marginalized over time of day, day of year and station. Fig. 5. View largeDownload slide El Niño index (in dark gray -red on the web version of the manuscript-) and $$\bar{x}$$ (in light gray -blue on the web version of the manuscript-), marginalized over time of day, day of year and station. Fig. 6. View largeDownload slide Standard deviation as a function of the time of the year, marginalized over station, time of day and climate variability. Early winters have about twice the variability of the late summer–early falls. Fig. 6. View largeDownload slide Standard deviation as a function of the time of the year, marginalized over station, time of day and climate variability. Early winters have about twice the variability of the late summer–early falls. As described in Section 8, we can iterate the procedure once to capture the variance as function of the same factors. Figure 6 displays the marginalized dependence of the standard deviation σ on the time of the year, with maximal variability in the winter and smallest in the late summer–early fall. After normalizing the y through the transformation $$y^{i} \rightarrow \frac{y^{i}}{\sigma (z^{i})}$$, using their standard deviation σ(z) so estimated, we can also estimate the time-lagged correlation between two stations in the following way: we input as a new variable x the product x = y1y2, where y1, 2 are the normalized observations at stations 1 and 2, respectively. As explanatory factors, we take the time of day and of year at station 1, and the time lag Δt between the two observations. Figure 7 shows the results of applying this procedure to Ithaca, NY and Des Moines, IA, through the dependence of the correlation on the time-lag, marginalized over time of day, but drawn specifically for one summer and one winter day. In both cases, we see the correlation peaking at a lag of two days, consistently with the time it takes for an average weather system to travel from the midwest to the eastern US. This suggests improving the prediction at Ithaca by explaining its already normalized signal y using as an additional explanatory factor the normalized y at Des Moines two days before, and possibly also the normalized temperatures at other stations in the past. Figure 8 shows the result of explaining the temperature in Ithaca using those in Des Moines (IA), Crossville (TN), Egbert (ON), Monahans (TX), Selma (AL) and Ithaca itself two days before, and that in Boulder (CO) four days before. We can see that the weather systems, not captured at all through the previously considered cofactors, are now accounted for to a significant degree. Fig. 7. View largeDownload slide Time-lagged correlation between Ithaca, NY and Des Moines, IA as a function of the time-lag, marginalized over time of day and climate variability, but drawn specifically for one summer and one winter day. Both peak at a lag of two days, but the correlation is significantly bigger in the winter. Fig. 7. View largeDownload slide Time-lagged correlation between Ithaca, NY and Des Moines, IA as a function of the time-lag, marginalized over time of day and climate variability, but drawn specifically for one summer and one winter day. Both peak at a lag of two days, but the correlation is significantly bigger in the winter. Fig. 8. View largeDownload slide Observed (in light gray -blue on the web version of the manuscript-) and predicted (in dark gray -red on the web version of the manuscript-) temperature at Ithaca, NY over 45 days. Left panel: prediction using time of day, day of year and climate variability. Right panel: prediction using as additional cofactors the temperature in various stations 2 and 4 days before, thus capturing much of the weather systems that dominate the weather at this particular location and time. Fig. 8. View largeDownload slide Observed (in light gray -blue on the web version of the manuscript-) and predicted (in dark gray -red on the web version of the manuscript-) temperature at Ithaca, NY over 45 days. Left panel: prediction using time of day, day of year and climate variability. Right panel: prediction using as additional cofactors the temperature in various stations 2 and 4 days before, thus capturing much of the weather systems that dominate the weather at this particular location and time. Clearly, many more things can be done with these data along similar lines; our purpose here though is not to perform a systematic data-based study of the weather and climate or to develop a complete methodology for time-series analysis, but just to illustrate the workings of the attributable component methodology in a realistic setting. 9.3. Book preference prediction In this section we test the algorithm on a preference learning problem. We create a synthetic data set representing book ratings by readers, where each book is assigned an integer score between 1 and 5 by each person that reads it. Each reader and book are characterized by categorical and non-categorical cofactors: we know each reader’s age (a real cofactor) and location through its geographical longitude (a periodic cofactor), and each book’s type (a categorical cofactor with 10 possible values) and number of pages (a real cofactor). In addition, each reader and book have idiosyncratic traits, i.e. individual characteristics not captured by the descriptive cofactors just mentioned. The goal is to learn each reader’s book preferences based on previously rated books, so as to be able to predict the rating of a book that has not been read yet by a particular reader. This problem belongs to the category of matrix completion problems (Candes & Recht, 2012): the matrix with entries Rij containing the rating that reader i assigns to book j is only partially filled, since not every reader has read every book in the data set. The task is to predict the expected value of the missing entries. To build the dataset, we first input the number of readers and books nusers and nitems. For each reader, the cofactors are defined as follows: z1: the reader, z1 ∈ {1, …, nusers}, z2: the age, drawn uniformly in the interval [10, 70], z3: the longitude, drawn from a binary Gaussian mixture with centers located on the American and Euro-Asian continents, z4: the book, z4 ∈ {1, …, nitems}, z5: the book type, an integer drawn uniformly from [1, …, 10], z6: the number of pages, drawn from a Gaussian distribution centered around 70 and truncated below at z6 = 10. In order to create data that do not have the same form as the components sought by the algorithm, we first decompose z3 into two: $$c_{3} = \cos (z_{3})$$ and $$s_{3} = \sin (z_{3})$$ and re-center and normalize all zi, c3 and s3 so that they lie in the interval [1/2, 3/2]. Then we define a set of functions gj of the zi as follows: $$g_{1}=\sin (\pi * z_{1} * z_{2} * c_{3})$$ $$g_{2}=\cos (\pi /(s_{3} * z_{4} * z_{5} * z_{6}))$$ g3 = 1/(z1 + z2 + c3 + s3 + z4 + z5 + z6) g4 = z1/z2 + z4/c3 + z5/z6. These functions gi are used to create a preliminary version of the data set through $$x=\sum _{i=1}^{4}a_{i}\frac{g_{i}-\mu _{i}}{\sigma _{i}}$$, where μi, σi are the mean and standard deviation of gi, and the coefficients ai are drawn from the normal distribution. Then we map x onto a uniform distribution in [1, 5] through sorting, and add to it normally distributed noise so as to displace in average each value of x by one unit. Finally, we truncate each value of x to obtained integer numbers from 1 to 5 in equal proportions. The result is a matrix $$R_{z_{1=}i}^{z_{4}=j}$$ with values between 1 to 5, depending on zi, i = 1, …, 6 in a coupled and nonlinear way. Filling the entries of R for which the pair (z1, z4) is absent from the data set is most challenging when R is very sparse, a setting where the methodology of this article is particularly useful. On the one hand, the availability of qualifying cofactors, such as a reader’s age or a book’s type, groups entries together, thus partially overcoming the sparsity of entries per reader and per book. On the other, even in the absence of such additional cofactors, one can still group books and readers through bi-clustering, again overcoming to a certain degree the sparsity of data for individual readers or books. In order to test this, we have created data sets with both small and large numbers of readers and books with an average of about two book ratings per reader. To measure the quality of the results, we divided the dataset into two parts, one used for training the model and one for testing it. We then computed the root mean squared deviation (RMSD) on both the training set (in-sample error) and in the testing set (out of sample error). The sample used for testing consists of 100 entries randomly selected from the dataset, after making sure that no rows or columns were left empty. Figure 10 shows predictions for a dataset of 500 readers and 100 books. The rating matrix R, with the sparsity pattern displayed in Fig. 9, contains 955 non-zero entries, or roughly 2 books per reader. To assess the effect of clustering, we run the algorithm by using as predictors only the row (z1) and column (z4) indices with and without clustering. The upper panels of Fig. 10 show in-sample and out of sample error without clustering, while to obtain the two lower panels of the figure the algorithm partitioned both readers and books into 20 possible classes. Notice that both the in-sample and out of sample RMSD are smaller in the run where clustering was used. Fig. 9. View largeDownload slide Sparsity structure of R. Fig. 9. View largeDownload slide Sparsity structure of R. Fig. 10. View largeDownload slide Synthetic example obtained with 100 books, 500 users and about 955 preferences. The out of sample error is computed on 100 entries randomly selected among the 955. Upper left panels: in-sample error for each value of the score (1 to 5) obtained without clustering and by considering only the idiosyncratic covariates z1 and z4; the overall root mean square deviation is reported on top of the figure. Upper right panel: out-of-sample error distribution computed on the 100 entries previously eliminated. Lower left panels: same as upper left, but using clustering. Lower right panel: out-of-sample error when clustering rows and columns into 20 classes each. Fig. 10. View largeDownload slide Synthetic example obtained with 100 books, 500 users and about 955 preferences. The out of sample error is computed on 100 entries randomly selected among the 955. Upper left panels: in-sample error for each value of the score (1 to 5) obtained without clustering and by considering only the idiosyncratic covariates z1 and z4; the overall root mean square deviation is reported on top of the figure. Upper right panel: out-of-sample error distribution computed on the 100 entries previously eliminated. Lower left panels: same as upper left, but using clustering. Lower right panel: out-of-sample error when clustering rows and columns into 20 classes each. Figure 11 shows results for a larger dataset with 50000 users, 10000 books and about 106000 preferences. The purpose of this experiment is to show how information beyond the idiosyncratic row and column index can improve the prediction, substantially reducing the out of sample error. This is a case in which clustering the users and books (into 20 classes each as before), besides improving the predictions, is also useful from a computational viewpoint, as the number of unknowns associated for instance to the readers decreases by a factor of 250, thus challenging the linear solvers to a much smaller degree. Stochastic gradient descent is not strictly required for the roughly m = 105 entries available, but we have applied it nonetheless, with m/5 random entries used per time-step, since for even larger datasets this would be the only practical option available. Fig. 11. View largeDownload slide Synthetic dataset with 10000 books, 50000 users and about 106000 preferences. The out-of-sample error is computed on 100 entries randomly selected among the 106000. First row: in-sample (left) and out-of-sample (right) errors when using only the reader and the book index. Second row: in and out-of-sample errors when using only the age, longitude, book’s type and number of pages. Third row: in and out of sample error when using all the available cofactors. In all three scenarios, the readers (z1) and books (z4) were clustered into 20 classes each. Fig. 11. View largeDownload slide Synthetic dataset with 10000 books, 50000 users and about 106000 preferences. The out-of-sample error is computed on 100 entries randomly selected among the 106000. First row: in-sample (left) and out-of-sample (right) errors when using only the reader and the book index. Second row: in and out-of-sample errors when using only the age, longitude, book’s type and number of pages. Third row: in and out of sample error when using all the available cofactors. In all three scenarios, the readers (z1) and books (z4) were clustered into 20 classes each. As one can see in the results, using as predictors only reader and book (via clustering) or only their qualifying cofactors (age, location, type, length) yields poorer results than combining the two into a maximally explanatory set. In order to assess the degree of accuracy of the predictions obtained including reader, book and all their qualifying cofactors, the RMSD obtained in this case (0.79) should be compared with the RMSD obtained in two extreme cases: All the entries to predict are assigned a value of 3, right in the middle of the rating scale. In this case, one can easily see that $$\textrm{RMSD}=\sqrt{2}=1.4142\ldots$$ All the entries to predict are assigned the value obtained from the model underlying the data, but without adding the normally distributed noise described in the last step of the construction of our dataset, which cannot possibly be predicted. In this case, we obtain RMSD = 0.4. Thus the root mean square error of the predictions, RMSD ≈ 0.8, is only twice the amount directly attributable to the noise. 10. Conclusions This article develops ‘attributable components’, a methodology for the non-parametric estimation of the conditional mean of a variable x in terms of a set of covariates {zl}. This is framed within a more general setting: the estimation of the full conditional probability $$\boldsymbol{\rho}(x|z)$$ through a family of maps y = $$\boldsymbol{Y}$$(x;z) that push forward the $$\boldsymbol{\rho}(x|z)$$ onto their Wasserstein barycenter $$\boldsymbol{\mu}(y)$$. Estimating the conditional mean results from restricting these maps to z-dependent rigid translations $$\boldsymbol{Y}=x-\boldsymbol{\beta}(z)$$. We prove that these act as pre-conditioners to the full conditional density estimation, in the sense that, if one performs the latter after removing the conditional mean, the composition of the resulting optimal maps and the previous rigid translations yields the solution to the original barycenter problem. Extending the methodology of this article to handle the full conditional density estimation problem is the subject of current work. On the other hand, the conditional second order structure of the data, including the conditional covariance matrix, can be found by a straightforward iteration of the procedure. The conditional mean $$\bar{x}(z)$$ is a function of the possibly many co-variates {zl}, which may in addition have different types, such as categorical, ordinal and real. The procedure represents this multivariable function as a sum of components, each of which is the product of single-variable functions. Each of these single-variable functions, in turn, is represented by the values that it adopts on a grid, for which smoothness is enforced through penalization. Every observation is assigned a weighted sum of these grid values, through interpolation when the zl is continuous and through straightforward assignment when it is discrete. When some or all values of zl are unknown, finding these weights becomes part of the problem, which then produces clustering, classification or continuous covariate assignments. The functions of each covariate zl are found through alternating descent of the objective function. Since this is quadratic in the functions of each zl, the descent step consists of the solution to a linear, Sylvester-like system. The methodology scales well for big datasets, as it adapts straightforwardly to stochastic descent and online learning. Despite its complex, non-parametric form, the predicted $$\bar{x}(z)$$ is easily rendered interpretable through marginalization over subsets of covariates. Among the many extensions that one could foresee, those currently being pursued include, in addition to the full conditional probability estimation mentioned before, specific extensions directed at the analysis of time series. As for any methodology for the analysis of data, a wealth of interesting problems and extensions arises from considering particular applications. Not to overstretch the size of this methodological article, we have included here only a handful of representative examples, though we believe that they are diverse enough to convey a feeling for the methodology’s versatility and broad applicability. Acknowledgements This work was partially supported by grants from the Office of Naval Research and from the NYU-AIG Partnership on Global Resilience. Footnotes 1  https://www1.ncdc.noaa.gov/pub/data/uscrn/products/hourly02/ References Agueh, M. & Carlier, G. ( 2011) Barycenter in the Wasserstein space. SIAM J. Math. Anal. , 43 , 094-- 924. Google Scholar CrossRef Search ADS   Álvarez-Esteban, P. C., del Barrio, E., Cuesta-Albertos, J. & Matrán, C. ( 2016) A fixed-point approach to barycenters in wasserstein space. J. Math. Anal. Appl. , 441 , 744-- 762. Google Scholar CrossRef Search ADS   Brenier, Y. ( 1991) Polar factorization and monotone rearrangement of vector-valued functions. Commun. Pure Appl. Mathematics , 44 , 375-- 417. Google Scholar CrossRef Search ADS   Caffarelli, L. A. ( 2003) The Monge-Ampère equation and optimal transportation, an elementary review. Optimal Transportation and Applications. Springer International Publishing AG, pp. 1–10. Candes, E. & Recht, B. ( 2012) Exact matrix completion via convex optimization. Commun. ACM , 55, 111-- 119. Google Scholar CrossRef Search ADS   Chapman, S. & Lindzen, R. ( 1970) Atmospheric Tides . D. Reidel, Norwell, Mass, p. 200. Friedman, J., Hastie, T. & Tibshirani, R. ( 2001) The Elements of Statistical Learning,  vol. 1. Springer series in statistics Springer, Berlin. Google Scholar CrossRef Search ADS   Friedman, J. H. ( 1991) Multivariate adaptive regression splines. The Annals of Statistics , pp. 1-- 67. Golub, G. H. & Van Loan, C. F. ( 2012) Matrix Computations , vol. 3. Baltimore, MD: JHU Press. Grasedyck, L., Kressner, D. & Tobler, C. ( 2013) A literature survey of low-rank tensor approximation techniques. GAMM-Mitteilungen , 36, 53-- 78. Google Scholar CrossRef Search ADS   Hartley, R. & Schaffalitzky, F. ( 2003) Powerfactorization: 3d reconstruction with missing or uncertain data. Australia-Japan Advanced Workshop on Computer Vision , vol. 74, pp. 76-- 85. Hastie, T. J. & Tibshirani, R. J. ( 1990) Generalized Additive Models , vol. 43. Boca Rator, FL: CRC press. Kuang, M. & Tabak, E. G. ( 2017) Sample-based optimal transport and barycenter problems. Commun. Pure Appl. Mathematics (in press). Mohri, M., Rostamizadeh, A. & Talwalkar, A. ( 2012) Foundations of Machine Learning . Cambridge, MA: MIT press. Tabak, E. G. & Trigila, G. ( 2017) Explanation of variability and removal of confounding factors from data through optimal transport. Commun. Pure Appl. Mathematics  (in press). Google Scholar CrossRef Search ADS   © The Author(s) 2018. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) For permissions, please e-mail: journals. permissions@oup.com http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Information and Inference: A Journal of the IMA Oxford University Press

# Conditional expectation estimation through attributable components

, Volume Advance Article – Mar 15, 2018
28 pages

/lp/ou_press/conditional-expectation-estimation-through-attributable-components-JJsE0JASmF
Publisher
Oxford University Press
ISSN
2049-8764
eISSN
2049-8772
D.O.I.
10.1093/imaiai/iax023
Publisher site
See Article on Publisher Site

### Abstract

Abstract A general methodology is proposed for the explanation of variability in a quantity of interest x in terms of covariates z = (z1, …, zL). It provides the conditional mean $$\bar{x}(z)$$ as a sum of components, where each component is represented as a product of non-parametric one-dimensional functions of each covariate zl that are computed through an alternating projection procedure. Both x and the zl can be real or categorical variables; in addition, some or all values of each zl can be unknown, providing a general framework for multi-clustering, classification and covariate imputation in the presence of confounding factors. The procedure can be considered as a preconditioning step for the more general determination of the full conditional distribution $$\boldsymbol{\rho}(x|z)$$ through a data-driven optimal-transport barycenter problem. In particular, just iterating the procedure once yields the second order structure (i.e. the covariance) of $$\boldsymbol{\rho}(x|z)$$. The methodology is illustrated through examples that include the explanation of variability of ground temperature across the continental United States and the prediction of book preference among potential readers. 1. Introduction A broad class of problems in data analysis focus on explaining variability in a quantity of interest x in terms of the values of a set of covariates z. For instance, one can attribute part of the variability in blood pressure to factors such as age and exercise. One can go one step further and seek hidden factors: new covariates z with values to be determined as part of the problem so as to account for a significant share of the variability in x. Examples are clustering, which explains variability in terms of discrete classes, and principal component analysis, which does so in terms of continuous variables z of dimension lower than the original x. The authors have recently developed (Tabak & Trigila, 2017) a unified framework for the explanation of variability, based on extensions of the mathematical theory of optimal transport. In this framework, one estimates the conditional probability distributions $$\boldsymbol{\rho}(x|z)$$ by mapping them to their Wasserstein barycenter (Agueh & Carlier, 2011). It was shown in the study by Tabak & Trigila (2017) that principal components emerge naturally from the methodology’s simplest setting, with maps restricted to rigid translations, and hence capturing not the full conditional probability distribution $$\boldsymbol{\rho}(x|z)$$, but only its conditional expectation $$\bar{x}(z)$$. From here stems one of this article’s legs: if one thinks of principal components in terms of explanation of variability, it is natural to consider more general scenarios, where the covariates, though still with values a priori unknown, are associated with particular attributes such as space, time or activity networks, and hence required to be smooth in the topologies associated with these attributes. For explaining variability, it is the subspace spanned by the principal components that matters, not the individual components. Seeking such subspace is a low-rank factorization problem, which can be computed with particular ease through alternating projections. This is the article’s second leg: we develop far-reaching extensions of principal component analysis that can be computed efficiently through variations of the alternating projection methodology. A third leg is the occurrence of missing and unstructured data. Frequently some—and sometimes most—of the observations xj are incomplete. This is for instance the case of the Netflix problem, where each viewer only ranks a small subset of all movies, and each movie is only rated by a small fraction of viewers. In addition, observations may not take place on regular grids. For instance, a person—the ‘observation’ in a census—has individual, values of age, height and weight drawn from a continuum, not a lattice. It turns out that both the extensions of principal components discussed here and their implementation in terms of alternating projections adapt very naturally to such scenarios, so they are developed without assuming that the data are complete or structured. At the core of the methodology proposed lies a formal analogy between the low-rank factorization of tensors and the expansion of multivariate functions as sums of products of single-variable ones, as in the classical construction of solutions to linear partial differential equations via separation of variables and in multivariate Fourier and Taylor expansions. Thus, when a variable x depends on many covariates z, we write it as a sum of ‘components’, each consisting of the product of functions of the individual covariates. In low-rank factorization, these one-variable functions are vectors (which can be thought of as functions of the row or column of the matrix entry being approximated), and hence completely unrestricted in the values they can adopt. When explaining variability, this is still the case for categorical covariates, such as sex or blood type. By contrast, in the separated variable solution to PDEs (Partial Differential Equation), the individual functions are typically given (sines, exponentials, special functions) and it is only the coefficients multiplying them that are fit to the data. Our procedure adopts free vectors, fitted to the data through the alternating projection methodology as in low-rank tensor factorization, but penalizes their non-smoothness as functions of those covariates z for which a distance can be defined (which includes continuous variables such as time or blood pressure, and ordinal ones such as rankings). Thus the methodology unifies the handling of categorical and non-categorical covariates. Describing multivariate functions in terms of sums of products of single-variable ones avoids the curse of dimensionality as the number of explanatory variables grows. Describing each of these single-variable functions by the values that it adopts on a grid further makes the computational cost largely independent of the number of observations available, particularly when combined with online learning and stochastic descent. In addition, the attribution of observations to grid points has a probabilistic interpretation that leads to a natural classification and clustering procedure, also applicable to the assignment of missing covariate values. The simultaneous consideration of the other covariates turns this into a general approach to multi-clustering and classification in the presence of confounding factors. Finding the conditional expectation is a regression problem, with a vast literature to which we cannot possibly do justice within the short frame of this introduction. Hence we shall only comment on methodologies that overlap significantly with the attributable components developed here. Least-square error regression (Friedman et al., 2001) is the natural starting point, as we also employ its characterization of the conditional mean as the minimizer of the variance (though in our case the computation of the conditional mean is framed within the general setting of optimal transport, where least-squares emerges naturally from the use of the 2-Wasserstein distance). Least-square regression is typically parametric, with linear models being by far the most commonly used, though these can be extended to handle nonlinearity, either explicitly through the introduction of nonlinear feature functions, or implicitly via the ‘kernel trick’ (Mohri et al., 2012). Smoothing splines (Hastie & Tibshirani, 1990) are a less parametric alternative with a penalization of non-smoothness similar to ours, though without the penalization weighting system based on the number of observations partially attributed to each node. Principal component analysis and, more generally, low-rank matrix factorization can be regarded as least-square regressions in terms of two categorical variables, the row and column of a matrix entry. Their calculation through alternate projections (Golub & Van Loan, 2012) is the computational basis for our extension to higher dimensional and less structured scenarios, while low-rank tensor factorization (Grasedyck et al., 2013) underlies our treatment of more than two categorical covariates. Our subsequent extension to real covariates has formal similarities to the multivariable adaptive regression splines (MARS) (Friedman, 1991), though its elementary components are more general than MARS’ hinges and the conceptual and computational framework of the two methodologies are substantially different. For clarity of exposition, rather than posing the proposed procedure from the start, this article follows its conceptual evolution through a series of simple steps. Thus, after this Introduction, Section 2 first briefly summarizes the relation between explanation of variability and optimal transport developed in the study by Tabak & Trigila (2017) (subsection 2.1), and restricts the maps defining the Wasserstein barycenter to rigid translations (subsection 2.2). Section 3 first considers the estimation of conditional means for tensors and arbitrary sets of categorical covariates, yielding a matrix completion problem solved through a low-rank factorization procedure. Then we consider covariates with an underlying metric, with components that are required to be smooth functions of these covariates, an extension motivated by the formal analogy between the components of a low-rank factorization and the single-variable factors in multivariate series such as Taylor’s and Fourier’s. Subsection 3.1 further parameterizes these single-variable functions in terms of the values they adopt not on the observed samples, but on an externally prescribed grid. Among the advantages of this switch is that it allows us to refine our characterization of smoothness to include higher derivatives (Section 4). Perhaps more importantly, it permits the inclusion of latent and not fully available covariates, extending to procedure to handle clustering, multi-clustering, classification and covariate inference, all in the presence of confounding factors (Section 5). The description so far addresses the explanation of variability in real variables x. Section 6 demonstrates that the procedure remains fundamentally unchanged when the x are categorical instead, as a simple embedding into Euclidean space allows one to define their conditional expected value $$\bar{x}(z)$$. Section 7 addresses the simple, but important issue of post-processing of the results, so as to make them interpretable by the user, providing the dependence of x on a selected number of covariates z through marginalization over all the others. Section 8 shows how iterating the procedure once allows one to capture not just the conditional expectation, but also the full conditional covariance structure of x. Section 9 includes a number of examples, some synthetic and some real applications, which display the versatility and power of the attributable component methodology. Finally, some conclusions and further extensions are drawn in Section 10. 2. Removal of variability attributable to known factors 2.1. The optimal transport barycenter problem Consider a variable x ∈ Rn and a set of d covariates z. Removing the variability in x explainable by z is equivalent to transforming x through a z-dependent map y = $$\boldsymbol{Y}$$(x;z), so that y is independent of z. In order not to remove additional variability in x not attributable to z, one imposes the additional condition that the maps deform the data minimally. This process finds a natural formulation in the language of optimal transport (Tabak & Trigila, 2017):   \begin{align} \min_{Y, \mu} C = \int \left[\int c\left(x, \boldsymbol{Y}(x; z) \right) \boldsymbol{\rho}(x|z) \ \textrm{d}x \right] \boldsymbol{\nu}(z) \ \textrm{d}z, \quad y = \boldsymbol{Y}(x; z) \sim \boldsymbol{\mu}. \end{align} (2.1) Here the cost c(x, y) is a measure of the deformation brought about by the map, for which we will adopt the squared distance   \begin{align} c(x,y) = \|x - y\|^{2}, \end{align} (2.2) and $$\boldsymbol{\nu}(z)$$ represents the distribution of the covariates z. The resulting optimal $$\boldsymbol{\mu}(y)$$, independent of z, is the $$\boldsymbol{\nu}$$-weighted c- barycenter of the conditional distributions $$\boldsymbol{\rho}(x|z)$$(Tabak & Trigila, 2017). For this particular cost function, the maps $$\boldsymbol{Y}$$ are given by the gradient of a potential function: $$\boldsymbol{Y}$$(x;z) = ∇xϕ(x;z) (Brenier, 1991). In applications, one does not know the distributions $$\boldsymbol{\rho}(x|z)$$ and $$\boldsymbol{\nu}(z)$$: the data consist solely of a finite set of samples {xi, zi}, i ∈ [1…m], in terms of which the objective function becomes   $$C = \frac{1}{m} \sum_{i=1}^{m} c\left(x\,^{i}, \boldsymbol{Y}(x\,^{i}; z\,^{i}) \right) .$$ Then one needs to restrict the space of allowable maps $$\boldsymbol{Y}$$ (x;z)—else one would overfit the data—and weaken the requirement that all $$\boldsymbol{\rho}(x|z)$$ be pushed into a single $$\boldsymbol{\mu}(y)$$, since (a) this is not achievable within a restricted family of maps, and (b) the $$\boldsymbol{\rho}(x|z)$$ are only known through samples and $$\boldsymbol{\mu}(y)$$ through their corresponding transformed points. 2.2. Map restriction to rigid translations Arguably, the simplest setting restricts the maps to rigid translations depending on z:   \begin{align} \boldsymbol{Y}(x; z) = x - \boldsymbol{\beta}(z). \end{align} (2.3) For rigid translations, the only constraint that one can impose on the distribution of the range of $$\boldsymbol{Y}$$ is that its expected value $$\bar{y}$$ should be independent of z, with the implication that   \begin{align} \boldsymbol{\beta}(z) = \bar{x}(z)-\bar{y}, \quad \textrm{with} \ \bar{y} = \int \bar{x}(z) \ \boldsymbol{\nu}(z) \ \textrm{d}z. \end{align} (2.4) In addition to its explanatory value, removing the expectation $$\bar{x}(z)$$ from the conditional distribution $$\boldsymbol{\rho}(x|z)$$ is an effective pre-conditioner for the full optimal transport barycenter problem, as the following theorem shows: Theorem Consider the solution $$\boldsymbol{Y}$$full(x;z) to the optimal transport barycenter problem (2.1) with the quadratic cost function (2.2), and the family of rigid translations $$\boldsymbol{Y}$$mean(x;z) given by (2.3, 2.4). The latter pushes forward the conditional distributions $$\boldsymbol{\rho}(x|z)$$ into new conditional distributions $${\boldsymbol{\rho}_{\ast}}(x|z)$$, with common mean $$\bar{y}$$ (by contrast, $$\boldsymbol{Y}$$full(x;z) pushes forward the $$\boldsymbol{\rho}(x|z)$$ into the z-independent barycenter $$\boldsymbol{\mu}$$). Solve now again the full problem (2.1), this time using the $${\boldsymbol{\rho}_{\ast}}(x|z)$$ as input distributions, and denote the corresponding optimal map $$\boldsymbol{Y}$$nl(x;z). Then   \begin{align} \boldsymbol{Y}_{\!full}(y; z) = \boldsymbol{Y}_{\!nl}(\boldsymbol{Y}_{\!mean}(y; z); z). \end{align} (2.5) In other words, the barycenter $$\boldsymbol{\mu}(y)$$ of the $$\boldsymbol{\rho}_{\ast}(x|z)$$ agrees with that of the original distributions $$\boldsymbol{\rho}(x|z)$$, and the optimal maps pushing forward the $$\boldsymbol{\rho}(x|z)$$ onto $$\boldsymbol{\mu}(y)$$ agree with the composition of the rigid translations and the optimal maps from the $$\boldsymbol{\rho}_{\ast}(x|z)$$ to $$\boldsymbol{\mu}(y)$$. Finding $$\boldsymbol{Y}$$nl is typically much easier computationally than finding $$\boldsymbol{Y}$$full, as the $$\boldsymbol{\rho}_{\ast}(x|z)$$ are closer to each other than the $$\boldsymbol{\rho}(x|z)$$, since they share the same expected value for all values of z. Proof. The statement above follows from two basic ingredients: a characterization of the barycenter distribution $$\boldsymbol{\mu}(y)$$ and the inverse $$\boldsymbol{X}$$(y;z) of $$\boldsymbol{Y}$$(x;z) as the unique satisfiers of the two properties (Álvarez-Esteban et al., 2016); (Kuang & Tabak, 2017): Each point y is the barycenter of its images under $$\boldsymbol{X}$$:   $$y = \int \boldsymbol{X}(y; z) \ \boldsymbol{\nu}(z) \ \textrm{d}z.$$ For each value of z, $$\boldsymbol{X}$$(y;z) maps optimally $$\boldsymbol{\mu}(y)$$ onto $$\boldsymbol{\rho}(x|z)$$ and a characterization of optimal maps under the square-distance cost as gradients of a convex function (Caffarelli, 2003):   $$\boldsymbol{X}(y; z) = \boldsymbol{\nabla}_{y} \psi(y; z), \quad \boldsymbol{\psi}(\cdot ; z) \ \textrm{convex}.$$ The composition of the two inverse maps is   \begin{align} \boldsymbol{X}_{full}(y; z) = \boldsymbol{X}_{mean}(\boldsymbol{X}_{nl}(y; z); z) = \boldsymbol{X}_{nl}(y; z) + \boldsymbol{\beta}(z) = \boldsymbol{\nabla}_{y} \boldsymbol{\psi}_{nl} (y; z) + \boldsymbol{\beta}(z), \nonumber \end{align} with $${\boldsymbol{\psi}_{\!nl}}(y;z)$$ convex in y, and therefore   $$\boldsymbol{X}_{\!full}(y; z) = \boldsymbol{\nabla}_{\!y} \boldsymbol{\psi}_{\!full}(y; z), \quad \textrm{with } \ \boldsymbol{\psi}_{\!full}(y; z) = \boldsymbol{\psi}_{\!nl} (y; z) + \boldsymbol{\beta}(z) y \ \textrm{ also convex},$$ so it is optimal, as required by property 2. Also   \begin{align*} \int \boldsymbol{X}_{full}(y; z) \ \boldsymbol{\nu}(z) \ \textrm{d}z &=\int \left[\boldsymbol{X}_{\!nl}(y; z) + \boldsymbol{\beta}(z)\right] \ \boldsymbol{\nu}(z) \ \textrm{d}z \\ &=y + \int \left[\bar{x}(z)-\bar{y} \right] \ \boldsymbol{\nu}(z) \ \textrm{d}z = y, \end{align*} thus satisfying property 1 and concluding the proof. 3. A low rank tensor factorization, separated variable procedure Given a set of m observations xi and L corresponding covariates $${z_{l}^{i}}$$, the task of removing the variability associated with z from x through the z-dependent rigid translation   $$y = x - \bar{x}(z) + \bar{y}, \quad \bar{y} = \bar{x}$$ reduces to regression, i.e. to estimating the conditional mean $$\bar{x}(z)$$. This can be characterized as the minimizer of the variance:   \begin{align} \bar{x}(z) = \arg \min_{f} \sum_{i} \left\|x\,^{i} - f(z\,^{i}) \right\|^{2} \end{align} (3.1) over a proposed family of functions f(z). Any multivariable function f(z) can be approximated to arbitrary accuracy by the superposition of products of functions fl of the individual components zl of z:   $$f(z) \approx \sum_{k} \prod_{l} {f_{l}^{k}}\left(z_{l}\right)\!.$$ Classical examples are the power series when $$z \in \mathbb{R}^{L}$$,   $$f(z) \approx \sum_{k} a_{k} \prod_{l=1}^{L}{z_{l}}^{{s^{k}_{l}}}, \quad s^{k} \in \mathbb{N}^{L},$$ the Fourier series when z is in the 2π-periodic L-dimensional torus,   $$f(z) \approx \sum_{k} a_{k} \textrm{e}^{\,i \sum_{l} {\xi^{k}_{l}} z_{l}}, \quad \xi^{k} \in \mathbb{Z}^{L}$$ and the singular value decomposition when $$z=(i,j) \in \mathbb{N}^{2}$$,   $$x(i,j) \approx \sum_{k} \sigma_{k} \ u^{k}(i) v^{k}(j).$$ This suggests proposing the approximation   \begin{align} \bar{x}(z) \approx \sum_{k=1}^{d} \prod_{l=1}^{L} V(l)^{k}(z_{l}). \end{align} (3.2) If x is vectorial, we can make it scalar by including the index as one more factor zl. Then (3.1) reduces to minimizing   \begin{align} L = \sum_{i} \left(x^{i} - \sum_{k=1}^{d} \prod_{l=1}^{L} V(l)^{k}\left({z_{l}^{i}}\right) \right)^{2} \end{align} (3.3) over the degrees of freedom available in the specification of the functions V(l)k(z). Consider first the particular case where all zl are categorical variables—such as the vector index above—with a finite number of possible values. Then, for each l, V(l) is a matrix with components $$V(l\,)^{k}_{z\,_{l}}$$, and the minimization of L becomes the low-rank factorization of the tensor x(z1, …, zL) from the available—possibly repeated—entries {xi}. This problem can be solved through an alternating direction methodology, minimizing L alternatively over each V (l). This minimization yields, for each value j of zl,   \begin{align} V(l\,)_{\,j} = \left(\sum_{i \in I_{\,j}} x^{i} \prod_{b\ne l} V(b)_{{z_{b}^{i}}} \right) \left[\sum_{i \in I_{\,j}} \left(\prod_{b\ne l} V(b)_{{z_{b}^{i}}} \right)^{T} \left(\prod_{b\ne l} V(b)_{{z_{b}^{i}}} \right) \right]^{-1}, \end{align} (3.4) where   $$I_{\,j} = \left\{i : {z_{l}^{i}} = j\right\}.$$ Such tensor-completion through low-rank factorization has a broad range of applications. Consider the following two typical examples: A blood test, measuring various quantities such as cholesterol level and white cell counts. Here the variable x—a scalar—is the value of a measurement. Some of the categorical covariates zl that one would use to account for variability in x are: (a) The quantity being measured (the row in regular low-rank matrix factorization). (b) The facility where the test was performed. (c) Individual characteristics of the patient (sex, ethnicity, treatment, etc.). Excluded so far are covariates with continuous values, such as age, weight, time of the day and results of prior tests; these will be discussed below. Further down we will also include latent covariates, such as distinct groups into which the patients can be divided in light of their test results, a case of clustering in the presence of confounding factors. 2. The ‘Netflix problem’: x is the rating given by viewers to movies, z1 the viewer and z2 the movie. Further extensions below allow for the inclusion of more granular characteristics, such as movie type and year of release, and viewer’s age and location. In tensor-completion, there is only a finite set of possible arguments $$j={z_{l}^{i}}$$ for V(l)(zl), and the corresponding values $$V(l)_{\,j}^{k}$$ are free for the minimization to determine. By contrast, when using separation of variables for solving PDE’s, the form of the V(l)k(z) is known ab-initio (sines, exponentials, special functions), with only their amplitudes available to fit initial or boundary data. In a data-analysis context, one would call such model parametric. Without a specific principle to guide the selection of the functions V(l)(z), this approach suffers from the arbitrariness of any such choice, which could lead to poor, overfitted or uninterpretable results. Instead, we can preserve the freedom in the assignment of values to each $$V(l)_{{z_{l}^{i}}}^{k}$$ while enforcing a smooth dependence of V(l)k on zl by adding to the objective function a penalization for non-smoothness, proportional to   $$\sum_{i,\,j>i} C_{i,\,j}^{\,l} \left(V(l\,)_{{z_{l}^{i}}}^{k}-V(l\,)_{{z_{l}^{\,j}}}^{k}\right)^{2},$$ with $$C_{i,j}^{l}$$ inversely proportional to a measure of the distance between $${z_{l}^{i}}$$ and $${z_{l}^{\,j}}$$. The logic behind this choice is the following: When $${z_{l}^{i}}$$ and $${z_{l}^{\,j}}$$ are close to each other, the corresponding V(l)k need also be close, as otherwise they would incur a penalty. This is how one enforces smoothness of V(l)k(z). As the penalty term is quadratic in each V(l), solving for each V(l) with all other V(b) fixed amounts to inverting a linear system, as in conventional alternating projections. The new objective function has the form   \begin{align} L = \sum_{i} \left(x^{i} - \sum_{k} \prod_{l} V(l)^{k}_{{z_{l}^{i}}} \right)^{2} + \sum_{l \in M} \lambda_{l} \sum_{k=1}^{d} \left(\prod_{b \in L, b\ne l} \|V(b)^{k}\|^{2}\right) \sum_{i,\,j>i} C_{i,\,j}^{l} \left(V(l)_{{z_{l}^{i}}}^{k}-V(l)_{{z_{l}^{\,j}}}^{k}\right)^{2}, \end{align} (3.5) where M is the set of factors with an underlying metric, for which the smoothness of functions is required. The pre-factors to the penalty terms, products of squares of the norms of the V(b)k, are included so that the objective function is invariant under re-scalings of the V(l)k that preserve their product. Otherwise, the penalty terms could be made arbitrarily small, not by enforcing smoothness, but by multiplying each V(l) by a small factor and absorbing this factor into other V’s less constrained by the smoothness requirement. The constants λl quantify the amount of penalization associated to smoothness for each factor. 3.1. Semi-parameterization Now we extend the procedure above to simultaneously address various issues: As the number of observations grows, so does the number of unknowns $$V(l)_{{z_{l}^{i}}}^{k}$$, turning the solution to the linear system from each step into a computational bottleneck. Yet the underlying functions V(l)k(z) do not change in any fundamental way; we are just seeking their values at a larger set of points $$\{{z_{l}^{i}}\}$$. Ultimately, we want to resolve these functions to a level of detail determined by our needs, not by the number of observations available. Also, the observed values $${z_{l}^{i}}$$ may not provide an adequate grid for resolving the V(l)k(z), as they may cluster in some locations and under-resolve others. As a consequence, the cost matrix $$C_{ij}^{l}$$ may be highly unbalanced. Frequently, values of a covariate zl are repeated systematically across the dataset. Two typical scenarios are the following: The variable zl can only adopt a finite number of values (as in rankings), which are then necessarily repeated many times. The dataset consists of observations drawn from various individuals, each observed many times. The characteristics z of each individual then appear repeatedly, once for each such observation. Examples: patient’s age in medical studies where each patient is followed over time, a meteorological station’s latitude and elevation in datasets comprising observations from more than one station. In such situations, the most sensible choice is to adopt just one value of V(z) for each value of z, rather than penalizing their variation. When some values $${z_{l}^{i}}$$ are missing from the dataset—even though xi and possibly other covariates $${z_{b}^{i}}$$ have been recorded—applying the procedure requires a way to assign values to the corresponding $$V(l)_{{z_{l}^{i}}}^{k}$$. Sometimes the $${z_{l}^{i}}$$ are unknown ab-initio, as they are latent variables to be assigned by the algorithm to further explain variability in an unsupervised fashion. (We will address latent variables in Section 5 using the tools developed here.) The procedure is non-parametric, with the functions Vl(z) defined only by their values on the sample points $${z_{l}^{i}}$$ and with no specified functional form, just constrained by the penalization on non-smoothness. Yet there are reasons why one may want to have a more explicit description of these functions, both within and after the procedure: (a) Prediction and imputation: one typical goal of data analysis is to predict the value of x or its probability distribution for a new set of values of the covariates z. This requires evaluating the V(l)k(zl) at the new values provided. (b) On-line learning: when new data keep flowing in, as in weather forecasting, one should not start the procedure from scratch every time. Avoiding this requires a current estimation of the V(l)k(zl) for the newly arrived values of zl. 5. Even though the procedure as described so far combines naturally categorical and non-categorical variables, the latter are penalized for non-smoothness, while the former are not. Hence the algorithm will be biased to explain as much variability as possible in terms of the categorical variables, a possibly undesired feature. In fact, the exact opposite is often required, as some of these categorical variables represent ‘idiosyncratic’ variability (as do the ‘viewer’ variable in the Netflix problem, the ‘patient’ variable in a medical study and the ‘stock’ variable in finance), which one would like to use to explain only the variability left once more descriptive factors (age, weight, industry) have had their say. In order to address the issues above, we introduce for each l a grid (not necessarily regular) of values zg(l)j, and the linear interpolation scheme   \begin{align} {z_{l}^{i}} = \sum_{\,j} \alpha(l)_{i}^{\,j}{z_{g}}(l)^{\,j}, \quad \textrm{so that we can posit} \quad \tilde{V}(l)_{{z_{l}^{i}}}^{k} = \sum_{\,j} \alpha(l)_{i}^{\,j} V(l)_{\,j}^{k}. \end{align} (3.6) Here the $$\tilde{V}_{{z_{l}^{i}}}$$ are the values associated with observation i, while the Vj are the smaller set of unknowns associated with the grid zg(l)j. Then the problem becomes   \begin{align} \min_{V} \sum_{i} \left(x^{i} - \sum_{k} \prod_{l \in L} \sum_{\,j} \alpha(l)_{i}^{\,j} V(l)_{\,j}^{k} \right)^{2}+\sum_{l=1}^{L} \lambda_{l} \sum_{k} \left(\prod_{b \in L, b\ne l} \|V(b)^{k}\|^{2}\right) \sum_{i,\,j>i} C_{i,\,j}^{l} \left(V(l)_{i}^{k}-V(l)_{\,j}^{k}\right)^{2}, \end{align} (3.7) which includes (3.5) as a particular case with the α ∈ {0, 1}. Let us see how this modification addresses each of the issues raised before: The number of unknowns $$V(l)_{\,j}^{k}$$ is controlled by the externally supplied grid, not the number of observations available. Moreover, this grid can be designed so as to be well-balanced, with neither clusters nor holes, and with as fine a mesh as desired in regions where higher resolution is sought. A repeated value of $${z_{l}^{i}}$$ is assigned through the same coefficients α to the same grid points zg(l)j. When the number of values that zl can adopt is finite, those are by default the values given as grid points, with corresponding α ∈ {0, 1}. When the value of zl for observation i is missing, one should marginalize over it, adopting for $$V(l)_{{z_{l}^{i}}}^{k}$$ the expected value of V(l)k(z). This corresponds to assigning as $$\alpha (l)_{i}^{\,j}$$ the mean value $$\alpha (l)_{i}^{\,j} = \frac{1}{n_{t}} \sum _{t=1}^{n_{t}} \alpha (l)_{t}^{\,j}$$ of α over the observations $${z_{l}^{t}}$$ for which zl is known. If there is a surrogate for zl that allows us to infer that the missing $${z_{l}^{i}}$$ is closer to some of the observed $${z_{l}^{\,j}}$$ than others—often the time t adopts this role—a correspondingly weighted mean of α should be adopted. In order to evaluate $$\tilde{V}(l)^{k}(z)$$ for a new value of z, one just needs to find the values of α that interpolate z in the grid and apply (3.6). One can extend the penalization to categorical variables by adding a new, dummy value of z to the grid, with corresponding unknown $$V(l)_{o}^{k}$$, and making each $$C_{oj}^{l} = 1$$ and all other $$C_{ij}^{l}=0$$. This penalizes variability in the V(l)k without reference to any underlying metric. The final value of $$V(l)_{o}^{k}$$ will be the empirical mean of V(l)k(z), and the quantity penalized, its variance. 4. Higher orders of smoothness Our characterization of smoothness has been restricted so far to continuity, enforced through the penalization of large differences in V(l) when the corresponding zl are close. With the procedure extended to incorporate well-balanced grids for each zl, we can refine this characterization, penalizing higher derivatives of V(l). In addition to providing a smoother fit to the data, this refinement is also useful for extrapolation. Consider a situation where one would like to estimate V(z) for a value of z outside the range of the data, say $$z> z_{\max }$$. The choice most consistent with the current algorithm is to set $$V(z) = V(z_{\max })$$, as this is the most ‘continuous’ choice. If instead we were minimizing ∥V″∥2, then we would extrapolate linearly, following the slope at $$z_{\max }$$. An intermediate choice follows from minimizing a2∥V′∥2 + |V′′∥2, as now beyond $$z_{\max }$$, not being required to explain any data, one would be solving the ordinary differential equation   $$\frac{\delta}{\delta V^{\prime}} \int \left(a^{2} \|V^{\prime}\|^{2} + |V^{\prime\prime}\|^{2}\right) \textrm{d}z = 0 \rightarrow \frac{d^{2} V^{\prime}}{d z^{2}} - a^{2} V^{\prime} = 0,$$ with solution   $$V^{\prime}(z) = V^{\prime}(z_{\max}) \textrm{e}^{-a (z-z_{\max})},$$ where the slope V′ agrees with its value at $$z_{\max }$$, but then decays to zero. One simple recipe for higher-order smoothness is to penalize the squared norm of a finite difference approximation to a derivative, as in the early versions of smoothing splines (Hastie & Tibshirani, 1990). In addition, it is convenient to weight this norm, so that each value $$V(l)^{k}_{\,j}$$ is rewarded for its explanatory value and penalized for non-smoothness in a balanced way. In particular, if values of zg are added beyond the range of the data, these should not affect the fit within the range. For concreteness, we develop here the procedure to penalize the first and second derivatives of V(l); extending this to higher derivatives is straightforward. Given a sorted grid {zj}, the three-point finite-difference approximation to the first and second derivatives of V at point zj is given by   $$V_{\,j}^{\prime} \approx a V_{\,j-1} + b V_{\,j} + c V_{\,j+1} , \quad V_{\,j}^{\prime\prime} \approx A V_{\,j-1} + B V_{\,j} + C V_{\,j+1},$$ with   \begin{align*} a =&\, -\frac{{\Delta_{+}}^{2}}{D}, \quad b = -\frac{{\Delta_{+}}^{2}-{\Delta_{-}}^{2}}{D}, \quad c = \frac{{\Delta_{-}}^{2}}{D},\\ A =&\, \frac{{\Delta_{+}}}{D}, \quad B = -\frac{{\Delta_{+}}-{\Delta_{-}}}{D}, \quad C = -\frac{{\Delta_{-}}}{D}, \\ \Delta_{+} =&\ z^{\,j+1}-z^{\,j}, \quad \Delta_{-} = z^{\,j-1}-z^{\,j}, \quad D = \frac{1}{2} \Delta_{+} \Delta_{-} \left(\Delta_{-} - \Delta_{+} \right)\!. \end{align*} Defining the weight wj of grid point $${z_{g}^{\,j}}$$ by the sum of its contributions to the data points {zi}:   $$w_{\,j} = \varepsilon + \sum_{i=1}^{m} {\alpha_{i}^{\,j}},$$ where ε ≪ 1 is the weight assigned to grid values away from the data, one can propose a penalty for non-smoothness of the form   $$\lambda \sum_{j} w_{\,j} \left(\delta \left\|V_{\,j}^{\prime}\right\|^{2} + (1-\delta) \left\|V_{\,j}^{\prime\prime}\right\|^{2} \right) = \lambda V^{\prime} C V.$$ (The C so defined is a non-negative-definite tridiagonal matrix.) Here δ ∈ [0, 1] measures the relative weight given to the first and second derivatives of V. For notational clarity, we are absorbing into the constant λ the product of the square norms of the V(b)k with b ≠ l in (3.7). When written in full, the problem becomes   \begin{align} \min_{V} \sum_{i} \left(x^{i} - \sum_{k} \prod_{l \in L} \sum_{\,j} \alpha(l)_{i}^{\,j} V(l)_{\,j}^{k} \right)^{2}+\sum_{l=1}^{L} \lambda_{l} \sum_{k} \left(\prod_{b \in L, b\ne l} \left\|V(b)^{k}\right\|^{2}\right) {V(l)^{k}}^{\prime} C^{l} V(l)^{k}. \end{align} (4.1) 5. Clustering, classification and posterior assignment of missing covariate values The previous section discussed how to handle situations in which some, but not all values of a given covariate zl were missing. When no value is known, zl is a latent variable, whose values should be assigned by the algorithm to further explain the variability in the data. This section extends the procedure to handle the discovery of latent categorical variables. Assigning a value $${z_{l}^{\,i}}$$ to data point xi is a clustering problem, which the presence of other covariates zb, b ≠ l turns it into clustering in the presence of confounding variables. In our setting, this gives rise to a mechanism analogous to k-means and Expectation Maximization (Friedman et al., 2001). In order to assign $${z_{l}^{i}}$$, one first sets a number ncl of clusters, and defines the corresponding ‘grid’ zg(l)j, with j = 0…ncl (with 0 standing for the dummy element to which no point is assigned for the penalization of functions of categorical variables). From eq. (3.6), the $${z_{l}^{i}}$$ sought follows from the values of the $$\alpha (l)_{i}^{\,j}$$ used to interpolate on the grid zg(l). These determine automatically to which l-variable cluster the observation xi belongs. When clustering through hard assignments, all entries $$\alpha (l)_{i}^{\,j}$$, but one equal zero, except the one specifying to which class j the data point xi is assigned. The values of $$\alpha (l)_{i}^{\,j}$$ are linked to the corresponding values of $$V(l)_{\,j}^{k}$$ in a two step procedure similar to Lloyd’s algorithm for k-means: Given the values of $$V(l)^{k}_{\,j}$$, we assign zl by choosing the j for which $$\alpha (l)_{i}^{\,j}=1$$ so that the resulting predicted mean $$\bar{x}(z^{\,i})$$ is closest to the observation xi (in other words, to decrease as much as possible the first term in 4.1). This corresponds to the reassignment step based on proximity within a given centroid in Lloyd’s algorithm. Once the alpha are known the values of V(l)j are updated as before by minimizing L (eq. 4.1). This step finds the mean of x within each class defined by the values of zl and corresponds to the update of the centroid in the Lloyd’s algorithm. A version with soft assignments can be developed along similar lines, with each observation xi having a probability $${\alpha _{i}^{\,j}}$$ of belonging to class j. Then the V(l) are updated as before, and the $${\alpha _{i}^{\,j}}$$ can be assigned through the following Bayesian procedure: For each cluster j, compute the square distances $${d_{i}^{\,j}}$$ between the cluster’s mean $$\bar{x}({z_{1}^{i}}, \ldots , {z_{l}^{\,j}}, \ldots{z_{L}^{i}})$$ and each observation xi:   $${d_{i}^{\,j}} = \left(x^{i} - \sum_{k} \left(\prod_{b \ne l} \sum_{\,j} \alpha(b)_{i}^{\,j} V(b)_{\,j}^{k}\right) V(l)_{\,j}^{k}\right)^{2}$$ and the standard deviation   $$\sigma_{\,j} = \left(\frac{\sum_{i} \alpha(l)_{i}^{\,j} {d_{i}^{\,j}}}{\sum_{i} \alpha(l)_{i}^{\,j}} \right)^{\frac{1}{2}} .$$ Denoting Q(l)s the sth set of ns observations {i} constrained to share the same assignments α(l)i, compute the average square distance to each cluster’s mean:   $${D_{s}^{\,j}} = \frac{1}{n_{s}}\sum_{i\in Q(l)_{s}} {d_{i}^{\,j}}.$$ Update $${\alpha _{i}^{\,j}} = {P_{s}^{\,j}}$$ (i ∈ Q(l)s) through Bayes formula, modeling the clusters as one-dimensional Gaussians and adopting $${P_{s}^{\,j}}$$ itself as prior:   $$\rho^{\,j} \propto \frac{1}{{\sigma_{\,j}}} \exp\left(-\frac{{D_{s}^{\,j}}}{2{\sigma_{\,j}}^{2}}\right),$$  \begin{equation*} {\hskip-35pt}{P_{s}^{\,j}} = \frac{{P_{s}^{\,j}} \rho^{\,j}}{\sum_{h} {P_{s}^{h}} \rho^{h}}, \end{equation*} a choice with positive feedback that leads to convergence to rigid assignments $${P_{s}^{\,j}} \in \{0, 1\}$$. If soft assignments are desired throughout, one should use in Bayes formula the non-informative prior $$P_{prior}^{\,j} = \frac{1}{n_{cl}}$$ instead of $${P_{s}^{\,j}}$$. In classification problems, the αi are known (and fixed) for the training set, while they evolve as above for the testing set. To be consistent with the Baysian approach, one should in this case use either the fixed prior   $$P_{prior}^{\,j} = \frac{1}{m_{k}} \sum_{i=1}^{m_{k}} {\alpha_{i}^{\,j}},$$ where the sum is taken over the mk observations for which zi is known, or, in the spirit the Expectation Maximization algorithm, the evolving prior   $$P_{prior}^{\,j} = \frac{1}{m} \sum_{i=1}^{m} {\alpha_{i}^{\,j}}$$ computed over all observations, including the ones for with α is being updated. Notice that the procedure just described applies without changes when the latent variable zl is non-categorical. Here we would not call the procedure ‘classification’, as this refers to categorical classes; a more appropriate name would be ‘posterior assignment of unknown covariate values’. It applies, for instance, to softly assign an age to individuals for which it has not been recorded. 5.1. Use for complexity reduction In large datasets, an idiosyncratic variable zl (the viewer, the movie, the patient, the station, the stock) can adopt a very large number of values. Since idiosyncratic variables are categorical, we cannot reduce the number of unknowns V(l)k simply by interpolating on a grid zg(l) via (3.6), since one would not know, for instance, how to interpolate one patient among others when the more informative variables—age, weight, etc.—are already considered separately. However, one can still use (3.6) to the same effect, but with the values of α unknown beforehand, i.e. performing clustering. In the example just mentioned, we would replace the idiosyncratic variable ‘patient’ by ‘patient type’, with the number of types decided based on the level of resolution sought. Below, we will show an application of this clustering procedure to a matrix completion problem for a book recommendation system, where the idiosyncratic variables ‘reader’ and ‘book’ are clustered into ‘reader type’ and ‘book type’ (different from the book genre or any other covariates taken into account explicitly). 6. Categorical dependent variables The methodology allows for covariates z of any type: categorical, ordinal, real, periodic. Does a comparable level of generality extend to the variable x whose unexplained variability one attempts to minimize? In particular, there are many scenarios of interest where x is categorical: has person z1 read book z2, will patient z1 develop illness z2, which party does person z1 vote for, etc.? The object that the procedure seeks, the conditional expectation $$\bar{x}(z)$$, can still be assigned a meaning when x is categorical. For instance, for a binary x, one can assign the values {0, 1} to the two possible outcomes, and define   $$\bar{x}(z) = 0 \cdot P(0|z) + 1 \cdot P(1|z) = P(1|z),$$ where the P(x|z) are the z-conditional probabilities for the outcome x. In doing so, we have embedded the space of outcomes x into a metric space, i.e. the real line. The resulting $$\bar{x}$$ has a clear meaning in terms of the underlying probability distribution P. Similarly, if x has three possible outcomes, we can embed those into the two-dimensional plane, as vertices of an equilateral triangle. The important notion here is that all the points so assigned be equidistant, not to affect the categorical nature of the variable through a metric-induced preferential ordering. Then, following the attributed component methodology, the two dimensions of x in this example are captured through a covariate z with values in {1, 2}. More generally, a number n of outcomes is embedded into an n − 1-dimensional space. A simpler alternative would embed a variable x with n outcomes into an n-dimensional space, as vertices of the corresponding simplex, i.e. the 1 of K encoding. This has more straightforward interpretability, as each component of $$\bar{x}$$ represents the probability of the corresponding outcome. On the other hand, it has one more dimension to consider and it requires imposing the constraint that $$\sum _{i} \bar{x}_{i}(z) = 1$$, which increases—albeit slightly—the computational complexity of the algorithm. 7. Interpretabilitythrough marginalization The procedure provides an estimation of the conditional mean of the form   $$\bar{x}(z_{1}, \ldots, z_{L}) = \sum_{k=1}^{d} \prod_{l \in L} \sum_{\,j} \alpha(l)^{\,j}(z_{l}) V(l)_{\,j}^{k},$$ where the $$V(l)_{\,j}^{k}$$ are known, and α(l)j(zl) are the coefficients that interpolate zl in the given grid zg(l)j. It is straightforward to use this expression to find the estimate for $$\bar{x}$$ for a new value of z. On the other hand, determining how x depends on each individual zl under average conditions for the other covariates zb is a question of marginalization: find   \begin{align} \bar{x}(z_{l}) &= \int x \ \rho(x|z) \ \textrm{d}z_{1} \ldots \ \textrm{d}z_{l-1} \ \textrm{d}z_{l+1} \ldots \ \textrm{d}z_{L} \nonumber \\ &= \int \bar{x}(z) \ \textrm{d}z_{1} \ldots \ \textrm{d}z_{l-1} \ \textrm{d}z_{l+1} \ldots \ \textrm{d}z_{L} . \end{align} (7.1) Evaluating at $$z_{l} = z_{l}^{\ast }$$ and replacing expectations by the corresponding empirical means, one has   \begin{align*} \bar{x}(z_{l}^{\ast}) &= \frac{1}{m} \sum_{i=1}^{m} \bar{x}({z_{1}^{i}}, \ldots, z_{l}^{\ast}, \ldots, {z_{L}^{i}}) \\ &=\sum_{k=1}^{d} \frac{1}{m}\sum_{i=1}^{m} \left[\prod_{b \ne l} \sum_{\,j} \alpha(b)_{i}^{\,j} V(b)_{\,j}^{k}\right] \sum_{\,j} \alpha(l)^{\,j}(z_{l}^{\ast}) V(l)_{\,j}^{k}. \end{align*} Similarly, one can compute the marginal of $$\bar{x}$$ over s variables $$\{z_{l_{h}}\}$$, through   \begin{align} \bar{x}(z_{l_{1}}^{\ast}, \ldots, z_{l_{s}}^{\ast}) = \sum_{k=1}^{d} \frac{1}{m}\sum_{i=1}^{m} \left[\prod_{b \not{\in} \{l_{h}\}} \sum_{\,j} \alpha(b)_{i}^{\,j} V(b)_{\,j}^{k}\right]\prod_{h=1}^{s} \sum_{\,j} \alpha(l_{h})^{\,j}(z_{l_{h}}^{\ast}) V(l_{h})_{\,j}^{k} . \end{align} (7.2) 8. Further explanation of variability: the variance The estimation of the conditional mean $$\bar{x}(z)$$ extends effortlessly to second order statistics. Introducing the filtered variable y through   \begin{align} y = x - \bar{x}(z), \end{align} (8.1) one can for instance estimate the z-dependent variance σ2(z) by applying the procedure not to x, but to w = y2, since   $$\sigma^{2}(z) = \bar{w}(z).$$ Often it is a covariance matrix we are after, with one of the zl representing the ‘entry’ or ‘variable’ in x. If two entries zl = i and zl = j are always observed simultaneously—as are, for instance, two quantities in a blood test—then one writes w = y(zl = i, zb)y(zl = j, zb) with zb comprising all covariates except zl, in order to estimate $${\Sigma _{i}^{\,j}} = \bar{w}$$ as a function of zb. Rather than estimating each entry (i, j) at a time though, it is better to estimate the whole covariance matrix Σ(i, j, zb) at once, as this will use all the information available and link the various entries of Σ, so that in particular for each value of zb, Σ(:, :, zb) is a non-negative definite matrix. Yet this procedure extends to much more general situations. One might seek, for instance, the correlation between x at two locations/times zl as a function of the distance/interval Δzl between them. For this, we introduce all products $$w = x({z_{l}^{i}}) x({z_{l}^{\,j}})$$ and a new covariate $$\varDelta z = {z_{l}^{i}}-{z_{l}^{\,j}}$$. For the remaining covariates zb, one may take the average between their values at the two observations. Alternatively, one may restrict the products considered to those where the two values of zb are not too far from each other, or include the Δzb as extra covariates. Similar examples abound: correlation between height or weight at different ages, between precipitation at different locations as a function of time-lag, etc. 9. Examples This section includes examples of application of the attributable component methodology. Though one of the examples uses real data—hourly ground temperatures across the continental US—the purpose here is not to highlight any particular application, but to illustrate various aspects of the workings of the procedure. Thus we start with a couple of simple synthetic examples that directly compare the components discovered by the algorithm with the actual variable-separated solution underlying the data. Then we consider the real example of ground temperatures, which combines different kinds of covariates: categorical, real and periodic, is big enough to require the use of stochastic descent, and also illustrates different uses of the iteration of the procedure that captures second order structure underlying the data. Finally, we consider a synthetic Netflix-like problem, posed in terms of book ratings by readers, that illustrates other capabilities of the methodology, including bi-clustering for complexity reduction in the presence of confounding factors and the use of idiosyncratic variables (reader, book) in combination with more descriptive—but less specific—covariates such as age, location and book type. 9.1. Numerical separation of variables Our first two examples analyze data generated using synthetic models where the distribution underlying the samples takes the variable-separated form that our algorithm adopts as ansatz. The first of these models has three covariates zl, two periodic and one real, and takes the form   $$x^{i} = \bar{x}(z^{i}) + w^{i}, \quad \bar{x}(z) = \frac{1}{5} \cos\left(z_{1}\right) \sin\left(z_{2}\right) z_{3}, \quad w \sim \mathscr{N}(0, 1),$$ where the m = 1000 samples from z1, z2 and z3 are drawn uniformly in the interval [π, π]. Figure 1 displays the results of applying the procedure with d = 1 component and λl = 0.01, l = {1, 2, 3}. The initial values of $$V(l)_{\,j}^{k}$$ (here and in all examples in this paper) are taken as small, symmetry-breaking perturbations of the uniform $$V(l)_{\,j}^{k}=1$$. Fig. 1. View largeDownload slide Upper left: original data x (in light gray -blue on the web version of the manuscript-) and the filtered signal y (in dark gray -red on the web version of the manuscript-) as a function of the noise w. As one can see, y is a function of the noise, i.e. of the unexplained variability, as all explainable variability has been filtered away. Upper right: first (and only) component as a function of z1. Lower left: same component as a function of z2. Lower right: same component as a function of z3. As one can see, the true separated solution underlying the data has been fully recovered. Fig. 1. View largeDownload slide Upper left: original data x (in light gray -blue on the web version of the manuscript-) and the filtered signal y (in dark gray -red on the web version of the manuscript-) as a function of the noise w. As one can see, y is a function of the noise, i.e. of the unexplained variability, as all explainable variability has been filtered away. Upper right: first (and only) component as a function of z1. Lower left: same component as a function of z2. Lower right: same component as a function of z3. As one can see, the true separated solution underlying the data has been fully recovered. The second model, with one less covariate and one more component, is   \begin{align} x^{i} = \bar{x}(z^{i}) + w^{i}, \quad \bar{x}(z) = 4 \cos\left(z_{1}\right) \sin\left(z_{2}\right) + 0.3\ z_{1}{z_{2}}^{2}, \quad w \sim \mathscr{N}(0, 1). \end{align} (9.1) Here we also draw m = 1000 samples of z1 and z2 uniformly in the interval [π, π] and adopt λl = 0.01, l = {1, 2}, setting this time the number of components to d = 2. The purpose of this example is to display the non-uniqueness of the separation into components: as shown in Fig. 2, the two components recovered by the algorithm, though fitting $$\bar{x}(z)$$ to perfection, do not agree with those used to generate the data in (9.1). This non-uniqueness of the decomposition into components is general whenever L ≥ 2. For instance, for any f1, f2, g1, g2, we can write  $$f_{1}(x) g_{1}(y) + f_{2}(x) g_{2}(y) = \tilde{f}_{1}(x) \tilde{g}_{1}(y) + \tilde{f}_{2}(x) \tilde{g}_{2}(y),$$ Fig. 2. View largeDownload slide Upper left panel: original data x (in light gray -blue on the web version of the manuscript-) and the filtered signal y (in dark gray -red on the web version of the manuscript-); the latter is a function of the unexplainable variability, i.e. the noise w. Upper right panel: the two components as functions of z1. Lower left: same as functions of z2. This separated solution does not agree with the one proposed to generate the data, yet it is completely equivalent, though smoother from the perspective of the objective function. Fig. 2. View largeDownload slide Upper left panel: original data x (in light gray -blue on the web version of the manuscript-) and the filtered signal y (in dark gray -red on the web version of the manuscript-); the latter is a function of the unexplainable variability, i.e. the noise w. Upper right panel: the two components as functions of z1. Lower left: same as functions of z2. This separated solution does not agree with the one proposed to generate the data, yet it is completely equivalent, though smoother from the perspective of the objective function. with   $$\tilde{f}_{1}(x) = f_{1}(x), \quad \tilde{g}_{1}(y) = g_{1}(y) + g_{2}(y), \quad \tilde{f}_{2}(x) = f_{2}(x)-f_{1}(x), \quad \tilde{g}_{2}(y) = g_{2}(y).$$ The algorithm picks, among all equivalent decompositions, the ‘smoothest’ one, in the sense of minimizing the second term in (4.1) penalizing non-smoothness. Notice though that $$\bar{x}(z)$$ itself and all marginals are still uniquely determined, so the non-uniqueness of the decomposition affects neither the numerical predictions nor the interpretability of the results. 9.2. Ground temperature variability in the continental US For a second example, we use hourly measurements of the ground-level temperature at 47 stations located across the continental United States and one in Canada, publicly available from NOAA.1 We picked an array of stations that covers the US roughly uniformly and have data available since at least the year 2005 to the present. In this example, the variable x to explain is the temperature itself, measured in degrees Celsius, and we have chosen as covariates zl the following four quantities: The station itself, a categorical variable with values z1 ∈ [1, 2, …, 48]. The local time of the day z2 ∈ [0, 24], periodic. We adopted a grid with 24 elements, with one grid point per hour. The season, described as day of the year, z3 ∈ [0, 365.25], periodic, also with a grid of 24 points. Time in years z4 ∈ [2005, 2017.3], real, included to account for climate variability in a multi-year scale, with a grid of 75 points. The total number of observations m ≈ 5.7 ⋅ 106 was large enough to justify the use of stochastic gradient descent, which we performed by choosing at each step of the algorithm a random subset of 105 entries of x. We adopted d = 8 components and penalization parameters λ = 0.0001 for the station and λ = 0.005 for all other covariates. The reduction in variability can be quantified by the variance, which moved from 151.7 for x to 23.8 for the transformed y, corresponding to a reduction in standard deviation from 12.3 to 4.9 degrees Celsius. Figure 3 (upper left panel) shows the eight periodic components associated with z2, the time of the day. Similarly, there are eight components, not displayed, associated with each of the other three covariates. The set of all these components constitute the output of the algorithm, from which all predictions are made. Fig. 3. View largeDownload slide Upper left panel: mean temperature components as functions of the time of the day (see eq. (3.6)). Upper right panel: mean temperature as function of the time of the day, marginalized over time of the year, climate variability and station. Lower left panel: mean temperature as a function of day of the year, marginalized over station, climate variability and time of the day. Lower right panel: mean temperature as a function of the station marginalized over time of day, day of year and climate variability (darker colors corresponds to lower temperatures). Fig. 3. View largeDownload slide Upper left panel: mean temperature components as functions of the time of the day (see eq. (3.6)). Upper right panel: mean temperature as function of the time of the day, marginalized over time of the year, climate variability and station. Lower left panel: mean temperature as a function of day of the year, marginalized over station, climate variability and time of the day. Lower right panel: mean temperature as a function of the station marginalized over time of day, day of year and climate variability (darker colors corresponds to lower temperatures). Figure 3 (upper right panel) displays the easier to interpret marginalized prediction, where the dependence of $$\bar{x}$$ on the time of the day is marginalized over time of year, station and multi-year variability. Similarly, the lower left panel of the same figure displays the marginalized dependence of $$\bar{x}$$ on the time of the year, with cold winters and hot summers, and the lower right panel displays through color the marginalized dependence of $$\bar{x}$$ on the station (this dependence is marked mostly by latitude, but also by other station characteristics such as elevation and distance from the sea). Figure 4 (left panel) shows the data (in blue) and predicted mean (in red) individualized for a particular station: Manhattan, Kansas, at four levels of detail: global, yearly, monthly and weekly. We see how the prediction ‘explains’ the cycle of seasons and hours of the day, but not the less regular weather systems, of typically around five days, that flow through the region. The right panel displays similar results for Barrow, at the northern tip of Alaska. An interesting observation here is that the estimated mean daily cycle captures two peaks. It is winter in Alaska, so the sun does not rise at all. What we are observing is a manifestation of the thermal tides (Chapman & Lindzen, 1970), with their two characteristic peaks, most often observed in the pressure, but manifested here through their temperature signal. Fig. 4. View largeDownload slide Upper 4 panels: data (in light gray -blue on the web version of the manuscript-) and predicted mean (in dark gray -red on the web version of the manuscript-) for Manhattan, Kansas, displayed over the length of one year, one month and a week. Unlike the diurnal and yearly cycle, the weather systems (most noticeable on the second row, left column panel) are not ‘explained’ by the procedure, as no covariate z representing them has been included. Lower 4 panels: data (in light gray -blue on the web version of the manuscript-) and predicted mean (in dark gray -red on the web version of the manuscript-) for Barrow, Alaska. Notice the two-peaked thermal tide in the predicted diurnal signal (4th row, right column panel). Fig. 4. View largeDownload slide Upper 4 panels: data (in light gray -blue on the web version of the manuscript-) and predicted mean (in dark gray -red on the web version of the manuscript-) for Manhattan, Kansas, displayed over the length of one year, one month and a week. Unlike the diurnal and yearly cycle, the weather systems (most noticeable on the second row, left column panel) are not ‘explained’ by the procedure, as no covariate z representing them has been included. Lower 4 panels: data (in light gray -blue on the web version of the manuscript-) and predicted mean (in dark gray -red on the web version of the manuscript-) for Barrow, Alaska. Notice the two-peaked thermal tide in the predicted diurnal signal (4th row, right column panel). Finally, Fig. 5 displays in blue the climate variability captured through the marginalized dependence of $$\bar{x}$$ on time in a multi-year scale. To illuminate its possible meaning, we have superimposed in red the El Niño Index, which measures the moving 3-month average temperature of the Indian Ocean. Noticeably, the two signals have a very similar amplitude of two to three degrees and time-scale of roughly 3–4 years. Clearly, this factor has captured a global phenomenon in its manifestation on the north American continent. Fig. 5. View largeDownload slide El Niño index (in dark gray -red on the web version of the manuscript-) and $$\bar{x}$$ (in light gray -blue on the web version of the manuscript-), marginalized over time of day, day of year and station. Fig. 5. View largeDownload slide El Niño index (in dark gray -red on the web version of the manuscript-) and $$\bar{x}$$ (in light gray -blue on the web version of the manuscript-), marginalized over time of day, day of year and station. Fig. 6. View largeDownload slide Standard deviation as a function of the time of the year, marginalized over station, time of day and climate variability. Early winters have about twice the variability of the late summer–early falls. Fig. 6. View largeDownload slide Standard deviation as a function of the time of the year, marginalized over station, time of day and climate variability. Early winters have about twice the variability of the late summer–early falls. As described in Section 8, we can iterate the procedure once to capture the variance as function of the same factors. Figure 6 displays the marginalized dependence of the standard deviation σ on the time of the year, with maximal variability in the winter and smallest in the late summer–early fall. After normalizing the y through the transformation $$y^{i} \rightarrow \frac{y^{i}}{\sigma (z^{i})}$$, using their standard deviation σ(z) so estimated, we can also estimate the time-lagged correlation between two stations in the following way: we input as a new variable x the product x = y1y2, where y1, 2 are the normalized observations at stations 1 and 2, respectively. As explanatory factors, we take the time of day and of year at station 1, and the time lag Δt between the two observations. Figure 7 shows the results of applying this procedure to Ithaca, NY and Des Moines, IA, through the dependence of the correlation on the time-lag, marginalized over time of day, but drawn specifically for one summer and one winter day. In both cases, we see the correlation peaking at a lag of two days, consistently with the time it takes for an average weather system to travel from the midwest to the eastern US. This suggests improving the prediction at Ithaca by explaining its already normalized signal y using as an additional explanatory factor the normalized y at Des Moines two days before, and possibly also the normalized temperatures at other stations in the past. Figure 8 shows the result of explaining the temperature in Ithaca using those in Des Moines (IA), Crossville (TN), Egbert (ON), Monahans (TX), Selma (AL) and Ithaca itself two days before, and that in Boulder (CO) four days before. We can see that the weather systems, not captured at all through the previously considered cofactors, are now accounted for to a significant degree. Fig. 7. View largeDownload slide Time-lagged correlation between Ithaca, NY and Des Moines, IA as a function of the time-lag, marginalized over time of day and climate variability, but drawn specifically for one summer and one winter day. Both peak at a lag of two days, but the correlation is significantly bigger in the winter. Fig. 7. View largeDownload slide Time-lagged correlation between Ithaca, NY and Des Moines, IA as a function of the time-lag, marginalized over time of day and climate variability, but drawn specifically for one summer and one winter day. Both peak at a lag of two days, but the correlation is significantly bigger in the winter. Fig. 8. View largeDownload slide Observed (in light gray -blue on the web version of the manuscript-) and predicted (in dark gray -red on the web version of the manuscript-) temperature at Ithaca, NY over 45 days. Left panel: prediction using time of day, day of year and climate variability. Right panel: prediction using as additional cofactors the temperature in various stations 2 and 4 days before, thus capturing much of the weather systems that dominate the weather at this particular location and time. Fig. 8. View largeDownload slide Observed (in light gray -blue on the web version of the manuscript-) and predicted (in dark gray -red on the web version of the manuscript-) temperature at Ithaca, NY over 45 days. Left panel: prediction using time of day, day of year and climate variability. Right panel: prediction using as additional cofactors the temperature in various stations 2 and 4 days before, thus capturing much of the weather systems that dominate the weather at this particular location and time. Clearly, many more things can be done with these data along similar lines; our purpose here though is not to perform a systematic data-based study of the weather and climate or to develop a complete methodology for time-series analysis, but just to illustrate the workings of the attributable component methodology in a realistic setting. 9.3. Book preference prediction In this section we test the algorithm on a preference learning problem. We create a synthetic data set representing book ratings by readers, where each book is assigned an integer score between 1 and 5 by each person that reads it. Each reader and book are characterized by categorical and non-categorical cofactors: we know each reader’s age (a real cofactor) and location through its geographical longitude (a periodic cofactor), and each book’s type (a categorical cofactor with 10 possible values) and number of pages (a real cofactor). In addition, each reader and book have idiosyncratic traits, i.e. individual characteristics not captured by the descriptive cofactors just mentioned. The goal is to learn each reader’s book preferences based on previously rated books, so as to be able to predict the rating of a book that has not been read yet by a particular reader. This problem belongs to the category of matrix completion problems (Candes & Recht, 2012): the matrix with entries Rij containing the rating that reader i assigns to book j is only partially filled, since not every reader has read every book in the data set. The task is to predict the expected value of the missing entries. To build the dataset, we first input the number of readers and books nusers and nitems. For each reader, the cofactors are defined as follows: z1: the reader, z1 ∈ {1, …, nusers}, z2: the age, drawn uniformly in the interval [10, 70], z3: the longitude, drawn from a binary Gaussian mixture with centers located on the American and Euro-Asian continents, z4: the book, z4 ∈ {1, …, nitems}, z5: the book type, an integer drawn uniformly from [1, …, 10], z6: the number of pages, drawn from a Gaussian distribution centered around 70 and truncated below at z6 = 10. In order to create data that do not have the same form as the components sought by the algorithm, we first decompose z3 into two: $$c_{3} = \cos (z_{3})$$ and $$s_{3} = \sin (z_{3})$$ and re-center and normalize all zi, c3 and s3 so that they lie in the interval [1/2, 3/2]. Then we define a set of functions gj of the zi as follows: $$g_{1}=\sin (\pi * z_{1} * z_{2} * c_{3})$$ $$g_{2}=\cos (\pi /(s_{3} * z_{4} * z_{5} * z_{6}))$$ g3 = 1/(z1 + z2 + c3 + s3 + z4 + z5 + z6) g4 = z1/z2 + z4/c3 + z5/z6. These functions gi are used to create a preliminary version of the data set through $$x=\sum _{i=1}^{4}a_{i}\frac{g_{i}-\mu _{i}}{\sigma _{i}}$$, where μi, σi are the mean and standard deviation of gi, and the coefficients ai are drawn from the normal distribution. Then we map x onto a uniform distribution in [1, 5] through sorting, and add to it normally distributed noise so as to displace in average each value of x by one unit. Finally, we truncate each value of x to obtained integer numbers from 1 to 5 in equal proportions. The result is a matrix $$R_{z_{1=}i}^{z_{4}=j}$$ with values between 1 to 5, depending on zi, i = 1, …, 6 in a coupled and nonlinear way. Filling the entries of R for which the pair (z1, z4) is absent from the data set is most challenging when R is very sparse, a setting where the methodology of this article is particularly useful. On the one hand, the availability of qualifying cofactors, such as a reader’s age or a book’s type, groups entries together, thus partially overcoming the sparsity of entries per reader and per book. On the other, even in the absence of such additional cofactors, one can still group books and readers through bi-clustering, again overcoming to a certain degree the sparsity of data for individual readers or books. In order to test this, we have created data sets with both small and large numbers of readers and books with an average of about two book ratings per reader. To measure the quality of the results, we divided the dataset into two parts, one used for training the model and one for testing it. We then computed the root mean squared deviation (RMSD) on both the training set (in-sample error) and in the testing set (out of sample error). The sample used for testing consists of 100 entries randomly selected from the dataset, after making sure that no rows or columns were left empty. Figure 10 shows predictions for a dataset of 500 readers and 100 books. The rating matrix R, with the sparsity pattern displayed in Fig. 9, contains 955 non-zero entries, or roughly 2 books per reader. To assess the effect of clustering, we run the algorithm by using as predictors only the row (z1) and column (z4) indices with and without clustering. The upper panels of Fig. 10 show in-sample and out of sample error without clustering, while to obtain the two lower panels of the figure the algorithm partitioned both readers and books into 20 possible classes. Notice that both the in-sample and out of sample RMSD are smaller in the run where clustering was used. Fig. 9. View largeDownload slide Sparsity structure of R. Fig. 9. View largeDownload slide Sparsity structure of R. Fig. 10. View largeDownload slide Synthetic example obtained with 100 books, 500 users and about 955 preferences. The out of sample error is computed on 100 entries randomly selected among the 955. Upper left panels: in-sample error for each value of the score (1 to 5) obtained without clustering and by considering only the idiosyncratic covariates z1 and z4; the overall root mean square deviation is reported on top of the figure. Upper right panel: out-of-sample error distribution computed on the 100 entries previously eliminated. Lower left panels: same as upper left, but using clustering. Lower right panel: out-of-sample error when clustering rows and columns into 20 classes each. Fig. 10. View largeDownload slide Synthetic example obtained with 100 books, 500 users and about 955 preferences. The out of sample error is computed on 100 entries randomly selected among the 955. Upper left panels: in-sample error for each value of the score (1 to 5) obtained without clustering and by considering only the idiosyncratic covariates z1 and z4; the overall root mean square deviation is reported on top of the figure. Upper right panel: out-of-sample error distribution computed on the 100 entries previously eliminated. Lower left panels: same as upper left, but using clustering. Lower right panel: out-of-sample error when clustering rows and columns into 20 classes each. Figure 11 shows results for a larger dataset with 50000 users, 10000 books and about 106000 preferences. The purpose of this experiment is to show how information beyond the idiosyncratic row and column index can improve the prediction, substantially reducing the out of sample error. This is a case in which clustering the users and books (into 20 classes each as before), besides improving the predictions, is also useful from a computational viewpoint, as the number of unknowns associated for instance to the readers decreases by a factor of 250, thus challenging the linear solvers to a much smaller degree. Stochastic gradient descent is not strictly required for the roughly m = 105 entries available, but we have applied it nonetheless, with m/5 random entries used per time-step, since for even larger datasets this would be the only practical option available. Fig. 11. View largeDownload slide Synthetic dataset with 10000 books, 50000 users and about 106000 preferences. The out-of-sample error is computed on 100 entries randomly selected among the 106000. First row: in-sample (left) and out-of-sample (right) errors when using only the reader and the book index. Second row: in and out-of-sample errors when using only the age, longitude, book’s type and number of pages. Third row: in and out of sample error when using all the available cofactors. In all three scenarios, the readers (z1) and books (z4) were clustered into 20 classes each. Fig. 11. View largeDownload slide Synthetic dataset with 10000 books, 50000 users and about 106000 preferences. The out-of-sample error is computed on 100 entries randomly selected among the 106000. First row: in-sample (left) and out-of-sample (right) errors when using only the reader and the book index. Second row: in and out-of-sample errors when using only the age, longitude, book’s type and number of pages. Third row: in and out of sample error when using all the available cofactors. In all three scenarios, the readers (z1) and books (z4) were clustered into 20 classes each. As one can see in the results, using as predictors only reader and book (via clustering) or only their qualifying cofactors (age, location, type, length) yields poorer results than combining the two into a maximally explanatory set. In order to assess the degree of accuracy of the predictions obtained including reader, book and all their qualifying cofactors, the RMSD obtained in this case (0.79) should be compared with the RMSD obtained in two extreme cases: All the entries to predict are assigned a value of 3, right in the middle of the rating scale. In this case, one can easily see that $$\textrm{RMSD}=\sqrt{2}=1.4142\ldots$$ All the entries to predict are assigned the value obtained from the model underlying the data, but without adding the normally distributed noise described in the last step of the construction of our dataset, which cannot possibly be predicted. In this case, we obtain RMSD = 0.4. Thus the root mean square error of the predictions, RMSD ≈ 0.8, is only twice the amount directly attributable to the noise. 10. Conclusions This article develops ‘attributable components’, a methodology for the non-parametric estimation of the conditional mean of a variable x in terms of a set of covariates {zl}. This is framed within a more general setting: the estimation of the full conditional probability $$\boldsymbol{\rho}(x|z)$$ through a family of maps y = $$\boldsymbol{Y}$$(x;z) that push forward the $$\boldsymbol{\rho}(x|z)$$ onto their Wasserstein barycenter $$\boldsymbol{\mu}(y)$$. Estimating the conditional mean results from restricting these maps to z-dependent rigid translations $$\boldsymbol{Y}=x-\boldsymbol{\beta}(z)$$. We prove that these act as pre-conditioners to the full conditional density estimation, in the sense that, if one performs the latter after removing the conditional mean, the composition of the resulting optimal maps and the previous rigid translations yields the solution to the original barycenter problem. Extending the methodology of this article to handle the full conditional density estimation problem is the subject of current work. On the other hand, the conditional second order structure of the data, including the conditional covariance matrix, can be found by a straightforward iteration of the procedure. The conditional mean $$\bar{x}(z)$$ is a function of the possibly many co-variates {zl}, which may in addition have different types, such as categorical, ordinal and real. The procedure represents this multivariable function as a sum of components, each of which is the product of single-variable functions. Each of these single-variable functions, in turn, is represented by the values that it adopts on a grid, for which smoothness is enforced through penalization. Every observation is assigned a weighted sum of these grid values, through interpolation when the zl is continuous and through straightforward assignment when it is discrete. When some or all values of zl are unknown, finding these weights becomes part of the problem, which then produces clustering, classification or continuous covariate assignments. The functions of each covariate zl are found through alternating descent of the objective function. Since this is quadratic in the functions of each zl, the descent step consists of the solution to a linear, Sylvester-like system. The methodology scales well for big datasets, as it adapts straightforwardly to stochastic descent and online learning. Despite its complex, non-parametric form, the predicted $$\bar{x}(z)$$ is easily rendered interpretable through marginalization over subsets of covariates. Among the many extensions that one could foresee, those currently being pursued include, in addition to the full conditional probability estimation mentioned before, specific extensions directed at the analysis of time series. As for any methodology for the analysis of data, a wealth of interesting problems and extensions arises from considering particular applications. Not to overstretch the size of this methodological article, we have included here only a handful of representative examples, though we believe that they are diverse enough to convey a feeling for the methodology’s versatility and broad applicability. Acknowledgements This work was partially supported by grants from the Office of Naval Research and from the NYU-AIG Partnership on Global Resilience. Footnotes 1  https://www1.ncdc.noaa.gov/pub/data/uscrn/products/hourly02/ References Agueh, M. & Carlier, G. ( 2011) Barycenter in the Wasserstein space. SIAM J. Math. Anal. , 43 , 094-- 924. Google Scholar CrossRef Search ADS   Álvarez-Esteban, P. C., del Barrio, E., Cuesta-Albertos, J. & Matrán, C. ( 2016) A fixed-point approach to barycenters in wasserstein space. J. Math. Anal. Appl. , 441 , 744-- 762. Google Scholar CrossRef Search ADS   Brenier, Y. ( 1991) Polar factorization and monotone rearrangement of vector-valued functions. Commun. Pure Appl. Mathematics , 44 , 375-- 417. Google Scholar CrossRef Search ADS   Caffarelli, L. A. ( 2003) The Monge-Ampère equation and optimal transportation, an elementary review. Optimal Transportation and Applications. Springer International Publishing AG, pp. 1–10. Candes, E. & Recht, B. ( 2012) Exact matrix completion via convex optimization. Commun. ACM , 55, 111-- 119. Google Scholar CrossRef Search ADS   Chapman, S. & Lindzen, R. ( 1970) Atmospheric Tides . D. Reidel, Norwell, Mass, p. 200. Friedman, J., Hastie, T. & Tibshirani, R. ( 2001) The Elements of Statistical Learning,  vol. 1. Springer series in statistics Springer, Berlin. Google Scholar CrossRef Search ADS   Friedman, J. H. ( 1991) Multivariate adaptive regression splines. The Annals of Statistics , pp. 1-- 67. Golub, G. H. & Van Loan, C. F. ( 2012) Matrix Computations , vol. 3. Baltimore, MD: JHU Press. Grasedyck, L., Kressner, D. & Tobler, C. ( 2013) A literature survey of low-rank tensor approximation techniques. GAMM-Mitteilungen , 36, 53-- 78. Google Scholar CrossRef Search ADS   Hartley, R. & Schaffalitzky, F. ( 2003) Powerfactorization: 3d reconstruction with missing or uncertain data. Australia-Japan Advanced Workshop on Computer Vision , vol. 74, pp. 76-- 85. Hastie, T. J. & Tibshirani, R. J. ( 1990) Generalized Additive Models , vol. 43. Boca Rator, FL: CRC press. Kuang, M. & Tabak, E. G. ( 2017) Sample-based optimal transport and barycenter problems. Commun. Pure Appl. Mathematics (in press). Mohri, M., Rostamizadeh, A. & Talwalkar, A. ( 2012) Foundations of Machine Learning . Cambridge, MA: MIT press. Tabak, E. G. & Trigila, G. ( 2017) Explanation of variability and removal of confounding factors from data through optimal transport. Commun. Pure Appl. Mathematics  (in press). Google Scholar CrossRef Search ADS   © The Author(s) 2018. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) For permissions, please e-mail: journals. permissions@oup.com

### Journal

Information and Inference: A Journal of the IMAOxford University Press

Published: Mar 15, 2018

## You’re reading a free preview. Subscribe to read the entire article.

### DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month ### Explore the DeepDyve Library ### Search Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly ### Organize Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place. ### Access Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals. ### Your journals are on DeepDyve Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more. All the latest content is available, no embargo periods. DeepDyve ### Freelancer DeepDyve ### Pro Price FREE$49/month
\$360/year

Save searches from
PubMed

Create lists to

Export lists, citations