Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Separable spatio‐temporal kriging for fast virtual sensing

Separable spatio‐temporal kriging for fast virtual sensing INTRODUCTIONEnvironmental monitoring systems rely on virtual sensing logic to predict relevant variables of their target environment. While the information on the whole system is of interest, this is typically based on sensor readings, which are limited in both space and time, so it is necessary to surrogate them based on some suitable statistical method.1 Variables of interest may include, for instance, room temperature,2 energy consumption,3 and air quality.4 We consider the case of enclosed environments,2,4,5 as contrasted to other applications that are aimed at larger environments like ecosystems.6,7 Moreover, our focus is on applications that need real‐time control.8The motivating example for this work comes from a virtual sensing project at Silicon Austria Labs GmbH, a European research center for electronic‐based systems.9 The data relate to an office room in Villach, Austria, that has been monitored for 19 weeks between October 2019 and March 2020. The temperature (in ∘C$$ {}^{\circ \mathrm{C}} $$) is reported by 12 sensors every 10 s, along with other physical measurements like pressure (in Pa) and light (in lx). The room is 127 m2$$ {\mathrm{m}}^2 $$ large and it is structured as reported in Figure 1. The picture also shows the locations of sensors and windows, along with the cardinal points.1FIGURERoom with windows, cardinal points, and sensor locations, modified from the original map9The sensors are all Raspberry Pi Zero boards. Their measurements are broadcast over a wireless network to a database server, which is a Raspberry Pi 3 instead. Raspberry Pis are popular and affordable single‐board computers that are widely used in home automation and smart systems.10 This example has some key aspects, including data referenced both temporally and spatially, high‐frequency measurements, resulting in multivariate times series with as many as 106$$ 1{0}^6 $$ observations for each sensor. The server has a limited yet non‐negligible computational power, which is also important, as it allows to process data locally if this task is planned carefully, considering the limitations of the monitoring system.As common in modern data analysis, there are at least two main and opposite approaches to deal with sensor data. These two opposites are represented by interpretable models and black‐box algorithms, respectively. The former include specifications based on actual knowledge about physical aspects of the system,8 often hard to formulate; the latter include neural networks and other machine learning techniques that achieve remarkable performance levels and are readily available in general software. Other authors addressed the same datasets of our present analysis,9 but they used techniques such as XGBoost regression11 and LSTM recurrent neural networks.12 These methods do not provide spatial interpolation by design, but they can do so only after some suitable engineering, thus some generalization issues emerge.Here we advocate for an approach lying between the two extremes, which is statistically sound without compromising the predictive performance. We are interested in simple models and distributed computing, thus on scalable methods that leverage the proximity principle. It is then natural to resort to distance‐based prediction methods,13 which include for instance inverse distance weighted regression (IDWR), k$$ k $$‐nearest‐neighbors (k$$ k $$‐NN),14 and spatio‐temporal kriging.15 These methods are somewhat related to pure spatial data analysis.16 The focus of the present article is on the kriging method. This approach relies on a correlation model, which depends on distances between measurements in time and space. A crucial assumption for distributed calculus is spatio‐temporal separability, which implies two separate models for spatial and temporal correlations.7 This assumption is hardly suitable for large environments, where some locations can anticipate events that will occur somewhere else. In smaller environments, separability provides instead a useful approximation that may perform similarly to more complicated, nonseparable models.17While involving a simple model, kriging is cursed by the enormous cost of computation, mostly due to the inversion of large correlation matrices. Some approximations have been devised to make kriging tractable like, for instance, covariance tapering.18 A composite likelihood (CL) approach can be used to estimate a separable model, which allows to separate the estimation of the spatial model from the estimation of the temporal model. Also, some models in the time domain can spare the cumbersome matrix inversions and thus simplify both estimation and prediction. For instance, autoregressive (AR) models can be estimated just by minimizing the conditional sum of squares, and they come with compact forecasting rules.19 As to spatio‐temporal predictions, we show that these can be seen as a spatial interpolation of temporal forecasts under separability, which allows to leverage specific advantages of time series and spatial models. These possibilities seem somewhat overlooked in the literature, in spite of the attention received by separability itself.All in all, the main contributions of this article can be summarized as follows. First, a CL‐based fitting procedure is proposed that is both memory‐ and compute‐efficient. An affordable implementation of the maximum likelihood (ML) estimator is also presented, which is slower to compute but serves as the natural benchmark for our proposal. Second, this article provides significant simplifications of computational order in both estimation and prediction related to Gaussian processes (GPs), in the case of separable covariance structures, making it possible to carry out scalable sensor data analysis. As a result, it expands the range of models that can be addressed in spatio‐temporal kriging, as we point out that the allegedly required covariance matrices need not to be evaluated directly. Thus, even rich seasonal models can be addressed without directly evaluating the autocorrelation function.The plan of this article is as follows. In Section 2, we review the literature in brief as it concerns GPs and their applications to large datasets. In Section 3, we recall basic kriging formulation, while emphasizing some correlation structures of practical importance. In Section 4, we detail our inferential and predictive framework, focusing on distributed computing. Section 5 illustrates the application of the proposed methodology to the motivating example, whereas Section 6 is devoted to some possible twists and extensions. Finally, Section 7 presents some concluding remarks.LITERATURE REVIEWWe deal with the task of surrogate modeling, estimation and prediction.1 Consider a function or a process defined over a domain. This can represent the temperature at given times and locations,9 groundwater levels over a region,20 material properties of different portions of a solid,21 wind speed at turbine locations,22 turbulence around airships,23 or robots' reliability over time,24 to name a few examples. The process is feasibly observed only at selected points in the domain. For instance, room temperature may be known only at sensor locations and at discrete times. In many of these examples, inputs and outputs of some production systems can be thought of, respectively, as domain points and process outcomes. The unavailable information, that is, the process at unobserved locations and times, has thus to be surrogated. Optimization tasks may involve some interpolation steps,23,25 for which a proximity principle is implicitly trusted. From a statistical viewpoint, a stochastic process is assumed to rule the target of knowledge, and suitable regularity assumptions will be made: these reflect the heuristic that similarity comes with proximity and independence comes with distance. For instance, atmospheric events and air conditions often exhibit such regularity.26 From an operational standpoint, a distance measure must be defined, which must be meaningful within the specific geometry of the domain on focus.27 For example, a domain can even be made up of functions: even in such a case, a meaningful distance measure has been proposed recently.28,29Gaussianity, coupled with a suitable correlation structure, is a common assumption in statistical modeling that is central to kriging, as it translates straightforwardly into first‐ and second‐ order conditions.30,31 Moreover, GPs arise naturally in the case of spatially and temporally referenced data. The term kriging in broad generality refers to the task of making predictions based on conditional normal distributions, where the covariance structure depends on a suitably defined distance measure.32 By extension, the model underlying kriging has been called the kriging model.33 Kriging has become an umbrella name to include many variants, like ordinary, simple, and universal kriging, which differ in the specification of the mean. Moreover, co‐kriging involves regression on covariates that are known beforehand; on the contrary, semantic kriging involves covariates that are not known in advance and thus need to be predicted in turn.34 In machine learning, kriging often uses rich and nonlinear trend formulations,35 see, for instance, the polynomial chaos approach.36 Intrinsic kriging revolves around the process of differences, which is approximately de‐trended if the mean varies slowly with distance.37 The main point of all these variants of kriging is in the way of formulating or estimating the mean of the process.The estimation of kriging models often involves two subsequent steps: the first one is devoted to estimating the mean, which may depend on covariates, and the second one is devoted to estimating the correlation structure parameters, based on the residual or de‐trended process that results from the first step. This procedure is known as residual kriging.38 When the mean model is complicated, Kaufman et al.39 advocate for separating the estimation of trend and covariance structures. The residual process is indeed hard to model in terms of correlations.40 Some kriging applications may actually fail due to little spatial regularities left in the process after accounting for covariates.41 By converse, missing covariates can be thought of as a source of residual correlation, which leaves room for spatio‐temporal modeling.42 Similar issues may arise when attempting to estimate both spatial and temporal correlations. One may thus generalize Kaufman et al.'s approach to separate the estimation of dependence for each domain, being it either space, time or covariates, thus borrowing predictive power from all the domains.43The scalability of GPs is concerning in the case of large datasets. Accessible reviews are available on this topic, focusing on big data applications.44 All scalable GPs generally involve approximating the information that could be extracted from a full dataset, provided a sufficient computational power. Data subsetting, partitioning,45 or subsampling46 involve retaining only a fraction of training data points: moving kriging,47 as a local estimation method, belongs to this category. Related but distinct are local approximations such as Markov assumptions,48 which are based on selecting a smaller training dataset that is relevant for a specific prediction task. Covariance tapering18 involves bounded support kernels49 that induce sparsity in large correlation matrices, which are then more manageable both memory‐ and compute‐ wise. Other algebraic approximations allow for parsimonious, low‐rank representations for large matrices: for instance, separability assumptions involve writing large matrices as a Kronecker product of multiple smaller matrices, which are easier to invert.7Computational concerns make it difficult to deploy an otherwise interesting, functional variant of kriging called GP regression (GPR).50 However, even plain GPs can be troubled within the same setting. The aforementioned approximations can be useful, but they require tuning of some sort. For instance, in Markov modeling, the extent of Markov neighborhoods is a tuning aspect of the problem.The focus of this article is on separability assumptions, which may fit into the category of parsimonious, low‐rank approximations for spatio‐temporal correlation matrices.17 Separability assumptions arise naturally in multivariate normal modeling. In the literature, one may find this model stated in terms of a normally distributed random vector with covariance matrices in Kronecker form, which come in handy even in problems with fewer data than the one we present here.42 An iterative procedure attributed to Arendt et al.51 allows to estimate the correlation structures in time and space domains. This technique consists in estimating the spatial correlation of data that have been temporally decorrelated, then estimating the temporal correlation of data that have been (by converse) spatially decorrelated; the two steps are iterated until convergence. What we highlight in our contribution is that the practitioner may find it useful to adopt some suitable time series models19 for which the matrix transformations need not be evaluated explicitly, as it may be cumbersome. The advantages of separability have not been exploited in full yet. A likelihood function for separable kriging models in many domains has already been considered and maximized with iterative procedures, though without leveraging convenient correlation models.52Separable models have already been addressed using CLs, which are nongenuine likelihoods based on working independence assumptions.53 Pairwise CLs help avoid matrix inversion.54 Basically, a CL allows to make inference even based on under‐specified models, on trusted assumptions, thus making the analysis more robust55 to model misspecification. Furthermore, under a separable spatio‐temporal model, temporal and spatial parameters can be estimated separately for added robustness with respect to each other's misspecification.56 The composite estimator provides stronger consistency guarantees in this sense.57 The potential of composite estimation has also been investigated in terms of the other kinds of robustness that are common in statistics, with respect to outliers and data contamination.58 Given these interesting results, the present article aims at presenting a composite estimation approach to the reader, as it promises to make kriging applications more computationally sustainable and, hopefully, robust.MODEL SPECIFICATIONWe deal with the case of data that are both spatially and temporally referenced, so we use a space index s$$ s $$ and a time index t$$ t $$. The space index s$$ s $$ takes its value in 𝒮={1,…,S} and is a pointer to one out of S$$ S $$ locations in space, while the time index t$$ t $$ takes its value in 𝒯={1,…,T} and is a pointer to one out of T$$ T $$ time frames. A joint index ts$$ ts $$ denotes location s$$ s $$ at time t$$ t $$. Since dealing with a constant sampling rate, we consider a discrete‐time system with equispaced time frames. In the long run, it holds T≫S$$ T\gg S $$.Let ds,s′$$ {d}_{s,{s}^{\prime }} $$ be a spatial distance, defined for all pairs of locations s,s′∈𝒮, thus endowed with non‐negativity, symmetry and triangle inequality. Distance is ordinarily evaluated along straight lines in the absence of physical obstacles; otherwise, the length of the shortest path is considered. We choose the Euclidean distance for this purpose, namely,1ds,s′=(xs−xs′)2+(ys−ys′)2,$$ {d}_{s,{s}^{\prime }}=\sqrt{{\left({\mathtt{x}}_s-{\mathtt{x}}_{s^{\prime }}\right)}^2+{\left({\mathtt{y}}_s-{\mathtt{y}}_{s^{\prime }}\right)}^2}\kern0.3em , $$where (xs,ys)$$ \left({\mathtt{x}}_s,{\mathtt{y}}_s\right) $$ and (xs′,ys′)$$ \left({\mathtt{x}}_{s^{\prime }},{\mathtt{y}}_{s^{\prime }}\right) $$ are the two‐dimensional Cartesian coordinates of s,s′∈𝒮, respectively. Temporal distance is defined analogously59 for all pairs of t,t′∈𝒯 as2dt,t′=|t−t′|,$$ {d}_{t,{t}^{\prime }}=\mid t-{t}^{\prime}\mid \kern0.3em , $$with |·|$$ \mid \cdotp \mid $$ denoting the absolute value. Here, ds,s′$$ {d}_{s,{s}^{\prime }} $$ is the generic (s,s′)$$ \left(s,{s}^{\prime}\right) $$th entry of the S×S$$ S\times S $$ spatial distance matrix dS$$ {d}_S $$ and dt,t′$$ {d}_{t,{t}^{\prime }} $$ is the generic (t,t′)$$ \left(t,{t}^{\prime}\right) $$th entry of the T×T$$ T\times T $$ temporal distance matrix dT$$ {d}_T $$, respectively.The observed data y$$ y $$ are structured as follows:3y=y11…y1S⋮yts⋮yT1…yTS,$$ y=\left[\begin{array}{ccc}{y}_{11}& \dots & {y}_{1S}\\ {}\vdots & {y}_{ts}& \vdots \\ {}{y}_{T1}& \dots & {y}_{TS}\end{array}\right]\kern0.3em , $$so the data related to location s$$ s $$ are all stored in the same s$$ s $$th column, while those related to the time frame t$$ t $$ are all stored in the same t$$ t $$th row. As new data are observed, they are appended to y$$ y $$ as a new row. The data y$$ y $$ are modeled by the random matrix Y$$ Y $$ and the mean matrix μ$$ \mu $$ with the same number of rows and columns of y$$ y $$.Let vec$$ \mathrm{vec} $$ be a unary operator defined for matrices that stacks their columns into a single vector.60 Y$$ Y $$ is assumed to be a multivariate normal with scale parameter σ>0$$ \sigma >0 $$ and correlation matrix R$$ R $$, in the sense that R$$ R $$ is the correlation matrix of vec(Y)$$ \mathrm{vec}(Y) $$. More formally, we assume that Y$$ Y $$ has the following density function.4f(y;μ,σ,R)=1|2πσ2R|·exp−12vecy−μσ⊤R−1vecy−μσ,$$ f\left(y;\mu, \sigma, R\right)=\frac{1}{\sqrt{\mid 2\pi {\sigma}^2R\mid }}\cdotp \exp \left\{-\frac{1}{2}\mathrm{vec}{\left(\frac{y-\mu }{\sigma}\right)}^{\top }{R}^{-1}\kern0.3em \mathrm{vec}\left(\frac{y-\mu }{\sigma}\right)\right\}\kern0.3em , $$where |·|$$ \mid \cdotp \mid $$ is the determinant of a square matrix. In kriging‐related applications, R$$ R $$ is a positive definite square matrix that depends on spatial and temporal distances through spatial and temporal autocorrelation functions discussed further in this section. The definition in use throughout this article will be given later in Equation (11).We assume that the expected value of Y$$ Y $$ at time t$$ t $$ and location s$$ s $$, denoted by μts$$ {\mu}_{ts} $$, is a smooth function of t$$ t $$ and is shared across locations, so it makes sense to estimate it with the asymmetric moving average mt$$ {m}_t $$ defined below, which pools data from across all the observed locations, limited to w$$ w $$ time frames preceding t$$ t $$. The trend μts$$ {\mu}_{ts} $$ is thus estimated via μ^ts$$ {\hat{\mu}}_{ts} $$, defined as5μ^ts=mt,for alls,withmt=1S·w∑s=1S∑i=1wY(t−i)s.$$ {\hat{\mu}}_{ts}={m}_t,\kern1em \mathrm{for}\ \mathrm{all}\kern0.3em s,\mathrm{with}\kern1em {m}_t=\frac{1}{S\cdotp w}\sum \limits_{s=1}^S\sum \limits_{i=1}^w{Y}_{\left(t-i\right)s}\kern0.3em . $$Here, Y(t−i)s$$ {Y}_{\left(t-i\right)s} $$ is the process at time t−i$$ t-i $$ and location s$$ s $$. We set w$$ w $$ equal to the number of observations per sensor in 24 h to remove the observed daily seasonality. The latest estimate available for μts$$ {\mu}_{ts} $$ can also serve as an estimate for future trend μt′s$$ {\mu}_{t^{\prime }s} $$, t′>t$$ {t}^{\prime }>t $$, assuming stability in the short term. Such an assumption can be credible when the univariate time series may agree on a single latent factor ruling all of them.The parameter σ>0$$ \sigma >0 $$ contributes only to making predictions probabilistically calibrated,61 because it is just a scale parameter, like the error standard deviation in classical linear regression, thus involved in prediction variance but not in mean predictions; see Appendix A.We resort to the classical kriging approach, which belongs to frequentist statistics, but this methodology also has a Bayesian counterpart involving a prior distribution on parameters.62 As per kriging approach, we assume that correlations between components of Y$$ Y $$ are stationary and thus depend on their distances in space and time. The covariance between any two components of Y$$ Y $$, say, Yts$$ {Y}_{ts} $$ and Yt′s′$$ {Y}_{t^{\prime }{s}^{\prime }} $$, is modeled as6cov(Yts,Yt′s′)=σ2·corS(ds,s′)·corT(dt,t′),$$ \operatorname{cov}\left({Y}_{ts},{Y}_{t^{\prime }{s}^{\prime }}\right)={\sigma}^2\cdotp {\mathrm{cor}}_S\left({d}_{s,{s}^{\prime }}\right)\cdotp {\mathrm{cor}}_T\left({d}_{t,{t}^{\prime }}\right)\kern0.3em , $$where corS(·)$$ {\mathrm{cor}}_S\left(\cdotp \right) $$ is the spatial autocorrelation function (ACF),63 while corT(·)$$ {\mathrm{cor}}_T\left(\cdotp \right) $$ is the temporal ACF;19 few examples of definition for corS(·)$$ {\mathrm{cor}}_S\left(\cdotp \right) $$ and corT(·)$$ {\mathrm{cor}}_T\left(\cdotp \right) $$ are provided in the next paragraph. The product between spatial and temporal correlations is implied by the separability assumption. ACFs depending on distances and not directions are implied by an isotropy assumption. Both separability and isotropy can simplify modeling and computing.1,2,6,7,64For the sake of illustration, we recall few possible definitions for the spatial ACF corS(·)$$ {\mathrm{cor}}_S\left(\cdotp \right) $$ and the temporal ACF corT(·)$$ {\mathrm{cor}}_T\left(\cdotp \right) $$. The spatial ACF can be, for instance, one of the following:13,16,59Matérn ACF657corS(d)=21−αΓ(α)(d/λ)αKα(d/λ),$$ {\mathrm{cor}}_S(d)=\frac{2^{1-\alpha }}{\Gamma \left(\alpha \right)}{\left(d/\lambda \right)}^{\alpha }{\mathrm{K}}_{\alpha}\left(d/\lambda \right)\kern0.3em , $$with Γ(·)$$ \Gamma \left(\cdotp \right) $$ the gamma function and Kα(·)$$ {\mathrm{K}}_{\alpha}\left(\cdotp \right) $$ the modified Bessel function of the second kind; α>0$$ \alpha >0 $$ is a smoothing parameter, λ>0$$ \lambda >0 $$ is a range parameter.Power exponential ACF18corS(d)=exp{−(d/λ)β}.$$ {\mathrm{cor}}_S(d)=\exp \left\{-{\left(d/\lambda \right)}^{\beta}\right\}\kern0.3em . $$Here, β>0$$ \beta >0 $$ is a smoothing parameter, λ>0$$ \lambda >0 $$ is a range parameter.Both Matérn and power exponential ACFs include two notable sub‐cases:When α=1/2$$ \alpha =1/2 $$ and β=1$$ \beta =1 $$, the exponential ACF is implied.6When α→∞$$ \alpha \to \infty $$ and β=2$$ \beta =2 $$, the Gaussian ACF is implied,1,66 which is also known as squared‐exponential ACF3,4,67 and involved in the double‐stable model.2The temporal ACF corT(·)$$ {\mathrm{cor}}_T\left(\cdotp \right) $$ in discrete time can be, for instance, one of the following:19ACF of a stationary AR of order 19corT(d)=ϕ|d|,$$ {\mathrm{cor}}_T(d)={\phi}^{\mid d\mid}\kern0.3em , $$where ϕ∈]−1,+1[$$ \phi \in \kern0.3em \left]-1,+1\right[ $$ is the correlation parameter.ACF of an invertible moving‐average model of order 110corT(d)=1,d=0,η,d=±1,0,otherwise,$$ {\mathrm{cor}}_T(d)=\left\{\begin{array}{cc}1,& d=0,\\ {}\eta, & d=\pm 1,\\ {}0,& \mathrm{otherwise},\end{array}\right. $$where η∈]−1/2,+1/2[$$ \eta \in \kern0.3em \left]-1/2,+1/2\right[ $$ is the correlation parameter.More complicated ACFs are possible when looking at more flexible time series models, such as the multiplicative seasonal AR models that are used in the empirical application presented later. Some approaches assume weak or no correlation structure, like empirical kriging.15 These approaches are necessarily less scalable but may still work for suitably targeted tasks.Let corS(·)$$ {\mathrm{cor}}_S\left(\cdotp \right) $$ and corT(·)$$ {\mathrm{cor}}_T\left(\cdotp \right) $$ be vectorized functions, that is, they transform matrices in an entry‐wise fashion. Then, RS=corS(dS)$$ {R}_S={\mathrm{cor}}_S\left({d}_S\right) $$ will be a spatial correlation matrix and RT=corT(dT)$$ {R}_T={\mathrm{cor}}_T\left({d}_T\right) $$ a temporal correlation matrix. We call11R=RS⊗RT,$$ R={R}_S\otimes {R}_T\kern0.3em , $$the spatio‐temporal correlation matrix, this expression denoting also a Kronecker correlation structure due to the Kronecker product ⊗$$ \otimes $$.Both temporal and spatial ACFs can be modified to account for noisy data by including a so‐called nugget effect.1,65 This means that the spatial ACF corS(d)$$ {\mathrm{cor}}_S(d) $$ and the temporal ACF corT(d)$$ {\mathrm{cor}}_T(d) $$ are multiplied by βS$$ {\beta}_S $$ and βT$$ {\beta}_T $$ when d>0$$ d>0 $$, respectively, the parameters βS$$ {\beta}_S $$ and βT$$ {\beta}_T $$ taking values in the interval ]0,1]$$ \left]0,1\right] $$.15 We refer to 1−βS$$ 1-{\beta}_S $$ and 1−βT$$ 1-{\beta}_T $$ as the nugget parameters, in space and time domains, respectively. Some authors apply the nugget directly to the spatio‐temporal covariance function, but this will break up separability.64,68 The latter way of modeling is more natural in the case of additive covariance models.69INFERENTIAL ASPECTSThis section presents two strategies that allow to perform estimation and prediction under a separable model with a low computational cost. In particular, we base estimation on a novel CL approach. Then, leveraging the peculiar expression of the kriging mean formula, we show how to compute predictions efficiently under separability.EstimationThe distribution of Y$$ Y $$ in Equation (4) is assumed to be indexed by μ,σ,R$$ \mu, \sigma, R $$. As per residual kriging, μ$$ \mu $$ is replaced with its estimate μ^$$ \hat{\mu} $$ via Equation (5). The parameters left to be estimated are θ=(σ,ψ⊤)⊤$$ \theta ={\left(\sigma, {\psi}^{\top}\right)}^{\top } $$, with ψ$$ \psi $$ the correlation parameters ruling R$$ R $$. Moreover, ψ$$ \psi $$ can be partitioned as ψ=(ψS⊤,ψT⊤)⊤$$ \psi ={\left({\psi}_S^{\top },{\psi}_T^{\top}\right)}^{\top } $$, where ψS$$ {\psi}_S $$ and ψT$$ {\psi}_T $$ are the spatial and temporal correlation parameters, respectively. In particular, RS$$ {R}_S $$ depends only on ψS$$ {\psi}_S $$, while RT$$ {R}_T $$ depends only on ψT$$ {\psi}_T $$. As a starting point, we consider the pseudo likelihood function ℒ(θ)$$ \mathcal{L}\left(\theta \right) $$, defined as12ℒ(θ)=ℒ(θ;y˜)=f(y˜;0,σ,R),$$ \mathcal{L}\left(\theta \right)=\mathcal{L}\left(\theta; \tilde{y}\right)=f\left(\tilde{y};0,\sigma, R\right)\kern0.3em , $$with y˜=y−μ^$$ \tilde{y}=y-\hat{\mu} $$ the de‐trended data. The maximizer of ℒ(θ)$$ \mathcal{L}\left(\theta \right) $$ with respect to θ$$ \theta $$ is referred to as the ML estimate. Under the separability assumption in Equation (11), it is possible to evaluate ML efficiently as illustrated in Appendix C. The ML approach is the natural benchmark for any other estimator, due to its asymptotic efficiency. Nevertheless, robustness concerns and computational issues may arise in practical data analysis, for which the following CL estimation approach can be more suited.Let the sample correlation matrix be defined as rank‐1 matrix13M^=1σ^2vec(y−μ^)vec(y−μ^)⊤,$$ \hat{M}=\frac{1}{{\hat{\sigma}}^2}\kern0.3em \mathrm{vec}\left(y-\hat{\mu}\right)\kern0.3em \mathrm{vec}{\left(y-\hat{\mu}\right)}^{\top}\kern0.3em , $$where σ^$$ \hat{\sigma} $$ is a working estimate of σ$$ \sigma $$, defined as14σ^=1TS∑s=1S∑t=1T(yts−μ^ts)2.$$ \hat{\sigma}=\sqrt{\frac{1}{TS}\sum \limits_{s=1}^S\sum \limits_{t=1}^T{\left({y}_{ts}-{\hat{\mu}}_{ts}\right)}^2}\kern0.3em . $$The likelihood function is thus replaced with a so‐called pseudo likelihood ℒp(ψ)$$ {\mathcal{L}}^p\left(\psi \right) $$, defined as follows, that allows to make inference on ψ$$ \psi $$ alone.65,7015ℒp(ψ)=|R|−1/2exp−12trR−1M^.$$ {\mathcal{L}}^p\left(\psi \right)={\left|R\right|}^{-1/2}\exp \left\{-\frac{1}{2}\mathrm{tr}\left({R}^{-1}\hat{M}\right)\right\}\kern0.3em . $$Here, tr(·)$$ \mathrm{tr}\left(\cdotp \right) $$ is the trace operator.Kriging is often cumbersome due to the inversion of R$$ R $$, and actually the pseudo likelihood in Equation (15) is intractable with high‐dimensional data. Separability reduces the dimensionality of the problem, as it holds16R−1=(RS⊗RT)−1=RS−1⊗RT−1,$$ {R}^{-1}={\left({R}_S\otimes {R}_T\right)}^{-1}={R}_S^{-1}\otimes {R}_T^{-1}\kern0.3em , $$so two smaller inverse matrices must be computed instead of a single and larger one. However, as T≫S$$ T\gg S $$, inverting RT$$ {R}_T $$ alone can also be difficult.Our proposal is two‐fold. First, a marginal CL approach56 can be used, exploiting separability more in depth so that the tasks of estimating ψS$$ {\psi}_S $$ and ψT$$ {\psi}_T $$ can even be addressed separately. Second, a suitable time series model can help handle the temporal correlation implicitly, as RT$$ {R}_T $$ must not be evaluated at all, and make it possible to address tall data and high sampling rates.CLs are known in spatial statistics mainly as tools that simplify estimation and inference, like the pairwise likelihood,53 and they can also be used in model selection.71 CLs allow to make inference on under‐specified models but, even in the case of fully specified models, like in kriging, a suitable CL can reduce the computational cost of estimation. The estimator based on the full model can be computed in some cases, but it would be generally cumbersome with large datasets, without the convenient temporal models discussed here. We remark the tractability of our estimator by carrying out inference based on bootstrap. Cheaper computation comes at a price, since the estimator is naturally sub‐optimal with respect to the ML estimator. Nonetheless, the loss of efficiency might be not significant when dealing with high‐frequency data.With CLs, as with any so‐called pseudo likelihood, estimation variance cannot be assessed as with classical likelihood functions based on the Hessian matrix. Parametric bootstrap can be used72 in the case of kriging because the model is fully specified, and one can simulate datasets based on it. It is straightforward to simulate datasets under separability and some temporal models. We detail a bootstrap strategy in Appendix B.Within a single time frame t∈𝒯, it holds R=RS$$ R={R}_S $$, as per Equation (11). Let the spatial pseudo CL be as follows,17ℒS(ψS)=∏t∈𝒯f(y˜t;0,σ^,RS)=|RS|−T/2exp−T2trRS−1M^S,with y˜t$$ {\tilde{y}}_t $$ the de‐trended data vector at time frame t$$ t $$. This allows to make inference on the spatial correlation parameters ψS$$ {\psi}_S $$ alone. The expression is the same of a “small blocks” marginal CL.56 Here, M^S$$ {\hat{M}}_S $$ is the sample correlation matrix between the univariate time series at distinct locations, defined as18M^S=1Tσ^2(y−μ^)⊤(y−μ^).$$ {\hat{M}}_S=\frac{1}{T{\hat{\sigma}}^2}{\left(y-\hat{\mu}\right)}^{\top}\left(y-\hat{\mu}\right)\kern0.3em . $$The estimator ψ^S$$ {\hat{\psi}}_S $$ for the spatial correlation parameters ψS$$ {\psi}_S $$ is defined as the maximizer of (17). If RS$$ {R}_S $$ is a smooth function of ψS$$ {\psi}_S $$, the estimator can be seen as the solution of a system of estimating equations, the equation for the generic parameter α∈ψS$$ \alpha \in {\psi}_S $$ being defined as follows:19α:trRS−M^S∂RS−1∂α=0.$$ \alpha \kern0.3em :\kern0.3em \mathrm{tr}\left\{\left({R}_S-{\hat{M}}_S\right)\frac{\partial {R}_S^{-1}}{\partial \alpha}\right\}=0\kern0.3em . $$The above equation looks like a weighted average of the equations from the over‐determined system RS−M^S=0$$ {R}_S-{\hat{M}}_S=0 $$. It is worth noticing the close resemblance between the pseudo likelihoods in Equations (15) and (17), but the correlation matrices involved are strikingly different in their sizes, as RS$$ {R}_S $$ is much smaller than R$$ R $$ in practical applications.The temporal correlation parameter ψT$$ {\psi}_T $$ can be estimated analogously to ψS$$ {\psi}_S $$, by defining a temporal pseudo CL ℒT$$ {\mathcal{L}}_T $$ as20ℒT(ψT)=∏s∈𝒮f(y˜s;0,σ^,RT)=|RT|−S/2exp−S2trRT−1M^T,with y˜s$$ {\tilde{y}}_s $$ the de‐trended univariate time series at location s$$ s $$. Here, M^T$$ {\hat{M}}_T $$ is the sample temporal correlation matrix, defined as21M^T=1Sσ^2(y−μ^)(y−μ^)⊤.$$ {\hat{M}}_T=\frac{1}{S{\hat{\sigma}}^2}\left(y-\hat{\mu}\right){\left(y-\hat{\mu}\right)}^{\top}\kern0.3em . $$This operation will completely disentangle the estimators of spatial and temporal parameters from each other. As contrasted to RS$$ {R}_S $$, RT$$ {R}_T $$ will be high‐dimensional, so it is even more pressing the need to use a convenient correlation structure in the time domain, to simplify the estimation of ψT$$ {\psi}_T $$ otherwise based on directly maximizing (20). In particular, we resort to an AR model with multiple overlapping seasonal lags. For its definition, the classical lag operator B$$ B $$ can be introduced, such that22BYts=Y(t−1)s,$$ B{Y}_{ts}={Y}_{\left(t-1\right)s}\kern0.3em , $$and BΔYts=Y(t−Δ)s$$ {B}^{\Delta}{Y}_{ts}={Y}_{\left(t-\Delta \right)s} $$ more in general, where Y(t−Δ)s$$ {Y}_{\left(t-\Delta \right)s} $$ is the process Y$$ Y $$ at time t−Δ$$ t-\Delta $$ and location s$$ s $$. Then, letting ϵs$$ {\epsilon}_s $$ be a white noise time series process, the multiplicative seasonal AR model is defined as follows.1923∏k=1K(1−ϕkBΔk)·(Ys−μs)=ϵs,$$ \left[\prod \limits_{k=1}^K\left(1-{\phi}_k{B}^{\Delta_k}\right)\right]\cdotp \left({Y}_s-{\mu}_s\right)={\epsilon}_s\kern0.3em , $$where Δ1,…,Δk,…,ΔK$$ {\Delta}_1,\dots, {\Delta}_k,\dots, {\Delta}_K $$ are the structural lags, and ϕ1,…,ϕk,…,ϕK∈]−1,+1[$$ {\phi}_1,\dots, {\phi}_k,\dots, {\phi}_K\in \kern0.3em \left]-1,+1\right[ $$ the model coefficients.Equation (23) reduces the temporal correlation parameter to ψT=(ϕ1,…,ϕK)⊤$$ {\psi}_T={\left({\phi}_1,\dots, {\phi}_K\right)}^{\top } $$. AR modeling also makes it easy to approximately maximize the likelihood by minimizing the conditional sum of squares,19 which can be attained via the coordinate descent method.73 In this sense, we define24Vs−h=∏k≠h(1−ϕkBΔk)·(Ys−μs),$$ {V}_s^{-h}=\left[\prod \limits_{k\ne h}\left(1-{\phi}_k{B}^{\Delta_k}\right)\right]\cdotp \left({Y}_s-{\mu}_s\right)\kern0.3em , $$and, by Equation (23), it holds25(1−ϕhBΔh)Vs−h=ϵs,$$ \left(1-{\phi}_h{B}^{\Delta_h}\right){V}_s^{-h}={\epsilon}_s\kern0.3em , $$so, the transformed time series Vs−h$$ {V}_s^{-h} $$ satisfies26ϕh=cor(Vs−h,BΔhVs−h).$$ {\phi}_h=\mathrm{cor}\left({V}_s^{-h},{B}^{\Delta_h}{V}_s^{-h}\right)\kern0.3em . $$This relation motivates an iterative procedure that loops over the correlation parameters, and repeats until convergence. For each ϕh$$ {\phi}_h $$, one evaluates the current estimate of the process Vs−h$$ {V}_s^{-h} $$ and then updates ϕh$$ {\phi}_h $$ as the sample ACF of Vs−h$$ {V}_s^{-h} $$ at lag Δh$$ {\Delta}_h $$.An online learning approach may be considered as an alternative. For instance, estimates can be updated continually via batch learning14 or exponentially weighted moving means and covariances,74,75 which require additional tuning.PredictionSeparability assumptions also simplify prediction, as we show in this section. To this end, we must expand our notation. Then, let Y′$$ {Y}^{\prime } $$ be from unobserved locations or times and with the mean matrix μ′$$ {\mu}^{\prime } $$, while previously introduced matrices Y$$ Y $$ and μ$$ \mu $$ are now related to the available data y$$ y $$. The correlation matrix of vec(Y′)$$ \mathrm{vec}\left({Y}^{\prime}\right) $$ is R′$$ {R}^{\prime } $$, whereas R$$ R $$ is the correlation matrix of vec(Y)$$ \mathrm{vec}(Y) $$. The cross‐correlation matrix between vec(Y′)$$ \mathrm{vec}\left({Y}^{\prime}\right) $$ and vec(Y)$$ \mathrm{vec}(Y) $$ is generally not square, and it is denoted by ρ$$ \rho $$, which is based on a suitable joint distance matrix. Like R$$ R $$ in Equation (11), also R′$$ {R}^{\prime } $$ and ρ$$ \rho $$ are defined in terms of Kronecker products as, respectively,27R′=RS′⊗RT′,ρ=ρS⊗ρT,$$ {R}^{\prime }={R}_S^{\prime}\otimes {R}_T^{\prime}\kern0.3em ,\kern1em \rho ={\rho}_S\otimes {\rho}_T\kern0.3em , $$where RS′$$ {R}_S^{\prime } $$ and RT′$$ {R}_T^{\prime } $$ are the spatial and temporal correlation matrices of Y′$$ {Y}^{\prime } $$, analogously to RS$$ {R}_S $$ and RT$$ {R}_T $$, while ρS$$ {\rho}_S $$ and ρT$$ {\rho}_T $$ are the spatial and temporal cross‐correlation matrices between Y′$$ {Y}^{\prime } $$ and Y$$ Y $$. These in turn depend on suitable spatial and temporal distance matrices. As typical in kriging, we treat Y$$ Y $$ and Y′$$ {Y}^{\prime } $$ as jointly normally distributed.Kriging revolves around the conditional distribution of Y′$$ {Y}^{\prime } $$ given Y=y$$ Y=y $$,1 though it was originally motivated as the linear unbiased prediction that is optimal with respect to the squared prediction error criterion.76 Conditional to Y=y$$ Y=y $$, the distribution of Y′$$ {Y}^{\prime } $$ is multivariate normal. The conditional mean matrix of Y′$$ {Y}^{\prime } $$, denoted by ŷ′$$ {\hat{y}}^{\prime } $$, and the conditional variance‐covariance matrix of vec(Y′)$$ \mathrm{vec}\left({Y}^{\prime}\right) $$, denoted by Rcond′$$ {R}_{\mathrm{cond}}^{\prime } $$, satisfy the following conditions.28vec(ŷ′)=vec(μ′)+ρR−1vec(y−μ),Rcond′=R′−ρR−1ρ⊤.$$ \mathrm{vec}\left({\hat{y}}^{\prime}\right)=\mathrm{vec}\left({\mu}^{\prime}\right)+\rho {R}^{-1}\mathrm{vec}\left(y-\mu \right)\kern0.3em ,\kern1em {R}_{\mathrm{cond}}^{\prime }={R}^{\prime }-\rho {R}^{-1}{\rho}^{\top}\kern0.3em . $$Let βS$$ {\beta}_S $$ and βT$$ {\beta}_T $$ be spatial and temporal regression coefficients, defined as29βS=ρSRS−1,βT=ρTRT−1,$$ {\beta}_S={\rho}_S{R}_S^{-1}\kern0.3em ,\kern1em {\beta}_T={\rho}_T{R}_T^{-1}\kern0.3em , $$then the kriging mean formula in Equation (28) can be written as30ŷ′−μ′=z^βS⊤,$$ {\hat{y}}^{\prime }-{\mu}^{\prime }=\hat{z}{\beta}_S^{\top}\kern0.3em , $$where z^$$ \hat{z} $$ is a matrix of centered temporal forecasts only, defined as31z^=βT(y−μ).$$ \hat{z}={\beta}_T\left(y-\mu \right)\kern0.3em . $$We prove this fact in Appendix A.1. The proof also covers existing applications of multivariate normal models.77The above result allows for at least two interesting uses. First, our result allows for distributed calculus in kriging. Indeed, spatio‐temporal prediction under separability can be carried out in two steps. The first step, in Equation (31), is temporal forecasting only within sensor locations. The second step, in Equation (30), is the spatial interpolation of such forecasts for the needed locations. Then, spatio‐temporal predictions can be seen as spatial interpolation of temporal forecasts. So, a separability assumption allows to separate domains also in prediction. Distributed calculus can be used for evaluating Equation (31), as each univariate time series is transformed separately by the matrix product.This insight into the meaning of matrix quantities cannot be found in other works,77 as they lack both the interpretation of βS$$ {\beta}_S $$ and βT$$ {\beta}_T $$. Moreover, an automatic application of their findings would be too compute‐intensive with large datasets, because it would require evaluating βT$$ {\beta}_T $$ explicitly.Correlation models affect prediction only through βS$$ {\beta}_S $$ and βT$$ {\beta}_T $$. So, any specification of time or space models can be employed if it implies tractable prediction. This view motivates, for instance, the use of general interpolators, or state‐space models,78 or integrated AR time series models,19 that allow for simple prediction since βT$$ {\beta}_T $$ in Equation (29) is sparse and explicit. The rather general auto‐regressive moving‐average model (ARMA) has already been considered by some authors79 though the MA component makes the model harder to estimate.In applications, one may consider adding covariates to the analysis. Assume that there are p$$ p $$ variables indexed by j=1,…,p$$ j=1,\dots, p $$, so Yjts$$ {Y}_{jts} $$ denotes the j$$ j $$th variable at spatio‐temporal coordinates ts$$ ts $$. In this case, the marginal means μjts$$ {\mu}_{jts} $$ will likely depend on j$$ j $$. The correlation structure can be simplified according to a fully factored model,7 such that32cov(Yjts,Yj′t′s′)=γj,j′·corS(ds,s′)·corT(dt,t′).$$ \operatorname{cov}\left({Y}_{jts},{Y}_{j^{\prime }{t}^{\prime }{s}^{\prime }}\right)={\gamma}_{j,{j}^{\prime }}\cdotp {\mathrm{cor}}_S\left({d}_{s,{s}^{\prime }}\right)\cdotp {\mathrm{cor}}_T\left({d}_{t,{t}^{\prime }}\right)\kern0.3em . $$Here, the new quantity γj,j′$$ {\gamma}_{j,{j}^{\prime }} $$ represents the cross‐sectional covariance between the j$$ j $$th and j′$$ {j}^{\prime } $$th variables at the same time and location. Thus, under a fully factored model, our kriging mean formula scales easily, as an additional set of regression coefficients βC$$ {\beta}_C $$ is defined besides βS$$ {\beta}_S $$ and βT$$ {\beta}_T $$, implied by the newly added domain of covariates.We provided a simple expression for mean predictions, along with some optimization strategies. For the sake of completeness, we now illustrate how to compute prediction variances. In our view, the model can be used to design a prediction rule, resulting in mean prediction, while prediction variance is a performance measure, which can, for instance, be evaluated on a test set. This approach may be favored when the focus is especially on prediction.Let V$$ V $$ be a matrix with the same size as Y′$$ {Y}^{\prime } $$ with the entrywise variances of Y′$$ {Y}^{\prime } $$ conditional to Y=y$$ Y=y $$. Let diag(M)$$ \operatorname{diag}(M) $$ be the operator that returns the diagonal entries of a square matrix M$$ M $$ as a column vector. Then, vec(V)=σ2diag(Rcond′)$$ \mathrm{vec}(V)={\sigma}^2\operatorname{diag}\left({R}_{\mathrm{cond}}^{\prime}\right) $$, so it holds33V=σ2diag(RT′)diag(RS′)⊤−diag(RT′−RT,cond′)diag(RS′−RS,cond′)⊤,$$ V={\sigma}^2\left\{\operatorname{diag}\left({R}_T^{\prime}\right)\kern0.3em \operatorname{diag}{\left({R}_S^{\prime}\right)}^{\top }-\operatorname{diag}\left({R}_T^{\prime }-{R}_{T,\mathrm{cond}}^{\prime}\right)\kern0.3em \operatorname{diag}{\left({R}_S^{\prime }-{R}_{S,\mathrm{cond}}^{\prime}\right)}^{\top}\right\}\kern0.3em , $$where RS,cond′$$ {R}_{S,\mathrm{cond}}^{\prime } $$ and RT,cond′$$ {R}_{T,\mathrm{cond}}^{\prime } $$ are defined as follows.34RS,cond′=RS′−ρSRS−1ρS⊤,RT,cond′=RT′−ρTRT−1ρT⊤.$$ {R}_{S,\mathrm{cond}}^{\prime }={R}_S^{\prime }-{\rho}_S{R}_S^{-1}{\rho}_S^{\top}\kern0.3em ,\kern1em {R}_{T,\mathrm{cond}}^{\prime }={R}_T^{\prime }-{\rho}_T{R}_T^{-1}{\rho}_T^{\top}\kern0.3em . $$We prove this result in Appendix A.2. Equation (33) has an advantage over direct evaluation of (28), as the expression for RT,cond′$$ {R}_{T,\mathrm{cond}}^{\prime } $$ is simple to retrieve under some time series models. For instance, with stationary AR processes, in one‐step‐forward forecasting, RT,cond′$$ {R}_{T,\mathrm{cond}}^{\prime } $$ is the ratio between the variance of the innovation term and the marginal variance of the process.EMPIRICAL APPLICATIONNow we illustrate our proposed methodology on the SAL dataset presented in the introduction. The dataset is publicly available on GitHub, as submitted by the authors of the first paper addressing it,9 at https://github.com/dslab‐uniud/virtual‐sensing. We performed all our analyses in the statistical computing environment R.80 Both the ML and CL estimators were given a custom implementation that minimizes the conditional sum of squares in the estimation of the temporal model and maximizes a spatial pseudo likelihood for the estimation of the spatial model. All the analyses were carried out using typical laptop computers.Explorative analysisThe SAL dataset consists of temperature sensor readings from an office room and has been briefly described in the introductory section. The goal is to develop a spatio‐temporal prediction rule for temperature based on these data. Some candidate spatial models are estimated on a training set, and the best one is selected based on a test set. The full dataset comprises 19 weeks of data, and it is partitioned accordingly, with the leading 8 weeks of data for training and the trailing 11 weeks as the test set. This choice is challenging for our method, as it is more exposed to shifts in the regime, but a longer training phase might not be reasonable for some applications, where a monitoring system shall be calibrated in short amounts of time.Data were missing less than 1% of the times, resulting from miscommunication faults unrelated to the data and thus statistically random. Kriging can handle missings at random, but we used a simpler imputation method called last observation carried forward (LOCF), which uses the last valid reading to impute missings.81 In fact, we require gridded data and the LOCF approach solves the problem rapidly and efficiently, so we can focus on other aspects of the problem.Figure 2 presents the training set. The sensors are numbered from 1 to 12 as in Figure 1. Troughs are concentrated in the mornings, as windows are opened, and cold air flows in the room during routine cleaning. Peaks are concentrated around noon, as direct sunlight overheats the sensors facing south, the ones numbered 5, 10, and 11. Weekly trends are highlighted in Figure 3, with temperature median and other percentiles reported throughout weekdays. Troughs seem to occur mostly on Mondays and Fridays, so on the first and last workdays in the week. Peaks instead concentrate on Fridays and weekends.2FIGUREUnivariate time series of temperature per sensor, training set (October through December). Peaks and troughs are thresholded and highlighted with dots.3FIGURESample quantiles of temperature, aggregation according to the weekday, training set (October through December)Similar conditions between subsequent days motivate using time series models with seasonal components, the period being one day long. Similar events occurring on the same weekday motivate considering one more seasonal component, whose period should be 1 week long.One may consider adding covariates into the model, in particular the physical variables mentioned in the introduction. Some preliminary analyses show that their explanatory power is limited, since spatial and temporal dependencies seem able to explain most of the variability of temperature data. As a consequence, they will not be considered further.Model estimatesWe compared our proposed CL estimator with the ML approach. The model chosen was separable, as discussed in previous sections. We considered four candidate models in the space domain, namely, exponential, Gaussian, power exponential, and Matérn ACFs. In the time domain, we considered a few candidate models, all multiplicative AR as in Equation (23), with a short term lag Δ1$$ {\Delta}_1 $$ and two seasonal lags, Δ2=$$ {\Delta}_2= $$ 1 day and Δ3=$$ {\Delta}_3= $$ 1 week. We probed some alternative values for the short term lag Δ1$$ {\Delta}_1 $$, ranging in 10, 20, 30, 40 min, 1, 2, 3, 4, 6, 12 h. The longer lags were of interest because they implied higher estimates of ϕ2$$ {\phi}_2 $$ and ϕ3$$ {\phi}_3 $$, so that one could leverage seasonal dynamics and spatial regularities. All the candidate spatio‐temporal models were then compared on the test set in terms of their performance in the spatial interpolation of each sensor based on the other ones, see Figure 5 in the following paragraphs.The estimation of all the spatio‐temporal models via ML took no more than 10 min. The estimation of spatial models via CL was essentially instantaneous by comparison, as the small size of the sample spatial correlation matrix made the computation fast. The estimation of temporal models via CL takes a few seconds, namely about one‐fourth of the time required for ML. The four spatial ACFs estimated with either ML or CL appears to be very similar. In Figure 4, this similarity is illustrated based on just the CL estimate.4FIGURESpatial ACFs estimated via CL (lines) with sample correlations (dots). See Table 1 for the estimated parametersFigure 5 summarizes the predictive performance of the estimated spatio‐temporal models on the test set. The performance is assessed in terms of spatial interpolation of each sensor, surrogated via all the others in turn. Three performance metrics are used, namely, the mean absolute prediction error (L1), the root of mean square error (L2), and the 95th percentile of the absolute prediction error (P95). In comparing ML and CL methods, these are seen to provide model estimates that turn out to be rather interchangeable for predictive tasks. Sensor 10 is uniformly poorly surrogated by the other sensors, which marks its outlying nature in the example. This sensor is better accommodated by the CL estimate, in line with its higher degree of robustness. CL seems otherwise a good approximation to ML in general.5FIGUREPerformance in spatial interpolation of each sensor based on the other ones, for ML and CL (linetype) with different metrics (labels, y$$ y $$‐axis) depending on the short term lag Δ1$$ {\Delta}_1 $$ (x$$ x $$‐axis). Multiple lines refer to distinct spatial models, essentially overlapped. Metrics in use are: mean absolute prediction error (L1), root mean square error (L2), and 95th percentile of the absolute prediction error (P95).1TABLEPoint estimates and standard errors for spatial and temporal correlation parameters, case Δ1=10$$ {\Delta}_1=10 $$ minSpatial modelParameterEst.Std. err. (×103$$ \times 1{0}^3 $$)Exponentialϕ1$$ {\phi}_1 $$0.9770.206ϕ2$$ {\phi}_2 $$0.0780.950ϕ3$$ {\phi}_3 $$0.0470.990Nugget0.1912.307Range24.692446.329Gaussianϕ1$$ {\phi}_1 $$0.9770.203ϕ2$$ {\phi}_2 $$0.0780.936ϕ3$$ {\phi}_3 $$0.0470.975Nugget0.2502.470Range15.011140.078Matérnϕ1$$ {\phi}_1 $$0.9770.206ϕ2$$ {\phi}_2 $$0.0780.950ϕ3$$ {\phi}_3 $$0.0470.991Nugget0.1874.805Range25.9931547.047Smoothness0.47922.544Power exponentialϕ1$$ {\phi}_1 $$0.9770.205ϕ2$$ {\phi}_2 $$0.0780.946ϕ3$$ {\phi}_3 $$0.0470.986Nugget0.2173.057Range19.883424.822Smoothness1.31228.654Lastly, Table 1 summarizes the CL estimates and standard errors for the four spatio‐temporal models we are going to consider in the next section with Δ1=10$$ {\Delta}_1=10 $$ min. Point estimates of the temporal parameters are shared across the four spatio‐temporal models. Their standard errors are based on parametric bootstrap, as outlined in Appendix B. The estimation variance does, instead, depend on the joint spatio‐temporal model, so it is different across them. For bootstrap, we carried out inference based on parametric bootstrap by simulating and analyzing 1000 datasets for each joint spatio‐temporal model. Performing the bootstrap took about 3 h on a single laptop computer based on the CL approach. Standard errors, all multiplied by 103$$ 1{0}^3 $$ in the table, are all rather small, as expected based on the large sample size.Sensor network optimizationIn the previous subsection, CL turned out to be capable of surrogating the efficiency of ML, with the help of a large training dataset. Now a practical concern is the selection of a few operation sensors from the initial set, as twelve of them are too many for a room that is 127 m2$$ {}^2 $$ large. Under the proximity principle, some sensors could be dropped, and their location could be just virtually sensed since their data can be surrogated82 with predictions from the remaining sensors.Different network configurations can be evaluated and compared according to a metric, which should reflect the priorities and objectives of stakeholders. The 95th percentile of absolute prediction errors on all but active sensors9 can be used for an approximate minimax decision. For comparison, we illustrate a sensor selection based on this criterion alongside one that uses the more classical mean absolute prediction error. The performance of spatial models and sensor configurations is evaluated and compared on the test set. For each sensor configuration, we interpolate data from selected locations to the unselected ones within time frames, as implied by separability. Prediction errors are then summarized according to the metrics. We perform selection in a forward fashion by starting with the best performer alone and then adding the sensor that led to the best improvement at each step. Adding sensors can worsen the performance because we are evaluating models on the test set. In Figure 6, a summary of the selection process is reported. The sensor added at each step appears within a box and is numbered as in Figure 1. An alternative prediction is given by the simple mean, which assumes that a single latent temperature is ruling the whole room. The selection took no more than 10 min in total, so it would be easy to perform it multiple times ad interim to check on the quality of predictions.6FIGUREForward selection of sensors, distinct per metric, prediction error versus number of selected sensors. The sensor added at each step appears within a box. Run on test set (December through March)We show only the power exponential ACF. The result based on the Matérn function is very similar, and the exponential and Gaussian ones are slightly outperformed. The percentile performance seems in line with k$$ k $$‐NN and IDW benchmarks.9 Based on performances in Figure 6, the power exponential ACF may be preferred over the mean prediction because this choice seems more robust with respect to the metric. Moreover, the mean prediction yields some narrowly spaced sensor configurations under both metrics.Behavior of the monitoring systemUsing different sensor configurations, we show a few examples of actual situations and the behavior of a possible monitoring system. The environment in focus has some peculiarities that somehow affect the performance of the system. As an illustration, we combine interpolation and forecasting by predicting temperatures 10 min forward for the whole floor plan, as in Figure 7.7FIGUREExamples of prediction via four sensors, based on the power exponential model. Forecasts for February 17, 12:00, based on data from 10 min before. Circles are colored based on observed temperatures, the floor based on predicted temperatures, selected sensors only are numbered as in Figure 1. (A) Anomalous sensors, (B) regular sensors, (C) selection via 95th percentile absolute error, and (D) selection via mean absolute errorWe anticipated that there would be differences between regular and anomalous sensors. Concerns relate to sensors facing direct sunlight around noon (numbered 5, 10) or close to other sources of anomalies (7, 12). If these were used to make predictions, the results would hardly reflect the normalcy ruling the interior of the room, see Figure 1A. On the contrary, a system based on more regular sensors only, like in Figure 1B, would yield more constant predictions that are likely exchangeable with mean prediction. The sensor selection illustrated in the previous section is aimed at selecting a solution between these two extremes.The two metrics considered suggest similar solutions. Figure 1C reports the prediction provided under the power exponential ACF by the four best sensors according to the percentile metric. Figure 1D shows the selection based on the mean absolute error instead. Both configurations attempt to replicate the north‐south gradient, which requires to choose between sensors 5 or 10. However, neither of these choices can surrogate sensors 11 and 3. Selecting sensors 5 and 10 would fail the whole interior of the room and, by converse, the sensors in the interior of the room cannot surrogate 5 and 10.SOME EXTENSIONSIn the previous section, we presented an application of our compute‐efficient approach to spatio‐temporal kriging. Here we describe two possible extensions that look legitimate under our prediction‐oriented view of kriging. For instance, it is possible to include some spatial interpolators or temporal forecasting methods that do not necessarily underlie any stationary ACF. It is also possible to use distinct temporal models for each sensor location to provide more specialized forecasting. Both these extensions share at least the advantage of further distributing calculus across the sensor network and simplifying server‐side computations.Nonstationary modelingThe kriging approach relies on stationary ACF models, which offer a wide variety of possibilities, but some problems may be addressed only with nonstationary models. For instance, integrated AR models need no trend formulation. Another alternative is k$$ k $$‐NN, which returns the sample average of response values from the k$$ k $$ sensors closest to the desired location.This solution could be useful in cases like ours, where distinct sensors might have different equilibria. We used a moving average to accommodate nonstationarity in the mean, which conciliates with stationarity in ACF, but one can use integrated AR and k$$ k $$‐NN as an alternative. With high‐frequency data, long‐term stationarity may coexist with short‐term nonstationarity, resulting from locally linear trends, so it might be useful to consider a nonstationary model that copes with both aspects.We considered an integrated AR model in our analysis but without obtaining any significant improvement. Indeed, the chosen AR model was already able to cope with nonstationarity due to both a moving average trend and a near‐unit first AR coefficient, which implied integration de facto.Sensor‐specific temporal correlation parametersA limitation of separable ACFs is that they imply the same marginal dynamics for all locations. As an extension, the temporal correlation parameters can be sensor‐specific: each sensor can estimate and update a distinct temporal model that is valid at least for its location and approximately also for a neighborhood.Distinct temporal models can each be based on a different CL and use different portions of data, so they will not affect each other. In mean prediction, as per Equation (31), z^$$ \hat{z} $$ can be replaced with a matrix, where each column is made up of the temporal forecasts based on a model with limited scope that works for just one location or a neighborhood. When interpolating these forecasts spatially, via Equation (30), more weight is given to forecasts close to the needed locations. This implies that all temporal models are involved but to a varied extent, depending on the distance.This extension with distinct temporal models per location adds flexibility to monitoring in at least two ways.It adds flexibility to network management. Each sensor has to estimate and update its own temporal model, so this has not to be handled by the server, which thus must be in charge only of the spatial interpolation task.Statistical modeling becomes more flexible too. Prior to this, a single overall temporal model is formulated that has to fit all locations forcefully. Distinct temporal models may now be used to address subsets of locations, so that they can cope with more local and specific dynamics.Anomalous sensors may be more effectively dealt with by allowing them to make predictions based on a more specific model targeted to them only. The office room in our example is too small to allow for a variety of temporal models. Larger environments will likely be more heterogeneous and will thus need many local models to provide better forecasts. Indeed, since spatio‐temporal prediction is made up of both interpolation and forecasting, the quality of the latter is a necessary ingredient to joint predictive performance.DISCUSSIONWe have proposed a separable kriging approach that allows to analyze large datasets by exploiting some overlooked aspects of separability. Even high‐frequency data can be processed in a reasonable amount of time using a maximum CL estimator and optimized calculus in prediction. The assumption of separability allows to distribute calculus across the sensor network by delegating as many operations as possible to the components that gather the relevant data. Separability is a strong assumption, though, which can be trusted at least in settings close to ours with sensor data from indoor environments. Probably, when considering less controlled environments and even weather data, this assumption is restrictive and unrealistic, and other nonseparable models may be considered instead. Future research may investigate the viability of CL approaches also in such settings.To our knowledge, the use of marginal CL is novel to sensor data analysis. Its most appealing aspect is that the spatial and temporal models under separability can be estimated in parallel without affecting each other. The spatial model must be estimated in a centralized way, but the temporal model may be addressed in a decentralized way by allowing sensors to estimate a temporal model valid for their location or neighborhood, as described in Section 6.2. This idea relates to stratified variograms,16,83 but it has even more in common with the estimation of a single variogram with data pairs sharing some identical conditions.84 The CL approach is thus computationally convenient and potentially more robust to model misspecification, as a wrong temporal model does not affect the estimation of the spatial model, and vice versa. There is, necessarily a loss in efficiency, which may be more noticeable in small samples, but should be less relevant to big data applications. In the case of challenging estimation problems, CL estimates are also readily available and may help initialize other iterative estimation procedures, if a more efficient estimate is of interest.The predictive part of our approach was already common in climate and weather sciences but mostly confined to spatial interpolation.85 Instead, we provide a formal motivation for this way of computing predictions based on separability. We found a related simplification in jointly spatio‐temporal prediction, which we guess can be easily extended to separability in more than two domains via Tucker products instead of Kronecker ones. For instance, covariates could be included in kriging via fully factored modeling,7 but feature engineering seemed necessary in our case,9 which is not typical in kriging. For the sake of completeness, alternative modeling strategies include additive covariances,86 process convolution,87 and linear mixed models,69 for which simplifications might be different where possible.In developing our proposal, we require data to be gridded, which means that all sensors provide simultaneous readings. However, this requirement can be weakened, since data can be at least projected onto a grid.88Kriging computation is hugely simplified by assuming separability and choosing a suitable spatial or (especially) temporal model. Both estimation and prediction can bypass the evaluation and inversion of large correlation matrices by treating the data as univariate time series or cross‐sections and thus splitting a generally complicated calculation into simpler operations. For instance, AR models may have an intractable ACF, but they can be estimated easily by minimizing a conditional sum of squares, and their forecasts based on Equation (23) are simple to calculate as well. Our approach allows to blend together and leverage known strengths of time series analysis and spatial statistics without the need to outline a joint spatio‐temporal framework from scratch.ACKNOWLEDGMENTSThis work was supported by the Competence Centre ASSIC – Austrian Smart Systems Integration Research Center, co‐funded by the Austrian Federal Ministry for Transport Innovation and Technology (BMVIT), the Austrian Federal Ministry for Education, Science and Research (BMWFW), and the Austrian Federal Provinces of Carinthia and Styria within the Competence Centres for Excellent Technologies Programme (COMET). The research of Michele Lambardi di San Miniato was supported by the European Social Fund (Investimenti in favore della crescita e dell'occupazione, Programma Operativo del Friuli Venezia Giulia 2014/2020) – Programma specifico 89/2019 – Sostegno alla realizzazione di dottorati e assegni di ricerca, operazione PS 89/2019 ASSEGNI DI RICERCA – UNIUD (FP1956292002, canale di finanziamento 1420_SRDAR8919). Open Access Funding provided by Universita degli Studi di Udine within the CRUI‐CARE Agreement.DATA AVAILABILITY STATEMENTThe data that support the findings of this study are openly available in Github at https://github.com/dslab‐uniud/Virtual‐sensing.ReferencesGramacy RB. Surrogates: Gaussian Process Modeling, Design and Optimization for the Applied Sciences. 1st ed.; Chapman and Hall/CRC; 2020.Nguyen L, Hu G, Spanos CJ. Spatio‐temporal environmental monitoring for smart buildings. Proceedings of the 13th IEEE International Conference on Control & Automation; 2017; IEEE, Ohrid, North Macedonia.Carpenter J, Woodbury KA, O'Neill Z. Using change‐point and Gaussian process models to create baseline energy models in industrial facilities: a comparison. Appl Energy. 2018;213:415‐425. doi:10.1016/j.apenergy.2018.01.043Liu H, Yang C, Huang M, Wang D, Yoo C. Modeling of subway indoor air quality using Gaussian process regression. J Hazard Mater. 2018;359:266‐273. doi:10.1016/j.jhazmat.2018.07.034Li H, Yu D, Braun JE. A review of virtual sensing technology and application in building systems. HVAC&R Res. 2011;17(5):619‐645.Rodríguez‐Iturbe I, Mejía JM. The design of rainfall networks in time and space. Water Resour Res. 1974;10(4):713‐728. doi:10.1029/wr010i004p00713Mardia KV, Goodall CR. Spatial‐temporal analysis of multivariate environmental monitoring data. In: Patil GP, Rao CR, eds. Multivariate Environmental Statistics; Elsevier; 1993:347‐385.Dong B, Lam KP. A real‐time model predictive control for building heating and cooling systems based on the occupancy behavior pattern detection and local weather forecasting. Build Simul. 2013;7(1):89‐106. doi:10.1007/s12273‐013‐0142‐7Brunello A, Urgolo A, Pittino F, Montvay A, Montanari A. Virtual sensing and sensors selection for efficient temperature monitoring in indoor environments. Sensors. 2021;21(8):2728. doi:10.3390/s21082728Ferdoush S, Li X. Wireless sensor network system design using Raspberry Pi and Arduino for environmental monitoring applications. Proc Comput Sci. 2014;34:103‐110. doi:10.1016/j.procs.2014.07.059Chen T, Guestrin C. XGBoost: a scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery; 2016:785‐794; New York.Hochreiter S, Schmidhuber J. Long short‐term memory. Neural Comput. 1997;9(8):1735‐1780. doi:10.1162/neco.1997.9.8.1735Oktavia E, Widyawan MIW. Inverse distance weighting and kriging spatial interpolation for data center thermal monitoring. Proceedings of the 1st International Conference on Information Technology, Information Systems and Electrical Engineering; 2016; IEEE, Yogyakarta, Indonesia.Azzalini A, Scarpa B. Data Analysis and Data Mining: An Introduction. Illustrated ed. Oxford University Press; 2012.Aryaputera AW, Yang D, Zhao L, Walsh WM. Very short‐term irradiance forecasting at unobserved locations using spatio‐temporal kriging. Sol Energy. 2015;122:1266‐1278. doi:10.1016/j.solener.2015.10.023Bivand RS, Pebesma E, Gómez‐Rubio V. Applied Spatial Data Analysis with R. Springer; 2013.Genton MG. Separable approximations of space‐time covariance matrices. Environmetrics. 2007;18(7):681‐695. doi:10.1002/env.854Furrer R, Genton MG, Nychka D. Covariance tapering for interpolation of large spatial datasets. J Comput Graph Stat. 2006;15(3):502‐523. doi:10.1198/106186006x132178Box GEP, Jenkins GM, Reinsel GC, Ljung GM. Time Series Analysis: Forecasting and Control. 5th ed. Wiley; 2015.Ruybal CJ, Hogue TS, McCray JE. Evaluation of groundwater levels in the Arapahoe aquifer using spatiotemporal regression kriging. Water Resour Res. 2019;55(4):2820‐2837. doi:10.1029/2018wr023437Thai CH, Tran TD, Phung‐Van P. A size‐dependent moving kriging meshfree model for deformation and free vibration analysis of functionally graded carbon nanotube‐reinforced composite nanoplates. Eng Anal Bound Elem. 2020;115:52‐63. doi:10.1016/j.enganabound.2020.02.008Slot RMM, Sørensen JD, Sudret B, Svenningsen L, Thøgersen ML. Surrogate model uncertainty in wind turbine reliability assessment. Renew Energy. 2020;151:1150‐1162. doi:10.1016/j.renene.2019.11.101García‐Gutiérrez A, Gonzalo J, Domínguez D, López D. Stochastic optimization of high‐altitude airship envelopes based on kriging method. Aerosp Sci Technol. 2022;120:107251. doi:10.1016/j.ast.2021.107251Qian HM, Li YF, Huang HZ. Time‐variant reliability analysis for industrial robot RV reducer under multiple failure modes using kriging model. Reliab Eng Syst Saf. 2020;199:106936. doi:10.1016/j.ress.2020.106936Luo Y, Xing J, Kang Z. Topology optimization using material‐field series expansion and kriging‐based algorithm: an effective non‐gradient method. Comput Methods Appl Mech Eng. 2020;364:112966. doi:10.1016/j.cma.2020.112966Yang Y, Christakos G, Yang X, He J. Spatiotemporal characterization and mapping of PM2.5 concentrations in southern Jiangsu province. China Environ Pollut. 2018;234:794‐803. doi:10.1016/j.envpol.2017.11.077Mohammadi V, Dehghan M, Khodadadian A, Wick T. Numerical investigation on the transport equation in spherical coordinates via generalized moving least squares and moving kriging least squares approximations. Eng Comput. 2019;37(2):1231‐1249. doi:10.1007/s00366‐019‐00881‐3Menafoglio A, Gaetani G, Secchi P. Random domain decompositions for object‐oriented kriging over complex domains. Stoch Env Res Risk A. 2018;32(12):3421‐3437. doi:10.1007/s00477‐018‐1596‐zChen J, Mak S, Joseph VR, Zhang C. Function‐on‐Function kriging, with applications to three‐dimensional printing of aortic tissues. Technometrics. 2020;63(3):384‐395. doi:10.1080/00401706.2020.1801255Stein ML. Space‐time covariance functions. J Am Stat Assoc. 2005;100(469):310‐321. doi:10.1198/016214504000000854Politis DN. Model‐Free Prediction and Regression. 1st ed. Springer; 2016.Chilès JP, Desassis N. Fifty years of kriging. In: Sagar BSD, Cheng Q, Agterberg F, eds. Handbook of Mathematical Geosciences. Springer; 2018:589, 612.You MY. Multi‐objective optimal design of permanent magnet synchronous motor for electric vehicle based on deep learning. Appl Sci. 2020;10(2):482. doi:10.3390/app10020482Bhattacharjee S, Mitra P, Ghosh SK. Spatial interpolation to predict missing attributes in GIS using semantic kriging. IEEE Trans Geosci Remote Sens. 2014;52(8):4771‐4780. doi:10.1109/tgrs.2013.2284489de Medeiros ES, de Lima RR, de Olinda RA, Dantas LG, de Santos CAC. Space‐time kriging of precipitation: modeling the large‐scale variation with model GAMLSS. Watermark. 2019;11(11):2368. doi:10.3390/w11112368Totis G, Sortino M. Polynomial chaos‐kriging approaches for an efficient probabilistic chatter prediction in milling. Int J Mach Tools Manuf. 2020;157:103610. doi:10.1016/j.ijmachtools.2020.103610Delfiner P, Chilès JP. Geostatistics: Modeling Spatial Uncertainty. 2nd ed. Wiley; 2012.Prudhomme C, Reed DW. Mapping extreme rainfall in a mountainous region using geostatistical techniques: a case study in Scotland. Int J Climatol. 1999;19(12):1337‐1356. doi:10.1002/(sici)1097‐0088(199910)19:12<1337::aid‐joc421>3.0.co;2‐gKaufman CG, Bingham D, Habib S, Heitmann K, Frieman JA. Efficient emulators of computer experiments using compactly supported correlation functions, with an application to cosmology. Ann Appl Stat. 2011;5(4):2470‐2492. doi:10.1214/11‐aoas489Griffith DA. Hidden negative spatial autocorrelation. J Geogr Syst. 2006;8(4):335‐355. doi:10.1007/s10109‐006‐0034‐9dos Reis AA, Carvalho MC, de Mello JM, Gomide LR, ACF F, FWA J. Spatial prediction of basal area and volume in Eucalyptus stands using Landsat TM data: an assessment of prediction methods. N Z J For Sci. 2018;48(1). doi:10.1186/s40490‐017‐0108‐0Angelini ME, Heuvelink GBM. Including spatial correlation in structural equation modelling of soil properties. Spatial Statistics. 2018;25:35‐51. doi:10.1016/j.spasta.2018.04.003McLean MI, Evers L, Bowman AW, Bonte M, Jones WR. Statistical modelling of groundwater contamination monitoring data: a comparison of spatial and spatiotemporal methods. Sci Total Environ. 2019;652:1339‐1346. doi:10.1016/j.scitotenv.2018.10.231Liu H, Ong YS, Shen X, Cai J. When Gaussian process meets big data: a review of scalable GPs. IEEE Trans Neural Netw Learn Syst. 2020;31(11):4405‐4423. doi:10.1109/tnnls.2019.2957109Guhaniyogi R, Banerjee S. Meta‐kriging: scalable Bayesian modeling and inference for massive spatial datasets. Technometrics. 2018;60(4):430‐444. doi:10.1080/00401706.2018.1437474Opitz T, Bonneu F, Gabriel E. Point‐process based Bayesian modeling of space‐time structures of forest fire occurrences in Mediterranean France. Spatial Stat. 2020;40:100429. doi:10.1016/j.spasta.2020.100429Gu L. Moving kriging interpolation and element‐free Galerkin method. Int J Numer Methods Eng. 2002;56(1):1‐11. doi:10.1002/nme.553Hartman L, Hössjer O. Fast kriging of large data sets with Gaussian Markov random fields. Comput Stat Data Anal. 2008;52(5):2331‐2349. doi:10.1016/j.csda.2007.09.018Hristopulos DT, Agou VD. Stochastic local interaction model with sparse precision matrix for space‐time interpolation. Spatial Stat. 2020;40:100403. doi:10.1016/j.spasta.2019.100403Strandberg J, de Luna SS, Mateu J. Prediction of spatial functional random processes: comparing functional and spatio‐temporal kriging approaches. Stoch Env Res Risk A. 2019;33(10):1699‐1719. doi:10.1007/s00477‐019‐01705‐yArendt PD, Apley DW, Chen W, Lamb D, Gorsich D. Improving identifiability in model calibration using multiple responses. J Mech Des. 2012;134(10):100909‐1‐100909‐9. doi:10.1115/1.4007573Welch WJ, Buck RJ, Sacks J, Wynn HP, Mitchell TJ, Morris MD. Screening, predicting, and computer experiments. Technometrics. 1992;34(1):15‐25. doi:10.1080/00401706.1992.10485229Varin C, Reid N, Firth D. An overview of composite likelihood methods. Stat Sin. 2011;21:5‐42.Bevilacqua M, Gaetan C, Mateu J, Porcu E. Estimating space and space‐time covariance functions for large data sets: a weighted composite likelihood approach. J Am Stat Assoc. 2012;107(497):268‐280. doi:10.1080/01621459.2011.646928Maronna RA, Martin RD, Yohai VJ. Robust Statistics: Theory and Methods. Wiley; 2006.Caragea PC, Smith RL. Asymptotic properties of computationally efficient alternative estimators for a class of multivariate normal models. J Multivar Anal. 2007;98(7):1417‐1440. doi:10.1016/j.jmva.2006.08.010Xu X, Reid N. On the robustness of maximum composite likelihood estimate. J Stat Plan Inference. 2011;141(9):3047‐3054. doi:10.1016/j.jspi.2011.03.026Agostinelli C, Yohai VJ. Composite robust estimators for linear mixed models. J Am Stat Assoc. 2016;111(516):1764‐1774. doi:10.1080/01621459.2015.1115358Franceschetti G, Riccio D. Ch. 6 Surface classical models. Scattering, Natural Surfaces, and Fractals. 1st ed. Elsevier; 2007:21‐59.Hartwig RE. AX ‐ XB = C, resultants and generalized inverses. SIAM J Appl Math. 1975;28(1):154‐183. doi:10.1137/0128014Berrocal VJ, Raftery AE, Gneiting T. Combining spatial statistical and ensemble information in probabilistic weather forecasts. Mon Weather Rev. 2007;135(4):1386‐1402. doi:10.1175/mwr3341.1Diggle PJ, Ribeiro PJ. Model‐based Geostatistics. 1st ed. Springer; 2007.Cressie N. Statistics for Spatial Data. Revised ed. Wiley; 1993.Zhang B, Sang H, Huang JZ. Full‐scale approximations of spatio‐temporal covariance models for large datasets. Stat Sin. 2015;25(1):99‐114. doi:10.5705/ss.2013.260wGaetan C, Guyon X. Spatial Statistics and Modeling. 1st ed. Springer; 2010.Maes MA, Breitung K, Dann MR. At issue: the Gaussian autocorrelation function. Proceedings of the 18th International Probabilistic Workshop; 2021:191‐203; Springer, Cham.Stachniss C, Plagemann C, Lilienthal AJ. Learning gas distribution models using sparse Gaussian process mixtures. Auton Robot. 2009;26(2‐3):187‐202. doi:10.1007/s10514‐009‐9111‐5Banerjee S, Gelfand AE, Finley AO, Sang H. Gaussian predictive process models for large spatial data sets. J Royal Stat Soc Ser B (Stat Methodol). 2008;70(4):825‐848. doi:10.1111/j.1467‐9868.2008.00663.xDumelle M, Hoef JMV, Fuentes C, Gitelman A. A linear mixed model formulation for spatio‐temporal random processes with computational advances for the product, sum, and product‐sum covariance functions. Spatial Stat. 2021;43:100510. doi:10.1016/j.spasta.2021.100510Gong G, Samaniego FJ. Pseudo maximum likelihood estimation: theory and applications. Ann Stat. 1981;9(4):861‐869. doi:10.1214/aos/1176345526Varin C, Vidoni P. A note on composite likelihood inference and model selection. Biometrika. 2005;92(3):519‐528. doi:10.1093/biomet/92.3.519Davison AC, Hinkley DV. Bootstrap Methods and Their Application. Illustrated ed. Cambridge University Press; 1997.Wright SJ. Coordinate descent algorithms. Math Program. 2015;151(1):3‐34. doi:10.1007/s10107‐015‐0892‐3Hunter JS. The exponentially weighted moving average. J Qual Technol. 1986;18(4):203‐210. doi:10.1080/00224065.1986.11979014Huwang L, Yeh AB, Wu CW. Monitoring multivariate process variability for individual observations. J Qual Technol. 2007;39(3):258‐278. doi:10.1080/00224065.2007.11917692Cressie N. The origins of kriging. Math Geol. 1990;22(3):239‐252. doi:10.1007/bf00889887Qian HM, Huang T, Huang HZ. A single‐loop strategy for time‐variant system reliability analysis under multiple failure modes. Mech Syst Signal Process. 2021;148:107159. doi:10.1016/j.ymssp.2020.107159Hyndman RJ, Koehler AB, Snyder RD, Grose S. A state space framework for automatic forecasting using exponential smoothing methods. Int J Forecast. 2002;18(3):439‐454. doi:10.1016/s0169‐2070(01)00110‐8Ma C. Semiparametric spatio‐temporal covariance models with the ARMA temporal margin. Ann Inst Stat Math. 2005;57(2):221‐233. doi:10.1007/bf02507023R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; 2021.Zhou H, Yu KM, Lee MG, Han CC. The application of last observation carried forward method for missing data estimation in the context of industrial wireless sensor networks. Proceedings of the 7th IEEE Asia‐Pacific Conference on Antennas and Propagation; 2018; IEEE, Auckland, New Zealand.Gramacy RB, Niemi J, Weiss RM. Massively parallel approximate Gaussian process regression. SIAM/ASA J Uncertainty Quantif. 2014;2(1):564‐584. doi:10.1137/130941912Courault D, Monestiez P. Spatial interpolation of air temperature according to atmospheric circulation patterns in southeast France. Int J Climatol. 1999;19(4):365‐378. doi:10.1002/(sici)1097‐0088(19990330)19:4<365::aid‐joc369>3.0.co;2‐eMonestiez P, Courault D, Allard D, Ruget F. Spatial interpolation of air temperature using environmental context: application to a crop model. Environ Ecol Stat. 2001;8(4):297‐309. doi:10.1023/a:1012726317935Holdaway M. Spatial modeling and interpolation of monthly temperature using kriging. Clim Res. 1996;6:215‐225. doi:10.3354/cr006215Ma P, Konomi BA, Kang EL. An additive approximate Gaussian process model for large spatio‐temporal data. Environmetrics. 2019;30(8):e2569. doi:10.1002/env.2569Higdon D. Space and space‐time modeling using process convolutions. In: Anderson CW, Barnett V, Chatwin PC, El‐Shaarawi AH, eds. Quantitative Methods for Current Environmental Issues. Springer; 2002:37‐56.Paciorek CJ. Computational techniques for spatial logistic regression with large data sets. Comput Stat Data Anal. 2007;51(8):3631‐3653. doi:10.1016/j.csda.2006.11.008Steeb WH. Problems and Solutions in Introductory and Advanced Matrix Calculus. 1st ed. World Scientific; 2006.AAPPENDIXPROOFSA.1Kriging mean formulaPredictions can be computed in a vectorized form, as follows, after Equation (28).A1vec(ŷ′−μ′)=ρR−1vec(y−μ).$$ \mathrm{vec}\left({\hat{y}}^{\prime }-{\mu}^{\prime}\right)=\rho {R}^{-1}\mathrm{vec}\left(y-\mu \right)\kern0.3em . $$By using definitions in Equations (11) and (27), it follows thatA2vec(ŷ′−μ′)=(ρS⊗ρT)(RS⊗RT)−1vec(y−μ).$$ \mathrm{vec}\left({\hat{y}}^{\prime }-{\mu}^{\prime}\right)=\left({\rho}_S\otimes {\rho}_T\right){\left({R}_S\otimes {R}_T\right)}^{-1}\mathrm{vec}\left(y-\mu \right)\kern0.3em . $$Next, we use the inversion behavior of the Kronecker product.A3vec(ŷ′−μ′)=(ρS⊗ρT)(RS−1⊗RT−1)vec(y−μ).$$ \mathrm{vec}\left({\hat{y}}^{\prime }-{\mu}^{\prime}\right)=\left({\rho}_S\otimes {\rho}_T\right)\left({R}_S^{-1}\otimes {R}_T^{-1}\right)\mathrm{vec}\left(y-\mu \right)\kern0.3em . $$Then, it comes in handy to use the mixed‐product property of the Kronecker product.A4vec(ŷ′−μ′)=(ρSRS−1)⊗(ρTRT−1)vec(y−μ).$$ \mathrm{vec}\left({\hat{y}}^{\prime }-{\mu}^{\prime}\right)=\left\{\left({\rho}_S{R}_S^{-1}\right)\otimes \left({\rho}_T{R}_T^{-1}\right)\right\}\mathrm{vec}\left(y-\mu \right)\kern0.3em . $$At this point, the regression coefficients of Equation (29) can be recognized.A5vec(ŷ′−μ′)=(βS⊗βT)vec(y−μ).$$ \mathrm{vec}\left({\hat{y}}^{\prime }-{\mu}^{\prime}\right)=\left({\beta}_S\otimes {\beta}_T\right)\mathrm{vec}\left(y-\mu \right)\kern0.3em . $$Lastly, we use Roth's column lemma.60 One can find it uncredited in recent algebra handbooks, see Steeb's book, Chapter 9, Problem 22.89A6vec(ŷ′−μ′)=vecβT(y−μ)βS⊤.$$ \mathrm{vec}\left({\hat{y}}^{\prime }-{\mu}^{\prime}\right)=\mathrm{vec}\left\{{\beta}_T\left(y-\mu \right){\beta}_S^{\top}\right\}\kern0.3em . $$Then, the vec$$ \mathrm{vec} $$ operator can be dropped and the matrix ŷ′$$ {\hat{y}}^{\prime } $$ is obtained.A.2Kriging variance formulaUsing Equation (28) as a starting point, it holdsA7diag(Rcond′)=diag(R′)−diag(ρR−1ρ⊤).$$ \operatorname{diag}\left({R}_{\mathrm{cond}}^{\prime}\right)=\operatorname{diag}\left({R}^{\prime}\right)-\operatorname{diag}\left(\rho {R}^{-1}{\rho}^{\top}\right)\kern0.3em . $$Similarly to the proof in Appendix A.1, we exploit again the inversion behavior and the mixed‐product property of the Kronecker product. We also use Equations (11) and (27) to obtainA8diag(Rcond′)=diag(RS′⊗RT′)−diag(ρSRS−1ρS⊤)⊗(ρTRT−1ρT⊤).$$ \operatorname{diag}\left({R}_{\mathrm{cond}}^{\prime}\right)=\operatorname{diag}\left({R}_S^{\prime}\otimes {R}_T^{\prime}\right)-\operatorname{diag}\left\{\left({\rho}_S{R}_S^{-1}{\rho}_S^{\top}\right)\otimes \left({\rho}_T{R}_T^{-1}{\rho}_T^{\top}\right)\right\}\kern0.3em . $$After Equation (34), it follows thatA9diag(Rcond′)=diag(RS′⊗RT′)−diag(RS′−RS,cond′)⊗(RT′−RT,cond′).$$ \operatorname{diag}\left({R}_{\mathrm{cond}}^{\prime}\right)=\operatorname{diag}\left({R}_S^{\prime}\otimes {R}_T^{\prime}\right)-\operatorname{diag}\left\{\left({R}_S^{\prime }-{R}_{S,\mathrm{cond}}^{\prime}\right)\otimes \left({R}_T^{\prime }-{R}_{T,\mathrm{cond}}^{\prime}\right)\right\}\kern0.3em . $$One can use the self‐evident property diag(A⊗B)=diag(A)⊗diag(B)$$ \operatorname{diag}\left(A\otimes B\right)=\operatorname{diag}(A)\otimes \operatorname{diag}(B) $$ for A$$ A $$ and B$$ B $$ square matrices, which impliesA10diag(Rcond′)=diag(RS′)⊗diag(RT′)−diag(RS′−RS,cond′)⊗diag(RT′−RT,cond′).$$ \operatorname{diag}\left({R}_{\mathrm{cond}}^{\prime}\right)=\operatorname{diag}\left({R}_S^{\prime}\right)\otimes \operatorname{diag}\left({R}_T^{\prime}\right)-\operatorname{diag}\left({R}_S^{\prime }-{R}_{S,\mathrm{cond}}^{\prime}\right)\otimes \operatorname{diag}\left({R}_T^{\prime }-{R}_{T,\mathrm{cond}}^{\prime}\right)\kern0.3em . $$Now, the components of diag(Rcond′)$$ \operatorname{diag}\left({R}_{\mathrm{cond}}^{\prime}\right) $$ can be partitioned into vectors with the same length as diag(RT′)$$ \operatorname{diag}\left({R}_T^{\prime}\right) $$. Such vectors can be the columns of the matrix V$$ V $$, which is thus defined as in our claims.BAPPENDIXBOOTSTRAPParametric bootstrap72 under separable kriging is particularly convenient because the implied model is easy to simulate, and its parameters are simple to estimate with the approach proposed in this article.The temporal and spatial models together identify the full model, under which one can simulate artificial datasets. In particular, separability allows to simulate a dataset Y$$ Y $$ asB1Y∼μ+σ·RT1/2·ϵ·RS1/2,$$ Y\sim \mu +\sigma \cdotp {R}_T^{1/2}\cdotp \epsilon \cdotp {R}_S^{1/2}\kern0.3em , $$where ϵ$$ \epsilon $$ is a Gaussian white noise structured into a T×S$$ T\times S $$ matrix, and RT1/2$$ {R}_T^{1/2} $$ and RS1/2$$ {R}_S^{1/2} $$ are the matrix square roots of the matrices RT$$ {R}_T $$ and RS$$ {R}_S $$, respectively.Assuming T≫S$$ T\gg S $$, RS1/2$$ {R}_S^{1/2} $$ may be tractable, while RT1/2$$ {R}_T^{1/2} $$ will hardly be so. The operator RT1/2$$ {R}_T^{1/2} $$ just makes RT1/2·ϵ$$ {R}_T^{1/2}\cdotp \epsilon $$ a matrix with independent columns that share the same correlation structure RT$$ {R}_T $$. So, as an alternative to directly evaluating RT1/2$$ {R}_T^{1/2} $$, one can generate each column of σ·RT1/2·ϵ$$ \sigma \cdotp {R}_T^{1/2}\cdotp \epsilon $$ according to the temporal model. AR(p) processes can be simulated efficiently according to the factorized MA(∞$$ \infty $$) form.19 The first observations should be initialized according to the stationary distribution of the process, but with complicated models one may instead provide an arbitrary initialization and then simulate additional observations as a burn‐in. We adopted this latter strategy. Actually, in simulating 8 weeks of data, we needed to simulate 32 more leading weeks of data as a burn‐in.CAPPENDIXMAXIMUM LIKELIHOOD ESTIMATIONThe pseudo ML estimator θ^$$ \hat{\theta} $$ for θ$$ \theta $$ is defined as a maximizer of (12). An iterative optimization procedure can be devised to compute θ^$$ \hat{\theta} $$, by leveraging the convenient separability assumption in Equation (16). Operationally, the components of the parameter vector θ$$ \theta $$ can be partitioned into groups of parameters, which can be updated one at a time until convergence. In particular, we define three groups of parameters, which coincide with the scale parameter σ$$ \sigma $$, the spatial and temporal correlation parameters ψS$$ {\psi}_S $$ and ψT$$ {\psi}_T $$.In Equation (B1), RS1/2$$ {R}_S^{1/2} $$ and RT1/2$$ {R}_T^{1/2} $$ were matrix square roots of RS$$ {R}_S $$ and RT$$ {R}_T $$, respectively. Now, let RS−1/2$$ {R}_S^{-1/2} $$ and RT−1/2$$ {R}_T^{-1/2} $$ be their inverses. These matrices can be feasibly estimated, as R^S−1/2$$ {\hat{R}}_S^{-1/2} $$ and R^T−1/2$$ {\hat{R}}_T^{-1/2} $$, respectively, based on available estimates of ψS$$ {\psi}_S $$ and ψT$$ {\psi}_T $$. An approximately efficient estimator can be computed iteratively byupdating σ^2$$ {\hat{\sigma}}^2 $$ as the mean of squared entries of the matrix R^T−1/2(y−μ^)R^S−1/2$$ {\hat{R}}_T^{-1/2}\left(y-\hat{\mu}\right){\hat{R}}_S^{-1/2} $$, based on current estimates for ψS$$ {\psi}_S $$ and ψT$$ {\psi}_T $$;updating ψ^S$$ {\hat{\psi}}_S $$ as the maximizer ofC1Ω(ψS)=−T2log|RS|+tr(RS−1M^S),$$ \Omega \left({\psi}_S\right)=-\frac{T}{2}\left\{\log |{R}_S|+\mathrm{tr}\left({R}_S^{-1}{\hat{M}}_S\right)\right\}, $$with respect to ψS$$ {\psi}_S $$, where M^S=zS⊤zS/T$$ {\hat{M}}_S={z}_S^{\top }{z}_S/T $$ and zS=R^T−1/2(y−μ^)/σ^$$ {z}_S={\hat{R}}_T^{-1/2}\left(y-\hat{\mu}\right)/\hat{\sigma} $$, based on current estimates for ψT$$ {\psi}_T $$ and σ$$ \sigma $$; notice the resemblance with Equations (17) and (18); this operation is equivalent to estimating ψS$$ {\psi}_S $$ based on T$$ T $$ independent cross‐sections;updating ψ^T$$ {\hat{\psi}}_T $$ as the maximizer ofC2Ω(ψT)=−S2log|RT|+tr(RT−1M^T),$$ \Omega \left({\psi}_T\right)=-\frac{S}{2}\left\{\log |{R}_T|+\mathrm{tr}\left({R}_T^{-1}{\hat{M}}_T\right)\right\}\kern0.3em , $$with respect to ψT$$ {\psi}_T $$, where M^T=zTzT⊤/S$$ {\hat{M}}_T={z}_T{z}_T^{\top }/S $$ and zT=(y−μ^)R^S−1/2/σ^$$ {z}_T=\left(y-\hat{\mu}\right){\hat{R}}_S^{-1/2}/\hat{\sigma} $$, based on current estimates for ψS$$ {\psi}_S $$ and σ$$ \sigma $$; notice the resemblance with Equations (20) and (21); this operation is equivalent to estimating ψT$$ {\psi}_T $$ based on S$$ S $$ independent time series.In Appendix B, we stress that some convenient spatial or temporal models allow for sparse formulations of matrices RS−1/2$$ {R}_S^{-1/2} $$ and RT−1/2$$ {R}_T^{-1/2} $$. In evaluating zS$$ {z}_S $$, there is thus no need to allocate in memory any large matrix like RT−1/2$$ {R}_T^{-1/2} $$. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Applied Stochastic Models in Business and Industry Wiley

Separable spatio‐temporal kriging for fast virtual sensing

Loading next page...
 
/lp/wiley/separable-spatio-temporal-kriging-for-fast-virtual-sensing-2VXUW1Kkvx

References (84)

Publisher
Wiley
Copyright
© 2022 John Wiley & Sons, Ltd.
ISSN
1524-1904
eISSN
1526-4025
DOI
10.1002/asmb.2697
Publisher site
See Article on Publisher Site

Abstract

INTRODUCTIONEnvironmental monitoring systems rely on virtual sensing logic to predict relevant variables of their target environment. While the information on the whole system is of interest, this is typically based on sensor readings, which are limited in both space and time, so it is necessary to surrogate them based on some suitable statistical method.1 Variables of interest may include, for instance, room temperature,2 energy consumption,3 and air quality.4 We consider the case of enclosed environments,2,4,5 as contrasted to other applications that are aimed at larger environments like ecosystems.6,7 Moreover, our focus is on applications that need real‐time control.8The motivating example for this work comes from a virtual sensing project at Silicon Austria Labs GmbH, a European research center for electronic‐based systems.9 The data relate to an office room in Villach, Austria, that has been monitored for 19 weeks between October 2019 and March 2020. The temperature (in ∘C$$ {}^{\circ \mathrm{C}} $$) is reported by 12 sensors every 10 s, along with other physical measurements like pressure (in Pa) and light (in lx). The room is 127 m2$$ {\mathrm{m}}^2 $$ large and it is structured as reported in Figure 1. The picture also shows the locations of sensors and windows, along with the cardinal points.1FIGURERoom with windows, cardinal points, and sensor locations, modified from the original map9The sensors are all Raspberry Pi Zero boards. Their measurements are broadcast over a wireless network to a database server, which is a Raspberry Pi 3 instead. Raspberry Pis are popular and affordable single‐board computers that are widely used in home automation and smart systems.10 This example has some key aspects, including data referenced both temporally and spatially, high‐frequency measurements, resulting in multivariate times series with as many as 106$$ 1{0}^6 $$ observations for each sensor. The server has a limited yet non‐negligible computational power, which is also important, as it allows to process data locally if this task is planned carefully, considering the limitations of the monitoring system.As common in modern data analysis, there are at least two main and opposite approaches to deal with sensor data. These two opposites are represented by interpretable models and black‐box algorithms, respectively. The former include specifications based on actual knowledge about physical aspects of the system,8 often hard to formulate; the latter include neural networks and other machine learning techniques that achieve remarkable performance levels and are readily available in general software. Other authors addressed the same datasets of our present analysis,9 but they used techniques such as XGBoost regression11 and LSTM recurrent neural networks.12 These methods do not provide spatial interpolation by design, but they can do so only after some suitable engineering, thus some generalization issues emerge.Here we advocate for an approach lying between the two extremes, which is statistically sound without compromising the predictive performance. We are interested in simple models and distributed computing, thus on scalable methods that leverage the proximity principle. It is then natural to resort to distance‐based prediction methods,13 which include for instance inverse distance weighted regression (IDWR), k$$ k $$‐nearest‐neighbors (k$$ k $$‐NN),14 and spatio‐temporal kriging.15 These methods are somewhat related to pure spatial data analysis.16 The focus of the present article is on the kriging method. This approach relies on a correlation model, which depends on distances between measurements in time and space. A crucial assumption for distributed calculus is spatio‐temporal separability, which implies two separate models for spatial and temporal correlations.7 This assumption is hardly suitable for large environments, where some locations can anticipate events that will occur somewhere else. In smaller environments, separability provides instead a useful approximation that may perform similarly to more complicated, nonseparable models.17While involving a simple model, kriging is cursed by the enormous cost of computation, mostly due to the inversion of large correlation matrices. Some approximations have been devised to make kriging tractable like, for instance, covariance tapering.18 A composite likelihood (CL) approach can be used to estimate a separable model, which allows to separate the estimation of the spatial model from the estimation of the temporal model. Also, some models in the time domain can spare the cumbersome matrix inversions and thus simplify both estimation and prediction. For instance, autoregressive (AR) models can be estimated just by minimizing the conditional sum of squares, and they come with compact forecasting rules.19 As to spatio‐temporal predictions, we show that these can be seen as a spatial interpolation of temporal forecasts under separability, which allows to leverage specific advantages of time series and spatial models. These possibilities seem somewhat overlooked in the literature, in spite of the attention received by separability itself.All in all, the main contributions of this article can be summarized as follows. First, a CL‐based fitting procedure is proposed that is both memory‐ and compute‐efficient. An affordable implementation of the maximum likelihood (ML) estimator is also presented, which is slower to compute but serves as the natural benchmark for our proposal. Second, this article provides significant simplifications of computational order in both estimation and prediction related to Gaussian processes (GPs), in the case of separable covariance structures, making it possible to carry out scalable sensor data analysis. As a result, it expands the range of models that can be addressed in spatio‐temporal kriging, as we point out that the allegedly required covariance matrices need not to be evaluated directly. Thus, even rich seasonal models can be addressed without directly evaluating the autocorrelation function.The plan of this article is as follows. In Section 2, we review the literature in brief as it concerns GPs and their applications to large datasets. In Section 3, we recall basic kriging formulation, while emphasizing some correlation structures of practical importance. In Section 4, we detail our inferential and predictive framework, focusing on distributed computing. Section 5 illustrates the application of the proposed methodology to the motivating example, whereas Section 6 is devoted to some possible twists and extensions. Finally, Section 7 presents some concluding remarks.LITERATURE REVIEWWe deal with the task of surrogate modeling, estimation and prediction.1 Consider a function or a process defined over a domain. This can represent the temperature at given times and locations,9 groundwater levels over a region,20 material properties of different portions of a solid,21 wind speed at turbine locations,22 turbulence around airships,23 or robots' reliability over time,24 to name a few examples. The process is feasibly observed only at selected points in the domain. For instance, room temperature may be known only at sensor locations and at discrete times. In many of these examples, inputs and outputs of some production systems can be thought of, respectively, as domain points and process outcomes. The unavailable information, that is, the process at unobserved locations and times, has thus to be surrogated. Optimization tasks may involve some interpolation steps,23,25 for which a proximity principle is implicitly trusted. From a statistical viewpoint, a stochastic process is assumed to rule the target of knowledge, and suitable regularity assumptions will be made: these reflect the heuristic that similarity comes with proximity and independence comes with distance. For instance, atmospheric events and air conditions often exhibit such regularity.26 From an operational standpoint, a distance measure must be defined, which must be meaningful within the specific geometry of the domain on focus.27 For example, a domain can even be made up of functions: even in such a case, a meaningful distance measure has been proposed recently.28,29Gaussianity, coupled with a suitable correlation structure, is a common assumption in statistical modeling that is central to kriging, as it translates straightforwardly into first‐ and second‐ order conditions.30,31 Moreover, GPs arise naturally in the case of spatially and temporally referenced data. The term kriging in broad generality refers to the task of making predictions based on conditional normal distributions, where the covariance structure depends on a suitably defined distance measure.32 By extension, the model underlying kriging has been called the kriging model.33 Kriging has become an umbrella name to include many variants, like ordinary, simple, and universal kriging, which differ in the specification of the mean. Moreover, co‐kriging involves regression on covariates that are known beforehand; on the contrary, semantic kriging involves covariates that are not known in advance and thus need to be predicted in turn.34 In machine learning, kriging often uses rich and nonlinear trend formulations,35 see, for instance, the polynomial chaos approach.36 Intrinsic kriging revolves around the process of differences, which is approximately de‐trended if the mean varies slowly with distance.37 The main point of all these variants of kriging is in the way of formulating or estimating the mean of the process.The estimation of kriging models often involves two subsequent steps: the first one is devoted to estimating the mean, which may depend on covariates, and the second one is devoted to estimating the correlation structure parameters, based on the residual or de‐trended process that results from the first step. This procedure is known as residual kriging.38 When the mean model is complicated, Kaufman et al.39 advocate for separating the estimation of trend and covariance structures. The residual process is indeed hard to model in terms of correlations.40 Some kriging applications may actually fail due to little spatial regularities left in the process after accounting for covariates.41 By converse, missing covariates can be thought of as a source of residual correlation, which leaves room for spatio‐temporal modeling.42 Similar issues may arise when attempting to estimate both spatial and temporal correlations. One may thus generalize Kaufman et al.'s approach to separate the estimation of dependence for each domain, being it either space, time or covariates, thus borrowing predictive power from all the domains.43The scalability of GPs is concerning in the case of large datasets. Accessible reviews are available on this topic, focusing on big data applications.44 All scalable GPs generally involve approximating the information that could be extracted from a full dataset, provided a sufficient computational power. Data subsetting, partitioning,45 or subsampling46 involve retaining only a fraction of training data points: moving kriging,47 as a local estimation method, belongs to this category. Related but distinct are local approximations such as Markov assumptions,48 which are based on selecting a smaller training dataset that is relevant for a specific prediction task. Covariance tapering18 involves bounded support kernels49 that induce sparsity in large correlation matrices, which are then more manageable both memory‐ and compute‐ wise. Other algebraic approximations allow for parsimonious, low‐rank representations for large matrices: for instance, separability assumptions involve writing large matrices as a Kronecker product of multiple smaller matrices, which are easier to invert.7Computational concerns make it difficult to deploy an otherwise interesting, functional variant of kriging called GP regression (GPR).50 However, even plain GPs can be troubled within the same setting. The aforementioned approximations can be useful, but they require tuning of some sort. For instance, in Markov modeling, the extent of Markov neighborhoods is a tuning aspect of the problem.The focus of this article is on separability assumptions, which may fit into the category of parsimonious, low‐rank approximations for spatio‐temporal correlation matrices.17 Separability assumptions arise naturally in multivariate normal modeling. In the literature, one may find this model stated in terms of a normally distributed random vector with covariance matrices in Kronecker form, which come in handy even in problems with fewer data than the one we present here.42 An iterative procedure attributed to Arendt et al.51 allows to estimate the correlation structures in time and space domains. This technique consists in estimating the spatial correlation of data that have been temporally decorrelated, then estimating the temporal correlation of data that have been (by converse) spatially decorrelated; the two steps are iterated until convergence. What we highlight in our contribution is that the practitioner may find it useful to adopt some suitable time series models19 for which the matrix transformations need not be evaluated explicitly, as it may be cumbersome. The advantages of separability have not been exploited in full yet. A likelihood function for separable kriging models in many domains has already been considered and maximized with iterative procedures, though without leveraging convenient correlation models.52Separable models have already been addressed using CLs, which are nongenuine likelihoods based on working independence assumptions.53 Pairwise CLs help avoid matrix inversion.54 Basically, a CL allows to make inference even based on under‐specified models, on trusted assumptions, thus making the analysis more robust55 to model misspecification. Furthermore, under a separable spatio‐temporal model, temporal and spatial parameters can be estimated separately for added robustness with respect to each other's misspecification.56 The composite estimator provides stronger consistency guarantees in this sense.57 The potential of composite estimation has also been investigated in terms of the other kinds of robustness that are common in statistics, with respect to outliers and data contamination.58 Given these interesting results, the present article aims at presenting a composite estimation approach to the reader, as it promises to make kriging applications more computationally sustainable and, hopefully, robust.MODEL SPECIFICATIONWe deal with the case of data that are both spatially and temporally referenced, so we use a space index s$$ s $$ and a time index t$$ t $$. The space index s$$ s $$ takes its value in 𝒮={1,…,S} and is a pointer to one out of S$$ S $$ locations in space, while the time index t$$ t $$ takes its value in 𝒯={1,…,T} and is a pointer to one out of T$$ T $$ time frames. A joint index ts$$ ts $$ denotes location s$$ s $$ at time t$$ t $$. Since dealing with a constant sampling rate, we consider a discrete‐time system with equispaced time frames. In the long run, it holds T≫S$$ T\gg S $$.Let ds,s′$$ {d}_{s,{s}^{\prime }} $$ be a spatial distance, defined for all pairs of locations s,s′∈𝒮, thus endowed with non‐negativity, symmetry and triangle inequality. Distance is ordinarily evaluated along straight lines in the absence of physical obstacles; otherwise, the length of the shortest path is considered. We choose the Euclidean distance for this purpose, namely,1ds,s′=(xs−xs′)2+(ys−ys′)2,$$ {d}_{s,{s}^{\prime }}=\sqrt{{\left({\mathtt{x}}_s-{\mathtt{x}}_{s^{\prime }}\right)}^2+{\left({\mathtt{y}}_s-{\mathtt{y}}_{s^{\prime }}\right)}^2}\kern0.3em , $$where (xs,ys)$$ \left({\mathtt{x}}_s,{\mathtt{y}}_s\right) $$ and (xs′,ys′)$$ \left({\mathtt{x}}_{s^{\prime }},{\mathtt{y}}_{s^{\prime }}\right) $$ are the two‐dimensional Cartesian coordinates of s,s′∈𝒮, respectively. Temporal distance is defined analogously59 for all pairs of t,t′∈𝒯 as2dt,t′=|t−t′|,$$ {d}_{t,{t}^{\prime }}=\mid t-{t}^{\prime}\mid \kern0.3em , $$with |·|$$ \mid \cdotp \mid $$ denoting the absolute value. Here, ds,s′$$ {d}_{s,{s}^{\prime }} $$ is the generic (s,s′)$$ \left(s,{s}^{\prime}\right) $$th entry of the S×S$$ S\times S $$ spatial distance matrix dS$$ {d}_S $$ and dt,t′$$ {d}_{t,{t}^{\prime }} $$ is the generic (t,t′)$$ \left(t,{t}^{\prime}\right) $$th entry of the T×T$$ T\times T $$ temporal distance matrix dT$$ {d}_T $$, respectively.The observed data y$$ y $$ are structured as follows:3y=y11…y1S⋮yts⋮yT1…yTS,$$ y=\left[\begin{array}{ccc}{y}_{11}& \dots & {y}_{1S}\\ {}\vdots & {y}_{ts}& \vdots \\ {}{y}_{T1}& \dots & {y}_{TS}\end{array}\right]\kern0.3em , $$so the data related to location s$$ s $$ are all stored in the same s$$ s $$th column, while those related to the time frame t$$ t $$ are all stored in the same t$$ t $$th row. As new data are observed, they are appended to y$$ y $$ as a new row. The data y$$ y $$ are modeled by the random matrix Y$$ Y $$ and the mean matrix μ$$ \mu $$ with the same number of rows and columns of y$$ y $$.Let vec$$ \mathrm{vec} $$ be a unary operator defined for matrices that stacks their columns into a single vector.60 Y$$ Y $$ is assumed to be a multivariate normal with scale parameter σ>0$$ \sigma >0 $$ and correlation matrix R$$ R $$, in the sense that R$$ R $$ is the correlation matrix of vec(Y)$$ \mathrm{vec}(Y) $$. More formally, we assume that Y$$ Y $$ has the following density function.4f(y;μ,σ,R)=1|2πσ2R|·exp−12vecy−μσ⊤R−1vecy−μσ,$$ f\left(y;\mu, \sigma, R\right)=\frac{1}{\sqrt{\mid 2\pi {\sigma}^2R\mid }}\cdotp \exp \left\{-\frac{1}{2}\mathrm{vec}{\left(\frac{y-\mu }{\sigma}\right)}^{\top }{R}^{-1}\kern0.3em \mathrm{vec}\left(\frac{y-\mu }{\sigma}\right)\right\}\kern0.3em , $$where |·|$$ \mid \cdotp \mid $$ is the determinant of a square matrix. In kriging‐related applications, R$$ R $$ is a positive definite square matrix that depends on spatial and temporal distances through spatial and temporal autocorrelation functions discussed further in this section. The definition in use throughout this article will be given later in Equation (11).We assume that the expected value of Y$$ Y $$ at time t$$ t $$ and location s$$ s $$, denoted by μts$$ {\mu}_{ts} $$, is a smooth function of t$$ t $$ and is shared across locations, so it makes sense to estimate it with the asymmetric moving average mt$$ {m}_t $$ defined below, which pools data from across all the observed locations, limited to w$$ w $$ time frames preceding t$$ t $$. The trend μts$$ {\mu}_{ts} $$ is thus estimated via μ^ts$$ {\hat{\mu}}_{ts} $$, defined as5μ^ts=mt,for alls,withmt=1S·w∑s=1S∑i=1wY(t−i)s.$$ {\hat{\mu}}_{ts}={m}_t,\kern1em \mathrm{for}\ \mathrm{all}\kern0.3em s,\mathrm{with}\kern1em {m}_t=\frac{1}{S\cdotp w}\sum \limits_{s=1}^S\sum \limits_{i=1}^w{Y}_{\left(t-i\right)s}\kern0.3em . $$Here, Y(t−i)s$$ {Y}_{\left(t-i\right)s} $$ is the process at time t−i$$ t-i $$ and location s$$ s $$. We set w$$ w $$ equal to the number of observations per sensor in 24 h to remove the observed daily seasonality. The latest estimate available for μts$$ {\mu}_{ts} $$ can also serve as an estimate for future trend μt′s$$ {\mu}_{t^{\prime }s} $$, t′>t$$ {t}^{\prime }>t $$, assuming stability in the short term. Such an assumption can be credible when the univariate time series may agree on a single latent factor ruling all of them.The parameter σ>0$$ \sigma >0 $$ contributes only to making predictions probabilistically calibrated,61 because it is just a scale parameter, like the error standard deviation in classical linear regression, thus involved in prediction variance but not in mean predictions; see Appendix A.We resort to the classical kriging approach, which belongs to frequentist statistics, but this methodology also has a Bayesian counterpart involving a prior distribution on parameters.62 As per kriging approach, we assume that correlations between components of Y$$ Y $$ are stationary and thus depend on their distances in space and time. The covariance between any two components of Y$$ Y $$, say, Yts$$ {Y}_{ts} $$ and Yt′s′$$ {Y}_{t^{\prime }{s}^{\prime }} $$, is modeled as6cov(Yts,Yt′s′)=σ2·corS(ds,s′)·corT(dt,t′),$$ \operatorname{cov}\left({Y}_{ts},{Y}_{t^{\prime }{s}^{\prime }}\right)={\sigma}^2\cdotp {\mathrm{cor}}_S\left({d}_{s,{s}^{\prime }}\right)\cdotp {\mathrm{cor}}_T\left({d}_{t,{t}^{\prime }}\right)\kern0.3em , $$where corS(·)$$ {\mathrm{cor}}_S\left(\cdotp \right) $$ is the spatial autocorrelation function (ACF),63 while corT(·)$$ {\mathrm{cor}}_T\left(\cdotp \right) $$ is the temporal ACF;19 few examples of definition for corS(·)$$ {\mathrm{cor}}_S\left(\cdotp \right) $$ and corT(·)$$ {\mathrm{cor}}_T\left(\cdotp \right) $$ are provided in the next paragraph. The product between spatial and temporal correlations is implied by the separability assumption. ACFs depending on distances and not directions are implied by an isotropy assumption. Both separability and isotropy can simplify modeling and computing.1,2,6,7,64For the sake of illustration, we recall few possible definitions for the spatial ACF corS(·)$$ {\mathrm{cor}}_S\left(\cdotp \right) $$ and the temporal ACF corT(·)$$ {\mathrm{cor}}_T\left(\cdotp \right) $$. The spatial ACF can be, for instance, one of the following:13,16,59Matérn ACF657corS(d)=21−αΓ(α)(d/λ)αKα(d/λ),$$ {\mathrm{cor}}_S(d)=\frac{2^{1-\alpha }}{\Gamma \left(\alpha \right)}{\left(d/\lambda \right)}^{\alpha }{\mathrm{K}}_{\alpha}\left(d/\lambda \right)\kern0.3em , $$with Γ(·)$$ \Gamma \left(\cdotp \right) $$ the gamma function and Kα(·)$$ {\mathrm{K}}_{\alpha}\left(\cdotp \right) $$ the modified Bessel function of the second kind; α>0$$ \alpha >0 $$ is a smoothing parameter, λ>0$$ \lambda >0 $$ is a range parameter.Power exponential ACF18corS(d)=exp{−(d/λ)β}.$$ {\mathrm{cor}}_S(d)=\exp \left\{-{\left(d/\lambda \right)}^{\beta}\right\}\kern0.3em . $$Here, β>0$$ \beta >0 $$ is a smoothing parameter, λ>0$$ \lambda >0 $$ is a range parameter.Both Matérn and power exponential ACFs include two notable sub‐cases:When α=1/2$$ \alpha =1/2 $$ and β=1$$ \beta =1 $$, the exponential ACF is implied.6When α→∞$$ \alpha \to \infty $$ and β=2$$ \beta =2 $$, the Gaussian ACF is implied,1,66 which is also known as squared‐exponential ACF3,4,67 and involved in the double‐stable model.2The temporal ACF corT(·)$$ {\mathrm{cor}}_T\left(\cdotp \right) $$ in discrete time can be, for instance, one of the following:19ACF of a stationary AR of order 19corT(d)=ϕ|d|,$$ {\mathrm{cor}}_T(d)={\phi}^{\mid d\mid}\kern0.3em , $$where ϕ∈]−1,+1[$$ \phi \in \kern0.3em \left]-1,+1\right[ $$ is the correlation parameter.ACF of an invertible moving‐average model of order 110corT(d)=1,d=0,η,d=±1,0,otherwise,$$ {\mathrm{cor}}_T(d)=\left\{\begin{array}{cc}1,& d=0,\\ {}\eta, & d=\pm 1,\\ {}0,& \mathrm{otherwise},\end{array}\right. $$where η∈]−1/2,+1/2[$$ \eta \in \kern0.3em \left]-1/2,+1/2\right[ $$ is the correlation parameter.More complicated ACFs are possible when looking at more flexible time series models, such as the multiplicative seasonal AR models that are used in the empirical application presented later. Some approaches assume weak or no correlation structure, like empirical kriging.15 These approaches are necessarily less scalable but may still work for suitably targeted tasks.Let corS(·)$$ {\mathrm{cor}}_S\left(\cdotp \right) $$ and corT(·)$$ {\mathrm{cor}}_T\left(\cdotp \right) $$ be vectorized functions, that is, they transform matrices in an entry‐wise fashion. Then, RS=corS(dS)$$ {R}_S={\mathrm{cor}}_S\left({d}_S\right) $$ will be a spatial correlation matrix and RT=corT(dT)$$ {R}_T={\mathrm{cor}}_T\left({d}_T\right) $$ a temporal correlation matrix. We call11R=RS⊗RT,$$ R={R}_S\otimes {R}_T\kern0.3em , $$the spatio‐temporal correlation matrix, this expression denoting also a Kronecker correlation structure due to the Kronecker product ⊗$$ \otimes $$.Both temporal and spatial ACFs can be modified to account for noisy data by including a so‐called nugget effect.1,65 This means that the spatial ACF corS(d)$$ {\mathrm{cor}}_S(d) $$ and the temporal ACF corT(d)$$ {\mathrm{cor}}_T(d) $$ are multiplied by βS$$ {\beta}_S $$ and βT$$ {\beta}_T $$ when d>0$$ d>0 $$, respectively, the parameters βS$$ {\beta}_S $$ and βT$$ {\beta}_T $$ taking values in the interval ]0,1]$$ \left]0,1\right] $$.15 We refer to 1−βS$$ 1-{\beta}_S $$ and 1−βT$$ 1-{\beta}_T $$ as the nugget parameters, in space and time domains, respectively. Some authors apply the nugget directly to the spatio‐temporal covariance function, but this will break up separability.64,68 The latter way of modeling is more natural in the case of additive covariance models.69INFERENTIAL ASPECTSThis section presents two strategies that allow to perform estimation and prediction under a separable model with a low computational cost. In particular, we base estimation on a novel CL approach. Then, leveraging the peculiar expression of the kriging mean formula, we show how to compute predictions efficiently under separability.EstimationThe distribution of Y$$ Y $$ in Equation (4) is assumed to be indexed by μ,σ,R$$ \mu, \sigma, R $$. As per residual kriging, μ$$ \mu $$ is replaced with its estimate μ^$$ \hat{\mu} $$ via Equation (5). The parameters left to be estimated are θ=(σ,ψ⊤)⊤$$ \theta ={\left(\sigma, {\psi}^{\top}\right)}^{\top } $$, with ψ$$ \psi $$ the correlation parameters ruling R$$ R $$. Moreover, ψ$$ \psi $$ can be partitioned as ψ=(ψS⊤,ψT⊤)⊤$$ \psi ={\left({\psi}_S^{\top },{\psi}_T^{\top}\right)}^{\top } $$, where ψS$$ {\psi}_S $$ and ψT$$ {\psi}_T $$ are the spatial and temporal correlation parameters, respectively. In particular, RS$$ {R}_S $$ depends only on ψS$$ {\psi}_S $$, while RT$$ {R}_T $$ depends only on ψT$$ {\psi}_T $$. As a starting point, we consider the pseudo likelihood function ℒ(θ)$$ \mathcal{L}\left(\theta \right) $$, defined as12ℒ(θ)=ℒ(θ;y˜)=f(y˜;0,σ,R),$$ \mathcal{L}\left(\theta \right)=\mathcal{L}\left(\theta; \tilde{y}\right)=f\left(\tilde{y};0,\sigma, R\right)\kern0.3em , $$with y˜=y−μ^$$ \tilde{y}=y-\hat{\mu} $$ the de‐trended data. The maximizer of ℒ(θ)$$ \mathcal{L}\left(\theta \right) $$ with respect to θ$$ \theta $$ is referred to as the ML estimate. Under the separability assumption in Equation (11), it is possible to evaluate ML efficiently as illustrated in Appendix C. The ML approach is the natural benchmark for any other estimator, due to its asymptotic efficiency. Nevertheless, robustness concerns and computational issues may arise in practical data analysis, for which the following CL estimation approach can be more suited.Let the sample correlation matrix be defined as rank‐1 matrix13M^=1σ^2vec(y−μ^)vec(y−μ^)⊤,$$ \hat{M}=\frac{1}{{\hat{\sigma}}^2}\kern0.3em \mathrm{vec}\left(y-\hat{\mu}\right)\kern0.3em \mathrm{vec}{\left(y-\hat{\mu}\right)}^{\top}\kern0.3em , $$where σ^$$ \hat{\sigma} $$ is a working estimate of σ$$ \sigma $$, defined as14σ^=1TS∑s=1S∑t=1T(yts−μ^ts)2.$$ \hat{\sigma}=\sqrt{\frac{1}{TS}\sum \limits_{s=1}^S\sum \limits_{t=1}^T{\left({y}_{ts}-{\hat{\mu}}_{ts}\right)}^2}\kern0.3em . $$The likelihood function is thus replaced with a so‐called pseudo likelihood ℒp(ψ)$$ {\mathcal{L}}^p\left(\psi \right) $$, defined as follows, that allows to make inference on ψ$$ \psi $$ alone.65,7015ℒp(ψ)=|R|−1/2exp−12trR−1M^.$$ {\mathcal{L}}^p\left(\psi \right)={\left|R\right|}^{-1/2}\exp \left\{-\frac{1}{2}\mathrm{tr}\left({R}^{-1}\hat{M}\right)\right\}\kern0.3em . $$Here, tr(·)$$ \mathrm{tr}\left(\cdotp \right) $$ is the trace operator.Kriging is often cumbersome due to the inversion of R$$ R $$, and actually the pseudo likelihood in Equation (15) is intractable with high‐dimensional data. Separability reduces the dimensionality of the problem, as it holds16R−1=(RS⊗RT)−1=RS−1⊗RT−1,$$ {R}^{-1}={\left({R}_S\otimes {R}_T\right)}^{-1}={R}_S^{-1}\otimes {R}_T^{-1}\kern0.3em , $$so two smaller inverse matrices must be computed instead of a single and larger one. However, as T≫S$$ T\gg S $$, inverting RT$$ {R}_T $$ alone can also be difficult.Our proposal is two‐fold. First, a marginal CL approach56 can be used, exploiting separability more in depth so that the tasks of estimating ψS$$ {\psi}_S $$ and ψT$$ {\psi}_T $$ can even be addressed separately. Second, a suitable time series model can help handle the temporal correlation implicitly, as RT$$ {R}_T $$ must not be evaluated at all, and make it possible to address tall data and high sampling rates.CLs are known in spatial statistics mainly as tools that simplify estimation and inference, like the pairwise likelihood,53 and they can also be used in model selection.71 CLs allow to make inference on under‐specified models but, even in the case of fully specified models, like in kriging, a suitable CL can reduce the computational cost of estimation. The estimator based on the full model can be computed in some cases, but it would be generally cumbersome with large datasets, without the convenient temporal models discussed here. We remark the tractability of our estimator by carrying out inference based on bootstrap. Cheaper computation comes at a price, since the estimator is naturally sub‐optimal with respect to the ML estimator. Nonetheless, the loss of efficiency might be not significant when dealing with high‐frequency data.With CLs, as with any so‐called pseudo likelihood, estimation variance cannot be assessed as with classical likelihood functions based on the Hessian matrix. Parametric bootstrap can be used72 in the case of kriging because the model is fully specified, and one can simulate datasets based on it. It is straightforward to simulate datasets under separability and some temporal models. We detail a bootstrap strategy in Appendix B.Within a single time frame t∈𝒯, it holds R=RS$$ R={R}_S $$, as per Equation (11). Let the spatial pseudo CL be as follows,17ℒS(ψS)=∏t∈𝒯f(y˜t;0,σ^,RS)=|RS|−T/2exp−T2trRS−1M^S,with y˜t$$ {\tilde{y}}_t $$ the de‐trended data vector at time frame t$$ t $$. This allows to make inference on the spatial correlation parameters ψS$$ {\psi}_S $$ alone. The expression is the same of a “small blocks” marginal CL.56 Here, M^S$$ {\hat{M}}_S $$ is the sample correlation matrix between the univariate time series at distinct locations, defined as18M^S=1Tσ^2(y−μ^)⊤(y−μ^).$$ {\hat{M}}_S=\frac{1}{T{\hat{\sigma}}^2}{\left(y-\hat{\mu}\right)}^{\top}\left(y-\hat{\mu}\right)\kern0.3em . $$The estimator ψ^S$$ {\hat{\psi}}_S $$ for the spatial correlation parameters ψS$$ {\psi}_S $$ is defined as the maximizer of (17). If RS$$ {R}_S $$ is a smooth function of ψS$$ {\psi}_S $$, the estimator can be seen as the solution of a system of estimating equations, the equation for the generic parameter α∈ψS$$ \alpha \in {\psi}_S $$ being defined as follows:19α:trRS−M^S∂RS−1∂α=0.$$ \alpha \kern0.3em :\kern0.3em \mathrm{tr}\left\{\left({R}_S-{\hat{M}}_S\right)\frac{\partial {R}_S^{-1}}{\partial \alpha}\right\}=0\kern0.3em . $$The above equation looks like a weighted average of the equations from the over‐determined system RS−M^S=0$$ {R}_S-{\hat{M}}_S=0 $$. It is worth noticing the close resemblance between the pseudo likelihoods in Equations (15) and (17), but the correlation matrices involved are strikingly different in their sizes, as RS$$ {R}_S $$ is much smaller than R$$ R $$ in practical applications.The temporal correlation parameter ψT$$ {\psi}_T $$ can be estimated analogously to ψS$$ {\psi}_S $$, by defining a temporal pseudo CL ℒT$$ {\mathcal{L}}_T $$ as20ℒT(ψT)=∏s∈𝒮f(y˜s;0,σ^,RT)=|RT|−S/2exp−S2trRT−1M^T,with y˜s$$ {\tilde{y}}_s $$ the de‐trended univariate time series at location s$$ s $$. Here, M^T$$ {\hat{M}}_T $$ is the sample temporal correlation matrix, defined as21M^T=1Sσ^2(y−μ^)(y−μ^)⊤.$$ {\hat{M}}_T=\frac{1}{S{\hat{\sigma}}^2}\left(y-\hat{\mu}\right){\left(y-\hat{\mu}\right)}^{\top}\kern0.3em . $$This operation will completely disentangle the estimators of spatial and temporal parameters from each other. As contrasted to RS$$ {R}_S $$, RT$$ {R}_T $$ will be high‐dimensional, so it is even more pressing the need to use a convenient correlation structure in the time domain, to simplify the estimation of ψT$$ {\psi}_T $$ otherwise based on directly maximizing (20). In particular, we resort to an AR model with multiple overlapping seasonal lags. For its definition, the classical lag operator B$$ B $$ can be introduced, such that22BYts=Y(t−1)s,$$ B{Y}_{ts}={Y}_{\left(t-1\right)s}\kern0.3em , $$and BΔYts=Y(t−Δ)s$$ {B}^{\Delta}{Y}_{ts}={Y}_{\left(t-\Delta \right)s} $$ more in general, where Y(t−Δ)s$$ {Y}_{\left(t-\Delta \right)s} $$ is the process Y$$ Y $$ at time t−Δ$$ t-\Delta $$ and location s$$ s $$. Then, letting ϵs$$ {\epsilon}_s $$ be a white noise time series process, the multiplicative seasonal AR model is defined as follows.1923∏k=1K(1−ϕkBΔk)·(Ys−μs)=ϵs,$$ \left[\prod \limits_{k=1}^K\left(1-{\phi}_k{B}^{\Delta_k}\right)\right]\cdotp \left({Y}_s-{\mu}_s\right)={\epsilon}_s\kern0.3em , $$where Δ1,…,Δk,…,ΔK$$ {\Delta}_1,\dots, {\Delta}_k,\dots, {\Delta}_K $$ are the structural lags, and ϕ1,…,ϕk,…,ϕK∈]−1,+1[$$ {\phi}_1,\dots, {\phi}_k,\dots, {\phi}_K\in \kern0.3em \left]-1,+1\right[ $$ the model coefficients.Equation (23) reduces the temporal correlation parameter to ψT=(ϕ1,…,ϕK)⊤$$ {\psi}_T={\left({\phi}_1,\dots, {\phi}_K\right)}^{\top } $$. AR modeling also makes it easy to approximately maximize the likelihood by minimizing the conditional sum of squares,19 which can be attained via the coordinate descent method.73 In this sense, we define24Vs−h=∏k≠h(1−ϕkBΔk)·(Ys−μs),$$ {V}_s^{-h}=\left[\prod \limits_{k\ne h}\left(1-{\phi}_k{B}^{\Delta_k}\right)\right]\cdotp \left({Y}_s-{\mu}_s\right)\kern0.3em , $$and, by Equation (23), it holds25(1−ϕhBΔh)Vs−h=ϵs,$$ \left(1-{\phi}_h{B}^{\Delta_h}\right){V}_s^{-h}={\epsilon}_s\kern0.3em , $$so, the transformed time series Vs−h$$ {V}_s^{-h} $$ satisfies26ϕh=cor(Vs−h,BΔhVs−h).$$ {\phi}_h=\mathrm{cor}\left({V}_s^{-h},{B}^{\Delta_h}{V}_s^{-h}\right)\kern0.3em . $$This relation motivates an iterative procedure that loops over the correlation parameters, and repeats until convergence. For each ϕh$$ {\phi}_h $$, one evaluates the current estimate of the process Vs−h$$ {V}_s^{-h} $$ and then updates ϕh$$ {\phi}_h $$ as the sample ACF of Vs−h$$ {V}_s^{-h} $$ at lag Δh$$ {\Delta}_h $$.An online learning approach may be considered as an alternative. For instance, estimates can be updated continually via batch learning14 or exponentially weighted moving means and covariances,74,75 which require additional tuning.PredictionSeparability assumptions also simplify prediction, as we show in this section. To this end, we must expand our notation. Then, let Y′$$ {Y}^{\prime } $$ be from unobserved locations or times and with the mean matrix μ′$$ {\mu}^{\prime } $$, while previously introduced matrices Y$$ Y $$ and μ$$ \mu $$ are now related to the available data y$$ y $$. The correlation matrix of vec(Y′)$$ \mathrm{vec}\left({Y}^{\prime}\right) $$ is R′$$ {R}^{\prime } $$, whereas R$$ R $$ is the correlation matrix of vec(Y)$$ \mathrm{vec}(Y) $$. The cross‐correlation matrix between vec(Y′)$$ \mathrm{vec}\left({Y}^{\prime}\right) $$ and vec(Y)$$ \mathrm{vec}(Y) $$ is generally not square, and it is denoted by ρ$$ \rho $$, which is based on a suitable joint distance matrix. Like R$$ R $$ in Equation (11), also R′$$ {R}^{\prime } $$ and ρ$$ \rho $$ are defined in terms of Kronecker products as, respectively,27R′=RS′⊗RT′,ρ=ρS⊗ρT,$$ {R}^{\prime }={R}_S^{\prime}\otimes {R}_T^{\prime}\kern0.3em ,\kern1em \rho ={\rho}_S\otimes {\rho}_T\kern0.3em , $$where RS′$$ {R}_S^{\prime } $$ and RT′$$ {R}_T^{\prime } $$ are the spatial and temporal correlation matrices of Y′$$ {Y}^{\prime } $$, analogously to RS$$ {R}_S $$ and RT$$ {R}_T $$, while ρS$$ {\rho}_S $$ and ρT$$ {\rho}_T $$ are the spatial and temporal cross‐correlation matrices between Y′$$ {Y}^{\prime } $$ and Y$$ Y $$. These in turn depend on suitable spatial and temporal distance matrices. As typical in kriging, we treat Y$$ Y $$ and Y′$$ {Y}^{\prime } $$ as jointly normally distributed.Kriging revolves around the conditional distribution of Y′$$ {Y}^{\prime } $$ given Y=y$$ Y=y $$,1 though it was originally motivated as the linear unbiased prediction that is optimal with respect to the squared prediction error criterion.76 Conditional to Y=y$$ Y=y $$, the distribution of Y′$$ {Y}^{\prime } $$ is multivariate normal. The conditional mean matrix of Y′$$ {Y}^{\prime } $$, denoted by ŷ′$$ {\hat{y}}^{\prime } $$, and the conditional variance‐covariance matrix of vec(Y′)$$ \mathrm{vec}\left({Y}^{\prime}\right) $$, denoted by Rcond′$$ {R}_{\mathrm{cond}}^{\prime } $$, satisfy the following conditions.28vec(ŷ′)=vec(μ′)+ρR−1vec(y−μ),Rcond′=R′−ρR−1ρ⊤.$$ \mathrm{vec}\left({\hat{y}}^{\prime}\right)=\mathrm{vec}\left({\mu}^{\prime}\right)+\rho {R}^{-1}\mathrm{vec}\left(y-\mu \right)\kern0.3em ,\kern1em {R}_{\mathrm{cond}}^{\prime }={R}^{\prime }-\rho {R}^{-1}{\rho}^{\top}\kern0.3em . $$Let βS$$ {\beta}_S $$ and βT$$ {\beta}_T $$ be spatial and temporal regression coefficients, defined as29βS=ρSRS−1,βT=ρTRT−1,$$ {\beta}_S={\rho}_S{R}_S^{-1}\kern0.3em ,\kern1em {\beta}_T={\rho}_T{R}_T^{-1}\kern0.3em , $$then the kriging mean formula in Equation (28) can be written as30ŷ′−μ′=z^βS⊤,$$ {\hat{y}}^{\prime }-{\mu}^{\prime }=\hat{z}{\beta}_S^{\top}\kern0.3em , $$where z^$$ \hat{z} $$ is a matrix of centered temporal forecasts only, defined as31z^=βT(y−μ).$$ \hat{z}={\beta}_T\left(y-\mu \right)\kern0.3em . $$We prove this fact in Appendix A.1. The proof also covers existing applications of multivariate normal models.77The above result allows for at least two interesting uses. First, our result allows for distributed calculus in kriging. Indeed, spatio‐temporal prediction under separability can be carried out in two steps. The first step, in Equation (31), is temporal forecasting only within sensor locations. The second step, in Equation (30), is the spatial interpolation of such forecasts for the needed locations. Then, spatio‐temporal predictions can be seen as spatial interpolation of temporal forecasts. So, a separability assumption allows to separate domains also in prediction. Distributed calculus can be used for evaluating Equation (31), as each univariate time series is transformed separately by the matrix product.This insight into the meaning of matrix quantities cannot be found in other works,77 as they lack both the interpretation of βS$$ {\beta}_S $$ and βT$$ {\beta}_T $$. Moreover, an automatic application of their findings would be too compute‐intensive with large datasets, because it would require evaluating βT$$ {\beta}_T $$ explicitly.Correlation models affect prediction only through βS$$ {\beta}_S $$ and βT$$ {\beta}_T $$. So, any specification of time or space models can be employed if it implies tractable prediction. This view motivates, for instance, the use of general interpolators, or state‐space models,78 or integrated AR time series models,19 that allow for simple prediction since βT$$ {\beta}_T $$ in Equation (29) is sparse and explicit. The rather general auto‐regressive moving‐average model (ARMA) has already been considered by some authors79 though the MA component makes the model harder to estimate.In applications, one may consider adding covariates to the analysis. Assume that there are p$$ p $$ variables indexed by j=1,…,p$$ j=1,\dots, p $$, so Yjts$$ {Y}_{jts} $$ denotes the j$$ j $$th variable at spatio‐temporal coordinates ts$$ ts $$. In this case, the marginal means μjts$$ {\mu}_{jts} $$ will likely depend on j$$ j $$. The correlation structure can be simplified according to a fully factored model,7 such that32cov(Yjts,Yj′t′s′)=γj,j′·corS(ds,s′)·corT(dt,t′).$$ \operatorname{cov}\left({Y}_{jts},{Y}_{j^{\prime }{t}^{\prime }{s}^{\prime }}\right)={\gamma}_{j,{j}^{\prime }}\cdotp {\mathrm{cor}}_S\left({d}_{s,{s}^{\prime }}\right)\cdotp {\mathrm{cor}}_T\left({d}_{t,{t}^{\prime }}\right)\kern0.3em . $$Here, the new quantity γj,j′$$ {\gamma}_{j,{j}^{\prime }} $$ represents the cross‐sectional covariance between the j$$ j $$th and j′$$ {j}^{\prime } $$th variables at the same time and location. Thus, under a fully factored model, our kriging mean formula scales easily, as an additional set of regression coefficients βC$$ {\beta}_C $$ is defined besides βS$$ {\beta}_S $$ and βT$$ {\beta}_T $$, implied by the newly added domain of covariates.We provided a simple expression for mean predictions, along with some optimization strategies. For the sake of completeness, we now illustrate how to compute prediction variances. In our view, the model can be used to design a prediction rule, resulting in mean prediction, while prediction variance is a performance measure, which can, for instance, be evaluated on a test set. This approach may be favored when the focus is especially on prediction.Let V$$ V $$ be a matrix with the same size as Y′$$ {Y}^{\prime } $$ with the entrywise variances of Y′$$ {Y}^{\prime } $$ conditional to Y=y$$ Y=y $$. Let diag(M)$$ \operatorname{diag}(M) $$ be the operator that returns the diagonal entries of a square matrix M$$ M $$ as a column vector. Then, vec(V)=σ2diag(Rcond′)$$ \mathrm{vec}(V)={\sigma}^2\operatorname{diag}\left({R}_{\mathrm{cond}}^{\prime}\right) $$, so it holds33V=σ2diag(RT′)diag(RS′)⊤−diag(RT′−RT,cond′)diag(RS′−RS,cond′)⊤,$$ V={\sigma}^2\left\{\operatorname{diag}\left({R}_T^{\prime}\right)\kern0.3em \operatorname{diag}{\left({R}_S^{\prime}\right)}^{\top }-\operatorname{diag}\left({R}_T^{\prime }-{R}_{T,\mathrm{cond}}^{\prime}\right)\kern0.3em \operatorname{diag}{\left({R}_S^{\prime }-{R}_{S,\mathrm{cond}}^{\prime}\right)}^{\top}\right\}\kern0.3em , $$where RS,cond′$$ {R}_{S,\mathrm{cond}}^{\prime } $$ and RT,cond′$$ {R}_{T,\mathrm{cond}}^{\prime } $$ are defined as follows.34RS,cond′=RS′−ρSRS−1ρS⊤,RT,cond′=RT′−ρTRT−1ρT⊤.$$ {R}_{S,\mathrm{cond}}^{\prime }={R}_S^{\prime }-{\rho}_S{R}_S^{-1}{\rho}_S^{\top}\kern0.3em ,\kern1em {R}_{T,\mathrm{cond}}^{\prime }={R}_T^{\prime }-{\rho}_T{R}_T^{-1}{\rho}_T^{\top}\kern0.3em . $$We prove this result in Appendix A.2. Equation (33) has an advantage over direct evaluation of (28), as the expression for RT,cond′$$ {R}_{T,\mathrm{cond}}^{\prime } $$ is simple to retrieve under some time series models. For instance, with stationary AR processes, in one‐step‐forward forecasting, RT,cond′$$ {R}_{T,\mathrm{cond}}^{\prime } $$ is the ratio between the variance of the innovation term and the marginal variance of the process.EMPIRICAL APPLICATIONNow we illustrate our proposed methodology on the SAL dataset presented in the introduction. The dataset is publicly available on GitHub, as submitted by the authors of the first paper addressing it,9 at https://github.com/dslab‐uniud/virtual‐sensing. We performed all our analyses in the statistical computing environment R.80 Both the ML and CL estimators were given a custom implementation that minimizes the conditional sum of squares in the estimation of the temporal model and maximizes a spatial pseudo likelihood for the estimation of the spatial model. All the analyses were carried out using typical laptop computers.Explorative analysisThe SAL dataset consists of temperature sensor readings from an office room and has been briefly described in the introductory section. The goal is to develop a spatio‐temporal prediction rule for temperature based on these data. Some candidate spatial models are estimated on a training set, and the best one is selected based on a test set. The full dataset comprises 19 weeks of data, and it is partitioned accordingly, with the leading 8 weeks of data for training and the trailing 11 weeks as the test set. This choice is challenging for our method, as it is more exposed to shifts in the regime, but a longer training phase might not be reasonable for some applications, where a monitoring system shall be calibrated in short amounts of time.Data were missing less than 1% of the times, resulting from miscommunication faults unrelated to the data and thus statistically random. Kriging can handle missings at random, but we used a simpler imputation method called last observation carried forward (LOCF), which uses the last valid reading to impute missings.81 In fact, we require gridded data and the LOCF approach solves the problem rapidly and efficiently, so we can focus on other aspects of the problem.Figure 2 presents the training set. The sensors are numbered from 1 to 12 as in Figure 1. Troughs are concentrated in the mornings, as windows are opened, and cold air flows in the room during routine cleaning. Peaks are concentrated around noon, as direct sunlight overheats the sensors facing south, the ones numbered 5, 10, and 11. Weekly trends are highlighted in Figure 3, with temperature median and other percentiles reported throughout weekdays. Troughs seem to occur mostly on Mondays and Fridays, so on the first and last workdays in the week. Peaks instead concentrate on Fridays and weekends.2FIGUREUnivariate time series of temperature per sensor, training set (October through December). Peaks and troughs are thresholded and highlighted with dots.3FIGURESample quantiles of temperature, aggregation according to the weekday, training set (October through December)Similar conditions between subsequent days motivate using time series models with seasonal components, the period being one day long. Similar events occurring on the same weekday motivate considering one more seasonal component, whose period should be 1 week long.One may consider adding covariates into the model, in particular the physical variables mentioned in the introduction. Some preliminary analyses show that their explanatory power is limited, since spatial and temporal dependencies seem able to explain most of the variability of temperature data. As a consequence, they will not be considered further.Model estimatesWe compared our proposed CL estimator with the ML approach. The model chosen was separable, as discussed in previous sections. We considered four candidate models in the space domain, namely, exponential, Gaussian, power exponential, and Matérn ACFs. In the time domain, we considered a few candidate models, all multiplicative AR as in Equation (23), with a short term lag Δ1$$ {\Delta}_1 $$ and two seasonal lags, Δ2=$$ {\Delta}_2= $$ 1 day and Δ3=$$ {\Delta}_3= $$ 1 week. We probed some alternative values for the short term lag Δ1$$ {\Delta}_1 $$, ranging in 10, 20, 30, 40 min, 1, 2, 3, 4, 6, 12 h. The longer lags were of interest because they implied higher estimates of ϕ2$$ {\phi}_2 $$ and ϕ3$$ {\phi}_3 $$, so that one could leverage seasonal dynamics and spatial regularities. All the candidate spatio‐temporal models were then compared on the test set in terms of their performance in the spatial interpolation of each sensor based on the other ones, see Figure 5 in the following paragraphs.The estimation of all the spatio‐temporal models via ML took no more than 10 min. The estimation of spatial models via CL was essentially instantaneous by comparison, as the small size of the sample spatial correlation matrix made the computation fast. The estimation of temporal models via CL takes a few seconds, namely about one‐fourth of the time required for ML. The four spatial ACFs estimated with either ML or CL appears to be very similar. In Figure 4, this similarity is illustrated based on just the CL estimate.4FIGURESpatial ACFs estimated via CL (lines) with sample correlations (dots). See Table 1 for the estimated parametersFigure 5 summarizes the predictive performance of the estimated spatio‐temporal models on the test set. The performance is assessed in terms of spatial interpolation of each sensor, surrogated via all the others in turn. Three performance metrics are used, namely, the mean absolute prediction error (L1), the root of mean square error (L2), and the 95th percentile of the absolute prediction error (P95). In comparing ML and CL methods, these are seen to provide model estimates that turn out to be rather interchangeable for predictive tasks. Sensor 10 is uniformly poorly surrogated by the other sensors, which marks its outlying nature in the example. This sensor is better accommodated by the CL estimate, in line with its higher degree of robustness. CL seems otherwise a good approximation to ML in general.5FIGUREPerformance in spatial interpolation of each sensor based on the other ones, for ML and CL (linetype) with different metrics (labels, y$$ y $$‐axis) depending on the short term lag Δ1$$ {\Delta}_1 $$ (x$$ x $$‐axis). Multiple lines refer to distinct spatial models, essentially overlapped. Metrics in use are: mean absolute prediction error (L1), root mean square error (L2), and 95th percentile of the absolute prediction error (P95).1TABLEPoint estimates and standard errors for spatial and temporal correlation parameters, case Δ1=10$$ {\Delta}_1=10 $$ minSpatial modelParameterEst.Std. err. (×103$$ \times 1{0}^3 $$)Exponentialϕ1$$ {\phi}_1 $$0.9770.206ϕ2$$ {\phi}_2 $$0.0780.950ϕ3$$ {\phi}_3 $$0.0470.990Nugget0.1912.307Range24.692446.329Gaussianϕ1$$ {\phi}_1 $$0.9770.203ϕ2$$ {\phi}_2 $$0.0780.936ϕ3$$ {\phi}_3 $$0.0470.975Nugget0.2502.470Range15.011140.078Matérnϕ1$$ {\phi}_1 $$0.9770.206ϕ2$$ {\phi}_2 $$0.0780.950ϕ3$$ {\phi}_3 $$0.0470.991Nugget0.1874.805Range25.9931547.047Smoothness0.47922.544Power exponentialϕ1$$ {\phi}_1 $$0.9770.205ϕ2$$ {\phi}_2 $$0.0780.946ϕ3$$ {\phi}_3 $$0.0470.986Nugget0.2173.057Range19.883424.822Smoothness1.31228.654Lastly, Table 1 summarizes the CL estimates and standard errors for the four spatio‐temporal models we are going to consider in the next section with Δ1=10$$ {\Delta}_1=10 $$ min. Point estimates of the temporal parameters are shared across the four spatio‐temporal models. Their standard errors are based on parametric bootstrap, as outlined in Appendix B. The estimation variance does, instead, depend on the joint spatio‐temporal model, so it is different across them. For bootstrap, we carried out inference based on parametric bootstrap by simulating and analyzing 1000 datasets for each joint spatio‐temporal model. Performing the bootstrap took about 3 h on a single laptop computer based on the CL approach. Standard errors, all multiplied by 103$$ 1{0}^3 $$ in the table, are all rather small, as expected based on the large sample size.Sensor network optimizationIn the previous subsection, CL turned out to be capable of surrogating the efficiency of ML, with the help of a large training dataset. Now a practical concern is the selection of a few operation sensors from the initial set, as twelve of them are too many for a room that is 127 m2$$ {}^2 $$ large. Under the proximity principle, some sensors could be dropped, and their location could be just virtually sensed since their data can be surrogated82 with predictions from the remaining sensors.Different network configurations can be evaluated and compared according to a metric, which should reflect the priorities and objectives of stakeholders. The 95th percentile of absolute prediction errors on all but active sensors9 can be used for an approximate minimax decision. For comparison, we illustrate a sensor selection based on this criterion alongside one that uses the more classical mean absolute prediction error. The performance of spatial models and sensor configurations is evaluated and compared on the test set. For each sensor configuration, we interpolate data from selected locations to the unselected ones within time frames, as implied by separability. Prediction errors are then summarized according to the metrics. We perform selection in a forward fashion by starting with the best performer alone and then adding the sensor that led to the best improvement at each step. Adding sensors can worsen the performance because we are evaluating models on the test set. In Figure 6, a summary of the selection process is reported. The sensor added at each step appears within a box and is numbered as in Figure 1. An alternative prediction is given by the simple mean, which assumes that a single latent temperature is ruling the whole room. The selection took no more than 10 min in total, so it would be easy to perform it multiple times ad interim to check on the quality of predictions.6FIGUREForward selection of sensors, distinct per metric, prediction error versus number of selected sensors. The sensor added at each step appears within a box. Run on test set (December through March)We show only the power exponential ACF. The result based on the Matérn function is very similar, and the exponential and Gaussian ones are slightly outperformed. The percentile performance seems in line with k$$ k $$‐NN and IDW benchmarks.9 Based on performances in Figure 6, the power exponential ACF may be preferred over the mean prediction because this choice seems more robust with respect to the metric. Moreover, the mean prediction yields some narrowly spaced sensor configurations under both metrics.Behavior of the monitoring systemUsing different sensor configurations, we show a few examples of actual situations and the behavior of a possible monitoring system. The environment in focus has some peculiarities that somehow affect the performance of the system. As an illustration, we combine interpolation and forecasting by predicting temperatures 10 min forward for the whole floor plan, as in Figure 7.7FIGUREExamples of prediction via four sensors, based on the power exponential model. Forecasts for February 17, 12:00, based on data from 10 min before. Circles are colored based on observed temperatures, the floor based on predicted temperatures, selected sensors only are numbered as in Figure 1. (A) Anomalous sensors, (B) regular sensors, (C) selection via 95th percentile absolute error, and (D) selection via mean absolute errorWe anticipated that there would be differences between regular and anomalous sensors. Concerns relate to sensors facing direct sunlight around noon (numbered 5, 10) or close to other sources of anomalies (7, 12). If these were used to make predictions, the results would hardly reflect the normalcy ruling the interior of the room, see Figure 1A. On the contrary, a system based on more regular sensors only, like in Figure 1B, would yield more constant predictions that are likely exchangeable with mean prediction. The sensor selection illustrated in the previous section is aimed at selecting a solution between these two extremes.The two metrics considered suggest similar solutions. Figure 1C reports the prediction provided under the power exponential ACF by the four best sensors according to the percentile metric. Figure 1D shows the selection based on the mean absolute error instead. Both configurations attempt to replicate the north‐south gradient, which requires to choose between sensors 5 or 10. However, neither of these choices can surrogate sensors 11 and 3. Selecting sensors 5 and 10 would fail the whole interior of the room and, by converse, the sensors in the interior of the room cannot surrogate 5 and 10.SOME EXTENSIONSIn the previous section, we presented an application of our compute‐efficient approach to spatio‐temporal kriging. Here we describe two possible extensions that look legitimate under our prediction‐oriented view of kriging. For instance, it is possible to include some spatial interpolators or temporal forecasting methods that do not necessarily underlie any stationary ACF. It is also possible to use distinct temporal models for each sensor location to provide more specialized forecasting. Both these extensions share at least the advantage of further distributing calculus across the sensor network and simplifying server‐side computations.Nonstationary modelingThe kriging approach relies on stationary ACF models, which offer a wide variety of possibilities, but some problems may be addressed only with nonstationary models. For instance, integrated AR models need no trend formulation. Another alternative is k$$ k $$‐NN, which returns the sample average of response values from the k$$ k $$ sensors closest to the desired location.This solution could be useful in cases like ours, where distinct sensors might have different equilibria. We used a moving average to accommodate nonstationarity in the mean, which conciliates with stationarity in ACF, but one can use integrated AR and k$$ k $$‐NN as an alternative. With high‐frequency data, long‐term stationarity may coexist with short‐term nonstationarity, resulting from locally linear trends, so it might be useful to consider a nonstationary model that copes with both aspects.We considered an integrated AR model in our analysis but without obtaining any significant improvement. Indeed, the chosen AR model was already able to cope with nonstationarity due to both a moving average trend and a near‐unit first AR coefficient, which implied integration de facto.Sensor‐specific temporal correlation parametersA limitation of separable ACFs is that they imply the same marginal dynamics for all locations. As an extension, the temporal correlation parameters can be sensor‐specific: each sensor can estimate and update a distinct temporal model that is valid at least for its location and approximately also for a neighborhood.Distinct temporal models can each be based on a different CL and use different portions of data, so they will not affect each other. In mean prediction, as per Equation (31), z^$$ \hat{z} $$ can be replaced with a matrix, where each column is made up of the temporal forecasts based on a model with limited scope that works for just one location or a neighborhood. When interpolating these forecasts spatially, via Equation (30), more weight is given to forecasts close to the needed locations. This implies that all temporal models are involved but to a varied extent, depending on the distance.This extension with distinct temporal models per location adds flexibility to monitoring in at least two ways.It adds flexibility to network management. Each sensor has to estimate and update its own temporal model, so this has not to be handled by the server, which thus must be in charge only of the spatial interpolation task.Statistical modeling becomes more flexible too. Prior to this, a single overall temporal model is formulated that has to fit all locations forcefully. Distinct temporal models may now be used to address subsets of locations, so that they can cope with more local and specific dynamics.Anomalous sensors may be more effectively dealt with by allowing them to make predictions based on a more specific model targeted to them only. The office room in our example is too small to allow for a variety of temporal models. Larger environments will likely be more heterogeneous and will thus need many local models to provide better forecasts. Indeed, since spatio‐temporal prediction is made up of both interpolation and forecasting, the quality of the latter is a necessary ingredient to joint predictive performance.DISCUSSIONWe have proposed a separable kriging approach that allows to analyze large datasets by exploiting some overlooked aspects of separability. Even high‐frequency data can be processed in a reasonable amount of time using a maximum CL estimator and optimized calculus in prediction. The assumption of separability allows to distribute calculus across the sensor network by delegating as many operations as possible to the components that gather the relevant data. Separability is a strong assumption, though, which can be trusted at least in settings close to ours with sensor data from indoor environments. Probably, when considering less controlled environments and even weather data, this assumption is restrictive and unrealistic, and other nonseparable models may be considered instead. Future research may investigate the viability of CL approaches also in such settings.To our knowledge, the use of marginal CL is novel to sensor data analysis. Its most appealing aspect is that the spatial and temporal models under separability can be estimated in parallel without affecting each other. The spatial model must be estimated in a centralized way, but the temporal model may be addressed in a decentralized way by allowing sensors to estimate a temporal model valid for their location or neighborhood, as described in Section 6.2. This idea relates to stratified variograms,16,83 but it has even more in common with the estimation of a single variogram with data pairs sharing some identical conditions.84 The CL approach is thus computationally convenient and potentially more robust to model misspecification, as a wrong temporal model does not affect the estimation of the spatial model, and vice versa. There is, necessarily a loss in efficiency, which may be more noticeable in small samples, but should be less relevant to big data applications. In the case of challenging estimation problems, CL estimates are also readily available and may help initialize other iterative estimation procedures, if a more efficient estimate is of interest.The predictive part of our approach was already common in climate and weather sciences but mostly confined to spatial interpolation.85 Instead, we provide a formal motivation for this way of computing predictions based on separability. We found a related simplification in jointly spatio‐temporal prediction, which we guess can be easily extended to separability in more than two domains via Tucker products instead of Kronecker ones. For instance, covariates could be included in kriging via fully factored modeling,7 but feature engineering seemed necessary in our case,9 which is not typical in kriging. For the sake of completeness, alternative modeling strategies include additive covariances,86 process convolution,87 and linear mixed models,69 for which simplifications might be different where possible.In developing our proposal, we require data to be gridded, which means that all sensors provide simultaneous readings. However, this requirement can be weakened, since data can be at least projected onto a grid.88Kriging computation is hugely simplified by assuming separability and choosing a suitable spatial or (especially) temporal model. Both estimation and prediction can bypass the evaluation and inversion of large correlation matrices by treating the data as univariate time series or cross‐sections and thus splitting a generally complicated calculation into simpler operations. For instance, AR models may have an intractable ACF, but they can be estimated easily by minimizing a conditional sum of squares, and their forecasts based on Equation (23) are simple to calculate as well. Our approach allows to blend together and leverage known strengths of time series analysis and spatial statistics without the need to outline a joint spatio‐temporal framework from scratch.ACKNOWLEDGMENTSThis work was supported by the Competence Centre ASSIC – Austrian Smart Systems Integration Research Center, co‐funded by the Austrian Federal Ministry for Transport Innovation and Technology (BMVIT), the Austrian Federal Ministry for Education, Science and Research (BMWFW), and the Austrian Federal Provinces of Carinthia and Styria within the Competence Centres for Excellent Technologies Programme (COMET). The research of Michele Lambardi di San Miniato was supported by the European Social Fund (Investimenti in favore della crescita e dell'occupazione, Programma Operativo del Friuli Venezia Giulia 2014/2020) – Programma specifico 89/2019 – Sostegno alla realizzazione di dottorati e assegni di ricerca, operazione PS 89/2019 ASSEGNI DI RICERCA – UNIUD (FP1956292002, canale di finanziamento 1420_SRDAR8919). Open Access Funding provided by Universita degli Studi di Udine within the CRUI‐CARE Agreement.DATA AVAILABILITY STATEMENTThe data that support the findings of this study are openly available in Github at https://github.com/dslab‐uniud/Virtual‐sensing.ReferencesGramacy RB. Surrogates: Gaussian Process Modeling, Design and Optimization for the Applied Sciences. 1st ed.; Chapman and Hall/CRC; 2020.Nguyen L, Hu G, Spanos CJ. Spatio‐temporal environmental monitoring for smart buildings. Proceedings of the 13th IEEE International Conference on Control & Automation; 2017; IEEE, Ohrid, North Macedonia.Carpenter J, Woodbury KA, O'Neill Z. Using change‐point and Gaussian process models to create baseline energy models in industrial facilities: a comparison. Appl Energy. 2018;213:415‐425. doi:10.1016/j.apenergy.2018.01.043Liu H, Yang C, Huang M, Wang D, Yoo C. Modeling of subway indoor air quality using Gaussian process regression. J Hazard Mater. 2018;359:266‐273. doi:10.1016/j.jhazmat.2018.07.034Li H, Yu D, Braun JE. A review of virtual sensing technology and application in building systems. HVAC&R Res. 2011;17(5):619‐645.Rodríguez‐Iturbe I, Mejía JM. The design of rainfall networks in time and space. Water Resour Res. 1974;10(4):713‐728. doi:10.1029/wr010i004p00713Mardia KV, Goodall CR. Spatial‐temporal analysis of multivariate environmental monitoring data. In: Patil GP, Rao CR, eds. Multivariate Environmental Statistics; Elsevier; 1993:347‐385.Dong B, Lam KP. A real‐time model predictive control for building heating and cooling systems based on the occupancy behavior pattern detection and local weather forecasting. Build Simul. 2013;7(1):89‐106. doi:10.1007/s12273‐013‐0142‐7Brunello A, Urgolo A, Pittino F, Montvay A, Montanari A. Virtual sensing and sensors selection for efficient temperature monitoring in indoor environments. Sensors. 2021;21(8):2728. doi:10.3390/s21082728Ferdoush S, Li X. Wireless sensor network system design using Raspberry Pi and Arduino for environmental monitoring applications. Proc Comput Sci. 2014;34:103‐110. doi:10.1016/j.procs.2014.07.059Chen T, Guestrin C. XGBoost: a scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery; 2016:785‐794; New York.Hochreiter S, Schmidhuber J. Long short‐term memory. Neural Comput. 1997;9(8):1735‐1780. doi:10.1162/neco.1997.9.8.1735Oktavia E, Widyawan MIW. Inverse distance weighting and kriging spatial interpolation for data center thermal monitoring. Proceedings of the 1st International Conference on Information Technology, Information Systems and Electrical Engineering; 2016; IEEE, Yogyakarta, Indonesia.Azzalini A, Scarpa B. Data Analysis and Data Mining: An Introduction. Illustrated ed. Oxford University Press; 2012.Aryaputera AW, Yang D, Zhao L, Walsh WM. Very short‐term irradiance forecasting at unobserved locations using spatio‐temporal kriging. Sol Energy. 2015;122:1266‐1278. doi:10.1016/j.solener.2015.10.023Bivand RS, Pebesma E, Gómez‐Rubio V. Applied Spatial Data Analysis with R. Springer; 2013.Genton MG. Separable approximations of space‐time covariance matrices. Environmetrics. 2007;18(7):681‐695. doi:10.1002/env.854Furrer R, Genton MG, Nychka D. Covariance tapering for interpolation of large spatial datasets. J Comput Graph Stat. 2006;15(3):502‐523. doi:10.1198/106186006x132178Box GEP, Jenkins GM, Reinsel GC, Ljung GM. Time Series Analysis: Forecasting and Control. 5th ed. Wiley; 2015.Ruybal CJ, Hogue TS, McCray JE. Evaluation of groundwater levels in the Arapahoe aquifer using spatiotemporal regression kriging. Water Resour Res. 2019;55(4):2820‐2837. doi:10.1029/2018wr023437Thai CH, Tran TD, Phung‐Van P. A size‐dependent moving kriging meshfree model for deformation and free vibration analysis of functionally graded carbon nanotube‐reinforced composite nanoplates. Eng Anal Bound Elem. 2020;115:52‐63. doi:10.1016/j.enganabound.2020.02.008Slot RMM, Sørensen JD, Sudret B, Svenningsen L, Thøgersen ML. Surrogate model uncertainty in wind turbine reliability assessment. Renew Energy. 2020;151:1150‐1162. doi:10.1016/j.renene.2019.11.101García‐Gutiérrez A, Gonzalo J, Domínguez D, López D. Stochastic optimization of high‐altitude airship envelopes based on kriging method. Aerosp Sci Technol. 2022;120:107251. doi:10.1016/j.ast.2021.107251Qian HM, Li YF, Huang HZ. Time‐variant reliability analysis for industrial robot RV reducer under multiple failure modes using kriging model. Reliab Eng Syst Saf. 2020;199:106936. doi:10.1016/j.ress.2020.106936Luo Y, Xing J, Kang Z. Topology optimization using material‐field series expansion and kriging‐based algorithm: an effective non‐gradient method. Comput Methods Appl Mech Eng. 2020;364:112966. doi:10.1016/j.cma.2020.112966Yang Y, Christakos G, Yang X, He J. Spatiotemporal characterization and mapping of PM2.5 concentrations in southern Jiangsu province. China Environ Pollut. 2018;234:794‐803. doi:10.1016/j.envpol.2017.11.077Mohammadi V, Dehghan M, Khodadadian A, Wick T. Numerical investigation on the transport equation in spherical coordinates via generalized moving least squares and moving kriging least squares approximations. Eng Comput. 2019;37(2):1231‐1249. doi:10.1007/s00366‐019‐00881‐3Menafoglio A, Gaetani G, Secchi P. Random domain decompositions for object‐oriented kriging over complex domains. Stoch Env Res Risk A. 2018;32(12):3421‐3437. doi:10.1007/s00477‐018‐1596‐zChen J, Mak S, Joseph VR, Zhang C. Function‐on‐Function kriging, with applications to three‐dimensional printing of aortic tissues. Technometrics. 2020;63(3):384‐395. doi:10.1080/00401706.2020.1801255Stein ML. Space‐time covariance functions. J Am Stat Assoc. 2005;100(469):310‐321. doi:10.1198/016214504000000854Politis DN. Model‐Free Prediction and Regression. 1st ed. Springer; 2016.Chilès JP, Desassis N. Fifty years of kriging. In: Sagar BSD, Cheng Q, Agterberg F, eds. Handbook of Mathematical Geosciences. Springer; 2018:589, 612.You MY. Multi‐objective optimal design of permanent magnet synchronous motor for electric vehicle based on deep learning. Appl Sci. 2020;10(2):482. doi:10.3390/app10020482Bhattacharjee S, Mitra P, Ghosh SK. Spatial interpolation to predict missing attributes in GIS using semantic kriging. IEEE Trans Geosci Remote Sens. 2014;52(8):4771‐4780. doi:10.1109/tgrs.2013.2284489de Medeiros ES, de Lima RR, de Olinda RA, Dantas LG, de Santos CAC. Space‐time kriging of precipitation: modeling the large‐scale variation with model GAMLSS. Watermark. 2019;11(11):2368. doi:10.3390/w11112368Totis G, Sortino M. Polynomial chaos‐kriging approaches for an efficient probabilistic chatter prediction in milling. Int J Mach Tools Manuf. 2020;157:103610. doi:10.1016/j.ijmachtools.2020.103610Delfiner P, Chilès JP. Geostatistics: Modeling Spatial Uncertainty. 2nd ed. Wiley; 2012.Prudhomme C, Reed DW. Mapping extreme rainfall in a mountainous region using geostatistical techniques: a case study in Scotland. Int J Climatol. 1999;19(12):1337‐1356. doi:10.1002/(sici)1097‐0088(199910)19:12<1337::aid‐joc421>3.0.co;2‐gKaufman CG, Bingham D, Habib S, Heitmann K, Frieman JA. Efficient emulators of computer experiments using compactly supported correlation functions, with an application to cosmology. Ann Appl Stat. 2011;5(4):2470‐2492. doi:10.1214/11‐aoas489Griffith DA. Hidden negative spatial autocorrelation. J Geogr Syst. 2006;8(4):335‐355. doi:10.1007/s10109‐006‐0034‐9dos Reis AA, Carvalho MC, de Mello JM, Gomide LR, ACF F, FWA J. Spatial prediction of basal area and volume in Eucalyptus stands using Landsat TM data: an assessment of prediction methods. N Z J For Sci. 2018;48(1). doi:10.1186/s40490‐017‐0108‐0Angelini ME, Heuvelink GBM. Including spatial correlation in structural equation modelling of soil properties. Spatial Statistics. 2018;25:35‐51. doi:10.1016/j.spasta.2018.04.003McLean MI, Evers L, Bowman AW, Bonte M, Jones WR. Statistical modelling of groundwater contamination monitoring data: a comparison of spatial and spatiotemporal methods. Sci Total Environ. 2019;652:1339‐1346. doi:10.1016/j.scitotenv.2018.10.231Liu H, Ong YS, Shen X, Cai J. When Gaussian process meets big data: a review of scalable GPs. IEEE Trans Neural Netw Learn Syst. 2020;31(11):4405‐4423. doi:10.1109/tnnls.2019.2957109Guhaniyogi R, Banerjee S. Meta‐kriging: scalable Bayesian modeling and inference for massive spatial datasets. Technometrics. 2018;60(4):430‐444. doi:10.1080/00401706.2018.1437474Opitz T, Bonneu F, Gabriel E. Point‐process based Bayesian modeling of space‐time structures of forest fire occurrences in Mediterranean France. Spatial Stat. 2020;40:100429. doi:10.1016/j.spasta.2020.100429Gu L. Moving kriging interpolation and element‐free Galerkin method. Int J Numer Methods Eng. 2002;56(1):1‐11. doi:10.1002/nme.553Hartman L, Hössjer O. Fast kriging of large data sets with Gaussian Markov random fields. Comput Stat Data Anal. 2008;52(5):2331‐2349. doi:10.1016/j.csda.2007.09.018Hristopulos DT, Agou VD. Stochastic local interaction model with sparse precision matrix for space‐time interpolation. Spatial Stat. 2020;40:100403. doi:10.1016/j.spasta.2019.100403Strandberg J, de Luna SS, Mateu J. Prediction of spatial functional random processes: comparing functional and spatio‐temporal kriging approaches. Stoch Env Res Risk A. 2019;33(10):1699‐1719. doi:10.1007/s00477‐019‐01705‐yArendt PD, Apley DW, Chen W, Lamb D, Gorsich D. Improving identifiability in model calibration using multiple responses. J Mech Des. 2012;134(10):100909‐1‐100909‐9. doi:10.1115/1.4007573Welch WJ, Buck RJ, Sacks J, Wynn HP, Mitchell TJ, Morris MD. Screening, predicting, and computer experiments. Technometrics. 1992;34(1):15‐25. doi:10.1080/00401706.1992.10485229Varin C, Reid N, Firth D. An overview of composite likelihood methods. Stat Sin. 2011;21:5‐42.Bevilacqua M, Gaetan C, Mateu J, Porcu E. Estimating space and space‐time covariance functions for large data sets: a weighted composite likelihood approach. J Am Stat Assoc. 2012;107(497):268‐280. doi:10.1080/01621459.2011.646928Maronna RA, Martin RD, Yohai VJ. Robust Statistics: Theory and Methods. Wiley; 2006.Caragea PC, Smith RL. Asymptotic properties of computationally efficient alternative estimators for a class of multivariate normal models. J Multivar Anal. 2007;98(7):1417‐1440. doi:10.1016/j.jmva.2006.08.010Xu X, Reid N. On the robustness of maximum composite likelihood estimate. J Stat Plan Inference. 2011;141(9):3047‐3054. doi:10.1016/j.jspi.2011.03.026Agostinelli C, Yohai VJ. Composite robust estimators for linear mixed models. J Am Stat Assoc. 2016;111(516):1764‐1774. doi:10.1080/01621459.2015.1115358Franceschetti G, Riccio D. Ch. 6 Surface classical models. Scattering, Natural Surfaces, and Fractals. 1st ed. Elsevier; 2007:21‐59.Hartwig RE. AX ‐ XB = C, resultants and generalized inverses. SIAM J Appl Math. 1975;28(1):154‐183. doi:10.1137/0128014Berrocal VJ, Raftery AE, Gneiting T. Combining spatial statistical and ensemble information in probabilistic weather forecasts. Mon Weather Rev. 2007;135(4):1386‐1402. doi:10.1175/mwr3341.1Diggle PJ, Ribeiro PJ. Model‐based Geostatistics. 1st ed. Springer; 2007.Cressie N. Statistics for Spatial Data. Revised ed. Wiley; 1993.Zhang B, Sang H, Huang JZ. Full‐scale approximations of spatio‐temporal covariance models for large datasets. Stat Sin. 2015;25(1):99‐114. doi:10.5705/ss.2013.260wGaetan C, Guyon X. Spatial Statistics and Modeling. 1st ed. Springer; 2010.Maes MA, Breitung K, Dann MR. At issue: the Gaussian autocorrelation function. Proceedings of the 18th International Probabilistic Workshop; 2021:191‐203; Springer, Cham.Stachniss C, Plagemann C, Lilienthal AJ. Learning gas distribution models using sparse Gaussian process mixtures. Auton Robot. 2009;26(2‐3):187‐202. doi:10.1007/s10514‐009‐9111‐5Banerjee S, Gelfand AE, Finley AO, Sang H. Gaussian predictive process models for large spatial data sets. J Royal Stat Soc Ser B (Stat Methodol). 2008;70(4):825‐848. doi:10.1111/j.1467‐9868.2008.00663.xDumelle M, Hoef JMV, Fuentes C, Gitelman A. A linear mixed model formulation for spatio‐temporal random processes with computational advances for the product, sum, and product‐sum covariance functions. Spatial Stat. 2021;43:100510. doi:10.1016/j.spasta.2021.100510Gong G, Samaniego FJ. Pseudo maximum likelihood estimation: theory and applications. Ann Stat. 1981;9(4):861‐869. doi:10.1214/aos/1176345526Varin C, Vidoni P. A note on composite likelihood inference and model selection. Biometrika. 2005;92(3):519‐528. doi:10.1093/biomet/92.3.519Davison AC, Hinkley DV. Bootstrap Methods and Their Application. Illustrated ed. Cambridge University Press; 1997.Wright SJ. Coordinate descent algorithms. Math Program. 2015;151(1):3‐34. doi:10.1007/s10107‐015‐0892‐3Hunter JS. The exponentially weighted moving average. J Qual Technol. 1986;18(4):203‐210. doi:10.1080/00224065.1986.11979014Huwang L, Yeh AB, Wu CW. Monitoring multivariate process variability for individual observations. J Qual Technol. 2007;39(3):258‐278. doi:10.1080/00224065.2007.11917692Cressie N. The origins of kriging. Math Geol. 1990;22(3):239‐252. doi:10.1007/bf00889887Qian HM, Huang T, Huang HZ. A single‐loop strategy for time‐variant system reliability analysis under multiple failure modes. Mech Syst Signal Process. 2021;148:107159. doi:10.1016/j.ymssp.2020.107159Hyndman RJ, Koehler AB, Snyder RD, Grose S. A state space framework for automatic forecasting using exponential smoothing methods. Int J Forecast. 2002;18(3):439‐454. doi:10.1016/s0169‐2070(01)00110‐8Ma C. Semiparametric spatio‐temporal covariance models with the ARMA temporal margin. Ann Inst Stat Math. 2005;57(2):221‐233. doi:10.1007/bf02507023R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; 2021.Zhou H, Yu KM, Lee MG, Han CC. The application of last observation carried forward method for missing data estimation in the context of industrial wireless sensor networks. Proceedings of the 7th IEEE Asia‐Pacific Conference on Antennas and Propagation; 2018; IEEE, Auckland, New Zealand.Gramacy RB, Niemi J, Weiss RM. Massively parallel approximate Gaussian process regression. SIAM/ASA J Uncertainty Quantif. 2014;2(1):564‐584. doi:10.1137/130941912Courault D, Monestiez P. Spatial interpolation of air temperature according to atmospheric circulation patterns in southeast France. Int J Climatol. 1999;19(4):365‐378. doi:10.1002/(sici)1097‐0088(19990330)19:4<365::aid‐joc369>3.0.co;2‐eMonestiez P, Courault D, Allard D, Ruget F. Spatial interpolation of air temperature using environmental context: application to a crop model. Environ Ecol Stat. 2001;8(4):297‐309. doi:10.1023/a:1012726317935Holdaway M. Spatial modeling and interpolation of monthly temperature using kriging. Clim Res. 1996;6:215‐225. doi:10.3354/cr006215Ma P, Konomi BA, Kang EL. An additive approximate Gaussian process model for large spatio‐temporal data. Environmetrics. 2019;30(8):e2569. doi:10.1002/env.2569Higdon D. Space and space‐time modeling using process convolutions. In: Anderson CW, Barnett V, Chatwin PC, El‐Shaarawi AH, eds. Quantitative Methods for Current Environmental Issues. Springer; 2002:37‐56.Paciorek CJ. Computational techniques for spatial logistic regression with large data sets. Comput Stat Data Anal. 2007;51(8):3631‐3653. doi:10.1016/j.csda.2006.11.008Steeb WH. Problems and Solutions in Introductory and Advanced Matrix Calculus. 1st ed. World Scientific; 2006.AAPPENDIXPROOFSA.1Kriging mean formulaPredictions can be computed in a vectorized form, as follows, after Equation (28).A1vec(ŷ′−μ′)=ρR−1vec(y−μ).$$ \mathrm{vec}\left({\hat{y}}^{\prime }-{\mu}^{\prime}\right)=\rho {R}^{-1}\mathrm{vec}\left(y-\mu \right)\kern0.3em . $$By using definitions in Equations (11) and (27), it follows thatA2vec(ŷ′−μ′)=(ρS⊗ρT)(RS⊗RT)−1vec(y−μ).$$ \mathrm{vec}\left({\hat{y}}^{\prime }-{\mu}^{\prime}\right)=\left({\rho}_S\otimes {\rho}_T\right){\left({R}_S\otimes {R}_T\right)}^{-1}\mathrm{vec}\left(y-\mu \right)\kern0.3em . $$Next, we use the inversion behavior of the Kronecker product.A3vec(ŷ′−μ′)=(ρS⊗ρT)(RS−1⊗RT−1)vec(y−μ).$$ \mathrm{vec}\left({\hat{y}}^{\prime }-{\mu}^{\prime}\right)=\left({\rho}_S\otimes {\rho}_T\right)\left({R}_S^{-1}\otimes {R}_T^{-1}\right)\mathrm{vec}\left(y-\mu \right)\kern0.3em . $$Then, it comes in handy to use the mixed‐product property of the Kronecker product.A4vec(ŷ′−μ′)=(ρSRS−1)⊗(ρTRT−1)vec(y−μ).$$ \mathrm{vec}\left({\hat{y}}^{\prime }-{\mu}^{\prime}\right)=\left\{\left({\rho}_S{R}_S^{-1}\right)\otimes \left({\rho}_T{R}_T^{-1}\right)\right\}\mathrm{vec}\left(y-\mu \right)\kern0.3em . $$At this point, the regression coefficients of Equation (29) can be recognized.A5vec(ŷ′−μ′)=(βS⊗βT)vec(y−μ).$$ \mathrm{vec}\left({\hat{y}}^{\prime }-{\mu}^{\prime}\right)=\left({\beta}_S\otimes {\beta}_T\right)\mathrm{vec}\left(y-\mu \right)\kern0.3em . $$Lastly, we use Roth's column lemma.60 One can find it uncredited in recent algebra handbooks, see Steeb's book, Chapter 9, Problem 22.89A6vec(ŷ′−μ′)=vecβT(y−μ)βS⊤.$$ \mathrm{vec}\left({\hat{y}}^{\prime }-{\mu}^{\prime}\right)=\mathrm{vec}\left\{{\beta}_T\left(y-\mu \right){\beta}_S^{\top}\right\}\kern0.3em . $$Then, the vec$$ \mathrm{vec} $$ operator can be dropped and the matrix ŷ′$$ {\hat{y}}^{\prime } $$ is obtained.A.2Kriging variance formulaUsing Equation (28) as a starting point, it holdsA7diag(Rcond′)=diag(R′)−diag(ρR−1ρ⊤).$$ \operatorname{diag}\left({R}_{\mathrm{cond}}^{\prime}\right)=\operatorname{diag}\left({R}^{\prime}\right)-\operatorname{diag}\left(\rho {R}^{-1}{\rho}^{\top}\right)\kern0.3em . $$Similarly to the proof in Appendix A.1, we exploit again the inversion behavior and the mixed‐product property of the Kronecker product. We also use Equations (11) and (27) to obtainA8diag(Rcond′)=diag(RS′⊗RT′)−diag(ρSRS−1ρS⊤)⊗(ρTRT−1ρT⊤).$$ \operatorname{diag}\left({R}_{\mathrm{cond}}^{\prime}\right)=\operatorname{diag}\left({R}_S^{\prime}\otimes {R}_T^{\prime}\right)-\operatorname{diag}\left\{\left({\rho}_S{R}_S^{-1}{\rho}_S^{\top}\right)\otimes \left({\rho}_T{R}_T^{-1}{\rho}_T^{\top}\right)\right\}\kern0.3em . $$After Equation (34), it follows thatA9diag(Rcond′)=diag(RS′⊗RT′)−diag(RS′−RS,cond′)⊗(RT′−RT,cond′).$$ \operatorname{diag}\left({R}_{\mathrm{cond}}^{\prime}\right)=\operatorname{diag}\left({R}_S^{\prime}\otimes {R}_T^{\prime}\right)-\operatorname{diag}\left\{\left({R}_S^{\prime }-{R}_{S,\mathrm{cond}}^{\prime}\right)\otimes \left({R}_T^{\prime }-{R}_{T,\mathrm{cond}}^{\prime}\right)\right\}\kern0.3em . $$One can use the self‐evident property diag(A⊗B)=diag(A)⊗diag(B)$$ \operatorname{diag}\left(A\otimes B\right)=\operatorname{diag}(A)\otimes \operatorname{diag}(B) $$ for A$$ A $$ and B$$ B $$ square matrices, which impliesA10diag(Rcond′)=diag(RS′)⊗diag(RT′)−diag(RS′−RS,cond′)⊗diag(RT′−RT,cond′).$$ \operatorname{diag}\left({R}_{\mathrm{cond}}^{\prime}\right)=\operatorname{diag}\left({R}_S^{\prime}\right)\otimes \operatorname{diag}\left({R}_T^{\prime}\right)-\operatorname{diag}\left({R}_S^{\prime }-{R}_{S,\mathrm{cond}}^{\prime}\right)\otimes \operatorname{diag}\left({R}_T^{\prime }-{R}_{T,\mathrm{cond}}^{\prime}\right)\kern0.3em . $$Now, the components of diag(Rcond′)$$ \operatorname{diag}\left({R}_{\mathrm{cond}}^{\prime}\right) $$ can be partitioned into vectors with the same length as diag(RT′)$$ \operatorname{diag}\left({R}_T^{\prime}\right) $$. Such vectors can be the columns of the matrix V$$ V $$, which is thus defined as in our claims.BAPPENDIXBOOTSTRAPParametric bootstrap72 under separable kriging is particularly convenient because the implied model is easy to simulate, and its parameters are simple to estimate with the approach proposed in this article.The temporal and spatial models together identify the full model, under which one can simulate artificial datasets. In particular, separability allows to simulate a dataset Y$$ Y $$ asB1Y∼μ+σ·RT1/2·ϵ·RS1/2,$$ Y\sim \mu +\sigma \cdotp {R}_T^{1/2}\cdotp \epsilon \cdotp {R}_S^{1/2}\kern0.3em , $$where ϵ$$ \epsilon $$ is a Gaussian white noise structured into a T×S$$ T\times S $$ matrix, and RT1/2$$ {R}_T^{1/2} $$ and RS1/2$$ {R}_S^{1/2} $$ are the matrix square roots of the matrices RT$$ {R}_T $$ and RS$$ {R}_S $$, respectively.Assuming T≫S$$ T\gg S $$, RS1/2$$ {R}_S^{1/2} $$ may be tractable, while RT1/2$$ {R}_T^{1/2} $$ will hardly be so. The operator RT1/2$$ {R}_T^{1/2} $$ just makes RT1/2·ϵ$$ {R}_T^{1/2}\cdotp \epsilon $$ a matrix with independent columns that share the same correlation structure RT$$ {R}_T $$. So, as an alternative to directly evaluating RT1/2$$ {R}_T^{1/2} $$, one can generate each column of σ·RT1/2·ϵ$$ \sigma \cdotp {R}_T^{1/2}\cdotp \epsilon $$ according to the temporal model. AR(p) processes can be simulated efficiently according to the factorized MA(∞$$ \infty $$) form.19 The first observations should be initialized according to the stationary distribution of the process, but with complicated models one may instead provide an arbitrary initialization and then simulate additional observations as a burn‐in. We adopted this latter strategy. Actually, in simulating 8 weeks of data, we needed to simulate 32 more leading weeks of data as a burn‐in.CAPPENDIXMAXIMUM LIKELIHOOD ESTIMATIONThe pseudo ML estimator θ^$$ \hat{\theta} $$ for θ$$ \theta $$ is defined as a maximizer of (12). An iterative optimization procedure can be devised to compute θ^$$ \hat{\theta} $$, by leveraging the convenient separability assumption in Equation (16). Operationally, the components of the parameter vector θ$$ \theta $$ can be partitioned into groups of parameters, which can be updated one at a time until convergence. In particular, we define three groups of parameters, which coincide with the scale parameter σ$$ \sigma $$, the spatial and temporal correlation parameters ψS$$ {\psi}_S $$ and ψT$$ {\psi}_T $$.In Equation (B1), RS1/2$$ {R}_S^{1/2} $$ and RT1/2$$ {R}_T^{1/2} $$ were matrix square roots of RS$$ {R}_S $$ and RT$$ {R}_T $$, respectively. Now, let RS−1/2$$ {R}_S^{-1/2} $$ and RT−1/2$$ {R}_T^{-1/2} $$ be their inverses. These matrices can be feasibly estimated, as R^S−1/2$$ {\hat{R}}_S^{-1/2} $$ and R^T−1/2$$ {\hat{R}}_T^{-1/2} $$, respectively, based on available estimates of ψS$$ {\psi}_S $$ and ψT$$ {\psi}_T $$. An approximately efficient estimator can be computed iteratively byupdating σ^2$$ {\hat{\sigma}}^2 $$ as the mean of squared entries of the matrix R^T−1/2(y−μ^)R^S−1/2$$ {\hat{R}}_T^{-1/2}\left(y-\hat{\mu}\right){\hat{R}}_S^{-1/2} $$, based on current estimates for ψS$$ {\psi}_S $$ and ψT$$ {\psi}_T $$;updating ψ^S$$ {\hat{\psi}}_S $$ as the maximizer ofC1Ω(ψS)=−T2log|RS|+tr(RS−1M^S),$$ \Omega \left({\psi}_S\right)=-\frac{T}{2}\left\{\log |{R}_S|+\mathrm{tr}\left({R}_S^{-1}{\hat{M}}_S\right)\right\}, $$with respect to ψS$$ {\psi}_S $$, where M^S=zS⊤zS/T$$ {\hat{M}}_S={z}_S^{\top }{z}_S/T $$ and zS=R^T−1/2(y−μ^)/σ^$$ {z}_S={\hat{R}}_T^{-1/2}\left(y-\hat{\mu}\right)/\hat{\sigma} $$, based on current estimates for ψT$$ {\psi}_T $$ and σ$$ \sigma $$; notice the resemblance with Equations (17) and (18); this operation is equivalent to estimating ψS$$ {\psi}_S $$ based on T$$ T $$ independent cross‐sections;updating ψ^T$$ {\hat{\psi}}_T $$ as the maximizer ofC2Ω(ψT)=−S2log|RT|+tr(RT−1M^T),$$ \Omega \left({\psi}_T\right)=-\frac{S}{2}\left\{\log |{R}_T|+\mathrm{tr}\left({R}_T^{-1}{\hat{M}}_T\right)\right\}\kern0.3em , $$with respect to ψT$$ {\psi}_T $$, where M^T=zTzT⊤/S$$ {\hat{M}}_T={z}_T{z}_T^{\top }/S $$ and zT=(y−μ^)R^S−1/2/σ^$$ {z}_T=\left(y-\hat{\mu}\right){\hat{R}}_S^{-1/2}/\hat{\sigma} $$, based on current estimates for ψS$$ {\psi}_S $$ and σ$$ \sigma $$; notice the resemblance with Equations (20) and (21); this operation is equivalent to estimating ψT$$ {\psi}_T $$ based on S$$ S $$ independent time series.In Appendix B, we stress that some convenient spatial or temporal models allow for sparse formulations of matrices RS−1/2$$ {R}_S^{-1/2} $$ and RT−1/2$$ {R}_T^{-1/2} $$. In evaluating zS$$ {z}_S $$, there is thus no need to allocate in memory any large matrix like RT−1/2$$ {R}_T^{-1/2} $$.

Journal

Applied Stochastic Models in Business and IndustryWiley

Published: Sep 1, 2022

Keywords: composite likelihood; distance‐based prediction; distributed calculus; indoor environments; separability; isotropy; spatio‐temporal kriging

There are no references for this article.