TY - JOUR
AU1 - Papaspiliopoulos,, O
AU2 - Roberts, G, O
AU3 - Zanella,, G
AB - Summary We develop methodology and complexity theory for Markov chain Monte Carlo algorithms used in inference for crossed random effects models in modern analysis of variance. We consider a plain Gibbs sampler and propose a simple modification, referred to as a collapsed Gibbs sampler. Under some balancedness conditions on the data designs and assuming that precision hyperparameters are known, we demonstrate that the plain Gibbs sampler is not scalable, in the sense that its complexity is worse than proportional to the number of parameters and data, but the collapsed Gibbs sampler is scalable. In simulated and real datasets we show that the explicit convergence rates predicted by our theory closely match the computable, but nonexplicit rates in cases where the design assumptions are violated. We also show empirically that the collapsed Gibbs sampler extended to sample precision hyperparameters significantly outperforms alternative state-of-the-art algorithms. 1. Introduction Crossed random effects models are additive models that relate a response variable to categorical predictors. In the literature they appear under various names, such as cross-classified data, variance component models or multi-way analysis of variance. They provide a canonical framework for understanding the relative importance of different sources of variation in a dataset, as argued in Gelman (2005). For the purposes of this article we focus on linear models according to which $$\begin{align}\label{eq:anova}y_{i_1\cdots i_K} \sim \mathcal{N}\bigl\{a^{(0)}+a^{(1)}_{i_1}+\dots+a^{(K)}_{i_K},\,(n_{i_1\cdots i_K} \tau_0)^{-1}\bigr\} \quad(i_k =1,\ldots,I_k;\, k=1,\ldots,K),\end{align}$$ (1) where |$a^{(k)}$| is a vector of |$I_k$| levels, |$a^{(k)}_{i_k}$|⁠, corresponding to the |$k$|th categorical factor, and |$a^{(0)}$| is a global mean with one level (⁠|$I_0=1$|⁠). These factors might correspond to both main and interaction effects in categorical data analysis. We work with exchangeable Gaussian random effects, |$a^{(k)}_j \sim \mathcal{N}(0,1/\tau_k)$| for |$k>0$|⁠, which is by far the most standard choice, although interesting alternative priors exist as in Volfovsky & Hoff (2014). The precision terms |$(\tau_0,\dots,\tau_K)$| are typically unknown and are estimated using either fully Bayesian or optimization-based methods. In the notation of (1) we allow |$n_{i_1\cdots i_K}=0$|⁠, which corresponds to empty cells. The total number of factor levels, and hence the number of regression parameters, is denoted by |$p=\sum_{k=0}^K I_k$|⁠, and the number of observations is denoted by |$N= \sum_{i_1\cdots i_K}n_{i_1\cdots i_K}$|⁠. Crossed random effects models adapt naturally to modern high-dimensional but sparse data. For example, they are used in the context of recommender systems, where in the simplest set-up there are two factors, customers and products, and the response is a rating; the examples in Gao & Owen (2017) are such that |$p \ll N \ll I_1 \times I_2$|⁠. We say that a data design has balanced levels if the same number of observations is made at each level of each factor, though this number can vary with factor; see § 2.1 for a mathematical definition. The design is said to have balanced cells if the same number of observations is available for each combination of factor levels, i.e., for each cell of the contingency table defined by the categorical predictors. By construction, a design with balanced cells also has balanced levels. We are interested in scalable methods of performing likelihood-based inferences for crossed random effects models. The main computational bottleneck is the need to perform an integration over the high-dimensional space of factors. This is needed both for a fully Bayesian inference that places priors on the precision hyperparameters and for optimization methods using the marginal likelihood. An exact marginalization is possible in the linear model owing to the joint Gaussian distribution of responses and factors. However, this involves matrix operations whose cost is |${O}(p^3)$|⁠, which can be prohibitively large in modern applications. For example, for typical recommender systems, |$I_1 \times I_2\geq N$| and hence |$p\geq \max\{I_1,I_2\}\geq \surd{N}$|⁠, meaning that the cost of these operations is at least |${O}(N^{3/2})$|⁠, which can be infeasible for large datasets; see also Gao & Owen (2019, § 1) for related discussion and references. Depending on the data design, the precision matrices of the Gaussian distributions involved may be sparse, in which case black-box sparse linear algebra algorithms can be used for the matrix operations to reduce the computational cost. However, we are not aware of theoretical results that can be applied in the context of crossed random effects models and which guarantee that the resulting computational complexity can be reduced to, say, |${O}(N)$|⁠. Additionally, we are interested in exploring methods that could be extended to non-Gaussian models, such as those with categorical or count observations. Markov chain Monte Carlo algorithms can be used to carry out the integration over factors. A popular and convenient algorithm for updating the factors given the precision terms is the Gibbs sampler, which samples the factors |$a^{(k)}$| iteratively from their full conditional distributions. Recently, Gao & Owen (2017) showed that in the special case with |$a^{(0)}$| assumed to be known, in the context of recommendation where |$K=2$| with balanced-cells design, the Gibbs sampler has complexity |${O}(N^{3/2})$|⁠. The complexity of a Markov chain Monte Carlo algorithm can be defined as the product of the computational time per iteration and the number of iterations the algorithm needs to mix. The argument in Gao & Owen (2017) suggests that the Gibbs sampler is not scalable for crossed random effects models because of its superlinear cost in the number of observations. In this article we develop theory for analysing the complexity of the Gibbs sampler for crossed random effects models under different designs. We propose a small modification of the basic algorithm, which we call the collapsed Gibbs sampler, and establish rigorously its superior performance and scalability. We obtain explicit results on the spectral gap of the algorithms and analyse their computational cost. Many aspects of our results are clearly shown in Fig. 1 of § 6.1. We consider modern big data asymptotic regimes where both the number of parameters and the number of observations grow. The relaxation times of the Gibbs sampler and the collapsed Gibbs sampler can be computed numerically and are plotted against the size of the datasets. The slowing down of the Gibbs sampler and the improvement of the collapsed Gibbs sampler with increasing dataset size are evident. Our theory is not applicable in these cases because the resultant designs, which have been generated randomly, do not have balanced levels; nonetheless, the rate predicted by our theory matches the correct rates remarkably well. In § 6.2 we present comparable results for a well-known real dataset of student evaluations with five factors. Fig. 1. Open in new tabDownload slide Comparison of the Gibbs sampler, GS, and the collapsed Gibbs sampler, cGS. (a) Relaxation time plotted against size of dataset in the case of |$K=2$|⁠, |$I_1=I_2$| and |$\tau_0=\tau_1=\tau_2$|⁠, for GS (circles and black solid line), cGS (triangles and red solid line) and predictions of our theory (plus signs and blue dotted line); the data are missing completely at random with probability of missingness 0.9. (b) Autocorrelation function of |$a^{(0)}$| and |$\sigma_3=\tau_3^{-1/2}$| for GS (circles and black line), cGS (triangles and red line), GS combined with parameter expansion (plus signs and blue line) and cGS combined with parameter expansion (crosses and green line) when applied to the InstEval dataset; see § 6.2 for details of the set-up. Fig. 1. Open in new tabDownload slide Comparison of the Gibbs sampler, GS, and the collapsed Gibbs sampler, cGS. (a) Relaxation time plotted against size of dataset in the case of |$K=2$|⁠, |$I_1=I_2$| and |$\tau_0=\tau_1=\tau_2$|⁠, for GS (circles and black solid line), cGS (triangles and red solid line) and predictions of our theory (plus signs and blue dotted line); the data are missing completely at random with probability of missingness 0.9. (b) Autocorrelation function of |$a^{(0)}$| and |$\sigma_3=\tau_3^{-1/2}$| for GS (circles and black line), cGS (triangles and red line), GS combined with parameter expansion (plus signs and blue line) and cGS combined with parameter expansion (crosses and green line) when applied to the InstEval dataset; see § 6.2 for details of the set-up. The complexity theory developed in the first part of the article relies on assuming that the precision hyperparameters are known. However, the collapsed Gibbs sampler method can easily be extended to produce an algorithm for fully Bayesian inference when the precision hyper-parameters are given prior distributions. In § 6.2 we compare numerically the performance of the resulting sampler with state-of-the-art Markov chain Monte Carlo algorithms, such as parameter expansion and Hamiltonian Monte Carlo. We find that the collapsed Gibbs sampler proposed here has far superior performance: the improvement in effective sample size per unit of computation time ranges from one to three orders of magnitude depending on the parameter under consideration. The theory is based on a multigrid decomposition of the Markov chain generated by the sampler, which allows us to identify the slowest-mixing components, and capitalizes on existing theory for the convergence of Gaussian Markov chains. The multigrid decomposition of a Markov chain is a powerful theoretical tool for studying the Markov chain’s spectral gap, as it provides a decomposition into independent processes. Identifying such a decomposition is a kind of art; a previously successful example in the context of multilevel nested linear models can be found in Zanella & Roberts (2019). 2. Decompositions of the posterior distribution 2.1. Notation The statistical model we work with is described in (1). In accordance with standard practice, Gaussian priors are used for the factor levels, |$a^{(k)}_j \sim \mathcal{N}(0,1/\tau_k)$|⁠, and an improper prior is used for the global mean, |$p(a^{(0)}) \propto 1$|⁠. When convenient we write |$a^{(0)}_{i_0}$|⁠, which is the same as |$a^{(0)}$|⁠. We allow |$n_{i_1\cdots i_K}=0$|⁠, which corresponds to empty cells in the contingency table defined by the outer product of the categorical factors. Let |$n$| denote the data incidence array, a multi-dimensional array with elements |$n_{i_1\cdots i_K}$|⁠. Two-dimensional marginal tables extracted from the data incidence matrix are denoted by |$n^{(l,k)}$| and have elements |$n^{(l,k)}_{i_l,\,i_k}$|⁠, the total number of observations on level |$i_l$| of factor |$l$| and level |$i_k$| of factor |$k$|⁠; margins of this table are denoted by |$n^{(k)}$| and are vectors of size |$I_k$| with elements |$n^{(k)}_j$|⁠, the total number of observations with level |$j$| on the |$k$|th factor. By definition |$\sum_j n^{(k)}_j=N$|⁠, where |$N$| is the total number of observations. A data design has balanced levels if |$n^{(k)}_j=N/I_k$| for every |$k$| and |$j$|⁠, and has balanced cells if |$n_{i_1\cdots i_K}=N/\prod_k I_k$| for all combinations of factor levels. Averages of vectors are indicated by an overbar, e.g., |$\bar{a}^{(k)}$|⁠; weighted averages are indicated by a tilde, e.g., |$\tilde{y} = \sum_{i_1\cdots i_K} y_{i_1\cdots i_K} n_{i_1\cdots i_K} /N$|⁠. The vector of all factor averages is denoted by |$\bar{a}$|⁠, the first element of which is trivially |$a^{(0)}$|⁠. We use |$a^{(-k)}$| to denote the vector of all factor levels except those of |$a^{(k)}$|⁠, and use |$a^{(k)}_{-j}$| to denote the vector of all levels of factor |$k$| except the |$j$|th level; |$a$| denotes the vector of all levels of all factors. We define |$\delta$| to be a residual operator that when applied to a vector returns the difference between its elements and their sample average, e.g., |$\delta a^{(k)}$| has elements |$a^{(k)}_j - \bar{a}^{(k)}$| and is referred to as the factor’s level increments; |$\delta a$| denotes the vector of all such increments except for |$\delta a^{(0)}$|⁠, which is 0 trivially. The law of a random variable |$X$| is denoted by |$\mathcal{L}(X)$|⁠, e.g., |$\mathcal{L} (a^{(k)}_j) = \mathcal{N}(0,1/\tau_k)$|⁠, and the law of |$X$| conditionally on |$Y$| is denoted by |$\mathcal{L}(X \mid Y)$|⁠. When a joint distribution has been specified for |$X$| and other random variables, |$\mathcal{L}( X \mid \cdot \, )$| denotes the full conditional distribution of |$X$| conditionally on the rest. In the following sections up to § 6, we assume the precision terms to be fixed without explicitly writing the conditioning on |$\tau$| in all expressions. 2.2. Full conditional distributions Fairly standard Bayesian linear model calculations yield that the conditional distribution of |$a^{(0)}$| given all other parameters and data is $$\begin{align} \label{eq:fcondmu} \mathcal{L}( a^{(0)} \mid \cdot \, ) & = \mathcal{N}\!\left \{\tilde{y} - {\sum_k \sum_i a^{(k)}_i n^{(k)}_i \over N},\, (N \tau_0)^{-1} \right \}\text{.}\end{align}$$ (2) With balanced levels this simplifies to $$\begin{align} \label{eq:fcondmu-bl} \mathcal{L}( a^{(0)} \mid \cdot \, ) & = \mathcal{N}\!\left \{\tilde{y} - \sum_k \bar{a}^{(k)} ,\, (N \tau_0)^{-1} \right \}\text{.} \end{align}$$ (3) Similarly we obtain that for |$k>0$|⁠, $$\begin{align} \label{eq:fconda} \mathcal{L}( a^{(k)}_{j} \mid \cdot \, ) & = \mathcal{N}\!\left \{{n^{(k)}_j \tau_0 \over n^{(k)}_j \tau_0 + \tau_k} \left (\tilde{y}^{(k)}_j - a^{(0)} - {\sum_{l \neq k, l \neq 0} \sum_i a^{(l)}_i n^{(k,l)}_{j,\,i} \over n^{(k)}_j}\right), \, (n^{(k)}_j \tau_0 + \tau_k)^{-1} \right \},\end{align}$$ (4) where |$\tilde{y}^{(k)}_j$| is the weighted average of all observations whose level on factor |$k$| is |$j$|⁠. With balanced cells this simplifies to $$\begin{align} \label{eq:fconda-cells} \mathcal{L}( a^{(k)}_{j} \mid \cdot \, ) & = \mathcal{N}\!\left \{{N \tau_0 \over N \tau_0 + I_k \tau_k} \biggl(\bar{y} - \sum_{l \neq k} \bar{a}^{(l)} \biggr), \, I_k (N \tau_0 + I_k\tau_k)^{-1} \right \}\text{.}\end{align}$$ (5) 2.3. Factorizations In designs with balanced levels the posterior distribution of regression parameters admits certain factorizations, which are collected together in the following proposition. Throughout the paper, we use products of laws to represent independence. Proposition 1. For designs with balanced levels, $$\begin{align*}\mathcal{L}(\bar{a},\delta a \mid {y})& =\mathcal{L}(\bar{a} \mid {y}) \mathcal{L}(\delta a \mid {y})\end{align*}$$and $$\begin{align}\label{eq:collapsed_iid} \mathcal{L}(\bar{a}^{(-0)} \mid {y})= \prod_{k=1}^K \mathcal{L}(\bar{a}^{(k)} \mid {y})\text{.}\end{align}$$ (6) For designs with balanced cells, we moreover have that $$\begin{align*}\mathcal{L}(\delta a \mid {y})& = \prod_{k} \mathcal{L}(\delta a^{(k)} \mid {y})\text{.}\end{align*}$$ The factorization in (6) is particularly relevant to the collapsed Gibbs sampler introduced later in the article. A sketch of the proof is as follows. For the first factorization, we obtain directly from (4) with the assumption of balanced levels that $$\begin{align} \label{eq:fcondma} \mathcal{L}(\bar{a}^{(k)} \mid y,a^{(-k)},\delta a^{(k)})= \mathcal{N} \!\left\{{N \tau_0 \over N \tau_0 + I_k \tau_k} \biggl(\tilde{y} - a^{(0)} - \sum_{l\neq k} \bar{a}^{(l)} \biggr) , \: (N \tau_0+I_k \tau_k)^{-1} \right \}\text{.}\end{align}$$ (7) We use the fact that global and local Markovian properties are equivalent; see, for instance, Besag (1974, § 3). This yields the first independence statement in the proposition. The proof of (6) follows by similar arguments using |$\mathcal{L}(\bar{a}^{(-0,-k)} \mid {y})=\mathcal{N}\{0,(I_k\tau_k)^{-1}\}$|⁠. The third factorization is derived in the same way, upon noting that (5) implies $$\begin{align*}\mathcal{L}( \delta a^{(k)} \mid y,a^{(-k)}, \delta a^{(-k)} )& = \mathcal{N}\bigl\{ (0,\dots,0)^{ \mathrm{\scriptscriptstyle T} },\, (N \tau_0 + I_k\tau_k)^{-1} (I_k\mathbb{I}_{I_k}-\mathbb{H}_{I_k}) \bigr\},\end{align*}$$ where |$\mathbb{I}_{I_k}$| denotes the |$I_k\times I_k$| identity matrix and |$\mathbb{H}_{I_k}$| the |$I_k\times I_k$| matrix with each entry equal to |$1$|⁠. 3. Gibbs samplers for inference We consider two main algorithms. The first is a block Gibbs sampler that updates in a single block the levels of a given factor conditional on everything else. Owing to the dependence structure in the model, the levels of a given factor conditional on the rest are independent, so the sampling is done separately for each factor level, i.e., iteratively from |$\mathcal{L}( a^{(k)}_{i_k} \mid \cdot \, )$| for |$i_k =1,\ldots,I_k$| and |$k=0,\ldots,K$|⁠; these distributions are specified in § 2.2. We refer to this algorithm as the Gibbs sampler, although it should be understood that it is just one implementation of the scheme. We also consider a collapsed Gibbs sampler that samples from |$\mathcal{L}(a^{(-0)} \mid {y})$|⁠, i.e., the algorithm obtained by first analytically integrating out the global mean |$a^{(0)}$| and then sampling in blocks the levels of each of the remaining factors. In practice, we implement this algorithm by sampling iteratively from |$\mathcal{L}( a^{(0)},a^{(k)} \mid \cdot \, )$| for |$k=1,\ldots,K$|⁠. Updating |$a^{(0)}$| together with each block is equivalent to integrating it out before sampling starts, in the sense that the resulting transition kernel for |$a^{(-0)}$| is the same, with the only difference being whether the values of |$a^{(0)}$| are stored or not. In our implementation we sample first |$\mathcal{L}(a^{(0)} \mid y,a^{(-0,-k)})$| and then |$\mathcal{L}( a^{(k)}_{i_k} \mid \cdot \, )$| for |$i_k =1,\ldots,I_k$| as in the original Gibbs sampler. The implementation of the collapsed Gibbs sampler relies on the following result. Proposition 2. Writing |$s^{(k)}_j=n^{(k)}_j\tau_0/(\tau_k+n^{(k)}_j\tau_0)$|⁠, we have that $$\begin{align} \label{eq:fcondmu_collapsed} \mathcal{L}(a^{(0)} \mid y,a^{(-0,-k)}) & = \mathcal{N}\!\left \{ \frac{1}{\sum_j s^{(k)}_j} \sum_j s^{(k)}_j \!\left(\tilde{y}^{(k)}_j-\frac{\sum_{l\neq k}\sum_{i} a^{(l)}_i n^{(k,l)}_{j,i}}{n^{(k)}_j} \right) , \: \frac{1}{\tau_k\sum_j s^{(k)}_j}\right \}\! \text{.}\end{align}$$ (8) We prefer to use the collapsed Gibbs sampler where |$a^{(0)}$| is updated together with each block, because this version remains realizable in more elaborate models, such as generalized linear crossed random effects models; in such extensions exact sampling from |$\mathcal{L}( a^{(0)},a^{(k)} \mid \cdot \, )$| may not be feasible, but a Metropolis–Hastings step can be used instead. Additionally, implementation requires only a minimal modification of the Gibbs sampler code, as shown in the Supplementary Material. 4. Complexity of Markov chain Monte Carlo 4.1. Notation For stochastic processes generated by Markov chain Monte Carlo, the time index corresponds to iteration and is generically denoted by |$t$| in parentheses, e.g., |$x(t)$|⁠. In such cases the stochastic process over |$T$| iterations is denoted by |$\{x(t)\}_{t=1}^T$|⁠, and we write simply |$\{x(t)\}$| when |$T=\infty$|⁠. Let |$\{(x,z)(t)\}$| denote a stochastic process that at each time |$t$| takes as a value the vector composed of |$x(t)$| and |$z(t)$|⁠. We say that the stochastic process |$\{x(t)\}$| is a timewise transformation of another process |$\{y(t)\}$| if there is a function |$\phi$| such that |$x(t) = \phi\{y(t)\}$| for all |$t$|⁠. 4.2. Spectral gap and relaxation time In this article we focus on |$L^2(\pi)$| convergence; this relies on functional-analytic concepts, of which we give a very high-level description below. For a given target distribution |$\pi$| defined on a state space |$\mathcal{X}$|⁠, we define |$L^2(\pi)$| to be the space of complex-valued functions that are square-integrable with respect to |$\pi$|⁠. The inner product on this space is such that the associated norm of a function |$f: \mathcal{X} \to C$| is |$\|f\|^2=\int_{\mathcal{X}} |f(x)|^2 \pi({\rm d} x)$|⁠. For a Markov chain |$\{x(t)\}$| defined on |$\mathcal{X}$| with transition kernel |$P$| that is invariant with respect to |$\pi$|⁠, we view |$P$| as an integral operator on |$L^2(\pi)$| and denote its spectrum by |$S$|⁠. We say that |$P$| converges geometrically fast to |$\pi$| in |$L^2(\pi)$|-norm, also known as the operator norm, if and only if its geometric rate of convergence, defined as |$\sup_{\lambda\in S}|\lambda|$|⁠, is less than 1. The spectral gap of |$P$| is defined as the difference between 1 and the geometric rate of convergence; hence a Markov chain converges in the |$L^2(\pi)$|-norm if and only if it has a positive spectral gap. In the remainder of the paper we refer to the geometric rate of convergence simply as rate of convergence for brevity. Define the relaxation time of a Markov chain as the reciprocal of its spectral gap. This can be interpreted as the number of iterations needed to subsample the Markov chain so that the resultant draws are roughly independent of each other. The complexity of a Markov chain Monte Carlo algorithm can be defined as the product of the relaxation time and the cost per iteration. 4.3. Spectral gap of the Gibbs sampler on Gaussian distributions The Markov chain |$\{x(t)\}$| generated by a Gibbs sampler targeting a Gaussian multivariate distribution |$\mathcal{N}(\mu,\Sigma)$| is a Gaussian autoregressive process evolving as |$x(t+1)\mid x(t)\sim \mathcal{N}\{Bx(t)+b,\Sigma-B\Sigma B^{ \mathrm{\scriptscriptstyle T} } \}$|⁠; see, for example, Lemma 1 in Roberts & Sahu (1997). The details of the Gibbs sampler, such as the order in which its components are updated or blocked together, are reflected in the precise form of |$B$|⁠. This representation implies that the rate of convergence of the Gibbs sampler is |$\rho(B)$|⁠, the largest absolute eigenvalue of the matrix |$B$|⁠; see Theorem 1 of Roberts & Sahu (1997). This characterization of the |$L^2(\pi)$| rate of convergence has provided useful insights into the performance of the Gibbs sampler and has led to much more efficient modifications of the basic algorithm; see, for example, Papaspiliopoulos et al. (2003, 2007). However, in high-dimensional scenarios it is often very challenging to compute |$\rho(B)$| explicitly as a function of the important parameters of the model, such as |$p$| and |$N$| in the crossed effects models considered here. Thus, as a tool for understanding the complexity of the Gibbs sampler in difficult problems, this approach has limited scope. In the present article we will make the approach useful by combining it with the multigrid decomposition developed below, which collapses the problem to studying the spectral gaps of two Gaussian subchains, |$\{\bar{a}(t)\}$| and |$\{\delta a(t)\}$|⁠, that turn out to be amenable to direct analysis. 5. Complexity analysis for crossed random effects models 5.1. Multigrid decomposition of the Gibbs samplers The results derived in this paper stem from the following theorem, the proof of which is given in the Appendix. Theorem 1 (Multigrid decomposition). Let |$\{a(t)\}$| be the Markov chain generated by either the Gibbs sampler or the collapsed Gibbs sampler for designs with balanced levels. Then the timewise transformations |$\{\bar{a}(t)\}$| and |$\{\delta a(t)\}$| obtained from |$\{a(t)\}$| are each a Markov chain and are independent of one another. A crucial point here is that the independence of |$\bar{a}$| and |$\delta a$| under the posterior distribution, see Proposition 1, does not imply that the corresponding chains |$\{\bar{a}(t)\}$| and |$\{\delta a(t)\}$| are independent of each other. The following very simple example makes this point clear. Consider a Gibbs sampler that targets a bivariate Gaussian for |$(x,y)$| with correlation |$\rho$| and standard Gaussian marginals. Then the transformation |$x$| and |$z=y-\rho x$| orthogonalizes the target, but the corresponding stochastic processes |$\{x(t)\}$| and |$\{z(t)\}$| obtained by timewise transformation of the original chain |$\{(x,y)(t)\}$| are not independent Markov chains; see, for example, the cross-correlogram in the Supplementary Material. Although this is a toy example, there are many situations in which an independence factorization of the target distribution does not require that of the Markov chain Monte Carlo algorithm used. In the following sections we apply Theorem 1 in conjunction with the theory from § 4.3 to characterize the complexity of the two samplers. 5.2. Timewise transformations and convergence of Markov chains The multigrid decomposition in Theorem 1 identifies two timewise transformations of the Markov chain |$\{a(t)\}$| produced by either of the algorithms considered in this article, each of which evolves independently of the other as a Markov chain. We can relate the rate of convergence of the Markov chains involved in this decomposition using the following two technical lemmas, which are proved in the Supplementary Material. Lemma 1. Let |$\{x(t)\}$| be a Markov chain with invariant distribution |$\pi$|⁠, and let |$\{y(t)\}$| be a timewise transformation given by |$y(t)=\phi\{x(t)\}$| where |$\phi$| is an injective function. Then |$\{y(t)\}$| is a Markov chain with the same rate of convergence as |$\{x(t)\}$|⁠. Lemma 1 is similar in spirit to Theorem 6 of Johnson & Geyer (2012), which shows that various properties of Markov chains are preserved under one-to-one timewise transformations. Lemma 2. Let |$\{x(t)\}$| be a Markov chain with state space |$\mathcal{X}_1\times \mathcal{X}_2$| and target distribution |$\pi_1\otimes\pi_2$|⁠. If the stochastic processes |$\{x_1(t)\}$| and |$\{x_2(t)\}$| obtained by projection onto the |$\mathcal{X}_1$| and |$\mathcal{X}_2$| components are independent Markov chains, then the rate of convergence of |$\{x(t)\}$| equals the maximum of the rates of convergence of |$\{x_1(t)\}$| and |$\{x_2(t)\}$|⁠. Therefore, for designs with balanced levels, the rate of convergence of the Markov chain |$\{a(t)\}$|⁠, generated by either the Gibbs sampler or the collapsed Gibbs sampler, is the larger of the rates of the two chains |$\{\bar{a}(t)\}$| and |$\{\delta a(t)\}$|⁠. In the rest of this section we analyse these two chains using the theory summarized in § 4.3 and use the results to characterize the complexity of the Gibbs sampler and its collapsed version. 5.3. Complexity analysis for designs with balanced cells The following result characterizes the rate of convergence of one of the two timewise transformations involved in the multigrid decomposition. Proposition 3. For designs with balanced levels, the rate of convergence of the Markov chain |$\{\bar{a}(t)\}$| defined in Theorem 1 equals |$\max_{k}{N\tau_{0}}/({N\tau_{0}+I_k\tau_{k}})$| for the Gibbs sampler and |$0$| for the collapsed Gibbs sampler, and this rate is the same for any order in which the different blocks are updated. Proof. For the Gibbs sampler, the subchain |$\{\bar{a}(t)\}$| is a Gaussian Gibbs sampler with |$K+1$| one-dimensional components. We can explicitly work out that its autoregressive matrix |$B$| takes the form $$\begin{equation}\label{eq:B_matrix_form} B= \!\left( \begin{array}{c|ccc} 0 & -1 & \cdots & -1\\\hline 0 & & & \\ \vdots & & L &\\ 0 & & & \end{array} \right), \end{equation}$$ (9) where |$L$| is a |$K\times K$| lower triangular matrix whose diagonal elements are |$r_1,\dots,r_K$|⁠, with $$\begin{align*} r_k = \frac{N\tau_{0}}{N\tau_{0}+I_k\tau_{k}}\,\text{.} \end{align*}$$ The correctness of (9) is shown in the Supplementary Material, by verifying directly with an induction argument that |${E}\{\bar{a}(t+1)\mid \bar{a}(t)\}=B\bar{a}(t)+b$|⁠. Since |$L$| is a lower triangular matrix, its spectrum coincides with its diagonal elements |$r_1,\dots,r_K$|⁠. For each |$k=1,\dots,K$|⁠, let |$v^{(k)}$| be the eigenvector with eigenvalue |$r_k$|⁠. It is easy to check that the |$(K+1)$|-dimensional vector |$w^{(k)}=(-r_k^{-1}\sum_{\ell=1}^K v^{(k)}_\ell,v^{(k)}_1,\dots,v^{(k)}_K)$| is an eigenvector of |$B$| with eigenvalue |$r_k$|⁠. Hence |$r_1,\dots,r_k$| are also eigenvalues of |$B$|⁠. Finally, |$(1,0,\dots,0)$| is an eigenvector of |$B$| with eigenvalue 0. With these ingredients, the claim for the Gibbs sampler follows immediately. For the collapsed Gibbs sampler, |$\bar{a}(t)$| is obtained from |$\bar{a}(t-1)$| by simulating |$\bar{a}^{(k)}(t)$| from $$\begin{align*} & \mathcal{L}\{\bar{a}^{(k)}(t) \mid y,\bar{a}^{(1)}(t),\ldots,\bar{a}^{(k-1)}(t),\bar{a}^{(k+1)}(t-1),\ldots,\bar{a}^{(K)}(t-1)\} \end{align*}$$ for |$k=1,\dots,K$|⁠. By Proposition 1, this procedure produces independent and identically distributed draws from |$\mathcal{L}(\bar{a}^{(-0)} \mid {y})$| or, equivalently, |$\mathcal{L}(\bar{a} \mid {y})$| if |$\bar{a}^{(0)}$| is jointly updated with |$\bar{a}^{(k)}$|⁠. These rates do not depend on the order in which the different components are updated. This is trivially true for the collapsed Gibbs sampler since the components are independent. For the Gibbs sampler the justification is as follows. The Gibbs sampler’s rate of convergence is invariant under cyclic permutations of the order of update of the components; see, for instance, Roberts & Sahu (1997, p. 297). Thus we can always assume |$a^{(0)}$| to be the first component that is updated. Then the result follows upon relabelling the components |$a^{(1)}$| to |$a^{(K)}$| according to their update order and repeating the argument in the previous paragraphs. □ The main result of this section follows rather easily from Proposition 3. Theorem 2. For designs with balanced cells, the relaxation time of the Gibbs sampler is |$1+\max_{k=1,\dots,K}\{ {N\tau_0}/({I_k\tau_k})\}$|⁠, and that of the collapsed Gibbs sampler is |$1$|⁠, i.e., the sampler produces independent and identically distributed draws from the target, and these rates do not depend on the order in which different components are updated. Proof. Let |$\{a(t)\}$| be the Markov chain generated by the Gibbs sampler or its collapsed version. Lemma 1 implies that |$\{(\bar{a},\delta a)(t)\}$| is a Markov chain with the same rate of convergence as |$\{a(t)\}$|⁠. Thus, by Theorem 1 and Lemma 2, the rate of convergence of |$\{a(t)\}$| equals the maximum of the rates of convergence of |$\{\bar{a}(t)\}$| and |$\{\delta{a}(t)\}$|⁠. Proposition 1 implies that |$\{\delta{a}(t)\}$| performs independent sampling from |$\mathcal{L}(\delta a \mid {y})$| and so its rate of convergence is 0, and the rate of convergence of |$\{a(t)\}$| equals that of |$\{\bar{a}(t)\}$|⁠. Proposition 3 and the definition of relaxation time as the reciprocal of the spectral gap then imply the statement to be proved. □ The theorem completely characterizes the relaxation times of the Gibbs sampler and the collapsed Gibbs sampler for designs with balanced cells. As to the computational cost of the algorithms, we find that each requires an |${O}(N)$| computation at initialization to precompute data averages. In the Appendix we show that the two algorithms have the same cost per iteration, which is proportional to the number of parameters, |$p$|⁠. Therefore, the collapsed Gibbs sampler is an |${O}(p)$| implementation of exact sampling from the posterior. We now consider asymptotic regimes. The more classical regime, which we refer to as infill asymptotics, keeps the numbers of factors and levels fixed, hence keeping |$K$| and |$p$| fixed, and increases the number of observations per cell, so that |$N$| grows. The other, more modern, asymptotic regime, referred to as outfill asymptotics, has |$p$| increasing with |$N$|⁠, for example treating the observations per cell as bounded while increasing the number of levels and/or number of factors. It is this type of asymptotic regime that is of greater interest in recommendation applications. Regardless of the asymptotic regime considered, the relaxation time of the collapsed Gibbs sampler is |${O}(1)$|⁠. On the other hand, that of the Gibbs sampler depends on the regime under consideration. In infill asymptotics, Theorem 2 implies that the relaxation time of the algorithm is |${O}(N)$|⁠. Intuition for this deterioration of the algorithm with increasing dataset size can be obtained by considering the analysis of noncentred parameterizations for hierarchical models as in Papaspiliopoulos et al. (2007, § 2); the parameterization of the crossed effects model is noncentred and the infill asymptotics regime makes the data increasingly informative per random effect, so one should anticipate the deterioration. Therefore, in this regime the complexity of both algorithms is |${O}(N)$|⁠, but in practice the collapsed version will be much more efficient. In outfill asymptotics, both |$N$| and the number of factor levels |$I_k$| are growing, so by Theorem 2 the relaxation time of the Gibbs sampler is no worse than |${O}(N)$|⁠, but no better than |${O}(N^{1-1/K})$|⁠. The lower bound on the relaxation time can be deduced from the balanced-cells design assumption, which implies |$\prod_{k=1}^KI_k\leq N$| and |$\min_{k}I_k\leq N^{1/K}$|⁠; the bound is achievable when |$I_1=\dots=I_K$|⁠. On the other hand, the number of parameters can grow as different powers of |$N$|⁠. For example, if the numbers of levels for all but one factor are fixed and those of the remaining factor are increasing, e.g., a fixed number of customers and an increasing number of products, then |$p$| is |${O}(N)$| and the relaxation time of the Gibbs sampler is also |${O}(N)$|⁠, resulting in a Gibbs sampler complexity of |${O}(N^{2})$|⁠, compared with |${O}(N)$| for the collapsed Gibbs sampler. 5.4. Complexity analysis for designs with balanced levels The strategy of obtaining complexity results for designs with balanced levels is the same as for designs with balanced cells, and Proposition 3 is again instrumental. However, in this case the analysis is much more complicated because the second timewise transformation, |$\{\delta a(t)\}$|⁠, no longer samples independently from its invariant distribution; in fact, its invariant distribution does not factorize as in the case of balanced cells. Nonetheless, Lemma 2 and Proposition 3 immediately imply the following lower bound on the relaxation time of the Gibbs sampler. Theorem 3. For designs with balanced levels, the relaxation time of the Gibbs sampler is at least |$1+\max_{k=1,\dots,K}\{ N\tau_0/(I_k\tau_k)\}$|⁠. From Proposition 3 we also know that the rate of the collapsed Gibbs sampler is that of |$\{\delta a(t)\}$|⁠. Therefore, obtaining explicit rates of convergence for |$\{\delta a(t)\}$| is a step needed for characterizing the relaxation times of both algorithms in designs with balanced levels. This is done for |$K=2$| in Proposition 4 below. Our theory is based on an auxiliary process |$\{i(t)\}$| with discrete state space |$\{1,\dots,I_1\}\times \{1,\dots,I_2\}$| that evolves according to a two-component Gibbs sampler, iteratively updating |$i_1\mid i_2$| and |$i_2\mid i_1$|⁠, with invariant distribution |$p(i_1,i_2)= n_{i_1i_2}/N$|⁠. Proposition 4. For designs with balanced levels where |$K=2$|⁠, the rate of convergence of the Markov chain |$\{\delta a(t)\}$| is $$\begin{align*} \frac{N\tau_{0}}{N\tau_{0}+I_1\tau_{1}} \frac{N\tau_{0}}{N\tau_{0}+I_2\tau_{2}}\, \rho_{\rm aux}, \end{align*}$$where |$\rho_{\rm aux}$| is the rate of convergence of the auxiliary Gibbs sampler |$\{i(t)\}$|⁠. Proof. The chain |$\{\delta{a}(t)\}$| is a two-component Gibbs sampler that alternates updates from |$\mathcal{L}(\delta{a}^{(1)} \mid y,\delta{a}^{(2)})$| and |$\mathcal{L}(\delta{a}^{(2)} \mid y,\delta{a}^{(1)})$|⁠. Thus, |$\{\delta{a}^{(1)}(t)\}$| is marginally a Markov chain and its rate of convergence equals that of |$\{\delta{a}(t)\}$|⁠; see, for instance, Roberts & Rosenthal (2001). Let |$B_1$| and |$B_2$| be defined by |${E}(\delta{a}^{(1)}\mid \delta{a}^{(2)},y)=B_1 \delta{a}^{(2)}+b_1$| and |${E}(\delta{a}^{(2)}\mid \delta{a}^{(1)},y)=B_2\delta{a}^{(1)}+b_2$|⁠. Then a simple computation shows that |$\delta{a}^{(1)}(t)$| is a Gaussian autoregressive process with autoregression matrix |$B_1B_2$|⁠. Since for designs with balanced levels we have that |$n^{(k)}_j \tau_0 /( n^{(k)}_j \tau_0 + \tau_k) =r_k$|⁠, it can be deduced from (4) and (7) that |$B_1=-r_1 P_1$|⁠, where the |$I_1\times I_2$| matrix |$P_1$| is the transition kernel of the update |$i_2\mid i_1$| of the auxiliary process. Similarly, one can show that |$B_2=-r_2P_2$|⁠, where the |$I_2\times I_1$| matrix |$P_2$| is the transition kernel of the update |$i_1\mid i_2$| of the auxiliary process. Hence, the autoregressive matrix of |$\delta{a}^{(1)}(t)$| is |$r_1r_2 P_1P_2$|⁠, where |$P_1P_2$| is the transition kernel of the auxiliary Gibbs sampler |$\{i(t)\}$|⁠. Consequently, the spectrum of the autoregressive matrix is |$r_1 r_2 \lambda_i$|⁠, where |$\lambda_i$| are the eigenvalues of |$P_1 P_2$|⁠. The largest |$|\lambda_i|$| is of course 1, since |$P_1P_2$| is a stochastic matrix. However, because |$\delta a^{(1)}$| is constrained to have zero sum, by Lemma A1 in the Appendix the rate of convergence of |$\delta{a}^{(1)}(t)$| is not given by the eigenvalue of largest modulus of the autoregressive matrix, but rather the eigenvalue of largest modulus whose eigenvector has zero sum, i.e., we need only consider the subspace orthogonal to the vector of ones. Therefore, the rate of convergence of |$\{\delta{a}(t)\}$| equals |$r_1r_2$| times the eigenvalue of |$P_1P_2$| with second largest modulus, which is |$\rho_{\rm aux}$| by definition. □ With Proposition 4 in place, the main result follows immediately. Theorem 4. For designs with balanced levels where |$K=2$|⁠, the rates of convergence of the Gibbs sampler and the collapsed Gibbs sampler are, respectively, $$\begin{align*} \max\!\left\{ \frac{N\tau_{0}}{N\tau_{0}+I_1\tau_{1}} ,\: \frac{N\tau_{0}}{N\tau_{0}+I_2\tau_{2}} \right\} ,\quad \frac{N\tau_{0}}{N\tau_{0}+I_1\tau_{1}} \frac{N\tau_{0}}{N\tau_{0}+I_2\tau_{2}} \,\rho_{\rm aux}, \end{align*}$$where |$\rho_{\rm aux}$| is the rate of convergence of the auxiliary Gibbs sampler |$\{i(t)\}$| with invariant distribution |$p(i_1,i_2)= n_{i_1i_2}/N$|⁠. If the design is in fact one with balanced cells, the rates given in Theorem 4 match those of Theorem 3, since |$\rho_{\rm aux}=0$| in this case. A corollary of this theorem is that the relaxation time of the Gibbs sampler is |$1+\max_{k=1,2}\{N\tau_0/(I_k\tau_k)\}$| and that of the collapsed Gibbs sampler is no larger than |$1+\min\{N\tau_0/(I_1\tau_1),\,N\tau_0/(I_2\tau_2),\,T_{\rm aux}\}$|⁠, where |$T_{\rm aux}$| is the relaxation time of the auxiliary process |$\{i(t)\}$|⁠. An implication of this is that the collapsed Gibbs sampler is never slower than the standard Gibbs sampler, and has good mixing both when the amount of data per level is low and when it is high. To see this, observe first that the ratios |$N/I_1$| and |$N/I_2$| coincide with the numbers of data points per column and row, respectively, in the data incidence matrix with entries |$n_{i_1i_2}$|⁠, and thus their values increase as the amount of data per level increases. In contrast, the relaxation time |$T_{\rm aux}$| of the auxiliary process |$\{i(t)\}$| tends to decrease as the amount of data per level increases, because the latter corresponds to adding more edges in the conditional independence graph, and hence to greater connectivity in the state space of the auxiliary process. Unfortunately, it is not true in general that the minimum of |$N\tau_0/(I_1\tau_1)$|⁠, |$N\tau_0/(I_2\tau_2)$| and |$T_{\rm aux}$| is uniformly bounded in |$N$|⁠. Consider, for example, a design where users and items are split into two communities of equal size, and users inside each community have rated all items from their community and no item from the other community. In this case the random walk |$\{i(t)\}$| is reducible. Therefore |$T_{\rm aux}=\infty$|⁠, and provided both |$N/I_1$| and |$N/I_2$| go to infinity, the relaxation time of the collapsed Gibbs sampler diverges as |$N$| goes to infinity. We now address the case where the number of factors |$K$| is greater than 2, which is not covered by Theorem 4. A conjecture we make is that |$1+\max_{k=1,\dots,K}\{N\tau_0/(I_k\tau_k)\}$| is the relaxation time of the Gibbs sampler also for |$K>2$|⁠. We have conducted extensive numerical experiments, since for specific examples we can compute the relaxation time by calculating numerically the largest eigenvalue of an explicit matrix, and have not been able to find a counterexample. The missing step in obtaining a generic result would be to show that |$\{\delta a(t)\}$| always mixes faster than |$\{\bar{a}(t)\}$|⁠. Such a result would also immediately prove, owing to Proposition 3, that the collapsed Gibbs sampler has a lower relaxation time than the Gibbs sampler for an arbitrary number of factors for designs with balanced levels. On the other hand, numerical experimentation has also shown that certain extensions of Theorem 4 are not true. We know that the convergence rate of the collapsed Gibbs sampler can be greater than |$\prod_{k=1,\dots,K}\{N\tau_0/(I_k\tau_k)\}$| when |$K>2$|⁠; we also know that the rate will depend on the order in which the different components are updated. We shall return to these points in § 7. We close this section with some asymptotic considerations on the complexity. In the following arguments we assume that the relaxation time of the Gibbs sampler is the conjectured |$1+\max_{k=1,\dots,K}\{N\tau_0/(I_k\tau_k)\}$|⁠; we do not consider the collapsed Gibbs sampler here since we do not have a conjecture about its rate in the |$K>2$| case. The asymptotic behaviour of the Gibbs sampler relaxation time depends on the regime under consideration, as it was for designs with balanced cells. The relaxation time can be as bad as |${O}(N)$|⁠, for example if the number of levels of at least one factor is fixed as |$N$| grows; it can be |${O}(N^{1-1/K})$| in the regime where |$I_1=\cdots=I_K$| and |$N={O}(I_1^K)$|⁠; but it can also be |${O}(1)$| in the sparse observation regime where |$N=I_1=\cdots=I_2$|⁠. In the Appendix we discuss the computational cost per iteration, which for these designs can grow quadratically with the number of parameters, as opposed to linearly in the case of balanced cells. Its growth with the observations can be |${O}(1)$| in infill asymptotic regimes where the number of levels of factors does not grow with |$N$|⁠; it can be |${O}(N^{2/K})$| when |$I_1=\cdots=I_K$| and |$N={O}(I_1^K)$|⁠; but it can also be |${O}(N)$| in the sparse regime where |$N=I_1=\cdots=I_2$|⁠. Making a connection to the observation in Gao & Owen (2017), we obtain that for |$K=2$|⁠, when |$N={O}(I_1^2)$| and |$I_1=I_2$|⁠, the complexity of the Gibbs sampler is |${O}(N^{3/2})$| and so the algorithm is not scalable. 6. Simulation studies 6.1. Simulated data with missingness completely at random First we consider simulated data with |$K=2$| and |$I_1=I_2$|⁠. We assume the data to be missing completely at random, where for each combination of factors we observe a data point, i.e., |$n_{i_1i_2}=1$|⁠, with probability 0.1 independently of the rest, and otherwise we have a missing observation, i.e., |$n_{i_1i_2}=0$|⁠. Since the relaxation time of the samplers under consideration does not depend on the value of the observations |$y$|⁠, but only on their presence or absence, we can set |$y_{i_1i_2}=0$| without affecting the computed convergence rates. Our theory does not apply directly to this context because the designs under consideration are not balanced in general. However, we can still compute numerically the convergence rate of the Gibbs sampler and its collapsed version in the context of known precisions, using the results discussed in § 4.3, and thus explore to what extent the qualitative findings of our theory still hold. Figure 1(a) displays the behaviour of the relaxation time of the Gibbs sampler and of its collapsed version in an outfill asymptotics regime, where both the number of data points and the number of factor levels increase. For the simulations we fixed the precision terms |$\tau_k$| at 1 and took |$I_1\in \{50,500,1000,2000\}$|⁠. The results suggest that the relaxation time of the Gibbs sampler diverges with |$N$|⁠, while the relaxation time of its collapsed version converges to 1 as |$N$| increases. This is consistent with the theoretical results of the previous section. In fact, we can compare the relaxation times computed numerically with the theoretical values calculated as if the design had balanced levels, which of course it does not here. Figure 1(a) shows an extremely close match, illustrating that the applicability of our theory actually extends beyond the specific designs that motivated the analysis. This suggests that the theory is relevant beyond situations that have strictly balanced levels. Since the cost per iteration of both samplers is |${O}(N)$|⁠, the results in Fig. 1(a) indicate that for the asymptotic regime considered in this section, the computational complexity of the Gibbs sampler is |${O}(N^{3/2})$| and that of the collapsed Gibbs sampler is |${O}(N)$|⁠. 6.2. ETH instructor evaluations dataset We now consider a real dataset containing evaluations of university lectures by students at ETH Zürich. The dataset is freely available as part of the R (R Development Core Team, 2020) package lme4 (Bates et al., 2015) under the name InstEval. It contains 73 421 observations, each consisting of a score from 1 to 5 assigned to a lecture together with six factors that potentially affect the score, such as the identity of the student giving the rating or the department offering the course; see the lme4 help material for more details about the dataset. We fit model (1) to the InstEval data. Following the notation in (1), we have |$N=73\,421$|⁠, |$K=6$| and |$(I_1,\dots,I_K)=(2972,1128,4,6,2,14)$|⁠. Clearly, a categorical response calls for a generalized linear model extension of (1); however, the point of this analysis is to test the algorithms, and (1) is not an outright unreasonable model to fit for this dataset. First we consider the case where the values of the precisions |$\{\tau_k\}$| are assumed to be known. As in § 6.1, we compute the true relaxation times numerically and compare them with the theoretical predictions implied by Theorem 4. The results for various combinations of factors are reported in the Supplementary Material. The relaxation time of the collapsed Gibbs sampler is up to three orders of magnitude smaller than that of the Gibbs sampler. In all cases considered, the theoretical predictions closely match the actual values computed numerically. Next consider the case of unknown precisions, where the hyperparameters |$\tau_k$| are given a prior distribution and the posterior of interest is the joint distribution of |$a$| and |$\tau=(\tau_0,\dots,\tau_K)$|⁠. We consider two popular prior specifications for |$\tau$|⁠, the first being a flat prior |$p(\tau_k^{-1/2})\propto 1$| and the second a half-Cauchy prior |$\tau_k^{-1/2}\,{\sim}\,\hbox{Cauchy}^+(0,1)$|⁠; see Gelman (2006) and Polson & Scott (2012) for a discussion. Under both prior specifications, we consider five Markov chain Monte Carlo schemes. The first two schemes alternate sampling |$\tau$| from the conditional distribution |$\mathcal{L}(\tau \mid a)$| and updating |$a$| with the Gibbs sampler and its collapsed version, respectively. In the flat prior case, the exact update |$\tau\sim \mathcal{L}(\tau \mid a)$| is straightforward, whereas in the half-Cauchy case the latter is replaced by a Metropolis–Hastings update. The third and fourth schemes add parameter expansion (Liu & Wu, 1999; Meng & Van Dyk, 1999). See the Supplementary Material for full details on the implementation of these first four schemes. Finally, the fifth scheme is the no-U-turn sampler (Hoffman & Gelman, 2014), a state-of-the-art Hamiltonian Monte Carlo scheme implemented in the R package RStan (Stan Development Team, 2018). To avoid potential issues related to using flat priors with a very low number of factor levels, we excluded the factor with only two levels from the analysis, resulting in |$K=5$| and |$(I_1,\dots,I_K)=(2972,1128, 4,6,14)$|⁠. Table 1 reports runtimes for the five schemes together with effective sample sizes. It can be seen that the first four schemes have similar runtimes, but those using the collapsed method proposed in this paper induce much faster mixing than the others. In this example the use of parameter expansion has little effect on mixing, providing some improvement in the flat prior case and some more deterioration in the half-Cauchy case. Hamiltonian Monte Carlo has a cost per iteration that is two orders of magnitude greater than the other schemes, resulting in the lowest effective sample sizes per unit of computation time. Figure 1(b) displays autocorrelation functions versus CPU time for the Gibbs samplers for two parameters, |$a^{(0)}$| and |$\sigma_3$|⁠, again showcasing the much improved mixing of the collapsed method. The Supplementary Material contains figures displaying autocorrelation functions for all the other parameters. Table 1. Comparison of sampling schemes on the InstEval data: the upper five rows correspond to flat priors for |$\sigma_k=1/\surd{\tau_k}$| and the lower five rows to half-Cauchy priors; values are averaged over |$10$| runs of |$10\,000$| iterations for each scheme, with the first |$1000$| samples discarded as burn-in Scheme . Time (s) per . Effective sample size per unit time (s|$^{-1}$|⁠) . . 1000 iterations . |$(a^{(0)},\,\bar{a}^{(1)},\,\bar{a}^{(2)},\,\bar{a}^{(3)},\,\bar{a}^{(4)},\,\bar{a}^{(5)})$| . |$(\sigma_0,\sigma_1,\sigma_2,\sigma_3,\sigma_4,\sigma_5)$| . GS 13.2 (0.07, 11.0, 2.12, 0.16, 0.21, 0.87) (60.9, 15.9, 36.8, 3.56, 2.53, 2.14) GS+PX 13.5 (0.06, 10.7, 2.02, 0.08, 0.11, 0.95) (59.6, 41.2, 43.9, 0.85, 0.58, 2.33) HMC 1112.6 (0.11, 0.78, 0.19, 0.08, 0.25, 0.68) (0.99, 0.51, 0.99, 0.10, 0.41, 0.18) cGS 14.2 (65.9, 42.1, 18.1, 70.5, 62.3, 35.0) (55.1, 14.7, 34.5, 17.7, 33.9, 2.51) cGS+PX 14.4 (62.5, 44.1, 19.9, 62.9, 63.2, 34.6) (55.1, 38.0, 41.9, 19.2, 33.0, 2.96) GS 15.0 (0.07, 9.87, 1.86, 0.16, 0.23, 0.85) (51.6, 14.1, 31.5, 1.33, 3.84, 2.19) GS+PX 13.8 (0.07, 10.9, 1.85, 0.07, 0.18, 0.84) (56.9, 39.9, 43.6, 0.80, 0.70, 1.71) HMC 1186.9 (0.08, 0.30, 0.08, 0.10, 0.08, 0.24) (0.79, 0.27, 0.55, 0.18, 0.14, 0.09) cGS 15.5 (74.0, 39.0, 17.7, 77.6, 57.2, 31.6) (51.8, 13.7, 30.1, 4.84, 15.0, 2.40) cGS+PX 14.6 (59.1, 43.7, 18.8, 63.2, 61.6, 33.2) (54.7, 38.0, 40.4, 0.72, 0.92, 1.88) Scheme . Time (s) per . Effective sample size per unit time (s|$^{-1}$|⁠) . . 1000 iterations . |$(a^{(0)},\,\bar{a}^{(1)},\,\bar{a}^{(2)},\,\bar{a}^{(3)},\,\bar{a}^{(4)},\,\bar{a}^{(5)})$| . |$(\sigma_0,\sigma_1,\sigma_2,\sigma_3,\sigma_4,\sigma_5)$| . GS 13.2 (0.07, 11.0, 2.12, 0.16, 0.21, 0.87) (60.9, 15.9, 36.8, 3.56, 2.53, 2.14) GS+PX 13.5 (0.06, 10.7, 2.02, 0.08, 0.11, 0.95) (59.6, 41.2, 43.9, 0.85, 0.58, 2.33) HMC 1112.6 (0.11, 0.78, 0.19, 0.08, 0.25, 0.68) (0.99, 0.51, 0.99, 0.10, 0.41, 0.18) cGS 14.2 (65.9, 42.1, 18.1, 70.5, 62.3, 35.0) (55.1, 14.7, 34.5, 17.7, 33.9, 2.51) cGS+PX 14.4 (62.5, 44.1, 19.9, 62.9, 63.2, 34.6) (55.1, 38.0, 41.9, 19.2, 33.0, 2.96) GS 15.0 (0.07, 9.87, 1.86, 0.16, 0.23, 0.85) (51.6, 14.1, 31.5, 1.33, 3.84, 2.19) GS+PX 13.8 (0.07, 10.9, 1.85, 0.07, 0.18, 0.84) (56.9, 39.9, 43.6, 0.80, 0.70, 1.71) HMC 1186.9 (0.08, 0.30, 0.08, 0.10, 0.08, 0.24) (0.79, 0.27, 0.55, 0.18, 0.14, 0.09) cGS 15.5 (74.0, 39.0, 17.7, 77.6, 57.2, 31.6) (51.8, 13.7, 30.1, 4.84, 15.0, 2.40) cGS+PX 14.6 (59.1, 43.7, 18.8, 63.2, 61.6, 33.2) (54.7, 38.0, 40.4, 0.72, 0.92, 1.88) GS, Gibbs sampler; cGS, collapsed Gibbs sampler with precision updates; +PX, combination with the parameter expansion method; HMC, the RStan implementation of the no-U-turn sampler. Open in new tab Table 1. Comparison of sampling schemes on the InstEval data: the upper five rows correspond to flat priors for |$\sigma_k=1/\surd{\tau_k}$| and the lower five rows to half-Cauchy priors; values are averaged over |$10$| runs of |$10\,000$| iterations for each scheme, with the first |$1000$| samples discarded as burn-in Scheme . Time (s) per . Effective sample size per unit time (s|$^{-1}$|⁠) . . 1000 iterations . |$(a^{(0)},\,\bar{a}^{(1)},\,\bar{a}^{(2)},\,\bar{a}^{(3)},\,\bar{a}^{(4)},\,\bar{a}^{(5)})$| . |$(\sigma_0,\sigma_1,\sigma_2,\sigma_3,\sigma_4,\sigma_5)$| . GS 13.2 (0.07, 11.0, 2.12, 0.16, 0.21, 0.87) (60.9, 15.9, 36.8, 3.56, 2.53, 2.14) GS+PX 13.5 (0.06, 10.7, 2.02, 0.08, 0.11, 0.95) (59.6, 41.2, 43.9, 0.85, 0.58, 2.33) HMC 1112.6 (0.11, 0.78, 0.19, 0.08, 0.25, 0.68) (0.99, 0.51, 0.99, 0.10, 0.41, 0.18) cGS 14.2 (65.9, 42.1, 18.1, 70.5, 62.3, 35.0) (55.1, 14.7, 34.5, 17.7, 33.9, 2.51) cGS+PX 14.4 (62.5, 44.1, 19.9, 62.9, 63.2, 34.6) (55.1, 38.0, 41.9, 19.2, 33.0, 2.96) GS 15.0 (0.07, 9.87, 1.86, 0.16, 0.23, 0.85) (51.6, 14.1, 31.5, 1.33, 3.84, 2.19) GS+PX 13.8 (0.07, 10.9, 1.85, 0.07, 0.18, 0.84) (56.9, 39.9, 43.6, 0.80, 0.70, 1.71) HMC 1186.9 (0.08, 0.30, 0.08, 0.10, 0.08, 0.24) (0.79, 0.27, 0.55, 0.18, 0.14, 0.09) cGS 15.5 (74.0, 39.0, 17.7, 77.6, 57.2, 31.6) (51.8, 13.7, 30.1, 4.84, 15.0, 2.40) cGS+PX 14.6 (59.1, 43.7, 18.8, 63.2, 61.6, 33.2) (54.7, 38.0, 40.4, 0.72, 0.92, 1.88) Scheme . Time (s) per . Effective sample size per unit time (s|$^{-1}$|⁠) . . 1000 iterations . |$(a^{(0)},\,\bar{a}^{(1)},\,\bar{a}^{(2)},\,\bar{a}^{(3)},\,\bar{a}^{(4)},\,\bar{a}^{(5)})$| . |$(\sigma_0,\sigma_1,\sigma_2,\sigma_3,\sigma_4,\sigma_5)$| . GS 13.2 (0.07, 11.0, 2.12, 0.16, 0.21, 0.87) (60.9, 15.9, 36.8, 3.56, 2.53, 2.14) GS+PX 13.5 (0.06, 10.7, 2.02, 0.08, 0.11, 0.95) (59.6, 41.2, 43.9, 0.85, 0.58, 2.33) HMC 1112.6 (0.11, 0.78, 0.19, 0.08, 0.25, 0.68) (0.99, 0.51, 0.99, 0.10, 0.41, 0.18) cGS 14.2 (65.9, 42.1, 18.1, 70.5, 62.3, 35.0) (55.1, 14.7, 34.5, 17.7, 33.9, 2.51) cGS+PX 14.4 (62.5, 44.1, 19.9, 62.9, 63.2, 34.6) (55.1, 38.0, 41.9, 19.2, 33.0, 2.96) GS 15.0 (0.07, 9.87, 1.86, 0.16, 0.23, 0.85) (51.6, 14.1, 31.5, 1.33, 3.84, 2.19) GS+PX 13.8 (0.07, 10.9, 1.85, 0.07, 0.18, 0.84) (56.9, 39.9, 43.6, 0.80, 0.70, 1.71) HMC 1186.9 (0.08, 0.30, 0.08, 0.10, 0.08, 0.24) (0.79, 0.27, 0.55, 0.18, 0.14, 0.09) cGS 15.5 (74.0, 39.0, 17.7, 77.6, 57.2, 31.6) (51.8, 13.7, 30.1, 4.84, 15.0, 2.40) cGS+PX 14.6 (59.1, 43.7, 18.8, 63.2, 61.6, 33.2) (54.7, 38.0, 40.4, 0.72, 0.92, 1.88) GS, Gibbs sampler; cGS, collapsed Gibbs sampler with precision updates; +PX, combination with the parameter expansion method; HMC, the RStan implementation of the no-U-turn sampler. Open in new tab Finally, to gain a higher-level sense of the practicality of the approach proposed in this article, we also fitted the same crossed effects model in a frequentist fashion using the R package lme4, which took 40.9 seconds to run. All computations were performed on the same desktop computer with 16 GB of RAM and an i7 Intel processor. The first four schemes were directly implemented using a high-level language such as R, so significant further speed-ups could be expected if one were to use a low-level language and distributed computing for the precomputations needed for the Gibbs samplers. 7. Discussion There are many future directions this work can take. One of the most pressing questions is to investigate the conjecture made in § 5.4 that the relaxation time of the Gibbs sampler for designs with balanced levels is |$1+\max_{k=1,\dots,K}\{N\tau_0/(I_k\tau_k)\}$|⁠. If this is true, we would also obtain that the collapsed Gibbs sampler always has a smaller rate for such designs. Another interesting direction is to obtain a characterization of the rate of the collapsed Gibbs sampler for such designs when |$K>2$|⁠. From numerical experimentation we know that the natural extension of the expression in Theorem 4 does not hold for |$K>2$|⁠, so a different line of attack is needed. Two other important possible directions of future research that would help provide a clearer picture about the scalability of likelihood-based inferences for crossed effects models, are to gain a theoretical understanding of the extent to which sparse linear algebra methods can reduce the computational complexity of the matrix operations involved in the exact marginalizations of the space of factors, and to develop rigorous complexity results for the case of fully Bayesian inferences with unknown variances. Acknowledgement The authors acknowledge helpful discussions with Art B. Owen. The first author was supported by the Spanish Ministry of Economics, the second author was supported by the U.K. Engineering and Physical Sciences Research Council, and the third author was supported by the European Research Council and by the Ministry of Education, Universities and Research. Supplementary material Supplementary material available at Biometrika online includes proofs of Lemmas 1, 2 and A1, additional figures and details for the simulation study in § 6.2, and R code to implement the Markov chain Monte Carlo schemes. Appendix Proof of Theorem 1 For concreteness and without affecting the validity of the argument, we assume that the algorithm updates factors and their levels in ascending order, i.e., it first simulates |$a^{(0)}$|⁠, then |$a^{(1)}_1$|⁠, |$a^{(1)}_2$| and so on. We first establish the result for the Gibbs sampler. Due to the conditional independence structure, the algorithm can be equivalently represented as one that samples in blocks according to the conditional laws |$\mathcal{L}(a^{(k)} \mid y,a^{(-k)})$|⁠. For each iteration |$t$|⁠, each such draw |$a^{(k)}(t)$| can be transformed to |$\bar{a}^{(k)}(t)$| and |$\delta a^{(k)}(t)$|⁠. Proposition 1 gives $$\begin{align*} & \mathcal{L}\{\bar{a}^{(k)}(t),\delta a^{(k)}(t) \mid y,a^{(0)}(t),\ldots,a^{(k-1)}(t),a^{(k+1)}(t-1),\ldots,a^{(K)}(t-1)\} \nonumber\\ & = \mathcal{L}\{\bar{a}^{(k)}(t) \mid y,\bar{a}^{(0)}(t),\ldots,\bar{a}^{(k-1)}(t),\bar{a}^{(k+1)}(t-1),\ldots,\bar{a}^{(K)}(t-1)\}\\ & \quad \times \mathcal{L}\{\delta a^{(k)}(t) \mid y,\delta a^{(1)}(t),\ldots,\delta a^{(k-1)}(t),\delta a^{(k+1)}(t-1),\ldots,\delta a^{(K)}(t-1)\}\text{.}\nonumber \end{align*}$$ Appealing to the equivalence of local and global Markov properties, as in Besag (1974, § 3), the processes |$\{\bar{a}(t)\}$| and |$\{\delta a(t)\}$|⁠, obtained as functions of |$\{a(t)\}$|⁠, are each a Markov chain with respect to its own filtration and are independent of each other. The case of the collapsed Gibbs sampler is analogous. Here the sampler iterates the updates of |$\mathcal{L}(a^{(k)} \mid y,a^{(-0,-k)})$| for |$k=1,\dots,K$|⁠. It can easily be deduced from Proposition 1 that |$\mathcal{L}(\bar{a}^{(-0)},\delta a \mid y)=\mathcal{L}(\bar{a}^{(-0)} \mid {y}) \mathcal{L}(\delta a \mid {y})$|⁠. Therefore, by transforming each draw |$a^{(k)}(t)$| to |$\bar{a}^{(k)}(t)$| and |$\delta a^{(k)}(t)$|⁠, we obtain $$\begin{align*} & \mathcal{L}\{\bar{a}^{(k)}(t),\delta a^{(k)}(t) \mid y,a^{(1)}(t),\ldots,a^{(k-1)}(t),a^{(k+1)}(t-1),\ldots,a^{(K)}(t-1)\} \\ & =\mathcal{L}\{\bar{a}^{(k)}(t) \mid y,\bar{a}^{(1)}(t),\ldots,\bar{a}^{(k-1)}(t),\bar{a}^{(k+1)}(t-1),\ldots,\bar{a}^{(K)}(t-1)\} \\ &\quad\times \mathcal{L}\{\delta a^{(k)}(t) \mid y,\delta a^{(1)}(t),\ldots,\delta a^{(k-1)}(t),\delta a^{(k+1)}(t-1),\ldots,\delta a^{(K)}(t-1)\}\text{.} \end{align*}$$ It follows that the processes |$\{\bar{a}^{(-0)}(t)\}$| and |$\{\delta a(t)\}$|⁠, obtained as functions of |$\{a(t)\}$|⁠, are each a Markov chain with respect to its own filtration and are independent of each other. Lemma A1. Let |$\{x(t)\}$| be a |$d$|-dimensional Gaussian ar|$(1)$| process with |${E}\{x(t+1)\mid x(t)\}=Bx(t)+b$| for some fixed |$b$| and stationary distribution |$\mathcal{N}(\mu,\Sigma)$| concentrated on the hyperplane |$\sum_{i} x_i=0$| with |$\Sigma$| of rank |$d-1$|⁠. Then the rate of convergence of |$\{x(t)\}$| equals the eigenvalue of |$B$| with largest modulus whose eigenvector has zero sum. Cost per iteration of the Gibbs sampler and its collapsed version In order to implement the Gibbs sampler, computation of the one- and two-dimensional marginals |$\{n^{(k)}\}$| and |$\{n^{(l,k)}\}$| of the data incidence table is required, as well as computation of the weighted averages |$\{\tilde{y}^{(k)}_j\}$| of the data. Such precomputation needs to be performed only once and requires |${O}(N)$| operations in general. Then, at each iteration of the Gibbs sampler, the update of |$a^{(0)}$| and each |$a^{(k)}_j$| can be accomplished in |${O}(\sum_{l}I_l)$| and |${O}(\sum_{l\neq k}I_l)$| operations using (2) and (4), respectively, resulting in a total of |${O}(\sum_{k}I_k\sum_{l\neq k}I_l)$| operations for each Gibbs sweep. The latter can be as bad as |${O}(p^2)$|⁠, where |$p=\sum_k I_k$| is the number of parameters and its relationship with |$N$| depends on the asymptotic regime under consideration. For the collapsed Gibbs sampler one needs to additionally precompute |$\{s^{(k)}\}$| defined in Proposition 2, which can be done in |${O}(p)$| operations given |$\{n^{(k)}\}$|⁠. Therefore the collapsed Gibbs sampler has a precomputation cost of order |${O}(N)$|⁠, similar to the standard Gibbs sampler. Moreover, the updates of |$a^{(0)}$| from (8) for |$k=1,\dots,K$| require |${O}(\sum_{k}I_k\sum_{l\neq k}I_l)$| operations in total, which is at most |${O}(p^2)$|⁠. Thus the collapsed Gibbs sampler has the same cost per iteration as the standard Gibbs sampler. In the balanced-cells case, the only precomputation required is of |$\{\tilde{y}^{(k)}_j\}$|⁠, which has |${O}(N)$| cost. Also, each Gibbs or collapsed Gibbs sweep can be accomplished in |${O}(p)$| operations rather than |${O}(p^2)$|⁠, using (3), (5) and the version of (8) for balanced cells. References Bates, D. , Mächler, M. , Bolker, B. & Walker, S. ( 2015 ). Fitting linear mixed-effects models using lme4 . J. Statist. Software 67 , 1 – 48 . Google Scholar Crossref Search ADS WorldCat Besag, J. ( 1974 ). Spatial interaction and the statistical analysis of lattice systems . J. R. Statist. Soc. B 36 , 192 – 236 . With discussion by D. R. Cox, A. G. Hawkes, P. Clifford, P. Whittle, K. Ord, R. Mead, J. M. Hammersley and M. S. Bartlett, and with a reply by the author . OpenURL Placeholder Text WorldCat Gao, K. & Owen, A. ( 2017 ). Efficient moment calculations for variance components in large unbalanced crossed random effects models . Electron. J. Statist. 11 , 1235 – 96 . Google Scholar Crossref Search ADS WorldCat Gao, K. & Owen, A. B. ( 2019 ). Estimation and inference for very large linear mixed effects models . Statist. Sinica To appear . OpenURL Placeholder Text WorldCat Gelman, A. ( 2005 ). Analysis of variance—why it is more important than ever . Ann. Statist. 33 , 1 – 53 . With discussions and a rejoinder by the author . Google Scholar Crossref Search ADS WorldCat Gelman, A. ( 2006 ). Prior distributions for variance parameters in hierarchical models . Bayesian Anal. 1 , 515 – 33 . Google Scholar Crossref Search ADS WorldCat Hoffman, M. D. & Gelman, A. ( 2014 ). The no-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo . J. Mach. Learn. Res. 15 , 1593 – 623 . OpenURL Placeholder Text WorldCat Johnson, L. T. & Geyer, C. J. ( 2012 ). Variable transformation to obtain geometric ergodicity in the random-walk Metropolis algorithm . Ann. Statist. 40 , 3050 – 76 . Google Scholar Crossref Search ADS WorldCat Liu, J. S. & Wu, Y. N. ( 1999 ). Parameter expansion for data augmentation . J. Am. Statist. Assoc. 94 , 1264 – 74 . Google Scholar Crossref Search ADS WorldCat Meng, X.-L. & Van Dyk, D. A. ( 1999 ). Seeking efficient data augmentation schemes via conditional and marginal augmentation . Biometrika 86 , 301 – 20 . Google Scholar Crossref Search ADS WorldCat Papaspiliopoulos, O. , Roberts, G. O. & Sköld, M. ( 2003 ). Non-centered parameterizations for hierarchical models and data augmentation. In Bayesian Statistics 7 (Tenerife, 2002) . New York : Oxford University Press , pp. 307 – 26 . With a discussion by A. E. Gelfand, O. F. Christensen and D. J. Wilkinson, and a reply by the authors . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Papaspiliopoulos, O. , Roberts, G. O. & Sköld, M. ( 2007 ). A general framework for the parametrization of hierarchical models . Statist. Sci. 22 , 59 – 73 . Google Scholar Crossref Search ADS WorldCat Polson, N. G. & Scott, J. G. ( 2012 ). On the half-Cauchy prior for a global scale parameter . Bayesian Anal. 7 , 887 – 902 . Google Scholar Crossref Search ADS WorldCat R Development Core Team ( 2020 ). R: A Language and Environment for Statistical Computing . R Foundation for Statistical Computing , Vienna, Austria . ISBN 3-900051-07-0. http://www.R-project.org. Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Roberts, G. O. & Rosenthal, J. S. ( 2001 ). Markov chains and de-initializing processes . Scand. J. Statist. 28 , 489 – 504 . Google Scholar Crossref Search ADS WorldCat Roberts, G. O. & Sahu, S. K. ( 1997 ). Updating schemes, correlation structure, blocking and parameterization for the Gibbs sampler . J. R. Statist. Soc. B 59 , 291 – 317 . Google Scholar Crossref Search ADS WorldCat Stan Development Team ( 2018 ). RStan: The R Interface to Stan . R package version 2.17.3 , available at http:// mc-stan.org/. Volfovsky, A. & Hoff, P. D. ( 2014 ). Hierarchical array priors for ANOVA decompositions of cross-classified data . Ann. Appl. Statist. 8 , 19 – 47 . Google Scholar Crossref Search ADS WorldCat Zanella, G. & Roberts, G. ( 2019 ). Multilevel linear models, Gibbs samplers and multigrid decomposition . arXiv: 1703.06098v2. Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC © 2019 Biometrika Trust This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
TI - Scalable inference for crossed random effects models
JF - Biometrika
DO - 10.1093/biomet/asz058
DA - 2020-03-01
UR - https://www.deepdyve.com/lp/oxford-university-press/scalable-inference-for-crossed-random-effects-models-dDD6h81hqL
SP - 25
VL - 107
IS - 1
DP - DeepDyve
ER -