# Monte Carlo local likelihood approximation

Monte Carlo local likelihood approximation Summary We propose the Monte Carlo local likelihood (MCLL) method for approximating maximum likelihood estimation (MLE). MCLL initially treats model parameters as random variables, sampling them from the posterior distribution as in a Bayesian model. The likelihood function is then approximated up to a constant by fitting a density to the posterior samples and dividing the approximate posterior density by the prior. In the MCLL algorithm, the posterior density is estimated using local likelihood density estimation, in which the log-density is locally approximated by a polynomial function. We also develop a new method that allows users to efficiently compute standard errors and the Bayes factor. Two empirical and three simulation studies are provided to demonstrate the performance of the MCLL method. 1. Introduction Likelihood-based inference for latent variable and generalized linear mixed models (GLMMs) is challenging as it involves solving (often high-dimensional) integrals that have no closed-form solutions. Various methods have been developed, but each has its own limitations. For instance, the Laplace approximation (Wolfinger, 1993) is computationally efficient but performs poorly when cluster sizes are small and random-effect variances are large (Joe, 2008). Adaptive quadrature (Rabe-Hesketh and others, 2005) is more accurate than the Laplace approximation but computationally infeasible when there are many random effects. Monte Carlo (MC) methods using many samples (e.g. MC expectation maximization (McCulloch, 1997)) tend to be computationally inefficient whereas the efficiency of single sample methods (e.g. MC maximum likelihood (Sung and Geyer, 2007)) depends critically on the choice of the importance sampling distribution. Data cloning (Lele and others, 2010) has been presented as a way to obtain approximate MLEs by utilizing a Bayesian framework, but this approach is usually computationally impractical because of the necessity to create very large datasets through cloning. In this article, we propose a new estimation technique utilizing Bayesian computations for approximate ML inference. Bayesian Markov chain Monte Carlo (MCMC) methods do not require integration of the likelihood over the random effects and therefore circumvent the major challenge of ML estimation. Our proposed algorithm requires a sufficiently large number of samples of model parameters from the posterior distribution corresponding to a particular prior distribution. We then approximate the likelihood function by estimating the density of the posterior samples of the parameters and dividing this approximate posterior density by the prior. Since we utilize local likelihood density estimation (Hjort and Jones, 1996; Loader, 1996) for approximating the posterior distribution (the log-density is locally approximated by a polynomial), we refer to the proposed method as Monte Carlo local likelihood (MCLL). Our MCLL algorithm builds upon the MC kernel likelihood (MCKL; De Valpine, 2004) method with an important difference that MCLL utilizes local likelihood density estimation instead of kernel density estimation of the posterior density to approximate the likelihood function. Local likelihood density estimation has several advantages over kernel density estimation. Importantly, we show that MCLL allows estimation of standard errors using an efficient method that is insensitive to bandwidth choice. Note that for MCKL, standard error estimation has not been discussed. Whereas MCKL was developed specifically for fitting dynamic population models, our MCLL method is designed to be applicable to a wide class of models; and we demonstrate its use and accuracy for various model types using both real data and simulation studies. Although MCLL was developed for approximate likelihood inference, it turns out that it can easily produce Bayes factors as a by-product of the algorithm. MCLL has been implemented in an R package $$\texttt{mcll}$$ (Jeon and others, 2014) that is easy to use and offers practical default options (such a software implementation is not available for MCKL). After MCMC draws from the posterior have been obtained, $$\texttt{mcll}$$ typically requires little additional time. The rest of this article is structured as follows: in Section 2, we describe the MCLL algorithm with implementation details. We then explain similarities and differences between MCLL and MCKL. In Sections 3 and 4, we present new ways of computing standard errors and Bayes factors, respectively. In Section 5, we provide two empirical applications; a GLMM for binary responses with crossed random effects and a spatial Poisson-gamma convolution model. Simulation studies are presented in Section 6 to demonstrate the performance and utility of our proposed MCLL algorithm compared with exact MLE in a linear mixed model, compared with adaptive quadrature in a GLMM for longitudinal data, and compared with the Laplace approximation for a GLMM with crossed random effects where adaptive quadrature is no longer feasible. We end in Section 7 with some concluding remarks. 2. Monte Carlo local likelihood method 2.1. MCLL procedure For a $$d$$-dimensional parameter $$\boldsymbol{\theta}$$ and observed data vector $$\boldsymbol{y}$$, MCLL begins by obtaining the posterior samples $$\boldsymbol{\theta}^{(s)}$$ ($$s=1,...,S$$) using any MCMC methods given the specified likelihood and prior distributions. Given the posterior samples $$\boldsymbol{\theta}^{(s)}$$ ($$s=1,...,S$$), possibly after sphering (explained below), the next step is to estimate the posterior density by maximizing the local likelihood function,   \begin{align} \hat{l} (\boldsymbol{\theta}, \boldsymbol{a} ) = \sum_{s=1}^{S} K_{ \boldsymbol{h}} (\boldsymbol{\theta}^{(s)} -\boldsymbol{\theta} ) P_{\boldsymbol{a}}(\boldsymbol{\theta}^{(s)} - \boldsymbol{\theta}) - m \int K_{ \boldsymbol{h}} ( \boldsymbol{u} -\boldsymbol{\theta} ) \exp{(P_{\boldsymbol{a}}(\boldsymbol{u}-\boldsymbol{\theta}))} d \boldsymbol{u} , \label{eq:loc2} \end{align} (2.1) with respect to the parameters $$\boldsymbol{a}$$ for a particular $$\boldsymbol{\theta}$$. Here, $$K_{\boldsymbol{h}} (\cdot )$$ is a multivariate kernel and $$P_{\boldsymbol{a}}(\boldsymbol{\theta}^{(s)}-\boldsymbol{\theta})$$ is a quadratic function with parameters $$\boldsymbol{a}$$, given for $$d=2$$ by   \begin{align} \nonumber P_{\boldsymbol{a}}(\boldsymbol{\theta}^{(s)} -\boldsymbol{\theta}) &= a_0 + a_1 (\theta^{(s)}_1 - \theta_1) + a_2 (\theta^{(s)}_2- \theta_2) + \frac{1}{2}a_{1,1} (\theta^{(s)}_1 - \theta_1 )^2 + \frac{1}{2}a_{2,2} (\theta^{(s)}_2 - \theta_2 )^2 \\ &\quad{} + a_{1,2} (\theta^{(s)}_1 - \theta_1) (\theta^{(s)}_2 - \theta_2 ). \label{eq:quad} \end{align} (2.2) Higher-order polynomial functions can also be used, but a quadratic function is sufficient in most situations (Loader, 1999, Ch. 2); and a quadratic function has the same asymptotic bias as a boundary kernel estimator, which corrects for boundary bias of a regular kernel density estimator (Shealther, 2004). We recommend applying a linear transformation to $$\boldsymbol{\theta}^{(s)}$$, called sphering (Wand and Jones, 1993; Duong and Hazelton, 2003). Specifically, $$\boldsymbol{\theta}^{(s)}$$ is mean-centered and premultiplied by the Cholesky decomposition of its empirical covariance matrix so that the transformed parameter draws have an identity covariance matrix and a zero mean vector. For notational simplicity, we refer to the transformed parameter draws simply as $${\boldsymbol{\theta}}^{(s)}$$. Sphering is performed mainly for making the elements of $$\boldsymbol{\theta}$$ approximately independent, so that we can set cross-product terms to zero in $$P_{\boldsymbol{a}}(\boldsymbol{\theta}^{(s)}-\boldsymbol{\theta})$$. How well the resulting quadratic function approximates the log posterior density of the sphered samples near the mode will depend on the extent to which the posterior is multivariate normal, and therefore we expect the approximation to work better for larger sample sizes as the posterior approaches multivariate normality (Walker, 1969). Sphering allows us to simplify the integral in (2.1) as a product of one-dimensional integrals if we also factorize the multivariate kernel function such that   $$\label{eq:prodker} K_{ \boldsymbol{h}} (\boldsymbol{\theta}^{(s)} - \boldsymbol{\theta} ) = \prod_{m=1}^d K_0 \left(\frac{\theta^{(s)}_m - \theta_m }{h_m}\right)\!,$$ (2.3) where we choose a tricube weight function for the univariate kernel, $$K_0 (u) = \frac{ \exp{(-|u|)}} { 1 + |u|^3}$$, but other weight functions can also be used. We utilize a nearest neighbor bandwidth $$h_m$$ in (2.3) to ensure that each kernel evaluation incorporates the same number of observations, namely a nearest neigbor fraction $$\lambda$$. We use $$\lambda= 0.7$$ as the default value (suggested by Loader (1999)), but users can select another value if desirable. Fortunately, local likelihood density estimation is less sensitive to the bandwidth choice (here $$\lambda$$) than kernel density estimation (e.g. Otneim and others, 2013) and our empirical and simulated examples in Sections 5 and 6 also show that the estimates are not overly sensitive to the choice of $$\lambda$$. Let the local likelihood estimate of the posterior density be denoted $$\text{Ps}_p (\boldsymbol{\theta} | \boldsymbol{y})$$. The next step of MCLL is to obtain an approximate likelihood function $$\hat{L}(\boldsymbol{\theta} |\boldsymbol{y} )$$, defined up to a constant $$C$$, as   $$\label{eq:locunweight} \hat{L}(\boldsymbol{\theta} |\boldsymbol{y} ) = \frac{1} {p(\boldsymbol{\theta})} \text{Ps}_p (\boldsymbol{\theta}{\,|\,}\boldsymbol{y}),$$ (2.4) where $$p (\boldsymbol{\theta})$$ is the prior. Then we maximize the approximate likelihood function $$\hat{L}(\boldsymbol{\theta} |\boldsymbol{y} )$$ with respect to the $$d$$-dimensional parameter $$\boldsymbol{\theta}$$. For maximization, we utilize the limited memory Broyden–Fletcher–Goldfarb–Shanno algorithm (L-BFGS; Byrd and others, 1995) as a default method. 2.2. MCLL versus MCKL The MCLL algorithm is closely related to the unweighted MCKL method proposed by DeValpine (2004). A weighted version of MCKL can also be formulated as an unnormalized, importance-sampled kernel estimate of the true likelihood $$L( \boldsymbol{\theta}|\boldsymbol{y})$$. When a diffuse prior is used, there is reportedly little difference in performance between the weighted and unweighted MCKL versions. De Valpine (2004) explained that MCKL estimates are obtained by scaling kernel mode estimates by a bounded normalizing constant; hence, the mode convergence of kernel estimators can be transferred to MCKL convergence. Similarly, the mode convergence of the local likelihood density estimator can be transferred to MCLL convergence. Although no documented proof of general convergence of mode estimators is yet available for local likelihood density estimation, researchers have shown the asymptotic convergence of a local likelihood density (e.g. Otneim and others, 2013) and the asymptotic mode convergence of a boundary kernel (e.g. Hall and others, 2006). The latter is especially encouraging because a boundary kernel is identical to the local likelihood density estimator with a quadratic polynomial function in terms of asymptotic bias (Shealther, 2004). With a zero-degree polynomial, the convergence for MCKL can directly be applied to MCLL. The essential technical difference between MCLL and MCKL lies in how the posterior density is approximated: MCLL utilizes local likelihood estimation, while MCKL employs kernel density estimation. Local likelihood estimation has important advantages over kernel density estimation (Loader, 1996; Shealther, 2004). The most useful and important strength of local likelihood is that it generates less bias near maxima compared with kernel density estimation (Loader, 1999). This is a critical point because the ultimate goal of both MCKL and MCLL algorithms comes down to accurately finding modes. In fact, De Valpine (2004) proposed two methods for reducing mode bias stemming from kernel smoothing—utilizing a focused prior based on an initial set of parameter estimates (so-called “zooming-in”) and a correction based on posterior cumulants. In an empirical example (in Section 5.1) we show that MCLL estimates are indeed comparable to the MCKL estimates obtained after bias correction. 3. Standard errors In MLE, standard error estimates are usually based on the Hessian of the log-likelihood function evaluated at the MLE. For MCLL, it is also feasible to utilize the Hessian matrix of $$\text{log} \, \hat{L}(\boldsymbol{\theta} |\boldsymbol{y})$$, obtained through numerical differentiation of the estimated log-likelihood function. However, in this case the standard error estimates can be greatly affected by bandwidth choice. For instance, with a small bandwidth, the approximate likelihood surface is “wiggly” and the standard errors are likely to be underestimated, while with a large bandwidth, the approximate likelihood surface is oversmoothed and the standard errors may be overestimated. To avoid this instability, we propose an alternative way of approximating the Hessian for MCLL. First, we write the log-likelihood function as   $$\text{log} \, L(\boldsymbol{\theta}|\boldsymbol{y}) = \text{log} \, p(\boldsymbol{\theta} |\boldsymbol{y}) - \text{log} \, p(\boldsymbol{\theta}) + C, \label{eq:se1}$$ (3.1) where $$\text{log} \, p(\boldsymbol{\theta} |\boldsymbol{y})$$ is the log-posterior, $$\text{log} \, p(\boldsymbol{\theta})$$ is the log-prior, and $$C$$ is a constant. We then obtain the required Hessian as follows:   \begin{align*} \nonumber \frac{\partial^2}{\partial \boldsymbol{\theta}^2} \text{log} \, L(\boldsymbol{\theta}|\boldsymbol{y}) \bigg |_{\hat{\boldsymbol{\theta}}} &= \frac{\partial^2}{\partial \boldsymbol{\theta}^2} \text{log} \, p(\boldsymbol{\theta} |\boldsymbol{y}) \bigg|_{\hat{\boldsymbol{\theta}}} - \frac{\partial^2}{\partial \boldsymbol{\theta}^2} \text{log} \, p(\boldsymbol{\theta}) \bigg|_{\hat{\boldsymbol{\theta}}}. \end{align*} By using $$\hat{L}(\boldsymbol{\theta} |\boldsymbol{y})$$ instead of $$L(\boldsymbol{\theta}|\boldsymbol{y})$$, we obtain   \begin{equation*}\label{eq:hessian} H_{\hat{L}} = H_{\text{Ps}} - H_{\text{Pr}}, \end{equation*} where $$H_{\hat{L}}$$, $$H_{\text{Ps}}$$, and $$H_{\text{Pr}}$$ are the Hessian matrices of the approximate log-likelihood $$\text{log} \, \hat{L}(\boldsymbol{\theta} | \boldsymbol{y} )$$, the log-posterior $$\text{log} \, p(\boldsymbol{\theta} |\boldsymbol{y})$$, and the log-prior $$\text{log} \, p(\boldsymbol{\theta})$$, respectively. An analytical solution is usually available for $$H_{\text{Pr}}$$. $$H_{\text{Ps}}$$ is easy to obtain by differentiating the quadratic function (2.2). If sphering was used, the MCLL estimates and their standard errors need to be back-transformed to the original scale. Because the back-transformation is linear, transformation of the standard errors, which resembles the delta method, does not rely on asymptotic normality of $$\boldsymbol{\theta}$$. Alternatively, a parametric bootstrap method can be applied which does not assume that the specified model is correct or that our approximation of the Hessian is accurate. 4. Bayes factors Bayes factors can be computed as a by-product of the MCLL algorithm. Suppose, we compare the fit of model $$M_1$$ and model $$M_2$$. For $$M_k$$ ($$k=1,2$$), the posterior density of the model parameters at the MCLL estimates $$\hat{\boldsymbol{\theta}}_k$$ is   \begin{align}\label{eq:post_se} p(\hat{\boldsymbol{\theta}}_k| \boldsymbol{y}, M_k) &= p(\hat{\boldsymbol{\theta}}_k| M_k) \frac{ p(\boldsymbol{y}| \hat{\boldsymbol{\theta}}_k, M_k)} {p(\boldsymbol{y} | M_k)}, \end{align} (4.1) where $$p(\hat{\boldsymbol{\theta}}_k| M_k)$$ is the prior density for model $$M_k$$, $$p(\boldsymbol{y}| \hat{\boldsymbol{\theta}}_k, M_k)$$ is the corresponding likelihood, and $$p(\boldsymbol{y} | M_k)$$ is the marginal likelihood, which can be expressed as $$p(\boldsymbol{y} | M_k) = \int p (\boldsymbol{y} | \boldsymbol{\theta}_k, M_k ) p (\boldsymbol{\theta}_k | M_k) d \boldsymbol{\theta}_k$$. By dividing both sides of Equation (4.1) by the prior $$p(\hat{\boldsymbol{\theta}}_k| M_k)$$, we obtain   \begin{align} \hat{L}(\hat{\boldsymbol{\theta}}_k | \boldsymbol{y}, M_k) = \frac {p(\hat{\boldsymbol{\theta}}_k| \boldsymbol{y} , M_k)}{p( \hat{\boldsymbol{\theta}}_k | M_k)} &= \frac{ p (\boldsymbol{y}| \hat{\boldsymbol{\theta}}_k , M_k)} {p(\boldsymbol{y} | M_k)}. \label{eq:bayes_m1} \end{align} (4.2) Based on (4.2), we can compute the ratio for two models ($$k=1,2$$) as   \begin{align*} \frac{\hat{L}(\hat{\boldsymbol{\theta}}_2 | \boldsymbol{y}, M_2)}{\hat{L}(\hat{\boldsymbol{\theta}}_1 | \boldsymbol{y}, M_1)} = \frac{p(\boldsymbol{y} | M_1)} {p(\boldsymbol{y} | M_2)} \times \frac{ p(\boldsymbol{y}|\hat{\boldsymbol{\theta}}_2, M_2)} { p(\boldsymbol{y}|\hat{\boldsymbol{\theta}}_1, M_1)}, \end{align*} and the Bayes factor $$\widehat{\text{BF}}_{12}$$ can then be computed as   $$\label{eq:bf} \widehat{\text{BF}}_{12} = \frac{\hat{L}(\hat{\boldsymbol{\theta}}_2 | \boldsymbol{y}, M_2)}{\hat{L}(\hat{\boldsymbol{\theta}}_1 | \boldsymbol{y}, M_1)} \times \frac{ p(\boldsymbol{y}| \hat{\boldsymbol{\theta}}_1 , M_1)} { p(\boldsymbol{y}| \hat{\boldsymbol{\theta}}_2 , M_2)}.$$ (4.3) Note that $$\hat{L}(\hat{\boldsymbol{\theta}}_k | \boldsymbol{y}, M_k)$$ ($$k=1,2$$) are by-products of the MCLL algorithm and the likelihood $$p(\boldsymbol{y}| \hat{\boldsymbol{\theta}}_k , M_k)$$ can readily be obtained using importance sampling (e.g. Shephard and Pitt, 1997) as shown in detail in the Appendix A of supplementary material available at Biostatistics online. 5. Empirical illustrations To illustrate the flexibility and utility of the MCLL algorithm, we provide two empirical examples where approximate ML estimation is infeasible in practice. 5.1. GLMM with crossed random effects for binary data The salamander mating dataset, described and analyzed by McCullagh and Nelder (1989) and many others, consists of three separate experiments, each involving matings among salamanders of two different populations, called Rough Butt (R) and White Side (W). The two salamander populations are geographically isolated from one another. The main research question is whether barriers to interbreeding have evolved so that matings within a population are more successful than those between populations. Each experiment included two groups of 10 female and 10 male salamanders from each population. Within each group, each salamander was paired with three salamanders of the opposite sex from the same population and three from the other population, which led to a total of 120 matings of four kinds of pairings (RR, RW, WR, and WW) per experiment. The response $$y_{ij}$$ is a binary variable indicating whether the mating was successful between female $$i$$ ($$i=1,...,60$$) and male $$j$$ ($$\,j=1,...,60$$). Specific aims for the data analysis are: (1) to find out the mating probabilities of the four kinds of pairings and (2) to investigate heterogeneity of mating success across female and male salamanders. We adopt model (A) by Karim and Zeger (1992) which can be written as:   $$\label{md:sal} \text{logit}(p (y_{ij} = 1 | z^f_i, z^m_j, x_{1i},x_{2j})) = \beta_1 + \beta_2 x_{1i} + \beta_3 x_{2j} + \beta_4 x_{1i}x_{2j} + z^f_i + z^m_j,$$ (5.1) where $$\beta_1$$ is the intercept, $$\beta_2$$, $$\beta_3$$, and $$\beta_4$$ are the regression coefficients of a dummy variable for a W female ($$x_{1i}$$), a dummy variable for a W male ($$x_{2j}$$), and the interaction between $$x_{1i}$$ and $$x_{2j}$$ ($$x_{1i}x_{2j}$$), respectively. The final two terms are random effects $$z^f_i \sim N(0, \sigma^2_f)$$ for females and $$z^m_j \sim N(0, \sigma^2_m)$$ for males. ML estimation of the parameters of Model (5.1) is challenging because the two random effects ($$z^f_i$$ and $$z^m_j$$) are cross-classified, rather than nested. For instance, adaptive quadrature, which can be viewed as the gold standard, nearly exact ML method for GLMMs, requires evaluating $$r^{61}$$ terms (with $$r$$ quadrature points). Even with $$r=3$$, adaptive quadrature is computationally intensive. 5.1.1. Implementation To obtain posterior samples of the model parameters, we considered three sets of priors: (1) a normal prior $$N(0,10)$$ for the regression coefficients and a truncated normal prior $$N(0,10)T(0,)$$ for the standard deviation parameters (where $$T(a,b)$$ is a truncation function with a lower bound $$a$$ and an upper bound $$b$$), (2) a normal prior $$N(0,4)$$ for the regression coefficients and a truncated normal prior $$N(0,4)T(0,)$$ for the standard deviation parameters, and (3) a normal prior $$N(0,4)$$ for the regression coefficients and a log-normal prior $$log N(-0.9870,0.5878)$$ for the standard deviation parameters, which gives a distribution with mean 0.5 and variance 0.2. The first set of priors is relatively diffuse, while the second set is weakly informative. The third set uses a more informative prior than the second set for the standard deviation parameters. The goal was to evaluate the sensitivity of the MCLL estimates, compared to the sensitivity of the posterior mean and posterior mode estimates, to different prior choices. The posterior samples were obtained using the WinBUGS software package (Lunn and others, 2000). The code is provided in the Appendix B of supplementary material available at Biostatistics online. Three chains with relatively diffuse starting values were run for 1000 iterations after a 2000 burn-in period. For convergence assessment, the Gelman–Rubin statistic (Gelman and Rubin, 1992) was used in addition to graphical checks such as trace plots and autocorrelation plots. We found that the posterior distributions were close to symmetric with skewness coefficients within $$\pm$$.25 for the regression parameters and smaller than 0.50 for the standard deviation parameters for all three prior specifications. The default nearest neighbor fraction of $$\lambda = 0.7$$ was chosen; to check for sensitivity, we also tried $$\lambda$$ values between 0.5 and 0.8 and found that average differences in parameter estimates were less than 10%. In terms of computational time, MCLL required only three seconds after MCMC samples were obtained (using a Intel Pentium Dual-Core 2.5-GHz processor computer with 3.2 GB of memory). We compared the MCLL parameter estimates with published estimates from various estimators such as MCEM (Booth and Hobert, 1999) and Monte Carlo maximum likelihood (MCMLE; Sung and Geyer, 2007). In addition, we obtained estimates using the Laplace approximation (using $$\texttt{lme4}$$ (Bates and others, 2014)) because it is perhaps the most widely used approximate ML estimator for fitting models with crossed random effects. We also compared MCLL with MCKL (discussed in Section 2.2). No software package is currently available for MCKL; hence, we implemented a customized MCKL algorithm for fitting model (5.1) using a diagonal bandwidth matrix; in this case, the $$(m,m)$$th bandwidth element $$h_m$$ for the $$m$$-th parameter was determined based on $$h_m = q \sigma_m$$, where $$q$$ is a proportionality constant and $$\sigma_m$$ is the posterior standard deviation of the $$m$$-th parameter. We considered ten different bandwidth values, ranging from $$q=$$ 0.1 to 1.0, for checking MCKL’s sensitivity to bandwidth choice. Additionally, we applied a posterior cumulant-based bias correction (De Valpine, 2004). The MCKL estimates varied to some degree across the ten sets of bandwidth values. The parameter estimates were somewhat shrunken after applying the bias correction. All MCKL estimates, before and after bias correction, are provided in the Appendix C of supplementary material available at Biostatistics online. Since no default $$q$$ value is available, we selected the mean value $$q=0.5$$ for performance comparison with other estimators. Lastly, for MCLL, we illustrate the approximate Hessian-based and bootstrap standard error estimation methods (see Section 3) and Bayes factor computation (see Section 4). 5.1.2. Results The first block of Table 1 lists the parameter estimates obtained from other approximate ML estimators. Standard errors are included when they were reported in the original papers. The second to fourth blocks of Table 1 list the MCLL parameter estimates, posterior means, and posterior modes obtained from three different prior choices. Table 1. Comparison of several estimators for the salamander mating data. Standard errors are given in parentheses if reported. MCEM: Booth and Hobert (1999); Laplace: lme4; MCMLE: Geyer (2007); MCKL: MCKL method ($$q=0.5$$); MCKL (adjusted): MCKL method after cumulant bias correction Method  $$\beta_{1}$$  $$\beta_{2}$$  $$\beta_{3}$$  $$\beta_{4}$$  $$\sigma_m$$  $$\sigma_f$$  $$\text{MCEM}^\text{a}$$  1.02  –2.96  –0.69  3.63  1.18  1.12  $$\text{MCMLE}^\text{a}$$  1.00  –2.78  –0.47  3.52  1.17  1.10  Laplace  1.00  –2.90  –0.70  3.59  1.08  1.02     (0.39)  (0.56)  (0.46)  (0.64)        MCKL  1.05  –3.01  –0.76  3.75  1.24  1.18  MCKL(adjusted)  1.03  –2.97  –0.74  3.69  1.23  1.17  MCLL                    $$\hspace{.1 in}$$ Prior 1$$^\text{b}$$  1.04  –3.04  –0.69  3.67  1.17  1.15     (0.41)  (0.58)  (0.45)  (0.62)        $$\hspace{.2 in}$$Boot.SE$$^\text{c}$$  [0.57]  [0.66]  [0.52]  [0.69]        $$\hspace{.2 in}$$(MCE)$$^\text{c}$$  (0.03)  (0.05)  (0.05)  (0.04)        $$\hspace{.1 in}$$ Prior 2$$^\text{b}$$  0.95  –2.85  –0.73  3.63  1.11  1.10     (0.38)  (0.53)  (0.44)  (0.58)        $$\hspace{.1 in}$$ Prior 3$$^\text{b}$$  0.97  –2.79  –0.64  3.50  1.03  1.04     (0.41)  (0.56)  (0.48)  (0.67)        Post mean                    $$\hspace{.1 in}$$ Prior 1  0.97  –2.89  –0.62  3.56  1.24  1.19     (0.43)  (0.59)  (0.47)  (0.64)        $$\hspace{.1 in}$$ Prior 2  0.84  –2.61  –0.49  3.25  1.18  1.12     (0.37)  (0.50)  (0.43)  (0.55)        $$\hspace{.1 in}$$ Prior 3  0.80  –2.45  –0.45  3.02  0.99  0.93     (0.35)  (0.47)  (0.42)  (0.55)        Post mode                    $$\hspace{.1 in}$$ Prior 1  0.95  –2.81  –0.60  3.40  1.13  1.07     (0.42)  (0.57)  (0.46)  (0.62)        $$\hspace{.1 in}$$ Prior 2  0.81  –2.52  –0.58  3.22  1.06  1.04     (0.39)  (0.52)  (0.45)  (0.56)        $$\hspace{.1 in}$$ Prior 3  0.75  –2.28  –0.45  2.86  0.89  0.90     (0.32)  (0.45)  (0.39)  (0.53)        Method  $$\beta_{1}$$  $$\beta_{2}$$  $$\beta_{3}$$  $$\beta_{4}$$  $$\sigma_m$$  $$\sigma_f$$  $$\text{MCEM}^\text{a}$$  1.02  –2.96  –0.69  3.63  1.18  1.12  $$\text{MCMLE}^\text{a}$$  1.00  –2.78  –0.47  3.52  1.17  1.10  Laplace  1.00  –2.90  –0.70  3.59  1.08  1.02     (0.39)  (0.56)  (0.46)  (0.64)        MCKL  1.05  –3.01  –0.76  3.75  1.24  1.18  MCKL(adjusted)  1.03  –2.97  –0.74  3.69  1.23  1.17  MCLL                    $$\hspace{.1 in}$$ Prior 1$$^\text{b}$$  1.04  –3.04  –0.69  3.67  1.17  1.15     (0.41)  (0.58)  (0.45)  (0.62)        $$\hspace{.2 in}$$Boot.SE$$^\text{c}$$  [0.57]  [0.66]  [0.52]  [0.69]        $$\hspace{.2 in}$$(MCE)$$^\text{c}$$  (0.03)  (0.05)  (0.05)  (0.04)        $$\hspace{.1 in}$$ Prior 2$$^\text{b}$$  0.95  –2.85  –0.73  3.63  1.11  1.10     (0.38)  (0.53)  (0.44)  (0.58)        $$\hspace{.1 in}$$ Prior 3$$^\text{b}$$  0.97  –2.79  –0.64  3.50  1.03  1.04     (0.41)  (0.56)  (0.48)  (0.67)        Post mean                    $$\hspace{.1 in}$$ Prior 1  0.97  –2.89  –0.62  3.56  1.24  1.19     (0.43)  (0.59)  (0.47)  (0.64)        $$\hspace{.1 in}$$ Prior 2  0.84  –2.61  –0.49  3.25  1.18  1.12     (0.37)  (0.50)  (0.43)  (0.55)        $$\hspace{.1 in}$$ Prior 3  0.80  –2.45  –0.45  3.02  0.99  0.93     (0.35)  (0.47)  (0.42)  (0.55)        Post mode                    $$\hspace{.1 in}$$ Prior 1  0.95  –2.81  –0.60  3.40  1.13  1.07     (0.42)  (0.57)  (0.46)  (0.62)        $$\hspace{.1 in}$$ Prior 2  0.81  –2.52  –0.58  3.22  1.06  1.04     (0.39)  (0.52)  (0.45)  (0.56)        $$\hspace{.1 in}$$ Prior 3  0.75  –2.28  –0.45  2.86  0.89  0.90     (0.32)  (0.45)  (0.39)  (0.53)        $$^{\rm a}$$The parameter estimates for MCEM and MCMLE in the respective original papers were reported based on the estimates in the respective original papers that used a different parameterization of the same model. Standard errors for MCMLE could not be derived because the covariance matrix of the model parameter estimates was not provided in the original paper. $$^{\rm b}$$Prior 1: diffuse normal priors; Prior 2: weakly informative normal priors; Prior 3: more informative priors for standard deviation parameters. $$^{\rm c}$$Boot.SE: bootstrap standard errors. MCE: Monte Carlo errors for the bootstrap standard errors. Table 1. Comparison of several estimators for the salamander mating data. Standard errors are given in parentheses if reported. MCEM: Booth and Hobert (1999); Laplace: lme4; MCMLE: Geyer (2007); MCKL: MCKL method ($$q=0.5$$); MCKL (adjusted): MCKL method after cumulant bias correction Method  $$\beta_{1}$$  $$\beta_{2}$$  $$\beta_{3}$$  $$\beta_{4}$$  $$\sigma_m$$  $$\sigma_f$$  $$\text{MCEM}^\text{a}$$  1.02  –2.96  –0.69  3.63  1.18  1.12  $$\text{MCMLE}^\text{a}$$  1.00  –2.78  –0.47  3.52  1.17  1.10  Laplace  1.00  –2.90  –0.70  3.59  1.08  1.02     (0.39)  (0.56)  (0.46)  (0.64)        MCKL  1.05  –3.01  –0.76  3.75  1.24  1.18  MCKL(adjusted)  1.03  –2.97  –0.74  3.69  1.23  1.17  MCLL                    $$\hspace{.1 in}$$ Prior 1$$^\text{b}$$  1.04  –3.04  –0.69  3.67  1.17  1.15     (0.41)  (0.58)  (0.45)  (0.62)        $$\hspace{.2 in}$$Boot.SE$$^\text{c}$$  [0.57]  [0.66]  [0.52]  [0.69]        $$\hspace{.2 in}$$(MCE)$$^\text{c}$$  (0.03)  (0.05)  (0.05)  (0.04)        $$\hspace{.1 in}$$ Prior 2$$^\text{b}$$  0.95  –2.85  –0.73  3.63  1.11  1.10     (0.38)  (0.53)  (0.44)  (0.58)        $$\hspace{.1 in}$$ Prior 3$$^\text{b}$$  0.97  –2.79  –0.64  3.50  1.03  1.04     (0.41)  (0.56)  (0.48)  (0.67)        Post mean                    $$\hspace{.1 in}$$ Prior 1  0.97  –2.89  –0.62  3.56  1.24  1.19     (0.43)  (0.59)  (0.47)  (0.64)        $$\hspace{.1 in}$$ Prior 2  0.84  –2.61  –0.49  3.25  1.18  1.12     (0.37)  (0.50)  (0.43)  (0.55)        $$\hspace{.1 in}$$ Prior 3  0.80  –2.45  –0.45  3.02  0.99  0.93     (0.35)  (0.47)  (0.42)  (0.55)        Post mode                    $$\hspace{.1 in}$$ Prior 1  0.95  –2.81  –0.60  3.40  1.13  1.07     (0.42)  (0.57)  (0.46)  (0.62)        $$\hspace{.1 in}$$ Prior 2  0.81  –2.52  –0.58  3.22  1.06  1.04     (0.39)  (0.52)  (0.45)  (0.56)        $$\hspace{.1 in}$$ Prior 3  0.75  –2.28  –0.45  2.86  0.89  0.90     (0.32)  (0.45)  (0.39)  (0.53)        Method  $$\beta_{1}$$  $$\beta_{2}$$  $$\beta_{3}$$  $$\beta_{4}$$  $$\sigma_m$$  $$\sigma_f$$  $$\text{MCEM}^\text{a}$$  1.02  –2.96  –0.69  3.63  1.18  1.12  $$\text{MCMLE}^\text{a}$$  1.00  –2.78  –0.47  3.52  1.17  1.10  Laplace  1.00  –2.90  –0.70  3.59  1.08  1.02     (0.39)  (0.56)  (0.46)  (0.64)        MCKL  1.05  –3.01  –0.76  3.75  1.24  1.18  MCKL(adjusted)  1.03  –2.97  –0.74  3.69  1.23  1.17  MCLL                    $$\hspace{.1 in}$$ Prior 1$$^\text{b}$$  1.04  –3.04  –0.69  3.67  1.17  1.15     (0.41)  (0.58)  (0.45)  (0.62)        $$\hspace{.2 in}$$Boot.SE$$^\text{c}$$  [0.57]  [0.66]  [0.52]  [0.69]        $$\hspace{.2 in}$$(MCE)$$^\text{c}$$  (0.03)  (0.05)  (0.05)  (0.04)        $$\hspace{.1 in}$$ Prior 2$$^\text{b}$$  0.95  –2.85  –0.73  3.63  1.11  1.10     (0.38)  (0.53)  (0.44)  (0.58)        $$\hspace{.1 in}$$ Prior 3$$^\text{b}$$  0.97  –2.79  –0.64  3.50  1.03  1.04     (0.41)  (0.56)  (0.48)  (0.67)        Post mean                    $$\hspace{.1 in}$$ Prior 1  0.97  –2.89  –0.62  3.56  1.24  1.19     (0.43)  (0.59)  (0.47)  (0.64)        $$\hspace{.1 in}$$ Prior 2  0.84  –2.61  –0.49  3.25  1.18  1.12     (0.37)  (0.50)  (0.43)  (0.55)        $$\hspace{.1 in}$$ Prior 3  0.80  –2.45  –0.45  3.02  0.99  0.93     (0.35)  (0.47)  (0.42)  (0.55)        Post mode                    $$\hspace{.1 in}$$ Prior 1  0.95  –2.81  –0.60  3.40  1.13  1.07     (0.42)  (0.57)  (0.46)  (0.62)        $$\hspace{.1 in}$$ Prior 2  0.81  –2.52  –0.58  3.22  1.06  1.04     (0.39)  (0.52)  (0.45)  (0.56)        $$\hspace{.1 in}$$ Prior 3  0.75  –2.28  –0.45  2.86  0.89  0.90     (0.32)  (0.45)  (0.39)  (0.53)        $$^{\rm a}$$The parameter estimates for MCEM and MCMLE in the respective original papers were reported based on the estimates in the respective original papers that used a different parameterization of the same model. Standard errors for MCMLE could not be derived because the covariance matrix of the model parameter estimates was not provided in the original paper. $$^{\rm b}$$Prior 1: diffuse normal priors; Prior 2: weakly informative normal priors; Prior 3: more informative priors for standard deviation parameters. $$^{\rm c}$$Boot.SE: bootstrap standard errors. MCE: Monte Carlo errors for the bootstrap standard errors. The MCLL, posterior mean and posterior mode estimates all vary between different choices of priors. However, the degree of variation is smaller for MCLL than for the posterior mean and posterior mode estimates. For example, comparing estimates of $$\beta_4$$ based on Prior 3 with those based on Prior 1, the MCLL estimates change by approximately 4.6% ($$=$$(3.67$$-$$3.50)$$/$$3.67), while the posterior mean and the posterior mode change by approximately 15.1% ($$=$$(3.56$$-$$3.02)$$/$$3.56) and 15.9% ($$=$$(3.40$$-$$2.86)$$/$$3.40), respectively. This result indicates that MCLL is less sensitive to the choice of priors than the posterior mean and mode estimates. For comparison with other estimators, we chose the MCLL estimates based on Prior 1 (relatively diffuse prior). These MCLL estimates are quite close to the MCEM and Laplace estimates and similar to the MCMLEs. In comparison to MCKL, the MCLL parameter estimates are somewhat closer to the bias-corrected MCKL estimates than to the original MCKL estimates. The approximate Hessian-based standard errors for MCLL are comparable to the standard error estimates from other estimators, whereas the bootstrap standard errors for MCLL are somewhat larger. Next, in order to illustrate the Bayes factor computation with our proposed procedure (described in Section 4), we considered an additional, reduced model that does not include the interaction term $$\beta_4 x_{1i}x_{2j}$$. The Bayes factor comparing the full and reduced models was computed as 1.12, indicating some evidence for the inclusion of the interaction term. Our Bayes factor value is comparable to the Bayes factor obtained based on the Laplace approximation (Raftery, 1996) which is 1.83. 5.2. Spatial model for count data As the second empirical example, we consider Best and others (2000)’s spatial Poisson-gamma convolution model for traffic-related air pollution and respiratory illness in 7–9 year-old children living in Northern England’s Huddersfield region. For analysis, we utilize a randomly perturbed dataset provided by Thomas and others (2014, pp 26–29). Spatial regions $$A_i$$ were defined by arbitrarily partitioning Huddersfield into 605 regular 750 $$m^2$$ grid cells covering a rectangle containing the Huddersfield study region plus a 2 km surrounding buffer zone (Best and others, 2000). Let $$Y_i$$ ($$i=1,...,I$$) be the number of reported cases of frequent coughing among children in area $$A_i$$. The variable $$Y_i$$ follows a Poisson distribution, $$Y_i \sim \text{Po} (\lambda_i)$$, with means $$\lambda_i = N_i \times p_i$$ given by the product of the population size (in hundreds) $$N_i$$ of children in area $$A_i$$ and the disease rate $$p_i$$ in the area. Risk factors can be viewed as incrementally contributing to the risk. Hence, a linear model can be specified, as in Best and others (2000), by assuming that coefficients of the (positive) risk factors are strictly non-negative, which can be achieved by specifying gamma priors for the coefficients. Specifically, a spatial moving average Poisson regression model was formulated for the disease rate $$p_i$$ in area $$A_i$$ as follows:   \begin{align}\label{eq:best} p_i &= \beta_0 + \beta_1 X_i + \beta_2 \sum_{j \in J}k_{ij} \gamma_j. \end{align} (5.2) In the first two terms, $$\beta_0$$ is the baseline disease rate, $$X_i$$ is the excess $$\text{NO}_2$$ concentration in area $$A_i$$ (measured in mg $$m^{-3}$$ above the Huddersfield baseline $$\text{NO}_2$$ concentration of approximately 20mg $$m^{-3}$$), and $$\beta_1$$ is the corresponding coefficient. In the final term, the sum is over areas $$A_j$$ that are near area $$A_i$$, $$\gamma_j$$ is a latent quantity representing the effect in area $$A_i$$ of unobserved spatially-distributed risk factors for frequent cough associated with area $$A_j$$, and $$k_{ij}$$ are elements of a Gaussian kernel matrix that govern the influence in $$A_i$$ arising from latent risk factors in $$A_j$$:   $$k_{ij} = \frac{1}{2\pi \rho^2} \exp \left(-d_{ij}^2 / 2 \rho^2 \right)\!,$$ (5.3) where $$d_{ij}$$ represents the Euclidean distance between the centroids of the areas $$A_i$$ and $$A_j$$. The distance scale $$\rho >0$$ governs how rapidly $$k_{ij}$$ declines with increasing distance $$d_{ij}$$. As in Best and others (2000), a point mass prior at $$\rho = 1\,$$km is used in the current analysis. The coefficient $$\beta_2$$ scales the latent variables $$\gamma_j$$ and indicates how much variation is explained by other spatial risk factors besides excess $$\text{NO}_2$$ concentration. 5.2.1. Implementation To obtain posterior samples, independent gamma prior distributions were specified for $$\beta_k$$ ($$k=0,1,2$$) and $$\gamma_j$$. Specifically, gamma prior distributions $$\Gamma (\alpha_k, \delta_k)$$ (with a shape parameter $$\alpha_k$$ and a rate parameter $$\delta_k$$, $$k=0,1,2$$) were chosen so that the number of frequent coughing cases associated with each of the three risk categories lie between one-tenth and ten times the nominal equal attribution with a 90% prior probability; this was achieved by choosing $$\alpha_k = 0.575$$ for all three parameters ($$k=0,1,2$$) and setting the value of the prior mean ($$\alpha_k / \delta_k$$) for $$\beta_k$$ such that the contribution of each risk factor would be equal to one third of the overall observed disease rate in the Huddersfield region; that is $$\delta_0 = 3 \alpha_0 / \bar{Y}$$, $$\delta_1 = 3 \alpha_1 \bar{X}/ \bar{Y}$$, and $$\delta_2 = 3 \alpha_2 / \bar{Y}$$, which correspond to $$\Gamma (0.575, 0.169)$$ for $$\beta_0$$, $$\Gamma (0.575, 1.533)$$ for $$\beta_1$$, and $$\Gamma (0.575, 0.169)$$ for $$\beta_2$$. For the latent risk factors $$\{\gamma_j \}_{j \in J}$$, the gamma prior $$\Gamma (\alpha_{\gamma_j}, \delta_{\gamma})$$ was chosen so that $$\alpha_{\gamma_j}$$ has prior mean $$\alpha_{\gamma_j} /\delta_{\gamma} = |A_j|$$, the area in km$$^2$$ of the region $$A_j$$; this makes the model spatially extensible on the grounds that fine or coarse partitions $$\{A_j\}$$ would lead to identical probability distributions for the sums of $$\gamma_j$$’s over any set, and for kernel sums, $$\sum_{j \in J}k_{ij} \gamma_j$$. The degree of uncertainly about the spatially-varying latent random effect can be adjusted by the value of $$\delta_{\gamma}$$. For moderate spatial variability, we set $$\delta_{\gamma} = 1.0/\text{km}^2$$ to give 1.0 $$\times$$ 30 $$\times$$ 20 $$=$$ 6000 prior points over the 30 km $$\times$$ 20 km Huddersfield region. For additional details on the choice of the prior distributions, see Best and others (2000). To evaluate how MCLL behaves with priors that slightly contradict evidence of the data, we specified an additional set of gamma priors (for the three model parameters) with a mean and a variance that were somewhat larger than those of the originally specified priors. In the Appendix D of supplementary material available at Biostatistics online, we provide graphs of two sets of prior distributions (the original priors and the intentionally “misleading” priors) for each of the three parameters as well as the corresponding posterior distributions. This comparison shows that the posterior distribution appears somewhat skewed especially for the $$\beta_2$$ parameter with the original prior specification. The skewness coefficients are within $$\pm0$$.10 for $$\beta_0$$ and $$\beta_1$$, but it is 0.77 for $$\beta_2$$, indicating that the $$\beta_2$$ posterior distribution is somewhat positively skewed. With the “misleading” prior, the posterior distribution looks less skewed for the $$\beta_2$$ parameter with a skewness coefficient of 0.50. The posterior samples of the model parameters with the specified priors were obtained using the WinBUGS software with three chains and 20 000 iterations after a 10 000 burn-in period. The code is provided in the Appendix B of supplementary material available at Biostatistics online. We used the default nearest neighbor fraction $$\lambda = 0.7$$ and compared the results with other values that ranged between 0.5 and 0.8. They all produced similar results, with a less than 10% average variation in the parameter estimates. For this example, MCLL took approximately two seconds to produce parameter estimates (post MCMC sampling). 5.2.2. Results In Table 2, we observe that the MCLL, posterior mean, and posterior mode estimates vary to some degree depending on which prior is chosen. In particular, the $$\beta_2$$ parameter shows marked differences with the misleading prior (Prior 2) when compared with the original prior (Prior 1). The degree of change is much smaller for the MCLL estimates compared with the posterior mean and posterior mode estimates. The MCLL estimates for $$\beta_2$$ change by approximately $$-$$39% ($$=$$(0.61$$-$$0.81)$$/$$0.61), while the posterior mean and posterior mode change by approximately $$-$$44% ($$=$$(0.93$$-$$1.34)/0.93) and $$-$$152% ($$=$$(0.44$$-$$1.11)/0.44), respectively. Table 2. Parameter estimates for regression coefficients $$\beta_0$$, $$\beta_1$$, and$$\beta_2$$ for the Huddersfield data with the original prior and the poorly chosen prior    $$\beta_0$$  $$\beta_1$$  $$\beta_2$$     Est  (SE)  Est  (SE)  Est  (SE)  MCLL                    $$\hspace{.1 in}$$Prior 1$$^\text{a}$$  6.33  (1.07)  0.36  (0.08)  0.61  (0.49)  $$\hspace{.1 in}$$Prior 2$$^\text{b}$$  6.05  (1.11)  0.36  (0.07)  0.85  (0.63)  Posterior mean                    $$\hspace{.1 in}$$Prior 1  6.14  (1.14)  0.34  (0.11)  0.93  (0.74)  $$\hspace{.1 in}$$Prior 2  5.66  (1.08)  0.35  (0.10)  1.34  (0.70)  Posterior mode                    $$\hspace{.1 in}$$Prior 1  6.51  (1.05)  0.35  (0.10)  0.44  (0.66)  $$\hspace{.1 in}$$Prior 2  5.75  (1.10)  0.37  (0.11)  1.11  (0.71)     $$\beta_0$$  $$\beta_1$$  $$\beta_2$$     Est  (SE)  Est  (SE)  Est  (SE)  MCLL                    $$\hspace{.1 in}$$Prior 1$$^\text{a}$$  6.33  (1.07)  0.36  (0.08)  0.61  (0.49)  $$\hspace{.1 in}$$Prior 2$$^\text{b}$$  6.05  (1.11)  0.36  (0.07)  0.85  (0.63)  Posterior mean                    $$\hspace{.1 in}$$Prior 1  6.14  (1.14)  0.34  (0.11)  0.93  (0.74)  $$\hspace{.1 in}$$Prior 2  5.66  (1.08)  0.35  (0.10)  1.34  (0.70)  Posterior mode                    $$\hspace{.1 in}$$Prior 1  6.51  (1.05)  0.35  (0.10)  0.44  (0.66)  $$\hspace{.1 in}$$Prior 2  5.75  (1.10)  0.37  (0.11)  1.11  (0.71)  $$^{\rm a}$$Prior 1: Original prior. $$^{\rm b}$$Prior 2: (Intentionally) Misleading prior (with a somewhat diffuse prior choice). Table 2. Parameter estimates for regression coefficients $$\beta_0$$, $$\beta_1$$, and$$\beta_2$$ for the Huddersfield data with the original prior and the poorly chosen prior    $$\beta_0$$  $$\beta_1$$  $$\beta_2$$     Est  (SE)  Est  (SE)  Est  (SE)  MCLL                    $$\hspace{.1 in}$$Prior 1$$^\text{a}$$  6.33  (1.07)  0.36  (0.08)  0.61  (0.49)  $$\hspace{.1 in}$$Prior 2$$^\text{b}$$  6.05  (1.11)  0.36  (0.07)  0.85  (0.63)  Posterior mean                    $$\hspace{.1 in}$$Prior 1  6.14  (1.14)  0.34  (0.11)  0.93  (0.74)  $$\hspace{.1 in}$$Prior 2  5.66  (1.08)  0.35  (0.10)  1.34  (0.70)  Posterior mode                    $$\hspace{.1 in}$$Prior 1  6.51  (1.05)  0.35  (0.10)  0.44  (0.66)  $$\hspace{.1 in}$$Prior 2  5.75  (1.10)  0.37  (0.11)  1.11  (0.71)     $$\beta_0$$  $$\beta_1$$  $$\beta_2$$     Est  (SE)  Est  (SE)  Est  (SE)  MCLL                    $$\hspace{.1 in}$$Prior 1$$^\text{a}$$  6.33  (1.07)  0.36  (0.08)  0.61  (0.49)  $$\hspace{.1 in}$$Prior 2$$^\text{b}$$  6.05  (1.11)  0.36  (0.07)  0.85  (0.63)  Posterior mean                    $$\hspace{.1 in}$$Prior 1  6.14  (1.14)  0.34  (0.11)  0.93  (0.74)  $$\hspace{.1 in}$$Prior 2  5.66  (1.08)  0.35  (0.10)  1.34  (0.70)  Posterior mode                    $$\hspace{.1 in}$$Prior 1  6.51  (1.05)  0.35  (0.10)  0.44  (0.66)  $$\hspace{.1 in}$$Prior 2  5.75  (1.10)  0.37  (0.11)  1.11  (0.71)  $$^{\rm a}$$Prior 1: Original prior. $$^{\rm b}$$Prior 2: (Intentionally) Misleading prior (with a somewhat diffuse prior choice). 6. Simulation studies We conducted three simulation studies to evaluate the performance of MCLL in various situations. In the first study, we deliberately selected a linear mixed model so that we could compare the MCLL estimates with exact MLEs. In the second study, we chose a longitudinal logistic mixed model for which MLE with adaptive quadrature is feasible and provides a useful “gold-standard” comparison to MCLL. In the third study, we selected a GLMM with crossed random effects, for which adaptive quadrature is not feasible, to demonstrate the utility of MCLL for such difficult problems. 6.1. Linear mixed model for continuous data In the first simulation example, we utilize the model proposed by Rabe-Hesketh and others (2008) for birth weight data on 1000 families (with mother, father, and one child). For family members $$i$$ nested in family $$j$$ the model is   $$\label{eq:bw} y_{ij} = \boldsymbol{x}_{ij}'\boldsymbol{\beta} + \alpha_{1j}[M_i +K_i/2] + \alpha_{2j}[F_i +K_i/2] + \alpha_{3j}[K_i/\sqrt{2}] + \epsilon_{ij},$$ (6.1) where $$\boldsymbol{x}_{ij}$$ is a vector of covariates with regression coefficients $$\boldsymbol{\beta}$$, $$M_i$$, $$K_i$$, and $$F_i$$ are dummy variables for mother, child, and father, respectively, and the three random effects at level 2, $$\alpha_{1j}$$, $$\alpha_{2j}$$, and $$\alpha_{3j}$$ are i.i.d as $$\alpha_{kj} \sim N(0, \sigma_A^2)$$ with $$k=1,2,3$$. The level-1 residuals have a normal distribution, $$\epsilon_{ij}^{(2)} \sim N(0, \sigma_E^2)$$ and are independent of the random effects. Here $$\sigma_A$$ can be interpreted as the additive genetic standard deviation and $$\sigma_E$$ as the unique environment standard deviation. The covariates include dummy variables for male, and mother being aged 20–35, younger than 35, or older than 35 at the time of the birth. We generated 100 datasets based on model (6.1) using the MLEs of the original data as parameter values, $$\boldsymbol{\beta}=(3368.09, 155.34, 126.94, 213.43)'$$ and $$(\sigma_A, \sigma_E)' = (311.21, 374.66)'$$. To obtain posterior samples, diffuse normal priors were specified for the regression coefficients (mean 0, standard deviation 1000). A log-normal prior $$logN(6.1761, 0.07696)$$ was specified for both $$\sigma_A$$ and $$\sigma_E$$, which gives a distribution with mean 500 and variance 20 000. The 0.05 quantile of this distribution is 307 g, which corresponds to the random effect falling between $$-$$614 and 614 g with 95% probability. The 0.95 quantile is 745 g. This choice of prior gives more weight to large values of $$\sigma_A$$ and $$\sigma_E$$ (compared with the true values) than a diffuse prior and the posterior distributions for $$\sigma_A$$ and $$\sigma_E$$ are centered in the left tail of the prior. However, this choice allows us to evaluate how well MCLL performs when an informative prior is poorly specified. WinBUGS was used to obtain the posterior samples using three chains with 5000 iterations after a 3000 burn-in. The default nearest neighbor fraction $$\lambda=0.7$$ was chosen. The MLEs were obtained using the R function $$\texttt{lme}$$ of the package $$\texttt{nlme}$$ (Pinheiro and others, 2012). 6.1.1. Results Figure 1 provides summary plots that compare the distances from the MLE between the MCLL and Bayesian (posterior mean) estimates for each parameter. The figure shows that the Bayesian estimates display a marked bias (defined relative to the MLEs), which is evident in the point clouds being shifted away from zero on the x-axis. Although the MCLL estimates are more variable, the mean squared distances to the MLEs are smaller than for the Bayesian estimates except for $$\beta_2$$. This is strong evidence that MCLL estimates are closer to the MLEs than the Bayesian estimates. The mean distances between the Bayesian estimates and MLEs differ significantly from zero for all parameters at the 1% level. But for the MCLL estimates, the mean distances are not significantly different from zero at the 5% level except for the parameter $$\sigma_A$$ ($$t = 3.20, p=0.002$$). Fig. 1. View largeDownload slide Distances from the MLE for MCLL estimates (MCLL–MLE) and for posterior mean estimates (Post.m-MLE) for simulated birth weight datasets. The Monte Carlo errors are within 10% of the means of the parameter estimates for all methods. Fig. 1. View largeDownload slide Distances from the MLE for MCLL estimates (MCLL–MLE) and for posterior mean estimates (Post.m-MLE) for simulated birth weight datasets. The Monte Carlo errors are within 10% of the means of the parameter estimates for all methods. In the Appendix E of supplementary material available at Biostatistics online, we compare the average standard error estimates over replicates (denoted SE) with the empirical standard errors (denoted SD) and report the ratios (SE/SD). The results show that both the ML and MCLL standard errors are good approximations of the empirical standard deviations, with SE/SD ratios between 0.91 and 1.14 for MLE and between 0.92 and 1.16 for MCLL. The MCLL standard error estimates tend to be somewhat more conservative (larger) than the ML standard errors. 6.2. Longitudinal model for binary data In the second simulation study, we consider a longitudinal model for repeated measurements of $$N=1000$$ subjects. Let $$y_{it}$$ denote a binary response for person $$i$$ ($$i = 1, ..., N$$) at time $$t$$ ($$t=1, ..., T$$). Then a logistic mixed model for $$y_{it}$$ can be written as   $$\label{eq:binary} logit[\text{Pr}(y_{it}=1|\text{time}_{it},x_i,\epsilon_i)] = \beta_0 + \beta_1 \text{time}_{it} + \beta_2 x_i + \epsilon_{i},$$ (6.2) where $$\beta_0$$ is the intercept, $$\beta_1$$ is the regression coefficient of $$\text{time}_{it}$$ ($$\text{time}_{it}$$ has $$T$$ categories from 0 to 1, equally spaced), and $$\beta_2$$ is the regression coefficient of the time-invariant covariate $$x_i$$ (a dichotomous variable equal to 0 for half of the subjects and 1 for the other half). The subject-specific random effect $$\epsilon_{i}$$ follows a normal distribution, $$\epsilon_{i} \sim N(0, \sigma^2)$$. Three conditions were considered for $$T=(3, 4, 50)$$ to cover small to large cluster sizes. Four conditions were considered for $$\sigma = (0.5, 0.9,1.5, 2.7)$$ to represent small to large intra class correlations for the latent responses (0.07, 0.2, 0.4, and 0.7, respectively). For the regression coefficients, we considered fixed true values, $$\beta_0=0.6,\beta_1=1.0$$, and $$\beta_2=-1.0$$. We generated 200 datasets under each of the twelve conditions. For MCLL, diffuse normal priors were specified for $$\boldsymbol{\beta}$$ with a mean of 0 and variance of 10 and a log-normal prior $$logN(-0.9870, 0.5878)$$ was specified for $$\sigma$$, which gives a distribution with mean 0.5 and variance 0.2. WinBUGS was used to obtain the posterior samples using three chains with 5000 iterations after a 3000 iteration burn-in. The default nearest neighbor fraction $$\lambda=0.7$$ was used. To evaluate the performance of MCLL, we considered adaptive quadrature, which can be seen as the gold standard approximate ML method; for a relatively simple model like (6.2) adaptive quadrature can produce estimates close to true MLEs with a sufficient number of quadrature points (Rabe-Hesketh and others, 2005). As an alternative approximate ML method, we also considered the Laplace approximation because it is the most widely available approximate ML method that can be applied when adaptive quadrature becomes infeasible. We used the $$\texttt{melogit}$$ command in Stata (with seven default quadrature points) for adaptive quadrature and the $$\texttt{lme4}$$ package in R for the Laplace approximation. 6.2.1. Results We provide the bias and MSE estimates of all model parameters under the twelve conditions in the Appendix F of supplementary material available at Biostatistics online. For the regression coefficients ($$\boldsymbol{\beta}$$), MCLL generally has smaller bias than the Laplace approximation especially when $$\sigma \geq 0.9$$. The bias of MCLL and the Laplace approximation are similar to or slightly larger than the bias of adaptive quadrature and the MSE estimates appear similar across the three estimators. We note that MCLL and the Laplace approximation show somewhat larger bias for $$\sigma$$ than for the regression coefficients in some conditions. Hence, we provide Table 3 to further discuss the performance differences between the methods with respect to $$\sigma$$. Table 3 suggests that MCLL has smaller bias and smaller MSE than the Laplace approximation in most conditions, specifically when $$\sigma =0.9$$, $$\sigma =1.5$$, and $$\sigma =2.7$$ across all $$T$$ ($$T=3,4,50$$). When the cluster sizes are small ($$T=3,4$$) and the variance components are large ($$\sigma =1.5, 2.7$$), the MSE of the Laplace approximation is substantially larger than for MCLL. It is important to note that the MCLL estimates are closer to the adaptive quadrature estimates under those conditions. Only for two conditions, when $$\sigma = 0.5$$ with $$T=4$$ and $$T=50$$, MCLL shows somewhat larger bias and MSE than the Laplace approximation. Table 3. Estimated bias and MSE for $$\sigma$$. Shading indicates statistical significance at the 5% level ($$H_0\,{:}\,\text{bias}=0$$). Laplace: Laplace approximation; AQ: adaptive quadrature     Table 3. Estimated bias and MSE for $$\sigma$$. Shading indicates statistical significance at the 5% level ($$H_0\,{:}\,\text{bias}=0$$). Laplace: Laplace approximation; AQ: adaptive quadrature     We also compared average standard error estimates (SE) with the empirical standard errors (SD) for the regression coefficients (see Appendix F, Tables S7–S9 of supplementary material available at Biostatistics online). For MCLL, the means of the standard error estimates (SE) are close to the empirical standard errors (SD) under all conditions. 6.3. Crossed random effects model for binary data In the third simulation study, we consider a crossed random effects model for binary data analogous to the salamander mating data example in Section 5.1. We generated 100 datasets based on model (5.1), while we set $$\boldsymbol{\beta} =(1.06, -3.05, -0.72, 3.77)'$$ and $$(\sigma^2_f, \sigma^2_m)' = (.50,.50)'$$ as data generating parameter values. A diffuse normal prior $$N(0,10)$$ was specified for the regression coefficients ($$\boldsymbol{\beta}$$) and a truncated normal prior $$N(0,10)T(0,)$$ for the standard deviation parameters ($$\sigma$$). WinBUGS was used to obtain the posterior samples using three chains with 5000 iterations after a 3000 burn-in. The default nearest neighbor fraction $$\lambda=0.7$$ was used. We compared the performance of MCLL with the Laplace approximation and MCMLE (Sung and Geyer, 2007). These two methods seem to be the only approximate ML methods available in standard software for estimating crossed random effects models for binary data. The Laplace approximation estimates were obtained using the $$\texttt{lme4}$$ package in R (Bates and others, 2014) and the MCMLEs were obtained using the $$\texttt{bernor}$$ R package (Geyer and Sung, 2006) with 10 000 samples. 6.3.1. Results Table 4 presents estimated bias and mean squared error (MSE) for the point estimates, as well as ratios (SE/SD) of the average standard errors divided by the empirical standard deviations. For the regression parameters, the bias and MSE are quite similar between the methods except that MCMLE has somewhat larger bias than the other methods. For the standard deviation parameters, MCLL has the smallest bias and MSE among all methods. The SE/SD ratios do not differ much from one for any of three methods and are closest to one for MCLL, followed by Laplace and then MCMLE. The MCLL standard error estimates tend to be more conservative than those from the Laplace approximation. Table 4. Estimated bias and MSE of the MCLL, Laplace approximation, and MCMLE point estimates and ratios of mean SEs to empirical SDs for 100 simulated salamander datasets    Bias  MSE  SE/SD $$^{\text{a,b}}$$     Laplace  MCMLE  MCLL  Laplace  MCMLE  MCLL  Laplace  MCMLE  MCLL  $$\beta_1$$  –0.03  –0.07  –0.03  0.12  0.12  0.12  0.86  0.89  0.93  $$\beta_2$$  0.04  0.12  0.01  0.22  0.20  0.21  0.95  1.12  1.07  $$\beta_3$$  0.06  0.08  0.05  0.17  0.17  0.16  0.90  1.08  0.97  $$\beta_4$$  –0.04  –0.14  –0.04  0.35  0.27  0.35  0.90  1.33  1.03  $$\sigma_m$$  –0.17  –0.16  –0.13  0.10  0.08  0.07  –  –     $$\sigma_f$$  –0.15  –0.20  –0.12  0.12  0.10  0.08  –  –  –     Bias  MSE  SE/SD $$^{\text{a,b}}$$     Laplace  MCMLE  MCLL  Laplace  MCMLE  MCLL  Laplace  MCMLE  MCLL  $$\beta_1$$  –0.03  –0.07  –0.03  0.12  0.12  0.12  0.86  0.89  0.93  $$\beta_2$$  0.04  0.12  0.01  0.22  0.20  0.21  0.95  1.12  1.07  $$\beta_3$$  0.06  0.08  0.05  0.17  0.17  0.16  0.90  1.08  0.97  $$\beta_4$$  –0.04  –0.14  –0.04  0.35  0.27  0.35  0.90  1.33  1.03  $$\sigma_m$$  –0.17  –0.16  –0.13  0.10  0.08  0.07  –  –     $$\sigma_f$$  –0.15  –0.20  –0.12  0.12  0.10  0.08  –  –  –  $$^{\rm a}$$The Monte Carlo errors for the means of the standard error estimates are less than 10%. $$^{\rm b}$$SE/SD not provided for $$\sigma_f$$ and $$\sigma_m$$ because Wald-type tests and confidence intervals are inappropriate for these parameters (e.g., Stram and Lee, 1994). Table 4. Estimated bias and MSE of the MCLL, Laplace approximation, and MCMLE point estimates and ratios of mean SEs to empirical SDs for 100 simulated salamander datasets    Bias  MSE  SE/SD $$^{\text{a,b}}$$     Laplace  MCMLE  MCLL  Laplace  MCMLE  MCLL  Laplace  MCMLE  MCLL  $$\beta_1$$  –0.03  –0.07  –0.03  0.12  0.12  0.12  0.86  0.89  0.93  $$\beta_2$$  0.04  0.12  0.01  0.22  0.20  0.21  0.95  1.12  1.07  $$\beta_3$$  0.06  0.08  0.05  0.17  0.17  0.16  0.90  1.08  0.97  $$\beta_4$$  –0.04  –0.14  –0.04  0.35  0.27  0.35  0.90  1.33  1.03  $$\sigma_m$$  –0.17  –0.16  –0.13  0.10  0.08  0.07  –  –     $$\sigma_f$$  –0.15  –0.20  –0.12  0.12  0.10  0.08  –  –  –     Bias  MSE  SE/SD $$^{\text{a,b}}$$     Laplace  MCMLE  MCLL  Laplace  MCMLE  MCLL  Laplace  MCMLE  MCLL  $$\beta_1$$  –0.03  –0.07  –0.03  0.12  0.12  0.12  0.86  0.89  0.93  $$\beta_2$$  0.04  0.12  0.01  0.22  0.20  0.21  0.95  1.12  1.07  $$\beta_3$$  0.06  0.08  0.05  0.17  0.17  0.16  0.90  1.08  0.97  $$\beta_4$$  –0.04  –0.14  –0.04  0.35  0.27  0.35  0.90  1.33  1.03  $$\sigma_m$$  –0.17  –0.16  –0.13  0.10  0.08  0.07  –  –     $$\sigma_f$$  –0.15  –0.20  –0.12  0.12  0.10  0.08  –  –  –  $$^{\rm a}$$The Monte Carlo errors for the means of the standard error estimates are less than 10%. $$^{\rm b}$$SE/SD not provided for $$\sigma_f$$ and $$\sigma_m$$ because Wald-type tests and confidence intervals are inappropriate for these parameters (e.g., Stram and Lee, 1994). 7. Concluding remarks In this article, we proposed a novel approximate ML method for complex models where ML estimation is practically infeasible. Our MCLL algorithm requires MCMC samples of model parameters to approximate a likelihood function. A major attraction of MCLL is that it can be applied to ML inference of any complex models where ML estimation is difficult but Bayesian estimation is feasible. In this article, we proposed a novel approximate ML method referred to as MCLL which builds upon the MCKL algorithm that was previously developed for fitting dynamic population models. MCLL has several technical and practical advantages compared with MCKL: first, by utilizing local likelihood density estimation instead of kernel density methods, MCLL reduces mode bias without needing to apply additional bias correction methods. Second, our MCLL method provides approximate standard errors that are less sensitive to bandwidth choice and allows approximate computation of the Bayes factors. Finally, our MCLL algorithm is flexible and can be applied to a broader class of models. An accompanying R package is also provided which is freely available, easy to use, and supplies convenient default options that generally work well. MCLL leverages Bayesian methods to approximate MLEs. One may wonder why we want to obtain MLEs when Bayesian estimates are available. Although Bayesian methods are a pragmatic solution for complex inferential problems, several issues of Bayesian solutions have been discussed (e.g., Lele and others, 2007); first, Bayesian inferences depend on the choice of prior distributions, while there is a debate as to how to define a diffuse or an objective prior (e.g. Press, 2003). In addition, Lele and others (2007) argued that how to interpret the credible intervals with diffuse priors is a controversial issue (Barnett, 1999). Because of these unresolved issues, some researchers may prefer approximate MLEs. We showed that MCLL was less sensitive to the choice of prior distributions than Bayesian methods; we further demonstrated that MCLL generated solutions closer to the MLE even when poorly-chosen priors were specified. Furthermore, we showed in several settings that MCLL performed very favorably when compared with existing methods that are feasible for complex models such as the Laplace approximation. MCLL required only a few seconds after obtaining MCMC samples. All together, our study results confirm that the proposed MCLL algorithm can be an effective and practical computational option for estimating complex models when ML inference is desirable. Supplementary material Supplementary material is available at http://biostatistics.oxfordjournals.org. Conflict of Interest: None declared. References Barnett V. ( 1999). Comparative Statistical Inference . New York: Wiley. Google Scholar CrossRef Search ADS   Bates D., Maechler M., Bolker B. and Walker S. ( 2014). lme4: Linear Mixed-Effects Models using Eigen and S4 . R package version 1.1-6. http://CRAN.R-project.org/package=lme4. Best N. G., Ickstadt K., Wolpert R. L. and Briggs D. J. ( 2000). Combining models of health and exposure data: the SAVIAH study. In: Elliott P., Wakefield J. C., Best N. G. and Briggs D. J. (editors), Spatial Epidemiology: Methods and Applications . Oxford: Oxford University Press, pp. 393– 414. Google Scholar CrossRef Search ADS   Booth J. G. and Hobert J. P. ( 1999). Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society Series B  61, 265– 285. Google Scholar CrossRef Search ADS   Byrd R. H., Lu P., Nocedal J. and Zhu C. ( 1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing  16, 1190– 1208. Google Scholar CrossRef Search ADS   De Valpine P. ( 2004). Monte Carlo state-space likelihoods by weighted posterior kernel density estimation. Journal of the American Statistical Association  99, 523– 535. Google Scholar CrossRef Search ADS   Duong T. and Hazelton M. L. ( 2003). Plug-in bandwidth matrices for bivariate kernel density estimation. Journal of Nonparametric Statistics  15, 17– 30. Google Scholar CrossRef Search ADS   Gelman A. and Rubin D. B. ( 1992). Inference from iterative simulation using multiple sequences. Statistical Science  7, 457– 472. Google Scholar CrossRef Search ADS   Geyer C. and Sung Y. ( 2006). bernor: Bernoulli Regression with Normal Random Effects . R package version 0.3-8. http://www.stat.umn.edu/geyer/bernor. Hall P., Müller H. and Wu P. ( 2006). Real-time density and mode estimation with application to time-dynamic mode tracking. Journal of Computational and Graphical Statistics  15, 82– 100. Google Scholar CrossRef Search ADS   Hjort N. L. and Jones M. C. ( 1996). Locally parametric nonparametric density estimation. The Annals of Statistics  24, 1619– 1647. Google Scholar CrossRef Search ADS   Jeon M., Kaufman C. and Rabe-Hesketh S. ( 2014). mcll: Monte Carlo Local Likelihood for Estimating Generalized Linear Mixed Models. R package version 1.2 . http://CRAN.R-project.org/package=mcll. Joe H. ( 2008). Accuracy of Laplace approximation for discrete response mixed models. Computational Statistics and Data Analysis  52, 5066– 5074. Google Scholar CrossRef Search ADS   Karim M. R. and Zeger S. L. ( 1992). Generalized linear models with random effects: salamander mating revisited. Biometrics  48, 631– 644. Google Scholar CrossRef Search ADS PubMed  Lele S. R., Dennis B. and Lutscher F. ( 2007). Data cloning: easy maximum likelihood estimation for complex ecological models using Bayesian Markov chain Monte Carlo methods. Ecology Letters  10, 551– 563. Google Scholar CrossRef Search ADS PubMed  Lele S. R., Nadeem K. and Schmuland B. ( 2010). Estimability and likelihood inference for generalized linear mixed models using data cloning. Journal of the American Statistical Association  105, 1617– 1625. Google Scholar CrossRef Search ADS   Loader C. ( 1999). Local Regression and Likelihood . New York: Springer. Loader C. R. ( 1996). Local likelihood density estimation. The Annals of Statistics  24, 1602– 1618. Google Scholar CrossRef Search ADS   Lunn D., Thomas A., Best N. and Spiegelhalter D. ( 2000). WinBUGS—a Bayesian modelling framework: concepts, structure, and extensibility. Statistics and Computing  10, 325– 337. Google Scholar CrossRef Search ADS   McCullagh P. and Nelder J. A. ( 1989). Generalized Linear Models . New York: Chapman and Hall. Google Scholar CrossRef Search ADS   McCulloch C. E. ( 1997). Maximum likelihood algorithms for generalized linear mixed models. Journal of the American Statistical Association  92, 162– 170. Google Scholar CrossRef Search ADS   Otneim H., Karlsen H. A. and Tjøstheim D. ( 2013). Bias and bandwidth for local likelihood density estimation. Statistics and Probability Letters  83, 1382– 1387. Google Scholar CrossRef Search ADS   Pinheiro Jose, Bates Douglas, DebRoy Saikat, Sarkar Deepayan and R Development Core Team. ( 2012). nlme: Linear and Nonlinear Mixed Effects Models . R package version 3.1–104. Press S. J. ( 2003). Subjective and Objective Bayesian Statistics . New York: Wiley. Rabe-Hesketh S., Skrondal A. and Gjessing H. K. ( 2008). Biometrical modeling of twin and family data using standard mixed model software. Biometrics  64, 280– 288. Google Scholar CrossRef Search ADS PubMed  Rabe-Hesketh S., Skrondal A. and Pickles A. ( 2005). Maximum likelihood estimation of limited and discrete dependent variable models with nested random effects. Journal of Econometrics  128, 301– 323. Google Scholar CrossRef Search ADS   Raftery, Adrian E. ( 1996). Approximate Bayes factors and accounting for model uncertainty in generalised linear models. Biometrika  83, 251– 266. Google Scholar CrossRef Search ADS   Shealther S. ( 2004). Density estimation. Statistical Science  19, 588– 597. Google Scholar CrossRef Search ADS   Shephard N. and Pitt M. K. ( 1997). Likelihood analysis of non-Gaussian measurement time series. Biometrika  84, 653– 667. Google Scholar CrossRef Search ADS   Stram D. O. and Lee J. W. ( 1994). Variance components testing in the longitudinal mixed effects model. Biometrics  50, 1171– 1177. Google Scholar CrossRef Search ADS PubMed  Sung Y. J. and Geyer C. J. ( 2007). Monte Carlo likelihood inference for missing data models. The Annals of Statistics  35, 990– 1011. Google Scholar CrossRef Search ADS   Thomas A., Best N. G., Lunn D., Arnold R. and Spiegelhalter D. J. ( 2014). GeoBugs User Manual. Version 3.2.3 . London: Imperial College School of Medicine. Walker A. M. ( 1969). On the asymptotic behavior of posterior distributions. Journal of the Royal Statistical Society, Series B  31, 80– 88. Wand M. P. and Jones M. C. ( 1993). Comparison of smoothing parameterizations in bivariate kernel density estimation. Journal of the American Statistical Association  88, 520– 528. Google Scholar CrossRef Search ADS   Wolfinger R. ( 1993). Laplace’s approximation for nonlinear mixed models. Biometrika  80, 791– 795. Google Scholar CrossRef Search ADS   © The Author 2017. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Biostatistics Oxford University Press

# Monte Carlo local likelihood approximation

, Volume Advance Article – Jan 4, 2018
16 pages

/lp/ou_press/monte-carlo-local-likelihood-approximation-GoGimHqaWB
Publisher
Oxford University Press
ISSN
1465-4644
eISSN
1468-4357
D.O.I.
10.1093/biostatistics/kxx070
Publisher site
See Article on Publisher Site

### Abstract

Summary We propose the Monte Carlo local likelihood (MCLL) method for approximating maximum likelihood estimation (MLE). MCLL initially treats model parameters as random variables, sampling them from the posterior distribution as in a Bayesian model. The likelihood function is then approximated up to a constant by fitting a density to the posterior samples and dividing the approximate posterior density by the prior. In the MCLL algorithm, the posterior density is estimated using local likelihood density estimation, in which the log-density is locally approximated by a polynomial function. We also develop a new method that allows users to efficiently compute standard errors and the Bayes factor. Two empirical and three simulation studies are provided to demonstrate the performance of the MCLL method. 1. Introduction Likelihood-based inference for latent variable and generalized linear mixed models (GLMMs) is challenging as it involves solving (often high-dimensional) integrals that have no closed-form solutions. Various methods have been developed, but each has its own limitations. For instance, the Laplace approximation (Wolfinger, 1993) is computationally efficient but performs poorly when cluster sizes are small and random-effect variances are large (Joe, 2008). Adaptive quadrature (Rabe-Hesketh and others, 2005) is more accurate than the Laplace approximation but computationally infeasible when there are many random effects. Monte Carlo (MC) methods using many samples (e.g. MC expectation maximization (McCulloch, 1997)) tend to be computationally inefficient whereas the efficiency of single sample methods (e.g. MC maximum likelihood (Sung and Geyer, 2007)) depends critically on the choice of the importance sampling distribution. Data cloning (Lele and others, 2010) has been presented as a way to obtain approximate MLEs by utilizing a Bayesian framework, but this approach is usually computationally impractical because of the necessity to create very large datasets through cloning. In this article, we propose a new estimation technique utilizing Bayesian computations for approximate ML inference. Bayesian Markov chain Monte Carlo (MCMC) methods do not require integration of the likelihood over the random effects and therefore circumvent the major challenge of ML estimation. Our proposed algorithm requires a sufficiently large number of samples of model parameters from the posterior distribution corresponding to a particular prior distribution. We then approximate the likelihood function by estimating the density of the posterior samples of the parameters and dividing this approximate posterior density by the prior. Since we utilize local likelihood density estimation (Hjort and Jones, 1996; Loader, 1996) for approximating the posterior distribution (the log-density is locally approximated by a polynomial), we refer to the proposed method as Monte Carlo local likelihood (MCLL). Our MCLL algorithm builds upon the MC kernel likelihood (MCKL; De Valpine, 2004) method with an important difference that MCLL utilizes local likelihood density estimation instead of kernel density estimation of the posterior density to approximate the likelihood function. Local likelihood density estimation has several advantages over kernel density estimation. Importantly, we show that MCLL allows estimation of standard errors using an efficient method that is insensitive to bandwidth choice. Note that for MCKL, standard error estimation has not been discussed. Whereas MCKL was developed specifically for fitting dynamic population models, our MCLL method is designed to be applicable to a wide class of models; and we demonstrate its use and accuracy for various model types using both real data and simulation studies. Although MCLL was developed for approximate likelihood inference, it turns out that it can easily produce Bayes factors as a by-product of the algorithm. MCLL has been implemented in an R package $$\texttt{mcll}$$ (Jeon and others, 2014) that is easy to use and offers practical default options (such a software implementation is not available for MCKL). After MCMC draws from the posterior have been obtained, $$\texttt{mcll}$$ typically requires little additional time. The rest of this article is structured as follows: in Section 2, we describe the MCLL algorithm with implementation details. We then explain similarities and differences between MCLL and MCKL. In Sections 3 and 4, we present new ways of computing standard errors and Bayes factors, respectively. In Section 5, we provide two empirical applications; a GLMM for binary responses with crossed random effects and a spatial Poisson-gamma convolution model. Simulation studies are presented in Section 6 to demonstrate the performance and utility of our proposed MCLL algorithm compared with exact MLE in a linear mixed model, compared with adaptive quadrature in a GLMM for longitudinal data, and compared with the Laplace approximation for a GLMM with crossed random effects where adaptive quadrature is no longer feasible. We end in Section 7 with some concluding remarks. 2. Monte Carlo local likelihood method 2.1. MCLL procedure For a $$d$$-dimensional parameter $$\boldsymbol{\theta}$$ and observed data vector $$\boldsymbol{y}$$, MCLL begins by obtaining the posterior samples $$\boldsymbol{\theta}^{(s)}$$ ($$s=1,...,S$$) using any MCMC methods given the specified likelihood and prior distributions. Given the posterior samples $$\boldsymbol{\theta}^{(s)}$$ ($$s=1,...,S$$), possibly after sphering (explained below), the next step is to estimate the posterior density by maximizing the local likelihood function,   \begin{align} \hat{l} (\boldsymbol{\theta}, \boldsymbol{a} ) = \sum_{s=1}^{S} K_{ \boldsymbol{h}} (\boldsymbol{\theta}^{(s)} -\boldsymbol{\theta} ) P_{\boldsymbol{a}}(\boldsymbol{\theta}^{(s)} - \boldsymbol{\theta}) - m \int K_{ \boldsymbol{h}} ( \boldsymbol{u} -\boldsymbol{\theta} ) \exp{(P_{\boldsymbol{a}}(\boldsymbol{u}-\boldsymbol{\theta}))} d \boldsymbol{u} , \label{eq:loc2} \end{align} (2.1) with respect to the parameters $$\boldsymbol{a}$$ for a particular $$\boldsymbol{\theta}$$. Here, $$K_{\boldsymbol{h}} (\cdot )$$ is a multivariate kernel and $$P_{\boldsymbol{a}}(\boldsymbol{\theta}^{(s)}-\boldsymbol{\theta})$$ is a quadratic function with parameters $$\boldsymbol{a}$$, given for $$d=2$$ by   \begin{align} \nonumber P_{\boldsymbol{a}}(\boldsymbol{\theta}^{(s)} -\boldsymbol{\theta}) &= a_0 + a_1 (\theta^{(s)}_1 - \theta_1) + a_2 (\theta^{(s)}_2- \theta_2) + \frac{1}{2}a_{1,1} (\theta^{(s)}_1 - \theta_1 )^2 + \frac{1}{2}a_{2,2} (\theta^{(s)}_2 - \theta_2 )^2 \\ &\quad{} + a_{1,2} (\theta^{(s)}_1 - \theta_1) (\theta^{(s)}_2 - \theta_2 ). \label{eq:quad} \end{align} (2.2) Higher-order polynomial functions can also be used, but a quadratic function is sufficient in most situations (Loader, 1999, Ch. 2); and a quadratic function has the same asymptotic bias as a boundary kernel estimator, which corrects for boundary bias of a regular kernel density estimator (Shealther, 2004). We recommend applying a linear transformation to $$\boldsymbol{\theta}^{(s)}$$, called sphering (Wand and Jones, 1993; Duong and Hazelton, 2003). Specifically, $$\boldsymbol{\theta}^{(s)}$$ is mean-centered and premultiplied by the Cholesky decomposition of its empirical covariance matrix so that the transformed parameter draws have an identity covariance matrix and a zero mean vector. For notational simplicity, we refer to the transformed parameter draws simply as $${\boldsymbol{\theta}}^{(s)}$$. Sphering is performed mainly for making the elements of $$\boldsymbol{\theta}$$ approximately independent, so that we can set cross-product terms to zero in $$P_{\boldsymbol{a}}(\boldsymbol{\theta}^{(s)}-\boldsymbol{\theta})$$. How well the resulting quadratic function approximates the log posterior density of the sphered samples near the mode will depend on the extent to which the posterior is multivariate normal, and therefore we expect the approximation to work better for larger sample sizes as the posterior approaches multivariate normality (Walker, 1969). Sphering allows us to simplify the integral in (2.1) as a product of one-dimensional integrals if we also factorize the multivariate kernel function such that   $$\label{eq:prodker} K_{ \boldsymbol{h}} (\boldsymbol{\theta}^{(s)} - \boldsymbol{\theta} ) = \prod_{m=1}^d K_0 \left(\frac{\theta^{(s)}_m - \theta_m }{h_m}\right)\!,$$ (2.3) where we choose a tricube weight function for the univariate kernel, $$K_0 (u) = \frac{ \exp{(-|u|)}} { 1 + |u|^3}$$, but other weight functions can also be used. We utilize a nearest neighbor bandwidth $$h_m$$ in (2.3) to ensure that each kernel evaluation incorporates the same number of observations, namely a nearest neigbor fraction $$\lambda$$. We use $$\lambda= 0.7$$ as the default value (suggested by Loader (1999)), but users can select another value if desirable. Fortunately, local likelihood density estimation is less sensitive to the bandwidth choice (here $$\lambda$$) than kernel density estimation (e.g. Otneim and others, 2013) and our empirical and simulated examples in Sections 5 and 6 also show that the estimates are not overly sensitive to the choice of $$\lambda$$. Let the local likelihood estimate of the posterior density be denoted $$\text{Ps}_p (\boldsymbol{\theta} | \boldsymbol{y})$$. The next step of MCLL is to obtain an approximate likelihood function $$\hat{L}(\boldsymbol{\theta} |\boldsymbol{y} )$$, defined up to a constant $$C$$, as   $$\label{eq:locunweight} \hat{L}(\boldsymbol{\theta} |\boldsymbol{y} ) = \frac{1} {p(\boldsymbol{\theta})} \text{Ps}_p (\boldsymbol{\theta}{\,|\,}\boldsymbol{y}),$$ (2.4) where $$p (\boldsymbol{\theta})$$ is the prior. Then we maximize the approximate likelihood function $$\hat{L}(\boldsymbol{\theta} |\boldsymbol{y} )$$ with respect to the $$d$$-dimensional parameter $$\boldsymbol{\theta}$$. For maximization, we utilize the limited memory Broyden–Fletcher–Goldfarb–Shanno algorithm (L-BFGS; Byrd and others, 1995) as a default method. 2.2. MCLL versus MCKL The MCLL algorithm is closely related to the unweighted MCKL method proposed by DeValpine (2004). A weighted version of MCKL can also be formulated as an unnormalized, importance-sampled kernel estimate of the true likelihood $$L( \boldsymbol{\theta}|\boldsymbol{y})$$. When a diffuse prior is used, there is reportedly little difference in performance between the weighted and unweighted MCKL versions. De Valpine (2004) explained that MCKL estimates are obtained by scaling kernel mode estimates by a bounded normalizing constant; hence, the mode convergence of kernel estimators can be transferred to MCKL convergence. Similarly, the mode convergence of the local likelihood density estimator can be transferred to MCLL convergence. Although no documented proof of general convergence of mode estimators is yet available for local likelihood density estimation, researchers have shown the asymptotic convergence of a local likelihood density (e.g. Otneim and others, 2013) and the asymptotic mode convergence of a boundary kernel (e.g. Hall and others, 2006). The latter is especially encouraging because a boundary kernel is identical to the local likelihood density estimator with a quadratic polynomial function in terms of asymptotic bias (Shealther, 2004). With a zero-degree polynomial, the convergence for MCKL can directly be applied to MCLL. The essential technical difference between MCLL and MCKL lies in how the posterior density is approximated: MCLL utilizes local likelihood estimation, while MCKL employs kernel density estimation. Local likelihood estimation has important advantages over kernel density estimation (Loader, 1996; Shealther, 2004). The most useful and important strength of local likelihood is that it generates less bias near maxima compared with kernel density estimation (Loader, 1999). This is a critical point because the ultimate goal of both MCKL and MCLL algorithms comes down to accurately finding modes. In fact, De Valpine (2004) proposed two methods for reducing mode bias stemming from kernel smoothing—utilizing a focused prior based on an initial set of parameter estimates (so-called “zooming-in”) and a correction based on posterior cumulants. In an empirical example (in Section 5.1) we show that MCLL estimates are indeed comparable to the MCKL estimates obtained after bias correction. 3. Standard errors In MLE, standard error estimates are usually based on the Hessian of the log-likelihood function evaluated at the MLE. For MCLL, it is also feasible to utilize the Hessian matrix of $$\text{log} \, \hat{L}(\boldsymbol{\theta} |\boldsymbol{y})$$, obtained through numerical differentiation of the estimated log-likelihood function. However, in this case the standard error estimates can be greatly affected by bandwidth choice. For instance, with a small bandwidth, the approximate likelihood surface is “wiggly” and the standard errors are likely to be underestimated, while with a large bandwidth, the approximate likelihood surface is oversmoothed and the standard errors may be overestimated. To avoid this instability, we propose an alternative way of approximating the Hessian for MCLL. First, we write the log-likelihood function as   $$\text{log} \, L(\boldsymbol{\theta}|\boldsymbol{y}) = \text{log} \, p(\boldsymbol{\theta} |\boldsymbol{y}) - \text{log} \, p(\boldsymbol{\theta}) + C, \label{eq:se1}$$ (3.1) where $$\text{log} \, p(\boldsymbol{\theta} |\boldsymbol{y})$$ is the log-posterior, $$\text{log} \, p(\boldsymbol{\theta})$$ is the log-prior, and $$C$$ is a constant. We then obtain the required Hessian as follows:   \begin{align*} \nonumber \frac{\partial^2}{\partial \boldsymbol{\theta}^2} \text{log} \, L(\boldsymbol{\theta}|\boldsymbol{y}) \bigg |_{\hat{\boldsymbol{\theta}}} &= \frac{\partial^2}{\partial \boldsymbol{\theta}^2} \text{log} \, p(\boldsymbol{\theta} |\boldsymbol{y}) \bigg|_{\hat{\boldsymbol{\theta}}} - \frac{\partial^2}{\partial \boldsymbol{\theta}^2} \text{log} \, p(\boldsymbol{\theta}) \bigg|_{\hat{\boldsymbol{\theta}}}. \end{align*} By using $$\hat{L}(\boldsymbol{\theta} |\boldsymbol{y})$$ instead of $$L(\boldsymbol{\theta}|\boldsymbol{y})$$, we obtain   \begin{equation*}\label{eq:hessian} H_{\hat{L}} = H_{\text{Ps}} - H_{\text{Pr}}, \end{equation*} where $$H_{\hat{L}}$$, $$H_{\text{Ps}}$$, and $$H_{\text{Pr}}$$ are the Hessian matrices of the approximate log-likelihood $$\text{log} \, \hat{L}(\boldsymbol{\theta} | \boldsymbol{y} )$$, the log-posterior $$\text{log} \, p(\boldsymbol{\theta} |\boldsymbol{y})$$, and the log-prior $$\text{log} \, p(\boldsymbol{\theta})$$, respectively. An analytical solution is usually available for $$H_{\text{Pr}}$$. $$H_{\text{Ps}}$$ is easy to obtain by differentiating the quadratic function (2.2). If sphering was used, the MCLL estimates and their standard errors need to be back-transformed to the original scale. Because the back-transformation is linear, transformation of the standard errors, which resembles the delta method, does not rely on asymptotic normality of $$\boldsymbol{\theta}$$. Alternatively, a parametric bootstrap method can be applied which does not assume that the specified model is correct or that our approximation of the Hessian is accurate. 4. Bayes factors Bayes factors can be computed as a by-product of the MCLL algorithm. Suppose, we compare the fit of model $$M_1$$ and model $$M_2$$. For $$M_k$$ ($$k=1,2$$), the posterior density of the model parameters at the MCLL estimates $$\hat{\boldsymbol{\theta}}_k$$ is   \begin{align}\label{eq:post_se} p(\hat{\boldsymbol{\theta}}_k| \boldsymbol{y}, M_k) &= p(\hat{\boldsymbol{\theta}}_k| M_k) \frac{ p(\boldsymbol{y}| \hat{\boldsymbol{\theta}}_k, M_k)} {p(\boldsymbol{y} | M_k)}, \end{align} (4.1) where $$p(\hat{\boldsymbol{\theta}}_k| M_k)$$ is the prior density for model $$M_k$$, $$p(\boldsymbol{y}| \hat{\boldsymbol{\theta}}_k, M_k)$$ is the corresponding likelihood, and $$p(\boldsymbol{y} | M_k)$$ is the marginal likelihood, which can be expressed as $$p(\boldsymbol{y} | M_k) = \int p (\boldsymbol{y} | \boldsymbol{\theta}_k, M_k ) p (\boldsymbol{\theta}_k | M_k) d \boldsymbol{\theta}_k$$. By dividing both sides of Equation (4.1) by the prior $$p(\hat{\boldsymbol{\theta}}_k| M_k)$$, we obtain   \begin{align} \hat{L}(\hat{\boldsymbol{\theta}}_k | \boldsymbol{y}, M_k) = \frac {p(\hat{\boldsymbol{\theta}}_k| \boldsymbol{y} , M_k)}{p( \hat{\boldsymbol{\theta}}_k | M_k)} &= \frac{ p (\boldsymbol{y}| \hat{\boldsymbol{\theta}}_k , M_k)} {p(\boldsymbol{y} | M_k)}. \label{eq:bayes_m1} \end{align} (4.2) Based on (4.2), we can compute the ratio for two models ($$k=1,2$$) as   \begin{align*} \frac{\hat{L}(\hat{\boldsymbol{\theta}}_2 | \boldsymbol{y}, M_2)}{\hat{L}(\hat{\boldsymbol{\theta}}_1 | \boldsymbol{y}, M_1)} = \frac{p(\boldsymbol{y} | M_1)} {p(\boldsymbol{y} | M_2)} \times \frac{ p(\boldsymbol{y}|\hat{\boldsymbol{\theta}}_2, M_2)} { p(\boldsymbol{y}|\hat{\boldsymbol{\theta}}_1, M_1)}, \end{align*} and the Bayes factor $$\widehat{\text{BF}}_{12}$$ can then be computed as   $$\label{eq:bf} \widehat{\text{BF}}_{12} = \frac{\hat{L}(\hat{\boldsymbol{\theta}}_2 | \boldsymbol{y}, M_2)}{\hat{L}(\hat{\boldsymbol{\theta}}_1 | \boldsymbol{y}, M_1)} \times \frac{ p(\boldsymbol{y}| \hat{\boldsymbol{\theta}}_1 , M_1)} { p(\boldsymbol{y}| \hat{\boldsymbol{\theta}}_2 , M_2)}.$$ (4.3) Note that $$\hat{L}(\hat{\boldsymbol{\theta}}_k | \boldsymbol{y}, M_k)$$ ($$k=1,2$$) are by-products of the MCLL algorithm and the likelihood $$p(\boldsymbol{y}| \hat{\boldsymbol{\theta}}_k , M_k)$$ can readily be obtained using importance sampling (e.g. Shephard and Pitt, 1997) as shown in detail in the Appendix A of supplementary material available at Biostatistics online. 5. Empirical illustrations To illustrate the flexibility and utility of the MCLL algorithm, we provide two empirical examples where approximate ML estimation is infeasible in practice. 5.1. GLMM with crossed random effects for binary data The salamander mating dataset, described and analyzed by McCullagh and Nelder (1989) and many others, consists of three separate experiments, each involving matings among salamanders of two different populations, called Rough Butt (R) and White Side (W). The two salamander populations are geographically isolated from one another. The main research question is whether barriers to interbreeding have evolved so that matings within a population are more successful than those between populations. Each experiment included two groups of 10 female and 10 male salamanders from each population. Within each group, each salamander was paired with three salamanders of the opposite sex from the same population and three from the other population, which led to a total of 120 matings of four kinds of pairings (RR, RW, WR, and WW) per experiment. The response $$y_{ij}$$ is a binary variable indicating whether the mating was successful between female $$i$$ ($$i=1,...,60$$) and male $$j$$ ($$\,j=1,...,60$$). Specific aims for the data analysis are: (1) to find out the mating probabilities of the four kinds of pairings and (2) to investigate heterogeneity of mating success across female and male salamanders. We adopt model (A) by Karim and Zeger (1992) which can be written as:   $$\label{md:sal} \text{logit}(p (y_{ij} = 1 | z^f_i, z^m_j, x_{1i},x_{2j})) = \beta_1 + \beta_2 x_{1i} + \beta_3 x_{2j} + \beta_4 x_{1i}x_{2j} + z^f_i + z^m_j,$$ (5.1) where $$\beta_1$$ is the intercept, $$\beta_2$$, $$\beta_3$$, and $$\beta_4$$ are the regression coefficients of a dummy variable for a W female ($$x_{1i}$$), a dummy variable for a W male ($$x_{2j}$$), and the interaction between $$x_{1i}$$ and $$x_{2j}$$ ($$x_{1i}x_{2j}$$), respectively. The final two terms are random effects $$z^f_i \sim N(0, \sigma^2_f)$$ for females and $$z^m_j \sim N(0, \sigma^2_m)$$ for males. ML estimation of the parameters of Model (5.1) is challenging because the two random effects ($$z^f_i$$ and $$z^m_j$$) are cross-classified, rather than nested. For instance, adaptive quadrature, which can be viewed as the gold standard, nearly exact ML method for GLMMs, requires evaluating $$r^{61}$$ terms (with $$r$$ quadrature points). Even with $$r=3$$, adaptive quadrature is computationally intensive. 5.1.1. Implementation To obtain posterior samples of the model parameters, we considered three sets of priors: (1) a normal prior $$N(0,10)$$ for the regression coefficients and a truncated normal prior $$N(0,10)T(0,)$$ for the standard deviation parameters (where $$T(a,b)$$ is a truncation function with a lower bound $$a$$ and an upper bound $$b$$), (2) a normal prior $$N(0,4)$$ for the regression coefficients and a truncated normal prior $$N(0,4)T(0,)$$ for the standard deviation parameters, and (3) a normal prior $$N(0,4)$$ for the regression coefficients and a log-normal prior $$log N(-0.9870,0.5878)$$ for the standard deviation parameters, which gives a distribution with mean 0.5 and variance 0.2. The first set of priors is relatively diffuse, while the second set is weakly informative. The third set uses a more informative prior than the second set for the standard deviation parameters. The goal was to evaluate the sensitivity of the MCLL estimates, compared to the sensitivity of the posterior mean and posterior mode estimates, to different prior choices. The posterior samples were obtained using the WinBUGS software package (Lunn and others, 2000). The code is provided in the Appendix B of supplementary material available at Biostatistics online. Three chains with relatively diffuse starting values were run for 1000 iterations after a 2000 burn-in period. For convergence assessment, the Gelman–Rubin statistic (Gelman and Rubin, 1992) was used in addition to graphical checks such as trace plots and autocorrelation plots. We found that the posterior distributions were close to symmetric with skewness coefficients within $$\pm$$.25 for the regression parameters and smaller than 0.50 for the standard deviation parameters for all three prior specifications. The default nearest neighbor fraction of $$\lambda = 0.7$$ was chosen; to check for sensitivity, we also tried $$\lambda$$ values between 0.5 and 0.8 and found that average differences in parameter estimates were less than 10%. In terms of computational time, MCLL required only three seconds after MCMC samples were obtained (using a Intel Pentium Dual-Core 2.5-GHz processor computer with 3.2 GB of memory). We compared the MCLL parameter estimates with published estimates from various estimators such as MCEM (Booth and Hobert, 1999) and Monte Carlo maximum likelihood (MCMLE; Sung and Geyer, 2007). In addition, we obtained estimates using the Laplace approximation (using $$\texttt{lme4}$$ (Bates and others, 2014)) because it is perhaps the most widely used approximate ML estimator for fitting models with crossed random effects. We also compared MCLL with MCKL (discussed in Section 2.2). No software package is currently available for MCKL; hence, we implemented a customized MCKL algorithm for fitting model (5.1) using a diagonal bandwidth matrix; in this case, the $$(m,m)$$th bandwidth element $$h_m$$ for the $$m$$-th parameter was determined based on $$h_m = q \sigma_m$$, where $$q$$ is a proportionality constant and $$\sigma_m$$ is the posterior standard deviation of the $$m$$-th parameter. We considered ten different bandwidth values, ranging from $$q=$$ 0.1 to 1.0, for checking MCKL’s sensitivity to bandwidth choice. Additionally, we applied a posterior cumulant-based bias correction (De Valpine, 2004). The MCKL estimates varied to some degree across the ten sets of bandwidth values. The parameter estimates were somewhat shrunken after applying the bias correction. All MCKL estimates, before and after bias correction, are provided in the Appendix C of supplementary material available at Biostatistics online. Since no default $$q$$ value is available, we selected the mean value $$q=0.5$$ for performance comparison with other estimators. Lastly, for MCLL, we illustrate the approximate Hessian-based and bootstrap standard error estimation methods (see Section 3) and Bayes factor computation (see Section 4). 5.1.2. Results The first block of Table 1 lists the parameter estimates obtained from other approximate ML estimators. Standard errors are included when they were reported in the original papers. The second to fourth blocks of Table 1 list the MCLL parameter estimates, posterior means, and posterior modes obtained from three different prior choices. Table 1. Comparison of several estimators for the salamander mating data. Standard errors are given in parentheses if reported. MCEM: Booth and Hobert (1999); Laplace: lme4; MCMLE: Geyer (2007); MCKL: MCKL method ($$q=0.5$$); MCKL (adjusted): MCKL method after cumulant bias correction Method  $$\beta_{1}$$  $$\beta_{2}$$  $$\beta_{3}$$  $$\beta_{4}$$  $$\sigma_m$$  $$\sigma_f$$  $$\text{MCEM}^\text{a}$$  1.02  –2.96  –0.69  3.63  1.18  1.12  $$\text{MCMLE}^\text{a}$$  1.00  –2.78  –0.47  3.52  1.17  1.10  Laplace  1.00  –2.90  –0.70  3.59  1.08  1.02     (0.39)  (0.56)  (0.46)  (0.64)        MCKL  1.05  –3.01  –0.76  3.75  1.24  1.18  MCKL(adjusted)  1.03  –2.97  –0.74  3.69  1.23  1.17  MCLL                    $$\hspace{.1 in}$$ Prior 1$$^\text{b}$$  1.04  –3.04  –0.69  3.67  1.17  1.15     (0.41)  (0.58)  (0.45)  (0.62)        $$\hspace{.2 in}$$Boot.SE$$^\text{c}$$  [0.57]  [0.66]  [0.52]  [0.69]        $$\hspace{.2 in}$$(MCE)$$^\text{c}$$  (0.03)  (0.05)  (0.05)  (0.04)        $$\hspace{.1 in}$$ Prior 2$$^\text{b}$$  0.95  –2.85  –0.73  3.63  1.11  1.10     (0.38)  (0.53)  (0.44)  (0.58)        $$\hspace{.1 in}$$ Prior 3$$^\text{b}$$  0.97  –2.79  –0.64  3.50  1.03  1.04     (0.41)  (0.56)  (0.48)  (0.67)        Post mean                    $$\hspace{.1 in}$$ Prior 1  0.97  –2.89  –0.62  3.56  1.24  1.19     (0.43)  (0.59)  (0.47)  (0.64)        $$\hspace{.1 in}$$ Prior 2  0.84  –2.61  –0.49  3.25  1.18  1.12     (0.37)  (0.50)  (0.43)  (0.55)        $$\hspace{.1 in}$$ Prior 3  0.80  –2.45  –0.45  3.02  0.99  0.93     (0.35)  (0.47)  (0.42)  (0.55)        Post mode                    $$\hspace{.1 in}$$ Prior 1  0.95  –2.81  –0.60  3.40  1.13  1.07     (0.42)  (0.57)  (0.46)  (0.62)        $$\hspace{.1 in}$$ Prior 2  0.81  –2.52  –0.58  3.22  1.06  1.04     (0.39)  (0.52)  (0.45)  (0.56)        $$\hspace{.1 in}$$ Prior 3  0.75  –2.28  –0.45  2.86  0.89  0.90     (0.32)  (0.45)  (0.39)  (0.53)        Method  $$\beta_{1}$$  $$\beta_{2}$$  $$\beta_{3}$$  $$\beta_{4}$$  $$\sigma_m$$  $$\sigma_f$$  $$\text{MCEM}^\text{a}$$  1.02  –2.96  –0.69  3.63  1.18  1.12  $$\text{MCMLE}^\text{a}$$  1.00  –2.78  –0.47  3.52  1.17  1.10  Laplace  1.00  –2.90  –0.70  3.59  1.08  1.02     (0.39)  (0.56)  (0.46)  (0.64)        MCKL  1.05  –3.01  –0.76  3.75  1.24  1.18  MCKL(adjusted)  1.03  –2.97  –0.74  3.69  1.23  1.17  MCLL                    $$\hspace{.1 in}$$ Prior 1$$^\text{b}$$  1.04  –3.04  –0.69  3.67  1.17  1.15     (0.41)  (0.58)  (0.45)  (0.62)        $$\hspace{.2 in}$$Boot.SE$$^\text{c}$$  [0.57]  [0.66]  [0.52]  [0.69]        $$\hspace{.2 in}$$(MCE)$$^\text{c}$$  (0.03)  (0.05)  (0.05)  (0.04)        $$\hspace{.1 in}$$ Prior 2$$^\text{b}$$  0.95  –2.85  –0.73  3.63  1.11  1.10     (0.38)  (0.53)  (0.44)  (0.58)        $$\hspace{.1 in}$$ Prior 3$$^\text{b}$$  0.97  –2.79  –0.64  3.50  1.03  1.04     (0.41)  (0.56)  (0.48)  (0.67)        Post mean                    $$\hspace{.1 in}$$ Prior 1  0.97  –2.89  –0.62  3.56  1.24  1.19     (0.43)  (0.59)  (0.47)  (0.64)        $$\hspace{.1 in}$$ Prior 2  0.84  –2.61  –0.49  3.25  1.18  1.12     (0.37)  (0.50)  (0.43)  (0.55)        $$\hspace{.1 in}$$ Prior 3  0.80  –2.45  –0.45  3.02  0.99  0.93     (0.35)  (0.47)  (0.42)  (0.55)        Post mode                    $$\hspace{.1 in}$$ Prior 1  0.95  –2.81  –0.60  3.40  1.13  1.07     (0.42)  (0.57)  (0.46)  (0.62)        $$\hspace{.1 in}$$ Prior 2  0.81  –2.52  –0.58  3.22  1.06  1.04     (0.39)  (0.52)  (0.45)  (0.56)        $$\hspace{.1 in}$$ Prior 3  0.75  –2.28  –0.45  2.86  0.89  0.90     (0.32)  (0.45)  (0.39)  (0.53)        $$^{\rm a}$$The parameter estimates for MCEM and MCMLE in the respective original papers were reported based on the estimates in the respective original papers that used a different parameterization of the same model. Standard errors for MCMLE could not be derived because the covariance matrix of the model parameter estimates was not provided in the original paper. $$^{\rm b}$$Prior 1: diffuse normal priors; Prior 2: weakly informative normal priors; Prior 3: more informative priors for standard deviation parameters. $$^{\rm c}$$Boot.SE: bootstrap standard errors. MCE: Monte Carlo errors for the bootstrap standard errors. Table 1. Comparison of several estimators for the salamander mating data. Standard errors are given in parentheses if reported. MCEM: Booth and Hobert (1999); Laplace: lme4; MCMLE: Geyer (2007); MCKL: MCKL method ($$q=0.5$$); MCKL (adjusted): MCKL method after cumulant bias correction Method  $$\beta_{1}$$  $$\beta_{2}$$  $$\beta_{3}$$  $$\beta_{4}$$  $$\sigma_m$$  $$\sigma_f$$  $$\text{MCEM}^\text{a}$$  1.02  –2.96  –0.69  3.63  1.18  1.12  $$\text{MCMLE}^\text{a}$$  1.00  –2.78  –0.47  3.52  1.17  1.10  Laplace  1.00  –2.90  –0.70  3.59  1.08  1.02     (0.39)  (0.56)  (0.46)  (0.64)        MCKL  1.05  –3.01  –0.76  3.75  1.24  1.18  MCKL(adjusted)  1.03  –2.97  –0.74  3.69  1.23  1.17  MCLL                    $$\hspace{.1 in}$$ Prior 1$$^\text{b}$$  1.04  –3.04  –0.69  3.67  1.17  1.15     (0.41)  (0.58)  (0.45)  (0.62)        $$\hspace{.2 in}$$Boot.SE$$^\text{c}$$  [0.57]  [0.66]  [0.52]  [0.69]        $$\hspace{.2 in}$$(MCE)$$^\text{c}$$  (0.03)  (0.05)  (0.05)  (0.04)        $$\hspace{.1 in}$$ Prior 2$$^\text{b}$$  0.95  –2.85  –0.73  3.63  1.11  1.10     (0.38)  (0.53)  (0.44)  (0.58)        $$\hspace{.1 in}$$ Prior 3$$^\text{b}$$  0.97  –2.79  –0.64  3.50  1.03  1.04     (0.41)  (0.56)  (0.48)  (0.67)        Post mean                    $$\hspace{.1 in}$$ Prior 1  0.97  –2.89  –0.62  3.56  1.24  1.19     (0.43)  (0.59)  (0.47)  (0.64)        $$\hspace{.1 in}$$ Prior 2  0.84  –2.61  –0.49  3.25  1.18  1.12     (0.37)  (0.50)  (0.43)  (0.55)        $$\hspace{.1 in}$$ Prior 3  0.80  –2.45  –0.45  3.02  0.99  0.93     (0.35)  (0.47)  (0.42)  (0.55)        Post mode                    $$\hspace{.1 in}$$ Prior 1  0.95  –2.81  –0.60  3.40  1.13  1.07     (0.42)  (0.57)  (0.46)  (0.62)        $$\hspace{.1 in}$$ Prior 2  0.81  –2.52  –0.58  3.22  1.06  1.04     (0.39)  (0.52)  (0.45)  (0.56)        $$\hspace{.1 in}$$ Prior 3  0.75  –2.28  –0.45  2.86  0.89  0.90     (0.32)  (0.45)  (0.39)  (0.53)        Method  $$\beta_{1}$$  $$\beta_{2}$$  $$\beta_{3}$$  $$\beta_{4}$$  $$\sigma_m$$  $$\sigma_f$$  $$\text{MCEM}^\text{a}$$  1.02  –2.96  –0.69  3.63  1.18  1.12  $$\text{MCMLE}^\text{a}$$  1.00  –2.78  –0.47  3.52  1.17  1.10  Laplace  1.00  –2.90  –0.70  3.59  1.08  1.02     (0.39)  (0.56)  (0.46)  (0.64)        MCKL  1.05  –3.01  –0.76  3.75  1.24  1.18  MCKL(adjusted)  1.03  –2.97  –0.74  3.69  1.23  1.17  MCLL                    $$\hspace{.1 in}$$ Prior 1$$^\text{b}$$  1.04  –3.04  –0.69  3.67  1.17  1.15     (0.41)  (0.58)  (0.45)  (0.62)        $$\hspace{.2 in}$$Boot.SE$$^\text{c}$$  [0.57]  [0.66]  [0.52]  [0.69]        $$\hspace{.2 in}$$(MCE)$$^\text{c}$$  (0.03)  (0.05)  (0.05)  (0.04)        $$\hspace{.1 in}$$ Prior 2$$^\text{b}$$  0.95  –2.85  –0.73  3.63  1.11  1.10     (0.38)  (0.53)  (0.44)  (0.58)        $$\hspace{.1 in}$$ Prior 3$$^\text{b}$$  0.97  –2.79  –0.64  3.50  1.03  1.04     (0.41)  (0.56)  (0.48)  (0.67)        Post mean                    $$\hspace{.1 in}$$ Prior 1  0.97  –2.89  –0.62  3.56  1.24  1.19     (0.43)  (0.59)  (0.47)  (0.64)        $$\hspace{.1 in}$$ Prior 2  0.84  –2.61  –0.49  3.25  1.18  1.12     (0.37)  (0.50)  (0.43)  (0.55)        $$\hspace{.1 in}$$ Prior 3  0.80  –2.45  –0.45  3.02  0.99  0.93     (0.35)  (0.47)  (0.42)  (0.55)        Post mode                    $$\hspace{.1 in}$$ Prior 1  0.95  –2.81  –0.60  3.40  1.13  1.07     (0.42)  (0.57)  (0.46)  (0.62)        $$\hspace{.1 in}$$ Prior 2  0.81  –2.52  –0.58  3.22  1.06  1.04     (0.39)  (0.52)  (0.45)  (0.56)        $$\hspace{.1 in}$$ Prior 3  0.75  –2.28  –0.45  2.86  0.89  0.90     (0.32)  (0.45)  (0.39)  (0.53)        $$^{\rm a}$$The parameter estimates for MCEM and MCMLE in the respective original papers were reported based on the estimates in the respective original papers that used a different parameterization of the same model. Standard errors for MCMLE could not be derived because the covariance matrix of the model parameter estimates was not provided in the original paper. $$^{\rm b}$$Prior 1: diffuse normal priors; Prior 2: weakly informative normal priors; Prior 3: more informative priors for standard deviation parameters. $$^{\rm c}$$Boot.SE: bootstrap standard errors. MCE: Monte Carlo errors for the bootstrap standard errors. The MCLL, posterior mean and posterior mode estimates all vary between different choices of priors. However, the degree of variation is smaller for MCLL than for the posterior mean and posterior mode estimates. For example, comparing estimates of $$\beta_4$$ based on Prior 3 with those based on Prior 1, the MCLL estimates change by approximately 4.6% ($$=$$(3.67$$-$$3.50)$$/$$3.67), while the posterior mean and the posterior mode change by approximately 15.1% ($$=$$(3.56$$-$$3.02)$$/$$3.56) and 15.9% ($$=$$(3.40$$-$$2.86)$$/$$3.40), respectively. This result indicates that MCLL is less sensitive to the choice of priors than the posterior mean and mode estimates. For comparison with other estimators, we chose the MCLL estimates based on Prior 1 (relatively diffuse prior). These MCLL estimates are quite close to the MCEM and Laplace estimates and similar to the MCMLEs. In comparison to MCKL, the MCLL parameter estimates are somewhat closer to the bias-corrected MCKL estimates than to the original MCKL estimates. The approximate Hessian-based standard errors for MCLL are comparable to the standard error estimates from other estimators, whereas the bootstrap standard errors for MCLL are somewhat larger. Next, in order to illustrate the Bayes factor computation with our proposed procedure (described in Section 4), we considered an additional, reduced model that does not include the interaction term $$\beta_4 x_{1i}x_{2j}$$. The Bayes factor comparing the full and reduced models was computed as 1.12, indicating some evidence for the inclusion of the interaction term. Our Bayes factor value is comparable to the Bayes factor obtained based on the Laplace approximation (Raftery, 1996) which is 1.83. 5.2. Spatial model for count data As the second empirical example, we consider Best and others (2000)’s spatial Poisson-gamma convolution model for traffic-related air pollution and respiratory illness in 7–9 year-old children living in Northern England’s Huddersfield region. For analysis, we utilize a randomly perturbed dataset provided by Thomas and others (2014, pp 26–29). Spatial regions $$A_i$$ were defined by arbitrarily partitioning Huddersfield into 605 regular 750 $$m^2$$ grid cells covering a rectangle containing the Huddersfield study region plus a 2 km surrounding buffer zone (Best and others, 2000). Let $$Y_i$$ ($$i=1,...,I$$) be the number of reported cases of frequent coughing among children in area $$A_i$$. The variable $$Y_i$$ follows a Poisson distribution, $$Y_i \sim \text{Po} (\lambda_i)$$, with means $$\lambda_i = N_i \times p_i$$ given by the product of the population size (in hundreds) $$N_i$$ of children in area $$A_i$$ and the disease rate $$p_i$$ in the area. Risk factors can be viewed as incrementally contributing to the risk. Hence, a linear model can be specified, as in Best and others (2000), by assuming that coefficients of the (positive) risk factors are strictly non-negative, which can be achieved by specifying gamma priors for the coefficients. Specifically, a spatial moving average Poisson regression model was formulated for the disease rate $$p_i$$ in area $$A_i$$ as follows:   \begin{align}\label{eq:best} p_i &= \beta_0 + \beta_1 X_i + \beta_2 \sum_{j \in J}k_{ij} \gamma_j. \end{align} (5.2) In the first two terms, $$\beta_0$$ is the baseline disease rate, $$X_i$$ is the excess $$\text{NO}_2$$ concentration in area $$A_i$$ (measured in mg $$m^{-3}$$ above the Huddersfield baseline $$\text{NO}_2$$ concentration of approximately 20mg $$m^{-3}$$), and $$\beta_1$$ is the corresponding coefficient. In the final term, the sum is over areas $$A_j$$ that are near area $$A_i$$, $$\gamma_j$$ is a latent quantity representing the effect in area $$A_i$$ of unobserved spatially-distributed risk factors for frequent cough associated with area $$A_j$$, and $$k_{ij}$$ are elements of a Gaussian kernel matrix that govern the influence in $$A_i$$ arising from latent risk factors in $$A_j$$:   $$k_{ij} = \frac{1}{2\pi \rho^2} \exp \left(-d_{ij}^2 / 2 \rho^2 \right)\!,$$ (5.3) where $$d_{ij}$$ represents the Euclidean distance between the centroids of the areas $$A_i$$ and $$A_j$$. The distance scale $$\rho >0$$ governs how rapidly $$k_{ij}$$ declines with increasing distance $$d_{ij}$$. As in Best and others (2000), a point mass prior at $$\rho = 1\,$$km is used in the current analysis. The coefficient $$\beta_2$$ scales the latent variables $$\gamma_j$$ and indicates how much variation is explained by other spatial risk factors besides excess $$\text{NO}_2$$ concentration. 5.2.1. Implementation To obtain posterior samples, independent gamma prior distributions were specified for $$\beta_k$$ ($$k=0,1,2$$) and $$\gamma_j$$. Specifically, gamma prior distributions $$\Gamma (\alpha_k, \delta_k)$$ (with a shape parameter $$\alpha_k$$ and a rate parameter $$\delta_k$$, $$k=0,1,2$$) were chosen so that the number of frequent coughing cases associated with each of the three risk categories lie between one-tenth and ten times the nominal equal attribution with a 90% prior probability; this was achieved by choosing $$\alpha_k = 0.575$$ for all three parameters ($$k=0,1,2$$) and setting the value of the prior mean ($$\alpha_k / \delta_k$$) for $$\beta_k$$ such that the contribution of each risk factor would be equal to one third of the overall observed disease rate in the Huddersfield region; that is $$\delta_0 = 3 \alpha_0 / \bar{Y}$$, $$\delta_1 = 3 \alpha_1 \bar{X}/ \bar{Y}$$, and $$\delta_2 = 3 \alpha_2 / \bar{Y}$$, which correspond to $$\Gamma (0.575, 0.169)$$ for $$\beta_0$$, $$\Gamma (0.575, 1.533)$$ for $$\beta_1$$, and $$\Gamma (0.575, 0.169)$$ for $$\beta_2$$. For the latent risk factors $$\{\gamma_j \}_{j \in J}$$, the gamma prior $$\Gamma (\alpha_{\gamma_j}, \delta_{\gamma})$$ was chosen so that $$\alpha_{\gamma_j}$$ has prior mean $$\alpha_{\gamma_j} /\delta_{\gamma} = |A_j|$$, the area in km$$^2$$ of the region $$A_j$$; this makes the model spatially extensible on the grounds that fine or coarse partitions $$\{A_j\}$$ would lead to identical probability distributions for the sums of $$\gamma_j$$’s over any set, and for kernel sums, $$\sum_{j \in J}k_{ij} \gamma_j$$. The degree of uncertainly about the spatially-varying latent random effect can be adjusted by the value of $$\delta_{\gamma}$$. For moderate spatial variability, we set $$\delta_{\gamma} = 1.0/\text{km}^2$$ to give 1.0 $$\times$$ 30 $$\times$$ 20 $$=$$ 6000 prior points over the 30 km $$\times$$ 20 km Huddersfield region. For additional details on the choice of the prior distributions, see Best and others (2000). To evaluate how MCLL behaves with priors that slightly contradict evidence of the data, we specified an additional set of gamma priors (for the three model parameters) with a mean and a variance that were somewhat larger than those of the originally specified priors. In the Appendix D of supplementary material available at Biostatistics online, we provide graphs of two sets of prior distributions (the original priors and the intentionally “misleading” priors) for each of the three parameters as well as the corresponding posterior distributions. This comparison shows that the posterior distribution appears somewhat skewed especially for the $$\beta_2$$ parameter with the original prior specification. The skewness coefficients are within $$\pm0$$.10 for $$\beta_0$$ and $$\beta_1$$, but it is 0.77 for $$\beta_2$$, indicating that the $$\beta_2$$ posterior distribution is somewhat positively skewed. With the “misleading” prior, the posterior distribution looks less skewed for the $$\beta_2$$ parameter with a skewness coefficient of 0.50. The posterior samples of the model parameters with the specified priors were obtained using the WinBUGS software with three chains and 20 000 iterations after a 10 000 burn-in period. The code is provided in the Appendix B of supplementary material available at Biostatistics online. We used the default nearest neighbor fraction $$\lambda = 0.7$$ and compared the results with other values that ranged between 0.5 and 0.8. They all produced similar results, with a less than 10% average variation in the parameter estimates. For this example, MCLL took approximately two seconds to produce parameter estimates (post MCMC sampling). 5.2.2. Results In Table 2, we observe that the MCLL, posterior mean, and posterior mode estimates vary to some degree depending on which prior is chosen. In particular, the $$\beta_2$$ parameter shows marked differences with the misleading prior (Prior 2) when compared with the original prior (Prior 1). The degree of change is much smaller for the MCLL estimates compared with the posterior mean and posterior mode estimates. The MCLL estimates for $$\beta_2$$ change by approximately $$-$$39% ($$=$$(0.61$$-$$0.81)$$/$$0.61), while the posterior mean and posterior mode change by approximately $$-$$44% ($$=$$(0.93$$-$$1.34)/0.93) and $$-$$152% ($$=$$(0.44$$-$$1.11)/0.44), respectively. Table 2. Parameter estimates for regression coefficients $$\beta_0$$, $$\beta_1$$, and$$\beta_2$$ for the Huddersfield data with the original prior and the poorly chosen prior    $$\beta_0$$  $$\beta_1$$  $$\beta_2$$     Est  (SE)  Est  (SE)  Est  (SE)  MCLL                    $$\hspace{.1 in}$$Prior 1$$^\text{a}$$  6.33  (1.07)  0.36  (0.08)  0.61  (0.49)  $$\hspace{.1 in}$$Prior 2$$^\text{b}$$  6.05  (1.11)  0.36  (0.07)  0.85  (0.63)  Posterior mean                    $$\hspace{.1 in}$$Prior 1  6.14  (1.14)  0.34  (0.11)  0.93  (0.74)  $$\hspace{.1 in}$$Prior 2  5.66  (1.08)  0.35  (0.10)  1.34  (0.70)  Posterior mode                    $$\hspace{.1 in}$$Prior 1  6.51  (1.05)  0.35  (0.10)  0.44  (0.66)  $$\hspace{.1 in}$$Prior 2  5.75  (1.10)  0.37  (0.11)  1.11  (0.71)     $$\beta_0$$  $$\beta_1$$  $$\beta_2$$     Est  (SE)  Est  (SE)  Est  (SE)  MCLL                    $$\hspace{.1 in}$$Prior 1$$^\text{a}$$  6.33  (1.07)  0.36  (0.08)  0.61  (0.49)  $$\hspace{.1 in}$$Prior 2$$^\text{b}$$  6.05  (1.11)  0.36  (0.07)  0.85  (0.63)  Posterior mean                    $$\hspace{.1 in}$$Prior 1  6.14  (1.14)  0.34  (0.11)  0.93  (0.74)  $$\hspace{.1 in}$$Prior 2  5.66  (1.08)  0.35  (0.10)  1.34  (0.70)  Posterior mode                    $$\hspace{.1 in}$$Prior 1  6.51  (1.05)  0.35  (0.10)  0.44  (0.66)  $$\hspace{.1 in}$$Prior 2  5.75  (1.10)  0.37  (0.11)  1.11  (0.71)  $$^{\rm a}$$Prior 1: Original prior. $$^{\rm b}$$Prior 2: (Intentionally) Misleading prior (with a somewhat diffuse prior choice). Table 2. Parameter estimates for regression coefficients $$\beta_0$$, $$\beta_1$$, and$$\beta_2$$ for the Huddersfield data with the original prior and the poorly chosen prior    $$\beta_0$$  $$\beta_1$$  $$\beta_2$$     Est  (SE)  Est  (SE)  Est  (SE)  MCLL                    $$\hspace{.1 in}$$Prior 1$$^\text{a}$$  6.33  (1.07)  0.36  (0.08)  0.61  (0.49)  $$\hspace{.1 in}$$Prior 2$$^\text{b}$$  6.05  (1.11)  0.36  (0.07)  0.85  (0.63)  Posterior mean                    $$\hspace{.1 in}$$Prior 1  6.14  (1.14)  0.34  (0.11)  0.93  (0.74)  $$\hspace{.1 in}$$Prior 2  5.66  (1.08)  0.35  (0.10)  1.34  (0.70)  Posterior mode                    $$\hspace{.1 in}$$Prior 1  6.51  (1.05)  0.35  (0.10)  0.44  (0.66)  $$\hspace{.1 in}$$Prior 2  5.75  (1.10)  0.37  (0.11)  1.11  (0.71)     $$\beta_0$$  $$\beta_1$$  $$\beta_2$$     Est  (SE)  Est  (SE)  Est  (SE)  MCLL                    $$\hspace{.1 in}$$Prior 1$$^\text{a}$$  6.33  (1.07)  0.36  (0.08)  0.61  (0.49)  $$\hspace{.1 in}$$Prior 2$$^\text{b}$$  6.05  (1.11)  0.36  (0.07)  0.85  (0.63)  Posterior mean                    $$\hspace{.1 in}$$Prior 1  6.14  (1.14)  0.34  (0.11)  0.93  (0.74)  $$\hspace{.1 in}$$Prior 2  5.66  (1.08)  0.35  (0.10)  1.34  (0.70)  Posterior mode                    $$\hspace{.1 in}$$Prior 1  6.51  (1.05)  0.35  (0.10)  0.44  (0.66)  $$\hspace{.1 in}$$Prior 2  5.75  (1.10)  0.37  (0.11)  1.11  (0.71)  $$^{\rm a}$$Prior 1: Original prior. $$^{\rm b}$$Prior 2: (Intentionally) Misleading prior (with a somewhat diffuse prior choice). 6. Simulation studies We conducted three simulation studies to evaluate the performance of MCLL in various situations. In the first study, we deliberately selected a linear mixed model so that we could compare the MCLL estimates with exact MLEs. In the second study, we chose a longitudinal logistic mixed model for which MLE with adaptive quadrature is feasible and provides a useful “gold-standard” comparison to MCLL. In the third study, we selected a GLMM with crossed random effects, for which adaptive quadrature is not feasible, to demonstrate the utility of MCLL for such difficult problems. 6.1. Linear mixed model for continuous data In the first simulation example, we utilize the model proposed by Rabe-Hesketh and others (2008) for birth weight data on 1000 families (with mother, father, and one child). For family members $$i$$ nested in family $$j$$ the model is   $$\label{eq:bw} y_{ij} = \boldsymbol{x}_{ij}'\boldsymbol{\beta} + \alpha_{1j}[M_i +K_i/2] + \alpha_{2j}[F_i +K_i/2] + \alpha_{3j}[K_i/\sqrt{2}] + \epsilon_{ij},$$ (6.1) where $$\boldsymbol{x}_{ij}$$ is a vector of covariates with regression coefficients $$\boldsymbol{\beta}$$, $$M_i$$, $$K_i$$, and $$F_i$$ are dummy variables for mother, child, and father, respectively, and the three random effects at level 2, $$\alpha_{1j}$$, $$\alpha_{2j}$$, and $$\alpha_{3j}$$ are i.i.d as $$\alpha_{kj} \sim N(0, \sigma_A^2)$$ with $$k=1,2,3$$. The level-1 residuals have a normal distribution, $$\epsilon_{ij}^{(2)} \sim N(0, \sigma_E^2)$$ and are independent of the random effects. Here $$\sigma_A$$ can be interpreted as the additive genetic standard deviation and $$\sigma_E$$ as the unique environment standard deviation. The covariates include dummy variables for male, and mother being aged 20–35, younger than 35, or older than 35 at the time of the birth. We generated 100 datasets based on model (6.1) using the MLEs of the original data as parameter values, $$\boldsymbol{\beta}=(3368.09, 155.34, 126.94, 213.43)'$$ and $$(\sigma_A, \sigma_E)' = (311.21, 374.66)'$$. To obtain posterior samples, diffuse normal priors were specified for the regression coefficients (mean 0, standard deviation 1000). A log-normal prior $$logN(6.1761, 0.07696)$$ was specified for both $$\sigma_A$$ and $$\sigma_E$$, which gives a distribution with mean 500 and variance 20 000. The 0.05 quantile of this distribution is 307 g, which corresponds to the random effect falling between $$-$$614 and 614 g with 95% probability. The 0.95 quantile is 745 g. This choice of prior gives more weight to large values of $$\sigma_A$$ and $$\sigma_E$$ (compared with the true values) than a diffuse prior and the posterior distributions for $$\sigma_A$$ and $$\sigma_E$$ are centered in the left tail of the prior. However, this choice allows us to evaluate how well MCLL performs when an informative prior is poorly specified. WinBUGS was used to obtain the posterior samples using three chains with 5000 iterations after a 3000 burn-in. The default nearest neighbor fraction $$\lambda=0.7$$ was chosen. The MLEs were obtained using the R function $$\texttt{lme}$$ of the package $$\texttt{nlme}$$ (Pinheiro and others, 2012). 6.1.1. Results Figure 1 provides summary plots that compare the distances from the MLE between the MCLL and Bayesian (posterior mean) estimates for each parameter. The figure shows that the Bayesian estimates display a marked bias (defined relative to the MLEs), which is evident in the point clouds being shifted away from zero on the x-axis. Although the MCLL estimates are more variable, the mean squared distances to the MLEs are smaller than for the Bayesian estimates except for $$\beta_2$$. This is strong evidence that MCLL estimates are closer to the MLEs than the Bayesian estimates. The mean distances between the Bayesian estimates and MLEs differ significantly from zero for all parameters at the 1% level. But for the MCLL estimates, the mean distances are not significantly different from zero at the 5% level except for the parameter $$\sigma_A$$ ($$t = 3.20, p=0.002$$). Fig. 1. View largeDownload slide Distances from the MLE for MCLL estimates (MCLL–MLE) and for posterior mean estimates (Post.m-MLE) for simulated birth weight datasets. The Monte Carlo errors are within 10% of the means of the parameter estimates for all methods. Fig. 1. View largeDownload slide Distances from the MLE for MCLL estimates (MCLL–MLE) and for posterior mean estimates (Post.m-MLE) for simulated birth weight datasets. The Monte Carlo errors are within 10% of the means of the parameter estimates for all methods. In the Appendix E of supplementary material available at Biostatistics online, we compare the average standard error estimates over replicates (denoted SE) with the empirical standard errors (denoted SD) and report the ratios (SE/SD). The results show that both the ML and MCLL standard errors are good approximations of the empirical standard deviations, with SE/SD ratios between 0.91 and 1.14 for MLE and between 0.92 and 1.16 for MCLL. The MCLL standard error estimates tend to be somewhat more conservative (larger) than the ML standard errors. 6.2. Longitudinal model for binary data In the second simulation study, we consider a longitudinal model for repeated measurements of $$N=1000$$ subjects. Let $$y_{it}$$ denote a binary response for person $$i$$ ($$i = 1, ..., N$$) at time $$t$$ ($$t=1, ..., T$$). Then a logistic mixed model for $$y_{it}$$ can be written as   $$\label{eq:binary} logit[\text{Pr}(y_{it}=1|\text{time}_{it},x_i,\epsilon_i)] = \beta_0 + \beta_1 \text{time}_{it} + \beta_2 x_i + \epsilon_{i},$$ (6.2) where $$\beta_0$$ is the intercept, $$\beta_1$$ is the regression coefficient of $$\text{time}_{it}$$ ($$\text{time}_{it}$$ has $$T$$ categories from 0 to 1, equally spaced), and $$\beta_2$$ is the regression coefficient of the time-invariant covariate $$x_i$$ (a dichotomous variable equal to 0 for half of the subjects and 1 for the other half). The subject-specific random effect $$\epsilon_{i}$$ follows a normal distribution, $$\epsilon_{i} \sim N(0, \sigma^2)$$. Three conditions were considered for $$T=(3, 4, 50)$$ to cover small to large cluster sizes. Four conditions were considered for $$\sigma = (0.5, 0.9,1.5, 2.7)$$ to represent small to large intra class correlations for the latent responses (0.07, 0.2, 0.4, and 0.7, respectively). For the regression coefficients, we considered fixed true values, $$\beta_0=0.6,\beta_1=1.0$$, and $$\beta_2=-1.0$$. We generated 200 datasets under each of the twelve conditions. For MCLL, diffuse normal priors were specified for $$\boldsymbol{\beta}$$ with a mean of 0 and variance of 10 and a log-normal prior $$logN(-0.9870, 0.5878)$$ was specified for $$\sigma$$, which gives a distribution with mean 0.5 and variance 0.2. WinBUGS was used to obtain the posterior samples using three chains with 5000 iterations after a 3000 iteration burn-in. The default nearest neighbor fraction $$\lambda=0.7$$ was used. To evaluate the performance of MCLL, we considered adaptive quadrature, which can be seen as the gold standard approximate ML method; for a relatively simple model like (6.2) adaptive quadrature can produce estimates close to true MLEs with a sufficient number of quadrature points (Rabe-Hesketh and others, 2005). As an alternative approximate ML method, we also considered the Laplace approximation because it is the most widely available approximate ML method that can be applied when adaptive quadrature becomes infeasible. We used the $$\texttt{melogit}$$ command in Stata (with seven default quadrature points) for adaptive quadrature and the $$\texttt{lme4}$$ package in R for the Laplace approximation. 6.2.1. Results We provide the bias and MSE estimates of all model parameters under the twelve conditions in the Appendix F of supplementary material available at Biostatistics online. For the regression coefficients ($$\boldsymbol{\beta}$$), MCLL generally has smaller bias than the Laplace approximation especially when $$\sigma \geq 0.9$$. The bias of MCLL and the Laplace approximation are similar to or slightly larger than the bias of adaptive quadrature and the MSE estimates appear similar across the three estimators. We note that MCLL and the Laplace approximation show somewhat larger bias for $$\sigma$$ than for the regression coefficients in some conditions. Hence, we provide Table 3 to further discuss the performance differences between the methods with respect to $$\sigma$$. Table 3 suggests that MCLL has smaller bias and smaller MSE than the Laplace approximation in most conditions, specifically when $$\sigma =0.9$$, $$\sigma =1.5$$, and $$\sigma =2.7$$ across all $$T$$ ($$T=3,4,50$$). When the cluster sizes are small ($$T=3,4$$) and the variance components are large ($$\sigma =1.5, 2.7$$), the MSE of the Laplace approximation is substantially larger than for MCLL. It is important to note that the MCLL estimates are closer to the adaptive quadrature estimates under those conditions. Only for two conditions, when $$\sigma = 0.5$$ with $$T=4$$ and $$T=50$$, MCLL shows somewhat larger bias and MSE than the Laplace approximation. Table 3. Estimated bias and MSE for $$\sigma$$. Shading indicates statistical significance at the 5% level ($$H_0\,{:}\,\text{bias}=0$$). Laplace: Laplace approximation; AQ: adaptive quadrature     Table 3. Estimated bias and MSE for $$\sigma$$. Shading indicates statistical significance at the 5% level ($$H_0\,{:}\,\text{bias}=0$$). Laplace: Laplace approximation; AQ: adaptive quadrature     We also compared average standard error estimates (SE) with the empirical standard errors (SD) for the regression coefficients (see Appendix F, Tables S7–S9 of supplementary material available at Biostatistics online). For MCLL, the means of the standard error estimates (SE) are close to the empirical standard errors (SD) under all conditions. 6.3. Crossed random effects model for binary data In the third simulation study, we consider a crossed random effects model for binary data analogous to the salamander mating data example in Section 5.1. We generated 100 datasets based on model (5.1), while we set $$\boldsymbol{\beta} =(1.06, -3.05, -0.72, 3.77)'$$ and $$(\sigma^2_f, \sigma^2_m)' = (.50,.50)'$$ as data generating parameter values. A diffuse normal prior $$N(0,10)$$ was specified for the regression coefficients ($$\boldsymbol{\beta}$$) and a truncated normal prior $$N(0,10)T(0,)$$ for the standard deviation parameters ($$\sigma$$). WinBUGS was used to obtain the posterior samples using three chains with 5000 iterations after a 3000 burn-in. The default nearest neighbor fraction $$\lambda=0.7$$ was used. We compared the performance of MCLL with the Laplace approximation and MCMLE (Sung and Geyer, 2007). These two methods seem to be the only approximate ML methods available in standard software for estimating crossed random effects models for binary data. The Laplace approximation estimates were obtained using the $$\texttt{lme4}$$ package in R (Bates and others, 2014) and the MCMLEs were obtained using the $$\texttt{bernor}$$ R package (Geyer and Sung, 2006) with 10 000 samples. 6.3.1. Results Table 4 presents estimated bias and mean squared error (MSE) for the point estimates, as well as ratios (SE/SD) of the average standard errors divided by the empirical standard deviations. For the regression parameters, the bias and MSE are quite similar between the methods except that MCMLE has somewhat larger bias than the other methods. For the standard deviation parameters, MCLL has the smallest bias and MSE among all methods. The SE/SD ratios do not differ much from one for any of three methods and are closest to one for MCLL, followed by Laplace and then MCMLE. The MCLL standard error estimates tend to be more conservative than those from the Laplace approximation. Table 4. Estimated bias and MSE of the MCLL, Laplace approximation, and MCMLE point estimates and ratios of mean SEs to empirical SDs for 100 simulated salamander datasets    Bias  MSE  SE/SD $$^{\text{a,b}}$$     Laplace  MCMLE  MCLL  Laplace  MCMLE  MCLL  Laplace  MCMLE  MCLL  $$\beta_1$$  –0.03  –0.07  –0.03  0.12  0.12  0.12  0.86  0.89  0.93  $$\beta_2$$  0.04  0.12  0.01  0.22  0.20  0.21  0.95  1.12  1.07  $$\beta_3$$  0.06  0.08  0.05  0.17  0.17  0.16  0.90  1.08  0.97  $$\beta_4$$  –0.04  –0.14  –0.04  0.35  0.27  0.35  0.90  1.33  1.03  $$\sigma_m$$  –0.17  –0.16  –0.13  0.10  0.08  0.07  –  –     $$\sigma_f$$  –0.15  –0.20  –0.12  0.12  0.10  0.08  –  –  –     Bias  MSE  SE/SD $$^{\text{a,b}}$$     Laplace  MCMLE  MCLL  Laplace  MCMLE  MCLL  Laplace  MCMLE  MCLL  $$\beta_1$$  –0.03  –0.07  –0.03  0.12  0.12  0.12  0.86  0.89  0.93  $$\beta_2$$  0.04  0.12  0.01  0.22  0.20  0.21  0.95  1.12  1.07  $$\beta_3$$  0.06  0.08  0.05  0.17  0.17  0.16  0.90  1.08  0.97  $$\beta_4$$  –0.04  –0.14  –0.04  0.35  0.27  0.35  0.90  1.33  1.03  $$\sigma_m$$  –0.17  –0.16  –0.13  0.10  0.08  0.07  –  –     $$\sigma_f$$  –0.15  –0.20  –0.12  0.12  0.10  0.08  –  –  –  $$^{\rm a}$$The Monte Carlo errors for the means of the standard error estimates are less than 10%. $$^{\rm b}$$SE/SD not provided for $$\sigma_f$$ and $$\sigma_m$$ because Wald-type tests and confidence intervals are inappropriate for these parameters (e.g., Stram and Lee, 1994). Table 4. Estimated bias and MSE of the MCLL, Laplace approximation, and MCMLE point estimates and ratios of mean SEs to empirical SDs for 100 simulated salamander datasets    Bias  MSE  SE/SD $$^{\text{a,b}}$$     Laplace  MCMLE  MCLL  Laplace  MCMLE  MCLL  Laplace  MCMLE  MCLL  $$\beta_1$$  –0.03  –0.07  –0.03  0.12  0.12  0.12  0.86  0.89  0.93  $$\beta_2$$  0.04  0.12  0.01  0.22  0.20  0.21  0.95  1.12  1.07  $$\beta_3$$  0.06  0.08  0.05  0.17  0.17  0.16  0.90  1.08  0.97  $$\beta_4$$  –0.04  –0.14  –0.04  0.35  0.27  0.35  0.90  1.33  1.03  $$\sigma_m$$  –0.17  –0.16  –0.13  0.10  0.08  0.07  –  –     $$\sigma_f$$  –0.15  –0.20  –0.12  0.12  0.10  0.08  –  –  –     Bias  MSE  SE/SD $$^{\text{a,b}}$$     Laplace  MCMLE  MCLL  Laplace  MCMLE  MCLL  Laplace  MCMLE  MCLL  $$\beta_1$$  –0.03  –0.07  –0.03  0.12  0.12  0.12  0.86  0.89  0.93  $$\beta_2$$  0.04  0.12  0.01  0.22  0.20  0.21  0.95  1.12  1.07  $$\beta_3$$  0.06  0.08  0.05  0.17  0.17  0.16  0.90  1.08  0.97  $$\beta_4$$  –0.04  –0.14  –0.04  0.35  0.27  0.35  0.90  1.33  1.03  $$\sigma_m$$  –0.17  –0.16  –0.13  0.10  0.08  0.07  –  –     $$\sigma_f$$  –0.15  –0.20  –0.12  0.12  0.10  0.08  –  –  –  $$^{\rm a}$$The Monte Carlo errors for the means of the standard error estimates are less than 10%. $$^{\rm b}$$SE/SD not provided for $$\sigma_f$$ and $$\sigma_m$$ because Wald-type tests and confidence intervals are inappropriate for these parameters (e.g., Stram and Lee, 1994). 7. Concluding remarks In this article, we proposed a novel approximate ML method for complex models where ML estimation is practically infeasible. Our MCLL algorithm requires MCMC samples of model parameters to approximate a likelihood function. A major attraction of MCLL is that it can be applied to ML inference of any complex models where ML estimation is difficult but Bayesian estimation is feasible. In this article, we proposed a novel approximate ML method referred to as MCLL which builds upon the MCKL algorithm that was previously developed for fitting dynamic population models. MCLL has several technical and practical advantages compared with MCKL: first, by utilizing local likelihood density estimation instead of kernel density methods, MCLL reduces mode bias without needing to apply additional bias correction methods. Second, our MCLL method provides approximate standard errors that are less sensitive to bandwidth choice and allows approximate computation of the Bayes factors. Finally, our MCLL algorithm is flexible and can be applied to a broader class of models. An accompanying R package is also provided which is freely available, easy to use, and supplies convenient default options that generally work well. MCLL leverages Bayesian methods to approximate MLEs. One may wonder why we want to obtain MLEs when Bayesian estimates are available. Although Bayesian methods are a pragmatic solution for complex inferential problems, several issues of Bayesian solutions have been discussed (e.g., Lele and others, 2007); first, Bayesian inferences depend on the choice of prior distributions, while there is a debate as to how to define a diffuse or an objective prior (e.g. Press, 2003). In addition, Lele and others (2007) argued that how to interpret the credible intervals with diffuse priors is a controversial issue (Barnett, 1999). Because of these unresolved issues, some researchers may prefer approximate MLEs. We showed that MCLL was less sensitive to the choice of prior distributions than Bayesian methods; we further demonstrated that MCLL generated solutions closer to the MLE even when poorly-chosen priors were specified. Furthermore, we showed in several settings that MCLL performed very favorably when compared with existing methods that are feasible for complex models such as the Laplace approximation. MCLL required only a few seconds after obtaining MCMC samples. All together, our study results confirm that the proposed MCLL algorithm can be an effective and practical computational option for estimating complex models when ML inference is desirable. Supplementary material Supplementary material is available at http://biostatistics.oxfordjournals.org. Conflict of Interest: None declared. References Barnett V. ( 1999). Comparative Statistical Inference . New York: Wiley. Google Scholar CrossRef Search ADS   Bates D., Maechler M., Bolker B. and Walker S. ( 2014). lme4: Linear Mixed-Effects Models using Eigen and S4 . R package version 1.1-6. http://CRAN.R-project.org/package=lme4. Best N. G., Ickstadt K., Wolpert R. L. and Briggs D. J. ( 2000). Combining models of health and exposure data: the SAVIAH study. In: Elliott P., Wakefield J. C., Best N. G. and Briggs D. J. (editors), Spatial Epidemiology: Methods and Applications . Oxford: Oxford University Press, pp. 393– 414. Google Scholar CrossRef Search ADS   Booth J. G. and Hobert J. P. ( 1999). Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society Series B  61, 265– 285. Google Scholar CrossRef Search ADS   Byrd R. H., Lu P., Nocedal J. and Zhu C. ( 1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing  16, 1190– 1208. Google Scholar CrossRef Search ADS   De Valpine P. ( 2004). Monte Carlo state-space likelihoods by weighted posterior kernel density estimation. Journal of the American Statistical Association  99, 523– 535. Google Scholar CrossRef Search ADS   Duong T. and Hazelton M. L. ( 2003). Plug-in bandwidth matrices for bivariate kernel density estimation. Journal of Nonparametric Statistics  15, 17– 30. Google Scholar CrossRef Search ADS   Gelman A. and Rubin D. B. ( 1992). Inference from iterative simulation using multiple sequences. Statistical Science  7, 457– 472. Google Scholar CrossRef Search ADS   Geyer C. and Sung Y. ( 2006). bernor: Bernoulli Regression with Normal Random Effects . R package version 0.3-8. http://www.stat.umn.edu/geyer/bernor. Hall P., Müller H. and Wu P. ( 2006). Real-time density and mode estimation with application to time-dynamic mode tracking. Journal of Computational and Graphical Statistics  15, 82– 100. Google Scholar CrossRef Search ADS   Hjort N. L. and Jones M. C. ( 1996). Locally parametric nonparametric density estimation. The Annals of Statistics  24, 1619– 1647. Google Scholar CrossRef Search ADS   Jeon M., Kaufman C. and Rabe-Hesketh S. ( 2014). mcll: Monte Carlo Local Likelihood for Estimating Generalized Linear Mixed Models. R package version 1.2 . http://CRAN.R-project.org/package=mcll. Joe H. ( 2008). Accuracy of Laplace approximation for discrete response mixed models. Computational Statistics and Data Analysis  52, 5066– 5074. Google Scholar CrossRef Search ADS   Karim M. R. and Zeger S. L. ( 1992). Generalized linear models with random effects: salamander mating revisited. Biometrics  48, 631– 644. Google Scholar CrossRef Search ADS PubMed  Lele S. R., Dennis B. and Lutscher F. ( 2007). Data cloning: easy maximum likelihood estimation for complex ecological models using Bayesian Markov chain Monte Carlo methods. Ecology Letters  10, 551– 563. Google Scholar CrossRef Search ADS PubMed  Lele S. R., Nadeem K. and Schmuland B. ( 2010). Estimability and likelihood inference for generalized linear mixed models using data cloning. Journal of the American Statistical Association  105, 1617– 1625. Google Scholar CrossRef Search ADS   Loader C. ( 1999). Local Regression and Likelihood . New York: Springer. Loader C. R. ( 1996). Local likelihood density estimation. The Annals of Statistics  24, 1602– 1618. Google Scholar CrossRef Search ADS   Lunn D., Thomas A., Best N. and Spiegelhalter D. ( 2000). WinBUGS—a Bayesian modelling framework: concepts, structure, and extensibility. Statistics and Computing  10, 325– 337. Google Scholar CrossRef Search ADS   McCullagh P. and Nelder J. A. ( 1989). Generalized Linear Models . New York: Chapman and Hall. Google Scholar CrossRef Search ADS   McCulloch C. E. ( 1997). Maximum likelihood algorithms for generalized linear mixed models. Journal of the American Statistical Association  92, 162– 170. Google Scholar CrossRef Search ADS   Otneim H., Karlsen H. A. and Tjøstheim D. ( 2013). Bias and bandwidth for local likelihood density estimation. Statistics and Probability Letters  83, 1382– 1387. Google Scholar CrossRef Search ADS   Pinheiro Jose, Bates Douglas, DebRoy Saikat, Sarkar Deepayan and R Development Core Team. ( 2012). nlme: Linear and Nonlinear Mixed Effects Models . R package version 3.1–104. Press S. J. ( 2003). Subjective and Objective Bayesian Statistics . New York: Wiley. Rabe-Hesketh S., Skrondal A. and Gjessing H. K. ( 2008). Biometrical modeling of twin and family data using standard mixed model software. Biometrics  64, 280– 288. Google Scholar CrossRef Search ADS PubMed  Rabe-Hesketh S., Skrondal A. and Pickles A. ( 2005). Maximum likelihood estimation of limited and discrete dependent variable models with nested random effects. Journal of Econometrics  128, 301– 323. Google Scholar CrossRef Search ADS   Raftery, Adrian E. ( 1996). Approximate Bayes factors and accounting for model uncertainty in generalised linear models. Biometrika  83, 251– 266. Google Scholar CrossRef Search ADS   Shealther S. ( 2004). Density estimation. Statistical Science  19, 588– 597. Google Scholar CrossRef Search ADS   Shephard N. and Pitt M. K. ( 1997). Likelihood analysis of non-Gaussian measurement time series. Biometrika  84, 653– 667. Google Scholar CrossRef Search ADS   Stram D. O. and Lee J. W. ( 1994). Variance components testing in the longitudinal mixed effects model. Biometrics  50, 1171– 1177. Google Scholar CrossRef Search ADS PubMed  Sung Y. J. and Geyer C. J. ( 2007). Monte Carlo likelihood inference for missing data models. The Annals of Statistics  35, 990– 1011. Google Scholar CrossRef Search ADS   Thomas A., Best N. G., Lunn D., Arnold R. and Spiegelhalter D. J. ( 2014). GeoBugs User Manual. Version 3.2.3 . London: Imperial College School of Medicine. Walker A. M. ( 1969). On the asymptotic behavior of posterior distributions. Journal of the Royal Statistical Society, Series B  31, 80– 88. Wand M. P. and Jones M. C. ( 1993). Comparison of smoothing parameterizations in bivariate kernel density estimation. Journal of the American Statistical Association  88, 520– 528. Google Scholar CrossRef Search ADS   Wolfinger R. ( 1993). Laplace’s approximation for nonlinear mixed models. Biometrika  80, 791– 795. Google Scholar CrossRef Search ADS   © The Author 2017. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

### Journal

BiostatisticsOxford University Press

Published: Jan 4, 2018

## You’re reading a free preview. Subscribe to read the entire article.

### DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month ### Explore the DeepDyve Library ### Search Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly ### Organize Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place. ### Access Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals. ### Your journals are on DeepDyve Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more. All the latest content is available, no embargo periods. DeepDyve ### Freelancer DeepDyve ### Pro Price FREE$49/month
\$360/year

Save searches from
PubMed

Create lists to

Export lists, citations