Add Journal to My Library
Biometrika
, Volume Advance Article (2) – Jan 20, 2018

15 pages

/lp/ou_press/on-the-asymptotic-efficiency-of-approximate-bayesian-computation-FRQadh1gRG

- Publisher
- Oxford University Press
- Copyright
- © 2018 Biometrika Trust
- ISSN
- 0006-3444
- eISSN
- 1464-3510
- D.O.I.
- 10.1093/biomet/asx078
- Publisher site
- See Article on Publisher Site

SUMMARY Many statistical applications involve models for which it is difficult to evaluate the likelihood, but from which it is relatively easy to sample. Approximate Bayesian computation is a likelihood-free method for implementing Bayesian inference in such cases. We present results on the asymptotic variance of estimators obtained using approximate Bayesian computation in a large data limit. Our key assumption is that the data are summarized by a fixed-dimensional summary statistic that obeys a central limit theorem. We prove asymptotic normality of the mean of the approximate Bayesian computation posterior. This result also shows that, in terms of asymptotic variance, we should use a summary statistic that is of the same dimension as the parameter vector, $$p$$, and that any summary statistic of higher dimension can be reduced, through a linear transformation, to dimension $$p$$ in a way that can only reduce the asymptotic variance of the posterior mean. We look at how the Monte Carlo error of an importance sampling algorithm that samples from the approximate Bayesian computation posterior affects the accuracy of estimators. We give conditions on the importance sampling proposal distribution such that the variance of the estimator will be of the same order as that of the maximum likelihood estimator based on the summary statistics used. This suggests an iterative importance sampling algorithm, which we evaluate empirically on a stochastic volatility model. 1. Introduction Many statistical applications involve inference about models that are easy to simulate from, but for which it is difficult, or impossible, to calculate likelihoods. In such situations it is possible to use the fact that we can simulate from the model to enable us to perform inference. There is a wide class of such likelihood-free methods of inference, including indirect inference (Gouriéroux & Ronchetti, 1993; Heggland & Frigessi, 2004), the bootstrap filter (Gordon et al., 1993), simulated methods of moments (Duffie & Singleton, 1993), and synthetic likelihood (Wood, 2010). We consider a Bayesian version of these methods, termed approximate Bayesian computation. This involves defining an approximation to the posterior distribution in such a way that it is possible to sample from this approximate posterior using only the ability to sample from the model. Arguably the first approximate Bayesian computation method was that of Pritchard et al. (1999), and these methods have been popular within population genetics (Beaumont et al., 2002), ecology (Beaumont, 2010) and systems biology (Toni et al., 2009). More recently, there have been applications to areas including stereology (Bortot et al., 2007), finance (Peters et al., 2011) and cosmology (Ishida et al., 2015). Let $$K({x})$$ be a density kernel, scaled, without loss of generality, so that $$\max_{{x}}K({x})=1$$. Further, let $$\varepsilon>0$$ be a bandwidth. Denote the data by $${Y}_{\rm obs}=(\,y_{\rm obs,1},\ldots,y_{{\rm obs},n})$$. Assume we have chosen a finite-dimensional summary statistic $${s}_{n}({Y})$$, and write $${s}_{\rm obs}={s}_{n}({Y}_{\rm obs})$$. If we model the data as a draw from a parametric density, $$f_{n}({y} \mid{\theta})$$, and assume a prior $$\pi({\theta})$$, then we define the approximate Bayesian computation posterior as \begin{equation} \pi_{\rm ABC}({\theta} \mid {s}_{\rm obs},\varepsilon)\propto\pi({\theta})\int f_{n}({s}_{\rm obs}+\varepsilon{v} \mid {\theta})K({v})\,{\rm d}{v},\label{eq:piABC} \end{equation} (1) where $$f_{n}({s} \mid {\theta})$$ is the density for the summary statistic implied by $$f_{n}({y} \mid {\theta})$$. Let $$f_{\rm ABC}({s}_{\rm obs}\mid {\theta},\varepsilon)=\int f_{n}({s}_{\rm obs}+\varepsilon{v} \mid {\theta})K({v})\,{\rm d}{v}$$. This framework encompasses most implementations of approximate Bayesian computation. In particular, the use of the uniform kernel corresponds to the popular rejection-based rule (Beaumont et al., 2002). The idea is that $$f_{\rm ABC}({s}_{\rm obs}\,{\mid}\, {\theta},\varepsilon)$$ is an approximation of the likelihood. The approximate Bayesian computation posterior, which is proportional to the prior multiplied by this likelihood approximation, is an approximation of the true posterior. The likelihood approximation can be interpreted as a measure of how close, on average, the summary $${s}_{n}$$ simulated from the model is to the summary for the observed data, $${s}_{\rm obs}$$. The choices of kernel and bandwidth determine the definition of closeness. By defining the approximate posterior in this way, we can simulate samples from it using standard Monte Carlo methods. One approach, which we will focus on later, uses importance sampling. Let $$K_{\varepsilon}({x})=K({x}/\varepsilon)$$. Given a proposal density $$q_{n}({\theta})$$, a bandwidth $$\varepsilon$$, and a Monte Carlo sample size $$N$$, an importance sampler would proceed as in Algorithm 1. The set of accepted parameters and their associated weights provides a Monte Carlo approximation to $$\pi_{\rm ABC}$$. If we set $$q_{n}({\theta})=\pi({\theta})$$ then this is just a rejection sampler. In practice sequential importance sampling methods are often used to learn a good proposal distribution (Beaumont et al., 2009). Algorithm 1. Importance and rejection sampling approximate Bayesian computation. Step 1. Simulate $${\theta}_{1},\ldots,{\theta}_{N}\sim q_{n}({\theta})$$. Step 2. For each $$i=1,\ldots,N$$, simulate $${Y}^{(i)}=\big\{\,y_{1}^{(i)},\ldots,y_{n}^{(i)}\big\}\sim f_{n}(\,y\mid {\theta}_{i})$$. Step 3. For each $$i=1,\ldots,N$$, accept $${\theta}_{i}$$ with probability $$K_{\varepsilon}\big\{{s}_{n}^{(i)}-{s}_{\rm obs}\big\}$$, where $${s}_{n}^{(i)}={s}_{n}\{{Y}^{(i)}\}$$, and define the associated weight as $$w_{i}=\pi({\theta}_{i})/q_{n}({\theta}_{i})$$. There are three choices in implementing approximate Bayesian computation: the choice of summary statistic, the choice of bandwidth, and the Monte Carlo algorithm. For importance sampling, the last of these involves specifying the Monte Carlo sample size, $$N$$, and the proposal density, $$q_{n}({\theta})$$. These, roughly, relate to three sources of approximation. To see this, note that as $$\varepsilon\rightarrow0$$ we would expect (1) to converge to the posterior given $${s}_{\rm obs}$$ (Fearnhead & Prangle, 2012). Thus the choice of summary statistic governs the approximation, or loss of information, between using the full posterior distribution and using the posterior given the summary. The value $$\varepsilon$$ then affects how close the approximate Bayesian computation posterior is to the posterior given the summary. Finally there is Monte Carlo error from approximating the approximate Bayesian computation posterior with a Monte Carlo sample. The Monte Carlo error is affected not only by the Monte Carlo algorithm, but also by the choices of summary statistic and bandwidth, which together affect the probability of acceptance in Step 3 of Algorithm 1. Having a higher-dimensional summary statistic, or a smaller value of $$\varepsilon$$, will tend to reduce this acceptance probability and hence increase the Monte Carlo error. This work studies the interaction between the three sources of error, when the summary statistics obey a central limit theorem for large $$n$$. We are interested in the efficiency of approximate Bayesian computation, where by efficiency we mean that an estimator obtained from running Algorithm 1 has the same rate of convergence as the maximum likelihood estimator for the parameter given the summary statistic. In particular, this work is motivated by the question of whether approximate Bayesian computation can be efficient as $$n\rightarrow\infty$$ if we have a fixed Monte Carlo sample size. Intuitively this appears unlikely. For efficiency we will need $$\varepsilon\rightarrow0$$ as $$n\rightarrow\infty$$, and this corresponds to an increasingly strict condition for acceptance. Thus we may imagine that the acceptance probability will necessarily tend to zero as $$n$$ increases, and we will need an increasing Monte Carlo sample size to compensate for this. However, our results show that Algorithm 1 can be efficient if we choose a proposal distribution with a suitable scale and location and appropriately heavy tails. If we use such a proposal distribution and have a summary statistic of the same dimension as the parameter vector, then the posterior mean of approximate Bayesian computation is asymptotically unbiased with a variance that is $$1+O(1/N)$$ times that of the estimator maximizing the likelihood of the summary statistic. This is similar to asymptotic results for indirect inference (Gouriéroux & Ronchetti, 1993; Heggland & Frigessi, 2004). Our results also lend theoretical support to methods that choose the bandwidth indirectly by specifying the proportion of samples that are accepted, as this leads to a bandwidth which is of the optimal order in $$n$$. We first prove a Bernstein–von Mises-type theorem for the posterior mean of approximate Bayesian computation. This is a nonstandard convergence result, as it is based on the partial information contained in the summary statistics. For related convergence results see Clarke & Ghosh (1995) and Yuan & Clarke (2004), though these do not consider the case where the dimension of the summary statistic is larger than that of the parameter. Dealing with this case introduces extra challenges. Our convergence result for the posterior mean of approximate Bayesian computation has practically important consequences. It shows that any $$d$$-dimensional summary with $$d>p$$ can be projected to a $$p$$-dimensional summary statistic without any loss of information. Furthermore it shows that using a summary statistic of dimension $$d>p$$ can lead to an increased bias, so the asymptotic variance can be reduced if the optimal $$p$$-dimensional projected summary is used instead. If a $$d$$-dimensional summary is used, with $$d>p$$, it suggests choosing the variance of the kernel to match the variance of the summary statistics. This paper adds to a growing literature on the theoretical properties of approximate Bayesian computation. Initial results focused on comparing the bias of approximate Bayesian computation to the Monte Carlo error, and studying how these depend on the choice of $$\varepsilon$$. The convergence rate of the bias is shown to be $$O(\varepsilon^{2})$$ in various settings (e.g., Barber et al., 2015). This can then be used to consider how the choice of $$\varepsilon$$ should depend on the Monte Carlo sample size so as to balance bias and Monte Carlo variability (Blum, 2010; Barber et al., 2015; Biau et al., 2015). There has also been work on consistency of approximate Bayesian computation estimators. Marin et al. (2014) consider consistency when performing model choice and Frazier et al. (2016) consider consistency for parameter estimation. The latter work, which appeared after the first version of this paper, includes a result on the asymptotic normality of the posterior mean that is similar to our Theorem 1, albeit under different conditions, and also gives results on the asymptotic form of the posterior obtained using approximate Bayesian computation. This shows that for many implementations of approximate Bayesian computation, the posterior will overestimate the uncertainty in the parameter estimate that it gives. Finally, a number of papers have looked at the choice of summary statistics (e.g., Wegmann et al., 2009; Blum, 2010; Prangle et al., 2014). Our Theorem 1 gives insight into this. As mentioned above, this theorem shows that, in terms of minimizing the asymptotic variance, we should use a summary statistic of the same dimension as the number of parameters. In particular it supports the suggestion in Fearnhead & Prangle (2012) of having one summary per parameter, with that summary approximating the maximum likelihood estimator for that parameter. 2. Notation and set-up Denote the data by $${Y}_{\rm obs}=(\,y_{{\rm obs},1},\ldots,$$$$y_{{\rm obs},n})$$, where $$n$$ is the sample size and each observation $$y_{{\rm obs},i}$$ can be of arbitrary dimension. We make no assumption directly on the data, but make assumptions on the distribution of the summary statistics. We consider the asymptotics as $$n\rightarrow\infty$$, and denote the density of $${Y}_{\rm obs}$$ by $$\,f_{n}({y} \mid {\theta})$$, where $${\theta}\in\mathcal{P}\subset\mathbb{R}^p$$. We let $${\theta}_{0}$$ denote the true parameter value and $$\pi({\theta})$$ its prior distribution. For a set $$A$$, let $$A^{\rm c}$$ be its complement with respect to the whole space. We assume that $${\theta}_{0}$$ is in the interior of the parameter space, and that the prior is differentiable in a neighbourhood of the true parameter: Condition 1. There exists some $$\delta_{0}>0$$ such that $$\mathcal{P}_{0}=\{{\theta}:|{\theta}-{\theta}_{0}|<\delta_{0}\}\subset\mathcal{P}$$, $$\pi({\theta})\in C^{1}(\mathcal{P}_{0})$$ and $$\pi({\theta}_{0})>0$$. To implement approximate Bayesian computation we will use a $$d$$-dimensional summary statistic, $${s}_{n}({Y})\in\mathbb{R}^{d}$$, such as a vector of sample means of appropriately chosen functions. We assume that $${s}_{n}({Y})$$ has a density function which depends on $$n$$, and we denote this by $$f_{n}({s} \mid {\theta})$$. We will use the shorthand $${S}_{n}$$ to denote the random variable with density $$f_{n}({s} \mid {\theta})$$. In approximate Bayesian computation we use a kernel, $$K({x})$$, with $$\max_{{x}}K({x})=1$$, and a bandwidth $$\varepsilon>0$$. As we vary $$n$$ we will often wish to vary $$\varepsilon$$, and in these situations we denote the bandwidth by $$\varepsilon_{n}$$. For Algorithm 1 we require a proposal distribution, $$q_{n}({\theta})$$, and allow this to depend on $$n$$. We assume the following conditions on the kernel, which are satisfied by all commonly-used kernels, Condition 2. The kernel satisfies (i) $$\int{v} K({v})\,{\rm d}{v}=0$$; (ii) $$\int\prod_{k=1}^{l}v_{i_{k}}K({v})\,{\rm d}{v}<\infty$$ for any coordinates $$(v_{i_{1}},\ldots,v_{i_{l}})$$ of $${v}$$ and $$l\leq p+6$$; (iii) $$K({v})\propto\bar{K}(\|{v}\|_{\Lambda}^{2})$$ where $$\|{v}\|_{\Lambda}^{2}={v}^{{\rm T}}\Lambda{v}$$ with $$\Lambda$$ a positive-definite matrix, and $$K({v})$$ is a decreasing function of $$\|{v}\|_{\Lambda}$$; (iv) $$K({v})=O\{\exp(-c_1 \|v\|_{\Gamma}^{-1})\}$$ for some $$\alpha_{1}>0$$ and $$c_{1}>0$$ as $$\|{v}\|_{\Lambda}\rightarrow\infty$$. For a real function $$g({x})$$ denote its $$k$$th partial derivative at $${x}={x}_{0}$$ by $$D_{x_{k}}g({x}_{0})$$, the gradient function by $$D_{{x}}g({x}_{0})$$ and the Hessian matrix by $$H_{{x}}g({x}_{0})$$. To simplify notation, $$D_{\theta_{k}}$$, $$D_{{\theta}}$$ and $$H_{{\theta}}$$ are written as $$D_{k}$$, $$D$$ and $$H$$, respectively. For a series $$x_{n}$$ we use the notation that for large enough $$n$$, $$x_{n}=\Theta(a_{n})$$ if there exist constants $$m$$ and $$M$$ such that $$0<m<|x_{n}/a_{n}|<M<\infty$$, and $$x_{n}=\Omega(a_{n})$$ if $$|x_{n}/a_{n}|\rightarrow\infty$$. For two square matrices $$A$$ and $$B$$, we say that $$A\leq B$$ if $$B-A$$ is semipositive definite, and $$A<B$$ if $$B-A$$ is positive definite. Our theory will focus on estimates of some function $${h}(\theta)$$ of $$\theta$$, which satisfies differentiability and moment conditions that will control the remainder terms in Taylor expansions. Condition 3. The $$k$$th coordinate of $${h}({\theta})$$, $$h_{k}({\theta})$$, satisfies (i) $$h_{k}({\theta})\in C^{1}(\mathcal{P}_{0})$$; (ii) $$D_{k}h({\theta}_{0})\neq0$$; and (iii) $$\int h_{k}({\theta})^{2}\pi({\theta})\,{\rm d}{\theta}<\infty$$. The asymptotic results presuppose a central limit theorem for the summary statistic. Condition 4. There exists a sequence $$a_{n}$$, with $$a_{n}\rightarrow\infty$$ as $$n\rightarrow\infty$$, a $$d$$-dimensional vector $${s}({\theta})$$ and a $$d\times d$$ matrix $$A({\theta})$$ such that for all $${\theta}\in\mathcal{P}_0$$, \[ a_{n}\{{S}_{n}-{s}({\theta})\}\rightarrow \mathcal{N}\{0,A({\theta})\}, \quad {n\rightarrow\infty}, \] with convergence in distribution. We also assume that $$s_{\rm obs}\rightarrow s(\theta_0)$$ in probability. Furthermore, assume that (i) $${s}({\theta})\in C^{1}(\mathcal{P}_{0})$$ and $$A({\theta})\in C^{1}(\mathcal{P}_{0})$$, and $$A({\theta})$$ is positive definite for $${\theta}\in\mathcal{P}_{0}$$; (ii) for any $$\delta>0$$ there exists a $$\delta'>0$$ such that $$\|s(\theta)-s(\theta_0)\|>\delta'$$ for all $$\theta$$ satisfying $$\|\theta-\theta_0\|>\delta$$; (iii) $$I({\theta})=D{s}({\theta})^{{\rm T}}A^{-1}({\theta})D{s}({\theta})$$ has full rank at $${\theta}={\theta}_{0}$$. Under Condition 4, $$a_{n}$$ is the rate of convergence in the central limit theorem. If the data are independent and identically distributed, and the summaries are sample means of functions of the data or of quantiles, then $$a_{n}=n^{1/2}$$. In most applications the data will be dependent, but if summaries are sample means (Wood, 2010), quantiles (Peters et al., 2011; Allingham et al., 2009; Blum & François, 2010) or linear combinations thereof (Fearnhead & Prangle, 2012), then a central limit theorem will often still hold, though $$a_n$$ may increase more slowly than $$n^{1/2}$$. Part (ii) of Condition 4 is required for the true parameter to be identifiable given only the summary of data. The asymptotic variance of the summary-based maximum likelihood estimator for $${\theta}$$ is $$I^{-1}({\theta}_{0})/a_{n}^{2}$$. Condition (iii) ensures that this variance is valid at the true parameter. We next require a condition that controls the difference between $$f_{n}({s}\mid {\theta})$$ and its limiting distribution for $${\theta}\in\mathcal{P}_{0}$$. Let $$\mathcal{N}({x};{\mu},\Sigma)$$ be the normal density at $${x}$$ with mean $${\mu}$$ and variance $$\Sigma$$. Define $$\tilde{f}_{n}({s} \mid {\theta})=\mathcal{N}\{{s};{s}({\theta}),A({\theta})/a_{n}^{2}\}$$ and the standardized random variable $$W_{n}({s})=a_{n}A({\theta})^{-1/2}\{{s}-{s}({\theta})\}$$. Let $$\tilde{f}_{W_{n}}({w} \mid {\theta})$$ and $$f_{W_{n}}({w} \mid {\theta})$$ be the densities of $$W_{n}({s})$$ when $${s}\sim\tilde{f}_{n}({s} \mid {\theta})$$ and $$f_{n}({s} \mid {\theta})$$ respectively. The condition below requires that the difference between $$f_{W_{n}}({w} \mid {\theta})$$ and its Edgeworth expansion $$\tilde{f}_{W_{n}}({w} \mid {\theta})$$ be of $$o(a_n^{-2/5})$$ and can be bounded by a density with exponentially decreasing tails. This is weaker than the standard requirement, $$o(a_n^{-1})$$, for the remainder in the Edgeworth expansion. Condition 5. There exists $$\alpha_{n}$$ satisfying $$\alpha_{n}/a_n^{2/5}\rightarrow\infty$$ and a density $$r_{\rm max}({w})$$ satisfying Condition 2(ii) and (iii) where $$K({v})$$ is replaced with $$r_{\rm max}({w})$$, such that $$\sup_{{\theta}\in\mathcal{P}_{0}}\alpha_{n}| f_{W_{n}}({w} \mid {\theta})-\tilde{f}_{W_{n}}({w} \mid {\theta})|\leq c_{3}r_{\rm max}({w})$$ for some positive constant $$c_{3}$$. The following condition further assumes that $$f_{n}(s \mid \theta)$$ has exponentially decreasing tails with rate uniform in the support of $$\pi(\theta)$$. Condition 6. The following statements hold: (i) $$r_{\rm max}(w)$$ satisfies Condition 2(iv); and (ii) $$\sup_{\theta\in\mathcal{P}_{0}^{\rm c}}f_{W_{n}}(w \mid \theta)=O\{\exp(-c_2 \|w\|^{\alpha_2})\}$$ as $$\|w\|\rightarrow\infty$$ for some positive constants $$c_{2}$$ and $$\alpha_{2}$$, and $$A(\theta)$$ is bounded in $$\mathcal{P}$$. 3. Posterior mean asymptotics We first ignore any Monte Carlo error, and focus on the ideal estimator of the true posterior mean from approximate Bayesian computation. This is the posterior mean, $${h}_{\rm ABC}$$, where \[ {h}_{\rm ABC}={E}_{\pi_{\rm ABC}}\{{h}({\theta})\mid {s}_{\rm obs}\}= \int {h}({\theta}) \pi_{\rm ABC}({\theta}\mid {s}_{\rm obs},\varepsilon_n)\text{.} \] This estimator depends on $$\varepsilon_n$$, but we suppress this from the notation. As an approximation to the true posterior mean $${E}\{{h}({\theta})\mid {Y}_{\rm obs}\}$$, $${h}_{\rm ABC}$$ contains errors from the choice of bandwidth $$\varepsilon_{n}$$ and summary statistic $${s}_{\rm obs}$$. To understand the effects of these two sources of error, we derive results for the asymptotic distributions of $${h}_{\rm ABC}$$ and the likelihood-based estimators, including the summary-based maximum likelihood estimator and the summary-based posterior mean, where we consider randomness solely due to the randomness of the data. Let $$T_{{\rm obs}}=a_{n}A(\theta_{0})^{-1/2}\{s_{\rm obs}-s(\theta_{0})\}$$. Theorem 1. Assume Conditions 1–6. (i) Let $$\hat{{{\theta}}}_{_{{ \rm MLES}}}=\mathop{\arg\max}\limits_{\theta\in\mathcal{P}}\log f_{n}({s}_{\rm obs}\mid {\theta})$$. For $${h}_{{s}}={h}(\hat{{{\theta}}}_{_{{ \rm MLES}}})$$ or $${E}\{{h}({\theta})\mid {s}_{\rm obs}\}$$, \[a_{n}\{{h}_{{s}}-{h}({\theta}_{0})\}\rightarrow \mathcal{N}\{0,D{h}({\theta}_{0})^{\mathrm{T}}I^{-1}({\theta}_{0})D{h}({\theta}_{0})\},\quad n\rightarrow\infty,\] with convergence in distribution. (ii) Define $$c_{\infty}= \lim_{n\rightarrow\infty}a_{n}\varepsilon_{n}$$. Let $$Z$$ be the weak limit of $$T_{\rm obs}$$, which has a standard normal distribution, and let $$R(c_{\infty},Z)$$ be a random vector with mean zero that is defined in the Supplementary Material. If $$\varepsilon_{n}=o(a_{n}^{-3/5})$$, then \[ a_{n}\{h_{{\rm ABC}}-h(\theta_{0})\}\rightarrow Dh(\theta_{0})^{\mathrm{T}}\{I(\theta_{0})^{-1/2}Z+R(c_\infty,Z)\}, \quad n\rightarrow\infty, \] with convergence in distribution. If either $$\varepsilon_{n}=o(a_{n}^{-1})$$, $$d=p$$, or the covariance matrix of $$K(v)$$ is proportional to $$A(\theta_{0})$$, then $$R(c_{\infty},Z)=0$$. For other cases, the variance of $$I(\theta_{0})^{-1/2}Z+R(c_{\infty},Z)$$ is no less than $$I^{-1}(\theta_{0})$$. Theorem 1 (i) shows the validity of posterior inference based on the summary statistics. Regardless of the sufficiency and dimension of $${s}_{\rm obs}$$, the posterior mean based on the summary statistics is consistent and asymptotically normal with the same variance as the summary-based maximum likelihood estimator. Denote the bias of approximate Bayesian computation, $${h}_{\rm ABC}-{E}\{{h}({\theta})\mid {s}_{\rm obs}\}$$, by $$\mbox{bias}_{\rm ABC}$$. The choice of bandwidth affects the size of the bias. Theorem 1(ii) indicates two regimes for the bandwidth for which the posterior mean of approximate Bayesian computation has good properties. The first case is when $$\varepsilon_{n}$$ is $$o(1/a_{n})$$. For this regime the posterior mean of approximate Bayesian computation always has the same asymptotic distribution as that of the true posterior given the summaries. The other case is when $$\varepsilon_{n}$$ is $$o(a_n^{-3/5})$$ but not $$o(n^{-1})$$. We obtain the same asymptotic distribution if either $$d=p$$ or we choose the kernel variance to be proportional to the variance of the summary statistics. In general for this regime of $$\varepsilon_n$$, $${h}_{\rm ABC}$$ will be less efficient than the summary-based maximum likelihood estimator. When $$d>p$$, Theorem 1(ii) shows that $$\mbox{bias}_{\rm ABC}$$ is nonnegligible and can increase the asymptotic variance. This is because the leading term of $$\mbox{bias}_{\rm ABC}$$ is proportional to the average of $${v}={s}-{s}_{\rm obs}$$, the difference between the simulated and observed summary statistics. If $$d>p$$, the marginal density of $${v}$$ is generally asymmetric, and thus is no longer guaranteed to have a mean of zero. One way to ensure that there is no increase in the asymptotic variance is to choose the variance of the kernel to be proportional to the variance of the summary statistics. The loss of efficiency we observe in Theorem 1(ii) for $$d>p$$ gives an advantage for choosing a summary statistic with $$d=p$$. The following proposition shows that for any summary statistic of dimension $$d>p$$ we can find a new $$p$$-dimensional summary statistic without any loss of information. The proof of the proposition is trivial and hence omitted. Proposition 1. Assume the conditions of Theorem 1. If $$d>p$$, define $$C=D{s}({\theta}_{0})^{\mathrm{T}}A({\theta}_{0})^{-1}$$. The $$p$$-dimensional summary statistic $$C{S}_{n}$$ has the same information matrix, $$I({\theta})$$, as $${S}_{n}$$. Therefore the asymptotic variance of $${h}_{\rm ABC}$$ based on $$C{s}_{\rm obs}$$ is smaller than or equal to that based on $${s}_{\rm obs}$$. Theorem 1 leads to the following natural definition. Definition 1. Assume that the conditions of Theorem 1 hold. Then the asymptotic variance of $${h}_{\rm ABC}$$ is \[ {\small{\text{AV}}}_{{h}_{\rm ABC}}=\frac{1}{a_{n}^{2}}D{h}({\theta}_{0})^{\mathrm{T}}I_{\rm ABC}^{-1}({\theta}_{0})D{h}({\theta}_{0})\text{.} \] 4. Asymptotic properties of rejection and the importance sampling algorithm 4.1. Asymptotic Monte Carlo error We now consider the Monte Carlo error involved in estimating $${h}_{\rm ABC}$$. Here we fix the data and consider solely the stochasticity of the Monte Carlo algorithm. We focus on Algorithm 1. Remember that $$N$$ is the Monte Carlo sample size. For $$i=1,\ldots,N$$, $${\theta}_{i}$$ is the proposed parameter value and $$w_{i}$$ is its importance sampling weight. Let $$\phi_{i}$$ be the indicator that is 1 if and only if $$\theta_{i}$$ is accepted in Step 3 of Algorithm 1 and let $$N_{\rm acc}=\sum_{i=1}^{N}\phi_{i}$$ be the number of accepted parameters. Provided $$N_{\rm acc}\geq1$$, we can estimate $${h}_{\rm ABC}$$ from the output of Algorithm 1 with \[ {\hat{h}}=\sum_{i=1}^{N}{h}({\theta}_{i})w_{i}\phi_{i}\Big/\sum_{i=1}^{N}w_{i}\phi_{i}\text{.} \] Define the acceptance probability \[ p_{{\rm acc},q}=\int q({\theta})\int f_{n}({s}\mid {\theta})K_{\varepsilon}({s}-{s}_{\rm obs})\,\mbox{d}{s}\,\mbox{d}{\theta} \] and the density of the accepted parameter \[ q_{\rm ABC}({\theta} \mid {s}_{\rm obs},\varepsilon)=\frac{q_{n}({\theta})f_{\rm ABC}({s}_{\rm obs}\mid {\theta},\varepsilon)}{\int q_{n}({\theta})f_{\rm ABC}({s}_{\rm obs}\mid {\theta},\varepsilon)\,{\rm d}\theta}\text{.} \] Finally, define \begin{align*} &\Sigma_{{\rm IS},n}=E_{\pi_{\rm ABC}}\left\{({h}({\theta})-{h}_{\rm ABC})^{2}\frac{\pi_{\rm ABC}({\theta} \mid {s}_{\rm obs},\varepsilon_{n})}{q_{\rm ABC}({\theta} \mid {s}_{\rm obs},\varepsilon_{n})}\right\}\!,\quad \Sigma_{{\rm ABC},n}=p_{{\rm acc},q_{n}}^{-1}\Sigma_{{\rm IS},n},\label{ISABC_var} \end{align*} where $$\Sigma_{{\rm IS},n}$$ is the importance sampling variance with $$\pi_{\rm ABC}$$ as the target density and $$q_{\rm ABC}$$ as the proposal density. Note that $$p_{{\rm acc},q_{n}}$$ and $$\Sigma_{{\rm IS},n}$$, and hence $$\Sigma_{{\rm ABC},n}$$, depend on $${s}_{\rm obs}$$. Standard results give the following asymptotic distribution of $${\hat{h}}$$. Proposition 2. For a given $$n$$ and $${s}_{\rm obs}$$, if $${h}_{\rm ABC}$$ and $$\Sigma_{{\rm ABC},n}$$ are finite, then \[ {N}^{1/2}({\hat{h}}-{h}_{\rm ABC})\rightarrow \mathcal{N}(0,\Sigma_{{\rm ABC},n}), \] in distribution as $$N\rightarrow\infty$$. This proposition motivates the following definition. Definition 2. For a given $$n$$ and $${s}_{\rm obs}$$, assume that the conditions of Proposition 2 hold. Then the asymptotic Monte Carlo variance of $${\hat{h}}$$ is $${\small{\text{MCV}}}_{{\hat{h}}}={N}^{-1}\Sigma_{{\rm ABC},n}$$. 4.2. Asymptotic efficiency We have defined the asymptotic variance as $$n\rightarrow\infty$$ of $${h}_{\rm ABC}$$, and the asymptotic Monte Carlo variance as $$N\rightarrow\infty$$ of $${\hat{h}}$$. The error of $${h}_{\rm ABC}$$ when estimating $${h}({\theta}_{0})$$ and the Monte Carlo error of $${\hat{h}}$$ when estimating $${h}_{\rm ABC}$$ are independent, which suggests the following definition. Definition 3. Assume the conditions of Theorem 1, and that $${h}_{\rm ABC}$$ and $$\Sigma_{{\rm ABC},n}$$ are bounded in probability for any $$n$$. Then the asymptotic variance of $${\hat{h}}$$ is \[ {\small{\text{AV}}}_{{\hat{h}}}=\frac{1}{a_{n}^{2}}{h}({\theta}_{0})^{\mathrm{T}}I_{\rm ABC}^{-1}({\theta}_{0})D{h}({\theta}_{0})+\frac{1}{N}\Sigma_{{\rm ABC},n}\text{.} \] We can interpret the asymptotic variance of $${\hat{h}}$$ as a first-order approximation to the variance of our Monte Carlo estimator for both large $$n$$ and $$N$$. We wish to investigate the properties of this asymptotic variance, for large but fixed $$N$$, as $$n\rightarrow\infty$$. The asymptotic variance itself depends on $$n$$, and we would hope it would tend to zero as $$n$$ increases. Thus we will study the ratio of $${\small{\text{AV}}}_{{\hat{h}}}$$ to $${\small{\text{AV}}}_{{ \rm MLES}}$$, where, by Theorem 1, the latter is $$a_{n}^{-2}{h}({\theta}_{0})^{\mathrm{T}}I^{-1}({\theta}_{0})D{h}({\theta}_{0})$$. This ratio measures the efficiency of our Monte Carlo estimator relative to the maximum likelihood estimator based on the summaries; it quantifies the loss of efficiency from using a nonzero bandwidth and a finite Monte Carlo sample size. We will consider how this ratio depends on the choice of $$\varepsilon_{n}$$ and $$q_{n}({\theta})$$. Thus we introduce the following definition. Definition 4. For a choice of $$\varepsilon_{n}$$ and $$q_{n}({\theta})$$, we define the asymptotic efficiency of $${\hat{h}}$$ as \[ {\small{\text{AE}}}_{{\hat{h}}}=\lim_{n\rightarrow\infty}\frac{{\small{\text{AV}}}_{{ \rm MLES}}}{{\small{\text{AV}}}_{{\hat{h}}}}\text{.} \] If this limiting value is zero, we say that $${\hat{h}}$$ is asymptotically inefficient. We will investigate the asymptotic efficiency of $${\hat{h}}$$ under the assumption of Theorem 1 that $$\varepsilon_{n}=o(a_{n}^{-3/5})$$. We shall see that the convergence rate of the importance sampling variance $$\Sigma_{{\rm IS},n}$$ depends on how large $$\varepsilon_{n}$$ is relative to $$a_n$$, and so we further define $$a_{n,\varepsilon}=a_n$$ if $$\lim_{n\rightarrow\infty} a_n\varepsilon_n<\infty$$ and $$a_{n,\varepsilon}=\varepsilon_n^{-1}$$ otherwise. If our proposal distribution in Algorithm 1 is either the prior or the posterior, then the estimator is asymptotically inefficient. Theorem 2. Assume the conditions of Theorem 1. (i) If $$q_{n}({\theta})=\pi({\theta})$$, then $$p_{{\rm acc},q_{n}}=\Theta_{\rm p}(\varepsilon_{n}^{d}a_{n,\varepsilon}^{d-p})$$ and $$\Sigma_{{\rm IS},n}=\Theta_{\rm p}(a_{n,\varepsilon}^{-2})$$. (ii) If $$q_{n}({\theta})=\pi_{\rm ABC}({\theta}\mid {s}_{\rm obs},\varepsilon_{n})$$, then $$p_{{\rm acc},q_{n}}=\Theta_{\rm p}(\varepsilon_{n}^{d}a_{n,\varepsilon}^{d})$$ and $$\Sigma_{{\rm IS},n}=\Theta_{\rm p}(a_{n,\varepsilon}^{p})$$. In both cases $${\hat{h}}$$ is asymptotically inefficient. The result in part (ii) shows a difference from standard importance sampling settings, where using the target distribution as the proposal leads to an estimator with no Monte Carlo error. The estimator $${\hat{h}}$$ is asymptotically inefficient because the Monte Carlo variance decays more slowly than $$1/a_{n}^{2}$$ as $$n\rightarrow\infty$$. However, this is caused by different factors in each case. To see this, consider the acceptance probability of a value of $$\theta$$ and the corresponding summary $${s}_{n}$$ simulated in one iteration of Algorithm 1. This acceptance probability depends on \begin{equation} \frac{{s}_{n}-{s}_{\rm obs}}{\varepsilon_{n}}=\frac{1}{\varepsilon_{n}}\left[\{{s}_{n}-{s}({\theta})\}+\{{s}({\theta})-{s}({\theta}_{0})\}+\{{s}({\theta}_{0})-{s}_{\rm obs}\}\right],\label{eq:accp} \end{equation} (2) where $${s}({\theta})$$, defined in Condition 4, is the limiting value of $${s}_{n}$$ as $$n\rightarrow\infty$$ if data are sampled from the model for parameter value $$\theta$$. By Condition 4, the first and third bracketed terms within the square brackets on the right-hand side are $$O_{\rm p}(a_{n}^{-1})$$. If we sample $$\theta$$ from the prior, the middle term is $$O_{\rm p}(1)$$, and thus (2) will blow up as $$\varepsilon_{n}$$ goes to zero. Hence $$p_{{\rm acc},\pi}$$ goes to zero as $$\varepsilon_{n}$$ goes to zero, which causes the estimate to be inefficient. If we sample from the posterior, then by Theorem 1 we expect the middle term to also be $$O_{\rm p}(a_{n}^{-1})$$. Hence (2) is well behaved as $$n\rightarrow\infty$$, and $$p_{{\rm acc},\pi}$$ is bounded away from zero, provided either $$\varepsilon_{n}=\Theta(a_{n}^{-1})$$ or $$\varepsilon_{n}=\Omega(a_{n}^{-1})$$. However, if we use $$\pi_{\rm ABC}({\theta}\mid {s}_{\rm obs},\varepsilon_{n})$$ as a proposal distribution, the estimates are still inefficient due to an increasing variance of the importance weights: as $$n$$ increases the proposal distribution is more and more concentrated around $${\theta}_{0}$$, while $$\pi$$ does not change. 4.3. Efficient proposal distributions Consider proposing the parameter value from a location-scale family, i.e., our proposal is of the form $$\sigma_{n}\Sigma^{1/2}{X}+\mu_{n}$$, where $${X}\sim q(\cdot)$$, $$E({X})=0$$ and $$\mbox{var}({X})=I_{p}$$. This defines a general form of proposal density, where the centre $$\mu_{n}$$, the scale rate $$\sigma_{n}$$, the scale matrix $$\Sigma$$ and the base density $$q(\cdot)$$ all need to be specified. We will give conditions under which such a proposal density results in estimators that are efficient. Our results are based on an expansion of $$\pi_{\rm ABC}({\theta}\mid {s}_{\rm obs},\varepsilon_{n})$$. Consider the rescaled random variables $${t}=a_{n,\varepsilon}({\theta}-{\theta}_{0})$$ and $${v}=\varepsilon_{n}^{-1}({s}-{s}_{\rm obs})$$. Recall that $${T}_{\rm obs}=a_{n}A({\theta}_{0})^{-1/2}\{{s}_{\rm obs}-{s}({\theta}_{0})\}$$. Define an unnormalized joint density of $${t}$$ and $${v}$$ as \[ g_{n}({t},{v};\tau)=\begin{cases} \begin{array}{@{}ll} \mathcal{N}\big[\{D{s}({\theta}_{0})+\tau\}{t};a_{n}\varepsilon_{n}{v}+A({\theta}_{0})^{1/2}{T}_{\rm obs},A({\theta}_{0})\big]K({v}), & a_{n}\varepsilon_{n}\rightarrow c<\infty,\\[5pt] \mathcal{N}\left[\{D{s}({\theta}_{0})+\tau\}{t};{v}+\frac{1}{a_{n}\varepsilon_{n}}A({\theta}_{0})^{1/2}{T}_{\rm obs},\frac{1}{a_{n}^{2}\varepsilon_{n}^{2}}A({\theta}_{0})\right]K({v}), & a_{n}\varepsilon_{n}\rightarrow\infty, \end{array}\end{cases} \] and further define $$g_{n}({t};\tau)=\int g_{n}({t},{v};\tau)\,{\rm d}{v}$$. For large $$n$$, and for the rescaled variable $${t}$$, the leading term of $$\pi_{\rm ABC}$$ is then proportional to $$g_{n}({t};0)$$. For both limits of $$a_{n}\varepsilon_{n}$$, $$g_{n}({t};\tau)$$ is a continuous mixture of normal densities with the kernel density determining the mixture weights. Our main theorem requires conditions on the proposal density. First, it requires that $$\sigma_{n}=a_{n,\varepsilon}^{-1}$$ and that $$c_{\mu}=\sigma_{n}^{-1}({\mu}_{n}-{\theta}_{0})$$ is $$O_{\rm p}(1)$$. This ensures that under the scaling of $$t$$, as $$n\rightarrow\infty$$, the proposal is not increasingly overdispersed compared to the target density, and the acceptance probability can be bounded away from zero. Second, the proposal distribution is required to be sufficiently heavy-tailed. Condition 7. There exist positive constants $$m_{1}$$ and $$m_{2}$$ satisfying $$m_{1}^{2}I_{p}<Ds(\theta_{0})^{\mathrm{T}}Ds(\theta_{0})$$ and $$m_{2}I_{d}<A(\theta_{0})$$, $$\alpha\in(0,1)$$, $$\gamma\in(0,1)$$ and $$c\in(0,\infty)$$ such that for any $$\lambda>0$$, \begin{align*} &\sup_{t\in\mathbb{R}^{p}}\frac{\mathcal{N}(t;0,m_{1}^{-2}m_{2}^{-2}\gamma^{-1})}{q\{\Sigma^{-1/2}(t{-}c)\}}<\infty, \quad \sup_{t\in\mathbb{R}^{p}}\frac{\bar{K}^{\alpha}(\|\lambda t\|^2)}{q\{\Sigma^{-1/2}(t{-}c)\}}<\infty,\quad \sup_{t\in\mathbb{R}^{p}}\frac{\bar{r}_{\rm max}(\|m_{1}m_{2}{\gamma}^{1/2}t\|^2)}{q\{\Sigma^{-1/2}(t{-}c)\}}<\infty,& \end{align*} where $$\bar{r}_{\rm max}(\cdot)$$ satisfies $$r_{\rm max}(v)=\bar{r}_{\rm max}(\|v\|_{\Lambda}^2)$$, and for any random series $$c_{n}$$ in $$\mathbb{R}^{p}$$ satisfying $$c_{n}=O_{\rm p}(1)$$, \[ \sup_{t\in\mathbb{R}^{p}}\frac{q(t)}{q(t+c_{n})}=O_{\rm p}(1)\text{.} \] If we choose $$\varepsilon_{n}=\Theta(a_{n}^{-1})$$, the Monte Carlo importance sampling variance for the accepted parameter values is $$\Theta(a_{n}^{-2})$$ and has the same order as the variance of the summary-based maximum likelihood estimator. Theorem 3. Assume the conditions of Theorem 1. If the proposal density $$q_{n}({\theta})$$ is \[ \beta\pi({\theta})+(1-\beta)\frac{1}{\sigma_{n}^{p}|\Sigma|^{1/2}}q\left\{\sigma_{n}^{-1}\Sigma^{-1/2}({\theta}-{\mu}_{n})\right\}\!, \] where $$\beta\in(0,1)$$, $$q(\cdot)$$ and $$\Sigma$$ satisfy Condition 7, $$\sigma_{n}=a_{n,\varepsilon}^{-1}$$ and $$c_{\mu}$$ is $$O_{\rm p}(1)$$, then $$p_{{\rm acc},q_{n}}=\Theta_{\rm p}(\varepsilon_{n}^{d}a_{n,\varepsilon}^{d})$$ and $$\Sigma_{{\rm IS},n}=O_{\rm p}(a_{n,\varepsilon}^{-2})$$. Then if $$\varepsilon_{n}=\Theta(a_{n}^{-1})$$, $${\small{\text{AE}}}_{{\hat{h}}}=\Theta_{\rm p}(1)$$. Furthermore, if $$d=p$$, $${\small{\text{AE}}}_{{\hat{h}}}=1-K/(N+K)$$ for some constant $$K$$. The mixture with $$\pi({\theta})$$ here is to control the importance weight in the tail area (Hesterberg, 1995). It is not clear whether this is needed in practice, or is just a consequence of the approach taken in the proof. Theorem 3 shows that with a good proposal distribution, if the acceptance probability is bounded away from zero as $$n$$ increases, the threshold $$\varepsilon_{n}$$ will have the preferred rate $$\Theta(a_{n}^{-1})$$. This supports using the acceptance rate to choose the threshold based on aiming for an appropriate proportion of acceptances (Del Moral et al., 2012; Biau et al., 2015). In practice, $$\sigma_{n}$$ and $$\mu_{n}$$ need to be adaptive to the observations since they depend on $$n$$. For $$q(\cdot)$$ and $$\Sigma$$, the following proposition gives a practical suggestion that satisfies Condition 7. Let $$T(\cdot;\gamma)$$ be the multivariate $$t$$ density with degree of freedom $$\gamma$$. The following result says that it is theoretically valid to choose any $$\Sigma$$ if a $$t$$ distribution is chosen as the base density. Proposition 3. Condition 7 is satisfied for $$q(\theta)=T({\theta};\gamma)$$ with any $$\gamma>0$$ and any $$\Sigma$$. Proof. The first part of Condition 7 follows as the $$t$$-density is heavy tailed relative to the normal density, $$\bar{K}(\cdot)$$ and $$\bar{r}_{\rm max}(\cdot)$$. The second part can be verified easily. □ 4.4. Iterative importance sampling Taken together, Theorem 3 and Proposition 3 suggest proposing from the mixture of $$\pi({\theta})$$ and a $$t$$ distribution with the scale matrix and centre approximating those of $$\pi_{\rm ABC}({\theta})$$. We suggest the following iterative procedure, similar in spirit to that of Beaumont et al. (2009). Algorithm 2. Iterative importance sampling approximate Bayesian computation. Input a mixture weight $$\beta$$, a sequence of acceptance rates $$\{p_{k}\}$$, and a location-scale family. Set $$q_1({\theta})=\pi({\theta})$$. For $$k=1,\ldots,K$$: Step 1. Run Algorithm 1 with simulation size $$N_{0}$$, proposal density $$\beta\pi({\theta})+(1-\beta)q_{k}({\theta})$$ and acceptance rate $$p_{k}$$, and record the bandwidth $$\varepsilon_{k}$$. Step 2. If $$\varepsilon_{k-1}-\varepsilon_{k}$$ is smaller than some positive threshold, stop. Otherwise, let $$\mu_{k+1}$$ and $$\Sigma_{k+1}$$ be the empirical mean and variance matrix of the weighted sample from Step 1, and let $$q_{k+1}({\theta})$$ be the density with centre $$\mu_{k+1}$$ and variance matrix $$2\Sigma_{k+1}$$. Step 3. If $$q_{k}({\theta})$$ is close to $$q_{k+1}({\theta})$$ or $$K=K_{\rm max}$$, stop. Otherwise, return to Step $$1$$. After the iteration stops at the $$K$$th step, run Algorithm 1 with the proposal density $$\beta\pi({\theta})+(1-\beta)q_{K+1}({\theta})$$, $$N-KN_{0}$$ simulations and $$p_{K+1}$$. In this algorithm, $$N$$ is the number of simulations allowed by the computing budget, $$N_{0}<N$$ and $$\{p_{k}\}$$ is a sequence of acceptance rates, which we use to choose the bandwidth. The maximum value $$K_{\rm max}$$ of $$K$$ is set such that $$K_{\rm max}N_{0}=N/2$$. The rule for choosing the new proposal distribution is based on approximating the mean and variance of the density proportional to $$\pi(\theta)f_{\rm ABC}({s}_{\rm obs}\mid \theta,\varepsilon)^{1/2}$$, which is optimal (Fearnhead & Prangle, 2012). It can be shown that these two moments are approximately equal to the mean and twice the variance of $$\pi_{\rm ABC}({\theta})$$ respectively. For the mixture weight, $$\beta$$, we suggest a small value, and use $$0.05$$ in the simulation study below. 5. Numerical examples 5.1. Gaussian likelihood with sample quantiles This example illustrates the results in § 3 with an analytically tractable problem. Assume that the observations $${Y}_{\rm obs}=(\,y_{1},\ldots,y_{n})$$ follow the univariate normal distribution $$\mathcal{N}(\mu,\sigma)$$ with true parameter values $$(1,2^{1/2})$$. Consider estimating the unknown parameter $$(\mu,\sigma)$$ with the uniform prior in the region $$[-10,10]\times[0,10]$$ using Algorithm 1. The summary statistic is $$\{\exp(\hat{q}_{\alpha_1}/2),\ldots,\exp(\hat{q}_{\alpha_d}/2)\}$$ where $${\hat{q}}_{\alpha}$$ is the sample quantile of $${Y}_{\rm obs}$$ for probability $$\alpha$$. We present results for $$n=10^{5}$$. Smaller sizes from $$10^{2}$$ to $$10^{4}$$ show similar patterns. The probabilities $$\alpha_{1},\ldots,\alpha_{d}$$ for calculating quantiles are selected with equal intervals in $$(0,1)$$, and $$d=2,9$$ and $$19$$ were tested. In order to investigate the Monte Carlo error-free performance, $$N$$ is chosen to be large enough that the Monte Carlo errors were negligible. We compare the performances of the approximate Bayesian computation estimator $${\hat{\theta}}$$, the maximum likelihood estimator based on the summary statistics and the maximum likelihood estimator based on the full dataset. Since the dimension reduction matrix $$C$$ in Proposition 1 can be obtained analytically, the performance of $${\hat{\theta}}$$ using the original $$d$$-dimensional summary is compared with that using the two-dimensional summary. The results of mean square error are presented in Fig. 1. Fig. 1. View largeDownload slide Mean square errors, MSE, of point estimates for $$200$$ datasets. Point estimates compared include $$\theta_{\rm ABC}$$ using the original summary statistic (solid) and the transformed summary statistic (dashed), the dimension of which is reduced to $$2$$ according to Proposition 1, and the maximum likelihood estimates based on the original summary statistic (dotted) and the full dataset (dash-dotted). Fig. 1. View largeDownload slide Mean square errors, MSE, of point estimates for $$200$$ datasets. Point estimates compared include $$\theta_{\rm ABC}$$ using the original summary statistic (solid) and the transformed summary statistic (dashed), the dimension of which is reduced to $$2$$ according to Proposition 1, and the maximum likelihood estimates based on the original summary statistic (dotted) and the full dataset (dash-dotted). The phenomena implied by Theorem 1 and Proposition 1 can be seen in this example, together with the limitations of these results. First, $${E}\{{h}({\theta})\mid {s}_{\rm obs}\}$$, equivalent to $${\hat{\theta}}$$ with small enough $$\varepsilon$$, and the maximum likelihood estimator based on the same summaries have similar accuracy. Second, when $$\varepsilon$$ is small, the mean square error of $${\hat{\theta}}$$ equals that of the maximum likelihood estimator based on the summary. When $$\varepsilon$$ becomes larger, for $$d>2$$ the mean square error increases more quickly than for $$d=2$$. This corresponds to the impact of the additional bias when $$d>p$$. For all cases, the two-dimensional summary obtained by projecting the original $$d$$ summaries is, for small $$\varepsilon$$, as accurate as the maximum likelihood estimator given the original $$d$$ summaries. This indicates that the lower-dimensional summary contains the same information as the original one. For larger $$\varepsilon$$, the performance of the reduced-dimension summaries is not stable, and is in fact worse than the original summaries for estimating $$\mu$$. This deterioration is caused by the bias of $${\hat{\theta}}$$, which for larger $$\varepsilon$$ is dominated by higher-order terms in $$\varepsilon$$ which could be ignored in our asymptotic results. 5.2. Stochastic volatility with ar$$(1)$$ dynamics We consider a stochastic volatility model from Sandmann & Koopman (1998) for the de-meaned returns of a portfolio. Denote this return for the $$t$$th time period by $$y_t$$. Then \[ x_{t} =\phi x_{t-1}+\eta_{t},\ \eta_{t}\sim \mathcal{N}(0,\sigma_{\eta}^{2}); \quad y_{t} =\bar{\sigma}\exp(x_t/2)\xi_{n},\ \xi_{t}\sim \mathcal{N}(0,1), \] where $$\eta_{t}$$ and $$\xi_{t}$$ are independent, and $$x_t$$ is a latent state that quantifies the level of volatility for time period $$t$$. By the transformation $$y_{t}^{*}=\log y_{t}^{2}$$ and $$\xi_{t}^{*}=\log\xi_{t}^{2}$$, the observation equation in the state-space model can be transformed to \begin{equation*} y_{n}^{*} =2\log\bar{\sigma}+x_{n}+\xi_{n}^{*},\quad \exp(\xi_{n}^{*})\sim\chi_{1}^{2}, \end{equation*} which is linear and non-Gaussian. Approximate Bayesian computation can be used to obtain an off-line estimator for the unknown parameters of this model. Here we illustrate the effectiveness of iteratively choosing the importance proposal for large $$n$$ by comparing with rejection sampling. In the iterative algorithm, a $$t$$ distribution with five degrees of freedom is used to construct $$q_{k}$$. Consider estimating the parameter $$(\phi,\sigma_{\eta},\log\bar{\sigma})$$ under a uniform prior in the region $$[0,1)\times[0.1,3]\times[-10,-1]$$. The setting with the true parameter $$(\phi,\sigma_{\eta},\log\bar{\sigma})=(0.9,0.675,-4.1)$$ is studied. We use a three-dimensional summary statistic that stores the mean, variance and lag-one autocovariance of the transformed data. If there were no noise in the state equation for $$\xi_{n}^{*}$$, then this would be a sufficient statistic of $${Y}^{*}$$, and hence is a natural choice for the summary statistic. The uniform kernel is used in the accept-reject step. We evaluate rejection sampling and iterative importance sampling methods on data of length $$n=100,500,2000$$ and $$10\,000$$, and use $$N=40\,000$$ Monte Carlo simulations. For iterative importance sampling, the sequence $$\{p_{k}\}$$ has the first five values decreasing linearly from $$5\%$$ to $$1\%$$, and later values being $$1\%$$. We further set $$N_0=2000$$ and $$K_{\rm max}=10$$. For the rejection sampler, acceptance probabilities of both $$5\%$$ and $$1\%$$ were tried and $$5\%$$ was chosen as it gave better performance. The simulation results are shown in Fig. 2. Fig. 2. View largeDownload slide Comparisons of rejection (solid) and iterative importance sampling (dashed) versions of approximate Bayesian computation. For each $$n$$, the logarithm of the average mean square error, MSE, across $$100$$ datasets is reported. For each dataset, the Monte Carlo sample size is $$40\,000$$. Ratios of mean square errors of the two methods are given in the table, and smaller values indicate better performance of iterative importance sampling. For each method in the plots, a line is fitted to the connected lines and the slope is reported in the table. Smaller values indicate faster decrease of the mean square error. Fig. 2. View largeDownload slide Comparisons of rejection (solid) and iterative importance sampling (dashed) versions of approximate Bayesian computation. For each $$n$$, the logarithm of the average mean square error, MSE, across $$100$$ datasets is reported. For each dataset, the Monte Carlo sample size is $$40\,000$$. Ratios of mean square errors of the two methods are given in the table, and smaller values indicate better performance of iterative importance sampling. For each method in the plots, a line is fitted to the connected lines and the slope is reported in the table. Smaller values indicate faster decrease of the mean square error. For all parameters, iterative importance sampling shows increasing advantage over rejection sampling as $$n$$ increases. For larger $$n$$, the iterative procedure obtains a centre for proposals closer to the true parameter and a bandwidth that is smaller than those used for rejection sampling. These contribute to the more accurate estimators. It is easy to estimate $$\log\bar{\sigma}$$, since the expected summary statistic $$\tilde{E}({Y}^{*})$$ is roughly linear in $$\log\bar{\sigma}$$. Thus iterative importance sampling has less of an advantage over rejection sampling when estimating this parameter. 6. Discussion Our results suggest that one can obtain efficient estimates using approximate Bayesian computation with a fixed Monte Carlo sample size as $$n$$ increases. Thus the computational complexity of approximate Bayesian computation will just be the complexity of simulating a sample of size $$n$$ from the underlying model. Our results on the Monte Carlo accuracy of approximate Bayesian computation considered the importance sampling implementation given in Algorithm 1. If we do not use the uniform kernel, then there is a simple improvement on this algorithm, which absorbs the accept-reject probability within the importance sampling weight. A simple Rao–Blackwellization argument then shows that this leads to a reduction in Monte Carlo variance, so our positive results about the scaling of approximate Bayesian computation with $$n$$ will also immediately apply to this implementation. Similar positive Monte Carlo results are likely to apply to Markov chain Monte Carlo implementations of approximate Bayesian computation. A Markov chain Monte Carlo version will be efficient provided the acceptance probability does not degenerate to zero as $$n$$ increases. However, at stationarity it will propose parameter values from a distribution close to the approximate Bayesian computation posterior density, and Theorems 2 and 3 suggest that for such a proposal distribution the acceptance probability will be bounded away from zero. Whilst our theoretical results suggest that point estimates based on approximate Bayesian computation have good properties, they do not suggest that the approximate Bayesian computation posterior is a good approximation to the true posterior. In fact, Frazier et al. (2016) show that it will overestimate uncertainty if $$\varepsilon_n=O(a_n^{-1})$$. However, Li & Fearnhead (2018) show that using regression methods (Beaumont et al., 2002) to post-process approximate Bayesian computation output can lead to both efficient point estimation and accurate quantification of uncertainty. Acknowledgement This work was supported by the U.K. Engineering and Physical Sciences Research Council. Supplementary material Supplementary material available at Biometrika online contains proofs of the main results. References Allingham D., King R. A. R. & Mengersen K. L. ( 2009). Bayesian estimation of quantile distributions. Statist. Comp. 19, 189– 201. Google Scholar CrossRef Search ADS Barber S., Voss J. & Webster M. ( 2015). The rate of convergence for approximate Bayesian computation. Electron. J. Statist. 9, 80– 105. Google Scholar CrossRef Search ADS Beaumont M. A. ( 2010). Approximate Bayesian computation in evolution and ecology. Ann. Rev. Ecol. Evol. Syst. 41, 379– 406. Google Scholar CrossRef Search ADS Beaumont M. A., Cornuet J.-M., Marin J.-M. & Robert C. P. ( 2009). Adaptive approximate Bayesian computation. Biometrika 96, 983– 90. Google Scholar CrossRef Search ADS Beaumont M. A., Zhang W. & Balding D. J. ( 2002). Approximate Bayesian computation in population genetics. Genetics 162, 2025– 35. Google Scholar PubMed Biau G., Cérou F. & Guyader A. ( 2015). New insights into approximate Bayesian computation. Ann. Inst. Henri Poincaré Prob. Statist. 51, 376– 403. Google Scholar CrossRef Search ADS Blum M. G. ( 2010). Approximate Bayesian computation: A nonparametric perspective. J. Am. Statist. Assoc. 105, 1178– 87. Google Scholar CrossRef Search ADS Blum M. G. & François O. ( 2010). Non-linear regression models for approximate Bayesian computation. Statist. Comp. 20, 63– 73. Google Scholar CrossRef Search ADS Bortot P., Coles S. G. & Sisson S. A. ( 2007). Inference for stereological extremes. J. Am. Statist. Assoc. 102, 84– 92. Google Scholar CrossRef Search ADS Clarke B. & Ghosh J. ( 1995). Posterior convergence given the mean. Ann. Statist. 23, 2116– 44. Google Scholar CrossRef Search ADS Del Moral P., Doucet A. & Jasra A. ( 2012). An adaptive sequential Monte Carlo method for approximate Bayesian computation. Statist. Comp. 22, 1009– 20. Google Scholar CrossRef Search ADS Duffie D. & Singleton K. J. ( 1993). Simulated moments estimation of Markov models of asset prices. Econometrica 61, 929– 52. Google Scholar CrossRef Search ADS Fearnhead P. & Prangle D. ( 2012). Constructing summary statistics for approximate Bayesian computation: Semi-automatic approximate Bayesian computation (with Discussion). J. R. Statist. Soc. B 74, 419– 74. Google Scholar CrossRef Search ADS Frazier D. T., Martin G. M., Robert C. P. & Rousseau J. ( 2016). Asymptotic properties of approximate Bayesian computation. arXiv:1607.06903. Gordon N., Salmond D. & Smith A. F. M. ( 1993). Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEEE Proc. Radar Sig. Proces. 140, 107– 13. Google Scholar CrossRef Search ADS Gouriéroux C. & Ronchetti E. ( 1993). Indirect inference. J. Appl. Economet. 8, s85– 118. Google Scholar CrossRef Search ADS Heggland K. & Frigessi A. ( 2004). Estimating functions in indirect inference. J. R. Statist. Soc. B 66, 447– 62. Google Scholar CrossRef Search ADS Hesterberg T. ( 1995). Weighted average importance sampling and defensive mixture distributions. Technometrics 37, 185– 94. Google Scholar CrossRef Search ADS Ishida E., Vitenti S., Penna-Lima M., Cisewski J., de Souza R., Trindade A., Cameron E. & Busti V. ( 2015). COSMOABC: Likelihood-free inference via population Monte Carlo approximate Bayesian computation. Astron. Comp. 13, 1– 11. Google Scholar CrossRef Search ADS Li W. & Fearnhead P. ( 2018). Convergence of regression adjusted approximate Bayesian computation. Biometrika , https://doi.org/10.1093/biomet/asx081. Marin J.-M., Pillai N. S., Robert C. P. & Rousseau J. ( 2014). Relevant statistics for Bayesian model choice. J. R. Statist. Soc. B 76, 833– 59. Google Scholar CrossRef Search ADS Peters G. W., Kannan B., Lasscock B., Mellen C. & Godsill S. ( 2011). Bayesian cointegrated vector autoregression models incorporating alpha-stable noise for inter-day price movements via approximate Bayesian computation. Bayesian Anal. 6, 755– 92. Google Scholar CrossRef Search ADS Prangle D., Fearnhead P., Cox M. P., Biggs P. J. & French N. P. ( 2014). Semi-automatic selection of summary statistics for ABC model choice. Statist. Appl. Genet. Molec. Biol. 13, 67– 82. Google Scholar CrossRef Search ADS Pritchard J. K., Seielstad M. T., Perez-Lezaun A. & Feldman M. W. ( 1999). Population growth of human Y chromosomes: A study of Y chromosome microsatellites. Molec. Biol. Evol. 16, 1791– 8. Google Scholar CrossRef Search ADS Sandmann G. & Koopman S. ( 1998). Estimation of stochastic volatility models via Monte Carlo maximum likelihood. J. Economet. 87, 271– 301. Google Scholar CrossRef Search ADS Toni T., Welch D., Strelkowa N., Ipsen A. & Stumpf M. P. ( 2009). Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. J. R. Soc. Interface 6, 187– 202. Google Scholar CrossRef Search ADS PubMed Wegmann D., Leuenberger C. & Excoffier L. ( 2009). Efficient approximate Bayesian computation coupled with Markov chain Monte Carlo without likelihood. Genetics 182, 1207– 18. Google Scholar CrossRef Search ADS PubMed Wood S. N. ( 2010). Statistical inference for noisy nonlinear ecological dynamic systems. Nature 466, 1102– 4. Google Scholar CrossRef Search ADS PubMed Yuan A. & Clarke B. ( 2004). Asymptotic normality of the posterior given a statistic. Can. J. Statist. 32, 119– 37. Google Scholar CrossRef Search ADS © 2018 Biometrika Trust This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Biometrika – Oxford University Press

**Published: ** Jan 20, 2018

Loading...

personal research library

It’s your single place to instantly

**discover** and **read** the research

that matters to you.

Enjoy **affordable access** to

over 18 million articles from more than

**15,000 peer-reviewed journals**.

All for just $49/month

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Read from thousands of the leading scholarly journals from *SpringerNature*, *Elsevier*, *Wiley-Blackwell*, *Oxford University Press* and more.

All the latest content is available, no embargo periods.

## “Hi guys, I cannot tell you how much I love this resource. Incredible. I really believe you've hit the nail on the head with this site in regards to solving the research-purchase issue.”

Daniel C.

## “Whoa! It’s like Spotify but for academic articles.”

@Phil_Robichaud

## “I must say, @deepdyve is a fabulous solution to the independent researcher's problem of #access to #information.”

@deepthiw

## “My last article couldn't be possible without the platform @deepdyve that makes journal papers cheaper.”

@JoseServera

DeepDyve ## Freelancer | DeepDyve ## Pro | |
---|---|---|

Price | FREE | $49/month |

Save searches from | ||

Create lists to | ||

Export lists, citations | ||

Read DeepDyve articles | Abstract access only | Unlimited access to over |

20 pages / month | ||

PDF Discount | 20% off | |

Read and print from thousands of top scholarly journals.

System error. Please try again!

or

By signing up, you agree to DeepDyve’s Terms of Service and Privacy Policy.

Already have an account? Log in

Bookmark this article. You can see your Bookmarks on your DeepDyve Library.

To save an article, **log in** first, or **sign up** for a DeepDyve account if you don’t already have one.