Asymptotic properties of approximate Bayesian computation

Asymptotic properties of approximate Bayesian computation Summary Approximate Bayesian computation allows for statistical analysis using models with intractable likelihoods. In this paper we consider the asymptotic behaviour of the posterior distribution obtained by this method. We give general results on the rate at which the posterior distribution concentrates on sets containing the true parameter, the limiting shape of the posterior distribution, and the asymptotic distribution of the posterior mean. These results hold under given rates for the tolerance used within the method, mild regularity conditions on the summary statistics, and a condition linked to identification of the true parameters. Implications for practitioners are discussed. 1. Introduction Interest in approximate Bayesian computation has begun to shift from an initial focus on using it as a computational tool towards validation of it as a statistical inference procedure; see, for example, Fearnhead & Prangle (2012), Marin et al. (2014), Creel & Kristensen (2015), Creel et al. (2015), Drovandi et al. (2015), Martin et al. (2017) and Li & Fearnhead (2018a,b). We study large-sample properties of posterior distributions and posterior means obtained from approximate Bayesian computation algorithms. Under mild regularity conditions on the underlying summary statistics, we characterize the rate of posterior concentration and show that the limiting posterior shape depends crucially on the interplay between the rate at which the summaries converge and the rate at which the tolerance used to select parameters shrinks to zero. Bayesian consistency places a less stringent condition on the speed with which the tolerance declines to zero than does asymptotic normality of the posterior distribution. Furthermore, and in contrast to textbook Bernstein–von Mises results, asymptotic normality of the posterior mean does not require asymptotic normality of the posterior distribution, the former being attainable under weaker conditions on the tolerance than required for the latter. Validity of these results requires that the summaries converge to a well-defined limit and that this limit, viewed as a mapping from parameters to summaries, be injective. These conditions have a close correspondence to those required for theoretical validity of indirect inference and related frequentist estimators (see, e.g., Gouriéroux et al., 1993). We focus on three aspects of asymptotic behaviour: posterior consistency, limiting posterior shape, and the asymptotic distribution of the posterior mean. Our focus is broader than that of existing studies on large-sample properties of approximate Bayesian computation algorithms, in which the asymptotic properties of resulting point estimators have been of primary interest; see Creel et al. (2015) and Li & Fearnhead (2018a). Our approach allows both weaker conditions and a complete characterization of the limiting posterior shape. We distinguish between the conditions, on both the summaries and the tolerance, required for concentration and those required for distributional results. Our results suggest how the tolerance in approximate Bayesian computation should be chosen to ensure posterior concentration, valid coverage levels for credible sets, and asymptotically normal and efficient point estimators. 2. Preliminaries and background We observe data $${y}=(y_{1},\dots,y_{T})^{{\rm T}}$$, $$T\geqslant 1$$, drawn from the model $$\{P_{{\theta }}:{\theta \in \Theta }\}$$, where $$P_{{\theta }}$$ admits the corresponding conditional density $$p(\cdot \mid{\theta })$$ and $${\theta }\in {\Theta }\subset \mathbb{R}^{k_{\theta }}$$. Given a prior measure $$\Pi(\theta)$$ with density $$\pi({\theta })$$, the aim of the algorithms under study is to produce draws from an approximation to the exact posterior density $$\pi({\theta \mid y})\propto p({y\mid\theta })\pi({\theta })$$, in situations where both parameters and pseudo-data $$({\theta },{z})$$ can easily be simulated, from $$\pi({\theta })p({z\mid\theta })$$, but $$p({z\mid\theta })$$ is intractable. The simplest accept/reject form of the algorithm (Tavaré et al., 1997; Pritchard et al., 1999) is detailed in Algorithm 1 below. Algorithm 1. Approximate Bayesian computation. Step 1. Simulate $${\theta }^{i}$$$$(i=1,\dots,N)$$ from $$\pi({\theta })$$. Step 2. Simulate $${z}^{i}=(z_{1}^{i},\dots,z_{T}^{i})^{{\rm T}}$$$$(i=1,\dots,N)$$ from the likelihood $$p(\cdot\mid{\theta }^{i})$$. Step 3. Select $${\theta }^{i}$$ such that $$d\{{\eta }({y}),{\eta }({z}^{i})\}\leqslant \varepsilon ,$$ where $${\eta }({\cdot })$$ is a statistic, $$d(\cdot\, , \cdot )$$ is a distance function, and $$\varepsilon>0$$ is the tolerance level. Algorithm 1 thus samples $${\theta }$$ and $${z}$$ from the joint posterior density $$ \pi_{\varepsilon }\{{\theta },{z\mid\eta }({y})\}={\pi({\theta })p({z\mid\theta } ){\rm 1}\kern-0.24em{\rm I}_{\varepsilon }({z})}\Big/\int\pi({\theta })p({z\mid\theta }){\rm 1}\kern-0.24em{\rm I}_{\varepsilon }({z})\,{\rm d}{z}\,{\rm d}{\theta}, $$ where $${\rm 1}\kern-0.24em{\rm I}_{\varepsilon }({z})={\rm 1}\kern-0.24em{\rm I}[d\{{\eta }({y}),{\eta }({z})\}\leqslant \varepsilon ]=1$$ if $$d\{ {\eta }({y}),{\eta }({z} )\} \leqslant \varepsilon$$ and 0 otherwise. The approximate Bayesian computation posterior density is defined as $$ \pi_{\varepsilon }\{{\theta \mid\eta }({y})\}=\int_{{}}\pi_{\varepsilon }\{{\theta },{z\mid\eta }({y})\}\,{\rm d}{z}\text{.} $$ In what follows, we refer to $$\pi_{\varepsilon }\{{\theta \mid\eta }({y})\}$$ as the approximate posterior density. Likewise, the posterior probability of a set $$A\subset\Theta$$ associated with Algorithm 1 is \begin{equation*} \Pi _{\varepsilon }\{A\mid \eta (y)\}=\Pi [ A\mid d_{}\{{\eta }({y}),{\eta }({z} )\}\leqslant \varepsilon ] =\int_{A}\pi_{\varepsilon }\{{\theta }\mid {\eta }({y} )\}\,{\rm d}{\theta }, \end{equation*} and we refer to $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$ as the approximate posterior distribution. When $$\eta(\cdot)$$ is sufficient for the observed data $$y$$ and $$\varepsilon$$ is close to zero, $$\pi_{\varepsilon }\{{\theta \mid\eta }({y})\}$$ will be a good approximation to $$\pi({\theta \mid y})$$, and draws of $${\theta }$$ from $$\pi_{\varepsilon }\{\theta \mid\eta({y})\}$$ can be used to estimate features of $$\pi(\theta \mid y)$$. In practice $$\eta(y)$$ is rarely sufficient for $$y$$, and draws of $$\theta$$ can only be used to approximate $$\pi\{\theta \mid \eta(y)\}=\lim_{\varepsilon\rightarrow0}\pi_{\varepsilon}\{\theta \mid \eta(y)\}$$. Given the general lack of sufficient statistics, we need to assess the behaviour of the approximate posterior distribution $$\Pi_{\varepsilon }\{\cdot\mid \eta(y)\}$$ and establish whether or not $$\Pi_{\varepsilon }\{\cdot\mid \eta(y)\}$$ behaves in a manner that is appropriate for statistical inference, with asymptotic theory being one obvious approach. Establishing the large-sample behaviour of $$\Pi_{\varepsilon }\{\cdot\mid \eta(y)\}$$, including point and interval estimates derived from this distribution, gives practitioners guarantees on the reliability of approximate Bayesian computation. Furthermore, these results enable us to propose guidelines for choosing the tolerance $$\varepsilon$$ so that $$\Pi_{\varepsilon }\{\cdot\mid \eta(y)\}$$ possesses desirable statistical properties. Before presenting our results, we define some notation used throughout the paper. Let $$\mathcal{B}\subset \mathbb{R}^{k_{\eta }}$$ denote the range of the simulated summaries $$\eta(z)$$. Let $$d_{1}(\cdot \,,\cdot )$$ be a metric on $${\Theta }$$ and $$d_{2}(\cdot\, ,\cdot)$$ a metric on $$\mathcal{B}$$. Take $$\Vert \cdot \Vert$$ to be the Euclidean norm. Throughout, $$C$$ denotes a generic positive constant. For real-valued sequences $$\{a_{T}\}_{T\geqslant 1}$$ and $$\{b_{T}\}_{T\geqslant 1}$$, $$\:a_{T}\lesssim b_{T}$$ means that $$a_{T}\leqslant Cb_{T}$$ for some finite $$C>0$$ and $$T$$ large, $$a_{T}\asymp b_{T}$$ means that $$a_{T}\lesssim b_{T} \lesssim a_{T}$$, and $$a_{T}\gg b_{T}$$ indicates a greater order of magnitude. For $$x_{T}$$ a random variable, $$x_{T}=o_{\rm p}(a_{T})$$ if $$\lim_{T\rightarrow \infty }\text{pr} (|x_{T}/a_{T}|\geqslant C)=0$$ for any $$C>0$$, and $$x_{T}=O_{\rm p}(a_{T})$$ if for any $$C>0$$ there exists a finite $$M>0$$ and a finite $$T$$ such that $$\text{pr}(|x_{t}/a_{t}|\geqslant M)\leqslant C$$ for all $$t>T$$. All limits are taken as $$T\rightarrow\infty$$. When no confusion is likely to result, we will write $$\lim_{T}$$ for $$\lim_{T\rightarrow\infty}$$. 3. Concentration of the approximate Bayesian computation posterior We assume throughout that the model is correctly specified: for some $$\theta_0$$ in the interior of $$\Theta$$, we have $$P_{{\theta}_0}=P_{0}$$. Asymptotic validity of any Bayesian procedure requires posterior concentration, which is often referred to as Bayesian consistency. In our context, this equates to the following posterior concentration property: for any $$\delta >0$$ and for some $$\varepsilon>0$$, \begin{align*} \Pi _{\varepsilon }\{d_{1}(\theta ,\theta _{0})>\delta\mid \eta (y)\}& =\Pi \big[ d_{1}(\theta ,\theta _{0})>\delta \mid d_{2}\{{\eta }({y}),{\eta }({z})\}\leqslant \varepsilon \big] \\ & =\int_{d_{1}(\theta ,\theta _{0})>\delta }\pi_{\varepsilon }\{\theta \mid \eta (y)\}\,{\rm d}\theta =o_{\rm p}(1)\text{.} \end{align*} This property is paramount since, for any $$A\subset \Theta$$, $$\:\Pi[ A\mid d_{2}\{{\eta }({y}),{\eta }({z})\}\leqslant \varepsilon]$$ will differ from the exact posterior probability. Without the guarantees of exact posterior inference, knowing that $$\Pi_{\varepsilon}\{\cdot\mid\eta(y)\}$$ will concentrate on $$\theta _{0}$$ justifies its use as a means of expressing our uncertainty about $$\theta$$. Posterior concentration is related to the rate at which information about $$\theta _{0}$$ accumulates in the sample. The amount of information Algorithm 1 provides depends on the rate at which the observed summaries $$\eta(y)$$ and the simulated summaries $$\eta(z)$$ converge to well-defined limit counterparts $${b}({\theta }_{0})$$ and $${b}({\theta })$$, as well as on the rate at which information about $$\theta _{0}$$ accumulates within the algorithm, governed by the rate at which $$\varepsilon$$ goes to $$0$$. To link the two factors, we consider $$\varepsilon$$ as a $$T$$-dependent sequence $$\varepsilon _{T}\rightarrow 0$$ as $${T\rightarrow \infty }$$. We can now state the technical assumptions used to establish our first result. These assumptions are applicable to a broad range of data structures, including weakly dependent data. Assumption 1. There exist a nonrandom map $${b}:{\Theta } \rightarrow \mathcal{B}$$ and a sequence of functions $$\rho _{T}(u)$$ that are monotone nonincreasing in $$u$$ for any $$T$$ and which satisfy $$\rho _{T}(u)\rightarrow 0$$ as $$T\rightarrow \infty$$. For fixed $$u$$ and for all $$\theta\in\Theta$$, \begin{equation*} P_{{\theta }}\big[ d_{2}\{{\eta }({z}),{b}({\theta })\}>u\big] \leqslant c({ \theta })\rho _{T}(u),\quad \int_{\Theta }c({\theta })\,{\rm d}\Pi ({\theta } )<\infty \end{equation*} under either of the following assumptions on $$c(\cdot )$$: (i) there exist $$c_{0}<\infty$$ and $$\delta >0$$ such that for all $${\theta }$$ satisfying $$d_{2}\{{b}({\theta } ),{b}({\theta}_{0})\}\leqslant \delta$$, $$c({\theta })\leqslant c_{0}$$; (ii) there exists $$a>0$$ such that $$\int_{\Theta }c({\theta })^{1+a}\,{\rm d}\Pi ({\theta })<\infty$$. Assumption 2. There exists some $$D>0$$ such that for all $$\xi >0$$ and some $$C>0$$, the prior probability satisfies $$\Pi [ d_{2}\{{b}({\theta }),{b}({\theta }_{0})\}\leqslant \xi ]\geqslant C \xi ^{D}$$. Assumption 3. (i) The map $${b}$$ is continuous. (ii) The map $${b}$$ is injective and satisfies $$\Vert {\theta }-{\theta }_{0}\Vert \leqslant L\Vert {b}({\theta })-{b}({\theta } _{0})\Vert ^{\alpha }$$ in some open neighbourhood of $${\theta }_{0}$$ with $$L>0$$ and $$\alpha >0$$. Remark 1. The convergence of $$\eta(z)$$ to $$b(\theta)$$ in Assumption 1 is the key to posterior concentration, and without it or a similar assumption, Bayesian consistency will not occur. The function $$\rho_{T}(u)$$ in Assumption 1 typically takes the form $$\rho_{T}(u)=\rho_{}(uv_{T})$$ for $$v_{T}$$ a sequence such that $$d_{2}\{\eta(z),b(\theta)\}=O_{\rm p}(1/v_{T})$$ and where $$\rho(u v_{T})$$ controls the tail behaviour of $$d_{2}\{\eta(z),b(\theta)\}$$. The specific structure of $$\rho(uv_{T})$$ will depend on what is assumed about the properties of the underlying summaries $$\eta(z)$$. In most cases, $$\rho(uv_{T})$$ will have either a polynomial or an exponential structure in $$uv_{T}$$, and thus satisfy one of the following rates. (a) Polynomial: there exist a diverging positive sequence $$\{v_{T}\}_{T\ge 1}$$ and $$u_{0},\kappa>0$$ such that \begin{equation} P_{\theta}\big[d_{2}\{\eta(z),b(\theta)\}>u\big]\leqslant c(\theta)\rho_{T}(u),\quad\rho _{T}(u)=1/(uv_{T})^{\kappa },\quad u\leqslant u_{0}, \label{rho_poly} \end{equation} (1) where for some $$c_0>0$$ and $$\delta>0$$, $$\int_{\Theta}c(\theta)\,{\rm d}\Pi(\theta)<\infty$$ and if $$d_{2}\{b(\theta),b(\theta_0)\}\leqslant\delta$$ then $$c(\theta)\leqslant c_0$$. (b) Exponential: there exist $$h_{\theta}(\cdot)>0$$ and $$u_0>0$$ such that \begin{equation} P_{\theta}\big[d_{2}\{\eta(z),b(\theta)\}>u\big]\leqslant c(\theta)\rho_{T}(u),\quad\rho _{T}(u)=\exp\{-h_{{\theta }}(uv_{T})\},\quad u\leqslant u_{0}, \label{rho_exp} \end{equation} (2) where for some $$c,C>0$$, $$\:\int_{\Theta} c(\theta)\exp\{-h_{\theta}(u v_{T})\}\,{\rm d}\Pi(\theta)\leqslant C\exp\{-c(u v_{T})^{\tau}\}$$. To illustrate these cases for $$\rho_{T}(\cdot)$$, consider the summary statistic $$\eta ({z})=T^{-1}\sum_{i=1}^{T}g(z_{i})$$, where for simplicity we consider $$\{g(z_{i})\}_{i\leqslant T}$$ to be independent and identically distributed, and $$b({\theta })=E_{{\theta }}\{g(Z)\}$$. If $$g(z_{i})-b(\theta)$$ has a finite moment of order $$\kappa$$, then $$\rho_{T}(u)$$ will satisfy (1): from Markov’s inequality, \begin{equation*} P_{\theta}\big\{\Vert {\eta }({z})-b({\theta })\Vert >u\big\}\leqslant { C_{}E_{{\theta }}\{ | g(Z)|^{\kappa }\} }/{({uT^{1/2}})^{\kappa } }\text{.} \end{equation*} With reference to (1), $$\rho_{T}(u)=1/(uv_{T})^{\kappa}$$, $$v_{T}=T^{1/2}$$ and $$c(\theta)=C\mathbb{E}_{\theta}\{|g(Z)|^{\kappa}\}<\infty$$. If the map $${\theta }\mapsto E_{{\theta }}\{ |g(Z)|^{\kappa }\}$$ is continuous at $${\theta }_{0}$$ and positive, Assumption 1 is satisfied. If $$\{g(z_i)-b(\theta)\}$$ has a finite exponential moment, $$\rho_{T}(u)$$ will satisfy (2): from a version of the Bernstein inequality, \begin{equation*} \begin{split} P_{\theta}\big\{ \Vert {\eta }({z})-b({\theta })\Vert >u\big\} &\leqslant \exp\bigl[-u^{2}T/\{2c(\theta )\}\bigr]\text{.} \end{split} \end{equation*} With reference to (2), $$\rho_{T}(u)=\exp\{-h_{\theta}(uv_{T})\}$$, $$h_{{\theta }}(uv_{T})=u^{2}v_{T}^{2}/\{2c(\theta )\}$$ and $$v_{T}=T^{1/2}$$. If the map $${\theta }\mapsto c(\theta )$$ is continuous at $${\theta }_{0}$$ and positive, Assumption 1 is satisfied. Remark 2. Assumption 2 controls the degree of prior mass in a neighbourhood of $${\theta }_{0}$$ and is standard in Bayesian asymptotics. For $$\xi$$ small, the larger $$D$$ is, the smaller is the amount of prior mass near $$\theta _{0}$$. If the prior measure $$\Pi(\theta)$$ is absolutely continuous with prior density $$\pi(\theta)$$ and if $$\pi$$ is bounded, above and below, near $$\theta _{0}$$, then $$D=\dim (\theta )=k_{\theta }$$. Assumption 3 is an identification condition that is critical for obtaining posterior concentration around $${ \theta }_{0}$$. Injectivity of $$b$$ depends on both the true structural model and the particular choice of $${\eta }$$. Without this identification condition, posterior concentration at $$\theta_0$$ cannot occur. Theorem 1. If Assumptions 1 and 2 are satisfied, then for $$M$$ large enough, as $$T\rightarrow\infty$$ and $$\varepsilon_T=o(1)$$, with $$P_{0}$$ probability going to $$1$$, \begin{equation} \Pi \big[ d_{2}\{{b}({\theta }),{b}({\theta }_{0})\}>\lambda_{T}\;\big|\; d_{2}\{{\eta }({y}),{\eta }({z} )\}\leqslant \varepsilon _{T}\big] \lesssim 1/M, \label{i} \end{equation} (3)with $$\lambda_{T}=4\varepsilon_{T}/3 + \rho_{T}^{-1}(\varepsilon_{T}^{D}/M)$$. Moreover, if Assumption 3 also holds, then as $$T\rightarrow\infty$$, \begin{equation} \Pi \big[ d_{1}({\theta },{\theta }_{0})>L\lambda_{T}^{\alpha }\;\big|\; d_{2}\{{\eta }({y}),{\eta }({z} )\}\leqslant \varepsilon _{T}\big] \lesssim 1/M\text{.} \label{ii} \end{equation} (4) Since equations (3) and (4) hold for any $$M$$ large enough, we can conclude that the posterior distribution behaves like an $$o_{\rm p}(1)$$ random variable on sets that do not include $$\theta_0$$, and Bayesian consistency of $$\Pi_{\varepsilon}\{\cdot \mid \eta(y)\}$$ follows. More generally, (3) and (4) give a posterior concentration rate, denoted by $$\lambda_{T}$$ in Theorem 1, that depends on $$\varepsilon_{T}$$ and on the underlying behaviour of $$\eta(z)$$, as described by $$\rho_{T}(u)$$. We must consider the nature of this concentration rate in order to understand which choices for $$\varepsilon_{T}$$ are appropriate under different assumptions on the summary statistics. As mentioned above, the deviation control function $$\rho_{T}(u)$$ will often be of a polynomial form, (1), or an exponential form, (2). Under these two assumptions, $$\rho_{T}(u)$$ has an explicit representation and the concentration rate $$\lambda_{T}$$ can be obtained by solving the equation $$\lambda_{T}= 4\varepsilon_{T}/3 + \rho_{T}^{-1}(\varepsilon_{T}^{D}/M)\text{.}$$ Case (a): polynomial. From (1), the deviation control function is $$\rho_{T}(u)=1/(uv_{T})^{\kappa}$$. To obtain the posterior concentration rate, we invert $$\rho_{T}(u)$$ to obtain $$\rho _{T}^{-1}(\varepsilon _{T}^{D})=1/(\varepsilon _{T}^{D/\kappa }v_{T})$$, and then equate $$\varepsilon _{T}$$ and $$\rho _{T}^{-1}(\varepsilon _{T}^{D})$$ to obtain $$\varepsilon _{T}\asymp v_{T}^{-\kappa /(\kappa +D)}$$. This choice of $$\varepsilon_{T}$$ implies concentration of the approximate posterior distribution at the rate \begin{equation*} \lambda _{T}\asymp v_{T}^{-\kappa /(\kappa +D)}\text{.} \end{equation*} Case (b): exponential. If the summary statistics admit an exponential moment, a faster rate of posterior concentration is obtained. From (2), $$\rho _{T}(u)=\exp\{-h_{{\theta }}(uv_{T})\}$$ and there exist finite $$u_{0},c,C>0$$ such that \begin{equation*} \int_{\Theta }c({\theta })\exp\{-h_{\theta }(uv_{T})\}\,{\rm d}\Pi ({\theta })\leqslant C\exp\{-c(uv_{T})^{\tau }\},\quad u\leqslant u_{0}\text{.} \end{equation*} Hence, if $$c({\theta })$$ is bounded from above and if $$h_{{\theta }}(u)\ge u^{\tau }$$ for $${\theta }$$ in a neighbourhood of $$\theta_0$$, then $$\rho _{T}(u)\asymp \exp\{-c_{0}(uv_{T})^{\tau }\}$$; thus, $$\rho _{T}^{-1}(\varepsilon _{T}^{D})\asymp (-\log \varepsilon _{T})^{1/\tau }/v_{T}$$. Following arguments similar to those used in case (a), if we take $$\varepsilon _{T}\asymp (\log v_{T})^{1/\tau }/v_{T}$$, the approximate posterior distribution concentrates at the rate \begin{equation*} \lambda _{T}\asymp {(\log v_{T})^{1/\tau }}/{v_{T}}\text{.} \end{equation*} Example 1. We now illustrate the conditions of Theorem 1 in a moving average model of order two, $$ y_{t}=e_{t}+\theta _{1}e_{t-1}+\theta _{2}e_{t-2}\quad (t=1,\dots,T), $$ where $$\{e_{t}\}_{t=1}^{T}$$ is a sequence of white noise random variables such that $$E(e_{t}^{4+\delta })<\infty$$ for some $$\delta >0$$. Our prior for $$\theta=(\theta _{1},\theta _{2})^{{\rm T}}$$ is uniform over the invertibility region \begin{equation} -2\leqslant \theta _{1}\leqslant 2,\quad \theta _{1}+\theta _{2}\geqslant -1,\quad \theta _{1}-\theta _{2}\leqslant 1\text{.} \label{const1} \end{equation} (5) Following Marin et al. (2011), we choose as summary statistics for Algorithm 1 the sample autocovariances $$\eta _{j}({y})=T^{-1}\sum_{t=1+j}^{T}y_{t}y_{t-j}$$ ($$j=0,1,2$$). For this choice, the $$j$$th component of $$b(\theta )$$ is $$b_{j}(\theta )=E_{\theta }(z_{t}z_{t-j})$$. Now take $$d_{2}\{\eta (z),b(\theta )\}=\Vert \eta (z)-b(\theta )\Vert$$. Under the moment condition for $$e_{t}$$ above, it follows that $$V(\theta )=E[\{\eta (z)-b(\theta )\}\{\eta (z)-b(\theta )\}^{{\rm T}}]$$ satisfies $$\text{tr}\{V(\theta )\}<\infty$$ for all $$\theta$$ in (5). By an application of Markov’s inequality, we can conclude that \begin{align*} P_{\theta}\big\{\|\eta(z)-b(\theta)\|>u\big\}=P_{\theta}\big\{\|\eta(z)-b(\theta)\|^{2}>u^2\big\}&\leqslant \frac{\text{tr}\{V(\theta)\}}{u^2 T}+o(1/T), \end{align*} where the $$o(1/T)$$ term comes from the fact that there are finitely many nonzero covariance terms due to the $$m$$-dependence of the series, and Assumption 1 is satisfied. Given the structure of $$b(\theta )$$, the uniform prior $$\pi(\theta )$$ over (5) satisfies Assumption 2. Furthermore, $$\theta\mapsto b(\theta)=\{1+\theta _{1}^{2}+\theta _{2}^{2},(1+\theta _{2})\theta _{1},\theta _{2}\}^{{\rm T}}$$ is injective and satisfies Assumption 3. As noted in Remark 2, the injectivity of $$\theta\mapsto b(\theta)$$ is required for posterior concentration, and without it there is no guarantee that the posterior will concentrate on $$\theta_0$$. Since the sufficient conditions for Theorem 1 are satisfied, approximate Bayesian computation based on this choice of statistics will yield an approximate posterior density that concentrates on $$\theta_0$$. Theorem 1 can also be visualized by fixing a particular value of $$\theta$$, say $$\tilde{\theta}$$, generating sets of observed data $$\tilde{y}$$ of increasing length, and then running Algorithm 1 on these datasets. If the conditions of Theorem 1 are satisfied, the approximate posterior density will become increasingly peaked at $$\tilde{\theta}$$ as $$T$$ increases. Using Example 1, we demonstrate this behaviour in the Supplementary Material. 4. Shape of the asymptotic posterior distribution 4.1. Assumptions and theorem While posterior concentration states that $$\Pi [ d_{1}(\theta ,\theta _{0})>\delta \mid d_{2}\{{\eta }({y}),{\eta }({z})\}\leqslant \varepsilon _{T}] =o_{\rm p}(1)$$ for an appropriate choice of $$\varepsilon _{T}$$, it does not indicate precisely how this mass accumulates, or the approximate amount of posterior probability within any neighbourhood of $$\theta _{0}$$. This information is needed to obtain accurate expressions of uncertainty about point estimators of $$\theta _{0}$$ and to ensure that credible regions have proper frequentist coverage. To this end, we now analyse the limiting shape of $$\Pi [ \cdot \mid d_{2}\{{\eta }({y}),{\eta }({z})\}\leqslant \varepsilon _{T} ]$$ for various relationships between $$\varepsilon _{T}$$ and the rate at which summary statistics satisfy a central limit theorem. In this and the following sections, we denote $$\Pi [ \cdot\mid d_{2}\{{\eta }({y}),{\eta }({z})\}\leqslant \varepsilon _{T}]$$ by $$\Pi _{\varepsilon }\{\cdot \mid {\eta }(y)\}$$. Let $$\Vert \cdot \Vert _{\ast }$$ denote the spectral norm. In addition to Assumption 2, the following conditions are needed to establish the results of this section. Assumption 4. Assumption 1 holds. Moreover, there exist a sequence of positive-definite matrices $$\{\Sigma _{T}({\theta } _{0})\}_{T\ge 1}$$ and $$c_0>0$$, $$\kappa >1$$ and $$\delta >0$$ such that for all $$\Vert {\theta }-{ \theta }_{0}\Vert \leqslant \delta$$, $$\:P_{{\theta }}[ \Vert \Sigma _{T}({ \theta }_{0})\{{\eta }({z})-{b}({\theta })\}\Vert >u] \leqslant {c_{0} }{u^{-\kappa }}$$ for all $$0<u\leqslant \delta \|\Sigma_T(\theta_0)\|_*$$, uniformly in $$T$$. Assumption 5. Assumption 3 holds. Moreover, the map $$\theta\mapsto{b}(\theta)$$ is continuously differentiable at $${\theta _{0}}$$ and the Jacobian $$\nabla _{\theta }b({\theta }_{0})$$ has full column rank $$k_{\theta }$$. Assumption 6. The value $$\theta_0$$ is in the interior of $$\Theta$$. For some $$\delta >0$$ and for all $$\Vert { \theta }-{\theta }_{0}\Vert \leqslant \delta$$, there exists a sequence of positive-definite $$k_{\eta }\times k_{\eta }$$ matrices $$\{\Sigma_{T}({\theta })\}_{T\ge 1}$$, with $$k_{\eta }=\dim \{\eta (z)\}$$, such that for all open sets $$B$$, \begin{equation*} \sup_{|\theta - \theta_0|\leqslant \delta} \Big| P_\theta\big[ {\Sigma }_{T}({\theta })\{{\eta }({z})-{b}({\theta })\} \in B\big] - P\{ \mathcal{N }(0,I_{k_{\eta }}) \in B\} \Big| \rightarrow 0 \end{equation*} in distribution as $$T\rightarrow\infty$$, where $$I_{k_{\eta }}$$ is the $$k_{\eta }\times k_{\eta }$$ identity matrix. Assumption 7. There exists $$v_T\rightarrow\infty$$ such that for all $$\Vert \theta -\theta _{0}\Vert \leqslant \delta$$ , the sequence of functions $${\theta }\mapsto {\Sigma }_{T}({\theta } )v_{T}^{-1}$$ converges to some positive-definite $$A(\theta )$$ and is equicontinuous at $${\theta }_{0}$$. Assumption 8. For some positive $$\delta$$, all $$\Vert {\theta }-{\theta }_{0}\Vert \leqslant \delta$$, all ellipsoids $$B_{T}=\{ (t_{1},\dots ,t_{k_{\eta }}):\sum_{j=1}^{k_{\eta }}t_{j}^{2}/h_{T}^{2}\leqslant 1 \}$$ and all $$u\in \mathbb{R}^{k_{\eta }}$$ fixed, for all $$h_{T}\rightarrow 0$$, as $$T\rightarrow \infty$$, \begin{equation*} \begin{split} & \lim_{T} \sup_{|\theta - \theta_0|\leqslant \delta } \Big| h_{T}^{-k_{\eta }} P_{\theta }\big[ {\Sigma }_{T}({\theta })\{{\eta }({z})-{b}( {\theta })\}-u\in B_{T}\big] - \varphi _{k_{\eta}}(u) \Big| = 0 , \\ & {h_{T}^{-k_{\eta }}}\,{P_{{\theta }}\big[ {\Sigma }_{T}({\theta })\{{\eta }({z})-{b}( {\theta })\}-u\in B_{T}\big] } \leqslant H(u),\quad \int H(u)\,{\rm d} u<\infty , \end{split} \end{equation*} for $$\varphi _{k_{\eta }}(\cdot )$$ the density of a $$k_{\eta }$$-dimensional normal random variate. Remark 3. Assumption 4 is similar to Assumption 1 but for $$\Sigma _{T}(\theta _{0})\{\eta (z)-b(\theta )\}$$. Assumption 6 is a central limit theorem for $$\{\eta (z)-b(\theta )\}$$ and, as such, requires the existence of a positive-definite matrix $$\Sigma _{T}(\theta )$$. In simple cases, such as independent and identically distributed data with $$\eta(z)=T^{-1}\sum_{i=1}^{T}g(z_{i})$$, $$\:\Sigma _{T}(\theta )=v_{T}A_{T}^{}(\theta )$$ with $$A_{T}(\theta)=A( \theta)+o_{\rm p}(1)$$ and $$V(\theta )=E[\{g(Z)-b(\theta )\}\{g(Z)-b(\theta )\}^{{\rm T}}]=\{A(\theta)^{{\rm T}}A(\theta)\}^{-1}$$. Assumptions 5 and 8 ensure that $$\theta \mapsto b(\theta )$$ and the covariance matrix of $$\{\eta (z)-b(\theta )\}$$ are well behaved, which allows the posterior behaviour of a normalized version of $$(\theta -\theta _{0})$$ to be governed by that of $$\Sigma _{T}(\theta _{0})\{b(\theta )-b(\theta _{0})\}$$. Assumption 8 concerns the pointwise convergence of a normalized version of the measure $$P_{\theta }$$, therein dominated by $$H(u)$$, and allows application of the dominated convergence theorem in case (iii) of the following result. Theorem 2. Under Assumptions 2 and 4–7, with $$\kappa >k_{\theta }$$, the following hold with probability going to $$1$$. (i) If $$\lim_{T}v_{T}\varepsilon _{T}=\infty$$, the posterior distribution of $$\varepsilon_{T}^{-1}({\theta }-{\theta }_{0})$$ converges to the uniform distribution over the ellipsoid $$\{w:w^{{\rm T}}B_{0}w\leqslant 1\}$$ with $$B_{0}=\nabla _{\theta}\,b({\theta _{0}})^{{\rm T}}\nabla _{\theta }b({\theta _{0}})$$, meaning that for $$f(\cdot)$$ continuous and bounded, $$ \int f\{\varepsilon _{T}^{-1}({\theta }-{\theta }_{0})\}\,{\rm d}\Pi _{\varepsilon }\{ {\theta }\mid {\eta(y) }\}\overset{}{ \rightarrow }{\int_{u^{{\rm T} }B_{0}u\leqslant 1}f(u)\,{\rm d} u}\Big/{\int_{u^{{\rm T}}B_{0}u\leqslant 1}\,{\rm d} u} $$ as $$T\rightarrow\infty$$. (ii) If $$\lim_{T}v_{T}\varepsilon _{T}=c>0$$, there exists a non-Gaussian distribution $$Q_{c}$$ on $$\mathbb{R}^{k_{\eta }}$$ such that $$ \Pi _{\varepsilon }\big[ {\Sigma }_{T}({\theta }_{0})\nabla_{\theta}b(\theta_0)({\theta }- \theta_{0})-{\Sigma }_{T}({\theta }_{0})\{\eta(y)-b(\theta_0)\}\in B\;\big|\; {\eta }(y)\big] \rightarrow Q_{c}(B) $$ as $$T\rightarrow\infty$$. In particular, $$Q_{c}(B) \propto \int_{B}\int_{\mathbb{R}^{k_{\eta}} }{\rm 1}\kern-0.24em{\rm I} \{(z-x)^{{\rm T}}A(\theta_0)^{{\rm T}}A(\theta_0)(z-x)\leqslant c\}{\varphi}_{k_\eta}(z) \,{\rm d} z\,{\rm d} x\text{.}$$ (iii) If $$\lim_{T}v_{T}\varepsilon _{T}=0$$ and Assumption 8 holds, then $$ \Pi _{\varepsilon }\big[ {\Sigma }_{T}({\theta }_{0})\nabla_{\theta}b(\theta_0)({\theta }- \theta_{0})-\Sigma_{T}({\theta }_{0})\{\eta(y)-b(\theta_0)\}\in B\;\big|\;{\eta }(y)\big] \rightarrow\int_{B}\varphi_{k_{\eta}}(x)\,{\rm d} x $$ as $$T\rightarrow\infty$$. Remark 4. Theorem 2 generalizes to the case where the components of $$\eta (z)$$ have different rates of convergence. The statement and proof of this more general result are deferred to the Supplementary Material. Furthermore, as with Theorem 1, the behaviour of $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$ described by Theorem 2 can be visualized. This is demonstrated in the Supplementary Material for the case of Example 1. Formal verification of the conditions underpinning Theorem 2 is quite challenging, even in this case. Numerical results nevertheless suggest that for this particular choice of model and summaries a Bernstein–von Mises result holds, conditional on $$\varepsilon_{T}=o(1/v_{T})$$ with $$v_{T}={T}^{1/2}$$. 4.2. Discussion of the result Theorem 2 asserts that the crucial feature in determining the limiting shape of $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$ is the behaviour of $$v_{T}\varepsilon _{T}$$. An implication of Theorem 2 is that only in the regime where $$\lim_{T} v_{T}\varepsilon_{T}=0$$ will $$100(1-\alpha)\%$$ Bayesian credible regions calculated from $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$ have frequentist coverage of $$100(1-\alpha)\%$$. If $$\lim_{T}v_{T}\varepsilon _{T}=c>0$$ for $$c$$ finite, $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$ is not asymptotically Gaussian and credible regions will have incorrect magnitude, i.e., the coverage will not be at the nominal level. If $$\lim_T v_{T}\varepsilon_{T}=\infty$$, i.e., $$\varepsilon _{T}\gg v_{T}^{-1}$$, credible regions will have coverage that converges to 100%. In case (i), which corresponds to a large tolerance $$\varepsilon _{T}$$, the approximate posterior distribution has nonstandard asymptotic behaviour. In this case $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$ behaves like the prior distribution over $$\{\theta:\Vert \nabla _{\theta }b({\theta }_{0})({\theta }-{\theta }_{0})\Vert \leqslant \varepsilon _{T}\{1+o_{\rm p}(1)\}\}$$, which, by prior continuity, implies that $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$ is equivalent to a uniform distribution over this set. Li & Fearnhead (2018a) also established this behaviour, and observed that asymptotically $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$ behaves like a convolution of a Gaussian distribution with variance of order $$1/v_T^2$$ and a uniform distribution over a ball of radius $$\varepsilon_T$$, where, depending on the order $$v_T\varepsilon_T$$, one distribution will dominate. Assumption 8 applies to random variables $${\eta }({z})$$ that are absolutely continuous with respect to Lebesgue measure or, in the case of sums of random variables, to sums that are nonlattice; see Bhattacharya & Rao (1986). For discrete $${\eta }({z})$$, Assumption 8 must be adapted for Theorem 2 to be satisfied. One such modification is the following. Assumption 9. There exist $$\delta >0$$ and a countable set $$E_{T}$$ such that for all $$\Vert {\theta }-{\theta }_{0}\Vert <\delta$$, for all $$x\in E_{T}$$ such that $$\text{pr}\{ {\eta }({z})={x}\} >0$$, $$\:\text{pr}\{ {\eta }({z})\in E_{T}\} =1$$ and \begin{equation*} \sup_{\Vert {\theta }-{\theta }_{0}\Vert \leqslant \delta }\sum_{x\in E_{T}}\Big\vert p[{\Sigma }_{T}({\theta })\{{x}-{b}({\theta })\}\mid {\theta } ]-v_{T}^{-k_{\eta}} | A(\theta_0) |^{-1/2}\varphi_{k_{\eta}} \lbrack {\Sigma }_{T}({\theta })\{{x }-{b}({\theta })\}]\Big\vert =o(1)\text{.} \end{equation*} This is satisfied when $${\eta }({z})$$ is a sum of independent lattice random variables, as in the population genetics experiment detailed in Marin et al. (2014, § 3.3), which compares evolution scenarios of separated populations from a most recent common ancestor. Furthermore, this example satisfies Assumptions 2 and 4–7. Thus the conclusions of both Theorem 1 and Theorem 2 apply to this example. 5. Asymptotic distribution of the posterior mean 5.1. Main result The literature on the asymptotics of approximate Bayesian computation has so far focused primarily on asymptotic normality of the posterior mean. The posterior normality result in Theorem 2 is not weaker, nor stronger, than the asymptotic normality of an approximate point estimator, as the results concern different objects. However, existing proofs of asymptotic normality of the posterior mean all require asymptotic normality of the posterior distribution $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$. We demonstrate that this is not a necessary condition. For clarity we focus on the case of a scalar parameter $$\theta$$ and a scalar summary $$\eta ({y})$$, i.e., $$k_{\theta}=k_{\eta}=1$$, but present an extension to the multivariate case in § 5.2. This result requires a further assumption on the prior in addition to Assumption 2. Assumption 10. The prior density $$\pi(\theta)$$ is such that: (i) for $$\theta _{0}$$ in the interior of $${\Theta }$$, $$\:\pi(\theta _{0})>0$$; (ii) the density function $$\pi(\theta)$$ is $$\beta$$-Hölder in a neighbourhood of $$\theta _{0}$$, i.e., there exist $$\delta ,L>0$$ such that for all $$|\theta -\theta _{0}|\leqslant \delta$$, $$\:\nabla_{\theta}^{(j)}\pi(\theta _{0})$$, the $$j$$th derivative of $$\pi(\theta_0)$$, satisfies \begin{equation*} \left\vert \pi(\theta )-\sum_{j=0}^{\lfloor \beta\rfloor }(\theta -\theta _{0})^{j}\frac{\nabla_{\theta}^{(j)}\pi(\theta _{0})}{j!}\right\vert \leqslant L|\theta -\theta _{0}|^{\beta }; \end{equation*} (iii) for $$\Theta \subset \mathbb{R}$$, $$\int_{\Theta }|\theta |^{\beta }\pi(\theta )\,{\rm d}\theta <\infty$$. Theorem 3. Suppose that Assumptions 2, 4–7 and 10 with $$\kappa >\beta +1$$ hold. Furthermore, let $$\theta\mapsto b(\theta)$$ be $$\beta$$-Hölder in a neighbourhood of $$\theta_{0}$$. Denoting by $$E_{\Pi _{\varepsilon }}(\theta )$$ the posterior mean of $$\theta$$, the following characterization holds with probability going to $$1$$. (i) If $$\lim_{T}v_{T}\varepsilon _{T}=\infty$$ and $$v_{T}\varepsilon _{T}^{2\wedge (1+\beta )}=o(1)$$, then \begin{equation}E_{\Pi _{\varepsilon }}\{ v_{T}(\theta -\theta _{0})\} \rightarrow \mathcal{N}\big[0,V(\theta _{0})/\{\nabla _{\theta }b(\theta _{0})\}^{2}\big] \label{normality:postmean} \end{equation} (6)in distribution as $$T\rightarrow\infty$$, where $$V(\theta _{0})=\lim_{T}{\rm var}[v_{T}\{\eta ({y})-b(\theta _{0})\}]$$. (ii) If $$\lim_{T}v_{T}\varepsilon _{T}=c\geqslant 0$$, and if Assumption 8 holds when $$c= 0$$, then (6) also holds. There are two immediate consequences of Theorem 3: first, part (i) states that if one is only interested in obtaining accurate point estimators for $$\theta_0$$, all one needs is a tolerance $$\varepsilon_{T}$$ satisfying $$v_{T}\varepsilon^2_{T}=o(1)$$, which can significantly reduce the computational burden of approximate Bayesian computation; secondly, if one wants accurate point estimators of $$\theta_0$$ and accurate expressions for the uncertainty associated with such a point estimator, one requires $$\varepsilon_{T}=o(1/v_{T})$$. The first statement follows directly from Theorem 3(i), while the second statement follows from Theorem 3(ii) and by recalling that, from Theorem 2, credible regions constructed from $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$ will have proper frequentist coverage only if $$\varepsilon_{T}=o(1/v_{T})$$. For $$\varepsilon_{T}\asymp v_T^{-1}$$ or $$\varepsilon_{T}\gg v_{T}^{-1}$$, the frequentist coverage of credible balls centred at $$E_{\Pi_\varepsilon}(\theta)$$ will not be equal to the nominal level. As an intermediate step in the proof of Theorem 3, we demonstrate the following expansion for the posterior mean, where $$k$$ denotes the integer part of $$(\beta+1)/2$$: \begin{equation} \begin{split}\label{expan:postmean} E_{\Pi _{\varepsilon }}( \theta -\theta _{0}) & =\frac{\eta(y)-b(\theta_0)}{\nabla _{\theta }b(\theta _{0})}+\sum_{j=1}^{\lfloor \beta \rfloor } \frac{\nabla_{b}^{(j)}b^{-1}(b_{0})}{j!}\sum_{l=\lceil j/2\rceil }^{\lfloor (j+k)/2\rfloor }\frac{\varepsilon _{T}^{2l}\nabla_{b}^{(2l-j)}\pi(b_{0})}{\pi(b_{0})(2l-j)! }\\ & \quad +O(\varepsilon _{T}^{1+\beta })+o_{\rm p}(1/v_{T})\text{.} \end{split} \end{equation} (7) This highlights a potential deviation from the expected asymptotic behaviour of the posterior mean $$E_{\Pi _{\varepsilon }}(\theta )$$, i.e., the behaviour corresponding to $$T\rightarrow \infty$$ and $$\varepsilon _{T}\rightarrow0$$. Indeed, the posterior mean is asymptotically normal for all values of $$\varepsilon _{T}=o(1)$$ , but is asymptotically unbiased only if the leading term in (7) is $$[\nabla _{\theta }b(\theta _{0})]^{-1}\{\eta(y)-b(\theta_0)\}$$, which is satisfied in case (ii) and in case (i) if $$v_{T}\varepsilon _{T}^{2}=o(1)$$, given $$\beta \geqslant 1$$. However, in case (i), if $$\lim\inf_{T}v_{T}\varepsilon _{T}^{2}>0$$, then when $$\beta \geqslant 3$$ the posterior mean has a bias of \begin{equation*} \varepsilon _{T}^{2}\left[ \frac{\nabla_{b}\pi(b_{0})}{3\pi(b_{0})\nabla _{\theta }b(\theta _{0})}-\frac{\nabla^{(2)}_{\theta}b(\theta _{0})}{2\{\nabla _{\theta }b(\theta _{0})\}^{2}}\right] +O(\varepsilon _{T}^{4})+o_{\rm p}(1/v_{T})\text{.} \end{equation*} 5.2. Comparison with existing results Li & Fearnhead (2018a) analysed the asymptotic properties of the posterior mean and functions thereof. Under the assumption of a central limit theorem for the summary statistic and further regularity assumptions on the convergence of the density of the summary statistics to this normal limit, including the existence of an Edgeworth expansion with exponential controls on the tails, Li & Fearnhead (2018a) demonstrated asymptotic normality, with no bias, of the posterior mean if $$\varepsilon_T = o (1/v_{T}^{3/5})$$. Heuristically, they derived this result using an approximation of the posterior density $$\pi_{\varepsilon }\{\theta \mid \eta ({y})\}$$, based on the Gaussian approximation of the density of $$\eta ({z})$$ given $$\theta$$ and using properties of the maximum likelihood estimator conditional on $$\eta ({y})$$. In contrast to our analysis, Li & Fearnhead (2018a) allowed the acceptance probability defining the algorithm to be an arbitrary density kernel in $$\|\eta(y)-\eta(z)\|$$. Consequently, their approach is more general than the accept/reject version in Theorem 3. However, the conditions required of $$\eta ({y})$$ in Li & Fearnhead (2018a) are stronger than ours. In particular, our results on asymptotic normality for the posterior mean require only weak convergence of $$v_{T}\{\eta ({z}) - {b}(\theta)\}$$ under $$P_\theta$$, with polynomial deviations that need not be uniform in $$\theta$$. These assumptions allow for the explicit treatment of models where the parameter space $$\Theta$$ is not compact. In addition, asymptotic normality of the posterior mean requires Assumption 8 only if $$\varepsilon_T = o(1/v_T)$$. Hence, if $$\varepsilon_T \gg v^{-1}_T$$, then only deviation bounds and weak convergence are required, which are much weaker than convergence of the densities. When $$\varepsilon_T = o(1/v_T)$$, Assumption 8 essentially implies local, in $$\theta$$, convergence of the density of $$v_{T}\{\eta ({z}) - {b}(\theta)\}$$, but imposes no requirement on the rate of this convergence. This assumption is weaker than the uniform convergence required in Li & Fearnhead (2018a). Our results also allow for an explicit representation of the bias that obtains for the posterior mean when $$\lim\inf_{T}v_{T}\varepsilon^{2}_T>0$$. In further contrast to Li & Fearnhead (2018a), Theorem 2 completely characterizes the asymptotic behaviour of the approximate posterior distribution for all $$\varepsilon _{T}=o(1)$$ that admit posterior concentration. This general characterization allows us to demonstrate, via Theorem 3(i), that asymptotic normality and unbiasedness of the posterior mean remain achievable even if $$\lim_{T}v_{T}\varepsilon _{T}=\infty$$, provided the tolerance satisfies $$\varepsilon _{T}=o(1/v_{T}^{1/2})$$. Li & Fearnhead (2018a) presented the interesting result that if $$k_\eta > k_\theta\geqslant1$$ and $$\varepsilon_{T}=o(1/v_T^{3/5})$$, the posterior mean is asymptotically normal, and unbiased, but is not asymptotically efficient. To help shed light on this phenomenon, the following result gives an alternative to Theorem 3.1 of Li & Fearnhead (2018a) and contains an explicit asymptotic expansion for the posterior mean when $$k_\eta > k_\theta\geqslant1$$. Theorem 4. Suppose that Assumptions 2, 4–7 and 10 hold. Assume that $$v_T\varepsilon_T \rightarrow \infty$$ and $$v_T\varepsilon_T^2 = o(1)$$. Assume also that $$b(\cdot)$$ and $$\pi(\cdot)$$ are Lipschitz in a neighbourhood of $$\theta _{0}$$. Then, for $$k_\eta > k_\theta\geqslant1$$, $$ E_{\Pi_{\varepsilon}}\{ v_T(\theta-\theta_0) \} = \{\nabla_\theta b(\theta_0)^{{\rm T}} \nabla_\theta b(\theta_0)\}^{-1} \nabla_\theta b(\theta_0)^{{\rm T}}v_{T}\{\eta(y)-b(\theta_0)\}+ o_{\rm p}(1)\text{.} $$ In addition, if $$\{\nabla_\theta b(\theta_0)^{{\rm T}} \nabla_\theta b(\theta_0)\}^{-1}\nabla_\theta b(\theta_0)^{{\rm T}} \neq \nabla_\theta b(\theta_0)^{{\rm T}}$$, the matrix $$ {\rm var}\big[\{\nabla_\theta b(\theta_0)^{{\rm T}}\nabla_\theta b(\theta_0)\}^{-1}\nabla_\theta b(\theta_0)^{{\rm T}}v_{T}\{\eta(y)-b(\theta_0)\}\big]- \{\nabla_\theta b(\theta_0)^{{\rm T}}V^{-1}(\theta_0)\nabla_\theta b(\theta_0)^{}\}^{-1} $$ is positive semidefinite, where $$\{\nabla_\theta b(\theta_0)^{{\rm T}}V^{-1}(\theta_0)\nabla_\theta b(\theta_0)^{}\}^{-1}$$ is the optimal asymptotic variance achievable given $$\eta(y)$$. A consequence of Theorem 4 is that, for a fixed choice of summaries, the two-stage procedure advocated by Fearnhead & Prangle (2012) will not reduce the asymptotic variance over a point estimator produced via Algorithm 1. However, this two-stage procedure does reduce the Monte Carlo error inherent in estimating the approximate posterior distribution $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$ by reducing the dimension of the statistics on which the matching in approximate Bayesian computation is based. 6. Practical implications of the results 6.1. General implications The approximate Bayesian computation approach in Algorithm 1 is typically not used in practice. Instead, the acceptance step in Algorithm 1 is commonly replaced by the nearest-neighbour selection step and with $$d_{2}\{\eta(z),\eta(y)\}=\|\eta(z)-\eta(y)\|$$ (see, e.g., Biau et al., 2015) as follows. Step 3$$'$$. Select all $$\theta ^{i}$$ associated with the $$\alpha =\delta /N$$ smallest distances $$\|\eta(z)-\eta(y)\|$$ for some $$\delta$$. This nearest-neighbour version accepts draws of $$\theta$$ associated with an empirical quantile over the simulated distances $$\|\eta(z)-\eta(y)\|$$ and defines the acceptance probability for Algorithm 1. A key practical insight gained from our asymptotic results is that the acceptance probability, $$\alpha _{T}=\text{pr}\{\|\eta(z)-\eta(y)\|\leqslant \varepsilon _{T}\}$$, is affected only by the dimension of $$\theta$$, as formalized in the following corollary. Corollary 1. Under the conditions of Theorem 2, the following hold. (i) If $$\varepsilon _{T}\asymp v_{T}^{-1}$$ or $$\varepsilon_{T}=o(1/v_{T})$$, then the acceptance rate associated with the threshold $$\varepsilon _{T}$$ is \begin{equation*} \alpha _{T}={\rm pr}\big\{\|\eta(z)-\eta(y)\| \leqslant \varepsilon _{T}\big\} \asymp (v_{T}\varepsilon _{T})^{k_{\eta }}\times v_{T}^{-k_{\theta }}\lesssim v_{T}^{-k_{\theta}}\text{.} \end{equation*} (ii) If $$\varepsilon _{T}\gg v_{T}^{-1}$$ , then \begin{equation*} \alpha _{T}={\rm pr}\big\{\|\eta(z)-\eta(y)\|\leqslant \varepsilon _{T}\big\} \asymp \varepsilon _{T}^{k_{\theta }}\gg v_{T}^{-k_{\theta}}\text{.} \end{equation*} This shows that choosing a tolerance $$\varepsilon _{T}=o(1)$$ is equivalent to choosing an $$\alpha _{T}=o(1)$$ quantile of $$\|\eta(z)-\eta(y)\|$$. It also demonstrates the role played by the dimension of $$\theta$$ in determining the rate at which $$\alpha _{T}$$ declines to zero. In case (i), if $$\varepsilon _{T}\asymp v_{T}^{-1}$$, then $$\alpha _{T}\asymp v_{T}^{-k_{\theta }}$$. On the other hand, if $$\varepsilon _{T}=o(1/v_{T})$$ , as is required for the Bernstein–von Mises result in Theorem 2, the associated acceptance probability goes to zero at the faster rate, $$\alpha _{T}=o(1/v_{T}^{k_{\theta }})$$. In case (ii), where $$\varepsilon _{T}\gg v_{T}^{-1}$$, it follows that $$\alpha _{T}\gg v_{T}^{-k_{\theta }}$$. Linking $$\varepsilon _{T}$$ and $$\alpha _{T}$$ provides a way of choosing the $$\alpha _{T}$$ quantile of the simulations, or equivalently the tolerance $$\varepsilon _{T}$$, in such a way that a particular type of posterior behaviour occurs for large $$T$$: choosing $$\smash{\alpha _{T}\gtrsim v_{T}^{-k_{\theta}}}$$ gives an approximate posterior distribution that concentrates; under the more stringent condition of $$\smash{\alpha_{T}=o(1/v_{T}^{k_{\theta }})}$$, the approximate posterior distribution both concentrates and is approximately Gaussian in large samples. These results can help give practitioners an understanding of what to expect from this procedure, as well as a means of detecting potential issues if this expected behaviour is not in evidence. Moreover, given that there is no direct connection between $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$ and the exact posterior distribution, these results provide some insight into the statistical properties that $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$ should display when it is obtained from the popular nearest-neighbour version of the algorithm. Corollary 1 demonstrates that to obtain reasonable statistical behaviour, the rate at which $$\alpha _{T}$$ declines to zero must be faster the larger the dimension of $$\theta$$ is, with the order of $$\alpha _{T}$$ unaffected by the dimension of $$\eta$$. This result provides theoretical evidence of a curse-of-dimensionality encountered in these algorithms as the dimension of the parameters increases, and to our knowledge this is the first work linking the dimension of $$\theta$$ to certain asymptotic properties of $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$. This result provides theoretical justification for dimension reduction methods that process parameter dimensions individually and independently of the other dimensions; see, for example, the regression adjustment approaches of Beaumont et al. (2002), Blum (2010) and Fearnhead & Prangle (2012), as well as the integrated auxiliary likelihood approach of Martin et al. (2017). While Corollary 1 demonstrates that the order of $$\alpha_{T}$$ is unaffected by the dimension of the summaries, $$\alpha_{T}$$ cannot be accessed in practice and so the nearest-neighbour version of Algorithm 1 is implemented using a Monte Carlo approximation to $$\alpha_{T}$$, which is based on the accepted draws of $$\theta$$. This approximation of $$\alpha_{T}$$ is a Monte Carlo estimate of a conditional expectation and, as such, will be sensitive to the dimension of $${\eta}(\cdot)$$ for any fixed number of Monte Carlo draws $$N$$; see Biau et al. (2015) for further discussion of this point. In addition, it can also be shown that if $$\varepsilon_T$$ becomes much smaller than $$1/v_T$$, the dimension of $$\eta(\cdot)$$ will affect the behaviour of Monte Carlo estimators for this acceptance probability. Specifically, when considering inference on $$\theta_0$$ using the accept/reject approximate Bayesian computation algorithm, we require a sequence of Monte Carlo trials $$N_{T}\rightarrow\infty$$ as $$T\rightarrow\infty$$ that diverges faster the larger $$k_{\eta}$$, the dimension of $$\eta(\cdot)$$, is. Such a feature highlights the lack of efficiency of the accept/reject approach when the sample size is large or when the dimension of the summaries is large. However, we note here that more efficient sampling approaches exist and could be applied in these settings. For example, Li & Fearnhead (2018a) considered an importance sampling approach to approximate Bayesian computation that yields acceptance rates satisfying $$\alpha_{T}=O(1)$$, so long as $$\varepsilon_{T}=O(1/v_{T})$$. Therefore, in cases where the Monte Carlo error is likely to be large, these alternative sampling approaches should be employed. Regardless of whether one uses a more efficient sampling procedure than the simple accept/reject approach, Corollary 1 demonstrates that taking a tolerance sequence as small as possible will not necessarily yield more accurate results; that is, Corollary 1 questions the persistent opinion that the tolerance in Algorithm 1 should always be taken as small as the computing budget allows. Once $$\varepsilon _{T}$$ is chosen small enough to satisfy case (iii) of Theorem 2, which leads to the most stringent requirement on the tolerance, $$v_{T}\varepsilon_{T}=o(1)$$, there may well be no gain in pushing $$\varepsilon _{T}$$ or, equivalently $$\alpha_{T}$$, any closer to zero, especially since pushing $$\varepsilon_{T}$$ closer to zero can drastically increase the required computational burden. In the following subsection we numerically demonstrate this result in a simple example. In particular, we show that for a choice of tolerance $$\varepsilon_{T}$$ that admits a Bernstein–von Mises result, there is no gain in taking a tolerance that is smaller than this value, while the computational cost associated with such a choice, for a fixed level of Monte Carlo error, drastically increases. 6.2. Numerical illustration of quantile choice Consider the simple example where we observe a sample $$\{y_{t}\}_{t=1}^{T}$$ from $$y_{t}\sim \mathcal{N}(\mu ,\sigma )$$ with $$T=100$$. Our goal is posterior inference on $$\theta=(\mu ,\sigma)^{{\rm T}}$$. We use as summaries the sample mean and variance, $$\bar{x}$$ and $$s_{T}^{2}$$, which satisfy a central limit theorem at rate $${T}^{1/2}$$. In order to guarantee asymptotic normality of the approximate posterior distribution, we must choose an $$\alpha _{T}$$ quantile of the simulated distances according to $$\alpha _{T}=o(1/T)$$, because of the joint inference on $$\mu$$ and $$\sigma$$. For the purpose of this illustration, we will compare inference based on the nearest-neighbour version of Algorithm 1 using four different choices of $$\alpha _{T}$$, namely $$\alpha _{1}=1/T^{1{\cdot}1}$$, $$\alpha _{2}=1/T^{3/2}$$, $$\alpha _{3}=1/T^{2}$$ and $$\alpha _{4}=1/T^{5/2}$$. Draws for $$(\mu ,\sigma )$$ are simulated on $$[0{\cdot}5,1{\cdot}5]\times \lbrack 0{\cdot}5,1{\cdot}5]$$ according to independent uniform distributions $${\rm Un}[0{\cdot}5,1{\cdot}5]$$. The number of draws, $$N$$, is chosen so that we retain 250 accepted draws for each of the different choices ($$\alpha _{1},\dots,\alpha _{4}$$). The exact finite-sample marginal posterior densities of $$\mu$$ and $$\sigma$$ are produced by numerically evaluating the likelihood function, normalizing over the support of the prior, and marginalizing with respect to each parameter. Given the sufficiency of $$(\bar{x},\, s_{T}^{2})$$, the exact marginal posterior densities for $$\mu$$ and $$\sigma$$ are equal to those based directly on the summaries themselves. Thus, we are able to assess the impact of the choice of $$\alpha$$, per se, on the ability of the nearest-neighbour version of Algorithm 1 to replicate the exact marginal posteriors. We summarize the accuracy of the resulting approximate posterior density estimates, across the four quantile choices, using root mean squared error. In particular, over 50 simulated replications and for the parameter $$\mu$$, we estimate the root mean squared error between the marginal posterior density obtained from Algorithm 1 using $$\alpha _{j}$$, denoted by $$\hat{\pi}_{\alpha _{j}}\{\mu \mid{\eta }({y})\}$$, and the exact marginal posterior density, $$\pi(\mu \mid {y})$$, by \begin{equation} {\small{\text{RMSE}}}_{\mu }(\alpha _{j})=\left[\frac{1}{G}\sum_{g=1}^{G}\big\{ \hat{\pi} _{\alpha _{j}}^{g}\{\mu \mid {\eta }({y})\}-\pi^{g}(\mu \mid {y})\big\} ^{2}\right]^{1/2}\text{.} \label{RMSE_def} \end{equation} (8) The term $$\hat{\pi}^{g}_{\alpha_{j}}$$ is the ordinate of the density estimate from the nearest-neighbour version of Algorithm 1, and $$\pi^{g}$$ is the ordinate of the exact posterior density, at the $$g$$th grid point upon which the density is estimated. $${\small{\text{RMSE}}}_{\sigma }(\alpha _{j})$$ is computed analogously. The value of $${\small{\text{RMSE}}}_{\mu }(\alpha _{j})$$ is averaged over 50 replications to account for sampling variability. For each replication, we fix $$T=100$$ and generate observations using the parameter values $$\mu _{0}=1$$ and $$\sigma _{0}=1$$. Before presenting the replication results, it is instructive to consider the graphical output of one particular run of the algorithm for each of the $$\alpha _{j}$$ values. Figure 1 plots the resulting marginal posterior estimates and compares these with the exact finite-sample marginal posterior densities of $$\mu$$ and $$\sigma$$. At the end of § 6.1 we argued that for large enough $$T$$, once $$\varepsilon _{T}$$ reaches a certain threshold, decreasing the tolerance further will not necessarily result in more accurate estimates of these exact posterior densities. This implication is evident in Fig. 1: in the case of $$\mu$$, there is a clear visual decline in the accuracy with which approximate Bayesian computation estimates the exact marginal posterior densities when choosing quantiles smaller than $$\alpha _{2}$$, while in the case of $$\sigma$$, the worst performing estimate is that associated with the smallest value of $$\alpha _{j}$$. Figure 1 View largeDownload slide Comparison of exact and approximate posterior densities of (a) $$\mu$$ and (b) $$\sigma$$ for various tolerances: exact marginal posterior densities (—) and approximate Bayesian computation posterior densities based on $$\alpha_{1}=1/T^{1{\cdot}1}$$ ($$\cdots$$), $$\alpha_{2}=1/T^{3/2}$$ (- - -), $$\alpha_{3}=1/T^{2}$$ (— $$\cdot$$ —) and $$\alpha_{4}=1/T^{5/2}$$ (—*—). Figure 1 View largeDownload slide Comparison of exact and approximate posterior densities of (a) $$\mu$$ and (b) $$\sigma$$ for various tolerances: exact marginal posterior densities (—) and approximate Bayesian computation posterior densities based on $$\alpha_{1}=1/T^{1{\cdot}1}$$ ($$\cdots$$), $$\alpha_{2}=1/T^{3/2}$$ (- - -), $$\alpha_{3}=1/T^{2}$$ (— $$\cdot$$ —) and $$\alpha_{4}=1/T^{5/2}$$ (—*—). Table 1 reports average root mean squared errors, relative to the average value associated with $$\alpha _{4}=1/T^{5/2}$$. Values smaller than 1 indicate that the larger, and less computationally burdensome, value of $$\alpha _{j}$$ yields a more accurate estimate than that obtained using $$\alpha_{4}$$. In brief, Table 1 reveals a similar picture to that of Fig. 1: for $$\sigma$$, the estimates based on $$\alpha _{j}$$ for $$j=1,2,3$$ are all more accurate than those based on $$\alpha _{4}$$; for $$\mu$$, estimates based on $$\alpha _{2}$$ and $$\alpha _{3}$$ are both more accurate than those based on $$\alpha _{4}$$. Table 1. Ratio of the average root mean squared error for marginal approximate posterior density estimates to the average root mean square error based on the smallest quantile, $$\alpha_{4}=1/T^{5/2}$$ $$\alpha _{1}=1/T^{1.1}$$ $$\alpha _{2}=1/T^{1.5}$$ $$\alpha _{3}=1/T^{2}$$ $${\small{\text{AVG-RMSE}}}_{\mu }(\alpha _{j})$$ 1$$\cdot$$17 0$$\cdot$$99 0$$\cdot$$98 $${\small{\text{AVG-RMSE}}}_{\sigma }(\alpha _{j})$$ 0$$\cdot$$86 0$$\cdot$$87 0$$\cdot$$91 $$\alpha _{1}=1/T^{1.1}$$ $$\alpha _{2}=1/T^{1.5}$$ $$\alpha _{3}=1/T^{2}$$ $${\small{\text{AVG-RMSE}}}_{\mu }(\alpha _{j})$$ 1$$\cdot$$17 0$$\cdot$$99 0$$\cdot$$98 $${\small{\text{AVG-RMSE}}}_{\sigma }(\alpha _{j})$$ 0$$\cdot$$86 0$$\cdot$$87 0$$\cdot$$91 avg-rmse, the ratio of the average root mean squared errors as defined in (8). View Large Table 1. Ratio of the average root mean squared error for marginal approximate posterior density estimates to the average root mean square error based on the smallest quantile, $$\alpha_{4}=1/T^{5/2}$$ $$\alpha _{1}=1/T^{1.1}$$ $$\alpha _{2}=1/T^{1.5}$$ $$\alpha _{3}=1/T^{2}$$ $${\small{\text{AVG-RMSE}}}_{\mu }(\alpha _{j})$$ 1$$\cdot$$17 0$$\cdot$$99 0$$\cdot$$98 $${\small{\text{AVG-RMSE}}}_{\sigma }(\alpha _{j})$$ 0$$\cdot$$86 0$$\cdot$$87 0$$\cdot$$91 $$\alpha _{1}=1/T^{1.1}$$ $$\alpha _{2}=1/T^{1.5}$$ $$\alpha _{3}=1/T^{2}$$ $${\small{\text{AVG-RMSE}}}_{\mu }(\alpha _{j})$$ 1$$\cdot$$17 0$$\cdot$$99 0$$\cdot$$98 $${\small{\text{AVG-RMSE}}}_{\sigma }(\alpha _{j})$$ 0$$\cdot$$86 0$$\cdot$$87 0$$\cdot$$91 avg-rmse, the ratio of the average root mean squared errors as defined in (8). View Large These numerical results have important implications for implementation of approximate Bayesian computation. In particular, keeping the level of Monte Carlo error constant across the $$\alpha_{j}$$ quantile choices, as we have done in this simulation setting via the retention of 250 draws, requires taking $$N=210{\rm e}03$$ for $$\alpha _{1}$$, $$1{\cdot}4{\rm e}06$$ for $$\alpha _{2}$$, $$13{\cdot}5{\rm e}06$$ for $$\alpha _{3}$$, and $$41{\cdot}0{\rm e}06$$ for $$\alpha _{4}$$. In other words, the computational burden associated with decreasing the quantile in the manner indicated increases dramatically: approximate posterior densities based on $$\alpha _{4}$$, for example, require a value of $$N$$ that is three orders of magnitude greater than those based on $$\alpha _{1}$$, but this increase in computational burden yields no, or only minimal, gain in accuracy. The extension of such explorations to additional scenarios is beyond the scope of this paper; however, we speculate that, with due consideration given to the properties of both the true data-generating process and the chosen summary statistics, and hence to the sample sizes for which Theorem 2 has practical relevance, similar qualitative results will continue to hold. Acknowledgement This research was supported by the Australian Research Council and l’Institut Universitaire de France. Robert is also affiliated with the University of Warwick and Rousseau with the Université Paris Dauphine. Université Paris Dauphine is a PSL Research University. We are grateful to Wentao Li and Paul Fearnhead for spotting a mistake in a previous version of Theorem 3 and to the editorial team for its help. Supplementary material Supplementary Material available at Biometrika online includes proofs of all results and numerical examples that illustrate the theoretical results. References Beaumont M. A. , Zhang W. & Balding D. J. ( 2002 ). Approximate Bayesian computation in population genetics . Genetics 162 , 2025 – 35 . Google Scholar PubMed Bhattacharya R. N. & Rao R. R. ( 1986 ). Normal Approximation and Asymptotic Expansions . Philadelphia : SIAM . Biau G. , Cérou F. & Guyader A. ( 2015 ). New insights into approximate Bayesian computation . Ann. Inst. Henri Poincaré Prob. Statist. 51 , 376 – 403 . Google Scholar CrossRef Search ADS Blum M. ( 2010 ). Approximate Bayesian computation: A non-parametric perspective . J. Am. Statist. Assoc. 105 , 1178 – 87 . Google Scholar CrossRef Search ADS Creel M. , Gao J. , Hong H. & Kristensen D. ( 2015 ). Bayesian indirect inference and the ABC of GMM . arXiv: 1512.07385 . Creel M. & Kristensen D. ( 2015 ). ABC of SV: Limited information likelihood inference in stochastic volatility jump-diffusion models , J. Empir. Finan. 31 , 85 – 108 . Google Scholar CrossRef Search ADS Drovandi C. C. , Pettitt A. N. & Lee A. ( 2015 ). Bayesian indirect inference using a parametric auxiliary model . Statist. Sci. 30 , 72 – 95 . Google Scholar CrossRef Search ADS Fearnhead P. & Prangle D. ( 2012 ). Constructing summary statistics for approximate Bayesian computation: Semi-automatic approximate Bayesian computation (with Discussion) . J. R. Statist. Soc. B 74 , 419 – 74 . Google Scholar CrossRef Search ADS Gouriéroux C. , Monfort A. & Renault E. ( 1993 ) Indirect inference . J. Appl. Economet. 8 , 85 – 118 . Google Scholar CrossRef Search ADS Li W. & Fearnhead P. ( 2018a ). On the asymptotic efficiency of approximate Bayesian computation estimators . Biometrika 105 , 285 – 99 . Google Scholar CrossRef Search ADS Li W. & Fearnhead P. ( 2018b ). Convergence of regression adjusted approximate Bayesian computation . Biometrika 105 , 301 – 18 . Google Scholar CrossRef Search ADS Marin J-M. , Pillai N. , Robert C. P. & Rousseau J. ( 2014 ). Relevant statistics for Bayesian model choice . J. R. Statist. Soc. B 76 , 833 – 59 . Google Scholar CrossRef Search ADS Marin J-M. , Pudlo P. , Robert C. P. & Ryder R. ( 2011 ). Approximate Bayesian computation methods . Statist. Comp. 21 , 289 – 91 . Google Scholar CrossRef Search ADS Martin G. M. , McCabe B. P. M. , Frazier D. T. , Maneesoonthorn W. & Robert C. P. ( 2017 ). Auxiliary likelihood-based approximate Bayesian computation in state space models . arXiv: 1604.07949v2 . Pritchard J. K. , Seilstad M. T. , Perez-Lezaun A. & Feldman M. W. ( 1999 ). Population growth of human Y chromosomes: A study of Y chromosome microsatellites . Molec. Biol. Evol. 16 , 1791 – 8 . Google Scholar CrossRef Search ADS Tavaré S. , Balding D. J. , Griffiths R. C. & Donnelly P. ( 1997 ). Inferring coalescence times from DNA sequence data . Genetics 145 , 505 – 18 . Google Scholar PubMed © 2018 Biometrika Trust This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Biometrika Oxford University Press

Asymptotic properties of approximate Bayesian computation

Loading next page...
 
/lp/ou_press/asymptotic-properties-of-approximate-bayesian-computation-sSKsjF0Od3
Publisher
Oxford University Press
Copyright
© 2018 Biometrika Trust
ISSN
0006-3444
eISSN
1464-3510
D.O.I.
10.1093/biomet/asy027
Publisher site
See Article on Publisher Site

Abstract

Summary Approximate Bayesian computation allows for statistical analysis using models with intractable likelihoods. In this paper we consider the asymptotic behaviour of the posterior distribution obtained by this method. We give general results on the rate at which the posterior distribution concentrates on sets containing the true parameter, the limiting shape of the posterior distribution, and the asymptotic distribution of the posterior mean. These results hold under given rates for the tolerance used within the method, mild regularity conditions on the summary statistics, and a condition linked to identification of the true parameters. Implications for practitioners are discussed. 1. Introduction Interest in approximate Bayesian computation has begun to shift from an initial focus on using it as a computational tool towards validation of it as a statistical inference procedure; see, for example, Fearnhead & Prangle (2012), Marin et al. (2014), Creel & Kristensen (2015), Creel et al. (2015), Drovandi et al. (2015), Martin et al. (2017) and Li & Fearnhead (2018a,b). We study large-sample properties of posterior distributions and posterior means obtained from approximate Bayesian computation algorithms. Under mild regularity conditions on the underlying summary statistics, we characterize the rate of posterior concentration and show that the limiting posterior shape depends crucially on the interplay between the rate at which the summaries converge and the rate at which the tolerance used to select parameters shrinks to zero. Bayesian consistency places a less stringent condition on the speed with which the tolerance declines to zero than does asymptotic normality of the posterior distribution. Furthermore, and in contrast to textbook Bernstein–von Mises results, asymptotic normality of the posterior mean does not require asymptotic normality of the posterior distribution, the former being attainable under weaker conditions on the tolerance than required for the latter. Validity of these results requires that the summaries converge to a well-defined limit and that this limit, viewed as a mapping from parameters to summaries, be injective. These conditions have a close correspondence to those required for theoretical validity of indirect inference and related frequentist estimators (see, e.g., Gouriéroux et al., 1993). We focus on three aspects of asymptotic behaviour: posterior consistency, limiting posterior shape, and the asymptotic distribution of the posterior mean. Our focus is broader than that of existing studies on large-sample properties of approximate Bayesian computation algorithms, in which the asymptotic properties of resulting point estimators have been of primary interest; see Creel et al. (2015) and Li & Fearnhead (2018a). Our approach allows both weaker conditions and a complete characterization of the limiting posterior shape. We distinguish between the conditions, on both the summaries and the tolerance, required for concentration and those required for distributional results. Our results suggest how the tolerance in approximate Bayesian computation should be chosen to ensure posterior concentration, valid coverage levels for credible sets, and asymptotically normal and efficient point estimators. 2. Preliminaries and background We observe data $${y}=(y_{1},\dots,y_{T})^{{\rm T}}$$, $$T\geqslant 1$$, drawn from the model $$\{P_{{\theta }}:{\theta \in \Theta }\}$$, where $$P_{{\theta }}$$ admits the corresponding conditional density $$p(\cdot \mid{\theta })$$ and $${\theta }\in {\Theta }\subset \mathbb{R}^{k_{\theta }}$$. Given a prior measure $$\Pi(\theta)$$ with density $$\pi({\theta })$$, the aim of the algorithms under study is to produce draws from an approximation to the exact posterior density $$\pi({\theta \mid y})\propto p({y\mid\theta })\pi({\theta })$$, in situations where both parameters and pseudo-data $$({\theta },{z})$$ can easily be simulated, from $$\pi({\theta })p({z\mid\theta })$$, but $$p({z\mid\theta })$$ is intractable. The simplest accept/reject form of the algorithm (Tavaré et al., 1997; Pritchard et al., 1999) is detailed in Algorithm 1 below. Algorithm 1. Approximate Bayesian computation. Step 1. Simulate $${\theta }^{i}$$$$(i=1,\dots,N)$$ from $$\pi({\theta })$$. Step 2. Simulate $${z}^{i}=(z_{1}^{i},\dots,z_{T}^{i})^{{\rm T}}$$$$(i=1,\dots,N)$$ from the likelihood $$p(\cdot\mid{\theta }^{i})$$. Step 3. Select $${\theta }^{i}$$ such that $$d\{{\eta }({y}),{\eta }({z}^{i})\}\leqslant \varepsilon ,$$ where $${\eta }({\cdot })$$ is a statistic, $$d(\cdot\, , \cdot )$$ is a distance function, and $$\varepsilon>0$$ is the tolerance level. Algorithm 1 thus samples $${\theta }$$ and $${z}$$ from the joint posterior density $$ \pi_{\varepsilon }\{{\theta },{z\mid\eta }({y})\}={\pi({\theta })p({z\mid\theta } ){\rm 1}\kern-0.24em{\rm I}_{\varepsilon }({z})}\Big/\int\pi({\theta })p({z\mid\theta }){\rm 1}\kern-0.24em{\rm I}_{\varepsilon }({z})\,{\rm d}{z}\,{\rm d}{\theta}, $$ where $${\rm 1}\kern-0.24em{\rm I}_{\varepsilon }({z})={\rm 1}\kern-0.24em{\rm I}[d\{{\eta }({y}),{\eta }({z})\}\leqslant \varepsilon ]=1$$ if $$d\{ {\eta }({y}),{\eta }({z} )\} \leqslant \varepsilon$$ and 0 otherwise. The approximate Bayesian computation posterior density is defined as $$ \pi_{\varepsilon }\{{\theta \mid\eta }({y})\}=\int_{{}}\pi_{\varepsilon }\{{\theta },{z\mid\eta }({y})\}\,{\rm d}{z}\text{.} $$ In what follows, we refer to $$\pi_{\varepsilon }\{{\theta \mid\eta }({y})\}$$ as the approximate posterior density. Likewise, the posterior probability of a set $$A\subset\Theta$$ associated with Algorithm 1 is \begin{equation*} \Pi _{\varepsilon }\{A\mid \eta (y)\}=\Pi [ A\mid d_{}\{{\eta }({y}),{\eta }({z} )\}\leqslant \varepsilon ] =\int_{A}\pi_{\varepsilon }\{{\theta }\mid {\eta }({y} )\}\,{\rm d}{\theta }, \end{equation*} and we refer to $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$ as the approximate posterior distribution. When $$\eta(\cdot)$$ is sufficient for the observed data $$y$$ and $$\varepsilon$$ is close to zero, $$\pi_{\varepsilon }\{{\theta \mid\eta }({y})\}$$ will be a good approximation to $$\pi({\theta \mid y})$$, and draws of $${\theta }$$ from $$\pi_{\varepsilon }\{\theta \mid\eta({y})\}$$ can be used to estimate features of $$\pi(\theta \mid y)$$. In practice $$\eta(y)$$ is rarely sufficient for $$y$$, and draws of $$\theta$$ can only be used to approximate $$\pi\{\theta \mid \eta(y)\}=\lim_{\varepsilon\rightarrow0}\pi_{\varepsilon}\{\theta \mid \eta(y)\}$$. Given the general lack of sufficient statistics, we need to assess the behaviour of the approximate posterior distribution $$\Pi_{\varepsilon }\{\cdot\mid \eta(y)\}$$ and establish whether or not $$\Pi_{\varepsilon }\{\cdot\mid \eta(y)\}$$ behaves in a manner that is appropriate for statistical inference, with asymptotic theory being one obvious approach. Establishing the large-sample behaviour of $$\Pi_{\varepsilon }\{\cdot\mid \eta(y)\}$$, including point and interval estimates derived from this distribution, gives practitioners guarantees on the reliability of approximate Bayesian computation. Furthermore, these results enable us to propose guidelines for choosing the tolerance $$\varepsilon$$ so that $$\Pi_{\varepsilon }\{\cdot\mid \eta(y)\}$$ possesses desirable statistical properties. Before presenting our results, we define some notation used throughout the paper. Let $$\mathcal{B}\subset \mathbb{R}^{k_{\eta }}$$ denote the range of the simulated summaries $$\eta(z)$$. Let $$d_{1}(\cdot \,,\cdot )$$ be a metric on $${\Theta }$$ and $$d_{2}(\cdot\, ,\cdot)$$ a metric on $$\mathcal{B}$$. Take $$\Vert \cdot \Vert$$ to be the Euclidean norm. Throughout, $$C$$ denotes a generic positive constant. For real-valued sequences $$\{a_{T}\}_{T\geqslant 1}$$ and $$\{b_{T}\}_{T\geqslant 1}$$, $$\:a_{T}\lesssim b_{T}$$ means that $$a_{T}\leqslant Cb_{T}$$ for some finite $$C>0$$ and $$T$$ large, $$a_{T}\asymp b_{T}$$ means that $$a_{T}\lesssim b_{T} \lesssim a_{T}$$, and $$a_{T}\gg b_{T}$$ indicates a greater order of magnitude. For $$x_{T}$$ a random variable, $$x_{T}=o_{\rm p}(a_{T})$$ if $$\lim_{T\rightarrow \infty }\text{pr} (|x_{T}/a_{T}|\geqslant C)=0$$ for any $$C>0$$, and $$x_{T}=O_{\rm p}(a_{T})$$ if for any $$C>0$$ there exists a finite $$M>0$$ and a finite $$T$$ such that $$\text{pr}(|x_{t}/a_{t}|\geqslant M)\leqslant C$$ for all $$t>T$$. All limits are taken as $$T\rightarrow\infty$$. When no confusion is likely to result, we will write $$\lim_{T}$$ for $$\lim_{T\rightarrow\infty}$$. 3. Concentration of the approximate Bayesian computation posterior We assume throughout that the model is correctly specified: for some $$\theta_0$$ in the interior of $$\Theta$$, we have $$P_{{\theta}_0}=P_{0}$$. Asymptotic validity of any Bayesian procedure requires posterior concentration, which is often referred to as Bayesian consistency. In our context, this equates to the following posterior concentration property: for any $$\delta >0$$ and for some $$\varepsilon>0$$, \begin{align*} \Pi _{\varepsilon }\{d_{1}(\theta ,\theta _{0})>\delta\mid \eta (y)\}& =\Pi \big[ d_{1}(\theta ,\theta _{0})>\delta \mid d_{2}\{{\eta }({y}),{\eta }({z})\}\leqslant \varepsilon \big] \\ & =\int_{d_{1}(\theta ,\theta _{0})>\delta }\pi_{\varepsilon }\{\theta \mid \eta (y)\}\,{\rm d}\theta =o_{\rm p}(1)\text{.} \end{align*} This property is paramount since, for any $$A\subset \Theta$$, $$\:\Pi[ A\mid d_{2}\{{\eta }({y}),{\eta }({z})\}\leqslant \varepsilon]$$ will differ from the exact posterior probability. Without the guarantees of exact posterior inference, knowing that $$\Pi_{\varepsilon}\{\cdot\mid\eta(y)\}$$ will concentrate on $$\theta _{0}$$ justifies its use as a means of expressing our uncertainty about $$\theta$$. Posterior concentration is related to the rate at which information about $$\theta _{0}$$ accumulates in the sample. The amount of information Algorithm 1 provides depends on the rate at which the observed summaries $$\eta(y)$$ and the simulated summaries $$\eta(z)$$ converge to well-defined limit counterparts $${b}({\theta }_{0})$$ and $${b}({\theta })$$, as well as on the rate at which information about $$\theta _{0}$$ accumulates within the algorithm, governed by the rate at which $$\varepsilon$$ goes to $$0$$. To link the two factors, we consider $$\varepsilon$$ as a $$T$$-dependent sequence $$\varepsilon _{T}\rightarrow 0$$ as $${T\rightarrow \infty }$$. We can now state the technical assumptions used to establish our first result. These assumptions are applicable to a broad range of data structures, including weakly dependent data. Assumption 1. There exist a nonrandom map $${b}:{\Theta } \rightarrow \mathcal{B}$$ and a sequence of functions $$\rho _{T}(u)$$ that are monotone nonincreasing in $$u$$ for any $$T$$ and which satisfy $$\rho _{T}(u)\rightarrow 0$$ as $$T\rightarrow \infty$$. For fixed $$u$$ and for all $$\theta\in\Theta$$, \begin{equation*} P_{{\theta }}\big[ d_{2}\{{\eta }({z}),{b}({\theta })\}>u\big] \leqslant c({ \theta })\rho _{T}(u),\quad \int_{\Theta }c({\theta })\,{\rm d}\Pi ({\theta } )<\infty \end{equation*} under either of the following assumptions on $$c(\cdot )$$: (i) there exist $$c_{0}<\infty$$ and $$\delta >0$$ such that for all $${\theta }$$ satisfying $$d_{2}\{{b}({\theta } ),{b}({\theta}_{0})\}\leqslant \delta$$, $$c({\theta })\leqslant c_{0}$$; (ii) there exists $$a>0$$ such that $$\int_{\Theta }c({\theta })^{1+a}\,{\rm d}\Pi ({\theta })<\infty$$. Assumption 2. There exists some $$D>0$$ such that for all $$\xi >0$$ and some $$C>0$$, the prior probability satisfies $$\Pi [ d_{2}\{{b}({\theta }),{b}({\theta }_{0})\}\leqslant \xi ]\geqslant C \xi ^{D}$$. Assumption 3. (i) The map $${b}$$ is continuous. (ii) The map $${b}$$ is injective and satisfies $$\Vert {\theta }-{\theta }_{0}\Vert \leqslant L\Vert {b}({\theta })-{b}({\theta } _{0})\Vert ^{\alpha }$$ in some open neighbourhood of $${\theta }_{0}$$ with $$L>0$$ and $$\alpha >0$$. Remark 1. The convergence of $$\eta(z)$$ to $$b(\theta)$$ in Assumption 1 is the key to posterior concentration, and without it or a similar assumption, Bayesian consistency will not occur. The function $$\rho_{T}(u)$$ in Assumption 1 typically takes the form $$\rho_{T}(u)=\rho_{}(uv_{T})$$ for $$v_{T}$$ a sequence such that $$d_{2}\{\eta(z),b(\theta)\}=O_{\rm p}(1/v_{T})$$ and where $$\rho(u v_{T})$$ controls the tail behaviour of $$d_{2}\{\eta(z),b(\theta)\}$$. The specific structure of $$\rho(uv_{T})$$ will depend on what is assumed about the properties of the underlying summaries $$\eta(z)$$. In most cases, $$\rho(uv_{T})$$ will have either a polynomial or an exponential structure in $$uv_{T}$$, and thus satisfy one of the following rates. (a) Polynomial: there exist a diverging positive sequence $$\{v_{T}\}_{T\ge 1}$$ and $$u_{0},\kappa>0$$ such that \begin{equation} P_{\theta}\big[d_{2}\{\eta(z),b(\theta)\}>u\big]\leqslant c(\theta)\rho_{T}(u),\quad\rho _{T}(u)=1/(uv_{T})^{\kappa },\quad u\leqslant u_{0}, \label{rho_poly} \end{equation} (1) where for some $$c_0>0$$ and $$\delta>0$$, $$\int_{\Theta}c(\theta)\,{\rm d}\Pi(\theta)<\infty$$ and if $$d_{2}\{b(\theta),b(\theta_0)\}\leqslant\delta$$ then $$c(\theta)\leqslant c_0$$. (b) Exponential: there exist $$h_{\theta}(\cdot)>0$$ and $$u_0>0$$ such that \begin{equation} P_{\theta}\big[d_{2}\{\eta(z),b(\theta)\}>u\big]\leqslant c(\theta)\rho_{T}(u),\quad\rho _{T}(u)=\exp\{-h_{{\theta }}(uv_{T})\},\quad u\leqslant u_{0}, \label{rho_exp} \end{equation} (2) where for some $$c,C>0$$, $$\:\int_{\Theta} c(\theta)\exp\{-h_{\theta}(u v_{T})\}\,{\rm d}\Pi(\theta)\leqslant C\exp\{-c(u v_{T})^{\tau}\}$$. To illustrate these cases for $$\rho_{T}(\cdot)$$, consider the summary statistic $$\eta ({z})=T^{-1}\sum_{i=1}^{T}g(z_{i})$$, where for simplicity we consider $$\{g(z_{i})\}_{i\leqslant T}$$ to be independent and identically distributed, and $$b({\theta })=E_{{\theta }}\{g(Z)\}$$. If $$g(z_{i})-b(\theta)$$ has a finite moment of order $$\kappa$$, then $$\rho_{T}(u)$$ will satisfy (1): from Markov’s inequality, \begin{equation*} P_{\theta}\big\{\Vert {\eta }({z})-b({\theta })\Vert >u\big\}\leqslant { C_{}E_{{\theta }}\{ | g(Z)|^{\kappa }\} }/{({uT^{1/2}})^{\kappa } }\text{.} \end{equation*} With reference to (1), $$\rho_{T}(u)=1/(uv_{T})^{\kappa}$$, $$v_{T}=T^{1/2}$$ and $$c(\theta)=C\mathbb{E}_{\theta}\{|g(Z)|^{\kappa}\}<\infty$$. If the map $${\theta }\mapsto E_{{\theta }}\{ |g(Z)|^{\kappa }\}$$ is continuous at $${\theta }_{0}$$ and positive, Assumption 1 is satisfied. If $$\{g(z_i)-b(\theta)\}$$ has a finite exponential moment, $$\rho_{T}(u)$$ will satisfy (2): from a version of the Bernstein inequality, \begin{equation*} \begin{split} P_{\theta}\big\{ \Vert {\eta }({z})-b({\theta })\Vert >u\big\} &\leqslant \exp\bigl[-u^{2}T/\{2c(\theta )\}\bigr]\text{.} \end{split} \end{equation*} With reference to (2), $$\rho_{T}(u)=\exp\{-h_{\theta}(uv_{T})\}$$, $$h_{{\theta }}(uv_{T})=u^{2}v_{T}^{2}/\{2c(\theta )\}$$ and $$v_{T}=T^{1/2}$$. If the map $${\theta }\mapsto c(\theta )$$ is continuous at $${\theta }_{0}$$ and positive, Assumption 1 is satisfied. Remark 2. Assumption 2 controls the degree of prior mass in a neighbourhood of $${\theta }_{0}$$ and is standard in Bayesian asymptotics. For $$\xi$$ small, the larger $$D$$ is, the smaller is the amount of prior mass near $$\theta _{0}$$. If the prior measure $$\Pi(\theta)$$ is absolutely continuous with prior density $$\pi(\theta)$$ and if $$\pi$$ is bounded, above and below, near $$\theta _{0}$$, then $$D=\dim (\theta )=k_{\theta }$$. Assumption 3 is an identification condition that is critical for obtaining posterior concentration around $${ \theta }_{0}$$. Injectivity of $$b$$ depends on both the true structural model and the particular choice of $${\eta }$$. Without this identification condition, posterior concentration at $$\theta_0$$ cannot occur. Theorem 1. If Assumptions 1 and 2 are satisfied, then for $$M$$ large enough, as $$T\rightarrow\infty$$ and $$\varepsilon_T=o(1)$$, with $$P_{0}$$ probability going to $$1$$, \begin{equation} \Pi \big[ d_{2}\{{b}({\theta }),{b}({\theta }_{0})\}>\lambda_{T}\;\big|\; d_{2}\{{\eta }({y}),{\eta }({z} )\}\leqslant \varepsilon _{T}\big] \lesssim 1/M, \label{i} \end{equation} (3)with $$\lambda_{T}=4\varepsilon_{T}/3 + \rho_{T}^{-1}(\varepsilon_{T}^{D}/M)$$. Moreover, if Assumption 3 also holds, then as $$T\rightarrow\infty$$, \begin{equation} \Pi \big[ d_{1}({\theta },{\theta }_{0})>L\lambda_{T}^{\alpha }\;\big|\; d_{2}\{{\eta }({y}),{\eta }({z} )\}\leqslant \varepsilon _{T}\big] \lesssim 1/M\text{.} \label{ii} \end{equation} (4) Since equations (3) and (4) hold for any $$M$$ large enough, we can conclude that the posterior distribution behaves like an $$o_{\rm p}(1)$$ random variable on sets that do not include $$\theta_0$$, and Bayesian consistency of $$\Pi_{\varepsilon}\{\cdot \mid \eta(y)\}$$ follows. More generally, (3) and (4) give a posterior concentration rate, denoted by $$\lambda_{T}$$ in Theorem 1, that depends on $$\varepsilon_{T}$$ and on the underlying behaviour of $$\eta(z)$$, as described by $$\rho_{T}(u)$$. We must consider the nature of this concentration rate in order to understand which choices for $$\varepsilon_{T}$$ are appropriate under different assumptions on the summary statistics. As mentioned above, the deviation control function $$\rho_{T}(u)$$ will often be of a polynomial form, (1), or an exponential form, (2). Under these two assumptions, $$\rho_{T}(u)$$ has an explicit representation and the concentration rate $$\lambda_{T}$$ can be obtained by solving the equation $$\lambda_{T}= 4\varepsilon_{T}/3 + \rho_{T}^{-1}(\varepsilon_{T}^{D}/M)\text{.}$$ Case (a): polynomial. From (1), the deviation control function is $$\rho_{T}(u)=1/(uv_{T})^{\kappa}$$. To obtain the posterior concentration rate, we invert $$\rho_{T}(u)$$ to obtain $$\rho _{T}^{-1}(\varepsilon _{T}^{D})=1/(\varepsilon _{T}^{D/\kappa }v_{T})$$, and then equate $$\varepsilon _{T}$$ and $$\rho _{T}^{-1}(\varepsilon _{T}^{D})$$ to obtain $$\varepsilon _{T}\asymp v_{T}^{-\kappa /(\kappa +D)}$$. This choice of $$\varepsilon_{T}$$ implies concentration of the approximate posterior distribution at the rate \begin{equation*} \lambda _{T}\asymp v_{T}^{-\kappa /(\kappa +D)}\text{.} \end{equation*} Case (b): exponential. If the summary statistics admit an exponential moment, a faster rate of posterior concentration is obtained. From (2), $$\rho _{T}(u)=\exp\{-h_{{\theta }}(uv_{T})\}$$ and there exist finite $$u_{0},c,C>0$$ such that \begin{equation*} \int_{\Theta }c({\theta })\exp\{-h_{\theta }(uv_{T})\}\,{\rm d}\Pi ({\theta })\leqslant C\exp\{-c(uv_{T})^{\tau }\},\quad u\leqslant u_{0}\text{.} \end{equation*} Hence, if $$c({\theta })$$ is bounded from above and if $$h_{{\theta }}(u)\ge u^{\tau }$$ for $${\theta }$$ in a neighbourhood of $$\theta_0$$, then $$\rho _{T}(u)\asymp \exp\{-c_{0}(uv_{T})^{\tau }\}$$; thus, $$\rho _{T}^{-1}(\varepsilon _{T}^{D})\asymp (-\log \varepsilon _{T})^{1/\tau }/v_{T}$$. Following arguments similar to those used in case (a), if we take $$\varepsilon _{T}\asymp (\log v_{T})^{1/\tau }/v_{T}$$, the approximate posterior distribution concentrates at the rate \begin{equation*} \lambda _{T}\asymp {(\log v_{T})^{1/\tau }}/{v_{T}}\text{.} \end{equation*} Example 1. We now illustrate the conditions of Theorem 1 in a moving average model of order two, $$ y_{t}=e_{t}+\theta _{1}e_{t-1}+\theta _{2}e_{t-2}\quad (t=1,\dots,T), $$ where $$\{e_{t}\}_{t=1}^{T}$$ is a sequence of white noise random variables such that $$E(e_{t}^{4+\delta })<\infty$$ for some $$\delta >0$$. Our prior for $$\theta=(\theta _{1},\theta _{2})^{{\rm T}}$$ is uniform over the invertibility region \begin{equation} -2\leqslant \theta _{1}\leqslant 2,\quad \theta _{1}+\theta _{2}\geqslant -1,\quad \theta _{1}-\theta _{2}\leqslant 1\text{.} \label{const1} \end{equation} (5) Following Marin et al. (2011), we choose as summary statistics for Algorithm 1 the sample autocovariances $$\eta _{j}({y})=T^{-1}\sum_{t=1+j}^{T}y_{t}y_{t-j}$$ ($$j=0,1,2$$). For this choice, the $$j$$th component of $$b(\theta )$$ is $$b_{j}(\theta )=E_{\theta }(z_{t}z_{t-j})$$. Now take $$d_{2}\{\eta (z),b(\theta )\}=\Vert \eta (z)-b(\theta )\Vert$$. Under the moment condition for $$e_{t}$$ above, it follows that $$V(\theta )=E[\{\eta (z)-b(\theta )\}\{\eta (z)-b(\theta )\}^{{\rm T}}]$$ satisfies $$\text{tr}\{V(\theta )\}<\infty$$ for all $$\theta$$ in (5). By an application of Markov’s inequality, we can conclude that \begin{align*} P_{\theta}\big\{\|\eta(z)-b(\theta)\|>u\big\}=P_{\theta}\big\{\|\eta(z)-b(\theta)\|^{2}>u^2\big\}&\leqslant \frac{\text{tr}\{V(\theta)\}}{u^2 T}+o(1/T), \end{align*} where the $$o(1/T)$$ term comes from the fact that there are finitely many nonzero covariance terms due to the $$m$$-dependence of the series, and Assumption 1 is satisfied. Given the structure of $$b(\theta )$$, the uniform prior $$\pi(\theta )$$ over (5) satisfies Assumption 2. Furthermore, $$\theta\mapsto b(\theta)=\{1+\theta _{1}^{2}+\theta _{2}^{2},(1+\theta _{2})\theta _{1},\theta _{2}\}^{{\rm T}}$$ is injective and satisfies Assumption 3. As noted in Remark 2, the injectivity of $$\theta\mapsto b(\theta)$$ is required for posterior concentration, and without it there is no guarantee that the posterior will concentrate on $$\theta_0$$. Since the sufficient conditions for Theorem 1 are satisfied, approximate Bayesian computation based on this choice of statistics will yield an approximate posterior density that concentrates on $$\theta_0$$. Theorem 1 can also be visualized by fixing a particular value of $$\theta$$, say $$\tilde{\theta}$$, generating sets of observed data $$\tilde{y}$$ of increasing length, and then running Algorithm 1 on these datasets. If the conditions of Theorem 1 are satisfied, the approximate posterior density will become increasingly peaked at $$\tilde{\theta}$$ as $$T$$ increases. Using Example 1, we demonstrate this behaviour in the Supplementary Material. 4. Shape of the asymptotic posterior distribution 4.1. Assumptions and theorem While posterior concentration states that $$\Pi [ d_{1}(\theta ,\theta _{0})>\delta \mid d_{2}\{{\eta }({y}),{\eta }({z})\}\leqslant \varepsilon _{T}] =o_{\rm p}(1)$$ for an appropriate choice of $$\varepsilon _{T}$$, it does not indicate precisely how this mass accumulates, or the approximate amount of posterior probability within any neighbourhood of $$\theta _{0}$$. This information is needed to obtain accurate expressions of uncertainty about point estimators of $$\theta _{0}$$ and to ensure that credible regions have proper frequentist coverage. To this end, we now analyse the limiting shape of $$\Pi [ \cdot \mid d_{2}\{{\eta }({y}),{\eta }({z})\}\leqslant \varepsilon _{T} ]$$ for various relationships between $$\varepsilon _{T}$$ and the rate at which summary statistics satisfy a central limit theorem. In this and the following sections, we denote $$\Pi [ \cdot\mid d_{2}\{{\eta }({y}),{\eta }({z})\}\leqslant \varepsilon _{T}]$$ by $$\Pi _{\varepsilon }\{\cdot \mid {\eta }(y)\}$$. Let $$\Vert \cdot \Vert _{\ast }$$ denote the spectral norm. In addition to Assumption 2, the following conditions are needed to establish the results of this section. Assumption 4. Assumption 1 holds. Moreover, there exist a sequence of positive-definite matrices $$\{\Sigma _{T}({\theta } _{0})\}_{T\ge 1}$$ and $$c_0>0$$, $$\kappa >1$$ and $$\delta >0$$ such that for all $$\Vert {\theta }-{ \theta }_{0}\Vert \leqslant \delta$$, $$\:P_{{\theta }}[ \Vert \Sigma _{T}({ \theta }_{0})\{{\eta }({z})-{b}({\theta })\}\Vert >u] \leqslant {c_{0} }{u^{-\kappa }}$$ for all $$0<u\leqslant \delta \|\Sigma_T(\theta_0)\|_*$$, uniformly in $$T$$. Assumption 5. Assumption 3 holds. Moreover, the map $$\theta\mapsto{b}(\theta)$$ is continuously differentiable at $${\theta _{0}}$$ and the Jacobian $$\nabla _{\theta }b({\theta }_{0})$$ has full column rank $$k_{\theta }$$. Assumption 6. The value $$\theta_0$$ is in the interior of $$\Theta$$. For some $$\delta >0$$ and for all $$\Vert { \theta }-{\theta }_{0}\Vert \leqslant \delta$$, there exists a sequence of positive-definite $$k_{\eta }\times k_{\eta }$$ matrices $$\{\Sigma_{T}({\theta })\}_{T\ge 1}$$, with $$k_{\eta }=\dim \{\eta (z)\}$$, such that for all open sets $$B$$, \begin{equation*} \sup_{|\theta - \theta_0|\leqslant \delta} \Big| P_\theta\big[ {\Sigma }_{T}({\theta })\{{\eta }({z})-{b}({\theta })\} \in B\big] - P\{ \mathcal{N }(0,I_{k_{\eta }}) \in B\} \Big| \rightarrow 0 \end{equation*} in distribution as $$T\rightarrow\infty$$, where $$I_{k_{\eta }}$$ is the $$k_{\eta }\times k_{\eta }$$ identity matrix. Assumption 7. There exists $$v_T\rightarrow\infty$$ such that for all $$\Vert \theta -\theta _{0}\Vert \leqslant \delta$$ , the sequence of functions $${\theta }\mapsto {\Sigma }_{T}({\theta } )v_{T}^{-1}$$ converges to some positive-definite $$A(\theta )$$ and is equicontinuous at $${\theta }_{0}$$. Assumption 8. For some positive $$\delta$$, all $$\Vert {\theta }-{\theta }_{0}\Vert \leqslant \delta$$, all ellipsoids $$B_{T}=\{ (t_{1},\dots ,t_{k_{\eta }}):\sum_{j=1}^{k_{\eta }}t_{j}^{2}/h_{T}^{2}\leqslant 1 \}$$ and all $$u\in \mathbb{R}^{k_{\eta }}$$ fixed, for all $$h_{T}\rightarrow 0$$, as $$T\rightarrow \infty$$, \begin{equation*} \begin{split} & \lim_{T} \sup_{|\theta - \theta_0|\leqslant \delta } \Big| h_{T}^{-k_{\eta }} P_{\theta }\big[ {\Sigma }_{T}({\theta })\{{\eta }({z})-{b}( {\theta })\}-u\in B_{T}\big] - \varphi _{k_{\eta}}(u) \Big| = 0 , \\ & {h_{T}^{-k_{\eta }}}\,{P_{{\theta }}\big[ {\Sigma }_{T}({\theta })\{{\eta }({z})-{b}( {\theta })\}-u\in B_{T}\big] } \leqslant H(u),\quad \int H(u)\,{\rm d} u<\infty , \end{split} \end{equation*} for $$\varphi _{k_{\eta }}(\cdot )$$ the density of a $$k_{\eta }$$-dimensional normal random variate. Remark 3. Assumption 4 is similar to Assumption 1 but for $$\Sigma _{T}(\theta _{0})\{\eta (z)-b(\theta )\}$$. Assumption 6 is a central limit theorem for $$\{\eta (z)-b(\theta )\}$$ and, as such, requires the existence of a positive-definite matrix $$\Sigma _{T}(\theta )$$. In simple cases, such as independent and identically distributed data with $$\eta(z)=T^{-1}\sum_{i=1}^{T}g(z_{i})$$, $$\:\Sigma _{T}(\theta )=v_{T}A_{T}^{}(\theta )$$ with $$A_{T}(\theta)=A( \theta)+o_{\rm p}(1)$$ and $$V(\theta )=E[\{g(Z)-b(\theta )\}\{g(Z)-b(\theta )\}^{{\rm T}}]=\{A(\theta)^{{\rm T}}A(\theta)\}^{-1}$$. Assumptions 5 and 8 ensure that $$\theta \mapsto b(\theta )$$ and the covariance matrix of $$\{\eta (z)-b(\theta )\}$$ are well behaved, which allows the posterior behaviour of a normalized version of $$(\theta -\theta _{0})$$ to be governed by that of $$\Sigma _{T}(\theta _{0})\{b(\theta )-b(\theta _{0})\}$$. Assumption 8 concerns the pointwise convergence of a normalized version of the measure $$P_{\theta }$$, therein dominated by $$H(u)$$, and allows application of the dominated convergence theorem in case (iii) of the following result. Theorem 2. Under Assumptions 2 and 4–7, with $$\kappa >k_{\theta }$$, the following hold with probability going to $$1$$. (i) If $$\lim_{T}v_{T}\varepsilon _{T}=\infty$$, the posterior distribution of $$\varepsilon_{T}^{-1}({\theta }-{\theta }_{0})$$ converges to the uniform distribution over the ellipsoid $$\{w:w^{{\rm T}}B_{0}w\leqslant 1\}$$ with $$B_{0}=\nabla _{\theta}\,b({\theta _{0}})^{{\rm T}}\nabla _{\theta }b({\theta _{0}})$$, meaning that for $$f(\cdot)$$ continuous and bounded, $$ \int f\{\varepsilon _{T}^{-1}({\theta }-{\theta }_{0})\}\,{\rm d}\Pi _{\varepsilon }\{ {\theta }\mid {\eta(y) }\}\overset{}{ \rightarrow }{\int_{u^{{\rm T} }B_{0}u\leqslant 1}f(u)\,{\rm d} u}\Big/{\int_{u^{{\rm T}}B_{0}u\leqslant 1}\,{\rm d} u} $$ as $$T\rightarrow\infty$$. (ii) If $$\lim_{T}v_{T}\varepsilon _{T}=c>0$$, there exists a non-Gaussian distribution $$Q_{c}$$ on $$\mathbb{R}^{k_{\eta }}$$ such that $$ \Pi _{\varepsilon }\big[ {\Sigma }_{T}({\theta }_{0})\nabla_{\theta}b(\theta_0)({\theta }- \theta_{0})-{\Sigma }_{T}({\theta }_{0})\{\eta(y)-b(\theta_0)\}\in B\;\big|\; {\eta }(y)\big] \rightarrow Q_{c}(B) $$ as $$T\rightarrow\infty$$. In particular, $$Q_{c}(B) \propto \int_{B}\int_{\mathbb{R}^{k_{\eta}} }{\rm 1}\kern-0.24em{\rm I} \{(z-x)^{{\rm T}}A(\theta_0)^{{\rm T}}A(\theta_0)(z-x)\leqslant c\}{\varphi}_{k_\eta}(z) \,{\rm d} z\,{\rm d} x\text{.}$$ (iii) If $$\lim_{T}v_{T}\varepsilon _{T}=0$$ and Assumption 8 holds, then $$ \Pi _{\varepsilon }\big[ {\Sigma }_{T}({\theta }_{0})\nabla_{\theta}b(\theta_0)({\theta }- \theta_{0})-\Sigma_{T}({\theta }_{0})\{\eta(y)-b(\theta_0)\}\in B\;\big|\;{\eta }(y)\big] \rightarrow\int_{B}\varphi_{k_{\eta}}(x)\,{\rm d} x $$ as $$T\rightarrow\infty$$. Remark 4. Theorem 2 generalizes to the case where the components of $$\eta (z)$$ have different rates of convergence. The statement and proof of this more general result are deferred to the Supplementary Material. Furthermore, as with Theorem 1, the behaviour of $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$ described by Theorem 2 can be visualized. This is demonstrated in the Supplementary Material for the case of Example 1. Formal verification of the conditions underpinning Theorem 2 is quite challenging, even in this case. Numerical results nevertheless suggest that for this particular choice of model and summaries a Bernstein–von Mises result holds, conditional on $$\varepsilon_{T}=o(1/v_{T})$$ with $$v_{T}={T}^{1/2}$$. 4.2. Discussion of the result Theorem 2 asserts that the crucial feature in determining the limiting shape of $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$ is the behaviour of $$v_{T}\varepsilon _{T}$$. An implication of Theorem 2 is that only in the regime where $$\lim_{T} v_{T}\varepsilon_{T}=0$$ will $$100(1-\alpha)\%$$ Bayesian credible regions calculated from $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$ have frequentist coverage of $$100(1-\alpha)\%$$. If $$\lim_{T}v_{T}\varepsilon _{T}=c>0$$ for $$c$$ finite, $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$ is not asymptotically Gaussian and credible regions will have incorrect magnitude, i.e., the coverage will not be at the nominal level. If $$\lim_T v_{T}\varepsilon_{T}=\infty$$, i.e., $$\varepsilon _{T}\gg v_{T}^{-1}$$, credible regions will have coverage that converges to 100%. In case (i), which corresponds to a large tolerance $$\varepsilon _{T}$$, the approximate posterior distribution has nonstandard asymptotic behaviour. In this case $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$ behaves like the prior distribution over $$\{\theta:\Vert \nabla _{\theta }b({\theta }_{0})({\theta }-{\theta }_{0})\Vert \leqslant \varepsilon _{T}\{1+o_{\rm p}(1)\}\}$$, which, by prior continuity, implies that $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$ is equivalent to a uniform distribution over this set. Li & Fearnhead (2018a) also established this behaviour, and observed that asymptotically $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$ behaves like a convolution of a Gaussian distribution with variance of order $$1/v_T^2$$ and a uniform distribution over a ball of radius $$\varepsilon_T$$, where, depending on the order $$v_T\varepsilon_T$$, one distribution will dominate. Assumption 8 applies to random variables $${\eta }({z})$$ that are absolutely continuous with respect to Lebesgue measure or, in the case of sums of random variables, to sums that are nonlattice; see Bhattacharya & Rao (1986). For discrete $${\eta }({z})$$, Assumption 8 must be adapted for Theorem 2 to be satisfied. One such modification is the following. Assumption 9. There exist $$\delta >0$$ and a countable set $$E_{T}$$ such that for all $$\Vert {\theta }-{\theta }_{0}\Vert <\delta$$, for all $$x\in E_{T}$$ such that $$\text{pr}\{ {\eta }({z})={x}\} >0$$, $$\:\text{pr}\{ {\eta }({z})\in E_{T}\} =1$$ and \begin{equation*} \sup_{\Vert {\theta }-{\theta }_{0}\Vert \leqslant \delta }\sum_{x\in E_{T}}\Big\vert p[{\Sigma }_{T}({\theta })\{{x}-{b}({\theta })\}\mid {\theta } ]-v_{T}^{-k_{\eta}} | A(\theta_0) |^{-1/2}\varphi_{k_{\eta}} \lbrack {\Sigma }_{T}({\theta })\{{x }-{b}({\theta })\}]\Big\vert =o(1)\text{.} \end{equation*} This is satisfied when $${\eta }({z})$$ is a sum of independent lattice random variables, as in the population genetics experiment detailed in Marin et al. (2014, § 3.3), which compares evolution scenarios of separated populations from a most recent common ancestor. Furthermore, this example satisfies Assumptions 2 and 4–7. Thus the conclusions of both Theorem 1 and Theorem 2 apply to this example. 5. Asymptotic distribution of the posterior mean 5.1. Main result The literature on the asymptotics of approximate Bayesian computation has so far focused primarily on asymptotic normality of the posterior mean. The posterior normality result in Theorem 2 is not weaker, nor stronger, than the asymptotic normality of an approximate point estimator, as the results concern different objects. However, existing proofs of asymptotic normality of the posterior mean all require asymptotic normality of the posterior distribution $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$. We demonstrate that this is not a necessary condition. For clarity we focus on the case of a scalar parameter $$\theta$$ and a scalar summary $$\eta ({y})$$, i.e., $$k_{\theta}=k_{\eta}=1$$, but present an extension to the multivariate case in § 5.2. This result requires a further assumption on the prior in addition to Assumption 2. Assumption 10. The prior density $$\pi(\theta)$$ is such that: (i) for $$\theta _{0}$$ in the interior of $${\Theta }$$, $$\:\pi(\theta _{0})>0$$; (ii) the density function $$\pi(\theta)$$ is $$\beta$$-Hölder in a neighbourhood of $$\theta _{0}$$, i.e., there exist $$\delta ,L>0$$ such that for all $$|\theta -\theta _{0}|\leqslant \delta$$, $$\:\nabla_{\theta}^{(j)}\pi(\theta _{0})$$, the $$j$$th derivative of $$\pi(\theta_0)$$, satisfies \begin{equation*} \left\vert \pi(\theta )-\sum_{j=0}^{\lfloor \beta\rfloor }(\theta -\theta _{0})^{j}\frac{\nabla_{\theta}^{(j)}\pi(\theta _{0})}{j!}\right\vert \leqslant L|\theta -\theta _{0}|^{\beta }; \end{equation*} (iii) for $$\Theta \subset \mathbb{R}$$, $$\int_{\Theta }|\theta |^{\beta }\pi(\theta )\,{\rm d}\theta <\infty$$. Theorem 3. Suppose that Assumptions 2, 4–7 and 10 with $$\kappa >\beta +1$$ hold. Furthermore, let $$\theta\mapsto b(\theta)$$ be $$\beta$$-Hölder in a neighbourhood of $$\theta_{0}$$. Denoting by $$E_{\Pi _{\varepsilon }}(\theta )$$ the posterior mean of $$\theta$$, the following characterization holds with probability going to $$1$$. (i) If $$\lim_{T}v_{T}\varepsilon _{T}=\infty$$ and $$v_{T}\varepsilon _{T}^{2\wedge (1+\beta )}=o(1)$$, then \begin{equation}E_{\Pi _{\varepsilon }}\{ v_{T}(\theta -\theta _{0})\} \rightarrow \mathcal{N}\big[0,V(\theta _{0})/\{\nabla _{\theta }b(\theta _{0})\}^{2}\big] \label{normality:postmean} \end{equation} (6)in distribution as $$T\rightarrow\infty$$, where $$V(\theta _{0})=\lim_{T}{\rm var}[v_{T}\{\eta ({y})-b(\theta _{0})\}]$$. (ii) If $$\lim_{T}v_{T}\varepsilon _{T}=c\geqslant 0$$, and if Assumption 8 holds when $$c= 0$$, then (6) also holds. There are two immediate consequences of Theorem 3: first, part (i) states that if one is only interested in obtaining accurate point estimators for $$\theta_0$$, all one needs is a tolerance $$\varepsilon_{T}$$ satisfying $$v_{T}\varepsilon^2_{T}=o(1)$$, which can significantly reduce the computational burden of approximate Bayesian computation; secondly, if one wants accurate point estimators of $$\theta_0$$ and accurate expressions for the uncertainty associated with such a point estimator, one requires $$\varepsilon_{T}=o(1/v_{T})$$. The first statement follows directly from Theorem 3(i), while the second statement follows from Theorem 3(ii) and by recalling that, from Theorem 2, credible regions constructed from $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$ will have proper frequentist coverage only if $$\varepsilon_{T}=o(1/v_{T})$$. For $$\varepsilon_{T}\asymp v_T^{-1}$$ or $$\varepsilon_{T}\gg v_{T}^{-1}$$, the frequentist coverage of credible balls centred at $$E_{\Pi_\varepsilon}(\theta)$$ will not be equal to the nominal level. As an intermediate step in the proof of Theorem 3, we demonstrate the following expansion for the posterior mean, where $$k$$ denotes the integer part of $$(\beta+1)/2$$: \begin{equation} \begin{split}\label{expan:postmean} E_{\Pi _{\varepsilon }}( \theta -\theta _{0}) & =\frac{\eta(y)-b(\theta_0)}{\nabla _{\theta }b(\theta _{0})}+\sum_{j=1}^{\lfloor \beta \rfloor } \frac{\nabla_{b}^{(j)}b^{-1}(b_{0})}{j!}\sum_{l=\lceil j/2\rceil }^{\lfloor (j+k)/2\rfloor }\frac{\varepsilon _{T}^{2l}\nabla_{b}^{(2l-j)}\pi(b_{0})}{\pi(b_{0})(2l-j)! }\\ & \quad +O(\varepsilon _{T}^{1+\beta })+o_{\rm p}(1/v_{T})\text{.} \end{split} \end{equation} (7) This highlights a potential deviation from the expected asymptotic behaviour of the posterior mean $$E_{\Pi _{\varepsilon }}(\theta )$$, i.e., the behaviour corresponding to $$T\rightarrow \infty$$ and $$\varepsilon _{T}\rightarrow0$$. Indeed, the posterior mean is asymptotically normal for all values of $$\varepsilon _{T}=o(1)$$ , but is asymptotically unbiased only if the leading term in (7) is $$[\nabla _{\theta }b(\theta _{0})]^{-1}\{\eta(y)-b(\theta_0)\}$$, which is satisfied in case (ii) and in case (i) if $$v_{T}\varepsilon _{T}^{2}=o(1)$$, given $$\beta \geqslant 1$$. However, in case (i), if $$\lim\inf_{T}v_{T}\varepsilon _{T}^{2}>0$$, then when $$\beta \geqslant 3$$ the posterior mean has a bias of \begin{equation*} \varepsilon _{T}^{2}\left[ \frac{\nabla_{b}\pi(b_{0})}{3\pi(b_{0})\nabla _{\theta }b(\theta _{0})}-\frac{\nabla^{(2)}_{\theta}b(\theta _{0})}{2\{\nabla _{\theta }b(\theta _{0})\}^{2}}\right] +O(\varepsilon _{T}^{4})+o_{\rm p}(1/v_{T})\text{.} \end{equation*} 5.2. Comparison with existing results Li & Fearnhead (2018a) analysed the asymptotic properties of the posterior mean and functions thereof. Under the assumption of a central limit theorem for the summary statistic and further regularity assumptions on the convergence of the density of the summary statistics to this normal limit, including the existence of an Edgeworth expansion with exponential controls on the tails, Li & Fearnhead (2018a) demonstrated asymptotic normality, with no bias, of the posterior mean if $$\varepsilon_T = o (1/v_{T}^{3/5})$$. Heuristically, they derived this result using an approximation of the posterior density $$\pi_{\varepsilon }\{\theta \mid \eta ({y})\}$$, based on the Gaussian approximation of the density of $$\eta ({z})$$ given $$\theta$$ and using properties of the maximum likelihood estimator conditional on $$\eta ({y})$$. In contrast to our analysis, Li & Fearnhead (2018a) allowed the acceptance probability defining the algorithm to be an arbitrary density kernel in $$\|\eta(y)-\eta(z)\|$$. Consequently, their approach is more general than the accept/reject version in Theorem 3. However, the conditions required of $$\eta ({y})$$ in Li & Fearnhead (2018a) are stronger than ours. In particular, our results on asymptotic normality for the posterior mean require only weak convergence of $$v_{T}\{\eta ({z}) - {b}(\theta)\}$$ under $$P_\theta$$, with polynomial deviations that need not be uniform in $$\theta$$. These assumptions allow for the explicit treatment of models where the parameter space $$\Theta$$ is not compact. In addition, asymptotic normality of the posterior mean requires Assumption 8 only if $$\varepsilon_T = o(1/v_T)$$. Hence, if $$\varepsilon_T \gg v^{-1}_T$$, then only deviation bounds and weak convergence are required, which are much weaker than convergence of the densities. When $$\varepsilon_T = o(1/v_T)$$, Assumption 8 essentially implies local, in $$\theta$$, convergence of the density of $$v_{T}\{\eta ({z}) - {b}(\theta)\}$$, but imposes no requirement on the rate of this convergence. This assumption is weaker than the uniform convergence required in Li & Fearnhead (2018a). Our results also allow for an explicit representation of the bias that obtains for the posterior mean when $$\lim\inf_{T}v_{T}\varepsilon^{2}_T>0$$. In further contrast to Li & Fearnhead (2018a), Theorem 2 completely characterizes the asymptotic behaviour of the approximate posterior distribution for all $$\varepsilon _{T}=o(1)$$ that admit posterior concentration. This general characterization allows us to demonstrate, via Theorem 3(i), that asymptotic normality and unbiasedness of the posterior mean remain achievable even if $$\lim_{T}v_{T}\varepsilon _{T}=\infty$$, provided the tolerance satisfies $$\varepsilon _{T}=o(1/v_{T}^{1/2})$$. Li & Fearnhead (2018a) presented the interesting result that if $$k_\eta > k_\theta\geqslant1$$ and $$\varepsilon_{T}=o(1/v_T^{3/5})$$, the posterior mean is asymptotically normal, and unbiased, but is not asymptotically efficient. To help shed light on this phenomenon, the following result gives an alternative to Theorem 3.1 of Li & Fearnhead (2018a) and contains an explicit asymptotic expansion for the posterior mean when $$k_\eta > k_\theta\geqslant1$$. Theorem 4. Suppose that Assumptions 2, 4–7 and 10 hold. Assume that $$v_T\varepsilon_T \rightarrow \infty$$ and $$v_T\varepsilon_T^2 = o(1)$$. Assume also that $$b(\cdot)$$ and $$\pi(\cdot)$$ are Lipschitz in a neighbourhood of $$\theta _{0}$$. Then, for $$k_\eta > k_\theta\geqslant1$$, $$ E_{\Pi_{\varepsilon}}\{ v_T(\theta-\theta_0) \} = \{\nabla_\theta b(\theta_0)^{{\rm T}} \nabla_\theta b(\theta_0)\}^{-1} \nabla_\theta b(\theta_0)^{{\rm T}}v_{T}\{\eta(y)-b(\theta_0)\}+ o_{\rm p}(1)\text{.} $$ In addition, if $$\{\nabla_\theta b(\theta_0)^{{\rm T}} \nabla_\theta b(\theta_0)\}^{-1}\nabla_\theta b(\theta_0)^{{\rm T}} \neq \nabla_\theta b(\theta_0)^{{\rm T}}$$, the matrix $$ {\rm var}\big[\{\nabla_\theta b(\theta_0)^{{\rm T}}\nabla_\theta b(\theta_0)\}^{-1}\nabla_\theta b(\theta_0)^{{\rm T}}v_{T}\{\eta(y)-b(\theta_0)\}\big]- \{\nabla_\theta b(\theta_0)^{{\rm T}}V^{-1}(\theta_0)\nabla_\theta b(\theta_0)^{}\}^{-1} $$ is positive semidefinite, where $$\{\nabla_\theta b(\theta_0)^{{\rm T}}V^{-1}(\theta_0)\nabla_\theta b(\theta_0)^{}\}^{-1}$$ is the optimal asymptotic variance achievable given $$\eta(y)$$. A consequence of Theorem 4 is that, for a fixed choice of summaries, the two-stage procedure advocated by Fearnhead & Prangle (2012) will not reduce the asymptotic variance over a point estimator produced via Algorithm 1. However, this two-stage procedure does reduce the Monte Carlo error inherent in estimating the approximate posterior distribution $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$ by reducing the dimension of the statistics on which the matching in approximate Bayesian computation is based. 6. Practical implications of the results 6.1. General implications The approximate Bayesian computation approach in Algorithm 1 is typically not used in practice. Instead, the acceptance step in Algorithm 1 is commonly replaced by the nearest-neighbour selection step and with $$d_{2}\{\eta(z),\eta(y)\}=\|\eta(z)-\eta(y)\|$$ (see, e.g., Biau et al., 2015) as follows. Step 3$$'$$. Select all $$\theta ^{i}$$ associated with the $$\alpha =\delta /N$$ smallest distances $$\|\eta(z)-\eta(y)\|$$ for some $$\delta$$. This nearest-neighbour version accepts draws of $$\theta$$ associated with an empirical quantile over the simulated distances $$\|\eta(z)-\eta(y)\|$$ and defines the acceptance probability for Algorithm 1. A key practical insight gained from our asymptotic results is that the acceptance probability, $$\alpha _{T}=\text{pr}\{\|\eta(z)-\eta(y)\|\leqslant \varepsilon _{T}\}$$, is affected only by the dimension of $$\theta$$, as formalized in the following corollary. Corollary 1. Under the conditions of Theorem 2, the following hold. (i) If $$\varepsilon _{T}\asymp v_{T}^{-1}$$ or $$\varepsilon_{T}=o(1/v_{T})$$, then the acceptance rate associated with the threshold $$\varepsilon _{T}$$ is \begin{equation*} \alpha _{T}={\rm pr}\big\{\|\eta(z)-\eta(y)\| \leqslant \varepsilon _{T}\big\} \asymp (v_{T}\varepsilon _{T})^{k_{\eta }}\times v_{T}^{-k_{\theta }}\lesssim v_{T}^{-k_{\theta}}\text{.} \end{equation*} (ii) If $$\varepsilon _{T}\gg v_{T}^{-1}$$ , then \begin{equation*} \alpha _{T}={\rm pr}\big\{\|\eta(z)-\eta(y)\|\leqslant \varepsilon _{T}\big\} \asymp \varepsilon _{T}^{k_{\theta }}\gg v_{T}^{-k_{\theta}}\text{.} \end{equation*} This shows that choosing a tolerance $$\varepsilon _{T}=o(1)$$ is equivalent to choosing an $$\alpha _{T}=o(1)$$ quantile of $$\|\eta(z)-\eta(y)\|$$. It also demonstrates the role played by the dimension of $$\theta$$ in determining the rate at which $$\alpha _{T}$$ declines to zero. In case (i), if $$\varepsilon _{T}\asymp v_{T}^{-1}$$, then $$\alpha _{T}\asymp v_{T}^{-k_{\theta }}$$. On the other hand, if $$\varepsilon _{T}=o(1/v_{T})$$ , as is required for the Bernstein–von Mises result in Theorem 2, the associated acceptance probability goes to zero at the faster rate, $$\alpha _{T}=o(1/v_{T}^{k_{\theta }})$$. In case (ii), where $$\varepsilon _{T}\gg v_{T}^{-1}$$, it follows that $$\alpha _{T}\gg v_{T}^{-k_{\theta }}$$. Linking $$\varepsilon _{T}$$ and $$\alpha _{T}$$ provides a way of choosing the $$\alpha _{T}$$ quantile of the simulations, or equivalently the tolerance $$\varepsilon _{T}$$, in such a way that a particular type of posterior behaviour occurs for large $$T$$: choosing $$\smash{\alpha _{T}\gtrsim v_{T}^{-k_{\theta}}}$$ gives an approximate posterior distribution that concentrates; under the more stringent condition of $$\smash{\alpha_{T}=o(1/v_{T}^{k_{\theta }})}$$, the approximate posterior distribution both concentrates and is approximately Gaussian in large samples. These results can help give practitioners an understanding of what to expect from this procedure, as well as a means of detecting potential issues if this expected behaviour is not in evidence. Moreover, given that there is no direct connection between $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$ and the exact posterior distribution, these results provide some insight into the statistical properties that $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$ should display when it is obtained from the popular nearest-neighbour version of the algorithm. Corollary 1 demonstrates that to obtain reasonable statistical behaviour, the rate at which $$\alpha _{T}$$ declines to zero must be faster the larger the dimension of $$\theta$$ is, with the order of $$\alpha _{T}$$ unaffected by the dimension of $$\eta$$. This result provides theoretical evidence of a curse-of-dimensionality encountered in these algorithms as the dimension of the parameters increases, and to our knowledge this is the first work linking the dimension of $$\theta$$ to certain asymptotic properties of $$\Pi_{\varepsilon}\{\cdot\mid \eta(y)\}$$. This result provides theoretical justification for dimension reduction methods that process parameter dimensions individually and independently of the other dimensions; see, for example, the regression adjustment approaches of Beaumont et al. (2002), Blum (2010) and Fearnhead & Prangle (2012), as well as the integrated auxiliary likelihood approach of Martin et al. (2017). While Corollary 1 demonstrates that the order of $$\alpha_{T}$$ is unaffected by the dimension of the summaries, $$\alpha_{T}$$ cannot be accessed in practice and so the nearest-neighbour version of Algorithm 1 is implemented using a Monte Carlo approximation to $$\alpha_{T}$$, which is based on the accepted draws of $$\theta$$. This approximation of $$\alpha_{T}$$ is a Monte Carlo estimate of a conditional expectation and, as such, will be sensitive to the dimension of $${\eta}(\cdot)$$ for any fixed number of Monte Carlo draws $$N$$; see Biau et al. (2015) for further discussion of this point. In addition, it can also be shown that if $$\varepsilon_T$$ becomes much smaller than $$1/v_T$$, the dimension of $$\eta(\cdot)$$ will affect the behaviour of Monte Carlo estimators for this acceptance probability. Specifically, when considering inference on $$\theta_0$$ using the accept/reject approximate Bayesian computation algorithm, we require a sequence of Monte Carlo trials $$N_{T}\rightarrow\infty$$ as $$T\rightarrow\infty$$ that diverges faster the larger $$k_{\eta}$$, the dimension of $$\eta(\cdot)$$, is. Such a feature highlights the lack of efficiency of the accept/reject approach when the sample size is large or when the dimension of the summaries is large. However, we note here that more efficient sampling approaches exist and could be applied in these settings. For example, Li & Fearnhead (2018a) considered an importance sampling approach to approximate Bayesian computation that yields acceptance rates satisfying $$\alpha_{T}=O(1)$$, so long as $$\varepsilon_{T}=O(1/v_{T})$$. Therefore, in cases where the Monte Carlo error is likely to be large, these alternative sampling approaches should be employed. Regardless of whether one uses a more efficient sampling procedure than the simple accept/reject approach, Corollary 1 demonstrates that taking a tolerance sequence as small as possible will not necessarily yield more accurate results; that is, Corollary 1 questions the persistent opinion that the tolerance in Algorithm 1 should always be taken as small as the computing budget allows. Once $$\varepsilon _{T}$$ is chosen small enough to satisfy case (iii) of Theorem 2, which leads to the most stringent requirement on the tolerance, $$v_{T}\varepsilon_{T}=o(1)$$, there may well be no gain in pushing $$\varepsilon _{T}$$ or, equivalently $$\alpha_{T}$$, any closer to zero, especially since pushing $$\varepsilon_{T}$$ closer to zero can drastically increase the required computational burden. In the following subsection we numerically demonstrate this result in a simple example. In particular, we show that for a choice of tolerance $$\varepsilon_{T}$$ that admits a Bernstein–von Mises result, there is no gain in taking a tolerance that is smaller than this value, while the computational cost associated with such a choice, for a fixed level of Monte Carlo error, drastically increases. 6.2. Numerical illustration of quantile choice Consider the simple example where we observe a sample $$\{y_{t}\}_{t=1}^{T}$$ from $$y_{t}\sim \mathcal{N}(\mu ,\sigma )$$ with $$T=100$$. Our goal is posterior inference on $$\theta=(\mu ,\sigma)^{{\rm T}}$$. We use as summaries the sample mean and variance, $$\bar{x}$$ and $$s_{T}^{2}$$, which satisfy a central limit theorem at rate $${T}^{1/2}$$. In order to guarantee asymptotic normality of the approximate posterior distribution, we must choose an $$\alpha _{T}$$ quantile of the simulated distances according to $$\alpha _{T}=o(1/T)$$, because of the joint inference on $$\mu$$ and $$\sigma$$. For the purpose of this illustration, we will compare inference based on the nearest-neighbour version of Algorithm 1 using four different choices of $$\alpha _{T}$$, namely $$\alpha _{1}=1/T^{1{\cdot}1}$$, $$\alpha _{2}=1/T^{3/2}$$, $$\alpha _{3}=1/T^{2}$$ and $$\alpha _{4}=1/T^{5/2}$$. Draws for $$(\mu ,\sigma )$$ are simulated on $$[0{\cdot}5,1{\cdot}5]\times \lbrack 0{\cdot}5,1{\cdot}5]$$ according to independent uniform distributions $${\rm Un}[0{\cdot}5,1{\cdot}5]$$. The number of draws, $$N$$, is chosen so that we retain 250 accepted draws for each of the different choices ($$\alpha _{1},\dots,\alpha _{4}$$). The exact finite-sample marginal posterior densities of $$\mu$$ and $$\sigma$$ are produced by numerically evaluating the likelihood function, normalizing over the support of the prior, and marginalizing with respect to each parameter. Given the sufficiency of $$(\bar{x},\, s_{T}^{2})$$, the exact marginal posterior densities for $$\mu$$ and $$\sigma$$ are equal to those based directly on the summaries themselves. Thus, we are able to assess the impact of the choice of $$\alpha$$, per se, on the ability of the nearest-neighbour version of Algorithm 1 to replicate the exact marginal posteriors. We summarize the accuracy of the resulting approximate posterior density estimates, across the four quantile choices, using root mean squared error. In particular, over 50 simulated replications and for the parameter $$\mu$$, we estimate the root mean squared error between the marginal posterior density obtained from Algorithm 1 using $$\alpha _{j}$$, denoted by $$\hat{\pi}_{\alpha _{j}}\{\mu \mid{\eta }({y})\}$$, and the exact marginal posterior density, $$\pi(\mu \mid {y})$$, by \begin{equation} {\small{\text{RMSE}}}_{\mu }(\alpha _{j})=\left[\frac{1}{G}\sum_{g=1}^{G}\big\{ \hat{\pi} _{\alpha _{j}}^{g}\{\mu \mid {\eta }({y})\}-\pi^{g}(\mu \mid {y})\big\} ^{2}\right]^{1/2}\text{.} \label{RMSE_def} \end{equation} (8) The term $$\hat{\pi}^{g}_{\alpha_{j}}$$ is the ordinate of the density estimate from the nearest-neighbour version of Algorithm 1, and $$\pi^{g}$$ is the ordinate of the exact posterior density, at the $$g$$th grid point upon which the density is estimated. $${\small{\text{RMSE}}}_{\sigma }(\alpha _{j})$$ is computed analogously. The value of $${\small{\text{RMSE}}}_{\mu }(\alpha _{j})$$ is averaged over 50 replications to account for sampling variability. For each replication, we fix $$T=100$$ and generate observations using the parameter values $$\mu _{0}=1$$ and $$\sigma _{0}=1$$. Before presenting the replication results, it is instructive to consider the graphical output of one particular run of the algorithm for each of the $$\alpha _{j}$$ values. Figure 1 plots the resulting marginal posterior estimates and compares these with the exact finite-sample marginal posterior densities of $$\mu$$ and $$\sigma$$. At the end of § 6.1 we argued that for large enough $$T$$, once $$\varepsilon _{T}$$ reaches a certain threshold, decreasing the tolerance further will not necessarily result in more accurate estimates of these exact posterior densities. This implication is evident in Fig. 1: in the case of $$\mu$$, there is a clear visual decline in the accuracy with which approximate Bayesian computation estimates the exact marginal posterior densities when choosing quantiles smaller than $$\alpha _{2}$$, while in the case of $$\sigma$$, the worst performing estimate is that associated with the smallest value of $$\alpha _{j}$$. Figure 1 View largeDownload slide Comparison of exact and approximate posterior densities of (a) $$\mu$$ and (b) $$\sigma$$ for various tolerances: exact marginal posterior densities (—) and approximate Bayesian computation posterior densities based on $$\alpha_{1}=1/T^{1{\cdot}1}$$ ($$\cdots$$), $$\alpha_{2}=1/T^{3/2}$$ (- - -), $$\alpha_{3}=1/T^{2}$$ (— $$\cdot$$ —) and $$\alpha_{4}=1/T^{5/2}$$ (—*—). Figure 1 View largeDownload slide Comparison of exact and approximate posterior densities of (a) $$\mu$$ and (b) $$\sigma$$ for various tolerances: exact marginal posterior densities (—) and approximate Bayesian computation posterior densities based on $$\alpha_{1}=1/T^{1{\cdot}1}$$ ($$\cdots$$), $$\alpha_{2}=1/T^{3/2}$$ (- - -), $$\alpha_{3}=1/T^{2}$$ (— $$\cdot$$ —) and $$\alpha_{4}=1/T^{5/2}$$ (—*—). Table 1 reports average root mean squared errors, relative to the average value associated with $$\alpha _{4}=1/T^{5/2}$$. Values smaller than 1 indicate that the larger, and less computationally burdensome, value of $$\alpha _{j}$$ yields a more accurate estimate than that obtained using $$\alpha_{4}$$. In brief, Table 1 reveals a similar picture to that of Fig. 1: for $$\sigma$$, the estimates based on $$\alpha _{j}$$ for $$j=1,2,3$$ are all more accurate than those based on $$\alpha _{4}$$; for $$\mu$$, estimates based on $$\alpha _{2}$$ and $$\alpha _{3}$$ are both more accurate than those based on $$\alpha _{4}$$. Table 1. Ratio of the average root mean squared error for marginal approximate posterior density estimates to the average root mean square error based on the smallest quantile, $$\alpha_{4}=1/T^{5/2}$$ $$\alpha _{1}=1/T^{1.1}$$ $$\alpha _{2}=1/T^{1.5}$$ $$\alpha _{3}=1/T^{2}$$ $${\small{\text{AVG-RMSE}}}_{\mu }(\alpha _{j})$$ 1$$\cdot$$17 0$$\cdot$$99 0$$\cdot$$98 $${\small{\text{AVG-RMSE}}}_{\sigma }(\alpha _{j})$$ 0$$\cdot$$86 0$$\cdot$$87 0$$\cdot$$91 $$\alpha _{1}=1/T^{1.1}$$ $$\alpha _{2}=1/T^{1.5}$$ $$\alpha _{3}=1/T^{2}$$ $${\small{\text{AVG-RMSE}}}_{\mu }(\alpha _{j})$$ 1$$\cdot$$17 0$$\cdot$$99 0$$\cdot$$98 $${\small{\text{AVG-RMSE}}}_{\sigma }(\alpha _{j})$$ 0$$\cdot$$86 0$$\cdot$$87 0$$\cdot$$91 avg-rmse, the ratio of the average root mean squared errors as defined in (8). View Large Table 1. Ratio of the average root mean squared error for marginal approximate posterior density estimates to the average root mean square error based on the smallest quantile, $$\alpha_{4}=1/T^{5/2}$$ $$\alpha _{1}=1/T^{1.1}$$ $$\alpha _{2}=1/T^{1.5}$$ $$\alpha _{3}=1/T^{2}$$ $${\small{\text{AVG-RMSE}}}_{\mu }(\alpha _{j})$$ 1$$\cdot$$17 0$$\cdot$$99 0$$\cdot$$98 $${\small{\text{AVG-RMSE}}}_{\sigma }(\alpha _{j})$$ 0$$\cdot$$86 0$$\cdot$$87 0$$\cdot$$91 $$\alpha _{1}=1/T^{1.1}$$ $$\alpha _{2}=1/T^{1.5}$$ $$\alpha _{3}=1/T^{2}$$ $${\small{\text{AVG-RMSE}}}_{\mu }(\alpha _{j})$$ 1$$\cdot$$17 0$$\cdot$$99 0$$\cdot$$98 $${\small{\text{AVG-RMSE}}}_{\sigma }(\alpha _{j})$$ 0$$\cdot$$86 0$$\cdot$$87 0$$\cdot$$91 avg-rmse, the ratio of the average root mean squared errors as defined in (8). View Large These numerical results have important implications for implementation of approximate Bayesian computation. In particular, keeping the level of Monte Carlo error constant across the $$\alpha_{j}$$ quantile choices, as we have done in this simulation setting via the retention of 250 draws, requires taking $$N=210{\rm e}03$$ for $$\alpha _{1}$$, $$1{\cdot}4{\rm e}06$$ for $$\alpha _{2}$$, $$13{\cdot}5{\rm e}06$$ for $$\alpha _{3}$$, and $$41{\cdot}0{\rm e}06$$ for $$\alpha _{4}$$. In other words, the computational burden associated with decreasing the quantile in the manner indicated increases dramatically: approximate posterior densities based on $$\alpha _{4}$$, for example, require a value of $$N$$ that is three orders of magnitude greater than those based on $$\alpha _{1}$$, but this increase in computational burden yields no, or only minimal, gain in accuracy. The extension of such explorations to additional scenarios is beyond the scope of this paper; however, we speculate that, with due consideration given to the properties of both the true data-generating process and the chosen summary statistics, and hence to the sample sizes for which Theorem 2 has practical relevance, similar qualitative results will continue to hold. Acknowledgement This research was supported by the Australian Research Council and l’Institut Universitaire de France. Robert is also affiliated with the University of Warwick and Rousseau with the Université Paris Dauphine. Université Paris Dauphine is a PSL Research University. We are grateful to Wentao Li and Paul Fearnhead for spotting a mistake in a previous version of Theorem 3 and to the editorial team for its help. Supplementary material Supplementary Material available at Biometrika online includes proofs of all results and numerical examples that illustrate the theoretical results. References Beaumont M. A. , Zhang W. & Balding D. J. ( 2002 ). Approximate Bayesian computation in population genetics . Genetics 162 , 2025 – 35 . Google Scholar PubMed Bhattacharya R. N. & Rao R. R. ( 1986 ). Normal Approximation and Asymptotic Expansions . Philadelphia : SIAM . Biau G. , Cérou F. & Guyader A. ( 2015 ). New insights into approximate Bayesian computation . Ann. Inst. Henri Poincaré Prob. Statist. 51 , 376 – 403 . Google Scholar CrossRef Search ADS Blum M. ( 2010 ). Approximate Bayesian computation: A non-parametric perspective . J. Am. Statist. Assoc. 105 , 1178 – 87 . Google Scholar CrossRef Search ADS Creel M. , Gao J. , Hong H. & Kristensen D. ( 2015 ). Bayesian indirect inference and the ABC of GMM . arXiv: 1512.07385 . Creel M. & Kristensen D. ( 2015 ). ABC of SV: Limited information likelihood inference in stochastic volatility jump-diffusion models , J. Empir. Finan. 31 , 85 – 108 . Google Scholar CrossRef Search ADS Drovandi C. C. , Pettitt A. N. & Lee A. ( 2015 ). Bayesian indirect inference using a parametric auxiliary model . Statist. Sci. 30 , 72 – 95 . Google Scholar CrossRef Search ADS Fearnhead P. & Prangle D. ( 2012 ). Constructing summary statistics for approximate Bayesian computation: Semi-automatic approximate Bayesian computation (with Discussion) . J. R. Statist. Soc. B 74 , 419 – 74 . Google Scholar CrossRef Search ADS Gouriéroux C. , Monfort A. & Renault E. ( 1993 ) Indirect inference . J. Appl. Economet. 8 , 85 – 118 . Google Scholar CrossRef Search ADS Li W. & Fearnhead P. ( 2018a ). On the asymptotic efficiency of approximate Bayesian computation estimators . Biometrika 105 , 285 – 99 . Google Scholar CrossRef Search ADS Li W. & Fearnhead P. ( 2018b ). Convergence of regression adjusted approximate Bayesian computation . Biometrika 105 , 301 – 18 . Google Scholar CrossRef Search ADS Marin J-M. , Pillai N. , Robert C. P. & Rousseau J. ( 2014 ). Relevant statistics for Bayesian model choice . J. R. Statist. Soc. B 76 , 833 – 59 . Google Scholar CrossRef Search ADS Marin J-M. , Pudlo P. , Robert C. P. & Ryder R. ( 2011 ). Approximate Bayesian computation methods . Statist. Comp. 21 , 289 – 91 . Google Scholar CrossRef Search ADS Martin G. M. , McCabe B. P. M. , Frazier D. T. , Maneesoonthorn W. & Robert C. P. ( 2017 ). Auxiliary likelihood-based approximate Bayesian computation in state space models . arXiv: 1604.07949v2 . Pritchard J. K. , Seilstad M. T. , Perez-Lezaun A. & Feldman M. W. ( 1999 ). Population growth of human Y chromosomes: A study of Y chromosome microsatellites . Molec. Biol. Evol. 16 , 1791 – 8 . Google Scholar CrossRef Search ADS Tavaré S. , Balding D. J. , Griffiths R. C. & Donnelly P. ( 1997 ). Inferring coalescence times from DNA sequence data . Genetics 145 , 505 – 18 . Google Scholar PubMed © 2018 Biometrika Trust This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

BiometrikaOxford University Press

Published: Jun 6, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off