Asymptotic post-selection inference for the Akaike information criterion

Asymptotic post-selection inference for the Akaike information criterion Summary Ignoring the model selection step in inference after selection is harmful. In this paper we study the asymptotic distribution of estimators after model selection using the Akaike information criterion. First, we consider the classical setting in which a true model exists and is included in the candidate set of models. We exploit the overselection property of this criterion in constructing a selection region, and we obtain the asymptotic distribution of estimators and linear combinations thereof conditional on the selected model. The limiting distribution depends on the set of competitive models and on the smallest overparameterized model. Second, we relax the assumption on the existence of a true model and obtain uniform asymptotic results. We use simulation to study the resulting post-selection distributions and to calculate confidence regions for the model parameters, and we also apply the method to a diabetes dataset. 1. Introduction Variable selection, model selection and estimation with a sparsity-enforcing penalty all induce uncertainty due to the process of selection, and they complicate subsequent inference. We investigate post-selection inference for the Akaike information criterion, AIC (Akaike 1973). The method is valid for variable selection in any likelihood-based model. We construct confidence intervals for regression parameters, or linear combinations thereof, conditional on the selected model, which have the correct coverage probabilities. The method involves rewriting the event of selection asymptotically as a number of inequalities that involve multivariate normal random variables. While the calculation of critical values might proceed exactly for one or two parameters, we develop a numerical approach that is more generally applicable. We focus explicitly on the classical low-dimensional setting, for which no such post-selection results are yet available. The need to address selection uncertainty has been pointed out many times (e.g., Kabaila 1995, 1998; Hjort & Claeskens, 2003; Leeb & Pötscher, 2003, 2005, 2006; Danilov & Magnus, 2004; Kabaila & Leeb, 2006). Claeskens & Hjort (2008) approached the post-selection issue via model averaging, using simulation in a local misspecification framework. For model selection via sequential testing in nested models, Pötscher (1991) calculated the asymptotic distribution of the parameter estimator. Several advances have been made recently. The post-selection inference method of Berk et al. (2013) yields, for linear models, valid confidence intervals irrespective of the selection procedure, which can also be informal. Bachoc et al. (2015) generalized this method to prediction intervals. Since these methods are not specific to any selection procedure, the resulting confidence intervals can be quite conservative. Efron (2014) proposed using a bagging, i.e., bootstrap aggregating, estimator and derived its variance, using normal quantiles to obtain confidence intervals. Ferrari & Yang (2014) assessed model uncertainty when performing F-tests in linear models via a so-called variable selection confidence set. Kabaila et al. (2016) investigated the exact coverage and scaled expected length of certain model-averaged confidence intervals for a parameter of a linear regression model. In selective inference one lets the data determine the selected model and the target of the parameter estimators. For the lasso, Lee et al. (2016) obtained exact post-selection inference by relating the selected set of active coefficients to a union of polyhedra. For forward selection and least angle regression in normal linear regression models, Taylor et al. (2016) studied selective hypothesis tests and confidence intervals. Jansen (2014) investigated the effect of the optimization on the expected values of the Akaike information criterion and Mallow’s $$C_p$$ in high-dimensional sparse models. Belloni et al. (2015) obtained uniformly valid confidence intervals in the presence of a sparse high-dimensional nuisance parameter. We explain our approach first in the traditional simple case of selection using the Akaike information criterion in a sequence of nested models, the so-called order selection problem. Next, we extend this to the practically more relevant setting of selection from a general set of models, not necessarily nested and possibly all misspecified. When a true parametric model exists, only pointwise results can be obtained, while under misspecification and working with pseudo-true values that change per model, stronger, uniformly valid confidence intervals are constructed. 2. Post-AIC selection in nested models 2.1. Selection properties of the AIC Consider first a nested sequence of $$K+1$$ likelihood models $$M_0 \subseteq \cdots \subseteq M_K$$, for which the likelihood function $$L_{n}$$ depends on a parameter vector $${\theta}^{{ \mathrm{\scriptscriptstyle T} }}=({\theta}_0^{{ \mathrm{\scriptscriptstyle T} }}, \theta_1, \ldots, \theta_K)\in\Omega\subseteq \mathbb{R}^{a+K}$$, where $${\theta}_0\in\mathbb{R}^{a}$$ denotes the parameter vector that is common to all models and hence not subject to variable selection and $$n$$ denotes the sample size. For ease of notation we assume that model $$M_i$$ adds a single parameter to model $$M_{i-1}$$. Generalizations are straightforward. We start by assuming that there is a single minimal true model $$M_{p_0}$$ in the set of models $$\mathcal{M}_{{\rm nest}}=\{M_i: i=0, \ldots, K\}$$ in the sense that $$p_0$$ is the smallest model order for which all nonzero components of the true parameter vector $$\vartheta$$ are included. This assumption is relaxed in § 4, where we do not require the existence of a true model and we allow for nonnested models and for model misspecification. In the current setting, models with indices $$i<p_0$$ are underparameterized, while models with $$i>p_0$$ are overparameterized. We denote by $$\hat{{\theta}}'(i)$$ the maximum likelihood estimator for the parameter vector $${\theta}^{{ \mathrm{\scriptscriptstyle T} }}(i)=({\theta}_0^{{ \mathrm{\scriptscriptstyle T} }}, \ldots, \theta_i)\in \mathbb{R}^{a+i}$$ in model $$M_i$$, write $$\hat{{\theta}}(i)=({\hat{\theta}}'(i)^{{ \mathrm{\scriptscriptstyle T} }},{0}_{K-i}^{{ \mathrm{\scriptscriptstyle T} }})^{{ \mathrm{\scriptscriptstyle T} }}$$, and let $${\vartheta}={\vartheta}(p_0)$$ denote the corresponding true value where $$\vartheta_j=0$$ for $$j>p_0$$. Here $$0_l$$ stands for the zero vector of length $$l$$. The Akaike information criterion for model $$M_j$$ in the model list $$\mathcal{M}_{{\rm nest}}$$ is $$\small{\text{AIC}}(M_j)=-2\ell_n\{\hat{{\theta}}(\,j)\} +2 (a+j)$$ where $$\ell_n=\log L_n$$. The index of the selected model is $$ \hat p_0=\min\{\,j:\small{\text{AIC}}(M_j)=\min_{0 \leqslant i \leqslant K} \small{\text{AIC}}(M_i)\}\text{.} $$ The idea behind the construction of post-selection inference is to rewrite the selection procedure in terms of a set of inequalities, which define a geometrical region in terms of random variables that can be easily simulated. For this purpose, we redefine $$\hat p_0 = \min\{\,j\in\{0,\ldots,K\}: j=\arg\max_{j=0,\ldots,K}\small{\text{AIC}}^*(M_j)\}$$, with $$\small{\text{AIC}}^*(M_j)=2[\ell_n\{\hat{{\theta}}(\,j)\}-\ell_n({\vartheta})]-2j=2 \ell_{n,j}^*-2j$$. Asymptotically, the probability of underselection is zero (Woodroofe 1982, Lemma A1 in the Appendix); see also Shibata (1976). Conditioning on $$\hat p_0=p$$, we have that $$\small{\text{AIC}}^*(M_{p})-\small{\text{AIC}}^*(M_{j})>0$$ for $$j=p_0,\ldots,p-1$$ and $$\small{\text{AIC}}^*(M_{p})-\small{\text{AIC}}^*(M_{j})\ge 0$$ for $$j=p+1,\ldots,K$$. For $$n\to\infty$$, there is joint convergence in distribution of $$(\ell^*_{n,p_0},\ldots,\ell^*_{n,K})$$ to $$(\sum_{i=1}^{a+p_0}Z_i^2,\ldots,\sum_{i=1}^{a+K}Z_i^2)/2$$, where $$Z_1,\ldots,Z_{a+K}$$ are independent and identically distributed $$N(0,1)$$ variables (Woodroofe 1982). By the continuous mapping theorem, asymptotically, when $$\hat p_0=p$$, $$(Z_1,\ldots,Z_{a+K})\in \mathcal{A}_p(\mathcal{M}_{{\rm nest}})$$, which is called the selection region for nested models and is defined by \begin{align*} \mathcal{A}_p(\mathcal{M}_{{\rm nest}}) & = \mathcal{B}_{1,p} \cap \mathcal{B}_{2,p}, \text{ with } \\ \mathcal{B}_{1,p} & = \bigcap_{j=p_0+1,\ldots,p}\left\{ {z}\in \mathbb{R}^{a+K}: \sum_{i=j}^{p}(z_{a+i}^2-2)>0\right\} \\ \mathcal{B}_{2,p} & = \bigcap_{j=p+1,\ldots,K}\left\{{z}\in \mathbb{R}^{a+K}: \sum_{i=p+1}^{j}(z_{a+i}^2-2)\le0\right\}\! \text{.} \end{align*} Geometrically, the first set of $$p-p_0-1$$ strict inequalities specifies regions outside spheres, the last set of $$K-p$$ inequalities indicates regions inside certain other spheres, and the inequality $$z_p^2>2$$ determines the union of two half-spaces, namely $$(-\infty,-2^{1/2})\cup(2^{1/2},+\infty)$$. The specific structure of the Akaike information criterion determines the form of the regions. Other selection methods define other regions; see § 7 for examples. Lemma 5.1 and Theorem 5.2 in Lee et al. (2016) characterize the lasso selection procedure, for a given value of the $$\ell_1$$-penalty, in terms of polyhedral sets; see also Taylor et al. (2016). 2.2. Distributional results Inference post-selection deals with the distribution of the estimators in the selected model, conditional on the selection. In this paper we always mean selection of the model with the smallest Akaike information criterion value, and by the post-selection estimator we mean the maximum likelihood estimator based on the selected model. We show that the limiting cumulative distribution function of $$n^{1/2}\{\hat{{\theta}}(\hat p_0)-{\vartheta}\}$$ conditional on the selected model can be described by a multivariate normal random variable $$Z$$ that is for nested models conditioned on $$Z\in \mathcal{A}_p(\mathcal{M}_{{\rm nest}})$$. Due to the nature of the selection using the Akaike information criterion and by the results of Pötscher (1991) and Leeb & Pötscher (2003), it can be shown that the selection of an overspecified model does not happen in a uniform way, but depends on the true parameter value $$\vartheta$$. Hence, in § § 2 and 3, the results are pointwise. All proofs and assumptions are given in the Appendix. Define for model $$M_i$$ the submatrix $${J}_{M_i}({\vartheta})$$ of the Fisher information matrix $${J}({\vartheta})$$ in the model with all parameters (see Assumption A1(iv)), and for a $$(a+K)$$-vector $${\nu}$$ define its subvector $$\tilde{{\nu}}(i)=(\nu_1,\ldots,\nu_{a+i})^{{ \mathrm{\scriptscriptstyle T} }}$$. The indicator function $$I(\cdot)$$ is defined by $$I(A)=1$$ if $$A$$ is true and $$I(A)=0$$ otherwise. Proposition 1. Suppose that conditions $${\rm (i)-(iv)}$$ of Assumption A1 in the Appendix hold. For a sequence of nested models $$\mathcal{M}_{{\rm nest}}$$, with $$p_0$$ denoting the true model order, the asymptotic conditional cumulative distribution function of the post-selection estimator is \begin{align} F_p({t}) & = \mathop{\lim}_{n \rightarrow \infty } \mathrm{pr}\big[n^{1/2}\{\hat{{\theta}}(p)-{\vartheta}\}\leqslant {t} \mid \hat{p}_0=p,\mathcal{M}_{{\rm nest}}\big] \nonumber\\ & = \mathrm{pr}\{{J}_{p}^{-1/2}({\vartheta}) \tilde{{Z}}(p) \leqslant \tilde{{t}}(p) \mid \tilde{{Z}}(p)\in\mathcal{A}_p^{({\rm s})}(\mathcal{M}_{{\rm nest}})\} I({t}\in \mathcal{T}_{p}), \end{align} (1)where $$p\geqslant p_0$$ by Lemma 1, $${Z}=(Z_1,\ldots,Z_{a+K})^{{ \mathrm{\scriptscriptstyle T} }}$$, the region with simplified constraints $$\mathcal{A}_p^{({\rm s})}(\mathcal{M}_{{\rm nest}})= \bigcap_{j=p_0+1,\ldots,p}\left\{\tilde{{z}}(p) \in \mathbb{R}^{a+p}: \sum_{i=j}^p\big(z_{a+i}^2-2\big)>0\right\}$$, and $$\mathcal{T}_p=\mathbb{R}^{a+p}\times (\mathbb{R}^+)^{K-p}$$. By the forms of $$\mathcal{A}_p$$ and $$\mathcal{A}_p^{({\rm s})}$$, the limiting distribution of $$n^{1/2}\{\hat{{\theta}}(p)-{\vartheta}\}$$ conditional on selection in the set $$\mathcal{M}_{{\rm nest}}$$ is symmetric and its density function is that of a truncated normal random variable. Let $$\phi_p(\cdot\mid\mathcal{A};{V})$$ denote the density of $${V}^{-1/2} \tilde{{Z}}(p)$$, where $$\tilde{{Z}}(p) \sim N_{a+p}({0},{I}_{a+p})$$ is truncated such that $$\tilde{{Z}}(p) \in \mathcal{A}$$. In the case of selecting the true model, the conditioning event contains random variables that are independent of $$\tilde{Z}(p_0)$$ and hence may be omitted. Figure 1 depicts some of the limiting post-selection densities for an example of selecting the largest in a sequence of three nested models, when the smallest model is the true one. This example is continued in § 3.1. For more details, see the Supplementary Material. Fig. 1. View largeDownload slide Marginal asymptotic densities $$f_{j|3}$$ ( $$j=1,2,3$$) of $$n^{1/2}(\hat{\theta}_j-\vartheta_j)$$ conditional on $$\hat{p}_0=3$$ when $$p_0=1$$ and $$J_3^{-1}(\vartheta)$$ is a diagonal matrix with diagonal elements $$(1,4,4)$$. Fig. 1. View largeDownload slide Marginal asymptotic densities $$f_{j|3}$$ ( $$j=1,2,3$$) of $$n^{1/2}(\hat{\theta}_j-\vartheta_j)$$ conditional on $$\hat{p}_0=3$$ when $$p_0=1$$ and $$J_3^{-1}(\vartheta)$$ is a diagonal matrix with diagonal elements $$(1,4,4)$$. Corollary 1. Under the assumptions of Proposition 1, the limiting density of $$n^{1/2}\{\hat{{\theta}}(\hat p_0)-{\vartheta}\}$$ conditional on $$\small{\text{AIC}}$$-selection with $$\hat p_0 = p$$ from the set of nested models $$\mathcal{M}_{{\rm nest}}$$ is $$f_p({t})= \phi_p\{\tilde{{t}}(p)\mid\mathcal{A}_p^{({\rm s})}(\mathcal{M}_{{\rm nest}});{J}_{p}^{-1}({\vartheta})\} I({t}\in \mathcal{T}_{p})$$. When the true model is selected, i.e., $$\hat p_0=p_0$$, we have $$f_{p_0}({t})=\phi_{p_0}\{\tilde{{t}}(p_0)\} I({t}\in \mathcal{T}_{p})$$. 2.3. Confidence regions A correct post-selection analysis incorporates the uncertainty associated with variable selection; we obtain confidence regions conditional on the selected model. Corollary 2. Under the assumptions of Proposition 1, an asymptotic $$100(1-\alpha)\%$$ Wald confidence ellipsoid conditional on having selected a model with $$\hat p_0=p$$ is \begin{eqnarray*} \big\{{\vartheta}\in\mathbb{R}^{a+K}: n\{{\hat{\theta}}'(p)-\tilde{{\vartheta}}(p)\}^{{ \mathrm{\scriptscriptstyle T} }}{J}_{p}({\vartheta}) \{{\hat{\theta}}'(p)-\tilde{{\vartheta}}(p)\} \leqslant q_{\alpha}\big\}, \end{eqnarray*}where $$q_{\alpha}$$ is defined such that $$1-\alpha$$ equals \begin{equation} \int_{2(p-p_0)}^{q_\alpha} \!\int_{2(p-p_0)}^{w_{1}}\!\!\!\! \cdots\!\! \int_4^{w_{p-2}}\!\! \!\int_2^{w_{p-1}} \!\!\frac{f(w_p,\ldots, w_{p_0+1}, w_1)}{\mathrm{pr}\{\tilde{{Z}}(p)\in\mathcal{A}_p^{({\rm s})}(\mathcal{M}_{{\rm nest}})\}} \,\mathrm{d} w_p\,\mathrm{d} w_{p-1}\ldots \,\mathrm{d} w_{p_0+1} \,\mathrm{d} w_{1}, \end{equation} (2)with \begin{align*} & f(w_p,\ldots, w_{p_0+1}, w_1)\\ &\quad{} = \frac{\exp(-{w_1}/{2})w_p^{-1/2}(w_1-w_{p_0+1})^{-(a+p_0)/2-1}\prod^{p-p_0+1}_{i=1}(w_i-w_{i-1})^{-1/2}} {2^{({a+p})/{2}}\{\Gamma(1/2)\}^{p-p_0}\Gamma(\frac{a+p_0}{2})}\text{.} \end{align*} In § 2.4 we propose an accurate method for estimating $$q_{\alpha}$$ when exact computation is cumbersome. Clearly, the naive approach of using the quantile of a chi-squared distribution gives too low coverage. Confidence intervals for single components of $${\vartheta}$$ require the calculation of marginal distributions. Corollary 3. Under the assumptions of Proposition 1, with $$\mathcal{R}_{\alpha}=\mathbb{R}^{j-1}\times [-q_{\alpha/2}, q_{\alpha/2}] \times \mathbb{R}^{a+p-j} \times (\mathbb{R}^+)^{K-p}$$, the asymptotic $$100(1-\alpha)\%$$ quantiles of the marginal distributions of $$\vartheta_j$$$$(\,j=1,2,\ldots, a+p)$$ satisfy $$ \int_{\mathcal{R}_{\alpha}}f_p(t) \,\mathrm{d} t = 1-\alpha$$. 2.4. Simulation-based inference Since the calculations are quite tedious, even in low dimensions, we present a method to simulate this conditional distribution, from which quantiles can then be obtained. When $${J}({\vartheta})$$ is unknown, we use a consistent estimator $$\skew6\hat{{J}}\{\hat{{\theta}}(K)\}$$. We use a Hamiltonian Monte Carlo method (Pakman & Paninski 2014) to sample from a $$(a+K)$$-variate standard normal distribution subject to quadratic constraints that are also based on standard normal random variables. The resulting $$n'$$ samples drawn from this density are placed in the $$n' \times (a+K)$$ matrix $${\mathcal{Z}}_{\mathcal{A}}$$. Next, we multiply each row of $$\tilde{{\mathcal{Z}}}_{\mathcal{A}}(p)$$ by $$\skew6\hat{{J}}^{-1/2}_{p}\{\hat{{\theta}}(K)\}$$, which leads to $$n'$$ samples from the limiting distribution of $$n^{1/2}\{\hat{{\theta}}(\hat p_0)-{\vartheta}\}$$; see Corollary 1. The example in the Supplementary Material demonstrates the close agreement between the 95% quantiles $$q_{\alpha}$$ in (2) simulated via constrained $$\chi^2$$ distributions and their exact values. 3. Post-selection inference in general models 3.1. AIC selection in a set of nonnested models Lemma 1 generalizes a result of Woodroofe (1982), which is repeated as Lemma S1 in the Supplementary Material, to an arbitrary set of models that contains at least one overparameterized model. Lemma 1. Under Assumption A1$${\rm (i)-(iv)}$$, the asymptotic probability that selection using the Akaike information criterion results in an underparameterized model from a set of models $$\mathcal{ M}$$ that contains at least one overparameterized model is equal to zero. The distributional properties of the post-selection estimators depend on the candidate set of models $$\mathcal{M}$$. Indeed, another set $$\mathcal{M}$$ could have led to a different selection. We define the selection matrix to indicate which variables appear in the set of models. Definition 1. The selection matrix $${\zeta}_{\mathcal{M}}$$ is a $$|\mathcal{M}| \times (a+K)$$ matrix with $$\{0, 1\}$$ elements, constructed as $$ {\zeta}_{\mathcal{M}}=(1^{T}_{a+K} {\pi}^{T}_{1} {\pi}_{1}, \ldots, 1^{T}_{a+K} {\pi}^{T}_{M} {\pi}_{M})^{{ \mathrm{\scriptscriptstyle T} }}, $$ where $$|\mathcal{M}|$$ is the number of models and $${\pi}_{m}$$ is a $$|m|\times(a+K)$$ projection matrix that selects those covariates which belong to model $$m$$. First consider $$\mathcal{M}=\mathcal{M}_{{\rm all}}$$, the set of all possible submodels of a largest model. Denote by $$\mathcal{M}_O \subseteq \mathcal{M}_{{\rm all}}$$ the set of all overparameterized models, including the true model, so the models in $$\mathcal{M}_O$$ are overlapping. In model $$M$$ the estimator of $${\vartheta}$$ is denoted by $$\hat{{\theta}}(M)$$, with zeros added for components not in $$M$$. For any vector $$\nu$$, let $$\tilde\nu(M)$$ denote its subvector corresponding to the variables in model $$M$$. Under the orthogonality condition, Assumption A1(v), Proposition 2 is similar to the nested model case. Otherwise, we follow Vuong (1989) for testing in overlapping models. Define $$\Sigma(\theta)$$ as a partitioned matrix with $$(i,j)$$th block equal to $$\Sigma_{M_i,M_j}={Q}_{M_i}^{-1}({\theta}){J}_{ij}({\theta}, {\theta}){Q}_{M_j}^{-1}({\theta})$$. Proposition 2. Assume conditions $${\rm (i)-(iv)}$$ of Assumption A1 and selection from $$\mathcal{M}_{{\rm all}}$$. (I) If Assumption A1$${\rm (v)}$$ holds, the selection region for model $$M$$ is \begin{align*} & \mathcal{A}_M(\mathcal{M}_O) \\ &\quad{} = \!\bigl\{{z}\in\mathbb{R}^{a+K}\!: \!\{{1}_{(|\mathcal{M}_O|-1)}\otimes ({1}^{T}_{K} {\pi}^{T}_{M} {\pi}_{{M}}) - \zeta_{{\mathcal{M}}_O\setminus M}\}\{(z_1^2-2),\ldots,(z_{a+K}^2-2)\}^{{ \mathrm{\scriptscriptstyle T} }} > 0\bigr\}\text{.} \end{align*} The conditional limiting cumulative distribution function of the post-selection estimator is \begin{align} F_M({t}) & = \mathop{\lim}_{n \rightarrow \infty } \mathrm{pr}\big[n^{1/2}\{\hat{{\theta}}(M)-{\vartheta}\}\leqslant {t} \mid M_{{\small{\text{AIC}}}}=M, \mathcal{M}_{{\rm all}}\big] \nonumber \\ & = \mathrm{pr}\{{J}_{M}^{-1/2}({\vartheta}) \tilde{{Z}}(M) \leqslant \tilde{{t}}(M) \mid {Z} \in \mathcal{A}_M({\mathcal{M}}_O)\}I({t}\in \mathcal{T}_{M}), \end{align} (3)where $$\mathcal{T}_{M}$$ is $$\mathbb{R}^{|M|} \times (\mathbb{R}^+)^{K-|M|}$$ and $${J}_{M}(\vartheta)$$, $$\tilde{{Z}}(M)$$ and $$\tilde{{t}}(M)$$ are submatrices of, respectively, $${J}({\vartheta})$$, $${Z}=(Z_1, \dots, Z_{a+K})$$ and $${t}$$, corresponding to the variables in model $$M$$. (II) If Assumption A1$${\rm (v)}$$ does not hold, define $$m=\sum_{M\in \mathcal{M}_O}|M|$$ and let $$W_{{\small{\text{AIC}}},i}$$ be a matrix partitioned in the same way as $$\Sigma(\vartheta)$$ with diagonal blocks that correspond to $$M_{{\small{\text{AIC}}}}$$ and $$M_i$$ equal to $$Q_{M_{\small{\text{AIC}}}}(\vartheta)$$ and $$-Q_{M_i}(\vartheta)$$, respectively, and with zeros elsewhere. The selection region for model $$M_{{\small{\text{AIC}}}}$$ is \begin{equation*} \mathcal{A}_M(\mathcal{M}_O)\!=\! \bigl\{{z}\in\mathbb{R}^{m}\!:\!{{{z}}}^{{ \mathrm{\scriptscriptstyle T} }} \Sigma^{1/2}(\vartheta) W_{{\small{\text{AIC}}},i} \Sigma^{1/2}(\vartheta){z} \geqslant 2 (|M_{\small{\text{AIC}}}|-|M_i|), \, M_i \in {\mathcal{M}}_O\!\!\setminus\!\!M_{{\small{\text{AIC}}}}\bigr\}\text{.} \end{equation*} Let $$\tilde{{Z}}(M)$$ denote the subvector of $${Z}\sim N_m(0,I)$$, $$Z\in \mathcal{A}_M(\mathcal{M}_O)$$, which contains only those components that correspond to components in the selected model $$M$$; then \begin{equation} F_M({t}) = \mathrm{pr}\{{J}_{M}^{-1/2}({\vartheta}) \tilde{{Z}}(M) \leqslant \tilde{{t}}(M) \mid {Z} \in \mathcal{A}_M({\mathcal{M}}_O)\} I({t}\in \mathcal{T}_{M}), \end{equation} (4)where $$\mathcal{T}_{M}$$ is $$\mathbb{R}^{|M|} \times (\mathbb{R}^+)^{m-|M|}$$. The choice of $$\mathcal{M}$$ is important. Regarding (I), the constraint involves those $$Z_i$$ corresponding to the parameters in the selected model $$M_{{\small{\text{AIC}}}}$$ that are not in the smallest true model $$M_{{\rm pars}}$$; hence no constraints are placed on the $$Z_i$$ corresponding to parameters that occur in every model. Obviously, the selection affects the distribution of all parameters, even those common to all models. The effect of the set of models is illustrated by the following example. Let $$K=2$$ and $$a=1$$, and let $$M_0$$ be the smallest true model containing only $$\theta_1$$. Suppose that Assumption A1(v) holds and that the full model $$M_{{\small{\text{AIC}}}}=(\theta_1 , \theta_2, \theta_3)$$ is selected in both $$\mathcal{M}_{{\rm nest}}$$ and $$\mathcal{M}_{{\rm all}}$$. So $$\mathcal{A}_M(\mathcal{M}_{{\rm all}})=\{{z}\in\mathbb{R}^{3}: z_2^2>2,\ z_3^2>2,\ z_2^2+z_3^2>4 \}$$ while $$\mathcal{A}_M(\mathcal{M}_{{\rm nest}})=\{{z}\in\mathbb{R}^{3}: z_3^2>2,\ z_2^2+z_3^2>4 \}$$. Figure 2 depicts these regions for both $$\mathcal{M}_{{\rm nest}}$$, shaded area, and $$\mathcal{M}_{\mathrm{all}}$$, double-shaded area. If one selects the full model in $$\mathcal{M}_{{\rm nest}}$$, then $$Z_2$$ is defined in $$\mathbb{R}$$ as long as $$Z_2^2+Z_3^2>4$$, while selection in $$\mathcal{M}_{{\rm all}}$$ requires both $$Z_2$$ and $$Z_3$$ to be in $$(-\infty, - 2^{1/2})\cup (2^{1/2}, \infty)$$. The distribution of parameter estimators can be obtained by premultiplying $$Z=(Z_1, Z_2, Z_3)$$ by $$J^{1/2}_{M_{\small{\text{AIC}}}}(\vartheta)$$. For the normal linear models $${Y}\sim N_n({X}{\vartheta},\sigma^2{I})$$ and $$M_{\small{\text{AIC}}} \in \mathcal{M}_{O}$$, the distribution results are also exact for finite samples. In such models $${J}({\vartheta})=n^{-1}{X}^{{ \mathrm{\scriptscriptstyle T} }}{X}/\sigma^2$$, which does not depend on $${\vartheta}$$. For (II) the main difference is that we need the joint distribution of the estimators in the different models and the constraints apply to the full vector. Fig. 2. View largeDownload slide Allowable domain of $$Z_2$$ and $$Z_3$$ for nested model selection (shaded) and all-subsets selection (double-shaded) when $$\small{\text{AIC}}$$ selects the full model. Fig. 2. View largeDownload slide Allowable domain of $$Z_2$$ and $$Z_3$$ for nested model selection (shaded) and all-subsets selection (double-shaded) when $$\small{\text{AIC}}$$ selects the full model. 3.2. Confidence regions For an arbitrary set of models $$\mathcal{M}_{{\rm arb}}$$ with $$\mathcal{M}_{{\rm arb}} \cap \mathcal{M}_O\not=\emptyset$$, by Assumption A1(i), (3) still holds upon replacing $$\mathcal{A}_M(\mathcal{M}_O)$$ with $$\mathcal{A}_M(\mathcal{M}_{{\rm arb}} \cap \mathcal{M}_O)$$. With $$M_{{\small{\text{AIC}}}}=M$$ selected from $$\mathcal{M}_{{\rm arb}}$$, the confidence region for $${\vartheta}$$ is \begin{equation} C(q_\alpha) = \big\{{\theta}\in\mathbb{R}^{a+K}: n\{{\hat{\theta}}'(M)-\tilde{{\theta}}(M)\}^{{ \mathrm{\scriptscriptstyle T} }}{J}_{M}({\theta}) \{{\hat{\theta}}'(M)-\tilde{{\theta}}(M)\} \leqslant q_{\alpha}\big\}, \end{equation} (5) where $${\hat{\theta}}'(M)$$ is the $$|M|$$-vector of nonzero values of $$\hat{{\theta}}(M)$$ and $$q_{\alpha}$$ is determined by solving \begin{equation} \frac{\mathrm{pr}\{(\sum_{i \in M} Z_i^2 \leqslant q_{\alpha}) \cap {Z} \in \mathcal{A}_M(\mathcal{M}_{{\rm arb}}\cap \mathcal{M}_O) \} } {\mathrm{pr}\{{Z} \in \mathcal{A}_M(\mathcal{M}_{{\rm arb}}\cap \mathcal{M}_O)\}}=1-\alpha\text{.} \end{equation} (6) Let $$f_M\{\tilde{{t}}(M)\}=\phi_M\{\tilde{{t}}(M)\mid \mathcal{A}_M(\mathcal{M}_{{\rm arb}}\cap \mathcal{M}_O);J_M^{-1}({\vartheta})\}$$ denote the density of $$n^{1/2}\{{\hat{\theta}}'(M)-\tilde{{\vartheta}}(M)\}$$, a truncated $$|M|$$-dimensional normal density. The quantile of its $$j$$th component is obtained via \begin{eqnarray*} \int_{\mathcal{R}_{\alpha}}f_M\{\tilde{{t}}(M)\} \,\mathrm{d} \tilde{{t}}(M) = 1-\alpha, \end{eqnarray*} where $$\mathcal{R}_{\alpha}\subset\mathbb{R}^{|M|}$$ restricts only the $$j$$th component to $$[-q_{\alpha/2},q_{\alpha/2}]$$. The confidence interval for $$\vartheta_j$$ is $$\hat{\theta}_j(M)\pm q_{\alpha/2}n^{-1/2}$$. While there is no uniform convergence of the distribution function in all settings (Leeb & Pötscher 2003), for normal linear models using rectangular confidence regions and sequential testing, a uniform result regarding coverage has been obtained by Pötscher (1995). The following result holds for overspecified models. For models in the set $$\mathcal{M}_{O}$$, all parameter components that appear in the true model are nonzero, but there may be additional parameter components which could be zero or nonzero. However, the set $$\mathcal{M}_{O}$$ does not depend on the value of the true parameter $$\vartheta$$. After conditioning on $$M_{\small{\text{AIC}}}\in \mathcal{M}_{O}$$, the set $$C(q_\alpha)$$ is random due to maximum likelihood estimation in the selected model. Proposition 3. Assume conditions $${\rm (i)-(iv)}$$ of Assumption A1 and that $${Q}_n({\theta})$$ in $${\rm (ii)}$$ is continuous over a compact set $$\Theta$$ that contains $${\vartheta}$$. The confidence region $$C(q_\alpha)$$ from (5) is such that $$ \lim_{n\rightarrow \infty} \inf_{{\vartheta} \in \Theta} \mathrm{pr}_{\vartheta} \{{\vartheta} \in C(q_\alpha)\mid M_{{\small{\text{AIC}}}}\in \mathcal{M}_{O}\} = 1- \alpha\text{.} $$ When $$\mathcal{A}_M(\mathcal{M}_{{\rm arb}})$$ replaces $$\mathcal{A}_M(\mathcal{M}_{{\rm arb}}\cap \mathcal{M}_O)$$ in (6) to obtain a value $$\tilde q_\alpha$$, $$\:\lim_{n\rightarrow \infty} \inf_{{\vartheta} \in \Theta} \mathrm{pr}\{{\vartheta} \in C(\tilde q_\alpha) \mid M_{\small{\text{AIC}}}\in \mathcal{M}_{O}\} \ge 1- \alpha$$. One limitation of the Akaike information criterion is that the selection of an overspecified model does not happen in a uniform way (Leeb & Pötscher 2003). Hence, this result cannot be strengthened. If the selected model is underparameterized, correct inference can be obtained for the pseudo-true values instead; see § 4. For a predetermined number of steps in a forward selection, least angle regression or lasso in linear additive error models, Tibshirani et al. (2015) obtained asymptotic results which are uniformly valid for a specific class of nonnormal errors. For a comparison between two models, Andrews & Guggenberger (2009) used a local neighbourhood to deal with the overselection and to obtain uniform results for parameters that were not subject to selection. Chernozhukov et al. (2015) performed uniformly valid inference on a low-dimensional parameter when there is selection in a high-dimensional vector of nuisance parameters. See also Belloni et al. (2015) for the use of least absolute deviation in high-dimensional regression. Inference after selection depends on (i) the set of models $$\mathcal{M}$$ specified by the researcher and (ii) the smallest true model $$M_{{\rm pars}}$$, in nested models $$p_0$$, via $$\mathcal{A}_M(\mathcal{M} \cap \mathcal{M}_O)$$. In $$\mathcal{M}_{{\rm nest}}$$ and $$\mathcal{M}_{{\rm all}}$$ one could take the smallest model for $$M_{{\rm pars}}$$. If this model is true or overparameterized, Propositions 1 and 2 hold and the asymptotic confidence intervals can be calculated exactly. If the smallest model is underparameterized, the structure of the additional constraints $$\mathcal{A}_M(\mathcal{M})\setminus\mathcal{A}_M(\mathcal{M} \cap \mathcal{M}_O)$$ is such that the resulting distribution of the parameters is longer-tailed. This leads to conservative confidence intervals, especially for the parameters which are truly nonzero. In practice we calculate the constraints based on the selected model and $$\mathcal{A}_M(\mathcal{M}_{{\rm arb}})$$. For case (i), in $$\mathcal{M}_{{\rm all}}$$ the number of constraints equals $$2^{K-|M_0|}-1$$. Here, we show that $$\mathcal{A}_M({\mathcal{M}}_O)$$ can be reduced to the set $$ \bigcap_{i \in M_{{\small{\text{AIC}}}}\setminus M_{{\rm pars}}} \left\{{z}\in\mathbb{R}^{a+K}: (z_i^2 > 2)\right\} \cap \bigcap_{i \notin M_{{\small{\text{AIC}}}}\setminus M_{{\rm pars}}} \left\{{z}\in\mathbb{R}^{a+K}: (z_i^2 < 2)\right\}$$ without losing information. Let $$\mathcal{I}_{M_{{\small{\text{AIC}}}}}$$ denote the set consisting of all subsets of the indices in $$M_{{\small{\text{AIC}}}}\setminus M_{{\rm pars}}$$, referring to the redundant selected parameters, and denote by $$\mathcal{I}_{M_{{\small{\text{AIC}}}}}^{\rm c}$$ the set of all subsets of the indices in $$\{1,\ldots,a+K\}\setminus M_{{\small{\text{AIC}}}}$$, referring to the variables that were not selected. Then \begin{eqnarray*} \mathcal{A}_{M_{\small{\text{AIC}}}}(\mathcal{M}\cap \mathcal{M}_O)= \bigcap_{I \in \mathcal{I}_{M_{\small{\text{AIC}}}}} \bigcap_{J \in \mathcal{I}^c_{M_{\small{\text{AIC}}}}}\left\{{z}\in\mathbb{R}^{a+K}: \sum_{i \in I} z_i^2 -\sum_{j \in J} z_i^2 > 2(|I|- |J|)\right\} \\ \cap \bigcap_{i\in M_{\small{\text{AIC}}}\backslash M_{\rm pars}} \left\{{z}\in\mathbb{R}^{a+K}: (z_i^2>2) \right\} \cap \bigcap_{i \in \{1,\ldots,a+K\}\backslash M_{\small{\text{AIC}}}}\left\{{z}\in\mathbb{R}^{a+K}: (-z_i^2>-2)\right\} \end{eqnarray*} The last two sets of constraints consist, respectively, of $$|M_{{\small{\text{AIC}}}}|-|M_{{\rm pars}}|$$ and $$a+K-|M_{{\small{\text{AIC}}}}|$$ elements. The first set only involves constraints that are summations of the constraints in the last two sets and does not add any new restrictions on $${z}$$. The constraint set for any $$\mathcal{M}_{{\rm arb}}$$ can be simplified as long as some constraints can be implied by summing other constraints. Removing redundant constraints is not always possible, such as for $$\mathcal{M}_{{\rm nest}}$$. 3.3. Inference for linear combinations For inference for linear combinations $${x}^{T}{\vartheta}$$ after model selection, we rewrite (3) as \begin{align} F(t)&=\mathop{\lim}_{n \rightarrow \infty} \mathrm{pr}\big[n^{1/2}\,\tilde{{x}}^{T}(M) \{{\hat{\theta}}'(M)-\tilde{{\vartheta}}(M)\}\leqslant t \mid M_{{\small{\text{AIC}}}}=M, \mathcal{M}\big]\nonumber\\ &=\mathrm{pr} \{\tilde{{x}}^{T}(M) \,{J}_{M}^{-1/2}({\vartheta}) \tilde{{Z}}(M) \leqslant t \mid \mathcal{A}_M(\mathcal{M}\cap\mathcal{M}_O)\}, \end{align} (7) where $$\tilde{{x}}(M)$$ are the covariates corresponding to $$M$$. The asymptotic distribution of the estimated linear combination $${x}^{T}{\vartheta}$$ is simulated via (7). When the sample size is small and the diagonal entries of $$J(\hat{{\theta}})$$ are large, it may happen that an underparameterized model is selected. In this case the coverage probability of confidence regions of a linear combination of the parameters, or a transformation thereof in generalized linear models, may be smaller than the nominal value. In cases of suspected underselection, one can use \begin{equation} \mathop{\lim}_{n \rightarrow \infty}\mathrm{pr}[n^{1/2}{x}^{T}\{\hat{{\theta}}(M_{{\rm full}})-{\vartheta}\}\leqslant t \mid M_{{\small{\text{AIC}}}}\!=\!M, \mathcal{M}] =\mathrm{pr}\{{x}^{T}{J}^{-1/2}({\vartheta}) {Z}_{a+K} \leqslant t \mid \mathcal{A}_M(\mathcal{M})\},\quad \end{equation} (8) where $$M_{{\rm full}}$$ is the full model. This differs from (7) in using all parameters, not just the selected parameters. This procedure is different from assuming that the full model is selected, since, for example in $$\mathcal{M}_{\rm all}$$, $$\:\mathcal{A}_M(\mathcal{M}_{\rm all})$$ contains $$z_i^2>2$$ for the parameters which are selected and $$z_i^2<2$$ for those which are not selected, whereas $$\mathcal{A}_{M_{{\rm full}}}(\mathcal{M}_{{\rm all}})$$ contains $$z_i^2>2$$ for all parameters, leading to a long-tailed distribution. The probability of underselection disappears asymptotically. The valid confidence intervals of Bachoc et al. (2015) target the true value for the selected model, not the true value $${x}^{t}{\vartheta}$$. While in their case underparameterized selection is not an issue, there is no guarantee that their proposed confidence interval is valid for the true value. 4. Confidence regions when all models are misspecified 4.1. Limiting distribution of estimators The results in this section do not require any assumption about the existence of a true model, are uniformly valid, and apply to general parametric likelihood models. In order to obtain uniformly valid results, we consider the setting where there is no true parameter vector, either because the true density of the data does not belong to a parametric family or because all models are misspecified. We assume the observations to be represented by a triangular array $$\{Y_{ni}: i=1,\ldots,n,\: n\in\mathbb{N}\}$$, where there is independence between the rows, i.e., different sample sizes $$n$$, and within the rows, i.e., $$Y_{ni}$$ and $$Y_{nj}$$ are independent for $$i\neq j$$. Regression models are included, as observations may have different distributions. The true joint density of $$(Y_{n1},\ldots,Y_{nn})$$ is $$g_n$$, with distribution function $$G_n$$. All probabilities are computed under the true distribution, so $$\mathrm{pr}=\mathrm{pr}_{G_n}$$. The data are modelled via models $$M_{n,j}=\{\prod_{i=1}^nf_{j,i}(y_i;\theta_j): \theta_j\in\Theta_j \subset\mathbb{R}^{m_j}\}$$. Thus $$m_j$$ is the number of parameters in model $$M_{n,j}$$. All models are collected in the set $$\mathcal{M}_{n}=\{M_{n,1},\ldots,M_{n,J}\}$$. When confusion is unlikely, we omit the subscript $$n$$ from the notation. We assume for each $$n\in\mathbb{N}$$ that $$\int g_n(y)\log g_n(y) \,{\rm d} y <\infty$$. This defines the class of true distributions $${\mathcal{G}}_n$$. Regarding the models, assume that for each $$i\in\mathbb{N}$$ and each $$j=1, \ldots J$$, $$\:f_{j,i}(\cdot\,;\theta_j)$$ is measurable for all $$\theta_j\in\Theta_j$$, a compact set, and $$f_{j,i}(y_i;\cdot)$$ is continuous on $$\Theta_j$$ almost surely and continuously differentiable on $$\Theta_j$$. Then for every model there exists (White 1994, Theorem 2.12) an estimator $$\hat\theta_{n,j}$$ maximizing $$\prod_{i=1}^nf_{j,i}(y_i;\theta_j)$$ over $$\Theta_j$$. If $$E_{G_n}\{n^{-1}\sum_{i=1}^n \log f_{j,i}(y_i;\theta_j)\}$$ has an identifiable unique maximizer over $$\Theta_j$$, this maximizer is called the pseudo-true value $$\vartheta_{n}^*(M_j)$$. This value depends on the true joint density, the model densities, and the sample size. We define two vectors of length $$m'=\sum_{j=1}^J m_j$$: $$\vartheta_{n,\mathcal{M}}^* =\{\vartheta_{n}^{*T}(M_1),\ldots,\vartheta_{n}^{*T}(M_K)\}^{{ \mathrm{\scriptscriptstyle T} }}$$ and $$\hat\theta_{n,\mathcal{M}} =\{\hat\theta_{n}^{T}(M_1),\ldots,\hat\theta_{n}^{T}(M_K)\}^{{ \mathrm{\scriptscriptstyle T} }}$$. Lemma 2. Let $$\{Y_{ni}: i=1,\ldots,n, \: n\in\mathbb{N}\setminus 0\}$$ form a triangular array consisting of independent random variables. (i) For all components of the vector $$\vartheta_{n,\mathcal{M}}^*$$, stated here for the $$k$$th such component of $$\theta_j$$ corresponding to model $$M_j$$, assume that for all $$G_n\in\mathcal{G}_n$$ with $$ A=\{y_i\in\mathbb{R}: | (\partial/\partial\theta_k)\log f_{j,i}\{y_i;\theta_{n}^*(M_j)\}| >\varepsilon nQ_{M_j,kk}\{\vartheta_{n}^*(M_j)\}\} $$ and for all $$\varepsilon>0$$, $$ \lim_{n\to\infty}\sum_{i=1}^n\int_A \left[\frac{\partial}{\partial\theta_k}\log f_{j,i}\{y_i;\vartheta_{n}^*(M_j)\}\right]^2\Big/ \big[nQ_{M_j,kk}\{\vartheta_{n}^*(M_j)\}\big] \,{\rm d} G_{ni}(y_i) = 0\text{.} $$ (ii) Write $$\Sigma_{M_j}\{\vartheta_{n}^*(M_j)\} = Q_{M_j}^{-1}\{\vartheta_{n}^*(M_j)\} J_{jj}\{\vartheta_{n}^*(M_j),\vartheta_{n}^*(M_j)\}Q_{M_j}^{-1}\{\vartheta_{n}^*(M_j)\}$$, and assume that \begin{align*} &\lim_{n\to\infty}\max_{i=1,\ldots,n}\mathrm{pr}_{G_n}\!\!\left(\! (\Sigma_{M_j,kk})^{-1/2}n^{-1/2} \bigl[Q^{-1}_{M_j}\{\vartheta_{n}^*(M_j)\}\bigr]_{kk} \left|\frac{\partial}{\partial\theta_k}\log f_{j,i}\{y_i;\vartheta_{n}^*(M_j)\}\right|>\varepsilon\!\right) = 0\text{.} \end{align*} Define $$\mathcal{W}_n\sim N_{m'}\{0,\Sigma(\vartheta_{n,\mathcal{M}}^*)\}$$ where $$\Sigma(\vartheta_{n,\mathcal{M}}^*)$$ is an $$m'\times m'$$ matrix with $$(i,j)$$th block, of size $$m_i\times m_j$$, equal to $$Q_{M_i}^{-1}\{\vartheta_{n}^*(M_i)\} J_{ij}\{\vartheta_{n}^*(M_i),\vartheta_{n}^*(M_j)\} Q_{M_j}^{-1}\{\vartheta_{n}^*(M_j)\}$$. Then $$ \lim_{n\to\infty}\sup_{t\in \mathbb{R}^{m'}}\sup_{G_n\in\mathcal{G}_n} \Big| \mathrm{pr}\{n^{1/2}(\hat\theta_{n,\mathcal{M}} - \vartheta_{n,\mathcal{M}}^*) \le t\} - \mathrm{pr}(\mathcal{W}_n\le t)\Big| = 0\text{.} $$ A pivot is needed in order to construct confidence regions. In general, the variance $$\Sigma(\vartheta^*_{n,\mathcal{M}})$$ of $$\mathcal{W}_n$$ may depend on $$\vartheta^*_{n,\mathcal{M}}$$. When there is an estimator $$\hat\Sigma$$ of $$\Sigma$$ such that $$ \lim_{n\to\infty}\sup_{G_n\in\mathcal{G}_n} \mathrm{pr}_{G_n}(\|\hat\Sigma_n-\Sigma\|>\varepsilon)=0, $$ where $$\|A\|$$ denotes the Euclidean matrix operator norm of $$A$$, then, with $$\mathcal{Z}_{m'}\sim N_{m'}(0,I_{m'})$$, $$ \lim_{n\to\infty}\sup_{G_n\in\mathcal{G}_n}\sup_{t\in\mathbb{R}^{m'}} \Big| \mathrm{pr}\{\hat\Sigma_n^{-1/2}n^{-1/2}(\hat\theta_{n,\mathcal{M}} - \vartheta_{n,\mathcal{M}}^*)\le t\}-\mathrm{pr}(\mathcal{Z}_{m'}\le t)\Big|=0\text{.} $$ The model determines whether or not the variance can be estimated well. (White 1994, § 8.3) gives some general conditions for consistent estimation of the variance. One requirement is that $$n^{-1}\sum_{i=1}^nE(s)E(s^{{ \mathrm{\scriptscriptstyle T} }}) \to 0$$, where $$s$$ is the vector of length $$m'$$ consisting of subvectors $$(\partial/\partial\theta_k)\log f_{ki}(Y_i;\vartheta_k^{*})$$ for $$k=1,\ldots,K$$. This assumption holds, for example, when the models are correctly specified. Under misspecification, (White 1994, § 8.3) showed that the empirical estimator for $$\Sigma(\vartheta_{n,\mathcal{M}}^*)$$ could overestimate the covariance matrix, leading to conservative confidence intervals. 4.2. Selection region in a misspecified setting When $$\mathcal{M}$$ consists of misspecified models, calculating the selection event requires additional care. Define $$\ell_{n,M_j}(y,\theta_j)=\sum_{i=1}^n \log f_{j,i}(y_i,\theta_j)$$. When model $$M_{\small{\text{AIC}}}$$ is selected, then for all $$M \in {\mathcal{M}}\setminus M_{\small{\text{AIC}}}$$, $$\:2[\ell_{n,M_{\small{\text{AIC}}}}\{y,\hat{\theta}_n(M_{\small{\text{AIC}}})\}-\ell_{n,M}\{y,\hat{\theta}_n(M)\}]\geqslant 2(|M_{\small{\text{AIC}}}|-|M|)$$. When both models $$M_{\small{\text{AIC}}}$$ and $$M$$ are correctly specified, the difference of the loglikelihoods can be characterized asymptotically by chi-squared random variables. However, when there is misspecification this difference can diverge to $$+\infty$$ or $$-\infty$$, depending on the assumptions about the models. For strictly nonnested models, the difference always diverges (Vuong 1989, Theorem 5.1). When the selected model is always best, there is no restriction on parameter estimators. See also Cox & Hinkley (1974, § 9.3) for the asymptotic behaviour of likelihood ratio tests in nonnested settings. For overlapping models having some common parameters, the loglikelihood difference converges to some random variable if one of the models is correctly specified, and diverges otherwise. Under misspecification of all models, the only situation where the asymptotic distribution can be used to characterize the selection event is the case of nested models under similarity of the likelihoods (Vuong 1989, Assumption A8). This means that $$\ell_{n,M_k}\{y,\vartheta_n^*(M_k)\}=\ell_{n,M_l}\{y,\vartheta_n^*(M_l)\}$$ for $$k,l=1,\ldots,K$$. For an arbitrary set of models we impose the same similarity condition and assume that $$\mathcal{M}$$ includes a model $$M_{{\rm s}}=M_{{\rm small}}$$ which is nested in all other models. If we were to perform a likelihood ratio test under this assumption, it would correspond to testing whether the smaller model can be considered equal to or worse than the larger model (Vuong 1989, Lemma 7.1). We first compare each model with the smallest model and then use the regions obtained from each comparison to compute the final selection region using pairwise comparisons. By imposing similarity, the calculated quantiles to be used in the confidence regions are larger than without similarity since, as explained earlier, the loglikelihood difference diverges otherwise and there is no restriction on the parameter estimators. For all $$M \in \mathcal{M}\setminus M_{{\rm s}}$$, $$\matrix{ {2[{\ell _{n,M}}\{ y,{{\hat \theta }_n}(M)\} - {\ell _{n,{M_s}}}\{ y,{{\hat \theta }_n}({M_{\rm{s}}})\} ]} \hfill \cr {\, = n{{\{ {{\hat \theta }_n}(M) - \vartheta _n^*(M)\} }^{\rm{T}}}{Q_M}\{ \vartheta _n^*(M)\} \{ \hat \theta (M) - \vartheta _n^*(M)\} } \hfill \cr {\quad - n{{\{ {{\hat \theta }_n}(M) - \vartheta _n^*({M_{\rm{s}}})\} }^{\rm{T}}}Q\{ \vartheta _n^*({M_s})\} \{ {{\hat \theta }_n}({M_{\rm{s}}}) - \vartheta _n^*({M_{\rm{s}}})\} + {o_{\rm{p}}}(1)} \hfill \cr {\, = n{{({{\hat \theta }_{n,{\cal M}}} - \vartheta _{n,{\cal M}}^*)}^{\rm{T}}}{W_{M,{M_{\rm{s}}}}}({{\hat \theta }_{n,{\cal M}}} - \vartheta _{n,{\cal M}}^*) + {o_{\rm{p}}}(1),} \hfill \cr } $$ (9) where $$W_{M,M_{{\rm s}}}$$ is a block-diagonal matrix partitioned in the same way as $$\Sigma$$, whose diagonal block referring to model $$M$$ equals $$Q_{M}\{\vartheta^*_n(M)\}$$, that referring to model $$M_{\rm s}$$ equals $$-Q_{M_s}\{\vartheta^*_n(M_{\rm s})\}$$, and other entries are all zero. If the models are already nested, there is no need to compare each model with the smallest model. The asymptotic counterpart of the selection event is \begin{align} \mathcal{A}_{M_{\small{\text{AIC}}}}(\mathcal{M})=\Big\{z \in \mathbb{R}^{m'}: z^{{ \mathrm{\scriptscriptstyle T} }} \Sigma^{1/2} (W_{M_{\small{\text{AIC}}},M_{\rm s}}-W_{M,M_{\rm s}}) \Sigma^{1/2}z \geqslant 2(|M_{\small{\text{AIC}}}|-|M|),\nonumber\\ M \in \mathcal{M}\setminus M_{\small{\text{AIC}}}\Big\}\text{.} \end{align} (10) Proposition 4. Suppose that the assumptions of Lemma 2 hold. Then, for a set of models with $$\mathcal{A}_{M_{\small{\text{AIC}}}}(\mathcal{M})$$ from (10), \begin{align*} \begin{split} \lim_{n\rightarrow\infty} \sup_{G_n \in \mathcal{G}_n} \sup_{t \in \mathbb{R}^{|M_{\small{\text{AIC}}}|}} \Big|\mathrm{pr}[n^{1/2}\{\hat{\theta}(M_{\small{\text{AIC}}})-\vartheta^*(M_{\small{\text{AIC}}})\} \leqslant t \mid M_{\small{\text{AIC}}}] & \\ \nonumber -\,\mathrm{pr}\{\Sigma^{1/2}Z \leqslant t \mid \mathcal{A}_{M_{\small{\text{AIC}}}}(\mathcal{M})\}\Big|& =0\text{.} \end{split} \end{align*} As noted by Tibshirani et al. (2015), uniform convergence in distribution can be translated to uniformly valid confidence sets. The following proposition clarifies this statement. The proof is similar to the proof of Proposition 4, using the fact that a continuous mapping preserves uniform convergence. Proposition 5. Suppose that the assumptions of Lemma 2 hold and that the set of models $$\mathcal{M}$$ contains a smallest model which is nested in all the models. Define the set \[ C^*(q_{\alpha})\! =\! \bigl\{\theta \in \mathbb{R}^{|M_{\small{\text{AIC}}}|}\! : n \{\hat{\theta}(M_{\small{\text{AIC}}})-\theta(M_{\small{\text{AIC}}})\}^{{ \mathrm{\scriptscriptstyle T} }} \Sigma_{M_{\small{\text{AIC}}}}(\vartheta^*_{M_{\small{\text{AIC}}}})^{-1} \{\hat{\theta}(M_{\small{\text{AIC}}})-\theta(M_{\small{\text{AIC}}})\}\leqslant q_{\alpha}\bigr\}, \]where $$q_a$$ is determined by solving \begin{align*} &\mathrm{pr}\bigl[ \{\tilde{{Z}}^{{ \mathrm{\scriptscriptstyle T} }}({M_{{\small{\text{AIC}}}}}) \Sigma_{M_{\small{\text{AIC}}}}(\vartheta^*_{M_{\small{\text{AIC}}}})^{-1} \tilde{{Z}}({M_{{\small{\text{AIC}}}}}) \leqslant q_{\alpha}\} \, \cap \{Z \in \mathcal{A}_{M_{\small{\text{AIC}}}}(\mathcal{M})\}\bigr]\\ &\quad = \mathrm{pr}\{Z \in \mathcal{A}_{M_{\small{\text{AIC}}}}(\mathcal{M})\}(1-\alpha)\text{.} \end{align*} Then $$ \lim_{n\rightarrow\infty} \sup_{G_n \in \mathcal{G}_n} \sup_{\alpha \in [0,1]} \bigl| \mathrm{pr} _{G_n}\{ \vartheta^{*}(M_{\small{\text{AIC}}}) \in C^*(q_{\alpha})\mid M_{\small{\text{AIC}}} \} - (1-\alpha)\bigr|=0 $$. 5. Simulation study 5.1. Parameters in linear models While the proposed method is applicable to general likelihood models, in order to compare it with existing methods, we present simulation results for linear models only. Results for generalized linear models and other settings can be found in the Supplementary Material. The data are generated from a regression model $$ Y_i=\sum_{j=1}^{10} \vartheta_j x_{ji} + \varepsilon_i$$$$(i=1,..., n) $$ with $$\varepsilon_i\sim N(0,1)$$. The true value of the parameters is $${\vartheta}^{{ \mathrm{\scriptscriptstyle T} }}=(2{\cdot}25,-1{\cdot}1,2{\cdot}43,-2{\cdot}24,2{\cdot}5,{0}_5^{{ \mathrm{\scriptscriptstyle T} }})$$, with $${0}_5$$ denoting the vector of length 5 whose entries are all zero. We set $$x_{1i}=1$$ and $$(x_{2i}, \ldots, x_{10,i})^{{ \mathrm{\scriptscriptstyle T} }} \sim N({0}_{9}, {\Omega})$$, where $${\Omega}$$ is a positive-definite matrix with diagonal elements equal to 1 and off-diagonal entries equal to $$0{\cdot}25$$. The sample size is either 30 or 100. Three different model sets were considered. Let $${\zeta}_{{\rm all}}^i$$ be the selection matrix when the first $$i$$ parameters are present in all models. We take $${\zeta}_{{\rm all}}^3$$, which is a $$2^7 \times 10$$ matrix, $${\zeta}_{{\rm all}}^6$$, which is a $$2^4 \times 10$$ matrix, and $${\zeta}_{{\rm arb}}$$, which contains 14 rows arbitrarily chosen from $${\zeta}_{{\rm all}}^3$$. We are interested in inference for the parameters in the selected model. In order to facilitate comparison, the simulations were run until model $$M$$ with parameters $$(\vartheta_1,\ldots,\vartheta_6, \vartheta_8)$$ had been selected 3000 times. For each of those simulation runs, the Fisher information matrix was estimated in the full model by $$\skew6\hat{{J}}(\hat{{\theta}})$$, leading to the submatrix $$\skew6\hat{{J}}_M(\hat{{\theta}})$$. When Assumption A1(v) does not hold, one should use (4) to calculate the confidence intervals. However, we used (3) instead, and it resulted in good approximations. Quantiles of the limiting asymptotic distribution for each setting were obtained via simulation; see the Supplementary Material for the code. In each simulation run we computed the lower and upper limits of the confidence interval and report, in Table 1, the averaged confidence intervals along with the coverage percentages for $$\vartheta_4$$, $$\vartheta_6$$ and $$\vartheta_8$$. Results for the other parameters are omitted to save space. Table 1 Results of the simulation study with $$3000$$ runs of AIC selection: average confidence intervals and coverage percentages for $$\vartheta_4, \vartheta_6$$ and $$\vartheta_8$$ using different selection matrices $${\zeta}$$ corresponding to different model sets $$\mathcal{M}$$ and two sample sizes $$n$$ for the proposed method, along with results obtained by the method of Berk et al. (2013) and by a naive approach that treats the selected model as given and ignores selection $$n$$ Method $$\vartheta_j$$ $${\zeta}_{{\rm all}}^3$$ $${\zeta}_{{\rm all}}^6$$ $${\zeta}_{{\rm arb}}$$ 30 PostAIC $$\vartheta_4$$ $$[-2{\cdot}85, -1{\cdot}64]$$ 98 $$[-2{\cdot}68 , -1{\cdot}78]$$ 92 $$[-2{\cdot}85 , -1{\cdot}64]$$ 97 $$\vartheta_6$$ $$[-0{\cdot}60 , \phantom{-}0.62]$$ 94 $$[-0{\cdot}45 , \phantom{-}0{\cdot}45]$$ 93 $$[-0{\cdot}60 , \phantom{-}0{\cdot}62]$$ 96 $$\vartheta_8$$ $$[-0{\cdot}60 , \phantom{-}0{\cdot}61]$$ 94 $$[-0{\cdot}60 , \phantom{-}0{\cdot}60]$$ 95 $$[-0{\cdot}61 , \phantom{-}0{\cdot}62]$$ 96 PoSI $$\vartheta_4$$ $$[-2{\cdot}98 , -1{\cdot}51]$$ 99 $$[-2{\cdot}89 , -1{\cdot}57]$$ 99 $$[-2{\cdot}97 , -1{\cdot}52]$$ 99 $$\vartheta_6$$ $$[-0{\cdot}73 , \phantom{-}0{\cdot}75]$$ 99 $$[-0{\cdot}66 , \phantom{-}0{\cdot}66]$$ 99 $$[-0{\cdot}71 , \phantom{-}0{\cdot}73]$$ 99 $$\vartheta_8$$ $$[-0{\cdot}73 , \phantom{-}0{\cdot}74]$$ 98 $$[-0{\cdot}66 , \phantom{-}0{\cdot}67]$$ 97 $$[-0{\cdot}72 , \phantom{-}0{\cdot}73]$$ 99 Naive $$\vartheta_4$$ $$[-2{\cdot}67 , -1{\cdot}82]$$ 89 $$[-2{\cdot}68 , -1{\cdot}79]$$ 91 $$[-2{\cdot}66 , -1{\cdot}83]$$ 89 $$\vartheta_6$$ $$[-0{\cdot}42 , \phantom{-}0{\cdot}43]$$ 69 $$[-0{\cdot}44 , \phantom{-}0{\cdot}44]$$ 92 $$[-0{\cdot}41 , \phantom{-}0{\cdot}42]$$ 71 $$\vartheta_8$$ $$[-0{\cdot}42 , \phantom{-}0{\cdot}43]$$ 70 $$[-0{\cdot}44 , \phantom{-}0{\cdot}45]$$ 75 $$[-0{\cdot}41 , \phantom{-}0{\cdot}43]$$ 71 100 PostAIC $$\vartheta_4$$ $$[-2{\cdot}54 , -1{\cdot}94]$$ 99 $$[-2{\cdot}46 , -2{\cdot}02]$$ 94 $$[-2{\cdot}55 , -1{\cdot}93]$$ 99 $$\vartheta_6$$ $$[-0{\cdot}30 , \phantom{-}0{\cdot}31]$$ 95 $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 95 $$[-0{\cdot}31 , \phantom{-}0{\cdot}32]$$ 96 $$\vartheta_8$$ $$[-0{\cdot}30 , \phantom{-}0{\cdot}31]$$ 95 $$[-0{\cdot}29 , \phantom{-}0{\cdot}30]$$ 95 $$[-0{\cdot}31 , \phantom{-}0{\cdot}31]$$ 97 PoSI $$\vartheta_4$$ $$[-2{\cdot}58 , -1{\cdot}90]$$ 100 $$[-2{\cdot}54 , -1{\cdot}94]$$ 99 $$[-2{\cdot}57 , -1{\cdot}90]$$ 99 $$\vartheta_6$$ $$[-0{\cdot}33 , \phantom{-}0{\cdot}34]$$ 98 $$[-0{\cdot}30 , \phantom{-}0{\cdot}30]$$ 99 $$[-0{\cdot}33 , \phantom{-}0{\cdot}34]$$ 98 $$\vartheta_8$$ $$[-0{\cdot}34 , \phantom{-}0{\cdot}34]$$ 98 $$[-0{\cdot}29 , \phantom{-}0{\cdot}31]$$ 95 $$[-0{\cdot}33 , \phantom{-}0{\cdot}34]$$ 98 Naive $$\vartheta_4$$ $$[-2{\cdot}46 , -2{\cdot}02]$$ 93 $$[-2{\cdot}46 , -2{\cdot}02]$$ 93 $$[-2{\cdot}46 , -2{\cdot}02]$$ 92 $$\vartheta_6$$ $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 66 $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 94 $$[-0{\cdot}21 , \phantom{-}0{\cdot}22]$$ 67 $$\vartheta_8$$ $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 66 $$[-0{\cdot}22 , \phantom{-}0{\cdot}23]$$ 69 $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 65 $$n$$ Method $$\vartheta_j$$ $${\zeta}_{{\rm all}}^3$$ $${\zeta}_{{\rm all}}^6$$ $${\zeta}_{{\rm arb}}$$ 30 PostAIC $$\vartheta_4$$ $$[-2{\cdot}85, -1{\cdot}64]$$ 98 $$[-2{\cdot}68 , -1{\cdot}78]$$ 92 $$[-2{\cdot}85 , -1{\cdot}64]$$ 97 $$\vartheta_6$$ $$[-0{\cdot}60 , \phantom{-}0.62]$$ 94 $$[-0{\cdot}45 , \phantom{-}0{\cdot}45]$$ 93 $$[-0{\cdot}60 , \phantom{-}0{\cdot}62]$$ 96 $$\vartheta_8$$ $$[-0{\cdot}60 , \phantom{-}0{\cdot}61]$$ 94 $$[-0{\cdot}60 , \phantom{-}0{\cdot}60]$$ 95 $$[-0{\cdot}61 , \phantom{-}0{\cdot}62]$$ 96 PoSI $$\vartheta_4$$ $$[-2{\cdot}98 , -1{\cdot}51]$$ 99 $$[-2{\cdot}89 , -1{\cdot}57]$$ 99 $$[-2{\cdot}97 , -1{\cdot}52]$$ 99 $$\vartheta_6$$ $$[-0{\cdot}73 , \phantom{-}0{\cdot}75]$$ 99 $$[-0{\cdot}66 , \phantom{-}0{\cdot}66]$$ 99 $$[-0{\cdot}71 , \phantom{-}0{\cdot}73]$$ 99 $$\vartheta_8$$ $$[-0{\cdot}73 , \phantom{-}0{\cdot}74]$$ 98 $$[-0{\cdot}66 , \phantom{-}0{\cdot}67]$$ 97 $$[-0{\cdot}72 , \phantom{-}0{\cdot}73]$$ 99 Naive $$\vartheta_4$$ $$[-2{\cdot}67 , -1{\cdot}82]$$ 89 $$[-2{\cdot}68 , -1{\cdot}79]$$ 91 $$[-2{\cdot}66 , -1{\cdot}83]$$ 89 $$\vartheta_6$$ $$[-0{\cdot}42 , \phantom{-}0{\cdot}43]$$ 69 $$[-0{\cdot}44 , \phantom{-}0{\cdot}44]$$ 92 $$[-0{\cdot}41 , \phantom{-}0{\cdot}42]$$ 71 $$\vartheta_8$$ $$[-0{\cdot}42 , \phantom{-}0{\cdot}43]$$ 70 $$[-0{\cdot}44 , \phantom{-}0{\cdot}45]$$ 75 $$[-0{\cdot}41 , \phantom{-}0{\cdot}43]$$ 71 100 PostAIC $$\vartheta_4$$ $$[-2{\cdot}54 , -1{\cdot}94]$$ 99 $$[-2{\cdot}46 , -2{\cdot}02]$$ 94 $$[-2{\cdot}55 , -1{\cdot}93]$$ 99 $$\vartheta_6$$ $$[-0{\cdot}30 , \phantom{-}0{\cdot}31]$$ 95 $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 95 $$[-0{\cdot}31 , \phantom{-}0{\cdot}32]$$ 96 $$\vartheta_8$$ $$[-0{\cdot}30 , \phantom{-}0{\cdot}31]$$ 95 $$[-0{\cdot}29 , \phantom{-}0{\cdot}30]$$ 95 $$[-0{\cdot}31 , \phantom{-}0{\cdot}31]$$ 97 PoSI $$\vartheta_4$$ $$[-2{\cdot}58 , -1{\cdot}90]$$ 100 $$[-2{\cdot}54 , -1{\cdot}94]$$ 99 $$[-2{\cdot}57 , -1{\cdot}90]$$ 99 $$\vartheta_6$$ $$[-0{\cdot}33 , \phantom{-}0{\cdot}34]$$ 98 $$[-0{\cdot}30 , \phantom{-}0{\cdot}30]$$ 99 $$[-0{\cdot}33 , \phantom{-}0{\cdot}34]$$ 98 $$\vartheta_8$$ $$[-0{\cdot}34 , \phantom{-}0{\cdot}34]$$ 98 $$[-0{\cdot}29 , \phantom{-}0{\cdot}31]$$ 95 $$[-0{\cdot}33 , \phantom{-}0{\cdot}34]$$ 98 Naive $$\vartheta_4$$ $$[-2{\cdot}46 , -2{\cdot}02]$$ 93 $$[-2{\cdot}46 , -2{\cdot}02]$$ 93 $$[-2{\cdot}46 , -2{\cdot}02]$$ 92 $$\vartheta_6$$ $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 66 $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 94 $$[-0{\cdot}21 , \phantom{-}0{\cdot}22]$$ 67 $$\vartheta_8$$ $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 66 $$[-0{\cdot}22 , \phantom{-}0{\cdot}23]$$ 69 $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 65 PostAIC, our proposed method; PoSI, the method of Berk et al. (2013); Naive, a naive approach that treats the selected model as given and ignores selection. Table 1 Results of the simulation study with $$3000$$ runs of AIC selection: average confidence intervals and coverage percentages for $$\vartheta_4, \vartheta_6$$ and $$\vartheta_8$$ using different selection matrices $${\zeta}$$ corresponding to different model sets $$\mathcal{M}$$ and two sample sizes $$n$$ for the proposed method, along with results obtained by the method of Berk et al. (2013) and by a naive approach that treats the selected model as given and ignores selection $$n$$ Method $$\vartheta_j$$ $${\zeta}_{{\rm all}}^3$$ $${\zeta}_{{\rm all}}^6$$ $${\zeta}_{{\rm arb}}$$ 30 PostAIC $$\vartheta_4$$ $$[-2{\cdot}85, -1{\cdot}64]$$ 98 $$[-2{\cdot}68 , -1{\cdot}78]$$ 92 $$[-2{\cdot}85 , -1{\cdot}64]$$ 97 $$\vartheta_6$$ $$[-0{\cdot}60 , \phantom{-}0.62]$$ 94 $$[-0{\cdot}45 , \phantom{-}0{\cdot}45]$$ 93 $$[-0{\cdot}60 , \phantom{-}0{\cdot}62]$$ 96 $$\vartheta_8$$ $$[-0{\cdot}60 , \phantom{-}0{\cdot}61]$$ 94 $$[-0{\cdot}60 , \phantom{-}0{\cdot}60]$$ 95 $$[-0{\cdot}61 , \phantom{-}0{\cdot}62]$$ 96 PoSI $$\vartheta_4$$ $$[-2{\cdot}98 , -1{\cdot}51]$$ 99 $$[-2{\cdot}89 , -1{\cdot}57]$$ 99 $$[-2{\cdot}97 , -1{\cdot}52]$$ 99 $$\vartheta_6$$ $$[-0{\cdot}73 , \phantom{-}0{\cdot}75]$$ 99 $$[-0{\cdot}66 , \phantom{-}0{\cdot}66]$$ 99 $$[-0{\cdot}71 , \phantom{-}0{\cdot}73]$$ 99 $$\vartheta_8$$ $$[-0{\cdot}73 , \phantom{-}0{\cdot}74]$$ 98 $$[-0{\cdot}66 , \phantom{-}0{\cdot}67]$$ 97 $$[-0{\cdot}72 , \phantom{-}0{\cdot}73]$$ 99 Naive $$\vartheta_4$$ $$[-2{\cdot}67 , -1{\cdot}82]$$ 89 $$[-2{\cdot}68 , -1{\cdot}79]$$ 91 $$[-2{\cdot}66 , -1{\cdot}83]$$ 89 $$\vartheta_6$$ $$[-0{\cdot}42 , \phantom{-}0{\cdot}43]$$ 69 $$[-0{\cdot}44 , \phantom{-}0{\cdot}44]$$ 92 $$[-0{\cdot}41 , \phantom{-}0{\cdot}42]$$ 71 $$\vartheta_8$$ $$[-0{\cdot}42 , \phantom{-}0{\cdot}43]$$ 70 $$[-0{\cdot}44 , \phantom{-}0{\cdot}45]$$ 75 $$[-0{\cdot}41 , \phantom{-}0{\cdot}43]$$ 71 100 PostAIC $$\vartheta_4$$ $$[-2{\cdot}54 , -1{\cdot}94]$$ 99 $$[-2{\cdot}46 , -2{\cdot}02]$$ 94 $$[-2{\cdot}55 , -1{\cdot}93]$$ 99 $$\vartheta_6$$ $$[-0{\cdot}30 , \phantom{-}0{\cdot}31]$$ 95 $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 95 $$[-0{\cdot}31 , \phantom{-}0{\cdot}32]$$ 96 $$\vartheta_8$$ $$[-0{\cdot}30 , \phantom{-}0{\cdot}31]$$ 95 $$[-0{\cdot}29 , \phantom{-}0{\cdot}30]$$ 95 $$[-0{\cdot}31 , \phantom{-}0{\cdot}31]$$ 97 PoSI $$\vartheta_4$$ $$[-2{\cdot}58 , -1{\cdot}90]$$ 100 $$[-2{\cdot}54 , -1{\cdot}94]$$ 99 $$[-2{\cdot}57 , -1{\cdot}90]$$ 99 $$\vartheta_6$$ $$[-0{\cdot}33 , \phantom{-}0{\cdot}34]$$ 98 $$[-0{\cdot}30 , \phantom{-}0{\cdot}30]$$ 99 $$[-0{\cdot}33 , \phantom{-}0{\cdot}34]$$ 98 $$\vartheta_8$$ $$[-0{\cdot}34 , \phantom{-}0{\cdot}34]$$ 98 $$[-0{\cdot}29 , \phantom{-}0{\cdot}31]$$ 95 $$[-0{\cdot}33 , \phantom{-}0{\cdot}34]$$ 98 Naive $$\vartheta_4$$ $$[-2{\cdot}46 , -2{\cdot}02]$$ 93 $$[-2{\cdot}46 , -2{\cdot}02]$$ 93 $$[-2{\cdot}46 , -2{\cdot}02]$$ 92 $$\vartheta_6$$ $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 66 $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 94 $$[-0{\cdot}21 , \phantom{-}0{\cdot}22]$$ 67 $$\vartheta_8$$ $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 66 $$[-0{\cdot}22 , \phantom{-}0{\cdot}23]$$ 69 $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 65 $$n$$ Method $$\vartheta_j$$ $${\zeta}_{{\rm all}}^3$$ $${\zeta}_{{\rm all}}^6$$ $${\zeta}_{{\rm arb}}$$ 30 PostAIC $$\vartheta_4$$ $$[-2{\cdot}85, -1{\cdot}64]$$ 98 $$[-2{\cdot}68 , -1{\cdot}78]$$ 92 $$[-2{\cdot}85 , -1{\cdot}64]$$ 97 $$\vartheta_6$$ $$[-0{\cdot}60 , \phantom{-}0.62]$$ 94 $$[-0{\cdot}45 , \phantom{-}0{\cdot}45]$$ 93 $$[-0{\cdot}60 , \phantom{-}0{\cdot}62]$$ 96 $$\vartheta_8$$ $$[-0{\cdot}60 , \phantom{-}0{\cdot}61]$$ 94 $$[-0{\cdot}60 , \phantom{-}0{\cdot}60]$$ 95 $$[-0{\cdot}61 , \phantom{-}0{\cdot}62]$$ 96 PoSI $$\vartheta_4$$ $$[-2{\cdot}98 , -1{\cdot}51]$$ 99 $$[-2{\cdot}89 , -1{\cdot}57]$$ 99 $$[-2{\cdot}97 , -1{\cdot}52]$$ 99 $$\vartheta_6$$ $$[-0{\cdot}73 , \phantom{-}0{\cdot}75]$$ 99 $$[-0{\cdot}66 , \phantom{-}0{\cdot}66]$$ 99 $$[-0{\cdot}71 , \phantom{-}0{\cdot}73]$$ 99 $$\vartheta_8$$ $$[-0{\cdot}73 , \phantom{-}0{\cdot}74]$$ 98 $$[-0{\cdot}66 , \phantom{-}0{\cdot}67]$$ 97 $$[-0{\cdot}72 , \phantom{-}0{\cdot}73]$$ 99 Naive $$\vartheta_4$$ $$[-2{\cdot}67 , -1{\cdot}82]$$ 89 $$[-2{\cdot}68 , -1{\cdot}79]$$ 91 $$[-2{\cdot}66 , -1{\cdot}83]$$ 89 $$\vartheta_6$$ $$[-0{\cdot}42 , \phantom{-}0{\cdot}43]$$ 69 $$[-0{\cdot}44 , \phantom{-}0{\cdot}44]$$ 92 $$[-0{\cdot}41 , \phantom{-}0{\cdot}42]$$ 71 $$\vartheta_8$$ $$[-0{\cdot}42 , \phantom{-}0{\cdot}43]$$ 70 $$[-0{\cdot}44 , \phantom{-}0{\cdot}45]$$ 75 $$[-0{\cdot}41 , \phantom{-}0{\cdot}43]$$ 71 100 PostAIC $$\vartheta_4$$ $$[-2{\cdot}54 , -1{\cdot}94]$$ 99 $$[-2{\cdot}46 , -2{\cdot}02]$$ 94 $$[-2{\cdot}55 , -1{\cdot}93]$$ 99 $$\vartheta_6$$ $$[-0{\cdot}30 , \phantom{-}0{\cdot}31]$$ 95 $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 95 $$[-0{\cdot}31 , \phantom{-}0{\cdot}32]$$ 96 $$\vartheta_8$$ $$[-0{\cdot}30 , \phantom{-}0{\cdot}31]$$ 95 $$[-0{\cdot}29 , \phantom{-}0{\cdot}30]$$ 95 $$[-0{\cdot}31 , \phantom{-}0{\cdot}31]$$ 97 PoSI $$\vartheta_4$$ $$[-2{\cdot}58 , -1{\cdot}90]$$ 100 $$[-2{\cdot}54 , -1{\cdot}94]$$ 99 $$[-2{\cdot}57 , -1{\cdot}90]$$ 99 $$\vartheta_6$$ $$[-0{\cdot}33 , \phantom{-}0{\cdot}34]$$ 98 $$[-0{\cdot}30 , \phantom{-}0{\cdot}30]$$ 99 $$[-0{\cdot}33 , \phantom{-}0{\cdot}34]$$ 98 $$\vartheta_8$$ $$[-0{\cdot}34 , \phantom{-}0{\cdot}34]$$ 98 $$[-0{\cdot}29 , \phantom{-}0{\cdot}31]$$ 95 $$[-0{\cdot}33 , \phantom{-}0{\cdot}34]$$ 98 Naive $$\vartheta_4$$ $$[-2{\cdot}46 , -2{\cdot}02]$$ 93 $$[-2{\cdot}46 , -2{\cdot}02]$$ 93 $$[-2{\cdot}46 , -2{\cdot}02]$$ 92 $$\vartheta_6$$ $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 66 $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 94 $$[-0{\cdot}21 , \phantom{-}0{\cdot}22]$$ 67 $$\vartheta_8$$ $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 66 $$[-0{\cdot}22 , \phantom{-}0{\cdot}23]$$ 69 $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 65 PostAIC, our proposed method; PoSI, the method of Berk et al. (2013); Naive, a naive approach that treats the selected model as given and ignores selection. Confidence intervals from the method of Berk et al. (2013) are given for the sake of comparison. Their target for inference is the so-called nonstandard target (Bachoc et al. 2015), namely the best coefficients within the selected model, in contrast to the standard target, i.e., the true values of the parameters (Berk et al. 2013, equation (3.2)). Simulation results in Leeb et al. (2015) have shown that the coverage probability of such intervals for the standard target is lower than the nominal value in certain situations. For $${\zeta}_{{\rm all}}^3$$ where $$\vartheta_4$$ and $$\vartheta_5$$ are truly nonzero, the conditional confidence intervals for the proposed method have simulated coverage probabilities higher than the nominal 95%. This is because we have $$\mathcal{A}_M(\mathcal{M}_{{\rm all}}^3)$$, $$Z_4^2>2$$ and $$Z_5^2>2$$ in the constraint set, while $$Z_4$$ and $$Z_5$$ are truly unconstrained when taking $$\mathcal{A}_M(\mathcal{M}_{{\rm all}}^3\cap \mathcal{M}_O)$$. For $$\vartheta_6$$ and $$\vartheta_8$$ which are truly zero, $$Z_6^2>2$$ and $$Z_8^2>2$$ are correct constraints. One may expect conservative confidence intervals for $$\vartheta_6$$ and $$\vartheta_8$$ because they are defined by multiplication of the corresponding rows in $$\skew6\hat{{J}}^{1/2}_M(\hat{{\theta}})$$ by $$\tilde{{Z}}(M)$$. The latter vector satisfies the constraints $$\mathcal{A}_M(\mathcal{M}_{{\rm all}}^3)$$ rather than $$\mathcal{A}_M(\mathcal{M}_{{\rm all}}^3\cap \mathcal{M}_O)$$, so the distribution is longer-tailed than needed. For the current simulation, the settings considered lead to $$\skew6\hat{{J}}^{1/2}_M(\hat{{\theta}})$$ with small off-diagonal elements, so the distribution of an estimator is mainly determined by its corresponding $$Z_i$$. For $${\zeta}_{{\rm all}}^6$$ the coverages almost equal the nominal values, especially for $$n=100$$. Using $${\zeta}_{{\rm arb}}$$ leads to conservative confidence intervals for all parameters because of the additional constraints in $$\mathcal{A}_M(\mathcal{M}_{{\rm arb}})$$, while theoretically the constraints should be $$\mathcal{A}_M(\mathcal{M}_{{\rm arb}} \cap \mathcal{M}_O)$$. The method of Berk et al. (2013) always yields conservative confidence intervals, although there is no guarantee that it will always lead to valid confidence intervals for the true parameters. Naive confidence intervals for $$\vartheta_4$$ have coverages almost equal to the nominal value, whereas for $$\vartheta_6$$ using $${\zeta}_{{\rm arb}}$$ or $${\zeta}_{{\rm all}}^3$$ and for $$\vartheta_8$$ in all settings the coverage percentages are around 70%. This is the result of wrongly treating the selected model as given. For settings with small off-diagonal elements of $$\skew6\hat{{J}}^{1/2}_M(\hat{{\theta}})$$, the confidence intervals for the truly nonzero parameters are valid. Other simulation results are presented in the Supplementary Material. We find that the proposed method can be used even in underparameterized situations, where Assumption A1(i) does not hold. 5.2. Linear combinations in linear models The performance of the proposed method for linear combinations was investigated via simulations. Let $${\vartheta}^{{ \mathrm{\scriptscriptstyle T} }}=(2{\cdot}25,-1{\cdot}1,2{\cdot}43,-1{\cdot}24,2{\cdot}5,{0}_8^{{ \mathrm{\scriptscriptstyle T} }})$$ be the true values for the parameters in a linear model, with the error standard deviation being either 1 or 3. Four different selection matrices are considered, $${\zeta}_{{\rm all}}^i$$ for $$i\in\{3,5,8,10\}$$, indicating that the first $$i$$ covariates are common to each model. The data-generation processes are as in § 5.1. For this simulation, we do not control the selected model because we are interested in a linear combination of the selected parameters. Table 2 shows the results. We compare the post-selection intervals with the smoothed bootstrap confidence intervals (Efron 2014) and the intervals for post-selection predictions (Bachoc et al. 2015). The bootstrap samples consist of $$n$$ draws with replacement from the main dataset and we replicate this $$B=1000$$ times. The nonideal bootstrap when the number of replications is not equal to $$n^n$$ biases the variance of the smoothed bootstrap estimator upwards, so we use the bias-corrected version (Efron 2014, Remark J). The post-selection intervals for prediction have a target based on the selected model, so this could be different from the true prediction. Table 2. Results of the simulation study with $$3000$$ runs of selection with AIC: average length of $$95\%$$ confidence intervals and coverage percentages for a linear combination of the parameters for different methods and model sets using the selection matrices $${\zeta}$$ for two sample sizes $${\zeta}_{{\rm all}}^3$$ $${\zeta}_{{\rm all}}^5$$ $${\zeta}_{{\rm all}}^8$$ $${\zeta}_{{\rm all}}^{10}$$ $$\sigma$$ $$n$$ Method Length Cov % Length Cov % Length Cov % Length Cov % 1 30 PostAIC $$3{\cdot}11$$ 97 $$2{\cdot}61$$ 95 $$2{\cdot}90$$ 94 $$3{\cdot}08$$ 94 Boot $$3{\cdot}67$$ 92 $$3{\cdot}32$$ 92 $$3{\cdot}31$$ 92 $$3{\cdot}79$$ 92 PoSIp $$4{\cdot}38$$ 100 $$4{\cdot}39$$ 100 $$5{\cdot}36$$ 100 $$6{\cdot}00$$ 100 100 PostAIC $$1{\cdot}42$$ 98 $$1{\cdot}17$$ 95 $$1{\cdot}30$$ 96 $$1{\cdot}37$$ 95 Boot $$1{\cdot}25$$ 94 $$1{\cdot}25$$ 94 $$1{\cdot}30$$ 94 $$1{\cdot}33$$ 93 PoSIp $$1{\cdot}83$$ 100 $$1{\cdot}83$$ 100 $$2{\cdot}20$$ 100 $$2{\cdot}42$$ 100 3 30 PostAIC $$11{\cdot}76$$ 98 $$7{\cdot}82$$ 94 $$8{\cdot}68$$ 94 $$9{\cdot}24$$ 94 Boot $$11{\cdot}46$$ 92 $$9{\cdot}95$$ 92 $$9{\cdot}94$$ 92 $$11{\cdot}37$$ 92 PoSIp $$12{\cdot}65$$ 99 $$13{\cdot}16$$ 100 $$16{\cdot}08$$ 100 $$17{\cdot}99$$ 100 100 PostAIC $$4{\cdot}25$$ 98 $$3{\cdot}50$$ 95 $$3{\cdot}90$$ 96 $$4{\cdot}12$$ 95 Boot $$3{\cdot}77$$ 94 $$3{\cdot}74$$ 94 $$3{\cdot}90$$ 94 $$4{\cdot}00$$ 93 PoSIp $$5{\cdot}47$$ 100 $$5{\cdot}48$$ 100 $$6{\cdot}60$$ 100 $$7{\cdot}26$$ 100 $${\zeta}_{{\rm all}}^3$$ $${\zeta}_{{\rm all}}^5$$ $${\zeta}_{{\rm all}}^8$$ $${\zeta}_{{\rm all}}^{10}$$ $$\sigma$$ $$n$$ Method Length Cov % Length Cov % Length Cov % Length Cov % 1 30 PostAIC $$3{\cdot}11$$ 97 $$2{\cdot}61$$ 95 $$2{\cdot}90$$ 94 $$3{\cdot}08$$ 94 Boot $$3{\cdot}67$$ 92 $$3{\cdot}32$$ 92 $$3{\cdot}31$$ 92 $$3{\cdot}79$$ 92 PoSIp $$4{\cdot}38$$ 100 $$4{\cdot}39$$ 100 $$5{\cdot}36$$ 100 $$6{\cdot}00$$ 100 100 PostAIC $$1{\cdot}42$$ 98 $$1{\cdot}17$$ 95 $$1{\cdot}30$$ 96 $$1{\cdot}37$$ 95 Boot $$1{\cdot}25$$ 94 $$1{\cdot}25$$ 94 $$1{\cdot}30$$ 94 $$1{\cdot}33$$ 93 PoSIp $$1{\cdot}83$$ 100 $$1{\cdot}83$$ 100 $$2{\cdot}20$$ 100 $$2{\cdot}42$$ 100 3 30 PostAIC $$11{\cdot}76$$ 98 $$7{\cdot}82$$ 94 $$8{\cdot}68$$ 94 $$9{\cdot}24$$ 94 Boot $$11{\cdot}46$$ 92 $$9{\cdot}95$$ 92 $$9{\cdot}94$$ 92 $$11{\cdot}37$$ 92 PoSIp $$12{\cdot}65$$ 99 $$13{\cdot}16$$ 100 $$16{\cdot}08$$ 100 $$17{\cdot}99$$ 100 100 PostAIC $$4{\cdot}25$$ 98 $$3{\cdot}50$$ 95 $$3{\cdot}90$$ 96 $$4{\cdot}12$$ 95 Boot $$3{\cdot}77$$ 94 $$3{\cdot}74$$ 94 $$3{\cdot}90$$ 94 $$4{\cdot}00$$ 93 PoSIp $$5{\cdot}47$$ 100 $$5{\cdot}48$$ 100 $$6{\cdot}60$$ 100 $$7{\cdot}26$$ 100 Cov %, coverage percentage; PostAIC, our proposed method; Boot, smoothed bootstrap (Efron 2014); PoSIp, method of post-selection prediction (Bachoc et al. 2015). Table 2. Results of the simulation study with $$3000$$ runs of selection with AIC: average length of $$95\%$$ confidence intervals and coverage percentages for a linear combination of the parameters for different methods and model sets using the selection matrices $${\zeta}$$ for two sample sizes $${\zeta}_{{\rm all}}^3$$ $${\zeta}_{{\rm all}}^5$$ $${\zeta}_{{\rm all}}^8$$ $${\zeta}_{{\rm all}}^{10}$$ $$\sigma$$ $$n$$ Method Length Cov % Length Cov % Length Cov % Length Cov % 1 30 PostAIC $$3{\cdot}11$$ 97 $$2{\cdot}61$$ 95 $$2{\cdot}90$$ 94 $$3{\cdot}08$$ 94 Boot $$3{\cdot}67$$ 92 $$3{\cdot}32$$ 92 $$3{\cdot}31$$ 92 $$3{\cdot}79$$ 92 PoSIp $$4{\cdot}38$$ 100 $$4{\cdot}39$$ 100 $$5{\cdot}36$$ 100 $$6{\cdot}00$$ 100 100 PostAIC $$1{\cdot}42$$ 98 $$1{\cdot}17$$ 95 $$1{\cdot}30$$ 96 $$1{\cdot}37$$ 95 Boot $$1{\cdot}25$$ 94 $$1{\cdot}25$$ 94 $$1{\cdot}30$$ 94 $$1{\cdot}33$$ 93 PoSIp $$1{\cdot}83$$ 100 $$1{\cdot}83$$ 100 $$2{\cdot}20$$ 100 $$2{\cdot}42$$ 100 3 30 PostAIC $$11{\cdot}76$$ 98 $$7{\cdot}82$$ 94 $$8{\cdot}68$$ 94 $$9{\cdot}24$$ 94 Boot $$11{\cdot}46$$ 92 $$9{\cdot}95$$ 92 $$9{\cdot}94$$ 92 $$11{\cdot}37$$ 92 PoSIp $$12{\cdot}65$$ 99 $$13{\cdot}16$$ 100 $$16{\cdot}08$$ 100 $$17{\cdot}99$$ 100 100 PostAIC $$4{\cdot}25$$ 98 $$3{\cdot}50$$ 95 $$3{\cdot}90$$ 96 $$4{\cdot}12$$ 95 Boot $$3{\cdot}77$$ 94 $$3{\cdot}74$$ 94 $$3{\cdot}90$$ 94 $$4{\cdot}00$$ 93 PoSIp $$5{\cdot}47$$ 100 $$5{\cdot}48$$ 100 $$6{\cdot}60$$ 100 $$7{\cdot}26$$ 100 $${\zeta}_{{\rm all}}^3$$ $${\zeta}_{{\rm all}}^5$$ $${\zeta}_{{\rm all}}^8$$ $${\zeta}_{{\rm all}}^{10}$$ $$\sigma$$ $$n$$ Method Length Cov % Length Cov % Length Cov % Length Cov % 1 30 PostAIC $$3{\cdot}11$$ 97 $$2{\cdot}61$$ 95 $$2{\cdot}90$$ 94 $$3{\cdot}08$$ 94 Boot $$3{\cdot}67$$ 92 $$3{\cdot}32$$ 92 $$3{\cdot}31$$ 92 $$3{\cdot}79$$ 92 PoSIp $$4{\cdot}38$$ 100 $$4{\cdot}39$$ 100 $$5{\cdot}36$$ 100 $$6{\cdot}00$$ 100 100 PostAIC $$1{\cdot}42$$ 98 $$1{\cdot}17$$ 95 $$1{\cdot}30$$ 96 $$1{\cdot}37$$ 95 Boot $$1{\cdot}25$$ 94 $$1{\cdot}25$$ 94 $$1{\cdot}30$$ 94 $$1{\cdot}33$$ 93 PoSIp $$1{\cdot}83$$ 100 $$1{\cdot}83$$ 100 $$2{\cdot}20$$ 100 $$2{\cdot}42$$ 100 3 30 PostAIC $$11{\cdot}76$$ 98 $$7{\cdot}82$$ 94 $$8{\cdot}68$$ 94 $$9{\cdot}24$$ 94 Boot $$11{\cdot}46$$ 92 $$9{\cdot}95$$ 92 $$9{\cdot}94$$ 92 $$11{\cdot}37$$ 92 PoSIp $$12{\cdot}65$$ 99 $$13{\cdot}16$$ 100 $$16{\cdot}08$$ 100 $$17{\cdot}99$$ 100 100 PostAIC $$4{\cdot}25$$ 98 $$3{\cdot}50$$ 95 $$3{\cdot}90$$ 96 $$4{\cdot}12$$ 95 Boot $$3{\cdot}77$$ 94 $$3{\cdot}74$$ 94 $$3{\cdot}90$$ 94 $$4{\cdot}00$$ 93 PoSIp $$5{\cdot}47$$ 100 $$5{\cdot}48$$ 100 $$6{\cdot}60$$ 100 $$7{\cdot}26$$ 100 Cov %, coverage percentage; PostAIC, our proposed method; Boot, smoothed bootstrap (Efron 2014); PoSIp, method of post-selection prediction (Bachoc et al. 2015). The choice of models with $${\zeta}_{{\rm all}}^3$$ as a selection matrix results in conservative confidence intervals due to conditioning on $$\mathcal{A}_M(\mathcal{M}^3_{{\rm all}})$$, similar to before. For this selection matrix, the confidence intervals obtained by the bootstrap method are shorter than those from the proposed post-selection method. The bootstrap confidence intervals are not directly based on the selected model for the original data, because a model is selected for each bootstrap sample. The ideal situation is when the selection matrix is $${\zeta}_{{\rm all}}^5$$, since all truly nonzero parameters are then forced to be in the model. The confidence intervals for the proposed method are always shorter than those for the competing methods and their coverages are almost equal to the nominal value. For $${\zeta}_{{\rm all}}^8$$ and $${\zeta}_{{\rm all}}^{10}$$ the situation is the same, though with wider intervals than for $${\zeta}_{{\rm all}}^{5}$$ in all methods, because more parameters are forced to be in the model, which increases the variability of the predictions. These confidence intervals are not wider than for $${\zeta}_{{\rm all}}^3$$. Thus the variability of the prediction is more affected by the condition part than by forcing more variables into the model. The post-selection method for prediction (Bachoc et al. 2015) always leads to wider confidence intervals than the bootstrap method and the proposed method. The coverages of the confidence intervals for the proposed method are always close to or higher than the nominal values, while the bootstrap method can have lower coverage probabilities than the nominal values. Moreover, the bootstrap method for all possible models is computationally intensive, because it needs $$B$$ bootstrap samples and in each one all candidate models are fitted. For the setting of $$\sigma=3$$ and $$n=30$$ in $${\zeta}_{{\rm all}}^3$$, we used the results in (8) instead of (7). In this setting the probability of selecting an underparameterized model is not zero due to a small sample size and a large variance. The average length of the confidence interval was 9$$\cdot$$9 and the coverage was around 90% when we used (7). 6. Pima Indian diabetes data We construct confidence intervals conditional on the selected model for a logistic regression model applied to the Pima Indian diabetes dataset (Lichman 2013). This dataset consists of women aged 21 years and over of Pima Indian heritage living near Phoenix, Arizona. We used 332 complete observations. The response is 0 if a test for diabetes is negative and 1 if positive. We include seven covariates in the model: npreg, number of pregnancies; glu, plasma glucose concentration in an oral glucose tolerance test; bp, diastolic blood pressure; skin, tricep skin fold thickness in millimetres; bmi, body mass index; ped, diabetes pedigree function; and age in years. See Smith et al. (1988) for more details about the data. First, we consider bootstrap percentile and naive confidence intervals for the parameters in the full model when no selection is involved; see Table 3. We used 5000 bootstrap runs, each resampling the 332 women uniformly with replacement. Several intervals contain zero, which indicates the possibility of using a smaller model. Table 3. Pima Indian diabetes data: $$95\%$$ naive and bootstrap confidence intervals in the full model without selection Method npreg glu bp skin bmi ped age Naive $$[0{\cdot}03, 0{\cdot}26]$$ $$[0{\cdot}03, 0{\cdot}05]$$ $$[-0{\cdot}03, 0{\cdot}02]$$ $$[-0{\cdot}03, 0{\cdot}05]$$ $$[0{\cdot}02, 0{\cdot}14]$$ $$[0{\cdot}24, 2{\cdot}00]$$ $$[-0{\cdot}02, 0{\cdot}05]$$ Bootstrap $$[-0{\cdot}003, 0{\cdot}30]$$ $$[0{\cdot}03, 0{\cdot}05]$$ $$[-0{\cdot}03, 0{\cdot}16]$$ $$[-0{\cdot}03, 0{\cdot}06]$$ $$[0{\cdot}02, 0{\cdot}15]$$ $$[0{\cdot}005, 2{\cdot}41]$$ $$[-0{\cdot}02, 0{\cdot}07]$$ Method npreg glu bp skin bmi ped age Naive $$[0{\cdot}03, 0{\cdot}26]$$ $$[0{\cdot}03, 0{\cdot}05]$$ $$[-0{\cdot}03, 0{\cdot}02]$$ $$[-0{\cdot}03, 0{\cdot}05]$$ $$[0{\cdot}02, 0{\cdot}14]$$ $$[0{\cdot}24, 2{\cdot}00]$$ $$[-0{\cdot}02, 0{\cdot}05]$$ Bootstrap $$[-0{\cdot}003, 0{\cdot}30]$$ $$[0{\cdot}03, 0{\cdot}05]$$ $$[-0{\cdot}03, 0{\cdot}16]$$ $$[-0{\cdot}03, 0{\cdot}06]$$ $$[0{\cdot}02, 0{\cdot}15]$$ $$[0{\cdot}005, 2{\cdot}41]$$ $$[-0{\cdot}02, 0{\cdot}07]$$ Table 3. Pima Indian diabetes data: $$95\%$$ naive and bootstrap confidence intervals in the full model without selection Method npreg glu bp skin bmi ped age Naive $$[0{\cdot}03, 0{\cdot}26]$$ $$[0{\cdot}03, 0{\cdot}05]$$ $$[-0{\cdot}03, 0{\cdot}02]$$ $$[-0{\cdot}03, 0{\cdot}05]$$ $$[0{\cdot}02, 0{\cdot}14]$$ $$[0{\cdot}24, 2{\cdot}00]$$ $$[-0{\cdot}02, 0{\cdot}05]$$ Bootstrap $$[-0{\cdot}003, 0{\cdot}30]$$ $$[0{\cdot}03, 0{\cdot}05]$$ $$[-0{\cdot}03, 0{\cdot}16]$$ $$[-0{\cdot}03, 0{\cdot}06]$$ $$[0{\cdot}02, 0{\cdot}15]$$ $$[0{\cdot}005, 2{\cdot}41]$$ $$[-0{\cdot}02, 0{\cdot}07]$$ Method npreg glu bp skin bmi ped age Naive $$[0{\cdot}03, 0{\cdot}26]$$ $$[0{\cdot}03, 0{\cdot}05]$$ $$[-0{\cdot}03, 0{\cdot}02]$$ $$[-0{\cdot}03, 0{\cdot}05]$$ $$[0{\cdot}02, 0{\cdot}14]$$ $$[0{\cdot}24, 2{\cdot}00]$$ $$[-0{\cdot}02, 0{\cdot}05]$$ Bootstrap $$[-0{\cdot}003, 0{\cdot}30]$$ $$[0{\cdot}03, 0{\cdot}05]$$ $$[-0{\cdot}03, 0{\cdot}16]$$ $$[-0{\cdot}03, 0{\cdot}06]$$ $$[0{\cdot}02, 0{\cdot}15]$$ $$[0{\cdot}005, 2{\cdot}41]$$ $$[-0{\cdot}02, 0{\cdot}07]$$ Selection uses the set $$\mathcal{M}_{{\rm all}}$$; an intercept is present in all models. This results in the selection of four variables: npreg, glu, bmi and ped. Table 4 presents the unconditional confidence intervals for these parameters using the naive method with the post-selection confidence intervals that condition on the model selected using the Akaike information criterion. The naive method ignores the selection procedure, leading to the significance of the covariate ped, whereas the proposed method, which takes the selection uncertainty into account, concludes that this covariate is not individually significant at the 5% level. For logistic regression, to the best of our knowledge, there are no other post-selection methods to compare with. Table 4. Pima Indian diabetes data: confidence intervals with nominal level $$95\%$$ ignoring, Naive, and including, PostAIC, model selection using the Akaike information criterion Method npreg glu bmi ped Naive $$[0{\cdot}091, 0{\cdot}269]$$ $$[0{\cdot}028, 0{\cdot}049]$$ $$[0{\cdot}042, 0{\cdot}129]$$ $$-$$$$[0{\cdot}305, 2{\cdot}050]$$ PostAIC $$[0{\cdot}058, 0{\cdot}299]$$ $$[0{\cdot}022, 0{\cdot}054]$$ $$[0{\cdot}027, 0{\cdot}142]$$ $$[-0{\cdot}027, 2{\cdot}358]$$ Method npreg glu bmi ped Naive $$[0{\cdot}091, 0{\cdot}269]$$ $$[0{\cdot}028, 0{\cdot}049]$$ $$[0{\cdot}042, 0{\cdot}129]$$ $$-$$$$[0{\cdot}305, 2{\cdot}050]$$ PostAIC $$[0{\cdot}058, 0{\cdot}299]$$ $$[0{\cdot}022, 0{\cdot}054]$$ $$[0{\cdot}027, 0{\cdot}142]$$ $$[-0{\cdot}027, 2{\cdot}358]$$ Table 4. Pima Indian diabetes data: confidence intervals with nominal level $$95\%$$ ignoring, Naive, and including, PostAIC, model selection using the Akaike information criterion Method npreg glu bmi ped Naive $$[0{\cdot}091, 0{\cdot}269]$$ $$[0{\cdot}028, 0{\cdot}049]$$ $$[0{\cdot}042, 0{\cdot}129]$$ $$-$$$$[0{\cdot}305, 2{\cdot}050]$$ PostAIC $$[0{\cdot}058, 0{\cdot}299]$$ $$[0{\cdot}022, 0{\cdot}054]$$ $$[0{\cdot}027, 0{\cdot}142]$$ $$[-0{\cdot}027, 2{\cdot}358]$$ Method npreg glu bmi ped Naive $$[0{\cdot}091, 0{\cdot}269]$$ $$[0{\cdot}028, 0{\cdot}049]$$ $$[0{\cdot}042, 0{\cdot}129]$$ $$-$$$$[0{\cdot}305, 2{\cdot}050]$$ PostAIC $$[0{\cdot}058, 0{\cdot}299]$$ $$[0{\cdot}022, 0{\cdot}054]$$ $$[0{\cdot}027, 0{\cdot}142]$$ $$[-0{\cdot}027, 2{\cdot}358]$$ 7. Discussion and extensions For one of the classic model selection methods, the Akaike information criterion (Akaike 1973), we have developed a method to deal with the selection uncertainty by performing inference conditional on the selected model. Our results demonstrate that this inference depends not only on the selected model but also on the set of models from which the selection takes place, as well as on the smallest overparameterized model. The dependence on the set of models is not surprising, though it has not received much attention so far. The proposed method explicitly uses the overselection properties of the Akaike information criterion. See Claeskens & Hjort (2004) for some selection properties under local misspecification. For consistent selection criteria, such as the Bayesian information criterion, other approaches should be used, although effects of the selection remain present (Leeb & Pötscher 2005). Other selection methods that are similar to the Akaike information criterion can be approached in the same way. Consider, for example, selection in an arbitrary set of models allowing for model misspecification, see § 4, using Takeuchi’s information criterion (Takeuchi 1976) $$ {\small{\text{TIC}}}(M)=2 \ell_n\{\hat{{\theta}}(M)\}-2\, {\rm tr} \{{Q}_{M}({\vartheta}^*)^{-1}{J}_{M}({\vartheta}^*)\} $$. In most practical settings the information matrices are estimated by their empirical counterparts $$\hat{{Q}}_{M}(\hat{{\theta}}_M)$$ and $$\skew6\hat{{J}}_{M}(\hat{{\theta}}_M)$$. We rewrite (9) for an arbitrary set of models containing $$M_{{\rm s}}$$ by replacing $$|M|$$ with $${\rm tr}\{{Q}_{M}({\vartheta}^*)^{-1}{J}_{M}({\vartheta}^*)\}$$ and proceed to calculate the asymptotic distribution of the parameters conditioned on the constraint set. Another such example is the generalized information criterion introduced by Konishi & Kitagawa (1996). It considers functional estimators, such as M-estimators, and uses the influence function as part of the criterion, $$ {\small{\text{GIC}}}(M)=-2 \ell_n\{\hat{{\theta}}(M)\}+(2/n)\sum_{i=1}^n {\rm tr}\big\{{\rm Infl}(Y_i)(\partial/\partial{\theta}_M^{{ \mathrm{\scriptscriptstyle T} }}) \log f(Y_i; \hat{{\theta}}_M)\big\} $$. Under some regularity conditions, the functional estimator has an asymptotic normal distribution, allowing us to extend the results in § 4. Mallows’ $$C_p$$ (Mallows 1973) for linear regression is $$C_p(M)=\hat{\sigma}^{-2} \hat{\sigma}^{2}(M) +2|M|-n$$, where $$\hat{\sigma}^2$$ is the estimated variance in the full model and $$\hat{\sigma}^{2}(M)$$ uses model $$M$$. The model with the smallest $$C_p$$ value is the best. In nested models one can easily show that as $$n$$ tends to infinity, $$C_p(M)-C_p(M^*) \sim \chi^2_{q}/q+2q$$ where $$q={|M^*|-|M|}$$. In the same manner as for the Akaike information criterion, one can calculate the constraint set and hence the distribution of estimators for parameters in the selected model. In forward stepwise selection, we start from a small model and embed it in a larger model containing one additional parameter. This procedure continues until adding a parameter does not decrease the Akaike information criterion. To be precise, in step $$t$$ we embed model $$M_t$$ in a number of bigger models, each adding one parameter. Define $$\mathcal{M}_t$$ to be this set of models. Model $$M_{t+1}\in\mathcal{M}_t$$ is selected when this model has a smaller criterion value than model $$M_t$$ and has the smallest criterion value among all models in $$\mathcal{M}_t$$. This means that $$\small{\text{AIC}}(M_{t+1})<\small{\text{AIC}}(M_{t})$$ and $$\small{\text{AIC}}(M_{t+1})< \small{\text{AIC}}(M)$$ for all $$M \in \mathcal{M}_t\setminus M_{t+1}$$. These inequalities can be translated to constraints. The constraint set is the collection of all the constraints from all the steps. We explicitly dealt with low-dimensional parameters for which maximum likelihood estimators exist and the Akaike information criterion is well-defined. Other criteria are more suitable for high-dimensional parameters. Acknowledgement The authors thank the editor and reviewers. Support from the Research Foundation of Flanders, the University of Leuven, and the Interuniversity Attraction Poles Research Network is acknowledged. The computational resources and services used in this work were provided by the Flemish Supercomputer Center, funded by the Hercules Foundation and the Flemish Government. Supplementary material Supplementary material available at Biometrika online contains a rewriting of some results of Woodroofe (1982), exact calculations for an example, the selection matrix for one of the simulation settings, additional simulation results, and R code to produce the results in the paper. Appendix Assumption A1. Let $$\mathcal{B}_{K}(\epsilon)$$ denote an $$(a+K)$$-dimensional sphere centred at $${\vartheta}$$ with radius $$\epsilon$$, and let $$\mathcal{B}_{K}^{\rm c}(\epsilon)$$ denote its complementary set. (i) For each $$\epsilon>0$$, as $$n\rightarrow\infty$$, $$\mathop{\sup}_{{\theta}\in\mathcal{B}_{K}^{\rm c}(\epsilon)} \{\ell_n({\theta})-\ell_n({\vartheta})\} {\rightarrow} -\infty$$ in probability. (ii) There exists an $$\epsilon_0 >0$$ such that $$\ell_n ({\theta})$$ is twice continuously differentiable in $$\mathcal{B}_K(\epsilon_0)$$ for all $$n$$ large enough. Define the score vector $${U}_n({\theta}) = (\partial/ \partial{\theta})\ell_n({\theta})$$ and the negative Hessian matrix $${Q}_n({\theta}) = -(\partial^2/(\partial{\theta}\,\partial{\theta}^{{ \mathrm{\scriptscriptstyle T} }}))\ell_n({\theta})$$. (iii) For some $$0<\epsilon_1 <\epsilon_0$$, as $$n\rightarrow \infty$$, there exists a nonrandom positive-definite continuous matrix $${Q}({\theta})$$, for $${\theta}$$ in $$\mathcal{B}_K(\epsilon_1)$$, such that $$\sup_{{\theta}\in\mathcal{B}_{K}(\epsilon)} {\rm tr}\{{Q}_{n}({\theta})/n - {Q}({\theta})\} {\rightarrow} 0$$ in probability. (iv) As $$n \rightarrow \infty$$, $$\:n^{1/2} {U}_n({\vartheta})$$ is asymptotically $$N\{{0},{J}({\vartheta})\}$$. (v) For $$i\neq j$$ and $$M_i, M_j \in \mathcal{M}_O$$, if the expectation is taken with respect to the true distribution, we have $$J_{ij}\{{\theta}(i),{\theta}(\,j)\}=E(\{\partial / \partial{\theta}(M_i)\}[\ell_n\{{\theta}(M_i)\}]\{\partial/ \partial{\theta}(M_j)^{{ \mathrm{\scriptscriptstyle T} }}\}[\ell_n\{{\theta}(M_j)\}])= 0_{|M_i|\times|M_j|}$$. In Assumption A1, (i)–(iv) are from Woodroofe (1982). Assumption A1(i) leads to the consistency of maximum likelihood estimators for $${\theta}$$ in the model considered and its submodels. For the nonnested case, Assumption A1(v) provides a simplification (Vuong 1989). In linear regression, Assumption A1(v) is equivalent to having an orthogonal design matrix. The next lemma is an extension of Lemma A in Vuong (1989) to more than two models. Lemma A1. Suppose that conditions $${\rm (i)-(iv)}$$ in Assumption A1 hold. Fix any ordering of the models in $${\mathcal{M}}_O$$ and write $$o=|{\mathcal{M}}_O|$$. Then, as $$n\to\infty$$, $$\:n^{1/2}(\hat{\theta}_{\mathcal{M}_o}-\vartheta_{\mathcal{M}_o})= n^{1/2}\{{\hat{\theta}}'(M_{1})^{{ \mathrm{\scriptscriptstyle T} }}-{\vartheta}(M_1)^{{ \mathrm{\scriptscriptstyle T} }}, \ldots, {\hat{\theta}}'(M_o)^{{ \mathrm{\scriptscriptstyle T} }}-{\vartheta}(M_o)^{{ \mathrm{\scriptscriptstyle T} }}\}^{{ \mathrm{\scriptscriptstyle T} }} \to N\{0,\Sigma(\vartheta)\}$$ in distribution. Proof. As in Vuong (1989), a Taylor series expansion leads to \begin{eqnarray*} 0=n^{-1/2}U_{n,M_i}({\vartheta})+{Q}_{M_i}({\vartheta})n^{1/2} \{\hat{{\theta}^{'}}(M_{i})-{\vartheta}\}+o_{\rm p}(1)\quad (M_i \in \mathcal{M}_O)\text{.} \end{eqnarray*} By the multivariate central limit theorem, we have convergence in distribution as $$n\to\infty$$, \begin{eqnarray*} n^{-1/2}(U_{n,M_1}^{{ \mathrm{\scriptscriptstyle T} }}, \ldots, U_{n,M_o}^{{ \mathrm{\scriptscriptstyle T} }})^{{ \mathrm{\scriptscriptstyle T} }} \rightarrow N(0, \Sigma _u) \end{eqnarray*} where $$\Sigma_u$$ is a partitioned matrix with $$(i,j)$$th block equal to $${J}_{ij}({\vartheta},{\vartheta})$$. The distribution of the estimators follows. □ When the models are correctly specified, $$J_{ii}({\vartheta},{\vartheta})={J}_{M_i}({\vartheta})={Q}_{M_i}({\vartheta})$$. Lemma A1 is also valid for misspecified models and for models not in $$\mathcal{M}_O$$. In such cases the true parameter is replaced by the pseudo-true parameter corresponding to the considered model. Proof of Proposition 1. We show that (1) equals \begin{eqnarray*} \lim_{n \rightarrow \infty } \frac{\mathrm{pr}([n^{1/2}\{\hat{{\theta}^{'}}(p)-\tilde{{\vartheta}}(p)\}\leqslant \tilde{{t}}(p)] \cap [2 \ell_{n,p}^*-2p \geqslant 2\ell_{n,j}^*-2j, \, j \in \{{p_0},{p_0+1},\ldots,{K}\}])}{\mathrm{pr}[2 \ell_{n,p}^*-2p \geqslant 2\ell_{n,j}^*-2j, \, j \in \{{p_0},{p_0+1},\ldots,{K}\}]}\text{.} \end{eqnarray*} By Lemma A1 there is joint convergence of the estimators in the different models. Next, since $$\ell_{n,j}^*$$ is a function of $$\hat{{\theta}^{'}}(\,j)$$, namely \begin{eqnarray*} \ell_{n,j}^*=\frac{n}{2}\{\hat{{\theta}^{'}}(p)-{\vartheta}(p)\}^{{ \mathrm{\scriptscriptstyle T} }}{J}_p({\vartheta})\{\hat{{\theta}^{'}}(p)-{\vartheta}(p)\}+o_{\rm p}(1), \end{eqnarray*} and since the probability of the event in the denominator is strictly positive, Slutsky’s theorem and the continuous mapping theorem give joint convergence for both the numerator and the denominator of the above expression to their asymptotic counterparts. To obtain the selection set, let $$S_j =\{{s} \in \mathbb{R}^{a+K}: s_i=0 \text{ for } i=a+j,\ldots,a+K \}$$, for $$j=p_0,\ldots,K$$. Woodroofe (1982) showed that $$(\ell^*_{n,p_0}, \ldots, \ell^*_{n,K})$$ converges in distribution to $$(\ell^*_{p_0}, \ldots, \ell^*_{K})$$ as $$n\rightarrow \infty$$, where for $$j=p_0,\ldots,K$$, $$ \:\ell_j^*=\mathop{\sup}_{{s} \in S_j} \{{s}'{Y}-{s}'{J}({\vartheta}){s}/2\} $$ with $${Y}\sim N\{{0}, {J}({\vartheta})\}$$. Then $$\ell_j^*=0{\cdot}5\sum_{i=1}^{a+j} Z_{i}^2$$ ( $$j= p_0, \ldots, K$$), where $$Z_1,\ldots,Z_{a+j}$$ are independent and identically distributed standard normal random variables. Lemma 1 and Assumption A1(i)–(iv) imply that $$n^{1/2}{J}_{j}^{1/2}({\vartheta})\{{\hat{\theta}}'(\,j)-\tilde{{\vartheta}}(\,j)\}$$ converges in distribution to $$\tilde{{Z}}(\,j)$$ as $$n\to \infty$$. Parameters not in the selected model are set to zero, which gives the region $$\mathcal{T}_p$$. Since $$\tilde{{Z}}(p)$$ and $$(Z_{p+1},\ldots,Z_K)$$ are independent, for $${t}\in\mathcal{T}_p$$, \begin{eqnarray*} F_p({t})&=& \mathrm{pr}\{J_{p}^{-1/2}(\vartheta) \tilde{{Z}}(p) \leqslant \tilde{{t}}(p) \mid {Z} \in \mathcal{A}_p(\mathcal{M}_{{\rm nest}})\} \nonumber \\ &=& \mathrm{pr}\left[J_{p}^{-1/2}(\vartheta) \tilde{{Z}}(p) \leqslant \tilde{{t}}(p) \;\bigg|\; \bigcap_{j=p_0,\ldots,p-1}\left\{\sum_{i=j+1}^{p} Z_{a+i}^2 > 2(p-j)\right\}\right]\!\text{.}\\[-42pt] \end{eqnarray*} □ Proof of Corollary 2. From Proposition 1, with $$\hat{p}_0=p$$, $$\:q_{\alpha}$$ is equivalently found via \begin{eqnarray*} {\mathrm{pr}\left[\left(\sum_{i=1}^{a+p} Z_i^2 \leqslant q_{\alpha}\right) \cap \bigcap_{j=p_0,\ldots,p-1}\left\{\sum_{i=j+1}^{p} Z_{a+i}^2 > 2(p-j) \right\} \right]} \Big/ {\mathrm{pr}\{\tilde{{Z}}(p)\in\mathcal{A}_p^{({\rm s})}(\mathcal{M}_{{\rm nest}})\}}=1-\alpha\text{.} \end{eqnarray*} The denominator can be calculated by Lemma S1 in the Supplementary Material. To calculate the numerator, we first find the joint density of $$(W_p,\ldots,W_{p_0+1},W_1)$$, where $$W_j = \sum_{i=a+j}^{a+p}Z_{i}^2$$ and $$W_1 = \sum_{i=1}^{p}Z_{i}^2$$ with $$Z_i^2 \sim \chi^2_{1}$$ for all $$i=1,\ldots,a+p$$. So $$Z_{a+i}^2=W_{i-1}-W_{i}$$ for $$i=p_0+1, \ldots, p-1$$ and $$Z_{a+p}^2=W_p$$ with $$\sum_{i=1}^{a+p_0}Z_i^2=W_1-W_{p_0+1}\sim \chi^2_{a+p_0}$$. The joint distribution of $$(W_p,\ldots,W_{p_0+1},W_1)$$ is obtained via a transformation of the distribution of $$(Z^2_{a+p}, Z^2_{a+p-1}, \ldots, Z^2_{a+p_0+1}, \sum_{i=1}^{a+p_0}Z_i^2)$$, \begin{eqnarray*} f(w_p,\ldots, w_{p_0+1}, w_1)= \frac{\exp(-{w_1}/{2})w_p^{-1/2}(w_1-w_{p_0+1})^{-(a+p_0)/2-1} \prod^{p-p_0+1}_{i=1}(w_i-w_{i-1})^{-1/2}} {2^{({a+p})/{2}}\{\Gamma(1/2)\}^{p-p_0}\Gamma(\frac{a+p_0}{2})}\text{.} \end{eqnarray*} The region of integration follows from $$\mathcal{A}_p^{({\rm s})}(\mathcal{M}_{{\rm nest}})$$ and the fact that $$W_i \leqslant W_{j}$$ for $$i>j$$. □ Proof of Lemma 1. Denote the smallest true model by $$M_{\rm pars}$$. For all $$M'\not\in\mathcal{M}_O$$, by Assumption A1(i), \begin{align*} &\mathrm{pr}(M_{{\small{\text{AIC}}}} = M') \\ & \quad \leqslant \mathrm{pr}\bigl\{\small{\text{AIC}}^*(M')\ge\mathop{\max}_{M\in\mathcal{M}_O}\small{\text{AIC}}^*(M)\bigr\}\leqslant \mathrm{pr}\bigl\{\small{\text{AIC}}^*(M')\ge\small{\text{AIC}}^*(M_{{\rm pars}})\bigr\} \\ &\quad= \mathrm{pr}\big[\ell_n\{{\hat{\theta}}(M')\} - |M'| \ge \ell_n\{{\hat{\theta}}(M_{{\rm pars}})\} - |M_{{\rm pars}}|\big]\\ &\quad= \mathrm{pr}\big[\ell_n\{{\hat{\theta}}(M')\}-\ell_n\{{\vartheta}(M_{{\rm pars}})\} - |M'| \ge \ell_n\{{\hat{\theta}}(M_{{\rm pars}})\} - \ell_n\{{{\vartheta}}(M_{{\rm pars}})\}-|M_{{\rm pars}}|\big]\\ &\quad \rightarrow 0\text{.}\\[-44pt] \end{align*} □ Proof of Proposition 2. (I) Define $$S_j =\{{s} \in \mathbb{R}^{a+K}: s_i=0,\, i\notin M \}$$ and $$\ell^*_{n,M_i}=\ell_n\{\hat{{\theta}}(M_i)\}-\ell_n({\vartheta})$$ where $$M_i \in \mathcal{M}_O$$. Similar to Proposition 1, we can show that for $$M_i \in \mathcal{M}_O$$, $$\:\ell^*_{n,M_i}\to 0{\cdot}5\sum_{j \in M_i}Z_j^2$$ in distribution. Now, the condition part can be calculated by $$\sum_{j \in M} Z_j^2 - 2 |M| > \sum_{j \in M_i} Z_j^2 - 2 |M_i|, \quad M_i \in {\mathcal{M}}_O\setminus M ,$$ which is equivalent to $${Z} \in \mathcal{A}_M({\mathcal{M}}_O)$$. (II) By Lemma A1 there is joint convergence in distribution of the estimators in the different models. The constraint set can be calculated by pairwise comparisons of the $$\small{\text{AIC}}^*$$ values. To do so, write \begin{eqnarray*} &&\ell_n\{\hat{{\theta}}(M_{\rm i})\}=\ell_n({{\vartheta}})+\frac{n}{2}\{\hat{{\theta}} (M_{i})-{\vartheta}\}^{{ \mathrm{\scriptscriptstyle T} }}{Q}_{M_{i}}({\vartheta})\{\hat{{\theta}}(M_{i}) -{\vartheta}\}+o_{\rm p}(1) \end{eqnarray*} from which it follows that $$ \ell_{n,i}^*=({n}/{2})\{\hat{{\theta}}(M_{i})-{\vartheta}\}^{{ \mathrm{\scriptscriptstyle T} }} {Q}_{M_{i}}({\vartheta})\{\hat{{\theta}}(M_{i})-{\vartheta}\}+o_{\rm p}(1)\text{.} $$ Then, since $$\small{\text{AIC}}^*(M_{\small{\text{AIC}}}) \geqslant \small{\text{AIC}}^*(M_{i})$$ is equivalent to $$2 (\ell_{n,{\small{\text{AIC}}}}^{*} -\ell_{n,{ i}}^*) \geqslant 2(|M_{\small{\text{AIC}}}|-|M_{\rm i}|) $$, it follows that \begin{eqnarray} n(\hat{{\theta}}_{{\mathcal{M}}_O}-{\vartheta}_{{\mathcal{M}}_O})^{{ \mathrm{\scriptscriptstyle T} }} {W}_{\rm AIC,i}(\hat{{\theta}}_{{\mathcal{M}}_O}-{\vartheta}_{{\mathcal{M}}_O})+o_{\rm p}(1) - 2(|M_{\small{\text{AIC}}}|-|M_i|) \geqslant 0\text{.} \end{eqnarray} (A1) By using Lemma A1 and the continuous mapping theorem, the asymptotic counterpart of (A1) can be written as $$ {{{Z}}}^{{ \mathrm{\scriptscriptstyle T} }} \Sigma^{1/2} {W}_{\rm AIC,i} \Sigma^{1/2}{Z} \geqslant 2 (|M_{\small{\text{AIC}}}|-|M_i|)$$$$(M_i \in {\mathcal{M}}_O) $$, which results in the stated selection region and limiting distribution. □ Proof of Proposition 3. By Theorems 1 and 2 of Sweeting (1980), $$ n^{1/2}\{{\hat{\theta}}'(M)-\tilde{{\vartheta}}(M)\}^{{ \mathrm{\scriptscriptstyle T} }}{J}_{M}^{1/2} ({\vartheta})\to \tilde{{Z}}(M) $$ uniformly in distribution over the compact set $$\Theta$$. This yields $$\lim_{n\rightarrow \infty} \inf_{\vartheta \in \Theta} \mathrm{pr}_{\vartheta}\{\vartheta \in C_{\alpha}(\vartheta)\} = 1- \alpha$$. When $$\mathcal{M}_O$$ is not known, we use $$\mathcal{A}_M(\mathcal{M}_{{\rm arb}})$$ in (6) instead of $$\mathcal{A}_M(\mathcal{M}_{{\rm arb}}\cap \mathcal{M}_O)$$, which defines the value $$\tilde q_\alpha$$. Since $$\mathcal{A}_M(\mathcal{M}_{{\rm arb}}) \subset \mathcal{A}_M(\mathcal{M}_{{\rm arb}}\cap \mathcal{M}_O)$$, we have $$\tilde q_\alpha\ge q_\alpha$$, which leads to a conservative confidence region. Proof of Lemma 2. For every $$j=1,\ldots,J$$ and every component $$k$$ of the vector $$\hat\theta_{n,\mathcal{M}}(M_j)$$, we have $$ n^{1/2}([\hat\theta_{n,\mathcal{M}}(M_j)]_k - [\vartheta_{n,\mathcal{M}}^*(M_j)]_k) = \sum_{i=1}^n Q^{-1}_{M_j}\{\vartheta_{n}^*(M_j)\} n^{-1/2} \frac{\partial}{\partial\theta_k}\log f_{j,i}\{Y_i,\vartheta_{n}^*(M_j)\} +o_{\rm p}(1)\text{.} $$ By assumption (i) in the lemma, which is a Lindeberg assumption for all $$G_n\in\mathcal{G}_n$$, we obtain a uniform limiting normality result for each of the components of $$n^{1/2}(\hat\theta_{n,\mathcal{M}} - \vartheta_{n,\mathcal{M}}^*)$$. Under assumption (ii) in the lemma, the data are in a so-called null triangular array format, to which Corollary 2 of Pollak (1972) applies, resulting in joint asymptotic normality for the vector combining all such components. □ Proof of Proposition 4. Define the events $$B= [n^{1/2}\{\hat{\theta}(M_{\small{\text{AIC}}})-\vartheta^*(M_{\small{\text{AIC}}})\} \leqslant t]$$ and $$ C=\bigcap_{M \in \mathcal{M}} \bigl\{n(\hat{\theta}_{{\mathcal{M}}}-\vartheta^*_{{\mathcal{M}}})^{{ \mathrm{\scriptscriptstyle T} }}(W_{M_{\small{\text{AIC}}},M_{{\rm s}}}-W_{M_{\rm i},M_{{\rm s}}}) (\hat{\theta}_{{\mathcal{M}}}-\vartheta^*_{{\mathcal{M}}})\geqslant 2(|M_{\small{\text{AIC}}}|-|M|) \bigr\} +o_{\rm p}(1)\text{.} $$ Using the results of Lemma 2 and the continuous mapping theorem, the difference between \begin{eqnarray*} \mathrm{pr}\big[n^{1/2}\{\hat{\theta}(M_{\small{\text{AIC}}})-\vartheta^*(M_{\small{\text{AIC}}})\} \leqslant t \mid \hat{M}=M_{\small{\text{AIC}}}\big] \mathrm{pr}(B\cap C)/\mathrm{pr}(C) \end{eqnarray*} and $$ \mathrm{pr}[\{ \Sigma_{M_{\small{\text{AIC}}}}(\vartheta^*_{M_{\small{\text{AIC}}}})^{1/2} \tilde{{Z}}({M_{{\small{\text{AIC}}}}}) \leqslant t \} \, \cap \{Z \in \mathcal{A}_{M_{\small{\text{AIC}}}}(\mathcal{M})\}] /\mathrm{pr}\{Z \in \mathcal{A}_{M_{\small{\text{AIC}}}}(\mathcal{M})\} $$ converges uniformly to 0. □ References Akaike H. ( 1973 ). Information theory and an extension of the maximum likelihood principle. In Proc. 2nd Int. Symp. Info. Theory, Tsahkadsor, Armenia, USSR, September 2–8, 1971 , Petrov B. & Cski F. eds. Budapest : Akadémiai Kiadó , pp. 267 – 81 . Andrews D. W. K. & Guggenberger P. ( 2009 ). Hybrid and size-corrected subsampling methods . Econometrica 77 , 721 – 62 . Google Scholar CrossRef Search ADS Bachoc F., Leeb H. & Pötscher B. ( 2015 ). Valid confidence intervals for post-model-selection predictors. arXiv: 1412.4605. Belloni A., Chernozhukov V. & Kato K. ( 2015 ). Uniform post selection inference for least absolute deviation regression models and other Z-estimation problems . Biometrika 102 , 77 – 94 . Google Scholar CrossRef Search ADS Berk R., Brown L., Buja A., Zhang K. & Zhao L. ( 2013 ). Valid post-selection inference . Ann. Statist. 41 , 802 – 37 . Google Scholar CrossRef Search ADS Chernozhukov V., Hansen C. & Spindler M. ( 2015 ). Valid post-selection and post-regularization inference: An elementary, general approach . Ann. Rev. Econ. 7 , 649 – 88 . Google Scholar CrossRef Search ADS Claeskens G. & Hjort N. L. ( 2004 ). Goodness of fit via nonparametric likelihood ratios . Scand. J. Statist. 31 , 487 – 513 . Google Scholar CrossRef Search ADS Claeskens G. & Hjort N. L. ( 2008 ). Model Selection and Model Averaging . Cambridge: Cambridge University Press. Cox D. R. & Hinkley D. V. ( 1974 ). Theoretical Statistics . London : Chapman and Hall. Danilov D. & Magnus J. R. ( 2004 ). On the harm that ignoring pretesting can cause . J. Economet. 122 , 27 – 46 . Google Scholar CrossRef Search ADS Efron B. ( 2014 ). Estimation and accuracy after model selection . J. Am. Statist. Assoc. 109 , 991 – 1007 . Google Scholar CrossRef Search ADS Ferrari D. & Yang Y. ( 2014 ). Confidence sets for model selection by F-testing . Statist. Sinica 25 , 1637 – 58 . Hjort N. L. & Claeskens G. ( 2003 ). Frequentist model average estimators . J. Am. Statist. Assoc. 98 , 879 – 99 . Google Scholar CrossRef Search ADS Jansen M. ( 2014 ). Information criteria for variable selection under sparsity . Biometrika 101 , 37 – 55 . Google Scholar CrossRef Search ADS Kabaila P. ( 1995 ). The effect of model selection on confidence regions and prediction regions . Economet. Theory 11 , 537 – 49 . Google Scholar CrossRef Search ADS Kabaila P. ( 1998 ). Valid confidence intervals in regression after variable selection . Economet. Theory 14 , 463 – 82 . Google Scholar CrossRef Search ADS Kabaila P. & Leeb H. ( 2006 ). On the large-sample minimal coverage probability of confidence intervals after model selection . J. Am. Statist. Assoc. 101 , 619 – 29 . Google Scholar CrossRef Search ADS Kabaila P., Welsh A. H. & Abeysekera W. ( 2016 ). Model-averaged confidence intervals . Scand. J. Statist. 43 , 35 – 48 . Google Scholar CrossRef Search ADS Konishi S. & Kitagawa G. ( 1996 ). Generalized information criteria in model selection . Biometrika 83 , 875 – 90 . Google Scholar CrossRef Search ADS Lee J. D., Sun D. L., Sun Y. & Taylor J. E. ( 2016 ). Exact post-selection inference, with application to the lasso . Ann. Statist. 44 , 907 – 27 . Google Scholar CrossRef Search ADS Leeb H. & Pötscher B. ( 2005 ). Model selection and inference: Facts and fiction . Economet. Theory 21 , 22 – 59 . Google Scholar CrossRef Search ADS Leeb H., Pötscher B. & Ewald K. ( 2015 ). On various confidence intervals post-model-selection . Statist. Sci. 30 , 216 – 27 . Google Scholar CrossRef Search ADS Leeb H. & Pötscher B. M. ( 2003 ). The finite-sample distribution of post-model-selection estimators and uniform versus nonuniform approximations . Economet. Theory 19 , 100 – 42 . Google Scholar CrossRef Search ADS Leeb H. & Pötscher B. M. ( 2006 ). Can one estimate the conditional distribution of post-model-selection estimators? Ann. Statist. 34 , 2554 – 91 . Google Scholar CrossRef Search ADS Lichman M. ( 2013 ). UCI Machine Learning Repository}, Center for Machine Learning and Intelligent Systems, School of Information and Computer Sciences, University of California, Irvine, https://archive.ics.uci.edu/ml/. Mallows C. ( 1973 ). Some comments on $${C}_p$$ . Technometrics 15 , 661 – 75 . Pakman A. & Paninski L. ( 2014 ). Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians . J. Comp. Graph. Statist. 23 , 518 – 42 . Google Scholar CrossRef Search ADS Pollak M. ( 1972 ). A note on infinitely divisible random vectors . Ann. Math. Statist. 43 , 673 – 5 . Google Scholar CrossRef Search ADS Pötscher B. ( 1991 ). Effects of model selection on inference . Economet. Theory 7 , 163 – 85 . Google Scholar CrossRef Search ADS Pötscher B. ( 1995 ). Comment on “The effect of model selection on confidence regions and prediction regions” by P. Kabaila . Economet. Theory 11 , 550 – 9 . Google Scholar CrossRef Search ADS Shibata R. ( 1976 ). Selection of the order of an autoregressive model by Akaike’s information criterion . Biometrika 63 , 117 – 26 . Google Scholar CrossRef Search ADS Smith J. W., Everhart J. E., Dickson W. C., Knowler W. C. & Johannes R. S. ( 1988 ). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. Proc. Ann. Symp. Comp. Appl. Med. Care , 1988 Nov 9 , 261 – 5 . Sweeting T. J. ( 1980 ). Uniform asymptotic normality of the maximum likelihood estimator . Ann. Statist. 8 , 1375 – 81 . Google Scholar CrossRef Search ADS Takeuchi K. ( 1976 ). Distribution of information statistics and criteria for adequacy of models . Biometrika 153 , 12 – 8 . Taylor J., Lockhart R., Tibshirani R. J. & Tibshirani R. ( 2016 ). Exact post-selection inference for sequential regression procedures . J. Am. Statist. Assoc. 111 , 600 – 20 . Google Scholar CrossRef Search ADS Tibshirani R. J., Rinaldo A., Tibshirani R. & Wasserman L. ( 2015 ). Uniform asymptotic inference and the bootstrap after model selection. arXiv: 1506.06266. Vuong Q. H. ( 1989 ). Likelihood ratio tests for model selection and non-nested hypotheses . Econometrica 57 , 307 – 33 . Google Scholar CrossRef Search ADS White H. ( 1994 ). Estimation, Inference and Specification Analysis . New York : Cambridge University Press. Woodroofe M. ( 1982 ). On model selection and the arc-sine laws . Ann. Statist. 10 , 1182 – 94 . Google Scholar CrossRef Search ADS © 2018 Biometrika Trust This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Biometrika Oxford University Press

Asymptotic post-selection inference for the Akaike information criterion

Biometrika , Volume 105 (3) – Sep 1, 2018

Loading next page...
 
/lp/ou_press/asymptotic-post-selection-inference-for-the-akaike-information-mdXrjGQ83p
Publisher
Oxford University Press
Copyright
© 2018 Biometrika Trust
ISSN
0006-3444
eISSN
1464-3510
D.O.I.
10.1093/biomet/asy018
Publisher site
See Article on Publisher Site

Abstract

Summary Ignoring the model selection step in inference after selection is harmful. In this paper we study the asymptotic distribution of estimators after model selection using the Akaike information criterion. First, we consider the classical setting in which a true model exists and is included in the candidate set of models. We exploit the overselection property of this criterion in constructing a selection region, and we obtain the asymptotic distribution of estimators and linear combinations thereof conditional on the selected model. The limiting distribution depends on the set of competitive models and on the smallest overparameterized model. Second, we relax the assumption on the existence of a true model and obtain uniform asymptotic results. We use simulation to study the resulting post-selection distributions and to calculate confidence regions for the model parameters, and we also apply the method to a diabetes dataset. 1. Introduction Variable selection, model selection and estimation with a sparsity-enforcing penalty all induce uncertainty due to the process of selection, and they complicate subsequent inference. We investigate post-selection inference for the Akaike information criterion, AIC (Akaike 1973). The method is valid for variable selection in any likelihood-based model. We construct confidence intervals for regression parameters, or linear combinations thereof, conditional on the selected model, which have the correct coverage probabilities. The method involves rewriting the event of selection asymptotically as a number of inequalities that involve multivariate normal random variables. While the calculation of critical values might proceed exactly for one or two parameters, we develop a numerical approach that is more generally applicable. We focus explicitly on the classical low-dimensional setting, for which no such post-selection results are yet available. The need to address selection uncertainty has been pointed out many times (e.g., Kabaila 1995, 1998; Hjort & Claeskens, 2003; Leeb & Pötscher, 2003, 2005, 2006; Danilov & Magnus, 2004; Kabaila & Leeb, 2006). Claeskens & Hjort (2008) approached the post-selection issue via model averaging, using simulation in a local misspecification framework. For model selection via sequential testing in nested models, Pötscher (1991) calculated the asymptotic distribution of the parameter estimator. Several advances have been made recently. The post-selection inference method of Berk et al. (2013) yields, for linear models, valid confidence intervals irrespective of the selection procedure, which can also be informal. Bachoc et al. (2015) generalized this method to prediction intervals. Since these methods are not specific to any selection procedure, the resulting confidence intervals can be quite conservative. Efron (2014) proposed using a bagging, i.e., bootstrap aggregating, estimator and derived its variance, using normal quantiles to obtain confidence intervals. Ferrari & Yang (2014) assessed model uncertainty when performing F-tests in linear models via a so-called variable selection confidence set. Kabaila et al. (2016) investigated the exact coverage and scaled expected length of certain model-averaged confidence intervals for a parameter of a linear regression model. In selective inference one lets the data determine the selected model and the target of the parameter estimators. For the lasso, Lee et al. (2016) obtained exact post-selection inference by relating the selected set of active coefficients to a union of polyhedra. For forward selection and least angle regression in normal linear regression models, Taylor et al. (2016) studied selective hypothesis tests and confidence intervals. Jansen (2014) investigated the effect of the optimization on the expected values of the Akaike information criterion and Mallow’s $$C_p$$ in high-dimensional sparse models. Belloni et al. (2015) obtained uniformly valid confidence intervals in the presence of a sparse high-dimensional nuisance parameter. We explain our approach first in the traditional simple case of selection using the Akaike information criterion in a sequence of nested models, the so-called order selection problem. Next, we extend this to the practically more relevant setting of selection from a general set of models, not necessarily nested and possibly all misspecified. When a true parametric model exists, only pointwise results can be obtained, while under misspecification and working with pseudo-true values that change per model, stronger, uniformly valid confidence intervals are constructed. 2. Post-AIC selection in nested models 2.1. Selection properties of the AIC Consider first a nested sequence of $$K+1$$ likelihood models $$M_0 \subseteq \cdots \subseteq M_K$$, for which the likelihood function $$L_{n}$$ depends on a parameter vector $${\theta}^{{ \mathrm{\scriptscriptstyle T} }}=({\theta}_0^{{ \mathrm{\scriptscriptstyle T} }}, \theta_1, \ldots, \theta_K)\in\Omega\subseteq \mathbb{R}^{a+K}$$, where $${\theta}_0\in\mathbb{R}^{a}$$ denotes the parameter vector that is common to all models and hence not subject to variable selection and $$n$$ denotes the sample size. For ease of notation we assume that model $$M_i$$ adds a single parameter to model $$M_{i-1}$$. Generalizations are straightforward. We start by assuming that there is a single minimal true model $$M_{p_0}$$ in the set of models $$\mathcal{M}_{{\rm nest}}=\{M_i: i=0, \ldots, K\}$$ in the sense that $$p_0$$ is the smallest model order for which all nonzero components of the true parameter vector $$\vartheta$$ are included. This assumption is relaxed in § 4, where we do not require the existence of a true model and we allow for nonnested models and for model misspecification. In the current setting, models with indices $$i<p_0$$ are underparameterized, while models with $$i>p_0$$ are overparameterized. We denote by $$\hat{{\theta}}'(i)$$ the maximum likelihood estimator for the parameter vector $${\theta}^{{ \mathrm{\scriptscriptstyle T} }}(i)=({\theta}_0^{{ \mathrm{\scriptscriptstyle T} }}, \ldots, \theta_i)\in \mathbb{R}^{a+i}$$ in model $$M_i$$, write $$\hat{{\theta}}(i)=({\hat{\theta}}'(i)^{{ \mathrm{\scriptscriptstyle T} }},{0}_{K-i}^{{ \mathrm{\scriptscriptstyle T} }})^{{ \mathrm{\scriptscriptstyle T} }}$$, and let $${\vartheta}={\vartheta}(p_0)$$ denote the corresponding true value where $$\vartheta_j=0$$ for $$j>p_0$$. Here $$0_l$$ stands for the zero vector of length $$l$$. The Akaike information criterion for model $$M_j$$ in the model list $$\mathcal{M}_{{\rm nest}}$$ is $$\small{\text{AIC}}(M_j)=-2\ell_n\{\hat{{\theta}}(\,j)\} +2 (a+j)$$ where $$\ell_n=\log L_n$$. The index of the selected model is $$ \hat p_0=\min\{\,j:\small{\text{AIC}}(M_j)=\min_{0 \leqslant i \leqslant K} \small{\text{AIC}}(M_i)\}\text{.} $$ The idea behind the construction of post-selection inference is to rewrite the selection procedure in terms of a set of inequalities, which define a geometrical region in terms of random variables that can be easily simulated. For this purpose, we redefine $$\hat p_0 = \min\{\,j\in\{0,\ldots,K\}: j=\arg\max_{j=0,\ldots,K}\small{\text{AIC}}^*(M_j)\}$$, with $$\small{\text{AIC}}^*(M_j)=2[\ell_n\{\hat{{\theta}}(\,j)\}-\ell_n({\vartheta})]-2j=2 \ell_{n,j}^*-2j$$. Asymptotically, the probability of underselection is zero (Woodroofe 1982, Lemma A1 in the Appendix); see also Shibata (1976). Conditioning on $$\hat p_0=p$$, we have that $$\small{\text{AIC}}^*(M_{p})-\small{\text{AIC}}^*(M_{j})>0$$ for $$j=p_0,\ldots,p-1$$ and $$\small{\text{AIC}}^*(M_{p})-\small{\text{AIC}}^*(M_{j})\ge 0$$ for $$j=p+1,\ldots,K$$. For $$n\to\infty$$, there is joint convergence in distribution of $$(\ell^*_{n,p_0},\ldots,\ell^*_{n,K})$$ to $$(\sum_{i=1}^{a+p_0}Z_i^2,\ldots,\sum_{i=1}^{a+K}Z_i^2)/2$$, where $$Z_1,\ldots,Z_{a+K}$$ are independent and identically distributed $$N(0,1)$$ variables (Woodroofe 1982). By the continuous mapping theorem, asymptotically, when $$\hat p_0=p$$, $$(Z_1,\ldots,Z_{a+K})\in \mathcal{A}_p(\mathcal{M}_{{\rm nest}})$$, which is called the selection region for nested models and is defined by \begin{align*} \mathcal{A}_p(\mathcal{M}_{{\rm nest}}) & = \mathcal{B}_{1,p} \cap \mathcal{B}_{2,p}, \text{ with } \\ \mathcal{B}_{1,p} & = \bigcap_{j=p_0+1,\ldots,p}\left\{ {z}\in \mathbb{R}^{a+K}: \sum_{i=j}^{p}(z_{a+i}^2-2)>0\right\} \\ \mathcal{B}_{2,p} & = \bigcap_{j=p+1,\ldots,K}\left\{{z}\in \mathbb{R}^{a+K}: \sum_{i=p+1}^{j}(z_{a+i}^2-2)\le0\right\}\! \text{.} \end{align*} Geometrically, the first set of $$p-p_0-1$$ strict inequalities specifies regions outside spheres, the last set of $$K-p$$ inequalities indicates regions inside certain other spheres, and the inequality $$z_p^2>2$$ determines the union of two half-spaces, namely $$(-\infty,-2^{1/2})\cup(2^{1/2},+\infty)$$. The specific structure of the Akaike information criterion determines the form of the regions. Other selection methods define other regions; see § 7 for examples. Lemma 5.1 and Theorem 5.2 in Lee et al. (2016) characterize the lasso selection procedure, for a given value of the $$\ell_1$$-penalty, in terms of polyhedral sets; see also Taylor et al. (2016). 2.2. Distributional results Inference post-selection deals with the distribution of the estimators in the selected model, conditional on the selection. In this paper we always mean selection of the model with the smallest Akaike information criterion value, and by the post-selection estimator we mean the maximum likelihood estimator based on the selected model. We show that the limiting cumulative distribution function of $$n^{1/2}\{\hat{{\theta}}(\hat p_0)-{\vartheta}\}$$ conditional on the selected model can be described by a multivariate normal random variable $$Z$$ that is for nested models conditioned on $$Z\in \mathcal{A}_p(\mathcal{M}_{{\rm nest}})$$. Due to the nature of the selection using the Akaike information criterion and by the results of Pötscher (1991) and Leeb & Pötscher (2003), it can be shown that the selection of an overspecified model does not happen in a uniform way, but depends on the true parameter value $$\vartheta$$. Hence, in § § 2 and 3, the results are pointwise. All proofs and assumptions are given in the Appendix. Define for model $$M_i$$ the submatrix $${J}_{M_i}({\vartheta})$$ of the Fisher information matrix $${J}({\vartheta})$$ in the model with all parameters (see Assumption A1(iv)), and for a $$(a+K)$$-vector $${\nu}$$ define its subvector $$\tilde{{\nu}}(i)=(\nu_1,\ldots,\nu_{a+i})^{{ \mathrm{\scriptscriptstyle T} }}$$. The indicator function $$I(\cdot)$$ is defined by $$I(A)=1$$ if $$A$$ is true and $$I(A)=0$$ otherwise. Proposition 1. Suppose that conditions $${\rm (i)-(iv)}$$ of Assumption A1 in the Appendix hold. For a sequence of nested models $$\mathcal{M}_{{\rm nest}}$$, with $$p_0$$ denoting the true model order, the asymptotic conditional cumulative distribution function of the post-selection estimator is \begin{align} F_p({t}) & = \mathop{\lim}_{n \rightarrow \infty } \mathrm{pr}\big[n^{1/2}\{\hat{{\theta}}(p)-{\vartheta}\}\leqslant {t} \mid \hat{p}_0=p,\mathcal{M}_{{\rm nest}}\big] \nonumber\\ & = \mathrm{pr}\{{J}_{p}^{-1/2}({\vartheta}) \tilde{{Z}}(p) \leqslant \tilde{{t}}(p) \mid \tilde{{Z}}(p)\in\mathcal{A}_p^{({\rm s})}(\mathcal{M}_{{\rm nest}})\} I({t}\in \mathcal{T}_{p}), \end{align} (1)where $$p\geqslant p_0$$ by Lemma 1, $${Z}=(Z_1,\ldots,Z_{a+K})^{{ \mathrm{\scriptscriptstyle T} }}$$, the region with simplified constraints $$\mathcal{A}_p^{({\rm s})}(\mathcal{M}_{{\rm nest}})= \bigcap_{j=p_0+1,\ldots,p}\left\{\tilde{{z}}(p) \in \mathbb{R}^{a+p}: \sum_{i=j}^p\big(z_{a+i}^2-2\big)>0\right\}$$, and $$\mathcal{T}_p=\mathbb{R}^{a+p}\times (\mathbb{R}^+)^{K-p}$$. By the forms of $$\mathcal{A}_p$$ and $$\mathcal{A}_p^{({\rm s})}$$, the limiting distribution of $$n^{1/2}\{\hat{{\theta}}(p)-{\vartheta}\}$$ conditional on selection in the set $$\mathcal{M}_{{\rm nest}}$$ is symmetric and its density function is that of a truncated normal random variable. Let $$\phi_p(\cdot\mid\mathcal{A};{V})$$ denote the density of $${V}^{-1/2} \tilde{{Z}}(p)$$, where $$\tilde{{Z}}(p) \sim N_{a+p}({0},{I}_{a+p})$$ is truncated such that $$\tilde{{Z}}(p) \in \mathcal{A}$$. In the case of selecting the true model, the conditioning event contains random variables that are independent of $$\tilde{Z}(p_0)$$ and hence may be omitted. Figure 1 depicts some of the limiting post-selection densities for an example of selecting the largest in a sequence of three nested models, when the smallest model is the true one. This example is continued in § 3.1. For more details, see the Supplementary Material. Fig. 1. View largeDownload slide Marginal asymptotic densities $$f_{j|3}$$ ( $$j=1,2,3$$) of $$n^{1/2}(\hat{\theta}_j-\vartheta_j)$$ conditional on $$\hat{p}_0=3$$ when $$p_0=1$$ and $$J_3^{-1}(\vartheta)$$ is a diagonal matrix with diagonal elements $$(1,4,4)$$. Fig. 1. View largeDownload slide Marginal asymptotic densities $$f_{j|3}$$ ( $$j=1,2,3$$) of $$n^{1/2}(\hat{\theta}_j-\vartheta_j)$$ conditional on $$\hat{p}_0=3$$ when $$p_0=1$$ and $$J_3^{-1}(\vartheta)$$ is a diagonal matrix with diagonal elements $$(1,4,4)$$. Corollary 1. Under the assumptions of Proposition 1, the limiting density of $$n^{1/2}\{\hat{{\theta}}(\hat p_0)-{\vartheta}\}$$ conditional on $$\small{\text{AIC}}$$-selection with $$\hat p_0 = p$$ from the set of nested models $$\mathcal{M}_{{\rm nest}}$$ is $$f_p({t})= \phi_p\{\tilde{{t}}(p)\mid\mathcal{A}_p^{({\rm s})}(\mathcal{M}_{{\rm nest}});{J}_{p}^{-1}({\vartheta})\} I({t}\in \mathcal{T}_{p})$$. When the true model is selected, i.e., $$\hat p_0=p_0$$, we have $$f_{p_0}({t})=\phi_{p_0}\{\tilde{{t}}(p_0)\} I({t}\in \mathcal{T}_{p})$$. 2.3. Confidence regions A correct post-selection analysis incorporates the uncertainty associated with variable selection; we obtain confidence regions conditional on the selected model. Corollary 2. Under the assumptions of Proposition 1, an asymptotic $$100(1-\alpha)\%$$ Wald confidence ellipsoid conditional on having selected a model with $$\hat p_0=p$$ is \begin{eqnarray*} \big\{{\vartheta}\in\mathbb{R}^{a+K}: n\{{\hat{\theta}}'(p)-\tilde{{\vartheta}}(p)\}^{{ \mathrm{\scriptscriptstyle T} }}{J}_{p}({\vartheta}) \{{\hat{\theta}}'(p)-\tilde{{\vartheta}}(p)\} \leqslant q_{\alpha}\big\}, \end{eqnarray*}where $$q_{\alpha}$$ is defined such that $$1-\alpha$$ equals \begin{equation} \int_{2(p-p_0)}^{q_\alpha} \!\int_{2(p-p_0)}^{w_{1}}\!\!\!\! \cdots\!\! \int_4^{w_{p-2}}\!\! \!\int_2^{w_{p-1}} \!\!\frac{f(w_p,\ldots, w_{p_0+1}, w_1)}{\mathrm{pr}\{\tilde{{Z}}(p)\in\mathcal{A}_p^{({\rm s})}(\mathcal{M}_{{\rm nest}})\}} \,\mathrm{d} w_p\,\mathrm{d} w_{p-1}\ldots \,\mathrm{d} w_{p_0+1} \,\mathrm{d} w_{1}, \end{equation} (2)with \begin{align*} & f(w_p,\ldots, w_{p_0+1}, w_1)\\ &\quad{} = \frac{\exp(-{w_1}/{2})w_p^{-1/2}(w_1-w_{p_0+1})^{-(a+p_0)/2-1}\prod^{p-p_0+1}_{i=1}(w_i-w_{i-1})^{-1/2}} {2^{({a+p})/{2}}\{\Gamma(1/2)\}^{p-p_0}\Gamma(\frac{a+p_0}{2})}\text{.} \end{align*} In § 2.4 we propose an accurate method for estimating $$q_{\alpha}$$ when exact computation is cumbersome. Clearly, the naive approach of using the quantile of a chi-squared distribution gives too low coverage. Confidence intervals for single components of $${\vartheta}$$ require the calculation of marginal distributions. Corollary 3. Under the assumptions of Proposition 1, with $$\mathcal{R}_{\alpha}=\mathbb{R}^{j-1}\times [-q_{\alpha/2}, q_{\alpha/2}] \times \mathbb{R}^{a+p-j} \times (\mathbb{R}^+)^{K-p}$$, the asymptotic $$100(1-\alpha)\%$$ quantiles of the marginal distributions of $$\vartheta_j$$$$(\,j=1,2,\ldots, a+p)$$ satisfy $$ \int_{\mathcal{R}_{\alpha}}f_p(t) \,\mathrm{d} t = 1-\alpha$$. 2.4. Simulation-based inference Since the calculations are quite tedious, even in low dimensions, we present a method to simulate this conditional distribution, from which quantiles can then be obtained. When $${J}({\vartheta})$$ is unknown, we use a consistent estimator $$\skew6\hat{{J}}\{\hat{{\theta}}(K)\}$$. We use a Hamiltonian Monte Carlo method (Pakman & Paninski 2014) to sample from a $$(a+K)$$-variate standard normal distribution subject to quadratic constraints that are also based on standard normal random variables. The resulting $$n'$$ samples drawn from this density are placed in the $$n' \times (a+K)$$ matrix $${\mathcal{Z}}_{\mathcal{A}}$$. Next, we multiply each row of $$\tilde{{\mathcal{Z}}}_{\mathcal{A}}(p)$$ by $$\skew6\hat{{J}}^{-1/2}_{p}\{\hat{{\theta}}(K)\}$$, which leads to $$n'$$ samples from the limiting distribution of $$n^{1/2}\{\hat{{\theta}}(\hat p_0)-{\vartheta}\}$$; see Corollary 1. The example in the Supplementary Material demonstrates the close agreement between the 95% quantiles $$q_{\alpha}$$ in (2) simulated via constrained $$\chi^2$$ distributions and their exact values. 3. Post-selection inference in general models 3.1. AIC selection in a set of nonnested models Lemma 1 generalizes a result of Woodroofe (1982), which is repeated as Lemma S1 in the Supplementary Material, to an arbitrary set of models that contains at least one overparameterized model. Lemma 1. Under Assumption A1$${\rm (i)-(iv)}$$, the asymptotic probability that selection using the Akaike information criterion results in an underparameterized model from a set of models $$\mathcal{ M}$$ that contains at least one overparameterized model is equal to zero. The distributional properties of the post-selection estimators depend on the candidate set of models $$\mathcal{M}$$. Indeed, another set $$\mathcal{M}$$ could have led to a different selection. We define the selection matrix to indicate which variables appear in the set of models. Definition 1. The selection matrix $${\zeta}_{\mathcal{M}}$$ is a $$|\mathcal{M}| \times (a+K)$$ matrix with $$\{0, 1\}$$ elements, constructed as $$ {\zeta}_{\mathcal{M}}=(1^{T}_{a+K} {\pi}^{T}_{1} {\pi}_{1}, \ldots, 1^{T}_{a+K} {\pi}^{T}_{M} {\pi}_{M})^{{ \mathrm{\scriptscriptstyle T} }}, $$ where $$|\mathcal{M}|$$ is the number of models and $${\pi}_{m}$$ is a $$|m|\times(a+K)$$ projection matrix that selects those covariates which belong to model $$m$$. First consider $$\mathcal{M}=\mathcal{M}_{{\rm all}}$$, the set of all possible submodels of a largest model. Denote by $$\mathcal{M}_O \subseteq \mathcal{M}_{{\rm all}}$$ the set of all overparameterized models, including the true model, so the models in $$\mathcal{M}_O$$ are overlapping. In model $$M$$ the estimator of $${\vartheta}$$ is denoted by $$\hat{{\theta}}(M)$$, with zeros added for components not in $$M$$. For any vector $$\nu$$, let $$\tilde\nu(M)$$ denote its subvector corresponding to the variables in model $$M$$. Under the orthogonality condition, Assumption A1(v), Proposition 2 is similar to the nested model case. Otherwise, we follow Vuong (1989) for testing in overlapping models. Define $$\Sigma(\theta)$$ as a partitioned matrix with $$(i,j)$$th block equal to $$\Sigma_{M_i,M_j}={Q}_{M_i}^{-1}({\theta}){J}_{ij}({\theta}, {\theta}){Q}_{M_j}^{-1}({\theta})$$. Proposition 2. Assume conditions $${\rm (i)-(iv)}$$ of Assumption A1 and selection from $$\mathcal{M}_{{\rm all}}$$. (I) If Assumption A1$${\rm (v)}$$ holds, the selection region for model $$M$$ is \begin{align*} & \mathcal{A}_M(\mathcal{M}_O) \\ &\quad{} = \!\bigl\{{z}\in\mathbb{R}^{a+K}\!: \!\{{1}_{(|\mathcal{M}_O|-1)}\otimes ({1}^{T}_{K} {\pi}^{T}_{M} {\pi}_{{M}}) - \zeta_{{\mathcal{M}}_O\setminus M}\}\{(z_1^2-2),\ldots,(z_{a+K}^2-2)\}^{{ \mathrm{\scriptscriptstyle T} }} > 0\bigr\}\text{.} \end{align*} The conditional limiting cumulative distribution function of the post-selection estimator is \begin{align} F_M({t}) & = \mathop{\lim}_{n \rightarrow \infty } \mathrm{pr}\big[n^{1/2}\{\hat{{\theta}}(M)-{\vartheta}\}\leqslant {t} \mid M_{{\small{\text{AIC}}}}=M, \mathcal{M}_{{\rm all}}\big] \nonumber \\ & = \mathrm{pr}\{{J}_{M}^{-1/2}({\vartheta}) \tilde{{Z}}(M) \leqslant \tilde{{t}}(M) \mid {Z} \in \mathcal{A}_M({\mathcal{M}}_O)\}I({t}\in \mathcal{T}_{M}), \end{align} (3)where $$\mathcal{T}_{M}$$ is $$\mathbb{R}^{|M|} \times (\mathbb{R}^+)^{K-|M|}$$ and $${J}_{M}(\vartheta)$$, $$\tilde{{Z}}(M)$$ and $$\tilde{{t}}(M)$$ are submatrices of, respectively, $${J}({\vartheta})$$, $${Z}=(Z_1, \dots, Z_{a+K})$$ and $${t}$$, corresponding to the variables in model $$M$$. (II) If Assumption A1$${\rm (v)}$$ does not hold, define $$m=\sum_{M\in \mathcal{M}_O}|M|$$ and let $$W_{{\small{\text{AIC}}},i}$$ be a matrix partitioned in the same way as $$\Sigma(\vartheta)$$ with diagonal blocks that correspond to $$M_{{\small{\text{AIC}}}}$$ and $$M_i$$ equal to $$Q_{M_{\small{\text{AIC}}}}(\vartheta)$$ and $$-Q_{M_i}(\vartheta)$$, respectively, and with zeros elsewhere. The selection region for model $$M_{{\small{\text{AIC}}}}$$ is \begin{equation*} \mathcal{A}_M(\mathcal{M}_O)\!=\! \bigl\{{z}\in\mathbb{R}^{m}\!:\!{{{z}}}^{{ \mathrm{\scriptscriptstyle T} }} \Sigma^{1/2}(\vartheta) W_{{\small{\text{AIC}}},i} \Sigma^{1/2}(\vartheta){z} \geqslant 2 (|M_{\small{\text{AIC}}}|-|M_i|), \, M_i \in {\mathcal{M}}_O\!\!\setminus\!\!M_{{\small{\text{AIC}}}}\bigr\}\text{.} \end{equation*} Let $$\tilde{{Z}}(M)$$ denote the subvector of $${Z}\sim N_m(0,I)$$, $$Z\in \mathcal{A}_M(\mathcal{M}_O)$$, which contains only those components that correspond to components in the selected model $$M$$; then \begin{equation} F_M({t}) = \mathrm{pr}\{{J}_{M}^{-1/2}({\vartheta}) \tilde{{Z}}(M) \leqslant \tilde{{t}}(M) \mid {Z} \in \mathcal{A}_M({\mathcal{M}}_O)\} I({t}\in \mathcal{T}_{M}), \end{equation} (4)where $$\mathcal{T}_{M}$$ is $$\mathbb{R}^{|M|} \times (\mathbb{R}^+)^{m-|M|}$$. The choice of $$\mathcal{M}$$ is important. Regarding (I), the constraint involves those $$Z_i$$ corresponding to the parameters in the selected model $$M_{{\small{\text{AIC}}}}$$ that are not in the smallest true model $$M_{{\rm pars}}$$; hence no constraints are placed on the $$Z_i$$ corresponding to parameters that occur in every model. Obviously, the selection affects the distribution of all parameters, even those common to all models. The effect of the set of models is illustrated by the following example. Let $$K=2$$ and $$a=1$$, and let $$M_0$$ be the smallest true model containing only $$\theta_1$$. Suppose that Assumption A1(v) holds and that the full model $$M_{{\small{\text{AIC}}}}=(\theta_1 , \theta_2, \theta_3)$$ is selected in both $$\mathcal{M}_{{\rm nest}}$$ and $$\mathcal{M}_{{\rm all}}$$. So $$\mathcal{A}_M(\mathcal{M}_{{\rm all}})=\{{z}\in\mathbb{R}^{3}: z_2^2>2,\ z_3^2>2,\ z_2^2+z_3^2>4 \}$$ while $$\mathcal{A}_M(\mathcal{M}_{{\rm nest}})=\{{z}\in\mathbb{R}^{3}: z_3^2>2,\ z_2^2+z_3^2>4 \}$$. Figure 2 depicts these regions for both $$\mathcal{M}_{{\rm nest}}$$, shaded area, and $$\mathcal{M}_{\mathrm{all}}$$, double-shaded area. If one selects the full model in $$\mathcal{M}_{{\rm nest}}$$, then $$Z_2$$ is defined in $$\mathbb{R}$$ as long as $$Z_2^2+Z_3^2>4$$, while selection in $$\mathcal{M}_{{\rm all}}$$ requires both $$Z_2$$ and $$Z_3$$ to be in $$(-\infty, - 2^{1/2})\cup (2^{1/2}, \infty)$$. The distribution of parameter estimators can be obtained by premultiplying $$Z=(Z_1, Z_2, Z_3)$$ by $$J^{1/2}_{M_{\small{\text{AIC}}}}(\vartheta)$$. For the normal linear models $${Y}\sim N_n({X}{\vartheta},\sigma^2{I})$$ and $$M_{\small{\text{AIC}}} \in \mathcal{M}_{O}$$, the distribution results are also exact for finite samples. In such models $${J}({\vartheta})=n^{-1}{X}^{{ \mathrm{\scriptscriptstyle T} }}{X}/\sigma^2$$, which does not depend on $${\vartheta}$$. For (II) the main difference is that we need the joint distribution of the estimators in the different models and the constraints apply to the full vector. Fig. 2. View largeDownload slide Allowable domain of $$Z_2$$ and $$Z_3$$ for nested model selection (shaded) and all-subsets selection (double-shaded) when $$\small{\text{AIC}}$$ selects the full model. Fig. 2. View largeDownload slide Allowable domain of $$Z_2$$ and $$Z_3$$ for nested model selection (shaded) and all-subsets selection (double-shaded) when $$\small{\text{AIC}}$$ selects the full model. 3.2. Confidence regions For an arbitrary set of models $$\mathcal{M}_{{\rm arb}}$$ with $$\mathcal{M}_{{\rm arb}} \cap \mathcal{M}_O\not=\emptyset$$, by Assumption A1(i), (3) still holds upon replacing $$\mathcal{A}_M(\mathcal{M}_O)$$ with $$\mathcal{A}_M(\mathcal{M}_{{\rm arb}} \cap \mathcal{M}_O)$$. With $$M_{{\small{\text{AIC}}}}=M$$ selected from $$\mathcal{M}_{{\rm arb}}$$, the confidence region for $${\vartheta}$$ is \begin{equation} C(q_\alpha) = \big\{{\theta}\in\mathbb{R}^{a+K}: n\{{\hat{\theta}}'(M)-\tilde{{\theta}}(M)\}^{{ \mathrm{\scriptscriptstyle T} }}{J}_{M}({\theta}) \{{\hat{\theta}}'(M)-\tilde{{\theta}}(M)\} \leqslant q_{\alpha}\big\}, \end{equation} (5) where $${\hat{\theta}}'(M)$$ is the $$|M|$$-vector of nonzero values of $$\hat{{\theta}}(M)$$ and $$q_{\alpha}$$ is determined by solving \begin{equation} \frac{\mathrm{pr}\{(\sum_{i \in M} Z_i^2 \leqslant q_{\alpha}) \cap {Z} \in \mathcal{A}_M(\mathcal{M}_{{\rm arb}}\cap \mathcal{M}_O) \} } {\mathrm{pr}\{{Z} \in \mathcal{A}_M(\mathcal{M}_{{\rm arb}}\cap \mathcal{M}_O)\}}=1-\alpha\text{.} \end{equation} (6) Let $$f_M\{\tilde{{t}}(M)\}=\phi_M\{\tilde{{t}}(M)\mid \mathcal{A}_M(\mathcal{M}_{{\rm arb}}\cap \mathcal{M}_O);J_M^{-1}({\vartheta})\}$$ denote the density of $$n^{1/2}\{{\hat{\theta}}'(M)-\tilde{{\vartheta}}(M)\}$$, a truncated $$|M|$$-dimensional normal density. The quantile of its $$j$$th component is obtained via \begin{eqnarray*} \int_{\mathcal{R}_{\alpha}}f_M\{\tilde{{t}}(M)\} \,\mathrm{d} \tilde{{t}}(M) = 1-\alpha, \end{eqnarray*} where $$\mathcal{R}_{\alpha}\subset\mathbb{R}^{|M|}$$ restricts only the $$j$$th component to $$[-q_{\alpha/2},q_{\alpha/2}]$$. The confidence interval for $$\vartheta_j$$ is $$\hat{\theta}_j(M)\pm q_{\alpha/2}n^{-1/2}$$. While there is no uniform convergence of the distribution function in all settings (Leeb & Pötscher 2003), for normal linear models using rectangular confidence regions and sequential testing, a uniform result regarding coverage has been obtained by Pötscher (1995). The following result holds for overspecified models. For models in the set $$\mathcal{M}_{O}$$, all parameter components that appear in the true model are nonzero, but there may be additional parameter components which could be zero or nonzero. However, the set $$\mathcal{M}_{O}$$ does not depend on the value of the true parameter $$\vartheta$$. After conditioning on $$M_{\small{\text{AIC}}}\in \mathcal{M}_{O}$$, the set $$C(q_\alpha)$$ is random due to maximum likelihood estimation in the selected model. Proposition 3. Assume conditions $${\rm (i)-(iv)}$$ of Assumption A1 and that $${Q}_n({\theta})$$ in $${\rm (ii)}$$ is continuous over a compact set $$\Theta$$ that contains $${\vartheta}$$. The confidence region $$C(q_\alpha)$$ from (5) is such that $$ \lim_{n\rightarrow \infty} \inf_{{\vartheta} \in \Theta} \mathrm{pr}_{\vartheta} \{{\vartheta} \in C(q_\alpha)\mid M_{{\small{\text{AIC}}}}\in \mathcal{M}_{O}\} = 1- \alpha\text{.} $$ When $$\mathcal{A}_M(\mathcal{M}_{{\rm arb}})$$ replaces $$\mathcal{A}_M(\mathcal{M}_{{\rm arb}}\cap \mathcal{M}_O)$$ in (6) to obtain a value $$\tilde q_\alpha$$, $$\:\lim_{n\rightarrow \infty} \inf_{{\vartheta} \in \Theta} \mathrm{pr}\{{\vartheta} \in C(\tilde q_\alpha) \mid M_{\small{\text{AIC}}}\in \mathcal{M}_{O}\} \ge 1- \alpha$$. One limitation of the Akaike information criterion is that the selection of an overspecified model does not happen in a uniform way (Leeb & Pötscher 2003). Hence, this result cannot be strengthened. If the selected model is underparameterized, correct inference can be obtained for the pseudo-true values instead; see § 4. For a predetermined number of steps in a forward selection, least angle regression or lasso in linear additive error models, Tibshirani et al. (2015) obtained asymptotic results which are uniformly valid for a specific class of nonnormal errors. For a comparison between two models, Andrews & Guggenberger (2009) used a local neighbourhood to deal with the overselection and to obtain uniform results for parameters that were not subject to selection. Chernozhukov et al. (2015) performed uniformly valid inference on a low-dimensional parameter when there is selection in a high-dimensional vector of nuisance parameters. See also Belloni et al. (2015) for the use of least absolute deviation in high-dimensional regression. Inference after selection depends on (i) the set of models $$\mathcal{M}$$ specified by the researcher and (ii) the smallest true model $$M_{{\rm pars}}$$, in nested models $$p_0$$, via $$\mathcal{A}_M(\mathcal{M} \cap \mathcal{M}_O)$$. In $$\mathcal{M}_{{\rm nest}}$$ and $$\mathcal{M}_{{\rm all}}$$ one could take the smallest model for $$M_{{\rm pars}}$$. If this model is true or overparameterized, Propositions 1 and 2 hold and the asymptotic confidence intervals can be calculated exactly. If the smallest model is underparameterized, the structure of the additional constraints $$\mathcal{A}_M(\mathcal{M})\setminus\mathcal{A}_M(\mathcal{M} \cap \mathcal{M}_O)$$ is such that the resulting distribution of the parameters is longer-tailed. This leads to conservative confidence intervals, especially for the parameters which are truly nonzero. In practice we calculate the constraints based on the selected model and $$\mathcal{A}_M(\mathcal{M}_{{\rm arb}})$$. For case (i), in $$\mathcal{M}_{{\rm all}}$$ the number of constraints equals $$2^{K-|M_0|}-1$$. Here, we show that $$\mathcal{A}_M({\mathcal{M}}_O)$$ can be reduced to the set $$ \bigcap_{i \in M_{{\small{\text{AIC}}}}\setminus M_{{\rm pars}}} \left\{{z}\in\mathbb{R}^{a+K}: (z_i^2 > 2)\right\} \cap \bigcap_{i \notin M_{{\small{\text{AIC}}}}\setminus M_{{\rm pars}}} \left\{{z}\in\mathbb{R}^{a+K}: (z_i^2 < 2)\right\}$$ without losing information. Let $$\mathcal{I}_{M_{{\small{\text{AIC}}}}}$$ denote the set consisting of all subsets of the indices in $$M_{{\small{\text{AIC}}}}\setminus M_{{\rm pars}}$$, referring to the redundant selected parameters, and denote by $$\mathcal{I}_{M_{{\small{\text{AIC}}}}}^{\rm c}$$ the set of all subsets of the indices in $$\{1,\ldots,a+K\}\setminus M_{{\small{\text{AIC}}}}$$, referring to the variables that were not selected. Then \begin{eqnarray*} \mathcal{A}_{M_{\small{\text{AIC}}}}(\mathcal{M}\cap \mathcal{M}_O)= \bigcap_{I \in \mathcal{I}_{M_{\small{\text{AIC}}}}} \bigcap_{J \in \mathcal{I}^c_{M_{\small{\text{AIC}}}}}\left\{{z}\in\mathbb{R}^{a+K}: \sum_{i \in I} z_i^2 -\sum_{j \in J} z_i^2 > 2(|I|- |J|)\right\} \\ \cap \bigcap_{i\in M_{\small{\text{AIC}}}\backslash M_{\rm pars}} \left\{{z}\in\mathbb{R}^{a+K}: (z_i^2>2) \right\} \cap \bigcap_{i \in \{1,\ldots,a+K\}\backslash M_{\small{\text{AIC}}}}\left\{{z}\in\mathbb{R}^{a+K}: (-z_i^2>-2)\right\} \end{eqnarray*} The last two sets of constraints consist, respectively, of $$|M_{{\small{\text{AIC}}}}|-|M_{{\rm pars}}|$$ and $$a+K-|M_{{\small{\text{AIC}}}}|$$ elements. The first set only involves constraints that are summations of the constraints in the last two sets and does not add any new restrictions on $${z}$$. The constraint set for any $$\mathcal{M}_{{\rm arb}}$$ can be simplified as long as some constraints can be implied by summing other constraints. Removing redundant constraints is not always possible, such as for $$\mathcal{M}_{{\rm nest}}$$. 3.3. Inference for linear combinations For inference for linear combinations $${x}^{T}{\vartheta}$$ after model selection, we rewrite (3) as \begin{align} F(t)&=\mathop{\lim}_{n \rightarrow \infty} \mathrm{pr}\big[n^{1/2}\,\tilde{{x}}^{T}(M) \{{\hat{\theta}}'(M)-\tilde{{\vartheta}}(M)\}\leqslant t \mid M_{{\small{\text{AIC}}}}=M, \mathcal{M}\big]\nonumber\\ &=\mathrm{pr} \{\tilde{{x}}^{T}(M) \,{J}_{M}^{-1/2}({\vartheta}) \tilde{{Z}}(M) \leqslant t \mid \mathcal{A}_M(\mathcal{M}\cap\mathcal{M}_O)\}, \end{align} (7) where $$\tilde{{x}}(M)$$ are the covariates corresponding to $$M$$. The asymptotic distribution of the estimated linear combination $${x}^{T}{\vartheta}$$ is simulated via (7). When the sample size is small and the diagonal entries of $$J(\hat{{\theta}})$$ are large, it may happen that an underparameterized model is selected. In this case the coverage probability of confidence regions of a linear combination of the parameters, or a transformation thereof in generalized linear models, may be smaller than the nominal value. In cases of suspected underselection, one can use \begin{equation} \mathop{\lim}_{n \rightarrow \infty}\mathrm{pr}[n^{1/2}{x}^{T}\{\hat{{\theta}}(M_{{\rm full}})-{\vartheta}\}\leqslant t \mid M_{{\small{\text{AIC}}}}\!=\!M, \mathcal{M}] =\mathrm{pr}\{{x}^{T}{J}^{-1/2}({\vartheta}) {Z}_{a+K} \leqslant t \mid \mathcal{A}_M(\mathcal{M})\},\quad \end{equation} (8) where $$M_{{\rm full}}$$ is the full model. This differs from (7) in using all parameters, not just the selected parameters. This procedure is different from assuming that the full model is selected, since, for example in $$\mathcal{M}_{\rm all}$$, $$\:\mathcal{A}_M(\mathcal{M}_{\rm all})$$ contains $$z_i^2>2$$ for the parameters which are selected and $$z_i^2<2$$ for those which are not selected, whereas $$\mathcal{A}_{M_{{\rm full}}}(\mathcal{M}_{{\rm all}})$$ contains $$z_i^2>2$$ for all parameters, leading to a long-tailed distribution. The probability of underselection disappears asymptotically. The valid confidence intervals of Bachoc et al. (2015) target the true value for the selected model, not the true value $${x}^{t}{\vartheta}$$. While in their case underparameterized selection is not an issue, there is no guarantee that their proposed confidence interval is valid for the true value. 4. Confidence regions when all models are misspecified 4.1. Limiting distribution of estimators The results in this section do not require any assumption about the existence of a true model, are uniformly valid, and apply to general parametric likelihood models. In order to obtain uniformly valid results, we consider the setting where there is no true parameter vector, either because the true density of the data does not belong to a parametric family or because all models are misspecified. We assume the observations to be represented by a triangular array $$\{Y_{ni}: i=1,\ldots,n,\: n\in\mathbb{N}\}$$, where there is independence between the rows, i.e., different sample sizes $$n$$, and within the rows, i.e., $$Y_{ni}$$ and $$Y_{nj}$$ are independent for $$i\neq j$$. Regression models are included, as observations may have different distributions. The true joint density of $$(Y_{n1},\ldots,Y_{nn})$$ is $$g_n$$, with distribution function $$G_n$$. All probabilities are computed under the true distribution, so $$\mathrm{pr}=\mathrm{pr}_{G_n}$$. The data are modelled via models $$M_{n,j}=\{\prod_{i=1}^nf_{j,i}(y_i;\theta_j): \theta_j\in\Theta_j \subset\mathbb{R}^{m_j}\}$$. Thus $$m_j$$ is the number of parameters in model $$M_{n,j}$$. All models are collected in the set $$\mathcal{M}_{n}=\{M_{n,1},\ldots,M_{n,J}\}$$. When confusion is unlikely, we omit the subscript $$n$$ from the notation. We assume for each $$n\in\mathbb{N}$$ that $$\int g_n(y)\log g_n(y) \,{\rm d} y <\infty$$. This defines the class of true distributions $${\mathcal{G}}_n$$. Regarding the models, assume that for each $$i\in\mathbb{N}$$ and each $$j=1, \ldots J$$, $$\:f_{j,i}(\cdot\,;\theta_j)$$ is measurable for all $$\theta_j\in\Theta_j$$, a compact set, and $$f_{j,i}(y_i;\cdot)$$ is continuous on $$\Theta_j$$ almost surely and continuously differentiable on $$\Theta_j$$. Then for every model there exists (White 1994, Theorem 2.12) an estimator $$\hat\theta_{n,j}$$ maximizing $$\prod_{i=1}^nf_{j,i}(y_i;\theta_j)$$ over $$\Theta_j$$. If $$E_{G_n}\{n^{-1}\sum_{i=1}^n \log f_{j,i}(y_i;\theta_j)\}$$ has an identifiable unique maximizer over $$\Theta_j$$, this maximizer is called the pseudo-true value $$\vartheta_{n}^*(M_j)$$. This value depends on the true joint density, the model densities, and the sample size. We define two vectors of length $$m'=\sum_{j=1}^J m_j$$: $$\vartheta_{n,\mathcal{M}}^* =\{\vartheta_{n}^{*T}(M_1),\ldots,\vartheta_{n}^{*T}(M_K)\}^{{ \mathrm{\scriptscriptstyle T} }}$$ and $$\hat\theta_{n,\mathcal{M}} =\{\hat\theta_{n}^{T}(M_1),\ldots,\hat\theta_{n}^{T}(M_K)\}^{{ \mathrm{\scriptscriptstyle T} }}$$. Lemma 2. Let $$\{Y_{ni}: i=1,\ldots,n, \: n\in\mathbb{N}\setminus 0\}$$ form a triangular array consisting of independent random variables. (i) For all components of the vector $$\vartheta_{n,\mathcal{M}}^*$$, stated here for the $$k$$th such component of $$\theta_j$$ corresponding to model $$M_j$$, assume that for all $$G_n\in\mathcal{G}_n$$ with $$ A=\{y_i\in\mathbb{R}: | (\partial/\partial\theta_k)\log f_{j,i}\{y_i;\theta_{n}^*(M_j)\}| >\varepsilon nQ_{M_j,kk}\{\vartheta_{n}^*(M_j)\}\} $$ and for all $$\varepsilon>0$$, $$ \lim_{n\to\infty}\sum_{i=1}^n\int_A \left[\frac{\partial}{\partial\theta_k}\log f_{j,i}\{y_i;\vartheta_{n}^*(M_j)\}\right]^2\Big/ \big[nQ_{M_j,kk}\{\vartheta_{n}^*(M_j)\}\big] \,{\rm d} G_{ni}(y_i) = 0\text{.} $$ (ii) Write $$\Sigma_{M_j}\{\vartheta_{n}^*(M_j)\} = Q_{M_j}^{-1}\{\vartheta_{n}^*(M_j)\} J_{jj}\{\vartheta_{n}^*(M_j),\vartheta_{n}^*(M_j)\}Q_{M_j}^{-1}\{\vartheta_{n}^*(M_j)\}$$, and assume that \begin{align*} &\lim_{n\to\infty}\max_{i=1,\ldots,n}\mathrm{pr}_{G_n}\!\!\left(\! (\Sigma_{M_j,kk})^{-1/2}n^{-1/2} \bigl[Q^{-1}_{M_j}\{\vartheta_{n}^*(M_j)\}\bigr]_{kk} \left|\frac{\partial}{\partial\theta_k}\log f_{j,i}\{y_i;\vartheta_{n}^*(M_j)\}\right|>\varepsilon\!\right) = 0\text{.} \end{align*} Define $$\mathcal{W}_n\sim N_{m'}\{0,\Sigma(\vartheta_{n,\mathcal{M}}^*)\}$$ where $$\Sigma(\vartheta_{n,\mathcal{M}}^*)$$ is an $$m'\times m'$$ matrix with $$(i,j)$$th block, of size $$m_i\times m_j$$, equal to $$Q_{M_i}^{-1}\{\vartheta_{n}^*(M_i)\} J_{ij}\{\vartheta_{n}^*(M_i),\vartheta_{n}^*(M_j)\} Q_{M_j}^{-1}\{\vartheta_{n}^*(M_j)\}$$. Then $$ \lim_{n\to\infty}\sup_{t\in \mathbb{R}^{m'}}\sup_{G_n\in\mathcal{G}_n} \Big| \mathrm{pr}\{n^{1/2}(\hat\theta_{n,\mathcal{M}} - \vartheta_{n,\mathcal{M}}^*) \le t\} - \mathrm{pr}(\mathcal{W}_n\le t)\Big| = 0\text{.} $$ A pivot is needed in order to construct confidence regions. In general, the variance $$\Sigma(\vartheta^*_{n,\mathcal{M}})$$ of $$\mathcal{W}_n$$ may depend on $$\vartheta^*_{n,\mathcal{M}}$$. When there is an estimator $$\hat\Sigma$$ of $$\Sigma$$ such that $$ \lim_{n\to\infty}\sup_{G_n\in\mathcal{G}_n} \mathrm{pr}_{G_n}(\|\hat\Sigma_n-\Sigma\|>\varepsilon)=0, $$ where $$\|A\|$$ denotes the Euclidean matrix operator norm of $$A$$, then, with $$\mathcal{Z}_{m'}\sim N_{m'}(0,I_{m'})$$, $$ \lim_{n\to\infty}\sup_{G_n\in\mathcal{G}_n}\sup_{t\in\mathbb{R}^{m'}} \Big| \mathrm{pr}\{\hat\Sigma_n^{-1/2}n^{-1/2}(\hat\theta_{n,\mathcal{M}} - \vartheta_{n,\mathcal{M}}^*)\le t\}-\mathrm{pr}(\mathcal{Z}_{m'}\le t)\Big|=0\text{.} $$ The model determines whether or not the variance can be estimated well. (White 1994, § 8.3) gives some general conditions for consistent estimation of the variance. One requirement is that $$n^{-1}\sum_{i=1}^nE(s)E(s^{{ \mathrm{\scriptscriptstyle T} }}) \to 0$$, where $$s$$ is the vector of length $$m'$$ consisting of subvectors $$(\partial/\partial\theta_k)\log f_{ki}(Y_i;\vartheta_k^{*})$$ for $$k=1,\ldots,K$$. This assumption holds, for example, when the models are correctly specified. Under misspecification, (White 1994, § 8.3) showed that the empirical estimator for $$\Sigma(\vartheta_{n,\mathcal{M}}^*)$$ could overestimate the covariance matrix, leading to conservative confidence intervals. 4.2. Selection region in a misspecified setting When $$\mathcal{M}$$ consists of misspecified models, calculating the selection event requires additional care. Define $$\ell_{n,M_j}(y,\theta_j)=\sum_{i=1}^n \log f_{j,i}(y_i,\theta_j)$$. When model $$M_{\small{\text{AIC}}}$$ is selected, then for all $$M \in {\mathcal{M}}\setminus M_{\small{\text{AIC}}}$$, $$\:2[\ell_{n,M_{\small{\text{AIC}}}}\{y,\hat{\theta}_n(M_{\small{\text{AIC}}})\}-\ell_{n,M}\{y,\hat{\theta}_n(M)\}]\geqslant 2(|M_{\small{\text{AIC}}}|-|M|)$$. When both models $$M_{\small{\text{AIC}}}$$ and $$M$$ are correctly specified, the difference of the loglikelihoods can be characterized asymptotically by chi-squared random variables. However, when there is misspecification this difference can diverge to $$+\infty$$ or $$-\infty$$, depending on the assumptions about the models. For strictly nonnested models, the difference always diverges (Vuong 1989, Theorem 5.1). When the selected model is always best, there is no restriction on parameter estimators. See also Cox & Hinkley (1974, § 9.3) for the asymptotic behaviour of likelihood ratio tests in nonnested settings. For overlapping models having some common parameters, the loglikelihood difference converges to some random variable if one of the models is correctly specified, and diverges otherwise. Under misspecification of all models, the only situation where the asymptotic distribution can be used to characterize the selection event is the case of nested models under similarity of the likelihoods (Vuong 1989, Assumption A8). This means that $$\ell_{n,M_k}\{y,\vartheta_n^*(M_k)\}=\ell_{n,M_l}\{y,\vartheta_n^*(M_l)\}$$ for $$k,l=1,\ldots,K$$. For an arbitrary set of models we impose the same similarity condition and assume that $$\mathcal{M}$$ includes a model $$M_{{\rm s}}=M_{{\rm small}}$$ which is nested in all other models. If we were to perform a likelihood ratio test under this assumption, it would correspond to testing whether the smaller model can be considered equal to or worse than the larger model (Vuong 1989, Lemma 7.1). We first compare each model with the smallest model and then use the regions obtained from each comparison to compute the final selection region using pairwise comparisons. By imposing similarity, the calculated quantiles to be used in the confidence regions are larger than without similarity since, as explained earlier, the loglikelihood difference diverges otherwise and there is no restriction on the parameter estimators. For all $$M \in \mathcal{M}\setminus M_{{\rm s}}$$, $$\matrix{ {2[{\ell _{n,M}}\{ y,{{\hat \theta }_n}(M)\} - {\ell _{n,{M_s}}}\{ y,{{\hat \theta }_n}({M_{\rm{s}}})\} ]} \hfill \cr {\, = n{{\{ {{\hat \theta }_n}(M) - \vartheta _n^*(M)\} }^{\rm{T}}}{Q_M}\{ \vartheta _n^*(M)\} \{ \hat \theta (M) - \vartheta _n^*(M)\} } \hfill \cr {\quad - n{{\{ {{\hat \theta }_n}(M) - \vartheta _n^*({M_{\rm{s}}})\} }^{\rm{T}}}Q\{ \vartheta _n^*({M_s})\} \{ {{\hat \theta }_n}({M_{\rm{s}}}) - \vartheta _n^*({M_{\rm{s}}})\} + {o_{\rm{p}}}(1)} \hfill \cr {\, = n{{({{\hat \theta }_{n,{\cal M}}} - \vartheta _{n,{\cal M}}^*)}^{\rm{T}}}{W_{M,{M_{\rm{s}}}}}({{\hat \theta }_{n,{\cal M}}} - \vartheta _{n,{\cal M}}^*) + {o_{\rm{p}}}(1),} \hfill \cr } $$ (9) where $$W_{M,M_{{\rm s}}}$$ is a block-diagonal matrix partitioned in the same way as $$\Sigma$$, whose diagonal block referring to model $$M$$ equals $$Q_{M}\{\vartheta^*_n(M)\}$$, that referring to model $$M_{\rm s}$$ equals $$-Q_{M_s}\{\vartheta^*_n(M_{\rm s})\}$$, and other entries are all zero. If the models are already nested, there is no need to compare each model with the smallest model. The asymptotic counterpart of the selection event is \begin{align} \mathcal{A}_{M_{\small{\text{AIC}}}}(\mathcal{M})=\Big\{z \in \mathbb{R}^{m'}: z^{{ \mathrm{\scriptscriptstyle T} }} \Sigma^{1/2} (W_{M_{\small{\text{AIC}}},M_{\rm s}}-W_{M,M_{\rm s}}) \Sigma^{1/2}z \geqslant 2(|M_{\small{\text{AIC}}}|-|M|),\nonumber\\ M \in \mathcal{M}\setminus M_{\small{\text{AIC}}}\Big\}\text{.} \end{align} (10) Proposition 4. Suppose that the assumptions of Lemma 2 hold. Then, for a set of models with $$\mathcal{A}_{M_{\small{\text{AIC}}}}(\mathcal{M})$$ from (10), \begin{align*} \begin{split} \lim_{n\rightarrow\infty} \sup_{G_n \in \mathcal{G}_n} \sup_{t \in \mathbb{R}^{|M_{\small{\text{AIC}}}|}} \Big|\mathrm{pr}[n^{1/2}\{\hat{\theta}(M_{\small{\text{AIC}}})-\vartheta^*(M_{\small{\text{AIC}}})\} \leqslant t \mid M_{\small{\text{AIC}}}] & \\ \nonumber -\,\mathrm{pr}\{\Sigma^{1/2}Z \leqslant t \mid \mathcal{A}_{M_{\small{\text{AIC}}}}(\mathcal{M})\}\Big|& =0\text{.} \end{split} \end{align*} As noted by Tibshirani et al. (2015), uniform convergence in distribution can be translated to uniformly valid confidence sets. The following proposition clarifies this statement. The proof is similar to the proof of Proposition 4, using the fact that a continuous mapping preserves uniform convergence. Proposition 5. Suppose that the assumptions of Lemma 2 hold and that the set of models $$\mathcal{M}$$ contains a smallest model which is nested in all the models. Define the set \[ C^*(q_{\alpha})\! =\! \bigl\{\theta \in \mathbb{R}^{|M_{\small{\text{AIC}}}|}\! : n \{\hat{\theta}(M_{\small{\text{AIC}}})-\theta(M_{\small{\text{AIC}}})\}^{{ \mathrm{\scriptscriptstyle T} }} \Sigma_{M_{\small{\text{AIC}}}}(\vartheta^*_{M_{\small{\text{AIC}}}})^{-1} \{\hat{\theta}(M_{\small{\text{AIC}}})-\theta(M_{\small{\text{AIC}}})\}\leqslant q_{\alpha}\bigr\}, \]where $$q_a$$ is determined by solving \begin{align*} &\mathrm{pr}\bigl[ \{\tilde{{Z}}^{{ \mathrm{\scriptscriptstyle T} }}({M_{{\small{\text{AIC}}}}}) \Sigma_{M_{\small{\text{AIC}}}}(\vartheta^*_{M_{\small{\text{AIC}}}})^{-1} \tilde{{Z}}({M_{{\small{\text{AIC}}}}}) \leqslant q_{\alpha}\} \, \cap \{Z \in \mathcal{A}_{M_{\small{\text{AIC}}}}(\mathcal{M})\}\bigr]\\ &\quad = \mathrm{pr}\{Z \in \mathcal{A}_{M_{\small{\text{AIC}}}}(\mathcal{M})\}(1-\alpha)\text{.} \end{align*} Then $$ \lim_{n\rightarrow\infty} \sup_{G_n \in \mathcal{G}_n} \sup_{\alpha \in [0,1]} \bigl| \mathrm{pr} _{G_n}\{ \vartheta^{*}(M_{\small{\text{AIC}}}) \in C^*(q_{\alpha})\mid M_{\small{\text{AIC}}} \} - (1-\alpha)\bigr|=0 $$. 5. Simulation study 5.1. Parameters in linear models While the proposed method is applicable to general likelihood models, in order to compare it with existing methods, we present simulation results for linear models only. Results for generalized linear models and other settings can be found in the Supplementary Material. The data are generated from a regression model $$ Y_i=\sum_{j=1}^{10} \vartheta_j x_{ji} + \varepsilon_i$$$$(i=1,..., n) $$ with $$\varepsilon_i\sim N(0,1)$$. The true value of the parameters is $${\vartheta}^{{ \mathrm{\scriptscriptstyle T} }}=(2{\cdot}25,-1{\cdot}1,2{\cdot}43,-2{\cdot}24,2{\cdot}5,{0}_5^{{ \mathrm{\scriptscriptstyle T} }})$$, with $${0}_5$$ denoting the vector of length 5 whose entries are all zero. We set $$x_{1i}=1$$ and $$(x_{2i}, \ldots, x_{10,i})^{{ \mathrm{\scriptscriptstyle T} }} \sim N({0}_{9}, {\Omega})$$, where $${\Omega}$$ is a positive-definite matrix with diagonal elements equal to 1 and off-diagonal entries equal to $$0{\cdot}25$$. The sample size is either 30 or 100. Three different model sets were considered. Let $${\zeta}_{{\rm all}}^i$$ be the selection matrix when the first $$i$$ parameters are present in all models. We take $${\zeta}_{{\rm all}}^3$$, which is a $$2^7 \times 10$$ matrix, $${\zeta}_{{\rm all}}^6$$, which is a $$2^4 \times 10$$ matrix, and $${\zeta}_{{\rm arb}}$$, which contains 14 rows arbitrarily chosen from $${\zeta}_{{\rm all}}^3$$. We are interested in inference for the parameters in the selected model. In order to facilitate comparison, the simulations were run until model $$M$$ with parameters $$(\vartheta_1,\ldots,\vartheta_6, \vartheta_8)$$ had been selected 3000 times. For each of those simulation runs, the Fisher information matrix was estimated in the full model by $$\skew6\hat{{J}}(\hat{{\theta}})$$, leading to the submatrix $$\skew6\hat{{J}}_M(\hat{{\theta}})$$. When Assumption A1(v) does not hold, one should use (4) to calculate the confidence intervals. However, we used (3) instead, and it resulted in good approximations. Quantiles of the limiting asymptotic distribution for each setting were obtained via simulation; see the Supplementary Material for the code. In each simulation run we computed the lower and upper limits of the confidence interval and report, in Table 1, the averaged confidence intervals along with the coverage percentages for $$\vartheta_4$$, $$\vartheta_6$$ and $$\vartheta_8$$. Results for the other parameters are omitted to save space. Table 1 Results of the simulation study with $$3000$$ runs of AIC selection: average confidence intervals and coverage percentages for $$\vartheta_4, \vartheta_6$$ and $$\vartheta_8$$ using different selection matrices $${\zeta}$$ corresponding to different model sets $$\mathcal{M}$$ and two sample sizes $$n$$ for the proposed method, along with results obtained by the method of Berk et al. (2013) and by a naive approach that treats the selected model as given and ignores selection $$n$$ Method $$\vartheta_j$$ $${\zeta}_{{\rm all}}^3$$ $${\zeta}_{{\rm all}}^6$$ $${\zeta}_{{\rm arb}}$$ 30 PostAIC $$\vartheta_4$$ $$[-2{\cdot}85, -1{\cdot}64]$$ 98 $$[-2{\cdot}68 , -1{\cdot}78]$$ 92 $$[-2{\cdot}85 , -1{\cdot}64]$$ 97 $$\vartheta_6$$ $$[-0{\cdot}60 , \phantom{-}0.62]$$ 94 $$[-0{\cdot}45 , \phantom{-}0{\cdot}45]$$ 93 $$[-0{\cdot}60 , \phantom{-}0{\cdot}62]$$ 96 $$\vartheta_8$$ $$[-0{\cdot}60 , \phantom{-}0{\cdot}61]$$ 94 $$[-0{\cdot}60 , \phantom{-}0{\cdot}60]$$ 95 $$[-0{\cdot}61 , \phantom{-}0{\cdot}62]$$ 96 PoSI $$\vartheta_4$$ $$[-2{\cdot}98 , -1{\cdot}51]$$ 99 $$[-2{\cdot}89 , -1{\cdot}57]$$ 99 $$[-2{\cdot}97 , -1{\cdot}52]$$ 99 $$\vartheta_6$$ $$[-0{\cdot}73 , \phantom{-}0{\cdot}75]$$ 99 $$[-0{\cdot}66 , \phantom{-}0{\cdot}66]$$ 99 $$[-0{\cdot}71 , \phantom{-}0{\cdot}73]$$ 99 $$\vartheta_8$$ $$[-0{\cdot}73 , \phantom{-}0{\cdot}74]$$ 98 $$[-0{\cdot}66 , \phantom{-}0{\cdot}67]$$ 97 $$[-0{\cdot}72 , \phantom{-}0{\cdot}73]$$ 99 Naive $$\vartheta_4$$ $$[-2{\cdot}67 , -1{\cdot}82]$$ 89 $$[-2{\cdot}68 , -1{\cdot}79]$$ 91 $$[-2{\cdot}66 , -1{\cdot}83]$$ 89 $$\vartheta_6$$ $$[-0{\cdot}42 , \phantom{-}0{\cdot}43]$$ 69 $$[-0{\cdot}44 , \phantom{-}0{\cdot}44]$$ 92 $$[-0{\cdot}41 , \phantom{-}0{\cdot}42]$$ 71 $$\vartheta_8$$ $$[-0{\cdot}42 , \phantom{-}0{\cdot}43]$$ 70 $$[-0{\cdot}44 , \phantom{-}0{\cdot}45]$$ 75 $$[-0{\cdot}41 , \phantom{-}0{\cdot}43]$$ 71 100 PostAIC $$\vartheta_4$$ $$[-2{\cdot}54 , -1{\cdot}94]$$ 99 $$[-2{\cdot}46 , -2{\cdot}02]$$ 94 $$[-2{\cdot}55 , -1{\cdot}93]$$ 99 $$\vartheta_6$$ $$[-0{\cdot}30 , \phantom{-}0{\cdot}31]$$ 95 $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 95 $$[-0{\cdot}31 , \phantom{-}0{\cdot}32]$$ 96 $$\vartheta_8$$ $$[-0{\cdot}30 , \phantom{-}0{\cdot}31]$$ 95 $$[-0{\cdot}29 , \phantom{-}0{\cdot}30]$$ 95 $$[-0{\cdot}31 , \phantom{-}0{\cdot}31]$$ 97 PoSI $$\vartheta_4$$ $$[-2{\cdot}58 , -1{\cdot}90]$$ 100 $$[-2{\cdot}54 , -1{\cdot}94]$$ 99 $$[-2{\cdot}57 , -1{\cdot}90]$$ 99 $$\vartheta_6$$ $$[-0{\cdot}33 , \phantom{-}0{\cdot}34]$$ 98 $$[-0{\cdot}30 , \phantom{-}0{\cdot}30]$$ 99 $$[-0{\cdot}33 , \phantom{-}0{\cdot}34]$$ 98 $$\vartheta_8$$ $$[-0{\cdot}34 , \phantom{-}0{\cdot}34]$$ 98 $$[-0{\cdot}29 , \phantom{-}0{\cdot}31]$$ 95 $$[-0{\cdot}33 , \phantom{-}0{\cdot}34]$$ 98 Naive $$\vartheta_4$$ $$[-2{\cdot}46 , -2{\cdot}02]$$ 93 $$[-2{\cdot}46 , -2{\cdot}02]$$ 93 $$[-2{\cdot}46 , -2{\cdot}02]$$ 92 $$\vartheta_6$$ $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 66 $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 94 $$[-0{\cdot}21 , \phantom{-}0{\cdot}22]$$ 67 $$\vartheta_8$$ $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 66 $$[-0{\cdot}22 , \phantom{-}0{\cdot}23]$$ 69 $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 65 $$n$$ Method $$\vartheta_j$$ $${\zeta}_{{\rm all}}^3$$ $${\zeta}_{{\rm all}}^6$$ $${\zeta}_{{\rm arb}}$$ 30 PostAIC $$\vartheta_4$$ $$[-2{\cdot}85, -1{\cdot}64]$$ 98 $$[-2{\cdot}68 , -1{\cdot}78]$$ 92 $$[-2{\cdot}85 , -1{\cdot}64]$$ 97 $$\vartheta_6$$ $$[-0{\cdot}60 , \phantom{-}0.62]$$ 94 $$[-0{\cdot}45 , \phantom{-}0{\cdot}45]$$ 93 $$[-0{\cdot}60 , \phantom{-}0{\cdot}62]$$ 96 $$\vartheta_8$$ $$[-0{\cdot}60 , \phantom{-}0{\cdot}61]$$ 94 $$[-0{\cdot}60 , \phantom{-}0{\cdot}60]$$ 95 $$[-0{\cdot}61 , \phantom{-}0{\cdot}62]$$ 96 PoSI $$\vartheta_4$$ $$[-2{\cdot}98 , -1{\cdot}51]$$ 99 $$[-2{\cdot}89 , -1{\cdot}57]$$ 99 $$[-2{\cdot}97 , -1{\cdot}52]$$ 99 $$\vartheta_6$$ $$[-0{\cdot}73 , \phantom{-}0{\cdot}75]$$ 99 $$[-0{\cdot}66 , \phantom{-}0{\cdot}66]$$ 99 $$[-0{\cdot}71 , \phantom{-}0{\cdot}73]$$ 99 $$\vartheta_8$$ $$[-0{\cdot}73 , \phantom{-}0{\cdot}74]$$ 98 $$[-0{\cdot}66 , \phantom{-}0{\cdot}67]$$ 97 $$[-0{\cdot}72 , \phantom{-}0{\cdot}73]$$ 99 Naive $$\vartheta_4$$ $$[-2{\cdot}67 , -1{\cdot}82]$$ 89 $$[-2{\cdot}68 , -1{\cdot}79]$$ 91 $$[-2{\cdot}66 , -1{\cdot}83]$$ 89 $$\vartheta_6$$ $$[-0{\cdot}42 , \phantom{-}0{\cdot}43]$$ 69 $$[-0{\cdot}44 , \phantom{-}0{\cdot}44]$$ 92 $$[-0{\cdot}41 , \phantom{-}0{\cdot}42]$$ 71 $$\vartheta_8$$ $$[-0{\cdot}42 , \phantom{-}0{\cdot}43]$$ 70 $$[-0{\cdot}44 , \phantom{-}0{\cdot}45]$$ 75 $$[-0{\cdot}41 , \phantom{-}0{\cdot}43]$$ 71 100 PostAIC $$\vartheta_4$$ $$[-2{\cdot}54 , -1{\cdot}94]$$ 99 $$[-2{\cdot}46 , -2{\cdot}02]$$ 94 $$[-2{\cdot}55 , -1{\cdot}93]$$ 99 $$\vartheta_6$$ $$[-0{\cdot}30 , \phantom{-}0{\cdot}31]$$ 95 $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 95 $$[-0{\cdot}31 , \phantom{-}0{\cdot}32]$$ 96 $$\vartheta_8$$ $$[-0{\cdot}30 , \phantom{-}0{\cdot}31]$$ 95 $$[-0{\cdot}29 , \phantom{-}0{\cdot}30]$$ 95 $$[-0{\cdot}31 , \phantom{-}0{\cdot}31]$$ 97 PoSI $$\vartheta_4$$ $$[-2{\cdot}58 , -1{\cdot}90]$$ 100 $$[-2{\cdot}54 , -1{\cdot}94]$$ 99 $$[-2{\cdot}57 , -1{\cdot}90]$$ 99 $$\vartheta_6$$ $$[-0{\cdot}33 , \phantom{-}0{\cdot}34]$$ 98 $$[-0{\cdot}30 , \phantom{-}0{\cdot}30]$$ 99 $$[-0{\cdot}33 , \phantom{-}0{\cdot}34]$$ 98 $$\vartheta_8$$ $$[-0{\cdot}34 , \phantom{-}0{\cdot}34]$$ 98 $$[-0{\cdot}29 , \phantom{-}0{\cdot}31]$$ 95 $$[-0{\cdot}33 , \phantom{-}0{\cdot}34]$$ 98 Naive $$\vartheta_4$$ $$[-2{\cdot}46 , -2{\cdot}02]$$ 93 $$[-2{\cdot}46 , -2{\cdot}02]$$ 93 $$[-2{\cdot}46 , -2{\cdot}02]$$ 92 $$\vartheta_6$$ $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 66 $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 94 $$[-0{\cdot}21 , \phantom{-}0{\cdot}22]$$ 67 $$\vartheta_8$$ $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 66 $$[-0{\cdot}22 , \phantom{-}0{\cdot}23]$$ 69 $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 65 PostAIC, our proposed method; PoSI, the method of Berk et al. (2013); Naive, a naive approach that treats the selected model as given and ignores selection. Table 1 Results of the simulation study with $$3000$$ runs of AIC selection: average confidence intervals and coverage percentages for $$\vartheta_4, \vartheta_6$$ and $$\vartheta_8$$ using different selection matrices $${\zeta}$$ corresponding to different model sets $$\mathcal{M}$$ and two sample sizes $$n$$ for the proposed method, along with results obtained by the method of Berk et al. (2013) and by a naive approach that treats the selected model as given and ignores selection $$n$$ Method $$\vartheta_j$$ $${\zeta}_{{\rm all}}^3$$ $${\zeta}_{{\rm all}}^6$$ $${\zeta}_{{\rm arb}}$$ 30 PostAIC $$\vartheta_4$$ $$[-2{\cdot}85, -1{\cdot}64]$$ 98 $$[-2{\cdot}68 , -1{\cdot}78]$$ 92 $$[-2{\cdot}85 , -1{\cdot}64]$$ 97 $$\vartheta_6$$ $$[-0{\cdot}60 , \phantom{-}0.62]$$ 94 $$[-0{\cdot}45 , \phantom{-}0{\cdot}45]$$ 93 $$[-0{\cdot}60 , \phantom{-}0{\cdot}62]$$ 96 $$\vartheta_8$$ $$[-0{\cdot}60 , \phantom{-}0{\cdot}61]$$ 94 $$[-0{\cdot}60 , \phantom{-}0{\cdot}60]$$ 95 $$[-0{\cdot}61 , \phantom{-}0{\cdot}62]$$ 96 PoSI $$\vartheta_4$$ $$[-2{\cdot}98 , -1{\cdot}51]$$ 99 $$[-2{\cdot}89 , -1{\cdot}57]$$ 99 $$[-2{\cdot}97 , -1{\cdot}52]$$ 99 $$\vartheta_6$$ $$[-0{\cdot}73 , \phantom{-}0{\cdot}75]$$ 99 $$[-0{\cdot}66 , \phantom{-}0{\cdot}66]$$ 99 $$[-0{\cdot}71 , \phantom{-}0{\cdot}73]$$ 99 $$\vartheta_8$$ $$[-0{\cdot}73 , \phantom{-}0{\cdot}74]$$ 98 $$[-0{\cdot}66 , \phantom{-}0{\cdot}67]$$ 97 $$[-0{\cdot}72 , \phantom{-}0{\cdot}73]$$ 99 Naive $$\vartheta_4$$ $$[-2{\cdot}67 , -1{\cdot}82]$$ 89 $$[-2{\cdot}68 , -1{\cdot}79]$$ 91 $$[-2{\cdot}66 , -1{\cdot}83]$$ 89 $$\vartheta_6$$ $$[-0{\cdot}42 , \phantom{-}0{\cdot}43]$$ 69 $$[-0{\cdot}44 , \phantom{-}0{\cdot}44]$$ 92 $$[-0{\cdot}41 , \phantom{-}0{\cdot}42]$$ 71 $$\vartheta_8$$ $$[-0{\cdot}42 , \phantom{-}0{\cdot}43]$$ 70 $$[-0{\cdot}44 , \phantom{-}0{\cdot}45]$$ 75 $$[-0{\cdot}41 , \phantom{-}0{\cdot}43]$$ 71 100 PostAIC $$\vartheta_4$$ $$[-2{\cdot}54 , -1{\cdot}94]$$ 99 $$[-2{\cdot}46 , -2{\cdot}02]$$ 94 $$[-2{\cdot}55 , -1{\cdot}93]$$ 99 $$\vartheta_6$$ $$[-0{\cdot}30 , \phantom{-}0{\cdot}31]$$ 95 $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 95 $$[-0{\cdot}31 , \phantom{-}0{\cdot}32]$$ 96 $$\vartheta_8$$ $$[-0{\cdot}30 , \phantom{-}0{\cdot}31]$$ 95 $$[-0{\cdot}29 , \phantom{-}0{\cdot}30]$$ 95 $$[-0{\cdot}31 , \phantom{-}0{\cdot}31]$$ 97 PoSI $$\vartheta_4$$ $$[-2{\cdot}58 , -1{\cdot}90]$$ 100 $$[-2{\cdot}54 , -1{\cdot}94]$$ 99 $$[-2{\cdot}57 , -1{\cdot}90]$$ 99 $$\vartheta_6$$ $$[-0{\cdot}33 , \phantom{-}0{\cdot}34]$$ 98 $$[-0{\cdot}30 , \phantom{-}0{\cdot}30]$$ 99 $$[-0{\cdot}33 , \phantom{-}0{\cdot}34]$$ 98 $$\vartheta_8$$ $$[-0{\cdot}34 , \phantom{-}0{\cdot}34]$$ 98 $$[-0{\cdot}29 , \phantom{-}0{\cdot}31]$$ 95 $$[-0{\cdot}33 , \phantom{-}0{\cdot}34]$$ 98 Naive $$\vartheta_4$$ $$[-2{\cdot}46 , -2{\cdot}02]$$ 93 $$[-2{\cdot}46 , -2{\cdot}02]$$ 93 $$[-2{\cdot}46 , -2{\cdot}02]$$ 92 $$\vartheta_6$$ $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 66 $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 94 $$[-0{\cdot}21 , \phantom{-}0{\cdot}22]$$ 67 $$\vartheta_8$$ $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 66 $$[-0{\cdot}22 , \phantom{-}0{\cdot}23]$$ 69 $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 65 $$n$$ Method $$\vartheta_j$$ $${\zeta}_{{\rm all}}^3$$ $${\zeta}_{{\rm all}}^6$$ $${\zeta}_{{\rm arb}}$$ 30 PostAIC $$\vartheta_4$$ $$[-2{\cdot}85, -1{\cdot}64]$$ 98 $$[-2{\cdot}68 , -1{\cdot}78]$$ 92 $$[-2{\cdot}85 , -1{\cdot}64]$$ 97 $$\vartheta_6$$ $$[-0{\cdot}60 , \phantom{-}0.62]$$ 94 $$[-0{\cdot}45 , \phantom{-}0{\cdot}45]$$ 93 $$[-0{\cdot}60 , \phantom{-}0{\cdot}62]$$ 96 $$\vartheta_8$$ $$[-0{\cdot}60 , \phantom{-}0{\cdot}61]$$ 94 $$[-0{\cdot}60 , \phantom{-}0{\cdot}60]$$ 95 $$[-0{\cdot}61 , \phantom{-}0{\cdot}62]$$ 96 PoSI $$\vartheta_4$$ $$[-2{\cdot}98 , -1{\cdot}51]$$ 99 $$[-2{\cdot}89 , -1{\cdot}57]$$ 99 $$[-2{\cdot}97 , -1{\cdot}52]$$ 99 $$\vartheta_6$$ $$[-0{\cdot}73 , \phantom{-}0{\cdot}75]$$ 99 $$[-0{\cdot}66 , \phantom{-}0{\cdot}66]$$ 99 $$[-0{\cdot}71 , \phantom{-}0{\cdot}73]$$ 99 $$\vartheta_8$$ $$[-0{\cdot}73 , \phantom{-}0{\cdot}74]$$ 98 $$[-0{\cdot}66 , \phantom{-}0{\cdot}67]$$ 97 $$[-0{\cdot}72 , \phantom{-}0{\cdot}73]$$ 99 Naive $$\vartheta_4$$ $$[-2{\cdot}67 , -1{\cdot}82]$$ 89 $$[-2{\cdot}68 , -1{\cdot}79]$$ 91 $$[-2{\cdot}66 , -1{\cdot}83]$$ 89 $$\vartheta_6$$ $$[-0{\cdot}42 , \phantom{-}0{\cdot}43]$$ 69 $$[-0{\cdot}44 , \phantom{-}0{\cdot}44]$$ 92 $$[-0{\cdot}41 , \phantom{-}0{\cdot}42]$$ 71 $$\vartheta_8$$ $$[-0{\cdot}42 , \phantom{-}0{\cdot}43]$$ 70 $$[-0{\cdot}44 , \phantom{-}0{\cdot}45]$$ 75 $$[-0{\cdot}41 , \phantom{-}0{\cdot}43]$$ 71 100 PostAIC $$\vartheta_4$$ $$[-2{\cdot}54 , -1{\cdot}94]$$ 99 $$[-2{\cdot}46 , -2{\cdot}02]$$ 94 $$[-2{\cdot}55 , -1{\cdot}93]$$ 99 $$\vartheta_6$$ $$[-0{\cdot}30 , \phantom{-}0{\cdot}31]$$ 95 $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 95 $$[-0{\cdot}31 , \phantom{-}0{\cdot}32]$$ 96 $$\vartheta_8$$ $$[-0{\cdot}30 , \phantom{-}0{\cdot}31]$$ 95 $$[-0{\cdot}29 , \phantom{-}0{\cdot}30]$$ 95 $$[-0{\cdot}31 , \phantom{-}0{\cdot}31]$$ 97 PoSI $$\vartheta_4$$ $$[-2{\cdot}58 , -1{\cdot}90]$$ 100 $$[-2{\cdot}54 , -1{\cdot}94]$$ 99 $$[-2{\cdot}57 , -1{\cdot}90]$$ 99 $$\vartheta_6$$ $$[-0{\cdot}33 , \phantom{-}0{\cdot}34]$$ 98 $$[-0{\cdot}30 , \phantom{-}0{\cdot}30]$$ 99 $$[-0{\cdot}33 , \phantom{-}0{\cdot}34]$$ 98 $$\vartheta_8$$ $$[-0{\cdot}34 , \phantom{-}0{\cdot}34]$$ 98 $$[-0{\cdot}29 , \phantom{-}0{\cdot}31]$$ 95 $$[-0{\cdot}33 , \phantom{-}0{\cdot}34]$$ 98 Naive $$\vartheta_4$$ $$[-2{\cdot}46 , -2{\cdot}02]$$ 93 $$[-2{\cdot}46 , -2{\cdot}02]$$ 93 $$[-2{\cdot}46 , -2{\cdot}02]$$ 92 $$\vartheta_6$$ $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 66 $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 94 $$[-0{\cdot}21 , \phantom{-}0{\cdot}22]$$ 67 $$\vartheta_8$$ $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 66 $$[-0{\cdot}22 , \phantom{-}0{\cdot}23]$$ 69 $$[-0{\cdot}22 , \phantom{-}0{\cdot}22]$$ 65 PostAIC, our proposed method; PoSI, the method of Berk et al. (2013); Naive, a naive approach that treats the selected model as given and ignores selection. Confidence intervals from the method of Berk et al. (2013) are given for the sake of comparison. Their target for inference is the so-called nonstandard target (Bachoc et al. 2015), namely the best coefficients within the selected model, in contrast to the standard target, i.e., the true values of the parameters (Berk et al. 2013, equation (3.2)). Simulation results in Leeb et al. (2015) have shown that the coverage probability of such intervals for the standard target is lower than the nominal value in certain situations. For $${\zeta}_{{\rm all}}^3$$ where $$\vartheta_4$$ and $$\vartheta_5$$ are truly nonzero, the conditional confidence intervals for the proposed method have simulated coverage probabilities higher than the nominal 95%. This is because we have $$\mathcal{A}_M(\mathcal{M}_{{\rm all}}^3)$$, $$Z_4^2>2$$ and $$Z_5^2>2$$ in the constraint set, while $$Z_4$$ and $$Z_5$$ are truly unconstrained when taking $$\mathcal{A}_M(\mathcal{M}_{{\rm all}}^3\cap \mathcal{M}_O)$$. For $$\vartheta_6$$ and $$\vartheta_8$$ which are truly zero, $$Z_6^2>2$$ and $$Z_8^2>2$$ are correct constraints. One may expect conservative confidence intervals for $$\vartheta_6$$ and $$\vartheta_8$$ because they are defined by multiplication of the corresponding rows in $$\skew6\hat{{J}}^{1/2}_M(\hat{{\theta}})$$ by $$\tilde{{Z}}(M)$$. The latter vector satisfies the constraints $$\mathcal{A}_M(\mathcal{M}_{{\rm all}}^3)$$ rather than $$\mathcal{A}_M(\mathcal{M}_{{\rm all}}^3\cap \mathcal{M}_O)$$, so the distribution is longer-tailed than needed. For the current simulation, the settings considered lead to $$\skew6\hat{{J}}^{1/2}_M(\hat{{\theta}})$$ with small off-diagonal elements, so the distribution of an estimator is mainly determined by its corresponding $$Z_i$$. For $${\zeta}_{{\rm all}}^6$$ the coverages almost equal the nominal values, especially for $$n=100$$. Using $${\zeta}_{{\rm arb}}$$ leads to conservative confidence intervals for all parameters because of the additional constraints in $$\mathcal{A}_M(\mathcal{M}_{{\rm arb}})$$, while theoretically the constraints should be $$\mathcal{A}_M(\mathcal{M}_{{\rm arb}} \cap \mathcal{M}_O)$$. The method of Berk et al. (2013) always yields conservative confidence intervals, although there is no guarantee that it will always lead to valid confidence intervals for the true parameters. Naive confidence intervals for $$\vartheta_4$$ have coverages almost equal to the nominal value, whereas for $$\vartheta_6$$ using $${\zeta}_{{\rm arb}}$$ or $${\zeta}_{{\rm all}}^3$$ and for $$\vartheta_8$$ in all settings the coverage percentages are around 70%. This is the result of wrongly treating the selected model as given. For settings with small off-diagonal elements of $$\skew6\hat{{J}}^{1/2}_M(\hat{{\theta}})$$, the confidence intervals for the truly nonzero parameters are valid. Other simulation results are presented in the Supplementary Material. We find that the proposed method can be used even in underparameterized situations, where Assumption A1(i) does not hold. 5.2. Linear combinations in linear models The performance of the proposed method for linear combinations was investigated via simulations. Let $${\vartheta}^{{ \mathrm{\scriptscriptstyle T} }}=(2{\cdot}25,-1{\cdot}1,2{\cdot}43,-1{\cdot}24,2{\cdot}5,{0}_8^{{ \mathrm{\scriptscriptstyle T} }})$$ be the true values for the parameters in a linear model, with the error standard deviation being either 1 or 3. Four different selection matrices are considered, $${\zeta}_{{\rm all}}^i$$ for $$i\in\{3,5,8,10\}$$, indicating that the first $$i$$ covariates are common to each model. The data-generation processes are as in § 5.1. For this simulation, we do not control the selected model because we are interested in a linear combination of the selected parameters. Table 2 shows the results. We compare the post-selection intervals with the smoothed bootstrap confidence intervals (Efron 2014) and the intervals for post-selection predictions (Bachoc et al. 2015). The bootstrap samples consist of $$n$$ draws with replacement from the main dataset and we replicate this $$B=1000$$ times. The nonideal bootstrap when the number of replications is not equal to $$n^n$$ biases the variance of the smoothed bootstrap estimator upwards, so we use the bias-corrected version (Efron 2014, Remark J). The post-selection intervals for prediction have a target based on the selected model, so this could be different from the true prediction. Table 2. Results of the simulation study with $$3000$$ runs of selection with AIC: average length of $$95\%$$ confidence intervals and coverage percentages for a linear combination of the parameters for different methods and model sets using the selection matrices $${\zeta}$$ for two sample sizes $${\zeta}_{{\rm all}}^3$$ $${\zeta}_{{\rm all}}^5$$ $${\zeta}_{{\rm all}}^8$$ $${\zeta}_{{\rm all}}^{10}$$ $$\sigma$$ $$n$$ Method Length Cov % Length Cov % Length Cov % Length Cov % 1 30 PostAIC $$3{\cdot}11$$ 97 $$2{\cdot}61$$ 95 $$2{\cdot}90$$ 94 $$3{\cdot}08$$ 94 Boot $$3{\cdot}67$$ 92 $$3{\cdot}32$$ 92 $$3{\cdot}31$$ 92 $$3{\cdot}79$$ 92 PoSIp $$4{\cdot}38$$ 100 $$4{\cdot}39$$ 100 $$5{\cdot}36$$ 100 $$6{\cdot}00$$ 100 100 PostAIC $$1{\cdot}42$$ 98 $$1{\cdot}17$$ 95 $$1{\cdot}30$$ 96 $$1{\cdot}37$$ 95 Boot $$1{\cdot}25$$ 94 $$1{\cdot}25$$ 94 $$1{\cdot}30$$ 94 $$1{\cdot}33$$ 93 PoSIp $$1{\cdot}83$$ 100 $$1{\cdot}83$$ 100 $$2{\cdot}20$$ 100 $$2{\cdot}42$$ 100 3 30 PostAIC $$11{\cdot}76$$ 98 $$7{\cdot}82$$ 94 $$8{\cdot}68$$ 94 $$9{\cdot}24$$ 94 Boot $$11{\cdot}46$$ 92 $$9{\cdot}95$$ 92 $$9{\cdot}94$$ 92 $$11{\cdot}37$$ 92 PoSIp $$12{\cdot}65$$ 99 $$13{\cdot}16$$ 100 $$16{\cdot}08$$ 100 $$17{\cdot}99$$ 100 100 PostAIC $$4{\cdot}25$$ 98 $$3{\cdot}50$$ 95 $$3{\cdot}90$$ 96 $$4{\cdot}12$$ 95 Boot $$3{\cdot}77$$ 94 $$3{\cdot}74$$ 94 $$3{\cdot}90$$ 94 $$4{\cdot}00$$ 93 PoSIp $$5{\cdot}47$$ 100 $$5{\cdot}48$$ 100 $$6{\cdot}60$$ 100 $$7{\cdot}26$$ 100 $${\zeta}_{{\rm all}}^3$$ $${\zeta}_{{\rm all}}^5$$ $${\zeta}_{{\rm all}}^8$$ $${\zeta}_{{\rm all}}^{10}$$ $$\sigma$$ $$n$$ Method Length Cov % Length Cov % Length Cov % Length Cov % 1 30 PostAIC $$3{\cdot}11$$ 97 $$2{\cdot}61$$ 95 $$2{\cdot}90$$ 94 $$3{\cdot}08$$ 94 Boot $$3{\cdot}67$$ 92 $$3{\cdot}32$$ 92 $$3{\cdot}31$$ 92 $$3{\cdot}79$$ 92 PoSIp $$4{\cdot}38$$ 100 $$4{\cdot}39$$ 100 $$5{\cdot}36$$ 100 $$6{\cdot}00$$ 100 100 PostAIC $$1{\cdot}42$$ 98 $$1{\cdot}17$$ 95 $$1{\cdot}30$$ 96 $$1{\cdot}37$$ 95 Boot $$1{\cdot}25$$ 94 $$1{\cdot}25$$ 94 $$1{\cdot}30$$ 94 $$1{\cdot}33$$ 93 PoSIp $$1{\cdot}83$$ 100 $$1{\cdot}83$$ 100 $$2{\cdot}20$$ 100 $$2{\cdot}42$$ 100 3 30 PostAIC $$11{\cdot}76$$ 98 $$7{\cdot}82$$ 94 $$8{\cdot}68$$ 94 $$9{\cdot}24$$ 94 Boot $$11{\cdot}46$$ 92 $$9{\cdot}95$$ 92 $$9{\cdot}94$$ 92 $$11{\cdot}37$$ 92 PoSIp $$12{\cdot}65$$ 99 $$13{\cdot}16$$ 100 $$16{\cdot}08$$ 100 $$17{\cdot}99$$ 100 100 PostAIC $$4{\cdot}25$$ 98 $$3{\cdot}50$$ 95 $$3{\cdot}90$$ 96 $$4{\cdot}12$$ 95 Boot $$3{\cdot}77$$ 94 $$3{\cdot}74$$ 94 $$3{\cdot}90$$ 94 $$4{\cdot}00$$ 93 PoSIp $$5{\cdot}47$$ 100 $$5{\cdot}48$$ 100 $$6{\cdot}60$$ 100 $$7{\cdot}26$$ 100 Cov %, coverage percentage; PostAIC, our proposed method; Boot, smoothed bootstrap (Efron 2014); PoSIp, method of post-selection prediction (Bachoc et al. 2015). Table 2. Results of the simulation study with $$3000$$ runs of selection with AIC: average length of $$95\%$$ confidence intervals and coverage percentages for a linear combination of the parameters for different methods and model sets using the selection matrices $${\zeta}$$ for two sample sizes $${\zeta}_{{\rm all}}^3$$ $${\zeta}_{{\rm all}}^5$$ $${\zeta}_{{\rm all}}^8$$ $${\zeta}_{{\rm all}}^{10}$$ $$\sigma$$ $$n$$ Method Length Cov % Length Cov % Length Cov % Length Cov % 1 30 PostAIC $$3{\cdot}11$$ 97 $$2{\cdot}61$$ 95 $$2{\cdot}90$$ 94 $$3{\cdot}08$$ 94 Boot $$3{\cdot}67$$ 92 $$3{\cdot}32$$ 92 $$3{\cdot}31$$ 92 $$3{\cdot}79$$ 92 PoSIp $$4{\cdot}38$$ 100 $$4{\cdot}39$$ 100 $$5{\cdot}36$$ 100 $$6{\cdot}00$$ 100 100 PostAIC $$1{\cdot}42$$ 98 $$1{\cdot}17$$ 95 $$1{\cdot}30$$ 96 $$1{\cdot}37$$ 95 Boot $$1{\cdot}25$$ 94 $$1{\cdot}25$$ 94 $$1{\cdot}30$$ 94 $$1{\cdot}33$$ 93 PoSIp $$1{\cdot}83$$ 100 $$1{\cdot}83$$ 100 $$2{\cdot}20$$ 100 $$2{\cdot}42$$ 100 3 30 PostAIC $$11{\cdot}76$$ 98 $$7{\cdot}82$$ 94 $$8{\cdot}68$$ 94 $$9{\cdot}24$$ 94 Boot $$11{\cdot}46$$ 92 $$9{\cdot}95$$ 92 $$9{\cdot}94$$ 92 $$11{\cdot}37$$ 92 PoSIp $$12{\cdot}65$$ 99 $$13{\cdot}16$$ 100 $$16{\cdot}08$$ 100 $$17{\cdot}99$$ 100 100 PostAIC $$4{\cdot}25$$ 98 $$3{\cdot}50$$ 95 $$3{\cdot}90$$ 96 $$4{\cdot}12$$ 95 Boot $$3{\cdot}77$$ 94 $$3{\cdot}74$$ 94 $$3{\cdot}90$$ 94 $$4{\cdot}00$$ 93 PoSIp $$5{\cdot}47$$ 100 $$5{\cdot}48$$ 100 $$6{\cdot}60$$ 100 $$7{\cdot}26$$ 100 $${\zeta}_{{\rm all}}^3$$ $${\zeta}_{{\rm all}}^5$$ $${\zeta}_{{\rm all}}^8$$ $${\zeta}_{{\rm all}}^{10}$$ $$\sigma$$ $$n$$ Method Length Cov % Length Cov % Length Cov % Length Cov % 1 30 PostAIC $$3{\cdot}11$$ 97 $$2{\cdot}61$$ 95 $$2{\cdot}90$$ 94 $$3{\cdot}08$$ 94 Boot $$3{\cdot}67$$ 92 $$3{\cdot}32$$ 92 $$3{\cdot}31$$ 92 $$3{\cdot}79$$ 92 PoSIp $$4{\cdot}38$$ 100 $$4{\cdot}39$$ 100 $$5{\cdot}36$$ 100 $$6{\cdot}00$$ 100 100 PostAIC $$1{\cdot}42$$ 98 $$1{\cdot}17$$ 95 $$1{\cdot}30$$ 96 $$1{\cdot}37$$ 95 Boot $$1{\cdot}25$$ 94 $$1{\cdot}25$$ 94 $$1{\cdot}30$$ 94 $$1{\cdot}33$$ 93 PoSIp $$1{\cdot}83$$ 100 $$1{\cdot}83$$ 100 $$2{\cdot}20$$ 100 $$2{\cdot}42$$ 100 3 30 PostAIC $$11{\cdot}76$$ 98 $$7{\cdot}82$$ 94 $$8{\cdot}68$$ 94 $$9{\cdot}24$$ 94 Boot $$11{\cdot}46$$ 92 $$9{\cdot}95$$ 92 $$9{\cdot}94$$ 92 $$11{\cdot}37$$ 92 PoSIp $$12{\cdot}65$$ 99 $$13{\cdot}16$$ 100 $$16{\cdot}08$$ 100 $$17{\cdot}99$$ 100 100 PostAIC $$4{\cdot}25$$ 98 $$3{\cdot}50$$ 95 $$3{\cdot}90$$ 96 $$4{\cdot}12$$ 95 Boot $$3{\cdot}77$$ 94 $$3{\cdot}74$$ 94 $$3{\cdot}90$$ 94 $$4{\cdot}00$$ 93 PoSIp $$5{\cdot}47$$ 100 $$5{\cdot}48$$ 100 $$6{\cdot}60$$ 100 $$7{\cdot}26$$ 100 Cov %, coverage percentage; PostAIC, our proposed method; Boot, smoothed bootstrap (Efron 2014); PoSIp, method of post-selection prediction (Bachoc et al. 2015). The choice of models with $${\zeta}_{{\rm all}}^3$$ as a selection matrix results in conservative confidence intervals due to conditioning on $$\mathcal{A}_M(\mathcal{M}^3_{{\rm all}})$$, similar to before. For this selection matrix, the confidence intervals obtained by the bootstrap method are shorter than those from the proposed post-selection method. The bootstrap confidence intervals are not directly based on the selected model for the original data, because a model is selected for each bootstrap sample. The ideal situation is when the selection matrix is $${\zeta}_{{\rm all}}^5$$, since all truly nonzero parameters are then forced to be in the model. The confidence intervals for the proposed method are always shorter than those for the competing methods and their coverages are almost equal to the nominal value. For $${\zeta}_{{\rm all}}^8$$ and $${\zeta}_{{\rm all}}^{10}$$ the situation is the same, though with wider intervals than for $${\zeta}_{{\rm all}}^{5}$$ in all methods, because more parameters are forced to be in the model, which increases the variability of the predictions. These confidence intervals are not wider than for $${\zeta}_{{\rm all}}^3$$. Thus the variability of the prediction is more affected by the condition part than by forcing more variables into the model. The post-selection method for prediction (Bachoc et al. 2015) always leads to wider confidence intervals than the bootstrap method and the proposed method. The coverages of the confidence intervals for the proposed method are always close to or higher than the nominal values, while the bootstrap method can have lower coverage probabilities than the nominal values. Moreover, the bootstrap method for all possible models is computationally intensive, because it needs $$B$$ bootstrap samples and in each one all candidate models are fitted. For the setting of $$\sigma=3$$ and $$n=30$$ in $${\zeta}_{{\rm all}}^3$$, we used the results in (8) instead of (7). In this setting the probability of selecting an underparameterized model is not zero due to a small sample size and a large variance. The average length of the confidence interval was 9$$\cdot$$9 and the coverage was around 90% when we used (7). 6. Pima Indian diabetes data We construct confidence intervals conditional on the selected model for a logistic regression model applied to the Pima Indian diabetes dataset (Lichman 2013). This dataset consists of women aged 21 years and over of Pima Indian heritage living near Phoenix, Arizona. We used 332 complete observations. The response is 0 if a test for diabetes is negative and 1 if positive. We include seven covariates in the model: npreg, number of pregnancies; glu, plasma glucose concentration in an oral glucose tolerance test; bp, diastolic blood pressure; skin, tricep skin fold thickness in millimetres; bmi, body mass index; ped, diabetes pedigree function; and age in years. See Smith et al. (1988) for more details about the data. First, we consider bootstrap percentile and naive confidence intervals for the parameters in the full model when no selection is involved; see Table 3. We used 5000 bootstrap runs, each resampling the 332 women uniformly with replacement. Several intervals contain zero, which indicates the possibility of using a smaller model. Table 3. Pima Indian diabetes data: $$95\%$$ naive and bootstrap confidence intervals in the full model without selection Method npreg glu bp skin bmi ped age Naive $$[0{\cdot}03, 0{\cdot}26]$$ $$[0{\cdot}03, 0{\cdot}05]$$ $$[-0{\cdot}03, 0{\cdot}02]$$ $$[-0{\cdot}03, 0{\cdot}05]$$ $$[0{\cdot}02, 0{\cdot}14]$$ $$[0{\cdot}24, 2{\cdot}00]$$ $$[-0{\cdot}02, 0{\cdot}05]$$ Bootstrap $$[-0{\cdot}003, 0{\cdot}30]$$ $$[0{\cdot}03, 0{\cdot}05]$$ $$[-0{\cdot}03, 0{\cdot}16]$$ $$[-0{\cdot}03, 0{\cdot}06]$$ $$[0{\cdot}02, 0{\cdot}15]$$ $$[0{\cdot}005, 2{\cdot}41]$$ $$[-0{\cdot}02, 0{\cdot}07]$$ Method npreg glu bp skin bmi ped age Naive $$[0{\cdot}03, 0{\cdot}26]$$ $$[0{\cdot}03, 0{\cdot}05]$$ $$[-0{\cdot}03, 0{\cdot}02]$$ $$[-0{\cdot}03, 0{\cdot}05]$$ $$[0{\cdot}02, 0{\cdot}14]$$ $$[0{\cdot}24, 2{\cdot}00]$$ $$[-0{\cdot}02, 0{\cdot}05]$$ Bootstrap $$[-0{\cdot}003, 0{\cdot}30]$$ $$[0{\cdot}03, 0{\cdot}05]$$ $$[-0{\cdot}03, 0{\cdot}16]$$ $$[-0{\cdot}03, 0{\cdot}06]$$ $$[0{\cdot}02, 0{\cdot}15]$$ $$[0{\cdot}005, 2{\cdot}41]$$ $$[-0{\cdot}02, 0{\cdot}07]$$ Table 3. Pima Indian diabetes data: $$95\%$$ naive and bootstrap confidence intervals in the full model without selection Method npreg glu bp skin bmi ped age Naive $$[0{\cdot}03, 0{\cdot}26]$$ $$[0{\cdot}03, 0{\cdot}05]$$ $$[-0{\cdot}03, 0{\cdot}02]$$ $$[-0{\cdot}03, 0{\cdot}05]$$ $$[0{\cdot}02, 0{\cdot}14]$$ $$[0{\cdot}24, 2{\cdot}00]$$ $$[-0{\cdot}02, 0{\cdot}05]$$ Bootstrap $$[-0{\cdot}003, 0{\cdot}30]$$ $$[0{\cdot}03, 0{\cdot}05]$$ $$[-0{\cdot}03, 0{\cdot}16]$$ $$[-0{\cdot}03, 0{\cdot}06]$$ $$[0{\cdot}02, 0{\cdot}15]$$ $$[0{\cdot}005, 2{\cdot}41]$$ $$[-0{\cdot}02, 0{\cdot}07]$$ Method npreg glu bp skin bmi ped age Naive $$[0{\cdot}03, 0{\cdot}26]$$ $$[0{\cdot}03, 0{\cdot}05]$$ $$[-0{\cdot}03, 0{\cdot}02]$$ $$[-0{\cdot}03, 0{\cdot}05]$$ $$[0{\cdot}02, 0{\cdot}14]$$ $$[0{\cdot}24, 2{\cdot}00]$$ $$[-0{\cdot}02, 0{\cdot}05]$$ Bootstrap $$[-0{\cdot}003, 0{\cdot}30]$$ $$[0{\cdot}03, 0{\cdot}05]$$ $$[-0{\cdot}03, 0{\cdot}16]$$ $$[-0{\cdot}03, 0{\cdot}06]$$ $$[0{\cdot}02, 0{\cdot}15]$$ $$[0{\cdot}005, 2{\cdot}41]$$ $$[-0{\cdot}02, 0{\cdot}07]$$ Selection uses the set $$\mathcal{M}_{{\rm all}}$$; an intercept is present in all models. This results in the selection of four variables: npreg, glu, bmi and ped. Table 4 presents the unconditional confidence intervals for these parameters using the naive method with the post-selection confidence intervals that condition on the model selected using the Akaike information criterion. The naive method ignores the selection procedure, leading to the significance of the covariate ped, whereas the proposed method, which takes the selection uncertainty into account, concludes that this covariate is not individually significant at the 5% level. For logistic regression, to the best of our knowledge, there are no other post-selection methods to compare with. Table 4. Pima Indian diabetes data: confidence intervals with nominal level $$95\%$$ ignoring, Naive, and including, PostAIC, model selection using the Akaike information criterion Method npreg glu bmi ped Naive $$[0{\cdot}091, 0{\cdot}269]$$ $$[0{\cdot}028, 0{\cdot}049]$$ $$[0{\cdot}042, 0{\cdot}129]$$ $$-$$$$[0{\cdot}305, 2{\cdot}050]$$ PostAIC $$[0{\cdot}058, 0{\cdot}299]$$ $$[0{\cdot}022, 0{\cdot}054]$$ $$[0{\cdot}027, 0{\cdot}142]$$ $$[-0{\cdot}027, 2{\cdot}358]$$ Method npreg glu bmi ped Naive $$[0{\cdot}091, 0{\cdot}269]$$ $$[0{\cdot}028, 0{\cdot}049]$$ $$[0{\cdot}042, 0{\cdot}129]$$ $$-$$$$[0{\cdot}305, 2{\cdot}050]$$ PostAIC $$[0{\cdot}058, 0{\cdot}299]$$ $$[0{\cdot}022, 0{\cdot}054]$$ $$[0{\cdot}027, 0{\cdot}142]$$ $$[-0{\cdot}027, 2{\cdot}358]$$ Table 4. Pima Indian diabetes data: confidence intervals with nominal level $$95\%$$ ignoring, Naive, and including, PostAIC, model selection using the Akaike information criterion Method npreg glu bmi ped Naive $$[0{\cdot}091, 0{\cdot}269]$$ $$[0{\cdot}028, 0{\cdot}049]$$ $$[0{\cdot}042, 0{\cdot}129]$$ $$-$$$$[0{\cdot}305, 2{\cdot}050]$$ PostAIC $$[0{\cdot}058, 0{\cdot}299]$$ $$[0{\cdot}022, 0{\cdot}054]$$ $$[0{\cdot}027, 0{\cdot}142]$$ $$[-0{\cdot}027, 2{\cdot}358]$$ Method npreg glu bmi ped Naive $$[0{\cdot}091, 0{\cdot}269]$$ $$[0{\cdot}028, 0{\cdot}049]$$ $$[0{\cdot}042, 0{\cdot}129]$$ $$-$$$$[0{\cdot}305, 2{\cdot}050]$$ PostAIC $$[0{\cdot}058, 0{\cdot}299]$$ $$[0{\cdot}022, 0{\cdot}054]$$ $$[0{\cdot}027, 0{\cdot}142]$$ $$[-0{\cdot}027, 2{\cdot}358]$$ 7. Discussion and extensions For one of the classic model selection methods, the Akaike information criterion (Akaike 1973), we have developed a method to deal with the selection uncertainty by performing inference conditional on the selected model. Our results demonstrate that this inference depends not only on the selected model but also on the set of models from which the selection takes place, as well as on the smallest overparameterized model. The dependence on the set of models is not surprising, though it has not received much attention so far. The proposed method explicitly uses the overselection properties of the Akaike information criterion. See Claeskens & Hjort (2004) for some selection properties under local misspecification. For consistent selection criteria, such as the Bayesian information criterion, other approaches should be used, although effects of the selection remain present (Leeb & Pötscher 2005). Other selection methods that are similar to the Akaike information criterion can be approached in the same way. Consider, for example, selection in an arbitrary set of models allowing for model misspecification, see § 4, using Takeuchi’s information criterion (Takeuchi 1976) $$ {\small{\text{TIC}}}(M)=2 \ell_n\{\hat{{\theta}}(M)\}-2\, {\rm tr} \{{Q}_{M}({\vartheta}^*)^{-1}{J}_{M}({\vartheta}^*)\} $$. In most practical settings the information matrices are estimated by their empirical counterparts $$\hat{{Q}}_{M}(\hat{{\theta}}_M)$$ and $$\skew6\hat{{J}}_{M}(\hat{{\theta}}_M)$$. We rewrite (9) for an arbitrary set of models containing $$M_{{\rm s}}$$ by replacing $$|M|$$ with $${\rm tr}\{{Q}_{M}({\vartheta}^*)^{-1}{J}_{M}({\vartheta}^*)\}$$ and proceed to calculate the asymptotic distribution of the parameters conditioned on the constraint set. Another such example is the generalized information criterion introduced by Konishi & Kitagawa (1996). It considers functional estimators, such as M-estimators, and uses the influence function as part of the criterion, $$ {\small{\text{GIC}}}(M)=-2 \ell_n\{\hat{{\theta}}(M)\}+(2/n)\sum_{i=1}^n {\rm tr}\big\{{\rm Infl}(Y_i)(\partial/\partial{\theta}_M^{{ \mathrm{\scriptscriptstyle T} }}) \log f(Y_i; \hat{{\theta}}_M)\big\} $$. Under some regularity conditions, the functional estimator has an asymptotic normal distribution, allowing us to extend the results in § 4. Mallows’ $$C_p$$ (Mallows 1973) for linear regression is $$C_p(M)=\hat{\sigma}^{-2} \hat{\sigma}^{2}(M) +2|M|-n$$, where $$\hat{\sigma}^2$$ is the estimated variance in the full model and $$\hat{\sigma}^{2}(M)$$ uses model $$M$$. The model with the smallest $$C_p$$ value is the best. In nested models one can easily show that as $$n$$ tends to infinity, $$C_p(M)-C_p(M^*) \sim \chi^2_{q}/q+2q$$ where $$q={|M^*|-|M|}$$. In the same manner as for the Akaike information criterion, one can calculate the constraint set and hence the distribution of estimators for parameters in the selected model. In forward stepwise selection, we start from a small model and embed it in a larger model containing one additional parameter. This procedure continues until adding a parameter does not decrease the Akaike information criterion. To be precise, in step $$t$$ we embed model $$M_t$$ in a number of bigger models, each adding one parameter. Define $$\mathcal{M}_t$$ to be this set of models. Model $$M_{t+1}\in\mathcal{M}_t$$ is selected when this model has a smaller criterion value than model $$M_t$$ and has the smallest criterion value among all models in $$\mathcal{M}_t$$. This means that $$\small{\text{AIC}}(M_{t+1})<\small{\text{AIC}}(M_{t})$$ and $$\small{\text{AIC}}(M_{t+1})< \small{\text{AIC}}(M)$$ for all $$M \in \mathcal{M}_t\setminus M_{t+1}$$. These inequalities can be translated to constraints. The constraint set is the collection of all the constraints from all the steps. We explicitly dealt with low-dimensional parameters for which maximum likelihood estimators exist and the Akaike information criterion is well-defined. Other criteria are more suitable for high-dimensional parameters. Acknowledgement The authors thank the editor and reviewers. Support from the Research Foundation of Flanders, the University of Leuven, and the Interuniversity Attraction Poles Research Network is acknowledged. The computational resources and services used in this work were provided by the Flemish Supercomputer Center, funded by the Hercules Foundation and the Flemish Government. Supplementary material Supplementary material available at Biometrika online contains a rewriting of some results of Woodroofe (1982), exact calculations for an example, the selection matrix for one of the simulation settings, additional simulation results, and R code to produce the results in the paper. Appendix Assumption A1. Let $$\mathcal{B}_{K}(\epsilon)$$ denote an $$(a+K)$$-dimensional sphere centred at $${\vartheta}$$ with radius $$\epsilon$$, and let $$\mathcal{B}_{K}^{\rm c}(\epsilon)$$ denote its complementary set. (i) For each $$\epsilon>0$$, as $$n\rightarrow\infty$$, $$\mathop{\sup}_{{\theta}\in\mathcal{B}_{K}^{\rm c}(\epsilon)} \{\ell_n({\theta})-\ell_n({\vartheta})\} {\rightarrow} -\infty$$ in probability. (ii) There exists an $$\epsilon_0 >0$$ such that $$\ell_n ({\theta})$$ is twice continuously differentiable in $$\mathcal{B}_K(\epsilon_0)$$ for all $$n$$ large enough. Define the score vector $${U}_n({\theta}) = (\partial/ \partial{\theta})\ell_n({\theta})$$ and the negative Hessian matrix $${Q}_n({\theta}) = -(\partial^2/(\partial{\theta}\,\partial{\theta}^{{ \mathrm{\scriptscriptstyle T} }}))\ell_n({\theta})$$. (iii) For some $$0<\epsilon_1 <\epsilon_0$$, as $$n\rightarrow \infty$$, there exists a nonrandom positive-definite continuous matrix $${Q}({\theta})$$, for $${\theta}$$ in $$\mathcal{B}_K(\epsilon_1)$$, such that $$\sup_{{\theta}\in\mathcal{B}_{K}(\epsilon)} {\rm tr}\{{Q}_{n}({\theta})/n - {Q}({\theta})\} {\rightarrow} 0$$ in probability. (iv) As $$n \rightarrow \infty$$, $$\:n^{1/2} {U}_n({\vartheta})$$ is asymptotically $$N\{{0},{J}({\vartheta})\}$$. (v) For $$i\neq j$$ and $$M_i, M_j \in \mathcal{M}_O$$, if the expectation is taken with respect to the true distribution, we have $$J_{ij}\{{\theta}(i),{\theta}(\,j)\}=E(\{\partial / \partial{\theta}(M_i)\}[\ell_n\{{\theta}(M_i)\}]\{\partial/ \partial{\theta}(M_j)^{{ \mathrm{\scriptscriptstyle T} }}\}[\ell_n\{{\theta}(M_j)\}])= 0_{|M_i|\times|M_j|}$$. In Assumption A1, (i)–(iv) are from Woodroofe (1982). Assumption A1(i) leads to the consistency of maximum likelihood estimators for $${\theta}$$ in the model considered and its submodels. For the nonnested case, Assumption A1(v) provides a simplification (Vuong 1989). In linear regression, Assumption A1(v) is equivalent to having an orthogonal design matrix. The next lemma is an extension of Lemma A in Vuong (1989) to more than two models. Lemma A1. Suppose that conditions $${\rm (i)-(iv)}$$ in Assumption A1 hold. Fix any ordering of the models in $${\mathcal{M}}_O$$ and write $$o=|{\mathcal{M}}_O|$$. Then, as $$n\to\infty$$, $$\:n^{1/2}(\hat{\theta}_{\mathcal{M}_o}-\vartheta_{\mathcal{M}_o})= n^{1/2}\{{\hat{\theta}}'(M_{1})^{{ \mathrm{\scriptscriptstyle T} }}-{\vartheta}(M_1)^{{ \mathrm{\scriptscriptstyle T} }}, \ldots, {\hat{\theta}}'(M_o)^{{ \mathrm{\scriptscriptstyle T} }}-{\vartheta}(M_o)^{{ \mathrm{\scriptscriptstyle T} }}\}^{{ \mathrm{\scriptscriptstyle T} }} \to N\{0,\Sigma(\vartheta)\}$$ in distribution. Proof. As in Vuong (1989), a Taylor series expansion leads to \begin{eqnarray*} 0=n^{-1/2}U_{n,M_i}({\vartheta})+{Q}_{M_i}({\vartheta})n^{1/2} \{\hat{{\theta}^{'}}(M_{i})-{\vartheta}\}+o_{\rm p}(1)\quad (M_i \in \mathcal{M}_O)\text{.} \end{eqnarray*} By the multivariate central limit theorem, we have convergence in distribution as $$n\to\infty$$, \begin{eqnarray*} n^{-1/2}(U_{n,M_1}^{{ \mathrm{\scriptscriptstyle T} }}, \ldots, U_{n,M_o}^{{ \mathrm{\scriptscriptstyle T} }})^{{ \mathrm{\scriptscriptstyle T} }} \rightarrow N(0, \Sigma _u) \end{eqnarray*} where $$\Sigma_u$$ is a partitioned matrix with $$(i,j)$$th block equal to $${J}_{ij}({\vartheta},{\vartheta})$$. The distribution of the estimators follows. □ When the models are correctly specified, $$J_{ii}({\vartheta},{\vartheta})={J}_{M_i}({\vartheta})={Q}_{M_i}({\vartheta})$$. Lemma A1 is also valid for misspecified models and for models not in $$\mathcal{M}_O$$. In such cases the true parameter is replaced by the pseudo-true parameter corresponding to the considered model. Proof of Proposition 1. We show that (1) equals \begin{eqnarray*} \lim_{n \rightarrow \infty } \frac{\mathrm{pr}([n^{1/2}\{\hat{{\theta}^{'}}(p)-\tilde{{\vartheta}}(p)\}\leqslant \tilde{{t}}(p)] \cap [2 \ell_{n,p}^*-2p \geqslant 2\ell_{n,j}^*-2j, \, j \in \{{p_0},{p_0+1},\ldots,{K}\}])}{\mathrm{pr}[2 \ell_{n,p}^*-2p \geqslant 2\ell_{n,j}^*-2j, \, j \in \{{p_0},{p_0+1},\ldots,{K}\}]}\text{.} \end{eqnarray*} By Lemma A1 there is joint convergence of the estimators in the different models. Next, since $$\ell_{n,j}^*$$ is a function of $$\hat{{\theta}^{'}}(\,j)$$, namely \begin{eqnarray*} \ell_{n,j}^*=\frac{n}{2}\{\hat{{\theta}^{'}}(p)-{\vartheta}(p)\}^{{ \mathrm{\scriptscriptstyle T} }}{J}_p({\vartheta})\{\hat{{\theta}^{'}}(p)-{\vartheta}(p)\}+o_{\rm p}(1), \end{eqnarray*} and since the probability of the event in the denominator is strictly positive, Slutsky’s theorem and the continuous mapping theorem give joint convergence for both the numerator and the denominator of the above expression to their asymptotic counterparts. To obtain the selection set, let $$S_j =\{{s} \in \mathbb{R}^{a+K}: s_i=0 \text{ for } i=a+j,\ldots,a+K \}$$, for $$j=p_0,\ldots,K$$. Woodroofe (1982) showed that $$(\ell^*_{n,p_0}, \ldots, \ell^*_{n,K})$$ converges in distribution to $$(\ell^*_{p_0}, \ldots, \ell^*_{K})$$ as $$n\rightarrow \infty$$, where for $$j=p_0,\ldots,K$$, $$ \:\ell_j^*=\mathop{\sup}_{{s} \in S_j} \{{s}'{Y}-{s}'{J}({\vartheta}){s}/2\} $$ with $${Y}\sim N\{{0}, {J}({\vartheta})\}$$. Then $$\ell_j^*=0{\cdot}5\sum_{i=1}^{a+j} Z_{i}^2$$ ( $$j= p_0, \ldots, K$$), where $$Z_1,\ldots,Z_{a+j}$$ are independent and identically distributed standard normal random variables. Lemma 1 and Assumption A1(i)–(iv) imply that $$n^{1/2}{J}_{j}^{1/2}({\vartheta})\{{\hat{\theta}}'(\,j)-\tilde{{\vartheta}}(\,j)\}$$ converges in distribution to $$\tilde{{Z}}(\,j)$$ as $$n\to \infty$$. Parameters not in the selected model are set to zero, which gives the region $$\mathcal{T}_p$$. Since $$\tilde{{Z}}(p)$$ and $$(Z_{p+1},\ldots,Z_K)$$ are independent, for $${t}\in\mathcal{T}_p$$, \begin{eqnarray*} F_p({t})&=& \mathrm{pr}\{J_{p}^{-1/2}(\vartheta) \tilde{{Z}}(p) \leqslant \tilde{{t}}(p) \mid {Z} \in \mathcal{A}_p(\mathcal{M}_{{\rm nest}})\} \nonumber \\ &=& \mathrm{pr}\left[J_{p}^{-1/2}(\vartheta) \tilde{{Z}}(p) \leqslant \tilde{{t}}(p) \;\bigg|\; \bigcap_{j=p_0,\ldots,p-1}\left\{\sum_{i=j+1}^{p} Z_{a+i}^2 > 2(p-j)\right\}\right]\!\text{.}\\[-42pt] \end{eqnarray*} □ Proof of Corollary 2. From Proposition 1, with $$\hat{p}_0=p$$, $$\:q_{\alpha}$$ is equivalently found via \begin{eqnarray*} {\mathrm{pr}\left[\left(\sum_{i=1}^{a+p} Z_i^2 \leqslant q_{\alpha}\right) \cap \bigcap_{j=p_0,\ldots,p-1}\left\{\sum_{i=j+1}^{p} Z_{a+i}^2 > 2(p-j) \right\} \right]} \Big/ {\mathrm{pr}\{\tilde{{Z}}(p)\in\mathcal{A}_p^{({\rm s})}(\mathcal{M}_{{\rm nest}})\}}=1-\alpha\text{.} \end{eqnarray*} The denominator can be calculated by Lemma S1 in the Supplementary Material. To calculate the numerator, we first find the joint density of $$(W_p,\ldots,W_{p_0+1},W_1)$$, where $$W_j = \sum_{i=a+j}^{a+p}Z_{i}^2$$ and $$W_1 = \sum_{i=1}^{p}Z_{i}^2$$ with $$Z_i^2 \sim \chi^2_{1}$$ for all $$i=1,\ldots,a+p$$. So $$Z_{a+i}^2=W_{i-1}-W_{i}$$ for $$i=p_0+1, \ldots, p-1$$ and $$Z_{a+p}^2=W_p$$ with $$\sum_{i=1}^{a+p_0}Z_i^2=W_1-W_{p_0+1}\sim \chi^2_{a+p_0}$$. The joint distribution of $$(W_p,\ldots,W_{p_0+1},W_1)$$ is obtained via a transformation of the distribution of $$(Z^2_{a+p}, Z^2_{a+p-1}, \ldots, Z^2_{a+p_0+1}, \sum_{i=1}^{a+p_0}Z_i^2)$$, \begin{eqnarray*} f(w_p,\ldots, w_{p_0+1}, w_1)= \frac{\exp(-{w_1}/{2})w_p^{-1/2}(w_1-w_{p_0+1})^{-(a+p_0)/2-1} \prod^{p-p_0+1}_{i=1}(w_i-w_{i-1})^{-1/2}} {2^{({a+p})/{2}}\{\Gamma(1/2)\}^{p-p_0}\Gamma(\frac{a+p_0}{2})}\text{.} \end{eqnarray*} The region of integration follows from $$\mathcal{A}_p^{({\rm s})}(\mathcal{M}_{{\rm nest}})$$ and the fact that $$W_i \leqslant W_{j}$$ for $$i>j$$. □ Proof of Lemma 1. Denote the smallest true model by $$M_{\rm pars}$$. For all $$M'\not\in\mathcal{M}_O$$, by Assumption A1(i), \begin{align*} &\mathrm{pr}(M_{{\small{\text{AIC}}}} = M') \\ & \quad \leqslant \mathrm{pr}\bigl\{\small{\text{AIC}}^*(M')\ge\mathop{\max}_{M\in\mathcal{M}_O}\small{\text{AIC}}^*(M)\bigr\}\leqslant \mathrm{pr}\bigl\{\small{\text{AIC}}^*(M')\ge\small{\text{AIC}}^*(M_{{\rm pars}})\bigr\} \\ &\quad= \mathrm{pr}\big[\ell_n\{{\hat{\theta}}(M')\} - |M'| \ge \ell_n\{{\hat{\theta}}(M_{{\rm pars}})\} - |M_{{\rm pars}}|\big]\\ &\quad= \mathrm{pr}\big[\ell_n\{{\hat{\theta}}(M')\}-\ell_n\{{\vartheta}(M_{{\rm pars}})\} - |M'| \ge \ell_n\{{\hat{\theta}}(M_{{\rm pars}})\} - \ell_n\{{{\vartheta}}(M_{{\rm pars}})\}-|M_{{\rm pars}}|\big]\\ &\quad \rightarrow 0\text{.}\\[-44pt] \end{align*} □ Proof of Proposition 2. (I) Define $$S_j =\{{s} \in \mathbb{R}^{a+K}: s_i=0,\, i\notin M \}$$ and $$\ell^*_{n,M_i}=\ell_n\{\hat{{\theta}}(M_i)\}-\ell_n({\vartheta})$$ where $$M_i \in \mathcal{M}_O$$. Similar to Proposition 1, we can show that for $$M_i \in \mathcal{M}_O$$, $$\:\ell^*_{n,M_i}\to 0{\cdot}5\sum_{j \in M_i}Z_j^2$$ in distribution. Now, the condition part can be calculated by $$\sum_{j \in M} Z_j^2 - 2 |M| > \sum_{j \in M_i} Z_j^2 - 2 |M_i|, \quad M_i \in {\mathcal{M}}_O\setminus M ,$$ which is equivalent to $${Z} \in \mathcal{A}_M({\mathcal{M}}_O)$$. (II) By Lemma A1 there is joint convergence in distribution of the estimators in the different models. The constraint set can be calculated by pairwise comparisons of the $$\small{\text{AIC}}^*$$ values. To do so, write \begin{eqnarray*} &&\ell_n\{\hat{{\theta}}(M_{\rm i})\}=\ell_n({{\vartheta}})+\frac{n}{2}\{\hat{{\theta}} (M_{i})-{\vartheta}\}^{{ \mathrm{\scriptscriptstyle T} }}{Q}_{M_{i}}({\vartheta})\{\hat{{\theta}}(M_{i}) -{\vartheta}\}+o_{\rm p}(1) \end{eqnarray*} from which it follows that $$ \ell_{n,i}^*=({n}/{2})\{\hat{{\theta}}(M_{i})-{\vartheta}\}^{{ \mathrm{\scriptscriptstyle T} }} {Q}_{M_{i}}({\vartheta})\{\hat{{\theta}}(M_{i})-{\vartheta}\}+o_{\rm p}(1)\text{.} $$ Then, since $$\small{\text{AIC}}^*(M_{\small{\text{AIC}}}) \geqslant \small{\text{AIC}}^*(M_{i})$$ is equivalent to $$2 (\ell_{n,{\small{\text{AIC}}}}^{*} -\ell_{n,{ i}}^*) \geqslant 2(|M_{\small{\text{AIC}}}|-|M_{\rm i}|) $$, it follows that \begin{eqnarray} n(\hat{{\theta}}_{{\mathcal{M}}_O}-{\vartheta}_{{\mathcal{M}}_O})^{{ \mathrm{\scriptscriptstyle T} }} {W}_{\rm AIC,i}(\hat{{\theta}}_{{\mathcal{M}}_O}-{\vartheta}_{{\mathcal{M}}_O})+o_{\rm p}(1) - 2(|M_{\small{\text{AIC}}}|-|M_i|) \geqslant 0\text{.} \end{eqnarray} (A1) By using Lemma A1 and the continuous mapping theorem, the asymptotic counterpart of (A1) can be written as $$ {{{Z}}}^{{ \mathrm{\scriptscriptstyle T} }} \Sigma^{1/2} {W}_{\rm AIC,i} \Sigma^{1/2}{Z} \geqslant 2 (|M_{\small{\text{AIC}}}|-|M_i|)$$$$(M_i \in {\mathcal{M}}_O) $$, which results in the stated selection region and limiting distribution. □ Proof of Proposition 3. By Theorems 1 and 2 of Sweeting (1980), $$ n^{1/2}\{{\hat{\theta}}'(M)-\tilde{{\vartheta}}(M)\}^{{ \mathrm{\scriptscriptstyle T} }}{J}_{M}^{1/2} ({\vartheta})\to \tilde{{Z}}(M) $$ uniformly in distribution over the compact set $$\Theta$$. This yields $$\lim_{n\rightarrow \infty} \inf_{\vartheta \in \Theta} \mathrm{pr}_{\vartheta}\{\vartheta \in C_{\alpha}(\vartheta)\} = 1- \alpha$$. When $$\mathcal{M}_O$$ is not known, we use $$\mathcal{A}_M(\mathcal{M}_{{\rm arb}})$$ in (6) instead of $$\mathcal{A}_M(\mathcal{M}_{{\rm arb}}\cap \mathcal{M}_O)$$, which defines the value $$\tilde q_\alpha$$. Since $$\mathcal{A}_M(\mathcal{M}_{{\rm arb}}) \subset \mathcal{A}_M(\mathcal{M}_{{\rm arb}}\cap \mathcal{M}_O)$$, we have $$\tilde q_\alpha\ge q_\alpha$$, which leads to a conservative confidence region. Proof of Lemma 2. For every $$j=1,\ldots,J$$ and every component $$k$$ of the vector $$\hat\theta_{n,\mathcal{M}}(M_j)$$, we have $$ n^{1/2}([\hat\theta_{n,\mathcal{M}}(M_j)]_k - [\vartheta_{n,\mathcal{M}}^*(M_j)]_k) = \sum_{i=1}^n Q^{-1}_{M_j}\{\vartheta_{n}^*(M_j)\} n^{-1/2} \frac{\partial}{\partial\theta_k}\log f_{j,i}\{Y_i,\vartheta_{n}^*(M_j)\} +o_{\rm p}(1)\text{.} $$ By assumption (i) in the lemma, which is a Lindeberg assumption for all $$G_n\in\mathcal{G}_n$$, we obtain a uniform limiting normality result for each of the components of $$n^{1/2}(\hat\theta_{n,\mathcal{M}} - \vartheta_{n,\mathcal{M}}^*)$$. Under assumption (ii) in the lemma, the data are in a so-called null triangular array format, to which Corollary 2 of Pollak (1972) applies, resulting in joint asymptotic normality for the vector combining all such components. □ Proof of Proposition 4. Define the events $$B= [n^{1/2}\{\hat{\theta}(M_{\small{\text{AIC}}})-\vartheta^*(M_{\small{\text{AIC}}})\} \leqslant t]$$ and $$ C=\bigcap_{M \in \mathcal{M}} \bigl\{n(\hat{\theta}_{{\mathcal{M}}}-\vartheta^*_{{\mathcal{M}}})^{{ \mathrm{\scriptscriptstyle T} }}(W_{M_{\small{\text{AIC}}},M_{{\rm s}}}-W_{M_{\rm i},M_{{\rm s}}}) (\hat{\theta}_{{\mathcal{M}}}-\vartheta^*_{{\mathcal{M}}})\geqslant 2(|M_{\small{\text{AIC}}}|-|M|) \bigr\} +o_{\rm p}(1)\text{.} $$ Using the results of Lemma 2 and the continuous mapping theorem, the difference between \begin{eqnarray*} \mathrm{pr}\big[n^{1/2}\{\hat{\theta}(M_{\small{\text{AIC}}})-\vartheta^*(M_{\small{\text{AIC}}})\} \leqslant t \mid \hat{M}=M_{\small{\text{AIC}}}\big] \mathrm{pr}(B\cap C)/\mathrm{pr}(C) \end{eqnarray*} and $$ \mathrm{pr}[\{ \Sigma_{M_{\small{\text{AIC}}}}(\vartheta^*_{M_{\small{\text{AIC}}}})^{1/2} \tilde{{Z}}({M_{{\small{\text{AIC}}}}}) \leqslant t \} \, \cap \{Z \in \mathcal{A}_{M_{\small{\text{AIC}}}}(\mathcal{M})\}] /\mathrm{pr}\{Z \in \mathcal{A}_{M_{\small{\text{AIC}}}}(\mathcal{M})\} $$ converges uniformly to 0. □ References Akaike H. ( 1973 ). Information theory and an extension of the maximum likelihood principle. In Proc. 2nd Int. Symp. Info. Theory, Tsahkadsor, Armenia, USSR, September 2–8, 1971 , Petrov B. & Cski F. eds. Budapest : Akadémiai Kiadó , pp. 267 – 81 . Andrews D. W. K. & Guggenberger P. ( 2009 ). Hybrid and size-corrected subsampling methods . Econometrica 77 , 721 – 62 . Google Scholar CrossRef Search ADS Bachoc F., Leeb H. & Pötscher B. ( 2015 ). Valid confidence intervals for post-model-selection predictors. arXiv: 1412.4605. Belloni A., Chernozhukov V. & Kato K. ( 2015 ). Uniform post selection inference for least absolute deviation regression models and other Z-estimation problems . Biometrika 102 , 77 – 94 . Google Scholar CrossRef Search ADS Berk R., Brown L., Buja A., Zhang K. & Zhao L. ( 2013 ). Valid post-selection inference . Ann. Statist. 41 , 802 – 37 . Google Scholar CrossRef Search ADS Chernozhukov V., Hansen C. & Spindler M. ( 2015 ). Valid post-selection and post-regularization inference: An elementary, general approach . Ann. Rev. Econ. 7 , 649 – 88 . Google Scholar CrossRef Search ADS Claeskens G. & Hjort N. L. ( 2004 ). Goodness of fit via nonparametric likelihood ratios . Scand. J. Statist. 31 , 487 – 513 . Google Scholar CrossRef Search ADS Claeskens G. & Hjort N. L. ( 2008 ). Model Selection and Model Averaging . Cambridge: Cambridge University Press. Cox D. R. & Hinkley D. V. ( 1974 ). Theoretical Statistics . London : Chapman and Hall. Danilov D. & Magnus J. R. ( 2004 ). On the harm that ignoring pretesting can cause . J. Economet. 122 , 27 – 46 . Google Scholar CrossRef Search ADS Efron B. ( 2014 ). Estimation and accuracy after model selection . J. Am. Statist. Assoc. 109 , 991 – 1007 . Google Scholar CrossRef Search ADS Ferrari D. & Yang Y. ( 2014 ). Confidence sets for model selection by F-testing . Statist. Sinica 25 , 1637 – 58 . Hjort N. L. & Claeskens G. ( 2003 ). Frequentist model average estimators . J. Am. Statist. Assoc. 98 , 879 – 99 . Google Scholar CrossRef Search ADS Jansen M. ( 2014 ). Information criteria for variable selection under sparsity . Biometrika 101 , 37 – 55 . Google Scholar CrossRef Search ADS Kabaila P. ( 1995 ). The effect of model selection on confidence regions and prediction regions . Economet. Theory 11 , 537 – 49 . Google Scholar CrossRef Search ADS Kabaila P. ( 1998 ). Valid confidence intervals in regression after variable selection . Economet. Theory 14 , 463 – 82 . Google Scholar CrossRef Search ADS Kabaila P. & Leeb H. ( 2006 ). On the large-sample minimal coverage probability of confidence intervals after model selection . J. Am. Statist. Assoc. 101 , 619 – 29 . Google Scholar CrossRef Search ADS Kabaila P., Welsh A. H. & Abeysekera W. ( 2016 ). Model-averaged confidence intervals . Scand. J. Statist. 43 , 35 – 48 . Google Scholar CrossRef Search ADS Konishi S. & Kitagawa G. ( 1996 ). Generalized information criteria in model selection . Biometrika 83 , 875 – 90 . Google Scholar CrossRef Search ADS Lee J. D., Sun D. L., Sun Y. & Taylor J. E. ( 2016 ). Exact post-selection inference, with application to the lasso . Ann. Statist. 44 , 907 – 27 . Google Scholar CrossRef Search ADS Leeb H. & Pötscher B. ( 2005 ). Model selection and inference: Facts and fiction . Economet. Theory 21 , 22 – 59 . Google Scholar CrossRef Search ADS Leeb H., Pötscher B. & Ewald K. ( 2015 ). On various confidence intervals post-model-selection . Statist. Sci. 30 , 216 – 27 . Google Scholar CrossRef Search ADS Leeb H. & Pötscher B. M. ( 2003 ). The finite-sample distribution of post-model-selection estimators and uniform versus nonuniform approximations . Economet. Theory 19 , 100 – 42 . Google Scholar CrossRef Search ADS Leeb H. & Pötscher B. M. ( 2006 ). Can one estimate the conditional distribution of post-model-selection estimators? Ann. Statist. 34 , 2554 – 91 . Google Scholar CrossRef Search ADS Lichman M. ( 2013 ). UCI Machine Learning Repository}, Center for Machine Learning and Intelligent Systems, School of Information and Computer Sciences, University of California, Irvine, https://archive.ics.uci.edu/ml/. Mallows C. ( 1973 ). Some comments on $${C}_p$$ . Technometrics 15 , 661 – 75 . Pakman A. & Paninski L. ( 2014 ). Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians . J. Comp. Graph. Statist. 23 , 518 – 42 . Google Scholar CrossRef Search ADS Pollak M. ( 1972 ). A note on infinitely divisible random vectors . Ann. Math. Statist. 43 , 673 – 5 . Google Scholar CrossRef Search ADS Pötscher B. ( 1991 ). Effects of model selection on inference . Economet. Theory 7 , 163 – 85 . Google Scholar CrossRef Search ADS Pötscher B. ( 1995 ). Comment on “The effect of model selection on confidence regions and prediction regions” by P. Kabaila . Economet. Theory 11 , 550 – 9 . Google Scholar CrossRef Search ADS Shibata R. ( 1976 ). Selection of the order of an autoregressive model by Akaike’s information criterion . Biometrika 63 , 117 – 26 . Google Scholar CrossRef Search ADS Smith J. W., Everhart J. E., Dickson W. C., Knowler W. C. & Johannes R. S. ( 1988 ). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. Proc. Ann. Symp. Comp. Appl. Med. Care , 1988 Nov 9 , 261 – 5 . Sweeting T. J. ( 1980 ). Uniform asymptotic normality of the maximum likelihood estimator . Ann. Statist. 8 , 1375 – 81 . Google Scholar CrossRef Search ADS Takeuchi K. ( 1976 ). Distribution of information statistics and criteria for adequacy of models . Biometrika 153 , 12 – 8 . Taylor J., Lockhart R., Tibshirani R. J. & Tibshirani R. ( 2016 ). Exact post-selection inference for sequential regression procedures . J. Am. Statist. Assoc. 111 , 600 – 20 . Google Scholar CrossRef Search ADS Tibshirani R. J., Rinaldo A., Tibshirani R. & Wasserman L. ( 2015 ). Uniform asymptotic inference and the bootstrap after model selection. arXiv: 1506.06266. Vuong Q. H. ( 1989 ). Likelihood ratio tests for model selection and non-nested hypotheses . Econometrica 57 , 307 – 33 . Google Scholar CrossRef Search ADS White H. ( 1994 ). Estimation, Inference and Specification Analysis . New York : Cambridge University Press. Woodroofe M. ( 1982 ). On model selection and the arc-sine laws . Ann. Statist. 10 , 1182 – 94 . Google Scholar CrossRef Search ADS © 2018 Biometrika Trust This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

BiometrikaOxford University Press

Published: Sep 1, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off