Add Journal to My Library
Biometrika
, Volume 105 (1) – Mar 1, 2018

4 pages

/lp/ou_press/on-overfitting-and-post-selection-uncertainty-assessments-IJ0UPNvFqv

- Publisher
- Oxford University Press
- Copyright
- © 2018 Biometrika Trust
- ISSN
- 0006-3444
- eISSN
- 1464-3510
- D.O.I.
- 10.1093/biomet/asx083
- Publisher site
- See Article on Publisher Site

Summary In a regression context, when the relevant subset of explanatory variables is uncertain, it is common to use a data-driven model selection procedure. Classical linear model theory, applied naively to the selected submodel, may not be valid because it ignores the selected submodel’s dependence on the data. We provide an explanation of this phenomenon, in terms of overfitting, for a class of model selection criteria. 1. Introduction Consider the classical multiple linear regression model \begin{equation} \label{eq:reg} y = X \beta + \sigma \varepsilon, \end{equation} (1) where $$y$$ is an $$n$$-vector of response variables, $$X$$ is an $$n \times p$$ matrix of explanatory variables, $$\beta$$ is a $$p$$-vector of slope coefficients, and $$\varepsilon$$ is an $$n$$-vector of independent Gaussian noise. We assume that $$p < n$$ and that $$y$$ and the columns of $$X$$ are centred so that the intercept term can be ignored. Formally, the model corresponds to the family of distributions (1) indexed by $$\theta=(\beta,\sigma)$$ in $$\Theta = \mathbb{R}^p \times (0,\infty)$$. In practice, there is often uncertainty about the set of explanatory variables to be included. In such cases, it is common to express the parameter $$\theta$$ as $$(S, \beta_S, \sigma_S)$$, where $$S \subseteq \{1,\ldots,p\}$$ represents a subset of the explanatory variables, $$\beta_S \in \mathbb{R}^{|S|}$$ represents the coefficients corresponding to the specific set $$S$$, and $$\sigma_S > 0$$. This amounts to decomposing the full parameter space $$\Theta$$ as $$\Theta = \bigcup_S \Theta(S)$$, where $$\Theta(S) = \mathbb{R}^{|S|} \times (0,\infty)$$. Then the model selection problem boils down to choosing a satisfactory submodel $$\Theta(S)$$ or, equivalently, a subset $$S$$. Standard tools for carrying out this selection step include the Akaike information criterion, aic (Akaike, 1973), and the Bayesian information criterion, bic (Schwarz, 1978). These are designed to produce models that suitably balance parsimony and fit. After a subset $$S \subseteq \{1,\ldots,p\}$$ of explanatory variables is selected, a secondary goal is to make inference on $$S$$-specific model parameters $$(\beta_S, \sigma_S)$$, or functions thereof, or to predict future values of the response. A naive approach, recommended in textbooks and commonly used by practitioners, is to replace $$X$$ in (1) with $$X_S$$, the matrix containing only the columns corresponding to $$S$$, and apply classical normal linear model theory. For example, for a given $$x \in \mathbb{R}^p$$, the classical $$100(1-\alpha)$$% confidence interval \begin{equation} \label{eq:ci} C_\alpha(x; S) = x_S^{{ \mathrm{\scriptscriptstyle T} }} \hat\beta_S \pm t_{n-|S|-1}(\alpha/2) \hat \sigma_S \{x_S^{a{ \mathrm{\scriptscriptstyle T} }} (X_S^{{ \mathrm{\scriptscriptstyle T} }} X_S)^{-1} x_S\}^{1/2} \end{equation} (2) can be used for inference on the mean response at the given $$x$$. However, as is now well known (Berk et al., 2013), the properties that these classical procedures enjoy for a fixed/true $$S$$ may not hold for a data-dependent choice $$\hat S$$. For example, $$C_\alpha(x; \hat S)$$ may not have coverage probability equal to $$1-\alpha$$. In this note we provide an explanation of this lack-of-validity phenomenon by showing that when the submodel is selected according to information criteria such as aic and bic, if the selected submodel overfits, i.e., contains a superset of the explanatory variables in the true model, then the corresponding estimate of the error variance will be smaller than that for the true model. This explains the empirical findings in Hong et al. (2017), where prediction intervals based on the submodel minimizing aic tend to be too short compared with those based on the true model and, consequently, tend to undercover; see § 3. Moreover, our Theorem 1, together with the dilation phenomenon described in Efron (2003), explains why bootstrap may not correct the selection effect for methods that tend to overfit. 2. Result For a given submodel $$\Theta(S)$$ corresponding to a subset $$S \subseteq \{1,\ldots,p\}$$, let $$(\hat\beta_S, \hat\sigma_S)$$ denote the least squares estimators of the $$\Theta(S)$$-specific parameters $$(\beta_S, \sigma_S)$$. We consider a selection procedure that chooses the subset $$S$$ by minimizing the function \begin{equation} \label{eq:gamma} \gamma_n(S) = n \log {\small{\text{SSE}}}(S) + c_n |S|, \quad S \subseteq \{1,\ldots,p\}, \end{equation} (3) where $${\small{\text{SSE}}}(S) = \|y - X_S \hat\beta_S\|^2$$ is the error sum of squares for submodel $$\Theta(S)$$, which is proportional to the corresponding least squares estimator $$\hat\sigma_S^2$$, $$\:c_n=o(n)$$ is a user-specified sequence of constants, and $$|S|$$ denotes the cardinality of the set $$S$$. The aic and bic set $$c_n \equiv 2$$ and $$c_n = \log n$$, respectively. Suppose that there exists a subset $$S^\star$$ corresponding to the truly nonzero regression coefficients, i.e., $$\beta_i \neq 0$$ for $$i \in S^\star$$ and $$\beta_i = 0$$ for $$i \notin S^\star$$. We write $$(\hat\beta_{S^\star}, \hat\sigma_{S^\star})$$ for the oracle estimators, i.e., those based on knowledge of the true submodel $$\Theta(S^\star)$$. Of course, if $$\hat S$$ is the subset chosen by minimizing $$\gamma_n$$ in (3), then $$\gamma_n(\hat S) \leqslant \gamma_n(S^\star)$$ or, equivalently, \begin{equation} \label{eq:selection} n \log {\small{\text{SSE}}}(\hat S) + c_n|\hat S| \leqslant n \log {\small{\text{SSE}}}(S^\star) + c_n|S^\star| ; \end{equation} (4) if $$\hat S \neq S^\star$$, then the inequality in (4) would be strict. For the purpose of inference or prediction, it is common to naively use the classical normal linear model theory, based on the selected subset $$\hat S$$, to derive uncertainty assessments. However, using the data to select $$\hat S$$ introduces bias, violating the assumptions of that classical theory and thereby invalidating the conclusions. The next result provides an explanation for this general phenomenon in cases where the selected submodel $$\Theta(\hat S)$$ overfits in the sense that $$\hat S \supset S^\star$$. In such cases, we find that $$\hat\sigma_{\hat S}$$ is smaller than the oracle estimator $$\hat\sigma_{S^\star}$$. Since the error variance estimate is involved in all uncertainty assessment calculations, and since it is common for selection methods to overfit, especially those based on aic (Hurvich & Tsai, 1989), this systematic underestimation explains the general lack of validity of the classical inferential tools applied naively in a post-selection context. Theorem 1. Suppose $$\hat S \supset S^\star$$. If \begin{equation} \label{eq:condition} 1 - \exp(-a_n D_n) > D_n, \end{equation} (5)where $$a_n = (c_n / n)(n - |S^\star|-1)$$ and $$D_n = (|\hat S| - |S^\star|)/(n-|S^\star|-1)$$, then $$\hat\sigma_{\hat S} < \hat\sigma_{S^\star}$$. To gain some intuition about the condition (5), first note that $$a_n D_n$$ will tend to be small. In particular, a very conservative bound is $$a_n D_n \leqslant c_n p / n$$, which is small for moderate $$c_n$$ and $$n \gg p$$. Next, since $$x \mapsto 1-\exp(-ax)$$ is convex for $$x > 0$$ and $$a > 0$$, we have $$1-\exp(-a_n D_n) > a_n D_n$$ for all $$D_n$$ in an interval $$(0, d)$$, where $$d=d(a_n) \in [0, 1)$$. So, to meet condition (5) we need $$a_n > 1$$ and, again, we have a conservative bound $$a_n \geqslant c_n (n-p-1)/n$$, which itself is greater than 1 for $$n \gg p$$ and $$c_n$$ not too small. In particular, if $$n \gg p$$ and $$c_n \equiv 2$$ as in the aic, then (5) holds. Proof of Theorem 1. Start by writing $${\small{\text{SSE}}}(\hat S)$$ in terms of $${\small{\text{SSE}}}(S^\star)$$. Let $$X_{\hat S}$$ and $$X_{S^\star}$$ denote the submatrices corresponding to the indicated subsets, and write $$P_{\hat S}$$ and $$P_{S^\star}$$ for the respective projections onto their column spaces. Then Pythagoras’ theorem implies that \begin{equation*} {\small{\text{SSE}}}(\hat S) = {\small{\text{SSE}}}(S^\star) + Y^{{ \mathrm{\scriptscriptstyle T} }} (P_{S^\star} - P_{\hat S}) Y = (1-r_n) {\small{\text{SSE}}}(S^\star), \end{equation*} where \[ r_n = r_n(S^\star, \hat S) = \frac{|\hat S|-|S^\star|}{n - |\hat S|} F_n(S^\star, \hat S), \] with $$F_n(S^\star, \hat S)$$ being the usual F-statistic for testing the larger $$\Theta(\hat S)$$ against the smaller $$\Theta(S^\star)$$. Consequently, we choose $$\hat S$$ over the strictly smaller $$S^\star$$, according to (4), if and only if $$r_n > 1 - \exp(-a_n D_n)$$. Then the above connection between $${\small{\text{SSE}}}(\hat S)$$ and $${\small{\text{SSE}}}(S^\star)$$ immediately gives a comparison between the corresponding variance estimates: \[ \hat\sigma_{\hat S}^2 = \frac{{\small{\text{SSE}}}(\hat S)}{n-|\hat S|-1} = \frac{(1 - r_n) {\small{\text{SSE}}}(S^\star)}{n-|\hat S|-1} = \frac{n-|S^\star|-1}{n-|\hat S|-1}\, (1-r_n) \hat\sigma_{S^\star}^2\text{.} \] As above, we find that $$\hat \sigma_{\hat S} < \hat\sigma_{S^\star}$$ if and only if $$r_n > D_n$$. By condition (5), it follows that the lower bound on $$r_n$$ derived from overfitting is greater than that derived from the underestimation. Therefore, overfitting implies underestimation, proving the claim. □ 3. Illustration Consider the model (1) with $$n=50$$, $$p=10$$ and variance $$\sigma^2=1$$. Set $$S^\star=\{1,2,3\}$$, with corresponding coefficients $$\beta_1^\star=1$$, $$\beta_2^\star=2$$ and $$\beta_3^\star=3$$. The rows of the $$X$$ matrix are independent and $$p$$-variate normal, with mean zero, ar(1) dependence structure and one-step correlation $$\rho=0{\cdot}5$$. We simulated 1000 datasets and, for each, evaluated $$\hat\sigma_{\hat S}$$ and $$\hat\sigma_{S^\star}$$, where $$\hat S$$ is chosen based on the aic. The scatterplot shown in Fig. 1(a) demonstrates the systematic underestimation based on the aic-selected submodel, as predicted by Theorem 1. In all 1000 cases, we have $$\hat S \supseteq S^\star$$, and points on the diagonal line correspond to $$\hat S = S^\star$$. To further illustrate the difference between the estimates, Fig. 1(b) plots a histogram of the ratio $$\hat\sigma_{S^\star}/\hat\sigma_{\hat S}$$, only for the cases of strict overfitting. In particular, the mean from this histogram is 1$$\cdot$$06. Fig. 1. View largeDownload slide Plots from the simulations described in § 3: (a) scatterplot of $$\hat\sigma_{\hat S}$$ versus $$\hat\sigma_{S^\star}$$; (b) histogram of the ratio $$\hat\sigma_{S^\star}/\hat\sigma_{\hat S} \,(\,>1)$$, for overfit cases only. Fig. 1. View largeDownload slide Plots from the simulations described in § 3: (a) scatterplot of $$\hat\sigma_{\hat S}$$ versus $$\hat\sigma_{S^\star}$$; (b) histogram of the ratio $$\hat\sigma_{S^\star}/\hat\sigma_{\hat S} \,(\,>1)$$, for overfit cases only. While the relative difference between the two estimates does not seem remarkable, even this small of a difference can impact the quality of inference. For example, consider using the confidence interval (2) for inference on the mean response at a particular setting $$x$$ of the explanatory variables; here, $$x$$ is an independent sample from the distribution that generated the rows of $$X$$. The oracle 95% confidence interval $$C_{0{\cdot}05}(x; S^\star)$$ has coverage exactly equal to $$0{\cdot}95$$, but in the 1000 simulations above, the coverage probability of $$C_\alpha(x; \hat S)$$ is roughly $$0{\cdot}86$$. It happens that the $$\hat S$$-based intervals tend to be shorter than the oracle, suggesting that valid post-selection inference on the mean response requires $$\hat\sigma_{\hat S}$$ to be strictly larger than $$\hat\sigma_{S^\star}$$, which is impossible given Theorem 1 and the aic’s tendency to overfit. Acknowledgement The authors are grateful to the editor and two referees whose comments have greatly enhanced the clarity of our presentation. Kuffner was supported by the U.S. National Science Foundation. References Akaike H. ( 1973). Information theory and an extension of the maximum likelihood principle. In Proc. 2nd Int. Symp. Info. Theory , Petrov B. and Csáki F. eds. Budapest: Akadémiai Kiadó. Berk R., Brown L., Buja A., Zhang K. & Zhao L. ( 2013). Valid post-selection inference. Ann. Statist. 41, 802– 37. Google Scholar CrossRef Search ADS Efron B. ( 2003). Second thoughts on the bootstrap. Statist. Sci. 18, 135– 40. Google Scholar CrossRef Search ADS Hong L., Kuffner T. A. & Martin R. G. ( 2017). On prediction of future insurance claims when the model is uncertain. SSRN: 2883574. Hurvich C. M. & Tsai C.-L. ( 1989). Regression and time series model selection in small samples. Biometrika 76, 297– 307. Google Scholar CrossRef Search ADS Schwarz G. ( 1978). Estimating the dimension of a model. Ann. Statist. 6, 461– 4. Google Scholar CrossRef Search ADS © 2018 Biometrika Trust

Biometrika – Oxford University Press

**Published: ** Mar 1, 2018

Loading...

personal research library

It’s your single place to instantly

**discover** and **read** the research

that matters to you.

Enjoy **affordable access** to

over 18 million articles from more than

**15,000 peer-reviewed journals**.

All for just $49/month

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Read from thousands of the leading scholarly journals from *SpringerNature*, *Elsevier*, *Wiley-Blackwell*, *Oxford University Press* and more.

All the latest content is available, no embargo periods.

## “Hi guys, I cannot tell you how much I love this resource. Incredible. I really believe you've hit the nail on the head with this site in regards to solving the research-purchase issue.”

Daniel C.

## “Whoa! It’s like Spotify but for academic articles.”

@Phil_Robichaud

## “I must say, @deepdyve is a fabulous solution to the independent researcher's problem of #access to #information.”

@deepthiw

## “My last article couldn't be possible without the platform @deepdyve that makes journal papers cheaper.”

@JoseServera

DeepDyve ## Freelancer | DeepDyve ## Pro | |
---|---|---|

Price | FREE | $49/month |

Save searches from | ||

Create lists to | ||

Export lists, citations | ||

Read DeepDyve articles | Abstract access only | Unlimited access to over |

20 pages / month | ||

PDF Discount | 20% off | |

Read and print from thousands of top scholarly journals.

System error. Please try again!

or

By signing up, you agree to DeepDyve’s Terms of Service and Privacy Policy.

Already have an account? Log in

Bookmark this article. You can see your Bookmarks on your DeepDyve Library.

To save an article, **log in** first, or **sign up** for a DeepDyve account if you don’t already have one.