DeepDyve requires Javascript to function. Please enable Javascript on your browser to continue.

On overfitting and post-selection uncertainty assessments
On overfitting and post-selection uncertainty assessments
Hong, L;Kuffner, T A;Martin, R
2018-03-01 00:00:00
Summary In a regression context, when the relevant subset of explanatory variables is uncertain, it is common to use a data-driven model selection procedure. Classical linear model theory, applied naively to the selected submodel, may not be valid because it ignores the selected submodel’s dependence on the data. We provide an explanation of this phenomenon, in terms of overfitting, for a class of model selection criteria. 1. Introduction Consider the classical multiple linear regression model \begin{equation} \label{eq:reg} y = X \beta + \sigma \varepsilon, \end{equation} (1) where $$y$$ is an $$n$$-vector of response variables, $$X$$ is an $$n \times p$$ matrix of explanatory variables, $$\beta$$ is a $$p$$-vector of slope coefficients, and $$\varepsilon$$ is an $$n$$-vector of independent Gaussian noise. We assume that $$p < n$$ and that $$y$$ and the columns of $$X$$ are centred so that the intercept term can be ignored. Formally, the model corresponds to the family of distributions (1) indexed by $$\theta=(\beta,\sigma)$$ in $$\Theta = \mathbb{R}^p \times (0,\infty)$$. In practice, there is often uncertainty about the set of explanatory variables to be included. In such cases, it is common to express the parameter $$\theta$$ as $$(S, \beta_S, \sigma_S)$$, where $$S \subseteq \{1,\ldots,p\}$$ represents a subset of the explanatory variables, $$\beta_S \in \mathbb{R}^{|S|}$$ represents the coefficients corresponding to the specific set $$S$$, and $$\sigma_S > 0$$. This amounts to decomposing the full parameter space $$\Theta$$ as $$\Theta = \bigcup_S \Theta(S)$$, where $$\Theta(S) = \mathbb{R}^{|S|} \times (0,\infty)$$. Then the model selection problem boils down to choosing a satisfactory submodel $$\Theta(S)$$ or, equivalently, a subset $$S$$. Standard tools for carrying out this selection step include the Akaike information criterion, aic (Akaike, 1973), and the Bayesian information criterion, bic (Schwarz, 1978). These are designed to produce models that suitably balance parsimony and fit. After a subset $$S \subseteq \{1,\ldots,p\}$$ of explanatory variables is selected, a secondary goal is to make inference on $$S$$-specific model parameters $$(\beta_S, \sigma_S)$$, or functions thereof, or to predict future values of the response. A naive approach, recommended in textbooks and commonly used by practitioners, is to replace $$X$$ in (1) with $$X_S$$, the matrix containing only the columns corresponding to $$S$$, and apply classical normal linear model theory. For example, for a given $$x \in \mathbb{R}^p$$, the classical $$100(1-\alpha)$$% confidence interval \begin{equation} \label{eq:ci} C_\alpha(x; S) = x_S^{{ \mathrm{\scriptscriptstyle T} }} \hat\beta_S \pm t_{n-|S|-1}(\alpha/2) \hat \sigma_S \{x_S^{a{ \mathrm{\scriptscriptstyle T} }} (X_S^{{ \mathrm{\scriptscriptstyle T} }} X_S)^{-1} x_S\}^{1/2} \end{equation} (2) can be used for inference on the mean response at the given $$x$$. However, as is now well known (Berk et al., 2013), the properties that these classical procedures enjoy for a fixed/true $$S$$ may not hold for a data-dependent choice $$\hat S$$. For example, $$C_\alpha(x; \hat S)$$ may not have coverage probability equal to $$1-\alpha$$. In this note we provide an explanation of this lack-of-validity phenomenon by showing that when the submodel is selected according to information criteria such as aic and bic, if the selected submodel overfits, i.e., contains a superset of the explanatory variables in the true model, then the corresponding estimate of the error variance will be smaller than that for the true model. This explains the empirical findings in Hong et al. (2017), where prediction intervals based on the submodel minimizing aic tend to be too short compared with those based on the true model and, consequently, tend to undercover; see § 3. Moreover, our Theorem 1, together with the dilation phenomenon described in Efron (2003), explains why bootstrap may not correct the selection effect for methods that tend to overfit. 2. Result For a given submodel $$\Theta(S)$$ corresponding to a subset $$S \subseteq \{1,\ldots,p\}$$, let $$(\hat\beta_S, \hat\sigma_S)$$ denote the least squares estimators of the $$\Theta(S)$$-specific parameters $$(\beta_S, \sigma_S)$$. We consider a selection procedure that chooses the subset $$S$$ by minimizing the function \begin{equation} \label{eq:gamma} \gamma_n(S) = n \log {\small{\text{SSE}}}(S) + c_n |S|, \quad S \subseteq \{1,\ldots,p\}, \end{equation} (3) where $${\small{\text{SSE}}}(S) = \|y - X_S \hat\beta_S\|^2$$ is the error sum of squares for submodel $$\Theta(S)$$, which is proportional to the corresponding least squares estimator $$\hat\sigma_S^2$$, $$\:c_n=o(n)$$ is a user-specified sequence of constants, and $$|S|$$ denotes the cardinality of the set $$S$$. The aic and bic set $$c_n \equiv 2$$ and $$c_n = \log n$$, respectively. Suppose that there exists a subset $$S^\star$$ corresponding to the truly nonzero regression coefficients, i.e., $$\beta_i \neq 0$$ for $$i \in S^\star$$ and $$\beta_i = 0$$ for $$i \notin S^\star$$. We write $$(\hat\beta_{S^\star}, \hat\sigma_{S^\star})$$ for the oracle estimators, i.e., those based on knowledge of the true submodel $$\Theta(S^\star)$$. Of course, if $$\hat S$$ is the subset chosen by minimizing $$\gamma_n$$ in (3), then $$\gamma_n(\hat S) \leqslant \gamma_n(S^\star)$$ or, equivalently, \begin{equation} \label{eq:selection} n \log {\small{\text{SSE}}}(\hat S) + c_n|\hat S| \leqslant n \log {\small{\text{SSE}}}(S^\star) + c_n|S^\star| ; \end{equation} (4) if $$\hat S \neq S^\star$$, then the inequality in (4) would be strict. For the purpose of inference or prediction, it is common to naively use the classical normal linear model theory, based on the selected subset $$\hat S$$, to derive uncertainty assessments. However, using the data to select $$\hat S$$ introduces bias, violating the assumptions of that classical theory and thereby invalidating the conclusions. The next result provides an explanation for this general phenomenon in cases where the selected submodel $$\Theta(\hat S)$$ overfits in the sense that $$\hat S \supset S^\star$$. In such cases, we find that $$\hat\sigma_{\hat S}$$ is smaller than the oracle estimator $$\hat\sigma_{S^\star}$$. Since the error variance estimate is involved in all uncertainty assessment calculations, and since it is common for selection methods to overfit, especially those based on aic (Hurvich & Tsai, 1989), this systematic underestimation explains the general lack of validity of the classical inferential tools applied naively in a post-selection context. Theorem 1. Suppose $$\hat S \supset S^\star$$. If \begin{equation} \label{eq:condition} 1 - \exp(-a_n D_n) > D_n, \end{equation} (5)where $$a_n = (c_n / n)(n - |S^\star|-1)$$ and $$D_n = (|\hat S| - |S^\star|)/(n-|S^\star|-1)$$, then $$\hat\sigma_{\hat S} < \hat\sigma_{S^\star}$$. To gain some intuition about the condition (5), first note that $$a_n D_n$$ will tend to be small. In particular, a very conservative bound is $$a_n D_n \leqslant c_n p / n$$, which is small for moderate $$c_n$$ and $$n \gg p$$. Next, since $$x \mapsto 1-\exp(-ax)$$ is convex for $$x > 0$$ and $$a > 0$$, we have $$1-\exp(-a_n D_n) > a_n D_n$$ for all $$D_n$$ in an interval $$(0, d)$$, where $$d=d(a_n) \in [0, 1)$$. So, to meet condition (5) we need $$a_n > 1$$ and, again, we have a conservative bound $$a_n \geqslant c_n (n-p-1)/n$$, which itself is greater than 1 for $$n \gg p$$ and $$c_n$$ not too small. In particular, if $$n \gg p$$ and $$c_n \equiv 2$$ as in the aic, then (5) holds. Proof of Theorem 1. Start by writing $${\small{\text{SSE}}}(\hat S)$$ in terms of $${\small{\text{SSE}}}(S^\star)$$. Let $$X_{\hat S}$$ and $$X_{S^\star}$$ denote the submatrices corresponding to the indicated subsets, and write $$P_{\hat S}$$ and $$P_{S^\star}$$ for the respective projections onto their column spaces. Then Pythagoras’ theorem implies that \begin{equation*} {\small{\text{SSE}}}(\hat S) = {\small{\text{SSE}}}(S^\star) + Y^{{ \mathrm{\scriptscriptstyle T} }} (P_{S^\star} - P_{\hat S}) Y = (1-r_n) {\small{\text{SSE}}}(S^\star), \end{equation*} where \[ r_n = r_n(S^\star, \hat S) = \frac{|\hat S|-|S^\star|}{n - |\hat S|} F_n(S^\star, \hat S), \] with $$F_n(S^\star, \hat S)$$ being the usual F-statistic for testing the larger $$\Theta(\hat S)$$ against the smaller $$\Theta(S^\star)$$. Consequently, we choose $$\hat S$$ over the strictly smaller $$S^\star$$, according to (4), if and only if $$r_n > 1 - \exp(-a_n D_n)$$. Then the above connection between $${\small{\text{SSE}}}(\hat S)$$ and $${\small{\text{SSE}}}(S^\star)$$ immediately gives a comparison between the corresponding variance estimates: \[ \hat\sigma_{\hat S}^2 = \frac{{\small{\text{SSE}}}(\hat S)}{n-|\hat S|-1} = \frac{(1 - r_n) {\small{\text{SSE}}}(S^\star)}{n-|\hat S|-1} = \frac{n-|S^\star|-1}{n-|\hat S|-1}\, (1-r_n) \hat\sigma_{S^\star}^2\text{.} \] As above, we find that $$\hat \sigma_{\hat S} < \hat\sigma_{S^\star}$$ if and only if $$r_n > D_n$$. By condition (5), it follows that the lower bound on $$r_n$$ derived from overfitting is greater than that derived from the underestimation. Therefore, overfitting implies underestimation, proving the claim. □ 3. Illustration Consider the model (1) with $$n=50$$, $$p=10$$ and variance $$\sigma^2=1$$. Set $$S^\star=\{1,2,3\}$$, with corresponding coefficients $$\beta_1^\star=1$$, $$\beta_2^\star=2$$ and $$\beta_3^\star=3$$. The rows of the $$X$$ matrix are independent and $$p$$-variate normal, with mean zero, ar(1) dependence structure and one-step correlation $$\rho=0{\cdot}5$$. We simulated 1000 datasets and, for each, evaluated $$\hat\sigma_{\hat S}$$ and $$\hat\sigma_{S^\star}$$, where $$\hat S$$ is chosen based on the aic. The scatterplot shown in Fig. 1(a) demonstrates the systematic underestimation based on the aic-selected submodel, as predicted by Theorem 1. In all 1000 cases, we have $$\hat S \supseteq S^\star$$, and points on the diagonal line correspond to $$\hat S = S^\star$$. To further illustrate the difference between the estimates, Fig. 1(b) plots a histogram of the ratio $$\hat\sigma_{S^\star}/\hat\sigma_{\hat S}$$, only for the cases of strict overfitting. In particular, the mean from this histogram is 1$$\cdot$$06. Fig. 1. View largeDownload slide Plots from the simulations described in § 3: (a) scatterplot of $$\hat\sigma_{\hat S}$$ versus $$\hat\sigma_{S^\star}$$; (b) histogram of the ratio $$\hat\sigma_{S^\star}/\hat\sigma_{\hat S} \,(\,>1)$$, for overfit cases only. Fig. 1. View largeDownload slide Plots from the simulations described in § 3: (a) scatterplot of $$\hat\sigma_{\hat S}$$ versus $$\hat\sigma_{S^\star}$$; (b) histogram of the ratio $$\hat\sigma_{S^\star}/\hat\sigma_{\hat S} \,(\,>1)$$, for overfit cases only. While the relative difference between the two estimates does not seem remarkable, even this small of a difference can impact the quality of inference. For example, consider using the confidence interval (2) for inference on the mean response at a particular setting $$x$$ of the explanatory variables; here, $$x$$ is an independent sample from the distribution that generated the rows of $$X$$. The oracle 95% confidence interval $$C_{0{\cdot}05}(x; S^\star)$$ has coverage exactly equal to $$0{\cdot}95$$, but in the 1000 simulations above, the coverage probability of $$C_\alpha(x; \hat S)$$ is roughly $$0{\cdot}86$$. It happens that the $$\hat S$$-based intervals tend to be shorter than the oracle, suggesting that valid post-selection inference on the mean response requires $$\hat\sigma_{\hat S}$$ to be strictly larger than $$\hat\sigma_{S^\star}$$, which is impossible given Theorem 1 and the aic’s tendency to overfit. Acknowledgement The authors are grateful to the editor and two referees whose comments have greatly enhanced the clarity of our presentation. Kuffner was supported by the U.S. National Science Foundation. References Akaike H. ( 1973). Information theory and an extension of the maximum likelihood principle. In Proc. 2nd Int. Symp. Info. Theory , Petrov B. and Csáki F. eds. Budapest: Akadémiai Kiadó. Berk R., Brown L., Buja A., Zhang K. & Zhao L. ( 2013). Valid post-selection inference. Ann. Statist. 41, 802– 37. Google Scholar CrossRef Search ADS Efron B. ( 2003). Second thoughts on the bootstrap. Statist. Sci. 18, 135– 40. Google Scholar CrossRef Search ADS Hong L., Kuffner T. A. & Martin R. G. ( 2017). On prediction of future insurance claims when the model is uncertain. SSRN: 2883574. Hurvich C. M. & Tsai C.-L. ( 1989). Regression and time series model selection in small samples. Biometrika 76, 297– 307. Google Scholar CrossRef Search ADS Schwarz G. ( 1978). Estimating the dimension of a model. Ann. Statist. 6, 461– 4. Google Scholar CrossRef Search ADS © 2018 Biometrika Trust
http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png
Biometrika
Oxford University Press
http://www.deepdyve.com/lp/oxford-university-press/on-overfitting-and-post-selection-uncertainty-assessments-IJ0UPNvFqv