Shrinking characteristics of precision matrix estimators

Shrinking characteristics of precision matrix estimators Summary We propose a framework to shrink a user-specified characteristic of a precision matrix estimator that is needed to fit a predictive model. Estimators in our framework minimize the Gaussian negative loglikelihood plus an $$L_1$$ penalty on a linear or affine function evaluated at the optimization variable corresponding to the precision matrix. We establish convergence rate bounds for these estimators and propose an alternating direction method of multipliers algorithm for their computation. Our simulation studies show that our estimators can perform better than competitors when they are used to fit predictive models. In particular, we illustrate cases where our precision matrix estimators perform worse at estimating the population precision matrix but better at prediction. 1. Introduction The estimation of precision matrices is required to fit many statistical models. Many papers written in the last decade have proposed shrinkage estimators of the precision matrix when the number of variables $$p$$ is large. Pourahmadi (2013) and Fan et al. (2016) provide comprehensive reviews of large covariance and precision matrix estimation. The main strategy used in many of these papers is to minimize the Gaussian negative loglikelihood plus a penalty on the off-diagonal entries of the optimization variable corresponding to the precision matrix. For example, Yuan & Lin (2007) proposed the $$L_1$$-penalized Gaussian likelihood precision matrix estimator defined by \begin{align}\label{graphical_lasso} \mathop {\rm \arg\,min}\limits_{\Omega \in \mathbb{S}^p_+} \left\{ {\rm tr}(S\Omega) - \log {\rm det}(\Omega) + \lambda \sum_{i \neq j}|\Omega_{ij}|\right\}\!, \end{align} (1) where $$S$$ is the sample covariance matrix, $$\lambda > 0$$ is a tuning parameter, $$\mathbb{S}^p_+$$ is the set of $$p \times p$$ symmetric and positive-definite matrices, $$\Omega_{ij}$$ is the $$(i,j)$$th entry of $$\Omega$$, and $${\rm tr}$$ and $${\rm det}$$ denote the trace and determinant, respectively. Other authors have replaced the $$L_1$$ penalty in (1) with the squared Frobenius norm (Witten & Tibshirani, 2009; Rothman & Forzani, 2014) or nonconvex penalties that also encourage zeros in the estimator (Lam & Fan, 2009; Fan et al., 2009). To fit many predictive models, only a characteristic of the population precision matrix need be estimated. For example, in binary linear discriminant analysis, the population precision matrix is needed for prediction only through the product of the precision matrix and the difference between the two conditional distribution mean vectors. Many authors have proposed methods that directly estimate this characteristic (Cai & Liu, 2011; Fan et al., 2012; Mai et al., 2012). We propose to estimate the precision matrix by shrinking the characteristic of the estimator that is needed for prediction. The characteristic we consider is a linear or affine function evaluated at the precision matrix. The goal is to improve prediction performance. Unlike methods that estimate the characteristic directly, our approach provides the practitioner with an estimate of the entire precision matrix, not just the characteristic. In our simulation studies and data example, we show that penalizing the characteristic needed for prediction can improve prediction performance over competing sparse precision estimators like (1), even when the true precision matrix is very sparse. In addition, estimators in our framework can be used in applications other than linear discriminant analysis. 2. Proposed method 2.1. Penalized likelihood estimator We propose to estimate the population precision matrix $$\Omega_*$$ with $$\label{eq:estimator} \hat{\Omega} = \mathop {\rm \arg\,min}\limits_{\Omega \in \mathbb{S}_+^{p}}\left\{ {\rm tr}(S\Omega) - \log {\rm det}(\Omega) + \lambda |A \Omega B - C |_1 \right\}\!,$$ (2) where $$A \in \mathbb{R}^{a \times p}$$, $$B \in \mathbb{R}^{p \times b}$$ and $$C \in \mathbb{R}^{a \times b}$$ are user-specified matrices, and $$|M|_1 = \sum_{i,j} |M_{ij}|$$. Our estimator exploits the assumption that $$A \Omega_* B - C$$ is sparse. When $$A$$, $$B$$ and $$C$$ need to be estimated, we replace them in (2) with their estimators. The matrix $$C$$ can serve as a shrinkage target for $$A \hat{\Omega}B$$, allowing practitioners to incorporate prior information. Dalal & Rajaratnam (2017) proposed a class of estimators similar to (2) which replaces $$A\Omega B-C$$ with $$T(\Omega)$$, where $$T$$ is a symmetric linear transform. Fitting the discriminant analysis model requires the estimation of one or more precision matrices. In particular, the linear discriminant analysis model assumes that the data are independent copies of the random pair $$(X,Y)$$, where the support of $$Y$$ is $$\left\{1, \dots, J\right\}$$ and $$\label{Normal_Model_LDA} X \mid Y= j \sim {\it N}_p \left(\mu_{*j}, \Omega_*^{-1}\right)\quad (\,j=1, \dots, J),$$ (3) where $$\mu_{*j} \in \mathbb{R}^p$$ and $$\Omega_*^{-1} \in \mathbb{S}^p_+$$ are unknown. To discriminate between response categories $$l$$ and $$m$$, only the characteristic $$\Omega_*(\mu_{*l} - \mu_{*m})$$ is needed. Methods that estimate this characteristic directly have been proposed (Cai & Liu, 2011; Mai et al., 2012; Fan et al., 2012; Mai et al., 2018). These methods are useful in high dimensions because they perform variable selection. For the $$j$$th variable to be noninformative for discriminating between response categories $$l$$ and $$m$$, it must be that the $$j$$th element of $$\Omega_*(\mu_{*l} - \mu_{*m})$$ is zero. While these methods can perform well in classification and variable selection, they do not actually fit the model in (3). Methods for fitting (3) specifically for linear discriminant analysis either assume $$\Omega_*$$ is diagonal (Bickel & Levina, 2004) or that both $$\mu_{*l} - \mu_{*m}$$ and $$\Omega_*$$ are sparse (Guo, 2010; Xu et al., 2015). A method for fitting (3) and performing variable selection was proposed by Witten & Tibshirani (2009). They suggest a two-step procedure where one first estimates $$\Omega_*$$ and then, with the estimate $$\bar{\Omega}$$ fixed, estimates each $$\mu_{*j}$$ by penalizing the characteristic $$\bar{\Omega} \mu_j$$, where $$\mu_j$$ is the optimization variable corresponding to $$\mu_{*j}$$. 2.2. Example applications To apply our method to the linear discriminant analysis problem, we use (2) with $$A = I_p$$, $$C$$ equal to the matrix of zeros, and $$B$$ equal to the matrix whose columns are $$\bar{x}_j - \bar{x}_{k}$$ for all $$1 \leq j < k \leq J$$, where $$\bar{x}_j$$ is the unpenalized maximum likelihood estimator of $$\mu_{*j}$$. For large values of the tuning parameter, this would lead to an estimator of $$\Omega_*$$ such that $$\hat{\Omega}(\bar{x}_j -\bar{x}_{k})$$ is sparse. Thus our approach simultaneously fits (3) and performs variable selection. Precision and covariance matrix estimators are also needed for portfolio allocation. The optimal allocation based on the Markowitz (1952) minimum-variance portfolio is proportional to $$\Omega_*\mu_*$$, where $$\mu_*$$ is the vector of expected returns for $$p$$ assets and $$\Omega_*$$ is the precision matrix for the returns. In practice, one would estimate $$\Omega_*$$ and $$\mu_*$$ with their usual sample estimators $$\hat\Omega$$ and $$\hat{\mu}$$. However, when $$p$$ is larger than the sample size, the usual sample estimator of $$\Omega_*$$ does not exist, so regularization is necessary. Moreover, Brodie et al. (2009) argue that sparse portfolios, i.e., portfolios with fewer than $$p$$ active positions, are often desirable when $$p$$ is large. While many have proposed using sparse or shrinkage estimators of $$\Omega_*$$ or $$\Omega_*^{-1}$$ inserted in the Markowitz criterion, e.g., Xue et al. (2012), this would not necessarily lead to sparse estimators of $$\Omega_* \mu_*$$. To achieve a sparse portfolio, Chen et al. (2016) proposed a method for estimating the characteristic $$\Omega_*\mu_*$$ directly, but like the direct linear discriminant methods, it does not lead to an estimate of $$\Omega_*$$. For the sparse portfolio allocation application, we propose to estimate $$\Omega_*$$ using (2) with $$A=I_p$$, $$C$$ equal to the vector of zeros, and $$B = \hat{\mu}$$. Another application is in linear regression where the $$q$$-variate response and $$p$$-variate predictors have a joint multivariate normal distribution. In this case, the regression coefficient matrix is $$\Omega_* \Sigma_{*XY}$$, where $$\Omega_*$$ is the marginal precision matrix for the predictors and $$\Sigma_{*XY}$$ is the cross-covariance matrix between predictors and response. We propose to estimate $$\Omega_*$$ using (2) with $$A = I_p$$, $$C$$ equal to the matrix of zeros, and $$B$$ equal to the usual sample estimator of $$\Sigma_{*XY}$$. Similar to the proposal of Witten & Tibshirani (2009), this approach provides an alternative method for estimating regression coefficients using shrinkage estimators of the marginal precision matrix for the predictors. There are also applications where neither $$A$$ nor $$B$$ is equal to $$I_p$$. For example, in quadratic discriminant analysis, if one assumes that $$\mu_{*j}^{\rm T} \Omega_{*j} \mu_{*j}$$ is small, e.g., $$\mu_{*j}$$ is in the span of a set of eigenvectors of $$\Omega_{*j}$$ corresponding to small eigenvalues, then it may be appropriate to shrink entries of the estimates of $$\Omega_{j*}$$, $$\Omega_{*j}\mu_{*j}$$ and $$\mu_{*j}^{\rm T} \Omega_{*j} \mu_{*j}$$. For this application, we propose to estimate $$\Omega_*$$ using (2) with $$C$$ equal to the matrix of zeros and $$A^{\rm T} = B = (\bar{x}_j, \gamma I_p)$$ for some tuning parameter $$\gamma > 0$$. 3. Computation 3.1. Alternating direction method of multipliers algorithm To solve the optimization in (2), we propose an alternating direction method of multipliers algorithm with a modification based on the majorize-minimize principle (Lange, 2016). Following the standard alternating direction method of multipliers approach (Boyd et al., 2011), we rewrite (2) as a constrained optimization problem: $$\label{eq: constrained_primal} \mathop {\rm \arg\,min}\limits_{\left(\Theta, \Omega\right) \in \mathbb{R}^{a \times b} \times \mathbb{S}^p_+} \left\{ {\rm tr}(S\Omega) - \log{\rm det}(\Omega) + \lambda |\Theta|_1 \right\} \quad \text{subject to } A\Omega B - \Theta = C\text{.}$$ (4) The augmented Lagrangian for (4) is defined by \begin{align*} \mathcal{F}_\rho(\Omega, \Theta, \Gamma) = {\rm tr}(S \Omega) & - \log{\rm det}(\Omega) + \lambda |\Theta|_1 \\ & - {\rm tr} \left\{ \Gamma^{\rm T}(A\Omega B - \Theta - C)\right\} + \frac{\rho}{2}\|A\Omega B - \Theta - C\|_{\rm F}^2, \end{align*} where $$\rho > 0$$, $$\Gamma \in \mathbb{R}^{a \times b}$$ is the Lagrangian dual variable, and $$\|\cdot\|_{\rm F}$$ is the Frobenius norm. Let the subscript $$k$$ denote the $$k$$th iterate. From Boyd et al. (2011), to solve (4), the alternating direction method of multipliers algorithm uses the updating equations \begin{align} \Omega_{k+1} &= \mathop {\rm \arg\,min}\limits_{\Omega \in \mathbb{S}^p_+} \mathcal{F}_{\rho}(\Omega, \Theta_{k}, \Gamma_{k}), \label{M_update_original} \\ \end{align} (5) \begin{align} \Theta_{k+1} &= \mathop {\rm \arg\,min}\limits_{\Theta \in \mathbb{R}^{a \times b}} \mathcal{F}_{\rho}(\Omega_{k+1}, \Theta, \Gamma_{k}), \label{soft_thresholding_step} \end{align} (6) \begin{gather*} \Gamma_{k+1} = \Gamma_{k} - \rho \left(A \Omega_{k+1}B - \Theta_{k+1} - C \right)\text{.} \notag \end{gather*} The update in (5) requires its own iterative algorithm, which is complicated by the positive definiteness of the optimization variable. To avoid this computation, we replace (5) with an approximation based on the majorize-minimize principle. In particular, we replace $$\mathcal{F}_\rho(\cdot, \Theta_k, \Gamma_k)$$ with a majorizing function at the current iterate $$\Omega_{k}$$: $$\Omega_{k+1} = \mathop {\rm \arg\,min}\limits_{\Omega \in \mathbb{S}^+_p} \left\{\mathcal{F}_\rho(\Omega, \Theta_{k}, \Gamma_{k}) + \frac{\rho}{2}{\rm vec}(\Omega - \Omega_{k})^{\rm T} Q\,{\rm vec}(\Omega - \Omega_{k})\right\}\!,\label{majorizer}$$ (7) where $$Q = \tau I - \left(A^{\rm T} A \otimes BB^{\rm T}\right)\!,$$$$\tau$$ is selected so that $$Q \in \mathbb{S}^p_+$$, $$\otimes$$ is the Kronecker product, and $${\rm vec}$$ forms a vector by stacking the columns of its matrix argument. Since $${\rm vec}(\Omega - \Omega_{k})^{\rm T}\left(A^{\rm T} A \otimes BB^{\rm T}\right) {\rm vec}(\Omega - \Omega_{k}) = {\rm tr}\left\{ A^{\rm T} A (\Omega - \Omega_{k}) BB^{\rm T} (\Omega - \Omega_{k})\right\}\!,$$ we can rewrite (7) as $$\Omega_{k+1} = \mathop {\rm \arg\,min}\limits_{\Omega \in \mathbb{S}^+_p} \left[ \mathcal{F}_\rho(\Omega, \Theta_{k}, \Gamma_{k}) + \frac{\rho\tau}{2} \|\Omega - \Omega_{k}\|_{\rm F}^2 - \frac{\rho}{2} {\rm tr}\left\{ A^{\rm T} A (\Omega - \Omega_{k}) BB^{\rm T} (\Omega - \Omega_{k}) \right\} \right]\!,$$ which is equivalent to $$\label{eq:majorized_LINADMM} \Omega_{k+1} = \mathop {\rm \arg\,min}\limits_{\Omega \in \mathbb{S}^+_p} \left[ {\rm tr}\left\{ \left(S + G_{k} \right) \Omega\right\} - \log {\rm det}(\Omega) + \frac{\rho\tau}{2} \|\Omega - \Omega_{k}\|_{\rm F}^2 \right]\!,$$ (8) where $$G_{k} = \rho A^{\rm T}(A\Omega_{k} B-\rho^{-1}\Gamma_{k}- \Theta_{k} - C)B^{\rm T}$$. The zero-gradient equation for (8) is \begin{equation*} \label{gradient_equation} S - \Omega_{k+1}^{-1} + \frac{1}{2} \left( G_{k} + G_{k}^{\rm T} \right) + \rho \tau \left(\Omega_{k+1} - \Omega_{k}\right)=0, \end{equation*} whose solution is (Witten & Tibshirani, 2009; Price et al., 2015) $$\Omega_{k+1} = \frac{1}{2\rho\tau} U \left\{ -\Psi + \left(\Psi^2 + 4 \rho \tau I_p\right)^{1/2}\right\} U^{\rm T},$$ where $$U \Psi U^{\rm T}$$ is the eigendecomposition of $$S + (G_{k} + G_{k}^{\rm T})/2 - \rho \tau \Omega_{k}$$. Our majorize-minimize approximation is a special case of the prox-linear alternating direction method of multiplier algorithm (Chen & Teboulle, 1994; Deng & Yin, 2016). Using the majorize-minimize approximation of (5) guarantees that $$\mathcal{F}_\rho(\Omega_{k+1}, \Theta_k, \Gamma_k) \leq \mathcal{F}_\rho(\Omega_{k}, \Theta_k, \Gamma_k)$$ and maintains the convergence properties of our algorithm (Deng & Yin, 2016). Finally, (6) also has a closed-form solution: $$\label{soft_update} \Theta_{k+1} = {\rm soft}\left( A\Omega_{k+1}B- \rho^{-1}\Gamma_{k} - C,\rho^{-1}\lambda \right)\!\notag,$$ where $${\rm soft}(x, \phi) = \max \left(|x|- \phi, 0\right) {\rm sign}(x)$$. To summarize, we solve (2) with the following algorithm. Algorithm 1. Initialize $$\Omega_{0} \in \mathbb{S}_+^p$$, $$\Theta_{0}\in \mathbb{R}^{a \times b}$$, $$\rho > 0$$, and $$\tau$$ such that $$Q$$ is positive definite. Set $$k=0$$. Repeat Steps 1–6 until convergence: Step $$1$$. Compute $$G_{k}= \rho A^{\rm T}( A\Omega_{k} B-\rho^{-1}\Gamma_k - \Theta_{k} - C)B^{\rm T}$$. Step $$2$$. Decompose $$S + (G_{k} + G_{k}^{\rm T})/2 - \rho \tau\Omega_{k} = U \Psi U^{\rm T}$$ where $$U$$ is orthogonal and $$\Psi$$ is diagonal. Step $$3$$. Set $$\Omega_{k+1} = (2\rho\tau)^{-1} U \left\{ -\Psi + (\Psi^2 + 4 \rho \tau I_p)^{1/2}\right\} U^{\rm T}$$. Step $$4$$. Set $$\Theta_{k+1} = {\rm soft}(A\Omega_{k+1}B- \rho^{-1}\Gamma_{k} - C, \rho^{-1}\lambda )$$. Step $$5$$. Set $$\Gamma_{k+1} = \Gamma_{k} - \rho\left(A\Omega_{k+1} B - \Theta_{k+1} - C\right)$$. Step $$6$$. Replace $$k$$ with $$k+1$$. 3.2. Convergence and implementation Using the same proof technique as in Deng & Yin (2016), one can show that the iterates from Algorithm 1 converge to their optimal values when a solution to (4) exists. In our implementation, we set $$\tau = \varphi_{1}(A^{\rm T} A)\varphi_{1}(BB^{\rm T}) + 10^{-8}$$, where $$\varphi_{1}(\cdot)$$ denotes the largest eigenvalue of its argument. This computation is only needed once at the initialization of our algorithm. We expect that in practice, the computational complexity of our algorithm will be dominated by the eigendecomposition in Step 2, which requires $$O(p^3)$$ flops. To select the tuning parameter to use in practice, we recommend using some type of crossvalidation procedure. For example, in the linear discriminant analysis case, one could select the tuning parameter that minimizes the validation misclassification rate or maximizes a validation likelihood. 4. Statistical properties We now show that by using the penalty in (2), we can estimate $$\Omega_*$$ and $$A \Omega_* B$$ consistently in the Frobenius and $$L_1$$-norms, respectively. We focus on the case where $$C$$ is the $$a \times b$$ matrix of zeros. A nonzero $$C$$ substantially complicates the theoretical analysis and is left as a direction for future research. Our results rely on assuming that $$A \Omega_* B$$ is sparse. Define the set $$\mathcal{G}$$ as the indices of $$A\Omega_* B$$ that are nonzero, i.e., $$\mathcal{G} = \left\{(i,j)\in \left\{1, \dots, a\right\} \times \left\{1 ,\dots, b\right\}: \left[ A \Omega_* B\right]_{ij} \neq 0 \right\}\!\text{.}$$ Let $$[A \Omega_* B]_\mathcal{G} \in \mathbb{R}^{a \times b}$$ denote the matrix whose $$(i,j)$$th entry is equal to the $$(i,j)$$th entry of $$A \Omega_* B$$ if $$(i,j) \in \mathcal{G}$$ and is equal to zero if $$(i,j) \notin \mathcal{G}$$. We generalize our results to the case where $$A$$ and $$B$$ are unknown, and we use plug-in estimators of them in (2). We first establish convergence rate bounds for known $$A$$ and $$B$$. Let $$\sigma_{k}(\cdot)$$ and $$\varphi_{k}(\cdot)$$ denote the $$k$$th largest singular value and eigenvalue of their arguments, respectively. Suppose that the sample covariance matrix used in (2) is $$S_n = n^{-1} \sum_{i=1}^n X_i X_i^{\rm T},$$ where $$X_1, \dots, X_n$$ are independent and identically distributed $$p_n$$-dimensional random vectors with mean zero and covariance matrix $$\Omega_*^{-1}$$. We will make the following assumptions. Assumption 1. For all $$n$$, there exists a constant $$k_1$$ such that $$0 < k_1^{-1} \leq \varphi_{p_n}(\Omega_*) \leq \varphi_{1}(\Omega_*) \leq k_1 < \infty\text{.}$$ Assumption 2. For all $$n$$, there exists a constant $$k_2$$ such that $$\min \left\{ \sigma_{p_n}(A), \sigma_{p_n}(B) \right\} \geq k_2 > 0$$. Assumption 3. For all $$n$$, there exist positive constants $$k_3$$ and $$k_4$$ such that $$\max_{j \in \left\{ 1, \dots, p_n \right\} } E\left\{ {\rm exp}(tX_{1j}^2)\right\} \leq k_3 < \infty, \quad t \in (-k_4, k_4)\text{.}$$ Assumptions 1 and 3 are common in the regularized precision matrix estimation literature; Assumption 1 was made by Bickel & Levina (2008), Rothman et al. (2008) and Lam & Fan (2009), and Assumption 3 holds if $$X_1$$ is multivariate normal. Assumption 2 requires that $$A$$ and $$B$$ both have rank $$p_n$$, which has the effect of shrinking every entry of $$\hat{\Omega}$$. The convergence rate bounds we establish also depend on the quantity $$\xi(p_n, \mathcal{G}) = \sup_{M \in \mathbb{S}^{p_n}, M \neq 0 } \frac{|\!\left[A M B\right]_{\mathcal{G}}\!|_1}{\|M\|_{\rm F}},$$ where $$\mathbb{S}^{p_n}$$ is the set of symmetric $$p_n \times p_n$$ matrices. Negahban et al. (2012) defined a similar and more general quantity and called it a compatibility constant. Theorem 1. Under Assumptions 1–3, if $$\lambda_n = K_1 (n^{-1}\log p_n)^{1/2}$$ with $$K_1$$ sufficiently large and $$\xi^2(p_n, \mathcal{G}) \log p_n = o(n)$$, then (i) $$\| \hat{\Omega} - \Omega_* \|_{\rm F} = O_{\rm P} \{ \xi(p_n, \mathcal{G}) (\log p_n/n)^{1/2}\}$$ and (ii) $$|A\hat{\Omega}B - A\Omega_*B |_1 = O_{\rm P}\{ \xi^2(p_n, \mathcal{G}) ( \log p_n/n)^{1/2} \}$$. The quantity $$\xi(p_n, \mathcal{G})$$ can be used to recover known results for special cases of (2). For example, when $$A$$ and $$B$$ are identity matrices, $$\xi(p_n,\mathcal{G}) = {s_n}^{1/2}$$, where $$s_n$$ is the number of nonzero entries in $$\Omega_*$$. This special case was established by Rothman et al. (2008). We can simplify the results of Theorem 1 for the case where $$A \Omega_* B$$ has $$g_n$$ nonzero entries by introducing an additional assumption. Assumption 4. For all $$n$$, there exists a constant $$k_5$$ such that $$\sup_{M \in \mathbb{S}^{p_n}, M \neq 0} \frac{ \| [A M B]_\mathcal{G} \|_{\rm F}}{\|M\|_{\rm F}} \leq k_5 < \infty\text{.}$$ Assumption 4 is not the same as bounding $$\xi(p_n,\mathcal{G}),$$ because the numerator uses the Frobenius norm instead of the $$L_1$$-norm. This requires that for those entries of $$A \Omega_* B$$ which are nonzero, the corresponding rows and columns of $$A$$ and $$B$$ do not have magnitudes that are too large as $$p_n$$ grows. Corollary 1. Under the conditions of Theorem 1 and Assumption 4, if $$A \Omega_* B$$ has $$g_n$$ nonzero entries, then (i) $$\| \hat{\Omega} - \Omega_* \|_{\rm F} = O_{\rm P}\{ (g_n \log p_n/n)^{1/2}\}$$ and (ii) $$| A\hat{\Omega}B - A\Omega_*B|_1 = O_{\rm P}\{ (g_n^2 \log p_n/n)^{1/2}\}$$. In practice, $$A$$ and $$B$$ are often unknown and must be estimated by $$\skew6\hat{A}_n$$ and $$\hat{B}_n$$, say. In this case, we estimate $$A\Omega_*B$$ with $$\skew6\hat{A}_n \tilde{\Omega} \hat{B}_n$$, where $$\label{estimator_AB_unknown} \tilde{\Omega} = \mathop {\rm \arg\,min}\limits_{\Omega \in \mathbb{S}_+^{p_n}} \left\{ {\rm tr}(S_n\Omega) - \log {\rm det}(\Omega) + \lambda_n |\skew6\hat{A}_n \Omega \hat{B}_n|_1 \right\}\!\text{.}$$ (9) To establish convergence rate bounds for $$\skew6\hat{A}_n \tilde{\Omega} \hat{B}_n$$, we require an assumption on the asymptotic properties of $$\skew6\hat{A}_n$$ and $$\hat{B}_n$$. Assumption 5. There exist sequences $$a_n = o(1)$$ and $$b_n = o(1)$$ such that $$|(A - \skew6\hat{A}_n)A^{+} |_1 = O_{\rm P}\left(a_n\right)$$ and $$|B^{+}(B - \hat{B}_n)|_1 = O_{\rm P}\left(b_n\right)$$, where $$A^{+}$$ and $$B^{+}$$ are the Moore–Penrose pseudoinverses of $$A$$ and $$B$$. The convergence rate bounds we establish for (9) also depend on the quantity $$\Phi_n = \max (a_n, b_n) |\left[A \Omega_* B\right]_\mathcal{G}|_1,$$ which partly controls the additional error incurred by using estimates of $$A$$ and $$B$$ in (9). Theorem 2. Under Assumptions 1–3 and 5, if $$\lambda_n = K_2 (\log p_n/n)^{1/2}$$ with $$K_2$$ sufficiently large, $$\xi^2(p_n, \mathcal{G})\log p_n = o(n)$$, and $$\Phi_n^2 \log p_n = o(n)$$, then: (i) $$\| \tilde{\Omega} - \Omega\|_{\rm F} = O_{\rm P} \{ \xi(p_n, \mathcal{G})(\log p_n/n)^{1/2} + \Phi_n^{1/2} (\log p_n/n)^{1/4}\}$$; (ii) $$|\skew6\hat{A}_n \tilde{\Omega} \hat{B}_n - A\Omega_* B|_1 = O_{\rm P}\{ \xi^2(p_n, \mathcal{G}) (\log p_n/n)^{1/2} + \Phi_n^{1/2} \xi(p_n, \mathcal{G}) (\log p_n/n)^{1/4} + \Phi_n\}$$. The convergence rate bounds in Theorem 2 are the sums of the statistical errors from Theorem 1 plus additional errors from estimating $$A$$ and $$B$$. 5. Simulation studies 5.1. Models We compare our precision matrix estimator with competing estimators when they are used to fit the linear discriminant analysis model. For 100 independent replications, we generated a realization of $$n$$ independent copies of $$(X,Y)$$ defined in (3), where $$\mu_{*j} = \Omega_*^{-1} \beta_{*j}$$ and $${\rm pr}(Y = j) = 1/J$$ for $$j = 1, \dots, J$$. Using this construction, because $$\beta_{*l} - \beta_{*m} = \Omega_*(\mu_{*l} - \mu_{*m})$$, if the $$k$$th element of $$\beta_{*l} - \beta_{*m}$$ is zero, then the $$k$$th variable is noninformative for discriminating between response categories $$l$$ and $$m$$. For each $$J \in \left\{3, \dots, 10\right\}$$, we partition our $$n$$ observations into a training set of size $$25 J$$, a validation set of size $$200$$, and a test set of size 1000. We considered two models for $$\Omega_*^{-1}$$ and $$\beta_{*j}$$. Let $${\rm 1}\kern-0.24em{\rm I}(\cdot)$$ be the indicator function. Model 1. We set $$\beta_{*j,k} = 1.5 \hspace{2pt}{\rm 1}\kern-0.24em{\rm I} \left[ k \in \left\{4(j-1) + 1, \dots, 4j\right\}\right],$$ so that for any pair of response categories, only eight variables were informative for discrimination. We set $$\Omega^{-1}_{*a,b} = 0.9^{|a-b|}$$, so that $$\Omega_*$$ was tridiagonal. Model 2. We set $$\beta_{*j,k} = 2 \hspace{2pt}{\rm 1}\kern-0.24em{\rm I} \left[k \in \left\{5(j-1) + 1, \dots, 5j\right\} \right],$$ so that for any pair of response categories, only ten variables were informative for discrimination. We set $$\Omega_*^{-1}$$ to be block diagonal: the block corresponding to the informative variables, i.e., the first $$5J$$ variables, had off-diagonal entries equal to $$0.5$$ and diagonal entries equal to 1. The block submatrix corresponding to the $$p - 5J$$ noninformative variables had $$(a,b)$$th entry equal to $$0.5^{|a-b|}$$. For both models, sparse estimators of $$\Omega_*$$ should perform well because the population precision matrices are very sparse and invertible. The total number of informative variables is $$4J$$ and $$5J$$ in Models 1 and 2 respectively, so a method like that proposed by Mai et al. (2018), which selects variables that are informative for all pairwise response category comparisons, may perform poorly when $$J$$ is large. 5.2. Methods We compared several methods in terms of classification accuracy on the test set. We fitted (3) using the sparse naïve Bayes estimator proposed by Guo (2010), with a tuning parameter chosen to minimize the misclassification rate on the validation set, and the Bayes rule, i.e., $$\Omega_*$$, $$\mu_{*j}$$ and $${\rm pr}(Y = j)$$ known for $$j = 1, \dots, J$$. We also fitted (3) using the ordinary sample means and using the precision matrix estimator proposed in § 2 with a tuning parameter chosen to minimize the misclassification rate on the validation set and $$B$$ estimated using the sample means; the $$L_1$$-penalized Gaussian likelihood precision matrix estimator (Yuan & Lin, 2007; Rothman et al., 2008; Friedman et al., 2008) with tuning parameter chosen to minimize the misclassification rate on the validation set; and a covariance matrix estimator similar to that proposed by Ledoit & Wolf (2004), which is defined by $$\hat{\Omega}^{-1}_{\rm LW} = \alpha S + \gamma (1 - \alpha) I_p,$$ where $$(\alpha, \gamma) \in (0,1) \times (0, \infty)$$ were chosen to minimize the misclassification rate on the validation set. The $$L_1$$-penalized Gaussian likelihood precision matrix estimator we used penalized the diagonals. With our data-generating models, we found this performed better at classification than (1), which does not penalize the diagonals. We also tried two Fisher linear discriminant-based methods applicable to multi-category linear discriminant analysis: the sparse linear discriminant method proposed by Witten & Tibshirani (2011) with tuning parameter and dimension chosen to minimize the misclassification rate on the validation set; and the multi-category sparse linear discriminant method proposed by Mai et al. (2018) with tuning parameter chosen to minimize the misclassification rate on the validation set. We could also have selected tuning parameters for the model-based methods by maximizing a validation likelihood or using an information criterion, but minimizing the misclassification rate on a validation set made it fairer to compare the model-based methods and the Fisher linear discriminant-based methods in terms of classification accuracy. 5.3. Performance measures We measured classification accuracy on the test set for each replication for the methods described in § 5.2. For the methods that produced a precision matrix estimator, we also measured this estimator’s Frobenius norm error, $$\| \bar{\Omega} - \Omega_*\|_{\rm F}$$, where $$\bar{\Omega}$$ is the estimator. To measure variable selection accuracy, we used both the true positive rate and the true negative rate, which are respectively defined by $$\dfrac{{\rm card} \left\{(m,k): \hat{\Delta}_{m, k} \neq 0 \cap \Delta_{*m, k} \neq 0 \right\}}{ {\rm card}\left\{(m,k):\Delta_{*m, k} \neq 0 \right\}}, \quad\dfrac{{\rm card} \left\{(m,k): \hat{\Delta}_{m, k} = 0 \cap \Delta_{*m, k} = 0 \right\}}{ {\rm card}\left\{(m,k):\Delta_{*m, k} = 0 \right\}},$$ where $$(m, k) \in \left\{2, \dots, J\right\} \times \left\{ 1,\dots, p\right\}$$, $$\Delta_{*m} = \beta_{*1} - \beta_{*m}$$, $$\hat{\Delta}_{m}$$ is an estimator of $$\Delta_{*m}$$, and $${\rm card}$$ denotes the cardinality of a set. 5.4. Results We display average misclassification rates and Frobenius norm error averages for both models with $$p=200$$ in Fig. 1, and display variable selection accuracy averages in Fig. 2. For both models, our method outperformed all competitors in terms of classification accuracy for all $$J$$, except the Bayes rule, which uses population parameter values unknown in practice. In terms of precision matrix estimation, for Model 1, our method did better than the $$L_1$$-penalized Gaussian likelihood precision matrix estimator when the sample size was small, but became worse than the $$L_1$$-penalized Gaussian likelihood precision matrix estimator as the sample size increased. For Model 2, our method was worse than the $$L_1$$-penalized Gaussian likelihood precision matrix estimator in Frobenius norm error for precision matrix estimation, but was better in terms of classification accuracy. Fig. 1. View largeDownload slide Misclassification rates, (a) and (b), and Frobenius norm error, (c) and (d), averaged over 100 replications with $$p=200$$ for Model 1, (a) and (c), and Model 2, (b) and (d). The methods displayed are the estimator we proposed in § 2 (dashed and $$\blacksquare$$), the $$L_1$$-penalized Gaussian likelihood estimator (dashed and $$\blacktriangle$$), the Ledoit–Wolf-type estimator $$\hat{\Omega}^{-1}_{\rm LW}$$ (dashed and ⚫), the Bayes rule (solid and $$\ast$$), the method proposed by Guo (2010) (dots and ⚪), the method proposed by Mai et al. (2018) (dots and $$\triangle$$), and the method proposed by Witten & Tibshirani (2011) (dots and $$\square$$). Fig. 1. View largeDownload slide Misclassification rates, (a) and (b), and Frobenius norm error, (c) and (d), averaged over 100 replications with $$p=200$$ for Model 1, (a) and (c), and Model 2, (b) and (d). The methods displayed are the estimator we proposed in § 2 (dashed and $$\blacksquare$$), the $$L_1$$-penalized Gaussian likelihood estimator (dashed and $$\blacktriangle$$), the Ledoit–Wolf-type estimator $$\hat{\Omega}^{-1}_{\rm LW}$$ (dashed and ⚫), the Bayes rule (solid and $$\ast$$), the method proposed by Guo (2010) (dots and ⚪), the method proposed by Mai et al. (2018) (dots and $$\triangle$$), and the method proposed by Witten & Tibshirani (2011) (dots and $$\square$$). Fig. 2. View largeDownload slide True positive and true negative rates averaged over 100 replications with $$p=200$$ for Model 1 in (a) and (c), and for Model 2 in (b) and (d). The methods displayed are the estimator we proposed in § 2 (dashed and $$\blacksquare$$), the method proposed by Guo (2010) (dots and ⚪), and the method proposed by Mai et al. (2018) (dots and $$\triangle$$). Fig. 2. View largeDownload slide True positive and true negative rates averaged over 100 replications with $$p=200$$ for Model 1 in (a) and (c), and for Model 2 in (b) and (d). The methods displayed are the estimator we proposed in § 2 (dashed and $$\blacksquare$$), the method proposed by Guo (2010) (dots and ⚪), and the method proposed by Mai et al. (2018) (dots and $$\triangle$$). The precision matrix estimation results are consistent with our theory. For Model 1, as $$J$$ increases, the number of nonzero entries in $$A \Omega_* B$$ also increases. Our convergence rate bound for estimating $$\Omega_*$$ gets worse as the number of nonzero entries in $$A\Omega_*B$$ increases, which may explain why our estimator’s Frobenius norm error gets worse as $$J$$ increases. Because the sample size increases with $$J$$ and the number of nonzero entries in $$\Omega_*$$ is fixed for all $$J$$, it is expected that the $$L_1$$-penalized Gaussian likelihood estimator improves as $$J$$ increases. For Model 2, as $$J$$ increases, more entries in $$\Omega_*$$ become close to zero and so the Frobenius norm error for any shrinkage estimator should decrease. However, an important point is illustrated by Model 2: although the $$L_1$$-penalized Gaussian likelihood estimator is more accurate in terms of estimating the precision matrix, our proposed estimator still performs better in terms of classification accuracy. In terms of variable selection, our method was competitive with the methods proposed by Guo (2010) and Mai et al. (2018). For Model 1, our method tended to have a higher average true negative rate than the method of Guo (2010) and a lower average true positive rate than the method of Mai et al. (2018). For Model 2, all methods tended to have relatively high average true positive rates, while our method had a higher average true negative rate than the method of Mai et al. (2018). Although the method proposed by Guo (2010) had a higher average true negative rate for Model 2 than our proposed method, our method performed better in terms of classification accuracy. The performance of the method proposed by Mai et al. (2018) can be partially explained by the method’s variable selection properties. Their method either includes or excludes variables for discriminating between all pairwise response category comparisons. As $$J$$ increases, many variables are informative for only a small number of pairwise comparisons. Their method’s low true negative rate for Model 2 suggests that it selects a large number of uninformative variables as $$J$$ increases. 6. Genomic data example We used our method to fit the linear discriminant analysis model in a real data application. The data are gene expression profiles consisting of $$p=22\,238$$ genes from 127 subjects, who have either Crohn’s disease, ulcerative colitis, or neither. This dataset comes from Burczynski et al. (2006). The goal of our analysis was to fit a linear discriminant analysis model that could be used to identify which genes are informative for discriminating between each pair of the response categories. These data were also analysed in Mai et al. (2018). To measure the classification accuracy of our method and its competitors, we randomly split the data into training sets of size 100 and test sets of size 27 for 100 independent replications. Within each replication, we first applied a screening rule to the training set as in Rothman et al. (2009) and Mai et al. (2018), based on $$F$$-test statistics, and then restricted our discriminant analysis model to the genes with the $$k$$ largest $$F$$-test statistic values. We chose tuning parameters with five-fold crossvalidation that minimized the validation classification error rate. Misclassification rates are shown in Fig. 3(b), where we compared our method to those of Mai et al. (2018), Witten & Tibshirani (2011), and Guo (2010), as well as the method that used the $$L_1$$-penalized Gaussian likelihood precision matrix estimator. Our method was at least as accurate in terms of classification as the competing methods. The only method that performed nearly as well was that of Mai et al. (2018) with $$k=100$$ screened genes. However, the best out-of-sample classification accuracy was achieved with $$k=300,$$ where our method was significantly better than the competitors. Fig. 3. View largeDownload slide (a) Model sizes and (b) misclassification rates from 100 random training/testing splits with $$k=100$$ (dark grey), $$k=200$$ (medium grey), and $$k=300$$ (light grey). Guo is the method proposed by Guo (2010), Mai is the method proposed by Mai et al. (2018), Glasso is the $$L_1$$-penalized Gaussian likelihood precision matrix estimator, Ours is the estimator we propose § 2, and Witten is the method proposed by Witten & Tibshirani (2011). Model size is the number of nonzero entries summed across the estimated discriminant vectors. Fig. 3. View largeDownload slide (a) Model sizes and (b) misclassification rates from 100 random training/testing splits with $$k=100$$ (dark grey), $$k=200$$ (medium grey), and $$k=300$$ (light grey). Guo is the method proposed by Guo (2010), Mai is the method proposed by Mai et al. (2018), Glasso is the $$L_1$$-penalized Gaussian likelihood precision matrix estimator, Ours is the estimator we propose § 2, and Witten is the method proposed by Witten & Tibshirani (2011). Model size is the number of nonzero entries summed across the estimated discriminant vectors. Although the method of Mai et al. (2018) tended to estimate smaller models, our method, which performs best in classification, selects only slightly more variables. Moreover, unlike the method of Mai et al. (2018), our method can be used to identify a distinct subset of genes that are informative specifically for discriminating between patients with Crohn’s disease and ulcerative colitis. This was of interest in the study of Burczynski et al. (2006). In the Supplementary Material, we provide additional details and further results. Acknowledgement We thank the associate editor and two referees for helpful comments. A. J. Molstad’s research was supported in part by a Doctoral Dissertation Fellowship from the University of Minnesota. A. J. Rothman’s research was supported in part by the U.S. National Science Foundation. Supplementary material Supplementary material available at Biometrika online includes the proofs of Theorems 1 and 2 and additional information about the genomic data example. References Bickel P. J. & Levina E. ( 2004 ). Some theory for Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli 10 , 989 – 1010 . Google Scholar CrossRef Search ADS Bickel P. J. & Levina E. ( 2008 ). Regularized estimation of large covariance matrices. Ann. Statist. 36 , 199 – 227 . Google Scholar CrossRef Search ADS Boyd S. , Parikh N. , Chu E. , Peleato B. & Eckstein J. ( 2011 ). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundat. Trends Mach. Learn. 3 , 1 – 122 . Brodie J. , Daubechies I. , De Mol C. , Giannone D. & Loris I. ( 2009 ). Sparse and stable Markowitz portfolios. Proc. Nat. Acad. Sci. 106 , 12267 – 72 . Google Scholar CrossRef Search ADS Burczynski M. E. , Peterson R. L. , Twine N. C. , Zuberek K. A. , Brodeur B. J. , Casciotti L. , Maganti V. , Reddy P. S. , Strahs A. , Immermann F. et al. ( 2006 ). Molecular classification of Crohn’s disease and ulcerative colitis patients using transcriptional profiles in peripheral blood mononuclear cells. J. Molec. Diagnos. 8 , 51 – 61 . Google Scholar CrossRef Search ADS Cai T. & Liu W. ( 2011 ). A direct estimation approach to sparse linear discriminant analysis. J. Am. Statist. Assoc. 106 , 1566 – 77 . Google Scholar CrossRef Search ADS Chen G. & Teboulle M. ( 1994 ). A proximal-based decomposition method for convex minimization problems. Math. Programming 64 , 81 – 101 . Google Scholar CrossRef Search ADS Chen X. , Xu M. & Wu W. B. ( 2016 ). Regularized estimation of linear functionals of precision matrices for high-dimensional time series. IEEE Trans. Sig. Proces. 64 , 6459 – 70 . Google Scholar CrossRef Search ADS Dalal O. & Rajaratnam B. ( 2017 ). Sparse Gaussian graphical model estimation via alternating minimization. Biometrika 104 , 379 – 95 . Google Scholar CrossRef Search ADS Deng W. & Yin W. ( 2016 ). On the global and linear convergence of the generalized alternating direction method of multipliers. J. Sci. Comp. 66 , 889 – 916 . Google Scholar CrossRef Search ADS Fan J. , Feng Y. & Tong X. ( 2012 ). A road to classification in high dimensional space: The regularized optimal affine discriminant. J. R. Statist. Soc. B 74 , 745 – 71 . Google Scholar CrossRef Search ADS Fan J. , Feng Y. & Wu Y. ( 2009 ). Network exploration via the adaptive LASSO and SCAD penalties. Ann. Appl. Statist. 3 , 521 – 41 . Google Scholar CrossRef Search ADS Fan J. , Liao Y. & Liu H. ( 2016 ). An overview of the estimation of large covariance and precision matrices. Economet. J. 19 , C1 – 32 . Google Scholar CrossRef Search ADS Friedman J. H. , Hastie T. J. & Tibshirani R. J. ( 2008 ). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 , 432 – 41 . Google Scholar CrossRef Search ADS PubMed Guo J. ( 2010 ). Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis. Biostatistics 11 , 599 – 608 . Google Scholar CrossRef Search ADS PubMed Lam C. & Fan J. ( 2009 ). Sparsistency and rates of convergence in large covariance matrix estimation. Ann. Statist. 37 , 4254 – 78 . Google Scholar CrossRef Search ADS Lange K. ( 2016 ). MM Optimization Algorithms . Philadelphia : SIAM . Google Scholar CrossRef Search ADS Ledoit O. & Wolf M. ( 2004 ). A well-conditioned estimator for large-dimensional covariance matrices. J. Mult. Anal. 88 , 365 – 411 . Google Scholar CrossRef Search ADS Mai Q. , Yang Y. & Zou H. ( 2018 ). Multiclass sparse discriminant analysis. Statist. Sinica to appear, https://doi.org/10.5705/ss.202016.0117 . Mai Q. , Zou H. & Yuan M. ( 2012 ). A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika 99 , 29 – 42 . Google Scholar CrossRef Search ADS Markowitz H. ( 1952 ). Portfolio selection. J. Finan. 7 , 77 – 91 . Negahban S. N. , Yu B. , Wainwright M. J. & Ravikumar P. K. ( 2012 ). A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Statist. Sci. 27 , 538 – 57 . Google Scholar CrossRef Search ADS Pourahmadi M. ( 2013 ). High-Dimensional Covariance Estimation: With High-Dimensional Data. Hoboken, New Jersey : Wiley . Google Scholar CrossRef Search ADS Price B. S. , Geyer C. J. & Rothman A. J. ( 2015 ). Ridge fusion in statistical learning. J. Comp. Graph. Statist. 24 , 439 – 54 . Google Scholar CrossRef Search ADS Rothman A. J. , Bickel P. J. , Levina E. & Zhu J. ( 2008 ). Sparse permutation invariant covariance estimation. Electron. J. Statist. 2 , 494 – 515 . Google Scholar CrossRef Search ADS Rothman A. J. & Forzani L. ( 2014 ). On the existence of the weighted bridge penalized Gaussian likelihood precision matrix estimator. Electron. J. Statist. 8 , 2693 – 700 . Google Scholar CrossRef Search ADS Rothman A. J. , Levina E. & Zhu J. ( 2009 ). Generalized thresholding of large covariance matrices. J. Am. Statist. Assoc. 104 , 177 – 86 . Google Scholar CrossRef Search ADS Witten D. M. & Tibshirani R. ( 2011 ). Penalized classification using Fisher’s linear discriminant. J. R. Statist. Soc. B 73 , 753 – 72 . Google Scholar CrossRef Search ADS Witten D. M. & Tibshirani R. J. ( 2009 ). Covariance-regularized regression and classification for high dimensional problems. J. R. Statist. Soc. B 71 , 615 – 36 . Google Scholar CrossRef Search ADS Xu P. , Zhu J. , Zhu L. & Li Y. ( 2015 ). Covariance-enhanced discriminant analysis. Biometrika 102 , 33 – 45 . Google Scholar CrossRef Search ADS PubMed Xue L. , Ma S. & Zou H. ( 2012 ). Positive-definite $$\ell_1$$-penalized estimation of large covariance matrices. J. Am. Statist. Assoc. 107 , 1480 – 91 . Google Scholar CrossRef Search ADS Yuan M. & Lin Y. ( 2007 ). Model selection and estimation in the Gaussian graphical model. Biometrika 93 , 19 – 35 . Google Scholar CrossRef Search ADS © 2018 Biometrika Trust This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Biometrika Oxford University Press

Shrinking characteristics of precision matrix estimators

, Volume Advance Article – May 10, 2018
12 pages

/lp/ou_press/shrinking-characteristics-of-precision-matrix-estimators-KiEQy3UqMi
Publisher
Oxford University Press
ISSN
0006-3444
eISSN
1464-3510
D.O.I.
10.1093/biomet/asy023
Publisher site
See Article on Publisher Site

Abstract

Summary We propose a framework to shrink a user-specified characteristic of a precision matrix estimator that is needed to fit a predictive model. Estimators in our framework minimize the Gaussian negative loglikelihood plus an $$L_1$$ penalty on a linear or affine function evaluated at the optimization variable corresponding to the precision matrix. We establish convergence rate bounds for these estimators and propose an alternating direction method of multipliers algorithm for their computation. Our simulation studies show that our estimators can perform better than competitors when they are used to fit predictive models. In particular, we illustrate cases where our precision matrix estimators perform worse at estimating the population precision matrix but better at prediction. 1. Introduction The estimation of precision matrices is required to fit many statistical models. Many papers written in the last decade have proposed shrinkage estimators of the precision matrix when the number of variables $$p$$ is large. Pourahmadi (2013) and Fan et al. (2016) provide comprehensive reviews of large covariance and precision matrix estimation. The main strategy used in many of these papers is to minimize the Gaussian negative loglikelihood plus a penalty on the off-diagonal entries of the optimization variable corresponding to the precision matrix. For example, Yuan & Lin (2007) proposed the $$L_1$$-penalized Gaussian likelihood precision matrix estimator defined by \begin{align}\label{graphical_lasso} \mathop {\rm \arg\,min}\limits_{\Omega \in \mathbb{S}^p_+} \left\{ {\rm tr}(S\Omega) - \log {\rm det}(\Omega) + \lambda \sum_{i \neq j}|\Omega_{ij}|\right\}\!, \end{align} (1) where $$S$$ is the sample covariance matrix, $$\lambda > 0$$ is a tuning parameter, $$\mathbb{S}^p_+$$ is the set of $$p \times p$$ symmetric and positive-definite matrices, $$\Omega_{ij}$$ is the $$(i,j)$$th entry of $$\Omega$$, and $${\rm tr}$$ and $${\rm det}$$ denote the trace and determinant, respectively. Other authors have replaced the $$L_1$$ penalty in (1) with the squared Frobenius norm (Witten & Tibshirani, 2009; Rothman & Forzani, 2014) or nonconvex penalties that also encourage zeros in the estimator (Lam & Fan, 2009; Fan et al., 2009). To fit many predictive models, only a characteristic of the population precision matrix need be estimated. For example, in binary linear discriminant analysis, the population precision matrix is needed for prediction only through the product of the precision matrix and the difference between the two conditional distribution mean vectors. Many authors have proposed methods that directly estimate this characteristic (Cai & Liu, 2011; Fan et al., 2012; Mai et al., 2012). We propose to estimate the precision matrix by shrinking the characteristic of the estimator that is needed for prediction. The characteristic we consider is a linear or affine function evaluated at the precision matrix. The goal is to improve prediction performance. Unlike methods that estimate the characteristic directly, our approach provides the practitioner with an estimate of the entire precision matrix, not just the characteristic. In our simulation studies and data example, we show that penalizing the characteristic needed for prediction can improve prediction performance over competing sparse precision estimators like (1), even when the true precision matrix is very sparse. In addition, estimators in our framework can be used in applications other than linear discriminant analysis. 2. Proposed method 2.1. Penalized likelihood estimator We propose to estimate the population precision matrix $$\Omega_*$$ with $$\label{eq:estimator} \hat{\Omega} = \mathop {\rm \arg\,min}\limits_{\Omega \in \mathbb{S}_+^{p}}\left\{ {\rm tr}(S\Omega) - \log {\rm det}(\Omega) + \lambda |A \Omega B - C |_1 \right\}\!,$$ (2) where $$A \in \mathbb{R}^{a \times p}$$, $$B \in \mathbb{R}^{p \times b}$$ and $$C \in \mathbb{R}^{a \times b}$$ are user-specified matrices, and $$|M|_1 = \sum_{i,j} |M_{ij}|$$. Our estimator exploits the assumption that $$A \Omega_* B - C$$ is sparse. When $$A$$, $$B$$ and $$C$$ need to be estimated, we replace them in (2) with their estimators. The matrix $$C$$ can serve as a shrinkage target for $$A \hat{\Omega}B$$, allowing practitioners to incorporate prior information. Dalal & Rajaratnam (2017) proposed a class of estimators similar to (2) which replaces $$A\Omega B-C$$ with $$T(\Omega)$$, where $$T$$ is a symmetric linear transform. Fitting the discriminant analysis model requires the estimation of one or more precision matrices. In particular, the linear discriminant analysis model assumes that the data are independent copies of the random pair $$(X,Y)$$, where the support of $$Y$$ is $$\left\{1, \dots, J\right\}$$ and $$\label{Normal_Model_LDA} X \mid Y= j \sim {\it N}_p \left(\mu_{*j}, \Omega_*^{-1}\right)\quad (\,j=1, \dots, J),$$ (3) where $$\mu_{*j} \in \mathbb{R}^p$$ and $$\Omega_*^{-1} \in \mathbb{S}^p_+$$ are unknown. To discriminate between response categories $$l$$ and $$m$$, only the characteristic $$\Omega_*(\mu_{*l} - \mu_{*m})$$ is needed. Methods that estimate this characteristic directly have been proposed (Cai & Liu, 2011; Mai et al., 2012; Fan et al., 2012; Mai et al., 2018). These methods are useful in high dimensions because they perform variable selection. For the $$j$$th variable to be noninformative for discriminating between response categories $$l$$ and $$m$$, it must be that the $$j$$th element of $$\Omega_*(\mu_{*l} - \mu_{*m})$$ is zero. While these methods can perform well in classification and variable selection, they do not actually fit the model in (3). Methods for fitting (3) specifically for linear discriminant analysis either assume $$\Omega_*$$ is diagonal (Bickel & Levina, 2004) or that both $$\mu_{*l} - \mu_{*m}$$ and $$\Omega_*$$ are sparse (Guo, 2010; Xu et al., 2015). A method for fitting (3) and performing variable selection was proposed by Witten & Tibshirani (2009). They suggest a two-step procedure where one first estimates $$\Omega_*$$ and then, with the estimate $$\bar{\Omega}$$ fixed, estimates each $$\mu_{*j}$$ by penalizing the characteristic $$\bar{\Omega} \mu_j$$, where $$\mu_j$$ is the optimization variable corresponding to $$\mu_{*j}$$. 2.2. Example applications To apply our method to the linear discriminant analysis problem, we use (2) with $$A = I_p$$, $$C$$ equal to the matrix of zeros, and $$B$$ equal to the matrix whose columns are $$\bar{x}_j - \bar{x}_{k}$$ for all $$1 \leq j < k \leq J$$, where $$\bar{x}_j$$ is the unpenalized maximum likelihood estimator of $$\mu_{*j}$$. For large values of the tuning parameter, this would lead to an estimator of $$\Omega_*$$ such that $$\hat{\Omega}(\bar{x}_j -\bar{x}_{k})$$ is sparse. Thus our approach simultaneously fits (3) and performs variable selection. Precision and covariance matrix estimators are also needed for portfolio allocation. The optimal allocation based on the Markowitz (1952) minimum-variance portfolio is proportional to $$\Omega_*\mu_*$$, where $$\mu_*$$ is the vector of expected returns for $$p$$ assets and $$\Omega_*$$ is the precision matrix for the returns. In practice, one would estimate $$\Omega_*$$ and $$\mu_*$$ with their usual sample estimators $$\hat\Omega$$ and $$\hat{\mu}$$. However, when $$p$$ is larger than the sample size, the usual sample estimator of $$\Omega_*$$ does not exist, so regularization is necessary. Moreover, Brodie et al. (2009) argue that sparse portfolios, i.e., portfolios with fewer than $$p$$ active positions, are often desirable when $$p$$ is large. While many have proposed using sparse or shrinkage estimators of $$\Omega_*$$ or $$\Omega_*^{-1}$$ inserted in the Markowitz criterion, e.g., Xue et al. (2012), this would not necessarily lead to sparse estimators of $$\Omega_* \mu_*$$. To achieve a sparse portfolio, Chen et al. (2016) proposed a method for estimating the characteristic $$\Omega_*\mu_*$$ directly, but like the direct linear discriminant methods, it does not lead to an estimate of $$\Omega_*$$. For the sparse portfolio allocation application, we propose to estimate $$\Omega_*$$ using (2) with $$A=I_p$$, $$C$$ equal to the vector of zeros, and $$B = \hat{\mu}$$. Another application is in linear regression where the $$q$$-variate response and $$p$$-variate predictors have a joint multivariate normal distribution. In this case, the regression coefficient matrix is $$\Omega_* \Sigma_{*XY}$$, where $$\Omega_*$$ is the marginal precision matrix for the predictors and $$\Sigma_{*XY}$$ is the cross-covariance matrix between predictors and response. We propose to estimate $$\Omega_*$$ using (2) with $$A = I_p$$, $$C$$ equal to the matrix of zeros, and $$B$$ equal to the usual sample estimator of $$\Sigma_{*XY}$$. Similar to the proposal of Witten & Tibshirani (2009), this approach provides an alternative method for estimating regression coefficients using shrinkage estimators of the marginal precision matrix for the predictors. There are also applications where neither $$A$$ nor $$B$$ is equal to $$I_p$$. For example, in quadratic discriminant analysis, if one assumes that $$\mu_{*j}^{\rm T} \Omega_{*j} \mu_{*j}$$ is small, e.g., $$\mu_{*j}$$ is in the span of a set of eigenvectors of $$\Omega_{*j}$$ corresponding to small eigenvalues, then it may be appropriate to shrink entries of the estimates of $$\Omega_{j*}$$, $$\Omega_{*j}\mu_{*j}$$ and $$\mu_{*j}^{\rm T} \Omega_{*j} \mu_{*j}$$. For this application, we propose to estimate $$\Omega_*$$ using (2) with $$C$$ equal to the matrix of zeros and $$A^{\rm T} = B = (\bar{x}_j, \gamma I_p)$$ for some tuning parameter $$\gamma > 0$$. 3. Computation 3.1. Alternating direction method of multipliers algorithm To solve the optimization in (2), we propose an alternating direction method of multipliers algorithm with a modification based on the majorize-minimize principle (Lange, 2016). Following the standard alternating direction method of multipliers approach (Boyd et al., 2011), we rewrite (2) as a constrained optimization problem: $$\label{eq: constrained_primal} \mathop {\rm \arg\,min}\limits_{\left(\Theta, \Omega\right) \in \mathbb{R}^{a \times b} \times \mathbb{S}^p_+} \left\{ {\rm tr}(S\Omega) - \log{\rm det}(\Omega) + \lambda |\Theta|_1 \right\} \quad \text{subject to } A\Omega B - \Theta = C\text{.}$$ (4) The augmented Lagrangian for (4) is defined by \begin{align*} \mathcal{F}_\rho(\Omega, \Theta, \Gamma) = {\rm tr}(S \Omega) & - \log{\rm det}(\Omega) + \lambda |\Theta|_1 \\ & - {\rm tr} \left\{ \Gamma^{\rm T}(A\Omega B - \Theta - C)\right\} + \frac{\rho}{2}\|A\Omega B - \Theta - C\|_{\rm F}^2, \end{align*} where $$\rho > 0$$, $$\Gamma \in \mathbb{R}^{a \times b}$$ is the Lagrangian dual variable, and $$\|\cdot\|_{\rm F}$$ is the Frobenius norm. Let the subscript $$k$$ denote the $$k$$th iterate. From Boyd et al. (2011), to solve (4), the alternating direction method of multipliers algorithm uses the updating equations \begin{align} \Omega_{k+1} &= \mathop {\rm \arg\,min}\limits_{\Omega \in \mathbb{S}^p_+} \mathcal{F}_{\rho}(\Omega, \Theta_{k}, \Gamma_{k}), \label{M_update_original} \\ \end{align} (5) \begin{align} \Theta_{k+1} &= \mathop {\rm \arg\,min}\limits_{\Theta \in \mathbb{R}^{a \times b}} \mathcal{F}_{\rho}(\Omega_{k+1}, \Theta, \Gamma_{k}), \label{soft_thresholding_step} \end{align} (6) \begin{gather*} \Gamma_{k+1} = \Gamma_{k} - \rho \left(A \Omega_{k+1}B - \Theta_{k+1} - C \right)\text{.} \notag \end{gather*} The update in (5) requires its own iterative algorithm, which is complicated by the positive definiteness of the optimization variable. To avoid this computation, we replace (5) with an approximation based on the majorize-minimize principle. In particular, we replace $$\mathcal{F}_\rho(\cdot, \Theta_k, \Gamma_k)$$ with a majorizing function at the current iterate $$\Omega_{k}$$: $$\Omega_{k+1} = \mathop {\rm \arg\,min}\limits_{\Omega \in \mathbb{S}^+_p} \left\{\mathcal{F}_\rho(\Omega, \Theta_{k}, \Gamma_{k}) + \frac{\rho}{2}{\rm vec}(\Omega - \Omega_{k})^{\rm T} Q\,{\rm vec}(\Omega - \Omega_{k})\right\}\!,\label{majorizer}$$ (7) where $$Q = \tau I - \left(A^{\rm T} A \otimes BB^{\rm T}\right)\!,$$$$\tau$$ is selected so that $$Q \in \mathbb{S}^p_+$$, $$\otimes$$ is the Kronecker product, and $${\rm vec}$$ forms a vector by stacking the columns of its matrix argument. Since $${\rm vec}(\Omega - \Omega_{k})^{\rm T}\left(A^{\rm T} A \otimes BB^{\rm T}\right) {\rm vec}(\Omega - \Omega_{k}) = {\rm tr}\left\{ A^{\rm T} A (\Omega - \Omega_{k}) BB^{\rm T} (\Omega - \Omega_{k})\right\}\!,$$ we can rewrite (7) as $$\Omega_{k+1} = \mathop {\rm \arg\,min}\limits_{\Omega \in \mathbb{S}^+_p} \left[ \mathcal{F}_\rho(\Omega, \Theta_{k}, \Gamma_{k}) + \frac{\rho\tau}{2} \|\Omega - \Omega_{k}\|_{\rm F}^2 - \frac{\rho}{2} {\rm tr}\left\{ A^{\rm T} A (\Omega - \Omega_{k}) BB^{\rm T} (\Omega - \Omega_{k}) \right\} \right]\!,$$ which is equivalent to $$\label{eq:majorized_LINADMM} \Omega_{k+1} = \mathop {\rm \arg\,min}\limits_{\Omega \in \mathbb{S}^+_p} \left[ {\rm tr}\left\{ \left(S + G_{k} \right) \Omega\right\} - \log {\rm det}(\Omega) + \frac{\rho\tau}{2} \|\Omega - \Omega_{k}\|_{\rm F}^2 \right]\!,$$ (8) where $$G_{k} = \rho A^{\rm T}(A\Omega_{k} B-\rho^{-1}\Gamma_{k}- \Theta_{k} - C)B^{\rm T}$$. The zero-gradient equation for (8) is \begin{equation*} \label{gradient_equation} S - \Omega_{k+1}^{-1} + \frac{1}{2} \left( G_{k} + G_{k}^{\rm T} \right) + \rho \tau \left(\Omega_{k+1} - \Omega_{k}\right)=0, \end{equation*} whose solution is (Witten & Tibshirani, 2009; Price et al., 2015) $$\Omega_{k+1} = \frac{1}{2\rho\tau} U \left\{ -\Psi + \left(\Psi^2 + 4 \rho \tau I_p\right)^{1/2}\right\} U^{\rm T},$$ where $$U \Psi U^{\rm T}$$ is the eigendecomposition of $$S + (G_{k} + G_{k}^{\rm T})/2 - \rho \tau \Omega_{k}$$. Our majorize-minimize approximation is a special case of the prox-linear alternating direction method of multiplier algorithm (Chen & Teboulle, 1994; Deng & Yin, 2016). Using the majorize-minimize approximation of (5) guarantees that $$\mathcal{F}_\rho(\Omega_{k+1}, \Theta_k, \Gamma_k) \leq \mathcal{F}_\rho(\Omega_{k}, \Theta_k, \Gamma_k)$$ and maintains the convergence properties of our algorithm (Deng & Yin, 2016). Finally, (6) also has a closed-form solution: $$\label{soft_update} \Theta_{k+1} = {\rm soft}\left( A\Omega_{k+1}B- \rho^{-1}\Gamma_{k} - C,\rho^{-1}\lambda \right)\!\notag,$$ where $${\rm soft}(x, \phi) = \max \left(|x|- \phi, 0\right) {\rm sign}(x)$$. To summarize, we solve (2) with the following algorithm. Algorithm 1. Initialize $$\Omega_{0} \in \mathbb{S}_+^p$$, $$\Theta_{0}\in \mathbb{R}^{a \times b}$$, $$\rho > 0$$, and $$\tau$$ such that $$Q$$ is positive definite. Set $$k=0$$. Repeat Steps 1–6 until convergence: Step $$1$$. Compute $$G_{k}= \rho A^{\rm T}( A\Omega_{k} B-\rho^{-1}\Gamma_k - \Theta_{k} - C)B^{\rm T}$$. Step $$2$$. Decompose $$S + (G_{k} + G_{k}^{\rm T})/2 - \rho \tau\Omega_{k} = U \Psi U^{\rm T}$$ where $$U$$ is orthogonal and $$\Psi$$ is diagonal. Step $$3$$. Set $$\Omega_{k+1} = (2\rho\tau)^{-1} U \left\{ -\Psi + (\Psi^2 + 4 \rho \tau I_p)^{1/2}\right\} U^{\rm T}$$. Step $$4$$. Set $$\Theta_{k+1} = {\rm soft}(A\Omega_{k+1}B- \rho^{-1}\Gamma_{k} - C, \rho^{-1}\lambda )$$. Step $$5$$. Set $$\Gamma_{k+1} = \Gamma_{k} - \rho\left(A\Omega_{k+1} B - \Theta_{k+1} - C\right)$$. Step $$6$$. Replace $$k$$ with $$k+1$$. 3.2. Convergence and implementation Using the same proof technique as in Deng & Yin (2016), one can show that the iterates from Algorithm 1 converge to their optimal values when a solution to (4) exists. In our implementation, we set $$\tau = \varphi_{1}(A^{\rm T} A)\varphi_{1}(BB^{\rm T}) + 10^{-8}$$, where $$\varphi_{1}(\cdot)$$ denotes the largest eigenvalue of its argument. This computation is only needed once at the initialization of our algorithm. We expect that in practice, the computational complexity of our algorithm will be dominated by the eigendecomposition in Step 2, which requires $$O(p^3)$$ flops. To select the tuning parameter to use in practice, we recommend using some type of crossvalidation procedure. For example, in the linear discriminant analysis case, one could select the tuning parameter that minimizes the validation misclassification rate or maximizes a validation likelihood. 4. Statistical properties We now show that by using the penalty in (2), we can estimate $$\Omega_*$$ and $$A \Omega_* B$$ consistently in the Frobenius and $$L_1$$-norms, respectively. We focus on the case where $$C$$ is the $$a \times b$$ matrix of zeros. A nonzero $$C$$ substantially complicates the theoretical analysis and is left as a direction for future research. Our results rely on assuming that $$A \Omega_* B$$ is sparse. Define the set $$\mathcal{G}$$ as the indices of $$A\Omega_* B$$ that are nonzero, i.e., $$\mathcal{G} = \left\{(i,j)\in \left\{1, \dots, a\right\} \times \left\{1 ,\dots, b\right\}: \left[ A \Omega_* B\right]_{ij} \neq 0 \right\}\!\text{.}$$ Let $$[A \Omega_* B]_\mathcal{G} \in \mathbb{R}^{a \times b}$$ denote the matrix whose $$(i,j)$$th entry is equal to the $$(i,j)$$th entry of $$A \Omega_* B$$ if $$(i,j) \in \mathcal{G}$$ and is equal to zero if $$(i,j) \notin \mathcal{G}$$. We generalize our results to the case where $$A$$ and $$B$$ are unknown, and we use plug-in estimators of them in (2). We first establish convergence rate bounds for known $$A$$ and $$B$$. Let $$\sigma_{k}(\cdot)$$ and $$\varphi_{k}(\cdot)$$ denote the $$k$$th largest singular value and eigenvalue of their arguments, respectively. Suppose that the sample covariance matrix used in (2) is $$S_n = n^{-1} \sum_{i=1}^n X_i X_i^{\rm T},$$ where $$X_1, \dots, X_n$$ are independent and identically distributed $$p_n$$-dimensional random vectors with mean zero and covariance matrix $$\Omega_*^{-1}$$. We will make the following assumptions. Assumption 1. For all $$n$$, there exists a constant $$k_1$$ such that $$0 < k_1^{-1} \leq \varphi_{p_n}(\Omega_*) \leq \varphi_{1}(\Omega_*) \leq k_1 < \infty\text{.}$$ Assumption 2. For all $$n$$, there exists a constant $$k_2$$ such that $$\min \left\{ \sigma_{p_n}(A), \sigma_{p_n}(B) \right\} \geq k_2 > 0$$. Assumption 3. For all $$n$$, there exist positive constants $$k_3$$ and $$k_4$$ such that $$\max_{j \in \left\{ 1, \dots, p_n \right\} } E\left\{ {\rm exp}(tX_{1j}^2)\right\} \leq k_3 < \infty, \quad t \in (-k_4, k_4)\text{.}$$ Assumptions 1 and 3 are common in the regularized precision matrix estimation literature; Assumption 1 was made by Bickel & Levina (2008), Rothman et al. (2008) and Lam & Fan (2009), and Assumption 3 holds if $$X_1$$ is multivariate normal. Assumption 2 requires that $$A$$ and $$B$$ both have rank $$p_n$$, which has the effect of shrinking every entry of $$\hat{\Omega}$$. The convergence rate bounds we establish also depend on the quantity $$\xi(p_n, \mathcal{G}) = \sup_{M \in \mathbb{S}^{p_n}, M \neq 0 } \frac{|\!\left[A M B\right]_{\mathcal{G}}\!|_1}{\|M\|_{\rm F}},$$ where $$\mathbb{S}^{p_n}$$ is the set of symmetric $$p_n \times p_n$$ matrices. Negahban et al. (2012) defined a similar and more general quantity and called it a compatibility constant. Theorem 1. Under Assumptions 1–3, if $$\lambda_n = K_1 (n^{-1}\log p_n)^{1/2}$$ with $$K_1$$ sufficiently large and $$\xi^2(p_n, \mathcal{G}) \log p_n = o(n)$$, then (i) $$\| \hat{\Omega} - \Omega_* \|_{\rm F} = O_{\rm P} \{ \xi(p_n, \mathcal{G}) (\log p_n/n)^{1/2}\}$$ and (ii) $$|A\hat{\Omega}B - A\Omega_*B |_1 = O_{\rm P}\{ \xi^2(p_n, \mathcal{G}) ( \log p_n/n)^{1/2} \}$$. The quantity $$\xi(p_n, \mathcal{G})$$ can be used to recover known results for special cases of (2). For example, when $$A$$ and $$B$$ are identity matrices, $$\xi(p_n,\mathcal{G}) = {s_n}^{1/2}$$, where $$s_n$$ is the number of nonzero entries in $$\Omega_*$$. This special case was established by Rothman et al. (2008). We can simplify the results of Theorem 1 for the case where $$A \Omega_* B$$ has $$g_n$$ nonzero entries by introducing an additional assumption. Assumption 4. For all $$n$$, there exists a constant $$k_5$$ such that $$\sup_{M \in \mathbb{S}^{p_n}, M \neq 0} \frac{ \| [A M B]_\mathcal{G} \|_{\rm F}}{\|M\|_{\rm F}} \leq k_5 < \infty\text{.}$$ Assumption 4 is not the same as bounding $$\xi(p_n,\mathcal{G}),$$ because the numerator uses the Frobenius norm instead of the $$L_1$$-norm. This requires that for those entries of $$A \Omega_* B$$ which are nonzero, the corresponding rows and columns of $$A$$ and $$B$$ do not have magnitudes that are too large as $$p_n$$ grows. Corollary 1. Under the conditions of Theorem 1 and Assumption 4, if $$A \Omega_* B$$ has $$g_n$$ nonzero entries, then (i) $$\| \hat{\Omega} - \Omega_* \|_{\rm F} = O_{\rm P}\{ (g_n \log p_n/n)^{1/2}\}$$ and (ii) $$| A\hat{\Omega}B - A\Omega_*B|_1 = O_{\rm P}\{ (g_n^2 \log p_n/n)^{1/2}\}$$. In practice, $$A$$ and $$B$$ are often unknown and must be estimated by $$\skew6\hat{A}_n$$ and $$\hat{B}_n$$, say. In this case, we estimate $$A\Omega_*B$$ with $$\skew6\hat{A}_n \tilde{\Omega} \hat{B}_n$$, where $$\label{estimator_AB_unknown} \tilde{\Omega} = \mathop {\rm \arg\,min}\limits_{\Omega \in \mathbb{S}_+^{p_n}} \left\{ {\rm tr}(S_n\Omega) - \log {\rm det}(\Omega) + \lambda_n |\skew6\hat{A}_n \Omega \hat{B}_n|_1 \right\}\!\text{.}$$ (9) To establish convergence rate bounds for $$\skew6\hat{A}_n \tilde{\Omega} \hat{B}_n$$, we require an assumption on the asymptotic properties of $$\skew6\hat{A}_n$$ and $$\hat{B}_n$$. Assumption 5. There exist sequences $$a_n = o(1)$$ and $$b_n = o(1)$$ such that $$|(A - \skew6\hat{A}_n)A^{+} |_1 = O_{\rm P}\left(a_n\right)$$ and $$|B^{+}(B - \hat{B}_n)|_1 = O_{\rm P}\left(b_n\right)$$, where $$A^{+}$$ and $$B^{+}$$ are the Moore–Penrose pseudoinverses of $$A$$ and $$B$$. The convergence rate bounds we establish for (9) also depend on the quantity $$\Phi_n = \max (a_n, b_n) |\left[A \Omega_* B\right]_\mathcal{G}|_1,$$ which partly controls the additional error incurred by using estimates of $$A$$ and $$B$$ in (9). Theorem 2. Under Assumptions 1–3 and 5, if $$\lambda_n = K_2 (\log p_n/n)^{1/2}$$ with $$K_2$$ sufficiently large, $$\xi^2(p_n, \mathcal{G})\log p_n = o(n)$$, and $$\Phi_n^2 \log p_n = o(n)$$, then: (i) $$\| \tilde{\Omega} - \Omega\|_{\rm F} = O_{\rm P} \{ \xi(p_n, \mathcal{G})(\log p_n/n)^{1/2} + \Phi_n^{1/2} (\log p_n/n)^{1/4}\}$$; (ii) $$|\skew6\hat{A}_n \tilde{\Omega} \hat{B}_n - A\Omega_* B|_1 = O_{\rm P}\{ \xi^2(p_n, \mathcal{G}) (\log p_n/n)^{1/2} + \Phi_n^{1/2} \xi(p_n, \mathcal{G}) (\log p_n/n)^{1/4} + \Phi_n\}$$. The convergence rate bounds in Theorem 2 are the sums of the statistical errors from Theorem 1 plus additional errors from estimating $$A$$ and $$B$$. 5. Simulation studies 5.1. Models We compare our precision matrix estimator with competing estimators when they are used to fit the linear discriminant analysis model. For 100 independent replications, we generated a realization of $$n$$ independent copies of $$(X,Y)$$ defined in (3), where $$\mu_{*j} = \Omega_*^{-1} \beta_{*j}$$ and $${\rm pr}(Y = j) = 1/J$$ for $$j = 1, \dots, J$$. Using this construction, because $$\beta_{*l} - \beta_{*m} = \Omega_*(\mu_{*l} - \mu_{*m})$$, if the $$k$$th element of $$\beta_{*l} - \beta_{*m}$$ is zero, then the $$k$$th variable is noninformative for discriminating between response categories $$l$$ and $$m$$. For each $$J \in \left\{3, \dots, 10\right\}$$, we partition our $$n$$ observations into a training set of size $$25 J$$, a validation set of size $$200$$, and a test set of size 1000. We considered two models for $$\Omega_*^{-1}$$ and $$\beta_{*j}$$. Let $${\rm 1}\kern-0.24em{\rm I}(\cdot)$$ be the indicator function. Model 1. We set $$\beta_{*j,k} = 1.5 \hspace{2pt}{\rm 1}\kern-0.24em{\rm I} \left[ k \in \left\{4(j-1) + 1, \dots, 4j\right\}\right],$$ so that for any pair of response categories, only eight variables were informative for discrimination. We set $$\Omega^{-1}_{*a,b} = 0.9^{|a-b|}$$, so that $$\Omega_*$$ was tridiagonal. Model 2. We set $$\beta_{*j,k} = 2 \hspace{2pt}{\rm 1}\kern-0.24em{\rm I} \left[k \in \left\{5(j-1) + 1, \dots, 5j\right\} \right],$$ so that for any pair of response categories, only ten variables were informative for discrimination. We set $$\Omega_*^{-1}$$ to be block diagonal: the block corresponding to the informative variables, i.e., the first $$5J$$ variables, had off-diagonal entries equal to $$0.5$$ and diagonal entries equal to 1. The block submatrix corresponding to the $$p - 5J$$ noninformative variables had $$(a,b)$$th entry equal to $$0.5^{|a-b|}$$. For both models, sparse estimators of $$\Omega_*$$ should perform well because the population precision matrices are very sparse and invertible. The total number of informative variables is $$4J$$ and $$5J$$ in Models 1 and 2 respectively, so a method like that proposed by Mai et al. (2018), which selects variables that are informative for all pairwise response category comparisons, may perform poorly when $$J$$ is large. 5.2. Methods We compared several methods in terms of classification accuracy on the test set. We fitted (3) using the sparse naïve Bayes estimator proposed by Guo (2010), with a tuning parameter chosen to minimize the misclassification rate on the validation set, and the Bayes rule, i.e., $$\Omega_*$$, $$\mu_{*j}$$ and $${\rm pr}(Y = j)$$ known for $$j = 1, \dots, J$$. We also fitted (3) using the ordinary sample means and using the precision matrix estimator proposed in § 2 with a tuning parameter chosen to minimize the misclassification rate on the validation set and $$B$$ estimated using the sample means; the $$L_1$$-penalized Gaussian likelihood precision matrix estimator (Yuan & Lin, 2007; Rothman et al., 2008; Friedman et al., 2008) with tuning parameter chosen to minimize the misclassification rate on the validation set; and a covariance matrix estimator similar to that proposed by Ledoit & Wolf (2004), which is defined by $$\hat{\Omega}^{-1}_{\rm LW} = \alpha S + \gamma (1 - \alpha) I_p,$$ where $$(\alpha, \gamma) \in (0,1) \times (0, \infty)$$ were chosen to minimize the misclassification rate on the validation set. The $$L_1$$-penalized Gaussian likelihood precision matrix estimator we used penalized the diagonals. With our data-generating models, we found this performed better at classification than (1), which does not penalize the diagonals. We also tried two Fisher linear discriminant-based methods applicable to multi-category linear discriminant analysis: the sparse linear discriminant method proposed by Witten & Tibshirani (2011) with tuning parameter and dimension chosen to minimize the misclassification rate on the validation set; and the multi-category sparse linear discriminant method proposed by Mai et al. (2018) with tuning parameter chosen to minimize the misclassification rate on the validation set. We could also have selected tuning parameters for the model-based methods by maximizing a validation likelihood or using an information criterion, but minimizing the misclassification rate on a validation set made it fairer to compare the model-based methods and the Fisher linear discriminant-based methods in terms of classification accuracy. 5.3. Performance measures We measured classification accuracy on the test set for each replication for the methods described in § 5.2. For the methods that produced a precision matrix estimator, we also measured this estimator’s Frobenius norm error, $$\| \bar{\Omega} - \Omega_*\|_{\rm F}$$, where $$\bar{\Omega}$$ is the estimator. To measure variable selection accuracy, we used both the true positive rate and the true negative rate, which are respectively defined by $$\dfrac{{\rm card} \left\{(m,k): \hat{\Delta}_{m, k} \neq 0 \cap \Delta_{*m, k} \neq 0 \right\}}{ {\rm card}\left\{(m,k):\Delta_{*m, k} \neq 0 \right\}}, \quad\dfrac{{\rm card} \left\{(m,k): \hat{\Delta}_{m, k} = 0 \cap \Delta_{*m, k} = 0 \right\}}{ {\rm card}\left\{(m,k):\Delta_{*m, k} = 0 \right\}},$$ where $$(m, k) \in \left\{2, \dots, J\right\} \times \left\{ 1,\dots, p\right\}$$, $$\Delta_{*m} = \beta_{*1} - \beta_{*m}$$, $$\hat{\Delta}_{m}$$ is an estimator of $$\Delta_{*m}$$, and $${\rm card}$$ denotes the cardinality of a set. 5.4. Results We display average misclassification rates and Frobenius norm error averages for both models with $$p=200$$ in Fig. 1, and display variable selection accuracy averages in Fig. 2. For both models, our method outperformed all competitors in terms of classification accuracy for all $$J$$, except the Bayes rule, which uses population parameter values unknown in practice. In terms of precision matrix estimation, for Model 1, our method did better than the $$L_1$$-penalized Gaussian likelihood precision matrix estimator when the sample size was small, but became worse than the $$L_1$$-penalized Gaussian likelihood precision matrix estimator as the sample size increased. For Model 2, our method was worse than the $$L_1$$-penalized Gaussian likelihood precision matrix estimator in Frobenius norm error for precision matrix estimation, but was better in terms of classification accuracy. Fig. 1. View largeDownload slide Misclassification rates, (a) and (b), and Frobenius norm error, (c) and (d), averaged over 100 replications with $$p=200$$ for Model 1, (a) and (c), and Model 2, (b) and (d). The methods displayed are the estimator we proposed in § 2 (dashed and $$\blacksquare$$), the $$L_1$$-penalized Gaussian likelihood estimator (dashed and $$\blacktriangle$$), the Ledoit–Wolf-type estimator $$\hat{\Omega}^{-1}_{\rm LW}$$ (dashed and ⚫), the Bayes rule (solid and $$\ast$$), the method proposed by Guo (2010) (dots and ⚪), the method proposed by Mai et al. (2018) (dots and $$\triangle$$), and the method proposed by Witten & Tibshirani (2011) (dots and $$\square$$). Fig. 1. View largeDownload slide Misclassification rates, (a) and (b), and Frobenius norm error, (c) and (d), averaged over 100 replications with $$p=200$$ for Model 1, (a) and (c), and Model 2, (b) and (d). The methods displayed are the estimator we proposed in § 2 (dashed and $$\blacksquare$$), the $$L_1$$-penalized Gaussian likelihood estimator (dashed and $$\blacktriangle$$), the Ledoit–Wolf-type estimator $$\hat{\Omega}^{-1}_{\rm LW}$$ (dashed and ⚫), the Bayes rule (solid and $$\ast$$), the method proposed by Guo (2010) (dots and ⚪), the method proposed by Mai et al. (2018) (dots and $$\triangle$$), and the method proposed by Witten & Tibshirani (2011) (dots and $$\square$$). Fig. 2. View largeDownload slide True positive and true negative rates averaged over 100 replications with $$p=200$$ for Model 1 in (a) and (c), and for Model 2 in (b) and (d). The methods displayed are the estimator we proposed in § 2 (dashed and $$\blacksquare$$), the method proposed by Guo (2010) (dots and ⚪), and the method proposed by Mai et al. (2018) (dots and $$\triangle$$). Fig. 2. View largeDownload slide True positive and true negative rates averaged over 100 replications with $$p=200$$ for Model 1 in (a) and (c), and for Model 2 in (b) and (d). The methods displayed are the estimator we proposed in § 2 (dashed and $$\blacksquare$$), the method proposed by Guo (2010) (dots and ⚪), and the method proposed by Mai et al. (2018) (dots and $$\triangle$$). The precision matrix estimation results are consistent with our theory. For Model 1, as $$J$$ increases, the number of nonzero entries in $$A \Omega_* B$$ also increases. Our convergence rate bound for estimating $$\Omega_*$$ gets worse as the number of nonzero entries in $$A\Omega_*B$$ increases, which may explain why our estimator’s Frobenius norm error gets worse as $$J$$ increases. Because the sample size increases with $$J$$ and the number of nonzero entries in $$\Omega_*$$ is fixed for all $$J$$, it is expected that the $$L_1$$-penalized Gaussian likelihood estimator improves as $$J$$ increases. For Model 2, as $$J$$ increases, more entries in $$\Omega_*$$ become close to zero and so the Frobenius norm error for any shrinkage estimator should decrease. However, an important point is illustrated by Model 2: although the $$L_1$$-penalized Gaussian likelihood estimator is more accurate in terms of estimating the precision matrix, our proposed estimator still performs better in terms of classification accuracy. In terms of variable selection, our method was competitive with the methods proposed by Guo (2010) and Mai et al. (2018). For Model 1, our method tended to have a higher average true negative rate than the method of Guo (2010) and a lower average true positive rate than the method of Mai et al. (2018). For Model 2, all methods tended to have relatively high average true positive rates, while our method had a higher average true negative rate than the method of Mai et al. (2018). Although the method proposed by Guo (2010) had a higher average true negative rate for Model 2 than our proposed method, our method performed better in terms of classification accuracy. The performance of the method proposed by Mai et al. (2018) can be partially explained by the method’s variable selection properties. Their method either includes or excludes variables for discriminating between all pairwise response category comparisons. As $$J$$ increases, many variables are informative for only a small number of pairwise comparisons. Their method’s low true negative rate for Model 2 suggests that it selects a large number of uninformative variables as $$J$$ increases. 6. Genomic data example We used our method to fit the linear discriminant analysis model in a real data application. The data are gene expression profiles consisting of $$p=22\,238$$ genes from 127 subjects, who have either Crohn’s disease, ulcerative colitis, or neither. This dataset comes from Burczynski et al. (2006). The goal of our analysis was to fit a linear discriminant analysis model that could be used to identify which genes are informative for discriminating between each pair of the response categories. These data were also analysed in Mai et al. (2018). To measure the classification accuracy of our method and its competitors, we randomly split the data into training sets of size 100 and test sets of size 27 for 100 independent replications. Within each replication, we first applied a screening rule to the training set as in Rothman et al. (2009) and Mai et al. (2018), based on $$F$$-test statistics, and then restricted our discriminant analysis model to the genes with the $$k$$ largest $$F$$-test statistic values. We chose tuning parameters with five-fold crossvalidation that minimized the validation classification error rate. Misclassification rates are shown in Fig. 3(b), where we compared our method to those of Mai et al. (2018), Witten & Tibshirani (2011), and Guo (2010), as well as the method that used the $$L_1$$-penalized Gaussian likelihood precision matrix estimator. Our method was at least as accurate in terms of classification as the competing methods. The only method that performed nearly as well was that of Mai et al. (2018) with $$k=100$$ screened genes. However, the best out-of-sample classification accuracy was achieved with $$k=300,$$ where our method was significantly better than the competitors. Fig. 3. View largeDownload slide (a) Model sizes and (b) misclassification rates from 100 random training/testing splits with $$k=100$$ (dark grey), $$k=200$$ (medium grey), and $$k=300$$ (light grey). Guo is the method proposed by Guo (2010), Mai is the method proposed by Mai et al. (2018), Glasso is the $$L_1$$-penalized Gaussian likelihood precision matrix estimator, Ours is the estimator we propose § 2, and Witten is the method proposed by Witten & Tibshirani (2011). Model size is the number of nonzero entries summed across the estimated discriminant vectors. Fig. 3. View largeDownload slide (a) Model sizes and (b) misclassification rates from 100 random training/testing splits with $$k=100$$ (dark grey), $$k=200$$ (medium grey), and $$k=300$$ (light grey). Guo is the method proposed by Guo (2010), Mai is the method proposed by Mai et al. (2018), Glasso is the $$L_1$$-penalized Gaussian likelihood precision matrix estimator, Ours is the estimator we propose § 2, and Witten is the method proposed by Witten & Tibshirani (2011). Model size is the number of nonzero entries summed across the estimated discriminant vectors. Although the method of Mai et al. (2018) tended to estimate smaller models, our method, which performs best in classification, selects only slightly more variables. Moreover, unlike the method of Mai et al. (2018), our method can be used to identify a distinct subset of genes that are informative specifically for discriminating between patients with Crohn’s disease and ulcerative colitis. This was of interest in the study of Burczynski et al. (2006). In the Supplementary Material, we provide additional details and further results. Acknowledgement We thank the associate editor and two referees for helpful comments. A. J. Molstad’s research was supported in part by a Doctoral Dissertation Fellowship from the University of Minnesota. A. J. Rothman’s research was supported in part by the U.S. National Science Foundation. Supplementary material Supplementary material available at Biometrika online includes the proofs of Theorems 1 and 2 and additional information about the genomic data example. References Bickel P. J. & Levina E. ( 2004 ). Some theory for Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli 10 , 989 – 1010 . Google Scholar CrossRef Search ADS Bickel P. J. & Levina E. ( 2008 ). Regularized estimation of large covariance matrices. Ann. Statist. 36 , 199 – 227 . Google Scholar CrossRef Search ADS Boyd S. , Parikh N. , Chu E. , Peleato B. & Eckstein J. ( 2011 ). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundat. Trends Mach. Learn. 3 , 1 – 122 . Brodie J. , Daubechies I. , De Mol C. , Giannone D. & Loris I. ( 2009 ). Sparse and stable Markowitz portfolios. Proc. Nat. Acad. Sci. 106 , 12267 – 72 . Google Scholar CrossRef Search ADS Burczynski M. E. , Peterson R. L. , Twine N. C. , Zuberek K. A. , Brodeur B. J. , Casciotti L. , Maganti V. , Reddy P. S. , Strahs A. , Immermann F. et al. ( 2006 ). Molecular classification of Crohn’s disease and ulcerative colitis patients using transcriptional profiles in peripheral blood mononuclear cells. J. Molec. Diagnos. 8 , 51 – 61 . Google Scholar CrossRef Search ADS Cai T. & Liu W. ( 2011 ). A direct estimation approach to sparse linear discriminant analysis. J. Am. Statist. Assoc. 106 , 1566 – 77 . Google Scholar CrossRef Search ADS Chen G. & Teboulle M. ( 1994 ). A proximal-based decomposition method for convex minimization problems. Math. Programming 64 , 81 – 101 . Google Scholar CrossRef Search ADS Chen X. , Xu M. & Wu W. B. ( 2016 ). Regularized estimation of linear functionals of precision matrices for high-dimensional time series. IEEE Trans. Sig. Proces. 64 , 6459 – 70 . Google Scholar CrossRef Search ADS Dalal O. & Rajaratnam B. ( 2017 ). Sparse Gaussian graphical model estimation via alternating minimization. Biometrika 104 , 379 – 95 . Google Scholar CrossRef Search ADS Deng W. & Yin W. ( 2016 ). On the global and linear convergence of the generalized alternating direction method of multipliers. J. Sci. Comp. 66 , 889 – 916 . Google Scholar CrossRef Search ADS Fan J. , Feng Y. & Tong X. ( 2012 ). A road to classification in high dimensional space: The regularized optimal affine discriminant. J. R. Statist. Soc. B 74 , 745 – 71 . Google Scholar CrossRef Search ADS Fan J. , Feng Y. & Wu Y. ( 2009 ). Network exploration via the adaptive LASSO and SCAD penalties. Ann. Appl. Statist. 3 , 521 – 41 . Google Scholar CrossRef Search ADS Fan J. , Liao Y. & Liu H. ( 2016 ). An overview of the estimation of large covariance and precision matrices. Economet. J. 19 , C1 – 32 . Google Scholar CrossRef Search ADS Friedman J. H. , Hastie T. J. & Tibshirani R. J. ( 2008 ). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 , 432 – 41 . Google Scholar CrossRef Search ADS PubMed Guo J. ( 2010 ). Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis. Biostatistics 11 , 599 – 608 . Google Scholar CrossRef Search ADS PubMed Lam C. & Fan J. ( 2009 ). Sparsistency and rates of convergence in large covariance matrix estimation. Ann. Statist. 37 , 4254 – 78 . Google Scholar CrossRef Search ADS Lange K. ( 2016 ). MM Optimization Algorithms . Philadelphia : SIAM . Google Scholar CrossRef Search ADS Ledoit O. & Wolf M. ( 2004 ). A well-conditioned estimator for large-dimensional covariance matrices. J. Mult. Anal. 88 , 365 – 411 . Google Scholar CrossRef Search ADS Mai Q. , Yang Y. & Zou H. ( 2018 ). Multiclass sparse discriminant analysis. Statist. Sinica to appear, https://doi.org/10.5705/ss.202016.0117 . Mai Q. , Zou H. & Yuan M. ( 2012 ). A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika 99 , 29 – 42 . Google Scholar CrossRef Search ADS Markowitz H. ( 1952 ). Portfolio selection. J. Finan. 7 , 77 – 91 . Negahban S. N. , Yu B. , Wainwright M. J. & Ravikumar P. K. ( 2012 ). A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Statist. Sci. 27 , 538 – 57 . Google Scholar CrossRef Search ADS Pourahmadi M. ( 2013 ). High-Dimensional Covariance Estimation: With High-Dimensional Data. Hoboken, New Jersey : Wiley . Google Scholar CrossRef Search ADS Price B. S. , Geyer C. J. & Rothman A. J. ( 2015 ). Ridge fusion in statistical learning. J. Comp. Graph. Statist. 24 , 439 – 54 . Google Scholar CrossRef Search ADS Rothman A. J. , Bickel P. J. , Levina E. & Zhu J. ( 2008 ). Sparse permutation invariant covariance estimation. Electron. J. Statist. 2 , 494 – 515 . Google Scholar CrossRef Search ADS Rothman A. J. & Forzani L. ( 2014 ). On the existence of the weighted bridge penalized Gaussian likelihood precision matrix estimator. Electron. J. Statist. 8 , 2693 – 700 . Google Scholar CrossRef Search ADS Rothman A. J. , Levina E. & Zhu J. ( 2009 ). Generalized thresholding of large covariance matrices. J. Am. Statist. Assoc. 104 , 177 – 86 . Google Scholar CrossRef Search ADS Witten D. M. & Tibshirani R. ( 2011 ). Penalized classification using Fisher’s linear discriminant. J. R. Statist. Soc. B 73 , 753 – 72 . Google Scholar CrossRef Search ADS Witten D. M. & Tibshirani R. J. ( 2009 ). Covariance-regularized regression and classification for high dimensional problems. J. R. Statist. Soc. B 71 , 615 – 36 . Google Scholar CrossRef Search ADS Xu P. , Zhu J. , Zhu L. & Li Y. ( 2015 ). Covariance-enhanced discriminant analysis. Biometrika 102 , 33 – 45 . Google Scholar CrossRef Search ADS PubMed Xue L. , Ma S. & Zou H. ( 2012 ). Positive-definite $$\ell_1$$-penalized estimation of large covariance matrices. J. Am. Statist. Assoc. 107 , 1480 – 91 . Google Scholar CrossRef Search ADS Yuan M. & Lin Y. ( 2007 ). Model selection and estimation in the Gaussian graphical model. Biometrika 93 , 19 – 35 . Google Scholar CrossRef Search ADS © 2018 Biometrika Trust This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

BiometrikaOxford University Press

Published: May 10, 2018

DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month Explore the DeepDyve Library Search Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly Organize Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place. Access Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals. Your journals are on DeepDyve Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more. All the latest content is available, no embargo periods. DeepDyve Freelancer DeepDyve Pro Price FREE$49/month
\$360/year

Save searches from
PubMed

Create lists to

Export lists, citations