Robust estimation of high-dimensional covariance and precision matrices

Robust estimation of high-dimensional covariance and precision matrices SUMMARY High-dimensional data are often most plausibly generated from distributions with complex structure and leptokurtosis in some or all components. Covariance and precision matrices provide a useful summary of such structure, yet the performance of popular matrix estimators typically hinges upon a sub-Gaussianity assumption. This paper presents robust matrix estimators whose performance is guaranteed for a much richer class of distributions. The proposed estimators, under a bounded fourth moment assumption, achieve the same minimax convergence rates as do existing methods under a sub-Gaussianity assumption. Consistency of the proposed estimators is also established under the weak assumption of bounded $$2+\epsilon$$ moments for $$\epsilon\in (0,2)$$. The associated convergence rates depend on $$\epsilon$$. 1. Introduction Covariance and precision matrices play a central role in summarizing linear relationships among variables. Our focus is on estimating these matrices when their dimension is large relative to the number of observations. Besides being of interest in themselves, estimates of covariance and precision matrices are used for numerous procedures from classical multivariate analysis, including linear regression. Consistency is achievable under structural assumptions provided regularity conditions are met. For instance, under the assumption that all rows or columns of the covariance matrix belong to a sufficiently small $$\ell_{q}$$-ball around zero, thresholding (Bickel & Levina, 2008; Rothman et al., 2008) or its adaptive counterpart (Cai & Liu, 2011) gives consistent estimators of the covariance matrix in the spectral norm for data from a distribution with sub-Gaussian tails. For precision matrix estimation, the same sparsity assumption on the precision matrix motivates the use of the constrained $$\ell_1$$-minimizer of Cai et al. (2011) or its adaptive counterpart (Cai et al., 2016), both of which are consistent in spectral norm under the same sub-Gaussianity condition. Under sub-Gaussianity, Cai & Liu (2011) and Cai et al. (2016) showed that in high-dimensional regimes the adaptive thresholding estimator and adaptive constrained $$\ell_1$$-minimization estimator are minimax optimal within the classes of covariance or precision matrices satisfying their sparsity constraint. Since sub-Gaussianity is often too restrictive in practice, we seek new procedures that can achieve the same minimax optimality when data are leptokurtic. Inspection of the proofs of Bickel & Levina (2008), Cai & Liu (2011) and Cai et al. (2016) reveals that sub-Gaussianity is needed because their methods are built on the sample covariance matrix, which requires the assumption to guarantee its optimal performance. Here we show that minimax optimality is achievable within a larger class of distributions if the sample covariance matrix is replaced by a robust pilot estimator, thus providing a unified theory for covariance and precision matrix estimation based on general pilot estimators. We also show how to construct pilot estimators that have the required elementwise convergence rates of (1) and (2) below. Within a much larger class of distributions with bounded fourth moment, it is shown that an estimator obtained by regularizing a robust pilot estimator attains the minimax rate achieved by existing methods under sub-Gaussianity. The analysis is extended to show that when only bounded $$2+\epsilon$$ moments exist for $$\epsilon\in(0,2)$$, matrix estimators with satisfactory convergence rates are still attainable. Some related work includes that of Liu et al. (2012) and Xue & Zou (2012), who considered robust estimation of graphical models when the underlying distribution is elliptically symmetric, Fan et al. (2015, 2016a,b), who studied robust matrix estimation in the context of factor models, and Chen et al. (2015) and Loh & Tan (2015), who investigated matrix estimation when the data are contaminated by outliers. The present paper is concerned with efficient estimation of general sparse covariance and precision matrices when only certain moment conditions are assumed. For a $$p$$-dimensional random vector $${X}$$ with mean $${\mu}$$, let $$\Sigma^{*}=E\{({X}-{\mu})({X}-{\mu})^{\mathrm{T}}\}$$ and let $$\tilde{\Sigma}=(\tilde{\sigma}_{uv})$$ denote an arbitrary pilot estimator of $$\Sigma^{*}=(\sigma^{*}_{uv})$$, where $$u,v \in [p]$$ with $$[p]$$ standing for $$\{1,\dots,p \}$$. The key requirement on $$\tilde{\Sigma}$$ for optimal covariance estimation is that   \begin{equation}\label{eqConditionC} \text{pr}\Bigl[\max_{u,v}|\tilde{\sigma}_{uv} - \sigma_{uv}^{*}|\leq C_{0}\{(\log p)/n\}^{1/2}\Bigr] \geq 1 - \epsilon_{n,p}, \end{equation} (1) where $$C_{0}$$ is a positive constant and $$\epsilon_{n,p}$$ is a deterministic sequence converging to zero as $$n,p\rightarrow \infty$$ such that $$n^{-1}\log p\rightarrow 0$$. This delivers rates of convergence that match the minimax rates of Cai & Liu (2011) even under violations of their sub-Gaussianity condition, which entails the existence of $$b>0$$ such that $$E(\exp[t\{X_{u}-E(X_u) \}])\leq \exp(b^{2}t^{2}/2)$$ for every $$t\in \mathbb{R}$$ and every $$u\in[p]$$. Introduce the sample covariance matrix   \begin{equation*} \hat{\Sigma}=(\hat{\sigma}_{uv}) = n^{-1}\sum_{i=1}^{n}\bigl({X}_{i} - \bar{{X}}\bigr) \bigl({X}_{i} - \bar{{X}}\bigr)^{\mathrm{T}}\!, \end{equation*} where $${X}_1,\ldots,{X}_n$$ are independent and identically distributed copies of $${X}$$ and $$\bar{{X}}=n^{-1}\sum_{i=1}^{n}{X}_{i}$$. Proposition 1 shows that $$\hat{\Sigma}$$ violates (1) when $${X}$$ is not sub-Gaussian. In other words, the sample covariance does not concentrate exponentially fast in an elementwise sense if sub-Gaussianity is violated. Similarly, for estimation of the precision matrix $$\Omega^{*}=(\Sigma^{*})^{-1}$$, the optimality of the adaptive constrained $$\ell_1$$-minimization estimator is retained under a pilot estimator satisfying   \begin{equation}\label{eqConditionC2} \text{pr}\Bigl[\max_{u,v}\,\bigl|(\tilde{\Sigma}\Omega^{*} - I_p)_{uv}\bigr|\leq C_{0}\{(\log p)/n\}^{1/2}\Bigr] \geq 1 - \epsilon_{n,p}, \end{equation} (2) where $$C_{0}$$ and $$\epsilon_{n,p}$$ are as in (1) and $$I_p$$ denotes the $$p\times p$$ identity matrix. While (2) holds with $$\tilde{\Sigma}=\hat{\Sigma}$$ under sub-Gaussianity of $${X}$$, it fails otherwise. The following proposition provides a more formal illustration of the unsuitability of $$\tilde{\Sigma}=\hat{\Sigma}$$ as a pilot estimator in the absence of sub-Gaussianity. Proposition 1. Let $$E(|X_{u}X_{v}-\sigma^{\ast}_{uv}|^{1+\gamma})\leq 2$$ for $$u\neq v$$ and some $$\gamma>0$$. For all distributions of $$X$$ satisfying this assumption, there is a distribution $${\rm pr}$$ such that for some $$\epsilon<1/2$$,   \begin{equation*} {\rm pr}\bigl\{|\hat{\sigma}_{uv} - \sigma_{uv}^{*}|>2^{-1}\epsilon^{-1/(1+\gamma)}n^{-\gamma/(1+\gamma)}\bigr\} \geq \epsilon\text{.} \end{equation*} This implies that the choice to take the sample covariance as the pilot estimator $$\tilde{\Sigma}$$ results in a polynomial rate of convergence, which is slower than the exponential rate of concentration in (1). Instead, we introduce robust pilot estimators in § 4 that satisfy the conditions (1) and (2). These estimators only require $$\max_{1\leq u \leq p} E(|X_u|^4)<\infty$$. Throughout the paper, for a vector $${a}=(a_1,\ldots,a_p)^{\mathrm{T}}\in\mathbb{R}^p$$, $$\:\|{a}\|_1=\sum_{v=1}^p|a_v|$$, $$\|{a}\|_2=(\sum_{v=1}^pa_v^2)^{1/2}$$ and $$\|{a}\|_\infty=\max_{1\leq v \leq p}|a_v|$$. For a matrix $$A=(a_{uv})\in\mathbb{R}^{p\times q}$$, $$\:\|A\|_{\max}=\max_{1\leq u\leq p, 1\leq v\leq q}|a_{uv}|$$ is the elementwise maximum norm, $$\|A\|_2=\sup_{\|{x}\|_2\leq 1}\|A{x}\|_2$$ is the spectral norm, and $$\|A\|_{1}=\max_{1\leq v\leq q}\sum_{u=1}^p|a_{uv}|$$ is the matrix $$\ell_1$$-norm. We let $$I_p$$ denote the $$p\times p$$ identity matrix; $$A\succ 0$$ and $$A \succeq 0$$ mean that $$A$$ is positive definite and positive semidefinite, respectively. For a square matrix $$A$$, we denote its maximum and minimum eigenvalues by $$\lambda_{\max}(A)$$ and $$\lambda_{\min}(A)$$, respectively. We also assume that $$E(X)=0$$. 2. Broadening the scope of the adaptive thresholding estimator Let $$\tau_{\lambda}(\cdot)$$ be a general thresholding function for which: (i) $$|\tau_\lambda(z)|\leq |y|$$ for all $$z$$ and $$y$$ that satisfy $$|z-y|\leq\lambda$$; (ii) $$\tau_\lambda(z)=0$$ for $$|z|\leq \lambda$$; (iii) $$|\tau_\lambda(z)-z|\leq \lambda$$ for all $$z\in \mathbb{R}$$. Similar properties are set forth in Antoniadis & Fan (2001) and were proposed in the context of covariance estimation via thresholding in Rothman et al. (2009) and Cai & Liu (2011). Some examples of thresholding functions satisfying these three conditions are the soft thresholding rule $$\tau_\lambda(z)={\rm sgn}(z)(z-\lambda)_+$$, the adaptive lasso rule $$\tau_\lambda(z)=z(1-|\lambda/z|^\eta)_+$$ with $$\eta\geq 1$$, and the smoothly clipped absolute deviation thresholding rule (Rothman et al., 2009). Although the hard thresholding rule $$\tau_\lambda(z)=z {\rm 1}\kern-0.24em{\rm I} (|z|>\lambda)$$ does not satisfy (i), the results presented in this section also hold for hard thresholding. The adaptive thresholding estimator is defined as   \[ \hat{\Sigma}^{\mathcal{T}} = (\hat{\sigma}_{uv}^{\mathcal{T}}) = \bigl\{\tau_{\lambda_{uv}}(\hat{\sigma}_{uv})\bigr\}\!, \] where $$\hat{\sigma}_{uv}$$ is the ($$u,v$$)th entry of $$\hat{\Sigma}$$ and the threshold $$\lambda_{uv}$$ is entry-dependent. Equipped with these adaptive thresholds, Cai & Liu (2011) established optimal rates of convergence of the resulting estimator under sub-Gaussianity of $${X}$$. To accommodate data drawn from distributions violating sub-Gaussianity, we replace the sample covariance matrix $$\hat{\Sigma}$$ by a pilot estimator $$\tilde{\Sigma}$$ satisfying (1). The resulting adaptive thresholding estimator is denoted by $$\tilde{\Sigma}^{\mathcal{T}}$$. As suggested by Fan et al. (2013), the entry-dependent threshold   \begin{equation}\label{thresholdAdaptive} \lambda_{uv}=\lambda \left(\frac{\tilde{\sigma}_{uu}\tilde{\sigma}_{vv}\log p}{n}\right)^{1/2} \end{equation} (3) is used, where $$\lambda>0$$ is a constant. This is simpler than the threshold used by Cai & Liu (2011), as it does not require estimation of $${\rm var}(\tilde{\sigma}_{uv})$$ and achieves the same optimality. Let $$\mathcal{S}^{+}(\mathbb{R},p)$$ denote the class of positive-definite symmetric matrices with elements in $$\mathbb{R}$$. Theorem 1 relies on the following conditions on the pilot estimator and the sparsity of $$\Sigma^{*}$$. Condition 1. The pilot estimator $$\tilde{\Sigma}=(\tilde{\sigma}_{uv})$$ satisfies (1). Condition 2. The matrix $$\Sigma^{*}=(\sigma_{uv}^{*})$$ belongs to the class   \begin{equation*} \mathcal{U}_q=\mathcal{U}_q\{s_0(p)\}=\left\{\Sigma:\Sigma\in \mathcal{S}^{+}(\mathbb{R},p),\:\max_u\sum_{v=1}^p(\sigma_{uu}^{*}\sigma_{vv}^{*})^{(1-q)/2}|\sigma_{uv}^{*}|^q\leq s_0(p)\right\}\!\text{.} \end{equation*} The class of weakly sparse matrices $$\mathcal{U}_q$$ was introduced by Cai & Liu (2011). The columns of a covariance matrix in $$\mathcal{U}_q$$ are required to lie in a weighted $$\ell_q$$-ball, where the weights are determined by the variance of the entries of the population covariance. Theorem 1. Suppose that Conditions 1 and 2 hold, $$\log p=o(n)$$, and $$\min_{u}\sigma_{uu}^{*}=\gamma > 0$$. There exists a positive constant $$C_0$$ such that  \[ \underset{\Sigma^{*}\in\mathcal{U}_q}{\inf}{\rm pr}\biggl\{\|\tilde{\Sigma}^{\mathcal{T}}-\Sigma^*\|_2\leq C_0s_0(p)\left(\frac{\log p}{n}\right)^{(1-q)/2}\biggr\}\geq 1-\epsilon_{n,p}, \] where $$\epsilon_{n,p}$$ is a deterministic sequence that decreases to zero as $$n,p \to \infty$$. The constant $$C_0$$ in Theorem 1 depends on $$q$$, $$\lambda$$ and the unknown distribution of $$X$$, and so do the constants appearing in Theorem 2 and Propositions 2–4. Our result generalizes Theorem 1 of Cai & Liu (2011); the minimax lower bound of our Theorem 1 matches theirs, implying that our procedure is minimax optimal for a wider class of distributions containing the sub-Gaussian distributions. Cai & Liu (2011) also give convergence rates under bounded moments in all components of $$X$$. In that case, a much more stringent scaling condition on $$n$$ and $$p$$ is required, as shown in Theorem 1(ii) of Cai & Liu (2011). Their result does not cover the high-dimensional case where $$p>n$$ when fewer than $$8+\epsilon$$ finite moments exist. If a larger number of finite moments is assumed, $$p$$ is allowed to increase polynomially with $$n$$. However, we allow $$\log p = o(n)$$. For the three pilot estimators to be given in § 4, even universal thresholding can achieve the same minimax optimal rate given the bounded fourth moments assumed there. However, adaptive thresholding as formulated in (3) results in better numerical performance. Unfortunately, $$\tilde{\Sigma}^{\mathcal{T}}$$ may not be positive semidefinite, but it can be projected onto the cone of positive-semidefinite matrices through the convex optimization   \begin{equation} \label{eq:L2-projection} \tilde{\Sigma}^{\mathcal{T}}_{+} = {\rm arg\,min}_{\Sigma \succeq 0} \|\Sigma-\tilde{\Sigma}^{\mathcal{T}}\|_2\text{.} \end{equation} (4) By definition, $$\|\tilde{\Sigma}^{\mathcal{T}}_+-\tilde{\Sigma}^{\mathcal{T}}\|_2 \leq\|{\Sigma}^{\ast}-\tilde{\Sigma}^{\mathcal{T}}\|_2$$, so the triangle inequality yields   \begin{equation*} \| \tilde{\Sigma}^{\mathcal{T}}_{+} - \Sigma^{*}\|_2 \leq \|\tilde{\Sigma}^{\mathcal{T}}_{+} - \tilde{\Sigma}^{\mathcal{T}}\|_2 + \| \tilde{\Sigma}^{\mathcal{T}} - \Sigma^{*}\|_2 \leq 2 \, \|\tilde{\Sigma}^{\mathcal{T}} - \Sigma^{*}\|_2\text{.} \end{equation*} Hence, the price to pay for projection is no more than a factor of two, which does not affect the convergence rate. The projection is easily made by linear programming (Boyd & Vandenberghe, 2004). 3. Broadening the scope of the adaptively constrained $$\ell_1$$-minimization estimator We consider a robust modification of the adaptively constrained $$\ell_1$$-minimization estimator of Cai et al. (2016). In a spirit similar to § 2, our robust modification relies on the existence of a pilot estimator $$\tilde{\Sigma}$$ satisfying (2). Construction of the robust adaptive constrained $$\ell_1$$-minimizer relies on a preliminary projection, resulting in the positive-definite estimator   \begin{equation}\label{projection} \tilde{\Sigma}^{+}=\mathop {\arg \min }\limits_{\Sigma\succeq \epsilon I_p} \|\Sigma-\tilde{\Sigma}\|_{\max} \end{equation} (5) for an arbitrarily small positive number $$\epsilon$$. The minimization problem in (5) can be rewritten as minimizing $$\vartheta$$ such that $$|(\tilde{\Sigma} - \Sigma)_{uv}| \leq \vartheta$$ and $$\Sigma \succeq\epsilon I_p$$ for all $$1\leq u,v \leq p$$. This problem can be solved in Matlab using the cvx solver (Grant & Boyd, 2014). Given $$\tilde{\Sigma}^{+}=(\tilde{\sigma}_{uv}^{+})$$, our estimator of $$\Omega^{*}$$ is constructed by replacing $$(\hat{\Sigma}+n^{-1}I_p)$$ with $$\tilde{\Sigma}^{+}$$ in the original constrained $$\ell_1$$-minimization procedure. For ease of reference, the steps are reproduced below. Define the first-stage estimator $$\check{\Omega}^{(1)}$$ of $$\Omega^{*}$$ through the vectors   \begin{equation} \check{{\omega}}_{v}^{\dagger}=\mathop {\arg \min }\limits_{{\omega}_v\in \mathbb{R}^p}\Big\{\|{\omega}_v\|_1:\|\tilde{\Sigma}^{+}{\omega}_v-{e}_v\|_\infty\leq \delta_{n,p} \max_v( \tilde{\sigma}^{+}_{vv})\,\omega_{vv}, \; \omega_{vv}>0\Big\}, \quad v\in[p], \end{equation} (6) with $${\omega}_v=(\omega_{1v},\ldots,\omega_{pv})^{\mathrm{T}}$$, $$\delta_{n,p}=\delta\{(\log p)/n\}^{1/2}$$ for $$\delta>0$$, and $${e}_v$$ being the vector that has value $$1$$ in the $$v$$th coordinate and zeros elsewhere. More specifically, define $$\check{{\omega}}^{(1)}_{v}$$ as an adjustment of $$\check{{\omega}}_{v}^{\dagger}$$ such that the $$v$$th entry is   \begin{equation}\label{eqACLIME2} \check{\omega}^{(1)}_{vv}=\check{\omega}_{vv}^{\dagger}{\rm 1}\kern-0.24em{\rm I} \Bigl\{\tilde{\sigma}_{vv}^{+}\leq (n/\log p)^{1/2}\Bigr\}+\{(\log p)/n\}^{1/2}{\rm 1}\kern-0.24em{\rm I} \Bigl\{\tilde{\sigma}_{vv}^{+}> (n/\log p)^{1/2}\Bigr\}, \end{equation} (7) and define the first-stage estimator as $$\check{\Omega}^{(1)}=(\check{{\omega}}_{1}^{(1)}, \ldots, \check{{\omega}}_{p}^{(1)})$$. A second-stage adaptive estimator $$\check{\Omega}^{(2)}=(\check{{\omega}}_{1}^{(2)},\ldots, \check{{\omega}}_{p}^{(2)})$$ is defined by solving, for each column,   \begin{equation} \check{{\omega}}_{v}^{(2)}=\mathop {\arg \min }\limits_{{\omega}_v\in \mathbb{R}^p} \Big\{\|{\omega}_v\|_1: \bigl|\left(\tilde{\Sigma}^{+}{\omega}_v-{e}_v\right)_{u}\bigr|\leq \lambda_{n,p}(\tilde{\sigma}^{+}_{uu} \check{\omega}^{(1)}_{vv})^{1/2} \; (u =1,\ldots,p)\Big\}, \end{equation} (8) where $$\lambda_{n,p}=\lambda\{(\log p)/n\}^{1/2}$$ for $$\lambda>0$$. In practice, the optimal values of $$\delta$$ and $$\lambda$$ are chosen by crossvalidation. The final estimator, $$\tilde{\Omega}$$, of $$\Omega^{*}$$ is a symmetrized version of $$\check{\Omega}^{(2)}$$ constructed as   \begin{equation}\label{eqACLIME4} \tilde{\Omega} = (\tilde{\omega}_{uv}),\quad \tilde{\omega}_{uv}=\tilde{\omega}_{vu} = \check{\omega}_{uv}^{(2)} {\rm 1}\kern-0.24em{\rm I} (| \check{\omega}_{uv}^{(2)}|\leq | \check{\omega}_{vu}^{(2)}|) + \check{\omega}_{vu}^{(2)}{\rm 1}\kern-0.24em{\rm I} (| \check{\omega}_{uv}^{(2)}| > | \check{\omega}_{vu}^{(2)}|)\text{.} \end{equation} (9) The theoretical properties of $$\tilde{\Omega}$$ are derived under Conditions 3 and 4. Condition 3. The pilot estimator $$\tilde{\Sigma}=(\tilde{\sigma}_{uv})$$ satisfies (2). Condition 4. The matrix $$\Omega^{*}=(\omega_{uv}^{*})$$ belongs to the class   \begin{equation*} \begin{split} \mathcal{G}_q=\mathcal{G}_q(c_{n,p},M_{n,p}) =\bigg\{\Omega \in \mathcal{S}^{+}(\mathbb{R},p): & \; \max_v\sum_{u=1}^p|\omega_{uv}|^q\leq c_{n,p},\; \|\Omega\|_{1}\leq M_{n,p}, \\ & \; \frac{1}{M_1}\leq{\lambda_{\min}(\Omega)}\leq\lambda_{\max}(\Omega)\leq M_1\bigg\}, \end{split} \end{equation*} where $$0\leq q\leq 1$$, $$M_1>0$$ is a constant, and $$M_{n,p}$$ and $$c_{n,p}$$ are positive deterministic sequences that are bounded away from zero and allowed to diverge as $$n$$ and $$p$$ grow. In this class of precision matrices, sparsity is imposed by restricting the columns of $$\Omega^{*}$$ to lie in an $$\ell_q$$-ball of radius $$c_{n,p}$$ ($$0\leq q<1$$). Theorem 2. Suppose that Conditions 1, 3 and 4 are satisfied with $$c_{n,p}=o(n/\log p)$$. Under the scaling condition $$\log p = O(n^{1/2})$$ we have, for a positive constant $$C_0$$,   \[ \underset{\Omega^{*}\in{\mathcal{G}_q}}{\inf}{\rm pr}\bigg\{\|\tilde{\Omega}-\Omega^{*}\|_2\leq C_0 M_{n,p}^{1-q}c_{n,p}\left(\frac{\log p}{n}\right)^{(1-q)/2}\bigg\} \geq 1-\epsilon_{n,p}, \] where $$\epsilon_{n,p}$$ is a deterministic sequence that decreases to zero as $$n, p\rightarrow \infty$$ and $$\tilde{\Omega}$$ is the robust adaptively constrained $$\ell_1$$-minimization estimator described in (6)–(9). Remark 1. Our class of precision matrices is slightly more restrictive than that considered in Cai et al. (2016), since we require $$1/M_1\leq\lambda_{\min}(\Omega^{*})\leq\lambda_{\max}(\Omega^{*})\leq M_1$$ instead of $$\lambda_{\max}(\Omega^{*})/\lambda_{\min}(\Omega^{*})\leq M_1$$. The difference is marginal since $$\sigma^{*}_{uu}={e}_u^{\mathrm{T}}\Sigma^{*}{e}_u=\|\Sigma^{*1/2}{e}_u\|_2^2\leq\lambda_{\max}(\Sigma^{*})=1/\lambda_{\min}(\Omega^{*})$$ and $$\lambda_{\max}(\Omega^{*})/\lambda_{\min}(\Omega^{*})\leq M_1$$ implies that $$0<M_1^{-1}\lambda_{\max}(\Omega^{*})\leq\lambda_{\min}(\Omega^{*})\leq \lambda_{\max}(\Omega^{*})\leq M_1\lambda_{\min}(\Omega^{*})<\infty$$. We therefore only exclude precision matrices associated with either exploding or imploding covariance matrices, i.e., we exclude $$\sigma^{*}_{uu}\to 0$$ and $$\sigma^{*}_{uu}\to\infty$$ for all $$u\in[p]$$. Ren et al. (2015) also require $$1/M_1\leq\lambda_{\min}(\Omega^{*})\leq\lambda_{\max}(\Omega^{*})\leq M_1$$. A positive-semidefinite estimator with the same convergence rate as $$\tilde{\Omega}$$ can be constructed by projecting the symmetric matrix $$\tilde{\Omega}$$ onto the cone of positive-semidefinite matrices, as in (4). Next, we present three pilot estimators whose performance is favourable with respect to the sample covariance matrix when the sub-Gaussianity assumption is violated. We verify Conditions 1 and 3 for these estimators. Condition 1 will be verified for all three pilot estimators. When $$\|\Omega^*\|_1$$ is bounded, Condition 1 implies Condition 3 because $$\|\tilde{\Sigma}\Omega^*-I_p\|_{\max}=\|(\tilde{\Sigma}-\Sigma^*)\Omega^*\|_{\max}\leq \|\Omega^*\|_{1}\|\tilde{\Sigma}-\Sigma^*\|_{\max}$$. When $$\|\Omega^*\|_1 \to \infty$$, Condition 3 is verified for the adaptive Huber estimator. We emphasize that Condition 3 is only needed if the goal is to obtain a minimax optimal estimator of $$\Omega$$. A consistent estimator is still attainable if only Condition 1 holds when $$\|\Omega^{\ast}\|_1\to \infty$$. A more thorough discussion appears in the Supplementary Material. 4. Robust pilot estimators 4.1. A rank-based estimator The rank-based estimator requires only the existence of the second moment. However, it makes arguably more restrictive assumptions, as it requires the distribution of $${X}$$ to be elliptically symmetric. Definition 1. A random vector $${Z} = (Z_1,\ldots,Z_p)^{\mathrm{T}}$$ follows an elliptically symmetric distribution if and only if $${Z}= \mu+\xi A U$$, where $$\mu\in \mathbb{R}^p$$, $$A\in \mathbb{R}^{d\times q}$$ with $$q = \mathrm{rank}(A)$$, $$U$$ is uniformly distributed on the unit sphere in $$\mathbb{R}^q$$, and $$\xi$$ is a positive random variable independent of $$U$$. Observe that $$\Sigma^{*}=D^{*}R^{*}D^{*}$$ where $$R=(r^{*}_{uv})$$ denotes the correlation matrix and $$D^{*}=\text{diag}\{(\sigma^{*}_{11})^{1/2},\ldots,(\sigma^{*}_{pp})^{1/2}\}$$. Liu et al. (2012) and Xue & Zou (2012) both proposed rank-based estimation of $$R^{*}$$, exploiting a bijective mapping between Pearson correlation and Kendall’s tau or Spearman’s rho dependence measures that hold for elliptical distributions. More specifically, Kendall’s tau concordance between $$X_{u}$$ and $$X_{v}$$ is defined as   \[ \tau^{*}_{uv} = \text{pr}\bigl\{(X_{u} - Y_{u})(X_{v} - Y_{v})>0\bigr\} - \text{pr}\bigl\{(X_{u} - Y_{u})(X_{v} - Y_{v})<0\bigr\}, \] where $${Y}$$ is an independent copy of $${X}$$. With $${X}_{i}=(X_{i,1},\ldots, X_{i,p})^{\mathrm{T}}$$, the empirical analogue of $$\tau^{*}_{uv}$$ is   \[ \tilde{\tau}_{uv} = {n \choose 2}^{-1}\sum_i\sum_{j<i} \Bigl[\!{\rm 1}\kern-0.24em{\rm I} \bigl\{(X_{iu} - Y_{iu})(X_{iv} - Y_{iv})>0\bigr\} - \!{\rm 1}\kern-0.24em{\rm I} \bigl\{(X_{iu} - Y_{iu})(X_{iv} - Y_{iv})<0\bigr\}\Bigr]\text{.} \] Since $$\tau^{*}_{uv} = 2\pi^{-1}\arcsin(r^{*}_{uv})$$, an estimator of $$R^{*}$$ is $$\tilde{R}=(\tilde{r}_{uv})=\{\sin(\pi\tilde{\tau}_{uv}/2)\}$$. An analogous bijection exists between Spearman’s rho and Pearson’s correlation; see Xue & Zou (2012) for details. We propose to estimate the elements of the diagonal matrix $$D^{*}$$ using a median absolute deviation estimator, $$\tilde{D}=\text{diag}\{(\tilde{\sigma}_{11})^{1/2}, \ldots, (\tilde{\sigma}_{pp})^{1/2}\}$$, where $$\tilde{\sigma}_{uu} = C_u \,\text{med}_{i\in[n]}\{|X_{iu} - \text{med}_{j\in[n]}(X_{ju})|\}$$. Here, $$\text{med}_{i\in [n]}(\cdot)$$ denotes the median within the index set $$[n]$$ and $$C_u=F_{u}^{-1}(3/4)$$ is the Fisher consistency constant, where $$F_{u}$$ is the distribution function of $$X_{1u}-\nu_u$$ and $$\nu_u$$ is the median of $$X_{1u}$$. Finally, the rank-based estimator is defined as $$\tilde{\Sigma}_R=(\tilde{\sigma}_{uv}^R)=\tilde{D}\tilde{R}\tilde{D}$$. Proposition 2. Let $${X}_1,\ldots,{X}_n$$ be independent and identically distributed copies of the elliptically symmetric random vector $${X}$$ with covariance matrix $${\Sigma^{*}=D^{*}R^{*}D^{*}}$$. Assume that $${\max_{u}\sigma_{uu}^{*} = \varsigma <\infty}$$ and $${\min_{u}\sigma_{uu}^{*}> C_{2}\{(\log p)/n\}^{1/2} + 1/U^{4}}$$, where $$U<C_{1}\pi\surd 2/(4\varsigma)$$ for $$C_{1}>0$$ and $$C_{2}>0$$. Then  \[ {\rm pr}\Bigl[\max_{u,v}|\tilde{\sigma}^{R}_{uv} - \sigma^{*}_{uv}| \leq c \{(\log p)/n\}^{1/2}\Bigr] \geq 1- \epsilon_{n,p}, \] with $$\epsilon_{n,p}\leq C_{0}p^{-L}$$ for positive constants $$c$$, $$C_{0}$$ and $$L$$. In estimating marginal variances, we use median absolute deviation estimators to avoid higher moment assumptions. This assumes knowledge of $$F_u^{-1}(3/4)$$, without which these marginal variances can be estimated by using the adaptive Huber estimator or the median of means estimator given in the next two subsections. This requires existence of a fourth moment; see Propositions 3 and 5. 4.2. An adaptive Huber estimator The Huber-type M-estimator only requires the existence of fourth moments. Let $$Y_{uv}=(X_{u} - \mu^{*}_{u})(X_{v} - \mu^{*}_{v})$$. Then $$\sigma_{uv}^{*}=E(Y_{uv})=\mu^{*}_{uv} - \mu^{*}_{u}\mu^{*}_{v}$$ where $$\mu_{u}^{*}=E(X_{u})$$, $$\mu^{*}_{v}=E(X_{v})$$ and $$\mu^{*}_{uv}=E(X_{u}X_{v})$$. We propose to estimate $$\sigma_{uv}^{*}$$ robustly through robust estimators of $$\mu^{*}_{u}$$, $$\mu^{*}_{v}$$ and $$\mu^{*}_{uv}$$. For independent and identically distributed copies $$Z_1,\ldots,Z_n$$ of a real random variable $$Z$$ with mean $$\mu$$, Huber’s (1964) M-estimator of $$\mu$$ is defined as the solution to   \begin{align}\label{eqHuber} \sum_{i=1}^n\psi_H(Z_i-\mu)=0, \end{align} (10) where $$\psi_H(z)=\min\{H,\max(-H,z)\}$$ is the Huber function. Replacing $$Z_{i}$$ in (10) by $$X_{i,u}$$, $$X_{i,u}$$ and $$X_{i,u}X_{i,v}$$ gives the Huber estimators $$\tilde{\mu}_{u}^{H}$$, $$\tilde{\mu}_{v}^{H}$$ and $$\tilde{\mu}_{uv}^{H}$$ of $$\mu_{u}^{*}$$, $$\mu_{v}^{*}$$ and $$\mu_{uv}^{*}$$, respectively, from which the Huber-type estimator of $$\Sigma^{*}$$ is defined as $$\tilde{\Sigma}_{H}=(\tilde{\sigma}^{H}_{uv})=(\tilde{\mu}_{uv}^{H}- \tilde{\mu}_{u}^{H}\tilde{\mu}_{v}^{H})$$. We depart from Huber (1964) by allowing $$H$$ to grow to infinity as $$n$$ increases, as our objectives differ from those of Huber (1964) and of classical robust statistics (Huber & Ronchetti, 2009). There, the distribution generating the data is assumed to be a contaminated version of a given parametric model, where the contamination level is small, and the objective is to estimate features of the parametric model as if no contamination were present. Our goal is instead to estimate the mean of the underlying distribution, allowing departures from sub-Gaussianity. In related work, Fan et al. (2017) have shown that when $$H$$ is allowed to diverge at an appropriate rate, the Huber estimator of the mean concentrates exponentially fast around the true mean when only a finite second moment exists. In a similar spirit, we allow $$H$$ to grow with $$n$$ in order to alleviate the bias. An appropriate choice of $$H$$ trades off bias and robustness. We build on Fan et al. (2017) and Catoni (2012), showing that our proposed Huber-type estimator satisfies Conditions 1 and 3. Proposition 3. Assume $$\max_{1\leq u\leq p}E(X_u^4)=\kappa^2<\infty$$. Let $$\tilde{\Sigma}_{H}=(\tilde{\sigma}^{H}_{uv})$$ be the Huber-type estimator with $$H=K(n/\log p)^{1/2}$$ for $$K\geq 4\kappa(2+L)^{1/2}$$ and $$L>0$$ satisfying $$(2 + L)(\log p)/n < 1/8$$. Under the scaling condition $$(\log p)/n \rightarrow 0$$ we have, for large $$n$$ and a constant $$C>\kappa(1+ 2\max_{u}|\mu^{*}_{u}|)$$,   \begin{align*} {\rm pr}\Big[\underset{u,v}{\max}\: |\tilde{\sigma}_{uv}^H-\sigma_{uv}^{*}|\leq C\{(\log p)/n\}^{1/2}\Big]\geq 1 - \epsilon_{n,p}, \end{align*} where $$\epsilon_{n,p}\leq C_{0}p^{-L}$$ for positive constants $$C_{0}$$ and $$L$$. Proposition 3 verifies Condition 1 for $$\tilde{\Sigma}_{H}$$, provided $$H$$ is chosen to diverge at the appropriate rate. As quantified in Proposition 4, $$\tilde{\Sigma}_{H}$$ also satisfies Condition 3 when $$H$$ is of the same rate as in Proposition 3. The proof of this result entails extending a large deviation result of Petrov (1995). Proposition 4. Assume that $$\max_{1\leq u\leq p}E(X_u^4)=\kappa^2<\infty$$ and $$\Omega^{*} \in \mathcal{G}_{q}(c_{n,p},M_{n,p})$$ with $$c_{n,p}=O\{n^{(1-q)/{2}}/(\log p)^{{(3-q)}/{2}} \}$$. Let $$\tilde{\Sigma}_{H}=(\tilde{\sigma}^{H}_{uv})$$ be the Huber-type estimator defined below (10) with $$H=K(n/\log p)^{1/2}$$ for $$K\geq 4\kappa(2+L)^{1/2}$$ and $$L>0$$. Assume that the truncated population covariance matrix $$\Sigma_H=E\{\!{\rm 1}\kern-0.24em{\rm I} (|X_{u}X_{v}|\leq H)X_uX_v\}$$ satisfies $$\|\Sigma_H\Omega^*-\text{I}_p\|_{\max}=O[\{(\log p)/n\}^{1/2}]$$. Under the scaling condition $$(\log p)/n^{1/3} =O(1)$$ we have, for large $$n$$ and a constant $$C>\kappa(1+ 2\max_{u}|\mu^{*}_{u}|)$$,   \[ {\rm pr}\Bigl[\max_{u,v}\:\bigl|(\tilde{\Sigma}_{H}\Omega^{*} - I_p)_{uv}\bigr|\leq C\{(\log p)/n\}^{1/2}\Bigr] \geq 1 - \epsilon_{n,p}, \] where $$\epsilon_{n,p}\leq C_0(\log p)^{-1/2}p^{-L}$$ for positive constants $$C_0$$ and $$L$$. 4.3. A median of means estimator The median of means estimator was proposed by Nemirovsky & Yudin (1983) and has been further studied by Lerasle & Oliveira (2011), Bubeck et al. (2013) and Joly & Lugosi (2016). It is defined as the median of $$M$$ means obtained by partitioning the data into $$M$$ subsamples. A heuristic explanation for its success is that taking means within subsamples results in a more symmetric sample while the median makes the solution concentrate faster. Our median of means estimator for $$\Sigma^{*}$$ is constructed as $$\tilde{\Sigma}_{M}=(\tilde{\sigma}_{uv}^{M})=(\tilde{\mu}_{uv}^{M} - \tilde{\mu}_{u}^{M}\tilde{\mu}_{v}^{M})$$, where $$\tilde{\mu}_{uv}^{M}$$, $$\tilde{\mu}_{u}^{M}$$ and $$\tilde{\mu}_{v}^{M}$$ are median of means estimators of $$(X_{iu}X_{iv})_{i=1}^{n}$$, $$(X_{iu})_{i=1}^{n}$$ and $$(X_{iv})_{i=1}^{n}$$, respectively; in each case, each of the $$M$$ means is computed on an regular partition $$B_{1},\ldots, B_{M}$$ of $$[n]$$. It is assumed that $$M$$ is a factor of $$n$$. The value of $$M$$ is a tuning parameter that affects the accuracy of the median of means estimator. The choice of $$M$$ involves a compromise between bias and variance. For the extreme cases, $$M=n$$ and $$M=1$$, we obtain respectively the sample median and the sample mean. The latter is asymptotically unbiased but does not concentrate exponentially fast in the presence of heavy tails, while the former concentrates exponentially fast but not to the population mean under asymmetric distributions. Proposition 5 gives the range of $$M$$ for which both goals are achieved simultaneously. Proposition 5. Assume $$\max_{1\leq u\leq p}E(X_u^4)=\kappa^2<\infty$$. Let $$\tilde{\Sigma}_{M}=(\tilde{\sigma}^{M}_{uv})$$ be the median of means estimator described above based on a regular partition $$B_1,\ldots,B_M$$ with $$M=\lceil(2+L)\log p\rceil$$ for a positive constant $$L$$. Under the scaling condition $$(\log p)/n \rightarrow 0$$ we have, for large $$n$$ and a constant $$C>2(6{\rm e})^{1/2}\{\kappa+2\max_{u,v}\mu^{*}_u(\sigma_{vv}^{*})^{1/2}\}$$,   \[ {\rm pr}\Big[\max_{u,v}|\tilde{\sigma}_{uv}^M-\sigma_{uv}^{*}|\leq C\{(\log p)/n\}^{1/2}\Big]\geq 1 - \epsilon_{n,p}, \] where $$\epsilon_{n,p} \leq C_{0} p^{-L}$$ for positive constants $$C_{0}$$ and $$L$$. 5. Infinite kurtosis In the previous discussion we assumed the existence of fourth moments of $${X}$$ for the Huber-type estimator in § 4. We now relax the condition of boundedness of $$E(X_u^4)$$ to that of $$E(|X_u|^{2+\varepsilon})$$ for some $$\varepsilon>0$$ and all $$u\in[p]$$. The following proposition lays the foundations for the analysis of high-dimensional covariance or precision matrix estimation with infinite kurtosis. It extends Theorem 5 in Fan et al. (2017) and gives rates of convergence for Huber’s estimator of $$E(X_u)$$ assuming a bounded $$1+\varepsilon$$ moment for $$\varepsilon\in(0,1]$$. The result is optimal in the sense that our rates match the minimax lower bound given in Theorem 3.1 of Devroye et al. (2016). The rates depend on $$\varepsilon$$, and when $$\varepsilon=1$$ they match those of Catoni (2012) and Fan et al. (2017). Proposition 6. Let $$\delta\in(0,1)$$, $$\varepsilon\in(0,1]$$ and $$n>12\log(2\delta^{-1})$$, and let $$Z_1,\ldots,Z_n$$ be independent and identically distributed random variables with mean $$\mu$$ and bounded $$1+\varepsilon$$ moment, i.e., $$E(|Z_1-\mu|^{1+\varepsilon})=v<\infty$$. Take $${H=\{vn/\log(2\delta^{-1})\}^{1/(1+\varepsilon)}}$$. Then, with probability at least $$1-\delta$$,   \begin{align*} \tilde{\mu}^H-\mu\leq \frac{7+\surd 2}{2}\,v^{1/(1+\varepsilon)}\left\{\frac{\log(2\delta^{-1})}{n}\right\}^{\varepsilon/(1+\varepsilon)}\!, \end{align*} where $$\tilde{\mu}^H$$ is as defined in § 4.2. Corollary 1. Under the conditions of Proposition 6, the Huber estimator satisfies  \[ {\rm pr}\!\left[|\tilde{\mu}^H-\mu|\leq \frac{7+\surd 2}{2}v^{1/(1+\varepsilon)}\left\{\frac{\log(2\delta^{-1})}{n}\right\}^{\varepsilon/(1+\varepsilon)}\right]\geq 1-2\delta\text{.} \] Corollary 1 allows us to generalize the upper bounds of the Huber-type estimator. The following two theorems establish rates of convergence for the adaptive thresholding and the adaptively constrained $$\ell_1$$-minimization estimators. While we do not prove that these rates are minimax optimal under $$2+\varepsilon$$ finite moments, the proof expands on the elementwise maximum norm convergence of the pilot estimator, which is optimal by Theorem 3.1 of Devroye et al. (2016), and the resulting rates for adaptive thresholding match the minimax rates of Cai & Liu (2011) when $$\varepsilon=2$$. This is a strong indication that the rates are sharp. Theorem 3. Suppose that Condition 2 is satisfied and assume $$\max_{1\leq u\leq p}E(|X_u|^{2+\varepsilon})\leq \kappa_\varepsilon^2$$. Let $$\hat{\Sigma}^{\mathcal{T}}_H$$ be the adaptive thresholding estimator defined in § 2based on the Huber pilot estimator $$\tilde{\Sigma}_{H}$$ with $$H=K(n/\log p)^{1/(2+\varepsilon)}$$ for $$K\geq 2^{-1}(7+\surd 2)\kappa_\varepsilon(2+L)^{\varepsilon/(2+\varepsilon)}$$ and $$L>0$$. Under the scaling condition $$\log p=O(n^{1/2})$$ and choosing $$\lambda_{uv}=\lambda\{\tilde{\sigma}^H_{uu}\tilde{\sigma}_{vv}^H(\log p)/n\}^{\varepsilon/(2+\varepsilon)}$$ for some $$\lambda>0$$, we have, for sufficiently large $$n$$,   \[ \inf_{\Sigma^{*}\in\mathcal{U}_q}{\rm pr}\!\left\{\|\hat{\Sigma}_H^{\mathcal{T}}-\Sigma^{*}\|_2\leq Cs_0(p)\left(\frac{\log p}{n}\right)^{{\varepsilon(1-q)}/{(2+\varepsilon)}}\right\}\geq 1-\epsilon_{n,p}, \] where $$\epsilon_{n,p}\leq C_{0}p^{-L}$$ for positive constants $$C_{0}$$ and $$L$$. Theorem 4. Suppose that Condition 4 is satisfied, $$\max_{1\leq u\leq p}E(|X_u|^{2+\varepsilon})\leq \kappa_\varepsilon^2$$ and $$c_{n,p}=O\{n^{({1-q})/{2}}/(\log p)^{({3-q})/{2}} \}$$. Let $$\hat{\Omega}_H$$ be the adaptively constrained $$\ell_1$$-minimization estimator defined in § 3based on the Huber pilot estimator $$\tilde{\Sigma}_{H}$$ with $$H=K(n/\log p)^{1/(2+\varepsilon)}$$ for $$K\geq 2^{-1}(7+\surd 2)\kappa_\varepsilon(2+L)^{\varepsilon/(2+\varepsilon)}$$ and $$L>0$$. Assume that the truncated population covariance matrix $$\Sigma_H=E\{\!{\rm 1}\kern-0.24em{\rm I}(|X_{u}X_{v}|\leq H)X_uX_v\}$$ satisfies $$\|\Sigma_H\Omega^*-{I}_p\|_{\max}=O[\{(\log p)/n\}^{\varepsilon/(2+\varepsilon)}]$$. Under the scaling condition $$(\log p)/n^{1/3} = O(1)$$, we have, for sufficiently large $$n$$,   \[ \inf_{\Omega^{*}\in\mathcal{G}_q}{\rm pr}\!\left\{\|\tilde{\Omega}_H-\Omega^{*}\|_2\leq C M_{n,p}^{1-q}c_{n,p}\left(\frac{\log p}{n}\right)^{{\varepsilon(1-q)}/{(2+\varepsilon)}}\right\} \geq 1-\epsilon_{n,p}, \] where $$\epsilon_{n,p}\leq C_{0}p^{-L}$$ for positive constants $$C_{0}$$ and $$L$$. A result similar to Proposition 6 was obtained in Lemma 2 of Bubeck et al. (2013) for the median of means estimator. Expanding on it, we obtain a result analogous to Theorem 3 for the median of means matrix estimator. 6. Finite-sample performance We illustrate the performance of the estimators discussed in § § 2 and 3 under a range of data-generating scenarios and for every choice of pilot estimator discussed in § 4. For the adaptive thresholding estimator of $$\Sigma^{*}$$, we use a hard thresholding rule with the entry-dependent thresholds of (3). In each of 500 Monte Carlo replications, $$n=200$$ independent copies of a random vector $${X}$$ of dimension $$p=400$$ are drawn from a model with either a sparse covariance matrix $$\Sigma^{*}$$ or a sparse precision matrix $$\Omega^{*}$$, depending on the experiment. We consider four different scenarios for the distribution of $${X}$$: the zero-mean multivariate normal distribution; the $$t$$ distribution with 3$$\cdot$$5 degrees of freedom and infinite kurtosis; the skewed $$t$$ distribution with four degrees of freedom and skew parameter equal to 20; and the contaminated skewed $$t$$ distribution (Azzalini, 2005) with four degrees of freedom and skew parameter equal to 10. Data in the last scenario are generated as $${X}=(1-b){Z}_{1} + b{Z}_{2}$$, where $${Z}_{1}\sim P$$, $${Z}_{2}\sim Q$$ and $$b\sim\text{Bi}(1, 0{\cdot}05)$$; here $$P$$ is the $$t$$ distribution generating most of the data, while $$Q$$ is a normal distribution with a mean vector of $$-8$$ and covariance matrix equal to the identity. Any unspecified tuning parameters from the adaptive thresholding estimator and adaptively constrained $$\ell_1$$-minimization estimator are chosen by crossvalidation to minimize the spectral norm error. Unspecified constants in the tuning parameters of the robust pilot estimators are conservatively chosen to be those that would be optimal if the true distribution was a Student $$t$$ distribution with five degrees of freedom. We consider the following two structures for $$\Sigma^{*}$$ and $$\Omega^{*}$$. (i) Sparse covariance matrix: similar to Model 2 in the simulation section of Cai & Liu (2011), we take the true covariance model to be the block-diagonal matrix $$\Sigma^{*}=\text{blockdiag}(\Sigma^{*}_1,\Sigma^{*}_2)$$, where $$\Sigma^{\ast}_2=4{I}_{p/2 \times p/2}$$, $$\Sigma^{\ast}_1=A+\epsilon {I}_{p/2 \times p/2}$$, $$A=(a_{uv})$$ with independent $$a_{uv}=\text{Un}(0{\cdot}3,0{\cdot}8)\times\text{Bi}(1,0{\cdot}2)$$ and $$\epsilon=\max\{-\lambda_{\min}(A),0\}+0{\cdot}01$$ to ensure that $$\Sigma^{\ast}_1$$ is positive definite. (ii) Banded precision matrix: following Cai et al. (2016), we take the true precision matrix to be of the banded form $${\Omega}_0=\{\omega_{ij}\}$$, where $$\omega_{ii}=1,$$$$\omega_{i,i+1}=0{\cdot}6$$, $$\omega_{i+2,i}=\omega_{i,i+2}=0{\cdot}3$$ and $$\omega_{ij}=0$$ for $$|i-j|\geq 3$$. Table 1 shows that while the sample covariance estimator performs well for the normally distributed case, when the true model departs from normality, thresholding this estimator gives poor performance, reflected by its elevated estimation error in both the maximum norm and the spectral norm. By contrast, thresholding one of our proposed robust pilot estimators does not suffer from these heavy-tailed distributions. Table 2 shows a similar pattern for the precision matrix estimators. The gains are apparent for all robust pilot estimators, as predicted by our theory. Table 1. Estimation errors $$($$with standard errors in parentheses$$)$$ of the adaptive thresholding estimator of $$\Sigma^{\ast}$$ based on four different pilot estimators; values are averaged over $$500$$ replications  Distribution  Error  Sample covariance  Adaptive Huber  Median of means  Rank-based  MVN  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  2$$\cdot$$88 (0$$\cdot$$04)  2$$\cdot$$86 (0$$\cdot$$04)  3$$\cdot$$31 (0$$\cdot$$05)  3$$\cdot$$01 (0$$\cdot$$07)  MVN  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  0$$\cdot$$98 (0$$\cdot$$09)  0$$\cdot$$92 (0$$\cdot$$09)  1$$\cdot$$50 (0$$\cdot$$14)  1$$\cdot$$61 (0$$\cdot$$23)  T  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  8$$\cdot$$95 (0$$\cdot$$53)  3$$\cdot$$92 (0$$\cdot$$06)  4$$\cdot$$46 (0$$\cdot$$24)  5$$\cdot$$02 (0$$\cdot$$06)  T  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  8$$\cdot$$72 (0$$\cdot$$55)  1$$\cdot$$87 (0$$\cdot$$05)  3$$\cdot$$35 (0$$\cdot$$74)  2$$\cdot$$54 (0$$\cdot$$04)  ST  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  7$$\cdot$$12 (0$$\cdot$$17)  4$$\cdot$$88 (0$$\cdot$$05)  4$$\cdot$$96 (0$$\cdot$$06)  5$$\cdot$$16 (0$$\cdot$$06)  ST  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  6$$\cdot$$89 (0$$\cdot$$18)  2$$\cdot$$41 (0$$\cdot$$04)  2$$\cdot$$43 (0$$\cdot$$04)  2$$\cdot$$57 (0$$\cdot$$04)  CST  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  5$$\cdot$$47 (0$$\cdot$$23)  4$$\cdot$$14 (0$$\cdot$$06)  4$$\cdot$$60 (0$$\cdot$$05)  5$$\cdot$$13 (0$$\cdot$$06)  CST  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  5$$\cdot$$07 (0$$\cdot$$27)  2$$\cdot$$02 (0$$\cdot$$05)  2$$\cdot$$27 (0$$\cdot$$05)  2$$\cdot$$56 (0$$\cdot$$04)  Distribution  Error  Sample covariance  Adaptive Huber  Median of means  Rank-based  MVN  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  2$$\cdot$$88 (0$$\cdot$$04)  2$$\cdot$$86 (0$$\cdot$$04)  3$$\cdot$$31 (0$$\cdot$$05)  3$$\cdot$$01 (0$$\cdot$$07)  MVN  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  0$$\cdot$$98 (0$$\cdot$$09)  0$$\cdot$$92 (0$$\cdot$$09)  1$$\cdot$$50 (0$$\cdot$$14)  1$$\cdot$$61 (0$$\cdot$$23)  T  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  8$$\cdot$$95 (0$$\cdot$$53)  3$$\cdot$$92 (0$$\cdot$$06)  4$$\cdot$$46 (0$$\cdot$$24)  5$$\cdot$$02 (0$$\cdot$$06)  T  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  8$$\cdot$$72 (0$$\cdot$$55)  1$$\cdot$$87 (0$$\cdot$$05)  3$$\cdot$$35 (0$$\cdot$$74)  2$$\cdot$$54 (0$$\cdot$$04)  ST  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  7$$\cdot$$12 (0$$\cdot$$17)  4$$\cdot$$88 (0$$\cdot$$05)  4$$\cdot$$96 (0$$\cdot$$06)  5$$\cdot$$16 (0$$\cdot$$06)  ST  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  6$$\cdot$$89 (0$$\cdot$$18)  2$$\cdot$$41 (0$$\cdot$$04)  2$$\cdot$$43 (0$$\cdot$$04)  2$$\cdot$$57 (0$$\cdot$$04)  CST  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  5$$\cdot$$47 (0$$\cdot$$23)  4$$\cdot$$14 (0$$\cdot$$06)  4$$\cdot$$60 (0$$\cdot$$05)  5$$\cdot$$13 (0$$\cdot$$06)  CST  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  5$$\cdot$$07 (0$$\cdot$$27)  2$$\cdot$$02 (0$$\cdot$$05)  2$$\cdot$$27 (0$$\cdot$$05)  2$$\cdot$$56 (0$$\cdot$$04)  MVN, the normal distribution; T, the $$t$$ distribution; ST, the skewed $$t$$ distribution; CST, the contaminated skewed $$t$$ distribution. Table 1. Estimation errors $$($$with standard errors in parentheses$$)$$ of the adaptive thresholding estimator of $$\Sigma^{\ast}$$ based on four different pilot estimators; values are averaged over $$500$$ replications  Distribution  Error  Sample covariance  Adaptive Huber  Median of means  Rank-based  MVN  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  2$$\cdot$$88 (0$$\cdot$$04)  2$$\cdot$$86 (0$$\cdot$$04)  3$$\cdot$$31 (0$$\cdot$$05)  3$$\cdot$$01 (0$$\cdot$$07)  MVN  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  0$$\cdot$$98 (0$$\cdot$$09)  0$$\cdot$$92 (0$$\cdot$$09)  1$$\cdot$$50 (0$$\cdot$$14)  1$$\cdot$$61 (0$$\cdot$$23)  T  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  8$$\cdot$$95 (0$$\cdot$$53)  3$$\cdot$$92 (0$$\cdot$$06)  4$$\cdot$$46 (0$$\cdot$$24)  5$$\cdot$$02 (0$$\cdot$$06)  T  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  8$$\cdot$$72 (0$$\cdot$$55)  1$$\cdot$$87 (0$$\cdot$$05)  3$$\cdot$$35 (0$$\cdot$$74)  2$$\cdot$$54 (0$$\cdot$$04)  ST  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  7$$\cdot$$12 (0$$\cdot$$17)  4$$\cdot$$88 (0$$\cdot$$05)  4$$\cdot$$96 (0$$\cdot$$06)  5$$\cdot$$16 (0$$\cdot$$06)  ST  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  6$$\cdot$$89 (0$$\cdot$$18)  2$$\cdot$$41 (0$$\cdot$$04)  2$$\cdot$$43 (0$$\cdot$$04)  2$$\cdot$$57 (0$$\cdot$$04)  CST  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  5$$\cdot$$47 (0$$\cdot$$23)  4$$\cdot$$14 (0$$\cdot$$06)  4$$\cdot$$60 (0$$\cdot$$05)  5$$\cdot$$13 (0$$\cdot$$06)  CST  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  5$$\cdot$$07 (0$$\cdot$$27)  2$$\cdot$$02 (0$$\cdot$$05)  2$$\cdot$$27 (0$$\cdot$$05)  2$$\cdot$$56 (0$$\cdot$$04)  Distribution  Error  Sample covariance  Adaptive Huber  Median of means  Rank-based  MVN  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  2$$\cdot$$88 (0$$\cdot$$04)  2$$\cdot$$86 (0$$\cdot$$04)  3$$\cdot$$31 (0$$\cdot$$05)  3$$\cdot$$01 (0$$\cdot$$07)  MVN  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  0$$\cdot$$98 (0$$\cdot$$09)  0$$\cdot$$92 (0$$\cdot$$09)  1$$\cdot$$50 (0$$\cdot$$14)  1$$\cdot$$61 (0$$\cdot$$23)  T  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  8$$\cdot$$95 (0$$\cdot$$53)  3$$\cdot$$92 (0$$\cdot$$06)  4$$\cdot$$46 (0$$\cdot$$24)  5$$\cdot$$02 (0$$\cdot$$06)  T  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  8$$\cdot$$72 (0$$\cdot$$55)  1$$\cdot$$87 (0$$\cdot$$05)  3$$\cdot$$35 (0$$\cdot$$74)  2$$\cdot$$54 (0$$\cdot$$04)  ST  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  7$$\cdot$$12 (0$$\cdot$$17)  4$$\cdot$$88 (0$$\cdot$$05)  4$$\cdot$$96 (0$$\cdot$$06)  5$$\cdot$$16 (0$$\cdot$$06)  ST  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  6$$\cdot$$89 (0$$\cdot$$18)  2$$\cdot$$41 (0$$\cdot$$04)  2$$\cdot$$43 (0$$\cdot$$04)  2$$\cdot$$57 (0$$\cdot$$04)  CST  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  5$$\cdot$$47 (0$$\cdot$$23)  4$$\cdot$$14 (0$$\cdot$$06)  4$$\cdot$$60 (0$$\cdot$$05)  5$$\cdot$$13 (0$$\cdot$$06)  CST  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  5$$\cdot$$07 (0$$\cdot$$27)  2$$\cdot$$02 (0$$\cdot$$05)  2$$\cdot$$27 (0$$\cdot$$05)  2$$\cdot$$56 (0$$\cdot$$04)  MVN, the normal distribution; T, the $$t$$ distribution; ST, the skewed $$t$$ distribution; CST, the contaminated skewed $$t$$ distribution. Table 2. Estimation errors $$($$with standard errors in parentheses$$)$$ of the adaptively constrained $$\ell_1$$-minimizers to $$\Omega^{\ast}$$ based on four different pilot estimators; values are averaged over $$500$$ replications  Distribution  Error  Sample covariance  Adaptive Huber  Median of means  Rank-based  MVN  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$62 (0$$\cdot$$01)  2$$\cdot$$61 (0$$\cdot$$01)  2$$\cdot$$59 (0$$\cdot$$01)  2$$\cdot$$59 (0$$\cdot$$01)  MVN  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  1$$\cdot$$05 (0$$\cdot$$09)  1$$\cdot$$02 (0$$\cdot$$09)  1$$\cdot$$85 (0$$\cdot$$28)  2$$\cdot$$90 (0$$\cdot$$55)  T  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$54 (0$$\cdot$$03)  2$$\cdot$$26 (0$$\cdot$$02)  2$$\cdot$$43 (0$$\cdot$$02)  2$$\cdot$$41 (0$$\cdot$$02)  T  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  2$$\cdot$$66 (3$$\cdot$$96)  0$$\cdot$$81 (0$$\cdot$$03)  1$$\cdot$$02 (0$$\cdot$$19)  1$$\cdot$$01 (0$$\cdot$$19)  ST  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$27 (0$$\cdot$$15)  1$$\cdot$$97 (0$$\cdot$$05)  2$$\cdot$$08 (0$$\cdot$$08)  2$$\cdot$$12 (0$$\cdot$$08)  ST  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  1$$\cdot$$40 (1$$\cdot$$59)  0$$\cdot$$97 (0$$\cdot$$02)  1$$\cdot$$05 (0$$\cdot$$03)  0$$\cdot$$96 (0$$\cdot$$02)  CST  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$65 (0$$\cdot$$02)  2$$\cdot$$01 (0$$\cdot$$04)  2$$\cdot$$12 (0$$\cdot$$06)  2$$\cdot$$10 (0$$\cdot$$06)  CST  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  9$$\cdot$$65 (3$$\cdot$$76)  0$$\cdot$$97 (0$$\cdot$$04)  2$$\cdot$$16 (2$$\cdot$$19)  0$$\cdot$$92 (0$$\cdot$$03)  Distribution  Error  Sample covariance  Adaptive Huber  Median of means  Rank-based  MVN  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$62 (0$$\cdot$$01)  2$$\cdot$$61 (0$$\cdot$$01)  2$$\cdot$$59 (0$$\cdot$$01)  2$$\cdot$$59 (0$$\cdot$$01)  MVN  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  1$$\cdot$$05 (0$$\cdot$$09)  1$$\cdot$$02 (0$$\cdot$$09)  1$$\cdot$$85 (0$$\cdot$$28)  2$$\cdot$$90 (0$$\cdot$$55)  T  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$54 (0$$\cdot$$03)  2$$\cdot$$26 (0$$\cdot$$02)  2$$\cdot$$43 (0$$\cdot$$02)  2$$\cdot$$41 (0$$\cdot$$02)  T  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  2$$\cdot$$66 (3$$\cdot$$96)  0$$\cdot$$81 (0$$\cdot$$03)  1$$\cdot$$02 (0$$\cdot$$19)  1$$\cdot$$01 (0$$\cdot$$19)  ST  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$27 (0$$\cdot$$15)  1$$\cdot$$97 (0$$\cdot$$05)  2$$\cdot$$08 (0$$\cdot$$08)  2$$\cdot$$12 (0$$\cdot$$08)  ST  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  1$$\cdot$$40 (1$$\cdot$$59)  0$$\cdot$$97 (0$$\cdot$$02)  1$$\cdot$$05 (0$$\cdot$$03)  0$$\cdot$$96 (0$$\cdot$$02)  CST  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$65 (0$$\cdot$$02)  2$$\cdot$$01 (0$$\cdot$$04)  2$$\cdot$$12 (0$$\cdot$$06)  2$$\cdot$$10 (0$$\cdot$$06)  CST  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  9$$\cdot$$65 (3$$\cdot$$76)  0$$\cdot$$97 (0$$\cdot$$04)  2$$\cdot$$16 (2$$\cdot$$19)  0$$\cdot$$92 (0$$\cdot$$03)  Table 2. Estimation errors $$($$with standard errors in parentheses$$)$$ of the adaptively constrained $$\ell_1$$-minimizers to $$\Omega^{\ast}$$ based on four different pilot estimators; values are averaged over $$500$$ replications  Distribution  Error  Sample covariance  Adaptive Huber  Median of means  Rank-based  MVN  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$62 (0$$\cdot$$01)  2$$\cdot$$61 (0$$\cdot$$01)  2$$\cdot$$59 (0$$\cdot$$01)  2$$\cdot$$59 (0$$\cdot$$01)  MVN  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  1$$\cdot$$05 (0$$\cdot$$09)  1$$\cdot$$02 (0$$\cdot$$09)  1$$\cdot$$85 (0$$\cdot$$28)  2$$\cdot$$90 (0$$\cdot$$55)  T  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$54 (0$$\cdot$$03)  2$$\cdot$$26 (0$$\cdot$$02)  2$$\cdot$$43 (0$$\cdot$$02)  2$$\cdot$$41 (0$$\cdot$$02)  T  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  2$$\cdot$$66 (3$$\cdot$$96)  0$$\cdot$$81 (0$$\cdot$$03)  1$$\cdot$$02 (0$$\cdot$$19)  1$$\cdot$$01 (0$$\cdot$$19)  ST  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$27 (0$$\cdot$$15)  1$$\cdot$$97 (0$$\cdot$$05)  2$$\cdot$$08 (0$$\cdot$$08)  2$$\cdot$$12 (0$$\cdot$$08)  ST  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  1$$\cdot$$40 (1$$\cdot$$59)  0$$\cdot$$97 (0$$\cdot$$02)  1$$\cdot$$05 (0$$\cdot$$03)  0$$\cdot$$96 (0$$\cdot$$02)  CST  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$65 (0$$\cdot$$02)  2$$\cdot$$01 (0$$\cdot$$04)  2$$\cdot$$12 (0$$\cdot$$06)  2$$\cdot$$10 (0$$\cdot$$06)  CST  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  9$$\cdot$$65 (3$$\cdot$$76)  0$$\cdot$$97 (0$$\cdot$$04)  2$$\cdot$$16 (2$$\cdot$$19)  0$$\cdot$$92 (0$$\cdot$$03)  Distribution  Error  Sample covariance  Adaptive Huber  Median of means  Rank-based  MVN  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$62 (0$$\cdot$$01)  2$$\cdot$$61 (0$$\cdot$$01)  2$$\cdot$$59 (0$$\cdot$$01)  2$$\cdot$$59 (0$$\cdot$$01)  MVN  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  1$$\cdot$$05 (0$$\cdot$$09)  1$$\cdot$$02 (0$$\cdot$$09)  1$$\cdot$$85 (0$$\cdot$$28)  2$$\cdot$$90 (0$$\cdot$$55)  T  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$54 (0$$\cdot$$03)  2$$\cdot$$26 (0$$\cdot$$02)  2$$\cdot$$43 (0$$\cdot$$02)  2$$\cdot$$41 (0$$\cdot$$02)  T  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  2$$\cdot$$66 (3$$\cdot$$96)  0$$\cdot$$81 (0$$\cdot$$03)  1$$\cdot$$02 (0$$\cdot$$19)  1$$\cdot$$01 (0$$\cdot$$19)  ST  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$27 (0$$\cdot$$15)  1$$\cdot$$97 (0$$\cdot$$05)  2$$\cdot$$08 (0$$\cdot$$08)  2$$\cdot$$12 (0$$\cdot$$08)  ST  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  1$$\cdot$$40 (1$$\cdot$$59)  0$$\cdot$$97 (0$$\cdot$$02)  1$$\cdot$$05 (0$$\cdot$$03)  0$$\cdot$$96 (0$$\cdot$$02)  CST  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$65 (0$$\cdot$$02)  2$$\cdot$$01 (0$$\cdot$$04)  2$$\cdot$$12 (0$$\cdot$$06)  2$$\cdot$$10 (0$$\cdot$$06)  CST  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  9$$\cdot$$65 (3$$\cdot$$76)  0$$\cdot$$97 (0$$\cdot$$04)  2$$\cdot$$16 (2$$\cdot$$19)  0$$\cdot$$92 (0$$\cdot$$03)  7. Real-data example A gene regulatory network, also known as a pathway, is a set of genes that interact with each other to control a specific cell function. With recent advances in genomic research, many such networks have been discovered and their functions thoroughly studied. Certain pathways are now known and available in public databases such as KEGG (Ogata et al., 2000). One popular way to infer a gene regulatory network is through estimation of the precision matrix associated with gene expression (Wit & Abbruzzo, 2015). However, such data often contain outliers. To assess whether our robust estimator can improve inference on gene regulatory networks, we use a microarray dataset and compare our findings with generally acknowledged scientific truth from the genomics literature. The microarray data come from a study by Huang et al. (2011) on the inflammation process of cardiovascular disease. They identified that the toll-like receptor signalling pathway plays a key role in the inflammation process. Their study involves $$n=48$$ patients and the data are available from the Gene Expression Omnibus via the accession name GSE20129. We consider 95 genes from the toll-like receptor signalling pathway and another 62 genes from the peroxisome proliferator-activated receptor signalling pathway, which is known to be unrelated to cardiovascular disease. A good method should discover connections for genes within each of the pathways but not across them. We use both the original version of the adaptively constrained $$\ell_1$$-minimization estimator and our robustified version via the Huber pilot estimator to estimate the precision matrix and therefore the gene regulatory network. We first choose the tuning parameters that deliver the top 100 connections for each method. Table 3 reports the selection results, also displayed in Fig. 1. Our robust method identifies more connections within each pathway and fewer connections across the pathways. Fig. 1. View largeDownload slide Connections estimated by the adaptively constrained $$\ell_1$$-minimization estimator using (a) the sample covariance and (b) the Huber-type pilot estimator; blue lines represent within-pathway connections and red lines between-pathway connections. Fig. 1. View largeDownload slide Connections estimated by the adaptively constrained $$\ell_1$$-minimization estimator using (a) the sample covariance and (b) the Huber-type pilot estimator; blue lines represent within-pathway connections and red lines between-pathway connections. Table 3. Number of connections detected by two types of methods  Top 100 connections  Equal tuning parameters     Within  Between  Total     Within  Between  Total  Huber estimator  60  40  100  Huber estimator  27  15  42  Sample covariance  55  45  100  Sample covariance  55  45  100  Top 100 connections  Equal tuning parameters     Within  Between  Total     Within  Between  Total  Huber estimator  60  40  100  Huber estimator  27  15  42  Sample covariance  55  45  100  Sample covariance  55  45  100  Table 3. Number of connections detected by two types of methods  Top 100 connections  Equal tuning parameters     Within  Between  Total     Within  Between  Total  Huber estimator  60  40  100  Huber estimator  27  15  42  Sample covariance  55  45  100  Sample covariance  55  45  100  Top 100 connections  Equal tuning parameters     Within  Between  Total     Within  Between  Total  Huber estimator  60  40  100  Huber estimator  27  15  42  Sample covariance  55  45  100  Sample covariance  55  45  100  We tried taking the same tuning parameter in the constrained $$\ell_1$$-minimization step (8) for each procedure. Table 3 gives the results. Our estimator detects fewer connections; however, the percentage of within-pathway connections estimated using the Huber pilot estimator is much higher than that of the sample covariance estimator. If the genomics literature is correct, our results show that use of the Huber pilot estimator improves inference for this example, in which heavy tails and skewness are present. Acknowledgement This work was partially supported by the U.S. National Science Foundation and National Institutes of Health. Avella-Medina was partially supported by the Swiss National Science Foundation. Battey was partially supported by the U.K. Engineering and Physical Sciences Research Council. The authors thank the editor, the associate editor and three referees for valuable comments. Supplementary material Supplementary Material available at Biometrika online includes the proofs of all propositions and theorems and additional plots for the real-data example. References Antoniadis, A. & Fan, J. ( 2001). Regularization of wavelet approximations. J. Am. Statist. Assoc.  96, 939– 57. Google Scholar CrossRef Search ADS   Azzalini, A. ( 2005). The skew-normal distribution and related multivariate families. Scand. J. Statist.  32, 159– 200. Google Scholar CrossRef Search ADS   Bickel, P. J. & Levina, E. ( 2008). Covariance regularization by thresholding. Ann. Statist.  36, 2577– 604. Google Scholar CrossRef Search ADS   Boyd, S. & Vandenberghe, L. ( 2004). Convex Optimization . Cambridge: Cambridge University Press. Google Scholar CrossRef Search ADS   Bubeck, S., Cesa-Bianchi, N. & Lugosi, G. ( 2013). Bandits with heavy tail. IEEE Trans. Info. Theory  59, 7711– 7. Google Scholar CrossRef Search ADS   Cai, T. T. & Liu, W. ( 2011). Adaptive thresholding for sparse covariance matrix estimation. J. Am. Statist. Assoc.  106, 672– 4. Google Scholar CrossRef Search ADS   Cai, T. T., Liu, W. & Luo, X. ( 2011). A constrained $$\ell_1$$-minimization approach to sparse precision matrix estimation. J. Am. Statist. Assoc.  106, 594– 607. Google Scholar CrossRef Search ADS   Cai, T. T., Liu, W. & Zhou, H. ( 2016). Estimating sparse precision matrix: Optimal rates of convergence and adaptive estimation. Ann. Statist.  44, 455– 88. Google Scholar CrossRef Search ADS   Catoni, O. ( 2012). Challenging the empirical mean and empirical variance: A deviation study. Ann. Inst. Henri Poincaré Prob. Statist.  48, 1148– 85. Google Scholar CrossRef Search ADS   Chen, M., Gao, C. & Ren, Z. ( 2015). Robust covariance matrix estimation via matrix depth. arXiv:  1506.00691. Devroye, L., Lerasle, M., Lugosi, G. & Oliveira, R. I. ( 2016). Sub-Gaussian mean estimators. Ann. Statist.  44, 2695– 725. Google Scholar CrossRef Search ADS   Fan, J., Han, F., Liu, H. & Vickers, B. ( 2016a). Robust inference of risks of large portfolios. J. Economet.  194, 298– 308. Google Scholar CrossRef Search ADS   Fan, J., Li, Q. & Wang, Y. ( 2017). Estimation of high-dimensional mean regression in absence of symmetry and light-tail assumptions. J. R. Statist. Soc.  B 79, 247– 65. Google Scholar CrossRef Search ADS   Fan, J., Liao, Y. & Mincheva, M. ( 2013). Large covariance estimation by thresholding principal orthogonal complements. J. R. Statist. Soc.  B 75, 603– 80. Google Scholar CrossRef Search ADS   Fan, J., Liu, H. & Wang, W. ( 2015). Large covariance estimation through elliptical factor models. arXiv:  1507.08377. Fan, J., Wang, W. & Zhong, Y. ( 2016b). Robust covariance estimation for approximate factor models. arXiv:  1602.00719. Grant, M. & Boyd, S. ( 2014). CVX: Matlab software for disciplined convex programming, version 2.1. Huang, C.-C., Liu, K., Pope, R. M., Du, P., Lin, S., Rajamannan, N. M., Huang, Q.-Q., Jafari, N., Burke, G. L., Post, W. et al.   ( 2011). Activated TLR signaling in atherosclerosis among women with lower Framingham risk score: The multi-ethnic study of atherosclerosis. PloS One  6, e21067. Google Scholar CrossRef Search ADS PubMed  Huber, P. ( 1964). Robust estimation of a location parameter. Ann. Math. Statist.  35, 73– 101. Google Scholar CrossRef Search ADS   Huber, P. & Ronchetti, E. ( 2009). Robust Statistics . Hoboken, New Jersey: Wiley, 2nd edn. Google Scholar CrossRef Search ADS   Joly, E. & Lugosi, G. ( 2016) Robust estimation of U-statistics. Stoch. Proces. Appl.  126, 3760– 73. Google Scholar CrossRef Search ADS   Lerasle, M. & Oliveira, R. ( 2011). Robust empirical mean estimators. arXiv:  1112.3914. Liu, H., Han, F., Yuan, M., Lafferty, J. & Wasserman, L. ( 2012). High-dimensional semiparametric Gaussian copula graphical models. Ann. Statist.  40, 2293– 326. Google Scholar CrossRef Search ADS   Loh, P. L. & Tan, X. L. ( 2015). High-dimensional robust precision matrix estimation: Cellwise corruption under $$\epsilon$$-contamination. arXiv:  1509.07229. Nemirovsky, A. S. & Yudin, D. B. ( 1983). Problem Complexity and Method Efficiency in Optimization . Hoboken, New Jersey: Wiley. Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H. & Kanehisa, M. ( 2000). KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res.  28, 27– 30. Google Scholar CrossRef Search ADS PubMed  Petrov, V. ( 1995). Limit Theorems of Probability Theory . Oxford: Clarendon Press. Ren, Z., Sun, T., Zhang, C.-H. & Zhou, H. ( 2015). Asymptotic normality and optimalities in estimation of large Gaussian graphical models. Ann. Statist.  43, 991– 1026. Google Scholar CrossRef Search ADS   Rothman, A., Bickel, P., Levina, E. & Zhu, J. ( 2008). Sparse permutation invariant covariance estimation. Electron. J. Statist.  2, 494– 515. Google Scholar CrossRef Search ADS   Rothman, A., Levina, E. & Zhu, J. ( 2009). Generalized thresholding of large covariance matrices. J. Am. Statist. Assoc.  104, 177– 86. Google Scholar CrossRef Search ADS   Wit, E. C. & Abbruzzo, A. ( 2015). Inferring slowly-changing dynamic gene-regulatory networks. BMC Bioinformatics  16, S5. Google Scholar CrossRef Search ADS   Xue, L. & Zou, H. ( 2012). Regularized rank-based estimation of high-dimensional nonparanormal graphical models. Ann. Statist.  40, 2541– 71. Google Scholar CrossRef Search ADS   © 2018 Biometrika Trust This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Biometrika Oxford University Press

Robust estimation of high-dimensional covariance and precision matrices

Loading next page...
 
/lp/ou_press/robust-estimation-of-high-dimensional-covariance-and-precision-237Q6BayYY
Publisher
Oxford University Press
Copyright
© 2018 Biometrika Trust
ISSN
0006-3444
eISSN
1464-3510
D.O.I.
10.1093/biomet/asy011
Publisher site
See Article on Publisher Site

Abstract

SUMMARY High-dimensional data are often most plausibly generated from distributions with complex structure and leptokurtosis in some or all components. Covariance and precision matrices provide a useful summary of such structure, yet the performance of popular matrix estimators typically hinges upon a sub-Gaussianity assumption. This paper presents robust matrix estimators whose performance is guaranteed for a much richer class of distributions. The proposed estimators, under a bounded fourth moment assumption, achieve the same minimax convergence rates as do existing methods under a sub-Gaussianity assumption. Consistency of the proposed estimators is also established under the weak assumption of bounded $$2+\epsilon$$ moments for $$\epsilon\in (0,2)$$. The associated convergence rates depend on $$\epsilon$$. 1. Introduction Covariance and precision matrices play a central role in summarizing linear relationships among variables. Our focus is on estimating these matrices when their dimension is large relative to the number of observations. Besides being of interest in themselves, estimates of covariance and precision matrices are used for numerous procedures from classical multivariate analysis, including linear regression. Consistency is achievable under structural assumptions provided regularity conditions are met. For instance, under the assumption that all rows or columns of the covariance matrix belong to a sufficiently small $$\ell_{q}$$-ball around zero, thresholding (Bickel & Levina, 2008; Rothman et al., 2008) or its adaptive counterpart (Cai & Liu, 2011) gives consistent estimators of the covariance matrix in the spectral norm for data from a distribution with sub-Gaussian tails. For precision matrix estimation, the same sparsity assumption on the precision matrix motivates the use of the constrained $$\ell_1$$-minimizer of Cai et al. (2011) or its adaptive counterpart (Cai et al., 2016), both of which are consistent in spectral norm under the same sub-Gaussianity condition. Under sub-Gaussianity, Cai & Liu (2011) and Cai et al. (2016) showed that in high-dimensional regimes the adaptive thresholding estimator and adaptive constrained $$\ell_1$$-minimization estimator are minimax optimal within the classes of covariance or precision matrices satisfying their sparsity constraint. Since sub-Gaussianity is often too restrictive in practice, we seek new procedures that can achieve the same minimax optimality when data are leptokurtic. Inspection of the proofs of Bickel & Levina (2008), Cai & Liu (2011) and Cai et al. (2016) reveals that sub-Gaussianity is needed because their methods are built on the sample covariance matrix, which requires the assumption to guarantee its optimal performance. Here we show that minimax optimality is achievable within a larger class of distributions if the sample covariance matrix is replaced by a robust pilot estimator, thus providing a unified theory for covariance and precision matrix estimation based on general pilot estimators. We also show how to construct pilot estimators that have the required elementwise convergence rates of (1) and (2) below. Within a much larger class of distributions with bounded fourth moment, it is shown that an estimator obtained by regularizing a robust pilot estimator attains the minimax rate achieved by existing methods under sub-Gaussianity. The analysis is extended to show that when only bounded $$2+\epsilon$$ moments exist for $$\epsilon\in(0,2)$$, matrix estimators with satisfactory convergence rates are still attainable. Some related work includes that of Liu et al. (2012) and Xue & Zou (2012), who considered robust estimation of graphical models when the underlying distribution is elliptically symmetric, Fan et al. (2015, 2016a,b), who studied robust matrix estimation in the context of factor models, and Chen et al. (2015) and Loh & Tan (2015), who investigated matrix estimation when the data are contaminated by outliers. The present paper is concerned with efficient estimation of general sparse covariance and precision matrices when only certain moment conditions are assumed. For a $$p$$-dimensional random vector $${X}$$ with mean $${\mu}$$, let $$\Sigma^{*}=E\{({X}-{\mu})({X}-{\mu})^{\mathrm{T}}\}$$ and let $$\tilde{\Sigma}=(\tilde{\sigma}_{uv})$$ denote an arbitrary pilot estimator of $$\Sigma^{*}=(\sigma^{*}_{uv})$$, where $$u,v \in [p]$$ with $$[p]$$ standing for $$\{1,\dots,p \}$$. The key requirement on $$\tilde{\Sigma}$$ for optimal covariance estimation is that   \begin{equation}\label{eqConditionC} \text{pr}\Bigl[\max_{u,v}|\tilde{\sigma}_{uv} - \sigma_{uv}^{*}|\leq C_{0}\{(\log p)/n\}^{1/2}\Bigr] \geq 1 - \epsilon_{n,p}, \end{equation} (1) where $$C_{0}$$ is a positive constant and $$\epsilon_{n,p}$$ is a deterministic sequence converging to zero as $$n,p\rightarrow \infty$$ such that $$n^{-1}\log p\rightarrow 0$$. This delivers rates of convergence that match the minimax rates of Cai & Liu (2011) even under violations of their sub-Gaussianity condition, which entails the existence of $$b>0$$ such that $$E(\exp[t\{X_{u}-E(X_u) \}])\leq \exp(b^{2}t^{2}/2)$$ for every $$t\in \mathbb{R}$$ and every $$u\in[p]$$. Introduce the sample covariance matrix   \begin{equation*} \hat{\Sigma}=(\hat{\sigma}_{uv}) = n^{-1}\sum_{i=1}^{n}\bigl({X}_{i} - \bar{{X}}\bigr) \bigl({X}_{i} - \bar{{X}}\bigr)^{\mathrm{T}}\!, \end{equation*} where $${X}_1,\ldots,{X}_n$$ are independent and identically distributed copies of $${X}$$ and $$\bar{{X}}=n^{-1}\sum_{i=1}^{n}{X}_{i}$$. Proposition 1 shows that $$\hat{\Sigma}$$ violates (1) when $${X}$$ is not sub-Gaussian. In other words, the sample covariance does not concentrate exponentially fast in an elementwise sense if sub-Gaussianity is violated. Similarly, for estimation of the precision matrix $$\Omega^{*}=(\Sigma^{*})^{-1}$$, the optimality of the adaptive constrained $$\ell_1$$-minimization estimator is retained under a pilot estimator satisfying   \begin{equation}\label{eqConditionC2} \text{pr}\Bigl[\max_{u,v}\,\bigl|(\tilde{\Sigma}\Omega^{*} - I_p)_{uv}\bigr|\leq C_{0}\{(\log p)/n\}^{1/2}\Bigr] \geq 1 - \epsilon_{n,p}, \end{equation} (2) where $$C_{0}$$ and $$\epsilon_{n,p}$$ are as in (1) and $$I_p$$ denotes the $$p\times p$$ identity matrix. While (2) holds with $$\tilde{\Sigma}=\hat{\Sigma}$$ under sub-Gaussianity of $${X}$$, it fails otherwise. The following proposition provides a more formal illustration of the unsuitability of $$\tilde{\Sigma}=\hat{\Sigma}$$ as a pilot estimator in the absence of sub-Gaussianity. Proposition 1. Let $$E(|X_{u}X_{v}-\sigma^{\ast}_{uv}|^{1+\gamma})\leq 2$$ for $$u\neq v$$ and some $$\gamma>0$$. For all distributions of $$X$$ satisfying this assumption, there is a distribution $${\rm pr}$$ such that for some $$\epsilon<1/2$$,   \begin{equation*} {\rm pr}\bigl\{|\hat{\sigma}_{uv} - \sigma_{uv}^{*}|>2^{-1}\epsilon^{-1/(1+\gamma)}n^{-\gamma/(1+\gamma)}\bigr\} \geq \epsilon\text{.} \end{equation*} This implies that the choice to take the sample covariance as the pilot estimator $$\tilde{\Sigma}$$ results in a polynomial rate of convergence, which is slower than the exponential rate of concentration in (1). Instead, we introduce robust pilot estimators in § 4 that satisfy the conditions (1) and (2). These estimators only require $$\max_{1\leq u \leq p} E(|X_u|^4)<\infty$$. Throughout the paper, for a vector $${a}=(a_1,\ldots,a_p)^{\mathrm{T}}\in\mathbb{R}^p$$, $$\:\|{a}\|_1=\sum_{v=1}^p|a_v|$$, $$\|{a}\|_2=(\sum_{v=1}^pa_v^2)^{1/2}$$ and $$\|{a}\|_\infty=\max_{1\leq v \leq p}|a_v|$$. For a matrix $$A=(a_{uv})\in\mathbb{R}^{p\times q}$$, $$\:\|A\|_{\max}=\max_{1\leq u\leq p, 1\leq v\leq q}|a_{uv}|$$ is the elementwise maximum norm, $$\|A\|_2=\sup_{\|{x}\|_2\leq 1}\|A{x}\|_2$$ is the spectral norm, and $$\|A\|_{1}=\max_{1\leq v\leq q}\sum_{u=1}^p|a_{uv}|$$ is the matrix $$\ell_1$$-norm. We let $$I_p$$ denote the $$p\times p$$ identity matrix; $$A\succ 0$$ and $$A \succeq 0$$ mean that $$A$$ is positive definite and positive semidefinite, respectively. For a square matrix $$A$$, we denote its maximum and minimum eigenvalues by $$\lambda_{\max}(A)$$ and $$\lambda_{\min}(A)$$, respectively. We also assume that $$E(X)=0$$. 2. Broadening the scope of the adaptive thresholding estimator Let $$\tau_{\lambda}(\cdot)$$ be a general thresholding function for which: (i) $$|\tau_\lambda(z)|\leq |y|$$ for all $$z$$ and $$y$$ that satisfy $$|z-y|\leq\lambda$$; (ii) $$\tau_\lambda(z)=0$$ for $$|z|\leq \lambda$$; (iii) $$|\tau_\lambda(z)-z|\leq \lambda$$ for all $$z\in \mathbb{R}$$. Similar properties are set forth in Antoniadis & Fan (2001) and were proposed in the context of covariance estimation via thresholding in Rothman et al. (2009) and Cai & Liu (2011). Some examples of thresholding functions satisfying these three conditions are the soft thresholding rule $$\tau_\lambda(z)={\rm sgn}(z)(z-\lambda)_+$$, the adaptive lasso rule $$\tau_\lambda(z)=z(1-|\lambda/z|^\eta)_+$$ with $$\eta\geq 1$$, and the smoothly clipped absolute deviation thresholding rule (Rothman et al., 2009). Although the hard thresholding rule $$\tau_\lambda(z)=z {\rm 1}\kern-0.24em{\rm I} (|z|>\lambda)$$ does not satisfy (i), the results presented in this section also hold for hard thresholding. The adaptive thresholding estimator is defined as   \[ \hat{\Sigma}^{\mathcal{T}} = (\hat{\sigma}_{uv}^{\mathcal{T}}) = \bigl\{\tau_{\lambda_{uv}}(\hat{\sigma}_{uv})\bigr\}\!, \] where $$\hat{\sigma}_{uv}$$ is the ($$u,v$$)th entry of $$\hat{\Sigma}$$ and the threshold $$\lambda_{uv}$$ is entry-dependent. Equipped with these adaptive thresholds, Cai & Liu (2011) established optimal rates of convergence of the resulting estimator under sub-Gaussianity of $${X}$$. To accommodate data drawn from distributions violating sub-Gaussianity, we replace the sample covariance matrix $$\hat{\Sigma}$$ by a pilot estimator $$\tilde{\Sigma}$$ satisfying (1). The resulting adaptive thresholding estimator is denoted by $$\tilde{\Sigma}^{\mathcal{T}}$$. As suggested by Fan et al. (2013), the entry-dependent threshold   \begin{equation}\label{thresholdAdaptive} \lambda_{uv}=\lambda \left(\frac{\tilde{\sigma}_{uu}\tilde{\sigma}_{vv}\log p}{n}\right)^{1/2} \end{equation} (3) is used, where $$\lambda>0$$ is a constant. This is simpler than the threshold used by Cai & Liu (2011), as it does not require estimation of $${\rm var}(\tilde{\sigma}_{uv})$$ and achieves the same optimality. Let $$\mathcal{S}^{+}(\mathbb{R},p)$$ denote the class of positive-definite symmetric matrices with elements in $$\mathbb{R}$$. Theorem 1 relies on the following conditions on the pilot estimator and the sparsity of $$\Sigma^{*}$$. Condition 1. The pilot estimator $$\tilde{\Sigma}=(\tilde{\sigma}_{uv})$$ satisfies (1). Condition 2. The matrix $$\Sigma^{*}=(\sigma_{uv}^{*})$$ belongs to the class   \begin{equation*} \mathcal{U}_q=\mathcal{U}_q\{s_0(p)\}=\left\{\Sigma:\Sigma\in \mathcal{S}^{+}(\mathbb{R},p),\:\max_u\sum_{v=1}^p(\sigma_{uu}^{*}\sigma_{vv}^{*})^{(1-q)/2}|\sigma_{uv}^{*}|^q\leq s_0(p)\right\}\!\text{.} \end{equation*} The class of weakly sparse matrices $$\mathcal{U}_q$$ was introduced by Cai & Liu (2011). The columns of a covariance matrix in $$\mathcal{U}_q$$ are required to lie in a weighted $$\ell_q$$-ball, where the weights are determined by the variance of the entries of the population covariance. Theorem 1. Suppose that Conditions 1 and 2 hold, $$\log p=o(n)$$, and $$\min_{u}\sigma_{uu}^{*}=\gamma > 0$$. There exists a positive constant $$C_0$$ such that  \[ \underset{\Sigma^{*}\in\mathcal{U}_q}{\inf}{\rm pr}\biggl\{\|\tilde{\Sigma}^{\mathcal{T}}-\Sigma^*\|_2\leq C_0s_0(p)\left(\frac{\log p}{n}\right)^{(1-q)/2}\biggr\}\geq 1-\epsilon_{n,p}, \] where $$\epsilon_{n,p}$$ is a deterministic sequence that decreases to zero as $$n,p \to \infty$$. The constant $$C_0$$ in Theorem 1 depends on $$q$$, $$\lambda$$ and the unknown distribution of $$X$$, and so do the constants appearing in Theorem 2 and Propositions 2–4. Our result generalizes Theorem 1 of Cai & Liu (2011); the minimax lower bound of our Theorem 1 matches theirs, implying that our procedure is minimax optimal for a wider class of distributions containing the sub-Gaussian distributions. Cai & Liu (2011) also give convergence rates under bounded moments in all components of $$X$$. In that case, a much more stringent scaling condition on $$n$$ and $$p$$ is required, as shown in Theorem 1(ii) of Cai & Liu (2011). Their result does not cover the high-dimensional case where $$p>n$$ when fewer than $$8+\epsilon$$ finite moments exist. If a larger number of finite moments is assumed, $$p$$ is allowed to increase polynomially with $$n$$. However, we allow $$\log p = o(n)$$. For the three pilot estimators to be given in § 4, even universal thresholding can achieve the same minimax optimal rate given the bounded fourth moments assumed there. However, adaptive thresholding as formulated in (3) results in better numerical performance. Unfortunately, $$\tilde{\Sigma}^{\mathcal{T}}$$ may not be positive semidefinite, but it can be projected onto the cone of positive-semidefinite matrices through the convex optimization   \begin{equation} \label{eq:L2-projection} \tilde{\Sigma}^{\mathcal{T}}_{+} = {\rm arg\,min}_{\Sigma \succeq 0} \|\Sigma-\tilde{\Sigma}^{\mathcal{T}}\|_2\text{.} \end{equation} (4) By definition, $$\|\tilde{\Sigma}^{\mathcal{T}}_+-\tilde{\Sigma}^{\mathcal{T}}\|_2 \leq\|{\Sigma}^{\ast}-\tilde{\Sigma}^{\mathcal{T}}\|_2$$, so the triangle inequality yields   \begin{equation*} \| \tilde{\Sigma}^{\mathcal{T}}_{+} - \Sigma^{*}\|_2 \leq \|\tilde{\Sigma}^{\mathcal{T}}_{+} - \tilde{\Sigma}^{\mathcal{T}}\|_2 + \| \tilde{\Sigma}^{\mathcal{T}} - \Sigma^{*}\|_2 \leq 2 \, \|\tilde{\Sigma}^{\mathcal{T}} - \Sigma^{*}\|_2\text{.} \end{equation*} Hence, the price to pay for projection is no more than a factor of two, which does not affect the convergence rate. The projection is easily made by linear programming (Boyd & Vandenberghe, 2004). 3. Broadening the scope of the adaptively constrained $$\ell_1$$-minimization estimator We consider a robust modification of the adaptively constrained $$\ell_1$$-minimization estimator of Cai et al. (2016). In a spirit similar to § 2, our robust modification relies on the existence of a pilot estimator $$\tilde{\Sigma}$$ satisfying (2). Construction of the robust adaptive constrained $$\ell_1$$-minimizer relies on a preliminary projection, resulting in the positive-definite estimator   \begin{equation}\label{projection} \tilde{\Sigma}^{+}=\mathop {\arg \min }\limits_{\Sigma\succeq \epsilon I_p} \|\Sigma-\tilde{\Sigma}\|_{\max} \end{equation} (5) for an arbitrarily small positive number $$\epsilon$$. The minimization problem in (5) can be rewritten as minimizing $$\vartheta$$ such that $$|(\tilde{\Sigma} - \Sigma)_{uv}| \leq \vartheta$$ and $$\Sigma \succeq\epsilon I_p$$ for all $$1\leq u,v \leq p$$. This problem can be solved in Matlab using the cvx solver (Grant & Boyd, 2014). Given $$\tilde{\Sigma}^{+}=(\tilde{\sigma}_{uv}^{+})$$, our estimator of $$\Omega^{*}$$ is constructed by replacing $$(\hat{\Sigma}+n^{-1}I_p)$$ with $$\tilde{\Sigma}^{+}$$ in the original constrained $$\ell_1$$-minimization procedure. For ease of reference, the steps are reproduced below. Define the first-stage estimator $$\check{\Omega}^{(1)}$$ of $$\Omega^{*}$$ through the vectors   \begin{equation} \check{{\omega}}_{v}^{\dagger}=\mathop {\arg \min }\limits_{{\omega}_v\in \mathbb{R}^p}\Big\{\|{\omega}_v\|_1:\|\tilde{\Sigma}^{+}{\omega}_v-{e}_v\|_\infty\leq \delta_{n,p} \max_v( \tilde{\sigma}^{+}_{vv})\,\omega_{vv}, \; \omega_{vv}>0\Big\}, \quad v\in[p], \end{equation} (6) with $${\omega}_v=(\omega_{1v},\ldots,\omega_{pv})^{\mathrm{T}}$$, $$\delta_{n,p}=\delta\{(\log p)/n\}^{1/2}$$ for $$\delta>0$$, and $${e}_v$$ being the vector that has value $$1$$ in the $$v$$th coordinate and zeros elsewhere. More specifically, define $$\check{{\omega}}^{(1)}_{v}$$ as an adjustment of $$\check{{\omega}}_{v}^{\dagger}$$ such that the $$v$$th entry is   \begin{equation}\label{eqACLIME2} \check{\omega}^{(1)}_{vv}=\check{\omega}_{vv}^{\dagger}{\rm 1}\kern-0.24em{\rm I} \Bigl\{\tilde{\sigma}_{vv}^{+}\leq (n/\log p)^{1/2}\Bigr\}+\{(\log p)/n\}^{1/2}{\rm 1}\kern-0.24em{\rm I} \Bigl\{\tilde{\sigma}_{vv}^{+}> (n/\log p)^{1/2}\Bigr\}, \end{equation} (7) and define the first-stage estimator as $$\check{\Omega}^{(1)}=(\check{{\omega}}_{1}^{(1)}, \ldots, \check{{\omega}}_{p}^{(1)})$$. A second-stage adaptive estimator $$\check{\Omega}^{(2)}=(\check{{\omega}}_{1}^{(2)},\ldots, \check{{\omega}}_{p}^{(2)})$$ is defined by solving, for each column,   \begin{equation} \check{{\omega}}_{v}^{(2)}=\mathop {\arg \min }\limits_{{\omega}_v\in \mathbb{R}^p} \Big\{\|{\omega}_v\|_1: \bigl|\left(\tilde{\Sigma}^{+}{\omega}_v-{e}_v\right)_{u}\bigr|\leq \lambda_{n,p}(\tilde{\sigma}^{+}_{uu} \check{\omega}^{(1)}_{vv})^{1/2} \; (u =1,\ldots,p)\Big\}, \end{equation} (8) where $$\lambda_{n,p}=\lambda\{(\log p)/n\}^{1/2}$$ for $$\lambda>0$$. In practice, the optimal values of $$\delta$$ and $$\lambda$$ are chosen by crossvalidation. The final estimator, $$\tilde{\Omega}$$, of $$\Omega^{*}$$ is a symmetrized version of $$\check{\Omega}^{(2)}$$ constructed as   \begin{equation}\label{eqACLIME4} \tilde{\Omega} = (\tilde{\omega}_{uv}),\quad \tilde{\omega}_{uv}=\tilde{\omega}_{vu} = \check{\omega}_{uv}^{(2)} {\rm 1}\kern-0.24em{\rm I} (| \check{\omega}_{uv}^{(2)}|\leq | \check{\omega}_{vu}^{(2)}|) + \check{\omega}_{vu}^{(2)}{\rm 1}\kern-0.24em{\rm I} (| \check{\omega}_{uv}^{(2)}| > | \check{\omega}_{vu}^{(2)}|)\text{.} \end{equation} (9) The theoretical properties of $$\tilde{\Omega}$$ are derived under Conditions 3 and 4. Condition 3. The pilot estimator $$\tilde{\Sigma}=(\tilde{\sigma}_{uv})$$ satisfies (2). Condition 4. The matrix $$\Omega^{*}=(\omega_{uv}^{*})$$ belongs to the class   \begin{equation*} \begin{split} \mathcal{G}_q=\mathcal{G}_q(c_{n,p},M_{n,p}) =\bigg\{\Omega \in \mathcal{S}^{+}(\mathbb{R},p): & \; \max_v\sum_{u=1}^p|\omega_{uv}|^q\leq c_{n,p},\; \|\Omega\|_{1}\leq M_{n,p}, \\ & \; \frac{1}{M_1}\leq{\lambda_{\min}(\Omega)}\leq\lambda_{\max}(\Omega)\leq M_1\bigg\}, \end{split} \end{equation*} where $$0\leq q\leq 1$$, $$M_1>0$$ is a constant, and $$M_{n,p}$$ and $$c_{n,p}$$ are positive deterministic sequences that are bounded away from zero and allowed to diverge as $$n$$ and $$p$$ grow. In this class of precision matrices, sparsity is imposed by restricting the columns of $$\Omega^{*}$$ to lie in an $$\ell_q$$-ball of radius $$c_{n,p}$$ ($$0\leq q<1$$). Theorem 2. Suppose that Conditions 1, 3 and 4 are satisfied with $$c_{n,p}=o(n/\log p)$$. Under the scaling condition $$\log p = O(n^{1/2})$$ we have, for a positive constant $$C_0$$,   \[ \underset{\Omega^{*}\in{\mathcal{G}_q}}{\inf}{\rm pr}\bigg\{\|\tilde{\Omega}-\Omega^{*}\|_2\leq C_0 M_{n,p}^{1-q}c_{n,p}\left(\frac{\log p}{n}\right)^{(1-q)/2}\bigg\} \geq 1-\epsilon_{n,p}, \] where $$\epsilon_{n,p}$$ is a deterministic sequence that decreases to zero as $$n, p\rightarrow \infty$$ and $$\tilde{\Omega}$$ is the robust adaptively constrained $$\ell_1$$-minimization estimator described in (6)–(9). Remark 1. Our class of precision matrices is slightly more restrictive than that considered in Cai et al. (2016), since we require $$1/M_1\leq\lambda_{\min}(\Omega^{*})\leq\lambda_{\max}(\Omega^{*})\leq M_1$$ instead of $$\lambda_{\max}(\Omega^{*})/\lambda_{\min}(\Omega^{*})\leq M_1$$. The difference is marginal since $$\sigma^{*}_{uu}={e}_u^{\mathrm{T}}\Sigma^{*}{e}_u=\|\Sigma^{*1/2}{e}_u\|_2^2\leq\lambda_{\max}(\Sigma^{*})=1/\lambda_{\min}(\Omega^{*})$$ and $$\lambda_{\max}(\Omega^{*})/\lambda_{\min}(\Omega^{*})\leq M_1$$ implies that $$0<M_1^{-1}\lambda_{\max}(\Omega^{*})\leq\lambda_{\min}(\Omega^{*})\leq \lambda_{\max}(\Omega^{*})\leq M_1\lambda_{\min}(\Omega^{*})<\infty$$. We therefore only exclude precision matrices associated with either exploding or imploding covariance matrices, i.e., we exclude $$\sigma^{*}_{uu}\to 0$$ and $$\sigma^{*}_{uu}\to\infty$$ for all $$u\in[p]$$. Ren et al. (2015) also require $$1/M_1\leq\lambda_{\min}(\Omega^{*})\leq\lambda_{\max}(\Omega^{*})\leq M_1$$. A positive-semidefinite estimator with the same convergence rate as $$\tilde{\Omega}$$ can be constructed by projecting the symmetric matrix $$\tilde{\Omega}$$ onto the cone of positive-semidefinite matrices, as in (4). Next, we present three pilot estimators whose performance is favourable with respect to the sample covariance matrix when the sub-Gaussianity assumption is violated. We verify Conditions 1 and 3 for these estimators. Condition 1 will be verified for all three pilot estimators. When $$\|\Omega^*\|_1$$ is bounded, Condition 1 implies Condition 3 because $$\|\tilde{\Sigma}\Omega^*-I_p\|_{\max}=\|(\tilde{\Sigma}-\Sigma^*)\Omega^*\|_{\max}\leq \|\Omega^*\|_{1}\|\tilde{\Sigma}-\Sigma^*\|_{\max}$$. When $$\|\Omega^*\|_1 \to \infty$$, Condition 3 is verified for the adaptive Huber estimator. We emphasize that Condition 3 is only needed if the goal is to obtain a minimax optimal estimator of $$\Omega$$. A consistent estimator is still attainable if only Condition 1 holds when $$\|\Omega^{\ast}\|_1\to \infty$$. A more thorough discussion appears in the Supplementary Material. 4. Robust pilot estimators 4.1. A rank-based estimator The rank-based estimator requires only the existence of the second moment. However, it makes arguably more restrictive assumptions, as it requires the distribution of $${X}$$ to be elliptically symmetric. Definition 1. A random vector $${Z} = (Z_1,\ldots,Z_p)^{\mathrm{T}}$$ follows an elliptically symmetric distribution if and only if $${Z}= \mu+\xi A U$$, where $$\mu\in \mathbb{R}^p$$, $$A\in \mathbb{R}^{d\times q}$$ with $$q = \mathrm{rank}(A)$$, $$U$$ is uniformly distributed on the unit sphere in $$\mathbb{R}^q$$, and $$\xi$$ is a positive random variable independent of $$U$$. Observe that $$\Sigma^{*}=D^{*}R^{*}D^{*}$$ where $$R=(r^{*}_{uv})$$ denotes the correlation matrix and $$D^{*}=\text{diag}\{(\sigma^{*}_{11})^{1/2},\ldots,(\sigma^{*}_{pp})^{1/2}\}$$. Liu et al. (2012) and Xue & Zou (2012) both proposed rank-based estimation of $$R^{*}$$, exploiting a bijective mapping between Pearson correlation and Kendall’s tau or Spearman’s rho dependence measures that hold for elliptical distributions. More specifically, Kendall’s tau concordance between $$X_{u}$$ and $$X_{v}$$ is defined as   \[ \tau^{*}_{uv} = \text{pr}\bigl\{(X_{u} - Y_{u})(X_{v} - Y_{v})>0\bigr\} - \text{pr}\bigl\{(X_{u} - Y_{u})(X_{v} - Y_{v})<0\bigr\}, \] where $${Y}$$ is an independent copy of $${X}$$. With $${X}_{i}=(X_{i,1},\ldots, X_{i,p})^{\mathrm{T}}$$, the empirical analogue of $$\tau^{*}_{uv}$$ is   \[ \tilde{\tau}_{uv} = {n \choose 2}^{-1}\sum_i\sum_{j<i} \Bigl[\!{\rm 1}\kern-0.24em{\rm I} \bigl\{(X_{iu} - Y_{iu})(X_{iv} - Y_{iv})>0\bigr\} - \!{\rm 1}\kern-0.24em{\rm I} \bigl\{(X_{iu} - Y_{iu})(X_{iv} - Y_{iv})<0\bigr\}\Bigr]\text{.} \] Since $$\tau^{*}_{uv} = 2\pi^{-1}\arcsin(r^{*}_{uv})$$, an estimator of $$R^{*}$$ is $$\tilde{R}=(\tilde{r}_{uv})=\{\sin(\pi\tilde{\tau}_{uv}/2)\}$$. An analogous bijection exists between Spearman’s rho and Pearson’s correlation; see Xue & Zou (2012) for details. We propose to estimate the elements of the diagonal matrix $$D^{*}$$ using a median absolute deviation estimator, $$\tilde{D}=\text{diag}\{(\tilde{\sigma}_{11})^{1/2}, \ldots, (\tilde{\sigma}_{pp})^{1/2}\}$$, where $$\tilde{\sigma}_{uu} = C_u \,\text{med}_{i\in[n]}\{|X_{iu} - \text{med}_{j\in[n]}(X_{ju})|\}$$. Here, $$\text{med}_{i\in [n]}(\cdot)$$ denotes the median within the index set $$[n]$$ and $$C_u=F_{u}^{-1}(3/4)$$ is the Fisher consistency constant, where $$F_{u}$$ is the distribution function of $$X_{1u}-\nu_u$$ and $$\nu_u$$ is the median of $$X_{1u}$$. Finally, the rank-based estimator is defined as $$\tilde{\Sigma}_R=(\tilde{\sigma}_{uv}^R)=\tilde{D}\tilde{R}\tilde{D}$$. Proposition 2. Let $${X}_1,\ldots,{X}_n$$ be independent and identically distributed copies of the elliptically symmetric random vector $${X}$$ with covariance matrix $${\Sigma^{*}=D^{*}R^{*}D^{*}}$$. Assume that $${\max_{u}\sigma_{uu}^{*} = \varsigma <\infty}$$ and $${\min_{u}\sigma_{uu}^{*}> C_{2}\{(\log p)/n\}^{1/2} + 1/U^{4}}$$, where $$U<C_{1}\pi\surd 2/(4\varsigma)$$ for $$C_{1}>0$$ and $$C_{2}>0$$. Then  \[ {\rm pr}\Bigl[\max_{u,v}|\tilde{\sigma}^{R}_{uv} - \sigma^{*}_{uv}| \leq c \{(\log p)/n\}^{1/2}\Bigr] \geq 1- \epsilon_{n,p}, \] with $$\epsilon_{n,p}\leq C_{0}p^{-L}$$ for positive constants $$c$$, $$C_{0}$$ and $$L$$. In estimating marginal variances, we use median absolute deviation estimators to avoid higher moment assumptions. This assumes knowledge of $$F_u^{-1}(3/4)$$, without which these marginal variances can be estimated by using the adaptive Huber estimator or the median of means estimator given in the next two subsections. This requires existence of a fourth moment; see Propositions 3 and 5. 4.2. An adaptive Huber estimator The Huber-type M-estimator only requires the existence of fourth moments. Let $$Y_{uv}=(X_{u} - \mu^{*}_{u})(X_{v} - \mu^{*}_{v})$$. Then $$\sigma_{uv}^{*}=E(Y_{uv})=\mu^{*}_{uv} - \mu^{*}_{u}\mu^{*}_{v}$$ where $$\mu_{u}^{*}=E(X_{u})$$, $$\mu^{*}_{v}=E(X_{v})$$ and $$\mu^{*}_{uv}=E(X_{u}X_{v})$$. We propose to estimate $$\sigma_{uv}^{*}$$ robustly through robust estimators of $$\mu^{*}_{u}$$, $$\mu^{*}_{v}$$ and $$\mu^{*}_{uv}$$. For independent and identically distributed copies $$Z_1,\ldots,Z_n$$ of a real random variable $$Z$$ with mean $$\mu$$, Huber’s (1964) M-estimator of $$\mu$$ is defined as the solution to   \begin{align}\label{eqHuber} \sum_{i=1}^n\psi_H(Z_i-\mu)=0, \end{align} (10) where $$\psi_H(z)=\min\{H,\max(-H,z)\}$$ is the Huber function. Replacing $$Z_{i}$$ in (10) by $$X_{i,u}$$, $$X_{i,u}$$ and $$X_{i,u}X_{i,v}$$ gives the Huber estimators $$\tilde{\mu}_{u}^{H}$$, $$\tilde{\mu}_{v}^{H}$$ and $$\tilde{\mu}_{uv}^{H}$$ of $$\mu_{u}^{*}$$, $$\mu_{v}^{*}$$ and $$\mu_{uv}^{*}$$, respectively, from which the Huber-type estimator of $$\Sigma^{*}$$ is defined as $$\tilde{\Sigma}_{H}=(\tilde{\sigma}^{H}_{uv})=(\tilde{\mu}_{uv}^{H}- \tilde{\mu}_{u}^{H}\tilde{\mu}_{v}^{H})$$. We depart from Huber (1964) by allowing $$H$$ to grow to infinity as $$n$$ increases, as our objectives differ from those of Huber (1964) and of classical robust statistics (Huber & Ronchetti, 2009). There, the distribution generating the data is assumed to be a contaminated version of a given parametric model, where the contamination level is small, and the objective is to estimate features of the parametric model as if no contamination were present. Our goal is instead to estimate the mean of the underlying distribution, allowing departures from sub-Gaussianity. In related work, Fan et al. (2017) have shown that when $$H$$ is allowed to diverge at an appropriate rate, the Huber estimator of the mean concentrates exponentially fast around the true mean when only a finite second moment exists. In a similar spirit, we allow $$H$$ to grow with $$n$$ in order to alleviate the bias. An appropriate choice of $$H$$ trades off bias and robustness. We build on Fan et al. (2017) and Catoni (2012), showing that our proposed Huber-type estimator satisfies Conditions 1 and 3. Proposition 3. Assume $$\max_{1\leq u\leq p}E(X_u^4)=\kappa^2<\infty$$. Let $$\tilde{\Sigma}_{H}=(\tilde{\sigma}^{H}_{uv})$$ be the Huber-type estimator with $$H=K(n/\log p)^{1/2}$$ for $$K\geq 4\kappa(2+L)^{1/2}$$ and $$L>0$$ satisfying $$(2 + L)(\log p)/n < 1/8$$. Under the scaling condition $$(\log p)/n \rightarrow 0$$ we have, for large $$n$$ and a constant $$C>\kappa(1+ 2\max_{u}|\mu^{*}_{u}|)$$,   \begin{align*} {\rm pr}\Big[\underset{u,v}{\max}\: |\tilde{\sigma}_{uv}^H-\sigma_{uv}^{*}|\leq C\{(\log p)/n\}^{1/2}\Big]\geq 1 - \epsilon_{n,p}, \end{align*} where $$\epsilon_{n,p}\leq C_{0}p^{-L}$$ for positive constants $$C_{0}$$ and $$L$$. Proposition 3 verifies Condition 1 for $$\tilde{\Sigma}_{H}$$, provided $$H$$ is chosen to diverge at the appropriate rate. As quantified in Proposition 4, $$\tilde{\Sigma}_{H}$$ also satisfies Condition 3 when $$H$$ is of the same rate as in Proposition 3. The proof of this result entails extending a large deviation result of Petrov (1995). Proposition 4. Assume that $$\max_{1\leq u\leq p}E(X_u^4)=\kappa^2<\infty$$ and $$\Omega^{*} \in \mathcal{G}_{q}(c_{n,p},M_{n,p})$$ with $$c_{n,p}=O\{n^{(1-q)/{2}}/(\log p)^{{(3-q)}/{2}} \}$$. Let $$\tilde{\Sigma}_{H}=(\tilde{\sigma}^{H}_{uv})$$ be the Huber-type estimator defined below (10) with $$H=K(n/\log p)^{1/2}$$ for $$K\geq 4\kappa(2+L)^{1/2}$$ and $$L>0$$. Assume that the truncated population covariance matrix $$\Sigma_H=E\{\!{\rm 1}\kern-0.24em{\rm I} (|X_{u}X_{v}|\leq H)X_uX_v\}$$ satisfies $$\|\Sigma_H\Omega^*-\text{I}_p\|_{\max}=O[\{(\log p)/n\}^{1/2}]$$. Under the scaling condition $$(\log p)/n^{1/3} =O(1)$$ we have, for large $$n$$ and a constant $$C>\kappa(1+ 2\max_{u}|\mu^{*}_{u}|)$$,   \[ {\rm pr}\Bigl[\max_{u,v}\:\bigl|(\tilde{\Sigma}_{H}\Omega^{*} - I_p)_{uv}\bigr|\leq C\{(\log p)/n\}^{1/2}\Bigr] \geq 1 - \epsilon_{n,p}, \] where $$\epsilon_{n,p}\leq C_0(\log p)^{-1/2}p^{-L}$$ for positive constants $$C_0$$ and $$L$$. 4.3. A median of means estimator The median of means estimator was proposed by Nemirovsky & Yudin (1983) and has been further studied by Lerasle & Oliveira (2011), Bubeck et al. (2013) and Joly & Lugosi (2016). It is defined as the median of $$M$$ means obtained by partitioning the data into $$M$$ subsamples. A heuristic explanation for its success is that taking means within subsamples results in a more symmetric sample while the median makes the solution concentrate faster. Our median of means estimator for $$\Sigma^{*}$$ is constructed as $$\tilde{\Sigma}_{M}=(\tilde{\sigma}_{uv}^{M})=(\tilde{\mu}_{uv}^{M} - \tilde{\mu}_{u}^{M}\tilde{\mu}_{v}^{M})$$, where $$\tilde{\mu}_{uv}^{M}$$, $$\tilde{\mu}_{u}^{M}$$ and $$\tilde{\mu}_{v}^{M}$$ are median of means estimators of $$(X_{iu}X_{iv})_{i=1}^{n}$$, $$(X_{iu})_{i=1}^{n}$$ and $$(X_{iv})_{i=1}^{n}$$, respectively; in each case, each of the $$M$$ means is computed on an regular partition $$B_{1},\ldots, B_{M}$$ of $$[n]$$. It is assumed that $$M$$ is a factor of $$n$$. The value of $$M$$ is a tuning parameter that affects the accuracy of the median of means estimator. The choice of $$M$$ involves a compromise between bias and variance. For the extreme cases, $$M=n$$ and $$M=1$$, we obtain respectively the sample median and the sample mean. The latter is asymptotically unbiased but does not concentrate exponentially fast in the presence of heavy tails, while the former concentrates exponentially fast but not to the population mean under asymmetric distributions. Proposition 5 gives the range of $$M$$ for which both goals are achieved simultaneously. Proposition 5. Assume $$\max_{1\leq u\leq p}E(X_u^4)=\kappa^2<\infty$$. Let $$\tilde{\Sigma}_{M}=(\tilde{\sigma}^{M}_{uv})$$ be the median of means estimator described above based on a regular partition $$B_1,\ldots,B_M$$ with $$M=\lceil(2+L)\log p\rceil$$ for a positive constant $$L$$. Under the scaling condition $$(\log p)/n \rightarrow 0$$ we have, for large $$n$$ and a constant $$C>2(6{\rm e})^{1/2}\{\kappa+2\max_{u,v}\mu^{*}_u(\sigma_{vv}^{*})^{1/2}\}$$,   \[ {\rm pr}\Big[\max_{u,v}|\tilde{\sigma}_{uv}^M-\sigma_{uv}^{*}|\leq C\{(\log p)/n\}^{1/2}\Big]\geq 1 - \epsilon_{n,p}, \] where $$\epsilon_{n,p} \leq C_{0} p^{-L}$$ for positive constants $$C_{0}$$ and $$L$$. 5. Infinite kurtosis In the previous discussion we assumed the existence of fourth moments of $${X}$$ for the Huber-type estimator in § 4. We now relax the condition of boundedness of $$E(X_u^4)$$ to that of $$E(|X_u|^{2+\varepsilon})$$ for some $$\varepsilon>0$$ and all $$u\in[p]$$. The following proposition lays the foundations for the analysis of high-dimensional covariance or precision matrix estimation with infinite kurtosis. It extends Theorem 5 in Fan et al. (2017) and gives rates of convergence for Huber’s estimator of $$E(X_u)$$ assuming a bounded $$1+\varepsilon$$ moment for $$\varepsilon\in(0,1]$$. The result is optimal in the sense that our rates match the minimax lower bound given in Theorem 3.1 of Devroye et al. (2016). The rates depend on $$\varepsilon$$, and when $$\varepsilon=1$$ they match those of Catoni (2012) and Fan et al. (2017). Proposition 6. Let $$\delta\in(0,1)$$, $$\varepsilon\in(0,1]$$ and $$n>12\log(2\delta^{-1})$$, and let $$Z_1,\ldots,Z_n$$ be independent and identically distributed random variables with mean $$\mu$$ and bounded $$1+\varepsilon$$ moment, i.e., $$E(|Z_1-\mu|^{1+\varepsilon})=v<\infty$$. Take $${H=\{vn/\log(2\delta^{-1})\}^{1/(1+\varepsilon)}}$$. Then, with probability at least $$1-\delta$$,   \begin{align*} \tilde{\mu}^H-\mu\leq \frac{7+\surd 2}{2}\,v^{1/(1+\varepsilon)}\left\{\frac{\log(2\delta^{-1})}{n}\right\}^{\varepsilon/(1+\varepsilon)}\!, \end{align*} where $$\tilde{\mu}^H$$ is as defined in § 4.2. Corollary 1. Under the conditions of Proposition 6, the Huber estimator satisfies  \[ {\rm pr}\!\left[|\tilde{\mu}^H-\mu|\leq \frac{7+\surd 2}{2}v^{1/(1+\varepsilon)}\left\{\frac{\log(2\delta^{-1})}{n}\right\}^{\varepsilon/(1+\varepsilon)}\right]\geq 1-2\delta\text{.} \] Corollary 1 allows us to generalize the upper bounds of the Huber-type estimator. The following two theorems establish rates of convergence for the adaptive thresholding and the adaptively constrained $$\ell_1$$-minimization estimators. While we do not prove that these rates are minimax optimal under $$2+\varepsilon$$ finite moments, the proof expands on the elementwise maximum norm convergence of the pilot estimator, which is optimal by Theorem 3.1 of Devroye et al. (2016), and the resulting rates for adaptive thresholding match the minimax rates of Cai & Liu (2011) when $$\varepsilon=2$$. This is a strong indication that the rates are sharp. Theorem 3. Suppose that Condition 2 is satisfied and assume $$\max_{1\leq u\leq p}E(|X_u|^{2+\varepsilon})\leq \kappa_\varepsilon^2$$. Let $$\hat{\Sigma}^{\mathcal{T}}_H$$ be the adaptive thresholding estimator defined in § 2based on the Huber pilot estimator $$\tilde{\Sigma}_{H}$$ with $$H=K(n/\log p)^{1/(2+\varepsilon)}$$ for $$K\geq 2^{-1}(7+\surd 2)\kappa_\varepsilon(2+L)^{\varepsilon/(2+\varepsilon)}$$ and $$L>0$$. Under the scaling condition $$\log p=O(n^{1/2})$$ and choosing $$\lambda_{uv}=\lambda\{\tilde{\sigma}^H_{uu}\tilde{\sigma}_{vv}^H(\log p)/n\}^{\varepsilon/(2+\varepsilon)}$$ for some $$\lambda>0$$, we have, for sufficiently large $$n$$,   \[ \inf_{\Sigma^{*}\in\mathcal{U}_q}{\rm pr}\!\left\{\|\hat{\Sigma}_H^{\mathcal{T}}-\Sigma^{*}\|_2\leq Cs_0(p)\left(\frac{\log p}{n}\right)^{{\varepsilon(1-q)}/{(2+\varepsilon)}}\right\}\geq 1-\epsilon_{n,p}, \] where $$\epsilon_{n,p}\leq C_{0}p^{-L}$$ for positive constants $$C_{0}$$ and $$L$$. Theorem 4. Suppose that Condition 4 is satisfied, $$\max_{1\leq u\leq p}E(|X_u|^{2+\varepsilon})\leq \kappa_\varepsilon^2$$ and $$c_{n,p}=O\{n^{({1-q})/{2}}/(\log p)^{({3-q})/{2}} \}$$. Let $$\hat{\Omega}_H$$ be the adaptively constrained $$\ell_1$$-minimization estimator defined in § 3based on the Huber pilot estimator $$\tilde{\Sigma}_{H}$$ with $$H=K(n/\log p)^{1/(2+\varepsilon)}$$ for $$K\geq 2^{-1}(7+\surd 2)\kappa_\varepsilon(2+L)^{\varepsilon/(2+\varepsilon)}$$ and $$L>0$$. Assume that the truncated population covariance matrix $$\Sigma_H=E\{\!{\rm 1}\kern-0.24em{\rm I}(|X_{u}X_{v}|\leq H)X_uX_v\}$$ satisfies $$\|\Sigma_H\Omega^*-{I}_p\|_{\max}=O[\{(\log p)/n\}^{\varepsilon/(2+\varepsilon)}]$$. Under the scaling condition $$(\log p)/n^{1/3} = O(1)$$, we have, for sufficiently large $$n$$,   \[ \inf_{\Omega^{*}\in\mathcal{G}_q}{\rm pr}\!\left\{\|\tilde{\Omega}_H-\Omega^{*}\|_2\leq C M_{n,p}^{1-q}c_{n,p}\left(\frac{\log p}{n}\right)^{{\varepsilon(1-q)}/{(2+\varepsilon)}}\right\} \geq 1-\epsilon_{n,p}, \] where $$\epsilon_{n,p}\leq C_{0}p^{-L}$$ for positive constants $$C_{0}$$ and $$L$$. A result similar to Proposition 6 was obtained in Lemma 2 of Bubeck et al. (2013) for the median of means estimator. Expanding on it, we obtain a result analogous to Theorem 3 for the median of means matrix estimator. 6. Finite-sample performance We illustrate the performance of the estimators discussed in § § 2 and 3 under a range of data-generating scenarios and for every choice of pilot estimator discussed in § 4. For the adaptive thresholding estimator of $$\Sigma^{*}$$, we use a hard thresholding rule with the entry-dependent thresholds of (3). In each of 500 Monte Carlo replications, $$n=200$$ independent copies of a random vector $${X}$$ of dimension $$p=400$$ are drawn from a model with either a sparse covariance matrix $$\Sigma^{*}$$ or a sparse precision matrix $$\Omega^{*}$$, depending on the experiment. We consider four different scenarios for the distribution of $${X}$$: the zero-mean multivariate normal distribution; the $$t$$ distribution with 3$$\cdot$$5 degrees of freedom and infinite kurtosis; the skewed $$t$$ distribution with four degrees of freedom and skew parameter equal to 20; and the contaminated skewed $$t$$ distribution (Azzalini, 2005) with four degrees of freedom and skew parameter equal to 10. Data in the last scenario are generated as $${X}=(1-b){Z}_{1} + b{Z}_{2}$$, where $${Z}_{1}\sim P$$, $${Z}_{2}\sim Q$$ and $$b\sim\text{Bi}(1, 0{\cdot}05)$$; here $$P$$ is the $$t$$ distribution generating most of the data, while $$Q$$ is a normal distribution with a mean vector of $$-8$$ and covariance matrix equal to the identity. Any unspecified tuning parameters from the adaptive thresholding estimator and adaptively constrained $$\ell_1$$-minimization estimator are chosen by crossvalidation to minimize the spectral norm error. Unspecified constants in the tuning parameters of the robust pilot estimators are conservatively chosen to be those that would be optimal if the true distribution was a Student $$t$$ distribution with five degrees of freedom. We consider the following two structures for $$\Sigma^{*}$$ and $$\Omega^{*}$$. (i) Sparse covariance matrix: similar to Model 2 in the simulation section of Cai & Liu (2011), we take the true covariance model to be the block-diagonal matrix $$\Sigma^{*}=\text{blockdiag}(\Sigma^{*}_1,\Sigma^{*}_2)$$, where $$\Sigma^{\ast}_2=4{I}_{p/2 \times p/2}$$, $$\Sigma^{\ast}_1=A+\epsilon {I}_{p/2 \times p/2}$$, $$A=(a_{uv})$$ with independent $$a_{uv}=\text{Un}(0{\cdot}3,0{\cdot}8)\times\text{Bi}(1,0{\cdot}2)$$ and $$\epsilon=\max\{-\lambda_{\min}(A),0\}+0{\cdot}01$$ to ensure that $$\Sigma^{\ast}_1$$ is positive definite. (ii) Banded precision matrix: following Cai et al. (2016), we take the true precision matrix to be of the banded form $${\Omega}_0=\{\omega_{ij}\}$$, where $$\omega_{ii}=1,$$$$\omega_{i,i+1}=0{\cdot}6$$, $$\omega_{i+2,i}=\omega_{i,i+2}=0{\cdot}3$$ and $$\omega_{ij}=0$$ for $$|i-j|\geq 3$$. Table 1 shows that while the sample covariance estimator performs well for the normally distributed case, when the true model departs from normality, thresholding this estimator gives poor performance, reflected by its elevated estimation error in both the maximum norm and the spectral norm. By contrast, thresholding one of our proposed robust pilot estimators does not suffer from these heavy-tailed distributions. Table 2 shows a similar pattern for the precision matrix estimators. The gains are apparent for all robust pilot estimators, as predicted by our theory. Table 1. Estimation errors $$($$with standard errors in parentheses$$)$$ of the adaptive thresholding estimator of $$\Sigma^{\ast}$$ based on four different pilot estimators; values are averaged over $$500$$ replications  Distribution  Error  Sample covariance  Adaptive Huber  Median of means  Rank-based  MVN  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  2$$\cdot$$88 (0$$\cdot$$04)  2$$\cdot$$86 (0$$\cdot$$04)  3$$\cdot$$31 (0$$\cdot$$05)  3$$\cdot$$01 (0$$\cdot$$07)  MVN  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  0$$\cdot$$98 (0$$\cdot$$09)  0$$\cdot$$92 (0$$\cdot$$09)  1$$\cdot$$50 (0$$\cdot$$14)  1$$\cdot$$61 (0$$\cdot$$23)  T  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  8$$\cdot$$95 (0$$\cdot$$53)  3$$\cdot$$92 (0$$\cdot$$06)  4$$\cdot$$46 (0$$\cdot$$24)  5$$\cdot$$02 (0$$\cdot$$06)  T  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  8$$\cdot$$72 (0$$\cdot$$55)  1$$\cdot$$87 (0$$\cdot$$05)  3$$\cdot$$35 (0$$\cdot$$74)  2$$\cdot$$54 (0$$\cdot$$04)  ST  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  7$$\cdot$$12 (0$$\cdot$$17)  4$$\cdot$$88 (0$$\cdot$$05)  4$$\cdot$$96 (0$$\cdot$$06)  5$$\cdot$$16 (0$$\cdot$$06)  ST  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  6$$\cdot$$89 (0$$\cdot$$18)  2$$\cdot$$41 (0$$\cdot$$04)  2$$\cdot$$43 (0$$\cdot$$04)  2$$\cdot$$57 (0$$\cdot$$04)  CST  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  5$$\cdot$$47 (0$$\cdot$$23)  4$$\cdot$$14 (0$$\cdot$$06)  4$$\cdot$$60 (0$$\cdot$$05)  5$$\cdot$$13 (0$$\cdot$$06)  CST  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  5$$\cdot$$07 (0$$\cdot$$27)  2$$\cdot$$02 (0$$\cdot$$05)  2$$\cdot$$27 (0$$\cdot$$05)  2$$\cdot$$56 (0$$\cdot$$04)  Distribution  Error  Sample covariance  Adaptive Huber  Median of means  Rank-based  MVN  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  2$$\cdot$$88 (0$$\cdot$$04)  2$$\cdot$$86 (0$$\cdot$$04)  3$$\cdot$$31 (0$$\cdot$$05)  3$$\cdot$$01 (0$$\cdot$$07)  MVN  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  0$$\cdot$$98 (0$$\cdot$$09)  0$$\cdot$$92 (0$$\cdot$$09)  1$$\cdot$$50 (0$$\cdot$$14)  1$$\cdot$$61 (0$$\cdot$$23)  T  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  8$$\cdot$$95 (0$$\cdot$$53)  3$$\cdot$$92 (0$$\cdot$$06)  4$$\cdot$$46 (0$$\cdot$$24)  5$$\cdot$$02 (0$$\cdot$$06)  T  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  8$$\cdot$$72 (0$$\cdot$$55)  1$$\cdot$$87 (0$$\cdot$$05)  3$$\cdot$$35 (0$$\cdot$$74)  2$$\cdot$$54 (0$$\cdot$$04)  ST  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  7$$\cdot$$12 (0$$\cdot$$17)  4$$\cdot$$88 (0$$\cdot$$05)  4$$\cdot$$96 (0$$\cdot$$06)  5$$\cdot$$16 (0$$\cdot$$06)  ST  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  6$$\cdot$$89 (0$$\cdot$$18)  2$$\cdot$$41 (0$$\cdot$$04)  2$$\cdot$$43 (0$$\cdot$$04)  2$$\cdot$$57 (0$$\cdot$$04)  CST  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  5$$\cdot$$47 (0$$\cdot$$23)  4$$\cdot$$14 (0$$\cdot$$06)  4$$\cdot$$60 (0$$\cdot$$05)  5$$\cdot$$13 (0$$\cdot$$06)  CST  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  5$$\cdot$$07 (0$$\cdot$$27)  2$$\cdot$$02 (0$$\cdot$$05)  2$$\cdot$$27 (0$$\cdot$$05)  2$$\cdot$$56 (0$$\cdot$$04)  MVN, the normal distribution; T, the $$t$$ distribution; ST, the skewed $$t$$ distribution; CST, the contaminated skewed $$t$$ distribution. Table 1. Estimation errors $$($$with standard errors in parentheses$$)$$ of the adaptive thresholding estimator of $$\Sigma^{\ast}$$ based on four different pilot estimators; values are averaged over $$500$$ replications  Distribution  Error  Sample covariance  Adaptive Huber  Median of means  Rank-based  MVN  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  2$$\cdot$$88 (0$$\cdot$$04)  2$$\cdot$$86 (0$$\cdot$$04)  3$$\cdot$$31 (0$$\cdot$$05)  3$$\cdot$$01 (0$$\cdot$$07)  MVN  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  0$$\cdot$$98 (0$$\cdot$$09)  0$$\cdot$$92 (0$$\cdot$$09)  1$$\cdot$$50 (0$$\cdot$$14)  1$$\cdot$$61 (0$$\cdot$$23)  T  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  8$$\cdot$$95 (0$$\cdot$$53)  3$$\cdot$$92 (0$$\cdot$$06)  4$$\cdot$$46 (0$$\cdot$$24)  5$$\cdot$$02 (0$$\cdot$$06)  T  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  8$$\cdot$$72 (0$$\cdot$$55)  1$$\cdot$$87 (0$$\cdot$$05)  3$$\cdot$$35 (0$$\cdot$$74)  2$$\cdot$$54 (0$$\cdot$$04)  ST  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  7$$\cdot$$12 (0$$\cdot$$17)  4$$\cdot$$88 (0$$\cdot$$05)  4$$\cdot$$96 (0$$\cdot$$06)  5$$\cdot$$16 (0$$\cdot$$06)  ST  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  6$$\cdot$$89 (0$$\cdot$$18)  2$$\cdot$$41 (0$$\cdot$$04)  2$$\cdot$$43 (0$$\cdot$$04)  2$$\cdot$$57 (0$$\cdot$$04)  CST  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  5$$\cdot$$47 (0$$\cdot$$23)  4$$\cdot$$14 (0$$\cdot$$06)  4$$\cdot$$60 (0$$\cdot$$05)  5$$\cdot$$13 (0$$\cdot$$06)  CST  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  5$$\cdot$$07 (0$$\cdot$$27)  2$$\cdot$$02 (0$$\cdot$$05)  2$$\cdot$$27 (0$$\cdot$$05)  2$$\cdot$$56 (0$$\cdot$$04)  Distribution  Error  Sample covariance  Adaptive Huber  Median of means  Rank-based  MVN  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  2$$\cdot$$88 (0$$\cdot$$04)  2$$\cdot$$86 (0$$\cdot$$04)  3$$\cdot$$31 (0$$\cdot$$05)  3$$\cdot$$01 (0$$\cdot$$07)  MVN  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  0$$\cdot$$98 (0$$\cdot$$09)  0$$\cdot$$92 (0$$\cdot$$09)  1$$\cdot$$50 (0$$\cdot$$14)  1$$\cdot$$61 (0$$\cdot$$23)  T  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  8$$\cdot$$95 (0$$\cdot$$53)  3$$\cdot$$92 (0$$\cdot$$06)  4$$\cdot$$46 (0$$\cdot$$24)  5$$\cdot$$02 (0$$\cdot$$06)  T  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  8$$\cdot$$72 (0$$\cdot$$55)  1$$\cdot$$87 (0$$\cdot$$05)  3$$\cdot$$35 (0$$\cdot$$74)  2$$\cdot$$54 (0$$\cdot$$04)  ST  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  7$$\cdot$$12 (0$$\cdot$$17)  4$$\cdot$$88 (0$$\cdot$$05)  4$$\cdot$$96 (0$$\cdot$$06)  5$$\cdot$$16 (0$$\cdot$$06)  ST  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  6$$\cdot$$89 (0$$\cdot$$18)  2$$\cdot$$41 (0$$\cdot$$04)  2$$\cdot$$43 (0$$\cdot$$04)  2$$\cdot$$57 (0$$\cdot$$04)  CST  $$\|\hat{\Sigma} - \Sigma^{*}\|_{2}$$  5$$\cdot$$47 (0$$\cdot$$23)  4$$\cdot$$14 (0$$\cdot$$06)  4$$\cdot$$60 (0$$\cdot$$05)  5$$\cdot$$13 (0$$\cdot$$06)  CST  $$\|\tilde{\Sigma} - \Sigma^{*}\|_{\max}$$  5$$\cdot$$07 (0$$\cdot$$27)  2$$\cdot$$02 (0$$\cdot$$05)  2$$\cdot$$27 (0$$\cdot$$05)  2$$\cdot$$56 (0$$\cdot$$04)  MVN, the normal distribution; T, the $$t$$ distribution; ST, the skewed $$t$$ distribution; CST, the contaminated skewed $$t$$ distribution. Table 2. Estimation errors $$($$with standard errors in parentheses$$)$$ of the adaptively constrained $$\ell_1$$-minimizers to $$\Omega^{\ast}$$ based on four different pilot estimators; values are averaged over $$500$$ replications  Distribution  Error  Sample covariance  Adaptive Huber  Median of means  Rank-based  MVN  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$62 (0$$\cdot$$01)  2$$\cdot$$61 (0$$\cdot$$01)  2$$\cdot$$59 (0$$\cdot$$01)  2$$\cdot$$59 (0$$\cdot$$01)  MVN  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  1$$\cdot$$05 (0$$\cdot$$09)  1$$\cdot$$02 (0$$\cdot$$09)  1$$\cdot$$85 (0$$\cdot$$28)  2$$\cdot$$90 (0$$\cdot$$55)  T  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$54 (0$$\cdot$$03)  2$$\cdot$$26 (0$$\cdot$$02)  2$$\cdot$$43 (0$$\cdot$$02)  2$$\cdot$$41 (0$$\cdot$$02)  T  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  2$$\cdot$$66 (3$$\cdot$$96)  0$$\cdot$$81 (0$$\cdot$$03)  1$$\cdot$$02 (0$$\cdot$$19)  1$$\cdot$$01 (0$$\cdot$$19)  ST  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$27 (0$$\cdot$$15)  1$$\cdot$$97 (0$$\cdot$$05)  2$$\cdot$$08 (0$$\cdot$$08)  2$$\cdot$$12 (0$$\cdot$$08)  ST  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  1$$\cdot$$40 (1$$\cdot$$59)  0$$\cdot$$97 (0$$\cdot$$02)  1$$\cdot$$05 (0$$\cdot$$03)  0$$\cdot$$96 (0$$\cdot$$02)  CST  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$65 (0$$\cdot$$02)  2$$\cdot$$01 (0$$\cdot$$04)  2$$\cdot$$12 (0$$\cdot$$06)  2$$\cdot$$10 (0$$\cdot$$06)  CST  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  9$$\cdot$$65 (3$$\cdot$$76)  0$$\cdot$$97 (0$$\cdot$$04)  2$$\cdot$$16 (2$$\cdot$$19)  0$$\cdot$$92 (0$$\cdot$$03)  Distribution  Error  Sample covariance  Adaptive Huber  Median of means  Rank-based  MVN  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$62 (0$$\cdot$$01)  2$$\cdot$$61 (0$$\cdot$$01)  2$$\cdot$$59 (0$$\cdot$$01)  2$$\cdot$$59 (0$$\cdot$$01)  MVN  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  1$$\cdot$$05 (0$$\cdot$$09)  1$$\cdot$$02 (0$$\cdot$$09)  1$$\cdot$$85 (0$$\cdot$$28)  2$$\cdot$$90 (0$$\cdot$$55)  T  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$54 (0$$\cdot$$03)  2$$\cdot$$26 (0$$\cdot$$02)  2$$\cdot$$43 (0$$\cdot$$02)  2$$\cdot$$41 (0$$\cdot$$02)  T  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  2$$\cdot$$66 (3$$\cdot$$96)  0$$\cdot$$81 (0$$\cdot$$03)  1$$\cdot$$02 (0$$\cdot$$19)  1$$\cdot$$01 (0$$\cdot$$19)  ST  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$27 (0$$\cdot$$15)  1$$\cdot$$97 (0$$\cdot$$05)  2$$\cdot$$08 (0$$\cdot$$08)  2$$\cdot$$12 (0$$\cdot$$08)  ST  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  1$$\cdot$$40 (1$$\cdot$$59)  0$$\cdot$$97 (0$$\cdot$$02)  1$$\cdot$$05 (0$$\cdot$$03)  0$$\cdot$$96 (0$$\cdot$$02)  CST  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$65 (0$$\cdot$$02)  2$$\cdot$$01 (0$$\cdot$$04)  2$$\cdot$$12 (0$$\cdot$$06)  2$$\cdot$$10 (0$$\cdot$$06)  CST  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  9$$\cdot$$65 (3$$\cdot$$76)  0$$\cdot$$97 (0$$\cdot$$04)  2$$\cdot$$16 (2$$\cdot$$19)  0$$\cdot$$92 (0$$\cdot$$03)  Table 2. Estimation errors $$($$with standard errors in parentheses$$)$$ of the adaptively constrained $$\ell_1$$-minimizers to $$\Omega^{\ast}$$ based on four different pilot estimators; values are averaged over $$500$$ replications  Distribution  Error  Sample covariance  Adaptive Huber  Median of means  Rank-based  MVN  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$62 (0$$\cdot$$01)  2$$\cdot$$61 (0$$\cdot$$01)  2$$\cdot$$59 (0$$\cdot$$01)  2$$\cdot$$59 (0$$\cdot$$01)  MVN  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  1$$\cdot$$05 (0$$\cdot$$09)  1$$\cdot$$02 (0$$\cdot$$09)  1$$\cdot$$85 (0$$\cdot$$28)  2$$\cdot$$90 (0$$\cdot$$55)  T  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$54 (0$$\cdot$$03)  2$$\cdot$$26 (0$$\cdot$$02)  2$$\cdot$$43 (0$$\cdot$$02)  2$$\cdot$$41 (0$$\cdot$$02)  T  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  2$$\cdot$$66 (3$$\cdot$$96)  0$$\cdot$$81 (0$$\cdot$$03)  1$$\cdot$$02 (0$$\cdot$$19)  1$$\cdot$$01 (0$$\cdot$$19)  ST  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$27 (0$$\cdot$$15)  1$$\cdot$$97 (0$$\cdot$$05)  2$$\cdot$$08 (0$$\cdot$$08)  2$$\cdot$$12 (0$$\cdot$$08)  ST  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  1$$\cdot$$40 (1$$\cdot$$59)  0$$\cdot$$97 (0$$\cdot$$02)  1$$\cdot$$05 (0$$\cdot$$03)  0$$\cdot$$96 (0$$\cdot$$02)  CST  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$65 (0$$\cdot$$02)  2$$\cdot$$01 (0$$\cdot$$04)  2$$\cdot$$12 (0$$\cdot$$06)  2$$\cdot$$10 (0$$\cdot$$06)  CST  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  9$$\cdot$$65 (3$$\cdot$$76)  0$$\cdot$$97 (0$$\cdot$$04)  2$$\cdot$$16 (2$$\cdot$$19)  0$$\cdot$$92 (0$$\cdot$$03)  Distribution  Error  Sample covariance  Adaptive Huber  Median of means  Rank-based  MVN  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$62 (0$$\cdot$$01)  2$$\cdot$$61 (0$$\cdot$$01)  2$$\cdot$$59 (0$$\cdot$$01)  2$$\cdot$$59 (0$$\cdot$$01)  MVN  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  1$$\cdot$$05 (0$$\cdot$$09)  1$$\cdot$$02 (0$$\cdot$$09)  1$$\cdot$$85 (0$$\cdot$$28)  2$$\cdot$$90 (0$$\cdot$$55)  T  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$54 (0$$\cdot$$03)  2$$\cdot$$26 (0$$\cdot$$02)  2$$\cdot$$43 (0$$\cdot$$02)  2$$\cdot$$41 (0$$\cdot$$02)  T  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  2$$\cdot$$66 (3$$\cdot$$96)  0$$\cdot$$81 (0$$\cdot$$03)  1$$\cdot$$02 (0$$\cdot$$19)  1$$\cdot$$01 (0$$\cdot$$19)  ST  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$27 (0$$\cdot$$15)  1$$\cdot$$97 (0$$\cdot$$05)  2$$\cdot$$08 (0$$\cdot$$08)  2$$\cdot$$12 (0$$\cdot$$08)  ST  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  1$$\cdot$$40 (1$$\cdot$$59)  0$$\cdot$$97 (0$$\cdot$$02)  1$$\cdot$$05 (0$$\cdot$$03)  0$$\cdot$$96 (0$$\cdot$$02)  CST  $$\|\hat{\Omega} - \Omega^{*}\|_{2}$$  2$$\cdot$$65 (0$$\cdot$$02)  2$$\cdot$$01 (0$$\cdot$$04)  2$$\cdot$$12 (0$$\cdot$$06)  2$$\cdot$$10 (0$$\cdot$$06)  CST  $$\|\tilde{\Sigma}\Omega^{*}-I_p\|_{\max}$$  9$$\cdot$$65 (3$$\cdot$$76)  0$$\cdot$$97 (0$$\cdot$$04)  2$$\cdot$$16 (2$$\cdot$$19)  0$$\cdot$$92 (0$$\cdot$$03)  7. Real-data example A gene regulatory network, also known as a pathway, is a set of genes that interact with each other to control a specific cell function. With recent advances in genomic research, many such networks have been discovered and their functions thoroughly studied. Certain pathways are now known and available in public databases such as KEGG (Ogata et al., 2000). One popular way to infer a gene regulatory network is through estimation of the precision matrix associated with gene expression (Wit & Abbruzzo, 2015). However, such data often contain outliers. To assess whether our robust estimator can improve inference on gene regulatory networks, we use a microarray dataset and compare our findings with generally acknowledged scientific truth from the genomics literature. The microarray data come from a study by Huang et al. (2011) on the inflammation process of cardiovascular disease. They identified that the toll-like receptor signalling pathway plays a key role in the inflammation process. Their study involves $$n=48$$ patients and the data are available from the Gene Expression Omnibus via the accession name GSE20129. We consider 95 genes from the toll-like receptor signalling pathway and another 62 genes from the peroxisome proliferator-activated receptor signalling pathway, which is known to be unrelated to cardiovascular disease. A good method should discover connections for genes within each of the pathways but not across them. We use both the original version of the adaptively constrained $$\ell_1$$-minimization estimator and our robustified version via the Huber pilot estimator to estimate the precision matrix and therefore the gene regulatory network. We first choose the tuning parameters that deliver the top 100 connections for each method. Table 3 reports the selection results, also displayed in Fig. 1. Our robust method identifies more connections within each pathway and fewer connections across the pathways. Fig. 1. View largeDownload slide Connections estimated by the adaptively constrained $$\ell_1$$-minimization estimator using (a) the sample covariance and (b) the Huber-type pilot estimator; blue lines represent within-pathway connections and red lines between-pathway connections. Fig. 1. View largeDownload slide Connections estimated by the adaptively constrained $$\ell_1$$-minimization estimator using (a) the sample covariance and (b) the Huber-type pilot estimator; blue lines represent within-pathway connections and red lines between-pathway connections. Table 3. Number of connections detected by two types of methods  Top 100 connections  Equal tuning parameters     Within  Between  Total     Within  Between  Total  Huber estimator  60  40  100  Huber estimator  27  15  42  Sample covariance  55  45  100  Sample covariance  55  45  100  Top 100 connections  Equal tuning parameters     Within  Between  Total     Within  Between  Total  Huber estimator  60  40  100  Huber estimator  27  15  42  Sample covariance  55  45  100  Sample covariance  55  45  100  Table 3. Number of connections detected by two types of methods  Top 100 connections  Equal tuning parameters     Within  Between  Total     Within  Between  Total  Huber estimator  60  40  100  Huber estimator  27  15  42  Sample covariance  55  45  100  Sample covariance  55  45  100  Top 100 connections  Equal tuning parameters     Within  Between  Total     Within  Between  Total  Huber estimator  60  40  100  Huber estimator  27  15  42  Sample covariance  55  45  100  Sample covariance  55  45  100  We tried taking the same tuning parameter in the constrained $$\ell_1$$-minimization step (8) for each procedure. Table 3 gives the results. Our estimator detects fewer connections; however, the percentage of within-pathway connections estimated using the Huber pilot estimator is much higher than that of the sample covariance estimator. If the genomics literature is correct, our results show that use of the Huber pilot estimator improves inference for this example, in which heavy tails and skewness are present. Acknowledgement This work was partially supported by the U.S. National Science Foundation and National Institutes of Health. Avella-Medina was partially supported by the Swiss National Science Foundation. Battey was partially supported by the U.K. Engineering and Physical Sciences Research Council. The authors thank the editor, the associate editor and three referees for valuable comments. Supplementary material Supplementary Material available at Biometrika online includes the proofs of all propositions and theorems and additional plots for the real-data example. References Antoniadis, A. & Fan, J. ( 2001). Regularization of wavelet approximations. J. Am. Statist. Assoc.  96, 939– 57. Google Scholar CrossRef Search ADS   Azzalini, A. ( 2005). The skew-normal distribution and related multivariate families. Scand. J. Statist.  32, 159– 200. Google Scholar CrossRef Search ADS   Bickel, P. J. & Levina, E. ( 2008). Covariance regularization by thresholding. Ann. Statist.  36, 2577– 604. Google Scholar CrossRef Search ADS   Boyd, S. & Vandenberghe, L. ( 2004). Convex Optimization . Cambridge: Cambridge University Press. Google Scholar CrossRef Search ADS   Bubeck, S., Cesa-Bianchi, N. & Lugosi, G. ( 2013). Bandits with heavy tail. IEEE Trans. Info. Theory  59, 7711– 7. Google Scholar CrossRef Search ADS   Cai, T. T. & Liu, W. ( 2011). Adaptive thresholding for sparse covariance matrix estimation. J. Am. Statist. Assoc.  106, 672– 4. Google Scholar CrossRef Search ADS   Cai, T. T., Liu, W. & Luo, X. ( 2011). A constrained $$\ell_1$$-minimization approach to sparse precision matrix estimation. J. Am. Statist. Assoc.  106, 594– 607. Google Scholar CrossRef Search ADS   Cai, T. T., Liu, W. & Zhou, H. ( 2016). Estimating sparse precision matrix: Optimal rates of convergence and adaptive estimation. Ann. Statist.  44, 455– 88. Google Scholar CrossRef Search ADS   Catoni, O. ( 2012). Challenging the empirical mean and empirical variance: A deviation study. Ann. Inst. Henri Poincaré Prob. Statist.  48, 1148– 85. Google Scholar CrossRef Search ADS   Chen, M., Gao, C. & Ren, Z. ( 2015). Robust covariance matrix estimation via matrix depth. arXiv:  1506.00691. Devroye, L., Lerasle, M., Lugosi, G. & Oliveira, R. I. ( 2016). Sub-Gaussian mean estimators. Ann. Statist.  44, 2695– 725. Google Scholar CrossRef Search ADS   Fan, J., Han, F., Liu, H. & Vickers, B. ( 2016a). Robust inference of risks of large portfolios. J. Economet.  194, 298– 308. Google Scholar CrossRef Search ADS   Fan, J., Li, Q. & Wang, Y. ( 2017). Estimation of high-dimensional mean regression in absence of symmetry and light-tail assumptions. J. R. Statist. Soc.  B 79, 247– 65. Google Scholar CrossRef Search ADS   Fan, J., Liao, Y. & Mincheva, M. ( 2013). Large covariance estimation by thresholding principal orthogonal complements. J. R. Statist. Soc.  B 75, 603– 80. Google Scholar CrossRef Search ADS   Fan, J., Liu, H. & Wang, W. ( 2015). Large covariance estimation through elliptical factor models. arXiv:  1507.08377. Fan, J., Wang, W. & Zhong, Y. ( 2016b). Robust covariance estimation for approximate factor models. arXiv:  1602.00719. Grant, M. & Boyd, S. ( 2014). CVX: Matlab software for disciplined convex programming, version 2.1. Huang, C.-C., Liu, K., Pope, R. M., Du, P., Lin, S., Rajamannan, N. M., Huang, Q.-Q., Jafari, N., Burke, G. L., Post, W. et al.   ( 2011). Activated TLR signaling in atherosclerosis among women with lower Framingham risk score: The multi-ethnic study of atherosclerosis. PloS One  6, e21067. Google Scholar CrossRef Search ADS PubMed  Huber, P. ( 1964). Robust estimation of a location parameter. Ann. Math. Statist.  35, 73– 101. Google Scholar CrossRef Search ADS   Huber, P. & Ronchetti, E. ( 2009). Robust Statistics . Hoboken, New Jersey: Wiley, 2nd edn. Google Scholar CrossRef Search ADS   Joly, E. & Lugosi, G. ( 2016) Robust estimation of U-statistics. Stoch. Proces. Appl.  126, 3760– 73. Google Scholar CrossRef Search ADS   Lerasle, M. & Oliveira, R. ( 2011). Robust empirical mean estimators. arXiv:  1112.3914. Liu, H., Han, F., Yuan, M., Lafferty, J. & Wasserman, L. ( 2012). High-dimensional semiparametric Gaussian copula graphical models. Ann. Statist.  40, 2293– 326. Google Scholar CrossRef Search ADS   Loh, P. L. & Tan, X. L. ( 2015). High-dimensional robust precision matrix estimation: Cellwise corruption under $$\epsilon$$-contamination. arXiv:  1509.07229. Nemirovsky, A. S. & Yudin, D. B. ( 1983). Problem Complexity and Method Efficiency in Optimization . Hoboken, New Jersey: Wiley. Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H. & Kanehisa, M. ( 2000). KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res.  28, 27– 30. Google Scholar CrossRef Search ADS PubMed  Petrov, V. ( 1995). Limit Theorems of Probability Theory . Oxford: Clarendon Press. Ren, Z., Sun, T., Zhang, C.-H. & Zhou, H. ( 2015). Asymptotic normality and optimalities in estimation of large Gaussian graphical models. Ann. Statist.  43, 991– 1026. Google Scholar CrossRef Search ADS   Rothman, A., Bickel, P., Levina, E. & Zhu, J. ( 2008). Sparse permutation invariant covariance estimation. Electron. J. Statist.  2, 494– 515. Google Scholar CrossRef Search ADS   Rothman, A., Levina, E. & Zhu, J. ( 2009). Generalized thresholding of large covariance matrices. J. Am. Statist. Assoc.  104, 177– 86. Google Scholar CrossRef Search ADS   Wit, E. C. & Abbruzzo, A. ( 2015). Inferring slowly-changing dynamic gene-regulatory networks. BMC Bioinformatics  16, S5. Google Scholar CrossRef Search ADS   Xue, L. & Zou, H. ( 2012). Regularized rank-based estimation of high-dimensional nonparanormal graphical models. Ann. Statist.  40, 2541– 71. Google Scholar CrossRef Search ADS   © 2018 Biometrika Trust This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

BiometrikaOxford University Press

Published: Mar 27, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off