# Two-sample tests of high-dimensional means for compositional data

Two-sample tests of high-dimensional means for compositional data Summary Compositional data are ubiquitous in many scientific endeavours. Motivated by microbiome and metagenomic research, we consider a two-sample testing problem for high-dimensional compositional data and formulate a testable hypothesis of compositional equivalence for the means of two latent log basis vectors. We propose a test through the centred log-ratio transformation of the compositions. The asymptotic null distribution of the test statistic is derived and its power against sparse alternatives is investigated. A modified test for paired samples is also considered. Simulations show that the proposed tests can be significantly more powerful than tests that are applied to the raw and log-transformed compositions. The usefulness of our tests is illustrated by applications to gut microbiome composition in obesity and Crohn’s disease. 1. Introduction Compositional data, which belong to a simplex, are ubiquitous in scientific disciplines such as geology, economics and genomics. This paper is motivated by microbiome and metagenomic research, where relative abundances of hundreds to thousands of bacterial taxa on a few tens or hundreds of individuals are available for analysis (The Human Microbiome Project Consortium, 2012). Due to varying amounts of DNA-generating material across different samples, sequencing read counts are often normalized to relative abundances; the resulting data are therefore compositional (Li, 2015). One fundamental problem in microbiome data analysis is to test whether two populations have the same microbiome composition, which can be viewed as a two-sample testing problem for high-dimensional compositional data. Since the components of a composition must sum to unity, directly applying standard multivariate statistical methods intended for unconstrained data to compositional data may result in inappropriate or misleading inferences (Aitchison, 2003). Various methods for compositional data analysis have been developed since the seminal work of Aitchison (1982). Most existing methods for two-sample testing, however, deal only with the low-dimensional setting where the dimensionality is smaller than the sample size; see, for example, the generalized likelihood ratio tests discussed in Aitchison (2003, § 7$$.$$5). In this paper, we consider the two-sample testing problem for high-dimensional compositional data, where compositions in the $$(p-1)$$-dimensional simplex $$\mathcal{S}^{p-1}$$ are thought of as arising from latent basis vectors in the $$p$$-dimensional positive orthant $$\mathbb{R}_+^p$$. In microbiome studies, the basis components may represent the true abundances of bacterial taxa in a microbial community such as the gut of a healthy individual (Li, 2015). To circumvent the nonidentifiability issue associated with the basis vectors, we formulate a testable hypothesis of compositional equivalence for the means of two log basis vectors. We then propose a test through the centred log-ratio transformation of the compositions. The proposed test is therefore scale-invariant, which is crucial for compositional data analysis. We emphasize here the extrinsic analysis point of view in compositional data analysis (Aitchison, 1982), which leads to biologically meaningful interpretations, in contrast to intrinsic analysis, where interest lies solely in the composition. Classical extrinsic analysis however, primarily concerns problems where the bases are observed, and thus differs radically from the focus of this paper. The development of tests for the equality of two high-dimensional means has received much attention; see, for instance, Bai & Saranadasa (1996), Srivastava & Du (2008), Srivastava (2009), Chen & Qin (2010) and Cai et al. (2014). Such tests, however, are not directly applicable to high-dimensional compositional data because the required regularity conditions are generally not met. For example, the covariance matrix of compositional data is singular, thereby violating the usual assumptions on the eigenvalues of the covariance matrix, such as those in Cai et al. (2014). Our assumptions are imposed on the latent log basis vectors, which are free of the simplex constraint. We show that, under mild conditions, the centred log-ratio variables satisfy certain desired properties, which guarantee the validity of the proposed test. Then the asymptotic null distribution of the test statistic is derived and the power of the test against sparse alternatives is investigated. The proposed two-sample test is further modified to accommodate paired samples. All proofs are deferred to the Appendix. 2. A testable hypothesis of compositional equivalence Denote by $$X^{(k)}=(X_1^{(k)},\dots,X_{n_k}^{(k)})^{{\mathrm{\scriptscriptstyle T}}}$$ the observed $$n_k\times p$$ data matrices for group $$k$$ ($$k=1,2$$), where the $$X_i^{(k)}$$ represent compositions that lie in the $$(p-1)$$-dimensional simplex $$\mathcal{S}^{p-1}=\{(x_1,\dots,x_p):x_j>0\:(\,j=1,\dots,p),\,\sum_{j=1}^p x_j=1\}$$. We assume that the compositional variables arise from a vector of latent variables, which we call the basis. For microbiome data, the basis components may refer to the true abundances of bacterial taxa in a microbial community. Let $$W^{(k)}=(W_1^{(k)},\dots,W_{n_k}^{(k)})^{{\mathrm{\scriptscriptstyle T}}}$$ denote the $$n_k\times p$$ matrices of unobserved bases, which generate the observed compositional data via the normalization   \begin{align}\label{eq:norm} X_{ij}^{(k)}=W_{ij}^{(k)}\bigg/\sum_{\ell=1}^p W_{i\ell}^{(k)}\quad (i=1,\dots,n_k;\: j=1,\dots,p;\: k=1,2), \end{align} (1) where $$X_{ij}^{(k)}$$ and $$W_{ij}^{(k)}>0$$ are the $$j$$th components of $$X_i^{(k)}$$ and $$W_i^{(k)}$$, respectively. Denote by $$Z_i^{(k)}=(Z_{i1}^{(k)},\dots,Z_{ip}^{(k)})^{{\mathrm{\scriptscriptstyle T}}}$$ the log basis vectors, where $$Z_{ij}^{(k)}=\log W_{ij}^{(k)}$$. Suppose that $$Z_1^{(k)},\dots,Z_{n_k}^{(k)}$$ ($$k=1,2$$) are two independent samples, each from a distribution with mean $$\mu_k=(\mu_{k1},\dots,\mu_{kp})^{{\mathrm{\scriptscriptstyle T}}}$$ and having common covariance matrix $$\Omega=(\omega_{ij})$$. One might attempt to test the hypotheses   $$\label{eq:hypo_basis} H_0:\mu_1=\mu_2\quad\text{versus}\quad H_1:\mu_1=\hspace{-8pt}|\,\,\mu_2\text{.}$$ (2) These hypotheses, however, are not testable through the observed compositional data $$X^{(k)}$$ ($$k=1,2$$). Clearly, a basis is determined by its composition only up to a multiplicative factor, and the set of bases giving rise to a composition $$x\in\mathcal{S}^{p-1}$$ forms the equivalence class $$\mathcal{W}(x)=\{(tx_1,\dots,tx_p):t>0\}$$ (Aitchison, 2003, p. 32). As an immediate consequence, a log basis vector is determined by the resulting composition only up to an additive constant, and the set of log basis vectors corresponding to $$x$$ constitutes the equivalence class $$\mathcal{Z}(x)=\{(\log x_1+c,\dots,\log x_p+c):c\in\mathbb{R}\}$$. We therefore introduce the following definition. Definition 1. Two log basis vectors $$z_1$$ and $$z_2$$ are said to be compositionally equivalent if their components differ by a constant $$c\in\mathbb{R}$$, i.e., $$z_1=z_2+c1_p$$, where $$1_p$$ is the $$p$$-vector of ones.  Now, instead of testing the hypotheses in (2), we propose to test   $$\label{eq:hypo_equiv} H_0:\mu_1=\mu_2+c1_p\text{ for some }c\in\mathbb{R}\quad\text{versus}\quad H_1:\mu_1=\hspace{-8pt}|\,\,\mu_2+c1_p\text{ for any }c\in\mathbb{R},$$ (3) which are testable using only the observed compositional data. Clearly, $$H_0$$ in (2) implies $$H_0$$ in (3), so that rejecting the latter would lead to rejection of the former. Note, however, that $$H_0$$ in (3) neither implies nor is implied by $$E(X_1^{(1)})=E(X_1^{(2)})$$ or $$E(\log X_1^{(1)})=E(\log X_1^{(2)})$$. We do not consider the latter two hypotheses because they are not scale-invariant, whereas we will derive in § 3.1 an equivalent form of $$H_0$$ in (3) from which its scale-invariance will be obvious. Moreover, these two hypotheses do not allow us to obtain biological interpretations in terms of the true underlying abundances. 3. Tests for compositional equivalence 3.1. The centred log-ratio transformation and an equivalent hypothesis The unit-sum constraint entails that compositional variables must not vary independently, making many covariance-based multivariate analysis methods inapplicable. Aitchison (1982) proposed to relax the constraint by performing statistical analysis through log ratios. Among various forms of log-ratio transformations, the centred log-ratio transformation has attractive features and has been widely used. For the observed compositional data $$X^{(k)}$$ ($$k=1,2$$), the centred log-ratio matrices $$Y^{(k)}=(Y_1^{(k)},\dots,Y_{n_k}^{(k)})^{{\mathrm{\scriptscriptstyle T}}}$$ are defined by   $$\label{eq:clr_entry} Y_{ij}^{(k)}=\log\bigl\{X_{ij}^{(k)}/g(X_i^{(k)})\bigr\}\quad (i=1,\dots,n_k;\: j=1,\dots,p;\: k=1,2),$$ (4) where $$g(x)=\big(\prod_{i=1}^p x_i\big)^{1/p}$$ denotes the geometric mean of a vector $$x=(x_1,\dots,x_p)^{{\mathrm{\scriptscriptstyle T}}}$$. The relationship (4) can be expressed in the matrix form   $$\label{eq:clr_matrix} Y_i^{(k)}=G\log X_i^{(k)},$$ (5) where $$G=I_p-p^{-1}1_p1_p^{{\mathrm{\scriptscriptstyle T}}}$$. Let $$\nu_k=E(Y_1^{(k)})$$ ($$k=1,2$$). In view of the scale-invariance of the centred log ratios, we can replace $$X_i^{(k)}$$ by $$W_i^{(k)}$$ in (5) and obtain   $$\label{eq:clr_basis} Y_i^{(k)}=GZ_i^{(k)}$$ (6) and hence   $$\label{eq:clr_mean} \nu_k=G\mu_k\text{.}$$ (7) The matrix $$G$$ has rank $$p-1$$ and hence has a null space of dimension 1, i.e., $$\mathcal{N}(G)\equiv\{x\in\mathbb{R}^p:Gx=0\}=\{c1_p:c\in\mathbb{R}\}$$. As a result, $$\nu_1=\nu_2$$ if and only if $$\mu_1=\mu_2+c1_p$$ for some $$c\in\mathbb{R}$$. Therefore, testing the hypotheses in (3) is equivalent to testing   $$\label{eq:hypo_clr} H_0:\nu_1=\nu_2\quad\text{versus}\quad H_1:\nu_1=\hspace{-8pt}|\,\,\nu_2\text{.}$$ (8) Despite this equivalence, the hypotheses in (3) are meaningful only when the bases exist, which is the case in microbiome studies. On the other hand, the hypotheses in (8) concern only the compositions through the centred log ratios, from which their scale-invariance and testability using the observed compositional data are evident. 3.2. A two-sample test for compositional equivalence A natural test statistic for testing $$H_0$$ in (8), and hence $$H_0$$ in (3), would be based on the differences $$\bar{Y}_j^{(1)}-\bar{Y}_j^{(2)}$$, where $$\bar{Y}_j^{(k)}=n_k^{-1}\sum_{i=1}^{n_k}Y_{ij}^{(k)}$$ are the sample means of the centred log ratios. Moreover, it is well known that tests against sparse alternatives based on maximum-type statistics are generally more powerful than those based on sum-of-squares-type statistics (Cai et al., 2014). Since in microbiome studies we are mainly interested in the sparse setting where only a small number of taxa may have different mean abundances, we consider the test statistic   $$\label{eq:M_n} M_n=\frac{n_1n_2}{n_1+n_2}\max_{1{\leqslant} j{\leqslant} p}\frac{(\bar{Y}_j^{(1)}-\bar{Y}_j^{(2)})^2}{\hat\gamma_{jj}},$$ (9) where $$\hat\gamma_{jj}=(n_1+n_2)^{-1}\sum_{k=1}^2\sum_{i=1}^{n_k}\big(Y_{ij}^{(k)}-\bar{Y}_j^{(k)}\big)^2$$ are the pooled-sample centred log-ratio variances. The asymptotic properties of $$M_n$$ will be investigated in detail in § 4. Under suitable conditions on the log basis variables $$Z_{1j}^{(k)}$$, we will show that the centred log-ratio variables $$Y_{1j}^{(k)}$$ are only weakly dependent and satisfy certain desired properties. As a result, $$M_n-2\log p+\log\log p$$ converges in distribution to a Type I extreme value or Gumbel variable; see Theorem 1. We can then define the asymptotic $$\alpha$$-level test   $$\label{eq:test} \Phi_\alpha=I(M_n{\geqslant} q_\alpha+2\log p-\log\log p),$$ (10) where $$q_{\alpha}=-\log\pi-2\log\log(1-\alpha)^{-1}$$ is the $$(1-\alpha)$$-quantile of the Gumbel distribution. The null hypothesis $$H_0$$ in (3), or equivalently (8), is rejected whenever $$\Phi_\alpha=1$$. Although $$M_n$$ is similar to the test statistic $$M_I$$ defined in Cai et al. (2014), the theoretical analysis is radically different, in that our assumptions are not imposed on the observed variables. Besides, the test statistic based on a linear transformation by the precision matrix that Cai et al. (2014) proposed is not considered here, because the covariance matrix of $$Y_1^{(k)}$$ is singular and so its precision matrix is not well defined. 3.3. A paired test for compositional equivalence So far we have been concerned with two independent samples. In practice, however, one may be interested in comparing compositions on the same sample before and after treatment. For such paired samples, the proposed test requires only slight modifications. Now we observe a paired sample $$(X_{ij}^{(1)},X_{ij}^{(2)})$$ ($$i=1,\dots,n;\, j=1,\dots,p$$), which is generated by the basis $$(W_{ij}^{(1)},W_{ij}^{(2)})$$; the log basis $$(Z_{ij}^{(1)},Z_{ij}^{(2)})$$ and the centred log ratios $$(Y_{ij}^{(1)},Y_{ij}^{(2)})$$ are the same as defined previously. Write $$D_{ij}=Y_{ij}^{(1)}-Y_{ij}^{(2)}$$ and $$\bar{D}_j=n^{-1}\sum_{i=1}^nD_{ij}$$. To test $$H_0$$ in (3) or equivalently (8), we propose the test statistic   $\tilde{M}_n=n\max_{1{\leqslant} j{\leqslant} p}\bar{D}_j^2/\tilde\gamma_{jj},$ where $$\tilde\gamma_{jj}=n^{-1}\sum_{i=1}^n(D_{ij}-\bar{D}_j)^2$$ are the sample variances of $$D_{ij}$$. Note that $$\tilde{M}_n$$ is different from $$M_n$$ defined in (9) only in the variance estimates. Under appropriate assumptions on the log basis differences $$\Delta_j=Z_{1j}^{(1)}-Z_{1j}^{(2)}$$, similar to Conditions 1–5 below, we can show that $$\tilde{M}_n-2\log p+\log\log p$$ converges in distribution to the same Gumbel variable as in Theorem 1. Hence the test $$\Phi_\alpha$$ defined in (10) is still valid with $$M_n$$ replaced by $$\tilde{M}_n$$. 4. Theoretical results 4.1. Assumptions and implications As we are interested in testing the latent basis structures, we will impose conditions directly on the log basis variables. Under the assumption of a common basis covariance matrix, the two populations have a common centred log-ratio covariance matrix $$\Gamma={\rm cov}(Y_1^{(k)})$$ ($$k=1,2$$), which, in light of (6), is given by   \begin{equation*}\label{eq:cov_rel} \Gamma=G\Omega G^{{\mathrm{\scriptscriptstyle T}}}\text{.} \end{equation*} Denote the correlation matrices of $$Z_1^{(k)}$$ and $$Y_1^{(k)}$$ by $$R=(\rho_{ij})$$ and $$T=(\tau_{ij})$$, respectively. We first impose the following conditions on the covariance structures of the log basis variables: Condition 1. $$1/\kappa_1{\leqslant}\omega_{jj}{\leqslant}\kappa_1$$ for $$j=1,\dots,p$$ and some constant $$\kappa_1>0$$; Condition 2. $$\max_{1{\leqslant} i<j{\leqslant} p}|\rho_{ij}|{\leqslant} r_1$$ for some constant $$0<r_1<1$$; Condition 3. $$\max_{1{\leqslant} j{\leqslant} p}\sum_{i=1}^p\rho_{ij}^2{\leqslant} r_2$$ for some constant $$r_2>0$$. Conditions 1–3 are mild and standard in the high-dimensional testing literature. Condition 1 requires that the variances be bounded away from zero and infinity. Condition 2 is mild since $$\max_{1{\leqslant} i<j{\leqslant} p}|\rho_{ij}|=1$$ would imply that $$\Omega$$ is singular. Condition 3 is weaker than the usual assumption that the maximum eigenvalue of $$R$$ is bounded. Under Conditions 1–3, the following proposition shows that similar properties are satisfied by the centred log-ratio covariance matrix $$\Gamma$$ and correlation matrix $$T$$. Proposition 1. Assume that Conditions 1–3 hold. Then, for sufficiently large $$p$$, the centred log-ratio covariance matrix $$\Gamma$$ and correlation matrix $$T$$ satisfy the following properties: (i) $$1/\kappa_2{\leqslant}\gamma_{jj}{\leqslant}\kappa_2$$for$$j=1,\dots,p$$and some constant$$\kappa_2>0$$; (ii) $$\max_{1{\leqslant} i<j{\leqslant} p}|\tau_{ij}|{\leqslant} r_3$$for some constant$$0<r_3<1$$; and  (iii) $$\max_{1{\leqslant} j{\leqslant} p}\sum_{i=1}^p\tau_{ij}^2{\leqslant} r_4$$for some constant$$r_4>0$$. We also need a moment condition on the log basis variables and a restriction on the dimensionality. Condition 4. There exist constants $$\eta,K>0$$ such that   $E\bigl[\exp\bigl\{\eta(Z_{1j}^{(k)}-\mu_{kj})^2/\omega_{jj}\bigr\}\bigr]{\leqslant} K\quad (\,j=1,\dots,p;\: k=1,2)\text{.}$ Condition 5. We have $$n_1\asymp n_2\asymp n$$ and $$\log p=o(n^{1/3})$$, where $$n=n_1n_2/(n_1+n_2)$$. Condition 4 is a popular sub-Gaussian tail assumption that can easily be relaxed to the case of polynomial tails. It allows us to establish the following concentration properties for the centred log-ratio variables and the pooled-sample variances. Proposition 2. Under Conditions 1 and 3–5, the centred log-ratio variables satisfy  $$\label{eq:rate1} \max_{i,j,k}\big|Y_{ij}^{(k)}-\nu_{kj}\big|/\gamma_{jj}^{1/2}=o_{\rm p}(n^{1/2}/\log p),$$ (11)and the pooled-sample centred log-ratio variances $$\hat\gamma_{jj}$$ satisfy  $$\label{eq:rate2} \max_j|\hat\gamma_{jj}-\gamma_{jj}|/\gamma_{jj}=O_{\rm p}\{(\log p/n)^{1/2}\}\text{.}$$ (12) 4.2. Asymptotic properties of the two-sample test We are now in a position to state our main results concerning the asymptotic properties of the proposed two-sample test. The validity of the test relies on the fact that certain desired properties of the centred log-ratio variables can be related to those of the log basis variables, which have been established in Propositions 1 and 2. The following theorem derives the asymptotic null distribution of $$M_n$$ defined in (9). Theorem 1. Under Conditions 1–5 we have, under$$H_0$$in (3) or equivalently (8),   ${\rm pr}(M_n-2\log p+\log\log p{\leqslant} t)\to\exp\{-\pi^{-1/2}\exp(-t/2)\},\quad t \in \mathbb{R},$ as$$n,p\to\infty$$. Theorem 1 shows that the test $$\Phi_\alpha$$ defined in (10) is indeed asymptotically of level $$\alpha$$. To study the asymptotic power of the test, we consider the alternative   $$\label{eq:alt_basis} H_1:\mu_{1j}=\hspace{-8pt}|\,\,\mu_{2j}+c,\quad j\in S;\quad\mu_{1j}=\mu_{2j}+c,\quad j\in S^{\rm c}$$ (13) for some $$c\in\mathbb{R}$$ and $$S\subset\{1,\dots,p\}$$ with cardinality $$s$$, where $$S^{\rm c}$$ denotes the complement of $$S$$. This alternative, however, is difficult to analyse since $$c$$ is unknown. We now eliminate the constant $$c$$ and connect $$H_1$$ in (13) to a more convenient form in terms of $$\nu_1$$ and $$\nu_2$$. Without loss of generality, define the signal vector $$\delta=(\delta_1,\dots,\delta_p)^{{\mathrm{\scriptscriptstyle T}}}$$ by   \label{eq:alt_basis_diff} \begin{aligned} \mu_{1j}-\mu_{2j}-c=\delta_j\omega_{jj}^{1/2}\left(\frac{\log p}{n}\right)^{1/2}\quad(\,j=1,\dots,p), \end{aligned} (14) where the scaling factor $$\omega_{jj}^{1/2}(\log p/n)^{1/2}$$ is introduced for technical reasons which will become clear in the proof of Theorem 2. Under $$H_1$$ in (13), we have $$\delta_j=\hspace{-8pt}|\,\,0$$ if and only if $$j\in S$$. Summing the equations in (14) and rearranging, we obtain   $c=\bar\mu_1-\bar\mu_2-\frac{1}{p}\sum_{j=1}^p\delta_j\omega_{jj}^{1/2}\left(\frac{\log p}{n}\right)^{1/2} =\bar\mu_1-\bar\mu_2-O\!\left\{\frac{\|\delta\|_1}{p}\left(\frac{\log p}{n}\right)^{1/2}\right\}\!,$ where $$\bar\mu_k=p^{-1}\sum_{j=1}^p\mu_{kj}$$ ($$k=1,2$$), $$\|\delta\|_1=\sum_{j=1}^p|\delta_j|$$, and we have used the fact that $$\max_j\omega_{jj}=O(1)$$ by Condition 1. Since $$\nu_{kj}=\mu_{kj}-\bar\mu_k$$ ($$k=1,2$$) by (7), we see that $$H_1$$ in (13) implies   \label{eq:alt_clr} \begin{aligned} \nu_{1j}-\nu_{2j}&=\left\{\delta_j\omega_{jj}^{1/2}+O\!\left(\frac{\|\delta\|_1}{p}\right)\right\}\left(\frac{\log p}{n}\right)^{1/2},&&j\in S,\\ \nu_{1j}-\nu_{2j}&=O\!\left\{\frac{\|\delta\|_1}{p}\left(\frac{\log p}{n}\right)^{1/2}\right\}\!,&&j\in S^{\rm c}\text{.} \end{aligned} (15) Compared with the usual sparse alternatives in the literature, such as in Cai et al. (2014), all components in the alternative (15) are shifted by a term of order $$O\{\|\delta\|_1p^{-1}(\log p/n)^{1/2}\}$$. To prevent this term from interfering with signals of order at least $$O\{(\log p/n)^{1/2}\}$$, it suffices to assume that $$\|\delta\|_1=o(p)$$. This key observation leads to the following theorem concerning the asymptotic power of $$\Phi_\alpha$$ defined in (10). Theorem 2. Assume that Conditions 1 and 3–5 hold. Under$$H_1$$in (13), if$$\|\delta\|_1=o(p)$$and$$\max_{j\in S}|\delta_j|{\geqslant}\surd{2}+{\varepsilon}$$for some constant$${\varepsilon}>0$$, then$${\rm pr}(\Phi_\alpha=1)\to1$$as$$n,p\to\infty$$. Two remarks on Theorem 2 are in order. First, if the signals $$\delta_i$$ are bounded, then the condition $$\|\delta\|_1=o(p)$$ holds provided the alternative (13) is sparse in the sense that $$s=o(p)$$. Second, by Theorem 3 of Cai et al. (2014), the condition $$\max_{j\in S}|\delta_j|{\geqslant}\surd{2}+{\varepsilon}$$ is minimax rate-optimal for testing sparse alternatives in the classical two-sample testing problem. Thus, the proposed test achieves the best possible rate even though the bases are not observed. 5. Simulation studies We conducted simulation studies to evaluate the numerical performance of the proposed two-sample and paired tests. For comparison, we consider counterparts applied to the raw and log-transformed compositions, which are obtained by replacing $$Y^{(k)}$$ in the proposed tests with $$X^{(k)}$$ and $$\log X^{(k)}$$, respectively. The oracle tests based on the unobserved $$W^{(k)}$$, though impracticable, serve as the benchmarks for comparison. We first examine the case of two independent samples. The simulated data were generated as follows. We first generated $$Z^{(k)}$$ from the following distributions: (i) multivariate normal distribution, $$Z_i^{(k)}\sim N_p(\tilde\mu_k,\Omega)$$; (ii) multivariate gamma distribution, $$Z_i^{(k)}=\tilde\mu_k+FU_i^{(k)}/\surd10$$, where $$F$$ is a $$p\times p$$ matrix $$F=QS^{1/2}$$ with $$Q$$ and $$S$$ obtained from the singular value decomposition $$\Omega=QSQ^{{\mathrm{\scriptscriptstyle T}}}$$, and the components of $$U_k$$ were generated independently from the standard gamma distribution with shape parameter 10. Then $$W^{(k)}$$ and $$X^{(k)}$$ were generated through the transformation $$W_{ij}^{(k)}=\exp(Z_{ij}^{(k)})$$ and (1). Note that $$\mu_k=\tilde\mu_k$$ in case (i) and $$\mu_k=\tilde\mu_k+\surd10\, F1_p$$ in case (ii). The location parameters $$\tilde\mu_k$$ were set as follows. The components of $$\tilde\mu_1$$ were drawn from a uniform distribution $${\rm Un}(0,10)$$. Under $$H_0$$, we took $$\tilde\mu_2=\tilde\mu_1$$; under $$H_1$$, we took   $\tilde\mu_{2j}=\tilde\mu_{1j}-\delta_j\omega_{jj}^{1/2}\left(\frac{\log p}{n}\right)^{1/2},$ where the signal vector $$\delta$$ has support of size $$s=\lfloor0{\cdot}05p\rfloor, \lfloor0{\cdot}1p\rfloor$$ or $$\lfloor0{\cdot}5p\rfloor$$, with the indices chosen uniformly from $$\{1,\dots,p\}$$ and the nonzero $$\delta_j$$ drawn from $${\rm Un}[-2\surd2,2\surd2]$$. We considered the following covariance structures: (i) banded covariance, $$\Omega=D^{1/2}AD^{1/2}$$, where $$A$$ has nonzero entries $$a_{jj}=1$$ and $$a_{j-1,j}=a_{j,j+1}=-0{\cdot}5$$, and $$D$$ is a diagonal matrix with entries drawn from $${\rm Un}(1,3)$$; (ii) sparse covariance, $$\Omega={\rm diag}(A_1,A_2)$$ with $$A_1=B+{\varepsilon} I_q$$ and $$A_2=I_{p-q}$$, where $$q=\lfloor 3p^{1/2}\rfloor$$, $$B$$ is a symmetric matrix with lower-triangular entries drawn from the uniform distribution on $$[-1,-0{\cdot}5]\cup[0{\cdot}5,1]$$ with probability $$0{\cdot}5$$ and set to 0 with probability $$0{\cdot}5$$, and $${\varepsilon}=\max\{-\lambda_{\min}(B),0\}+0{\cdot}05$$ with $$\lambda_{\min}(\cdot)$$ denoting the smallest eigenvalue of a matrix. The simulation settings for the case of paired samples are similar, except that $$Z_i^{(1)}$$ and $$Z_i^{(2)}$$ are correlated and must be generated from a $$2p$$-dimensional joint distribution. The parameters $$(\tilde\mu^*,\Omega^*)$$ of the joint distribution were specified by $$\tilde\mu^*=(\tilde\mu_1^{{\mathrm{\scriptscriptstyle T}}},\tilde\mu_2^{{\mathrm{\scriptscriptstyle T}}})^{{\mathrm{\scriptscriptstyle T}}}$$ and   $\Omega^*=\begin{pmatrix} 1 & 0{\cdot}3\\ 0{\cdot}3 & 1 \end{pmatrix}\otimes\Omega\text{.}$ We took the sample sizes to be $$n_1=n_2=100$$ for two independent samples and $$n=100$$ for paired samples, with varying dimensions $$p=50,100$$ and $$200$$. We repeated the simulation 1000 times for each setting and calculated the empirical sizes and powers of four tests with significance level $$\alpha=0{\cdot}05$$. The results for two independent samples and paired samples are summarized in Tables 1 and 2. The proposed test has higher power than the tests applied to the raw and log-transformed compositions, and it controls the size reasonably well around the nominal level 0$$\cdot$$05 and closely mimics the performance of the oracle test. Its power gains over the tests based on log-transformed and raw compositions tend to be more pronounced in the more challenging scenarios with moderate dimensions and sparse signals. Its superiority does not seem to depend on the distributions or covariance structures. Table 1. Empirical sizes and powers (%) of two-sample tests with $$\alpha=0{\cdot}05$$ and $$n_1=n_2=100$$ based on $$1000$$ replications        Banded covariance  Sparse covariance     Method  $$p=50$$  $$p=100$$  $$p=200$$  $$p=50$$  $$p=100$$  $$p=200$$  Normal, $$H_0$$  Oracle  4$$\cdot$$7  5$$\cdot$$3  4$$\cdot$$8  4$$\cdot$$0  4$$\cdot$$8  5$$\cdot$$2     Proposed  4$$\cdot$$6  4$$\cdot$$9  5$$\cdot$$1  3$$\cdot$$7  4$$\cdot$$5  5$$\cdot$$3     Log  3$$\cdot$$9  5$$\cdot$$1  5$$\cdot$$3  3$$\cdot$$5  3$$\cdot$$3  5$$\cdot$$2     Raw  0$$\cdot$$9  1$$\cdot$$0  0$$\cdot$$3  1$$\cdot$$5  1$$\cdot$$0  1$$\cdot$$3  Normal, $$H_1$$$$s=\lfloor0{\cdot}05p\rfloor$$  Oracle  38$$\cdot$$2  70$$\cdot$$7  92$$\cdot$$5  40$$\cdot$$1  70$$\cdot$$7  91$$\cdot$$8     Proposed  36$$\cdot$$5  70$$\cdot$$5  92$$\cdot$$2  38$$\cdot$$0  70$$\cdot$$2  91$$\cdot$$0     Log  26$$\cdot$$1  60$$\cdot$$8  84$$\cdot$$7  25$$\cdot$$5  51$$\cdot$$4  70$$\cdot$$7     Raw  4$$\cdot$$0  5$$\cdot$$5  8$$\cdot$$2  7$$\cdot$$0  16$$\cdot$$8  23$$\cdot$$7  Normal, $$H_1$$$$s=\lfloor0{\cdot}1p\rfloor$$  Oracle  68$$\cdot$$7  90$$\cdot$$6  99$$\cdot$$1  69$$\cdot$$4  91$$\cdot$$0  99$$\cdot$$5     Proposed  66$$\cdot$$9  89$$\cdot$$9  98$$\cdot$$9  67$$\cdot$$6  91$$\cdot$$0  99$$\cdot$$5     Log  53$$\cdot$$7  79$$\cdot$$7  97$$\cdot$$3  50$$\cdot$$0  77$$\cdot$$1  91$$\cdot$$7     Raw  9$$\cdot$$1  10$$\cdot$$1  14$$\cdot$$1  16$$\cdot$$6  31$$\cdot$$7  49$$\cdot$$2  Normal $$s=\lfloor0{\cdot}5p\rfloor$$  Oracle  99$$\cdot$$3  100$$\cdot$$0  100$$\cdot$$0  99$$\cdot$$4  100$$\cdot$$0  100$$\cdot$$0     Proposed  99$$\cdot$$2  100$$\cdot$$0  100$$\cdot$$0  99$$\cdot$$7  100$$\cdot$$0  100$$\cdot$$0     Log  96$$\cdot$$5  99$$\cdot$$9  100$$\cdot$$0  96$$\cdot$$6  99$$\cdot$$9  100$$\cdot$$0     Raw  39$$\cdot$$2  37$$\cdot$$1  61$$\cdot$$5  55$$\cdot$$0  84$$\cdot$$0  96$$\cdot$$0  Gamma, $$H_0$$  Oracle  5$$\cdot$$6  4$$\cdot$$4  4$$\cdot$$7  5$$\cdot$$9  4$$\cdot$$8  4$$\cdot$$8     Proposed  5$$\cdot$$3  4$$\cdot$$9  4$$\cdot$$8  5$$\cdot$$0  4$$\cdot$$9  5$$\cdot$$1     Log  4$$\cdot$$7  3$$\cdot$$6  3$$\cdot$$7  5$$\cdot$$0  4$$\cdot$$7  4$$\cdot$$5     Raw  1$$\cdot$$6  0$$\cdot$$8  0$$\cdot$$2  1$$\cdot$$6  0$$\cdot$$6  1$$\cdot$$1  Gamma, $$H_1$$$$s=\lfloor0{\cdot}05p\rfloor$$  Oracle  35$$\cdot$$7  70$$\cdot$$0  91$$\cdot$$3  36$$\cdot$$7  70$$\cdot$$5  92$$\cdot$$3     Proposed  36$$\cdot$$7  71$$\cdot$$5  91$$\cdot$$9  36$$\cdot$$3  68$$\cdot$$0  92$$\cdot$$0     Log  27$$\cdot$$0  52$$\cdot$$6  82$$\cdot$$8  23$$\cdot$$6  49$$\cdot$$9  66$$\cdot$$0     Raw  5$$\cdot$$1  4$$\cdot$$4  10$$\cdot$$2  4$$\cdot$$2  6$$\cdot$$0  9$$\cdot$$4  Gamma, $$H_1$$$$s=\lfloor0{\cdot}1p\rfloor$$  Oracle  68$$\cdot$$5  91$$\cdot$$8  99$$\cdot$$6  69$$\cdot$$0  91$$\cdot$$7  99$$\cdot$$5     Proposed  66$$\cdot$$8  91$$\cdot$$5  99$$\cdot$$5  66$$\cdot$$9  90$$\cdot$$8  99$$\cdot$$7     Log  52$$\cdot$$4  78$$\cdot$$4  96$$\cdot$$2  50$$\cdot$$9  75$$\cdot$$6  90$$\cdot$$7     Raw  11$$\cdot$$6  9$$\cdot$$8  17$$\cdot$$3  10$$\cdot$$3  13$$\cdot$$2  17$$\cdot$$4  Gamma $$s=\lfloor0{\cdot}5p\rfloor$$  Oracle  99$$\cdot$$9  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Proposed  99$$\cdot$$9  100$$\cdot$$0  100$$\cdot$$0  99$$\cdot$$5  100$$\cdot$$0  100$$\cdot$$0     Log  96$$\cdot$$5  99$$\cdot$$7  100$$\cdot$$0  96$$\cdot$$9  99$$\cdot$$9  100$$\cdot$$0     Raw  42$$\cdot$$7  53$$\cdot$$1  61$$\cdot$$9  40$$\cdot$$4  50$$\cdot$$7  62$$\cdot$$9        Banded covariance  Sparse covariance     Method  $$p=50$$  $$p=100$$  $$p=200$$  $$p=50$$  $$p=100$$  $$p=200$$  Normal, $$H_0$$  Oracle  4$$\cdot$$7  5$$\cdot$$3  4$$\cdot$$8  4$$\cdot$$0  4$$\cdot$$8  5$$\cdot$$2     Proposed  4$$\cdot$$6  4$$\cdot$$9  5$$\cdot$$1  3$$\cdot$$7  4$$\cdot$$5  5$$\cdot$$3     Log  3$$\cdot$$9  5$$\cdot$$1  5$$\cdot$$3  3$$\cdot$$5  3$$\cdot$$3  5$$\cdot$$2     Raw  0$$\cdot$$9  1$$\cdot$$0  0$$\cdot$$3  1$$\cdot$$5  1$$\cdot$$0  1$$\cdot$$3  Normal, $$H_1$$$$s=\lfloor0{\cdot}05p\rfloor$$  Oracle  38$$\cdot$$2  70$$\cdot$$7  92$$\cdot$$5  40$$\cdot$$1  70$$\cdot$$7  91$$\cdot$$8     Proposed  36$$\cdot$$5  70$$\cdot$$5  92$$\cdot$$2  38$$\cdot$$0  70$$\cdot$$2  91$$\cdot$$0     Log  26$$\cdot$$1  60$$\cdot$$8  84$$\cdot$$7  25$$\cdot$$5  51$$\cdot$$4  70$$\cdot$$7     Raw  4$$\cdot$$0  5$$\cdot$$5  8$$\cdot$$2  7$$\cdot$$0  16$$\cdot$$8  23$$\cdot$$7  Normal, $$H_1$$$$s=\lfloor0{\cdot}1p\rfloor$$  Oracle  68$$\cdot$$7  90$$\cdot$$6  99$$\cdot$$1  69$$\cdot$$4  91$$\cdot$$0  99$$\cdot$$5     Proposed  66$$\cdot$$9  89$$\cdot$$9  98$$\cdot$$9  67$$\cdot$$6  91$$\cdot$$0  99$$\cdot$$5     Log  53$$\cdot$$7  79$$\cdot$$7  97$$\cdot$$3  50$$\cdot$$0  77$$\cdot$$1  91$$\cdot$$7     Raw  9$$\cdot$$1  10$$\cdot$$1  14$$\cdot$$1  16$$\cdot$$6  31$$\cdot$$7  49$$\cdot$$2  Normal $$s=\lfloor0{\cdot}5p\rfloor$$  Oracle  99$$\cdot$$3  100$$\cdot$$0  100$$\cdot$$0  99$$\cdot$$4  100$$\cdot$$0  100$$\cdot$$0     Proposed  99$$\cdot$$2  100$$\cdot$$0  100$$\cdot$$0  99$$\cdot$$7  100$$\cdot$$0  100$$\cdot$$0     Log  96$$\cdot$$5  99$$\cdot$$9  100$$\cdot$$0  96$$\cdot$$6  99$$\cdot$$9  100$$\cdot$$0     Raw  39$$\cdot$$2  37$$\cdot$$1  61$$\cdot$$5  55$$\cdot$$0  84$$\cdot$$0  96$$\cdot$$0  Gamma, $$H_0$$  Oracle  5$$\cdot$$6  4$$\cdot$$4  4$$\cdot$$7  5$$\cdot$$9  4$$\cdot$$8  4$$\cdot$$8     Proposed  5$$\cdot$$3  4$$\cdot$$9  4$$\cdot$$8  5$$\cdot$$0  4$$\cdot$$9  5$$\cdot$$1     Log  4$$\cdot$$7  3$$\cdot$$6  3$$\cdot$$7  5$$\cdot$$0  4$$\cdot$$7  4$$\cdot$$5     Raw  1$$\cdot$$6  0$$\cdot$$8  0$$\cdot$$2  1$$\cdot$$6  0$$\cdot$$6  1$$\cdot$$1  Gamma, $$H_1$$$$s=\lfloor0{\cdot}05p\rfloor$$  Oracle  35$$\cdot$$7  70$$\cdot$$0  91$$\cdot$$3  36$$\cdot$$7  70$$\cdot$$5  92$$\cdot$$3     Proposed  36$$\cdot$$7  71$$\cdot$$5  91$$\cdot$$9  36$$\cdot$$3  68$$\cdot$$0  92$$\cdot$$0     Log  27$$\cdot$$0  52$$\cdot$$6  82$$\cdot$$8  23$$\cdot$$6  49$$\cdot$$9  66$$\cdot$$0     Raw  5$$\cdot$$1  4$$\cdot$$4  10$$\cdot$$2  4$$\cdot$$2  6$$\cdot$$0  9$$\cdot$$4  Gamma, $$H_1$$$$s=\lfloor0{\cdot}1p\rfloor$$  Oracle  68$$\cdot$$5  91$$\cdot$$8  99$$\cdot$$6  69$$\cdot$$0  91$$\cdot$$7  99$$\cdot$$5     Proposed  66$$\cdot$$8  91$$\cdot$$5  99$$\cdot$$5  66$$\cdot$$9  90$$\cdot$$8  99$$\cdot$$7     Log  52$$\cdot$$4  78$$\cdot$$4  96$$\cdot$$2  50$$\cdot$$9  75$$\cdot$$6  90$$\cdot$$7     Raw  11$$\cdot$$6  9$$\cdot$$8  17$$\cdot$$3  10$$\cdot$$3  13$$\cdot$$2  17$$\cdot$$4  Gamma $$s=\lfloor0{\cdot}5p\rfloor$$  Oracle  99$$\cdot$$9  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Proposed  99$$\cdot$$9  100$$\cdot$$0  100$$\cdot$$0  99$$\cdot$$5  100$$\cdot$$0  100$$\cdot$$0     Log  96$$\cdot$$5  99$$\cdot$$7  100$$\cdot$$0  96$$\cdot$$9  99$$\cdot$$9  100$$\cdot$$0     Raw  42$$\cdot$$7  53$$\cdot$$1  61$$\cdot$$9  40$$\cdot$$4  50$$\cdot$$7  62$$\cdot$$9  Table 2. Empirical sizes and powers (%) of paired tests with$$\alpha=0{\cdot}05$$and$$n=100$$based on$$1000$$replications        Banded covariance  Sparse covariance     Method  $$p=50$$  $$p=100$$  $$p=200$$  $$p=50$$  $$p=100$$  $$p=200$$  Normal, $$H_0$$  Oracle  4$$\cdot$$8  5$$\cdot$$6  6$$\cdot$$3  5$$\cdot$$9  6$$\cdot$$6  6$$\cdot$$0     Proposed  4$$\cdot$$9  5$$\cdot$$5  6$$\cdot$$8  5$$\cdot$$9  6$$\cdot$$1  6$$\cdot$$5     Log  5$$\cdot$$4  4$$\cdot$$4  7$$\cdot$$1  5$$\cdot$$2  3$$\cdot$$7  4$$\cdot$$1     Raw  1$$\cdot$$1  0$$\cdot$$4  0$$\cdot$$2  1$$\cdot$$5  1$$\cdot$$1  1$$\cdot$$2  Normal, $$H_1$$$$s=\lfloor0{\cdot}05p\rfloor$$  Oracle  55$$\cdot$$3  86$$\cdot$$8  98$$\cdot$$3  54$$\cdot$$4  85$$\cdot$$8  99$$\cdot$$0     Proposed  52$$\cdot$$9  86$$\cdot$$5  98$$\cdot$$4  54$$\cdot$$0  84$$\cdot$$8  98$$\cdot$$9     Log  39$$\cdot$$6  75$$\cdot$$0  95$$\cdot$$6  35$$\cdot$$7  68$$\cdot$$7  87$$\cdot$$3     Raw  5$$\cdot$$2  7$$\cdot$$3  13$$\cdot$$7  9$$\cdot$$2  21$$\cdot$$6  34$$\cdot$$3  Normal, $$H_1$$$$s=\lfloor0{\cdot}1p\rfloor$$  Oracle  85$$\cdot$$1  98$$\cdot$$6  99$$\cdot$$9  83$$\cdot$$9  98$$\cdot$$6  99$$\cdot$$9     Proposed  82$$\cdot$$8  98$$\cdot$$4  99$$\cdot$$9  82$$\cdot$$8  98$$\cdot$$3  99$$\cdot$$9     Log  71$$\cdot$$2  94$$\cdot$$5  99$$\cdot$$6  65$$\cdot$$4  90$$\cdot$$0  98$$\cdot$$3     Raw  13$$\cdot$$8  14$$\cdot$$9  22$$\cdot$$1  22$$\cdot$$7  43$$\cdot$$7  64$$\cdot$$5  Normal $$s=\lfloor0{\cdot}5p\rfloor$$  Oracle  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Proposed  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Log  99$$\cdot$$5  100$$\cdot$$0  100$$\cdot$$0  99$$\cdot$$7  100$$\cdot$$0  100$$\cdot$$0     Raw  50$$\cdot$$0  50$$\cdot$$0  74$$\cdot$$0  77$$\cdot$$9  94$$\cdot$$4  99$$\cdot$$3  Gamma, $$H_0$$  Oracle  4$$\cdot$$2  6$$\cdot$$1  7$$\cdot$$2  5$$\cdot$$1  6$$\cdot$$3  7$$\cdot$$3     Proposed  4$$\cdot$$3  6$$\cdot$$2  7$$\cdot$$2  5$$\cdot$$8  6$$\cdot$$1  7$$\cdot$$1     Log  4$$\cdot$$7  4$$\cdot$$7  5$$\cdot$$6  5$$\cdot$$2  4$$\cdot$$5  5$$\cdot$$5     Raw  1$$\cdot$$2  0$$\cdot$$5  0$$\cdot$$4  1$$\cdot$$4  1$$\cdot$$7  2$$\cdot$$0  Gamma, $$H_1$$$$s=\lfloor0{\cdot}05p\rfloor$$  Oracle  55$$\cdot$$5  86$$\cdot$$1  98$$\cdot$$4  51$$\cdot$$3  84$$\cdot$$2  98$$\cdot$$3     Proposed  53$$\cdot$$9  86$$\cdot$$2  98$$\cdot$$4  50$$\cdot$$1  84$$\cdot$$0  98$$\cdot$$0     Log  42$$\cdot$$4  77$$\cdot$$3  94$$\cdot$$8  35$$\cdot$$9  67$$\cdot$$2  88$$\cdot$$1     Raw  6$$\cdot$$1  8$$\cdot$$4  11$$\cdot$$1  9$$\cdot$$6  22$$\cdot$$6  39$$\cdot$$4  Gamma, $$H_1$$$$s=\lfloor0{\cdot}1p\rfloor$$  Oracle  83$$\cdot$$8  97$$\cdot$$7  100$$\cdot$$0  87$$\cdot$$9  98$$\cdot$$2  100$$\cdot$$0     Proposed  82$$\cdot$$6  97$$\cdot$$1  100$$\cdot$$0  86$$\cdot$$5  98$$\cdot$$3  99$$\cdot$$9     Log  70$$\cdot$$8  93$$\cdot$$0  99$$\cdot$$8  67$$\cdot$$8  90$$\cdot$$8  98$$\cdot$$4     Raw  12$$\cdot$$0  13$$\cdot$$2  20$$\cdot$$7  24$$\cdot$$0  44$$\cdot$$0  61$$\cdot$$5  Gamma $$s=\lfloor0{\cdot}5p\rfloor$$  Oracle  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Proposed  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Log  99$$\cdot$$4  100$$\cdot$$0  100$$\cdot$$0  99$$\cdot$$8  100$$\cdot$$0  100$$\cdot$$0     Raw  51$$\cdot$$5  52$$\cdot$$7  75$$\cdot$$0  80$$\cdot$$3  94$$\cdot$$5  99$$\cdot$$0        Banded covariance  Sparse covariance     Method  $$p=50$$  $$p=100$$  $$p=200$$  $$p=50$$  $$p=100$$  $$p=200$$  Normal, $$H_0$$  Oracle  4$$\cdot$$8  5$$\cdot$$6  6$$\cdot$$3  5$$\cdot$$9  6$$\cdot$$6  6$$\cdot$$0     Proposed  4$$\cdot$$9  5$$\cdot$$5  6$$\cdot$$8  5$$\cdot$$9  6$$\cdot$$1  6$$\cdot$$5     Log  5$$\cdot$$4  4$$\cdot$$4  7$$\cdot$$1  5$$\cdot$$2  3$$\cdot$$7  4$$\cdot$$1     Raw  1$$\cdot$$1  0$$\cdot$$4  0$$\cdot$$2  1$$\cdot$$5  1$$\cdot$$1  1$$\cdot$$2  Normal, $$H_1$$$$s=\lfloor0{\cdot}05p\rfloor$$  Oracle  55$$\cdot$$3  86$$\cdot$$8  98$$\cdot$$3  54$$\cdot$$4  85$$\cdot$$8  99$$\cdot$$0     Proposed  52$$\cdot$$9  86$$\cdot$$5  98$$\cdot$$4  54$$\cdot$$0  84$$\cdot$$8  98$$\cdot$$9     Log  39$$\cdot$$6  75$$\cdot$$0  95$$\cdot$$6  35$$\cdot$$7  68$$\cdot$$7  87$$\cdot$$3     Raw  5$$\cdot$$2  7$$\cdot$$3  13$$\cdot$$7  9$$\cdot$$2  21$$\cdot$$6  34$$\cdot$$3  Normal, $$H_1$$$$s=\lfloor0{\cdot}1p\rfloor$$  Oracle  85$$\cdot$$1  98$$\cdot$$6  99$$\cdot$$9  83$$\cdot$$9  98$$\cdot$$6  99$$\cdot$$9     Proposed  82$$\cdot$$8  98$$\cdot$$4  99$$\cdot$$9  82$$\cdot$$8  98$$\cdot$$3  99$$\cdot$$9     Log  71$$\cdot$$2  94$$\cdot$$5  99$$\cdot$$6  65$$\cdot$$4  90$$\cdot$$0  98$$\cdot$$3     Raw  13$$\cdot$$8  14$$\cdot$$9  22$$\cdot$$1  22$$\cdot$$7  43$$\cdot$$7  64$$\cdot$$5  Normal $$s=\lfloor0{\cdot}5p\rfloor$$  Oracle  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Proposed  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Log  99$$\cdot$$5  100$$\cdot$$0  100$$\cdot$$0  99$$\cdot$$7  100$$\cdot$$0  100$$\cdot$$0     Raw  50$$\cdot$$0  50$$\cdot$$0  74$$\cdot$$0  77$$\cdot$$9  94$$\cdot$$4  99$$\cdot$$3  Gamma, $$H_0$$  Oracle  4$$\cdot$$2  6$$\cdot$$1  7$$\cdot$$2  5$$\cdot$$1  6$$\cdot$$3  7$$\cdot$$3     Proposed  4$$\cdot$$3  6$$\cdot$$2  7$$\cdot$$2  5$$\cdot$$8  6$$\cdot$$1  7$$\cdot$$1     Log  4$$\cdot$$7  4$$\cdot$$7  5$$\cdot$$6  5$$\cdot$$2  4$$\cdot$$5  5$$\cdot$$5     Raw  1$$\cdot$$2  0$$\cdot$$5  0$$\cdot$$4  1$$\cdot$$4  1$$\cdot$$7  2$$\cdot$$0  Gamma, $$H_1$$$$s=\lfloor0{\cdot}05p\rfloor$$  Oracle  55$$\cdot$$5  86$$\cdot$$1  98$$\cdot$$4  51$$\cdot$$3  84$$\cdot$$2  98$$\cdot$$3     Proposed  53$$\cdot$$9  86$$\cdot$$2  98$$\cdot$$4  50$$\cdot$$1  84$$\cdot$$0  98$$\cdot$$0     Log  42$$\cdot$$4  77$$\cdot$$3  94$$\cdot$$8  35$$\cdot$$9  67$$\cdot$$2  88$$\cdot$$1     Raw  6$$\cdot$$1  8$$\cdot$$4  11$$\cdot$$1  9$$\cdot$$6  22$$\cdot$$6  39$$\cdot$$4  Gamma, $$H_1$$$$s=\lfloor0{\cdot}1p\rfloor$$  Oracle  83$$\cdot$$8  97$$\cdot$$7  100$$\cdot$$0  87$$\cdot$$9  98$$\cdot$$2  100$$\cdot$$0     Proposed  82$$\cdot$$6  97$$\cdot$$1  100$$\cdot$$0  86$$\cdot$$5  98$$\cdot$$3  99$$\cdot$$9     Log  70$$\cdot$$8  93$$\cdot$$0  99$$\cdot$$8  67$$\cdot$$8  90$$\cdot$$8  98$$\cdot$$4     Raw  12$$\cdot$$0  13$$\cdot$$2  20$$\cdot$$7  24$$\cdot$$0  44$$\cdot$$0  61$$\cdot$$5  Gamma $$s=\lfloor0{\cdot}5p\rfloor$$  Oracle  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Proposed  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Log  99$$\cdot$$4  100$$\cdot$$0  100$$\cdot$$0  99$$\cdot$$8  100$$\cdot$$0  100$$\cdot$$0     Raw  51$$\cdot$$5  52$$\cdot$$7  75$$\cdot$$0  80$$\cdot$$3  94$$\cdot$$5  99$$\cdot$$0  To further examine the performance of the proposed test in very high-dimensional settings, we carried out simulations for two independent samples with dimension $$p=2000$$ and sample sizes $$n_1=n_2=100,200$$. The results are summarized in Table 3 and indicate that the proposed test still has approximately correct size and improved power over the two competing tests. Table 3. Empirical sizes and powers (%) of two-sample tests with$$\alpha=0{\cdot}05$$and$$p=2000$$based on$$1000$$replications        Banded covariance  Sparse covariance     Method  $$n_1=n_2=100$$  $$n_1=n_2=200$$  $$n_1=n_2=100$$  $$n_1=n_2=200$$  Normal, $$H_0$$  Oracle  6$$\cdot$$5  3$$\cdot$$7  7$$\cdot$$6  4$$\cdot$$1     Proposed  6$$\cdot$$4  3$$\cdot$$7  7$$\cdot$$5  4$$\cdot$$1     Log  5$$\cdot$$0  4$$\cdot$$7  2$$\cdot$$4  1$$\cdot$$8     Raw  0$$\cdot$$2  0$$\cdot$$1  0$$\cdot$$5  0$$\cdot$$9  Normal, $$H_1$$$$s=\lfloor0{\cdot}05p\rfloor$$  Oracle  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Proposed  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Log  100$$\cdot$$0  100$$\cdot$$0  98$$\cdot$$7  98$$\cdot$$0     Raw  48$$\cdot$$4  55$$\cdot$$0  60$$\cdot$$7  68$$\cdot$$6  Gamma, $$H_0$$  Oracle  7$$\cdot$$0  6$$\cdot$$1  6$$\cdot$$4  6$$\cdot$$7     Proposed  6$$\cdot$$9  6$$\cdot$$1  6$$\cdot$$6  7$$\cdot$$1     Log  4$$\cdot$$9  4$$\cdot$$5  2$$\cdot$$7  2$$\cdot$$2     Raw  0$$\cdot$$2  0$$\cdot$$2  0$$\cdot$$4  0$$\cdot$$1  Gamma, $$H_1$$$$s=\lfloor0{\cdot}05p\rfloor$$  Oracle  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Proposed  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Log  100$$\cdot$$0  100$$\cdot$$0  98$$\cdot$$5  98$$\cdot$$9     Raw  36$$\cdot$$1  45$$\cdot$$3  22$$\cdot$$1  22$$\cdot$$0        Banded covariance  Sparse covariance     Method  $$n_1=n_2=100$$  $$n_1=n_2=200$$  $$n_1=n_2=100$$  $$n_1=n_2=200$$  Normal, $$H_0$$  Oracle  6$$\cdot$$5  3$$\cdot$$7  7$$\cdot$$6  4$$\cdot$$1     Proposed  6$$\cdot$$4  3$$\cdot$$7  7$$\cdot$$5  4$$\cdot$$1     Log  5$$\cdot$$0  4$$\cdot$$7  2$$\cdot$$4  1$$\cdot$$8     Raw  0$$\cdot$$2  0$$\cdot$$1  0$$\cdot$$5  0$$\cdot$$9  Normal, $$H_1$$$$s=\lfloor0{\cdot}05p\rfloor$$  Oracle  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Proposed  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Log  100$$\cdot$$0  100$$\cdot$$0  98$$\cdot$$7  98$$\cdot$$0     Raw  48$$\cdot$$4  55$$\cdot$$0  60$$\cdot$$7  68$$\cdot$$6  Gamma, $$H_0$$  Oracle  7$$\cdot$$0  6$$\cdot$$1  6$$\cdot$$4  6$$\cdot$$7     Proposed  6$$\cdot$$9  6$$\cdot$$1  6$$\cdot$$6  7$$\cdot$$1     Log  4$$\cdot$$9  4$$\cdot$$5  2$$\cdot$$7  2$$\cdot$$2     Raw  0$$\cdot$$2  0$$\cdot$$2  0$$\cdot$$4  0$$\cdot$$1  Gamma, $$H_1$$$$s=\lfloor0{\cdot}05p\rfloor$$  Oracle  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Proposed  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Log  100$$\cdot$$0  100$$\cdot$$0  98$$\cdot$$5  98$$\cdot$$9     Raw  36$$\cdot$$1  45$$\cdot$$3  22$$\cdot$$1  22$$\cdot$$0  6. Applications to microbiome data 6.1. Analysis of obesity microbiome data We illustrate the application of the proposed tests by analysing two microbiome datasets. The first is a dataset from Wu et al. (2011) collected in a cross-sectional study of 98 subjects to investigate the effects of habitual diet on the human gut microbiome. The dataset was analysed by regression in Lin et al. (2014), where it was found to suggest an association between obesity and changes in gut microbiome composition. For each subject, DNA collected from stool samples was analysed by 454/Roche pyrosequencing of 16S rRNA gene segments from the V1–V2 region. An average of 9265 reads per sample were obtained, with a standard deviation of 3864, by denoising the pyrosequences prior to taxonomic assignment. The resulting 3068 operational taxonomic units were further merged into 87 genera that were observed in at least one sample. As suggested by Aitchison (2003) and Lin et al. (2014), zero counts were replaced by 0$$\cdot$$5 before the count data were converted into compositional data by normalization. Demographic information, including body mass index, BMI, was recorded on the subjects. We are interested in testing whether lean and obese individuals have the same gut microbiome composition. To this end, we divided the subjects into a lean group, BMI $$<25$$, $$n_1=63$$, and an obese group, BMI $${\geqslant}25$$, $$n_2=35$$, on which we performed various two-sample tests. The proposed test yielded a $$p$$-value of 0$$\cdot$$001, indicating a marked difference between the two groups. In contrast, the tests based on the log-transformed and raw compositions gave $$p$$-values of 0$$\cdot$$129 and 0$$\cdot$$261, and hence failed to detect the difference at the 0$$\cdot$$05 level. To assess the stability of our proposed test and to perform power comparisons, we generated 5000 bootstrap subsamples within each group, with the subsampling proportion varying from 0$$\cdot$$2 to 1. For each subsampling proportion, we obtained the empirical power as the proportion of subsamples for which the null hypothesis was rejected at the 0$$\cdot$$05 level. The empirical power curves based on the bootstrap subsamples, presented in Fig. 1(a), show that the proposed test greatly outperforms the competitors. We further conducted back-testing to check whether the signal disappears upon breaking the association. We generated 1000 bootstrap samples from the pooled data and then randomly divided each sample into two groups with the same sizes as before. The histogram of $$p$$-values from our test based on the bootstrap samples is displayed in Fig. 1(b). The $$p$$-values are distributed quite evenly, indicating good accuracy of the asymptotics. Overall, our results confirm previous findings that the microbiomes of lean and obese individuals differ at the taxonomic and functional levels (Turnbaugh et al., 2009). Fig. 1. View largeDownload slide Analysis of two microbiome datasets: empirical power curves of the proposed test (triangles) and the tests based on log-transformed (circles) and raw (squares) compositions with $$\alpha=0{\cdot}05$$ are shown in (a) for the obesity data and (c) for the Crohn’s disease data; histograms of $$p$$-values from the proposed test in the back-testing are shown in (b) for the obesity data and (d) for the Crohn’s disease data, for 1000 replicates. Fig. 1. View largeDownload slide Analysis of two microbiome datasets: empirical power curves of the proposed test (triangles) and the tests based on log-transformed (circles) and raw (squares) compositions with $$\alpha=0{\cdot}05$$ are shown in (a) for the obesity data and (c) for the Crohn’s disease data; histograms of $$p$$-values from the proposed test in the back-testing are shown in (b) for the obesity data and (d) for the Crohn’s disease data, for 1000 replicates. To further assess the sensitivity of the results to zero replacements, we repeated the analysis with the zero counts replaced by 0$$\cdot$$1 before normalization. The proposed test resulted in a $$p$$-value of 0$$\cdot$$0001, while the tests based on the log-transformed and raw compositions gave $$p$$-values of 0$$\cdot$$015 and 0$$\cdot$$080, respectively. In this case only the proposed test rejects the null hypothesis at the 0$$\cdot$$01 level, and the inference does not seem sensitive to the zero-replacement values. 6.2. Analysis of Crohn’s disease microbiome data Crohn’s disease is a type of inflammatory bowel disease characterized by altered gut bacterial composition, whose etiology appears multifactorial and remains poorly understood. We analyse a dataset from a longitudinal study of 90 pediatric Crohn’s disease patients reported by Lewis et al. (2015). Among these patients, 26 were classified as responders to anti-tumour necrosis factor therapy, where response to therapy was defined as a reduction in faecal calprotectin, FCP, concentration to $$250\,\mu$$g/g or below among those with baseline FCP greater than $$250\,\mu$$g/g. Twenty-four of the responders had stool samples collected at four time-points: baseline, and then 1 week, 4 weeks and 8 weeks into therapy. The bacterial composition was quantified using shotgun metagenomic sequencing and the MetaPhlAn package (Segata et al., 2012), yielding 43 genera that appeared in at least three samples across all time-points. As the read counts were not available, zero proportions were replaced by half or 10% of the minimum nonzero proportions in the dataset. To determine the effect of the therapy among responders, we applied various paired tests to test for changes in gut microbiome composition between baseline and the three later time-points. As shown in Table 4, the $$p$$-values for the comparison between baseline and week 8 from all tests were significant or close to significant, with the strongest evidence provided by the proposed test. The comparisons at the two earlier time-points did not yield decisive conclusions. These inferences do not seem sensitive to the zero-replacement strategies. The empirical power curves based on bootstrap subsamples in Fig. 1(c) exhibit more substantial power gains of the proposed test over the competitors with smaller sample sizes. Moreover, the histogram of $$p$$-values in Fig. 1(d) indicates that the proposed test survives the back-testing, where the observations at two time-points were randomly interchanged for each subject in the bootstrap samples. Our results provide further support for the effect of the therapy on gut microbiome composition through reduced inflammation and suggest that it may take longer for the intestinal dysbiosis to be resolved. Table 4. The $$p$$-values of paired tests applied to the Crohn’s disease microbiome data with zeros replaced by half or $$10\%$$ of the minimum nonzero proportions in the dataset     Zero replacement by half  Zero replacement by 10%     Proposed  Log  Raw  Proposed  Log  Raw  Baseline versus week 1  0$$\cdot$$119  0$$\cdot$$605  0$$\cdot$$757  0$$\cdot$$141  0$$\cdot$$611  0$$\cdot$$757  Baseline versus week 4  0$$\cdot$$460  0$$\cdot$$553  0$$\cdot$$468  0$$\cdot$$373  0$$\cdot$$684  0$$\cdot$$468  Baseline versus week 8  0$$\cdot$$014  0$$\cdot$$033  0$$\cdot$$082  0$$\cdot$$018  0$$\cdot$$058  0$$\cdot$$082     Zero replacement by half  Zero replacement by 10%     Proposed  Log  Raw  Proposed  Log  Raw  Baseline versus week 1  0$$\cdot$$119  0$$\cdot$$605  0$$\cdot$$757  0$$\cdot$$141  0$$\cdot$$611  0$$\cdot$$757  Baseline versus week 4  0$$\cdot$$460  0$$\cdot$$553  0$$\cdot$$468  0$$\cdot$$373  0$$\cdot$$684  0$$\cdot$$468  Baseline versus week 8  0$$\cdot$$014  0$$\cdot$$033  0$$\cdot$$082  0$$\cdot$$018  0$$\cdot$$058  0$$\cdot$$082  7. Discussion We have shown that it is possible to develop tests for high-dimensional parameters of the log basis variables from which compositional data are derived, even though the bases are not observed. In this regard, our method extends the scope of the log-ratio transformation methodology due to Aitchison (1982). The mild assumption that $$\|\delta\|_1=o(p)$$ for the proposed test to achieve the minimax optimal rate is due to the use of centred log-ratio variables as a proxy for the latent log basis variables, and bears a resemblance to an approximate identifiability condition for large covariance estimation from compositional data considered in Cao et al. (2016). Our testing framework may be extended in at least two directions. First, it would be worthwhile to exploit the covariance structure of compositional data for power enhancement, by borrowing ideas of Cai et al. (2014). Such an extension, however, seems nontrivial owing to the singularity of the centred log-ratio covariance matrix. Second, in addition to the global test developed in this paper, a multiple testing procedure with accurate error control would be helpful for identifying specific taxa that differ significantly between groups and contribute to the outcome of interest. Acknowledgement This research was supported in part by the U.S. National Institutes of Health and the National Natural Science Foundation of China. We thank the associate editor and three reviewers for their very helpful comments. Appendix Proofs of the theoretical results We first introduce some notation. For a matrix $$A=(a_{ij})_{p\times p}$$, denote by $$\|A\|_1$$ and $$\|A\|_{\max}$$ the matrix 1-norm and entrywise $$\ell_\infty$$-norm, respectively, i.e., $$\|A\|_1=\max_{1{\leqslant} j{\leqslant} p}\sum_{i=1}^p|a_{ij}|$$ and $$\|A\|_{\max}=\max_{1{\leqslant} i,j{\leqslant} p}|a_{ij}|$$. Write $$a_{i\cdot}=p^{-1}\sum_{j=1}^p a_{ij}$$ and $$a_{\cdot\cdot}=p^{-2}\sum_{i=1}^p\sum_{j=1}^p a_{ij}$$. We will use $$C_1,C_2,\ldots>0$$ to denote generic constants, whose values may vary from line to line. Proof of Proposition 1 By Condition 3, we have   $$\label{eq:R_1norm} \|R\|_1{\leqslant} p^{1/2}\max_{1{\leqslant} j{\leqslant} p}\left(\sum_{i=1}^p\rho_{ij}^2\right)^{1/2}{\leqslant} p^{1/2}r_2^{1/2}=O(p^{1/2})\text{.}$$ (A1) We write $$\gamma_{ij}=\omega_{ij}-\omega_{i\cdot}-\omega_{j\cdot}+\omega_{\cdot\cdot}$$. It follows from Condition 1 and (A1) that   $$\label{eq:omega_i} |\omega_{i\cdot}|{\leqslant}\frac{1}{p}\sum_{j=1}^p|\omega_{ij}|{\leqslant}\frac{1}{p}\max_{1{\leqslant} j{\leqslant} p}|\omega_{jj}|\sum_{j=1}^p|\rho_{ij}|{\leqslant}\frac{1}{p}\kappa_1\|R\|_1 =O(p^{-1/2})$$ (A2) and, similarly,   $$\label{eq:omega_j} |\omega_{j\cdot}|=O(p^{-1/2}),\quad |\omega_{\cdot\cdot}|=O(p^{-1/2})\text{.}$$ (A3) Hence   $$\label{eq:gamma_omega} \|\Gamma-\Omega\|_{\max}{\leqslant}\max_{1{\leqslant} i,j{\leqslant} p}\big(|\omega_{i\cdot}|+|\omega_{j\cdot}|+|\omega_{\cdot\cdot}|\big)=O(p^{-1/2})\text{.}$$ (A4) This and Condition 1 imply (i). To show (ii), we write   $\tau_{ij}=\frac{\gamma_{ij}}{(\gamma_{ii}\gamma_{jj})^{1/2}}=\frac{\omega_{ij}+{\varepsilon}_1}{\{(\omega_{ii}+{\varepsilon}_2)(\omega_{jj}+{\varepsilon}_3)\}^{1/2}},$ where $${\varepsilon}_1=-\omega_{i\cdot}-\omega_{j\cdot}+\omega_{\cdot\cdot}$$, $${\varepsilon}_2=-2\omega_{i\cdot}+\omega_{\cdot\cdot}$$ and $${\varepsilon}_3=-2\omega_{j\cdot}+\omega_{\cdot\cdot}$$. By (A2) and (A3), we have $${\varepsilon}_i=O(p^{-1/2})$$ ($$i=1,2,3$$). Therefore, by Condition 1,   \begin{align} \tau_{ij}&=\frac{\omega_{ij}+{\varepsilon}_1}{(\omega_{ii}\omega_{jj})^{1/2}}\left\{\frac{(\omega_{ii}+{\varepsilon}_2)(\omega_{jj}+{\varepsilon}_3)}{\omega_{ii}\omega_{jj}} \right\}^{-1/2}=\frac{\rho_{ij}+O(p^{-1/2})}{[\{1+O(p^{-1/2})\}\{1+O(p^{-1/2})\}]^{1/2}}\notag\\ &=\rho_{ij}+O(p^{-1/2})\label{eq:tau-rho}, \end{align} (A5) which, together with Condition 2, implies (ii). To show (iii), noting that $$\tau_{ij}^2-\rho_{ij}^2=(\tau_{ij}-\rho_{ij})^2+2\rho_{ij}(\tau_{ij}-\rho_{ij})$$ and using (A1) and (A5), we have   $\sum_{i=1}^p(\tau_{ij}^2-\rho_{ij}^2)=\sum_{i=1}^p(\tau_{ij}-\rho_{ij})^2+2\sum_{i=1}^p\rho_{ij}(\tau_{ij}-\rho_{ij})=O(1)+2\|R\|_1O(p^{-1/2})=O(1)\text{.}$ This and Condition 3 imply (iii) and thus complete the proof. Proof of Proposition 2 We first write   $Y_{ij}^{(k)}-\nu_{kj}=Z_{ij}^{(k)}-\mu_{kj}+\frac{1}{p}\sum_{j=1}^p(Z_{ij}^{(k)}-\mu_{kj})\text{.}$ It follows from Condition 1 and Proposition 1 that   $$\label{eq:y_to_z} \frac{|Y_{ij}^{(k)}-\nu_{kj}|}{\gamma_{jj}^{1/2}}{\leqslant}\frac{\omega_{jj}^{1/2}}{\gamma_{jj}^{1/2}} \Biggl(\frac{|Z_{ij}^{(k)}-\mu_{kj}|}{\omega_{jj}^{1/2}} +\frac{1}{p}\sum_{j=1}^p\frac{|Z_{ij}^{(k)}-\mu_{kj}|}{\omega_{jj}^{1/2}}\Biggr) {\leqslant}2(\kappa_1\kappa_2)^{1/2}\max_{i,j,k}\frac{|Z_{ij}^{(k)}-\mu_{kj}|}{\omega_{jj}^{1/2}}\text{.}$$ (A6) Using Condition 4 and applying Markov’s inequality and the union bound, we obtain   ${\rm pr}\!\left(\max_{i,j,k}|Z_{ij}^{(k)}-\mu_{kj}|/\omega_{jj}^{1/2}{\geqslant} t\right){\leqslant}(n_1+n_2)pK\exp(-\eta t^2)$ for all $$t>0$$. Hence, by Condition 5,   $$\label{eq:z_rate_exp} \max_{i,j,k}\big|Z_{ij}^{(k)}-\mu_{kj}\big|/\omega_{jj}^{1/2}=O_{\rm p}\big\{(\log n+\log p)^{1/2}\big\}=o_{\rm p}(n^{1/2}/\log p)\text{.}$$ (A7) Combining (A6) with (A7), we arrive at (11). To prove (12), without loss of generality we assume $$\mu_1=\mu_2=0$$. Let $$\hat\gamma_{jj}^{(k)}$$ denote the sample centred log-ratio variances for population $$k$$ ($$k=1,2$$). Observe that   $\big|\hat\gamma_{jj}^{(k)}-\gamma_{jj}\big|=\big|\hat\omega_{jj}^{(k)}-2\hat\omega_{j\cdot}^{(k)}+\hat\omega_{\cdot\cdot}^{(k)}-(\omega_{jj}-2\omega_{j\cdot} +\omega_{\cdot\cdot})\big|{\leqslant}4\max_{i,j}\big|\hat\omega_{ij}^{(k)}-\omega_{ij}\big|\text{.}$ It follows from Condition 1 and Proposition 1 that   $\frac{|\hat\gamma_{jj}^{(k)}-\gamma_{jj}|}{\gamma_{jj}}{\leqslant}\frac{4}{\gamma_{jj}}\max_{i,j}\frac{|\hat\omega_{ij}^{(k)}-\omega_{ij}|} {(\omega_{ii}\omega_{jj})^{1/2}}\,(\omega_{ii}\omega_{jj})^{1/2} {\leqslant}4\kappa_1\kappa_2\max_{i,j}\frac{|\hat\omega_{ij}^{(k)}-\omega_{ij}|}{(\omega_{ii}\omega_{jj})^{1/2}}\text{.}$ The proof is completed by invoking the following lemma, which recaps a concentration result in Bickel & Levina (2008), and using the fact that $$\hat\gamma_{jj}=n_1\hat\gamma_{jj}^{(1)}/(n_1+n_2)+n_2\hat\gamma_{jj}^{(2)}/(n_1+n_2)$$. Lemma A1. Under Condition 4, there exist constants $$C_1,C_2,C_3,C_4>0$$ such that  ${\rm pr}\!\left\{\max_{i,j}\frac{|\hat\omega_{ij}^{(k)}-\omega_{ij}|}{(\omega_{ii}\omega_{jj})^{1/2}}{\geqslant} t\right\}{\leqslant} C_1p\exp(-C_2n_kt/2)+C_3p^2\exp(-C_4n_kt^2/4)\quad (t>0;\, k=1,2)\text{.}$ Proof of Theorem 1 Let $$t_p=t+2\log p-\log\log p$$ and   $$\label{eq:M_nstar} M_n^*=n\max_{1{\leqslant} j{\leqslant} p}\frac{(\bar{Y}_j^{(1)}-\bar{Y}_j^{(2)})^2}{\gamma_{jj}}\text{.}$$ (A8) We first show that under $$H_0$$ in (3), or equivalently (8), for any fixed $$t\in\mathbb{R}$$ we have   $$\label{eq:limit} {\rm pr}(M_n^*{\leqslant} t_p)\to\exp\{-\pi^{-1/2}\exp(-t/2)\}$$ (A9) as $$n,p\to\infty$$. By the Bonferroni inequality, for any fixed integer $$m$$ with $$1{\leqslant} m{\leqslant} p/2$$,   $\sum_{d=1}^{2m}(-1)^{d-1}\!\!\!\sum_{1{\leqslant} j_1<\cdots<j_d{\leqslant} p}{\rm pr}\!\left(\bigcap_{k=1}^d E_{j_k}\right){\leqslant}{\rm pr}(M_n^*{\geqslant} t_p){\leqslant} \sum_{d=1}^{2m-1}(-1)^{d-1}\!\!\!\sum_{1{\leqslant} j_1<\cdots<j_d{\leqslant} p}{\rm pr}\!\left(\bigcap_{k=1}^d E_{j_k}\right)\!,$ where   $E_j=\left\{n\frac{(\bar{Y}_j^{(1)}-\bar{Y}_j^{(2)})^2}{\gamma_{jj}}{\geqslant} t_p\right\}\!\text{.}$ Under $$H_0$$, we write   $n^{1/2}\frac{\bar{Y}_j^{(1)}-\bar{Y}_j^{(2)}}{\gamma_{jj}^{1/2}}=\frac{n^{1/2}}{n_1}\sum_{i=1}^{n_1}\frac{Y_{ij}^{(1)}-\nu_{1j}}{\gamma_{jj}^{1/2}} -\frac{n^{1/2}}{n_2}\sum_{i=1}^{n_2}\frac{Y_{ij}^{(2)}-\nu_{2j}}{\gamma_{jj}^{1/2}}\equiv\sum_{i=1}^{n_1+n_2}\xi_{ij}\text{.}$ By Proposition 2, it suffices to consider the event $$\{\max_{i,j}|\xi_{ij}|{\leqslant} C_1(\log p)^{-1}\}$$ for some constant $$C_1>0$$, which occurs with probability tending to 1. Let $$N=(N_1,\dots,N_p)^{{\mathrm{\scriptscriptstyle T}}}$$ be multivariate normal with mean zero and covariance matrix $$nR/n_1+nR/n_2=R$$. Applying Theorem 1.1 of Zaĭtsev (1987), for any sequence $${\varepsilon}_n=o(1)$$ we have   \begin{align*} {\rm pr}\!\left(\bigcap_{k=1}^d E_{j_k}\right)&={\rm pr}\!\left(\min_{1{\leqslant} k{\leqslant} d}\left|\sum_{i=1}^{n_1+n_2}\xi_{ij_k}\right|{\geqslant} t_p^{1/2}\right)\\ &{\leqslant}{\rm pr}\!\left(\min_{1{\leqslant} k{\leqslant} d}|N_{j_k}|{\geqslant} t_p^{1/2}-{\varepsilon}_n\right)+O\!\left\{d^{5/2}\exp\left(-C_2\frac{\log p}{d^3}\right)\right\}\\ &{\leqslant}{\rm pr}\!\left(\min_{1{\leqslant} k{\leqslant} d}|N_{j_k}|{\geqslant} t_p^{1/2}-{\varepsilon}_n\right)+O(p^{-C_3}) \end{align*} and, similarly,   ${\rm pr}\!\left(\bigcap_{k=1}^d E_{j_k}\right){\geqslant}{\rm pr}\!\left(\min_{1{\leqslant} k{\leqslant} d}|N_{j_k}|{\geqslant} t_p^{1/2}+{\varepsilon}_n\right)+O(p^{-C_3})\text{.}$ Hence   \begin{align*} \sum_{d=1}^{2m}(-1)^{d-1}\!\!\!\sum_{1{\leqslant} j_1<\cdots<j_d{\leqslant} p}&{\rm pr}\!\left\{\min_{1{\leqslant} k{\leqslant} d}|N_{j_k}|{\geqslant} t_p^{1/2}+(-1)^{d-1}{\varepsilon}_n\right\}+o(1)\\ &{\leqslant}{\rm pr}(M_n^*{\geqslant} t_p)\\ &{\leqslant}\sum_{d=1}^{2m-1}(-1)^{d-1}\!\!\!\sum_{1{\leqslant} j_1<\cdots<j_d{\leqslant} p}{\rm pr}\!\left\{\min_{1{\leqslant} k{\leqslant} d}|N_{j_k}|{\geqslant} t_p^{1/2}+(-1)^d{\varepsilon}_n\right\}+o(1)\text{.} \end{align*} Then (A9) is proved by applying the following lemma, which follows from the same arguments as those for Lemma 6 of Cai et al. (2014), and letting $$m\to\infty$$. Lemma A2. Under Conditions 2 and 3, we have  $\sum_{1{\leqslant} j_1<\cdots<j_d{\leqslant} p}{\rm pr}\!\left(\min_{1{\leqslant} k{\leqslant} d}|N_{j_k}|{\geqslant} t_p^{1/2}\pm{\varepsilon}_n\right) =\frac{1}{d!}\,\pi^{-d/2}\exp\!\left(-\frac{dt}{2}\right)\{1+o(1)\}\text{.}$ Finally, consider the event $$\{\max_j|\hat\gamma_{jj}-\gamma_{jj}|/\gamma_{jj}{\leqslant} C_4(\log p/n)^{1/2}\}$$ for some constant $$C_4>0$$, which occurs with probability tending to 1 by Proposition 2. Then   $$\label{eq:M_nineq} |M_n-M_n^*|{\leqslant} n\max_{1{\leqslant} j{\leqslant} p}\frac{(\bar{Y}_j^{(1)}-\bar{Y}_j^{(2)})^2}{\hat\gamma_{jj}}\max_{1{\leqslant} j{\leqslant} p}\frac{|\hat\gamma_{jj}-\gamma_{jj}|}{\gamma_{jj}}{\leqslant} C_4M_n\left(\frac{\log p}{n}\right)^{1/2}=M_n o\!\left(\frac{1}{\log p}\right)$$ (A10) by Condition 5. This, together with (A9), completes the proof. Proof of Theorem 2 In view of (A10) with $$M_n^*$$ defined in (A8), it suffices to prove that under $$H_1$$ in (13),   $$\label{eq:M_nconv} {\rm pr}(M_n^*{\geqslant} q_\alpha+2\log p-\log\log p)\to1\text{.}$$ (A11) By assumption, there exists some $$j_0\in S$$ such that $$|\delta_{j_0}|{\geqslant}\surd{2}+{\varepsilon}$$. We write   $n^{1/2}\frac{\bar{Y}_{j_0}^{(1)}-\bar{Y}_{j_0}^{(2)}}{\gamma_{j_0j_0}^{1/2}}=n^{1/2}\frac{\bar{Y}_{j_0}^{(1)}-\nu_{1j_0}}{\gamma_{j_0j_0}^{1/2}} -n^{1/2}\frac{\bar{Y}_{j_0}^{(2)}-\nu_{2j_0}}{\gamma_{j_0j_0}^{1/2}}+n^{1/2}\frac{\nu_{1j_0}-\nu_{2j_0}}{\gamma_{j_0j_0}^{1/2}}\equiv T_1+T_2+T_3\text{.}$ Note that $$T_1=O_{\rm p}(1)$$ and $$T_2=O_{\rm p}(1)$$ by the central limit theorem. Define   $T_4=n^{1/2}\frac{\nu_{1j_0}-\nu_{2j_0}}{\omega_{j_0j_0}^{1/2}}\text{.}$ It follows from (A4) and Condition 1 that   $|T_3-T_4|=n^{1/2}\frac{|\nu_{1j_0}-\nu_{2j_0}|}{\gamma_{j_0j_0}^{1/2}}\frac{|\gamma_{j_0j_0}^{1/2}-\omega_{j_0j_0}^{1/2}|}{\omega_{j_0j_0}^{1/2}} =|T_3|O(p^{-1/2})\text{.}$ Then, using (15) and the assumption $$\|\delta\|_1=o(p)$$, we have   \begin{align*} T_3&=T_4\{1+O(p^{-1/2})\}=n^{1/2}\frac{\delta_{j_0}\omega_{j_0j_0}^{1/2}+o(1)}{\omega_{j_0j_0}^{1/2}}\left(\frac{\log p}{n}\right)^{1/2} \{1+O(p^{-1/2})\}\\ &=\{\delta_{j_0}+o(1)\}(\log p)^{1/2}{\geqslant}\left(\surd{2}+\frac{{\varepsilon}}{2}\right)(\log p)^{1/2} \end{align*} for sufficiently large $$p$$. Combining these bounds, we conclude that, with probability tending to 1,   $n^{1/2}\frac{|\bar{Y}_{j_0}^{(1)}-\bar{Y}_{j_0}^{(2)}|}{\gamma_{j_0j_0}^{1/2}}{\geqslant}(q_\alpha+2\log p-\log\log p)^{1/2}\text{.}$ This implies (A11) and completes the proof. References Aitchison J. ( 1982). The statistical analysis of compositional data (with Discussion). J. R. Statist. Soc. B  44, 139– 77. Aitchison J. ( 2003). The Statistical Analysis of Compositional Data . Caldwell: Blackburn Press. Bai Z. & Saranadasa H. ( 1996). Effect of high dimension: By an example of a two sample problem. Statist. Sinica  6, 311– 29. Bickel P. J. & Levina E. ( 2008). Covariance regularization by thresholding. Ann. Statist.  36, 2577– 604. Google Scholar CrossRef Search ADS   Cai T. T. , Liu W. & Xia Y. ( 2014). Two-sample test of high dimensional means under dependence. J. R. Statist. Soc. B  76, 349– 72. Google Scholar CrossRef Search ADS   Cao Y. , Lin W. & Li H. ( 2016). Large covariance estimation for compositional data via composition-adjusted thresholding. arXiv: 1601.04397. Chen S. X. & Qin Y.-L. ( 2010). A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Statist.  38, 808– 35. Google Scholar CrossRef Search ADS   Lewis J. D. , Chen E. Z., Baldassano R. N., Otley A. R., Griffiths A. M., Lee D., Bittinger K., Bailey A., Friedman E. S., Hoffmann C. et al.   ( 2015). Inflammation, antibiotics, and diet as environmental stressors of the gut microbiome in pediatric Crohn’s disease. Cell Host Microbe  18, 489– 500. Google Scholar CrossRef Search ADS PubMed  Li H. ( 2015). Microbiome, metagenomics, and high-dimensional compositional data analysis. Ann. Rev. Statist. Appl.  2, 73– 94. Google Scholar CrossRef Search ADS   Lin W. , Shi P., Feng R. & Li H. ( 2014). Variable selection in regression with compositional covariates. Biometrika  101, 785– 97. Google Scholar CrossRef Search ADS   Segata N. , Waldron L., Ballarini A., Narasimhan V., Jousson O. & Huttenhower C. ( 2012). Metagenomic microbial community profiling using unique clade-specific marker genes. Nature Meth.  9, 811– 4. Google Scholar CrossRef Search ADS   Srivastava M. S. ( 2009). A test for the mean vector with fewer observations than the dimension under non-normality. J. Mult. Anal.  100, 518– 32. Google Scholar CrossRef Search ADS   Srivastava M. S. & Du M. ( 2008). A test for the mean vector with fewer observations than the dimension. J. Mult. Anal.  99, 386– 402. Google Scholar CrossRef Search ADS   The Human Microbiome Project Consortium ( 2012). Structure, function and diversity of the healthy human microbiome. Nature  486, 207– 14. Turnbaugh P. J. , Hamady M., Yatsunenko T., Cantarel B. L., Duncan A., Ley R. E., Sogin M. L., Jones W. J., Roe B. A., Affourtit J. P. et al.   ( 2009). A core gut microbiome in obese and lean twins. Nature  457, 480– 4. Google Scholar CrossRef Search ADS PubMed  Wu G. D. , Chen J., Hoffmann C., Bittinger K., Chen Y.-Y., Keilbaugh S. A., Bewtra M., Knights D., Walters W. A., Knight R. et al.   ( 2011). Linking long-term dietary patterns with gut microbial enterotypes. Science  334, 105– 8. Google Scholar CrossRef Search ADS PubMed  Zaĭtsev A. Yu. ( 1987). On the Gaussian approximation of convolutions under multidimensional analogues of S. N. Bernstein’s inequality conditions. Prob. Theory Rel. Fields  74, 535– 66. Google Scholar CrossRef Search ADS   © 2017 Biometrika Trust http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Biometrika Oxford University Press

# Two-sample tests of high-dimensional means for compositional data

, Volume 105 (1) – Mar 1, 2018
18 pages

/lp/ou_press/two-sample-tests-of-high-dimensional-means-for-compositional-data-WYdpLXtQYH
Publisher
Oxford University Press
ISSN
0006-3444
eISSN
1464-3510
D.O.I.
10.1093/biomet/asx060
Publisher site
See Article on Publisher Site

### Abstract

Summary Compositional data are ubiquitous in many scientific endeavours. Motivated by microbiome and metagenomic research, we consider a two-sample testing problem for high-dimensional compositional data and formulate a testable hypothesis of compositional equivalence for the means of two latent log basis vectors. We propose a test through the centred log-ratio transformation of the compositions. The asymptotic null distribution of the test statistic is derived and its power against sparse alternatives is investigated. A modified test for paired samples is also considered. Simulations show that the proposed tests can be significantly more powerful than tests that are applied to the raw and log-transformed compositions. The usefulness of our tests is illustrated by applications to gut microbiome composition in obesity and Crohn’s disease. 1. Introduction Compositional data, which belong to a simplex, are ubiquitous in scientific disciplines such as geology, economics and genomics. This paper is motivated by microbiome and metagenomic research, where relative abundances of hundreds to thousands of bacterial taxa on a few tens or hundreds of individuals are available for analysis (The Human Microbiome Project Consortium, 2012). Due to varying amounts of DNA-generating material across different samples, sequencing read counts are often normalized to relative abundances; the resulting data are therefore compositional (Li, 2015). One fundamental problem in microbiome data analysis is to test whether two populations have the same microbiome composition, which can be viewed as a two-sample testing problem for high-dimensional compositional data. Since the components of a composition must sum to unity, directly applying standard multivariate statistical methods intended for unconstrained data to compositional data may result in inappropriate or misleading inferences (Aitchison, 2003). Various methods for compositional data analysis have been developed since the seminal work of Aitchison (1982). Most existing methods for two-sample testing, however, deal only with the low-dimensional setting where the dimensionality is smaller than the sample size; see, for example, the generalized likelihood ratio tests discussed in Aitchison (2003, § 7$$.$$5). In this paper, we consider the two-sample testing problem for high-dimensional compositional data, where compositions in the $$(p-1)$$-dimensional simplex $$\mathcal{S}^{p-1}$$ are thought of as arising from latent basis vectors in the $$p$$-dimensional positive orthant $$\mathbb{R}_+^p$$. In microbiome studies, the basis components may represent the true abundances of bacterial taxa in a microbial community such as the gut of a healthy individual (Li, 2015). To circumvent the nonidentifiability issue associated with the basis vectors, we formulate a testable hypothesis of compositional equivalence for the means of two log basis vectors. We then propose a test through the centred log-ratio transformation of the compositions. The proposed test is therefore scale-invariant, which is crucial for compositional data analysis. We emphasize here the extrinsic analysis point of view in compositional data analysis (Aitchison, 1982), which leads to biologically meaningful interpretations, in contrast to intrinsic analysis, where interest lies solely in the composition. Classical extrinsic analysis however, primarily concerns problems where the bases are observed, and thus differs radically from the focus of this paper. The development of tests for the equality of two high-dimensional means has received much attention; see, for instance, Bai & Saranadasa (1996), Srivastava & Du (2008), Srivastava (2009), Chen & Qin (2010) and Cai et al. (2014). Such tests, however, are not directly applicable to high-dimensional compositional data because the required regularity conditions are generally not met. For example, the covariance matrix of compositional data is singular, thereby violating the usual assumptions on the eigenvalues of the covariance matrix, such as those in Cai et al. (2014). Our assumptions are imposed on the latent log basis vectors, which are free of the simplex constraint. We show that, under mild conditions, the centred log-ratio variables satisfy certain desired properties, which guarantee the validity of the proposed test. Then the asymptotic null distribution of the test statistic is derived and the power of the test against sparse alternatives is investigated. The proposed two-sample test is further modified to accommodate paired samples. All proofs are deferred to the Appendix. 2. A testable hypothesis of compositional equivalence Denote by $$X^{(k)}=(X_1^{(k)},\dots,X_{n_k}^{(k)})^{{\mathrm{\scriptscriptstyle T}}}$$ the observed $$n_k\times p$$ data matrices for group $$k$$ ($$k=1,2$$), where the $$X_i^{(k)}$$ represent compositions that lie in the $$(p-1)$$-dimensional simplex $$\mathcal{S}^{p-1}=\{(x_1,\dots,x_p):x_j>0\:(\,j=1,\dots,p),\,\sum_{j=1}^p x_j=1\}$$. We assume that the compositional variables arise from a vector of latent variables, which we call the basis. For microbiome data, the basis components may refer to the true abundances of bacterial taxa in a microbial community. Let $$W^{(k)}=(W_1^{(k)},\dots,W_{n_k}^{(k)})^{{\mathrm{\scriptscriptstyle T}}}$$ denote the $$n_k\times p$$ matrices of unobserved bases, which generate the observed compositional data via the normalization   \begin{align}\label{eq:norm} X_{ij}^{(k)}=W_{ij}^{(k)}\bigg/\sum_{\ell=1}^p W_{i\ell}^{(k)}\quad (i=1,\dots,n_k;\: j=1,\dots,p;\: k=1,2), \end{align} (1) where $$X_{ij}^{(k)}$$ and $$W_{ij}^{(k)}>0$$ are the $$j$$th components of $$X_i^{(k)}$$ and $$W_i^{(k)}$$, respectively. Denote by $$Z_i^{(k)}=(Z_{i1}^{(k)},\dots,Z_{ip}^{(k)})^{{\mathrm{\scriptscriptstyle T}}}$$ the log basis vectors, where $$Z_{ij}^{(k)}=\log W_{ij}^{(k)}$$. Suppose that $$Z_1^{(k)},\dots,Z_{n_k}^{(k)}$$ ($$k=1,2$$) are two independent samples, each from a distribution with mean $$\mu_k=(\mu_{k1},\dots,\mu_{kp})^{{\mathrm{\scriptscriptstyle T}}}$$ and having common covariance matrix $$\Omega=(\omega_{ij})$$. One might attempt to test the hypotheses   $$\label{eq:hypo_basis} H_0:\mu_1=\mu_2\quad\text{versus}\quad H_1:\mu_1=\hspace{-8pt}|\,\,\mu_2\text{.}$$ (2) These hypotheses, however, are not testable through the observed compositional data $$X^{(k)}$$ ($$k=1,2$$). Clearly, a basis is determined by its composition only up to a multiplicative factor, and the set of bases giving rise to a composition $$x\in\mathcal{S}^{p-1}$$ forms the equivalence class $$\mathcal{W}(x)=\{(tx_1,\dots,tx_p):t>0\}$$ (Aitchison, 2003, p. 32). As an immediate consequence, a log basis vector is determined by the resulting composition only up to an additive constant, and the set of log basis vectors corresponding to $$x$$ constitutes the equivalence class $$\mathcal{Z}(x)=\{(\log x_1+c,\dots,\log x_p+c):c\in\mathbb{R}\}$$. We therefore introduce the following definition. Definition 1. Two log basis vectors $$z_1$$ and $$z_2$$ are said to be compositionally equivalent if their components differ by a constant $$c\in\mathbb{R}$$, i.e., $$z_1=z_2+c1_p$$, where $$1_p$$ is the $$p$$-vector of ones.  Now, instead of testing the hypotheses in (2), we propose to test   $$\label{eq:hypo_equiv} H_0:\mu_1=\mu_2+c1_p\text{ for some }c\in\mathbb{R}\quad\text{versus}\quad H_1:\mu_1=\hspace{-8pt}|\,\,\mu_2+c1_p\text{ for any }c\in\mathbb{R},$$ (3) which are testable using only the observed compositional data. Clearly, $$H_0$$ in (2) implies $$H_0$$ in (3), so that rejecting the latter would lead to rejection of the former. Note, however, that $$H_0$$ in (3) neither implies nor is implied by $$E(X_1^{(1)})=E(X_1^{(2)})$$ or $$E(\log X_1^{(1)})=E(\log X_1^{(2)})$$. We do not consider the latter two hypotheses because they are not scale-invariant, whereas we will derive in § 3.1 an equivalent form of $$H_0$$ in (3) from which its scale-invariance will be obvious. Moreover, these two hypotheses do not allow us to obtain biological interpretations in terms of the true underlying abundances. 3. Tests for compositional equivalence 3.1. The centred log-ratio transformation and an equivalent hypothesis The unit-sum constraint entails that compositional variables must not vary independently, making many covariance-based multivariate analysis methods inapplicable. Aitchison (1982) proposed to relax the constraint by performing statistical analysis through log ratios. Among various forms of log-ratio transformations, the centred log-ratio transformation has attractive features and has been widely used. For the observed compositional data $$X^{(k)}$$ ($$k=1,2$$), the centred log-ratio matrices $$Y^{(k)}=(Y_1^{(k)},\dots,Y_{n_k}^{(k)})^{{\mathrm{\scriptscriptstyle T}}}$$ are defined by   $$\label{eq:clr_entry} Y_{ij}^{(k)}=\log\bigl\{X_{ij}^{(k)}/g(X_i^{(k)})\bigr\}\quad (i=1,\dots,n_k;\: j=1,\dots,p;\: k=1,2),$$ (4) where $$g(x)=\big(\prod_{i=1}^p x_i\big)^{1/p}$$ denotes the geometric mean of a vector $$x=(x_1,\dots,x_p)^{{\mathrm{\scriptscriptstyle T}}}$$. The relationship (4) can be expressed in the matrix form   $$\label{eq:clr_matrix} Y_i^{(k)}=G\log X_i^{(k)},$$ (5) where $$G=I_p-p^{-1}1_p1_p^{{\mathrm{\scriptscriptstyle T}}}$$. Let $$\nu_k=E(Y_1^{(k)})$$ ($$k=1,2$$). In view of the scale-invariance of the centred log ratios, we can replace $$X_i^{(k)}$$ by $$W_i^{(k)}$$ in (5) and obtain   $$\label{eq:clr_basis} Y_i^{(k)}=GZ_i^{(k)}$$ (6) and hence   $$\label{eq:clr_mean} \nu_k=G\mu_k\text{.}$$ (7) The matrix $$G$$ has rank $$p-1$$ and hence has a null space of dimension 1, i.e., $$\mathcal{N}(G)\equiv\{x\in\mathbb{R}^p:Gx=0\}=\{c1_p:c\in\mathbb{R}\}$$. As a result, $$\nu_1=\nu_2$$ if and only if $$\mu_1=\mu_2+c1_p$$ for some $$c\in\mathbb{R}$$. Therefore, testing the hypotheses in (3) is equivalent to testing   $$\label{eq:hypo_clr} H_0:\nu_1=\nu_2\quad\text{versus}\quad H_1:\nu_1=\hspace{-8pt}|\,\,\nu_2\text{.}$$ (8) Despite this equivalence, the hypotheses in (3) are meaningful only when the bases exist, which is the case in microbiome studies. On the other hand, the hypotheses in (8) concern only the compositions through the centred log ratios, from which their scale-invariance and testability using the observed compositional data are evident. 3.2. A two-sample test for compositional equivalence A natural test statistic for testing $$H_0$$ in (8), and hence $$H_0$$ in (3), would be based on the differences $$\bar{Y}_j^{(1)}-\bar{Y}_j^{(2)}$$, where $$\bar{Y}_j^{(k)}=n_k^{-1}\sum_{i=1}^{n_k}Y_{ij}^{(k)}$$ are the sample means of the centred log ratios. Moreover, it is well known that tests against sparse alternatives based on maximum-type statistics are generally more powerful than those based on sum-of-squares-type statistics (Cai et al., 2014). Since in microbiome studies we are mainly interested in the sparse setting where only a small number of taxa may have different mean abundances, we consider the test statistic   $$\label{eq:M_n} M_n=\frac{n_1n_2}{n_1+n_2}\max_{1{\leqslant} j{\leqslant} p}\frac{(\bar{Y}_j^{(1)}-\bar{Y}_j^{(2)})^2}{\hat\gamma_{jj}},$$ (9) where $$\hat\gamma_{jj}=(n_1+n_2)^{-1}\sum_{k=1}^2\sum_{i=1}^{n_k}\big(Y_{ij}^{(k)}-\bar{Y}_j^{(k)}\big)^2$$ are the pooled-sample centred log-ratio variances. The asymptotic properties of $$M_n$$ will be investigated in detail in § 4. Under suitable conditions on the log basis variables $$Z_{1j}^{(k)}$$, we will show that the centred log-ratio variables $$Y_{1j}^{(k)}$$ are only weakly dependent and satisfy certain desired properties. As a result, $$M_n-2\log p+\log\log p$$ converges in distribution to a Type I extreme value or Gumbel variable; see Theorem 1. We can then define the asymptotic $$\alpha$$-level test   $$\label{eq:test} \Phi_\alpha=I(M_n{\geqslant} q_\alpha+2\log p-\log\log p),$$ (10) where $$q_{\alpha}=-\log\pi-2\log\log(1-\alpha)^{-1}$$ is the $$(1-\alpha)$$-quantile of the Gumbel distribution. The null hypothesis $$H_0$$ in (3), or equivalently (8), is rejected whenever $$\Phi_\alpha=1$$. Although $$M_n$$ is similar to the test statistic $$M_I$$ defined in Cai et al. (2014), the theoretical analysis is radically different, in that our assumptions are not imposed on the observed variables. Besides, the test statistic based on a linear transformation by the precision matrix that Cai et al. (2014) proposed is not considered here, because the covariance matrix of $$Y_1^{(k)}$$ is singular and so its precision matrix is not well defined. 3.3. A paired test for compositional equivalence So far we have been concerned with two independent samples. In practice, however, one may be interested in comparing compositions on the same sample before and after treatment. For such paired samples, the proposed test requires only slight modifications. Now we observe a paired sample $$(X_{ij}^{(1)},X_{ij}^{(2)})$$ ($$i=1,\dots,n;\, j=1,\dots,p$$), which is generated by the basis $$(W_{ij}^{(1)},W_{ij}^{(2)})$$; the log basis $$(Z_{ij}^{(1)},Z_{ij}^{(2)})$$ and the centred log ratios $$(Y_{ij}^{(1)},Y_{ij}^{(2)})$$ are the same as defined previously. Write $$D_{ij}=Y_{ij}^{(1)}-Y_{ij}^{(2)}$$ and $$\bar{D}_j=n^{-1}\sum_{i=1}^nD_{ij}$$. To test $$H_0$$ in (3) or equivalently (8), we propose the test statistic   $\tilde{M}_n=n\max_{1{\leqslant} j{\leqslant} p}\bar{D}_j^2/\tilde\gamma_{jj},$ where $$\tilde\gamma_{jj}=n^{-1}\sum_{i=1}^n(D_{ij}-\bar{D}_j)^2$$ are the sample variances of $$D_{ij}$$. Note that $$\tilde{M}_n$$ is different from $$M_n$$ defined in (9) only in the variance estimates. Under appropriate assumptions on the log basis differences $$\Delta_j=Z_{1j}^{(1)}-Z_{1j}^{(2)}$$, similar to Conditions 1–5 below, we can show that $$\tilde{M}_n-2\log p+\log\log p$$ converges in distribution to the same Gumbel variable as in Theorem 1. Hence the test $$\Phi_\alpha$$ defined in (10) is still valid with $$M_n$$ replaced by $$\tilde{M}_n$$. 4. Theoretical results 4.1. Assumptions and implications As we are interested in testing the latent basis structures, we will impose conditions directly on the log basis variables. Under the assumption of a common basis covariance matrix, the two populations have a common centred log-ratio covariance matrix $$\Gamma={\rm cov}(Y_1^{(k)})$$ ($$k=1,2$$), which, in light of (6), is given by   \begin{equation*}\label{eq:cov_rel} \Gamma=G\Omega G^{{\mathrm{\scriptscriptstyle T}}}\text{.} \end{equation*} Denote the correlation matrices of $$Z_1^{(k)}$$ and $$Y_1^{(k)}$$ by $$R=(\rho_{ij})$$ and $$T=(\tau_{ij})$$, respectively. We first impose the following conditions on the covariance structures of the log basis variables: Condition 1. $$1/\kappa_1{\leqslant}\omega_{jj}{\leqslant}\kappa_1$$ for $$j=1,\dots,p$$ and some constant $$\kappa_1>0$$; Condition 2. $$\max_{1{\leqslant} i<j{\leqslant} p}|\rho_{ij}|{\leqslant} r_1$$ for some constant $$0<r_1<1$$; Condition 3. $$\max_{1{\leqslant} j{\leqslant} p}\sum_{i=1}^p\rho_{ij}^2{\leqslant} r_2$$ for some constant $$r_2>0$$. Conditions 1–3 are mild and standard in the high-dimensional testing literature. Condition 1 requires that the variances be bounded away from zero and infinity. Condition 2 is mild since $$\max_{1{\leqslant} i<j{\leqslant} p}|\rho_{ij}|=1$$ would imply that $$\Omega$$ is singular. Condition 3 is weaker than the usual assumption that the maximum eigenvalue of $$R$$ is bounded. Under Conditions 1–3, the following proposition shows that similar properties are satisfied by the centred log-ratio covariance matrix $$\Gamma$$ and correlation matrix $$T$$. Proposition 1. Assume that Conditions 1–3 hold. Then, for sufficiently large $$p$$, the centred log-ratio covariance matrix $$\Gamma$$ and correlation matrix $$T$$ satisfy the following properties: (i) $$1/\kappa_2{\leqslant}\gamma_{jj}{\leqslant}\kappa_2$$for$$j=1,\dots,p$$and some constant$$\kappa_2>0$$; (ii) $$\max_{1{\leqslant} i<j{\leqslant} p}|\tau_{ij}|{\leqslant} r_3$$for some constant$$0<r_3<1$$; and  (iii) $$\max_{1{\leqslant} j{\leqslant} p}\sum_{i=1}^p\tau_{ij}^2{\leqslant} r_4$$for some constant$$r_4>0$$. We also need a moment condition on the log basis variables and a restriction on the dimensionality. Condition 4. There exist constants $$\eta,K>0$$ such that   $E\bigl[\exp\bigl\{\eta(Z_{1j}^{(k)}-\mu_{kj})^2/\omega_{jj}\bigr\}\bigr]{\leqslant} K\quad (\,j=1,\dots,p;\: k=1,2)\text{.}$ Condition 5. We have $$n_1\asymp n_2\asymp n$$ and $$\log p=o(n^{1/3})$$, where $$n=n_1n_2/(n_1+n_2)$$. Condition 4 is a popular sub-Gaussian tail assumption that can easily be relaxed to the case of polynomial tails. It allows us to establish the following concentration properties for the centred log-ratio variables and the pooled-sample variances. Proposition 2. Under Conditions 1 and 3–5, the centred log-ratio variables satisfy  $$\label{eq:rate1} \max_{i,j,k}\big|Y_{ij}^{(k)}-\nu_{kj}\big|/\gamma_{jj}^{1/2}=o_{\rm p}(n^{1/2}/\log p),$$ (11)and the pooled-sample centred log-ratio variances $$\hat\gamma_{jj}$$ satisfy  $$\label{eq:rate2} \max_j|\hat\gamma_{jj}-\gamma_{jj}|/\gamma_{jj}=O_{\rm p}\{(\log p/n)^{1/2}\}\text{.}$$ (12) 4.2. Asymptotic properties of the two-sample test We are now in a position to state our main results concerning the asymptotic properties of the proposed two-sample test. The validity of the test relies on the fact that certain desired properties of the centred log-ratio variables can be related to those of the log basis variables, which have been established in Propositions 1 and 2. The following theorem derives the asymptotic null distribution of $$M_n$$ defined in (9). Theorem 1. Under Conditions 1–5 we have, under$$H_0$$in (3) or equivalently (8),   ${\rm pr}(M_n-2\log p+\log\log p{\leqslant} t)\to\exp\{-\pi^{-1/2}\exp(-t/2)\},\quad t \in \mathbb{R},$ as$$n,p\to\infty$$. Theorem 1 shows that the test $$\Phi_\alpha$$ defined in (10) is indeed asymptotically of level $$\alpha$$. To study the asymptotic power of the test, we consider the alternative   $$\label{eq:alt_basis} H_1:\mu_{1j}=\hspace{-8pt}|\,\,\mu_{2j}+c,\quad j\in S;\quad\mu_{1j}=\mu_{2j}+c,\quad j\in S^{\rm c}$$ (13) for some $$c\in\mathbb{R}$$ and $$S\subset\{1,\dots,p\}$$ with cardinality $$s$$, where $$S^{\rm c}$$ denotes the complement of $$S$$. This alternative, however, is difficult to analyse since $$c$$ is unknown. We now eliminate the constant $$c$$ and connect $$H_1$$ in (13) to a more convenient form in terms of $$\nu_1$$ and $$\nu_2$$. Without loss of generality, define the signal vector $$\delta=(\delta_1,\dots,\delta_p)^{{\mathrm{\scriptscriptstyle T}}}$$ by   \label{eq:alt_basis_diff} \begin{aligned} \mu_{1j}-\mu_{2j}-c=\delta_j\omega_{jj}^{1/2}\left(\frac{\log p}{n}\right)^{1/2}\quad(\,j=1,\dots,p), \end{aligned} (14) where the scaling factor $$\omega_{jj}^{1/2}(\log p/n)^{1/2}$$ is introduced for technical reasons which will become clear in the proof of Theorem 2. Under $$H_1$$ in (13), we have $$\delta_j=\hspace{-8pt}|\,\,0$$ if and only if $$j\in S$$. Summing the equations in (14) and rearranging, we obtain   $c=\bar\mu_1-\bar\mu_2-\frac{1}{p}\sum_{j=1}^p\delta_j\omega_{jj}^{1/2}\left(\frac{\log p}{n}\right)^{1/2} =\bar\mu_1-\bar\mu_2-O\!\left\{\frac{\|\delta\|_1}{p}\left(\frac{\log p}{n}\right)^{1/2}\right\}\!,$ where $$\bar\mu_k=p^{-1}\sum_{j=1}^p\mu_{kj}$$ ($$k=1,2$$), $$\|\delta\|_1=\sum_{j=1}^p|\delta_j|$$, and we have used the fact that $$\max_j\omega_{jj}=O(1)$$ by Condition 1. Since $$\nu_{kj}=\mu_{kj}-\bar\mu_k$$ ($$k=1,2$$) by (7), we see that $$H_1$$ in (13) implies   \label{eq:alt_clr} \begin{aligned} \nu_{1j}-\nu_{2j}&=\left\{\delta_j\omega_{jj}^{1/2}+O\!\left(\frac{\|\delta\|_1}{p}\right)\right\}\left(\frac{\log p}{n}\right)^{1/2},&&j\in S,\\ \nu_{1j}-\nu_{2j}&=O\!\left\{\frac{\|\delta\|_1}{p}\left(\frac{\log p}{n}\right)^{1/2}\right\}\!,&&j\in S^{\rm c}\text{.} \end{aligned} (15) Compared with the usual sparse alternatives in the literature, such as in Cai et al. (2014), all components in the alternative (15) are shifted by a term of order $$O\{\|\delta\|_1p^{-1}(\log p/n)^{1/2}\}$$. To prevent this term from interfering with signals of order at least $$O\{(\log p/n)^{1/2}\}$$, it suffices to assume that $$\|\delta\|_1=o(p)$$. This key observation leads to the following theorem concerning the asymptotic power of $$\Phi_\alpha$$ defined in (10). Theorem 2. Assume that Conditions 1 and 3–5 hold. Under$$H_1$$in (13), if$$\|\delta\|_1=o(p)$$and$$\max_{j\in S}|\delta_j|{\geqslant}\surd{2}+{\varepsilon}$$for some constant$${\varepsilon}>0$$, then$${\rm pr}(\Phi_\alpha=1)\to1$$as$$n,p\to\infty$$. Two remarks on Theorem 2 are in order. First, if the signals $$\delta_i$$ are bounded, then the condition $$\|\delta\|_1=o(p)$$ holds provided the alternative (13) is sparse in the sense that $$s=o(p)$$. Second, by Theorem 3 of Cai et al. (2014), the condition $$\max_{j\in S}|\delta_j|{\geqslant}\surd{2}+{\varepsilon}$$ is minimax rate-optimal for testing sparse alternatives in the classical two-sample testing problem. Thus, the proposed test achieves the best possible rate even though the bases are not observed. 5. Simulation studies We conducted simulation studies to evaluate the numerical performance of the proposed two-sample and paired tests. For comparison, we consider counterparts applied to the raw and log-transformed compositions, which are obtained by replacing $$Y^{(k)}$$ in the proposed tests with $$X^{(k)}$$ and $$\log X^{(k)}$$, respectively. The oracle tests based on the unobserved $$W^{(k)}$$, though impracticable, serve as the benchmarks for comparison. We first examine the case of two independent samples. The simulated data were generated as follows. We first generated $$Z^{(k)}$$ from the following distributions: (i) multivariate normal distribution, $$Z_i^{(k)}\sim N_p(\tilde\mu_k,\Omega)$$; (ii) multivariate gamma distribution, $$Z_i^{(k)}=\tilde\mu_k+FU_i^{(k)}/\surd10$$, where $$F$$ is a $$p\times p$$ matrix $$F=QS^{1/2}$$ with $$Q$$ and $$S$$ obtained from the singular value decomposition $$\Omega=QSQ^{{\mathrm{\scriptscriptstyle T}}}$$, and the components of $$U_k$$ were generated independently from the standard gamma distribution with shape parameter 10. Then $$W^{(k)}$$ and $$X^{(k)}$$ were generated through the transformation $$W_{ij}^{(k)}=\exp(Z_{ij}^{(k)})$$ and (1). Note that $$\mu_k=\tilde\mu_k$$ in case (i) and $$\mu_k=\tilde\mu_k+\surd10\, F1_p$$ in case (ii). The location parameters $$\tilde\mu_k$$ were set as follows. The components of $$\tilde\mu_1$$ were drawn from a uniform distribution $${\rm Un}(0,10)$$. Under $$H_0$$, we took $$\tilde\mu_2=\tilde\mu_1$$; under $$H_1$$, we took   $\tilde\mu_{2j}=\tilde\mu_{1j}-\delta_j\omega_{jj}^{1/2}\left(\frac{\log p}{n}\right)^{1/2},$ where the signal vector $$\delta$$ has support of size $$s=\lfloor0{\cdot}05p\rfloor, \lfloor0{\cdot}1p\rfloor$$ or $$\lfloor0{\cdot}5p\rfloor$$, with the indices chosen uniformly from $$\{1,\dots,p\}$$ and the nonzero $$\delta_j$$ drawn from $${\rm Un}[-2\surd2,2\surd2]$$. We considered the following covariance structures: (i) banded covariance, $$\Omega=D^{1/2}AD^{1/2}$$, where $$A$$ has nonzero entries $$a_{jj}=1$$ and $$a_{j-1,j}=a_{j,j+1}=-0{\cdot}5$$, and $$D$$ is a diagonal matrix with entries drawn from $${\rm Un}(1,3)$$; (ii) sparse covariance, $$\Omega={\rm diag}(A_1,A_2)$$ with $$A_1=B+{\varepsilon} I_q$$ and $$A_2=I_{p-q}$$, where $$q=\lfloor 3p^{1/2}\rfloor$$, $$B$$ is a symmetric matrix with lower-triangular entries drawn from the uniform distribution on $$[-1,-0{\cdot}5]\cup[0{\cdot}5,1]$$ with probability $$0{\cdot}5$$ and set to 0 with probability $$0{\cdot}5$$, and $${\varepsilon}=\max\{-\lambda_{\min}(B),0\}+0{\cdot}05$$ with $$\lambda_{\min}(\cdot)$$ denoting the smallest eigenvalue of a matrix. The simulation settings for the case of paired samples are similar, except that $$Z_i^{(1)}$$ and $$Z_i^{(2)}$$ are correlated and must be generated from a $$2p$$-dimensional joint distribution. The parameters $$(\tilde\mu^*,\Omega^*)$$ of the joint distribution were specified by $$\tilde\mu^*=(\tilde\mu_1^{{\mathrm{\scriptscriptstyle T}}},\tilde\mu_2^{{\mathrm{\scriptscriptstyle T}}})^{{\mathrm{\scriptscriptstyle T}}}$$ and   $\Omega^*=\begin{pmatrix} 1 & 0{\cdot}3\\ 0{\cdot}3 & 1 \end{pmatrix}\otimes\Omega\text{.}$ We took the sample sizes to be $$n_1=n_2=100$$ for two independent samples and $$n=100$$ for paired samples, with varying dimensions $$p=50,100$$ and $$200$$. We repeated the simulation 1000 times for each setting and calculated the empirical sizes and powers of four tests with significance level $$\alpha=0{\cdot}05$$. The results for two independent samples and paired samples are summarized in Tables 1 and 2. The proposed test has higher power than the tests applied to the raw and log-transformed compositions, and it controls the size reasonably well around the nominal level 0$$\cdot$$05 and closely mimics the performance of the oracle test. Its power gains over the tests based on log-transformed and raw compositions tend to be more pronounced in the more challenging scenarios with moderate dimensions and sparse signals. Its superiority does not seem to depend on the distributions or covariance structures. Table 1. Empirical sizes and powers (%) of two-sample tests with $$\alpha=0{\cdot}05$$ and $$n_1=n_2=100$$ based on $$1000$$ replications        Banded covariance  Sparse covariance     Method  $$p=50$$  $$p=100$$  $$p=200$$  $$p=50$$  $$p=100$$  $$p=200$$  Normal, $$H_0$$  Oracle  4$$\cdot$$7  5$$\cdot$$3  4$$\cdot$$8  4$$\cdot$$0  4$$\cdot$$8  5$$\cdot$$2     Proposed  4$$\cdot$$6  4$$\cdot$$9  5$$\cdot$$1  3$$\cdot$$7  4$$\cdot$$5  5$$\cdot$$3     Log  3$$\cdot$$9  5$$\cdot$$1  5$$\cdot$$3  3$$\cdot$$5  3$$\cdot$$3  5$$\cdot$$2     Raw  0$$\cdot$$9  1$$\cdot$$0  0$$\cdot$$3  1$$\cdot$$5  1$$\cdot$$0  1$$\cdot$$3  Normal, $$H_1$$$$s=\lfloor0{\cdot}05p\rfloor$$  Oracle  38$$\cdot$$2  70$$\cdot$$7  92$$\cdot$$5  40$$\cdot$$1  70$$\cdot$$7  91$$\cdot$$8     Proposed  36$$\cdot$$5  70$$\cdot$$5  92$$\cdot$$2  38$$\cdot$$0  70$$\cdot$$2  91$$\cdot$$0     Log  26$$\cdot$$1  60$$\cdot$$8  84$$\cdot$$7  25$$\cdot$$5  51$$\cdot$$4  70$$\cdot$$7     Raw  4$$\cdot$$0  5$$\cdot$$5  8$$\cdot$$2  7$$\cdot$$0  16$$\cdot$$8  23$$\cdot$$7  Normal, $$H_1$$$$s=\lfloor0{\cdot}1p\rfloor$$  Oracle  68$$\cdot$$7  90$$\cdot$$6  99$$\cdot$$1  69$$\cdot$$4  91$$\cdot$$0  99$$\cdot$$5     Proposed  66$$\cdot$$9  89$$\cdot$$9  98$$\cdot$$9  67$$\cdot$$6  91$$\cdot$$0  99$$\cdot$$5     Log  53$$\cdot$$7  79$$\cdot$$7  97$$\cdot$$3  50$$\cdot$$0  77$$\cdot$$1  91$$\cdot$$7     Raw  9$$\cdot$$1  10$$\cdot$$1  14$$\cdot$$1  16$$\cdot$$6  31$$\cdot$$7  49$$\cdot$$2  Normal $$s=\lfloor0{\cdot}5p\rfloor$$  Oracle  99$$\cdot$$3  100$$\cdot$$0  100$$\cdot$$0  99$$\cdot$$4  100$$\cdot$$0  100$$\cdot$$0     Proposed  99$$\cdot$$2  100$$\cdot$$0  100$$\cdot$$0  99$$\cdot$$7  100$$\cdot$$0  100$$\cdot$$0     Log  96$$\cdot$$5  99$$\cdot$$9  100$$\cdot$$0  96$$\cdot$$6  99$$\cdot$$9  100$$\cdot$$0     Raw  39$$\cdot$$2  37$$\cdot$$1  61$$\cdot$$5  55$$\cdot$$0  84$$\cdot$$0  96$$\cdot$$0  Gamma, $$H_0$$  Oracle  5$$\cdot$$6  4$$\cdot$$4  4$$\cdot$$7  5$$\cdot$$9  4$$\cdot$$8  4$$\cdot$$8     Proposed  5$$\cdot$$3  4$$\cdot$$9  4$$\cdot$$8  5$$\cdot$$0  4$$\cdot$$9  5$$\cdot$$1     Log  4$$\cdot$$7  3$$\cdot$$6  3$$\cdot$$7  5$$\cdot$$0  4$$\cdot$$7  4$$\cdot$$5     Raw  1$$\cdot$$6  0$$\cdot$$8  0$$\cdot$$2  1$$\cdot$$6  0$$\cdot$$6  1$$\cdot$$1  Gamma, $$H_1$$$$s=\lfloor0{\cdot}05p\rfloor$$  Oracle  35$$\cdot$$7  70$$\cdot$$0  91$$\cdot$$3  36$$\cdot$$7  70$$\cdot$$5  92$$\cdot$$3     Proposed  36$$\cdot$$7  71$$\cdot$$5  91$$\cdot$$9  36$$\cdot$$3  68$$\cdot$$0  92$$\cdot$$0     Log  27$$\cdot$$0  52$$\cdot$$6  82$$\cdot$$8  23$$\cdot$$6  49$$\cdot$$9  66$$\cdot$$0     Raw  5$$\cdot$$1  4$$\cdot$$4  10$$\cdot$$2  4$$\cdot$$2  6$$\cdot$$0  9$$\cdot$$4  Gamma, $$H_1$$$$s=\lfloor0{\cdot}1p\rfloor$$  Oracle  68$$\cdot$$5  91$$\cdot$$8  99$$\cdot$$6  69$$\cdot$$0  91$$\cdot$$7  99$$\cdot$$5     Proposed  66$$\cdot$$8  91$$\cdot$$5  99$$\cdot$$5  66$$\cdot$$9  90$$\cdot$$8  99$$\cdot$$7     Log  52$$\cdot$$4  78$$\cdot$$4  96$$\cdot$$2  50$$\cdot$$9  75$$\cdot$$6  90$$\cdot$$7     Raw  11$$\cdot$$6  9$$\cdot$$8  17$$\cdot$$3  10$$\cdot$$3  13$$\cdot$$2  17$$\cdot$$4  Gamma $$s=\lfloor0{\cdot}5p\rfloor$$  Oracle  99$$\cdot$$9  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Proposed  99$$\cdot$$9  100$$\cdot$$0  100$$\cdot$$0  99$$\cdot$$5  100$$\cdot$$0  100$$\cdot$$0     Log  96$$\cdot$$5  99$$\cdot$$7  100$$\cdot$$0  96$$\cdot$$9  99$$\cdot$$9  100$$\cdot$$0     Raw  42$$\cdot$$7  53$$\cdot$$1  61$$\cdot$$9  40$$\cdot$$4  50$$\cdot$$7  62$$\cdot$$9        Banded covariance  Sparse covariance     Method  $$p=50$$  $$p=100$$  $$p=200$$  $$p=50$$  $$p=100$$  $$p=200$$  Normal, $$H_0$$  Oracle  4$$\cdot$$7  5$$\cdot$$3  4$$\cdot$$8  4$$\cdot$$0  4$$\cdot$$8  5$$\cdot$$2     Proposed  4$$\cdot$$6  4$$\cdot$$9  5$$\cdot$$1  3$$\cdot$$7  4$$\cdot$$5  5$$\cdot$$3     Log  3$$\cdot$$9  5$$\cdot$$1  5$$\cdot$$3  3$$\cdot$$5  3$$\cdot$$3  5$$\cdot$$2     Raw  0$$\cdot$$9  1$$\cdot$$0  0$$\cdot$$3  1$$\cdot$$5  1$$\cdot$$0  1$$\cdot$$3  Normal, $$H_1$$$$s=\lfloor0{\cdot}05p\rfloor$$  Oracle  38$$\cdot$$2  70$$\cdot$$7  92$$\cdot$$5  40$$\cdot$$1  70$$\cdot$$7  91$$\cdot$$8     Proposed  36$$\cdot$$5  70$$\cdot$$5  92$$\cdot$$2  38$$\cdot$$0  70$$\cdot$$2  91$$\cdot$$0     Log  26$$\cdot$$1  60$$\cdot$$8  84$$\cdot$$7  25$$\cdot$$5  51$$\cdot$$4  70$$\cdot$$7     Raw  4$$\cdot$$0  5$$\cdot$$5  8$$\cdot$$2  7$$\cdot$$0  16$$\cdot$$8  23$$\cdot$$7  Normal, $$H_1$$$$s=\lfloor0{\cdot}1p\rfloor$$  Oracle  68$$\cdot$$7  90$$\cdot$$6  99$$\cdot$$1  69$$\cdot$$4  91$$\cdot$$0  99$$\cdot$$5     Proposed  66$$\cdot$$9  89$$\cdot$$9  98$$\cdot$$9  67$$\cdot$$6  91$$\cdot$$0  99$$\cdot$$5     Log  53$$\cdot$$7  79$$\cdot$$7  97$$\cdot$$3  50$$\cdot$$0  77$$\cdot$$1  91$$\cdot$$7     Raw  9$$\cdot$$1  10$$\cdot$$1  14$$\cdot$$1  16$$\cdot$$6  31$$\cdot$$7  49$$\cdot$$2  Normal $$s=\lfloor0{\cdot}5p\rfloor$$  Oracle  99$$\cdot$$3  100$$\cdot$$0  100$$\cdot$$0  99$$\cdot$$4  100$$\cdot$$0  100$$\cdot$$0     Proposed  99$$\cdot$$2  100$$\cdot$$0  100$$\cdot$$0  99$$\cdot$$7  100$$\cdot$$0  100$$\cdot$$0     Log  96$$\cdot$$5  99$$\cdot$$9  100$$\cdot$$0  96$$\cdot$$6  99$$\cdot$$9  100$$\cdot$$0     Raw  39$$\cdot$$2  37$$\cdot$$1  61$$\cdot$$5  55$$\cdot$$0  84$$\cdot$$0  96$$\cdot$$0  Gamma, $$H_0$$  Oracle  5$$\cdot$$6  4$$\cdot$$4  4$$\cdot$$7  5$$\cdot$$9  4$$\cdot$$8  4$$\cdot$$8     Proposed  5$$\cdot$$3  4$$\cdot$$9  4$$\cdot$$8  5$$\cdot$$0  4$$\cdot$$9  5$$\cdot$$1     Log  4$$\cdot$$7  3$$\cdot$$6  3$$\cdot$$7  5$$\cdot$$0  4$$\cdot$$7  4$$\cdot$$5     Raw  1$$\cdot$$6  0$$\cdot$$8  0$$\cdot$$2  1$$\cdot$$6  0$$\cdot$$6  1$$\cdot$$1  Gamma, $$H_1$$$$s=\lfloor0{\cdot}05p\rfloor$$  Oracle  35$$\cdot$$7  70$$\cdot$$0  91$$\cdot$$3  36$$\cdot$$7  70$$\cdot$$5  92$$\cdot$$3     Proposed  36$$\cdot$$7  71$$\cdot$$5  91$$\cdot$$9  36$$\cdot$$3  68$$\cdot$$0  92$$\cdot$$0     Log  27$$\cdot$$0  52$$\cdot$$6  82$$\cdot$$8  23$$\cdot$$6  49$$\cdot$$9  66$$\cdot$$0     Raw  5$$\cdot$$1  4$$\cdot$$4  10$$\cdot$$2  4$$\cdot$$2  6$$\cdot$$0  9$$\cdot$$4  Gamma, $$H_1$$$$s=\lfloor0{\cdot}1p\rfloor$$  Oracle  68$$\cdot$$5  91$$\cdot$$8  99$$\cdot$$6  69$$\cdot$$0  91$$\cdot$$7  99$$\cdot$$5     Proposed  66$$\cdot$$8  91$$\cdot$$5  99$$\cdot$$5  66$$\cdot$$9  90$$\cdot$$8  99$$\cdot$$7     Log  52$$\cdot$$4  78$$\cdot$$4  96$$\cdot$$2  50$$\cdot$$9  75$$\cdot$$6  90$$\cdot$$7     Raw  11$$\cdot$$6  9$$\cdot$$8  17$$\cdot$$3  10$$\cdot$$3  13$$\cdot$$2  17$$\cdot$$4  Gamma $$s=\lfloor0{\cdot}5p\rfloor$$  Oracle  99$$\cdot$$9  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Proposed  99$$\cdot$$9  100$$\cdot$$0  100$$\cdot$$0  99$$\cdot$$5  100$$\cdot$$0  100$$\cdot$$0     Log  96$$\cdot$$5  99$$\cdot$$7  100$$\cdot$$0  96$$\cdot$$9  99$$\cdot$$9  100$$\cdot$$0     Raw  42$$\cdot$$7  53$$\cdot$$1  61$$\cdot$$9  40$$\cdot$$4  50$$\cdot$$7  62$$\cdot$$9  Table 2. Empirical sizes and powers (%) of paired tests with$$\alpha=0{\cdot}05$$and$$n=100$$based on$$1000$$replications        Banded covariance  Sparse covariance     Method  $$p=50$$  $$p=100$$  $$p=200$$  $$p=50$$  $$p=100$$  $$p=200$$  Normal, $$H_0$$  Oracle  4$$\cdot$$8  5$$\cdot$$6  6$$\cdot$$3  5$$\cdot$$9  6$$\cdot$$6  6$$\cdot$$0     Proposed  4$$\cdot$$9  5$$\cdot$$5  6$$\cdot$$8  5$$\cdot$$9  6$$\cdot$$1  6$$\cdot$$5     Log  5$$\cdot$$4  4$$\cdot$$4  7$$\cdot$$1  5$$\cdot$$2  3$$\cdot$$7  4$$\cdot$$1     Raw  1$$\cdot$$1  0$$\cdot$$4  0$$\cdot$$2  1$$\cdot$$5  1$$\cdot$$1  1$$\cdot$$2  Normal, $$H_1$$$$s=\lfloor0{\cdot}05p\rfloor$$  Oracle  55$$\cdot$$3  86$$\cdot$$8  98$$\cdot$$3  54$$\cdot$$4  85$$\cdot$$8  99$$\cdot$$0     Proposed  52$$\cdot$$9  86$$\cdot$$5  98$$\cdot$$4  54$$\cdot$$0  84$$\cdot$$8  98$$\cdot$$9     Log  39$$\cdot$$6  75$$\cdot$$0  95$$\cdot$$6  35$$\cdot$$7  68$$\cdot$$7  87$$\cdot$$3     Raw  5$$\cdot$$2  7$$\cdot$$3  13$$\cdot$$7  9$$\cdot$$2  21$$\cdot$$6  34$$\cdot$$3  Normal, $$H_1$$$$s=\lfloor0{\cdot}1p\rfloor$$  Oracle  85$$\cdot$$1  98$$\cdot$$6  99$$\cdot$$9  83$$\cdot$$9  98$$\cdot$$6  99$$\cdot$$9     Proposed  82$$\cdot$$8  98$$\cdot$$4  99$$\cdot$$9  82$$\cdot$$8  98$$\cdot$$3  99$$\cdot$$9     Log  71$$\cdot$$2  94$$\cdot$$5  99$$\cdot$$6  65$$\cdot$$4  90$$\cdot$$0  98$$\cdot$$3     Raw  13$$\cdot$$8  14$$\cdot$$9  22$$\cdot$$1  22$$\cdot$$7  43$$\cdot$$7  64$$\cdot$$5  Normal $$s=\lfloor0{\cdot}5p\rfloor$$  Oracle  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Proposed  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Log  99$$\cdot$$5  100$$\cdot$$0  100$$\cdot$$0  99$$\cdot$$7  100$$\cdot$$0  100$$\cdot$$0     Raw  50$$\cdot$$0  50$$\cdot$$0  74$$\cdot$$0  77$$\cdot$$9  94$$\cdot$$4  99$$\cdot$$3  Gamma, $$H_0$$  Oracle  4$$\cdot$$2  6$$\cdot$$1  7$$\cdot$$2  5$$\cdot$$1  6$$\cdot$$3  7$$\cdot$$3     Proposed  4$$\cdot$$3  6$$\cdot$$2  7$$\cdot$$2  5$$\cdot$$8  6$$\cdot$$1  7$$\cdot$$1     Log  4$$\cdot$$7  4$$\cdot$$7  5$$\cdot$$6  5$$\cdot$$2  4$$\cdot$$5  5$$\cdot$$5     Raw  1$$\cdot$$2  0$$\cdot$$5  0$$\cdot$$4  1$$\cdot$$4  1$$\cdot$$7  2$$\cdot$$0  Gamma, $$H_1$$$$s=\lfloor0{\cdot}05p\rfloor$$  Oracle  55$$\cdot$$5  86$$\cdot$$1  98$$\cdot$$4  51$$\cdot$$3  84$$\cdot$$2  98$$\cdot$$3     Proposed  53$$\cdot$$9  86$$\cdot$$2  98$$\cdot$$4  50$$\cdot$$1  84$$\cdot$$0  98$$\cdot$$0     Log  42$$\cdot$$4  77$$\cdot$$3  94$$\cdot$$8  35$$\cdot$$9  67$$\cdot$$2  88$$\cdot$$1     Raw  6$$\cdot$$1  8$$\cdot$$4  11$$\cdot$$1  9$$\cdot$$6  22$$\cdot$$6  39$$\cdot$$4  Gamma, $$H_1$$$$s=\lfloor0{\cdot}1p\rfloor$$  Oracle  83$$\cdot$$8  97$$\cdot$$7  100$$\cdot$$0  87$$\cdot$$9  98$$\cdot$$2  100$$\cdot$$0     Proposed  82$$\cdot$$6  97$$\cdot$$1  100$$\cdot$$0  86$$\cdot$$5  98$$\cdot$$3  99$$\cdot$$9     Log  70$$\cdot$$8  93$$\cdot$$0  99$$\cdot$$8  67$$\cdot$$8  90$$\cdot$$8  98$$\cdot$$4     Raw  12$$\cdot$$0  13$$\cdot$$2  20$$\cdot$$7  24$$\cdot$$0  44$$\cdot$$0  61$$\cdot$$5  Gamma $$s=\lfloor0{\cdot}5p\rfloor$$  Oracle  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Proposed  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Log  99$$\cdot$$4  100$$\cdot$$0  100$$\cdot$$0  99$$\cdot$$8  100$$\cdot$$0  100$$\cdot$$0     Raw  51$$\cdot$$5  52$$\cdot$$7  75$$\cdot$$0  80$$\cdot$$3  94$$\cdot$$5  99$$\cdot$$0        Banded covariance  Sparse covariance     Method  $$p=50$$  $$p=100$$  $$p=200$$  $$p=50$$  $$p=100$$  $$p=200$$  Normal, $$H_0$$  Oracle  4$$\cdot$$8  5$$\cdot$$6  6$$\cdot$$3  5$$\cdot$$9  6$$\cdot$$6  6$$\cdot$$0     Proposed  4$$\cdot$$9  5$$\cdot$$5  6$$\cdot$$8  5$$\cdot$$9  6$$\cdot$$1  6$$\cdot$$5     Log  5$$\cdot$$4  4$$\cdot$$4  7$$\cdot$$1  5$$\cdot$$2  3$$\cdot$$7  4$$\cdot$$1     Raw  1$$\cdot$$1  0$$\cdot$$4  0$$\cdot$$2  1$$\cdot$$5  1$$\cdot$$1  1$$\cdot$$2  Normal, $$H_1$$$$s=\lfloor0{\cdot}05p\rfloor$$  Oracle  55$$\cdot$$3  86$$\cdot$$8  98$$\cdot$$3  54$$\cdot$$4  85$$\cdot$$8  99$$\cdot$$0     Proposed  52$$\cdot$$9  86$$\cdot$$5  98$$\cdot$$4  54$$\cdot$$0  84$$\cdot$$8  98$$\cdot$$9     Log  39$$\cdot$$6  75$$\cdot$$0  95$$\cdot$$6  35$$\cdot$$7  68$$\cdot$$7  87$$\cdot$$3     Raw  5$$\cdot$$2  7$$\cdot$$3  13$$\cdot$$7  9$$\cdot$$2  21$$\cdot$$6  34$$\cdot$$3  Normal, $$H_1$$$$s=\lfloor0{\cdot}1p\rfloor$$  Oracle  85$$\cdot$$1  98$$\cdot$$6  99$$\cdot$$9  83$$\cdot$$9  98$$\cdot$$6  99$$\cdot$$9     Proposed  82$$\cdot$$8  98$$\cdot$$4  99$$\cdot$$9  82$$\cdot$$8  98$$\cdot$$3  99$$\cdot$$9     Log  71$$\cdot$$2  94$$\cdot$$5  99$$\cdot$$6  65$$\cdot$$4  90$$\cdot$$0  98$$\cdot$$3     Raw  13$$\cdot$$8  14$$\cdot$$9  22$$\cdot$$1  22$$\cdot$$7  43$$\cdot$$7  64$$\cdot$$5  Normal $$s=\lfloor0{\cdot}5p\rfloor$$  Oracle  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Proposed  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Log  99$$\cdot$$5  100$$\cdot$$0  100$$\cdot$$0  99$$\cdot$$7  100$$\cdot$$0  100$$\cdot$$0     Raw  50$$\cdot$$0  50$$\cdot$$0  74$$\cdot$$0  77$$\cdot$$9  94$$\cdot$$4  99$$\cdot$$3  Gamma, $$H_0$$  Oracle  4$$\cdot$$2  6$$\cdot$$1  7$$\cdot$$2  5$$\cdot$$1  6$$\cdot$$3  7$$\cdot$$3     Proposed  4$$\cdot$$3  6$$\cdot$$2  7$$\cdot$$2  5$$\cdot$$8  6$$\cdot$$1  7$$\cdot$$1     Log  4$$\cdot$$7  4$$\cdot$$7  5$$\cdot$$6  5$$\cdot$$2  4$$\cdot$$5  5$$\cdot$$5     Raw  1$$\cdot$$2  0$$\cdot$$5  0$$\cdot$$4  1$$\cdot$$4  1$$\cdot$$7  2$$\cdot$$0  Gamma, $$H_1$$$$s=\lfloor0{\cdot}05p\rfloor$$  Oracle  55$$\cdot$$5  86$$\cdot$$1  98$$\cdot$$4  51$$\cdot$$3  84$$\cdot$$2  98$$\cdot$$3     Proposed  53$$\cdot$$9  86$$\cdot$$2  98$$\cdot$$4  50$$\cdot$$1  84$$\cdot$$0  98$$\cdot$$0     Log  42$$\cdot$$4  77$$\cdot$$3  94$$\cdot$$8  35$$\cdot$$9  67$$\cdot$$2  88$$\cdot$$1     Raw  6$$\cdot$$1  8$$\cdot$$4  11$$\cdot$$1  9$$\cdot$$6  22$$\cdot$$6  39$$\cdot$$4  Gamma, $$H_1$$$$s=\lfloor0{\cdot}1p\rfloor$$  Oracle  83$$\cdot$$8  97$$\cdot$$7  100$$\cdot$$0  87$$\cdot$$9  98$$\cdot$$2  100$$\cdot$$0     Proposed  82$$\cdot$$6  97$$\cdot$$1  100$$\cdot$$0  86$$\cdot$$5  98$$\cdot$$3  99$$\cdot$$9     Log  70$$\cdot$$8  93$$\cdot$$0  99$$\cdot$$8  67$$\cdot$$8  90$$\cdot$$8  98$$\cdot$$4     Raw  12$$\cdot$$0  13$$\cdot$$2  20$$\cdot$$7  24$$\cdot$$0  44$$\cdot$$0  61$$\cdot$$5  Gamma $$s=\lfloor0{\cdot}5p\rfloor$$  Oracle  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Proposed  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Log  99$$\cdot$$4  100$$\cdot$$0  100$$\cdot$$0  99$$\cdot$$8  100$$\cdot$$0  100$$\cdot$$0     Raw  51$$\cdot$$5  52$$\cdot$$7  75$$\cdot$$0  80$$\cdot$$3  94$$\cdot$$5  99$$\cdot$$0  To further examine the performance of the proposed test in very high-dimensional settings, we carried out simulations for two independent samples with dimension $$p=2000$$ and sample sizes $$n_1=n_2=100,200$$. The results are summarized in Table 3 and indicate that the proposed test still has approximately correct size and improved power over the two competing tests. Table 3. Empirical sizes and powers (%) of two-sample tests with$$\alpha=0{\cdot}05$$and$$p=2000$$based on$$1000$$replications        Banded covariance  Sparse covariance     Method  $$n_1=n_2=100$$  $$n_1=n_2=200$$  $$n_1=n_2=100$$  $$n_1=n_2=200$$  Normal, $$H_0$$  Oracle  6$$\cdot$$5  3$$\cdot$$7  7$$\cdot$$6  4$$\cdot$$1     Proposed  6$$\cdot$$4  3$$\cdot$$7  7$$\cdot$$5  4$$\cdot$$1     Log  5$$\cdot$$0  4$$\cdot$$7  2$$\cdot$$4  1$$\cdot$$8     Raw  0$$\cdot$$2  0$$\cdot$$1  0$$\cdot$$5  0$$\cdot$$9  Normal, $$H_1$$$$s=\lfloor0{\cdot}05p\rfloor$$  Oracle  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Proposed  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Log  100$$\cdot$$0  100$$\cdot$$0  98$$\cdot$$7  98$$\cdot$$0     Raw  48$$\cdot$$4  55$$\cdot$$0  60$$\cdot$$7  68$$\cdot$$6  Gamma, $$H_0$$  Oracle  7$$\cdot$$0  6$$\cdot$$1  6$$\cdot$$4  6$$\cdot$$7     Proposed  6$$\cdot$$9  6$$\cdot$$1  6$$\cdot$$6  7$$\cdot$$1     Log  4$$\cdot$$9  4$$\cdot$$5  2$$\cdot$$7  2$$\cdot$$2     Raw  0$$\cdot$$2  0$$\cdot$$2  0$$\cdot$$4  0$$\cdot$$1  Gamma, $$H_1$$$$s=\lfloor0{\cdot}05p\rfloor$$  Oracle  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Proposed  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Log  100$$\cdot$$0  100$$\cdot$$0  98$$\cdot$$5  98$$\cdot$$9     Raw  36$$\cdot$$1  45$$\cdot$$3  22$$\cdot$$1  22$$\cdot$$0        Banded covariance  Sparse covariance     Method  $$n_1=n_2=100$$  $$n_1=n_2=200$$  $$n_1=n_2=100$$  $$n_1=n_2=200$$  Normal, $$H_0$$  Oracle  6$$\cdot$$5  3$$\cdot$$7  7$$\cdot$$6  4$$\cdot$$1     Proposed  6$$\cdot$$4  3$$\cdot$$7  7$$\cdot$$5  4$$\cdot$$1     Log  5$$\cdot$$0  4$$\cdot$$7  2$$\cdot$$4  1$$\cdot$$8     Raw  0$$\cdot$$2  0$$\cdot$$1  0$$\cdot$$5  0$$\cdot$$9  Normal, $$H_1$$$$s=\lfloor0{\cdot}05p\rfloor$$  Oracle  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Proposed  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Log  100$$\cdot$$0  100$$\cdot$$0  98$$\cdot$$7  98$$\cdot$$0     Raw  48$$\cdot$$4  55$$\cdot$$0  60$$\cdot$$7  68$$\cdot$$6  Gamma, $$H_0$$  Oracle  7$$\cdot$$0  6$$\cdot$$1  6$$\cdot$$4  6$$\cdot$$7     Proposed  6$$\cdot$$9  6$$\cdot$$1  6$$\cdot$$6  7$$\cdot$$1     Log  4$$\cdot$$9  4$$\cdot$$5  2$$\cdot$$7  2$$\cdot$$2     Raw  0$$\cdot$$2  0$$\cdot$$2  0$$\cdot$$4  0$$\cdot$$1  Gamma, $$H_1$$$$s=\lfloor0{\cdot}05p\rfloor$$  Oracle  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Proposed  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0  100$$\cdot$$0     Log  100$$\cdot$$0  100$$\cdot$$0  98$$\cdot$$5  98$$\cdot$$9     Raw  36$$\cdot$$1  45$$\cdot$$3  22$$\cdot$$1  22$$\cdot$$0  6. Applications to microbiome data 6.1. Analysis of obesity microbiome data We illustrate the application of the proposed tests by analysing two microbiome datasets. The first is a dataset from Wu et al. (2011) collected in a cross-sectional study of 98 subjects to investigate the effects of habitual diet on the human gut microbiome. The dataset was analysed by regression in Lin et al. (2014), where it was found to suggest an association between obesity and changes in gut microbiome composition. For each subject, DNA collected from stool samples was analysed by 454/Roche pyrosequencing of 16S rRNA gene segments from the V1–V2 region. An average of 9265 reads per sample were obtained, with a standard deviation of 3864, by denoising the pyrosequences prior to taxonomic assignment. The resulting 3068 operational taxonomic units were further merged into 87 genera that were observed in at least one sample. As suggested by Aitchison (2003) and Lin et al. (2014), zero counts were replaced by 0$$\cdot$$5 before the count data were converted into compositional data by normalization. Demographic information, including body mass index, BMI, was recorded on the subjects. We are interested in testing whether lean and obese individuals have the same gut microbiome composition. To this end, we divided the subjects into a lean group, BMI $$<25$$, $$n_1=63$$, and an obese group, BMI $${\geqslant}25$$, $$n_2=35$$, on which we performed various two-sample tests. The proposed test yielded a $$p$$-value of 0$$\cdot$$001, indicating a marked difference between the two groups. In contrast, the tests based on the log-transformed and raw compositions gave $$p$$-values of 0$$\cdot$$129 and 0$$\cdot$$261, and hence failed to detect the difference at the 0$$\cdot$$05 level. To assess the stability of our proposed test and to perform power comparisons, we generated 5000 bootstrap subsamples within each group, with the subsampling proportion varying from 0$$\cdot$$2 to 1. For each subsampling proportion, we obtained the empirical power as the proportion of subsamples for which the null hypothesis was rejected at the 0$$\cdot$$05 level. The empirical power curves based on the bootstrap subsamples, presented in Fig. 1(a), show that the proposed test greatly outperforms the competitors. We further conducted back-testing to check whether the signal disappears upon breaking the association. We generated 1000 bootstrap samples from the pooled data and then randomly divided each sample into two groups with the same sizes as before. The histogram of $$p$$-values from our test based on the bootstrap samples is displayed in Fig. 1(b). The $$p$$-values are distributed quite evenly, indicating good accuracy of the asymptotics. Overall, our results confirm previous findings that the microbiomes of lean and obese individuals differ at the taxonomic and functional levels (Turnbaugh et al., 2009). Fig. 1. View largeDownload slide Analysis of two microbiome datasets: empirical power curves of the proposed test (triangles) and the tests based on log-transformed (circles) and raw (squares) compositions with $$\alpha=0{\cdot}05$$ are shown in (a) for the obesity data and (c) for the Crohn’s disease data; histograms of $$p$$-values from the proposed test in the back-testing are shown in (b) for the obesity data and (d) for the Crohn’s disease data, for 1000 replicates. Fig. 1. View largeDownload slide Analysis of two microbiome datasets: empirical power curves of the proposed test (triangles) and the tests based on log-transformed (circles) and raw (squares) compositions with $$\alpha=0{\cdot}05$$ are shown in (a) for the obesity data and (c) for the Crohn’s disease data; histograms of $$p$$-values from the proposed test in the back-testing are shown in (b) for the obesity data and (d) for the Crohn’s disease data, for 1000 replicates. To further assess the sensitivity of the results to zero replacements, we repeated the analysis with the zero counts replaced by 0$$\cdot$$1 before normalization. The proposed test resulted in a $$p$$-value of 0$$\cdot$$0001, while the tests based on the log-transformed and raw compositions gave $$p$$-values of 0$$\cdot$$015 and 0$$\cdot$$080, respectively. In this case only the proposed test rejects the null hypothesis at the 0$$\cdot$$01 level, and the inference does not seem sensitive to the zero-replacement values. 6.2. Analysis of Crohn’s disease microbiome data Crohn’s disease is a type of inflammatory bowel disease characterized by altered gut bacterial composition, whose etiology appears multifactorial and remains poorly understood. We analyse a dataset from a longitudinal study of 90 pediatric Crohn’s disease patients reported by Lewis et al. (2015). Among these patients, 26 were classified as responders to anti-tumour necrosis factor therapy, where response to therapy was defined as a reduction in faecal calprotectin, FCP, concentration to $$250\,\mu$$g/g or below among those with baseline FCP greater than $$250\,\mu$$g/g. Twenty-four of the responders had stool samples collected at four time-points: baseline, and then 1 week, 4 weeks and 8 weeks into therapy. The bacterial composition was quantified using shotgun metagenomic sequencing and the MetaPhlAn package (Segata et al., 2012), yielding 43 genera that appeared in at least three samples across all time-points. As the read counts were not available, zero proportions were replaced by half or 10% of the minimum nonzero proportions in the dataset. To determine the effect of the therapy among responders, we applied various paired tests to test for changes in gut microbiome composition between baseline and the three later time-points. As shown in Table 4, the $$p$$-values for the comparison between baseline and week 8 from all tests were significant or close to significant, with the strongest evidence provided by the proposed test. The comparisons at the two earlier time-points did not yield decisive conclusions. These inferences do not seem sensitive to the zero-replacement strategies. The empirical power curves based on bootstrap subsamples in Fig. 1(c) exhibit more substantial power gains of the proposed test over the competitors with smaller sample sizes. Moreover, the histogram of $$p$$-values in Fig. 1(d) indicates that the proposed test survives the back-testing, where the observations at two time-points were randomly interchanged for each subject in the bootstrap samples. Our results provide further support for the effect of the therapy on gut microbiome composition through reduced inflammation and suggest that it may take longer for the intestinal dysbiosis to be resolved. Table 4. The $$p$$-values of paired tests applied to the Crohn’s disease microbiome data with zeros replaced by half or $$10\%$$ of the minimum nonzero proportions in the dataset     Zero replacement by half  Zero replacement by 10%     Proposed  Log  Raw  Proposed  Log  Raw  Baseline versus week 1  0$$\cdot$$119  0$$\cdot$$605  0$$\cdot$$757  0$$\cdot$$141  0$$\cdot$$611  0$$\cdot$$757  Baseline versus week 4  0$$\cdot$$460  0$$\cdot$$553  0$$\cdot$$468  0$$\cdot$$373  0$$\cdot$$684  0$$\cdot$$468  Baseline versus week 8  0$$\cdot$$014  0$$\cdot$$033  0$$\cdot$$082  0$$\cdot$$018  0$$\cdot$$058  0$$\cdot$$082     Zero replacement by half  Zero replacement by 10%     Proposed  Log  Raw  Proposed  Log  Raw  Baseline versus week 1  0$$\cdot$$119  0$$\cdot$$605  0$$\cdot$$757  0$$\cdot$$141  0$$\cdot$$611  0$$\cdot$$757  Baseline versus week 4  0$$\cdot$$460  0$$\cdot$$553  0$$\cdot$$468  0$$\cdot$$373  0$$\cdot$$684  0$$\cdot$$468  Baseline versus week 8  0$$\cdot$$014  0$$\cdot$$033  0$$\cdot$$082  0$$\cdot$$018  0$$\cdot$$058  0$$\cdot$$082  7. Discussion We have shown that it is possible to develop tests for high-dimensional parameters of the log basis variables from which compositional data are derived, even though the bases are not observed. In this regard, our method extends the scope of the log-ratio transformation methodology due to Aitchison (1982). The mild assumption that $$\|\delta\|_1=o(p)$$ for the proposed test to achieve the minimax optimal rate is due to the use of centred log-ratio variables as a proxy for the latent log basis variables, and bears a resemblance to an approximate identifiability condition for large covariance estimation from compositional data considered in Cao et al. (2016). Our testing framework may be extended in at least two directions. First, it would be worthwhile to exploit the covariance structure of compositional data for power enhancement, by borrowing ideas of Cai et al. (2014). Such an extension, however, seems nontrivial owing to the singularity of the centred log-ratio covariance matrix. Second, in addition to the global test developed in this paper, a multiple testing procedure with accurate error control would be helpful for identifying specific taxa that differ significantly between groups and contribute to the outcome of interest. Acknowledgement This research was supported in part by the U.S. National Institutes of Health and the National Natural Science Foundation of China. We thank the associate editor and three reviewers for their very helpful comments. Appendix Proofs of the theoretical results We first introduce some notation. For a matrix $$A=(a_{ij})_{p\times p}$$, denote by $$\|A\|_1$$ and $$\|A\|_{\max}$$ the matrix 1-norm and entrywise $$\ell_\infty$$-norm, respectively, i.e., $$\|A\|_1=\max_{1{\leqslant} j{\leqslant} p}\sum_{i=1}^p|a_{ij}|$$ and $$\|A\|_{\max}=\max_{1{\leqslant} i,j{\leqslant} p}|a_{ij}|$$. Write $$a_{i\cdot}=p^{-1}\sum_{j=1}^p a_{ij}$$ and $$a_{\cdot\cdot}=p^{-2}\sum_{i=1}^p\sum_{j=1}^p a_{ij}$$. We will use $$C_1,C_2,\ldots>0$$ to denote generic constants, whose values may vary from line to line. Proof of Proposition 1 By Condition 3, we have   $$\label{eq:R_1norm} \|R\|_1{\leqslant} p^{1/2}\max_{1{\leqslant} j{\leqslant} p}\left(\sum_{i=1}^p\rho_{ij}^2\right)^{1/2}{\leqslant} p^{1/2}r_2^{1/2}=O(p^{1/2})\text{.}$$ (A1) We write $$\gamma_{ij}=\omega_{ij}-\omega_{i\cdot}-\omega_{j\cdot}+\omega_{\cdot\cdot}$$. It follows from Condition 1 and (A1) that   $$\label{eq:omega_i} |\omega_{i\cdot}|{\leqslant}\frac{1}{p}\sum_{j=1}^p|\omega_{ij}|{\leqslant}\frac{1}{p}\max_{1{\leqslant} j{\leqslant} p}|\omega_{jj}|\sum_{j=1}^p|\rho_{ij}|{\leqslant}\frac{1}{p}\kappa_1\|R\|_1 =O(p^{-1/2})$$ (A2) and, similarly,   $$\label{eq:omega_j} |\omega_{j\cdot}|=O(p^{-1/2}),\quad |\omega_{\cdot\cdot}|=O(p^{-1/2})\text{.}$$ (A3) Hence   $$\label{eq:gamma_omega} \|\Gamma-\Omega\|_{\max}{\leqslant}\max_{1{\leqslant} i,j{\leqslant} p}\big(|\omega_{i\cdot}|+|\omega_{j\cdot}|+|\omega_{\cdot\cdot}|\big)=O(p^{-1/2})\text{.}$$ (A4) This and Condition 1 imply (i). To show (ii), we write   $\tau_{ij}=\frac{\gamma_{ij}}{(\gamma_{ii}\gamma_{jj})^{1/2}}=\frac{\omega_{ij}+{\varepsilon}_1}{\{(\omega_{ii}+{\varepsilon}_2)(\omega_{jj}+{\varepsilon}_3)\}^{1/2}},$ where $${\varepsilon}_1=-\omega_{i\cdot}-\omega_{j\cdot}+\omega_{\cdot\cdot}$$, $${\varepsilon}_2=-2\omega_{i\cdot}+\omega_{\cdot\cdot}$$ and $${\varepsilon}_3=-2\omega_{j\cdot}+\omega_{\cdot\cdot}$$. By (A2) and (A3), we have $${\varepsilon}_i=O(p^{-1/2})$$ ($$i=1,2,3$$). Therefore, by Condition 1,   \begin{align} \tau_{ij}&=\frac{\omega_{ij}+{\varepsilon}_1}{(\omega_{ii}\omega_{jj})^{1/2}}\left\{\frac{(\omega_{ii}+{\varepsilon}_2)(\omega_{jj}+{\varepsilon}_3)}{\omega_{ii}\omega_{jj}} \right\}^{-1/2}=\frac{\rho_{ij}+O(p^{-1/2})}{[\{1+O(p^{-1/2})\}\{1+O(p^{-1/2})\}]^{1/2}}\notag\\ &=\rho_{ij}+O(p^{-1/2})\label{eq:tau-rho}, \end{align} (A5) which, together with Condition 2, implies (ii). To show (iii), noting that $$\tau_{ij}^2-\rho_{ij}^2=(\tau_{ij}-\rho_{ij})^2+2\rho_{ij}(\tau_{ij}-\rho_{ij})$$ and using (A1) and (A5), we have   $\sum_{i=1}^p(\tau_{ij}^2-\rho_{ij}^2)=\sum_{i=1}^p(\tau_{ij}-\rho_{ij})^2+2\sum_{i=1}^p\rho_{ij}(\tau_{ij}-\rho_{ij})=O(1)+2\|R\|_1O(p^{-1/2})=O(1)\text{.}$ This and Condition 3 imply (iii) and thus complete the proof. Proof of Proposition 2 We first write   $Y_{ij}^{(k)}-\nu_{kj}=Z_{ij}^{(k)}-\mu_{kj}+\frac{1}{p}\sum_{j=1}^p(Z_{ij}^{(k)}-\mu_{kj})\text{.}$ It follows from Condition 1 and Proposition 1 that   $$\label{eq:y_to_z} \frac{|Y_{ij}^{(k)}-\nu_{kj}|}{\gamma_{jj}^{1/2}}{\leqslant}\frac{\omega_{jj}^{1/2}}{\gamma_{jj}^{1/2}} \Biggl(\frac{|Z_{ij}^{(k)}-\mu_{kj}|}{\omega_{jj}^{1/2}} +\frac{1}{p}\sum_{j=1}^p\frac{|Z_{ij}^{(k)}-\mu_{kj}|}{\omega_{jj}^{1/2}}\Biggr) {\leqslant}2(\kappa_1\kappa_2)^{1/2}\max_{i,j,k}\frac{|Z_{ij}^{(k)}-\mu_{kj}|}{\omega_{jj}^{1/2}}\text{.}$$ (A6) Using Condition 4 and applying Markov’s inequality and the union bound, we obtain   ${\rm pr}\!\left(\max_{i,j,k}|Z_{ij}^{(k)}-\mu_{kj}|/\omega_{jj}^{1/2}{\geqslant} t\right){\leqslant}(n_1+n_2)pK\exp(-\eta t^2)$ for all $$t>0$$. Hence, by Condition 5,   $$\label{eq:z_rate_exp} \max_{i,j,k}\big|Z_{ij}^{(k)}-\mu_{kj}\big|/\omega_{jj}^{1/2}=O_{\rm p}\big\{(\log n+\log p)^{1/2}\big\}=o_{\rm p}(n^{1/2}/\log p)\text{.}$$ (A7) Combining (A6) with (A7), we arrive at (11). To prove (12), without loss of generality we assume $$\mu_1=\mu_2=0$$. Let $$\hat\gamma_{jj}^{(k)}$$ denote the sample centred log-ratio variances for population $$k$$ ($$k=1,2$$). Observe that   $\big|\hat\gamma_{jj}^{(k)}-\gamma_{jj}\big|=\big|\hat\omega_{jj}^{(k)}-2\hat\omega_{j\cdot}^{(k)}+\hat\omega_{\cdot\cdot}^{(k)}-(\omega_{jj}-2\omega_{j\cdot} +\omega_{\cdot\cdot})\big|{\leqslant}4\max_{i,j}\big|\hat\omega_{ij}^{(k)}-\omega_{ij}\big|\text{.}$ It follows from Condition 1 and Proposition 1 that   $\frac{|\hat\gamma_{jj}^{(k)}-\gamma_{jj}|}{\gamma_{jj}}{\leqslant}\frac{4}{\gamma_{jj}}\max_{i,j}\frac{|\hat\omega_{ij}^{(k)}-\omega_{ij}|} {(\omega_{ii}\omega_{jj})^{1/2}}\,(\omega_{ii}\omega_{jj})^{1/2} {\leqslant}4\kappa_1\kappa_2\max_{i,j}\frac{|\hat\omega_{ij}^{(k)}-\omega_{ij}|}{(\omega_{ii}\omega_{jj})^{1/2}}\text{.}$ The proof is completed by invoking the following lemma, which recaps a concentration result in Bickel & Levina (2008), and using the fact that $$\hat\gamma_{jj}=n_1\hat\gamma_{jj}^{(1)}/(n_1+n_2)+n_2\hat\gamma_{jj}^{(2)}/(n_1+n_2)$$. Lemma A1. Under Condition 4, there exist constants $$C_1,C_2,C_3,C_4>0$$ such that  ${\rm pr}\!\left\{\max_{i,j}\frac{|\hat\omega_{ij}^{(k)}-\omega_{ij}|}{(\omega_{ii}\omega_{jj})^{1/2}}{\geqslant} t\right\}{\leqslant} C_1p\exp(-C_2n_kt/2)+C_3p^2\exp(-C_4n_kt^2/4)\quad (t>0;\, k=1,2)\text{.}$ Proof of Theorem 1 Let $$t_p=t+2\log p-\log\log p$$ and   $$\label{eq:M_nstar} M_n^*=n\max_{1{\leqslant} j{\leqslant} p}\frac{(\bar{Y}_j^{(1)}-\bar{Y}_j^{(2)})^2}{\gamma_{jj}}\text{.}$$ (A8) We first show that under $$H_0$$ in (3), or equivalently (8), for any fixed $$t\in\mathbb{R}$$ we have   $$\label{eq:limit} {\rm pr}(M_n^*{\leqslant} t_p)\to\exp\{-\pi^{-1/2}\exp(-t/2)\}$$ (A9) as $$n,p\to\infty$$. By the Bonferroni inequality, for any fixed integer $$m$$ with $$1{\leqslant} m{\leqslant} p/2$$,   $\sum_{d=1}^{2m}(-1)^{d-1}\!\!\!\sum_{1{\leqslant} j_1<\cdots<j_d{\leqslant} p}{\rm pr}\!\left(\bigcap_{k=1}^d E_{j_k}\right){\leqslant}{\rm pr}(M_n^*{\geqslant} t_p){\leqslant} \sum_{d=1}^{2m-1}(-1)^{d-1}\!\!\!\sum_{1{\leqslant} j_1<\cdots<j_d{\leqslant} p}{\rm pr}\!\left(\bigcap_{k=1}^d E_{j_k}\right)\!,$ where   $E_j=\left\{n\frac{(\bar{Y}_j^{(1)}-\bar{Y}_j^{(2)})^2}{\gamma_{jj}}{\geqslant} t_p\right\}\!\text{.}$ Under $$H_0$$, we write   $n^{1/2}\frac{\bar{Y}_j^{(1)}-\bar{Y}_j^{(2)}}{\gamma_{jj}^{1/2}}=\frac{n^{1/2}}{n_1}\sum_{i=1}^{n_1}\frac{Y_{ij}^{(1)}-\nu_{1j}}{\gamma_{jj}^{1/2}} -\frac{n^{1/2}}{n_2}\sum_{i=1}^{n_2}\frac{Y_{ij}^{(2)}-\nu_{2j}}{\gamma_{jj}^{1/2}}\equiv\sum_{i=1}^{n_1+n_2}\xi_{ij}\text{.}$ By Proposition 2, it suffices to consider the event $$\{\max_{i,j}|\xi_{ij}|{\leqslant} C_1(\log p)^{-1}\}$$ for some constant $$C_1>0$$, which occurs with probability tending to 1. Let $$N=(N_1,\dots,N_p)^{{\mathrm{\scriptscriptstyle T}}}$$ be multivariate normal with mean zero and covariance matrix $$nR/n_1+nR/n_2=R$$. Applying Theorem 1.1 of Zaĭtsev (1987), for any sequence $${\varepsilon}_n=o(1)$$ we have   \begin{align*} {\rm pr}\!\left(\bigcap_{k=1}^d E_{j_k}\right)&={\rm pr}\!\left(\min_{1{\leqslant} k{\leqslant} d}\left|\sum_{i=1}^{n_1+n_2}\xi_{ij_k}\right|{\geqslant} t_p^{1/2}\right)\\ &{\leqslant}{\rm pr}\!\left(\min_{1{\leqslant} k{\leqslant} d}|N_{j_k}|{\geqslant} t_p^{1/2}-{\varepsilon}_n\right)+O\!\left\{d^{5/2}\exp\left(-C_2\frac{\log p}{d^3}\right)\right\}\\ &{\leqslant}{\rm pr}\!\left(\min_{1{\leqslant} k{\leqslant} d}|N_{j_k}|{\geqslant} t_p^{1/2}-{\varepsilon}_n\right)+O(p^{-C_3}) \end{align*} and, similarly,   ${\rm pr}\!\left(\bigcap_{k=1}^d E_{j_k}\right){\geqslant}{\rm pr}\!\left(\min_{1{\leqslant} k{\leqslant} d}|N_{j_k}|{\geqslant} t_p^{1/2}+{\varepsilon}_n\right)+O(p^{-C_3})\text{.}$ Hence   \begin{align*} \sum_{d=1}^{2m}(-1)^{d-1}\!\!\!\sum_{1{\leqslant} j_1<\cdots<j_d{\leqslant} p}&{\rm pr}\!\left\{\min_{1{\leqslant} k{\leqslant} d}|N_{j_k}|{\geqslant} t_p^{1/2}+(-1)^{d-1}{\varepsilon}_n\right\}+o(1)\\ &{\leqslant}{\rm pr}(M_n^*{\geqslant} t_p)\\ &{\leqslant}\sum_{d=1}^{2m-1}(-1)^{d-1}\!\!\!\sum_{1{\leqslant} j_1<\cdots<j_d{\leqslant} p}{\rm pr}\!\left\{\min_{1{\leqslant} k{\leqslant} d}|N_{j_k}|{\geqslant} t_p^{1/2}+(-1)^d{\varepsilon}_n\right\}+o(1)\text{.} \end{align*} Then (A9) is proved by applying the following lemma, which follows from the same arguments as those for Lemma 6 of Cai et al. (2014), and letting $$m\to\infty$$. Lemma A2. Under Conditions 2 and 3, we have  $\sum_{1{\leqslant} j_1<\cdots<j_d{\leqslant} p}{\rm pr}\!\left(\min_{1{\leqslant} k{\leqslant} d}|N_{j_k}|{\geqslant} t_p^{1/2}\pm{\varepsilon}_n\right) =\frac{1}{d!}\,\pi^{-d/2}\exp\!\left(-\frac{dt}{2}\right)\{1+o(1)\}\text{.}$ Finally, consider the event $$\{\max_j|\hat\gamma_{jj}-\gamma_{jj}|/\gamma_{jj}{\leqslant} C_4(\log p/n)^{1/2}\}$$ for some constant $$C_4>0$$, which occurs with probability tending to 1 by Proposition 2. Then   $$\label{eq:M_nineq} |M_n-M_n^*|{\leqslant} n\max_{1{\leqslant} j{\leqslant} p}\frac{(\bar{Y}_j^{(1)}-\bar{Y}_j^{(2)})^2}{\hat\gamma_{jj}}\max_{1{\leqslant} j{\leqslant} p}\frac{|\hat\gamma_{jj}-\gamma_{jj}|}{\gamma_{jj}}{\leqslant} C_4M_n\left(\frac{\log p}{n}\right)^{1/2}=M_n o\!\left(\frac{1}{\log p}\right)$$ (A10) by Condition 5. This, together with (A9), completes the proof. Proof of Theorem 2 In view of (A10) with $$M_n^*$$ defined in (A8), it suffices to prove that under $$H_1$$ in (13),   $$\label{eq:M_nconv} {\rm pr}(M_n^*{\geqslant} q_\alpha+2\log p-\log\log p)\to1\text{.}$$ (A11) By assumption, there exists some $$j_0\in S$$ such that $$|\delta_{j_0}|{\geqslant}\surd{2}+{\varepsilon}$$. We write   $n^{1/2}\frac{\bar{Y}_{j_0}^{(1)}-\bar{Y}_{j_0}^{(2)}}{\gamma_{j_0j_0}^{1/2}}=n^{1/2}\frac{\bar{Y}_{j_0}^{(1)}-\nu_{1j_0}}{\gamma_{j_0j_0}^{1/2}} -n^{1/2}\frac{\bar{Y}_{j_0}^{(2)}-\nu_{2j_0}}{\gamma_{j_0j_0}^{1/2}}+n^{1/2}\frac{\nu_{1j_0}-\nu_{2j_0}}{\gamma_{j_0j_0}^{1/2}}\equiv T_1+T_2+T_3\text{.}$ Note that $$T_1=O_{\rm p}(1)$$ and $$T_2=O_{\rm p}(1)$$ by the central limit theorem. Define   $T_4=n^{1/2}\frac{\nu_{1j_0}-\nu_{2j_0}}{\omega_{j_0j_0}^{1/2}}\text{.}$ It follows from (A4) and Condition 1 that   $|T_3-T_4|=n^{1/2}\frac{|\nu_{1j_0}-\nu_{2j_0}|}{\gamma_{j_0j_0}^{1/2}}\frac{|\gamma_{j_0j_0}^{1/2}-\omega_{j_0j_0}^{1/2}|}{\omega_{j_0j_0}^{1/2}} =|T_3|O(p^{-1/2})\text{.}$ Then, using (15) and the assumption $$\|\delta\|_1=o(p)$$, we have   \begin{align*} T_3&=T_4\{1+O(p^{-1/2})\}=n^{1/2}\frac{\delta_{j_0}\omega_{j_0j_0}^{1/2}+o(1)}{\omega_{j_0j_0}^{1/2}}\left(\frac{\log p}{n}\right)^{1/2} \{1+O(p^{-1/2})\}\\ &=\{\delta_{j_0}+o(1)\}(\log p)^{1/2}{\geqslant}\left(\surd{2}+\frac{{\varepsilon}}{2}\right)(\log p)^{1/2} \end{align*} for sufficiently large $$p$$. Combining these bounds, we conclude that, with probability tending to 1,   $n^{1/2}\frac{|\bar{Y}_{j_0}^{(1)}-\bar{Y}_{j_0}^{(2)}|}{\gamma_{j_0j_0}^{1/2}}{\geqslant}(q_\alpha+2\log p-\log\log p)^{1/2}\text{.}$ This implies (A11) and completes the proof. References Aitchison J. ( 1982). The statistical analysis of compositional data (with Discussion). J. R. Statist. Soc. B  44, 139– 77. Aitchison J. ( 2003). The Statistical Analysis of Compositional Data . Caldwell: Blackburn Press. Bai Z. & Saranadasa H. ( 1996). Effect of high dimension: By an example of a two sample problem. Statist. Sinica  6, 311– 29. Bickel P. J. & Levina E. ( 2008). Covariance regularization by thresholding. Ann. Statist.  36, 2577– 604. Google Scholar CrossRef Search ADS   Cai T. T. , Liu W. & Xia Y. ( 2014). Two-sample test of high dimensional means under dependence. J. R. Statist. Soc. B  76, 349– 72. Google Scholar CrossRef Search ADS   Cao Y. , Lin W. & Li H. ( 2016). Large covariance estimation for compositional data via composition-adjusted thresholding. arXiv: 1601.04397. Chen S. X. & Qin Y.-L. ( 2010). A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Statist.  38, 808– 35. Google Scholar CrossRef Search ADS   Lewis J. D. , Chen E. Z., Baldassano R. N., Otley A. R., Griffiths A. M., Lee D., Bittinger K., Bailey A., Friedman E. S., Hoffmann C. et al.   ( 2015). Inflammation, antibiotics, and diet as environmental stressors of the gut microbiome in pediatric Crohn’s disease. Cell Host Microbe  18, 489– 500. Google Scholar CrossRef Search ADS PubMed  Li H. ( 2015). Microbiome, metagenomics, and high-dimensional compositional data analysis. Ann. Rev. Statist. Appl.  2, 73– 94. Google Scholar CrossRef Search ADS   Lin W. , Shi P., Feng R. & Li H. ( 2014). Variable selection in regression with compositional covariates. Biometrika  101, 785– 97. Google Scholar CrossRef Search ADS   Segata N. , Waldron L., Ballarini A., Narasimhan V., Jousson O. & Huttenhower C. ( 2012). Metagenomic microbial community profiling using unique clade-specific marker genes. Nature Meth.  9, 811– 4. Google Scholar CrossRef Search ADS   Srivastava M. S. ( 2009). A test for the mean vector with fewer observations than the dimension under non-normality. J. Mult. Anal.  100, 518– 32. Google Scholar CrossRef Search ADS   Srivastava M. S. & Du M. ( 2008). A test for the mean vector with fewer observations than the dimension. J. Mult. Anal.  99, 386– 402. Google Scholar CrossRef Search ADS   The Human Microbiome Project Consortium ( 2012). Structure, function and diversity of the healthy human microbiome. Nature  486, 207– 14. Turnbaugh P. J. , Hamady M., Yatsunenko T., Cantarel B. L., Duncan A., Ley R. E., Sogin M. L., Jones W. J., Roe B. A., Affourtit J. P. et al.   ( 2009). A core gut microbiome in obese and lean twins. Nature  457, 480– 4. Google Scholar CrossRef Search ADS PubMed  Wu G. D. , Chen J., Hoffmann C., Bittinger K., Chen Y.-Y., Keilbaugh S. A., Bewtra M., Knights D., Walters W. A., Knight R. et al.   ( 2011). Linking long-term dietary patterns with gut microbial enterotypes. Science  334, 105– 8. Google Scholar CrossRef Search ADS PubMed  Zaĭtsev A. Yu. ( 1987). On the Gaussian approximation of convolutions under multidimensional analogues of S. N. Bernstein’s inequality conditions. Prob. Theory Rel. Fields  74, 535– 66. Google Scholar CrossRef Search ADS   © 2017 Biometrika Trust

### Journal

BiometrikaOxford University Press

Published: Mar 1, 2018

## You’re reading a free preview. Subscribe to read the entire article.

### DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month ### Explore the DeepDyve Library ### Search Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly ### Organize Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place. ### Access Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals. ### Your journals are on DeepDyve Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more. All the latest content is available, no embargo periods. DeepDyve ### Freelancer DeepDyve ### Pro Price FREE$49/month
\$360/year

Save searches from
PubMed

Create lists to

Export lists, citations