# A randomization-based perspective on analysis of variance: a test statistic robust to treatment effect heterogeneity

A randomization-based perspective on analysis of variance: a test statistic robust to treatment... Summary Fisher randomization tests for Neyman’s null hypothesis of no average treatment effect are considered in a finite-population setting associated with completely randomized experiments involving more than two treatments. The consequences of using the $$F$$ statistic to conduct such a test are examined, and we argue that under treatment effect heterogeneity, use of the $$F$$ statistic in the Fisher randomization test can severely inflate the Type I error under Neyman’s null hypothesis. We propose to use an alternative test statistic, derive its asymptotic distributions under Fisher’s and Neyman’s null hypotheses, and demonstrate its advantages through simulations. 1. Introduction One-way analysis of variance (Fisher, 1925; Scheffe, 1959) is perhaps the most commonly used tool to analyse completely randomized experiments with more than two treatments. The standard $$F$$ test for testing equality of mean treatment effects can be justified either by assuming a linear additive superpopulation model with identically and independently distributed normal error terms, or by using the asymptotic randomization distribution of the $$F$$ statistic. Units in real-life experiments are rarely random samples from a superpopulation, making a finite-population randomization-based perspective on inference important (e.g., Rosenbaum, 2010; Dasgupta et al., 2015; Imbens & Rubin, 2015). Fisher randomization tests are useful tools for such inference, because they pertain to a finite population of units and assess the statistical significance of treatment effects without any assumptions about the underlying outcome distribution. In causal inference from a finite population, two hypotheses are of interest: Fisher’s sharp null hypothesis of no treatment effect on any experimental unit (Fisher, 1935; Rubin, 1980), and Neyman’s null hypothesis of no average treatment effect (Neyman, 1923, 1935). These hypotheses are equivalent when there is no treatment effect heterogeneity (Ding et al., 2016) or, equivalently, under the assumption of strict additivity of treatment effects, i.e., the same treatment effect for each unit (Kempthorne, 1952). In the context of a multi-treatment completely randomized experiment, Neyman’s null hypothesis allows for treatment effect heterogeneity, which is weaker than Fisher’s null hypothesis and is sometimes of greater interest. We find that the Fisher randomization test using the $$F$$ statistic can inflate the Type I error under Neyman’s null hypothesis, when the sample sizes and variances of the outcomes under different treatment levels are negatively associated. We propose to use the $$X^2$$ statistic defined in § 5, a statistic that is robust with respect to treatment effect heterogeneity, because the resulting Fisher randomization test is exact under Fisher’s null hypothesis and controls asymptotic Type I error under Neyman’s null hypothesis. 2. Completely randomized experiment with $$J$$ treatments Consider a finite population of $$N$$ experimental units, each of which can be exposed to any one of $$J$$ treatments. Let $$Y_i(j)$$ denote the potential outcome (Neyman, 1923; Rubin, 1974) of unit $$i$$ when assigned to treatment level $$j$$ ($$i=1,\ldots, N;\,j=1,\ldots, J)$$. For two different treatment levels $$j$$ and $$j'$$, we define the unit-level treatment effect as $$\tau_i(j,j') = Y_i(j) - Y_i(j')$$ and the population-level treatment effect as   \begin{equation*} \tau(j,j') = N^{-1}\sum_{i=1}^N \tau_i(j,j') = N^{-1} \sum_{i=1}^N \{ Y_i(j) - Y_i(j')\} \equiv \bar{Y}_{\cdot}(j) - \bar{Y}_{\cdot}(j'), \end{equation*} where $$\bar{Y}_{\cdot}(j) = N^{-1} \sum_{i=1}^N Y_i(j)$$ is the average of the $$N$$ potential outcomes for treatment $$j$$. For treatment level $$j = 1, \ldots, J$$, define $$p_j = N_j/N$$ as the proportion of the units and $$S_{\cdot}^2(j) = (N -1)^{-1} \sum_{i=1}^N \{ Y_i(j) - \bar{Y}_{\cdot}(j) \}^2$$ as the finite-population variance of the potential outcomes. The treatment assignment mechanism can be represented by the binary random variable $$W_i(j),$$ which equals $$1$$ if the $$i$$th unit is assigned to treatment $$j$$ and $$0$$ otherwise. Equivalently, it can be represented by the discrete random variable $$W_i = \sum_{j=1}^J j W_i(j)$$, the treatment received by unit $$i$$. Let $$(W_1, \ldots, W_N)$$ be the treatment assignment vector, and let $$(w_1,\ldots, w_N)$$ denote its realization. For the $$N=\sum_{j=1}^J N_j$$ units, $$(N_1,\ldots, N_J)$$ are assigned at random to treatments $$(1,\ldots, J)$$, respectively, and the treatment assignment mechanism satisfies $${\rm{pr}}\{ (W_1, \ldots, W_N) = (w_1,\ldots, w_N) \} = {\rm{pr}}od_{j=1}^J N_j!/N!$$ if $$\sum_{i=1}^N W_i(j) = N_j$$ and $$0$$ otherwise. The observed outcome of unit $$i$$ is a deterministic function of the treatment it has received and the potential outcomes, given by $$Y_i^{\rm{obs}} = \sum_{j=1}^J W_i(j) Y_i(j)$$. 3. The Fisher randomization test under the sharp null hypothesis Fisher (1935) was interested in testing the following sharp null hypothesis of zero individual treatment effects:   \begin{equation*} H_{0{\rm{F}}}: Y_i(1) = \cdots = Y_i(J) \quad (i=1,\ldots, N)\text{.} \end{equation*} Under $$H_{0{\rm{F}}}$$, all $$J$$ potential outcomes $$Y_i(1), \ldots, Y_i(J)$$ equal the observed outcome $$Y_i^{\rm{obs}}$$, for all units $$i = 1, \ldots, N$$. Thus any possible realization of the treatment assignment vector would generate the same vector of observed outcomes. This means that under $$H_{0{\rm{F}}}$$ and given any realization $$(W_1,\ldots, W_N) = (w_1, \ldots, w_N)$$, the observed outcomes are fixed. Consequently, the randomization distribution or null distribution of any test statistic, which is a function of the observed outcomes and the treatment assignment vector, is its distribution over all possible realizations of the treatment assignment. The $$p$$-value is the tail probability measuring the extremeness of the test statistic with respect to its randomization distribution. Computationally, we can enumerate or simulate a subset of all possible randomizations to obtain the randomization distribution of any test statistic and thus perform the Fisher randomization test (Fisher, 1935; Imbens & Rubin, 2015). Fisher (1925) suggested using the $$F$$ statistic to test the departure from $$H_{0{\rm{F}}}$$. Define $$\bar{Y}_{\cdot}^{\rm{obs}}(j) = N_j^{-1} \sum_{i=1}^N W_i(j)Y_i^{\rm{obs}}$$ as the sample average of the observed outcomes within treatment level $$j$$, and define $$\bar{Y}_{\cdot}^{\rm{obs}} = N^{-1} \sum_{i=1}^N Y_i^{\rm{obs}}$$ as the sample average of all the observed outcomes. Let $$s^2_{{\rm{obs}}}(j) = ( N_j - 1 )^{-1} \sum_{i=1}^N W_i(j) \{ Y_i^{\rm{obs}} - \bar{Y}_{\cdot}^{\rm{obs}} (j) \}^2$$ and $$s^2_{\rm{obs}} = (N-1)^{-1} \sum_{i=1}^N (Y_i^{\rm{obs}} - \bar{Y}_{\cdot}^{\rm{obs}})^2$$ be the corresponding sample variances with divisors $$N_j-1$$ and $$N-1$$, respectively. Let   \begin{equation*} {\small{\rm{SS}}}_{\rm{T}} = \sum_{j=1}^J N_j \{ \bar{Y}_{\cdot}^{\rm{obs}}(j) - \bar{Y}_{\cdot}^{\rm{obs}} \}^2 \end{equation*} be the treatment sum of squares, and let   \begin{equation*} {\small{\rm{SS}}}_{\rm{R}} = \sum_{j=1}^J \:\sum_{i:\,W_i(j) = 1} \{ Y_i^{\rm{obs}} - \bar{Y}_{\cdot}^{\rm{obs}} (j) \}^2 = \sum_{j=1}^J (N_j - 1) s^2_{{\rm{obs}}}(j) \end{equation*} be the residual sum of squares. The treatment and residual sums of squares add up to the total sum of squares $$\sum_{i=1}^N (Y_i^{\rm{obs}} - \bar{Y}_{\cdot}^{\rm{obs}})^2 = (N-1)s^2_{\rm{obs}}$$. The $$F$$ statistic   $$F = \frac{ {\small{\rm{SS}}}_{\rm{T}} / (J-1) }{ {\small{\rm{SS}}}_{\rm{R}} / (N-J) } \equiv \frac{ {\small{\rm{MS}}}_{\rm{T}} }{ {\small{\rm {MS}}}_{\rm{R}}} \label{eq:F}$$ (1) is defined as the ratio of the treatment mean square $${\small{\rm{MS}}}_{\rm{T}} = {\small{\rm{SS}}}_{\rm{T}} / (J-1)$$ to the residual mean square $${\small{\rm {MS}}}_{\rm{R}}= {\small{\rm{SS}}}_{\rm{R}} / (N-J)$$. The distribution of (1) under $$H_{0{\rm{F}}}$$ can be well approximated by an $$F_{J-1, N-J}$$ distribution with degrees of freedom $$J-1$$ and $$N-J$$, as is often used in the analysis of variance table obtained from fitting a normal linear model. Although it is relatively easy to show that (1) follows $$F_{J-1, N-J}$$ if the observed outcomes follow a normal linear model drawn from a superpopulation, arriving at such a result via a purely randomization-based argument is nontrivial. Below, we state a known result on the approximate randomization distribution of (1), and throughout our discussion we assume the following regularity conditions required by the finite-population central limit theorem for causal inference Li & Ding, 2017. Condition 1. As $$N\rightarrow \infty$$, for all $$j$$, $$\:N_j/N$$ has a positive limit, $$\bar{Y}_\cdot (j)$$ has a finite limit, $$S_\cdot^2(j)$$ has a finite and positive limit, and $$N^{-1}\max_{1\leq i\leq N} | Y_i(j) - \bar{Y}_{\cdot}(j) |^2 \rightarrow 0$$. Theorem 1. Assume $$H_{0{\rm{F}}}$$. Over repeated sampling of $$(W_1,\ldots, W_N)$$, the expectations of the residual and treatment sums of squares are $$E({\small{\rm{SS}}}_{\rm{T}}) = (J-1) s^2_{\rm{obs}}$$ and $$E({\small{\rm{SS}}}_{\rm{R}}) = (N-J) s^2_{{\rm{obs}}}$$, and as $$N\rightarrow \infty$$, the asymptotic distribution of (1) is  $$F\overset{.}{\sim} \frac{ \chi^2_{J-1}/ (J-1) }{ \{ (N-1) - \chi^2_{J-1} \}/(N-J) } \overset{.}{\sim} \chi^2_{J-1}/(J-1) \overset{.}{\sim} F_{J-1, N-J}\text{.}$$ Remark 1. In Theorem 1 and the following discussion, we use the notation $$A_N \overset{.}{\sim} B_N$$ to represent two sequences of random variables $$\{A_N\}_{N=1}^\infty$$ and $$\{B_N\}_{N=1}^\infty$$ that have the same asymptotic distribution as $$N \rightarrow \infty$$. The original $$F$$ approximation for randomization inference for a finite population was derived by cumbersome moment matching between the statistic (1) and the corresponding $$F_{J-1, N-J}$$ distribution (Welch, 1937; Pitman, 1938; Kempthorne, 1952). In the Supplementary Material, we give a simpler proof based on the finite-population central limit theorem, similar to Silvey (1954). Remark 2. Under $$H_{0{\rm{F}}}$$, the total sum of squares is fixed, but its components $${\small{\rm{SS}}}_{\rm{T}}$$ and $${\small{\rm{SS}}}_{\rm{R}}$$ are random through the treatment assignment $$(W_1,\ldots,W_N)$$, and their expectations are calculated with respect to the distribution of the treatment assignment. Also, the ratio of the expectations of the numerator $${\small{\rm{MS}}}_{\rm{T}}$$ and the denominator $${\small{\rm {MS}}}_{\rm{R}}$$ of (1) is $$1$$ under $$H_{0{\rm{F}}}$$. 4. Sampling properties of the $$F$$ statistic under Neyman’s null hypothesis In § 3 we discussed the randomization distribution, i.e., the sampling distribution under $$H_{0{\rm{F}}}$$, of the $$F$$ statistic in (1). However, the sampling distribution of the $$F$$ statistic under Neyman’s null hypothesis of no average treatment effect (Neyman, 1923, 1935),   \begin{equation*} H_{0{\rm{N}}}: \bar{Y}_{\cdot}(1) = \cdots = \bar{Y}_{\cdot}(J), \end{equation*} is often of interest but has received limited attention (Imbens & Rubin, 2015). This hypothesis imposes weaker restrictions on the potential outcomes than $$H_{0{\rm{F}}}$$, making it impossible to compute the corresponding exact, or even approximate, distribution of $$F$$. However, analytical expressions for $$E({\small{\rm{SS}}}_{\rm{T}})$$ and $$E({\small{\rm{SS}}}_{\rm{R}})$$ can be derived under $$H_{0{\rm{N}}}$$ along the lines of Theorem 1, and can be used to gain insights into the consequences of testing $$H_{0{\rm{N}}}$$ using the Fisher randomization test with $$F$$. Let $$\bar{Y}_{\cdot}(\cdot) = \sum_{j=1}^J p_j \bar{Y}_{\cdot}(j)$$ and $$S^2 = \sum_{j=1}^J p_j S_{\cdot}^2(j)$$ be the weighted averages of the finite-population means and variances. The sampling distribution of $$F$$ depends crucially on the finite-population variance of the unit-level treatment effects,   $$S_{\tau}^2(j, j') = (N-1)^{-1} \sum_{i=1}^N \{ \tau_i(j, j') - \tau(j, j') \}^2 \text{.}$$ Definition 1. The potential outcomes $$\{ Y_i(j): i=1, \ldots, N; \: j = 1, \ldots, J \}$$ have strictly additive treatment effects if for all $$j \ne j'$$ the unit-level treatment effects $$\tau_i(j, j')$$ are the same for $$i=1, \ldots, N$$ or, equivalently, if $$S_{\tau}^2(j, j') = 0$$ for all $$j \ne j'$$.  Kempthorne (1955) obtained the following result for balanced designs with $$p_j=1/J$$ under the assumption of strict additivity:   $$E({\small{\rm{SS}}}_{\rm{R}}) = (N-J )S^2,\quad E({\small{\rm{SS}}}_{\rm{T}}) = \frac{N}{J } \sum_{j=1}^J \{ \bar{Y}_{\cdot}(j) - \bar{Y}_{\cdot}(\cdot) \}^2 + (J-1) S^2\text{.} \label{eq::kempthorne}$$ (2) This result implies that with balanced treatment assignments and strict additivity, $$E({\small{\rm{MS}}}_{\rm{R}}{-}{\small{\rm{MS}}}_{\rm{T}}){=}0$$ under $$H_{0{\rm{N}}}$$, and it provides a heuristic justification for testing $$H_{0{\rm{N}}}$$ using the Fisher randomization test with the $$F$$ statistic. However, strict additivity combined with $$H_{0{\rm{N}}}$$ implies $$H_{0{\rm{F}}}$$, for which this result is already known by Theorem 1. We will now derive results that do not require strict additivity, and thus are more general than those in Kempthorne (1955). For this purpose, we introduce a measure of deviation from additivity. Let   \begin{equation*} \Delta = \mathop{\sum\sum}_{j < j'} p_j p_{j'} S_{\tau}^2(j, j') \end{equation*} be a weighted average of the variances of unit-level treatment effects. By Definition 1, $$\Delta = 0$$ under strict additivity. If strict additivity does not hold, i.e., if there is treatment effect heterogeneity, then $$\Delta \neq 0$$. Thus $$\Delta$$ is a measure of the deviation from additivity and plays a crucial role in the following results on the sampling distribution of the $$F$$ statistic. Theorem 2. Over repeated sampling of $$(W_1, \ldots, W_N)$$, the expectation of the residual sum of squares is $$E({\small{\rm{SS}}}_{\rm{R}}) = \sum_{j=1}^J (N_j - 1) S_{\cdot}^2(j) ,$$ and the expectation of the treatment sum of squares is  $$E({\small{\rm{SS}}}_{\rm{T}}) = \sum_{j=1}^J N_j \bigl\{ \bar{Y}_{\cdot}(j) - \bar{Y}_{\cdot}(\cdot) \bigr\}^2+ \sum_{j=1}^J (1-p_j) S_{\cdot}^2(j) - \Delta,$$ which reduces to $$E({\small{\rm{SS}}}_{\rm{T}}) = \sum_{j=1}^J (1-p_j) S_{\cdot}^2(j) - \Delta$$ under $$H_{0{\rm{N}}}$$.  Corollary 1. Under $$H_{0{\rm{N}}}$$ with strict additivity in Definition 1 or, equivalently, under $$H_{0{\rm{F}}}$$, the results in Theorem 2 reduce to $$E({\small{\rm{SS}}}_{\rm{R}}) = (N-J) S^2$$ and $$E({\small{\rm{SS}}}_{\rm{T}}) = (J-1) S^2,$$ which coincide with the statements in Theorem 1.  Corollary 2. For a balanced design with $$p_j = 1/J$$,  $$E({\small{\rm{SS}}}_{\rm{R}}) = (N-J )S^2,\quad E({\small{\rm{SS}}}_{\rm{T}}) = \frac{N}{J } \sum_{j=1}^J \{ \bar{Y}_{\cdot}(j) - \bar{Y}_{\cdot}(\cdot) \}^2 + (J-1) S^2 - \Delta\text{.}$$ Furthermore, under $$H_{0{\rm{N}}}$$, $$E({\small{\rm{SS}}}_{\rm{R}}) = (N-J )S^2$$ and $$E({\small{\rm{SS}}}_{\rm{T}}) = (J-1) S^2 - \Delta,$$ implying that the difference between the residual mean square and treatment mean square is $$E ( {\small{\rm{MS}}}_{\rm{R}} - {\small{\rm{MS}}}_{\rm{T}} ) =\Delta / (J - 1) \geq 0$$.  The result in (2) is a special case of Corollary 2 for $$\Delta=0$$. Corollary 2 implies that, for balanced designs, if the assumption of strict additivity does not hold, then testing $$H_{0{\rm{N}}}$$ using the Fisher randomization test with the $$F$$ statistic may be conservative, in the sense that it may reject a null hypothesis less often than the nominal level. However, for unbalanced designs, the conclusion is not definite, as can be seen from the following corollary. Corollary 3. Under $$H_{0{\rm{N}}}$$, the difference between the residual and treatment mean square is  $$E( {\small{\rm{MS}}}_{\rm{R}} - {\small{\rm{MS}}}_{\rm{T}}) = \frac{ (N-1)J }{ (J-1)(N-J) } \sum_{j=1}^J (p_j - J^{-1})S_{\cdot}^2(j) + \frac{ \Delta} { J-1}\text{.}$$ Corollary 3 shows that the residual mean square may be larger or smaller than that of the treatment, depending on the balance or lack thereof in the experiment and the variances of the potential outcomes. Under $$H_{0{\rm{N}}}$$, when the $$p_j$$ and $$S_{\cdot}^2(j)$$ are positively associated, the Fisher randomization test using $$F$$ tends to be conservative; when the $$p_j$$ and $$S_{\cdot}^2(j)$$ are negatively associated, the Fisher randomization test using $$F$$ may not control correct Type I error. 5. A test statistic that controls type I error more precisely than $$F$$ To address the failure of the $$F$$ statistic to control Type I error of the Fisher randomization test under $$H_{0{\rm{N}}}$$ in unbalanced experiments, we propose to use (3) for the Fisher randomization test. Let $$\hat{Q}_j = N_j / s^2_{{\rm{obs}}}(j)$$, and define the weighted average of the sample means as $$\bar{Y}^{\rm{obs}}_w = \sum_{j=1}^J \hat{Q}_j \bar{Y}_{\cdot}^{\rm{obs}}(j) / \sum_{j=1}^J \hat{Q}_j$$. Define   $$X^2 = \sum_{j=1}^J \hat{Q}_j \bigl\{ \bar{Y}_{\cdot}^{\rm{obs}}(j) - \bar{Y}^{\rm{obs}}_w \bigr\}^2\text{.} \label{eq:X^2}$$ (3) This test statistic has been exploited in classical analysis of variance (e.g., James, 1951; Welch, 1951; Johansen, 1980; Rice & Gaines, 1989; Weerahandi, 1995; Krishnamoorthy et al., 2007) based on the normal linear model with heteroskedasticity, and a similar idea called studentization has been adopted in permutation tests has been adopted in permutation tests (e.g., Neuhaus, 1993; Janssen, 1997, 1999; Janssen & Pauls, 2003; Chung & Romano, 2013; Pauly et al., 2015). Replacing $$F$$ with (3) does not affect the validity of the Fisher randomization test for testing $$H_{0{\rm{F}}}$$, because we always have an exact test for $$H_{0{\rm{F}}}$$ no matter which test statistic we use. We show below that the Fisher randomization test using $$X^2$$ can also control the asymptotic Type I error for testing $$H_{0{\rm{N}}}$$, so the Fisher randomization test using $$X^2$$ can control the Type I error under both $$H_{0{\rm{F}}}$$ and $$H_{0{\rm{N}}}$$ asymptotically, making $$X^2$$ a more attractive choice than the classical $$F$$ statistic for the Fisher randomization test. Theorem 3. Under $$H_{0{\rm{F}}}$$, the asymptotic distribution of $$X^2$$ is $$\chi^2_{J-1}$$ as $$N\rightarrow \infty$$. Under $$H_{0{\rm{N}}}$$, the asymptotic distribution of $$X^2$$ is stochastically dominated by $$\chi^2_{J-1}$$, i.e., for any constant $$a>0$$, $$\:\lim_{N\rightarrow \infty} {\rm{pr}}(X^2 \ge a) \le {\rm{pr}}(\chi^2_{J-1} \ge a)$$.  Remark 3. Under $$H_{0{\rm{F}}}$$, the randomization distribution of $${\small{\rm{SS}}}_{\rm{T}}/s_{\rm{obs}}^2$$ follows $$\chi^2_{J-1}$$ asymptotically, as shown in the Supplementary Material. Under $$H_{0{\rm{N}}}$$, however, the asymptotic distribution of $${\small{\rm{SS}}}_{\rm{T}}/s_{\rm{obs}}^2$$ is not $$\chi^2_{J-1}$$, and the asymptotic distribution of $$F$$ is not $$F_{N-J,J-1}$$ as suggested by Corollary 3. Fortunately, if we weight each treatment square by the inverse of the sample variance of the outcomes, the resulting $$X^2$$ statistic preserves the asymptotic $$\chi^2_{J-1}$$ randomization distribution under $$H_{0{\rm{F}}}$$ and has an asymptotic distribution that is stochastically dominated by $$\chi^2_{J-1}$$ under $$H_{0{\rm{N}}}$$. Therefore, under $$H_{0{\rm{N}}}$$, the Type I error of the Fisher randomization test using $$X^2$$ does not exceed the nominal level. Although we can perform the Fisher randomization test by enumerating or simulating from all possible realizations of the treatment assignment, Theorem 3 suggests that an asymptotic rejection rule against $$H_{0{\rm{F}}}$$ or $$H_{0{\rm{N}}}$$ is $$X^2 > x_{1-\alpha}$$, the $$1-\alpha$$ quantile of the $$\chi^2_{J-1}$$ distribution. Because the asymptotic distribution of $$X^2$$ under $$H_{0{\rm{N}}}$$ is stochastically dominated by $$\chi^2_{J-1}$$, its true $$1-\alpha$$ quantile is asymptotically smaller than $$x_{1-\alpha}$$, and the corresponding Fisher randomization test is conservative in the sense of having smaller Type I error than the nominal level asymptotically. Remark 4. The asymptotic conservativeness described above is not particular to our test statistic, but rather a feature of the finite-population inference (Neyman, 1923; Aronow et al., 2014; Imbens & Rubin, 2015). It distinguishes Theorem 3 from previous results on permutation tests (e.g., Chung & Romano, 2013; Pauly et al., 2015), where the conservativeness did not appear and the correlation between the potential outcomes played no role in the theory. The form of (3) suggests its difference from $$F$$ when the potential outcomes have different variances under different treatment levels. Otherwise we show that they are asymptotically equivalent in the following sense. Corollary 4. If $$S_{\cdot}^2(1) = \cdots = S_{\cdot}^2(J)$$, then $$(J-1) F \overset{.}{\sim} X^2$$.  Under strict additivity in Definition 1, the condition $$S_{\cdot}^2(1) = \cdots = S_{\cdot}^2(J)$$ holds, and the equivalence between $$(J-1) F$$ and $$X^2$$ guarantees that the Fisher randomization tests using $$F$$ and $$X^2$$ have the same asymptotic Type I error and power. However, Corollary 4 is a large-sample result; we evaluate it in finite samples in the Supplementary Material. 6. Simulation 6.1 Type I error of the Fisher randomization test using $$F$$ In this subsection, we use simulation to evaluate the finite-sample performance of the Fisher randomization test using $$F$$ under $$H_{0{\rm{N}}}$$. We consider the following three cases, where $$\mathcal{N}(\mu, \sigma^2)$$ denotes a normal distribution with mean $$\mu$$ and variance $$\sigma^2$$. We choose significance level $$0{\cdot}05$$ for all tests. Case 1. For balanced experiments with sample sizes $$N=45$$ and $$N=120$$, we generate potential outcomes under two settings: (1A) $$Y_i(1)\sim \mathcal{N}(0,1)$$, $$Y_i(2)\sim \mathcal{N}(0, 1{\cdot}2^2)$$ and $$Y_i(3)\sim \mathcal{N}(0, 1{\cdot}5^2)$$; (1B) $$Y_i(1)\sim \mathcal{N}(0,1)$$, $$Y_i(2)\sim \mathcal{N}(0, 2^2)$$ and $$Y_i(3)\sim \mathcal{N}(0, 3^2)$$. These potential outcomes are independently generated and standardized to have zero mean. Case 2. For unbalanced experiments with sample sizes $$(N_1, N_2, N_3) = (10,20,30)$$ and $$(N_1, N_2, N_3) = (20,30,50)$$, we generate potential outcomes under two settings: (2A) $$Y_i(1)\sim \mathcal{N}(0,1)$$, $$Y_i(2) = 2Y_i(1)$$ and $$Y_i(3) = 3Y_i(1)$$; (2B) $$Y_i(1)\sim \mathcal{N}(0,1)$$, $$Y_i(2) = 3Y_i(1)$$ and $$Y_i(3) = 5Y_i(1)$$. These potential outcomes are standardized to have zero mean. In this case, $$p_1<p_2<p_3$$ and $$S^2_{\cdot}(1) < S^2_{\cdot}(2) < S^2_{\cdot}(3)$$. Case 3. For unbalanced experiments with sample sizes $$(N_1, N_2, N_3) = (30,20,10)$$ and $$(N_1, N_2, N_3) = (50,30,20)$$, we generate potential outcomes under two settings: (3A) $$Y_i(1)\sim \mathcal{N}(0,1)$$, $$Y_i(2) = 2Y_i(1)$$ and $$Y_i(3) = 3Y_i(1)$$; (3B) $$Y_i(1)\sim \mathcal{N}(0,1)$$, $$Y_i(2) = 3Y_i(1)$$ and $$Y_i(3) = 5Y_i(1)$$. These potential outcomes are standardized to have zero mean. In this case, $$p_1>p_2>p_3$$ and $$S^2_{\cdot}(1) <S^2_{\cdot}(2) < S^2_{\cdot}(3)$$. Once generated, the potential outcomes are treated as fixed constants. Over $$2000$$ simulated randomizations, we calculate the observed outcomes and then perform the Fisher randomization test using $$F$$ to approximate the $$p$$-values by $$2000$$ draws of the treatment assignment. The histograms of the $$p$$-values are shown in Figs. 1(a)–(c) corresponding to cases 1–3 above. In the next few paragraphs we report the rejection rates associated with these cases along with their standard errors. Fig. 1. View largeDownload slide Histograms of the $$p$$-values under $$H_{0{\rm{N}}}$$ based on the Fisher randomization tests using $$F$$: (a) balanced experiments, case 1; (b) unbalanced experiments, case 2; (c) unbalanced experiments, case 3. Grey and white histograms correspond to the subcases A and B, respectively. Fig. 1. View largeDownload slide Histograms of the $$p$$-values under $$H_{0{\rm{N}}}$$ based on the Fisher randomization tests using $$F$$: (a) balanced experiments, case 1; (b) unbalanced experiments, case 2; (c) unbalanced experiments, case 3. Grey and white histograms correspond to the subcases A and B, respectively. In Fig. 1(a), the Fisher randomization test using $$F$$ is conservative with $$p$$-values distributed towards $$1$$. With greater heterogeneity in the potential outcomes, the histograms of the $$p$$-values have larger masses near $$1$$. For case (1A) the rejection rates are $$0{\cdot}010$$ and $$0{\cdot}018$$, and for case (1B) the rejection rates are $$0{\cdot}023$$ and $$0{\cdot}016$$, for sample sizes $$N=45$$ and $$N=120$$ respectively, with all Monte Carlo standard errors no greater than $$0{\cdot}003$$. In Fig. 1(b), the sample sizes under each treatment level are increasing in the variances of the potential outcomes. The Fisher randomization test using $$F$$ is conservative with $$p$$-values distributed towards $$1$$. Similar to Fig. 1(a), when there is greater heterogeneity in the potential outcomes, the $$p$$-values have larger masses near $$1$$. For case (2A) the rejection rates are $$0{\cdot}016$$ and $$0{\cdot}014$$, and for case (2B) the rejection rates are $$0{\cdot}015$$ and $$0{\cdot}011$$, for sample sizes $$N=45$$ and $$N=120$$ respectively, with all Monte Carlo standard errors no greater than $$0{\cdot}003$$. In Fig. 1(c), the sample sizes under different treatment levels are decreasing in the variances of the potential outcomes. For case (3A) the rejection rates are $$0{\cdot}133$$ and $$0{\cdot}126$$, and for case (3B) the rejection rates are $$0{\cdot}189$$ and $$0{\cdot}146$$, for sample sizes $$N=45$$ and $$N=120$$ respectively, with all Monte Carlo standard errors no greater than $$0{\cdot}009$$. The Fisher randomization test using $$F$$ does not preserve correct Type I error, with $$p$$-values distributed towards $$0$$. With greater heterogeneity in the potential outcomes, the $$p$$-values have larger masses near $$0$$. These empirical findings agree with our theory in § 4; that is, if the sample sizes under different treatment levels are decreasing in the sample variances of the observed outcomes, then the Fisher randomization test using $$F$$ may not yield correct Type I error under $$H_{0{\rm{N}}}$$. 6.2. Type I error of the Fisher randomization test using $$X^2$$ Figure 2 shows the same simulation as in Fig. 1, but with test statistic $$X^2$$. Fig. 2. View largeDownload slide Histograms of the $$p$$-values under $$H_{0{\rm{N}}}$$ based on the Fisher randomization tests using $$X^2$$: (a) balanced experiments, case 1; (b) unbalanced experiments, case 2; (c) unbalanced experiments, case 3. Grey and white histograms correspond to the subcases A and B, respectively. Fig. 2. View largeDownload slide Histograms of the $$p$$-values under $$H_{0{\rm{N}}}$$ based on the Fisher randomization tests using $$X^2$$: (a) balanced experiments, case 1; (b) unbalanced experiments, case 2; (c) unbalanced experiments, case 3. Grey and white histograms correspond to the subcases A and B, respectively. Figure 2(a) is similar to Fig. 1(a). For case (1A) the rejection rates are $$0{\cdot}016$$ and $$0{\cdot}012$$, and for case (1B) the rejection rates are $$0{\cdot}014$$ and $$0{\cdot}010$$, for sample sizes $$N=45$$ and $$N=120$$ respectively, with all Monte Carlo standard errors no greater than $$0{\cdot}003$$. Figure 2(b) shows better performance of the Fisher randomization test using $$X^2$$ than in Fig. 1(b), with $$p$$-values closer to uniform. For case (2A) the rejection rates are $$0{\cdot}032$$ and $$0{\cdot}038$$, and for case (2B) the rejection rates are $$0{\cdot}026$$ and $$0{\cdot}030$$, for sample sizes $$N=45$$ and $$N=120$$ respectively, with all Monte Carlo standard errors no greater than $$0{\cdot}004$$. Figure 2(c) shows much better performance of the Fisher randomization test using $$X^2$$ than in Fig. 1(c), because the $$p$$-values are much closer to uniform. For case (3A) the rejection rates are $$0{\cdot}052$$ and $$0{\cdot}042$$, and for case (3B) the rejection rates are $$0{\cdot}048$$ and $$0{\cdot}040$$, for sample sizes $$N=45$$ and $$N=120$$ respectively, with all Monte Carlo standard errors no greater than $$0{\cdot}005$$. This agrees with our theory that the Fisher randomization test using $$X^2$$ can control the asymptotic Type I error under $$H_{0{\rm{N}}}$$. 6.3. Power comparison of the Fisher randomization tests using $$F$$ and $$X^2$$ In this subsection, we compare the powers of the Fisher randomization tests using $$F$$ and $$X^2$$ under alternative hypotheses. We consider the following cases. Case 4. For balanced experiments with sample sizes $$N=30$$ and $$N=45$$, we generate potential outcomes from $$Y_i(1)\sim \mathcal{N}(0,1)$$, $$Y_i(2)\sim \mathcal{N}(0, 2^2)$$ and $$Y_i(3)\sim \mathcal{N}(0, 3^2)$$. These potential outcomes are independently generated and transformed to have means $$(0,1,2)$$. Case 5. For unbalanced experiments with sample sizes $$(N_1, N_2, N_3) = (10,20,30)$$ and $$(N_1, N_2, N_3) = (20,30,50)$$, we first generate $$Y_i(1)\sim \mathcal{N}(0,1)$$ and standardize them to have mean zero; we then generate $$Y_i(2) = 3Y_i(1)+1$$ and $$Y_i(3) = 5Y_i(1) + 2$$. In this case, $$p_1<p_2<p_3$$ and $$S^2_{\cdot}(1) < S^2_{\cdot}(2) < S^2_{\cdot}(3)$$. Case 6. For unbalanced experiments with sample sizes $$(N_1, N_2, N_3) = (30,20,10)$$ and $$(N_1, N_2, N_3) = (50,30,20)$$, we generate potential outcomes in the same way as in case 5 above. In this case, $$p_1>p_2>p_3$$ and $$S^2_{\cdot}(1) <S^2_{\cdot}(2) < S^2_{\cdot}(3)$$. Over $$2000$$ simulated datasets, we perform the Fisher randomization test using $$F$$ and $$X^2$$ and obtain the $$p$$-values by $$2000$$ draws of the treatment assignment. The histograms of the $$p$$-values, shown in Figs. 3(a)–(c), correspond to cases 4–6 above. The Monte Carlo standard errors for the rejection rates are all close to but no greater than $$0{\cdot}011$$. Fig. 3. View largeDownload slide Histograms of the $$p$$-values under alternative hypotheses based on the Fisher randomization tests using $$F$$ and $$X^2$$: (a) balanced experiments, case 4; (b) unbalanced experiments, case 5; (c) unbalanced experiments, case 6. Grey histograms correspond to $$X^2$$ and white histograms to $$F$$. Fig. 3. View largeDownload slide Histograms of the $$p$$-values under alternative hypotheses based on the Fisher randomization tests using $$F$$ and $$X^2$$: (a) balanced experiments, case 4; (b) unbalanced experiments, case 5; (c) unbalanced experiments, case 6. Grey histograms correspond to $$X^2$$ and white histograms to $$F$$. For case 4, the rejection rates using $$X^2$$ and $$F$$ are respectively $$0{\cdot}290$$ and $$0{\cdot}376$$ with sample size $$N=30$$, and $$0{\cdot}576$$ and $$0{\cdot}692$$ with sample size $$N=45$$. For case 5, the powers using $$X^2$$ and $$F$$ are respectively $$0{\cdot}178$$ and $$0{\cdot}634$$ with sample size $$N=60$$, and $$0{\cdot}288$$ and $$0{\cdot}794$$ with sample size $$N=100$$. Therefore, when the experiments are balanced or when the sample sizes are positively associated with the variances of the potential outcomes, the Fisher randomization test using $$F$$ has higher power than that using $$X^2$$. For case 6, the rejection rates using $$X^2$$ and $$F$$ are respectively $$0{\cdot}494$$ and $$0{\cdot}355$$ with sample size $$N=60$$, and $$0{\cdot}642$$ and $$0{\cdot}576$$ with sample size $$N=100$$. Therefore, when the sample sizes are negatively associated with the variances of the potential outcomes, the Fisher randomization test using $$F$$ has lower power than that using $$X^2$$. 6.4 Simulation studies under other distributions and applications In the Supplementary Material, we give more numerical examples. First, we conduct simulation studies that parallel those in §§ 6.1–6.3 but have outcomes generated from exponential distributions. The conclusions are nearly identical to those in §§ 6.1–6.3, because the finite-population central limit theorem holds under mild moment conditions without distributional assumptions. Second, we use two numerical examples to illustrate the conservativeness issue in Theorem 3. Third, we compare the different behaviours of the Fisher randomization tests using $$F$$ and $$X^2$$ in two real-life examples. 7. Discussion As shown in the proofs of Theorems 1 and 3 in the Supplementary Material, we need to analyse the eigenvalues of the covariance matrix of $$\{ \bar{Y}_{\cdot}^{\rm{obs}}(1), \ldots, \bar{Y}_{\cdot}^{\rm{obs}}(J) \}$$ to obtain the properties of $$F$$ and $$X^2$$ for general $$J> 2$$. Moreover, by considering the case of $$J=2$$ we can gain more insight and make connections with existing literature. For $$j\neq j'$$, an unbiased estimator for $$\tau(j,j')$$ is $$\hat{\tau}(j,j') = \bar{Y}_{\cdot}^{\rm{obs}}(j) - \bar{Y}_{\cdot}^{\rm{obs}}(j')$$, which has sampling variance $${\rm {var}}\{\hat{\tau}(j,j') \} = S_{\cdot}^2(j) / N_j + S_{\cdot}^2(j') / N_{j'} - S_{\tau}^2(j, j') /N$$ and a conservative variance estimator $$s_{\rm{obs}}^2(j)/N_j + s_{\rm{obs}}^2(j') / N_{j'}$$Neyman, 1923. Corollary 5. When $$J=2$$, the $$F$$ and $$X^2$$ statistics reduce to  $$F\approx \frac{ \hat{\tau}^2(1,2) }{ s_{\rm{obs}}^2(1)/N_2 + s_{\rm{obs}}^2(2) / N_1 },\quad X^2 = \frac{ \hat{\tau}^2(1,2) }{ s_{\rm{obs}}^2(1)/N_1 + s_{\rm{obs}}^2(2) / N_2 },$$ where the approximation for $$F$$ is due to ignoring the difference between $$N$$ and $$N-2$$ and the difference between $$N_j$$ and $$N_j-1$$$$(j=1,2)$$. Under $$H_{0{\rm{F}}}$$, $$\: F\overset{.}{\sim} \chi^2_1$$ and $$X^2\overset{.}{\sim} \chi^2_1$$. Under $$H_{0{\rm{N}}}$$, $$\:F\overset{.}{\sim} C_1 \chi^2_1$$ and $$X^2\overset{.}{\sim} C_2 \chi^2_1$$, where  $$\label{eq::constants} C_1 = \lim_{N\rightarrow + \infty} \frac{ {\rm {var}}\{\hat{\tau}(1,2) \} } { S_{\cdot}^2(1)/N_2 + S_{\cdot}^2(2) / N_1 }, \quad C_2 = \lim_{N\rightarrow + \infty} \frac{ {\rm {var}}\{\hat{\tau}(1,2) \} } { S_{\cdot}^2(1)/N_1 + S_{\cdot}^2(2) / N_2 } \leq 1\text{.}$$ (4) Depending on the sample sizes and the finite-population variances, $$C_1$$ can be either larger or smaller than $$1$$. Consequently, using $$F$$ in the Fisher randomization test can be conservative or anticonservative for testing $$H_{0{\rm{N}}}$$. In contrast, $$C_2$$ is always no larger than $$1$$, and therefore using $$X^2$$ in the Fisher randomization test is conservative for testing $$H_{0{\rm{N}}}$$. Neyman (1923) proposed using the square root of $$X^2$$ to test $$H_{0{\rm{N}}}$$ based on a normal approximation, which is asymptotically equivalent to the Fisher randomization test using $$X^2$$. Both are conservative unless the unit-level treatments are constant. In practice, for treatment-control experiments, the difference-in-means statistic $$\hat{\tau}(1,2)$$ has been widely used in the Fisher randomization test (Imbens & Rubin, 2015); it, however, can yield either conservative or anticonservative tests for $$H_{0{\rm{N}}}$$, as shown by Gail et al. (1996), Lin et al. (2017) and Ding (2017) using numerical examples. We formally state this result below, recognizing the equivalence between $$\hat{\tau}(1,2)$$ and $$F$$ in a two-sided test. Corollary 6. When $$J\,{=}\,2$$, the two-sided Fisher randomization test using $$\hat{\tau}(1,2)$$ is equivalent to using  $$T^2 = \frac{ \hat{\tau}^2(1,2) }{ N s_{\rm{obs}}^2/(N_1N_2) } \approx \frac{\hat{\tau}^2(1,2) }{ s_{\rm{obs}}^2(1)/N_2 + s_{\rm{obs}}^2(2) / N_1 + \hat{\tau}^2(1,2)/N },$$ where the approximation is due to ignoring the difference between $$(N,N_1-1,N_2-1)$$ and $$(N,N_1,N_2)$$. Under $$H_{0{\rm{F}}}$$, $$\:T^2\overset{.}{\sim} F\overset{.}{\sim} \chi^2_1$$, and under $$H_{0{\rm{N}}}$$, $$\:T^2\overset{.}{\sim} F \overset{.}{\sim} C_1\chi^2_1$$ with $$C_1$$ defined in (4). Remark 5. Analogously, under the superpopulation model, Romano (1990) showed that the Fisher randomization test using $$\hat{\tau}(1,2)$$ can be conservative or anticonservative for testing the hypothesis of equal means of two samples. Janssen (1997, 1999) and Chung & Romano (2013) suggested using the studentized statistic, or equivalently $$X^2$$, to remedy the problem of possibly inflated Type I error, which is asymptotically exact under the superpopulation model. After rejecting either $$H_{0{\rm{F}}}$$ or $$H_{0{\rm{N}}}$$, it is often of interest to test pairwise hypotheses; that is, for $$j\neq j'$$, $$\:Y_i(j) = Y_i(j')$$ for all $$i$$, or $$\bar{Y}_{\cdot}(j) = \bar{Y}_{\cdot}(j')$$. According to Corollaries 5 and 6, we recommend using the Fisher randomization test with test statistic $$\hat{\tau}^2(j,j') /\{ s_{\rm{obs}}^2(j)/N_j + s_{\rm{obs}}^2(j') / N_{j'} \} ,$$ which will yield conservative Type I error even if the experiment is unbalanced and the variances of the potential outcomes vary across treatment groups. The analogy between our finite-population theory and the superpopulation theory of Chung & Romano (2013) suggests that similar results may also hold for layouts of higher order and other test statistics (Pauly et al., 2015; Chung & Romano, 2016a,b; Friedrich et al., 2017). In more complex experimental designs, often multiple effects are of interest simultaneously, giving rise to the problem of multiple testings (Chung & Romano, 2016b). We leave these questions to future work. Acknowledgement Ding was partially funded by the Institute of Education Sciences and the National Science Foundation, U.S.A. Dasgupta was partially funded by the National Science Foundation, U.S.A. The authors thank Xinran Li, Zhichao Jiang, Lo-Hua Yuan and Robin Gong for comments on earlier versions of the paper. We thank a reviewer and the associate editor for helpful comments. Supplementary material Supplementary material available at Biometrika online includes proofs and more examples. References Aronow P. M., Green D. P. & Lee D. K. ( 2014). Sharp bounds on the variance in randomized experiments. Ann. Statist.  42, 850– 71. Google Scholar CrossRef Search ADS   Chung E. & Romano J. P. ( 2013). Exact and asymptotically robust permutation tests. Ann. Statist.  41, 484– 507. Google Scholar CrossRef Search ADS   Chung E. & Romano J. P. ( 2016a). Asymptotically valid and exact permutation tests based on two-sample U-statistics. J. Statist. Plan. Infer.  168, 97– 105. Google Scholar CrossRef Search ADS   Chung E. & Romano J. P. ( 2016b). Multivariate and multiple permutation tests. J. Economet.  193, 76– 91. Google Scholar CrossRef Search ADS   Dasgupta T., Pillai N. S. & Rubin D. B. ( 2015). Causal inference from $$2^{K}$$ factorial designs using the potential outcomes model. J. R. Statist. Soc. B  74, 727– 53. Google Scholar CrossRef Search ADS   Ding P. ( 2017). A paradox from randomization-based causal inference (with Discussion). Statist. Sci.  32, 331– 45. Google Scholar CrossRef Search ADS   Ding P., Feller A. & Miratrix L. ( 2016). Randomization inference for treatment effect variation. J. R. Statist. Soc. B  78, 655– 71. Google Scholar CrossRef Search ADS   Fisher R. A. ( 1925). Statistical Methods for Research Workers . Edinburgh: Oliver & Boyd. Fisher R. A. ( 1935). The Design of Experiments . Edinburgh: Oliver & Boyd. Friedrich S., Brunner E. & Pauly M. ( 2017). Permuting longitudinal data in spite of the dependencies. J. Mult. Anal.  153, 255– 65. Google Scholar CrossRef Search ADS   Gail M. H., Mark S. D., Carroll R. J., Green S. B. & Pee D. ( 1996). On design considerations and randomization-based inference for community intervention trials. Statist. Med.  15, 1069– 92. Google Scholar CrossRef Search ADS   Imbens G. W. & Rubin D. B. ( 2015). Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction . New York: Cambridge University Press. Google Scholar CrossRef Search ADS   James G. ( 1951). The comparison of several groups of observations when the ratios of the population variances are unknown. Biometrika  38, 324– 9. Google Scholar CrossRef Search ADS   Janssen A. ( 1997). Studentized permutation tests for non-i.i.d. hypotheses and the generalized Behrens-Fisher problem. Statist. Prob. Lett.  36, 9– 21. Google Scholar CrossRef Search ADS   Janssen A. ( 1999). Testing nonparametric statistical functionals with applications to rank tests. J. Statist. Plan. Infer.  81, 71– 93. Google Scholar CrossRef Search ADS   Janssen A. & Pauls T. ( 2003). How do bootstrap and permutation tests work? Ann. Statist.  31, 768– 806. Google Scholar CrossRef Search ADS   Johansen S. ( 1980). The Welch–James approximation to the distribution of the residual sum of squares in a weighted linear regression. Biometrika  67, 85– 92. Google Scholar CrossRef Search ADS   Kempthorne O. ( 1952). The Design and Analysis of Experiments.  London: Chapman & Hall. Kempthorne O. ( 1955). The randomization theory of experimental inference. J. Am. Statist. Assoc.  50, 946– 67. Krishnamoorthy K., Lu F. & Mathew T. ( 2007). A parametric bootstrap approach for ANOVA with unequal variances: Fixed and random models. Comp. Statist. Data Anal.  51, 5731– 42. Google Scholar CrossRef Search ADS   Li X. & Ding P. ( 2017). General forms of finite population central limit theorems with applications to causal inference. J. Am. Statist. Assoc.  112, 1759– 69. Google Scholar CrossRef Search ADS   Lin W., Halpern S. D., Prasad K. M. & Small D. S. ( 2017). A “placement of death” approach for studies of treatment effects on ICU length of stay. Statist. Meth. Med. Res.  26, 292– 311. Google Scholar CrossRef Search ADS   Neuhaus G. ( 1993). Conditional rank tests for the two-sample problem under random censorship. Ann. Statist.  21, 1760– 79. Google Scholar CrossRef Search ADS   Neyman J. ( 1923). On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Statist. Sci.  5, 465– 72. Google Scholar CrossRef Search ADS   Neyman J. ( 1935). Statistical problems in agricultural experimentation (with Discussion). Suppl. J. R. Statist. Soc.  2, 107– 80. Google Scholar CrossRef Search ADS   Pauly M., Brunner E. & Konietschke F. ( 2015). Asymptotic permutation tests in general factorial designs. J. R. Statist. Soc. B  77, 461– 73. Google Scholar CrossRef Search ADS   Pitman E. J. ( 1938). Significance tests which may be applied to samples from any populations: III. The analysis of variance test. Biometrika  29, 322– 35. Rice W. R. & Gaines S. D. ( 1989). One-way analysis of variance with unequal variances. Proc. Nat. Acad. Sci.  86, 8183– 4. Google Scholar CrossRef Search ADS   Romano J. P. ( 1990). On the behavior of randomization tests without a group invariance assumption. J. Am. Statist. Assoc.  85, 686– 92. Google Scholar CrossRef Search ADS   Rosenbaum P. R. ( 2010). Design of Observational Studies . New York: Springer. Google Scholar CrossRef Search ADS   Rubin D. B. ( 1974). Estimating causal effects of treatments in randomized and nonrandomized studies. J. Educ. Psychol . 66, 688– 701. Rubin D. B. ( 1980). Comment on “Randomization analysis of experimental data: The Fisher randomization test” by Basu. D. J. Am. Statist. Assoc.  75, 591– 3. Scheffe H. ( 1959). The Analysis of Variance . New York: John Wiley & Sons. Silvey S. D. ( 1954). The asymptotic distributions of statistics arising in certain non-parametric tests. Glasgow Math. J.  2, 47– 51. Google Scholar CrossRef Search ADS   Weerahandi S. ( 1995). ANOVA under unequal error variances. Biometrics  51, 589– 99. Google Scholar CrossRef Search ADS   Welch B. ( 1937). On the $$z$$-test in randomized blocks and Latin squares. Biometrika  29, 21– 52. Google Scholar CrossRef Search ADS   Welch B. ( 1951). On the comparison of several mean values: An alternative approach. Biometrika  38, 330– 6. Google Scholar CrossRef Search ADS   © 2017 Biometrika Trust http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Biometrika Oxford University Press

# A randomization-based perspective on analysis of variance: a test statistic robust to treatment effect heterogeneity

, Volume 105 (1) – Mar 1, 2018
12 pages

/lp/ou_press/a-randomization-based-perspective-on-analysis-of-variance-a-test-L0vpShTFtB
Publisher
Oxford University Press
ISSN
0006-3444
eISSN
1464-3510
D.O.I.
10.1093/biomet/asx059
Publisher site
See Article on Publisher Site

### Abstract

Summary Fisher randomization tests for Neyman’s null hypothesis of no average treatment effect are considered in a finite-population setting associated with completely randomized experiments involving more than two treatments. The consequences of using the $$F$$ statistic to conduct such a test are examined, and we argue that under treatment effect heterogeneity, use of the $$F$$ statistic in the Fisher randomization test can severely inflate the Type I error under Neyman’s null hypothesis. We propose to use an alternative test statistic, derive its asymptotic distributions under Fisher’s and Neyman’s null hypotheses, and demonstrate its advantages through simulations. 1. Introduction One-way analysis of variance (Fisher, 1925; Scheffe, 1959) is perhaps the most commonly used tool to analyse completely randomized experiments with more than two treatments. The standard $$F$$ test for testing equality of mean treatment effects can be justified either by assuming a linear additive superpopulation model with identically and independently distributed normal error terms, or by using the asymptotic randomization distribution of the $$F$$ statistic. Units in real-life experiments are rarely random samples from a superpopulation, making a finite-population randomization-based perspective on inference important (e.g., Rosenbaum, 2010; Dasgupta et al., 2015; Imbens & Rubin, 2015). Fisher randomization tests are useful tools for such inference, because they pertain to a finite population of units and assess the statistical significance of treatment effects without any assumptions about the underlying outcome distribution. In causal inference from a finite population, two hypotheses are of interest: Fisher’s sharp null hypothesis of no treatment effect on any experimental unit (Fisher, 1935; Rubin, 1980), and Neyman’s null hypothesis of no average treatment effect (Neyman, 1923, 1935). These hypotheses are equivalent when there is no treatment effect heterogeneity (Ding et al., 2016) or, equivalently, under the assumption of strict additivity of treatment effects, i.e., the same treatment effect for each unit (Kempthorne, 1952). In the context of a multi-treatment completely randomized experiment, Neyman’s null hypothesis allows for treatment effect heterogeneity, which is weaker than Fisher’s null hypothesis and is sometimes of greater interest. We find that the Fisher randomization test using the $$F$$ statistic can inflate the Type I error under Neyman’s null hypothesis, when the sample sizes and variances of the outcomes under different treatment levels are negatively associated. We propose to use the $$X^2$$ statistic defined in § 5, a statistic that is robust with respect to treatment effect heterogeneity, because the resulting Fisher randomization test is exact under Fisher’s null hypothesis and controls asymptotic Type I error under Neyman’s null hypothesis. 2. Completely randomized experiment with $$J$$ treatments Consider a finite population of $$N$$ experimental units, each of which can be exposed to any one of $$J$$ treatments. Let $$Y_i(j)$$ denote the potential outcome (Neyman, 1923; Rubin, 1974) of unit $$i$$ when assigned to treatment level $$j$$ ($$i=1,\ldots, N;\,j=1,\ldots, J)$$. For two different treatment levels $$j$$ and $$j'$$, we define the unit-level treatment effect as $$\tau_i(j,j') = Y_i(j) - Y_i(j')$$ and the population-level treatment effect as   \begin{equation*} \tau(j,j') = N^{-1}\sum_{i=1}^N \tau_i(j,j') = N^{-1} \sum_{i=1}^N \{ Y_i(j) - Y_i(j')\} \equiv \bar{Y}_{\cdot}(j) - \bar{Y}_{\cdot}(j'), \end{equation*} where $$\bar{Y}_{\cdot}(j) = N^{-1} \sum_{i=1}^N Y_i(j)$$ is the average of the $$N$$ potential outcomes for treatment $$j$$. For treatment level $$j = 1, \ldots, J$$, define $$p_j = N_j/N$$ as the proportion of the units and $$S_{\cdot}^2(j) = (N -1)^{-1} \sum_{i=1}^N \{ Y_i(j) - \bar{Y}_{\cdot}(j) \}^2$$ as the finite-population variance of the potential outcomes. The treatment assignment mechanism can be represented by the binary random variable $$W_i(j),$$ which equals $$1$$ if the $$i$$th unit is assigned to treatment $$j$$ and $$0$$ otherwise. Equivalently, it can be represented by the discrete random variable $$W_i = \sum_{j=1}^J j W_i(j)$$, the treatment received by unit $$i$$. Let $$(W_1, \ldots, W_N)$$ be the treatment assignment vector, and let $$(w_1,\ldots, w_N)$$ denote its realization. For the $$N=\sum_{j=1}^J N_j$$ units, $$(N_1,\ldots, N_J)$$ are assigned at random to treatments $$(1,\ldots, J)$$, respectively, and the treatment assignment mechanism satisfies $${\rm{pr}}\{ (W_1, \ldots, W_N) = (w_1,\ldots, w_N) \} = {\rm{pr}}od_{j=1}^J N_j!/N!$$ if $$\sum_{i=1}^N W_i(j) = N_j$$ and $$0$$ otherwise. The observed outcome of unit $$i$$ is a deterministic function of the treatment it has received and the potential outcomes, given by $$Y_i^{\rm{obs}} = \sum_{j=1}^J W_i(j) Y_i(j)$$. 3. The Fisher randomization test under the sharp null hypothesis Fisher (1935) was interested in testing the following sharp null hypothesis of zero individual treatment effects:   \begin{equation*} H_{0{\rm{F}}}: Y_i(1) = \cdots = Y_i(J) \quad (i=1,\ldots, N)\text{.} \end{equation*} Under $$H_{0{\rm{F}}}$$, all $$J$$ potential outcomes $$Y_i(1), \ldots, Y_i(J)$$ equal the observed outcome $$Y_i^{\rm{obs}}$$, for all units $$i = 1, \ldots, N$$. Thus any possible realization of the treatment assignment vector would generate the same vector of observed outcomes. This means that under $$H_{0{\rm{F}}}$$ and given any realization $$(W_1,\ldots, W_N) = (w_1, \ldots, w_N)$$, the observed outcomes are fixed. Consequently, the randomization distribution or null distribution of any test statistic, which is a function of the observed outcomes and the treatment assignment vector, is its distribution over all possible realizations of the treatment assignment. The $$p$$-value is the tail probability measuring the extremeness of the test statistic with respect to its randomization distribution. Computationally, we can enumerate or simulate a subset of all possible randomizations to obtain the randomization distribution of any test statistic and thus perform the Fisher randomization test (Fisher, 1935; Imbens & Rubin, 2015). Fisher (1925) suggested using the $$F$$ statistic to test the departure from $$H_{0{\rm{F}}}$$. Define $$\bar{Y}_{\cdot}^{\rm{obs}}(j) = N_j^{-1} \sum_{i=1}^N W_i(j)Y_i^{\rm{obs}}$$ as the sample average of the observed outcomes within treatment level $$j$$, and define $$\bar{Y}_{\cdot}^{\rm{obs}} = N^{-1} \sum_{i=1}^N Y_i^{\rm{obs}}$$ as the sample average of all the observed outcomes. Let $$s^2_{{\rm{obs}}}(j) = ( N_j - 1 )^{-1} \sum_{i=1}^N W_i(j) \{ Y_i^{\rm{obs}} - \bar{Y}_{\cdot}^{\rm{obs}} (j) \}^2$$ and $$s^2_{\rm{obs}} = (N-1)^{-1} \sum_{i=1}^N (Y_i^{\rm{obs}} - \bar{Y}_{\cdot}^{\rm{obs}})^2$$ be the corresponding sample variances with divisors $$N_j-1$$ and $$N-1$$, respectively. Let   \begin{equation*} {\small{\rm{SS}}}_{\rm{T}} = \sum_{j=1}^J N_j \{ \bar{Y}_{\cdot}^{\rm{obs}}(j) - \bar{Y}_{\cdot}^{\rm{obs}} \}^2 \end{equation*} be the treatment sum of squares, and let   \begin{equation*} {\small{\rm{SS}}}_{\rm{R}} = \sum_{j=1}^J \:\sum_{i:\,W_i(j) = 1} \{ Y_i^{\rm{obs}} - \bar{Y}_{\cdot}^{\rm{obs}} (j) \}^2 = \sum_{j=1}^J (N_j - 1) s^2_{{\rm{obs}}}(j) \end{equation*} be the residual sum of squares. The treatment and residual sums of squares add up to the total sum of squares $$\sum_{i=1}^N (Y_i^{\rm{obs}} - \bar{Y}_{\cdot}^{\rm{obs}})^2 = (N-1)s^2_{\rm{obs}}$$. The $$F$$ statistic   $$F = \frac{ {\small{\rm{SS}}}_{\rm{T}} / (J-1) }{ {\small{\rm{SS}}}_{\rm{R}} / (N-J) } \equiv \frac{ {\small{\rm{MS}}}_{\rm{T}} }{ {\small{\rm {MS}}}_{\rm{R}}} \label{eq:F}$$ (1) is defined as the ratio of the treatment mean square $${\small{\rm{MS}}}_{\rm{T}} = {\small{\rm{SS}}}_{\rm{T}} / (J-1)$$ to the residual mean square $${\small{\rm {MS}}}_{\rm{R}}= {\small{\rm{SS}}}_{\rm{R}} / (N-J)$$. The distribution of (1) under $$H_{0{\rm{F}}}$$ can be well approximated by an $$F_{J-1, N-J}$$ distribution with degrees of freedom $$J-1$$ and $$N-J$$, as is often used in the analysis of variance table obtained from fitting a normal linear model. Although it is relatively easy to show that (1) follows $$F_{J-1, N-J}$$ if the observed outcomes follow a normal linear model drawn from a superpopulation, arriving at such a result via a purely randomization-based argument is nontrivial. Below, we state a known result on the approximate randomization distribution of (1), and throughout our discussion we assume the following regularity conditions required by the finite-population central limit theorem for causal inference Li & Ding, 2017. Condition 1. As $$N\rightarrow \infty$$, for all $$j$$, $$\:N_j/N$$ has a positive limit, $$\bar{Y}_\cdot (j)$$ has a finite limit, $$S_\cdot^2(j)$$ has a finite and positive limit, and $$N^{-1}\max_{1\leq i\leq N} | Y_i(j) - \bar{Y}_{\cdot}(j) |^2 \rightarrow 0$$. Theorem 1. Assume $$H_{0{\rm{F}}}$$. Over repeated sampling of $$(W_1,\ldots, W_N)$$, the expectations of the residual and treatment sums of squares are $$E({\small{\rm{SS}}}_{\rm{T}}) = (J-1) s^2_{\rm{obs}}$$ and $$E({\small{\rm{SS}}}_{\rm{R}}) = (N-J) s^2_{{\rm{obs}}}$$, and as $$N\rightarrow \infty$$, the asymptotic distribution of (1) is  $$F\overset{.}{\sim} \frac{ \chi^2_{J-1}/ (J-1) }{ \{ (N-1) - \chi^2_{J-1} \}/(N-J) } \overset{.}{\sim} \chi^2_{J-1}/(J-1) \overset{.}{\sim} F_{J-1, N-J}\text{.}$$ Remark 1. In Theorem 1 and the following discussion, we use the notation $$A_N \overset{.}{\sim} B_N$$ to represent two sequences of random variables $$\{A_N\}_{N=1}^\infty$$ and $$\{B_N\}_{N=1}^\infty$$ that have the same asymptotic distribution as $$N \rightarrow \infty$$. The original $$F$$ approximation for randomization inference for a finite population was derived by cumbersome moment matching between the statistic (1) and the corresponding $$F_{J-1, N-J}$$ distribution (Welch, 1937; Pitman, 1938; Kempthorne, 1952). In the Supplementary Material, we give a simpler proof based on the finite-population central limit theorem, similar to Silvey (1954). Remark 2. Under $$H_{0{\rm{F}}}$$, the total sum of squares is fixed, but its components $${\small{\rm{SS}}}_{\rm{T}}$$ and $${\small{\rm{SS}}}_{\rm{R}}$$ are random through the treatment assignment $$(W_1,\ldots,W_N)$$, and their expectations are calculated with respect to the distribution of the treatment assignment. Also, the ratio of the expectations of the numerator $${\small{\rm{MS}}}_{\rm{T}}$$ and the denominator $${\small{\rm {MS}}}_{\rm{R}}$$ of (1) is $$1$$ under $$H_{0{\rm{F}}}$$. 4. Sampling properties of the $$F$$ statistic under Neyman’s null hypothesis In § 3 we discussed the randomization distribution, i.e., the sampling distribution under $$H_{0{\rm{F}}}$$, of the $$F$$ statistic in (1). However, the sampling distribution of the $$F$$ statistic under Neyman’s null hypothesis of no average treatment effect (Neyman, 1923, 1935),   \begin{equation*} H_{0{\rm{N}}}: \bar{Y}_{\cdot}(1) = \cdots = \bar{Y}_{\cdot}(J), \end{equation*} is often of interest but has received limited attention (Imbens & Rubin, 2015). This hypothesis imposes weaker restrictions on the potential outcomes than $$H_{0{\rm{F}}}$$, making it impossible to compute the corresponding exact, or even approximate, distribution of $$F$$. However, analytical expressions for $$E({\small{\rm{SS}}}_{\rm{T}})$$ and $$E({\small{\rm{SS}}}_{\rm{R}})$$ can be derived under $$H_{0{\rm{N}}}$$ along the lines of Theorem 1, and can be used to gain insights into the consequences of testing $$H_{0{\rm{N}}}$$ using the Fisher randomization test with $$F$$. Let $$\bar{Y}_{\cdot}(\cdot) = \sum_{j=1}^J p_j \bar{Y}_{\cdot}(j)$$ and $$S^2 = \sum_{j=1}^J p_j S_{\cdot}^2(j)$$ be the weighted averages of the finite-population means and variances. The sampling distribution of $$F$$ depends crucially on the finite-population variance of the unit-level treatment effects,   $$S_{\tau}^2(j, j') = (N-1)^{-1} \sum_{i=1}^N \{ \tau_i(j, j') - \tau(j, j') \}^2 \text{.}$$ Definition 1. The potential outcomes $$\{ Y_i(j): i=1, \ldots, N; \: j = 1, \ldots, J \}$$ have strictly additive treatment effects if for all $$j \ne j'$$ the unit-level treatment effects $$\tau_i(j, j')$$ are the same for $$i=1, \ldots, N$$ or, equivalently, if $$S_{\tau}^2(j, j') = 0$$ for all $$j \ne j'$$.  Kempthorne (1955) obtained the following result for balanced designs with $$p_j=1/J$$ under the assumption of strict additivity:   $$E({\small{\rm{SS}}}_{\rm{R}}) = (N-J )S^2,\quad E({\small{\rm{SS}}}_{\rm{T}}) = \frac{N}{J } \sum_{j=1}^J \{ \bar{Y}_{\cdot}(j) - \bar{Y}_{\cdot}(\cdot) \}^2 + (J-1) S^2\text{.} \label{eq::kempthorne}$$ (2) This result implies that with balanced treatment assignments and strict additivity, $$E({\small{\rm{MS}}}_{\rm{R}}{-}{\small{\rm{MS}}}_{\rm{T}}){=}0$$ under $$H_{0{\rm{N}}}$$, and it provides a heuristic justification for testing $$H_{0{\rm{N}}}$$ using the Fisher randomization test with the $$F$$ statistic. However, strict additivity combined with $$H_{0{\rm{N}}}$$ implies $$H_{0{\rm{F}}}$$, for which this result is already known by Theorem 1. We will now derive results that do not require strict additivity, and thus are more general than those in Kempthorne (1955). For this purpose, we introduce a measure of deviation from additivity. Let   \begin{equation*} \Delta = \mathop{\sum\sum}_{j < j'} p_j p_{j'} S_{\tau}^2(j, j') \end{equation*} be a weighted average of the variances of unit-level treatment effects. By Definition 1, $$\Delta = 0$$ under strict additivity. If strict additivity does not hold, i.e., if there is treatment effect heterogeneity, then $$\Delta \neq 0$$. Thus $$\Delta$$ is a measure of the deviation from additivity and plays a crucial role in the following results on the sampling distribution of the $$F$$ statistic. Theorem 2. Over repeated sampling of $$(W_1, \ldots, W_N)$$, the expectation of the residual sum of squares is $$E({\small{\rm{SS}}}_{\rm{R}}) = \sum_{j=1}^J (N_j - 1) S_{\cdot}^2(j) ,$$ and the expectation of the treatment sum of squares is  $$E({\small{\rm{SS}}}_{\rm{T}}) = \sum_{j=1}^J N_j \bigl\{ \bar{Y}_{\cdot}(j) - \bar{Y}_{\cdot}(\cdot) \bigr\}^2+ \sum_{j=1}^J (1-p_j) S_{\cdot}^2(j) - \Delta,$$ which reduces to $$E({\small{\rm{SS}}}_{\rm{T}}) = \sum_{j=1}^J (1-p_j) S_{\cdot}^2(j) - \Delta$$ under $$H_{0{\rm{N}}}$$.  Corollary 1. Under $$H_{0{\rm{N}}}$$ with strict additivity in Definition 1 or, equivalently, under $$H_{0{\rm{F}}}$$, the results in Theorem 2 reduce to $$E({\small{\rm{SS}}}_{\rm{R}}) = (N-J) S^2$$ and $$E({\small{\rm{SS}}}_{\rm{T}}) = (J-1) S^2,$$ which coincide with the statements in Theorem 1.  Corollary 2. For a balanced design with $$p_j = 1/J$$,  $$E({\small{\rm{SS}}}_{\rm{R}}) = (N-J )S^2,\quad E({\small{\rm{SS}}}_{\rm{T}}) = \frac{N}{J } \sum_{j=1}^J \{ \bar{Y}_{\cdot}(j) - \bar{Y}_{\cdot}(\cdot) \}^2 + (J-1) S^2 - \Delta\text{.}$$ Furthermore, under $$H_{0{\rm{N}}}$$, $$E({\small{\rm{SS}}}_{\rm{R}}) = (N-J )S^2$$ and $$E({\small{\rm{SS}}}_{\rm{T}}) = (J-1) S^2 - \Delta,$$ implying that the difference between the residual mean square and treatment mean square is $$E ( {\small{\rm{MS}}}_{\rm{R}} - {\small{\rm{MS}}}_{\rm{T}} ) =\Delta / (J - 1) \geq 0$$.  The result in (2) is a special case of Corollary 2 for $$\Delta=0$$. Corollary 2 implies that, for balanced designs, if the assumption of strict additivity does not hold, then testing $$H_{0{\rm{N}}}$$ using the Fisher randomization test with the $$F$$ statistic may be conservative, in the sense that it may reject a null hypothesis less often than the nominal level. However, for unbalanced designs, the conclusion is not definite, as can be seen from the following corollary. Corollary 3. Under $$H_{0{\rm{N}}}$$, the difference between the residual and treatment mean square is  $$E( {\small{\rm{MS}}}_{\rm{R}} - {\small{\rm{MS}}}_{\rm{T}}) = \frac{ (N-1)J }{ (J-1)(N-J) } \sum_{j=1}^J (p_j - J^{-1})S_{\cdot}^2(j) + \frac{ \Delta} { J-1}\text{.}$$ Corollary 3 shows that the residual mean square may be larger or smaller than that of the treatment, depending on the balance or lack thereof in the experiment and the variances of the potential outcomes. Under $$H_{0{\rm{N}}}$$, when the $$p_j$$ and $$S_{\cdot}^2(j)$$ are positively associated, the Fisher randomization test using $$F$$ tends to be conservative; when the $$p_j$$ and $$S_{\cdot}^2(j)$$ are negatively associated, the Fisher randomization test using $$F$$ may not control correct Type I error. 5. A test statistic that controls type I error more precisely than $$F$$ To address the failure of the $$F$$ statistic to control Type I error of the Fisher randomization test under $$H_{0{\rm{N}}}$$ in unbalanced experiments, we propose to use (3) for the Fisher randomization test. Let $$\hat{Q}_j = N_j / s^2_{{\rm{obs}}}(j)$$, and define the weighted average of the sample means as $$\bar{Y}^{\rm{obs}}_w = \sum_{j=1}^J \hat{Q}_j \bar{Y}_{\cdot}^{\rm{obs}}(j) / \sum_{j=1}^J \hat{Q}_j$$. Define   $$X^2 = \sum_{j=1}^J \hat{Q}_j \bigl\{ \bar{Y}_{\cdot}^{\rm{obs}}(j) - \bar{Y}^{\rm{obs}}_w \bigr\}^2\text{.} \label{eq:X^2}$$ (3) This test statistic has been exploited in classical analysis of variance (e.g., James, 1951; Welch, 1951; Johansen, 1980; Rice & Gaines, 1989; Weerahandi, 1995; Krishnamoorthy et al., 2007) based on the normal linear model with heteroskedasticity, and a similar idea called studentization has been adopted in permutation tests has been adopted in permutation tests (e.g., Neuhaus, 1993; Janssen, 1997, 1999; Janssen & Pauls, 2003; Chung & Romano, 2013; Pauly et al., 2015). Replacing $$F$$ with (3) does not affect the validity of the Fisher randomization test for testing $$H_{0{\rm{F}}}$$, because we always have an exact test for $$H_{0{\rm{F}}}$$ no matter which test statistic we use. We show below that the Fisher randomization test using $$X^2$$ can also control the asymptotic Type I error for testing $$H_{0{\rm{N}}}$$, so the Fisher randomization test using $$X^2$$ can control the Type I error under both $$H_{0{\rm{F}}}$$ and $$H_{0{\rm{N}}}$$ asymptotically, making $$X^2$$ a more attractive choice than the classical $$F$$ statistic for the Fisher randomization test. Theorem 3. Under $$H_{0{\rm{F}}}$$, the asymptotic distribution of $$X^2$$ is $$\chi^2_{J-1}$$ as $$N\rightarrow \infty$$. Under $$H_{0{\rm{N}}}$$, the asymptotic distribution of $$X^2$$ is stochastically dominated by $$\chi^2_{J-1}$$, i.e., for any constant $$a>0$$, $$\:\lim_{N\rightarrow \infty} {\rm{pr}}(X^2 \ge a) \le {\rm{pr}}(\chi^2_{J-1} \ge a)$$.  Remark 3. Under $$H_{0{\rm{F}}}$$, the randomization distribution of $${\small{\rm{SS}}}_{\rm{T}}/s_{\rm{obs}}^2$$ follows $$\chi^2_{J-1}$$ asymptotically, as shown in the Supplementary Material. Under $$H_{0{\rm{N}}}$$, however, the asymptotic distribution of $${\small{\rm{SS}}}_{\rm{T}}/s_{\rm{obs}}^2$$ is not $$\chi^2_{J-1}$$, and the asymptotic distribution of $$F$$ is not $$F_{N-J,J-1}$$ as suggested by Corollary 3. Fortunately, if we weight each treatment square by the inverse of the sample variance of the outcomes, the resulting $$X^2$$ statistic preserves the asymptotic $$\chi^2_{J-1}$$ randomization distribution under $$H_{0{\rm{F}}}$$ and has an asymptotic distribution that is stochastically dominated by $$\chi^2_{J-1}$$ under $$H_{0{\rm{N}}}$$. Therefore, under $$H_{0{\rm{N}}}$$, the Type I error of the Fisher randomization test using $$X^2$$ does not exceed the nominal level. Although we can perform the Fisher randomization test by enumerating or simulating from all possible realizations of the treatment assignment, Theorem 3 suggests that an asymptotic rejection rule against $$H_{0{\rm{F}}}$$ or $$H_{0{\rm{N}}}$$ is $$X^2 > x_{1-\alpha}$$, the $$1-\alpha$$ quantile of the $$\chi^2_{J-1}$$ distribution. Because the asymptotic distribution of $$X^2$$ under $$H_{0{\rm{N}}}$$ is stochastically dominated by $$\chi^2_{J-1}$$, its true $$1-\alpha$$ quantile is asymptotically smaller than $$x_{1-\alpha}$$, and the corresponding Fisher randomization test is conservative in the sense of having smaller Type I error than the nominal level asymptotically. Remark 4. The asymptotic conservativeness described above is not particular to our test statistic, but rather a feature of the finite-population inference (Neyman, 1923; Aronow et al., 2014; Imbens & Rubin, 2015). It distinguishes Theorem 3 from previous results on permutation tests (e.g., Chung & Romano, 2013; Pauly et al., 2015), where the conservativeness did not appear and the correlation between the potential outcomes played no role in the theory. The form of (3) suggests its difference from $$F$$ when the potential outcomes have different variances under different treatment levels. Otherwise we show that they are asymptotically equivalent in the following sense. Corollary 4. If $$S_{\cdot}^2(1) = \cdots = S_{\cdot}^2(J)$$, then $$(J-1) F \overset{.}{\sim} X^2$$.  Under strict additivity in Definition 1, the condition $$S_{\cdot}^2(1) = \cdots = S_{\cdot}^2(J)$$ holds, and the equivalence between $$(J-1) F$$ and $$X^2$$ guarantees that the Fisher randomization tests using $$F$$ and $$X^2$$ have the same asymptotic Type I error and power. However, Corollary 4 is a large-sample result; we evaluate it in finite samples in the Supplementary Material. 6. Simulation 6.1 Type I error of the Fisher randomization test using $$F$$ In this subsection, we use simulation to evaluate the finite-sample performance of the Fisher randomization test using $$F$$ under $$H_{0{\rm{N}}}$$. We consider the following three cases, where $$\mathcal{N}(\mu, \sigma^2)$$ denotes a normal distribution with mean $$\mu$$ and variance $$\sigma^2$$. We choose significance level $$0{\cdot}05$$ for all tests. Case 1. For balanced experiments with sample sizes $$N=45$$ and $$N=120$$, we generate potential outcomes under two settings: (1A) $$Y_i(1)\sim \mathcal{N}(0,1)$$, $$Y_i(2)\sim \mathcal{N}(0, 1{\cdot}2^2)$$ and $$Y_i(3)\sim \mathcal{N}(0, 1{\cdot}5^2)$$; (1B) $$Y_i(1)\sim \mathcal{N}(0,1)$$, $$Y_i(2)\sim \mathcal{N}(0, 2^2)$$ and $$Y_i(3)\sim \mathcal{N}(0, 3^2)$$. These potential outcomes are independently generated and standardized to have zero mean. Case 2. For unbalanced experiments with sample sizes $$(N_1, N_2, N_3) = (10,20,30)$$ and $$(N_1, N_2, N_3) = (20,30,50)$$, we generate potential outcomes under two settings: (2A) $$Y_i(1)\sim \mathcal{N}(0,1)$$, $$Y_i(2) = 2Y_i(1)$$ and $$Y_i(3) = 3Y_i(1)$$; (2B) $$Y_i(1)\sim \mathcal{N}(0,1)$$, $$Y_i(2) = 3Y_i(1)$$ and $$Y_i(3) = 5Y_i(1)$$. These potential outcomes are standardized to have zero mean. In this case, $$p_1<p_2<p_3$$ and $$S^2_{\cdot}(1) < S^2_{\cdot}(2) < S^2_{\cdot}(3)$$. Case 3. For unbalanced experiments with sample sizes $$(N_1, N_2, N_3) = (30,20,10)$$ and $$(N_1, N_2, N_3) = (50,30,20)$$, we generate potential outcomes under two settings: (3A) $$Y_i(1)\sim \mathcal{N}(0,1)$$, $$Y_i(2) = 2Y_i(1)$$ and $$Y_i(3) = 3Y_i(1)$$; (3B) $$Y_i(1)\sim \mathcal{N}(0,1)$$, $$Y_i(2) = 3Y_i(1)$$ and $$Y_i(3) = 5Y_i(1)$$. These potential outcomes are standardized to have zero mean. In this case, $$p_1>p_2>p_3$$ and $$S^2_{\cdot}(1) <S^2_{\cdot}(2) < S^2_{\cdot}(3)$$. Once generated, the potential outcomes are treated as fixed constants. Over $$2000$$ simulated randomizations, we calculate the observed outcomes and then perform the Fisher randomization test using $$F$$ to approximate the $$p$$-values by $$2000$$ draws of the treatment assignment. The histograms of the $$p$$-values are shown in Figs. 1(a)–(c) corresponding to cases 1–3 above. In the next few paragraphs we report the rejection rates associated with these cases along with their standard errors. Fig. 1. View largeDownload slide Histograms of the $$p$$-values under $$H_{0{\rm{N}}}$$ based on the Fisher randomization tests using $$F$$: (a) balanced experiments, case 1; (b) unbalanced experiments, case 2; (c) unbalanced experiments, case 3. Grey and white histograms correspond to the subcases A and B, respectively. Fig. 1. View largeDownload slide Histograms of the $$p$$-values under $$H_{0{\rm{N}}}$$ based on the Fisher randomization tests using $$F$$: (a) balanced experiments, case 1; (b) unbalanced experiments, case 2; (c) unbalanced experiments, case 3. Grey and white histograms correspond to the subcases A and B, respectively. In Fig. 1(a), the Fisher randomization test using $$F$$ is conservative with $$p$$-values distributed towards $$1$$. With greater heterogeneity in the potential outcomes, the histograms of the $$p$$-values have larger masses near $$1$$. For case (1A) the rejection rates are $$0{\cdot}010$$ and $$0{\cdot}018$$, and for case (1B) the rejection rates are $$0{\cdot}023$$ and $$0{\cdot}016$$, for sample sizes $$N=45$$ and $$N=120$$ respectively, with all Monte Carlo standard errors no greater than $$0{\cdot}003$$. In Fig. 1(b), the sample sizes under each treatment level are increasing in the variances of the potential outcomes. The Fisher randomization test using $$F$$ is conservative with $$p$$-values distributed towards $$1$$. Similar to Fig. 1(a), when there is greater heterogeneity in the potential outcomes, the $$p$$-values have larger masses near $$1$$. For case (2A) the rejection rates are $$0{\cdot}016$$ and $$0{\cdot}014$$, and for case (2B) the rejection rates are $$0{\cdot}015$$ and $$0{\cdot}011$$, for sample sizes $$N=45$$ and $$N=120$$ respectively, with all Monte Carlo standard errors no greater than $$0{\cdot}003$$. In Fig. 1(c), the sample sizes under different treatment levels are decreasing in the variances of the potential outcomes. For case (3A) the rejection rates are $$0{\cdot}133$$ and $$0{\cdot}126$$, and for case (3B) the rejection rates are $$0{\cdot}189$$ and $$0{\cdot}146$$, for sample sizes $$N=45$$ and $$N=120$$ respectively, with all Monte Carlo standard errors no greater than $$0{\cdot}009$$. The Fisher randomization test using $$F$$ does not preserve correct Type I error, with $$p$$-values distributed towards $$0$$. With greater heterogeneity in the potential outcomes, the $$p$$-values have larger masses near $$0$$. These empirical findings agree with our theory in § 4; that is, if the sample sizes under different treatment levels are decreasing in the sample variances of the observed outcomes, then the Fisher randomization test using $$F$$ may not yield correct Type I error under $$H_{0{\rm{N}}}$$. 6.2. Type I error of the Fisher randomization test using $$X^2$$ Figure 2 shows the same simulation as in Fig. 1, but with test statistic $$X^2$$. Fig. 2. View largeDownload slide Histograms of the $$p$$-values under $$H_{0{\rm{N}}}$$ based on the Fisher randomization tests using $$X^2$$: (a) balanced experiments, case 1; (b) unbalanced experiments, case 2; (c) unbalanced experiments, case 3. Grey and white histograms correspond to the subcases A and B, respectively. Fig. 2. View largeDownload slide Histograms of the $$p$$-values under $$H_{0{\rm{N}}}$$ based on the Fisher randomization tests using $$X^2$$: (a) balanced experiments, case 1; (b) unbalanced experiments, case 2; (c) unbalanced experiments, case 3. Grey and white histograms correspond to the subcases A and B, respectively. Figure 2(a) is similar to Fig. 1(a). For case (1A) the rejection rates are $$0{\cdot}016$$ and $$0{\cdot}012$$, and for case (1B) the rejection rates are $$0{\cdot}014$$ and $$0{\cdot}010$$, for sample sizes $$N=45$$ and $$N=120$$ respectively, with all Monte Carlo standard errors no greater than $$0{\cdot}003$$. Figure 2(b) shows better performance of the Fisher randomization test using $$X^2$$ than in Fig. 1(b), with $$p$$-values closer to uniform. For case (2A) the rejection rates are $$0{\cdot}032$$ and $$0{\cdot}038$$, and for case (2B) the rejection rates are $$0{\cdot}026$$ and $$0{\cdot}030$$, for sample sizes $$N=45$$ and $$N=120$$ respectively, with all Monte Carlo standard errors no greater than $$0{\cdot}004$$. Figure 2(c) shows much better performance of the Fisher randomization test using $$X^2$$ than in Fig. 1(c), because the $$p$$-values are much closer to uniform. For case (3A) the rejection rates are $$0{\cdot}052$$ and $$0{\cdot}042$$, and for case (3B) the rejection rates are $$0{\cdot}048$$ and $$0{\cdot}040$$, for sample sizes $$N=45$$ and $$N=120$$ respectively, with all Monte Carlo standard errors no greater than $$0{\cdot}005$$. This agrees with our theory that the Fisher randomization test using $$X^2$$ can control the asymptotic Type I error under $$H_{0{\rm{N}}}$$. 6.3. Power comparison of the Fisher randomization tests using $$F$$ and $$X^2$$ In this subsection, we compare the powers of the Fisher randomization tests using $$F$$ and $$X^2$$ under alternative hypotheses. We consider the following cases. Case 4. For balanced experiments with sample sizes $$N=30$$ and $$N=45$$, we generate potential outcomes from $$Y_i(1)\sim \mathcal{N}(0,1)$$, $$Y_i(2)\sim \mathcal{N}(0, 2^2)$$ and $$Y_i(3)\sim \mathcal{N}(0, 3^2)$$. These potential outcomes are independently generated and transformed to have means $$(0,1,2)$$. Case 5. For unbalanced experiments with sample sizes $$(N_1, N_2, N_3) = (10,20,30)$$ and $$(N_1, N_2, N_3) = (20,30,50)$$, we first generate $$Y_i(1)\sim \mathcal{N}(0,1)$$ and standardize them to have mean zero; we then generate $$Y_i(2) = 3Y_i(1)+1$$ and $$Y_i(3) = 5Y_i(1) + 2$$. In this case, $$p_1<p_2<p_3$$ and $$S^2_{\cdot}(1) < S^2_{\cdot}(2) < S^2_{\cdot}(3)$$. Case 6. For unbalanced experiments with sample sizes $$(N_1, N_2, N_3) = (30,20,10)$$ and $$(N_1, N_2, N_3) = (50,30,20)$$, we generate potential outcomes in the same way as in case 5 above. In this case, $$p_1>p_2>p_3$$ and $$S^2_{\cdot}(1) <S^2_{\cdot}(2) < S^2_{\cdot}(3)$$. Over $$2000$$ simulated datasets, we perform the Fisher randomization test using $$F$$ and $$X^2$$ and obtain the $$p$$-values by $$2000$$ draws of the treatment assignment. The histograms of the $$p$$-values, shown in Figs. 3(a)–(c), correspond to cases 4–6 above. The Monte Carlo standard errors for the rejection rates are all close to but no greater than $$0{\cdot}011$$. Fig. 3. View largeDownload slide Histograms of the $$p$$-values under alternative hypotheses based on the Fisher randomization tests using $$F$$ and $$X^2$$: (a) balanced experiments, case 4; (b) unbalanced experiments, case 5; (c) unbalanced experiments, case 6. Grey histograms correspond to $$X^2$$ and white histograms to $$F$$. Fig. 3. View largeDownload slide Histograms of the $$p$$-values under alternative hypotheses based on the Fisher randomization tests using $$F$$ and $$X^2$$: (a) balanced experiments, case 4; (b) unbalanced experiments, case 5; (c) unbalanced experiments, case 6. Grey histograms correspond to $$X^2$$ and white histograms to $$F$$. For case 4, the rejection rates using $$X^2$$ and $$F$$ are respectively $$0{\cdot}290$$ and $$0{\cdot}376$$ with sample size $$N=30$$, and $$0{\cdot}576$$ and $$0{\cdot}692$$ with sample size $$N=45$$. For case 5, the powers using $$X^2$$ and $$F$$ are respectively $$0{\cdot}178$$ and $$0{\cdot}634$$ with sample size $$N=60$$, and $$0{\cdot}288$$ and $$0{\cdot}794$$ with sample size $$N=100$$. Therefore, when the experiments are balanced or when the sample sizes are positively associated with the variances of the potential outcomes, the Fisher randomization test using $$F$$ has higher power than that using $$X^2$$. For case 6, the rejection rates using $$X^2$$ and $$F$$ are respectively $$0{\cdot}494$$ and $$0{\cdot}355$$ with sample size $$N=60$$, and $$0{\cdot}642$$ and $$0{\cdot}576$$ with sample size $$N=100$$. Therefore, when the sample sizes are negatively associated with the variances of the potential outcomes, the Fisher randomization test using $$F$$ has lower power than that using $$X^2$$. 6.4 Simulation studies under other distributions and applications In the Supplementary Material, we give more numerical examples. First, we conduct simulation studies that parallel those in §§ 6.1–6.3 but have outcomes generated from exponential distributions. The conclusions are nearly identical to those in §§ 6.1–6.3, because the finite-population central limit theorem holds under mild moment conditions without distributional assumptions. Second, we use two numerical examples to illustrate the conservativeness issue in Theorem 3. Third, we compare the different behaviours of the Fisher randomization tests using $$F$$ and $$X^2$$ in two real-life examples. 7. Discussion As shown in the proofs of Theorems 1 and 3 in the Supplementary Material, we need to analyse the eigenvalues of the covariance matrix of $$\{ \bar{Y}_{\cdot}^{\rm{obs}}(1), \ldots, \bar{Y}_{\cdot}^{\rm{obs}}(J) \}$$ to obtain the properties of $$F$$ and $$X^2$$ for general $$J> 2$$. Moreover, by considering the case of $$J=2$$ we can gain more insight and make connections with existing literature. For $$j\neq j'$$, an unbiased estimator for $$\tau(j,j')$$ is $$\hat{\tau}(j,j') = \bar{Y}_{\cdot}^{\rm{obs}}(j) - \bar{Y}_{\cdot}^{\rm{obs}}(j')$$, which has sampling variance $${\rm {var}}\{\hat{\tau}(j,j') \} = S_{\cdot}^2(j) / N_j + S_{\cdot}^2(j') / N_{j'} - S_{\tau}^2(j, j') /N$$ and a conservative variance estimator $$s_{\rm{obs}}^2(j)/N_j + s_{\rm{obs}}^2(j') / N_{j'}$$Neyman, 1923. Corollary 5. When $$J=2$$, the $$F$$ and $$X^2$$ statistics reduce to  $$F\approx \frac{ \hat{\tau}^2(1,2) }{ s_{\rm{obs}}^2(1)/N_2 + s_{\rm{obs}}^2(2) / N_1 },\quad X^2 = \frac{ \hat{\tau}^2(1,2) }{ s_{\rm{obs}}^2(1)/N_1 + s_{\rm{obs}}^2(2) / N_2 },$$ where the approximation for $$F$$ is due to ignoring the difference between $$N$$ and $$N-2$$ and the difference between $$N_j$$ and $$N_j-1$$$$(j=1,2)$$. Under $$H_{0{\rm{F}}}$$, $$\: F\overset{.}{\sim} \chi^2_1$$ and $$X^2\overset{.}{\sim} \chi^2_1$$. Under $$H_{0{\rm{N}}}$$, $$\:F\overset{.}{\sim} C_1 \chi^2_1$$ and $$X^2\overset{.}{\sim} C_2 \chi^2_1$$, where  $$\label{eq::constants} C_1 = \lim_{N\rightarrow + \infty} \frac{ {\rm {var}}\{\hat{\tau}(1,2) \} } { S_{\cdot}^2(1)/N_2 + S_{\cdot}^2(2) / N_1 }, \quad C_2 = \lim_{N\rightarrow + \infty} \frac{ {\rm {var}}\{\hat{\tau}(1,2) \} } { S_{\cdot}^2(1)/N_1 + S_{\cdot}^2(2) / N_2 } \leq 1\text{.}$$ (4) Depending on the sample sizes and the finite-population variances, $$C_1$$ can be either larger or smaller than $$1$$. Consequently, using $$F$$ in the Fisher randomization test can be conservative or anticonservative for testing $$H_{0{\rm{N}}}$$. In contrast, $$C_2$$ is always no larger than $$1$$, and therefore using $$X^2$$ in the Fisher randomization test is conservative for testing $$H_{0{\rm{N}}}$$. Neyman (1923) proposed using the square root of $$X^2$$ to test $$H_{0{\rm{N}}}$$ based on a normal approximation, which is asymptotically equivalent to the Fisher randomization test using $$X^2$$. Both are conservative unless the unit-level treatments are constant. In practice, for treatment-control experiments, the difference-in-means statistic $$\hat{\tau}(1,2)$$ has been widely used in the Fisher randomization test (Imbens & Rubin, 2015); it, however, can yield either conservative or anticonservative tests for $$H_{0{\rm{N}}}$$, as shown by Gail et al. (1996), Lin et al. (2017) and Ding (2017) using numerical examples. We formally state this result below, recognizing the equivalence between $$\hat{\tau}(1,2)$$ and $$F$$ in a two-sided test. Corollary 6. When $$J\,{=}\,2$$, the two-sided Fisher randomization test using $$\hat{\tau}(1,2)$$ is equivalent to using  $$T^2 = \frac{ \hat{\tau}^2(1,2) }{ N s_{\rm{obs}}^2/(N_1N_2) } \approx \frac{\hat{\tau}^2(1,2) }{ s_{\rm{obs}}^2(1)/N_2 + s_{\rm{obs}}^2(2) / N_1 + \hat{\tau}^2(1,2)/N },$$ where the approximation is due to ignoring the difference between $$(N,N_1-1,N_2-1)$$ and $$(N,N_1,N_2)$$. Under $$H_{0{\rm{F}}}$$, $$\:T^2\overset{.}{\sim} F\overset{.}{\sim} \chi^2_1$$, and under $$H_{0{\rm{N}}}$$, $$\:T^2\overset{.}{\sim} F \overset{.}{\sim} C_1\chi^2_1$$ with $$C_1$$ defined in (4). Remark 5. Analogously, under the superpopulation model, Romano (1990) showed that the Fisher randomization test using $$\hat{\tau}(1,2)$$ can be conservative or anticonservative for testing the hypothesis of equal means of two samples. Janssen (1997, 1999) and Chung & Romano (2013) suggested using the studentized statistic, or equivalently $$X^2$$, to remedy the problem of possibly inflated Type I error, which is asymptotically exact under the superpopulation model. After rejecting either $$H_{0{\rm{F}}}$$ or $$H_{0{\rm{N}}}$$, it is often of interest to test pairwise hypotheses; that is, for $$j\neq j'$$, $$\:Y_i(j) = Y_i(j')$$ for all $$i$$, or $$\bar{Y}_{\cdot}(j) = \bar{Y}_{\cdot}(j')$$. According to Corollaries 5 and 6, we recommend using the Fisher randomization test with test statistic $$\hat{\tau}^2(j,j') /\{ s_{\rm{obs}}^2(j)/N_j + s_{\rm{obs}}^2(j') / N_{j'} \} ,$$ which will yield conservative Type I error even if the experiment is unbalanced and the variances of the potential outcomes vary across treatment groups. The analogy between our finite-population theory and the superpopulation theory of Chung & Romano (2013) suggests that similar results may also hold for layouts of higher order and other test statistics (Pauly et al., 2015; Chung & Romano, 2016a,b; Friedrich et al., 2017). In more complex experimental designs, often multiple effects are of interest simultaneously, giving rise to the problem of multiple testings (Chung & Romano, 2016b). We leave these questions to future work. Acknowledgement Ding was partially funded by the Institute of Education Sciences and the National Science Foundation, U.S.A. Dasgupta was partially funded by the National Science Foundation, U.S.A. The authors thank Xinran Li, Zhichao Jiang, Lo-Hua Yuan and Robin Gong for comments on earlier versions of the paper. We thank a reviewer and the associate editor for helpful comments. Supplementary material Supplementary material available at Biometrika online includes proofs and more examples. References Aronow P. M., Green D. P. & Lee D. K. ( 2014). Sharp bounds on the variance in randomized experiments. Ann. Statist.  42, 850– 71. Google Scholar CrossRef Search ADS   Chung E. & Romano J. P. ( 2013). Exact and asymptotically robust permutation tests. Ann. Statist.  41, 484– 507. Google Scholar CrossRef Search ADS   Chung E. & Romano J. P. ( 2016a). Asymptotically valid and exact permutation tests based on two-sample U-statistics. J. Statist. Plan. Infer.  168, 97– 105. Google Scholar CrossRef Search ADS   Chung E. & Romano J. P. ( 2016b). Multivariate and multiple permutation tests. J. Economet.  193, 76– 91. Google Scholar CrossRef Search ADS   Dasgupta T., Pillai N. S. & Rubin D. B. ( 2015). Causal inference from $$2^{K}$$ factorial designs using the potential outcomes model. J. R. Statist. Soc. B  74, 727– 53. Google Scholar CrossRef Search ADS   Ding P. ( 2017). A paradox from randomization-based causal inference (with Discussion). Statist. Sci.  32, 331– 45. Google Scholar CrossRef Search ADS   Ding P., Feller A. & Miratrix L. ( 2016). Randomization inference for treatment effect variation. J. R. Statist. Soc. B  78, 655– 71. Google Scholar CrossRef Search ADS   Fisher R. A. ( 1925). Statistical Methods for Research Workers . Edinburgh: Oliver & Boyd. Fisher R. A. ( 1935). The Design of Experiments . Edinburgh: Oliver & Boyd. Friedrich S., Brunner E. & Pauly M. ( 2017). Permuting longitudinal data in spite of the dependencies. J. Mult. Anal.  153, 255– 65. Google Scholar CrossRef Search ADS   Gail M. H., Mark S. D., Carroll R. J., Green S. B. & Pee D. ( 1996). On design considerations and randomization-based inference for community intervention trials. Statist. Med.  15, 1069– 92. Google Scholar CrossRef Search ADS   Imbens G. W. & Rubin D. B. ( 2015). Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction . New York: Cambridge University Press. Google Scholar CrossRef Search ADS   James G. ( 1951). The comparison of several groups of observations when the ratios of the population variances are unknown. Biometrika  38, 324– 9. Google Scholar CrossRef Search ADS   Janssen A. ( 1997). Studentized permutation tests for non-i.i.d. hypotheses and the generalized Behrens-Fisher problem. Statist. Prob. Lett.  36, 9– 21. Google Scholar CrossRef Search ADS   Janssen A. ( 1999). Testing nonparametric statistical functionals with applications to rank tests. J. Statist. Plan. Infer.  81, 71– 93. Google Scholar CrossRef Search ADS   Janssen A. & Pauls T. ( 2003). How do bootstrap and permutation tests work? Ann. Statist.  31, 768– 806. Google Scholar CrossRef Search ADS   Johansen S. ( 1980). The Welch–James approximation to the distribution of the residual sum of squares in a weighted linear regression. Biometrika  67, 85– 92. Google Scholar CrossRef Search ADS   Kempthorne O. ( 1952). The Design and Analysis of Experiments.  London: Chapman & Hall. Kempthorne O. ( 1955). The randomization theory of experimental inference. J. Am. Statist. Assoc.  50, 946– 67. Krishnamoorthy K., Lu F. & Mathew T. ( 2007). A parametric bootstrap approach for ANOVA with unequal variances: Fixed and random models. Comp. Statist. Data Anal.  51, 5731– 42. Google Scholar CrossRef Search ADS   Li X. & Ding P. ( 2017). General forms of finite population central limit theorems with applications to causal inference. J. Am. Statist. Assoc.  112, 1759– 69. Google Scholar CrossRef Search ADS   Lin W., Halpern S. D., Prasad K. M. & Small D. S. ( 2017). A “placement of death” approach for studies of treatment effects on ICU length of stay. Statist. Meth. Med. Res.  26, 292– 311. Google Scholar CrossRef Search ADS   Neuhaus G. ( 1993). Conditional rank tests for the two-sample problem under random censorship. Ann. Statist.  21, 1760– 79. Google Scholar CrossRef Search ADS   Neyman J. ( 1923). On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Statist. Sci.  5, 465– 72. Google Scholar CrossRef Search ADS   Neyman J. ( 1935). Statistical problems in agricultural experimentation (with Discussion). Suppl. J. R. Statist. Soc.  2, 107– 80. Google Scholar CrossRef Search ADS   Pauly M., Brunner E. & Konietschke F. ( 2015). Asymptotic permutation tests in general factorial designs. J. R. Statist. Soc. B  77, 461– 73. Google Scholar CrossRef Search ADS   Pitman E. J. ( 1938). Significance tests which may be applied to samples from any populations: III. The analysis of variance test. Biometrika  29, 322– 35. Rice W. R. & Gaines S. D. ( 1989). One-way analysis of variance with unequal variances. Proc. Nat. Acad. Sci.  86, 8183– 4. Google Scholar CrossRef Search ADS   Romano J. P. ( 1990). On the behavior of randomization tests without a group invariance assumption. J. Am. Statist. Assoc.  85, 686– 92. Google Scholar CrossRef Search ADS   Rosenbaum P. R. ( 2010). Design of Observational Studies . New York: Springer. Google Scholar CrossRef Search ADS   Rubin D. B. ( 1974). Estimating causal effects of treatments in randomized and nonrandomized studies. J. Educ. Psychol . 66, 688– 701. Rubin D. B. ( 1980). Comment on “Randomization analysis of experimental data: The Fisher randomization test” by Basu. D. J. Am. Statist. Assoc.  75, 591– 3. Scheffe H. ( 1959). The Analysis of Variance . New York: John Wiley & Sons. Silvey S. D. ( 1954). The asymptotic distributions of statistics arising in certain non-parametric tests. Glasgow Math. J.  2, 47– 51. Google Scholar CrossRef Search ADS   Weerahandi S. ( 1995). ANOVA under unequal error variances. Biometrics  51, 589– 99. Google Scholar CrossRef Search ADS   Welch B. ( 1937). On the $$z$$-test in randomized blocks and Latin squares. Biometrika  29, 21– 52. Google Scholar CrossRef Search ADS   Welch B. ( 1951). On the comparison of several mean values: An alternative approach. Biometrika  38, 330– 6. Google Scholar CrossRef Search ADS   © 2017 Biometrika Trust

### Journal

BiometrikaOxford University Press

Published: Mar 1, 2018

## You’re reading a free preview. Subscribe to read the entire article.

### DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month ### Explore the DeepDyve Library ### Search Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly ### Organize Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place. ### Access Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals. ### Your journals are on DeepDyve Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more. All the latest content is available, no embargo periods. DeepDyve ### Freelancer DeepDyve ### Pro Price FREE$49/month
\$360/year

Save searches from
PubMed

Create lists to

Export lists, citations